Code Kata - 6 - RK's Musings

Code Kata 6 - Anagrams

In England, I used to waste hour upon hour doing newspaper crosswords. As crossword fans will know, English cryptic crosswords have a totally different feel to their American counterparts: most clues involve punning or word play, and there are lots of anagrams to work through. For example, a recent Guardian crossword had:

Down: 21. Most unusual form of arrest (6) The hint is the phrase ‘form of,’ indicating that we’re looking for an anagram. Sure enough ‘arrest’ has six letters, and can be arranged nicely into ‘rarest,’ meaning ‘most unusual.’ (Other anagrams include raster, raters, Sartre, and starer)

A while back we had a thread on the Ruby mailing list about finding anagrams, and I’d like to resurrect it here. The challenge is fairly simple: given a file containing one word per line, print out all the combinations of words that are anagrams; each line in the output contains all the words from the input that are anagrams of each other. For example, your program might include in its output: \\

kinship pinkish
enlist inlets listen silent
boaster boaters borates
fresher refresh
sinks skins
knits stink
rots sort If you run this on the word list here you should find 2,530 sets of anagrams (a total of 5,680 words). Running on a larger dictionary (about 234k words) on my OSX box produces 15,048 sets of anagrams (including all-time favorites such as actaeonidae/donatiaceae, crepitant/pittancer, and (for those readers in Florida) electoral/recollate).

For added programming pleasure, find the longest words that are anagrams, and find the set of anagrams containing the most words (so “parsley players replays sparely” would not win, having only four words in the set). === My Workings

> setwd("C:/Cauldron/garage/Volatility/Learn/DeliberatePractice/Codekata/codekata_6")
> t1 = Sys.time()
> f <- read.delim("codekata_wordlist.txt", stringsAsFactors = FALSE)
> colnames(f) <- c("word")
> status <- as.data.frame(cbind(regexpr("[^ a-zA-Z]", f$word)))
> sanitized <- data.frame(word = f$word[c(status < 0)])
> sanitized$word <- tolower(sanitized$word)
> sanitized$word <- as.character(sanitized$word)
> recode <- function(x) {
     paste(sort(strsplit(x, "")[[1]]), sep = "", collapse = "")
 }
> test.sorted <- data.frame(word = sanitized, code = as.character(apply(sanitized,
     1, recode)))
> res <- split(test.sorted$word, test.sorted$code)
> count <- cbind(sapply(res, length))
> not.anagrams <- which(count[, 1] == 1)
> res[not.anagrams] <- NULL
> t2 = Sys.time()
> print(round(t2 - t1, 2))
Time difference of 4.27 secs
> res.output <- data.frame(anagrams = sapply(res, function(x) paste(x,
     collapse = ",")))
> res.output$count <- cbind(sapply(res, length))
> res.output <- res.output[order(res.output$count, decreasing = T),
     ]
> head(res.output)
                                        anagrams count
aeprs  pares,parse,pears,rapes,reaps,spare,spear     7
acerst caster,caters,crates,reacts,recast,traces     6
acert        caret,cater,crate,react,recta,trace     6
aerrst arrest,rarest,raster,raters,sartre,starer     6
opst               opts,post,pots,spot,stop,tops     6
abder              bared,beard,bread,debar,debra     5

The program has taken 4.27 But on Python it took 0.3 seconds . Is Python 14.23 times faster ?

Learnings

Good practice of using list structure functions like removing the list elements, use of split function where the result is a list, use of cbind and sapply.
Learnt that product of primes can be used as key in dictionary
Checked that Python can do the above kata in 0.3 seconds flat where R takes 5 seconds
While trying to do this Kata in Python, I used pyscript a fantastic IDE.