Code Kata 5 - Bloom Filters

There are many circumstances where we need to find out if something is a member of a set, and many algorithms for doing it. If the set is small, you can use bitmaps. When they get larger, hashes are a useful technique. But when the sets get big, we start bumping in to limitations. Holding 250,000 words in memory for a spell checker might be too big an overhead if your target environment is a PDA or cell phone. Keeping a list of web-pages visited might be extravagant when you get up to tens of millions of pages.

Fortunately, there’s a technique that can help. Bloom filters are a 30-year-old statistical way of testing for membership in a set. They greatly reduce the amount of storage you need to represent the set, but at a price: they’ll sometimes report that something is in the set when it isn’t (but it’ll never do the opposite if the filter says that the set doesn’t contain your object, you know that it doesn’t). And the nice thing is you can control the accuracy; the more memory you’re prepared to give the algorithm, the fewer false positives you get. I once wrote a spell checker for a PDP-11 which stored a dictionary of 80,000 words in 16kbytes, and I very rarely saw it let though an incorrect word. (Update: I must have mis-remembered these figures, because they are not in line with the theory. Unfortunately, I can no longer read the 8 floppies holding the source, so I can’t get the correct numbers. Let’s just say that I got a decent sized dictionary, along with the spell checker, all in under 64k.)

Bloom filters are very simple. Take a big array of bits, initially all zero. Then take the things you want to look up (in our case we’ll use a dictionary of words). Produce n independent hash values for each word. Each hash is a number which is used to set the corresponding bit in the array of bits. Sometimes there’ll be clashes, where the bit will already be set from some other word. This doesn’t matter.

To check to see of a new word is already in the dictionary, perform the same hashes on it that you used to load the bitmap. Then check to see if each of the bits corresponding to these hash values is set. If any bit is not set, then you never loaded that word in, and you can reject it.

The Bloom filter reports a false positive when a set of hashes for a word all end up corresponding to bits that were set previously by other words. In practice this doesn’t happen too often as long as the bitmap isn’t too heavily loaded with one-bits (clearly if every bit is one, then it’ll give a false positive on every lookup). There’s a discussion of the math in Bloom filters at www.cs.wisc.edu/~cao/papers/summary-cache/node8.html.

So, this kata is fairly straightforward. Implement a Bloom filter based spell checker. You’ll need some kind of bitmap, some hash functions, and a simple way of reading in the dictionary and then the words to check. For the hash function, remember that you can always use something that generates a fairly long hash (such as MD5) and then take your smaller hash values by extracting sequences of bits from the result. On a Unix box you can find a list of words in /usr/dict/words (or possibly in /usr/share/dict/words). For others, I’ve put a word list up at pragprog.com/katadata/wordlist.txt.

Play with using different numbers of hashes, and with different bitmap sizes.

Preliminary Thoughts

Well I need to understand how to create a hash function in R. Then how to limit the range of the hash function between 1 and m.Where do I get to code a hash function ? What is MD5 ? Is it just one function or range of functions ?

I think I need to experiment with digest package === My Workings

> library(digest)
> hex2bin <- function(x) {
     bin <- apply(outer(0:15, 3:0, function(x, y) x%/%(2^y)%%2),
         1, paste, collapse = "")
     names(bin) <- format.hexmode(0:15)
     sapply(strsplit(x, ""), function(x) paste(bin[x], collapse = ""))
 }
> bin2val <- function(input) {
     input.bits = unlist(strsplit(input, NULL))
     base = 2^(15:0)
     return(sum(base * as.numeric(input.bits)))
 }
> bincutter <- function(input) {
     rel.indices <- cbind(seq(1, 128, by = 16), seq(0, 128, by = 16)[-1])
     x <- c(apply(rel.indices, 1, function(x) substr(input, x[1],
         x[2])))
     y <- sapply(x, bin2val)
     return(as.vector(y))
 }
> f <- read.delim("words.txt", stringsAsFactors = FALSE)
> colnames(f) <- c("word")
> status <- as.data.frame(cbind(regexpr("[^ a-zA-Z]", f$word)))
> sanitized <- as.data.frame(f$word[c(status < 0)])
> colnames(sanitized) <- c("word")
> model <- sanitized[1:7500, 1, drop = F]
> model$hex <- apply(model, 1, function(x) digest(x))
> model$bin <- apply(model, 1, function(x) hex2bin(x[2]))
> model.bin <- model[, c("bin"), drop = F]
> hashval <- t(apply(model.bin, 1, bincutter))
> n <- dim(model)[1]
> ratio <- 30
> m <- ratio * n
> hashval <- hashval%%m
> final.res <- numeric(m)
> final.res[c(hashval)] <- 1
> test <- sanitized[35001:38000, 1, drop = F]
> test$hex <- apply(test, 1, function(x) digest(x))
> test$bin <- apply(test, 1, function(x) hex2bin(x[2]))
> test.bin <- test[, c("bin"), drop = F]
> testval <- t(apply(test.bin, 1, bincutter))
> testval <- testval%%m
> test.word <- function(x) {
     check.sum <- sum(final.res[c(x)] == 0)
     if (check.sum > 0) {
         return(0)
     }
     else {
         return(1)
     }
 }
> test.results <- apply(testval, 1, test.word)
> table(test.results)
test.results
   0    1
2944   56
> mean(test.results == 1)
[1] 0.01866667

False positives are around 0.0186666666666667

Learnings

  • Found digest package that uses MD5 to give a hash function that can be used to generate a 32 bit hexa decimal string
  • Found a nice hack to convert hex to bin
  • Understood the concept of bloom filters. Beautiful learning
  • Got some nice practice of R commands