MathJax

MathJax

Friday, September 5, 2014

Implementing the Bayes Spam Filter from the book "Doing Data Science"

I decided to have a try at implementing the spam filter described in "Doing Data Science" by Cathy O'Neil and Rachel Shutt. The book gives a fairly clear description of Bayes Theorem, and a bash script which can calculate the probability of individual words being spam, but does not give any real direction for actually coding a multi-word Bayes Filter. They do give the formula that such a filter will be based on, but don't really give any clues on how one would code it, apart from solving it out by taking the log of both sides and then proving that Laplace smoothing is valid. This is not quite as user friendly as one might wish, but it does leave plenty to try out and learn.

First, we will need some data so we might as well fetch the Enron emails that they are using in the book for the bash script.

wget 'http://www.aueb.gr/users/ion/data/enron-spam/preprocessed/enron1.tar.gz'

I am assuming that everyone is running linux here and has the wget command installed. One could do this just as well from inside R. Next we will need to unpack the data.

tar xzvf enron1.tar.gz

This will unpack everything into a directory called enron1, with a directory called spam and another called ham underneath it. All the email messages are stored as text files within each directory, one per message. Now, the formula they give for a multi-message spam filter looks like this:

$$p(x|c)=\prod_j\theta_{jc}^{x_j}(1-\theta_{jc})^{(1-x_j)}$$

You can code the formula directly in R with something like this:

pxGivenC <- function(theta, x) {
  not_x <- 1 - x
  not_theta <- 1 - theta
  pxc <- 1
  for(i in length(x)) {
    pxc <- theta[i]^x[i] * not_theta[i]^not_x[i]
  }
  return(pxc)
}

But this will not work as well as one might hope because the differences in probabilities become so tiny that they fall off the end of the floating point implementation. If one tries this approach on the Enron dataset, the model guesses that everything is is ham since the ratio of ham to spam is about 0.71 to 0.29. The book recommends taking the log of both sides and then solves out sections of the resulting formula which do not change for individual email messages, but it doesn't really explain where the formula came from. So let me try to explain what I have managed to figure out so far. You might want to download a reference given in the book, Spam Filtering with Naive Bayes - Which Naive Bayes? This was a math heavy explanation of the various different ways one can implement Bayes' Theorem, but it explained where the equation above came from, which "Doing Data Science" did not.

In the book they take the log of both sides of the formula, which is apparently necessary to have any useful output, and get:

$$\log(p(x|c))=\sum_jx_j\log(\theta_j/(1-\theta_j))+\sum_j(\log(1-\theta_j))$$

Most of this formula doesn't vary with the individual messages, so it can be extracted and solved once ahead of time:

$$w_j=\log(\theta_j/(1-\theta_j))$$

$$w_0=\sum_j(\log(1-\theta_j))$$

So substituting we get:

$$\log(p(x|c))=\sum_jx_jw_j+w_0$$

Now, what does all this mean and what have we calculated in relation to Bayes' theorem? First, we haven't calculated what we were actually looking for which would be p(spam|word) for all the words in an email message taken together, instead we have calculated p(word|spam) with "word" meaning the probability of finding this collection of words at least once in an email message we know to be spam. This is evidently the hard part according to the book, but in my case at least, I found a few other confusing things.

First, let's go over all the steps we need to take before we can even apply this formula.

  • We need to count all the words that occur in spam and ham email and turn this into a dictionary with the counts of each word. But, this is not a count of how many times we found a word in a spam message and how many times we found a word in a ham message. Instead, we are counting the number of spam messages we found this word at least once in, and then counting the number of ham messages we found the same word at list once in and saving these counts. This information is used to construct theta in the formula above. Theta is a long vector of probabilities of finding each and every word in the dictionary in a spam email. According to the book, it is constructed by this formula: (count of spam emails containing word)/(count of all emails containing word). On the other hand, the reference the I copied above gives the formula as (count of spam emails containing word)/(count of spam emails). The second form would seem to be the actual probability of p(word|spam), while I haven't quite figured out exactly what the first form is. Maybe it allows one to calculate the denominator of Bayes' Theorem more easily, but the authors do not explain this in the book.
  • After we have built a great deal of infrastructure to let us read in all the email messages of both classes and count how many messages of both classes contain each word, we need to calculate theta by either one or the other formula, and then calculate w and w0 for our training set. All these counts and variables are determined by our training data and have no dependence on any testing email we are classifying. As near as I can see, the process of counting the number of emails of our training set in which we find each word is the entire training process.
  • Next, we need to read in a testing email, parse it into individual words, and construct an x vector of zeros which is as long our dictionary. We will flip the x vector to a one at every index of a word in our dictionary which we find in the email message we are testing.
  • Now we can finally load the values into the formula above and derive the equivalent of log(p(word|spam)) for all the words in our test email message. This produced outputs like -1084.43 for me, as well as many cases of -Inf and NaN. In order to avoid the values of NaN and -Inf, one must ensure that none of the words have a probability of zero with regard to either ham or spam messages. To do this, one must alter the formula used to calculate theta by adding a small alpha and beta to the numerator and denominator of the ratio of counts. According to the book, alpha and beta should be some value < 0.2, while according to the online source alpha = 1 and beta = 2 is the correct choice. The first problem of the relatively large negative number is more perplexing. This is the log of a probability, so to calculate the probability of finding our email message x given that we knew it was spam we will have to take the exponential of -1084.43, and this will be zero according to R. Assuming we have set up everything right on the way to here, we will need to find some way of turning this into p(spam|x) that doesn't just involve running it through exp().
  • Since we can't just take the exponential we will need to keep everything as a log until we solved the denominator of Bayes' Theorem, and then hopefully we will have something that will be a more likely probability. Looking online I find a log identity that seems useful: log(a + b) = log(a) + log(1 + exp(log(b) - log(a))). This can be used to construct the denominator of Bayes' Theorem without taking the anti-log, and so after a few algebra incidents we can see things which look like valid probabilities come out of our little dummy Bayes' Classifier.

This is the code for a little dummy program which implements everything appart from the parsing. The code for the entire project is on github at https://github.com/joelkr/BayesFilter.git. The code is in R and it works, but the section which parses and builds the dictionary is seriously slow. So here is something I wrote to try to figure out why I wasn't getting valid probabilities out of my huge Rube Goldberg parsing contraption. It creates a dummy dictionary out of two letter tokens and then constructs some email vectors with spam emails mostly having odd tokens and ham messages mostly having even tokens:

No comments:

Post a Comment