You know that thing we call an array? Yes, that’s right, an ordered list of data. Each element of an array is numbered and accessed by its numeric index.
What if, however, instead of numbering the elements of an array we could name them? This element is named “Sue”, this one “Bob”, this one “Jane”, and so on and so forth. In programming, this kind of data structure is often referred to as an “associative array”, “map”, “hash” or “dictionary.” It’s a collection of
key/value pairs. The key is
Sue, the value is
24. It’s just like having a dictionary of words and when you look up, say,
Sue the definition is
Associative arrays can be incredibly convenient for various applications. For example, you could keep a list of student IDs (
student name/id) or a list of prices (
product name/price) in a dictionary. The fundamental building block of just about every text analysis application is a concordance, a list of all words in a document along with how many times each word occurred. A dictionary is the perfect data structure to hold this information. Each element of the dictionary consists of a String paired with a number.
To start writing a concordance all we need is an empty object.
A value (in this case a count) can be paired with a word by naming the key as a String.
The above is just another way of writing:
We’ll need this new way since we’ll be pulling the names for the object as strings from a source text.
In the case of our examples, we’re going to take a text document, split it into an array of Strings and increase the value associated with a particular key (i.e. word) each time we encounter the same String. Let’s assume we have some text in a variable named
data. First, we’ll split into word “tokens”.
Then we’ll go through each one a a time.
The tricky thing here is we have to determine if each token (each element of the resulting array) is a new word or one we’ve already encountered. If it’s new, we need to set its initial count at 1. If it’s not, we need to increase its count by one.
There, we now have a concordance object that stores all the words and their counts! The question, however, remains: what do we do with this thing?
sort() function available as part of the Array prototype. The tricky thing here is that the sort function expects as an argument which a function itself!
sort() function itself. This function takes two arguments:
b. The function is a comparison function and should return true if element
b should appear before
a in the sorted result.
This can be condensed since a positive number is evaluated as
true and a negative one as
Now that we have sorted keys, we can iterate over the concordance.
Here is a text concordance example and its source code.
One common application of a text concordance is TF-IDF or term frequency–inverse document frequency. Let’s consider a corpus of wikipedia articles. Is there a way we could automatically generate keywords or tags for an article based on its word counts?
TF-IDF has two components. Term frequency is one that we are already quite familiar with. How frequent is a given term in a document? This is exactly what we calculated in the concordance. We could stop here and say that keyword generation is: “The words that appear most frequently are most important in a document.” While there is some merit to this idea, what we’ll see is that the most frequent words are just the words that appear frequently in all text: junk words like ‘to’, ‘a’, ‘and’, ‘you’, ‘me’, etc. Ironically, these junk words may hold the key to unlocking a world of information about a particular text. Nevertheless, these are clearly not related to a document’s subject matter as keywords.
TF-IDF takes a different approach. Yes, a word that appears frequently in a document (TF) is one key indicator. But adding in another indicator such as inverse document frequency (is it a word that rarely appears in other documents?) takes the junk words out of the equation Let’s consider a wikipedia article about rainbows. Here are some of the counts:
the: 16 and: 6 rainbow: 5 droplets: 3
Using this as a keyword score alone is not enough since the most important word is ‘the’. Now let’s say we looked at five other wikipedia articles. Let’s now count how many articles each of these words appear at least once in.
the: 6 and: 6 rainbow: 1 droplets: 1
This is a somewhat obvious result: ‘the’ and ‘and’ appear in all the articles and ‘rainbow’ and ‘droplet’ appear in both. We could therefore compute a score for each of these as:
rainbow: 5 * (6/1) 30 droplets: 3 * (6/1) 18 the: 16 * (6/6) 16 and: 6 * (6/6) 6
Now we’re getting somewhere!
TF-IDF is meant to be run on a much larger corpus and in order to dampen the effect of the IDF value, a common solution is to use the logarithm of IDF.
rainbow: 5 * log(6/1) 3.89 droplets: 3 * log(6/1) 2.33 the: 16 * log(6/6) 0.0 and: 6 * log(6/6) 0.0
If logarithmic scale is new to you, this Khan Academy video may help. (Note how if a term appears in every single document the tf-idf score is always zero.)
We can improve this one more step by using not just the raw count of how many times a term (such as “rainbow”) appears in a document, but the ratio of of its count to the total number of words in the document. This normalizes the score by document length. So if the total number of words in the article is 100, the score would now be:
rainbow: (5/100) * log(6/1) 0.0389 droplets: (3/100) * log(6/1) 0.0233 the: (16/100) * log(6/6) 0.0 and: (6/100) * log(6/6) 0.0
In the case of only examining this document it makes no difference, but if we were looking at the score for “rainbow” across multiple documents without this change the score would be biased towards longer documents.
For a wonderful example of TF-IDF out in the world, take a look at Nicholas Felton’s 2013 Annual Report.
p(A|B) = (p(B|A) * p(A)) / (p(B|A) * p(A) + p(B|~A) * p(~A) )
Consider the following scenario:
You have received a positive TID, what is the likelihood you have ITPosis?
As you might expect, there is a very precise answer to this question but it’s probably not what you initially guess. Bayesian reasoning is counter-intuitive and takes quite a bit of getting used to. In fact, when given a similar question related to breast cancer and mammograms</a>, only 15% of doctors get the answer correct.
The answer — 15.3% — is calculated via Bayes’ Theorem. Let’s look at it again with this scenario:
This video illustrates the problem quite nicely.
The problem our brains run into are those rascally 90% and 95% numbers. 90% of students who test positive have the disease and 95% who don’t test negative, if I test positive, I should probably have it, right?!! The important thing to remember is that only 1% of students actually have the disease. Sure testing positive increases the likelihood, but because 5% of students without the disease receive a false positive, it only increases the chances to 15%. All of this is explained in incredibly thorough and wonderful detail in Eliezer Yudkowsky’s article An Intuitive Explanation of Bayesian Reasoning. My explanation is simply adapted from his.
By the way, we could have calculated it as follows:
P (ITPosis | Positive TID) = (90% * 1%) / (90% * 1% + 5% * 99%)
This reads as “the probability that a positive TID means you have ITPosis” equals:
So why do we care? This type of reasoning can be applied quite nicely to text analysis. A common example is spam filtering. If we know the probability that a spam e-mail contains a specific words, we can calculate the likelihood that an e-mail is spam based on its concordance.
A wonderful resource for this approach is Paul Graham’s A Plan for Spam as well as Better Bayesian Filtering.
The example code that follows is not a perfect text classifier by any means. It’s a simple implementation of the idea that outlines the basic steps one might take to apply Bayesian Filtering to text.
The first thing we need to do is expand on the concordance example that stores a single number associated with each word. For classification, we’ll need to know things like how many times that word appears in spam e-mails versus good (aka ‘ham’) e-mails. And then we’ll need to use these values to calculate the probability that each word would appear in a spam or ham e-mail.</p>
Instead of storing a single number like
dictionary['the'] = 16; we now need to associate an object with multiple data points with each key.The process of running the filter works as follows:
Here’s how this might look:
The above steps are repeated over and over again for all training documents. Once all the “training” files are read, the probabilities can be calculated for every word.
Once we’ve gone through the process of counting the occurences in each category (‘A’ or ‘B’, spam or ham, etc.), we can the calculate the probabilities according to Bayes rule.
The above formula might look a little bit simpler to you than the original Bayes rule. This is because I am leaving out the “prior probability” and assuming that any document has a 50% chance of being category A or B.
Now, all that is left to do is take a new document, and compute the total probability for that document according to the formula specified in Graham’s essay. For this step, we need to calculate combined probability as outlined by Graham. For more about combined probability, here’s another resource.
Now we know the probability the document is in category A!
One important aspect of this analysis that I’ve left out is the “interesting-ness” of any given word. An interesting rating is defined as how different, say, the spam probability is from 0.5 (i.e. 50/50 is as boring as it gets) or the absolute value of
probA - 0.5. Graham’s spam filter, for example, only uses the probability of the top 15 most interesting words. If you are looking for an exercise, you might try adding this feature to the Bayesian classifier example.