Have you seen “to be or not to be” yet? Probably not. After all, a true monkey types totally at random, meaning there is an equal probability that any key will be pressed at any given time. The chances are incredibly slim we’ll get an exact sequence. But what if we could generate our own custom probability map for a keyboard (or a sequence of word tokens)? What kind of results could we achieve? Let’s consider the following scenario. What if we picked letters according to their actual frequency in the English language? Here’s a chart to get us started.
English letter frequency (alphabetic) by Nandhp Licensed under Public domain via Wikimedia Commons.
Letter Probability ------ ----------- a 8.167% b 1.492% c 2.782% d 4.253%
How might we pick ‘a’ 8% of the time and ‘d’ 4% of the time? Another way of stating this question might be: “How would be pick ‘a’ twice as often as ‘d’”? While there are a variety of solutions to this sort of problem a simple one that we will employ in our examples is the following. Imagine we had an array.
We could then pick one element from that array randomly as follows:
Each element in the case of an array with two spots has a 50% chance of being picked. But what if created the array as follows instead?
Now when picking a random element, there is a two out of three chance of picking ‘a’. In fact, ‘a’ is twice as likely to be picked than ‘d’. Using all the letter frequencies found in this JSON file, we could rewrite our random generator to build an array of letters, adding each letter to the array a number of times according to its frequency.
While we’ve probably increased the likelihood of randomly typing “to be or not to be” only slightly, we can nevertheless see how the quality of the results have changed. e’s appear a great deal more than z’s and so on and so forth.
Next, let’s take this idea of custom letter probabilities another step forward and examine the frequencies of letters neighboring other letters commonly in English. For example, how often does ‘h’ occur after ‘t’ versus ‘a’ or ‘r’ and so on and so forth?
This is exactly the approach of N-grams. An n-gram is a contiguous sequence of N elements. In the case of text, this might be N letters or N words or N phonemes or N syllables. For an example, here’s a function that returns all word N-grams for a given String (using a regex to split the text up into tokens):
Try it out below.
Looking at all the N-grams of a given text provides a strategy for generating text. Let’s go back to the phrase “to_be_or_not_to_be” (using an underscore instead of space for clarity). Let’s start with the simplest possible scenario and look at all N-grams where N=1 (which is admittedly a bit silly, but will be a useful demonstration.)
t o _ b e _ o r _ n o t _ t o _ b e
Now, let’s look at all the possible characters that appear after t:
t -> o t -> _ t -> o
From the above we can calculate that in this (very small) piece of text, an ‘o’ follows a ‘t’ 67% of the time and a space 33%. Now let’s look at ‘o’, where we next see a space 50% of the time, ‘r’ 25%, and ‘t’ 25%.
o -> _ o -> r o -> t o -> _
Now, imagine we next decided to generate new text based on these probabilities. We might start with ‘t’:
And then pick a new character based on the array associated in the
Now repeat for the next N-gram. And so and and so forth. Here are some sample outcomes (and the full code for producing these):
This technique is known as a Markov Chain. A Markov Chain can be described as a sequence of random “states” where each new state is conditional only on the previous state. An example of a Markov Chain is monopoly. The next state of the monopoly board depends on the current state and the roll of the dice. It doesn’t matter how we got to that current state, only what it is at the moment. A game like blackjack, for example, is different in that the deal of the cards is dependent on the history of many previous deals (assuming a single-deck not continuously shuffled.)
Using an N-gram model, can use a markov chain to generate text where each new word or character is dependent on the previous word (or character) or sequence of words (or characters). For example. given the phrase “I have to” we might say the next word is 50% likely to be “go”, 30% likely to be “run” and 20% likely to be “pee.” We can construct these word sequence probabilities based on a large corpus of source texts. Here, for example, is the full set of ngrams of order 2 and their possible outcomes in the phrase: “to be or not to be, that is the question.”
The process to generate the above
ngrams object is quite similar to the concordance we previously developed. Only this time, instead of pairing each token with a count, we are pairing each N-gram with an array of possible next characters. Let’s look at how this is built.
A small difference from the concordance you might notice above is the use of
if (ngrams[gram] === undefined). Both are valid strategies.
Once we’ve got the
ngrams object with all possibities mapped out we can start to generate text.
You might notice the
choice() function above. Just about all of our examples require the ability to pick a random element from an array over and over. To simplify this process, I’m emulating the python choice() function.
In the examples, you’ll find all of the above packaged into a MarkovGenerator object. Special thanks to Allison Parish’s excellent RWET examples which this is modeled after.