From Wikipedia: “Grammar is the study of the rules governing the use of a given natural language, and, as such, is a field of linguistics. Traditionally, grammar included morphology and syntax; in modern linguistics these subfields are complemented by phonetics, phonology, semantics, and pragmatics.”
A Context-Free Grammar is a set of rules that define the syntax of a language — what is and is not a valid sequence of “tokens”. Programming languages, for example, are context-free grammars — a compiler reads your code to make sure it conforms to specific rules and informs you of any errors. (These rules aren’t limited to the production of text, and can be used for music, images, etc.) A context-free grammar, however, isn’t robust enough to describe the complexities of natural language. After all, they have no context! Natural languages can be described using Context-sensitive grammars, a concept introduced by Chomsky in the 50s.
For our purposes here, we want to learn how to define our own context free grammars and generate text from them. I’m going to give a short explanation here, followed by code examples. For more detailed discussions, I would recommend Daniel Howe’s (creator of RiTa)explanation and Context-Free Grammars by Allison Parrish.
All “production rules” in a context-free grammar are of the form:
V -> w
V is a single “non-terminal” and
w is a sequence of “terminals” and “non-terminals”
A non-terminal symbol is simply another symbol which needs to be defined in another production rule. A terminal symbol does not need to be defined because that’s it, you’ve reached the end, what is here is what should be in the sentence itself. Here’s an incredibly simple CFG.
s -> a b a -> 'the' b -> c 'cat' c -> 'happy' c -> 'sad'
Anything in single quotes is a “terminal” element, meaning this is the end and no more substitutions are necessary. In the above “cat,” “happy,” and “sad” are all terminals. Anything that is not in quotes is non-terminal, or a “symbol.” The abve example has 4 symbols — s, a, b, c. The symbol “s” is commonly used to indicate “start” or “sentence.”
Notice how in the above set of rules the symbol c can be “happy” or “sad.” This is often express with an OR (pipe character) to combine these two rules:
c -> 'happy' | 'sad'
Again, this grammar is incredibly basic, and is just for demonstrating how the elements work. The only two valid “sentences” that conform to this grammar are:
the happy cat the sad cat
The process of generating the above two sentences goes something like:
a b which becomes
the c cat which then becomes
the happy cat or
the sad cat. Here, either “happy” or “sad” is picked randomly (with a 50% chance for each one.)
The above may look a little strange to you. Notice how the value for each key is an array of arrays. Each element of the larger array is one possible outcome which is an array itself: the list of elements in that outcome. I should also note that the is no distinction between a “symbol” and a “terminal.” Everything is just a String, if there is a rule associated with that String then it is a symbol, if not, it’s terminal.
A generated sentence from the CFG is commonly referred to as an “expansion” and is a bit tricky to implement. We could try to just use a loop, iterating over a sentence and executing the production rules. The nested nature of the rules, however, introduces a great deal of complexity. An elegant strategy, however, is to call a function recursively expanding the same sentence over and over again until there are no non-terminal elements left. Let’s assume we have an empty array and some seed (often called an “axiom”).
We can write a function that receives both the array and the seed and calls itself recursively if a rule that matches that seed String is present.
The full implementation of the CFG object can be found here. In addition, here are some sample grammars: creating a grammar in code, a grammar file, a grammar files as JSON, haiku grammar file as JSON. These grammars come from Allison Parrish and Daniel Howe.
Finally, a grammar file is something you can generate yourself based on some algorithm. For example, using this haiku grammar file from Daniel Howe, you could read in a source text, analyze the syllable counts of all the words, and rewrite the grammar file to reflect these new words. Here is a version of this grammar file using words from an Obama speech: generated_grammar.g.
Here is an example that uses a concordance to get al the words from a source file and the RiTa library to count syllables.