Markov Chains Part 1: N-Grams
After wanting to create a markov chain to generate random words for ages, I've recently had the time to actually write one :D Since I had a lot of fun writing it, I thought I'd share it here.
A markov chain, in simple terms, is an algorithm to take a bunch of input, and generate a virtually unlimited amount of output in the style of the input. If I put my 166 strong wordlist of sciencey words through a markov chain, I get a bunch of words like this:
a | b | c | d |
---|---|---|---|
raccession | bstrolaneu | aticl | lonicretiv |
mpliquadri | tagnetecta | subinverti | catorp |
ssignatten | attrotemic | surspertiv | tecommultr |
ndui | coiseceivi | horinversp | icreflerat |
landargeog | eograuxila | omplecessu | ginverceng |
evertionde | chartianua | spliqui | ydritangt |
grajecubst | ngintagorp | ombintrepe | mbithretec |
trounicabl | ombitagnai | ccensorbit | holialinai |
cessurspec | dui | mperaneuma | yptintivid |
ectru | llatividet | imaccellat | siondl |
tru | coo | treptinver | gnatiartia |
nictrivide | pneumagori | entansplan | uatellonic |
Obviously, some of the above aren't particularly readable, but the majority are ok (I could do with a longer input wordlist, I think).
To create our very own markov chain that can output words like the above, we need 2 parts: An n-gram generator, to take in the word list and convert it into a form that we can feed into the second part - the markov chain itself. In this post, I'm going to just look at the n-gram generator - I'll cover the markov chain itself in the second part of this mini-series.
An n-gram is best explained by example. Take the word refractive
, for example. Let's split it up into chunks:
ref
efr
fra
rac
act
cti
tiv
ive
See what I've done? I've taken the original word and split it into chunks of 3, but I've only moved along the word by 1 character at a time, so some characters have been duplicated. These are n-grams of order 3. The order, in the case of an n-gram, is the number of characters per chunk. We could use any order we like:
refra
efrac
fract
racti
activ
ctive
The order of the above is 5. If you're wondering how this could possibly be useful - don't worry: All will be explained in due time :-) For now though, writing all these n-grams out manually is rather annoying and tedious. Let's write some code!
Generating n-grams from a single word like we did above is actually pretty simple. Here's what I came up with:
/// <summary>
/// Generates a unique list of n-grams from the given string.
/// </summary>
/// <param name="str">The string to n-gram-ise.</param>
/// <param name="order">The order of n-gram to generate.</param>
/// <returns>A unique list of n-grams found in the specified string.</returns>
public static IEnumerable<string> GenerateFlat(string str, int order)
{
List<string> results = new List<string>();
for(int i = 0; i < str.Length - order; i++)
{
results.Add(str.Substring(i, order));
}
return results.Distinct();
}
I'm using C♯ here, but you can use whatever language you like. Basically, I enter a loop and crawl along the word, adding the n-grams I find to a list, which I then de-duplicate and return.
Generating n-grams for just one word is nice, but we need to process a whole bunch of words. Thankfully, that's easy to automate too with a sneaky overload:
/// <summary>
/// Generates a unique list of n-grams that the given list of words.
/// </summary>
/// <param name="words">The words to turn into n-grams.</param>
/// <param name="order">The order of n-gram to generate..</param>
/// <returns>A unique list of n-grams found in the given list of words.</returns>
public static IEnumerable<string> GenerateFlat(IEnumerable<string> words, int order)
{
List<string> results = new List<string>();
foreach(string word in words)
{
results.AddRange(GenerateFlat(word, order));
}
return results.Distinct();
}
All the above does is take a list of words, run them all through the n-gram generation method we wrote above and return the de-duplicated results. Here's a few that it generated from the same wordlist I used above in order 3:
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
---|---|---|---|---|---|---|---|
hor | sig | ign | gna | str | tre | ren | ngt |
sol | old | lde | oli | sor | sou | oun | tel |
lla | sub | ubs | bst | tem | emp | mpe | atu |
tur | err | ert | thr | hre | dim | ime | men |
nsi | ack | cki | kin | raj | aje | jec | tor |
ans | nsa | sat | nsf | sfe | nsl | sla | slu |
luc | uce | nsm | smi | nsp | are | nsu | tan |
Next time, I'll show you my (unweighted) markov chain I've written that uses the n-grams generated by these methods.