Markov Chains Part 1: N-Grams

After wanting to create a markov chain to generate random words for ages, I've recently had the time to actually write one :D Since I had a lot of fun writing it, I thought I'd share it here.

A markov chain, in simple terms, is an algorithm to take a bunch of input, and generate a virtually unlimited amount of output in the style of the input. If I put my 166 strong wordlist of sciencey words through a markov chain, I get a bunch of words like this:

a b c d
raccession bstrolaneu aticl lonicretiv
mpliquadri tagnetecta subinverti catorp
ssignatten attrotemic surspertiv tecommultr
ndui coiseceivi horinversp icreflerat
landargeog eograuxila omplecessu ginverceng
evertionde chartianua spliqui ydritangt
grajecubst ngintagorp ombintrepe mbithretec
trounicabl ombitagnai ccensorbit holialinai
cessurspec dui mperaneuma yptintivid
ectru llatividet imaccellat siondl
tru coo treptinver gnatiartia
nictrivide pneumagori entansplan uatellonic

Obviously, some of the above aren't particularly readable, but the majority are ok (I could do with a longer input wordlist, I think).

To create our very own markov chain that can output words like the above, we need 2 parts: An n-gram generator, to take in the word list and convert it into a form that we can feed into the second part - the markov chain itself. In this post, I'm going to just look at the n-gram generator - I'll cover the markov chain itself in the second part of this mini-series.

An n-gram is best explained by example. Take the word refractive, for example. Let's split it up into chunks:


See what I've done? I've taken the original word and split it into chunks of 3, but I've only moved along the word by 1 character at a time, so some characters have been duplicated. These are n-grams of order 3. The order, in the case of an n-gram, is the number of characters per chunk. We could use any order we like:


The order of the above is 5. If you're wondering how this could possibly be useful - don't worry: All will be explained in due time :-) For now though, writing all these n-grams out manually is rather annoying and tedious. Let's write some code!

Generating n-grams from a single word like we did above is actually pretty simple. Here's what I came up with:

/// <summary>
/// Generates a unique list of n-grams from the given string.
/// </summary>
/// <param name="str">The string to n-gram-ise.</param>
/// <param name="order">The order of n-gram to generate.</param>
/// <returns>A unique list of n-grams found in the specified string.</returns>
public static IEnumerable<string> GenerateFlat(string str, int order)
    List<string> results = new List<string>();
    for(int i = 0; i < str.Length - order; i++)
        results.Add(str.Substring(i, order));
    return results.Distinct();

I'm using C♯ here, but you can use whatever language you like. Basically, I enter a loop and crawl along the word, adding the n-grams I find to a list, which I then de-duplicate and return.

Generating n-grams for just one word is nice, but we need to process a whole bunch of words. Thankfully, that's easy to automate too with a sneaky overload:

/// <summary>
/// Generates a unique list of n-grams that the given list of words.
/// </summary>
/// <param name="words">The words to turn into n-grams.</param>
/// <param name="order">The order of n-gram to generate..</param>
/// <returns>A unique list of n-grams found in the given list of words.</returns>
public static IEnumerable<string> GenerateFlat(IEnumerable<string> words, int order)
    List<string> results = new List<string>();
    foreach(string word in words)
        results.AddRange(GenerateFlat(word, order));
    return results.Distinct();

All the above does is take a list of words, run them all through the n-gram generation method we wrote above and return the de-duplicated results. Here's a few that it generated from the same wordlist I used above in order 3:

1 2 3 4 5 6 7 8
hor sig ign gna str tre ren ngt
sol old lde oli sor sou oun tel
lla sub ubs bst tem emp mpe atu
tur err ert thr hre dim ime men
nsi ack cki kin raj aje jec tor
ans nsa sat nsf sfe nsl sla slu
luc uce nsm smi nsp are nsu tan

Next time, I'll show you my (unweighted) markov chain I've written that uses the n-grams generated by these methods.

Sources and Further Reading

Tag Cloud

3d 3d printing account algorithms android announcement architecture archives arduino artificial intelligence artix assembly async audio automation backups bash batch blender blog bookmarklet booting bug hunting c sharp c++ challenge chrome os cluster code codepen coding conundrums coding conundrums evolved command line compilers compiling compression containerisation css dailyprogrammer data analysis debugging demystification distributed computing dns docker documentation downtime electronics email embedded systems encryption es6 features ethics event experiment external first impressions freeside future game github github gist gitlab graphics hardware hardware meetup holiday holidays html html5 html5 canvas infrastructure interfaces internet interoperability io.js jabber jam javascript js bin labs learning library linux lora low level lua maintenance manjaro minetest network networking nibriboard node.js open source operating systems optimisation own your code pepperminty wiki performance phd photos php pixelbot portable privacy problem solving programming problems project projects prolog protocol protocols pseudo 3d python reddit redis reference release releases rendering resource review rust searching secrets security series list server software sorting source code control statistics storage svg systemquery talks technical terminal textures thoughts three thing game three.js tool tutorial tutorials twitter ubuntu university update updates upgrade version control virtual reality virtualisation visual web website windows windows 10 worldeditadditions xmpp xslt


Art by Mythdael