Starbeamrainbowlabs

Stardust
Blog

Markov Chains Part 1: N-Grams

After wanting to create a markov chain to generate random words for ages, I've recently had the time to actually write one :D Since I had a lot of fun writing it, I thought I'd share it here.

A markov chain, in simple terms, is an algorithm to take a bunch of input, and generate a virtually unlimited amount of output in the style of the input. If I put my 166 strong wordlist of sciencey words through a markov chain, I get a bunch of words like this:

a b c d
raccession bstrolaneu aticl lonicretiv
mpliquadri tagnetecta subinverti catorp
ssignatten attrotemic surspertiv tecommultr
ndui coiseceivi horinversp icreflerat
landargeog eograuxila omplecessu ginverceng
evertionde chartianua spliqui ydritangt
grajecubst ngintagorp ombintrepe mbithretec
trounicabl ombitagnai ccensorbit holialinai
cessurspec dui mperaneuma yptintivid
ectru llatividet imaccellat siondl
tru coo treptinver gnatiartia
nictrivide pneumagori entansplan uatellonic

Obviously, some of the above aren't particularly readable, but the majority are ok (I could do with a longer input wordlist, I think).

To create our very own markov chain that can output words like the above, we need 2 parts: An n-gram generator, to take in the word list and convert it into a form that we can feed into the second part - the markov chain itself. In this post, I'm going to just look at the n-gram generator - I'll cover the markov chain itself in the second part of this mini-series.

An n-gram is best explained by example. Take the word refractive, for example. Let's split it up into chunks:

ref
efr
fra
rac
act
cti
tiv
ive

See what I've done? I've taken the original word and split it into chunks of 3, but I've only moved along the word by 1 character at a time, so some characters have been duplicated. These are n-grams of order 3. The order, in the case of an n-gram, is the number of characters per chunk. We could use any order we like:

refra
efrac
fract
racti
activ
ctive

The order of the above is 5. If you're wondering how this could possibly be useful - don't worry: All will be explained in due time :-) For now though, writing all these n-grams out manually is rather annoying and tedious. Let's write some code!

Generating n-grams from a single word like we did above is actually pretty simple. Here's what I came up with:

/// <summary>
/// Generates a unique list of n-grams from the given string.
/// </summary>
/// <param name="str">The string to n-gram-ise.</param>
/// <param name="order">The order of n-gram to generate.</param>
/// <returns>A unique list of n-grams found in the specified string.</returns>
public static IEnumerable<string> GenerateFlat(string str, int order)
{
    List<string> results = new List<string>();
    for(int i = 0; i < str.Length - order; i++)
    {
        results.Add(str.Substring(i, order));
    }
    return results.Distinct();
}

I'm using C♯ here, but you can use whatever language you like. Basically, I enter a loop and crawl along the word, adding the n-grams I find to a list, which I then de-duplicate and return.

Generating n-grams for just one word is nice, but we need to process a whole bunch of words. Thankfully, that's easy to automate too with a sneaky overload:

/// <summary>
/// Generates a unique list of n-grams that the given list of words.
/// </summary>
/// <param name="words">The words to turn into n-grams.</param>
/// <param name="order">The order of n-gram to generate..</param>
/// <returns>A unique list of n-grams found in the given list of words.</returns>
public static IEnumerable<string> GenerateFlat(IEnumerable<string> words, int order)
{
    List<string> results = new List<string>();
    foreach(string word in words)
    {
        results.AddRange(GenerateFlat(word, order));
    }
    return results.Distinct();
}

All the above does is take a list of words, run them all through the n-gram generation method we wrote above and return the de-duplicated results. Here's a few that it generated from the same wordlist I used above in order 3:

1 2 3 4 5 6 7 8
hor sig ign gna str tre ren ngt
sol old lde oli sor sou oun tel
lla sub ubs bst tem emp mpe atu
tur err ert thr hre dim ime men
nsi ack cki kin raj aje jec tor
ans nsa sat nsf sfe nsl sla slu
luc uce nsm smi nsp are nsu tan

Next time, I'll show you my (unweighted) markov chain I've written that uses the n-grams generated by these methods.

Sources and Further Reading

Tag Cloud

3d account algorithms announcement archives arduino artificial intelligence assembly async audio bash batch blog bookmarklet booting c sharp c++ challenge chrome os code codepen coding conundrums coding conundrums evolved command line compiling css dailyprogrammer debugging demystification distributed computing downtime embedded systems encryption es6 features event experiment external first impressions future game github github gist graphics hardware hardware meetup holiday html html5 html5 canvas interfaces internet io.js jabber javascript js bin labs learning library linux low level lua maintenance network networking node.js operating systems performance photos php pixelbot portable privacy programming problems project projects prolog protocol protocols pseudo 3d python reddit reference release releases resource review rust secrets security series list server software sorting source code control statistics svg technical terminal textures three thing game three.js tool tutorial tutorials twitter ubuntu university update updates upgrade version control virtualisation visual web website windows windows 10 xmpp

Archive

Art by Mythdael