Markov Chains Part 2: Unweighted Chains
Hello and welcome to the second part of this mini-series about markov chains. In the last part, I explained what an n-gram was, and how I went about generating them.
In this part, I'll get to the meat of the subject: The markov chain itself. To start with (to simplify matters) I'll be looking at unweighted markov chains.
A markov chain, in essence, takes the n-grams we generated last time, and picks one to start with. It then takes the all but the first character of the n-gram it chose, and finds all the n-grams in it's library that begin with that sequence of characters. After drawing up a list of suitable n-grams, it picks one at random, and tacks the last character in the n-gram it chose onto the end of the first n-gram.
Then, it starts the whole process all over again with the 2nd n-gram it chose, and then the 3rd, and so on until it either a) hits a brick wall and can't find any suitable n-grams to use next, or b) reaches the desired length of word it was asked to generate.
An unweighted markov chain, as I call it, does not take the frequency of the source n-grams in the original text into account - it just picks the next n-gram from the list randomly.
With explanations and introductions out of the way, let's get down to some code! Since the markov chain is slightly more complicated, I decided to write a class for it. Let's start with one of those, then:
using System;
using System.Collections.Generic;
using System.Linq;
namespace SBRL.Algorithms.MarkovGrams
{
/// <summary>
/// An unweighted character-based markov chain.
/// </summary>
public class UnweightedMarkovChain
{
}
}
I've also added a few using
statements for later. Our new class is looking a bit bare. how about some methods to liven it up a bit?
/// <summary>
/// Creates a new character-based markov chain.
/// </summary>
/// <param name="inNgrams">The ngrams to populate the new markov chain with.</param>
public UnweightedMarkovChain(IEnumerable<string> inNgrams)
{
}
/// <summary>
/// Returns a random ngram that's currently loaded into this UnweightedMarkovChain.
/// </summary>
/// <returns>A random ngram from this UnweightMarkovChain's cache of ngrams.</returns>
public string RandomNgram()
{
}
/// <summary>
/// Generates a new random string from the currently stored ngrams.
/// </summary>
/// <param name="length">
/// The length of ngram to generate.
/// Note that this is a target, not a fixed value - e.g. passing 2 when the n-gram order is 3 will
/// result in a string of length 3. Also, depending on the current ngrams this markov chain contains,
/// it may end up being cut short.
/// </param>
/// <returns>A new random string.</returns>
public string Generate(int length)
{
}
That's much better. Let's keep going - this time with some member variables:
/// <summary>
/// The random number generator
/// </summary>
Random rand = new Random();
/// <summary>
/// The ngrams that this markov chain currently contains.
/// </summary>
List<string> ngrams;
We'll need that random number generator later! As for the List<string>
, we'll be using that to store our n-grams - but you probably figured that one out for yourself :P
The class isn't looking completely bare anymore, but we can still do something about those methods. Let's start with that constructor:
public UnweightedMarkovChain(IEnumerable<string> inNgrams)
{
ngrams = new List<string>(inNgrams);
}
Easy peasy! It just turns the IEnumerable<string>
into a List<string>
and stores it. Let's do another one:
public string RandomNgram()
{
return ngrams[rand.Next(0, ngrams.Count)];
}
We're on a roll here! This is another fairly simple method - it just picks a random n-gram from the dictionary. We'll need this for our 3rd, and most important, method, Generate()
. This one's a bit more complicated, so let's take it in a few stages. Firstly, we need an n-gram to start the whole thing off. We also need to return it at the end of the method.
string result = RandomNgram();
return result;
While we're at it, we'll also need a variable to keep track of the last n-gram in the chain, so we can find an appropriate match to come next.
string lastNgram = result;
Then we'll need a loop to keep adding n-grams to the chain. Since we're not entirely sure how long we'll be looping for (and we've got fairly complicated stop conditions, as far as that kind of thing goes), I decided to use a while
loop here.
while(result.Length < length)
{
}
That's the first of our 2 stop conditions in place, too! We want to stop when the word we're working on reaches it's desired length. Now, we can write the bit that works out which n-gram should come next! This bit goes inside the while loop we created above (as you might suspect). First, let's fetch a list of n-grams that would actually make sense coming next.
// The substring that the next ngram in the chain needs to start with
string nextStartsWith = lastNgram.Substring(1);
// Get a list of possible n-grams we could choose from next
List<string> nextNgrams = ngrams.FindAll(gram => gram.StartsWith(nextStartsWith));
With a bit of Linq (Language-INtrgrated Query), that isn't too tough :-) If you haven't seen linq before, then I'd highly recommend you check it out! It makes sorting and searching datasets much easier. The above is quite simple - I just filter our list of n-grams through a function that extracts all the ones that start with the appropriate letter.
It's at this point that we can insert the second of our two stopping conditions. If there aren't any possible n-grams to pick from, then we can't continue.
// If there aren't any choices left, we can't exactly keep adding to the new string any more :-(
if(nextNgrams.Count == 0)
break;
With our list of possible n-grams, we're now in a position to pick one at random to add to the word. It's LINQ to the rescue again:
// Pick a random n-gram from the list
string nextNgram = nextNgrams.ElementAt(rand.Next(0, nextNgrams.Count));
This is another simple one - it just extract the element in the list at a random location in the list. In hindsight I could have used the array operator syntax here ([]
), but it doesn't really matter :-)
Now that we've picked the next n-gram, we can add it to the word we're building:
// Add the last character from the n-gram to the string we're building
result += nextNgram[nextNgram.Length - 1];
and that's the markov chain practically done! Oh, we mustn't forget to update the lastNgram
variable (I forgot this when building it :P):
lastNgram = nextNgram;
And that wraps up our unweighted markov chain. Here's the whole class in full:
using System;
using System.Collections.Generic;
using System.Linq;
namespace SBRL.Algorithms.MarkovGrams
{
/// <summary>
/// An unweighted character-based markov chain.
/// </summary>
public class UnweightedMarkovChain
{
/// <summary>
/// The random number generator
/// </summary>
Random rand = new Random();
/// <summary>
/// The ngrams that this markov chain currently contains.
/// </summary>
List<string> ngrams;
/// <summary>
/// Creates a new character-based markov chain.
/// </summary>
/// <param name="inNgrams">The ngrams to populate the new markov chain with.</param>
public UnweightedMarkovChain(IEnumerable<string> inNgrams)
{
ngrams = new List<string>(inNgrams);
}
/// <summary>
/// Returns a random ngram that's currently loaded into this UnweightedMarkovChain.
/// </summary>
/// <returns>A random ngram from this UnweightMarkovChain's cache of ngrams.</returns>
public string RandomNgram()
{
return ngrams[rand.Next(0, ngrams.Count)];
}
/// <summary>
/// Generates a new random string from the currently stored ngrams.
/// </summary>
/// <param name="length">
/// The length of ngram to generate.
/// Note that this is a target, not a fixed value - e.g. passing 2 when the n-gram order is 3 will
/// result in a string of length 3. Also, depending on the current ngrams this markov chain contains,
/// it may end up being cut short.
/// </param>
/// <returns>A new random string.</returns>
public string Generate(int length)
{
string result = RandomNgram();
string lastNgram = result;
while(result.Length < length)
{
// The substring that the next ngram in the chain needs to start with
string nextStartsWith = lastNgram.Substring(1);
// Get a list of possible n-grams we could choose from next
List<string> nextNgrams = ngrams.FindAll(gram => gram.StartsWith(nextStartsWith));
// If there aren't any choices left, we can't exactly keep adding to the new string any more :-(
if(nextNgrams.Count == 0)
break;
// Pick a random n-gram from the list
string nextNgram = nextNgrams.ElementAt(rand.Next(0, nextNgrams.Count));
// Add the last character from the n-gram to the string we're building
result += nextNgram[nextNgram.Length - 1];
lastNgram = nextNgram;
}
return result;
}
}
}
I've released the full code for my markov generator (with a complete command line interface!) on my personal git server. The repository can be found here: sbrl/MarkovGrams. To finish this post off, I'll leave you with a few more words that I've generated using it :D
1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|
mecuc | uipes | jeraq | acrin | nnvit |
blerbopt | drsacoqu | yphortag | roirrcai | elurucon |
pnsemophiqub | omuayplisshi | udaisponctec | mocaltepraua | rcyptheticys |
eoigemmmpntartrc | rattismemaxthotr | hoaxtancurextudu | rrgtryseumaqutrc | hrpiniglucurutaj |