Archive

## Tag Cloud

3d account algorithms android announcement architecture archives arduino artificial intelligence artix assembly async audio bash batch blog bookmarklet booting c sharp c++ challenge chrome os code codepen coding conundrums coding conundrums evolved command line compilers compiling compression css dailyprogrammer debugging demystification distributed computing documentation downtime electronics email embedded systems encryption es6 features event experiment external first impressions future game github github gist graphics hardware hardware meetup holiday holidays html html5 html5 canvas infrastructure interfaces internet io.js jabber javascript js bin labs learning library linux low level lua maintenance manjaro network networking node.js operating systems performance photos php pixelbot portable privacy problem solving programming problems project projects prolog protocol protocols pseudo 3d python reddit reference release releases resource review rust secrets security series list server software sorting source code control statistics storage svg technical terminal textures three thing game three.js tool tutorial tutorials twitter ubuntu university update updates upgrade version control virtual reality virtualisation visual web website windows windows 10 xmpp xslt

## Shift-Reduce Parser Part 2: Building Furniture (1)

Hello and welcome! I got a bit distracted by other things as you can tell, but I'm back with part 2 of my series on building a shift-reduce parser. If you're not sure what I'm talking about, then I'd advise reading part 1 first and then coming back here. It might be a good idea to re-read it anyway, juts to refresh your memory :-)

Last time, we created some data classes to store the various rules and tokens that we'll be generating. Today, we're going to build on that and start turning a set of rules into a parse table. Let's introduce the rules we'll working with:

<start> ::= <expression>

<expression> ::= <expression> PLUS <value>
| <term>

<term> ::= <term> DIVIDE <value>
| <value>

<value> ::= <number>
| BRACKET_OPEN <expression> BRACKET_CLOSE

<number> ::= DIGIT
| <number> DIGIT

The above represents a very basic calculator-style syntax, which only supports adding and dividing. It's written in Backus-Naur Form, which is basically a standardised way of writing parsing rules.

To build a parse table, we first must understand what such a thing actually is. Let's take a look at an example:

state action goto
* + 0 1 $E B 0 s1 s2 3 4 1 r4 r4 r4 r4 r4 2 r5 r5 r5 r5 r5 3 s5 s6 goal 4 r3 r3 r3 r3 r3 5 s1 s2 7 6 s1 s2 8 7 r1 r1 r1 r1 r1 8 r2 r2 r2 r2 r2 _(Source: Adapted from the LR Parser on Wikipedia.)_ While it looks complex at first, let's break it down. There are 3 parts to this table: The state, the action, and the goto. The action and goto represent What should happen if a particular token is encountered. In this case, the input stream contains both terminal (i.e. DIGIT, DIVIDE, BRACKET_CLOSE, etc. in the case of our BNF above) and non-terminal (i.e. number, term, expression, etc. in the case of our BNF above) symbols - if understand it correctly, so there are actually 2 parts to the table here to make sure that both are handled correctly. We'll be connecting this to our lexer, which outputs only terminal symbols, so we should be ok I believe (if you know better, please post a comment below!). The state refers to the state in the table. As I've mentioned before, a given state may contain one or more configurations. It's these configurations that give rise to the actions in the table above, such as s2 (shift and then go to state 2) or r3 (reduce and jump to state 3). To use the table, the parser must know what state it's in, and then take a look across the top row for the next symbol it has in the token stream. Once found, it can follow it down to figure out what action it should take, as explained above. If there isn't an action in the box, then there must be an error in the input, as the table doesn't tell us what to do in this situation. To that end, we should try and generate a meaningful error message to help the user to find the mistake in the input (or the developer in the parser!). We're kind of getting ahead of ourselves here though. We need to build this table first, and to do that, we need to figure out which configurations go in which state. And, going down the rabbit hole, we need to know what a configuration is. Again, it's best if I demonstrate. Consider the following parsing rule from our example BNF at the beginning of this post: <value> ::= BRACKET_OPEN <expression> BRACKET_CLOSE A single configuration represent a possible state of the parser at a particular instant in time. I could split that above rule up like so: <value> ::= BRACKET_OPEN * <expression> BRACKET_CLOSE <value> ::= BRACKET_OPEN <expression> * BRACKET_CLOSE <value> ::= BRACKET_OPEN <expression> BRACKET_CLOSE * The asterisk represent where the parser might have gotten up to. Everything to the left is on the stack of the parser, and everything to the right hasn't happened yet. Since this isn't a top-level rule (in our example that honour goes to the rule for the start non-terminal), the parser will never be in a position where the first item there doesn't exist yet on the stack, so we can ignore the configuration in which the asterisk would be to the left of BRACKET_OPEN. Confused? Let me try and help here. Let's draw a diagram of how our parser is going to operate: _(Source: Made by me, but adapted from the LR Parser article on Wikipedia)_ Basically, the parser will be taking in the input token stream and either shift a new terminal token onto the stack, or reduce one or more existing tokens on the stack into a single non-terminal token, which replaces those existing tokens on the stack. The configurations above represent possible states of the stack (the bit to the left of the asterisk), and possible directions that the parser could take when parsing (the bit to th right of the asterisk). Finally, when the goal is reached, the output is returned to the caller (which, by the time we're done, should be a parse tree). Said tree can then be optimised and processed for whatever purpose we desire! With this knowledge, we can deduce that we can build the entire table by recursing over the tree of rules from the start state. That way, we'll visit every rule that we'll need to parse everything required to reach the goal state by recursing over all the rules for all the non-terminals referenced by all the rules we visit. We could even generate a warning if we detect that some rules don't connect to this 'tree'. Here's a tree of our example ruleset from the beginning of this post: It's a bit spaghetti-ish, but it should be legible enough :P This gives us an idea as to how we're going to tackle this. Taking into account the data classes we created in the last post, we need to make sure we keep the following in mind: 1. Since the main ShiftReduceParser class is going to hold the rules, the ParseTable class will need a reference to its parent ShiftReduceParser in order to query the rules. 2. In light of this, the SHiftReduceParser should be responsible for satisfying any queries the ParseTable has about rules - the ParseTable should not have to go looking & filtering the rule list held by ShiftReduceParser itself. 3. ParseTable will need a recursive method that will take a single top-level rule and recurse over it and its child rules (according to the tree I've talked about above) 4. This method in ParseTale will need to be extremely careful it doesn't get stuck in a loop. To that end, it'll have to keep track of whether it's already processed a rule or not. 5. It'll probably also have to keep track of which configurations it has added to the table class structure we defined in the last post to avoid adding rules twice. 6. Once ParseTable has figured out all the configurations and grouped them all into the right states, it will then have to recurse over the generated table and fill in all the shift / reduce / goal action(s) - not forgetting about the links to the other states they should point to. It's quite the laundry list! Thankfully, most of this is quite simple if we tackle it one step at a time. The most annoying bit is the grouping of configurations into states. This is done by looking at the token immediately before the asterisk in each configuration - all the configurations with the same token here will get grouped into the same state (while there are more complex algorithms that allow for more complex grammars, we'll stick with this for now as anything else makes my head hurt! Maybe in the future I'll look as figuring out precisely what kind of LR-style parser this is, and upgrading it to be a canonical LR(1) parser - the most advanced type I know of). This is quite a lot to take in, so I think I'll leave this post here for you to digest - and we'll get to writing some code in the next one. Found this useful? Spotted a mistake? Having trouble getting your head around it? Post a comment below! ## Shift-reduce Parser Part 1: First Steps Now that I've done the Languages and Compilers module at University, it's opened my eyes to a much better and more extensible way of handling complex text in a way that can easily be read by any subsequent code I write. To that end, I've found that at least 3 different various projects of mine could benefit from the inclusion of a shift-reduce parser, but I haven't been able to track one down for C♯ yet. With this in mind, I thought to myself: "Why not build one myself?" Sure, my Lecturer didn't go into too many details as to how they work, but it can't be that difficult, can it? Oh, I had no idea..... In this mini-series, I'm going to take you through the process of building a shift-reduce parser of your very own. As I write this, I haven't actually finished mine yet - I've just got to the important milestone of building a parse table! Thankfully, that's going to be a few posts away, as there's a fair amount of ground to cover until we get to that point. Warning: This series is not for the faint of heart! It's rather complicated, and extremely abstract - making it difficult to get your head around. I've had great difficulty getting mine around it - and ended up writing it in multiple stages. If you want to follow along, be prepared for lots of research, theory, and preparation! Let's start out by taking a look at what a shift-reduce parser does. If you haven't already, I'd recommend reading my previous compilers 101 post, which explains how to write a compiler, and the different stages involved. I'd also recommend checking out my earlier post on building a lexer, as it ties in nicely with the shift-reduce parser that we'll be building. In short, a shift-reduce parser compiles a set of BNF-style rules into a Parse Table, which it then utilises as a sort of state-machine when parsing a stream on input tokens. We'll take a look at this table compilation process in a future blog post. In this post, let's set up some data structures to help us along when we get to the compilation process in the next blog post. Here's the class structure we'll be going for: Let's start with a class to represent a single token in a rule: public enum ParserTokenClass { Terminal, NonTerminal } public struct ParserToken { public readonly ParserTokenClass Class; public readonly string Type; public ParserToken(ParserTokenClass inTokenType, string inType) { Class = inTokenType; Type = inType; } public override bool Equals(object obj) { ParserToken otherTokenType = (ParserToken)obj; return Class == otherTokenType.Class && Type == otherTokenType.Type; } public override int GetHashCode() { return$"{Class}:{Type}".GetHashCode();
}

public override string ToString()
{
string terminalDisplay = Class == ParserTokenClass.Terminal ? "T" : "NT";
return $"[ParserToken {terminalDisplay}: {Type}]"; } public static ParserToken NonTerm(string inType) { return new ParserToken(ParserTokenClass.NonTerminal, inType); } public static ParserToken Term(string inType) { return new ParserToken(ParserTokenClass.Terminal, inType); } } Pretty simple! A token in a rule can either be a terminal (basically a token straight from the lexer), or a non-terminal (a token that the parser reduces a set of other tokens into), and has a type - which we represent as a string. Unfortunately due to the complex comparisons we'll be doing later, it's a huge hassle to use an enum with a template class as I did in the lexer I built that I linked to earlier. Later on (once we've built the parse table), we'll extend this class to support attaching values and other such pieces of information to it, but for now we'll leave that out to aid simplicity. I also override Equals() and GetHashCode() in order to make comparing tokens easier later on. Overriding ToString() makes the debugging process much easier later, as we'll see in the next post! With a class to represent a token, we also need one to represent a rule. Let's create one now: public class ParserRule { /// <summary> /// A function to call when a reduce operation utilises this rule. /// </summary> public Action MatchingAction; public ParserToken LeftSide; public ParserToken[] RightSideSequence; public ParserRule(Action inMatchingAction, ParserToken inLeftSide, params ParserToken[] inRightSideSequence) { if (inLeftSide.Class != ParserTokenClass.NonTerminal) throw new ArgumentException("Error: The left-hand side must be a non-terminal token type."); MatchingAction = inMatchingAction; LeftSide = inLeftSide; RightSideSequence = inRightSideSequence; } public bool RightSideSequenceMatches(IEnumerable<ParserToken> otherRhs) { int i = 0; foreach (ParserToken nextToken in otherRhs) { if (!nextToken.Equals(RightSideSequence[i])) return false; i++; } return true; } public override string ToString() { StringBuilder result = new StringBuilder(); result.Append($"ParserRule: {LeftSide} = ");
foreach (ParserToken nextToken in RightSideSequence)
result.Append($" {nextToken}"); result.Append(";"); return result.ToString(); } } The above represents a single parser rule, such as <number> ::= <digit> <number>. Here we have the token on the left-hand-side (which we make sure is a non-terminal), and an array of tokens (which can be either terminal or non-terminal) for the right-hand-side. We also have an Action (which is basically a lamba function) that we'll call when we match against the rule, so that we have a place to hook into when we write code that actually does the tree building (not to be confused with the shift-reduce parser itself). Here I also add a method that we'll need later, which compares an array of tokens against the current rule, to see if they match - and we override ToString() here again to aid debugging. Now that we can represent tokens and rules, we can start thinking about representing configurations and states. Not sure what these are? All will be explained in the next post, don't worry :-) For now, A state can be seen as a row in the parse table, and it contains a number of configurations - which are like routes to different other states that the parser decides between, depending where it's gotten to in the token stream. public enum ParseTableAction { Shift, Reduce, Goal, Error } public class ParseTableConfiguration { public readonly ParserRule Rule; public readonly int RhsPosition; public ParseTableAction LinkingAction = ParseTableAction.Error; public ParseTableState LinkingState = null; public ParserToken TokenAfterDot { get { return Rule.RightSideSequence[RhsPosition]; } } public ParserToken TokenBeforeDot { get { return Rule.RightSideSequence[RhsPosition - 1]; } } /// <summary> /// Whether this configuration is the last in the sequence of configurations for the specified rule or not. /// </summary> /// <value><c>true</c> if is last in rule; otherwise, <c>false</c>.</value> public bool IsLastInRule { get { return RhsPosition > Rule.RightSideSequence.Length - 1; } } public ParseTableConfiguration(ParserRule inRule, int inRhsPosition) { Rule = inRule; RhsPosition = inRhsPosition; } public IEnumerable<ParserToken> GetParsedRhs() { return Rule.RightSideSequence.TakeWhile((ParserToken token, int index) => index <= RhsPosition); } public bool MatchesRhsSequence(ParserRule otherRule) { int i = 0; foreach (ParserToken nextToken in otherRule.RightSideSequence) { if (i > RhsPosition) break; if (!nextToken.Equals(otherRule.RightSideSequence[i])) return false; i++; } return true; } public override bool Equals(object obj) { ParseTableConfiguration otherConfig = obj as ParseTableConfiguration; if (otherConfig == null) return false; return Rule == otherConfig.Rule && RhsPosition == otherConfig.RhsPosition; } public override int GetHashCode() { return$"{Rule}:{RhsPosition}".GetHashCode();
}

public override string ToString()
{
StringBuilder result = new StringBuilder();

result.Append($"Configuration: {LinkingAction} "); if (LinkingState != null) result.Append($"to State {LinkingState.Id} ");
result.Append($"{Rule.LeftSide} = "); for (int i = 0; i <= Rule.RightSideSequence.Length; i++) { if (i == RhsPosition) result.Append(" * "); if (i == Rule.RightSideSequence.Length) continue; result.Append($"{Rule.RightSideSequence[i]} ");
}
result.Append(";");
return result.ToString();
}
}

This class is slightly more complicated. First, we define an enum that holds information about what the parser should do if it chooses this configuration. Then, we declare the configuration class itself. This entails specifying which parse rule we're deriving the configuration from, and both which tokens in the right-hand-side of the rule should have been parsed already, and which should still be somewhere in the token stream. Again, I'll explain this in more detail in the next post!

Then, we declare a few utility methods and properties to fetch different parts of the configuration's rule, such as the token to the immediate left and right of the right-hand-side position (which was represented as a dot . in the book I followed), all the tokens before the dot ., and whether a given rule matches this configuration in the basis of everything before the dot ..

Finally, I continue with the trend of overriding the equality checking methods and ToString(), as it makes a world of difference in the code coming up in future blog posts!

Now that we've got a class for configurations, the last one on our list is one for the states themselves. Let's do that now:

public class ParseTableState
{

public int Id {
get {
return ParentTable.FindStateId(this);
}
}

public List<ParseTableConfiguration> Configurations = new List<ParseTableConfiguration>();

public ParseTableState(ParseTable inParentTable)
{
ParentTable = inParentTable;
}

public override string ToString()
{
StringBuilder result = new StringBuilder();
foreach(ParseTableConfiguration nextConfiguration in Configurations)
result.AppendLine($" - {nextConfiguration}"); return result.ToString(); } } Much simpler than the configuration rule class, right? :P As I mentioned earlier, all a state consists of is a list of configurations in that state. In our case, we'll be assigning an id to the states in our parse table, so I include a property here that fetches a state's id from the parent parse table that it's part to make the later code as simple as possible. Still with me? Congratulations! You've got the beginnings of a shift-reduce parser. Next time, we'll expand on some of theory behind the things I've touched on in this post, and possibly look at building the start of the recursive parse table builder itself. Found this interesting? Confused about something? Comment below! ## Markov Chains Part 3: Weighted Chains Recently I remembered that I had all the pieces I need to make a weighted version of the unweighted markov chain I built in part 2 of this series - specifically the weighted random number generator I built shortly after the unweighted markov chain. With this in mind, I decided on a whim to put all the pieces of the puzzle together - and this post is the result! Where to start... hrm. I know. To start with, I needed to perform a minor upgrade tot he WeightedRandom class I had. If you haven't read the original post about it, I'd recommend doing so now. Finished reading that? Great! Lets talk about the changes I've made (they may show up there at the bottom, since I embedded it via GitHub Gist). Firstly, I needed a way to work out if the weighted random number generator was currently empty or not, leading me to add a Count property: public int Count { get { return weights.Count; } } With a count property in place, I also found I was going to need a way to dynamically swap out the weightings of the random number generator without creating a completely new instance - which would end up resetting the Random class instance it was working with, leading to a reduction in the quality in random numbers it uses under high load (see [this article]() for more information on that). To that end, I ended up refactoring the constructor into a pair of methods: SetContents, and a companion method ClearContents. Since the weight calculations happen when the items are first added to the generator, and I'd need to completely recalculate them if another item is added, I wasn't able to emulate the API for another existing class in .NET, such as the List class, as I like to do. Finally, I found later on I needed a way to initialise an empty weighted random generator, so I added a new empty constructor to facilitate that, along with an additional check in the Next() method that throws an InvalidOperationException if the generator is empty and you try to ask it to pick a random item. Here's the updated WeightedRandomNumberGenerator: (Can't see the above? Click here to view it on GitHub directly, or here for the raw code as plain text) With the weighted random number generator updated to properly support the future weighted markov chain, let's get down to the markov chain itself. Firstly, let's create a skeleton that's based on the UnweightedMarkovChain class I wrote in the last post: using System; using System.Collections.Generic; using System.Linq; using MarkovGrams.Utilities; using SBRL.Algorithms; namespace MarkovGrams { /// <summary> /// An unweighted character-based markov chain. /// </summary> public class WeightedMarkovChain { private WeightedRandom<string> wrandom = new WeightedRandom<string>(); /// <summary> /// The ngrams that this markov chain currently contains. /// </summary> Dictionary<string, double> ngrams; /// <summary> /// Creates a new character-based markov chain. /// </summary> /// <param name="inNgrams">The ngrams to populate the new markov chain with.</param> public WeightedMarkovChain(IEnumerable<string> inNgrams); /// <summary> /// Returns a random ngram that's currently loaded into this WeightedMarkovChain. /// </summary> /// <returns>A random ngram from this UnweightMarkovChain's cache of ngrams.</returns> public string RandomNgram(); /// <summary> /// Generates a new random string from the currently stored ngrams. /// </summary> /// <param name="length"> /// The length of ngram to generate. /// Note that this is a target, not a fixed value - e.g. passing 2 when the n-gram order is 3 will /// result in a string of length 3. Also, depending on the current ngrams this markov chain contains, /// it may end up being cut short. /// </param> /// <returns>A new random string.</returns> public string Generate(int length); } } As you can see, it is rather similar to the unweighted version. Fear not however, for the differences will become more apparent shortly. The only real difference so far is the extra private WeightedRandom<string> wrandom declaration at the top of the class. Let's change that though, by filling out the constructor: ngrams = new Dictionary<string, double>(); foreach (string ngram in inNgrams) { if (ngrams.ContainsKey(ngram)) ngrams[ngram]++; else ngrams.Add(ngram, 1); } Here, we read in the raw n-grams and a dictionary that represents the number of times that a given n-gram has been discovered. It's got to be a double there as the type value of the dictionary, as apparently the C♯ compiler isn't clever enough to convert a Dictionary<string, int> to a Dictionary<string, double>. Hrm. Maybe they'll fix that in the future (or if not, does anyone know why not-)? Anyway, let's move on to RandomNgram(). Here it is: if (wrandom.Count == 0) wrandom.SetContents(ngrams); return wrandom.Next(); Quite simple, right? Basically, we populate the weighted random generator if it's currently empty, and then we simply ask it for a random item. We're on a roll, here! Let's keep going with Generate(). Here's the first part: string result = RandomNgram(); string lastNgram = result; while(result.Length < length) { // ...... } Here, we declare an accumulator-like variable result to hold the word we're generating as we construct it, and another one to holdt he last n-gram we picked. We also create a while loop to make sure we keep adding to the word until we reach the desired length (we'll be adding a stop condition just in case we run into a brick wall later). Next, let's put some code inside that while loop. First up is the (re)population of the weighted random number generator: wrandom.ClearContents(); // The substring that the next ngram in the chain needs to start with string nextStartsWith = lastNgram.Substring(1); // Get a list of possible n-grams we could choose from next Dictionary<string, double> convNextNgrams = new Dictionary<string, double>(); ngrams.Where(gram_data => gram_data.Key.StartsWith(nextStartsWith)) .ForEach((KeyValuePair<string, double> ngramData) => convNextNgrams.Add(ngramData.Key, ngramData.Value)); Ah, good ol' Linq to the rescue again! But wait, what's that ForEach() call there? I don't remember that being in core .NET! You'd be right of course, but through the power of [extension methods]() one can extend a class with an additional method that can then be used as if it were an integral part of that class, when in reality that isn't the case! Here's my definition for that ForEach() extension method I used above: public static class LinqExtensions { public static void ForEach<T>(this IEnumerable<T> enumerable, Action<T> action) { foreach (T item in enumerable) { action(item); } } } Next, we need to add that stop condition we talked about earlier before I forget! Here it is: // If there aren't any choices left, we can't exactly keep adding to the new string any more :-( if(convNextNgrams.Count() == 0) break; Observant readers will notice that we haven't actually finished the (re)population process yet, so we should do that next. Once done, we can also obtain a random n-gram from the generator and process it: wrandom.SetContents(convNextNgrams); // Pick a random n-gram from the list string nextNgram = wrandom.Next(); // Add the last character from the n-gram to the string we're building result += nextNgram[nextNgram.Length - 1]; lastNgram = nextNgram; That completes my initial weighted markov chain implementation. Here's the class in full: using System; using System.Collections.Generic; using System.Linq; using MarkovGrams.Utilities; using SBRL.Algorithms; namespace MarkovGrams { /// <summary> /// An unweighted character-based markov chain. /// </summary> public class WeightedMarkovChain { private WeightedRandom<string> wrandom = new WeightedRandom<string>(); /// <summary> /// The ngrams that this markov chain currently contains. /// </summary> Dictionary<string, double> ngrams; /// <summary> /// Creates a new character-based markov chain. /// </summary> /// <param name="inNgrams">The ngrams to populate the new markov chain with.</param> public WeightedMarkovChain(IEnumerable<string> inNgrams) { ngrams = new Dictionary<string, double>(); foreach (string ngram in inNgrams) { if (ngrams.ContainsKey(ngram)) ngrams[ngram]++; else ngrams.Add(ngram, 1); } } /// <summary> /// Returns a random ngram that's currently loaded into this WeightedMarkovChain. /// </summary> /// <returns>A random ngram from this UnweightMarkovChain's cache of ngrams.</returns> public string RandomNgram() { if (wrandom.Count == 0) wrandom.SetContents(ngrams); return wrandom.Next(); } /// <summary> /// Generates a new random string from the currently stored ngrams. /// </summary> /// <param name="length"> /// The length of ngram to generate. /// Note that this is a target, not a fixed value - e.g. passing 2 when the n-gram order is 3 will /// result in a string of length 3. Also, depending on the current ngrams this markov chain contains, /// it may end up being cut short. /// </param> /// <returns>A new random string.</returns> public string Generate(int length) { string result = RandomNgram(); string lastNgram = result; while(result.Length < length) { wrandom.ClearContents(); // The substring that the next ngram in the chain needs to start with string nextStartsWith = lastNgram.Substring(1); // Get a list of possible n-grams we could choose from next Dictionary<string, double> convNextNgrams = new Dictionary<string, double>(); ngrams.Where(gram_data => gram_data.Key.StartsWith(nextStartsWith)) .ForEach((KeyValuePair<string, double> ngramData) => convNextNgrams.Add(ngramData.Key, ngramData.Value)); // If there aren't any choices left, we can't exactly keep adding to the new string any more :-( if(convNextNgrams.Count() == 0) break; wrandom.SetContents(convNextNgrams); // Pick a random n-gram from the list string nextNgram = wrandom.Next(); // Add the last character from the n-gram to the string we're building result += nextNgram[nextNgram.Length - 1]; lastNgram = nextNgram; } wrandom.ClearContents(); return result; } } } You can find it on my private git server here, if you're interested in any future improvements I might have made to it since writing this post. Speaking of which, I've got a few in mind - mainly refactoring both this class and it's unweighted cousin to utilise lists of objects instead of strings. This way, I'll be able to apply it to anything I like - such as sentence generation, music improvisation, and more! I'd also like to extend it such that I can specify the weights manually, giving me even more flexibility as to how I can put the engine to use. (Found a cool use for a Markov Chain? Comment about it below!) ### Sources and Further Reading ## Deterring spammers with a comment key system I recently found myself reimplementing the comment key system I use on this blog (I posted about it here) for a another purpose. Being more experienced now, my new implemention (which I should really put into use on this blog actually) is stand-alone in a separate file - so I'm blogging about it here both to help out anyone who reads this other than myself - and for myself as I know I'll forget otherwise :P The basic algorithm hasn't changed much since I first invented it: take the current timestamp, apply a bunch or arbitrary transformations to it, put it as a hidden field in the comment form, and then reverse the transformations on the comment key the user submits as part of the form to discover how long they had the page loaded for. Bots will have it loaded for either less than 10-ish seconds, or more than 24 hours. Humans will be somewhere in the middle - at least according to my observations! Of course, any determined spammer can easily bypass this system if they spend even a little bit of time analysing the system - but I'm banking on the fact that my blog is too low-value for a spammer to bother reverse-engineering my system to figure out how it works. This time, I chose to use simple XOR encryption, followed by reversing the string, followed by base64 encoding. It should be noted that XOR encryption is certainly not secure - but in this case it doesn't really matter. If my website becomes a high-enough value target for that to matter, I'll investigate proper AES encryption - which will probably be a separate post in and of itself, as a quick look revealed that it's rather involved - and will probably require quite a bit of research and experimentation working correctly. Let's take a look at the key generation function first: function key_generate($pass) {
$new_key = strval(time()); // Repeat the key so that it's long enough to XOR the key with$pass_enc = str_repeat($pass, (strlen($new_key) / strlen($pass)) + 1);$new_key = $new_key ^$pass_enc;
return base64_encode(strrev($new_key)); } As I explained above, this first XORs the timestamp against a provided 'passcode' of sorts, and then it reverses it, base64 encodes it, and then returns it. I discovered that I needed to repeat the passcode to make sure it's at least as long as the timestamp - because otherwise it cuts the timestamp short! Longer passwords are always desirable for certain, but I wanted to make sure I addressed it here - just in case I need to lift this algorithm from here for a future project. Next up is the decoding algorithm, that reverses the transformations we apply above:  function key_decode($key, $pass) {$key_dec = strrev(base64_decode($key)); // Repeat the key so that it's long enough to XOR the key with$pass_dec = str_repeat($pass, (strlen($key_dec) / strlen($pass)) + 1); return intval($key_dec ^ $pass_dec); } Very similar. Again, the XOR passphrase has to be repeated to make it long enough to apply to the whole encoded key without inadvertently chopping some off the end. Additionally, we also convert the timestamp back into an integer - since it is the number of seconds since the last UNIX epoch (1st January 1970 as of the time of typing). With the ability to create and decode keys, let's write a helper method to make the verification process a bit easier:  function key_verify($key, $pass,$min_age, $max_age) {$age = time() - key_decode($key,$pass);
return $age >=$min_age && $age <=$max_age;
}

It's fairly self-explanatory, really. It takes an encoded key, decodes it, and verifies that it's age lies between the specified bounds. Here's the code in full (it updates every time I update the code in the GitHub Gist):

(Above: The full comment key code. Can't see it? Check it out on GitHub Gist here.)

## A comparison of compression formats for storing JSON

Happy new year, everyone!

I've blogged about different aspects of a (not so?) little project of mine several times now (exhibits A, B, C, D, and finally E - even if I didn't know it at the time), but it appears that I end up running into all sorts of interesting problems that I invent cool solutions for. I also find myself doing a bunch of research that I'm surprised nobody's compiled into a single place yet - as is the case in this blog post. (Also, Happy 2018 everyone! First post of the year :D)

I've been refactoring the subsystem that saves a considerable amount of JSON data to a bunch of different files on disk. Obviously, I'm interested in minimising the amount of space this JSON data takes up on disk. As this saving process happens in the background on a separate thread, I'm not too concerned about performance - other than it can't be too slow. With this in mind, I've found myself testing a bunch of different compression algorithms. Let's introduce our test data:

1. A 17MiB card game dataset, as a single minified JSON file
2. A 40KiB 'live' specimen chunk's data, saved as a single minified JSON file.

I can't remember which card game the first dataset is from, but I do know that I found it in the awesome JSON datasets list. Next up, here's our cast of compression algorithms we'll be testing:

1. The venerable GZip
2. BZip2 - Apparently GZIP, but smaller and slightly more computationally expensive
3. XZ (the newer child of LZMA2)
4. 7zip

A colourful cast, to be sure! Let's run them through their paces - starting with the card game dataset. Here are the results I observe with each set to their default settings:

Format Size
Uncompressed 17M
gzip 2.4M
bzip2 1.6M
xz 1.3M
7zip 1.4M
brotli 1.3M

Very interesting. It looks like xz and brotli are tied in first place - though I observed that brotli took ages in comparison to all the other algorithms I tested - and upon closer inspection xz beat it by 17.3KiB. Numbers are all very well, but to really see what's going on here, let's plot it on a graph:

That's better! I can actually make some comparisons now. From this graph we can observe that gzip is the worst of the lot, followed by bzip2. 7zip is surprisingly in third place, but then again it is designed for multiple files, whereas the rest of them are designed for a single stream of data. In second place is the terribly slow brotli, and finally in first place is xz.

Hrm - very interesting. How do our algorithms measure up when confronted a smaller load though? Here are the results for the sample chunk data:

Format Size
Uncompressed 40K
gzip 5.4K
bzip2 4.3K
xz 4.6K
7zip 4.9K
Brotli 4.6K

Interesting results, to be sure, but I can't discern much from that. Let's plot a graph:

Very interesting. With smaller loads, it appears that bzip2 performs much better with smaller loads than any other algorithm. While gzip is still the worst performing algorithm, while xz and brotli, surprisingly, performed much worse than bzip2.

To that end, I'm think I'm going to be choosing bzip2 as my compression of choice for this job, as it produces the best results for the type of work I'm going to be doing.

I'm really surprised about brotli though, actually. I had high hopes for it, considering it's a new algorithm invented by Google. They claimed that it would provide xz-like compression with gzip-like speeds - but from what I'm seeing, it does anything but.

## Line Simplification: Visvalingam's Algorithm

(Above: A screenshot of the demo of my implementation of Visvalingam's line simplification algorithm. Link below!)

For a secret project of mine I've been working on since about February time (if I recall correctly), I've discovered that I could make some considerable use of a line simplification algorithm. The tricky thing is though that I need an implementation in both Javascript and C♯ - which will both return identical results.

Initially, I chose the Ramer-Douglas-Peucker Algorithm, but I ended up implementing Visvalingam's Algorithm instead, as I encountered issues with calculating the shortest distance from a point to a line reliably along with other algorithmic problems that I determined weren't worth the time to fix.

Visvalingam's algorithm is actually really simple. Suppose we take a line:

If we create a sliding window with a width of 3 and slide it along the list of points, then we get a set of triangles. To simplify the line, we can calculate the area of each of these triangles, and remove the centre point of the triangle with the smallest area.

Then we can continue removing the centre point of the smallest triangle until we reach a triangle with an area that's above a threshold we set - and this is Visvalingam's Algorithm.

Though I haven't written the C♯ version yet, I've completed the Javascript implementation - and created a demo for you to play around with! Here's a link:

Visvalingam's Algorithm Demo

Note that you'll need to enable ES6 Module support in your browser to get it to work, as I've used ES6 Modules whilst building it.

In Firefox this can be done by setting dom.moduleScripts.enabled to true in about:config, and in chrome by visiting chrome://flags/#enable-javascript-harmony (sorry, hyperlinks don't work for chrome:// urls IIRC!), enabling it, and restarting your browser.

It's open-source, of course - under the Mozilla Public License 2.0. You can find my code on GitHub - and pull requests are welcome :D

Finally, I've released it as an npm package. If you aren't aware of npm, it's really cool. It's the primary package manager for Javascript - I've written a blog post on this here.

Once I've written the C♯ version I'll have another bash at trying to get Nuget to package it. I think I know what the issue has been so far - so hopefully it works this time! If it does I'll blog about that too.

Found this useful? Think it's cool? Let me know in the comments below!

## Weekend Challenge: Detecting and Decoding Morse Code in an Audio File

Recently I received a message in morse code from a family member using this site. It said that the sender had hidden the message, so I was presented with 2 options: I could sit and decode the message by listening to it over and over again, or write a program to do it for me.

Naturally, as a computer science student and enthusiast, I chose the second option. My first problem: Capture a recording of the target morse code. This was easy - the audio-recorder package in the ubuntu repositories solved that one easily, as it has an option to record the audio output of my laptop.

Second problem: Figure out how to read the recording in and extract the samples in C♯. This one wasn't so easy. Amidst issues with flatpak and Monodevelop (flatpak is terrible!), I eventually found the NAudio (Codeplex, GitHub, NuGet) package to do the job. After some digging, I discovered that NAudio is actually really powerful! It's got some pretty advanced functions for handling audio that I'll have to explore at a later date.

Anyway, with a plan of action I set to work. - decided to work in reverse, so the first thing I needed was a chart that converted morse code into the latin alphabet. Wikipedia to the rescue:

With a handy-dandy conversion chart, it was relatively simple to create a class to handle the conversion from dots and dashes to the latin alphabet automatically:

using System;
using System.Collections.Generic;

namespace SBRL.Algorithms.MorseCodeTranslator
{
/// <summary>
/// A simple class to translate a morse code string into a normal string.
/// </summary>
/// <origin></origin>
/// <author>Starbeamrainbowlabs (https://starbeamrainbowlabs.com/)</author>
/// <changelog>
/// v0.1 - 26th May 2017:
///      - Creation! 😁
/// </changelog>
public static class MorseDecoder
{
/// <summary>
/// The morse code lookup table. Use the methods in this class is possible,
/// rather than accessing this lookup table directly!
/// </summary>
public static Dictionary<string, char> morseCodeLookup = new Dictionary<string, char>()
{
[".-"] = 'a',
["-..."] = 'b',
["-.-."] = 'c',
["-.."] = 'd',
["."] = 'e',
["..-."] = 'f',
["--."] = 'g',
["...."] = 'h',
[".."] = 'i',
[".---"] = 'j',
["-.-"] = 'k',
[".-.."] = 'l',
["--"] = 'm',
["-."] = 'n',
["---"] = 'o',
[".--."] = 'p',
["--.-"] = 'q',
[".-."] = 'r',
["..."] = 's',
["-"] = 't',
["..-"] = 'u',
["...-"] = 'v',
[".--"] = 'w',
["-..-"] = 'x',
["-.--"] = 'y',
["--.."] = 'z',
[".----"] = '1',
["..---"] = '2',
["...--"] = '3',
["....-"] = '4',
["....."] = '5',
["-...."] = '6',
["--..."] = '7',
["---.."] = '8',
["----."] = '9',
["-----"] = '0',
};

/// <summary>
/// Translates a single letter from morse code.
/// </summary>
/// <param name="morseSource">The morse code to translate.</param>
/// <returns>The translated letter.</returns>
public static char TranslateLetter(string morseSource)
{
return morseCodeLookup[morseSource.Trim()];
}

/// <summary>
/// Translates a string of space-separated morse code strings from morse code.
/// </summary>
/// <param name="morseSource">The morse code to translate.</param>
/// <returns>The translated word.</returns>
public static string TranslateWord(string morseSource)
{
string result = string.Empty;

string[] morseLetters = morseSource.Split(" ".ToCharArray());

foreach(string morseLetter in morseLetters)
result += TranslateLetter(morseLetter);

return result;
}

/// <summary>
/// Translates a list of morse-encoded words.
/// </summary>
/// <param name="morseSources">The morse-encoded words to decipher.</param>
/// <returns>The decoded text.</returns>
public static string TranslateText(IEnumerable<string> morseSources)
{
string result = string.Empty;
foreach(string morseSource in morseSources)
result += $"{TranslateWord(morseSource)} "; return result.Trim(); } } } That was easy! The next challenge to tackle was considerably more challenging though: Read in the audio file and analyse the samples. I came up with that I think is a rather ingenious design. It's best explained with a diagram: 1. Read the raw samples into a buffer. If there isn't enough space to hold it all at once, then we handle it in chunks. 2. Move a sliding-window along the raw buffer, with a width of 100 samples and sliding along 25 samples at a time. Extracts the maximum value from the window each time and places it in the windowed buffer. 3. Analyse the windowed buffer and extract context-free tokens that mark the start or end of a tone. 4. Convert the context-free tokens into ones that hold the starting point and length of the tones. 5. Analyse the contextual tokens to extract the morse code as a string 6. Decipher the morse code string It's a pretty complicated problem when you first think about it, but breaking it down into steps as I did in the above diagram really helps in figuring out how you're going to tackle it. I, however, ended up drawing the diagram after Id finished writing the program.... I appear to find it easy to break things down in my head - it's only when it gets too big to remember all at once or if I'm working with someone else that I draw diagrams :P Having drawn up an algorithm and 6 steps I needed to follow to create the program, I spent a happy afternoon writing some C♯. While the remainder of the algorithm is not too long (only ~202 lines), it's a bit too long to explain bit by bit here. I have uploaded the full program to a repository on my personal git server, which you can find here: sbrl/AudioMorseDecoder. If you're confused about any part of it, ask away in the comments below! Binaries available on request. I'll leave you with a pair of challenging messages of my own to decode. Try not to use my decoder - write your own! Message A (easy), Message B (hard) (hard message generated with cwwav) ## Let's build a weighted random number generator! Ever wondered how random loot in a dungeon is generated? Or how the rooms in a procedurally generated castle might be picked? Perhaps you need to skew the number of times an apple is picked by your game engine over a banana. If you've considered any of these things, then you want a weighted random number generator. In this post, I'll be showing you how I built one, and how you can build one too. If you're interested in trying to build one for yourself first though, then look away now! Come back when you're done (or stuck) to see my solution. To start with, let's consider what a weighted random number generator actually is. Let's say we've got 3 rewards for a treasure chest: a cool-looking shield, a health potion, and a fancy ring. We want to give the player 1 of the 3 when they option the chest, making sure that the health potion is more common than the others. We can represent that as a ratio:$3 : 4 : 3$. (Above: The ratio between the different items. See below for the explanation of the math!). In order to pick one of the 3 items using the ratio, we need to normalise the ratio so that it's between$0$and$1$. That's rather easy, as far as maths goes: All we have to do is convert each part of the ratio into a fraction, and that into a decimal. Let's calculate the denominator of the fraction first. That's easy-peasy too - we just add up all the parts of the ratio, as we want to represent each part as a fraction of a whole:$3 + 4 + 3 = 10$. With our denominator sorted, we can convert each part into a fraction: $$\frac{3}{10} + \frac{4}{10} + \frac{3}{10} = 1$$ Fractions are nice, but it's be better to have that as a decimal: $$0.3 + 0.4 + 0.3 = 10$$ That's much better. Now, with the initial theory out of the way, let's start writing a class for it. using System; using System.Collections.Generic; using System.Linq; namespace SBRL.Algorithms { public class WeightedRandom<ItemType> { protected Random rand = new Random(); protected Dictionary<double, ItemType> weights = new Dictionary<double, ItemType>(); /// <summary> /// Creates a new weighted random number generator. /// </summary> /// <param name="items">The dictionary of weights and their corresponding items.</param> public WeightedRandom(IDictionary<double, ItemType> items) { if(items.Count == 0) throw new ArgumentException("Error: The items dictionary provided is empty!"); double totalWeight = items.Keys.Aggregate((double a, double b) => a + b); foreach(KeyValuePair<double, ItemType> itemData in items) weights.Add(itemData.Key / totalWeight, itemData.Value); } } } I've created a template class here, to allow the caller to provide us with any type of item (so long as they are all the same). That's what the <ItemType> bit is on the end of the class name - it's the same syntax behind the List class: List<TreasureReward> rewards = new List<TreasureReward>() { TreasureReward.FromFile("./treasure/coolsword.txt"), TreasureReward.FromFile("./treasure/healthpotion.txt"), TreasureReward.FromFile("./treasure/fancyring.txt"), }; Next, let's go through that constructor bit by bit. First, we make sure that we actually have some weights in the first place: if(items.Count == 0) throw new ArgumentException("Error: The items dictionary provided is empty!"); Then, it's more Linq to the rescue in calculating the total of the weights we've been provided with: double totalWeight = items.Keys.Aggregate((double a, double b) => a + b); Finally, we loop over each of the items in the provided dictionary, dividing them by the sum of the weights and adding them to our internal dictionary of normalised weights. foreach(KeyValuePair<double, ItemType> itemData in items) weights.Add(itemData.Key / totalWeight, itemData.Value); Now that we've got our items loaded and the weights normalised, we can start picking things from our dictionary. For this part, I devised a sort of 'sliding window' algorithm to work out which item to pick. It's best explained through a series of whiteboard images: Basically, I have 2 variables: lower and higher. When I loop over each of the weights, I do the following things: 1. Add the current normalised weight to higher 2. Check if the target is between lower and higher a. If it is, then return the current item b. If not, then keep going 3. Bring lower up to the same value as higher 4. Loop around again until we find the weight in which the target lies. With that in mind, here's the code I cooked up: /// <summary> /// Picks a new random item from the list provided at initialisation, based /// on the weights assigned to them. /// </summary> /// <returns>A random item, picked according to the assigned weights.</returns> public ItemType Next() { double target = rand.NextDouble(); double lower = 0; double higher = 0; foreach(KeyValuePair<double, ItemType> weightData in weights) { higher += weightData.Key; if(target >= lower && target <= higher) return weightData.Value; lower += weightData.Key; } throw new Exception($"Error: Unable to find the weight that matches {target}");
}

That pretty much completes the class. While it seems daunting at first, it's actually quite easy once you get your head around it. Personally, I find whiteboards very useful in that regard! Here's the completed class:

Found this interesting? Got stuck? Have a suggestion for another cool algorithm I could implement? Comment below!

## Markov Chains Part 2: Unweighted Chains

Hello and welcome to the second part of this mini-series about markov chains. In the last part, I explained what an n-gram was, and how I went about generating them.

In this part, I'll get to the meat of the subject: The markov chain itself. To start with (to simplify matters) I'll be looking at unweighted markov chains.

A markov chain, in essence, takes the n-grams we generated last time, and picks one to start with. It then takes the all but the first character of the n-gram it chose, and finds all the n-grams in it's library that begin with that sequence of characters. After drawing up a list of suitable n-grams, it picks one at random, and tacks the last character in the n-gram it chose onto the end of the first n-gram.

Then, it starts the whole process all over again with the 2nd n-gram it chose, and then the 3rd, and so on until it either a) hits a brick wall and can't find any suitable n-grams to use next, or b) reaches the desired length of word it was asked to generate.

An unweighted markov chain, as I call it, does not take the frequency of the source n-grams in the original text into account - it just picks the next n-gram from the list randomly.

With explanations and introductions out of the way, let's get down to some code! Since the markov chain is slightly more complicated, I decided to write a class for it. Let's start with one of those, then:

using System;
using System.Collections.Generic;
using System.Linq;

namespace SBRL.Algorithms.MarkovGrams
{
/// <summary>
/// An unweighted character-based markov chain.
/// </summary>
public class UnweightedMarkovChain
{

}
}


I've also added a few using statements for later. Our new class is looking a bit bare. how about some methods to liven it up a bit?

/// <summary>
/// Creates a new character-based markov chain.
/// </summary>
/// <param name="inNgrams">The ngrams to populate the new markov chain with.</param>
public UnweightedMarkovChain(IEnumerable<string> inNgrams)
{

}

/// <summary>
/// Returns a random ngram that's currently loaded into this UnweightedMarkovChain.
/// </summary>
/// <returns>A random ngram from this UnweightMarkovChain's cache of ngrams.</returns>
public string RandomNgram()
{

}

/// <summary>
/// Generates a new random string from the currently stored ngrams.
/// </summary>
/// <param name="length">
/// The length of ngram to generate.
/// Note that this is a target, not a fixed value - e.g. passing 2 when the n-gram order is 3 will
/// result in a string of length 3. Also, depending on the current ngrams this markov chain contains,
/// it may end up being cut short.
/// </param>
/// <returns>A new random string.</returns>
public string Generate(int length)
{

}

That's much better. Let's keep going - this time with some member variables:

/// <summary>
/// The random number generator
/// </summary>
Random rand = new Random();

/// <summary>
/// The ngrams that this markov chain currently contains.
/// </summary>
List<string> ngrams;

We'll need that random number generator later! As for the List<string>, we'll be using that to store our n-grams - but you probably figured that one out for yourself :P

The class isn't looking completely bare anymore, but we can still do something about those methods. Let's start with that constructor:

public UnweightedMarkovChain(IEnumerable<string> inNgrams)
{
ngrams = new List<string>(inNgrams);
}

Easy peasy! It just turns the IEnumerable<string> into a List<string> and stores it. Let's do another one:

public string RandomNgram()
{
return ngrams[rand.Next(0, ngrams.Count)];
}

We're on a roll here! This is another fairly simple method - it just picks a random n-gram from the dictionary. We'll need this for our 3rd, and most important, method, Generate(). This one's a bit more complicated, so let's take it in a few stages. Firstly, we need an n-gram to start the whole thing off. We also need to return it at the end of the method.

string result = RandomNgram();

return result;

While we're at it, we'll also need a variable to keep track of the last n-gram in the chain, so we can find an appropriate match to come next.

string lastNgram = result;

Then we'll need a loop to keep adding n-grams to the chain. Since we're not entirely sure how long we'll be looping for (and we've got fairly complicated stop conditions, as far as that kind of thing goes), I decided to use a while loop here.

while(result.Length < length)
{

}

That's the first of our 2 stop conditions in place, too! We want to stop when the word we're working on reaches it's desired length. Now, we can write the bit that works out which n-gram should come next! This bit goes inside the while loop we created above (as you might suspect). First, let's fetch a list of n-grams that would actually make sense coming next.

// The substring that the next ngram in the chain needs to start with
string nextStartsWith = lastNgram.Substring(1);
// Get a list of possible n-grams we could choose from next
List<string> nextNgrams = ngrams.FindAll(gram => gram.StartsWith(nextStartsWith));

With a bit of Linq (Language-INtrgrated Query), that isn't too tough :-) If you haven't seen linq before, then I'd highly recommend you check it out! It makes sorting and searching datasets much easier. The above is quite simple - I just filter our list of n-grams through a function that extracts all the ones that start with the appropriate letter.

It's at this point that we can insert the second of our two stopping conditions. If there aren't any possible n-grams to pick from, then we can't continue.

// If there aren't any choices left, we can't exactly keep adding to the new string any more :-(
if(nextNgrams.Count == 0)
break;

With our list of possible n-grams, we're now in a position to pick one at random to add to the word. It's LINQ to the rescue again:

// Pick a random n-gram from the list
string nextNgram = nextNgrams.ElementAt(rand.Next(0, nextNgrams.Count));

This is another simple one - it just extract the element in the list at a random location in the list. In hindsight I could have used the array operator syntax here ([]), but it doesn't really matter :-)

Now that we've picked the next n-gram, we can add it to the word we're building:

// Add the last character from the n-gram to the string we're building
result += nextNgram[nextNgram.Length - 1];

and that's the markov chain practically done! Oh, we mustn't forget to update the lastNgram variable (I forgot this when building it :P):

lastNgram = nextNgram;

And that wraps up our unweighted markov chain. Here's the whole class in full:

using System;
using System.Collections.Generic;
using System.Linq;

namespace SBRL.Algorithms.MarkovGrams
{
/// <summary>
/// An unweighted character-based markov chain.
/// </summary>
public class UnweightedMarkovChain
{
/// <summary>
/// The random number generator
/// </summary>
Random rand = new Random();

/// <summary>
/// The ngrams that this markov chain currently contains.
/// </summary>
List<string> ngrams;

/// <summary>
/// Creates a new character-based markov chain.
/// </summary>
/// <param name="inNgrams">The ngrams to populate the new markov chain with.</param>
public UnweightedMarkovChain(IEnumerable<string> inNgrams)
{
ngrams = new List<string>(inNgrams);
}

/// <summary>
/// Returns a random ngram that's currently loaded into this UnweightedMarkovChain.
/// </summary>
/// <returns>A random ngram from this UnweightMarkovChain's cache of ngrams.</returns>
public string RandomNgram()
{
return ngrams[rand.Next(0, ngrams.Count)];
}

/// <summary>
/// Generates a new random string from the currently stored ngrams.
/// </summary>
/// <param name="length">
/// The length of ngram to generate.
/// Note that this is a target, not a fixed value - e.g. passing 2 when the n-gram order is 3 will
/// result in a string of length 3. Also, depending on the current ngrams this markov chain contains,
/// it may end up being cut short.
/// </param>
/// <returns>A new random string.</returns>
public string Generate(int length)
{
string result = RandomNgram();
string lastNgram = result;
while(result.Length < length)
{
// The substring that the next ngram in the chain needs to start with
string nextStartsWith = lastNgram.Substring(1);
// Get a list of possible n-grams we could choose from next
List<string> nextNgrams = ngrams.FindAll(gram => gram.StartsWith(nextStartsWith));
// If there aren't any choices left, we can't exactly keep adding to the new string any more :-(
if(nextNgrams.Count == 0)
break;
// Pick a random n-gram from the list
string nextNgram = nextNgrams.ElementAt(rand.Next(0, nextNgrams.Count));
// Add the last character from the n-gram to the string we're building
result += nextNgram[nextNgram.Length - 1];
lastNgram = nextNgram;
}

return result;
}
}
}

I've released the full code for my markov generator (with a complete command line interface!) on my personal git server. The repository can be found here: sbrl/MarkovGrams. To finish this post off, I'll leave you with a few more words that I've generated using it :D

1 2 3 4 5
mecuc uipes jeraq acrin nnvit
blerbopt drsacoqu yphortag roirrcai elurucon
pnsemophiqub omuayplisshi udaisponctec mocaltepraua rcyptheticys
eoigemmmpntartrc rattismemaxthotr hoaxtancurextudu rrgtryseumaqutrc hrpiniglucurutaj

## Markov Chains Part 1: N-Grams

After wanting to create a markov chain to generate random words for ages, I've recently had the time to actually write one :D Since I had a lot of fun writing it, I thought I'd share it here.

A markov chain, in simple terms, is an algorithm to take a bunch of input, and generate a virtually unlimited amount of output in the style of the input. If I put my 166 strong wordlist of sciencey words through a markov chain, I get a bunch of words like this:

a b c d
raccession bstrolaneu aticl lonicretiv
ssignatten attrotemic surspertiv tecommultr
ndui coiseceivi horinversp icreflerat
landargeog eograuxila omplecessu ginverceng
evertionde chartianua spliqui ydritangt
grajecubst ngintagorp ombintrepe mbithretec
trounicabl ombitagnai ccensorbit holialinai
cessurspec dui mperaneuma yptintivid
ectru llatividet imaccellat siondl
tru coo treptinver gnatiartia
nictrivide pneumagori entansplan uatellonic

Obviously, some of the above aren't particularly readable, but the majority are ok (I could do with a longer input wordlist, I think).

To create our very own markov chain that can output words like the above, we need 2 parts: An n-gram generator, to take in the word list and convert it into a form that we can feed into the second part - the markov chain itself. In this post, I'm going to just look at the n-gram generator - I'll cover the markov chain itself in the second part of this mini-series.

An n-gram is best explained by example. Take the word refractive, for example. Let's split it up into chunks:

ref
efr
fra
rac
act
cti
tiv
ive

See what I've done? I've taken the original word and split it into chunks of 3, but I've only moved along the word by 1 character at a time, so some characters have been duplicated. These are n-grams of order 3. The order, in the case of an n-gram, is the number of characters per chunk. We could use any order we like:

refra
efrac
fract
racti
activ
ctive

The order of the above is 5. If you're wondering how this could possibly be useful - don't worry: All will be explained in due time :-) For now though, writing all these n-grams out manually is rather annoying and tedious. Let's write some code!

Generating n-grams from a single word like we did above is actually pretty simple. Here's what I came up with:

/// <summary>
/// Generates a unique list of n-grams from the given string.
/// </summary>
/// <param name="str">The string to n-gram-ise.</param>
/// <param name="order">The order of n-gram to generate.</param>
/// <returns>A unique list of n-grams found in the specified string.</returns>
public static IEnumerable<string> GenerateFlat(string str, int order)
{
List<string> results = new List<string>();
for(int i = 0; i < str.Length - order; i++)
{
}
return results.Distinct();
}

I'm using C♯ here, but you can use whatever language you like. Basically, I enter a loop and crawl along the word, adding the n-grams I find to a list, which I then de-duplicate and return.

Generating n-grams for just one word is nice, but we need to process a whole bunch of words. Thankfully, that's easy to automate too with a sneaky overload:

/// <summary>
/// Generates a unique list of n-grams that the given list of words.
/// </summary>
/// <param name="words">The words to turn into n-grams.</param>
/// <param name="order">The order of n-gram to generate..</param>
/// <returns>A unique list of n-grams found in the given list of words.</returns>
public static IEnumerable<string> GenerateFlat(IEnumerable<string> words, int order)
{
List<string> results = new List<string>();
foreach(string word in words)
{
}
return results.Distinct();
}

All the above does is take a list of words, run them all through the n-gram generation method we wrote above and return the de-duplicated results. Here's a few that it generated from the same wordlist I used above in order 3:

1 2 3 4 5 6 7 8
hor sig ign gna str tre ren ngt
sol old lde oli sor sou oun tel
lla sub ubs bst tem emp mpe atu
tur err ert thr hre dim ime men
nsi ack cki kin raj aje jec tor
ans nsa sat nsf sfe nsl sla slu
luc uce nsm smi nsp are nsu tan

Next time, I'll show you my (unweighted) markov chain I've written that uses the n-grams generated by these methods.