Shift-reduce Parser Part 1: First Steps

Now that I've done the Languages and Compilers module at University, it's opened my eyes to a much better and more extensible way of handling complex text in a way that can easily be read by any subsequent code I write. To that end, I've found that at least 3 different various projects of mine could benefit from the inclusion of a shift-reduce parser, but I haven't been able to track one down for C♯ yet.

With this in mind, I thought to myself: "Why not build one myself?" Sure, my Lecturer didn't go into too many details as to how they work, but it can't be that difficult, can it? Oh, I had no idea.....

In this mini-series, I'm going to take you through the process of building a shift-reduce parser of your very own. As I write this, I haven't actually finished mine yet - I've just got to the important milestone of building a parse table! Thankfully, that's going to be a few posts away, as there's a fair amount of ground to cover until we get to that point.

Warning: This series is not for the faint of heart! It's rather complicated, and extremely abstract - making it difficult to get your head around. I've had great difficulty getting mine around it - and ended up writing it in multiple stages. If you want to follow along, be prepared for lots of research, theory, and preparation!

Let's start out by taking a look at what a shift-reduce parser does. If you haven't already, I'd recommend reading my previous compilers 101 post, which explains how to write a compiler, and the different stages involved. I'd also recommend checking out my earlier post on building a lexer, as it ties in nicely with the shift-reduce parser that we'll be building.

An overview of how a shift-reduce works.

In short, a shift-reduce parser compiles a set of BNF-style rules into a Parse Table, which it then utilises as a sort of state-machine when parsing a stream on input tokens. We'll take a look at this table compilation process in a future blog post. In this post, let's set up some data structures to help us along when we get to the compilation process in the next blog post. Here's the class structure we'll be going for:

An overview of the class structure we'll be creating in this blog post.

Let's start with a class to represent a single token in a rule:

public enum ParserTokenClass
{
    Terminal,
    NonTerminal
}

public struct ParserToken
{
    public readonly ParserTokenClass Class;
    public readonly string Type;

    public ParserToken(ParserTokenClass inTokenType, string inType)
    {
        Class = inTokenType;
        Type = inType;
    }

    public override bool Equals(object obj)
    {
        ParserToken otherTokenType = (ParserToken)obj;
        return Class == otherTokenType.Class && Type == otherTokenType.Type;
    }
    public override int GetHashCode()
    {
        return $"{Class}:{Type}".GetHashCode();
    }

    public override string ToString()
    {
        string terminalDisplay = Class == ParserTokenClass.Terminal ? "T" : "NT";
        return $"[ParserToken {terminalDisplay}: {Type}]";
    }

    public static ParserToken NonTerm(string inType)
    {
        return new ParserToken(ParserTokenClass.NonTerminal, inType);
    }
    public static ParserToken Term(string inType)
    {
        return new ParserToken(ParserTokenClass.Terminal, inType);
    }
}

Pretty simple! A token in a rule can either be a terminal (basically a token straight from the lexer), or a non-terminal (a token that the parser reduces a set of other tokens into), and has a type - which we represent as a string. Unfortunately due to the complex comparisons we'll be doing later, it's a huge hassle to use an enum with a template class as I did in the lexer I built that I linked to earlier.

Later on (once we've built the parse table), we'll extend this class to support attaching values and other such pieces of information to it, but for now we'll leave that out to aid simplicity.

I also override Equals() and GetHashCode() in order to make comparing tokens easier later on. Overriding ToString() makes the debugging process much easier later, as we'll see in the next post!

With a class to represent a token, we also need one to represent a rule. Let's create one now:

public class ParserRule
{
    /// <summary>
    /// A function to call when a reduce operation utilises this rule.
    /// </summary>
    public Action MatchingAction;
    public ParserToken LeftSide;
    public ParserToken[] RightSideSequence;

    public ParserRule(Action inMatchingAction, ParserToken inLeftSide, params ParserToken[] inRightSideSequence)
    {
        if (inLeftSide.Class != ParserTokenClass.NonTerminal)
            throw new ArgumentException("Error: The left-hand side must be a non-terminal token type.");

        MatchingAction = inMatchingAction;
        LeftSide = inLeftSide;
        RightSideSequence = inRightSideSequence;
    }

    public bool RightSideSequenceMatches(IEnumerable<ParserToken> otherRhs)
    {
        int i = 0;
        foreach (ParserToken nextToken in otherRhs)
        {
            if (!nextToken.Equals(RightSideSequence[i]))
                return false;

           i++;
        }
        return true;
    }

    public override string ToString()
    {
        StringBuilder result = new StringBuilder();
        result.Append($"ParserRule: {LeftSide} = ");
        foreach (ParserToken nextToken in RightSideSequence)
            result.Append($" {nextToken}");
        result.Append(";");
        return result.ToString();
    }
}

The above represents a single parser rule, such as <number> ::= <digit> <number>. Here we have the token on the left-hand-side (which we make sure is a non-terminal), and an array of tokens (which can be either terminal or non-terminal) for the right-hand-side. We also have an Action (which is basically a lamba function) that we'll call when we match against the rule, so that we have a place to hook into when we write code that actually does the tree building (not to be confused with the shift-reduce parser itself).

Here I also add a method that we'll need later, which compares an array of tokens against the current rule, to see if they match - and we override ToString() here again to aid debugging.

Now that we can represent tokens and rules, we can start thinking about representing configurations and states. Not sure what these are? All will be explained in the next post, don't worry :-) For now, A state can be seen as a row in the parse table, and it contains a number of configurations - which are like routes to different other states that the parser decides between, depending where it's gotten to in the token stream.

public enum ParseTableAction
{
    Shift,
    Reduce,
    Goal,
    Error
}

public class ParseTableConfiguration
{
    public readonly ParserRule Rule;
    public readonly int RhsPosition;

    public ParseTableAction LinkingAction = ParseTableAction.Error;
    public ParseTableState LinkingState = null;

    public ParserToken TokenAfterDot {
        get {
            return Rule.RightSideSequence[RhsPosition];
        }
    }
    public ParserToken TokenBeforeDot {
        get {
            return Rule.RightSideSequence[RhsPosition - 1];
        }
    }

    /// <summary>
    /// Whether this configuration is the last in the sequence of configurations for the specified rule or not.
    /// </summary>
    /// <value><c>true</c> if is last in rule; otherwise, <c>false</c>.</value>
    public bool IsLastInRule {
        get {
            return RhsPosition > Rule.RightSideSequence.Length - 1;
        }
    }

    public ParseTableConfiguration(ParserRule inRule, int inRhsPosition)
    {
        Rule = inRule;
        RhsPosition = inRhsPosition;
    }

    public IEnumerable<ParserToken> GetParsedRhs()
    {
        return Rule.RightSideSequence.TakeWhile((ParserToken token, int index) => index <= RhsPosition);
    }

    public bool MatchesRhsSequence(ParserRule otherRule)
    {
        int i = 0;
        foreach (ParserToken nextToken in otherRule.RightSideSequence)
        {
            if (i > RhsPosition)
                break;

            if (!nextToken.Equals(otherRule.RightSideSequence[i]))
                return false;

            i++;
        }
        return true;
    }

    public override bool Equals(object obj)
    {
        ParseTableConfiguration otherConfig = obj as ParseTableConfiguration;
        if (otherConfig == null) return false;
        return Rule == otherConfig.Rule && RhsPosition == otherConfig.RhsPosition;
    }
    public override int GetHashCode()
    {
        return $"{Rule}:{RhsPosition}".GetHashCode();
    }

    public override string ToString()
    {
        StringBuilder result = new StringBuilder();

        result.Append($"Configuration: {LinkingAction} ");
        if (LinkingState != null)
            result.Append($"to State {LinkingState.Id} ");
        result.Append($"{Rule.LeftSide} = ");

        for (int i = 0; i <= Rule.RightSideSequence.Length; i++)
        {
            if (i == RhsPosition)
                result.Append(" * ");
            if (i == Rule.RightSideSequence.Length)
                continue;
            result.Append($"{Rule.RightSideSequence[i]} ");
        }
        result.Append(";");
        return result.ToString();
    }
}

This class is slightly more complicated. First, we define an enum that holds information about what the parser should do if it chooses this configuration. Then, we declare the configuration class itself. This entails specifying which parse rule we're deriving the configuration from, and both which tokens in the right-hand-side of the rule should have been parsed already, and which should still be somewhere in the token stream. Again, I'll explain this in more detail in the next post!

Then, we declare a few utility methods and properties to fetch different parts of the configuration's rule, such as the token to the immediate left and right of the right-hand-side position (which was represented as a dot . in the book I followed), all the tokens before the dot ., and whether a given rule matches this configuration in the basis of everything before the dot ..

Finally, I continue with the trend of overriding the equality checking methods and ToString(), as it makes a world of difference in the code coming up in future blog posts!

Now that we've got a class for configurations, the last one on our list is one for the states themselves. Let's do that now:

public class ParseTableState
{
    public readonly ParseTable ParentTable;

    public int Id {
        get {
            return ParentTable.FindStateId(this);
        }
    }

    public List<ParseTableConfiguration> Configurations = new List<ParseTableConfiguration>();

    public ParseTableState(ParseTable inParentTable)
    {
        ParentTable = inParentTable;
    }

    public override string ToString()
    {
        StringBuilder result = new StringBuilder();
        foreach(ParseTableConfiguration nextConfiguration in Configurations)
            result.AppendLine($"     - {nextConfiguration}");
        return result.ToString();
    }
}

Much simpler than the configuration rule class, right? :P As I mentioned earlier, all a state consists of is a list of configurations in that state. In our case, we'll be assigning an id to the states in our parse table, so I include a property here that fetches a state's id from the parent parse table that it's part to make the later code as simple as possible.

Still with me? Congratulations! You've got the beginnings of a shift-reduce parser. Next time, we'll expand on some of theory behind the things I've touched on in this post, and possibly look at building the start of the recursive parse table builder itself.

Found this interesting? Confused about something? Comment below!

Type this	To get this	Notes
`bold text`	bold text	-
`_italics text_`	italics text	-
`~~deleted text~~`	~~deleted~~	-
`code text`	`code text`	Inserts some monospaced code. It is preferred that large blocks of code are linked to using a service such as Pastebin, Github Gists or Ideone.
`> Quote`	Quote	-
`[display text](//google.com)`	display text	Inserts a hyperlink. Please use responsibly. `[rel=nofollow]` is in use and spam will be deleted.
`---`		Inserts a horizontal line. The previous line must be blank.

Stardust
Blog