Starbeamrainbowlabs

Stardust
Blog

Building a line-by-line lexer in C#

So there I was. It was a lazy afternoon before my final exam of the semester, and I was idly looking through some old code. One thing led to another, and I ended up writing a line-based scanning lexer in C# - and I thought I'd share it here, pulling it apart and putting it back together again.

The aim was to build something regular expression based, that would be flexible enough that it could be used in a wide-range of applications - but not too difficult or confusing to pick up, use and add to another project. The final code isn't currently available in a public repository (it's actually for a personal project - maybe I'll get around to refactoring it into a library if there's the demand), but I'll still post the finished code at the end of this post.

To start with, let's define some data classes to hold some input and output information. Hrm. Let's see - we'll need a class to represent a rule that the lexer will utilise, and one to represent the tokens that our lexer will be emitting. Let's deal with the one for the rules first:

public class LexerRule<TokenType>
{
    public readonly TokenType Type;
    public readonly Regex RegEx;
    public bool Enabled { get; set; } = true;

    public LexerRule(TokenType inName, string inRegEx)
    {
        if (!typeof(TokenType).IsEnum)
            throw new ArgumentException($"Error: inName must be an enum - {typeof(TokenType)} passed");

        Type = inName;
        RegEx = new Regex(inRegEx);
    }

    public bool Toggle()
    {
        Enabled = !Enabled;
        return Enabled;
    }
}

Here I define a template (or generic) class that holds a regular expression, and associates it with a value from an enum. There's probably a better / cleaner way to make sure that TokenType is an enum, but for now this should serve it's purpose just fine. I also add a simple Enabled boolean property - as we'll be adding support for dynamically enabling and disabling rules later on.

Next up, let's tackle the class for the tokens that we're going to be emitting:

public class LexerToken<TokenType>
{
    public readonly bool IsNullMatch = false;
    public readonly LexerRule<TokenType> Rule = null;
    public readonly Match RegexMatch;

    public TokenType Type {
        get {
            try {
                return Rule.Type;
            }
            catch (NullReferenceException) {
                return default(TokenType);
            }
        }
    }
    private string nullValueData;
    public string Value {
        get {
            return IsNullMatch ? nullValueData : RegexMatch.Value;
        }
    }

    public LexerToken(LexerRule<TokenType> inRule, Match inMatch)
    {
        Rule = inRule;
        RegexMatch = inMatch;
    }
    public LexerToken(string unknownData)
    {
        IsNullMatch = true;
        nullValueData = unknownData;
    }

    public override string ToString()
    {
        return string.Format("[LexerToken: Type={0}, Value={1}]", Type, Value);
    }
}

A little more complex, but still manageable. It, like it's LexerRule cousin, is also a template (or generic) class. It holds the type of token it is and the regular expression Match object generated during the scanning process. It also has something strange going on with Value and nullValueData - this is such that we can emit tokens with an 'unknown' type (more on that later) for the text in between that doesn't match any known rule. We'll be covering this later too.

With our data classes in place, it's time to turn our attention to the lexer itself. Let's put together some scaffolding to get an idea as to how it's going to work:

public class Lexer<TokenType>
{
    public List<LexerRule<TokenType>> Rules { get; private set; } = new List<LexerRule<TokenType>>();

    public int CurrentLineNumber { get; private set; } = 0;
    public int CurrentLinePos { get; private set; } = 0;
    public int TotalCharsScanned { get; private set; } = 0;

    private StreamReader textStream;

    public Lexer()
    {

    }

    public void AddRule(LexerRule<TokenType> newRule);
    public void AddRules(IEnumerable<LexerRule<TokenType>> newRules);

    public void Initialise(StreamReader reader);

    public IEnumerable<LexerToken<TokenType>> TokenStream();

    public void EnableRule(TokenType type);
    public void DisableRule(TokenType type);
    public void SetRule(TokenType type, bool state);
}

There - that should do the trick! CurrentLineNumber, CurrentLinePos, and TotalCharsScanned are properties to keep track of where we've got to, and textStream is the StreamReader we'll be reading data from. Then, we've got some methods that will add new lexer rules to Rules enable and disable rules by token type, a method to initialise the lexer with the correct textStream, and finally a generator method that will emit the tokens.

With our skeleton complete, let's fill out a few of those methods:

public void AddRule(LexerRule<TokenType> newRule)
{
    Rules.Add(newRule);
}
public void AddRules(IEnumerable<LexerRule<TokenType>> newRules)
{
    Rules.AddRange(newRules);
}

public void Initialise(StreamReader reader)
{
    textStream = reader;
}

public void EnableRule(TokenType type)
{
    SetRule(type, true);
}
public void DisableRule(TokenType type)
{
    SetRule(type, false);
}
public void SetRule(TokenType type, bool state)
{
    foreach (LexerRule<TokenType> rule in Rules)
    {
        // We have to do a string comparison here because of the generic type we're using in multiple nested
        // classes
        if (Enum.GetName(rule.Type.GetType(), rule.Type) == Enum.GetName(type.GetType(), type)) {
            rule.Enabled = state;
            return;
        }
    }
}

Very cool. None of this is particularly exciting - apart from SetBody. In SetBody we have to convert the type argument ot a string in order to compare it to the rules in the Rules list, as C♯ doesn't seem to understand that the TokenType on the LexerRule class is the same as the TokenType on the Lexer class - even though they have the same name! This did give me an idea for a trio of additional methods to make manipulating rules easier though:

public void EnableRulesByPrefix(string tokenTypePrefix)
{
    SetRulesByPrefix(tokenTypePrefix, true);
}
public void DisableRulesByPrefix(string tokenTypePrefix)
{
    SetRulesByPrefix(tokenTypePrefix, false);
}
public void SetRulesByPrefix(string tokenTypePrefix, bool state)
{
    foreach (LexerRule<TokenType> rule in Rules)
    {
        // We have to do a string comparison here because of the generic type we're using in multiple nested
        // classes
        if (Enum.GetName(rule.Type.GetType(), rule.Type).StartsWith(tokenTypePrefix, StringComparison.CurrentCulture))
        {
            rule.Enabled = state;
        }
    }
}

This set of methods let us enable or disable rules based on what they are with. For example, if I have the 3 rules CommentStart, CommentEnd, and FunctionStart, then calling EnableRulesByPrefix("Comment") will enable CommentStart and CommentEnd, but not FunctionStart.

With all the interfacing code out of the way, let's turn tot he real meat of the subject: The TokenStream method. This is the method behind the magic. It's a bit complicated, so let's take it step-by-step. Firstly, we need to iterate over the lines in the StreamReader:

string nextLine;
List<LexerToken<TokenType>> matches = new List<LexerToken<TokenType>>();
while ((nextLine = textStream.ReadLine()) != null)
{
    CurrentLinePos = 0;

    // .....

    CurrentLineNumber++;
    TotalCharsScanned += CurrentLinePos;
}

Fairly simple, right? I've used this construct a few times in the past. Before you ask, we'll get to matches in just a moment :-) Next, we need another while loop that iterates until we reach the end of the line:

while (CurrentLinePos < nextLine.Length)
{
    matches.Clear();
    foreach (LexerRule<TokenType> rule in Rules) {
        if (!rule.Enabled) continue;

        Match nextMatch = rule.RegEx.Match(nextLine, CurrentLinePos);
        if (!nextMatch.Success) continue;

        matches.Add(new LexerToken<TokenType>(rule, nextMatch));
    }

    // .....
}

Also fairly easy to follow. Basically, we clear the matches list, and then attempt to find the next match from the current position on the line that we've reached (CurrentLinePos) for every rule - and we store all the successful matches for further inspection and processing. We also make sure we skip any disabled rules here, too.

If we don't find any matching rules, then that must mean that we can't match the rest of this line to any known token. In this case, we want to emit an unknown token:

if (matches.Count == 0) {
    yield return new LexerToken<TokenType>(nextLine.Substring(CurrentLinePos));
    break;
}

This is what that extra LexerToken constructor is for that we created above. Note that we yield return here, instead of simply returning - this is very similar in construct to the yield statement in Javascript that I blogged about before (and again here), in that they allow you to maintain state inside a method and return multiple values in sequence.

By an 'unknown token', I am referring to the default value of the TokenType enum. Here's an example enum you might use with this lexer class:

public enum SimpleToken {
    Unknown = 0,

    Whitespace,

    Comment,
    BlockOpen,
    BlockClose,
}

Since the value Unknown is explicitly assigned the index 0, we can be absolutely certain that it's the default value of this enum.

With our list of potential matches in hand, we next need to sort it in order to work our which one we should prioritise. After deliberating and experimenting a bit, I came up with this:

matches.Sort((LexerToken<TokenType> a, LexerToken<TokenType> b) => {
    // Match of offset position position
    int result = nextLine.IndexOf(a.RegexMatch.Value, CurrentLinePos, StringComparison.CurrentCulture) -
                nextLine.IndexOf(b.RegexMatch.Value, CurrentLinePos, StringComparison.CurrentCulture);
    // If they both start at the same position, then go with the longest one
    if (result == 0)
        result = b.RegexMatch.Length - a.RegexMatch.Length;

    return result;
});

Basically, this sorts them so that the matches closest to the current scanning position on the list (CurrentLinePos) are prioritised. If 2 or more matches tie based on this criterion, we prioritise the longest match. This seems to work for now - I can always change it later if it becomes a problem :P

With our matches sorted, we can now pick our the one we're going t emit next. Before we do so though, we need to take care of any characters between our current scanning position and the start of the next token:

LexerToken<TokenType> selectedToken = matches[0];
int selectedTokenOffset = nextLine.IndexOf(selectedToken.RegexMatch.Value, CurrentLinePos) - CurrentLinePos;

if (selectedTokenOffset > 0) {
    CurrentLinePos += selectedTokenOffset;
    yield return new LexerToken<TokenType>(nextLine.Substring(CurrentLinePos, selectedTokenOffset));
}

This emits these additional characters as an unknown token as we did before. Finally, we can emit the token and continue onto the next iteration of the loop:

CurrentLinePos += selectedToken.RegexMatch.Length;
yield return selectedToken;

That concludes our TokenStream method - and with it this lexer! Here's the code in full:

using System;
using System.Collections.Generic;
using System.IO;
using System.Text.RegularExpressions;

namespace SBRL.Tools
{
    public class LexerRule<TokenType>
    {
        public readonly TokenType Type;
        public readonly Regex RegEx;
        public bool Enabled { get; set; } = true;

        public LexerRule(TokenType inName, string inRegEx)
        {
            if (!typeof(TokenType).IsEnum)
                throw new ArgumentException($"Error: inName must be an enum - {typeof(TokenType)} passed");

            Type = inName;
            RegEx = new Regex(inRegEx);
        }

        public bool Toggle()
        {
            Enabled = !Enabled;
            return Enabled;
        }
    }

    public class LexerToken<TokenType>
    {
        public readonly bool IsNullMatch = false;
        public readonly LexerRule<TokenType> Rule = null;
        public readonly Match RegexMatch;

        public TokenType Type {
            get {
                try {
                    return Rule.Type;
                }
                catch (NullReferenceException) {
                    return default(TokenType);
                }
            }
        }
        private string nullValueData;
        public string Value {
            get {
                return IsNullMatch ? nullValueData : RegexMatch.Value;
            }
        }

        public LexerToken(LexerRule<TokenType> inRule, Match inMatch)
        {
            Rule = inRule;
            RegexMatch = inMatch;
        }
        public LexerToken(string unknownData)
        {
            IsNullMatch = true;
            nullValueData = unknownData;
        }

        public override string ToString()
        {
            return string.Format("[LexerToken: Type={0}, Value={1}]", Type, Value);
        }
    }

    public class Lexer<TokenType>
    {
        public List<LexerRule<TokenType>> Rules { get; private set; } = new List<LexerRule<TokenType>>();

        public bool Verbose { get; set; } = false;

        /// <summary>
        /// The number of the line that currently being scanned.
        /// </summary>
        public int CurrentLineNumber { get; private set; } = 0;
        /// <summary>
        /// The number of characters on the current line that have been scanned.
        /// </summary>
        /// <value>The current line position.</value>
        public int CurrentLinePos { get; private set; } = 0;
        /// <summary>
        /// The total number of characters currently scanned by this lexer instance.
        /// Only updated every newline!
        /// </summary>
        public int TotalCharsScanned { get; private set; } = 0;

        private StreamReader textStream;

        public Lexer()
        {

        }

        public void AddRule(LexerRule<TokenType> newRule)
        {
            Rules.Add(newRule);
        }
        public void AddRules(IEnumerable<LexerRule<TokenType>> newRules)
        {
            Rules.AddRange(newRules);
        }

        public void Initialise(StreamReader reader)
        {
            textStream = reader;
        }

        public IEnumerable<LexerToken<TokenType>> TokenStream()
        {
            string nextLine;
            List<LexerToken<TokenType>> matches = new List<LexerToken<TokenType>>();
            while ((nextLine = textStream.ReadLine()) != null)
            {
                CurrentLinePos = 0;

                while (CurrentLinePos < nextLine.Length)
                {
                    matches.Clear();
                    foreach (LexerRule<TokenType> rule in Rules) {
                        if (!rule.Enabled) continue;

                        Match nextMatch = rule.RegEx.Match(nextLine, CurrentLinePos);
                        if (!nextMatch.Success) continue;

                        matches.Add(new LexerToken<TokenType>(rule, nextMatch));
                    }

                    if (matches.Count == 0) {
                        string unknownTokenContent = nextLine.Substring(CurrentLinePos);
                        if(Verbose) Console.WriteLine("[Unknown Token: No matches found for this line] {0}", unknownTokenContent);
                        yield return new LexerToken<TokenType>(unknownTokenContent);
                        break;
                    }

                    matches.Sort((LexerToken<TokenType> a, LexerToken<TokenType> b) => {
                        // Match of offset position position
                        int result = nextLine.IndexOf(a.RegexMatch.Value, CurrentLinePos, StringComparison.CurrentCulture) -
                                    nextLine.IndexOf(b.RegexMatch.Value, CurrentLinePos, StringComparison.CurrentCulture);
                        // If they both start at the same position, then go with the longest one
                        if (result == 0)
                            result = b.RegexMatch.Length - a.RegexMatch.Length;

                        return result;
                    });
                    LexerToken<TokenType> selectedToken = matches[0];
                    int selectedTokenOffset = nextLine.IndexOf(selectedToken.RegexMatch.Value, CurrentLinePos) - CurrentLinePos;

                    if (selectedTokenOffset > 0) {
                        string extraTokenContent = nextLine.Substring(CurrentLinePos, selectedTokenOffset);
                        CurrentLinePos += selectedTokenOffset;
                        if(Verbose) Console.WriteLine("[Unmatched content] '{0}'", extraTokenContent);
                        yield return new LexerToken<TokenType>(extraTokenContent);
                    }

                    CurrentLinePos += selectedToken.RegexMatch.Length;
                    if(Verbose) Console.WriteLine(selectedToken);
                    yield return selectedToken;
                }

                if(Verbose) Console.WriteLine("[Lexer] Next line");
                CurrentLineNumber++;
                TotalCharsScanned += CurrentLinePos;
            }
        }

        public void EnableRule(TokenType type)
        {
            SetRule(type, true);
        }
        public void DisableRule(TokenType type)
        {
            SetRule(type, false);
        }
        public void SetRule(TokenType type, bool state)
        {
            foreach (LexerRule<TokenType> rule in Rules)
            {
                // We have to do a string comparison here because of the generic type we're using in multiple nested
                // classes
                if (Enum.GetName(rule.Type.GetType(), rule.Type) == Enum.GetName(type.GetType(), type)) {
                    rule.Enabled = state;
                    return;
                }
            }
        }

        public void EnableRulesByPrefix(string tokenTypePrefix)
        {
            SetRulesByPrefix(tokenTypePrefix, true);
        }
        public void DisableRulesByPrefix(string tokenTypePrefix)
        {
            SetRulesByPrefix(tokenTypePrefix, false);
        }
        public void SetRulesByPrefix(string tokenTypePrefix, bool state)
        {
            foreach (LexerRule<TokenType> rule in Rules)
            {
                // We have to do a string comparison here because of the generic type we're using in multiple nested
                // classes
                if (Enum.GetName(rule.Type.GetType(), rule.Type).StartsWith(tokenTypePrefix, StringComparison.CurrentCulture))
                {
                    rule.Enabled = state;
                }
            }
        }
    }
}

It's a bit much to take in all at once, but hopefully by breaking it down into steps I've made it easier to understand how I built it, and how all the different pieces fit together. The only real difference between the above code and the code I walked through in this post is the Verbose parameter I added for testing purposes, and the associated Console.WriteLine calls. For fun, here's a very basic LOGO (also here) lexer. I've based it on what I remember from using MSWLogo / FMSLogo a long time ago (there seem to be many dialects around these days):

public enum LogoToken
{
    Unknown = 0,

    Whitespace,

    FunctionForwards,
    FunctionBackwards,
    FunctionLeft,
    FunctionRight,
    FunctionPenUp,
    FunctionPenDown,

    Number
}

public class LogoLexer : Lexer<LogoToken>
{
    public LogoLexer()
    {
        AddRules(new List<LexerRule<LogoToken>>() {
            new LexerRule<LogoToken>(LogoToken.Whitespace,    @"\s+"),

            new LexerRule<LogoToken>(LogoToken.FunctionForwards,    @"FD"),
            new LexerRule<LogoToken>(LogoToken.FunctionBackwards,    @"BK"),
            new LexerRule<LogoToken>(LogoToken.FunctionLeft,    @"LT"),
            new LexerRule<LogoToken>(LogoToken.FunctionRight,    @"RT"),

            new LexerRule<LogoToken>(LogoToken.FunctionPenUp,    @"PU"),
            new LexerRule<LogoToken>(LogoToken.FunctionPenDown,    @"PD"),

            new LexerRule<LogoToken>(LogoToken.Number,    @"\d+"),
        });
    }
}

Here's an example LOGO program that it parses:

...and here's the output from lexing that example program:

[LexerToken: Type=FunctionForwards, Value=FD]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=Number, Value=100]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=FunctionRight, Value=RT]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=Number, Value=90]
[Lexer] Next line
[LexerToken: Type=FunctionForwards, Value=FD]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=Number, Value=50]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=FunctionPenUp, Value=PU]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=FunctionRight, Value=RT]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=Number, Value=180]
[Lexer] Next line
[LexerToken: Type=FunctionBackwards, Value=BK]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=Number, Value=40]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=FunctionLeft, Value=LT]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=Number, Value=45]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=FunctionForwards, Value=FD]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=Number, Value=250]
[Lexer] Next line
[Lexer] Next line

Very cool! This could easily be extended to support more of the LOGO syntax. As an exercise, can you extend it to support the REPEAT statement? At some point in the future, I might go even further and build a bottom-up left-to-right shift-reduce parser, and combine it with this lexer and some BNF to create a parse tree.

Enjoyed this post? Don't quite understand something? Think you could do better? Post a comment below!

Tag Cloud

3d 3d printing account algorithms android announcement architecture archives arduino artificial intelligence artix assembly async audio automation backups bash batch blender blog bookmarklet booting bug hunting c sharp c++ challenge chrome os cluster code codepen coding conundrums coding conundrums evolved command line compilers compiling compression containerisation css dailyprogrammer data analysis debugging demystification distributed computing dns docker documentation downtime electronics email embedded systems encryption es6 features ethics event experiment external first impressions freeside future game github github gist gitlab graphics hardware hardware meetup holiday holidays html html5 html5 canvas infrastructure interfaces internet interoperability io.js jabber jam javascript js bin labs learning library linux lora low level lua maintenance manjaro minetest network networking nibriboard node.js open source operating systems optimisation own your code pepperminty wiki performance phd photos php pixelbot portable privacy problem solving programming problems project projects prolog protocol protocols pseudo 3d python reddit redis reference releases rendering resource review rust searching secrets security series list server software sorting source code control statistics storage svg systemquery talks technical terminal textures thoughts three thing game three.js tool tutorial tutorials twitter ubuntu university update updates upgrade version control virtual reality virtualisation visual web website windows windows 10 worldeditadditions xmpp xslt

Archive

Art by Mythdael