Starbeamrainbowlabs

About

Hello!

I am a computer science student who is in their third year at Hull University. I started out teaching myself about various web technologies, and then I managed to get a place at University, where I am now. I've done a year in industry too, which I found to be particuarly helpful in learning about the workplace and the world.

I currently know C# + Monogame / XNA (+ WPF), HTML5, CSS3, Javascript (ES6 + Node.js), PHP, C++, and a bit of Python. Oh yeah, and I can use XSLT too.

I love to experiment and learn about new things on a regular basis. You can find some of the things that I've done in the labs and code sections of this website, or on GitHub. My current projects are Pepperminty Wiki, an entire wiki engine in a single file (the source code is spread across multiple files - don't worry!), and a Prolog Visualisation Tool, although the latter is in its very early stages.

I can also be found in a number of other different places around the web. I've compiled a list of the places that I can remember below.

I can be contacted at the email address webmaster at starbeamrainbowlabs dot com. Suggestions, bug reports and constructive criticism are always welcome.

Blog

Blog Roll | Article Atom Feed | Mailing List


Latest Post

Compilers 101: Build your own flex + bison compiler in a few easy(?) steps

So you want to build your own compiler? Great! Don't know where to start? This guide should help! At University, we're building our own compiler for a custom programming language invented by our lecturer with a pair of GNU tools by the name of flex and bison - which I've blogged about before. Since that post, I've learnt a ton about how the whole process works, so I thought I'd write up a more coherent blog post on the subject :P

A diagram explaining the process of building a compiler. Explained below.

Stage 1: Planning

The whole process starts with railroad diagrams (also known as flowcharts) of the language you want to write a compiler for. Having an accurate set of railroad diagrams is essential to understanding precisely how the language is put together, which is rather useful for the next step.

Converting the railroad diagrams into plain BNF (Backus Naur Form). Unfortunately, bison doesn't support EBNF-like notation at the current time, so only plain-old BNF will do.

Stage 2: Lexing

With your railroad diagrams converted into BNF, you can start writing code! The first chunk of code that needs writing is the lexer. Lexing is what flex is good at - and involves converting the input source code into lexemes - discrete sequences of characters that match a particular pattern and can be assigned a particular category name, turning it into a token. Perhaps an example would help. Consider the following:

void do_awesome_stuff(int a, string b) {
    /* Code here */
}

The above can be turned into a sequence of tokens, not unlike the following (ignoring whitespace tokens, of course):

TYPE: void
IDENTIFIER: do_awesome_stuff
OPEN_BRACKET: (
TYPE: int
IDENTIFIER: a,
COMMA: ,
TYPE: string
IDENTIFIER: b
CLOSE_BRACKET: )
OPEN_BRACE: {
COMMENT: /* Code here */*
CLOSE_BRACE: }

See? We can identify 8 token types in the source string: TYPE, IDENTIFIER, COMMA, OPEN_BRACKET, CLOSE_BRACKET, OPEN_BRACE, COMMENT, and CLOSE_BRACE. These types and the rules to match them can be found by analysing a combination of the railroad diagrams and the BNF you created earlier.

Stage 3: Parser the first

With a lexer in hand, we can now look at writing the parser. This is done in two stages. The parser itself, and upgrading said parser to generate a parse tree.

Let's talk about the parser first. The parser can be largely created simply by running a few regular-expression find and replace rules on your BNF, actually. From there, it's just a case of adding the header and the footer to complete the document.

Let's take a look at some example BNF:

<instructions> ::= START <lines> END

<lines> ::= <lines> <line>
    | <line>

<line> ::= <command>

<command> ::= <cmd_name> <number>

<cmd_name> ::= FD
    | BK
    | LT
    | RT

<number> ::= <number> <digit>
    | <digit>

<digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

The above matches something like the following:

START
FD 100
RT 180
FD 125
LT 90
BK 50
END

Very interesting (a virtual cookie is available for anyone who gets the reference as to what this grammar represents!). Let's look at converting that into something bison can understand:

instructions : START lines END
    ;

lines : lines line
    | line
    ;

line : command
    ;

command : cmd_name number
    ;

cmd_name : FD
    | BK
    | LT
    | RT
    ;

number : number digit
    | digit
    ;

digit : 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

That's looking much better already! Simply by using the regular expression substitutions:

  1. <([a-z_]+)> -> $1
  2. ::= -> :
  3. \n\n -> \n\t;\n\n

....we can get most of the way there to something that bison can understand. Next, we need to refactor it a bit to tell it which tokens are coming from the lexer (which I'll leave up to you to write as an exercise), so it doesn't get them confused with the rules - which are defined in the BNF(-like) rules.

Let's get rid of the rule for number and digit first, since we can do those in the lexer quite easily. Next, let's add a %token definition to the top to tell it which are coming from the lexer. It's good practice to define everything that comes from the lexer in uppercase, and everything that's a rule that exists only in bison in lowercase:

%start instructions
%token FD BK LT RT NUMBER START END

We've also defined a start symbol - the one which when bison reaches it, it knows that it's completed the parsing process, as bison is a bottom-up parser.

Lastly, we need to reference the lexer itself. Thankfully that's easy too by appending to your new bison file:

%%

#include "lex.yy.c"

Very nice. Don't forget about the new line at the end of the file - flex and bison will complain if it isn't present! Here's the completed bison file:

%start instructions
%token FD BK LT RT NUMBER START END

instructions : START lines END
    ;

lines : lines line
    | line
    ;

line : command
    ;

command : cmd_name NUMBER
    ;

cmd_name : FD
    | BK
    | LT
    | RT
    ;

%%

#include "lex.yy.c"

With a brand-new bison file completed, there's just one component of the parser left - a plain-old C file that calls it. Let's create one of those quickly:


#include 

int yyparse(void);

int main(void)
{       
    # if YYDEBUG == 1
    extern int yydebug;
    yydebug = 1;
    #endif

    return yyparse();
}

void yyerror(char *error_message)
{
    fprintf(stderr, "Error: %s\nExiting\n", error_message);
}

The highlighted lines enable a special debugging mode built-in to bison if the standard compile-time symbol YYDEBUG is specified - and bison is run with a few special parameters. Here's the sequence of commands needed to compile this:

flex lexer.l
bison -tv parser.y
gcc -Wall -Wextra -g parser.tab.c main.c -lfl -ly -DYYDEBUG -D_XOPEN_SOURCE=700

The gcc command is probably a bit long-winded, but it does several useful things for us:

  • Shows additional warnings just in case we've made a mistake that might be an issue later (-Wall -Wextra)
  • Include additional debugging information in the output file to allow debugging with gdb (the GNU Debugger) if necessary (-g)
  • Fix strange errors on some systems (-D_XOPEN_SOURCE=700)

If you're on a Windows system, you may need to remove the -ly - which appears to be required on the Linux machines I use - it tells gcc that we'll be referencing the bison library.

Stage 4: Parser again

Congratulations on getting this far! You've now got a lexer and a parser - so it's time to put them to use. This is done by utilising the parser to build a parse tree - a tree of nodes that represent the input. Here's an example tree:

An example parse tree.

As you can see, each high-level node references one or more lower-level nodes, and the structure of the tree represents the first 2 lines of the example input above. The nodes in yellow are lexical tokens that come directly from flex - these are called terminals, or leaf nodes. The ones in purple come from the bison rules (which we derived from the BNF we wrote at the beginning of this post) - and are non-terminals, or tree nodes.

With this in mind, let's introduce another feature or two of bison. Firstly, let's take a look at revising that %token declaration we created above:

%token FD BK LT RT START END
%token<val_num> NUMBER

The important bit here is the <val_num>. Here we tell bison that a value should be attached to the token NUMBER - and that it will be of type int. After telling bison that it should expect a value, we need to give it a place to put it. Let's write some more code to go just below the %token declarations:

%union {
    int val_num;
}

There we go! Excellent - we've got a place to put it. Now we just need to alter the lexer to convert the token value to an int and put it there. That's not too tough, thankfully - but if you're having trouble with it, here's a hint:

{number}        { yylval.val_num = atoi(yytext); return(NUMBER); }

Now we have it passing numbers correctly, let's look briefly at generating that parse tree. I'm not going to give the game away - just a few helpful hints as to what you need to do here - otherwise it's not as fun :P

Generating the parse tree can be considered the both the most challenging part of the experience (especially if you don't know what you're doing) and the easiest to deal with at same time. Knowing your stuff and your end goal before you start makes the whole process a lot easier.

The first major step is to create a struct that can represent a type of node in your parse tree. It might be useful to store several properties here - such as the node type (An enum will come in handy here), a spot for the value of a lexical token (or a reference to it in a symbol table if you have one), and references to child nodes in the parse tree.

The second major step of the process is to create a utility method that creates a new node of the tree on the heap, and then revise the bison file to get each rule to create new nodes on the tree in such a way that it creates a parse tree when it reaches start node (or top node of the tree - which, in the case of the above, is instructions). For the purposes of this post, I'll be using a method with this prototype:

TreeNode create_node(int item, int node_type, TreeNode left, TreeNode right);

Your tree node struct and subsequent creation method may vary. With this in hand, we can revise the bison rules we created above to create these nodes we've been talking about. Here's a quick pointer on how to revise the rule for command above:

command : cmd_name NUMBER   { $$ = create_node($2, NODE_COMMAND, $1, NULL); }
    ;

This might look a bit strange, but let's break it down. The bit in curly braces is some (almost) plain C code that creates the node and returns a pointer to it to bison. The $$ is the return value for that node - which, I might add reminds me of something I forgot above. We need to tell bison about our new tree node data type and which rules should return it:

%type<val_tnode> instructions lines line command cmd_name

/* And in %union { ... } ..... */
TreeNode val_tnode;

This is almost the same as the %token<val_num> we did before, but we're defining the return value of a rule this type - not a token. With that little interlude out of the way, let's return to the code above. $1 and $2 refer to the first and second items in the rule definition respectively - and hold the type that we defined above in the %token and %type directives. Since bison is a bottom-up parser - this means that by the time this code executes, all it's child nodes have (hopefully) been created - and we just have to tie them all up together with a new node. In the case of my little example above, $1 is of type TreeNode, and $2 is of type int (that is if I didn't make any mistakes further up!).

Stage 5: Blasting off to code generation and beyond

Phew! That's a lot of work. If you've read this far, thank you and well done! It's been a long journey for both you the reader and me the writer, but you're almost done.

While conceptually simple when broken down, the whole process actually gets rather involved - especially when writing the BNF and the parser (the latter of which can be a particular pain due to shift/reduce and reduce/reduce errors), and the amount of code and head scratching you've got to do to get to this point is enormous. My best advice is to take it slow and don't rush - you'll only cause most problems for yourself if you try and jump the gun. Make sure that each stage works as you intend before you continue - back-pedalling to fix an issue can be particularly annoying as it can be bothersome to work out which stage the bug is actually in.

The last step of the whole process is to actually do something with the parse tree we've worked so hard to create. Thankfully, that's not too difficult - as we can put some additional code in the { } block of the starting symbol to call methods that will do things like perform some optimisation, print the tree to the console - or generate some sweet code. While the actual generation of code is beyond the scope of this article, I may end up posting about some optimisation techniques you can use on a parse tree after I've finished fiddling with float handling, symbol tables, and initial code generation in my ACW (Assessed Course Work).

Found this useful? Found a bug in this post? Got a suggestion? Comment below! Since I don't have any real analytics on this blog besides the server logs, I've no idea if you've read it really unless you comment :P

Sources and Further Reading


By on

Labs

An implementation of Valvalingam's line simplification algorithm.
githubjsLine Simplifier
Share files and images on your computer with your friends!
githubjsGalleryShare
A night sky full of pretty twinkling stars.
sbrljsStarry Sky
A small gem I found in my archives. From 2013.
sbrljsArchives: Colour Picker
A bicycle riding through some procedural scrolling parallax hills.
sbrljsParallax Bicycle
A procedural castle generator I wrote for /r/proceduralgeneration's 2nd Monthly challenge.
githubjsProcedural Castles
A pen I created as a demo whilst writing a class to draw regular shapes.
codepenjsRotating Shapes
A pen I created as a demo whilst writing a class to draw smooth lines.
codepenjsSmooth Lines
A small experiment to get my head around how fractals work.
sbrljsFractal Shapes
An example of context.ellipse in action, written for a blog post.
sbrljsRipples
Get all the fun of the fair without the noise and the cold.
sbrljsBig Wheel
Some treasure is hidden on your screen. Can you find it using only your ears?
sbrljsAudio Treasure Hunter
A Voronoi Diagram Generator
sbrljsVoronoi Diagrams
A random snowflake generator
sbrljsSnowflake Generator
A fully functional wiki in a box.
sbrlphpPepperminty Wiki
Some clouds drifting across the screen, drawn via the HTML5 Canvas.
sbrljsHTML5 Canvas Clouds
A turtle based drawing program for your first forays into simple programming.
sbrljsImaanvas
A client side online tool for stitching strings of still images into an animated gif.
sbrljsGif Renderer
A set of parallax scrolling stars using the HTML5 Canvas
sbrljsParallax Scrolling Stars
An (almost) pure CSS spotlight demo.
jsbincss(Almost) Pure CSS Spotlight
A Javascript Bookmarklet to fade the unimportant parts of a page. Also features HTML5 fullscreen API integration.
sbrljsLightsout
A small script to trianglify (draw triangles on) an image.
sbrlhtmlImage Trianglifier

Code

An implementation of HTTP and Websockets in pure C&sharp; as a library. Does not use System.Net.HttpServer.
githubcsharpGliding Squirrel HTTP
A program that detects and decodes morse code embedded inside an audio file.
sbrlcsharpAudio Morse Decoder
The one and only C&sharp; class generator. Tired of typing the same old scaffolding out all the time? Give this tool a try.
sbrlcsharpCscz
A class modelled on StreamWriter that makes it easy to generate CSV files.
githubcsharpCSVWriter
A web based tool that generates diagrams based on Prolog traces.
githubjsProlog Visualisation Tool
A command line tool to generate random noise, written in C#
githubcsharpNoisebox
A (hopefully) better traceroute utility written in C#
githubcsharpTraceRoutePlus
An easier way to generate XML.
sbrlphpSimple XML Writer
A PHP based Atom feed generator.
sbrlphpPHP Atom Generator
A simple CodeMirror based javascript bookmarklet editor.
sbrlhtmlBookmarklet Playground

Tools

I find useful tools on the internet occasionally. I will list them here.

Brackets
ConEmu
WinSCP
7zip
Bfxr
I'm Only Resting
HxD
DCPicker
Art by Mythdael