Flexible Bison: Compiler Theory
One of the modules I've picked to do in my first semester of my third year at university is Lanuguages and their Compilers. Naturally, this entails building a compiler to compile a program that's written in a source language (spec provided, thankfully! :D) into plain old ANSI C.
The tools we're going to be using for this and the steps involved in actually compiling something into another language are somewhat complicated, and I'm having a bit of difficulty getting my head around the different steps a compiler goes through and how these steps relate to the tools we're going to be using. This blog post is my attempt to make sense of what I've learnt so far.
Firstly, let me introduce the tools I'll be using: GNU flex
and GNU bison
. Apparently they have a much shallower learning curve than other tools out there. At first, this doesn't appear to be the case - but the more I think about it the more I realise that this is true.
Flex, as far as I can tell, is a regular-expression based scanning tokeniser. In other words, it breaks down an input string into a series of tokens. It has a method that, when called, finds and returns the next token from the source string.
Bison uses tokenised output from flex to construct a parse tree. This parse tree is then optimised with redundant nodes removed, loops optimised, and other such tweaks. Finally, this optimised tree is then used to generate the output code.
With the cast introduced, I can get to the stages of a compiler:
- Lexical Analysis - Tokenisation
- Syntactical Analysis - Conversion of the token stream into a parse tree
- Semantic Analysis - Correction of the tree - e.g. automatic type conversion
- Intermediate Code Generation - Sometimes the compiler outputs sets of 3 values in a list of tuples. This was needed in older computers that couldn't hold all the steps of a compiler in memory at once! In my case, I'll be outputting the parse tree generated in step 3 I guess - but not to disk, as today we can have all the passes of the compiler in memory at the same time :D
- Optimisation - Redundant parts of the parse tree are removed etc. - loops are focused in particular
- Code Generation - The output code in the target language is generated here - whether that be in C (very common), Assembly, or another language.
This seems somewhat familiar. The Lexical Analysis phase seems to be rather similar to what flex
is designed for, and the Semantic Analysis stage appears to what bison
does. As for the other stages, I'm not really sure. I'm guessing that it'll become clear later as we build this compiler in stages - but I'm suspecting that we'll be writing them in plain C - unless I've missed something about bison
:P
If you've made it this far, thanks for reading! If this feels somewhat disorganised - then it probably is - after all, this is mainly to get it all straight in my own head :P
If you've got any questions, please ask away in the comments below :-)