Flexible Bison: Compiler Theory

One of the modules I've picked to do in my first semester of my third year at university is Lanuguages and their Compilers. Naturally, this entails building a compiler to compile a program that's written in a source language (spec provided, thankfully! :D) into plain old ANSI C.

The tools we're going to be using for this and the steps involved in actually compiling something into another language are somewhat complicated, and I'm having a bit of difficulty getting my head around the different steps a compiler goes through and how these steps relate to the tools we're going to be using. This blog post is my attempt to make sense of what I've learnt so far.

Firstly, let me introduce the tools I'll be using: GNU flex and GNU bison. Apparently they have a much shallower learning curve than other tools out there. At first, this doesn't appear to be the case - but the more I think about it the more I realise that this is true.

Flex, as far as I can tell, is a regular-expression based scanning tokeniser. In other words, it breaks down an input string into a series of tokens. It has a method that, when called, finds and returns the next token from the source string.

Bison uses tokenised output from flex to construct a parse tree. This parse tree is then optimised with redundant nodes removed, loops optimised, and other such tweaks. Finally, this optimised tree is then used to generate the output code.

With the cast introduced, I can get to the stages of a compiler:

Lexical Analysis - Tokenisation
Syntactical Analysis - Conversion of the token stream into a parse tree
Semantic Analysis - Correction of the tree - e.g. automatic type conversion
Intermediate Code Generation - Sometimes the compiler outputs sets of 3 values in a list of tuples. This was needed in older computers that couldn't hold all the steps of a compiler in memory at once! In my case, I'll be outputting the parse tree generated in step 3 I guess - but not to disk, as today we can have all the passes of the compiler in memory at the same time :D
Optimisation - Redundant parts of the parse tree are removed etc. - loops are focused in particular
Code Generation - The output code in the target language is generated here - whether that be in C (very common), Assembly, or another language.

This seems somewhat familiar. The Lexical Analysis phase seems to be rather similar to what flex is designed for, and the Semantic Analysis stage appears to what bison does. As for the other stages, I'm not really sure. I'm guessing that it'll become clear later as we build this compiler in stages - but I'm suspecting that we'll be writing them in plain C - unless I've missed something about bison :P

If you've made it this far, thanks for reading! If this feels somewhat disorganised - then it probably is - after all, this is mainly to get it all straight in my own head :P

If you've got any questions, please ask away in the comments below :-)

Comments

Bilal:

You're doing great, contributing to this industry is not what everyone do. It makes easier for all of us students to follow you guys who are senior in the field. thanks.

Type this	To get this	Notes
`bold text`	bold text	-
`_italics text_`	italics text	-
`~~deleted text~~`	~~deleted~~	-
`code text`	`code text`	Inserts some monospaced code. It is preferred that large blocks of code are linked to using a service such as Pastebin, Github Gists or Ideone.
`> Quote`	Quote	-
`[display text](//google.com)`	display text	Inserts a hyperlink. Please use responsibly. `[rel=nofollow]` is in use and spam will be deleted.
`---`		Inserts a horizontal line. The previous line must be blank.

Stardust
Blog