Starbeamrainbowlabs

Stardust
Blog


Archive


Mailing List Articles Atom Feed Comments Atom Feed Twitter Reddit Facebook

Tag Cloud

3d account algorithms android announcement architecture archives arduino artificial intelligence artix assembly async audio bash batch blog bookmarklet booting c sharp c++ challenge chrome os code codepen coding conundrums coding conundrums evolved command line compilers compiling compression css dailyprogrammer debugging demystification distributed computing documentation downtime electronics email embedded systems encryption es6 features event experiment external first impressions future game github github gist gitlab graphics hardware hardware meetup holiday holidays html html5 html5 canvas infrastructure interfaces internet io.js jabber jam javascript js bin labs learning library linux lora low level lua maintenance manjaro network networking nibriboard node.js operating systems performance photos php pixelbot portable privacy problem solving programming problems projects prolog protocol protocols pseudo 3d python reddit redis reference release releases resource review rust searching secrets security series list server software sorting source code control statistics storage svg technical terminal textures three thing game three.js tool tutorial twitter ubuntu university update updates upgrade version control virtual reality virtualisation visual web website windows windows 10 xmpp xslt

C# & .NET Terminology Demystified: A Glossary

After my last glossary post on LoRa, I thought I'd write another one of C♯ and .NET, as (in typical Microsoft fashion it would seem), they're seems to be a lot of jargon floating around whose meaning is not always obvious.

If you're new to C♯ and the .NET ecosystems, I wouldn't recommend tackling all of this at once - especially the bottom ~3 definitions - with those in particular there's a lot to get your head around.

C♯

C♯ is an object-oriented programming language that was invented by Microsoft. It's cross-platform, and is usually written in an IDE (Integrated Development Environment), which has a deeper understanding of the code you write than a regular text editor. IDEs include Visual Studio (for Windows) and MonoDevelop (for everyone else).

Solution

A Solution (sometimes referred to as a Visual Studio Solution) is the top-level definition of a project, contained in a file ending in .sln. Each solution may contain one or more Project Files (not to be confused with the project you're working on itself), each of which gets compiled into a single binary. Each project may have its own dependencies too: whether they be a core standard library, another project, or a NuGet package.

Project

A project contains your code, and sits 1 level down from a solution file. Normally, a solution file will sit in the root directory of your repository, and the projects will each have their own sub-folders.

While each project has a single output file (be that a .dll class library or a standalone .exe executable), a project may have multiple dependencies - leading to many files in the build output folder.

The build process and dependency definitions for a project are defined in the .csproj file. This file is written in XML, and can be edited to perform advanced build steps, should you need to do something that the GUI of your IDE doesn't support. I've blogged about the structuring of this file before (see here, and also a bit more here), should you find yourself curious.

CIL

Known as Common Intermediate Language, CIL is the binary format that C♯ (also Visual Basic and F♯ code) code gets compiled into. From here, the .NET runtime (on Windows) or Mono (on macOS, Linux, etc.) can execute it to run the compiled project.

MSBuild

The build system for Solutions and Projects. It reads a .sln or .csproj (there are others for different languages, but I won't list them here) file and executes the defined build instructions.

.NET Framework

The .NET Framework is the standard library of C♯ it provides practically everything you'll need to perform most common tasks. It does not provide a framework for constructing GUIs and Graphical Interfaces. You can browse the API reference over at the official .NET API Browser.

WPF

The Windows Presentation Foundation is a Windows-only GUI framework. Powered by XAML (eXtensible Application Markup Language) definitions of what the GUI should look like, it provides everything you need to create a native-looking GUI on Windows.

It does not work on macOS and Linux. To create a cross-platform program that works on all 3 operating systems, you'll need to use an alternative GUI framework, such as XWT or Gtk# (also: Glade). A more complete list of cross-platform frameworks can be found here. It's worth noting that Windows Forms, although a tempting option, aren't as flexible as the other options listed here.

C♯ 7

The 7th version of the C♯ language specification. This includes the syntax of the language, but not the .NET Framework itself.

.NET Standard

A specification of the .NET Framework, but not the C♯ Language. As of the time of typing, the latest version is 2.0, although version 1.6 is commonly used too. The intention here is the improve cross-platform portability of .NET programs by defining a specification for a subset of the full .NET Framework standard library that all platforms will always be able to use. This includes Android and iOS through the use of Xamarin.

Note that all .NET Standard projects are class libraries. In order to create an executable, you'll have to add an additional Project to your Solution that references your .NET Standard class library.

ASP.NET

A web framework for .NET-based programming languages (in our case C♯). Allows you to write C♯ code to handle HTTP (and now WebSockets) requests in a similar manner to PHP, but different in that your code still needs compiling. Compiled code is then managed by a web server IIS web server (on Windows).

With the release of .NET Core, ASP.NET is now obsolete.

.NET Core

Coming in 2 versions so far (1.0 and 2.0), .NET Core is the replacement for ASP.NET (though this is not its exclusive purpose). As far as I understand it, .NET Core is a modular runtime that allows programs targeting it to run multiple platforms. Such programs can either be ASP.NET Core, or a Universal Windows Platform application for the Windows Store.

This question and answer appears to have the best explanation I've found so far. In particular, the displayed diagram is very helpful:

A diagram showing the structure of the .NET ecosystem, including .NET Core. See the links in the sources and further reading below for more information.

....along with the pair of official "Introducing" blog posts that I've included in the Sources and Further Reading section below.

Conclusion

We've looked at some of the confusing terminology in the .NET ecosystems, and examined each of them in turn. We started by defining and untangling the process by which your C♯ code is compiled and run, and then moved on to the different variants and specifications related to the .NET Framework and C♯.

As always, this is a starting point - not an ending point! I'd recommend doing some additional reading and experimentation to figure out all the details.

Found this helpful? Still confused? Spotted a mistake? Comment below!

Sources and Further Reading

LoRa Terminology Demystified: A Glossary

My 2 RFM95s on the lid of my project's box. More info in a future blog post coming soon!

(Above: My 2 RFM95s. One works, but the other doesn't yet....)

I've been doing some more experimenting with LoRa recently, as I've got 1 of my 2 RFM95 working (yay)! While the other is still giving me trouble (meaning that I can't have 1 transmit and the other receive yet :-/), I've still been able to experiment with other people's implementations.

To that end, I've been learning about a bunch of different words and concepts - and thought that I'd document them all here.

LoRa

The radio protocol itself is called LoRa, which stands for Long Range. It provides a chirp-based system (more on that later under Bandwidth) to allow 2 devices to communicate over great distances.

LoRaWAN

LoRaWAN builds on LoRa to provide a complete end-to-end protocol stack to allow Internet of Things (IoT) devices to communicate with an application server and each other. It provides:

  • Standard device classes (A, B, and C) with defined behaviours
    • Class A devices can only receive for a short time after transmitting
    • Class B devices receive on a regular, timed, basis - regardless of when they transmit
    • Class C devices send and receive whenever they like
  • The concept of a Gateway for picking up packets and forwarding them across the rest of the network (The Things Network is the largest open implementation to date - you should definitely check it out if you're thinking of using LoRa in a project)
  • Secure multiple-layered encryption of messages via AES

...amongst many other things.

The Things Network

The largest open implementation of LoRaWAN that I know of. If you hook into The Things Network's LoRaWAN network, then your messages will get delivered to and from your application server and LoRaWAN-enabled IoT device, wherever you are in the world (so long as you've got a connection to a gateway). It's often abbreviated to TTN.

Check out their website.

A coverage map for The Things Network.

(Above: A coverage map for The Things Network. The original can be found here)

Data Rate

The data rate is the speed at which a message is transmitted. This is measured in bits-per-second, as LoRa itself is an 'unreliable' protocol (it doesn't guarantee that anyone will pick anything up at the other end). There are a number of preset data rates:

Code Speed (bits/second)
DR0 250
DR1 440
DR2 980
DR3 1760
DR4 3125
DR5 5470
DR6 11000
DR7 50000

_(Source: Exploratory Engineering: Data Rate and Spreading Factor)_

These values are a little different in different places - the above are for Europe on 868MHz.

Maximum Payload Size

Going hand-in-hand with the Data Rate, the Maximum Payload Size is the maximum number of bytes that can be transmitted in a single packet. If more than the maximum number of bytes needs to be transmitted, then it will be split across multiple packets - much like TCP's Maximum Transmission Unit (MTU), when it comes to that.

With LoRa, the maximum payload size varies with the Data Rate - from 230 bytes at DR7 to just 59 at DF2 and below.

Spreading Factor

Often abbreviated to just simply SF, the spreading factor is also related to the Data Rate. In LoRa, the Spreading Factor refers to the duration of a single chirp. There are 6 defined Spreading Factors: ranging from SF7 (the fastest transmission speed) to SF12 (the slowest transmission speed).

Which one you use is up to you - and may be automatically determined by the driver library you use (it's always best to check). At first glance, it may seem optimal to choose SF7, but it's worth noting that the slower speeds achieved by the higher spreading factors can net you a longer range.

Data Rate Configuration bits / second Max payload size (bytes)
DR0 SF12/125kHz 250 59
DR1 SF11/125kHz 440 59
DR2 SF10/125kHz 980 59
DR3 SF9/125kHz 1 760 123
DR4 SF8/125kHz 3 125 230
DR5 SF7/125kHz 5 470 230
DR6 SF7/250kHz 11 000 230
DR7 FSK: 50kpbs 50 000 230

_(Again, from Exploratory Engineering: Data Rate and Spreading Factor)_

Duty Cycle

A Duty Cycle is the amount of time something is active as a percentage of a total time. In the case of LoRa(/WAN?), there is an imposed 1% Duty Cycle, which means that you aren't allowed to be transmitting for more than 1% of the time.

Bandwidth

Often understood, the Bandwidth is the range of frequencies across which LoRa transmits. The LoRa protocol itself uses a system of 'chirps', which are spread form one end of the Bandwidth to the other going either up (an up-chirp), or down (a down-chirp). LoRahas 2 bandwidths it uses: 125kHz, 250kHz, and 500kHz.

Some example LoRa chirps as described above.

(Some example LoRa Chirps. Source: This Article on Link Labs)

Frequency

Frequency is something that most of us are familiar with. Different wireless protocols utilise different frequencies - allowing them to go about their business in peace without interfering with each other. For example, 2.4GHz and 5GHz are used by WiFi, and 800MHz is one of the frequencies used by 4G.

In the case of LoRa, different frequencies are in use in different parts of the world. ~868MHz is used in Europe (443MHz can also be used, but I haven't heard of many people doing so), 915MHz is used in the US, and ~780MHz is used in China.

Location Frequency
Europe 863 - 870MHz
US 902 - 928MHz
China 779 - 787MHz

(Source: RF Wireless World)

Found this helpful? Still confused? Found a mistake? Comment below!

Sources and Further Reading

https://electronics.stackexchange.com/a/305287/180059

Proxies: What's the difference?

You've probably heard of proxies. Perhaps you used one when you were at school to access a website you weren't supposed to. But did you know that there are multiple different types of proxies that are used for different things? For example, a reverse proxy perform load-balancing and caching for your web application? And that a transparent proxy can be used to filter the traffic of your internet connection without you knowing (well, almost)? In this post, I'll be explaining the difference between the different types of proxy I'm aware of, why you'd want one, and how to detect their presence.

Reverse Proxies

A reverse proxy is one that, when it receives a request, repeats it to an upstream server. For example, I use nginx to reverse-proxy PHP requests to a backend PHP-FPM instance.

A diagram showing how a reverse proxy works. Basically: Client -> nginx (the reverse proxy) -> PHP-FPM (the server behind the reverse proxy).

Reverse proxies also come in really handy if you want to run multiple, perhaps unrelated, servers on a single machine with a single IP address, as they can reverse proxy requests to the right place based on the requested subdomain. For example, on my server I not only serve my website (which in and of itself reverse-proxies PHP requests), but also serves my git server - which is a separate process listening on a different port behind my firewall.

Caching is another key feature of reverse proxies that comes in dead useful if you're running a medium-high traffic website. Instead of forwarding every single request to your backend for processing, if you've got a blog, for instance, you could cache the responses to requests for the posts themselves and serve them directly from the reverse proxy, leaving the slower backend free to process comments that people make, for example. Both nginx and Varnish have support for this. This with method, it's possible to serve 1000s of requests a minute from a very modestly sized virtual machine (say, 512MB RAM, 1 CPU) if configured correctly. Take that, Apache!

Finally, when 1 server isn't enough any more, your can get reverse proxies like nginx to act as a load balancer. In this scenario, there are multiple backend servers (probably running on different machines, with a fast internal LAN connecting them all), and a single front-facing load balancer sitting in front of them all distributing requests to the backend servers. nginx in particular can get very fancy with the logic here, should you need that kind of control. It can even monitor the health of the backend application servers, and avoid sending any requests to unresponsive servers - giving them time to recover from a crash.

A diagram visualing the load-balancer explained above. A single nginx instance faces the internet, with multiple app servers behind it that it proxies requests to.

Forward Proxies

Forward proxies are distinctly different to reverse proxies, in that they make requests to the destination client wants to connect to on their behalf. Such a proxy can be instituted for many reasons. Sometimes, it's for security reasons - for example to ensure that all those connecting to a backend local network are authenticated (authentication with a forward proxy is done via a set of special Proxy- HTTP headers). Other times, it's to preserve data on limited and/or expensive internet connections.

More often though, it's to censor and surveil the internet connection of the users on a network - and also to bypass such censoring. It is in this manner that HTTP(S) has become so pervasive - in that companies, institutions, (and, in rare cases), Internet Service Providers install forward proxies to censor the connections of their users - as such proxies usually only understand HTTP and HTTPS (clients request that a forward proxy retrieve something for them via a GET https://bobsrockets.net/ HTTP/1.1 request for example). If you're curious though, some forward proxies these days support the CONNECT HTTP method, allowing one to set up a TLS connection with another server (whether that be an HTTPS, SSH, SMTPS, or other protocol server). In addition, the SOCKS protocol now allows for arbitrary TCP connection to be proxied through as well.

Forward proxies nearly always require some client-side configuration. If you've wondered what the proxy settings are in your operating system and web browser's settings - this is what they're for.

Such can usually by identified by the Via and other headers that they attach to outgoing requests, as per RFC 2616. Online tools exist that exploit this - allowing you to detect whether such a proxy exists.

Transparent Proxies

Transparent proxies are similar to forward proxies, but do not require any client-side configuration. Instead, they utilise clever networking tricks to intercept network traffic being sent to and from the clients on a network. In this manner, they can cache responses, filter content, and protect the users from attacks without the client necessarily being aware of their existence.

It is important to note here though that utilising a proxy is by no means a substitute for maintaining proper defences on your own computer, such as installing and configuring a firewall, ensuring your system has all the latest updates, and, if you're running windows, ensuring you have an antivirus program installing and running (Windows 10 comes with one automatically these days).

Even though they don't usually attach the Via header (as they are supposed to), such proxies can usually be detected by cleverly designed tests that exploit their tendency to cache requests, thankfully.

Conclusion

So there you have it. We've taken a look at Forward proxies, and the benefits (and drawbacks) they can provide to users. We've also investigated Transparent proxies, and how to detect them. Finally, we've looked at Reverse proxies and the advantages they can provide to enable you to scale and structure your next great web (and other protocol! Nginx supports all sorts of other protocols besides HTTP(S)) application.

Job Scheduling on Linux

Scheduling jobs to happen at a later time on a Linux based machine can be somewhat confusing. Confused by 5 4 8-10/4 6/4 * baffled by 5 */4 * * *? All will be revealed!

cron

Scheduling jobs on a Linux machine can be done in several ways. Let's start with cron - the primary program that orchestrates the whole proceeding. Its name comes from the Greek word Chronos, which means time. By filling in a crontab (read cron-table), you can tell it what to do when. It's essentially a time-table of jobs you'd like it to run.

Your Linux machine should come with cron installed already. You can check if cron is installed and running by entering this command into your terminal:

if [[ "$(pgrep -c cron)" -gt 0 ]]; then echo "Cron is installed :D"; else echo "Cron is not installed :-("; fi

If it isn't installed or running, then you'll have to investigate why this isn't the case. The most common is that it isn't installed. It's normally in the official repositories for most distributions - on Debian-based system sudo apt install cron should suffice. Arch-based users may need to check to make sure that the system service is enabled and do so manually.

With cron setup and ready to go, we can start adding jobs to it. This is done by way of a crontab, as explained above. Each user has their own crontab such that they can each configure their own individual sets jobs. To edit it, type this:

crontab -e

This will open your favourite editor with your crontab ready for editing (if you'd like to change your editor, do sudo update-alternatives --config editor or change the EDITOR environment variable). You should see a bunch of lines like this:

# Edit this file to introduce tasks to be run by cron.
# 
# Each task to run has to be defined through a single line
# indicating with different fields when the task will be run
# and what command to run for the task
# 
# To define the time you can provide concrete values for
# minute (m), hour (h), day of month (dom), month (mon),
# and day of week (dow) or use '*' in these fields (for 'any').# 
# Notice that tasks will be started based on the cron's system
# daemon's notion of time and timezones.
# 
# Output of the crontab jobs (including errors) is sent through
# email to the user the crontab file belongs to (unless redirected).
# 
# For example, you can run a backup of all your user accounts
# at 5 a.m every week with:
# 0 5 * * 1 tar -zcf /var/backups/home.tgz /home/
# 
# For more information see the manual pages of crontab(5) and cron(8)
# 
# m h  dom mon dow   command

I'd advise you keep this for future reference - just in case you find yourself in a pinch later - so scroll down to the bottom and start adding your jobs there.

Let's look at the syntax for telling cron about a job next. This is best done by example:

0 1 * * 7   cd /root && /root/run-backup

This job, as you might have guessed, runs a custom backup script. It's one I wrote myself, but that's a story for another time (comment below if you'd like me to post about that). What we're interested in is the bit at the beginning: 0 1 * * 7. Scheduling a cron job is done by specifying 5 space-separated values. In the case of the above, the job will run at 1am every Sunday morning. The order is as follows:

  • Minute
  • Hour
  • Day of the Month
  • Month
  • Day of the week

For of these values, a number of different specifiers can be used. For example, specifying an asterisk (*) will cause the job to run at every interval of that column - e.g. every minute or every hour. If you want to run something on every minute of the day (such as a logging or monitoring script), use * * * * *. Be aware of the system resources you can use up by doing that though!

Specifying number will restrict it to a specific time in an interval. For example, 10 * * * * will run the job at 10 minutes past every hour, and 22 3 * * * will run a job at 03:22 in the morning every day (I find such times great for maintenance jobs).

Sometimes, every hour or every minute is too often. Cron can handle this too! For example 3 */2 * * * will run a job at 3 minutes past every second hour. You can alter this at your leisure: The value after the forward slash (/) decides the interval (i.e. */3 would be every third, */15 would be every 15th, etc.).

The last column, the day of the week, is an alternative to the day of the month column. It lets you specify, as you may assume, the day oft he week a job should run on. This can be specified in 2 way: With the numbers 0-6, or with 3-letter short codes such as MON or SAT. For example, 6 20 * * WED runs at 6 minutes past 8 in the evening on Wednesday, and 0 */4 * * 0 runs every 4th hour on a Sunday.

The combinations are endless! Since it can be a bit confusing combining all the options to get what you want, crontab.guru is great for piecing cron-job specifications together. It describes your cron-job spec in plain English for you as you type!

crontab.guru showing a random cronjob spec.

(Above: crontab.guru displaying a random cronjob spec)

What if I turn my computer off?

Ok, so cron is all very well, but what if you turn your machine off? Well, if cron isn't running at the time a job should be run, then it won't get executed. For those of us who don't leave their laptops on all the time, all is not lost! It's time to introduce the second piece of software at our disposal.

Enter stage left: anacron. Built to be a complement to cron, anacron sets up 3 folders:

  • /etc/cron.daily
  • /etc/cron.weekly
  • /etc/cron.monthly`

Any executable scripts in this folder will be run at daily, weekly, and monthly intervals respectively by anacron, and it respects the hash-bang (that #! line at the beginning of the script) too!

Most server systems do not come with anacron pre-installed, though it should be present if your distributions official repositories. Once you've installed it, edit root's crontab (with sudo crontab -e if you can't remember how) and add a job that executes anacron every hour like so:

# Run anacron every hour
5 * * * *   /usr/sbin/anacron

This is important, as anacron does not in itself run all the time like cron does (this behaviour is called a daemon in the Linux world) - it needs a helping hand to get it to run.

If you've got more specific requirements, then anacron also has it's own configuration file you can edit. It's found at /etc/anacrontab, and has a different syntax. In the anacron table, jobs follow the following pattern:

  • period - The interval, in days, that the job should run
  • delay - The offset, in minutes, that the job should run at
  • job identifier - A textual identifier (without spaces, of course) that identifies the job
  • command - The command that should be executed

You'll notice that there are 3 jobs specified already - one for each of the 3 folders mentioned above. You can specify your own jobs too. Here's an example:`

# Do the weekly backup
7   20  run-backup  cd /root/data-shape-backup && ./do-backup;

The above job runs every 7 days, with an offset of 20 minutes. Note that I've included a command (the line starting with a hash #) to remind myself as to what the job does - I'd recommend you always include such a comment for your own reference - whether you're using cron, anacron, or otherwise.

I'd also recommend that you test your anacron configuration file after editing it to ensure it's valid. This is done like so:

anacron -T

I'm not an administrator, can I still use this?

Sure you can! If you've got anacron installed (you could even compile it from source locally if you haven't) and want to specify some jobs for your local account, then that's easily done too. Just create an anacrontab file anywhere you please, and then in your regular crontab (crontab -e), tell anacron where you put it like this:

# Run anacron every hour
5 * * * *   /usr/sbin/anacron -t "path/to/anacrontab"

What about one-off jobs?

Good point. cron and anacron are great for repeating jobs, but what if you want to set up a one-off job to auto-disable your firewall before enabling it just in case you accidentally lock yourself out? Thankfully, there's even an answer for this use-case too: atd.

atd is similar to cron in that it runs a daemon in the background, but instead of executing jobs specified in a crontab, you tell it when you want it to execute a series of commands, and then enter the commands themselves. For example:

$ at now + 10 minutes
warning: commands will be executed using /bin/sh
at> echo -e "Testing"  
at> uptime
at> <EOT>
job 4 at Thu Jul 12 14:36:00 2018

In the above, I tell it to run the job 10 minutes from now, and enter a pair of commands. To end the command list, I hit CTRL + D on an empty line. The output of the job will be emailed to me automatically if I've got that set up (cron and anacron also do this).

Specifying a time can be somewhat fiddly, but its also quite flexible:

  • at tomorrow
  • at now + 5 hours
  • at 16:06
  • at next month
  • at 2018 09 25

....and so on. Listing the current scheduled jobs is also just as easy:

atq

This will output a list of scheduled jobs that haven't been run yet. You can't see any jobs that aren't created by you unless you're root (use sudo), though. You can use the job ids listed here to cancel a job too:

# Remove job id 4:
atrm 4

Conclusion

That just about concludes this whirlwind tour of job scheduling on Linux systems. We've looked at how to schedule jobs with cron, and how to ensure our jobs get run - even when the target machine isn't turned on all the time with anacron. We've also looked at one-time jobs with atd, and how to manage the job queue.

As usual, this is a starting point - not an ending point! Job scheduling is just the beginning. From here, you can look at setting up automated backups. You could investigate setting up an email server, and how that integrates with cron. You can utilise cron to perform maintenance for your next great web (or other!) application. The possibilities are endless!

Found this useful? Still confused? Comment below!

Demystifying Inverted Indexes

The test texts below overlaying one another in different colours, with a magnifying glass on centred top. (The magnifying glass in the above banner came from openclipart)

After writing the post that will be released after this one, I realised that I made a critical assumption that everyone knew what an inverted index was. Upon looking for an appropriate tutorial online, I couldn't find one that was close enough to what I did in Pepperminty Wiki, so I decided to write my own.

First, some context. What's Pepperminty Wiki? Well, it's a complete wiki engine in a single file of PHP. The source files are obviously not a single file, but it builds into a single file - making it easy to drop into any PHP-enabled web server.

One of its features is a full-text search engine. A personal wiki of mine has ~75k words spread across ~550 pages, and it manages to search them all in just ~450ms! It does this with the aid of an inverted index - which I'll be explaining in this post.

First though, we need some data to index. How about the descriptions of some video games?

Kerbal Space Program

In KSP, you must build a space-worthy craft, capable of flying its crew out into space, without killing them. At your disposal is a collection of parts, which must be assembled to create a functional ship. Each part has its own function and will affect the way a ship flies (or doesn't). So strap yourself in, and get ready to try some Rocket Science!

 

Cross Code

Meet Lea as she logs into an MMO of the distant future. Follow her steps as she discovers a vast world, meets other players and overcomes all the challenges of the game.

 

Fort Meow

Fort Meow is a physics game by Upper Class Walrus about a girl, an old book and a house full of cats! Meow.

 

Factory Balls

Factory balls is the brand new bonte game I already announced yesterday. Factory balls takes part in the game design competition over at jayisgames. The goal of the design competition was to create a 'ball physics'-themed game. I hope you enjoy it!

Very cool, this should provide us with plenty of data to experiment with. Firstly, let's consider indexing. Take the Factory Balls description. We can split it up into tokens like this:

T o k e n s V V
factory balls is the brand new bonte game
i already announced yesterday factory balls takes
part in the game design competition over
at jayisgames the goal of the design
competition was to create a ball physics
themed game i hope you enjoy it

Notice how we've removed punctuation here, and made everything lowercase. This is important for the next step, as we want to make sure we consider Factory and factory to be the same word - otherwise when querying the index we'd have to remember to get the casing correct.

With our tokens sorted, we can now count them to create our index. It's like a sort of tally chart I guess, except we'll be including the offset in the text of every token in the list. We'll also be removing some of the most common words in the list that aren't very interesting - these are known as stop words. Here's an index generated from that Factory Balls text above:

Token Frequency Offsets
factory 2 0, 12
balls 2 1, 13
brand 1 4
new 1 5
bonte 1 6
game 3 7, 18, 37
i 2 8, 38
announced 1 10
yesterday 1 11
takes 1 14
design 2 19, 28
competition 2 20, 29
jayisgames 1 23
goal 1 25
create 1 32
ball 1 34
physics 1 35
themed 1 36
hope 1 39
enjoy 1 41

Very cool. Now we can generate an index for each page's content. The next step is to turn this into an inverted index. Basically, the difference between the normal index and a inverted index is that an entry in an inverted index contains not just the offsets for a single page, but all the pages that contain that token. For example, the Cross-Code example above also contains the token game, so the inverted index entry for game would contain a list of offsets for both the Factory Balls and Cross-Code pages.

Including the names of every page under every different token in the inverted index would be both inefficient computationally and cause the index to grow rather large, so we should assign each page a unique numerical id. Let's do that now:

Id Page Name
1 Kerbal Space Program
2 Cross Code
3 Fort Meow
4 Factory Balls

There - much better. In Pepperminty Wiki, this is handled by the ids class, which has a pair of public methods: getid($pagename) and getpagename($id). If an id can't be found for a page name, then a new id is created and added to the list (Pepperminty Wiki calls this the id index) transparently. Similarly, if a page name can't be found for an id, then null should be returned.

Now that we've got ids for our pages, let's look at generating that inverted index entry for game we talked about above. Here it is:

  • Term: game
Id Frequency Offsets
2 1 31
3 1 5
4 3 5, 12, 23

Note how there isn't an entry for page id 1, as the Kerbal Space Program page doesn't contain the token game.

This, in essence, is the basics of inverted indexes. A full inverted index will contain an entry for every token that's found in at least 1 source document - though the approach used here is far from the only way of doing it (I'm sure there are much more advanced ways of doing it for larger datasets, but this came to mind from reading a few web articles and is fairly straight-forward and easy to understand).

Can you write a program that generates a full inverted index like I did in the example above? Try testing it on the test game descriptions at the start of this post.

You may also have noticed that the offsets used here are of the tokens in the list. If you wanted to generate contexts (like Duck Duck Go or Google do just below the title of a result), you'd need to use the character offsets from the source document instead. Can you extend your program to support querying the inverted index, generating contexts based on the inverted index too?

Liked this post? Got your own thoughts on the subject? Having trouble with the challenges at the end? Comment below!

Shift-Reduce Parser Part 2: Building Furniture (1)

Hello and welcome! I got a bit distracted by other things as you can tell, but I'm back with part 2 of my series on building a shift-reduce parser. If you're not sure what I'm talking about, then I'd advise reading part 1 first and then coming back here. It might be a good idea to re-read it anyway, juts to refresh your memory :-)

The same flowchart from last time, but with the parse table section highlighted.

Last time, we created some data classes to store the various rules and tokens that we'll be generating. Today, we're going to build on that and start turning a set of rules into a parse table. Let's introduce the rules we'll working with:

<start> ::= <expression>

<expression> ::= <expression> PLUS <value>
    | <term>

<term> ::= <term> DIVIDE <value>
    | <value>

<value> ::= <number>
    | BRACKET_OPEN <expression> BRACKET_CLOSE

<number> ::= DIGIT
    | <number> DIGIT

The above represents a very basic calculator-style syntax, which only supports adding and dividing. It's written in Backus-Naur Form, which is basically a standardised way of writing parsing rules.

To build a parse table, we first must understand what such a thing actually is. Let's take a look at an example:

state action goto
* + 0 1 $ E B
0 s1 s2 3 4
1 r4 r4 r4 r4 r4
2 r5 r5 r5 r5 r5
3 s5 s6 goal
4 r3 r3 r3 r3 r3
5 s1 s2 7
6 s1 s2 8
7 r1 r1 r1 r1 r1
8 r2 r2 r2 r2 r2

_(Source: Adapted from the LR Parser on Wikipedia.)_

While it looks complex at first, let's break it down. There are 3 parts to this table: The state, the action, and the goto. The action and goto represent What should happen if a particular token is encountered. In this case, the input stream contains both terminal (i.e. DIGIT, DIVIDE, BRACKET_CLOSE, etc. in the case of our BNF above) and non-terminal (i.e. number, term, expression, etc. in the case of our BNF above) symbols - if understand it correctly, so there are actually 2 parts to the table here to make sure that both are handled correctly.

We'll be connecting this to our lexer, which outputs only terminal symbols, so we should be ok I believe (if you know better, please post a comment below!). The state refers to the state in the table. As I've mentioned before, a given state may contain one or more configurations. It's these configurations that give rise to the actions in the table above, such as s2 (shift and then go to state 2) or r3 (reduce and jump to state 3).

To use the table, the parser must know what state it's in, and then take a look across the top row for the next symbol it has in the token stream. Once found, it can follow it down to figure out what action it should take, as explained above. If there isn't an action in the box, then there must be an error in the input, as the table doesn't tell us what to do in this situation. To that end, we should try and generate a meaningful error message to help the user to find the mistake in the input (or the developer in the parser!).

We're kind of getting ahead of ourselves here though. We need to build this table first, and to do that, we need to figure out which configurations go in which state. And, going down the rabbit hole, we need to know what a configuration is. Again, it's best if I demonstrate. Consider the following parsing rule from our example BNF at the beginning of this post:

<value> ::= BRACKET_OPEN <expression> BRACKET_CLOSE

A single configuration represent a possible state of the parser at a particular instant in time. I could split that above rule up like so:

<value> ::= BRACKET_OPEN * <expression> BRACKET_CLOSE
<value> ::= BRACKET_OPEN <expression> * BRACKET_CLOSE
<value> ::= BRACKET_OPEN <expression> BRACKET_CLOSE *

The asterisk represent where the parser might have gotten up to. Everything to the left is on the stack of the parser, and everything to the right hasn't happened yet.

Since this isn't a top-level rule (in our example that honour goes to the rule for the start non-terminal), the parser will never be in a position where the first item there doesn't exist yet on the stack, so we can ignore the configuration in which the asterisk would be to the left of BRACKET_OPEN.

Confused? Let me try and help here. Let's draw a diagram of how our parser is going to operate:

_(Source: Made by me, but adapted from the LR Parser article on Wikipedia)_

Basically, the parser will be taking in the input token stream and either shift a new terminal token onto the stack, or reduce one or more existing tokens on the stack into a single non-terminal token, which replaces those existing tokens on the stack. The configurations above represent possible states of the stack (the bit to the left of the asterisk), and possible directions that the parser could take when parsing (the bit to th right of the asterisk).

Finally, when the goal is reached, the output is returned to the caller (which, by the time we're done, should be a parse tree). Said tree can then be optimised and processed for whatever purpose we desire!

With this knowledge, we can deduce that we can build the entire table by recursing over the tree of rules from the start state. That way, we'll visit every rule that we'll need to parse everything required to reach the goal state by recursing over all the rules for all the non-terminals referenced by all the rules we visit. We could even generate a warning if we detect that some rules don't connect to this 'tree'. Here's a tree of our example ruleset from the beginning of this post:

A tree diagram of the rules detailed near the beginning of this post.

It's a bit spaghetti-ish, but it should be legible enough :P This gives us an idea as to how we're going to tackle this. Taking into account the data classes we created in the last post, we need to make sure we keep the following in mind:

  1. Since the main ShiftReduceParser class is going to hold the rules, the ParseTable class will need a reference to its parent ShiftReduceParser in order to query the rules.
  2. In light of this, the SHiftReduceParser should be responsible for satisfying any queries the ParseTable has about rules - the ParseTable should not have to go looking & filtering the rule list held by ShiftReduceParser itself.
  3. ParseTable will need a recursive method that will take a single top-level rule and recurse over it and its child rules (according to the tree I've talked about above)
  4. This method in ParseTale will need to be extremely careful it doesn't get stuck in a loop. To that end, it'll have to keep track of whether it's already processed a rule or not.
  5. It'll probably also have to keep track of which configurations it has added to the table class structure we defined in the last post to avoid adding rules twice.
  6. Once ParseTable has figured out all the configurations and grouped them all into the right states, it will then have to recurse over the generated table and fill in all the shift / reduce / goal action(s) - not forgetting about the links to the other states they should point to.

It's quite the laundry list! Thankfully, most of this is quite simple if we tackle it one step at a time. The most annoying bit is the grouping of configurations into states. This is done by looking at the token immediately before the asterisk in each configuration - all the configurations with the same token here will get grouped into the same state (while there are more complex algorithms that allow for more complex grammars, we'll stick with this for now as anything else makes my head hurt! Maybe in the future I'll look as figuring out precisely what kind of LR-style parser this is, and upgrading it to be a canonical LR(1) parser - the most advanced type I know of).

This is quite a lot to take in, so I think I'll leave this post here for you to digest - and we'll get to writing some code in the next one.

Found this useful? Spotted a mistake? Having trouble getting your head around it? Post a comment below!

Shift-reduce Parser Part 1: First Steps

Now that I've done the Languages and Compilers module at University, it's opened my eyes to a much better and more extensible way of handling complex text in a way that can easily be read by any subsequent code I write. To that end, I've found that at least 3 different various projects of mine could benefit from the inclusion of a shift-reduce parser, but I haven't been able to track one down for C♯ yet.

With this in mind, I thought to myself: "Why not build one myself?" Sure, my Lecturer didn't go into too many details as to how they work, but it can't be that difficult, can it? Oh, I had no idea.....

In this mini-series, I'm going to take you through the process of building a shift-reduce parser of your very own. As I write this, I haven't actually finished mine yet - I've just got to the important milestone of building a parse table! Thankfully, that's going to be a few posts away, as there's a fair amount of ground to cover until we get to that point.

Warning: This series is not for the faint of heart! It's rather complicated, and extremely abstract - making it difficult to get your head around. I've had great difficulty getting mine around it - and ended up writing it in multiple stages. If you want to follow along, be prepared for lots of research, theory, and preparation!

Let's start out by taking a look at what a shift-reduce parser does. If you haven't already, I'd recommend reading my previous compilers 101 post, which explains how to write a compiler, and the different stages involved. I'd also recommend checking out my earlier post on building a lexer, as it ties in nicely with the shift-reduce parser that we'll be building.

An overview of how a shift-reduce works.

In short, a shift-reduce parser compiles a set of BNF-style rules into a Parse Table, which it then utilises as a sort of state-machine when parsing a stream on input tokens. We'll take a look at this table compilation process in a future blog post. In this post, let's set up some data structures to help us along when we get to the compilation process in the next blog post. Here's the class structure we'll be going for:

An overview of the class structure we'll be creating in this blog post.

Let's start with a class to represent a single token in a rule:

public enum ParserTokenClass
{
    Terminal,
    NonTerminal
}

public struct ParserToken
{
    public readonly ParserTokenClass Class;
    public readonly string Type;

    public ParserToken(ParserTokenClass inTokenType, string inType)
    {
        Class = inTokenType;
        Type = inType;
    }

    public override bool Equals(object obj)
    {
        ParserToken otherTokenType = (ParserToken)obj;
        return Class == otherTokenType.Class && Type == otherTokenType.Type;
    }
    public override int GetHashCode()
    {
        return $"{Class}:{Type}".GetHashCode();
    }

    public override string ToString()
    {
        string terminalDisplay = Class == ParserTokenClass.Terminal ? "T" : "NT";
        return $"[ParserToken {terminalDisplay}: {Type}]";
    }

    public static ParserToken NonTerm(string inType)
    {
        return new ParserToken(ParserTokenClass.NonTerminal, inType);
    }
    public static ParserToken Term(string inType)
    {
        return new ParserToken(ParserTokenClass.Terminal, inType);
    }
}

Pretty simple! A token in a rule can either be a terminal (basically a token straight from the lexer), or a non-terminal (a token that the parser reduces a set of other tokens into), and has a type - which we represent as a string. Unfortunately due to the complex comparisons we'll be doing later, it's a huge hassle to use an enum with a template class as I did in the lexer I built that I linked to earlier.

Later on (once we've built the parse table), we'll extend this class to support attaching values and other such pieces of information to it, but for now we'll leave that out to aid simplicity.

I also override Equals() and GetHashCode() in order to make comparing tokens easier later on. Overriding ToString() makes the debugging process much easier later, as we'll see in the next post!

With a class to represent a token, we also need one to represent a rule. Let's create one now:

public class ParserRule
{
    /// <summary>
    /// A function to call when a reduce operation utilises this rule.
    /// </summary>
    public Action MatchingAction;
    public ParserToken LeftSide;
    public ParserToken[] RightSideSequence;

    public ParserRule(Action inMatchingAction, ParserToken inLeftSide, params ParserToken[] inRightSideSequence)
    {
        if (inLeftSide.Class != ParserTokenClass.NonTerminal)
            throw new ArgumentException("Error: The left-hand side must be a non-terminal token type.");

        MatchingAction = inMatchingAction;
        LeftSide = inLeftSide;
        RightSideSequence = inRightSideSequence;
    }

    public bool RightSideSequenceMatches(IEnumerable<ParserToken> otherRhs)
    {
        int i = 0;
        foreach (ParserToken nextToken in otherRhs)
        {
            if (!nextToken.Equals(RightSideSequence[i]))
                return false;

           i++;
        }
        return true;
    }

    public override string ToString()
    {
        StringBuilder result = new StringBuilder();
        result.Append($"ParserRule: {LeftSide} = ");
        foreach (ParserToken nextToken in RightSideSequence)
            result.Append($" {nextToken}");
        result.Append(";");
        return result.ToString();
    }
}

The above represents a single parser rule, such as <number> ::= <digit> <number>. Here we have the token on the left-hand-side (which we make sure is a non-terminal), and an array of tokens (which can be either terminal or non-terminal) for the right-hand-side. We also have an Action (which is basically a lamba function) that we'll call when we match against the rule, so that we have a place to hook into when we write code that actually does the tree building (not to be confused with the shift-reduce parser itself).

Here I also add a method that we'll need later, which compares an array of tokens against the current rule, to see if they match - and we override ToString() here again to aid debugging.

Now that we can represent tokens and rules, we can start thinking about representing configurations and states. Not sure what these are? All will be explained in the next post, don't worry :-) For now, A state can be seen as a row in the parse table, and it contains a number of configurations - which are like routes to different other states that the parser decides between, depending where it's gotten to in the token stream.

public enum ParseTableAction
{
    Shift,
    Reduce,
    Goal,
    Error
}

public class ParseTableConfiguration
{
    public readonly ParserRule Rule;
    public readonly int RhsPosition;

    public ParseTableAction LinkingAction = ParseTableAction.Error;
    public ParseTableState LinkingState = null;

    public ParserToken TokenAfterDot {
        get {
            return Rule.RightSideSequence[RhsPosition];
        }
    }
    public ParserToken TokenBeforeDot {
        get {
            return Rule.RightSideSequence[RhsPosition - 1];
        }
    }

    /// <summary>
    /// Whether this configuration is the last in the sequence of configurations for the specified rule or not.
    /// </summary>
    /// <value><c>true</c> if is last in rule; otherwise, <c>false</c>.</value>
    public bool IsLastInRule {
        get {
            return RhsPosition > Rule.RightSideSequence.Length - 1;
        }
    }

    public ParseTableConfiguration(ParserRule inRule, int inRhsPosition)
    {
        Rule = inRule;
        RhsPosition = inRhsPosition;
    }

    public IEnumerable<ParserToken> GetParsedRhs()
    {
        return Rule.RightSideSequence.TakeWhile((ParserToken token, int index) => index <= RhsPosition);
    }

    public bool MatchesRhsSequence(ParserRule otherRule)
    {
        int i = 0;
        foreach (ParserToken nextToken in otherRule.RightSideSequence)
        {
            if (i > RhsPosition)
                break;

            if (!nextToken.Equals(otherRule.RightSideSequence[i]))
                return false;

            i++;
        }
        return true;
    }

    public override bool Equals(object obj)
    {
        ParseTableConfiguration otherConfig = obj as ParseTableConfiguration;
        if (otherConfig == null) return false;
        return Rule == otherConfig.Rule && RhsPosition == otherConfig.RhsPosition;
    }
    public override int GetHashCode()
    {
        return $"{Rule}:{RhsPosition}".GetHashCode();
    }

    public override string ToString()
    {
        StringBuilder result = new StringBuilder();

        result.Append($"Configuration: {LinkingAction} ");
        if (LinkingState != null)
            result.Append($"to State {LinkingState.Id} ");
        result.Append($"{Rule.LeftSide} = ");

        for (int i = 0; i <= Rule.RightSideSequence.Length; i++)
        {
            if (i == RhsPosition)
                result.Append(" * ");
            if (i == Rule.RightSideSequence.Length)
                continue;
            result.Append($"{Rule.RightSideSequence[i]} ");
        }
        result.Append(";");
        return result.ToString();
    }
}

This class is slightly more complicated. First, we define an enum that holds information about what the parser should do if it chooses this configuration. Then, we declare the configuration class itself. This entails specifying which parse rule we're deriving the configuration from, and both which tokens in the right-hand-side of the rule should have been parsed already, and which should still be somewhere in the token stream. Again, I'll explain this in more detail in the next post!

Then, we declare a few utility methods and properties to fetch different parts of the configuration's rule, such as the token to the immediate left and right of the right-hand-side position (which was represented as a dot . in the book I followed), all the tokens before the dot ., and whether a given rule matches this configuration in the basis of everything before the dot ..

Finally, I continue with the trend of overriding the equality checking methods and ToString(), as it makes a world of difference in the code coming up in future blog posts!

Now that we've got a class for configurations, the last one on our list is one for the states themselves. Let's do that now:

public class ParseTableState
{
    public readonly ParseTable ParentTable;

    public int Id {
        get {
            return ParentTable.FindStateId(this);
        }
    }

    public List<ParseTableConfiguration> Configurations = new List<ParseTableConfiguration>();

    public ParseTableState(ParseTable inParentTable)
    {
        ParentTable = inParentTable;
    }

    public override string ToString()
    {
        StringBuilder result = new StringBuilder();
        foreach(ParseTableConfiguration nextConfiguration in Configurations)
            result.AppendLine($"     - {nextConfiguration}");
        return result.ToString();
    }
}

Much simpler than the configuration rule class, right? :P As I mentioned earlier, all a state consists of is a list of configurations in that state. In our case, we'll be assigning an id to the states in our parse table, so I include a property here that fetches a state's id from the parent parse table that it's part to make the later code as simple as possible.

Still with me? Congratulations! You've got the beginnings of a shift-reduce parser. Next time, we'll expand on some of theory behind the things I've touched on in this post, and possibly look at building the start of the recursive parse table builder itself.

Found this interesting? Confused about something? Comment below!

Securing a Linux Server Part 2: SSH

Wow, it's been a while since I posted something in this series! Last time, I took a look at the Uncomplicated Firewall, and how you can use it to control the traffic coming in (and going out) of your server. This time, I'm going to take a look at steps you can take to secure another vitally important part of most servers: SSH. Used by servers and their administrators across the world to talk to one another, if someone manages to get in who isn't supposed to, they could do all kinds of damage!

The first, and easiest thing we can do it improve security is to prevent the root user logging in. If you haven't done so already, you should create a new user on your server, set a good password, and give it superuser privileges. Login with the new user account, and then edit /etc/ssh/sshd_config, finding the line that says something like

PermitRootLogin yes

....and change it to

PermitRootLogin no

Once done, restart the ssh server. Your config might be slightly different (e.g. it might be PermitRootLogin without-password) - but the principle is the same. This adds an extra barrier to getting into your server, as now attackers must not only guess your password, but your username as well (some won't even bother, and keep trying to login to the root account :P).

Next, we can move SSH to a non-standard port. Some might argue that this isn't a good security measure to take and that it doesn't actually make your server more secure, but I find that it's still a good measure to take for 2 reasons: defence in depth, and preventing excessive CPU load from all the dumb bots that try to get in on the default port. With that, it's make another modification to /etc/ssh/sshd_config. Make sure you test at every step you take, as if you lock yourself out, you'll have a hard time getting back in again....

Port 22

Change 22 in the above to any other number between about 1 and 65535. Next, make sure you've allowed the new port through your firewall! If you're using ufw, my previous post (link above) gives a helpful guide on how to do this. Once done, restart your SSH server again - and try logging in before you close your current session. That way if you make a mistake, you can fix through your existing session.

Once you're confident that you've got it right, you can close port 22 on your firewall.

So we've created a new user account with a secure password (tip: use a password manager if you have trouble remembering it :-)), disabled root login, and moved the ssh port to another port number that's out of the way. Is there anything else we can do? Turns out there is.

Passwords are not the only we can authenticate against an SSH server. Public private keypairs can be used too - and are much more secure - and convenient - than passwords if used correctly. You can generate your own public-private keypair like so:

ssh-keygen -t ed25519

It will ask you a few questions, such as a password to encrypt the private key on disk, and where to save it. Once done, we need to tell ssh to use the new public-private keypair. This is fairly easy to do, actually (though it took me a while to figure out how!). Simply edit ~/.ssh/config (or create it if it doesn't exist), and create (or edit) an entry for your ssh server, making it look something like this:

Host bobsrockets.com
    Port            {port_name}
    IdentityFile    {path/to/private/keyfile}

It's the IdentityFile line that's important. The port line simply makes it such that you can type ssh bobsrockets.com (or whatever your server is called) and it will figure out the port number for you.

With a public-private keypair now in use, there's just one step left: disable password-based logins. I'd recommend trailing it for a while to make sure you haven't messed anything up - because once you disable it, if you lose your private key, you won't be getting back in again any time soon!

Again, open /etc/ssh/sshd_config for editing. Find the line that starts with PasswordAuthentication, and comment it out with a hash symbol (#), if it isn't already. Directly below that line, add PasswordAuthentication no.

Once done, restart ssh for a final time, and check it works. If it does, congratulations! You've successfully secured your SSH server (to the best of my knowledge, of course). Got a tip I haven't covered here? Found a mistake? Let me know in a comment below!

Untangling MSBuild: MSBuild for normal people

I don't know about you, but I find the documentation on MSBuild is be rather confusing. Even their definition of what MSBuild is is a bit misleading:

MSBuild is the build system for Visual Studio.

Whilst having to pull apart the .csproj file of a project of mine and put it back together again to get it to do what I wanted, I spent a considerable amount of time reading Microsoft's (bad) documentation and various web tutorials on what MSBuild is and what it does. I'm bound to forget what I've learnt, so I'm detailing it here both to save myself the bother of looking everything up again and to make sense of everything I've read and experimented with myself.

Before continuing, you might find my previous post, Understanding your compiler: C# an interesting read if you aren't already aware of some of the basics of Visual Studio solutions, project files, msbuild, and the C♯ compiler.

Let's start with a real definition. MSBuild is Microsoft's build framework that ties into Visual Studio, Mono, and basically anything in the .NET world. It has an XML-based syntax that allows one to describe what MSBuild has to do to build a project - optionally depending on other projects elsewhere in the file system. It's most commonly used to build C♯ projects - but it can be used to build other things, too.

The structure of your typical Visual Studio solution might look a bit like this:

The basic structure of a Visual Studio solution. Explained below.

As you can see, the .sln file references one or more .csproj files, each of which may reference other .csproj files - not all of which have to be tied to the same .sln file, although they usually are (this can be handy if you're using Git Submodules). The .sln file will also specify a default project to build, too.

The file extension .csproj isn't the only one recognised by MSBuild, either - others such as .pssproj (PowerShell project), .vcxproj (Visual C++ Project), .targets (Shared tasks + targets - we'll get to these later), and others - though the generic extension is simply .proj.

So far, I've observed that MSBuild is pretty intelligent about automatically detecting project / solution files in it's working directory - you can call it with msbuild in a directory and most of the time it'll find and build the right project - even if it finds a .sln file that it has to parse first.

Let's get to the real meat of the subject: targets. Consider the following:

<?xml version="1.0" encoding="utf-8"?>
<Project xmlns="http://schemas.microsoft.com/developer/msbuild/2003" DefaultTargets="Build" ToolsVersion="4.0">
    <Import Project="$(MSBuildBinPath)\Microsoft.CSharp.targets" />

    <Target Name="BeforeBuild">
        <Message Importance="high" Text="Before build executing" />
    </Target>
</Project>

I've simplified it a bit to make it a bit clearer, but in essence the above imports the predefined C♯ set of targets, which includes (amongst others) BeforeBuild, Build itself, and After Build - the former of which is overridden by the local MSBuild project file. Sound complicated? It is a bit!

MSBuild uses a system of targets. When you ask it to do something, you're actually asking it to reach a particular target. By default, this is the Build target. Targets can have dependencies, too - in the case of the above, the Build target depends on the (blank) BeforeBuild target in the default C♯ target set, which is then overridden by the MSBuild project file above.

The second key component of MSBuild project files are tasks. I've used one in the example above - the Message task which, obviously enough, outputs a message to the build output. There are lots of other types of task, too:

  • Reference - reference another assembly or core namespace
  • Compile - specify a C♯ file to compile
  • EmbeddedResource - specify a file to include as an embedded resource
  • MakeDir - create a directory
  • MSBuild - recursively build another MSBuild project
  • Exec - Execute a shell command (on Windows this is with cmd.exe)
  • Copy - Copy file(s) and/or director(y/ies) from one place to another

This is just a small sample of what's available - check out the MSBuild task reference for a full list and more information about how each is used. It's worth noting that MSBuild targets can include multiple tasks one after another - and they'll all be executed in sequence.

Unfortunately, if you try to specify a wildcard in an EmbeddedResource directive, both Visual Studio and Monodevelop 'helpfully' try to auto-expand these wildcards to reference the actual files themselves. To avoid this, a clever hack can be instituted that uses the CreateItem task:

<CreateItem Include="$(ProjectDir)/obj/client_dist/**">
    <Output ItemName="EmbeddedResource" TaskParameter="Include" />
</CreateItem>

The above loops above all the files that are in $(ProjectDir)/obj/client_dist, and dynamically creates an EmbeddedResource directive for each - thereby preventing annoying auto-expansion.

The $(ProjectDir) bit is a variable - the final key component of MSBuild project files. There are a number of built-in variables for different things, but you can also define your own.

  • $(ProjectDir) - The current project's root directory
  • $(SolutionDir) - The root directory of the current solution. Undefined if a project is being built directly.
  • $(TargetDir) - The output directory that the result of the build will be written to.
  • $(Configuration) - The selected configuration to build. Most commonly Debug or Release.

As with the tasks, there's a full list available that you can check out for more information.

That's the basics of how MSBuild works, as far as I'm aware at the moment. If I discover anything else of note, I'll post again in the future about what I've learnt. If this has helped, please comment below! If you're still confused, let me know in the comments too, and I'll try my best to help you out :-)

Want a tutorial on Git Submodules? Let me know by posting a comment below!

Compilers 101: Build your own flex + bison compiler in a few easy(?) steps

So you want to build your own compiler? Great! Don't know where to start? This guide should help! At University, we're building our own compiler for a custom programming language invented by our lecturer with a pair of GNU tools by the name of flex and bison - which I've blogged about before. Since that post, I've learnt a ton about how the whole process works, so I thought I'd write up a more coherent blog post on the subject :P

A diagram explaining the process of building a compiler. Explained below.

Stage 1: Planning

The whole process starts with railroad diagrams (also known as flowcharts) of the language you want to write a compiler for. Having an accurate set of railroad diagrams is essential to understanding precisely how the language is put together, which is rather useful for the next step.

Converting the railroad diagrams into plain BNF (Backus Naur Form). Unfortunately, bison doesn't support EBNF-like notation at the current time, so only plain-old BNF will do.

Stage 2: Lexing

With your railroad diagrams converted into BNF, you can start writing code! The first chunk of code that needs writing is the lexer. Lexing is what flex is good at - and involves converting the input source code into lexemes - discrete sequences of characters that match a particular pattern and can be assigned a particular category name, turning it into a token. Perhaps an example would help. Consider the following:

void do_awesome_stuff(int a, string b) {
    /* Code here */
}

The above can be turned into a sequence of tokens, not unlike the following (ignoring whitespace tokens, of course):

TYPE: void
IDENTIFIER: do_awesome_stuff
OPEN_BRACKET: (
TYPE: int
IDENTIFIER: a,
COMMA: ,
TYPE: string
IDENTIFIER: b
CLOSE_BRACKET: )
OPEN_BRACE: {
COMMENT: /* Code here */*
CLOSE_BRACE: }

See? We can identify 8 token types in the source string: TYPE, IDENTIFIER, COMMA, OPEN_BRACKET, CLOSE_BRACKET, OPEN_BRACE, COMMENT, and CLOSE_BRACE. These types and the rules to match them can be found by analysing a combination of the railroad diagrams and the BNF you created earlier.

Stage 3: Parser the first

With a lexer in hand, we can now look at writing the parser. This is done in two stages. The parser itself, and upgrading said parser to generate a parse tree.

Let's talk about the parser first. The parser can be largely created simply by running a few regular-expression find and replace rules on your BNF, actually. From there, it's just a case of adding the header and the footer to complete the document.

Let's take a look at some example BNF:

<instructions> ::= START <lines> END

<lines> ::= <lines> <line>
    | <line>

<line> ::= <command>

<command> ::= <cmd_name> <number>

<cmd_name> ::= FD
    | BK
    | LT
    | RT

<number> ::= <number> <digit>
    | <digit>

<digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

The above matches something like the following:

START
FD 100
RT 180
FD 125
LT 90
BK 50
END

Very interesting (a virtual cookie is available for anyone who gets the reference as to what this grammar represents!). Let's look at converting that into something bison can understand:

instructions : START lines END
    ;

lines : lines line
    | line
    ;

line : command
    ;

command : cmd_name number
    ;

cmd_name : FD
    | BK
    | LT
    | RT
    ;

number : number digit
    | digit
    ;

digit : 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

That's looking much better already! Simply by using the regular expression substitutions:

  1. <([a-z_]+)> -> $1
  2. ::= -> :
  3. \n\n -> \n\t;\n\n

....we can get most of the way there to something that bison can understand. Next, we need to refactor it a bit to tell it which tokens are coming from the lexer (which I'll leave up to you to write as an exercise), so it doesn't get them confused with the rules - which are defined in the BNF(-like) rules.

Let's get rid of the rule for number and digit first, since we can do those in the lexer quite easily. Next, let's add a %token definition to the top to tell it which are coming from the lexer. It's good practice to define everything that comes from the lexer in uppercase, and everything that's a rule that exists only in bison in lowercase:

%start instructions
%token FD BK LT RT NUMBER START END

We've also defined a start symbol - the one which when bison reaches it, it knows that it's completed the parsing process, as bison is a bottom-up parser.

Lastly, we need to reference the lexer itself. Thankfully that's easy too by appending to your new bison file:

%%

#include "lex.yy.c"

Very nice. Don't forget about the new line at the end of the file - flex and bison will complain if it isn't present! Here's the completed bison file:

%start instructions
%token FD BK LT RT NUMBER START END

instructions : START lines END
    ;

lines : lines line
    | line
    ;

line : command
    ;

command : cmd_name NUMBER
    ;

cmd_name : FD
    | BK
    | LT
    | RT
    ;

%%

#include "lex.yy.c"

With a brand-new bison file completed, there's just one component of the parser left - a plain-old C file that calls it. Let's create one of those quickly:


#include 

int yyparse(void);

int main(void)
{       
    # if YYDEBUG == 1
    extern int yydebug;
    yydebug = 1;
    #endif

    return yyparse();
}

void yyerror(char *error_message)
{
    fprintf(stderr, "Error: %s\nExiting\n", error_message);
}

The highlighted lines enable a special debugging mode built-in to bison if the standard compile-time symbol YYDEBUG is specified - and bison is run with a few special parameters. Here's the sequence of commands needed to compile this:

flex lexer.l
bison -tv parser.y
gcc -Wall -Wextra -g parser.tab.c main.c -lfl -ly -DYYDEBUG -D_XOPEN_SOURCE=700

The gcc command is probably a bit long-winded, but it does several useful things for us:

  • Shows additional warnings just in case we've made a mistake that might be an issue later (-Wall -Wextra)
  • Include additional debugging information in the output file to allow debugging with gdb (the GNU Debugger) if necessary (-g)
  • Fix strange errors on some systems (-D_XOPEN_SOURCE=700)

If you're on a Windows system, you may need to remove the -ly - which appears to be required on the Linux machines I use - it tells gcc that we'll be referencing the bison library.

Stage 4: Parser again

Congratulations on getting this far! You've now got a lexer and a parser - so it's time to put them to use. This is done by utilising the parser to build a parse tree - a tree of nodes that represent the input. Here's an example tree:

An example parse tree.

As you can see, each high-level node references one or more lower-level nodes, and the structure of the tree represents the first 2 lines of the example input above. The nodes in yellow are lexical tokens that come directly from flex - these are called terminals, or leaf nodes. The ones in purple come from the bison rules (which we derived from the BNF we wrote at the beginning of this post) - and are non-terminals, or tree nodes.

With this in mind, let's introduce another feature or two of bison. Firstly, let's take a look at revising that %token declaration we created above:

%token FD BK LT RT START END
%token<val_num> NUMBER

The important bit here is the <val_num>. Here we tell bison that a value should be attached to the token NUMBER - and that it will be of type int. After telling bison that it should expect a value, we need to give it a place to put it. Let's write some more code to go just below the %token declarations:

%union {
    int val_num;
}

There we go! Excellent - we've got a place to put it. Now we just need to alter the lexer to convert the token value to an int and put it there. That's not too tough, thankfully - but if you're having trouble with it, here's a hint:

{number}        { yylval.val_num = atoi(yytext); return(NUMBER); }

Now we have it passing numbers correctly, let's look briefly at generating that parse tree. I'm not going to give the game away - just a few helpful hints as to what you need to do here - otherwise it's not as fun :P

Generating the parse tree can be considered the both the most challenging part of the experience (especially if you don't know what you're doing) and the easiest to deal with at same time. Knowing your stuff and your end goal before you start makes the whole process a lot easier.

The first major step is to create a struct that can represent a type of node in your parse tree. It might be useful to store several properties here - such as the node type (An enum will come in handy here), a spot for the value of a lexical token (or a reference to it in a symbol table if you have one), and references to child nodes in the parse tree.

The second major step of the process is to create a utility method that creates a new node of the tree on the heap, and then revise the bison file to get each rule to create new nodes on the tree in such a way that it creates a parse tree when it reaches start node (or top node of the tree - which, in the case of the above, is instructions). For the purposes of this post, I'll be using a method with this prototype:

TreeNode create_node(int item, int node_type, TreeNode left, TreeNode right);

Your tree node struct and subsequent creation method may vary. With this in hand, we can revise the bison rules we created above to create these nodes we've been talking about. Here's a quick pointer on how to revise the rule for command above:

command : cmd_name NUMBER   { $$ = create_node($2, NODE_COMMAND, $1, NULL); }
    ;

This might look a bit strange, but let's break it down. The bit in curly braces is some (almost) plain C code that creates the node and returns a pointer to it to bison. The $$ is the return value for that node - which, I might add reminds me of something I forgot above. We need to tell bison about our new tree node data type and which rules should return it:

%type<val_tnode> instructions lines line command cmd_name

/* And in %union { ... } ..... */
TreeNode val_tnode;

This is almost the same as the %token<val_num> we did before, but we're defining the return value of a rule this type - not a token. With that little interlude out of the way, let's return to the code above. $1 and $2 refer to the first and second items in the rule definition respectively - and hold the type that we defined above in the %token and %type directives. Since bison is a bottom-up parser - this means that by the time this code executes, all it's child nodes have (hopefully) been created - and we just have to tie them all up together with a new node. In the case of my little example above, $1 is of type TreeNode, and $2 is of type int (that is if I didn't make any mistakes further up!).

Stage 5: Blasting off to code generation and beyond

Phew! That's a lot of work. If you've read this far, thank you and well done! It's been a long journey for both you the reader and me the writer, but you're almost done.

While conceptually simple when broken down, the whole process actually gets rather involved - especially when writing the BNF and the parser (the latter of which can be a particular pain due to shift/reduce and reduce/reduce errors), and the amount of code and head scratching you've got to do to get to this point is enormous. My best advice is to take it slow and don't rush - you'll only cause most problems for yourself if you try and jump the gun. Make sure that each stage works as you intend before you continue - back-pedalling to fix an issue can be particularly annoying as it can be bothersome to work out which stage the bug is actually in.

The last step of the whole process is to actually do something with the parse tree we've worked so hard to create. Thankfully, that's not too difficult - as we can put some additional code in the { } block of the starting symbol to call methods that will do things like perform some optimisation, print the tree to the console - or generate some sweet code. While the actual generation of code is beyond the scope of this article, I may end up posting about some optimisation techniques you can use on a parse tree after I've finished fiddling with float handling, symbol tables, and initial code generation in my ACW (Assessed Course Work).

Found this useful? Found a bug in this post? Got a suggestion? Comment below! Since I don't have any real analytics on this blog besides the server logs, I've no idea if you've read it really unless you comment :P

Sources and Further Reading

Art by Mythdael