Starbeamrainbowlabs

Stardust
Blog


Archive

Mailing List Articles Atom Feed Comments Atom Feed Twitter Reddit Facebook

Tag Cloud

3d account algorithms android announcement architecture archives arduino artificial intelligence artix assembly async audio bash batch blog bookmarklet booting c sharp c++ challenge chrome os code codepen coding conundrums coding conundrums evolved command line compilers compiling compression css dailyprogrammer debugging demystification distributed computing documentation downtime electronics email embedded systems encryption es6 features event experiment external first impressions future game github github gist graphics hardware hardware meetup holiday holidays html html5 html5 canvas infrastructure interfaces internet io.js jabber javascript js bin labs learning library linux low level lua maintenance manjaro network networking node.js operating systems performance photos php pixelbot portable privacy problem solving programming problems project projects prolog protocol protocols pseudo 3d python reddit reference release releases resource review rust secrets security series list server software sorting source code control statistics storage svg technical terminal textures three thing game three.js tool tutorial tutorials twitter ubuntu university update updates upgrade version control virtual reality virtualisation visual web website windows windows 10 xmpp xslt

Just another day: More spam, more defences

When post is released I'll be in an exam, but I wanted to post again about the perfectly fascinating spam situation here on my blog. I've blogged about fending off spam on here before (exhibits a, b, c), but I find the problem is detecting it in a transparent manner that you as the reader don't notice very interesting, so I think I'll write another post on the subject. I could use a service like Google's ReCAPTCHA, but that would be boring :P

Recently I've had a trio of spam comments make it all the way through my (rather extensive) checks and onto my blog here. I removed them, of course, but it still baffled me as to why they made it through.

It didn't take long to find out. When I was first implementing comments on here, I added a logger specifically for purposes such as this that saves everything about current environmental state to a log file for later inspection - for both comments that make it through, and those that don't. It's not available publically available, of course (but if you'd like to take a look, just ask and I'll consider it). Upon isolating the entries for the spam comments, I discovered a few interesting things.

  • The comment keys were aged 21, 21, and 17 seconds respectively (the lower limit I have set is 10 seconds)
  • All 3 comments claimed that they were Firefox 57
  • 2 out of 3 comments used HTTP 1.0 (even though they claimed to be Firefox 57, and despite my server offering HTTP/1.1 and HTTP/2.0)
  • All 3 comments utilised HTTPS
  • The IP Addresses that the comments came form were in Ukraine, Russia, and Canada (hey?) respectively
  • All 3 appear to be phishing scams, with a link leading to a likely malicious website
  • The 2 using HTTP/1.0 also asked my server to close the connection after sending a response
  • All 3 asked not to be tracked via the DNT HTTP header
  • The last comment had some really weird capitalisation. After consulting someone experienced on the subject, I learnt that the writer likely natively spoke an eastern language, such as Chinese

This was most interesting. From this, I can conclude:

  • The last comment was likely submitted by a Chinese operator - even though the source IP address is located in Ukraine
  • All three are spoofing their user agent string.
    • Firefox 57 uses HTTP/2.0 by default if you're really in a browser, and the spam comments utilised HTTP/1.0 and HTTP/1.1
    • Curiously, all of this took place over HTTPS. I'd be really curious to log which cipher was used for the connection here.
    • In light of this, if I knew more about HTTP client libraries, I could probably identify what software was really used to submit the spam comments (and possibly even what operating system it was running on). If you know, please comment below!

To combat this development, I thought of a few options. Firstly, raising the minimum comment age, whilst effective, may disrupt the user experience, which I don't want to do. Plus, the bot owners could just increase the delay even more. To that end, I decided not to do this.

Secondly, with the amount of data I've collected, I could probably write an AI that takes the environment in and spits out a 'spaminess' score, much like SpamAssassin and rspamd do for email. Perhaps a multi-weighted system would work, with a series of tests that add or take away from the final score? I might investigate upgrading my spam detection system to do this in the future, as it would not only block spam more effectively, but provide a more distilled overview of the characteristics of each comment submission than I have currently.

Lastly, I could block HTTP/1.0 requests. While not perfect (1 out of 3 requests used HTTP/1.1), it would still catch some more bots out without disrupting user experience - as normal browsers (include text-based ones IIRC) use HTTP/1.1 or above. HTTP/1.1 has been around since 1991 (27 years!), so if you're not using it by now - upgrade! For now, this is the best option I can see.

From today, if you try to submit a comment and get a HTTP 505 HTTP Version Not Supported error and see a message saying something like this:

You sent your request via HTTP/1.0, but this is not supported for submitting comments due to high volume of spam. Please retry with HTTP/1.1 or higher.

...then you'll have to upgrade and / or reconfigure your web browser. Please let me know (my email address is on the homepage) if this causes any issues for anyone, and I'll help you out.

Found this interesting? Know more about this? Got a better solution? Comment below!

Shift-Reduce Parser Part 2: Building Furniture (1)

Hello and welcome! I got a bit distracted by other things as you can tell, but I'm back with part 2 of my series on building a shift-reduce parser. If you're not sure what I'm talking about, then I'd advise reading part 1 first and then coming back here. It might be a good idea to re-read it anyway, juts to refresh your memory :-)

The same flowchart from last time, but with the parse table section highlighted.

Last time, we created some data classes to store the various rules and tokens that we'll be generating. Today, we're going to build on that and start turning a set of rules into a parse table. Let's introduce the rules we'll working with:

<start> ::= <expression>

<expression> ::= <expression> PLUS <value>
    | <term>

<term> ::= <term> DIVIDE <value>
    | <value>

<value> ::= <number>
    | BRACKET_OPEN <expression> BRACKET_CLOSE

<number> ::= DIGIT
    | <number> DIGIT

The above represents a very basic calculator-style syntax, which only supports adding and dividing. It's written in Backus-Naur Form, which is basically a standardised way of writing parsing rules.

To build a parse table, we first must understand what such a thing actually is. Let's take a look at an example:

state action goto
* + 0 1 $ E B
0 s1 s2 3 4
1 r4 r4 r4 r4 r4
2 r5 r5 r5 r5 r5
3 s5 s6 goal
4 r3 r3 r3 r3 r3
5 s1 s2 7
6 s1 s2 8
7 r1 r1 r1 r1 r1
8 r2 r2 r2 r2 r2

_(Source: Adapted from the LR Parser on Wikipedia.)_

While it looks complex at first, let's break it down. There are 3 parts to this table: The state, the action, and the goto. The action and goto represent What should happen if a particular token is encountered. In this case, the input stream contains both terminal (i.e. DIGIT, DIVIDE, BRACKET_CLOSE, etc. in the case of our BNF above) and non-terminal (i.e. number, term, expression, etc. in the case of our BNF above) symbols - if understand it correctly, so there are actually 2 parts to the table here to make sure that both are handled correctly.

We'll be connecting this to our lexer, which outputs only terminal symbols, so we should be ok I believe (if you know better, please post a comment below!). The state refers to the state in the table. As I've mentioned before, a given state may contain one or more configurations. It's these configurations that give rise to the actions in the table above, such as s2 (shift and then go to state 2) or r3 (reduce and jump to state 3).

To use the table, the parser must know what state it's in, and then take a look across the top row for the next symbol it has in the token stream. Once found, it can follow it down to figure out what action it should take, as explained above. If there isn't an action in the box, then there must be an error in the input, as the table doesn't tell us what to do in this situation. To that end, we should try and generate a meaningful error message to help the user to find the mistake in the input (or the developer in the parser!).

We're kind of getting ahead of ourselves here though. We need to build this table first, and to do that, we need to figure out which configurations go in which state. And, going down the rabbit hole, we need to know what a configuration is. Again, it's best if I demonstrate. Consider the following parsing rule from our example BNF at the beginning of this post:

<value> ::= BRACKET_OPEN <expression> BRACKET_CLOSE

A single configuration represent a possible state of the parser at a particular instant in time. I could split that above rule up like so:

<value> ::= BRACKET_OPEN * <expression> BRACKET_CLOSE
<value> ::= BRACKET_OPEN <expression> * BRACKET_CLOSE
<value> ::= BRACKET_OPEN <expression> BRACKET_CLOSE *

The asterisk represent where the parser might have gotten up to. Everything to the left is on the stack of the parser, and everything to the right hasn't happened yet.

Since this isn't a top-level rule (in our example that honour goes to the rule for the start non-terminal), the parser will never be in a position where the first item there doesn't exist yet on the stack, so we can ignore the configuration in which the asterisk would be to the left of BRACKET_OPEN.

Confused? Let me try and help here. Let's draw a diagram of how our parser is going to operate:

_(Source: Made by me, but adapted from the LR Parser article on Wikipedia)_

Basically, the parser will be taking in the input token stream and either shift a new terminal token onto the stack, or reduce one or more existing tokens on the stack into a single non-terminal token, which replaces those existing tokens on the stack. The configurations above represent possible states of the stack (the bit to the left of the asterisk), and possible directions that the parser could take when parsing (the bit to th right of the asterisk).

Finally, when the goal is reached, the output is returned to the caller (which, by the time we're done, should be a parse tree). Said tree can then be optimised and processed for whatever purpose we desire!

With this knowledge, we can deduce that we can build the entire table by recursing over the tree of rules from the start state. That way, we'll visit every rule that we'll need to parse everything required to reach the goal state by recursing over all the rules for all the non-terminals referenced by all the rules we visit. We could even generate a warning if we detect that some rules don't connect to this 'tree'. Here's a tree of our example ruleset from the beginning of this post:

A tree diagram of the rules detailed near the beginning of this post.

It's a bit spaghetti-ish, but it should be legible enough :P This gives us an idea as to how we're going to tackle this. Taking into account the data classes we created in the last post, we need to make sure we keep the following in mind:

  1. Since the main ShiftReduceParser class is going to hold the rules, the ParseTable class will need a reference to its parent ShiftReduceParser in order to query the rules.
  2. In light of this, the SHiftReduceParser should be responsible for satisfying any queries the ParseTable has about rules - the ParseTable should not have to go looking & filtering the rule list held by ShiftReduceParser itself.
  3. ParseTable will need a recursive method that will take a single top-level rule and recurse over it and its child rules (according to the tree I've talked about above)
  4. This method in ParseTale will need to be extremely careful it doesn't get stuck in a loop. To that end, it'll have to keep track of whether it's already processed a rule or not.
  5. It'll probably also have to keep track of which configurations it has added to the table class structure we defined in the last post to avoid adding rules twice.
  6. Once ParseTable has figured out all the configurations and grouped them all into the right states, it will then have to recurse over the generated table and fill in all the shift / reduce / goal action(s) - not forgetting about the links to the other states they should point to.

It's quite the laundry list! Thankfully, most of this is quite simple if we tackle it one step at a time. The most annoying bit is the grouping of configurations into states. This is done by looking at the token immediately before the asterisk in each configuration - all the configurations with the same token here will get grouped into the same state (while there are more complex algorithms that allow for more complex grammars, we'll stick with this for now as anything else makes my head hurt! Maybe in the future I'll look as figuring out precisely what kind of LR-style parser this is, and upgrading it to be a canonical LR(1) parser - the most advanced type I know of).

This is quite a lot to take in, so I think I'll leave this post here for you to digest - and we'll get to writing some code in the next one.

Found this useful? Spotted a mistake? Having trouble getting your head around it? Post a comment below!

Distributing work with Node.js

A graph of the data I generated by writing the scripts I talk about in this post. (Above: A pair of graphs generated with gnuplot from the data I crunched with the scripts I talk about in this blog post. Anti-aliased version - easier to pick details out [928.1 KiB])

I really like Node.js. For those not in the know, it's basically Javascript for servers - and it's brilliant at networking. Like really really good. Like C♯-beating good. Anyway, last week I had a 2-layer neural network that I wanted to simulate all the different combinations from 1-64 nodes in both layers for, as I wanted to generate a 3-dimensional surface graph of the error.

Since my neural network (which is also written in Node.js :P) has a command-line interface, I wrote a simple shell script to drive it in parallel, and set it going on a Raspberry Pi I have acting as a file server (it doesn't do much else most of the time). After doing some calculations, I determined that it would finish at 6:40am Thursday..... next week!

Of course, taking so long is no good at all if you need it done Thursday this week - so I set about writing a script that would parallelise it over the network. In the end I didn't actually include the data generated in my report for which I had the Thursday deadline, but it was a cool challenge nonetheless!

Server

To start with, I created a server script that would allocate work items, called nodecount-surface-server.js. The first job was to set things up and create a quick settings object and a work item generator:

#!/usr/bin/env node
// ^----| Shebang to make executing it on Linux easier

const http = require("http"); // We'll need this later

const settings = {
    port: 32000,
    min: 1,
    max: 64,
};
settings.start = [settings.min, settings.min];

function* work_items() {
    for(let a = settings.start[0]; a < settings.max; a++) {
        for(let b = settings.start[1]; b < settings.max; b++) {
            yield [a, b];
        }
    }
}

That function* is a generator. C♯ has them too - and they let a function return more than one item in an orderly fashion. In my case, it returns arrays of numbers which I use as the topology for my neural networks:

[1, 1]
[1, 2]
[1, 3]
[1, 4]
....

Next, I wrote the server itself. Since it was just a temporary script that was running on my local network, I didn't implement too many security measures - please bear this in mind if using or adapting it yourself!


function calculate_progress(work_item) {
    let i = (work_item[0]-1)*64 + (work_item[1]-1), max = settings.max * settings.max;
    return `${i} / ${max} ${(i/max*100).toFixed(2)}%`;
}

var work_generator = work_items();

const server = http.createServer((request, response) => {
    switch(request.method) {
        case "GET":
            let next = work_generator.next();
            let next_item = next.value;
            if(next.done)
                break;
            response.write(next_item.join("\t"));
            console.error(`[allocation] [${calculate_progress(next_item)}] ${next_item}`);
            break;
        case "POST":
            var body = "";
            request.on("data", (data) => body += data);
            request.on("end", () => {
                console.log(body);
                console.error(`[complete] ${body}`);
            })
            break;
    }
    response.end();
});
server.on("clientError", (error, socket) => {
    socket.end("HTTP/1.1 400 Bad Request");
});
server.listen(settings.port, () => { console.error(`Listening on ${settings.port}`); });

Basically, the server accepts 2 types of requests:

  • GET requests, which ask for work
  • POST requests, which respond with the results of a work item

In my case, I send out work items like this:

11  24

...and will be receiving work results like this:

11  24  0.2497276811644629

This means that I don't even need to keep track of which work item I'm receiving a result for! If I did though, I'd probably having some kind of ID-based system with a list of allocated work items which I could refer back to - and periodically iterate over to identify any items that got lost somewhere so I can add them to a reallocation queue.

With that, the server was complete. It outputs the completed work item results to the standard output, and progress information to the standard error. This allows me to invoke it like this:

node ./nodecount-surface-server.js >results.tsv

Worker

Very cool. A server isn't much good without an army of workers ready and waiting to tear through the work items it's serving at breakneck speed though - and that's where the worker comes in. I started writing it in much the same way I did the server:

#!/usr/bin/env node
// ^----| Another shebang, just like the server

const http = require("http"); // We'll need this to talk to the server later
const child_process = require("child_process"); // This is used to spawn the neural network subprocess

const settings = {
    server: { host: "172.16.230.58", port: 32000 },
    worker_command: "./network.js --epochs 1000 --learning-rate 0.2 --topology {topology} <datasets/acw-2-set-10.txt 2>/dev/null"
};

That worker_command there in the settings object is the command I used to execute the neural network, with a placeholder {topology} which we find-and-replace just before execution. Due to obvious reasons (no plagiarism thanks!) I can't release that script itself, but it's not necessary to understand how the distributed work item systme I've written works. It could just as well be any other command you like!

Next up is the work item executor itself. Since it obviously takes time to execute a work item (why else would I go to such lengths to process as many of them at once as possible :P), I take a callback as the 2nd argument (it's just like a delegate or Action in C♯):


function execute_item(data, callback) {
    let command = settings.worker_command.replace("{topology}", data.join(","));
    console.log(`[execute] ${command}`);
    let network_process = child_process.exec(command, (error, stdout, stderr) =>  {
        console.log(`[done] ${stdout.trim()}`);
        let result = stdout.trim().split(/\t|,/g);
        let payload = `${result[0]}\t${result[1]}\t${result[5]}`;

        let request = http.request({
            hostname: settings.server.host,
            port: settings.server.port,
            path: "/",
            method: "POST",
            headers: {
                "content-length": payload.length
            }
        }, (response) => {
            console.log(`[submitted] ${payload}`);
            callback();
        });
        request.write(payload);
        request.end();
    });
}

In the above I substitute in the work item array as a comma-separated list, execute the command as a subprocess, report the result back to the server, and then call the callback. To report the result back I use the http module built-in to Node.JS, but if I were tidy this up I would probably use an npm package like got instead, as it simplifies the code a lot and provides more features / better error handling / etc.

A work item executor is no good without any work to do, so that's what I tackled next. I wrote another function that fetches work items from the server and executes them - wrapping the whole thing in a Promise to make looping it easier later:


function do_work() {
    return new Promise(function(resolve, reject) {
        let request = http.request({
            hostname: settings.server.host,
            port: settings.server.port,
            path: "/",
            method: "GET"
        }, (response) => {
            var body = "";
            response.on("data", (chunk) => body += chunk);
            response.on("end", () => {
                if(body.trim().length == 0) {
                    console.error(`No work item received. We're done!`);
                    process.exit();
                }
                let work_item = body.split(/\s+/).map((item) => parseInt(item.trim()));
                console.log(`[work item] ${work_item}`);
                execute_item(work_item, resolve);
            });
        });
        request.end();
    });
}

Awesome! It's really coming together. Doing just one work item isn't good enough though, so I took it to the next level:

function* do_lots_of_work() {
    while(true) {
        yield do_work();
    }
}

// From https://starbeamrainbowlabs.com/blog/article.php?article=posts/087-Advanced-Generators.html
function run_generator(g) {
    var it = g(), ret;

    (function iterate() {
        ret = it.next();
        ret.value.then(iterate);
    })();
}

run_generator(do_lots_of_work);

Much better. That completed the worker script - so all that remained was to set it going on as many machines as I could get my hands on, sit back, and watch it go :D

I did have some trouble with crashes at the end because there was no work left for them to do, but it didn't take (much) fiddling to figure out where the problem(s) lay.

Each instance of the worker script can max out a single core of a machine, so multiple instances of the worker script are needed per machine in order to fully utilise a single machine's resources. If I ever need to do this again, I'll probably make use of the built-in cluster module to simplify it such that I only need to start a single instance of the worker script per machine instance of 1 for each core.

Come to think of it, it would have looked really cool if I'd done it at University and employed a whole row of machines in a deserted lab doing the crunching - especially since it was for my report....

Liked this post? Got an improvement? Comment below!

AT24C64 EEPROM and the Arduino

For a project of mine I've bought a bunch of parts. One of those are a bunch of AT24C64 EEPROM chips - which are basically really small SD cards which, in this case, can store 64 KiB of data - even when the power is switched off, as you'd expect.

I ended up having a bit of trouble getting it to work though, as the Arduino IDE appears to have been abandoned and I don't think it's still in development. Still, it works well enough. Anyway, I thought I'd document my findings here for future reference, and to save you a bit of trouble if you find yourself in a similar situation!

The first issue I ran into was in trying to get the associated library to work. I kept getting errors like these:

sketch/eeprom.ino.cpp.o:(.text.loop+0x1c): undefined reference to `AT24CX::writeChars(unsigned int, char*, int)'
sketch/eeprom.ino.cpp.o:(.text.loop+0x20): undefined reference to `AT24CX::readChars(unsigned int, char*, int)'
sketch/eeprom.ino.cpp.o:(.text.loop+0x49): undefined reference to `AT24CX::writeChars(unsigned int, char*, int)'
sketch/eeprom.ino.cpp.o: In function `loop':

Strange. I thought I'd added the #include "lib/AT24Cx/AT24CX.h" to the top? Sure enough, I had. It turns out that the problem actually lay in the fact that I'd used a git submodule to bring in the AT24Cx library, such that the library was not located in the same folder as the .ino file - so the Arduino IDE, in all its wisdom, decided that including the library's .cpp files was hardly necessary O.o

The solution I found was to #include the .cpp explicitly like so:

#include "lib/AT24Cx/AT24CX.cpp"

The other issue I stumbled across was that it couldn't find my EEPROM chip via I2C. Even the demo I2C scanner couldn't find it! It turned out, after searching up a storm on DuckDuckGo, that I needed a pair of 1kΩ resistors stretching from the I2C pins tot he +5V power rail. Here's a diagram I created in Fritzing to show you what I mean:

A circuit diagram showing my wiring for the AT24C64. Note the 1KΩ resistors going form the SCL and SDA pins to the +5V power rail in the breadboard.

(svg, fritzing file)

As usual with the various arduino test programs I find / write, you can get a hold of them in my main arduino repository on my personal git server.

Deep Sleep on ESP-Based Chips

If you're interested in the Arduino ecosystem, you've no doubt come across the Wemos family of boards. Based on the ESP8266, they have WiFi and TCP / UDP support built in! While that's very cool indeed for such a low-power device, in this post I'll be focusing on another cool aspect of the chipset, as I'm going to need to for a project in the nearish future (which I might blog about too!).

Curiously, the ESP chipset carries a unique ability to go into so-called 'deep sleep', which turns off everything but an internal counter, which emits a pulse on a specified pin when the sleep time is up. By wiring this specified pin to the RST (Reset) pin with a jumper cable, we can get it to automagically wake itself up after the deep sleep cycle is completed.

A Wemos with the aforementioned cable running from the D0 pin tot he RST pin.

This is a lot simpler than the sleep modes available on other (non-ESP) chips - which are explained here for those interested. Here's an example:

void setup() {
    Serial.begin(9600);

    Serial.print("Initialising - ");

    pinMode(D0, WAKEUP_PULLUP);

    Serial.println("done");

    Serial.println("Waiting: ");
    for(int i = 0; i < 5; i++) {
        delay(1000);
        Serial.print(".");
    }
    Serial.println();

    Serial.println("Entering deep sleep. Goodnight!");
    ESP.deepSleep(5 * 1000000);
}

void loop() {

}

Nice and easy, right? You can see how you'd need to factor this into the design of your program before you get too far into it. Note that I multiply the number of seconds by 1000000 - this is because the sleep time is specified in microseconds - not milliseconds or seconds.

When in deep sleep, people have managed to reduce it's power consumption down to ~100µA(!) - I'll update this post once I manage to make some measurements of my own.

That's about everything I wanted to mention - just to remind myself on how to do it in a few weeks time :-)

Source and Further Reading

Found this useful? Got a question? Comment below!

Representing clickable links with awkward characters in LaTeX

Hello again! As this Semester draws to a close, I thought I'd make a quick post about links in references in LaTeX. I've discovered recently with the help of a lecturer (thank you!) how properly represent links in LaTeX references - as I've been having some issues with getting the ones with underscores _ and tildes ~ displaying correctly.

For example, if I wanted to cite the Vulkan specification, I might do this in my BibTeX file:

@Misc{Vulkan2016,
    author = {{The Khronos Vulkan Working Group}},
    title = {Vulkan 1.0.31 - A Specification},
    howpublished = {Available online: https://www.khronos.org/registry/vulkan/specs/1.0/xhtml/vkspec.html [Accessed 15/10/2016]},
    year = {2016},
}

This is fine, but that link isn't clickable - and if it contained any awkward characters as described above, I might get weird compilation errors! The solution is to make sure you're include hyperref in your main LaTeX file (in my report I do \usepackage[hidelinks]{hyperref} in the top-level .tex file), and then do this:

@Misc{Vulkan2016,
    author = {{The Khronos Vulkan Working Group}},
    title = {Vulkan 1.0.31 - A Specification},
    howpublished = {Available online: \url{https://www.khronos.org/registry/vulkan/specs/1.0/xhtml/vkspec.html} [Accessed 15/10/2016]},
    year = {2016},
}

Problem solved! :D

Found this useful? Still having issues? Got an even better solution? Post a comment below!

Shift-reduce Parser Part 1: First Steps

Now that I've done the Languages and Compilers module at University, it's opened my eyes to a much better and more extensible way of handling complex text in a way that can easily be read by any subsequent code I write. To that end, I've found that at least 3 different various projects of mine could benefit from the inclusion of a shift-reduce parser, but I haven't been able to track one down for C♯ yet.

With this in mind, I thought to myself: "Why not build one myself?" Sure, my Lecturer didn't go into too many details as to how they work, but it can't be that difficult, can it? Oh, I had no idea.....

In this mini-series, I'm going to take you through the process of building a shift-reduce parser of your very own. As I write this, I haven't actually finished mine yet - I've just got to the important milestone of building a parse table! Thankfully, that's going to be a few posts away, as there's a fair amount of ground to cover until we get to that point.

Warning: This series is not for the faint of heart! It's rather complicated, and extremely abstract - making it difficult to get your head around. I've had great difficulty getting mine around it - and ended up writing it in multiple stages. If you want to follow along, be prepared for lots of research, theory, and preparation!

Let's start out by taking a look at what a shift-reduce parser does. If you haven't already, I'd recommend reading my previous compilers 101 post, which explains how to write a compiler, and the different stages involved. I'd also recommend checking out my earlier post on building a lexer, as it ties in nicely with the shift-reduce parser that we'll be building.

An overview of how a shift-reduce works.

In short, a shift-reduce parser compiles a set of BNF-style rules into a Parse Table, which it then utilises as a sort of state-machine when parsing a stream on input tokens. We'll take a look at this table compilation process in a future blog post. In this post, let's set up some data structures to help us along when we get to the compilation process in the next blog post. Here's the class structure we'll be going for:

An overview of the class structure we'll be creating in this blog post.

Let's start with a class to represent a single token in a rule:

public enum ParserTokenClass
{
    Terminal,
    NonTerminal
}

public struct ParserToken
{
    public readonly ParserTokenClass Class;
    public readonly string Type;

    public ParserToken(ParserTokenClass inTokenType, string inType)
    {
        Class = inTokenType;
        Type = inType;
    }

    public override bool Equals(object obj)
    {
        ParserToken otherTokenType = (ParserToken)obj;
        return Class == otherTokenType.Class && Type == otherTokenType.Type;
    }
    public override int GetHashCode()
    {
        return $"{Class}:{Type}".GetHashCode();
    }

    public override string ToString()
    {
        string terminalDisplay = Class == ParserTokenClass.Terminal ? "T" : "NT";
        return $"[ParserToken {terminalDisplay}: {Type}]";
    }

    public static ParserToken NonTerm(string inType)
    {
        return new ParserToken(ParserTokenClass.NonTerminal, inType);
    }
    public static ParserToken Term(string inType)
    {
        return new ParserToken(ParserTokenClass.Terminal, inType);
    }
}

Pretty simple! A token in a rule can either be a terminal (basically a token straight from the lexer), or a non-terminal (a token that the parser reduces a set of other tokens into), and has a type - which we represent as a string. Unfortunately due to the complex comparisons we'll be doing later, it's a huge hassle to use an enum with a template class as I did in the lexer I built that I linked to earlier.

Later on (once we've built the parse table), we'll extend this class to support attaching values and other such pieces of information to it, but for now we'll leave that out to aid simplicity.

I also override Equals() and GetHashCode() in order to make comparing tokens easier later on. Overriding ToString() makes the debugging process much easier later, as we'll see in the next post!

With a class to represent a token, we also need one to represent a rule. Let's create one now:

public class ParserRule
{
    /// <summary>
    /// A function to call when a reduce operation utilises this rule.
    /// </summary>
    public Action MatchingAction;
    public ParserToken LeftSide;
    public ParserToken[] RightSideSequence;

    public ParserRule(Action inMatchingAction, ParserToken inLeftSide, params ParserToken[] inRightSideSequence)
    {
        if (inLeftSide.Class != ParserTokenClass.NonTerminal)
            throw new ArgumentException("Error: The left-hand side must be a non-terminal token type.");

        MatchingAction = inMatchingAction;
        LeftSide = inLeftSide;
        RightSideSequence = inRightSideSequence;
    }

    public bool RightSideSequenceMatches(IEnumerable<ParserToken> otherRhs)
    {
        int i = 0;
        foreach (ParserToken nextToken in otherRhs)
        {
            if (!nextToken.Equals(RightSideSequence[i]))
                return false;

           i++;
        }
        return true;
    }

    public override string ToString()
    {
        StringBuilder result = new StringBuilder();
        result.Append($"ParserRule: {LeftSide} = ");
        foreach (ParserToken nextToken in RightSideSequence)
            result.Append($" {nextToken}");
        result.Append(";");
        return result.ToString();
    }
}

The above represents a single parser rule, such as <number> ::= <digit> <number>. Here we have the token on the left-hand-side (which we make sure is a non-terminal), and an array of tokens (which can be either terminal or non-terminal) for the right-hand-side. We also have an Action (which is basically a lamba function) that we'll call when we match against the rule, so that we have a place to hook into when we write code that actually does the tree building (not to be confused with the shift-reduce parser itself).

Here I also add a method that we'll need later, which compares an array of tokens against the current rule, to see if they match - and we override ToString() here again to aid debugging.

Now that we can represent tokens and rules, we can start thinking about representing configurations and states. Not sure what these are? All will be explained in the next post, don't worry :-) For now, A state can be seen as a row in the parse table, and it contains a number of configurations - which are like routes to different other states that the parser decides between, depending where it's gotten to in the token stream.

public enum ParseTableAction
{
    Shift,
    Reduce,
    Goal,
    Error
}

public class ParseTableConfiguration
{
    public readonly ParserRule Rule;
    public readonly int RhsPosition;

    public ParseTableAction LinkingAction = ParseTableAction.Error;
    public ParseTableState LinkingState = null;

    public ParserToken TokenAfterDot {
        get {
            return Rule.RightSideSequence[RhsPosition];
        }
    }
    public ParserToken TokenBeforeDot {
        get {
            return Rule.RightSideSequence[RhsPosition - 1];
        }
    }

    /// <summary>
    /// Whether this configuration is the last in the sequence of configurations for the specified rule or not.
    /// </summary>
    /// <value><c>true</c> if is last in rule; otherwise, <c>false</c>.</value>
    public bool IsLastInRule {
        get {
            return RhsPosition > Rule.RightSideSequence.Length - 1;
        }
    }

    public ParseTableConfiguration(ParserRule inRule, int inRhsPosition)
    {
        Rule = inRule;
        RhsPosition = inRhsPosition;
    }

    public IEnumerable<ParserToken> GetParsedRhs()
    {
        return Rule.RightSideSequence.TakeWhile((ParserToken token, int index) => index <= RhsPosition);
    }

    public bool MatchesRhsSequence(ParserRule otherRule)
    {
        int i = 0;
        foreach (ParserToken nextToken in otherRule.RightSideSequence)
        {
            if (i > RhsPosition)
                break;

            if (!nextToken.Equals(otherRule.RightSideSequence[i]))
                return false;

            i++;
        }
        return true;
    }

    public override bool Equals(object obj)
    {
        ParseTableConfiguration otherConfig = obj as ParseTableConfiguration;
        if (otherConfig == null) return false;
        return Rule == otherConfig.Rule && RhsPosition == otherConfig.RhsPosition;
    }
    public override int GetHashCode()
    {
        return $"{Rule}:{RhsPosition}".GetHashCode();
    }

    public override string ToString()
    {
        StringBuilder result = new StringBuilder();

        result.Append($"Configuration: {LinkingAction} ");
        if (LinkingState != null)
            result.Append($"to State {LinkingState.Id} ");
        result.Append($"{Rule.LeftSide} = ");

        for (int i = 0; i <= Rule.RightSideSequence.Length; i++)
        {
            if (i == RhsPosition)
                result.Append(" * ");
            if (i == Rule.RightSideSequence.Length)
                continue;
            result.Append($"{Rule.RightSideSequence[i]} ");
        }
        result.Append(";");
        return result.ToString();
    }
}

This class is slightly more complicated. First, we define an enum that holds information about what the parser should do if it chooses this configuration. Then, we declare the configuration class itself. This entails specifying which parse rule we're deriving the configuration from, and both which tokens in the right-hand-side of the rule should have been parsed already, and which should still be somewhere in the token stream. Again, I'll explain this in more detail in the next post!

Then, we declare a few utility methods and properties to fetch different parts of the configuration's rule, such as the token to the immediate left and right of the right-hand-side position (which was represented as a dot . in the book I followed), all the tokens before the dot ., and whether a given rule matches this configuration in the basis of everything before the dot ..

Finally, I continue with the trend of overriding the equality checking methods and ToString(), as it makes a world of difference in the code coming up in future blog posts!

Now that we've got a class for configurations, the last one on our list is one for the states themselves. Let's do that now:

public class ParseTableState
{
    public readonly ParseTable ParentTable;

    public int Id {
        get {
            return ParentTable.FindStateId(this);
        }
    }

    public List<ParseTableConfiguration> Configurations = new List<ParseTableConfiguration>();

    public ParseTableState(ParseTable inParentTable)
    {
        ParentTable = inParentTable;
    }

    public override string ToString()
    {
        StringBuilder result = new StringBuilder();
        foreach(ParseTableConfiguration nextConfiguration in Configurations)
            result.AppendLine($"     - {nextConfiguration}");
        return result.ToString();
    }
}

Much simpler than the configuration rule class, right? :P As I mentioned earlier, all a state consists of is a list of configurations in that state. In our case, we'll be assigning an id to the states in our parse table, so I include a property here that fetches a state's id from the parent parse table that it's part to make the later code as simple as possible.

Still with me? Congratulations! You've got the beginnings of a shift-reduce parser. Next time, we'll expand on some of theory behind the things I've touched on in this post, and possibly look at building the start of the recursive parse table builder itself.

Found this interesting? Confused about something? Comment below!

Markdown editors compared

Parts of the 3 markdown editors I'll be comparing in this post.

If you didn't know already, I write all my blog posts here in markdown. I've used several markdown editors over the years (wow it's strange to write that), and I thought I'd talk a little bit about the ones I've used, what I liked about them (and the things I didn't), and what my current preference is.

Firstly though, why would you want one? Couldn't you just use a regular text editor like Notepad++? Well, yes - but a dedicated editor has several benefits: Proper spell-checking for one, a live-preview for another, and other nice features that make the experience just that little bit better (I'm even writing one of my reports for University in Markdown, and I have to say that the experience is much more pleasurable than using Microsoft Word :P).

I like Markdown itself rather a lot too. First invented by John Gruber over on daringfireball.net, Markdown is a simple markup language that's inspired by the things that people already do in instant messaging and other text-based mediums. It's designed to be both easy to read and understand on it's own, and easy to write in - such that it doesn't break your flow as a writer by requiring you to look up how to figure out how to apply that particular bit of formatting (I find myself having to do that with LaTeX and others a lot).

A Screenshot of StackEdit.

(Above: A Screenshot of StackEdit.)

The first contender up is StackEdit. It's an in-browser offering, which saves it's data to your local machine (or the cloud). It comes with a number of nice features - apart from not having to install it of course - such as synchronised scrolling in the live-preview, and a 'publish' button to send your document to a number of different sources automatically.

Since I used it last (which was quite a while ago, actually), it appears to have received a sizeable update, updating the user-interface to be more polished and aesthetically pleasing, and adding a toggleable folder structure to the left-hand-side, amongst other things.

If you can't install anything or run portable programs from a flash drive, StackEdit would be my recommendation.

A screenshot of Classeur.

(Above: A Screenshot of Classeur.)

Next up on my list is Classeur. It's another browser-based offering, with many of the same features, just with a different UI. When I discovered it I was using Stack Edit, and at the time the interface of Classeur was vastly superior.

The main thing I don't like about it is that it's 'freemium' -- meaning that you get to keep about 100 documents on it, and then you either have to delete something or pay. While Markdown is just text documents I can keep on my computer, if I'm going to use a browser-based solution I would prefer to keep them all in the same place (though I never did hit this limit :P).

A screenshot of me writing this post in ghostwriter. Meta!

More recently, now that I've got a travel-laptop that is running Linux (and not Chrome OS, as nice that was I ended up out-growing it), I've been using ghostwriter. It's a desktop application for both Windows and Linux. While it doesn't have synchronised-scrolling for the live-preview as Stack Edit does, it allows you to save your text files to your local disk (or other mounted partition!), and open them as you would a Spreadsheet or other file - in a way that you can't with a browser-based tool.

The interface is also highly customisable - if you don't like the built-in themes, you can write your own. You can also write your own stylesheet for exported documents too. In addition, it automatically detects multiple different markdown renderers that may or may not have installed, allowing you to switch between them (and the inbuilt sundown processor) at will to get the exported document (e.g. HTML, PDF, EPUB, etc.) looking just the way you want it to.

For me, you can't beat the feeling of a native desktop application, so currently ghostwriter is my markdown editor of choice. If I can't use ghostwriter, I'll probably use StackEdit, with Classeur coming at the bottom of the pile.

If you're thinking of doing some writing, I'd highly suggest considering using a proper markdown editor such as the ones I've mentioned here. If you're not familiar with markdown, fear not! It's easy to learn, and all 3 of the editors featured here feature a quick-reference guide sidebar (or floating window) that you can enable to help you along.

Found this useful? Got a different editor of choice? Comment below!

An epic journey awaits: The hows and whys of DNS (and why DNS privacy is important)

The fancy 1.1.1.1 logo! Read on to find out more. (Above: The logo of Cloudflare's new announcement. Read on to find out more! Sourced from here.)

Hello! I hope everyone had a nice restful Easter. Cloudflare made an exciting announcement recently (more on that later), which inspired me to sit down and write about a vital, but invisible, part of the internet we know today.

It's called DNS (Domain Name System), and I'd like to take you on a journey - showing you what DNS is, how it works, how it can be exploited, and what we can do about it. After all, privacy is important! How does relate to DNS you ask? Well, I'll show you - but we're getting a little ahead of ourselves. Let's introduce DNS first. I'll explain what it is, how it works, and why we need it.

Enter Stage Left

DNS is, in many ways, the backbone of the modern internet. While it isn't directly responsible for delivering billions of packets across the internet every day like the Internet Protocol is, its role is still vitally important. DNS is responsible for translating domain names, such as starbeamrainbowlabs.com, bobsrockets.com, or billsboosters.net into an IP address that your device can connect to in order to do whatever else it needs to do.

It does this by sending a UDP datagram (comparison with TCP) to a DNS server ask it for a specific type of response - usually the IP address associated with a specific domain name. The following query types are most common:

  • A - Returns the IPv4 address(es) associated with the specified domain name
  • AAAA - Same as A, but returns IPv6 addresses instead
  • CNAME - Acts as an alias to another domain name. Usually immediately followed by either an A or AAAA record in the DNS server's response to save time (a DNS server can return multiple items in a single response)
  • MX - A bit like a CNAME, but returns a prioritised list of domains that handle email for the specified domain.
  • TXT - Contains an arbitrary text string. Usually used for easter eggs or for domain ownership verification by various analytics services (e.g. Google Analytics, Bing Webmaster Tools, etc.)
  • NS - Specifies which DNS servers can be queried about the domain.
  • SOA - Specifies what the primary DNS server is that holds the authoritative copy of the DNS records for the specified domain.

Let's try it out

With that in mind, lets try some queries.

(Can't see the above asciicast? Try viewing it over on asciinema.org, or entering the below commands into a computer with DiG)

dig starbeamrainbowlabs.com
dig bbc.co.uk AAAA
dig cloudflare.com MX
dig contact.starbeamrainbowlabs.com TXT
dig github.com TXT

DiG is a command-line DNS client for Linux-like operating systems (if you don't have it already, try sudo apt install dnsutils, or equivalent for your distribution. If you're on Windows without access to a Linux-like machine, try following along with nslookup.). In the above asciicast I make a variety of queries for demonstrative purposes. Note the QUESTION SECTION and ANSWER SECTION bits - they tell us what the query was for, and what the response to that query was. For example, here's an extract from the question and answer sections respectively from the bbc.co.uk lookup in the asciicast:

;bbc.co.uk.         IN  AAAA
bbc.co.uk.      300 IN  AAAA    2a04:4e42:600::81

The bit in the question section is quite straightforward - it's asking for an AAAA record for bbc.co.uk.. The answer section is a bit more complicated. From left to right:

  • bbc.co.uk. - The domain name the response is for.
  • 300 - The time-to-live. In other words, the number of seconds that the response can be cached for.
  • IN - a legacy component. Stands for INternet - more information here
  • AAAA - The type of response record.
  • 2a04:4e42:600::81 - The IPv6 address that the domain name corresponds to.

Am I being spied on?

DNS works rather well, most of the time. The problems start to occur when you start thinking about privacy. With more websites than ever now serving their websites over https, the data that we transfer between these websites and our devices is now much more secure - and can't be intercepted, analysed, and modified in transit.

DNS, however, is not currently encrypted - which poses a rather serious problem. Anyone able to get a hold of your devices network traffic - such as another device of your network in promiscuous mode, your ISP, or literally anyone in between you and your DNS server - can spy on the DNS lookups your device is doing, and even poison your DNS cache - sending you to an attacker's website when you typed in a legitimate domain name!

DNS Cache Poisoning in Action

(Above: A DNS timing cache poisoning in action. The attacker responds with a spoofed UDP datagram before the original server has a chance to reply!)

Thankfully, after 35 years of DNS, the internet has some solutions to some of these problems. First up: DNSSEC. Often misunderstood, the protocol tries to prevent man-in-the-middle and timing attacks (such as the one shown in the diagram above) by cryptographically verifying the DNS records returned to the client. Though it's actually 20 years old already, it's still overly-complicated - and subsequently hasn't been rolled out by an awful lot of people. It's also rather weighty - requiring the transfer of crytographical keys and other associated information.

Preventing cache poisoning is one thing, but it would be nice to prevent nosy onlookers from peering at the DNS queries we're making - and here's where it gets complicated. As of early 2018, there are currently no less than 3 competing standards to provide proper client-server connection encryption:

  • DNS-over-HTTPS - Basically a protocol for sending DNS requests via a standard HTTPS web server. As you can imagine, this can be rather weighty.
  • DNS-over-TLS - As the name implies - DNS queries over a raw TLS connection - which is, in short, a HTTPS connection without the HTTP bit. Now supported natively in Android.
  • DNSCurve - An augmentation to the existing DNS protocol that adds encryption by way of elliptical curves. The supposed official website appears to be a bit biased and inaccurate, so I'm linking to the Wikipedia article here.

A bit of mess, isn't it? Furthermore, many applications don't yet have support for some (or any) of these protocols. In that regard, it's currently a waiting game. Still, it's interesting to compare the different approaches taken here. Most of these protocols carry significantly more weight that plain-old DNS - with DNS-over-HTTPS being the most weighty, and DNSCurve being the lightest I should imagine.

I find it especially curious that DNS-over-HTTPS is as popular as it is. Surely it's a bit flawed if you've got to look up the domain name of the HTTPS server that you need to contact in order to do a 'secure' lookup? A safe is only as strong as it's weakest point, after all....

But wait, there's more!

Encrypted and verified responses are all very well, but it's no good of the owner of the DNS server themselves are logging all the queries you send to them! Google's 8.8.8.8 service logs a percentage of queries made permanently to disk, and OpenDNS don't appear to have very many details on their website about what data they collect and what they don't!

Furthermore, some DNS servers (especially those controlled by ISPs) tend to have some domain names censored due to agreements with their country's government - preventing you from 'accessing' a website by stopping your device from figuring out where on the internet to talk to.

Clearly, these are serious issues - and the solutions boil down to trust. Who do you trust to send your DNS queries to? If you don't trust any of the aforementioned providers (Google Public DNS or OpenDNS), then you could always run a DNS resolver yourself.

How does it work if you run it yourself? Well, basically instead of your device sending queries to a remote DNS server, they send it to your personal DNS server instead. Your personal DNS server then performs a recursive resolve. Basically, this means that it traverses the requested domain name from right-to-left, analysing and resolving each part in turn. For example, gateway.discord.gg. would be resolved like so:

  • com.
  • discord.
  • gateway.

For each successive part of the domain name, the DNS server asks the next one in the the chain that other DNS servers hold the authoritative records about that domain name (using SOA / NS records), and then repeats the cycle with the servers provided in the response.

Quite quickly you can see that there's an issue here - how does it know where to start? That's where root servers come in. They contain the authoritative information on the Internet's top-level domains. These servers can be queried to figure out which servers hold information about the various country codes (or other codes). The servers that these root servers point to can then be queried to ask who holds information about the various domain names you and I are used to typing in our address bars, such as seanssatellites.io, or billsboosters.edu for instance.

A simpler alternative

This brings me to the announcement by Cloudflare I mentioned at the beginning of this post. By now, you can probably guess what it is: they've set up a new public DNS server! Apparently, they did a deal with [APNIC]() to let them study the garbage traffic that ends up at 1.1.1.1 in exchange for running a DNS server on it.

Either way, I think it's a brilliant thing for the Internet at large to have another public DNS network to choose from. Especially considering how privacy-conscious they appear to have being in setting it up: They never store client IP addresses, and they delete the anonymised logs after 24 hours. Assuming what they've said is true, I think that it's rather great. For my own personal reference, here are the IP addresses of Cloudflare's new service:

  • 1.1.1.1
  • 1.0.0.1
  • 2606:4700:4700::1111
  • 2606:4700:4700::1001

Conclusion

That brings us to the end of our journey through DNS. We've seen what DNS is and how it works. We've also seen how it can be attacked, and what is being done about it. Lastly, we've taken a look at how running your own recursive resolver works, and looked at Cloudflare's new service.

If you'd like to continue on and explore DNS further, I've left some links below.

Found this informative? Still confused about something? Comment below!

Sources and Further Reading

Read / Write Disk Performance Testing in Bash

Recently I needed to quickly (and non-destructively) test the read / write performance of a flash drive of mine. Naturally, I turned my attention to my terminal. This post is me documenting what I did so that I can remember for next time :P

Firstly, to test the speed of a disk, we need some data to test with. Since lots of small files will inevitably cause slowdowns due to the overhead of writing the file metadata and inode information to the superblock, it makes the most sense to use one gigantic file rather than tons of small ones. Here's what I did to generate a 1 Gigabyte file filled with zeroes:

dd if=/dev/zero of=/tmp/testfile.bin bs=1M count=1024

Cool. Next, we need to copy it to the target disk and measure the time it took. Then, since we know the size of the file (1073741824 bytes, to be exact), we can calculate the speed at which the copy took place. Here's my first attempt:

time dd if=/tmp/testfile.bin >testfile.bin

If you run this, you might find that it doesn't take it very long at all, and you get a speed of something like ~250MiB / sec! While impressive, I seriously doubt that my flash drive has that kind of speed behind it. Typically, flash memory takes longer to write to and read from - and I'm pretty sure that it can't read from it that fast either. So what's going on?

Well, it turns out that Linux is caching the disk write operations in a buffer, and then doing them in the background for us. Whilst fine for ordinary operation, this doesn't give us an accurate representation of how fast it's actually writing to the disk. Thankfully, there's something we can do about this: Use the sync command. sync will flush all cached write operations to disk for us, giving us the actual time it took to write the 1 GiB file to disk. Here's the altered command:

sync;
time sh -c 'dd if=/tmp/testfile.bin >testfile.bin; sync'

Very cool! Now, we can just take the time it took and do some simple maths to calculate the write speed of our disk. What about the read speed though? Well, to test that, we'll first need to clear out the page cache - another one of Linux's (many) caches that holds portions of files that have recently been accessed for faster retrieval - because as before, we're not interested in the speed of the cache! Here's how to do that:

echo 1 | sudo tee /proc/sys/vm/drop_caches

With the correct cache cleared, we can test the read speed accurately. Here's how I did it:

time dd if=testfile.bin of=/dev/null

Fairly simple, right? At a later date I might figure out a way of automating this, but for the occasional use now and again this works just fine :)

Found this useful? Got a better way of doing it? Want to say hi? Post in the comments below!

Art by Mythdael