Starbeamrainbowlabs

Stardust
Blog


Archive


Mailing List Articles Atom Feed Comments Atom Feed Twitter Reddit Facebook

Tag Cloud

3d account algorithms android announcement architecture archives arduino artificial intelligence artix assembly async audio automation backups bash batch blog bookmarklet booting bug hunting c sharp c++ challenge chrome os code codepen coding conundrums coding conundrums evolved command line compilers compiling compression css dailyprogrammer debugging demystification distributed computing documentation downtime electronics email embedded systems encryption es6 features event experiment external first impressions future game github github gist gitlab graphics hardware hardware meetup holiday holidays html html5 html5 canvas infrastructure interfaces internet interoperability io.js jabber jam javascript js bin labs learning library linux lora low level lua maintenance manjaro network networking nibriboard node.js operating systems performance photos php pixelbot portable privacy problem solving programming problems projects prolog protocol protocols pseudo 3d python reddit redis reference releases resource review rust searching secrets security series list server software sorting source code control statistics storage svg talks technical terminal textures three thing game three.js tool tutorial twitter ubuntu university update updates upgrade version control virtual reality virtualisation visual web website windows windows 10 xmpp xslt

Easy Node.JS Dependencies Updates

Once you've had a project around for a while, it's inevitable that dependency updates will become available. Unfortunately, npm (the Node Package Manager), while excellent at everything else, is completely terrible at notifying you about updates.

The solution to this is, of course, to use an external tool. Personally, I use npm-check, which is also installable via npm. It shows you a list of updates to your project's dependencies, like so:

(Can't see the above? View it directly on asciinema.org)

It even supports the packages that you've install globally too, which no other tool appears to do as far as I can tell (although it does appear to miss some packages, such as npm and itself). To install it, simply do this:

sudo npm install --global npm-check

Once done, you can then use it like this:

# List updates, but don't install them
npm-check
# Interactively choose the updates to install
npm-check -u
# Interactively check the globally-installed packages
sudo npm-check -gu

The tool also checks to see which of the dependencies are actually used, and prompts you to check the dependencies it think you're not using (it doesn't always get it right, so check carefully yourself before removing!). There's an argument to disable this behaviour:

npm-check --skip-unused

Speaking of npm package dependencies, the other big issue is security vulnerabilities. GitHub have recently started giving maintainers of projects notifications about security vulnerabilities in their dependencies, which I think is a brilliant idea.

Actually fixing said vulnerabilities is a whole other issue though. If you don't want to update all the dependencies of a project to the latest version (perhaps you're just doing a one-off contribution to a project or aren't very familiar with the codebase yet), there's another tool - this time built-in to npm - to help out - the npm audit subcommand.

# Check for reported security issues in your dependencies
npm audit
# Attempt to fix said issues by updating packages *just* enough
npm audit fix

(Can't see the above? View it directly on asciinema.org)

This helps out a ton with contributing to other projects. Issues arise when the vulnerabilities are not in packages you directly depend on, but instead in packages that you indirectly depend on via dependencies of the packages you've installed.

Thankfully the vulnerabilities in the above can all be traced back to development dependencies (and aren't essential for Peppermint Wiki itself), but it's rather troubling that I can't install updated packages because the packages I depend on haven't updated their dependencies.

I guess I'll be sending some pull requests to update some dependencies! To help me in this, the output of npm audit even displays the dependency graph of why a package is installed. If this isn't enough though, there's always the npm-why which, given a package name, will figure out why it's installed.

Found this interesting? Got a better solution? Comment below!

Generating Atom 1.0 Feeds with PHP (the proper way)

I've generated Atom feeds in PHP before, but recently I went on the hunt to discover if PHP has something like C♯'s XMLWriter class - and it turns out it does! Although poorly documented (probably why I didn't find it in the first place :P), it's actually quite logical and easy to pick up.

To this end, I thought I'd blog about how I used to write the Atom 1.0 feed generator for recent changes on Pepperminty Wiki that I've recently implemented (coming soon in the next release!), as it's so much cleaner than atom.gen.php that I blogged about before! It's safer too - as all the escaping is handled automatically by PHP - so there's no risk of an injection attack because I forgot to escape a character in my library code.

It ends up being a bit verbose, but a few helper methods (or a wrapper class?) should alleviate this nicely - I might refactor it at a later date.

To begin, we need to create an instance of the aptly-named XMLLWriter class. It's probable that you'll need the php-xml package installed in order to use this.

$xml = new XMLWriter();
$xml->openMemory();
$xml->setIndent(true); $xml->setIndentString("\t");
$xml->startDocument("1.0", "utf-8");

In short, the above creates a new XMLWriter instance, instructs it to write the XML to memory, enables pretty-printing of the output, and writes the standard XML document header.

With the document created, we can begin to generate the Atom feed. To figure out the format (I couldn't remember from when I wrote atom.gen.php - that was ages ago :P), I ended following this helpful guide on atomenabled.org. It seems familiar - I think I might have used it when I wrote atom.gen.php. To start, we need something like this:

<feed xmlns="http://www.w3.org/2005/Atom">
    ......
</feed>

In PHP, that translates to this:

$xml->startElement("feed");
$xml->writeAttribute("xmlns", "http://www.w3.org/2005/Atom");

$xml->endElement(); // </feed>

Next, we probably want to advertise how the Atom feed was generated. Useful for letting the curious know what a website is powered by, and for spotting usage of your code out in the wild!

Since I'm writing this for Pepperminty Wiki, I'm settling on something not unlike this:

<generator uri="https://github.com/sbrl/Pepperminty-Wiki/" version="v0.18-dev">Pepperminty Wiki</generator>

In PHP, this translates to this:

$xml->startElement("generator");
$xml->writeAttribute("uri", "https://github.com/sbrl/Pepperminty-Wiki/");
$xml->writeAttribute("version", $version); // A variable defined elsewhere in Pepperminty Wiki
$xml->text("Pepperminty Wiki");
$xml->endElement();

Next, we need to add a <link rel="self" /> tag. This informs clients as to where the feed was fetched from, and the canonical URL of the feed. I've done this:

xml->startElement("link");
$xml->writeAttribute("rel", "self");
$xml->writeAttribute("type", "application/atom+xml");
$xml->writeAttribute("href", full_url());
$xml->endElement();

That full_url() function is from StackOverflow, and calculates the full URI that was used to make a request. As Pepperminty Wiki can be run in any directory on ayn server, I can't pre-determine this url - hence the complexity.

Note also that I output type="application/atom+xml" here. This specifies the type of content that can be found at the supplied URL. The idea here is that if you represent the same data in different ways, you can advertise them all in a standard way, with other formats having rel="alternate". Pepperminty Wiki does this - generating the recent changes list in HTML, CSV, and JSON in addition to the new Atom feed I'm blogging about here (the idea is to make the wiki data as accessible and easy-to-parse as possible). Let's advertise those too:

$xml->startElement("link");
$xml->writeAttribute("rel", "alternate");
$xml->writeAttribute("type", "text/html");
$xml->writeAttribute("href", "$full_url_stem?action=recent-changes&format=html");
$xml->endElement();

$xml->startElement("link");
$xml->writeAttribute("rel", "alternate");
$xml->writeAttribute("type", "application/json");
$xml->writeAttribute("href", "$full_url_stem?action=recent-changes&format=json");
$xml->endElement();

$xml->startElement("link");
$xml->writeAttribute("rel", "alternate");
$xml->writeAttribute("type", "text/csv");
$xml->writeAttribute("href", "$full_url_stem?action=recent-changes&format=csv");
$xml->endElement();

Before we can output the articles themselves, there are a few more pieces of metadata left on our laundry list - namely <updated />, <id />, <icon />, <title />, and <subtitle />. There are others in the documentation too, but aren't essential (as far as I can tell) - and not appropriate in this specific case. Here's what they might look like:

<updated>2019-02-02T21:23:43+00:00</updated>
<id>https://wiki.bobsrockets.com/?action=recent-changes&amp;format=atom</id>
<icon>https://wiki.bobsrockets.com/rocket_logo.png</icon>
<title>Bob's Wiki - Recent Changes</title>
<subtitle>Recent Changes on Bob's Wiki</subtitle>

The <updated /> tag specifies when the feed was last updated. It's unclear as to whether it's the date/time the last change was made to the feed or the date/time the feed was generated, so I've gone with the latter. If this isn't correct, please let me know and I'll change it.

The <id /> element can contain anything, but it must be a globally-unique string that identifies this feed. I've seen other feeds use the canonical url - and I've gone to the trouble of calculating it for the <link rel="self" /> - so it seems a shame to not use it here too.

The remaining elements (<icon />, <title />, and <subtitle />) are pretty self explanatory - although it's worth mentioning that the icon must be square apparently. Let's whip those up with some more PHP:

$xml->writeElement("updated", date(DateTime::ATOM));
$xml->writeElement("id", full_url());
$xml->writeElement("icon", $settings->favicon);
$xml->writeElement("title", "$settings->sitename - Recent Changes");
$xml->writeElement("subtitle", "Recent Changes on $settings->sitename");

PHP even has a present for generating a date string in the correct format required by the spec :D $settings is an object containing the wiki settings that's a parsed form of peppermint.json, and contains useful things like the wiki's name, icon, etc.

Finally, with all the preamble done, we can turn to the articles themselves. In the case of Pepperminty Wiki, the final result will look something like this:

<entry>
    <title type="text">Edit to Compute Core by Sean</title>
    <id>https://seanssatellites.co.uk/wiki/?page=Compute%20Core</id>
    <updated>2019-01-29T10:21:43+00:00</updated>
    <content type="html">&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Change type:&lt;/strong&gt; edit&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;User:&lt;/strong&gt;  Sean&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Page name:&lt;/strong&gt; Compute Core&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Timestamp:&lt;/strong&gt; Tue, 29 Jan 2019 10:21:43 +0000&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;New page size:&lt;/strong&gt; 1.36kb&lt;/li&gt;
        &lt;li&gt;&lt;strong&gt;Page size difference:&lt;/strong&gt; +1&lt;/li&gt;
&lt;/ul&gt;</content>
    <link rel="alternate" type="text/html" href="https://seanssatellites.co.uk/wiki/?page=Compute%20Core"/>
    <author>
        <name>Sean</name>
        <uri>https://seanssatellites.co.uk/wiki/?page=Users%2FSean</uri>
    </author>
</entry>

There are a bunch of elements here that deserve attention:

  • <title /> - The title of the article. Easy peasy!
  • <id /> - Just like the id of the feed itself, each article entry needs an id too. Here I've followed the same system I used for the feed, and given the url of the page content.
  • <updated /> - The last time the article was updated. Since this is part of a feed of recent changes, I've got this information readily at hand.
  • <content /> - The content to display. If the content is HTML, it must be escaped and type="html" present to indicate this.
  • <link rel="alternate" /> Same deal as above, but on an article-by-article level. In this case, it should link to the page the article content is from. In this case, I link to the page & revision of the change in question. In other cases, you might link to the blog post in question for example.
  • <author /> - Can contain <name />, <uri />, and <email />, and should indicate the author of the content. In this case, I use the name of the user that made the change, along with a link to their user page.

Here's all that in PHP:

foreach($recent_changes as $recent_change) {
    if(empty($recent_change->type))
        $recent_change->type = "edit";

    $xml->startElement("entry");

    // Change types: revert, edit, deletion, move, upload, comment
    $type = $recent_change->type;
    $url = "$full_url_stem?page=".rawurlencode($recent_change->page);

    $content = ".......";

    $xml->startElement("title");
    $xml->writeAttribute("type", "text");
    $xml->text("$type $recent_change->page by $recent_change->user");
    $xml->endElement();

    $xml->writeElement("id", $url);
    $xml->writeElement("updated", date(DateTime::ATOM, $recent_change->timestamp));

    $xml->startElement("content");
    $xml->writeAttribute("type", "html");
    $xml->text($content);
    $xml->endElement();

    $xml->startElement("link");
    $xml->writeAttribute("rel", "alternate");
    $xml->writeAttribute("type", "text/html");
    $xml->writeAttribute("href", $url);
    $xml->endElement();

    $xml->startElement("author");
    $xml->writeElement("name", $recent_change->user);
    $xml->writeElement("uri", "$full_url_stem?page=".rawurlencode("$settings->user_page_prefix/$recent_change->user"));
    $xml->endElement();

    $xml->endElement();
}

I've omitted the logic that generates the value of the <content /> tag, as it's not really relevant here (you can check it out here if you're curious :D).

This about finishes the XML we need to generate for our feed. To extract the XML from the XMLWriter, we can do this:

$atom_feed = $xml->flush();

Then we can do whatever we want to with the generated XML!

When the latest version of Pepperminty Wiki comes out, you'll be able to see a live demo here! Until then, you'll need to download a copy of the latest master version and experiment with it yourself. I'll also include a complete demo feed below:

<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
    <generator uri="https://github.com/sbrl/Pepperminty-Wiki/" version="v0.18-dev">Pepperminty Wiki</generator>
    <link rel="self" type="application/atom+xml" href="http://[::]:35623/?action=recent-changes&amp;format=atom&amp;count=3"/>
    <link rel="alternate" type="text/html" href="http://[::]:35623/?action=recent-changes&amp;format=html"/>
    <link rel="alternate" type="application/json" href="http://[::]:35623/?action=recent-changes&amp;format=json"/>
    <link rel="alternate" type="text/csv" href="http://[::]:35623/?action=recent-changes&amp;format=csv"/>
    <updated>2019-02-03T17:25:10+00:00</updated>
    <id>http://[::]:35623/?action=recent-changes&amp;format=atom&amp;count=3</id>
    <icon>data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAMAAAAoLQ9TAAAB3VBMVEXhERHbKCjeVVXjb2/kR0fhKirdHBziDg6qAADaHh7qLy/pdXXUNzfMAADYPj7ZPDzUNzfbHx/fERHpamrqMTHgExPdHx/bLCzhLS3fVFTjT0/ibm7kRkbiLi7aKirdISHeFBTqNDTpeHjgERHYJCTVODjYQkLaPj6/AADVOTnpbW3cIyPdFRXcJCThMjLiTU3ibW3fVVXaKyvcERH4ODj+8fH/////fHz+Fxf4KSn0UFD/CAj/AAD/Xl7/wMD/EhL//v70xMT/+Pj/iYn/HBz/g4P/IyP/Kyv/7Oz0QUH/9PT/+vr/ior/Dg7/vr7/aGj/QED/bGz/AQH/ERH/Jib/R0f/goL/0dH/qan/YWH/7e3/Cwv4R0f/MTH/enr/vLz/u7v/cHD/oKD/n5//aWn+9/f/k5P/0tL/trb/QUH/cXH/dHT/wsL/DQ3/p6f/DAz/1dX/XV3/kpL/i4v/Vlb/2Nj/9/f/pKT+7Oz/V1f/iIj/jIz/r6//Zmb/lZX/j4//T0//Dw/4MzP/GBj/+fn/o6P/TEz/xMT/b2//Tk7/OTn/HR3/hIT/ODj/Y2P/CQn/ZGT/6Oj0UlL/Gxv//f3/Bwf/YmL/6+v0w8P/Cgr/tbX0QkL+9fX4Pz/qNzd0dFHLAAAAAXRSTlMAQObYZgAAAAFiS0dEAIgFHUgAAAAJcEhZcwAACxMAAAsTAQCanBgAAAAHdElNRQfeCxINNSdmw510AAAA5ElEQVQYGQXBzSuDAQCA8eexKXOwmSZepa1JiPJxsJOrCwcnuchBjg4O/gr7D9zk4uAgJzvuMgcTpYxaUZvSm5mUj7TX7ycAqvoLIJBwStVbP0Hom1Z/ejoxrbaR1Jz6nWinbKWttGRgMSSjanPktRY6mB9WtRNTn7Ilh7LxnNpKq2/x5LnBitfz+hx0qxUaxhZ6vwqq9bx6f2XXvuUl9SVQS38NR7cvln3v15tZ9bQpuWDtZN3Lgh5DWJex3Y+z1KrVhw21+CiM74WZo83DiXq0dVBDYNJkFEU7WrwDAZhRtQrwDzwKQbT6GboLAAAAAElFTkSuQmCC</icon>
    <title>Pepperminty Wiki - Recent Changes</title>
    <subtitle>Recent Changes on Pepperminty Wiki</subtitle>
    <entry>
        <title type="text">Edit to Internal link by admin</title>
        <id>http://[::]:35623/?page=Internal%20link</id>
        <updated>2019-01-29T19:55:08+00:00</updated>
        <content type="html">&lt;ul&gt;
    &lt;li&gt;&lt;strong&gt;Change type:&lt;/strong&gt; edit&lt;/li&gt;
    &lt;li&gt;&lt;strong&gt;User:&lt;/strong&gt;  admin&lt;/li&gt;
    &lt;li&gt;&lt;strong&gt;Page name:&lt;/strong&gt; Internal link&lt;/li&gt;
    &lt;li&gt;&lt;strong&gt;Timestamp:&lt;/strong&gt; Tue, 29 Jan 2019 19:55:08 +0000&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;New page size:&lt;/strong&gt; 2.11kb&lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;Page size difference:&lt;/strong&gt; +2007&lt;/li&gt;
&lt;/ul&gt;</content>
        <link rel="alternate" type="text/html" href="http://[::]:35623/?page=Internal%20link"/>
        <author>
            <name>admin</name>
            <uri>http://[::]:35623/?page=Users%2FInternal%20link</uri>
        </author>
    </entry>
    <entry>
        <title type="text">Edit to Main Page by admin</title>
        <id>http://[::]:35623/?page=Main%20Page</id>
        <updated>2019-01-05T20:14:07+00:00</updated>
        <content type="html">&lt;ul&gt;
    &lt;li&gt;&lt;strong&gt;Change type:&lt;/strong&gt; edit&lt;/li&gt;
    &lt;li&gt;&lt;strong&gt;User:&lt;/strong&gt;  admin&lt;/li&gt;
    &lt;li&gt;&lt;strong&gt;Page name:&lt;/strong&gt; Main Page&lt;/li&gt;
    &lt;li&gt;&lt;strong&gt;Timestamp:&lt;/strong&gt; Sat, 05 Jan 2019 20:14:07 +0000&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;New page size:&lt;/strong&gt; 317b&lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;Page size difference:&lt;/strong&gt; +68&lt;/li&gt;
&lt;/ul&gt;</content>
        <link rel="alternate" type="text/html" href="http://[::]:35623/?page=Main%20Page"/>
        <author>
            <name>admin</name>
            <uri>http://[::]:35623/?page=Users%2FMain%20Page</uri>
        </author>
    </entry>
    <entry>
        <title type="text">Edit to Main Page by admin</title>
        <id>http://[::]:35623/?page=Main%20Page</id>
        <updated>2019-01-05T17:53:08+00:00</updated>
        <content type="html">&lt;ul&gt;
    &lt;li&gt;&lt;strong&gt;Change type:&lt;/strong&gt; edit&lt;/li&gt;
    &lt;li&gt;&lt;strong&gt;User:&lt;/strong&gt;  admin&lt;/li&gt;
    &lt;li&gt;&lt;strong&gt;Page name:&lt;/strong&gt; Main Page&lt;/li&gt;
    &lt;li&gt;&lt;strong&gt;Timestamp:&lt;/strong&gt; Sat, 05 Jan 2019 17:53:08 +0000&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;New page size:&lt;/strong&gt; 249b&lt;/li&gt;
            &lt;li&gt;&lt;strong&gt;Page size difference:&lt;/strong&gt; +31&lt;/li&gt;
&lt;/ul&gt;</content>
        <link rel="alternate" type="text/html" href="http://[::]:35623/?page=Main%20Page"/>
        <author>
            <name>admin</name>
            <uri>http://[::]:35623/?page=Users%2FMain%20Page</uri>
        </author>
    </entry>
</feed>

...this is from my local development instance.

Found this interesting? Confused about something? Want to say hi? Comment below!

Generating word searches for fun and profit

(Above: A Word search generated with the tool below)

A little while ago I was asked about generating a wordsearch in a custom shape. I thought to myself "someone has to have built this before...", and while I was right to an extent, I couldn't find one that let you use any shape you liked.

This, of course, was unacceptable! You've probably guessed it by now, but I went ahead and wrote my own :P

While I wrote it a little while ago, I apparently never got around to posting about it on here.

In short, it works by using an image you drop into the designated area on the page as the shape the word search should take. Each pixel is a single cell of the word search - with the alpha channel representing whether or not a character is allowed to be placed there (transparent means that it can't contain a character, and opaque means that it can).

Creating such an image is simple. Personally, I recommend Piskel or GIMP for this purpose.

Once done, you can start building a wordlist in the wordlist box at the right-hand-side. It should rebuild the word search as soon as you click out of the box. If it doesn't, then you've found a bug! Please report it here.

With the word search generated, you can use the Question Sheet and Answer Sheet links to open printable versions for export.

You can find my word search generator here:

I've generated a word search of the current tags in the tag cloud on this blog too: Question Sheet [50.3KiB], Answer Sheet [285.6KiB]

The most complicated part of this was probably the logistics behind rude word removal. Thankfully, I did't have to find and maintain such a list of words, as the futility npm package does this for me, but algorithmically guaranteeing that by censoring 1 rude word another is not accidentally created in another direction is a nasty problem.

If you're interested in a more technical breakdown of one (or several!) particular aspects of this - let me know! While writing about all of it would probably make for an awfully long post, a specific aspect or two should be more manageable.

In the future, I'll probably revisit this and add additional features to it, such as the ability to restrict which directions words are placed in, for example. If you've got a suggestion of your own, open an issue (or even better, open a pull request :D)!

Demystifying Inverted Indexes

The test texts below overlaying one another in different colours, with a magnifying glass on centred top. (The magnifying glass in the above banner came from openclipart)

After writing the post that will be released after this one, I realised that I made a critical assumption that everyone knew what an inverted index was. Upon looking for an appropriate tutorial online, I couldn't find one that was close enough to what I did in Pepperminty Wiki, so I decided to write my own.

First, some context. What's Pepperminty Wiki? Well, it's a complete wiki engine in a single file of PHP. The source files are obviously not a single file, but it builds into a single file - making it easy to drop into any PHP-enabled web server.

One of its features is a full-text search engine. A personal wiki of mine has ~75k words spread across ~550 pages, and it manages to search them all in just ~450ms! It does this with the aid of an inverted index - which I'll be explaining in this post.

First though, we need some data to index. How about the descriptions of some video games?

Kerbal Space Program

In KSP, you must build a space-worthy craft, capable of flying its crew out into space, without killing them. At your disposal is a collection of parts, which must be assembled to create a functional ship. Each part has its own function and will affect the way a ship flies (or doesn't). So strap yourself in, and get ready to try some Rocket Science!

 

Cross Code

Meet Lea as she logs into an MMO of the distant future. Follow her steps as she discovers a vast world, meets other players and overcomes all the challenges of the game.

 

Fort Meow

Fort Meow is a physics game by Upper Class Walrus about a girl, an old book and a house full of cats! Meow.

 

Factory Balls

Factory balls is the brand new bonte game I already announced yesterday. Factory balls takes part in the game design competition over at jayisgames. The goal of the design competition was to create a 'ball physics'-themed game. I hope you enjoy it!

Very cool, this should provide us with plenty of data to experiment with. Firstly, let's consider indexing. Take the Factory Balls description. We can split it up into tokens like this:

T o k e n s V V
factory balls is the brand new bonte game
i already announced yesterday factory balls takes
part in the game design competition over
at jayisgames the goal of the design
competition was to create a ball physics
themed game i hope you enjoy it

Notice how we've removed punctuation here, and made everything lowercase. This is important for the next step, as we want to make sure we consider Factory and factory to be the same word - otherwise when querying the index we'd have to remember to get the casing correct.

With our tokens sorted, we can now count them to create our index. It's like a sort of tally chart I guess, except we'll be including the offset in the text of every token in the list. We'll also be removing some of the most common words in the list that aren't very interesting - these are known as stop words. Here's an index generated from that Factory Balls text above:

Token Frequency Offsets
factory 2 0, 12
balls 2 1, 13
brand 1 4
new 1 5
bonte 1 6
game 3 7, 18, 37
i 2 8, 38
announced 1 10
yesterday 1 11
takes 1 14
design 2 19, 28
competition 2 20, 29
jayisgames 1 23
goal 1 25
create 1 32
ball 1 34
physics 1 35
themed 1 36
hope 1 39
enjoy 1 41

Very cool. Now we can generate an index for each page's content. The next step is to turn this into an inverted index. Basically, the difference between the normal index and a inverted index is that an entry in an inverted index contains not just the offsets for a single page, but all the pages that contain that token. For example, the Cross-Code example above also contains the token game, so the inverted index entry for game would contain a list of offsets for both the Factory Balls and Cross-Code pages.

Including the names of every page under every different token in the inverted index would be both inefficient computationally and cause the index to grow rather large, so we should assign each page a unique numerical id. Let's do that now:

Id Page Name
1 Kerbal Space Program
2 Cross Code
3 Fort Meow
4 Factory Balls

There - much better. In Pepperminty Wiki, this is handled by the ids class, which has a pair of public methods: getid($pagename) and getpagename($id). If an id can't be found for a page name, then a new id is created and added to the list (Pepperminty Wiki calls this the id index) transparently. Similarly, if a page name can't be found for an id, then null should be returned.

Now that we've got ids for our pages, let's look at generating that inverted index entry for game we talked about above. Here it is:

  • Term: game
Id Frequency Offsets
2 1 31
3 1 5
4 3 5, 12, 23

Note how there isn't an entry for page id 1, as the Kerbal Space Program page doesn't contain the token game.

This, in essence, is the basics of inverted indexes. A full inverted index will contain an entry for every token that's found in at least 1 source document - though the approach used here is far from the only way of doing it (I'm sure there are much more advanced ways of doing it for larger datasets, but this came to mind from reading a few web articles and is fairly straight-forward and easy to understand).

Can you write a program that generates a full inverted index like I did in the example above? Try testing it on the test game descriptions at the start of this post.

You may also have noticed that the offsets used here are of the tokens in the list. If you wanted to generate contexts (like Duck Duck Go or Google do just below the title of a result), you'd need to use the character offsets from the source document instead. Can you extend your program to support querying the inverted index, generating contexts based on the inverted index too?

Liked this post? Got your own thoughts on the subject? Having trouble with the challenges at the end? Comment below!

Quest Get: Search large amounts of code!

A map  of the Linux Kernel source code.

(Above: A map of the Linux Kernel source code. Source: this post on medium.)

Recently I was working on a little project of mine (nope, not this for once! :P), and I needed a C♯ class I'd written a while ago. Being forgetful as I am, I had no idea which of my project I'd written it for. And so the quest began to find it! I did in the end, but it left me thinking whether there was a better way to search all my code quickly. This post is the culmination of everything I've discovered so far about the process of searching one's code.

Before I started, I already know about grep, which is built into almost every Linux system around. It's even available for Windows via the MSYS Tools. Unfortunately though, despite it's prevailance, it's not particularly good at searching large numbers of git repositories, as it keeps descending into the .git folder and displaying a whole load of useless results.

Something had to change. After asking reddit, I was introduced to OpenGrok. Written in Java, it indexes all of your code, and provides a web interface through which you can search it. Very nice. Unfortunately, I had trouble figuring out the logistics of actually getting it to run - and discovered that it takes multiple hours to set up correctly.

Moving on, I was re-introduced to ack, written in plain-old Perl, it apparently runs practically any system that Perl does - though it's not installed by default like grep is. Looking into it, I found it to be much like grep - only smarter. It ignores version control directories (like the .git folder ), and common package folders (like node_modules) by default, and even has a system by which results can be filtered by language (with support for hash-bangs too!). The results themselves are coloured by default - making it easy to skim through quickly. Coupled with the flexible configuration file system, ack makes for a wonderfully flexible way to search through large amounts of code quickly.

Though ack looks good, I still didn't have a way to search through all my code that scattered across multiple devices at once, so I kept looking. The next project I found (through alternative to actually) was Text Sherlock. It positions itself as an alternative to OpenGrok that's much simpler to configure.

True to its word, I managed to get a test instance set up running from my /tmp directory in 15 minutes - though it did take a while to index the code I had locally. It also took several seconds to consult its index when I entered a query. I suspect I could alleviate both of these issues by installing Xapian (an open-source high-performance search library), which it appears to have support for.

While the interface was cool, it didn't appear to allow me to tell it which directories not to index, so it ended trawling through all my .git directories - just like grep did. It also doesn't appear to multi-threaded - so it took much longer to index my code than it really needed to (I've got a solid-state drive and enough RAM for a few GBs of cache, so the indexing operation was CPU-bound, not I/O-bound).

In the end, I've rediscovered the awesome search tool ack, and taken a look at the current state of code search tools today. While I haven't yet found precisely what I'm looking for, I'm further forward than when I started.

Other honourable mentions include GNU Global (which apparently needs several GiBs per ~300MiB of source code for its generated static HTML web interface), insight.io (an IDE-like freemium cloud product that 'understands your code'), CodeQuery (only supports C, C++, Java, Python, Ruby, Javascript, and Go), and ripgrep (rust-based program, similar to ack and grep, feature comparison). The official ack website has a good page that contains more tools that are worth a look, too.

Got a cool way to search through all your code? Did this help you out? Comment below!

Building a line-by-line lexer in C#

So there I was. It was a lazy afternoon before my final exam of the semester, and I was idly looking through some old code. One thing led to another, and I ended up writing a line-based scanning lexer in C# - and I thought I'd share it here, pulling it apart and putting it back together again.

The aim was to build something regular expression based, that would be flexible enough that it could be used in a wide-range of applications - but not too difficult or confusing to pick up, use and add to another project. The final code isn't currently available in a public repository (it's actually for a personal project - maybe I'll get around to refactoring it into a library if there's the demand), but I'll still post the finished code at the end of this post.

To start with, let's define some data classes to hold some input and output information. Hrm. Let's see - we'll need a class to represent a rule that the lexer will utilise, and one to represent the tokens that our lexer will be emitting. Let's deal with the one for the rules first:

public class LexerRule<TokenType>
{
    public readonly TokenType Type;
    public readonly Regex RegEx;
    public bool Enabled { get; set; } = true;

    public LexerRule(TokenType inName, string inRegEx)
    {
        if (!typeof(TokenType).IsEnum)
            throw new ArgumentException($"Error: inName must be an enum - {typeof(TokenType)} passed");

        Type = inName;
        RegEx = new Regex(inRegEx);
    }

    public bool Toggle()
    {
        Enabled = !Enabled;
        return Enabled;
    }
}

Here I define a template (or generic) class that holds a regular expression, and associates it with a value from an enum. There's probably a better / cleaner way to make sure that TokenType is an enum, but for now this should serve it's purpose just fine. I also add a simple Enabled boolean property - as we'll be adding support for dynamically enabling and disabling rules later on.

Next up, let's tackle the class for the tokens that we're going to be emitting:

public class LexerToken<TokenType>
{
    public readonly bool IsNullMatch = false;
    public readonly LexerRule<TokenType> Rule = null;
    public readonly Match RegexMatch;

    public TokenType Type {
        get {
            try {
                return Rule.Type;
            }
            catch (NullReferenceException) {
                return default(TokenType);
            }
        }
    }
    private string nullValueData;
    public string Value {
        get {
            return IsNullMatch ? nullValueData : RegexMatch.Value;
        }
    }

    public LexerToken(LexerRule<TokenType> inRule, Match inMatch)
    {
        Rule = inRule;
        RegexMatch = inMatch;
    }
    public LexerToken(string unknownData)
    {
        IsNullMatch = true;
        nullValueData = unknownData;
    }

    public override string ToString()
    {
        return string.Format("[LexerToken: Type={0}, Value={1}]", Type, Value);
    }
}

A little more complex, but still manageable. It, like it's LexerRule cousin, is also a template (or generic) class. It holds the type of token it is and the regular expression Match object generated during the scanning process. It also has something strange going on with Value and nullValueData - this is such that we can emit tokens with an 'unknown' type (more on that later) for the text in between that doesn't match any known rule. We'll be covering this later too.

With our data classes in place, it's time to turn our attention to the lexer itself. Let's put together some scaffolding to get an idea as to how it's going to work:

public class Lexer<TokenType>
{
    public List<LexerRule<TokenType>> Rules { get; private set; } = new List<LexerRule<TokenType>>();

    public int CurrentLineNumber { get; private set; } = 0;
    public int CurrentLinePos { get; private set; } = 0;
    public int TotalCharsScanned { get; private set; } = 0;

    private StreamReader textStream;

    public Lexer()
    {

    }

    public void AddRule(LexerRule<TokenType> newRule);
    public void AddRules(IEnumerable<LexerRule<TokenType>> newRules);

    public void Initialise(StreamReader reader);

    public IEnumerable<LexerToken<TokenType>> TokenStream();

    public void EnableRule(TokenType type);
    public void DisableRule(TokenType type);
    public void SetRule(TokenType type, bool state);
}

There - that should do the trick! CurrentLineNumber, CurrentLinePos, and TotalCharsScanned are properties to keep track of where we've got to, and textStream is the StreamReader we'll be reading data from. Then, we've got some methods that will add new lexer rules to Rules enable and disable rules by token type, a method to initialise the lexer with the correct textStream, and finally a generator method that will emit the tokens.

With our skeleton complete, let's fill out a few of those methods:

public void AddRule(LexerRule<TokenType> newRule)
{
    Rules.Add(newRule);
}
public void AddRules(IEnumerable<LexerRule<TokenType>> newRules)
{
    Rules.AddRange(newRules);
}

public void Initialise(StreamReader reader)
{
    textStream = reader;
}

public void EnableRule(TokenType type)
{
    SetRule(type, true);
}
public void DisableRule(TokenType type)
{
    SetRule(type, false);
}
public void SetRule(TokenType type, bool state)
{
    foreach (LexerRule<TokenType> rule in Rules)
    {
        // We have to do a string comparison here because of the generic type we're using in multiple nested
        // classes
        if (Enum.GetName(rule.Type.GetType(), rule.Type) == Enum.GetName(type.GetType(), type)) {
            rule.Enabled = state;
            return;
        }
    }
}

Very cool. None of this is particularly exciting - apart from SetBody. In SetBody we have to convert the type argument ot a string in order to compare it to the rules in the Rules list, as C♯ doesn't seem to understand that the TokenType on the LexerRule class is the same as the TokenType on the Lexer class - even though they have the same name! This did give me an idea for a trio of additional methods to make manipulating rules easier though:

public void EnableRulesByPrefix(string tokenTypePrefix)
{
    SetRulesByPrefix(tokenTypePrefix, true);
}
public void DisableRulesByPrefix(string tokenTypePrefix)
{
    SetRulesByPrefix(tokenTypePrefix, false);
}
public void SetRulesByPrefix(string tokenTypePrefix, bool state)
{
    foreach (LexerRule<TokenType> rule in Rules)
    {
        // We have to do a string comparison here because of the generic type we're using in multiple nested
        // classes
        if (Enum.GetName(rule.Type.GetType(), rule.Type).StartsWith(tokenTypePrefix, StringComparison.CurrentCulture))
        {
            rule.Enabled = state;
        }
    }
}

This set of methods let us enable or disable rules based on what they are with. For example, if I have the 3 rules CommentStart, CommentEnd, and FunctionStart, then calling EnableRulesByPrefix("Comment") will enable CommentStart and CommentEnd, but not FunctionStart.

With all the interfacing code out of the way, let's turn tot he real meat of the subject: The TokenStream method. This is the method behind the magic. It's a bit complicated, so let's take it step-by-step. Firstly, we need to iterate over the lines in the StreamReader:

string nextLine;
List<LexerToken<TokenType>> matches = new List<LexerToken<TokenType>>();
while ((nextLine = textStream.ReadLine()) != null)
{
    CurrentLinePos = 0;

    // .....

    CurrentLineNumber++;
    TotalCharsScanned += CurrentLinePos;
}

Fairly simple, right? I've used this construct a few times in the past. Before you ask, we'll get to matches in just a moment :-) Next, we need another while loop that iterates until we reach the end of the line:

while (CurrentLinePos < nextLine.Length)
{
    matches.Clear();
    foreach (LexerRule<TokenType> rule in Rules) {
        if (!rule.Enabled) continue;

        Match nextMatch = rule.RegEx.Match(nextLine, CurrentLinePos);
        if (!nextMatch.Success) continue;

        matches.Add(new LexerToken<TokenType>(rule, nextMatch));
    }

    // .....
}

Also fairly easy to follow. Basically, we clear the matches list, and then attempt to find the next match from the current position on the line that we've reached (CurrentLinePos) for every rule - and we store all the successful matches for further inspection and processing. We also make sure we skip any disabled rules here, too.

If we don't find any matching rules, then that must mean that we can't match the rest of this line to any known token. In this case, we want to emit an unknown token:

if (matches.Count == 0) {
    yield return new LexerToken<TokenType>(nextLine.Substring(CurrentLinePos));
    break;
}

This is what that extra LexerToken constructor is for that we created above. Note that we yield return here, instead of simply returning - this is very similar in construct to the yield statement in Javascript that I blogged about before (and again here), in that they allow you to maintain state inside a method and return multiple values in sequence.

By an 'unknown token', I am referring to the default value of the TokenType enum. Here's an example enum you might use with this lexer class:

public enum SimpleToken {
    Unknown = 0,

    Whitespace,

    Comment,
    BlockOpen,
    BlockClose,
}

Since the value Unknown is explicitly assigned the index 0, we can be absolutely certain that it's the default value of this enum.

With our list of potential matches in hand, we next need to sort it in order to work our which one we should prioritise. After deliberating and experimenting a bit, I came up with this:

matches.Sort((LexerToken<TokenType> a, LexerToken<TokenType> b) => {
    // Match of offset position position
    int result = nextLine.IndexOf(a.RegexMatch.Value, CurrentLinePos, StringComparison.CurrentCulture) -
                nextLine.IndexOf(b.RegexMatch.Value, CurrentLinePos, StringComparison.CurrentCulture);
    // If they both start at the same position, then go with the longest one
    if (result == 0)
        result = b.RegexMatch.Length - a.RegexMatch.Length;

    return result;
});

Basically, this sorts them so that the matches closest to the current scanning position on the list (CurrentLinePos) are prioritised. If 2 or more matches tie based on this criterion, we prioritise the longest match. This seems to work for now - I can always change it later if it becomes a problem :P

With our matches sorted, we can now pick our the one we're going t emit next. Before we do so though, we need to take care of any characters between our current scanning position and the start of the next token:

LexerToken<TokenType> selectedToken = matches[0];
int selectedTokenOffset = nextLine.IndexOf(selectedToken.RegexMatch.Value, CurrentLinePos) - CurrentLinePos;

if (selectedTokenOffset > 0) {
    CurrentLinePos += selectedTokenOffset;
    yield return new LexerToken<TokenType>(nextLine.Substring(CurrentLinePos, selectedTokenOffset));
}

This emits these additional characters as an unknown token as we did before. Finally, we can emit the token and continue onto the next iteration of the loop:

CurrentLinePos += selectedToken.RegexMatch.Length;
yield return selectedToken;

That concludes our TokenStream method - and with it this lexer! Here's the code in full:

using System;
using System.Collections.Generic;
using System.IO;
using System.Text.RegularExpressions;

namespace SBRL.Tools
{
    public class LexerRule<TokenType>
    {
        public readonly TokenType Type;
        public readonly Regex RegEx;
        public bool Enabled { get; set; } = true;

        public LexerRule(TokenType inName, string inRegEx)
        {
            if (!typeof(TokenType).IsEnum)
                throw new ArgumentException($"Error: inName must be an enum - {typeof(TokenType)} passed");

            Type = inName;
            RegEx = new Regex(inRegEx);
        }

        public bool Toggle()
        {
            Enabled = !Enabled;
            return Enabled;
        }
    }

    public class LexerToken<TokenType>
    {
        public readonly bool IsNullMatch = false;
        public readonly LexerRule<TokenType> Rule = null;
        public readonly Match RegexMatch;

        public TokenType Type {
            get {
                try {
                    return Rule.Type;
                }
                catch (NullReferenceException) {
                    return default(TokenType);
                }
            }
        }
        private string nullValueData;
        public string Value {
            get {
                return IsNullMatch ? nullValueData : RegexMatch.Value;
            }
        }

        public LexerToken(LexerRule<TokenType> inRule, Match inMatch)
        {
            Rule = inRule;
            RegexMatch = inMatch;
        }
        public LexerToken(string unknownData)
        {
            IsNullMatch = true;
            nullValueData = unknownData;
        }

        public override string ToString()
        {
            return string.Format("[LexerToken: Type={0}, Value={1}]", Type, Value);
        }
    }

    public class Lexer<TokenType>
    {
        public List<LexerRule<TokenType>> Rules { get; private set; } = new List<LexerRule<TokenType>>();

        public bool Verbose { get; set; } = false;

        /// <summary>
        /// The number of the line that currently being scanned.
        /// </summary>
        public int CurrentLineNumber { get; private set; } = 0;
        /// <summary>
        /// The number of characters on the current line that have been scanned.
        /// </summary>
        /// <value>The current line position.</value>
        public int CurrentLinePos { get; private set; } = 0;
        /// <summary>
        /// The total number of characters currently scanned by this lexer instance.
        /// Only updated every newline!
        /// </summary>
        public int TotalCharsScanned { get; private set; } = 0;

        private StreamReader textStream;

        public Lexer()
        {

        }

        public void AddRule(LexerRule<TokenType> newRule)
        {
            Rules.Add(newRule);
        }
        public void AddRules(IEnumerable<LexerRule<TokenType>> newRules)
        {
            Rules.AddRange(newRules);
        }

        public void Initialise(StreamReader reader)
        {
            textStream = reader;
        }

        public IEnumerable<LexerToken<TokenType>> TokenStream()
        {
            string nextLine;
            List<LexerToken<TokenType>> matches = new List<LexerToken<TokenType>>();
            while ((nextLine = textStream.ReadLine()) != null)
            {
                CurrentLinePos = 0;

                while (CurrentLinePos < nextLine.Length)
                {
                    matches.Clear();
                    foreach (LexerRule<TokenType> rule in Rules) {
                        if (!rule.Enabled) continue;

                        Match nextMatch = rule.RegEx.Match(nextLine, CurrentLinePos);
                        if (!nextMatch.Success) continue;

                        matches.Add(new LexerToken<TokenType>(rule, nextMatch));
                    }

                    if (matches.Count == 0) {
                        string unknownTokenContent = nextLine.Substring(CurrentLinePos);
                        if(Verbose) Console.WriteLine("[Unknown Token: No matches found for this line] {0}", unknownTokenContent);
                        yield return new LexerToken<TokenType>(unknownTokenContent);
                        break;
                    }

                    matches.Sort((LexerToken<TokenType> a, LexerToken<TokenType> b) => {
                        // Match of offset position position
                        int result = nextLine.IndexOf(a.RegexMatch.Value, CurrentLinePos, StringComparison.CurrentCulture) -
                                    nextLine.IndexOf(b.RegexMatch.Value, CurrentLinePos, StringComparison.CurrentCulture);
                        // If they both start at the same position, then go with the longest one
                        if (result == 0)
                            result = b.RegexMatch.Length - a.RegexMatch.Length;

                        return result;
                    });
                    LexerToken<TokenType> selectedToken = matches[0];
                    int selectedTokenOffset = nextLine.IndexOf(selectedToken.RegexMatch.Value, CurrentLinePos) - CurrentLinePos;

                    if (selectedTokenOffset > 0) {
                        string extraTokenContent = nextLine.Substring(CurrentLinePos, selectedTokenOffset);
                        CurrentLinePos += selectedTokenOffset;
                        if(Verbose) Console.WriteLine("[Unmatched content] '{0}'", extraTokenContent);
                        yield return new LexerToken<TokenType>(extraTokenContent);
                    }

                    CurrentLinePos += selectedToken.RegexMatch.Length;
                    if(Verbose) Console.WriteLine(selectedToken);
                    yield return selectedToken;
                }

                if(Verbose) Console.WriteLine("[Lexer] Next line");
                CurrentLineNumber++;
                TotalCharsScanned += CurrentLinePos;
            }
        }

        public void EnableRule(TokenType type)
        {
            SetRule(type, true);
        }
        public void DisableRule(TokenType type)
        {
            SetRule(type, false);
        }
        public void SetRule(TokenType type, bool state)
        {
            foreach (LexerRule<TokenType> rule in Rules)
            {
                // We have to do a string comparison here because of the generic type we're using in multiple nested
                // classes
                if (Enum.GetName(rule.Type.GetType(), rule.Type) == Enum.GetName(type.GetType(), type)) {
                    rule.Enabled = state;
                    return;
                }
            }
        }

        public void EnableRulesByPrefix(string tokenTypePrefix)
        {
            SetRulesByPrefix(tokenTypePrefix, true);
        }
        public void DisableRulesByPrefix(string tokenTypePrefix)
        {
            SetRulesByPrefix(tokenTypePrefix, false);
        }
        public void SetRulesByPrefix(string tokenTypePrefix, bool state)
        {
            foreach (LexerRule<TokenType> rule in Rules)
            {
                // We have to do a string comparison here because of the generic type we're using in multiple nested
                // classes
                if (Enum.GetName(rule.Type.GetType(), rule.Type).StartsWith(tokenTypePrefix, StringComparison.CurrentCulture))
                {
                    rule.Enabled = state;
                }
            }
        }
    }
}

It's a bit much to take in all at once, but hopefully by breaking it down into steps I've made it easier to understand how I built it, and how all the different pieces fit together. The only real difference between the above code and the code I walked through in this post is the Verbose parameter I added for testing purposes, and the associated Console.WriteLine calls. For fun, here's a very basic LOGO (also here) lexer. I've based it on what I remember from using MSWLogo / FMSLogo a long time ago (there seem to be many dialects around these days):

public enum LogoToken
{
    Unknown = 0,

    Whitespace,

    FunctionForwards,
    FunctionBackwards,
    FunctionLeft,
    FunctionRight,
    FunctionPenUp,
    FunctionPenDown,

    Number
}

public class LogoLexer : Lexer<LogoToken>
{
    public LogoLexer()
    {
        AddRules(new List<LexerRule<LogoToken>>() {
            new LexerRule<LogoToken>(LogoToken.Whitespace,    @"\s+"),

            new LexerRule<LogoToken>(LogoToken.FunctionForwards,    @"FD"),
            new LexerRule<LogoToken>(LogoToken.FunctionBackwards,    @"BK"),
            new LexerRule<LogoToken>(LogoToken.FunctionLeft,    @"LT"),
            new LexerRule<LogoToken>(LogoToken.FunctionRight,    @"RT"),

            new LexerRule<LogoToken>(LogoToken.FunctionPenUp,    @"PU"),
            new LexerRule<LogoToken>(LogoToken.FunctionPenDown,    @"PD"),

            new LexerRule<LogoToken>(LogoToken.Number,    @"\d+"),
        });
    }
}

Here's an example LOGO program that it parses:

...and here's the output from lexing that example program:

[LexerToken: Type=FunctionForwards, Value=FD]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=Number, Value=100]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=FunctionRight, Value=RT]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=Number, Value=90]
[Lexer] Next line
[LexerToken: Type=FunctionForwards, Value=FD]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=Number, Value=50]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=FunctionPenUp, Value=PU]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=FunctionRight, Value=RT]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=Number, Value=180]
[Lexer] Next line
[LexerToken: Type=FunctionBackwards, Value=BK]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=Number, Value=40]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=FunctionLeft, Value=LT]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=Number, Value=45]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=FunctionForwards, Value=FD]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=Number, Value=250]
[Lexer] Next line
[Lexer] Next line

Very cool! This could easily be extended to support more of the LOGO syntax. As an exercise, can you extend it to support the REPEAT statement? At some point in the future, I might go even further and build a bottom-up left-to-right shift-reduce parser, and combine it with this lexer and some BNF to create a parse tree.

Enjoyed this post? Don't quite understand something? Think you could do better? Post a comment below!

GlidingSquirrel is now on NuGet with automatic API documentation!

(This post is a bit late as I'm rather under the weather.)

A while ago, I posted about a HTTP/1.0 server library I'd written in C♯ for a challenge. One thing led to another, and I've now ended up adding support for HTTP/1.1, WebSockets, and more....

While these enhancements have been driven mainly by a certain project of mine that keeps stealing all of my attention, they've been fun to add - and now that I've (finally!) worked out how to package it as a NuGet package, it's available on NuGet for everyone to use, under the Mozilla Public License 2.0!

Caution should be advised though, as I've not thoroughly tested it yet to weed out the slew of bugs and vulnerabilities I'm sure are hiding in there somewhere - it's designed mainly to sit behind a reverse-proxy such as Nginx (not that's it's any excuse, I know!) - and also I don't have a good set of test cases I can check it against (I thought for sure there would be at least one test suite out there for HTTP servers, considering the age of the protocol, but apparently not!).

With that out of the way, specifying dependencies didn't turn out to be that hard, actually. Here's the extra bit I added to the .nuspec file to specify the GlidingSquirrel's dependencies:

<dependencies>
    <dependency id="MimeSharp" version="1.0.0" />
    <dependency id="System.ValueTuple" version="4.4.0" />
</dependencies>

The above goes inside the <metadata> element. If you're curious about what this strange .nuspec file is and it's format, I've blogged about it a while back when I figured out how to package my TeleConsole Client - in which I explain my findings and how you can do it yourself too!

I've still not figured out the strange $version$ and $description$ things that are supposed to reference the appropriate values in the .csproj file (if you know how those work, please leave a comment below, as I'd love to know!) - as it keeps telling me that <description> cannot be empty.....

In addition to packaging it and putting it up on NuGet, I've also taken a look at automatic documentation generation. For a while I've been painstakingly documenting the entire project with intellisense comments for especially this purpose - which I'm glad I started early, as it's been a real pain and taken a rather decent chunk of time to do.

Anyway, with the last of the intellisense comments finished today, I set about generating automatic documentation - which turned out to be a surprisingly difficult thing to achieve.

  • Sandcastle, Microsoft's offering has been helpfully discontinued.
  • Sandcastle Help File Builder, the community open-source fork of Sandcastle, appears to be Windows-only (I use Linux myself).
  • DocFX, Microsoft's new offering, appears to be more of a static site generator that uses your code and a bunch of other things as input, rather than the simple HTML API documentation generator I was after.
  • I found Docxygen to produce an acceptable result out-of-the-box, I discovered that configuring it was not a trivial task - the initial configuration file the init action created was over 100 KiB!

I found a few other options too, but they were either Windows-only, or commercial offerings that I can't justify using for an open-source project.

When I was about to give up the search for the day, I stumbled across this page on generating documentation by the mono project, who just happen to be the ones behind mono, the cross-platform .NET runtime that runs on Linux, macOS, and, of course, Windows. They're also the ones who have built mcs, the C♯ compiler that compliments the mono runtime.

Apparently, they've also built a documentation generation that has the ability to export to HTML. While it's certainly nothing fancy, it does look like you've all the power you need at your fingertips to customise the look and feel by tweaking the XSLT stylesheet it renders with, should you need it.

After a bit of research, I found this pair of commands to build the documentation for me quite nicely:

mdoc update -o docs_xml/ -i GlidingSquirrel.xml GlidingSquirrel.dll
mdoc export-html --out docs/ docs_xml

The first command builds the multi-page XML tree from the XML documentation generated during the build process (make sure the "Generate xml documentation" option is ticked in the project build options!), and the second one transforms that into HTML that you can view in your browser.

With the commands figured out, I set about including it in my build process. Here's what I came up with:

<Target Name="AfterBuild">
    <Message Importance="high" Text="----------[ Building documentation ]----------" />

    <Exec Command="mdoc update -o docs_xml/ -i GlidingSquirrel.xml GlidingSquirrel.dll" WorkingDirectory="$(TargetDir)" IgnoreExitCode="true" />
    <Exec Command="mdoc export-html --out $(SolutionDir)/docs docs_xml" WorkingDirectory="$(TargetDir)" IgnoreExitCode="true" />
</Target>

This outputs the documentation to the docs/ folder in the root of the solution. If you're a bit confused as to how to utilise this in your own project, then perhaps the Untangling MSBuild post I wrote to figure out how MSBuild works can help! You can view the GlidingSquirrel's automatically-generated documentation here. It's nothing fancy (yet), but it certainly does the job. I'll have to look into prettifying it later!

Finding the distance to a (finite) line from a point in Javascript

A screenshot of the library I've written. Explanation below.

For a project of mine (which I might post about once it's more stable), I'm going to need a way to find the distance to a point from the mouse cursor to implement an eraser. I've attempted this problem before - but it didn't exactly go to plan. To that end, I decided to implement the algorithm on its own to start with - so that I could debug it properly without all the (numerous) moving parts of the project I'm writing it for getting in the way.

As you may have guessed since you're reading this post, it actually went rather well! Using the C++ implementation on this page as a reference, it didn't take more than an hour or two to get a reasonable implementation working - and it didn't take a huge amount of time to tidy it up into an npm package for everyone to use!

My implementation uses ES6 Modules - so you may need to enable them in about:config or chrome://flags if you haven't already (don't believe the pages online that say you need Firefox / Chrome nightly - it's available in stable, just disabled by default) before taking a look at the demo, which you can find here:

Line Distance Calculator

(Click and drag to draw a line - your distance from it is shown in the top left)

The code behind it is actually quite simple - just rather full of nasty maths that will give you a headache if you try and understand it all at once (I broke it down, which helped). The library exposes multiple methods to detect a point's distance from different kinds of line - one for multi-segmented lines (which I needed in the first place), one for a single (finite) line (which the multi-segmented line employs), and one for a single infinite line - which I implemented first, using this Wikipedia article - before finding that it was buggy because it was for an infinite line (even though the article's name is apparently correct)!

I've written up a usage guide if you're interested in playing around with it yourself.

I've also got another library that I've released recently (also for Nibriboard) that simplifies multi-segmented lines instead of finding the distance to them, which I may post about about soon too!

Update: Looks like I forgot that I've already posted about the other library! You can read about it here: Line Simplification: Visvalingam's Algorithm

Got a question? Wondering why I've gone to the trouble of implementing such an algorithm? Comment below - I'd love to hear your thoughts!

PixelBot Part 2: Devices need protocols, apparently

A selection of the technologies I'm using to put together my PixelHub server.

So there I was. I'd just got home, turned on my laptop, opened the Arduino IDE and Monodevelop, and then.... nothing. I knew I wanted my PixelBot to talk to the PixelHub I'd started writing, but I was confused as to how I could make it happen.

In this kind of situation, I realised that although I knew what I wanted them to do, I hadn't figured out the how. As it happens, when you're trying to get one (or more, in this case) different devices to talk to each other, there's something rather useful that helps them all to speak the same language: a protocol.

A protocol is a specification that defines the language that different devices use to talk to each other and exchange messages. Defining one before you start writing a networked program is probably a good idea - I find particularly helpful to write a specification for the protocol that the program(s) I'm writing, especially if their function(s) is/are complicated.

To this end, I've ended up spending a considerable amount of time drawing up the PixelHub Protocol - a specification document that defines how my PixelHub server is going to talk to a swarm of PixelBots. It might seem strange at first, but I decided on a (mostly) binary protocol.

Upon closer inspection though, (I hope) it makes a lot of sense. Since the Arduino is programmed using C++ as it has a limited amount of memory, it doesn't have any of the standard string manipulation function that you're used to in C♯. Since C++ is undoubtedly the harder of the 2 to write, I decided to make it easier to write the C++ rather than the C&sharp. Messages on the Arduino side are come in as a byte[] array, so (in theory) it should be easy to pick out certain known parts of the array and cast them into various different fundamental types.

With the specification written, the next step in my PixelBot journey is to actually implement it, which I'll be posting about in the next entry in this series!

My PixelBot is Connected! (Part 1)

After many attempts and solving many problems, I've finally gotten my Wemos-powered PixelBot to connect via TCP to a C# server component I've written. Since I experienced so many problems with it, I decided to post about it here to help others who want to do the same as I.

The first (and arguably most difficult) hurdle I came across was the lack of correct information. For one, the TCP client class is actually called WifiClient (which is confusing in and of itself). For another, there aren't any really good tutorials out there that show you what to do.

In addition, I wanted to build a (rather complicated as it turns out!) auto discovery system to allow my PixelBot to find the PixelHub (the server) automatically. As usual, there was even less information about this task! All I found was an outdated guide and rather simplistic example. I ended up inspecting the header file of the WiFiUDP class in order to figure it out.

In this post, I'm going to explain how I got the autodiscovery mechanism working. Before we begin, you need to understand what multicast UDP is and how it works. For those of you who don't, I've posted all about it.

There are several ways that one can go about writing an autodiscovery mechanism. While Rob Miles chose mDNS, I ended up approaching the problem from a different angle and writing my own protocol. It's best explained with a diagram:

A diagram explaining the PixelBot's autodiscovery mechanism.

  1. The PixelHub server has a beacon that sends out pings to tell the PixelBots where it is
  2. The PixelBots subscribe to the beacon ping channel when they want to find the server
  3. The PixelBots decode the incoming ping packets to find the discover of the server
  4. The PixelBots connect to the PixelHub server using the credentials that it found in the beacon pings it decoded!

In order to receive these beacon pings, we need to set up a UDP listener and wire it up to receive multicast packets. In the following example I'm multicasting on 239.62.148.30 on port 5050.

IPAddress beaconAddress = IPAddress(239, 62, 148, 30);
unsigned int beaconPort = 5050;

WiFiUDP UdpClient;
UdpClient.beginMulticast(WiFi.localIP(), beaconAddress, beaconPort);

After connecting to the multicast address, we then need to receive the pings:

// Create a buffer to hold the message 
byte datagramBuffer[datagramBufferSize];
// Prefill the datagram buffer with zeros for protection later
memset(datagramBuffer, '\0', datagramBufferSize);
while(true) {
    int datagramSize = UdpClient.parsePacket();
    Serial.print("Received datagram #");
    Serial.print(datagramSize);
    Serial.print(" bytes in size from ");
    Serial.print(UdpClient.remoteIP());
    Serial.print(":");
    Serial.print(UdpClient.remotePort());

    // Don't overflow the message buffer!
    if(datagramSize > datagramBufferSize) {
        Serial.println(", but the message is larger than the datagram buffer size.");
        continue;
    }
    // Read the message in now that we've verified that it won't blow our buffer up
    UdpClient.read(datagramBuffer, datagramSize);

    // Cheat and cast the datagram to a character array
    char* datagramStr = (char*)datagramBuffer;

}

That's rather complicated for such a simple task! Anyway, now we've got our beacon ping into a character array, we can start to pick it apart. Before we do, here's what my beacon pings look like:

server@10.0.40.66:5050
  ^        ^        ^
 Role IP Address   Port

Although it looks simple, in C++ (the language of the arduino) it's rather a pain. Here's what I came up with:

// Define the role of the remote server that we're looking for
char desiredRemoteRole[7] = { 's', 'e', 'r', 'v', 'e', 'r', '\0' };

// Find the positions of the key characters
int atPos = findChar(datagramStr, '@');
int colonPos = findChar(datagramStr, ':');

// Create some variables to store things in
char role[7];
char serverIp[16];
char serverPortText[7];
int serverPort = -1;
// Fill everything with zeroes to protect ourselves
memset(role, '\0', 7);
memset(serverIp, '\0', 16);
memset(serverPortText, '\0', 7);
// Extract the parts of the 
strncpy(role, datagramStr, atPos);
strncpy(serverIp, datagramStr + atPos + 1, colonPos - atPos - 1);
strncpy(serverPortText, datagramStr + colonPos + 1, datagramSize - colonPos - 1);

Serial.println("complete.");

// Print everything out to the serial console
Serial.print("atPos: "); Serial.println(atPos);
Serial.print("colonPos: "); Serial.println(colonPos);

Serial.print("Role: "); Serial.print(role); Serial.print(" ");
Serial.print("Remote IP: "); Serial.print(serverIp); Serial.print(" ");
Serial.print("Port number: "); Serial.print(serverPortText);
Serial.println();

// If the advertiser isn't playing the role of a server, then we're not interested
if(strcmp(role, desiredRemoteRole) != 0)
{
    Serial.print("Incompatible role "); Serial.print(role); Serial.println(".");
    continue;
}
Serial.println("Role ok!");
serverPort = atoi(serverPortText);

Phew! That's a lot of code! I've tried to annotate it the best I can. The important variables in the are serverIp the serverPort - they hold the IP address and the port number of the remote machine that we want to connect to.

I hope that's somewhat helpful. If there's anything you don't understand or need help with, please post a comment down below :-)

Next time, I'm going to show off the TCP connection bit of the system. Stay tuned :D

Art by Mythdael