Stardust Blog

Archive

Mailing List Articles Atom Feed Comments Atom Feed Twitter Reddit Facebook

Tag Cloud

3d 3d printing account algorithms android announcement architecture archives arduino artificial intelligence artix assembly async audio automation backups bash batch blog bookmarklet booting bug hunting c sharp c++ challenge chrome os code codepen coding conundrums coding conundrums evolved command line compilers compiling compression css dailyprogrammer data analysis debugging demystification distributed computing documentation downtime electronics email embedded systems encryption es6 features ethics event experiment external first impressions future game github github gist gitlab graphics hardware hardware meetup holiday holidays html html5 html5 canvas infrastructure interfaces internet interoperability io.js jabber jam javascript js bin labs learning library linux lora low level lua maintenance manjaro network networking nibriboard node.js operating systems own your code performance photos php pixelbot portable privacy problem solving programming problems project projects prolog protocol protocols pseudo 3d python reddit redis reference releases resource review rust searching secrets security series list server software sorting source code control statistics storage svg talks technical terminal textures thoughts three thing game three.js tool tutorial twitter ubuntu university update updates upgrade version control virtual reality virtualisation visual web website windows windows 10 xmpp xslt

Next Gen Search, Part 1: Backend Storage

I've got a bit of a thing about full-text search engines. I've talked about one in particular before for Pepperminty Wiki, and I was thinking about it again the other day - and how I could optimise it further.

If oyu haven't already, I do recommend reading my previous post on the curious question - as a number of things in this post might not make sense otherwise.

Between the time I wrote that last post and now, I've done quite a bit more digging into the root causes of that ~450ms search time, and I eventually determined that most of it was actually being spent deserialising the inverted index from the JSON file it's stored in back into memory.

This is horribly inefficient and is actually taking up 90% of that query time, so I got to thinking about what I could do about it.

My solution was multi-faceted, as I also (separately) developed a new search-term analysis system (I abbreviated to STAS, because it sounds cool :D) to add advanced query syntax such as -dog to exclude pages that contain the word dog and the like - which I also folded in at the same time as fixing the existing bottleneck.

As of the time of typing, the work on STAS is still ongoing. This doesn't mean that I can't blog about the rest of it though! I've recently-ish come into contact with key-value data stores, such as Redis and LevelDB (now RocksDB). They work rather like a C♯ dictionary or a browser's localStorage, in that they store values that are associated with unique keys (Redis is a bit of a special case though, for reasons that I won't get into here).

Such a data store would suit an inverted index surprisingly well. I devised a key system to represent an inverted index:

• The first key, |termindex|, is used to store a master list of words that have entries in the page index.
• The second key, term is simply the word itself (e.g. cat, chicken, rocket, etc.), and stores a list of ids of the pages that contain that word.
• The third and final key, term|pageid, is a word followed by a separator and finally the id of a page (e.g. bill|1, cat|45, boosters|69, etc.). These keys store the locations that a particular word appears at in a given document in the index. A separate id index is needed to convert between the page id and it's associated name - Pepperminty Wiki provides this functionality out-of-the-box.

The more I thought about it, the more I liked it. If I use a key-value data store in this manner, I can store the values as JSON objects - and then I only have to deserialise the parts of the index that I actually use. Furthermore, adding this extra layer of abstraction allows for some clever trickery to speed things up even more.

The problem here is that Pepperminty Wiki is supposed to be portable, so I try not to use any heavy external libraries or depend on odd PHP modules that not everyone will have installed. While a LevelDB extension for PHP does exist, it's not installed by default and it's a PECL module, which are awkward to install.

All isn't lost though, because it turns out that SQLite functions surprisingly well as a key-value data store:

CREATE TABLE store (
key TEXT UNIQUE NOT NULL,
value TEXT
);

Yes - it really is that simple! Now all we need is some code to create and interface with such a database. Some simple getters and setters should suffice!

(Can't see the above? Try a direct link.)

While this works, I quickly ran into several major problems:

• Every time I read from the data store I'm decoding JSON, which is expensive
• Every time I'm saving to the data store, I'm encoding to JSON, which is also expensive
• I'm reading and writing the same thing multiple times, which is very expensive
• Writing anything to the data store takes a long time

The (partial) solution here was to refactor it such that the json encoding is handled by the storage provider, and to implement a cache.

Such a cache could just be an associative array:

private $cache = []; Then, to fetch a value, we can do this: // If it's not in the cache, insert it if(!isset($this->cache[$key])) {$this->cache[$key] = [ "modified" => false, "value" => json_decode($this->query(
"SELECT value FROM store WHERE key = :key;",
[ "key" => $key ] )->fetchColumn()) ]; } return$this->cache[$key]["value"]; Notice how each item in the cache is also an associative array. This is so that we can flag items that have been modified in memory, such that when we next sync to disk we can write multiple changes all at once in a single batch. That turns the setter into something like this: if(!isset($this->cache[$key]))$this->cache[$key] = [];$this->cache[$key]["value"] =$value;
$this->cache[$key]["modified"] = true;

Very cool! Now all we need is a function to batch-write all the changes to disk. This isn't hard either:

foreach($this->cache as$key => $value_data) { // If it wasn't modified, there's no point in saving it, is there? if(!$value_data["modified"])
continue;

$this->query( "INSERT OR REPLACE INTO store(key, value) VALUES(:key, :value)", [ "key" =>$key,
"value" => json_encode($value_data["value"]) ] ); } I'll get onto the cogs and wheels behind that query() function a bit later in this post. It's one of those utility functions that are great to have around that I keep copying from one project to the next. Much better, but still not great. Why does it still take ages to write all the changes to disk? Well, it turns out that by default SQLite wraps every INSERT in it's own transaction. If we wrap our code in an explicit transaction, we can seriously boost the speed even further: $this->db->beginTransaction();

// Do batch writing here

$this->db->commit(); Excellent (boomed the wizard). But wait, there's more! The PHP PDO database driver supports prepared statements, which is a fancy way of caching SQL statements and substituting values into them. We use these already, but since we only use a very limited number of SQL queries, we can squeak some additional performance out by caching them in their prepared forms (preparing them is relatively computationally expensive, after all). This is really easy to do. Let's create another associative array to store them in: private$query_cache = [];

Then, we can alter the query() function to look like this:

/**
* Makes a query against the database.
* @param   string  $sql The (potentially parametised) query to make. * @param array$variables  Optional. The variables to substitute into the SQL query.
* @return  \PDOStatement       The result of the query, as a PDOStatement.
*/
private function query(string $sql, array$variables = []) {
// Add to the query cache if it doesn't exist
if(!isset($this->query_cache[$sql]))
$this->query_cache[$sql] = $this->db->prepare($sql);

string result = Regex.IsMatch(type, @"range$") ? aiResult["start"] : aiResult["value"]; return DateTime.Parse(result); } private static Dictionary<string, string> unwindResult(ModelResult modelResult) { return (modelResult.Resolution["values"] as List<Dictionary<string, string>>)[0]; } Of course, it depends on your use-case as to precisely how you unwind it, but the above should be a good starting point. Once I've polished the bot I've written a bit, I might post about it on here. Found this interesting? Run into an issue? Got a neat use for it? Comment below! Using libsodium to upgrade the comment key system I've blogged about the comment key system I utilise on this blog to prevent spam before (see also). Today, I've given it another upgrade to make it harder for spammers to fake a comment key! In the last post, I transformed the comment key with a number of reversible operations - including a simple XOR password system. This is, of course, very insecure - especially since an attacker knows (or can at least guess) the content of the encrypted key, making it trivial (I suspect) to guess the password used for 'encryption'. The solution here, obviously, is to utilise a better encryption system. Since it's the 5th November and I'm not particularly keen on my website going poof like the fireworks tonight, let's do something about it! PHP 7.2+ comes with native libsodium support (those still using older versions of PHP can still follow along! Simply install the PECL module). libsodium bills itself as A modern, portable, easy to use crypto library. After my experiences investigating it, I'd certainly say that's true (although the official PHP documentation could do with, erm, existing). I used this documentation instead instead - it's quite ironic because I have actually base64-encoded the password....... Anyway, after doing some digging I found the quick reference, which explains how you can go about accomplishing common tasks with libsodium. For my comment key system, I want to encrypt my timestamp with a password - so I wanted the sodium_crypto_secretbox() and its associated sodium_crypto_secretbox_open() functions. This pair of functions, when used together, implement a secure symmetric key encryption system. In other words, they securely encrypt a lump of data with a password. They do, however, have 2 requirements that must be taken care of. Firstly, the password must be of a specific length. This is easy enough to accomplish, as PHP is kind enough to tell us how long it needs to be: /** * Generates a new random comment key system password. * @return string The base64-encoded password. */ function key_generate_password() { return base64_encode_safe(random_bytes(SODIUM_CRYPTO_SECRETBOX_KEYBYTES)); } Easy-peasy! base64_encode_safe isn't a built-in function - it's a utility function I wrote that we'll need later. For consistency, I've used it here too. Here it is, along with its counterpart: /** * Encodes the given data with base64, but replaces characters that could * get mangled in transit with safer equivalents. * @param mixed$data   The data to encode.
* @return  string  The data, encoded as transmit-safe base64.
*/
function base64_encode_safe($data) { return str_replace(["/", "+"], ["-", "_"], base64_encode($data));
}
/**
* Decodes data encoded with base64_encode_safe().
* @param   mixed   $base64 The data to decode. * @return string The data, decoded from transmit-safe base64. */ function base64_decode_safe($base64) {
return base64_decode(str_replace(["-", "_"], ["/", "+"], $base64)); } With that taken care of, we can look at the other requirement: a nonce. Standing for Number used ONCE, it's a sequence of random bytes that's used by the encryption algorithm. We don't need to keep it a secret, but we do need to to decrypt the data again at the other end, in addition to the password - and we do need to ensure that we generate a new one for every comment key. Thankfully, this is also easy to do in a similar manner to generating a password: $nonce = random_bytes(SODIUM_CRYPTO_SECRETBOX_NONCEBYTES);

With everything in place, we can look at upgrading the comment key generator itself. It looks much simpler now:

/**
* Generates a new comment key.
* Note that the 'encryption' used is not secure - it's just simple XOR!
* It's mainly to better support verification in complex setups and
* serve as a nother annoying hurdle for spammers.
* @param  string $pass The password to encrypt with. Should be a base64-encoded string from key_generate_password(). * @return string A new comment key stamped at the current time. */ function key_generate($pass) {
$pass = base64_decode_safe($pass);
$nonce = random_bytes(SODIUM_CRYPTO_SECRETBOX_NONCEBYTES);$encrypt = sodium_crypto_secretbox(
strval(time()), // The thing we want to encrypt
$nonce, // The nonce$pass // The password to encrypt with
);

return base64_encode_safe($nonce.$encrypt);
}

I bundle the nonce with the encrypted data here to ensure that the system continues to be stateless (i.e. we don't need to store any state information about a user on the server). I also encode the encrypted string with base64, as the encrypted strings contain lots of nasty characters (it's actually a binary byte array I suspect). This produces keys like this:

BOqDvr26XY9s8PhlmIZMIp6xCOZyfsh6J05S4Jp2cY3bL8ccf_oRgRMldrmzKk6RrnA=
Tt8H81tkJEqiJt-RvIstA_vz13LS8vjLgriSAvc1n5iKwHuEKjW93IMITikdOwr5-NY=
5NPvHg-l1GgcQ9ZbThZH7ixfKGqAtSBr5ggOFbN_wFRjo3OeJSWAcFNhQulYEVkzukI=

They are a bit long, but not unmanageable. In theory, I could make it a bit shorter by avoiding converting the integer output from time() to a string, but in my testing it didn't make much difference. I suspect that there's some sort of minimum length to the output string for some (probably good) reason.


php > var_dump(sodium_crypto_secretbox(time(), random_bytes(24), random_bytes(32)));
string(26) "GW$:���ߌ@]�+1b��������d%" php > var_dump(sodium_crypto_secretbox(strval(time()), random_bytes(24), random_bytes(32))); string(26) ":_}0H �E°9��Kn�p��ͧ��"  Now that we're encrypting comment keys properly, it isn't much good unless we can decrypt them as well! Let's do that too. The first step is to decode the base64 and re-split the nonce from the encrypted data: $pass = base64_decode_safe($pass);$key_enc_raw = base64_decode_safe($key);$nonce = mb_substr($key_enc_raw, 0, SODIUM_CRYPTO_SECRETBOX_NONCEBYTES, "8bit");$key_enc = mb_substr($key_enc_raw, SODIUM_CRYPTO_SECRETBOX_NONCEBYTES, null, "8bit"); This is derived from the example code. With the 2 separated once more, we can decrypt the key: $key_dec = sodium_crypto_secretbox_open($key_enc,$nonce, $pass); Apparently, according to the example code on the website I linked to above, if the return value isn't a string then the decryption failed. We should make sure we handle that when returning: if(!is_string($key_dec))
return null;
return intval($key_dec); That completes the decryption code. Here is in full: /** * Decodes the given key. * @param string$key  The key to decode.
* @param  string $pass The password to decode it with. * @return int|null The time that the key was generated, or null if the provided key was invalid. */ function key_decode($key, $pass) {$pass = base64_decode_safe($pass);$key_enc_raw = base64_decode_safe($key);$nonce = mb_substr($key_enc_raw, 0, SODIUM_CRYPTO_SECRETBOX_NONCEBYTES, "8bit");$key_enc = mb_substr($key_enc_raw, SODIUM_CRYPTO_SECRETBOX_NONCEBYTES, null, "8bit");$key_dec = sodium_crypto_secretbox_open($key_enc,$nonce, $pass); if(!is_string($key_dec)) return null;
return intval($key_dec); } Explicitly returning null in key_decode() requires a small change to key_verify(), in order to prevent it from thinking that a key is really old if the decryption fails (null is treated as 0 in arithmetic operations, apparently). Let's update key_verify() to handle that: /** * Verifies a key. * @param string$key        The key to decode and verify.
* @param   string  $pass The password to decode the key with. * @param int$min_age    The minimum required age for the key.
* @param   int     $max_age The maximum allowed age for the key. * @return bool Whether the key is valid or not. */ function key_verify($key, $pass,$min_age, $max_age) {$key_time = key_decode($key,$pass);
if($key_time == null) return false;$age = time() - $key_time; return$age >= $min_age &&$age <= $max_age; } A simple check is all that's needed. With that, the system is all updated! Time to generate a new password with the provided function and put it to use. You can do that directly from the PHP REPL (php -a):  php > require("lib/comment_system/comment_key.php"); php > var_dump(key_generate_password()); string(44) "S0bJpph5htdaKHZdVfo9FB6O4jOSzr3xZZ94c2Qcn44="  Obviously, this isn't the password I'm using on my blog :P I've got a few more ideas I'd like to play around with to deter spammers even more whilst maintaining the current transparency - so you may see another post soon on this subject :-) With everything complete, here's the complete script: (Above: The full comment key system code. Can't see it? Check it out on GitHub Gist here.) Found this useful? Got an improvement? Confused about something? Comment below! Sources and Further Reading Generating word searches for fun and profit (Above: A Word search generated with the tool below) A little while ago I was asked about generating a wordsearch in a custom shape. I thought to myself "someone has to have built this before...", and while I was right to an extent, I couldn't find one that let you use any shape you liked. This, of course, was unacceptable! You've probably guessed it by now, but I went ahead and wrote my own :P While I wrote it a little while ago, I apparently never got around to posting about it on here. In short, it works by using an image you drop into the designated area on the page as the shape the word search should take. Each pixel is a single cell of the word search - with the alpha channel representing whether or not a character is allowed to be placed there (transparent means that it can't contain a character, and opaque means that it can). Creating such an image is simple. Personally, I recommend Piskel or GIMP for this purpose. Once done, you can start building a wordlist in the wordlist box at the right-hand-side. It should rebuild the word search as soon as you click out of the box. If it doesn't, then you've found a bug! Please report it here. With the word search generated, you can use the Question Sheet and Answer Sheet links to open printable versions for export. You can find my word search generator here: I've generated a word search of the current tags in the tag cloud on this blog too: Question Sheet [50.3KiB], Answer Sheet [285.6KiB] The most complicated part of this was probably the logistics behind rude word removal. Thankfully, I did't have to find and maintain such a list of words, as the futility npm package does this for me, but algorithmically guaranteeing that by censoring 1 rude word another is not accidentally created in another direction is a nasty problem. If you're interested in a more technical breakdown of one (or several!) particular aspects of this - let me know! While writing about all of it would probably make for an awfully long post, a specific aspect or two should be more manageable. In the future, I'll probably revisit this and add additional features to it, such as the ability to restrict which directions words are placed in, for example. If you've got a suggestion of your own, open an issue (or even better, open a pull request :D)! Placeholder Image Generator (Above: An example output with debugging turned on from my placeholder generation service) For a quite a considerable amount of time now, I've been running my own placeholder image generation service here at starbeamrainbowlabs.com - complete with etag generation and custom colour selection. Although it's somewhat of an Easter Egg, it's not actually that hard to find if you know what you're looking for (hint: I use it on my home page, but you may need to view source to find it). I decided to post about it now because I've just finished fixing the angle GET parameter - and it was interesting (and hard) enough to warrant a post to remind myself how I did it for future reference. Comment below if you knew about it before this blog post came out! The script itself is split into 3 loose parts: • The initial settings / argument parsing • The polyfills and utility functions • The image generation. My aim here was to keep the script contained within a single file - so that it fits well in a gist (hence why it currently looks a bit messy!). It's probably best if show the code in full: (Having trouble viewing code above? Visit it directly here: placeholder.php) If you append ?help to the url, it will send back a plain text file detailing the different supported parameters and what they do: (Can't see the above? View it directly here) Aside from implementing the random value for fg_colour and bg_colour, the angle has been a huge pain. I use GD - a graphics drawing library that's bundled with practically every PHP installation ever - to draw the placeholder image, and when you ask it to draw some rotated text, it decides that it's going to pivot it around the bottom-left corner instead of the centre. Naturally, this creates some complications. Though some people on the PHP manual page said method (imagettftext) have attempetd to correct for this (exhibits a and b), neither of their solutions work for me. I don't know if it's because their code is just really old (13 and 5 years respectively as of the time of typing) or what. Anyway, I finally decided recently that enough was enough - and set about fixing it myself. Basing my code on the latter of the 2 pre-existing solutions, I tried fixing it - but ended up deleting most of it and starting again. It did give me some idea as to how to solve the problem though - all I needed to do was find the centre of where the text would be drawn when it is both not rotated and rotated and correct for it (these are represented by the green and blue crosses respectively on the image at the top of this post). PHP does provide a method for calculating the bounding box of some prospective text that you're thinking of drawing via imagettfbbox. Finding the centre of the box though sounded like a horrible maths-y problem that would take me ages to work through on a whiteboard first, but thankfully (after some searching around) Wikipedia had a really neat solution for finding the central point of any set of points. It calls it the centroid, and claims that the geometric centre of a set of points is simply the average of all the points involved. It just so happens that the geometric centre is precisely what I'm after: $$C=\frac{a+b+c+d+....}{N}$$ ...Where$C$is the geometric centre of the shape as an$\binom{x}{y}$vector,$a$,$b$,$c$,$d$, etc. are the input points as vectors, and$N$is the number of points in question. This was nice and easy to program: $bbox = imagettfbbox($size, 0,$fontfile, $text);$orig_centre = [
($bbox[0] +$bbox[2] + $bbox[4] +$bbox[6]) / 4,
($bbox[1] +$bbox[3] + $bbox[5] +$bbox[7]) / 4
];
$text_width =$bbox[2] - $bbox[0];$text_height = $bbox[3] -$bbox[5];

$bbox = imagettfbbox($size, $angle,$fontfile, $text);$rot_centre = [
($bbox[0] +$bbox[2] + $bbox[4] +$bbox[6]) / 4,
($bbox[1] +$bbox[3] + $bbox[5] +$bbox[7]) / 4
];

The odd and even indexes of $bbox there are referring to the$x$and$y$co-ordinates of the 4 corners of the bounding boxes - imagettfbbox outputs the co-ordinates in 1 single array for some reason. I also calculate the original width and height of the text - in order to perform an additional corrective translation later. With these in hand, I could calculate the actual position I need to draw it (indicated by the yellow cross in the image at the top of this post): $$delta = C_{rotated} - C_{original}$$ . $$pivot = \frac{pos_x - ( delta_x + \frac{text_width}{2})}{pos_y - delta_y + \frac{text_height}{2}}$$ In short, I calculate the distance between the 2 centre points calculated above, and then find the pivot point by taking this distance plus half the size of the text from the original central position we wanted to draw the text at. Here's that in PHP: // Calculate the x and y shifting needed$dx = $rot_centre[0] -$orig_centre[0];
$dy =$rot_centre[1] - $orig_centre[1]; // New pivot point$px = $x - ($dx + $text_width/2);$py = $y -$dy + $text_height/2; I generated a video of it in action (totally not because I thought it would look cool :P) with a combination of curl, and ffmpeg: (Having issues with the above video? Try downloading it directly, or viewing it on YouTube) This was really quite easy to generate. In a new folder, I executed the following: echo {0..360} | xargs -P3 -d' ' -n1 -I{} curl 'https://starbeamrainbowlabs.com/images/placeholder/?width=1920&height=1920&fg_colour=inverse&bg_colour=white&angle={}' -o {}.jpeg ffmpeg -r 60 -i %d.jpeg -q:v 3 /tmp/text-rotation.webm The first command downloads all the required frames (3 at a time), and the second stitches them together. The -q:v 3 bit of the ffmpeg command is of note - by default webm videos apparently have a really low quality - this corrects that. Lower is better, apparently - it goes up to about 40 I seem to remember reading somewhere. 3 to 5 is supposed to be the right range to get it to look ok without using too much disk space. That's about everything I wanted to talk about to remind myself about what I did. Let me know if there's anything you're confused about in the comments below, and I'll explain it in more detail. (Found this interesting? Comment below!) Sources and Further Reading Demystifying Inverted Indexes (The magnifying glass in the above banner came from openclipart) After writing the post that will be released after this one, I realised that I made a critical assumption that everyone knew what an inverted index was. Upon looking for an appropriate tutorial online, I couldn't find one that was close enough to what I did in Pepperminty Wiki, so I decided to write my own. First, some context. What's Pepperminty Wiki? Well, it's a complete wiki engine in a single file of PHP. The source files are obviously not a single file, but it builds into a single file - making it easy to drop into any PHP-enabled web server. One of its features is a full-text search engine. A personal wiki of mine has ~75k words spread across ~550 pages, and it manages to search them all in just ~450ms! It does this with the aid of an inverted index - which I'll be explaining in this post. First though, we need some data to index. How about the descriptions of some video games? Kerbal Space Program In KSP, you must build a space-worthy craft, capable of flying its crew out into space, without killing them. At your disposal is a collection of parts, which must be assembled to create a functional ship. Each part has its own function and will affect the way a ship flies (or doesn't). So strap yourself in, and get ready to try some Rocket Science! Cross Code Meet Lea as she logs into an MMO of the distant future. Follow her steps as she discovers a vast world, meets other players and overcomes all the challenges of the game. Fort Meow Fort Meow is a physics game by Upper Class Walrus about a girl, an old book and a house full of cats! Meow. Factory Balls Factory balls is the brand new bonte game I already announced yesterday. Factory balls takes part in the game design competition over at jayisgames. The goal of the design competition was to create a 'ball physics'-themed game. I hope you enjoy it! Very cool, this should provide us with plenty of data to experiment with. Firstly, let's consider indexing. Take the Factory Balls description. We can split it up into tokens like this: T o k e n s V V factory balls is the brand new bonte game i already announced yesterday factory balls takes part in the game design competition over at jayisgames the goal of the design competition was to create a ball physics themed game i hope you enjoy it Notice how we've removed punctuation here, and made everything lowercase. This is important for the next step, as we want to make sure we consider Factory and factory to be the same word - otherwise when querying the index we'd have to remember to get the casing correct. With our tokens sorted, we can now count them to create our index. It's like a sort of tally chart I guess, except we'll be including the offset in the text of every token in the list. We'll also be removing some of the most common words in the list that aren't very interesting - these are known as stop words. Here's an index generated from that Factory Balls text above: Token Frequency Offsets factory 2 0, 12 balls 2 1, 13 brand 1 4 new 1 5 bonte 1 6 game 3 7, 18, 37 i 2 8, 38 announced 1 10 yesterday 1 11 takes 1 14 design 2 19, 28 competition 2 20, 29 jayisgames 1 23 goal 1 25 create 1 32 ball 1 34 physics 1 35 themed 1 36 hope 1 39 enjoy 1 41 Very cool. Now we can generate an index for each page's content. The next step is to turn this into an inverted index. Basically, the difference between the normal index and a inverted index is that an entry in an inverted index contains not just the offsets for a single page, but all the pages that contain that token. For example, the Cross-Code example above also contains the token game, so the inverted index entry for game would contain a list of offsets for both the Factory Balls and Cross-Code pages. Including the names of every page under every different token in the inverted index would be both inefficient computationally and cause the index to grow rather large, so we should assign each page a unique numerical id. Let's do that now: Id Page Name 1 Kerbal Space Program 2 Cross Code 3 Fort Meow 4 Factory Balls There - much better. In Pepperminty Wiki, this is handled by the ids class, which has a pair of public methods: getid($pagename) and getpagename($id). If an id can't be found for a page name, then a new id is created and added to the list (Pepperminty Wiki calls this the id index) transparently. Similarly, if a page name can't be found for an id, then null should be returned. Now that we've got ids for our pages, let's look at generating that inverted index entry for game we talked about above. Here it is: • Term: game Id Frequency Offsets 2 1 31 3 1 5 4 3 5, 12, 23 Note how there isn't an entry for page id 1, as the Kerbal Space Program page doesn't contain the token game. This, in essence, is the basics of inverted indexes. A full inverted index will contain an entry for every token that's found in at least 1 source document - though the approach used here is far from the only way of doing it (I'm sure there are much more advanced ways of doing it for larger datasets, but this came to mind from reading a few web articles and is fairly straight-forward and easy to understand). Can you write a program that generates a full inverted index like I did in the example above? Try testing it on the test game descriptions at the start of this post. You may also have noticed that the offsets used here are of the tokens in the list. If you wanted to generate contexts (like Duck Duck Go or Google do just below the title of a result), you'd need to use the character offsets from the source document instead. Can you extend your program to support querying the inverted index, generating contexts based on the inverted index too? Liked this post? Got your own thoughts on the subject? Having trouble with the challenges at the end? Comment below! Shift-Reduce Parser Part 2: Building Furniture (1) Hello and welcome! I got a bit distracted by other things as you can tell, but I'm back with part 2 of my series on building a shift-reduce parser. If you're not sure what I'm talking about, then I'd advise reading part 1 first and then coming back here. It might be a good idea to re-read it anyway, juts to refresh your memory :-) Last time, we created some data classes to store the various rules and tokens that we'll be generating. Today, we're going to build on that and start turning a set of rules into a parse table. Let's introduce the rules we'll working with: <start> ::= <expression> <expression> ::= <expression> PLUS <value> | <term> <term> ::= <term> DIVIDE <value> | <value> <value> ::= <number> | BRACKET_OPEN <expression> BRACKET_CLOSE <number> ::= DIGIT | <number> DIGIT The above represents a very basic calculator-style syntax, which only supports adding and dividing. It's written in Backus-Naur Form, which is basically a standardised way of writing parsing rules. To build a parse table, we first must understand what such a thing actually is. Let's take a look at an example: state action goto * + 0 1$ E B
0 s1 s2 3 4
1 r4 r4 r4 r4 r4
2 r5 r5 r5 r5 r5
3 s5 s6 goal
4 r3 r3 r3 r3 r3
5 s1 s2 7
6 s1 s2 8
7 r1 r1 r1 r1 r1
8 r2 r2 r2 r2 r2

_(Source: Adapted from the LR Parser on Wikipedia.)_

While it looks complex at first, let's break it down. There are 3 parts to this table: The state, the action, and the goto. The action and goto represent What should happen if a particular token is encountered. In this case, the input stream contains both terminal (i.e. DIGIT, DIVIDE, BRACKET_CLOSE, etc. in the case of our BNF above) and non-terminal (i.e. number, term, expression, etc. in the case of our BNF above) symbols - if understand it correctly, so there are actually 2 parts to the table here to make sure that both are handled correctly.

We'll be connecting this to our lexer, which outputs only terminal symbols, so we should be ok I believe (if you know better, please post a comment below!). The state refers to the state in the table. As I've mentioned before, a given state may contain one or more configurations. It's these configurations that give rise to the actions in the table above, such as s2 (shift and then go to state 2) or r3 (reduce and jump to state 3).

To use the table, the parser must know what state it's in, and then take a look across the top row for the next symbol it has in the token stream. Once found, it can follow it down to figure out what action it should take, as explained above. If there isn't an action in the box, then there must be an error in the input, as the table doesn't tell us what to do in this situation. To that end, we should try and generate a meaningful error message to help the user to find the mistake in the input (or the developer in the parser!).

We're kind of getting ahead of ourselves here though. We need to build this table first, and to do that, we need to figure out which configurations go in which state. And, going down the rabbit hole, we need to know what a configuration is. Again, it's best if I demonstrate. Consider the following parsing rule from our example BNF at the beginning of this post:

<value> ::= BRACKET_OPEN <expression> BRACKET_CLOSE

A single configuration represent a possible state of the parser at a particular instant in time. I could split that above rule up like so:

<value> ::= BRACKET_OPEN * <expression> BRACKET_CLOSE
<value> ::= BRACKET_OPEN <expression> * BRACKET_CLOSE
<value> ::= BRACKET_OPEN <expression> BRACKET_CLOSE *

The asterisk represent where the parser might have gotten up to. Everything to the left is on the stack of the parser, and everything to the right hasn't happened yet.

Since this isn't a top-level rule (in our example that honour goes to the rule for the start non-terminal), the parser will never be in a position where the first item there doesn't exist yet on the stack, so we can ignore the configuration in which the asterisk would be to the left of BRACKET_OPEN.

Confused? Let me try and help here. Let's draw a diagram of how our parser is going to operate:

_(Source: Made by me, but adapted from the LR Parser article on Wikipedia)_

Basically, the parser will be taking in the input token stream and either shift a new terminal token onto the stack, or reduce one or more existing tokens on the stack into a single non-terminal token, which replaces those existing tokens on the stack. The configurations above represent possible states of the stack (the bit to the left of the asterisk), and possible directions that the parser could take when parsing (the bit to th right of the asterisk).

Finally, when the goal is reached, the output is returned to the caller (which, by the time we're done, should be a parse tree). Said tree can then be optimised and processed for whatever purpose we desire!

With this knowledge, we can deduce that we can build the entire table by recursing over the tree of rules from the start state. That way, we'll visit every rule that we'll need to parse everything required to reach the goal state by recursing over all the rules for all the non-terminals referenced by all the rules we visit. We could even generate a warning if we detect that some rules don't connect to this 'tree'. Here's a tree of our example ruleset from the beginning of this post:

It's a bit spaghetti-ish, but it should be legible enough :P This gives us an idea as to how we're going to tackle this. Taking into account the data classes we created in the last post, we need to make sure we keep the following in mind:

1. Since the main ShiftReduceParser class is going to hold the rules, the ParseTable class will need a reference to its parent ShiftReduceParser in order to query the rules.
2. In light of this, the SHiftReduceParser should be responsible for satisfying any queries the ParseTable has about rules - the ParseTable should not have to go looking & filtering the rule list held by ShiftReduceParser itself.
3. ParseTable will need a recursive method that will take a single top-level rule and recurse over it and its child rules (according to the tree I've talked about above)
4. This method in ParseTale will need to be extremely careful it doesn't get stuck in a loop. To that end, it'll have to keep track of whether it's already processed a rule or not.
5. It'll probably also have to keep track of which configurations it has added to the table class structure we defined in the last post to avoid adding rules twice.
6. Once ParseTable has figured out all the configurations and grouped them all into the right states, it will then have to recurse over the generated table and fill in all the shift / reduce / goal action(s) - not forgetting about the links to the other states they should point to.

It's quite the laundry list! Thankfully, most of this is quite simple if we tackle it one step at a time. The most annoying bit is the grouping of configurations into states. This is done by looking at the token immediately before the asterisk in each configuration - all the configurations with the same token here will get grouped into the same state (while there are more complex algorithms that allow for more complex grammars, we'll stick with this for now as anything else makes my head hurt! Maybe in the future I'll look as figuring out precisely what kind of LR-style parser this is, and upgrading it to be a canonical LR(1) parser - the most advanced type I know of).

This is quite a lot to take in, so I think I'll leave this post here for you to digest - and we'll get to writing some code in the next one.

Found this useful? Spotted a mistake? Having trouble getting your head around it? Post a comment below!

Shift-reduce Parser Part 1: First Steps

Now that I've done the Languages and Compilers module at University, it's opened my eyes to a much better and more extensible way of handling complex text in a way that can easily be read by any subsequent code I write. To that end, I've found that at least 3 different various projects of mine could benefit from the inclusion of a shift-reduce parser, but I haven't been able to track one down for C♯ yet.

With this in mind, I thought to myself: "Why not build one myself?" Sure, my Lecturer didn't go into too many details as to how they work, but it can't be that difficult, can it? Oh, I had no idea.....

In this mini-series, I'm going to take you through the process of building a shift-reduce parser of your very own. As I write this, I haven't actually finished mine yet - I've just got to the important milestone of building a parse table! Thankfully, that's going to be a few posts away, as there's a fair amount of ground to cover until we get to that point.

Warning: This series is not for the faint of heart! It's rather complicated, and extremely abstract - making it difficult to get your head around. I've had great difficulty getting mine around it - and ended up writing it in multiple stages. If you want to follow along, be prepared for lots of research, theory, and preparation!

Let's start out by taking a look at what a shift-reduce parser does. If you haven't already, I'd recommend reading my previous compilers 101 post, which explains how to write a compiler, and the different stages involved. I'd also recommend checking out my earlier post on building a lexer, as it ties in nicely with the shift-reduce parser that we'll be building.

In short, a shift-reduce parser compiles a set of BNF-style rules into a Parse Table, which it then utilises as a sort of state-machine when parsing a stream on input tokens. We'll take a look at this table compilation process in a future blog post. In this post, let's set up some data structures to help us along when we get to the compilation process in the next blog post. Here's the class structure we'll be going for:

Let's start with a class to represent a single token in a rule:

public enum ParserTokenClass
{
Terminal,
NonTerminal
}

public struct ParserToken
{
public readonly ParserTokenClass Class;
public readonly string Type;

public ParserToken(ParserTokenClass inTokenType, string inType)
{
Class = inTokenType;
Type = inType;
}

public override bool Equals(object obj)
{
ParserToken otherTokenType = (ParserToken)obj;
return Class == otherTokenType.Class && Type == otherTokenType.Type;
}
public override int GetHashCode()
{
return $"{Class}:{Type}".GetHashCode(); } public override string ToString() { string terminalDisplay = Class == ParserTokenClass.Terminal ? "T" : "NT"; return$"[ParserToken {terminalDisplay}: {Type}]";
}

public static ParserToken NonTerm(string inType)
{
return new ParserToken(ParserTokenClass.NonTerminal, inType);
}
public static ParserToken Term(string inType)
{
return new ParserToken(ParserTokenClass.Terminal, inType);
}
}

Pretty simple! A token in a rule can either be a terminal (basically a token straight from the lexer), or a non-terminal (a token that the parser reduces a set of other tokens into), and has a type - which we represent as a string. Unfortunately due to the complex comparisons we'll be doing later, it's a huge hassle to use an enum with a template class as I did in the lexer I built that I linked to earlier.

Later on (once we've built the parse table), we'll extend this class to support attaching values and other such pieces of information to it, but for now we'll leave that out to aid simplicity.

I also override Equals() and GetHashCode() in order to make comparing tokens easier later on. Overriding ToString() makes the debugging process much easier later, as we'll see in the next post!

With a class to represent a token, we also need one to represent a rule. Let's create one now:

public class ParserRule
{
/// <summary>
/// A function to call when a reduce operation utilises this rule.
/// </summary>
public Action MatchingAction;
public ParserToken LeftSide;
public ParserToken[] RightSideSequence;

public ParserRule(Action inMatchingAction, ParserToken inLeftSide, params ParserToken[] inRightSideSequence)
{
if (inLeftSide.Class != ParserTokenClass.NonTerminal)
throw new ArgumentException("Error: The left-hand side must be a non-terminal token type.");

MatchingAction = inMatchingAction;
LeftSide = inLeftSide;
RightSideSequence = inRightSideSequence;
}

public bool RightSideSequenceMatches(IEnumerable<ParserToken> otherRhs)
{
int i = 0;
foreach (ParserToken nextToken in otherRhs)
{
if (!nextToken.Equals(RightSideSequence[i]))
return false;

i++;
}
return true;
}

public override string ToString()
{
StringBuilder result = new StringBuilder();
result.Append($"ParserRule: {LeftSide} = "); foreach (ParserToken nextToken in RightSideSequence) result.Append($" {nextToken}");
result.Append(";");
return result.ToString();
}
}

The above represents a single parser rule, such as <number> ::= <digit> <number>. Here we have the token on the left-hand-side (which we make sure is a non-terminal), and an array of tokens (which can be either terminal or non-terminal) for the right-hand-side. We also have an Action (which is basically a lamba function) that we'll call when we match against the rule, so that we have a place to hook into when we write code that actually does the tree building (not to be confused with the shift-reduce parser itself).

Here I also add a method that we'll need later, which compares an array of tokens against the current rule, to see if they match - and we override ToString() here again to aid debugging.

Now that we can represent tokens and rules, we can start thinking about representing configurations and states. Not sure what these are? All will be explained in the next post, don't worry :-) For now, A state can be seen as a row in the parse table, and it contains a number of configurations - which are like routes to different other states that the parser decides between, depending where it's gotten to in the token stream.

public enum ParseTableAction
{
Shift,
Reduce,
Goal,
Error
}

public class ParseTableConfiguration
{
public readonly ParserRule Rule;
public readonly int RhsPosition;

public ParseTableAction LinkingAction = ParseTableAction.Error;
public ParseTableState LinkingState = null;

public ParserToken TokenAfterDot {
get {
return Rule.RightSideSequence[RhsPosition];
}
}
public ParserToken TokenBeforeDot {
get {
return Rule.RightSideSequence[RhsPosition - 1];
}
}

/// <summary>
/// Whether this configuration is the last in the sequence of configurations for the specified rule or not.
/// </summary>
/// <value><c>true</c> if is last in rule; otherwise, <c>false</c>.</value>
public bool IsLastInRule {
get {
return RhsPosition > Rule.RightSideSequence.Length - 1;
}
}

public ParseTableConfiguration(ParserRule inRule, int inRhsPosition)
{
Rule = inRule;
RhsPosition = inRhsPosition;
}

public IEnumerable<ParserToken> GetParsedRhs()
{
return Rule.RightSideSequence.TakeWhile((ParserToken token, int index) => index <= RhsPosition);
}

public bool MatchesRhsSequence(ParserRule otherRule)
{
int i = 0;
foreach (ParserToken nextToken in otherRule.RightSideSequence)
{
if (i > RhsPosition)
break;

if (!nextToken.Equals(otherRule.RightSideSequence[i]))
return false;

i++;
}
return true;
}

public override bool Equals(object obj)
{
ParseTableConfiguration otherConfig = obj as ParseTableConfiguration;
if (otherConfig == null) return false;
return Rule == otherConfig.Rule && RhsPosition == otherConfig.RhsPosition;
}
public override int GetHashCode()
{
return $"{Rule}:{RhsPosition}".GetHashCode(); } public override string ToString() { StringBuilder result = new StringBuilder(); result.Append($"Configuration: {LinkingAction} ");
if (LinkingState != null)
result.Append($"to State {LinkingState.Id} "); result.Append($"{Rule.LeftSide} = ");

for (int i = 0; i <= Rule.RightSideSequence.Length; i++)
{
if (i == RhsPosition)
result.Append(" * ");
if (i == Rule.RightSideSequence.Length)
continue;
result.Append($"{Rule.RightSideSequence[i]} "); } result.Append(";"); return result.ToString(); } } This class is slightly more complicated. First, we define an enum that holds information about what the parser should do if it chooses this configuration. Then, we declare the configuration class itself. This entails specifying which parse rule we're deriving the configuration from, and both which tokens in the right-hand-side of the rule should have been parsed already, and which should still be somewhere in the token stream. Again, I'll explain this in more detail in the next post! Then, we declare a few utility methods and properties to fetch different parts of the configuration's rule, such as the token to the immediate left and right of the right-hand-side position (which was represented as a dot . in the book I followed), all the tokens before the dot ., and whether a given rule matches this configuration in the basis of everything before the dot .. Finally, I continue with the trend of overriding the equality checking methods and ToString(), as it makes a world of difference in the code coming up in future blog posts! Now that we've got a class for configurations, the last one on our list is one for the states themselves. Let's do that now: public class ParseTableState { public readonly ParseTable ParentTable; public int Id { get { return ParentTable.FindStateId(this); } } public List<ParseTableConfiguration> Configurations = new List<ParseTableConfiguration>(); public ParseTableState(ParseTable inParentTable) { ParentTable = inParentTable; } public override string ToString() { StringBuilder result = new StringBuilder(); foreach(ParseTableConfiguration nextConfiguration in Configurations) result.AppendLine($"     - {nextConfiguration}");
return result.ToString();
}
}

Much simpler than the configuration rule class, right? :P As I mentioned earlier, all a state consists of is a list of configurations in that state. In our case, we'll be assigning an id to the states in our parse table, so I include a property here that fetches a state's id from the parent parse table that it's part to make the later code as simple as possible.

Still with me? Congratulations! You've got the beginnings of a shift-reduce parser. Next time, we'll expand on some of theory behind the things I've touched on in this post, and possibly look at building the start of the recursive parse table builder itself.

Found this interesting? Confused about something? Comment below!

Markov Chains Part 3: Weighted Chains

Recently I remembered that I had all the pieces I need to make a weighted version of the unweighted markov chain I built in part 2 of this series - specifically the weighted random number generator I built shortly after the unweighted markov chain. With this in mind, I decided on a whim to put all the pieces of the puzzle together - and this post is the result!

Where to start... hrm. I know. To start with, I needed to perform a minor upgrade tot he WeightedRandom class I had. If you haven't read the original post about it, I'd recommend doing so now.

Finished reading that? Great! Lets talk about the changes I've made (they may show up there at the bottom, since I embedded it via GitHub Gist). Firstly, I needed a way to work out if the weighted random number generator was currently empty or not, leading me to add a Count property:

public int Count {
get {
return weights.Count;
}
}

With a count property in place, I also found I was going to need a way to dynamically swap out the weightings of the random number generator without creating a completely new instance - which would end up resetting the Random class instance it was working with, leading to a reduction in the quality in random numbers it uses under high load (see [this article]() for more information on that).

To that end, I ended up refactoring the constructor into a pair of methods: SetContents, and a companion method ClearContents. Since the weight calculations happen when the items are first added to the generator, and I'd need to completely recalculate them if another item is added, I wasn't able to emulate the API for another existing class in .NET, such as the List class, as I like to do.

Finally, I found later on I needed a way to initialise an empty weighted random generator, so I added a new empty constructor to facilitate that, along with an additional check in the Next() method that throws an InvalidOperationException if the generator is empty and you try to ask it to pick a random item.

Here's the updated WeightedRandomNumberGenerator:

(Can't see the above? Click here to view it on GitHub directly, or here for the raw code as plain text)

With the weighted random number generator updated to properly support the future weighted markov chain, let's get down to the markov chain itself. Firstly, let's create a skeleton that's based on the UnweightedMarkovChain class I wrote in the last post:

using System;
using System.Collections.Generic;
using System.Linq;
using MarkovGrams.Utilities;
using SBRL.Algorithms;

namespace MarkovGrams
{
/// <summary>
/// An unweighted character-based markov chain.
/// </summary>
public class WeightedMarkovChain
{
private WeightedRandom<string> wrandom = new WeightedRandom<string>();

/// <summary>
/// The ngrams that this markov chain currently contains.
/// </summary>
Dictionary<string, double> ngrams;

/// <summary>
/// Creates a new character-based markov chain.
/// </summary>
/// <param name="inNgrams">The ngrams to populate the new markov chain with.</param>
public WeightedMarkovChain(IEnumerable<string> inNgrams);

/// <summary>
/// Returns a random ngram that's currently loaded into this WeightedMarkovChain.
/// </summary>
/// <returns>A random ngram from this UnweightMarkovChain's cache of ngrams.</returns>
public string RandomNgram();

/// <summary>
/// Generates a new random string from the currently stored ngrams.
/// </summary>
/// <param name="length">
/// The length of ngram to generate.
/// Note that this is a target, not a fixed value - e.g. passing 2 when the n-gram order is 3 will
/// result in a string of length 3. Also, depending on the current ngrams this markov chain contains,
/// it may end up being cut short.
/// </param>
/// <returns>A new random string.</returns>
public string Generate(int length);
}
}

As you can see, it is rather similar to the unweighted version. Fear not however, for the differences will become more apparent shortly. The only real difference so far is the extra private WeightedRandom<string> wrandom declaration at the top of the class. Let's change that though, by filling out the constructor:

ngrams = new Dictionary<string, double>();
foreach (string ngram in inNgrams)
{
if (ngrams.ContainsKey(ngram))
ngrams[ngram]++;
else
}

Here, we read in the raw n-grams and a dictionary that represents the number of times that a given n-gram has been discovered. It's got to be a double there as the type value of the dictionary, as apparently the C♯ compiler isn't clever enough to convert a Dictionary<string, int> to a Dictionary<string, double>. Hrm. Maybe they'll fix that in the future (or if not, does anyone know why not-)?

Anyway, let's move on to RandomNgram(). Here it is:

if (wrandom.Count == 0)
wrandom.SetContents(ngrams);
return wrandom.Next();

Quite simple, right? Basically, we populate the weighted random generator if it's currently empty, and then we simply ask it for a random item. We're on a roll, here! Let's keep going with Generate(). Here's the first part:

string result = RandomNgram();
string lastNgram = result;
while(result.Length < length)
{
// ......
}

Here, we declare an accumulator-like variable result to hold the word we're generating as we construct it, and another one to holdt he last n-gram we picked. We also create a while loop to make sure we keep adding to the word until we reach the desired length (we'll be adding a stop condition just in case we run into a brick wall later). Next, let's put some code inside that while loop. First up is the (re)population of the weighted random number generator:

wrandom.ClearContents();
// The substring that the next ngram in the chain needs to start with
string nextStartsWith = lastNgram.Substring(1);
// Get a list of possible n-grams we could choose from next
Dictionary<string, double> convNextNgrams = new Dictionary<string, double>();
ngrams.Where(gram_data => gram_data.Key.StartsWith(nextStartsWith))
.ForEach((KeyValuePair<string, double> ngramData) => convNextNgrams.Add(ngramData.Key, ngramData.Value));

Ah, good ol' Linq to the rescue again! But wait, what's that ForEach() call there? I don't remember that being in core .NET! You'd be right of course, but through the power of [extension methods]() one can extend a class with an additional method that can then be used as if it were an integral part of that class, when in reality that isn't the case! Here's my definition for that ForEach() extension method I used above:

public static class LinqExtensions
{
public static void ForEach<T>(this IEnumerable<T> enumerable, Action<T> action)
{
foreach (T item in enumerable)
{
action(item);
}
}
}

Next, we need to add that stop condition we talked about earlier before I forget! Here it is:

// If there aren't any choices left, we can't exactly keep adding to the new string any more :-(
if(convNextNgrams.Count() == 0)
break;

Observant readers will notice that we haven't actually finished the (re)population process yet, so we should do that next. Once done, we can also obtain a random n-gram from the generator and process it:

wrandom.SetContents(convNextNgrams);
// Pick a random n-gram from the list
string nextNgram = wrandom.Next();
// Add the last character from the n-gram to the string we're building
result += nextNgram[nextNgram.Length - 1];
lastNgram = nextNgram;

That completes my initial weighted markov chain implementation. Here's the class in full:

using System;
using System.Collections.Generic;
using System.Linq;
using MarkovGrams.Utilities;
using SBRL.Algorithms;

namespace MarkovGrams
{
/// <summary>
/// An unweighted character-based markov chain.
/// </summary>
public class WeightedMarkovChain
{
private WeightedRandom<string> wrandom = new WeightedRandom<string>();

/// <summary>
/// The ngrams that this markov chain currently contains.
/// </summary>
Dictionary<string, double> ngrams;

/// <summary>
/// Creates a new character-based markov chain.
/// </summary>
/// <param name="inNgrams">The ngrams to populate the new markov chain with.</param>
public WeightedMarkovChain(IEnumerable<string> inNgrams)
{
ngrams = new Dictionary<string, double>();
foreach (string ngram in inNgrams)
{
if (ngrams.ContainsKey(ngram))
ngrams[ngram]++;
else
}
}

/// <summary>
/// Returns a random ngram that's currently loaded into this WeightedMarkovChain.
/// </summary>
/// <returns>A random ngram from this UnweightMarkovChain's cache of ngrams.</returns>
public string RandomNgram()
{
if (wrandom.Count == 0)
wrandom.SetContents(ngrams);
return wrandom.Next();
}

/// <summary>
/// Generates a new random string from the currently stored ngrams.
/// </summary>
/// <param name="length">
/// The length of ngram to generate.
/// Note that this is a target, not a fixed value - e.g. passing 2 when the n-gram order is 3 will
/// result in a string of length 3. Also, depending on the current ngrams this markov chain contains,
/// it may end up being cut short.
/// </param>
/// <returns>A new random string.</returns>
public string Generate(int length)
{
string result = RandomNgram();
string lastNgram = result;
while(result.Length < length)
{
wrandom.ClearContents();
// The substring that the next ngram in the chain needs to start with
string nextStartsWith = lastNgram.Substring(1);
// Get a list of possible n-grams we could choose from next
Dictionary<string, double> convNextNgrams = new Dictionary<string, double>();
ngrams.Where(gram_data => gram_data.Key.StartsWith(nextStartsWith))
.ForEach((KeyValuePair<string, double> ngramData) => convNextNgrams.Add(ngramData.Key, ngramData.Value));
// If there aren't any choices left, we can't exactly keep adding to the new string any more :-(
if(convNextNgrams.Count() == 0)
break;
wrandom.SetContents(convNextNgrams);
// Pick a random n-gram from the list
string nextNgram = wrandom.Next();
// Add the last character from the n-gram to the string we're building
result += nextNgram[nextNgram.Length - 1];
lastNgram = nextNgram;
}
wrandom.ClearContents();
return result;
}
}
}

You can find it on my private git server here, if you're interested in any future improvements I might have made to it since writing this post. Speaking of which, I've got a few in mind - mainly refactoring both this class and it's unweighted cousin to utilise lists of objects instead of strings. This way, I'll be able to apply it to anything I like - such as sentence generation, music improvisation, and more!

I'd also like to extend it such that I can specify the weights manually, giving me even more flexibility as to how I can put the engine to use.

(Found a cool use for a Markov Chain? Comment about it below!)

Deterring spammers with a comment key system

I recently found myself reimplementing the comment key system I use on this blog (I posted about it here) for a another purpose. Being more experienced now, my new implemention (which I should really put into use on this blog actually) is stand-alone in a separate file - so I'm blogging about it here both to help out anyone who reads this other than myself - and for myself as I know I'll forget otherwise :P

The basic algorithm hasn't changed much since I first invented it: take the current timestamp, apply a bunch or arbitrary transformations to it, put it as a hidden field in the comment form, and then reverse the transformations on the comment key the user submits as part of the form to discover how long they had the page loaded for. Bots will have it loaded for either less than 10-ish seconds, or more than 24 hours. Humans will be somewhere in the middle - at least according to my observations!

Of course, any determined spammer can easily bypass this system if they spend even a little bit of time analysing the system - but I'm banking on the fact that my blog is too low-value for a spammer to bother reverse-engineering my system to figure out how it works.

This time, I chose to use simple XOR encryption, followed by reversing the string, followed by base64 encoding. It should be noted that XOR encryption is certainly not secure - but in this case it doesn't really matter. If my website becomes a high-enough value target for that to matter, I'll investigate proper AES encryption - which will probably be a separate post in and of itself, as a quick look revealed that it's rather involved - and will probably require quite a bit of research and experimentation working correctly.

Let's take a look at the key generation function first:

function key_generate($pass) {$new_key = strval(time());
// Repeat the key so that it's long enough to XOR the key with
$pass_enc = str_repeat($pass, (strlen($new_key) / strlen($pass)) + 1);
$new_key =$new_key ^ $pass_enc; return base64_encode(strrev($new_key));
}

As I explained above, this first XORs the timestamp against a provided 'passcode' of sorts, and then it reverses it, base64 encodes it, and then returns it. I discovered that I needed to repeat the passcode to make sure it's at least as long as the timestamp - because otherwise it cuts the timestamp short! Longer passwords are always desirable for certain, but I wanted to make sure I addressed it here - just in case I need to lift this algorithm from here for a future project.

Next up is the decoding algorithm, that reverses the transformations we apply above:

    function key_decode($key,$pass) {
$key_dec = strrev(base64_decode($key));
// Repeat the key so that it's long enough to XOR the key with
$pass_dec = str_repeat($pass, (strlen($key_dec) / strlen($pass)) + 1);
return intval($key_dec ^$pass_dec);
}

Very similar. Again, the XOR passphrase has to be repeated to make it long enough to apply to the whole encoded key without inadvertently chopping some off the end. Additionally, we also convert the timestamp back into an integer - since it is the number of seconds since the last UNIX epoch (1st January 1970 as of the time of typing).

With the ability to create and decode keys, let's write a helper method to make the verification process a bit easier:

    function key_verify($key,$pass, $min_age,$max_age) {
$age = time() - key_decode($key, $pass); return$age >= $min_age &&$age <= \$max_age;
}`

It's fairly self-explanatory, really. It takes an encoded key, decodes it, and verifies that it's age lies between the specified bounds. Here's the code in full (it updates every time I update the code in the GitHub Gist):

(Above: The full comment key code. Can't see it? Check it out on GitHub Gist here.)

Art by Mythdael