Starbeamrainbowlabs

Stardust
Blog


Archive


Mailing List Articles Atom Feed Comments Atom Feed Twitter Reddit Facebook

Tag Cloud

3d account algorithms android announcement architecture archives arduino artificial intelligence artix assembly async audio bash batch blog bookmarklet booting c sharp c++ challenge chrome os code codepen coding conundrums coding conundrums evolved command line compilers compiling compression css dailyprogrammer debugging demystification distributed computing documentation downtime electronics email embedded systems encryption es6 features event experiment external first impressions future game github github gist gitlab graphics hardware hardware meetup holiday holidays html html5 html5 canvas infrastructure interfaces internet io.js jabber jam javascript js bin labs learning library linux lora low level lua maintenance manjaro network networking nibriboard node.js operating systems performance photos php pixelbot portable privacy problem solving programming problems projects prolog protocol protocols pseudo 3d python reddit redis reference release releases resource review rust searching secrets security series list server software sorting source code control statistics storage svg technical terminal textures three thing game three.js tool tutorial tutorials twitter ubuntu university update updates upgrade version control virtual reality virtualisation visual web website windows windows 10 xmpp xslt

Easy AI with Microsoft.Text.Recognizers

I recently discovered that there's an XMPP client library (NuGet) for .NET that I overlooked a few months ago, and so I promptly investigated the building of a bot!

The actual bot itself needs some polishing before I post about it here, but in writing said bot I stumbled across a perfectly brilliant library - released by Microsoft of all companies - that can be used to automatically extract common data-types from a natural-language sentence.

While said library is the underpinnings of the Azure Bot Framework, it's actually free and open-source. To that end, I decided to experiment with it - and ended up writing this blog post.

Data types include (but are not limited to) dates and times (and ranges thereof), numbers, percentages, monetary amounts, email addresses, phone numbers, simple binary choices, and more!

While it also lands you with a terrific number of DLL dependencies in your build output folder, the result is totally worth it! How about pulling a DateTime from this:

in 5 minutes

or this:

the first Monday of January

or even this:

next Monday at half past six

Pretty cool, right? You can even pull multiple things out of the same sentence. For example, from the following:

The host 1.2.3.4 has been down 3 times over the last month - the last of which was from 5pm and lasted 30 minutes

It can extract an IP address (1.2.3.4), a number (3), and a few dates and times (last month, 5pm, 30 minutes).

I've written a test program that shows it in action. Here's a demo of it working:

(Can't see the asciicast above? View it on asciinema.org)

The source code is, of course, available on my personal Git server: Demos/TextRecogniserDemo

If you can't check out the repo, here's the basic gist. First, install the Microsoft.Recognizers.Text package(s) for the types of data that you'd like to recognise. Then, to recognise a date or time, do this:

List<ModelResult> result = DateTimeRecognizer.RecognizeDateTime(nextLine, Culture.English);

The awkward bit is unwinding the ModelResult to get at the actual data. The matched text is stored in the ModelResult.Resolution property, but that's a SortedDictionary<string, object>. The interesting property inside which is value, but depending on the data type you're recognising - that can be an array too! The best way I've found to decipher the data types is to print the value of ModelResult.Resolution as a string to the console:

Console.WriteLine(result[0].Resolution.ToString());

The .NET runtime will helpfully convert this into something like this:

System.Collections.Generic.SortedDictionary`2[System.String,System.Object]

Very helpful. Then we can continue to drill down:

Console.WriteLine(result[0].Resolution["values"]);

This produces this:

System.Collections.Generic.List`1[System.Collections.Generic.Dictionary`2[System.String,System.String]]

Quite a mouthful, right? By cross-referencing this against the JSON (thanks, Newtonsoft.JSON!), we can figure out how to drill the rest of the way. I ended up writing myself a pair of little utility methods for dates and times:

public static DateTime RecogniseDateTime(string source, out string rawString)
{
    List<ModelResult> aiResults = DateTimeRecognizer.RecognizeDateTime(source, Culture.English);
    if (aiResults.Count == 0)
        throw new Exception("Error: Couldn't recognise any dates or times in that source string.");

    /* Example contents of the below dictionary:
        [0]: {[timex, 2018-11-11T06:15]}
        [1]: {[type, datetime]}
        [2]: {[value, 2018-11-11 06:15:00]}
    */

    rawString = aiResults[0].Text;
    Dictionary<string, string> aiResult = unwindResult(aiResults[0]);
    string type = aiResult["type"];
    if (!(new string[] { "datetime", "date", "time", "datetimerange", "daterange", "timerange" }).Contains(type))
        throw new Exception($"Error: An invalid type of {type} was encountered ('datetime' expected).");

    string result = Regex.IsMatch(type, @"range$") ? aiResult["start"] : aiResult["value"];
    return DateTime.Parse(result);
}

private static Dictionary<string, string> unwindResult(ModelResult modelResult)
{
    return (modelResult.Resolution["values"] as List<Dictionary<string, string>>)[0];
}

Of course, it depends on your use-case as to precisely how you unwind it, but the above should be a good starting point.

Once I've polished the bot I've written a bit, I might post about it on here.

Found this interesting? Run into an issue? Got a neat use for it? Comment below!

Using libsodium to upgrade the comment key system

I've blogged about the comment key system I utilise on this blog to prevent spam before (see also). Today, I've given it another upgrade to make it harder for spammers to fake a comment key!

In the last post, I transformed the comment key with a number of reversible operations - including a simple XOR password system. This is, of course, very insecure - especially since an attacker knows (or can at least guess) the content of the encrypted key, making it trivial (I suspect) to guess the password used for 'encryption'.

The solution here, obviously, is to utilise a better encryption system. Since it's the 5th November and I'm not particularly keen on my website going poof like the fireworks tonight, let's do something about it! PHP 7.2+ comes with native libsodium support (those still using older versions of PHP can still follow along! Simply install the PECL module). libsodium bills itself as

A modern, portable, easy to use crypto library.

After my experiences investigating it, I'd certainly say that's true (although the official PHP documentation could do with, erm, existing). I used this documentation instead instead - it's quite ironic because I have actually base64-encoded the password.......

Anyway, after doing some digging I found the quick reference, which explains how you can go about accomplishing common tasks with libsodium. For my comment key system, I want to encrypt my timestamp with a password - so I wanted the sodium_crypto_secretbox() and its associated sodium_crypto_secretbox_open() functions.

This pair of functions, when used together, implement a secure symmetric key encryption system. In other words, they securely encrypt a lump of data with a password. They do, however, have 2 requirements that must be taken care of. Firstly, the password must be of a specific length. This is easy enough to accomplish, as PHP is kind enough to tell us how long it needs to be:

/**
 * Generates a new random comment key system password.
 * @return string   The base64-encoded password.
 */
function key_generate_password() {
    return base64_encode_safe(random_bytes(SODIUM_CRYPTO_SECRETBOX_KEYBYTES));
}

Easy-peasy! base64_encode_safe isn't a built-in function - it's a utility function I wrote that we'll need later. For consistency, I've used it here too. Here it is, along with its counterpart:

/**
 * Encodes the given data with base64, but replaces characters that could 
 * get mangled in transit with safer equivalents.
 * @param   mixed   $data   The data to encode.
 * @return  string  The data, encoded as transmit-safe base64.
 */
function base64_encode_safe($data) {
    return str_replace(["/", "+"], ["-", "_"], base64_encode($data));
}
/**
 * Decodes data encoded with base64_encode_safe().
 * @param   mixed   $base64     The data to decode.
 * @return  string  The data, decoded from transmit-safe base64.
 */
function base64_decode_safe($base64) {
    return base64_decode(str_replace(["-", "_"], ["/", "+"], $base64));
}

With that taken care of, we can look at the other requirement: a nonce. Standing for Number used ONCE, it's a sequence of random bytes that's used by the encryption algorithm. We don't need to keep it a secret, but we do need to to decrypt the data again at the other end, in addition to the password - and we do need to ensure that we generate a new one for every comment key. Thankfully, this is also easy to do in a similar manner to generating a password:

$nonce = random_bytes(SODIUM_CRYPTO_SECRETBOX_NONCEBYTES);

With everything in place, we can look at upgrading the comment key generator itself. It looks much simpler now:

/**
 * Generates a new comment key.
 * Note that the 'encryption' used is not secure - it's just simple XOR!
 * It's mainly to better support verification in complex setups and 
 * serve as a nother annoying hurdle for spammers.
 * @param  string $pass The password to encrypt with. Should be a base64-encoded string from key_generate_password().
 * @return string       A new comment key stamped at the current time.
 */
function key_generate($pass) {
    $pass = base64_decode_safe($pass);
    $nonce = random_bytes(SODIUM_CRYPTO_SECRETBOX_NONCEBYTES);
    $encrypt = sodium_crypto_secretbox(
        strval(time()), // The thing we want to encrypt
        $nonce, // The nonce
        $pass // The password to encrypt with
    );

    return base64_encode_safe($nonce.$encrypt);
}

I bundle the nonce with the encrypted data here to ensure that the system continues to be stateless (i.e. we don't need to store any state information about a user on the server). I also encode the encrypted string with base64, as the encrypted strings contain lots of nasty characters (it's actually a binary byte array I suspect). This produces keys like this:

BOqDvr26XY9s8PhlmIZMIp6xCOZyfsh6J05S4Jp2cY3bL8ccf_oRgRMldrmzKk6RrnA=
Tt8H81tkJEqiJt-RvIstA_vz13LS8vjLgriSAvc1n5iKwHuEKjW93IMITikdOwr5-NY=
5NPvHg-l1GgcQ9ZbThZH7ixfKGqAtSBr5ggOFbN_wFRjo3OeJSWAcFNhQulYEVkzukI=

They are a bit long, but not unmanageable. In theory, I could make it a bit shorter by avoiding converting the integer output from time() to a string, but in my testing it didn't make much difference. I suspect that there's some sort of minimum length to the output string for some (probably good) reason.


php > var_dump(sodium_crypto_secretbox(time(), random_bytes(24), random_bytes(32)));
string(26) "GW$:���ߌ@]�+1b��������d%"
php > var_dump(sodium_crypto_secretbox(strval(time()), random_bytes(24), random_bytes(32)));
string(26) ":_}0H �E°9��Kn�p��ͧ��"

Now that we're encrypting comment keys properly, it isn't much good unless we can decrypt them as well! Let's do that too. The first step is to decode the base64 and re-split the nonce from the encrypted data:

$pass = base64_decode_safe($pass);
$key_enc_raw = base64_decode_safe($key);
$nonce = mb_substr($key_enc_raw, 0, SODIUM_CRYPTO_SECRETBOX_NONCEBYTES, "8bit");
$key_enc = mb_substr($key_enc_raw, SODIUM_CRYPTO_SECRETBOX_NONCEBYTES, null, "8bit");

This is derived from the example code. With the 2 separated once more, we can decrypt the key:

$key_dec = sodium_crypto_secretbox_open($key_enc, $nonce, $pass);

Apparently, according to the example code on the website I linked to above, if the return value isn't a string then the decryption failed. We should make sure we handle that when returning:

if(!is_string($key_dec))
    return null;
return intval($key_dec);

That completes the decryption code. Here is in full:

/**
 * Decodes the given key.
 * @param  string $key  The key to decode.
 * @param  string $pass The password to decode it with.
 * @return int|null     The time that the key was generated, or null if the provided key was invalid.
 */
function key_decode($key, $pass) {
    $pass = base64_decode_safe($pass);
    $key_enc_raw = base64_decode_safe($key);
    $nonce = mb_substr($key_enc_raw, 0, SODIUM_CRYPTO_SECRETBOX_NONCEBYTES, "8bit");
    $key_enc = mb_substr($key_enc_raw, SODIUM_CRYPTO_SECRETBOX_NONCEBYTES, null, "8bit");
    $key_dec = sodium_crypto_secretbox_open($key_enc, $nonce, $pass);
    if(!is_string($key_dec)) return null;
    return intval($key_dec);
}

Explicitly returning null in key_decode() requires a small change to key_verify(), in order to prevent it from thinking that a key is really old if the decryption fails (null is treated as 0 in arithmetic operations, apparently). Let's update key_verify() to handle that:

/**
 * Verifies a key.
 * @param   string  $key        The key to decode and verify.
 * @param   string  $pass       The password to decode the key with.
 * @param   int     $min_age    The minimum required age for the key.
 * @param   int     $max_age    The maximum allowed age for the key.
 * @return  bool                Whether the key is valid or not.
 */
function key_verify($key, $pass, $min_age, $max_age) {
    $key_time = key_decode($key, $pass);
    if($key_time == null) return false;
    $age = time() - $key_time;
    return $age >= $min_age && $age <= $max_age;
}

A simple check is all that's needed. With that, the system is all updated! Time to generate a new password with the provided function and put it to use. You can do that directly from the PHP REPL (php -a):


php > require("lib/comment_system/comment_key.php");
php > var_dump(key_generate_password());                                        string(44) "S0bJpph5htdaKHZdVfo9FB6O4jOSzr3xZZ94c2Qcn44="

Obviously, this isn't the password I'm using on my blog :P

I've got a few more ideas I'd like to play around with to deter spammers even more whilst maintaining the current transparency - so you may see another post soon on this subject :-)

With everything complete, here's the complete script:

(Above: The full comment key system code. Can't see it? Check it out on GitHub Gist here.)

Found this useful? Got an improvement? Confused about something? Comment below!

Sources and Further Reading

Markov Chains Part 4: Test Data

With a shiny-new markov chain engine (see parts 1, 2, and 3), I found that I had a distinct lack of test data to put through it. Obviously this was no good at all, so I decided to do something about it.

Initially, I started with a list of HTML colours (direct link; 8.6KiB), but that didn't produce very good output:

MarkovGrams/bin/Debug/MarkovGrams.exe markov-w --wordlist wordlists/Colours.txt --length 16
errobiartrawbear
frelecteringupsy
lectrictomadolbo
vendellorazanigh
arvanginklectrit
dighoonbottlaven
onadeestersweese
ndiu
llighoolequorain
indeesteadesomiu

I see a few problems here. Firstly, it's treating each word as it's entity, where in fact I'd like it to generate n-grams on a line-by-line basis. Thankfully, this is easy enough with my new --no-split option:

MarkovGrams/bin/Debug/MarkovGrams.exe markov-w --wordlist wordlists/Colours.txt --no-split --length 16
med carrylight b
jungin pe red dr
ureelloufts blue
uamoky bluellemo
trinaterry aupph
utatellon reep g
radolitter brast
bian reep mardar
ght burnse greep
atimson-phloungu

Hrm, that's still rather unreadable. What if we make the n-grams longer by bumping the order?

MarkovGrams/bin/Debug/MarkovGrams.exe markov-w --wordlist wordlists/Colours.txt --length 16 --order 4
on fuchsia blue 
rsity of carmili
e blossom per sp
ngel
ulean red au lav
as green yellowe
indigri
ly gray aspe
disco blus
berry pine blach

Better, but it looks like it's starting the generation process from inside the middle of words. We can fix that with my new --start-uppercase option, which ensures that each output always stars with an n-gram that begins with a capital letter. Unfortunately the wordlist is all lowercase:

air force blue
alice blue
alizarin crimson
almond
amaranth
amber
american rose
amethyst
android green
anti-flash white

This is an issue. The other problem is that with an order of 4 the choice-point ratio is dropping quite low - I got a low of just ~0.97 in my testing.

The choice-point ratio is a measure I came up with of the average number of different directions the engine could potential go in at each step of the generation process. I'd like to keep this number consistently higher than 2, at least - to ensure a good variety of output.

Greener Pastures

Rather than try fix that wordlist, let's go in search of something better. It looks like the CrossCode Wiki has a page that lists all the items in the entire game. That should do the trick! The only problem is extracting it though. Let's use a bit of bash! We can use curl to download the HTML of the page, and then xidel to parse out the text from the <a> tags inside tables. Here's what I came up with:

curl https://crosscode.gamepedia.com/Items | xidel --data - --css "table a"

This is a great start, but we've got blank lines in there, and the list isn't sorted alphabetically (not required, but makes it look nice :P). Let's fix that:

curl https://crosscode.gamepedia.com/Items | xidel --data - --css "table a" | awk "NF > 0" | sort

Very cool. Tacking wc -l on the end of the pipe chain I can we've got ourselves a list of 527(!) items! Here's a selection of input lines:

Rough Branch
Raw Meat
Shady Monocle
Blue Grass
Tengu Mask
Crystal Plate
Humming Razor
Everlasting Amber
Tracker Chip
Lawkeeper's Fist

Let's run it through the engine. After a bit of tweaking, I came up with this:

cat wordlists/Cross-Code-Items.txt | MarkovGrams/bin/Debug/MarkovGrams.exe markov-w --start-uppercase --no-split --length 16 --order 3
Capt Keboossauci
Fajiz Keblathfin
King Steaf Sharp
Stintze Geakralt
Fruisty 'olipe F
Apper's TN
Prow Peptumn's C
Rus Recreetan Co
Veggiel Spiragma
Laver's Bolden M

That's quite interesting! With a choice-point ratio of ~5.6 at an order of 3, we've got a nice variable output. If we increase the order to 4, we get ~1.5 - ~2.3:

Edgy Hoo
Junk Petal Goggl
Red Metal Need C
Samurai Shel
Echor
Krystal Wated Li
Sweet Residu
Raw Stomper Thor
Purple Fruit Dev
Smokawa

It appears to be cutting off at the end though. Not sure what we can do about that (ideas welcome!). This looks interesting, but I'm not done yet. I'd like it to work on word-level too!

Going up a level

After making some pretty extensive changes, I managed to add support for this. Firstly, I needed to add support for word-level n-gram generation. Currently, I've done this with a new GenerationMode enum.

public enum GenerationMode
{
    CharacterLevel,
    WordLevel
}

Under the hood I've just used a few if statements. Fortunately, in the case of the weighted generator, only the bottom method needed adjusting:

/// <summary>
/// Generates a dictionary of weighted n-grams from the specified string.
/// </summary>
/// <param name="str">The string to n-gram-ise.</param>
/// <param name="order">The order of n-grams to generate.</param>
/// <returns>The weighted dictionary of ngrams.</returns>
private static void GenerateWeighted(string str, int order, GenerationMode mode, ref Dictionary<string, int> results)
{
    if (mode == GenerationMode.CharacterLevel) {
        for (int i = 0; i < str.Length - order; i++) {
            string ngram = str.Substring(i, order);
            if (!results.ContainsKey(ngram))
                results[ngram] = 0;
            results[ngram]++;
        }
    }
    else {
        string[] parts = str.Split(" ".ToCharArray());
        for (int i = 0; i < parts.Length - order; i++) {
            string ngram = string.Join(" ", parts.Skip(i).Take(order)).Trim();
            if (ngram.Trim().Length == 0) continue;
            if (!results.ContainsKey(ngram))
                results[ngram] = 0;
            results[ngram]++;
        }
    }
}

Full code available here. After that, the core generation algorithm was next. The biggest change - apart from adding a setting for the GenerationMode enum - was the main while loop. This was a case of updating the condition to count the number of words instead of the number of characters in word mode:

(Mode == GenerationMode.CharacterLevel ? result.Length : result.CountCharInstances(" ".ToCharArray()) + 1) < length

A simple ternary if statement did the trick. I ended up tweaking it a bit to optimise it - the above is the end result (full code available here). Instead of counting the words, I count the number fo spaces instead and add 1. That CountCharInstances() method there is an extension method I wrote to simplify things. Here it is:

public static int CountCharInstances(this string str, char[] targets)
{
    int result = 0;
    for (int i = 0; i < str.Length; i++) {
        for (int t = 0; t < targets.Length; t++)
            if (str[i] == targets[t]) result++;
    }
    return result;
}

Recursive issues

After making these changes, I needed some (more!) test data. Inspiration struck: I could run it recipe names! They've quite often got more than 1 word, but not too many. Searching for such a list proved to be a challenge though. My first thought was BBC Food, but their terms of service disallow scraping :-(

A couple of different websites later, and I found the Recipes Wikia. Thousands of recipes, just ready and waiting! Time to get to work scraping them. My first stop was, naturally, the sitemap (though how I found in the first place I really can't remember :P).

What I was greeted with, however, was a bit of a shock:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap><loc>http://recipes.wikia.com/sitemap-newsitemapxml-NS_0-p1.xml</loc></sitemap>
<sitemap><loc>http://recipes.wikia.com/sitemap-newsitemapxml-NS_0-p2.xml</loc></sitemap>
<sitemap><loc>http://recipes.wikia.com/sitemap-newsitemapxml-NS_0-p3.xml</loc></sitemap>
<sitemap><loc>http://recipes.wikia.com/sitemap-newsitemapxml-NS_0-p4.xml</loc></sitemap>
<sitemap><loc>http://recipes.wikia.com/sitemap-newsitemapxml-NS_0-p5.xml</loc></sitemap>
<sitemap><loc>http://recipes.wikia.com/sitemap-newsitemapxml-NS_0-p6.xml</loc></sitemap>
<sitemap><loc>http://recipes.wikia.com/sitemap-newsitemapxml-NS_0-p7.xml</loc></sitemap>
<sitemap><loc>http://recipes.wikia.com/sitemap-newsitemapxml-NS_0-p8.xml</loc></sitemap>
<sitemap><loc>http://recipes.wikia.com/sitemap-newsitemapxml-NS_0-p9.xml</loc></sitemap>
<sitemap><loc>http://recipes.wikia.com/sitemap-newsitemapxml-NS_14-p1.xml</loc></sitemap>
<sitemap><loc>https://services.wikia.com/discussions-sitemap/sitemap/3355</loc></sitemap>
</sitemapindex>
<!-- Generation time: 26ms -->
<!-- Generation date: 2018-10-25T10:14:26Z -->

Like who has a sitemap of sitemaps, anyways?! We better do something about this: Time for some more bash! Let's start by pulling out those sitemaps.

curl http://recipes.wikia.com/sitemap-newsitemapxml-index.xml | xidel --data - --css "loc"

Easy peasy! Next up, we don't want that bottom one - as it appears to have a bunch of discussion pages and other junk in it. Let's strip it out before we even download it!

curl http://recipes.wikia.com/sitemap-newsitemapxml-index.xml | xidel --data - --css "loc" | grep -i NS_0

With a list of sitemaps extract from the sitemap (completely coconuts I tell you) extracted, we need to download them all in turn and extract the page urls therein. This is, unfortunately, where it starts to get nasty. While a simple xargs call downloads them all easily enough (| xargs -n1 -I{} curl "{}" should do the trick), this outputs them all to stdout, and makes it very difficult for us to parse them.

I'd like to avoid shuffling things around on the file system if possible, as this introduces further complexity. We're not out of options yet though, as we can pull a subshell out of our proverbial hat:

curl http://recipes.wikia.com/sitemap-newsitemapxml-index.xml | xidel --data - --css "loc" | grep -i NS_0 | xargs -n1 -I{} sh -c 'curl {} | xidel --data - --css "loc"'

Yay! Now we're getting a list of urls to all the pages on the entire wiki:

http://recipes.wikia.com/wiki/Mexican_Black_Bean_Soup
http://recipes.wikia.com/wiki/Eggplant_and_Roasted_Garlic_Babakanoosh
http://recipes.wikia.com/wiki/Bathingan_bel_Khal_Wel_Thome
http://recipes.wikia.com/wiki/Lebanese_Tabbouleh
http://recipes.wikia.com/wiki/Lebanese_Hummus_Bi-tahini
http://recipes.wikia.com/wiki/Baba_Ghannooj
http://recipes.wikia.com/wiki/Lebanese_Falafel
http://recipes.wikia.com/wiki/Lebanese_Pita_Bread
http://recipes.wikia.com/wiki/Kebab_Koutbane
http://recipes.wikia.com/wiki/Moroccan_Yogurt_Dip

One problem though: We want recipes names, not urls! Let's do something about that. Our next special guest that inhabits our bottomless hat is the illustrious sed. Armed with the mystical power of find-and-replace, we can make short work of these urls:

... | sed -e 's/^.*\///g' -e 's/_/ /g'

The rest of the command is omitted for clarity. Here I've used 2 sed scripts: One to strip everything up to the last forward slash /, and another to replace the underscores _ with spaces. We're almost done, but there are a few annoying hoops left to jump through. Firstly, there are A bunch of unfortunate escape sequences lying around (I actually only discovered this when the engine started spitting out random ones :P). Also, there are far too many page names that contain the word Nutrient, oddly enough.

The latter is easy to deal with. A quick grep sorts it out:

... | grep -iv "Nutrient"

The former is awkward and annoying. As far as I can tell, there's no command I can call that will decode escape sequences. To this end, I wound up embedding some Python:

... | python -c "import urllib, sys; print urllib.unquote(sys.argv[1] if len(sys.argv) > 1 else sys.stdin.read()[0:-1])"

This makes the whole thing much more intimidating that it would otherwise be. Lastly, I'd really like to sort the list and save it to a file. Compared to the above, this is chicken feed!

| sort >Dishes.txt

And there we have it. Bash is very much like lego bricks when you break it down. The trick is to build it up step-by-step until you've got something that does what you want it to :)

Here's the complete command:

curl http://recipes.wikia.com/sitemap-newsitemapxml-index.xml | xidel --data - --css "loc" | grep -i NS_0 | xargs -n1 -I{} sh -c 'curl {} | xidel --data - --css "loc"' | sed -e 's/^.*\///g' -e 's/_/ /g' | python -c "import urllib, sys; print urllib.unquote(sys.argv[1] if len(sys.argv) > 1 else sys.stdin.read()[0:-1])" | grep -iv "Nutrient" | sort >Dishes.txt

After all that effort, I think we deserve something for our troubles! With ~42K(!) lines in the resulting file (42,039 to be exact as of the last time I ran the monster above :P), the output (after some tweaking, of course) is pretty sweet:

cat wordlists/Dishes.txt | mono --debug MarkovGrams/bin/Debug/MarkovGrams.exe markov-w --words --start-uppercase --length 8
Lemon Lime Ginger
Seared Tomatoes and Summer Squash
Cabbage and Potato au
Endive stuffed with Lamb and Winter Vegetable
Stuffed Chicken Breasts in Yogurt Turmeric Sauce with
Blossoms on Tomato
Morning Shortcake with Whipped Cream Graham
Orange and Pineapple Honey
Mango Raspberry Bread
Tempura with a Southwestern
Rice Florentine with
Cabbage Slaw with Pecans and Parmesan
Pork Sandwiches with Tangy Sweet Barbecue
Tea with Lemongrass and Chile Rice
Butterscotch Mousse Cake with Fudge
Fish and Shrimp -
Cucumber Salad with Roast Garlic Avocado
Beans in the Slow
Apple-Cherry Vinaigrette Salad
California Avocado Chinese Chicken Noodle Soup with Arugula

...I really need to do something about that cutting off issue. Other than that, I'm pretty happy with the result! The choice-point ratio is really variable, but most of the time it's hovering around ~2.5-7.5, which is great! The output if I lower the order from 3 to 2 isn't too bad either:

Salata me Htapodi kai
Hot 'n' Cheese Sandwich with Green Fish and
Poisson au Feuilles de Milagros
Valentines Day Cookies with Tofu with Walnut Rice
Up Party Perfect Turkey Tetrazzini
Olives and Strawberry Pie with Iceberg Salad with
Mashed Sweet sauce for Your Mood with Dried
Zespri Gold Corn rice tofu, and Corn Roasted
California Avocado and Rice Casserole with Dilled Shrimp
Egyptian Tomato and Red Bell Peppers, Mango Fandango

This gives us a staggering average choice-point ratio of ~125! Success :D

One more level

After this, I wanted to push the limits of the engine, so see what it's capable of. The obvious choice here is Shakespeare's Complete Works (~5.85MiB). Pushing this through the engine required some work, as ~30 seconds is far too slow - namely optimising the pipeline as much as possible.

The Mono Profiler helped a lot here. With it, I discovered that string.StartsWith() is really slow. Like, ridiculously slow (though this is relative, since I'm calling it hundreds of thousand of times), as it's culture-aware. In our case, we can't be bothering with any of that, as it's not relevant anyway. The easiest solution is to write another extension method:

public static bool StartsWithFast(this string str, string target) {
    if (str.Length < target.Length) return false;
    return str.Substring(0, target.Length) == target;
}

string.Substring() is faster, so by utilising this instead of the regular string.StartsWith() yields us a perfectly gigantic boost! Next up, I noticed that I can probably parallelize the Linq query that builds the list of possible n-grams we can choose from next, so that it runs on all the CPU cores:

Parallel.ForEach(ngrams, (KeyValuePair<string, double> ngramData) => {
    if (!ngramData.Key.StartsWithFast(nextStartsWith)) return;
    if (!convNextNgrams.TryAdd(ngramData.Key, ngramData.Value))
        throw new Exception("Error: Failed to add to staging ngram concurrent dictionary");
});

Again, this netted a another huge gain. With this and a few other architectural changes, I was able to chop the time down to a mere ~4 seconds (for a standard 10 words)! In the future, I might experiment with selective unmanaged code via the unsafe keyword to see if I can do any better.

For now, it's fast enough to enjoy some random Shakespeare on-demand:

What should they since that to that tells me Nero
He doth it turn and and too, gentle
Ha! you shall not come hither; since that to provoke
ANTONY. No further, sir; a so, farewell, Sir
Bona, joins with
Go with your fingering,
From fairies and the are like an ass which is
The very life-blood of our blood, no, not with the
Alas, why is here-in which
Transform'd and weak'ned? Hath Bolingbroke

Very interesting. The choice-point ratios sit at ~20 and ~3 for orders 3 and 4 respectively, though I got as high as 188 for an order of 3 during my testing. Definitely plenty of test data here :P

Conclusion

My experiments took me to several other places - which, if I included them all here, would result in an post much much longer than this! I scripted the download of several other wordlists in download.sh (direct link, 4.2KiB), if you're interested, with ready-downloaded copies in the wordlists folder of the repository.

I would advise reading the table in the README that gives credit to where I sourced each list, because of course I didn't put any of them together myself - I just wrote the script :P

Particularly of note is the Starbound list, which contains a list of all the blocks and items from the game Starbound. I might post about that one separately, as it ended up being a most interesting adventure.

In the future, I'd like to look at implementing a linguistic drift algorithm, to try and improve the output of the engine. The guy over at Here Dragons Abound has a great post on the subject, which you should definitely read if you're interested.

Found this interesting? Got an idea for another wordlist I can push though my engine? Confused by something? Comment below!

Sources and Further Reading

Disassembling .NET Assemblies with Mono

As part of the Component-Based Architectures module on my University course, I've been looking at what makes the .NET ecosystem tick, and how .NET assemblies (i.e. .NET .exe / .dll files) are put together. In the process, we looked as disassembling .NET assemblies into the text-form of the Common Intermediate Language (CIL) that they contain. The instructions on how to do this were windows-specific though - so I thought I'd post about the process on Linux and other platforms here.

Our tool of choice will be Mono - but before we get to that we'll need something to disassemble. Here's a good candidate for the role:

using System;

namespace SBRL.Demo.Disassembly {
    static class Program {
        public static void Main(string[] args) {
            int a = int.Parse(Console.ReadLine()), b = 10;
            Console.WriteLine(
                "{0} + {1} = {2}",
                a, b,
                a + b
            );
        }
    }
}

Excellent. Let's compile it:

csc Program.cs

This should create a new Program.exe file in the current directory. Before we get to disassembling it, it's worth mentioning how the compilation and execution process works in .NET. It's best explained with the aid of a diagram:

Left-to-right flowchart: Multiple source languages get compiled into Common Intermediate Language, which is then executed by an execution environment.

As is depicted in the diagram above, source code in multiple languages get compiled (maybe not with the same compiler, of course) into Common Intermediate Language, or CIL. This CIL is then executed in an Execution Environment - which is usually a virtual machine (Nope! not as in Virtual Box and KVM. It's not a separate operating system as such, rather than a layer of abstraction), which may (or may not) decide to compile the CIL down into native code through a process called JIT (Just-In-Time compilation).

It's also worth mentioning here that the CIL code generated by the compiler is in binary form, as this take up less space and is (much) faster for the computer to operate on. After all, CIL is designed to be efficient for a computer to understand - not people!

We can make it more readable by disassembling it into it's textual equivalent. Doing so with Mono is actually quite simple:

monodis Program.exe >Program.il

Here I redirect the output to a file called Program.il for convenience, as my editor has a plugin for syntax-highlighting CIL. For those reading without access to Mono, here's what I got when disassembling the above program:

.assembly extern mscorlib
{
  .ver 4:0:0:0
  .publickeytoken = (B7 7A 5C 56 19 34 E0 89 ) // .z\V.4..
}
.assembly 'Program'
{
  .custom instance void class [mscorlib]System.Runtime.CompilerServices.CompilationRelaxationsAttribute::'.ctor'(int32) =  (01 00 08 00 00 00 00 00 ) // ........

  .custom instance void class [mscorlib]System.Runtime.CompilerServices.RuntimeCompatibilityAttribute::'.ctor'() =  (
        01 00 01 00 54 02 16 57 72 61 70 4E 6F 6E 45 78   // ....T..WrapNonEx
        63 65 70 74 69 6F 6E 54 68 72 6F 77 73 01       ) // ceptionThrows.

  .custom instance void class [mscorlib]System.Diagnostics.DebuggableAttribute::'.ctor'(valuetype [mscorlib]System.Diagnostics.DebuggableAttribute/DebuggingModes) =  (01 00 07 01 00 00 00 00 ) // ........

  .hash algorithm 0x00008004
  .ver  0:0:0:0
}
.module Program.exe // GUID = {D6162DAD-AD98-45B3-814F-C646C6DD7998}

.namespace SBRL.Demo.Disassembly
{
  .class private auto ansi beforefieldinit Program
    extends [mscorlib]System.Object
  {

    // method line 1
    .method public static hidebysig 
           default void Main (string[] args)  cil managed 
    {
        // Method begins at RVA 0x2050
    .entrypoint
    // Code size 47 (0x2f)
    .maxstack 5
    .locals init (
        int32   V_0,
        int32   V_1)
    IL_0000:  nop 
    IL_0001:  call string class [mscorlib]System.Console::ReadLine()
    IL_0006:  call int32 int32::Parse(string)
    IL_000b:  stloc.0 
    IL_000c:  ldc.i4.s 0x0a
    IL_000e:  stloc.1 
    IL_000f:  ldstr "{0} + {1} = {2}"
    IL_0014:  ldloc.0 
    IL_0015:  box [mscorlib]System.Int32
    IL_001a:  ldloc.1 
    IL_001b:  box [mscorlib]System.Int32
    IL_0020:  ldloc.0 
    IL_0021:  ldloc.1 
    IL_0022:  add 
    IL_0023:  box [mscorlib]System.Int32
    IL_0028:  call void class [mscorlib]System.Console::WriteLine(string, object, object, object)
    IL_002d:  nop 
    IL_002e:  ret 
    } // end of method Program::Main

    // method line 2
    .method public hidebysig specialname rtspecialname 
           instance default void '.ctor' ()  cil managed 
    {
        // Method begins at RVA 0x208b
    // Code size 8 (0x8)
    .maxstack 8
    IL_0000:  ldarg.0 
    IL_0001:  call instance void object::'.ctor'()
    IL_0006:  nop 
    IL_0007:  ret 
    } // end of method Program::.ctor

  } // end of class SBRL.Demo.Disassembly.Program
}

Very interesting. There are a few things of note here:

  • The metadata at the top of the CIL tells the execution environment a bunch of useful things about the assembly, such as the version number, the classes contained within (and their signatures), and a bunch of other random attributes.
  • An extra .ctor method has been generator for us automatically. It's the class' constructor, and it automagically calls the base constructor of the object class, since all classes are descended from object.
  • The ints a and b are boxed before being passed to Console.WriteLine. Exactly what this does and why is quite complicated, and best explained by this Stackoverflow answer.
  • We can deduce that CIL is a stack-based language form the add instruction, as it has no arguments.

I'd recommend that you explore this on your own with your own test programs. Try changing things and see what happens!

  • Try making the Program class static
  • Try refactoring the int.Parse(Console.ReadLine()) into it's own method. How is the variable returned?

This isn't all, though. We can also recompile the CIL back into an assembly with the ilasm code:

ilasm Program.il

This makes for some additional fun experiments:

  • See if you can find where b's value is defined, and change it
  • What happens if you alter the Console.WriteLine() format string so that it becomes invalid?
  • Can you get ilasm to reassemble an executable into a .dll library file?

Found this interesting? Discovered something cool? Comment below!

Sources and Further Reading

Geolocation Strategies

Surprisingly often you'll find yourself needing to know the physical location of a user - be it to calculate which time-zone they're in, when the sun sets, or even which building they're in for a guided tour. Such a question has multiple different answers and ways of approaching it, so I decided to blog about the ones I can think of for future reference.

Location-By-IP-Address

The first method that comes to mind is doing a lookup based on the user's IP address. This gives city / area / country level accuracy, which should be good enough for sunrise and sunset times, time-zone information, and so on. Several websites exist that provide this as a service - ipinfo.io comes to mind. You can even do this in the terminal:

curl https://ipinfo.io/ | jq --raw-output '.loc'

(You'll need jq installed to use this. Most Linux distributions come with it in their repositories!)

If a web service doesn't suit you, then several downloadable databases exist. Most you have to pay for to get the 'full' (and most accurate) version, but the free versions available seem to be pretty good up to country / time-zone level. Here are a few I've found:

Precision Matters

If you need something more precise (say for a driving-tracking app or something), then GPS is usually the way to go, although phones are the only device that commonly carry a sensor. Despite the name, GPS actually refers to just a single satellite constellation that was launched by the USA.

Others exist too, of course - Russia have GLONASS, China has BeiDou, and the EU now has Galileo just to name a few. Most phones will support at least 2 from this list out-of-the-box. There's a great app available for Android if you'd like to play around and explore your phone's GPS capabilities.

Utilising a device's GPS support is usually just a case of finding the appropriate API (and, in some cases, permissions) for your environment. For the web, there's the Geolocation API (note that it's for HTTPS websites only). On Android, there's android.location apparently.

Bring in the big data

GPS can be slow at getting a fix, and drain your users' battery. It also has issues in indoor environments. Thankfully, there's an alternative: WiFi-based location mapping. Though it only works where there are WiFi hotspots about (urban areas are best), it is much faster and doesn't use nearly as much power.

Pioneered by Google with their Google Play Services Location API, it sends a list of the strongest WiFi access points (their MAC addresses & signal strengths) that the device can detect off to a remote server, and, after consulting it's massive database of WiFi access points and their locations, it responds with a triangulation of the user's approximate location.

Of course, you probably don't want to be giving Google your location on a regular basis just to save battery power. To this end, Mozilla (the company behind Firefox!) has built their own crowd-sourced location service. Creatively named the Mozilla Location Service, it's continually improved by users of Firefox on their Phones who submit their location along with the detected WiFi networks for analysis on a semi-regular (and automated) basis. There's even a map showing the current coverage!

If WiFi isn't for you, then don't despair! The mobile network (1/2/3/4G) cell towers can also be used - if you've got the right kind of device (sorry, desktops). Databases of cell tower locations exist that can be used to triangulate a user's approximate position.

If you've got a limited indoor area that's already got WiFi that you need to cover, it shouldn't be too time-consuming to manually construct your own database instead.

For example, in a tour-guide Android app you could record the signal strengths in the pre-set locations that you want to take people in your tour, and use a Neural Network (libraries exist!) to decide which one the user is closest to by comparing the signal strengths & MAC addresses of the top 3 WiFi access points in the area.

Bluetooth to the rescue!

If that's still not good enough (say if there aren't any WiFi networks around), you'll probably have to start deploying some of your own infrastructure. Bluetooth beacons are a good place to start. In short, a bluetooth beacon is simply a Bluetooth device that advertises it's presence on a regular basis, along with both it's actual position and the transmitting power that it is currently using to advertise itself. By triangulating the signal from 2 or more beacons, the user's position can be determined.

Of course, this requires that you have a bunch of Bluetooth beacons to hand, which aren't exactly cheap! You could, of course, build your own with an Arduino, but there are still the parts to buy there.

Other worthy contenders

That just about covers all the main geolocation methods I can think of, though there are few others I've come across that sound interesting - but aren't particularly useful unless you've got some specific circumstances.

The first of these is eLoran. Similar in structure to GPS, eLoran has ground-based transmitter stations instead of space-based satellites. It's mainly used by ships at sea and aeroplanes, but I'm sure that it's used elsewhere too. Of course, you need a special receiver chip (and aerial, I should imagine) to use it.

Secondly, there's some research ongoing into utilising LoRa gateways to geolocate a device. I've blogged about LoRa before (and might do again now that I'm actually getting really close to having a pair for RFM95 chips working!) here and also here. There's a blog post (not written by me) explaining geolocation via LoRa here, but it's basically the same kind of thing as above with WiFi, GPS, eLoran, and mobile cell towers.

Conclusion

We've covered practically every geolocation method I can think of - from coarse IP Address-based solutions to much finer GPS and WiFi-based technologies. There are certainly more of them than I thought - each with its own benefits and drawbacks, making different technologies useful in different situations.

Found this interesting? Got another method I haven't thought of? Comment below!

Goodbye, Flash?

You're probably already aware that Adobe announced that they are ending support for flash at the end of 2020. You're browser has probably already disabled the plugin by default. This is good news for the web as a whole, but what about everything left behind?

What if I said that at the end of 2020 we'll lose an entire chapter of the Internet's history? I'm talking about the masses of content on the internet that's powered by flash. While the Internet Archive probably has a considerable portion of ti archived already, that's only the beginning of the issues at hand.

For example, much of the flash content of yore had what's called a site lock, which prevents it from working on any site other than specifically-approved domains. This was very useful in prevent thieves from taking off with valuable content, setting up a competing website, and earning tons of ad revenue from it, but it's a hindrance to archiving efforts.

The other problem at hand is that if Adobe end support for flash, it isn't going to stick around in our browsers for long after that - so we'll lose the ability to run all this flash content that makes up such a huge part of the Internet's history.

A scary thought, to be sure! Thankfully, all is not lost! Dedicated volunteers have set up a project called Flashpoint, which serves to not only archive all this content, but build a series fo programs that will allow people for generations to come to run said content - through the use of clever hacks such as an intelligent local proxy, and tools such as the standalone projector version of flash, Pale Moon, and other such software.

Flashpoint's logo. I don't own it!

If you're interested in checking it out, I suggest heading over to Flashpoint's website and downloading the Infinity version - this downloads the content you wish to run dynamically - instead of all at once (totalling ~34GiB O.o). There's even a master list (also here) listing everything archived so far.

C# & .NET Terminology Demystified: A Glossary

After my last glossary post on LoRa, I thought I'd write another one of C♯ and .NET, as (in typical Microsoft fashion it would seem), they're seems to be a lot of jargon floating around whose meaning is not always obvious.

If you're new to C♯ and the .NET ecosystems, I wouldn't recommend tackling all of this at once - especially the bottom ~3 definitions - with those in particular there's a lot to get your head around.

C♯

C♯ is an object-oriented programming language that was invented by Microsoft. It's cross-platform, and is usually written in an IDE (Integrated Development Environment), which has a deeper understanding of the code you write than a regular text editor. IDEs include Visual Studio (for Windows) and MonoDevelop (for everyone else).

Solution

A Solution (sometimes referred to as a Visual Studio Solution) is the top-level definition of a project, contained in a file ending in .sln. Each solution may contain one or more Project Files (not to be confused with the project you're working on itself), each of which gets compiled into a single binary. Each project may have its own dependencies too: whether they be a core standard library, another project, or a NuGet package.

Project

A project contains your code, and sits 1 level down from a solution file. Normally, a solution file will sit in the root directory of your repository, and the projects will each have their own sub-folders.

While each project has a single output file (be that a .dll class library or a standalone .exe executable), a project may have multiple dependencies - leading to many files in the build output folder.

The build process and dependency definitions for a project are defined in the .csproj file. This file is written in XML, and can be edited to perform advanced build steps, should you need to do something that the GUI of your IDE doesn't support. I've blogged about the structuring of this file before (see here, and also a bit more here), should you find yourself curious.

CIL

Known as Common Intermediate Language, CIL is the binary format that C♯ (also Visual Basic and F♯ code) code gets compiled into. From here, the .NET runtime (on Windows) or Mono (on macOS, Linux, etc.) can execute it to run the compiled project.

MSBuild

The build system for Solutions and Projects. It reads a .sln or .csproj (there are others for different languages, but I won't list them here) file and executes the defined build instructions.

.NET Framework

The .NET Framework is the standard library of C♯ it provides practically everything you'll need to perform most common tasks. It does not provide a framework for constructing GUIs and Graphical Interfaces. You can browse the API reference over at the official .NET API Browser.

WPF

The Windows Presentation Foundation is a Windows-only GUI framework. Powered by XAML (eXtensible Application Markup Language) definitions of what the GUI should look like, it provides everything you need to create a native-looking GUI on Windows.

It does not work on macOS and Linux. To create a cross-platform program that works on all 3 operating systems, you'll need to use an alternative GUI framework, such as XWT or Gtk# (also: Glade). A more complete list of cross-platform frameworks can be found here. It's worth noting that Windows Forms, although a tempting option, aren't as flexible as the other options listed here.

C♯ 7

The 7th version of the C♯ language specification. This includes the syntax of the language, but not the .NET Framework itself.

.NET Standard

A specification of the .NET Framework, but not the C♯ Language. As of the time of typing, the latest version is 2.0, although version 1.6 is commonly used too. The intention here is the improve cross-platform portability of .NET programs by defining a specification for a subset of the full .NET Framework standard library that all platforms will always be able to use. This includes Android and iOS through the use of Xamarin.

Note that all .NET Standard projects are class libraries. In order to create an executable, you'll have to add an additional Project to your Solution that references your .NET Standard class library.

ASP.NET

A web framework for .NET-based programming languages (in our case C♯). Allows you to write C♯ code to handle HTTP (and now WebSockets) requests in a similar manner to PHP, but different in that your code still needs compiling. Compiled code is then managed by a web server IIS web server (on Windows).

With the release of .NET Core, ASP.NET is now obsolete.

.NET Core

Coming in 2 versions so far (1.0 and 2.0), .NET Core is the replacement for ASP.NET (though this is not its exclusive purpose). As far as I understand it, .NET Core is a modular runtime that allows programs targeting it to run multiple platforms. Such programs can either be ASP.NET Core, or a Universal Windows Platform application for the Windows Store.

This question and answer appears to have the best explanation I've found so far. In particular, the displayed diagram is very helpful:

A diagram showing the structure of the .NET ecosystem, including .NET Core. See the links in the sources and further reading below for more information.

....along with the pair of official "Introducing" blog posts that I've included in the Sources and Further Reading section below.

Conclusion

We've looked at some of the confusing terminology in the .NET ecosystems, and examined each of them in turn. We started by defining and untangling the process by which your C♯ code is compiled and run, and then moved on to the different variants and specifications related to the .NET Framework and C♯.

As always, this is a starting point - not an ending point! I'd recommend doing some additional reading and experimentation to figure out all the details.

Found this helpful? Still confused? Spotted a mistake? Comment below!

Sources and Further Reading

Generating word searches for fun and profit

(Above: A Word search generated with the tool below)

A little while ago I was asked about generating a wordsearch in a custom shape. I thought to myself "someone has to have built this before...", and while I was right to an extent, I couldn't find one that let you use any shape you liked.

This, of course, was unacceptable! You've probably guessed it by now, but I went ahead and wrote my own :P

While I wrote it a little while ago, I apparently never got around to posting about it on here.

In short, it works by using an image you drop into the designated area on the page as the shape the word search should take. Each pixel is a single cell of the word search - with the alpha channel representing whether or not a character is allowed to be placed there (transparent means that it can't contain a character, and opaque means that it can).

Creating such an image is simple. Personally, I recommend Piskel or GIMP for this purpose.

Once done, you can start building a wordlist in the wordlist box at the right-hand-side. It should rebuild the word search as soon as you click out of the box. If it doesn't, then you've found a bug! Please report it here.

With the word search generated, you can use the Question Sheet and Answer Sheet links to open printable versions for export.

You can find my word search generator here:

I've generated a word search of the current tags in the tag cloud on this blog too: Question Sheet [50.3KiB], Answer Sheet [285.6KiB]

The most complicated part of this was probably the logistics behind rude word removal. Thankfully, I did't have to find and maintain such a list of words, as the futility npm package does this for me, but algorithmically guaranteeing that by censoring 1 rude word another is not accidentally created in another direction is a nasty problem.

If you're interested in a more technical breakdown of one (or several!) particular aspects of this - let me know! While writing about all of it would probably make for an awfully long post, a specific aspect or two should be more manageable.

In the future, I'll probably revisit this and add additional features to it, such as the ability to restrict which directions words are placed in, for example. If you've got a suggestion of your own, open an issue (or even better, open a pull request :D)!

Converting my timetable to ical with Node.JS and Nightmare

A photo of a nice beach in a small bay, taken from a hill off to the side. A small-leaved tree in the foreground frames the bottom-left, with white-crested waves breaking over the beach in the background, before riding steeply on the way inland.

(Source: Taken by me!)

My University timetable is a nightmare. I either have to use a terrible custom app for my phone, or an awkwardly-built website that feels like it's at least 10 years old!

Thankfully, it's not all doom and gloom. For a number of years now, I've been maintaining a Node.JS-based converter script that automatically pulls said timetable down from the JSON backend of the app - thanks to a friend who reverse-engineered said app. It then exports it as a .ical file that I can upload to my server & subscribe to in my pre-existing calendar.

Unfortunately, said backend changed quite dramatically recently, and broke my script. With the only alternative being the annoying timetable website that really don't like being scraped.

Where there's a will, there's a way though. Not to be deterred, I gave it a nightmare of my own: a scraper written with Nightmare.JS - a Node.JS library that acts, essentially, as a scriptable web-browser!

While the library has some kinks (especially with .wait("selector")), it worked well enough for me to implement a scraper that pulled down my timetable in HTML form, which I then proceeded to parse with cheerio.

The code is open-source (find it here!) - and as of this week I've updated it to work with the new update to the timetabling system this semester. A further update will be needed in early December time, which I'll also be pushing to the repository.

The README of the repository should contain adequate instructions for getting it running yourself, but if not, please open an issue!

Note that I am not responsible for anything that happens as a result of using this script! I would strongly recommend setting up the secure storage of your password if you intend to automate it. I've just written this to solve a problem in order to ensure that I can actually get to my lectures on time - and not an hour late or on the wrong week because I've misread the timetable (again)!

In the future, I'd like to experiment with other scriptable web-browser frameworks to compare them with my experiences with NightmareJS.

Found this interesting? Found a better way to do this? Comment below!

LoRa Terminology Demystified: A Glossary

My 2 RFM95s on the lid of my project's box. More info in a future blog post coming soon!

(Above: My 2 RFM95s. One works, but the other doesn't yet....)

I've been doing some more experimenting with LoRa recently, as I've got 1 of my 2 RFM95 working (yay)! While the other is still giving me trouble (meaning that I can't have 1 transmit and the other receive yet :-/), I've still been able to experiment with other people's implementations.

To that end, I've been learning about a bunch of different words and concepts - and thought that I'd document them all here.

LoRa

The radio protocol itself is called LoRa, which stands for Long Range. It provides a chirp-based system (more on that later under Bandwidth) to allow 2 devices to communicate over great distances.

LoRaWAN

LoRaWAN builds on LoRa to provide a complete end-to-end protocol stack to allow Internet of Things (IoT) devices to communicate with an application server and each other. It provides:

  • Standard device classes (A, B, and C) with defined behaviours
    • Class A devices can only receive for a short time after transmitting
    • Class B devices receive on a regular, timed, basis - regardless of when they transmit
    • Class C devices send and receive whenever they like
  • The concept of a Gateway for picking up packets and forwarding them across the rest of the network (The Things Network is the largest open implementation to date - you should definitely check it out if you're thinking of using LoRa in a project)
  • Secure multiple-layered encryption of messages via AES

...amongst many other things.

The Things Network

The largest open implementation of LoRaWAN that I know of. If you hook into The Things Network's LoRaWAN network, then your messages will get delivered to and from your application server and LoRaWAN-enabled IoT device, wherever you are in the world (so long as you've got a connection to a gateway). It's often abbreviated to TTN.

Check out their website.

A coverage map for The Things Network.

(Above: A coverage map for The Things Network. The original can be found here)

Data Rate

The data rate is the speed at which a message is transmitted. This is measured in bits-per-second, as LoRa itself is an 'unreliable' protocol (it doesn't guarantee that anyone will pick anything up at the other end). There are a number of preset data rates:

Code Speed (bits/second)
DR0 250
DR1 440
DR2 980
DR3 1760
DR4 3125
DR5 5470
DR6 11000
DR7 50000

_(Source: Exploratory Engineering: Data Rate and Spreading Factor)_

These values are a little different in different places - the above are for Europe on 868MHz.

Maximum Payload Size

Going hand-in-hand with the Data Rate, the Maximum Payload Size is the maximum number of bytes that can be transmitted in a single packet. If more than the maximum number of bytes needs to be transmitted, then it will be split across multiple packets - much like TCP's Maximum Transmission Unit (MTU), when it comes to that.

With LoRa, the maximum payload size varies with the Data Rate - from 230 bytes at DR7 to just 59 at DF2 and below.

Spreading Factor

Often abbreviated to just simply SF, the spreading factor is also related to the Data Rate. In LoRa, the Spreading Factor refers to the duration of a single chirp. There are 6 defined Spreading Factors: ranging from SF7 (the fastest transmission speed) to SF12 (the slowest transmission speed).

Which one you use is up to you - and may be automatically determined by the driver library you use (it's always best to check). At first glance, it may seem optimal to choose SF7, but it's worth noting that the slower speeds achieved by the higher spreading factors can net you a longer range.

Data Rate Configuration bits / second Max payload size (bytes)
DR0 SF12/125kHz 250 59
DR1 SF11/125kHz 440 59
DR2 SF10/125kHz 980 59
DR3 SF9/125kHz 1 760 123
DR4 SF8/125kHz 3 125 230
DR5 SF7/125kHz 5 470 230
DR6 SF7/250kHz 11 000 230
DR7 FSK: 50kpbs 50 000 230

_(Again, from Exploratory Engineering: Data Rate and Spreading Factor)_

Duty Cycle

A Duty Cycle is the amount of time something is active as a percentage of a total time. In the case of LoRa(/WAN?), there is an imposed 1% Duty Cycle, which means that you aren't allowed to be transmitting for more than 1% of the time.

Bandwidth

Often understood, the Bandwidth is the range of frequencies across which LoRa transmits. The LoRa protocol itself uses a system of 'chirps', which are spread form one end of the Bandwidth to the other going either up (an up-chirp), or down (a down-chirp). LoRahas 2 bandwidths it uses: 125kHz, 250kHz, and 500kHz.

Some example LoRa chirps as described above.

(Some example LoRa Chirps. Source: This Article on Link Labs)

Frequency

Frequency is something that most of us are familiar with. Different wireless protocols utilise different frequencies - allowing them to go about their business in peace without interfering with each other. For example, 2.4GHz and 5GHz are used by WiFi, and 800MHz is one of the frequencies used by 4G.

In the case of LoRa, different frequencies are in use in different parts of the world. ~868MHz is used in Europe (443MHz can also be used, but I haven't heard of many people doing so), 915MHz is used in the US, and ~780MHz is used in China.

Location Frequency
Europe 863 - 870MHz
US 902 - 928MHz
China 779 - 787MHz

(Source: RF Wireless World)

Found this helpful? Still confused? Found a mistake? Comment below!

Sources and Further Reading

https://electronics.stackexchange.com/a/305287/180059

Art by Mythdael