Starbeamrainbowlabs

Stardust
Blog


Archive


Mailing List Articles Atom Feed Comments Atom Feed Twitter Reddit Facebook

Tag Cloud

3d account algorithms android announcement architecture archives arduino artificial intelligence artix assembly async audio bash batch blog bookmarklet booting c sharp c++ challenge chrome os code codepen coding conundrums coding conundrums evolved command line compilers compiling compression css dailyprogrammer debugging demystification distributed computing documentation downtime electronics email embedded systems encryption es6 features event experiment external first impressions future game github github gist gitlab graphics hardware hardware meetup holiday holidays html html5 html5 canvas infrastructure interfaces internet io.js jabber jam javascript js bin labs learning library linux lora low level lua maintenance manjaro network networking nibriboard node.js operating systems performance photos php pixelbot portable privacy problem solving programming problems projects prolog protocol protocols pseudo 3d python reddit redis reference release releases resource review rust searching secrets security series list server software sorting source code control statistics storage svg technical terminal textures three thing game three.js tool tutorial tutorials twitter ubuntu university update updates upgrade version control virtual reality virtualisation visual web website windows windows 10 xmpp xslt

What I've learnt from #LOWREZJAM 2018

The LOWREZJAM 2018 banner

(Above: The official LOWREZJAM 2018 logo)

While I didn't manage to finish my submission in time for #LOWREZJAM, I've had ton of fun building it - and learnt a lot too! This post is a summary of my thoughts and the things I've learnt.

Firstly, completing a game for a game jam requires experience - and preparation - I think. As this was my first game jam, I didn't know how I should prepare for such a thing in order to finish my submission successfully - in subsequent jams I now know what it is that I need to prepare in advance. This mainly includes learning the technology(ies) and libraries I intend to use.

(Sorry, but this image is © Starbeamrainbowlabs 2018)

This time around, I went with building my own game engine. While this added a lot of extra work (which probably wasn't helpful) and contributed to the architectural issues that prevented me from finishing my submission in time, I found it a valuable experience in learning how a game engine is put together and how it works. This information, I feel, will help inform my choice in which other game engine I learn next. I've got my eye on Crafty.JS, Phaser (v3 vs the community edition, anyone? I'm confused), and perhaps MelonJS - your thoughts on the matter are greatly appreciated! I may blog my choice after exploring the available options at a later date.

I've also got some ideas as to how to improve my Sprite Packer, which I wrote a while ago. While it proved very useful in the jam, I think it could benefit from some tweaks to the algorithm to ensure it packs things in neater than it currently does.

I'd also like to investigate other things such as QuadTrees (used to quickly search an area for nearby objects - and also to compress images, apparently) and Particle Systems, as they look really interesting.

An visualisation of an example QuadTree from the page I linked to above.

I'd also like to talk a little about my motivation behind participating in this game jam. For me, it's a couple of reasons. Not only does it help me figure out what I need to work on (see above), but it's always been a personal dream of mine to make a game that tells a story. I find games to be an art form for communicating ideas, stories, and emotions in a interactive way that simply isn't possible with other media formats, such as films, for instance (films are great too though).

While I'll never be able to make a game of triple-A quality on my own, many developers have shown time and again that you don't need an army of game developers, artists, musicians, and more to make a great game.

As my art-related skills aren't amazing, I decided on participating in #LOWREZJAM, because lower-resolution sprites are much easier to draw (though I've had lots of help from the same artist who helped me design my website here). Coupled with the 2-week timespan (I can't even begin to fathom building a game in 4 hours - let alone the 24 hours from 3 Thing Game!), it was the perfect opportunity to participate and improve my skills. While I haven't managed to complete my entry this year, I'll certainly be back again for more next year!

(Sorry, but this image is © Starbeamrainbowlabs 2018)

Found this interesting? Had your own experience with a game jam? Comment below!

Acorn Validator

Edit: Corrected title and a bunch of grammatical mistakes. I typed this out on my phone with Monospace - and I seems that my keyboard (and Phone-Laptop Bluetooth connection!) leave something to be desired.....

Over the last week, I've been hard at work on an entry for #LOWREZJAM. While it's not finished yet (submission is on Thursday), I've found some time to write up a quick blog post about a particular aspect of it. Of course, I'll be blogging about the rest of it later once it's finished :D

The history of my entry is somewhat.... complicated. Originally, I started work on it a few months back as an independent project, but due to time constraints and other issues I was unable to get very far with it. Once I discovered #LOWREZJAM, I made the decision to throw away the code I had written and start again from (almost) scratch.

It is for this reason that I have a Javascript validator script lying around. I discovered when writing it originally that my editor Atom didn't have syntax validation support for Javascript. While there are extensions that do the job, it looked like a complicated business setting one up for just syntax checking (I don't want your code style guideline suggestions! I have my own style!). To this end, I wrote myself a quick bash script to automatically check the syntax of all my javascript files that I can then include as a build step - just before webpack.

Over the time I've been working on my #LOWREZJAM entry here, I've been tweaking and improving it - and thought I'd share it here. In the future, I'm considering turning it into a linting provider for the Atom editor I mentioned above (it depends on how complicated that is, and how good the documentation is to help me understand the process).

The script (which can be found in full at the bottom of this post), has 2 modes of operation. In the first mode, it acts as a co-ordinator process that starts all the sub-processes that validate the javascript. In the second mode, it validates a single file - outputting the result to the standard output and also logging any errors in a special place that the co-ordinator process can find them later.

It decides which mode to operate in based on whether it recieves an argument telling it which file to validate:

if [ "$1" != "" ]; then
    validate_file "$1" "$2";
    exit $?;
fi

If it detects an argument, then it calls the validate_file function and exits with the returned exit code.

If not, then the script continues into co-ordinator mode. In this mode it chains a bunch of commands together like lego bricks to start subprocesses to validate all of the javascript files it can find a fast as possible - acorn, the validator itself, can only check one file as a time it would appear. It does this like so:

find . -not -path "./node_modules/" -not -path "./dist/" | grep -iP '\.mjs$' | xargs -n1 -I{} -P32 $0 "{}" "${counter_dirname}";

This looks complicated, but it can be broken down into smaller, easy-to-understand chunks. explainshell.com is rather good at demonstrating this. Particularly of note here is the $0. This variable holds the path to the currently executing script - allowing the co-ordinator to call itself in validator mode.

The validator function itself is also quite simple. In short, it runs the validator, storing the result in a variable. It then also saves the exit code for later analysis. Once done, it outputs to the standard output, and then also outputs the validator's output - but only is there was an error to keep things neat and tidy. Finally, if there was an error, it outputs a file to a temporary directory (whose name is determined by the co-ordinator and passed to sub-processes via the 2nd argument) with a name of its PID (the content doesn't matter - I just output 1, but anything will do). This allows the co-ordinator to count the number of errors that the subprocesses encounter, without having to deal with complicated locks arising from updating the value stored in a single file. Here's that in bash:

counter_dirname=$2;

# ....... 

# Use /dev/shm here since apparently while is in a subshell, so it can't modify variables in the main program O.o
    if ! [ "${validate_exit_code}" -eq 0 ]; then
        echo 1 >"${counter_dirname}/$$";
    fi

Once all the subprocesses have finished up, the co-ordinator counts up all the errors and outputs the total at the bottom:

error_count=$(ls ${counter_dirname} | wc -l);

echo 
echo Errors: $error_count
echo 

Finally, the co-ordinator cleans up after the subprocesses, and exits with the appropriate error code. This last bit is important for automation, as a non-zero exit code tells the patent process that it failed. My build script (which uses my lantern build engine, which deserves a post of its own) picks up on this and halts the build if any errors were found.

rm -rf "${counter_dirname}";

if [ ${error_count} -ne 0 ]; then
    exit 1;
fi

exit 0;

That's about all there is to it! The complete code can be found at the bottom of this post. To use it, you'll need to run npm install acorn in the directory that you save it to.

I've done my best to optimize it - it can process a dozen or so files in ~1 second - but I think I can do much better if I rewrite it in Node.JS - as I can eliminate the subprocesses by calling the acorn API directly (it's a Node.JS library), rather than spawning many subprocesses via the CLI.

Found this useful? Got a better solution? Comment below!

#!/usr/bin/env sh

validate_file() {
    filename=$1;
    counter_dirname=$2;

    validate_result=$(node_modules/.bin/acorn --silent --allow-hash-bang --ecma9 --module $filename 2>&1);
    validate_exit_code=$?;
    validate_output=$([ ${validate_exit_code} -eq 0 ] && echo ok || echo ${validate_result});
    echo "${filename}: ${validate_output}";
    # Use /dev/shm here since apparently while is in a subshell, so it can't modify variables in the main program O.o
    if ! [ "${validate_exit_code}" -eq 0 ]; then
        echo 1 >"${counter_dirname}/$$";
    fi
}

if [ "$1" != "" ]; then
    validate_file "$1" "$2";
    exit $?;
fi

counter_dirname=$(mktemp -d -p /dev/shm/ -t acorn-validator.XXXXXXXXX.tmp);
# Parallelisation trick from https://stackoverflow.com/a/33058618/1460422
# Updated to use xargs
find . -not -path "./node_modules/" -not -path "./dist/" | grep -iP '\.mjs$' | xargs -n1 -I{} -P32 $0 "{}" "${counter_dirname}";

error_count=$(ls ${counter_dirname} | wc -l);

echo 
echo Errors: $error_count
echo 

rm -rf "${counter_dirname}";

if [ ${error_count} -ne 0 ]; then
    exit 1;
fi

exit 0;

Revolutionising CSS with Grids

(Source: The header of Mozilla's Firefox grid inspector landing page. Mozilla neither endorse this blog post nor probably know that it exists :P)

You may or may not have heard about it by now, but browsers now have support for a new feature of CSS - the Grid. I'd argue that it's the single greatest addition to CSS since it was created, and to that end I wanted to post about it here.

First though, let's look at where we've been, as it will help us understand what's so great about the new grid specification.

In the days of CSS 2, the venerable HTML <table> was the de-facto method for laying out your website - though some websites would use a frameset instead. This worked, but it wasn't very flexible. It also sort of 'abused' the HTML <table> - as a <table> is meant to display data, not provide a tool for layout purposes.

Curiously, I found some educational courses called this an 'advanced' technique for building a website - I certainly wouldn't want a website laid out in a table!

To solve this, we got the float property. With this, we could get things to sit side by side - without having to use a table! This was much better than before, but still not very flexible.

Next up came the flexbox - This is like a sort of 1-dimensional grid - allowing you to set out your content in multiple nested rows or columns. It's certainly worth taking the time to learn, as it's very useful for things like spacing the items in a navigation bar out evenly, for example.

This brings us to the Grid. Basically, it's a 2-dimensional flexbox, but with some added extra features. It's probably best explained with an example image:

The above is a pair of screenshots of a little project of mine (hint: it's related to this) with Firefox's Grid Inspector turned on. As you can see on the left, I can split the page up into multiple discrete parts, and assign different elements to each one. This is known as an explicit grid.

On the right, however, I have simply told it that I want 2 columns, each row must be at least 10em high, and that there should be a 1em gap between cells - and it has worked out that it needs 2 rows in order for every element to have a place. This is known as an implicit grid.

Together, they form a powerful framework for laying out a page. Gone are the days of having a million container elements all over the place confusing things - it is now possible to lay out a webpage with a beautifully flat HTML structure.

(Source: A Complete Guide to Grid on CSS Tricks)

If you're interested in creating a website for your next project, I encourage you to investigate the CSS Grid to layout your page. With browser support like this:

Can I Use css-grid? Data on support for the css-grid feature across the major browsers from caniuse.com.

(Can I Use Embed from caniuse.bitsofco.de)

...there's really no reason not to use it! To get started, I'd recommend checking out CSS Tricks' A Complete Guide to Grid. It contains everything you need to know. If you prefer a more example-driven approach, then Grid by Example also provide a fairly comprehensive view.

(Found this useful? Got a brilliant resource not listed here? Found something even better? Comment below!)

Sources and Further Reading

Placeholder Image Generator

A 1920x1080 white-text-on-black placeholder image, covered with a bunch of brightly covered debug crosses and coloured texts in different (meaningful?) places.

(Above: An example output with debugging turned on from my placeholder generation service)

For a quite a considerable amount of time now, I've been running my own placeholder image generation service here at starbeamrainbowlabs.com - complete with etag generation and custom colour selection. Although it's somewhat of an Easter Egg, it's not actually that hard to find if you know what you're looking for (hint: I use it on my home page, but you may need to view source to find it).

I decided to post about it now because I've just finished fixing the angle GET parameter - and it was interesting (and hard) enough to warrant a post to remind myself how I did it for future reference. Comment below if you knew about it before this blog post came out!

A randomly coloured placeholder image.

The script itself is split into 3 loose parts:

  • The initial settings / argument parsing
  • The polyfills and utility functions
  • The image generation.

My aim here was to keep the script contained within a single file - so that it fits well in a gist (hence why it currently looks a bit messy!). It's probably best if show the code in full:

(Having trouble viewing code above? Visit it directly here: placeholder.php)

If you append ?help to the url, it will send back a plain text file detailing the different supported parameters and what they do:


(Can't see the above? View it directly here)

Aside from implementing the random value for fg_colour and bg_colour, the angle has been a huge pain. I use GD - a graphics drawing library that's bundled with practically every PHP installation ever - to draw the placeholder image, and when you ask it to draw some rotated text, it decides that it's going to pivot it around the bottom-left corner instead of the centre.

Naturally, this creates some complications. Though some people on the PHP manual page said method (imagettftext) have attempetd to correct for this (exhibits a and b), neither of their solutions work for me. I don't know if it's because their code is just really old (13 and 5 years respectively as of the time of typing) or what.

Another randomly coloured placeholder image - this time thinner and with the text on a slight angle, with the text Enough! instead of the dimensions of the generated image.

Anyway, I finally decided recently that enough was enough - and set about fixing it myself. Basing my code on the latter of the 2 pre-existing solutions, I tried fixing it - but ended up deleting most of it and starting again. It did give me some idea as to how to solve the problem though - all I needed to do was find the centre of where the text would be drawn when it is both not rotated and rotated and correct for it (these are represented by the green and blue crosses respectively on the image at the top of this post).

PHP does provide a method for calculating the bounding box of some prospective text that you're thinking of drawing via imagettfbbox. Finding the centre of the box though sounded like a horrible maths-y problem that would take me ages to work through on a whiteboard first, but thankfully (after some searching around) Wikipedia had a really neat solution for finding the central point of any set of points.

It calls it the centroid, and claims that the geometric centre of a set of points is simply the average of all the points involved. It just so happens that the geometric centre is precisely what I'm after:

$$ C=\frac{a+b+c+d+....}{N} $$

...Where $C$ is the geometric centre of the shape as an $\binom{x}{y}$ vector, $a$, $b$, $c$, $d$, etc. are the input points as vectors, and $N$ is the number of points in question. This was nice and easy to program:

$bbox = imagettfbbox($size, 0, $fontfile, $text);
$orig_centre = [
    ($bbox[0] + $bbox[2] + $bbox[4] + $bbox[6]) / 4,
    ($bbox[1] + $bbox[3] + $bbox[5] + $bbox[7]) / 4
];
$text_width = $bbox[2] - $bbox[0];
$text_height = $bbox[3] - $bbox[5];

$bbox = imagettfbbox($size, $angle, $fontfile, $text);
$rot_centre = [
    ($bbox[0] + $bbox[2] + $bbox[4] + $bbox[6]) / 4,
    ($bbox[1] + $bbox[3] + $bbox[5] + $bbox[7]) / 4
];

The odd and even indexes of $bbox there are referring to the $x$ and $y$ co-ordinates of the 4 corners of the bounding boxes - imagettfbbox outputs the co-ordinates in 1 single array for some reason. I also calculate the original width and height of the text - in order to perform an additional corrective translation later.

With these in hand, I could calculate the actual position I need to draw it (indicated by the yellow cross in the image at the top of this post):

$$ delta = C_{rotated} - C_{original} $$

.

$$ pivot = \frac{pos_x - ( delta_x + \frac{text_width}{2})}{pos_y - delta_y + \frac{text_height}{2}} $$

In short, I calculate the distance between the 2 centre points calculated above, and then find the pivot point by taking this distance plus half the size of the text from the original central position we wanted to draw the text at. Here's that in PHP:

// Calculate the x and y shifting needed
$dx = $rot_centre[0] - $orig_centre[0];
$dy = $rot_centre[1] - $orig_centre[1];

// New pivot point
$px = $x - ($dx + $text_width/2);
$py = $y - $dy + $text_height/2;

I generated a video of it in action (totally not because I thought it would look cool :P) with a combination of curl, and ffmpeg:

(Having issues with the above video? Try downloading it directly, or viewing it on YouTube)

This was really quite easy to generate. In a new folder, I executed the following:

echo {0..360} | xargs -P3 -d' ' -n1 -I{} curl 'https://starbeamrainbowlabs.com/images/placeholder/?width=1920&height=1920&fg_colour=inverse&bg_colour=white&angle={}' -o {}.jpeg
ffmpeg -r 60 -i %d.jpeg -q:v 3 /tmp/text-rotation.webm

The first command downloads all the required frames (3 at a time), and the second stitches them together. The -q:v 3 bit of the ffmpeg command is of note - by default webm videos apparently have a really low quality - this corrects that. Lower is better, apparently - it goes up to about 40 I seem to remember reading somewhere. 3 to 5 is supposed to be the right range to get it to look ok without using too much disk space.

The word Conclusion in blue on a yellow background, at a 20 degree angle - generated with my placeholder image generation service :D

That's about everything I wanted to talk about to remind myself about what I did. Let me know if there's anything you're confused about in the comments below, and I'll explain it in more detail.

(Found this interesting? Comment below!)

Sources and Further Reading

Password Protect: Secure?

Everyone knows about passwords. Most people I've met usually react to creating a new account with a variety of negative reactions - mainly due to the annoying issue of having to create a new password (or even worse, re-use an old one!). Most people I've met also reuse at least one password several times, too!

This is obviously a bad thing, but what can we do about it? Perhaps, while we're at it we can solve that awkward problem of forgetting which password you've used too.

Before we tackle those questions, it's important to discuss what makes a good password. Perhaps you think of some of these rules:

A laughably bad password policy that demands a password between 8 and 10 characters in length.

(Source: this post on Password Shaming)

  • It has to contain a number (not always)
  • Some symbols make it more secure (it depends)
  • Changing it often is a good security measure (not as good as you'd think)
  • It shouldn't be longer than 32 characters long (no! If longer is an option, take it! )
  • Change out letters makes it more secure (this can be guessed by attackers)

Here's the big question though: What do any of these password rules achieve? Well, we want to ensure that only the owner of an account can access it.

What could be so bad?

Ok, so we've got our goal. Make sure only the owner of an account can access it. Sounds simple, right? Just let the user enter a secret that only they know, and then we can check if they still know that secret in the future in order to verify that they are the right person attempting to access the account.

This brings a number of problems:

  1. The user has to remember their password (more on this later)
  2. As a service provider, we have to store their password securely so that a hacker can't steal it

Point #1 here is bad enough (more on this later and what you can do about it though), but #2 is a real issue. A safe is only as secure as its lock, so how to we makes sure we keep passwords stored so that nobody can steal them?

The answer: Make it so that even we can't read them! This sounds silly, but it really does work. By using a process called hashing, we can apply a process of complicated transformations to an input string - leaving us with an unintelligible output string.

Why is this useful? Because it's repeatable. By hashing the same string twice, we can get exactly the same output - allowing us to check if a user has entered their password correctly without actually storing it in plain-text! Very cool.

We're not out of the woods yet though. Picking the right hashing algorithm can be tricky. No doubt you've heard about sha1 - and maybe even sha2 and sha3. They stand for secure hashing algorithm, right? What if we used one of those?

More problems, unfortunately. It's too fast. Yep! My laptop upon which I'm writing this blog post can hash 157 Megabytes of data per second per core. I've got 4 cores, so that's 628MiB if I maxed them all out. If I manage to steal the hash of your password, then I could try every single combination of characters (including non-printable ones) in the following amount of time:

Length (chars) Time to crack
1 ~0.04µs
2 ~12µs
3 ~3ms
4 ~803ms
5 ~204s
6 ~14.5 hours
7 ~154 days
8 ~107.5 years
9 ~27423 years
10 ~7 million years

It gets worse. With new innovations, hashing algorithms such as the SHA series (don't even start on MD5) can now be run in massive parallel on a (or several) graphics card, slashing these times by several orders of magnitude:

Length (chars) Time to Crack
8 1 minute
9 2 hours
10 1 week
11 2 years
12 2 centuries

(Source: This Coding Horror Blog Post)

I don't even have to buy a top-of-the-range card any more, either. I can rent one for as little as ~35p per hour from Google Cloud Compute - well within the reach of almost any script kiddie or wannabe hacker.

Many people also use the same password for several different accounts - and attackers exploit this mercilessly. If they've gained access to one of your accounts, they'll also try using the same password against other online services to see if they can gain access to any other accounts they may belong to you.

(If you're wondering, SHA2 and SHA3 are actually secure - they just aren't for hashing passwords :P)

Furthermore, the particularly determined have gone to the trouble of pre-generating what we call rainbow tables. These table contain pre-generated hashes for millions and millions of different combinations of characters - reducing the time required to crack a password to almost nothing!

Doing something about it

Obviously, if there wasn't anything we can do about it then everyone's accounts would be hacked by now. Thankfully though, this isn't the case. We can combat the threat by using longer passwords (at least 12 characters - and preferable 16+), and that of a rainbow table by utilising a salt.

Basically, it's a long and unintelligible string that's added to the password before it's hashed - and is different for every password. Then in order to check that whether an entered password is identical to the stored one, we simply re-read the plain-text salt from storage and hash the new password with the salt.

Over the Rainbow

Pretty good right? Not so fast. Even though we've prevented attackers from using a rainbow table against our password hashes, we still have to store the salt in plain text, so our attacker can still try millions of different passwords a second if they manage to steal our password hashes.

The solution: Slow them down! Using bcrypt, we can specify a work factor when hashing a password. Higher work factors mean that it takes longer to hash a password. By making it so that it takes a consistent ~1 second to hash passwords, we can limit our attacker to trying 1 password per second instead of a few million. Furthermore, the bycrypt algorithm doesn't translate onto graphics cards very well at all - further increasing the amount of time it takes to crack our password hashes!

Combined with a salt from earlier, this is a pretty good way of storing passwords. There are other variants of this general algorithm too - including the newer Argon2 algorithm whose work factor increases the amount of memory required to calculate a hash as well as the amount of CPU power.

Still others include scrypt and PBKDF2 - the likes of which are discussed here (also see this answer).

Just getting started

This isn't the end of the road though. Far from it - we're just getting started. Next, the best hashing algorithm in the world doesn't help you if your password is terrible. There are many lists of common passwords extracted from data breaches from around the world - so if your password is in that list, you can expect your account to be hacked in seconds!

If you're unsure if your password is good enough, then there are online tools available that you can use to measure the strength of your passwords. It will even tell you if your password is on any of the common password lists.

Even so, there are other methods employed by those with questionable intent to gain access to your account. For example, why bother cracking your password when the can install a keylogger on your computer or look over your shoulder to pull out your password as you're typing it?

Some simple steps are all that's required to keep keyloggers at bay. Firstly, keeping your computer (and all the software you have installed) up-to-date is critical for patching security holes. If you're using Linux, then you've already got a brilliant way of doing this - so long as you install all your software through your package manager.

If you're on Windows, making sure any auto-update options are turned on and installing updates when prompted is very important. Though it's not installed by default, Chocolatey brings Linux-style package management to your computer - allowing you to mass-update all the software you've got installed at once.

Secondly, Windows users should make sure they have an anti-virus program installed (Windows 10 comes with one built in, so no need to worry there) with up-to-date virus definitions. If you're not on Windows 10 yet, then Windows Defender can be installed and enabled. Alternatively, Avast does the job well enough (though somewhat noisily with lots of notifications). Both are free, so there's no excuse :P

Password Management

As I mentioned earlier in this post, re-using passwords is a very bad idea. According to dashlane, the average number of accounts registered against a single email address is a staggering 118! Having to remember a different password for all 118 accounts is clearly not sustainable (dashlane also notes that the average number of 'forgot password' emails per inbox in 2020 is estimated to be 22), so is there anything we can do about it?

Yes, as it turns out. With the rise in the number of online accounts people have, so have the ways at your disposal to manage them. The best way I've found to manage my passwords is with a password manager. The general idea is that you create a password database that's encrypted with a (super-secure) master password, and then you store all of your credentials all of your different accounts inside it.

Many password managers come with helper tools that automagically type them into a box at the touch of a button - eliminating the need to remember them at all. This means that you can use much longer and more secure passwords - protecting your online accounts from intrusion.

Personally, my preference is Keepass 2 - as it lets me save my password database to disk - so that I can set up my own automatic backup system (don't forget this step!) against my own infrastructure. Many others exist though too. This article has a great discussion on the merits (and dangers) of using one, along with recommendations at the end.

Of course, it's a trade-off. In using a password database all of your passwords will be in one place (albeit encrypted with your master password). If an attacker gets hold of it and cracks the password, then it's game over! Thankfully, there are several techniques that are employed to ensure the security of such a database, such as utilising a password hashing algorithm (as discussed above) to transform the master password before encrypting the database, adding a work factor. Secondly, keeping your database stored in a secure location (such as flash drive in a safe place, or a secure remote server that you own) can also help.

Given that the chance of being burgled in the UK is about 2.5% (source), but the chance of being hacked is about 1 in 3 (source), I think I'd rather take my chances with a password database to keep my online accounts secure.

If you aren't comfortable with storing your passwords digitally yet though, fear not! You could buy a password book from loads of different online stores - and even most good high-street stationers such as W.H. Smith for under £10. Although storing really long and unintelligible passwords isn't really viable, you can still have a different one for each site - making your online accounts much more secure.

Beyond the Password

Having a good password is a great start, but is there anything else we can do? Well yes, as it turned out. Enter stage left: 2-factor authentication.

Given those statistics about burglary and hacking, ideally we want to tie our account security to the burgalry statistic rather than the hacking one. 2-factor authentication does just this: it makes it such that you require not only something that you know (your password) to access your account, but also something that you have, such as your phone - or a small flash-drive like device called a hardware-security key.

Linking the two is a system of 6-digit codes. Basically, on your phone you scan a QR code representing a lump of data. This data is then used to generate a 6-digit code that you enter into the online service after entering your password. This code changes every 20 or so seconds according to a complicated algorithm, so the server can also generate a code to see if it matches the one you entered.

In this fashion, attackers are prevented from accessing your account unless they not only crack your password, but also actively infect your phone.

(Above: A few hardware security keys. Onlykey is featured on the left, whilst a selection of Yubikeys are on the right.)

If you're really paranoid, special 'security keys' exist that can store your 2nd-factor information - and generate the codes securely without the source data ever leaving your device (though those are a separate topic for a different post - comment below if you'd like me to do a writeup on them on something!)

Conclusion

Phew. If you've reach the end of this post, then congratulations! We've covered a lot of content in this post. We've looked a little at what makes a good password, and why that is. We've investigated how attackers attempt to gain access to your account, and what you can do to protect yourself. Also also considered the merits of various password management strategies. Finally, we've looked at 2-factor authentication, and how it is far more secure than even the strongest password.

As always, this post is a starting point - not an ending point! I'll list some useful articles below for further reading. I'd also recommend you consider taking the time to secure your online accounts better - for example changing their passwords, enabling 2-factor authentication, and implementing a password management strategy.

Found this interesting? Spotted a mistake? Got a great tip of your own? Comment below!

Sources and Further Reading

Job Scheduling on Linux

Scheduling jobs to happen at a later time on a Linux based machine can be somewhat confusing. Confused by 5 4 8-10/4 6/4 * baffled by 5 */4 * * *? All will be revealed!

cron

Scheduling jobs on a Linux machine can be done in several ways. Let's start with cron - the primary program that orchestrates the whole proceeding. Its name comes from the Greek word Chronos, which means time. By filling in a crontab (read cron-table), you can tell it what to do when. It's essentially a time-table of jobs you'd like it to run.

Your Linux machine should come with cron installed already. You can check if cron is installed and running by entering this command into your terminal:

if [[ "$(pgrep -c cron)" -gt 0 ]]; then echo "Cron is installed :D"; else echo "Cron is not installed :-("; fi

If it isn't installed or running, then you'll have to investigate why this isn't the case. The most common is that it isn't installed. It's normally in the official repositories for most distributions - on Debian-based system sudo apt install cron should suffice. Arch-based users may need to check to make sure that the system service is enabled and do so manually.

With cron setup and ready to go, we can start adding jobs to it. This is done by way of a crontab, as explained above. Each user has their own crontab such that they can each configure their own individual sets jobs. To edit it, type this:

crontab -e

This will open your favourite editor with your crontab ready for editing (if you'd like to change your editor, do sudo update-alternatives --config editor or change the EDITOR environment variable). You should see a bunch of lines like this:

# Edit this file to introduce tasks to be run by cron.
# 
# Each task to run has to be defined through a single line
# indicating with different fields when the task will be run
# and what command to run for the task
# 
# To define the time you can provide concrete values for
# minute (m), hour (h), day of month (dom), month (mon),
# and day of week (dow) or use '*' in these fields (for 'any').# 
# Notice that tasks will be started based on the cron's system
# daemon's notion of time and timezones.
# 
# Output of the crontab jobs (including errors) is sent through
# email to the user the crontab file belongs to (unless redirected).
# 
# For example, you can run a backup of all your user accounts
# at 5 a.m every week with:
# 0 5 * * 1 tar -zcf /var/backups/home.tgz /home/
# 
# For more information see the manual pages of crontab(5) and cron(8)
# 
# m h  dom mon dow   command

I'd advise you keep this for future reference - just in case you find yourself in a pinch later - so scroll down to the bottom and start adding your jobs there.

Let's look at the syntax for telling cron about a job next. This is best done by example:

0 1 * * 7   cd /root && /root/run-backup

This job, as you might have guessed, runs a custom backup script. It's one I wrote myself, but that's a story for another time (comment below if you'd like me to post about that). What we're interested in is the bit at the beginning: 0 1 * * 7. Scheduling a cron job is done by specifying 5 space-separated values. In the case of the above, the job will run at 1am every Sunday morning. The order is as follows:

  • Minute
  • Hour
  • Day of the Month
  • Month
  • Day of the week

For of these values, a number of different specifiers can be used. For example, specifying an asterisk (*) will cause the job to run at every interval of that column - e.g. every minute or every hour. If you want to run something on every minute of the day (such as a logging or monitoring script), use * * * * *. Be aware of the system resources you can use up by doing that though!

Specifying number will restrict it to a specific time in an interval. For example, 10 * * * * will run the job at 10 minutes past every hour, and 22 3 * * * will run a job at 03:22 in the morning every day (I find such times great for maintenance jobs).

Sometimes, every hour or every minute is too often. Cron can handle this too! For example 3 */2 * * * will run a job at 3 minutes past every second hour. You can alter this at your leisure: The value after the forward slash (/) decides the interval (i.e. */3 would be every third, */15 would be every 15th, etc.).

The last column, the day of the week, is an alternative to the day of the month column. It lets you specify, as you may assume, the day oft he week a job should run on. This can be specified in 2 way: With the numbers 0-6, or with 3-letter short codes such as MON or SAT. For example, 6 20 * * WED runs at 6 minutes past 8 in the evening on Wednesday, and 0 */4 * * 0 runs every 4th hour on a Sunday.

The combinations are endless! Since it can be a bit confusing combining all the options to get what you want, crontab.guru is great for piecing cron-job specifications together. It describes your cron-job spec in plain English for you as you type!

crontab.guru showing a random cronjob spec.

(Above: crontab.guru displaying a random cronjob spec)

What if I turn my computer off?

Ok, so cron is all very well, but what if you turn your machine off? Well, if cron isn't running at the time a job should be run, then it won't get executed. For those of us who don't leave their laptops on all the time, all is not lost! It's time to introduce the second piece of software at our disposal.

Enter stage left: anacron. Built to be a complement to cron, anacron sets up 3 folders:

  • /etc/cron.daily
  • /etc/cron.weekly
  • /etc/cron.monthly`

Any executable scripts in this folder will be run at daily, weekly, and monthly intervals respectively by anacron, and it respects the hash-bang (that #! line at the beginning of the script) too!

Most server systems do not come with anacron pre-installed, though it should be present if your distributions official repositories. Once you've installed it, edit root's crontab (with sudo crontab -e if you can't remember how) and add a job that executes anacron every hour like so:

# Run anacron every hour
5 * * * *   /usr/sbin/anacron

This is important, as anacron does not in itself run all the time like cron does (this behaviour is called a daemon in the Linux world) - it needs a helping hand to get it to run.

If you've got more specific requirements, then anacron also has it's own configuration file you can edit. It's found at /etc/anacrontab, and has a different syntax. In the anacron table, jobs follow the following pattern:

  • period - The interval, in days, that the job should run
  • delay - The offset, in minutes, that the job should run at
  • job identifier - A textual identifier (without spaces, of course) that identifies the job
  • command - The command that should be executed

You'll notice that there are 3 jobs specified already - one for each of the 3 folders mentioned above. You can specify your own jobs too. Here's an example:`

# Do the weekly backup
7   20  run-backup  cd /root/data-shape-backup && ./do-backup;

The above job runs every 7 days, with an offset of 20 minutes. Note that I've included a command (the line starting with a hash #) to remind myself as to what the job does - I'd recommend you always include such a comment for your own reference - whether you're using cron, anacron, or otherwise.

I'd also recommend that you test your anacron configuration file after editing it to ensure it's valid. This is done like so:

anacron -T

I'm not an administrator, can I still use this?

Sure you can! If you've got anacron installed (you could even compile it from source locally if you haven't) and want to specify some jobs for your local account, then that's easily done too. Just create an anacrontab file anywhere you please, and then in your regular crontab (crontab -e), tell anacron where you put it like this:

# Run anacron every hour
5 * * * *   /usr/sbin/anacron -t "path/to/anacrontab"

What about one-off jobs?

Good point. cron and anacron are great for repeating jobs, but what if you want to set up a one-off job to auto-disable your firewall before enabling it just in case you accidentally lock yourself out? Thankfully, there's even an answer for this use-case too: atd.

atd is similar to cron in that it runs a daemon in the background, but instead of executing jobs specified in a crontab, you tell it when you want it to execute a series of commands, and then enter the commands themselves. For example:

$ at now + 10 minutes
warning: commands will be executed using /bin/sh
at> echo -e "Testing"  
at> uptime
at> <EOT>
job 4 at Thu Jul 12 14:36:00 2018

In the above, I tell it to run the job 10 minutes from now, and enter a pair of commands. To end the command list, I hit CTRL + D on an empty line. The output of the job will be emailed to me automatically if I've got that set up (cron and anacron also do this).

Specifying a time can be somewhat fiddly, but its also quite flexible:

  • at tomorrow
  • at now + 5 hours
  • at 16:06
  • at next month
  • at 2018 09 25

....and so on. Listing the current scheduled jobs is also just as easy:

atq

This will output a list of scheduled jobs that haven't been run yet. You can't see any jobs that aren't created by you unless you're root (use sudo), though. You can use the job ids listed here to cancel a job too:

# Remove job id 4:
atrm 4

Conclusion

That just about concludes this whirlwind tour of job scheduling on Linux systems. We've looked at how to schedule jobs with cron, and how to ensure our jobs get run - even when the target machine isn't turned on all the time with anacron. We've also looked at one-time jobs with atd, and how to manage the job queue.

As usual, this is a starting point - not an ending point! Job scheduling is just the beginning. From here, you can look at setting up automated backups. You could investigate setting up an email server, and how that integrates with cron. You can utilise cron to perform maintenance for your next great web (or other!) application. The possibilities are endless!

Found this useful? Still confused? Comment below!

Search Engine Optimisation: The curious question of efficiency

For one reason or another I found myself a few days ago inspecting the code behind Pepperminty Wiki's full-text search engine. What I found was interesting enough that I thought I'd blog about it.

Forget about that kind of Search Engine Optimisation (the horrible click-baity kind - if there's enough interest I'll blog about my thoughts there too) and cue the appropriate music - we're going on a field trip fraught with the perils of Unicode, page ids, transliteration, and more!

Firstly, I should probably mention a little about the starting point. The (personal) wiki that is exhibiting the slowness has 546 ~75K words spread across 546 pages. Pepperminty Wiki manages to search all of this in about 2.8 seconds by way of an inverted index. If you haven't read my last post, you should do so now - it sets the stage for this one - and you'll be rather confused otherwise.

2.8 seconds is far too slow though. Let's do something about it! In order to do something about it, there are several other things that need explaining before I can show you what I did to optimise it. Let's look at Pepperminty Wiki's search system first. It's best explained with the aid of a diagram:

An SVG diagram explaining how Pepperminty Wiki's search system works. A textual explanation is given below.

In short, every page has a numerical id, which is tracked by the ids core class. The search system interacts with this during the indexing phase (that's a topic for another blog post) and during the query phase. The query phase works something like this:

  1. The inverted index is loaded from disk in my personal wiki the inverted index is ~968k, and loads in ~128ms)
  2. The inverted index is queried for pages in that match the tokenised query terms.
  3. The results returned from the query are ranked and sorted before being returned.
  4. A context is extracted from the source of each page in the results returned - just like Duck Duck Go or Google have a bit of text below the title of each result
  5. Said context has the search terms hightlighted

It sounds complicated, but it really isn't. The complicated bit comes when I tried to optimise it. To start with, I analysed how long each of the above steps were taking. The results were quite surprising:

  • Step #1 took ~128ms on average
  • Steps #2 & #3 took ~1200ms on average
  • Step #4 took ~1500ms on average(!)
  • Step #5 took a negligible amount of time

I did this by setting headers on the returned page. Timing things in PHP is relatively easy:

$start_time = microtime(true);

// Do work here

$end_time = microtime(true);

$time_taken_ms = round(($end_time - $start_time )*1000, 3);

This gave me a general idea as to what needed attention. I was surprised to learn that the context extractor was taking most of the time. At first, I thought that my weird and probably inefficient algorithm was to blame. There's no way it should be taking 1500ms!

So I set to work rewriting it to make it more optimal. Firstly, I tried something like this. Instead of multiple sub-loops, I figured out a way to do it with just 1 for loop and a few calls to mb_stripos_all().

Unfortunately, this did not have the desired effect. While it did shave about 50ms off the total time, it was far from what I'd hoped for. I tried refactoring it slightly again to use preg_match_all(), but it still didn't give me the speed boost I was after - only another 50ms or so.

To get some answers, I brought out the big guns and profiled it with XDebug.

Upon analysing the generated profile it immediately became clear what the issue was: transliteration. Transliteration is the process of removing the diacritics and other accents from a string to make it easier to compare with other strings. For example, café becomes Café. In PHP this process is a bit funky. Here's what I do in Pepperminty Wiki:

$literator = Transliterator::createFromRules(':: Any-Latin; :: Latin-ASCII; :: NFD; :: [:Nonspacing Mark:] Remove; :: Lower(); :: NFC;', Transliterator::FORWARD);

$literator->transliterate($string);

(Originally from this StackOverflow answer)

Note that this requires the intl PHP extension (which should be installed & enabled by default). This does several things:

  • Converts the text to lowercase
  • Normalises it to UTF-8 (See this article for more information)
  • Casts Cyrillic characters to their Latin alphabet (i.e. a-z) phonetic equivalent
  • Removes all diacritics

In short, it preprocesses a chunk of text so that it can be easily used by the search system. In my case, I transliterate search queries before tokenising them, source texts before indexing them, and crucially: source texts before extracting contextual information.

The thing about this wonderful transliteration process is that, at least in PHP, it's really slow. Thinking about it, the answer was obvious. Why bother extract offset information when the inverted index already contains that information?

The answer is: you don't upon refactoring the context extractor to utilise the inverted index, I managed to get it down to just ~59ms. Success!

Next up was the query system itself. 1200ms seems a bit high, so while I was at it, I analysed a profile of that as well. It turned out that a similar problem was occurring here too. Surprisingly, the page id system's getid($pagename) function was being really slow. I found 2 issues here.

Firstly, I was doing too much Unicode normalisation. In the page id system, I don't want to transliterate to remove diacritics, but I do want to make sure that all the diacritics and accents are represented in the same way.

If you didn't know, Unicode has a both a character for letters like é (e-acute), and a code-point for the acute accent itself, which gets merged into the previous letter during rendering. This can cause a page to acquire 2 (or even more!) seemingly identical ids in the system, which caused me a few headaches in the past! If you'd like to learn more, the article on Unicode normalisation I linked to above explains it in some detail. Thankfully, the solution is quite simple. Here's what Pepperminty Wiki does:

Normalizer::normalize($string, Normalizer::FORM_C)

This ensures that all accents and other weird characters are represented in the same way. As you might guess though, it's slow. I found that in the getid() function I was normalising both the page names I was iterating over in the index, as well as the target page name to find in every loop. The solution here was simple:

  • Don't normalise the page names from the index - that's the job of the assign() protected method to ensure that they are suitably normalised when adding them in the first place
  • Normalise the target page name only once, and then use that when spinning through the index.

Implementing these simple changes brought the overall search time down to 700ms. The other thing to note here is the structure of the index. I show it in the diagram, but here it is again:

  • 1: Booster
  • 2: Rocket
  • 3: Satellite

The index is basically a hash-table mapping numerical ids to their page names. This is great for when you have an id and want to know what the name of the page associated with it is, but terrible for when you want to go in the other direction, as we need to do when performing a query!

I haven't quite decided what to do about this. Obviously, the implications on efficiency are significant whenever we need to convert a page name into its respective numerical id. The problem lies in the fact that the search query system travels in both directions: It needs to convert page ids into page names when unravelling the results from the inverted index, but it also needs to convert page names into their respective ids when searching the titles and tags in the page index (the index that contains information about all the pages on a wiki - not pictured in the diagram above).

I have several options that I can see immediately:

  • Maintain 2 indexes: One going in each direction. This would also bring a minor improvement to indexing new and updating existing content in the inverted index.
  • Use some fancy footwork to refactor the search query system to unwind the page ids into their respective page names before we search the pages' titles and tags.

While deciding what to do, I did manage to reduce the number of times I convert a page name into its respective id by only performing the conversion if I find a match in the page metadata. This brought the average search time down to ~455ms, which is perfectly fine for my needs at the moment.

In the future, I may come back to this and optimise it further - but as it stands I'm getting to the point of diminishing returns: Where every additional optimisation requires twice the amount of time to implement as the last, and only provides a marginal gain in speed.

To this end, it doesn't seem worth it to spend ages tackling this issue now. Pepperminty Wiki is written in such a way that I can come back later and turn the inner workings of any part of the system upside-down, and it doesn't have any effect on the rest of the system (most of the time, anyway.... :P).

If you do find the search system too slow after these optimisations are released in v0.17, I'd like to hear about it! Please open an issue and I'll investigate further - I certainly haven't reached the end of this particular lollipop.

Found this interesting? Learnt something? Got a better way of doing it? Comment below!

Demystifying Inverted Indexes

The test texts below overlaying one another in different colours, with a magnifying glass on centred top. (The magnifying glass in the above banner came from openclipart)

After writing the post that will be released after this one, I realised that I made a critical assumption that everyone knew what an inverted index was. Upon looking for an appropriate tutorial online, I couldn't find one that was close enough to what I did in Pepperminty Wiki, so I decided to write my own.

First, some context. What's Pepperminty Wiki? Well, it's a complete wiki engine in a single file of PHP. The source files are obviously not a single file, but it builds into a single file - making it easy to drop into any PHP-enabled web server.

One of its features is a full-text search engine. A personal wiki of mine has ~75k words spread across ~550 pages, and it manages to search them all in just ~450ms! It does this with the aid of an inverted index - which I'll be explaining in this post.

First though, we need some data to index. How about the descriptions of some video games?

Kerbal Space Program

In KSP, you must build a space-worthy craft, capable of flying its crew out into space, without killing them. At your disposal is a collection of parts, which must be assembled to create a functional ship. Each part has its own function and will affect the way a ship flies (or doesn't). So strap yourself in, and get ready to try some Rocket Science!

 

Cross Code

Meet Lea as she logs into an MMO of the distant future. Follow her steps as she discovers a vast world, meets other players and overcomes all the challenges of the game.

 

Fort Meow

Fort Meow is a physics game by Upper Class Walrus about a girl, an old book and a house full of cats! Meow.

 

Factory Balls

Factory balls is the brand new bonte game I already announced yesterday. Factory balls takes part in the game design competition over at jayisgames. The goal of the design competition was to create a 'ball physics'-themed game. I hope you enjoy it!

Very cool, this should provide us with plenty of data to experiment with. Firstly, let's consider indexing. Take the Factory Balls description. We can split it up into tokens like this:

T o k e n s V V
factory balls is the brand new bonte game
i already announced yesterday factory balls takes
part in the game design competition over
at jayisgames the goal of the design
competition was to create a ball physics
themed game i hope you enjoy it

Notice how we've removed punctuation here, and made everything lowercase. This is important for the next step, as we want to make sure we consider Factory and factory to be the same word - otherwise when querying the index we'd have to remember to get the casing correct.

With our tokens sorted, we can now count them to create our index. It's like a sort of tally chart I guess, except we'll be including the offset in the text of every token in the list. We'll also be removing some of the most common words in the list that aren't very interesting - these are known as stop words. Here's an index generated from that Factory Balls text above:

Token Frequency Offsets
factory 2 0, 12
balls 2 1, 13
brand 1 4
new 1 5
bonte 1 6
game 3 7, 18, 37
i 2 8, 38
announced 1 10
yesterday 1 11
takes 1 14
design 2 19, 28
competition 2 20, 29
jayisgames 1 23
goal 1 25
create 1 32
ball 1 34
physics 1 35
themed 1 36
hope 1 39
enjoy 1 41

Very cool. Now we can generate an index for each page's content. The next step is to turn this into an inverted index. Basically, the difference between the normal index and a inverted index is that an entry in an inverted index contains not just the offsets for a single page, but all the pages that contain that token. For example, the Cross-Code example above also contains the token game, so the inverted index entry for game would contain a list of offsets for both the Factory Balls and Cross-Code pages.

Including the names of every page under every different token in the inverted index would be both inefficient computationally and cause the index to grow rather large, so we should assign each page a unique numerical id. Let's do that now:

Id Page Name
1 Kerbal Space Program
2 Cross Code
3 Fort Meow
4 Factory Balls

There - much better. In Pepperminty Wiki, this is handled by the ids class, which has a pair of public methods: getid($pagename) and getpagename($id). If an id can't be found for a page name, then a new id is created and added to the list (Pepperminty Wiki calls this the id index) transparently. Similarly, if a page name can't be found for an id, then null should be returned.

Now that we've got ids for our pages, let's look at generating that inverted index entry for game we talked about above. Here it is:

  • Term: game
Id Frequency Offsets
2 1 31
3 1 5
4 3 5, 12, 23

Note how there isn't an entry for page id 1, as the Kerbal Space Program page doesn't contain the token game.

This, in essence, is the basics of inverted indexes. A full inverted index will contain an entry for every token that's found in at least 1 source document - though the approach used here is far from the only way of doing it (I'm sure there are much more advanced ways of doing it for larger datasets, but this came to mind from reading a few web articles and is fairly straight-forward and easy to understand).

Can you write a program that generates a full inverted index like I did in the example above? Try testing it on the test game descriptions at the start of this post.

You may also have noticed that the offsets used here are of the tokens in the list. If you wanted to generate contexts (like Duck Duck Go or Google do just below the title of a result), you'd need to use the character offsets from the source document instead. Can you extend your program to support querying the inverted index, generating contexts based on the inverted index too?

Liked this post? Got your own thoughts on the subject? Having trouble with the challenges at the end? Comment below!

Redis in Review: First Impressions

The red Redis logo backed by a yellow voronoi diagram

(Above: The Redis logo backed by a voronoi diagram generated by my specially-upgraded voronoi diagram generator :D)

I've known about Redis for a while now - but I've never gotten around to looking into it until now. A bit of background: I'm building a user management panel thingy for this project (yes, again. I'll be blogging about it for a bit more yet before it's ready). Since I want it to be a separate, and optional, component that you don't have to run in order to use the main program, I decided to make it a separate program in a different language (Nojde.JS is rather good at doing this sort of thing I've found).

Anyway, whilst writing the login system, I found that I needed a place to store session tokens such that they persist. I considered a few different options here:

  1. A JSON file on disk - This wouldn't be particularly scalable if I found myself with lots of users. It also means I've got to read in and decode a bunch of JSON when my program starts up, which can get mighty fiddly. Furthermore, persisting it back to disk would also be rather awkward and inefficient.
  2. An SQL database of some description - Better, not still not great. The setup for a database is complicated, and requires a complex and resource-hungry process running at all times. Even if I used SQLite, I wouldn't be able to escape the complexities associated with database schemas - which I'd likely end up spending more time fiddling with than actually writing the program itself!

Then I remembered Redis. At first, I thought it was an in-memory data storage solution. While I was right, I discovered that there's a lot more to Redis than I first thought. For one, it does actually allow you to persist things to disk - and is very transparent about it too, allowing a great deal of control over how data is persisted to disk (also this post is a great read on the subject, even if it's a bit old now - it goes into a great deal of depth about how the operating system handles and caches writes to disk).

When it comes to actually storing your data, Redis provides just a handful of delightfully simple data structures for you to model your data with. Unlike a traditional SQL-based database, storing data in Redis requires a different way of thinking. Rather than considering how data relates to one another and trying to represent it as a series of interconnected tables, in Redis you store things in a manner that's much closer to that which you might find in the program that you're writing.

The core of Redis is a key-value store - rather like localStorage in the browser. Unlike localStorage though, the following data types are provided:

  • A simple string value
  • A hash-table
  • A list of values (which can be also be treated as a queue)
  • A set of unique strings (which can be optionally sorted)
  • Binary blobs (called bit arrays)
  • A HyperLogLog (apparently a probabilistic data structure of some kind)

The official Redis tutorial on the subject does a much better job of explaining this than I could (I have only been playing with it for about a week after all!), so I'd recommend you read it if you're curious.

All this has some interesting effects. Firstly, the data storage system can be more closely representative of the data structures it is storing, allowing for clever data-specific optimisations to be made in ways that simply aren't possible with a traditional SQL-based system. To this end, a custom index can be created in Redis via simple values to accelerate the process of finding a particular hash-table given a particular piece of identifying information.

Secondly, it makes Redis very versatile. You could use it as a caching layer to store the results of queries made against a traditional SQL-based database. In that scenario, no persistence to disk would be required to make it as fast as possible. You could use it for storing session tokens and short-lived chunks of data, utilising the inbuilt EXPIRE command to tell Redis to automatically delete certain keys after a specified amount of time. You could use it store all your data in and do away with an SQL server altogether. You could even do all of these things at the same time!

When storing data in Redis, it's important to make sure you've got the right mindset. I've found that despite learning about the different commands and data structures Redis has to offer, it's taken me a few goes to figure out how I'm going to store my program's data in a way that both makes sense and makes it easy to get it back out again. I'm certain that I'll take quite a bit of practice (and trial and error!) to figure out what works and what doesn't.

It's not the kind of thing that would make you want to drop your database system like a hot potato and move everything to Redis before the week is out. But it is a tool to keep in mind that can be used to solve a class of problems that SQL-based database servers have traditionally been very bad at handling - in particular high-throughput systems that need to store lots of mainly short-lived data (e.g. session tokens, etc.).

Found this interesting? Got your own thoughts on Redis? Post a comment below!

#movingtogitlab: What's up, Thoughts, and First Impressions

I'm moving some of my repositories to GitLab! Read on to find out why.

You've probably heard by now that GitHub has been bought by Microsoft. It was certainly huge news at the time! While I did tweet at the time (also here too), I've been waiting until I've gotten all of the repositories I've decided to move over to GitLab settled in before blogging about the experience and sharing my thoughts.

While I've got some of them settled in (you can tell which ones I've done because they are public on GitLab and I've deleted the GitHub repository in most cases), I've still got a fair few to go - there should be 13 in total on GitLab once I'm finished.

Since it's taking me longer than I anticipated to properly update them after the transfer (which in and of itself was actually really quick and painless - thanks GitLab!), I thought I'd blog about the experience so far - as it's probably going to be a little while before I've got all my repositories sorted :P

What's up?

Firstly, Microsoft. Buying GitHub. Who'd have thought it? I know that I was certainly surprised. I even had to check multiple websites to triple-check that I wasn't seeing things.... GitHub have certainly done an excellent job of hiding that fact that they were looking for a buyer. Upon further inspection, it appears that the issues GitHub have been facing are twofold. Firstly, they've been without a boss of the company for quite a while, and from the way things are looking, they are having a little bit of trouble finding a new one. Secondly, there's been talk that they haven't been making enough money.

Let's back up a second. How does GitHub actually make money? Good question. The answer is surprisingly simple: GitHub Enterprise - which is a version of GitHUb for businesses that they can host on their own servers for a significant fee. From what I can tell, it's targeted at medium-to-large companies.

If they aren't making enough money to sustain themselves, then something obviously needs to be done about that, or the github.com service that open-source developers the world over suddenly vanish overnight O.o! This presents GitHub with a very awkward problem. In this case, they chose to seek a buyer, who could then perhaps help them to sell GitHub Enterprise better.

Looking at the alternative companies, I think that GitHub could have done a lot worse than Microsoft. Alternatives include Apple (who operate a gigantic walled garden called macOS and iOS), IBM (International Business Machines, who sell big and powerful mainframes to other big and powerful businesses. Most of the time, we - or I at least - don't hear from them, unless they are showing off some research or other that they've just completed), and Facebook (who don't have a particularly great reputation right now). The problem is that they need a big company who have lots of money, because they are also a big company (so they'll be worth a lot of money, like it or not. Also, in order to sort out the money making issue they'll need cash in the meantime).

Microsoft ha ve been quite active in the open-source scene as of late, and seem to really have changed their tune in recent years, with their open-sourcing of large parts of the .NET framework - and their code editor Visual Studio Code. While they aren't the best at naming things (I'm looking at you, .NET Core / .NET Standard / .....), they do appear to be more in-line with GitHub's goals and ethics than alternative companies, as discussed above.

What's the problem?

So why do I have a problem with this deal? Clearly, Microsoft are the best of the bunch to have bought GitHub. It could be a lot worse. Microsoft have even announced that GitHub can continue to operate as a separate entity. The new boss of GitHub did an Ask Me Anything on reddit, and he seems like a really cool guy, and genuinely wants the best for GitHub.

Well, it's complicated. For me, it all boils down to independence. GitHub is (or was) an independent company. To that end, they can make their own decisions, and aren't being told what to or at risk of being told what to do by someone else. They aren't being influenced by someone else. That's not to say that GitHub will now be influenced by Microsoft (though they probably will be) - and it's not to say that it's a bad thing.

Coupled with the some 85 million open-source repositories, 28 million developers using the service, and 1.8 million businesses utilising github.com, it's a huge responsibility. In order to effectively serve the community that has grown around github.com, I feel that GitHub has to remain impartial. It's absolutely essential. The number of different workflows, tools, programs, operating systems, and more that those 28 million developers use will be staggering - and I feel that only a completely independent GitHub will truly be able to meet the needs of that community. Microsoft is great for some things, but taking on GitHub is asking them to actively support users of one of their services using their competitors software, such as Linux - or devices running macOS.

What's the alternative then?

That's huge ask, and I guess only time will tell whether they are able to pull it off. I find myself asking though: What's the alternative? if they are losing money as the rumours say, then what could they do about it?

Obviously, I don't and can't know everything that's going on inside GitHub - I can only read marketing-y web articles, theorise, and make guesses. I'm sure they've tried it, but I don't see why they can't enter a partnership with Microsoft instead. How about a partnership in which Microsoft helps sell GitHub Enterprise, and they get a cut of the profits in return? Microsoft have a much bigger base on enterprisey companies, so they'd be much better placed to do that part of things, and GitHub would be able to retain it's independence.

Where does GitLab fit into this puzzle?

All of this brings me to GitLab. Similar to GitHub, GitLab offers free code hosting on gitlab.com, along with free continuous integration. Unlike GitHub though, GitLab's source code is open - and you can download and run your own instance if you like! They sell support packages to businesses - thereby making enough money to support the continued development of GitLab. I think that they sell a few additional features with GitLab Enterprise too - but they aren't anything that the average user would want - only businesses. The also do a free package for students and open-source developers too - all whilst staying and independent company. Very cool.

As of today, I'm moving a total of 13 of my smaller repositories to GitLab. My reasoning here is that I really don't want to keep all my eggs in one basket. By moving some repositories to GitLab, I can ensure that if one or the other goes in a direction I seriously object to, I've got an alternative that I'm familiar with that I can move all my repositories to in a hurry.

I've used GitLab before. Last academic year (2016 / 2017) I interned at a local company, at which I helped to move over to Git as their primary version-control system. To that end, they chose GitLab as the server they'd use for the job - and so I got quite a bit of experience with it!

I'd use it for my personal git server but unfortunately it uses far too many resources - which I can't currently dedicate to it full-time - so I use Gitea, a lighter-weight alternative.

How does GitLab compare?

Back to the matter at hand. The experience with gitlab.com and my open-soruce repositories so far has been a hugely positive one so far! The migration process itself was completely painless - it just took a while since there were thousands of other developers doing the exact same thing I was :P

Updating my repositories and adjusting all the links & references has been a tedious process so far - but that's nothing that GitLab can really help me with - I've gotta do it myself :-) GitLab's documentation has been really helpful for those times when I've needed some assistance to figure something out, with plenty of GitLab CI examples available to get me started.

If there's one thing I'd change, it's the releases (or tags) system. The system itself is fine, but the way the tags are presented is horrible. The text doesn't even wrap properly! A /latest shortcut url would would be welcome too.

You can't attach release binaries to tags either, which is also rather annoying. There is a workaround though - you can link directly to the GitLab CI (Continuous-Integration) artifacts instead, which seems to work just fine for my purposes so far. If it doesn't work so well in the future, I'm probably going to have to find an alternative solution for hosting release binaries - at least until they implement the feature (if at all).

Other than that, the interface, while very different to GitHub's, does feel appropriately polished and suitably easy to use. If anything, I feel as though it's rather difficult to explore any of the existing repositories on GitLab - I hadn't realised until using GitLab just how useful the new repository tags GitHub have implemented are in exploring others' repositories. I'm sure that there's a 3rd-party website that enables one to explore the repositories on gitlab.com much more effectively - I just haven't found it yet.

The sidebar when viewing a repository is quite handy - but a little unwieldy at times. Similarly, the 2-3 layers of navigation directly below the repository tagline are also rather unwieldy, and difficult to tell their differing functions apart. Perhaps a rethink here would bring these 2 different parts of the user interface together in a more cohesive and usable fashion?

All in all, gitlab.com is a great service for hosting my open-source repositories. The continuous integration services provided are superb - and completely unparalleled in my (admittedly limited) experience. The interface is functional, for the most part, and gets the job done - there are just a few rough edges in places that could do with a slight rethink.

Enjoyed this? Have your own opinion about what's happened to GitHub? Found a better service? Comment below!

Art by Mythdael