Starbeamrainbowlabs

Stardust
Blog


Archive


Mailing List Articles Atom Feed Comments Atom Feed Twitter Reddit Facebook

Tag Cloud

3d 3d printing account algorithms android announcement architecture archives arduino artificial intelligence artix assembly async audio automation backups bash batch blog bookmarklet booting bug hunting c sharp c++ challenge chrome os code codepen coding conundrums coding conundrums evolved command line compilers compiling compression css dailyprogrammer data analysis debugging demystification distributed computing documentation downtime electronics email embedded systems encryption es6 features ethics event experiment external first impressions future game github github gist gitlab graphics hardware hardware meetup holiday holidays html html5 html5 canvas infrastructure interfaces internet interoperability io.js jabber jam javascript js bin labs learning library linux lora low level lua maintenance manjaro network networking nibriboard node.js operating systems performance photos php pixelbot portable privacy problem solving programming problems project projects prolog protocol protocols pseudo 3d python reddit redis reference releases resource review rust searching secrets security series list server software sorting source code control statistics storage svg talks technical terminal textures three thing game three.js tool tutorial twitter ubuntu university update upgrade version control virtual reality virtualisation visual web website windows windows 10 xmpp xslt

TCP (Client) Networking in Pure Bash

Recently I re-remembered about /dev/tcp - a virtual bash file system that allows you to directly connect to a remote TCP endpoint - without the use of nc / netcat / ncat.

While it only allows you to connect (no listening, sadly), it's still a great bash built-in that helps avoid awkward platform-specific issues.

Here's how you'd listen for a connection with netcat, sending as many random numbers as possible to the poor unsuspecting client:

netcat -l 0.0.0.0 6666 </dev/urandom

Here's how you'd traditionally connect to that via netcat:

netcat X.Y.Z.W 6666 | pv >/dev/null

The pv command there is not installed by default, but is a great tool that shows the amount of data flowing through a pipe. It's available in the repositories for most Linux distributions without any special configuration required - so sudo apt install pv should be enough on Debian-based distributions.

Now, let's look at how we'd do this with pure bash:

pv >/dev/null </dev/tcp/X.Y.Z.W/6666

Very neat! We've been able to eliminate an extra child process. The question is though: how do they compare performance-wise? Well, that depends on how we measure it. In my case, I measured a single connection, downloading data as fast as it can for 60 seconds.

Another test would be to open many connections and download lots of small files. While I haven't done that here, I theorise that the pure-bash method would win out, as it doesn't have to spawn lots of subprocesses.

In my test, I did this:

# Traditional method
timeout 60 nc X.Y.Z.W 6666 | pv >/dev/null
# Pure-Bash method
timeout 60 pv >/dev/null </dev/tcp/X.Y.Z.W/6666

The timeout command kills the process after a given number of seconds. The server I connected to was just this:

while true; do nc -l 0.0.0.0 6666 </dev/urandom; done

Running the above test, I got the following output:

$ timeout 60 pv >/dev/null </dev/tcp/172.16.230.58/6666
 652MiB 0:00:59 [11.2MiB/s] [                                      <=>         ]
$ timeout 60 nc 172.16.230.58 6666 | pv >/dev/null
 599MiB 0:01:00 [11.1MiB/s] [                                     <=>          ]
Method Total Data Transferred
Traditional 599MiB
Pure Bash 652MiB

As it turns out, the pure bash method is apparently faster - by ~8.8%. I think this might have something to do with the lack of the additional sub-process, or some other optimisation that bash can apply when doing the TCP networking itself.

Found this interesting? Got a cool use for it? Discovered another awesome bash built-in? Comment below!

Animated PNG for all!

I recently discovered that Animated PNGs are now supported by most major browsers:

Data on support for the apng feature across the major browsers from caniuse.com

I stumbled across the concept of an animated PNG a number of years ago (on caniuse.com actually if I remember right!), but at the time browser support was very bad (read: non-existent :P) - so I moved on to other things.

I ended up re-discovering it a few weeks ago (also through caniuse.com!), and since browser support is so much better now I decided that I just had to play around with it.

It hasn't disappointed. Traditional animated GIFs (Graphics Interchange Format for the curious) are limited to 256 colours, have limited transparency support (a pixel is either transparent, or it isn't - there's no translucency support), and don't compress terribly well.

Animated PNGs (Portable Network Graphics), on the other hand, let you enjoy all the features of a regular PNG (as many colours as you want, translucent pixels, etc.) while also supporting animation, and compressing better as well! Best of all, if a browser doesn't support the animated PNG standard, they will just see and render the first frame as a regular PNG and silently ignore the rest.

Let's take it out for a spin. For my test, I took an image and created a 'panning' animation from one side of it to the other. Here's the image I've used:

(Credit: The background on the download page for Mozilla's Firefox Nightly builds. It isn't available on the original source website anymore (and I've lost the link), but can still be found on various wallpaper websites.)

The first task is to generate the frames from the original image. I wrote a quick shell script for this purpose. Firstly, I defined a bunch of variables:

#!/usr/bin/env bash
set -e; # Crash if we hit an error

input_file="Input.png";
output_file="Output.apng"

# The maximum number of frames
max_frame=32;
end_x=960; end_y=450;
start_x=0; start_y=450;

crop_size_x=960; crop_size_y=540;

mkdir -p ./frames;
Variable Meaning
input_file The input file to generate frames from
output_file The file to write the animated png to
max_frame The number of frames (plus 1) to generate
start_x The starting position to pan from on the x axis
start_y The starting position to pan from on the y axis
end_x The ending position to pan to on the x axis
end_y The ending position to pan to on the y axis
crop_size_x The width of the cropped frames
crop_size_y The height of the cropped frames

It's worth noting here that it's probably a good idea to implement a proper CLI to this script at this point, but since it's currently only a one-off script I didn't bother. Perhaps in the future I'll come back and improve it if I find myself needing it again for some other purpose.

With the parameters set up (and a temporary directory created - note that you should use mktemp -d instead of the approach I've taken here!), we can then use a for loop to repeatedly call ImageMagick to crop the input image to generate our frames. This won't run in parallel unfortunately, but since it's only a few frames it shouldn't take too long to render. This is only a quick shell script after all :P


for ((frame=0; frame <= "${max_frame}"; frame++)); do
    this_x="$(calc -p "${start_x}+((${end_x}-${start_x})*(${frame}/${max_frame}))")";
    this_y="$(calc -p "${start_y}+((${end_y}-${start_y})*(${frame}/${max_frame}))")";
    percent="$(calc -p "round((${frame}/${max_frame})*100, 2)")";


    convert "${input_file}" -crop "${crop_size_x}x${crop_size_y}+${this_x}+${this_y}" "frames/Frame-$(printf "%02d" "${frame}").jpeg";

    echo -ne "${frame} / ${max_frame} (${percent}%)          \r";
done
echo ""; # Move to the line after the progress indicator

This looks complicated, but it really isn't. The for loop iterates over each of the frame numbers. We do some maths to calculate the (x, y) co-ordinates of the frame we want to extract, and then get ImageMagick's convert command to do the cropping for us. Finally we write a quick progress indicator out to stdout (\r resets the cursor to the beginning of the line, but doesn't go down to the next one).

The maths there is probably better represented like this:

$this_x=start_x+((end_x-start_x)*\frac{currentframe}{max_frame})$

$this_y=start_y+((end_y-start_y)*\frac{currentframe}{max_frame})$

Much better :-) In short, we convert the current frame number into a percentage (between 0 and 1) of how far through the animation we are and use that as a multiplier to find the distance between the starting and ending points.

I use the calc command-line tool (in the apcalc package on Ubuntu) here to do the calculations, because the bash built-in result=$(((4 + 5))) only supports integer-based maths, and I wanted floating-point support here.

With the frames generated, we only need to stitch them together to make the final animated png. Unfortunately, an easy-to-use tool does not yet exist (like gifsicle for GIFs) - so we'll have to make-do with ffpmeg (which, while very powerful, has a rather confusing CLI!). Here's how I stitched the frames together:

ffmpeg -r 10 -i frames/Frame-%02d.jpeg -plays 0 "${output_file}";
  • -r - The frame rate
  • -i - The input filename
  • -plays - The number of times to loop (0 = infinite; defaults to no looping if omitted)
  • "${output_file}" - The output file

Here's the final result:

(Filesize: ~2.98MiB)

Of course, it'd be cool to compare it to a traditional animated gif. Let's do that too! First, we'll need to convert the frames to gif (gifsicle, our tool of choice, doesn't support anything other than GIFs as far as I'm aware):

mogrify -format gif frames/*.jpeg

Easy-peasy! mogrify is another tool from ImageMagick that makes such in-place conversions trivial. Note that the frames themselves are stored as JPEGs because I was experiencing an issue whereby each of the frames apparently had a slightly different colour palette, and ffmpeg wasn't smart enough to correct for this - choosing instead to crash.

With the frames converted, we can make our animated GIF like so:

gifsicle --optimize --colors=256 --loopcount=10 --delay=10 frames/*.gif >Output.gif;

Lastly, we probably want to delete the intermediate frames:

rm -r ./frames

Here's the animated GIF version:

(Filesize: ~3.28MiB)

Woah! That's much bigger.

A graph comparing the sizes of the APNG and GIF versions of the animation.

(Generated (and then extracted & edited with the Firefox developer tools) from Meta-Chart)

It's ~9.7% bigger in fact! Though not a crazy amount, smaller resulting files are always good. I suspect that this effect will stack the more frames you have. Others have tested this too, finding pretty similar results to those that I've found here - though it does of course depend on your specific scenario.

With that observation, I'll end this blog post. The next time you think about inserting an animation into a web page or chat window, consider making it an Animated PNG.

Found this interesting? Found a cool CLI tool for manipulating APNGs? Having trouble? Comment below!

Sources and Further Reading

Bridging the gap between XMPP and shell scripts

In a previous post, I set up a semi-automated backup system for my Raspberry Pi using duplicity, sendxmpp, and an external drive. It's been working fabulously for a while now, but unfortunately the other week sendxmpp suddenly stopped working with no obvious explanation. Given the long list of arguments I had to pass it:

sendxmpp --file "${xmpp_config_file}" --resource "${xmpp_resource}" --tls --chatroom "${xmpp_target_chatroom}" ...........

....and the fact that I've had to tweak said arguments on a number of occasions, I thought it was time to switch it out for something better suited to the task at hand.

Unfortunately, finding such a tool proved to be a challenge. I even asked on Reddit - but nobody had anything that fit the bill (xmpp-bridge wouldn't compile correctly - and didn't support multi-user chatrooms anyway, and xmpppy was broken too).

If you're unsure as to what XMPP is, I'd recommend checkout out either this or this tutorial. They both give a great introduction to what it is, what it does, and how it works - and the rest of this post will make much more sense if you read that first :-)

To this end, I finally gave in and wrote my own tool, which I've called xmppbridge. It's a global Node.JS script that uses the simple-xmpp to forward the standard input to a given JID over XMPP - which can optionally be a group chat.

In this post, I'm going to look at how I put it together, some of the issues I ran into along the way, and how I solved them. If you're interested in how to install and use it, then the package page on npm will tell you everything you need to know:

xmppbridge on npm

Architectural Overview

The script consists of 3 files:

  • index.sh - Calls the main script with ES6 modules enabled
  • index.mjs - Parses the command-line arguments and environment variables out, and provides a nice CLI
  • XmppBridge.mjs - The bit that actually captures input from stdin and sends it via XMPP

Let's look at each of these in turn - starting with the command-line interface.

CLI Parsing

The CLI itself is relatively simple - and follows a paradigm I've used extensively in C♯ (although somewhat modified of course to get it to work in Node.JS, and without fancy ANSI colouring etc.).

#!/usr/bin/env node
"use strict";

import XmppBridge from './XmppBridge.mjs';

const settings = {
    jid: process.env.XMPP_JID,
    destination_jid: null,
    is_destination_groupchat: false,
    password: process.env.XMPP_PASSWORD
};

let extras = [];
// The first arg is the script name itself
for(let i = 1; i < process.argv.length; i++) {
    if(!process.argv[i].startsWith("-")) {
        extras.push(process.argv[i]);
        continue;
    }

    switch(process.argv[i]) {
        case "-h":
        case "--help":
            // ........
            break;

        // ........

        default:
            console.error(`Error: Unknown argument '${process.argv[i]}'.`);
            process.exit(2);
            break;
    }
}

We start with a shebang, telling Linux-based systems to execute the script with Node.JS. Following that, we import the XmppBridge class that's located in XmppBrdige.mjs (we'll come back to this later). Then, we define an object to hold our settings - and pull in the environment variables along with defining some defaults for other parameters.

With that setup, we can then parse the command-line arguments themselves - using the exact same paradigm I've used time and time again in C♯.

Once the command-line arguments are parsed, we validate the final settings to ensure that the user hasn't left any required parameters undefined:

for(let environment_varable of ["XMPP_JID", "XMPP_PASSWORD"]) {
    if(typeof process.env[environment_varable] == "undefined") {
        console.error(`Error: The environment variable ${environment_varable} wasn't found.`);
        process.exit(1);
    }
}

if(typeof settings.destination_jid != "string") {
    console.error("Error: No destination jid specified.");
    process.exit(5);
}

That's basically all that index.mjs does. All that's really left is passing the parameters to an instance of XmppBridge:

const bridge = new XmppBridge(
    settings.destination_jid,
    settings.is_destination_groupchat
);
bridge.start(settings.jid, settings.password);

Shebang Trouble

Because I've used ES6 modules here, currently Node must be informed of this via the --experimental-modules CLI argument like this:

node --experimental-modules ./index.mjs

If we're going to make this a global command-line tool via the bin directive in package.json, then we're going to have to ensure that this flag gets passed to Node and not our program. While we could alter the shebang, that comes with the awkward problem that not all systems (in fact relatively few) support using both env and passing arguments. For example, this:

#!/usr/bin/env node --experimental-modules

Wouldn't work, because env doesn't recognise that --experimental-modules is actually a command-line argument and not part of the binary name that it should search for. I did see some Linux systems support env -S to enable this functionality, but it's hardly portable and doesn't even appear to work all the time anyway - so we'll have to look for another solution.

Another way we could do it is by dropping the env entirely. We could do this:

#!/usr/local/bin/node --experimental-modules

...which would work fine on my system, but probably not on anyone else's if they haven't installed Node to the same place. Sadly, we'll have to throw this option out the window too. We've still got some tricks up our sleeve though - namely writing a bash wrapper script that will call node telling it to execute index.mjs with the correct arguments. After a little bit of fiddling, I came up with this:

#!/usr/bin/env bash
install_dir="$(dirname "$(readlink -f $0)")";
exec node --experimental-modules "${install_dir}/index.mjs" $@

2 things are at play here. Firstly, we have to deduce where the currently executing script actually lies - as npm uses a symbolic link to allow a global command-line tool to be 'found'. Said symbolic link gets put in /usr/local/bin/ (which is, by default, in the user's PATH), and links to where the script is actually installed to.

To figure out the directory that we've been installed to is (and hence the location of index.mjs), we need to dereference the symbolic link and strip the index.sh filename away. This can be done with a combination of readlink -f (dereferences the symbolic link), dirname (get the parent directory of a given file path), and $0 (holds the path to the currently executing script in most circumstances) - which, in the case of the above, gets put into the install_dir variable.

The other issue is passing all the existing command-line arguments to index.mjs unchanged. We do this with a combination of $@ (which refers to all the arguments passed to this script except the script name itself) and exec (which replaces the currently executing process with a new one - in this case it replaces the bash shell with node).

This approach let's us customise the CLI arguments, while still providing global access to our script. Here's an extract from xmppbridge's package.json showing how I specify that I want index.sh to be a global script:

{
    .....

    "bin": {
        "xmppbridge": "./index.sh"
    },

    .....
}

Bridging the Gap

Now that we've got Node calling our script correctly and the arguments parsed out, we can actually bridge the gap. This is as simple as some glue code between simple-xmpp and readline. simple-xmpp is an npm package that makes programmatic XMPP interaction fairly trivial (though I did have to look at examples in the GitHub repository to figure out how to send a message to a multi-user chatroom).

readline is a Node built-in that allows us to read the standard input line-by-line. It does other things too (and is great for interactive scripts amongst other things), but that's a tale for another time.

The first task is to create a new class for this to live in:

"use strict";

import readline from 'readline';

import xmpp from 'simple-xmpp';

class XmppBridge {

    /**
     * Creates a new XmppBridge instance.
     * @param   {string}    in_login_jid        The JID to login with.
     * @param   {string}    in_destination_jid  The JID to send stdin to.
     * @param   {Boolean}   in_is_groupchat     Whether the destination JID is a group chat or not.
     */
    constructor(in_destination_jid, in_is_groupchat) {
        // ....
    }
}

export default XmppBridge;

Very cool! That was easy. Next, we need to store those arguments and connect to the XMPP server in the constructor:

this.destination_jid = in_destination_jid;
this.is_destination_groupchat = in_is_groupchat;

this.client = xmpp;
this.client.on("online", this.on_connect.bind(this));
this.client.on("error", this.on_error.bind(this));
this.client.on("chat", ((_from, _message) => {
    // noop
}).bind(this));

I ended up having to define a chat event handler - even though it's pointless, as I ran into a nasty crash if I didn't do so (I suspect that this use-case wasn't considered by the original package developer).

The next area of interest is that online event handler. Note that I've bound the method to the current this context - this is important, as it would be able to access the class instance's properties otherwise. Let's take a look at the code for that handler:

console.log(`[XmppBridge] Connected as ${data.jid}.`);
if(this.is_destination_groupchat) {
    this.client.join(`${this.destination_jid}/bot_${data.jid.user}`);
}
this.stdin = readline.createInterface({
    input: process.stdin,
    output: process.stdout,
    terminal: false
});
this.stdin.on("line", this.on_line_handler.bind(this));
this.stdin.on("close", this.on_stdin_close_handler.bind(this));

This is the point at which we open the standard input and start listening for things to send. We don't do it earlier, as we don't want to end up in a situation where we try sending something before we're connected!

If we're supposed to be sending to a multi-user chatroom, this is also the point at which it joins said room. This is required as you can't send a message to a room that you haven't joined.

The resource (the bit after the forward slash /), for a group chat, specifies the nickname that you want to give to yourself when joining. Here, I automatically set this to the user part of the JID that we used to login prefixed with bot_.

The connection itself is established in the start method:

start(jid, password) {
    this.client.connect({
        jid,
        password
    });
}

And every time we receive a line of input, we execute the send() method:

on_line_handler(line_text) {
    this.send(line_text);
}

I used a full method here, as initially I had some issues and wanted to debug which methods were being called. That send method looks like this:

send(message) {
    this.client.send(
        this.destination_jid,
        message,
        this.is_destination_groupchat
    );
}

The last event handler worth mentioning is the close event handler on the readline interface:

on_stdin_close_handler() {
    this.client.disconnect();
}

This just disconnects from the XMXPP server so that Node can exit cleanly.

That basically completes the script. In total, the entire XmppBridge.mjs class file is 72 lines. Not bad going!

You can install this tool for yourself with sudo npm install -g xmppbridge. I've documented how it use it in the README, so I'd recommend heading over there if you're interested in trying it out.

Found this interesting? Got a cool use for XMPP? Comment below!

Sources and Further Reading

Backing up to AWS S3 with duplicity

The server that this website runs on backs up automatically to the Simple Storage Service, provided by Amazon Web Services. Such an arrangement is actually fairly cheap - only ~20p/month! I realised recently that although I've blogged about duplicity before (where I discussed using an external hard drive), I never covered how I fully automate the process here on starbeamrainbowlabs.com.

A bunch of hard drives. (Above: A bunch of hard drives. The original can be found here.)

It's fairly similar in structure to the way it works backing up to an external hard drive - just with a few different components here and there, as the script that drives this is actually older than the one that backs up to an external hard drive.

To start, we'll need an AWS S3 bucket. I'm not going to cover how to do this here, as the AWS interface keeps changing, and this guide will likely become outdated quickly. Instead, the AWS S3 documentation has an official guide on how to create one. Make sure it's private, as you don't want anyone getting a hold of your backups!

With that done, you should have both an access key and a secret. Note these down in a file called .backup-password in a new directory that will hold the backup script like this:

#!/usr/bin/env bash
PASSPHRASE="INSERT_RANDOM_PASSWORD_HERE";
AWS_ACCESS_KEY_ID="INSERT_AWS_ACCESS_KEY_HERE";
AWS_SECRET_ACCESS_KEY="INSERT_AWS_SECRET_KEY_HERE";

The PASSPHRASE here should be a long and unintelligible string of random characters, and will encrypt your backups. Note that down somewhere safe too - preferably in your password manager or somewhere else at least as secure.

If you're on Linux, you should also set the permissions on the .backup-password file to ensure nobody gets access to it who shouldn't. Here's how I did it:

sudo chown root:root .backup-password
sudo chmod 0400 .backup-password

This ensures that only the root user is able to read the file - and nobody can write to it. With our secrets generated and safely stored, we can start writing the backup script itself. Let's start by reading in the secrets:

#!/usr/bin/env bash
source /root/.backup-password

I stored my .backup-password file in /root. Next, let's export these values. This enables the subprocesses we invoke to access these environment variables:

export PASSPHRASE;
export AWS_ACCESS_KEY_ID;
export AWS_SECRET_ACCESS_KEY;

Now it's time to do the backup itself! Here's what I do:

duplicity \
    --full-if-older-than 2M \
    --exclude /proc \
    --exclude /sys \
    --exclude /tmp \
    --exclude /dev \
    --exclude /mnt \
    --exclude /var/cache \
    --exclude /var/tmp \
    --exclude /var/backups \
    --exclude /srv/www-mail/rainloop/v \
    --s3-use-new-style --s3-european-buckets --s3-use-ia \
    / s3://s3-eu-west-1.amazonaws.com/INSERT_BUCKET_NAME_HERE

Compressed version:

duplicity --full-if-older-than 2M --exclude /proc --exclude /sys --exclude /tmp --exclude /dev --exclude /mnt --exclude /var/cache --exclude /var/tmp --exclude /var/backups --exclude /srv/www-mail/rainloop/v --s3-use-new-style --s3-european-buckets --s3-use-ia / s3://s3-eu-west-1.amazonaws.com/INSERT_BUCKET_NAME_HERE

This might look long and complicated, but it's mainly due to the large number of directories that I'm excluding from the backup. The key options here are --full-if-older-than 2M and --s3-use-ia, which specify I want a full backup to be done every 2 months and to use the infrequent access pricing tier to reduce costs.

The other important bit here is to replace INSERT_BUCKET_NAME_HERE with the name of the S3 bucket that you created.

Backing is all very well, but we want to remove old backups too - in order to avoid ridiculous bills (AWS are terrible for this - there's no way that you can set a hard spending limit! O.o). That's fairly easy to do:

duplicity remove-older-than 4M \
    --force \
    --s3-use-new-style --s3-european-buckets --s3-use-ia \
    s3://s3-eu-west-1.amazonaws.com/INSERT_BUCKET_NAME_HERE

Again, don't forget to replace INSERT_BUCKET_NAME_HERE with the name of your S3 bucket. Here, I specify I want all backups older than 4 months (the 4M bit) to be deleted.

It's worth noting here that it may not actually be able to remove backups older than 4 months here, as it can only delete a full backup if there are not incremental backups that depend on it. To this end, you'll need to plan for potentially storing (and being charged for) an extra backup cycle's worth of data. In my case, that's an extra 2 months worth of data.

That's the backup part of the script complete. If you want, you could finish up here and have a fully-working backup script. Personally, I want to know how much data is in my S3 bucket - so that I can get an idea as to how much I'll be charged when the bill comes through - and also so that I can see if anything's going wrong.

Unfortunately, this is a bit fiddly. Basically, we have to utilise the AWS command-line interface to recursively list the entire contents of our S3 bucket in summarising mode in order to get it to tell us what we want to know. Here's how to do that:

aws s3 ls s3://INSERT_BUCKET_BAME_HERE --recursive --human-readable --summarize

Don't forget to replace INSERT_BUCKET_BAME_HERE wiith your bucket's name. The output from this is somewhat verbose, so I ended up writing an awk script to process it and output something nicer. Said awk script looks like this:

/^\s*Total\s+Objects/ { parts[i++] = $3 }
/^\s*Total\s+Size/ { parts[i++] = $3; parts[i++] = $4; }
END {
    print(
        "AWS S3 Bucket Status:",
        parts[0], "objects, totalling "
        parts[1], parts[2]
    );
}

If we put all that together, it should look something like this:

aws s3 ls s3://INSERT_BUCKET_BAME_HERE --recursive --human-readable --summarize | awk '/^\s*Total\s+Objects/ { parts[i++] = $3 } /^\s*Total\s+Size/ { parts[i++] = $3; parts[i++] = $4; } END { print("AWS S3 Bucket Status:", parts[0], "objects, totalling " parts[1], parts[2]); }'

...it's a bit of a mess. Perhaps I should look at putting that awk script in a separate file :P Anyway, here's some example output:

AWS S3 Bucket Status: 602 objects, totalling 21.0 GiB Very nice indeed. To finish off, I'd rather like to know how long it took to do all this. Thankfully, bash has an inbuilt automatic variable that holds the number of seconds since the current process has started, so it's just a case of parsing this out into something readable:

echo "Done in $(($SECONDS / 3600))h $((($SECONDS / 60) % 60))m $(($SECONDS % 60))s.";

...I forget which Stackoverflow answer it was that showed this off, but if you know - please comment below and I'll update this to add credit. This should output something like this:

Done in 0h 12m 51s.

Awesome! We've now got a script that backs up to AWS S3, deletes old backups, and tells us both how much space on S3 is being used and how long the whole process took.

I'm including the entire script at the bottom of this post. I've changed it slightly to add a single variable for the bucket name - so there's only 1 place on line 9 (highlighted) you need to update there.

(Above: A Geopattern, tiled using the GNU Image Manipulation Program)


#!/usr/bin/env bash

# Make sure duplicity exists
test -x $(which duplicity) || exit 1;

# Pull in the password
. /root/.backup-password

AWS_S3_BUCKET_NAME="INSERT_BUCKET_NAME_HERE";

# Allow duplicity to access it
export PASSPHRASE;
export AWS_ACCESS_KEY_ID;
export AWS_SECRET_ACCESS_KEY;

# Actually do the backup
# Backup strategy:
# 1 x backup per week:
#   1 x full backup per 2 months
#   incremental backups in between
# S3 Bucket URI: https://${AWS_S3_BUCKET_NAME}/
echo [ $(date +%F%r) ] Performing backup.
duplicity --full-if-older-than 2M --exclude /proc --exclude /sys --exclude /tmp --exclude /dev --exclude /mnt --exclude /var/cache --exclude /var/tmp --exclude /var/backups --exclude /srv/www-mail/rainloop/v --s3-use-new-style --s3-european-buckets --s3-use-ia / s3://s3-eu-west-1.amazonaws.com/${AWS_S3_BUCKET_NAME}

# Remove old backups
# You have to plan for 1 extra full backup cycle when
# calculating space requirements - duplicity only
# removes a backup if it won't invalidate those further
# along the chain - the oldest backup will always be
# a full one.
echo [ $(date +%F%r) ] Backup complete. Removing old volumes.
duplicity remove-older-than 4M --force --encrypt-key F2A6D8B6 --s3-use-new-style --s3-european-buckets --s3-use-ia s3://s3-eu-west-1.amazonaws.com/${AWS_S3_BUCKET_NAME}
echo [ $(date +%F%r) ] Cleanup complete.

aws s3 ls s3://${AWS_S3_BUCKET_NAME} --recursive --human-readable --summarize | awk '/^\s*Total\s+Objects/ { parts[i++] = $3 } /^\s*Total\s+Size/ { parts[i++] = $3; parts[i++] = $4; } END { print("AWS S3 Bucket Status:", parts[0], "objects, totalling " parts[1], parts[2]); }'

echo "Done in $(($SECONDS / 3600))h $((($SECONDS / 60) % 60))m $(($SECONDS % 60))s.";

Markov Chains Part 4: Test Data

With a shiny-new markov chain engine (see parts 1, 2, and 3), I found that I had a distinct lack of test data to put through it. Obviously this was no good at all, so I decided to do something about it.

Initially, I started with a list of HTML colours (direct link; 8.6KiB), but that didn't produce very good output:

MarkovGrams/bin/Debug/MarkovGrams.exe markov-w --wordlist wordlists/Colours.txt --length 16
errobiartrawbear
frelecteringupsy
lectrictomadolbo
vendellorazanigh
arvanginklectrit
dighoonbottlaven
onadeestersweese
ndiu
llighoolequorain
indeesteadesomiu

I see a few problems here. Firstly, it's treating each word as it's entity, where in fact I'd like it to generate n-grams on a line-by-line basis. Thankfully, this is easy enough with my new --no-split option:

MarkovGrams/bin/Debug/MarkovGrams.exe markov-w --wordlist wordlists/Colours.txt --no-split --length 16
med carrylight b
jungin pe red dr
ureelloufts blue
uamoky bluellemo
trinaterry aupph
utatellon reep g
radolitter brast
bian reep mardar
ght burnse greep
atimson-phloungu

Hrm, that's still rather unreadable. What if we make the n-grams longer by bumping the order?

MarkovGrams/bin/Debug/MarkovGrams.exe markov-w --wordlist wordlists/Colours.txt --length 16 --order 4
on fuchsia blue 
rsity of carmili
e blossom per sp
ngel
ulean red au lav
as green yellowe
indigri
ly gray aspe
disco blus
berry pine blach

Better, but it looks like it's starting the generation process from inside the middle of words. We can fix that with my new --start-uppercase option, which ensures that each output always stars with an n-gram that begins with a capital letter. Unfortunately the wordlist is all lowercase:

air force blue
alice blue
alizarin crimson
almond
amaranth
amber
american rose
amethyst
android green
anti-flash white

This is an issue. The other problem is that with an order of 4 the choice-point ratio is dropping quite low - I got a low of just ~0.97 in my testing.

The choice-point ratio is a measure I came up with of the average number of different directions the engine could potential go in at each step of the generation process. I'd like to keep this number consistently higher than 2, at least - to ensure a good variety of output.

Greener Pastures

Rather than try fix that wordlist, let's go in search of something better. It looks like the CrossCode Wiki has a page that lists all the items in the entire game. That should do the trick! The only problem is extracting it though. Let's use a bit of bash! We can use curl to download the HTML of the page, and then xidel to parse out the text from the <a> tags inside tables. Here's what I came up with:

curl https://crosscode.gamepedia.com/Items | xidel --data - --css "table a"

This is a great start, but we've got blank lines in there, and the list isn't sorted alphabetically (not required, but makes it look nice :P). Let's fix that:

curl https://crosscode.gamepedia.com/Items | xidel --data - --css "table a" | awk "NF > 0" | sort

Very cool. Tacking wc -l on the end of the pipe chain I can we've got ourselves a list of 527(!) items! Here's a selection of input lines:

Rough Branch
Raw Meat
Shady Monocle
Blue Grass
Tengu Mask
Crystal Plate
Humming Razor
Everlasting Amber
Tracker Chip
Lawkeeper's Fist

Let's run it through the engine. After a bit of tweaking, I came up with this:

cat wordlists/Cross-Code-Items.txt | MarkovGrams/bin/Debug/MarkovGrams.exe markov-w --start-uppercase --no-split --length 16 --order 3
Capt Keboossauci
Fajiz Keblathfin
King Steaf Sharp
Stintze Geakralt
Fruisty 'olipe F
Apper's TN
Prow Peptumn's C
Rus Recreetan Co
Veggiel Spiragma
Laver's Bolden M

That's quite interesting! With a choice-point ratio of ~5.6 at an order of 3, we've got a nice variable output. If we increase the order to 4, we get ~1.5 - ~2.3:

Edgy Hoo
Junk Petal Goggl
Red Metal Need C
Samurai Shel
Echor
Krystal Wated Li
Sweet Residu
Raw Stomper Thor
Purple Fruit Dev
Smokawa

It appears to be cutting off at the end though. Not sure what we can do about that (ideas welcome!). This looks interesting, but I'm not done yet. I'd like it to work on word-level too!

Going up a level

After making some pretty extensive changes, I managed to add support for this. Firstly, I needed to add support for word-level n-gram generation. Currently, I've done this with a new GenerationMode enum.

public enum GenerationMode
{
    CharacterLevel,
    WordLevel
}

Under the hood I've just used a few if statements. Fortunately, in the case of the weighted generator, only the bottom method needed adjusting:

/// <summary>
/// Generates a dictionary of weighted n-grams from the specified string.
/// </summary>
/// <param name="str">The string to n-gram-ise.</param>
/// <param name="order">The order of n-grams to generate.</param>
/// <returns>The weighted dictionary of ngrams.</returns>
private static void GenerateWeighted(string str, int order, GenerationMode mode, ref Dictionary<string, int> results)
{
    if (mode == GenerationMode.CharacterLevel) {
        for (int i = 0; i < str.Length - order; i++) {
            string ngram = str.Substring(i, order);
            if (!results.ContainsKey(ngram))
                results[ngram] = 0;
            results[ngram]++;
        }
    }
    else {
        string[] parts = str.Split(" ".ToCharArray());
        for (int i = 0; i < parts.Length - order; i++) {
            string ngram = string.Join(" ", parts.Skip(i).Take(order)).Trim();
            if (ngram.Trim().Length == 0) continue;
            if (!results.ContainsKey(ngram))
                results[ngram] = 0;
            results[ngram]++;
        }
    }
}

Full code available here. After that, the core generation algorithm was next. The biggest change - apart from adding a setting for the GenerationMode enum - was the main while loop. This was a case of updating the condition to count the number of words instead of the number of characters in word mode:

(Mode == GenerationMode.CharacterLevel ? result.Length : result.CountCharInstances(" ".ToCharArray()) + 1) < length

A simple ternary if statement did the trick. I ended up tweaking it a bit to optimise it - the above is the end result (full code available here). Instead of counting the words, I count the number fo spaces instead and add 1. That CountCharInstances() method there is an extension method I wrote to simplify things. Here it is:

public static int CountCharInstances(this string str, char[] targets)
{
    int result = 0;
    for (int i = 0; i < str.Length; i++) {
        for (int t = 0; t < targets.Length; t++)
            if (str[i] == targets[t]) result++;
    }
    return result;
}

Recursive issues

After making these changes, I needed some (more!) test data. Inspiration struck: I could run it recipe names! They've quite often got more than 1 word, but not too many. Searching for such a list proved to be a challenge though. My first thought was BBC Food, but their terms of service disallow scraping :-(

A couple of different websites later, and I found the Recipes Wikia. Thousands of recipes, just ready and waiting! Time to get to work scraping them. My first stop was, naturally, the sitemap (though how I found in the first place I really can't remember :P).

What I was greeted with, however, was a bit of a shock:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap><loc>http://recipes.wikia.com/sitemap-newsitemapxml-NS_0-p1.xml</loc></sitemap>
<sitemap><loc>http://recipes.wikia.com/sitemap-newsitemapxml-NS_0-p2.xml</loc></sitemap>
<sitemap><loc>http://recipes.wikia.com/sitemap-newsitemapxml-NS_0-p3.xml</loc></sitemap>
<sitemap><loc>http://recipes.wikia.com/sitemap-newsitemapxml-NS_0-p4.xml</loc></sitemap>
<sitemap><loc>http://recipes.wikia.com/sitemap-newsitemapxml-NS_0-p5.xml</loc></sitemap>
<sitemap><loc>http://recipes.wikia.com/sitemap-newsitemapxml-NS_0-p6.xml</loc></sitemap>
<sitemap><loc>http://recipes.wikia.com/sitemap-newsitemapxml-NS_0-p7.xml</loc></sitemap>
<sitemap><loc>http://recipes.wikia.com/sitemap-newsitemapxml-NS_0-p8.xml</loc></sitemap>
<sitemap><loc>http://recipes.wikia.com/sitemap-newsitemapxml-NS_0-p9.xml</loc></sitemap>
<sitemap><loc>http://recipes.wikia.com/sitemap-newsitemapxml-NS_14-p1.xml</loc></sitemap>
<sitemap><loc>https://services.wikia.com/discussions-sitemap/sitemap/3355</loc></sitemap>
</sitemapindex>
<!-- Generation time: 26ms -->
<!-- Generation date: 2018-10-25T10:14:26Z -->

Like who has a sitemap of sitemaps, anyways?! We better do something about this: Time for some more bash! Let's start by pulling out those sitemaps.

curl http://recipes.wikia.com/sitemap-newsitemapxml-index.xml | xidel --data - --css "loc"

Easy peasy! Next up, we don't want that bottom one - as it appears to have a bunch of discussion pages and other junk in it. Let's strip it out before we even download it!

curl http://recipes.wikia.com/sitemap-newsitemapxml-index.xml | xidel --data - --css "loc" | grep -i NS_0

With a list of sitemaps extract from the sitemap (completely coconuts I tell you) extracted, we need to download them all in turn and extract the page urls therein. This is, unfortunately, where it starts to get nasty. While a simple xargs call downloads them all easily enough (| xargs -n1 -I{} curl "{}" should do the trick), this outputs them all to stdout, and makes it very difficult for us to parse them.

I'd like to avoid shuffling things around on the file system if possible, as this introduces further complexity. We're not out of options yet though, as we can pull a subshell out of our proverbial hat:

curl http://recipes.wikia.com/sitemap-newsitemapxml-index.xml | xidel --data - --css "loc" | grep -i NS_0 | xargs -n1 -I{} sh -c 'curl {} | xidel --data - --css "loc"'

Yay! Now we're getting a list of urls to all the pages on the entire wiki:

http://recipes.wikia.com/wiki/Mexican_Black_Bean_Soup
http://recipes.wikia.com/wiki/Eggplant_and_Roasted_Garlic_Babakanoosh
http://recipes.wikia.com/wiki/Bathingan_bel_Khal_Wel_Thome
http://recipes.wikia.com/wiki/Lebanese_Tabbouleh
http://recipes.wikia.com/wiki/Lebanese_Hummus_Bi-tahini
http://recipes.wikia.com/wiki/Baba_Ghannooj
http://recipes.wikia.com/wiki/Lebanese_Falafel
http://recipes.wikia.com/wiki/Lebanese_Pita_Bread
http://recipes.wikia.com/wiki/Kebab_Koutbane
http://recipes.wikia.com/wiki/Moroccan_Yogurt_Dip

One problem though: We want recipes names, not urls! Let's do something about that. Our next special guest that inhabits our bottomless hat is the illustrious sed. Armed with the mystical power of find-and-replace, we can make short work of these urls:

... | sed -e 's/^.*\///g' -e 's/_/ /g'

The rest of the command is omitted for clarity. Here I've used 2 sed scripts: One to strip everything up to the last forward slash /, and another to replace the underscores _ with spaces. We're almost done, but there are a few annoying hoops left to jump through. Firstly, there are A bunch of unfortunate escape sequences lying around (I actually only discovered this when the engine started spitting out random ones :P). Also, there are far too many page names that contain the word Nutrient, oddly enough.

The latter is easy to deal with. A quick grep sorts it out:

... | grep -iv "Nutrient"

The former is awkward and annoying. As far as I can tell, there's no command I can call that will decode escape sequences. To this end, I wound up embedding some Python:

... | python -c "import urllib, sys; print urllib.unquote(sys.argv[1] if len(sys.argv) > 1 else sys.stdin.read()[0:-1])"

This makes the whole thing much more intimidating that it would otherwise be. Lastly, I'd really like to sort the list and save it to a file. Compared to the above, this is chicken feed!

| sort >Dishes.txt

And there we have it. Bash is very much like lego bricks when you break it down. The trick is to build it up step-by-step until you've got something that does what you want it to :)

Here's the complete command:

curl http://recipes.wikia.com/sitemap-newsitemapxml-index.xml | xidel --data - --css "loc" | grep -i NS_0 | xargs -n1 -I{} sh -c 'curl {} | xidel --data - --css "loc"' | sed -e 's/^.*\///g' -e 's/_/ /g' | python -c "import urllib, sys; print urllib.unquote(sys.argv[1] if len(sys.argv) > 1 else sys.stdin.read()[0:-1])" | grep -iv "Nutrient" | sort >Dishes.txt

After all that effort, I think we deserve something for our troubles! With ~42K(!) lines in the resulting file (42,039 to be exact as of the last time I ran the monster above :P), the output (after some tweaking, of course) is pretty sweet:

cat wordlists/Dishes.txt | mono --debug MarkovGrams/bin/Debug/MarkovGrams.exe markov-w --words --start-uppercase --length 8
Lemon Lime Ginger
Seared Tomatoes and Summer Squash
Cabbage and Potato au
Endive stuffed with Lamb and Winter Vegetable
Stuffed Chicken Breasts in Yogurt Turmeric Sauce with
Blossoms on Tomato
Morning Shortcake with Whipped Cream Graham
Orange and Pineapple Honey
Mango Raspberry Bread
Tempura with a Southwestern
Rice Florentine with
Cabbage Slaw with Pecans and Parmesan
Pork Sandwiches with Tangy Sweet Barbecue
Tea with Lemongrass and Chile Rice
Butterscotch Mousse Cake with Fudge
Fish and Shrimp -
Cucumber Salad with Roast Garlic Avocado
Beans in the Slow
Apple-Cherry Vinaigrette Salad
California Avocado Chinese Chicken Noodle Soup with Arugula

...I really need to do something about that cutting off issue. Other than that, I'm pretty happy with the result! The choice-point ratio is really variable, but most of the time it's hovering around ~2.5-7.5, which is great! The output if I lower the order from 3 to 2 isn't too bad either:

Salata me Htapodi kai
Hot 'n' Cheese Sandwich with Green Fish and
Poisson au Feuilles de Milagros
Valentines Day Cookies with Tofu with Walnut Rice
Up Party Perfect Turkey Tetrazzini
Olives and Strawberry Pie with Iceberg Salad with
Mashed Sweet sauce for Your Mood with Dried
Zespri Gold Corn rice tofu, and Corn Roasted
California Avocado and Rice Casserole with Dilled Shrimp
Egyptian Tomato and Red Bell Peppers, Mango Fandango

This gives us a staggering average choice-point ratio of ~125! Success :D

One more level

After this, I wanted to push the limits of the engine, so see what it's capable of. The obvious choice here is Shakespeare's Complete Works (~5.85MiB). Pushing this through the engine required some work, as ~30 seconds is far too slow - namely optimising the pipeline as much as possible.

The Mono Profiler helped a lot here. With it, I discovered that string.StartsWith() is really slow. Like, ridiculously slow (though this is relative, since I'm calling it hundreds of thousand of times), as it's culture-aware. In our case, we can't be bothering with any of that, as it's not relevant anyway. The easiest solution is to write another extension method:

public static bool StartsWithFast(this string str, string target) {
    if (str.Length < target.Length) return false;
    return str.Substring(0, target.Length) == target;
}

string.Substring() is faster, so by utilising this instead of the regular string.StartsWith() yields us a perfectly gigantic boost! Next up, I noticed that I can probably parallelize the Linq query that builds the list of possible n-grams we can choose from next, so that it runs on all the CPU cores:

Parallel.ForEach(ngrams, (KeyValuePair<string, double> ngramData) => {
    if (!ngramData.Key.StartsWithFast(nextStartsWith)) return;
    if (!convNextNgrams.TryAdd(ngramData.Key, ngramData.Value))
        throw new Exception("Error: Failed to add to staging ngram concurrent dictionary");
});

Again, this netted a another huge gain. With this and a few other architectural changes, I was able to chop the time down to a mere ~4 seconds (for a standard 10 words)! In the future, I might experiment with selective unmanaged code via the unsafe keyword to see if I can do any better.

For now, it's fast enough to enjoy some random Shakespeare on-demand:

What should they since that to that tells me Nero
He doth it turn and and too, gentle
Ha! you shall not come hither; since that to provoke
ANTONY. No further, sir; a so, farewell, Sir
Bona, joins with
Go with your fingering,
From fairies and the are like an ass which is
The very life-blood of our blood, no, not with the
Alas, why is here-in which
Transform'd and weak'ned? Hath Bolingbroke

Very interesting. The choice-point ratios sit at ~20 and ~3 for orders 3 and 4 respectively, though I got as high as 188 for an order of 3 during my testing. Definitely plenty of test data here :P

Conclusion

My experiments took me to several other places - which, if I included them all here, would result in an post much much longer than this! I scripted the download of several other wordlists in download.sh (direct link, 4.2KiB), if you're interested, with ready-downloaded copies in the wordlists folder of the repository.

I would advise reading the table in the README that gives credit to where I sourced each list, because of course I didn't put any of them together myself - I just wrote the script :P

Particularly of note is the Starbound list, which contains a list of all the blocks and items from the game Starbound. I might post about that one separately, as it ended up being a most interesting adventure.

In the future, I'd like to look at implementing a linguistic drift algorithm, to try and improve the output of the engine. The guy over at Here Dragons Abound has a great post on the subject, which you should definitely read if you're interested.

Found this interesting? Got an idea for another wordlist I can push though my engine? Confused by something? Comment below!

Sources and Further Reading

Acorn Validator

Edit: Corrected title and a bunch of grammatical mistakes. I typed this out on my phone with Monospace - and I seems that my keyboard (and Phone-Laptop Bluetooth connection!) leave something to be desired.....

Over the last week, I've been hard at work on an entry for #LOWREZJAM. While it's not finished yet (submission is on Thursday), I've found some time to write up a quick blog post about a particular aspect of it. Of course, I'll be blogging about the rest of it later once it's finished :D

The history of my entry is somewhat.... complicated. Originally, I started work on it a few months back as an independent project, but due to time constraints and other issues I was unable to get very far with it. Once I discovered #LOWREZJAM, I made the decision to throw away the code I had written and start again from (almost) scratch.

It is for this reason that I have a Javascript validator script lying around. I discovered when writing it originally that my editor Atom didn't have syntax validation support for Javascript. While there are extensions that do the job, it looked like a complicated business setting one up for just syntax checking (I don't want your code style guideline suggestions! I have my own style!). To this end, I wrote myself a quick bash script to automatically check the syntax of all my javascript files that I can then include as a build step - just before webpack.

Over the time I've been working on my #LOWREZJAM entry here, I've been tweaking and improving it - and thought I'd share it here. In the future, I'm considering turning it into a linting provider for the Atom editor I mentioned above (it depends on how complicated that is, and how good the documentation is to help me understand the process).

The script (which can be found in full at the bottom of this post), has 2 modes of operation. In the first mode, it acts as a co-ordinator process that starts all the sub-processes that validate the javascript. In the second mode, it validates a single file - outputting the result to the standard output and also logging any errors in a special place that the co-ordinator process can find them later.

It decides which mode to operate in based on whether it recieves an argument telling it which file to validate:

if [ "$1" != "" ]; then
    validate_file "$1" "$2";
    exit $?;
fi

If it detects an argument, then it calls the validate_file function and exits with the returned exit code.

If not, then the script continues into co-ordinator mode. In this mode it chains a bunch of commands together like lego bricks to start subprocesses to validate all of the javascript files it can find a fast as possible - acorn, the validator itself, can only check one file as a time it would appear. It does this like so:

find . -not -path "./node_modules/" -not -path "./dist/" | grep -iP '\.mjs$' | xargs -n1 -I{} -P32 $0 "{}" "${counter_dirname}";

This looks complicated, but it can be broken down into smaller, easy-to-understand chunks. explainshell.com is rather good at demonstrating this. Particularly of note here is the $0. This variable holds the path to the currently executing script - allowing the co-ordinator to call itself in validator mode.

The validator function itself is also quite simple. In short, it runs the validator, storing the result in a variable. It then also saves the exit code for later analysis. Once done, it outputs to the standard output, and then also outputs the validator's output - but only is there was an error to keep things neat and tidy. Finally, if there was an error, it outputs a file to a temporary directory (whose name is determined by the co-ordinator and passed to sub-processes via the 2nd argument) with a name of its PID (the content doesn't matter - I just output 1, but anything will do). This allows the co-ordinator to count the number of errors that the subprocesses encounter, without having to deal with complicated locks arising from updating the value stored in a single file. Here's that in bash:

counter_dirname=$2;

# ....... 

# Use /dev/shm here since apparently while is in a subshell, so it can't modify variables in the main program O.o
    if ! [ "${validate_exit_code}" -eq 0 ]; then
        echo 1 &gt;"${counter_dirname}/$$";
    fi

Once all the subprocesses have finished up, the co-ordinator counts up all the errors and outputs the total at the bottom:

error_count=$(ls ${counter_dirname} | wc -l);

echo 
echo Errors: $error_count
echo 

Finally, the co-ordinator cleans up after the subprocesses, and exits with the appropriate error code. This last bit is important for automation, as a non-zero exit code tells the patent process that it failed. My build script (which uses my lantern build engine, which deserves a post of its own) picks up on this and halts the build if any errors were found.

rm -rf "${counter_dirname}";

if [ ${error_count} -ne 0 ]; then
    exit 1;
fi

exit 0;

That's about all there is to it! The complete code can be found at the bottom of this post. To use it, you'll need to run npm install acorn in the directory that you save it to.

I've done my best to optimize it - it can process a dozen or so files in ~1 second - but I think I can do much better if I rewrite it in Node.JS - as I can eliminate the subprocesses by calling the acorn API directly (it's a Node.JS library), rather than spawning many subprocesses via the CLI.

Found this useful? Got a better solution? Comment below!

#!/usr/bin/env sh

validate_file() {
    filename=$1;
    counter_dirname=$2;

    validate_result=$(node_modules/.bin/acorn --silent --allow-hash-bang --ecma9 --module $filename 2&gt;&amp;1);
    validate_exit_code=$?;
    validate_output=$([ ${validate_exit_code} -eq 0 ] &amp;&amp; echo ok || echo ${validate_result});
    echo "${filename}: ${validate_output}";
    # Use /dev/shm here since apparently while is in a subshell, so it can't modify variables in the main program O.o
    if ! [ "${validate_exit_code}" -eq 0 ]; then
        echo 1 &gt;"${counter_dirname}/$$";
    fi
}

if [ "$1" != "" ]; then
    validate_file "$1" "$2";
    exit $?;
fi

counter_dirname=$(mktemp -d -p /dev/shm/ -t acorn-validator.XXXXXXXXX.tmp);
# Parallelisation trick from https://stackoverflow.com/a/33058618/1460422
# Updated to use xargs
find . -not -path "./node_modules/" -not -path "./dist/" | grep -iP '\.mjs$' | xargs -n1 -I{} -P32 $0 "{}" "${counter_dirname}";

error_count=$(ls ${counter_dirname} | wc -l);

echo 
echo Errors: $error_count
echo 

rm -rf "${counter_dirname}";

if [ ${error_count} -ne 0 ]; then
    exit 1;
fi

exit 0;

Read / Write Disk Performance Testing in Bash

Recently I needed to quickly (and non-destructively) test the read / write performance of a flash drive of mine. Naturally, I turned my attention to my terminal. This post is me documenting what I did so that I can remember for next time :P

Firstly, to test the speed of a disk, we need some data to test with. Since lots of small files will inevitably cause slowdowns due to the overhead of writing the file metadata and inode information to the superblock, it makes the most sense to use one gigantic file rather than tons of small ones. Here's what I did to generate a 1 Gigabyte file filled with zeroes:

dd if=/dev/zero of=/tmp/testfile.bin bs=1M count=1024

Cool. Next, we need to copy it to the target disk and measure the time it took. Then, since we know the size of the file (1073741824 bytes, to be exact), we can calculate the speed at which the copy took place. Here's my first attempt:

time dd if=/tmp/testfile.bin >testfile.bin

If you run this, you might find that it doesn't take it very long at all, and you get a speed of something like ~250MiB / sec! While impressive, I seriously doubt that my flash drive has that kind of speed behind it. Typically, flash memory takes longer to write to and read from - and I'm pretty sure that it can't read from it that fast either. So what's going on?

Well, it turns out that Linux is caching the disk write operations in a buffer, and then doing them in the background for us. Whilst fine for ordinary operation, this doesn't give us an accurate representation of how fast it's actually writing to the disk. Thankfully, there's something we can do about this: Use the sync command. sync will flush all cached write operations to disk for us, giving us the actual time it took to write the 1 GiB file to disk. Here's the altered command:

sync;
time sh -c 'dd if=/tmp/testfile.bin >testfile.bin; sync'

Very cool! Now, we can just take the time it took and do some simple maths to calculate the write speed of our disk. What about the read speed though? Well, to test that, we'll first need to clear out the page cache - another one of Linux's (many) caches that holds portions of files that have recently been accessed for faster retrieval - because as before, we're not interested in the speed of the cache! Here's how to do that:

echo 1 | sudo tee /proc/sys/vm/drop_caches

With the correct cache cleared, we can test the read speed accurately. Here's how I did it:

time dd if=testfile.bin of=/dev/null

Fairly simple, right? At a later date I might figure out a way of automating this, but for the occasional use now and again this works just fine :)

Found this useful? Got a better way of doing it? Want to say hi? Post in the comments below!

Jump around a filesystem with a bit of bash

Your shell in the middle of a teleporter.

(Banner remixed from images found on openclipart)

I've seen things like jump, which allow you to bookmark places on your system so that you can return to them faster. The trouble is, I keep forgetting to use it. I open the terminal and realise that I need to be in a specific directory, and forget to bookmark it once I cd to it - or I forget that I bookmarked it and cd my way there anyway :P

To solve the problem, I thought I'd try implementing my own simplified system, under the name teleport, telepeek, and telepick. Obviously, we'll have to put these scripts in something like .bash_aliases as functions - otherwise it won't cd in the terminal itself. Let's start with teleport:

function teleport() {
    cd "$(find . -type d | grep -iP "$@" | head -n1)";
}

Not bad for a first attempt! Basically, it does a find to list all the subdirectories in the current directory, filters the results with the specified regex, and changes directory to the first result returned. Here's an example of how it's used:

~ $ teleport 'pep.*mint'
~/Documents/code/some/path/pepperminty-wiki/ $ 

We can certainly improve it though. Let's start by removing that head call:

function teleport() {
    cd "$(find . -type d | grep -m1 -iP "$@")";
}

What about all those Permission denied messages that pop up when you're jumping around places that you might not have permission to go everywhere? Let's suppress those too:

function teleport() {
    cd "$(find . -type d 2>/dev/null | grep -m1 -iP "$@")";
}

Much better. With a teleport command in hand, it might be nice to inspect the list of directories the find + grep combo finds. For that, let's invent a telepeek variant:

function telepeek() {
    find . -type d 2>/dev/null | grep -iP "$@" | less
}

Very cool. It doesn't have line numbers though, and they're useful. Let's fix that:

function telepeek() {
    find . -type d 2>/dev/null | grep -iP "$@" | less -N
}

Better, but I'd prefer them to be highlighted so that I can tell them apart from the directory paths. For that, we've got to change our approach to the problem:

function telepeek() {
    find . -type d 2>/dev/null | grep -iP "$@" | cat -n | sed 's/^[ 0-9]*[0-9]/\o033[34m&\o033[0m/' | less -R
}

By using a clever combination of cat -n to add the line numbers and a strange sed recipe (which I found in a comment on this Stack Overflow answer) to highlight the numbers themselves, we can get the result we want.

This telepeek command has given me an idea. Why not ask for an index to jump to after going to the trouble of displaying line numbers and jump to that directory? Let's cook up a telepick command!

function telepick() {
    telepeek $1;
    read -p "jump to index: " line_number;
    cd "$(find . -type d 2>/dev/null | grep -iP "$@" | sed "${line_number}q;d")";
}

That wasn't too hard. By using a few different commands rather like lego bricks, we can very easily create something that does what we want with minimal effort. The read -p "jump to index: " line_number bit fetches the index that the user wants to jump to, and sed comes to the rescue again to pick out the line number we're interested in with sed "${line_number}q;d".

Update April 20th 2018: I've updated the approach here to support spaces everywhere by adding additional quotes, and utilising $@ instead of $1.

Pepperminty Wiki CLI

The Pepperminty Wiki CLI. in a terminal window, with a peppermint overlaid in the top left of the image.

I've got a plan. Since I'm taking the Mobile Development module next semester, I'd like to create an Android app for Pepperminty Wiki that will let me edit one or more instances of Pepperminty Wiki while I'm, say, on a bus.

To this end, I'll need to make sure that Pepperminty Wiki itself is all ready to go - which primarily entail making sure that its REST API is suitably machine-friendly, so that I can pull down all the information I need in the app I build.

Testing this, however, is a bit of a challenge - since I haven't actually started the module yet. My solution, as you might have guessed by the title of this blog post, is to build a command-line interface (CLI) instead. I've been writing a few bash scripts recently, to I tried my hand at creating something that's slightly more polished. Here's a list of the features supported at the time of posting:

  • Listing all pages
  • Viewing a specific page
  • Listing all revisions of a page
  • Viewing a specific revision of a page

Support for searching is on the cards, but it's currently waiting on support for grabbing search results as json / plain text from Pepperminty Wiki itself.

I'll be updating it with other things too as I think of them, but if you'd like to give it a try now, then here's the source:

It should update dynamically as I update the script. Simply save it to a file called peppermint - and then you can run ./peppermint to get an overview of the commands it supports. To get detailed help on a specific command, simply run ./peppermint {command_name} to get additional help about that specific command - and additional help for that command will be displayed if it supports any further arguments (it will be executed directly if not).

Sound interesting? Any particular aspect of this script you'd like explaining in more detail? Want to help out? Leave a comment below!

Semi-automated backups with duplicity and an external drive

A bunch of hard drives. (Above: A bunch of hard drives. The original can be found here.)

Since I've recently got myself a raspberry pi to act as a server, I naturally needed a way to back it up. Not seeing anything completely to my tastes, I ended up putting something together that did the job for me. For this I used an external hard drive, duplicity, sendxmpp (sudo apt install sendxmpp), and a bit of bash.

Since it's gone rather well for me so far, I thought I'd write a blog post on how I did it. It still needs some tidying up, of course - but it works in it's current state, and perhaps it will help someone else put together their own system!

Step 1: Configuring the XMPP server

I use XMPP as my primary instant messaging server, so it's only natural that I'd want to integrate the system in with it to remind me when to plug in the external drive, and so that it can tell me when it's done and what happened. Since I use prosody as my XMPP server, I can execute the following on the server:

sudo prosodyctl adduser rasperrypi@bobsrockets.com

...and then enter a random password for the new account. From there, I set up a new private persistent multi-user chatroom for the messages to filter into, and set my client to always notify when a message is posted.

After that, it was a case of creating a new config file in a format that sendxmpp will understand:

rasperrypi@bobsrockets.com:5222 thesecurepassword

Step 2: Finding the id of the drive partition

With the XMPP side of things configured, next I needed a way to detect if the drie was plugged in or not. Thankfully all partitions have a unique id built-in, which you can use to see if it's plugged in or not. It's easy to find, too:

sudo blkid

The above will list all available partitions and their UUID - the unique id I mentioned. With that in hand, we can now check if it's plugged in or not with a cleverly crafted use of the readlink command:

readlink /dev/disk/by-uuid/${partition_uuid} 1>/dev/null 2>&2;
partition_found=$?
if [[ "${partition_found}" -eq "0" ]]; then
    echo "It's plugged in!";
else
    echo "It's not plugged in :-(";
fi

Simple, right? readlink has an exit code of 0 if it managed to read the symbolik link in /dev/disk/by-uuid ok, and 1 if it didn't. The symbolic links in /deve/disk/by-uuid are helpfuly created automatically for us :D From here, we can take it a step further to wait until the drive is plugged in:

# Wait until the drive is available
while true
do
    readlink "${partition_uuid}";

    if [[ "$?" -eq 0 ]]; then
        break
    fi

    sleep 1;
done

Step 3: Mounting and unmounting the drive

Raspberry Pis don't mount drive automatically, so we'll have do that ourselves. Thankfully, it's not so tough:

# Create the fodler to mount the drive into
mkdir -p ${backup_drive_mount_point};
# Mount it in read-write mode
mount "/dev/disk/by-uuid/${partition_uuid}" "${backup_drive_mount_point}" -o rw;

# Do backup thingy here

# Sync changes to disk
sync
# Unmount the drive
umount "${backup_drive_mount_point}";

Make sure you've got the ntfs-3g package installed if you want to back up to an NTFS volume (Raspberry Pis don't come with it by default!).

Step 4: Backup all teh things!

There are more steps involved in getting to this point than I thought there were, but if you've made it this far, than congrats! Have a virtual cookie :D 🍪

The next part is what you probably came here for: duplicity itself. I've had an interesting time getting this to work so far, actually. It's probably easier if I show you the duplicity commands I came up with first.

# Create the archive & temporary directories
mkdir -p /mnt/data_drive/.duplicity/{archives,tmp}/{os,data_drive}
# Do a new backup
PASSPHRASE=${encryption_password} duplicity --full-if-older-than 2M --archive-dir /mnt/data_drive/.duplicity/archives/os --tempdir /mnt/data_drive/.duplicity/tmp/os --exclude /proc --exclude /sys --exclude /tmp --exclude /dev --exclude /mnt --exclude /var/cache --exclude /var/tmp --exclude /var/backups / file://${backup_drive_mount_point}/duplicity-backups/os/
PASSPHRASE=${data_drive_encryption_password} duplicity --full-if-older-than 2M --archive-dir /mnt/data_drive/.duplicity/archives/data_drive --tempdir /mnt/data_drive/.duplicity/tmp/data_drive /mnt/data_drive --exclude '**.duplicity/**' file://${backup_drive_mount_point}/duplicity-backups/data_drive/

# Remove old backups
PASSPHRASE=${encryption_password} duplicity remove-older-than 6M --force --archive-dir /mnt/data_drive/.duplicity/archives/os file:///${backup_drive_mount_point}/duplicity-backups/os/
PASSPHRASE=${data_drive_encryption_password} duplicity remove-older-than 6M --force --archive-dir /mnt/data_drive/.duplicity/archives/data_drive file:///${backup_drive_mount_point}/duplicity-backups/data_drive/

Path names have been altered for privacy reasons. The first duplicity command in the above was fairly straight forward - backup everything, except a few folders with cache files / temporary / weird stuff in them (like /proc).

I ended up having to specify the archive and temporary directories here to be on another disk because the Raspberry Pi I'm running this on has a rather... limited capacity on it's internal micro SD card, so the default location for both isn't a good idea.

The second duplicity call is a little more complicated. It backs up the data disk I have attached to my Raspberry Pi to the external drive I've got plugged in that we're backing up to. The awkward bit comes when you realise that the archive and temporary directories are located on this same data-disk that we're trying to back up. To this end, I eventually found (through lots of fiddling) that you can exclude a folder duplicity via the --exclude '**.duplicity/**' syntax. I've no idea why it's different when you're not backing up the root of the filesystem, but it is (--exclude ./.duplicity/ didn't work, and neither did /mnt/data_drive/.duplicity/).

The final two duplicity calls just clean up and remove old backups that are older than 6 months, so that the drive doesn't fill up too much :-)

Step 5: What? Where? Who?

We've almost got every piece of the puzzle, but there's still one left: letting us know what's going on! This is a piece of cake in comparison to the above:

function xmpp_notify {
        echo $1 | sendxmpp --file "${xmpp_config_file}" --resource "${xmpp_resource}" --tls --chatroom "${xmpp_target_chatroom}"
}

Easy! All we have to do is point sendxmpp at our config file we created waaay in step #1, and tell it where the chatroom is that we'd like it to post messages in. With that, we can put all the pieces of the puzzle together:

#!/usr/bin/env bash

source .backup-settings

function xmpp_notify {
    echo $1 | sendxmpp --file "${xmpp_config_file}" --resource "${xmpp_resource}" --tls --chatroom "${xmpp_target_chatroom}"
}

xmpp_notify "Waiting for the backup disk to be plugged in.";

# Wait until the drive is available
while true
do
    readlink "${backup_drive_dev}";

    if [[ "$?" -eq 0 ]]; then
        break
    fi

    sleep 1;
done

xmpp_notify "Backup disk detected - mounting";

mkdir -p ${backup_drive_mount_point};

mount "${backup_drive_dev}" "${backup_drive_mount_point}" -o rw

xmpp_notify "Mounting complete - performing backup";

# Create the archive & temporary directories
mkdir -p /mnt/data_drive/.duplicity/{archives,tmp}/{os,data_drive}

echo '--- Root Filesystem ---' >/tmp/backup-status.txt
# Create the archive & temporary directories
mkdir -p /mnt/data_drive/.duplicity/{archives,tmp}/{os,data_drive}
# Do a new backup
PASSPHRASE=${encryption_password} duplicity --full-if-older-than 2M --archive-dir /mnt/data_drive/.duplicity/archives/os --tempdir /mnt/data_drive/.duplicity/tmp/os --exclude /proc --exclude /sys --exclude /tmp --exclude /dev --exclude /mnt --exclude /var/cache --exclude /var/tmp --exclude /var/backups / file://${backup_drive_mount_point}/duplicity-backups/os/ 2>&1 >>/tmp/backup-status.txt
echo '--- Data Disk ---' >>/tmp/backup-status.txt
PASSPHRASE=${data_drive_encryption_password} duplicity --full-if-older-than 2M --archive-dir /mnt/data_drive/.duplicity/archives/data_drive --tempdir /mnt/data_drive/.duplicity/tmp/data_drive /mnt/data_drive --exclude '**.duplicity/**' file://${backup_drive_mount_point}/duplicity-backups/data_drive/ 2>&1 >>/tmp/backup-status.txt

xmpp_notify "Backup complete!"
cat /tmp/backup-status.txt | sendxmpp --file "${xmpp_config_file}" --resource "${xmpp_resource}" --tls --chatroom "${xmpp_target_chatroom}"
rm /tmp/backup-status.txt

xmpp_notify "Performing cleanup."

PASSPHRASE=${encryption_password} duplicity remove-older-than 6M --force --archive-dir /mnt/data_drive/.duplicity/archives/os file:///${backup_drive_mount_point}/duplicity-backups/os/
PASSPHRASE=${data_drive_encryption_password} duplicity remove-older-than 6M --force --archive-dir /mnt/data_drive/.duplicity/archives/data_drive file:///${backup_drive_mount_point}/duplicity-backups/data_drive/

sync;
umount "${backup_drive_mount_point}";

xmpp_notify "Done! Backup completed. You can now remove the backup disk."

I've tweaked a few of the pieces to get them to work better together, and created a separate .backup-settings file to store all the settings in.

That completes my backup script! Found this useful? Got an improvement? Use a different strategy? Post a comment below!

Art by Mythdael