Archive

## Tag Cloud

3d 3d printing account algorithms android announcement architecture archives arduino artificial intelligence artix assembly async audio automation backups bash batch blog bookmarklet booting bug hunting c sharp c++ challenge chrome os cluster code codepen coding conundrums coding conundrums evolved command line compilers compiling compression containerisation css dailyprogrammer data analysis debugging demystification distributed computing documentation downtime electronics email embedded systems encryption es6 features ethics event experiment external first impressions future game github github gist gitlab graphics hardware hardware meetup holiday holidays html html5 html5 canvas infrastructure interfaces internet interoperability io.js jabber jam javascript js bin labs learning library linux lora low level lua maintenance manjaro network networking nibriboard node.js operating systems own your code pepperminty wiki performance phd photos php pixelbot portable privacy problem solving programming problems project projects prolog protocol protocols pseudo 3d python reddit redis reference releases resource review rust searching secrets security series list server software sorting source code control statistics storage svg talks technical terminal textures thoughts three thing game three.js tool tutorial twitter ubuntu university update updates upgrade version control virtual reality virtualisation visual web website windows windows 10 xmpp xslt

## Monitoring HTTP server response time with collectd and a bit of bash

In the spirit of the last few posts I've been making here (A and B), I'd like to talk a bit about collectd, which I use to monitor the status of my infrastructure. Currently this consists of the server you've connected to in order to view this webpage, and a Raspberry Pi that acts as a home file server.

I realised recently that monitoring the various services that I run (such as my personal git server for instance) would be a good idea, as I'd rather like to know when they go down or act abnormally.

As a first step towards this, I decided to configure my existing collectd setup to monitor the response time of the HTTP endpoints of these services. Later on, I can then configure some alerts to message me when something goes down.

My first thought was to check the plugin list to see if there was one that would do the trick. As you might have guessed by the title of this post, however, such an easy solution would be too uninteresting and not worthy of writing a blog post.

Since such a plugin doesn't (yet?) exist, I turned to the exec plugin instead.

In short, it lets you write a program that writes to the standard output in the collectd plain text protocol, which collectd then interprets and adds to whichever data storage backend you have configured.

Since shebangs are a thing on Linux, I could technically choose any language I have an interpreter installed for, but to keep things (relatively) simple, I chose Bash, the language your local terminal probably speaks (unless it speaks zsh or fish instead).

My priorities were to write a script that is:

1. Easy to reconfigure
2. Ultra lightweight

Bash supports associative arrays, so I can cover point #1 pretty easily like this:

declare -A targets=(
["main_website"]="https://starbeamrainbowlabs.com/"
["git"]="https://git.starbeamrainbowlabs.com/"
# .....
)

Excellent! Covering point #2 will be an on-going process that I'll need to keep in mind as I write this script. I found this GitHub repository a while back, which has served as a great reference point in the past. Here's hoping it'll be useful this time too!

It's important to note the structure of the script that we're trying to write. Collectd exec scripts have 2 main environment variables we need to take notice of:

• COLLECTD_HOSTNAME - The hostname of the local machine
• COLLECTD_INTERVAL - Interval at which we should collect data. Defined in collectd.conf.

The script should write to the standard output the values we've collected in the collectd plain text format every COLLECTD_INTERVAL. Collectd will automatically ensure that only 1 instance of our script is running at once, and will also automatically restart it if it crashes.

To run a command regularly at a set interval, we probably want a while loop like this:

while :; do
# Do our stuff here

sleep "${COLLECTD_INTERVAL}"; done This is a great start, but it isn't really compliant with objective #2 we defined above. sleep is actually a separate command that spawns a new process. That's an expensive operation, since it has to allocate memory for a new stack and create a new entry in the process table. We can avoid this by abusing the read command timeout, like this: # Pure-bash alternative to sleep. # Source: https://blog.dhampir.no/content/sleeping-without-a-subprocess-in-bash-and-how-to-sleep-forever snore() { local IFS; [[ -n "${_snore_fd:-}" ]] || exec {_snore_fd}<> <(:);
read ${1:+-t "$1"} -u $_snore_fd || :; } Thanks to bolt for this. Next, we need to iterate over the array of targets we defined above. We can do that with a for loop: while :; do for target in "${!targets[@]}"; do
check_target "${target}" "${targets[${target}]}" done snore "${COLLECTD_INTERVAL}";
done

Here we call a function check_target that will contain our main measurement logic. We've changed sleep to snore too - our new subprocess-less sleep alternative.

Note that we're calling check_target for each target one at a time. This is important for 2 reasons:

• We don't want to potentially skew the results by taking multiple measurements at once (e.g. if we want to measure multiple PHP applications that sit in the same process poll, or measure more applications than we have CPUs)
• It actually spawns a subprocess for each function invocation if we push them into the background with the & operator. As I've explained above, we want to try and avoid this to keep it lightweight.

Next, we need to figure out how to do the measuring. I'm going to do this with curl. First though, we need to setup the function and bring in the arguments:

# $1 - target name #$2 - url
check_target() {
local target_name="${1}" local url="${2}";

# ......
}

Excellent. Now, let's use curl to do the measurement itself:

curl -sS --user-agent "${user_agent}" -o /dev/null --max-time 5 -w "%{http_code}\n%{time_total}" "${url}"

This looks complicated (and it probably is to some extent), but let's break it down with the help of explainshell.

Part Meaning
-sS Squashes all output except for errors and the bits we want. Great for scripts like ours.
--user-agent Specifies the user agent string to use when making a request. All good internet citizens should specify a descriptive one (more on this later).
-o /dev/null We're not interested in the content we download, so this sends it straight to the bin.
--max-time 5 This sets a timeout of 5 seconds for the whole operation - after which curl will throw an error and return with exit code 28.
-w "%{http_code}\n%{time_total}" This allows us to pull out metadata about the request we're interested in. There's actually a whole range available, but for now I'm interested in how long it took and the response code returned
"${url}" Specifies the URL to send the request to. curl does actually support making more than 1 request at once, but utilising this functionality is out-of-scope for now (and we'd get skewed results because it re-uses connections - which is normally really helpful & performance boosting) To parse the output we get from curl, I found the readarray command after going a bit array mad at the beginning of this post. It pulls every line of input into a new slot in an array for us - and since we can control the delimiter between values with curl, it's perfect for parsing the output. Let's hook that up now: readarray -t result < <(curl -sS --user-agent "${user_agent}" -o /dev/null --max-time 5 -w "%{http_code}\n%{time_total}" "${url}"); The weird command < <(another_command); syntax is process substitution. It's a bit like the another_command | command syntax, but a bit different. We need it here because readarray parses the values into a new array variable in the current context, and if we use the a | b syntax here, we instantly lose access to the variable it creates because a subprocess is spawned (and readarray is a bash builtin) - hence the weird process substitution. Now that we've got the output from curl parsed and ready to go, we need to handle failures next. This is a little on the nasty side, as by default bash won't give us the non-zero exit code from substituted processes. Hence, we need to tweak our already long arcane incantation a bit more: readarray -t result < <(curl -sS --user-agent "${user_agent}" -o /dev/null --max-time 5 -w "%{http_code}\n%{time_total}\n" "${url}"; echo "${PIPESTATUS[*]}");

Thanks to this answer on StackOverflow for ${PIPESTATUS}. Now, we have array called result with 3 elements in it: Index Value 0 The HTTP response code 1 The time taken in seconds 2 The exit code of curl With this information, we can now detect errors and abort continuing if we detect one. We know there was an error if any of the following occur: • curl returned a non-zero exit code • The HTTP response code isn't 2XX or 3XX Let's implement that in bash: if [[ "${result[2]}" -ne 0 ]] || [[ "${result[0]}" -lt "200" ]] || [[ "${result[0]}" -gt "399" ]]; then
return
fi

Again, let's break it down:

• [[ "${result[2]}" -ne 0 ]] - Detect a non-zero exit code from curl • [[ "${result[0]}" -lt "200" ]] - Detect if the HTTP response code is less than 200
• [[ "${result[0]}" -gt "399" ]] - Detect if the HTTP response code is greater than 399 In the future, we probably want to output a notification here of some sort instead of just simply silently returning, but for now it's fine. Finally, we can now output the result in the right format for collectd to consume. Collectd operates on identifiers, values, and intervals. A bit of head-scratching and documentation reading later, and I determined the correct identifier format for the task. I wanted to have all the readings on the same graph so I could compare the different response times (just like the ping plugin does), so we want something like this: bobsrockets.com/http_services/response_time-TARGET_NAME ....where we replace bobsrockets.com with ${COLLECTD_HOSTNAME}, and TARGET_NAME with the name of the target we're measuring (${target_name} from above). We can do this like so: echo "PUTVAL \"${COLLECTD_HOSTNAME}/http_services/response_time-${target_name}\" interval=${COLLECTD_I
NTERVAL} N:${result[1]}"; Here's an example of it in action: PUTVAL "/http_services/response_time-git" interval=300.000 N:0.118283 PUTVAL "/http_services/response_time-main_website" interval=300.000 N:0.112073 It does seem to run through the items in the array in a rather strange order, but so long as it does iterate the whole lot, I don't really care. I'll include the full script at the bottom of this post, so all that's left to do is to point collectd at our new script like this in /etc/collectd.conf: LoadPlugin exec # ..... <Plugin exec> Exec "nobody:nogroup" "/etc/collectd/http_response_times.sh" "measure" </Plugin> I've added measure as an argument there for future-proofing, as it looks like we may have to run a separate instance of the script for sending notifications if I understand the documentation correctly (I need to do some research.....). Very cool. It's taken a few clever tricks, but we've managed to write an efficient script for measuring http response times. We've made it more efficient by exploiting read timeouts and other such things. While we won't gain a huge amount of speed from this (bash is pretty lightweight already - this script is weighing in at just ~3.64MiB of private RAM O.o), it will all add up over time - especially considering how often this will be running. In the future, I'll definitely want to take a look at implementing some alerts to notify me if a service is down - but that will be a separate post, as this one is getting quite long :P Found this interesting? Got another way of doing this? Curious about something? Comment below! ### Full Script #!/usr/bin/env bash set -o pipefail; # Variables: # COLLECTD_INTERVAL Interval at which to collect data # COLLECTD_HOSTNAME The hostname of the local machine declare -A targets=( ["main_website"]="https://starbeamrainbowlabs.com/" ["webmail"]="https://mail.starbeamrainbowlabs.com/" ["git"]="https://git.starbeamrainbowlabs.com/" ["nextcloud"]="https://nextcloud.starbeamrainbowlabs.com/" ) # These are only done once, so external commands are ok version="0.1+$(date +%Y%m%d -r $(readlink -f "${0}"))";

user_agent="HttpResponseTimeMeasurer/${version} (Collectd Exec Plugin;$(uname -sm)) bash/${BASH_VERSION} curl/$(curl --version | head -n1 | cut -f2 -d' ')";

# echo "${user_agent}" ############################################################################### # Pure-bash alternative to sleep. # Source: https://blog.dhampir.no/content/sleeping-without-a-subprocess-in-bash-and-how-to-sleep-forever snore() { local IFS; [[ -n "${_snore_fd:-}" ]] || exec {_snore_fd}<> <(:);
read ${1:+-t "$1"} -u $_snore_fd || :; } # Source: https://github.com/dylanaraps/pure-bash-bible#split-a-string-on-a-delimiter split() { # Usage: split "string" "delimiter" IFS=$'\n' read -d "" -ra arr <<< "${1//$2/$'\n'}" printf '%s\n' "${arr[@]}"
}

# Source: https://github.com/dylanaraps/pure-bash-bible#get-the-number-of-lines-in-a-file
# Altered to operate on the standard input.
count_lines() {
# Usage: lines <"file"
mapfile -tn 0 lines
printf '%s\n' "${#lines[@]}" } ############################################################################### #$1 - target name
# $2 - url check_target() { local target_name="${1}"
local url="${2}"; readarray -t result < <(curl -sS --user-agent "${user_agent}" -o /dev/null --max-time 5 -w "%{http_code}\n%{time_total}\n" "${url}"; echo "${PIPESTATUS[*]}");

# 0 - http response code
# 1 - time taken
# 2 - curl exit code

# Make sure the exit code is non-zero - this includes if curl hits a timeout error
# Also ensure that the HTTP response code is valid - any 2xx or 3xx response code is ok
if [[ "${result[2]}" -ne 0 ]] || [[ "${result[0]}" -lt "200" ]] || [[ "${result[0]}" -gt "399" ]]; then return fi echo "PUTVAL \"${COLLECTD_HOSTNAME}/http_services/response_time-${target_name}\" interval=${COLLECTD_INTERVAL} N:${result[1]}"; } while :; do for target in "${!targets[@]}"; do
# NOTE: We don't use concurrency here because that spawns additional subprocesses, which we want to try & avoid. Even though it looks slower, it's actually more efficient (and we don't potentially skew the results by measuring multiple things at once)
check_target "${target}" "${targets[${target}]}" done snore "${COLLECTD_INTERVAL}";
done

## TCP (Client) Networking in Pure Bash

Recently I re-remembered about /dev/tcp - a virtual bash file system that allows you to directly connect to a remote TCP endpoint - without the use of nc / netcat / ncat.

While it only allows you to connect (no listening, sadly), it's still a great bash built-in that helps avoid awkward platform-specific issues.

Here's how you'd listen for a connection with netcat, sending as many random numbers as possible to the poor unsuspecting client:

netcat -l 0.0.0.0 6666 </dev/urandom

Here's how you'd traditionally connect to that via netcat:

netcat X.Y.Z.W 6666 | pv >/dev/null

The pv command there is not installed by default, but is a great tool that shows the amount of data flowing through a pipe. It's available in the repositories for most Linux distributions without any special configuration required - so sudo apt install pv should be enough on Debian-based distributions.

Now, let's look at how we'd do this with pure bash:

pv >/dev/null </dev/tcp/X.Y.Z.W/6666

Very neat! We've been able to eliminate an extra child process. The question is though: how do they compare performance-wise? Well, that depends on how we measure it. In my case, I measured a single connection, downloading data as fast as it can for 60 seconds.

Another test would be to open many connections and download lots of small files. While I haven't done that here, I theorise that the pure-bash method would win out, as it doesn't have to spawn lots of subprocesses.

In my test, I did this:

# Traditional method
timeout 60 nc X.Y.Z.W 6666 | pv >/dev/null
# Pure-Bash method
timeout 60 pv >/dev/null </dev/tcp/X.Y.Z.W/6666

The timeout command kills the process after a given number of seconds. The server I connected to was just this:

while true; do nc -l 0.0.0.0 6666 </dev/urandom; done

Running the above test, I got the following output:

$timeout 60 pv >/dev/null </dev/tcp/172.16.230.58/6666 652MiB 0:00:59 [11.2MiB/s] [ <=> ]$ timeout 60 nc 172.16.230.58 6666 | pv >/dev/null
599MiB 0:01:00 [11.1MiB/s] [                                     <=>          ]
Method Total Data Transferred
Pure Bash 652MiB

As it turns out, the pure bash method is apparently faster - by ~8.8%. I think this might have something to do with the lack of the additional sub-process, or some other optimisation that bash can apply when doing the TCP networking itself.

Found this interesting? Got a cool use for it? Discovered another awesome bash built-in? Comment below!

## Animated PNG for all!

I recently discovered that Animated PNGs are now supported by most major browsers:

I stumbled across the concept of an animated PNG a number of years ago (on caniuse.com actually if I remember right!), but at the time browser support was very bad (read: non-existent :P) - so I moved on to other things.

I ended up re-discovering it a few weeks ago (also through caniuse.com!), and since browser support is so much better now I decided that I just had to play around with it.

It hasn't disappointed. Traditional animated GIFs (Graphics Interchange Format for the curious) are limited to 256 colours, have limited transparency support (a pixel is either transparent, or it isn't - there's no translucency support), and don't compress terribly well.

Animated PNGs (Portable Network Graphics), on the other hand, let you enjoy all the features of a regular PNG (as many colours as you want, translucent pixels, etc.) while also supporting animation, and compressing better as well! Best of all, if a browser doesn't support the animated PNG standard, they will just see and render the first frame as a regular PNG and silently ignore the rest.

Let's take it out for a spin. For my test, I took an image and created a 'panning' animation from one side of it to the other. Here's the image I've used:

(Credit: The background on the download page for Mozilla's Firefox Nightly builds. It isn't available on the original source website anymore (and I've lost the link), but can still be found on various wallpaper websites.)

The first task is to generate the frames from the original image. I wrote a quick shell script for this purpose. Firstly, I defined a bunch of variables:

#!/usr/bin/env bash
set -e; # Crash if we hit an error

input_file="Input.png";
output_file="Output.apng"

# The maximum number of frames
max_frame=32;
end_x=960; end_y=450;
start_x=0; start_y=450;

crop_size_x=960; crop_size_y=540;

mkdir -p ./frames;
Variable Meaning
input_file The input file to generate frames from
output_file The file to write the animated png to
max_frame The number of frames (plus 1) to generate
start_x The starting position to pan from on the x axis
start_y The starting position to pan from on the y axis
end_x The ending position to pan to on the x axis
end_y The ending position to pan to on the y axis
crop_size_x The width of the cropped frames
crop_size_y The height of the cropped frames

It's worth noting here that it's probably a good idea to implement a proper CLI to this script at this point, but since it's currently only a one-off script I didn't bother. Perhaps in the future I'll come back and improve it if I find myself needing it again for some other purpose.

With the parameters set up (and a temporary directory created - note that you should use mktemp -d instead of the approach I've taken here!), we can then use a for loop to repeatedly call ImageMagick to crop the input image to generate our frames. This won't run in parallel unfortunately, but since it's only a few frames it shouldn't take too long to render. This is only a quick shell script after all :P


for ((frame=0; frame <= "${max_frame}"; frame++)); do this_x="$(calc -p "${start_x}+((${end_x}-${start_x})*(${frame}/${max_frame}))")"; this_y="$(calc -p "${start_y}+((${end_y}-${start_y})*(${frame}/${max_frame}))")"; percent="$(calc -p "round((${frame}/${max_frame})*100, 2)")";

convert "${input_file}" -crop "${crop_size_x}x${crop_size_y}+${this_x}+${this_y}" "frames/Frame-$(printf "%02d" "${frame}").jpeg"; echo -ne "${frame} / ${max_frame} (${percent}%)          \r";
done
echo ""; # Move to the line after the progress indicator

This looks complicated, but it really isn't. The for loop iterates over each of the frame numbers. We do some maths to calculate the (x, y) co-ordinates of the frame we want to extract, and then get ImageMagick's convert command to do the cropping for us. Finally we write a quick progress indicator out to stdout (\r resets the cursor to the beginning of the line, but doesn't go down to the next one).

The maths there is probably better represented like this:

$this_x=start_x+((end_x-start_x)*\frac{currentframe}{max_frame})$

$this_y=start_y+((end_y-start_y)*\frac{currentframe}{max_frame})$

Much better :-) In short, we convert the current frame number into a percentage (between 0 and 1) of how far through the animation we are and use that as a multiplier to find the distance between the starting and ending points.

I use the calc command-line tool (in the apcalc package on Ubuntu) here to do the calculations, because the bash built-in result=$(((4 + 5))) only supports integer-based maths, and I wanted floating-point support here. With the frames generated, we only need to stitch them together to make the final animated png. Unfortunately, an easy-to-use tool does not yet exist (like gifsicle for GIFs) - so we'll have to make-do with ffpmeg (which, while very powerful, has a rather confusing CLI!). Here's how I stitched the frames together: ffmpeg -r 10 -i frames/Frame-%02d.jpeg -plays 0 "${output_file}";
• -r - The frame rate
• -i - The input filename
• -plays - The number of times to loop (0 = infinite; defaults to no looping if omitted)
• "${output_file}" - The output file Here's the final result: (Filesize: ~2.98MiB) Of course, it'd be cool to compare it to a traditional animated gif. Let's do that too! First, we'll need to convert the frames to gif (gifsicle, our tool of choice, doesn't support anything other than GIFs as far as I'm aware): mogrify -format gif frames/*.jpeg Easy-peasy! mogrify is another tool from ImageMagick that makes such in-place conversions trivial. Note that the frames themselves are stored as JPEGs because I was experiencing an issue whereby each of the frames apparently had a slightly different colour palette, and ffmpeg wasn't smart enough to correct for this - choosing instead to crash. With the frames converted, we can make our animated GIF like so: gifsicle --optimize --colors=256 --loopcount=10 --delay=10 frames/*.gif >Output.gif; Lastly, we probably want to delete the intermediate frames: rm -r ./frames Here's the animated GIF version: (Filesize: ~3.28MiB) Woah! That's much bigger. (Generated (and then extracted & edited with the Firefox developer tools) from Meta-Chart) It's ~9.7% bigger in fact! Though not a crazy amount, smaller resulting files are always good. I suspect that this effect will stack the more frames you have. Others have tested this too, finding pretty similar results to those that I've found here - though it does of course depend on your specific scenario. With that observation, I'll end this blog post. The next time you think about inserting an animation into a web page or chat window, consider making it an Animated PNG. Found this interesting? Found a cool CLI tool for manipulating APNGs? Having trouble? Comment below! ### Sources and Further Reading ## Bridging the gap between XMPP and shell scripts In a previous post, I set up a semi-automated backup system for my Raspberry Pi using duplicity, sendxmpp, and an external drive. It's been working fabulously for a while now, but unfortunately the other week sendxmpp suddenly stopped working with no obvious explanation. Given the long list of arguments I had to pass it: sendxmpp --file "${xmpp_config_file}" --resource "${xmpp_resource}" --tls --chatroom "${xmpp_target_chatroom}" ...........

....and the fact that I've had to tweak said arguments on a number of occasions, I thought it was time to switch it out for something better suited to the task at hand.

Unfortunately, finding such a tool proved to be a challenge. I even asked on Reddit - but nobody had anything that fit the bill (xmpp-bridge wouldn't compile correctly - and didn't support multi-user chatrooms anyway, and xmpppy was broken too).

If you're unsure as to what XMPP is, I'd recommend checkout out either this or this tutorial. They both give a great introduction to what it is, what it does, and how it works - and the rest of this post will make much more sense if you read that first :-)

To this end, I finally gave in and wrote my own tool, which I've called xmppbridge. It's a global Node.JS script that uses the simple-xmpp to forward the standard input to a given JID over XMPP - which can optionally be a group chat.

In this post, I'm going to look at how I put it together, some of the issues I ran into along the way, and how I solved them. If you're interested in how to install and use it, then the package page on npm will tell you everything you need to know:

xmppbridge on npm

### Architectural Overview

The script consists of 3 files:

• index.sh - Calls the main script with ES6 modules enabled
• index.mjs - Parses the command-line arguments and environment variables out, and provides a nice CLI
• XmppBridge.mjs - The bit that actually captures input from stdin and sends it via XMPP

Let's look at each of these in turn - starting with the command-line interface.

### CLI Parsing

The CLI itself is relatively simple - and follows a paradigm I've used extensively in C♯ (although somewhat modified of course to get it to work in Node.JS, and without fancy ANSI colouring etc.).

#!/usr/bin/env node
"use strict";

import XmppBridge from './XmppBridge.mjs';

const settings = {
jid: process.env.XMPP_JID,
destination_jid: null,
is_destination_groupchat: false,
};

let extras = [];
// The first arg is the script name itself
for(let i = 1; i < process.argv.length; i++) {
if(!process.argv[i].startsWith("-")) {
extras.push(process.argv[i]);
continue;
}

switch(process.argv[i]) {
case "-h":
case "--help":
// ........
break;

// ........

default:
console.error(Error: Unknown argument '${process.argv[i]}'.); process.exit(2); break; } } We start with a shebang, telling Linux-based systems to execute the script with Node.JS. Following that, we import the XmppBridge class that's located in XmppBrdige.mjs (we'll come back to this later). Then, we define an object to hold our settings - and pull in the environment variables along with defining some defaults for other parameters. With that setup, we can then parse the command-line arguments themselves - using the exact same paradigm I've used time and time again in C♯. Once the command-line arguments are parsed, we validate the final settings to ensure that the user hasn't left any required parameters undefined: for(let environment_varable of ["XMPP_JID", "XMPP_PASSWORD"]) { if(typeof process.env[environment_varable] == "undefined") { console.error(Error: The environment variable${environment_varable} wasn't found.);
process.exit(1);
}
}

if(typeof settings.destination_jid != "string") {
console.error("Error: No destination jid specified.");
process.exit(5);
}

That's basically all that index.mjs does. All that's really left is passing the parameters to an instance of XmppBridge:

const bridge = new XmppBridge(
settings.destination_jid,
settings.is_destination_groupchat
);
bridge.start(settings.jid, settings.password);

### Shebang Trouble

Because I've used ES6 modules here, currently Node must be informed of this via the --experimental-modules CLI argument like this:

node --experimental-modules ./index.mjs

If we're going to make this a global command-line tool via the bin directive in package.json, then we're going to have to ensure that this flag gets passed to Node and not our program. While we could alter the shebang, that comes with the awkward problem that not all systems (in fact relatively few) support using both env and passing arguments. For example, this:

#!/usr/bin/env node --experimental-modules

Wouldn't work, because env doesn't recognise that --experimental-modules is actually a command-line argument and not part of the binary name that it should search for. I did see some Linux systems support env -S to enable this functionality, but it's hardly portable and doesn't even appear to work all the time anyway - so we'll have to look for another solution.

Another way we could do it is by dropping the env entirely. We could do this:

#!/usr/local/bin/node --experimental-modules

...which would work fine on my system, but probably not on anyone else's if they haven't installed Node to the same place. Sadly, we'll have to throw this option out the window too. We've still got some tricks up our sleeve though - namely writing a bash wrapper script that will call node telling it to execute index.mjs with the correct arguments. After a little bit of fiddling, I came up with this:

#!/usr/bin/env bash
install_dir="$(dirname "$(readlink -f $0)")"; exec node --experimental-modules "${install_dir}/index.mjs" $@ 2 things are at play here. Firstly, we have to deduce where the currently executing script actually lies - as npm uses a symbolic link to allow a global command-line tool to be 'found'. Said symbolic link gets put in /usr/local/bin/ (which is, by default, in the user's PATH), and links to where the script is actually installed to. To figure out the directory that we've been installed to is (and hence the location of index.mjs), we need to dereference the symbolic link and strip the index.sh filename away. This can be done with a combination of readlink -f (dereferences the symbolic link), dirname (get the parent directory of a given file path), and $0 (holds the path to the currently executing script in most circumstances) - which, in the case of the above, gets put into the install_dir variable.

The other issue is passing all the existing command-line arguments to index.mjs unchanged. We do this with a combination of $@ (which refers to all the arguments passed to this script except the script name itself) and exec (which replaces the currently executing process with a new one - in this case it replaces the bash shell with node). This approach let's us customise the CLI arguments, while still providing global access to our script. Here's an extract from xmppbridge's package.json showing how I specify that I want index.sh to be a global script: { ..... "bin": { "xmppbridge": "./index.sh" }, ..... } ### Bridging the Gap Now that we've got Node calling our script correctly and the arguments parsed out, we can actually bridge the gap. This is as simple as some glue code between simple-xmpp and readline. simple-xmpp is an npm package that makes programmatic XMPP interaction fairly trivial (though I did have to look at examples in the GitHub repository to figure out how to send a message to a multi-user chatroom). readline is a Node built-in that allows us to read the standard input line-by-line. It does other things too (and is great for interactive scripts amongst other things), but that's a tale for another time. The first task is to create a new class for this to live in: "use strict"; import readline from 'readline'; import xmpp from 'simple-xmpp'; class XmppBridge { /** * Creates a new XmppBridge instance. * @param {string} in_login_jid The JID to login with. * @param {string} in_destination_jid The JID to send stdin to. * @param {Boolean} in_is_groupchat Whether the destination JID is a group chat or not. */ constructor(in_destination_jid, in_is_groupchat) { // .... } } export default XmppBridge; Very cool! That was easy. Next, we need to store those arguments and connect to the XMPP server in the constructor: this.destination_jid = in_destination_jid; this.is_destination_groupchat = in_is_groupchat; this.client = xmpp; this.client.on("online", this.on_connect.bind(this)); this.client.on("error", this.on_error.bind(this)); this.client.on("chat", ((_from, _message) => { // noop }).bind(this)); I ended up having to define a chat event handler - even though it's pointless, as I ran into a nasty crash if I didn't do so (I suspect that this use-case wasn't considered by the original package developer). The next area of interest is that online event handler. Note that I've bound the method to the current this context - this is important, as it would be able to access the class instance's properties otherwise. Let's take a look at the code for that handler: console.log([XmppBridge] Connected as${data.jid}.);
if(this.is_destination_groupchat) {
this.client.join(${this.destination_jid}/bot_${data.jid.user});
}
input: process.stdin,
output: process.stdout,
terminal: false
});
this.stdin.on("line", this.on_line_handler.bind(this));
this.stdin.on("close", this.on_stdin_close_handler.bind(this));

This is the point at which we open the standard input and start listening for things to send. We don't do it earlier, as we don't want to end up in a situation where we try sending something before we're connected!

If we're supposed to be sending to a multi-user chatroom, this is also the point at which it joins said room. This is required as you can't send a message to a room that you haven't joined.

The resource (the bit after the forward slash /), for a group chat, specifies the nickname that you want to give to yourself when joining. Here, I automatically set this to the user part of the JID that we used to login prefixed with bot_.

The connection itself is established in the start method:

start(jid, password) {
this.client.connect({
jid,
});
}

And every time we receive a line of input, we execute the send() method:

on_line_handler(line_text) {
this.send(line_text);
}

I used a full method here, as initially I had some issues and wanted to debug which methods were being called. That send method looks like this:

send(message) {
this.client.send(
this.destination_jid,
message,
this.is_destination_groupchat
);
}

The last event handler worth mentioning is the close event handler on the readline interface:

on_stdin_close_handler() {
this.client.disconnect();
}

This just disconnects from the XMXPP server so that Node can exit cleanly.

That basically completes the script. In total, the entire XmppBridge.mjs class file is 72 lines. Not bad going!

You can install this tool for yourself with sudo npm install -g xmppbridge. I've documented how it use it in the README, so I'd recommend heading over there if you're interested in trying it out.

Found this interesting? Got a cool use for XMPP? Comment below!

## Backing up to AWS S3 with duplicity

The server that this website runs on backs up automatically to the Simple Storage Service, provided by Amazon Web Services. Such an arrangement is actually fairly cheap - only ~20p/month! I realised recently that although I've blogged about duplicity before (where I discussed using an external hard drive), I never covered how I fully automate the process here on starbeamrainbowlabs.com.

(Above: A bunch of hard drives. The original can be found here.)

It's fairly similar in structure to the way it works backing up to an external hard drive - just with a few different components here and there, as the script that drives this is actually older than the one that backs up to an external hard drive.

To start, we'll need an AWS S3 bucket. I'm not going to cover how to do this here, as the AWS interface keeps changing, and this guide will likely become outdated quickly. Instead, the AWS S3 documentation has an official guide on how to create one. Make sure it's private, as you don't want anyone getting a hold of your backups!

With that done, you should have both an access key and a secret. Note these down in a file called .backup-password in a new directory that will hold the backup script like this:

#!/usr/bin/env bash
AWS_ACCESS_KEY_ID="INSERT_AWS_ACCESS_KEY_HERE";
AWS_SECRET_ACCESS_KEY="INSERT_AWS_SECRET_KEY_HERE";

The PASSPHRASE here should be a long and unintelligible string of random characters, and will encrypt your backups. Note that down somewhere safe too - preferably in your password manager or somewhere else at least as secure.

If you're on Linux, you should also set the permissions on the .backup-password file to ensure nobody gets access to it who shouldn't. Here's how I did it:

sudo chown root:root .backup-password
sudo chmod 0400 .backup-password

This ensures that only the root user is able to read the file - and nobody can write to it. With our secrets generated and safely stored, we can start writing the backup script itself. Let's start by reading in the secrets:

#!/usr/bin/env bash
source /root/.backup-password

I stored my .backup-password file in /root. Next, let's export these values. This enables the subprocesses we invoke to access these environment variables:

export PASSPHRASE;
export AWS_ACCESS_KEY_ID;
export AWS_SECRET_ACCESS_KEY;

Now it's time to do the backup itself! Here's what I do:

duplicity \
--full-if-older-than 2M \
--exclude /proc \
--exclude /sys \
--exclude /tmp \
--exclude /dev \
--exclude /mnt \
--exclude /var/cache \
--exclude /var/tmp \
--exclude /var/backups \
--exclude /srv/www-mail/rainloop/v \
--s3-use-new-style --s3-european-buckets --s3-use-ia \
/ s3://s3-eu-west-1.amazonaws.com/INSERT_BUCKET_NAME_HERE

Compressed version:

duplicity --full-if-older-than 2M --exclude /proc --exclude /sys --exclude /tmp --exclude /dev --exclude /mnt --exclude /var/cache --exclude /var/tmp --exclude /var/backups --exclude /srv/www-mail/rainloop/v --s3-use-new-style --s3-european-buckets --s3-use-ia / s3://s3-eu-west-1.amazonaws.com/INSERT_BUCKET_NAME_HERE

This might look long and complicated, but it's mainly due to the large number of directories that I'm excluding from the backup. The key options here are --full-if-older-than 2M and --s3-use-ia, which specify I want a full backup to be done every 2 months and to use the infrequent access pricing tier to reduce costs.

The other important bit here is to replace INSERT_BUCKET_NAME_HERE with the name of the S3 bucket that you created.

Backing is all very well, but we want to remove old backups too - in order to avoid ridiculous bills (AWS are terrible for this - there's no way that you can set a hard spending limit! O.o). That's fairly easy to do:

duplicity remove-older-than 4M \
--force \
--s3-use-new-style --s3-european-buckets --s3-use-ia \
s3://s3-eu-west-1.amazonaws.com/INSERT_BUCKET_NAME_HERE

Again, don't forget to replace INSERT_BUCKET_NAME_HERE with the name of your S3 bucket. Here, I specify I want all backups older than 4 months (the 4M bit) to be deleted.

It's worth noting here that it may not actually be able to remove backups older than 4 months here, as it can only delete a full backup if there are not incremental backups that depend on it. To this end, you'll need to plan for potentially storing (and being charged for) an extra backup cycle's worth of data. In my case, that's an extra 2 months worth of data.

That's the backup part of the script complete. If you want, you could finish up here and have a fully-working backup script. Personally, I want to know how much data is in my S3 bucket - so that I can get an idea as to how much I'll be charged when the bill comes through - and also so that I can see if anything's going wrong.

Unfortunately, this is a bit fiddly. Basically, we have to utilise the AWS command-line interface to recursively list the entire contents of our S3 bucket in summarising mode in order to get it to tell us what we want to know. Here's how to do that:

aws s3 ls s3://INSERT_BUCKET_BAME_HERE --recursive --human-readable --summarize

Don't forget to replace INSERT_BUCKET_BAME_HERE wiith your bucket's name. The output from this is somewhat verbose, so I ended up writing an awk script to process it and output something nicer. Said awk script looks like this:

/^\s*Total\s+Objects/ { parts[i++] = $3 } /^\s*Total\s+Size/ { parts[i++] =$3; parts[i++] = $4; } END { print( "AWS S3 Bucket Status:", parts[0], "objects, totalling " parts[1], parts[2] ); } If we put all that together, it should look something like this: aws s3 ls s3://INSERT_BUCKET_BAME_HERE --recursive --human-readable --summarize | awk '/^\s*Total\s+Objects/ { parts[i++] =$3 } /^\s*Total\s+Size/ { parts[i++] = $3; parts[i++] =$4; } END { print("AWS S3 Bucket Status:", parts[0], "objects, totalling " parts[1], parts[2]); }'

...it's a bit of a mess. Perhaps I should look at putting that awk script in a separate file :P Anyway, here's some example output:

AWS S3 Bucket Status: 602 objects, totalling 21.0 GiB Very nice indeed. To finish off, I'd rather like to know how long it took to do all this. Thankfully, bash has an inbuilt automatic variable that holds the number of seconds since the current process has started, so it's just a case of parsing this out into something readable:

echo "Done in $(($SECONDS / 3600))h $((($SECONDS / 60) % 60))m $(($SECONDS % 60))s.";

...I forget which Stackoverflow answer it was that showed this off, but if you know - please comment below and I'll update this to add credit. This should output something like this:

Done in 0h 12m 51s.

Awesome! We've now got a script that backs up to AWS S3, deletes old backups, and tells us both how much space on S3 is being used and how long the whole process took.

I'm including the entire script at the bottom of this post. I've changed it slightly to add a single variable for the bucket name - so there's only 1 place on line 9 (highlighted) you need to update there.

(Above: A Geopattern, tiled using the GNU Image Manipulation Program)


#!/usr/bin/env bash

# Make sure duplicity exists
test -x $(which duplicity) || exit 1; # Pull in the password . /root/.backup-password AWS_S3_BUCKET_NAME="INSERT_BUCKET_NAME_HERE"; # Allow duplicity to access it export PASSPHRASE; export AWS_ACCESS_KEY_ID; export AWS_SECRET_ACCESS_KEY; # Actually do the backup # Backup strategy: # 1 x backup per week: # 1 x full backup per 2 months # incremental backups in between # S3 Bucket URI: https://${AWS_S3_BUCKET_NAME}/
echo [ $(date +%F%r) ] Performing backup. duplicity --full-if-older-than 2M --exclude /proc --exclude /sys --exclude /tmp --exclude /dev --exclude /mnt --exclude /var/cache --exclude /var/tmp --exclude /var/backups --exclude /srv/www-mail/rainloop/v --s3-use-new-style --s3-european-buckets --s3-use-ia / s3://s3-eu-west-1.amazonaws.com/${AWS_S3_BUCKET_NAME}

# Remove old backups
# You have to plan for 1 extra full backup cycle when
# calculating space requirements - duplicity only
# removes a backup if it won't invalidate those further
# along the chain - the oldest backup will always be
# a full one.
echo [ $(date +%F%r) ] Backup complete. Removing old volumes. duplicity remove-older-than 4M --force --encrypt-key F2A6D8B6 --s3-use-new-style --s3-european-buckets --s3-use-ia s3://s3-eu-west-1.amazonaws.com/${AWS_S3_BUCKET_NAME}
echo [ $(date +%F%r) ] Cleanup complete. aws s3 ls s3://${AWS_S3_BUCKET_NAME} --recursive --human-readable --summarize | awk '/^\s*Total\s+Objects/ { parts[i++] = $3 } /^\s*Total\s+Size/ { parts[i++] =$3; parts[i++] = $4; } END { print("AWS S3 Bucket Status:", parts[0], "objects, totalling " parts[1], parts[2]); }' echo "Done in$(($SECONDS / 3600))h$((($SECONDS / 60) % 60))m$(($SECONDS % 60))s.";  ## Markov Chains Part 4: Test Data With a shiny-new markov chain engine (see parts 1, 2, and 3), I found that I had a distinct lack of test data to put through it. Obviously this was no good at all, so I decided to do something about it. Initially, I started with a list of HTML colours (direct link; 8.6KiB), but that didn't produce very good output: MarkovGrams/bin/Debug/MarkovGrams.exe markov-w --wordlist wordlists/Colours.txt --length 16 errobiartrawbear frelecteringupsy lectrictomadolbo vendellorazanigh arvanginklectrit dighoonbottlaven onadeestersweese ndiu llighoolequorain indeesteadesomiu I see a few problems here. Firstly, it's treating each word as it's entity, where in fact I'd like it to generate n-grams on a line-by-line basis. Thankfully, this is easy enough with my new --no-split option: MarkovGrams/bin/Debug/MarkovGrams.exe markov-w --wordlist wordlists/Colours.txt --no-split --length 16 med carrylight b jungin pe red dr ureelloufts blue uamoky bluellemo trinaterry aupph utatellon reep g radolitter brast bian reep mardar ght burnse greep atimson-phloungu Hrm, that's still rather unreadable. What if we make the n-grams longer by bumping the order? MarkovGrams/bin/Debug/MarkovGrams.exe markov-w --wordlist wordlists/Colours.txt --length 16 --order 4 on fuchsia blue rsity of carmili e blossom per sp ngel ulean red au lav as green yellowe indigri ly gray aspe disco blus berry pine blach Better, but it looks like it's starting the generation process from inside the middle of words. We can fix that with my new --start-uppercase option, which ensures that each output always stars with an n-gram that begins with a capital letter. Unfortunately the wordlist is all lowercase: air force blue alice blue alizarin crimson almond amaranth amber american rose amethyst android green anti-flash white This is an issue. The other problem is that with an order of 4 the choice-point ratio is dropping quite low - I got a low of just ~0.97 in my testing. The choice-point ratio is a measure I came up with of the average number of different directions the engine could potential go in at each step of the generation process. I'd like to keep this number consistently higher than 2, at least - to ensure a good variety of output. ### Greener Pastures Rather than try fix that wordlist, let's go in search of something better. It looks like the CrossCode Wiki has a page that lists all the items in the entire game. That should do the trick! The only problem is extracting it though. Let's use a bit of bash! We can use curl to download the HTML of the page, and then xidel to parse out the text from the <a> tags inside tables. Here's what I came up with: curl https://crosscode.gamepedia.com/Items | xidel --data - --css "table a" This is a great start, but we've got blank lines in there, and the list isn't sorted alphabetically (not required, but makes it look nice :P). Let's fix that: curl https://crosscode.gamepedia.com/Items | xidel --data - --css "table a" | awk "NF > 0" | sort Very cool. Tacking wc -l on the end of the pipe chain I can we've got ourselves a list of 527(!) items! Here's a selection of input lines: Rough Branch Raw Meat Shady Monocle Blue Grass Tengu Mask Crystal Plate Humming Razor Everlasting Amber Tracker Chip Lawkeeper's Fist Let's run it through the engine. After a bit of tweaking, I came up with this: cat wordlists/Cross-Code-Items.txt | MarkovGrams/bin/Debug/MarkovGrams.exe markov-w --start-uppercase --no-split --length 16 --order 3 Capt Keboossauci Fajiz Keblathfin King Steaf Sharp Stintze Geakralt Fruisty 'olipe F Apper's TN Prow Peptumn's C Rus Recreetan Co Veggiel Spiragma Laver's Bolden M That's quite interesting! With a choice-point ratio of ~5.6 at an order of 3, we've got a nice variable output. If we increase the order to 4, we get ~1.5 - ~2.3: Edgy Hoo Junk Petal Goggl Red Metal Need C Samurai Shel Echor Krystal Wated Li Sweet Residu Raw Stomper Thor Purple Fruit Dev Smokawa It appears to be cutting off at the end though. Not sure what we can do about that (ideas welcome!). This looks interesting, but I'm not done yet. I'd like it to work on word-level too! ### Going up a level After making some pretty extensive changes, I managed to add support for this. Firstly, I needed to add support for word-level n-gram generation. Currently, I've done this with a new GenerationMode enum. public enum GenerationMode { CharacterLevel, WordLevel } Under the hood I've just used a few if statements. Fortunately, in the case of the weighted generator, only the bottom method needed adjusting: /// <summary> /// Generates a dictionary of weighted n-grams from the specified string. /// </summary> /// <param name="str">The string to n-gram-ise.</param> /// <param name="order">The order of n-grams to generate.</param> /// <returns>The weighted dictionary of ngrams.</returns> private static void GenerateWeighted(string str, int order, GenerationMode mode, ref Dictionary<string, int> results) { if (mode == GenerationMode.CharacterLevel) { for (int i = 0; i < str.Length - order; i++) { string ngram = str.Substring(i, order); if (!results.ContainsKey(ngram)) results[ngram] = 0; results[ngram]++; } } else { string[] parts = str.Split(" ".ToCharArray()); for (int i = 0; i < parts.Length - order; i++) { string ngram = string.Join(" ", parts.Skip(i).Take(order)).Trim(); if (ngram.Trim().Length == 0) continue; if (!results.ContainsKey(ngram)) results[ngram] = 0; results[ngram]++; } } } Full code available here. After that, the core generation algorithm was next. The biggest change - apart from adding a setting for the GenerationMode enum - was the main while loop. This was a case of updating the condition to count the number of words instead of the number of characters in word mode: (Mode == GenerationMode.CharacterLevel ? result.Length : result.CountCharInstances(" ".ToCharArray()) + 1) < length A simple ternary if statement did the trick. I ended up tweaking it a bit to optimise it - the above is the end result (full code available here). Instead of counting the words, I count the number fo spaces instead and add 1. That CountCharInstances() method there is an extension method I wrote to simplify things. Here it is: public static int CountCharInstances(this string str, char[] targets) { int result = 0; for (int i = 0; i < str.Length; i++) { for (int t = 0; t < targets.Length; t++) if (str[i] == targets[t]) result++; } return result; } ### Recursive issues After making these changes, I needed some (more!) test data. Inspiration struck: I could run it recipe names! They've quite often got more than 1 word, but not too many. Searching for such a list proved to be a challenge though. My first thought was BBC Food, but their terms of service disallow scraping :-( A couple of different websites later, and I found the Recipes Wikia. Thousands of recipes, just ready and waiting! Time to get to work scraping them. My first stop was, naturally, the sitemap (though how I found in the first place I really can't remember :P). What I was greeted with, however, was a bit of a shock: <?xml version="1.0" encoding="UTF-8"?> <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <sitemap><loc>http://recipes.wikia.com/sitemap-newsitemapxml-NS_0-p1.xml</loc></sitemap> <sitemap><loc>http://recipes.wikia.com/sitemap-newsitemapxml-NS_0-p2.xml</loc></sitemap> <sitemap><loc>http://recipes.wikia.com/sitemap-newsitemapxml-NS_0-p3.xml</loc></sitemap> <sitemap><loc>http://recipes.wikia.com/sitemap-newsitemapxml-NS_0-p4.xml</loc></sitemap> <sitemap><loc>http://recipes.wikia.com/sitemap-newsitemapxml-NS_0-p5.xml</loc></sitemap> <sitemap><loc>http://recipes.wikia.com/sitemap-newsitemapxml-NS_0-p6.xml</loc></sitemap> <sitemap><loc>http://recipes.wikia.com/sitemap-newsitemapxml-NS_0-p7.xml</loc></sitemap> <sitemap><loc>http://recipes.wikia.com/sitemap-newsitemapxml-NS_0-p8.xml</loc></sitemap> <sitemap><loc>http://recipes.wikia.com/sitemap-newsitemapxml-NS_0-p9.xml</loc></sitemap> <sitemap><loc>http://recipes.wikia.com/sitemap-newsitemapxml-NS_14-p1.xml</loc></sitemap> <sitemap><loc>https://services.wikia.com/discussions-sitemap/sitemap/3355</loc></sitemap> </sitemapindex> <!-- Generation time: 26ms --> <!-- Generation date: 2018-10-25T10:14:26Z --> Like who has a sitemap of sitemaps, anyways?! We better do something about this: Time for some more bash! Let's start by pulling out those sitemaps. curl http://recipes.wikia.com/sitemap-newsitemapxml-index.xml | xidel --data - --css "loc" Easy peasy! Next up, we don't want that bottom one - as it appears to have a bunch of discussion pages and other junk in it. Let's strip it out before we even download it! curl http://recipes.wikia.com/sitemap-newsitemapxml-index.xml | xidel --data - --css "loc" | grep -i NS_0 With a list of sitemaps extract from the sitemap (completely coconuts I tell you) extracted, we need to download them all in turn and extract the page urls therein. This is, unfortunately, where it starts to get nasty. While a simple xargs call downloads them all easily enough (| xargs -n1 -I{} curl "{}" should do the trick), this outputs them all to stdout, and makes it very difficult for us to parse them. I'd like to avoid shuffling things around on the file system if possible, as this introduces further complexity. We're not out of options yet though, as we can pull a subshell out of our proverbial hat: curl http://recipes.wikia.com/sitemap-newsitemapxml-index.xml | xidel --data - --css "loc" | grep -i NS_0 | xargs -n1 -I{} sh -c 'curl {} | xidel --data - --css "loc"' Yay! Now we're getting a list of urls to all the pages on the entire wiki: http://recipes.wikia.com/wiki/Mexican_Black_Bean_Soup http://recipes.wikia.com/wiki/Eggplant_and_Roasted_Garlic_Babakanoosh http://recipes.wikia.com/wiki/Bathingan_bel_Khal_Wel_Thome http://recipes.wikia.com/wiki/Lebanese_Tabbouleh http://recipes.wikia.com/wiki/Lebanese_Hummus_Bi-tahini http://recipes.wikia.com/wiki/Baba_Ghannooj http://recipes.wikia.com/wiki/Lebanese_Falafel http://recipes.wikia.com/wiki/Lebanese_Pita_Bread http://recipes.wikia.com/wiki/Kebab_Koutbane http://recipes.wikia.com/wiki/Moroccan_Yogurt_Dip One problem though: We want recipes names, not urls! Let's do something about that. Our next special guest that inhabits our bottomless hat is the illustrious sed. Armed with the mystical power of find-and-replace, we can make short work of these urls: ... | sed -e 's/^.*\///g' -e 's/_/ /g' The rest of the command is omitted for clarity. Here I've used 2 sed scripts: One to strip everything up to the last forward slash /, and another to replace the underscores _ with spaces. We're almost done, but there are a few annoying hoops left to jump through. Firstly, there are A bunch of unfortunate escape sequences lying around (I actually only discovered this when the engine started spitting out random ones :P). Also, there are far too many page names that contain the word Nutrient, oddly enough. The latter is easy to deal with. A quick grep sorts it out: ... | grep -iv "Nutrient" The former is awkward and annoying. As far as I can tell, there's no command I can call that will decode escape sequences. To this end, I wound up embedding some Python: ... | python -c "import urllib, sys; print urllib.unquote(sys.argv[1] if len(sys.argv) > 1 else sys.stdin.read()[0:-1])" This makes the whole thing much more intimidating that it would otherwise be. Lastly, I'd really like to sort the list and save it to a file. Compared to the above, this is chicken feed! | sort >Dishes.txt And there we have it. Bash is very much like lego bricks when you break it down. The trick is to build it up step-by-step until you've got something that does what you want it to :) Here's the complete command: curl http://recipes.wikia.com/sitemap-newsitemapxml-index.xml | xidel --data - --css "loc" | grep -i NS_0 | xargs -n1 -I{} sh -c 'curl {} | xidel --data - --css "loc"' | sed -e 's/^.*\///g' -e 's/_/ /g' | python -c "import urllib, sys; print urllib.unquote(sys.argv[1] if len(sys.argv) > 1 else sys.stdin.read()[0:-1])" | grep -iv "Nutrient" | sort >Dishes.txt After all that effort, I think we deserve something for our troubles! With ~42K(!) lines in the resulting file (42,039 to be exact as of the last time I ran the monster above :P), the output (after some tweaking, of course) is pretty sweet: cat wordlists/Dishes.txt | mono --debug MarkovGrams/bin/Debug/MarkovGrams.exe markov-w --words --start-uppercase --length 8 Lemon Lime Ginger Seared Tomatoes and Summer Squash Cabbage and Potato au Endive stuffed with Lamb and Winter Vegetable Stuffed Chicken Breasts in Yogurt Turmeric Sauce with Blossoms on Tomato Morning Shortcake with Whipped Cream Graham Orange and Pineapple Honey Mango Raspberry Bread Tempura with a Southwestern Rice Florentine with Cabbage Slaw with Pecans and Parmesan Pork Sandwiches with Tangy Sweet Barbecue Tea with Lemongrass and Chile Rice Butterscotch Mousse Cake with Fudge Fish and Shrimp - Cucumber Salad with Roast Garlic Avocado Beans in the Slow Apple-Cherry Vinaigrette Salad California Avocado Chinese Chicken Noodle Soup with Arugula ...I really need to do something about that cutting off issue. Other than that, I'm pretty happy with the result! The choice-point ratio is really variable, but most of the time it's hovering around ~2.5-7.5, which is great! The output if I lower the order from 3 to 2 isn't too bad either: Salata me Htapodi kai Hot 'n' Cheese Sandwich with Green Fish and Poisson au Feuilles de Milagros Valentines Day Cookies with Tofu with Walnut Rice Up Party Perfect Turkey Tetrazzini Olives and Strawberry Pie with Iceberg Salad with Mashed Sweet sauce for Your Mood with Dried Zespri Gold Corn rice tofu, and Corn Roasted California Avocado and Rice Casserole with Dilled Shrimp Egyptian Tomato and Red Bell Peppers, Mango Fandango This gives us a staggering average choice-point ratio of ~125! Success :D ### One more level After this, I wanted to push the limits of the engine, so see what it's capable of. The obvious choice here is Shakespeare's Complete Works (~5.85MiB). Pushing this through the engine required some work, as ~30 seconds is far too slow - namely optimising the pipeline as much as possible. The Mono Profiler helped a lot here. With it, I discovered that string.StartsWith() is really slow. Like, ridiculously slow (though this is relative, since I'm calling it hundreds of thousand of times), as it's culture-aware. In our case, we can't be bothering with any of that, as it's not relevant anyway. The easiest solution is to write another extension method: public static bool StartsWithFast(this string str, string target) { if (str.Length < target.Length) return false; return str.Substring(0, target.Length) == target; } string.Substring() is faster, so by utilising this instead of the regular string.StartsWith() yields us a perfectly gigantic boost! Next up, I noticed that I can probably parallelize the Linq query that builds the list of possible n-grams we can choose from next, so that it runs on all the CPU cores: Parallel.ForEach(ngrams, (KeyValuePair<string, double> ngramData) => { if (!ngramData.Key.StartsWithFast(nextStartsWith)) return; if (!convNextNgrams.TryAdd(ngramData.Key, ngramData.Value)) throw new Exception("Error: Failed to add to staging ngram concurrent dictionary"); }); Again, this netted a another huge gain. With this and a few other architectural changes, I was able to chop the time down to a mere ~4 seconds (for a standard 10 words)! In the future, I might experiment with selective unmanaged code via the unsafe keyword to see if I can do any better. For now, it's fast enough to enjoy some random Shakespeare on-demand: What should they since that to that tells me Nero He doth it turn and and too, gentle Ha! you shall not come hither; since that to provoke ANTONY. No further, sir; a so, farewell, Sir Bona, joins with Go with your fingering, From fairies and the are like an ass which is The very life-blood of our blood, no, not with the Alas, why is here-in which Transform'd and weak'ned? Hath Bolingbroke Very interesting. The choice-point ratios sit at ~20 and ~3 for orders 3 and 4 respectively, though I got as high as 188 for an order of 3 during my testing. Definitely plenty of test data here :P ### Conclusion My experiments took me to several other places - which, if I included them all here, would result in an post much much longer than this! I scripted the download of several other wordlists in download.sh (direct link, 4.2KiB), if you're interested, with ready-downloaded copies in the wordlists folder of the repository. I would advise reading the table in the README that gives credit to where I sourced each list, because of course I didn't put any of them together myself - I just wrote the script :P Particularly of note is the Starbound list, which contains a list of all the blocks and items from the game Starbound. I might post about that one separately, as it ended up being a most interesting adventure. In the future, I'd like to look at implementing a linguistic drift algorithm, to try and improve the output of the engine. The guy over at Here Dragons Abound has a great post on the subject, which you should definitely read if you're interested. Found this interesting? Got an idea for another wordlist I can push though my engine? Confused by something? Comment below! ### Sources and Further Reading ## Acorn Validator Edit: Corrected title and a bunch of grammatical mistakes. I typed this out on my phone with Monospace - and I seems that my keyboard (and Phone-Laptop Bluetooth connection!) leave something to be desired..... Over the last week, I've been hard at work on an entry for #LOWREZJAM. While it's not finished yet (submission is on Thursday), I've found some time to write up a quick blog post about a particular aspect of it. Of course, I'll be blogging about the rest of it later once it's finished :D The history of my entry is somewhat.... complicated. Originally, I started work on it a few months back as an independent project, but due to time constraints and other issues I was unable to get very far with it. Once I discovered #LOWREZJAM, I made the decision to throw away the code I had written and start again from (almost) scratch. It is for this reason that I have a Javascript validator script lying around. I discovered when writing it originally that my editor Atom didn't have syntax validation support for Javascript. While there are extensions that do the job, it looked like a complicated business setting one up for just syntax checking (I don't want your code style guideline suggestions! I have my own style!). To this end, I wrote myself a quick bash script to automatically check the syntax of all my javascript files that I can then include as a build step - just before webpack. Over the time I've been working on my #LOWREZJAM entry here, I've been tweaking and improving it - and thought I'd share it here. In the future, I'm considering turning it into a linting provider for the Atom editor I mentioned above (it depends on how complicated that is, and how good the documentation is to help me understand the process). The script (which can be found in full at the bottom of this post), has 2 modes of operation. In the first mode, it acts as a co-ordinator process that starts all the sub-processes that validate the javascript. In the second mode, it validates a single file - outputting the result to the standard output and also logging any errors in a special place that the co-ordinator process can find them later. It decides which mode to operate in based on whether it recieves an argument telling it which file to validate: if [ "$1" != "" ]; then
validate_file "$1" "$2";
exit $?; fi If it detects an argument, then it calls the validate_file function and exits with the returned exit code. If not, then the script continues into co-ordinator mode. In this mode it chains a bunch of commands together like lego bricks to start subprocesses to validate all of the javascript files it can find a fast as possible - acorn, the validator itself, can only check one file as a time it would appear. It does this like so: find . -not -path "./node_modules/" -not -path "./dist/" | grep -iP '\.mjs$' | xargs -n1 -I{} -P32 $0 "{}" "${counter_dirname}";

This looks complicated, but it can be broken down into smaller, easy-to-understand chunks. explainshell.com is rather good at demonstrating this. Particularly of note here is the $0. This variable holds the path to the currently executing script - allowing the co-ordinator to call itself in validator mode. The validator function itself is also quite simple. In short, it runs the validator, storing the result in a variable. It then also saves the exit code for later analysis. Once done, it outputs to the standard output, and then also outputs the validator's output - but only is there was an error to keep things neat and tidy. Finally, if there was an error, it outputs a file to a temporary directory (whose name is determined by the co-ordinator and passed to sub-processes via the 2nd argument) with a name of its PID (the content doesn't matter - I just output 1, but anything will do). This allows the co-ordinator to count the number of errors that the subprocesses encounter, without having to deal with complicated locks arising from updating the value stored in a single file. Here's that in bash: counter_dirname=$2;

# .......

# Use /dev/shm here since apparently while is in a subshell, so it can't modify variables in the main program O.o
if ! [ "${validate_exit_code}" -eq 0 ]; then echo 1 &gt;"${counter_dirname}/$$"; fi Once all the subprocesses have finished up, the co-ordinator counts up all the errors and outputs the total at the bottom: error_count=(ls {counter_dirname} | wc -l); echo echo Errors: error_count echo  Finally, the co-ordinator cleans up after the subprocesses, and exits with the appropriate error code. This last bit is important for automation, as a non-zero exit code tells the patent process that it failed. My build script (which uses my lantern build engine, which deserves a post of its own) picks up on this and halts the build if any errors were found. rm -rf "{counter_dirname}"; if [ {error_count} -ne 0 ]; then exit 1; fi exit 0; That's about all there is to it! The complete code can be found at the bottom of this post. To use it, you'll need to run npm install acorn in the directory that you save it to. I've done my best to optimize it - it can process a dozen or so files in ~1 second - but I think I can do much better if I rewrite it in Node.JS - as I can eliminate the subprocesses by calling the acorn API directly (it's a Node.JS library), rather than spawning many subprocesses via the CLI. Found this useful? Got a better solution? Comment below! #!/usr/bin/env sh validate_file() { filename=1; counter_dirname=2; validate_result=(node_modules/.bin/acorn --silent --allow-hash-bang --ecma9 --module filename 2&gt;&amp;1); validate_exit_code=?; validate_output=([ {validate_exit_code} -eq 0 ] &amp;&amp; echo ok || echo {validate_result}); echo "{filename}: {validate_output}"; # Use /dev/shm here since apparently while is in a subshell, so it can't modify variables in the main program O.o if ! [ "{validate_exit_code}" -eq 0 ]; then echo 1 &gt;"{counter_dirname}/$$";
fi
}

if [ "$1" != "" ]; then validate_file "$1" "$2"; exit$?;
fi

counter_dirname=$(mktemp -d -p /dev/shm/ -t acorn-validator.XXXXXXXXX.tmp); # Parallelisation trick from https://stackoverflow.com/a/33058618/1460422 # Updated to use xargs find . -not -path "./node_modules/" -not -path "./dist/" | grep -iP '\.mjs$' | xargs -n1 -I{} -P32 $0 "{}" "${counter_dirname}";

error_count=$(ls${counter_dirname} | wc -l);

echo
echo Errors: $error_count echo rm -rf "${counter_dirname}";

if [ ${error_count} -ne 0 ]; then exit 1; fi exit 0; ## Read / Write Disk Performance Testing in Bash Recently I needed to quickly (and non-destructively) test the read / write performance of a flash drive of mine. Naturally, I turned my attention to my terminal. This post is me documenting what I did so that I can remember for next time :P Firstly, to test the speed of a disk, we need some data to test with. Since lots of small files will inevitably cause slowdowns due to the overhead of writing the file metadata and inode information to the superblock, it makes the most sense to use one gigantic file rather than tons of small ones. Here's what I did to generate a 1 Gigabyte file filled with zeroes: dd if=/dev/zero of=/tmp/testfile.bin bs=1M count=1024 Cool. Next, we need to copy it to the target disk and measure the time it took. Then, since we know the size of the file (1073741824 bytes, to be exact), we can calculate the speed at which the copy took place. Here's my first attempt: time dd if=/tmp/testfile.bin >testfile.bin If you run this, you might find that it doesn't take it very long at all, and you get a speed of something like ~250MiB / sec! While impressive, I seriously doubt that my flash drive has that kind of speed behind it. Typically, flash memory takes longer to write to and read from - and I'm pretty sure that it can't read from it that fast either. So what's going on? Well, it turns out that Linux is caching the disk write operations in a buffer, and then doing them in the background for us. Whilst fine for ordinary operation, this doesn't give us an accurate representation of how fast it's actually writing to the disk. Thankfully, there's something we can do about this: Use the sync command. sync will flush all cached write operations to disk for us, giving us the actual time it took to write the 1 GiB file to disk. Here's the altered command: sync; time sh -c 'dd if=/tmp/testfile.bin >testfile.bin; sync' Very cool! Now, we can just take the time it took and do some simple maths to calculate the write speed of our disk. What about the read speed though? Well, to test that, we'll first need to clear out the page cache - another one of Linux's (many) caches that holds portions of files that have recently been accessed for faster retrieval - because as before, we're not interested in the speed of the cache! Here's how to do that: echo 1 | sudo tee /proc/sys/vm/drop_caches With the correct cache cleared, we can test the read speed accurately. Here's how I did it: time dd if=testfile.bin of=/dev/null Fairly simple, right? At a later date I might figure out a way of automating this, but for the occasional use now and again this works just fine :) Found this useful? Got a better way of doing it? Want to say hi? Post in the comments below! ## Jump around a filesystem with a bit of bash (Banner remixed from images found on openclipart) I've seen things like jump, which allow you to bookmark places on your system so that you can return to them faster. The trouble is, I keep forgetting to use it. I open the terminal and realise that I need to be in a specific directory, and forget to bookmark it once I cd to it - or I forget that I bookmarked it and cd my way there anyway :P To solve the problem, I thought I'd try implementing my own simplified system, under the name teleport, telepeek, and telepick. Obviously, we'll have to put these scripts in something like .bash_aliases as functions - otherwise it won't cd in the terminal itself. Let's start with teleport: function teleport() { cd "$(find . -type d | grep -iP "$@" | head -n1)"; } Not bad for a first attempt! Basically, it does a find to list all the subdirectories in the current directory, filters the results with the specified regex, and changes directory to the first result returned. Here's an example of how it's used: ~$ teleport 'pep.*mint'
~/Documents/code/some/path/pepperminty-wiki/ $ We can certainly improve it though. Let's start by removing that head call: function teleport() { cd "$(find . -type d | grep -m1 -iP "$@")"; } What about all those Permission denied messages that pop up when you're jumping around places that you might not have permission to go everywhere? Let's suppress those too: function teleport() { cd "$(find . -type d 2>/dev/null | grep -m1 -iP "$@")"; } Much better. With a teleport command in hand, it might be nice to inspect the list of directories the find + grep combo finds. For that, let's invent a telepeek variant: function telepeek() { find . -type d 2>/dev/null | grep -iP "$@" | less
}

Very cool. It doesn't have line numbers though, and they're useful. Let's fix that:

function telepeek() {
find . -type d 2>/dev/null | grep -iP "$@" | less -N } Better, but I'd prefer them to be highlighted so that I can tell them apart from the directory paths. For that, we've got to change our approach to the problem: function telepeek() { find . -type d 2>/dev/null | grep -iP "$@" | cat -n | sed 's/^[ 0-9]*[0-9]/\o033[34m&\o033[0m/' | less -R
}

By using a clever combination of cat -n to add the line numbers and a strange sed recipe (which I found in a comment on this Stack Overflow answer) to highlight the numbers themselves, we can get the result we want.

This telepeek command has given me an idea. Why not ask for an index to jump to after going to the trouble of displaying line numbers and jump to that directory? Let's cook up a telepick command!

function telepick() {
telepeek $1; read -p "jump to index: " line_number; cd "$(find . -type d 2>/dev/null | grep -iP "$@" | sed "${line_number}q;d")";
}

That wasn't too hard. By using a few different commands rather like lego bricks, we can very easily create something that does what we want with minimal effort. The read -p "jump to index: " line_number bit fetches the index that the user wants to jump to, and sed comes to the rescue again to pick out the line number we're interested in with sed "${line_number}q;d". Update April 20th 2018: I've updated the approach here to support spaces everywhere by adding additional quotes, and utilising $@ instead of \$1.

## Pepperminty Wiki CLI

I've got a plan. Since I'm taking the Mobile Development module next semester, I'd like to create an Android app for Pepperminty Wiki that will let me edit one or more instances of Pepperminty Wiki while I'm, say, on a bus.

To this end, I'll need to make sure that Pepperminty Wiki itself is all ready to go - which primarily entail making sure that its REST API is suitably machine-friendly, so that I can pull down all the information I need in the app I build.

Testing this, however, is a bit of a challenge - since I haven't actually started the module yet. My solution, as you might have guessed by the title of this blog post, is to build a command-line interface (CLI) instead. I've been writing a few bash scripts recently, to I tried my hand at creating something that's slightly more polished. Here's a list of the features supported at the time of posting:

• Listing all pages
• Viewing a specific page
• Listing all revisions of a page
• Viewing a specific revision of a page

Support for searching is on the cards, but it's currently waiting on support for grabbing search results as json / plain text from Pepperminty Wiki itself.

I'll be updating it with other things too as I think of them, but if you'd like to give it a try now, then here's the source:

It should update dynamically as I update the script. Simply save it to a file called peppermint - and then you can run ./peppermint to get an overview of the commands it supports. To get detailed help on a specific command, simply run ./peppermint {command_name}` to get additional help about that specific command - and additional help for that command will be displayed if it supports any further arguments (it will be executed directly if not).

Sound interesting? Any particular aspect of this script you'd like explaining in more detail? Want to help out? Leave a comment below!

Art by Mythdael