Archive

## Tag Cloud

3d 3d printing account algorithms android announcement architecture archives arduino artificial intelligence artix assembly async audio automation backups bash batch blog bookmarklet booting bug hunting c sharp c++ challenge chrome os cluster code codepen coding conundrums coding conundrums evolved command line compilers compiling compression css dailyprogrammer data analysis debugging demystification distributed computing documentation downtime electronics email embedded systems encryption es6 features ethics event experiment external first impressions future game github github gist gitlab graphics hardware hardware meetup holiday holidays html html5 html5 canvas infrastructure interfaces internet interoperability io.js jabber jam javascript js bin labs learning library linux lora low level lua maintenance manjaro network networking nibriboard node.js operating systems own your code pepperminty wiki performance phd photos php pixelbot portable privacy problem solving programming problems project projects prolog protocol protocols pseudo 3d python reddit redis reference releases resource review rust searching secrets security series list server software sorting source code control statistics storage svg talks technical terminal textures thoughts three thing game three.js tool tutorial twitter ubuntu university update updates upgrade version control virtual reality virtualisation visual web website windows windows 10 xmpp xslt

## Cluster, Part 2: Grand Designs

In the last part of this series, I talked about my plans for building an ARM-based cluster, because I'm growing out of the Raspberry Pi 3B+ I currently have at home. Since then, I have decided to focus on the compute cluster first, as I have a reasonable amount of room left on the 1tB WD Pidrive I have attached to my existing Raspberry Pi 3B+.

### Hardware

To this end, I have been busy ordering parts and organising things to get construction of the compute cluster side of things going. The most important part of the whole cluster is the compute boards themselves. I've decided to go with 4 x Raspberry Pi 4s with 4GB RAM each for the worker nodes, and 1 x Raspberry Pi 4 with 2GB of RAM as the controller (it would have been a 1GB RAM model, but a recent announcement changed my mind :D):

(Above: The Raspberry Pi 4s I'm going to be using. The colourful heatsink cases there are to dissipate heat passively if possible and reduce the need for the fan to run as often. The one with the smaller red heatsink is the controller node - I don't anticipate the load on that node being high enough to need a bigger more expensive heatsink)

My reasoning for Raspberry Pis is software support. They are hugely popular - and from experience I can tell that they are pretty well supported on the software side of things. Issues with hardware features not being supported by the operating system are minimal - and where issues do arise they are more often than not sorted out. Regular kernel security updates are also provided - something that isn't always a thing with Linux distributions for other boards I've noticed.

Although the nodes in the cluster are very important, they are far from the only component I'll need. I'll also need a way to power it - which I've settled on an using a desktop ATX power supply (generously donated by University).

(Above: The ATX power supply, with a few wires cut and other bits and bobs attached. As of this blog post I'm in the middle of wiring it up, so I haven't finished it yet)

This adds some additional complications though, because wiring an ATX power supply up to a fleet of Raspberry Pi 4s isn't as easy as it sounds. To do that, I've decided to wire the 5V and ground wires up to 5 USB type-a breakout boards, with a 3 amp self-resettable fuse on each live (red) wire. Then I can use 5 short type-a to type-c converter cables to power the Raspberry Pi 4s.

(Above: The extra bits and bobs laid out that I'll be using to wire the ATX power supply up to the USB type-a breakout boards. From left to right: 3A self-resettable fuses, 18 AWG wire, Wagos, header pins, and finally the USB type-a breakout boards themselves)

With power to the Raspberry Pis, the core compute hardware is in place. I still need a bunch of things around the edges though, such as a (very quiet) fan to keep it cool:

(Above: A Noctua NF-P14s redux-1200)

I found this particular fan on quietpc.com. While their prices and shipping are somewhat expensive (I didn't actually buy it from there - I got a better deal on Amazon instead), they are a great place to look into the different options available for really quiet fans. I'm pretty sensitive to noise, so having a quiet fan is an important part of my cluster design.

This one is the large 14cm model, so that it fits in front of all 5 Raspberry Pis if they are stood up on their sides and stacked horizontally. It takes 12 volts, so I'll be connecting it to the 12V rail from the ATX power supply. The fan speed is also controllable via PWM (pulse-width modulation), so I plan on using an Arduino (probably one of the Arduino Unos I've got lying around) to control it and present a serial interface or something to the Raspberry Pi that's acting as the controller node in the cluster.

Lastly, another extremely important part of any cluster is a solid switch. Without a great switch at the base of the network, you'll have all sorts of connection issues and the performance of the cluster will be degraded significantly. I'm anticipating that I'll want to transfer significant amounts of data around very quickly (e.g. Docker container images, and later large blocks of data during a storage cluster rebalance).

For this reason, I've bought myself a Netgear GS116v2. While its unmanaged, I can't currently afford a more expensive managed switch at this time. It is however gigabit and also has an array of other features such as energy efficient ethernet (802.3az), full duplex gigabit (i.e. 32GB bandwidth available to all ports, which is enough for all ports to be transmitting and receiving gigabit at the same time), and a silent fanless design.

(Above: The switch I'll be using. I watched eBay and got it used for much less than it's available new)

### Networking

Hardware isn't the only thing I've been thinking about. While I've been waiting for packages to arrive, I've also been planning out the software I'm going to use and how I'm going to network all my Pis together.

My plans on the networking side of things are subject to significant change depending on how many responsibilities I can convince my home router to give up, but I have drawn up a network diagram showing what I'm currently aiming towards:

The cluster is represented on the left half of the diagram. This will probably entail some considerable persuasion of my router to pull off, but a quick look reveals that it's (probably) possible with some trial-and-error.

The idea is that I have a separate subnet for the cluster than the rest of the home network. Then I can do strange stuff and fiddle with it (hopefully) without affecting everyone else on the network.

### Software

Meanwhile, out of all the different aspects of building this cluster I've got the clearest picture as to the software I'm going to be using.

I've decided that I'm going to use a container-based system. I've looked at a number of different options (such as podman and Singularity) - but I'm currently of the opinion that Docker is the most suitable option for what I'm going for. It's not as enterprisey as Singularity, and it seems to be more mature than podman. It also has a huge library of prebuilt containers too - but for learning purposes I'm going to be writing almost all my container scripts from scratch - probably using some sort of Alpine Linux container as a base. If I ever run into a situation where Docker isn't suitable and I need something closer to a VM, I'll probably use LXC, which I believe sits on top of the same underlying container runtime that Docker does.

I'm anticipating that container-based tech is going to be great for managing the stuff that's running on my cluster - so you can expect more posts that go into some depth about how it all works and how I'm setting my system up in the future.

To complement my container-based tech, I'm also going to be using a workload orchestrator. The Viper High-Performance Computer I've recently gained access to has lots of nodes in it and uses Slurm for workload orchestration, but that seems more geared towards environments that have lots of jobs that each have a defined running time. Great for scientific simulations and other such things, but not so great for personal self-hosted applications and the like.

Instead, I'm probably going to use Nomad. It looks seriously cool, and an initial look at the documentation reveals that it's probably going to be much simpler easier to understand than Kubernetes (see also), which seems to be the other competing software in the business. It also seems to integrate well with other programs done by the same company (Hashicorp) like Consul for service networking management (I'm hoping I can get DNS resolution for the services running on the cluster under control with it) and Vault for secret management (e.g. API keys, passwords, and other miscellaneous secrets) - all of which I'm going to install and experiment with (expect more on that soon).

All of those for now will be backed by an NFS share on all nodes in the cluster for the persistent volumes attached to running containers.

On the controller node I mentioned earlier I'm also going to be running a few extra items to aid in the management of the cluster:

• A Docker registry, from which the worker nodes will be pulling containers for execution (worker nodes will not have access to the public Docker registry at hub.docker.com)
• An apt caching proxy - probably apt-cacher-ng. Since all the nodes in the cluster are going to be using the same OS, have the same packages installed, and the same configuration settings etc, it doesn't make much sense for them to be downloading apt packages from the Internet every time - so I'll be caching them locally on the controller node
• Potentially some sort of reverse proxy that sits in front of all the services running on the cluster, but I haven't decided on how this will fit into the larger puzzle just yet (more research is required). I'm already very familiar with Nginx, but I've seen Traefik recommended for dynamic container-based setups, so I'm going to investigate that too.

That about covers my high-level design ideas. As of the time of typing, the next thing I need to do is organise a case for it all to go in, fix the loose connections in the screw terminals (not pictured; they arrived after I took the pictures), and then find a place to put it....

## Testing storage devices with f3

(Above: Some microSD cards. Thankfully none of these are fake, but you never know.....)

Always test storage devices after you buy them. I don't just mean check to see if they work (though that's a good idea too), but also that they can actually store the amount of stuff that they advertise they can.

Recently, I bought myself 5 64GB microSD cards for my cluster (more on this very soon in a future blog post!). The first thing did when I got them was test them to make sure that they could actually store 64GB of stuff. My tool of choice was f3, which stands for Fight Flash Fraud or Fight Fake Flash. I'm glad I did - because 3 of them turned out to be faulty. 2 of them were actually 32GB cards in disguise, and 1 of them wouldn't mount at all.

While this might be my first experience with fake or fault storage devices, it's hardly an uncommon occurrence. Everything from microSD cards to flash drives - and even regular hard drives! - may be faulty upon arrival - or worse appear fine at first, and then a few months down the line start corrupting random data for no reason.

f3 is a suite of tools for testing storage devices to make sure they function properly. They work best as a destructive test - i.e. one that destroys existing data on the disk - so if you've got some data on the target disk you want to test, now is the time to back it up (hopefully this is something you've been doing already - more on that in another post if there's the demand).

f3 consists of 3 principle tools:

• f3probe, which runs a fast test to check for issues (sadly I couldn't get this to work reliably)
• f3write, which fills a disk with test files
• f3read, which reads the test files back from disk and validates them

It's a real shame that I can't get f3probe to work reliably. Maybe at some point I'll implement my own version that writes data to every nth block of a device to test it more quickly than the f3write/f3read mechanism I'll explain below (if anyone knows of a better tool that works on Linux, please let comment below!)

To test a device, you first need to write the test files to it. I've taken to reformatting the device as ext4 (the Linux filesystem) first:

sudo umount /dev/sdXY; # Unmount it if it's currently mounted
sudo mkfs.ext4 /dev/sdXY; # Format it to ext4

....where /dev/sdXY is the partition you want to format. This isn't mandatory, but it is a quick way of making sure a disk is empty.

Next, we need to write the test files to the device. If it isn't already, you'll need to mount it first. This can be done like so:

# If it's not mounted automatically:
f3write /media/YOUR_USERNAME_HERE/SOME_NAME_HERE

This might take a while - don't forget to replace the paths there with those specific to your setup. With the test files written to the disk, we need to read them back again to make sure they are valid:

f3read /media/YOUR_USERNAME_HERE/SOME_NAME_HERE

This will read them all back again, and then print a summary report at the bottom to tell you what it found. Ideally, it should show a big number of blocks as succeeded, and no blocks in any of the other failure categories.

Running multiple commands like this is effort though, so surely we can do better than this. With some simple shell scripting, we can run both commands at once:

location=/media/YOUR_USERNAME_HERE/SOME_NAME_HERE; f3write "${location}"; && f3read "${location}"; alert

If you're on a machine with a graphical desktop, then the ; alert bit on the end should generate a desktop notification when it's done. For other users (e.g. over SSH), this should be removed. Just in case you have a graphical desktop (e.g. Ubuntu Desktop) and the alert bit doesn't work for you, append this to your ~/.bashrc file and restart your terminal:

# Add an "alert" alias for long running commands.  Use like so:
alias alert='notify-send --urgency=low -i "$([$? = 0 ] && echo terminal || echo error)" "$(history|tail -n1|sed -e '\''s/^\s*[0-9]\+\s*//;s/[;&|]\s*alert$//'\'')"'

....I forget where this is from exactly.

If you're not likely to be at your computer when it finishes, then there's still something you can do. Personally I use XMPP for personal messaging, so I thought it would be great if I could get a notification when it was done. Since I've already written xmppbridge for easily sending XMPP messages from the terminal, it was pretty trivial to write a shell script for my bin folder that would send my a message when the process was complete:

#!/usr/bin/env bash

# f3test: Runs f3 on the current directory.
#
# Usage:
#

destination="$1"; f3write .; f3read .; echo "Card testing complete in${SECONDS}s" | xmppbridge --groupchat --destination "${destination}"; I called this script f3test, and put it in my ~/bin folder. To use it, first cd to the root of the device you want to test ( in the above examples), and then set a pair of environment variables to let it know how to login to an XMPP account to send a message: export XMPP_JID="someone@bobsrockets.com"; # The JID to login with. export XMPP_PASSWORD="weN33dM0reBoost3rs"; # The password to use when logging in ...remove the --groupchat in the script if it's not a groupchat you want it to send a message to (I have a personal group chat that's just between me and various bots that notify me about various aspect of the systems I manage). If you don't have an XMPP account yet, you can get one at any public server in the XMPP directory, or run your own (see also snikket, which is a distribution of Prosody that's designed to be extremely easy to setup & run)! Of course, you could just as easily swap the xmppbridge call there with a different command to send a message via a different channel. For example mailx can send emails. Found this interesting? Got a better tool? Need some help? Comment below! ## Installing libonig4 from source to fix php7.4-mbstring I have several Raspberry Pis. The one I'd like tot alk about today though is a 3B+, and for 1 reason or another it has PHP installed on it with the excellent deb.sury.org apt PPA for PHP. Recently, I've upgraded to PHP 7.4. This was fine initially, but soon enough I started to get a warning that php-mbstring couldn't be installed and that I have held broken packages. This was not a good sign, but after doing some digging it transpired that the package libonig4 was missing - and couldn't be installed because it wasn't available in the Raspbian apt repositories. Awkward. After doing some quick digging into the Ubuntu apt repositories, I discovered that while it does exist, it isn't built for armhf (the architecture of the Raspberry Pi). Thankfully though, Ubuntu is open-source - so the source package was available. The Debian tooling makes it relatively easy to build source packages once downloaded too. Unfortunately I couldn't use the apt-get source command to download it as I didn't have an Ubuntu machine to hand, but their website makes it easy to download packages: https://packages.ubuntu.com/bionic/libonig4 On here, you'll want to download the 3 source package files: Download them to a new directory. Then, extract the source files like so: cd path/to/directory; dpkg-source -x *.dsc; Next, cd into the created directory, and build the source files into a bunch of .deb files: cd libonig-6.7.0/; dpkg-buildpackage --no-sign; The --no-sign there is necessary, because otherwise I encountered errors where it tried to automatically sign the resulting package with the original author's secret key, which we obviously don't have access to! Once done (it might make a moment), a bunch of .deb files will be generated in the parent directory: Filename Description libonig4_6.7.0-1_armhf.deb The actual package itself libonig4-dbgsym_6.7.0-1_armhf.deb Debugging symbols generated in the build process libonig-dev_6.7.0-1_armhf.deb Development headers (in case you need to build another package against it) Out of these 3, the top and bottom ones are probably the ones you want to install. This can be done like so: sudo dpkg -i libonig4_6.7.0-1_armhf.deb; sudo dpkg -i libonig-dev_6.7.0-1_armhf.deb; This completes the process. Now, we can install php7.4-mbstring as normal: sudo apt install php7.4-mbstring Success! This should solve the problem. I figured this out in part by following a Unix Stackexchange answer that I have since lost, but I had to adapt the instructions significantly - so I decided to blog about it here. Found this useful? Still encountering issues? Comment below! ## Variable-length fuzzy hashes with Nilsimsa for did you mean correction Or, why fuzzy hashing isn't helpful for improving a search engine. Welcome to another blog post about one of my special interests: search engines - specifically the implementation thereof :D I've blogged about search engines before, in which I looked at taking my existing search engine implementation to the next level by switching to a SQLite-based key-value datastore backing and stress-testing it with ~5M words. Still not satisfied, I'm now turning my attention to query correction. Have you ever seen something like this when you make a typo when you do a search? Surprisingly, this is actually quite challenging to achieve. The problem is that given a word with a typo in it, while it's easy to determine if a word contains a typo, it's hard to determine what the correct version of the word is. Consider a wordlist like this: apple orange pear grape pineapple If the user entered something like pinneapple, then it's obvious to us that the correct spelling would be pineapple - but in order to determine this algorithmically you need an algorithm capable of determining how close 2 different words are to 1 another. The most popular algorithm for this is called leveshtein. Given 2 words a and b, it calculates the number of edits to turn a into b. For example, the edit distance between pinneapple and pineapple is 1. This is useful, but it still doesn't help us very much. With this, we'd have to calculate the leveshtein distance between the typo and all the words in the list. This could easily run into millions of words for large wikis, so this is obviously completely impractical. To this end, we need a better idea. In this post, I'm going to talk about my first attempt at solving this problem. I feel it's important to document failures as well as successes, so this is part 1 of a 2 part series. The first order of business is to track down a Nilsimsa implementation in PHP - since it doesn't come built-in, and it's pretty complicated to implement. Thankfully, this isn't too hard - I found this one on GitHub. Nilsimsa is a fuzzy hashing algorithm. This means that if you hash 2 similar words, then you'll get 2 similar hashes: Word Hash pinneapple 020c2312000800920004880000200002618200017c1021108200421018000404 pineapple 0204239242000042000428018000213364820000d02421100200400018080200256 If you look closely, you'll notice that the hashes are quite similar. My thinking is that if we vary the hash size, then words that are similar will have identical hashes, allowing the search space to be cut down significantly. The existing Nilsimsa implementation I've found doesn't support that though, so we'll need to alter it. This didn't turn out to be too much of a problem. By removing some magic numbers and adding a class member variable, it seems to work like a charm: (Can't view the above? Try this direct link.) I removed the comparison functions since I'm not using them (yet?), and also added a static convenience method for generating hashes. If I end up using this for large quantities of hashes, I may come back to it make it resettable, to avoid having to create a new object for every hash. With this, we can get the variable-length hashes we wanted: 256 0a200240020004a180810950040a00d033828480cd16043246180e54444060a5 128 3ba286c0cf1604b3c6990f54444a60f5 64 02880ed0c40204b1 32 060a04f0 16 06d2 8 06 The number there is the number of bits in the hash, and the hex value is the hash itself. The algorithm defaults to 256-bit hashes. Next, we need to determine which sized hash is best. The easiest way to do this is to take a list of typos, hash the typo and the correction, and count the number of hashes that are identical. Thankfully, there's a great dataset just for this purpose. Since it's formatted in CSV, we can download it and extract the typos and corrections in 1 go like this: curl https://raw.githubusercontent.com/src-d/datasets/master/Typos/typos.csv | cut -d',' -f2-3 >typos.csv There's also a much larger dataset too, but that one is formatted as JSON objects and would require a bunch of processing to get it into a format that would be useful here - and since this is just a relatively quick test to get a feel for how our idea works, I don't think it's too crucial that we use the larger dataset just yet. With the dataset downloaded, we can run our test. First, we need to read the file in line-by line for every hash length we want to test: <?php$handle = fopen("typos.csv", "r");

$sizes = [ 256, 128, 64, 32, 16, 8 ]; foreach($sizes as $size) { fseek($handle, 0); // Jump back to the beginning
fgets($handle); // Skip the first line since it's the header while(($line = fgets($handle)) !== false) { // Do something with the next line here } } PHP has an inbuilt function fgets() which gets the next line of input from a file handle, which is convenient. Next, we need to actually do the hashes and compare them: <?php // .....$parts = explode(",", trim($line), 2); if(strlen($parts[1]) < 3) {
$skipped++; continue; }$hash_a = Nilsimsa::hash($parts[0],$size);
$hash_b = Nilsimsa::hash($parts[1], $size);$count++;
if($hash_a ==$hash_b) {
$count_same++;$same[] = $parts; } else {$not_same[] = $parts; } echo("$count_same / $count ($skipped skipped)\r");

// .....

Finally, a bit of extra logic around the edges and we're ready for our test:

<?php
$handle = fopen("typos.csv", "r");$line_count = lines_count($handle); echo("$line_count lines total\n");

$sizes = [ 256, 128, 64, 32, 16, 8 ]; foreach($sizes as $size) { fseek($handle, 0);fgets($handle); // Skipt he first line since it's the header$count = 0; $count_same = 0;$skipped = 0;
$same = [];$not_same = [];
while(($line = fgets($handle)) !== false) {
$parts = explode(",", trim($line), 2);
if(strlen($parts[1]) < 3) {$skipped++;
continue;
}
$hash_a = Nilsimsa::hash($parts[0], $size);$hash_b = Nilsimsa::hash($parts[1],$size);

$count++; if($hash_a == $hash_b) {$count_same++;
$same[] =$parts;
}
else $not_same[] =$parts;
echo("$count_same /$count ($skipped skipped)\r"); } file_put_contents("$size-same.csv", implode("\n", array_map(function ($el) { return implode(",",$el);
}, $same))); file_put_contents("$size-not-same.csv", implode("\n", array_map(function ($el) { return implode(",",$el);
}, $not_same))); echo(str_pad($size, 10)."→ $count_same /$count (".round(($count_same/$count)*100, 2)."%), $skipped skipped\n"); } I'm writing the pairs that are the same and different to different files here for a visual inspection. I'm also skipping words that are less than 3 characters long, and that lines_count() function there is just a quick helper function for counting the number of lines in a file for the progress indicator (if you write a \r without a \n to the terminal, it'll reset to the beginning of the current line): <?php function lines_count($handle) : int {
fseek($handle, 0);$count = 0;
while(fgets($handle) !== false)$count++;
return $count; } Unfortunately, the results of running the test aren't too promising. Even with the shortest hash the algorithm will generate without getting upset, only ~23% of typos generate the same hash as their correction: 7375 lines total 256 → 7 / 7322 (0.1%), 52 skipped 128 → 9 / 7322 (0.12%), 52 skipped 64 → 13 / 7322 (0.18%), 52 skipped 32 → 64 / 7322 (0.87%), 52 skipped 16 → 347 / 7322 (4.74%), 52 skipped 8 → 1689 / 7322 (23.07%), 52 skipped Furthermore, digging deeper with an 8-bit you start to get large numbers of words that have the same hash, which isn't ideal at all. A potential solution here would be to use hamming distance (basically counting the number of bits that are different in a string of binary) to determine which hashes are similar to each other like leveshtein distance does, but that still doesn't help us as we then still have a problem that's almost identical to where we started. In the second part of this mini-series, I'm going to talk about how I ultimately solved this problem. While the algorithm I ultimately used (a BK-Tree, more on them next time) is certainly not the most efficient out there (it's O(log n) if I understand it correctly), it's very simple to implement and is much less complicated than Symspell, which seems to be the most efficient algorithm that exists at the moment. Additionally, I have been able to optimise said algorithm to return results for a 172K wordlist in ~110ms, which is fine for my purposes. Found this interesting? Got another algorithm I should check out? Got confused somewhere along the way? Comment below! ## The legend of the disappearing data in Node.js Happy leap day! :D _(Above: A nice green tree frog - source)_ Recently, I've been doing a bunch of work in Node.js streaming large amounts of data. For the most part the experience has been highly pleasurable, as Node.js makes it so easy! I have encountered a few pain points though, the most significant of which I'd like to talk about here. In Node.js, streams come in 3 main forms: • Readable Streams • Writable Streams • Transform Streams In addition, you can either plug streams together with the .pipe() method, write to them directly with the .write() method, or any combination thereof - allowing you to build up a chain of streams that enables data to flow through your program. The problems start when you try and write large amounts of data to a stream directly: import fs from 'fs'; import do_work from 'somewhere'; import get_some_stream from 'somewhere_else'; let stream_in = get_some_stream(); let out = fs.createWriteStream("/tmp/test.txt"); for(let i = 0; i < 1000000; i++) { out.write(do_work(stream_in, i)) } (Above: Just an example of writing lots of data to a stream) When this happens, you start to lose random chunks of data. The reason for this is not obvious, but it is buried in the Node.js docs: The writable.write() method writes some data to the stream, and calls the supplied callback once the data has been fully handled. If an error occurs, the callback may or may not be called with the error as its first argument. To reliably detect write errors, add a listener for the 'error' event. This is a huge pain. It means that you have to wrap all write calls like this: "use strict"; /** * Writes data to a stream, automatically waiting for the drain event if asked. * @param {stream.Writable} stream_out The writable stream to write to. * @param {string|Buffer|Uint8Array} data The data to write. * @return {Promise} A promise that resolves when writing is complete. */ function write_safe(stream_out, data) { return new Promise((resolve, reject) => { // Handle errors let handler_error = (error) => { stream_out.off("error", handler_error); reject(error); }; stream_out.on("error", handler_error); if(stream_out.write(data)) { // We're good to go stream_out.off("error", handler_error); resolve(); } else { // We need to wait for the drain event before continuing stream_out.once("drain", () => { stream_out.off("error", handler_error); resolve(); }); } }); } export { write_safe }; Such a huge boilerplate for such a simple task! Basically, if the .write() method returns false, you have to wait until the drain event is fired on the writeable stream before continuing to write to the stream. The reason for this I think is that it signals that the write buffer is full, and it needs to be drained before writing can continue. This is ok, but it would be nice if this was abstracted away behind a single method, such as the wrapper I've shown above. Something like a async stream.Writable.writeAsync() would be great, but it doesn't currently exist. I think I'm going to open an issue about it - since it seems very doable and just silly that it doesn't exist already. ## Rust Review Redux It was aaaages ago that I first reviewed Rust. For those not in the know, Rust is a next-generation compiled language (similar to Go, but this is where they diverge) developed by Mozilla - out of a need to have a safer alternative to C++ for writing key components of Firefox in. Since then, I've obtained both a degree and a masters in computer science. I've also learnt a number of programming languages since then. I have been searching for a better alternative to C++ that's easier to use and doesn't fight you at every step - and I decided to give Rust another go. After a few false starts, I managed to get going with starting to build a little web app (which will probably take a while until I can really show it off here). The tooling for the compiler is pretty good once you actually get it installed - although the installer itself is truly shocking (ly bad): • rustup - Manages multiple versions of Rust installed (I haven't used it much yet; apparently it's like nvm the Node Version Manager, but I don't use that either) • cargo - Orchestrates the building of your project and the installation of dependencies, which are known as crates. • rustc - The compiler itself. You probably won't interact with it directly much - instead going through cargo most of the time. Together (and with the right Atom packages installed), they make for a relatively pleasant development experience. I mention the installer in particular though, because it's awful. I noted a number of issues with it: • The official website forces you to download an installation script that pipes to sh • It will only install on a per-user basis (goodbye disk space, hello extra system config complexity) • It doesn't even tell you how much disk space it's going to use (which wouldn't be an issue if they just setup an apt repository....) These issues aside, other aspects of the experience were also worthy of note. First, the error messages the Rust compiler generates are actually useful. Much better than they were the last time I really dove into Rust, they provide you with much moree detail as to what's gone wrong, and there's even a special rustc --explain ERROR_CODE command you can execute to get more detail about what went wrong, why, and how to fix it. This as a feature is certainly helpful for me as a beginner Rust programmer, but I think it's also a pretty essential feature given Rust's weirdness as a language. I'm seriously not kidding - Rust is a nutty language. For one, classes exist.... sort of - but only as structs. Which are passed by reference (again, sort of) by default and may not contain methods - that's the job of an impl, which is short for an implementation. Implementations are a strange mix between C♯'s interfaces and multiple inheritance (in C++ I think it is?). And there are traits, which I haven't really looked into fully yet, but are a mix between interfaces and abstract classes..... you get the picture. Point is, all this funky strangeness that goes on in Rust makes it a very challenging language to learn. A challenge that I feel is worth persevering with, but a challenge nonetheless. Rust does have a number of very powerful features that make it worth the effort, in my opinion. For example, it catches entire classes of critically nasty bugs that plague other low-level systems languages such as C and C++ like use-after-free and the really awful concurrency race conditions at compile time - which is incredible, if you ask me. Such bugs have been a serious bother to many high-profile software projects that exist today and have caused a number of security issues. Rust is a testament to what can be achieved when you start from scratch and fix these issues by designing them out of the language. For the curious, it does this by a complex system of variable lifetime, ownership, moving, and borrowing. I don't yet understand all the details, but the system enables the Rust compiler to be able to trace the lifetime of a variable at compile time, so you get the benefit of having a garbage collector without any of the overhead, since it's all been done at compile-time and built into your program that way. This deep understanding of how data is passed around also yields performance and efficiency benefits too. C and C++ do not have such an understanding, so there are a number of performance optimisations the Rust compiler can make that would be considered far too dangerous for gcc to do. The net result of this is that sometimes code written in Rust will actually be faster than C and C++. This is a significant accomplishment, as the speed of C and C++ has been held as the gold standard for a long time (see exhibits A and B just for starters). These are just some of the reasons that I'm persisting with learning Rust. So far, it seems like a "slow and steady wins the race" kinda deal - in that I'm taking it one concept at a time. There's a huge amount to take in, so I can't recommend that you try and do it all at once - time to consolidate what I've learnt so far is quite important I've found. Rust is absolutely one of the hardest languages I've tried to learn, as it reinvents a lot of concepts which have been a staple of programming languages for a long time. However, it also comes with key benefits ease-of-use (once learnt, compared to C and C++), performance, and program execution safety at runtime (it was originally invented by Mozilla specifically to make Firefox a safer and faster browser, IIRC). To this end, I'm going to try my best to keep learning the language - and report back here at some point with cool stuff I've created (at the moment it's still in a state of flux and I'm refactoring heavily at each successive stage) :D Edit: I've just remembered. I do currently have 2 big issues with rust: compilation time and disk space usage. When you install a dependency, it not only builds it from source,e but also recursively builds all of it's dependencies from source too. Not only does this take forever, but it also eats huge volumes of disk space for breakfast! Found this interesting? Got some helpful advice or a question about Rust? Comment below! ## PhD Update 2: The experiment, the data, and the supercomputers Welcome to another PhD update post. Since last time, a bunch of different things have happened - which I'll talk about here. In particular, 2 distinct strands have become evident: The reading papers and theory bit - and the writing code and testing stuff out bit. At the moment, I'm focusing much more heavily on the writing code and experimental side of things, as I've recently gained access to the 1km resolution rainfall radar dataset from CEDA. While I'm not allowed to share any data that I've now got, I'm pretty sure I'm safe to talk about how terribly it's formatted. The data itself is with 1 directory per year, and 1 file per day inside those directories. Logical so far, right? Each of those files is a tar archive, inside which are the binary files that contain the data itself - with 1 file for every 5 minutes. These are compressed with gzip - which seems odd since they could probably gain greater compression ratios if they were tared first and compressed second (then the compression algorithm would be able to compress based on the similarities between files too). The problems arise when you start to parse out the binary files themselves. They are in a propriety format - which has 3 different versions that don't conform to the (limited) documentation. To this end, it's been proving somewhat of a challenge to parse them and extract the bits I'm interested in. To tackle this, I've been using Node.js and a bunch of libraries from npm (noisy pirate mutiny? nearest phase modulator? nasty popsicle machine? nah, it's probably node package manager): • binary-parser - For parsing the binary files themselves. Allows you to define the format of the file programmatically, and it'll parse it out into a nice object you can then manipulate. • gunzip-maybe - A streaming library that unzips a gzip-compressed stream • @icetee/ftp - An FTP client library for downloading the files (I know that FTP is insecure, that's all they offer at this time :-/) • tar-stream - For parsing tar files • nnng - Stands for No! Not National Grid!. helps with the conversion between OS national grid references and regular longitude latitude. Aside from the binary file format, I encountered 3 main issues: 1. The data is only a rectangle when using ordnance survey national grid references 2. There's so much data, it needs to be streamed from the remote server 3. Generating a valid gzip file is harder than you expect Problem 1 here took me a while to figure out. Since as I mentioned the documentation is rather limited, I spent much longer than I would have liked attempting to parse the data in latitude longitude and finding it didn't work. Problem 2 was rather interesting. Taking a cursory glance over the data before hand revealed that each daily tar file was about 80MiB - and with roughly 5.7K days worth of data (the dataset appears to go back to May 2004-ish), it quickly became clear that I couldn't just download them all and process them later. It is for this reason that I chose Node.js in the first place for this. For those who haven't encountered it before, it's Javascript for the server - and it's brilliant for 2 main use-cases: networking and streaming data. Both of which were characteristics of the problem at hand - so the answer was obvious. I'm still working on tweaking and improving my final solution, but as it stands after implementing the extractor on it's own, I've also implemented a wrapper that streams the tar archives from the FTP server, stream-reads the tar archives, streams the files in the tar archives into a gzip decompressor, parses the result, and then streams the interesting bits to disk as a disk object via a gzip compressor. That's a lot of streams. The great part about this is that I don't accidentally end up with huge chunks of binary files in memory. The only bits that can't be streamed is the binary file parser and the bit that extracts the interesting bits. I'm still working on the last issue, but I've been encountering nasty problems with the built-in zlib gzip compressor transformation stream. When I send a SIGINT (Ctrl + C) to the Node.js process, it doesn't seem to want to finish writing the gzip file correctly - leading to invalid gzip files with chunks missing from the end. Since the zlib gzip transformation stream is so badly documented, I've ended up replacing it with a different solution that spawns a gzip child process instead (so you've got to have gzip installed on the machine you're running the script on, which shouldn't be a huge deal on Linux). This solution is better, but still requires some tweaks because it transpires that Node.js automatically propagates signals it receives to child processes - before you've had a chance to tie up all your loose ends. Frustrating. Even so, I'm hopeful that I've pretty much got it to a workable state for now - though I'll need to implement a daemon-type script at some point to automatically download and process the new files as they are uploaded - it is a living dataset that's constantly being added to after all. ### Papers The other strand (that's less active at the minute) is reading papers. Last time, I mentioned the summary papers I'd read, and the direction I was considering reading in. Since then, I've both read a number of new papers and talked to a bunch of very talented people on-campus - so I've got a little bit of a better idea as to the direction I'm headed in now. Firstly, I've looked into a few cutting-edge recurrent neural network types: • Grid LSTMs - Basically multi-dimensional LSTMs • Diluted LSTMs - Makes LSTMs less computationally intensive and better at learning long-term relationships • Transformer Neural Networks - more reading required here • NARX Networks Many of these recurrent neural network structures appear to show promise for mapping floods. The last experiment into a basic LSTM didn't go too well (it turned out to be hugely computationally expensive), but learning from that experiment I've got lots of options for my next one. A friend of mine managed to track down the paper behind Google's AI blog post - which turned out to be an interesting read. It transpires that despite the bold words in the blog post, the paper is more of an initial proposal for a research project - rather than a completed project itself. Most of the work they've done is actually using a traditional physics-based model - which they've basically thrown Google-scale compute power at to make pretty graphs - which they've then critically evaluated and identified a number of areas in which they can improve. They've been a bit light on details - which is probably because they either haven't started - or don't want to divulge their sekrets. I also saw another interesting paper from Google entitled "Machine Learning for Precipitation Nowcasting from Radar Images", which I found because it was reported on by Ars Technica. It describes the short-term forecasting of rain from rainfall radar in the US (I'm in the UK) using a convolutional neural network-based model (specifically U-Net, apparently - I have yet to read up on it). The model they use is comprised in part of convolutional neural network (CNN) layers that downsample 256x256 tiles to a smaller size, and then upscale it back to the original size. It has some extra connections that skip part of the model too. They claim that their model manages to improve on existing approaches for up to 6 hours in advance - so their network structure seems somewhat promising as inspiration for my own research. Initial thoughts include theories as to whether I can use CNN layers like this to sandwich a more complex recurrent layer of some description that remembers the long-term relationships? I'll have to experiment..... Found this interesting? Got a suggestion? Confused on a point? Comment below! ## I've got an apt repository, and you can too Hey there! In this post, I want to talk about my apt repository. I've had it for a while, but since it's been working well for me I thought I'd announce it to wider world on here. For those not in the know, an apt repository is a repository of software in a particular format that the apt package manager (found on Debian-based distributions such as Ubuntu) use to keep software on a machine up-to-date. The apt package manager queries all repositories it has configured to find out what versions of which packages they have available, and then compares this with those locally installed. Any packages out of date then get upgraded, usually after prompting you to install the updates. Linux distributions based on Debian come with a large repository of software, but it doesn't have everything. For this reason, extra repositories are often used to deliver updates to software automatically from third parties. In my case, I've been finding increasingly that I've been wanting to deliver updates for software that isn't packaged for installation with apt to a number of different machines. Every time I get around to installing update it felt like it was time to install another, so naturally I got frustrated enough with it that I decided to automate my problems away by scripting my own apt repository! My apt repository can be found here: https://starbeamrainbowlabs.com/ It comes in 2 parts. Firstly, there's the repository itself - which is managed by a script that's based on my lantern build engine. It's this I'll be talking about in this post. Secondly, I have a number of as yet adhoc custom Laminar job scripts for automatically downloading various software projects from GitHub, such that all I have to do is run laminarc queue apt-softwarename and it'll automatically package the latest version and upload it to the repository itself, which has a cron job set to fold in all of the new packages at 2am every night. The specifics of this are best explain in another post. Currently this process requires me to login and run the laminarc command manually, but I intend to automate this too in the future (I'm currently waiting for a new release of beehive to fix a nasty bug for this). Anyway, currently I have the following software packaged in my repository: • Gossa - A simple HTTP file browser • The Tiled Map Editor - An amazing 2D tile-based graphical map editor. You should sponsor the developer via any of the means on the Tiled Map Editor's website before using my apt package. • tldr-missing-pages - A small utility script for finding tldr-pages to write • webhook - A flexible webhook system that calls binaries and shell scripts when a HTTP call is made • I've also got a pleaserun-based service file generator packaged for this too in the webhook-service package Of course, more will be coming as and when I discover and start using cool software. The repository itself is driven by a set of scripts. These scripts were inspired by a stack overflow post that I have since lost, but I made a number of usability improvements and rewrote it to use my lantern build engine as I described above. I call this improved script aptosaurus, because it sounds cool. To use it, first clone the repository: git clone https://git.starbeamrainbowlabs.com/sbrl/aptosaurus.git Then, create a new GPG key to sign your packages with: gpg --full-generate-key Next, we need to export the new keypair to disk so that we can use it in scripts. Do that like this: # Identify the key's ID in the list this prints out gpg --list-secret-keys # Export the secret key gpg --export-secret-keys --armor INSERT_KEY_ID_HERE >secret.gpg chmod 0600 secret.gpg # Don't forget to lock down the permissions # Export the public key gpg --export --armor INSERT_KEY_ID_HERE >public.gpg Then, run the setup script: ./aptosaurus.sh setup It should warn you if anything's amiss. With the setup complete, you can new put your .deb packages in the sources subdirectory. Once done, run the update command to fold them into the repository: ./aptosaurus.sh update Now you've got your own repository! Your next step is to setup a static web server to serve the repo subdirectory (which contains the repo itself) to the world! Personally, I use Nginx with the following config: server { listen 80; listen [::]:80; listen 443 ssl http2; listen [::]:443 ssl http2; server_name apt.starbeamrainbowlabs.com; ssl_certificate /etc/letsencrypt/live/$server_name/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/$server_name/privkey.pem; #add_header strict-transport-security "max-age=31536000;"; add_header x-xss-protection "1; mode=block"; add_header x-frame-options "sameorigin"; add_header link '<https://starbeamrainbowlabs.com$request_uri>; rel="canonical"';

index   index.html;
root    /srv/aptosaurus/repo;

include /etc/nginx/snippets/letsencrypt.conf;

autoindex   off;
fancyindex  on;
fancyindex_exact_size   off;

#location ~ /.well-known {
#   root    /srv/letsencrypt;
#}

}

This requires the fancyindex module for Nginx, which can be installed with sudo apt install libnginx-mod-http-fancyindex on Ubuntu-based systems.

To add your new apt repository to a machine, simply follow the instructions for my repository, replacing the domain name and the key ids with yours.

Hopefully this release announcement-turned-guide has been either interesting, helpful, or both! Do let me know in the comments if you encounter any issues. If there's enough interest I'll migrate the code to GitHub from my personal Git server if people want to make contributions (express said interest in the comments below).

It's worth noting that this is only a very simply apt repository. Larger apt repositories are sectioned off into multiple categories by distribution and release status (e.g. the Ubuntu repositories have xenial, bionic, eoan, etc for the version number of Ubuntu, and main, universe, multiverse, restricted, etc for the different categories of software).

If you setup your own simple apt repository using this guide, I'd love it if you could let me know with a comment below too.

## Switching TOTP providers from Authy to andOTP

Since I first started using 2-factor authentication with TOTP (Time based One Time Passwords), I've been using Authy to store my TOTP secrets. This has worked well for a number of years, but recently I decided that I wanted to change. This was for a number of reasons:

1. I've acquired a large number of TOTP secrets for various websites and services, and I'd like a better way of sorting the list
2. Most of the web services I have TOTP secrets for don't have an icon in Authy - and there are only so many times you can repeat the 6 generic colours before it becomes totally confusing
3. I'd like the backups of my TOTP secrets to be completely self-hosted (i.e. completely on my own infrastructure)

After asking on Reddit, I received a recommendation to use andOTP (F-Droid, Google Play). After installing it, I realised that I needed to export my TOTP secrets from Authy first.

Unfortunately, it turns out that this isn't an easy process. Many guides tell you to alter the code behind the official Authy Chrome app - and since I don't have Chrome installed (I'm a Firefox user :D), that's not particularly helpful.

Thankfully, all is not lost. During my research I found the authy project on GitHub, which is a command-line app - written in Go - temporarily registers as a 'TOTP provider' with Authy and then exports all of your TOTP secrets to a standard text file of URIs.

These can then be imported into whatever TOTP-supporting authenticator app you like. Personally, I did this by generating QR codes for each URI and scanning them into my phone. The URIs generated, when converted to a QR code, are actually in the same format that they were originally when you scan them in the first place on the original website. This makes for an easy time importing them - at least from a walled garden.

Generating all those QR codes manually isn't much fun though, so I automated the process. This was pretty simple:

#!/usr/bin/env bash
exec 3<&0; # Copy stdin
echo "\${url}" | qr --error-correction=H;
read -p "Press a enter to continue" <&3; # Pipe in stdin, since we override it with the read loop
done <secrets.txt;

The exec 3<&0 bit copies the standard input to file descriptor 3 for later. Then we enter a while loop, and read in the file that contains the secrets and iterate over it.

For each line, we convert it to a QR code that displays in the terminal with VT-100 ANSI escape codes with the Python program qr.

Finally, after generating each QR code we pause for a moment until we press the enter key, so that we can generate the QR codes 1 at a time. We pipe in file descriptor 3 here that we copied earlier, because inside the while loop the standard input is the file we're reading line-by-line and not the keyboard input.

With my secrets migrated, I set to work changing the labels, images, and tags for each of them. I'm impressed by the number of different icons it supports - and since it's open-source if there's one I really want that it doesn't have, I'm sure I can open a PR to add it. It also encrypts the TOTP secrets database at rest on disk, which is pretty great.

Lastly came the backups. It looks like andOTP is pretty flexible when it comes to backups - supporting plain text files as well as various forms of encrypted file. I opted for the latter, with GPG encryption instead of a password or PIN. I'm sure it'll come back to bite me later when I struggle to decrypt the database in an emergency because I find the gpg CLI terribly difficult to use - perhaps I should take multiple backups encrypted with long and difficult password too.

To encrypt the backups with GPG, you need to have a GPG provider installed on your phone. It recommended that I install OpenKeychain for managing my GPG private keys on Android, which I did. So far, it seems to be functioning as expected too - additionally providing me with a mechanism by which I can encrypt and decrypt files easily and perform other GPG-related tasks...... if only it was this easy in the Linux terminal!

Once setup, I saved my encrypted backups directly to my Nextcloud instance, since it turns out that in Android 10 (or maybe before? I'm not sure) it appears that if you have the Nextcloud app installed it appears as a file system provider when saving things. I'm certainly not complaining!

While I'm still experimenting with my new setup, I'm pretty happy with it at the moment. I'm still considering how I can make my TOTP backups even more secure while not compromising the '2nd factor' nature of the thing, so it's possible I might post again in the future about that.

Next on my security / privacy todo list is to configure my Keepass database to use my Solo for authentication, and possibly figure out how I can get my phone to pretend to be a keyboard to input passwords into machines I don't have my password database configured on :D

Found this interesting? Got a suggestion? Comment below!

At home, I have a Raspberry Pi 3B+ as a home file server. Lately though, I've been noticing that I've been starting to grow out of it (both in terms of compute capacity and storage) - so I've decided to get thinking early about what I can do about it.

I thought of 2 different options pretty quickly:

• Build a 'proper' server

While both of these options are perfectly viable and would serve my needs well, one of them is distinctly more interesting than the other - that being a cluster. While having a 'proper' server would be much simpler, perhaps slightly more power efficient (though I would need tests to confirm, since ARM - the CPU architecture I'm planning on using for the cluster - is more power efficient), and would put all my system resources on the same box, I like the idea of building a cluster for a number of reasons.

For one, I'll learn new skills setting it up and managing it. So far, I've been mostly managing my servers by hand. When you start to acquire a number of machines though, this quickly becomes unwieldy. I'd like to experiment with a containerisation technology (I'm not sure which one yet) and play around with distributing containers across hosts - and auto-restarting them on a different host if 1 host goes down. If this is decentralised, even better!

For another, having a single larger server is a single point of failure - which would be relatively expensive to replace. If I use lots of small machines instead, then if 1 dies then not only is it cheaper to replace, but it's also not as urgent since the other machines in the cluster can take over while I order a replacement.

Finally, having a cluster is just cool. Do we really need more of a reason than this?

With all this in mind, I've been thinking quite a bit about the architecture of such a cluster. I haven't bought anything yet (and probably won't for a while yet) - because as you may have guessed from the title of this post I've been running into a number of issues that all need researching.

First though let's talk about which machines I'm planning on using. I'm actually considering 2 clusters, to solve 2 different issues: compute and storage. Compute refers to running applications (e.g. Nextcloud etc), and storage refers to a distributed storage mechanism with multiple hosts - each with 1 drive attached - though I'm unsure about the storage cluster at this stage.

For the compute cluster, I'm leaning towards 4 x Raspberry Pi 4 with 4GiB of RAM each. For the storage cluster, I'm considering a number of different boards. 3 identical boards of 1 of the following:

I do seem to remember a board that had USB 3 onboard, which would be useful for connecting to the external drives. Currently the plan is to use a SATA to USB converter connect to internal HDDs (e.g. WD Greens) - but I have yet to find one that doesn't include the power connector or splits the power off into a separate USB cable (more on power later). This would be all be backed up by a Gigabit switch of some description (so the Rock Pi S is not a particularly attractive option, since it would be limited to 100MiB).

I've been using HackerBoards.com to discover different boards which may fit my project - but I'm not particularly satisfied with any of the options here so far. Specifically, I'm after Gigabit Ethernet and USB 3 on the same board if possible.

The next issue is software support. I've been bitten by this before, so I'm being extra cautious this time. I'm after a board that provides good software support, so that I can actually use all the hardware I've paid for.

The other thing relating to software that I'd like if possible is the ability to use a systemd-free operating system. Just like before, when I selected Manjaro OpenRC (which is now called Artix Linux), since I already have a number of systems using systemd I would like to balance it out a bit with some systems that use something else. I don't really mind if this is OpenRC, S6, or RunIt - just that it's something different to broaden my skill set.

Unfortunately, it's been a challenge to locate a distribution of Linux that both has broad support for ARM SoCs and does not use systemd. I suspect that I may have to give up on this, but I'm still holding out hope that there's a distribution out there that can do what I want - even if I have to prepare the system image myself (Alpine Linux looks potentially promising, but at the moment it's a huge challenge to figure out whether a chipset supported or not....). Either way, from my research it looks like having mainline Linux kernel support fro my board of choice is critically important to ensure continued support and updates (both feature and security) in the future.

Lastly, I also have power problems. Specifically, how to power the cluster. The big problem is that the Raspberry Pi 4 requires 3A of power max - instead the usual 2.4A in the 3B+ model. Of course, it won't be using this all the time, but it's apparently important that the ceiling of the power supply is 3A to avoid issues. Problem is, most multi-port chargers can barely provide 2A to connected devices - and I have not yet seen one that would provide 3A to 4+ devices and support additional peripherals such as hard drives and other supporting boards as described above.

To this end, I may end up having to build my own power supply from an old ATX supply that you can find in an old desktop PC. These can generally supply plenty of power (though it's always best to check) - but the problem here is that I'd need to do a ton of research to make sure that I wire it up correctly and safely, to avoid issues there too (I'm scared of blowing a fuse or electrocuting someone etc).

This concludes my first blog post on my cluster plans. It may be a while until the next one, as I have lots more research to do before I can continue. Suggestions and tips are welcomed in the comments below.

Art by Mythdael