Starbeamrainbowlabs

Stardust
Blog


Archive


Mailing List Articles Atom Feed Comments Atom Feed Twitter Reddit Facebook

Tag Cloud

3d 3d printing account algorithms android announcement architecture archives arduino artificial intelligence artix assembly async audio automation backups bash batch blog bookmarklet booting bug hunting c sharp c++ challenge chrome os cluster code codepen coding conundrums coding conundrums evolved command line compilers compiling compression containerisation css dailyprogrammer data analysis debugging demystification distributed computing docker documentation downtime electronics email embedded systems encryption es6 features ethics event experiment external first impressions future game github github gist gitlab graphics hardware hardware meetup holiday holidays html html5 html5 canvas infrastructure interfaces internet interoperability io.js jabber jam javascript js bin labs learning library linux lora low level lua maintenance manjaro minetest network networking nibriboard node.js operating systems own your code pepperminty wiki performance phd photos php pixelbot portable privacy problem solving programming problems project projects prolog protocol protocols pseudo 3d python reddit redis reference release releases rendering resource review rust searching secrets security series list server software sorting source code control statistics storage svg talks technical terminal textures thoughts three thing game three.js tool tutorial tutorials twitter ubuntu university update updates upgrade version control virtual reality virtualisation visual web website windows windows 10 xmpp xslt

I bought a 3d printer! | Ender 3 v2 in review

Hey there! Recently, I bought an Ender 3 v2 3d printer, and now that I've used it enough I can talk about how I've found it. In this post, I'll be covering my thoughts on the 2 parts of 3d printing: the printer itself, and the slicing software that you run models through to turn them into G-code that the 3d printer understands.

A photo of my ender 3 v2 in my loft.

(Above: A photo of my ender 3 v2 in my loft.)

Let's start with the printer itself. Compared to my earlier efforts, it is immediately apparent that the design of the printer is considerably more robust. The frame has 2 pillars that are pretty much impossible to screw together incorrectly (it will be obvious at later steps if you do it incorrectly). The drive belt does not come pre-installed though, which I found to be the most frustrating part of the build.

The instruction booklet was noticeably more sparse and unclear than the Axis 3d instructions though - at some points I found myself having to look up some assembly instructions in order to understand them.

Once assembled, the printer was fairly easy to use. It has a colour display that as I understand it has a significantly higher resolution than previous models by Creality, which allows room for things like icons which greatly enhance the usability of the printer. I suspect that this upgrade may be due to the presence of a 32 bit microprocessor (likely an ARM) rather than an 8 bit one found in previous models, which is bound to come with more RAM etc as standard.

While the printer does come with a small amount of filament, I recommend buying a reel or 2 to use with it. Because the filament that comes with it is not on a reel, it easily tangles into nasty knots. I'm going to empty a reel of other filament first before winding the white filament that came with the printer back onto the empty reel.

Loading the filament takes practice. The advice in the booklet to cut a 45ยฐ angle on the end of the filament is really important, as otherwise it's impossible to load the filament past the NEMA motor filament loading mechanism. You have to get the end of the filament at just the right angle too to catch the PTFE tubing that leads to the hot end nozzle. This does get easier though with time - personally I think I need to make or purchase an extra clip-on light, as because my printer is in my loft it can be difficult to see what I'm doing when changing the filament.

Before you print, you have to manually level the bed of the printer. This is done by adjusting the wheels under the 4 corners of the build plate until a piece of plain paper on the build plate just gently scratches the tip of the nozzle. If the wheel in 1 corner doesn't appear to go far enough, try coming back to it and doing the other corners first. By adjusting the wheel in the other corners, the corner in question will be adjusted as well. It is for this reason it is also recommended to go around 2-3 times to make sure it's all level before beginning.

Printing itself is fairly simple. You insert the microSD card containing the G-code, preheat the nozzle to the right temperature, use the auto-home feature, and then select the file you want to print using the menu. I found that it's absolutely essential that you make sure that the build plate itself is as far back is possible - just touching the end-stop - as otherwise you get nasty loud belt grinding noises when it runs through the preamble at the beginning the G-code that causes the hot end to move all the way to the front of the build plate and back again.

Once a print is complete, I've found the supplied scraping tool to be sufficient to extract prints from the print bed. It's much easier to wait between 5 and 10 minutes for the heated bed to cool down before attempting to scrape it off - many prints can just be lifted off the bed with no scraping required (tested with some PLA filament).

Speaking of the build plate, it has a glass surface on top. My research suggested that this leads to a much more even surface on the bottom of prints, and I've certainly found this to be true. While you do have to be careful not to scratch it, the glass build plate the Ender 3 v2 comes with as standard is a nice addition to the printer.

To summarise, the Ender 3 v2 is a really nice solid printer. It's well built and relatively easy to setup and use, though filament organisation and anti-tangling will be the first project you work on when you start printing.

4 calibration cats! 3 in blue and 1 in white. The 3 in buile have some stringing issues.

(Above: 4 calibration cats! 3 in blue and 1 in white - the 3 in buile have some stringing issues. I'll definitely be printing more of these :D)

Ultimaker Cura

In order to print things, you need to use a slicer, which takes a 3D model (e.g. in a .obj or .stl file). My choice here is Ultimaker Cura. While it's in the default Ubuntu repositories, I found the AppImage on GitHub to be more up-to-date, so I packaged it into my apt repository.

Since Cura doesn't appear to have explicit support for the Ender 3 v2 just yet, I've been using the Ender 3 Pro profile instead, which seems to work just fine.

Cura has a large number of feature which are reasonably well organised for preparing prints. You can import the aforementioned .obj or .stl files and apply various transformations to imported models, such as translate, scale, and rotate. Cura also helpfully auto-snaps the bottom of a model to the virtual build plate for you, so you don't have to worry about getting the alignment right.

Saving and loading project files is annoying. It asks you to specify the place you want to save a project to every time you hit the save button and doesn't remember the location of the project file you saved to last (e.g. like a text editor, GIMP, LibreOffice, etc), which is really frustrating. I guess it's a better than crashing on save like the version of Cura in the default Ubuntu repositories though, so I'll count that as a win?

It would also be helpful if there was a system for remembering or bookmarking the commonly adjusted settings. I've found that I usually need to adjust the same small set of setting over and over again, and it's a pain having to find them in the "expert" settings list or using the search bar.

The preview mode is also useful, as it shows you precisely what your printer will actually end up printing. Great for checking that text is large / thick enough, that parts are large though or avoid losing detail, and for double checking that supports will print the way you expect (I recommend trying the tree mode for supports if you have to use them).

Given that I've used Blender before (exhibit a), it would be very nice to have the ability customise keyboard shortcuts or even better have a Blender keyboard shortcut scheme I could enable. Hitting R, then X, then 90 is just 1 example of a whole range of keyboard shortcuts I keep trying to use in Cura, but Cura doesn't support them.

On the whole, Cura works well as a slicer. It provides many tweakable settings for adjusting things based on the filament you're using (I need to look into making a profile for the filament I use at the moment, as I'm sure the next roll of filament will require different settings). The 3d preview window is intuitive and easy to use. While the program does have some rough edges as mentioned above, these are minor issues that could easily be corrected.

Unethically disclosed vulnerabilities in Pepperminty Wiki: My perspective

Recently, I've made a new release of my PHP-based wiki engine Pepperminty Wiki - v0.23. This would not normally be notable, but as it turns out there were a number of security issues (the severity of which varies) that needed fixing. I fixed them of course, but the manner in which they were disclosed to me was less than ethical.

In this post, I want to explain what happened from my perspective, and why I'm rather frustrated with the way the reporter handled things.

It all started with issue #222 that was opened by @hmaverickadams. In that issue, they say that they have discovered a number of vulnerabilities:

Hi,

I am a penetration tester and discovered a couple of vulnerabilities within your application. I will be applying for CVE status on the findings, but would like to work with you on the issues if possible. I could not locate an email, so please feel free to shoot me your contact info if possible.

Thank you!

So far, so good! Seems responsible, right? It did to me too. For reference, CVE there refers to the Common Vulnerabilities and Exposures, a website that tracks vulnerabilities in software from across the globe.

I would have left it at that, but I decided to check out the GitHub projects that @hmaverickadams (henceforth "the reporter") had. To my surprise, I found these public GitHub repositories:

These appeared to be created just 1 day after the issue #222 was opened against Pepperminty Wiki. I was on holiday at the time (3 weeks; and I've haven't been checking my GitHub notifications as often as I perhaps should), so it took me 22 days to get to it. Generally speaking I would consider a minimum of 90 days with no response before publishing a vulnerability publicly like that - this is the core of the matter, but more on this later. Here are links these vulnerabilites on the CVE website:

You may also ask yourself "what were the vulnerabilities in question in the first place?" - glad you asked! Let's take a look.

CVE-2021-38600

Described officially as a "a stored Cross Site Scripting (XSS) vulnerability", this essentially that you can convince Pepperminty Wiki to store some arbitrary HTML (which may contain a malicious script for example) and later serve it to some poor unsuspecting visitors.

In this particular vulnerability, the reporter found that when filling out the initial setup web form that appears the first time you load Pepperminty Wiki up with a wiki name that contains arbitrary HTML, Pepperminty Wiki will blindly serve this to users.

It sounds like a big issue, but once you realise that to fill out the first run web form you need the site secret - which is generated randomly and stored in peppermint.json, which itself has a check to ensure it can't be loaded through the web server, you realise that this isn't actually a big deal. In fact, Pepperminty Wiki has a number of settings that by design allow one to serve arbitrary HTML:

  • editing_message - a message that appears below the page editing form and before the submit button
  • admindisplaychar - inserts text (or arbitrary HTML) before the name of an administrator
  • footer_message - a message (that may contain arbitrary HTML) that is displayed at the bottom of every page

All of these can be modified either by a moderator in the site settings page, or through peppermint.json directly.

...so personally I don't class this as a vulnerability. Regardless, I've fixed this by running the wiki name through htmlentities() - but in doing so I speculate that some special characters (e.g. quotes) will no longer display properly because of how I fixed CVE-2021-38600 (see below) - I'll continue working on this.

CVE-2021-38601

This vulnerability is described as "a reflected Cross Site Scripting (XSS) vulnerability". This is similar to CVE-2021-38600, but instead of storing a value the attack makes use of various GET parameters. There are (were, since I've fixed it) examples of GET parameters that caused this issue, including action (sets the subcommand/action that should be taken - e.g. view, edit, etc) and page (the current page on the wiki you're looking at).

Unlike CVE-2021-38600, this is a more serious vulnerability. Someone could generate a malicious link to a Pepperminty Wiki instance that instead redirects you to an attacker-controller website (i.e. by using location.href = "blah").

Fixing it though required me to do a comprehensive review of every single line of Pepperminty Wiki's codebase, which took me multiple hours of intense programming and was really rather unpleasant. The description by the reporter in the repo was quite unhelpful:

A reflected Cross Site Scripting (XSS) vulnerability exists on multiple parameters in version 0.23-dev of the Pepperminty-Wiki application that allows for arbitrary execution of JavaScript commands.

Affected page: http://localhost/index.php

Sample payload: https://localhost/index.php?action=<script>alert(1)</script>

CVE: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-38601

I discovered in my comprehensive review that action and page were not the only parameters affected - I fixed every instance I could find.

Reaching out

To try and understand their side of the story and gain additional insight into the vulnerabilities they discovered, I attempted to reach out to them. First, I tried opening an issue on the above GitHub repositories they created.

Instead of replying to the issues though, the reporter instead deleted the issues I opened, and set it so that nobody could opened issues on the repositories anymore! As of the time of typing I still do not have a response from the reporter about this.

Not to be deterred, I found a pair of twitter accounts they controlled and tweeted at them:

As you can probably guess, I haven't had a response yet (though I will of course update this blog post if I do). To make absolutely sure that I got through, I also filled out the contact form on their website - also to no avail so far.

With all this in mind, I get the impression the reporter does not want to talk to me - especially since they deleted the issues I opened against their repositories instead of replying to them. This is frustrating, because I was put in a really awkward position of having to deal with a zero day vulnerability as fast as I could after they publicly disclosed the vulnerabilities (worse still, I could tell that those repositories had some significant traffic since they have been starred by 7 + 4 people as of the time of typing).

Can't find an email address?

After this (and in between comprehensively reviewing Pepperminty Wiki's codebase), I also re-read the initial issue. When re-reading it, a particular sentence also struck me as odd:

I could not locate an email, so please feel free to shoot me your contact info if possible.

This is very strange, since I have the following paragraph in Pepperminty Wiki's README:

If you've found a security issue, please don't open an issue. Instead, get in touch privately - e.g. via Keybase or by email (security [at sign] starbeamrainbowlabs [replace me with a dot] com), and I'll try to respond ASAP.

I also have my website and email address on my GitHub profile, and my website lists:

  • My email address
  • My Keybase details
  • My Twitter account
  • My Stack Exchange account
  • My reddit account
  • My GPG/PGP key id

I don't have my Discord account on there, but I can chat over that too after first using one of the above.

With this in mind, I found it to be very strange that the reporter was unable to find a means of contact to use to responsibly disclose the vulnerabilities.

CVE confusion

Now that I've fixed the vulnerabilities, I'm somewhat confused above how I update the pair of CVEs. This website gives the following instructions:

  1. Identify the CNA that published the CVE Record by searching for the CVE Record on the CVE List.
  2. Locate the responsible CNA in the โ€œAssigning CNAโ€ field of the CVE Record.
  3. Contact the CNA using their preferred contact method to request the update.

In my case, the assigning CNA is stated as "N/A" - I assume it's the unresponsive reporter above. I'm confused here then about how I'm supposed to update the the CVE, since I can't contract the original reporter.

If anyone can suggest a way in which I can update these CVEs to reflect what has happened and the fact that I've fixed them, I'd love to know - I would hate to leave those CVEs outdated as they may misinform someone. Contact details on my website homepage. You can also leave a comment on this blog post.

Conclusion

I'm upset not because the reporter found a vulnerability - it's great they even took the time to find it in the first place in my little small-time project! I'm upset because they failed to properly disclose the vulnerabilities by privately contacting me. In fact, they would have discovered that CVE-2021-38600 is not really a vulnerability at all.

I'm also upset because despite the effort I've gone to in order to reach out, instead of opening a civil and polite discussion about the whole issue I've instead been met with doors slammed in my face (i.e. issues deleted instead of being replied to).

I wanted to document my experiences here also to educate others about ethical vulnerability / security issue / disclosure. Ethics, justice, and honesty are really important to me - and I'd like to try and avoid any accidental disclosures if at all possible.

If you find a vulnerability in someone's code (be it open or closed source), I would advise you to:

  1. Go in search of a security vulnerability disclosure policy (or, for open-source projects, search the README / find the contact details for the maintainers)
  2. Contact the authors of the code (or company, if commercial) to organise responsible disclosure
  3. When a patch has been written and tested, co-ordinate the release of the patch and the disclosure of the vulnerability

Please do not release the vulnerability publicly without first contacting the author (I suggest waiting 60 days and trying multiple methods of communication). This causes maintainers of projects (who in the case of open source are mostly volunteers who pour their time into projects without asking for anything in return) a lot of stress and anxiety, as I've discovered during this incident.

Given that I haven't experienced anything like this before and that I'm only human I'm sure that my response to this incident could use some work - but the manner in which these vulnerabilities were disclosed could use a lot of work too.

Sources and further reading

stl2png Nautilus Thumbnailer

Recently I've found myself working with STL files a lot more since I bought a 3d printer (more on that in a separate post!) (.obj is technically the better format, but STL is still widely used). I'm a visual sort of person, and with this in mind I like to have previews of things in my file manager. When I found no good STL thumbnailers for Nautilus (the default file manager on Ubuntu), I set out to write my own.

In my research, I discovered that OpenSCAD can be used to generate a PNG image from an STL file if one writes a small .scad file wrapper (ref), and wanted to blog about it here.

First, a screenshot of it in action:

(Above: stl2png in action as a nautilus thumbnailer. STL credit: Entitled Goose from the Untitled Goose Game)

You can find installation instructions here:

https://github.com/sbrl/nautilus-thumbnailer-stl/#nautilus-thumbnailer-stl

The original inspiration for this twofold:

From there, wrapping it in a shell script and turning it into a nautilus thumbnailer was not too challenging. To do that, I followed this guide, ehich was very helpful (though the update-mime-database bit was wrong - the filepath there needs to have the /packages suffix removed).

I did encounter a few issues though. Firstly finding the name for a suitable fallback icon was not easy - I resorted to browsing the contents of /usr/share/icons/gnome/256x256/mimetypes/ in my file manager, as this spec was not helpful because STL model files don't fit neatly into any of the categories there.

The other major issue was that the script worked fine when I called it manually, but failed when I tried to use it via the nautilus thumbnailing engine. It turned out that OpenSCAD couldn't open an OpenGL context, so as a quick hack I wrapped the openscad call in xvfb-run (from X Virtual FrameBuffer; sudo apt install xvfb)).

With those issues sorted, it worked flawlessly. I also added optional oxipng (or optipng as an optional fallback) support for optimising the generated PNG image - this I found in casual testing saved between 70% and 90% on file sizes.

Found this interesting or helpful? Comment below! It really helps motivate me.

NAS Backups, Part 2: Btrfs send / receive

Hey there! In the first post of this series, I talked about my plan for a backup NAS to complement my main NAS. In this part, I'm going to show the pair of scripts I've developed to take care of backing up btrfs snapshots.

The first script is called snapshot-send.sh, and it:

  1. Calculates which snapshot it is that requires sending
  2. Uses SSH to remote into the backup NAS
  3. Pipes the output of btrfs send to snapshot-receive.sh on the backup NAS that is called with sudo

Note there that while sudo is used for calling snapshot-receive.sh, the account it uses to SSH into the backup NAS, it doesn't have completely unrestricted sudo access. Instead, a sudo rule is used to restrict it to allow only specific commands to be called (without a password, as this is intended to be a completely automated and unattended system).

The second script is called snapshot-receive.sh, and it receives the output of btrfs send and pipes it to btrfs receive. It also has some extra logic to delete old snapshots and stuff like that.

Both of these are designed to be command line programs in their own right with a simple CLI, and useful error / help messages to assist in understanding it when I come back to it to fix an issue or extend it after many months.

snapshot-send.sh

As described above, snapshot-send.sh sends btrfs snapshot to a remote host via SSH and the snapshot-receive.sh script.

Before we continue and look at it in detail, it is important to note that snapshot-send.sh depends on btrfs-snapshot-rotation. If you haven't already done so, you should set that up first before setting up my scripts here.

If you have btrfs-snapshot-rotation setup correctly, you should have something like this in your crontab:

# Btrfs automatic snapshots
0 * * * *       cronic /root/btrfs-snapshot-rotation/btrfs-snapshot /mnt/some_btrfs_filesystem/main /mnt/some_btrfs_filesystem/main/.snapshots hourly 8
0 2 * * *       cronic /root/btrfs-snapshot-rotation/btrfs-snapshot /mnt/some_btrfs_filesystem/main /mnt/some_btrfs_filesystem/main/.snapshots daily 4
0 2 * * 7       cronic /root/btrfs-snapshot-rotation/btrfs-snapshot /mnt/some_btrfs_filesystem/main /mnt/some_btrfs_filesystem/main/.snapshots weekly 4

I use cronic there to reduce unnecessary emails. I also have a subvolume there for the snapshots:

sudo btrfs subvolume create /mnt/some_btrfs_filesystem/main/.snapshots

Because Btrfs does not take take a snapshot of any child subvolumes when it takes a snapshot, I can use this to keep all my snapshots organised and associated with the subvolume they are snapshots of.

If done right, ls /mnt/some_btrfs_filesystem/main/.snapshots should result in something like this:

2021-07-25T02:00:01+00:00-@weekly  2021-08-17T07:00:01+00:00-@hourly
2021-08-01T02:00:01+00:00-@weekly  2021-08-17T08:00:01+00:00-@hourly
2021-08-08T02:00:01+00:00-@weekly  2021-08-17T09:00:01+00:00-@hourly
2021-08-14T02:00:01+00:00-@daily   2021-08-17T10:00:01+00:00-@hourly
2021-08-15T02:00:01+00:00-@daily   2021-08-17T11:00:01+00:00-@hourly
2021-08-15T02:00:01+00:00-@weekly  2021-08-17T12:00:01+00:00-@hourly
2021-08-16T02:00:01+00:00-@daily   2021-08-17T13:00:01+00:00-@hourly
2021-08-17T02:00:01+00:00-@daily   last_sent_@daily.txt
2021-08-17T06:00:01+00:00-@hourly

Ignore the last_sent_@daily.txt there for now - it's created by snapshot-send.sh so that it can remember the name of the snapshot it last sent. We'll talk about it later.

With that out of the way, let's start going through snapshot-send.sh! First up is the CLI and associated error handling:

#!/usr/bin/env bash
set -e;

dir_source="${1}";
tag_source="${2}";
tag_dest="${3}";
loc_ssh_key="${4}";
remote_host="${5}";

if [[ -z "${remote_host}" ]]; then
    echo "This script sends btrfs snapshots to a remote host via SSH.
The script snapshot-receive must be present on the remote host in the PATH for this to work.
It pairs well with btrfs-snapshot-rotation: https://github.com/mmehnert/btrfs-snapshot-rotation
Usage:
    snapshot-send.sh <snapshot_dir> <source_tag_name> <dest_tag_name> <ssh_key> <user@example.com>

Where:
    <snapshot_dir> is the path to the directory containing the snapshots
    <source_tag_name> is the tag name to look for (see btrfs-snapshot-rotation).
    <dest_tag_name> is the tag name to use when sending to the remote. This must be unique across all snapshot rotations sent.
    <ssh_key> is the path to the ssh private key
    <user@example.com> is the user@host to connect to via SSH" >&2;
    exit 0;
fi

# $EUID = effective uid
if [[ "${EUID}" -ne 0 ]]; then
    echo "Error: This script must be run as root (currently running as effective uid ${EUID})" >&2;
    exit 5;
fi

if [[ ! -e "${loc_ssh_key}" ]]; then
    echo "Error: When looking for the ssh key, no file was found at '${loc_ssh_key}' (have you checked the spelling and file permissions?)." >&2;
    exit 1;
fi
if [[ ! -d "${dir_source}" ]]; then
    echo "Error: No source directory located at '${dir_source}' (have you checked the spelling and permissions?)" >&2;
    exit 2;
fi

###############################################################################

Pretty simple stuff. snapshot-send.sh is called like so:

snapshot-send.sh /absolute/path/to/snapshot_dir SOURCE_TAG DEST_TAG_NAME path/to/ssh_key user@example.com

A few things to unpack here.

  • /absolute/path/to/snapshot_dir is the path to the directory (i.e. btrfs subvolume) containing the snapshots we want to read, as described above.
  • SOURCE_TAG: Given the directory (subvolume) name of a snapshot (e.g. 2021-08-17T02:00:01+00:00-@daily), then the source tag is the bit at the end after the at sign @ - e.g. daily.
  • DEST_TAG_NAME: The tag name to give the snapshot on the backup NAS. Useful, because you might have multiple subvolumes you snapshot with btrfs-snapshot-rotation and they all might have snapshots with the daily tag.
  • path/to/ssh_key: The path to the (unencrypted!) SSH key to use to SSH into the remote backup NAS.
  • user@example.com: The user and hostname of the backup NAS to SSH into.

This is a good time to sort out the remote user we're going to SSH into (we'll sort out snapshot-receive.sh and the sudo rules in the next section below).

Assuming that you already have a Btrfs filesystem setup and automounting on boot on the remote NAS, do this:


sudo useradd --system --home /absolute/path/to/btrfs-filesystem/backups backups
sudo groupadd backup-senders
sudo usermod -a -G backup-senders backups
cd /absolute/path/to/btrfs-filesystem/backups
sudo mkdir .ssh
sudo touch .ssh/authorized_keys
sudo chown -R backups:backups .ssh
sudo chmod -R u=rwX,g=rX,o-rwx .ssh

Then, on the main NAS, generate the SSH key:


mkdir -p /root/backups && cd /root/backups
ssh-keygen -t ed25519 -C backups@main-nas -f /root/backups/ssh_key_backup_nas_ed25519

Then, copy the generated SSH public key to the authorized_keys file on the backup NAS (located at /absolute/path/to/btrfs-filesystem/backups/.ssh/authorized_keys).

Now that's sorted, let's continue with snapshot-send.sh. Next up are a few miscellaneous functions:


# The filepath to the last sent text file that contains the name of the snapshot that was last sent to the remote.
# If this file doesn't exist, then we send a full snapshot to start with.
# We need to keep track of this because we need this information to know which
# snapshot we need to parent the latest snapshot from to send snapshots incrementally.
filepath_last_sent="${dir_source}/last_sent_@${tag_source}.txt";

## Logs a message to stderr.
# $*    The message to log.
log_msg() {
    echo "[ $(date +%Y-%m-%dT%H:%M:%S) ] remote/${HOSTNAME}: >>> ${*}";
}

## Lists all the currently available snapshots for the current source tag.
list_snapshots() {
    find "${dir_source}" -maxdepth 1 ! -path "${dir_source}" -name "*@${tag_source}" -type d;
}

## Returns an exit code of 0 if we've sent a snapshot, or 1 if we haven't.
have_sent() {
    if [[ ! -f "${filepath_last_sent}" ]]; then
        return 1;
    else
        return 0;
    fi
}

## Fetches the directory name of the last snapshot sent to the remote with the given tag name.
last_sent() {
    if [[ -f "${filepath_last_sent}" ]]; then
        cat "${filepath_last_sent}";
    fi
}

# Runs snapshot-receive on the remote host.
do_ssh() {
    ssh -o "ServerAliveInterval=900" -i "${loc_ssh_key}" "${remote_host}" sudo snapshot-receive "${tag_dest}";
}

Particularly of note is the filepath_last_sent variable - this is set to the path to that text file I mentioned earlier.

Other than that it's all pretty well commented, so let's continue on. Next, we need to determine the name of the latest snapshot:

latest_snapshot="$(list_snapshots | sort | tail -n1)";
latest_snapshot_dirname="$(dirname "${latest_snapshot}")";

With this information in hand we can compare it to the last snapshot name we sent. We store this in the text file mentioned above - the path to which is stored in the filepath_last_sent variable.

if [[ "$(dirname "${latest_snapshot_dirname}")" == "$(cat "${filepath_last_sent}")" ]]; then
    if [[ -z "${FORCE_SEND}" ]]; then
        echo "We've sent the latest snapshot '${latest_snapshot_dirname}' already and the FORCE_SEND environment variable is empty or not specified, skipping";
        exit 0;
    else
        echo "We've sent it already, but sending it again since the FORCE_SEND environment variable is specified";
    fi
fi

If the latest snapshot has the same name as the one we last send, we exit out - unless the FORCE_SEND environment variable is specified (to allow for an easy way to fix stuff if it goes wrong on the other end).

Now, we can actually send the snapshot to the remote:


if ! have_sent; then
    log_msg "Sending initial snapshot $(dirname "${latest_snapshot}")";
    btrfs send "${latest_snapshot}" | do_ssh;
else
    parent_snapshot="${dir_source}/$(last_sent)";
    if [[ ! -d "${parent_snapshot}" ]]; then
        echo "Error: Failed to locate parent snapshot at '${parent_snapshot}'" >&2;
        exit 3;
    fi

    log_msg "Sending incremental snapshot $(dirname "${latest_snapshot}") parent $(last_sent)";
    btrfs send -p "${parent_snapshot}" "${latest_snapshot}" | do_ssh;
fi

have_sent simply determines if we have previously sent a snapshot before. We know this by checking the filepath_last_sent text file.

If we haven't, then we send a full snapshot rather than an incremental one. If we're sending an incremental one, then we find the parent snapshot (i.e. the one we last sent). If we can't find it, we generate an error (it's because of this that you need to store at least 2 snapshots at a time with btrfs-snapshot-rotation).

After sending a snapshot, we need to update the filepath_last_sent text file:

log_msg "Updating state information";
basename "${latest_snapshot}" >"${filepath_last_sent}";
log_msg "Snapshot sent successfully";

....and that concludes snapshot-send.sh! Once you've finished reading this blog post and testing your setup, put your snapshot-send.sh calls in a script in /etc/cron.daily or something.

snapshot-receive.sh

Next up is the receiving end of the system. The CLI for this script is much simpler, on account of sudo rules only allowing exact and specific commands (no wildcards or regex of any kind). I put snapshot-receive.sh in /usr/local/sbin and called it snapshot-receive.

Let's get started:

#!/usr/bin/env bash

# This script wraps btrfs receive so that it can be called by non-root users.
# It should be saved to '/usr/local/sbin/snapshot-receive' (without quotes, of course).
# The following entry needs to be put in the sudoers file:
# 
# %backup-senders   ALL=(ALL) NOPASSWD: /usr/local/sbin/snapshot-receive TAG_NAME
# 
# ....replacing TAG_NAME with the name of tag you want to allow. You'll need 1 line in your sudoers file per tag you want to allow.
# Edit your sudoers file like this:
# sudo visudo

# The ABSOLUTE path to the target directory to receive to.
target_dir="CHANGE_ME";

# The maximum number of backups to keep.
max_backups="7";

# Allow only alphanumeric characters in the tag
tag="$(echo "${1}" | tr -cd '[:alnum:]-_')";

snapshot-receive.sh only takes a single argument, and that's the tag it should use for the snapshot being received:


sudo snapshot-receive DEST_TAG_NAME

The target directory it should save snapshots to is stored as a variable at the top of the file (the target_dir there). You should change this based on your specific setup. It goes without saying, but the target directory needs to be a directory on a btrfs filesystem (preferable raid1, though as I've said before btrfs raid1 is a misnomer). We also ensure that the tag contains only safe characters for security.

max_backups is the maximum number of snapshots to keep. Any older snapshots will be deleted.

Next, ime error handling:

###############################################################################

# $EUID = effective uid
if [[ "${EUID}" -ne 0 ]]; then
    echo "Error: This script must be run as root (currently running as effective uid ${EUID})" >&2;
    exit 5;
fi

if [[ -z "${tag}" ]]; then
    echo "Error: No tag specified. It should be specified as the 1st and only argument, and may only contain alphanumeric characters." >&2;
    echo "Example:" >&2;
    echo "    snapshot-receive TAG_NAME_HERE" >&2;
    exit 4;
fi

Nothing too exciting. Continuing on, a pair of useful helper functions:


###############################################################################

## Logs a message to stderr.
# $*    The message to log.
log_msg() {
    echo "[ $(date +%Y-%m-%dT%H:%M:%S) ] remote/${HOSTNAME}: >>> ${*}";
}

list_backups() {
    find "${target_dir}/${tag}" -maxdepth 1 ! -path "${target_dir}/${tag}" -type d;
}

list_backups lists the snapshots with the given tag, and log_msg logs messages to stdout (not stderr unless there's an error, because otherwise cronic will dutifully send you an email every time the scripts execute). Next up, more error handling:

###############################################################################

if [[ "${target_dir}" == "CHANGE_ME" ]]; then
    echo "Error: target_dir was not changed from the default value." >&2;
    exit 1;
fi

if [[ ! -d "${target_dir}" ]]; then
    echo "Error: No directory was found at '${target_dir}'." >&2;
    exit 2;
fi

if [[ ! -d "${target_dir}/${tag}" ]]; then
    log_msg "Creating new directory at ${target_dir}/${tag}";
    mkdir "${target_dir}/${tag}";
fi

We check:

  • That the target directory was changed from the default CHANGE_ME value
  • That the target directory exists

We also create a subdirectory for the given tag if it doesn't exist already.

With the preamble completed, we can actually receive the snapshot:

log_msg "Launching btrfs in chroot mode";

time nice ionice -c Idle btrfs receive --chroot "${target_dir}/${tag}";

We use nice and ionice to reduce the priority of the receive to the lowest possible level. If you're using a Raspberry Pi (I have a Raspberry Pi 4 with 4GB RAM) like I am, this is important for stability (Pis tend to fall over otherwise). Don't worry if you experience some system crashes on your Pi when transferring the first snapshot - I've found that incremental snapshots don't cause the same issue.

We also use the chroot option there for increased security.

Now that the snapshot is transferred, we can delete old snapshots if we have too many:

backups_count="$(echo -e "$(list_backups)" | wc -l)";

log_msg "Btrfs finished, we now have ${backups_count} backups:";
list_backups;

while [[ "${backups_count}" -gt "${max_backups}" ]]; do
    oldest_backup="$(list_backups | sort | head -n1)";
    log_msg "Maximum number backups is ${max_backups}, requesting removal of backup for $(dirname "${oldest_backup}")";

    btrfs subvolume delete "${oldest_backup}";

    backups_count="$(echo -e "$(list_backups)" | wc -l)";
done

log_msg "Done, any removed backups will be deleted in the background";

Sorted! The only thing left to do here is to setup those sudo rules. Let's do that now. Execute sudoedit /etc/sudoers, and enter the following:

%backup-senders ALL=(ALL) NOPASSWD: /usr/local/sbin/snapshot-receive TAG_NAME

Replace TAG_NAME with the DEST_TAG_NAME you're using. You'll need 1 entry in /etc/sudoers for each DEST_TAG_NAME you're using.

We assign the rights to the backup-senders group we created earlier, of which the user we are going to SSH in with is a member. This make the system more flexible should we want to extend it later.

Warning: A mistake in /etc/sudoers can leave you unable to use sudo! Make sure you have a root shell open in the background and that you test sudo again after making changes to ensure you haven't made a mistake.

That completes the setup of snapshot-receive.sh.

Conclusion

With snapshot-send.sh and snapshot-receive.sh, we now have a system for transferring snapshots from 1 host to another via SSH. If combined with full disk encryption (e.g. with LUKS), this provides a secure backup system with a number of desirable qualities:

  • The main NAS can't access the backups on the backup NAS (in case fo ransomware)
  • Backups are encrypted during transfer (via SSH)
  • Backups are encrypted at rest (LUKS)

To further secure the backup NAS, one could:

  • Disable SSH password login
  • Automatically start / shutdown the backup NAS (though with full disk encryption when it boots up it would require manual intervention)

At the bottom of this post I've included the full scripts for you to copy and paste.

As it turns out, there will be 1 more post in this series, which will cover generating multiple streams of backups (e.g. weekly, monthly) from a single stream of e.g. daily backups on my backup NAS.

Sources and further reading

Full scripts

snapshot-send.sh

#!/usr/bin/env bash
set -e;

dir_source="${1}";
tag_source="${2}";
tag_dest="${3}";
loc_ssh_key="${4}";
remote_host="${5}";

if [[ -z "${remote_host}" ]]; then
    echo "This script sends btrfs snapshots to a remote host via SSH.
The script snapshot-receive must be present on the remote host in the PATH for this to work.
It pairs well with btrfs-snapshot-rotation: https://github.com/mmehnert/btrfs-snapshot-rotation
Usage:
    snapshot-send.sh <snapshot_dir> <source_tag_name> <dest_tag_name> <ssh_key> <user@example.com>

Where:
    <snapshot_dir> is the path to the directory containing the snapshots
    <source_tag_name> is the tag name to look for (see btrfs-snapshot-rotation).
    <dest_tag_name> is the tag name to use when sending to the remote. This must be unique across all snapshot rotations sent.
    <ssh_key> is the path to the ssh private key
    <user@example.com> is the user@host to connect to via SSH" >&2;
    exit 0;
fi

# $EUID = effective uid
if [[ "${EUID}" -ne 0 ]]; then
    echo "Error: This script must be run as root (currently running as effective uid ${EUID})" >&2;
    exit 5;
fi

if [[ ! -e "${loc_ssh_key}" ]]; then
    echo "Error: When looking for the ssh key, no file was found at '${loc_ssh_key}' (have you checked the spelling and file permissions?)." >&2;
    exit 1;
fi
if [[ ! -d "${dir_source}" ]]; then
    echo "Error: No source directory located at '${dir_source}' (have you checked the spelling and permissions?)" >&2;
    exit 2;
fi

###############################################################################

# The filepath to the last sent text file that contains the name of the snapshot that was last sent to the remote.
# If this file doesn't exist, then we send a full snapshot to start with.
# We need to keep track of this because we need this information to know which
# snapshot we need to parent the latest snapshot from to send snapshots incrementally.
filepath_last_sent="${dir_source}/last_sent_@${tag_source}.txt";

## Logs a message to stderr.
# $*    The message to log.
log_msg() {
    echo "[ $(date +%Y-%m-%dT%H:%M:%S) ] remote/${HOSTNAME}: >>> ${*}";
}

## Lists all the currently available snapshots for the current source tag.
list_snapshots() {
    find "${dir_source}" -maxdepth 1 ! -path "${dir_source}" -name "*@${tag_source}" -type d;
}

## Returns an exit code of 0 if we've sent a snapshot, or 1 if we haven't.
have_sent() {
    if [[ ! -f "${filepath_last_sent}" ]]; then
        return 1;
    else
        return 0;
    fi
}

## Fetches the directory name of the last snapshot sent to the remote with the given tag name.
last_sent() {
    if [[ -f "${filepath_last_sent}" ]]; then
        cat "${filepath_last_sent}";
    fi
}

do_ssh() {
    ssh -o "ServerAliveInterval=900" -i "${loc_ssh_key}" "${remote_host}" sudo snapshot-receive "${tag_dest}";
}

latest_snapshot="$(list_snapshots | sort | tail -n1)";
latest_snapshot_dirname="$(dirname "${latest_snapshot}")";

if [[ "$(dirname "${latest_snapshot_dirname}")" == "$(cat "${filepath_last_sent}")" ]]; then
    if [[ -z "${FORCE_SEND}" ]]; then
        echo "We've sent the latest snapshot '${latest_snapshot_dirname}' already and the FORCE_SEND environment variable is empty or not specified, skipping";
        exit 0;
    else
        echo "We've sent it already, but sending it again since the FORCE_SEND environment variable is specified";
    fi
fi

if ! have_sent; then
    log_msg "Sending initial snapshot $(dirname "${latest_snapshot}")";
    btrfs send "${latest_snapshot}" | do_ssh;
else
    parent_snapshot="${dir_source}/$(last_sent)";
    if [[ ! -d "${parent_snapshot}" ]]; then
        echo "Error: Failed to locate parent snapshot at '${parent_snapshot}'" >&2;
        exit 3;
    fi

    log_msg "Sending incremental snapshot $(dirname "${latest_snapshot}") parent $(last_sent)";
    btrfs send -p "${parent_snapshot}" "${latest_snapshot}" | do_ssh;
fi


log_msg "Updating state information";
basename "${latest_snapshot}" >"${filepath_last_sent}";
log_msg "Snapshot sent successfully";

snapshot-receive.sh

#!/usr/bin/env bash

# This script wraps btrfs receive so that it can be called by non-root users.
# It should be saved to '/usr/local/sbin/snapshot-receive' (without quotes, of course).
# The following entry needs to be put in the sudoers file:
# 
# %backup-senders   ALL=(ALL) NOPASSWD: /usr/local/sbin/snapshot-receive TAG_NAME
# 
# ....replacing TAG_NAME with the name of tag you want to allow. You'll need 1 line in your sudoers file per tag you want to allow.
# Edit your sudoers file like this:
# sudo visudo

# The ABSOLUTE path to the target directory to receive to.
target_dir="CHANGE_ME";

# The maximum number of backups to keep.
max_backups="7";

# Allow only alphanumeric characters in the tag
tag="$(echo "${1}" | tr -cd '[:alnum:]-_')";

###############################################################################

# $EUID = effective uid
if [[ "${EUID}" -ne 0 ]]; then
    echo "Error: This script must be run as root (currently running as effective uid ${EUID})" >&2;
    exit 5;
fi

if [[ -z "${tag}" ]]; then
    echo "Error: No tag specified. It should be specified as the 1st and only argument, and may only contain alphanumeric characters." >&2;
    echo "Example:" >&2;
    echo "    snapshot-receive TAG_NAME_HERE" >&2;
    exit 4;
fi

###############################################################################

## Logs a message to stderr.
# $*    The message to log.
log_msg() {
    echo "[ $(date +%Y-%m-%dT%H:%M:%S) ] remote/${HOSTNAME}: >>> ${*}";
}

list_backups() {
    find "${target_dir}/${tag}" -maxdepth 1 ! -path "${target_dir}/${tag}" -type d;
}

###############################################################################

if [[ "${target_dir}" == "CHANGE_ME" ]]; then
    echo "Error: target_dir was not changed from the default value." >&2;
    exit 1;
fi

if [[ ! -d "${target_dir}" ]]; then
    echo "Error: No directory was found at '${target_dir}'." >&2;
    exit 2;
fi

if [[ ! -d "${target_dir}/${tag}" ]]; then
    log_msg "Creating new directory at ${target_dir}/${tag}";
    mkdir "${target_dir}/${tag}";
fi

log_msg "Launching btrfs in chroot mode";

time nice ionice -c Idle btrfs receive --chroot "${target_dir}/${tag}";

backups_count="$(echo -e "$(list_backups)" | wc -l)";

log_msg "Btrfs finished, we now have ${backups_count} backups:";
list_backups;

while [[ "${backups_count}" -gt "${max_backups}" ]]; do
    oldest_backup="$(list_backups | sort | head -n1)";
    log_msg "Maximum number backups is ${max_backups}, requesting removal of backup for $(dirname "${oldest_backup}")";

    btrfs subvolume delete "${oldest_backup}";

    backups_count="$(echo -e "$(list_backups)" | wc -l)";
done

log_msg "Done, any removed backups will be deleted in the background";

PhD, Update 9: Results?

Hey, it's time for another PhD update blog post! Since last time, I've been working on tweet classification mostly, but I have a small bit to update you on with the Temporal CNN.

Here's a list of posts in this series so far before we continue. If you haven't already, you should read them as they give some contact for what I'll be talking about in this post.

Note that the results and graphs in this blog post are not final, and may change as I further evaluate the performance of the models I've trained (for example I recently found and fixed a bug where I misclassified all tweets without emojis as positive tweets by mistake).

Tweet classification

After trying a bunch of different tweaks and variants of models, I now have an AI model that classifies tweets. It's not perfect, but I am reaching ~80% validation accuracy.

The model itself is trained to predict the sentiment of a tweet, with the label coming from a preset list of emojis I compiled manually (looking at a list of emojis in the dataset itself I wrote a quick Bash one-liner to calculate). Each tweet is classified by adding up how any emojis in each category it has in it, and the category with the most emojis is then ultimately chosen as the label to predict. Here's the list of emojis I've compiled:

Category Emojis
positive ๐Ÿ˜‚๐Ÿ‘๐Ÿคฃโค๐Ÿ‘๐Ÿ˜Š๐Ÿ˜‰๐Ÿ˜๐Ÿ˜˜๐Ÿ˜๐Ÿค—๐Ÿ’•๐Ÿ˜€๐Ÿ’™๐Ÿ’ชโœ…๐Ÿ™Œ๐Ÿ’š๐Ÿ‘Œ๐Ÿ™‚๐Ÿ˜Ž๐Ÿ˜†๐Ÿ˜…โ˜บ๐Ÿ˜ƒ๐Ÿ˜ป๐Ÿ’–๐Ÿ’‹๐Ÿ’œ๐Ÿ˜น๐Ÿ˜œโ™ฅ๐Ÿ˜„๐Ÿ’›๐Ÿ˜ฝโœ”๐Ÿ‘‹๐Ÿ’—โœจ๐ŸŒน๐ŸŽ‰๐Ÿ‘Š๐Ÿ˜‹๐Ÿ˜๐ŸŒž๐Ÿ˜‡๐ŸŽถโญ๐Ÿ’ž๐Ÿ˜บ๐Ÿ˜ธ๐Ÿ–ค๐ŸŒธ๐Ÿ’๐Ÿ€๐ŸŒผ๐ŸŒŸ๐Ÿค๐ŸŒท๐Ÿฑ๐Ÿค“๐Ÿ˜Œ๐Ÿ˜›๐Ÿ˜™
negative ๐Ÿฅถโš ๐Ÿ’”๐Ÿ˜ฐ๐Ÿ™„๐Ÿ˜ฑ๐Ÿ˜ญ๐Ÿ˜ณ๐Ÿ˜ข๐Ÿ˜ฌ๐Ÿคฆ๐Ÿ˜ก๐Ÿ˜ฉ๐Ÿ™ˆ๐Ÿ˜”โ˜น๐Ÿ˜ฎโŒ๐Ÿ˜ฃ๐Ÿ˜•๐Ÿ˜ฒ๐Ÿ˜ฅ๐Ÿ˜ž๐Ÿ˜ซ๐Ÿ˜Ÿ๐Ÿ˜ด๐Ÿ™€๐Ÿ˜’๐Ÿ™๐Ÿ˜ ๐Ÿ˜ช๐Ÿ˜ฏ๐Ÿ˜จ๐Ÿ‘Ž๐Ÿคข๐Ÿ’€๐Ÿ˜ค๐Ÿ˜๐Ÿ˜–๐Ÿ˜๐Ÿ˜ˆ๐Ÿ˜‘๐Ÿ˜“๐Ÿ˜ฟ๐Ÿ˜ต๐Ÿ˜ง๐Ÿ˜ถ๐Ÿ˜ฆ๐Ÿคฅ๐Ÿค๐Ÿคง๐ŸŒงโ˜”๐Ÿ’จ๐ŸŒŠ๐Ÿ’ฆ๐Ÿšจ๐ŸŒฌ๐ŸŒชโ›ˆ๐ŸŒ€โ„๐Ÿ’ง๐ŸŒจโšก๐Ÿฆ†๐ŸŒฉ๐ŸŒฆ๐ŸŒ‚

These categories are stored in a tab separated values file (TSV), allowing the program I've written that trains new models to be completely customisable. Given that there were a significant number of tweets with flood-related emojis (๐ŸŒงโ˜”๐Ÿ’จ๐ŸŒŠ๐Ÿ’ฆ๐Ÿšจ๐ŸŒฌ๐ŸŒชโ›ˆ๐ŸŒ€โ„๐Ÿ’ง๐ŸŒจโšก๐Ÿฆ†๐ŸŒฉ๐ŸŒฆ๐ŸŒ‚) in my dataset, I tried separating them out into their own class, but it didn't work very well - more on this later.

The dataset I've downloaded comes from a number of different queries and hashtags, and is comprised of just under 1.4 million tweets, of which ~90K tweets have at least 1 emoji in the above categories.

A pair of pie charts showing statistics about the twitter dataset.

Tweets were downloaded with a tool I implemented using a number of flood related hashtags and query strings - from generic ones like #flood to specific ones such as #StormChristoph.

Of the entire dataset, ~6.44% contain at least 1 emoji. Of those, ~49.57~ are positive, and ~50.434% are negative:

Category Number of tweets
No emoji 1306061
Negative 45367
Positive 44585
Total tweets 1396013

The models I've trained so far are either a transformer, or an 2-layer LSTM with 128 units per layer - plus a dense layer with a softmax activation function at the end for classification. If batch normalisation, every layer except the dense layer has a batch normalisation layer afterwards:

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
bidirectional (Bidirectional (None, 100, 256)          336896    
_________________________________________________________________
batch_normalization (BatchNo (None, 100, 256)          1024      
_________________________________________________________________
bidirectional_1 (Bidirection (None, 256)               394240    
_________________________________________________________________
batch_normalization_1 (Batch (None, 256)               1024      
_________________________________________________________________
dense (Dense)                (None, 2)                 514       
=================================================================
Total params: 733,698
Trainable params: 732,674
Non-trainable params: 1,024
_________________________________________________________________

I tried using more units and varying the number of layers, but it didn't help much (in hindsight this was before I found the bug where tweets without emojis were misclassified as positive).

Now for the exciting part! I split the tweets that had at least 1 emoji into 2 datasets: 80% for training, and 20% for validation. I trained a few different models tweaking various different parameters, and got the following results:

2 categories: training accuracy

2 categories: validation accuracy

Each model has a code that is comprised of multiple parts:

  • c2: 2 categories
  • bidi: LSTM, bidirectional wrapper enabled
  • nobidi: LSTM, bidirectional wrapper disabled
  • batchnorm: LSTM, batch normalisation enabled
  • nobatchnorm: LSTM, batch normalisation disabled
  • transformer: Transformer (just the encoder part)

It seems that if you're an LSTM being bidirectional does result in a boost to both stability and performance (performance here meaning how well the model works, rather than how fast it runs - that would be efficiency instead). Batch normalisation doesn't appear to help all that much - I speculate this is because the models I'm training really aren't all that deep, and batch normalisation apparently helps most with deep models (e.g. ResNet).

For completeness, here are the respective graphs for 3 categories (positive, negative, and flood):

3 Categories: Training accuracy

3 Categories: Validation accuracy

Unfortunately, 3 categories didn't work quite as well as I'd hoped. To understand why, I plotted some confusion matrices at epochs 11 (left) and 44 (right). For both categories I've picked out the charts for bidi-nobatchnorm, as it was the best performing (see above).

2 Categories: confusion matrices

3 Categories: confusion matrices

The 2 category model looks to be reasonably accurate with no enormous issues as we expected from the ~80% accuracy from above. Twice the number of negative tweets are misclassified as positive compared to the other way around of some reason - it's possible that negative tweets are more difficult to interpret. Given the noisy dataset (it's social media and natural language, after all), I'm pretty happy with this result.

The 3 category model has fared less brilliantly. Although we can clearly see it has improved over the training process (and would likely continue to do so if I gave it longer than 50 epochs to train), it's accuracy is really limited by it's tendency to overfit - preventing it from performing as well as the 2 category model (which, incidentally, also overfits).

Although it classifies flood tweets fairly accurately, it seems to struggle more with positive and negative tweets. This leads me to suspect that training a model on flood / not flood might be worth trying - though as always I'm limited by a lack of data (I have only ~16K tweets that contain a flood emoji). Perhaps searching twitter for flood emojis directly would help gather additional here.

Now that I have a model that works, I'm going to move on to answering some research questions with this model. To do so, I've upgraded my twitter-academic-downloader to download information about attached media - downloading attached media itself deserves its own blog post. After that, I'll be looking to publish my first ever scientific papery thing (don't know the difference between journal articles and other types of publication yet), so that is sure to be an adventure.

Temporal CNN(?)

In Temporal CNN land, things have been much slower. My latest attempt to fix the accuracy issues I've described in detail in previous posts, I've implemented a vanilla convolutional autoencoder by following this tutorial (the links to the full code behind the tutorial are broken, and it's really annoying). This does image to image translation, so it would be ideal for modifying it to support my use-case. After implementing it with Tensorflow.js, I hit a snag:

Autoencoder: expected vs actual

After trying all the loss functions I found think of and all sort of other variations of the model, for sanity's sake I tried implementing the model in Python (the original language using in the blog post). I originally didn't want to do this, since all the data preprocessing code I've written so far is all in Javascript / Node.js, but within half an hour with some copying and pasting, I had a very quick model implemented. After training a teeny model it for just 7 epochs on the Fashion MNIST dataset, I got this:

Autoencoder: epoch 7 in Tensorflow for Python

...so clearly there's something different in Tensorflow.js compared to Tensorflow for Python, but after comparing them dozens of times, I'm not sure I'll be able to spot the difference. It's a shame to give up on the Tensorflow.js version (it also has a very nice CLI I implemented), but given how long I've spent in the background on this problem, I'm going to move forwards with the Python version, I'm going to work on converting my existing data preprocessing code into a standalone CLI for converting the data into a format that I can consume in Python with minimal effort.

Once that's done, I'm going to expand the quick autoencoder Python script I implemented into a proper CLI program. This involves adding:

  • A CLI argument parser (argparse is my weapon of choice here for now in Python)
  • A settings system that has a default and a custom (TOML) config file
  • Saving multiple pieces of data to the output directory:
    • A summary of the model structure
    • The settings that are currently in use
    • A checkpoint for each epoch
    • A TSV-formatted metrics file

The existing script also spits out a sample image (such as the one above - just without the text) for each epoch too, which I've found to be very useful indeed. I'll definitely be looking into generating more visualisations on-the-fly as the model is training.

All of this will take a while (especially since this isn't the focus right now), so I'm a bit unsure if I'll get all of this done by the time I write the next post. If this plan works, then I'll probably have to come up with a new name for this model, since it's not really a Temporal CNN. Suggestions are welcome - my current thought is maybe "Temporal Convolutional Autoencoder".

Conclusion

Since the last post I've made on here, I've made much more progress on stuff than I was expecting to have done. I've now got a useful(?) model for classifying tweets, which I'm now going to move ahead with answering some research questions with (more on that in the next post - this one is long already haha). I've also got an Autoencoder training properly for image-to-image translation as a step towards getting 2D flood predictions working.

Looking forwards, I'm going to be answering research questions with my tweet classification model and preparing to write something to publish, and expanding on the Python autoencoder for more direct flood prediction.

Curious about the code behind any of these models and would like to check them out? Please get in touch. While I'm not sure I can share the tweet classifier itself yet (though I do plan on releasing as open-source after a discussion with my supervisor when I publish), I can probably share my various autoencoder implementations, and I can absolutely give you a bunch of links to things that I've used and found helpful.

If you've found this interesting or have some suggestions, please comment below. It's really motivating to hear that what I'm doing is actually interesting / useful to people.

Sources and further reading

Tensorflow / Tensorflow.js in Review

For my PhD, I've been using both Tensorflow.js (Tensorflow for Javascript) and more recently Tensorflow for Python (including the bundled Keras) extensively for implementing multiple different models. Given the experiences I've had so far, I thought it was high time I put my thoughts to paper so to speak and write a blog post reviewing the 2 frameworks.

Tensorflow logo

Tensorflow for Python

Let's start with Tensorflow for Python. I haven't been using it as long as Tensorflow.js, but as far as I can tell they've done a great job of ensuring it comes with batteries included. It has layers that come in an enormous number of different flavours for doing everything you can possibly imagine - including building Transformers (though I ended up implementing the time signal encoding in my own custom layer).

Building custom layers is not particularly difficult either - though you do have to hunt around a bit for the correct documentation, and I haven't yet worked out the all the bugs with loading model checkpoints that use custom layers back in again.

Handling data as a generic "tensor" that contains an n-dimension slab of data is - once you get used to it - a great way of working. It's not something I would recommend to the beginner however - rather I would recommend checking out Brain.js. It's easier to setup, and also more transparent / easier to understand what's going on.

Data preprocessing however is where things start to get complicated. Despite a good set of API reference docs to refer to, it's not clear how one is supposed to implement a performant data preprocessing pipeline. There are multiple methods for doing this (tf.data.Dataset, tf.utils.Sequence, and others), and I have as of yet been unable to find a definitive guide on the subject.

Other small inconsistencies are also present, such as both the Keras website and the Tensorflow API docs both documenting the Keras API, which in and of itself appears to be an abstraction of the Tensorflow API.... it gets confusing. Some love for the docs more generally is also needed, as I found some the wording in places ambiguous as to what it meant - so I ended up guessing and having to work it out by experimentation.

By far the biggest issue I encountered though (aside from the data preprocessing pipeline, which is really confusing and frustrating) is that a highly specific version of CUDA is required for each version of Tensorflow. Thankfully, there's a table of CUDA / CuDNN versions to help you out, but it's still pretty annoying that you have to have a specific version. Blender manages to be CUDA enabled while supporting enough different versions of CUDA that I haven't had an issue on stock Ubuntu with the propriety Nvidia drivers and the system CUDA version, so whhy can't Tensorflow do it too?

Tensorflow.js

This brings me on to Tensorflow.js, the Javascript bindings for libtensorflow (the underlying C++ library). This also has the specific version of CUDA issue, but in addition the version requirement documented in the README is often wrong, leaving you to make random wild guesses as to which version is required!

Despite this flaw, Tensorflow.js fixes a number of inconsistencies in Tensorflow for Python - I suspect that it was written after Tensorflow for Python was first implemented. The developers have effectively learnt valuable lessons from the Python version of Tensorflow, which has resulted in a coherent and cohesive API that makes a much more sense than the API in Python. A great example of this is tf.Dataset: The data preprocessing pipeline in Tensorflow.js is well designed and easy to use. The Python version could learn a lot from this.

While Tensorflow.js doesn't have quite the same set of features (a number of prebuilt layers that exist in Python don't yet existing in Tensorflow.js), it still provides a reasonable set of features that satisfy most use-cases. I have noticed a few annoying inconsistencies in the loss functions though and how they behave - for example I implemented an autoencoder in Tensorflow.js, but it only returned black and white pixels - whereas Tensorflow for Python returned greyscale as intended.

Aside from improving CUDA support and adding more prebuilt layers, the other thing that Tensorflow.js struggles with is documentation. It has comparatively few guides with respect to Tensorflow for Python, and it also has an ambiguity problem in the API docs. This could be mostly resolved by being slightly more generous with explanations as to what things are and do. Adding some examples to the docs in question would also help - as would also fixing the bug where it does not highlight the item you're currently viewing in the navigation page (DevDocs support would be awesome too).

Conclusion

Tensorflow for Python and Tensorflow.js are feature-filled and performant frameworks for machine learning and processing large datasets with GPU acceleration. Many tutorials are provided to help newcomers to the frameworks, but once you've followed a tutorial or 2 you're left very much to be on your own. A number of caveats and difficulties such as CUDA versions and confusing APIs / docs make mastering the framework difficult.

MIDI to Music Box score converter

Keeping with the theme of things I've forgotten to blog about (every time I think I've mentioned everything, I remember something else), in this post I'm going to talk about yet another project of mine that's been happily sitting around (and I've nearly blogged about but then forgot on at least 3 separate occasions): A MIDI file to music box conversion script. A while ago, I bought one of these customisable music boxes:

(Above: My music box (centre), and the associated hole punching device it came with (top left).)

The basic principle is that you bunch holes in a piece of card, and then you feed it into the music box to make it play the notes you've punched into the card. Different variants exist with different numbers of notes - mine has a staggering 30 different nodes it can play!

It has a few problems though:

  1. The note names on the card are wrong
  2. Figuring out which one is which is a problem
  3. Once you've punched a hole, it's permanent
  4. Punching the holes is fiddly

Solving problem #4 might take a bit more work (a future project out-of-scope of this post), but the other 3 are definitely solvable.

MIDI is a protocol (and a file format) for storing notes from a variety of different instruments (this is just the tip of the iceberg, but it's all that's needed for this post), and it just so happens that there's a Cโ™ฏ library on NuGet called DryWetMIDI that can read MIDI files in and allow one to iterate over the notes inside.

By using this and an homebrew SVG writer class (another blog post topic for another time once I've tidied it up), I implemented a command-line program I've called MusicBoxConverter (the name needs work - comment below if you can think of a better one) that converts MIDI files to an SVG file, which can then be printed (carefully - full instructions on the project's page - see below).

Before I continue, it's perhaps best to give an example of my (command line) program's output:

Example output

(Above: A lovely little tune from the TV classic The Clangers, converted for a 30 note music box by my MusicBoxConverter program. I do not own the song!)

The output from the program can (once printed to the correct size) then be cut out and pinned onto a piece of card for hole punching. I find paper clips work for this, but in future I might build myself an Arduino-powered device to punch holes for me.

The SVG generated has a bunch of useful features that makes reading it easy:

  • Each note is marked by a blue circle with a red plus sign to help with accuracy (the hole puncher that came with mine has lines on it that make a plus shape over the viewing hole)
  • Each hole is numbered to aid with getting it the right way around
  • Big arrows in the background ensure you don't get it backwards
  • The orange bar indicates the beginning
  • The blue circle should always be at the bottom and the and the red circle at the top, so you don't get it upside down
  • The green lines should be lined up with the lines on the card

While it's still a bit tedious to punch the holes, it's much easier be able to adapt a score in Musescore, export to MIDI, push it through MusicBoxConverter than doing lots of trial-and-error punching it directly unaided.

I won't explain how to use the program in this blog post in any detail, since it might change. However, the project's README explains in detail how to install and use it:

https://git.starbeamrainbowlabs.com/sbrl/MusicBoxConverter

Something worth a particular mention here is the printing process for the generated SVGs. It's somewhat complicated, but this is unfortunately necessary in order to avoid rescaling the SVG during the conversion process - otherwise it doesn't come out the right size to be compatible with the music box.

A number of improvements stand out to me that I could make to this project. While it only supports 30 note music boxes at the moment, it can easily be extended to support multiple other different type of music box (I just don't have them to hand).

Implementing a GUI is another possible improvement but would take a lot of work, and might be best served as a different project that uses my MusicBoxConverter CLI under the hood.

Finally, as I mentioned above, in a future project (once I have a 3D printer) I want to investigate building an automated hole punching machine to make punching the holes much easier.

If you make something cool with this program, please comment below! I'd love to hear from you (and I may feature you in the README of the project too).

Sources and further reading

  • Musescore - free and open source music notation program with a MIDI export function
  • MusicBoxConverter (name suggestions wanted in the comments below)

NAS Backups, Part 1: Overview

After building my nice NAS, the next thing on my sysadmin todo list was to ensure it is backed up. In this miniseries (probably 2 posts), I'm going to talk about the backup NAS that I've built. As of the time of typing, it is sucessfully backing up my Btrfs subvolumes.

In this first post, I'm going to give an overview of the hardware I'm using and the backup system I've put together. In future posts, I'll go into detail as to how the system works, and which areas I still need to work on moving forwards.

Personally, I find that the 3-2-1 backup strategy is a good rule of thumb:

  • 3 copies of the data
  • 2 different media
  • 1 off-site

What this means is tha you should have 3 copies of your data, with 2 local copies and one remote copy in a different geographical location. To achieve this, I have solved first the problem of the local backup copy, since it's a lot easier and less complicated than the remote one. Although I've talked about backups before (see also), in this case my solution is slightly different - partly due to the amount of data involved, and partly due to additional security measures I'm trying to get into place.

Hardware

For hardware, I'm using a Raspberry Pi 4 with 4GB RAM (the same as the rest of my cluster), along with 2 x 4 TB USB 3 external hard drives. This is a fairly low-cost and low-performance solution. I don't particularly care how long it takes the backup to complete, and it's relatively cheap to replace if it fails without being unreliable (Raspberry Pis are tough little things).

Here's a diagram of how I've wired it up:

(Can't see the above? Try a direct link to the SVG. Created with drawio.)

I use USB Y-cables to power the hard rives directly from the USB power supply, as the Pi is unlikely to be able to supply enough power for mulitple external hard drives on it's own.

Important Note: As I've discovered with a different unrelated host on my network, if you do this you can back-power the Pi through the USB Y cable, and potentially corrupt the microSD card by doing so. Make sure you switch off the entire USB power supply at once, rather than unplug just the Pi's power cable!

For a power supply, I'm using an Anker 10 port device (though I bought through Amazon, since I wasn't aware that Anker had their own website) - the same one that powers my Pi cluster.

Strategy

To do the backup itself I'm using the fact that I store my data in Btrfs subvolumes and the btrfs send / btrfs receive commands to send my subvolumes to the remote backup host over SSH. This has a number of benefits:

  1. The host doing the backing up has no access to the resulting backups (so if it gets infected it can't also infect the backups)
  2. The backups are read-only Btrfs snapshots (so if the backup NAS gets infected my backups can't be altered without first creating a read-write snapshot)
  3. Backups are done incrementally to save time, but a full backup is done automatically on the first run (or if the local metadata is missing)

While my previous backup solution using Restic for the server that sent you this web page has point #3 on my list above, it doesn't have points 1 and 2.

Restic does encrypt backups at rest though, which the system I'm setting up doesn't do unless you use LUKS to encrypt the underlying disks that Btrfs stores it's data on. More on that in the future, as I have tentative plans to deal with my off-site backup problem using a similar technique to that which I've used here that also encrypts data at rest when a backup isn't taking place.

In the next post, I'll be diving into the implementation details for the backup system I've created and explaining it in more detail - including sharing the pair of scripts that I've developed that do the heavy lifting.

WorldEditAdditions: The story of the //convolve command

Sometimes, one finds a fascinating new use for an algorithm in a completely unrelated field to its original application. Such is the case with the algorithm behind the //convolve command in WorldEditAdditions which I recently blogged about. At the time, I had tried out the //smooth command provided by we_env, but The //smooth command smooths out rough edges in the terrain inside the defined WorldEdit region, but I was less than satisfied with it's flexibility, configurability, and performance.

To this end, I set about devising a solution. My breakthrough in solving the problem occurred when I realised that the problem of terrain smoothing is much easier when you consider it as a 2D heightmap, which is itself suspiciously like a greyscale image. Image editors such as GIMP already have support for smoothing with their various blur filters, the most interesting of which for my use case was the gaussian blur filter. By applying a gaussian blur to the terrain heightmap, it would have the effect of smoothing the terrain - and thus //convolve was born.

If you're looking for a detailed explanation on how to use this command, you're probably in the wrong place - this post is more of a developer commentary on how it is implemented. For an explanation on how to use the command, please refer to the chat command reference.

The gaussian blur effect is applied by performing an operation called a convolution over the input 2D array. This function is not specific to image editor (it can be found in AI too), and it calculates the new value for each pixel by taking a weighted sum of it's existing value and all of it's neighbours. It's perhaps best explained with a diagram:

All the weights in the kernel (sometimes called a filter) are floating-point numbers and should add up to 1. Since we look at each pixels neighbours, we can't operate on the pixels around the edge of the image. Different implementations have different approaches to solving this problem:

  1. Drop them in the output image (i.e. the output image is slightly smaller than the input) - a CNN in AI does this by default in most frameworks
  2. Ignore the pixel and pass it straight through
  3. Take only a portion of the kernel and remap the weights to that we can use that instead
  4. Wrap around to the other side of the image

For my implementation of the //convolve command, the heightmap in my case is essentially infinite in all directions (although I obviously request a small subset thereof), so I chose option 2 here. I calculate the boundary given the size of the kernel and request an area with an extra border around it so I can operate on all the pixels inside the defined region.

Let's look at the maths more specifically here:

Given an input image, for each pixel we take a multiply the kernel by each pixel and it's neighbours, and for each resulting matrix we sum all the values therein to get the new pixel value.

The specific weightings are of course configurable, and are represented by a rectangular (often square) 2D array. By manipulating these weightings, one can perform different kinds of operations. All of the following operations can be performed using this method:

Currently, WorldEditAdditions supports 3 different kernels in the //convolve command:

  • Box blur - frequently has artefacts
  • Gaussian Blur (additional parameter: sigma, which controls the 'blurriness') - the default, and also most complicated to implement
  • Pascal - A unique variant I derived from Pascal's Triangle. Works out as slightly sharper than Gaussian Blur without having artifacts like box blur.

After implementing the core convolution engine, writing functions to generate kernels of defined size for the above kernels was a relatively simple matter. If you know of another kernel that you think would be useful to have in WorldEditAdditions, I'm definitely open to implementing it.

For the curious, the core convolutional filter can be found in convolve.lua. It takes 4 arguments:

  1. The heightmap - a flat ZERO indexed array of values. Think of a flat array containing image data like the HTML5 Canvas getImageData() function.
  2. The size of the heightmap in the form { x = width, z = height }
  3. The kernel (internally called a matrix), in the same form as the heightmap
  4. The size of the kernel in the same form as the size of the heightmap.

Reflecting on the implementation, I'm pretty happy with it within the restrictions of the environment within which it runs. As you might have suspected based on my explanation above, convolutionally-based operations are highly parallelisable (there's a reason they are used in AI), and so there's a ton of scope for at using multiple threads, SIMD, and more - but unfortunately the Lua (5.1! It's so frustrating since 5.4 is out now) environment in Minetest doesn't allow for threading or other optimisations to be made.

The other thing I might do something about soon is the syntax of the command. Currently it looks like this:

//convolve <kernel> [<width>[,<height>]] [<sigma>]

Given the way I've used this command in-the-field, I've found myself specifying the kernel width and the sigma value more often than I have found myself changing the kernel. With this observation in hand, it would be nice to figure out a better syntax here that reduces the amount of typing for everyday use-cases.

In a future post about WorldEditAdditions, I want to talk about a number of topics, such as the snowballs algorithm in the //erode. I also potentially may post about the new //noise2d command I've working on, but that might take a while (I'm still working on implementing a decent noise function for that, since at the time of typing the Perlin noise implementation I've ported is less than perfect).

WorldEditAdditions: More WorldEdit commands for Minetest

Personally, I see enormous creative potential in games such as Minetest. For those who aren't in the know, Minetest is a voxel sandbox game rather like the popular Minecraft. Being open-source though, it has a solid Lua Modding API that allows players with a bit of Lua knowledge to script their own mods with little difficulty (though Lua doesn't come with batteries included, but that's a topic for another day). Given the ease of making mods, Minetest puts much more emphasis on installing and using mods - in fact most content is obtained this way.

Personally I find creative building in Minetest a relaxing activity, and one of the mods I have found most useful for this is WorldEdit. Those familiar with Minecraft may already be aware of WorldEdit for Minecraft - Minetest has it's own equivalent too. WorldEdit in both games provides an array of commands one can type into the chat window to perform various functions to manipulate the world - for example fill an area with blocks, create shapes such as spheres and pyramids, or replace 1 type of node with another.

Unfortunately though, WorldEdit for Minetest (henceforth simply WorldEdit) doesn't have quite the same feature set that WorldEdit for Minecraft does - so I decided to do something about it. Initially, I contributed a pull request to node weighting support to the //mix command, but I quickly realised that the scale of the plans I had in mind weren't exactly compatible with WorldEdit's codebase (as of the time of typing WorldEdit for Minetest's codebase consists of a relatively small number of very long files).

(Above: The WorldEditAdditions logo, kindly created by @VorTechnix with Blender 2.9)

To this end, I decided to create my own project in which I could build out the codebase in my own way, without waiting for lots of pull requests to be merged (which would probably put strain on the maintainers of WorldEdit).

The result of this is a new mod for Minetest which I've called WorldEditAdditions. Currently, it has over 35 additional commands (though by the time you read this that number has almost certainly grown!) and 2 handheld tools that extend the core feature set provided by WorldEdit, adding commands to do things like:

These are just a few of my favourites - I've implemented many more beyond those in this list. VorTechnix has also contributed a raft of improvements, from improvements to //torus to an entire suite of selection commands (//srect, //scol, //scube, and more), which has been an amazing experience to collaborate on an open-source project with someone on another continent (open-source is so cool like that).

A fundamental difference I decided on at the beginning when working of WorldEditAdditions was to split my new codebase into lots of loosely connected files, and have each file have a single purpose. If a files gets over 150 lines, then it's a candidate for being split up - for example every command has a backend (which sometimes itself spans multiple files, as in the case of //convolve and //erode) that actually manipulates the Minetest world and a front-end that handles the chat command parsing (this way we also get an API for free! Still working on documenting that properly though - if anyone knows of something like documentation for Lua please comment below). By doing this, I enable the codebase to scale in a way that the original WorldEdit codebase does not.

I've had a ton of fun so far implementing different commands and subsequently using them (it's so satisfying to see a new command finally working!), and they have already proved very useful in creative building. Evidently others think so too, as we've already had over 4800 downloads on ContentDB: ContentDB Shield listing the live number of downloads

Given the enormous amount of work I and others have put into WorldEditAdditions and the level of polish it is now achieving (on par with my other big project Pepperminty Wiki, which I've blogged about before), recently I also built a website for it:

The WorldEditAdditions website

You can visit it here: https://worldeditadditions.mooncarrot.space/

I built the website with Eleventy, as I did with the Pepperminty Wiki website (blog post). My experience with Eleventy this time around was much more positive than last time - more on this in future blog post. For the observant, I did lift the fancy button code on the new WorldEditAdditions website from the Pepperminty Wiki website :D

The website has the number of functions:

  • Explaining what WorldEditAdditions is
  • Instructing people on how to install it
  • Showing people how to use it
  • Providing a central reference of commands

I think I covered all of these bases fairly well, but only time will tell how actual users find it (please comment below! It gives me a huge amount of motivation to continue working on stuff like this).

Several components of the WorldEditAdditions codebase deserve their own blog posts that have not yet got one - especially //erode and //convolve, so do look out for those at some point in the future.

Moving forwards with WorldEditAdditions, I want to bring it up to feature parity with Worldedit for Minecraft. I also have a number of cool and unique commands in mind I'm working on such as //noise (apply an arbitrary 2d noise function to the height of the terrain), //mathapply (execute another command, but only apply changes to the nodes whose coordinates result in a number greater than 0.5 when pushed through a given arbitrary mathematical expression - I've already started implementing a mathematical expression parser in Lua with a recursive-descent parser, but have run into difficulties with operator precedence), and exporting Minetest regions to obj files that can then be imported into Blender (a collaboration with @VorTechnix).

If WorldEditAdditions has caught your interest, you can get started by visiting the WorldEditAdditions website: https://worldeditadditions.mooncarrot.space/.

If you find it useful, please consider starring it on GitHub and leaving a review on ContentDB! If you run into difficulties, please open an issue and I'll help you out.

Art by Mythdael