Starbeamrainbowlabs

Stardust
Blog


Archive


Mailing List Articles Atom Feed Comments Atom Feed Twitter Reddit Facebook

Tag Cloud

3d 3d printing account algorithms android announcement architecture archives arduino artificial intelligence artix assembly async audio automation backups bash batch blog bookmarklet booting bug hunting c sharp c++ challenge chrome os cluster code codepen coding conundrums coding conundrums evolved command line compilers compiling compression containerisation css dailyprogrammer data analysis debugging demystification distributed computing docker documentation downtime electronics email embedded systems encryption es6 features ethics event experiment external first impressions future game github github gist gitlab graphics hardware hardware meetup holiday holidays html html5 html5 canvas infrastructure interfaces internet interoperability io.js jabber jam javascript js bin labs learning library linux lora low level lua maintenance manjaro minetest network networking nibriboard node.js operating systems own your code pepperminty wiki performance phd photos php pixelbot portable privacy problem solving programming problems project projects prolog protocol protocols pseudo 3d python reddit redis reference releases rendering resource review rust searching secrets security series list server software sorting source code control statistics storage svg talks technical terminal textures thoughts three thing game three.js tool tutorial tutorials twitter ubuntu university update updates upgrade version control virtual reality virtualisation visual web website windows windows 10 xmpp xslt

MIDI to Music Box score converter

Keeping with the theme of things I've forgotten to blog about (every time I think I've mentioned everything, I remember something else), in this post I'm going to talk about yet another project of mine that's been happily sitting around (and I've nearly blogged about but then forgot on at least 3 separate occasions): A MIDI file to music box conversion script. A while ago, I bought one of these customisable music boxes:

(Above: My music box (centre), and the associated hole punching device it came with (top left).)

The basic principle is that you bunch holes in a piece of card, and then you feed it into the music box to make it play the notes you've punched into the card. Different variants exist with different numbers of notes - mine has a staggering 30 different nodes it can play!

It has a few problems though:

  1. The note names on the card are wrong
  2. Figuring out which one is which is a problem
  3. Once you've punched a hole, it's permanent
  4. Punching the holes is fiddly

Solving problem #4 might take a bit more work (a future project out-of-scope of this post), but the other 3 are definitely solvable.

MIDI is a protocol (and a file format) for storing notes from a variety of different instruments (this is just the tip of the iceberg, but it's all that's needed for this post), and it just so happens that there's a C♯ library on NuGet called DryWetMIDI that can read MIDI files in and allow one to iterate over the notes inside.

By using this and an homebrew SVG writer class (another blog post topic for another time once I've tidied it up), I implemented a command-line program I've called MusicBoxConverter (the name needs work - comment below if you can think of a better one) that converts MIDI files to an SVG file, which can then be printed (carefully - full instructions on the project's page - see below).

Before I continue, it's perhaps best to give an example of my (command line) program's output:

Example output

(Above: A lovely little tune from the TV classic The Clangers, converted for a 30 note music box by my MusicBoxConverter program. I do not own the song!)

The output from the program can (once printed to the correct size) then be cut out and pinned onto a piece of card for hole punching. I find paper clips work for this, but in future I might build myself an Arduino-powered device to punch holes for me.

The SVG generated has a bunch of useful features that makes reading it easy:

  • Each note is marked by a blue circle with a red plus sign to help with accuracy (the hole puncher that came with mine has lines on it that make a plus shape over the viewing hole)
  • Each hole is numbered to aid with getting it the right way around
  • Big arrows in the background ensure you don't get it backwards
  • The orange bar indicates the beginning
  • The blue circle should always be at the bottom and the and the red circle at the top, so you don't get it upside down
  • The green lines should be lined up with the lines on the card

While it's still a bit tedious to punch the holes, it's much easier be able to adapt a score in Musescore, export to MIDI, push it through MusicBoxConverter than doing lots of trial-and-error punching it directly unaided.

I won't explain how to use the program in this blog post in any detail, since it might change. However, the project's README explains in detail how to install and use it:

https://git.starbeamrainbowlabs.com/sbrl/MusicBoxConverter

Something worth a particular mention here is the printing process for the generated SVGs. It's somewhat complicated, but this is unfortunately necessary in order to avoid rescaling the SVG during the conversion process - otherwise it doesn't come out the right size to be compatible with the music box.

A number of improvements stand out to me that I could make to this project. While it only supports 30 note music boxes at the moment, it can easily be extended to support multiple other different type of music box (I just don't have them to hand).

Implementing a GUI is another possible improvement but would take a lot of work, and might be best served as a different project that uses my MusicBoxConverter CLI under the hood.

Finally, as I mentioned above, in a future project (once I have a 3D printer) I want to investigate building an automated hole punching machine to make punching the holes much easier.

If you make something cool with this program, please comment below! I'd love to hear from you (and I may feature you in the README of the project too).

Sources and further reading

  • Musescore - free and open source music notation program with a MIDI export function
  • MusicBoxConverter (name suggestions wanted in the comments below)

NAS Backups, Part 1: Overview

After building my nice NAS, the next thing on my sysadmin todo list was to ensure it is backed up. In this miniseries (probably 2 posts), I'm going to talk about the backup NAS that I've built. As of the time of typing, it is sucessfully backing up my Btrfs subvolumes.

In this first post, I'm going to give an overview of the hardware I'm using and the backup system I've put together. In future posts, I'll go into detail as to how the system works, and which areas I still need to work on moving forwards.

Personally, I find that the 3-2-1 backup strategy is a good rule of thumb:

  • 3 copies of the data
  • 2 different media
  • 1 off-site

What this means is tha you should have 3 copies of your data, with 2 local copies and one remote copy in a different geographical location. To achieve this, I have solved first the problem of the local backup copy, since it's a lot easier and less complicated than the remote one. Although I've talked about backups before (see also), in this case my solution is slightly different - partly due to the amount of data involved, and partly due to additional security measures I'm trying to get into place.

Hardware

For hardware, I'm using a Raspberry Pi 4 with 4GB RAM (the same as the rest of my cluster), along with 2 x 4 TB USB 3 external hard drives. This is a fairly low-cost and low-performance solution. I don't particularly care how long it takes the backup to complete, and it's relatively cheap to replace if it fails without being unreliable (Raspberry Pis are tough little things).

Here's a diagram of how I've wired it up:

(Can't see the above? Try a direct link to the SVG. Created with drawio.)

I use USB Y-cables to power the hard rives directly from the USB power supply, as the Pi is unlikely to be able to supply enough power for mulitple external hard drives on it's own.

Important Note: As I've discovered with a different unrelated host on my network, if you do this you can back-power the Pi through the USB Y cable, and potentially corrupt the microSD card by doing so. Make sure you switch off the entire USB power supply at once, rather than unplug just the Pi's power cable!

For a power supply, I'm using an Anker 10 port device (though I bought through Amazon, since I wasn't aware that Anker had their own website) - the same one that powers my Pi cluster.

Strategy

To do the backup itself I'm using the fact that I store my data in Btrfs subvolumes and the btrfs send / btrfs receive commands to send my subvolumes to the remote backup host over SSH. This has a number of benefits:

  1. The host doing the backing up has no access to the resulting backups (so if it gets infected it can't also infect the backups)
  2. The backups are read-only Btrfs snapshots (so if the backup NAS gets infected my backups can't be altered without first creating a read-write snapshot)
  3. Backups are done incrementally to save time, but a full backup is done automatically on the first run (or if the local metadata is missing)

While my previous backup solution using Restic for the server that sent you this web page has point #3 on my list above, it doesn't have points 1 and 2.

Restic does encrypt backups at rest though, which the system I'm setting up doesn't do unless you use LUKS to encrypt the underlying disks that Btrfs stores it's data on. More on that in the future, as I have tentative plans to deal with my off-site backup problem using a similar technique to that which I've used here that also encrypts data at rest when a backup isn't taking place.

In the next post, I'll be diving into the implementation details for the backup system I've created and explaining it in more detail - including sharing the pair of scripts that I've developed that do the heavy lifting.

WorldEditAdditions: The story of the //convolve command

Sometimes, one finds a fascinating new use for an algorithm in a completely unrelated field to its original application. Such is the case with the algorithm behind the //convolve command in WorldEditAdditions which I recently blogged about. At the time, I had tried out the //smooth command provided by we_env, but The //smooth command smooths out rough edges in the terrain inside the defined WorldEdit region, but I was less than satisfied with it's flexibility, configurability, and performance.

To this end, I set about devising a solution. My breakthrough in solving the problem occurred when I realised that the problem of terrain smoothing is much easier when you consider it as a 2D heightmap, which is itself suspiciously like a greyscale image. Image editors such as GIMP already have support for smoothing with their various blur filters, the most interesting of which for my use case was the gaussian blur filter. By applying a gaussian blur to the terrain heightmap, it would have the effect of smoothing the terrain - and thus //convolve was born.

If you're looking for a detailed explanation on how to use this command, you're probably in the wrong place - this post is more of a developer commentary on how it is implemented. For an explanation on how to use the command, please refer to the chat command reference.

The gaussian blur effect is applied by performing an operation called a convolution over the input 2D array. This function is not specific to image editor (it can be found in AI too), and it calculates the new value for each pixel by taking a weighted sum of it's existing value and all of it's neighbours. It's perhaps best explained with a diagram:

All the weights in the kernel (sometimes called a filter) are floating-point numbers and should add up to 1. Since we look at each pixels neighbours, we can't operate on the pixels around the edge of the image. Different implementations have different approaches to solving this problem:

  1. Drop them in the output image (i.e. the output image is slightly smaller than the input) - a CNN in AI does this by default in most frameworks
  2. Ignore the pixel and pass it straight through
  3. Take only a portion of the kernel and remap the weights to that we can use that instead
  4. Wrap around to the other side of the image

For my implementation of the //convolve command, the heightmap in my case is essentially infinite in all directions (although I obviously request a small subset thereof), so I chose option 2 here. I calculate the boundary given the size of the kernel and request an area with an extra border around it so I can operate on all the pixels inside the defined region.

Let's look at the maths more specifically here:

Given an input image, for each pixel we take a multiply the kernel by each pixel and it's neighbours, and for each resulting matrix we sum all the values therein to get the new pixel value.

The specific weightings are of course configurable, and are represented by a rectangular (often square) 2D array. By manipulating these weightings, one can perform different kinds of operations. All of the following operations can be performed using this method:

Currently, WorldEditAdditions supports 3 different kernels in the //convolve command:

  • Box blur - frequently has artefacts
  • Gaussian Blur (additional parameter: sigma, which controls the 'blurriness') - the default, and also most complicated to implement
  • Pascal - A unique variant I derived from Pascal's Triangle. Works out as slightly sharper than Gaussian Blur without having artifacts like box blur.

After implementing the core convolution engine, writing functions to generate kernels of defined size for the above kernels was a relatively simple matter. If you know of another kernel that you think would be useful to have in WorldEditAdditions, I'm definitely open to implementing it.

For the curious, the core convolutional filter can be found in convolve.lua. It takes 4 arguments:

  1. The heightmap - a flat ZERO indexed array of values. Think of a flat array containing image data like the HTML5 Canvas getImageData() function.
  2. The size of the heightmap in the form { x = width, z = height }
  3. The kernel (internally called a matrix), in the same form as the heightmap
  4. The size of the kernel in the same form as the size of the heightmap.

Reflecting on the implementation, I'm pretty happy with it within the restrictions of the environment within which it runs. As you might have suspected based on my explanation above, convolutionally-based operations are highly parallelisable (there's a reason they are used in AI), and so there's a ton of scope for at using multiple threads, SIMD, and more - but unfortunately the Lua (5.1! It's so frustrating since 5.4 is out now) environment in Minetest doesn't allow for threading or other optimisations to be made.

The other thing I might do something about soon is the syntax of the command. Currently it looks like this:

//convolve <kernel> [<width>[,<height>]] [<sigma>]

Given the way I've used this command in-the-field, I've found myself specifying the kernel width and the sigma value more often than I have found myself changing the kernel. With this observation in hand, it would be nice to figure out a better syntax here that reduces the amount of typing for everyday use-cases.

In a future post about WorldEditAdditions, I want to talk about a number of topics, such as the snowballs algorithm in the //erode. I also potentially may post about the new //noise2d command I've working on, but that might take a while (I'm still working on implementing a decent noise function for that, since at the time of typing the Perlin noise implementation I've ported is less than perfect).

WorldEditAdditions: More WorldEdit commands for Minetest

Personally, I see enormous creative potential in games such as Minetest. For those who aren't in the know, Minetest is a voxel sandbox game rather like the popular Minecraft. Being open-source though, it has a solid Lua Modding API that allows players with a bit of Lua knowledge to script their own mods with little difficulty (though Lua doesn't come with batteries included, but that's a topic for another day). Given the ease of making mods, Minetest puts much more emphasis on installing and using mods - in fact most content is obtained this way.

Personally I find creative building in Minetest a relaxing activity, and one of the mods I have found most useful for this is WorldEdit. Those familiar with Minecraft may already be aware of WorldEdit for Minecraft - Minetest has it's own equivalent too. WorldEdit in both games provides an array of commands one can type into the chat window to perform various functions to manipulate the world - for example fill an area with blocks, create shapes such as spheres and pyramids, or replace 1 type of node with another.

Unfortunately though, WorldEdit for Minetest (henceforth simply WorldEdit) doesn't have quite the same feature set that WorldEdit for Minecraft does - so I decided to do something about it. Initially, I contributed a pull request to node weighting support to the //mix command, but I quickly realised that the scale of the plans I had in mind weren't exactly compatible with WorldEdit's codebase (as of the time of typing WorldEdit for Minetest's codebase consists of a relatively small number of very long files).

(Above: The WorldEditAdditions logo, kindly created by @VorTechnix with Blender 2.9)

To this end, I decided to create my own project in which I could build out the codebase in my own way, without waiting for lots of pull requests to be merged (which would probably put strain on the maintainers of WorldEdit).

The result of this is a new mod for Minetest which I've called WorldEditAdditions. Currently, it has over 35 additional commands (though by the time you read this that number has almost certainly grown!) and 2 handheld tools that extend the core feature set provided by WorldEdit, adding commands to do things like:

These are just a few of my favourites - I've implemented many more beyond those in this list. VorTechnix has also contributed a raft of improvements, from improvements to //torus to an entire suite of selection commands (//srect, //scol, //scube, and more), which has been an amazing experience to collaborate on an open-source project with someone on another continent (open-source is so cool like that).

A fundamental difference I decided on at the beginning when working of WorldEditAdditions was to split my new codebase into lots of loosely connected files, and have each file have a single purpose. If a files gets over 150 lines, then it's a candidate for being split up - for example every command has a backend (which sometimes itself spans multiple files, as in the case of //convolve and //erode) that actually manipulates the Minetest world and a front-end that handles the chat command parsing (this way we also get an API for free! Still working on documenting that properly though - if anyone knows of something like documentation for Lua please comment below). By doing this, I enable the codebase to scale in a way that the original WorldEdit codebase does not.

I've had a ton of fun so far implementing different commands and subsequently using them (it's so satisfying to see a new command finally working!), and they have already proved very useful in creative building. Evidently others think so too, as we've already had over 4800 downloads on ContentDB: ContentDB Shield listing the live number of downloads

Given the enormous amount of work I and others have put into WorldEditAdditions and the level of polish it is now achieving (on par with my other big project Pepperminty Wiki, which I've blogged about before), recently I also built a website for it:

The WorldEditAdditions website

You can visit it here: https://worldeditadditions.mooncarrot.space/

I built the website with Eleventy, as I did with the Pepperminty Wiki website (blog post). My experience with Eleventy this time around was much more positive than last time - more on this in future blog post. For the observant, I did lift the fancy button code on the new WorldEditAdditions website from the Pepperminty Wiki website :D

The website has the number of functions:

  • Explaining what WorldEditAdditions is
  • Instructing people on how to install it
  • Showing people how to use it
  • Providing a central reference of commands

I think I covered all of these bases fairly well, but only time will tell how actual users find it (please comment below! It gives me a huge amount of motivation to continue working on stuff like this).

Several components of the WorldEditAdditions codebase deserve their own blog posts that have not yet got one - especially //erode and //convolve, so do look out for those at some point in the future.

Moving forwards with WorldEditAdditions, I want to bring it up to feature parity with Worldedit for Minecraft. I also have a number of cool and unique commands in mind I'm working on such as //noise (apply an arbitrary 2d noise function to the height of the terrain), //mathapply (execute another command, but only apply changes to the nodes whose coordinates result in a number greater than 0.5 when pushed through a given arbitrary mathematical expression - I've already started implementing a mathematical expression parser in Lua with a recursive-descent parser, but have run into difficulties with operator precedence), and exporting Minetest regions to obj files that can then be imported into Blender (a collaboration with @VorTechnix).

If WorldEditAdditions has caught your interest, you can get started by visiting the WorldEditAdditions website: https://worldeditadditions.mooncarrot.space/.

If you find it useful, please consider starring it on GitHub and leaving a review on ContentDB! If you run into difficulties, please open an issue and I'll help you out.

simple-dash fork: now with directory support!

A while back (I still have all sorts of projects I've forgotten to blog about - with many more to come), I forked an excellent project called simple-dash, which is a web dashboard. You can configure it to display 1 or more links, and it presents them nice and cleanly in the middle of the page.

I don't make forks lightly, but in this case I liked the project a lot - but I wanted to add enough features that I felt that I might be taking it in a different direction than the original project. The original project also hasn't been touched in 2+ years, and the author hasn't had any contributions on GitHub in that time either - so think it's fair to say that it's unlikely that any pull request I open wouldn't be looked at either (if the original author is reading this, I'm happy to open one!).

Anyway, before I continue too far, here's a screenshot of my improvements in action:

A screenshot of my improvements - explained in more detail below.

I use simple-dash in multiple places to provide a dashboard of links to the various services that I run so I both don't lose them and, in some cases, other people in my family can easily access said services.

I added a number of features here. The first is invisible, but I completely re-implemented the layout to use the CSS Grid (see also: a, b). If you've played with CSS before but aren't yet aware of the CSS grid yet - I can thoroughly recommend you take a moment to investigate - it will blow you away and solve all your layout problems all at the same time! In short, it's like a 2d version of the flexbox.

Since the original has full mobile support, I continue that trend in the rewrite with some CSS media queries to change the number of items per row based on the width of your screen.

The other invisible change is that I changed the language the configuration file is written in to TOML, which is a much more friendly language to write configuration files in.

Anyway, in terms of more visible changes, I also added the ability to set a background image, as well as the default random triangles background. Icons also got the same treatment - gaining the ability to display an image instead of a Font Awesome icon (I haven't actually used Font Awesome before, so this was an interesting experience - even if it was already setup in this project).

Last but certainly not least, I added the ability group pages into folders. Here's a screenshot of what the contents of that folder in the top left looks like when opened:

simple-dash with a folder open

You can't see it here, but it's even animated! Link to a demo at the end of this post.

There were a number of different challenges to overcome to get this working right actually - it was not trivial at all. There are 2 components to it: The CSS to style it, and the Javascript to fiddle the class list on the folder itself to add / remove the active class so that I could distinguish between open and closed folders in the CSS, and also prevent the click event from propagating through to the <a href="https://example.com/">links</a> links when the folder is closed.

Thinking about it, it may be possible use a clever pointer-events: none to avoid the Javascript.

The CSS does the heavy lifting here though. For inactive folders, I use a CSS grid with overflow: none to display the 1st 4 icons in a preview. When the folder becomes active, position: fixed breaks it out of the layout of the rest of the page (sadly leaving a placeholder behind would require an additional html element), and the content reflows to use the same CSS as the main grid of tiles.

Through some CSS grid wizardry (you can do anything with CSS grid, it's amazing) and a container element, I can even fade out the rest of the page while the folder is open.

Clicking on the items in a folder when the folder is open takes you to their destination as usual, while clicking anywhere else closes the folder again.

I've got a demo running over here if you'd like to play around with it:

sbrl's simple-dash fork demo

The background is set to a random image from Unsplash. It loads fine for me, but sometimes it takes a moment.

If this looks like something, you'd like to use for yourself, my fork is open-source! Check it out here:

sbrl/simple-dash on GitHub

You can find instructions on how to set it up for yourself in the README. You'll need npm to install dependencies - this should come bundled with Node.js. You can also find a lovingly-commented example configuration file here:

config.sample.toml

If you have any difficulties setting it up, want to request a feature, or even (gasp!) report a bug, please open an issue. While I do monitor the comments here on this blog, GitHub issues are a much better place to track bugs and feature requests.

Rendering Time plan / Gantt charts: hourgraph

I have a number of tools and other programs I've implemented, but forgotten to blog about here - hourgraph is one such tool I stumbled across today again. Originally I implemented it for my PhD panel 1 topic project analysis report, as I realised that not only have I manually created a number of these, but I'm going to have to create a bunch more in the future, but I open-sourced it as I usually do with most of the things I write in the hopes that someone else will find it useful.

I've published it on NPM, so you can install it like this:

npm install --global hourgraph

You'll need Node.js installed, and Linux users will need to prefix the above with sudo.

The program takes in a TOML definition file. Here's an example:

width = 1500
height = 480
title = "Apples"

[[task]]
name = "Pick apples"
start = 0
duration = 3

[[task]]
name = "Make apple juice"
start = 2
duration = 2

[[task]]
name = "Enjoy!"
start = 4
duration = 4
colour = "hsl(46, 90%, 60%)"
ghost_colour = "hsla(46, 90%, 60%, 0.1)"

The full set of options are available in the default config file, which is loaded in to fill in any gaps of things you haven't specified in your custom file.

Comprehensive usage instructions are found in the README, but you can render a new time plan chart thingy like this:

hourgraph --input path/to/input.toml --output path/to/output.toml

The above renders to this:

Hourgraph output

Personally, I find it's much easier to create charts like this by defining them in a simple text file that is then rendered into the actual thing. That way, I don't have to fiddle with the layout myself - it all comes out in the wash automatically.

For those interested in the code, it can be found here: https://github.com/sbrl/hourgraph

EmbedBox: Lightweight syntax-highlighted embeds

I was planning posting about something else yesterday, but I wanted to show some GitLab code in a syntax-highlighted embed. When I wasn't able to figure out how to do that, I ended up writing EmbedBox.

The whole thing is best explained with an example. Have an embed:

(Can't see the above? Check out the original file here)

Pretty cool, right? The above is the default settings file for EmbedBox. Given any URL (e.g. https://raw.githubusercontent.com/sbrl/EmbedBox/master/src/settings.default.toml), it will generate a syntax-highlighted embed for it.

It does so using highlight.php to do the syntax-highlighting server-side, Stash PHP for the cache, and without any Javascript in the embed itself.

It comes with a web interface that generates the embed code given the input URL and a few other settings and shows a preview of what it'll look like.

EmbedBox is open-source too (under the Mozilla Public Licence 2.0), so you're welcome to setup your own instance!

To do so, check out the code here: https://github.com/sbrl/EmbedBox/

The installation instructions should be pretty straightforward in theory, but if you get stuck please open an issue.

Now that I've implemented EmbedBox, you can expect to see it appear in future blog posts. I'm planning to write about my organise-photos script in the near future, so expect a blog post about it soon.

Found this interesting? Got a suggestion? Want to say hi? Comment below!

Starting my PhD on the mapping of flooding

Specifically, using new technologies such as AI and the Internet of Things to map and predict where it's going to flood in real-time.

This year, I'm starting a 3 year funded PhD on dynamic flood risk mapping, as part of a cluster of water-related PhDs that are all being funded at the same time. I've got some initial ideas as to the direction I'm going to take it too, which I'd like to talk a little bit about in this post.

I've got no shortage of leads as to potential data sources I can explore to achieve this. Just some of the avenues I'm exploring include:

  • Monitoring street drains
  • Fitting local council vehicles with sensors
  • Analysing geotagged tweets with natural language processing techniques
  • Predicting rainfall with aggregate mobile phone signal strength information
  • Vehicle windscreen wipers
  • Setting up static sensors? Not sure on this one.

Also, I've talked to a few people and thought of some pre-existing data sources that might come in useful:

  • Elevation maps
  • Vegetation maps
    • Normalised difference vegetation index - Map of vegetation density that's already available
    • Normalised difference water index - detects water on the surface of the earth - including that contained within leaves
  • River levels

Finally, I've been thinking about what I'm going to be doing with all this data I'd potentially be collecting. First and foremost, I'm going to experiment with InfluxDB as a storage mechanism. Apparently it's supposed to be able to handle high read and write loads, so it sounds suitable a first glance.

Next, I'm probably going to wind up training an AI - possibly incrementally - to predict flooding. Unlike my summer project, I'm probably going to be using a more cutting-edge and exotic AI architecture.

I suspect I might end up with a multi-level system too - whereby I pre-analyse the incoming data and process it into a format that the AI will take. For example, if I end up using geotagged social media posts, those will very likely filter through an AI of some description that does the natural language processing first - the output of which will be (part of) the input (or training output?) for the next AI in the chain.

I've given some thought to training data too. My current thinking is that while some data sources might be used as inputs to a network of interconnected AIs, others are more likely to take on a corrective role - i.e. improving the accuracy of the AI and correcting the model to fit a situation as it unfolds.

All this will not only require huge amounts of data, but will also require a sophisticated system that's capable of training an AI on past datasets as if they were happening in real-time. I suppose in a way the training process is a bit like chronological history lessons at speed - catching the AI up to the present day so that it can predict flood risk in real-time.

Before all this though, my first and foremost task will be to analyse what people have done already to map and predict flood risk. Only by understanding the current state of things will I be able to improve on what's already out there.

Found this interesting? Got any tips? Comment below!

Summer Project Part 6: A matching bookend

Since the last post, I've completed the project - at least in it's initial form. The IoT device I've been building is finished - along with the backend that receives the data and trains an AI on it. I've learnt a great deal whilst working on it. Unfortunately, I've been very short on time, so I haven't been able to blog about it as frequently as I'd have liked to.

Before we continue, it's perhaps best to show a diagram here of how the completed system works:

A colour-coded diagram. See the full explanation below. (Above: A diagram showing the typical workflow when using the completed system. Taken from my dissertation.).

Operation is divided up into 3 sections:

  1. Data collection: Using the Internet of Things (IoT) device to collect data.
  2. Processing the collected data: The data file on the microSD card of the device is folded into the main database (more on this later), and AIs are trained
  3. Viewing the final map in your browser.

Because the application is somewhat distributed in nature, it's not exactly obvious how the data flows across the application. Let's use another diagram to show how this works:

Another brightly-coloured diagram  - this time showing the data flow throughout the application. Explanation below. (Above: A diagram showing the data flow throughout the application. Taken from my dissertation.)

This diagram looks complicated, but it really isn't as complex as it looks.

  • First, the IoT device takes a reading. This includes the GPS location (latitude and longitude) and a random id (unsigned 32-bit integer)
  • This reading is both stored on a local SD card, and transmitted via LoRa
  • The copy transmitted via LoRa travels through The Things Network and is picked up by the TTN listener, which stores the reading in an SQLite database
  • Once data collection has been completed, the data on the microSD card is folded into the SQLite database generated by the TTN listener by the data processor. This ensures we have data points for places that the signal isn't, as well as where it is.
  • 1 AI is then trained per gateway. These AIs are trains on the signal strength data for their gateway, but also the data for where the signal is not.
  • The trained AIs are serialised to disk with an index file, where the web interface picks them up.

Training the AIs was a bit of a pain. Getting the inputs and outputs right along with the AIs topology to a point where the output is meaningful was quite challenging - especially considering the limited amount of data I managed to collect. I'd like to tweak and improve on it further, if I have time.

I ended up getting quite a bit of noise in the final dataset that I trained the AIs with - simply because of the way I designed the IoT device - and because of the time constraints, I didn't have the time to go back and fix it. The device currently logs the data point to disk and transmits it via LoRa afterwards. When the battery was low, I noticed that it had a tendency to reset when transmitting via LoRa, leading to a reset-loop and lots of bogus readings on the microSD card. And because I don't store the timestamp, I can't even tell them apart!

I also had a bit of a problem with configuration files - specifically that they kept multiplying. At the end of the project, I now have no fewer than 3-4 different configuration files - all in different formats! I'd love to consolidate them into 1 single configuration file, to make configuration much easier for the end user.

  • 1 for the pin definitions and features to enable / disable in C++ #define statements for the IoT device
  • Another 1 in C++ for the private TTN LMiC settings
  • 1 in TOML for the server
  • 1 in Javascript for the web interface

Ideally, I want everything to be defined in the TOML configuration file, since I developed a system by which I have a default configuration file, and a custom one in which the user can override any of the default settings.

With all this said, the output isn't bad:

A screenshot of the web interface. See below.

(Above: A screenshot of the web interface, with the background map removed for privacy)

The AIs do appear to have learnt a round circle of coverage around the gateways, which is somewhat frustrating - since the aim was to take obstructions into account too. I suspect that I need more data - and to figure out a way of training the AI of places that other gateways pick up a signal, but the current gateway does not (currently the gateway is only trained on positive readings and negative ones where no gateway picks up the signal).

If anyone is interested in the code behind the project, I have now open-sourced it, and it is available here:

LoRaWAN Signal Mapping

The repository includes source code, user and hardware manuals, and a circuit diagram.

I may blog about this in the future, but it's likely to be a while. I'm starting a PhD this academic year, which is sure to be fun! Expect some posts about that in the near future.

Found this interesting? Comment below!

Next Gen Search, Part 2: Pushing the limits

In the last part, we looked at how I built a new backend for storing inverted indexes for Pepperminty Wiki, which allows for partial index deserialisation and other nice features that boost performance considerably.

Since the last post, I've completed work on the new search system - though there are a few bits around the edges that I still want to touch up and do some more work on.

In this post though, I want to talk about how I generated test data to give my full-text search engine something to chew on. I've done this before for my markov chain program I wrote a while back, and that was so much fun I did it again for my search engine here.

After scratching my head for a bit to think of a data source I could use, I came up with the perfect plan. Ages ago I downloaded a Wikipedia dump - just the content pages in Wikitext markup. Why not use that?

As it turns out, it was a rather good idea. Some processing of said dump was required though to transform it into a format that Pepperminty Wiki can understand, though. Pepperminty Wiki stores pages on disk as flat text files in markdown, and indexes them in pageindex.json. If pageindex.json doesn't exist, then Pepperminty Wiki rebuilds it automagically by looking for content pages on disk.

This makes it easy to import batches of new pages into Pepperminty Wiki, so all we need to do is extract the wiki text, convert to markdown, and import! This ended up requiring a number of different separate steps though, so let's take it 1 at a time

First, we need a Wikipedia database dump in the XML format. These are available from dumps.wikimedia.org. There are many different ones available, but I suggest grabbing one that has a filename similar to enwiki-20180201-pages-articles.xml - i.e. just content pages - no revision history, user pages, or additional extras. I think the most recent one as of the time of posting is downloadable here - though I'd warn you that it's 15.3GiB in size! You can see a list of available dump dates for the English Wikipedia here.

Now that we've got our dump, let's extract the pages from it. This is nice and easy to do with wikiextractor on GitHub:

nice -n20 wikiextractor enwiki-20180201-pages-articles.xml --no_templates --html --keep_tables --lists --links --sections --json --output wikipages --compress --bytes 25M >progress.log 2>&1 

This will parse the dump and output a number of compressed files to the wikipages directory. These will have 1 JSON object per line, each containing information about a single page on Wikipedia - with page content pre-converted to HTML for us. The next step is to extract the page content and save it to a file with the correct name. This ended up being somewhat complicated, so I wrote a quick Node.js script to do the job:

#!/usr/bin/env node

const readline = require("readline");
const fs = require("fs");


if(!fs.existsSync("pages"))
        fs.mkdirSync("pages", { mode: 0o755 });

// From https://stackoverflow.com/a/44195856/1460422
function html_unentities(encodedString) {
        var translate_re = /&(nbsp|amp|quot|lt|gt);/g;
        var translate = {
                "nbsp":" ",
                "amp" : "&",
                "quot": "\"",
                "lt"  : "<",
                "gt"  : ">"
        };
        return encodedString.replace(translate_re, function(match, entity) {
                return translate[entity];
        }).replace(/&#(\d+);/gi, function(match, numStr) {
                var num = parseInt(numStr, 10);
                return String.fromCharCode(num);
        });
}

const interface = readline.createInterface({
        input: process.stdin,
        //output: process.stdout
});

interface.on("line", (text) => {
        const obj = JSON.parse(text);

        fs.writeFileSync(`pages/${obj.title.replace(/\//, "-")}.html`, html_unentities(obj.text));
        console.log(`${obj.id}\t${obj.title}`);
});

This basically takes the stream of JSON object on the standard input, parses them, and saves the relevant content to disk. We can invoke it like so:

bzcat path/to/*.bz2 | ./parse.js

Don't forget to chmod +x parse.js if you get an error here. The other important thing about the above script ist hat we have to unescape the HTML entities (e.g. &gt;), because otherwise we'll have issues later with HTML conversation and page names will look odd. This is done by the html_unentities() function in the above script.

This should result in a directory containing a large number of files - 1 file per content page. This is much better, but we're still not quite there yet. Wikipedia uses wiki markup (which we converted to HTML with wikiextractor) and Pepperminty Wiki uses Markdown - the 2 of which are, despite all their similarities, inherently incompatible. Thankfully, pandoc is capable of converting from HTML to markdown.

Pandoc is great at this kind of thing - it uses an intermediate representation and allows you to convert almost any type of textual document format to any other format. Markdown to PDF, EPUB to plain text, ..... and HTML to markdown (just to name a few). It actually looks like it shares a number of features with traditional compilers like GCC.

Anyway, let's use it to convert our folder full of wikitext files to a folder full of markdown:

mkdir -p pages_md;
find pages/ -type f -name "*.html" -print0 | nice -n20 xargs -P4 -0 -n1 -I{} sh -c 'filename="{}"; title="${filename##*/}"; title="${title%.*}"; pandoc --from "html"  --to "markdown+backtick_code_blocks+pipe_tables+strikeout" "${filename}" -o "pages_md/${title}.md"; echo "${title}";';

_(See this on explainshell.com - doesn't include the nice -n20 due to a bug on their end)_

This looks complicated, but it really isn't. Let's break it down a bit:

find pages/ -type f -name "*.html" -print0

This finds all the HTML files that we want to convert to Markdown, and delimits the output with a NUL byte - i.e. 0x0`. This makes the next step easier:

... | nice -n20 xargs -P4 -0 -n1 -I{} sh -c '....'

This pushes a new xargs instance into the background, which will execute 4 commands at a time. xargs executes a command for each line of input it receives. In our case, we're delimiting with NUL 0x0 instead though. We explicitly specify that we can 1 command per line of input though, as xargs tries to optimise and do command file1 file2 file3 instead.

The sh -c bit is starting a subshell, in which we execute a small wrapper script that then calls pandoc. This is of course inefficient, but I couldn't find any way around spawning a subshell in this instance.

filename="{}";
title="${filename##*/}";
title="${title%.*}";
pandoc --from "html" --to "markdown+backtick_code_blocks+pipe_tables+strikeout" "${filename}" -o "pages_md/${title}.md";
echo "${title}";

I've broken the sh -c subshell script down into multiple liens for readability. Simply put, it extracts the page title from the filename, converts the HTML to Markdown, and saves it to a new file in a different directory with the .md replacing the original .html extension.

When you put all these components together, you get a script that converts a folder full of HTML files to Markdown. Just like with the markov chains extraction I mentioned at the beginning of this post, Bash and shell scripting really is all about lego bricks. This is due in part to the Unix philosophy:

Make each program do one thing well.

There is more to it, but this is the most important point to remember. Many of the core utilities you'll find on the terminal follow this way of thinking.

There's 1 last thing we need to take care of before we have them in the right format though - we need to convert the [display text](page name) markdown-format links back into the Wikipedia [[internal link]] format that Pepperminty Wiki also uses.

Thankfully, another command-line tool I know of called repren is well-suited to this:

repren --from '\[([^\]]+)\]\(([^):]+)\)' --to '[[\1]]' pages_md/*.md

It took some fiddling, but I got all the escaping figured out and the above converts back into the [[internal link]] format well enough.

Now that we've got our folder full of markdown files, we need to extract a random portion of them to act as a test for Pepperminty Wiki - as the whole lot might be a bit much for it to handle (though if Pepperminty Wiki was capable of handling it all eventually that'd be awesome :D). Let's try 500 pages to start:

find path/to/wikipages/ -type f -name "*.md" -print0 | shuf --zero-terminated | head -n500 --zero-terminated | xargs -0 -n1 -I{} cp "{}" .

(See this on explainshell.com)

This is another lego-brick style command. Let's break it down too:

find path/to/wikipages/ -type f -name "*.md" -print0

This lists all the .md files in a directory, delimiting them with a NUL character, as before. It's better to do this than use ls, as find is explicitly designed to be machine-readable.

.... | shuf --zero-terminated

The shuf command randomly shuffles the input lines. In this case, we're telling it that the input is delimited by the NUL byte.

.... | head -n500 --zero-terminated

Similar deal here. head takes the top N lines of input, and discards the rest.

.... | xargs -0 -n1 -I{} cp "{}" .

Finally, xargs calls cp to copy the selected files to the current directory - which is, in this case, the root directory of my test Pepperminty Wiki instance.

Since I'm curious, let's now find out roughly how many words we're dealing with here:

cat data_test/*.md | wc --words
1593190

1.5 million words! That's a lot. I wonder how quickly we can search that?

A screenshot of the Pepperminty Wiki search results on the test wiki for the word food, showing the new dark theme coming soon!

24.8ms? Awesome! That's so much better than before. If you're wondering about new coat of paint in the screenshot - Pepperminty Wiki is getting dark theme, thanks to prefers-color-scheme :D

I wonder what happens if we push it to 2K pages?

Another screenshot, the same as before

This time we get ~120ms for 5.9M total words - wow! I wasn't expecting it to perform so well. At this scale, rebuilding the entire index is particularly costly - so if I was to push it even further it would make sense to implement an incremental approach that spreads the work over multiple requests, assuming I can't squeeze any more performance out the system as-is.

The last thing I want to do here is make a rough estimate about time time complexity the search system as-is, given the data we have so far. This isn't particularly difficult to do.

Given the results above, we can calculate that at 1.5M total words, an increase of ~60K total words results in an increase of 1ms of execution time. At 5.9M words, it's only ~49K words / ms of execution time - a drop of ~11K words / ms of execution time.

From this, we can speculate that for every million words total added to a wiki, we can expect a ~2.5K words / ms of execution time drop - not bad! We'd need more data points to make any reasonable guess as to the Big-O complexity function that it conforms to. My guess would be something like $O(xN^2)$, where x is a constant between ~0.2 and 2.

Maybe at some point I'll go to the trouble of running enough tests to calculate it, but with all the variables that affect the execution time (number of pages, distribution of words across pages, etc.), I'm not in any hurry to calculate it. If you'd like to do so, go ahead and comment below!

Next time, I'll unveil the inner working of the STAS: my new search-term analysis system.

Found this interesting? Got your own story about some cool code you've written to tell? Comment below!

Art by Mythdael