Starbeamrainbowlabs

Stardust
Blog


Archive


Mailing List Articles Atom Feed Comments Atom Feed Twitter Reddit Facebook

Tag Cloud

3d 3d printing account algorithms android announcement architecture archives arduino artificial intelligence artix assembly async audio automation backups bash batch blog bookmarklet booting bug hunting c sharp c++ challenge chrome os code codepen coding conundrums coding conundrums evolved command line compilers compiling compression css dailyprogrammer data analysis debugging demystification distributed computing documentation downtime electronics email embedded systems encryption es6 features ethics event experiment external first impressions future game github github gist gitlab graphics hardware hardware meetup holiday holidays html html5 html5 canvas infrastructure interfaces internet interoperability io.js jabber jam javascript js bin labs learning library linux lora low level lua maintenance manjaro network networking nibriboard node.js operating systems own your code pepperminty wiki performance phd photos php pixelbot portable privacy problem solving programming problems project projects prolog protocol protocols pseudo 3d python reddit redis reference releases resource review rust searching secrets security series list server software sorting source code control statistics storage svg talks technical terminal textures thoughts three thing game three.js tool tutorial twitter ubuntu university update updates upgrade version control virtual reality virtualisation visual web website windows windows 10 xmpp xslt

PhD Update 1: Directions

Welcome to my first PhD update post. I intend to post these at bimonthly intervals. In the last post, I talked a bit about my PhD project that I'm doing and my initial thoughts. Since then, I've done heaps of investigation into a number of different potential directions I could take the project. For reference, my PhD title is actually as follows:

Using the Internet of Things, Big Data, and AI to dynamically map flood risk.

There are 3 main elements to this project:

  • Big Data
  • Artificial Intelligence (AI)
  • The Internet of Things (IoT)

I'm pretty sure that each of them will have an important role to play in the final product - even if I'm not sure what those roles are just yet :P

Particularly of concern at the moment is this blog post by Google. It talks about they've managed to significantly improve flood forecasting with AI along with a seriously impressive visualisation to back it up - but I can't find a paper on it anywhere. I'm concerned that anything I try to do in the area won't be useful if they are already streets ahead of everyone else like that.

I guess one of the strong points I should try to hit is the concept of explainable AI if possible.

All the data sources!

As it stands right now, I'm currently evaluating various different potential data sources that I've managed to gain access to. My aim here is to evaluate how useful they will be in solving the wider problem - and whether they are useful enough to be worth investigating further.

Environment Agency

Some great people from the environment agency came into University recently to chat with us about what they did. The discussion we had was very interesting - but they also asked if there was anything they could do to help our PhD projects out.

Seeing the opportunity, I jumped at the chance to get a hold of some of their historical datasets. They actually maintain a network of high-quality sensors across the country that monitor everything from rainfall to river statistics. While they have a real-time API that you can use to download recent measurements, it doesn't appear to go back further than March 2017. To this end, I asked for data from 2005 up to the end of 2017, so that I could get a clearer picture of the 2007 and 2013 floods for AI training purposes.

So far, this dataset has proved very useful at least initially as a testbed for training various kinds of AI as I learn PyTorch (see my recent post for how that has been going - I've started with a basic LSTM first. For reference, an LSTM is a neural network architecture that is good at processing time-series data - but is quite computationally expensive to run.

Met Office

I've also been investigating the datasets that the Met Office provide. These chiefly appear to be in the form of their free DataPoint API. Particularly of interest are their rainfall radar images, which are 500x500 pixels and are released every 15 minutes. Sadly they are only available for a few hours at best, so you have to grab them fast if you want to be able to analyse particularly interesting ones later.

Annoyingly though, their API does not appear to give any hints as to the bounding boxes of these images - and neither can I find any information about this online. I posted in their support forum, but it doesn't appear that anyone actually monitors it - so at this point I suspect that I'm unlikely to receive a response. Without knowing the (lat, lng) co-ordinates of the images produced by the API, they are little more use than pretty wall art.

Internet of Things

On the Internet of Things front, I'm already part of Commected Humber, which have a network of sensors setup that are monitoring everything from air quality to temperature, humidity, and air pressure. While these things aren't directly related to my project, the dataset that we're collecting as a group may very well come in handy as an input to a model of some description.

I'm pretty sure that I'll need to setup some additional custom sensors of my own at some point (probably soonish too) to collect the measurement readings that I'm missing from other pre-existing datasets.

Reading a library

Whilst I've been doing this, I've also been reading up a storm. I've started by reading into traditional physics-based flood modelling simulations (such as caesar-lisflood) - which appear to fall into a number of different categories, which also have sub-categories. It's quite a rabbit hole - but apparently I'm diving allt he way down to the very bottom.

The most interesting paper on this subject I found was this one from 2017. It splits physics-based models up into 3 categories:

  • Empirical models (i.e. ones that just display sensor readings, calculate some statistics, and that's about it)
  • Hydrodynamic models - the best-known models that simulate water flow etc - can be categorised as either 1D, 2D, or 3D - also very computationally expensive - especially in higher dimensions
  • Simplified conceptual models - don't actually simulate water flow, but efficient enough to be used on large areas - also can be quite inaccurate with complex terrain etc.

As I'm going to be using artificial intelligence as the core of my project, it quickly became evident that this is just stage-setting for the actual kind of work I'll be doing. After winding my way through a bunch of other less interesting papers, I found my way to this paper from 2018 next, which is similar to the previous one I linked to - just for AI and flood modelling.

While I haven't yet had a chance to follow up on all the interesting papers referenced, it has a number of interesting points to keep in mind:

  • Artificial Intelligences need lots of diverse data points to train well
  • It's important to measure a trained network's ability to generalise what it's learnt to other situations it hasn't seen yet

The odd thing about this paper is that it claims that regular neural networks were better than recurrent neural network structures - despite the fact that it is only citing a single old 2013 paper (which I haven't yet read). This lead me on to read a few more papers - all of which were mildly interesting and had at least something to do with neural networks.

I certainly haven't read everything yet about flood modelling and AI, so I've got quite a way to go until I'm done in this department. Also of interest are 2 newer neural network architectures which I'm currently reading about:

Next steps

I want to continue to read about the above neural networks. I also want to implement a number of the networks I've read about in PyTorch to continue to learn the library.

Lastly, I want to continue to find new datasets to explore. If you're aware of a dataset that I haven't yet talked about on here, comment below!

PyTorch and the GPU: A tale of graphics cards

Recently, I've been learning PyTorch - which is an artificial intelligence / deep learning framework in Python. While I'm not personally a huge fan of Python, it seems to be the only library of it's kind out there at the moment (and Tensorflow.js has terrible documentation) - so it would seem that I'm stuck with it.

Anyway, as I've been trying to learn it I inevitably came to the bit where I need to learn how to take advantage of a GPU to accelerate the neural network training process. I've been implementing a few test networks to see how it performs (my latest one is a simple LSTM, loosely following this tutorial).

In PyTorch, this isn't actually done for you automatically. The basic building blocks of PyTorch are tensors (potentially multi-dimensional arrays that hold data). Each tensor is bound to a specific compute device - by default the CPU (in which the data is stored in regular RAM). TO do the calculations on a graphics card, you need to bind the data to the GPU in order to load the data into the GPU's own memory - so that the GPU can access it and do the calculation. The same goes for any models you create - they have to be explicitly loaded onto the GPU in order to run the calculations in the right place. Thankfully, this is fairly trivial:

tensor = torch.rand(3, 4)
tensor = tensor.to(COMPUTE_DEVICE)

....where COMPUTE_DEVICE is the PyTorch device object you want to load the tensor onto. I found that this works to determine the device that the data should be loaded onto quite well:

COMPUTE_DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

Unfortunately, PyTorch (and all other AI frameworks out there) only support a technology called CUDA for GPU acceleration. This is a propriety Nvidia technology - which means that you can only use Nvidia GPUs for accelerated deep learning. Since I don't actually own an Nvidia GPU (far too expensive, and in my current laptop I have an AMD Radeon R7 M445 - and I don't plan on spending large sums of money to replace a perfectly good laptop), I've been investigating hardware at my University that I can use for development purposes - since this is directly related to my PhD after all.

Initially, I've found a machine with an Nvidia GeForce GTX 650 in it. If you run torch.cuda.is_available(), it will tell you if CUDA is available or not:

print(torch.cuda.is_available()) # Prints True if CUDA is available

.....but, as always, there's got to be a catch. Just because CUDA is available, doesn't mean to say that PyTorch can actually use it. After a bunch of testing, it transpired that PyTorch only supports CUDA devices with a capability index greater than or equal to 3.5 - and the GTX 650 has a capability index of just 3.0. You can see where this is going. I foound this webpage was helpful - it lists all of Nvidia's GPUs and their CUDA capability indices.

You can also get PyTorch to tell you more about the CUDA device it has found:

def display_compute_device():
    """Displays information about the compute device that PyTorch is using."""

    log(f"Using device: {COMPUTE_DEVICE}", newline=False)
    if COMPUTE_DEVICE.type == 'cuda':
        print(" {0} [Memory: {1}GB allocated, {2}GB cached]".format(
            torch.cuda.get_device_name(0),
            round(torch.cuda.memory_allocated(0)/1024**3, 1),
            round(torch.cuda.memory_cached(0)/1024**3, 1)
        ))

    print()

If you execute the above method, it will tell you more about the compute device it has found. Note that you can actually make use of multiple compute devices at the same time - I just haven't done any research into that yet.

Crucially, it will also generate a warning message if your CUDA device is too old. To this end, I'll be doing some more investigating as to the resources that the Department of Computer Science has available for PhD students to use....

If anyone knows of an artificial intelligence framework that can take advantage of any GPU (e.g. via OpenCL, oneAPI, or other similar technologies), do get in touch. I'm very interested to explore other options.

Exporting an SQLite3 database to a directory of CSV files

Recently I was working with a dataset I acquired for my PhD, and to pre-process said dataset into something more sensible I imported it into an SQLite3 database. Once I was finished processing it, I then needed to export it again into regular CSV files so that I could do other things, such as plot it with GNUPlot, or import it into InfluxDB (more on InfluxDB in a later post).

With the help of Stack Overflow and the SQLite3 man page, this didn't prove to be too difficult. To export a single SQLite3 table to a CSV file, you do this:

sqlite3 -bail -header -csv "bobsrockets.sqlite3" "SELECT * FROM 'table_name';" >"path/to/output_file.csv";

This is great for a single table, but what if we want to export all the tables? Well, we can iterate over all the tables in an SQLite3 database like so:

while read table_name; do
    echo "Exporting ${table_name}";

    # Do stuff
done < <(sqlite3 "bobsrockets.sqlite3" ".tables");

If we combine this with the previous snippet, we can export all the tables like so:

while read table_name; do
    log "Exporting ${table_name}";

    sqlite3 -bail -header -csv "bobsrockets.sqlite3" "SELECT * FROM '${table_name}';" >"${table_name}.csv"; 
done < <(sqlite3 "bobsrockets.sqlite3" ".tables");

Cool! We can make it even better with some simple improvements though:

  1. It's a pain to have to edit the script every time we want to change the database we're exporting
  2. It would be nice to be able to specify the output directory without editing the script too

Satisfying both of these points isn't particularly challenging. 10 minutes of fiddling got this the final completed script:

#!/usr/bin/env bash
set -e; # Don't allow errors

show_usage() {
    echo -e "Usage:";
    echo -e "\t./sqlite2csv.sh {db_filename} {output_dir}";
}

log() {
    echo -e "[ $(date +"%F %T") ] ${@}";
}

###############################################################################

db_filename="${1}";
output_dir="${2}";

if [ -z "${db_filename}" ]; then
    echo "Error: No database filename specified.";
    show_usage; exit;
fi
if [ -z "${output_dir}" ]; then
    echo "Error: No output directory specified.";
    show_usage; exit;
fi

if [ ! -d "${output_dir}" ]; then
    mkdir -p "${output_dir}"; 
fi

log "Output directory is ${output_dir}";

while read table_name; do
    log "Exporting ${table_name}";

    sqlite3 -bail -header -csv "${db_filename}" "SELECT * FROM '${table_name}';" >"${output_dir}/${table_name}.csv";    
done < <(sqlite3 "${db_filename}" ".tables");

log "Complete!";

Found this useful? Comment below!

Pepperminty Wiki is 5 today!

....let's celebrate with the release of v0.20. I got a notification from my calendar system yesterday that Pepperminty Wiki's birthday is today, and since I did a beta release a few days ago and there haven't been any major issues, I thought I'd time the full release to coincide with its birthday.

I'm timing it from the first commit I ever made in Pepperminty Wiki's git repository. 5 years is a long time - and as a program Pepperminty Wiki has come such a long way since then.

Today, it's actually a really useful piece of open-source software, which is evidenced by the fact that people recommend it to other people on their own. Seeing such things and hearing about where it's used are really amazing to see - and give me lots of motivation to improve Pepperminty Wiki even more.

While the number of commits a project has isn't always an indicator of quality or how complete a project is, you can usually get a pretty good idea as to how much work has been done on a project by the number of commits it has (but of course, not always). At the time of writing Pepperminty Wiki has 1,415 commits, which is more than any other project I have ever worked on - past or present. The air quality web interface (which is now more of a general sensor web interface) is my 2nd place project unless I've missed one - and at 425 commits it doesn't even come close!

To summarise the features in the latest release:

  • 🌜 New automatic dark mode in the default theme! Uses prefers-color-scheme under-the-hood
  • 🌈 Added theme gallery! Read more here
  • Vastly improved search engine performance, with new advanced query syntax (with even more syntax along the way)
  • 🚁 Accessibility improvements - if you're a screen-reader or accessibility tool user, I want to hear from you if you think anything (big or small!) could be improved!

Personally, I'm most proud of the optimisations to the search engine. I've actually blogged about how I did it in a 3 part series and tested it on a test wiki with ~5.9M words - while search times vary depending on your input (the new -exclude syntax will actually speed up queries) and your server hardware, a single word query for ~5.0M word wikis takes ~50ms O.o

Unfortunately, this does mean that the search index will need to be rebuilt under the new format - and will be slightly larger than before. To get a progress bar for this operation, go to the master settings and click the rebuild button.

Another notable change is the new 'mega-menu' style more menu:

image

That menu has been bothering me for a while, and thanks to the kind people on Reddit, I've now got a solution.

Note that you'll need to delete nav_links_extra from your peppermint.json in order for it to take effect.

Please also test the theme gallery in particular. It's brand-new in this release and quite complicated under-the-hood, so I'd appreciate some extra eyes on that.

As for when I'll release v1.0, I'm not sure. As a program, Pepperminty Wiki is certainly stable enough to be used in production scenarios today - so perhaps incrementing the version number to v1.0 would be a good idea to reflect that. At the same time though, there are a number of missing features - most notably watchlists and further improvements to the page history system - so I'm not sure when I'll be confident enough to bump it to v1.0.

Either way, I'm pretty sure that I'll keep working on Pepperminty Wiki for years to come - I have no plans to cease development at this time. While Pepperminty Wiki releases don't move at the most rapid of paces, I aim to get about 2 releases out per year about 6 months apart from each other.

Special thanks to @SeanFromIT for reporting a number of bugs which have been squashed.

If you use Pepperminty Wiki, tweet me @SBRLabs! I'd love to hear about how you're using it.

Lastly, don't forget to take a backup of your wiki before updating. While I've made every effort to squash bugs, you can never be too careful :P

Check out v0.20 here:

Pepperminty Wiki v0.20

MDNS: Simple device addressing for home networks

We all know about DNS, and how it forms one of the foundations of the Internet. With a hierarchical system of caching DNS resolvers, it provides a scalable system by which domain names (such as starbeamrainbowlabs.com) can be translated into their associated IP address (such as 2001:41d0:e:74b::1 or 5.196.73.75). You can register your own domain name for a modest fee, and point it at a web server to host a website.

But what about a local home network? In such an environment, where devices get switched on and off and enter and leave the network on a regular basis, manually specifying DNS records for devices which may even have dynamic IP addresses is a chore (and dynamic DNS solutions are complex to setup). Is there an easier way?

As I discovered the other day, it turns out the answer is yes - and it comes in the form of Multicast DNS, which abbreviates to MDNS. MDNS is a decentralised peer-to-peer protocol that lets devices on a small home network announce their names and their IP addresses in a standard fashion. It's also (almost) zero-configuration, so as long as UDP port 5353 is allowed through all your devices' firewalls, it should start working automatically.

Linux users will need avahi-daemon installed and running, which should be the default on popular distributions such as Ubuntu. Windows users with a recent build of Windows 10 should have it enabled by default too - and if I understand it right, macOS users should also have it enabled by default (though I don't have a mac, or a Windows machine, to check these on).

For example, if Bob has a home network with a file server on it, that file server might announce it's name as bobsfiles. This is automatically translated to be the fully-qualified domain name bobsfiles.local.. When Bill comes around to Bob's house and turns on his laptop, it will send a multicast DNS message out to ask all the supporting hosts on the network what their names and IP addresses are to add them it it's cache. Then, all Bill has to do is enter bobsfiles.local. into their web browser (or file manager, or SSH client, any other networked application) to connect to Bob's file server and access Bob's cool rocket designs and cat pictures.

This greatly simplifies the setup of a home network, and allows for pseudo-hostnames even in a local setting! Very cool. At some point, I'd like to refactor my home network to make better use of this - and have 1 MDNS name per service I'm running, rather than using subfolders for everything. This fits in nicely with some clustering plans I have on the horizon too.....

With a bit of fiddling, you can assign multiple MDNS names to a single host too. On Linux, you can use avahi-publish:

avahi-publish --address -R bobsrockets.local X.Y.Z.W

...where X.Y.Z.W is your local machine's IP, and bobsrockets.local is the .local MDNS domain name you want to assign. This is a daemon process that needs to run in the background apparently which is a bit of a pain - but hopefully there's a better solution out there somewhere.

Own your code, part 6: The Lantern Build Engine

It's time again for another installment in the own your code series! In the last post, we looked at the git post-receive hook that calls the main git-repo Laminar CI task, which is the core of our Continuous Integration system (which we discussed in the post before that). You can see all the posts in the series so far here.

In this post we're going to travel in the other direction, and look at the build script / task automation engine that I've developed that goes hand-in-hand with the Laminar CI system - though it can and does stand on it's own too.

Introducing the Lantern Build Engine! Finally, after far too long I'm going to formally post here about it.

Originally developed out of a need to automate the boring and repetitive parts of building and packing my assessed coursework (ACWs) at University, the lantern build engine is my personal task automation system. It's written in 100% Bash, and allows tasks to be easily defined like so:

task_dostuff() {
    task_begin "Doing a thing";
    do_work;
    task_end "$?" "Oops, do_work failed!";

    task_begin "Doing another thing";
    do_hard_work;
    task_end "$?" "Yikes! do_hard_work failed.";
}

When the above task is run, Lantern will automatically detect the dustuff task, since it's a bash function that's prefixed with task_. The task_begin and task_end calls there are 2 other bash functions, which generate pretty output to inform the user that a task is starting or ending. The $? there grabs the exit code from the last command - and if it fails task_end will automatically display the provided error message.

Tasks are defined in a build.sh file, for which Lantern provides a template. Currently, the template file contains some additional logic such as the help text output if no tasks were specified - which is left-over from the time when Lantern was small enough to fit in the same file as the build tasks themselves.

I'm in the process of adding support for the all the logic in the template file, so that I can cut down on the extra boilerplate there even further. After defining your tasks in a copy of the template build file, it's really easy to call them:

./build dostuff

Of course, don't forget to mark the copy of the template file executable with chmod +x ./build.

The above initial example only scratches the surface of what Lantern can do though. It can easily check to see if a given command is installed with check_command:

task_go-to-the-moon() {
    task_begin "Checking requirements";
    check_command git true;
    check_command node true;
    check_command npm true;
    task_end 0;
}

If any of the check_command calls fail, then an error message is printed and the build terminated.

Work that needs doing in Lantern can be expressed with 3 levels of logical separation: stages, tasks, and subtasks:

task_build-rocket() {
    stage_begin "Preparation";

    task_begin "Gathering resources";
    gather_resources;
    task_end "$?" "Failed to gather resources";

    task_begin "Hiring engineers";
    hire_engineers;
    task_end "$?" "Failed to hire engineers";

    stage_end "$?";

    stage_begin "Building Rocket";
    build_rocket --size big --boosters 99;
    stage_end "$?";

    stage_begin "Launching rocket";
    task_begin "Preflight checks";
    subtask_begin "Checking fuel";
    check_fuel --level full;
    subtask_end "$?" "Error: The fuel tank isn't full!";
    subtask_begin "Loading snacks";
    load_items --type snacks --from warehouse;
    subtask_end "$?" "Error: Failed to load snacks!";
    task_end "$?";

    task_begin "Launching!";
    launch --countdown 10;
    task_end "$?";

    stage_end "$?";
}

Come to think about it, I should probably rename the function prefix from task to job. Stages, tasks, and subtasks each look different in the output - so it's down to personal preference as to which one you use and where. Subtasks in particular are best for commands that don't return any output.

Popular services such as [Travis CI]() have a thing where in the build transcript they display the versions of related programs to the build, like this:

$ uname -a
Linux MachineName 5.3.0-19-generic #20-Ubuntu SMP Fri Oct 18 09:04:39 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
$ node --version
v13.0.1
$ npm --version
6.12.1

Lantern provides support for this with the execute command. Prefixing commands with execute will cause them to be printed before being executed, just like the above:

task_prepare() {
    task_begin "Displaying environment details";
    execute uname -a;
    execute node --version;
    execute npm --version;
    task_end "$?";
}

As build tasks get more complicated, it's logical to split them up into multiple tasks that can be called independently and conditionally. Lantern makes this easy too:

task_build() {
    task_begin "Building";
    # Do build stuff here
    task_end "$?";
}
task_deploy() {
    task_begin "Deploying";
    # Do deploy stuff here
    task_end "$?";
}

task_all() {
    tasks_run build deploy;
}

The all task in the above runs both the build and deploy tasks. In fact, the template build script uses tasks_run at the very bottom to treat every argument passed to it as a task name, leading to the behaviour described above.

Lantern also provides an array of other useful functions to make expressing build sequences easy, concise, and readable - from custom colours to testing environment variables to see if they exist. It's all fully documented in the README of the project too.

As described 2 posts ago, the git-repo Laminar CI task (once it's spawned a hologram of itself) currently checks for the existence of a build or build.sh executable script in the root of the repository it is running on, and passes ci as the first and only argument.

This provides easy integration with Lantern, since Lantern build scripts can be called anything we like, and with a tasks_run call at the bottom as in the template file, we can simply define a ci Lantern task function that runs all our continuous integration jobs that we need to execute.

If you're interested in trying out Lantern for yourself, check out the repository!

https://gitlab.com/sbrl/lantern-build-engine#lantern-build-engine

Personally, I use it for everything from CI to rapid development environment setup.

This concludes my (epic) series about my git hosting and continuous integration. We've looked at git hosting, and taken a deep dive into integrating it into a continuous integration system, which we've augmented with a bunch of scripts of our own design. The system we've ended up with, while a lot of work to setup, is extremely flexible, allowing for modifications at will (for example, I have a webhook script that's similar to the git post-receive hook, but is designed to receive notifications from GitHub instead of Gitea and queue the git-repo just the same).

I'll post a series list post soon. After that, I might blog about my personal apt repository that I've setup, which is somewhat related to this.

Own your code, part 5: git post-receive hook

In the last post, I took a deep dive into the master git-repo job that powers the my entire build system. In the next few posts, I'm going to take a look at the bits around the edges that interact with this laminar job - starting with the git post-receive hook in this post.

When you push commits to a git repository, the remote server does a bunch of work to integrate your changes into the remote master copy of the repository. At various points in the process, git allows you to run scripts to augment your repository, and potentially alter the way git ultimately processes the push. You can send content back to the pushing user too - which is how you get those messages on the command-line occasionally when you push to a GitHub repository.

In our case, we want to queue a new Laminar CI job when new commits are pushed to a private Gitea server, for instance (like mine). Doing this isn't particularly difficult, but we do need to collect a bunch of information about the environment we're running in so that we can correctly inform the git-repo task where it needs to pull the repository from, who pushed the commits, and which commits need testing.

In addition, we want to write 1 universal git post-receive hook script that will work everywhere - regardless of the server the repository is hosted on. Of course, on GitHub you can't run a script directly, but if I ever come into contact with another supporting git server, I want to minimise the amount of extra work I've got to do to hook it up.

Let's jump into the script:

#!/usr/bin/env bash
if [ "${GIT_HOST}" == "" ]; then
    GIT_HOST="git.starbeamrainbowlabs.com";
fi

Fairly standard stuff. Here we set a shebang and specify the GIT_HOST variable if it's not set already. This is mainly just a placeholder for the future, as explained above.

Next, we determine the git repository's url, because I'm not sure that Gitea (my git server, for which this script is intended) actually tells you directly in a git post-receive hook. The post-receive hook script does actually support HTTPS, but this support isn't currently used and I'm unsure how the git-repo Laminar CI job would handle a HTTPS url:

# The url of the repository in question. SSH is recommended, as then you can use a deploy key.
# SSH:
GIT_REPO_URL="git@${GIT_HOST}:${GITEA_REPO_USER_NAME}/${GITEA_REPO_NAME}.git";
# HTTPS:
# git_repo_url="https://git.starbeamrainbowlabs.com/${GITEA_REPO_USER_NAME}/${GITEA_REPO_NAME}.git";

With the repository url determined, next on the list is the identity of the pusher. At this stage it's a simple matter of grabbing the value of 1 variable and putting it in another as we're only supporting Gitea at the moment, but in the future we may have some logic here to intelligently determine this value.

GIT_AUTHOR="${GITEA_PUSHER_NAME}";

With the basics taken care of, we can start getting to the more interesting bits. Before we do that though, we should define a few common settings:

###### Internal Settings ######

version="0.2";

# The job name to queue.
job_name="git-repo";

###############################

job_name refers to the name of the Laminar CI job that we should queue to process new commits. version is a value that we can increment should we iterate on this script in the future, so that we can then tell which repositories have the new version of the post-receive hook and which ones don't.

Next, we need to calculate the virtual name of the repository. This is used by the git-repo job to generate a 'hologram' copy of itself that acts differently, as explained in the previous post. This is done through a series of Bash transformations on the repository URL:

# 1. Make lowercase
repo_name_auto="${GIT_REPO_URL,,}";
# 2. Trim git@ & .git from url
repo_name_auto="${repo_name_auto/git@}";
repo_name_auto="${repo_name_auto/.git}";
# 3. Replace unknown characters to make it 'safe'
repo_name_auto="$(echo -n "${repo_name_auto}" | tr -c '[:lower:]' '-')";

The result is quite like 'slugification'. For example, this URL:

git@git.starbeamrainbowlabs.com:sbrl/Linux-101.git

...will get turned into this:

git-starbeamrainbowlabs-com-sbrl-linux----

I actually forgot to allow digits in step #3, but it's a bit awkward to change it at this point :P Maybe at some later time when I'm feeling bored I'll update it and fiddle with Laminar's data structures on disk to move all the affected repositories over to the new naming scheme.

Now that we've got everything in place, we can start to process the commits that the user has pushed. The documentation on how this is done in a post-receive hook is a bit sparse, so it took some experimenting before I had it right. Turns out that the information we need is provided on the standard input, so a while-read loop is needed to process it:

while read next_line
do
    # .....
done

For each line on the standard input, 3 variables are provided:

  • The old commit reference (i.e. the commit before the one that was pushed)
  • The new commit reference (i.e. the one that was pushed)
  • The name of the reference (usually the branch that the commit being pushed is on)

Commits on multiple branches can be pushed at once, so the name of the branch each commit is being pushed to is kind of important.

Anyway, I pull these into variables like so:

oldref="$(echo "${next_line}" | cut -d' ' -f1)";
newref="$(echo "${next_line}" | cut -d' ' -f2)";
refname="$(echo "${next_line}" | cut -d' ' -f3)";

I think there's some clever Bash trick I've used elsewhere that allows you to pull them all in at once in a single line, but I believe I implemented this before I discovered that trick.

With that all in place, we can now (finally) queue the Laminar CI job. This is quite a monster, as it needs to pass a considerable number of variables to the git-repo job itself:

LAMINAR_HOST="127.0.0.1:3100" LAMINAR_REASON="Push from ${GIT_AUTHOR} to ${GIT_REPO_URL}" laminarc queue "${job_name}" GIT_AUTHOR="${GIT_AUTHOR}" GIT_REPO_URL="${GIT_REPO_URL}" GIT_COMMIT_REF="${newref}" GIT_REF_NAME="${refname}" GIT_AUTHOR="${GIT_AUTHOR}" GIT_REPO_NAME="${repo_name_auto}";

Laminar CI's management socket listens on the abstract unix socket laminar (IIRC). Since you can't yet forward abstract sockets over SSH with OpenSSH, I instead opt to use a TCP socket instead. To this end, the LAMINAR_HOST prefix there is needed to tell laminarc where to find the management socket that it can use to talk to the Laminar daemon, laminard - since Gitea and Laminar CI run on different servers.

The LAMINAR_REASON there is the message that is displayed in the Laminar CI web interface. Said interface is read-only (by design), but very useful for inspecting what's going on. Messages like this add context as to why a given job was triggered.

Lastly, we should send a message to the pushing user, to let them know that a job has been queued. This can be done with a simple echo, as the standard output is sent back to the client:

echo "[Laminar git hook ${version}] Queued Laminar CI build ("${job_name}" -> ${repo_name_auto}).";

Note that we display the version number of the post-receive hook here. This is how I tell whether I need to give into the Gitea settings to update the hook or not.

With that, the post-receive hook script is complete. It takes a bunch of information lying around, transforms it into a common universal format, and then passes the information on to my continuous integration system - which is then responsible for building the code itself.

Here's the completed script:

#!/usr/bin/env bash

##############################
########## Settings ##########
##############################

# Useful environment variables (gitea):
#   GITEA_REPO_NAME         Repository name
#   GITEA_REPO_USER_NAME    Repo owner username
#   GITEA_PUSHER_NAME       The username that pushed the commits

#   GIT_HOST                Domain name the repo is hosted on. Default: git.starbeamrainbowlabs.com

if [ "${GIT_HOST}" == "" ]; then
    GIT_HOST="git.starbeamrainbowlabs.com";
fi

# The url of the repository in question. SSH is recommended, as then you can use a deploy key.
# SSH:
GIT_REPO_URL="git@${GIT_HOST}:${GITEA_REPO_USER_NAME}/${GITEA_REPO_NAME}.git";
# HTTPS:
# git_repo_url="https://git.starbeamrainbowlabs.com/${GITEA_REPO_USER_NAME}/${GITEA_REPO_NAME}.git";

# The user that pushed the commits
GIT_AUTHOR="${GITEA_PUSHER_NAME}";

##############################

###### Internal Settings ######

version="0.2";

# The job name to queue.
job_name="git-repo";

###############################

# 1. Make lowercase
repo_name_auto="${GIT_REPO_URL,,}";
# 2. Trim git@ & .git from url
repo_name_auto="${repo_name_auto/git@}";
repo_name_auto="${repo_name_auto/.git}";
# 3. Replace unknown characters to make it 'safe'
repo_name_auto="$(echo -n "${repo_name_auto}" | tr -c '[:lower:]' '-')";

while read next_line
do
    oldref="$(echo "${next_line}" | cut -d' ' -f1)";
    newref="$(echo "${next_line}" | cut -d' ' -f2)";
    refname="$(echo "${next_line}" | cut -d' ' -f3)";
    # echo "********";
    # echo "oldref: ${oldref}";
    # echo "newref: ${newref}";
    # echo "refname: ${refname}";
    # echo "********";

    LAMINAR_HOST="127.0.0.1:3100" LAMINAR_REASON="Push from ${GIT_AUTHOR} to ${GIT_REPO_URL}" laminarc queue "${job_name}" GIT_AUTHOR="${GIT_AUTHOR}" GIT_REPO_URL="${GIT_REPO_URL}" GIT_COMMIT_REF="${newref}" GIT_REF_NAME="${refname}" GIT_AUTHOR="${GIT_AUTHOR}" GIT_REPO_NAME="${repo_name_auto}";
    # GIT_REF_NAME and GIT_AUTHOR are used for the LAMINAR_REASON when the git-repo task recursively calls itself
    # GIT_REPO_NAME is used to auto-name hologram copies of the git-repo.run task when recursing
    echo "[Laminar git hook ${version}] Queued Laminar CI build ("${job_name}" -> ${repo_name_auto}).";
done

#cat -;
# YAY what we're after is on the first line of stdin! :D
# The format appears to be documented here: https://git-scm.com/book/en/v2/Customizing-Git-Git-Hooks#_server_side_hooks
# Line format:
# oldref newref refname
# There may be multiple lines that all need handling.

In the next post, I want to finally introduce my very own home-brew build engine: lantern. I've used it in over half a dozen different projects by now, so it's high time I talked about it a bit more formally.

Found this interesting? Spotted a mistake? Got a suggestion to improve it? Comment below!

Using Cloudflare for DNS is awesome

Finally, a decent DNS provider! You might not have noticed, but I switched starbeamrainbowlabs.com to Cloudflare DNS the other month. I meant to blog about it at the time, but forgot - so I'm doing it now :P

This comes in a succession of various DNS providers such as GoDaddy and Uniregistry who, while nice enough, didn't really provide what I'm after.

The transfer process itself was really rather simple - much moreso than the transfer I did from GoDaddy to Uniregistry.

GoDaddy is way too expensive after the first year, and doesn't allow you to create many different DNS record types - instead preferring to roll them into various 'premium' products which I have neither the money nor the inclination to purchase. After all, you're only paying for a string of characters to be globally unique, and hosting for a small text file containing your DNS records!

Uniregistry is better, but they still don't support record types such as CAA, which let you whitelist who's allowed to issue SSL certificates for your domain.

Cloudflare, however, support all the record types. Even ones I've never heard of. It's so cool! They provide the service at cost-price (which means it's much cheaper than either Uniregistry and certainly GoDaddy), and they provide privacy as standard - at no extra cost! No individual should have to pay to hide their full name and address (I'm looking at you, GoDaddy).

You do have to be careful to set it to avoid proxying requests through Cloudflare for each DNS record you add, but this isn't a huge deal. Cloudflare's main business is improving the performance of your website by optimising it and serving it through their global network, after all - so I don't think I can fault them for setting it as the default :P

It's also slightly awkward that you can't actually buy domains through Cloudflare. You have to buy them elsewhere and transfer them in, which is a huge pain. I guess that if they let you buy domains directly, the rest of the domain-name trading business would collapse? Thoughtful of them I suppose, but considering that you can pay literally thousands of pounds for some domain names it does begin to make me wonder......

Anyway, I haven't yet moved starbeamrainbowlabs.co.uk over because Cloudflare don't support .co.uk domains yet, but if I haven't by the time this blog post comes out I'll be setting the name servers to Cloudflare at least very soon (top tip! You can set the name servers for a domain that you own to another provider like Cloudflare, even if you've got your domain registered with another company like Uniregistry).

Found this interesting? Transferring your domain name over? Got another cool provider? Comment below!

Converting multiline text to an image in PHP

This post is late because I lost the post I had written when I tried to save it - I need to find a new markdown editor. Do let me know if you have any suggestions!

I was working on Pepperminty Wiki earlier, and as I was working on external diagram renderer support (coming soon! It's really cool) I needed to upgrade my errorimage() function to support multi-line text. Pepperminty Wiki's key feature is that it's portable and builds (with a custom build system) to a single file of PHP that's plug-and-play, so no nice easy libraries here!

Instead, said function uses GD to convert text to images. This is kind of useful when you want to send back an error, but you also want to send an image because otherwise the user won't see it as they've used an <img src="" /> tag, which doesn't support displaying text, obviously.

To this end, I found myself wanting to add multi-line text support. This was quite an interesting task, because as errorimage() uses GD and Pepperminty Wiki needs to be portable, I can't use imagettftext() and use a nice font. Instead, it uses imagestring(), which draws things in a monospace font. This actually makes the whole proceeding much easier - especially since imagefontwidth() and imagefontheight() give us the width and height of a character respectively.

Here's what the function looked like before:

/**
 * Creates an images containing the specified text.
 * Useful for sending errors back to the client.
 * @package feature-upload
 * @param   string  $text           The text to include in the image.
 * @param   int     $target_size    The target width to aim for when creating
 *                                  the image.
 * @return  resource                The handle to the generated GD image.
 */
function errorimage($text, $target_size = null)
{
    $width = 640;
    $height = 480;

    if(!empty($target_size))
    {
        $width = $target_size;
        $height = $target_size * (2 / 3);
    }

    $image = imagecreatetruecolor($width, $height);
    imagefill($image, 0, 0, imagecolorallocate($image, 238, 232, 242)); // Set the background to #eee8f2
    $fontwidth = imagefontwidth(3);
    imagestring($image, 3,
        ($width / 2) - (($fontwidth * mb_strlen($text)) / 2),
        ($height / 2) - (imagefontheight(3) / 2),
        $text,
        imagecolorallocate($image, 17, 17, 17) // #111111
    );

    return $image;
}

...it's based on a Stack Overflow answer that I can no longer locate. It takes in a string of text, and draws an image with the specified size. This had to change too - since the text might not fit in the image. The awkward thing here is that I needed to maintain the existing support for the $target_size variable, making the code a bit messier than it needed to be.

To start here, let's define a few extra variables to hold some settings:

$width = 0;
$height = 0;
$border_size = 10; // in px, if $target_size isn't null has no effect
$line_spacing = 2; // in px
$font_size = 5; // 1 - 5

Looking better already! Variables like these will help us tune it later (I'm picky). The font size in PHP is a value from 1 to 5, with higher values corresponding to larger font sizes. The $border_size is the number of pixels around the text that we want to add as padding when we're in auto-sizing mode to make it look neater. The $line_spacing is the number of extra pixels of space we should add between lines to make the text look better.

So, about that image size. We'll need the size of a character for that:

$font_width = imagefontwidth($font_size);   // in px
$font_height = imagefontheight($font_size); // in px

....and we'll need to split the input text into a list of lines too:

$text_lines = array_map("trim", explode("\n", $text));

We use an array_map() call here to ensure we chop the whitespace of the end, because strange whitespace characters lying around will result in odd characters appearing in the output image. If the target size is set, then calculating the actual size of the image is easy:

if(!empty($target_size)) {
    $width = $target_size;
    $height = $target_size * (2 / 3);
}

If not, then we'll have to do some fancier footwork to count the maximum number of characters on a line to find the width of the image, and the number of lines we have for the image height:

else {
    $height = count($text_lines) * $font_height + 
        (count($text_lines) - 1) * $line_spacing +
        $border_size * 2;
    foreach($text_lines as $line)
        $width = max($width, $font_width * mb_strlen($line));
    $width += $border_size * 2;
}

Here we also don't forget about the line spacing either - which to get the number of spaces between the lines, we need to take the number of lines minus one. We also add the border as an offset value to the width and height too - multiplied by 2 because there's a border on both sides of the text.

Next, we need to create an image to draw the text to. This is largely the same as before:

$image = imagecreatetruecolor($width, $height);
imagefill($image, 0, 0, imagecolorallocate($image, 250, 249, 251)); // Set the background to #faf8fb

Now, we're ready to draw the text itself. This needs to now be done with a loop, because we've got multiple lines of text to draw - and imagestring() doesn't support that as we've discussed above. We also need to keep track of the index of the loop, so a temporary value is required:

$i = 0;
foreach($text_lines as $line) {
    // ....

    $i++;
}

With the loop in place, we can make the call to imagestring():

imagestring($image, $font_size,
    ($width / 2) - (($font_width * mb_strlen($line)) / 2),
    $border_size + $i * ($font_height + $line_spacing),
    $line,
    imagecolorallocate($image, 68, 39, 113) // #442772
);

This looks full of maths, but it's really quite simple. Let's break it down. Lines #2 and #3 there are the $(x, y)$ of the top-left corner at which the text should be drawn. Let's look at them in turn.

The $x$ co-ordinate (($width / 2) - (($font_width * mb_strlen($line)) / 2)) centres the text on the row. Basically, we take the centre of the image $width / 2), and take away ½ of the text width - which we calculate by taking the number of characters on the line, multiplying it by the width of a single character, and dividing it by 2.

The $y$ co-ordinate ($border_size + $i * ($font_height + $line_spacing)) is slightly different, because we need to account for the border at the top of the image. We take the font height, add the line spacing, and multiply it by the index of the text line that we're drawing. Since values start from 0 here, this will have no effect for the first line of text that we process, and it'll be drawn at the top of the image. We add to this the border width, to avoid drawing it inside the border that we've allocated around the image.

Lastly, the imagecolorallocate() call there tells GD the colour that we want to draw the text in via RGB. I've added a comment there because my editor highlights certain colour formats with the actual colour they represent, which is cool.

All that's left here is to return the completed image:

return $image;

....then we can do something like this:

if(!empty($error)) {
    http_response_code(503);
    header("content-type: image/png");
    imagepng(errorimage("Error: Something went\nwrong!")); // Note: Don't ever send generic error messages like this one. It makes for a bad and frustrating user (and debugging) experience.
}

I'm including the completed upgrade to the function at the bottom of this blog post. Here's an example image that it can render:

Found this interesting? Done some refactoring of your own recently? Comment below!)


(Can't see the above? Try a direct link.)

Saving space on Linux

While Linux is a whole lot lighter than Windows, there does come a point at which one has to look at reducing the amount of stuff that's on one's hard drive.

Thankfully, there are a number of possible things that we can do on Linux to find and delete large, bulky, and extraneous files, and I thought I'd post about them here.

Firstly, there's the Disk Usage Analyser, or baobab. It's a graphical interface that shows you what your hard drive looks like:

The disk usage analyser, showing my root partition.

Personally, I really appreciate the diagram on the right-hand side - it's a wonderfully visual way of displaying hard disk usage. By right clicking on a directory, you can send it to the recycle bin (don't forget to empty the recycle bin later! Recycle bins on Linux are per-user, so you'll need to make sure you empty them all - you can do root's by doing sudo nautilus and navigating to the recycle bin).

By default the program starts under your current user, so to do it for any directory you'll need to use sudo:

sudo baobab

If you don't have a GUI (e.g. if you're trying to clear out a server's hard disk), there's always ncdu. This, unlike the Disk Usage Analyser, isn't installed by default (at least on Ubuntu Server), so you'll need to install it:

sudo apt install ncdu

Then, you can get it to scan a partition:

sudo ncdu -x /

In the above, I'm scanning / (the root partition) with sudo - as not all the files are under my ownership. The -x ensures that ncdu doesn't cross partition boundaries and end up scanning something silly like /proc.

ncdu, showing my server's root partition.

By using these tools, not only was I able to clear out a bunch of files my systems don't need, but I also discovered that /var/log/journald was taking up 4GiB (!) on my laptop's disk. 4 GiB! On systems that use systemd, journald is used to store and manage some log files. It's strange, weird, and I'm not sure I like the opaque storage format, but there you go.

Unlike syslog and logrotate though, it doesn't appear to have a limit set on when it should delete logs. This has to be done manually:

# Show journald disk space usage beforehand
journalctl --disk-usage
sudo nano /etc/systemd/journald.conf
# Add "SystemMaxUse=500M" to the bottom
sudo systemctl kill --kill-who=main --signal=SIGUSR2 systemd-journald.service
sudo systemctl restart systemd-journald.service
# Show journald disk space usage afterwards
journalctl --disk-usage

(Source: this Unix StackExchange answer)

Found this helpful? Got another great tip to save space on disk? Comment below!

Art by Mythdael