Archive

## Tag Cloud

3d 3d printing account algorithms android announcement architecture archives arduino artificial intelligence artix assembly async audio automation backups bash batch blog bookmarklet booting bug hunting c sharp c++ challenge chrome os code codepen coding conundrums coding conundrums evolved command line compilers compiling compression css dailyprogrammer data analysis debugging demystification distributed computing documentation downtime electronics email embedded systems encryption es6 features ethics event experiment external first impressions future game github github gist gitlab graphics hardware hardware meetup holiday holidays html html5 html5 canvas infrastructure interfaces internet interoperability io.js jabber jam javascript js bin labs learning library linux lora low level lua maintenance manjaro network networking nibriboard node.js operating systems own your code pepperminty wiki performance phd photos php pixelbot portable privacy problem solving programming problems project projects prolog protocol protocols pseudo 3d python reddit redis reference releases resource review rust searching secrets security series list server software sorting source code control statistics storage svg talks technical terminal textures thoughts three thing game three.js tool tutorial twitter ubuntu university update updates upgrade version control virtual reality virtualisation visual web website windows windows 10 xmpp xslt

## I've got an apt repository, and you can too

Hey there!

In this post, I want to talk about my apt repository. I've had it for a while, but since it's been working well for me I thought I'd announce it to wider world on here.

For those not in the know, an apt repository is a repository of software in a particular format that the apt package manager (found on Debian-based distributions such as Ubuntu) use to keep software on a machine up-to-date.

The apt package manager queries all repositories it has configured to find out what versions of which packages they have available, and then compares this with those locally installed. Any packages out of date then get upgraded, usually after prompting you to install the updates.

Linux distributions based on Debian come with a large repository of software, but it doesn't have everything. For this reason, extra repositories are often used to deliver updates to software automatically from third parties.

In my case, I've been finding increasingly that I've been wanting to deliver updates for software that isn't packaged for installation with apt to a number of different machines. Every time I get around to installing update it felt like it was time to install another, so naturally I got frustrated enough with it that I decided to automate my problems away by scripting my own apt repository!

My apt repository can be found here: https://starbeamrainbowlabs.com/

It comes in 2 parts. Firstly, there's the repository itself - which is managed by a script that's based on my lantern build engine. It's this I'll be talking about in this post.

Secondly, I have a number of as yet adhoc custom Laminar job scripts for automatically downloading various software projects from GitHub, such that all I have to do is run laminarc queue apt-softwarename and it'll automatically package the latest version and upload it to the repository itself, which has a cron job set to fold in all of the new packages at 2am every night. The specifics of this are best explain in another post.

Currently this process requires me to login and run the laminarc command manually, but I intend to automate this too in the future (I'm currently waiting for a new release of beehive to fix a nasty bug for this).

Anyway, currently I have the following software packaged in my repository:

• Gossa - A simple HTTP file browser
• The Tiled Map Editor - An amazing 2D tile-based graphical map editor. You should sponsor the developer via any of the means on the Tiled Map Editor's website before using my apt package.
• tldr-missing-pages - A small utility script for finding tldr-pages to write
• webhook - A flexible webhook system that calls binaries and shell scripts when a HTTP call is made
• I've also got a pleaserun-based service file generator packaged for this too in the webhook-service package

Of course, more will be coming as and when I discover and start using cool software.

The repository itself is driven by a set of scripts. These scripts were inspired by a stack overflow post that I have since lost, but I made a number of usability improvements and rewrote it to use my lantern build engine as I described above. I call this improved script aptosaurus, because it sounds cool.

To use it, first clone the repository:

git clone https://git.starbeamrainbowlabs.com/sbrl/aptosaurus.git

Then, create a new GPG key to sign your packages with:

gpg --full-generate-key

Next, we need to export the new keypair to disk so that we can use it in scripts. Do that like this:

# Identify the key's ID in the list this prints out
gpg --list-secret-keys
# Export the secret key
gpg --export-secret-keys --armor INSERT_KEY_ID_HERE >secret.gpg
chmod 0600 secret.gpg # Don't forget to lock down the permissions
# Export the public key
gpg --export --armor INSERT_KEY_ID_HERE >public.gpg

Then, run the setup script:

./aptosaurus.sh setup

It should warn you if anything's amiss.

With the setup complete, you can new put your .deb packages in the sources subdirectory. Once done, run the update command to fold them into the repository:

./aptosaurus.sh update

Now you've got your own repository! Your next step is to setup a static web server to serve the repo subdirectory (which contains the repo itself) to the world! Personally, I use Nginx with the following config:

server {
listen  80;
listen  [::]:80;
listen  443 ssl http2;
listen  [::]:443 ssl http2;

server_name apt.starbeamrainbowlabs.com;
ssl_certificate     /etc/letsencrypt/live/$server_name/fullchain.pem; ssl_certificate_key /etc/letsencrypt/live/$server_name/privkey.pem;

add_header link '<https://starbeamrainbowlabs.com$request_uri>; rel="canonical"'; index index.html; root /srv/aptosaurus/repo; include /etc/nginx/snippets/letsencrypt.conf; autoindex off; fancyindex on; fancyindex_exact_size off; fancyindex_header header.html; #location ~ /.well-known { # root /srv/letsencrypt; #} } This requires the fancyindex module for Nginx, which can be installed with sudo apt install libnginx-mod-http-fancyindex on Ubuntu-based systems. To add your new apt repository to a machine, simply follow the instructions for my repository, replacing the domain name and the key ids with yours. Hopefully this release announcement-turned-guide has been either interesting, helpful, or both! Do let me know in the comments if you encounter any issues. If there's enough interest I'll migrate the code to GitHub from my personal Git server if people want to make contributions (express said interest in the comments below). It's worth noting that this is only a very simply apt repository. Larger apt repositories are sectioned off into multiple categories by distribution and release status (e.g. the Ubuntu repositories have xenial, bionic, eoan, etc for the version number of Ubuntu, and main, universe, multiverse, restricted, etc for the different categories of software). If you setup your own simple apt repository using this guide, I'd love it if you could let me know with a comment below too. ## Own Your Code Series List Hey there! It's time for another series list. This time it's for my Own Your Code series, where I take a look into Gitea and Laminar CI. Following this series, I plan to also post about my apt repository, which is hosting a growing list of software - including the tiled map editor (support them with a donation if you can), gossa (a minimalist file browser interface), and webhook - if you find any issues, you can always get in touch. Anyway, here's the full list of posts in the series in the Own Your Code series: In the unlikely event I post another entry in this series, I'll come back and update this list. Most likely though I'll be posting related things standalone, rather than part of this series - so subscribe for updates with your favourite method if you'd like to stay up-to-date with my latest blog posts (Atom/RSS, Email, Twitter, Reddit, and Facebook are all supported - just ask if there's something missing). ## Exporting an SQLite3 database to a directory of CSV files Recently I was working with a dataset I acquired for my PhD, and to pre-process said dataset into something more sensible I imported it into an SQLite3 database. Once I was finished processing it, I then needed to export it again into regular CSV files so that I could do other things, such as plot it with GNUPlot, or import it into InfluxDB (more on InfluxDB in a later post). With the help of Stack Overflow and the SQLite3 man page, this didn't prove to be too difficult. To export a single SQLite3 table to a CSV file, you do this: sqlite3 -bail -header -csv "bobsrockets.sqlite3" "SELECT * FROM 'table_name';" >"path/to/output_file.csv"; This is great for a single table, but what if we want to export all the tables? Well, we can iterate over all the tables in an SQLite3 database like so: while read table_name; do echo "Exporting${table_name}";

# Do stuff
done < <(sqlite3 "bobsrockets.sqlite3" ".tables");

If we combine this with the previous snippet, we can export all the tables like so:

while read table_name; do
log "Exporting ${table_name}"; sqlite3 -bail -header -csv "bobsrockets.sqlite3" "SELECT * FROM '${table_name}';" >"${table_name}.csv"; done < <(sqlite3 "bobsrockets.sqlite3" ".tables"); Cool! We can make it even better with some simple improvements though: 1. It's a pain to have to edit the script every time we want to change the database we're exporting 2. It would be nice to be able to specify the output directory without editing the script too Satisfying both of these points isn't particularly challenging. 10 minutes of fiddling got this the final completed script: #!/usr/bin/env bash set -e; # Don't allow errors show_usage() { echo -e "Usage:"; echo -e "\t./sqlite2csv.sh {db_filename} {output_dir}"; } log() { echo -e "[$(date +"%F %T") ] ${@}"; } ############################################################################### db_filename="${1}";
output_dir="${2}"; if [ -z "${db_filename}" ]; then
echo "Error: No database filename specified.";
show_usage; exit;
fi
if [ -z "${output_dir}" ]; then echo "Error: No output directory specified."; show_usage; exit; fi if [ ! -d "${output_dir}" ]; then
mkdir -p "${output_dir}"; fi log "Output directory is${output_dir}";

log "Exporting ${table_name}"; sqlite3 -bail -header -csv "${db_filename}" "SELECT * FROM '${table_name}';" >"${output_dir}/${table_name}.csv"; done < <(sqlite3 "${db_filename}" ".tables");

log "Complete!";

Found this useful? Comment below!

## Own your code, part 6: The Lantern Build Engine

It's time again for another installment in the own your code series! In the last post, we looked at the git post-receive hook that calls the main git-repo Laminar CI task, which is the core of our Continuous Integration system (which we discussed in the post before that). You can see all the posts in the series so far here.

In this post we're going to travel in the other direction, and look at the build script / task automation engine that I've developed that goes hand-in-hand with the Laminar CI system - though it can and does stand on it's own too.

Introducing the Lantern Build Engine! Finally, after far too long I'm going to formally post here about it.

Originally developed out of a need to automate the boring and repetitive parts of building and packing my assessed coursework (ACWs) at University, the lantern build engine is my personal task automation system. It's written in 100% Bash, and allows tasks to be easily defined like so:

task_dostuff() {
do_work;
task_end "$?" "Oops, do_work failed!"; task_begin "Doing another thing"; do_hard_work; task_end "$?" "Yikes! do_hard_work failed.";
}

When the above task is run, Lantern will automatically detect the dustuff task, since it's a bash function that's prefixed with task_. The task_begin and task_end calls there are 2 other bash functions, which generate pretty output to inform the user that a task is starting or ending. The $? there grabs the exit code from the last command - and if it fails task_end will automatically display the provided error message. Tasks are defined in a build.sh file, for which Lantern provides a template. Currently, the template file contains some additional logic such as the help text output if no tasks were specified - which is left-over from the time when Lantern was small enough to fit in the same file as the build tasks themselves. I'm in the process of adding support for the all the logic in the template file, so that I can cut down on the extra boilerplate there even further. After defining your tasks in a copy of the template build file, it's really easy to call them: ./build dostuff Of course, don't forget to mark the copy of the template file executable with chmod +x ./build. The above initial example only scratches the surface of what Lantern can do though. It can easily check to see if a given command is installed with check_command: task_go-to-the-moon() { task_begin "Checking requirements"; check_command git true; check_command node true; check_command npm true; task_end 0; } If any of the check_command calls fail, then an error message is printed and the build terminated. Work that needs doing in Lantern can be expressed with 3 levels of logical separation: stages, tasks, and subtasks: task_build-rocket() { stage_begin "Preparation"; task_begin "Gathering resources"; gather_resources; task_end "$?" "Failed to gather resources";

hire_engineers;
task_end "$?" "Failed to hire engineers"; stage_end "$?";

stage_begin "Building Rocket";
build_rocket --size big --boosters 99;
stage_end "$?"; stage_begin "Launching rocket"; task_begin "Preflight checks"; subtask_begin "Checking fuel"; check_fuel --level full; subtask_end "$?" "Error: The fuel tank isn't full!";
subtask_end "$?" "Error: Failed to load snacks!"; task_end "$?";

launch --countdown 10;
task_end "$?"; stage_end "$?";
}

Come to think about it, I should probably rename the function prefix from task to job. Stages, tasks, and subtasks each look different in the output - so it's down to personal preference as to which one you use and where. Subtasks in particular are best for commands that don't return any output.

Popular services such as [Travis CI]() have a thing where in the build transcript they display the versions of related programs to the build, like this:

$uname -a Linux MachineName 5.3.0-19-generic #20-Ubuntu SMP Fri Oct 18 09:04:39 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux$ node --version
v13.0.1
$npm --version 6.12.1 Lantern provides support for this with the execute command. Prefixing commands with execute will cause them to be printed before being executed, just like the above: task_prepare() { task_begin "Displaying environment details"; execute uname -a; execute node --version; execute npm --version; task_end "$?";
}

As build tasks get more complicated, it's logical to split them up into multiple tasks that can be called independently and conditionally. Lantern makes this easy too:

task_build() {
# Do build stuff here
task_end "$?"; } task_deploy() { task_begin "Deploying"; # Do deploy stuff here task_end "$?";
}

}

The all task in the above runs both the build and deploy tasks. In fact, the template build script uses tasks_run at the very bottom to treat every argument passed to it as a task name, leading to the behaviour described above.

Lantern also provides an array of other useful functions to make expressing build sequences easy, concise, and readable - from custom colours to testing environment variables to see if they exist. It's all fully documented in the README of the project too.

As described 2 posts ago, the git-repo Laminar CI task (once it's spawned a hologram of itself) currently checks for the existence of a build or build.sh executable script in the root of the repository it is running on, and passes ci as the first and only argument.

This provides easy integration with Lantern, since Lantern build scripts can be called anything we like, and with a tasks_run call at the bottom as in the template file, we can simply define a ci Lantern task function that runs all our continuous integration jobs that we need to execute.

If you're interested in trying out Lantern for yourself, check out the repository!

https://gitlab.com/sbrl/lantern-build-engine#lantern-build-engine

Personally, I use it for everything from CI to rapid development environment setup.

This concludes my (epic) series about my git hosting and continuous integration. We've looked at git hosting, and taken a deep dive into integrating it into a continuous integration system, which we've augmented with a bunch of scripts of our own design. The system we've ended up with, while a lot of work to setup, is extremely flexible, allowing for modifications at will (for example, I have a webhook script that's similar to the git post-receive hook, but is designed to receive notifications from GitHub instead of Gitea and queue the git-repo just the same).

I'll post a series list post soon. After that, I might blog about my personal apt repository that I've setup, which is somewhat related to this.

In the last post, I took a deep dive into the master git-repo job that powers the my entire build system. In the next few posts, I'm going to take a look at the bits around the edges that interact with this laminar job - starting with the git post-receive hook in this post.

When you push commits to a git repository, the remote server does a bunch of work to integrate your changes into the remote master copy of the repository. At various points in the process, git allows you to run scripts to augment your repository, and potentially alter the way git ultimately processes the push. You can send content back to the pushing user too - which is how you get those messages on the command-line occasionally when you push to a GitHub repository.

In our case, we want to queue a new Laminar CI job when new commits are pushed to a private Gitea server, for instance (like mine). Doing this isn't particularly difficult, but we do need to collect a bunch of information about the environment we're running in so that we can correctly inform the git-repo task where it needs to pull the repository from, who pushed the commits, and which commits need testing.

In addition, we want to write 1 universal git post-receive hook script that will work everywhere - regardless of the server the repository is hosted on. Of course, on GitHub you can't run a script directly, but if I ever come into contact with another supporting git server, I want to minimise the amount of extra work I've got to do to hook it up.

Let's jump into the script:

#!/usr/bin/env bash
if [ "${GIT_HOST}" == "" ]; then GIT_HOST="git.starbeamrainbowlabs.com"; fi Fairly standard stuff. Here we set a shebang and specify the GIT_HOST variable if it's not set already. This is mainly just a placeholder for the future, as explained above. Next, we determine the git repository's url, because I'm not sure that Gitea (my git server, for which this script is intended) actually tells you directly in a git post-receive hook. The post-receive hook script does actually support HTTPS, but this support isn't currently used and I'm unsure how the git-repo Laminar CI job would handle a HTTPS url: # The url of the repository in question. SSH is recommended, as then you can use a deploy key. # SSH: GIT_REPO_URL="git@${GIT_HOST}:${GITEA_REPO_USER_NAME}/${GITEA_REPO_NAME}.git";
# HTTPS:
# git_repo_url="https://git.starbeamrainbowlabs.com/${GITEA_REPO_USER_NAME}/${GITEA_REPO_NAME}.git";

With the repository url determined, next on the list is the identity of the pusher. At this stage it's a simple matter of grabbing the value of 1 variable and putting it in another as we're only supporting Gitea at the moment, but in the future we may have some logic here to intelligently determine this value.

GIT_AUTHOR="${GITEA_PUSHER_NAME}"; With the basics taken care of, we can start getting to the more interesting bits. Before we do that though, we should define a few common settings: ###### Internal Settings ###### version="0.2"; # The job name to queue. job_name="git-repo"; ############################### job_name refers to the name of the Laminar CI job that we should queue to process new commits. version is a value that we can increment should we iterate on this script in the future, so that we can then tell which repositories have the new version of the post-receive hook and which ones don't. Next, we need to calculate the virtual name of the repository. This is used by the git-repo job to generate a 'hologram' copy of itself that acts differently, as explained in the previous post. This is done through a series of Bash transformations on the repository URL: # 1. Make lowercase repo_name_auto="${GIT_REPO_URL,,}";
# 2. Trim git@ & .git from url
repo_name_auto="${repo_name_auto/git@}"; repo_name_auto="${repo_name_auto/.git}";
# 3. Replace unknown characters to make it 'safe'
repo_name_auto="$(echo -n "${repo_name_auto}" | tr -c '[:lower:]' '-')";

The result is quite like 'slugification'. For example, this URL:

git@git.starbeamrainbowlabs.com:sbrl/Linux-101.git

...will get turned into this:

git-starbeamrainbowlabs-com-sbrl-linux----

I actually forgot to allow digits in step #3, but it's a bit awkward to change it at this point :P Maybe at some later time when I'm feeling bored I'll update it and fiddle with Laminar's data structures on disk to move all the affected repositories over to the new naming scheme.

Now that we've got everything in place, we can start to process the commits that the user has pushed. The documentation on how this is done in a post-receive hook is a bit sparse, so it took some experimenting before I had it right. Turns out that the information we need is provided on the standard input, so a while-read loop is needed to process it:

while read next_line
do
# .....
done

For each line on the standard input, 3 variables are provided:

• The old commit reference (i.e. the commit before the one that was pushed)
• The new commit reference (i.e. the one that was pushed)
• The name of the reference (usually the branch that the commit being pushed is on)

Commits on multiple branches can be pushed at once, so the name of the branch each commit is being pushed to is kind of important.

Anyway, I pull these into variables like so:

oldref="$(echo "${next_line}" | cut -d' ' -f1)";
newref="$(echo "${next_line}" | cut -d' ' -f2)";
refname="$(echo "${next_line}" | cut -d' ' -f3)";

I think there's some clever Bash trick I've used elsewhere that allows you to pull them all in at once in a single line, but I believe I implemented this before I discovered that trick.

With that all in place, we can now (finally) queue the Laminar CI job. This is quite a monster, as it needs to pass a considerable number of variables to the git-repo job itself:

LAMINAR_HOST="127.0.0.1:3100" LAMINAR_REASON="Push from ${GIT_AUTHOR} to${GIT_REPO_URL}" laminarc queue "${job_name}" GIT_AUTHOR="${GIT_AUTHOR}" GIT_REPO_URL="${GIT_REPO_URL}" GIT_COMMIT_REF="${newref}" GIT_REF_NAME="${refname}" GIT_AUTHOR="${GIT_AUTHOR}" GIT_REPO_NAME="${repo_name_auto}"; Laminar CI's management socket listens on the abstract unix socket laminar (IIRC). Since you can't yet forward abstract sockets over SSH with OpenSSH, I instead opt to use a TCP socket instead. To this end, the LAMINAR_HOST prefix there is needed to tell laminarc where to find the management socket that it can use to talk to the Laminar daemon, laminard - since Gitea and Laminar CI run on different servers. The LAMINAR_REASON there is the message that is displayed in the Laminar CI web interface. Said interface is read-only (by design), but very useful for inspecting what's going on. Messages like this add context as to why a given job was triggered. Lastly, we should send a message to the pushing user, to let them know that a job has been queued. This can be done with a simple echo, as the standard output is sent back to the client: echo "[Laminar git hook${version}] Queued Laminar CI build ("${job_name}" ->${repo_name_auto}).";

Note that we display the version number of the post-receive hook here. This is how I tell whether I need to give into the Gitea settings to update the hook or not.

With that, the post-receive hook script is complete. It takes a bunch of information lying around, transforms it into a common universal format, and then passes the information on to my continuous integration system - which is then responsible for building the code itself.

Here's the completed script:

#!/usr/bin/env bash

##############################
########## Settings ##########
##############################

# Useful environment variables (gitea):
#   GITEA_REPO_NAME         Repository name
#   GITEA_PUSHER_NAME       The username that pushed the commits

#   GIT_HOST                Domain name the repo is hosted on. Default: git.starbeamrainbowlabs.com

if [ "${GIT_HOST}" == "" ]; then GIT_HOST="git.starbeamrainbowlabs.com"; fi # The url of the repository in question. SSH is recommended, as then you can use a deploy key. # SSH: GIT_REPO_URL="git@${GIT_HOST}:${GITEA_REPO_USER_NAME}/${GITEA_REPO_NAME}.git";
# HTTPS:
# git_repo_url="https://git.starbeamrainbowlabs.com/${GITEA_REPO_USER_NAME}/${GITEA_REPO_NAME}.git";

# The user that pushed the commits
GIT_AUTHOR="${GITEA_PUSHER_NAME}"; ############################## ###### Internal Settings ###### version="0.2"; # The job name to queue. job_name="git-repo"; ############################### # 1. Make lowercase repo_name_auto="${GIT_REPO_URL,,}";
# 2. Trim git@ & .git from url
repo_name_auto="${repo_name_auto/git@}"; repo_name_auto="${repo_name_auto/.git}";
# 3. Replace unknown characters to make it 'safe'
repo_name_auto="$(echo -n "${repo_name_auto}" | tr -c '[:lower:]' '-')";

do
oldref="$(echo "${next_line}" | cut -d' ' -f1)";
newref="$(echo "${next_line}" | cut -d' ' -f2)";
refname="$(echo "${next_line}" | cut -d' ' -f3)";
# echo "********";
# echo "oldref: ${oldref}"; # echo "newref:${newref}";
# echo "refname: ${refname}"; # echo "********"; LAMINAR_HOST="127.0.0.1:3100" LAMINAR_REASON="Push from${GIT_AUTHOR} to ${GIT_REPO_URL}" laminarc queue "${job_name}" GIT_AUTHOR="${GIT_AUTHOR}" GIT_REPO_URL="${GIT_REPO_URL}" GIT_COMMIT_REF="${newref}" GIT_REF_NAME="${refname}" GIT_AUTHOR="${GIT_AUTHOR}" GIT_REPO_NAME="${repo_name_auto}";
# GIT_REF_NAME and GIT_AUTHOR are used for the LAMINAR_REASON when the git-repo task recursively calls itself
# GIT_REPO_NAME is used to auto-name hologram copies of the git-repo.run task when recursing
echo "[Laminar git hook ${version}] Queued Laminar CI build ("${job_name}" -> ${repo_name_auto})."; done #cat -; # YAY what we're after is on the first line of stdin! :D # The format appears to be documented here: https://git-scm.com/book/en/v2/Customizing-Git-Git-Hooks#_server_side_hooks # Line format: # oldref newref refname # There may be multiple lines that all need handling. In the next post, I want to finally introduce my very own home-brew build engine: lantern. I've used it in over half a dozen different projects by now, so it's high time I talked about it a bit more formally. Found this interesting? Spotted a mistake? Got a suggestion to improve it? Comment below! ## Next Gen Search, Part 2: Pushing the limits In the last part, we looked at how I built a new backend for storing inverted indexes for Pepperminty Wiki, which allows for partial index deserialisation and other nice features that boost performance considerably. Since the last post, I've completed work on the new search system - though there are a few bits around the edges that I still want to touch up and do some more work on. In this post though, I want to talk about how I generated test data to give my full-text search engine something to chew on. I've done this before for my markov chain program I wrote a while back, and that was so much fun I did it again for my search engine here. After scratching my head for a bit to think of a data source I could use, I came up with the perfect plan. Ages ago I downloaded a Wikipedia dump - just the content pages in Wikitext markup. Why not use that? As it turns out, it was a rather good idea. Some processing of said dump was required though to transform it into a format that Pepperminty Wiki can understand, though. Pepperminty Wiki stores pages on disk as flat text files in markdown, and indexes them in pageindex.json. If pageindex.json doesn't exist, then Pepperminty Wiki rebuilds it automagically by looking for content pages on disk. This makes it easy to import batches of new pages into Pepperminty Wiki, so all we need to do is extract the wiki text, convert to markdown, and import! This ended up requiring a number of different separate steps though, so let's take it 1 at a time First, we need a Wikipedia database dump in the XML format. These are available from dumps.wikimedia.org. There are many different ones available, but I suggest grabbing one that has a filename similar to enwiki-20180201-pages-articles.xml - i.e. just content pages - no revision history, user pages, or additional extras. I think the most recent one as of the time of posting is downloadable here - though I'd warn you that it's 15.3GiB in size! You can see a list of available dump dates for the English Wikipedia here. Now that we've got our dump, let's extract the pages from it. This is nice and easy to do with wikiextractor on GitHub: nice -n20 wikiextractor enwiki-20180201-pages-articles.xml --no_templates --html --keep_tables --lists --links --sections --json --output wikipages --compress --bytes 25M >progress.log 2>&1  This will parse the dump and output a number of compressed files to the wikipages directory. These will have 1 JSON object per line, each containing information about a single page on Wikipedia - with page content pre-converted to HTML for us. The next step is to extract the page content and save it to a file with the correct name. This ended up being somewhat complicated, so I wrote a quick Node.js script to do the job: #!/usr/bin/env node const readline = require("readline"); const fs = require("fs"); if(!fs.existsSync("pages")) fs.mkdirSync("pages", { mode: 0o755 }); // From https://stackoverflow.com/a/44195856/1460422 function html_unentities(encodedString) { var translate_re = /&(nbsp|amp|quot|lt|gt);/g; var translate = { "nbsp":" ", "amp" : "&", "quot": "\"", "lt" : "<", "gt" : ">" }; return encodedString.replace(translate_re, function(match, entity) { return translate[entity]; }).replace(/&#(\d+);/gi, function(match, numStr) { var num = parseInt(numStr, 10); return String.fromCharCode(num); }); } const interface = readline.createInterface({ input: process.stdin, //output: process.stdout }); interface.on("line", (text) => { const obj = JSON.parse(text); fs.writeFileSync(pages/${obj.title.replace(/\//, "-")}.html, html_unentities(obj.text));
console.log(${obj.id}\t${obj.title});
});

This basically takes the stream of JSON object on the standard input, parses them, and saves the relevant content to disk. We can invoke it like so:

bzcat path/to/*.bz2 | ./parse.js

Don't forget to chmod +x parse.js if you get an error here. The other important thing about the above script ist hat we have to unescape the HTML entities (e.g. &gt;), because otherwise we'll have issues later with HTML conversation and page names will look odd. This is done by the html_unentities() function in the above script.

This should result in a directory containing a large number of files - 1 file per content page. This is much better, but we're still not quite there yet. Wikipedia uses wiki markup (which we converted to HTML with wikiextractor) and Pepperminty Wiki uses Markdown - the 2 of which are, despite all their similarities, inherently incompatible. Thankfully, pandoc is capable of converting from HTML to markdown.

Pandoc is great at this kind of thing - it uses an intermediate representation and allows you to convert almost any type of textual document format to any other format. Markdown to PDF, EPUB to plain text, ..... and HTML to markdown (just to name a few). It actually looks like it shares a number of features with traditional compilers like GCC.

Anyway, let's use it to convert our folder full of wikitext files to a folder full of markdown:

mkdir -p pages_md;
find pages/ -type f -name "*.html" -print0 | nice -n20 xargs -P4 -0 -n1 -I{} sh -c 'filename="{}"; title="${filename##*/}"; title="${title%.*}"; pandoc --from "html"  --to "markdown+backtick_code_blocks+pipe_tables+strikeout" "${filename}" -o "pages_md/${title}.md"; echo "${title}";'; _(See this on explainshell.com - doesn't include the nice -n20 due to a bug on their end)_ This looks complicated, but it really isn't. Let's break it down a bit: find pages/ -type f -name "*.html" -print0 This finds all the HTML files that we want to convert to Markdown, and delimits the output with a NUL byte - i.e. 0x0. This makes the next step easier: ... | nice -n20 xargs -P4 -0 -n1 -I{} sh -c '....' This pushes a new xargs instance into the background, which will execute 4 commands at a time. xargs executes a command for each line of input it receives. In our case, we're delimiting with NUL 0x0 instead though. We explicitly specify that we can 1 command per line of input though, as xargs tries to optimise and do command file1 file2 file3 instead. The sh -c bit is starting a subshell, in which we execute a small wrapper script that then calls pandoc. This is of course inefficient, but I couldn't find any way around spawning a subshell in this instance. filename="{}"; title="${filename##*/}";
title="${title%.*}"; pandoc --from "html" --to "markdown+backtick_code_blocks+pipe_tables+strikeout" "${filename}" -o "pages_md/${title}.md"; echo "${title}";

I've broken the sh -c subshell script down into multiple liens for readability. Simply put, it extracts the page title from the filename, converts the HTML to Markdown, and saves it to a new file in a different directory with the .md replacing the original .html extension.

When you put all these components together, you get a script that converts a folder full of HTML files to Markdown. Just like with the markov chains extraction I mentioned at the beginning of this post, Bash and shell scripting really is all about lego bricks. This is due in part to the Unix philosophy:

Make each program do one thing well.

There is more to it, but this is the most important point to remember. Many of the core utilities you'll find on the terminal follow this way of thinking.

There's 1 last thing we need to take care of before we have them in the right format though - we need to convert the [display text](page name) markdown-format links back into the Wikipedia [[internal link]] format that Pepperminty Wiki also uses.

Thankfully, another command-line tool I know of called repren is well-suited to this:

repren --from '$([^$]+)\]$([^):]+)$' --to '[[\1]]' pages_md/*.md

It took some fiddling, but I got all the escaping figured out and the above converts back into the [[internal link]] format well enough.

Now that we've got our folder full of markdown files, we need to extract a random portion of them to act as a test for Pepperminty Wiki - as the whole lot might be a bit much for it to handle (though if Pepperminty Wiki was capable of handling it all eventually that'd be awesome :D). Let's try 500 pages to start:

find path/to/wikipages/ -type f -name "*.md" -print0 | shuf --zero-terminated | head -n500 --zero-terminated | xargs -0 -n1 -I{} cp "{}" .

(See this on explainshell.com)

This is another lego-brick style command. Let's break it down too:

find path/to/wikipages/ -type f -name "*.md" -print0

This lists all the .md files in a directory, delimiting them with a NUL character, as before. It's better to do this than use ls, as find is explicitly designed to be machine-readable.

.... | shuf --zero-terminated

The shuf command randomly shuffles the input lines. In this case, we're telling it that the input is delimited by the NUL byte.

.... | head -n500 --zero-terminated

Similar deal here. head takes the top N lines of input, and discards the rest.

.... | xargs -0 -n1 -I{} cp "{}" .

Finally, xargs calls cp to copy the selected files to the current directory - which is, in this case, the root directory of my test Pepperminty Wiki instance.

Since I'm curious, let's now find out roughly how many words we're dealing with here:

cat data_test/*.md | wc --words
1593190


1.5 million words! That's a lot. I wonder how quickly we can search that?

24.8ms? Awesome! That's so much better than before. If you're wondering about new coat of paint in the screenshot - Pepperminty Wiki is getting dark theme, thanks to prefers-color-scheme :D

I wonder what happens if we push it to 2K pages?

This time we get ~120ms for 5.9M total words - wow! I wasn't expecting it to perform so well. At this scale, rebuilding the entire index is particularly costly - so if I was to push it even further it would make sense to implement an incremental approach that spreads the work over multiple requests, assuming I can't squeeze any more performance out the system as-is.

The last thing I want to do here is make a rough estimate about time time complexity the search system as-is, given the data we have so far. This isn't particularly difficult to do.

Given the results above, we can calculate that at 1.5M total words, an increase of ~60K total words results in an increase of 1ms of execution time. At 5.9M words, it's only ~49K words / ms of execution time - a drop of ~11K words / ms of execution time.

From this, we can speculate that for every million words total added to a wiki, we can expect a ~2.5K words / ms of execution time drop - not bad! We'd need more data points to make any reasonable guess as to the Big-O complexity function that it conforms to. My guess would be something like $O(xN^2)$, where x is a constant between ~0.2 and 2.

Maybe at some point I'll go to the trouble of running enough tests to calculate it, but with all the variables that affect the execution time (number of pages, distribution of words across pages, etc.), I'm not in any hurry to calculate it. If you'd like to do so, go ahead and comment below!

Next time, I'll unveil the inner working of the STAS: my new search-term analysis system.

Found this interesting? Got your own story about some cool code you've written to tell? Comment below!

## Own your code, part 4: Laminar CI

In the last post, I talked at a high level about the infrastructure behind my continuous integration and deployment system. In this post, I'm going to dive into the details of the Laminar CI job is the engine that drives the whole system.

Laminar CI is based on a concept of jobs. The docs explain it quite well, but in short each job is a file in the jobs folder with the file extension run and a shebang. In my case, I'm using Bash - and I'll continue to do so at regular intervals throughout this series.

Unlike most other setups, the Laminar CI job that we'll be writing here won't actually do any of the actual CI tasks itself - it will simply act as a proxy script to setup & manage the execution of the actual build system - which, in this case, will be the lantern build engine, an engine I wrote to aid me with automating repetitive tasks when working on my University ACWs (Assessed CourseWork).

Every job has it's own workspace, which acts as a common area to store and cache various files across all the runs of that job. Each run of a job also has it's very own private area too - which will be useful later on.

The first step in this proxy script is to extract the parameters of the run that we're supposed to be doing. For me, I store this in a number of environment variables, which are set when queuing the job run from the git post-receive (or web) hook:

Variable Example Description
GIT_REPO_NAME git-starbeamrainbowlabs-com-sbrl-rhinoreminds The safe name of the repository that we're running against, with potentially troublesome characters removed.
GIT_REF_NAME refs/heads/master Basically the branch that we're working on. Useful for logging purposes.
GIT_REPO_URL git@git.starbeamrainbowlabs.com:sbrl/rhinoreminds.git The URL of the repository that we're running against.
GIT_COMMIT_REF e23b2e0.... The exact commit to check out and build.
GIT_AUTHOR The friendly name of the author that pushed the commit. Useful for logging purposes.

Before we do anything else, we need to make sure that these variables are defined:

set -e; # Don't allow errors

# Check that all the right variables are present
if [ -z "${GIT_REPO_NAME}" ]; then echo -e "Error: The environment variable GIT_REPO_NAME isn't set." >&2; exit 1; fi if [ -z "${GIT_REF_NAME}" ]; then echo -e "Error: The environment variable GIT_REF_NAME isn't set." >&2; exit 1; fi
if [ -z "${GIT_REPO_URL}" ]; then echo -e "Error: The environment variable GIT_REPO_URL isn't set." >&2; exit 1; fi if [ -z "${GIT_COMMIT_REF}" ]; then echo -e "Error: The environment variable GIT_COMMIT_REF isn't set." >&2; exit 1; fi
if [ -z "${GIT_AUTHOR}" ]; then echo -e "Error: The environment variable GIT_AUTHOR isn't set." >&2; exit 1; fi There are a bunch of other variables that I'm omitting here, since they are dynamically determined by from the build variables. I extract many of these additional variables using regular expressions. For example: GIT_REF_TYPE="$(regex_match "${GIT_REF_NAME}" 'refs/([a-z]+)')"; GIT_REF_TYPE is the bit after the refs/ and before the actual branch or tag name. It basically tells us whether we're building against a branch or a tag. That regex_match function is a utility function that I found in the pure bash bible - which is an excellent resource on various tips and tricks to do common tasks without spawning subprocesses - and therefore obtaining superior performance and lower resource usage. Here it is: # @source https://github.com/dylanaraps/pure-bash-bible#use-regex-on-a-string # Usage: regex "string" "regex" regex_match() { [[$1 =~ $2 ]] && printf '%s\n' "${BASH_REMATCH[1]}"
}

Very cool. For completeness, here are the remainder of the secondary environment variables. Many of them aren't actually used directly - instead they are used indirectly by other scripts and lantern build engine tasks that we call from the main Laminar CI job.

if [[ "${GIT_REF_TYPE}" == "tags" ]]; then GIT_TAG_NAME="$(regex_match "${GIT_REF_NAME}" 'refs/tags/(.*)$')";
fi

# NOTE: These only work with SSH urls.
GIT_REPO_OWNER="$(echo "${GIT_REPO_URL}" | grep -Po '(?<=:)[^/]+(?=/)')";
GIT_REPO_NAME_SHORT="$(echo "${GIT_REPO_URL}" | grep -Po '(?<=/)[^/]+(?=\.git$)')"; GIT_SERVER_DOMAIN="$(echo "${GIT_REPO_URL}" | grep -Po '(?<=@)[^/]+(?=:)')";  GIT_TAG_NAME is the name of the tag that we're building against - but only if we've been passed a tag as the GIT_REF_TYPE. The GIT_SERVER_DOMAIN is important for sending the status reports to the right place. Gitea supports a status API that we can hook into to report on how we're doing. You can see it in action here on my RhinoReminds repository. Those green ticks are the build status that was reported by the Laminar CI job that we're writing in this post. Unfortunately you won't be able to click on it to see the actual build output, as that is currently protected behind a username and password, since the Laminar CI web interface exposes all the git project I've currently got setup on it - including a number of private ones that I can't share. Anyway, with all our environment variables in order, it's time to do something with them. Before we do though, we should tell Gitea that we're starting the build process: send-status-gitea "${GIT_COMMIT_REF}" "pending" "Executing build....";

I haven't yet implemented support for sending notifications to GitHub, but it's on my todo list. In theory it's pretty easy to do - this is why I've got that GIT_SERVER_DOMAIN variable above in anticipation of this.

That send-status-gitea function there is another helper script I've written that does what you'd expect - it sends a status message to Gitea. It does this by using the environment variables we deduced earlier (that are also exported - though I didn't include that in the abovecode snippet) and curl.

There's still a bunch of stuff to get through in this post, so I'm going to omit the source of that script from this post for brevity. I've got no particular issue with releasing it though - if you're interested, contact me using the details on my homepage.

Next, we need to set an exit trap. This is a function that will run when the Bash process exits - regardless of whether this was because we finished our work successfully, or otherwise. This can be very useful to make absolutely sure that your script cleans up after itself. In our case, we're only going to be using it to report the build status back to Gitea:

# Runs on exit, no matter what
cleanup() {
original_exit_code="$?"; status="success"; description="Build${RUN} succeeded in $(human-duration "${SECONDS}").";
if [[ "${original_exit_code}" -ne "0" ]]; then status="failed"; description="Build failed with exit code${original_exit_code} after $(human-duration "${SECONDS}")";
fi

send-status-gitea "${GIT_COMMIT_REF}" "${status}" "${description}"; } trap cleanup EXIT; Very cool. The RUN variable there is provided by Laminar CI, and SECONDS is a bash built-in that tells us the number of seconds that the current Bash process has been running for. human-duration is yet another helper script because I like nice readable durations in my status messages - not something unreadable like Build 3 failed in 345 seconds. It's also somewhat verbose - I adapted it from this StackExchange answer. With that all out of the way, the next item on the list is to work out what job name we're running under. I've chosen git-repo for the name of the master 'virtual' job - that is to say the one whose entire purpose is to queue the actual job. That's pretty easy, since Laminar gives us an environment variable: if [ "${JOB}" == "git-repo" ]; then
# ...
fi

If the job name is git-repo, then we need to queue the actual job name. Since I don't want to have to manually alter the system every time I'm setting up a new repo on my CI system, I've automated the process with symbolic links. The main git-repo job creates a symbolic link to itself in the name of the repository that it's supposed to be running against, and then queues a new job to run itself under the different job name. This segment takes place nested in the above if statement:

# If the job file doesn't exist, create it
# We create a symlink here because this is a 'smart' job - whose
# behaviour changes dynamically based on the job name.
if [ ! -e "${LAMINAR_HOME}/cfg/jobs/${repo_job_name}.run" ]; then
pushd "${LAMINAR_HOME}/cfg/jobs"; ln -s "git-repo.run" "${repo_job_name}.run";
popd
fi

Once we're sure that the symbolic link is in place, we can queue the virtual copy:

# Queue our new hologram
LAMINAR_REASON="git push by ${GIT_AUTHOR} to${GIT_REF_NAME}" laminarc queue "${repo_job_name}" GIT_REPO_NAME="${GIT_REPO_NAME}" GIT_REF_NAME="${GIT_REF_NAME}" GIT_REPO_URL="${GIT_REPO_URL}" GIT_COMMIT_REF="${GIT_COMMIT_REF}" GIT_AUTHOR="${GIT_AUTHOR}";
# If we got to here, we queued the hologram successfully
# Clear the trap, because we know that the trap for the hologram will fire
# This avoids sending a 2nd status to Gitea, linking the user to the wrong place
trap - EXIT;

exit 0;

This also ensures that if we make any changes to the main job file, all the copies will get updated automatically too. After all, they are only pointers to the actual job on disk.

Notice that we also clear the trap there before exiting - that's important, since we're queuing a copy of ourselves, we don't want to report the completed status before we've actually finished.

At this point, we can now look at what happens if the job name isn't git-repo. In this case, we need to do a few things:

1. Clone the git repository in question to the shared workspace (if it hasn't been done already)
2. Fetch new commits on the shared repository copy
3. Check out the right commit
4. Copy it to the run-specific directory
5. Execute the build script

Additionally, we need to ensure that points #1 to #4 are not done by multiple jobs that are running at the same time, since that would probably confuse things and induce weird and undesirable behaviour. This might happen if we push multiple commits at once, for example - since the git post-receive hook (which I'll be talking about in a future post) queues 1 run per commit.

We can make sure of this by using flock. It's actually a feature provided by the Linux Kernel, which allows a single process to obtain exclusive access to a resource on disk. Since each Laminar job has it's own workspace as described above, we can abuse this by doing an flock on the workspace directory. This will ensure that only 1 run per job is accessing the workspace area at once:

# Acquire a lock for this repo
exec 9<"${WORKSPACE}"; flock --exclusive 9; echo "[${SECONDS}] Lock acquired";

Nice. Next, we need to clone the repository into the shared workspace if we haven't already:

cd "${WORKSPACE}"; # If we haven't already, clone the repository git_directory="$(echo "${GIT_REPO_URL}" | grep -oP '(?<=/)(.+)(?=.git$)')";
if [ ! -d "${git_directory}" ]; then echo "[${SECONDS}] Cloning repository";
git clone "${GIT_REPO_URL}"; fi cd "${git_directory}";

Then, we need to fetch any new commits:

# Pull down any updates that are available
echo "[${SECONDS}] Downloading commits"; git fetch origin; ....and check out the one we're supposed to be building: # Checkout the commit we're interested in testing echo "[${SECONDS}] Checking out ${GIT_COMMIT_REF}"; git checkout "${GIT_COMMIT_REF}";

Then, we need to copy the repo to the run-specific directory. This is important, since the run might create new files - and we don't want multiple runs running in the same directory at the same time.


echo "[${SECONDS}] Linking source to run directory"; # Hard-link the repo content to the run directory # This is important because then we can allow multiple runs of the same repo at the same time without using extra disk space # -r Recursive mode # -a Preserve permissions # -l Hardlink instead of copy cp -ral ./ "${run_directory}";
# Don't forget the .git directory, .gitattributes, .gitmodules, .gitignore, etc.
# This is required for submodules and other functionality, but likely won't be edited - hence we can hardlink here (I think).
# NOTE: If we see weirdness with multiple runs at a time, then we'll need to do something about this.
cp -ral ./.git* "${run_directory}/.git"; I'm using hard linking here for efficiency - I'm banking on the fact that the build script I call isn't going to modify any existing files. Thinking about it, I should do a git reset --hard there just in case - though then I'd have all sorts of nasty issues with timing problems. So far, I haven't had any issues. If I do, then I'll just disable the hard linking and copy instead. This entire script assumes a trusted environment - i.e. it trusts that the code being executed is not malicious. To this end, it's only suitable for personal projects and the like. For it to be useful in untrusted environments, it would need to avoid hard linking and execute the build script inside a container - e.g. using LXD or Docker. Moving on, we next need to release that flock and return to the run-specific directory: # Go back to the job-specific run directory cd "${run_directory}";

# Release the lock
exec 9>&- # Close file descriptor 9 and release lock

echo "[${SECONDS}] Lock released"; At this point, we're all set up to run the build script. We need to find it first though. I've currently got 2 standards I'm using across my repositories: build and build.sh. This is easy to automate: build_script="./build"; if [ ! -x "${build_script}" ]; then build_script="./build.sh"; fi
# FUTURE: Add Makefile support here?
if [ ! -x "${build_script}" ]; then echo "[${SECONDS}] Error: Couldn't find the build script, or it wasn't marked as executable." >&2;
exit 1;
fi

Now that we know where it is, we can execute it. Before we do though, as a little extra I like to run shellcheck over it - since we assume that it's a shell script too (though it might call something that isn't a shell script):

echo "----------------------------------------------------------------";
echo "------------------ Shellcheck of build script ------------------";
set +e; # Allow shellcheck errors - we just warn about them
shellcheck "${build_script}"; set -e; echo "----------------------------------------------------------------"; I can highly recommend shellcheck - it finds a number of potential issues in both style and syntax that might cause your shell scripts to behave in unexpected ways. I've learnt a bunch about shell scripting and really improved my skills from using it on a regular basis. Finally, we can now actually execute the build script: echo "[${SECONDS}] Executing '${build_script} ci'"; nice -n10${build_script} ci

I pass the argument ci here, since the lantern build engine takes task names as arguments on the command line. If it's not a lantern script, then it can be interpreted as a helpful hint as to the environment that it's running in.

I also nice it to push it into the background, since I actually have my Laminar CI server running on a Raspberry Pi and it's resources are rather limited. I found oddly that I'd lose other essential services (e.g. SSH) if I didn't do this for some reason - since build tasks are usually quite computationally expensive.

That completes the build script. Of course, when the above finishes executing the trap that we set earlier will trigger and the build status reported. I'll include the full script at the bottom of this post.

This was a long post! We've taken a deep dive into the engine that powers my build system. In the next few posts, I'd like to talk about the git post-receive hook I've been mentioning that triggers this job. I'd also like to talk formally about the lantern build engine - what it is, where it came from, and how it works.

Found this interesting? Spotted a mistake? Got a suggestion? Confused about something? Comment below!

#!/usr/bin/env bash
set -e; # Don't allow errors

# Check that all the right variables are present
if [ -z "${GIT_REPO_NAME}" ]; then echo -e "Error: The environment variable GIT_REPO_NAME isn't set." >&2; exit 1; fi if [ -z "${GIT_REF_NAME}" ]; then echo -e "Error: The environment variable GIT_REF_NAME isn't set." >&2; exit 1; fi
if [ -z "${GIT_REPO_URL}" ]; then echo -e "Error: The environment variable GIT_REPO_URL isn't set." >&2; exit 1; fi if [ -z "${GIT_COMMIT_REF}" ]; then echo -e "Error: The environment variable GIT_COMMIT_REF isn't set." >&2; exit 1; fi
if [ -z "${GIT_AUTHOR}" ]; then echo -e "Error: The environment variable GIT_AUTHOR isn't set." >&2; exit 1; fi # It's checked directly anyway # shellcheck disable=SC1091 source source_regex_match.sh; GIT_REF_TYPE="$(regex_match "${GIT_REF_NAME}" 'refs/([a-z]+)')"; if [[ "${GIT_REF_TYPE}" == "tags" ]]; then
GIT_TAG_NAME="$(regex_match "${GIT_REF_NAME}" 'refs/tags/(.*)$')"; fi # NOTE: These only work with SSH urls. GIT_REPO_OWNER="$(echo "${GIT_REPO_URL}" | grep -Po '(?<=:)[^/]+(?=/)')"; GIT_REPO_NAME_SHORT="$(echo "${GIT_REPO_URL}" | grep -Po '(?<=/)[^/]+(?=\.git$)')";
GIT_SERVER_DOMAIN="$(echo "${GIT_REPO_URL}" | grep -Po '(?<=@)[^/]+(?=:)')";

export GIT_REPO_OWNER GIT_REPO_NAME_SHORT GIT_SERVER_DOMAIN GIT_REF_TYPE GIT_TAG_NAME;

###############################################################################

# Example URL: git@git.starbeamrainbowlabs.com:sbrl/rhinoreminds.git
# Environment variables:
#   GIT_REPO_NAME           git-starbeamrainbowlabs-com-sbrl-rhinoreminds
#       Determined dynamically from GIT_REF_NAME.
#   GIT_TAG_NAME            v0.1.1-build7
#       Determined dynamically from GIT_REF_NAME, only set if GIT_REF_TYPE == "tags".
#   GIT_REPO_URL            git@git.starbeamrainbowlabs.com:sbrl/rhinoreminds.git
#   GIT_COMMIT_REF          e23b2e0f3c0b9f48effebca24db48d9a3f028a61
#   GIT_AUTHOR              bob
# Generated:
#   GIT_SERVER_DOMAIN       git.starbeamrainbowlabs.com
#   GIT_REPO_OWNER          sbrl
#   GIT_REPO_NAME_SHORT     rhinoreminds
#   GIT_RUN_SOURCE          github
#       Not always set. If not set then assume git.starbeamrainbowlabs.com

send-status-gitea "${GIT_COMMIT_REF}" "pending" "Executing build...."; # Runs on exit, no matter what cleanup() { original_exit_code="$?";

status="success";
description="Build ${RUN} succeeded in$(human-duration "${SECONDS}")."; if [[ "${original_exit_code}" -ne "0" ]]; then
status="failed";
description="Build failed with exit code ${original_exit_code} after$(human-duration "${SECONDS}")"; fi send-status-gitea "${GIT_COMMIT_REF}" "${status}" "${description}";
}

trap cleanup EXIT;

###############################################################################

repo_job_name="$(echo "${GIT_REPO_NAME}" | tr '/' '--')";
if [ "${JOB}" == "git-repo" ]; then # If the job file doesn't exist, create it # We create a symlink here because this is a 'smart' job - whose # behaviour changes dynamically based on the job name. if [ ! -e "${LAMINAR_HOME}/cfg/jobs/${repo_job_name}.run" ]; then pushd "${LAMINAR_HOME}/cfg/jobs";
ln -s "git-repo.run" "${repo_job_name}.run"; popd fi # Queue our new hologram LAMINAR_REASON="git push by${GIT_AUTHOR} to ${GIT_REF_NAME}" laminarc queue "${repo_job_name}" GIT_REPO_NAME="${GIT_REPO_NAME}" GIT_REF_NAME="${GIT_REF_NAME}" GIT_REPO_URL="${GIT_REPO_URL}" GIT_COMMIT_REF="${GIT_COMMIT_REF}" GIT_AUTHOR="${GIT_AUTHOR}"; # If we got to here, we queued the hologram successfully # Clear the trap, because we know that the trap for the hologram will fire # This avoids sending a 2nd status to Gitea, linking the user to the wrong place trap - EXIT; exit 0; fi # We're running in hologram mode! # Remember the run directory - we'll need it later run_directory="$(pwd)";

# Important directories:
# $WORKSPACE Shared between all runs of a job #$run_directory    The initial directory a run lands in. Empty and run-specific.
# $ARCHIVE Also run-speicfic, but the contents is persisted after the run ends # Acquire a lock for this repo #laminarc lock "${JOB}-workspace";
exec 9<"${WORKSPACE}"; flock --exclusive 9; ############################################################################### # No need to allow errors here, because the lock will automagically be released # if the process crashes, as that'll close the file description anyway :P echo "[${SECONDS}] Lock acquired";

cd "${WORKSPACE}"; # If we haven't already, clone the repository git_directory="$(echo "${GIT_REPO_URL}" | grep -oP '(?<=/)(.+)(?=.git$)')";
if [ ! -d "${git_directory}" ]; then echo "[${SECONDS}] Cloning repository";
git clone "${GIT_REPO_URL}"; fi cd "${git_directory}";

# Pull down any updates that are available
echo "[${SECONDS}] Downloading commits"; git fetch origin; # Checkout the commit we're interested in testing echo "[${SECONDS}] Checking out ${GIT_COMMIT_REF}"; git checkout "${GIT_COMMIT_REF}";

echo "[${SECONDS}] Linking source to run directory"; # Hard-link the repo content to the run directory # This is important because then we can allow multiple runs of the same repo at the same time without using extra disk space # -r Recursive mode # -a Preserve permissions # -l Hardlink instead of copy cp -ral ./ "${run_directory}";
# Don't forget the .git directory, .gitattributes, .gitmodules, .gitignore, etc.
# This is required for submodules and other functionality, but likely won't be edited - hence we can hardlink here (I think).
# NOTE: If we see weirdness with multiple runs at a time, then we'll need to do something about this.
cp -ral ./.git* "${run_directory}/.git"; echo "[${SECONDS}] done";

# Go back to the job-specific run directory
cd "${run_directory}"; ############################################################################### # Release the lock exec 9>&- # Close file descriptor 9 and release lock #laminarc release "${JOB}-workspace";

echo "[${SECONDS}] Lock released"; echo "[${SECONDS}] Finding build script";

build_script="./build";
if [ ! -x "${build_script}" ]; then build_script="./build.sh"; fi # FUTURE: Add Makefile support here? if [ ! -x "${build_script}" ]; then
echo "[${SECONDS}] Error: Couldn't find the build script, or it wasn't marked as executable." >&2; exit 1; fi echo "[${SECONDS}] Executing '${build_script} ci'"; echo "----------------------------------------------------------------"; echo "------------------ Shellcheck of build script ------------------"; set +e; # Allow shellcheck errors - we just warn about them shellcheck "${build_script}";
set -e;
echo "----------------------------------------------------------------";

nice -n10 ${build_script} ci ## Monitoring HTTP server response time with collectd and a bit of bash In the spirit of the last few posts I've been making here (A and B), I'd like to talk a bit about collectd, which I use to monitor the status of my infrastructure. Currently this consists of the server you've connected to in order to view this webpage, and a Raspberry Pi that acts as a home file server. I realised recently that monitoring the various services that I run (such as my personal git server for instance) would be a good idea, as I'd rather like to know when they go down or act abnormally. As a first step towards this, I decided to configure my existing collectd setup to monitor the response time of the HTTP endpoints of these services. Later on, I can then configure some alerts to message me when something goes down. My first thought was to check the plugin list to see if there was one that would do the trick. As you might have guessed by the title of this post, however, such an easy solution would be too uninteresting and not worthy of writing a blog post. Since such a plugin doesn't (yet?) exist, I turned to the exec plugin instead. In short, it lets you write a program that writes to the standard output in the collectd plain text protocol, which collectd then interprets and adds to whichever data storage backend you have configured. Since shebangs are a thing on Linux, I could technically choose any language I have an interpreter installed for, but to keep things (relatively) simple, I chose Bash, the language your local terminal probably speaks (unless it speaks zsh or fish instead). My priorities were to write a script that is: 1. Easy to reconfigure 2. Ultra lightweight Bash supports associative arrays, so I can cover point #1 pretty easily like this: declare -A targets=( ["main_website"]="https://starbeamrainbowlabs.com/" ["git"]="https://git.starbeamrainbowlabs.com/" # ..... ) Excellent! Covering point #2 will be an on-going process that I'll need to keep in mind as I write this script. I found this GitHub repository a while back, which has served as a great reference point in the past. Here's hoping it'll be useful this time too! It's important to note the structure of the script that we're trying to write. Collectd exec scripts have 2 main environment variables we need to take notice of: • COLLECTD_HOSTNAME - The hostname of the local machine • COLLECTD_INTERVAL - Interval at which we should collect data. Defined in collectd.conf. The script should write to the standard output the values we've collected in the collectd plain text format every COLLECTD_INTERVAL. Collectd will automatically ensure that only 1 instance of our script is running at once, and will also automatically restart it if it crashes. To run a command regularly at a set interval, we probably want a while loop like this: while :; do # Do our stuff here sleep "${COLLECTD_INTERVAL}";
done

This is a great start, but it isn't really compliant with objective #2 we defined above. sleep is actually a separate command that spawns a new process. That's an expensive operation, since it has to allocate memory for a new stack and create a new entry in the process table.

We can avoid this by abusing the read command timeout, like this:

# Pure-bash alternative to sleep.
# Source: https://blog.dhampir.no/content/sleeping-without-a-subprocess-in-bash-and-how-to-sleep-forever
snore() {
local IFS;
[[ -n "${_snore_fd:-}" ]] || exec {_snore_fd}<> <(:); read${1:+-t "$1"} -u$_snore_fd || :;
}

Thanks to bolt for this.

Next, we need to iterate over the array of targets we defined above. We can do that with a for loop:

while :; do
for target in "${!targets[@]}"; do check_target "${target}" "${targets[${target}]}"
done

snore "${COLLECTD_INTERVAL}"; done Here we call a function check_target that will contain our main measurement logic. We've changed sleep to snore too - our new subprocess-less sleep alternative. Note that we're calling check_target for each target one at a time. This is important for 2 reasons: • We don't want to potentially skew the results by taking multiple measurements at once (e.g. if we want to measure multiple PHP applications that sit in the same process poll, or measure more applications than we have CPUs) • It actually spawns a subprocess for each function invocation if we push them into the background with the & operator. As I've explained above, we want to try and avoid this to keep it lightweight. Next, we need to figure out how to do the measuring. I'm going to do this with curl. First though, we need to setup the function and bring in the arguments: #$1 - target name
# $2 - url check_target() { local target_name="${1}"
local url="${2}"; # ...... } Excellent. Now, let's use curl to do the measurement itself: curl -sS --user-agent "${user_agent}" -o /dev/null --max-time 5 -w "%{http_code}\n%{time_total}" "${url}" This looks complicated (and it probably is to some extent), but let's break it down with the help of explainshell. Part Meaning -sS Squashes all output except for errors and the bits we want. Great for scripts like ours. --user-agent Specifies the user agent string to use when making a request. All good internet citizens should specify a descriptive one (more on this later). -o /dev/null We're not interested in the content we download, so this sends it straight to the bin. --max-time 5 This sets a timeout of 5 seconds for the whole operation - after which curl will throw an error and return with exit code 28. -w "%{http_code}\n%{time_total}" This allows us to pull out metadata about the request we're interested in. There's actually a whole range available, but for now I'm interested in how long it took and the response code returned "${url}" Specifies the URL to send the request to. curl does actually support making more than 1 request at once, but utilising this functionality is out-of-scope for now (and we'd get skewed results because it re-uses connections - which is normally really helpful & performance boosting)

To parse the output we get from curl, I found the readarray command after going a bit array mad at the beginning of this post. It pulls every line of input into a new slot in an array for us - and since we can control the delimiter between values with curl, it's perfect for parsing the output. Let's hook that up now:

readarray -t result < <(curl -sS --user-agent "${user_agent}" -o /dev/null --max-time 5 -w "%{http_code}\n%{time_total}" "${url}");

The weird command < <(another_command); syntax is process substitution. It's a bit like the another_command | command syntax, but a bit different. We need it here because readarray parses the values into a new array variable in the current context, and if we use the a | b syntax here, we instantly lose access to the variable it creates because a subprocess is spawned (and readarray is a bash builtin) - hence the weird process substitution.

Now that we've got the output from curl parsed and ready to go, we need to handle failures next. This is a little on the nasty side, as by default bash won't give us the non-zero exit code from substituted processes. Hence, we need to tweak our already long arcane incantation a bit more:

readarray -t result < <(curl -sS --user-agent "${user_agent}" -o /dev/null --max-time 5 -w "%{http_code}\n%{time_total}\n" "${url}"; echo "${PIPESTATUS[*]}"); Thanks to this answer on StackOverflow for ${PIPESTATUS}. Now, we have array called result with 3 elements in it:

Index Value
0 The HTTP response code
1 The time taken in seconds
2 The exit code of curl

With this information, we can now detect errors and abort continuing if we detect one. We know there was an error if any of the following occur:

• curl returned a non-zero exit code
• The HTTP response code isn't 2XX or 3XX

Let's implement that in bash:

if [[ "${result[2]}" -ne 0 ]] || [[ "${result[0]}" -lt "200" ]] || [[ "${result[0]}" -gt "399" ]]; then return fi Again, let's break it down: • [[ "${result[2]}" -ne 0 ]] - Detect a non-zero exit code from curl
• [[ "${result[0]}" -lt "200" ]] - Detect if the HTTP response code is less than 200 • [[ "${result[0]}" -gt "399" ]] - Detect if the HTTP response code is greater than 399

In the future, we probably want to output a notification here of some sort instead of just simply silently returning, but for now it's fine.

Finally, we can now output the result in the right format for collectd to consume. Collectd operates on identifiers, values, and intervals. A bit of head-scratching and documentation reading later, and I determined the correct identifier format for the task. I wanted to have all the readings on the same graph so I could compare the different response times (just like the ping plugin does), so we want something like this:

bobsrockets.com/http_services/response_time-TARGET_NAME

....where we replace bobsrockets.com with ${COLLECTD_HOSTNAME}, and TARGET_NAME with the name of the target we're measuring (${target_name} from above).

We can do this like so:

echo "PUTVAL \"${COLLECTD_HOSTNAME}/http_services/response_time-${target_name}\" interval=${COLLECTD_I NTERVAL} N:${result[1]}";

Here's an example of it in action:

PUTVAL "/http_services/response_time-git" interval=300.000 N:0.118283
PUTVAL "/http_services/response_time-main_website" interval=300.000 N:0.112073

It does seem to run through the items in the array in a rather strange order, but so long as it does iterate the whole lot, I don't really care.

I'll include the full script at the bottom of this post, so all that's left to do is to point collectd at our new script like this in /etc/collectd.conf:

LoadPlugin  exec

# .....

<Plugin exec>
Exec    "nobody:nogroup"        "/etc/collectd/http_response_times.sh"  "measure"
</Plugin>

I've added measure as an argument there for future-proofing, as it looks like we may have to run a separate instance of the script for sending notifications if I understand the documentation correctly (I need to do some research.....).

Very cool. It's taken a few clever tricks, but we've managed to write an efficient script for measuring http response times. We've made it more efficient by exploiting read timeouts and other such things. While we won't gain a huge amount of speed from this (bash is pretty lightweight already - this script is weighing in at just ~3.64MiB of private RAM O.o), it will all add up over time - especially considering how often this will be running.

In the future, I'll definitely want to take a look at implementing some alerts to notify me if a service is down - but that will be a separate post, as this one is getting quite long :P

Found this interesting? Got another way of doing this? Curious about something? Comment below!

### Full Script

#!/usr/bin/env bash
set -o pipefail;

# Variables:
#   COLLECTD_INTERVAL   Interval at which to collect data
#   COLLECTD_HOSTNAME   The hostname of the local machine

declare -A targets=(
["main_website"]="https://starbeamrainbowlabs.com/"
["webmail"]="https://mail.starbeamrainbowlabs.com/"
["git"]="https://git.starbeamrainbowlabs.com/"
["nextcloud"]="https://nextcloud.starbeamrainbowlabs.com/"
)
# These are only done once, so external commands are ok
version="0.1+$(date +%Y%m%d -r$(readlink -f "${0}"))"; user_agent="HttpResponseTimeMeasurer/${version} (Collectd Exec Plugin; $(uname -sm)) bash/${BASH_VERSION} curl/$(curl --version | head -n1 | cut -f2 -d' ')"; # echo "${user_agent}"

###############################################################################

# Pure-bash alternative to sleep.
# Source: https://blog.dhampir.no/content/sleeping-without-a-subprocess-in-bash-and-how-to-sleep-forever
snore() {
local IFS;
[[ -n "${_snore_fd:-}" ]] || exec {_snore_fd}<> <(:); read${1:+-t "$1"} -u$_snore_fd || :;
}

# Source: https://github.com/dylanaraps/pure-bash-bible#split-a-string-on-a-delimiter
split() {
# Usage: split "string" "delimiter"
IFS=$'\n' read -d "" -ra arr <<< "${1//$2/$'\n'}"
printf '%s\n' "${arr[@]}" } # Source: https://github.com/dylanaraps/pure-bash-bible#get-the-number-of-lines-in-a-file # Altered to operate on the standard input. count_lines() { # Usage: lines <"file" mapfile -tn 0 lines printf '%s\n' "${#lines[@]}"
}

###############################################################################

# $1 - target name #$2 - url
check_target() {
local target_name="${1}" local url="${2}";

readarray -t result < <(curl -sS --user-agent "${user_agent}" -o /dev/null --max-time 5 -w "%{http_code}\n%{time_total}\n" "${url}"; echo "${PIPESTATUS[*]}"); # 0 - http response code # 1 - time taken # 2 - curl exit code # Make sure the exit code is non-zero - this includes if curl hits a timeout error # Also ensure that the HTTP response code is valid - any 2xx or 3xx response code is ok if [[ "${result[2]}" -ne 0 ]] || [[ "${result[0]}" -lt "200" ]] || [[ "${result[0]}" -gt "399" ]]; then
return
fi

echo "PUTVAL \"${COLLECTD_HOSTNAME}/http_services/response_time-${target_name}\" interval=${COLLECTD_INTERVAL} N:${result[1]}";
}

while :; do
for target in "${!targets[@]}"; do # NOTE: We don't use concurrency here because that spawns additional subprocesses, which we want to try & avoid. Even though it looks slower, it's actually more efficient (and we don't potentially skew the results by measuring multiple things at once) check_target "${target}" "${targets[${target}]}"
done

snore "${COLLECTD_INTERVAL}"; done ## TCP (Client) Networking in Pure Bash Recently I re-remembered about /dev/tcp - a virtual bash file system that allows you to directly connect to a remote TCP endpoint - without the use of nc / netcat / ncat. While it only allows you to connect (no listening, sadly), it's still a great bash built-in that helps avoid awkward platform-specific issues. Here's how you'd listen for a connection with netcat, sending as many random numbers as possible to the poor unsuspecting client: netcat -l 0.0.0.0 6666 </dev/urandom Here's how you'd traditionally connect to that via netcat: netcat X.Y.Z.W 6666 | pv >/dev/null The pv command there is not installed by default, but is a great tool that shows the amount of data flowing through a pipe. It's available in the repositories for most Linux distributions without any special configuration required - so sudo apt install pv should be enough on Debian-based distributions. Now, let's look at how we'd do this with pure bash: pv >/dev/null </dev/tcp/X.Y.Z.W/6666 Very neat! We've been able to eliminate an extra child process. The question is though: how do they compare performance-wise? Well, that depends on how we measure it. In my case, I measured a single connection, downloading data as fast as it can for 60 seconds. Another test would be to open many connections and download lots of small files. While I haven't done that here, I theorise that the pure-bash method would win out, as it doesn't have to spawn lots of subprocesses. In my test, I did this: # Traditional method timeout 60 nc X.Y.Z.W 6666 | pv >/dev/null # Pure-Bash method timeout 60 pv >/dev/null </dev/tcp/X.Y.Z.W/6666 The timeout command kills the process after a given number of seconds. The server I connected to was just this: while true; do nc -l 0.0.0.0 6666 </dev/urandom; done Running the above test, I got the following output: $ timeout 60 pv >/dev/null </dev/tcp/172.16.230.58/6666
652MiB 0:00:59 [11.2MiB/s] [                                      <=>         ]
$timeout 60 nc 172.16.230.58 6666 | pv >/dev/null 599MiB 0:01:00 [11.1MiB/s] [ <=> ] Method Total Data Transferred Traditional 599MiB Pure Bash 652MiB As it turns out, the pure bash method is apparently faster - by ~8.8%. I think this might have something to do with the lack of the additional sub-process, or some other optimisation that bash can apply when doing the TCP networking itself. Found this interesting? Got a cool use for it? Discovered another awesome bash built-in? Comment below! ## Animated PNG for all! I recently discovered that Animated PNGs are now supported by most major browsers: I stumbled across the concept of an animated PNG a number of years ago (on caniuse.com actually if I remember right!), but at the time browser support was very bad (read: non-existent :P) - so I moved on to other things. I ended up re-discovering it a few weeks ago (also through caniuse.com!), and since browser support is so much better now I decided that I just had to play around with it. It hasn't disappointed. Traditional animated GIFs (Graphics Interchange Format for the curious) are limited to 256 colours, have limited transparency support (a pixel is either transparent, or it isn't - there's no translucency support), and don't compress terribly well. Animated PNGs (Portable Network Graphics), on the other hand, let you enjoy all the features of a regular PNG (as many colours as you want, translucent pixels, etc.) while also supporting animation, and compressing better as well! Best of all, if a browser doesn't support the animated PNG standard, they will just see and render the first frame as a regular PNG and silently ignore the rest. Let's take it out for a spin. For my test, I took an image and created a 'panning' animation from one side of it to the other. Here's the image I've used: (Credit: The background on the download page for Mozilla's Firefox Nightly builds. It isn't available on the original source website anymore (and I've lost the link), but can still be found on various wallpaper websites.) The first task is to generate the frames from the original image. I wrote a quick shell script for this purpose. Firstly, I defined a bunch of variables: #!/usr/bin/env bash set -e; # Crash if we hit an error input_file="Input.png"; output_file="Output.apng" # The maximum number of frames max_frame=32; end_x=960; end_y=450; start_x=0; start_y=450; crop_size_x=960; crop_size_y=540; mkdir -p ./frames; Variable Meaning input_file The input file to generate frames from output_file The file to write the animated png to max_frame The number of frames (plus 1) to generate start_x The starting position to pan from on the x axis start_y The starting position to pan from on the y axis end_x The ending position to pan to on the x axis end_y The ending position to pan to on the y axis crop_size_x The width of the cropped frames crop_size_y The height of the cropped frames It's worth noting here that it's probably a good idea to implement a proper CLI to this script at this point, but since it's currently only a one-off script I didn't bother. Perhaps in the future I'll come back and improve it if I find myself needing it again for some other purpose. With the parameters set up (and a temporary directory created - note that you should use mktemp -d instead of the approach I've taken here!), we can then use a for loop to repeatedly call ImageMagick to crop the input image to generate our frames. This won't run in parallel unfortunately, but since it's only a few frames it shouldn't take too long to render. This is only a quick shell script after all :P  for ((frame=0; frame <= "${max_frame}"; frame++)); do
this_x="$(calc -p "${start_x}+((${end_x}-${start_x})*(${frame}/${max_frame}))")";
this_y="$(calc -p "${start_y}+((${end_y}-${start_y})*(${frame}/${max_frame}))")";
percent="$(calc -p "round((${frame}/${max_frame})*100, 2)")"; convert "${input_file}" -crop "${crop_size_x}x${crop_size_y}+${this_x}+${this_y}" "frames/Frame-$(printf "%02d" "${frame}").jpeg";

echo -ne "${frame} /${max_frame} (${percent}%) \r"; done echo ""; # Move to the line after the progress indicator This looks complicated, but it really isn't. The for loop iterates over each of the frame numbers. We do some maths to calculate the (x, y) co-ordinates of the frame we want to extract, and then get ImageMagick's convert command to do the cropping for us. Finally we write a quick progress indicator out to stdout (\r resets the cursor to the beginning of the line, but doesn't go down to the next one). The maths there is probably better represented like this:$this_x=start_x+((end_x-start_x)*\frac{currentframe}{max_frame})this_y=start_y+((end_y-start_y)*\frac{currentframe}{max_frame})$Much better :-) In short, we convert the current frame number into a percentage (between 0 and 1) of how far through the animation we are and use that as a multiplier to find the distance between the starting and ending points. I use the calc command-line tool (in the apcalc package on Ubuntu) here to do the calculations, because the bash built-in result=$(((4 + 5))) only supports integer-based maths, and I wanted floating-point support here.

With the frames generated, we only need to stitch them together to make the final animated png. Unfortunately, an easy-to-use tool does not yet exist (like gifsicle for GIFs) - so we'll have to make-do with ffpmeg (which, while very powerful, has a rather confusing CLI!). Here's how I stitched the frames together:

ffmpeg -r 10 -i frames/Frame-%02d.jpeg -plays 0 "${output_file}"; • -r - The frame rate • -i - The input filename • -plays - The number of times to loop (0 = infinite; defaults to no looping if omitted) • "${output_file}" - The output file

Here's the final result:

(Filesize: ~2.98MiB)

Of course, it'd be cool to compare it to a traditional animated gif. Let's do that too! First, we'll need to convert the frames to gif (gifsicle, our tool of choice, doesn't support anything other than GIFs as far as I'm aware):

mogrify -format gif frames/*.jpeg

Easy-peasy! mogrify is another tool from ImageMagick that makes such in-place conversions trivial. Note that the frames themselves are stored as JPEGs because I was experiencing an issue whereby each of the frames apparently had a slightly different colour palette, and ffmpeg wasn't smart enough to correct for this - choosing instead to crash.

With the frames converted, we can make our animated GIF like so:

gifsicle --optimize --colors=256 --loopcount=10 --delay=10 frames/*.gif >Output.gif;

Lastly, we probably want to delete the intermediate frames:

rm -r ./frames

Here's the animated GIF version:

(Filesize: ~3.28MiB)

Woah! That's much bigger.

(Generated (and then extracted & edited with the Firefox developer tools) from Meta-Chart)

It's ~9.7% bigger in fact! Though not a crazy amount, smaller resulting files are always good. I suspect that this effect will stack the more frames you have. Others have tested this too, finding pretty similar results to those that I've found here - though it does of course depend on your specific scenario.

With that observation, I'll end this blog post. The next time you think about inserting an animation into a web page or chat window, consider making it an Animated PNG.

Found this interesting? Found a cool CLI tool for manipulating APNGs? Having trouble? Comment below!