Archive

## Tag Cloud

3d 3d printing account algorithms android announcement architecture archives arduino artificial intelligence artix assembly async audio automation backups bash batch blog bookmarklet booting bug hunting c sharp c++ challenge chrome os code codepen coding conundrums coding conundrums evolved command line compilers compiling compression css dailyprogrammer data analysis debugging demystification distributed computing documentation downtime electronics email embedded systems encryption es6 features ethics event experiment external first impressions future game github github gist gitlab graphics hardware hardware meetup holiday holidays html html5 html5 canvas infrastructure interfaces internet interoperability io.js jabber jam javascript js bin labs learning library linux lora low level lua maintenance manjaro network networking nibriboard node.js operating systems own your code pepperminty wiki performance phd photos php pixelbot portable privacy problem solving programming problems project projects prolog protocol protocols pseudo 3d python reddit redis reference releases resource review rust searching secrets security series list server software sorting source code control statistics storage svg talks technical terminal textures thoughts three thing game three.js tool tutorial twitter ubuntu university update updates upgrade version control virtual reality virtualisation visual web website windows windows 10 xmpp xslt

## Starting my PhD on the mapping of flooding

Specifically, using new technologies such as AI and the Internet of Things to map and predict where it's going to flood in real-time.

This year, I'm starting a 3 year funded PhD on dynamic flood risk mapping, as part of a cluster of water-related PhDs that are all being funded at the same time. I've got some initial ideas as to the direction I'm going to take it too, which I'd like to talk a little bit about in this post.

I've got no shortage of leads as to potential data sources I can explore to achieve this. Just some of the avenues I'm exploring include:

• Monitoring street drains
• Fitting local council vehicles with sensors
• Analysing geotagged tweets with natural language processing techniques
• Predicting rainfall with aggregate mobile phone signal strength information
• Vehicle windscreen wipers
• Setting up static sensors? Not sure on this one.

Also, I've talked to a few people and thought of some pre-existing data sources that might come in useful:

• Elevation maps
• Vegetation maps
• Normalised difference vegetation index - Map of vegetation density that's already available
• Normalised difference water index - detects water on the surface of the earth - including that contained within leaves
• River levels

Finally, I've been thinking about what I'm going to be doing with all this data I'd potentially be collecting. First and foremost, I'm going to experiment with InfluxDB as a storage mechanism. Apparently it's supposed to be able to handle high read and write loads, so it sounds suitable a first glance.

Next, I'm probably going to wind up training an AI - possibly incrementally - to predict flooding. Unlike my summer project, I'm probably going to be using a more cutting-edge and exotic AI architecture.

I suspect I might end up with a multi-level system too - whereby I pre-analyse the incoming data and process it into a format that the AI will take. For example, if I end up using geotagged social media posts, those will very likely filter through an AI of some description that does the natural language processing first - the output of which will be (part of) the input (or training output?) for the next AI in the chain.

I've given some thought to training data too. My current thinking is that while some data sources might be used as inputs to a network of interconnected AIs, others are more likely to take on a corrective role - i.e. improving the accuracy of the AI and correcting the model to fit a situation as it unfolds.

All this will not only require huge amounts of data, but will also require a sophisticated system that's capable of training an AI on past datasets as if they were happening in real-time. I suppose in a way the training process is a bit like chronological history lessons at speed - catching the AI up to the present day so that it can predict flood risk in real-time.

Before all this though, my first and foremost task will be to analyse what people have done already to map and predict flood risk. Only by understanding the current state of things will I be able to improve on what's already out there.

Found this interesting? Got any tips? Comment below!

## Summer Project Series List

At this point, it is basically the end of my summer project series - at least for a while while I start my PhD (more on that in a future post). To this end, I'm releasing the series list for it.

## Summer Project Part 6: A matching bookend

Since the last post, I've completed the project - at least in it's initial form. The IoT device I've been building is finished - along with the backend that receives the data and trains an AI on it. I've learnt a great deal whilst working on it. Unfortunately, I've been very short on time, so I haven't been able to blog about it as frequently as I'd have liked to.

Before we continue, it's perhaps best to show a diagram here of how the completed system works:

(Above: A diagram showing the typical workflow when using the completed system. Taken from my dissertation.).

Operation is divided up into 3 sections:

1. Data collection: Using the Internet of Things (IoT) device to collect data.
2. Processing the collected data: The data file on the microSD card of the device is folded into the main database (more on this later), and AIs are trained
3. Viewing the final map in your browser.

Because the application is somewhat distributed in nature, it's not exactly obvious how the data flows across the application. Let's use another diagram to show how this works:

(Above: A diagram showing the data flow throughout the application. Taken from my dissertation.)

This diagram looks complicated, but it really isn't as complex as it looks.

• First, the IoT device takes a reading. This includes the GPS location (latitude and longitude) and a random id (unsigned 32-bit integer)
• This reading is both stored on a local SD card, and transmitted via LoRa
• The copy transmitted via LoRa travels through The Things Network and is picked up by the TTN listener, which stores the reading in an SQLite database
• Once data collection has been completed, the data on the microSD card is folded into the SQLite database generated by the TTN listener by the data processor. This ensures we have data points for places that the signal isn't, as well as where it is.
• 1 AI is then trained per gateway. These AIs are trains on the signal strength data for their gateway, but also the data for where the signal is not.
• The trained AIs are serialised to disk with an index file, where the web interface picks them up.

Training the AIs was a bit of a pain. Getting the inputs and outputs right along with the AIs topology to a point where the output is meaningful was quite challenging - especially considering the limited amount of data I managed to collect. I'd like to tweak and improve on it further, if I have time.

I ended up getting quite a bit of noise in the final dataset that I trained the AIs with - simply because of the way I designed the IoT device - and because of the time constraints, I didn't have the time to go back and fix it. The device currently logs the data point to disk and transmits it via LoRa afterwards. When the battery was low, I noticed that it had a tendency to reset when transmitting via LoRa, leading to a reset-loop and lots of bogus readings on the microSD card. And because I don't store the timestamp, I can't even tell them apart!

I also had a bit of a problem with configuration files - specifically that they kept multiplying. At the end of the project, I now have no fewer than 3-4 different configuration files - all in different formats! I'd love to consolidate them into 1 single configuration file, to make configuration much easier for the end user.

• 1 for the pin definitions and features to enable / disable in C++ #define statements for the IoT device
• Another 1 in C++ for the private TTN LMiC settings
• 1 in TOML for the server
• 1 in Javascript for the web interface

Ideally, I want everything to be defined in the TOML configuration file, since I developed a system by which I have a default configuration file, and a custom one in which the user can override any of the default settings.

With all this said, the output isn't bad:

(Above: A screenshot of the web interface, with the background map removed for privacy)

The AIs do appear to have learnt a round circle of coverage around the gateways, which is somewhat frustrating - since the aim was to take obstructions into account too. I suspect that I need more data - and to figure out a way of training the AI of places that other gateways pick up a signal, but the current gateway does not (currently the gateway is only trained on positive readings and negative ones where no gateway picks up the signal).

If anyone is interested in the code behind the project, I have now open-sourced it, and it is available here:

LoRaWAN Signal Mapping

The repository includes source code, user and hardware manuals, and a circuit diagram.

I may blog about this in the future, but it's likely to be a while. I'm starting a PhD this academic year, which is sure to be fun! Expect some posts about that in the near future.

Found this interesting? Comment below!

## Next Gen Search, Part 2: Pushing the limits

In the last part, we looked at how I built a new backend for storing inverted indexes for Pepperminty Wiki, which allows for partial index deserialisation and other nice features that boost performance considerably.

Since the last post, I've completed work on the new search system - though there are a few bits around the edges that I still want to touch up and do some more work on.

In this post though, I want to talk about how I generated test data to give my full-text search engine something to chew on. I've done this before for my markov chain program I wrote a while back, and that was so much fun I did it again for my search engine here.

After scratching my head for a bit to think of a data source I could use, I came up with the perfect plan. Ages ago I downloaded a Wikipedia dump - just the content pages in Wikitext markup. Why not use that?

As it turns out, it was a rather good idea. Some processing of said dump was required though to transform it into a format that Pepperminty Wiki can understand, though. Pepperminty Wiki stores pages on disk as flat text files in markdown, and indexes them in pageindex.json. If pageindex.json doesn't exist, then Pepperminty Wiki rebuilds it automagically by looking for content pages on disk.

This makes it easy to import batches of new pages into Pepperminty Wiki, so all we need to do is extract the wiki text, convert to markdown, and import! This ended up requiring a number of different separate steps though, so let's take it 1 at a time

First, we need a Wikipedia database dump in the XML format. These are available from dumps.wikimedia.org. There are many different ones available, but I suggest grabbing one that has a filename similar to enwiki-20180201-pages-articles.xml - i.e. just content pages - no revision history, user pages, or additional extras. I think the most recent one as of the time of posting is downloadable here - though I'd warn you that it's 15.3GiB in size! You can see a list of available dump dates for the English Wikipedia here.

Now that we've got our dump, let's extract the pages from it. This is nice and easy to do with wikiextractor on GitHub:

nice -n20 wikiextractor enwiki-20180201-pages-articles.xml --no_templates --html --keep_tables --lists --links --sections --json --output wikipages --compress --bytes 25M >progress.log 2>&1 

This will parse the dump and output a number of compressed files to the wikipages directory. These will have 1 JSON object per line, each containing information about a single page on Wikipedia - with page content pre-converted to HTML for us. The next step is to extract the page content and save it to a file with the correct name. This ended up being somewhat complicated, so I wrote a quick Node.js script to do the job:

#!/usr/bin/env node

const fs = require("fs");

if(!fs.existsSync("pages"))
fs.mkdirSync("pages", { mode: 0o755 });

// From https://stackoverflow.com/a/44195856/1460422
function html_unentities(encodedString) {
var translate_re = /&(nbsp|amp|quot|lt|gt);/g;
var translate = {
"nbsp":" ",
"amp" : "&",
"quot": "\"",
"lt"  : "<",
"gt"  : ">"
};
return encodedString.replace(translate_re, function(match, entity) {
return translate[entity];
}).replace(/&#(\d+);/gi, function(match, numStr) {
var num = parseInt(numStr, 10);
return String.fromCharCode(num);
});
}

input: process.stdin,
//output: process.stdout
});

interface.on("line", (text) => {
const obj = JSON.parse(text);

fs.writeFileSync(pages/${obj.title.replace(/\//, "-")}.html, html_unentities(obj.text)); console.log(${obj.id}\t${obj.title}); }); This basically takes the stream of JSON object on the standard input, parses them, and saves the relevant content to disk. We can invoke it like so: bzcat path/to/*.bz2 | ./parse.js Don't forget to chmod +x parse.js if you get an error here. The other important thing about the above script ist hat we have to unescape the HTML entities (e.g. &gt;), because otherwise we'll have issues later with HTML conversation and page names will look odd. This is done by the html_unentities() function in the above script. This should result in a directory containing a large number of files - 1 file per content page. This is much better, but we're still not quite there yet. Wikipedia uses wiki markup (which we converted to HTML with wikiextractor) and Pepperminty Wiki uses Markdown - the 2 of which are, despite all their similarities, inherently incompatible. Thankfully, pandoc is capable of converting from HTML to markdown. Pandoc is great at this kind of thing - it uses an intermediate representation and allows you to convert almost any type of textual document format to any other format. Markdown to PDF, EPUB to plain text, ..... and HTML to markdown (just to name a few). It actually looks like it shares a number of features with traditional compilers like GCC. Anyway, let's use it to convert our folder full of wikitext files to a folder full of markdown: mkdir -p pages_md; find pages/ -type f -name "*.html" -print0 | nice -n20 xargs -P4 -0 -n1 -I{} sh -c 'filename="{}"; title="${filename##*/}"; title="${title%.*}"; pandoc --from "html" --to "markdown+backtick_code_blocks+pipe_tables+strikeout" "${filename}" -o "pages_md/${title}.md"; echo "${title}";';

_(See this on explainshell.com - doesn't include the nice -n20 due to a bug on their end)_

This looks complicated, but it really isn't. Let's break it down a bit:

find pages/ -type f -name "*.html" -print0

This finds all the HTML files that we want to convert to Markdown, and delimits the output with a NUL byte - i.e. 0x0. This makes the next step easier:

... | nice -n20 xargs -P4 -0 -n1 -I{} sh -c '....'

This pushes a new xargs instance into the background, which will execute 4 commands at a time. xargs executes a command for each line of input it receives. In our case, we're delimiting with NUL 0x0 instead though. We explicitly specify that we can 1 command per line of input though, as xargs tries to optimise and do command file1 file2 file3 instead.

The sh -c bit is starting a subshell, in which we execute a small wrapper script that then calls pandoc. This is of course inefficient, but I couldn't find any way around spawning a subshell in this instance.

filename="{}";
title="${filename##*/}"; title="${title%.*}";
pandoc --from "html" --to "markdown+backtick_code_blocks+pipe_tables+strikeout" "${filename}" -o "pages_md/${title}.md";
echo "${title}"; I've broken the sh -c subshell script down into multiple liens for readability. Simply put, it extracts the page title from the filename, converts the HTML to Markdown, and saves it to a new file in a different directory with the .md replacing the original .html extension. When you put all these components together, you get a script that converts a folder full of HTML files to Markdown. Just like with the markov chains extraction I mentioned at the beginning of this post, Bash and shell scripting really is all about lego bricks. This is due in part to the Unix philosophy: Make each program do one thing well. There is more to it, but this is the most important point to remember. Many of the core utilities you'll find on the terminal follow this way of thinking. There's 1 last thing we need to take care of before we have them in the right format though - we need to convert the [display text](page name) markdown-format links back into the Wikipedia [[internal link]] format that Pepperminty Wiki also uses. Thankfully, another command-line tool I know of called repren is well-suited to this: repren --from '$([^$]+)\]$([^):]+)$' --to '[[\1]]' pages_md/*.md It took some fiddling, but I got all the escaping figured out and the above converts back into the [[internal link]] format well enough. Now that we've got our folder full of markdown files, we need to extract a random portion of them to act as a test for Pepperminty Wiki - as the whole lot might be a bit much for it to handle (though if Pepperminty Wiki was capable of handling it all eventually that'd be awesome :D). Let's try 500 pages to start: find path/to/wikipages/ -type f -name "*.md" -print0 | shuf --zero-terminated | head -n500 --zero-terminated | xargs -0 -n1 -I{} cp "{}" . (See this on explainshell.com) This is another lego-brick style command. Let's break it down too: find path/to/wikipages/ -type f -name "*.md" -print0 This lists all the .md files in a directory, delimiting them with a NUL character, as before. It's better to do this than use ls, as find is explicitly designed to be machine-readable. .... | shuf --zero-terminated The shuf command randomly shuffles the input lines. In this case, we're telling it that the input is delimited by the NUL byte. .... | head -n500 --zero-terminated Similar deal here. head takes the top N lines of input, and discards the rest. .... | xargs -0 -n1 -I{} cp "{}" . Finally, xargs calls cp to copy the selected files to the current directory - which is, in this case, the root directory of my test Pepperminty Wiki instance. Since I'm curious, let's now find out roughly how many words we're dealing with here: cat data_test/*.md | wc --words 1593190  1.5 million words! That's a lot. I wonder how quickly we can search that? 24.8ms? Awesome! That's so much better than before. If you're wondering about new coat of paint in the screenshot - Pepperminty Wiki is getting dark theme, thanks to prefers-color-scheme :D I wonder what happens if we push it to 2K pages? This time we get ~120ms for 5.9M total words - wow! I wasn't expecting it to perform so well. At this scale, rebuilding the entire index is particularly costly - so if I was to push it even further it would make sense to implement an incremental approach that spreads the work over multiple requests, assuming I can't squeeze any more performance out the system as-is. The last thing I want to do here is make a rough estimate about time time complexity the search system as-is, given the data we have so far. This isn't particularly difficult to do. Given the results above, we can calculate that at 1.5M total words, an increase of ~60K total words results in an increase of 1ms of execution time. At 5.9M words, it's only ~49K words / ms of execution time - a drop of ~11K words / ms of execution time. From this, we can speculate that for every million words total added to a wiki, we can expect a ~2.5K words / ms of execution time drop - not bad! We'd need more data points to make any reasonable guess as to the Big-O complexity function that it conforms to. My guess would be something like$O(xN^2)$, where x is a constant between ~0.2 and 2. Maybe at some point I'll go to the trouble of running enough tests to calculate it, but with all the variables that affect the execution time (number of pages, distribution of words across pages, etc.), I'm not in any hurry to calculate it. If you'd like to do so, go ahead and comment below! Next time, I'll unveil the inner working of the STAS: my new search-term analysis system. Found this interesting? Got your own story about some cool code you've written to tell? Comment below! ## Own your code, part 4: Laminar CI In the last post, I talked at a high level about the infrastructure behind my continuous integration and deployment system. In this post, I'm going to dive into the details of the Laminar CI job is the engine that drives the whole system. Laminar CI is based on a concept of jobs. The docs explain it quite well, but in short each job is a file in the jobs folder with the file extension run and a shebang. In my case, I'm using Bash - and I'll continue to do so at regular intervals throughout this series. Unlike most other setups, the Laminar CI job that we'll be writing here won't actually do any of the actual CI tasks itself - it will simply act as a proxy script to setup & manage the execution of the actual build system - which, in this case, will be the lantern build engine, an engine I wrote to aid me with automating repetitive tasks when working on my University ACWs (Assessed CourseWork). Every job has it's own workspace, which acts as a common area to store and cache various files across all the runs of that job. Each run of a job also has it's very own private area too - which will be useful later on. The first step in this proxy script is to extract the parameters of the run that we're supposed to be doing. For me, I store this in a number of environment variables, which are set when queuing the job run from the git post-receive (or web) hook: Variable Example Description GIT_REPO_NAME git-starbeamrainbowlabs-com-sbrl-rhinoreminds The safe name of the repository that we're running against, with potentially troublesome characters removed. GIT_REF_NAME refs/heads/master Basically the branch that we're working on. Useful for logging purposes. GIT_REPO_URL git@git.starbeamrainbowlabs.com:sbrl/rhinoreminds.git The URL of the repository that we're running against. GIT_COMMIT_REF e23b2e0.... The exact commit to check out and build. GIT_AUTHOR The friendly name of the author that pushed the commit. Useful for logging purposes. Before we do anything else, we need to make sure that these variables are defined: set -e; # Don't allow errors # Check that all the right variables are present if [ -z "${GIT_REPO_NAME}" ]; then echo -e "Error: The environment variable GIT_REPO_NAME isn't set." >&2; exit 1; fi
if [ -z "${GIT_REF_NAME}" ]; then echo -e "Error: The environment variable GIT_REF_NAME isn't set." >&2; exit 1; fi if [ -z "${GIT_REPO_URL}" ]; then echo -e "Error: The environment variable GIT_REPO_URL isn't set." >&2; exit 1; fi
if [ -z "${GIT_COMMIT_REF}" ]; then echo -e "Error: The environment variable GIT_COMMIT_REF isn't set." >&2; exit 1; fi if [ -z "${GIT_AUTHOR}" ]; then echo -e "Error: The environment variable GIT_AUTHOR isn't set." >&2; exit 1; fi

There are a bunch of other variables that I'm omitting here, since they are dynamically determined by from the build variables. I extract many of these additional variables using regular expressions. For example:

GIT_REF_TYPE="$(regex_match "${GIT_REF_NAME}" 'refs/([a-z]+)')";

GIT_REF_TYPE is the bit after the refs/ and before the actual branch or tag name. It basically tells us whether we're building against a branch or a tag. That regex_match function is a utility function that I found in the pure bash bible - which is an excellent resource on various tips and tricks to do common tasks without spawning subprocesses - and therefore obtaining superior performance and lower resource usage. Here it is:

# @source https://github.com/dylanaraps/pure-bash-bible#use-regex-on-a-string
# Usage: regex "string" "regex"
regex_match() {
[[ $1 =~$2 ]] && printf '%s\n' "${BASH_REMATCH[1]}" } Very cool. For completeness, here are the remainder of the secondary environment variables. Many of them aren't actually used directly - instead they are used indirectly by other scripts and lantern build engine tasks that we call from the main Laminar CI job. if [[ "${GIT_REF_TYPE}" == "tags" ]]; then
GIT_TAG_NAME="$(regex_match "${GIT_REF_NAME}" 'refs/tags/(.*)$')"; fi # NOTE: These only work with SSH urls. GIT_REPO_OWNER="$(echo "${GIT_REPO_URL}" | grep -Po '(?<=:)[^/]+(?=/)')"; GIT_REPO_NAME_SHORT="$(echo "${GIT_REPO_URL}" | grep -Po '(?<=/)[^/]+(?=\.git$)')";
GIT_SERVER_DOMAIN="$(echo "${GIT_REPO_URL}" | grep -Po '(?<=@)[^/]+(?=:)')";


GIT_TAG_NAME is the name of the tag that we're building against - but only if we've been passed a tag as the GIT_REF_TYPE.

The GIT_SERVER_DOMAIN is important for sending the status reports to the right place. Gitea supports a status API that we can hook into to report on how we're doing. You can see it in action here on my RhinoReminds repository. Those green ticks are the build status that was reported by the Laminar CI job that we're writing in this post. Unfortunately you won't be able to click on it to see the actual build output, as that is currently protected behind a username and password, since the Laminar CI web interface exposes all the git project I've currently got setup on it - including a number of private ones that I can't share.

Anyway, with all our environment variables in order, it's time to do something with them. Before we do though, we should tell Gitea that we're starting the build process:

send-status-gitea "${GIT_COMMIT_REF}" "pending" "Executing build...."; I haven't yet implemented support for sending notifications to GitHub, but it's on my todo list. In theory it's pretty easy to do - this is why I've got that GIT_SERVER_DOMAIN variable above in anticipation of this. That send-status-gitea function there is another helper script I've written that does what you'd expect - it sends a status message to Gitea. It does this by using the environment variables we deduced earlier (that are also exported - though I didn't include that in the abovecode snippet) and curl. There's still a bunch of stuff to get through in this post, so I'm going to omit the source of that script from this post for brevity. I've got no particular issue with releasing it though - if you're interested, contact me using the details on my homepage. Next, we need to set an exit trap. This is a function that will run when the Bash process exits - regardless of whether this was because we finished our work successfully, or otherwise. This can be very useful to make absolutely sure that your script cleans up after itself. In our case, we're only going to be using it to report the build status back to Gitea: # Runs on exit, no matter what cleanup() { original_exit_code="$?";

status="success";
description="Build ${RUN} succeeded in$(human-duration "${SECONDS}")."; if [[ "${original_exit_code}" -ne "0" ]]; then
status="failed";
description="Build failed with exit code ${original_exit_code} after$(human-duration "${SECONDS}")"; fi send-status-gitea "${GIT_COMMIT_REF}" "${status}" "${description}";
}

trap cleanup EXIT;

Very cool. The RUN variable there is provided by Laminar CI, and SECONDS is a bash built-in that tells us the number of seconds that the current Bash process has been running for. human-duration is yet another helper script because I like nice readable durations in my status messages - not something unreadable like Build 3 failed in 345 seconds. It's also somewhat verbose - I adapted it from this StackExchange answer.

With that all out of the way, the next item on the list is to work out what job name we're running under. I've chosen git-repo for the name of the master 'virtual' job - that is to say the one whose entire purpose is to queue the actual job. That's pretty easy, since Laminar gives us an environment variable:

if [ "${JOB}" == "git-repo" ]; then # ... fi If the job name is git-repo, then we need to queue the actual job name. Since I don't want to have to manually alter the system every time I'm setting up a new repo on my CI system, I've automated the process with symbolic links. The main git-repo job creates a symbolic link to itself in the name of the repository that it's supposed to be running against, and then queues a new job to run itself under the different job name. This segment takes place nested in the above if statement: # If the job file doesn't exist, create it # We create a symlink here because this is a 'smart' job - whose # behaviour changes dynamically based on the job name. if [ ! -e "${LAMINAR_HOME}/cfg/jobs/${repo_job_name}.run" ]; then pushd "${LAMINAR_HOME}/cfg/jobs";
ln -s "git-repo.run" "${repo_job_name}.run"; popd fi Once we're sure that the symbolic link is in place, we can queue the virtual copy: # Queue our new hologram LAMINAR_REASON="git push by${GIT_AUTHOR} to ${GIT_REF_NAME}" laminarc queue "${repo_job_name}" GIT_REPO_NAME="${GIT_REPO_NAME}" GIT_REF_NAME="${GIT_REF_NAME}" GIT_REPO_URL="${GIT_REPO_URL}" GIT_COMMIT_REF="${GIT_COMMIT_REF}" GIT_AUTHOR="${GIT_AUTHOR}"; # If we got to here, we queued the hologram successfully # Clear the trap, because we know that the trap for the hologram will fire # This avoids sending a 2nd status to Gitea, linking the user to the wrong place trap - EXIT; exit 0; This also ensures that if we make any changes to the main job file, all the copies will get updated automatically too. After all, they are only pointers to the actual job on disk. Notice that we also clear the trap there before exiting - that's important, since we're queuing a copy of ourselves, we don't want to report the completed status before we've actually finished. At this point, we can now look at what happens if the job name isn't git-repo. In this case, we need to do a few things: 1. Clone the git repository in question to the shared workspace (if it hasn't been done already) 2. Fetch new commits on the shared repository copy 3. Check out the right commit 4. Copy it to the run-specific directory 5. Execute the build script Additionally, we need to ensure that points #1 to #4 are not done by multiple jobs that are running at the same time, since that would probably confuse things and induce weird and undesirable behaviour. This might happen if we push multiple commits at once, for example - since the git post-receive hook (which I'll be talking about in a future post) queues 1 run per commit. We can make sure of this by using flock. It's actually a feature provided by the Linux Kernel, which allows a single process to obtain exclusive access to a resource on disk. Since each Laminar job has it's own workspace as described above, we can abuse this by doing an flock on the workspace directory. This will ensure that only 1 run per job is accessing the workspace area at once: # Acquire a lock for this repo exec 9<"${WORKSPACE}";
flock --exclusive 9;

echo "[${SECONDS}] Lock acquired"; Nice. Next, we need to clone the repository into the shared workspace if we haven't already: cd "${WORKSPACE}";

# If we haven't already, clone the repository
git_directory="$(echo "${GIT_REPO_URL}" | grep -oP '(?<=/)(.+)(?=.git$)')"; if [ ! -d "${git_directory}" ]; then
echo "[${SECONDS}] Cloning repository"; git clone "${GIT_REPO_URL}";
fi
cd "${git_directory}"; Then, we need to fetch any new commits: # Pull down any updates that are available echo "[${SECONDS}] Downloading commits";
git fetch origin;

....and check out the one we're supposed to be building:

# Checkout the commit we're interested in testing
echo "[${SECONDS}] Checking out${GIT_COMMIT_REF}";
git checkout "${GIT_COMMIT_REF}"; Then, we need to copy the repo to the run-specific directory. This is important, since the run might create new files - and we don't want multiple runs running in the same directory at the same time.  echo "[${SECONDS}] Linking source to run directory";
# Hard-link the repo content to the run directory
# This is important because then we can allow multiple runs of the same repo at the same time without using extra disk space
# -r    Recursive mode
# -a    Preserve permissions
cp -ral ./ "${run_directory}"; # Don't forget the .git directory, .gitattributes, .gitmodules, .gitignore, etc. # This is required for submodules and other functionality, but likely won't be edited - hence we can hardlink here (I think). # NOTE: If we see weirdness with multiple runs at a time, then we'll need to do something about this. cp -ral ./.git* "${run_directory}/.git";

I'm using hard linking here for efficiency - I'm banking on the fact that the build script I call isn't going to modify any existing files. Thinking about it, I should do a git reset --hard there just in case - though then I'd have all sorts of nasty issues with timing problems.

So far, I haven't had any issues. If I do, then I'll just disable the hard linking and copy instead. This entire script assumes a trusted environment - i.e. it trusts that the code being executed is not malicious. To this end, it's only suitable for personal projects and the like.

For it to be useful in untrusted environments, it would need to avoid hard linking and execute the build script inside a container - e.g. using LXD or Docker.

Moving on, we next need to release that flock and return to the run-specific directory:

# Go back to the job-specific run directory
cd "${run_directory}"; # Release the lock exec 9>&- # Close file descriptor 9 and release lock echo "[${SECONDS}] Lock released";

At this point, we're all set up to run the build script. We need to find it first though. I've currently got 2 standards I'm using across my repositories: build and build.sh. This is easy to automate:

build_script="./build";
if [ ! -x "${build_script}" ]; then build_script="./build.sh"; fi # FUTURE: Add Makefile support here? if [ ! -x "${build_script}" ]; then
echo "[${SECONDS}] Error: Couldn't find the build script, or it wasn't marked as executable." >&2; exit 1; fi Now that we know where it is, we can execute it. Before we do though, as a little extra I like to run shellcheck over it - since we assume that it's a shell script too (though it might call something that isn't a shell script): echo "----------------------------------------------------------------"; echo "------------------ Shellcheck of build script ------------------"; set +e; # Allow shellcheck errors - we just warn about them shellcheck "${build_script}";
set -e;
echo "----------------------------------------------------------------";

I can highly recommend shellcheck - it finds a number of potential issues in both style and syntax that might cause your shell scripts to behave in unexpected ways. I've learnt a bunch about shell scripting and really improved my skills from using it on a regular basis.

Finally, we can now actually execute the build script:

echo "[${SECONDS}] Executing '${build_script} ci'";

nice -n10 ${build_script} ci I pass the argument ci here, since the lantern build engine takes task names as arguments on the command line. If it's not a lantern script, then it can be interpreted as a helpful hint as to the environment that it's running in. I also nice it to push it into the background, since I actually have my Laminar CI server running on a Raspberry Pi and it's resources are rather limited. I found oddly that I'd lose other essential services (e.g. SSH) if I didn't do this for some reason - since build tasks are usually quite computationally expensive. That completes the build script. Of course, when the above finishes executing the trap that we set earlier will trigger and the build status reported. I'll include the full script at the bottom of this post. This was a long post! We've taken a deep dive into the engine that powers my build system. In the next few posts, I'd like to talk about the git post-receive hook I've been mentioning that triggers this job. I'd also like to talk formally about the lantern build engine - what it is, where it came from, and how it works. Found this interesting? Spotted a mistake? Got a suggestion? Confused about something? Comment below! #!/usr/bin/env bash set -e; # Don't allow errors # Check that all the right variables are present if [ -z "${GIT_REPO_NAME}" ]; then echo -e "Error: The environment variable GIT_REPO_NAME isn't set." >&2; exit 1; fi
if [ -z "${GIT_REF_NAME}" ]; then echo -e "Error: The environment variable GIT_REF_NAME isn't set." >&2; exit 1; fi if [ -z "${GIT_REPO_URL}" ]; then echo -e "Error: The environment variable GIT_REPO_URL isn't set." >&2; exit 1; fi
if [ -z "${GIT_COMMIT_REF}" ]; then echo -e "Error: The environment variable GIT_COMMIT_REF isn't set." >&2; exit 1; fi if [ -z "${GIT_AUTHOR}" ]; then echo -e "Error: The environment variable GIT_AUTHOR isn't set." >&2; exit 1; fi

# It's checked directly anyway
# shellcheck disable=SC1091
source source_regex_match.sh;

GIT_REF_TYPE="$(regex_match "${GIT_REF_NAME}" 'refs/([a-z]+)')";

if [[ "${GIT_REF_TYPE}" == "tags" ]]; then GIT_TAG_NAME="$(regex_match "${GIT_REF_NAME}" 'refs/tags/(.*)$')";
fi

# NOTE: These only work with SSH urls.
GIT_REPO_OWNER="$(echo "${GIT_REPO_URL}" | grep -Po '(?<=:)[^/]+(?=/)')";
GIT_REPO_NAME_SHORT="$(echo "${GIT_REPO_URL}" | grep -Po '(?<=/)[^/]+(?=\.git$)')"; GIT_SERVER_DOMAIN="$(echo "${GIT_REPO_URL}" | grep -Po '(?<=@)[^/]+(?=:)')"; export GIT_REPO_OWNER GIT_REPO_NAME_SHORT GIT_SERVER_DOMAIN GIT_REF_TYPE GIT_TAG_NAME; ############################################################################### # Example URL: git@git.starbeamrainbowlabs.com:sbrl/rhinoreminds.git # Environment variables: # GIT_REPO_NAME git-starbeamrainbowlabs-com-sbrl-rhinoreminds # GIT_REF_NAME refs/heads/master, refs/tags/v0.1.1-build7 # GIT_REF_TYPE heads, tags # Determined dynamically from GIT_REF_NAME. # GIT_TAG_NAME v0.1.1-build7 # Determined dynamically from GIT_REF_NAME, only set if GIT_REF_TYPE == "tags". # GIT_REPO_URL git@git.starbeamrainbowlabs.com:sbrl/rhinoreminds.git # GIT_COMMIT_REF e23b2e0f3c0b9f48effebca24db48d9a3f028a61 # GIT_AUTHOR bob # Generated: # GIT_SERVER_DOMAIN git.starbeamrainbowlabs.com # GIT_REPO_OWNER sbrl # GIT_REPO_NAME_SHORT rhinoreminds # GIT_RUN_SOURCE github # Not always set. If not set then assume git.starbeamrainbowlabs.com send-status-gitea "${GIT_COMMIT_REF}" "pending" "Executing build....";

# Runs on exit, no matter what
cleanup() {
original_exit_code="$?"; status="success"; description="Build${RUN} succeeded in $(human-duration "${SECONDS}").";
if [[ "${original_exit_code}" -ne "0" ]]; then status="failed"; description="Build failed with exit code${original_exit_code} after $(human-duration "${SECONDS}")";
fi

send-status-gitea "${GIT_COMMIT_REF}" "${status}" "${description}"; } trap cleanup EXIT; ############################################################################### repo_job_name="$(echo "${GIT_REPO_NAME}" | tr '/' '--')"; if [ "${JOB}" == "git-repo" ]; then
# If the job file doesn't exist, create it
# We create a symlink here because this is a 'smart' job - whose
# behaviour changes dynamically based on the job name.
if [ ! -e "${LAMINAR_HOME}/cfg/jobs/${repo_job_name}.run" ]; then
pushd "${LAMINAR_HOME}/cfg/jobs"; ln -s "git-repo.run" "${repo_job_name}.run";
popd
fi

# Queue our new hologram
LAMINAR_REASON="git push by ${GIT_AUTHOR} to${GIT_REF_NAME}" laminarc queue "${repo_job_name}" GIT_REPO_NAME="${GIT_REPO_NAME}" GIT_REF_NAME="${GIT_REF_NAME}" GIT_REPO_URL="${GIT_REPO_URL}" GIT_COMMIT_REF="${GIT_COMMIT_REF}" GIT_AUTHOR="${GIT_AUTHOR}";
# If we got to here, we queued the hologram successfully
# Clear the trap, because we know that the trap for the hologram will fire
# This avoids sending a 2nd status to Gitea, linking the user to the wrong place
trap - EXIT;

exit 0;
fi

# We're running in hologram mode!

# Remember the run directory - we'll need it later
run_directory="$(pwd)"; # Important directories: #$WORKSPACE        Shared between all runs of a job
# $run_directory The initial directory a run lands in. Empty and run-specific. #$ARCHIVE          Also run-speicfic, but the contents is persisted after the run ends

# Acquire a lock for this repo
#laminarc lock "${JOB}-workspace"; exec 9<"${WORKSPACE}";
flock --exclusive 9;
###############################################################################
# No need to allow errors here, because the lock will automagically be released
# if the process crashes, as that'll close the file description anyway :P
echo "[${SECONDS}] Lock acquired"; cd "${WORKSPACE}";

# If we haven't already, clone the repository
git_directory="$(echo "${GIT_REPO_URL}" | grep -oP '(?<=/)(.+)(?=.git$)')"; if [ ! -d "${git_directory}" ]; then
echo "[${SECONDS}] Cloning repository"; git clone "${GIT_REPO_URL}";
fi
cd "${git_directory}"; # Pull down any updates that are available echo "[${SECONDS}] Downloading commits";
git fetch origin;
# Checkout the commit we're interested in testing
echo "[${SECONDS}] Checking out${GIT_COMMIT_REF}";
git checkout "${GIT_COMMIT_REF}"; echo "[${SECONDS}] Linking source to run directory";
# Hard-link the repo content to the run directory
# This is important because then we can allow multiple runs of the same repo at the same time without using extra disk space
# -r    Recursive mode
# -a    Preserve permissions
cp -ral ./ "${run_directory}"; # Don't forget the .git directory, .gitattributes, .gitmodules, .gitignore, etc. # This is required for submodules and other functionality, but likely won't be edited - hence we can hardlink here (I think). # NOTE: If we see weirdness with multiple runs at a time, then we'll need to do something about this. cp -ral ./.git* "${run_directory}/.git";
echo "[${SECONDS}] done"; # Go back to the job-specific run directory cd "${run_directory}";

###############################################################################
# Release the lock
exec 9>&- # Close file descriptor 9 and release lock
#laminarc release "${JOB}-workspace"; echo "[${SECONDS}] Lock released";

echo "[${SECONDS}] Finding build script"; build_script="./build"; if [ ! -x "${build_script}" ]; then build_script="./build.sh"; fi
# FUTURE: Add Makefile support here?
if [ ! -x "${build_script}" ]; then echo "[${SECONDS}] Error: Couldn't find the build script, or it wasn't marked as executable." >&2;
exit 1;
fi

echo "[${SECONDS}] Executing '${build_script} ci'";

echo "----------------------------------------------------------------";
echo "------------------ Shellcheck of build script ------------------";
set +e; # Allow shellcheck errors - we just warn about them
shellcheck "${build_script}"; set -e; echo "----------------------------------------------------------------"; nice -n10${build_script} ci

## Quick File Management with Gossa

Recently a family member needed to access some documents at a remote location that didn't support USB flash drives. Awkward to be sure, but I did some searching around and found a nice little solution that I thought I'd blog about here.

At first, I thought about setting up Filestash - but I discovered that only installation through Docker is officially supported (if it's written in Go, then shouldn't it end up as a single binary? What's Docker needed for?).

Docker might be great, but for a quick solution to an awkward issue I didn't really want to go to the trouble for installing Docker and figuring out all the awkward plumbing problems for the first time. It definitely appeared to me that it's better suited to a setup where you're already using Docker.

Anyway, I then discovered Gossa. It's also written in Go, and is basically a web interface that lets you upload, download, and rename files (click on a file or directory's icon to rename).

Is it basic? Yep.

Do the icons look like something from 1995? Sure.

(Is that Times New Roman I spy? I hope not)

Does it do the job? Absolutely.

For what it is, it's solved my problem fabulously - and it's so easy to setup! First, I downloaded the binary from the latest release for my CPU architecture, and put it somewhere on disk:

curl -o gossa -L https://github.com/pldubouilh/gossa/releases/download/v0.0.8/gossa-linux-arm

chmod +x gossa
sudo chown root: gossa
sudo mv gossa /usr/local/bin/gossa;

Then, I created a systemd service file to launch Gossa with the right options:

[Unit]
Description=Gossa File Manager (syncthing)
After=syslog.target rsyslog.service network.target

[Service]
Type=simple
User=gossa
Group=gossa
WorkingDirectory=/path/to/dir
ExecStart=/usr/local/bin/gossa -h [::1] -p 5700 -prefix /gossa/ /path/to/directory/to/serve
Restart=always

StandardOutput=syslog
StandardError=syslog
SyslogIdentifier=gossa

[Install]
WantedBy=multi-user.target

_(Top tip! Use systemctl cat service_name to quickly see the service file definition for any given service)_

Here I start Gossa listening on the IPv6 local loopback address on port 5700, set the prefix to /gossa/ (I'm going to be reverse-proxying it later on using a subdirectory of a pre-existing subdomain), and send the standard output & error to syslog. Speaking of which, we should tell syslog what to do with the logs we send it. I put this in /etc/rsyslog.d/gossa.conf:

if $programname == 'gossa' then /var/log/gossa/gossa.log if$programname == 'gossa' then stop

After that, I configured log rotate by putting this into /etc/logrotate.d/gossa:

/var/log/gossa/*.log {
daily
missingok
rotate 14
compress
delaycompress
notifempty
postrotate
invoke-rc.d rsyslog rotate >/dev/null
endscript
}

Very similar to the configuration I used for RhinoReminds, which I blogged about here.

Lastly, I configured Nginx on the machine I'm running this on to reverse-proxy to Gossa:

server {

# ....

location /gossa {
proxy_pass http://[::1]:5700;
}

# ....

}

I've configured authentication elsewhere in my Nginx server block to protect my installation against unauthorised access (and oyu probably should too). All that's left to do is start Gossa and reload Nginx:

sudo systemctl daemon-reload
sudo systemctl start gossa
# Check that Gossa is running
sudo systemctl status gossa

sudo nginx -t
sudo systemctl reload

Note that reloading Nginx is more efficient that restarting it, since it doesn't kill the process - only reload the configuration from disk. It doesn't matter here, but in a production environment that receives a high volume of traffic you it's a great way make configuration changes while avoid dropping client connections.

Found this interesting? Got another quick solution to an otherwise awkward issue? Comment below!

## Next Gen Search, Part 1: Backend Storage

I've got a bit of a thing about full-text search engines. I've talked about one in particular before for Pepperminty Wiki, and I was thinking about it again the other day - and how I could optimise it further.

If you haven't already, I do recommend reading my previous post on the curious question - as a number of things in this post might not make sense otherwise.

Between the time I wrote that last post and now, I've done quite a bit more digging into the root causes of that ~450ms search time, and I eventually determined that most of it was actually being spent deserialising the inverted index from the JSON file it's stored in back into memory.

This is horribly inefficient and is actually taking up 90% of that query time, so I got to thinking about what I could do about it.

My solution was multi-faceted, as I also (separately) developed a new search-term analysis system (I abbreviated to STAS, because it sounds cool :D) to add advanced query syntax such as -dog to exclude pages that contain the word dog and the like - which I also folded in at the same time as fixing the existing bottleneck.

As of the time of typing, the work on STAS is still ongoing. This doesn't mean that I can't blog about the rest of it though! I've recently-ish come into contact with key-value data stores, such as Redis and LevelDB (now RocksDB). They work rather like a C♯ dictionary or a browser's localStorage, in that they store values that are associated with unique keys (Redis is a bit of a special case though, for reasons that I won't get into here).

Such a data store would suit an inverted index surprisingly well. I devised a key system to represent an inverted index:

• The first key, |termindex|, is used to store a master list of words that have entries in the page index.
• The second key, term is simply the word itself (e.g. cat, chicken, rocket, etc.), and stores a list of ids of the pages that contain that word.
• The third and final key, term|pageid, is a word followed by a separator and finally the id of a page (e.g. bill|1, cat|45, boosters|69, etc.). These keys store the locations that a particular word appears at in a given document in the index. A separate id index is needed to convert between the page id and it's associated name - Pepperminty Wiki provides this functionality out-of-the-box.

The more I thought about it, the more I liked it. If I use a key-value data store in this manner, I can store the values as JSON objects - and then I only have to deserialise the parts of the index that I actually use. Furthermore, adding this extra layer of abstraction allows for some clever trickery to speed things up even more.

The problem here is that Pepperminty Wiki is supposed to be portable, so I try not to use any heavy external libraries or depend on odd PHP modules that not everyone will have installed. While a LevelDB extension for PHP does exist, it's not installed by default and it's a PECL module, which are awkward to install.

All isn't lost though, because it turns out that SQLite functions surprisingly well as a key-value data store:

CREATE TABLE store (
key TEXT UNIQUE NOT NULL,
value TEXT
);

Yes - it really is that simple! Now all we need is some code to create and interface with such a database. Some simple getters and setters should suffice!

(Can't see the above? Try a direct link.)

While this works, I quickly ran into several major problems:

• Every time I read from the data store I'm decoding JSON, which is expensive
• Every time I'm saving to the data store, I'm encoding to JSON, which is also expensive
• I'm reading and writing the same thing multiple times, which is very expensive
• Writing anything to the data store takes a long time

The (partial) solution here was to refactor it such that the json encoding is handled by the storage provider, and to implement a cache.

Such a cache could just be an associative array:

private $cache = []; Then, to fetch a value, we can do this: // If it's not in the cache, insert it if(!isset($this->cache[$key])) {$this->cache[$key] = [ "modified" => false, "value" => json_decode($this->query(
"SELECT value FROM store WHERE key = :key;",
[ "key" => $key ] )->fetchColumn()) ]; } return$this->cache[$key]["value"]; Notice how each item in the cache is also an associative array. This is so that we can flag items that have been modified in memory, such that when we next sync to disk we can write multiple changes all at once in a single batch. That turns the setter into something like this: if(!isset($this->cache[$key]))$this->cache[$key] = [];$this->cache[$key]["value"] =$value;
$this->cache[$key]["modified"] = true;

Very cool! Now all we need is a function to batch-write all the changes to disk. This isn't hard either:

foreach($this->cache as$key => $value_data) { // If it wasn't modified, there's no point in saving it, is there? if(!$value_data["modified"])
continue;

$this->query( "INSERT OR REPLACE INTO store(key, value) VALUES(:key, :value)", [ "key" =>$key,
"value" => json_encode($value_data["value"]) ] ); } I'll get onto the cogs and wheels behind that query() function a bit later in this post. It's one of those utility functions that are great to have around that I keep copying from one project to the next. Much better, but still not great. Why does it still take ages to write all the changes to disk? Well, it turns out that by default SQLite wraps every INSERT in it's own transaction. If we wrap our code in an explicit transaction, we can seriously boost the speed even further: $this->db->beginTransaction();

// Do batch writing here

$this->db->commit(); Excellent (boomed the wizard). But wait, there's more! The PHP PDO database driver supports prepared statements, which is a fancy way of caching SQL statements and substituting values into them. We use these already, but since we only use a very limited number of SQL queries, we can squeak some additional performance out by caching them in their prepared forms (preparing them is relatively computationally expensive, after all). This is really easy to do. Let's create another associative array to store them in: private$query_cache = [];

Then, we can alter the query() function to look like this:

/**
* Makes a query against the database.
* @param   string  $sql The (potentially parametised) query to make. * @param array$variables  Optional. The variables to substitute into the SQL query.
* @return  \PDOStatement       The result of the query, as a PDOStatement.
*/
private function query(string $sql, array$variables = []) {
// Add to the query cache if it doesn't exist
if(!isset($this->query_cache[$sql]))
$this->query_cache[$sql] = $this->db->prepare($sql);
$this->query_cache[$sql]->execute($variables); return$this->query_cache[$sql]; // fetchColumn(), fetchAll(), etc. are defined on the statement, not the return value of execute() } If a prepared statement for the given SQL exists in the query cache already, we re-use that again instead of preparing a brand-new one. This works especially well, since we perform a large number of queries with the same small set of SQL queries. Get operations all use the same SQL query, so why not cache it? This completes our work on the backend storage for the next-generation search system I've built for Pepperminty Wiki. Not only will it boost the speed of Pepperminty Wiki's search engine, but it's also a cool reusable component, which I can apply to other applications to give them a boost, too! Next time, we'll take a look at generating the test data required to stress-test the system. I also want to give an overview of the new search-term analysis system and associated changes to the page ranking system I implemented (and aspire to implement) too, but those may have to wait until another post (teaser: PageRank isn't really what I'm after). (Can't see the above? Try a direct link.) ## Own your code, part 3: Shell scripting infrastructure In the last post, I told the curious tale of my unreliable webhook. In the post before that, I talked about my Gitea-powered git server and how I set it up. In this one, we're going to back up a bit and look at setting up Laminar CI. Laminar CI is a continuous integration program that takes a decidedly different approach to the one you see in solutions like GitLab CI and Travis. It takes a much more minimal approach, instead preferring to provide you with the tools you need and letting you get on with setting it up however you like and integrating it with whatever you like. I recommend looking at its website and user manual to get a feel for how it works. In short, it lets you do things like this: laminarc queue build-code laminarc show-jobs It is, of course, entirely command-line based. In order to integrate it with other services, webhooks are needed. In my case, I've used webhook for this purpose. Before we get into that though, we should outline how the system we build should work. For me, I'm not happy with filling a folder with job scripts. I want my CI system to have the following properties: • I want to have all the configuration files and scripts under version control • I want to keep project-specific CI scripts in their appropriate repositories • I don't want to have to alter the CI configuration every time I start a new project. • The CI system should be stable - it shouldn't fall over if multiple jobs for the same repository are running at the same time. Bold claims. Achieving this was actually quite complicated, and demanded a pretty sophistic infrastructure that's comprised of multiple independent shell scripts. It's best explained with a diagram: There are 3 different machines at play here. 1. The local computer, where we write our code 2. The git server, where the code is stored 3. The CI Server, which runs continuous integration tasks Somehow, we need to notify the CI server that it needs to do something when new commits end up at the git server. We can achieve this with a git post-receive hook, which is basically a shell script (yep, we'll be seeing a lot of those) that can perform some logic on the server just after a push is complete, but the client pushing them hasn't disconnected yet. In this post-receive hook we need to trigger the webhook that notifies the CI server that there are new commits for it to test. GitHub makes this easy, as it provides a webhook system where you can configure a webhook via a GUI - but it doesn't let you set the webhook script directly as far as I know. Alternatively, should we run into issues with the webhook and we have control over the git server, we can trigger the CI build directly by writing a post-receive git hook directly utilising SSH port forwarding. This is what I did in the end for my personal git server, though as I noted in part 2 I did end up working around the webhook issues so that I could have it work with GitHub too. For the webhook to work, we'll need a receiving script that will parse the JSON body of the webhook itself, and queue the laminar job. In order for the laminar job to work without modification when we add a new project, it will have to come in 2 parts. The first will have a generic 'virtual' or 'smart' job, which should create a symbolic link to itself under the name of the repository that we want to run CI tasks for. When called by Laminar under a repository-specific name, we want to run the CI tasks - but only on a copy of the main repository. Additionally, we don't want to re-clone the repository each time - this is slow and wastes bandwidth. Finally, we need a unified standard for defining CI tasks in our repositories for which we want to enable continuous integration. This we can achieve with the use of the lantern build engine, which I'll talk about (and its history!) in a future post. Putting all this together has been quite the the undertaking - hence this series of blog posts! For now, since this blog post somehow seems to be getting rather long already and we've laid down the foundations of quite the complicated system, I think I'll leave this post here for now. In the next post, we'll look at building the core of the system: The main laminar CI job that will organise the execution of project-specific CI tasks. ## Thinking about coding style After upgrading my blog to support view counting, it got me thinking about programming styles. Do you put your braces on a separate line or the same one as your if statements? What about whitespace and new lines? And then there's even casing of variable names to consider, such as snake_case, PascalCase, or camelCase. As if to add to the confusion, there are also paradigms to worry about. Object-oriented, functional, procedural? Personally, I think it depends on the project you're working on as to what programming style you use. Depending on the project, I end up formatting my code completely differently - taking into account various factors such as the style of any pre-existing code, the language it's written in, and other things. When I first implemented this blog, I used a fairly procedural programming style with snake_case variable naming. While I would certainly write it very differently if I implemented ti now, when I add to it I try to ensure that the code I add follows a similar style, whilst simultaneously modernising the codebase little by little to make it easier to maintain. However, when I work on Air Quality Web (I blogged about it here), I adopt a very different style. I write object-oriented code, with a combination of PascalCase for class names andsnake_case for variables. While it's important to remember that certain design patterns and code formatting decisions work better than others, I'm firmly of the opinion that there isn't way single 'right' way to program. While some people have tried to standardise code formatting, I'm not so sure that it's really worth the extra effort. After all, if you can read it and others can read and understand it too, does it really matter if all the whitespacing in the entire project is completely uniform? ## Setting up a Mosquitto MQTT server I recently found myself setting up a mosquitto instance (yep, for this) due to a migration we're in the middle of doing and it got quite interesting, so I thought I'd post about it here. This post is also partly documentation of what I did and why, just in case future people come across it and wonder how it's setup, though I have tried to make it fairly self-documenting. At first, I started by doing sudo apt install mosquitto and seeing if it would work. I can't remember if it did or not, but it certainly didn't after I played around with the configuration files. To this end, I decided that enough was enough and I turned the entire configuration upside-down. First up, I needed to disable the existing sysV init-based service that ships with the mosquitto package: sudo systemctl stop mosquitto # Just in case sudo systemctl start mosquitto Next, I wrote a new systemd service file: [Unit] Description=Mosquitto MQTT Broker After=syslog.target rsyslog.target network.target [Service] Type=simple PIDFile=/var/run/mosquitto/mosquitto.pid User=mosquitto PermissionsStartOnly=true ExecStartPre=-/bin/mkdir /run/mosquitto ExecStartPre=/bin/chown -R mosquitto:mosquitto /run/mosquitto ExecStart=/usr/sbin/mosquitto --config-file /etc/mosquitto/mosquitto.conf ExecReload=/bin/kill -s HUP$MAINPID

StandardOutput=syslog
StandardError=syslog
SyslogIdentifier=mosquitto

[Install]
WantedBy=multi-user.target

This is broadly similar to the service file I developed in my earlier tutorial post, but it's slightly more complicated.

For one, I use PermissionsStartOnly=true and a series of ExecStartPre directives to allow mosquitto to create a PID file in a directory in /run. /run is a special directory on Linux for PID files and other such things, but normally only root can modify it. mosquitto will be running under the mosquitto user (surprise surprise), so we need to create a subdirectory for it and chown it so that it has write permissions.

A PID file is just a regular file on disk that contains the PID (Process IDentifier) number of the primary process of a system service. System service managers such as systemd and OpenRC use this number to manage the health of the service while it's running and send it various signals (such as to ask it to reload its configuration file).

With this in place, I then added an rsyslog definition at /etc/rsyslog.d/mosquitto.conf to tell it where to put the log files:

if $programname == 'kraggwapple' then /var/log/mosquitto/mosquitto.log if$programname == 'kraggwapple' then stop

Thinking about it, I should probably check that a log rotation definition file is also in place.

Just in case, I then chowned the pre-existing log files to ensure that rsyslog could read & write to it:

sudo chown -R syslog: /var/log/mosquitto

Then, I filled out /etc/mosquitto/mosquitto.conf with a few extra directives and restarted the service. Here's the full configuration file:

# Place your local configuration in /etc/mosquitto/conf.d/
#
# A full description of the configuration file is at
# /usr/share/doc/mosquitto/examples/mosquitto.conf.example

# NOTE: We can't use tab characters here, as mosquitto doesn't like it.

pid_file /run/mosquitto/mosquitto.pid

# Persistence configuration
persistence true
persistence_location /var/lib/mosquitto/

# Not a file today, thanks
# Log files will actually end up at /var/llog/mosquitto/mosquitto.log, but will go via syslog
# See /etc/rsyslog.d/mosquitto.conf
#log_dest file /var/log/mosquitto/mosquitto.log
log_dest syslog

include_dir /etc/mosquitto/conf.d

# Documentation: https://mosquitto.org/man/mosquitto-conf-5.html

allow_anonymous false
# ....which are stored in the following file

# Make a log entry when a client connects & disconnects, to aid debugging
connection_messages true

# TLS configuration
# Disabled at the moment, since we don't yet have a letsencrypt cert
# NOTE: I don't think that the sensors currently connect over TLS. We should probably fix this.
# TODO: Point these at letsencrypt
#cafile /etc/mosquitto/certs/ca.crt
#certfile /etc/mosquitto/certs/hostname.localdomain.crt
#keyfile /etc/mosquitto/certs/hostname.localdomain.key

As you can tell, I've still got some work to do here - namely the TLS setup. It's a bit of a chicken-and-egg problem, because I need the domain name to be pointing at the MQTT server in order to get a Let's Encrypt TLS certificate, but that'll break all the sensors using the current one..... I'm sure I'll figure it out.

But wait! We forgot the user accounts. Before I started the new service, I added some user accounts for client applications to connect with:

sudo mosquitto_passwd /etc/mosquitto/mosquitto_users username1
sudo mosquitto_passwd /etc/mosquitto/mosquitto_users username1

The mosquitto_passwd program prompts for a password - that way you don't end up with the passwords in your ~/.bash_history file.

With all that taken care of, I started the systemd service:

sudo systemctl daemon-reload
sudo systemctl start mosquitto-broker.service

Of course, I ended up doing a considerable amount of debugging in between all this - I've edited it down to make it more readable and fit better in a blog post :P

Lastly, because I'm paranoid, I double-checked that it was running with htop and netstat:


sudo netstat -peanut | grep -i mosquitto
tcp        0      0 0.0.0.0:1883            0.0.0.0:*               LISTEN      112        2676558    5246/mosquitto
tcp        0      0 x.y.z.w:1883           x.y.z.w:54657       ESTABLISHED 112        2870033    1234/mosquitto
tcp        0      0 x.y.z.w:1883           x.y.z.w:39365       ESTABLISHED 112        2987984    1234/mosquitto
tcp        0      0 x.y.z.w:1883           x.y.z.w:58428       ESTABLISHED 112        2999427    1234/mosquitto
tcp6       0      0 :::1883                 :::*                    LISTEN      112        2676559    1234/mosquitto


...no idea why it want to connect to itself, but hey! Whatever floats its boat.

Art by Mythdael