Archive

## Tag Cloud

3d 3d printing account algorithms android announcement architecture archives arduino artificial intelligence artix assembly async audio automation backups bash batch blender blog bookmarklet booting bug hunting c sharp c++ challenge chrome os cluster code codepen coding conundrums coding conundrums evolved command line compilers compiling compression containerisation css dailyprogrammer data analysis debugging demystification distributed computing docker documentation downtime electronics email embedded systems encryption es6 features ethics event experiment external first impressions freeside future game github github gist gitlab graphics hardware hardware meetup holiday holidays html html5 html5 canvas infrastructure interfaces internet interoperability io.js jabber jam javascript js bin labs learning library linux lora low level lua maintenance manjaro minetest network networking nibriboard node.js open source operating systems optimisation own your code pepperminty wiki performance phd photos php pixelbot portable privacy problem solving programming problems project projects prolog protocol protocols pseudo 3d python reddit redis reference release releases rendering resource review rust searching secrets security series list server software sorting source code control statistics storage svg systemquery talks technical terminal textures thoughts three thing game three.js tool tutorial tutorials twitter ubuntu university update updates upgrade version control virtual reality virtualisation visual web website windows windows 10 worldeditadditions xmpp xslt

## PhD Aside 2: Jupyter Lab / Notebook First Impressions

Hello there! I'm back with another PhD Aside blog post. In the last one, I devised an extremely complicated and ultimately pointless mechanism by which multiple Node.js processes can read from the same file handle at the same time. This post hopefully won't be quite as useless, as it's a cross with the other reviews / first impressions posts I've made previously.

I've had Jupyter on my radar for ages, but it's only very recently that I've actually given it a try. Despite being almost impossible to spell (though it does appear to be getting easier with time), both it's easy to install and extremely useful when plotting visualisations, so I wanted to talk about it here.

I tried Jupyter Lab, which is apparently more complicated than Jupyter Notebook. Personally though I'm not sure I see much of a difference, aside from a file manager sidebar in Jupyter Lab that is rather useful.

(Above: A Jupyter Lab session of mine, in which I was visualising embeddings from a pretrained CLIP model.)

Jupyter Lab is installed via pip (pip3 for apt-based systems): https://jupyter.org/install. Once installed, you can start a server with jupyter-lab in a terminal (or command line), and then it will automatically open a new tab in your browser that points to the server instance (http://localhost:8888/ by default).

Then, you can open 1 or more Jupyter Notebooks, which seem to be regular files (e.g. Javascript, Python, and more) but are split into 'cells', which can be run independently of one another. While these cells are usually run in order, there's nothing to say that you can't run them out of order, or indeed the same cell over and over again as you prototype a graph.

The output of each cell is displayed directly below it. Be that a console.log()/print() call or a graph visualisation (see the screenshot above), it seems to work just fine. It also saves the output of a cell to disk alongside the code in the Jupyter Notebook, can be a double-edged sword: On the one hand, it's very useful to have the plot and other output be displayed to remind you what you were working on, but on the other hand if the output somehow contains sensitive data, then you need to remember to clear it before saving & committing to git each time, which is a hassle. Similarly, every time the output changes the notebook file on disk also changes, which can result in unnecessary extra changes committed to git if you're not careful.

In the same vein, I have yet to find a way to define a variable in a notebook file whose value is not saved along with the notebook file, which I'd rather like since the e.g. tweets I work with for the social media side of my PhD are considered sensitive information, and so I don't want to commit them to a git repository which will no doubt end up open-source.

You can also import functions and classes from other files. Personally, I see Jupyter notebooks to be most useful when used in conjunction with an existing codebase: while you can put absolutely everything in your Jupyter notebook, I wouldn't recommend it as you'll end up with spaghetti code that's hard to understand or maintain - just like you would in a regular codebase in any other language.

Likewise, I wouldn't recommend implementing an AI model in a Jupyter notebook directly. While you can, it makes it complicated to train it on a headless server - which you'll likely want to do if you want to train a model at any scale.

The other minor annoyance is that by using Jupyter you end up forfeiting thee code intelligence of e.g. Atom or Visual Studio Code, which is a shame since a good editor can e.g. check syntax on the fly, inform you of unused variables, provide autocomplete, etc.

These issues aside, Jupyter is a great fit for plotting visualisations due to the very short improve → rerun → inspect/evaluate output loop. It's also a good fit for writing tutorials I suspect, as it apparently has support for markdown cells too. At some point, I may try writing a tutorial in Jupyter notebook, rendering it to regular markdown, and posting it here.

## Excluding domains from Encrypted DNS

Heya! I've got a quick tip for you that was annoying to look up. When using Encrypted DNS (either by DNS-over-TLS or DNS-over-HTTPS), your DNS requests will often go directly to Cloudflare or Google.

This is all well and good if you have a setup like my home network where DNS for my entire network goes through an Unbound instance which forwards to Cloudflare via Encrypted DNS (associated blog post; it's great for ensuring devices that don't support encrypted DNS are also secure), but things get more complicated if you're another network with Firefox on your laptop. In such a scenario, you most likely want Firefox configured with private/encrypted DNS enabled - but if you have domains on that network (e.g. if it's a network with split-horizon DNS with local Intranet sites), then it's awkward because you have to keep turning encrypted DNS on and off again.

A pretty specific situation that can be annoying and difficult to diagnose, to be sure. The easiest way to spot the issue is to see if the site you are accessing is local to (or hosted on) the network you're connected to, and check that while it doesn't work on your local device, but it does work on other devices on that network.

But no longer! I have discovered a setting in Firefox that allows you do set specific domains that resolved via your system's DNS resolver (for Linux users, that's what is specified in /etc/resolv.conf).

To edit it, first navigate to about:config and dismiss the warning. Then, find the network.trr.builtin-excluded-domains setting. By default for me it's localhost,local.

Once you've located it, you can add the domains you want to exclude from resolving via encrypted DNS to the comma-separated list. It supports wildcards too, so you can do something like this:

localhost,local,mooncarrot.space,*.mooncarrot.space

I'm sure that Chrome has a setting for this too, but I don't use it (for reasons that I could fill an entirely separate blog post with).

I'm mainly posting this for my own reference, but hopefully it helps others too :-)

## Tensorflow and PyTorch compared

Hey there! Since I've used both Tensorflow and PyTorch a bit now, I thought it was time to write a post comparing the two and their respective strengths and weaknesses.

For reference, I've used Tensorflow both for Javascript (less popular) and for Python (more popular) for a number of different models, relating to both my rainfall radar and social media halves to my PhD. While I definitely have less experience with PyTorch, I feel like I have a good enough grasp on it to get a first impression.

Firstly, let's talk about how PyTorch is different from Tensorflow, and what Tensorflow could learn from the former. The key thing I noticed about PyTorch is that it's easily the more flexible of the two. I'm pretty sure that you can create layers and even whole models that do not explicitly define the input and output shapes of the tensors they operate on - e.g. using CNN layers. This gives them a huge amount of power for handling variable sized images or sentences without additional padding, and would be rather useful in Tensorflow - where you must have a specific input shape for every layer.

Unfortunately, this comes at the cost of complexity. Whereas Tensorflow has a .fit() method, in PyTorch you have to implement it yourself - which, as you can imagine - result in a lot of additional code you have to write and test. This was quite the surprise to me when I first used PyTorch!

The other thing I like about PyTorch is the data processing pipeline and it's simplicity. It's easy to understand and essentially guides you to the most optimal solution all on it's own - leading to greater GPU usage, faster model training times, less waiting around, and tighter improve → run → evaluate & inspect → repeat loops.

While in most cases you need to know the number of items in your dataset in advance, this is not necessarily a bad thing - as it gently guides you to the realisation that by changing the way your dataset is stored, you can significantly improve CPU and disk utilisation by making your dataset more amenable to be processed in parallel.

Tensorflow on the other hand has a rather complicated data processing pipeline with multiple ways to do things and no clear guidance I could easily find on building a generic data processing pipeline that didn't make enormous assumptions like "Oh, you want to load images right? Just use this function!" - which really isn't helpful when you want to do something unusual for a research project.

Those tutorials I do find suggest you use a generator function, which can't be parallelised and makes training a model a slow and painful process. Things aren't completely without hope though - Tensorflow has a .map() method on their Dataset objects and also have a .interleave() method (if I recall correctly) to interleave multiple Dataset objects together - which I believe is a relatively recent addition. This is quite a clever way of doing things, if a bit more complicated than PyTorch's solution.

It would be nice though if the tf.data.AUTOTUNE feature for automatically managing the number of parallel workers to use when parallelising things was more intelligent. I recently discovered that it doesn't max out my CPU if I have multiple .map() calls I parallelise for example, when it really should look at the current CPU usage and notice that the CPU is sitting e.g. 50% idle.

Tensorflow for Python has a horrible API more generally. It's a confusing mess as there's both Tensorflow and the inbuilt Keras, which means that it's not obvious where that function you need is - or, indeed, which version thereof you want to call. I know it's a holdover from when Keras wasn't bundled with Tensorflow by default, but the API really should be imagined and tf.keras merged into the main tf namespace somehow.

It can also be unclear when you mix Tensorflow Tensors, numpy arrays and numbers, and plain Python numbers. In some cases, it's impossible to tell where one begins and the other ends, which can be annoying since they all behave differently, so you can in some cases get random error messages when you accidentally mix the types (e.g. "I want a Tensor, not a numpy array", or "I want a plain Python number, not a numpy number").

A great example of what's possible is demonstrated by Tensorflow's own Javascript bindings - to a point. They are much better organised than the Python library for Tensorflow, although they require explicit memory management and disposal of Tensors (which isn't necessarily a bad thing, though it's difficult to compare performance improvements without comparing apples and oranges).

The difficulties start though if you want to do anything in even remotely uncharted territory - Tensorflow.js doesn't have a very wide selection of layers like the Python bindings do (e.g. multi-headed attention). It also seems to have some a number of bugs, meaning you can't just port code from the Python bindings and expect it to work. For example, I tried implementing an autoencoder, but found that that it didn't work as I wanted it to - and for the life of me I couldn't find the bug at all (despite extensive searching).

Another annoyance with Tensorflow.js is that the documentation for exactly which CUDA version you need is very poor - and sometimes outright wrong! In addition, there's no table of versions and associated CUDA + CuDNN versions required like there is for Tensorflow for Python.

It is for these reasons that I find myself using Python much more regularly - even if I dislike Python as a language and ecosystem.

At some point, I'd love to build a generic Tensor library on top of GPU.js. It would naturally support almost any GPU (since GPU.js isn't limited to CUDA-capable devices like Tensorflow is - while you can recompile it with support for other GPUs, I don't recommend it unless you have lots of time on your hands), be applicable to everything from machine to simulation to cellular automata, and run in server, desktop, and browser environments with minimal to no changes to your codebase!

### Conclusion

There's no clear answer to whether you should use PyTorch or Tensorflow for your next project. As a rule of thumb, I suggest starting in Tensorflow due to the reduced boilerplate code, and use PyTorch if you find yourself with a wacky model that Tensorflow doesn't like very much - or you want to use a pretrained model that's only available in one or the other.

Having said this, I can certainly recommend experiencing both libraries, as there are valuable things to be learnt from both frameworks. Unfortunately, I can't recommend Tensorflow.js for anything more than basic tensor manipulations (which it is very good at, despite supporting only a limited range of GPUs without recompilation in Node.js) - even though it's API is nice and neat (and the Python bindings should take significant inspiration from it).

In the near future - one way or another - I will be posting about contrastive learning here soon. It's very cool indeed - I just need to wrap my head around and implement the loss function....

If you have experience with handling matrices, please get in touch as I'd really appreciate some assistance :P

## Minifying CSS, HTML, and more in eleventy static sites

I've built a few sites with eleventy, and one of the things that's been on my todo list is figure out a way to optimise everything.

With websites, it's very important that content loads as fast as possible. To achieve this, a number of strategies can be employed, such as enabling gzip to reduce data transferred. A common theme here in techniques used to improve page load times is either:

1. The number of requests made to the server / amount of data transferred
2. Improving JS / CSS parsing and execution performance

By far the most important thing we can do here with a static site like eleventy though is minifying HTML, CSS, Javascript (if you have any being served to the client), and everything else you serve to the client. In doing so, we can significantly reduce the amount of data transferred from the server to the client.

Build systems like esbuild are a good choice here, but if you have yourself an eleventy-based static site, then esbuild may be somewhat complicated and not best suited to the problem at hand (it's best at bundling JS + CSS assets, and doesn't like HTML very much).

To this end, a solution that is more integrated with eleventy is preferable to reduce complexity. The official eleventy docs suggest using clean-css to minify CSS, but this approach doesn't tackle HTML, and requires you to remember to use the cssmin filter every time.

With this in mind, in this post I want to show a much easier method of minifying CSS, HTML, and anything else (except non-svg images, but I have a solution for those too which I'll talk about in a future post if there's any interest) you can think of.

By using eleventy transforms, we can apply a minification filter to every file that Eleventy generates.

For this post, I'll assume that you already have an eleventy site you want to optimise. if you don't have one yet, I recommend the official docs as a starting point.

Let's start with the CSS. I assume you already have something like this in e.g. css.njk in your project:

---
---

{% include "css/patterns.css" %}
{% include "css/theme.css" %}
{% include "css/gallerybox.css" %}
{% include "css/smallscreens.css" %}
{% include "css/prism-custom.css" %}

This puts all your CSS into a single file. This is good, but we can do better. Let's install clean-css:

npm install --save clean-css

Then, open your .eleventy.js file for editing. Add the following:

// Add to your require() statements at the top of the file:
const CleanCSS = require("clean-css");
const is_production = typeof process.env.NODE_ENV === "string" && process.env.NODE_ENV === "production";

function do_minifycss(source, output_path) {
if(!output_path.endsWith(".css") || !is_production) return source;

const result = new CleanCSS({
level: 2
}).minify(source).styles.trim();
console.log(MINIFY ${output_path}, source.length, →, result.length, (${((1 - (result.length / source.length)) * 100).toFixed(2)}% reduction));
return result;
}

Finally, find the bit at the bottom of the file that looks like this:

module.exports = function(eleventyConfig) {

// Some stuff may be here

}

...and add the following to that function there:

eleventyConfig.addTransform("cssmin", do_minifycss);

In short, for every file that eleventy is just about to right to disk, it executes all the transforms it has registered. In our do_minifycss transform we register, we first ensure it's a .css file that eleventy is writing, and then check that the NODE_ENV environment variable is set to production. If these conditions are met, then we minify the source code we were passed before returning it.

This transform pattern is very useful, and can be applied to any file type you like. For example, we could also minify HTML. To do this, install the html-minifier-terser npm package like this:

npm install --save html-minifier-terser

Then, here's what to add to the .eleventy.js configuration file:

// At the top:
const { minify: minify_html } = require("html-minifier-terser");

// Somewhere in the middle:
async function do_minifyhtml(source, output_path) {
if(!output_path.endsWith(".html") || !is_production) return source;

const result = await minify_html(source, {
collapseBooleanAttributes: true,
collapseWhitespace: true,
collapseInlineTagWhitespace: true,
continueOnParseError: true,
decodeEntities: true,
keepClosingSlash: true,
minifyCSS: true,
quoteCharacter: ",
removeAttributeQuotes: true,
removeRedundantAttributes: true,
removeScriptTypeAttributes: true,
sortAttributes: true,
sortClassName: true,
useShortDoctype: true
});

console.log(MINIFY ${output_path}, source.length, →, result.length, (${((1 - (result.length / source.length)) * 100).toFixed(2)}% reduction));

return result;
}

Finally, add this to the module.exports = function.... at the bottom of the file as before:

eleventyConfig.addTransform("htmlmin", do_minifyhtml);

This follows the same pattern as we did for the CSS, but we instead use the HTML minifier html-minifier-terser as our minifier instead of the clean-css CSS minifier.

This pattern is repeatable over and over for other file types. For example, you could use something like JSON.stringify(JSON.parse(source)) to compress pretty-printed JSON, or wrap svgo to compress SVG images.

If there's a file format, there is probably a minifier for it. Got XML? try minify-xml. Lua (wow, that's an unusual website you've got there)? try luamin. PDF? I'm sure there's a minifier / compressor for those too.

Note that if you have a lot of Javascript, esbuild as I mentioned at the beginning of this post may be a better choice. for your Javascript (and potentially CSS).

The reason for this is that esbuild has the ability to tree-shake your Javascript. In other words, it identifies code that you aren't using, and throws it away. This can be very useful if you are using a number of libraries, as these can seriously bloat the size of your final Javascript file.

### Conclusion

The larger the site, the more of an effect you'll see by minifying your source code. In this post, I've shown you how to minify your source code in your eleventy sites. Other techniques that you can employ to further reduce load times include:

• Optimising images (I'll write a separate post on this if there's interest, as it can be quite involved)
• Reducing the number of domains the browser has to contact by serving external resources locally from your site
• This avoids the extra latency of setting up a brand-new connection to a new place, since multiple requests to the your own domain can re-use the same connection (and, with HTTP/2 enabled, multiplex multiple requests at once over a single connection)

Have you found a cool minifier or got a cool tip to optimise a static site? Please also share these below too.

## On the value of the open source community

Open source is a wonderful thing. With over 200 million repositories on GitHub alone and many many more of SourceHut, GitLab, and thousands of personal git server instances (like mine!) across the globe, there's no question that open source powers the world - look no further than NASA's Curiosity rover!

On a more personal level, open source means a lot to me too, and I wanted to talk a bit about that here. I think the oldest open source project I both started and am continuing to work on and improve would have to be Pepperminty Wiki. At 1.8K commits and the first commit way back in November 2014 (7 years 7 months ago, wow), I've probably poured thousands of hours into it and many other projects over the years.

While at the time it was just because it was something cool I wanted to work on, since then it's come to mean far more to me, and has helped me to develop very useful skills without me even realising it, and I can thoroughly recommend if you have some time to spare and you're beginning a journey into programming / computer science, it's definitely worth your time.

Documenting things. I really can't stress this one enough. Through working on open source projects, I've learnt the power of good documentation. You can write the best program in the world, but if it isn't documented well then nobody will know how to use it! The best test for documentation is when someone comes along and tries your program out without you present, as then in following your documentation they test it ensuring it's updated and accurate.

Writing good documentation is both an iterative process and takes practice but open source is a great place to do so. You don't even have to have written the program yourself - you can even help out another project and improve theirs. Chances are you've read the documentation for a free and open source program already - and if you've ever had a thought of how you'd improve it, don't hesitate to get involved!

Working (remotely) in a team. Speaking of, it's very common to collaborate in open source with people over the Internet - probably in different timezones to you - on everything from tracking down bugs to writing code to reviewing contributions. Doing so effectively can also be a learned skill, but one definitely worth having.

Having a portfolio for your CV. Personally this isn't something I really think about myself, but if you're looking to get a job of some description, then having an open source portfolio of work can definitely work in your favour, and demonstrate your skills to your potential new employer.

Reviewing contributions, resolving disputes. What you review and how often you do so depends greatly on the kind of project you're working on. For me, this is not something I do often on my personal projects, but I do it all the time in tldr-pages.

By reviewing and checking for issues and things other people might have missed, the quality of a project can be improved. It's a fine balance though between getting contributions merged and requesting improvements. On the one hand, suggesting improvements can be a good thing as previously described, but on the other it can cause unnecessary delays and frustrate everyone involved. As with all things on this list, it takes practice and continual adjustments to find the right balance.

Helping others. One of the things I love most about open source is how I can help other people out. tldr-pages has 39.1K stars as of the time of typing, and is a hugely useful resource that lots of people use daily. I've had comments from people thanking me for my work on Pepperminty Wiki and other projects that I've created and open-sourced. Knowing that I'm helping others out is very motivating for me to continue contributing :-)

### Conclusion

All these things are reasons why I'm proud to say that I'm a part of the open source community as a maintainer of multiple open source projects (both personal and tldr-pages). I'm especially grateful to everyone at tldr-pages (especially @waldyrious) for everything they taught me, and the chance I've been given to help out with the project.

While it hasn't been easy at times (helping to maintain a popular project like tldr-pages takes time and can be tedious in places), it's certainly something I'll be continuing to do and can thoroughly recommend to anyone who has the time to do so.

## Centralising logs with rsyslog

I manage quite a number of servers at this point, and something that's been on my mind for a while now is centralising all the log files generated by them. By this, specifically I mean that I want to automatically gather all logs generated by all the systems I manage into a single place in real time.

While there are enterprise-grade log management setups such as the ELK stack (elasticsearch, logstash, and kibana), as far as I'm aware they are all quite heavy and given my infrastructure is Raspberry Pi based (seriously, they use hardly any electricity at all compared to a regular desktop PC), with such a setup I would likely need multiple Pis to run it.

With this in mind, I'm opting for a different kind of log management system, which I'm basing on rsyslog (which is installed by default in most Linux distros) and lnav (which I've blogged about before: lnav basics tutorial), which runs much lighter, requiring only a fraction of a Raspberry Pi to operate, which is good since the Raspberry Pi I've dedicated to monitoring the rest of the infrastructure currently also handles:

1. Continuous Integration: Laminar (this will eventually be a Docker container on my Hashicorp Nomad cluster)
2. Collectd (Collectd is really easy to setup and runs so light, I love it)

I'm sure you might be asking yourself what the purpose of this is. My reasoning is fourfold:

1. Having all the logs in one place makes them easier to analyse all at once, without having to SSH into many different servers
2. If a box goes down, then I can read the logs from it before start attempting to fix it, giving me a heads up as to what the problem is (this, in conjunction with my collectd monitoring system)
3. On the Raspberry Pis I manage, this prolongs the life of the microSD cards by reducing the number of writes thereto
4. I gain a little bit of security, in that if a box is compromised, then unless the attacker also gains access to my logging server, then they can't erase their tracks as easily as might otherwise have done

With all this in mind, I thought that it's about time I actually did something about this. I've found that while the solution is actually really quite simple, it's not particularly easy to find, so I thought I'd post about it here.

In my setup, I'm going to be using a Raspberry Pi 4 4GB RAM I've dubbed eldarion, which is the successor to an earlier Raspberry Pi 3B+ that died some years prior I called elessar as the server upon which I centralise my logs. It has a 120GB SATA SSD attached in a case that used to house a WD PiDrive (they don't sell those anymore :-/) that I had lying around, which I've formatted with Btrfs.

Before we begin, let's outline the setup we're aiming for with a diagram to avoid confusion:

eldarion will host the rsyslog server (which is essentially just a reconfiguration of the existing rsyslog server it is most likely already running), while other servers connect using the syslog protocol via a TCP connection, which is encrypted with TLS, using the GnuTLS engine (the default built into rsyslog). TLS here is important, since logs are naturally rather sensitive as I'm sure you can imagine.

To follow along here, you will need a valid Let's Encrypt certificate. It just so happens that I have a web server hosting my collectd graph panel interface, so I'm using that.

Of course, rsyslog can be configured in arbitrarily complex ways (such as having clients send logs to servers that they themselves forward to yet other servers), but at least for now I'm keeping it (relatively) simple.

### Preparing the server

To start this process, we want to ensure the logs for the local system are stored in the right place. In my case, I have my SSD mounted to /mnt/eldarion-data2, so I want to put my logs in /mnt/eldarion-data2/syslog/localhost. There are 2 ways of accomplishing this:

1. Reconfigure rsyslog to save logs elsewhere
2. Be lazy, and bind mount the target location to /var/log

Since I'm feeling lazy today, I'm going to go with option 2 here. It's also a good idea if a program is badly written and decides it's a brilliant idea to write logs directly to /var/log itself instead of going through syslog.

If you're using DietPi, before you continue, do sudo dietpi-software and remove the existing logging system.

A bind mount is like a hard link of a directory, in that it makes a directory appear in multiple places at once. It acts as a separate "filesystem" though I assume to allow for avoiding infinite loops. They are also the tech behind volumes in Docker's backend containerd.

Open /etc/fstab for editing, and something like this on a new line:

/mnt/eldarion-data2/syslog/localhost    /var/log    none    auto,defaults,bind  0   0

..where /mnt/eldarion-data2/syslog/localhost is the location we want the data to be stored, and /var/log is the location we want to bind mount it to. Save and close /etc/fstab, and then mount the bind mount like so. Make sure /var/log is empty before mounting!

sudo mount /var/log

Next, we need to install some dependencies:

sudo apt install rsyslog rsyslog-gnutls

For some strange reason, TLS support is in a separate package on Debian-based systems. You'll need to investigate package names and translate this command for your distribution, of course.

### Configuring the server

Now we have that taken care of, we can actually configure our server. Open /etc/rsyslog.conf for editing, and at the top put this:

# The $Thing syntax is apparently 'legacy', but I can't find how else we're supposed to do this$DefaultNetstreamDriver gtls
$DefaultNetstreamDriverCAFile /etc/letsencrypt/live/mooncarrot.space/chain.pem$DefaultNetstreamDriverCertFile /etc/letsencrypt/live/mooncarrot.space/cert.pem
$DefaultNetstreamDriverKeyFile /etc/letsencrypt/live/mooncarrot.space/privkey.pem # StreamDriver.Mode=1 means TLS-only mode module(load="imtcp" MaxSessions="500" StreamDriver.Mode="1" StreamDriver.AuthMode="anon") input(type="imtcp" port="514")$template remote-incoming-logs,"/mnt/eldarion-data2/syslog/hosts/%HOSTNAME%/%PROGRAMNAME%.log"
*.* ?remote-incoming-logs

You'll need to edit these bits to match your own setup:

• /etc/letsencrypt/live/mooncarrot.space/: Path to the live directory there that contains the symlinks to the certs your Let's Encrypt client obtained for you
• /mnt/eldarion-data2/syslog/hosts: The path to the directory we want to store the logs in

Save and close this, and then restart your server like so:

sudo systemctl restart rsyslog.service

Then, check to see if there were any errors:

sudo systemctl status rsyslog.service

Lastly, I recommend assigning a DNS subdomain to the server hosting the logs, such as logs.mooncarrot.space in my case. A single server can have multiple domain names of course, and this just makes it convenient if we every move the rsyslog server elsewhere - as we won't have to go around and edit like a dozen config files (which would be very annoying and tedious).

### Configuring a client

Now that we have our rsyslog server setup, it should be relatively straightforward to configure a client box to send logs there. This is a 3 step process:

1. Configure the existing /var/log to be an in-memory tmpfs to avoid any potential writes to disk
2. Add a cron script to wipe /var/log every hour to avoid it getting full by accident
3. Reconfigure (and install, if necessary) rsyslog to send logs to our shiny new server rather than save them to disk

If you haven't already confgiured /var/log to be an in-memory tmpfs, it is relatively simple. If you're unsure whether it is or not, do df -h.

First, open /etc/fstab for editing, and add the following line somewhere:

tmpfs /var/log tmpfs size=50M,noatime,lazytime,nodev,nosuid,noexec,mode=1777

Then, save + close it, and mount /var/log. Again, make sure /var/log is empty before mounting! Weird things happen if you don't.

sudo mount /var/log

Secondly, save the following to /etc/cron.hourly/clear-logs:

#!/usr/bin/env bash
rm -rf /var/log/*

Then, mark it executable:

sudo chmod +x /etc/cron.hourly/clear-logs

Lastly, we can reconfigure rsyslog. The specifics of how you do this varies depending on what you want to achieve, but for a host where I want to send all the logs to the rsyslog server and avoid saving them to the local in-memory tmpfs at all, I have a config file like this:

#################
#### MODULES ####
#################

module(load="imuxsock") # provides support for local system logging
module(load="imklog")   # provides kernel logging support
#module(load="immark")  # provides --MARK-- message capability

###########################
#### GLOBAL DIRECTIVES ####
###########################

$IncludeConfig /etc/rsyslog.d/*.conf # Where to place spool and state files$WorkDirectory /var/spool/rsyslog

###############
#### RULES ####
###############
$DefaultNetstreamDriverCAFile /etc/ssl/isrg-root-x1-cross-signed.pem$DefaultNetstreamDriver         gtls
$ActionSendStreamDriverMode 1 # Require TLS$ActionSendStreamDriverAuthMode anon
*.* @@(o)logs.mooncarrot.space:514  # Forward everything to our rsyslog server

#
# Emergencies are sent to everybody logged in.
#
*.emerg             :omusrmsg:*

The rsyslog config file in question this needs to be saved to is located at /etc/rsyslog.conf. In this case, I replace the entire config file with the above, but you can pick and choose (e.g. on some hosts I want to save to the local disk and and to the rsyslog server).

Un the above you'll need to change the logs.mooncarrot.space bit - this should be the (sub)domain that you pointed at your rsyslog server earlier. The number after the colon (514) is the port number. The *.* tells it to send everything to the remote rsyslog server.

Before we're done here, we need to provide the rsyslog client with the CA certificate of the server (because, apparently, it isn't capable of ferreting around in /etc/ssl/certs like everyone else is). Since I'm using Let's Encrypt here, I downloaded their root certificate like this and it seemed to do the job:

sudo curl -sSL https://letsencrypt.org/certs/isrg-root-x1-cross-signed.pem -o /etc/ssl/isrg-root-x1-cross-signed.pem

Of course, one could generate their own CA and do mutual authentication for added security, but that's complicated, lots of effort, and probably unnecessary for my purposes as far as I can tell. I'll leave a link in the sources and further reading on how to do this if you're interested.

If you have a different setup, it's the \$DefaultNetstreamDriverCAFile in the above you need to change to point at your actual CA certificate.

With that all configured, we can now restart the rsyslog client:

sudo systemctl restart rsyslog.service

...and, of course, check to see if there were any errors:

sudo systemctl status rsyslog.service

Finally, we also need to configure logrotate to rotate all these new log files. First, install logrotate if the logrotate command doesn't exist:

sudo apt install logrotate

Then, place the following in the file /etc/logrotate.d/centralisedlogging:

/mnt/eldarion-data2/syslog/hosts/*/*.log {
rotate 12
weekly
missingok
notifempty
compress
delaycompress
}

Of course, you'll want to replace /mnt/eldarion-data2/syslog/hosts/ with the directory you're storing the logs from the remote server in, and also customise the log rotation. For example, the 12 there is the number of old log files to keep, and weekly can be swapped for daily or even monthly if you like.

### Conclusion

This has been a very quick whistle-stop tour of setting up an rsyslog server to centralise your logs. We've setup our rsyslog server to use a TLS encrypted connection to receive logs, which 1 or more clients can send logs to. We've also configured /var/log on both the server and the client to avoid awkward issues.

Moving forwards, I recommend reading my lnav basics tutorial blog post, which should be rather helpful in analysing the resulting log files.

lnav was not helpful however when I asked it to look at all the log files separately with sudo lnav */*.log, deciding to treat them as "generic logs" rather than "syslog logs", meaning that it didn't colour them properly, and also didn't allow for proper filter. To this end, it may be benefical to store all the logs in 1 file rather than in separate files. I'll keep an eye on this, and update this post if figure out how to convince lnav to treat them properly.

Another slightly snag with my approach here is that for some reason all the logs from elsewhere also end up in the generic /var/log/syslog file (hence how I found a 'workaround' the above issue), resulting in duplicated logs. I have yet to find a solution to this issue, but I'm also not sure whether I want to keep the logs in 1 big file or in many smaller files yet.

These issues aside, I'm pretty satisfied with the results. Together with my existing collectd-based monitoring system (which I'll blog about how I've set that up if there's any interest - collectd is really easy to use), this is another step towards greater transparency into the infrastructure I manage.

In the future, I want to investigate generating notifications alerts for issues in my infrastructure. These could come either from collectd, or from rsyslog, and I envision them going to a variety of places:

1. Email (a daily digest perhaps?)
2. XMPP (I've bridged to it from shell scripts before)

Given that my infrastructure is just something I run at home and I don't mind so much if it's down for a few hours, my focus here is not on notifying my as soon as possible, but notifying myself in a way that doesn't disturb me so I can check into it in my own time.

If you found this tutorial / guide useful, please do comment below! It's really cool and motivating to see that the stuff I post on here helps others out.

## How to pin an apt repository for preferential package installation

As described in my last post, pinning apt repositories is now necessary if you want to install Firefox from an apt repository (e.g. if you want to install Firefox Beta). This is not an especially difficult process, but it is significantly confusing, so I thought I'd write a post about it.

Pinning an apt repository means that even if there's a newer version of a package elsewhere, the 'older' version will still be installed from the apt repository you pin.

Be very careful with this technique. You can easily cause major issues with your system if you pin the wrong repository!

Firstly, you want to head to /etc/apt/sources.list.d/ and find the .list file for the repository you want to pin. Take note of the URL inside that file, and then run this command:

apt-cache policy

No root is necessary here, as it's still a read-only command. Depending on how many apt repositories you have installed in your system, there may be a significant amount of output. Find the lines that correspond to the apt repository you want to preferentially install from in this output. For this example, I'm going to pin the excellent nautilus-typeahead apt repository, so the bit I'm looking for looks like this:

999 http://ppa.launchpad.net/lubomir-brindza/nautilus-typeahead/ubuntu jammy/main amd64 Packages
origin ppa.launchpad.net

From here, take a note of the o= bit. In my case, it's o=LP-PPA-lubomir-brindza-nautilus-typeahead. Then, create a new file in /etc/apt/preferences.d with the following content:

Package: *
Pin-Priority: 1001

See that o=.... bit there? Replace it with the one for the repository you want to pin. The number there is the new priority of the repository. The numbers at the beginning of each line in the output of the apt-cache policy command are the priorities of your existing apt repositories, so this should give you an idea as to what number you need to use here - a higher number means a higher priority regardless of the version number of the packages contained therein.

Then, simply sudo apt update and sudo apt dist-upgrade, and apt should pick up the "upgrades" from your newly pinned repository! In some situations you may need to remove and reinstall the offending package if you encounter issues.

(Above: A slice of the official Ubuntu 22.04 Jammy Jellyfish wallpaper)

Hey there! Since Ubuntu 22.04 Jammy Jellyfish has recently been released, I've upgraded multiple machines to it, and I have enough to talk about that I thought it would be a good idea to write them up into a proper blog post for the benefit of others.

For reference, I've upgraded my main laptop on 20th May 2022 (10 days ago as of writing this psot), and I've also upgraded one of my desktops I use at University. I have yet to upgrade starbeamrainbowlabs.com - the server this blog post is hosted on - as I'm waiting for 22.04.1 for that (it would be very awkward indeed if the upgrade failed or there was some other issue I'm not yet aware of).

The official release notes for the Ubuntu 22.04 can be found here: https://discourse.ubuntu.com/t/jammy-jellyfish-release-notes/24668

There's also an official blog post that's ranked much higher in search engines, but it's not really very informative for me as I don't use the GNOME desktop - you're better off reading the real release notes above.

Thankfully, I have not encountered as many issues (so far!) with this update as I have with previous updates. While this update doesn't seem to change all that much aside from a few upgrades here and there, by far the biggest annoyance is shipping Firefox as a snapd by default.

Not only are they shipping it as a snap package, but they have bumped the epoch number, which means that the packages in the official firefox apt repository (beta users like me, use this one instead) are ignored in favour of the new snap package! I mean I get that shipping packages simplifies build systems for large projects like Firefox, but I have a number of issues with snapd:

• Extra disk space usage: every snap package has it's own version of it's dependencies
• Permissions: as far as I'm aware (please comment below if this is now fixed), there are permissions issues if you try to load a file from some places on disk when you're running an app installed via snapd, as it runs in a sandbox (this is also true of apps installed with flatpak). This makes using most applications completely impractical
• Ease of updates: A minor annoyance, but with apps installed via snap I have 2 different package managers to worry about
• Observability: Another minor concern, but with every package having it's own local dependencies, I'm makes it more difficult to observe and understand what's going on, and fix any potential issues

This aside, apt does allow for pinning apt repositories to work around this issue. I'll be posting a blog post on how this works more generally hopefully soon, but for now, you want to put this in a file at /etc/apt/preferences.d/firefox (after installing one of the above 2 apt repositories if you haven't done so already):

Package: *
Pin: release o=LP-PPA-mozillateam
Pin-Priority: 1001

...then run this sequence of commands:

sudo apt update
sudo apt purge firefox # This will *not* delete your user data - that's stored in your local user profile
sudo apt install firefox

The above works for both the stable and beta versions. Optionally: sudo apt purge snapd.

I also found this necessary for the wonderful nautilus-typeahead apt repository.

This was the most major issue I encountered. Other than this, I ran into a number of little things that are worth noting before you decide to upgrade. Firstly, for those who dual (or triple or even more!) boot, the version of the grub bootloader shipped with Ubuntu does not detect other bootable partitions!

Warning: os-prober will not be executed to detect other bootable partitions.
Systems on them will not be added to the GRUB boot configuration.
Check GRUB_DISABLE_OS_PROBER documentation entry.

....so if you do run more than a single OS on your system, make sure you correct this after upgrading.

Another thing is that, as usual, Ubuntu disables all third party apt repositories on upgrade. I strongly recommend paying very close attention to the list of packages that do-release-upgrade decides it needs to remove, as if you install e.g. Inkscape or Krita via an apt repository to get the latest versions thereof, you'll need to reinstall them after re-enabling your apt repositories. Personally, I say "no" to the reboot at the end of the upgrade process and fix my apt repositories before then running:

sudo apt update
sudo apt autoremove
sudo apt autoclean
# See also https://gitlab.com/sbrl/bin/-/blob/master/update-system

...and only then rebooting.

While GitHub's Atom seems to be more and more inactive these days as people move over to Visual Studio Code, I still find myself using it regularly as my primary code editor. Unfortunately, I encountered this bug, so I needed to edit /usr/share/applications/atom.desktop to add --no-sandbox to the execution line when starting Atom. The Exec= line in that file now reads:

Exec=env ATOM_DISABLE_SHELLING_OUT_FOR_ENVIRONMENT=false /usr/bin/atom --no-sandbox %F

This issue only occurred on 1 of the 2 systems I've upgraded though, so I'm not sure of the root cause. Other random issues I encountered:

• GDM has a truly awful shade of grey in the background now. This repository gives a way to fix this problem. Try to avoid an image that's too light in colour, as the white text of the lock screen becomes rather difficult to see.
• Speaking of backgrounds, the upgrade reset my desktop background on both machines I upgraded. Make sure you have a copy of it stored away somewhere, as you'll need it. lightdm (the login screen I use on my main laptop in place of gdm) seems to be fine though.
• tumbler - a d-bus thumbnailing service - was also automatically removed. This does not appear to have caused me any problems so far (though image previews now make transparent pixels appear white, which is really annoying and I haven't yet looked into a fix on that one), so I need to look into this one further.
• If you're a regular user of Memtest86+, it may disappear from your grub bootloader menu if you use EFI boot now for some strange reason.
• The colour scheme in the address bar of Nautilus (the file manager) seems a bit messed up for me, but this may have more to do with the desktop theme I'm using.

If I encounter any other issues while upgrading my servers in the future, I'll make another post here about it if it's a significant issue, or comment on/edit this post if it's a minor thing.

If you encounter any other issues upgrading that aren't mentioned here, please do leave a comment below with the issue you encountered and the solution / workaround you implemented to fix it.

## PhD Update 13: A half complete

*...almost! In the last post, I talked about the AAAI-22 doctoral consortium, the sentiment analysis models I've implemented, and finally LDA topic analysis. Before we continue to what I've been doing since then, here's a list of all the posts in this series so far:

As always, you can follow all my PhD-related blog posts in the PhD tag on my blog.

Since the last post, I've participated in both the AAAI-22 Doctoral Consortium and a Hackathon in AI for Sustainability! I've written separate posts about these topics to avoid cluttering this post, so if you're interested I can recommending checking those posts out:

### CLIP works.... kinda

In the last post, I mentioned I was implementing a sentiment analysis model based on CLIP. I've been doing this in PyTorch as the pretrained CLIP model is also implemented in PyTorch. This has caused a number of issues, since it requires a GPU with a CUDA compute capability index of 3.7+, which excludes a number of the GPUs I currently have access to, making things rather awkward. Thankfully, a few months ago I built a GPU server which have somehow forgotten to blog about, so I have been able to use this for the majority of the CLIP experiments I've been running.

Anyway, this process is now complete, so I can share a graph or two on the training progress:

These graphs are as always provisional and not final results, so please don't take them such. The graph on the left is the training accuracy, and the graph on the right is the validation accuracy. I used the ViT-B/32 variant of CLIP, with 512 units for 2 x dense layers after it before the final softmax dense layer that made the prediction (full model summary available upon request - please ensure you send requests by email from an official email account I can verify). What's astonishing here is CLIP's ability to 'zero-shot' - the ability to make a prediction in a target domain it hasn't seen yet with no additional training or fine tuning. It's one thing seeing it in a blog post, but quite another seeing it in person on your own dataset.

The reasoning for multiple lines here on each graph takes some explanation. Because the CLIP model is trained on tweets both with an image and an emoji, the number of tweets in my ~700K+ dataset of tweets that satisfy both of these requirements is only ~14K. With this in mind, I implemented a system to augment tweets that had a supported emoji but didn't have an image the image that CLIP thought best matched it. It was done with the following algorithm:

1. Rank each image against the tweet in question
2. Pick a random image from those CLIP has at least 75% confidence in
3. If it doesn't have at least 75% confidence in any image, pick the next best image

The reason for this somewhat convoluted algorithm is to avoid a situation where CLIP picks the same image for every tweet. With this in place I increased the size of the dataset up to a peak of ~55K (it should be higher still, but I have yet to find the bug even after combing through all related code multiple times), I could then train multiple CLIP models each with a different threshold as to how confident CLIP had to be in the augmented dataset - this is what's shown on in the above graphs.

From the graphs above, I can tell that interestingly any image is better than none at all - at least in terms of training accuracy. With a peak validation accuracy of 86.48% (vs 84.61% without dataset augmentation), this outstrips the transformer encoder I trained earlier by a fair margin.

It's cool to compare the validation accuracy, but what would be really fascinating (and also more objective) would be to compare this to human-labelled tweets as a ground truth. While I'm unsure if I can publish the exact results and details of this experiment at this time, I can say that the results were very surprising: the transformer encoder narrowly beat CLIP in accuracy when comparing them against the ~2K human-labelled tweets!

The effect of this is that the images may not contain much information that's useful when predicting the positive/negative sentiment, so attempts to extract information from the images likely need to use a different strategy. I speculate here that the reason it appeared to boost the validation accuracy of CLIP is that it assisted CLIP in figuring out what actually being asked of it - similar to the "prompt engineering" the authors of CLIP mention in their section on CLIP's limitations.

### Wrapping this half up

To wrap the social media half of my project up (for now at least), I'm writing a journal article to summarise the (sub)project. This will also include data and experiments from some of the students who participated in the Hackathon in AI for Sustainability 2022. I doubt that the journal I ultimately end up submitting to would like it very much if I release too many more details about this at this time, so a deeper discussion on the results, the journal I've chosen with my PhD supervisor's help to submit to, and the paper will have to wait until I finish it and it (hopefully!) gets accepted and published.

It's been slow-going on writing this journal article - both because it's my first one and because I'm drawing content together from many different sources, but I think I'm getting there.

Once I've finished writing this journal article, I believe I'll be turning my attention to the rainfall radar half of my project while I wait for a decision on whether it'll be published or not - so you can expect more on this in the next post in this series.

### The plan

Going on a bit of a tangent, the CLIP portion of the project has been very helpful in introducing me to how important optimising the data preprocessing pipeline is - especially the data augmentation part. By preprocessing in parallel and reshuffling some things, I was able to bump the average usage of my Nvidia GeForce 3060 GPU from around 10% to well over 80%, speeding up the process of augmenting the data from ~10 minutes per tweet to just 1.5 seconds per tweet! It's well worth spending a few hours on your data processing pipeline if you know you'll be training and retraining your model a bunch of times as you tweak it, as you could save yourself many hours of training time.

A number of key things to watch out for that I've found so far, in no particular order:

• Preprocessing data in parallel is very important. You can usually boost performance by as many times as you have CPU cores!
• Reading data from a stream makes it awkward to parallelise. It's much easier and simpler to handle e.g. 1 image per file than a stream of images in a single file.
• Image decoding is expensive, meaning that you'll most likely hit a CPU bottleneck if your model handles images. Ensuring images are JPEG can help, as PNGs are more expensive to decode.
• Similarly, the image decoder you use can significantly affect performance. I used simplejpeg, but I've heard that if you wrap Tensorflow's native image decoding in an input pipeline that can also be good as it can compile it into something more efficient. Test different methods with your own dataset to see which is best.
• Given that your preprocessing pipeline will run for every epoch, investigate if you can do any expensive steps just once before training begins.

In the future I'd like to write a blog post that more thoroughly compares PyTorch and Tensorflow now that I have more experience with both of them. They have different strengths and weaknesses which make them both good fits for different types of models and projects.

All this experience will be very useful indeed when I turn my attention back to the rainfall radar portion of my project. My current plan is to investigate training a CLIP model to comparatively train the rainfall radar + heightmap and the water depth data against one another. As of now I haven't looked into the specifics and details of how CLIP's training process actually works, but I'm hoping it's not too complicated to either re-use their code or implement my own.

In training such a CLIP model, it should in theory tell me whether there's any relationship between the two at all that a model can learn. If there is, then I can then move on to the next step and connect a decoder of some description to the model that will produce an image as an output. If anyone has any good resources on this, please do comment below as I'm rather unsure as to where to begin (I've tried an autoencoder design in the past for this model - albeit without CLIP - and it didn't go very well).

### Conclusion

Since last time, I've trained a bunch of CLIP models, and compared these (in more ways than one) to the transformer encoder I trained earlier. To extract useful information from images, a different strategy is likely needed as it doesn't appear that they contain much useful information about sentiment in the context of a flooding situation.

In training the CLIP models however, I've gained a lot of very valuable experience that will greatly help me in implementing an efficient model and pipeline for the rainfall radar half of my project. If I could go back and do this all again, I would have started the social media half of my project first, as it's taught me a whole bunch of very useful things that would have saved me a lot of time on my rainfall radar project....

If you've found this interesting, are confused about anything here, or have any suggestions, please do comment below! I'd love to hear from you.

## 500 posts - thank you!

500 posts is a lot. When I started writing back in 2014, I never imagined that I was make it to this milestone. I've thought for a while about what I wanted to do to celebrate, but couldn't think of anything specific - so I wanted to thank everyone who has supported me so far in my journey through University - first in my undergraduate course, then in my MSc course, and now in my PhD.

It was Rob Miles that first encouraged me to start a blog in the first year of my undergraduate course. A few weeks later, and I had gone from a coming soon page to building starbeamrainbowlabs.com, followed closely by this blog which I put together piece by piece.

The backend is actually written in PHP - though it is on my (seemingly endless :P) todo list to rewrite it as it's not particularly well written. I've made a start on this already by refactoring the commenting system (and adding more statistics), but I haven't touched the blog itself and the main website (particularly the CSS) much yet.

In total, over the last 499 posts (I'm still writing this post as of the time of typing) I've written 347,256 words in total, counted by doing cat *.md | tr -d -- '-{}();=><' | wc -w on all the markdown sources of the posts I've written. This is a mind boggling number! I suspect it's somewhat inflated by the code I include in my blog posts though.

On these, I've received 192 (probably) genuine top-level comments that aren't spam (not counting replies, which are difficult to count with jq, as the replies parameter isn't always present in my backend JSON files I store comments in). Each and every one of these has been helpful, and given me motivation to continue writing here - especially more recently on my PhD Update series.

I might have missed some spam comments, so do get in touch if you spot one.

From my first post way back on 29th June 2014 to this post in the present spans exactly 7 years, 10 months, 13 days, and 8 hours (or 2874 days and 8 hours), averaging 5 days 17 hours between each post overall.

I would like to thank everyone who has supported me on this incredible journey - especially my personal supervisor and also my PhD supervisor - both of whom have continuously assisted me with issues both large and small at all times of the day and year. The entire Department of Computer Science at the University of Hull - members both past and present - have all been very kind and helpful, and I'm deeply grateful to have had such a welcoming place to be.

Finally, thank you for reading. While I don't write posts on my blog here expecting that anyone will read them, it's amazing to see and hear about people finding them helpful :D

I can't say where I'm headed next after my PhD (the end of which is still some time away), but I can say that I'm committed to posting on this blog - so it won't be going anywhere any time soon :P