Starbeamrainbowlabs

Stardust
Blog


Archive


Mailing List Articles Atom Feed Comments Atom Feed Twitter Reddit Facebook

Tag Cloud

3d 3d printing account algorithms android announcement architecture archives arduino artificial intelligence artix assembly async audio automation backups bash batch blender blog bookmarklet booting bug hunting c sharp c++ challenge chrome os cluster code codepen coding conundrums coding conundrums evolved command line compilers compiling compression containerisation css dailyprogrammer data analysis debugging demystification distributed computing dns docker documentation downtime electronics email embedded systems encryption es6 features ethics event experiment external first impressions freeside future game github github gist gitlab graphics hardware hardware meetup holiday holidays html html5 html5 canvas infrastructure interfaces internet interoperability io.js jabber jam javascript js bin labs learning library linux lora low level lua maintenance manjaro minetest network networking nibriboard node.js open source operating systems optimisation own your code pepperminty wiki performance phd photos php pixelbot portable privacy problem solving programming problems project projects prolog protocol protocols pseudo 3d python reddit redis reference releases rendering resource review rust searching secrets security series list server software sorting source code control statistics storage svg systemquery talks technical terminal textures thoughts three thing game three.js tool tutorial tutorials twitter ubuntu university update updates upgrade version control virtual reality virtualisation visual web website windows windows 10 worldeditadditions xmpp xslt

The NSD Authoritative DNS Server: What, why, and how

In a previous blog post, I explained how to setup unbound, a recursive resolving DNS server. I demonstrated how to setup a simple split-horizon DNS setup, and forward DNS requests to an upstream DNS server - potentially over DNS-over-TLS.

Recently, for reasons that are rather complicated, I found myself in an awkward situation which required an authoritative DNS server - and given my love of explaining complicated and rather niche concepts here on my blog, I thought this would be a fabulous opportunity to write a 2-part series :P

In this post, I'm going to outline the difference between a recursive resolver and an authoritative DNS server, and explain why you'd want one and how to set one up. I'll explain how it fits as a part of a wider system.

Go grab your snacks - you'll be learning more about DNS than you ever wanted to know....

DNS in a (small) nutshell

As I'm sure you know if you're reading this, DNS stands for the Domain Name System. It translates domain names (e.g. starbeamrainbowlabs.com.) into IP addresses (e.g. 5.196.73.75, or 2001:41d0:e:74b::1). Every network-connected system will make use of a DNS server at one point or another.

DNS functions on records. These define how a given domain name should be resolved to it's corresponding IP address (or vice verse, but that's out-of-scope of this post). While there are many different types of DNS record, here's a quick reference for the most common one's you'll encounter when reading this post.

  • A: As simple as it gets. An A record defines the corresponding IPv4 address for a domain name.
  • AAAA: Like an A record, but for IPv6.
  • CNAME: An alias, like a symlink in a filesystem [Linux] or a directory junction [Windows]
  • NS: Specifies the domain name of the authoritative DNS server that holds DNS records for this domain. See more on this below.

A tale of 2 (DNS) servers

Consider your laptop, desktop, phone, or other device you're reading this on right now. Normally (if you are using DHCP, which is a story for another time), your router (which usually acts as the DHCP server on most home networks) will tell you what DNS server(s) to use.

These servers that your device talks to is what's known as a recursive resolving DNS server. These DNS servers do not have any DNS records themselves: their entire purpose is to ask other DNS servers to resolve queries for them.

At first this seems rather counterintuitive. Why bother when you can have a server that actually hosts the DNS records themselves and just ask that every time instead?

Given the size of the Internet today, this is unfortunately not possible. If we all used the same DNS server that hosted all DNS records, it would be drowned in DNS queries that even the best Internet connection would not be abel to handle. It would also be a single point of failure - bringing the entire Internet crashing down every time maintenance was required.

To this end, a more scaleable system was developed. By having multiple DNS servers between users and the authoritative DNS servers that actually hold the real DNS records, we can ensure the system scales virtually infinitely.

The next question that probably comes to mind is where the name recursive resolvers DNS server comes from. This name comes from the way that these recursive DNS servers ask other DNS servers for the answer to a query, instead of answering based on records they hold locally (though most recursive resolving DNS servers also have a cache for performance, but this is also a tale for another time).

Some recursive resolving DNS servers - such as the one built into your home router - simply asks 1 or 2 upstream DNS servers - usually either provided by your ISP or manually set by you (I recommend 1.1.1.1/1.0.0.1), but others are truly recursive.

Take peppermint.mooncarrot.space. for example. If we had absolutely no idea where to start resolving this domain, we would first ask a DNS root server for help. Domain names are hierarchical in nature - sub.example.com. is a subdomain of example.com.. The same goes then that mooncarrot.space. is a subdomain of space., which is itself a subdomain of ., the DNS root zone. It is no accident that all the domain names in this blog post have a dot at the end of them (try entering starbeamrainbowlabs.com. into your browser, and watch as your browser auto-hides the trailing dot .).

In this way, if we know the IP address of a DNS root server (e.g. 193.0.14.129, or 2001:7fd::1), we can recurse through this hierarchical tree to discover the IP address associated with a domain name we want to resolve

First, we'd ask a root server to tell us the authoritative DNS server for the space. domain name. We do this by asking it for the NS record for the space. domain.

Once we know the address of the authoritative DNS server for space., we can ask it to give us the NS record for mooncarrot.space. for us. We may repeat this process a number of times - I'll omit the specific details of this for brevity (if anyone's interested, I can write a full deep dive post into this, how it works, and how it's kept secure - comment below) - and then we can finally ask the authoritative DNS server we've tracked down to resolve the domain name peppermint.mooncarrot.space. to an IP address for us (e.g. by asking for the associated A or AAAA record).

Authoritative DNS servers

With this in mind, we can now move on to the main purpose of this post: setting up an authoritative DNS server. As you might have guessed by now, the purpose of an authoritative DNS server is to hold records about 1 or more domain names.

While most of the time the authoritative DNS server for your domain name will be either your registrar or someone like Cloudflare, there are a number of circumstances in which it can be useful to run your own authoritative DNS server(s) and not rely on your registrar:

  • If you need more control over the DNS records served for your domain than your registrar provides
  • Serving complex DNS records for a domain name on an internal network (split-horizon DNS)
  • Setting up your own dynamic DNS system (i.e. where you dynamically update the IP address(es) that a domain name resolves to via an API call)

Other situations certainly exist, but these are 2 that come to mind at the moment (comment below if you have any other uses for authoritative DNS servers).

The specific situation I found myself was a combination of the latter 2 points here, so that's the context in which I'll be talking.

To set one up, we first need some software to do this. There are a number of DNS servers out there:

  • Bind9 [recursive; authoritative]
  • Unbound [recursive; not really authoritative; my favourite]
  • Dnsmasq [recursive]
  • systemd-resolved [recursive; it always breaks for me so I don't use it]

As mentioned Unbound is my favourite, so for this post I'll be showing you how to use it's equally cool sibling, nsd (Name Server Daemon).

The Name Server Daemon

Now that I've explained what an authoritative DNS server is and why it's important, I'll show you how to install and configure one, and then convince another recursive resolving DNS server that's under your control to ask your new authoritative DNS server instead of it's default upstream to resolve DNS queries for a given domain name.

It goes without saying that I'll be using Linux here. If you haven't already, I strongly recommend using Linux for hosting a DNS server (or any other kind of server). You'll have a bad day if you don't.

I will also be assuming that you have a level of familiarity with the Linux terminal. If you don't learn your terminal and then come back here.

nsd is available in all major distributions of Linux in the default repositories. Adjust as appropriate for your distribution:

sudo apt install nsd

nsd has 2 configuration files that are important. First is /etc/nsd/nsd.conf, which configures the nsd daemon itself. Let's do this one first. If there's an existing config file here, move it aside and then paste in something like this:

server:
    port: 5353

    server-count: 1
    username: nsd

    logfile: "/var/log/nsd.log"
    pidfile: "/run/nsd.pid"

    # The zonefile directive(s) below is prefixed by this path
    zonesdir: /etc/nsd/zones

zone:
    name: example.com
    zonefile: example.com.zone

...replace example.com with the domain name that you want the authoritative DNS server to serve DNS records for. You can also have multiple zone: blocks for different (sub)domains - even if those domain names are subdomains of others.

For example, I could have a zone: block for both example.com and dyn.example.com. This can be useful if you want to run your own dynamic DNS server, which will write out a full DNS zone file (a file that contains DNS records) without regard to any other DNS records that might have been in that DNS zone.

Replace also 5353 with the port you want nsd to listen on. In my case I have my authoritative DNS server running on the same box as the regular recursive resolver, so I've had to move the authoritative DNS server aside to a different port as dnsmasq (the recursive DNS server I have running on this particular box) has already taken port 53.

Next up, create the directory /etc/nsd/zones, and then open up example.com.zone for editing inside that new directory. In here, we will put the actual DNS records we want nsd to serve.

The format of this file is governed by RFC1035 section 5 and RFC1034 section 3.6.1, but the nsd docs provide a simpler example. See also the wikipedia page on DNS zone files.

Here's an example:

; example.com.
$TTL 300
example.com. IN     SOA    a.root-servers.net. admin.example.com. (
                2022090501  ; Serial
                3H          ; refresh after 3 hours
                1H          ; retry after 1 hour
                1W          ; expire after 1 week
                1D)         ; minimum TTL of 1 day

; Name Server
IN  NS  dns.example.com.

@                   IN A        5.196.73.75
example.com.        IN AAAA     2001:41d0:e:74b::1
www                 IN CNAME    @
ci                  IN CNAME    @

Some notes about the format to help you understand it:

  • Make sure ALL your fully-qualified domain names have the trailing dot at the end otherwise you'll have a bad day.
  • $TTL 300 specifies the default TTL (Time To Live, or the time DNS records can be cached for) in seconds for all subsequent DNS records.
  • Replace example.com. with your domain name.
  • admin.example.com. should be the email address of the person responsible for the DNS zone file, with the @ replaced with a dot instead.
  • dns.example.com. in the NS record must be set to the domain name of the authoritative DNS server serving the zone file.
  • @ IN A 5.196.73.75 is the format for defining an A record (see the introduction to this blog post) for example.com. - @ is automatically replaced with the domain name in question - in this case example.com.
  • When declaring a record, if you don't add the trailing dot then it is assumed you're referring to a subdomain of the domain this DNS zone file is for - e.g. if you put www it assumes you mean www.example.com.

Once you're done, all that's left for configuring nsd is to start it up for the first time (and on boot). Do that like so:

sudo systemctl restart nsd
sudo systemctl enable nsd

Now, you should be able to query it to test it. I like to use dig for this:

dig -p 5353 +short @dns.example.com example.com

...this should return a result based on the DNS zone file you defined above. Replace 5353 with the port number your authoritative DNS server is running on, or omit -p 5353 altogether if it's running on port 53.

Try it out by updating your DNS zone file and reloading nsd: sudo systemctl reload nsd

Congratulations! You now have an authoritative DNS server under your control! This does not mean that it will be queried by any other DNS servers on your network though - read on.....

Integration with the rest of your network

The final part of this post will cover integrating an authoritative DNS server with another DNS server on your network - usually a recursive one. How you do this will vary depending on the target DNS server you want to convince to talk to your authoritative DNS server.

For Unbound:

I've actually covered this in a previous blog post. Simply update /etc/unbound/unbound.conf with a new block like this:

forward-zone:
    name: "example.com."
    forward-addr: 127.0.0.1@5353

...where example.com. is the domain name to forward for (WITH THE TRAILING DOT; and all subdomains thereof), 127.0.0.1 is the IP address of the authoritative DNS server, and 5353 is the port number of the authoritative DNS server.

Then, restart Unbound like so:

sudo systemctl restart unbound

For dnsmasq:

Dnsmasq's main config file is located at /etc/dnsmasq.conf, but there may be other config files located in /etc/dnsmasq.d/ that might interfere. Either way, update dnsmasq's config file with this directive:

server=/example.com./127.0.0.1#5353

...where example.com. is the domain name to forward for (WITH THE TRAILING DOT; and all subdomains thereof), 127.0.0.1 is the IP address of the authoritative DNS server, and 5353 is the port number of the authoritative DNS server.

If there's another server=/example.com./... directive elsewhere in your dnsmasq config, it may override your new definition.

Then, restart dnsmasq like so:

sudo systemctl restart dnsmasq

If there's another DNS server that I haven't included here that you use, please leave a comment on how to reconfigure it to forward a specific domain name to a different DNS server.

Conclusion

In this post, I've talked about the difference between an authoritative DNS server and a recursive resolving DNS server. I've shown why authoritative DNS servers are useful, and alluded to reasons why running your own authoritative DNS server can be beneficial.

In the second post in this 2-part miniseries, I'm going to go into detail on dynamic DNS, why it's useful, and how to set up a dynamic dns server.

As always, this blog post is a starting point - not an ending point. DNS is a surprisingly deep subject: from DNS root hint files to mDNS (multicast DNS) to the various different DNS record types, there are many interesting and useful things to learn about it.

After all, it's always DNS..... especially when you don't think it is.

Sources and further reading

Mounting LVM partitions from the terminal on Linux

Hello there! Recently I found myself with the interesting task of mounting an LVM partition by hand. It wasn't completely straightforward and there was a bunch of guesswork involved, so I thought I'd document the process here.

For those who aren't aware, LVM stands for the Logical Volume Manager, and it's present on Linux system to make managing partitions easier. It can:

  • Move and resize partitions while they are still mounted
  • Span multiple disks

....but to my knowledge it doesn't have any redundancy (use Btrfs) or encryption (use LUKS) built in. It is commonly used to manage the partitions on your Linux desktop, as then you don't need to reboot it into a live Linux environment to fiddle with your partitions as much.

LVM works on a layered system. There are 3 layers to it:

  1. Physical Volumes: Normal physical partitions on the disk.
  2. Volume Groups: Groups of logical (LVM) partitions.
  3. Logical Volumes: LVM-managed partitions.

In summary, logical volumes are part of a volume group, which spans 1 or more physical disks.

With this in mind, first list the available physical volumes and their associated volume groups, and identify which is the one you want to mount:

sudo vgdisplay

Notice the VG Size in the output. Comparing it with the output of lsblk -o NAME,RO,SIZE,RM,TYPE,MOUNTPOINT,LABEL,VENDOR,MODEL can be helpful to identify which one is which.

I encountered a situation where I had 2 with the same name - one from my host system I was working on, and another from the target disk I was trying to mount. In my situation each disk had it's own volume group assigned to it, so I needed to rename one of the volumes.

To do this, take the value of the VG UUID field of the volume group you want to rename from the output of sudo vgdisplay above, and then rename it like this:

sudo vgrename SOME_ID NEW_NAME

...for example, I did this:

sudo vgrename 5o1LoG-jFdv-v1Xm-m0Ca-vYmt-D5Wf-9AAFLm examplename

With that done, we can now locate the logical volume we want to mount. Do this by listing the logical volumes in the volume group you're interested in:

sudo lvdisplay vg_name

Note down the name of the logical volume you want to mount. Now we just need to figure out where it is actually located in /dev so that we can mount it. Despite the LV Path field appearing to show us this, it's not actually correct - at least on my system.

Instead, list the contents of /dev/mapper:

ls /dev/mapper

You should see the name of the logical volume that you want to mount in the form volumegroup-logicalvolumename. Once found, you should be able to mount it like so:

sudo mount /dev/mapper/volumegroup-logicalvolumename path/to/directory

...replacing path/to/directory with the path to the (empty) directory you want to mount it to.

If you can't find it, then it is probably because you plugged the drive in question in after you booted up. In this case, it's probable that the volume group is not active. You can check this is the case or not like so:

sudo lvscan

If it isn't active, then you can activate it like this:

sudo lvchange -a y vg_name

...replacing vg_name with the name of the volume group you want to activate. Once done, you can then mount the logical volume as I mentioned above.

Once you are done, unmounting it is a case of reversing these steps. First, unmount the partition:

sudo umount path/to/mount_point

Then, disable the volume group again:

sudo lvchange -a n vg_name

Finally, flush any cached writes to disk, just in case:

sync

Now, you can unplug the device from your machine.

That wraps up this quick tutorial. If you spot any mistakes in this, please do leave a comment below and I'll correct it.

PhD Update 14: An old enemy

Hello again! This post is rather late due to one thing and another, but I've finally gotten around to writing it. In the last post, I talked about the CLIP model I trained to predict sentiment using both twitter and their associated images in pairs, and the augmentation system I devised to increase the size of the dataset. I also talked about the plan for a next-generation rainfall radar model, and a journal article I'm writing.

Before we begin though, let's start with the customary list of previous posts:

Since that last post, I've pretty much finished my initial draft of the journal article - though it is rather overlength, and I've also made a significant start on the rainfall radar model, which is what I will be focusing on in this blog post as there isn't all that much to talk about with the journal article at the moment (I'm unsure how much I'm allowed to share). I will make a separate post when I (finally) publish the journal article.

Rainfall radar model, revisited

As you might remember, I have dealt with rainfall radar data before (exhibit A, B, C, D), and it didn't go too well. After the part of my PhD on social media, I have learnt a lot about AI models and how to build them. I have also learnt a lot about data preprocessing. With all this in hand, I am now better equipped to do battle once more with an old enemy: the 1.5M time step rainfall radar dataset.

For those who are somewhat confused, the dataset in question is in 2 dimensions (i.e. like greyscale images). It is comprised of 3 things:

  • A heightmap
  • Rainfall radar data every 5 minutes
  • Water depth information, calculated by HAIL-CAESAR and binarised to water / no water for each pixel with a simple threshold

Given that the rainfall radar dataset has an extremely restrictive licence, I am unfortunately unable to share sample images from the dataset here.

My first objective was to tame the beast. To do this, I needed to convert the data to .tfrecord.gz files (applying all the preprocessing transformations ahead of time) instead of the split .asc.stream.gz and .jsonl.gz files I was using. At first, I thought I could use a TextLineDataset (it even supports reading from gzipped files!), but the snag here is that Tensorflow does not have a JSON parsing function.

The reason this is a problem is due to the new way I am parsing my dataset. Before, I used tf.data.Dataset.from_generator() and a regular Python function, but I have since discovered that there is a much more efficient way of doing things. The key revelation here was that Tensorflow does not just simply execute e.g. your custom layers you implement and call .call() each time. No, instead it calls it once and constructs a graph of operations, before then compiling this into machine code that the GPU can understand. The implication of this is twofold:

  1. It is significantly more efficient to take advantage of Tensorflow's execution graph functionality where available
  2. Once your (any part of) dataset becomes a Tensor, it must stay a Tensor

This not only goes for custom layers, loss functions, etc, but it also goes for the dataset pipeline too! I strongly recommend using the .map() function on tf.data.Dataset with a tf.function. Avoid .from_generation() if you can possibly help it!

To take advantage of this, I needed to convert my dataset to a set of .tfrecord.gz files (to support parallel reading, esp. since Viper has a high read latency). Given my code to parse my dataset is in Javascript/Node.js, I first tried using the tfrecord npm package to write .tfrecord files in Javascript directly. This did not work out though, as it kept crashing. I also tried variant packages like tfrecords and tfrecord-stream and more, but none of them worked. In the end, I settled on a multi-step process:

  1. Convert split data into .jsonl.gz files, 4K records per file. Do all preprocessing / correction steps here.
  2. Make all records unique: hash all records in all files, mark records for deletion, then delete them from files
  3. Recompress .jsonl.gz files to 4K records per file
  4. Convert .jsonl.gz.tfrecord.gz with Python child processes managed by Node.js

Overcomplicated? Perhaps. Do I have a single command I can execute to do all of this? Nope! Does it work? Absolutely :P

With the data converted I turned my attention to the model itself. As I have discussed previously, my current hypothesis is that the previous models failed because the relationship between the rainfall radar and water depth data is non-obvious (and that the model designs were terrible. 5K parameters? hahahaha, 5M parameters is probably the absolute minimum I would need). To this end, I will be first training a contrastive learning model to find relationships between the dataset items. Only then will I train a model to predict water depth, which I'll model as an image segmentation task (I have yet to find a segmentation decoder to implement, so suggestions here are welcome).

The first step here is to implement the contrastive learning algorithm. This is non-trivial however, so I implemented a test model using images from Reddit (r/cats, r/fish, and r/dogs) to test it and test the visualisations that I will require to determine the effectiveness of the model. In doing this, I found that the algorithm for contrastive learning in the CLIP paper (Learning Transferable Visual Models From Natural Language Supervision) was wrong and completely different to that which is described in the code, and I couldn't find the training loop or core loss function at all - so I had to piece together something from a variety of different sources.

To visualise the model, I needed a new approach. While the loss function value over time plotted on a graph is useful, it's difficult to tell if the resulting embedded representation the model outputs is actually doing what it is supposed to. There Reading online, there are 2 ways of visualising embedding representations I've found:

  1. Dimensionality reduction
  2. Parallel coordinates plot

I can even include here a cool plot that demonstrates both of them with the pretrained CLIP model I used in the social media half of my project:

The second one is the easier to explain so I'll start with that. If you imagine that the output of the model is of shape [ batch_size, embedding_dim ] / [ 64, 200 ], then for every record in the dataset we can plot a line across a set of vertical lines, where each vertical line stands for each successive point in the dataset. This is what I have done in the plot on the right there.

The plot on the left uses the UMAP dimensionality reduction algorithm (paper), which to my knowledge is the best dimensionality reduction algorithm out there at the moment. For the uninitiated, a dimensionality reduction algorithm takes a vector with many dimensions - such one with an embedding dimension of size 200 - and converts it into a lower-dimensional value (e.g. in 2 or 3 dimensions most commonly) so that it can be plotted and visualised. This is particularly helpful in AI when you want to check if your model is actually doing what you expect.

I took some time to look into this, as there are a number of other algorithms out there and it seems like it's far too easy to pick the wrong one for the task. In short, there are 3 different algorithms you'll see most often:

  • PCA: Stands for Principled Component Analysis, and while popular it does not support non-linear transformations, which is most AI models.
  • tSNE: A non-linear alternative (designed for AI applications, in part) that is also rather popular. It does not preserve the global structure of the dataset (i.e. relationships and distances between different values) very well though.
  • UMAP: Stands for Uniform Manifold Approximation and Projection. It is designed as an alternative to tSNE and preserves global structure much better.

Sources for this are at the end of this post. If you're applying PCA or tSNE for dimensionality reduction in an AI context, consider switching it out to UMAP.

In the plot above, it is obvious that the pretrained CLIP model can differentiate between the 3 types of pet that I gave it as a test dataset. The next step was to train a model with the contrastive learning and the test dataset.

To do this, I needed an encoder. In the test, I used ResNetV2, which is apparently an improved version of the ResNet architecture (I have yet to read the paper on it). Since I implemented it though, I discovered an implementation of the state-of-the-art image encoder ConvNeXt (paper) that I discovered recently, so I'm using that in the main model. See my recent post on my image captioning project for more details on image encoders, but in short to the best of my knowledge ConvNeXt is the current state of the art.

Any, when I plot the output of this model it gave me this plot:

I notice a few issues with this. Firstly and most obviously, the points are all jumbled up! It has not learnt the difference between cats, fish, and dogs. I suspect this is because the input to the test model I trained got 2 variants of the same image altered randomly in different ways (flipping, hue change, etc) rather than an image and a textual label. I'm not too worried though, 'cause the real model will have 2 different items as inputs - I was avoiding doing extra work here.

Secondly, the parallel coordinates plot does not show a whole lot of variance between the different items. This is more worrying, but I'm again hoping that this issue will fix itself when I give the model 'real pairs' of rainfall radar <-> water depth images (with the heightmap thrown in there somewhere probably, I haven't decided yet).

Finally, I plotted a UMAP graph with completely random points to ensure it represented them properly:

As you can see, it plots them in a roughly spherical shape with no clear form or separation between the points. I'm glad I did this, because at first I was passing the labels to the UMAP plotter in the wrong way, and it instead artificially moved the points into groups.

With the test model done, I have moved swiftly on to (pretraining) actual model itself. This is currently underway so I don't have anything to show just yet (it is still training and I have yet to implement code to plot the output), but I can say that thanks to my realisations in Tensorflow graph execution as tensors, I'm seeing a GPU utilisation of 95% and above at all times :D

Conclusion

I've got a journal article written, but it's overlength so my job there isn't quite done just yet. When it is published, I will definitely make a dedicated post here!

Now, I have moved from writing to implementing a new model to tackle the rainfall radar part of my project. By using contrastive learning, I hope to enable the model to learn the relationship between the rainfall radar data and the water depth information. Once I've trained a contrastive learning model, I'll attach and train another model for image segmentation to predict the water depth information.

If you know of any state-of-the-art image segmentation decoder AI architectures, please leave a comment below. Bonus points if I can configure it to have >= 5M parameters without running out of memory. I'm currently very unsure what I'm going to choose.

Additionally, if you have any suggestions for additional tests I can do to verify my contrastive learning model is actually learning something, please leave a comment below also. The difficulty ist hat the while the loss value goes down, it's extremely difficult to tell whether what it's learning is actually sensible or not.

The plan to caption and index images

Something that has been on my mind for a while are the photos that I take. At last count on my NAS I have 8564 pictures I have taken so far since I first got a phone to take them with, and many more belonging to other family members.

I have blogged before about a script I've written that automatically processes photos graphs and files them in by year and month. It fixes the date taken, set the thumbnail for rapid preview loading, automatically rotates them to be the right way up, losslessly optimises them, and more.

The one thing it can't do though is to help me locate a specific photo I'm after, so given my work with AI recently I have come up with a plan to do something about this, and I want to blog about it here.

By captioning the images with an AI, I plan to index the captions (and other image metadata) and have a web interface in the form of a search engine. In this blog post, I'm going to outline the AI I intend to use, and the architecture of the image search engine I have already made a start on implementing.

AI for image captioning

The core AI to do image captioning will be somewhat based on work I've done for my PhD. The first order of business was finding a dataset to train on, and I stumbled across Microsoft's Common Objects in Context dataset. The next and more interesting part was to devise a model architecture that translate an image into text.

When translating 1 thing (or state space) into another in AI, it is generally done with an encoder-decoder architecture. In my case here, that's an encoder for the image - to translate it into an embedded feature space - and a decoder to turn that embedded feature space into text.

There are many options for these - especially for encoding images - which I'll look at first. While doing my PhD, I've come across many different encoders for images, which I'd roughly categorise into 2 main categories:

Since the transformer model was invented, they have been widely considered to be the best option. Swin Transformers adapt this groundbreaking design for images - transformers originally handled text - from what I can tell better than the earlier Vision Transformer architecture.

On the other side, a number of encoders were invented before transformers were a thing - the most famous of which was ResNet (I think I have the right paper), which was basically just a bunch of CNN layers stacked on top of one another with a few extra bits like normalisation and skip connections.

Recently though, a new CNN-based architecture that draws inspiration from the strong points of transformers - and it's called ConvNeXt. Based on the numbers in the paper, it even beats the swin transformer model mentioned earlier. Best of all, it's much simpler in design so it makes it relatively easy to implement. It is this model architecture I will be using.

For the text, things are both straight forward - the model architecture I'll be using is a transformer (of course - I even implemented it myself from scratch!) - but the trouble is representation. Particularly the representation of the image caption we want the model to predict.

There are many approaches to this problem, but the one I'm going to try first is a word-based solution using one-hot encoding. There are about 27K different unique words in the dataset, so I've assigned each one a unique number in a dictionary file. Then, I can turn this:

[ "a", "cat", "sat", "on", "a", "mat" ]

....into this:

[ 0, 1, 2, 3, 0, 4 ]

...then, the model would predict something like this:

[
    [ 1, 0, 0, 0, 0, 0, ... ],
    [ 0, 1, 0, 0, 0, 0, ... ],
    [ 0, 0, 1, 0, 0, 0, ... ],
    [ 0, 0, 0, 1, 0, 0, ... ]
    [ 1, 0, 0, 0, 0, 0, ... ]
    [ 0, 0, 0, 0, 1, 0, ... ]
]

...where each sub-array is a word.

This will as you might suspect use a lot of memory - especially with 27K words in the dictionary. By my calculations, with a batch size of 64 and a maximum caption length of 25, each output prediction tensor will use a whopping 172.8 MiB memory as float32, or 86.4 MiB memory as float16 (more on memory usage later).

I'm considering a variety of techniques to combat this if it becomes an issue. For example, reducing the dictionary size by discarding infrequently used words.

Another option would be to have the model predict GloVe vectors as an output and then compare the output to the GloVe dictionary to tell which one to pick. This would come with it's own set of problems however, like lots of calculations to compare each word to every word in the dictionary.

My final thought was that I could maybe predict individual characters instead of full words. There would be more items in the sequence predicted, but each character would only have up to 255 choices (probably more like 36-ish), potentially saving memory.

I have already implemented this AI - I just need to debug and train it now. To summarise, here's a diagram:

The last problem with the AI though is memory usage. I plan on eventually running the AI on a raspberry pi, so much tuning will be required to reduce memory usage and latency as much I can. In particular, I'll be trying out quantisating my model and writing the persistent daemon to use Tensorflow Lite to reduce memory usage. Models train using the float32 data type - which uses 32 bits per value, but quantising it after training to use float16 (16 bits / value) or even uint8 (8 bits / value) would significantly reduce memory usage.

Search engine and indexing

The second part of this is the search engine. The idea here is to index all my photos ahead of time, and then have a web interface I can use to search and filter them. The architecture I plan on using to achieve this is rather complicated, and best explained with a diagram:

The backend I have chosen for the index is called meilisearch. It's written in Rust, and provides advanced as-you-type search functionality. This is for 2 reasons:

  1. While I'd love to implement my own, meilisearch is an open source project where they have put in more hours into making it cool than I ever would be able to
  2. Being a separate daemon means I can schedule it on my cluster as a separate task, which potentially might end up on a different machine

With this in mind, the search engine has 2 key parts to it: the crawler / indexer, and the HTTP server that serves the web interface. The web interface will talk to meilisearch to perform searches (not directly; requests will be proxied and transformed).

The crawler will periodically scan the disk for new, updated, and deleted files, and pass them on to the indexer queue. The indexer will do 4 things:

  1. Caption the image, by talking to a persistent Python child process via Inter Process Communication (IPC) - captions will be written as EXIF data to images
  2. Thumbnail images and store them in a cache (perhaps some kinda key-value store, as lots of little files on disk would be a disaster for disk space)
  3. Extract EXIF (and other) metadata
  4. Finally, push the metadata to meilisearch for indexing

Tasks 2 and 3 can be done in parallel, but the others will need to be done serially - though multiple images can of course be processed concurrently. I anticipate much asynchronous code here, which I'm rather looking forward to finishing writing :D

I already have a good start on the foundation of the search engine here. Once I've implemented enough that it's functional, I'll open source everything.

To finish this post, I have a mockup screenshot of what the main search page might look like:

Obviously the images are all placeholders (append ?help to this URL see the help page) for now and I don't yet have a name for it (suggestions in the comments are most welcome!), but the rough idea is there.

Configuring an endlessh honeypot with rsyslog email notifications

Security is all about defence in depth, so I'm always looking for ways to better secure my home network. For example, I have cluster management traffic running over a Wireguard mesh VPN. Now, I'm turning my attention to the rest of my network.

To this end, while I have a guest network with wireless isolation enabled, I do not currently have a way to detect unauthorised devices connecting to my home WiFi network, or fake WiFi networks with the same name, etc. Detecting this is my next focus. While I've seen nzyme recently and it looks fantastic, it also looks more complicated to setup.

While I look into the documentation for nzyme, inspired by this reddit post I decided to setup a honeypot on my home network.

The goal of a honeypot is to detect threats moving around in a network. In my case, I want to detect if someone has connected to my network who shouldn't have done. Honeypots achieve this by pretending to be a popular service, but in reality they are there to collect information about potential threats.

To set one up, I found endlessh, which pretends to be an SSH server - but instead slowly sends an endless banner to the client, keeping the connection open as long as possible. It can also connection attempts to syslog, which allows us to detect connections and send an alert.

Implementing this comes in 2 steps. First, we setup endlessh and configure it to log connection attempts. Then, we reconfigure rsyslog to send email alerts.

Setting up endlessh

I'm working on one of the Raspberry Pis running Raspberry Pi OS in my network, but this should with with other machines too.

If you're following along to implement this yourself, make sure you've moved SSH to another port number before you continue, as we'll be configuring endlessh to listen on port 22 - the default port for ssh, as this is the port I imagine that an automated network scanner might attempt to connect to by default if it were looking for ssh servers to attempt to crack.

Conveniently, endlessh has a package in the default Debian repositories:

sudo apt install endlessh

...adjust this for your own package manager if you aren't on an apt-based system.

endlessh has a configuration file at /etc/endlessh/config by default. Open it up for editing, and make it look something like this:

# The port on which to listen for new SSH connections.
Port 22

# Set the detail level for the log.
#   0 = Quiet
#   1 = Standard, useful log messages
#   2 = Very noisy debugging information
LogLevel 1

Beforee we can start the endlessh service, we need to reconfigure it to allow it to listen on port 22, as this is a privileged port number. Doing this requires 2 steps. First, allow the binary to listen on privileged ports:

sudo setcap CAP_NET_BIND_SERVICE=+eip "$(which "endlessh")";

Then, if you are running systemd (most distributions do by default), execute the following command:

sudo systemctl edit endlessh.service

This will allow you to append some additional directives to the service definition for endlessh, without editing the original apt-managed systemd service file. Add the following, and then save and quit:

[Service]
AmbientCapabilities=CAP_NET_BIND_SERVICE
PrivateUsers=false

Finally, we can restart the endlessh service:

sudo systemctl restart endlessh
sudo systemctl enable --now endlessh

That completes the setup of endlessh!

Configuring rsyslog to send email alerts

The second part of this process is to send automatic alerts whenever anyone connects to our endlessh service. Since endlessh forwards logs to syslog by default, reconfiguring rsyslog to send the alerts seems like the logical choice. In my case, I'm going to send email alerts - but other ways of sending alerts do exist - I just haven't looked into them yet.

To do this requires that you have either a working email server (I followed the Ars Technica taking email back series, but whatever you do it's not for the faint for heart! Command line experience is definitely required - if you're looking for a nice first project to try, a web server instead), or an email account you can use. Note that I do not recommend using your own personal email account, as you'll have to store the password in plain text!

In my case, I have my own email server, and I have forwarded port 25 down an SSH tunnel so that I can use it to send emails (in the future I want to configure a proper smart host that listen on port 25 and forwards emails by authenticating against my server properly, but that's for another time as I have yet to find a relay-only MTA that also listens on port 25).

In a previous post, implemented centralised logging - so I'm going to be reconfiguring my main centralised rsyslog instance.

To do this, open up /etc/rsyslog.d/10-endlessh.conf for editing, and paste in something like this:

template (name="mailSubjectEndlessh" type="string" string="[HONEYPOT] endlessh connection on %hostname%")

if ( ($programname == 'endlessh') and (($msg contains "ACCEPT") or ($msg contains "CLOSE")) ) then {
    action(type="ommail" server="localhost" port="20205"
        mailfrom="sender@example.com"
        mailto=["bill@billsboosters.net"]
        subject.template="mailSubjectEndlessh"
        action.execonlyonceeveryinterval="3600"
    )
}

...where:

  • [HONEYPOT] endlessh connection on %hostname% is the subject name, and %hostname% is substituted for the actual hostname the honeypot is running on
  • sender@example.com is the address that you want to send the alert FROM
  • bill@billsboosters.net is the address that you want to send the alert TO
  • 3600 is the minimum interval between emails, in seconds. Log lines are not collected up - only 1 log line is sent at a time, and others logged in-between are ignored and handled as if the above email directive doesn't exist until the given number of seconds expires - at which point it will then email for the next log line that comes through, and the cycle then repeats. If anyone knows how to change that, please leave a command below.

Note that the template line is outside the if statement. This is important - I got a syntax error if I put it inside the if statement.

The if statement specifically looks for log messages with a tag of endlessh that contain either the substring ACCEPT or CLOSE. Only if those conditions are true will it send an email.

I have yet to learn how to configure rsyslog to authenticate while sending emails. I would suspect though that the easiest way of achieving this is to setup a local SMTP relay-only MTA (Mail Transfer Agent) that rsyslog can connect to and send emails, and then the relay will authenticate against the real server and send the email on rsyslog's behalf. I have yet to find such an MTA however other than Postfix - which, while great, can be hugely complicated to setup. Other alternatives I've tried include:

....but they all implement sendmail and while that's useful they do not listen on port 25 (or any other port for that matter) as far as I can tell.

Anyway, the other file you need to edit is /etc/rsyslog.conf. Open it up for editing, and put this near the top:

module(load="ommail")

...this loads the mail output plugin that sends the emails.

Now that we've reconfigured rsyslog, we need to restart it:

sudo systemctl restart rsyslog

rsyslog is picky about it's config file syntax, so make sure to check it's status for error messages:

sudo systemctl status rsyslog

You can also use lnav analyse your logs and find any error messages there too.

Conclusion

We've setup endlessh as a honeypot, and then reconfigured rsyslog to send email alerts. Test the system like so on your local machine:

ssh -vvv -p 22 someuser@yourserver

...and watch your inbox for the email alert that will follow shortly!

While this system isn't particularly useful on it's own, it's a small part of a larger strategy for securing my network. It's also been a testing ground for me to configure rsyslog to send email alerts - something I may want to configure my centralised rsyslog logging system to do for other things in the future.

If you've found this post useful or you have some suggestions, please leave a comment below!

Sources and further reading

PhD Aside 2: Jupyter Lab / Notebook First Impressions

Hello there! I'm back with another PhD Aside blog post. In the last one, I devised an extremely complicated and ultimately pointless mechanism by which multiple Node.js processes can read from the same file handle at the same time. This post hopefully won't be quite as useless, as it's a cross with the other reviews / first impressions posts I've made previously.

I've had Jupyter on my radar for ages, but it's only very recently that I've actually given it a try. Despite being almost impossible to spell (though it does appear to be getting easier with time), both it's easy to install and extremely useful when plotting visualisations, so I wanted to talk about it here.

I tried Jupyter Lab, which is apparently more complicated than Jupyter Notebook. Personally though I'm not sure I see much of a difference, aside from a file manager sidebar in Jupyter Lab that is rather useful.

A Jupyter Lab session of mine, in which I was visualising embeddings from a pretrained CLIP model.

(Above: A Jupyter Lab session of mine, in which I was visualising embeddings from a pretrained CLIP model.)

Jupyter Lab is installed via pip (pip3 for apt-based systems): https://jupyter.org/install. Once installed, you can start a server with jupyter-lab in a terminal (or command line), and then it will automatically open a new tab in your browser that points to the server instance (http://localhost:8888/ by default).

Then, you can open 1 or more Jupyter Notebooks, which seem to be regular files (e.g. Javascript, Python, and more) but are split into 'cells', which can be run independently of one another. While these cells are usually run in order, there's nothing to say that you can't run them out of order, or indeed the same cell over and over again as you prototype a graph.

The output of each cell is displayed directly below it. Be that a console.log()/print() call or a graph visualisation (see the screenshot above), it seems to work just fine. It also saves the output of a cell to disk alongside the code in the Jupyter Notebook, can be a double-edged sword: On the one hand, it's very useful to have the plot and other output be displayed to remind you what you were working on, but on the other hand if the output somehow contains sensitive data, then you need to remember to clear it before saving & committing to git each time, which is a hassle. Similarly, every time the output changes the notebook file on disk also changes, which can result in unnecessary extra changes committed to git if you're not careful.

In the same vein, I have yet to find a way to define a variable in a notebook file whose value is not saved along with the notebook file, which I'd rather like since the e.g. tweets I work with for the social media side of my PhD are considered sensitive information, and so I don't want to commit them to a git repository which will no doubt end up open-source.

You can also import functions and classes from other files. Personally, I see Jupyter notebooks to be most useful when used in conjunction with an existing codebase: while you can put absolutely everything in your Jupyter notebook, I wouldn't recommend it as you'll end up with spaghetti code that's hard to understand or maintain - just like you would in a regular codebase in any other language.

Likewise, I wouldn't recommend implementing an AI model in a Jupyter notebook directly. While you can, it makes it complicated to train it on a headless server - which you'll likely want to do if you want to train a model at any scale.

The other minor annoyance is that by using Jupyter you end up forfeiting thee code intelligence of e.g. Atom or Visual Studio Code, which is a shame since a good editor can e.g. check syntax on the fly, inform you of unused variables, provide autocomplete, etc.

These issues aside, Jupyter is a great fit for plotting visualisations due to the very short improve → rerun → inspect/evaluate output loop. It's also a good fit for writing tutorials I suspect, as it apparently has support for markdown cells too. At some point, I may try writing a tutorial in Jupyter notebook, rendering it to regular markdown, and posting it here.

Excluding domains from Encrypted DNS

Heya! I've got a quick tip for you that was annoying to look up. When using Encrypted DNS (either by DNS-over-TLS or DNS-over-HTTPS), your DNS requests will often go directly to Cloudflare or Google.

This is all well and good if you have a setup like my home network where DNS for my entire network goes through an Unbound instance which forwards to Cloudflare via Encrypted DNS (associated blog post; it's great for ensuring devices that don't support encrypted DNS are also secure), but things get more complicated if you're another network with Firefox on your laptop. In such a scenario, you most likely want Firefox configured with private/encrypted DNS enabled - but if you have domains on that network (e.g. if it's a network with split-horizon DNS with local Intranet sites), then it's awkward because you have to keep turning encrypted DNS on and off again.

A pretty specific situation that can be annoying and difficult to diagnose, to be sure. The easiest way to spot the issue is to see if the site you are accessing is local to (or hosted on) the network you're connected to, and check that while it doesn't work on your local device, but it does work on other devices on that network.

But no longer! I have discovered a setting in Firefox that allows you do set specific domains that resolved via your system's DNS resolver (for Linux users, that's what is specified in /etc/resolv.conf).

To edit it, first navigate to about:config and dismiss the warning. Then, find the network.trr.builtin-excluded-domains setting. By default for me it's localhost,local.

Once you've located it, you can add the domains you want to exclude from resolving via encrypted DNS to the comma-separated list. It supports wildcards too, so you can do something like this:

localhost,local,mooncarrot.space,*.mooncarrot.space

I'm sure that Chrome has a setting for this too, but I don't use it (for reasons that I could fill an entirely separate blog post with).

I'm mainly posting this for my own reference, but hopefully it helps others too :-)

Tensorflow and PyTorch compared

Hey there! Since I've used both Tensorflow and PyTorch a bit now, I thought it was time to write a post comparing the two and their respective strengths and weaknesses.

For reference, I've used Tensorflow both for Javascript (less popular) and for Python (more popular) for a number of different models, relating to both my rainfall radar and social media halves to my PhD. While I definitely have less experience with PyTorch, I feel like I have a good enough grasp on it to get a first impression.

Firstly, let's talk about how PyTorch is different from Tensorflow, and what Tensorflow could learn from the former. The key thing I noticed about PyTorch is that it's easily the more flexible of the two. I'm pretty sure that you can create layers and even whole models that do not explicitly define the input and output shapes of the tensors they operate on - e.g. using CNN layers. This gives them a huge amount of power for handling variable sized images or sentences without additional padding, and would be rather useful in Tensorflow - where you must have a specific input shape for every layer.

Unfortunately, this comes at the cost of complexity. Whereas Tensorflow has a .fit() method, in PyTorch you have to implement it yourself - which, as you can imagine - result in a lot of additional code you have to write and test. This was quite the surprise to me when I first used PyTorch!

The other thing I like about PyTorch is the data processing pipeline and it's simplicity. It's easy to understand and essentially guides you to the most optimal solution all on it's own - leading to greater GPU usage, faster model training times, less waiting around, and tighter improve → run → evaluate & inspect → repeat loops.

While in most cases you need to know the number of items in your dataset in advance, this is not necessarily a bad thing - as it gently guides you to the realisation that by changing the way your dataset is stored, you can significantly improve CPU and disk utilisation by making your dataset more amenable to be processed in parallel.

Tensorflow on the other hand has a rather complicated data processing pipeline with multiple ways to do things and no clear guidance I could easily find on building a generic data processing pipeline that didn't make enormous assumptions like "Oh, you want to load images right? Just use this function!" - which really isn't helpful when you want to do something unusual for a research project.

Those tutorials I do find suggest you use a generator function, which can't be parallelised and makes training a model a slow and painful process. Things aren't completely without hope though - Tensorflow has a .map() method on their Dataset objects and also have a .interleave() method (if I recall correctly) to interleave multiple Dataset objects together - which I believe is a relatively recent addition. This is quite a clever way of doing things, if a bit more complicated than PyTorch's solution.

It would be nice though if the tf.data.AUTOTUNE feature for automatically managing the number of parallel workers to use when parallelising things was more intelligent. I recently discovered that it doesn't max out my CPU if I have multiple .map() calls I parallelise for example, when it really should look at the current CPU usage and notice that the CPU is sitting e.g. 50% idle.

Tensorflow for Python has a horrible API more generally. It's a confusing mess as there's both Tensorflow and the inbuilt Keras, which means that it's not obvious where that function you need is - or, indeed, which version thereof you want to call. I know it's a holdover from when Keras wasn't bundled with Tensorflow by default, but the API really should be imagined and tf.keras merged into the main tf namespace somehow.

It can also be unclear when you mix Tensorflow Tensors, numpy arrays and numbers, and plain Python numbers. In some cases, it's impossible to tell where one begins and the other ends, which can be annoying since they all behave differently, so you can in some cases get random error messages when you accidentally mix the types (e.g. "I want a Tensor, not a numpy array", or "I want a plain Python number, not a numpy number").

A great example of what's possible is demonstrated by Tensorflow's own Javascript bindings - to a point. They are much better organised than the Python library for Tensorflow, although they require explicit memory management and disposal of Tensors (which isn't necessarily a bad thing, though it's difficult to compare performance improvements without comparing apples and oranges).

The difficulties start though if you want to do anything in even remotely uncharted territory - Tensorflow.js doesn't have a very wide selection of layers like the Python bindings do (e.g. multi-headed attention). It also seems to have some a number of bugs, meaning you can't just port code from the Python bindings and expect it to work. For example, I tried implementing an autoencoder, but found that that it didn't work as I wanted it to - and for the life of me I couldn't find the bug at all (despite extensive searching).

Another annoyance with Tensorflow.js is that the documentation for exactly which CUDA version you need is very poor - and sometimes outright wrong! In addition, there's no table of versions and associated CUDA + CuDNN versions required like there is for Tensorflow for Python.

It is for these reasons that I find myself using Python much more regularly - even if I dislike Python as a language and ecosystem.

At some point, I'd love to build a generic Tensor library on top of GPU.js. It would naturally support almost any GPU (since GPU.js isn't limited to CUDA-capable devices like Tensorflow is - while you can recompile it with support for other GPUs, I don't recommend it unless you have lots of time on your hands), be applicable to everything from machine to simulation to cellular automata, and run in server, desktop, and browser environments with minimal to no changes to your codebase!

Conclusion

There's no clear answer to whether you should use PyTorch or Tensorflow for your next project. As a rule of thumb, I suggest starting in Tensorflow due to the reduced boilerplate code, and use PyTorch if you find yourself with a wacky model that Tensorflow doesn't like very much - or you want to use a pretrained model that's only available in one or the other.

Having said this, I can certainly recommend experiencing both libraries, as there are valuable things to be learnt from both frameworks. Unfortunately, I can't recommend Tensorflow.js for anything more than basic tensor manipulations (which it is very good at, despite supporting only a limited range of GPUs without recompilation in Node.js) - even though it's API is nice and neat (and the Python bindings should take significant inspiration from it).

In the near future - one way or another - I will be posting about contrastive learning here soon. It's very cool indeed - I just need to wrap my head around and implement the loss function....

If you have experience with handling matrices, please get in touch as I'd really appreciate some assistance :P

Minifying CSS, HTML, and more in eleventy static sites

I've built a few sites with eleventy, and one of the things that's been on my todo list is figure out a way to optimise everything.

With websites, it's very important that content loads as fast as possible. To achieve this, a number of strategies can be employed, such as enabling gzip to reduce data transferred. A common theme here in techniques used to improve page load times is either:

  1. The number of requests made to the server / amount of data transferred
  2. Improving JS / CSS parsing and execution performance

By far the most important thing we can do here with a static site like eleventy though is minifying HTML, CSS, Javascript (if you have any being served to the client), and everything else you serve to the client. In doing so, we can significantly reduce the amount of data transferred from the server to the client.

Build systems like esbuild are a good choice here, but if you have yourself an eleventy-based static site, then esbuild may be somewhat complicated and not best suited to the problem at hand (it's best at bundling JS + CSS assets, and doesn't like HTML very much).

To this end, a solution that is more integrated with eleventy is preferable to reduce complexity. The official eleventy docs suggest using clean-css to minify CSS, but this approach doesn't tackle HTML, and requires you to remember to use the cssmin filter every time.

With this in mind, in this post I want to show a much easier method of minifying CSS, HTML, and anything else (except non-svg images, but I have a solution for those too which I'll talk about in a future post if there's any interest) you can think of.

By using eleventy transforms, we can apply a minification filter to every file that Eleventy generates.

For this post, I'll assume that you already have an eleventy site you want to optimise. if you don't have one yet, I recommend the official docs as a starting point.

Let's start with the CSS. I assume you already have something like this in e.g. css.njk in your project:

---
permalink: theme.css
---

{% include "css/patterns.css" %}
{% include "css/theme.css" %}
{% include "css/gallerybox.css" %}
{% include "css/smallscreens.css" %}
{% include "css/prism-custom.css" %}

This puts all your CSS into a single file. This is good, but we can do better. Let's install clean-css:

npm install --save clean-css

Then, open your .eleventy.js file for editing. Add the following:

// Add to your require() statements at the top of the file:
const CleanCSS = require("clean-css");
const is_production = typeof process.env.NODE_ENV === "string" && process.env.NODE_ENV === "production";

function do_minifycss(source, output_path) {
    if(!output_path.endsWith(".css") || !is_production) return source;

    const result = new CleanCSS({
        level: 2
    }).minify(source).styles.trim();
    console.log(`MINIFY ${output_path}`, source.length, `→`, result.length, `(${((1 - (result.length / source.length)) * 100).toFixed(2)}% reduction)`);
    return result;
}

Finally, find the bit at the bottom of the file that looks like this:

module.exports = function(eleventyConfig) {

    // Some stuff may be here

}

...and add the following to that function there:

eleventyConfig.addTransform("cssmin", do_minifycss);

In short, for every file that eleventy is just about to right to disk, it executes all the transforms it has registered. In our do_minifycss transform we register, we first ensure it's a .css file that eleventy is writing, and then check that the NODE_ENV environment variable is set to production. If these conditions are met, then we minify the source code we were passed before returning it.

This transform pattern is very useful, and can be applied to any file type you like. For example, we could also minify HTML. To do this, install the html-minifier-terser npm package like this:

npm install --save html-minifier-terser

Then, here's what to add to the .eleventy.js configuration file:

// At the top:
const { minify: minify_html } = require("html-minifier-terser");

// Somewhere in the middle:
async function do_minifyhtml(source, output_path) {
    if(!output_path.endsWith(".html") || !is_production) return source;

    const result = await minify_html(source, {
        collapseBooleanAttributes: true,
        collapseWhitespace: true,
        collapseInlineTagWhitespace: true,
        continueOnParseError: true,
        decodeEntities: true,
        keepClosingSlash: true,
        minifyCSS: true,
        quoteCharacter: `"`,
        removeComments: true,
        removeAttributeQuotes: true,
        removeRedundantAttributes: true,
        removeScriptTypeAttributes: true,
        removeStyleLinkTypeAttributes: true,
        sortAttributes: true,
        sortClassName: true,
        useShortDoctype: true
    });

    console.log(`MINIFY ${output_path}`, source.length, `→`, result.length, `(${((1 - (result.length / source.length)) * 100).toFixed(2)}% reduction)`);

    return result;
}

Finally, add this to the module.exports = function.... at the bottom of the file as before:

eleventyConfig.addTransform("htmlmin", do_minifyhtml);

This follows the same pattern as we did for the CSS, but we instead use the HTML minifier html-minifier-terser as our minifier instead of the clean-css CSS minifier.

This pattern is repeatable over and over for other file types. For example, you could use something like JSON.stringify(JSON.parse(source)) to compress pretty-printed JSON, or wrap svgo to compress SVG images.

If there's a file format, there is probably a minifier for it. Got XML? try minify-xml. Lua (wow, that's an unusual website you've got there)? try luamin. PDF? I'm sure there's a minifier / compressor for those too.

Note that if you have a lot of Javascript, esbuild as I mentioned at the beginning of this post may be a better choice. for your Javascript (and potentially CSS).

The reason for this is that esbuild has the ability to tree-shake your Javascript. In other words, it identifies code that you aren't using, and throws it away. This can be very useful if you are using a number of libraries, as these can seriously bloat the size of your final Javascript file.

Conclusion

The larger the site, the more of an effect you'll see by minifying your source code. In this post, I've shown you how to minify your source code in your eleventy sites. Other techniques that you can employ to further reduce load times include:

  • Optimising images (I'll write a separate post on this if there's interest, as it can be quite involved)
  • Reducing the number of domains the browser has to contact by serving external resources locally from your site
    • This avoids the extra latency of setting up a brand-new connection to a new place, since multiple requests to the your own domain can re-use the same connection (and, with HTTP/2 enabled, multiplex multiple requests at once over a single connection)

Hopefully you've found this post useful. If you have, please do leave a comment below.

Have you found a cool minifier or got a cool tip to optimise a static site? Please also share these below too.

Sources and further reading

On the value of the open source community

Open source is a wonderful thing. With over 200 million repositories on GitHub alone and many many more of SourceHut, GitLab, and thousands of personal git server instances (like mine!) across the globe, there's no question that open source powers the world - look no further than NASA's Curiosity rover!

On a more personal level, open source means a lot to me too, and I wanted to talk a bit about that here. I think the oldest open source project I both started and am continuing to work on and improve would have to be Pepperminty Wiki. At 1.8K commits and the first commit way back in November 2014 (7 years 7 months ago, wow), I've probably poured thousands of hours into it and many other projects over the years.

While at the time it was just because it was something cool I wanted to work on, since then it's come to mean far more to me, and has helped me to develop very useful skills without me even realising it, and I can thoroughly recommend if you have some time to spare and you're beginning a journey into programming / computer science, it's definitely worth your time.

Documenting things. I really can't stress this one enough. Through working on open source projects, I've learnt the power of good documentation. You can write the best program in the world, but if it isn't documented well then nobody will know how to use it! The best test for documentation is when someone comes along and tries your program out without you present, as then in following your documentation they test it ensuring it's updated and accurate.

Writing good documentation is both an iterative process and takes practice but open source is a great place to do so. You don't even have to have written the program yourself - you can even help out another project and improve theirs. Chances are you've read the documentation for a free and open source program already - and if you've ever had a thought of how you'd improve it, don't hesitate to get involved!

Working (remotely) in a team. Speaking of, it's very common to collaborate in open source with people over the Internet - probably in different timezones to you - on everything from tracking down bugs to writing code to reviewing contributions. Doing so effectively can also be a learned skill, but one definitely worth having.

Having a portfolio for your CV. Personally this isn't something I really think about myself, but if you're looking to get a job of some description, then having an open source portfolio of work can definitely work in your favour, and demonstrate your skills to your potential new employer.

Reviewing contributions, resolving disputes. What you review and how often you do so depends greatly on the kind of project you're working on. For me, this is not something I do often on my personal projects, but I do it all the time in tldr-pages.

By reviewing and checking for issues and things other people might have missed, the quality of a project can be improved. It's a fine balance though between getting contributions merged and requesting improvements. On the one hand, suggesting improvements can be a good thing as previously described, but on the other it can cause unnecessary delays and frustrate everyone involved. As with all things on this list, it takes practice and continual adjustments to find the right balance.

Helping others. One of the things I love most about open source is how I can help other people out. tldr-pages has 39.1K stars as of the time of typing, and is a hugely useful resource that lots of people use daily. I've had comments from people thanking me for my work on Pepperminty Wiki and other projects that I've created and open-sourced. Knowing that I'm helping others out is very motivating for me to continue contributing :-)

Conclusion

All these things are reasons why I'm proud to say that I'm a part of the open source community as a maintainer of multiple open source projects (both personal and tldr-pages). I'm especially grateful to everyone at tldr-pages (especially @waldyrious) for everything they taught me, and the chance I've been given to help out with the project.

While it hasn't been easy at times (helping to maintain a popular project like tldr-pages takes time and can be tedious in places), it's certainly something I'll be continuing to do and can thoroughly recommend to anyone who has the time to do so.

Art by Mythdael