Starbeamrainbowlabs

Stardust
Blog


Archive


Mailing List Articles Atom Feed Comments Atom Feed Twitter Reddit Facebook

Tag Cloud

3d 3d printing account algorithms android announcement architecture archives arduino artificial intelligence artix assembly async audio automation backups bash batch blog bookmarklet booting bug hunting c sharp c++ challenge chrome os cluster code codepen coding conundrums coding conundrums evolved command line compilers compiling compression containerisation css dailyprogrammer data analysis debugging demystification distributed computing documentation downtime electronics email embedded systems encryption es6 features ethics event experiment external first impressions future game github github gist gitlab graphics hardware hardware meetup holiday holidays html html5 html5 canvas infrastructure interfaces internet interoperability io.js jabber jam javascript js bin labs learning library linux lora low level lua maintenance manjaro network networking nibriboard node.js operating systems own your code pepperminty wiki performance phd photos php pixelbot portable privacy problem solving programming problems project projects prolog protocol protocols pseudo 3d python reddit redis reference releases rendering resource review rust searching secrets security series list server software sorting source code control statistics storage svg talks technical terminal textures thoughts three thing game three.js tool tutorial twitter ubuntu university update updates upgrade version control virtual reality virtualisation visual web website windows windows 10 xmpp xslt

Cluster, Part 9: The Border Between | Load Balancing with Fabio

Hello again! It's been a while since the last one (mainly since I've been unsure about a few architectural things), but I'm now ready to continue writing about my setup. Before we continue, here's a refresher of everything we've done so far:

In this post, we're going to look at tying off our primary pipeline. So far, we've got job scheduling with Nomad, (superglue!) service discovery with Consul, and shared storage backed with NFS (although I'm going to revisit this eventually), with everything underpinned by a WireGuard mesh VPN with wesher.

In order to allow people to interact with services that are running on the cluster, we need something that will translate from the weird and strange world of anything running somewhere anywhere, and everywhere in-between into something that makes sense from an outside perspective. We want to have a single gateway by which we can control and manage access.

It is for these purposes that we're going to add Fabio to our stack. Its configuration is backed by Consul, and it is relatively simple and easy to understand. Having the config backed by Consul nets us multiple benefits:

  • It can run anywhere on the cluster we like in a pinch
  • We can configure new routes directly from a Nomad job spec file (although we still need to update the Unbound config)
  • The configuration of Vault gains additional data redundancy being stored on multiple nodes in the cluster

Like in previous parts of this series, Fabio isn't available to install with apt directly, so I've packaged it into my apt repository. If you haven't yet set up my apt repository, up-to-date instructions on how to do so can be found at the top of its main page - just click the aforementioned link (I'm not going to include instructions here, as they may go out of date at a later time).

Once you've set up my apt repository (or downloaded the Fabio binary manually, though I don't recommend that as it's more difficult to keep up-to-date), we can install Fabio like so:

sudo apt install fabio

This should be done on your primary (controller) node in your cluster. You can also do it on a secondary node too if you like to increase redundancy. To do this, just follow these instructions on both nodes one at a time. I'll be doing this soon myself: I've just been distracted with other things :P

Next, we need a service file. For systemd users (I'm using Raspbian at the moment), I have an apt package:

sudo apt install fabio-systemd

With this installed, we need to create a (very) minimal configuration file. Here it is:

proxy.addr = :80;proto=http
proxy.auth = name=admin;type=basic;file=/etc/fabio/auth.admin.htpasswd

Pretty short, right? This does 2 things:

  1. Tells Fabio to listen on port 80 for HTTP requests (we'll be tackling HTTPS in a separate post - we need Vault for that)
  2. Tells Fabio about the admin auth realm and where it can find the .htpasswd file that corresponds with it

Fabio's password authentication uses HTTP Basic Auth - which is insecure over unencrypted HTTP. Note that we'll be working towards improving the situation here and I'll insert a reminder when we arrive to change all your passwords where we do, but there are quite a number of obstacles between here and there we have to deal with first.

With this in mind, Take a copy of the above Fabio config file and write it to /etc/fabio/fabio.properties. Next, we need to generate that htpasswd file we reference in the config file. There are many tools out there that can be used for this purpose - for example the htpasswd tool in the apache2-utils package:

htpasswd /etc/fabio/auth.admin.htpasswd username

I like this authentication setup for Fabio, as it allows one to have a single easily configurable set of realms for different purposes if desired.

If you're setting up Fabio on multiple servers, you'll want to put your config file in your shared NFS storage and create a symlink at /etc/fabio/fabio.properties instead. Do that like this:

sudo ln -s /etc/fabio/fabio.properties /mnt/shared/config/fabio/fabio.properties

....update the /mnt/... path accordingly. Don't forget to adjust the /etc/fabio/auth.admin.htpasswd path too in fabio.properties as well.

Now that we've got the configuration file out of the way, we can start Fabio for the first time! Do that like this:

sudo systemctl start fabio.service
sudo systemctl enable fabio.service

Don't forget to punch a hole in the firewall:

sudo ufw allow 80/tcp comment fabio

Fabio is running - but it's not particularly useful, as we haven't configured any routes! Let's add some routes now.The first few routes we're going to add will be manual routes, which will allow us to tell Fabio about a static route we want it to add to it's routing table.

Fabio itself actually has a web interface, which will make a good first target for testing out our new cool toy. I mentioned earlier that Fabio gets its configuration from Consul - and it's now that we're going to take advantage of that. Consul isn't just a service discovery tool you see - it's a shared configuration manager too via a fancy hierarchical distributed key-value data store.

In this datastore Fabio looks in particular at the keys in the fabio directory. Create a new key under here with the Consul CLI like so:

consul kv put "fabio/fabio" 'route add fabio fabio.bobsrockets.com/ http://NODE_NAME.node.mooncarrot.space:9998 tags "mission-control" opts "auth=admin"'

Replace NODE_NAME with the name of the node you're running Fabio on, and yourdomain.com with a domain name you've bought. Once done, update your DNS config to point fabio.bobsrockets.com to the node that's running Fabio (you might want to refer back to my earlier post on Unbound - don't forget to restart unbound with sudo systemctl restart unbound).

When you have your DNS server updated, you should be able to point your browser at fabio.bobsrockets.com. No reloading of Fabio is needed - it picks up changes dynamically and automagically! It should prompt you for your password, and then you should see your the Fabio web interface. It should look something like this:

The Fabio web interface

As you can see, I've got a number of services running - including a few that I'm going to be blogging about soon-ish, such as Vault (but I haven't yet learnt how to use it :P) and Docker Registry UI (which is useful but has some issues - I'm going to see if HTTPS helps fix some of them up as I'm getting some errors in the dev tools about the SubtleCrypto API, which is only available in secure contexts).

Those services with IP addresses as the destination are defined through Nomad, and auto-update based on the host upon which they are running.

In the web interface you can click on overrides on the top bar to view and edit the configuration for the static routes you've got configured. You can't create new ones though, which is a shame.

Using the same technique as described above, you can create manual routes for Nomad and Consul - as they have web interfaces too! If you haven't already you'll need to enable it though with ui = true the Nomad and Consul server configuration files respectively though. For example, you could use these definitions:

route add nomad nomad.seanssatellites.io/ http://nomad.service.seanssatellites.io:4646 tags "mission-control" opts "auth=admin"
route add consul consul.billsboosters.space/ http://consul.service.billsboosters.space:8500 tags "mission-control" opts "auth=admin"

If you do the Consul one first, you can use the web interface to create the definition for Nomad :D

It's perhaps worth making a quick note of some parts of the above route definitions:

  • opts "auth=admin": This bit activates HTTP Basic Auth with the specified realm
  • consul.billsboosters.space/: This is the domain through which outside users will access the service. The trailing slash is very important.

From here, the last item on the list for this post are automatic routes via Nomad jobs. Since it's the only job we've got running on Nomad so far, let's use that as an example. Adding a Fabio route in this manner requires 3 steps:

  1. Find the service stanza in your Docker Registry Nomad job file, and edit the tags list to include a pair of tags something like urlprefix-registry.tillystelescopes.fr/ and auth=admin (again, the trailing slash is important, and the urlprefix- bit instructs Fabio that it's the domain name to route traffic from to the container).
  2. Save the edits to the Nomad job file and re-run it with nom job run path/to/file.nomad
  3. Update your DNS with a new record pointing registry.tillystelescopes.fr at the IP address(es) of the node(s) running Fabio

Also pretty simple to get used to, right? From here, step 4 of the official quickstart guide is useful. It explains about the different service tags (like the urlprefix- and auth=admin ones we created above) that are supported. Apparently raw TCP forwarding is also supported - though personally I'm waiting eagerly on UDP forwarding myself for some services I would like to run.

The rest of the Fabio docs are a bit of a mess, but I've found them more understandable than that of Traefik - the solution I investigated before turning to Fabio upon a recommendation from someone over in the r/selfhosted subreddit in frustration (whoever says "Traefik is simple!" is lying - I can't make sense of anything - it might as well be written in hieroglyphs.....).

Looking into the future, our path is diverging into 2 clear routes:

  • Getting services up and running on our new cluster
  • Securing said cluster to avoid attack

While relatively separate goals, they do intertwine at intervals. Moving forwards, we're going to be oscillating between these 2 goals. Likely topics include Vault (though it'll take several blog posts to realise any benefit from it at this point), and getting some Docker container infrastructure setup.

Speaking of Docker container infrastructure, if anyone has any ideas as to how to auto-rebuild docker containers and/or auto-restart Nomad jobs to keep them up-to-date, I'd love to know in a comment below. I'm currently scratching my head over that one....

Found this interesting? Got an idea that would improve on my setup? Confused about something? Comment below!

Cluster, Part 8: The Shoulders of Giants | NFS, Nomad, Docker Registry

Welcome back! It's been a bit of a while, but now I'm back with the next part of my cluster series. As a refresher, here's a list of all the parts in the series so far:

In this one, we're going to look at running our first job on our Nomad cluster! If you haven't read the previous posts in this series, you'll probably want to go back and read them now, as we're going to be building on the infrastructure we've setup and the groundwork we've laid in the previous posts in this series.

Before we get to that though, we need to sort out shared storage - as we don't know which node in the cluster tasks will be running on. In my case, I'll be setting up NFS. This is hardly the only solution to the issue though - other options include:

If you're going to choose NFS like me though, you should be warned that it's neither encrypted not authenticated. You should ensure that NFS is only run on a trusted network. If you don't have a trusted network, use the WireGuard Mesh VPN trick in part 4 of this series.

NFS: Server

Setting up a server is relatively easy. Simply install the relevant package:

sudo apt install nfs-kernel-server

....edit /etc/exports to look something like this:

/mnt/somedrive/subdirectory 10.1.2.0/24(rw,async,no_subtree_check)

/mnt/somedrive/subdirectory is the directory you'd like clients to be able to access, and 10.1.2.0/24 is the IP range that should be allowed to talk to your NFS server.

Next, open up the relevant ports in your firewall (I use UFW):

sudo ufw allow nfs

....and you're done! Pretty easy, right? Don't worry, it'll get harder later on :P

NFS: Client

The client, in theory, is relatively straightforward too. This must be done on all nodes in the cluster - except the node that's acting as the NFS server (although having the NFS server as a regular node in the cluster is probably a bad idea). First, install the relevant package:

sudo apt install nfs-common

Then, update /etc/fstab and add the following line:

10.1.2.10:/mnt/somedrive/subdirectory   /mnt/shared nfs auto,nofail,noatime,intr,tcp,bg,_netdev 0   0

Again, 10.1.2.10 is the IP of the NFS server, and /mnt/somedrive/subdirectory must match the directory exported by the server. Finally, /mnt/shared is the location that we're going to mount the directory from the NFS server to. Speaking of, we should create that directory:

sudo mkdir /mnt/shared

I have yet to properly tune the options there on both the client and the server. If I find that I have to change anything here, I'll both come back and edit this and mention it in a future post that I did.

From here, you should be able to mount the NFS share like so:

sudo mount /mnt/shared

You should see the files from the NFS server located in /mnt/shared. You should check to make sure that this auto-mounts it on boot too (that's what the auto and _netdev are supposed to do).

If you experience issues on boot (like me), you might see something like this buried in /var/log/syslog:

mount[586]: mount.nfs: Network is unreachable

....then we can quickly hack this by creating a script in the directory /etc/network/if-up.d. It should read something like this should fix the issue:

#!/usr/bin/env bash
mount /mnt/shared

Save this to /etc/network/if-up.d/cluster-shared-nfs for example, not forgetting to mark it as executable:

sudo chmod +x /etc/network/if-up.d/cluster-shared-nfs

Alternatively, there's autofs that can do this more intelligently if you prefer.

First Nomad Job: Docker Registry

Now that we've got shared storage online, it's time for the big moment. We're finally going to start our very first job on our Nomad cluster!

It's going to be a Docker registry, and in my very specific case I'm going to be marking it as insecure (gasp!) because it's only going to be accessible from the WireGuard VPN - which I figure provides the encryption and authentication for us to get started reasonably simply without jumping through too many hoops. In the future, I'll probably revisit this in a later post to tighten things up.

Tasks on a Nomad cluster take the form of a Nomad job file. These can written in JSON or HCL (Hashicorp Configuration Language). I'll be using HCL here, because it's easier to read and we're not after machine legibility yet at this stage.

Nomad job files work a little bit like Nginx config files, in that they have nested sequences of blocks in a hierarchical structure. They loosely follow the following pattern:

job > group > task

The job is the top-level block that contains everything else. tasks are the items that actually run on the cluster - e.g. a Docker container. groups are a way to logically group tasks in a job, and are not required as far as I can tell (but we'll use one here anyway just for illustrative purposes). Let's start with the job spec:

job "registry" {
    datacenters = ["dc1"]
    # The Docker registry *is* pretty important....
    priority = 80

    # If this task was a regular task, we'd use a constraint here instead & set the weight to -100
    affinity {
        attribute   = "${attr.class}"
        value       = "controller"
        weight      = 100
    }

    # .....

}

This defines a new job called registry, and it should be pretty straight forward. We don't need to worry about the datacenters definition there, because we've only got the 1 (so far?). We set a priority of 80, and get the job to prefer running on nodes with the controller class (though I observe that this hasn't actually made much of a difference to Nomad's scheduling algorithm at all).

Let's move on to the real meat of the job file: the task definition!

group "main" {
    task "registry" {
        driver = "docker"

        config {
            image = "registry:2"
            labels { group = "registry" }

            volumes = [
                "/mnt/shared/registry:/var/lib/registry"
            ]

            port_map {
                registry = 5000
            }
        }

        resources {
            network {
                port "registry" {
                    static = 5000
                }
            }
        }

        # .......
    }
}

There's quite a bit to unpack here. The task itself uses the Docker driver, which tells Nomad to run a Docker container.

In the config block, we define the Docker driver-specific settings. The docker image we're going to run is registry:2 where registry is the image name, and 2 is the tag. This will to automatically pulled from the Docker hub. Future tasks will pull docker images from our very own private Docker registry, which we're in the process of setting up :D

We also mount a directory into the Docker container to allow it to persist the images that we push to it. This is done through a volume, which is the Docker word for bind-mounting a specific directory on the host system into a given location inside the guest container. For me I'm (currently) going to store the Docker registry data at /mnt/shared/registry - you should update this if you want to store it elsewhere. Remember this this needs to be a location on your shared storage, as we don't know which node in the cluster the Docker registry is going to run on in advance.

The port_map allows us to tell Nomad the port(s) that our service inside the Docker container listens on, and attach a logical name to them. We can then expose them in the resources block. In this specific case, I'm forcing Nomad to statically allocate port 5000 on the host system to point to port 5000 inside the container, for reasons that will become apparent later. This is done with the static keyword there. If we didn't do this, Nomad would allocate a random port number (which is normally what we'd want, because then we can run lots of copies of the same thing at the same time on the same host).

The last block we need to add to complete the job spec file is the service block. with a service block, Nomad will inform Consul that a new service is running, which will then in turn allow us to query it via DNS.

service {
    name = "${TASK}"
    tags = [ "infrastructure" ]

    address_mode = "host"
    port = "registry"
    check {
        type        = "tcp"
        port        = "registry"
        interval    = "10s"
        timeout     = "3s"
    }

}

The service name here is pulled from the name of the task. We tell Consul about the port number by specifying the logical name we assigned to it earlier.

Finally, we add a health check, to allow Consul to keep an eye on the health of our Docker registry for us. This will appear as a green tick if all is well in the web interface, which we'll be getting to in a future post. The health check in question simply ensures that the Docker registry is listening via TCP on the port it should be.

Here's the completed job file:

job "registry" {
    datacenters = ["dc1"]
    # The Docker registry *is* pretty important....
    priority = 80

    # If this task was a regular task, we'd use a constraint here instead & set the weight to -100
    affinity {
        attribute   = "${attr.class}"
        value       = "controller"
        weight      = 100
    }

    group "main" {

        task "registry" {
            driver = "docker"

            config {
                image = "registry:2"
                labels { group = "registry" }

                volumes = [
                    "/mnt/shared/registry:/var/lib/registry"
                ]

                port_map {
                    registry = 5000
                }
            }

            resources {
                network {
                    port "registry" {
                        static = 5000
                    }
                }
            }

            service {
                name = "${TASK}"
                tags = [ "infrastructure" ]

                address_mode = "host"
                port = "registry"
                check {
                    type        = "tcp"
                    port        = "registry"
                    interval    = "10s"
                    timeout     = "3s"
                }

            }
        }

        // task "registry-web" {
        //  driver = "docker"
        // 
        //  config {
        //      // We're going to have to build our own - the Docker image on the Docker Hub is amd64 only :-/
        //      // See https://github.com/Joxit/docker-registry-ui
        //      image = ""
        //  }
        // }
    }
}

Save this to a file, and then run it on the cluster like so:

nomad job run path/to/job/file.nomad

I'm as of yet unsure as to whether Nomad needs the file to persist on disk to avoid it getting confused - so it's probably best to keep your job files in a permanent place on disk to avoid issues.

Give Nomad to start the job, and then you can check on it's status like so:

nomad job status

This will print a summary of the status of all jobs on the cluster. To get detailed information about our new job, do this:

nomad job status registry

It should show that 1 task is running, like this:

ID            = registry
Name          = registry
Submit Date   = 2020-04-26T01:23:37+01:00
Type          = service
Priority      = 80
Datacenters   = dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
main        0       0         1        5       6         1

Latest Deployment
ID          = ZZZZZZZZ
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
main        1        1       1        0          2020-06-17T22:03:58+01:00

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created   Modified
XXXXXXXX  YYYYYYYY  main        4        run      running  6d2h ago  2d23h ago

Ignore the Failed, Complete, and Lost there in my output - I ran into some snags while learning the system and setting mine up :P

You should also be able to resolve the IP of your Docker registry via DNS:

dig +short registry.service.mooncarrot.space

mooncarrot.space is the root domain I've bought for my cluster. I highly recommend you do the same if you haven't already. Consul exposes all services under the service subdomain, so in the future you should be able to resolve the IP of all your services in the same way: service_name.service.DOMAIN_ROOT.

Take care to ensure that it's showing the right IP address here. In my case, it should be the IP address of the wgoverlay network interface. If it's showing the wrong IP address, you may need to carefully check the configuration of both Nomad and Consul. Specifically, start by checking the network_interface setting in the client block of your Nomad worker nodes from part 7 of this series.

Conclusion

We're getting there, slowly but surely. Today we've setup shared storage with NFS, and started our first Nomad job. In doing so, we've started to kick the tyres of everything we've installed so far:

  • wesher, our WireGuard Mesh VPN
  • Unbound, our DNS server
  • Consul, our service discovery superglue
  • Nomad, our task scheduler

Truly, we are standing on the shoulders of giants: a whole host of open-source software that thousands of people from across the globe have collaborated together to produce which makes this all possible.

Moving forwards, we're going to be putting that Docker registry to good use. More immediately, we're going to be setting up Fabio (who's documentation is only marginally better than Traefik's, but just good enough that I could figure out how to use it....) in order to take a peek at those cool web interfaces for Nomad and Consul that I keep talking about.

We're also going to be looking at setting up Vault for secret (and certificate, if all goes well) management.

Until then, happy cluster configuration! If you're confused about anything so far, please leave a comment below. If you've got a suggestion to make it even better, please comment also! I'd love to know.

Sources and further reading

Cluster, Part 7: Wrangling... boxes? | Expanding the Hashicorp stack with Docker and Nomad

Welcome back to part 7 of my cluster configuration series. Sorry this one's a bit late - the last one was a big undertaking, and I needed a bit of a rest :P

Anyway, I'm back at it with another part to my cluster series. For reference, here are all the posts in this series so far:

Don't forget that you can see all the latest posts in the cluster tag right here on my blog.

Last time, we lit the spark for the bonfire so to speak, that keeps track of what is running where. We also tied it into the internal DNS system that we setup in part 4, which will act as the binding fabric of our network.

In this post, we're going to be doing 4 very important things:

  • Installing Docker
  • Installing and configuring Hashicorp Nomad

This is set to be another complex blog post that builds on the previous ones in this series (remember that benign rabbit hole from a few blog posts ago?).

Above: Nomad is a bit like a railway network manager. It decides what is going to run where and at what time. Picture taken by me.

Installing Docker

Let's install Docker first. This should be relatively easy. According to the official Docker documentation, you can install Docker like so:

curl https://get.docker.com/ | sudo sh

I don't like piping to sh though (and neither should you), so we're going to be doing something more akin to the "install using the repository". As a reminder, I'm using Raspberry Pi 4s running Raspbian (well, DietPi - but that's a minor detail). If you're using a different distribution or CPU architecture, you'll need to read the documentation to figure out the specifics of installing Docker for your architecture.

For Raspberry Pi 4s at least, it looks a bit like this:

echo 'deb [arch=armhf] https://download.docker.com/linux/raspbian buster stable' | sudo tee /etc/apt/sources.list.d/docker.list
sudo apt update
sudo apt install docker-ce

Don't forget that if you're running an apt caching server, you'll need to tweak that https to be plain-old http. For the curious, my automated script for my automated Ansible replacement (see the "A note about cluster management" in part 6) looks like this:

#!/usr/bin/env bash
RUN "echo 'deb [arch=armhf] http://download.docker.com/linux/raspbian buster stable' | sudo tee /etc/apt/sources.list.d/docker.list";
RUN "sudo apt-get update";
RUN "sudo apt-get install --yes docker-ce";

Docker should install without issue - note that you need to install it on all nodes in the cluster. We can't really do anything meaningful with it yet though, as we don't yet have Nomad installed. Let's move on and install that then!

Installing Hashicorp Nomad

Nomad is what's known as a workload orchestrator. This means that it, given a bunch of jobs, decides what is going to run where. If a host goes down, it is also responsible for shuffling things around to compensate.

Nomad works on the concept of 'jobs', which can be handled any any 1 of a number of drivers. In our case, we're going to be using the built-in Docker driver, as we want to manage the running of lots of Docker containers across multiple hosts in our cluster.

After installing Consul last time, we can build on that with Nomad. The 2 actually integrate really nicely with each other. Nomad will, by default, seek out a local Consul daemon, use it to discover other hosts in the cluster, and hang it's own cluster from Consul. Neat!

Also like Consul, Nomad functions with servers and clients. The servers all talk to each other via the Raft consensus algorithm, and the clients a lightweight daemons that do that the servers tell them to. I'm going to have 3 servers and 2 clients, in the following layout:

Host # Consul Nomad
1 Server Server
2 Server + Client Client
3 Server + Client Client
4 Client Server + Client
5 Client Server + Client

Just for the record, according to thee Nomad documentation it's not recommended that servers also act as clients, but I don't have enough hosts to avoid this yet.

With this in mind, let's install Nomad. Again, as last time, I've packaged Nomad in my at repository. If you haven't already, go and set it up now. Then, install nomad like so:

sudo apt install hashicorp-nomad

Also as last time, I've deliberately chosen a different name then the existing nomad package that you'll probably find in your distribution's repositories to avoid confusion during updates. If you're a systemd user, then I've also got a trio of packages that provide a systemd service file:

Package Name Config file location
hashicorp-nomad-systemd-server /etc/nomad/server.hcl
hashicorp-nomad-systemd-client /etc/nomad/client.hcl
hashicorp-nomad-systemd-both /etc/nomad/both.hcl

They all conflict with each other (such that you can only have 1 installed at a time), and the only difference between them is where the configuration file is located.

Install 1 of these (if required) now too with your package manager. If you're not a systemd user, consul your service manager's documentation and write a service definition. If you're willing, comment below and I'll include a note about it here!

Speaking of configuration files, we should write one for Nomad. Let's start off with the bits that will be common across all the config file variants:

bind_addr = "{{ GetInterfaceIP \"wgoverlay\" }}"

# Increase log verbosity
log_level = "INFO"

# Setup data dir
# The data directory used to store state and other persistent data. On client
# machines this is used to house allocation data such as downloaded artifacts
# used by drivers. On server nodes, the data dir is also used to store the
# replicated log.
data_dir = "/srv/nomad"

A few things to note here. log_level is mostly personal preference, so do whatever you like there. I'll probably tune it myself as I get more familiar with how everything works.

data_dir needs to be a path to a private root-owned directory on disk for the Nomad agent to store stuff locally to that node. It should not be shared with other nodes. If you installed one of the systemd packages above, /srv/nomad is created and properly permissed for you.

bind_addr tells Nomad which network interface to send management clustering traffic over. For me I'm using a WireGuard mesh VPN I setup in [part 4](), so I specify wgoverlay here.

Next, let's look at the server config:

# Enable the server
server {
    enabled = true

    # We've got 3 servers in the cluster at the moment
    bootstrap_expect = 3

    # Note that Nomad finds other servers automagically through the consul cluster

    # TODO: Enable this. Before we do we need to figure out how to move this sekret into vault though or something
    # encrypt = "SOME_VALUE_HERE"
}

Not much to see here. Don't forget to change the bootstrap_expect to a different value if you are going to have a different number of servers in your cluster (nodes that are just clients don't count).

Note that this isn't the complete server configuration file - you need to take both this and the above common bit to make the complete server configuration file.

Now, let's look at the client configuration:

client {
    enabled = true
    # Note that Nomad finds other servers automagically through the consul cluster

    # Just a worker, nothing special going on here
    node_class = "worker"

    # use wgoverlay for network fingerprinting and forwarding
    network_interface = "wgoverlay"

    # Nobody is allow to run as root - even if you *are* inside a container.....
    # For 1 thing it'll trigger a permission denied when writing to the NFS share
    options = {
        "user.blacklist" = "root"
    }
}

This is more interesting.

network_interface is really important if you're using a WireGuard mesh VPN like wesher that I setup and configured in part 4. By default, Nomad port forwards over all interfaces that make sense, and in this case gets it wrong.

This fixes that by telling it to explicitly port forward containers over the wgoverlay interface. If your network interface has a different name, this is the place to change it. It's a fairly common practice from what I can tell to have both a 'public' and a 'private' network in a cluster environment. The private network is usually trusted, and as such has lots of management traffic running over it. The public network is the one that's locked down that requests come in to from outside.

The "user.blacklist" = "root" here is a precaution that I may end up having to remove in future. It blocks any containers from running on this client from running as root inside the Docker container. This is actually worth remembering, because it's a bit of a security risk. This is a fail-safe to remind myself that it's a Bad Idea.

Apparently there are tactics that can be deployed to avoid running containers as root - even when you might think you need to. In addition, if there's no other way to avoid it, apparently there's a clever user namespace remapping trick one can deploy to avoid a container from having root privileges if it breaks out of it's container.

Another thing to note is that NFS shares often don't like you reading or writing files owned by root either, so if you're going to be saving data to a shared NFS partition like me, this is yet another reason to avoid root in your containers.

At this point it's also probably a good idea to talk a little bit about usernames - although we'll talk in more depth about this later. From my current understanding, the usernames inside a container aren't necessarily the same as those outside the container.

Every process runs under a specified username, but each username is backed by a given user id. It's this user id that is translated back into a username on the client machine when reading files from an NFS mount - hence why usernames in NFS shares can be somewhat odd.

Docker containers often have custom usernames created inside the containers for running processes inside the container with specific user ids. More on this later, but I plan to dive into this in the process of making my own Docker container images.

Anyway, we now have our configuration files for Nomad. For a client node, take the client config and the common config from the top of this section. For a server, take the server and common sections. For a node that's to act as both a client and a server, take all 3 sections.

Now that we've got that sorted, we should be able to start the Nomad agent:

sudo systemctl enable --now nomad.service

This is the same for all nodes in the cluster - regardless as to whether it's a client, a server, or both (this is also the reason that you can't have more than 1 of the systemd apt packages installed at once that I mentioned above).

If you're using the UFW firewall, then that will need configuring. For me, I'm allowing all traffic on the wgoverlay network interface that's acting as my trusted network:

sudo ufw allow in on wgoverlay

If you'd prefer not to do that, then you can allow only the specific ports through like so:

sudo ufw allow 4646/tcp comment nomad-http
sudo ufw allow 4647/tcp comment nomad-rpc
sudo ufw allow 4648/tcp comment nomad-serf

Note that this allows the traffic on all interfaces - these will need tweaking if you only want to allow the traffic in on a specific interface (which, depending on your setup, is probably a wise decision).

Anyway, you should now be able to ask the Nomad cluster for it's status like so:

nomad node status

...execute this from any server node in the cluster. It should give output like this:

ID        DC   Name         Class   Drain  Eligibility  Status
75188064  dc1  piano        worker  false  eligible     ready
9eb7a7a5  dc1  harpsicord   worker  false  eligible     ready
c0d23e71  dc1  saxophone    worker  false  eligible     ready
a837aaf4  dc1  violin       worker  false  eligible     ready

If you see this, you've successfully configured Nomad. Next, I recommend reading the Nomad tutorial and experimenting with some of the examples. In particular the Getting Started and Deploy and Manage Jobs topics are worth investigating.

Conclusion

In this post, we've installed Docker, and installed and configured Nomad. We've also touched briefly on some of the security considerations we need to be aware of when running things in Docker containers - much more on this in the future.

In future posts, we're going to look at setting up shared storage, so that jobs running on Nomad can be safely store state and execute on any client / worker node in the cluster while retaining access to said state information.

On the topic of Nomad, we're also going to look at running our first real job: a Docker registry, so that we can push our own custom Docker images to it when we've built them.

You may have noticed that both Nomad and Consul also come with a web interface. We're going to look at these too, but in order to do so we need a special container-aware reverse-proxy to act as a broker between 'cluster-space' (in which everything happens 'somewhere', and we don't really know nor do we particularly care where), and 'normal-network-space' (in which everything happens in clearly defined places).

I've actually been experiencing some issues with this, as I initially wanted to use Traefik for this purpose - but I ran into a number of serious difficulties with reading their (lack of) documentation. After getting thoroughly confused I'm now experimenting with Fabio (git repository) instead, which I'm getting on much better with. It's a shame really, I even got as far as writing the automated packaging script for Traefik - as evidenced by the traefik packages in my apt repository.

Until then though, happy cluster configuration! Feel free to post a comment below.

Found this interesting? Found a mistake? Confused about something? Comment below!

Sources and Further Reading

Cluster, Part 6: Superglue Service Discovery | Setting up Consul

Hey, welcome back to another weekly installment of cluster configuration for colossal computing control. Apparently I'm managing to keep this up as a weekly series every Wednesday.

Last week, we sorted out managing updates to the host machines in our cluster to keep them fully patched. We achieved this by firstly setting up an apt caching server with apt-cacher-ng. Then we configured our host machines to use it. Finally, we setup automated updates with unattended-upgrades so we don't have to keep installing them manually all the time.

For reference, here are all the posts in this series so far:

In this part, we're going to install and configure Consul - the first part of the Hashicorp stack. Consul doesn't sound especially exciting, but it is an extremely important part of our (diabolical? I never said that) plan. It serves a few purposes:

  • Clusters together, so Nomad (the task scheduler) can find other nodes
  • Keeps track of which services are running where

It uses the Raft Consensus Algorithm (like wesher from part 4; they actually use the same library under-the-hood it would appear) to provide a relatively decentralised approach to the problem, allowing for some nodes to fail without impacting the cluster's operation as a whole.

It also provides a DNS API, which we'll be tying into with Unbound later in this post.

Before continuing, you may find reading through the official Consul guides a useful exercise. Try out some of the examples too to get your head around what Consul is useful for.

(Above: Nasa's DSN dish in Canberra, Australia just before major renovations are carried out. Credit: NASA/Canberra Deep Space Communication Complex)

Installation and Preamble

To start with, we need to install it. I've done the hard work of packaging it already, so you can install it from my apt repository - which, if you've been following this series, you should have it setup already (if not, follow the link and read the instructions there).

Install consul like this:

sudo apt install consul

Next, we need a systemd service file. Again, I have packages in my apt repository for this. There are 2 packages:

  • hashicorp-consul-systemd-client
  • hashicorp-consul-systemd-server

The only difference between the 2 packages is where it reads the configuration file from. The client package reads from /etc/consul/client.hcl, and the server from /etc/consul/server.hcl. They also conflict with each other, so you can't install both at the same time. This is because - as far as I can tell - servers can expose the client interface in just the same way as any other normal client.

To get a feel for the bigger picture, let's talk architecture. Because Consul uses the Raft Consensus Algorithm, we'll need an odd number of servers to avoid issues (if you use an even number of servers, then you run the risk of a 'split brain', where there's no clear majority vote as to who's the current leader of the cluster). In my case, I have 5 Raspberry Pi 4s:

  • 1 x 2GB RAM (controller)
  • 4 x 4GB RAM (workers)

In this case, I'm going to use the controller as my first Consul server, and pick 2 of the workers at random to be my other 2, to make up 3 servers in total. Note that in future parts of this series you'll need to keep track of which ones are the servers, since we'll be doing this all over again for Nomad.

With this in mind, install the hashicorp-consul-systemd-server package on the nodes you'll be using as your servers, and the hashicorp-consul-systemd-client package on the rest.

A note about cluster management

This is probably a good point to talk about cluster management before continuing. Specifically the automation of said management. Personally, I have the goal of making the worker nodes completely disposable - which means completely automating the setup from installing the OS right up to being folded into the cluster and accepting jobs.

To do this, we'll need a tool to help us out. In my case, I've opted to write one from scratch using Bash shell scripts. This is not something I can recommend to anyone else, unless you want to gain an understanding of how such tools work. My inspiration was efs2, which appears to be a Go program - and Docker files. As an example, my job file for automating the install of a Consul client agent install looks like this:

#!/usr/bin/env bash

SCRIPT "${JOBFILE_DIR}/common.sh";

COPY "../consul/server.hcl" "/tmp/server.hcl"

RUN "sudo mv /tmp/server.hcl /etc/consul/server.hcl";
RUN "sudo chown root:root /etc/consul/server.hcl";
RUN "sudo apt-get update";
RUN "sudo apt-get install --yes hashicorp-consul-systemd-server";

RUN "sudo systemctl enable consul.service";
RUN "sudo systemctl restart consul.service";

...I'll be going through all steps in a moment. Of course, if there's the demand for it then I'll certainly write a post or 2 about my shell scripting setup here (comment below), but I still recommend another solution :P

Note that the firewall configuration is absent here - this is because I've set it to allow all traffic on the wgoverlay network interface, which I talked about in part 4. If you did want to configure the firewall, here are the rules you'd need to create:

sudo ufw allow 8301 comment consul-serf-lan;
sudo ufw allow 8300/tcp comment consul-rpc;
sudo ufw allow 8600 comment consul-dns;

Many other much more mature tools exist - you should use one of those instead of writing your own:

  • Ansible - uses YAML configuration files; organises things logically into 'playbooks' (personally I really seriously can't stand YAML, which is another reason for writing my own)
  • Puppet
  • and more

The other thing to be aware of is version control. You should absolutely put all your configuration files, scripts, Ansible playbooks, etc under version control. My preference is Git, but you can use anything you like. This will help provide a safety net in case you make an edit and break everything. It's also a pretty neat way to keep it all backed up by pushing it to a remote such as your own Git server (you do have automated backups, right?), GitHub, or GitLab.

Configuration

Now that we've got that sorted out, we need to deal with the configuration files. Let's do the server configuration file first. This is written in the Hashicorp Configuration Language. It's probably a good idea to get familiar with it - I have a feeling we'll be seeing a lot of it. Here's my full server configuration (at the time of typing, anyway - I'll try to keep this up-to-date).

bind_addr = "{{ GetInterfaceIP \"wgoverlay\" }}"

# When we have this many servers in the cluster, automatically run the first leadership election
# Remember that the Hashicorp stack uses the Raft consensus algorithm.
bootstrap_expect = 3
server = true
ui = true

client_addr = "127.0.0.1 {{ GetInterfaceIP \"docker0\" }}"

data_dir = "/srv/consul"
log_level = "INFO"

domain = "mooncarrot.space."


retry_join = [
    // "172.16.230.100"
    "bobsrockets",
    "seanssatellites",
    "tillystelescopes"
]

This might look rather unfamiliar. There's also a lot to talk about, so let's go through it bit by bit. If you haven't already, I do suggest coming up with an awesome naming scheme for your servers. You'll thank me later.

The other thing you'll need to do before continuing is buy a domain name. It sounds silly, but it's actually really important. As we talked about in part 3, we're going to be running our own internal DNS - and Consul is a huge part of this.

By default, Consul serves DNS under the .consul top-level-domain, which both unregistered and very bad practice (because it's unregistered). Someone could come along tomorrow and regsister and start using the .consul top-level domain, and then things would get awkward if you ever wanted to visit an external domain that ends in .consul that you're using internally.

I've chosen mooncarrot.space myself, but if you don't yet have one, I recommend taking your time and coming up with a really clever one you like - since you'll be using it for a long time - and updating it later is a pain in the behind. If you're looking for a recommendation for a DNS provider, I've been finding Gandi great so far (much better than GoDaddy, who have tons of hidden charges).

Once you've decided on a domain name (and bought it, if necessary), set it in the server configuration file via the domain directive:

domain = "mooncarrot.space."

Don't forget that trailing dot. As we learned in part 3, it's important, since it indicates an absolute domain name.

Also note that I'm using a subdomain of the domain in question here. This is because of an issue whereby I'm unable to get Unbound to forward that Consul is unable to resolve on to CloudFlare.

Another thing of note is the data_dir directive. Note that this is the data storage directive local to the specific node, not shared storage (we'll be tackling that in a future post).

The client_addr directive here tells Consul which network interfaces to bind the client API to. In our case, we're binding it to the local loopback (127.0.0.1) and the docker0 network interface by dynamically grabbing it's IP address - so that docker containers on the host can use the API.

The bind_addr directive is similar, but for the inter-node communication interfaces. This tells Consul that the other nodes in the Cluster are accessible over the wgoverlay interface that we setup in part 4. This is important, since Consul doesn't encrypt or authenticate it's traffic by default as far as I can tell - and I haven't yet found a good way to do this that doesn't involve putting a password directly into a configuration file.

In this way the WireGuard mesh VPN provides the encryption & authentication that Consul lacks by default (though I'm certainly going to be looking into it anyway).

bootstrap_expect is also interesting. If you've decided on a different number of Consul server nodes, then you should change this value to equal the number of server nodes you're going to have. 3, 5, and 7 are all good numbers - though don't go overboard. More servers means more overhead. Servers take more computing power than clients, so try not to have too many of them.

Finally, retry_join is also very important. It should contain the domain name of all the servers in the cluster. In my case, I'm using the be the hostnames of the other servers in the network, since Wesher (our WireGuard mesh VPN program) automatically adds the machine names of all the nodes in the VPN cluster to your /etc/hosts file. In this way we ensure that the Cluster always talks over the wgoverlay VPN network interface.

Oh yeah, and I should probably note here that your servers should not use FQDNs (Fully Qualified Domain Names) as their hostnames. I found out the hard way: it messes with Consul, and it ends up serving node IPs via DNS on something like harpsichord.mooncarrot.space.node.mooncarrot.space instead something sensible like harpsichord.node.mooncarrot.space. If anyone has a solution for this that doesn't involve using non-FQDNs as hostnames, I'd love to know (since FQDNs as hostnames is my preference).

That was a lot of words. Next, let's do the client configuration file:

bind_addr = "{{ GetInterfaceIP \"wgoverlay\" }}"

bootstrap = false
server = false

domain = "mooncarrot.space."

client_addr = "127.0.0.1 {{ GetInterfaceIP \"docker0\" }}"

data_dir = "/srv/consul"
log_level = "INFO"

retry_join = [
    "wopplefox",
    "spatterling",
    "sycadil"
]

Not much to talk about here, since the configuration is almost indentical to that of the server, except you don't have to tell it how many servers there are, and retry_join should contain the names of the servers that the client should try to join, as explained above.

Once you've copied the configuration files onto all the nodes in your cluster (/etc/consul/server.hcl for servers; /etc/consul/client.hcl for clients), it's now time to boot up the cluster. On all nodes (probably starting with the servers), do this:

# Enable the Consul service on boot
sudo systemctl enable consul.service
# Start the Consul service now
sudo systemctl start consul.service
# Or, you can do it all in 1 command:
sudo systemctl enable --now consul.service

It's probably a good idea to follow the cluster's progress along in the logs. On a server node, do this after starting the service & enabling start on boot:

sudo journalctl -u consul --follow

You'll probably see a number of error messages, but be patient - it can take a few minutes after starting Consul on all nodes for the first time for them to start talking to each other, sort themselves out, and calm down.

Now, you should have a working Consul cluster! On one of the server nodes, do this to list all the servers in the cluster:

consul members

If you like, you can also run this from your local machine. Simply install the consul package (but not the systemd service file), and make some configuration file adjustments. Update your ~/.bashrc on your local machine to add something like this:

export CONSUL_HTTP_ADDR=http://consul.service.mooncarrot.space:8500;

....replacing mooncarrot.space with your own domain, of course :P

Next, update the server configuration file to make the client_addr directive look like this:

client_addr = "127.0.0.1 {{ GetInterfaceIP \"docker0\" }} {{ GetInterfaceIP \"wgoverlay\" }}"

Upload the new version to your servers, and restart them one at a time (unless you're ok with downtime - I'm trying to practice avoiding downtime now so I know all the processes for later):

sudo systemctl restart consul.service

At this point, we've got a fully-functioning Consul cluster. I recommend following some of the official guides to learn more about how it works and what you can do with it.

Unbound

Before we finish for today, we've got 1 more task to complete. As I mentioned back in part 3, we're going to configure our DNS server to conditionally forward queries to Consul. The end result we're aiming for is best explained with a diagram:

A diagram showing how we're aiming for Unbound to resolve queries.

In short:

  1. Try the localzone data
  2. If nothing was found there (or it didn't match), see if it matches Consul's subdomain
  3. If so, forward the query to Consul and return the result
  4. If Consul couldn't resolve the query, forward it to CloudFlare via DNS-over-TLS

The only bit we're currently missing of this process is the Consul bit, which we're going to do now. Edit /etc/unbound/unbound.conf on your DNS server (mine is on my controller node), and insert the following:

###
# Consul
###
forward-zone:
    name: "node.mooncarrot.space."
    forward-addr: 127.0.0.1@8600
forward-zone:
    name: "service.mooncarrot.space."
    forward-addr: 127.0.0.1@8600

...replace mooncarrot.space. with your domain name (not forgetting the trailing dot, of course). Note here that we have 2 separate forward zones here.

Unfortunately, I can't seem to find a way to get Unbound to fall back to a more generic forward zone in the event that a more specific one is unable to resolve the query (I've tried both a forward-zone and a stub-zone). To this end, we need to define multiple more specific forward-zones if we want to be able to forward queries out to CloudFlare for additional DNS records. Here's an example:

  1. tuner.service.mooncarrot.space is an internal service that is resolved by Consul
  2. peppermint.mooncarrot.space is an externally defined DNS record defined with my registrar

If we then ask Unbound to resolve then both, only #1 will be resolved correctly. Unbound will do something like this for #2:

  • Check the local-zone for a record (not found)
  • Ask Consul (not found)
  • Return error

If you are not worried about defining DNS records in your registrar's web interface / whatever they use, then you can just do something like this instead:

###
# Consul
###
forward-zone:
    name: "mooncarrot.space."
    forward-addr: 127.0.0.1@8600

For advanced users, the Consul's documentation on the DNS interface is worth a read, as it gives the format of all DNS records Consul can service.

Note also here that the recursors configuration option is an alternative solution, but I don't see an option to force DNS-over-TLS queries there.

If you have a better solution, please get in touch by commenting below or answering my ServerFault question.

With this done, you should be able to ask Consul what the IP address of any node in the cluster is like so:

dig +short harpsichord.node.mooncarrot.space
dig +short grandpiano.node.mooncarrot.space
dig +short oboe.node.mooncarrot.space
dig +short some_machine_name_here.node.mooncarrot.space

Again, you'll need to replace mooncarrot.space of course with your domain name.

Conclusion

Phew! There was a lot of steps and moving parts to setup in this post. Take your time, and re-read this post a few times to make sure you've got all your ducks in a row. Make sure to test your new Consul cluster by reading the official guides as mentioned above too, as it'll cause lots of issues later if you've got bugs lurking around in Consul.

I can't promise that it's going to get easier in future posts - it's probably going to get a lot more complicated with lots more to keep track of, so make sure you have a solid understanding of what we've done so far before continuing.

To summarise, we've managed to setup a brand-new Consul cluster. We've configured Unbound to forward queries to Consul, to enable seamless service discovery (regardless of which host things are running on) later on.

We've also picked an automation framework for automating the setup and configuration of the various services and such we'll be configuring. I recommend taking some time to get to know your chosen framework - many have lots of built-in modules to make common tasks much easier. Try going back to previous posts in this series (links at the top of this post) and implementing them in your chosen framework.

Finally, we've learnt a bit about version control and it's importance in managing configuration files.

In the next few posts in this series (unless I get distracted - likely - or have a change of plans), we're going to be setting up Nomad, the task scheduler that will be responsible for managing what runs where and informing Consul of this. We're also going to be setting up and configuring a private Docker registry and Traefik - the latter of which is known as an edge router (more on that in future posts).

See you next time, when we dive further down into what's looking less like a rabbit hole and more like a cavernous sinkhole of epic proportions.

Found this useful? Confused about something? Got a suggestion? Comment below! It's really awesome to hear that my blog posts have helped someone else out.

Cluster, Part 5: Staying current | Automating apt updates and using apt-cacher-ng

Hey there! Welcome to another cluster blog post. In the last post, we looked at setting up a WireGuard mesh VPN as a trusted private network for management interfaces and inter-cluster communication. As a refresher, here's a list of all the parts in this series so far:

Before we get to the Hashicorp stack though (next week, I promise!), there's an important housekeeping matter we need to attend to: managing updates.

In Debian-based Linux distributions such as Raspbian (that I'm using on my Raspberry Pis), updates are installed through apt - and this covers everything from the kernel to the programs we have installed - hence the desire to package everything we're going to be using to enable easy installation and updating.

There are a number of different related command-line tools, but the ones we're interested in are apt (the easy-to-use front-end CLI) and apt-get (the original tool for installing updates).

There are 2 key issues we need to take care of:

  1. Automating the installation of packages updates
  2. Caching the latest packages locally to avoid needless refetching

Caching package updates locally with apt-cacher-ng

Issue #2 here is slightly easier to tackle, surprisingly enough, so we'll do that first. We want to cache the latest packages locally, because if we have lots of machines in our cluster (I have 5, all running Raspbian), then when they update they all have to download the package lists and the packages themselves from the remote sources each time. Not only is this bandwidth-inefficient, but it takes longer and puts more strain on the remote servers too.

For apt, this can be achieved through the use of apt-cacher-ng. Other distributions require different tools - in particular I'll be researching and tackling Alpine Linux's apk package manager in a future post in this series - since I intend to use Alpine Linux as my primary base image for my Docker containers (I also intend to build my own Docker containers from scratch too, so that will be an interesting experience that will likely result in a few posts too).

Anyway, installation is pretty simple:

sudo apt install apt-cacher-ng

Once done, there's a little bit of tuning we need to attend to. apt-cacher-ng by default listens for HTTP requests on TCP port 3142 and has an administrative interface at /acng-report.html. This admin interface is not, by default, secured - so this is something we should do before opening a hole in the firewall.

This can be done by editing the /etc/apt-cacher-ng/security.conf configuration file. It should read something like this:


# This file contains confidential data and should be protected with file
# permissions from being read by untrusted users.
#
# NOTE: permissions are fixated with dpkg-statoverride on Debian systems.
# Read its manual page for details.

# Basic authentication with username and password, required to
# visit pages with administrative functionality. Format: username:password

AdminAuth: username:password

....you may need to use sudo to view and edit it. Replace username and password with your own username and a long unguessable password that's not similar to any existing passwords you have (especially since it's stored in plain text!).

Then we can (re) start apt-cacher-ng:

sudo systemctl enable apt-cacher-ng
sudo systemctl restart apt-cacher-ng

The last thing we need to do here is to punch a hole through the firewall, if required. As I explained in the previous post, I'm using a WireGuard mesh VPN, so I'm allowing all traffic on that interface (for reasons that will - eventually - come clear), so I don't need to open a separate hole in my firewall unless I want other devices on my network to use it too (which wouldn't be a bad idea, all things considered).

Anyway, ufw can be configured like so:

sudo ufw allow 3142/tcp comment apt-cacher-ng

With the apt-cacher server installed and configured, you can now get apt to use it:

echo 'Acquire::http { Proxy \"http://X.Y.Z.W:3142\"; }' | sudo tee -a /etc/apt/apt.conf.d/proxy

....replacing X.Y.Z.W with the IP address (or hostname!) of your apt-cacher-ng server. Note that it will get upset if you use https anywhere in your apt sources, so you'll have to inspect /etc/apt/sources.list and all the files in /etc/apt/sources.list.d/ manually and update them.

Automatic updates with unattended-upgrades

Next on the list is installing updates automatically. This is useful because we don't want to have to manually install updates every day on every node in the cluster. There are positives and negatives about installing updates - I recommend giving the top of this article a read.

First, we need to install unattended-upgrades:

sudo apt install unattended-upgrades

Then, we need to edit the /etc/apt/apt.conf.d/50unattended-upgrades file - don't forget to sudo.

Unfortunately, I haven't yet automated this process (or properly developed a replacement configuration place that can be automatically placed on a target system by a script), so for now we'll have to do this manually (the mssh command might come in handy).

First, find the line that starts with Unattended-Upgrade::Origins-Pattern, and uncomment the lines that end in -updates, label=Debian, label=Debian-Security. For Raspberry Pi users, add the following lines in that bit too:

"origin=Raspbian,codename=${distro_codename},label=Raspbian";

// Additionally, for those running Raspbian on a Raspberry Pi,
// match packages from the Raspberry Pi Foundation as well.
"origin=Raspberry Pi Foundation,codename=${distro_codename},label=Raspberry Pi Foundation";

unattended-upgrades will only install packages that are matched by a list of origins. Unfortunately, the way that you specify which updates to install is a total mess, and it's not obvious how to configure it. I did find an Ask Ubuntu answer that explains how to get unattended-upgrades to install updates. If anyone knows of a cleaner way of doing this - I'd love to know.

The other decision to make here is whether you'd like your hosts to automatically reboot. This could be disruptive, so only enable it if you're sure that it won't interrupt any long-running tasks.

To enable it, find the line that starts with Unattended-Upgrade::Automatic-Reboot and set it to true (uncommenting it if necessary). Then find the Unattended-Upgrade::Automatic-Reboot-Time setting and set it to a time of day you're ok with it rebooting at - e.g. 03:00 for 3am in the morning - but take care to avoid all your servers from rebooting at the same time, as this might cause issues later.

A few other settings need to be updated too. Here are they are, with their correct values:

APT::Periodic::Update-Package-Lists "1";
APT::Periodic::Download-Upgradeable-Packages "1";
APT::Periodic::Unattended-Upgrade "1";
APT::Periodic::AutocleanInterval "7";

Make sure you find the existing settings and update them, because otherwise if you just paste these in, they may get overridden. In short these settings:

  • Enable automatic updates to the package metadata indexes
  • Downloads upgradeable packages
  • Installs downloaded updates
  • Automatically cleans the cache up every 7 days

Once done, save and close that file. Finally, we need to enable and start the unattended-upgrades service:

sudo systemctl enable unattended-upgrades
sudo systemctl restart unattended-upgrades

To learn more about automatic upgrades, these guides might help shed some additional light on the subject:

Conclusion

In this post, we've taken a look at apt package caching and unattended-upgrades. In the future, I'm probably going to have to sit down and either find an alternative to unattended-upgrades that's easier to configure, or rewrite the entire configuration file and create my own version. Comments and suggestions on this are welcome in the comments.

In the next post, we'll be finally getting to the Hashicorp stack by installing and configuring Consul. Hold on to your hats - next week's post is significantly complicated.

Edit 2020-05-09: Add missing instructions on how to get apt to actually use an apt-cacher-ng server

Cluster, Part 4: Weaving Wormholes | Peer-to-Peer VPN with WireGuard

(Above: The WireGuard and wesher logos. Background photo: from Unsplash by Clint Adair.)

Hey - welcome back! Last week, we set Unbound up as our primary DNS server for our network. We also configured cluster member devices to use it for DNS resolution. As a recap, here are links to the all the posts in this series so far:

In this part, we're going to setup a WireGuard) peer-to-peer VPN. This is a good idea for several reasons:

  • It provides defence-in-depth
  • It protects non-encrypted / unprotected private services from the rest of the network

The latter point here is particularly important - especially if you've got other device on your network like I have. If you're somehow following along with this series with devices fancy enough to have multiple network interfaces, you can connect the 2nd network interface of every server to a separate switch, that doesn't connect to anywhere else. Don't forget that you'll need to setup a DHCP server on this new mini-network (or configure static IPs manually on each device, which I don't recommend) - but this is out-of-scope of this article.

In the absence of such an opportunity, a peer-to-peer VPN should do the trick just as well. We're going to be using WireGuard), which I discovered recently. It's very cool indeed - and it's apparently special enough to be merged directly into the Linux Kernel v5.6! It's been given high praise from security experts too.

What I like most is it's simplicity. It follows the UNIX Philosophy, and as such while it's very simple in the way it works, it can be used and applied in so many different ways to solve so many different problems.

With this in mind, let's get to installing it! If you're on an Ubuntu or Debian-based machine, then you should just be able to install it directly:

sudo apt install wireguard

Note that you might have to have your kernel's development headers installed if you experience issues. For Raspbian users (like me), installation is slightly more complicated. We'll need to setup the debian-backports apt repository to pull it in, since the Debian developers have backported it to the latest stable version of Debian (e.g. like a hotfix) - but Raspbian hasn't yet pulled it in. This is done in 2 steps:

# Add the debian-backports GPG key
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 04EE7237B7D453EC 648ACFD622F3D138
# Add the debian-backports apt repo
echo 'deb http://httpredir.debian.org/debian buster-backports main contrib non-free' | sudo tee /etc/apt/sources.list.d/debian-backports.list

Now, we should be able to update to download the new repository metadata:

sudo apt update

Next, we need to install the Raspberry Pi Linux kernel headers. Unlike other distributions (which use the linux-headers-generic package), this is done in slightly different way:

sudo apt install raspberrypi-kernel-headers

This might take a while. Once done, we can now install WireGuard itself:

sudo apt install wireguard

Again, this will take a while. Don't forget to pay close attention to the output - I've noticed that it's fond of throwing error messages, but not actually counting it as an error and ultimately claiming it completed successfully.

With this installed, we can now setup our WireGuard peer-to-peer VPN. WireGuard itself works on a public-private keypair per-device setup. A device first generates a keypair, and then the public key thereof needs copying to all other devices it wants to connect to. In this fashion, both a peer-to-peer setup (like we're after), and a client-server setup (like a more traditional VPN such as IPSec or OpenVPN) can be configured.

The overhead of configuring such a peer-to-peer WireGuard VPN is considerable though, since every time a device is added to the VPN every existing device needs updating to add the public key thereof to establish communications between the 2.

While researching an easier solution to the problem, I came across wesher, which does much of the heavy-lifting for you. It does of course come at the cost of slightly reduced security (since the entire VPN network is protected by a single pre-shared key) and reduced configurability, but from my experiences so far it works like a charm for my purposes - and it eases management too.

It is distributed as a single Go binary, that uses the Raft Consensus Algorithm (the same library that Nomad and Consul use actually) to manage cluster membership and provide self-healing properties. This makes is very easy for my to package into an apt package, which I've added to my apt repository.

The instructions to add my apt repository can be found on it's page, but here they are directly:

# Add the repository
echo "deb https://apt.starbeamrainbowlabs.com/ ./ # apt.starbeamrainbowlabs.com" | sudo tee /etc/apt/sources.list.d/sbrl.list
# Import the signing key
wget -q https://apt.starbeamrainbowlabs.com/aptosaurus.asc -O- | sudo apt-key add -
# Alternatively, import the signing key from a keyserver:
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys D48D801C6A66A5D8
# Update apt's cache
sudo apt update

Don't forget that if you're using a caching server like apt-cacher-ng (which we'll be setting up in the next post in this series), you'll probably want to change the https in the first command there to regular http in order for the caching to take effect. Note that this doesn't affect the security of the downloaded packages, since apt will verify the integrity and GPG signature of all packages downloaded. It only affects the privacy of the connection used to download the packages themselves (more on this in a future post).

Once setup, install wesher like so:

sudo apt install wesher

If you've got a systemd-based system (as I have sadly with Raspbian), I provide a systemd service file in a separate package:

sudo apt install wesher-systemd

Don't forget to perform these steps on all the machines you want to enter the cluster. Next, we need to configure our firewall. I'm using UFW - I hope you set up something similar when you first configured your servers you're clustering (I recommend UFW. Note also that this tutorial series will not cover the basics like this - instead, I'll link to other tutorials and such as I think of them). For UFW users, do this:

sudo ufw allow 7946 comment wesher-gossip
sudo ufw allow 51820/udp comment wesher-wireguard

Wesher requires 2 ports: 1 for the clustering traffic for managing cluster membership, and another for the WireGuard traffic itself. Now, we can initialise the cluster. This has to be done on an interactive terminal for the first time, to avoid leaking the cluster's pre-shared key to log files. Do it like this in a terminal:

sudo wesher

It should tell you that it's generated a new cluster key - it will save it in a private configuration directory. Save this somewhere safe, such as in your password manager. Now, you can press Ctrl + C to quit, and start the systemd service:

sudo systemctl enable --now wesher.service

It's perhaps good practice to check that the service has started successfully:

sudo systemctl status wesher.service

Having 1 node setup is nice, but not very useful. Adding additional nodes to the cluster is a bit different. Follow this tutorial up to and including the installation of the wesher and wesher-systemd packages, and then instead of just doing sudo wesher, do this instead:

sudo wesher --cluster-key CLUSTER_KEY_HERE --join IP_OF_ANOTHER_NODE --overlay-net 172.31.250.0/16 --log-level info

...replacing CLUSTER_KEY_HERE with your cluster key (don't forget to prefix the entire command with a space to avoid it from entering your shell history file), and IP_OF_ANOTHER_NODE with the IP address of another node in the cluster on the external untrusted network. Note that the --overlay-net there is required because of the way I wrote the systemd service file in the wesher-systemd package. I did this:

[Unit]
Description=wesher - wireguard mesh builder
After=network-online.target
After=syslog.target
After=rsyslog.service

[Service]
EnvironmentFile=-/etc/default/wesher
ExecStart=/usr/local/sbin/wesher --overlay-net 172.31.250.0/16 --log-level info
Restart=on-failure
Type=simple
StandardOutput=syslog
StandardError=syslog
SyslogIdentifier=wesher

[Install]
WantedBy = multi-user.target

I explicitly specify the subnet of the VPN network here to avoid clashes with other networks in the 10.0.0.0/8 range. I don't have a network in that range, but I know that others do. Unfortunately there's currently a known bug that means that IP address collisions may occur, and the cluster can't currently sort them out. So you need to use a pretty large subnet for now to avoid such collisions (here's hoping this one is patched soon).

Note that if you want to set additional configuration options, you can do so in /etc/default/wesher in the format VAR_NAME=VALUE - 1 per line. A full reference of the supported environment variables can be found here.

Anyway, once wesher has joined the cluster on the new node, press Ctrl + C to exit and then start and enable the systemd service as before (note that wesher saves all the configuration details to a configuration directory, so the cluster key doesn't need to be provided more than once):

sudo systemctl enable --now wesher.service
sudo systemctl status wesher.service

With this, you should have a fully-functional Wireguard peer-to-peer VPN setup with wesher. You can ask a node as to what IP address it has on the VPN by using the following command:

ip address

The IP address should be shown next to the wgoverlay network interface.

The last thing to do here is to configure our Firewall. In my case, I'm using UFW, so the instructions I include here will be specific to UFW (if you use another firewall, translate these commands for your firewall, and command below!).

In my specific case, I want to take the unusual step of allowing all traffic in on the VPN. The reason for this will become apparent in a future post, but in short Nomad dynamically allocates outward-facing ports for services. There's a good reason for this I'll get into at the time, but for now, this is how you'd do that:

sudo ufw allow in on wgoverlay

Of course, we can add override rules here that block traffic if we need to. Note that this only allows in all traffic on the wgoverlay network interface and not all network interfaces as my previous blog about UFW would have done - i.e. like this:

sudo ufw allow 1234/tcp

In the next part, we'll take a look at setting up an apt caching server to improve performance and reduce Internet data usage when downloading system updates.

Found this useful? Spotted a mistake? Confused about something? Comment below! It really helps motivate me to write these posts.

Cluster, Part 3: Laying groundwork with Unbound as a DNS server

Looking up at the canopy of some trees

(Above: A picture from my wallpapers folder. If you took this, comment below and I'll credit you)

Welcome to another blog post about my cluster! Although I had to replace the ATX PSU I was planning on using to power the thing with a USB power supply instead, I've got all the Pis powered up and networked together now - so it's time to finally start on the really interesting bit!

In this post, I'm going to show you how to install Unbound as a DNS server. This will underpin the entire stack we're going to be setting up - and over the next few posts in this series, I'll take you through the entire process.

Starting with DNS is a logical choice, which comes with several benefits:

  • We get a minor performance boost by caching DNS queries locally
  • We get to configure our DNS server to make encrypted DNS-over-TLS queries on behalf of the entire network (even if some of the connected devices - e.g. phones don't support doing so)
  • If we later change our minds and want to shuffle around the IP addresses, it's not such a big deal as if we used IP addresses in all our config files

With this in mind, I starting with DNS before moving on to Docker and the Hashicorp stack:

A picture of the homepages of the Hashicorp stack

Before we begin, let's set out our goals:

  • We want a caching resolver, to avoid repeated requests across the Internet for the same query
  • We want to encrypt queries that leave the network via DNS-over-TLS
  • We want to be able to add our own custom DNS records for a domain of our choosing, for internal resolution only.

The last point there is particularly important. We want to resolve something like 172.16.230.100 to server1.bobsrockets.com internally, but not externally outside the network. This way we can include server1.bobsrockets.com in config files, and if the IP changes then we don't have to go back and edit all our config files - just reload or restart the relevant services.

Without further delay, let's start by installing unbound:

sudo apt install unbound

If you're on another system, translate this for your package manager. See this amazing wiki page for help on translating package manager commands :-)

Next, we need to write the config file. The default config file for me it located at /etc/unbound/unbound.conf:

# Unbound configuration file for Debian.
#
# See the unbound.conf(5) man page.
#
# See /usr/share/doc/unbound/examples/unbound.conf for a commented
# reference config file.
#
# The following line includes additional configuration files from the
# /etc/unbound/unbound.conf.d directory.
include: "/etc/unbound/unbound.conf.d/*.conf"

There's not a lot in the /etc/unbound/unbound.conf.d/ directory, so I'm going to be extending /etc/unbound/unbound.conf. First, we're going to define a section to get Unbound to forward requests via DNS-over-TLS:

forward-zone:
    name: "."
    # Cloudflare DNS
    forward-addr: 1.0.0.1@853
    # DNSlify - ref https://www.dnslify.com/services/resolver/
    forward-addr: 185.235.81.1@853
    forward-ssl-upstream: yes

The . name there simply means "everything". If you haven't seen it before, the fully-qualified domain name for seanssatellites.io for example is as follows:

seanssatellites.io.

Notice the extra trailing dot . there. That's really important, as it signifies the DNS root (not sure on it's technical name. Comment if you know it, and I'll update this). The io bit is the top-level domain (commonly abbreviated as TLD). seanssatellites is the actual domain bit that you buy.

It's a hierarchical structure, and apparently used to be inverted here in the UK before the formal standard was defined by the IETF (Internet Engineering Task Force) - of which RFC 1034 was a part.

Anyway, now that we've told Unbound to forward queries, the next order of business is to define a bunch of server settings to get it to behave the way we want it to:

server:
    interface: 0.0.0.0
    interface: ::0

    ip-freebind: yes

    # Access control - default is to deny everything apparently

    # The local network
    access-control: 172.16.230.0/24 allow
    # The docker interface
    access-control: 172.17.0.1/16 allow

    username: "unbound"

    harden-algo-downgrade: yes
    unwanted-reply-threshold: 10000000


    prefetch: yes

There's a lot going on here, so let's break it down.

Property Meaning
interface Tells unbound what interfaces to listen on. In this case I tell it to listen on all interfaces on both IPv4 and IPv6.
ip-freebind Tells it to try and listen on interfaces that aren't yet up. You probably don't need this, so you can remove it. I'm including it here because I'm currently unsure whether unbound will start before docker, which I'm going to be using extensively. In the future I'll probably test this and remove this directive.
access-control unbound has an access control system, which functions rather like a firewall from what I can tell. I haven't had the time yet to experiment (you'll be seeing that a lot), but once I've got my core cluster up and running I intend to experiment and play with it extensively, so expect more in the future from this.
username The name of the user on the system that unbound should run as.
harden-algo-downgrade Protect against downgrade attacks when making encrypted connections. For some reason the default is to set this to no, so I enable it here.
unwanted-reply-threshold Another security-hardening directive. If this many DNS replies are received that unbound didn't ask for, then take protective actions such as emptying the cache just in case of a DNS cache poisoning attack
prefetch Causes unbound to prefetch updated DNS records for cache entries that are about to expire. Should improve performance slightly.

If you have a flaky Internet connection, you can also get Unbound to return stale DNS cache entries if it can't reach the remote DNS server. Do that like this:

server:
    # Service expired cached responses, but only after a failed 
    # attempt to fetch from upstream, and 10 seconds after 
    # expiration. Retry every 10s to see if we can get a
    # response from upstream.
    serve-expired: yes
    serve-expired-ttl: 10
    serve-expired-ttl-reset: yes

With this, we should have a fully-functional DNS server. Enable it to start on boot and (re)start it now:

sudo systemctl enable unbound.service
sudo systemctl restart unbound.service

If it's not already started, the restart action will start it.

Internal DNS records

If you're like me and want some custom DNS records, then continue reading. Unbound has a pretty nifty way of declaring custom internal DNS records. Let's enable that now. First, you'll need a domain name that you want to return custom internal DNS records for. I recommend buying one - don't use an unregistered one, just in case someone else comes along and registers it.

Gandi are a pretty cool registrar - I can recommend them. Cloudflare are also cool, but they don't allow you to register several years at once yet - so they are probably best set as the name servers for your domain (that's a free service), leaving your domain name with a registrar like Gandi.

To return custom DNS records for a domain name, we need to tell unbound that it may contain private DNS records. Let's do that now:

server:
    private-domain: "mooncarrot.space"

This of course goes in /etc/unbound/unbound.conf, as before. See the bottom of this post for the completed configuration file.

Next, we need to define some DNS records:

server:
    local-zone: "mooncarrot.space." transparent
    local-data: "controller.mooncarrot.space.   IN A 172.16.230.100"
    local-data: "piano.mooncarrot.space.        IN A 172.16.230.101"
    local-data: "harpsichord.mooncarrot.space.  IN A 172.16.230.102"
    local-data: "saxophone.mooncarrot.space.    IN A 172.16.230.103"
    local-data: "bag.mooncarrot.space.          IN A 172.16.230.104"

    local-data-ptr: "172.16.230.100 controller.mooncarrot.space."
    local-data-ptr: "172.16.230.101 piano.mooncarrot.space."
    local-data-ptr: "172.16.230.102 harpsichord.mooncarrot.space."
    local-data-ptr: "172.16.230.103 saxophone.mooncarrot.space."
    local-data-ptr: "172.16.230.104 bag.mooncarrot.space."

The local-zone directive tells it that we're defining custom DNS records for the given domain name. The transparent bit tells it that if it can't resolve using the custom records, to forward it out to the Internet instead. Other interesting values include:

Value Meaning
deny Serve local data (if any), otherwise drop the query.
refuse Serve local data (if any), otherwise reply with an error.
static Serve local data, otherwise reply with an nxdomain or nodata answer (similar to the reponses you'd expect from a DNS server that's authoritative for the domain).
transparent Respond with local data, but resolve other queries normally if the answer isn't found locally.
redirect serves the zone data for any subdomain in the zone.
inform The same as transparent, but logs client IP addresses
inform_deny Drops queries and logs client IP addresses

Adapted from /usr/share/doc/unbound/examples/unbound.conf, the example Unbound configuration file.

Others exist too if you need even more control, like always_refuse (which always responds with an error message).

The local-data directives define the custom DNS records we want Unbound to return, in DNS records syntax (again, if there's an official name for the syntax leave a comment below). The local-data-ptr directive is a shortcut for defining PTR, or reverse DNS records - which resolve IP addresses to their respective domain names (useful for various things, but also commonly used as a step to verify email servers - comment below and I'll blog on lots of other shady and not so shady techniques used here).

With that, our configuration file is complete. Here's the full configuration file in it's entirety:

# Unbound configuration file for Debian.
#
# See the unbound.conf(5) man page.
#
# See /usr/share/doc/unbound/examples/unbound.conf for a commented
# reference config file.
#
# The following line includes additional configuration files from the
# /etc/unbound/unbound.conf.d directory.
include: "/etc/unbound/unbound.conf.d/*.conf"

server:
    interface: 0.0.0.0
    interface: ::0

    ip-freebind: yes

    # Access control - default is to deny everything apparently

    # The local network
    access-control: 172.16.230.0/24 allow
    # The docker interface
    access-control: 172.17.0.1/16 allow

    username: "unbound"

    harden-algo-downgrade: yes
    unwanted-reply-threshold: 10000000

    private-domain: "mooncarrot.space"

    prefetch: yes

    # ?????? https://www.internic.net/domain/named.cache

    # Service expired cached responses, but only after a failed 
    # attempt to fetch from upstream, and 10 seconds after 
    # expiration. Retry every 10s to see if we can get a
    # response from upstream.
    serve-expired: yes
    serve-expired-ttl: 10
    serve-expired-ttl-reset: yes

    local-zone: "mooncarrot.space." transparent
    local-data: "controller.mooncarrot.space.   IN A 172.16.230.100"
    local-data: "piano.mooncarrot.space.        IN A 172.16.230.101"
    local-data: "harpsichord.mooncarrot.space.  IN A 172.16.230.102"
    local-data: "saxophone.mooncarrot.space.    IN A 172.16.230.103"
    local-data: "bag.mooncarrot.space.          IN A 172.16.230.104"

    local-data-ptr: "172.16.230.100 controller.mooncarrot.space."
    local-data-ptr: "172.16.230.101 piano.mooncarrot.space."
    local-data-ptr: "172.16.230.102 harpsichord.mooncarrot.space."
    local-data-ptr: "172.16.230.103 saxophone.mooncarrot.space."
    local-data-ptr: "172.16.230.104 bag.mooncarrot.space."

    fast-server-permil: 500

forward-zone:
    name: "."
    # Cloudflare DNS
    forward-addr: 1.0.0.1@853
    # DNSlify - ref https://www.dnslify.com/services/resolver/
    forward-addr: 185.235.81.1@853
    forward-ssl-upstream: yes

Where do we go from here?

Remember that it's important to not just copy and paste a configuration file, but to understand what every single line of it does.

In a future post in this series, we'll be revising this to forward requests for *.service.mooncarrot.space to Consul, a clustered service discovery system that keeps track of what is running where and presents a DNS server as an interface (there are others).

In the next post, we'll (probably) be looking at setting up Consul - unless something else happens first :P Nomad should be next after that, followed closely by Vault.

Once I've got all that set up (which is a blog post series in and of itself!), I'll then look at encrypting all communications between all nodes in the cluster. After that, we'll (finally?) get to Docker and my plans there. Other things include an apt and apk (the Alpine Linux package manager) caching servers - which will have to be tackled separately.

Oh yeah, and I want to explore Traefik, which is apparently like Nginx, but intended for containerised infrastructure.

All this is definitely going to keep me very busy!

Found this interesting? Got a suggestion? Comment below!

Sources and Further Reading

Cluster, Part 2: Grand Designs

In the last part of this series, I talked about my plans for building an ARM-based cluster, because I'm growing out of the Raspberry Pi 3B+ I currently have at home. Since then, I have decided to focus on the compute cluster first, as I have a reasonable amount of room left on the 1tB WD Pidrive I have attached to my existing Raspberry Pi 3B+.

Hardware

To this end, I have been busy ordering parts and organising things to get construction of the compute cluster side of things going. The most important part of the whole cluster is the compute boards themselves. I've decided to go with 4 x Raspberry Pi 4s with 4GB RAM each for the worker nodes, and 1 x Raspberry Pi 4 with 2GB of RAM as the controller (it would have been a 1GB RAM model, but a recent announcement changed my mind :D):

(Above: The Raspberry Pi 4s I'm going to be using. The colourful heatsink cases there are to dissipate heat passively if possible and reduce the need for the fan to run as often. The one with the smaller red heatsink is the controller node - I don't anticipate the load on that node being high enough to need a bigger more expensive heatsink)

My reasoning for Raspberry Pis is software support. They are hugely popular - and from experience I can tell that they are pretty well supported on the software side of things. Issues with hardware features not being supported by the operating system are minimal - and where issues do arise they are more often than not sorted out. Regular kernel security updates are also provided - something that isn't always a thing with Linux distributions for other boards I've noticed.

Although the nodes in the cluster are very important, they are far from the only component I'll need. I'll also need a way to power it - which I've settled on an using a desktop ATX power supply (generously donated by University).

(Above: The ATX power supply, with a few wires cut and other bits and bobs attached. As of this blog post I'm in the middle of wiring it up, so I haven't finished it yet)

This adds some additional complications though, because wiring an ATX power supply up to a fleet of Raspberry Pi 4s isn't as easy as it sounds. To do that, I've decided to wire the 5V and ground wires up to 5 USB type-a breakout boards, with a 3 amp self-resettable fuse on each live (red) wire. Then I can use 5 short type-a to type-c converter cables to power the Raspberry Pi 4s.

(Above: The extra bits and bobs laid out that I'll be using to wire the ATX power supply up to the USB type-a breakout boards. From left to right: 3A self-resettable fuses, 18 AWG wire, Wagos, header pins, and finally the USB type-a breakout boards themselves)

With power to the Raspberry Pis, the core compute hardware is in place. I still need a bunch of things around the edges though, such as a (very quiet) fan to keep it cool:

(Above: A Noctua NF-P14s redux-1200)

I found this particular fan on quietpc.com. While their prices and shipping are somewhat expensive (I didn't actually buy it from there - I got a better deal on Amazon instead), they are a great place to look into the different options available for really quiet fans. I'm pretty sensitive to noise, so having a quiet fan is an important part of my cluster design.

This one is the large 14cm model, so that it fits in front of all 5 Raspberry Pis if they are stood up on their sides and stacked horizontally. It takes 12 volts, so I'll be connecting it to the 12V rail from the ATX power supply. The fan speed is also controllable via PWM (pulse-width modulation), so I plan on using an Arduino (probably one of the Arduino Unos I've got lying around) to control it and present a serial interface or something to the Raspberry Pi that's acting as the controller node in the cluster.

Lastly, another extremely important part of any cluster is a solid switch. Without a great switch at the base of the network, you'll have all sorts of connection issues and the performance of the cluster will be degraded significantly. I'm anticipating that I'll want to transfer significant amounts of data around very quickly (e.g. Docker container images, and later large blocks of data during a storage cluster rebalance).

For this reason, I've bought myself a Netgear GS116v2. While its unmanaged, I can't currently afford a more expensive managed switch at this time. It is however gigabit and also has an array of other features such as energy efficient ethernet (802.3az), full duplex gigabit (i.e. 32GB bandwidth available to all ports, which is enough for all ports to be transmitting and receiving gigabit at the same time), and a silent fanless design.

My Netgear GS116v2

(Above: The switch I'll be using. I watched eBay and got it used for much less than it's available new)

Networking

Hardware isn't the only thing I've been thinking about. While I've been waiting for packages to arrive, I've also been planning out the software I'm going to use and how I'm going to network all my Pis together.

My plans on the networking side of things are subject to significant change depending on how many responsibilities I can convince my home router to give up, but I have drawn up a network diagram showing what I'm currently aiming towards:

An ideal-case scenario network diagram. Explained below.

The cluster is represented on the left half of the diagram. This will probably entail some considerable persuasion of my router to pull off, but a quick look reveals that it's (probably) possible with some trial-and-error.

The idea is that I have a separate subnet for the cluster than the rest of the home network. Then I can do strange stuff and fiddle with it (hopefully) without affecting everyone else on the network.

Software

Meanwhile, out of all the different aspects of building this cluster I've got the clearest picture as to the software I'm going to be using.

I've decided that I'm going to use a container-based system. I've looked at a number of different options (such as podman and Singularity) - but I'm currently of the opinion that Docker is the most suitable option for what I'm going for. It's not as enterprisey as Singularity, and it seems to be more mature than podman. It also has a huge library of prebuilt containers too - but for learning purposes I'm going to be writing almost all my container scripts from scratch - probably using some sort of Alpine Linux container as a base. If I ever run into a situation where Docker isn't suitable and I need something closer to a VM, I'll probably use LXC, which I believe sits on top of the same underlying container runtime that Docker does.

I'm anticipating that container-based tech is going to be great for managing the stuff that's running on my cluster - so you can expect more posts that go into some depth about how it all works and how I'm setting my system up in the future.

To complement my container-based tech, I'm also going to be using a workload orchestrator. The Viper High-Performance Computer I've recently gained access to has lots of nodes in it and uses Slurm for workload orchestration, but that seems more geared towards environments that have lots of jobs that each have a defined running time. Great for scientific simulations and other such things, but not so great for personal self-hosted applications and the like.

Instead, I'm probably going to use Nomad. It looks seriously cool, and an initial look at the documentation reveals that it's probably going to be much simpler easier to understand than Kubernetes (see also), which seems to be the other competing software in the business. It also seems to integrate well with other programs done by the same company (Hashicorp) like Consul for service networking management (I'm hoping I can get DNS resolution for the services running on the cluster under control with it) and Vault for secret management (e.g. API keys, passwords, and other miscellaneous secrets) - all of which I'm going to install and experiment with (expect more on that soon).

All of those for now will be backed by an NFS share on all nodes in the cluster for the persistent volumes attached to running containers.

On the controller node I mentioned earlier I'm also going to be running a few extra items to aid in the management of the cluster:

  • A Docker registry, from which the worker nodes will be pulling containers for execution (worker nodes will not have access to the public Docker registry at hub.docker.com)
  • An apt caching proxy - probably apt-cacher-ng. Since all the nodes in the cluster are going to be using the same OS, have the same packages installed, and the same configuration settings etc, it doesn't make much sense for them to be downloading apt packages from the Internet every time - so I'll be caching them locally on the controller node
  • Potentially some sort of reverse proxy that sits in front of all the services running on the cluster, but I haven't decided on how this will fit into the larger puzzle just yet (more research is required). I'm already very familiar with Nginx, but I've seen Traefik recommended for dynamic container-based setups, so I'm going to investigate that too.

That about covers my high-level design ideas. As of the time of typing, the next thing I need to do is organise a case for it all to go in, fix the loose connections in the screw terminals (not pictured; they arrived after I took the pictures), and then find a place to put it....

Cluster, Part 1: Answers only lead to more questions

At home, I have a Raspberry Pi 3B+ as a home file server. Lately though, I've been noticing that I've been starting to grow out of it (both in terms of compute capacity and storage) - so I've decided to get thinking early about what I can do about it.

I thought of 2 different options pretty quickly:

  • Build a 'proper' server
  • Build a cluster instead

While both of these options are perfectly viable and would serve my needs well, one of them is distinctly more interesting than the other - that being a cluster. While having a 'proper' server would be much simpler, perhaps slightly more power efficient (though I would need tests to confirm, since ARM - the CPU architecture I'm planning on using for the cluster - is more power efficient), and would put all my system resources on the same box, I like the idea of building a cluster for a number of reasons.

For one, I'll learn new skills setting it up and managing it. So far, I've been mostly managing my servers by hand. When you start to acquire a number of machines though, this quickly becomes unwieldy. I'd like to experiment with a containerisation technology (I'm not sure which one yet) and play around with distributing containers across hosts - and auto-restarting them on a different host if 1 host goes down. If this is decentralised, even better!

For another, having a single larger server is a single point of failure - which would be relatively expensive to replace. If I use lots of small machines instead, then if 1 dies then not only is it cheaper to replace, but it's also not as urgent since the other machines in the cluster can take over while I order a replacement.

Finally, having a cluster is just cool. Do we really need more of a reason than this?

With all this in mind, I've been thinking quite a bit about the architecture of such a cluster. I haven't bought anything yet (and probably won't for a while yet) - because as you may have guessed from the title of this post I've been running into a number of issues that all need researching.

First though let's talk about which machines I'm planning on using. I'm actually considering 2 clusters, to solve 2 different issues: compute and storage. Compute refers to running applications (e.g. Nextcloud etc), and storage refers to a distributed storage mechanism with multiple hosts - each with 1 drive attached - though I'm unsure about the storage cluster at this stage.

For the compute cluster, I'm leaning towards 4 x Raspberry Pi 4 with 4GiB of RAM each. For the storage cluster, I'm considering a number of different boards. 3 identical boards of 1 of the following:

I do seem to remember a board that had USB 3 onboard, which would be useful for connecting to the external drives. Currently the plan is to use a SATA to USB converter connect to internal HDDs (e.g. WD Greens) - but I have yet to find one that doesn't include the power connector or splits the power off into a separate USB cable (more on power later). This would be all be backed up by a Gigabit switch of some description (so the Rock Pi S is not a particularly attractive option, since it would be limited to 100MiB).

I've been using HackerBoards.com to discover different boards which may fit my project - but I'm not particularly satisfied with any of the options here so far. Specifically, I'm after Gigabit Ethernet and USB 3 on the same board if possible.

The next issue is software support. I've been bitten by this before, so I'm being extra cautious this time. I'm after a board that provides good software support, so that I can actually use all the hardware I've paid for.

The other thing relating to software that I'd like if possible is the ability to use a systemd-free operating system. Just like before, when I selected Manjaro OpenRC (which is now called Artix Linux), since I already have a number of systems using systemd I would like to balance it out a bit with some systems that use something else. I don't really mind if this is OpenRC, S6, or RunIt - just that it's something different to broaden my skill set.

Unfortunately, it's been a challenge to locate a distribution of Linux that both has broad support for ARM SoCs and does not use systemd. I suspect that I may have to give up on this, but I'm still holding out hope that there's a distribution out there that can do what I want - even if I have to prepare the system image myself (Alpine Linux looks potentially promising, but at the moment it's a huge challenge to figure out whether a chipset supported or not....). Either way, from my research it looks like having mainline Linux kernel support fro my board of choice is critically important to ensure continued support and updates (both feature and security) in the future.

Lastly, I also have power problems. Specifically, how to power the cluster. The big problem is that the Raspberry Pi 4 requires 3A of power max - instead the usual 2.4A in the 3B+ model. Of course, it won't be using this all the time, but it's apparently important that the ceiling of the power supply is 3A to avoid issues. Problem is, most multi-port chargers can barely provide 2A to connected devices - and I have not yet seen one that would provide 3A to 4+ devices and support additional peripherals such as hard drives and other supporting boards as described above.

To this end, I may end up having to build my own power supply from an old ATX supply that you can find in an old desktop PC. These can generally supply plenty of power (though it's always best to check) - but the problem here is that I'd need to do a ton of research to make sure that I wire it up correctly and safely, to avoid issues there too (I'm scared of blowing a fuse or electrocuting someone etc).

This concludes my first blog post on my cluster plans. It may be a while until the next one, as I have lots more research to do before I can continue. Suggestions and tips are welcomed in the comments below.

Art by Mythdael