Starbeamrainbowlabs

Stardust
Blog


Archive


Mailing List Articles Atom Feed Comments Atom Feed Twitter Reddit Facebook

Tag Cloud

3d 3d printing account algorithms android announcement architecture archives arduino artificial intelligence artix assembly async audio automation backups bash batch blog bookmarklet booting bug hunting c sharp c++ challenge chrome os cluster code codepen coding conundrums coding conundrums evolved command line compilers compiling compression containerisation css dailyprogrammer data analysis debugging demystification distributed computing documentation downtime electronics email embedded systems encryption es6 features ethics event experiment external first impressions future game github github gist gitlab graphics hardware hardware meetup holiday holidays html html5 html5 canvas infrastructure interfaces internet interoperability io.js jabber jam javascript js bin labs learning library linux lora low level lua maintenance manjaro network networking nibriboard node.js operating systems own your code pepperminty wiki performance phd photos php pixelbot portable privacy problem solving programming problems projects prolog protocol protocols pseudo 3d python reddit redis reference releases rendering resource review rust searching secrets security series list server software sorting source code control statistics storage svg talks technical terminal textures thoughts three thing game three.js tool tutorial twitter ubuntu university update updates upgrade version control virtual reality virtualisation visual web website windows windows 10 xmpp xslt

Cluster, Part 7: Wrangling... boxes? | Expanding the Hashicorp stack with Docker and Nomad

Welcome back to part 7 of my cluster configuration series. Sorry this one's a bit late - the last one was a big undertaking, and I needed a bit of a rest :P

Anyway, I'm back at it with another part to my cluster series. For reference, here are all the posts in this series so far:

Don't forget that you can see all the latest posts in the cluster tag right here on my blog.

Last time, we lit the spark for the bonfire so to speak, that keeps track of what is running where. We also tied it into the internal DNS system that we setup in part 4, which will act as the binding fabric of our network.

In this post, we're going to be doing 4 very important things:

  • Installing Docker
  • Installing and configuring Hashicorp Nomad

This is set to be another complex blog post that builds on the previous ones in this series (remember that benign rabbit hole from a few blog posts ago?).

Above: Nomad is a bit like a railway network manager. It decides what is going to run where and at what time. Picture taken by me.

Installing Docker

Let's install Docker first. This should be relatively easy. According to the official Docker documentation, you can install Docker like so:

curl https://get.docker.com/ | sudo sh

I don't like piping to sh though (and neither should you), so we're going to be doing something more akin to the "install using the repository". As a reminder, I'm using Raspberry Pi 4s running Raspbian (well, DietPi - but that's a minor detail). If you're using a different distribution or CPU architecture, you'll need to read the documentation to figure out the specifics of installing Docker for your architecture.

For Raspberry Pi 4s at least, it looks a bit like this:

echo 'deb [arch=armhf] https://download.docker.com/linux/raspbian buster stable' | sudo tee /etc/apt/sources.list.d/docker.list
sudo apt update
sudo apt install docker-ce

Don't forget that if you're running an apt caching server, you'll need to tweak that https to be plain-old http. For the curious, my automated script for my automated Ansible replacement (see the "A note about cluster management" in part 6) looks like this:

#!/usr/bin/env bash
RUN "echo 'deb [arch=armhf] http://download.docker.com/linux/raspbian buster stable' | sudo tee /etc/apt/sources.list.d/docker.list";
RUN "sudo apt-get update";
RUN "sudo apt-get install --yes docker-ce";

Docker should install without issue - note that you need to install it on all nodes in the cluster. We can't really do anything meaningful with it yet though, as we don't yet have Nomad installed. Let's move on and install that then!

Installing Hashicorp Nomad

Nomad is what's known as a workload orchestrator. This means that it, given a bunch of jobs, decides what is going to run where. If a host goes down, it is also responsible for shuffling things around to compensate.

Nomad works on the concept of 'jobs', which can be handled any any 1 of a number of drivers. In our case, we're going to be using the built-in Docker driver, as we want to manage the running of lots of Docker containers across multiple hosts in our cluster.

After installing Consul last time, we can build on that with Nomad. The 2 actually integrate really nicely with each other. Nomad will, by default, seek out a local Consul daemon, use it to discover other hosts in the cluster, and hang it's own cluster from Consul. Neat!

Also like Consul, Nomad functions with servers and clients. The servers all talk to each other via the Raft consensus algorithm, and the clients a lightweight daemons that do that the servers tell them to. I'm going to have 3 servers and 2 clients, in the following layout:

Host # Consul Nomad
1 Server Server
2 Server + Client Client
3 Server + Client Client
4 Client Server + Client
5 Client Server + Client

Just for the record, according to thee Nomad documentation it's not recommended that servers also act as clients, but I don't have enough hosts to avoid this yet.

With this in mind, let's install Nomad. Again, as last time, I've packaged Nomad in my at repository. If you haven't already, go and set it up now. Then, install nomad like so:

sudo apt install hashicorp-nomad

Also as last time, I've deliberately chosen a different name then the existing nomad package that you'll probably find in your distribution's repositories to avoid confusion during updates. If you're a systemd user, then I've also got a trio of packages that provide a systemd service file:

Package Name Config file location
hashicorp-nomad-systemd-server /etc/nomad/server.hcl
hashicorp-nomad-systemd-client /etc/nomad/client.hcl
hashicorp-nomad-systemd-both /etc/nomad/both.hcl

They all conflict with each other (such that you can only have 1 installed at a time), and the only difference between them is where the configuration file is located.

Install 1 of these (if required) now too with your package manager. If you're not a systemd user, consul your service manager's documentation and write a service definition. If you're willing, comment below and I'll include a note about it here!

Speaking of configuration files, we should write one for Nomad. Let's start off with the bits that will be common across all the config file variants:

bind_addr = "{{ GetInterfaceIP \"wgoverlay\" }}"

# Increase log verbosity
log_level = "INFO"

# Setup data dir
# The data directory used to store state and other persistent data. On client
# machines this is used to house allocation data such as downloaded artifacts
# used by drivers. On server nodes, the data dir is also used to store the
# replicated log.
data_dir = "/srv/nomad"

A few things to note here. log_level is mostly personal preference, so do whatever you like there. I'll probably tune it myself as I get more familiar with how everything works.

data_dir needs to be a path to a private root-owned directory on disk for the Nomad agent to store stuff locally to that node. It should not be shared with other nodes. If you installed one of the systemd packages above, /srv/nomad is created and properly permissed for you.

bind_addr tells Nomad which network interface to send management clustering traffic over. For me I'm using a WireGuard mesh VPN I setup in [part 4](), so I specify wgoverlay here.

Next, let's look at the server config:

# Enable the server
server {
    enabled = true

    # We've got 3 servers in the cluster at the moment
    bootstrap_expect = 3

    # Note that Nomad finds other servers automagically through the consul cluster

    # TODO: Enable this. Before we do we need to figure out how to move this sekret into vault though or something
    # encrypt = "SOME_VALUE_HERE"
}

Not much to see here. Don't forget to change the bootstrap_expect to a different value if you are going to have a different number of servers in your cluster (nodes that are just clients don't count).

Note that this isn't the complete server configuration file - you need to take both this and the above common bit to make the complete server configuration file.

Now, let's look at the client configuration:

client {
    enabled = true
    # Note that Nomad finds other servers automagically through the consul cluster

    # Just a worker, nothing special going on here
    node_class = "worker"

    # use wgoverlay for network fingerprinting and forwarding
    network_interface = "wgoverlay"

    # Nobody is allow to run as root - even if you *are* inside a container.....
    # For 1 thing it'll trigger a permission denied when writing to the NFS share
    options = {
        "user.blacklist" = "root"
    }
}

This is more interesting.

network_interface is really important if you're using a WireGuard mesh VPN like wesher that I setup and configured in part 4. By default, Nomad port forwards over all interfaces that make sense, and in this case gets it wrong.

This fixes that by telling it to explicitly port forward containers over the wgoverlay interface. If your network interface has a different name, this is the place to change it. It's a fairly common practice from what I can tell to have both a 'public' and a 'private' network in a cluster environment. The private network is usually trusted, and as such has lots of management traffic running over it. The public network is the one that's locked down that requests come in to from outside.

The "user.blacklist" = "root" here is a precaution that I may end up having to remove in future. It blocks any containers from running on this client from running as root inside the Docker container. This is actually worth remembering, because it's a bit of a security risk. This is a fail-safe to remind myself that it's a Bad Idea.

Apparently there are tactics that can be deployed to avoid running containers as root - even when you might think you need to. In addition, if there's no other way to avoid it, apparently there's a clever user namespace remapping trick one can deploy to avoid a container from having root privileges if it breaks out of it's container.

Another thing to note is that NFS shares often don't like you reading or writing files owned by root either, so if you're going to be saving data to a shared NFS partition like me, this is yet another reason to avoid root in your containers.

At this point it's also probably a good idea to talk a little bit about usernames - although we'll talk in more depth about this later. From my current understanding, the usernames inside a container aren't necessarily the same as those outside the container.

Every process runs under a specified username, but each username is backed by a given user id. It's this user id that is translated back into a username on the client machine when reading files from an NFS mount - hence why usernames in NFS shares can be somewhat odd.

Docker containers often have custom usernames created inside the containers for running processes inside the container with specific user ids. More on this later, but I plan to dive into this in the process of making my own Docker container images.

Anyway, we now have our configuration files for Nomad. For a client node, take the client config and the common config from the top of this section. For a server, take the server and common sections. For a node that's to act as both a client and a server, take all 3 sections.

Now that we've got that sorted, we should be able to start the Nomad agent:

sudo systemctl enable --now nomad.service

This is the same for all nodes in the cluster - regardless as to whether it's a client, a server, or both (this is also the reason that you can't have more than 1 of the systemd apt packages installed at once that I mentioned above).

If you're using the UFW firewall, then that will need configuring. For me, I'm allowing all traffic on the wgoverlay network interface that's acting as my trusted network:

sudo ufw allow in on wgoverlay

If you'd prefer not to do that, then you can allow only the specific ports through like so:

sudo ufw allow 4646/tcp comment nomad-http
sudo ufw allow 4647/tcp comment nomad-rpc
sudo ufw allow 4648/tcp comment nomad-serf

Note that this allows the traffic on all interfaces - these will need tweaking if you only want to allow the traffic in on a specific interface (which, depending on your setup, is probably a wise decision).

Anyway, you should now be able to ask the Nomad cluster for it's status like so:

nomad node status

...execute this from any server node in the cluster. It should give output like this:

ID        DC   Name         Class   Drain  Eligibility  Status
75188064  dc1  piano        worker  false  eligible     ready
9eb7a7a5  dc1  harpsicord   worker  false  eligible     ready
c0d23e71  dc1  saxophone    worker  false  eligible     ready
a837aaf4  dc1  violin       worker  false  eligible     ready

If you see this, you've successfully configured Nomad. Next, I recommend reading the Nomad tutorial and experimenting with some of the examples. In particular the Getting Started and Deploy and Manage Jobs topics are worth investigating.

Conclusion

In this post, we've installed Docker, and installed and configured Nomad. We've also touched briefly on some of the security considerations we need to be aware of when running things in Docker containers - much more on this in the future.

In future posts, we're going to look at setting up shared storage, so that jobs running on Nomad can be safely store state and execute on any client / worker node in the cluster while retaining access to said state information.

On the topic of Nomad, we're also going to look at running our first real job: a Docker registry, so that we can push our own custom Docker images to it when we've built them.

You may have noticed that both Nomad and Consul also come with a web interface. We're going to look at these too, but in order to do so we need a special container-aware reverse-proxy to act as a broker between 'cluster-space' (in which everything happens 'somewhere', and we don't really know nor do we particularly care where), and 'normal-network-space' (in which everything happens in clearly defined places).

I've actually been experiencing some issues with this, as I initially wanted to use Traefik for this purpose - but I ran into a number of serious difficulties with reading their (lack of) documentation. After getting thoroughly confused I'm now experimenting with Fabio (git repository) instead, which I'm getting on much better with. It's a shame really, I even got as far as writing the automated packaging script for Traefik - as evidenced by the traefik packages in my apt repository.

Until then though, happy cluster configuration! Feel free to post a comment below.

Found this interesting? Found a mistake? Confused about something? Comment below!

Sources and Further Reading

Cluster, Part 6: Superglue Service Discovery | Setting up Consul

Hey, welcome back to another weekly installment of cluster configuration for colossal computing control. Apparently I'm managing to keep this up as a weekly series every Wednesday.

Last week, we sorted out managing updates to the host machines in our cluster to keep them fully patched. We achieved this by firstly setting up an apt caching server with apt-cacher-ng. Then we configured our host machines to use it. Finally, we setup automated updates with unattended-upgrades so we don't have to keep installing them manually all the time.

For reference, here are all the posts in this series so far:

In this part, we're going to install and configure Consul - the first part of the Hashicorp stack. Consul doesn't sound especially exciting, but it is an extremely important part of our (diabolical? I never said that) plan. It serves a few purposes:

  • Clusters together, so Nomad (the task scheduler) can find other nodes
  • Keeps track of which services are running where

It uses the Raft Consensus Algorithm (like wesher from part 4; they actually use the same library under-the-hood it would appear) to provide a relatively decentralised approach to the problem, allowing for some nodes to fail without impacting the cluster's operation as a whole.

It also provides a DNS API, which we'll be tying into with Unbound later in this post.

Before continuing, you may find reading through the official Consul guides a useful exercise. Try out some of the examples too to get your head around what Consul is useful for.

(Above: Nasa's DSN dish in Canberra, Australia just before major renovations are carried out. Credit: NASA/Canberra Deep Space Communication Complex)

Installation and Preamble

To start with, we need to install it. I've done the hard work of packaging it already, so you can install it from my apt repository - which, if you've been following this series, you should have it setup already (if not, follow the link and read the instructions there).

Install consul like this:

sudo apt install consul

Next, we need a systemd service file. Again, I have packages in my apt repository for this. There are 2 packages:

  • hashicorp-consul-systemd-client
  • hashicorp-consul-systemd-server

The only difference between the 2 packages is where it reads the configuration file from. The client package reads from /etc/consul/client.hcl, and the server from /etc/consul/server.hcl. They also conflict with each other, so you can't install both at the same time. This is because - as far as I can tell - servers can expose the client interface in just the same way as any other normal client.

To get a feel for the bigger picture, let's talk architecture. Because Consul uses the Raft Consensus Algorithm, we'll need an odd number of servers to avoid issues (if you use an even number of servers, then you run the risk of a 'split brain', where there's no clear majority vote as to who's the current leader of the cluster). In my case, I have 5 Raspberry Pi 4s:

  • 1 x 2GB RAM (controller)
  • 4 x 4GB RAM (workers)

In this case, I'm going to use the controller as my first Consul server, and pick 2 of the workers at random to be my other 2, to make up 3 servers in total. Note that in future parts of this series you'll need to keep track of which ones are the servers, since we'll be doing this all over again for Nomad.

With this in mind, install the hashicorp-consul-systemd-server package on the nodes you'll be using as your servers, and the hashicorp-consul-systemd-client package on the rest.

A note about cluster management

This is probably a good point to talk about cluster management before continuing. Specifically the automation of said management. Personally, I have the goal of making the worker nodes completely disposable - which means completely automating the setup from installing the OS right up to being folded into the cluster and accepting jobs.

To do this, we'll need a tool to help us out. In my case, I've opted to write one from scratch using Bash shell scripts. This is not something I can recommend to anyone else, unless you want to gain an understanding of how such tools work. My inspiration was efs2, which appears to be a Go program - and Docker files. As an example, my job file for automating the install of a Consul client agent install looks like this:

#!/usr/bin/env bash

SCRIPT "${JOBFILE_DIR}/common.sh";

COPY "../consul/server.hcl" "/tmp/server.hcl"

RUN "sudo mv /tmp/server.hcl /etc/consul/server.hcl";
RUN "sudo chown root:root /etc/consul/server.hcl";
RUN "sudo apt-get update";
RUN "sudo apt-get install --yes hashicorp-consul-systemd-server";

RUN "sudo systemctl enable consul.service";
RUN "sudo systemctl restart consul.service";

...I'll be going through all steps in a moment. Of course, if there's the demand for it then I'll certainly write a post or 2 about my shell scripting setup here (comment below), but I still recommend another solution :P

Note that the firewall configuration is absent here - this is because I've set it to allow all traffic on the wgoverlay network interface, which I talked about in part 4. If you did want to configure the firewall, here are the rules you'd need to create:

sudo ufw allow 8301 comment consul-serf-lan;
sudo ufw allow 8300/tcp comment consul-rpc;
sudo ufw allow 8600 comment consul-dns;

Many other much more mature tools exist - you should use one of those instead of writing your own:

  • Ansible - uses YAML configuration files; organises things logically into 'playbooks' (personally I really seriously can't stand YAML, which is another reason for writing my own)
  • Puppet
  • and more

The other thing to be aware of is version control. You should absolutely put all your configuration files, scripts, Ansible playbooks, etc under version control. My preference is Git, but you can use anything you like. This will help provide a safety net in case you make an edit and break everything. It's also a pretty neat way to keep it all backed up by pushing it to a remote such as your own Git server (you do have automated backups, right?), GitHub, or GitLab.

Configuration

Now that we've got that sorted out, we need to deal with the configuration files. Let's do the server configuration file first. This is written in the Hashicorp Configuration Language. It's probably a good idea to get familiar with it - I have a feeling we'll be seeing a lot of it. Here's my full server configuration (at the time of typing, anyway - I'll try to keep this up-to-date).

bind_addr = "{{ GetInterfaceIP \"wgoverlay\" }}"

# When we have this many servers in the cluster, automatically run the first leadership election
# Remember that the Hashicorp stack uses the Raft consensus algorithm.
bootstrap_expect = 3
server = true
ui = true

client_addr = "127.0.0.1 {{ GetInterfaceIP \"docker0\" }}"

data_dir = "/srv/consul"
log_level = "INFO"

domain = "mooncarrot.space."


retry_join = [
    // "172.16.230.100"
    "bobsrockets",
    "seanssatellites",
    "tillystelescopes"
]

This might look rather unfamiliar. There's also a lot to talk about, so let's go through it bit by bit. If you haven't already, I do suggest coming up with an awesome naming scheme for your servers. You'll thank me later.

The other thing you'll need to do before continuing is buy a domain name. It sounds silly, but it's actually really important. As we talked about in part 3, we're going to be running our own internal DNS - and Consul is a huge part of this.

By default, Consul serves DNS under the .consul top-level-domain, which both unregistered and very bad practice (because it's unregistered). Someone could come along tomorrow and regsister and start using the .consul top-level domain, and then things would get awkward if you ever wanted to visit an external domain that ends in .consul that you're using internally.

I've chosen mooncarrot.space myself, but if you don't yet have one, I recommend taking your time and coming up with a really clever one you like - since you'll be using it for a long time - and updating it later is a pain in the behind. If you're looking for a recommendation for a DNS provider, I've been finding Gandi great so far (much better than GoDaddy, who have tons of hidden charges).

Once you've decided on a domain name (and bought it, if necessary), set it in the server configuration file via the domain directive:

domain = "mooncarrot.space."

Don't forget that trailing dot. As we learned in part 3, it's important, since it indicates an absolute domain name.

Also note that I'm using a subdomain of the domain in question here. This is because of an issue whereby I'm unable to get Unbound to forward that Consul is unable to resolve on to CloudFlare.

Another thing of note is the data_dir directive. Note that this is the data storage directive local to the specific node, not shared storage (we'll be tackling that in a future post).

The client_addr directive here tells Consul which network interfaces to bind the client API to. In our case, we're binding it to the local loopback (127.0.0.1) and the docker0 network interface by dynamically grabbing it's IP address - so that docker containers on the host can use the API.

The bind_addr directive is similar, but for the inter-node communication interfaces. This tells Consul that the other nodes in the Cluster are accessible over the wgoverlay interface that we setup in part 4. This is important, since Consul doesn't encrypt or authenticate it's traffic by default as far as I can tell - and I haven't yet found a good way to do this that doesn't involve putting a password directly into a configuration file.

In this way the WireGuard mesh VPN provides the encryption & authentication that Consul lacks by default (though I'm certainly going to be looking into it anyway).

bootstrap_expect is also interesting. If you've decided on a different number of Consul server nodes, then you should change this value to equal the number of server nodes you're going to have. 3, 5, and 7 are all good numbers - though don't go overboard. More servers means more overhead. Servers take more computing power than clients, so try not to have too many of them.

Finally, retry_join is also very important. It should contain the domain name of all the servers in the cluster. In my case, I'm using the be the hostnames of the other servers in the network, since Wesher (our WireGuard mesh VPN program) automatically adds the machine names of all the nodes in the VPN cluster to your /etc/hosts file. In this way we ensure that the Cluster always talks over the wgoverlay VPN network interface.

Oh yeah, and I should probably note here that your servers should not use FQDNs (Fully Qualified Domain Names) as their hostnames. I found out the hard way: it messes with Consul, and it ends up serving node IPs via DNS on something like harpsichord.mooncarrot.space.node.mooncarrot.space instead something sensible like harpsichord.node.mooncarrot.space. If anyone has a solution for this that doesn't involve using non-FQDNs as hostnames, I'd love to know (since FQDNs as hostnames is my preference).

That was a lot of words. Next, let's do the client configuration file:

bind_addr = "{{ GetInterfaceIP \"wgoverlay\" }}"

bootstrap = false
server = false

domain = "mooncarrot.space."

client_addr = "127.0.0.1 {{ GetInterfaceIP \"docker0\" }}"

data_dir = "/srv/consul"
log_level = "INFO"

retry_join = [
    "wopplefox",
    "spatterling",
    "sycadil"
]

Not much to talk about here, since the configuration is almost indentical to that of the server, except you don't have to tell it how many servers there are, and retry_join should contain the names of the servers that the client should try to join, as explained above.

Once you've copied the configuration files onto all the nodes in your cluster (/etc/consul/server.hcl for servers; /etc/consul/client.hcl for clients), it's now time to boot up the cluster. On all nodes (probably starting with the servers), do this:

# Enable the Consul service on boot
sudo systemctl enable consul.service
# Start the Consul service now
sudo systemctl start consul.service
# Or, you can do it all in 1 command:
sudo systemctl enable --now consul.service

It's probably a good idea to follow the cluster's progress along in the logs. On a server node, do this after starting the service & enabling start on boot:

sudo journalctl -u consul --follow

You'll probably see a number of error messages, but be patient - it can take a few minutes after starting Consul on all nodes for the first time for them to start talking to each other, sort themselves out, and calm down.

Now, you should have a working Consul cluster! On one of the server nodes, do this to list all the servers in the cluster:

consul members

If you like, you can also run this from your local machine. Simply install the consul package (but not the systemd service file), and make some configuration file adjustments. Update your ~/.bashrc on your local machine to add something like this:

export CONSUL_HTTP_ADDR=http://consul.service.mooncarrot.space:8500;

....replacing mooncarrot.space with your own domain, of course :P

Next, update the server configuration file to make the client_addr directive look like this:

client_addr = "127.0.0.1 {{ GetInterfaceIP \"docker0\" }} {{ GetInterfaceIP \"wgoverlay\" }}"

Upload the new version to your servers, and restart them one at a time (unless you're ok with downtime - I'm trying to practice avoiding downtime now so I know all the processes for later):

sudo systemctl restart consul.service

At this point, we've got a fully-functioning Consul cluster. I recommend following some of the official guides to learn more about how it works and what you can do with it.

Unbound

Before we finish for today, we've got 1 more task to complete. As I mentioned back in part 3, we're going to configure our DNS server to conditionally forward queries to Consul. The end result we're aiming for is best explained with a diagram:

A diagram showing how we're aiming for Unbound to resolve queries.

In short:

  1. Try the localzone data
  2. If nothing was found there (or it didn't match), see if it matches Consul's subdomain
  3. If so, forward the query to Consul and return the result
  4. If Consul couldn't resolve the query, forward it to CloudFlare via DNS-over-TLS

The only bit we're currently missing of this process is the Consul bit, which we're going to do now. Edit /etc/unbound/unbound.conf on your DNS server (mine is on my controller node), and insert the following:

###
# Consul
###
forward-zone:
    name: "node.mooncarrot.space."
    forward-addr: 127.0.0.1@8600
forward-zone:
    name: "service.mooncarrot.space."
    forward-addr: 127.0.0.1@8600

...replace mooncarrot.space. with your domain name (not forgetting the trailing dot, of course). Note here that we have 2 separate forward zones here.

Unfortunately, I can't seem to find a way to get Unbound to fall back to a more generic forward zone in the event that a more specific one is unable to resolve the query (I've tried both a forward-zone and a stub-zone). To this end, we need to define multiple more specific forward-zones if we want to be able to forward queries out to CloudFlare for additional DNS records. Here's an example:

  1. tuner.service.mooncarrot.space is an internal service that is resolved by Consul
  2. peppermint.mooncarrot.space is an externally defined DNS record defined with my registrar

If we then ask Unbound to resolve then both, only #1 will be resolved correctly. Unbound will do something like this for #2:

  • Check the local-zone for a record (not found)
  • Ask Consul (not found)
  • Return error

If you are not worried about defining DNS records in your registrar's web interface / whatever they use, then you can just do something like this instead:

###
# Consul
###
forward-zone:
    name: "mooncarrot.space."
    forward-addr: 127.0.0.1@8600

For advanced users, the Consul's documentation on the DNS interface is worth a read, as it gives the format of all DNS records Consul can service.

Note also here that the recursors configuration option is an alternative solution, but I don't see an option to force DNS-over-TLS queries there.

If you have a better solution, please get in touch by commenting below or answering my ServerFault question.

With this done, you should be able to ask Consul what the IP address of any node in the cluster is like so:

dig +short harpsichord.node.mooncarrot.space
dig +short grandpiano.node.mooncarrot.space
dig +short oboe.node.mooncarrot.space
dig +short some_machine_name_here.node.mooncarrot.space

Again, you'll need to replace mooncarrot.space of course with your domain name.

Conclusion

Phew! There was a lot of steps and moving parts to setup in this post. Take your time, and re-read this post a few times to make sure you've got all your ducks in a row. Make sure to test your new Consul cluster by reading the official guides as mentioned above too, as it'll cause lots of issues later if you've got bugs lurking around in Consul.

I can't promise that it's going to get easier in future posts - it's probably going to get a lot more complicated with lots more to keep track of, so make sure you have a solid understanding of what we've done so far before continuing.

To summarise, we've managed to setup a brand-new Consul cluster. We've configured Unbound to forward queries to Consul, to enable seamless service discovery (regardless of which host things are running on) later on.

We've also picked an automation framework for automating the setup and configuration of the various services and such we'll be configuring. I recommend taking some time to get to know your chosen framework - many have lots of built-in modules to make common tasks much easier. Try going back to previous posts in this series (links at the top of this post) and implementing them in your chosen framework.

Finally, we've learnt a bit about version control and it's importance in managing configuration files.

In the next few posts in this series (unless I get distracted - likely - or have a change of plans), we're going to be setting up Nomad, the task scheduler that will be responsible for managing what runs where and informing Consul of this. We're also going to be setting up and configuring a private Docker registry and Traefik - the latter of which is known as an edge router (more on that in future posts).

See you next time, when we dive further down into what's looking less like a rabbit hole and more like a cavernous sinkhole of epic proportions.

Found this useful? Confused about something? Got a suggestion? Comment below! It's really awesome to hear that my blog posts have helped someone else out.

Pipes, /dev/shm, or a TCP socket: Which is faster?

I've been busy patching HAIL-CAESAR (a simplified 2D flood simulation program designed for HPC supercomputers) to make it more suitable for the scale of my PhD project, and as part of this I'm trying to use the standard input & output where possible to speed up data transfer for the pre and post-processing steps, since I need to convert the data to and from different formats.

As part of this, it crossed my mind that there are actually a number of different ways of getting data in and out of a program, so I decided to do a quick (relatively informal) test to see which was fastest.

In my actual project, I'm going to be doing the following data transfers:

  • From .jsonstream.gz files to a Node.js process
  • From the Node.js process to HAIL-CAESAR
  • From HAIL-CAESAR to another Node.js process (there's a LOT of data in this bit)
  • From that Node.js process to disk as PNG files

That's a lot of transferring. In particular the output of HAIL-CAESAR, which I'm currently writing directly to disk, appears to be absolutely enormous - due mainly to the hugely inefficient storage format used.

Anyway, the 3 mechanisms I'm putting to the test here are:

  • A pipe (e.g. writing to standard output)
  • Writing to a file in /dev/shm
  • A TCP socket

If anyone can think of any other mechanisms for rapid inter-process communication, please do get in touch by leaving a comment below.

Pipe

I'm simulating a pipe with the following code:

timeout --signal=SIGINT 30s dd if=/dev/zero status=progress | cat >/dev/null

The timeout --signal=SIGINT 30s bit lets it run for 30 seconds before stopping it with a SIGINT (the same as Ctrl + C). I'm reading from /dev/zero here, because I want to test the performance of the pipe and not be limited by the speed of random number generation if I were to use /dev/urandom.

Running this on my laptop resulted in a speed of ~396 MB/s.

/dev/shm

/dev/shm is the shared memory area on Linux - and is usually backed by a tmpfs file system (i.e. an in-memory ramdisk).

Here are the command I'm using to test this:

dd if=/dev/zero of=/dev/shm/test-1gb bs=1024 count=1000000
dd if=/dev/shm/test-1gb of=/dev/null bs=1024 count=1000000

This writes a 1GB file to /dev/shm, and then reads it back again (to be consistent with the pipe test). To calculate the overall MB/s speed, we need to know the time it took to do the read and write operations. I observed the following:

Operation Speed Time
Write 692 MB/s 1.4788s
Read 890 MB/s 1.1501s

....so that's 2.6289s in total. Then, we can calculate the MB/s by dividing 1GB by the total time, giving us a total transfer speed of ~380 MB/s. This seemed quite variable though - as when I tested it the other day I got only ~273 MB/s.

TCP Socket

Finally, to test a TCP socket, I devised the following:

nc -l 8888 >/dev/null &
timeout --signal=SIGINT 30s dd status=progress if=/dev/zero | nc 127.0.0.1 8888

The first line sets up the listener, and the 2nd line is the sender. As before with the pipe test, I'm stopping it after 30 seconds. It took a moment to stabilise, but towards the end it levelled off at about ~360 MB/s.

Conclusion

After running the 3 tests, the results were as follows:

Test Speed
Pipe 396 MB/s
/dev/shm 380 MB/s
TCP Socket 360 MB/s

According to this, the pipe (i.e. writing to the standard output and reading from the standard input) is the fastest. This isn't particularly surprising (since the other methods have overhead), but interesting to test all the same. Here's a quick graph of that:

A quick bar chart of the above data

Of course, there are other considerations to take into account. For example, If you need scalable multi-core processing, then /dev/shm or TCP sockets (the latter especially since Linux has a special mechanism for multiple processes to listen on the same port and allow load-balancing between them) might be a better option - despite the additional overhead.

Other CPU architectures may have an effect on it too due to different CPU instructions being available - I ran these tests on Ubuntu 19.10 on the Intel Core i7-7500U in my laptop.

As of yet I'm unsure as to how much post-processing the data coming from HAIL-CAESAR will require - and whether it will require multiple processes to handle the load or not. I hope not - since HAIL-CAESAR is written in C++, and TCP sockets would be awkward and messy to implement (since you would probably have to use the low-level socket API, and I don't have any experience with networking in C++ yet) - and the HPC in question doesn't appear to have inotifywait installed to make listening for file writes on disk easier.

Cluster, Part 5: Staying current | Automating apt updates and using apt-cacher-ng

Hey there! Welcome to another cluster blog post. In the last post, we looked at setting up a WireGuard mesh VPN as a trusted private network for management interfaces and inter-cluster communication. As a refresher, here's a list of all the parts in this series so far:

Before we get to the Hashicorp stack though (next week, I promise!), there's an important housekeeping matter we need to attend to: managing updates.

In Debian-based Linux distributions such as Raspbian (that I'm using on my Raspberry Pis), updates are installed through apt - and this covers everything from the kernel to the programs we have installed - hence the desire to package everything we're going to be using to enable easy installation and updating.

There are a number of different related command-line tools, but the ones we're interested in are apt (the easy-to-use front-end CLI) and apt-get (the original tool for installing updates).

There are 2 key issues we need to take care of:

  1. Automating the installation of packages updates
  2. Caching the latest packages locally to avoid needless refetching

Caching package updates locally with apt-cacher-ng

Issue #2 here is slightly easier to tackle, surprisingly enough, so we'll do that first. We want to cache the latest packages locally, because if we have lots of machines in our cluster (I have 5, all running Raspbian), then when they update they all have to download the package lists and the packages themselves from the remote sources each time. Not only is this bandwidth-inefficient, but it takes longer and puts more strain on the remote servers too.

For apt, this can be achieved through the use of apt-cacher-ng. Other distributions require different tools - in particular I'll be researching and tackling Alpine Linux's apk package manager in a future post in this series - since I intend to use Alpine Linux as my primary base image for my Docker containers (I also intend to build my own Docker containers from scratch too, so that will be an interesting experience that will likely result in a few posts too).

Anyway, installation is pretty simple:

sudo apt install apt-cacher-ng

Once done, there's a little bit of tuning we need to attend to. apt-cacher-ng by default listens for HTTP requests on TCP port 3142 and has an administrative interface at /acng-report.html. This admin interface is not, by default, secured - so this is something we should do before opening a hole in the firewall.

This can be done by editing the /etc/apt-cacher-ng/security.conf configuration file. It should read something like this:


# This file contains confidential data and should be protected with file
# permissions from being read by untrusted users.
#
# NOTE: permissions are fixated with dpkg-statoverride on Debian systems.
# Read its manual page for details.

# Basic authentication with username and password, required to
# visit pages with administrative functionality. Format: username:password

AdminAuth: username:password

....you may need to use sudo to view and edit it. Replace username and password with your own username and a long unguessable password that's not similar to any existing passwords you have (especially since it's stored in plain text!).

Then we can (re) start apt-cacher-ng:

sudo systemctl enable apt-cacher-ng
sudo systemctl restart apt-cacher-ng

The last thing we need to do here is to punch a hole through the firewall, if required. As I explained in the previous post, I'm using a WireGuard mesh VPN, so I'm allowing all traffic on that interface (for reasons that will - eventually - come clear), so I don't need to open a separate hole in my firewall unless I want other devices on my network to use it too (which wouldn't be a bad idea, all things considered).

Anyway, ufw can be configured like so:

sudo ufw allow 3142/tcp comment apt-cacher-ng

With the apt-cacher server installed and configured, you can now get apt to use it:

echo 'Acquire::http { Proxy \"http://X.Y.Z.W:3142\"; }' | sudo tee -a /etc/apt/apt.conf.d/proxy

....replacing X.Y.Z.W with the IP address (or hostname!) of your apt-cacher-ng server. Note that it will get upset if you use https anywhere in your apt sources, so you'll have to inspect /etc/apt/sources.list and all the files in /etc/apt/sources.list.d/ manually and update them.

Automatic updates with unattended-upgrades

Next on the list is installing updates automatically. This is useful because we don't want to have to manually install updates every day on every node in the cluster. There are positives and negatives about installing updates - I recommend giving the top of this article a read.

First, we need to install unattended-upgrades:

sudo apt install unattended-upgrades

Then, we need to edit the /etc/apt/apt.conf.d/50unattended-upgrades file - don't forget to sudo.

Unfortunately, I haven't yet automated this process (or properly developed a replacement configuration place that can be automatically placed on a target system by a script), so for now we'll have to do this manually (the mssh command might come in handy).

First, find the line that starts with Unattended-Upgrade::Origins-Pattern, and uncomment the lines that end in -updates, label=Debian, label=Debian-Security. For Raspberry Pi users, add the following lines in that bit too:

"origin=Raspbian,codename=${distro_codename},label=Raspbian";

// Additionally, for those running Raspbian on a Raspberry Pi,
// match packages from the Raspberry Pi Foundation as well.
"origin=Raspberry Pi Foundation,codename=${distro_codename},label=Raspberry Pi Foundation";

unattended-upgrades will only install packages that are matched by a list of origins. Unfortunately, the way that you specify which updates to install is a total mess, and it's not obvious how to configure it. I did find an Ask Ubuntu answer that explains how to get unattended-upgrades to install updates. If anyone knows of a cleaner way of doing this - I'd love to know.

The other decision to make here is whether you'd like your hosts to automatically reboot. This could be disruptive, so only enable it if you're sure that it won't interrupt any long-running tasks.

To enable it, find the line that starts with Unattended-Upgrade::Automatic-Reboot and set it to true (uncommenting it if necessary). Then find the Unattended-Upgrade::Automatic-Reboot-Time setting and set it to a time of day you're ok with it rebooting at - e.g. 03:00 for 3am in the morning - but take care to avoid all your servers from rebooting at the same time, as this might cause issues later.

A few other settings need to be updated too. Here are they are, with their correct values:

APT::Periodic::Update-Package-Lists "1";
APT::Periodic::Download-Upgradeable-Packages "1";
APT::Periodic::Unattended-Upgrade "1";
APT::Periodic::AutocleanInterval "7";

Make sure you find the existing settings and update them, because otherwise if you just paste these in, they may get overridden. In short these settings:

  • Enable automatic updates to the package metadata indexes
  • Downloads upgradeable packages
  • Installs downloaded updates
  • Automatically cleans the cache up every 7 days

Once done, save and close that file. Finally, we need to enable and start the unattended-upgrades service:

sudo systemctl enable unattended-upgrades
sudo systemctl restart unattended-upgrades

To learn more about automatic upgrades, these guides might help shed some additional light on the subject:

Conclusion

In this post, we've taken a look at apt package caching and unattended-upgrades. In the future, I'm probably going to have to sit down and either find an alternative to unattended-upgrades that's easier to configure, or rewrite the entire configuration file and create my own version. Comments and suggestions on this are welcome in the comments.

In the next post, we'll be finally getting to the Hashicorp stack by installing and configuring Consul. Hold on to your hats - next week's post is significantly complicated.

Edit 2020-05-09: Add missing instructions on how to get apt to actually use an apt-cacher-ng server

Cluster, Part 4: Weaving Wormholes | Peer-to-Peer VPN with WireGuard

(Above: The WireGuard and wesher logos. Background photo: from Unsplash by Clint Adair.)

Hey - welcome back! Last week, we set Unbound up as our primary DNS server for our network. We also configured cluster member devices to use it for DNS resolution. As a recap, here are links to the all the posts in this series so far:

In this part, we're going to setup a WireGuard) peer-to-peer VPN. This is a good idea for several reasons:

  • It provides defence-in-depth
  • It protects non-encrypted / unprotected private services from the rest of the network

The latter point here is particularly important - especially if you've got other device on your network like I have. If you're somehow following along with this series with devices fancy enough to have multiple network interfaces, you can connect the 2nd network interface of every server to a separate switch, that doesn't connect to anywhere else. Don't forget that you'll need to setup a DHCP server on this new mini-network (or configure static IPs manually on each device, which I don't recommend) - but this is out-of-scope of this article.

In the absence of such an opportunity, a peer-to-peer VPN should do the trick just as well. We're going to be using WireGuard), which I discovered recently. It's very cool indeed - and it's apparently special enough to be merged directly into the Linux Kernel v5.6! It's been given high praise from security experts too.

What I like most is it's simplicity. It follows the UNIX Philosophy, and as such while it's very simple in the way it works, it can be used and applied in so many different ways to solve so many different problems.

With this in mind, let's get to installing it! If you're on an Ubuntu or Debian-based machine, then you should just be able to install it directly:

sudo apt install wireguard

Note that you might have to have your kernel's development headers installed if you experience issues. For Raspbian users (like me), installation is slightly more complicated. We'll need to setup the debian-backports apt repository to pull it in, since the Debian developers have backported it to the latest stable version of Debian (e.g. like a hotfix) - but Raspbian hasn't yet pulled it in. This is done in 2 steps:

# Add the debian-backports GPG key
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 04EE7237B7D453EC 648ACFD622F3D138
# Add the debian-backports apt repo
echo 'deb http://httpredir.debian.org/debian buster-backports main contrib non-free' | sudo tee /etc/apt/sources.list.d/debian-backports.list

Now, we should be able to update to download the new repository metadata:

sudo apt update

Next, we need to install the Raspberry Pi Linux kernel headers. Unlike other distributions (which use the linux-headers-generic package), this is done in slightly different way:

sudo apt install raspberrypi-kernel-headers

This might take a while. Once done, we can now install WireGuard itself:

sudo apt install wireguard

Again, this will take a while. Don't forget to pay close attention to the output - I've noticed that it's fond of throwing error messages, but not actually counting it as an error and ultimately claiming it completed successfully.

With this installed, we can now setup our WireGuard peer-to-peer VPN. WireGuard itself works on a public-private keypair per-device setup. A device first generates a keypair, and then the public key thereof needs copying to all other devices it wants to connect to. In this fashion, both a peer-to-peer setup (like we're after), and a client-server setup (like a more traditional VPN such as IPSec or OpenVPN) can be configured.

The overhead of configuring such a peer-to-peer WireGuard VPN is considerable though, since every time a device is added to the VPN every existing device needs updating to add the public key thereof to establish communications between the 2.

While researching an easier solution to the problem, I came across wesher, which does much of the heavy-lifting for you. It does of course come at the cost of slightly reduced security (since the entire VPN network is protected by a single pre-shared key) and reduced configurability, but from my experiences so far it works like a charm for my purposes - and it eases management too.

It is distributed as a single Go binary, that uses the Raft Consensus Algorithm (the same library that Nomad and Consul use actually) to manage cluster membership and provide self-healing properties. This makes is very easy for my to package into an apt package, which I've added to my apt repository.

The instructions to add my apt repository can be found on it's page, but here they are directly:

# Add the repository
echo "deb https://apt.starbeamrainbowlabs.com/ ./ # apt.starbeamrainbowlabs.com" | sudo tee /etc/apt/sources.list.d/sbrl.list
# Import the signing key
wget -q https://apt.starbeamrainbowlabs.com/aptosaurus.asc -O- | sudo apt-key add -
# Alternatively, import the signing key from a keyserver:
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys D48D801C6A66A5D8
# Update apt's cache
sudo apt update

Don't forget that if you're using a caching server like apt-cacher-ng (which we'll be setting up in the next post in this series), you'll probably want to change the https in the first command there to regular http in order for the caching to take effect. Note that this doesn't affect the security of the downloaded packages, since apt will verify the integrity and GPG signature of all packages downloaded. It only affects the privacy of the connection used to download the packages themselves (more on this in a future post).

Once setup, install wesher like so:

sudo apt install wesher

If you've got a systemd-based system (as I have sadly with Raspbian), I provide a systemd service file in a separate package:

sudo apt install wesher-systemd

Don't forget to perform these steps on all the machines you want to enter the cluster. Next, we need to configure our firewall. I'm using UFW - I hope you set up something similar when you first configured your servers you're clustering (I recommend UFW. Note also that this tutorial series will not cover the basics like this - instead, I'll link to other tutorials and such as I think of them). For UFW users, do this:

sudo ufw allow 7946 comment wesher-gossip
sudo ufw allow 51820/udp comment wesher-wireguard

Wesher requires 2 ports: 1 for the clustering traffic for managing cluster membership, and another for the WireGuard traffic itself. Now, we can initialise the cluster. This has to be done on an interactive terminal for the first time, to avoid leaking the cluster's pre-shared key to log files. Do it like this in a terminal:

sudo wesher

It should tell you that it's generated a new cluster key - it will save it in a private configuration directory. Save this somewhere safe, such as in your password manager. Now, you can press Ctrl + C to quit, and start the systemd service:

sudo systemctl enable --now wesher.service

It's perhaps good practice to check that the service has started successfully:

sudo systemctl status wesher.service

Having 1 node setup is nice, but not very useful. Adding additional nodes to the cluster is a bit different. Follow this tutorial up to and including the installation of the wesher and wesher-systemd packages, and then instead of just doing sudo wesher, do this instead:

sudo wesher --cluster-key CLUSTER_KEY_HERE --join IP_OF_ANOTHER_NODE --overlay-net 172.31.250.0/16 --log-level info

...replacing CLUSTER_KEY_HERE with your cluster key (don't forget to prefix the entire command with a space to avoid it from entering your shell history file), and IP_OF_ANOTHER_NODE with the IP address of another node in the cluster on the external untrusted network. Note that the --overlay-net there is required because of the way I wrote the systemd service file in the wesher-systemd package. I did this:

[Unit]
Description=wesher - wireguard mesh builder
After=network-online.target
After=syslog.target
After=rsyslog.service

[Service]
EnvironmentFile=-/etc/default/wesher
ExecStart=/usr/local/sbin/wesher --overlay-net 172.31.250.0/16 --log-level info
Restart=on-failure
Type=simple
StandardOutput=syslog
StandardError=syslog
SyslogIdentifier=wesher

[Install]
WantedBy = multi-user.target

I explicitly specify the subnet of the VPN network here to avoid clashes with other networks in the 10.0.0.0/8 range. I don't have a network in that range, but I know that others do. Unfortunately there's currently a known bug that means that IP address collisions may occur, and the cluster can't currently sort them out. So you need to use a pretty large subnet for now to avoid such collisions (here's hoping this one is patched soon).

Note that if you want to set additional configuration options, you can do so in /etc/default/wesher in the format VAR_NAME=VALUE - 1 per line. A full reference of the supported environment variables can be found here.

Anyway, once wesher has joined the cluster on the new node, press Ctrl + C to exit and then start and enable the systemd service as before (note that wesher saves all the configuration details to a configuration directory, so the cluster key doesn't need to be provided more than once):

sudo systemctl enable --now wesher.service
sudo systemctl status wesher.service

With this, you should have a fully-functional Wireguard peer-to-peer VPN setup with wesher. You can ask a node as to what IP address it has on the VPN by using the following command:

ip address

The IP address should be shown next to the wgoverlay network interface.

The last thing to do here is to configure our Firewall. In my case, I'm using UFW, so the instructions I include here will be specific to UFW (if you use another firewall, translate these commands for your firewall, and command below!).

In my specific case, I want to take the unusual step of allowing all traffic in on the VPN. The reason for this will become apparent in a future post, but in short Nomad dynamically allocates outward-facing ports for services. There's a good reason for this I'll get into at the time, but for now, this is how you'd do that:

sudo ufw allow in on wgoverlay

Of course, we can add override rules here that block traffic if we need to. Note that this only allows in all traffic on the wgoverlay network interface and not all network interfaces as my previous blog about UFW would have done - i.e. like this:

sudo ufw allow 1234/tcp

In the next part, we'll take a look at setting up an apt caching server to improve performance and reduce Internet data usage when downloading system updates.

Found this useful? Spotted a mistake? Confused about something? Comment below! It really helps motivate me to write these posts.

Cluster, Part 3: Laying groundwork with Unbound as a DNS server

Looking up at the canopy of some trees

(Above: A picture from my wallpapers folder. If you took this, comment below and I'll credit you)

Welcome to another blog post about my cluster! Although I had to replace the ATX PSU I was planning on using to power the thing with a USB power supply instead, I've got all the Pis powered up and networked together now - so it's time to finally start on the really interesting bit!

In this post, I'm going to show you how to install Unbound as a DNS server. This will underpin the entire stack we're going to be setting up - and over the next few posts in this series, I'll take you through the entire process.

Starting with DNS is a logical choice, which comes with several benefits:

  • We get a minor performance boost by caching DNS queries locally
  • We get to configure our DNS server to make encrypted DNS-over-TLS queries on behalf of the entire network (even if some of the connected devices - e.g. phones don't support doing so)
  • If we later change our minds and want to shuffle around the IP addresses, it's not such a big deal as if we used IP addresses in all our config files

With this in mind, I starting with DNS before moving on to Docker and the Hashicorp stack:

A picture of the homepages of the Hashicorp stack

Before we begin, let's set out our goals:

  • We want a caching resolver, to avoid repeated requests across the Internet for the same query
  • We want to encrypt queries that leave the network via DNS-over-TLS
  • We want to be able to add our own custom DNS records for a domain of our choosing, for internal resolution only.

The last point there is particularly important. We want to resolve something like 172.16.230.100 to server1.bobsrockets.com internally, but not externally outside the network. This way we can include server1.bobsrockets.com in config files, and if the IP changes then we don't have to go back and edit all our config files - just reload or restart the relevant services.

Without further delay, let's start by installing unbound:

sudo apt install unbound

If you're on another system, translate this for your package manager. See this amazing wiki page for help on translating package manager commands :-)

Next, we need to write the config file. The default config file for me it located at /etc/unbound/unbound.conf:

# Unbound configuration file for Debian.
#
# See the unbound.conf(5) man page.
#
# See /usr/share/doc/unbound/examples/unbound.conf for a commented
# reference config file.
#
# The following line includes additional configuration files from the
# /etc/unbound/unbound.conf.d directory.
include: "/etc/unbound/unbound.conf.d/*.conf"

There's not a lot in the /etc/unbound/unbound.conf.d/ directory, so I'm going to be extending /etc/unbound/unbound.conf. First, we're going to define a section to get Unbound to forward requests via DNS-over-TLS:

forward-zone:
    name: "."
    # Cloudflare DNS
    forward-addr: 1.0.0.1@853
    # DNSlify - ref https://www.dnslify.com/services/resolver/
    forward-addr: 185.235.81.1@853
    forward-ssl-upstream: yes

The . name there simply means "everything". If you haven't seen it before, the fully-qualified domain name for seanssatellites.io for example is as follows:

seanssatellites.io.

Notice the extra trailing dot . there. That's really important, as it signifies the DNS root (not sure on it's technical name. Comment if you know it, and I'll update this). The io bit is the top-level domain (commonly abbreviated as TLD). seanssatellites is the actual domain bit that you buy.

It's a hierarchical structure, and apparently used to be inverted here in the UK before the formal standard was defined by the IETF (Internet Engineering Task Force) - of which RFC 1034 was a part.

Anyway, now that we've told Unbound to forward queries, the next order of business is to define a bunch of server settings to get it to behave the way we want it to:

server:
    interface: 0.0.0.0
    interface: ::0

    ip-freebind: yes

    # Access control - default is to deny everything apparently

    # The local network
    access-control: 172.16.230.0/24 allow
    # The docker interface
    access-control: 172.17.0.1/16 allow

    username: "unbound"

    harden-algo-downgrade: yes
    unwanted-reply-threshold: 10000000


    prefetch: yes

There's a lot going on here, so let's break it down.

Property Meaning
interface Tells unbound what interfaces to listen on. In this case I tell it to listen on all interfaces on both IPv4 and IPv6.
ip-freebind Tells it to try and listen on interfaces that aren't yet up. You probably don't need this, so you can remove it. I'm including it here because I'm currently unsure whether unbound will start before docker, which I'm going to be using extensively. In the future I'll probably test this and remove this directive.
access-control unbound has an access control system, which functions rather like a firewall from what I can tell. I haven't had the time yet to experiment (you'll be seeing that a lot), but once I've got my core cluster up and running I intend to experiment and play with it extensively, so expect more in the future from this.
username The name of the user on the system that unbound should run as.
harden-algo-downgrade Protect against downgrade attacks when making encrypted connections. For some reason the default is to set this to no, so I enable it here.
unwanted-reply-threshold Another security-hardening directive. If this many DNS replies are received that unbound didn't ask for, then take protective actions such as emptying the cache just in case of a DNS cache poisoning attack
prefetch Causes unbound to prefetch updated DNS records for cache entries that are about to expire. Should improve performance slightly.

If you have a flaky Internet connection, you can also get Unbound to return stale DNS cache entries if it can't reach the remote DNS server. Do that like this:

server:
    # Service expired cached responses, but only after a failed 
    # attempt to fetch from upstream, and 10 seconds after 
    # expiration. Retry every 10s to see if we can get a
    # response from upstream.
    serve-expired: yes
    serve-expired-ttl: 10
    serve-expired-ttl-reset: yes

With this, we should have a fully-functional DNS server. Enable it to start on boot and (re)start it now:

sudo systemctl enable unbound.service
sudo systemctl restart unbound.service

If it's not already started, the restart action will start it.

Internal DNS records

If you're like me and want some custom DNS records, then continue reading. Unbound has a pretty nifty way of declaring custom internal DNS records. Let's enable that now. First, you'll need a domain name that you want to return custom internal DNS records for. I recommend buying one - don't use an unregistered one, just in case someone else comes along and registers it.

Gandi are a pretty cool registrar - I can recommend them. Cloudflare are also cool, but they don't allow you to register several years at once yet - so they are probably best set as the name servers for your domain (that's a free service), leaving your domain name with a registrar like Gandi.

To return custom DNS records for a domain name, we need to tell unbound that it may contain private DNS records. Let's do that now:

server:
    private-domain: "mooncarrot.space"

This of course goes in /etc/unbound/unbound.conf, as before. See the bottom of this post for the completed configuration file.

Next, we need to define some DNS records:

server:
    local-zone: "mooncarrot.space." transparent
    local-data: "controller.mooncarrot.space.   IN A 172.16.230.100"
    local-data: "piano.mooncarrot.space.        IN A 172.16.230.101"
    local-data: "harpsichord.mooncarrot.space.  IN A 172.16.230.102"
    local-data: "saxophone.mooncarrot.space.    IN A 172.16.230.103"
    local-data: "bag.mooncarrot.space.          IN A 172.16.230.104"

    local-data-ptr: "172.16.230.100 controller.mooncarrot.space."
    local-data-ptr: "172.16.230.101 piano.mooncarrot.space."
    local-data-ptr: "172.16.230.102 harpsichord.mooncarrot.space."
    local-data-ptr: "172.16.230.103 saxophone.mooncarrot.space."
    local-data-ptr: "172.16.230.104 bag.mooncarrot.space."

The local-zone directive tells it that we're defining custom DNS records for the given domain name. The transparent bit tells it that if it can't resolve using the custom records, to forward it out to the Internet instead. Other interesting values include:

Value Meaning
deny Serve local data (if any), otherwise drop the query.
refuse Serve local data (if any), otherwise reply with an error.
static Serve local data, otherwise reply with an nxdomain or nodata answer (similar to the reponses you'd expect from a DNS server that's authoritative for the domain).
transparent Respond with local data, but resolve other queries normally if the answer isn't found locally.
redirect serves the zone data for any subdomain in the zone.
inform The same as transparent, but logs client IP addresses
inform_deny Drops queries and logs client IP addresses

Adapted from /usr/share/doc/unbound/examples/unbound.conf, the example Unbound configuration file.

Others exist too if you need even more control, like always_refuse (which always responds with an error message).

The local-data directives define the custom DNS records we want Unbound to return, in DNS records syntax (again, if there's an official name for the syntax leave a comment below). The local-data-ptr directive is a shortcut for defining PTR, or reverse DNS records - which resolve IP addresses to their respective domain names (useful for various things, but also commonly used as a step to verify email servers - comment below and I'll blog on lots of other shady and not so shady techniques used here).

With that, our configuration file is complete. Here's the full configuration file in it's entirety:

# Unbound configuration file for Debian.
#
# See the unbound.conf(5) man page.
#
# See /usr/share/doc/unbound/examples/unbound.conf for a commented
# reference config file.
#
# The following line includes additional configuration files from the
# /etc/unbound/unbound.conf.d directory.
include: "/etc/unbound/unbound.conf.d/*.conf"

server:
    interface: 0.0.0.0
    interface: ::0

    ip-freebind: yes

    # Access control - default is to deny everything apparently

    # The local network
    access-control: 172.16.230.0/24 allow
    # The docker interface
    access-control: 172.17.0.1/16 allow

    username: "unbound"

    harden-algo-downgrade: yes
    unwanted-reply-threshold: 10000000

    private-domain: "mooncarrot.space"

    prefetch: yes

    # ?????? https://www.internic.net/domain/named.cache

    # Service expired cached responses, but only after a failed 
    # attempt to fetch from upstream, and 10 seconds after 
    # expiration. Retry every 10s to see if we can get a
    # response from upstream.
    serve-expired: yes
    serve-expired-ttl: 10
    serve-expired-ttl-reset: yes

    local-zone: "mooncarrot.space." transparent
    local-data: "controller.mooncarrot.space.   IN A 172.16.230.100"
    local-data: "piano.mooncarrot.space.        IN A 172.16.230.101"
    local-data: "harpsichord.mooncarrot.space.  IN A 172.16.230.102"
    local-data: "saxophone.mooncarrot.space.    IN A 172.16.230.103"
    local-data: "bag.mooncarrot.space.          IN A 172.16.230.104"

    local-data-ptr: "172.16.230.100 controller.mooncarrot.space."
    local-data-ptr: "172.16.230.101 piano.mooncarrot.space."
    local-data-ptr: "172.16.230.102 harpsichord.mooncarrot.space."
    local-data-ptr: "172.16.230.103 saxophone.mooncarrot.space."
    local-data-ptr: "172.16.230.104 bag.mooncarrot.space."

    fast-server-permil: 500

forward-zone:
    name: "."
    # Cloudflare DNS
    forward-addr: 1.0.0.1@853
    # DNSlify - ref https://www.dnslify.com/services/resolver/
    forward-addr: 185.235.81.1@853
    forward-ssl-upstream: yes

Where do we go from here?

Remember that it's important to not just copy and paste a configuration file, but to understand what every single line of it does.

In a future post in this series, we'll be revising this to forward requests for *.service.mooncarrot.space to Consul, a clustered service discovery system that keeps track of what is running where and presents a DNS server as an interface (there are others).

In the next post, we'll (probably) be looking at setting up Consul - unless something else happens first :P Nomad should be next after that, followed closely by Vault.

Once I've got all that set up (which is a blog post series in and of itself!), I'll then look at encrypting all communications between all nodes in the cluster. After that, we'll (finally?) get to Docker and my plans there. Other things include an apt and apk (the Alpine Linux package manager) caching servers - which will have to be tackled separately.

Oh yeah, and I want to explore Traefik, which is apparently like Nginx, but intended for containerised infrastructure.

All this is definitely going to keep me very busy!

Found this interesting? Got a suggestion? Comment below!

Sources and Further Reading

Measuring maximum RAM usage with Bash on Linux

While running a simulation on my University's Viper HPC, I found that I needed to measure the maximum RAM usage of a simulation that I was running. Since the solution wasn't particularly easy to find, I thought I'd quickly blog about it here.

Doing this isn't actually as easy as you might think. In the end, I used this:

/usr/bin/time --format 'Max RAM working set size: %Mk' command_here --foo bar

....replacing command_here --foo bar with the command you want to measure.

/usr/bin/time is a program that measures how long a command takes to execute, but it's evident that it measures a bunch of other different things as well. While the output format leaves something to be desired (hence the --format in the above), it does the job pretty well.

Note that /usr/bin/time is distinct from the time built-in you get in Bash. Depending on your shell, you may need to explicitly specify the full path to the time binary. In addition, if your system doesn't have the command (like Viper), you may need to copy it from another system that does that has the same CPU architecture.

I forget where it was that I found this solution, but if you comment below then I'll add the credit to this post if your post looks familiar.

As a quick extra, you can limit the time a command can execute for like this:

timeout 60 command_here

That will limit command_here to executing for only 60 seconds.

Found this interesting? Comment below!

How to hash and sign files with GPG and a bit of Bash

When making a release of your software or sending some important documents, it's pretty common practice (especially amongst larger projects) to distribute hashes and GPG signatures along with release binaries themselves. Example projects that do this include:

This is great practice, since it allows downloaders to verify that their download has not been corrupted, and that it was you who released them and not some imposter.

In this post, I'm going to outline how you can do this too.

I've recently both verified a number of signatures and generated some of my own too, so I thought I'd post about it here to show others how to do it too.

Verification

Before we get into the generation of hashes and signatures, we should first talk about verifying them. I'm mentioned this is good practice already, so it makes sense to briefly talk about verification first. First, let's download Nomad version 0.10.5. You can grab the files here. Download the following files:

  • nomad_0.10.5_SHA256SUMS - The hashes themselves
  • nomad_0.10.5_SHA256SUMS.sig - The GPG signature of the above file
  • nomad_0.10.5_linux_arm.zip - Nomad itself, for the ARM architecture (feel free to pick whichever one you like and adapt these instructions accordingly)

Verifying the hashes is easier, so let's do that first. We can see from the filenames that we have SHA 256 hashes, so we'll want the sha256sum command. Windows users will need to use the Windows Subsystem for Linux, or setup an msys environment:

sha256sum --ignore-missing --check nomad_0.10.5_SHA256SUMS

It should output an OK message and return an exit code zero (to check this in a script, you can do echo "$?" directly after running it to check the exit code). 99% of the time this check will succeed, but you'll be glad that you checked the 1% of the time it fails.

Checking the hashes here is the bit that ensures that the files haven't been corrupted. Next, we'll verify the GPG signature. This is the bit that ensures that the files we've downloaded were actually originally released by who we think they were. Do that like this:

gpg --verify nomad_0.10.5_SHA256SUMS.sig

Doing this, you may get an error message telling you that it can't verify the signature because you haven't got the public key of the signer imported into your keyring. To remedy this, look for the bit that tells you the key id (e.g. using RSA key 51852D87348FFC4C). Copy it, and then do this:

gpg --recv-keys 51852D87348FFC4C

This will download it and import the key id into your local GPG keyring. Then re-run the gpg --verify command above, and it should work.

Generation

Now that we know how to verify a signature, let's generate our own. First, put the files you want to hash into a directory and cd into it in your terminal. Then, let's generate the hash file:

# Hash files
find . -type f -not -name "*.SHA256" -print0 | xargs -0 -n1 -I{} -P"$(nproc)" sha256sum -b "{}" >HASHES.SHA256

We use find here to locate all the files (other than the hash file itself), and then pass them to xargs, which calls sha256sum to hash the files in question. Finally, we write the hashes to HASHES.SHA256.

Next, lets generate a GPG signature for the hash file. For this, you'll need a GPG key. That's out of scope of this post really, but this tutorial looks like it will show you how to do it. Note that in order for other people to verify the GPG signature you create, you'll probably need to upload your GPG public key to a keyserver (the article I link to shows you how to do this too).

Once done, generate the signature like this:

# Sign the hashes
gpg --sign --detach-sign --armor HASHES.SHA256

Specifically, we generate a detached signature here - meaning that it's in a separate file to the file that is being signed. The --armor there just means to wrap the signature in base64 encoding (I'm pretty sure) and some text so that it's not a raw binary file that might confuse the uninitiated.

Finally, let's verify the signature we just created - just in case:

# Verify the signature, and check we used the right key
gpg --verify HASHES.SHA256.asc

If all's good, it should tell you that the signature is ok. If you've got multiple keys, ensure that you signed it with the correct key here. GPG will sign things with the key you have marked as the default key.

Found this interesting? Having trouble? Want to say hi? Comment below!

Testing storage devices with f3

Some microSD cards (Above: Some microSD cards. Thankfully none of these are fake, but you never know.....)

Always test storage devices after you buy them. I don't just mean check to see if they work (though that's a good idea too), but also that they can actually store the amount of stuff that they advertise they can.

Recently, I bought myself 5 64GB microSD cards for my cluster (more on this very soon in a future blog post!). The first thing did when I got them was test them to make sure that they could actually store 64GB of stuff. My tool of choice was f3, which stands for Fight Flash Fraud or Fight Fake Flash. I'm glad I did - because 3 of them turned out to be faulty. 2 of them were actually 32GB cards in disguise, and 1 of them wouldn't mount at all.

While this might be my first experience with fake or fault storage devices, it's hardly an uncommon occurrence. Everything from microSD cards to flash drives - and even regular hard drives! - may be faulty upon arrival - or worse appear fine at first, and then a few months down the line start corrupting random data for no reason.

f3 is a suite of tools for testing storage devices to make sure they function properly. They work best as a destructive test - i.e. one that destroys existing data on the disk - so if you've got some data on the target disk you want to test, now is the time to back it up (hopefully this is something you've been doing already - more on that in another post if there's the demand).

f3 consists of 3 principle tools:

  • f3probe, which runs a fast test to check for issues (sadly I couldn't get this to work reliably)
  • f3write, which fills a disk with test files
  • f3read, which reads the test files back from disk and validates them

It's a real shame that I can't get f3probe to work reliably. Maybe at some point I'll implement my own version that writes data to every nth block of a device to test it more quickly than the f3write/f3read mechanism I'll explain below (if anyone knows of a better tool that works on Linux, please let comment below!)

To test a device, you first need to write the test files to it. I've taken to reformatting the device as ext4 (the Linux filesystem) first:

sudo umount /dev/sdXY; # Unmount it if it's currently mounted
sudo mkfs.ext4 /dev/sdXY; # Format it to ext4

....where /dev/sdXY is the partition you want to format. This isn't mandatory, but it is a quick way of making sure a disk is empty.

Next, we need to write the test files to the device. If it isn't already, you'll need to mount it first. This can be done like so:

# If it's not mounted automatically:
sudo mkdir /media/YOUR_USERNAME_HERE/SOME_NAME_HERE;
sudo mount /dev/sdXY /media/YOUR_USERNAME_HERE/SOME_NAME_HERE;
f3write /media/YOUR_USERNAME_HERE/SOME_NAME_HERE

This might take a while - don't forget to replace the paths there with those specific to your setup. With the test files written to the disk, we need to read them back again to make sure they are valid:

f3read /media/YOUR_USERNAME_HERE/SOME_NAME_HERE

This will read them all back again, and then print a summary report at the bottom to tell you what it found. Ideally, it should show a big number of blocks as succeeded, and no blocks in any of the other failure categories.

Running multiple commands like this is effort though, so surely we can do better than this. With some simple shell scripting, we can run both commands at once:

location=/media/YOUR_USERNAME_HERE/SOME_NAME_HERE; f3write "${location}"; && f3read "${location}"; alert

If you're on a machine with a graphical desktop, then the ; alert bit on the end should generate a desktop notification when it's done. For other users (e.g. over SSH), this should be removed. Just in case you have a graphical desktop (e.g. Ubuntu Desktop) and the alert bit doesn't work for you, append this to your ~/.bashrc file and restart your terminal:

# Add an "alert" alias for long running commands.  Use like so:
#   sleep 10; alert
alias alert='notify-send --urgency=low -i "$([ $? = 0 ] && echo terminal || echo error)" "$(history|tail -n1|sed -e '\''s/^\s*[0-9]\+\s*//;s/[;&|]\s*alert$//'\'')"'

....I forget where this is from exactly.

If you're not likely to be at your computer when it finishes, then there's still something you can do. Personally I use XMPP for personal messaging, so I thought it would be great if I could get a notification when it was done. Since I've already written xmppbridge for easily sending XMPP messages from the terminal, it was pretty trivial to write a shell script for my bin folder that would send my a message when the process was complete:

#!/usr/bin/env bash

# f3test: Runs f3 on the current directory.
# 
# Usage:
#     f3test "alerts@xmpp.example.com"
# 

destination="$1";

f3write .;
f3read .;

echo "Card testing complete in ${SECONDS}s" | xmppbridge --groupchat --destination "${destination}";

I called this script f3test, and put it in my ~/bin folder. To use it, first cd to the root of the device you want to test (`` in the above examples), and then set a pair of environment variables to let it know how to login to an XMPP account to send a message:

export XMPP_JID="someone@bobsrockets.com"; # The JID to login with.
export XMPP_PASSWORD="weN33dM0reBoost3rs"; # The password to use when logging in

...remove the --groupchat in the script if it's not a groupchat you want it to send a message to (I have a personal group chat that's just between me and various bots that notify me about various aspect of the systems I manage). If you don't have an XMPP account yet, you can get one at any public server in the XMPP directory, or run your own (see also snikket, which is a distribution of Prosody that's designed to be extremely easy to setup & run)!

Of course, you could just as easily swap the xmppbridge call there with a different command to send a message via a different channel. For example mailx can send emails.

Found this interesting? Got a better tool? Need some help? Comment below!

Installing libonig4 from source to fix php7.4-mbstring

I have several Raspberry Pis. The one I'd like tot alk about today though is a 3B+, and for 1 reason or another it has PHP installed on it with the excellent deb.sury.org apt PPA for PHP. Recently, I've upgraded to PHP 7.4. This was fine initially, but soon enough I started to get a warning that php-mbstring couldn't be installed and that I have held broken packages.

This was not a good sign, but after doing some digging it transpired that the package libonig4 was missing - and couldn't be installed because it wasn't available in the Raspbian apt repositories. Awkward.

After doing some quick digging into the Ubuntu apt repositories, I discovered that while it does exist, it isn't built for armhf (the architecture of the Raspberry Pi).

Thankfully though, Ubuntu is open-source - so the source package was available. The Debian tooling makes it relatively easy to build source packages once downloaded too. Unfortunately I couldn't use the apt-get source command to download it as I didn't have an Ubuntu machine to hand, but their website makes it easy to download packages:

https://packages.ubuntu.com/bionic/libonig4

On here, you'll want to download the 3 source package files:

The source package download page

Download them to a new directory. Then, extract the source files like so:

cd path/to/directory;
dpkg-source -x *.dsc;

Next, cd into the created directory, and build the source files into a bunch of .deb files:

cd libonig-6.7.0/;
dpkg-buildpackage --no-sign;

The --no-sign there is necessary, because otherwise I encountered errors where it tried to automatically sign the resulting package with the original author's secret key, which we obviously don't have access to!

Once done (it might make a moment), a bunch of .deb files will be generated in the parent directory:

Filename Description
libonig4_6.7.0-1_armhf.deb The actual package itself
libonig4-dbgsym_6.7.0-1_armhf.deb Debugging symbols generated in the build process
libonig-dev_6.7.0-1_armhf.deb Development headers (in case you need to build another package against it)

Out of these 3, the top and bottom ones are probably the ones you want to install. This can be done like so:

sudo dpkg -i libonig4_6.7.0-1_armhf.deb;
sudo dpkg -i libonig-dev_6.7.0-1_armhf.deb;

This completes the process. Now, we can install php7.4-mbstring as normal:

sudo apt install php7.4-mbstring

Success! This should solve the problem. I figured this out in part by following a Unix Stackexchange answer that I have since lost, but I had to adapt the instructions significantly - so I decided to blog about it here.

Found this useful? Still encountering issues? Comment below!

Art by Mythdael