How (not) to recover a consul cluster
Hello again! I'm still getting used to a new part-time position at University which I'm not quite ready to talk about yet, but in the mean time please bear with me as I shuffle my schedule around.
As I've explained previously on here, I have a consul cluster (superglue service discovery!) that forms the backbone of my infrastructure at home. Recently, I had a small powercut that knocked everything offline, and as the recovery process was quite interesting I thought I'd blog about it here.
The issue at had happened at about 5pm, but I only discovered it was a problem until a few horus later when I got home. Essentially, a small powercut knocked everything offline. While my NAS rebooted automatically afterwards, my collection of Raspberry Pis weren't so lucky. I can only suspect that they were caught in some transient state or something. None of them responded when I pinged them, and later inspection of the logs on my collectd instance revealed that they were essentially non-functional until after they were rebooted manually.
A side effect of this was that my Consul (and, by extension, my Nomad cluster) cluster was knocked offline.
Anyway, at first I only rebooted the controller host (that has both a Consul and Nomad server running on it, but does not accept and run jobs). This rebooted just fine and came back online, so I then rebooted my monitoring box (that also runs continuous integration), which also came back online.
Due to the significantly awkward physical location I keep my cluster in with the rest of the Pis, I decided to flip the power switch on the extension to restart all my hosts at the same time.
While this worked..... it also caused my cluster controller node to reboot, which caused its raft epoch number to increment by 1... which broke the quorum (agreement) of my cluster, and required manual intervention to resolve.
Raft quorum
To understand the specific issue here, we need to look at the Raft consensus algorithm. Raft is, as the name suggests, a consensus algorithm. Such an algorithm is useful when you have a cluster of servers that need to work together in a redundant fault-tolerant fashion on some common task, such as in our case Consul (service discovery) and Nomad (task scheduling).
The purpose of a raft server is to maintain agreement amongst all nodes in a cluster as to the global state of an application. It does this using a distributed log that it replicates through a fancy but surprisingly simple algorithm.
At the core of this algorithm is the concept of a leader. The cluster leader is responsible for managing and committing updates to the global state, as well as sending out the global state to everyone else in the cluster. In the case of Consul, the Consul servers are the cluster (the clients simply connect back to whichever servers are available) - and I have 3 of them, since Raft requires an odd number of nodes.
When the cluster first starts up or the leader develops a fault (e.g. someone sets off a fork bomb on it just for giggles), an election occurs to decide on a new leader. The election term number (or epoch number) is incremented by one, and everyone votes on who the new leader should be. The node with the most votes becomes the new leader, and quorum (agreement) is achieved across the entire cluster.
Consul and Raft
In the case of Consul, everyone must cast a vote for the vote to be considered valid, otherwise the vote is considered invalid and the election process must begin again. Crucially, the election term number must also be the same across everyone voting.
In my case, because I started my cluster controller and then rebooted it before it had a chance to achieve quorum, it incremented it's election term number and additional time than the rest of the cluster did, which caused the cluster to fail to reach quorum as the other 2 nodes in the Consul server cluster consider the controller node's vote to be invalid, yet they still demanded that all servers vote to elect a new leader.
The practical effect of this was tha because the Consul cluster failed to agree on who the leader should be, the Nomad cluster (which hangs off the Consul cluster, using it to find each other) also failed to start and subsequently reach quorum, which knocked all my jobs offline.
The solution
Thankfully, the Hashicorp Consul documentation for this specific issue is fabulous:
To summarise:
- Boot the cluster as normal if it isn't booted already
- Stop the failed node
- Create a special config file (
raft/peers.json
) that will cause the failed node to drop it's state and accept the state of the incomplete cluster, allowing it to rejoin and the cluster gain collective quorum once more.
The documentation to perform this recovery protocol is quite clear. While there is an option to recover a failed node if you still have a working cluster with a leader, in my case I didn't so I had to use the alternate route.
Conclusion
I've talked briefly about an interesting issue that caused my Consul cluster to break quorum, which inadvertently brought my entire infrastructure down until resolved the issue.
While Consul is normally really quite resilient, you can break it if you aren't careful. Having an understanding of the underlying consensus algorithm Raft is very helpful to diagnosing and resolving issues, though the error messages and documentation I looked through were generally clear and helpful.