Considerations on monitoring infrastructure
I like Raspberry Pis. I like them so much that by my count I have at least 8 in operation at the time of typing performing various functions for me, including a cluster for running various services.
Having Raspberry Pis and running services on servers is great, but once you have some infrastructure setup hosting something you care about your thoughts naturally turn to mechanisms by which you can ensure that such infrastructure continues to run without incident, and if problems do occur they can be diagnosed and fixed efficiently.
Such is the thought that is always on my mind when managing my own infrastructure, sprawls across multiple physical locations. To this end, I thought I'd blog what my monitoring system looks like - what it's strengths are, and what it could do better.
A note before we begin: I continue to have a long-term commitment to posting on this blog - I have just started a part-time position alongside my PhD due to the end of my primary research period, which has been taking up a lot of my mental energy. Things should get slowly back to normal soon-ish.
Keep in mind as you read this that my situation may be different to your own. For example, monitoring a network primary consisting of Raspberry Pis demands a very different approach than an enterprise setup (if you're looking for a monitoring solution for a bunch of big powerful servers, I've heard the TICK stack is a good place to start).
Monitoring takes many forms and purposes. Broadly speaking, I split the monitoring I have on my infrastructure into the following categories:
- Logs (see my earlier post on Centralising logs with rsyslog)
- System resources (e.g. CPU/RAM/disk/etc usage) - I use
collectd
for this - Service health - I use Consul for my cluster, and Uptime Robot for this website.
- Server health (e.g. whether a server is down or not, hanging due to a bad mount, etc.)
I've found that as there are multiple categories of things that need monitoring, there isn't a single one-size-fits-all solution to the problem, so different tools are needed to monitor different things.
Logs - centralised rsyslog
At the moment, monitoring logs is a solved problem for me. I've talked about my setup previously, in which I have a centralised rsyslog server which receives and stores all logs from my entire infrastructure (barring a few select boxes I need to enrol in this system). Storing logs nets me 2 things:
- The ability to reference them (e.g. with
lnav
) later in the event of an issue for diagnostic purposes - The ability to inspect the logs during routine maintenance for any anomalies, issues, or errors that might become problematic later if left unattended
System information - collectd
Similarly, storing information about system resource usage - such as CPU load or disk usage for instance - is more useful than you'd think for spotting and pinpointing issues with one's infrastructure - be it a single server or an entire fleet. In my case, this also includes monitoring network latency (useful should my ISP encounter issues, as then I can identify if it's a me or a them problem) and HTTP response times.
For this, I use collectd
, backed by rrd (round-robin database) files. These are fixed-size files that contain ring buffers that it iteratively writes over, allowing efficient storage of up to 1 year's worth of history.
To visualise this in the browser, I use Collectd Graph Panel, which is unfortunately pretty much abandonware (I haven't found anything better).
To start with the strengths of this system, it's very computationally efficient. I have tried previously to setup a TICK (Telegraf, InfluxDB, Chronograf, and Kapacitor) stack on a Raspberry Pi, but it was way too heavy - especially considering the Raspberry Pi my monitoring system runs on is also my continuous integration server. Collectd, on the other hand, runs quietly in the background, barely using any resources at all.
Another strength is that it's easy and simple. You throw a config file at it (which could be easily standardised across an entire fleet of servers), and collectd will dutifully send encrypted system metrics to a given destination for you with minimal fuss. Meanwhile, the browser-based dashboard I use automatically plots graphs and displays them for you without any tedious creation of a custom dashboard.
Having a system monitor things is good, but having it notify you in the event of an anomaly is even better. While collectd does have the ability to generate and send notifications, its capacity to do this is unfortunately rather limited.
Another limitation of collectd is that accessing and processing the stored system metrics data is not a trivial process, since it's stored in rrd databases, the parsing of which is surprisingly difficult due to a lack of readily available libraries to do this. This makes it difficult to integrate it with other systems, such as n8n for example, which I have recently setup to replace some functions of IFTTT to automatically repost my blog posts here to Reddit and Discord.
Collectd can write to multiple sources however (e.g. MQTT), so I might look into this as an option to connect it to some other program to deliver more flexible notifications about issues.
Service health
Service health is what most people might think of when I initially said that this blog post would be about monitoring. In many ways, it's one of the most important things to monitor - especially if other people rely on infrastructure which is managed by you.
Currently, I achieve this in 2 ways. Firstly, for services running on the server that hosts this website I have a free Uptime Robot account which monitors my server and website. It costs me nothing, and I get monitoring of my server from a completely separate location. In the event my server or the services thereon that it monitors are down, I will get an email telling me as such - and another email once it goes back up again.
Secondly, for services running on my cluster I use Consul's inbuilt service monitoring functionality, though I don't yet have automated emails to notify me of failures (something I need to investigate a solution for).
The monitoring system you choose here depends on your situation, but I strongly recommend having at least some form of external monitoring of whether your target boxes go down that can notify you of this. If your monitoring is hosted on the box that goes down, it's not really of much use...!
Monitoring service health more robustly and notifying myself about issues is currently on my todo list.
Server health
Server health ties into service health, and perhaps also system information too. Knowing which servers are up and which ones are down is important - not least because of the services running thereon.
The tricky part of this is that if a server goes down, it could be because of any one of a number of issues - ranging from a simple software/hardware failure, all the way up to an entire-building failure (e.g. a powercut) or a natural disaster. With this in mind, it's important to plan your monitoring carefully such that you still get notified in the event of a failure.
Conclusion
In this post, I've talked a bit about my monitoring infrastructure, and things to consider more generally when planning monitoring for new or existing infrastructure.
It's never too late to iteratively improve your infrastructure monitoring system - whether it be enrolling that box in the corner that never got added to the system, or implementing a totally kind of monitoring - e.g. centralised logging, or in my case I need to work on more notifications for when things go wrong.
On a related note, what do your backups look like right now? Are they automated? Do they cover all your important data? Could you restore them quickly and efficiently?
If you've found this interesting, please leave a comment below!