Monitoring HTTP server response time with collectd and a bit of bash

In the spirit of the last few posts I've been making here (A and B), I'd like to talk a bit about collectd, which I use to monitor the status of my infrastructure. Currently this consists of the server you've connected to in order to view this webpage, and a Raspberry Pi that acts as a home file server.

I realised recently that monitoring the various services that I run (such as my personal git server for instance) would be a good idea, as I'd rather like to know when they go down or act abnormally.

As a first step towards this, I decided to configure my existing collectd setup to monitor the response time of the HTTP endpoints of these services. Later on, I can then configure some alerts to message me when something goes down.

My first thought was to check the plugin list to see if there was one that would do the trick. As you might have guessed by the title of this post, however, such an easy solution would be too uninteresting and not worthy of writing a blog post.

Since such a plugin doesn't (yet?) exist, I turned to the exec plugin instead.

In short, it lets you write a program that writes to the standard output in the collectd plain text protocol, which collectd then interprets and adds to whichever data storage backend you have configured.

Since shebangs are a thing on Linux, I could technically choose any language I have an interpreter installed for, but to keep things (relatively) simple, I chose Bash, the language your local terminal probably speaks (unless it speaks zsh or fish instead).

My priorities were to write a script that is:

Easy to reconfigure
Ultra lightweight

Bash supports associative arrays, so I can cover point #1 pretty easily like this:

declare -A targets=(
    ["main_website"]="https://starbeamrainbowlabs.com/"
    ["git"]="https://git.starbeamrainbowlabs.com/"
    # .....
)

Excellent! Covering point #2 will be an on-going process that I'll need to keep in mind as I write this script. I found this GitHub repository a while back, which has served as a great reference point in the past. Here's hoping it'll be useful this time too!

It's important to note the structure of the script that we're trying to write. Collectd exec scripts have 2 main environment variables we need to take notice of:

COLLECTD_HOSTNAME - The hostname of the local machine
COLLECTD_INTERVAL - Interval at which we should collect data. Defined in collectd.conf.

The script should write to the standard output the values we've collected in the collectd plain text format every COLLECTD_INTERVAL. Collectd will automatically ensure that only 1 instance of our script is running at once, and will also automatically restart it if it crashes.

To run a command regularly at a set interval, we probably want a while loop like this:

while :; do
    # Do our stuff here

    sleep "${COLLECTD_INTERVAL}";
done

This is a great start, but it isn't really compliant with objective #2 we defined above. sleep is actually a separate command that spawns a new process. That's an expensive operation, since it has to allocate memory for a new stack and create a new entry in the process table.

We can avoid this by abusing the read command timeout, like this:

# Pure-bash alternative to sleep.
# Source: https://blog.dhampir.no/content/sleeping-without-a-subprocess-in-bash-and-how-to-sleep-forever
snore() {
    local IFS;
    [[ -n "${_snore_fd:-}" ]] || exec {_snore_fd}<> <(:);
    read ${1:+-t "$1"} -u $_snore_fd || :;
}

Thanks to bolt for this.

Next, we need to iterate over the array of targets we defined above. We can do that with a for loop:

while :; do
    for target in "${!targets[@]}"; do
        check_target "${target}" "${targets[${target}]}"
    done

    snore "${COLLECTD_INTERVAL}";
done

Here we call a function check_target that will contain our main measurement logic. We've changed sleep to snore too - our new subprocess-less sleep alternative.

Note that we're calling check_target for each target one at a time. This is important for 2 reasons:

We don't want to potentially skew the results by taking multiple measurements at once (e.g. if we want to measure multiple PHP applications that sit in the same process poll, or measure more applications than we have CPUs)
It actually spawns a subprocess for each function invocation if we push them into the background with the & operator. As I've explained above, we want to try and avoid this to keep it lightweight.

Next, we need to figure out how to do the measuring. I'm going to do this with curl. First though, we need to setup the function and bring in the arguments:

# $1 - target name
# $2 - url
check_target() {
    local target_name="${1}"
    local url="${2}";

    # ......
}

Excellent. Now, let's use curl to do the measurement itself:

curl -sS --user-agent "${user_agent}" -o /dev/null --max-time 5 -w "%{http_code}\n%{time_total}" "${url}"

This looks complicated (and it probably is to some extent), but let's break it down with the help of explainshell.

Part	Meaning
`-sS`	Squashes all output except for errors and the bits we want. Great for scripts like ours.
`--user-agent`	Specifies the user agent string to use when making a request. All good internet citizens should specify a descriptive one (more on this later).
`-o /dev/null`	We're not interested in the content we download, so this sends it straight to the bin.
`--max-time 5`	This sets a timeout of 5 seconds for the whole operation - after which `curl` will throw an error and return with exit code 28.
`-w "%{http_code}\n%{time_total}"`	This allows us to pull out metadata about the request we're interested in. There's actually a whole range available, but for now I'm interested in how long it took and the response code returned
`"${url}"`	Specifies the URL to send the request to. `curl` does actually support making more than 1 request at once, but utilising this functionality is out-of-scope for now (and we'd get skewed results because it re-uses connections - which is normally really helpful & performance boosting)

To parse the output we get from curl, I found the readarray command after going a bit array mad at the beginning of this post. It pulls every line of input into a new slot in an array for us - and since we can control the delimiter between values with curl, it's perfect for parsing the output. Let's hook that up now:

readarray -t result < <(curl -sS --user-agent "${user_agent}" -o /dev/null --max-time 5 -w "%{http_code}\n%{time_total}" "${url}");

The weird command < <(another_command); syntax is process substitution. It's a bit like the another_command | command syntax, but a bit different. We need it here because readarray parses the values into a new array variable in the current context, and if we use the a | b syntax here, we instantly lose access to the variable it creates because a subprocess is spawned (and readarray is a bash builtin) - hence the weird process substitution.

Now that we've got the output from curl parsed and ready to go, we need to handle failures next. This is a little on the nasty side, as by default bash won't give us the non-zero exit code from substituted processes. Hence, we need to tweak our already long arcane incantation a bit more:

readarray -t result < <(curl -sS --user-agent "${user_agent}" -o /dev/null --max-time 5 -w "%{http_code}\n%{time_total}\n" "${url}"; echo "${PIPESTATUS[*]}");

Thanks to this answer on StackOverflow for ${PIPESTATUS}. Now, we have array called result with 3 elements in it:

Index	Value
0	The HTTP response code
1	The time taken in seconds
2	The exit code of `curl`

With this information, we can now detect errors and abort continuing if we detect one. We know there was an error if any of the following occur:

curl returned a non-zero exit code
The HTTP response code isn't 2XX or 3XX

Let's implement that in bash:

if [[ "${result[2]}" -ne 0 ]] || [[ "${result[0]}" -lt "200" ]] || [[ "${result[0]}" -gt "399" ]]; then
    return
fi

Again, let's break it down:

[[ "${result[2]}" -ne 0 ]] - Detect a non-zero exit code from curl
[[ "${result[0]}" -lt "200" ]] - Detect if the HTTP response code is less than 200
[[ "${result[0]}" -gt "399" ]] - Detect if the HTTP response code is greater than 399

In the future, we probably want to output a notification here of some sort instead of just simply silently returning, but for now it's fine.

Finally, we can now output the result in the right format for collectd to consume. Collectd operates on identifiers, values, and intervals. A bit of head-scratching and documentation reading later, and I determined the correct identifier format for the task. I wanted to have all the readings on the same graph so I could compare the different response times (just like the ping plugin does), so we want something like this:

bobsrockets.com/http_services/response_time-TARGET_NAME

....where we replace bobsrockets.com with ${COLLECTD_HOSTNAME}, and TARGET_NAME with the name of the target we're measuring (${target_name} from above).

We can do this like so:

echo "PUTVAL \"${COLLECTD_HOSTNAME}/http_services/response_time-${target_name}\" interval=${COLLECTD_I
NTERVAL} N:${result[1]}";

Here's an example of it in action:

PUTVAL "HOSTNAME_HERE/http_services/response_time-git" interval=300.000 N:0.118283
PUTVAL "HOSTNAME_HERE/http_services/response_time-main_website" interval=300.000 N:0.112073

It does seem to run through the items in the array in a rather strange order, but so long as it does iterate the whole lot, I don't really care.

I'll include the full script at the bottom of this post, so all that's left to do is to point collectd at our new script like this in /etc/collectd.conf:

LoadPlugin  exec

# .....

<Plugin exec>
    Exec    "nobody:nogroup"        "/etc/collectd/http_response_times.sh"  "measure"
</Plugin>

I've added measure as an argument there for future-proofing, as it looks like we may have to run a separate instance of the script for sending notifications if I understand the documentation correctly (I need to do some research.....).

Very cool. It's taken a few clever tricks, but we've managed to write an efficient script for measuring http response times. We've made it more efficient by exploiting read timeouts and other such things. While we won't gain a huge amount of speed from this (bash is pretty lightweight already - this script is weighing in at just ~3.64MiB of private RAM O.o), it will all add up over time - especially considering how often this will be running.

In the future, I'll definitely want to take a look at implementing some alerts to notify me if a service is down - but that will be a separate post, as this one is getting quite long :P

Found this interesting? Got another way of doing this? Curious about something? Comment below!

Full Script

#!/usr/bin/env bash
set -o pipefail;

# Variables:
#   COLLECTD_INTERVAL   Interval at which to collect data
#   COLLECTD_HOSTNAME   The hostname of the local machine

declare -A targets=(
    ["main_website"]="https://starbeamrainbowlabs.com/"
    ["webmail"]="https://mail.starbeamrainbowlabs.com/"
    ["git"]="https://git.starbeamrainbowlabs.com/"
    ["nextcloud"]="https://nextcloud.starbeamrainbowlabs.com/"
)
# These are only done once, so external commands are ok
version="0.1+$(date +%Y%m%d -r $(readlink -f "${0}"))";

user_agent="HttpResponseTimeMeasurer/${version} (Collectd Exec Plugin; $(uname -sm)) bash/${BASH_VERSION} curl/$(curl --version | head -n1 | cut -f2 -d' ')";

# echo "${user_agent}"

###############################################################################

# Pure-bash alternative to sleep.
# Source: https://blog.dhampir.no/content/sleeping-without-a-subprocess-in-bash-and-how-to-sleep-forever
snore() {
    local IFS;
    [[ -n "${_snore_fd:-}" ]] || exec {_snore_fd}<> <(:);
    read ${1:+-t "$1"} -u $_snore_fd || :;
}

# Source: https://github.com/dylanaraps/pure-bash-bible#split-a-string-on-a-delimiter
split() {
    # Usage: split "string" "delimiter"
    IFS=$'\n' read -d "" -ra arr <<< "${1//$2/$'\n'}"
    printf '%s\n' "${arr[@]}"
}

# Source: https://github.com/dylanaraps/pure-bash-bible#get-the-number-of-lines-in-a-file
# Altered to operate on the standard input.
count_lines() {
    # Usage: lines <"file"
    mapfile -tn 0 lines
    printf '%s\n' "${#lines[@]}"
}

###############################################################################

# $1 - target name
# $2 - url
check_target() {
    local target_name="${1}"
    local url="${2}";

    readarray -t result < <(curl -sS --user-agent "${user_agent}" -o /dev/null --max-time 5 -w "%{http_code}\n%{time_total}\n" "${url}"; echo "${PIPESTATUS[*]}");

    # 0 - http response code
    # 1 - time taken
    # 2 - curl exit code

    # Make sure the exit code is non-zero - this includes if curl hits a timeout error
    # Also ensure that the HTTP response code is valid - any 2xx or 3xx response code is ok
    if [[ "${result[2]}" -ne 0 ]] || [[ "${result[0]}" -lt "200" ]] || [[ "${result[0]}" -gt "399" ]]; then
        return
    fi

    echo "PUTVAL \"${COLLECTD_HOSTNAME}/http_services/response_time-${target_name}\" interval=${COLLECTD_INTERVAL} N:${result[1]}";
}

while :; do
    for target in "${!targets[@]}"; do
        # NOTE: We don't use concurrency here because that spawns additional subprocesses, which we want to try & avoid. Even though it looks slower, it's actually more efficient (and we don't potentially skew the results by measuring multiple things at once)
        check_target "${target}" "${targets[${target}]}"
    done

    snore "${COLLECTD_INTERVAL}";
done

Type this	To get this	Notes
`bold text`	bold text	-
`_italics text_`	italics text	-
`~~deleted text~~`	~~deleted~~	-
`code text`	`code text`	Inserts some monospaced code. It is preferred that large blocks of code are linked to using a service such as Pastebin, Github Gists or Ideone.
`> Quote`	Quote	-
`[display text](//google.com)`	display text	Inserts a hyperlink. Please use responsibly. `[rel=nofollow]` is in use and spam will be deleted.
`---`		Inserts a horizontal line. The previous line must be blank.

Stardust
Blog