Monitoring latency / ping with Collectd and Bash
I use Collectd as the monitoring system for the devices I manage. As part of this, I use the Ping plugin to monitor latency to a number of different hosts, such as GitHub, the raspberry pi apt repo, 1.0.0.1, and this website.
I've noticed for a while that the ping plugin doesn't always work: Even when I check to ensure that a host is pingable before I add it to the monitoring list, Collectd doesn't always manage to ping it - showing it as NaN instead.
Yesterday I finally decided that enough was enough, and that I was going to do something about it. I've blogged about using the exec plugin with a bash script before in a previous post, in which I monitor HTTP response times with curl
.
Using that script as a base, I adapted it to instead parse the output of the ping
command and translate it into something that Collectd understands. If you haven't already, you'll want to go and read that post before continuing with this one.
The first order of business is to sort out the identifier we're going to use for the data in question. Collectd assigns an identifier to all the the data coming in, and it uses this to determine where it is stored on disk - and subsequently which graph it will appear in on screen in the front-end.
Such identifiers follow this pattern:
host/plugin-instance/type-instance
This can be broken down into the following parts:
Part | Meaning |
---|---|
host |
The hostname of the machine from which the data was collected |
plugin |
The name of the plugin that collected the data (e.g. memory , disk , thermal , etc) |
instance |
The instance name of the plugin, if the plugin is enabled multiple times |
type |
The type of reading that was collected |
instance |
If multiple readings for a given type are collected, this instance differentiates between them. |
Of note specifically here are the type
, which must be one of a number of pre-defined values, which can be found in a text file located at /usr/share/collectd/types.db
. In my case, my types.db
file contains the following definitions for ping:
ping
: The average latencyping_droprate
: The percentage of packets that were droppedping_stddev
: The standard deviation of the latency (lower = better; a high value here indicates potential network instability and you may encounter issues in voice / video calls for example)
To this end, I've decided on the following identifier strings:
HOSTNAME_HERE/ping-exec/ping-TARGET_NAME
HOSTNAME_HERE/ping-exec/ping_droprate-TARGET_NAME
HOSTNAME_HERE/ping-exec/ping_stddev-TARGET_NAME
I'm using exec
for the first instance here to cause it to store my ping results separately from the internal ping plugin. The 2nd instance is the name of the target that is being pinged, resulting in multiple lines on the same graph.
To parse the output of the ping
command, I've found it easiest if I push the output of the ping command to disk first, and then read it back afterwards. To do that, a temporary directory is needed:
temp_dir="$(mktemp --tmpdir="/dev/shm" -d "collectd-exec-ping-XXXXXXX")";
on_exit() {
rm -rf "${temp_dir}";
}
trap on_exit EXIT;
This creates a new temporary directory in /dev/shm
(shared memory in RAM), and automatically deletes it when the script terminates by scheduling an exit trap.
Then, we can create a temporary file inside the new temporary directory and call the ping command:
tmpfile="$(mktemp --tmpdir="${temp_dir}" "ping-target-XXXXXXX")";
ping -O -c "${ping_count}" "${target}" >"${tmpfile}";
A number of variables in that second command. Let me break that down:
${ping_count}
: The number of pings to send (e.g.3
)${target}
: The target to ping (e.g.starbeamrainbowlabs.com
)${tmpfile}
: The temporary file to which to write the output
For reference, the output of the ping command looks something like this:
PING starbeamrainbowlabs.com (5.196.73.75) 56(84) bytes of data.
64 bytes from starbeamrainbowlabs.com (5.196.73.75): icmp_seq=1 ttl=55 time=28.6 ms
64 bytes from starbeamrainbowlabs.com (5.196.73.75): icmp_seq=2 ttl=55 time=15.1 ms
64 bytes from starbeamrainbowlabs.com (5.196.73.75): icmp_seq=3 ttl=55 time=18.9 ms
--- starbeamrainbowlabs.com ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2002ms
rtt min/avg/max/mdev = 15.145/20.886/28.574/5.652 ms
We're only interested in the last 2 lines of output. Since this is a long-running script that is going to be executing every 5 minutes, to minimise load (it will be running on a rather overburdened Raspberry Pi 3B+ :P), we will be minimising the number of subprocesses we spawn. To read in the last 2 lines of the file into an array, we can do this:
mapfile -s "$((ping_count+3))" -t file_data <"${tmpfile}"
The -s
here tells mapfile
(a bash built-in) to skip a given number of lines before reading from the file. Since we know the number of ping requests we sent and that there are 3 additional lines that don't contain the ping request output before the last 2 lines that we're interested in, we can calculate the number of lines we need to skip here.
Next, we can now parse the last 2 lines of the file. The read
command (which is also a bash built-in, so it doesn't spawn a subprocess) is great for this purpose. Let's take it 1 line at a time:
read -r _ _ _ _ _ loss _ < <(echo "${file_data[0]}")
loss="${loss/\%}";
Here the read
command splits the input on whitespace into multiple different variables. We are only interested in the packet loss here. While the other values might be interesting, Collectd (at least by default) doesn't have a definition in types.db
for them and I don't see any huge benefits from adding them anyway, so I use an underscore _
to indicate, by common convention, that I'm not interested in those fields.
We then strip the percent sign %
from the end of the packet loss value here too.
Next, let's extract the statistics from the very last line:
read -r _ _ _ _ _ _ min avg max stdev _ < <(echo "${file_data[1]//\// }");
Here we replace all forward slashes in the input with a space to allow read
to split it properly. Then, we extract the 4 interesting values (although we can't actually log min
and max
).
With the values extracted, we can output the statistics we've collected in a format that Collectd understands:
echo "PUTVAL \"${COLLECTD_HOSTNAME}/ping-exec/ping_droprate-${target}\" interval=${COLLECTD_INTERVAL} N:${loss}";
echo "PUTVAL \"${COLLECTD_HOSTNAME}/ping-exec/ping-${target}\" interval=${COLLECTD_INTERVAL} N:${avg}";
echo "PUTVAL \"${COLLECTD_HOSTNAME}/ping-exec/ping_stddev-${target}\" interval=${COLLECTD_INTERVAL} N:${stdev}";
Finally, we mustn't forget to delete the temporary file:
rm "${tmpfile}";
Those are the major changes I made from the earlier HTTP response time monitor. The full script can be found at the bottom of this post. The settings that control the operation of the script are the top, which allow you to change the list of hosts to ping, and the number of ping requests to make.
Save it to something like /etc/collectd/collectd-exec-ping.sh
(don't forget to sudo chmod +x /etc/collectd/collectd-exec-ping.sh
it), and then append this to your /etc/collectd/collectd.conf
:
<Plugin exec>
Exec "nobody:nogroup" "/etc/collectd/collectd-exec-ping.sh"
</Plugin>
Final script
#!/usr/bin/env bash
set -o pipefail;
# Variables:
# COLLECTD_INTERVAL Interval at which to collect data
# COLLECTD_HOSTNAME The hostname of the local machine
declare targets=(
"starbeamrainbowlabs.com"
"github.com"
"reddit.com"
"raspbian.raspberrypi.org"
"1.0.0.1"
)
ping_count="3";
###############################################################################
# Pure-bash alternative to sleep.
# Source: https://blog.dhampir.no/content/sleeping-without-a-subprocess-in-bash-and-how-to-sleep-forever
snore() {
local IFS;
[[ -n "${_snore_fd:-}" ]] || exec {_snore_fd}<> <(:);
read ${1:+-t "$1"} -u $_snore_fd || :;
}
# Source: https://github.com/dylanaraps/pure-bash-bible#split-a-string-on-a-delimiter
split() {
# Usage: split "string" "delimiter"
IFS=$'\n' read -d "" -ra arr <<< "${1//$2/$'\n'}"
printf '%s\n' "${arr[@]}"
}
# Source: https://github.com/dylanaraps/pure-bash-bible#use-regex-on-a-string
regex() {
# Usage: regex "string" "regex"
[[ $1 =~ $2 ]] && printf '%s\n' "${BASH_REMATCH[1]}"
}
# Source: https://github.com/dylanaraps/pure-bash-bible#get-the-number-of-lines-in-a-file
# Altered to operate on the standard input.
count_lines() {
# Usage: count_lines <"file"
mapfile -tn 0 lines
printf '%s\n' "${#lines[@]}"
}
# Source https://github.com/dylanaraps/pure-bash-bible#get-the-last-n-lines-of-a-file
tail() {
# Usage: tail "n" "file"
mapfile -tn 0 line < "$2"
printf '%s\n' "${line[@]: -$1}"
}
###############################################################################
temp_dir="$(mktemp --tmpdir="/dev/shm" -d "collectd-exec-ping-XXXXXXX")";
on_exit() {
rm -rf "${temp_dir}";
}
trap on_exit EXIT;
# $1 - target name
# $2 - url
check_target() {
local target="${1}";
tmpfile="$(mktemp --tmpdir="${temp_dir}" "ping-target-XXXXXXX")";
ping -O -c "${ping_count}" "${target}" >"${tmpfile}";
# readarray -t result < <(curl -sS --user-agent "${user_agent}" -o /dev/null --max-time 5 -w "%{http_code}\n%{time_total}\n" "${url}"; echo "${PIPESTATUS[*]}");
mapfile -s "$((ping_count+3))" -t file_data <"${tmpfile}"
read -r _ _ _ _ _ loss _ < <(echo "${file_data[0]}")
loss="${loss/\%}";
read -r _ _ _ _ _ _ min avg max stdev _ < <(echo "${file_data[1]//\// }");
echo "PUTVAL \"${COLLECTD_HOSTNAME}/ping-exec/ping_droprate-${target}\" interval=${COLLECTD_INTERVAL} N:${loss}";
echo "PUTVAL \"${COLLECTD_HOSTNAME}/ping-exec/ping-${target}\" interval=${COLLECTD_INTERVAL} N:${avg}";
echo "PUTVAL \"${COLLECTD_HOSTNAME}/ping-exec/ping_stddev-${target}\" interval=${COLLECTD_INTERVAL} N:${stdev}";
rm "${tmpfile}";
}
while :; do
for target in "${targets[@]}"; do
# NOTE: We don't use concurrency here because that spawns additional subprocesses, which we want to try & avoid. Even though it looks slower, it's actually more efficient (and we don't potentially skew the results by measuring multiple things at once)
check_target "${target}"
done
snore "${COLLECTD_INTERVAL}";
done