Starbeamrainbowlabs

Stardust
Blog


Archive


Mailing List Articles Atom Feed Comments Atom Feed Twitter Reddit Facebook

Tag Cloud

3d 3d printing account algorithms android announcement architecture archives arduino artificial intelligence artix assembly async audio automation backups bash batch blog bookmarklet booting bug hunting c sharp c++ challenge chrome os cluster code codepen coding conundrums coding conundrums evolved command line compilers compiling compression containerisation css dailyprogrammer data analysis debugging demystification distributed computing documentation downtime electronics email embedded systems encryption es6 features ethics event experiment external first impressions future game github github gist gitlab graphics hardware hardware meetup holiday holidays html html5 html5 canvas infrastructure interfaces internet interoperability io.js jabber jam javascript js bin labs learning library linux lora low level lua maintenance manjaro network networking nibriboard node.js operating systems own your code pepperminty wiki performance phd photos php pixelbot portable privacy problem solving programming problems projects prolog protocol protocols pseudo 3d python reddit redis reference releases rendering resource review rust searching secrets security series list server software sorting source code control statistics storage svg talks technical terminal textures thoughts three thing game three.js tool tutorial twitter ubuntu university update updates upgrade version control virtual reality virtualisation visual web website windows windows 10 xmpp xslt

Monitoring latency / ping with Collectd and Bash

I use Collectd as the monitoring system for the devices I manage. As part of this, I use the Ping plugin to monitor latency to a number of different hosts, such as GitHub, the raspberry pi apt repo, 1.0.0.1, and this website.

I've noticed for a while that the ping plugin doesn't always work: Even when I check to ensure that a host is pingable before I add it to the monitoring list, Collectd doesn't always manage to ping it - showing it as NaN instead.

Yesterday I finally decided that enough was enough, and that I was going to do something about it. I've blogged about using the exec plugin with a bash script before in a previous post, in which I monitor HTTP response times with curl.

Using that script as a base, I adapted it to instead parse the output of the ping command and translate it into something that Collectd understands. If you haven't already, you'll want to go and read that post before continuing with this one.

The first order of business is to sort out the identifier we're going to use for the data in question. Collectd assigns an identifier to all the the data coming in, and it uses this to determine where it is stored on disk - and subsequently which graph it will appear in on screen in the front-end.

Such identifiers follow this pattern:

host/plugin-instance/type-instance

This can be broken down into the following parts:

Part Meaning
host The hostname of the machine from which the data was collected
plugin The name of the plugin that collected the data (e.g. memory, disk, thermal, etc)
instance The instance name of the plugin, if the plugin is enabled multiple times
type The type of reading that was collected
instance If multiple readings for a given type are collected, this instance differentiates between them.

Of note specifically here are the type, which must be one of a number of pre-defined values, which can be found in a text file located at /usr/share/collectd/types.db. In my case, my types.db file contains the following definitions for ping:

  • ping: The average latency
  • ping_droprate: The percentage of packets that were dropped
  • ping_stddev: The standard deviation of the latency (lower = better; a high value here indicates potential network instability and you may encounter issues in voice / video calls for example)

To this end, I've decided on the following identifier strings:

HOSTNAME_HERE/ping-exec/ping-TARGET_NAME
HOSTNAME_HERE/ping-exec/ping_droprate-TARGET_NAME
HOSTNAME_HERE/ping-exec/ping_stddev-TARGET_NAME

I'm using exec for the first instance here to cause it to store my ping results separately from the internal ping plugin. The 2nd instance is the name of the target that is being pinged, resulting in multiple lines on the same graph.

To parse the output of the ping command, I've found it easiest if I push the output of the ping command to disk first, and then read it back afterwards. To do that, a temporary directory is needed:

temp_dir="$(mktemp --tmpdir="/dev/shm" -d "collectd-exec-ping-XXXXXXX")";

on_exit() {
    rm -rf "${temp_dir}";
}
trap on_exit EXIT;

This creates a new temporary directory in /dev/shm (shared memory in RAM), and automatically deletes it when the script terminates by scheduling an exit trap.

Then, we can create a temporary file inside the new temporary directory and call the ping command:

tmpfile="$(mktemp --tmpdir="${temp_dir}" "ping-target-XXXXXXX")";
ping -O -c "${ping_count}" "${target}" >"${tmpfile}";

A number of variables in that second command. Let me break that down:

  • ${ping_count}: The number of pings to send (e.g. 3)
  • ${target}: The target to ping (e.g. starbeamrainbowlabs.com)
  • ${tmpfile}: The temporary file to which to write the output

For reference, the output of the ping command looks something like this:

PING starbeamrainbowlabs.com (5.196.73.75) 56(84) bytes of data.
64 bytes from starbeamrainbowlabs.com (5.196.73.75): icmp_seq=1 ttl=55 time=28.6 ms
64 bytes from starbeamrainbowlabs.com (5.196.73.75): icmp_seq=2 ttl=55 time=15.1 ms
64 bytes from starbeamrainbowlabs.com (5.196.73.75): icmp_seq=3 ttl=55 time=18.9 ms

--- starbeamrainbowlabs.com ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2002ms
rtt min/avg/max/mdev = 15.145/20.886/28.574/5.652 ms

We're only interested in the last 2 lines of output. Since this is a long-running script that is going to be executing every 5 minutes, to minimise load (it will be running on a rather overburdened Raspberry Pi 3B+ :P), we will be minimising the number of subprocesses we spawn. To read in the last 2 lines of the file into an array, we can do this:

mapfile -s "$((ping_count+3))" -t file_data <"${tmpfile}"

The -s here tells mapfile (a bash built-in) to skip a given number of lines before reading from the file. Since we know the number of ping requests we sent and that there are 3 additional lines that don't contain the ping request output before the last 2 lines that we're interested in, we can calculate the number of lines we need to skip here.

Next, we can now parse the last 2 lines of the file. The read command (which is also a bash built-in, so it doesn't spawn a subprocess) is great for this purpose. Let's take it 1 line at a time:

read -r _ _ _ _ _ loss _ < <(echo "${file_data[0]}")
loss="${loss/\%}";

Here the read command splits the input on whitespace into multiple different variables. We are only interested in the packet loss here. While the other values might be interesting, Collectd (at least by default) doesn't have a definition in types.db for them and I don't see any huge benefits from adding them anyway, so I use an underscore _ to indicate, by common convention, that I'm not interested in those fields.

We then strip the percent sign % from the end of the packet loss value here too.

Next, let's extract the statistics from the very last line:

read -r _ _ _ _ _ _ min avg max stdev _ < <(echo "${file_data[1]//\// }");

Here we replace all forward slashes in the input with a space to allow read to split it properly. Then, we extract the 4 interesting values (although we can't actually log min and max).

With the values extracted, we can output the statistics we've collected in a format that Collectd understands:

echo "PUTVAL \"${COLLECTD_HOSTNAME}/ping-exec/ping_droprate-${target}\" interval=${COLLECTD_INTERVAL} N:${loss}";
echo "PUTVAL \"${COLLECTD_HOSTNAME}/ping-exec/ping-${target}\" interval=${COLLECTD_INTERVAL} N:${avg}";
echo "PUTVAL \"${COLLECTD_HOSTNAME}/ping-exec/ping_stddev-${target}\" interval=${COLLECTD_INTERVAL} N:${stdev}";

Finally, we mustn't forget to delete the temporary file:

rm "${tmpfile}";

Those are the major changes I made from the earlier HTTP response time monitor. The full script can be found at the bottom of this post. The settings that control the operation of the script are the top, which allow you to change the list of hosts to ping, and the number of ping requests to make.

Save it to something like /etc/collectd/collectd-exec-ping.sh (don't forget to sudo chmod +x /etc/collectd/collectd-exec-ping.sh it), and then append this to your /etc/collectd/collectd.conf:

<Plugin exec>
        Exec    "nobody:nogroup"        "/etc/collectd/collectd-exec-ping.sh"
</Plugin>

Final script

#!/usr/bin/env bash
set -o pipefail;

# Variables:
#   COLLECTD_INTERVAL   Interval at which to collect data
#   COLLECTD_HOSTNAME   The hostname of the local machine

declare targets=(
    "starbeamrainbowlabs.com"
    "github.com"
    "reddit.com"
    "raspbian.raspberrypi.org"
    "1.0.0.1"
)
ping_count="3";

###############################################################################

# Pure-bash alternative to sleep.
# Source: https://blog.dhampir.no/content/sleeping-without-a-subprocess-in-bash-and-how-to-sleep-forever
snore() {
    local IFS;
    [[ -n "${_snore_fd:-}" ]] || exec {_snore_fd}<> <(:);
    read ${1:+-t "$1"} -u $_snore_fd || :;
}

# Source: https://github.com/dylanaraps/pure-bash-bible#split-a-string-on-a-delimiter
split() {
    # Usage: split "string" "delimiter"
    IFS=$'\n' read -d "" -ra arr <<< "${1//$2/$'\n'}"
    printf '%s\n' "${arr[@]}"
}

# Source: https://github.com/dylanaraps/pure-bash-bible#use-regex-on-a-string
regex() {
    # Usage: regex "string" "regex"
    [[ $1 =~ $2 ]] && printf '%s\n' "${BASH_REMATCH[1]}"
}

# Source: https://github.com/dylanaraps/pure-bash-bible#get-the-number-of-lines-in-a-file
# Altered to operate on the standard input.
count_lines() {
    # Usage: count_lines <"file"
    mapfile -tn 0 lines
    printf '%s\n' "${#lines[@]}"
}

# Source https://github.com/dylanaraps/pure-bash-bible#get-the-last-n-lines-of-a-file
tail() {
    # Usage: tail "n" "file"
    mapfile -tn 0 line < "$2"
    printf '%s\n' "${line[@]: -$1}"
}

###############################################################################

temp_dir="$(mktemp --tmpdir="/dev/shm" -d "collectd-exec-ping-XXXXXXX")";

on_exit() {
    rm -rf "${temp_dir}";
}
trap on_exit EXIT;

# $1 - target name
# $2 - url
check_target() {
    local target="${1}";

    tmpfile="$(mktemp --tmpdir="${temp_dir}" "ping-target-XXXXXXX")";

    ping -O -c "${ping_count}" "${target}" >"${tmpfile}";

    # readarray -t result < <(curl -sS --user-agent "${user_agent}" -o /dev/null --max-time 5 -w "%{http_code}\n%{time_total}\n" "${url}"; echo "${PIPESTATUS[*]}");
    mapfile -s "$((ping_count+3))" -t file_data <"${tmpfile}"

    read -r _ _ _ _ _ loss _ < <(echo "${file_data[0]}")
    loss="${loss/\%}";
    read -r _ _ _ _ _ _ min avg max stdev _ < <(echo "${file_data[1]//\// }");


    echo "PUTVAL \"${COLLECTD_HOSTNAME}/ping-exec/ping_droprate-${target}\" interval=${COLLECTD_INTERVAL} N:${loss}";
    echo "PUTVAL \"${COLLECTD_HOSTNAME}/ping-exec/ping-${target}\" interval=${COLLECTD_INTERVAL} N:${avg}";
    echo "PUTVAL \"${COLLECTD_HOSTNAME}/ping-exec/ping_stddev-${target}\" interval=${COLLECTD_INTERVAL} N:${stdev}";

    rm "${tmpfile}";
}

while :; do
    for target in "${targets[@]}"; do
        # NOTE: We don't use concurrency here because that spawns additional subprocesses, which we want to try & avoid. Even though it looks slower, it's actually more efficient (and we don't potentially skew the results by measuring multiple things at once)
        check_target "${target}"
    done

    snore "${COLLECTD_INTERVAL}";
done

Making an auto-updated downmuxed copy of my music

I like to buy and own music. That way, if the service goes down, I still get to keep both my music and the rights thereto that I've paid for.

To this end, I maintain an offline collection of music tracks that I've purchased digitally. Recently, it's been growing quite large (~15GiB at the moment) - which is quite a bit of disk space. While this doesn't matter too much on my laptop, on my phone it's quite a different story.

For this reason, I wanted to keep a downmuxed copy of my music collection on my Raspberry Pi 3B+ file server that I can sync to my phone. Said Raspberry Pi already has ffmpeg installed, so I decided to write a script to automate the process. In this blog post, I'm going to walk you through the script itself and what it does - and how you can use it too.

I've decided on a standard downmuxed format of 256kbps MP3. You can choose anything you like - you just need to tweak the appropriate lines in the script.

First, let's outline what we want it to do:

  1. Convert anything that isn't an mp3 to 256kbps mp3 (e.g. ogg, flac)
  2. Downmux mp3 files that are at a bitrate higher than 256kbps
  3. Leave mp3s that are at a bitrate lower than (or equal to) 256kbps alone
  4. Convert and optimise album art to 256x256
  5. Copy any unknown files as-is
  6. If the file exists in the target directory already, don't re-convert it again
  7. Max out the system resources when downmuxing to get it done as fast as possible

With this in mind, let's start outlining a script:

#!/usr/bin/env bash

input="${DIR_INPUT:-/absolute/path/to/Music}"
output="${DIR_OUTPUT:-/absolute/path/to/Music-Portable}"

export input;
export output;

temp_dir="$(mktemp --tmpdir -d "portable-music-copy-XXXXXXX")";

on_exit() {
    rm -rf "${temp_dir}";
}
trap on_exit EXIT;

# Library functions go here

# $1    filename
process_file() {
    filename="${1}";
    extension="${filename##*.}";

    # Process file here
}

export temp_dir;
export -f process_file;

cd "${input}" || { echo "Error: Failed to cd to input directory"; exit 1; };
find  -type f -print0 | nice -n20 xargs -P "$(nproc)" -0 -n1 -I{} bash -c 'process_file "{}"';

# Cleanup here....

Very cool. At the top of the script, we define the input and output directories we're going to work on. We use the ${VARIABLE_NAME:-default_value} syntax to allow for changing the input and output directories on the fly with the DIR_INPUT and DIR_OUTPUT environment variables.

Next, we create a temporary directory, and define an exit trap to ensure it gets deleted when the script exits (regardless of whether the exit is clean or not).

Then, we define the main driver function that will process a single file. This is called by xargs a little further down - which takes the file list in from a find call. The cd is important there, because we want the file paths from find to be relative to the input directory for easier mangling later. The actual process_file call is wrapped in bash -c '', because being a bash function it can't be called by xargs directly - so we have to export -f it and wrap it as shown.

Next, we need to write some functions to handle converting different file types. First, let's write a simple copy function:

# $1    Source
# $2    Target
do_copy() {
    source="${1}";
    target="${2}";

    echo -n "cp ";
    cp "${source}" "${target}";
}

All it does is call cp, but it's nice to abstract like this so that if we wanted to add extra features (e.g. uploading via sftp or something) later, it's not as much of a bother.

We also need to downmux audio files and convert them to mp3. Let's write a function for that too:


# $1    Source
# $2    Target
do_downmux() {
    source="${1}";
    target="${2}";

    set +e;
    ffmpeg -hide_banner -loglevel warning -nostats -i "${source}" -vn -ar 44100 -b:a 256k -f mp3 "${target}";
    exit_code="${?}";
    if [[ "${exit_code}" -ne 0 ]] && [[ -f "${target}" ]]; then
        rm "${target}";
    fi
    return "${exit_code}";
}

It's got the same arguments signature as do_copy, but it downmuxes instead of copying directly. The line that does the magic is highlighted. It looks complicated, but it's actually pretty logical. Let's break down all those arguments:

Argument Purpose
-hide_banner Hides the really rather wordy banner at the top when ffmpeg starts up
-loglevel warning Hides everything but warning messages to avoid too much unreadable output when converting many tracks at once
-nostats As above
-i "${source}" Specifies the input file
-vn Strips any video tracks found
-ar 44100 Force the sampling rate to 44.1KHz, just in case it's sampled higher
-b:a 256k Sets the output bitrate to 256kbps (change this bit if you like)
-f mp3 Output as mp3
"${target}" Write the output to the target location

That's not so bad, right? After calling it, we also need to capture the exit code. If it's not 0, then ffmpeg encountered some kind of issue. If so, we delete any output files it creates and return the same exit code - which we handle elsewhere.

Finally, we need a function to optimise images. For this I'm using optipng and jpegoptim to handle optimising JPEGs and PNGs respectively, and ImageMagick for the resizing operation.


# $1    Source
compress_image() {
    source="${1}";

    temp_file_png="$(mktemp --tmpdir="${temp_dir}" XXXXXXX.png)";
    temp_file_jpeg="$(mktemp --tmpdir="${temp_dir}" XXXXXXX.jpeg)";

    convert "${source}" -resize 256x256\> "${temp_file_jpeg}" >&2 &
    convert "${source}" -resize 256x256\> "${temp_file_png}" >&2 &
    wait

    jpegoptim --quiet --all-progressive --preserve "${temp_file_jpeg}" >&2 &
    optipng -quiet -fix -preserve "${temp_file_png}" >&2 &
    wait

    read -r size_png _ < <(wc --bytes "${temp_file_png}");
    read -r size_jpeg _ < <(wc --bytes "${temp_file_jpeg}");
    if [[ "${size_png}" -gt "${size_jpeg}" ]]; then
        # JPEG is smaller
        rm -rf "${temp_file_png}";
        echo "${temp_file_jpeg}";
    else
        # PNG is smaller
        rm -rf "${temp_file_jpeg}";
        echo "${temp_file_png}";
    fi
}

Unlike the previous functions, this one only takes a source file in. It converts it using that temporary directory we created earlier, and echos the filename of the smallest format found.

It's done in 2 stages. First, the source file is resized to 256x256 (maintaining aspect ratio, and avoiding upscaling smaller images) and written as both a JPEG and a PNG.

Then, jpegoptim and optipng are called on the resulting files. Once done, the filesizes are compared and the filepath to the smallest of the 2 is echoed.

With these in place, we can now write the glue that binds them to the xargs call by filling out process_file. Before we do though, we need to tweak the export statements from earlier to export our library functions we've written - otherwise process_file won't be able to access them since it's wrapped in bash -c '' and xargs. Here's the full list of export directives (directly below the end of process_file):

export temp_dir;
export -f process_file;
export -f compress_image;
export -f do_downmux;
export -f do_copy;
# $1    filename
process_file() {
    filename="${1}";
    extension="${filename##*.}";

    orig_destination="${output}/${filename}";
    destination="${orig_destination}";

    echo -n "[file] ${filename}: ";

    do_downmux=false;
    # Downmux, but only the bitrate is above 256k
    if [[ "${extension}" == "flac" ]] || [[ "${extension}" == "ogg" ]] || [[ "${extension}" == "mp3" ]]; then
        probejson="$(ffprobe -hide_banner -v quiet -show_format -print_format json "${filename}")";
        is_above_256k="$(echo "${probejson}" | jq --raw-output '(.format.bit_rate | tonumber) > 256000')";
        exit_code="${?}";
        if [[ "${exit_code}" -ne 0 ]]; then
            echo -n "ffprobe failed; falling back on ";
            do_downmux=false;
        elif [[ "${is_above_256k}" == "true" ]]; then
            do_downmux=true;
        fi
    fi

    if [[ "${do_downmux}" == "true" ]]; then
        echo -n "downmuxing/";
        destination="${orig_destination%.*}.mp3";
    fi

    # ....
}

We use 2 variables to keep track of the destination location here, because we may or may not successfully manage to convert any given input file to a different format with a different file extension.

We also use ffprobe (part of ffmpeg) and jq (a JSON query and manipulation tool) on audio files to detect the bitrate of input files so that we can avoid remuxing files with a bitrate lower than 256kbps. Once we're determined that, we rewrite the destination filename to include the extension .mp3.

Next, we need to deal with the images. We do this in a preprocessing step that comes next:

case "${extension}" in 
    png|jpg|jpeg|JPG|JPEG )
        compressed_image="$(compress_image "${filename}")";
        compressed_extension="${compressed_image##*.}";
        destination="${orig_destination%.*}.${compressed_extension}";
        ;;
esac

If the file is an image, we run it through the image optimiser. Then we look at the file extension of the optimised image, and alter the destination filename accordingly.

if [[ -f "${destination}" ]] || [[ -f "${orig_destination}" ]]; then
    echo "exists in destination; skipping";
    return 0;
fi

destination_dir="$(dirname "${destination}")";
if [[ ! -d "${destination_dir}" ]]; then
    mkdir -p "${destination_dir}";
fi

Next, we look to see if there's a file in the destination already. If so, then we skip out and don't continue processing the file. If not, we make sure that the parent directory exists to avoid issues later.

case "${extension}" in
    flac|mp3|ogg )
        # Use ffmpeg, but only if necessary
        if [[ "${do_downmux}" == "false" ]]; then
            do_copy "${filename}" "${orig_destination}";
        else
            echo -n "ffmpeg ";
            do_downmux "${filename}" "${destination}";
            exit_code="$?";
            if [[ "${exit_code}" -ne 0 ]]; then
                echo "failed, exit code ${exit_code}: falling back on ";
                do_copy "${filename}" "${orig_destination}";
            fi
        fi
        ;;

    png|jpg|jpeg|JPG|JPEG )
        mv "${compressed_image}" "${destination}";
        ;;

    * )
        do_copy "${filename}" "${destination}";
        ;;
esac
echo "done";

Finally, we get to the main case statement that handles the different files. If it's an audio file, we run it through do_downmux (which we implemented earlier) - but only if it would benefit us. If it's an image, we move the converted image from the temporary directory that was optimised earlier, and if we can't tell what it is, then we just copy it over directly.

That's process_file completed. Now all we're missing are a few clean-up tasks that make it more cron friendly:

echo "[ ${SECONDS} ] Setting permissions";
chown -R root:root "${output}";
chmod -R 0644 "${output}";
chmod -R ugo+X "${output}";

echo "[ ${SECONDS} ] Portable music copy update complete";

This goes at the end of the file, and it reset the permissions on the output directory to avoid issues. This ensures that everyone can read it, but only root can write to it - as any modifications should be made it to the original version, and not the portable copy.

That completes this script. By understanding how it works, hopefully you'll be able to apply it to your own specific circumstances.

For example, you could call it via cron. Edit your crontab:

sudo crontab -e

...and paste in something like this:

5 4 * * *   /absolute/path/to/script.sh

This won't work if your device isn't turned on at the time, however. In that case, there is alternative. Simply drop the script (without an extension) into /etc/cron.daily or /etc/cron.weekly and mark it executable, and anacron will run your job every day or week respectively.

Anyway, here's the complete script:

Sources and Further Reading

Automatically organising & optimising photos and videos with Bash

As I promised recently, this post is about a script I implemented a while back that automatically organises and optimises the photos and videos that I take for me. Since I've been using it a while now and it seems stable, I thought I'd share it here in the hopes that it might be useful to someone else too.

I take quite a few photos and the odd video or two with my phone. These are automatically uploaded to a Raspberry Pi 3B+ that's acting as a file server on my home network with FolderSync (yes, it has ads, but it's the best I could find that does the job). Once uploaded to a folder, I then wanted a script that would automatically sort the uploaded images and videos into folders by year and month according to their date taken.

To do this, I implemented a script that uses exiftool (sudo apt install libimage-exiftool-perl I believe) to pull out the date taken from JPEGs and sort based on that. For other formats that don't support EXIF data, I take the last modified time with the date command and use that instead.

Before I do this though, I run my images through a few preprocessing tools:

  • PNGs are optimised with optipng (sudo apt install optipng)
  • JPEGs are optimised with jpegoptim (sudo apt install jpegoptim)
  • JPEGs are additionally automatically reoriented with mogrify -auto-orient from ImageMagick, as many cameras will set an EXIF tag for the rotation of an image without bothering to physically rotate the image itself

It's worth noting here that these preprocessing optimisation steps are lossless. In other words, no quality lost by performing these actions - it simply encodes the images more efficiently such that they use less disk space.

Once all these steps are complete, images and videos are sorted according to their date taken / last modified time as described above. That should end up looking a bit like this:

images
    + 2019
        + 07-July
            + image1.jpeg
    + 2020
        + 05-May
            + image2.png
            + image3.jpeg
        + 06-June
            + video1.mp4

Now that I've explained how it works, I can show you the script itself:

(Can't see the above? Check out the script directly on GitLab here: organise-photos)

The script operates on the current working directory. All images directly in the working directory will be sorted as described above. Once you've put it in a directory that is in your PATH, simply call it like this:

organise-photos

The script can be divided up into 3 distinct sections:

  1. The setup and initialisation
  2. The function that sorts individual files themselves into the right directory (handle_file - it's about half-way down)
  3. The preprocessing steps and the driver code that calls the above function.

So far, I've found that it's been working really rather well. During development and testing I did encounter a number of issues with the sorting system in handle_file that caused it to sort files into the wrong directory - which took me a while finally squash.

I'm always tweaking and improving it though. To that end, I have several plans to improve it.

Firstly, I want to optimise videos too. I'd like to store them in a standard format if possible. It's not that simple though, because some videos don't take well to being transcoded into a different format - indeed they can even take up more space than they did previously! In those cases it's probably worth discarding the attempt at transcoding the video to a more efficient format if it's larger than the original file.

I'd also like to reverse-geocode (see also the usage policy) the (latitude, longitude) geotags in my images to the name of the place that I took them, and append this data to the comment EXIF tag. This will make it easier to search for images based on location, rather than having to remember when I took them.

Finally, I'd also like to experiment with some form of AI object recognition with a similar goal as reverse-geocoding. By detecting the objects in my images and appending them to the comment EXIF tag, I can do things like search for "cat", and return all the images of cats I've taken so far.

I haven't started to look into AI much yet, but initial search results indicate that I might have an interesting time locating an AI that can identify a large number of different objects.

Anyway, my organise-photos script is available on GitLab in my personal bin folder that I commit to git if you'd like to take a closer look - suggestions and merge requests are welcome if you've got an idea that would make it even better :D

Sources and further reading

Website change detection with headless Firefox and ImageMagick

This wasn't the script I had in mind in the previous blog post (so you can look forward to another blog post about it), but have you ever wanted to know when a web page changes? If it does change, it's almost impossible to tell where on the page it's changed. Recently, I was thinking about the problem, and realised a few things:

  • Firefox can be operated headlessly (with --headless) to take screenshots
  • ImageMagick must be advanced enough to diff images

With this in mind, I set about implementing a script. Before we continue, here's an example diff image:

It's rather tall because of the webpage I chose, but the bits that have changed appear in red. The script I've written also generates an animated PNG showing the difference too:

Again, it's very tall because of the page I tested with, but I think it's pretty cool!

If you'd like to check the script out for yourself, you find it in the following git repository: sbrl/url-diff

For the curious, the script in question is written in Bash. It uses apcalc (available in Debian / Ubuntu based Linux distributions with sudo apt install apcalc) to crunch the numbers, and headless Firefox + Imagemagick as described above to take the screenshots and do the image processing. It should in theory work on Windows, but you'll need to jump through a number of hoops:

  • Install call url-diff.sh from [git bash]()
  • Install [ImageMagick]() and make sure the binaries are in your PATH
  • Install Firefox and make sure firefox is in your PATH
  • Explicitly set the URLDIFF_STORAGE_DIR environment variable when calling the script (do this by prefixing the command at the bottom of this post with URLDIFF_STORAGE_DIR=path/to/directory)

With my fancy new embed system, I can show you the code behind it:

(Can't see the above? Check it out in the git repository.)

I'm working on line numbers (sadly the author of highlight.js doesn't like them, so an alternative solution is required).

Anyway, the basic layout of the script is as follows:

  1. First, the settings are read in and the default values set
  2. Then, I define some utility functions.
    • The calculate_percentage_colour function is integral to the image change detection algorithm. It counts percentage of an image that is a given colour.
  3. Next, the help text is displayed if necessary
  4. The case statement that follows allows multiple subcommands to be implemented. Currently I only have a check subcommand, but you never know!
  5. Inside this case statement, the screenshots are taken and compared.
    • A new screenshot is taken with headless Firefox
    • If we don't have a screenshot stored away already, we stash the new screenshot and exit
    • If we do have a pre-existing screenshot, we continue with the comparison, starting by generating a diff image where pixels that have changed are given 1 colour, and pixels that haven't changed another
    • It's at this point that calculate_percentage_colour is called to calculate how much of the image has changed - the diff image is passed in and the changed pixels are counted
    • If more than 2% (by default) has changed, then we continue on to generate the output images
    • The first output image consists of the new screenshot with the diff image overlaid - this is generated with some ImageMagick wizardry: -compose over -composite
    • The second is an animated PNG comprised of the old and new screenshots. This is generated with ffmpeg - which supports animated PNGs
    • Finally, the old screenshot that we have stored away is replaced with the new one

It sounds more complicated than it is - hopefully my above explanation makes sense (post a comment below if you're confused about something!).

You can call the script like so:

git clone https://git.starbeamrainbowlabs.com/sbrl/url-diff.git
cd url-diff;
./url-diff.sh check URL_HERE path/to/output_diff.png path/to/output.apng

....replacing URL_HERE with the URL to check, and the paths with the places you'd like to write the output images to.

Ensuring a Linux machine's network connection stays up with Bash

Recently, I had the unpleasant experience of my Lab machine at University dropping offline. It has a tendency to do this randomly - and normally I'd just reboot it myself, but since I'm working from home at the moment it meant that I couldn't go in to fix it. This unfortunately meant that I was stuck waiting for a generous technician to go in and reboot it for me.

With access now restored I decided that I really didn't want this to happen again, so I've written a simple Bash script to resolve the issue.

It works by checking for an Internet connection every hour by pinging starbeamrainbowlabs.com - and if it doesn't manage to do so successfully, then it will reboot. A simple concept, but I discovered a number of things that needed considering while writing it:

  1. To avoid detecting transient network issues, we should make multiple attempts before giving up and rebooting
  2. Those multiple attempts need to be delayed to be effective
  3. We mustn't reboot more than once an hour to avoid getting into a 'reboot loop'
  4. If we're running an experiment, we need a way to temporarily delay it from doing it's checks that will resume automatically
  5. We could try and diagnose the network error or turn the networking of and on again, but if it gets stuck halfway through then we're locked out (very undesirable) - so it's easier / safer to just reboot

With these considerations in mind, I came up with this: ensure-network.sh (link to part of a GitHub Gist, as it's quite long)

This script requires Bash version 4+ and has a number of environment variables that can configure its behaviour:

Environment Variable Description
CHECK_EXTERNAL_HOST The domain name or IP address to ping to check the connection
CHECK_INTERVAL The interval to check the connection in seconds
CHECK_TIMEOUT Wait at most this long for a reply to our ping
CHECK_RETRIES Retry this many times before giving up and rebooting
CHECK_RETRY_DELAY Delay this many seconds in between retries
CHECK_DRY_RUN If true, then don't actually reboot (useful for testing)
CHECK_REBOOT_DELAY Leave at least this many minutes in between reboots
CHECK_POSTPONE_FILE If this file exists and has a recent last-modified time (mtime), don't actually reboot
CHECK_POSTPONE_MAXAGE The maximum age in minutes of the CHECK_POSTPONE_FILE to consider it fresh and avoid rebooting

With these environment variables, it covers all 4 points in the above list. To expand on CHECK_POSTPONE_FILE, if I'm running an experiment for example and I don't want it to reboot in the middle of said experiment, then I can simply run touch /path/to/postpone_file to delay network connection-related reboots for 7 days (by default). After this time, it will automatically start rebooting again if it drops off the network. This ensures that it will always restart monitoring eventually - as if I had a more manual system I'd forget to re-enable it and then loose access.

Another consideration is that the /var/cache directory must exist. This is because an empty tracking file is created there to keep track of when the last network connection-related reboot occurred.

With the script written, then next step is to have it run automatically on boot. For systemd-based systems such as my lab machine, a systemd service is the order of the day. This is relatively simple:

[Unit]
Description=Reboot if the network connection is down
After=network.target

[Service]
Type=simple
# Because it needs to be able to reboot
User=root
Group=root
EnvironmentFile=-/etc/default/ensure-network
ExecStartPre=/bin/sleep 60
ExecStart=/bin/bash "/usr/local/lib/ensure-network/ensure-network.sh"
SyslogIdentifier=ensure-access
StandardError=syslog
StandardOutput=syslog

[Install]
WantedBy=multi-user.target

(View the latest version in the GitHub Gist)

This assumes that the ensure-network.sh script is located at /usr/local/lib/ensure-network/ensure-network.sh. It also allows for an environment file to optionally be created at /etc/default/ensure-network, so that you can customise the parameters. Here's an example environment file:

CHECK_EXTERNAL_HOST=example.com
CHECK_INTERVAL=60

The above example environment file checks against example.com every minute instead of the default starbeamrainbowlabs.com every hour. You can, of course, specify any (or all) of the environment variables detailed above in the environment file if you wish.

That completes my setup - so hopefully I don't encounter any more network-related issues that lock me out of accessing my lab machine remotely! To install it yourself, you can do this:

# Create the directory for the script to live in
sudo mkdir /usr/local/lib/ensure-network
# Download the script & service file
sudo curl -L -O /usr/local/lib/ensure-network/ensure-network.sh https://gist.githubusercontent.com/sbrl/08e13f2ceedafe35ac7f8dbdfb8bfde7/raw/cc5ab4226472c08b09e448a257256936cc749193/ensure-network.sh
sudo curl -L -O /etc/systemd/system/ensure-network.service https://gist.githubusercontent.com/sbrl/08e13f2ceedafe35ac7f8dbdfb8bfde7/raw/adf5ed4009b3e1a09f857936fceb3581897072f4/ensure-network.service
# Start the service & enable it on boot
sudo systemctl daemon-reload
sudo systemctl start ensure-network.service
sudo systemctl enable ensure-network.service

You might need to replace the URLs there with the latest ones that download the raw content from the GitHub Gist.

Did you find this useful? Got a suggestion to make it better? Running into issues? Comment below!

Pipes, /dev/shm, or a TCP socket: Which is faster?

I've been busy patching HAIL-CAESAR (a simplified 2D flood simulation program designed for HPC supercomputers) to make it more suitable for the scale of my PhD project, and as part of this I'm trying to use the standard input & output where possible to speed up data transfer for the pre and post-processing steps, since I need to convert the data to and from different formats.

As part of this, it crossed my mind that there are actually a number of different ways of getting data in and out of a program, so I decided to do a quick (relatively informal) test to see which was fastest.

In my actual project, I'm going to be doing the following data transfers:

  • From .jsonstream.gz files to a Node.js process
  • From the Node.js process to HAIL-CAESAR
  • From HAIL-CAESAR to another Node.js process (there's a LOT of data in this bit)
  • From that Node.js process to disk as PNG files

That's a lot of transferring. In particular the output of HAIL-CAESAR, which I'm currently writing directly to disk, appears to be absolutely enormous - due mainly to the hugely inefficient storage format used.

Anyway, the 3 mechanisms I'm putting to the test here are:

  • A pipe (e.g. writing to standard output)
  • Writing to a file in /dev/shm
  • A TCP socket

If anyone can think of any other mechanisms for rapid inter-process communication, please do get in touch by leaving a comment below.

Pipe

I'm simulating a pipe with the following code:

timeout --signal=SIGINT 30s dd if=/dev/zero status=progress | cat >/dev/null

The timeout --signal=SIGINT 30s bit lets it run for 30 seconds before stopping it with a SIGINT (the same as Ctrl + C). I'm reading from /dev/zero here, because I want to test the performance of the pipe and not be limited by the speed of random number generation if I were to use /dev/urandom.

Running this on my laptop resulted in a speed of ~396 MB/s.

/dev/shm

/dev/shm is the shared memory area on Linux - and is usually backed by a tmpfs file system (i.e. an in-memory ramdisk).

Here are the command I'm using to test this:

dd if=/dev/zero of=/dev/shm/test-1gb bs=1024 count=1000000
dd if=/dev/shm/test-1gb of=/dev/null bs=1024 count=1000000

This writes a 1GB file to /dev/shm, and then reads it back again (to be consistent with the pipe test). To calculate the overall MB/s speed, we need to know the time it took to do the read and write operations. I observed the following:

Operation Speed Time
Write 692 MB/s 1.4788s
Read 890 MB/s 1.1501s

....so that's 2.6289s in total. Then, we can calculate the MB/s by dividing 1GB by the total time, giving us a total transfer speed of ~380 MB/s. This seemed quite variable though - as when I tested it the other day I got only ~273 MB/s.

TCP Socket

Finally, to test a TCP socket, I devised the following:

nc -l 8888 >/dev/null &
timeout --signal=SIGINT 30s dd status=progress if=/dev/zero | nc 127.0.0.1 8888

The first line sets up the listener, and the 2nd line is the sender. As before with the pipe test, I'm stopping it after 30 seconds. It took a moment to stabilise, but towards the end it levelled off at about ~360 MB/s.

Conclusion

After running the 3 tests, the results were as follows:

Test Speed
Pipe 396 MB/s
/dev/shm 380 MB/s
TCP Socket 360 MB/s

According to this, the pipe (i.e. writing to the standard output and reading from the standard input) is the fastest. This isn't particularly surprising (since the other methods have overhead), but interesting to test all the same. Here's a quick graph of that:

A quick bar chart of the above data

Of course, there are other considerations to take into account. For example, If you need scalable multi-core processing, then /dev/shm or TCP sockets (the latter especially since Linux has a special mechanism for multiple processes to listen on the same port and allow load-balancing between them) might be a better option - despite the additional overhead.

Other CPU architectures may have an effect on it too due to different CPU instructions being available - I ran these tests on Ubuntu 19.10 on the Intel Core i7-7500U in my laptop.

As of yet I'm unsure as to how much post-processing the data coming from HAIL-CAESAR will require - and whether it will require multiple processes to handle the load or not. I hope not - since HAIL-CAESAR is written in C++, and TCP sockets would be awkward and messy to implement (since you would probably have to use the low-level socket API, and I don't have any experience with networking in C++ yet) - and the HPC in question doesn't appear to have inotifywait installed to make listening for file writes on disk easier.

Measuring maximum RAM usage with Bash on Linux

While running a simulation on my University's Viper HPC, I found that I needed to measure the maximum RAM usage of a simulation that I was running. Since the solution wasn't particularly easy to find, I thought I'd quickly blog about it here.

Doing this isn't actually as easy as you might think. In the end, I used this:

/usr/bin/time --format 'Max RAM working set size: %Mk' command_here --foo bar

....replacing command_here --foo bar with the command you want to measure.

/usr/bin/time is a program that measures how long a command takes to execute, but it's evident that it measures a bunch of other different things as well. While the output format leaves something to be desired (hence the --format in the above), it does the job pretty well.

Note that /usr/bin/time is distinct from the time built-in you get in Bash. Depending on your shell, you may need to explicitly specify the full path to the time binary. In addition, if your system doesn't have the command (like Viper), you may need to copy it from another system that does that has the same CPU architecture.

I forget where it was that I found this solution, but if you comment below then I'll add the credit to this post if your post looks familiar.

As a quick extra, you can limit the time a command can execute for like this:

timeout 60 command_here

That will limit command_here to executing for only 60 seconds.

Found this interesting? Comment below!

How to hash and sign files with GPG and a bit of Bash

When making a release of your software or sending some important documents, it's pretty common practice (especially amongst larger projects) to distribute hashes and GPG signatures along with release binaries themselves. Example projects that do this include:

This is great practice, since it allows downloaders to verify that their download has not been corrupted, and that it was you who released them and not some imposter.

In this post, I'm going to outline how you can do this too.

I've recently both verified a number of signatures and generated some of my own too, so I thought I'd post about it here to show others how to do it too.

Verification

Before we get into the generation of hashes and signatures, we should first talk about verifying them. I'm mentioned this is good practice already, so it makes sense to briefly talk about verification first. First, let's download Nomad version 0.10.5. You can grab the files here. Download the following files:

  • nomad_0.10.5_SHA256SUMS - The hashes themselves
  • nomad_0.10.5_SHA256SUMS.sig - The GPG signature of the above file
  • nomad_0.10.5_linux_arm.zip - Nomad itself, for the ARM architecture (feel free to pick whichever one you like and adapt these instructions accordingly)

Verifying the hashes is easier, so let's do that first. We can see from the filenames that we have SHA 256 hashes, so we'll want the sha256sum command. Windows users will need to use the Windows Subsystem for Linux, or setup an msys environment:

sha256sum --ignore-missing --check nomad_0.10.5_SHA256SUMS

It should output an OK message and return an exit code zero (to check this in a script, you can do echo "$?" directly after running it to check the exit code). 99% of the time this check will succeed, but you'll be glad that you checked the 1% of the time it fails.

Checking the hashes here is the bit that ensures that the files haven't been corrupted. Next, we'll verify the GPG signature. This is the bit that ensures that the files we've downloaded were actually originally released by who we think they were. Do that like this:

gpg --verify nomad_0.10.5_SHA256SUMS.sig

Doing this, you may get an error message telling you that it can't verify the signature because you haven't got the public key of the signer imported into your keyring. To remedy this, look for the bit that tells you the key id (e.g. using RSA key 51852D87348FFC4C). Copy it, and then do this:

gpg --recv-keys 51852D87348FFC4C

This will download it and import the key id into your local GPG keyring. Then re-run the gpg --verify command above, and it should work.

Generation

Now that we know how to verify a signature, let's generate our own. First, put the files you want to hash into a directory and cd into it in your terminal. Then, let's generate the hash file:

# Hash files
find . -type f -not -name "*.SHA256" -print0 | xargs -0 -n1 -I{} -P"$(nproc)" sha256sum -b "{}" >HASHES.SHA256

We use find here to locate all the files (other than the hash file itself), and then pass them to xargs, which calls sha256sum to hash the files in question. Finally, we write the hashes to HASHES.SHA256.

Next, lets generate a GPG signature for the hash file. For this, you'll need a GPG key. That's out of scope of this post really, but this tutorial looks like it will show you how to do it. Note that in order for other people to verify the GPG signature you create, you'll probably need to upload your GPG public key to a keyserver (the article I link to shows you how to do this too).

Once done, generate the signature like this:

# Sign the hashes
gpg --sign --detach-sign --armor HASHES.SHA256

Specifically, we generate a detached signature here - meaning that it's in a separate file to the file that is being signed. The --armor there just means to wrap the signature in base64 encoding (I'm pretty sure) and some text so that it's not a raw binary file that might confuse the uninitiated.

Finally, let's verify the signature we just created - just in case:

# Verify the signature, and check we used the right key
gpg --verify HASHES.SHA256.asc

If all's good, it should tell you that the signature is ok. If you've got multiple keys, ensure that you signed it with the correct key here. GPG will sign things with the key you have marked as the default key.

Found this interesting? Having trouble? Want to say hi? Comment below!

I've got an apt repository, and you can too

Hey there!

In this post, I want to talk about my apt repository. I've had it for a while, but since it's been working well for me I thought I'd announce it to wider world on here.

For those not in the know, an apt repository is a repository of software in a particular format that the apt package manager (found on Debian-based distributions such as Ubuntu) use to keep software on a machine up-to-date.

The apt package manager queries all repositories it has configured to find out what versions of which packages they have available, and then compares this with those locally installed. Any packages out of date then get upgraded, usually after prompting you to install the updates.

Linux distributions based on Debian come with a large repository of software, but it doesn't have everything. For this reason, extra repositories are often used to deliver updates to software automatically from third parties.

In my case, I've been finding increasingly that I've been wanting to deliver updates for software that isn't packaged for installation with apt to a number of different machines. Every time I get around to installing update it felt like it was time to install another, so naturally I got frustrated enough with it that I decided to automate my problems away by scripting my own apt repository!

My apt repository can be found here: https://starbeamrainbowlabs.com/

It comes in 2 parts. Firstly, there's the repository itself - which is managed by a script that's based on my lantern build engine. It's this I'll be talking about in this post.

Secondly, I have a number of as yet adhoc custom Laminar job scripts for automatically downloading various software projects from GitHub, such that all I have to do is run laminarc queue apt-softwarename and it'll automatically package the latest version and upload it to the repository itself, which has a cron job set to fold in all of the new packages at 2am every night. The specifics of this are best explain in another post.

Currently this process requires me to login and run the laminarc command manually, but I intend to automate this too in the future (I'm currently waiting for a new release of beehive to fix a nasty bug for this).

Anyway, currently I have the following software packaged in my repository:

  • Gossa - A simple HTTP file browser
  • The Tiled Map Editor - An amazing 2D tile-based graphical map editor. You should sponsor the developer via any of the means on the Tiled Map Editor's website before using my apt package.
  • tldr-missing-pages - A small utility script for finding tldr-pages to write
  • webhook - A flexible webhook system that calls binaries and shell scripts when a HTTP call is made
    • I've also got a pleaserun-based service file generator packaged for this too in the webhook-service package

Of course, more will be coming as and when I discover and start using cool software.

The repository itself is driven by a set of scripts. These scripts were inspired by a stack overflow post that I have since lost, but I made a number of usability improvements and rewrote it to use my lantern build engine as I described above. I call this improved script aptosaurus, because it sounds cool.

To use it, first clone the repository:

git clone https://git.starbeamrainbowlabs.com/sbrl/aptosaurus.git

Then, create a new GPG key to sign your packages with:

gpg --full-generate-key

Next, we need to export the new keypair to disk so that we can use it in scripts. Do that like this:

# Identify the key's ID in the list this prints out
gpg --list-secret-keys
# Export the secret key
gpg --export-secret-keys --armor INSERT_KEY_ID_HERE >secret.gpg
chmod 0600 secret.gpg # Don't forget to lock down the permissions
# Export the public key
gpg --export --armor INSERT_KEY_ID_HERE >public.gpg

Then, run the setup script:

./aptosaurus.sh setup

It should warn you if anything's amiss.

With the setup complete, you can new put your .deb packages in the sources subdirectory. Once done, run the update command to fold them into the repository:

./aptosaurus.sh update

Now you've got your own repository! Your next step is to setup a static web server to serve the repo subdirectory (which contains the repo itself) to the world! Personally, I use Nginx with the following config:

server {
    listen  80;
    listen  [::]:80;
    listen  443 ssl http2;
    listen  [::]:443 ssl http2;

    server_name apt.starbeamrainbowlabs.com;
    ssl_certificate     /etc/letsencrypt/live/$server_name/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/$server_name/privkey.pem;

    #add_header strict-transport-security "max-age=31536000;";
    add_header x-xss-protection "1; mode=block";
    add_header x-frame-options  "sameorigin";
    add_header link '<https://starbeamrainbowlabs.com$request_uri>; rel="canonical"';

    index   index.html;
    root    /srv/aptosaurus/repo;

    include /etc/nginx/snippets/letsencrypt.conf;

    autoindex   off;
    fancyindex  on;
    fancyindex_exact_size   off;
    fancyindex_header   header.html;

    #location ~ /.well-known {
    #   root    /srv/letsencrypt;
    #}

}

This requires the fancyindex module for Nginx, which can be installed with sudo apt install libnginx-mod-http-fancyindex on Ubuntu-based systems.

To add your new apt repository to a machine, simply follow the instructions for my repository, replacing the domain name and the key ids with yours.

Hopefully this release announcement-turned-guide has been either interesting, helpful, or both! Do let me know in the comments if you encounter any issues. If there's enough interest I'll migrate the code to GitHub from my personal Git server if people want to make contributions (express said interest in the comments below).

It's worth noting that this is only a very simply apt repository. Larger apt repositories are sectioned off into multiple categories by distribution and release status (e.g. the Ubuntu repositories have xenial, bionic, eoan, etc for the version number of Ubuntu, and main, universe, multiverse, restricted, etc for the different categories of software).

If you setup your own simple apt repository using this guide, I'd love it if you could let me know with a comment below too.

Own Your Code Series List

Hey there! It's time for another series list. This time it's for my Own Your Code series, where I take a look into Gitea and Laminar CI.

Following this series, I plan to also post about my apt repository, which is hosting a growing list of software - including the tiled map editor (support them with a donation if you can), gossa (a minimalist file browser interface), and webhook - if you find any issues, you can always get in touch.

Anyway, here's the full list of posts in the series in the Own Your Code series:

In the unlikely event I post another entry in this series, I'll come back and update this list. Most likely though I'll be posting related things standalone, rather than part of this series - so subscribe for updates with your favourite method if you'd like to stay up-to-date with my latest blog posts (Atom/RSS, Email, Twitter, Reddit, and Facebook are all supported - just ask if there's something missing).

Art by Mythdael