Ensuring a Linux machine's network connection stays up with Bash
Recently, I had the unpleasant experience of my Lab machine at University dropping offline. It has a tendency to do this randomly - and normally I'd just reboot it myself, but since I'm working from home at the moment it meant that I couldn't go in to fix it. This unfortunately meant that I was stuck waiting for a generous technician to go in and reboot it for me.
With access now restored I decided that I really didn't want this to happen again, so I've written a simple Bash script to resolve the issue.
It works by checking for an Internet connection every hour by pinging starbeamrainbowlabs.com - and if it doesn't manage to do so successfully, then it will reboot. A simple concept, but I discovered a number of things that needed considering while writing it:
- To avoid detecting transient network issues, we should make multiple attempts before giving up and rebooting
- Those multiple attempts need to be delayed to be effective
- We mustn't reboot more than once an hour to avoid getting into a 'reboot loop'
- If we're running an experiment, we need a way to temporarily delay it from doing it's checks that will resume automatically
- We could try and diagnose the network error or turn the networking of and on again, but if it gets stuck halfway through then we're locked out (very undesirable) - so it's easier / safer to just reboot
With these considerations in mind, I came up with this: ensure-network.sh (link to part of a GitHub Gist, as it's quite long)
This script requires Bash version 4+ and has a number of environment variables that can configure its behaviour:
| Environment Variable | Description |
|---|---|
CHECK_EXTERNAL_HOST |
The domain name or IP address to ping to check the connection |
CHECK_INTERVAL |
The interval to check the connection in seconds |
CHECK_TIMEOUT |
Wait at most this long for a reply to our ping |
CHECK_RETRIES |
Retry this many times before giving up and rebooting |
CHECK_RETRY_DELAY |
Delay this many seconds in between retries |
CHECK_DRY_RUN |
If true, then don't actually reboot (useful for testing) |
CHECK_REBOOT_DELAY |
Leave at least this many minutes in between reboots |
CHECK_POSTPONE_FILE |
If this file exists and has a recent last-modified time (mtime), don't actually reboot |
CHECK_POSTPONE_MAXAGE |
The maximum age in minutes of the CHECK_POSTPONE_FILE to consider it fresh and avoid rebooting |
With these environment variables, it covers all 4 points in the above list. To expand on CHECK_POSTPONE_FILE, if I'm running an experiment for example and I don't want it to reboot in the middle of said experiment, then I can simply run touch /path/to/postpone_file to delay network connection-related reboots for 7 days (by default). After this time, it will automatically start rebooting again if it drops off the network. This ensures that it will always restart monitoring eventually - as if I had a more manual system I'd forget to re-enable it and then loose access.
Another consideration is that the /var/cache directory must exist. This is because an empty tracking file is created there to keep track of when the last network connection-related reboot occurred.
With the script written, then next step is to have it run automatically on boot. For systemd-based systems such as my lab machine, a systemd service is the order of the day. This is relatively simple:
[Unit]
Description=Reboot if the network connection is down
After=network.target
[Service]
Type=simple
# Because it needs to be able to reboot
User=root
Group=root
EnvironmentFile=-/etc/default/ensure-network
ExecStartPre=/bin/sleep 60
ExecStart=/bin/bash "/usr/local/lib/ensure-network/ensure-network.sh"
SyslogIdentifier=ensure-access
StandardError=syslog
StandardOutput=syslog
[Install]
WantedBy=multi-user.target
(View the latest version in the GitHub Gist)
This assumes that the ensure-network.sh script is located at /usr/local/lib/ensure-network/ensure-network.sh. It also allows for an environment file to optionally be created at /etc/default/ensure-network, so that you can customise the parameters. Here's an example environment file:
CHECK_EXTERNAL_HOST=example.com
CHECK_INTERVAL=60
The above example environment file checks against example.com every minute instead of the default starbeamrainbowlabs.com every hour. You can, of course, specify any (or all) of the environment variables detailed above in the environment file if you wish.
That completes my setup - so hopefully I don't encounter any more network-related issues that lock me out of accessing my lab machine remotely! To install it yourself, you can do this:
# Create the directory for the script to live in
sudo mkdir /usr/local/lib/ensure-network
# Download the script & service file
sudo curl -L -O /usr/local/lib/ensure-network/ensure-network.sh https://gist.githubusercontent.com/sbrl/08e13f2ceedafe35ac7f8dbdfb8bfde7/raw/cc5ab4226472c08b09e448a257256936cc749193/ensure-network.sh
sudo curl -L -O /etc/systemd/system/ensure-network.service https://gist.githubusercontent.com/sbrl/08e13f2ceedafe35ac7f8dbdfb8bfde7/raw/adf5ed4009b3e1a09f857936fceb3581897072f4/ensure-network.service
# Start the service & enable it on boot
sudo systemctl daemon-reload
sudo systemctl start ensure-network.service
sudo systemctl enable ensure-network.service
You might need to replace the URLs there with the latest ones that download the raw content from the GitHub Gist.
Did you find this useful? Got a suggestion to make it better? Running into issues? Comment below!