Backing up to AWS S3 with duplicity
The server that this website runs on backs up automatically to the Simple Storage Service, provided by Amazon Web Services. Such an arrangement is actually fairly cheap - only ~20p/month! I realised recently that although I've blogged about duplicity before (where I discussed using an external hard drive), I never covered how I fully automate the process here on starbeamrainbowlabs.com.
(Above: A bunch of hard drives. The original can be found here.)
It's fairly similar in structure to the way it works backing up to an external hard drive - just with a few different components here and there, as the script that drives this is actually older than the one that backs up to an external hard drive.
To start, we'll need an AWS S3 bucket. I'm not going to cover how to do this here, as the AWS interface keeps changing, and this guide will likely become outdated quickly. Instead, the AWS S3 documentation has an official guide on how to create one. Make sure it's private, as you don't want anyone getting a hold of your backups!
With that done, you should have both an access key and a secret. Note these down in a file called .backup-password
in a new directory that will hold the backup script like this:
#!/usr/bin/env bash
PASSPHRASE="INSERT_RANDOM_PASSWORD_HERE";
AWS_ACCESS_KEY_ID="INSERT_AWS_ACCESS_KEY_HERE";
AWS_SECRET_ACCESS_KEY="INSERT_AWS_SECRET_KEY_HERE";
The PASSPHRASE
here should be a long and unintelligible string of random characters, and will encrypt your backups. Note that down somewhere safe too - preferably in your password manager or somewhere else at least as secure.
If you're on Linux, you should also set the permissions on the .backup-password
file to ensure nobody gets access to it who shouldn't. Here's how I did it:
sudo chown root:root .backup-password
sudo chmod 0400 .backup-password
This ensures that only the root
user is able to read the file - and nobody can write to it.
With our secrets generated and safely stored, we can start writing the backup script itself. Let's start by reading in the secrets:
#!/usr/bin/env bash
source /root/.backup-password
I stored my .backup-password
file in /root
. Next, let's export these values. This enables the subprocesses we invoke to access these environment variables:
export PASSPHRASE;
export AWS_ACCESS_KEY_ID;
export AWS_SECRET_ACCESS_KEY;
Now it's time to do the backup itself! Here's what I do:
duplicity \
--full-if-older-than 2M \
--exclude /proc \
--exclude /sys \
--exclude /tmp \
--exclude /dev \
--exclude /mnt \
--exclude /var/cache \
--exclude /var/tmp \
--exclude /var/backups \
--exclude /srv/www-mail/rainloop/v \
--s3-use-new-style --s3-european-buckets --s3-use-ia \
/ s3://s3-eu-west-1.amazonaws.com/INSERT_BUCKET_NAME_HERE
Compressed version:
duplicity --full-if-older-than 2M --exclude /proc --exclude /sys --exclude /tmp --exclude /dev --exclude /mnt --exclude /var/cache --exclude /var/tmp --exclude /var/backups --exclude /srv/www-mail/rainloop/v --s3-use-new-style --s3-european-buckets --s3-use-ia / s3://s3-eu-west-1.amazonaws.com/INSERT_BUCKET_NAME_HERE
This might look long and complicated, but it's mainly due to the large number of directories that I'm excluding from the backup. The key options here are --full-if-older-than 2M
and --s3-use-ia
, which specify I want a full backup to be done every 2 months and to use the infrequent access pricing tier to reduce costs.
The other important bit here is to replace INSERT_BUCKET_NAME_HERE
with the name of the S3 bucket that you created.
Backing is all very well, but we want to remove old backups too - in order to avoid ridiculous bills (AWS are terrible for this - there's no way that you can set a hard spending limit! O.o). That's fairly easy to do:
duplicity remove-older-than 4M \
--force \
--s3-use-new-style --s3-european-buckets --s3-use-ia \
s3://s3-eu-west-1.amazonaws.com/INSERT_BUCKET_NAME_HERE
Again, don't forget to replace INSERT_BUCKET_NAME_HERE
with the name of your S3 bucket. Here, I specify I want all backups older than 4 months (the 4M
bit) to be deleted.
It's worth noting here that it may not actually be able to remove backups older than 4 months here, as it can only delete a full backup if there are not incremental backups that depend on it. To this end, you'll need to plan for potentially storing (and being charged for) an extra backup cycle's worth of data. In my case, that's an extra 2 months worth of data.
That's the backup part of the script complete. If you want, you could finish up here and have a fully-working backup script. Personally, I want to know how much data is in my S3 bucket - so that I can get an idea as to how much I'll be charged when the bill comes through - and also so that I can see if anything's going wrong.
Unfortunately, this is a bit fiddly. Basically, we have to utilise the AWS command-line interface to recursively list the entire contents of our S3 bucket in summarising mode in order to get it to tell us what we want to know. Here's how to do that:
aws s3 ls s3://INSERT_BUCKET_BAME_HERE --recursive --human-readable --summarize
Don't forget to replace INSERT_BUCKET_BAME_HERE
wiith your bucket's name. The output from this is somewhat verbose, so I ended up writing an awk
script to process it and output something nicer. Said awk
script looks like this:
/^\s*Total\s+Objects/ { parts[i++] = $3 }
/^\s*Total\s+Size/ { parts[i++] = $3; parts[i++] = $4; }
END {
print(
"AWS S3 Bucket Status:",
parts[0], "objects, totalling "
parts[1], parts[2]
);
}
If we put all that together, it should look something like this:
aws s3 ls s3://INSERT_BUCKET_BAME_HERE --recursive --human-readable --summarize | awk '/^\s*Total\s+Objects/ { parts[i++] = $3 } /^\s*Total\s+Size/ { parts[i++] = $3; parts[i++] = $4; } END { print("AWS S3 Bucket Status:", parts[0], "objects, totalling " parts[1], parts[2]); }'
...it's a bit of a mess. Perhaps I should look at putting that awk
script in a separate file :P Anyway, here's some example output:
AWS S3 Bucket Status: 602 objects, totalling 21.0 GiB Very nice indeed. To finish off, I'd rather like to know how long it took to do all this. Thankfully, bash has an inbuilt automatic variable that holds the number of seconds since the current process has started, so it's just a case of parsing this out into something readable:
echo "Done in $(($SECONDS / 3600))h $((($SECONDS / 60) % 60))m $(($SECONDS % 60))s.";
...I forget which Stackoverflow answer it was that showed this off, but if you know - please comment below and I'll update this to add credit. This should output something like this:
Done in 0h 12m 51s.
Awesome! We've now got a script that backs up to AWS S3, deletes old backups, and tells us both how much space on S3 is being used and how long the whole process took.
I'm including the entire script at the bottom of this post. I've changed it slightly to add a single variable for the bucket name - so there's only 1 place on line 9 (highlighted) you need to update there.
(Above: A Geopattern, tiled using the GNU Image Manipulation Program)
#!/usr/bin/env bash
# Make sure duplicity exists
test -x $(which duplicity) || exit 1;
# Pull in the password
. /root/.backup-password
AWS_S3_BUCKET_NAME="INSERT_BUCKET_NAME_HERE";
# Allow duplicity to access it
export PASSPHRASE;
export AWS_ACCESS_KEY_ID;
export AWS_SECRET_ACCESS_KEY;
# Actually do the backup
# Backup strategy:
# 1 x backup per week:
# 1 x full backup per 2 months
# incremental backups in between
# S3 Bucket URI: https://${AWS_S3_BUCKET_NAME}/
echo [ $(date +%F%r) ] Performing backup.
duplicity --full-if-older-than 2M --exclude /proc --exclude /sys --exclude /tmp --exclude /dev --exclude /mnt --exclude /var/cache --exclude /var/tmp --exclude /var/backups --exclude /srv/www-mail/rainloop/v --s3-use-new-style --s3-european-buckets --s3-use-ia / s3://s3-eu-west-1.amazonaws.com/${AWS_S3_BUCKET_NAME}
# Remove old backups
# You have to plan for 1 extra full backup cycle when
# calculating space requirements - duplicity only
# removes a backup if it won't invalidate those further
# along the chain - the oldest backup will always be
# a full one.
echo [ $(date +%F%r) ] Backup complete. Removing old volumes.
duplicity remove-older-than 4M --force --encrypt-key F2A6D8B6 --s3-use-new-style --s3-european-buckets --s3-use-ia s3://s3-eu-west-1.amazonaws.com/${AWS_S3_BUCKET_NAME}
echo [ $(date +%F%r) ] Cleanup complete.
aws s3 ls s3://${AWS_S3_BUCKET_NAME} --recursive --human-readable --summarize | awk '/^\s*Total\s+Objects/ { parts[i++] = $3 } /^\s*Total\s+Size/ { parts[i++] = $3; parts[i++] = $4; } END { print("AWS S3 Bucket Status:", parts[0], "objects, totalling " parts[1], parts[2]); }'
echo "Done in $(($SECONDS / 3600))h $((($SECONDS / 60) % 60))m $(($SECONDS % 60))s.";