Archive

Tag Cloud

3d 3d printing account algorithms android announcement architecture archives arduino artificial intelligence artix assembly async audio automation backups bash batch blog bookmarklet booting bug hunting c sharp c++ challenge chrome os cluster code codepen coding conundrums coding conundrums evolved command line compilers compiling compression containerisation css dailyprogrammer data analysis debugging demystification distributed computing docker documentation downtime electronics email embedded systems encryption es6 features ethics event experiment external first impressions future game github github gist gitlab graphics hardware hardware meetup holiday holidays html html5 html5 canvas infrastructure interfaces internet interoperability io.js jabber jam javascript js bin labs learning library linux lora low level lua maintenance manjaro network networking nibriboard node.js operating systems own your code pepperminty wiki performance phd photos php pixelbot portable privacy problem solving programming problems project projects prolog protocol protocols pseudo 3d python reddit redis reference releases rendering resource review rust searching secrets security series list server software sorting source code control statistics storage svg talks technical terminal textures thoughts three thing game three.js tool tutorial tutorials twitter ubuntu university update updates upgrade version control virtual reality virtualisation visual web website windows windows 10 xmpp xslt

NAS, Part 4: Time machines | Automatic snapshotting with btrfs-snapshot

In the last part in this series, I compared ZFS with Btrfs. I ended up choosing Btrfs because it was easier to install and came with a number of advantages. Since last time, I've now put Btrfs to work and have about ~1.3 TiB of data stored in it (much of which is from various devices across the network automatically backing up to it). Before we continue, here's a list of the parts in the series so far:

In this post, I'm going to talk about the automatic snapshotting I've setup. Btrfs supports creating snapshots, which are defined as subvolumes that are seeded with data from another subvolume (boundaries between subvolumes are not crossed). Most of the time, these are created to be read-only. In addition because of the copy-on-write system Btrfs uses, a snapshot takes no disk space on its own (other than that required to store the fact that it exists) - it only starts to consume disk space when files that it contains are modified in the original subvolume.

To this end, we can efficiently keep a rotating series of snapshots to serve as an initial safety net should a someone accidentally delete a file. Of course, we can't assume that snapshots will be ok as the only backup (I use Restic for that - I'm in the process of reconfiguring it for my new setup) - but they are still useful things to have.

To take a Btrfs snapshot, you can do this:

sudo btrfs subvolume snapshot -r path/to/source_subvolume path/to/target

The problem here, of course, is that you also need a way to delete old snapshots too. While I could roll my own solution for this, I figured that someone has already solved this problem - so it might save me some effort if I look for a pre-existing solution first.

After doing a bit of searching without success, I asked on Reddit, and the helpful folks there gave me a number of suggestions:

Of these 3, snapper seemed to be the most popular. From some reading, it appeared to be powerful and flexible - at the cost of being easy to understand. btrbk seemed to be feature-packed too, but in the end I decided on btrfs-snapshot.

btrfs-snapshot is designed to be used with cron. For example, I have something like this for one of my subvolumes in root user's crontab:

0 * * * *       /root/btrfs-snapshot-rotation/btrfs-snapshot path/to/subvolume path/to/subvolume/.snapshots hourly 8
0 2 * * *       /root/btrfs-snapshot-rotation/btrfs-snapshot path/to/subvolume path/to/subvolume/.snapshots daily 4
0 2 * * 7       /root/btrfs-snapshot-rotation/btrfs-snapshot path/to/subvolume path/to/subvolume/.snapshots weekly 4

Given a subvolume at path/to/subvolume, it creates the following snapshots in a nested subvolume in path/to/subvolume/.snapshots (which needs to be created manually: sudo btrfs subvolume create path/to/subvolume/.snapshots):

• 8 x hourly snapshots
• 4 x daily snapshots
• 4 x weekly snapshots

I find the system so beautifully simple and easy to understand. This is important for me in a system like this, as it has to be easy for me to understand when I inevitably come back to it months or even years later when I've forgotten how it works. The arguments to btrfs-snapshot are easy to understand, and are in the form path/to/source path/to/target tag_name number_of_snapshots_to_keep.

This has the added bonus that if a user deletes a file accidentally in our shared drive, they can retrieve it on their own from the .snapshots directory - without my intervention.

With this in place and the data (mostly) moved over, my NAS project is almost complete. The final task I have left to do is to setup a proper backup system with Restic to either a remote (e.g. Backblaze B2) or offline location (such as an external HDD).

The latter might prove to be a problem though, since the maximum amount of data I can store right now is 5.5 TiB and is only going to grow from there. Portable external hard drives I've seen online don't appear to go up that high, so I suspect I'll need to choose another plan.

Should I encounter some interesting issues when setting this final backup step up, I'll make an additional post in this series. If not though, this will probably be the last entry in this series. If you have any questions about my setup, please comment below! I'll dod my best to answer any questions.

NAS, Part 3: Decisions | Choosing a Filesystem

It's another entry in my NAS series! It's still 2020 for me as I type this, but I hope that 2021 is going well. Before we continue, I recommend checking out the previous posts in this series:

Part 1 in particular is useful for context as to the hardware I'm using. Part 2 is a review of my experience assembling the system. In this part, we're going to look at my choice of filesystem and OS.

I left off in the last post after I'd booted into the installer for Ubuntu Server 20.04. After running through that installer, I performed my collection of initial setup tasks for any server I manage:

• Setup an SSH server
• Enable UFW
• Setup my personal ~/bin folder
• Assign a static IP address (why won't you let me choose an IP, Netgear RAX120? Your UI lets me enter a custom IP, but it devices don't ultimately end up with the IP I tell you to assign to them....)
• Setup Collectd
• A number of other tasks I forget

With my basic setup completed, I also setup a few things specific to devices that have SMART-enabled storage devices:

• Setup an email relay (via autossh) for mail delivery
• Installed smartd (which sends you emails when there's something wrong with 1 your disks)
• Installed and configured hddtemp, and integrated it with collectd (a topic for another post, I did this for the first time)

With these out of the way and after making a mental note to sort out backups, I could now play with filesystems with a view to making a decision. The 2 contenders:

• (Open)ZFS
• Btrfs

Both of these filesystems are designed to be spread across multiple disks in what's known as a pool thereof. The idea behind them is to enable multiple disks to be presented to the user as a single big directory, with the complexities as to which disk(s) a file is/are stored on. They also come with extra nice features, such as checksumming (which allows them to detect corruption), snapshotting (taking snapshots of what the filesystem looks like at a given point in time), automatic data deduplication, compression, snapshot send / receiving, and more!

Overview: ZFS

ZFS is a filesystem originally developed by Sun Microsystems in 2001. Since then, it has been continually developed and improved. After Oracle bought Sun Microsystems in 2010, the source code for ZFS was closed - hence the OpenZFS fork was born. It's licenced under the CDDL, which isn't compatible with the GPLv2 used by the Linux Kernel. This causes some minor installation issues.

As a filesystem, it seems to be widely accepted to be rock solid and mature. It's used across the globe by home users and businesses both large and small to store huge volumes of data. Given its long history, it has proven its capability to store data safely.

It does however have some limitations. For one, it only has limited support for adding drives to a zpool (a pool of disks in the ZFS world), which is a problem for me - as I'd prefer to have the ability to add drives 1 at a time. It also has limited support for changing key options such as the compression algorithm later, as this will only affect new files - and the only way to recompress old files is to copy them in and out of the disk again.

Overview: Btrfs

Btrfs, or B-Tree File System is a newer filesystem that development upon which began in 2007, and was accepted into the Linux Kernel in 2009 with the release of version 1.0. It's licenced under the GPLv2, the same licence as the Linux Kernel. As of 2020, many different distributions of Linux ship with btrfs installed by default - even if it isn't the default filesystem (that's ext4 in most cases).

Unlike ZFS, Btrfs isn't as well-tested in production settings. In particular, it's raid5 and raid6 modes of operation are not well tested (though this isn't a problem, since raid1 operates at file/block level and not disk level as it does with ZFS, which enables us to use interesting setups like raid1 striped across 3 disks). Despite this, it does look to be stable enough - particularly as openSUSE has set it to be the default filesystem.

It has a number of tempting features over ZFS too. For example, it supports adding drives 1 at a time, and you can even convert your entire pool from 1 raid level to another dynamically while it's still mounted! The same goes for converting between compression algorithms - it's all done using a generic filter system.

Such a system is useful when adding new disks to the pool too, as they it can be used to rebalance data across all the disks present - allowing for new disks to be accounted for and faulty disks to be removed, preserving the integrity of the data while a replacement disk is ordered for example.

While btrfs does have a bold list of features that they'd like to implement, they haven't gotten around to all of them yet (the status of existing features can be found here). For example, while ZFS can use an SSD as a dedicated caching device, btrfs doesn't yet have this ability - and nobody appears to have claimed the task on the wiki.

Performance

Inspired by a recent Ars Technica article, I'd like to test the performance of the 2 filesystems at hand. I ran the following tests for reading and writing separately:

• 4k-random: Single 4KiB random read/write process
• 64k-random-16p: 16 parallel 64KiB random read/write processes
• 1m-random: Single 1MiB random write process

I did this for both ZFS in raid5 mode, and Btrfs in raid5 (though if I go with btrfs I'll be using raid1, as I later discovered - which I theorise would yield a minor performance improvement). I tested ZFS twice: once with gzip compression, and again with zstd compression. As far as I can tell, Btrfs doesn't have compression enabled by default. Other than the compression mode, no other tuning was done - all the settings were left at their defaults. Both filesystems were completely empty aside from the test files, which were created automatically in a chowned subdirectory by fio.

The graph uses a logarithmic scale. My initial impressions are that ZFS benefits from parallelisation to a much greater extent than btrfs - though I suspect that I may be CPU bound here, which is an unexpected finding. I may also be RAM-bound too, as I observed a significant increase in RAM usage when both filesystems were under load. Buying another 8GB would probably go a long way to alleviating that issue.

Other than that, zstd appears to provide a measurable performance improvement over gzip compression. Btrfs also appears to benefit from writing larger blocks over smaller ones.

Overall, some upgrades to my NAS are on the cards should I be unsatisfied with the performance in future:

• More RAM would assist in heavy i/o loads
• A better CPU would probably raise the peak throughput speeds - if I can figure out what to do with the old one

But for now, I'm perfectly content with these speeds. Especially since I have a single gigabit ethernet port on my storage NAS, I'm not going to need anything above 1000Mbps - which is 119.2 MiB/s if you'd like to compare against the graph above.

Conclusion

As for my final choice of filesystem, I think I'm going to go with btrfs. While I'm aware that it isn't as 'proven' as ZFS - and slightly less performant too - I have a number of reasons for this decision:

1. Btrfs allows you to add disks 1 at a time, and ZFS makes this difficult
2. Btrfs has the ability to convert to a different raid level at a later date if I change my mind
3. Btrfs is easier to install, since it's already built-in to Ubuntu Server 20.04.

NAS, Part 2: Assembly and Installation

Welcome back! This is part 2 of a series of posts about my new NAS (network attached storage) device I'm building. If you haven't read it yet, I recommend you go back and read part 1, in which I talk about the hardware I'm using.

Since the Fractal Design Node 804 case came first, I was able to install the parts into it as they arrived. First up was the motherboard (an ASUS PRIME B450M-A) and CPU (an AMD Athlon 3000G).

The motherboard was a pain. As I read, the middle panel of the case has some flex in it, so you've got to hold it in place with one hand we you're screwing the motherboard in. This in and of itself wasn't an issue at all, but the screws for the motherboard were really stiff. I think this was just the motherboard, but it was annoying.

Thankfully I managed it though, and then set to work installing the CPU. This went well - the CPU came with thermal paste on top already, so I didn't need to buy my own. The installation process for the stock CPU heatsink + fan was unfamiliar, which took me a moment to decipher how the mechanism worked.

Following this, I connected the front ports from the case up to the motherboard (consulting my motherboard's documentation showed me where I needed to plug these in - I remember this being something I struggled with when I last built an (old) PC when doing some IT technician work experience some years ago). The RAM - while a little stiff (to be expected) - went in fine too. I might buy another stick later if I run into memory pressure, but I thought a single 8GB stick would be a good place to start.

The case came with a dedicated fan controller board that has a high / medium / low switch on the back too, so I wired up the 3 included Noctua case fans to this instead of the slots on the motherboard. The CPU fan (nothing special yet - just the stock fan that came with the CPU) went into the motherboard though, as the fan controller didn't have room - and I thought that the motherboard would be better placed to control the speed of that one.

(Above: The inside of the 2 sides of the case. Left: The 'hot' side, Right: The 'cold' side.)

The case is split into 2 sides: 1 for 'hot' components (e.g. the motherboard and CPU), and another for 'cold' components (e.g. the HDDs and PSU). Next up were the hard disks - so I mounted the SSD for the operating system to the base of the case in the 'hot' side, as the carriage in the cold side fits only 3.5 inch disks, and my SSD is a 2.5 inch disk. While this made the cabling slightly awkward, it all worked out in the end.

For the 3.5 inch HDDs (for data storage), I found I was unable to mount them with the included pieces of bracket metal that allow you to put screws into the bottom set of holes - as the screws wouldn't fit through the top holes. I just left the metal bracket pieces out and mounted the HDDs directly into the carriage, and it seems to have worked well so far.

The PSU was uneventful too. It fit nicely into the space provided, and the semi-modular nature of the cables provided helped tremendously to avoid a mess of cables all over the place as I could remove the cables I didn't need.

Finally, the DVD writer had some stiff screws, but it seemed to mount well enough (just a note: I've been having an issue I need to investigate with this DVD drive whereby I can't take a copy of a disk - e.g. the documentation CD that came with my motherboard - with dd, as it reports an IO error. I need to investigate this further, so more on that in a later post).

The installation of the DVD drive completed the assembly process. To start it up for the first time, I connected my new NAS to my television temporarily so that I could see the screen. The machine booted fine, and I dove straight into the BIOS.

(Above: The BIOS that comes with the ASUS motherboard, before the clock was set by Ubuntu Server 20.04 - which I had yet to install)

Unlike my new laptop, the BIOS that comes with the ASUS motherboard is positively delightful. It has all the features you'd need, laid out in a friendly interface. I observed some minor input lag, but considering this is a BIOS we're talking about here I can definitely overlook that. It even has an online update feature, where you can plug in an Ethernet cable and download + install BIOS updates from the Internet.

I tweaked a few settings here, and then rebooted into my flash drive - onto which I loaded an Ubuntu Server 20.04 ISO. It booted into this without complaint (unlike a certain laptop I'm rather unhappy with at the moment), and then I selected the appropriate ISO and got to work installing the operating system (want your own multiboot flash drive? I've blogged about that already! :D).

In the next post, I'm going to talk about the filesystem I ultimately chose. I'm also going to show and discuss some performance tests I ran using fio following this Ars Technica guide.

A comparison of compression formats for storing JSON

Happy new year, everyone!

I've blogged about different aspects of a (not so?) little project of mine several times now (exhibits A, B, C, D, and finally E - even if I didn't know it at the time), but it appears that I end up running into all sorts of interesting problems that I invent cool solutions for. I also find myself doing a bunch of research that I'm surprised nobody's compiled into a single place yet - as is the case in this blog post. (Also, Happy 2018 everyone! First post of the year :D)

I've been refactoring the subsystem that saves a considerable amount of JSON data to a bunch of different files on disk. Obviously, I'm interested in minimising the amount of space this JSON data takes up on disk. As this saving process happens in the background on a separate thread, I'm not too concerned about performance - other than it can't be too slow. With this in mind, I've found myself testing a bunch of different compression algorithms. Let's introduce our test data:

1. A 17MiB card game dataset, as a single minified JSON file
2. A 40KiB 'live' specimen chunk's data, saved as a single minified JSON file.

I can't remember which card game the first dataset is from, but I do know that I found it in the awesome JSON datasets list. Next up, here's our cast of compression algorithms we'll be testing:

1. The venerable GZip
2. BZip2 - Apparently GZIP, but smaller and slightly more computationally expensive
3. XZ (the newer child of LZMA2)
4. 7zip

A colourful cast, to be sure! Let's run them through their paces - starting with the card game dataset. Here are the results I observe with each set to their default settings:

Format Size
Uncompressed 17M
gzip 2.4M
bzip2 1.6M
xz 1.3M
7zip 1.4M
brotli 1.3M

Very interesting. It looks like xz and brotli are tied in first place - though I observed that brotli took ages in comparison to all the other algorithms I tested - and upon closer inspection xz beat it by 17.3KiB. Numbers are all very well, but to really see what's going on here, let's plot it on a graph:

That's better! I can actually make some comparisons now. From this graph we can observe that gzip is the worst of the lot, followed by bzip2. 7zip is surprisingly in third place, but then again it is designed for multiple files, whereas the rest of them are designed for a single stream of data. In second place is the terribly slow brotli, and finally in first place is xz.

Hrm - very interesting. How do our algorithms measure up when confronted a smaller load though? Here are the results for the sample chunk data:

Format Size
Uncompressed 40K
gzip 5.4K
bzip2 4.3K
xz 4.6K
7zip 4.9K
Brotli 4.6K

Interesting results, to be sure, but I can't discern much from that. Let's plot a graph:

Very interesting. With smaller loads, it appears that bzip2 performs much better with smaller loads than any other algorithm. While gzip is still the worst performing algorithm, while xz and brotli, surprisingly, performed much worse than bzip2.

To that end, I'm think I'm going to be choosing bzip2 as my compression of choice for this job, as it produces the best results for the type of work I'm going to be doing.

I'm really surprised about brotli though, actually. I had high hopes for it, considering it's a new algorithm invented by Google. They claimed that it would provide xz-like compression with gzip-like speeds - but from what I'm seeing, it does anything but.