Mailing List Articles Atom Feed Comments Atom Feed Twitter Reddit Facebook

Tag Cloud

3d account algorithms announcement architecture archives arduino artificial intelligence artix assembly async audio bash batch blog bookmarklet booting c sharp c++ challenge chrome os code codepen coding conundrums coding conundrums evolved command line compilers compiling compression css dailyprogrammer debugging demystification distributed computing documentation downtime electronics email embedded systems encryption es6 features event experiment external first impressions future game github github gist graphics hardware hardware meetup holiday holidays html html5 html5 canvas infrastructure interfaces internet io.js jabber javascript js bin labs learning library linux low level lua maintenance manjaro network networking node.js operating systems performance photos php pixelbot portable privacy programming problems project projects prolog protocol protocols pseudo 3d python reddit reference release releases resource review rust secrets security series list server software sorting source code control statistics storage svg technical terminal textures three thing game three.js tool tutorial tutorials twitter ubuntu university update updates upgrade version control virtual reality virtualisation visual web website windows windows 10 xmpp xslt

A comparison of compression formats for storing JSON

Happy new year, everyone!

I've blogged about different aspects of a (not so?) little project of mine several times now (exhibits A, B, C, D, and finally E - even if I didn't know it at the time), but it appears that I end up running into all sorts of interesting problems that I invent cool solutions for. I also find myself doing a bunch of research that I'm surprised nobody's compiled into a single place yet - as is the case in this blog post. (Also, Happy 2018 everyone! First post of the year :D)

I've been refactoring the subsystem that saves a considerable amount of JSON data to a bunch of different files on disk. Obviously, I'm interested in minimising the amount of space this JSON data takes up on disk. As this saving process happens in the background on a separate thread, I'm not too concerned about performance - other than it can't be too slow. With this in mind, I've found myself testing a bunch of different compression algorithms. Let's introduce our test data:

  1. A 17MiB card game dataset, as a single minified JSON file
  2. A 40KiB 'live' specimen chunk's data, saved as a single minified JSON file.

I can't remember which card game the first dataset is from, but I do know that I found it in the awesome JSON datasets list. Next up, here's our cast of compression algorithms we'll be testing:

  1. The venerable GZip
  2. BZip2 - Apparently GZIP, but smaller and slightly more computationally expensive
  3. XZ (the newer child of LZMA2)
  4. 7zip
  5. Google's Brotli

A colourful cast, to be sure! Let's run them through their paces - starting with the card game dataset. Here are the results I observe with each set to their default settings:

Format Size
Uncompressed 17M
gzip 2.4M
bzip2 1.6M
xz 1.3M
7zip 1.4M
brotli 1.3M

Very interesting. It looks like xz and brotli are tied in first place - though I observed that brotli took ages in comparison to all the other algorithms I tested - and upon closer inspection xz beat it by 17.3KiB. Numbers are all very well, but to really see what's going on here, let's plot it on a graph:

A graph of the data in the table above.

That's better! I can actually make some comparisons now. From this graph we can observe that gzip is the worst of the lot, followed by bzip2. 7zip is surprisingly in third place, but then again it is designed for multiple files, whereas the rest of them are designed for a single stream of data. In second place is the terribly slow brotli, and finally in first place is xz.

Hrm - very interesting. How do our algorithms measure up when confronted a smaller load though? Here are the results for the sample chunk data:

Format Size
Uncompressed 40K
gzip 5.4K
bzip2 4.3K
xz 4.6K
7zip 4.9K
Brotli 4.6K

Interesting results, to be sure, but I can't discern much from that. Let's plot a graph:

A graph of the data in the above table. Further explanation below.

Very interesting. With smaller loads, it appears that bzip2 performs much better with smaller loads than any other algorithm. While gzip is still the worst performing algorithm, while xz and brotli, surprisingly, performed much worse than bzip2.

To that end, I'm think I'm going to be choosing bzip2 as my compression of choice for this job, as it produces the best results for the type of work I'm going to be doing.

I'm really surprised about brotli though, actually. I had high hopes for it, considering it's a new algorithm invented by Google. They claimed that it would provide xz-like compression with gzip-like speeds - but from what I'm seeing, it does anything but.

Sources and Further Reading

Making Mathematical Art with C Sharp and PPM

The other day I wanted to (for some random reason) create some stripes. Having worked out a simple algorithm that would produce some rather nice stripes given an (x, y) pixel coordinate, I set out to write a small script that would generate some stripes for me.

I discovered that it wasn't as easy as I'd thought. Lockbits confused me, and I couldn't find a good enough example to learn from. Thankfully I caught wind of a ridiculously simple image format called PPM (Portable Pixel Map) that I could use to output a byte[] array of pixel data as a valid image.

Here's a diagram I made to illustrate the format:

The PPM Format.

The format basically consists of a ascii header, followed by a raw dump of a byte[] full of pixel data. The header contains several parts:

  1. The characters P6 (This is called the 'magic byte', and can be used to identify the type of content that a file contains)
  2. A single whitespace (I used \s in the diagram because that is the escape code for whitespace in a javascript regular expression)
  3. The width of the image, in ascii
  4. Another single whitespace
  5. The height of the image, in ascii
  6. Another single whitespace
  7. The maximum value that the red / green / blue pixels will go up to. This value will be considered 100% saturated. Normally, you'd want this to be 255.
  8. Another single whitespace - not shown on the diagram (oops); usually a new line (\n).
  9. The raw byte[] array of pixel data.

Once you've your pixel data as a PPM, you can then use something like imagemagick to convert it to a png with a command like mogrify -format png image.ppm or convert image.ppm image.png.

Some red stripes.

Using this method, you can generate almost anything using pure C#. Here's the code I used to generate the above stripes:

using System;
using System.IO;

public class EmptyClass
    #region Settings

    static string filename = "image.ppm";

    static int width = 1500;
    static int height = 400;

    static int stripeWidth = width / 30;
    static rgb stripeLowCol = new rgb(204, 0, 0);
    static rgb stripeHighCol = new rgb(255, 51, 51);

    static float multiplier = 1f;


    #region Image Generator

    public static void Main()
        byte[] pixelData = new byte[width * height * 3];
        for(int x = 0; x < width; ++x)
            for(int y = 0; y < height; ++y)
                int currentPixel = ((y * width) + x) * 3;
                pixelData[currentPixel] = redPixel(x, y);
                pixelData[currentPixel + 1] = greenPixel(x, y);
                pixelData[currentPixel + 2] = bluePixel(x, y);

        StreamWriter destination = new StreamWriter(filename);
        destination.Write("P6\n{0} {1}\n{2}\n", width, height, 255);
        destination.BaseStream.Write(pixelData, 0, pixelData.Length);


    #region Pixel value functions - edit these

    public static byte redPixel(int x, int y)
        return (byte)(((x + y) % stripeWidth < stripeWidth / 2 ? stripeLowCol.r : stripeHighCol.r) * multiplier);
    public static byte greenPixel(int x, int y)
        return (byte)(((x + y) % stripeWidth < stripeWidth / 2 ? stripeLowCol.g : stripeHighCol.g) * multiplier);
    public static byte bluePixel(int x, int y)
        return (byte)(((x + y) % stripeWidth < stripeWidth / 2 ? stripeLowCol.b : stripeHighCol.b) * multiplier);


#region Utility Classes
class rgb
    public byte r, g, b;
    public rgb(byte inCol)
        r = g = b = inCol;
    public rgb(byte inR, byte inG, byte inB)
        r = inR;
        g = inG;
        b = inB;

The settings at the top control the appearance of the output. filename is the filename to write the image to, width and height set the dimensions of the image, stripeWidth sets the width in pixels of each stripe, and stripeLowCol and stripeHighCol set the colour of the different stripes. The multiplier at the end isn't actually needed, but you can use it to brighten or dim the resulting image if you want.

Not content with stripes, I played around for a bit longer and came up with this:

Another mathematical picture.

Above: My second attempt at mathematical art. It looks better in my native image previewer...

The above actually consists of a 3 different functions - one for each channel. Here they are:

public static byte redPixel(int x, int y)
    return (byte)(Math.Sin(x / (width / (Math.PI * 10))) * 255 * Math.Sin(y / (height / (Math.PI*10))));
public static byte greenPixel(int x, int y)
    return (byte)(Math.Sin(x / (width / (Math.PI * 5))) * 128 * Math.Sin(y / (height / (Math.PI * 5))));
public static byte bluePixel(int x, int y)
    return (byte)((Math.Sin(x / (width / Math.PI)) * 52 * Math.Sin(y / (height / Math.PI))) + 25);

I don't actually know how it works (even though I wrote it strangely enough), but if you do know, please leave a comment down below!

Since it might be a bit difficult to see, here's an animated gif that shows each of the colour channels broken down:

Illusion Boxes Decoded

Lastly, I have rendered a larger copy of the above. You can view it here (Size: 4.8MB).

Have you made some interesting mathematical art? Post it in the comments below!

A Distributed solution to /r/dailyprogrammer Challenge #223

A little while ago, I wrote my first distributed computing program to solve the second optional challenge to /r/dailyprogrammer's #223rd problem. I wrote it in Javascript, because I know really know how to utilise the tcp/ip networking stack in any other language yet :(

Anyway, if you want to check out my source code, I have put it up on GitHub. Below I will explain how you can get started with the pair for scripts I wrote, and how they work

Getting Started

Getting started is easy. Just clone the repository & cd into the new directory:

git clone
cd dailyprogrammer-223

Then install the dependency:

npm install

Then you can start a server like so:

node server.js 4321 # starts a new server on port 4321

Or you can start a client like this:

node client.js

Note that I have used some ES6 features (check out my ES6: Features series), so if you are using Node.js and not io.js, you will need to tack on the --harmony flag in order to get them to work.

It works by mapping the search space (aaaaa to zzzzz) to a set of numbers, starting at 0. I used an ES6 generator to keep track of where we have got up to, wrapped in a function that handles reallocating work units that haven't been completed within a certain time limit. Each work unit consist of 16 words by default (though it can be changed), with the time limit set to 5 times the number of words in the block.

Next time, I will write the client so that it is compatible with a conventional browser - this should make it much easier for people to contribute their CPU time!

Art by Mythdael