## Summer Project Part 2: Random Number Analysis with Gnuplot

In my last post about my Masters Summer Project, I talked about my plans and what I'm doing. In this post, I want to talk about random number generator evaluation.

As part of the Arduino-based Internet of Things device that will be collecting the data, I need to generate high-quality random numbers in order to ensure that the unique ids I use in my project are both unpredictable and unique.

In order to generate such numbers, I've found a library that exploits the jitter in the inbuilt watchdog timer that's present in the Arduino Uno. It's got a manual which is worth a read, as it explains the concepts behind it quite well.

After some experimenting, I ended up with a program that would generate random numbers as fast as it could:

// Generate_Random_Numbers - This sketch makes use of the Entropy library
// to produce a sequence of random integers and floating point values.
// to demonstrate the use of the entropy library
//
// Copyright 2012 by Walter Anderson
//
// This file is part of Entropy, an Arduino library.
// Entropy is free software: you can redistribute it and/or modify
// the Free Software Foundation, either version 3 of the License, or
// (at your option) any later version.
//
// Entropy is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
// GNU General Public License for more details.
//
// You should have received a copy of the GNU General Public License
// along with Entropy.  If not, see <http://www.gnu.org/licenses/>.
//
// Edited by Starbeamrainbowlabs 2019

#include "Entropy.h"

void setup() {
Serial.begin(9600);

// This routine sets up the watch dog timer with interrupt handler to maintain a
// pool of real entropy for use in sketches. This mechanism is relatively slow
// since it will only produce a little less than two 32-bit random values per
// second.
Entropy.initialize();
}

void loop() {
uint32_t random_long;
random_long = Entropy.random();
Serial.println(random_long);
}

As you can tell, it's based on one of the examples. You may need to fiddle around with the imports in order to get it to work, because the Arduino IDE is terrible.

With this in place and uploaded to an Arduino, all I needed to do was log the serial console output to a file. Thankfully, this is actually really quite easy on Linux:

screen -S random-number-generator dd if=/dev/ttyACM0 of=random.txt bs=1

Since I connected the Arduino in question to a Raspberry Pi I have acting as a file server, I've included a screen call here that ensures that I can close the SSH session without it killing the command I'm executing - retaining the ability to 'reattach' to it later to check on it.

With it set off, I left it for a day or 2 until I had at least 1 MiB of random numbers. Once it was done, I ended up with a file looking a little like this:

216767155
986748290
455286059
1956258942
4245729381
3339111661
1821899502
3892736709
3658303796
2524261768
732282824
999812729
1312753534
2810553575
246363223
4106522438
260211625
1375011617
795481000
319056836

In total, it generated 134318 numbers for me to play with, which should be plenty to graph their distribution.

Graphing such a large amount of numbers requires a special kind of program. Since I've used it before, I reached for Gnuplot.

A histogram is probably the best kind of graph for this purpose, so I looked up how to get gnuplot to draw one and tweaked it for my own purposes.

I didn't realise that you could do arbitrary calculations inside a Gnuplot graph definition file, but apparently you can. The important bit below is the bin_width variable:

set key off
set border 3

# Add a vertical dotted line at x=0 to show centre (mean) of distribution.
#set yzeroaxis

# Each bar is half the (visual) width of its x-range.
#set boxwidth 0.05 absolute
#set style fill solid 1.0 noborder

bin_width = 171797751.16;

bin_number(x) = floor(x / bin_width)

rounded(x) = bin_width * ( bin_number(x) + 0.5 )

set terminal png linewidth 3 size 1920,1080

plot 'random.txt' using (rounded($1)):(1) smooth frequency with boxes It specifies the width of each bar on the graph. To work this out, we need to know the maximum number in the dataset. Then we can divide it by the target number of bins to get the width thereof. Time for some awk! awk 'BEGIN { a = 0; b=999999999999999 } { if ($1>0+a) a=$1; if ($1 < 0+b) b=$1; } END { print(a, b); }' <random.txt This looks like a bit of a mess, so let's unwind that awk script so that we can take a better look at it. BEGIN { max = 0; min = 999999999999999 } { if ($1 > max)
max = $1; if ($1 < min)
min = \$1;
}
END {
print("Max:", max, "Min:", min);
}

Much better. In short, it keeps a record of the maximum and minimum numbers it's seen so far, and updates them if it sees a better one. Let's run that over our random numbers:

awk -f minmax.awk 

Excellent. 4294943779 ÷ 25 = 171797751.16 - which is how I arrived at that value for the bin_width earlier.

Now we can render our graph:

gnuplot random.plt >histogram.png && optipng -o7 histogram.png

I always optimise images with either optipng or jpegoptim to save on storage space on the server, and bandwidth for readers - and in this case the difference was particularly noticeable. Here's the final graph:

As you can see, the number of numbers in each bin is pretty even, so we can reasonably conclude that the algorithm isn't too terrible.

What about uniqueness? Well, that's much easier to test than the distribution. If we count the numbers before and after removing duplicates, it should tell us how many duplicates there were. There's even a special command for it:

wc -l <random.txt
134318
sort <random.txt | uniq | wc -l
134317
sort <random.txt | uniq --repeated
1349455381

Very interesting. Out of ~134K numbers, there's only a single duplicate! I'm not sure whether that's a good thing or not, as I haven't profiled any other random number generated in this manner before.

Since I'm planning on taking 1 reading a minute for at least a week (that's 10080 readings), I'm not sure that I'm going to get close to running into this issue - but it's always good to be prepared I guess......

Found this interesting? Got a better way of doing it? Comment below!