## Stardust Blog

Archive

Mailing List Articles Atom Feed Comments Atom Feed Twitter Reddit Facebook

## Tag Cloud

3d 3d printing account algorithms android announcement architecture archives arduino artificial intelligence artix assembly async audio automation backups bash batch blog bookmarklet booting bug hunting c sharp c++ challenge chrome os cluster code codepen coding conundrums coding conundrums evolved command line compilers compiling compression containerisation css dailyprogrammer data analysis debugging demystification distributed computing documentation downtime electronics email embedded systems encryption es6 features ethics event experiment external first impressions future game github github gist gitlab graphics hardware hardware meetup holiday holidays html html5 html5 canvas infrastructure interfaces internet interoperability io.js jabber jam javascript js bin labs learning library linux lora low level lua maintenance manjaro network networking nibriboard node.js operating systems own your code pepperminty wiki performance phd photos php pixelbot portable privacy problem solving programming problems project projects prolog protocol protocols pseudo 3d python reddit redis reference releases resource review rust searching secrets security series list server software sorting source code control statistics storage svg talks technical terminal textures thoughts three thing game three.js tool tutorial twitter ubuntu university update updates upgrade version control virtual reality virtualisation visual web website windows windows 10 xmpp xslt

## PhD Update 4: Ginormous Data

Hello again! In the last PhD update blog post, I talked about patching HAIL-CAESAR to improve performance and implementing a Temporal Convolutional Neural Net (Temporal CNN).

Since making that post, I've had my PhD Panel 1 (very useful, thanks to everyone who was on that panel!). I've also got an initial - albeit untested - implementation of a Temporal CNN. I've also been wrangling lots of data in more ways than one. I'm definitely seeing the Big Data aspect of my project title now.

### HAIL-CAESAR

I ran HAIL-CAESAR initially at 50m per pixel. This went ok, and generated lots of data out, but in 2 weeks of real time it barely hit 43 days worth of simulation time! The other issue I discovered due to the way I compressed the output of HAIL-CAESAR, for some reason it compressed the output files before HAIL-CAESAR had finished writing to them. This resulted in the data being cut off randomly in the output files for each time step.

Big problem - clearly another approach is in order.

To tackle these issues, I've done several things. Firstly, I patched HAIL-CAESAR again to support writing the output water depth files to the standard output. As a refresher, they are actually identical in format to the heightmap, which looks a bit like this:

ncols 4
nrows 3
xllcorner 400000
yllcorner 300000
cellsize 1000
1 2 3 4
1 1 2 3
0 1 1 2

The above is a 4x3 grid of points, with the bottom-left corner being at (400000, 300000) on the Ordnance Survey National Grid (I know, latitude / longitude would be so much better, but all the data I'm working with is on the OS national grid :-/). Each point represents a 1km square area.

To this end, I realised that it doesn't actually matter if I concatenate multiple files in this format together - I can guarantee that I can tell them apart. As soon as I detect a metadata line that the current file has already declared, then I know that the next file is starting and we're starting to read the next file along. To this end, I implemented a new Terrain50.ParseStream() function that is an async generator that will take a stream, and then iteratively yield the Terrain50 instances it parses out of the stream. In this way, I can split 1 big continuous stream back up again into the individual parts.

By patching HAIL-CAESAR such that it outputs the data in 1 continuous stream, it also means that I can pipe it to a single compression program. This has 2 benefits:

• It avoids the "compressing the individual files before HAIL-CAESAR is ready" problem (the observant might note that inotifywait would solve this issue neatly too, but it isn't installed on Viper)
• It allows for more efficient compression, as the compression program can use data from other time step files as context

Finding a compression tool was next. I wanted something CPU efficient, because I wanted to ensure that the maximum number of CPU cycles were dedicated to HAIL-CAESAR for running the simulation, rather than compressing the output it generates - since it is the bottleneck after all.

I ended up using lz4 in the end, an extremely fast compression algorithm. It compiles easily too, which is nice as I needed to compile it from source automatically on Viper.

With all this in place, I ran HAIL-CAESAR again 2 more times. The first run was at the same resolution as before, and generated 303 GiB (!) of data.

The second run was at 500m per pixel (10 times lower resolution), which generated 159 GiB (!) of data and, by my calculations, managed to run through ~4.3 years in simulation time in 5 days of real time. Some quick calculations suggest that to get through all 13 years of rainfall radar data I have it would take just over 11 days, so since I've got everything setup already, I'm going to be contacting the Viper administrators to ask about running a longer job to allow it to complete this process if possible.

### Temporal CNN Preprocessing

The other major thing I've been working on since the last post is the Temporal CNN. I've already got an initial implementation setup, and I'm currently in the process of ironing out all the bugs in it.

I ran into a number of interesting bugs. One of these was to do with incorrectly specifying the batch size (due to a typo), which resulted in the null values you may have noticed in the model summary in the last post. With those fixed, it looks much more sensible:

_________________________________________________________________
Layer (type)                 Output shape              Param #
=================================================================
conv3d_1 (Conv3D)            [32,2096,3476,124,64]     16064
_________________________________________________________________
conv3d_2 (Conv3D)            [32,1046,1736,60,64]      512064
_________________________________________________________________
conv3d_3 (Conv3D)            [32,521,866,28,64]        512064
_________________________________________________________________
pooling (AveragePooling3D)   [32,521,866,1,64]         0
_________________________________________________________________
reshape (Reshape)            [32,521,866,64]           0
_________________________________________________________________
conv2d_output (Conv2D)       [32,517,862,1]            1601
_________________________________________________________________
reshape_end (Reshape)        [32,517,862]              0
=================================================================
Total params: 1041793
Trainable params: 1041793
Non-trainable params: 0
_________________________________________________________________

This model is comprised of the following:

• 3 x 3D convolutional layers
• 1 x pooling layer to average out the temporal dimension
• 1 x reshaping layer to remove the redundant dimension
• 1 x 2D convolutional layer that will produce the output
• 1 x reshaping layer to remove another redundant dimension

I'll talk about this model in more detail in a future post in this series once I've managed to get it running and I've played around with it a bit.

Another significant one I ran into was to do with stacking tensors like an image. I ended up asking on Stack Overflow: How do I reorder the dimensions of a rank 3 tensor in Tensorflow.js?

The input to the above model is comprised of a sliding window that moves along the rainfall radar time steps. Each time step contains a 2D array, representing the amount of rain that has fallen in a given area. This needs to be combined with the heightmap, so that the AI model knows what the terrain that the rain is falling on looks like.

The heightmap doesn't change, but I'm including a copy of it with every rainfall radar time step because of the way the 3D convolutional layer works in Tensorflow.js. 2D convolutional layers in Tensorflow.js, for example, take in a 2D array of data as a tensor. They can also take in multiple channels though, much like pixels in an image. The pixels in an image might look something like this:

R1 G1 B1 A1 R2 G2 B2 A2 R3 G3 B3 A3 .....

As you might have seen in the Stack Overflow answer I linked to above, Tensorflow.js does support stacking multiple 2D tensors in this fashion. It is unfortunately extremely slow however. It is for this reason that I've been implementing a multi-process program to preprocess the data to do this stacking in advance.

As I'm writing this though, I've finally understood what the dataFormat option is for in the conv3d and conv2d layers is for, and I think I might have been barking up the wrong tree......

### What's next

From here, I'm going to investigate that dataFormat option for the TemporalCNN - it would hugely simplify my setup and remove the need for me to preprocess the data beforehand, since stacking tensors directly 1 after another is very quick - it's just stacking them along a different dimension that's slow.

I'm also hoping to do a longer run of that 500m per pixel HAIL-CAESAR simulation. More data is always good, right? :P

After I've looked into the dataFormat option, I'd really like to get the Temporal CNN set off training and all the bugs ironed out. I'm so close I can almost taste it!

Finally, if I have time, I want to go looking for a baseline model of sorts. By this, I mean an existing model that's the closest thing to the task I'm doing - even though they might not be as performant or designed for my specific task.

Found this interesting? Got a suggestion of something I could do better? Confused about something I've talked about? Comment below!

## PhD Aside: Reading a file descriptor line-by-line from multiple Node.js processes

Phew, that's a bit of a mouthful. We're taking a short break from the cluster series of posts (though those will be back next week I hope), because I've just run into a fascinating problem, the solution to which I thought I'd share here - since I didn't find a solution elsewhere on the web.

For my PhD, I've got a big old lump of data, and it all needs preprocessing before I train an AI model (or a variant thereof, since I'm effectively doing video-to-image translation). Unfortunately, one of the preprocessing steps is really slow. And because I'll naturally be training my AI for multiple epochs, the problem is multiplied.....

The solution, of course, is to do all the preprocessing up front such that I can just read the data in and push it directly into a Tensor in the right format. However, doing this on such a large dataset would take forever if I did the items 1 by 1. The thing is that Javascript isn't inherently multithreaded. I like this quote, as it describes the situation rather well:

In Javascript everything runs in parallel... except your code

In other words, when Node.js is reading or writing to and from the network, disk, or other places it can do lots of things at the same time because it does them asynchronously. The Javascript that gets executed though is only done on a single thread though.

This is great for io-bound tasks (such as a web server), as Node.js (a Javascript runtime) can handle many requests at the same time. On a side note, this is also the reason why Nginx is more efficient than Apache (because Nginx is event based too like Javascript, unlike Apache which is thread based).

It's not so great though for CPU bound tasks, such as the one I've got on my hands. All is not lost though, because Node.js has a number of useful functions inbuilt that we can use to tackle the issue.

Firstly, Node.js has a clever forking system. By using child_process.fork(), a single Node.js process can create multiple copies of itself to act as workers:

// main.js
import child_process from 'child_process';
import os from 'os';

let workers = [];

for(let i = 0; i < os.cpus().length; i++) {
workers.push(
child_process.fork("worker.mjs")
);
}
// worker.js
console.log(Hello, world from a child process!);

Very useful! The next much more sticky problem though is how to actually preprocess the data in a performant manner. In my specific case, I'm piping the data in from a shell script that decompresses a number of gzip archives in a specific order (as of the time of typing I have yet to implement this).

Because this is a single pipe we're talking about here, the question now arises of how to allow all the child processes to access the data that's coming in from the standard input of the master process.

I've actually encountered an issue like this one before. I initially tried reading it in on the master process, and then using worker.send(message) to send it to the worker processes for processing. This didn't end up working very well, because the master process became a bottleneck as it couldn't read from the standard input and send stuff to the workers fast enough.

With this in mind, I came up with a new plan. In Node.js, when you're forking to create a worker process, you can supply it with some custom file descriptors upon initialisation. So long as it has at least IPC (inter-process communication) channel for passing messages back and forth with the .send() and .on("message", (message) => ....) method and listeners, it doesn't actually care what you do with the others.

Cue file descriptor cloning:


// main.js
import child_process from 'child_process';
import os from 'os';

let workers = [];

for(let i = 0; i 

I've highlighted the key line here (line 10 for those who can't see it). Here we tell it to clone file descriptors 0, 1, and 2 - which refer to stdin, stdout, and stderr respectively. This allows the worker processes direct access to the master process' stdin, stdout, and stderr.

With this, we can read from the same pipe with as many worker processes as we like - so long as they do so 1 at a time.

With this sorted, it gives rise to the next issue: reading line-by-line. Packages exist on npm (such as nexline, my personal favourite) to read from a stream line-by-line, but they have the unfortunate side-effect of maintaining a read buffer. While this is great for performance, it's not so great in my situation because it ends up scrambling the input! This is because said read buffer would be local to each worker process, so when the next worker along reads, it will skip a random number of bytes and start reading from the next bit along.

This means that I need to implement a custom method that reads a single line from a given file descriptor without maintaining a read buffer. I came up with this:

import fs from 'fs';

//  .....

// Global buffer to avoid unnecessary memory churn
let buffer = Buffer.alloc(4096);
let i = 0;
while(true) {
let bytes_read = fs.readSync(fd, buffer, i, 1);
if(bytes_read !== 1 || buffer[i] == 0x0A) {
if(i == 0 && bytes_read == null) return null;
return buffer.toString("utf-8", 0, i); // This is not inclusive, so we can abuse it to trim the \n off the end
}

i++;
if(i == buffer.length) {
let new_buffer = new Buffer(Math.ceil(buffer.length * 1.5));
buffer.copy(new_buffer);
buffer = new_buffer;
}
}
}

I read from the given file descriptor character by character directly into a buffer. As soon as it detects a new line character (\n, or character code 0x0A), it returns the new line. If we run out of space in the buffer, then we create a new larger one, copy the old buffer's contents into it, and keep going.

I maintain a global buffer here, because this helps to avoid unnecessary memory churn. In my case, the lines I'm reading in a rather long (hence the need to clone the file descriptor in the first place), and if I didn't keep a shared buffer I'd be allocating and deallocating a new pretty large buffer every time.

This also has the nice side-effect that we keep the largest buffer we've had to use so far around for next time, avoiding the need for subsequent copies to larger and larger buffers.

Finally, we can also guarantee that it won't be a problem if we call this multiple times, because as I explained above Javascript is single-threaded, so if we call the function multiple times in quick succession each read will happen 1 after another.

With this chain of Node.js features, we can read a large amount of data from and efficiently process the content of a pipe. The trick from here is to implement a proper messaging and locking system to avoid reading from the stream at the same time, and avoid write to the standard output at the same time.

Taking this further, I ended up with this:

(Licence: Mozilla Public Licence 2.0)

This correctly ensures that only 1 worker process reads from the stream at the same time. It doesn't do anything with the result though except log a message to the console, but when I implement that I'll implement a similar messaging system to ensure that only 1 process writes to the output at once.

On that note, my data is also ordered, so I'll have to implement a complicated cache system // ordering system to ensure that I write them to the standard output in the same order I read them in. When I do implement that, I'll probably blog about that too....

The main problem I still have with this solution is that I'm reading from the input stream. I haven't done any proper testing, but I'm pretty sure that doing so will be really slow. I not sure I can avoid this though and read a few KiBs at a time, because I don't currently know of any way to put the extra characters back into the input stream.

If anyone has a solution to that that increases performance, I'd love to know. Leave a comment below!

## PhD Update 3: Simulating simulations with some success

Hey there! Welcome to another PhD update blog post. The last time I posted, I was still working away at getting the rainfall radar data downloader working as intended.

Thankfully, since then I've managed to get it to complete (wow, that too much longer than expected) - and I've now turned my attention to running the physics-based simulation, and beginning to implement the AI(s) that will (hopefully) implicitly learn the parameters of the model in question.

### Physics-based simulation patching

Getting to this point, as you might imagine, wasn't quite as straight-forward as I initially thought. The physics-based model I'm (currently) using is HAIL-CAESAR, a (supposedly) high-performance version of CAESAR-Lisflood (yes, it's SourceForge shudder). Unfortunately, the format for the rainfall data that it takes in is especially space inefficient - after writing a converter, I found that my 4.5GiB compress JSON stream files (1 per year) would have turned into about 66GB of uncompressed ASCII! Theoretically speaking by my calculations. I don't have that much disk space free - so clearly another approach is in order.

This approach I speak of is convincing HAIL-CAESAR to take the data in via the standard input. I initially tried using a FIFO (also known as a named pipe), but I ran into this bug in Node.js.

HAIL-CAESAR by default doesn't support taking the data in on the standard input though, so I had to patch HAIL-CAESAR to add support. I did this by getting it to interpret the filename - to mean "use the standard input instead", which from my previous experiences seem to be an unofficial convention that lots of other programs follow. Perhaps at some point soon I should consider contributing my patch back to HAIL-CAESAR for others to enjoy.

### Heightmap tweaking

With that sorted, I also had to mess around with the heightmap (I got this through my University's "Digimap" service thingy) I obtained to get it to be precisely the same size as the rainfall radar data I have.

It turned out that the service I got the heightmap from isn't smart enough to give you precisely the bit you asked for - instead giving you just the tiles that overlap the area you specify. In the end I found myself with ~170 separate tiles (some of which I had to get after the fact because I found I was missing some) - so I ended up implementing a program to stitch them all back together again.

That program ended up turning out much more complete as a separate whole than I thought it would. I'm pretty sure that these heightmap files I've been dealing with are in a standard format, but I'm not aware of its name (if you know, I'd love to hear from you - post a comment below!). It's for these reasons that I ended up releasing it as a pair of packages on npm.

You can find them here:

I'll probably make a separate blog post about them at some point soon. For the curious, the API docs (there's a link in the README of the library package too) are automatically updated with my Laminar CI setup :D

### Tensor trouble

It is with a considerable amount of anticipation that I'm finally reaching the really interesting part of this experiment. This week, I've started work on implementing a Temporal Convolutional Neural Network (see also this paper). A Temporal CNN is a network type I discovered recently that takes advantage of multiple 3-dimensional CNN layers to allow a CNN-based model to learn temporal-based relationships in a dataset.

I'm not sure how well it's going to work on my particular dataset, given that the existing papers I've found on it use it for classification-based tasks, but I'm pretty hopeful that, with some tweaking, it should perform pretty well. While I haven't yet finished writing up the dataset input logic, I have implemented the core model using the Tensorflow.js layers API:

In the end I've decided to give Tensorflow.js another go (I don't think I mentioned it, but it attempted to use it for my Master's summer project, but it didn't work out so well), since I realised that I've implemented a good portion of the data processing code in Javascript (Node.js) already (as mentioned above). Interestingly, HAIL-CAESAR spits out files in the same format as the heightmap I've been working with, which makes processing even easier!

### What's next

From here, I intend to finish up my Temporal CNN implementation and get it running on the data I have so far from the HAIL-CAESAR model (which isn't unfortunately a lot - so far I've only got ~8K 5-minute time-steps worth of output which, if I'm calculating correctly, is just 29 days worth of simulation). I'm probably going to have to swap HAIL-CAESAR out at some point though, because it's really slow. Or perhaps I just don't know how to use it properly (maybe I should find someone more experienced with it and ask them first).

Anyway, I'm also going to try implementing a model inspired by the Google rainfall radar nowcasting paper I mentioned in my last post in this series. With both of these implemented, I can start to compare them and see which one is better suited for the task of flood prediction. I might even implement the Grid LSTM model I saw too.

In addition, I have my PhD panel 1 review coming up soon too - so apparently I've got a list of things I need to do to prepare for that - including writing a ~5K word report. I'll probably do this pretty soon - I don't want to be rushing it at the last minute.

Found this interesting? Got a suggestion? Want to say hi? Comment below!

## PhD Update 2: The experiment, the data, and the supercomputers

Welcome to another PhD update post. Since last time, a bunch of different things have happened - which I'll talk about here. In particular, 2 distinct strands have become evident: The reading papers and theory bit - and the writing code and testing stuff out bit.

At the moment, I'm focusing much more heavily on the writing code and experimental side of things, as I've recently gained access to the 1km resolution rainfall radar dataset from CEDA. While I'm not allowed to share any data that I've now got, I'm pretty sure I'm safe to talk about how terribly it's formatted.

The data itself is with 1 directory per year, and 1 file per day inside those directories. Logical so far, right? Each of those files is a tar archive, inside which are the binary files that contain the data itself - with 1 file for every 5 minutes. These are compressed with gzip - which seems odd since they could probably gain greater compression ratios if they were tared first and compressed second (then the compression algorithm would be able to compress based on the similarities between files too).

The problems arise when you start to parse out the binary files themselves. They are in a propriety format - which has 3 different versions that don't conform to the (limited) documentation. To this end, it's been proving somewhat of a challenge to parse them and extract the bits I'm interested in.

To tackle this, I've been using Node.js and a bunch of libraries from npm (noisy pirate mutiny? nearest phase modulator? nasty popsicle machine? nah, it's probably node package manager):

• binary-parser - For parsing the binary files themselves. Allows you to define the format of the file programmatically, and it'll parse it out into a nice object you can then manipulate.
• gunzip-maybe - A streaming library that unzips a gzip-compressed stream
• @icetee/ftp - An FTP client library for downloading the files (I know that FTP is insecure, that's all they offer at this time :-/)
• tar-stream - For parsing tar files
• nnng - Stands for No! Not National Grid!. helps with the conversion between OS national grid references and regular longitude latitude.

Aside from the binary file format, I encountered 3 main issues:

1. The data is only a rectangle when using ordnance survey national grid references
2. There's so much data, it needs to be streamed from the remote server
3. Generating a valid gzip file is harder than you expect

Problem 1 here took me a while to figure out. Since as I mentioned the documentation is rather limited, I spent much longer than I would have liked attempting to parse the data in latitude longitude and finding it didn't work.

Problem 2 was rather interesting. Taking a cursory glance over the data before hand revealed that each daily tar file was about 80MiB - and with roughly 5.7K days worth of data (the dataset appears to go back to May 2004-ish), it quickly became clear that I couldn't just download them all and process them later.

It is for this reason that I chose Node.js in the first place for this. For those who haven't encountered it before, it's Javascript for the server - and it's brilliant for 2 main use-cases: networking and streaming data. Both of which were characteristics of the problem at hand - so the answer was obvious.

I'm still working on tweaking and improving my final solution, but as it stands after implementing the extractor on it's own, I've also implemented a wrapper that streams the tar archives from the FTP server, stream-reads the tar archives, streams the files in the tar archives into a gzip decompressor, parses the result, and then streams the interesting bits to disk as a disk object via a gzip compressor.

That's a lot of streams. The great part about this is that I don't accidentally end up with huge chunks of binary files in memory. The only bits that can't be streamed is the binary file parser and the bit that extracts the interesting bits.

I'm still working on the last issue, but I've been encountering nasty problems with the built-in zlib gzip compressor transformation stream. When I send a SIGINT (Ctrl + C) to the Node.js process, it doesn't seem to want to finish writing the gzip file correctly - leading to invalid gzip files with chunks missing from the end.

Since the zlib gzip transformation stream is so badly documented, I've ended up replacing it with a different solution that spawns a gzip child process instead (so you've got to have gzip installed on the machine you're running the script on, which shouldn't be a huge deal on Linux). This solution is better, but still requires some tweaks because it transpires that Node.js automatically propagates signals it receives to child processes - before you've had a chance to tie up all your loose ends. Frustrating.

Even so, I'm hopeful that I've pretty much got it to a workable state for now - though I'll need to implement a daemon-type script at some point to automatically download and process the new files as they are uploaded - it is a living dataset that's constantly being added to after all.

### Papers

The other strand (that's less active at the minute) is reading papers. Last time, I mentioned the summary papers I'd read, and the direction I was considering reading in. Since then, I've both read a number of new papers and talked to a bunch of very talented people on-campus - so I've got a little bit of a better idea as to the direction I'm headed in now.

Firstly, I've looked into a few cutting-edge recurrent neural network types:

• Grid LSTMs - Basically multi-dimensional LSTMs
• Diluted LSTMs - Makes LSTMs less computationally intensive and better at learning long-term relationships
• Transformer Neural Networks - more reading required here
• NARX Networks

Many of these recurrent neural network structures appear to show promise for mapping floods. The last experiment into a basic LSTM didn't go too well (it turned out to be hugely computationally expensive), but learning from that experiment I've got lots of options for my next one.

A friend of mine managed to track down the paper behind Google's AI blog post - which turned out to be an interesting read. It transpires that despite the bold words in the blog post, the paper is more of an initial proposal for a research project - rather than a completed project itself. Most of the work they've done is actually using a traditional physics-based model - which they've basically thrown Google-scale compute power at to make pretty graphs - which they've then critically evaluated and identified a number of areas in which they can improve. They've been a bit light on details - which is probably because they either haven't started - or don't want to divulge their sekrets.

I also saw another interesting paper from Google entitled "Machine Learning for Precipitation Nowcasting from Radar Images", which I found because it was reported on by Ars Technica. It describes the short-term forecasting of rain from rainfall radar in the US (I'm in the UK) using a convolutional neural network-based model (specifically U-Net, apparently - I have yet to read up on it).

The model they use is comprised in part of convolutional neural network (CNN) layers that downsample 256x256 tiles to a smaller size, and then upscale it back to the original size. It has some extra connections that skip part of the model too. They claim that their model manages to improve on existing approaches for up to 6 hours in advance - so their network structure seems somewhat promising as inspiration for my own research.

Initial thoughts include theories as to whether I can use CNN layers like this to sandwich a more complex recurrent layer of some description that remembers the long-term relationships? I'll have to experiment.....

Found this interesting? Got a suggestion? Confused on a point? Comment below!

## PhD Update 1: Directions

Welcome to my first PhD update post. I intend to post these at bimonthly intervals. In the last post, I talked a bit about my PhD project that I'm doing and my initial thoughts. Since then, I've done heaps of investigation into a number of different potential directions I could take the project. For reference, my PhD title is actually as follows:

Using the Internet of Things, Big Data, and AI to dynamically map flood risk.

There are 3 main elements to this project:

• Big Data
• Artificial Intelligence (AI)
• The Internet of Things (IoT)

I'm pretty sure that each of them will have an important role to play in the final product - even if I'm not sure what those roles are just yet :P

Particularly of concern at the moment is this blog post by Google. It talks about they've managed to significantly improve flood forecasting with AI along with a seriously impressive visualisation to back it up - but I can't find a paper on it anywhere. I'm concerned that anything I try to do in the area won't be useful if they are already streets ahead of everyone else like that.

I guess one of the strong points I should try to hit is the concept of explainable AI if possible.

### All the data sources!

As it stands right now, I'm currently evaluating various different potential data sources that I've managed to gain access to. My aim here is to evaluate how useful they will be in solving the wider problem - and whether they are useful enough to be worth investigating further.

#### Environment Agency

Some great people from the environment agency came into University recently to chat with us about what they did. The discussion we had was very interesting - but they also asked if there was anything they could do to help our PhD projects out.

Seeing the opportunity, I jumped at the chance to get a hold of some of their historical datasets. They actually maintain a network of high-quality sensors across the country that monitor everything from rainfall to river statistics. While they have a real-time API that you can use to download recent measurements, it doesn't appear to go back further than March 2017. To this end, I asked for data from 2005 up to the end of 2017, so that I could get a clearer picture of the 2007 and 2013 floods for AI training purposes.

So far, this dataset has proved very useful at least initially as a testbed for training various kinds of AI as I learn PyTorch (see my recent post for how that has been going - I've started with a basic LSTM first. For reference, an LSTM is a neural network architecture that is good at processing time-series data - but is quite computationally expensive to run.

#### Met Office

I've also been investigating the datasets that the Met Office provide. These chiefly appear to be in the form of their free DataPoint API. Particularly of interest are their rainfall radar images, which are 500x500 pixels and are released every 15 minutes. Sadly they are only available for a few hours at best, so you have to grab them fast if you want to be able to analyse particularly interesting ones later.

Annoyingly though, their API does not appear to give any hints as to the bounding boxes of these images - and neither can I find any information about this online. I posted in their support forum, but it doesn't appear that anyone actually monitors it - so at this point I suspect that I'm unlikely to receive a response. Without knowing the (lat, lng) co-ordinates of the images produced by the API, they are little more use than pretty wall art.

#### Internet of Things

On the Internet of Things front, I'm already part of Connected Humber, which have a network of sensors setup that are monitoring everything from air quality to temperature, humidity, and air pressure. While these things aren't directly related to my project, the dataset that we're collecting as a group may very well come in handy as an input to a model of some description.

I'm pretty sure that I'll need to setup some additional custom sensors of my own at some point (probably soonish too) to collect the measurement readings that I'm missing from other pre-existing datasets.

### Reading a library

Whilst I've been doing this, I've also been reading up a storm. I've started by reading into traditional physics-based flood modelling simulations (such as caesar-lisflood) - which appear to fall into a number of different categories, which also have sub-categories. It's quite a rabbit hole - but apparently I'm diving all the way down to the very bottom.

The most interesting paper on this subject I found was this one from 2017. It splits physics-based models up into 3 categories:

• Empirical models (i.e. ones that just display sensor readings, calculate some statistics, and that's about it)
• Hydrodynamic models - the best-known models that simulate water flow etc - can be categorised as either 1D, 2D, or 3D - also very computationally expensive - especially in higher dimensions
• Simplified conceptual models - don't actually simulate water flow, but efficient enough to be used on large areas - also can be quite inaccurate with complex terrain etc.

As I'm going to be using artificial intelligence as the core of my project, it quickly became evident that this is just stage-setting for the actual kind of work I'll be doing. After winding my way through a bunch of other less interesting papers, I found my way to this paper from 2018 next, which is similar to the previous one I linked to - just for AI and flood modelling.

While I haven't yet had a chance to follow up on all the interesting papers referenced, it has a number of interesting points to keep in mind:

• Artificial Intelligences need lots of diverse data points to train well
• It's important to measure a trained network's ability to generalise what it's learnt to other situations it hasn't seen yet

The odd thing about this paper is that it claims that regular neural networks were better than recurrent neural network structures - despite the fact that it is only citing a single old 2013 paper (which I haven't yet read). This led me on to read a few more papers - all of which were mildly interesting and had at least something to do with neural networks.

I certainly haven't read everything yet about flood modelling and AI, so I've got quite a way to go until I'm done in this department. Also of interest are 2 newer neural network architectures which I'm currently reading about:

### Next steps

I want to continue to read about the above neural networks. I also want to implement a number of the networks I've read about in PyTorch to continue to learn the library.

Lastly, I want to continue to find new datasets to explore. If you're aware of a dataset that I haven't yet talked about on here, comment below!

## Summer Project Part 5: When is a function not a function?

Another post! Looks like I'm on a roll in this series :P

In the last post, I looked at the box I designed that was ready for 3D printing. That process has now been completed, and I'm now in possession of an (almost) luminous orange and pink box that could almost glow in the dark.......

I also looked at the libraries that I'll be using and how to manage the (rather limited) amount of memory available in the AVR microprocessor.

Since last time, I've somehow managed to shave a further 6% program space off (though I'm not sure how I've done it), so most recently I've been implementing 2 additional features:

• An additional layer of AES encryption, to prevent The Things Network for having access to the decrypted data
• GPS delta checking (as I'm calling it), to avoid sending multiple messages when the device hasn't moved

After all was said and done, I'm now at 97% program space and 47% global variable RAM usage.

To implement the additional AES encryption layer, I abused LMiC's IDEETRON AES-128 (ECB mode) implementation, which is stored in src/aes/ideetron/AES-128_V10.cpp.

It's worth noting here that if you're doing crypto yourself, it's seriously not recommended that you use ECB mode. Please don't. The only reason that I used it here is because I already had an implementation to hand that was being compiled into my program, I didn't have the program space to add another one, and my messages all start with a random 32-bit unsigned integer that will provide a measure of protection against collision attacks and other nastiness.

Specifically, it's the method with this signature:

void lmic_aes_encrypt(unsigned char *Data, unsigned char *Key);

Since this is an internal LMiC function declared in a .cpp source file with no obvious header file twin, I needed to declare the prototype in my source code as above - as the method will only be discovered by the compiler when linking the object files together (see this page for more information about the C++ compilation process. While it's for regular Linux executable binaries, it still applies here since the Arduino toolchain spits out a very similar binary that's uploaded to the microprocessor via a programmer).

However, once I'd sorted out all the typing issues, I slammed into this error:

/tmp/ccOLIbBm.ltrans0.ltrans.o: In function transmit_send':
sketch/transmission.cpp:89: undefined reference to lmic_aes_encrypt(unsigned char*, unsigned char*)'
collect2: error: ld returned 1 exit status

Very strange. What's going on here? I declared that method via a prototype, didn't I?

Of course, it's not quite that simple. The thing is, the file I mentioned above isn't the first place that a prototype for that method is defined in LMiC. It's actually in other.c, line 35 as a C function. Since C and C++ (for all their similarities) are decidedly different, apparently to call a C function in C++ code you need to declare the function prototype as extern "C", like this:

extern "C" void lmic_aes_encrypt(unsigned char *Data, unsigned char *Key);

This cleaned the error right up. Turns out that even if a function body is defined in C++, what matters is where the original prototype is declared.

I'm hoping to release the source code, but I need to have a discussion with my supervisor about that at the end of the project.

Found this interesting? Come across some equally nasty bugs? Comment below!

## Summer Project Part 4: Threading the needle and compacting it down

In the last part, I put the circuit for the IoT device together and designed a box for said circuit to be housed inside of.

In this post, I'm going to talk a little bit about 3D printing, but I'm mostly going to discuss the software aspect of the firmware I'm writing for the Arduino Uno that's going to be control the whole operation out in the field.

Since last time, I've completed the design for the housing and sent it off to my University for 3D printing. They had some great suggestions for improving the design like making the walls slightly thicker (moving from 2mm to 4mm), and including an extra lip on the lid to keep it from shifting around. Here are some pictures:

(Left: The housing itself. Right: The lid. On the opposite side (not shown), the screw holes are indented.)

At the same time as handling sending the housing off to be 3D printed, I've also been busily iterating on the software that the Arduino will be running - and this is what I'd like to spend the majority of this post talking about.

I've been taking an iterative approach to writing it - adding a library, interfacing with it and getting it to do what I want on it's own, then integrating it into the main program.... and then compacting the whole thing down so that it'll fit inside the Arduino Uno. The thing is, the Uno is powered by an ATmega328P (datasheet). Which has 32K of program space and just 2K of RAM. Not much. At all.

The codebase I've built for the Uno is based on the following libraries.

• LMiC (the matthijskooijman fork) - the (rather heavy and needlessly complicated) LoRaWAN implementation
• Entropy for generating random numbers as explained in part 2
• TinyGPS, for decoding NMEA messages from the NEO-6M
• SdFat, for interfacing with microSD cards over SPI

### Memory Management

Packing the whole program into a 32K + 2K box is not exactly an easy challenge, I discovered. I chose to first deal with the RAM issue. This was greatly aided by the FreeMemory library, which tells you how much RAM you've got left at a given point in the execution of your program. While it's a bit outdated, it's still a useful tool. It works a bit like this:

#include <MemoryFree.h>;

void setup() {
Serial.begin(115200);
Serial.println(freeMemory, DEC);
char test[] = "Bobs Rockets";
Serial.println(freeMemory, DEC); // Should be lower than the above call
}

void loop() {
// Nothing here
}

It's worth taking a moment to revise the way stacks and heaps work - and the differences between how they work in the Arduino environment and on your desktop. This is going to get rather complicated quite quickly - so I'd advise reading this stack overflow answer first before continuing.

First, let's look at the locations in RAM for different types of allocation:

• Things on the stack
• Things on the heap
• Global variables

Unlike on the device you're reading this on, the Arduino does not support multiple processes - and therefore the entirety of the RAM available is allocated to your program.

Since I wasn't sure about preecisly how the Arduino does it (it's processor architecture-specific), I wrote a simple test program to tell me:

#include <Arduino.h>

struct Test {
uint32_t a;
char b;
};

Test global_var;

void setup() {
Serial.begin(115200);

Test stack;
Test* heap = new Test();

Serial.print(F("Stack location: "));
Serial.println((uint32_t)(&stack), DEC);

Serial.print(F("Heap location: "));
Serial.println((uint32_t)heap, DEC);

Serial.print(F("Global location: "));
Serial.println((uint32_t)&global_var, DEC);
}

void loop() {
// Nothing here
}

This prints the following for me:

Stack location: 2295
Heap location: 461
Global location: 284

From this we can deduce that global variables are located at the beginning of the RAM space, heap allocations go on top of globals, and the stack grows down starting from the end of RAM space. It's best explained with a diagram:

Now for the differences. On a normal machine running an operating system, there's an extra layer of abstraction between where things are actually located in RAM and where the operating system tells you they are located. This is known as virtual memory address translation (see also virtual memory, virtual address space).

It's a system whereby the operating system maintains a series of tables that map physical RAM to a virtual address space that the running processes actually use. Usually each process running on a system will have it's own table (but this doesn't mean that it will have it's own physical memory - see also shared memory, but this is a topic for another time). When a process accesses an area of memory with a virtual address, the operating system will transparently translate the address using the table to the actual location in RAM (or elsewhere) that the process wants to access.

This is important (and not only for security), because under normal operation a process will probably allocate and deallocate a bunch of different lumps of memory at different times. With a virtual address space, the operating system can defragment the physical RAM space in the background and move stuff around without disturbing currently running processes. Keeping the free memory contiguous speeds up future allocations, and ensures that if a process asks for a large block of contiguous memory the operating system will be able to allocate it without issue.

As I mentioned before though, the Arduino doesn't have a virtual memory system - partly because it doesn't support multiple processes (it would need an operating system for that). The side-effect here is that it doesn't defragment the physical RAM. Since C/C++ isn't a managed language, we don't get _heap compaction_ either like in .NET environments such as Mono.

All this leads us to an environment in which heap allocation needs to be done very carefully, in order to avoid fragmenting the heap and causing a stack crash. If an object somewhere in the middle of the heap is deallocated, the heap will not shrink until everything to the right of it is also deallocated. This post has a good explanation of the problem too.

Other things we need to consider are keeping global variables to a minimum, and trying to keep most things on the stack if we can help it (though this may slow the program down if it's copying things between stack frames all the time).

To this end, we need to choose the libraries we use with care - because they can easily break these guidelines that we've set for ourselves. For example, the inbuilt SD library is out, because it uses a global variable that eats over 50% of our available RAM - and it there's no way (that I can see at least) to reclaim that RAM once we're finished with it.

This is why I chose SdFat instead, because it's at least a little better at allowing us to reclaim most of the RAM it used once we're finished with it by letting the instance fall out of scope (though in my testing I never managed to reclaim all of the RAM it used afterwards).

Alternatives like µ-Fat do exist and are even lighter, but they have restrictions such as no appending to files for example - which would make the whole thing much more complicated since we'd have to pre-allocate the space for the file (which would get rather messy).

The other major tactic you can do to save RAM is to use the F() trick. Consider the following sketch:

#include

void setup() {
Serial.begin(115200);
Serial.println("Bills boosters controller, version 1");
}

void loop() {
// Nothing here
}

On the highlighted line we've got an innocent Serial.println() call. What's not obvious here is that the string literal here is actually copied to RAM before being passed to Serial.println() - using up a huge amount of our precious working memory. Wrapping it in the F() macro forces it to stay in your program's storage space:

Serial.println(F("Bills boosters controller, version 1"));

### Saving storage

With the RAM issue mostly dealt with, I then had to deal with the thorny issue of program space. Unfortunately, this is not as easy as saving RAM because we can't just 'unload' something when it's not needed.

My approach to reducing program storage space was twofold:

• Picking lightweight alternatives to libraries I needed
• Messing with flags of said libraries to avoid compiling parts of libraries I don't need

It is for these reasons that I ultimately went with TinyGPS instead of TinyGPS++, as it saved 1% or so of the program storage space.

It's also for this reason that I've disabled as much of LMiC as possible:

#define DISABLE_JOIN
#define DISABLE_PING
#define DISABLE_BEACONS
#define DISABLE_MCMD_DCAP_REQ
#define DISABLE_MCMD_DN2P_SET

This disables OTAA, Class B functionality (which I don't need anyway), receiving messaages, the duty cycle cap system (which I'm not sure works between reboots), and a bunch of other stuff that I'd probably find rather useful.

In the future, I'll probably dedicate an entire microcontroller to handling LoRaWAN functionality - so that I can still use the features I've had to disable here.

Even doing all this, I still had to trim down my Serial.println() calls and remove any non-essential code to bring it under the 32K limit. As of the time of typing, I've got jut 26 bytes to spare!

Next time, after tuning the TPL5110 down to the right value, we're probably going to switch gears and look at the server-side of things - and how I'm going to be storing the data I receive from the Arudino-based device I've built.

Found this interesting? Got a suggestion? Comment below!

## Summer Project Part 3: Putting it together

In the first post in this series, I outlined my plans for my Msc summer project and what I'm going to be doing. In the second post, I talked about random number generation for the data collection.

In this post, I'm going to give a general progress update - which will mostly centre around the Internet of Things device I'm building to collect the signal strength data.

Since the last post, I've got nearly all the parts I need for the project, except the TPL5111 power manager and 4 rechargeable AA batteries (which should be easy to come by - I'm sure I've got some lying around somewhere).

I've also wired the thing up, with a cable standing in for the TPL5111.

The power management board there technically doesn't need a breadboard, but it makes mounting it in the box easier.

I still need to splice the connector onto the battery box I had lying around with some soldering and electrical tape - I'll do that later this week.

The wiring there is kind of messy, but I've tested each device individually and they all appear to work as intended. Here's a clearer diagram of what's going on that drew up in Fritzing (sudo apt install fritzing for Linux users):

Speaking of mounting things in the box, I've discovered OpenSCAD thanks to help from a friend and have been busily working away at designing a box to put everything in that can be 3D printed:

I've just got the lid to do next (which I'm going to do after writing this blog post), and then I'm going to get it printed.

With this all done, it's time to start working on the transport for the messages - namely using LMIC to connection to the network and send the GPS location to the application server, which is also unfinished.

The lovely people at the hardware meetup have lent me a full 8-channel LoRaWAN gateway that's connected to The Things Network for my project, which will make this process a lot easier.

Next time, I'll likely talk about 3D printing and how I've been 'threading the needle', so to speak.

## Easy Paper Referencing and Research

For my Msc (Masters in Science) degree that I'm currently working towards, I'm having to do an increasing amount of research and 'official' referencing in my University's referencing style.

Unlike when I first started at University, however, by now I've developed a number of strategies to aid me in doing this referencing properly with minimal effort - and actually finding papers in the first place. To this end, I wanted to write a quick post on what I do for my own reference - and hopefully someone else will find it useful too (comment below!).

Although I really like using Markdown for writing blog posts and other documents, when I've got to do proper referencing I often end up using LaTeX instead. I first used LaTeX for my interim report in my final year of my undergraduate degree, and while I'm still looking for a decent renderer (I think I'm using TeX Live, and it's a pain), it's great for referencing because of BibTeX. With the template provided by my University, I can enter something like this:

@Misc{Techopedia2017,
author = {{Techopedia Inc.}},
year = {2017},
title = {What is Hamming Distance? - Definition from Techopedia},
howpublished = {Available online: \url{https://www.techopedia.com/definition/19723/hamming-distance} [Accessed 04/04/2019]}
}

...and it will automagically convert it into the right style according to the template my University has given me. Then I just reference it (\citep{Techopedia2017}), and I'm away!

For actually finding papers, I've been finding Google Scholar useful. It even has a button you can press that generates the BibTeX definition as shown above for you to paste into your references file!

Finally, I've just recently discovered Microsoft Academic. While I haven't used it too much yet, it seems to be a great alternative to Google Scholar. It's got a cool AI-based semantic search engine, and an interesting summary view of how a given paper relates to other papers to help you find more papers on a topic. It shows a papers references,the papers that cite that paper, and papers that it determines to be related - giving you plenty to choose from.

Using a combination of these approaches, I've been able to effectively focus on the finding and referencing papers, without getting bogged down too much with the semantics of how I should actually do the referencing itself.

Found this useful? Got another tip? Comment below!

## Delivering Linux 101

Achievement get: Deliver workshop!

At the beginning of my time here at University I never thought I'd be planning and leading the delivery an entire workshop on the basics of Linux. Assessed coursework presentations have nothing on this!

Overall, I think it went rather well, actually. About a dozen people attended in total, and most people seemed to manage to get near the end of the tasks I had prepared:

1. Installing Ubuntu
2. Installing Mono
3. Investigating Monodevelop

I think next time I want to better prepare for the gap when installing the operating system, as it took much longer than I expected. Perhaps choosing the "minimal" installation instead of the "normal" installation would help here?

Preparing some slides on things like the folder structure and layout, and re-ordering the slides about package management would would all help.

If I can't cut down on the installation time, pre-installed virtual machines would also work - but I'd like to keep the OS installation if possible to show that it's an easy process installing Ubuntu on their own machines.

Moving forwards, I've already received a bunch of feedback on what future sessions could contain:

1. Setting up remote access
• This would be SSH, which is already installed & pre-setup on a server installation of Ubuntu
2. Gaming
• I unsure precisely what's meant by this. Is it installation of various games? Or maybe it's configuration of various platforms such as Steam? Perhaps someone could elaborate on it?
3. Server installation & maintenance
• Installation is largely similar to a desktop
• I'd want to measure how long it takes to install, because much of the work with a server is the post-install tasks
• Perhaps looking into a pre-installed server might be beneficial here, but security would be a slight concern

I think for anything more advanced, I'll probably go with a lab sheet-style setup instead, so that people can work at their own pace - especially since something like server configuration has many different steps to it.

I'd certainly want a goal to work towards for such a session. I've had some ideas already:

• Setting up a web server
• Installing Nginx
• Writing and understanding configuration files
• Possibly some FastCGI? PHP / Python? Probably not, what with everything else
• Setting up a server to host a custom application
• Writing systemd service files
• Setting up log rotation

Common to both of these ideas would be:

• Basic terminal skills
• Basic hardening
• Using sudo instead of root if it doesn't come configured with that setup already
• Securing SSH
• Enabling UFW

I'm pretty sure I'll be doing another one of these sessions, although I'm unsure as to whether there's the demand for a repeat of this one.

If you've got any thoughts, let me know in the comments below!

Thanks also to @MoirkoB and everyone else who provided both time and resources to enable this to go ahead. Without, I'm sure it wouldn't have happened.

If you'd like to view the slide deck I used, you can do so here:

Linux 101 Slide Deck

If you missed it, but would like to be notified of future sessions, then fill out this Google Form:

Linux 101 Overflow

Art by Mythdael