PhD Update 4: Ginormous Data

Hello again! In the last PhD update blog post, I talked about patching HAIL-CAESAR to improve performance and implementing a Temporal Convolutional Neural Net (Temporal CNN).

Since making that post, I've had my PhD Panel 1 (very useful, thanks to everyone who was on that panel!). I've also got an initial - albeit untested - implementation of a Temporal CNN. I've also been wrangling lots of data in more ways than one. I'm definitely seeing the Big Data aspect of my project title now.

HAIL-CAESAR

I ran HAIL-CAESAR initially at 50m per pixel. This went ok, and generated lots of data out, but in 2 weeks of real time it barely hit 43 days worth of simulation time! The other issue I discovered due to the way I compressed the output of HAIL-CAESAR, for some reason it compressed the output files before HAIL-CAESAR had finished writing to them. This resulted in the data being cut off randomly in the output files for each time step.

Big problem - clearly another approach is in order.

To tackle these issues, I've done several things. Firstly, I patched HAIL-CAESAR again to support writing the output water depth files to the standard output. As a refresher, they are actually identical in format to the heightmap, which looks a bit like this:

ncols 4
nrows 3
xllcorner 400000
yllcorner 300000
cellsize 1000
1 2 3 4
1 1 2 3
0 1 1 2

The above is a 4x3 grid of points, with the bottom-left corner being at (400000, 300000) on the Ordnance Survey National Grid (I know, latitude / longitude would be so much better, but all the data I'm working with is on the OS national grid :-/). Each point represents a 1km square area.

To this end, I realised that it doesn't actually matter if I concatenate multiple files in this format together - I can guarantee that I can tell them apart. As soon as I detect a metadata line that the current file has already declared, then I know that the next file is starting and we're starting to read the next file along. To this end, I implemented a new Terrain50.ParseStream() function that is an async generator that will take a stream, and then iteratively yield the Terrain50 instances it parses out of the stream. In this way, I can split 1 big continuous stream back up again into the individual parts.

By patching HAIL-CAESAR such that it outputs the data in 1 continuous stream, it also means that I can pipe it to a single compression program. This has 2 benefits:

It avoids the "compressing the individual files before HAIL-CAESAR is ready" problem (the observant might note that inotifywait would solve this issue neatly too, but it isn't installed on Viper)
It allows for more efficient compression, as the compression program can use data from other time step files as context

Finding a compression tool was next. I wanted something CPU efficient, because I wanted to ensure that the maximum number of CPU cycles were dedicated to HAIL-CAESAR for running the simulation, rather than compressing the output it generates - since it is the bottleneck after all.

I ended up using lz4 in the end, an extremely fast compression algorithm. It compiles easily too, which is nice as I needed to compile it from source automatically on Viper.

With all this in place, I ran HAIL-CAESAR again 2 more times. The first run was at the same resolution as before, and generated 303 GiB (!) of data.

The second run was at 500m per pixel (10 times lower resolution), which generated 159 GiB (!) of data and, by my calculations, managed to run through ~4.3 years in simulation time in 5 days of real time. Some quick calculations suggest that to get through all 13 years of rainfall radar data I have it would take just over 11 days, so since I've got everything setup already, I'm going to be contacting the Viper administrators to ask about running a longer job to allow it to complete this process if possible.

Temporal CNN Preprocessing

The other major thing I've been working on since the last post is the Temporal CNN. I've already got an initial implementation setup, and I'm currently in the process of ironing out all the bugs in it.

I ran into a number of interesting bugs. One of these was to do with incorrectly specifying the batch size (due to a typo), which resulted in the null values you may have noticed in the model summary in the last post. With those fixed, it looks much more sensible:

_________________________________________________________________
Layer (type)                 Output shape              Param #   
=================================================================
conv3d_1 (Conv3D)            [32,2096,3476,124,64]     16064     
_________________________________________________________________
conv3d_2 (Conv3D)            [32,1046,1736,60,64]      512064    
_________________________________________________________________
conv3d_3 (Conv3D)            [32,521,866,28,64]        512064    
_________________________________________________________________
pooling (AveragePooling3D)   [32,521,866,1,64]         0         
_________________________________________________________________
reshape (Reshape)            [32,521,866,64]           0         
_________________________________________________________________
conv2d_output (Conv2D)       [32,517,862,1]            1601      
_________________________________________________________________
reshape_end (Reshape)        [32,517,862]              0         
=================================================================
Total params: 1041793
Trainable params: 1041793
Non-trainable params: 0
_________________________________________________________________

This model is comprised of the following:

3 x 3D convolutional layers
1 x pooling layer to average out the temporal dimension
1 x reshaping layer to remove the redundant dimension
1 x 2D convolutional layer that will produce the output
1 x reshaping layer to remove another redundant dimension

I'll talk about this model in more detail in a future post in this series once I've managed to get it running and I've played around with it a bit.

Another significant one I ran into was to do with stacking tensors like an image. I ended up asking on Stack Overflow: How do I reorder the dimensions of a rank 3 tensor in Tensorflow.js?

The input to the above model is comprised of a sliding window that moves along the rainfall radar time steps. Each time step contains a 2D array, representing the amount of rain that has fallen in a given area. This needs to be combined with the heightmap, so that the AI model knows what the terrain that the rain is falling on looks like.

The heightmap doesn't change, but I'm including a copy of it with every rainfall radar time step because of the way the 3D convolutional layer works in Tensorflow.js. 2D convolutional layers in Tensorflow.js, for example, take in a 2D array of data as a tensor. They can also take in multiple channels though, much like pixels in an image. The pixels in an image might look something like this:

R1 G1 B1 A1 R2 G2 B2 A2 R3 G3 B3 A3 .....

As you might have seen in the Stack Overflow answer I linked to above, Tensorflow.js does support stacking multiple 2D tensors in this fashion. It is unfortunately extremely slow however. It is for this reason that I've been implementing a multi-process program to preprocess the data to do this stacking in advance.

As I'm writing this though, I've finally understood what the dataFormat option is for in the conv3d and conv2d layers is for, and I think I might have been barking up the wrong tree......

What's next

From here, I'm going to investigate that dataFormat option for the TemporalCNN - it would hugely simplify my setup and remove the need for me to preprocess the data beforehand, since stacking tensors directly 1 after another is very quick - it's just stacking them along a different dimension that's slow.

I'm also hoping to do a longer run of that 500m per pixel HAIL-CAESAR simulation. More data is always good, right? :P

After I've looked into the dataFormat option, I'd really like to get the Temporal CNN set off training and all the bugs ironed out. I'm so close I can almost taste it!

Finally, if I have time, I want to go looking for a baseline model of sorts. By this, I mean an existing model that's the closest thing to the task I'm doing - even though they might not be as performant or designed for my specific task.

Found this interesting? Got a suggestion of something I could do better? Confused about something I've talked about? Comment below!

Type this	To get this	Notes
`bold text`	bold text	-
`_italics text_`	italics text	-
`~~deleted text~~`	~~deleted~~	-
`code text`	`code text`	Inserts some monospaced code. It is preferred that large blocks of code are linked to using a service such as Pastebin, Github Gists or Ideone.
`> Quote`	Quote	-
`[display text](//google.com)`	display text	Inserts a hyperlink. Please use responsibly. `[rel=nofollow]` is in use and spam will be deleted.
`---`		Inserts a horizontal line. The previous line must be blank.

Stardust
Blog