PhD Update 2: The experiment, the data, and the supercomputers
Welcome to another PhD update post. Since last time, a bunch of different things have happened - which I'll talk about here. In particular, 2 distinct strands have become evident: The reading papers and theory bit - and the writing code and testing stuff out bit.
At the moment, I'm focusing much more heavily on the writing code and experimental side of things, as I've recently gained access to the 1km resolution rainfall radar dataset from CEDA. While I'm not allowed to share any data that I've now got, I'm pretty sure I'm safe to talk about how
terribly it's formatted.
The data itself is with 1 directory per year, and 1 file per day inside those directories. Logical so far, right? Each of those files is a tar archive, inside which are the binary files that contain the data itself - with 1 file for every 5 minutes. These are compressed with gzip - which seems odd since they could probably gain greater compression ratios if they were
tared first and compressed second (then the compression algorithm would be able to compress based on the similarities between files too).
The problems arise when you start to parse out the binary files themselves. They are in a propriety format - which has 3 different versions that don't conform to the (limited) documentation. To this end, it's been proving somewhat of a challenge to parse them and extract the bits I'm interested in.
To tackle this, I've been using Node.js and a bunch of libraries from npm (noisy pirate mutiny? nearest phase modulator? nasty popsicle machine? nah, it's probably node package manager):
- binary-parser - For parsing the binary files themselves. Allows you to define the format of the file programmatically, and it'll parse it out into a nice object you can then manipulate.
- gunzip-maybe - A streaming library that unzips a gzip-compressed stream
- @icetee/ftp - An FTP client library for downloading the files (I know that FTP is insecure, that's all they offer at this time :-/)
- tar-stream - For parsing tar files
- nnng - Stands for No! Not National Grid!. helps with the conversion between OS national grid references and regular longitude latitude.
Aside from the binary file format, I encountered 3 main issues:
- The data is only a rectangle when using ordnance survey national grid references
- There's so much data, it needs to be streamed from the remote server
- Generating a valid gzip file is harder than you expect
Problem 1 here took me a while to figure out. Since as I mentioned the documentation is rather limited, I spent much longer than I would have liked attempting to parse the data in latitude longitude and finding it didn't work.
Problem 2 was rather interesting. Taking a cursory glance over the data before hand revealed that each daily tar file was about 80MiB - and with roughly 5.7K days worth of data (the dataset appears to go back to May 2004-ish), it quickly became clear that I couldn't just download them all and process them later.
I'm still working on tweaking and improving my final solution, but as it stands after implementing the extractor on it's own, I've also implemented a wrapper that streams the tar archives from the FTP server, stream-reads the tar archives, streams the files in the tar archives into a gzip decompressor, parses the result, and then streams the interesting bits to disk as a disk object via a gzip compressor.
That's a lot of streams. The great part about this is that I don't accidentally end up with huge chunks of binary files in memory. The only bits that can't be streamed is the binary file parser and the bit that extracts the interesting bits.
I'm still working on the last issue, but I've been encountering nasty problems with the built-in zlib gzip compressor transformation stream. When I send a SIGINT (Ctrl + C) to the Node.js process, it doesn't seem to want to finish writing the gzip file correctly - leading to invalid gzip files with chunks missing from the end.
Since the zlib gzip transformation stream is so badly documented, I've ended up replacing it with a different solution that spawns a
gzip child process instead (so you've got to have
gzip installed on the machine you're running the script on, which shouldn't be a huge deal on Linux). This solution is better, but still requires some tweaks because it transpires that Node.js automatically propagates signals it receives to child processes - before you've had a chance to tie up all your loose ends. Frustrating.
Even so, I'm hopeful that I've pretty much got it to a workable state for now - though I'll need to implement a daemon-type script at some point to automatically download and process the new files as they are uploaded - it is a living dataset that's constantly being added to after all.
The other strand (that's less active at the minute) is reading papers. Last time, I mentioned the summary papers I'd read, and the direction I was considering reading in. Since then, I've both read a number of new papers and talked to a bunch of very talented people on-campus - so I've got a little bit of a better idea as to the direction I'm headed in now.
Firstly, I've looked into a few cutting-edge recurrent neural network types:
- Grid LSTMs - Basically multi-dimensional LSTMs
- Diluted LSTMs - Makes LSTMs less computationally intensive and better at learning long-term relationships
- Transformer Neural Networks - more reading required here
- NARX Networks
Many of these recurrent neural network structures appear to show promise for mapping floods. The last experiment into a basic LSTM didn't go too well (it turned out to be hugely computationally expensive), but learning from that experiment I've got lots of options for my next one.
A friend of mine managed to track down the paper behind Google's AI blog post - which turned out to be an interesting read. It transpires that despite the bold words in the blog post, the paper is more of an initial proposal for a research project - rather than a completed project itself. Most of the work they've done is actually using a traditional physics-based model - which they've basically thrown Google-scale compute power at to make pretty graphs - which they've then critically evaluated and identified a number of areas in which they can improve. They've been a bit light on details - which is probably because they either haven't started - or don't want to divulge their sekrets.
I also saw another interesting paper from Google entitled "Machine Learning for Precipitation Nowcasting from Radar Images", which I found because it was reported on by Ars Technica. It describes the short-term forecasting of rain from rainfall radar in the US (I'm in the UK) using a convolutional neural network-based model (specifically U-Net, apparently - I have yet to read up on it).
The model they use is comprised in part of convolutional neural network (CNN) layers that downsample 256x256 tiles to a smaller size, and then upscale it back to the original size. It has some extra connections that skip part of the model too. They claim that their model manages to improve on existing approaches for up to 6 hours in advance - so their network structure seems somewhat promising as inspiration for my own research.
Initial thoughts include theories as to whether I can use CNN layers like this to sandwich a more complex recurrent layer of some description that remembers the long-term relationships? I'll have to experiment.....
Found this interesting? Got a suggestion? Confused on a point? Comment below!