PhD Update 5: Hyper optimisation and frustration
Hello there again! It's been longer than I anticipated since the last proper post in this series. Before I continue, here's a list of all the (proper) posts in this series so far:
- PhD Update 1: Directions
- PhD Update 2: The experiment, the data, and the supercomputers
- PhD Update 3: Simulating simulations with some success
I've haven't managed to get as much done since last time as I was hoping (partly due to the fact that I'm currently having to work from home, which is more challenging than I expected), but I have finished my implementation of the Temporal CNN, and am now working on hyperparameter optimisation. I've also fixed a number of issues in my rainfall radar data downloader and processing programs - which I'll talk about in more detail below
HAIL-CAESAR and the iterative improvements
Someone at the University recently approached me (if you are reading this and have a blog, comment below and I'll update) to ask if they could use my rainfall radar data downloader program to download some rainfall radar data for their project. Naturally, I helped them out. This turned out to be a great thing for me as well, as with their help I managed to uncover a number of very nasty issues with the data pipeline I had been building up to that point:
- The hydro index file that HAIL-CAESAR uses was completely scrambled
- The date on the data downloaded was a month out
- The data downloaded was (and still is) rotated by 90° on disk
- The data was out by a factor of 32
While fixing each of these bugs was a (relatively) simple process, I can't help but wonder how they managed to escape my notice until (for all but 1 of them) someone else told me about them.
The other issue was that because of the amount of data I'm working with, it took forever to re-run the program to test to see if I had managed to fix the problem - and if I had, I'd encounter another problem. This long iteration process makes implementing a new feature or fixing a bug a very time-consuming process.
Despite fixing all these issues, I'm still experiencing issues with my latest refactoring of the rainfall radar data downloader (namely a hang in the event system when reading tar files). My current thinking is that I'm going to completely reimplement it (using snippets from the old programif I need to use it again in the future, as it is currently neither particularly efficient (it's single-threaded) nor easy to bugfix (it's pretty complicated).
I've got an idea for a parallel system that processes each tar file separately first, and then only after all the tar files have been converted separately are they strung together into the actual files the existing implementation spits out currently.
Temporal CNN delight
Last time, I had just started my implementation of a Temporal CNN. This is now pretty much complete, and I've also been able to run it and get some results! Check out this graph:
This graph shows the root mean squared error when training on 1000 time steps of data (about 3 days 11 hours or so). Epochs are along the X axis, and the root mean squared error is on the Y axis.
A few things to note here. Firstly, the implementation I've come up with essentially does video-to-image translation. The original model in the paper I've linked to is demonstrates a classification task (specifically land use over time) - so what I'm doing is a little different.
Secondly, I've omitted the root mean squared error for the first epoch. It was so high that it made the rest of the graph impossible to see - hence the omission.
I'm pretty pleased with this result so far - as I have a nice downwards curve indicating that the model is (probably) learning something useful.
I am still rather nervous about the output though, as due to the way I've implemented the network I haven't actually been able to 'see' the output of the network at all as an image yet. Doing so would take a while to implement, so I haven't done so for now (although I really should do this soon). It would be really cool to see a short video (maybe at ~10fps) of the network output as the epochs move forwards to visualise the network training process.
Lastly, at the suggestion of my supervisor I've been working on hyperparameter optimisation. In short, this consists of training the model with random combinations of hyperparameters and seeing which ones work best.
A hyperparameter is a tunable parameter that controls an aspect of a model. In my case, I have 2 key hyperparameters I need to tune:
- Filter count: CNN layers in Tensorflow.js have a filter count associated with them. I theorise that increasing this will increase the model's ability to learn spatial information.
- Temporal depth: The number of time steps to push through the model at once. Increasing this will allow the model to make predictions based on events that occur further in the past.
My eventual aim here is to create a heatmap that has the above hyperparameters along the X and Y axes, and the colour showing the accuracy of the model that was trained - similar to the one I created previously.
To do this, I implemented a program that tries random combinations of hyperparameters - but never the same combination twice. It starts the model in a subprocess and passes the chosen filter count and temporal depth values in as CLI arguments, which the child process picks up, parses, and then trains a model based on. This CLI is the same one as the one I developed that generated the above graph in the previous section of this blog post.
This approach has the advantage that it isolates the model in a subprocess, so when the subprocess exits and a new one spawns for the next combination of hyperparameters, the environment is completely clean and there isn't anything that might interfere with it.
Unfortunately though, while I set off a run of this implementation before I took a 'holiday' - and even checked on it to ensure it was running as expected (multiple times) - it still managed to crash when I wasn't looking.
After some debugging, I discovered that problem was because the model ran out of memory while training. This was something I had expected - and used the
--unhandled-rejections=strict option for Node.js, which tells Node.js to crash and exit when an UnhandledPromiseRejection is thrown - like this one:
2020-08-04 15:31:26.174395: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at conv_grad_ops_3d.cc:1783 : Resource exhausted: OOM when allocating tensor with shape[8,2,104,348,210] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc (node:62355) UnhandledPromiseRejectionWarning: Error: Invalid TF_Status: 8 Message: OOM when allocating tensor with shape[8,2,104,348,210] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc at Object.<anonymous> (<anonymous>) .....
Unfortunately though, I used this flag on the parent process (that drives the hyperparameter optimisation) and not the child process - leading to a situation whereby the child process crashed due to the aforementioned error and just hangs around doing nothing. Even more frustatingly, the solution si as simple as doing a quick
export NODE_OPTIONS="--unhandled-rejections=strict" before running the hyperparameter optimisation program to ensure that the flag propagates to the child processes......
Very frustrating indeed - especially considering I calculate that it will take multiple weeks to gather enough data to create a meaningful heatmap.
Reading back over this post, I have got more done than I expected. I've started and finished my Temporal CNN implementation, and fixed lots of bugs in them existing code.
However, the long iteration times to test code I've written (despite using a small slice of the dataset to test with), the large datasets I'm working with, having work on a system remotely via SSH by pushing and pulling code with git (many times), working from home all the time, and the continued bugs I've been facing and will likely continue to face have caused and are causing significant unexpected slowdowns moving forwards.
At least the VPN is no longer dropping out every 5 minutes!