Starbeamrainbowlabs

Stardust
Blog


Archive


Mailing List Articles Atom Feed Comments Atom Feed Twitter Reddit Facebook

Tag Cloud

3d 3d printing account algorithms android announcement architecture archives arduino artificial intelligence artix assembly async audio automation backups bash batch blog bookmarklet booting bug hunting c sharp c++ challenge chrome os cluster code codepen coding conundrums coding conundrums evolved command line compilers compiling compression containerisation css dailyprogrammer data analysis debugging demystification distributed computing docker documentation downtime electronics email embedded systems encryption es6 features ethics event experiment external first impressions future game github github gist gitlab graphics hardware hardware meetup holiday holidays html html5 html5 canvas infrastructure interfaces internet interoperability io.js jabber jam javascript js bin labs learning library linux lora low level lua maintenance manjaro minetest network networking nibriboard node.js operating systems own your code pepperminty wiki performance phd photos php pixelbot portable privacy problem solving programming problems project projects prolog protocol protocols pseudo 3d python reddit redis reference releases rendering resource review rust searching secrets security series list server software sorting source code control statistics storage svg talks technical terminal textures thoughts three thing game three.js tool tutorial tutorials twitter ubuntu university update updates upgrade version control virtual reality virtualisation visual web website windows windows 10 xmpp xslt

PhD, Update 8: Eggs in Baskets

I'm back again with another PhD update blog post! Before we begin, here's a list of all the parts in the series so far:

As in the previous post, progress since last time is split in 2: The Temporal CNN, and the social media side of things. I've started to split my time more evenly between the 2 sides, as it seems like the Temporal CNN is going to take lots more work than anticipated and I'd rather not put all my eggs in 1 basket.

Temporal CNN

As you might have guessed, the Temporal CNN still isn't learning anything, but at least now I think I know what the problem is. Since last time, I've done a bunch of debugging and tests to try and figure out what the problem is. During that process, I've managed to reach a record of ~20% accuracy, which at least gives me hope that it's going to work!

Specifically, I used the MNIST (alternative site) handwriting digit dataset with my "easy" task as explain in the previous post, but with a small difference: I pre-generated 2 random tensors to serve as the "below 5" and "5 and above" targets to predict instead of a pair of tensors filled with 0s or 1s respectively. The model didn't like this at all, so this is how I now know what the problem is.

For those interested, here's the laundry list of other things I've tried since last time:

  • Giving it more data (all of 2007, with the 2013 floods as validation; made things a bit worse)
  • Found and fixed a bug in data normalisation that managed to sneak through during the rewrite (reduced training times a touch)
  • Inverting the heightmap (helped a bit)
  • Making the model deeper (Gave me a full 5% accuracy increase from 15% to 20%!)

Knowing what the problem is though is 1 thing, but solving it is another matter entirely. Thankfully, my supervisor and I have a plan to look into using a modified version of the latter half of a variational autoencoder and squidge it onto the tail end of the Temporal CNN. If it works, then I'm imagining that we'll need a new name for the Temporal CNN (suggestions?), but I'll tackle that once I've finished revising the model.

For context, a variational autoencoder is a modified "vanilla" autoencoder, and is 1 of 2 different main classes of generative AI model architecture - the other being Generative Adversarial Networks (GAN). In contrast to a GAN, a variational autoencoder does image-to-image translation with a single model, and maps an input parameter space onto an output parameter space. It first encodes the input to the model into a smaller tensor of features, before upscaling that back into an image again. In this fashion, it can learn to translate between 2 different images - for example putting glasses on people's faces.

To do this, I'm going to implement a vanilla variational autoencoder using the MNIST dataset, and once I've done this I'll then lift part of the model structure and transpose it onto the top of my existing Temporal CNN - by doing it this way I'll ensure that I have a known-good model to work with that is definitely capable of image-to-image translation.

Social Media

In other news, I've started to make some real progress on the social media side of things. I've downloaded and anonymised some tweets (the code for which is open source on npm under the package name twitter-academic-downloader - I intend to write a separate blog post about it at some point soon-ish), and I've also put together an LSTM-based model to start looking at doing some text classification.

I decided to implement said model in Python instead of Javascript, because for what I can tell Tensorflow.js doesn't come with as many batteries included as Tensorflow for Python does for natural language processing-based tasks. This has caused some interesting adventures (and a number of frustrating crashes), but I think I'm starting to get the hang of it.

In particular it's interesting coming from Tensorflow.js (which is a later project), because it seems that Tensorflow for Python is much less cohesive and more disjointed as a library compared to Tensorflow.js, which has learnt and applied lessons from the Python implementation - resulting in a much more cohesive and well thought out API. A prime example of this is the tf.Dataset vs tf.keras.Sequence in the Python version, which isn't an issue in Tensorflow.js, as in the Javascript bindings we have a single tf.Dataset.

This aside, my next step here is to train a significantly sized model that's larger than the mini model with a single layer and 100 units I've been using for testing purposes (that's my task for this afternoon - which I've likely done by the time you're reading this post).

In terms of literature, I've read a bunch more papers on the subject since last time - but I still feel like I've got more to read. Recently I read a series of papers about word embeddings (converting words into numerical tensors), which was very interesting. The process has evolved over the years, starting from a simple dictionary mapping incrementing numbers to words, to training an AI to generate said representations in increasingly sophisticated ways (starting with word2vec, then moving on to in no particular order ELMo, GloVe, and finally BERT - transformers are pretty incredible models). It was a fascinating read - I can recommend it to anyone who's interested in natural language processing (along with this excellent post)

In the model I've implemented, I've ultimately decided to go with GloVe (Global Vectors for Word Representation), as the pre-trained model is simply a text file containing a lookup table one can read into a dictionary or hash table.

Conclusion

Things have been moving forwards - albeit slowly. I've got an idea as to how I can resolve the issues I've been facing with the Temporal CNN (pending a new name once I'm done with all the modifications and I know what the model architecture is going to be like), though it's going to take a lot of work.

Things are finally starting to move in social media land - hopefully the accuracy of the LSTM-based model will be higher than that of the mini model I trained, which was only 50% on a balanced dataset - no better than blind guessing!

See you again in 2 months or so, when hopefully I'll have some real results to show (though of course I'll be keeping up with weekly posts about other things in the meantime). If you have any comments or questions about any of this - please leave a comment below! I'd love to hear your thoughts.

Sources and further reading

A much easier way to install custom versions of Python

Recently, I wrote a rather extensive blog post about compiling Python from source: Installing Python, Keras, and Tensorflow from source.

Since then, I've learnt of multiple other different ways to do that which are much easier as it turns out to achieve that goal.

For context, the purpose of running a specific version of Python in the first place was because on my University's High-Performance Computer (HPC) Viper, it doesn't have a version of Python new enough to run the latest version of Tensorflow.

Using miniconda

After contacting the Viper team at the suggestion of my supervisor, I discovered that they already had a mechanism in place for specifying which version of Python to use. It seems obvious in hindsight - since they are sure to have been asked about this before, they already had a solution in the form of miniconda.

If you're lucky enough to have access to Viper, then you can load miniconda like so:

module load python/anaconda/4.6/miniconda/3.7

If you don't have access to Viper, then worry not. I've got other methods in store which might be better suited to your environment in later sections.

Once loaded, you can specify a version of Python like so:

conda create -n py38 python=3.8

The -n py38 specifies the name of the environment you'd like to create, and can be anything you like. Perhaps you could use the name of the project you're working on would be a good idea. The python=3.8 is the version of Python you want to use. You can list the versions of Python available like so:

conda search -f python

Then, to activate the new environment, do this:

conda init bash
conda activate py38
exec bash

Replace py38 with the name of the environment you created above.

Now, you should have the specific version of Python you wanted installed and ready to use. You can also install packages with pip, and it should all come out in the wash.

For Viper users, further information about miniconda can be found here: Applications/Miniconda Last

Gentoo Project Prefix

Another option I've been made aware of is Gentoo's Project Prefix. Essentially, it installs Gentoo (a distribution of Linux) inside a directory without root privileges. It doesn't work very well on Ubuntu, however due to this bug, but it should work on other systems.

They provide a bootstrap script that you can run that helps you bootstrap the system. It asks you a few questions, and then gets to work compiling everything required (since Gentoo is a distribution that compiles everything from source).

If you have multiple versions of gcc available, try telling it about a slightly older version of GCC if it fails to install.

If you can get it to install, a Gentoo Prefix install allows the installation whatever software you like!

pyenv

The last solution to the problem I'm aware of is pyenv. It automates the process of downloading and compiling specified versions of Python, and also updates you shell automatically. It does require some additional dependencies to be installed though, which could be somewhat awkward if you don't have sudo access to your system. I haven't actually tried it myself, but it may be worth looking into if the other 2 options don't work for you.

Conclusion

There's always more than 1 way to do something, and it's always worth asking if there's a better way if the way you're currently using seems hugely complicated.

Installing Python, Keras, and Tensorflow from source

I found myself in the interesting position recently of needing to compile Python from source. The reasoning behind this is complicated, but it boils down to a need to use Python with Tensorflow / Keras for some natural language processing AI, as Tensorflow.js isn't going to cut it for the next stage of my PhD.

The target upon which I'm aiming to be running things currently is Viper, my University's high-performance computer (HPC). Unfortunately, the version of Python on said HPC is rather old, which necessitated obtaining a later version. Since I obviously don't have sudo permissions on Viper, I couldn't use the default system package manager. Incredibly, pre-compiled Python binaries are not distributed for Linux either, which meant that I ended up compiling from source.

I am going to be assuming that you have a directory at $HOME/software in which we will be working. In there, there should be a number of subdirectories:

  • bin: For binaries, already added to your PATH
  • lib: For library files - we'll be configuring this correctly in this guide
  • repos: For git repositories we clone

Make sure you have your snacks - this was a long ride to figure out and write - and it's an equally long ride to follow. I recommend reading this all the way through before actually executing anything to get an overall idea as to the process you'll be following and the assumptions I've made to keep this post a reasonable length.

Setting up

Before we begin, we need some dependencies:

  • gcc - The compiler
  • git - For checking out the cpython git repository
  • readline - An optional dependency of cpython (presumably for the REPL)

On Viper, we can load these like so:

module load utilities/multi
module load gcc/10.2.0
module load readline/7.0

Compiling openssl

We also need to clone the openssl git repo and build it from source:

cd ~/software/repos
git clone git://git.openssl.org/openssl.git;    # Clone the git repo
cd openssl;                                     # cd into it
git checkout OpenSSL_1_1_1-stable;              # Checkout the latest stable branch (do git branch -a to list all branches; Python will complain at you during build if you choose the wrong one and tell you what versions it supports)
./config;                                       # Configure openssl ready for compilation
make -j "$(nproc)"                              # Build openssl

With openssl compiled, we need to copy the resulting binaries to our ~/software/lib directory:

cp lib*.so* ~/software/lib;
# We're done, cd back to the parent directory
cd ..;

To finish up openssl, we need to update some environment variables to let the C++ compiler and linker know about it, but we'll talk about those after dealing with another dependency that Python requires.

Compiling libffi

libffi is another dependency of Python that's needed if you want to use Tensorflow. To start, go to the libgffi GitHub releases page in your web browser, and copy the URL for the latest release file. It should look something like this:

https://github.com/libffi/libffi/releases/download/v3.3/libffi-3.3.tar.gz

Then, download it to the target system:

cd ~/software/lib
curl -OL URL_HERE

Note that we do it this way, because otherwise we'd have to run the autogen.sh script which requires yet more dependencies that you're unlikely to have installed.

Then extract it and delete the tar.gz file:

tar -xzf libffi-3.3.tar.gz
rm libffi-3.3.tar.gz

Now, we can configure and compile it:

./configure --prefix=$HOME/software
make -j "$(nproc)"

Before we install it, we need to create a quick alias:

cd ~/software;
ln -s lib lib64;
cd -;

libffi for some reason likes to install to the lib64 directory, rather than our pre-existing lib directory, so creating an alias makes it so that it installs to the right place.

Updating the environment

Now that we've dealt with the dependencies, we now need to update our environment so that the compiler knows where to find them. Do that like so:

export LD_LIBRARY_PATH="$HOME/software/lib:${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}";
export LDFLAGS="-L$HOME/software/lib -L$HOME/software/include $LDFLAGS";
export CPPFLAGS="-I$HOME/software/include -I$HOME/software/repos/openssl/include -I$HOME/software/repos/openssl/include/openssl $CPPFLAGS"

It is also advisable to update your ~/.bashrc with these settings, as you may need to come back and recompile a different version of Python in the future.

Personally, I have a file at ~/software/setup.sh which I run with source $HOME/software/setuop.sh in my ~/.bashrc file to keep things neat and tidy.

Compiling Python

Now that we have openssl and libffi compiled, we can turn our attention to Python. First, clone the cpython git repo:

git clone https://github.com/python/cpython.git
cd cpython;

Then, checkout the latest tag. This essentially checks out the latest stable release:

git checkout "$(git tag | grep -ivP '[ab]|rc' | tail -n1)"

Important: If you're intention is to use tensorflow, check the Tensorflow Install page for supported Python versions. It's probable that it doesn't yet support the latest version of Python, so you might need to checkout a different tag here. For some reason, Python is really bad at propagating new versions out to the community quickly.

Before we can start the compilation process, we need to configure it. We're going for performance, so execute the configure script like so:

./configure --with-lto --enable-optimizations --with-openssl=/absolute/path/to/openssl_repo_dir

Replace /absolute/path/to/openssl_repo with the absolute path to the above openssl repo.

Now, we're ready to compile Python. Do that like so:

make -j "$(nproc)"

This will take a while, but once it's done it should have built Python successfully. For a sanity check, we can also test it like so:

make -j "$(nproc)" test

The Python binary compiled should be called simply python, and be located in the root of the git repository. Now that we've compiled it, we need to make a few tweaks to ensure that our shell uses our newly compiled version by default and not the older version from the host system. Personally, I keep my ~/bin folder under version control, so I install host-specific to ~/software, and put ~/software/bin in my PATH like so:

export PATH=$HOME/software/bin

With this in mind, we need to create some symbolic links in ~/software/bin that point to our new Python installation:

cd $HOME/software/bin;
ln -s relative/path/to/python_binary python
ln -s relative/path/to/python_binary python3
ln -s relative/path/to/python_binary python3.9

Replace relative/path/to/python_binary with the relative path tot he Python binary we compiled above.

To finish up the Python installation, we need to get pip up and running, the Python package manager. We can do this using the inbuilt ensurepip module, which can bootstrap a pip installation for us:

python -m ensurepip --user

This bootstraps pip into our local user directory. This is probably what you want, since if you try and install directly the shebang incorrectly points to the system's version of Python, which doesn't exist.

Then, update your ~/.bash_aliases and add the following:

export LD_LIBRARY_PATH=/absolute/path/to/openssl_repo_dir/lib:$LD_LIBRARY_PATH;
alias pip='python -m pip'
alias pip3='python -m pip'

...replacing /absolute/path/to/openssl_repo_dir with the path to the openssl git repo we cloned earlier.

The next stage is to use virtualenv to locally install our Python packages that we want to use for our project. This is good practice, because it keeps our dependencies locally installed to a single project, so they don't clash with different versions in other projects.

Before we can use virtualenv though, we have to install it:

pip install virtualenv

Unfortunately, Python / pip is not very clever at detecting the actual Python installation location, so in order to actually use virtualenv, we have to use a wrapper script - because the [shebang]() in the main ~/.local/bin/virtualenv entrypoint does not use /usr/bin/env to auto-detect the python binary location. Save the following to ~/software/bin (or any other location that's in your PATH ahead of ~/.local/bin):

#!/usr/bin/env bash

exec python ~/.local/bin/virtualenv "$@"

For example:

# Write the script to disk
nano ~/software/bin/virtualenv;
# chmod it to make it executable
chmod +x ~/software/bin/virtualenv

Installing Keras and tensorflow-gpu

With all that out of the way, we can finally use virtualenv to install Keras and tensorflow-gpu. Let's create a new directory and create a virtual environment to install our packages in:

mkdir tensorflow-test
cd tensorflow-test;
virtualenv "$PWD";
source bin/activate;

Now, we can install Tensorflow & Keras:

pip install tensorflow-gpu

It's worth noting here that Keras is a dependency of Tensorflow.

Tensorflow has a number of alternate package names you might want to install instead depending on your situation:

  • tensorflow: Stable tensorflow without GPU support - i.e. it runs on the CPU instead.
  • tf-nightly-gpu: Nightly tensorflow for the GPU. Useful if your version of Python is newer than the version of Python supported by Tensorflow

Once you're done in the virtual environment, exit it like this:

deactivate

Phew, that was a huge amount of work! Hopefully this sheds some light on the maddenly complicated process of compiling Python from source. If you run into issues, you're welcome to comment below and I'll try to help you out - but you might be better off asking the Python community instead, as they've likely got more experience with Python than I have.

Sources and further reading

PhD, Update 7: Just out of reach

Oops! I must have forgotten about writing an entry for this series. Things have been complicated with the current situation, but I've got some time now to talk about what's been happening since my last post about my PhD. Before we continue though, here's a list of all the parts so far:

In this post, there are 2 different distinct areas to talk about. Firstly, the (limited) progress I've made on the Temporal CNN - and secondly the social media content.

Rainfall Radar / Temporal CNN

Things on the Temporal CNN front have been.... interesting. In the last post, I talked about how I was planning to update the model to use a cross-entropy loss function instead of mean squared error. In short, the idea here is to bin the water depth values we want to predict into a number of different categories, and then get the AI to predict which category each pixel belongs in.

The point of this is to allow for evaluating what the model is good at, and what it struggles with more effectively with the help of a confusion matrix.

Unfortunately, after going to a considerable amount of effort, the model hasn't yet been able to learn anything at all when using the cross-entropy loss error function. I've tried a whole array of different things by this point:

  • Using sparse / non-sparse categorical cross-entropy loss (ref)
  • Changing the number of filters in the model
  • Changing the format of the input data
  • Moving from average pooling → max pooling
  • Fiddling with the 2D CNN layer at the end of the model
  • ...and many more things - too many to list or remember here

Unfortunately, none of these have had a meaningful impact on the model's ability to learn anything. Despite this, we haven't run out of ideas yet. My current plan is to rebuild the model based on a known good model.

The known-good model in question is one I built earlier for a talk I did. It's purpose is classifying images, which it does fabulously with the well-known MNIST handwriting digits dataset. It's structure has 1 2D CNN layer, followed by a dense layer that outputs the probabilities as a 1D array:

See above for explanation

I devised 2 learning tasks to test the model with here. The "hard" task, which is to predict the exact digit in the picture, and the "easy" task - which is to predict whether the digit is greater than or equal to 5 or not (this simulates a binary cross-entropy task I've been trying my original model with). The original model works brilliantly with the hard task, gaining an easy 98% accuracy after 12 epochs.

After forking it and then refactoring significantly to decouple its various components, I started to modify the model's structure step by step to more closely match that of the Temporal CNN.

The initial results of this process (which only really got going on Tuesday 23rd February 2021) have been fascinating, as I've been running the MNIST dataset through it in between each step to check that it's still working as intended.

For example, I've discovered that the model has an intense dislike of pooling layers (both average and max). I suspect this might be because I'm not using it correctly, but I discovered that I could only get about 40% accuracy with the pooling layer in place, compared to ~99% without it.

Another thing I've done is removing the dense layer from the model, but this comes with its own set of problems though. The eventual goal is to do what is essentially image-to-video translation, so a key part of this process is to get the model to produce at least 2D tensor as an output instead of a 1D list of predictions for a single pixel.

To simulate this with the MNIST dataset, copied the output prediction. For the "hard" task, I copied the array of probabilities for each category into a cube, with 1 copy of the array for each pixel of the output. I found while doing this though that I got about 22% accuracy - though I suspect that the model was slower to converge than normal and if I'd maybe made the model a bit larger or let it train for longer, I'd be able to improve that somewhat.

It fared much better on the "easy" task though - easily achieving 99% accuracy fairly quickly with just 2 x 2D CNN layers in a row.

With these tests in mind, I'll be continuing the process of tweaking my new model bit by bit to match the original Temporal CNN, with the eventual goal of running my actual dataset through the model.

Social Media

I thought I'd be well into the social media part of my PhD by now, but things have been getting in the way (e.g. life stuff with respect to the current situation, and the temporal cnn being awkward) so I haven't yet been able to make a serious start on the social media side of things yet.

Still, I've been working away at the paperwork. I've now got ethical approval to work with publicly-available social media data (so long as I anonymise it, of course), and I've also been applying to get access to Twitter's new Academic API (which apparently went through successfully, but I'm currently troubleshooting the reason why it's asking me to apply for an account all over again).

I've also been reading a paper or 2, but since most of my energy has been spent elsewhere I have yet to dive seriously into this (I re-discovered the other day a folder full of interesting papers my supervisor sent me, so I'm going to dive into that as soon as I get a moment).

Papers looking into analysing social media data with advanced AI models appear to be in short supply - most papers I've read so far are either talking about analysing longer texts such as newspaper articles, or are using keyword-based and statistics-based methodologies to analyse data.

While this makes for an interesting research gap, I do feel slightly nervous that I've somehow missed something (which I guess I'll find out soon enough after reading some more papers). At any rate, my supervisor and I have some promising ideas and directions to look into moving forwards, so I'm not too worried here. I've also had some interesting discussions with people from the humanities side of my PhD (if you can call it that? I'm not sure what the right terminology is) over potential research questions too, so there's lots of scope here for investigation.

I anticipate that social media data isn't going to be as difficult a dataset to wrangle as the rainfall radar either (it's got to be better than a badly documented propriety binary format), as it's already encoded in JSON - so I'm not expecting I'll need to spend ages and ages writing programs to reformat and parse the data.

Conclusion

Things have been moving slowly recently, due in part to difficulties with the Temporal CNN, and due in part to life in general suddenly becoming rather challenging recently. Things are starting to calm down now though, so I'm starting to have more time to work on my PhD (but it's going to be a number of months yet until things are properly back to normal).

By changing tack with the Temporal CNN, I feel like I'm starting to make some more progress again, and the social media track of my PhD is showing lots of promise even though it's too early to tell exactly what direction I'll be heading in with it.

Hopefully by the time I make another post here in 2 months time, I'll have a working Temporal CNN and a start to the social media side of things - but this seems a tad ambitious based on how things have been going so far.

If you've got any comments or suggestions, do leave them below! I'd love to hear from you.

PhD Update 6: The road ahead

Hey, I'm back early with another post in my PhD series! Turns out there was a bit of mix up last time, and I misplaced update 4 - so I've renamed the duplicate update 4 to update 5. Before we continue, here are all the posts in this series so far:

Earlier today, I had my PhD panel 2. For those reading who don't know, the primary assessments for a PhD (at least in my case) take the form of 6 panels - in which you write a big report, and then your supervisors get together with you and discuss it. The even numbered ones at the end of each year are the more important ones, I'm led to believe - so I was understandably nervous.

Thankfully, it went well, and I ended up writing a report that's a third longer than my undergraduate dissertation at ~9k words O.o Anyway, I wanted to make another post in this series, as the process of writing the report (and research plan) for my panel has made me look at the bigger picture of where my PhD is going, and how it's all going to tie together - and I wanted to share this here.

Temporal CNN

In the last few posts, I've given some insights into the process I've been working through to train a Temporal CNN to predict the output of HAIL-CAESAR. This is starting to reach its conclusion - though there are a number of tasks I have yet to complete to close this chapter of the story. In particular, I've (finally!) got the results from the hyperparameter optimisation I've been doing. Let's check out a heatmap:

The aforementioned heatmap - explained below.

I plotted the above with GNUPlot, but ran into a number of issues (of course) while generating it, which I'd like to briefly discuss here. If you're interested in a more detailed discussion of this, please get in touch (I'll also need to know your real name and where you're working / studying, just in case) via a private communication method - such as my email address on my website.

Firstly, those who are particularly particularly perceptive will notice the rather strange filename for the above image. As indicated, I ran into some instability in my early stopping algorithm. For each epoch, the algorithm I implemented looked at the error from 3 epochs ago - and if it's greater than the error value for the epoch that's just finished it allows training to continue. Unfortunately, due to the nature of the task at hand, this sometimes caused it to stop training too soon. Further analysis of the results revealed that it sometimes allowed it to continue training too long as well (which is reflected in the above chart), but I'm baffled on that one.

For this reason, all hyperparameter combinations that didn't train for quite long enough were omitted in the above graph.

Secondly, there's a large area of the chart that isn't filled in. This is because both increasing the number of filters and the temporal depth increases memory usage - and the area that's unfilled is the area in which it failed to train because it ran out of GPU memory (I used 4 x Nvidia GeForce V100 - thanks very much to my department for providing this!).

Finally, the most important thing to note is that after reviewing other similar models, I discovered that this really isn't the best way to evaluate the performance of this kind of model. Rather, it would be better to consider this as a classification-based task instead. This can be achieved by creating a number of different bins for the water depth values, and assign a probability to each pixel for each water depth bin. The technical name for this is cross-entropy loss as far as I'm aware, which in my case I've been tending to prefix with pixel-based.

In this fashion, I can use things like a confusion matrix (can't find a good explanation of this, so I'll talk about it in a future blog post) to evaluate the performance of the model - which allows me to see the models strengths and weaknesses. What are its strengths? What are its weaknesses? I hope to find out - though I'm a little nervous about it, as I haven't yet looked into how to generate one with Tensorflow.js or tested my initial cross-entropy loss implementation with my kind of dataset yet.

Having a probability-based system like this should also make the output more useful. By this, I mean that having probabilities assigned to different water depths should be easier to interpret than a single value which nobody knows how the model came to that conclusion, or how uncertain the model is about the result.

Let's take look at 1 more set of graphs before we move on:

4 graphs showing the training / validation root mean squared error - see below for explanation.

In this set of 4 graphs, I've taken the training and validation root mean squared error from 4 different combinations of hyperparameters. These rather effectively illustrate the early stopping stability here (top left), and they also demonstrate some instability in the training process (all graphs). I'm guessing the latter is probably caused by insufficient training data - this I can remedy quite easily as I have plenty more besides the 5K time steps I trained on (~1.5M time steps to be precise), but in the interests of time I limited the training dataset so that I could train a variety of different models.

It also shows the shortcomings of the root mean-squared-error approach I've been using so far - is an average error of 2 biased towards deeper or shallower water? What can it predict well, and what does it struggle with?

My supervisor also wants me to write a publication on the Temporal CNN stuff I've done when it's complete, so that's going to be an interesting new experience for me too (I'm both excited and terribly nervous at the same time).

Social Media Analysis

Moving forwards, I'm hoping to complement my Temporal CNN model with some social media analysis (subject to ethical approval of course - filling out the paperwork for this is part of my task list for the rest of this week). If I get the go-ahead, my intention is to bring some natural language processing AI model to the subject of flood mapping using social media - as I've noticed that there's.... limited existing material on this so far.

I have yet to read up properly on the subject, but most recently I've found this paper rather interesting. It uses an unsupervised model to identify the topic(s) a tweet is talking about, which they then follow up with some statistical analysis.

I'm a little fuzzy on how they identify where tweets are talking about (especially since this paper mentions that only a very small percentage of tweets have a geotag attached, and even of these a fraction will be wrong or misleading for 1 reason or another), but the researchers put together a rather nice map by chaining together several statistical techniques techniques I'm currently unfamiliar with. It shows hot spots and cold spots, which indicate where the damage for an earthquake has occurred.

If possible, it would be very nice indeed to plot a flood on a map (stretch goal: in real-time!). I anticipate this to be a key issue I'll need to pay attention to.

In terms of my most immediate starting point, I'm going to be doing some more reading on the subject, and then have a chat with my supervisor about the next steps (she's particularly knowledgable about natural language processing :D).

Conclusion

In conclusion, I'm starting to come to the end of the Temporal CNN chapter of my PhD, although I suspect it's going to keep coming back over and over again to say hello. Moving forwards, I'm hoping to complement work I've done so far with some social media analysis (subject to ethical approval) using AI-based Natural Language Processing - which will probably consist of improving an existing model or something more directly computer science related (my existing work is classified more as a contribution to environmental sciences, apparently).

Look out for more posts in this series in the future, as I'm sure I'll have plenty to talk about soon (I might do another one in a month's time, or it might end up being nearer 2 months depending on circumstances).

PhD Update 5: Hyper optimisation and frustration

Hello there again! It's been longer than I anticipated since the last proper post in this series. Before I continue, here's a list of all the (proper) posts in this series so far:

I've haven't managed to get as much done since last time as I was hoping (partly due to the fact that I'm currently having to work from home, which is more challenging than I expected), but I have finished my implementation of the Temporal CNN, and am now working on hyperparameter optimisation. I've also fixed a number of issues in my rainfall radar data downloader and processing programs - which I'll talk about in more detail below

HAIL-CAESAR and the iterative improvements

Someone at the University recently approached me (if you are reading this and have a blog, comment below and I'll update) to ask if they could use my rainfall radar data downloader program to download some rainfall radar data for their project. Naturally, I helped them out. This turned out to be a great thing for me as well, as with their help I managed to uncover a number of very nasty issues with the data pipeline I had been building up to that point:

  • The hydro index file that HAIL-CAESAR uses was completely scrambled
  • The date on the data downloaded was a month out
  • The data downloaded was (and still is) rotated by 90° on disk
  • The data was out by a factor of 32

While fixing each of these bugs was a (relatively) simple process, I can't help but wonder how they managed to escape my notice until (for all but 1 of them) someone else told me about them.

The other issue was that because of the amount of data I'm working with, it took forever to re-run the program to test to see if I had managed to fix the problem - and if I had, I'd encounter another problem. This long iteration process makes implementing a new feature or fixing a bug a very time-consuming process.

Despite fixing all these issues, I'm still experiencing issues with my latest refactoring of the rainfall radar data downloader (namely a hang in the event system when reading tar files). My current thinking is that I'm going to completely reimplement it (using snippets from the old programif I need to use it again in the future, as it is currently neither particularly efficient (it's single-threaded) nor easy to bugfix (it's pretty complicated).

I've got an idea for a parallel system that processes each tar file separately first, and then only after all the tar files have been converted separately are they strung together into the actual files the existing implementation spits out currently.

Temporal CNN delight

Last time, I had just started my implementation of a Temporal CNN. This is now pretty much complete, and I've also been able to run it and get some results! Check out this graph:

A graph showing the root mean squared error while training my Temporal CNN implementation - more details below.

This graph shows the root mean squared error when training on 1000 time steps of data (about 3 days 11 hours or so). Epochs are along the X axis, and the root mean squared error is on the Y axis.

A few things to note here. Firstly, the implementation I've come up with essentially does video-to-image translation. The original model in the paper I've linked to is demonstrates a classification task (specifically land use over time) - so what I'm doing is a little different.

Secondly, I've omitted the root mean squared error for the first epoch. It was so high that it made the rest of the graph impossible to see - hence the omission.

I'm pretty pleased with this result so far - as I have a nice downwards curve indicating that the model is (probably) learning something useful.

I am still rather nervous about the output though, as due to the way I've implemented the network I haven't actually been able to 'see' the output of the network at all as an image yet. Doing so would take a while to implement, so I haven't done so for now (although I really should do this soon). It would be really cool to see a short video (maybe at ~10fps) of the network output as the epochs move forwards to visualise the network training process.

Hyperparameter frustrations

Lastly, at the suggestion of my supervisor I've been working on hyperparameter optimisation. In short, this consists of training the model with random combinations of hyperparameters and seeing which ones work best.

A hyperparameter is a tunable parameter that controls an aspect of a model. In my case, I have 2 key hyperparameters I need to tune:

  • Filter count: CNN layers in Tensorflow.js have a filter count associated with them. I theorise that increasing this will increase the model's ability to learn spatial information.
  • Temporal depth: The number of time steps to push through the model at once. Increasing this will allow the model to make predictions based on events that occur further in the past.

My eventual aim here is to create a heatmap that has the above hyperparameters along the X and Y axes, and the colour showing the accuracy of the model that was trained - similar to the one I created previously.

To do this, I implemented a program that tries random combinations of hyperparameters - but never the same combination twice. It starts the model in a subprocess and passes the chosen filter count and temporal depth values in as CLI arguments, which the child process picks up, parses, and then trains a model based on. This CLI is the same one as the one I developed that generated the above graph in the previous section of this blog post.

This approach has the advantage that it isolates the model in a subprocess, so when the subprocess exits and a new one spawns for the next combination of hyperparameters, the environment is completely clean and there isn't anything that might interfere with it.

Unfortunately though, while I set off a run of this implementation before I took a 'holiday' - and even checked on it to ensure it was running as expected (multiple times) - it still managed to crash when I wasn't looking.

After some debugging, I discovered that problem was because the model ran out of memory while training. This was something I had expected - and used the --unhandled-rejections=strict option for Node.js, which tells Node.js to crash and exit when an UnhandledPromiseRejection is thrown - like this one:

2020-08-04 15:31:26.174395: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at conv_grad_ops_3d.cc:1783 : Resource exhausted: OOM when allocating tensor with shape[8,2,104,348,210] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
(node:62355) UnhandledPromiseRejectionWarning: Error: Invalid TF_Status: 8
Message: OOM when allocating tensor with shape[8,2,104,348,210] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
    at Object.<anonymous> (<anonymous>)
.....

Unfortunately though, I used this flag on the parent process (that drives the hyperparameter optimisation) and not the child process - leading to a situation whereby the child process crashed due to the aforementioned error and just hangs around doing nothing. Even more frustatingly, the solution si as simple as doing a quick export NODE_OPTIONS="--unhandled-rejections=strict" before running the hyperparameter optimisation program to ensure that the flag propagates to the child processes......

Very frustrating indeed - especially considering I calculate that it will take multiple weeks to gather enough data to create a meaningful heatmap.

Conclusion

Reading back over this post, I have got more done than I expected. I've started and finished my Temporal CNN implementation, and fixed lots of bugs in them existing code.

However, the long iteration times to test code I've written (despite using a small slice of the dataset to test with), the large datasets I'm working with, having work on a system remotely via SSH by pushing and pulling code with git (many times), working from home all the time, and the continued bugs I've been facing and will likely continue to face have caused and are causing significant unexpected slowdowns moving forwards.

At least the VPN is no longer dropping out every 5 minutes!

Sources and Further Reading

Partitioning and mounting a new disk using LVM

As I've been doing my PhD, I've been acquiring quite a lot of data that needs storing. To this end, I have acquired a new 2 TiB hard drive in my Lab PC. Naturally, this necessitates formatting it so that I can use it. Since I've been using LVM (Logical Volume Management) for my OS disk - so I decided to use it for my new disk too.

Unfortunately, I don't currently have GUI access to my Lab PC - instead for the past few months I've been limited to SSH access (which is still much better than nothing at all), so I can't really use any GUI tool to do this for me.

This provided me with a perfect opportunity to get into LVM through the terminal instead. As it turns out, it's not actually that bad. In this post, I'm going to take you through the process of formatting a fresh disk: from creating a partition table to mounting the LVM logical volume.

Let's start by partitioning the disk. For this, we'll use the fdisk CLI tool (install it on Debian-based systems with sudo apt install fdisk if it's not available already). It should be obvious, but for this tutorial root access is required to execute pretty much all the commands we'll be using.

Start fdisk like so:

sudo fdisk /dev/sdX

Replace X with the index of your disk (try lsblk - no sudo required - to disk your disks). fdisk works a bit like a shell. You enter letters (or short sequences) followed by hitting enter to give it commands. Enter the following sequence of commands:

  • g: Create new GPT partition table
  • n: Create new partition (allow it to fill the disk)
  • t: Change partition type
    • L: List all known partition types
    • 31: Change to a Linux LVM partition
  • p: Preview final partition setup
  • w: Write changes to disk and exit

Some commands need additional information - fdisk will prompt you here.

With our disk partitioned, we now need to get LVM organised. LVM has 3 different key concepts:

  • Physical Volumes: The physical disk partitions it should use as a storage area (e.g. we created an appropriate partition type above)
  • Volume Groups: Groups of 1 or more Physical Volumes
  • Logical Volumes: The volumes that you use, format, and mount - they are stored in Volume Groups.

To go with these, there are also 3 different classes of commands that LVM exposes: pv* commands for Physical Volumes, vg* for Volume Groups, and lv* for Logical Volumes.

With respect to Physical Volumes, these are physical partitions on disk. For legacy MSDOS partition tables, these must have a partition type of 8e. For newer GPT partition tables (such as the one we created above), these need the partition id 31 (Linux LVM) - as described above.

Next, we create a new volume group that holds our physical volume:

sudo vgcreate vg-pool-name /dev/sdXY

....replacing /dev/sdXY with the partition you want to add (again, lsblk is helpful here). Don't forget to change the name of the Volume Group to something more descriptive than vg-pool-name - though keeping the vg prefix is recommended.

If you like, you can display the current Volume Groups with the following command:

sudo vgdisplay

Then, create a new logical volume that uses all of the space in the new volume group:

sudo lvcreate -l 100%FREE -n lv-rocket-blueprints vg-pool-name

Again, replace vg-pool-name with the name of your Volume Group, and lv-rocket-blueprints with the desired name of your new logical volume. tldr (for which I review pull requests) has a nice page on lvcreate. If you have a tldr client installed, simply do this to see it:

tldr lvcreate

With our logical volume created, we can now format it. I'm going to format it as ext4, but you can format it as anything you like.

sudo mkfs.ext4 /dev/vg-pool-name/lv-rocket-blueprints

As before, replace vg-pool-name and lv-rocket-blueprints with your Volume Group and Logical Volume names respectively.

Finally, mount the newly formatted partition:

sudo mkdir /mnt/rocket-blueprints
sudo mount /dev/vg-extra-data/lv-rocket-blueprints /mnt/rocket-blueprints

You can mount it anywhere - though I'd recommend mounting it to somewhere in /mnt.

Auto-mounting LVM logical volumes

A common thing many (myself included) want to do is automatically mount an LVM partition on boot. This is fairly trivial to do with /etc/fstab.

To start, find the logical volume's id with the following command:

sudo blkid

It should be present as UUID="THE_UUID_HERE". Pick out only the UUID of the logical volume you want to automount here. As a side note, using the UUID is generally a better idea than the name, because the name of the partition (whether it's an LVM partition or a physical /dev/sdXY partition) might change, while the UUID always stays the same.

Before continuing, ensure that the partition is unmounted:

sudo umount /mnt/rocket-blueprints

Now, edit /etc/fstab (e.g. with sudo nano /etc/fstab), and append something like this following to the bottom of the file:

UUID=THE_UUID_YOU_FOUND_HERE    /mnt/rocket-blueprints  ext4    defaults,noauto 0   2

Replace THE_UUID_YOU_FOUND_HERE with the UUID you located with blkid, and /mnt/rocket-blueprints with the path to where you want to mount it to. If an empty directory doesn't already exist at the target mount point, don't forget to create it (e.g. with sudo mkdir /mnt/rocket-blueprints).

Save and close /etc/fstab, and then try mounting the partition using the /etc/fstab definition:

sudo mount /mnt/rocket-blueprints

If it works, edit /etc/fstab again and replace noauto with auto to automatically mount it on boot.

That's everything you need to know to get up and running with LVM. I've included my sources below - in particular check out the howtogeek.com tutorial, as it's not only very detailed, but it also has a cheat sheet containing most of the different LVM commands that are available.

Found this useful? Still having issues? Got a suggestion? Comment below!

Sources and further reading

Ensuring a Linux machine's network connection stays up with Bash

Recently, I had the unpleasant experience of my Lab machine at University dropping offline. It has a tendency to do this randomly - and normally I'd just reboot it myself, but since I'm working from home at the moment it meant that I couldn't go in to fix it. This unfortunately meant that I was stuck waiting for a generous technician to go in and reboot it for me.

With access now restored I decided that I really didn't want this to happen again, so I've written a simple Bash script to resolve the issue.

It works by checking for an Internet connection every hour by pinging starbeamrainbowlabs.com - and if it doesn't manage to do so successfully, then it will reboot. A simple concept, but I discovered a number of things that needed considering while writing it:

  1. To avoid detecting transient network issues, we should make multiple attempts before giving up and rebooting
  2. Those multiple attempts need to be delayed to be effective
  3. We mustn't reboot more than once an hour to avoid getting into a 'reboot loop'
  4. If we're running an experiment, we need a way to temporarily delay it from doing it's checks that will resume automatically
  5. We could try and diagnose the network error or turn the networking of and on again, but if it gets stuck halfway through then we're locked out (very undesirable) - so it's easier / safer to just reboot

With these considerations in mind, I came up with this: ensure-network.sh (link to part of a GitHub Gist, as it's quite long)

This script requires Bash version 4+ and has a number of environment variables that can configure its behaviour:

Environment Variable Description
CHECK_EXTERNAL_HOST The domain name or IP address to ping to check the connection
CHECK_INTERVAL The interval to check the connection in seconds
CHECK_TIMEOUT Wait at most this long for a reply to our ping
CHECK_RETRIES Retry this many times before giving up and rebooting
CHECK_RETRY_DELAY Delay this many seconds in between retries
CHECK_DRY_RUN If true, then don't actually reboot (useful for testing)
CHECK_REBOOT_DELAY Leave at least this many minutes in between reboots
CHECK_POSTPONE_FILE If this file exists and has a recent last-modified time (mtime), don't actually reboot
CHECK_POSTPONE_MAXAGE The maximum age in minutes of the CHECK_POSTPONE_FILE to consider it fresh and avoid rebooting

With these environment variables, it covers all 4 points in the above list. To expand on CHECK_POSTPONE_FILE, if I'm running an experiment for example and I don't want it to reboot in the middle of said experiment, then I can simply run touch /path/to/postpone_file to delay network connection-related reboots for 7 days (by default). After this time, it will automatically start rebooting again if it drops off the network. This ensures that it will always restart monitoring eventually - as if I had a more manual system I'd forget to re-enable it and then loose access.

Another consideration is that the /var/cache directory must exist. This is because an empty tracking file is created there to keep track of when the last network connection-related reboot occurred.

With the script written, then next step is to have it run automatically on boot. For systemd-based systems such as my lab machine, a systemd service is the order of the day. This is relatively simple:

[Unit]
Description=Reboot if the network connection is down
After=network.target

[Service]
Type=simple
# Because it needs to be able to reboot
User=root
Group=root
EnvironmentFile=-/etc/default/ensure-network
ExecStartPre=/bin/sleep 60
ExecStart=/bin/bash "/usr/local/lib/ensure-network/ensure-network.sh"
SyslogIdentifier=ensure-access
StandardError=syslog
StandardOutput=syslog

[Install]
WantedBy=multi-user.target

(View the latest version in the GitHub Gist)

This assumes that the ensure-network.sh script is located at /usr/local/lib/ensure-network/ensure-network.sh. It also allows for an environment file to optionally be created at /etc/default/ensure-network, so that you can customise the parameters. Here's an example environment file:

CHECK_EXTERNAL_HOST=example.com
CHECK_INTERVAL=60

The above example environment file checks against example.com every minute instead of the default starbeamrainbowlabs.com every hour. You can, of course, specify any (or all) of the environment variables detailed above in the environment file if you wish.

That completes my setup - so hopefully I don't encounter any more network-related issues that lock me out of accessing my lab machine remotely! To install it yourself, you can do this:

# Create the directory for the script to live in
sudo mkdir /usr/local/lib/ensure-network
# Download the script & service file
sudo curl -L -O /usr/local/lib/ensure-network/ensure-network.sh https://gist.githubusercontent.com/sbrl/08e13f2ceedafe35ac7f8dbdfb8bfde7/raw/cc5ab4226472c08b09e448a257256936cc749193/ensure-network.sh
sudo curl -L -O /etc/systemd/system/ensure-network.service https://gist.githubusercontent.com/sbrl/08e13f2ceedafe35ac7f8dbdfb8bfde7/raw/adf5ed4009b3e1a09f857936fceb3581897072f4/ensure-network.service
# Start the service & enable it on boot
sudo systemctl daemon-reload
sudo systemctl start ensure-network.service
sudo systemctl enable ensure-network.service

You might need to replace the URLs there with the latest ones that download the raw content from the GitHub Gist.

Did you find this useful? Got a suggestion to make it better? Running into issues? Comment below!

PhD Update 4: Ginormous Data

Hello again! In the last PhD update blog post, I talked about patching HAIL-CAESAR to improve performance and implementing a Temporal Convolutional Neural Net (Temporal CNN).

Since making that post, I've had my PhD Panel 1 (very useful, thanks to everyone who was on that panel!). I've also got an initial - albeit untested - implementation of a Temporal CNN. I've also been wrangling lots of data in more ways than one. I'm definitely seeing the Big Data aspect of my project title now.

HAIL-CAESAR

I ran HAIL-CAESAR initially at 50m per pixel. This went ok, and generated lots of data out, but in 2 weeks of real time it barely hit 43 days worth of simulation time! The other issue I discovered due to the way I compressed the output of HAIL-CAESAR, for some reason it compressed the output files before HAIL-CAESAR had finished writing to them. This resulted in the data being cut off randomly in the output files for each time step.

Big problem - clearly another approach is in order.

To tackle these issues, I've done several things. Firstly, I patched HAIL-CAESAR again to support writing the output water depth files to the standard output. As a refresher, they are actually identical in format to the heightmap, which looks a bit like this:

ncols 4
nrows 3
xllcorner 400000
yllcorner 300000
cellsize 1000
1 2 3 4
1 1 2 3
0 1 1 2

The above is a 4x3 grid of points, with the bottom-left corner being at (400000, 300000) on the Ordnance Survey National Grid (I know, latitude / longitude would be so much better, but all the data I'm working with is on the OS national grid :-/). Each point represents a 1km square area.

To this end, I realised that it doesn't actually matter if I concatenate multiple files in this format together - I can guarantee that I can tell them apart. As soon as I detect a metadata line that the current file has already declared, then I know that the next file is starting and we're starting to read the next file along. To this end, I implemented a new Terrain50.ParseStream() function that is an async generator that will take a stream, and then iteratively yield the Terrain50 instances it parses out of the stream. In this way, I can split 1 big continuous stream back up again into the individual parts.

By patching HAIL-CAESAR such that it outputs the data in 1 continuous stream, it also means that I can pipe it to a single compression program. This has 2 benefits:

  • It avoids the "compressing the individual files before HAIL-CAESAR is ready" problem (the observant might note that inotifywait would solve this issue neatly too, but it isn't installed on Viper)
  • It allows for more efficient compression, as the compression program can use data from other time step files as context

Finding a compression tool was next. I wanted something CPU efficient, because I wanted to ensure that the maximum number of CPU cycles were dedicated to HAIL-CAESAR for running the simulation, rather than compressing the output it generates - since it is the bottleneck after all.

I ended up using lz4 in the end, an extremely fast compression algorithm. It compiles easily too, which is nice as I needed to compile it from source automatically on Viper.

With all this in place, I ran HAIL-CAESAR again 2 more times. The first run was at the same resolution as before, and generated 303 GiB (!) of data.

The second run was at 500m per pixel (10 times lower resolution), which generated 159 GiB (!) of data and, by my calculations, managed to run through ~4.3 years in simulation time in 5 days of real time. Some quick calculations suggest that to get through all 13 years of rainfall radar data I have it would take just over 11 days, so since I've got everything setup already, I'm going to be contacting the Viper administrators to ask about running a longer job to allow it to complete this process if possible.

Temporal CNN Preprocessing

The other major thing I've been working on since the last post is the Temporal CNN. I've already got an initial implementation setup, and I'm currently in the process of ironing out all the bugs in it.

I ran into a number of interesting bugs. One of these was to do with incorrectly specifying the batch size (due to a typo), which resulted in the null values you may have noticed in the model summary in the last post. With those fixed, it looks much more sensible:

_________________________________________________________________
Layer (type)                 Output shape              Param #   
=================================================================
conv3d_1 (Conv3D)            [32,2096,3476,124,64]     16064     
_________________________________________________________________
conv3d_2 (Conv3D)            [32,1046,1736,60,64]      512064    
_________________________________________________________________
conv3d_3 (Conv3D)            [32,521,866,28,64]        512064    
_________________________________________________________________
pooling (AveragePooling3D)   [32,521,866,1,64]         0         
_________________________________________________________________
reshape (Reshape)            [32,521,866,64]           0         
_________________________________________________________________
conv2d_output (Conv2D)       [32,517,862,1]            1601      
_________________________________________________________________
reshape_end (Reshape)        [32,517,862]              0         
=================================================================
Total params: 1041793
Trainable params: 1041793
Non-trainable params: 0
_________________________________________________________________

This model is comprised of the following:

  • 3 x 3D convolutional layers
  • 1 x pooling layer to average out the temporal dimension
  • 1 x reshaping layer to remove the redundant dimension
  • 1 x 2D convolutional layer that will produce the output
  • 1 x reshaping layer to remove another redundant dimension

I'll talk about this model in more detail in a future post in this series once I've managed to get it running and I've played around with it a bit.

Another significant one I ran into was to do with stacking tensors like an image. I ended up asking on Stack Overflow: How do I reorder the dimensions of a rank 3 tensor in Tensorflow.js?

The input to the above model is comprised of a sliding window that moves along the rainfall radar time steps. Each time step contains a 2D array, representing the amount of rain that has fallen in a given area. This needs to be combined with the heightmap, so that the AI model knows what the terrain that the rain is falling on looks like.

The heightmap doesn't change, but I'm including a copy of it with every rainfall radar time step because of the way the 3D convolutional layer works in Tensorflow.js. 2D convolutional layers in Tensorflow.js, for example, take in a 2D array of data as a tensor. They can also take in multiple channels though, much like pixels in an image. The pixels in an image might look something like this:

R1 G1 B1 A1 R2 G2 B2 A2 R3 G3 B3 A3 .....

As you might have seen in the Stack Overflow answer I linked to above, Tensorflow.js does support stacking multiple 2D tensors in this fashion. It is unfortunately extremely slow however. It is for this reason that I've been implementing a multi-process program to preprocess the data to do this stacking in advance.

As I'm writing this though, I've finally understood what the dataFormat option is for in the conv3d and conv2d layers is for, and I think I might have been barking up the wrong tree......

What's next

From here, I'm going to investigate that dataFormat option for the TemporalCNN - it would hugely simplify my setup and remove the need for me to preprocess the data beforehand, since stacking tensors directly 1 after another is very quick - it's just stacking them along a different dimension that's slow.

I'm also hoping to do a longer run of that 500m per pixel HAIL-CAESAR simulation. More data is always good, right? :P

After I've looked into the dataFormat option, I'd really like to get the Temporal CNN set off training and all the bugs ironed out. I'm so close I can almost taste it!

Finally, if I have time, I want to go looking for a baseline model of sorts. By this, I mean an existing model that's the closest thing to the task I'm doing - even though they might not be as performant or designed for my specific task.

Found this interesting? Got a suggestion of something I could do better? Confused about something I've talked about? Comment below!

PhD Aside: Reading a file descriptor line-by-line from multiple Node.js processes

Phew, that's a bit of a mouthful. We're taking a short break from the cluster series of posts (though those will be back next week I hope), because I've just run into a fascinating problem, the solution to which I thought I'd share here - since I didn't find a solution elsewhere on the web.

For my PhD, I've got a big old lump of data, and it all needs preprocessing before I train an AI model (or a variant thereof, since I'm effectively doing video-to-image translation). Unfortunately, one of the preprocessing steps is really slow. And because I'll naturally be training my AI for multiple epochs, the problem is multiplied.....

The solution, of course, is to do all the preprocessing up front such that I can just read the data in and push it directly into a Tensor in the right format. However, doing this on such a large dataset would take forever if I did the items 1 by 1. The thing is that Javascript isn't inherently multithreaded. I like this quote, as it describes the situation rather well:

In Javascript everything runs in parallel... except your code

--Felix Geisendörfer

In other words, when Node.js is reading or writing to and from the network, disk, or other places it can do lots of things at the same time because it does them asynchronously. The Javascript that gets executed though is only done on a single thread though.

This is great for io-bound tasks (such as a web server), as Node.js (a Javascript runtime) can handle many requests at the same time. On a side note, this is also the reason why Nginx is more efficient than Apache (because Nginx is event based too like Javascript, unlike Apache which is thread based).

It's not so great though for CPU bound tasks, such as the one I've got on my hands. All is not lost though, because Node.js has a number of useful functions inbuilt that we can use to tackle the issue.

Firstly, Node.js has a clever forking system. By using child_process.fork(), a single Node.js process can create multiple copies of itself to act as workers:

// main.js
import child_process from 'child_process';
import os from 'os';

let workers = [];

for(let i = 0; i &lt; os.cpus().length; i++) {
    workers.push(
        child_process.fork("worker.mjs")
    );
}
// worker.js
console.log(`Hello, world from a child process!`);

Very useful! The next much more sticky problem though is how to actually preprocess the data in a performant manner. In my specific case, I'm piping the data in from a shell script that decompresses a number of gzip archives in a specific order (as of the time of typing I have yet to implement this).

Because this is a single pipe we're talking about here, the question now arises of how to allow all the child processes to access the data that's coming in from the standard input of the master process.

I've actually encountered an issue like this one before. I initially tried reading it in on the master process, and then using worker.send(message) to send it to the worker processes for processing. This didn't end up working very well, because the master process became a bottleneck as it couldn't read from the standard input and send stuff to the workers fast enough.

With this in mind, I came up with a new plan. In Node.js, when you're forking to create a worker process, you can supply it with some custom file descriptors upon initialisation. So long as it has at least IPC (inter-process communication) channel for passing messages back and forth with the .send() and .on("message", (message) => ....) method and listeners, it doesn't actually care what you do with the others.

Cue file descriptor cloning:


// main.js
import child_process from 'child_process';
import os from 'os';

let workers = [];

for(let i = 0; i 

I've highlighted the key line here (line 10 for those who can't see it). Here we tell it to clone file descriptors 0, 1, and 2 - which refer to stdin, stdout, and stderr respectively. This allows the worker processes direct access to the master process' stdin, stdout, and stderr.

With this, we can read from the same pipe with as many worker processes as we like - so long as they do so 1 at a time.

With this sorted, it gives rise to the next issue: reading line-by-line. Packages exist on npm (such as nexline, my personal favourite) to read from a stream line-by-line, but they have the unfortunate side-effect of maintaining a read buffer. While this is great for performance, it's not so great in my situation because it ends up scrambling the input! This is because said read buffer would be local to each worker process, so when the next worker along reads, it will skip a random number of bytes and start reading from the next bit along.

This means that I need to implement a custom method that reads a single line from a given file descriptor without maintaining a read buffer. I came up with this:

import fs from 'fs';

//  .....

// Global buffer to avoid unnecessary memory churn
let buffer = Buffer.alloc(4096);
function read_line_unbuffered(fd) {
    let i = 0;
    while(true) {
        let bytes_read = fs.readSync(fd, buffer, i, 1);
        if(bytes_read !== 1 || buffer[i] == 0x0A) {
            if(i == 0 && bytes_read == null) return null;
            return buffer.toString("utf-8", 0, i); // This is not inclusive, so we can abuse it to trim the \n off the end
        }

        i++;
        if(i == buffer.length) {
            let new_buffer = new Buffer(Math.ceil(buffer.length * 1.5));
            buffer.copy(new_buffer);
            buffer = new_buffer;
        }
    }
}

I read from the given file descriptor character by character directly into a buffer. As soon as it detects a new line character (\n, or character code 0x0A), it returns the new line. If we run out of space in the buffer, then we create a new larger one, copy the old buffer's contents into it, and keep going.

I maintain a global buffer here, because this helps to avoid unnecessary memory churn. In my case, the lines I'm reading in a rather long (hence the need to clone the file descriptor in the first place), and if I didn't keep a shared buffer I'd be allocating and deallocating a new pretty large buffer every time.

This also has the nice side-effect that we keep the largest buffer we've had to use so far around for next time, avoiding the need for subsequent copies to larger and larger buffers.

Finally, we can also guarantee that it won't be a problem if we call this multiple times, because as I explained above Javascript is single-threaded, so if we call the function multiple times in quick succession each read will happen 1 after another.

With this chain of Node.js features, we can read a large amount of data from and efficiently process the content of a pipe. The trick from here is to implement a proper messaging and locking system to avoid reading from the stream at the same time, and avoid write to the standard output at the same time.

Taking this further, I ended up with this:

(Licence: Mozilla Public Licence 2.0)

This correctly ensures that only 1 worker process reads from the stream at the same time. It doesn't do anything with the result though except log a message to the console, but when I implement that I'll implement a similar messaging system to ensure that only 1 process writes to the output at once.

On that note, my data is also ordered, so I'll have to implement a complicated cache system // ordering system to ensure that I write them to the standard output in the same order I read them in. When I do implement that, I'll probably blog about that too....

The main problem I still have with this solution is that I'm reading from the input stream. I haven't done any proper testing, but I'm pretty sure that doing so will be really slow. I not sure I can avoid this though and read a few KiBs at a time, because I don't currently know of any way to put the extra characters back into the input stream.

If anyone has a solution to that that increases performance, I'd love to know. Leave a comment below!

Art by Mythdael