PhD Update 14: An old enemy
Hello again! This post is rather late due to one thing and another, but I've finally gotten around to writing it. In the last post, I talked about the CLIP model I trained to predict sentiment using both twitter and their associated images in pairs, and the augmentation system I devised to increase the size of the dataset. I also talked about the plan for a next-generation rainfall radar model, and a journal article I'm writing.
Before we begin though, let's start with the customary list of previous posts:
- PhD Update 1: Directions
- PhD Update 2: The experiment, the data, and the supercomputers
- PhD Update 3: Simulating simulations with some success
- PhD Update 4: Ginormous Data
- PhD Update 5: Hyper optimisation and frustration
- PhD Update 6: The road ahead
- PhD Update 7: Just out of reach
- PhD Update 8: Eggs in Baskets
- PhD Update 9: Results?
- PhD Update 10: Sharing with the world
- PhD Update 11: Answers to our questions
- PhD Update 12: Is it enough?
- PhD Update 13: A half complete
Since that last post, I've pretty much finished my initial draft of the journal article - though it is rather overlength, and I've also made a significant start on the rainfall radar model, which is what I will be focusing on in this blog post as there isn't all that much to talk about with the journal article at the moment (I'm unsure how much I'm allowed to share). I will make a separate post when I (finally) publish the journal article.
Rainfall radar model, revisited
As you might remember, I have dealt with rainfall radar data before (exhibit A, B, C, D), and it didn't go too well. After the part of my PhD on social media, I have learnt a lot about AI models and how to build them. I have also learnt a lot about data preprocessing. With all this in hand, I am now better equipped to do battle once more with an old enemy: the 1.5M time step rainfall radar dataset.
For those who are somewhat confused, the dataset in question is in 2 dimensions (i.e. like greyscale images). It is comprised of 3 things:
- A heightmap
- Rainfall radar data every 5 minutes
- Water depth information, calculated by HAIL-CAESAR and binarised to water / no water for each pixel with a simple threshold
Given that the rainfall radar dataset has an extremely restrictive licence, I am unfortunately unable to share sample images from the dataset here.
My first objective was to tame the beast. To do this, I needed to convert the data to
.tfrecord.gz files (applying all the preprocessing transformations ahead of time) instead of the split
.jsonl.gz files I was using. At first, I thought I could use a
TextLineDataset (it even supports reading from gzipped files!), but the snag here is that Tensorflow does not have a JSON parsing function.
The reason this is a problem is due to the new way I am parsing my dataset. Before, I used
tf.data.Dataset.from_generator() and a regular Python function, but I have since discovered that there is a much more efficient way of doing things. The key revelation here was that Tensorflow does not just simply execute e.g. your custom layers you implement and call
.call() each time. No, instead it calls it once and constructs a graph of operations, before then compiling this into machine code that the GPU can understand. The implication of this is twofold:
- It is significantly more efficient to take advantage of Tensorflow's execution graph functionality where available
- Once your (any part of) dataset becomes a Tensor, it must stay a Tensor
This not only goes for custom layers, loss functions, etc, but it also goes for the dataset pipeline too! I strongly recommend using the
.map() function on
tf.data.Dataset with a
.from_generation() if you can possibly help it!
To take advantage of this, I needed to convert my dataset to a set of
tfrecord npm package to write
tfrecord-stream and more, but none of them worked. In the end, I settled on a multi-step process:
- Convert split data into
.jsonl.gzfiles, 4K records per file. Do all preprocessing / correction steps here.
- Make all records unique: hash all records in all files, mark records for deletion, then delete them from files
.jsonl.gzfiles to 4K records per file
.tfrecord.gzwith Python child processes managed by Node.js
Overcomplicated? Perhaps. Do I have a single command I can execute to do all of this? Nope! Does it work? Absolutely :P
With the data converted I turned my attention to the model itself. As I have discussed previously, my current hypothesis is that the previous models failed because the relationship between the rainfall radar and water depth data is non-obvious (and that the model designs were terrible. 5K parameters? hahahaha, 5M parameters is probably the absolute minimum I would need). To this end, I will be first training a contrastive learning model to find relationships between the dataset items. Only then will I train a model to predict water depth, which I'll model as an image segmentation task (I have yet to find a segmentation decoder to implement, so suggestions here are welcome).
The first step here is to implement the contrastive learning algorithm. This is non-trivial however, so I implemented a test model using images from Reddit (r/cats, r/fish, and r/dogs) to test it and test the visualisations that I will require to determine the effectiveness of the model. In doing this, I found that the algorithm for contrastive learning in the CLIP paper (Learning Transferable Visual Models From Natural Language Supervision) was wrong and completely different to that which is described in the code, and I couldn't find the training loop or core loss function at all - so I had to piece together something from a variety of different sources.
To visualise the model, I needed a new approach. While the loss function value over time plotted on a graph is useful, it's difficult to tell if the resulting embedded representation the model outputs is actually doing what it is supposed to. There Reading online, there are 2 ways of visualising embedding representations I've found:
- Dimensionality reduction
- Parallel coordinates plot
I can even include here a cool plot that demonstrates both of them with the pretrained CLIP model I used in the social media half of my project:
The second one is the easier to explain so I'll start with that. If you imagine that the output of the model is of shape
[ batch_size, embedding_dim ] /
[ 64, 200 ], then for every record in the dataset we can plot a line across a set of vertical lines, where each vertical line stands for each successive point in the dataset. This is what I have done in the plot on the right there.
The plot on the left uses the UMAP dimensionality reduction algorithm (paper), which to my knowledge is the best dimensionality reduction algorithm out there at the moment. For the uninitiated, a dimensionality reduction algorithm takes a vector with many dimensions - such one with an embedding dimension of size 200 - and converts it into a lower-dimensional value (e.g. in 2 or 3 dimensions most commonly) so that it can be plotted and visualised. This is particularly helpful in AI when you want to check if your model is actually doing what you expect.
I took some time to look into this, as there are a number of other algorithms out there and it seems like it's far too easy to pick the wrong one for the task. In short, there are 3 different algorithms you'll see most often:
- PCA: Stands for Principled Component Analysis, and while popular it does not support non-linear transformations, which is most AI models.
- tSNE: A non-linear alternative (designed for AI applications, in part) that is also rather popular. It does not preserve the global structure of the dataset (i.e. relationships and distances between different values) very well though.
- UMAP: Stands for Uniform Manifold Approximation and Projection. It is designed as an alternative to tSNE and preserves global structure much better.
Sources for this are at the end of this post. If you're applying PCA or tSNE for dimensionality reduction in an AI context, consider switching it out to UMAP.
In the plot above, it is obvious that the pretrained CLIP model can differentiate between the 3 types of pet that I gave it as a test dataset. The next step was to train a model with the contrastive learning and the test dataset.
To do this, I needed an encoder. In the test, I used [ResNetV2](), which is apparently an improved version of the ResNet architecture (I have yet to read the paper on it). Since I implemented it though, I discovered an implementation of the state-of-the-art image encoder [ConvNeXt]() that I discovered recently, so I'm using that in the main model. See my recent post on my image captioning project for more details on image encoders, but in short to the best of my knowledge ConvNeXt is the current state of the art.
Any, when I plot the output of this model it gave me this plot:
I notice a few issues with this. Firstly and most obviously, the points are all jumbled up! It has not learnt the difference between cats, fish, and dogs. I suspect this is because the input to the test model I trained got 2 variants of the same image altered randomly in different ways (flipping, hue change, etc) rather than an image and a textual label. I'm not too worried though, 'cause the real model will have 2 different items as inputs - I was avoiding doing extra work here.
Secondly, the parallel coordinates plot does not show a whole lot of variance between the different items. This is more worrying, but I'm again hoping that this issue will fix itself when I give the model 'real pairs' of rainfall radar <-> water depth images (with the heightmap thrown in there somewhere probably, I haven't decided yet).
Finally, I plotted a UMAP graph with completely random points to ensure it represented them properly:
As you can see, it plots them in a roughly spherical shape with no clear form or separation between the points. I'm glad I did this, because at first I was passing the labels to the UMAP plotter in the wrong way, and it instead artificially moved the points into groups.
With the test model done, I have moved swiftly on to (pretraining) actual model itself. This is currently underway so I don't have anything to show just yet (it is still training and I have yet to implement code to plot the output), but I can say that thanks to my realisations in Tensorflow graph execution as tensors, I'm seeing a GPU utilisation of 95% and above at all times :D
I've got a journal article written, but it's overlength so my job there isn't quite done just yet. When it is published, I will definitely make a dedicated post here!
Now, I have moved from writing to implementing a new model to tackle the rainfall radar part of my project. By using contrastive learning, I hope to enable the model to learn the relationship between the rainfall radar data and the water depth information. Once I've trained a contrastive learning model, I'll attach and train another model for image segmentation to predict the water depth information.
If you know of any state-of-the-art image segmentation decoder AI architectures, please leave a comment below. Bonus points if I can configure it to have >= 5M parameters without running out of memory. I'm currently very unsure what I'm going to choose.
Additionally, if you have any suggestions for additional tests I can do to verify my contrastive learning model is actually learning something, please leave a comment below also. The difficulty ist hat the while the loss value goes down, it's extremely difficult to tell whether what it's learning is actually sensible or not.