PhD Update 13: A half complete

*...almost! In the last post, I talked about the AAAI-22 doctoral consortium, the sentiment analysis models I've implemented, and finally LDA topic analysis. Before we continue to what I've been doing since then, here's a list of all the posts in this series so far:

As always, you can follow all my PhD-related blog posts in the PhD tag on my blog.

Since the last post, I've participated in both the AAAI-22 Doctoral Consortium and a Hackathon in AI for Sustainability! I've written separate posts about these topics to avoid cluttering this post, so if you're interested I can recommending checking those posts out:

CLIP works.... kinda

In the last post, I mentioned I was implementing a sentiment analysis model based on CLIP. I've been doing this in PyTorch as the pretrained CLIP model is also implemented in PyTorch. This has caused a number of issues, since it requires a GPU with a CUDA compute capability index of 3.7+, which excludes a number of the GPUs I currently have access to, making things rather awkward. Thankfully, a few months ago I built a GPU server which have somehow forgotten to blog about, so I have been able to use this for the majority of the CLIP experiments I've been running.

Anyway, this process is now complete, so I can share a graph or two on the training progress:

These graphs are as always provisional and not final results, so please don't take them such. The graph on the left is the training accuracy, and the graph on the right is the validation accuracy. I used the ViT-B/32 variant of CLIP, with 512 units for 2 x dense layers after it before the final softmax dense layer that made the prediction (full model summary available upon request - please ensure you send requests by email from an official email account I can verify). What's astonishing here is CLIP's ability to 'zero-shot' - the ability to make a prediction in a target domain it hasn't seen yet with no additional training or fine tuning. It's one thing seeing it in a blog post, but quite another seeing it in person on your own dataset.

The reasoning for multiple lines here on each graph takes some explanation. Because the CLIP model is trained on tweets both with an image and an emoji, the number of tweets in my ~700K+ dataset of tweets that satisfy both of these requirements is only ~14K. With this in mind, I implemented a system to augment tweets that had a supported emoji but didn't have an image the image that CLIP thought best matched it. It was done with the following algorithm:

Rank each image against the tweet in question
Pick a random image from those CLIP has at least 75% confidence in
If it doesn't have at least 75% confidence in any image, pick the next best image

The reason for this somewhat convoluted algorithm is to avoid a situation where CLIP picks the same image for every tweet. With this in place I increased the size of the dataset up to a peak of ~55K (it should be higher still, but I have yet to find the bug even after combing through all related code multiple times), I could then train multiple CLIP models each with a different threshold as to how confident CLIP had to be in the augmented dataset - this is what's shown on in the above graphs.

From the graphs above, I can tell that interestingly any image is better than none at all - at least in terms of training accuracy. With a peak validation accuracy of 86.48% (vs 84.61% without dataset augmentation), this outstrips the transformer encoder I trained earlier by a fair margin.

It's cool to compare the validation accuracy, but what would be really fascinating (and also more objective) would be to compare this to human-labelled tweets as a ground truth. While I'm unsure if I can publish the exact results and details of this experiment at this time, I can say that the results were very surprising: the transformer encoder narrowly beat CLIP in accuracy when comparing them against the ~2K human-labelled tweets!

The effect of this is that the images may not contain much information that's useful when predicting the positive/negative sentiment, so attempts to extract information from the images likely need to use a different strategy. I speculate here that the reason it appeared to boost the validation accuracy of CLIP is that it assisted CLIP in figuring out what actually being asked of it - similar to the "prompt engineering" the authors of CLIP mention in their section on CLIP's limitations.

Wrapping this half up

To wrap the social media half of my project up (for now at least), I'm writing a journal article to summarise the (sub)project. This will also include data and experiments from some of the students who participated in the Hackathon in AI for Sustainability 2022. I doubt that the journal I ultimately end up submitting to would like it very much if I release too many more details about this at this time, so a deeper discussion on the results, the journal I've chosen with my PhD supervisor's help to submit to, and the paper will have to wait until I finish it and it (hopefully!) gets accepted and published.

It's been slow-going on writing this journal article - both because it's my first one and because I'm drawing content together from many different sources, but I think I'm getting there.

Once I've finished writing this journal article, I believe I'll be turning my attention to the rainfall radar half of my project while I wait for a decision on whether it'll be published or not - so you can expect more on this in the next post in this series.

The plan

Going on a bit of a tangent, the CLIP portion of the project has been very helpful in introducing me to how important optimising the data preprocessing pipeline is - especially the data augmentation part. By preprocessing in parallel and reshuffling some things, I was able to bump the average usage of my Nvidia GeForce 3060 GPU from around 10% to well over 80%, speeding up the process of augmenting the data from ~10 minutes per tweet to just 1.5 seconds per tweet! It's well worth spending a few hours on your data processing pipeline if you know you'll be training and retraining your model a bunch of times as you tweak it, as you could save yourself many hours of training time.

A number of key things to watch out for that I've found so far, in no particular order:

Preprocessing data in parallel is very important. You can usually boost performance by as many times as you have CPU cores!
Reading data from a stream makes it awkward to parallelise. It's much easier and simpler to handle e.g. 1 image per file than a stream of images in a single file.
Image decoding is expensive, meaning that you'll most likely hit a CPU bottleneck if your model handles images. Ensuring images are JPEG can help, as PNGs are more expensive to decode.
- Similarly, the image decoder you use can significantly affect performance. I used simplejpeg, but I've heard that if you wrap Tensorflow's native image decoding in an input pipeline that can also be good as it can compile it into something more efficient. Test different methods with your own dataset to see which is best.
Given that your preprocessing pipeline will run for every epoch, investigate if you can do any expensive steps just once before training begins.

In the future I'd like to write a blog post that more thoroughly compares PyTorch and Tensorflow now that I have more experience with both of them. They have different strengths and weaknesses which make them both good fits for different types of models and projects.

All this experience will be very useful indeed when I turn my attention back to the rainfall radar portion of my project. My current plan is to investigate training a CLIP model to comparatively train the rainfall radar + heightmap and the water depth data against one another. As of now I haven't looked into the specifics and details of how CLIP's training process actually works, but I'm hoping it's not too complicated to either re-use their code or implement my own.

In training such a CLIP model, it should in theory tell me whether there's any relationship between the two at all that a model can learn. If there is, then I can then move on to the next step and connect a decoder of some description to the model that will produce an image as an output. If anyone has any good resources on this, please do comment below as I'm rather unsure as to where to begin (I've tried an autoencoder design in the past for this model - albeit without CLIP - and it didn't go very well).

Conclusion

Since last time, I've trained a bunch of CLIP models, and compared these (in more ways than one) to the transformer encoder I trained earlier. To extract useful information from images, a different strategy is likely needed as it doesn't appear that they contain much useful information about sentiment in the context of a flooding situation.

In training the CLIP models however, I've gained a lot of very valuable experience that will greatly help me in implementing an efficient model and pipeline for the rainfall radar half of my project. If I could go back and do this all again, I would have started the social media half of my project first, as it's taught me a whole bunch of very useful things that would have saved me a lot of time on my rainfall radar project....

If you've found this interesting, are confused about anything here, or have any suggestions, please do comment below! I'd love to hear from you.

Type this	To get this	Notes
`bold text`	bold text	-
`_italics text_`	italics text	-
`~~deleted text~~`	~~deleted~~	-
`code text`	`code text`	Inserts some monospaced code. It is preferred that large blocks of code are linked to using a service such as Pastebin, Github Gists or Ideone.
`> Quote`	Quote	-
`[display text](//google.com)`	display text	Inserts a hyperlink. Please use responsibly. `[rel=nofollow]` is in use and spam will be deleted.
`---`		Inserts a horizontal line. The previous line must be blank.

Stardust
Blog