Starbeamrainbowlabs

Stardust
Blog


Archive


Mailing List Articles Atom Feed Comments Atom Feed Twitter Reddit Facebook

Tag Cloud

3d 3d printing account algorithms android announcement architecture archives arduino artificial intelligence artix assembly async audio automation backups bash batch blender blog bookmarklet booting bug hunting c sharp c++ challenge chrome os cluster code codepen coding conundrums coding conundrums evolved command line compilers compiling compression containerisation css dailyprogrammer data analysis debugging demystification distributed computing docker documentation downtime electronics email embedded systems encryption es6 features ethics event experiment external first impressions freeside future game github github gist gitlab graphics hardware hardware meetup holiday holidays html html5 html5 canvas infrastructure interfaces internet interoperability io.js jabber jam javascript js bin labs learning library linux lora low level lua maintenance manjaro minetest network networking nibriboard node.js open source operating systems optimisation own your code pepperminty wiki performance phd photos php pixelbot portable privacy problem solving programming problems project projects prolog protocol protocols pseudo 3d python reddit redis reference release releases rendering resource review rust searching secrets security series list server software sorting source code control statistics storage svg systemquery talks technical terminal textures thoughts three thing game three.js tool tutorial tutorials twitter ubuntu university update updates upgrade version control virtual reality virtualisation visual web website windows windows 10 worldeditadditions xmpp xslt

PhD Update 14: An old enemy

Hello again! This post is rather late due to one thing and another, but I've finally gotten around to writing it. In the last post, I talked about the CLIP model I trained to predict sentiment using both twitter and their associated images in pairs, and the augmentation system I devised to increase the size of the dataset. I also talked about the plan for a next-generation rainfall radar model, and a journal article I'm writing.

Before we begin though, let's start with the customary list of previous posts:

Since that last post, I've pretty much finished my initial draft of the journal article - though it is rather overlength, and I've also made a significant start on the rainfall radar model, which is what I will be focusing on in this blog post as there isn't all that much to talk about with the journal article at the moment (I'm unsure how much I'm allowed to share). I will make a separate post when I (finally) publish the journal article.

Rainfall radar model, revisited

As you might remember, I have dealt with rainfall radar data before (exhibit A, B, C, D), and it didn't go too well. After the part of my PhD on social media, I have learnt a lot about AI models and how to build them. I have also learnt a lot about data preprocessing. With all this in hand, I am now better equipped to do battle once more with an old enemy: the 1.5M time step rainfall radar dataset.

For those who are somewhat confused, the dataset in question is in 2 dimensions (i.e. like greyscale images). It is comprised of 3 things:

  • A heightmap
  • Rainfall radar data every 5 minutes
  • Water depth information, calculated by HAIL-CAESAR and binarised to water / no water for each pixel with a simple threshold

Given that the rainfall radar dataset has an extremely restrictive licence, I am unfortunately unable to share sample images from the dataset here.

My first objective was to tame the beast. To do this, I needed to convert the data to .tfrecord.gz files (applying all the preprocessing transformations ahead of time) instead of the split .asc.stream.gz and .jsonl.gz files I was using. At first, I thought I could use a TextLineDataset (it even supports reading from gzipped files!), but the snag here is that Tensorflow does not have a JSON parsing function.

The reason this is a problem is due to the new way I am parsing my dataset. Before, I used tf.data.Dataset.from_generator() and a regular Python function, but I have since discovered that there is a much more efficient way of doing things. The key revelation here was that Tensorflow does not just simply execute e.g. your custom layers you implement and call .call() each time. No, instead it calls it once and constructs a graph of operations, before then compiling this into machine code that the GPU can understand. The implication of this is twofold:

  1. It is significantly more efficient to take advantage of Tensorflow's execution graph functionality where available
  2. Once your (any part of) dataset becomes a Tensor, it must stay a Tensor

This not only goes for custom layers, loss functions, etc, but it also goes for the dataset pipeline too! I strongly recommend using the .map() function on tf.data.Dataset with a tf.function. Avoid .from_generation() if you can possibly help it!

To take advantage of this, I needed to convert my dataset to a set of .tfrecord.gz files (to support parallel reading, esp. since Viper has a high read latency). Given my code to parse my dataset is in Javascript/Node.js, I first tried using the tfrecord npm package to write .tfrecord files in Javascript directly. This did not work out though, as it kept crashing. I also tried variant packages like tfrecords and tfrecord-stream and more, but none of them worked. In the end, I settled on a multi-step process:

  1. Convert split data into .jsonl.gz files, 4K records per file. Do all preprocessing / correction steps here.
  2. Make all records unique: hash all records in all files, mark records for deletion, then delete them from files
  3. Recompress .jsonl.gz files to 4K records per file
  4. Convert .jsonl.gz.tfrecord.gz with Python child processes managed by Node.js

Overcomplicated? Perhaps. Do I have a single command I can execute to do all of this? Nope! Does it work? Absolutely :P

With the data converted I turned my attention to the model itself. As I have discussed previously, my current hypothesis is that the previous models failed because the relationship between the rainfall radar and water depth data is non-obvious (and that the model designs were terrible. 5K parameters? hahahaha, 5M parameters is probably the absolute minimum I would need). To this end, I will be first training a contrastive learning model to find relationships between the dataset items. Only then will I train a model to predict water depth, which I'll model as an image segmentation task (I have yet to find a segmentation decoder to implement, so suggestions here are welcome).

The first step here is to implement the contrastive learning algorithm. This is non-trivial however, so I implemented a test model using images from Reddit (r/cats, r/fish, and r/dogs) to test it and test the visualisations that I will require to determine the effectiveness of the model. In doing this, I found that the algorithm for contrastive learning in the CLIP paper (Learning Transferable Visual Models From Natural Language Supervision) was wrong and completely different to that which is described in the code, and I couldn't find the training loop or core loss function at all - so I had to piece together something from a variety of different sources.

To visualise the model, I needed a new approach. While the loss function value over time plotted on a graph is useful, it's difficult to tell if the resulting embedded representation the model outputs is actually doing what it is supposed to. There Reading online, there are 2 ways of visualising embedding representations I've found:

  1. Dimensionality reduction
  2. Parallel coordinates plot

I can even include here a cool plot that demonstrates both of them with the pretrained CLIP model I used in the social media half of my project:

The second one is the easier to explain so I'll start with that. If you imagine that the output of the model is of shape [ batch_size, embedding_dim ] / [ 64, 200 ], then for every record in the dataset we can plot a line across a set of vertical lines, where each vertical line stands for each successive point in the dataset. This is what I have done in the plot on the right there.

The plot on the left uses the UMAP dimensionality reduction algorithm (paper), which to my knowledge is the best dimensionality reduction algorithm out there at the moment. For the uninitiated, a dimensionality reduction algorithm takes a vector with many dimensions - such one with an embedding dimension of size 200 - and converts it into a lower-dimensional value (e.g. in 2 or 3 dimensions most commonly) so that it can be plotted and visualised. This is particularly helpful in AI when you want to check if your model is actually doing what you expect.

I took some time to look into this, as there are a number of other algorithms out there and it seems like it's far too easy to pick the wrong one for the task. In short, there are 3 different algorithms you'll see most often:

  • PCA: Stands for Principled Component Analysis, and while popular it does not support non-linear transformations, which is most AI models.
  • tSNE: A non-linear alternative (designed for AI applications, in part) that is also rather popular. It does not preserve the global structure of the dataset (i.e. relationships and distances between different values) very well though.
  • UMAP: Stands for Uniform Manifold Approximation and Projection. It is designed as an alternative to tSNE and preserves global structure much better.

Sources for this are at the end of this post. If you're applying PCA or tSNE for dimensionality reduction in an AI context, consider switching it out to UMAP.

In the plot above, it is obvious that the pretrained CLIP model can differentiate between the 3 types of pet that I gave it as a test dataset. The next step was to train a model with the contrastive learning and the test dataset.

To do this, I needed an encoder. In the test, I used [ResNetV2](), which is apparently an improved version of the ResNet architecture (I have yet to read the paper on it). Since I implemented it though, I discovered an implementation of the state-of-the-art image encoder [ConvNeXt]() that I discovered recently, so I'm using that in the main model. See my recent post on my image captioning project for more details on image encoders, but in short to the best of my knowledge ConvNeXt is the current state of the art.

Any, when I plot the output of this model it gave me this plot:

I notice a few issues with this. Firstly and most obviously, the points are all jumbled up! It has not learnt the difference between cats, fish, and dogs. I suspect this is because the input to the test model I trained got 2 variants of the same image altered randomly in different ways (flipping, hue change, etc) rather than an image and a textual label. I'm not too worried though, 'cause the real model will have 2 different items as inputs - I was avoiding doing extra work here.

Secondly, the parallel coordinates plot does not show a whole lot of variance between the different items. This is more worrying, but I'm again hoping that this issue will fix itself when I give the model 'real pairs' of rainfall radar <-> water depth images (with the heightmap thrown in there somewhere probably, I haven't decided yet).

Finally, I plotted a UMAP graph with completely random points to ensure it represented them properly:

As you can see, it plots them in a roughly spherical shape with no clear form or separation between the points. I'm glad I did this, because at first I was passing the labels to the UMAP plotter in the wrong way, and it instead artificially moved the points into groups.

With the test model done, I have moved swiftly on to (pretraining) actual model itself. This is currently underway so I don't have anything to show just yet (it is still training and I have yet to implement code to plot the output), but I can say that thanks to my realisations in Tensorflow graph execution as tensors, I'm seeing a GPU utilisation of 95% and above at all times :D

Conclusion

I've got a journal article written, but it's overlength so my job there isn't quite done just yet. When it is published, I will definitely make a dedicated post here!

Now, I have moved from writing to implementing a new model to tackle the rainfall radar part of my project. By using contrastive learning, I hope to enable the model to learn the relationship between the rainfall radar data and the water depth information. Once I've trained a contrastive learning model, I'll attach and train another model for image segmentation to predict the water depth information.

If you know of any state-of-the-art image segmentation decoder AI architectures, please leave a comment below. Bonus points if I can configure it to have >= 5M parameters without running out of memory. I'm currently very unsure what I'm going to choose.

Additionally, if you have any suggestions for additional tests I can do to verify my contrastive learning model is actually learning something, please leave a comment below also. The difficulty ist hat the while the loss value goes down, it's extremely difficult to tell whether what it's learning is actually sensible or not.

PhD Aside 2: Jupyter Lab / Notebook First Impressions

Hello there! I'm back with another PhD Aside blog post. In the last one, I devised an extremely complicated and ultimately pointless mechanism by which multiple Node.js processes can read from the same file handle at the same time. This post hopefully won't be quite as useless, as it's a cross with the other reviews / first impressions posts I've made previously.

I've had Jupyter on my radar for ages, but it's only very recently that I've actually given it a try. Despite being almost impossible to spell (though it does appear to be getting easier with time), both it's easy to install and extremely useful when plotting visualisations, so I wanted to talk about it here.

I tried Jupyter Lab, which is apparently more complicated than Jupyter Notebook. Personally though I'm not sure I see much of a difference, aside from a file manager sidebar in Jupyter Lab that is rather useful.

A Jupyter Lab session of mine, in which I was visualising embeddings from a pretrained CLIP model.

(Above: A Jupyter Lab session of mine, in which I was visualising embeddings from a pretrained CLIP model.)

Jupyter Lab is installed via pip (pip3 for apt-based systems): https://jupyter.org/install. Once installed, you can start a server with jupyter-lab in a terminal (or command line), and then it will automatically open a new tab in your browser that points to the server instance (http://localhost:8888/ by default).

Then, you can open 1 or more Jupyter Notebooks, which seem to be regular files (e.g. Javascript, Python, and more) but are split into 'cells', which can be run independently of one another. While these cells are usually run in order, there's nothing to say that you can't run them out of order, or indeed the same cell over and over again as you prototype a graph.

The output of each cell is displayed directly below it. Be that a console.log()/print() call or a graph visualisation (see the screenshot above), it seems to work just fine. It also saves the output of a cell to disk alongside the code in the Jupyter Notebook, can be a double-edged sword: On the one hand, it's very useful to have the plot and other output be displayed to remind you what you were working on, but on the other hand if the output somehow contains sensitive data, then you need to remember to clear it before saving & committing to git each time, which is a hassle. Similarly, every time the output changes the notebook file on disk also changes, which can result in unnecessary extra changes committed to git if you're not careful.

In the same vein, I have yet to find a way to define a variable in a notebook file whose value is not saved along with the notebook file, which I'd rather like since the e.g. tweets I work with for the social media side of my PhD are considered sensitive information, and so I don't want to commit them to a git repository which will no doubt end up open-source.

You can also import functions and classes from other files. Personally, I see Jupyter notebooks to be most useful when used in conjunction with an existing codebase: while you can put absolutely everything in your Jupyter notebook, I wouldn't recommend it as you'll end up with spaghetti code that's hard to understand or maintain - just like you would in a regular codebase in any other language.

Likewise, I wouldn't recommend implementing an AI model in a Jupyter notebook directly. While you can, it makes it complicated to train it on a headless server - which you'll likely want to do if you want to train a model at any scale.

The other minor annoyance is that by using Jupyter you end up forfeiting thee code intelligence of e.g. Atom or Visual Studio Code, which is a shame since a good editor can e.g. check syntax on the fly, inform you of unused variables, provide autocomplete, etc.

These issues aside, Jupyter is a great fit for plotting visualisations due to the very short improve → rerun → inspect/evaluate output loop. It's also a good fit for writing tutorials I suspect, as it apparently has support for markdown cells too. At some point, I may try writing a tutorial in Jupyter notebook, rendering it to regular markdown, and posting it here.

PhD Update 13: A half complete

*...almost! In the last post, I talked about the AAAI-22 doctoral consortium, the sentiment analysis models I've implemented, and finally LDA topic analysis. Before we continue to what I've been doing since then, here's a list of all the posts in this series so far:

As always, you can follow all my PhD-related blog posts in the PhD tag on my blog.

Since the last post, I've participated in both the AAAI-22 Doctoral Consortium and a Hackathon in AI for Sustainability! I've written separate posts about these topics to avoid cluttering this post, so if you're interested I can recommending checking those posts out:

CLIP works.... kinda

In the last post, I mentioned I was implementing a sentiment analysis model based on CLIP. I've been doing this in PyTorch as the pretrained CLIP model is also implemented in PyTorch. This has caused a number of issues, since it requires a GPU with a CUDA compute capability index of 3.7+, which excludes a number of the GPUs I currently have access to, making things rather awkward. Thankfully, a few months ago I built a GPU server which have somehow forgotten to blog about, so I have been able to use this for the majority of the CLIP experiments I've been running.

Anyway, this process is now complete, so I can share a graph or two on the training progress:

These graphs are as always provisional and not final results, so please don't take them such. The graph on the left is the training accuracy, and the graph on the right is the validation accuracy. I used the ViT-B/32 variant of CLIP, with 512 units for 2 x dense layers after it before the final softmax dense layer that made the prediction (full model summary available upon request - please ensure you send requests by email from an official email account I can verify). What's astonishing here is CLIP's ability to 'zero-shot' - the ability to make a prediction in a target domain it hasn't seen yet with no additional training or fine tuning. It's one thing seeing it in a blog post, but quite another seeing it in person on your own dataset.

The reasoning for multiple lines here on each graph takes some explanation. Because the CLIP model is trained on tweets both with an image and an emoji, the number of tweets in my ~700K+ dataset of tweets that satisfy both of these requirements is only ~14K. With this in mind, I implemented a system to augment tweets that had a supported emoji but didn't have an image the image that CLIP thought best matched it. It was done with the following algorithm:

  1. Rank each image against the tweet in question
  2. Pick a random image from those CLIP has at least 75% confidence in
  3. If it doesn't have at least 75% confidence in any image, pick the next best image

The reason for this somewhat convoluted algorithm is to avoid a situation where CLIP picks the same image for every tweet. With this in place I increased the size of the dataset up to a peak of ~55K (it should be higher still, but I have yet to find the bug even after combing through all related code multiple times), I could then train multiple CLIP models each with a different threshold as to how confident CLIP had to be in the augmented dataset - this is what's shown on in the above graphs.

From the graphs above, I can tell that interestingly any image is better than none at all - at least in terms of training accuracy. With a peak validation accuracy of 86.48% (vs 84.61% without dataset augmentation), this outstrips the transformer encoder I trained earlier by a fair margin.

It's cool to compare the validation accuracy, but what would be really fascinating (and also more objective) would be to compare this to human-labelled tweets as a ground truth. While I'm unsure if I can publish the exact results and details of this experiment at this time, I can say that the results were very surprising: the transformer encoder narrowly beat CLIP in accuracy when comparing them against the ~2K human-labelled tweets!

The effect of this is that the images may not contain much information that's useful when predicting the positive/negative sentiment, so attempts to extract information from the images likely need to use a different strategy. I speculate here that the reason it appeared to boost the validation accuracy of CLIP is that it assisted CLIP in figuring out what actually being asked of it - similar to the "prompt engineering" the authors of CLIP mention in their section on CLIP's limitations.

Wrapping this half up

To wrap the social media half of my project up (for now at least), I'm writing a journal article to summarise the (sub)project. This will also include data and experiments from some of the students who participated in the Hackathon in AI for Sustainability 2022. I doubt that the journal I ultimately end up submitting to would like it very much if I release too many more details about this at this time, so a deeper discussion on the results, the journal I've chosen with my PhD supervisor's help to submit to, and the paper will have to wait until I finish it and it (hopefully!) gets accepted and published.

It's been slow-going on writing this journal article - both because it's my first one and because I'm drawing content together from many different sources, but I think I'm getting there.

Once I've finished writing this journal article, I believe I'll be turning my attention to the rainfall radar half of my project while I wait for a decision on whether it'll be published or not - so you can expect more on this in the next post in this series.

The plan

Going on a bit of a tangent, the CLIP portion of the project has been very helpful in introducing me to how important optimising the data preprocessing pipeline is - especially the data augmentation part. By preprocessing in parallel and reshuffling some things, I was able to bump the average usage of my Nvidia GeForce 3060 GPU from around 10% to well over 80%, speeding up the process of augmenting the data from ~10 minutes per tweet to just 1.5 seconds per tweet! It's well worth spending a few hours on your data processing pipeline if you know you'll be training and retraining your model a bunch of times as you tweak it, as you could save yourself many hours of training time.

A number of key things to watch out for that I've found so far, in no particular order:

  • Preprocessing data in parallel is very important. You can usually boost performance by as many times as you have CPU cores!
  • Reading data from a stream makes it awkward to parallelise. It's much easier and simpler to handle e.g. 1 image per file than a stream of images in a single file.
  • Image decoding is expensive, meaning that you'll most likely hit a CPU bottleneck if your model handles images. Ensuring images are JPEG can help, as PNGs are more expensive to decode.
    • Similarly, the image decoder you use can significantly affect performance. I used simplejpeg, but I've heard that if you wrap Tensorflow's native image decoding in an input pipeline that can also be good as it can compile it into something more efficient. Test different methods with your own dataset to see which is best.
  • Given that your preprocessing pipeline will run for every epoch, investigate if you can do any expensive steps just once before training begins.

In the future I'd like to write a blog post that more thoroughly compares PyTorch and Tensorflow now that I have more experience with both of them. They have different strengths and weaknesses which make them both good fits for different types of models and projects.

All this experience will be very useful indeed when I turn my attention back to the rainfall radar portion of my project. My current plan is to investigate training a CLIP model to comparatively train the rainfall radar + heightmap and the water depth data against one another. As of now I haven't looked into the specifics and details of how CLIP's training process actually works, but I'm hoping it's not too complicated to either re-use their code or implement my own.

In training such a CLIP model, it should in theory tell me whether there's any relationship between the two at all that a model can learn. If there is, then I can then move on to the next step and connect a decoder of some description to the model that will produce an image as an output. If anyone has any good resources on this, please do comment below as I'm rather unsure as to where to begin (I've tried an autoencoder design in the past for this model - albeit without CLIP - and it didn't go very well).

Conclusion

Since last time, I've trained a bunch of CLIP models, and compared these (in more ways than one) to the transformer encoder I trained earlier. To extract useful information from images, a different strategy is likely needed as it doesn't appear that they contain much useful information about sentiment in the context of a flooding situation.

In training the CLIP models however, I've gained a lot of very valuable experience that will greatly help me in implementing an efficient model and pipeline for the rainfall radar half of my project. If I could go back and do this all again, I would have started the social media half of my project first, as it's taught me a whole bunch of very useful things that would have saved me a lot of time on my rainfall radar project....

If you've found this interesting, are confused about anything here, or have any suggestions, please do comment below! I'd love to hear from you.

Hackathon in AI for Sustainability 2022

The other week, I took part in the Hackathon in AI for Sustainability 2022. While this was notable because it was my first hackathon, what was more important was that it was partially based on my research! For those who aren't aware, I'm currently doing a PhD at the University of Hull with the project title "Using Big Data and AI to Dynamically Predict Flood Risk". While part of it really hasn't gone according to plan (I do have a plan to fix it, I just need to find time to implement it), the second half of my project on social media has been coming together much more easily.

To this end, my supervisor asked me about a month ago whether I wanted to help organise a hackathon, so I took the plunge and said yes. The hackathon has 3 projects for attendees to choose from:

  • Project 1: Hedge identification from earth observation data with interpretable computer vision algorithms
  • Project 2: Monopile fatigue estimation from nonlinear waves using deep learning
  • Project 3: Live sentiment tracking during floods from social media data (my project!)

When doing research, I've found that there are often many more avenues to explore than there is time to explore them. To this end, a hackathon is an ideal time to explore these avenues that I have not had the time to explore previously.

To prepare, I put together some dataset of tweets and associated images - some from the models I've actually trained, and others (such as one based on the hashtag #StormFranklin) that I downloaded specially for the occasion. Alongside this, I also trained and prepared a model and some sample code for students to use as a starting point.

On the first day of the event, the leaders of the 3 projects presented the background and objectives of the 3 projects available for students to choose from, and then we headed to the lab to get started. While unfortunate technical issues were a problem for all 3 projects, we managed to find ways to work around them.

Over the next few days, the students participating in the hackathon tackled the 3 projects and explored different directions. At first, I wasn't really sure about what to do or how to help the students, but I soon started to figure out how I could assist students by explaining things, helping them with their problems, fetching and organising more data, and other such things.

While I can't speak for the other projects, the outputs of the hackathon for my project are fascinating insights into things I haven't had time to look into myself - and I anticipate that we'll be may be able to draw them together into something more formal.

Just some of the approaches taken in my project include:

  • Automatically captioning images to extract additional information
  • Using other sentiment classification models to compare performance
    • VADER: A rule-based model that classifies to positive/negative/neutral
    • BART: A variant of BERT
  • Resolving and inferring geolocations of tweets and plotting them on a map, with the goal of increasing relevance of tweets

The outputs of the hackathon have been beyond my wildest dreams, so I'm hugely thankful to all who participated in my project as part of the hackathon!

While I don't have many fancy visuals to show right now, I'll definitely keep you updated with progress on drawing it all together in my PhD Update blog post series.

A learning experience | AAAI-22 in review

Hey there! As you might have guessed, it's time for my review of the AAAI-22 conference(?) (Association for the Advancement of Artificial Intelligence) I attended recently. It's definitely been a learning experience, so I think I've got my thoughts in order in a way that means I can now write about them here.

Attending a conference has always been on the cards - right from the very beginning of my PhD - but it's only recently that I have had something substantial enough that it would be worth attending one. To this end, I wrote a 2 page paper last year and submitted it to the Doctoral Consortium, which is a satellite event that takes place slightly before the actual AAAI-22 conference. To my surprise I got accepted!

Unfortunately in January AAAI-22 was switched from being an in-person conference to being a virtual conference instead. While I appreciate and understand the reasons why they made that decision (safety must come first, after all), it made some things rather awkward. For example, the registration form didn't mention a timezone, so I had to reach out to the helpdesk to ask about it.

For some reason, the Doctoral Consortium wanted me to give a talk. While I was nervous beforehand, the talk itself seemed to go ok (even though I forgot to create a slide somewhere in the middle) - people seemed to find the subject interesting. They also assigned a virtual mentor to me as well, who was very helpful in checking my slide deck for me.

The other Doctoral Consortium talks were also really interesting. I think the one that stood out to me was "AI-Driven Road Condition Monitoring Across Multiple Nations" by Deeksha Arya, in which the presenter was using CNNs to detect damage to roads - and found that a model trained on data from 1 country didn't work so well in another - and talked about ways in which they were going to combat the issue. The talk on "Creating Interpretable Data-Driven Approaches for Tropical Cyclones Forecasting" by Fan Meng also sounded fascinating, but I didn't get a chance to attend on account of their session being when I was asleep.

As part of the conference, I also submitted a poster. I've actually done a poster session before, so I sort of knew what to expect with this one. After a brief hiccup and rescheduling of the poster session I was part of, I got a 35 minute slot to present my poster, and had some interesting conversations with people.

Technical issues were a constant theme throughout the event. While the Doctoral Consortium went well on Zoom (there was a last minute software change - I'm glad I took the night before to install and check multiple different video conferencing programs, otherwise I wouldn't have made it), the rest of the conference wasn't so lucky. AAAI-22 was held on something called VirtualChair / Gather.town, which as it turned out was not suited to the scale of the conference in question (200 people in each room? yikes). I found myself with the seemingly impossible task of using a website that was so laggy it was barely usable - even on my i7-10750H I bought back in 2020. While the helpdesk were helpful and suggested some things I could try, nothing seemed to help. This severely limited the benefit I could gain from the conference.

At times, there were also a number of communication issues that made the experience a stressful one. Some emails contradicted each other, and others were unclear - so I had to email the organisers at multiple points to request clarification. The wording on some of the forms (especially the registration form) left a lot to be desired. All in all, this led to a very large number of wasted hours figuring things out and going back and forth to resolve confusion.

It also seemed as though everyone appeared to assume that I knew how a big conference like this worked and what each event was about, when this was not the case. For example, after the start of the conference I received an email saying that they hoped I'd been enjoying the plenary sessions, when I didn't know that plenary sessions existed, let alone what they were about. Perhaps in future it would be a good idea to to distribute a beginner's guide to the conference - perhaps by email or something.

For future reference, my current understanding of the different events in a conference is as follows:

  • Doctoral Consortium: A series of talks - perhaps over several sessions - in which PhD students submit a 2 page paper in advance and then present their projects.
  • Workshop: A themed event in which a bunch of presenters submit longer papers and talk about their work
  • Tutorial: In which the organisers deliver content centred around a specific theme with the aim of educating the audience on a particular topic
  • Plenary session: While workshops and tutorials may run in parallel, plenary sessions are talks at a time when everyone can attend. They are designed to be general enough that they are applicable to the entire audience.
  • Poster session: A bunch of people create a poster about their research, and all of these posters are put up in a room. Then, researchers are designated specific sessions in which they stand by their poster and people come by and chat with them about their research. At other times, researchers are free to browse other researchers' papers.

Conclusion

Even though the benefit from talks, workshops, and other activities at the conference directly has been extremely limited due to technical, communication, and timezoning issues, the experience of attending this conference has been a beneficial one. I've learnt about how a conference is structured, and also had the chance to present my research to a global audience for the first time!

In the future, I hope that I get the chance to attend my first actual conference as I feel I'm much better prepared, and have a better understanding as to what I'm getting myself in for.

PhD Update 12: Is it enough?

Hey there! It's another PhD update blog post! Sorry for the lack of posts here, the reason why will become apparent below. In the last post, I talked about the AAAI-22 Doctoral Consortium conference I'll be attending, and also about sentiment analysis of both tweets and images. Before we talk about progress since then, here's a list of all the posts in this series so far:

As in all the posts preceding this one, none of the things I present here are finalised and are subject to significant change as I double check everything. I can think of no better example of this than the image classification model I talked last time - the accuracy of which has dropped from 75.6% to 60.7% after I fixed a bug....

AAAI-22 Doctoral Consortium

The most major thing in my calendar in the next few weeks is surely the AAAI-22 Doctoral Consortium. I've been given a complimentary registration to the main conference after I had a paper accepted, so as it turns out the next few weeks are going to be rather busy - as have been the last few weeks preparing for this conference.

Since the last post when I mentioned that AAAI-22 has been moved to be fully virtual, there have been a number of developments. Firstly, the specific nature of the doctoral consortium (and the wider conference) is starting to become clear. Not having been to a conference before (a theme which will come up a lot in this post), I'm not entirely sure what to expect, but as it turns out I was asked to create a poster that I'll be presenting in a poster session (whether this is part of AAAI-22 or the AAAI-22 Doctoral Consortium is unclear).

I've created one with baposter (or a variant thereof given to me by my supervisor some time ago) which I've now submitted. The thought of presenting a poster in a poster session at a conference to lots of people I don't know has been a rather terrifying thought though, so I've been increasingly anxious over this over the past few weeks.

This isn't the end of the story though, as I've also being asked to do a 20 minute presentation with 15 minutes for questions / discussion, the preparations for which have taken perhaps longer than I anticipated. Thankfully I've had prior experience presenting at my department's PGR seminars previously (which have also been online recently), so it's not as daunting as would otherwise be, but as with the poster there's still the fear of presenting to lots of people I don't know - most of which probably know more about AI than I do!

Despite these fears and other complications that you can expect from an international conference (timezones are such a pain sometimes), I'm looking forward to seeing what other people done, and also hoping that the feedback I get from my presentation and poster isn't all negative :P

Sentiment analysis: a different perspective

Since the last post, a number of things have changed in my approach to sentiment analysis. The reason for this is - as I alluded to earlier in this post - after I fixed a bug in my model that predicts the sentiment of images from twitter, causing the validation accuracy to drop from 75.6% to 60.7%. As of now, I have the following models implemented for sentiment analysis:

  1. Text sentiment prediction
    • Status: Implemented, and works rather well actually - top accuracy is currently 79.6%
    • Input: Tweet text
    • Output: Positive/negative sentiment
    • Architecture: Transformer encoder (previously: LSTM)
    • Labels: Emojis from text [manually sorted into 2 categories]
  2. Image sentiment prediction
    • Status: Implemented, but doesn't work very well
    • Input: Images attached to tweets
    • Output: Positive/negative sentiment
    • Architecture: ResNet50 (previously: CCT, but I couldn't get it to work)
    • Labels: Positive/negative sentiment predictions from model #1
  3. Combined sentiment prediction
    • Status: Under construction
    • Input: Tweet text, associated image
    • Output: Positive/negative sentiment
    • Architecture: CLIP → a few dense layers [provisionally]
    • Labels: Emojis from text, as in model #1

Not mentioned here of course is my rainfall radar model, but that's still waiting for me to return to it and implement a plan I came up with some months ago to fix it.

Of particular note here is the new CLIP-based model. After analysing model #2 (image sentiment prediction) and sampling some images from each category on a per-flood basis, it soon became clear that the output was very noisy and wasn't particularly useful, and combined with it's rather low performance means that I'm pretty much considering it a failed attempt.

With this in mind, after my supervisor suggested I looked into CLIP. While I mentioned it in the paper I've submitted to AAAI-22, until recently I haven't had a clear picture of how it would be actually useful. As a model, CLIP trains on text-image pairs, and learns to identify which image belongs to which text caption. To do this, it has 2 encoders - 1 for the text, and 1 for the associated image.

You can probably guess where this is going, but my new plan here is to combine the text and images of the tweets I have downloaded into 1 single model rather than 2 separate ones. I figure that I can potentially take advantage of the encoders trained by the CLIP models to make a prediction. If I recall correctly, Ive read a paper that has done this before with tweets from twitter - just not with CLIP - but unfortunately I can't locate the paper at this time (I'll edit this post if I do find it in the future - if you know which paper I'm talking about please do leave a DOI link in the comments below).

At the moment, I'm busy implementing the code to wrap the CLIP model and make it suitable for my specific learning task, but this is as of yet incomplete. As noted in the architecture above, I plan on concatenating the output from the CLIP encoders and passing it through a few dense layers (as the [ batch_size, concat, dim ] tensor is not compatible with a transformer, which operates on sequences), but I'm open to suggestions - please do comment below.

My biggest concern here is that the tweet text will not be enough like an image caption for CLIP to produce useful results. If this is the case, I have some tricks up my sleeve:

  1. Extract any alt text associated with images (yes, twitter does let you do this for images) and use that instead - though I can't imagine that people have captioned images especially often.
  2. Train a new CLIP model from scratch on my dataset - potentially using this model architecture - this may require quite a time investment to get the model working as intended (Tensorflow can be confusing and difficult to debug with complex model architectures).

Topic analysis

Another thing I've explored is topic analysis using gensim's LDAModel. Essentially, as far as I can tell LDA is an unsupervised algorithm that groups the words found in the source input documents into a fixed number of related groups, each of which contains words which are often found in close proximity to one another.

As an example, if I train an LDA model for 20 groups, I get something like the following for some of those categories:

place   death   north   evacu   rise    drive   hng coverag toll    cumbria
come    us  ye  issu    london  set line    offic   old mean
rescu   dead    miss    uttarakhand leav    bbc texa    victim  kerala  defenc
flashflood  awai    world   car train   make    best    turn    school  boat
need    town    problem includ  effect  head    assist  sign    philippin   condit

(Full results may be available upon request, if context is provided)

As you can tell, the results of this are mixed. While it has grouped the words, the groups themselves aren't really what I was hoping for. The groups here seem to be quite generic, rather than being about specific things (words like rescu, dead, miss, and victim in a category), and also seem to be rather noisy (words like "north", "old", "set", "come", "us").

My first thought here was that instead of training the model on all the tweets I have, I might get better results if I train it on just a single flood at once - since some categories are dominated by flood-specific words (hurrican hurricaneeta, hurricaneiota for example). Here's an extract from the results of that for 10 categories for the #StormDennis hashtag:

warn    flood   stormdenni  met issu    offic   water   tree    high    risk
good    hope    look    love    i’m game    morn    dai yellow  nice
ye  lol head    enjoi   tell    book    got chanc   definit valentin
mph gust    ireland coast   wors    wave    brace   wow doesn’t photo

(Again, full results may be available on request if context is provided)

Again, mixed results here - and still very noisy (though I'd expect nothing else from social media data), though it is nice to see something like gust, ireland, and coast there in the bottom category.

The goal here was to attempt to extract some useful information about which places have been affected and by how much by combining this with the sentiment analysis (see above), but my approach here doesn't seem to have captured what I intended. Still, it does seem that on a per-category basis (for all tweets) there is some difference in sentiment on a per-category basis:

(Above: A char of the sentiment of each LDA topic, using the 20 topic model that was trained on all available tweets.)

It seems as if there's definitely something going on here. I speculate that positive words are more often than not used near other positive words, and so categories are more likely to skew to 1 extreme or another - though acquiring proof of this theory would likely require a significant time investment.

While LDA topic analysis was an interesting diversion, I'm not sure how useful it is in this context. Still, it's a useful thing to have in my growing natural language processing toolkit (how did this happen) - and perhaps future research problems will benefit from it more.

Moving forwards, I could imagine instead of doing LDA topic analysis it might be beneficial to group tweets by place and perhaps run some sentiment analysis on that instead. This comes with it's own set of problems of course (especially pinning down / inferring the location of the tweets in question), but this is not an insurmountable problem given strategies such as named entity recognition (which the twitter Academic API does for you, believe it or not).

Conclusion

While most of my time has been spent on preparing for the AAAI-22 conference, I have managed to do some investigating into the twitter data I've downloaded. Unfortunately, not all my methods have been a success (image sentiment analysis, LDA topic analysis), but these have served as useful exercises for both learning new techniques and understanding the dataset better.

Moving forwards, I'm going to implement a new CLIP-based model that I'm hoping will improve accuracy over my existing models - though I'm somewhat apprehensive that tweet text won't be descriptive enough for CLIP to produce a useful output.

With the AAAI-22 conference happening next week, I'll be sure to write up my experiences and post about the event here. I'm just hoping that I've done enough to make attending a conference like this worth it, and that what I have done is actually interesting to people :-)

Sources and further reading

PhD Update 11: Answers to our questions

Heya! It's time for another PhD update blog post. Sometimes, answers to the questions one ask some in the form of more questions. In this post, as I predicted in the last post I'll be talking mainly about my work with tweets from twitter, as I haven't yet had time to return to the Temporal CNN Autoencoder. Before we start though, here's a list of all the parts in this series so far:

As usual, none of the things I present here are finalised, and are subject to significant change as I double check everything.

Conferences part 2

In the last post, I talked about the conferences I have applied to and not applied to. In this one, I can now say that I have been accepted for the AAAI-22 Doctoral Consortium! It was going to be held in Vancouver, Canada - but has since been moved to be fully virtual and online. While I both understand the reasoning behind the decision and am relieved I don't have to travel, it is a bit of a shame that I won't get the chance to have those face-to-face conversations you don't get when in a video call.

Despite the move to being in person, I'm still both excited to attend and mildly terrified about presenting the things I've been doing to a potentially large audience.

Looking forward, at the suggestion of my supervisor I plan to finish writing up that AI+HADR paper I mentioned in the last update as a journal article instead, and then submit that. While I'm working through the review process for that paper, I'll return to the Temporal Autoencoder / rainfall radar subproject and work on implementing my idea for it.

Tweet sentiment analysis

After the rather rushed analysis of the data in the last post, I've now taken the time to analyse the data more thoroughly. A number of things became apparent here, but first let's look at the research questions I've asked:

  1. Is there a more negative response to more sudden / severe floods?
  2. Can we classify images by the sentiment of the associated tweet?

Answering these questions has not been straightforward, but I'm now at a point where I have some preliminary answers (which, I stress, are not double checked and not peer reviewed).

Unfortunately, the answer to the first question there is that it can't easily be answered with the current data available. As it turns out, I have been unable to find an objective and consistent measure of how severe or sudden a flood was.

One might think that say the amount of insurance damages would be a good choice. This doesn't work out though, because not all flooding event have (public) damage estimates. Those that do are often measured against different goalposts: flood A might be measured in property damages, and flood B in economic impact.

Another measure I investigated was the number of homes destroyed or people displaced. This too as it turns out has multiple issues. For one, while for some floods multiple estimates are available they don't always agree. For a single given flood estimates might range from 800 to 2400 homes destroyed, and are often measured at different points in the history of the given flooding event, and it's sometimes unclear at what stage in a flood's lifecycle the estimate was made.

Even if such estimates were consistent (which they really aren't), there's another issue too: they are often limited to country borders. For example, a government agency may estimate the number of homes destroyed for their country, but not other countries. This is totally reasonable: a government is concerned first and foremost with the people within it's borders. Sadly floods, storms, and hurricanes rarely discriminate across such borders. Take Storm Christoph for example. It hit the UK in January 2021, but after that it continued on and hit Scandinavia too.

The other question above is thankfully much easier to answer - it deserves it's own section in this post though, so see below for that. It's not all bad new on the tweet sentiment analysis front though - I did find some unexpected results while I was analysing the data. To explain, let's look at a fancy new chart I've plotted:

(Above: Various floods and their overall sentiment - both with replies included and excluded.)

This fancy chart shows the overall sentiment of a number of different flooding events, with replies included (going down) and excluded (going up). What's fascinating here is that it appears to suggest that the replies contribute significantly towards the percentage of positive tweets. Perhaps people are most likely to tweet words of encouragement (e.g. "stay safe :hugs:")?

With this in mind, I correlated the total number of tweets made in a flood with the overall sentiment (as a percentage), and got a Pearson correlation coefficient of -0.54, which indicates a medium correlation. Apparently, if more people tweet about a flood, it's more likely to have a more positive overall sentiment. If replies are excluded, it works out to -0.31, which would indicate a weaker negative correlation.

Image classification

Another thing I've been working on is classifying images associated with tweets using the sentiment of the tweet as a label. With roughly ~175K images to work with, this has proved to be a useful exercise - resulting in ~75.23% validation accuracy (rising to 96.9% training accuracy by epoch 50%, suggesting it's overfitting and I need more data to improve validation accuracy any further) over 9 epochs. While unfortunately I can't share a sample of positive/negative images predicted by the model due to data privacy rules, I can talk about the structure of the model and show a confusion matrix.

I started out with a Compact Convolutional Transformer model as it looks very cool and has some significant benefits over more traditional model architectures, there's a bug in my implementation somewhere I can't spot, and it only yields 10% to 20% accuracy on Fashion MNIST. To avoid wasting too much time, I'm now using a prebuilt ResNet50 initialised with random weights that takes in images in the size 128 x 128 pixels (images larger or smaller than this are automatically resized without preserving aspect ratio).

While the accuracy of the model is slightly lower than that of the model that predicts the sentiment of the tweets themselves, looking at samples that I quickly threw together with a bit of Bash and ImageMagick and the confusion matrix (see below) reveals that it is in fact doing something useful. Here's that Bash 2-liner:

shuf path/to/tweets-labelled.tsv | awk '/positive$/ { print("/absolte/path/to/media_dir/" $1); }' | head -n64 >/tmp/sample-pos.txt;
montage @/tmp/sample-pos.txt -geometry 960x540+10+10 -tile 8x8 /tmp/sample-pos.jpeg

The @path/to/file.txt syntax in the montage command call there reads a list of filepaths from a file instead of directly specifying them on the command line. By replacing positive in the awk filter with negative and sample-pos.txt to sample-neg.txt, the same procedure can also generate a random sample for the negative category too.

From looking at a single sample of 64 positive images and 64 negative images generated using the method above, positive images generally include:

  • Cats (by far the most popular of course)
  • Cupcakes
  • People, some of which are helping others in a flood

Whereas negative images generally include:

  • Floods
  • Damage to homes, buildings, etc

My next immediate step here is to plot a confusion matrix to better understand how the model is performing, as I'm slightly concerned that it's ignoring the minority class (in this case, positive tweets). I've already mostly completed this already, but it's just not quite ready to show here yet as I need to double check and revise some stuff.

Of course, given that the source dataset is very noisy (social media data generally is) and relatively difficult for AI models to understand, I think this is a good result.

From here, if I have time I'd like to combine this image classification model with the earlier tweet sentiment analyser model to create a single model that can more accurately predict the sentiment of both the text of a tweet and the associated image at the same. To do this, I'm probably going to investigate and use CLIP - more on this in a future post.

Conclusion

While I still haven't done anything with the Temporal Autoencoder due to other priorities, I'm hoping to return to it once I've wrapped up this social media section of the project. I have made significant progress on analysing the social media data - both the textual tweets and the associated images, and I plan to combine the models I've trained to classify both text and images into a single model. It's not the end of the road yet though: while I've found some answers, they are just leading to more questions.

Add your blog to hullblogs.com

For many years, csblogs.com has aggregated posts from the blogs of current and past students of the University of Hull. Unfortunately, recently the site has started to generate random error messages. To fix this, Freeside (of which I'm the head, as it turns out) have decided to implement a replacement.

It looks rather cool if I do say so myself, so I thought I'd share it here. I'll talk first about how you can add your blog to it and why it's a good idea to start one yourself (and how you can do so for free!). Then, I'll talk a little bit at the end about how I built hullblogs.com and how it's put together.

@closebracket had the idea to call it hullblogs.com, and then I implemented the code for it:

A screenshot of hullblogs.com

(Above: A screenshot of hullblogs.com)

You can visit it here: hullblogs.com.

If you're a current or past student at the University of Hull, then you we want to hear from you! If you've got a blog (and even if you haven't yet!), then you can add your blog by going to hullblogs.com, and then clicking "add your blog".

If you don't yet have a blog, then we have you covered. Our guide has a number fo really easy ways to get started and host a blog for free! There are multiple ways to host a blog without paying a penny (no hidden charges after some time either).

But I don't have a blog!

If you're not convinced about setting up a blog and putting it on the Internet, there are a number of key reasons you should:

  • It's not about other people reading it. By setting up a blog, you don't need to have 100s of views. What matters is that you write for you - not for anyone else.
  • If only 1 person reads your blog, then it's worth it. You personal blog and/or website makes a great addition to your CV. It's a wonderful way of documenting the things you're doing and learning while doing your degree and beyond - and a great way for potential employers to find more detail on all this information in 1 place.
  • Write for yourself. Personally, I find my blog a great place to post about difficult problems that I've encountered and solved. Then when I encounter a similar problem again, I can re-read my own blog post about it. Because I'm the one who wrote the post, I can anticipate what I may find confusing and explain that in more detail to help out future me.
  • Improve your technical writing skills. Technical writing - the process of writing about complex technical subjects (whether this is Computer Science or beyond) is an invaluable skill. Writing documentation, communicating complex ideas, and helping others are all situations which can call for technical writing - and it's a great thing to put on your CV too. Just doing a thing isn't enough - being able to write about it and document it is really important when working for example on commercial (or even academic) work.

If you're not computer science oriented, then Wordpress, Blogger, or maybe Squarespace [unclear if squarespace is free or not] would be a a great place to start your blogging journey. You can host a blog for free too!

If you are computer science oriented (I'd guess most of the people reading this probably are given the kinds of posts I write for my blog here), then I strongly recommend Eleventy. It's a static site generator, and there are a whole bunch of ready-made templates for you to use to get started quickly.

In fact, hullblogs.com itself is built with Eleventy, using feedme to parse RSS/Atom feeds.

Every night, the site will be rebuilt. During this process, it will download the RSS/Atom feed from all the blogs registered, and order the posts chronologically. If an image is associated with a post, this will also be downloaded - the site also looks for the first image in a post's content if an image isn't explicitly specified to be attached to a blog. Once ordered, the posts are then paginated (split into multiple pages) and the rest of the site is rendered.

Being a static site, once the (re)build is complete hosting the site is simple and secure. Currently, Freeside hosts it using Nginx to statically serve it.

If your feed is malformed/contains errors or your blog is offline don't worry! hullblogs.com will transparently ignore your feed when it rebuilds. Then, once your blog is back up and running, it will add it back into the main hullblogs.com site the next night when it rebuilds again.

If you're interested in digging into the source code of the site, I've open sourced it on GitHub under the Apache 2.0 Licence: https://github.com/FreesideHull/hullblogs.com

If you've read this far, thanks so much for reading! We're after a better favicon logo for hullblogs.com. If you've got an idea, please do let us know by opening an issue.

PhD Update 10: Sharing with the world

Hey there - is it that time already? Another PhD update blog post! And in double digits too! In this one, I'll be talking mainly about the social media project I've been working on, as I haven't had time to work on the temporal CNN much since last time. I've also taken some time off in August, so technically this is only ~just over 1 month's worth of progress here.

Before we begin, here's a list of posts in this series so far. They give useful context - you probably won't understand this one without reading them first.

As with the last post, none of the graphs here are finalised, and are work in progress. There ~~may~ probably are multiple nasty bugs that invalidate the results that I haven't found yet.

AAAI Doctoral Consortium 2022

The main thing I've done since last time is apply for the AAAI Doctoral Consortium 2022. As I understand it, this is a specific part of the main AAAI conference (Association for the Advancement of Artificial Intelligence) that is designed for PhD student researchers. To apply, you have to submit a cover page, your CV, and a 2 page thesis summary.

The cover page wasn't too bad, and updating my CV and getting it checked by the careers service at my university was simply a case of doing it. The thesis summary on the other hand was more of a challenge than I expected - it's quite difficult to summarise your entire 3 years of (planned) work in just 2 pages! It helped me to picture it as a high-level overview / conversation starter rather than a free-standing paper in it's own right - even if it looks like one.

Although this took longer than I thought to prepare, I did in fact get my submission together in time. I was glad that I left some extra time at the end though, as the rules were both very strict for what you can and can't do with your paper and unclear since they were written for the main AAAI conference. In addition I found at least 2 mistakes in the instructions and rules which appeared to be left over from previous years and hadn't been updated.

AI + HADR 2021

The other conference which I attempted to apply for at the suggestion of my supervisor is AI + HADR 2021. This stands for Artificial Intelligence for Humanitarian Assistance and Disaster Response, and it's virtual workshop that is happening in December 2021. The submission calls for a 6 page paper (5 for content, and 1 for references) on things related to disaster response.

My plan here was to write a paper on the social media work I've been doing and submit that, with the idea of talking about the existing sentiment analysis I've done (see update 9) and focusing on answering a research question using the sentiment analysis model I've implemented.

Unfortunately, things didn't go to plan as I ran out of time - even with my supervisor helping to write some of the parts of he paper. After finding a number of bugs in my data processing pipeline and failing to see any obvious trends in the sentiment analysis graphs I plotted, I realised that it was unlikely that I was going to be able to submit for it this time around.

The plan from here is to take some more time over it and answer my other research question instead, and tidy up and re-use our existing unfinished submission here for IJCNN 2022 (I think that's the right link) - the the submission deadline for which I think is going to be in January 2021.

Social media

While I've spent time writing submissions for various conferences and workshops, I have also been doing a bit of social media data analysis too.

To start with, I implemented a new endpoint for labelling tweets using a given saved model checkpoint. After using this to label the various datasets I've acquired (with the best transformer-based model checkpoint I have, which I think is ~78% accurate (got to double check everything to make sure I haven't mixed anything up), I then got to work plotting some graphs. To start with, I plotted a simple bar graph of the overall sentiment of the different datasets I've downloaded.

Graph showing the overall sentiment of some of the floods in my dataset.

(Above: A bar graph showing the overall sentiment of some of the floods in my dataset.)

I'm not really sure what to make of this - I suspect context-specific information is required to fully interpret this. I couldn't find any reliable context-specific information on short notice for the AI + HADR paper, so I'm going to attempt to keep looking if I can find the time to do so. Asking someone form the energy and environment institute may be a good idea.

After this, I binned the tweets over time and used this to plot a combined graph showing both the tweet frequency and sentiment over time using Gnuplot.

Graph showing the frequency (blue) and sentiment (green and red) of tweets over time

(Above: A graph showing the frequency (blue) and sentiment (green and red) of tweets over time.)

Again here, I'm not sure what to make of this. Even with cropping out the long tail of people talking about the flooding even afterwards to make it easier to see the sentiment over time as the actual event occurred doesn't seem to help uncover any clear trends. It could be said that for Hurricane Iota that the sentiment got more positive over time at the beginning, but this does not really also hold true for Storm Christoph and others - and without context-specific information it's difficult to tell if there are any meaningful conclusions that can be drawn here.

To this end, after talking with my supervisor we've got some idea of things I'm going to try - so more on this in an upcoming PhD update blog post.

Finally, I've also had a discussion with my supervisor and when I'm ready to publish something on my social media work, I'm going to make the code behind it open source (probably either GPLv3 or MPL-2.0). If you're interested, the code for downloading tweets using Twitter's Academic API is already open source: https://www.npmjs.com/package/twitter-academic-downloader.

Temporal CNN

I haven't really done anything on the Temporal CNN since last time, but I wanted to make sure it wasn't left out of this post! It's definitely still on the cards - the plan is that once I've got this data analysis done and some meaningful social media results, I'm going to return the the Temporal CNN and put the plan I described in the last post into action.

Conclusion

Since last time I've mainly been writing and analysing social media data. While I didn't manage to apply to AI + HADR 2021, I did manage to submit for AAAI Doctoral Consortium 2022 - I'll find out if I've been accepted on the 15th October 2021.

Next up, I'm going to be working on answering my other research question first - more on this in a later blog post. If I have time, I'll put some effort into the Temporal CNN - though I doubt I'll have anything to show on that front next time. Finally, I'm going to be arranging my PhD Panel 4 (where did all the time go?) - hopefully for before the end of November, availability of those involved permitting.

If you are finding this series of blog posts on my PhD interesting, please do comment below. It's great to see that the stuff I'm working on for my PhD is actually interesting to someone.

Sources and further reading

PhD, Update 9: Results?

Hey, it's time for another PhD update blog post! Since last time, I've been working on tweet classification mostly, but I have a small bit to update you on with the Temporal CNN.

Here's a list of posts in this series so far before we continue. If you haven't already, you should read them as they give some contact for what I'll be talking about in this post.

Note that the results and graphs in this blog post are not final, and may change as I further evaluate the performance of the models I've trained (for example I recently found and fixed a bug where I misclassified all tweets without emojis as positive tweets by mistake).

Tweet classification

After trying a bunch of different tweaks and variants of models, I now have an AI model that classifies tweets. It's not perfect, but I am reaching ~80% validation accuracy.

The model itself is trained to predict the sentiment of a tweet, with the label coming from a preset list of emojis I compiled manually (looking at a list of emojis in the dataset itself I wrote a quick Bash one-liner to calculate). Each tweet is classified by adding up how any emojis in each category it has in it, and the category with the most emojis is then ultimately chosen as the label to predict. Here's the list of emojis I've compiled:

Category Emojis
positive 😂👍🤣❤👏😊😉😁😘😍🤗💕😀💙💪✅🙌💚👌🙂😎😆😅☺😃😻💖💋💜😹😜♥😄💛😽✔👋💗✨🌹🎉👊😋😏🌞😇🎶⭐💞😺😸🖤🌸💐🍀🌼🌟🤝🌷🐱🤓😌😛😙
negative 🥶⚠💔😰🙄😱😭😳😢😬🤦😡😩🙈😔☹😮❌😣😕😲😥😞😫😟😴🙀😒🙁😠😪😯😨👎🤢💀😤😐😖😝😈😑😓😿😵😧😶😦🤥🤐🤧🌧☔💨🌊💦🚨🌬🌪⛈🌀❄💧🌨⚡🦆🌩🌦🌂

These categories are stored in a tab separated values file (TSV), allowing the program I've written that trains new models to be completely customisable. Given that there were a significant number of tweets with flood-related emojis (🌧☔💨🌊💦🚨🌬🌪⛈🌀❄💧🌨⚡🦆🌩🌦🌂) in my dataset, I tried separating them out into their own class, but it didn't work very well - more on this later.

The dataset I've downloaded comes from a number of different queries and hashtags, and is comprised of just under 1.4 million tweets, of which ~90K tweets have at least 1 emoji in the above categories.

A pair of pie charts showing statistics about the twitter dataset.

Tweets were downloaded with a tool I implemented using a number of flood related hashtags and query strings - from generic ones like #flood to specific ones such as #StormChristoph.

Of the entire dataset, ~6.44% contain at least 1 emoji. Of those, ~49.57~ are positive, and ~50.434% are negative:

Category Number of tweets
No emoji 1306061
Negative 45367
Positive 44585
Total tweets 1396013

The models I've trained so far are either a transformer, or an 2-layer LSTM with 128 units per layer - plus a dense layer with a softmax activation function at the end for classification. If batch normalisation, every layer except the dense layer has a batch normalisation layer afterwards:

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
bidirectional (Bidirectional (None, 100, 256)          336896    
_________________________________________________________________
batch_normalization (BatchNo (None, 100, 256)          1024      
_________________________________________________________________
bidirectional_1 (Bidirection (None, 256)               394240    
_________________________________________________________________
batch_normalization_1 (Batch (None, 256)               1024      
_________________________________________________________________
dense (Dense)                (None, 2)                 514       
=================================================================
Total params: 733,698
Trainable params: 732,674
Non-trainable params: 1,024
_________________________________________________________________

I tried using more units and varying the number of layers, but it didn't help much (in hindsight this was before I found the bug where tweets without emojis were misclassified as positive).

Now for the exciting part! I split the tweets that had at least 1 emoji into 2 datasets: 80% for training, and 20% for validation. I trained a few different models tweaking various different parameters, and got the following results:

2 categories: training accuracy

2 categories: validation accuracy

Each model has a code that is comprised of multiple parts:

  • c2: 2 categories
  • bidi: LSTM, bidirectional wrapper enabled
  • nobidi: LSTM, bidirectional wrapper disabled
  • batchnorm: LSTM, batch normalisation enabled
  • nobatchnorm: LSTM, batch normalisation disabled
  • transformer: Transformer (just the encoder part)

It seems that if you're an LSTM being bidirectional does result in a boost to both stability and performance (performance here meaning how well the model works, rather than how fast it runs - that would be efficiency instead). Batch normalisation doesn't appear to help all that much - I speculate this is because the models I'm training really aren't all that deep, and batch normalisation apparently helps most with deep models (e.g. ResNet).

For completeness, here are the respective graphs for 3 categories (positive, negative, and flood):

3 Categories: Training accuracy

3 Categories: Validation accuracy

Unfortunately, 3 categories didn't work quite as well as I'd hoped. To understand why, I plotted some confusion matrices at epochs 11 (left) and 44 (right). For both categories I've picked out the charts for bidi-nobatchnorm, as it was the best performing (see above).

2 Categories: confusion matrices

3 Categories: confusion matrices

The 2 category model looks to be reasonably accurate with no enormous issues as we expected from the ~80% accuracy from above. Twice the number of negative tweets are misclassified as positive compared to the other way around of some reason - it's possible that negative tweets are more difficult to interpret. Given the noisy dataset (it's social media and natural language, after all), I'm pretty happy with this result.

The 3 category model has fared less brilliantly. Although we can clearly see it has improved over the training process (and would likely continue to do so if I gave it longer than 50 epochs to train), it's accuracy is really limited by it's tendency to overfit - preventing it from performing as well as the 2 category model (which, incidentally, also overfits).

Although it classifies flood tweets fairly accurately, it seems to struggle more with positive and negative tweets. This leads me to suspect that training a model on flood / not flood might be worth trying - though as always I'm limited by a lack of data (I have only ~16K tweets that contain a flood emoji). Perhaps searching twitter for flood emojis directly would help gather additional here.

Now that I have a model that works, I'm going to move on to answering some research questions with this model. To do so, I've upgraded my twitter-academic-downloader to download information about attached media - downloading attached media itself deserves its own blog post. After that, I'll be looking to publish my first ever scientific papery thing (don't know the difference between journal articles and other types of publication yet), so that is sure to be an adventure.

Temporal CNN(?)

In Temporal CNN land, things have been much slower. My latest attempt to fix the accuracy issues I've described in detail in previous posts, I've implemented a vanilla convolutional autoencoder by following this tutorial (the links to the full code behind the tutorial are broken, and it's really annoying). This does image to image translation, so it would be ideal for modifying it to support my use-case. After implementing it with Tensorflow.js, I hit a snag:

Autoencoder: expected vs actual

After trying all the loss functions I found think of and all sort of other variations of the model, for sanity's sake I tried implementing the model in Python (the original language using in the blog post). I originally didn't want to do this, since all the data preprocessing code I've written so far is all in Javascript / Node.js, but within half an hour with some copying and pasting, I had a very quick model implemented. After training a teeny model it for just 7 epochs on the Fashion MNIST dataset, I got this:

Autoencoder: epoch 7 in Tensorflow for Python

...so clearly there's something different in Tensorflow.js compared to Tensorflow for Python, but after comparing them dozens of times, I'm not sure I'll be able to spot the difference. It's a shame to give up on the Tensorflow.js version (it also has a very nice CLI I implemented), but given how long I've spent in the background on this problem, I'm going to move forwards with the Python version, I'm going to work on converting my existing data preprocessing code into a standalone CLI for converting the data into a format that I can consume in Python with minimal effort.

Once that's done, I'm going to expand the quick autoencoder Python script I implemented into a proper CLI program. This involves adding:

  • A CLI argument parser (argparse is my weapon of choice here for now in Python)
  • A settings system that has a default and a custom (TOML) config file
  • Saving multiple pieces of data to the output directory:
    • A summary of the model structure
    • The settings that are currently in use
    • A checkpoint for each epoch
    • A TSV-formatted metrics file

The existing script also spits out a sample image (such as the one above - just without the text) for each epoch too, which I've found to be very useful indeed. I'll definitely be looking into generating more visualisations on-the-fly as the model is training.

All of this will take a while (especially since this isn't the focus right now), so I'm a bit unsure if I'll get all of this done by the time I write the next post. If this plan works, then I'll probably have to come up with a new name for this model, since it's not really a Temporal CNN. Suggestions are welcome - my current thought is maybe "Temporal Convolutional Autoencoder".

Conclusion

Since the last post I've made on here, I've made much more progress on stuff than I was expecting to have done. I've now got a useful(?) model for classifying tweets, which I'm now going to move ahead with answering some research questions with (more on that in the next post - this one is long already haha). I've also got an Autoencoder training properly for image-to-image translation as a step towards getting 2D flood predictions working.

Looking forwards, I'm going to be answering research questions with my tweet classification model and preparing to write something to publish, and expanding on the Python autoencoder for more direct flood prediction.

Curious about the code behind any of these models and would like to check them out? Please get in touch. While I'm not sure I can share the tweet classifier itself yet (though I do plan on releasing as open-source after a discussion with my supervisor when I publish), I can probably share my various autoencoder implementations, and I can absolutely give you a bunch of links to things that I've used and found helpful.

If you've found this interesting or have some suggestions, please comment below. It's really motivating to hear that what I'm doing is actually interesting / useful to people.

Sources and further reading

Art by Mythdael