PhD, Update 9: Results?
Hey, it's time for another PhD update blog post! Since last time, I've been working on tweet classification mostly, but I have a small bit to update you on with the Temporal CNN.
Here's a list of posts in this series so far before we continue. If you haven't already, you should read them as they give some contact for what I'll be talking about in this post.
- PhD Update 1: Directions
- PhD Update 2: The experiment, the data, and the supercomputers
- PhD Update 3: Simulating simulations with some success
- PhD Update 4: Ginormous Data
- PhD Update 5: Hyper optimisation and frustration
- PhD Update 6: The road ahead
- PhD Update 7: Just out of reach
- PhD Update 8: Eggs in Baskets
Note that the results and graphs in this blog post are not final, and may change as I further evaluate the performance of the models I've trained (for example I recently found and fixed a bug where I misclassified all tweets without emojis as positive tweets by mistake).
Tweet classification
After trying a bunch of different tweaks and variants of models, I now have an AI model that classifies tweets. It's not perfect, but I am reaching ~80% validation accuracy.
The model itself is trained to predict the sentiment of a tweet, with the label coming from a preset list of emojis I compiled manually (looking at a list of emojis in the dataset itself I wrote a quick Bash one-liner to calculate). Each tweet is classified by adding up how any emojis in each category it has in it, and the category with the most emojis is then ultimately chosen as the label to predict. Here's the list of emojis I've compiled:
Category | Emojis |
---|---|
positive | ๐๐๐คฃโค๐๐๐๐๐๐๐ค๐๐๐๐ชโ ๐๐๐๐๐๐๐ โบ๐๐ป๐๐๐๐น๐โฅ๐๐๐ฝโ๐๐โจ๐น๐๐๐๐๐๐๐ถโญ๐๐บ๐ธ๐ค๐ธ๐๐๐ผ๐๐ค๐ท๐ฑ๐ค๐๐๐ |
negative | ๐ฅถโ ๐๐ฐ๐๐ฑ๐ญ๐ณ๐ข๐ฌ๐คฆ๐ก๐ฉ๐๐โน๐ฎโ๐ฃ๐๐ฒ๐ฅ๐๐ซ๐๐ด๐๐๐๐ ๐ช๐ฏ๐จ๐๐คข๐๐ค๐๐๐๐๐๐๐ฟ๐ต๐ง๐ถ๐ฆ๐คฅ๐ค๐คง๐งโ๐จ๐๐ฆ๐จ๐ฌ๐ชโ๐โ๐ง๐จโก๐ฆ๐ฉ๐ฆ๐ |
These categories are stored in a tab separated values file (TSV), allowing the program I've written that trains new models to be completely customisable. Given that there were a significant number of tweets with flood-related emojis (๐งโ๐จ๐๐ฆ๐จ๐ฌ๐ชโ๐โ๐ง๐จโก๐ฆ๐ฉ๐ฆ๐) in my dataset, I tried separating them out into their own class, but it didn't work very well - more on this later.
The dataset I've downloaded comes from a number of different queries and hashtags, and is comprised of just under 1.4 million tweets, of which ~90K tweets have at least 1 emoji in the above categories.
Tweets were downloaded with a tool I implemented using a number of flood related hashtags and query strings - from generic ones like #flood
to specific ones such as #StormChristoph
.
Of the entire dataset, ~6.44% contain at least 1 emoji. Of those, ~49.57~ are positive, and ~50.434% are negative:
Category | Number of tweets |
---|---|
No emoji | 1306061 |
Negative | 45367 |
Positive | 44585 |
Total tweets | 1396013 |
The models I've trained so far are either a transformer, or an 2-layer LSTM with 128 units per layer - plus a dense layer with a softmax activation function at the end for classification. If batch normalisation, every layer except the dense layer has a batch normalisation layer afterwards:
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
bidirectional (Bidirectional (None, 100, 256) 336896
_________________________________________________________________
batch_normalization (BatchNo (None, 100, 256) 1024
_________________________________________________________________
bidirectional_1 (Bidirection (None, 256) 394240
_________________________________________________________________
batch_normalization_1 (Batch (None, 256) 1024
_________________________________________________________________
dense (Dense) (None, 2) 514
=================================================================
Total params: 733,698
Trainable params: 732,674
Non-trainable params: 1,024
_________________________________________________________________
I tried using more units and varying the number of layers, but it didn't help much (in hindsight this was before I found the bug where tweets without emojis were misclassified as positive).
Now for the exciting part! I split the tweets that had at least 1 emoji into 2 datasets: 80% for training, and 20% for validation. I trained a few different models tweaking various different parameters, and got the following results:
Each model has a code that is comprised of multiple parts:
- c2: 2 categories
- bidi: LSTM, bidirectional wrapper enabled
- nobidi: LSTM, bidirectional wrapper disabled
- batchnorm: LSTM, batch normalisation enabled
- nobatchnorm: LSTM, batch normalisation disabled
- transformer: Transformer (just the encoder part)
It seems that if you're an LSTM being bidirectional does result in a boost to both stability and performance (performance here meaning how well the model works, rather than how fast it runs - that would be efficiency instead). Batch normalisation doesn't appear to help all that much - I speculate this is because the models I'm training really aren't all that deep, and batch normalisation apparently helps most with deep models (e.g. ResNet).
For completeness, here are the respective graphs for 3 categories (positive, negative, and flood):
Unfortunately, 3 categories didn't work quite as well as I'd hoped. To understand why, I plotted some confusion matrices at epochs 11 (left) and 44 (right). For both categories I've picked out the charts for bidi-nobatchnorm
, as it was the best performing (see above).
The 2 category model looks to be reasonably accurate with no enormous issues as we expected from the ~80% accuracy from above. Twice the number of negative tweets are misclassified as positive compared to the other way around of some reason - it's possible that negative tweets are more difficult to interpret. Given the noisy dataset (it's social media and natural language, after all), I'm pretty happy with this result.
The 3 category model has fared less brilliantly. Although we can clearly see it has improved over the training process (and would likely continue to do so if I gave it longer than 50 epochs to train), it's accuracy is really limited by it's tendency to overfit - preventing it from performing as well as the 2 category model (which, incidentally, also overfits).
Although it classifies flood tweets fairly accurately, it seems to struggle more with positive and negative tweets. This leads me to suspect that training a model on flood / not flood might be worth trying - though as always I'm limited by a lack of data (I have only ~16K tweets that contain a flood emoji). Perhaps searching twitter for flood emojis directly would help gather additional here.
Now that I have a model that works, I'm going to move on to answering some research questions with this model. To do so, I've upgraded my twitter-academic-downloader
to download information about attached media - downloading attached media itself deserves its own blog post. After that, I'll be looking to publish my first ever scientific papery thing (don't know the difference between journal articles and other types of publication yet), so that is sure to be an adventure.
Temporal CNN(?)
In Temporal CNN land, things have been much slower. My latest attempt to fix the accuracy issues I've described in detail in previous posts, I've implemented a vanilla convolutional autoencoder by following this tutorial (the links to the full code behind the tutorial are broken, and it's really annoying). This does image to image translation, so it would be ideal for modifying it to support my use-case. After implementing it with Tensorflow.js, I hit a snag:
After trying all the loss functions I found think of and all sort of other variations of the model, for sanity's sake I tried implementing the model in Python (the original language using in the blog post). I originally didn't want to do this, since all the data preprocessing code I've written so far is all in Javascript / Node.js, but within half an hour with some copying and pasting, I had a very quick model implemented. After training a teeny model it for just 7 epochs on the Fashion MNIST dataset, I got this:
...so clearly there's something different in Tensorflow.js compared to Tensorflow for Python, but after comparing them dozens of times, I'm not sure I'll be able to spot the difference. It's a shame to give up on the Tensorflow.js version (it also has a very nice CLI I implemented), but given how long I've spent in the background on this problem, I'm going to move forwards with the Python version, I'm going to work on converting my existing data preprocessing code into a standalone CLI for converting the data into a format that I can consume in Python with minimal effort.
Once that's done, I'm going to expand the quick autoencoder Python script I implemented into a proper CLI program. This involves adding:
- A CLI argument parser (
argparse
is my weapon of choice here for now in Python) - A settings system that has a default and a custom (TOML) config file
- Saving multiple pieces of data to the output directory:
- A summary of the model structure
- The settings that are currently in use
- A checkpoint for each epoch
- A TSV-formatted metrics file
The existing script also spits out a sample image (such as the one above - just without the text) for each epoch too, which I've found to be very useful indeed. I'll definitely be looking into generating more visualisations on-the-fly as the model is training.
All of this will take a while (especially since this isn't the focus right now), so I'm a bit unsure if I'll get all of this done by the time I write the next post. If this plan works, then I'll probably have to come up with a new name for this model, since it's not really a Temporal CNN. Suggestions are welcome - my current thought is maybe "Temporal Convolutional Autoencoder".
Conclusion
Since the last post I've made on here, I've made much more progress on stuff than I was expecting to have done. I've now got a useful(?) model for classifying tweets, which I'm now going to move ahead with answering some research questions with (more on that in the next post - this one is long already haha). I've also got an Autoencoder training properly for image-to-image translation as a step towards getting 2D flood predictions working.
Looking forwards, I'm going to be answering research questions with my tweet classification model and preparing to write something to publish, and expanding on the Python autoencoder for more direct flood prediction.
Curious about the code behind any of these models and would like to check them out? Please get in touch. While I'm not sure I can share the tweet classifier itself yet (though I do plan on releasing as open-source after a discussion with my supervisor when I publish), I can probably share my various autoencoder implementations, and I can absolutely give you a bunch of links to things that I've used and found helpful.
If you've found this interesting or have some suggestions, please comment below. It's really motivating to hear that what I'm doing is actually interesting / useful to people.
Sources and further reading
- twitter-academic-downloader - CLI tool I implemented to download tweets using Twitter's new Academic API
- ResNet
- Fashion MNIST dataset
- Building Autoencoders in Keras
- TOML: Tom's Obvious Markup Language