PhD, Update 9: Results?

Hey, it's time for another PhD update blog post! Since last time, I've been working on tweet classification mostly, but I have a small bit to update you on with the Temporal CNN.

Here's a list of posts in this series so far before we continue. If you haven't already, you should read them as they give some contact for what I'll be talking about in this post.

Note that the results and graphs in this blog post are not final, and may change as I further evaluate the performance of the models I've trained (for example I recently found and fixed a bug where I misclassified all tweets without emojis as positive tweets by mistake).

Tweet classification

After trying a bunch of different tweaks and variants of models, I now have an AI model that classifies tweets. It's not perfect, but I am reaching ~80% validation accuracy.

The model itself is trained to predict the sentiment of a tweet, with the label coming from a preset list of emojis I compiled manually (looking at a list of emojis in the dataset itself I wrote a quick Bash one-liner to calculate). Each tweet is classified by adding up how any emojis in each category it has in it, and the category with the most emojis is then ultimately chosen as the label to predict. Here's the list of emojis I've compiled:

Category	Emojis
positive	😂👍🤣❤👏😊😉😁😘😍🤗💕😀💙💪✅🙌💚👌🙂😎😆😅☺😃😻💖💋💜😹😜♥😄💛😽✔👋💗✨🌹🎉👊😋😏🌞😇🎶⭐💞😺😸🖤🌸💐🍀🌼🌟🤝🌷🐱🤓😌😛😙
negative	🥶⚠💔😰🙄😱😭😳😢😬🤦😡😩🙈😔☹😮❌😣😕😲😥😞😫😟😴🙀😒🙁😠😪😯😨👎🤢💀😤😐😖😝😈😑😓😿😵😧😶😦🤥🤐🤧🌧☔💨🌊💦🚨🌬🌪⛈🌀❄💧🌨⚡🦆🌩🌦🌂

These categories are stored in a tab separated values file (TSV), allowing the program I've written that trains new models to be completely customisable. Given that there were a significant number of tweets with flood-related emojis (🌧☔💨🌊💦🚨🌬🌪⛈🌀❄💧🌨⚡🦆🌩🌦🌂) in my dataset, I tried separating them out into their own class, but it didn't work very well - more on this later.

The dataset I've downloaded comes from a number of different queries and hashtags, and is comprised of just under 1.4 million tweets, of which ~90K tweets have at least 1 emoji in the above categories.

A pair of pie charts showing statistics about the twitter dataset.

Tweets were downloaded with a tool I implemented using a number of flood related hashtags and query strings - from generic ones like #flood to specific ones such as #StormChristoph.

Of the entire dataset, ~6.44% contain at least 1 emoji. Of those, ~49.57~ are positive, and ~50.434% are negative:

Category	Number of tweets
No emoji	1306061
Negative	45367
Positive	44585
Total tweets	1396013

The models I've trained so far are either a transformer, or an 2-layer LSTM with 128 units per layer - plus a dense layer with a softmax activation function at the end for classification. If batch normalisation, every layer except the dense layer has a batch normalisation layer afterwards:

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
bidirectional (Bidirectional (None, 100, 256)          336896    
_________________________________________________________________
batch_normalization (BatchNo (None, 100, 256)          1024      
_________________________________________________________________
bidirectional_1 (Bidirection (None, 256)               394240    
_________________________________________________________________
batch_normalization_1 (Batch (None, 256)               1024      
_________________________________________________________________
dense (Dense)                (None, 2)                 514       
=================================================================
Total params: 733,698
Trainable params: 732,674
Non-trainable params: 1,024
_________________________________________________________________

I tried using more units and varying the number of layers, but it didn't help much (in hindsight this was before I found the bug where tweets without emojis were misclassified as positive).

Now for the exciting part! I split the tweets that had at least 1 emoji into 2 datasets: 80% for training, and 20% for validation. I trained a few different models tweaking various different parameters, and got the following results:

2 categories: training accuracy

2 categories: validation accuracy

Each model has a code that is comprised of multiple parts:

c2: 2 categories
bidi: LSTM, bidirectional wrapper enabled
nobidi: LSTM, bidirectional wrapper disabled
batchnorm: LSTM, batch normalisation enabled
nobatchnorm: LSTM, batch normalisation disabled
transformer: Transformer (just the encoder part)

It seems that if you're an LSTM being bidirectional does result in a boost to both stability and performance (performance here meaning how well the model works, rather than how fast it runs - that would be efficiency instead). Batch normalisation doesn't appear to help all that much - I speculate this is because the models I'm training really aren't all that deep, and batch normalisation apparently helps most with deep models (e.g. ResNet).

For completeness, here are the respective graphs for 3 categories (positive, negative, and flood):

3 Categories: Training accuracy

3 Categories: Validation accuracy

Unfortunately, 3 categories didn't work quite as well as I'd hoped. To understand why, I plotted some confusion matrices at epochs 11 (left) and 44 (right). For both categories I've picked out the charts for bidi-nobatchnorm, as it was the best performing (see above).

2 Categories: confusion matrices

3 Categories: confusion matrices

The 2 category model looks to be reasonably accurate with no enormous issues as we expected from the ~80% accuracy from above. Twice the number of negative tweets are misclassified as positive compared to the other way around of some reason - it's possible that negative tweets are more difficult to interpret. Given the noisy dataset (it's social media and natural language, after all), I'm pretty happy with this result.

The 3 category model has fared less brilliantly. Although we can clearly see it has improved over the training process (and would likely continue to do so if I gave it longer than 50 epochs to train), it's accuracy is really limited by it's tendency to overfit - preventing it from performing as well as the 2 category model (which, incidentally, also overfits).

Although it classifies flood tweets fairly accurately, it seems to struggle more with positive and negative tweets. This leads me to suspect that training a model on flood / not flood might be worth trying - though as always I'm limited by a lack of data (I have only ~16K tweets that contain a flood emoji). Perhaps searching twitter for flood emojis directly would help gather additional here.

Now that I have a model that works, I'm going to move on to answering some research questions with this model. To do so, I've upgraded my twitter-academic-downloader to download information about attached media - downloading attached media itself deserves its own blog post. After that, I'll be looking to publish my first ever scientific papery thing (don't know the difference between journal articles and other types of publication yet), so that is sure to be an adventure.

Temporal CNN(?)

In Temporal CNN land, things have been much slower. My latest attempt to fix the accuracy issues I've described in detail in previous posts, I've implemented a vanilla convolutional autoencoder by following this tutorial (the links to the full code behind the tutorial are broken, and it's really annoying). This does image to image translation, so it would be ideal for modifying it to support my use-case. After implementing it with Tensorflow.js, I hit a snag:

Autoencoder: expected vs actual

After trying all the loss functions I found think of and all sort of other variations of the model, for sanity's sake I tried implementing the model in Python (the original language using in the blog post). I originally didn't want to do this, since all the data preprocessing code I've written so far is all in Javascript / Node.js, but within half an hour with some copying and pasting, I had a very quick model implemented. After training a teeny model it for just 7 epochs on the Fashion MNIST dataset, I got this:

Autoencoder: epoch 7 in Tensorflow for Python

...so clearly there's something different in Tensorflow.js compared to Tensorflow for Python, but after comparing them dozens of times, I'm not sure I'll be able to spot the difference. It's a shame to give up on the Tensorflow.js version (it also has a very nice CLI I implemented), but given how long I've spent in the background on this problem, I'm going to move forwards with the Python version, I'm going to work on converting my existing data preprocessing code into a standalone CLI for converting the data into a format that I can consume in Python with minimal effort.

Once that's done, I'm going to expand the quick autoencoder Python script I implemented into a proper CLI program. This involves adding:

A CLI argument parser (argparse is my weapon of choice here for now in Python)
A settings system that has a default and a custom (TOML) config file
Saving multiple pieces of data to the output directory:
- A summary of the model structure
- The settings that are currently in use
- A checkpoint for each epoch
- A TSV-formatted metrics file

The existing script also spits out a sample image (such as the one above - just without the text) for each epoch too, which I've found to be very useful indeed. I'll definitely be looking into generating more visualisations on-the-fly as the model is training.

All of this will take a while (especially since this isn't the focus right now), so I'm a bit unsure if I'll get all of this done by the time I write the next post. If this plan works, then I'll probably have to come up with a new name for this model, since it's not really a Temporal CNN. Suggestions are welcome - my current thought is maybe "Temporal Convolutional Autoencoder".

Conclusion

Since the last post I've made on here, I've made much more progress on stuff than I was expecting to have done. I've now got a useful(?) model for classifying tweets, which I'm now going to move ahead with answering some research questions with (more on that in the next post - this one is long already haha). I've also got an Autoencoder training properly for image-to-image translation as a step towards getting 2D flood predictions working.

Looking forwards, I'm going to be answering research questions with my tweet classification model and preparing to write something to publish, and expanding on the Python autoencoder for more direct flood prediction.

Curious about the code behind any of these models and would like to check them out? Please get in touch. While I'm not sure I can share the tweet classifier itself yet (though I do plan on releasing as open-source after a discussion with my supervisor when I publish), I can probably share my various autoencoder implementations, and I can absolutely give you a bunch of links to things that I've used and found helpful.

If you've found this interesting or have some suggestions, please comment below. It's really motivating to hear that what I'm doing is actually interesting / useful to people.

Sources and further reading

twitter-academic-downloader - CLI tool I implemented to download tweets using Twitter's new Academic API
ResNet
Fashion MNIST dataset
- Formatted as PNG
Building Autoencoders in Keras
TOML: Tom's Obvious Markup Language

Type this	To get this	Notes
`bold text`	bold text	-
`_italics text_`	italics text	-
`~~deleted text~~`	~~deleted~~	-
`code text`	`code text`	Inserts some monospaced code. It is preferred that large blocks of code are linked to using a service such as Pastebin, Github Gists or Ideone.
`> Quote`	Quote	-
`[display text](//google.com)`	display text	Inserts a hyperlink. Please use responsibly. `[rel=nofollow]` is in use and spam will be deleted.
`---`		Inserts a horizontal line. The previous line must be blank.

Stardust
Blog