PhD, Update 9: Results?

Hey, it's time for another PhD update blog post! Since last time, I've been working on tweet classification mostly, but I have a small bit to update you on with the Temporal CNN.

Here's a list of posts in this series so far before we continue. If you haven't already, you should read them as they give some contact for what I'll be talking about in this post.

Note that the results and graphs in this blog post are not final, and may change as I further evaluate the performance of the models I've trained (for example I recently found and fixed a bug where I misclassified all tweets without emojis as positive tweets by mistake).

Tweet classification

After trying a bunch of different tweaks and variants of models, I now have an AI model that classifies tweets. It's not perfect, but I am reaching ~80% validation accuracy.

The model itself is trained to predict the sentiment of a tweet, with the label coming from a preset list of emojis I compiled manually (looking at a list of emojis in the dataset itself I wrote a quick Bash one-liner to calculate). Each tweet is classified by adding up how any emojis in each category it has in it, and the category with the most emojis is then ultimately chosen as the label to predict. Here's the list of emojis I've compiled:

Category Emojis
positive ๐Ÿ˜‚๐Ÿ‘๐Ÿคฃโค๐Ÿ‘๐Ÿ˜Š๐Ÿ˜‰๐Ÿ˜๐Ÿ˜˜๐Ÿ˜๐Ÿค—๐Ÿ’•๐Ÿ˜€๐Ÿ’™๐Ÿ’ชโœ…๐Ÿ™Œ๐Ÿ’š๐Ÿ‘Œ๐Ÿ™‚๐Ÿ˜Ž๐Ÿ˜†๐Ÿ˜…โ˜บ๐Ÿ˜ƒ๐Ÿ˜ป๐Ÿ’–๐Ÿ’‹๐Ÿ’œ๐Ÿ˜น๐Ÿ˜œโ™ฅ๐Ÿ˜„๐Ÿ’›๐Ÿ˜ฝโœ”๐Ÿ‘‹๐Ÿ’—โœจ๐ŸŒน๐ŸŽ‰๐Ÿ‘Š๐Ÿ˜‹๐Ÿ˜๐ŸŒž๐Ÿ˜‡๐ŸŽถโญ๐Ÿ’ž๐Ÿ˜บ๐Ÿ˜ธ๐Ÿ–ค๐ŸŒธ๐Ÿ’๐Ÿ€๐ŸŒผ๐ŸŒŸ๐Ÿค๐ŸŒท๐Ÿฑ๐Ÿค“๐Ÿ˜Œ๐Ÿ˜›๐Ÿ˜™
negative ๐Ÿฅถโš ๐Ÿ’”๐Ÿ˜ฐ๐Ÿ™„๐Ÿ˜ฑ๐Ÿ˜ญ๐Ÿ˜ณ๐Ÿ˜ข๐Ÿ˜ฌ๐Ÿคฆ๐Ÿ˜ก๐Ÿ˜ฉ๐Ÿ™ˆ๐Ÿ˜”โ˜น๐Ÿ˜ฎโŒ๐Ÿ˜ฃ๐Ÿ˜•๐Ÿ˜ฒ๐Ÿ˜ฅ๐Ÿ˜ž๐Ÿ˜ซ๐Ÿ˜Ÿ๐Ÿ˜ด๐Ÿ™€๐Ÿ˜’๐Ÿ™๐Ÿ˜ ๐Ÿ˜ช๐Ÿ˜ฏ๐Ÿ˜จ๐Ÿ‘Ž๐Ÿคข๐Ÿ’€๐Ÿ˜ค๐Ÿ˜๐Ÿ˜–๐Ÿ˜๐Ÿ˜ˆ๐Ÿ˜‘๐Ÿ˜“๐Ÿ˜ฟ๐Ÿ˜ต๐Ÿ˜ง๐Ÿ˜ถ๐Ÿ˜ฆ๐Ÿคฅ๐Ÿค๐Ÿคง๐ŸŒงโ˜”๐Ÿ’จ๐ŸŒŠ๐Ÿ’ฆ๐Ÿšจ๐ŸŒฌ๐ŸŒชโ›ˆ๐ŸŒ€โ„๐Ÿ’ง๐ŸŒจโšก๐Ÿฆ†๐ŸŒฉ๐ŸŒฆ๐ŸŒ‚

These categories are stored in a tab separated values file (TSV), allowing the program I've written that trains new models to be completely customisable. Given that there were a significant number of tweets with flood-related emojis (๐ŸŒงโ˜”๐Ÿ’จ๐ŸŒŠ๐Ÿ’ฆ๐Ÿšจ๐ŸŒฌ๐ŸŒชโ›ˆ๐ŸŒ€โ„๐Ÿ’ง๐ŸŒจโšก๐Ÿฆ†๐ŸŒฉ๐ŸŒฆ๐ŸŒ‚) in my dataset, I tried separating them out into their own class, but it didn't work very well - more on this later.

The dataset I've downloaded comes from a number of different queries and hashtags, and is comprised of just under 1.4 million tweets, of which ~90K tweets have at least 1 emoji in the above categories.

A pair of pie charts showing statistics about the twitter dataset.

Tweets were downloaded with a tool I implemented using a number of flood related hashtags and query strings - from generic ones like #flood to specific ones such as #StormChristoph.

Of the entire dataset, ~6.44% contain at least 1 emoji. Of those, ~49.57~ are positive, and ~50.434% are negative:

Category Number of tweets
No emoji 1306061
Negative 45367
Positive 44585
Total tweets 1396013

The models I've trained so far are either a transformer, or an 2-layer LSTM with 128 units per layer - plus a dense layer with a softmax activation function at the end for classification. If batch normalisation, every layer except the dense layer has a batch normalisation layer afterwards:

Model: "sequential"
Layer (type)                 Output Shape              Param #   
bidirectional (Bidirectional (None, 100, 256)          336896    
batch_normalization (BatchNo (None, 100, 256)          1024      
bidirectional_1 (Bidirection (None, 256)               394240    
batch_normalization_1 (Batch (None, 256)               1024      
dense (Dense)                (None, 2)                 514       
Total params: 733,698
Trainable params: 732,674
Non-trainable params: 1,024

I tried using more units and varying the number of layers, but it didn't help much (in hindsight this was before I found the bug where tweets without emojis were misclassified as positive).

Now for the exciting part! I split the tweets that had at least 1 emoji into 2 datasets: 80% for training, and 20% for validation. I trained a few different models tweaking various different parameters, and got the following results:

2 categories: training accuracy

2 categories: validation accuracy

Each model has a code that is comprised of multiple parts:

It seems that if you're an LSTM being bidirectional does result in a boost to both stability and performance (performance here meaning how well the model works, rather than how fast it runs - that would be efficiency instead). Batch normalisation doesn't appear to help all that much - I speculate this is because the models I'm training really aren't all that deep, and batch normalisation apparently helps most with deep models (e.g. ResNet).

For completeness, here are the respective graphs for 3 categories (positive, negative, and flood):

3 Categories: Training accuracy

3 Categories: Validation accuracy

Unfortunately, 3 categories didn't work quite as well as I'd hoped. To understand why, I plotted some confusion matrices at epochs 11 (left) and 44 (right). For both categories I've picked out the charts for bidi-nobatchnorm, as it was the best performing (see above).

2 Categories: confusion matrices

3 Categories: confusion matrices

The 2 category model looks to be reasonably accurate with no enormous issues as we expected from the ~80% accuracy from above. Twice the number of negative tweets are misclassified as positive compared to the other way around of some reason - it's possible that negative tweets are more difficult to interpret. Given the noisy dataset (it's social media and natural language, after all), I'm pretty happy with this result.

The 3 category model has fared less brilliantly. Although we can clearly see it has improved over the training process (and would likely continue to do so if I gave it longer than 50 epochs to train), it's accuracy is really limited by it's tendency to overfit - preventing it from performing as well as the 2 category model (which, incidentally, also overfits).

Although it classifies flood tweets fairly accurately, it seems to struggle more with positive and negative tweets. This leads me to suspect that training a model on flood / not flood might be worth trying - though as always I'm limited by a lack of data (I have only ~16K tweets that contain a flood emoji). Perhaps searching twitter for flood emojis directly would help gather additional here.

Now that I have a model that works, I'm going to move on to answering some research questions with this model. To do so, I've upgraded my twitter-academic-downloader to download information about attached media - downloading attached media itself deserves its own blog post. After that, I'll be looking to publish my first ever scientific papery thing (don't know the difference between journal articles and other types of publication yet), so that is sure to be an adventure.

Temporal CNN(?)

In Temporal CNN land, things have been much slower. My latest attempt to fix the accuracy issues I've described in detail in previous posts, I've implemented a vanilla convolutional autoencoder by following this tutorial (the links to the full code behind the tutorial are broken, and it's really annoying). This does image to image translation, so it would be ideal for modifying it to support my use-case. After implementing it with Tensorflow.js, I hit a snag:

Autoencoder: expected vs actual

After trying all the loss functions I found think of and all sort of other variations of the model, for sanity's sake I tried implementing the model in Python (the original language using in the blog post). I originally didn't want to do this, since all the data preprocessing code I've written so far is all in Javascript / Node.js, but within half an hour with some copying and pasting, I had a very quick model implemented. After training a teeny model it for just 7 epochs on the Fashion MNIST dataset, I got this:

Autoencoder: epoch 7 in Tensorflow for Python clearly there's something different in Tensorflow.js compared to Tensorflow for Python, but after comparing them dozens of times, I'm not sure I'll be able to spot the difference. It's a shame to give up on the Tensorflow.js version (it also has a very nice CLI I implemented), but given how long I've spent in the background on this problem, I'm going to move forwards with the Python version, I'm going to work on converting my existing data preprocessing code into a standalone CLI for converting the data into a format that I can consume in Python with minimal effort.

Once that's done, I'm going to expand the quick autoencoder Python script I implemented into a proper CLI program. This involves adding:

The existing script also spits out a sample image (such as the one above - just without the text) for each epoch too, which I've found to be very useful indeed. I'll definitely be looking into generating more visualisations on-the-fly as the model is training.

All of this will take a while (especially since this isn't the focus right now), so I'm a bit unsure if I'll get all of this done by the time I write the next post. If this plan works, then I'll probably have to come up with a new name for this model, since it's not really a Temporal CNN. Suggestions are welcome - my current thought is maybe "Temporal Convolutional Autoencoder".


Since the last post I've made on here, I've made much more progress on stuff than I was expecting to have done. I've now got a useful(?) model for classifying tweets, which I'm now going to move ahead with answering some research questions with (more on that in the next post - this one is long already haha). I've also got an Autoencoder training properly for image-to-image translation as a step towards getting 2D flood predictions working.

Looking forwards, I'm going to be answering research questions with my tweet classification model and preparing to write something to publish, and expanding on the Python autoencoder for more direct flood prediction.

Curious about the code behind any of these models and would like to check them out? Please get in touch. While I'm not sure I can share the tweet classifier itself yet (though I do plan on releasing as open-source after a discussion with my supervisor when I publish), I can probably share my various autoencoder implementations, and I can absolutely give you a bunch of links to things that I've used and found helpful.

If you've found this interesting or have some suggestions, please comment below. It's really motivating to hear that what I'm doing is actually interesting / useful to people.

Sources and further reading

Tag Cloud

3d 3d printing account algorithms android announcement architecture archives arduino artificial intelligence artix assembly async audio automation backups bash batch blender blog bookmarklet booting bug hunting c sharp c++ challenge chrome os cluster code codepen coding conundrums coding conundrums evolved command line compilers compiling compression containerisation css dailyprogrammer data analysis debugging demystification distributed computing dns docker documentation downtime electronics email embedded systems encryption es6 features ethics event experiment external first impressions freeside future game github github gist gitlab graphics hardware hardware meetup holiday holidays html html5 html5 canvas infrastructure interfaces internet interoperability io.js jabber jam javascript js bin labs learning library linux lora low level lua maintenance manjaro minetest network networking nibriboard node.js open source operating systems optimisation own your code pepperminty wiki performance phd photos php pixelbot portable privacy problem solving programming problems project projects prolog protocol protocols pseudo 3d python reddit redis reference releases rendering resource review rust searching secrets security series list server software sorting source code control statistics storage svg systemquery talks technical terminal textures thoughts three thing game three.js tool tutorial tutorials twitter ubuntu university update updates upgrade version control virtual reality virtualisation visual web website windows windows 10 worldeditadditions xmpp xslt


Art by Mythdael