Starbeamrainbowlabs

Stardust
Blog


Archive


Mailing List Articles Atom Feed Comments Atom Feed Twitter Reddit Facebook

Tag Cloud

3d 3d printing account algorithms android announcement architecture archives arduino artificial intelligence artix assembly async audio automation backups bash batch blender blog bookmarklet booting bug hunting c sharp c++ challenge chrome os cluster code codepen coding conundrums coding conundrums evolved command line compilers compiling compression conference conferences containerisation css dailyprogrammer data analysis debugging defining ai demystification distributed computing dns docker documentation downtime electronics email embedded systems encryption es6 features ethics event experiment external first impressions freeside future game github github gist gitlab graphics guide hardware hardware meetup holiday holidays html html5 html5 canvas infrastructure interfaces internet interoperability io.js jabber jam javascript js bin labs latex learning library linux lora low level lua maintenance manjaro minetest network networking nibriboard node.js open source operating systems optimisation outreach own your code pepperminty wiki performance phd photos php pixelbot portable privacy problem solving programming problems project projects prolog protocol protocols pseudo 3d python reddit redis reference release releases rendering research resource review rust searching secrets security series list server software sorting source code control statistics storage svg systemquery talks technical terminal textures thoughts three thing game three.js tool tutorial twitter ubuntu university update updates upgrade version control virtual reality virtualisation visual web website windows windows 10 worldeditadditions xmpp xslt

PhD Update 12: Is it enough?

Hey there! It's another PhD update blog post! Sorry for the lack of posts here, the reason why will become apparent below. In the last post, I talked about the AAAI-22 Doctoral Consortium conference I'll be attending, and also about sentiment analysis of both tweets and images. Before we talk about progress since then, here's a list of all the posts in this series so far:

As in all the posts preceding this one, none of the things I present here are finalised and are subject to significant change as I double check everything. I can think of no better example of this than the image classification model I talked last time - the accuracy of which has dropped from 75.6% to 60.7% after I fixed a bug....

AAAI-22 Doctoral Consortium

The most major thing in my calendar in the next few weeks is surely the AAAI-22 Doctoral Consortium. I've been given a complimentary registration to the main conference after I had a paper accepted, so as it turns out the next few weeks are going to be rather busy - as have been the last few weeks preparing for this conference.

Since the last post when I mentioned that AAAI-22 has been moved to be fully virtual, there have been a number of developments. Firstly, the specific nature of the doctoral consortium (and the wider conference) is starting to become clear. Not having been to a conference before (a theme which will come up a lot in this post), I'm not entirely sure what to expect, but as it turns out I was asked to create a poster that I'll be presenting in a poster session (whether this is part of AAAI-22 or the AAAI-22 Doctoral Consortium is unclear).

I've created one with baposter (or a variant thereof given to me by my supervisor some time ago) which I've now submitted. The thought of presenting a poster in a poster session at a conference to lots of people I don't know has been a rather terrifying thought though, so I've been increasingly anxious over this over the past few weeks.

This isn't the end of the story though, as I've also being asked to do a 20 minute presentation with 15 minutes for questions / discussion, the preparations for which have taken perhaps longer than I anticipated. Thankfully I've had prior experience presenting at my department's PGR seminars previously (which have also been online recently), so it's not as daunting as would otherwise be, but as with the poster there's still the fear of presenting to lots of people I don't know - most of which probably know more about AI than I do!

Despite these fears and other complications that you can expect from an international conference (timezones are such a pain sometimes), I'm looking forward to seeing what other people done, and also hoping that the feedback I get from my presentation and poster isn't all negative :P

Sentiment analysis: a different perspective

Since the last post, a number of things have changed in my approach to sentiment analysis. The reason for this is - as I alluded to earlier in this post - after I fixed a bug in my model that predicts the sentiment of images from twitter, causing the validation accuracy to drop from 75.6% to 60.7%. As of now, I have the following models implemented for sentiment analysis:

  1. Text sentiment prediction
    • Status: Implemented, and works rather well actually - top accuracy is currently 79.6%
    • Input: Tweet text
    • Output: Positive/negative sentiment
    • Architecture: Transformer encoder (previously: LSTM)
    • Labels: Emojis from text [manually sorted into 2 categories]
  2. Image sentiment prediction
    • Status: Implemented, but doesn't work very well
    • Input: Images attached to tweets
    • Output: Positive/negative sentiment
    • Architecture: ResNet50 (previously: CCT, but I couldn't get it to work)
    • Labels: Positive/negative sentiment predictions from model #1
  3. Combined sentiment prediction
    • Status: Under construction
    • Input: Tweet text, associated image
    • Output: Positive/negative sentiment
    • Architecture: CLIP → a few dense layers [provisionally]
    • Labels: Emojis from text, as in model #1

Not mentioned here of course is my rainfall radar model, but that's still waiting for me to return to it and implement a plan I came up with some months ago to fix it.

Of particular note here is the new CLIP-based model. After analysing model #2 (image sentiment prediction) and sampling some images from each category on a per-flood basis, it soon became clear that the output was very noisy and wasn't particularly useful, and combined with it's rather low performance means that I'm pretty much considering it a failed attempt.

With this in mind, after my supervisor suggested I looked into CLIP. While I mentioned it in the paper I've submitted to AAAI-22, until recently I haven't had a clear picture of how it would be actually useful. As a model, CLIP trains on text-image pairs, and learns to identify which image belongs to which text caption. To do this, it has 2 encoders - 1 for the text, and 1 for the associated image.

You can probably guess where this is going, but my new plan here is to combine the text and images of the tweets I have downloaded into 1 single model rather than 2 separate ones. I figure that I can potentially take advantage of the encoders trained by the CLIP models to make a prediction. If I recall correctly, Ive read a paper that has done this before with tweets from twitter - just not with CLIP - but unfortunately I can't locate the paper at this time (I'll edit this post if I do find it in the future - if you know which paper I'm talking about please do leave a DOI link in the comments below).

At the moment, I'm busy implementing the code to wrap the CLIP model and make it suitable for my specific learning task, but this is as of yet incomplete. As noted in the architecture above, I plan on concatenating the output from the CLIP encoders and passing it through a few dense layers (as the [ batch_size, concat, dim ] tensor is not compatible with a transformer, which operates on sequences), but I'm open to suggestions - please do comment below.

My biggest concern here is that the tweet text will not be enough like an image caption for CLIP to produce useful results. If this is the case, I have some tricks up my sleeve:

  1. Extract any alt text associated with images (yes, twitter does let you do this for images) and use that instead - though I can't imagine that people have captioned images especially often.
  2. Train a new CLIP model from scratch on my dataset - potentially using this model architecture - this may require quite a time investment to get the model working as intended (Tensorflow can be confusing and difficult to debug with complex model architectures).

Topic analysis

Another thing I've explored is topic analysis using gensim's LDAModel. Essentially, as far as I can tell LDA is an unsupervised algorithm that groups the words found in the source input documents into a fixed number of related groups, each of which contains words which are often found in close proximity to one another.

As an example, if I train an LDA model for 20 groups, I get something like the following for some of those categories:

place   death   north   evacu   rise    drive   hng coverag toll    cumbria
come    us  ye  issu    london  set line    offic   old mean
rescu   dead    miss    uttarakhand leav    bbc texa    victim  kerala  defenc
flashflood  awai    world   car train   make    best    turn    school  boat
need    town    problem includ  effect  head    assist  sign    philippin   condit

(Full results may be available upon request, if context is provided)

As you can tell, the results of this are mixed. While it has grouped the words, the groups themselves aren't really what I was hoping for. The groups here seem to be quite generic, rather than being about specific things (words like rescu, dead, miss, and victim in a category), and also seem to be rather noisy (words like "north", "old", "set", "come", "us").

My first thought here was that instead of training the model on all the tweets I have, I might get better results if I train it on just a single flood at once - since some categories are dominated by flood-specific words (hurrican hurricaneeta, hurricaneiota for example). Here's an extract from the results of that for 10 categories for the #StormDennis hashtag:

warn    flood   stormdenni  met issu    offic   water   tree    high    risk
good    hope    look    love    i’m game    morn    dai yellow  nice
ye  lol head    enjoi   tell    book    got chanc   definit valentin
mph gust    ireland coast   wors    wave    brace   wow doesn’t photo

(Again, full results may be available on request if context is provided)

Again, mixed results here - and still very noisy (though I'd expect nothing else from social media data), though it is nice to see something like gust, ireland, and coast there in the bottom category.

The goal here was to attempt to extract some useful information about which places have been affected and by how much by combining this with the sentiment analysis (see above), but my approach here doesn't seem to have captured what I intended. Still, it does seem that on a per-category basis (for all tweets) there is some difference in sentiment on a per-category basis:

(Above: A char of the sentiment of each LDA topic, using the 20 topic model that was trained on all available tweets.)

It seems as if there's definitely something going on here. I speculate that positive words are more often than not used near other positive words, and so categories are more likely to skew to 1 extreme or another - though acquiring proof of this theory would likely require a significant time investment.

While LDA topic analysis was an interesting diversion, I'm not sure how useful it is in this context. Still, it's a useful thing to have in my growing natural language processing toolkit (how did this happen) - and perhaps future research problems will benefit from it more.

Moving forwards, I could imagine instead of doing LDA topic analysis it might be beneficial to group tweets by place and perhaps run some sentiment analysis on that instead. This comes with it's own set of problems of course (especially pinning down / inferring the location of the tweets in question), but this is not an insurmountable problem given strategies such as named entity recognition (which the twitter Academic API does for you, believe it or not).

Conclusion

While most of my time has been spent on preparing for the AAAI-22 conference, I have managed to do some investigating into the twitter data I've downloaded. Unfortunately, not all my methods have been a success (image sentiment analysis, LDA topic analysis), but these have served as useful exercises for both learning new techniques and understanding the dataset better.

Moving forwards, I'm going to implement a new CLIP-based model that I'm hoping will improve accuracy over my existing models - though I'm somewhat apprehensive that tweet text won't be descriptive enough for CLIP to produce a useful output.

With the AAAI-22 conference happening next week, I'll be sure to write up my experiences and post about the event here. I'm just hoping that I've done enough to make attending a conference like this worth it, and that what I have done is actually interesting to people :-)

Sources and further reading

PhD Update 11: Answers to our questions

Heya! It's time for another PhD update blog post. Sometimes, answers to the questions one ask some in the form of more questions. In this post, as I predicted in the last post I'll be talking mainly about my work with tweets from twitter, as I haven't yet had time to return to the Temporal CNN Autoencoder. Before we start though, here's a list of all the parts in this series so far:

As usual, none of the things I present here are finalised, and are subject to significant change as I double check everything.

Conferences part 2

In the last post, I talked about the conferences I have applied to and not applied to. In this one, I can now say that I have been accepted for the AAAI-22 Doctoral Consortium! It was going to be held in Vancouver, Canada - but has since been moved to be fully virtual and online. While I both understand the reasoning behind the decision and am relieved I don't have to travel, it is a bit of a shame that I won't get the chance to have those face-to-face conversations you don't get when in a video call.

Despite the move to being in person, I'm still both excited to attend and mildly terrified about presenting the things I've been doing to a potentially large audience.

Looking forward, at the suggestion of my supervisor I plan to finish writing up that AI+HADR paper I mentioned in the last update as a journal article instead, and then submit that. While I'm working through the review process for that paper, I'll return to the Temporal Autoencoder / rainfall radar subproject and work on implementing my idea for it.

Tweet sentiment analysis

After the rather rushed analysis of the data in the last post, I've now taken the time to analyse the data more thoroughly. A number of things became apparent here, but first let's look at the research questions I've asked:

  1. Is there a more negative response to more sudden / severe floods?
  2. Can we classify images by the sentiment of the associated tweet?

Answering these questions has not been straightforward, but I'm now at a point where I have some preliminary answers (which, I stress, are not double checked and not peer reviewed).

Unfortunately, the answer to the first question there is that it can't easily be answered with the current data available. As it turns out, I have been unable to find an objective and consistent measure of how severe or sudden a flood was.

One might think that say the amount of insurance damages would be a good choice. This doesn't work out though, because not all flooding event have (public) damage estimates. Those that do are often measured against different goalposts: flood A might be measured in property damages, and flood B in economic impact.

Another measure I investigated was the number of homes destroyed or people displaced. This too as it turns out has multiple issues. For one, while for some floods multiple estimates are available they don't always agree. For a single given flood estimates might range from 800 to 2400 homes destroyed, and are often measured at different points in the history of the given flooding event, and it's sometimes unclear at what stage in a flood's lifecycle the estimate was made.

Even if such estimates were consistent (which they really aren't), there's another issue too: they are often limited to country borders. For example, a government agency may estimate the number of homes destroyed for their country, but not other countries. This is totally reasonable: a government is concerned first and foremost with the people within it's borders. Sadly floods, storms, and hurricanes rarely discriminate across such borders. Take Storm Christoph for example. It hit the UK in January 2021, but after that it continued on and hit Scandinavia too.

The other question above is thankfully much easier to answer - it deserves it's own section in this post though, so see below for that. It's not all bad new on the tweet sentiment analysis front though - I did find some unexpected results while I was analysing the data. To explain, let's look at a fancy new chart I've plotted:

(Above: Various floods and their overall sentiment - both with replies included and excluded.)

This fancy chart shows the overall sentiment of a number of different flooding events, with replies included (going down) and excluded (going up). What's fascinating here is that it appears to suggest that the replies contribute significantly towards the percentage of positive tweets. Perhaps people are most likely to tweet words of encouragement (e.g. "stay safe :hugs:")?

With this in mind, I correlated the total number of tweets made in a flood with the overall sentiment (as a percentage), and got a Pearson correlation coefficient of -0.54, which indicates a medium correlation. Apparently, if more people tweet about a flood, it's more likely to have a more positive overall sentiment. If replies are excluded, it works out to -0.31, which would indicate a weaker negative correlation.

Image classification

Another thing I've been working on is classifying images associated with tweets using the sentiment of the tweet as a label. With roughly ~175K images to work with, this has proved to be a useful exercise - resulting in ~75.23% validation accuracy (rising to 96.9% training accuracy by epoch 50%, suggesting it's overfitting and I need more data to improve validation accuracy any further) over 9 epochs. While unfortunately I can't share a sample of positive/negative images predicted by the model due to data privacy rules, I can talk about the structure of the model and show a confusion matrix.

I started out with a Compact Convolutional Transformer model as it looks very cool and has some significant benefits over more traditional model architectures, there's a bug in my implementation somewhere I can't spot, and it only yields 10% to 20% accuracy on Fashion MNIST. To avoid wasting too much time, I'm now using a prebuilt ResNet50 initialised with random weights that takes in images in the size 128 x 128 pixels (images larger or smaller than this are automatically resized without preserving aspect ratio).

While the accuracy of the model is slightly lower than that of the model that predicts the sentiment of the tweets themselves, looking at samples that I quickly threw together with a bit of Bash and ImageMagick and the confusion matrix (see below) reveals that it is in fact doing something useful. Here's that Bash 2-liner:

shuf path/to/tweets-labelled.tsv | awk '/positive$/ { print("/absolte/path/to/media_dir/" $1); }' | head -n64 >/tmp/sample-pos.txt;
montage @/tmp/sample-pos.txt -geometry 960x540+10+10 -tile 8x8 /tmp/sample-pos.jpeg

The @path/to/file.txt syntax in the montage command call there reads a list of filepaths from a file instead of directly specifying them on the command line. By replacing positive in the awk filter with negative and sample-pos.txt to sample-neg.txt, the same procedure can also generate a random sample for the negative category too.

From looking at a single sample of 64 positive images and 64 negative images generated using the method above, positive images generally include:

  • Cats (by far the most popular of course)
  • Cupcakes
  • People, some of which are helping others in a flood

Whereas negative images generally include:

  • Floods
  • Damage to homes, buildings, etc

My next immediate step here is to plot a confusion matrix to better understand how the model is performing, as I'm slightly concerned that it's ignoring the minority class (in this case, positive tweets). I've already mostly completed this already, but it's just not quite ready to show here yet as I need to double check and revise some stuff.

Of course, given that the source dataset is very noisy (social media data generally is) and relatively difficult for AI models to understand, I think this is a good result.

From here, if I have time I'd like to combine this image classification model with the earlier tweet sentiment analyser model to create a single model that can more accurately predict the sentiment of both the text of a tweet and the associated image at the same. To do this, I'm probably going to investigate and use CLIP - more on this in a future post.

Conclusion

While I still haven't done anything with the Temporal Autoencoder due to other priorities, I'm hoping to return to it once I've wrapped up this social media section of the project. I have made significant progress on analysing the social media data - both the textual tweets and the associated images, and I plan to combine the models I've trained to classify both text and images into a single model. It's not the end of the road yet though: while I've found some answers, they are just leading to more questions.

Add your blog to hullblogs.com

For many years, csblogs.com has aggregated posts from the blogs of current and past students of the University of Hull. Unfortunately, recently the site has started to generate random error messages. To fix this, Freeside (of which I'm the head, as it turns out) have decided to implement a replacement.

It looks rather cool if I do say so myself, so I thought I'd share it here. I'll talk first about how you can add your blog to it and why it's a good idea to start one yourself (and how you can do so for free!). Then, I'll talk a little bit at the end about how I built hullblogs.com and how it's put together.

@closebracket had the idea to call it hullblogs.com, and then I implemented the code for it:

A screenshot of hullblogs.com

(Above: A screenshot of hullblogs.com)

You can visit it here: hullblogs.com.

If you're a current or past student at the University of Hull, then you we want to hear from you! If you've got a blog (and even if you haven't yet!), then you can add your blog by going to hullblogs.com, and then clicking "add your blog".

If you don't yet have a blog, then we have you covered. Our guide has a number fo really easy ways to get started and host a blog for free! There are multiple ways to host a blog without paying a penny (no hidden charges after some time either).

But I don't have a blog!

If you're not convinced about setting up a blog and putting it on the Internet, there are a number of key reasons you should:

  • It's not about other people reading it. By setting up a blog, you don't need to have 100s of views. What matters is that you write for you - not for anyone else.
  • If only 1 person reads your blog, then it's worth it. You personal blog and/or website makes a great addition to your CV. It's a wonderful way of documenting the things you're doing and learning while doing your degree and beyond - and a great way for potential employers to find more detail on all this information in 1 place.
  • Write for yourself. Personally, I find my blog a great place to post about difficult problems that I've encountered and solved. Then when I encounter a similar problem again, I can re-read my own blog post about it. Because I'm the one who wrote the post, I can anticipate what I may find confusing and explain that in more detail to help out future me.
  • Improve your technical writing skills. Technical writing - the process of writing about complex technical subjects (whether this is Computer Science or beyond) is an invaluable skill. Writing documentation, communicating complex ideas, and helping others are all situations which can call for technical writing - and it's a great thing to put on your CV too. Just doing a thing isn't enough - being able to write about it and document it is really important when working for example on commercial (or even academic) work.

If you're not computer science oriented, then Wordpress, Blogger, or maybe Squarespace [unclear if squarespace is free or not] would be a a great place to start your blogging journey. You can host a blog for free too!

If you are computer science oriented (I'd guess most of the people reading this probably are given the kinds of posts I write for my blog here), then I strongly recommend Eleventy. It's a static site generator, and there are a whole bunch of ready-made templates for you to use to get started quickly.

In fact, hullblogs.com itself is built with Eleventy, using feedme to parse RSS/Atom feeds.

Every night, the site will be rebuilt. During this process, it will download the RSS/Atom feed from all the blogs registered, and order the posts chronologically. If an image is associated with a post, this will also be downloaded - the site also looks for the first image in a post's content if an image isn't explicitly specified to be attached to a blog. Once ordered, the posts are then paginated (split into multiple pages) and the rest of the site is rendered.

Being a static site, once the (re)build is complete hosting the site is simple and secure. Currently, Freeside hosts it using Nginx to statically serve it.

If your feed is malformed/contains errors or your blog is offline don't worry! hullblogs.com will transparently ignore your feed when it rebuilds. Then, once your blog is back up and running, it will add it back into the main hullblogs.com site the next night when it rebuilds again.

If you're interested in digging into the source code of the site, I've open sourced it on GitHub under the Apache 2.0 Licence: https://github.com/FreesideHull/hullblogs.com

If you've read this far, thanks so much for reading! We're after a better favicon logo for hullblogs.com. If you've got an idea, please do let us know by opening an issue.

PhD Update 10: Sharing with the world

Hey there - is it that time already? Another PhD update blog post! And in double digits too! In this one, I'll be talking mainly about the social media project I've been working on, as I haven't had time to work on the temporal CNN much since last time. I've also taken some time off in August, so technically this is only ~just over 1 month's worth of progress here.

Before we begin, here's a list of posts in this series so far. They give useful context - you probably won't understand this one without reading them first.

As with the last post, none of the graphs here are finalised, and are work in progress. There ~~may~ probably are multiple nasty bugs that invalidate the results that I haven't found yet.

AAAI Doctoral Consortium 2022

The main thing I've done since last time is apply for the AAAI Doctoral Consortium 2022. As I understand it, this is a specific part of the main AAAI conference (Association for the Advancement of Artificial Intelligence) that is designed for PhD student researchers. To apply, you have to submit a cover page, your CV, and a 2 page thesis summary.

The cover page wasn't too bad, and updating my CV and getting it checked by the careers service at my university was simply a case of doing it. The thesis summary on the other hand was more of a challenge than I expected - it's quite difficult to summarise your entire 3 years of (planned) work in just 2 pages! It helped me to picture it as a high-level overview / conversation starter rather than a free-standing paper in it's own right - even if it looks like one.

Although this took longer than I thought to prepare, I did in fact get my submission together in time. I was glad that I left some extra time at the end though, as the rules were both very strict for what you can and can't do with your paper and unclear since they were written for the main AAAI conference. In addition I found at least 2 mistakes in the instructions and rules which appeared to be left over from previous years and hadn't been updated.

AI + HADR 2021

The other conference which I attempted to apply for at the suggestion of my supervisor is AI + HADR 2021. This stands for Artificial Intelligence for Humanitarian Assistance and Disaster Response, and it's virtual workshop that is happening in December 2021. The submission calls for a 6 page paper (5 for content, and 1 for references) on things related to disaster response.

My plan here was to write a paper on the social media work I've been doing and submit that, with the idea of talking about the existing sentiment analysis I've done (see update 9) and focusing on answering a research question using the sentiment analysis model I've implemented.

Unfortunately, things didn't go to plan as I ran out of time - even with my supervisor helping to write some of the parts of he paper. After finding a number of bugs in my data processing pipeline and failing to see any obvious trends in the sentiment analysis graphs I plotted, I realised that it was unlikely that I was going to be able to submit for it this time around.

The plan from here is to take some more time over it and answer my other research question instead, and tidy up and re-use our existing unfinished submission here for IJCNN 2022 (I think that's the right link) - the the submission deadline for which I think is going to be in January 2021.

Social media

While I've spent time writing submissions for various conferences and workshops, I have also been doing a bit of social media data analysis too.

To start with, I implemented a new endpoint for labelling tweets using a given saved model checkpoint. After using this to label the various datasets I've acquired (with the best transformer-based model checkpoint I have, which I think is ~78% accurate (got to double check everything to make sure I haven't mixed anything up), I then got to work plotting some graphs. To start with, I plotted a simple bar graph of the overall sentiment of the different datasets I've downloaded.

Graph showing the overall sentiment of some of the floods in my dataset.

(Above: A bar graph showing the overall sentiment of some of the floods in my dataset.)

I'm not really sure what to make of this - I suspect context-specific information is required to fully interpret this. I couldn't find any reliable context-specific information on short notice for the AI + HADR paper, so I'm going to attempt to keep looking if I can find the time to do so. Asking someone form the energy and environment institute may be a good idea.

After this, I binned the tweets over time and used this to plot a combined graph showing both the tweet frequency and sentiment over time using Gnuplot.

Graph showing the frequency (blue) and sentiment (green and red) of tweets over time

(Above: A graph showing the frequency (blue) and sentiment (green and red) of tweets over time.)

Again here, I'm not sure what to make of this. Even with cropping out the long tail of people talking about the flooding even afterwards to make it easier to see the sentiment over time as the actual event occurred doesn't seem to help uncover any clear trends. It could be said that for Hurricane Iota that the sentiment got more positive over time at the beginning, but this does not really also hold true for Storm Christoph and others - and without context-specific information it's difficult to tell if there are any meaningful conclusions that can be drawn here.

To this end, after talking with my supervisor we've got some idea of things I'm going to try - so more on this in an upcoming PhD update blog post.

Finally, I've also had a discussion with my supervisor and when I'm ready to publish something on my social media work, I'm going to make the code behind it open source (probably either GPLv3 or MPL-2.0). If you're interested, the code for downloading tweets using Twitter's Academic API is already open source: https://www.npmjs.com/package/twitter-academic-downloader.

Temporal CNN

I haven't really done anything on the Temporal CNN since last time, but I wanted to make sure it wasn't left out of this post! It's definitely still on the cards - the plan is that once I've got this data analysis done and some meaningful social media results, I'm going to return the the Temporal CNN and put the plan I described in the last post into action.

Conclusion

Since last time I've mainly been writing and analysing social media data. While I didn't manage to apply to AI + HADR 2021, I did manage to submit for AAAI Doctoral Consortium 2022 - I'll find out if I've been accepted on the 15th October 2021.

Next up, I'm going to be working on answering my other research question first - more on this in a later blog post. If I have time, I'll put some effort into the Temporal CNN - though I doubt I'll have anything to show on that front next time. Finally, I'm going to be arranging my PhD Panel 4 (where did all the time go?) - hopefully for before the end of November, availability of those involved permitting.

If you are finding this series of blog posts on my PhD interesting, please do comment below. It's great to see that the stuff I'm working on for my PhD is actually interesting to someone.

Sources and further reading

PhD, Update 9: Results?

Hey, it's time for another PhD update blog post! Since last time, I've been working on tweet classification mostly, but I have a small bit to update you on with the Temporal CNN.

Here's a list of posts in this series so far before we continue. If you haven't already, you should read them as they give some contact for what I'll be talking about in this post.

Note that the results and graphs in this blog post are not final, and may change as I further evaluate the performance of the models I've trained (for example I recently found and fixed a bug where I misclassified all tweets without emojis as positive tweets by mistake).

Tweet classification

After trying a bunch of different tweaks and variants of models, I now have an AI model that classifies tweets. It's not perfect, but I am reaching ~80% validation accuracy.

The model itself is trained to predict the sentiment of a tweet, with the label coming from a preset list of emojis I compiled manually (looking at a list of emojis in the dataset itself I wrote a quick Bash one-liner to calculate). Each tweet is classified by adding up how any emojis in each category it has in it, and the category with the most emojis is then ultimately chosen as the label to predict. Here's the list of emojis I've compiled:

Category Emojis
positive 😂👍🤣❤👏😊😉😁😘😍🤗💕😀💙💪✅🙌💚👌🙂😎😆😅☺😃😻💖💋💜😹😜♥😄💛😽✔👋💗✨🌹🎉👊😋😏🌞😇🎶⭐💞😺😸🖤🌸💐🍀🌼🌟🤝🌷🐱🤓😌😛😙
negative 🥶⚠💔😰🙄😱😭😳😢😬🤦😡😩🙈😔☹😮❌😣😕😲😥😞😫😟😴🙀😒🙁😠😪😯😨👎🤢💀😤😐😖😝😈😑😓😿😵😧😶😦🤥🤐🤧🌧☔💨🌊💦🚨🌬🌪⛈🌀❄💧🌨⚡🦆🌩🌦🌂

These categories are stored in a tab separated values file (TSV), allowing the program I've written that trains new models to be completely customisable. Given that there were a significant number of tweets with flood-related emojis (🌧☔💨🌊💦🚨🌬🌪⛈🌀❄💧🌨⚡🦆🌩🌦🌂) in my dataset, I tried separating them out into their own class, but it didn't work very well - more on this later.

The dataset I've downloaded comes from a number of different queries and hashtags, and is comprised of just under 1.4 million tweets, of which ~90K tweets have at least 1 emoji in the above categories.

A pair of pie charts showing statistics about the twitter dataset.

Tweets were downloaded with a tool I implemented using a number of flood related hashtags and query strings - from generic ones like #flood to specific ones such as #StormChristoph.

Of the entire dataset, ~6.44% contain at least 1 emoji. Of those, ~49.57~ are positive, and ~50.434% are negative:

Category Number of tweets
No emoji 1306061
Negative 45367
Positive 44585
Total tweets 1396013

The models I've trained so far are either a transformer, or an 2-layer LSTM with 128 units per layer - plus a dense layer with a softmax activation function at the end for classification. If batch normalisation, every layer except the dense layer has a batch normalisation layer afterwards:

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
bidirectional (Bidirectional (None, 100, 256)          336896    
_________________________________________________________________
batch_normalization (BatchNo (None, 100, 256)          1024      
_________________________________________________________________
bidirectional_1 (Bidirection (None, 256)               394240    
_________________________________________________________________
batch_normalization_1 (Batch (None, 256)               1024      
_________________________________________________________________
dense (Dense)                (None, 2)                 514       
=================================================================
Total params: 733,698
Trainable params: 732,674
Non-trainable params: 1,024
_________________________________________________________________

I tried using more units and varying the number of layers, but it didn't help much (in hindsight this was before I found the bug where tweets without emojis were misclassified as positive).

Now for the exciting part! I split the tweets that had at least 1 emoji into 2 datasets: 80% for training, and 20% for validation. I trained a few different models tweaking various different parameters, and got the following results:

2 categories: training accuracy

2 categories: validation accuracy

Each model has a code that is comprised of multiple parts:

  • c2: 2 categories
  • bidi: LSTM, bidirectional wrapper enabled
  • nobidi: LSTM, bidirectional wrapper disabled
  • batchnorm: LSTM, batch normalisation enabled
  • nobatchnorm: LSTM, batch normalisation disabled
  • transformer: Transformer (just the encoder part)

It seems that if you're an LSTM being bidirectional does result in a boost to both stability and performance (performance here meaning how well the model works, rather than how fast it runs - that would be efficiency instead). Batch normalisation doesn't appear to help all that much - I speculate this is because the models I'm training really aren't all that deep, and batch normalisation apparently helps most with deep models (e.g. ResNet).

For completeness, here are the respective graphs for 3 categories (positive, negative, and flood):

3 Categories: Training accuracy

3 Categories: Validation accuracy

Unfortunately, 3 categories didn't work quite as well as I'd hoped. To understand why, I plotted some confusion matrices at epochs 11 (left) and 44 (right). For both categories I've picked out the charts for bidi-nobatchnorm, as it was the best performing (see above).

2 Categories: confusion matrices

3 Categories: confusion matrices

The 2 category model looks to be reasonably accurate with no enormous issues as we expected from the ~80% accuracy from above. Twice the number of negative tweets are misclassified as positive compared to the other way around of some reason - it's possible that negative tweets are more difficult to interpret. Given the noisy dataset (it's social media and natural language, after all), I'm pretty happy with this result.

The 3 category model has fared less brilliantly. Although we can clearly see it has improved over the training process (and would likely continue to do so if I gave it longer than 50 epochs to train), it's accuracy is really limited by it's tendency to overfit - preventing it from performing as well as the 2 category model (which, incidentally, also overfits).

Although it classifies flood tweets fairly accurately, it seems to struggle more with positive and negative tweets. This leads me to suspect that training a model on flood / not flood might be worth trying - though as always I'm limited by a lack of data (I have only ~16K tweets that contain a flood emoji). Perhaps searching twitter for flood emojis directly would help gather additional here.

Now that I have a model that works, I'm going to move on to answering some research questions with this model. To do so, I've upgraded my twitter-academic-downloader to download information about attached media - downloading attached media itself deserves its own blog post. After that, I'll be looking to publish my first ever scientific papery thing (don't know the difference between journal articles and other types of publication yet), so that is sure to be an adventure.

Temporal CNN(?)

In Temporal CNN land, things have been much slower. My latest attempt to fix the accuracy issues I've described in detail in previous posts, I've implemented a vanilla convolutional autoencoder by following this tutorial (the links to the full code behind the tutorial are broken, and it's really annoying). This does image to image translation, so it would be ideal for modifying it to support my use-case. After implementing it with Tensorflow.js, I hit a snag:

Autoencoder: expected vs actual

After trying all the loss functions I found think of and all sort of other variations of the model, for sanity's sake I tried implementing the model in Python (the original language using in the blog post). I originally didn't want to do this, since all the data preprocessing code I've written so far is all in Javascript / Node.js, but within half an hour with some copying and pasting, I had a very quick model implemented. After training a teeny model it for just 7 epochs on the Fashion MNIST dataset, I got this:

Autoencoder: epoch 7 in Tensorflow for Python

...so clearly there's something different in Tensorflow.js compared to Tensorflow for Python, but after comparing them dozens of times, I'm not sure I'll be able to spot the difference. It's a shame to give up on the Tensorflow.js version (it also has a very nice CLI I implemented), but given how long I've spent in the background on this problem, I'm going to move forwards with the Python version, I'm going to work on converting my existing data preprocessing code into a standalone CLI for converting the data into a format that I can consume in Python with minimal effort.

Once that's done, I'm going to expand the quick autoencoder Python script I implemented into a proper CLI program. This involves adding:

  • A CLI argument parser (argparse is my weapon of choice here for now in Python)
  • A settings system that has a default and a custom (TOML) config file
  • Saving multiple pieces of data to the output directory:
    • A summary of the model structure
    • The settings that are currently in use
    • A checkpoint for each epoch
    • A TSV-formatted metrics file

The existing script also spits out a sample image (such as the one above - just without the text) for each epoch too, which I've found to be very useful indeed. I'll definitely be looking into generating more visualisations on-the-fly as the model is training.

All of this will take a while (especially since this isn't the focus right now), so I'm a bit unsure if I'll get all of this done by the time I write the next post. If this plan works, then I'll probably have to come up with a new name for this model, since it's not really a Temporal CNN. Suggestions are welcome - my current thought is maybe "Temporal Convolutional Autoencoder".

Conclusion

Since the last post I've made on here, I've made much more progress on stuff than I was expecting to have done. I've now got a useful(?) model for classifying tweets, which I'm now going to move ahead with answering some research questions with (more on that in the next post - this one is long already haha). I've also got an Autoencoder training properly for image-to-image translation as a step towards getting 2D flood predictions working.

Looking forwards, I'm going to be answering research questions with my tweet classification model and preparing to write something to publish, and expanding on the Python autoencoder for more direct flood prediction.

Curious about the code behind any of these models and would like to check them out? Please get in touch. While I'm not sure I can share the tweet classifier itself yet (though I do plan on releasing as open-source after a discussion with my supervisor when I publish), I can probably share my various autoencoder implementations, and I can absolutely give you a bunch of links to things that I've used and found helpful.

If you've found this interesting or have some suggestions, please comment below. It's really motivating to hear that what I'm doing is actually interesting / useful to people.

Sources and further reading

PhD, Update 8: Eggs in Baskets

I'm back again with another PhD update blog post! Before we begin, here's a list of all the parts in the series so far:

As in the previous post, progress since last time is split in 2: The Temporal CNN, and the social media side of things. I've started to split my time more evenly between the 2 sides, as it seems like the Temporal CNN is going to take lots more work than anticipated and I'd rather not put all my eggs in 1 basket.

Temporal CNN

As you might have guessed, the Temporal CNN still isn't learning anything, but at least now I think I know what the problem is. Since last time, I've done a bunch of debugging and tests to try and figure out what the problem is. During that process, I've managed to reach a record of ~20% accuracy, which at least gives me hope that it's going to work!

Specifically, I used the MNIST (alternative site) handwriting digit dataset with my "easy" task as explain in the previous post, but with a small difference: I pre-generated 2 random tensors to serve as the "below 5" and "5 and above" targets to predict instead of a pair of tensors filled with 0s or 1s respectively. The model didn't like this at all, so this is how I now know what the problem is.

For those interested, here's the laundry list of other things I've tried since last time:

  • Giving it more data (all of 2007, with the 2013 floods as validation; made things a bit worse)
  • Found and fixed a bug in data normalisation that managed to sneak through during the rewrite (reduced training times a touch)
  • Inverting the heightmap (helped a bit)
  • Making the model deeper (Gave me a full 5% accuracy increase from 15% to 20%!)

Knowing what the problem is though is 1 thing, but solving it is another matter entirely. Thankfully, my supervisor and I have a plan to look into using a modified version of the latter half of a variational autoencoder and squidge it onto the tail end of the Temporal CNN. If it works, then I'm imagining that we'll need a new name for the Temporal CNN (suggestions?), but I'll tackle that once I've finished revising the model.

For context, a variational autoencoder is a modified "vanilla" autoencoder, and is 1 of 2 different main classes of generative AI model architecture - the other being Generative Adversarial Networks (GAN). In contrast to a GAN, a variational autoencoder does image-to-image translation with a single model, and maps an input parameter space onto an output parameter space. It first encodes the input to the model into a smaller tensor of features, before upscaling that back into an image again. In this fashion, it can learn to translate between 2 different images - for example putting glasses on people's faces.

To do this, I'm going to implement a vanilla variational autoencoder using the MNIST dataset, and once I've done this I'll then lift part of the model structure and transpose it onto the top of my existing Temporal CNN - by doing it this way I'll ensure that I have a known-good model to work with that is definitely capable of image-to-image translation.

Social Media

In other news, I've started to make some real progress on the social media side of things. I've downloaded and anonymised some tweets (the code for which is open source on npm under the package name twitter-academic-downloader - I intend to write a separate blog post about it at some point soon-ish), and I've also put together an LSTM-based model to start looking at doing some text classification.

I decided to implement said model in Python instead of Javascript, because for what I can tell Tensorflow.js doesn't come with as many batteries included as Tensorflow for Python does for natural language processing-based tasks. This has caused some interesting adventures (and a number of frustrating crashes), but I think I'm starting to get the hang of it.

In particular it's interesting coming from Tensorflow.js (which is a later project), because it seems that Tensorflow for Python is much less cohesive and more disjointed as a library compared to Tensorflow.js, which has learnt and applied lessons from the Python implementation - resulting in a much more cohesive and well thought out API. A prime example of this is the tf.Dataset vs tf.keras.Sequence in the Python version, which isn't an issue in Tensorflow.js, as in the Javascript bindings we have a single tf.Dataset.

This aside, my next step here is to train a significantly sized model that's larger than the mini model with a single layer and 100 units I've been using for testing purposes (that's my task for this afternoon - which I've likely done by the time you're reading this post).

In terms of literature, I've read a bunch more papers on the subject since last time - but I still feel like I've got more to read. Recently I read a series of papers about word embeddings (converting words into numerical tensors), which was very interesting. The process has evolved over the years, starting from a simple dictionary mapping incrementing numbers to words, to training an AI to generate said representations in increasingly sophisticated ways (starting with word2vec, then moving on to in no particular order ELMo, GloVe, and finally BERT - transformers are pretty incredible models). It was a fascinating read - I can recommend it to anyone who's interested in natural language processing (along with this excellent post)

In the model I've implemented, I've ultimately decided to go with GloVe (Global Vectors for Word Representation), as the pre-trained model is simply a text file containing a lookup table one can read into a dictionary or hash table.

Conclusion

Things have been moving forwards - albeit slowly. I've got an idea as to how I can resolve the issues I've been facing with the Temporal CNN (pending a new name once I'm done with all the modifications and I know what the model architecture is going to be like), though it's going to take a lot of work.

Things are finally starting to move in social media land - hopefully the accuracy of the LSTM-based model will be higher than that of the mini model I trained, which was only 50% on a balanced dataset - no better than blind guessing!

See you again in 2 months or so, when hopefully I'll have some real results to show (though of course I'll be keeping up with weekly posts about other things in the meantime). If you have any comments or questions about any of this - please leave a comment below! I'd love to hear your thoughts.

Sources and further reading

A much easier way to install custom versions of Python

Recently, I wrote a rather extensive blog post about compiling Python from source: Installing Python, Keras, and Tensorflow from source.

Since then, I've learnt of multiple other different ways to do that which are much easier as it turns out to achieve that goal.

For context, the purpose of running a specific version of Python in the first place was because on my University's High-Performance Computer (HPC) Viper, it doesn't have a version of Python new enough to run the latest version of Tensorflow.

Using miniconda

After contacting the Viper team at the suggestion of my supervisor, I discovered that they already had a mechanism in place for specifying which version of Python to use. It seems obvious in hindsight - since they are sure to have been asked about this before, they already had a solution in the form of miniconda.

If you're lucky enough to have access to Viper, then you can load miniconda like so:

module load python/anaconda/4.6/miniconda/3.7

If you don't have access to Viper, then worry not. I've got other methods in store which might be better suited to your environment in later sections.

Once loaded, you can specify a version of Python like so:

conda create -n py python=3.8

The -n py specifies the name of the environment you'd like to create, and can be anything you like. Perhaps you could use the name of the project you're working on would be a good idea. The python=3.8 is the version of Python you want to use. You can list the versions of Python available like so:

conda search -f python

Then, to activate the new environment, do this:

conda init bash
conda activate py
exec bash

Replace py with the name of the environment you created above.

Now, you should have the specific version of Python you wanted installed and ready to use.

Edit 2022-03-30: Added conda install pip step, as some systems don't natively have pip by default which causes issues.

The last thing we need to do here is to install pip inside the virtual conda environment. Do that like so:

conda install pip

You can also install packages with pip, and it should all come out in the wash.

For Viper users, further information about miniconda can be found here: Applications/Miniconda Last

Gentoo Project Prefix

Another option I've been made aware of is Gentoo's Project Prefix. Essentially, it installs Gentoo (a distribution of Linux) inside a directory without root privileges. It doesn't work very well on Ubuntu, however due to this bug, but it should work on other systems.

They provide a bootstrap script that you can run that helps you bootstrap the system. It asks you a few questions, and then gets to work compiling everything required (since Gentoo is a distribution that compiles everything from source).

If you have multiple versions of gcc available, try telling it about a slightly older version of GCC if it fails to install.

If you can get it to install, a Gentoo Prefix install allows the installation whatever software you like!

pyenv

The last solution to the problem I'm aware of is pyenv. It automates the process of downloading and compiling specified versions of Python, and also updates you shell automatically. It does require some additional dependencies to be installed though, which could be somewhat awkward if you don't have sudo access to your system. I haven't actually tried it myself, but it may be worth looking into if the other 2 options don't work for you.

Conclusion

There's always more than 1 way to do something, and it's always worth asking if there's a better way if the way you're currently using seems hugely complicated.

PhD, Update 7: Just out of reach

Oops! I must have forgotten about writing an entry for this series. Things have been complicated with the current situation, but I've got some time now to talk about what's been happening since my last post about my PhD. Before we continue though, here's a list of all the parts so far:

In this post, there are 2 different distinct areas to talk about. Firstly, the (limited) progress I've made on the Temporal CNN - and secondly the social media content.

Rainfall Radar / Temporal CNN

Things on the Temporal CNN front have been.... interesting. In the last post, I talked about how I was planning to update the model to use a cross-entropy loss function instead of mean squared error. In short, the idea here is to bin the water depth values we want to predict into a number of different categories, and then get the AI to predict which category each pixel belongs in.

The point of this is to allow for evaluating what the model is good at, and what it struggles with more effectively with the help of a confusion matrix.

Unfortunately, after going to a considerable amount of effort, the model hasn't yet been able to learn anything at all when using the cross-entropy loss error function. I've tried a whole array of different things by this point:

  • Using sparse / non-sparse categorical cross-entropy loss (ref)
  • Changing the number of filters in the model
  • Changing the format of the input data
  • Moving from average pooling → max pooling
  • Fiddling with the 2D CNN layer at the end of the model
  • ...and many more things - too many to list or remember here

Unfortunately, none of these have had a meaningful impact on the model's ability to learn anything. Despite this, we haven't run out of ideas yet. My current plan is to rebuild the model based on a known good model.

The known-good model in question is one I built earlier for a talk I did. It's purpose is classifying images, which it does fabulously with the well-known MNIST handwriting digits dataset. It's structure has 1 2D CNN layer, followed by a dense layer that outputs the probabilities as a 1D array:

See above for explanation

I devised 2 learning tasks to test the model with here. The "hard" task, which is to predict the exact digit in the picture, and the "easy" task - which is to predict whether the digit is greater than or equal to 5 or not (this simulates a binary cross-entropy task I've been trying my original model with). The original model works brilliantly with the hard task, gaining an easy 98% accuracy after 12 epochs.

After forking it and then refactoring significantly to decouple its various components, I started to modify the model's structure step by step to more closely match that of the Temporal CNN.

The initial results of this process (which only really got going on Tuesday 23rd February 2021) have been fascinating, as I've been running the MNIST dataset through it in between each step to check that it's still working as intended.

For example, I've discovered that the model has an intense dislike of pooling layers (both average and max). I suspect this might be because I'm not using it correctly, but I discovered that I could only get about 40% accuracy with the pooling layer in place, compared to ~99% without it.

Another thing I've done is removing the dense layer from the model, but this comes with its own set of problems though. The eventual goal is to do what is essentially image-to-video translation, so a key part of this process is to get the model to produce at least 2D tensor as an output instead of a 1D list of predictions for a single pixel.

To simulate this with the MNIST dataset, copied the output prediction. For the "hard" task, I copied the array of probabilities for each category into a cube, with 1 copy of the array for each pixel of the output. I found while doing this though that I got about 22% accuracy - though I suspect that the model was slower to converge than normal and if I'd maybe made the model a bit larger or let it train for longer, I'd be able to improve that somewhat.

It fared much better on the "easy" task though - easily achieving 99% accuracy fairly quickly with just 2 x 2D CNN layers in a row.

With these tests in mind, I'll be continuing the process of tweaking my new model bit by bit to match the original Temporal CNN, with the eventual goal of running my actual dataset through the model.

Social Media

I thought I'd be well into the social media part of my PhD by now, but things have been getting in the way (e.g. life stuff with respect to the current situation, and the temporal cnn being awkward) so I haven't yet been able to make a serious start on the social media side of things yet.

Still, I've been working away at the paperwork. I've now got ethical approval to work with publicly-available social media data (so long as I anonymise it, of course), and I've also been applying to get access to Twitter's new Academic API (which apparently went through successfully, but I'm currently troubleshooting the reason why it's asking me to apply for an account all over again).

I've also been reading a paper or 2, but since most of my energy has been spent elsewhere I have yet to dive seriously into this (I re-discovered the other day a folder full of interesting papers my supervisor sent me, so I'm going to dive into that as soon as I get a moment).

Papers looking into analysing social media data with advanced AI models appear to be in short supply - most papers I've read so far are either talking about analysing longer texts such as newspaper articles, or are using keyword-based and statistics-based methodologies to analyse data.

While this makes for an interesting research gap, I do feel slightly nervous that I've somehow missed something (which I guess I'll find out soon enough after reading some more papers). At any rate, my supervisor and I have some promising ideas and directions to look into moving forwards, so I'm not too worried here. I've also had some interesting discussions with people from the humanities side of my PhD (if you can call it that? I'm not sure what the right terminology is) over potential research questions too, so there's lots of scope here for investigation.

I anticipate that social media data isn't going to be as difficult a dataset to wrangle as the rainfall radar either (it's got to be better than a badly documented propriety binary format), as it's already encoded in JSON - so I'm not expecting I'll need to spend ages and ages writing programs to reformat and parse the data.

Conclusion

Things have been moving slowly recently, due in part to difficulties with the Temporal CNN, and due in part to life in general suddenly becoming rather challenging recently. Things are starting to calm down now though, so I'm starting to have more time to work on my PhD (but it's going to be a number of months yet until things are properly back to normal).

By changing tack with the Temporal CNN, I feel like I'm starting to make some more progress again, and the social media track of my PhD is showing lots of promise even though it's too early to tell exactly what direction I'll be heading in with it.

Hopefully by the time I make another post here in 2 months time, I'll have a working Temporal CNN and a start to the social media side of things - but this seems a tad ambitious based on how things have been going so far.

If you've got any comments or suggestions, do leave them below! I'd love to hear from you.

PhD Update 6: The road ahead

Hey, I'm back early with another post in my PhD series! Turns out there was a bit of mix up last time, and I misplaced update 4 - so I've renamed the duplicate update 4 to update 5. Before we continue, here are all the posts in this series so far:

Earlier today, I had my PhD panel 2. For those reading who don't know, the primary assessments for a PhD (at least in my case) take the form of 6 panels - in which you write a big report, and then your supervisors get together with you and discuss it. The even numbered ones at the end of each year are the more important ones, I'm led to believe - so I was understandably nervous.

Thankfully, it went well, and I ended up writing a report that's a third longer than my undergraduate dissertation at ~9k words O.o Anyway, I wanted to make another post in this series, as the process of writing the report (and research plan) for my panel has made me look at the bigger picture of where my PhD is going, and how it's all going to tie together - and I wanted to share this here.

Temporal CNN

In the last few posts, I've given some insights into the process I've been working through to train a Temporal CNN to predict the output of HAIL-CAESAR. This is starting to reach its conclusion - though there are a number of tasks I have yet to complete to close this chapter of the story. In particular, I've (finally!) got the results from the hyperparameter optimisation I've been doing. Let's check out a heatmap:

The aforementioned heatmap - explained below.

I plotted the above with GNUPlot, but ran into a number of issues (of course) while generating it, which I'd like to briefly discuss here. If you're interested in a more detailed discussion of this, please get in touch (I'll also need to know your real name and where you're working / studying, just in case) via a private communication method - such as my email address on my website.

Firstly, those who are particularly particularly perceptive will notice the rather strange filename for the above image. As indicated, I ran into some instability in my early stopping algorithm. For each epoch, the algorithm I implemented looked at the error from 3 epochs ago - and if it's greater than the error value for the epoch that's just finished it allows training to continue. Unfortunately, due to the nature of the task at hand, this sometimes caused it to stop training too soon. Further analysis of the results revealed that it sometimes allowed it to continue training too long as well (which is reflected in the above chart), but I'm baffled on that one.

For this reason, all hyperparameter combinations that didn't train for quite long enough were omitted in the above graph.

Secondly, there's a large area of the chart that isn't filled in. This is because both increasing the number of filters and the temporal depth increases memory usage - and the area that's unfilled is the area in which it failed to train because it ran out of GPU memory (I used 4 x Nvidia GeForce V100 - thanks very much to my department for providing this!).

Finally, the most important thing to note is that after reviewing other similar models, I discovered that this really isn't the best way to evaluate the performance of this kind of model. Rather, it would be better to consider this as a classification-based task instead. This can be achieved by creating a number of different bins for the water depth values, and assign a probability to each pixel for each water depth bin. The technical name for this is cross-entropy loss as far as I'm aware, which in my case I've been tending to prefix with pixel-based.

In this fashion, I can use things like a confusion matrix (can't find a good explanation of this, so I'll talk about it in a future blog post) to evaluate the performance of the model - which allows me to see the models strengths and weaknesses. What are its strengths? What are its weaknesses? I hope to find out - though I'm a little nervous about it, as I haven't yet looked into how to generate one with Tensorflow.js or tested my initial cross-entropy loss implementation with my kind of dataset yet.

Having a probability-based system like this should also make the output more useful. By this, I mean that having probabilities assigned to different water depths should be easier to interpret than a single value which nobody knows how the model came to that conclusion, or how uncertain the model is about the result.

Let's take look at 1 more set of graphs before we move on:

4 graphs showing the training / validation root mean squared error - see below for explanation.

In this set of 4 graphs, I've taken the training and validation root mean squared error from 4 different combinations of hyperparameters. These rather effectively illustrate the early stopping stability here (top left), and they also demonstrate some instability in the training process (all graphs). I'm guessing the latter is probably caused by insufficient training data - this I can remedy quite easily as I have plenty more besides the 5K time steps I trained on (~1.5M time steps to be precise), but in the interests of time I limited the training dataset so that I could train a variety of different models.

It also shows the shortcomings of the root mean-squared-error approach I've been using so far - is an average error of 2 biased towards deeper or shallower water? What can it predict well, and what does it struggle with?

My supervisor also wants me to write a publication on the Temporal CNN stuff I've done when it's complete, so that's going to be an interesting new experience for me too (I'm both excited and terribly nervous at the same time).

Social Media Analysis

Moving forwards, I'm hoping to complement my Temporal CNN model with some social media analysis (subject to ethical approval of course - filling out the paperwork for this is part of my task list for the rest of this week). If I get the go-ahead, my intention is to bring some natural language processing AI model to the subject of flood mapping using social media - as I've noticed that there's.... limited existing material on this so far.

I have yet to read up properly on the subject, but most recently I've found this paper rather interesting. It uses an unsupervised model to identify the topic(s) a tweet is talking about, which they then follow up with some statistical analysis.

I'm a little fuzzy on how they identify where tweets are talking about (especially since this paper mentions that only a very small percentage of tweets have a geotag attached, and even of these a fraction will be wrong or misleading for 1 reason or another), but the researchers put together a rather nice map by chaining together several statistical techniques techniques I'm currently unfamiliar with. It shows hot spots and cold spots, which indicate where the damage for an earthquake has occurred.

If possible, it would be very nice indeed to plot a flood on a map (stretch goal: in real-time!). I anticipate this to be a key issue I'll need to pay attention to.

In terms of my most immediate starting point, I'm going to be doing some more reading on the subject, and then have a chat with my supervisor about the next steps (she's particularly knowledgable about natural language processing :D).

Conclusion

In conclusion, I'm starting to come to the end of the Temporal CNN chapter of my PhD, although I suspect it's going to keep coming back over and over again to say hello. Moving forwards, I'm hoping to complement work I've done so far with some social media analysis (subject to ethical approval) using AI-based Natural Language Processing - which will probably consist of improving an existing model or something more directly computer science related (my existing work is classified more as a contribution to environmental sciences, apparently).

Look out for more posts in this series in the future, as I'm sure I'll have plenty to talk about soon (I might do another one in a month's time, or it might end up being nearer 2 months depending on circumstances).

PhD Update 5: Hyper optimisation and frustration

Hello there again! It's been longer than I anticipated since the last proper post in this series. Before I continue, here's a list of all the (proper) posts in this series so far:

I've haven't managed to get as much done since last time as I was hoping (partly due to the fact that I'm currently having to work from home, which is more challenging than I expected), but I have finished my implementation of the Temporal CNN, and am now working on hyperparameter optimisation. I've also fixed a number of issues in my rainfall radar data downloader and processing programs - which I'll talk about in more detail below

HAIL-CAESAR and the iterative improvements

Someone at the University recently approached me (if you are reading this and have a blog, comment below and I'll update) to ask if they could use my rainfall radar data downloader program to download some rainfall radar data for their project. Naturally, I helped them out. This turned out to be a great thing for me as well, as with their help I managed to uncover a number of very nasty issues with the data pipeline I had been building up to that point:

  • The hydro index file that HAIL-CAESAR uses was completely scrambled
  • The date on the data downloaded was a month out
  • The data downloaded was (and still is) rotated by 90° on disk
  • The data was out by a factor of 32

While fixing each of these bugs was a (relatively) simple process, I can't help but wonder how they managed to escape my notice until (for all but 1 of them) someone else told me about them.

The other issue was that because of the amount of data I'm working with, it took forever to re-run the program to test to see if I had managed to fix the problem - and if I had, I'd encounter another problem. This long iteration process makes implementing a new feature or fixing a bug a very time-consuming process.

Despite fixing all these issues, I'm still experiencing issues with my latest refactoring of the rainfall radar data downloader (namely a hang in the event system when reading tar files). My current thinking is that I'm going to completely reimplement it (using snippets from the old programif I need to use it again in the future, as it is currently neither particularly efficient (it's single-threaded) nor easy to bugfix (it's pretty complicated).

I've got an idea for a parallel system that processes each tar file separately first, and then only after all the tar files have been converted separately are they strung together into the actual files the existing implementation spits out currently.

Temporal CNN delight

Last time, I had just started my implementation of a Temporal CNN. This is now pretty much complete, and I've also been able to run it and get some results! Check out this graph:

A graph showing the root mean squared error while training my Temporal CNN implementation - more details below.

This graph shows the root mean squared error when training on 1000 time steps of data (about 3 days 11 hours or so). Epochs are along the X axis, and the root mean squared error is on the Y axis.

A few things to note here. Firstly, the implementation I've come up with essentially does video-to-image translation. The original model in the paper I've linked to is demonstrates a classification task (specifically land use over time) - so what I'm doing is a little different.

Secondly, I've omitted the root mean squared error for the first epoch. It was so high that it made the rest of the graph impossible to see - hence the omission.

I'm pretty pleased with this result so far - as I have a nice downwards curve indicating that the model is (probably) learning something useful.

I am still rather nervous about the output though, as due to the way I've implemented the network I haven't actually been able to 'see' the output of the network at all as an image yet. Doing so would take a while to implement, so I haven't done so for now (although I really should do this soon). It would be really cool to see a short video (maybe at ~10fps) of the network output as the epochs move forwards to visualise the network training process.

Hyperparameter frustrations

Lastly, at the suggestion of my supervisor I've been working on hyperparameter optimisation. In short, this consists of training the model with random combinations of hyperparameters and seeing which ones work best.

A hyperparameter is a tunable parameter that controls an aspect of a model. In my case, I have 2 key hyperparameters I need to tune:

  • Filter count: CNN layers in Tensorflow.js have a filter count associated with them. I theorise that increasing this will increase the model's ability to learn spatial information.
  • Temporal depth: The number of time steps to push through the model at once. Increasing this will allow the model to make predictions based on events that occur further in the past.

My eventual aim here is to create a heatmap that has the above hyperparameters along the X and Y axes, and the colour showing the accuracy of the model that was trained - similar to the one I created previously.

To do this, I implemented a program that tries random combinations of hyperparameters - but never the same combination twice. It starts the model in a subprocess and passes the chosen filter count and temporal depth values in as CLI arguments, which the child process picks up, parses, and then trains a model based on. This CLI is the same one as the one I developed that generated the above graph in the previous section of this blog post.

This approach has the advantage that it isolates the model in a subprocess, so when the subprocess exits and a new one spawns for the next combination of hyperparameters, the environment is completely clean and there isn't anything that might interfere with it.

Unfortunately though, while I set off a run of this implementation before I took a 'holiday' - and even checked on it to ensure it was running as expected (multiple times) - it still managed to crash when I wasn't looking.

After some debugging, I discovered that problem was because the model ran out of memory while training. This was something I had expected - and used the --unhandled-rejections=strict option for Node.js, which tells Node.js to crash and exit when an UnhandledPromiseRejection is thrown - like this one:

2020-08-04 15:31:26.174395: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at conv_grad_ops_3d.cc:1783 : Resource exhausted: OOM when allocating tensor with shape[8,2,104,348,210] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
(node:62355) UnhandledPromiseRejectionWarning: Error: Invalid TF_Status: 8
Message: OOM when allocating tensor with shape[8,2,104,348,210] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
    at Object.<anonymous> (<anonymous>)
.....

Unfortunately though, I used this flag on the parent process (that drives the hyperparameter optimisation) and not the child process - leading to a situation whereby the child process crashed due to the aforementioned error and just hangs around doing nothing. Even more frustatingly, the solution si as simple as doing a quick export NODE_OPTIONS="--unhandled-rejections=strict" before running the hyperparameter optimisation program to ensure that the flag propagates to the child processes......

Very frustrating indeed - especially considering I calculate that it will take multiple weeks to gather enough data to create a meaningful heatmap.

Conclusion

Reading back over this post, I have got more done than I expected. I've started and finished my Temporal CNN implementation, and fixed lots of bugs in them existing code.

However, the long iteration times to test code I've written (despite using a small slice of the dataset to test with), the large datasets I'm working with, having work on a system remotely via SSH by pushing and pulling code with git (many times), working from home all the time, and the continued bugs I've been facing and will likely continue to face have caused and are causing significant unexpected slowdowns moving forwards.

At least the VPN is no longer dropping out every 5 minutes!

Sources and Further Reading

Art by Mythdael