PhD Update 12: Is it enough?
Hey there! It's another PhD update blog post! Sorry for the lack of posts here, the reason why will become apparent below. In the last post, I talked about the AAAI-22 Doctoral Consortium conference I'll be attending, and also about sentiment analysis of both tweets and images. Before we talk about progress since then, here's a list of all the posts in this series so far:
- PhD Update 1: Directions
- PhD Update 2: The experiment, the data, and the supercomputers
- PhD Update 3: Simulating simulations with some success
- PhD Update 4: Ginormous Data
- PhD Update 5: Hyper optimisation and frustration
- PhD Update 6: The road ahead
- PhD Update 7: Just out of reach
- PhD Update 8: Eggs in Baskets
- PhD Update 9: Results?
- PhD Update 10: Sharing with the world
- PhD Update 11: Answers to our questions
As in all the posts preceding this one, none of the things I present here are finalised and are subject to significant change as I double check everything. I can think of no better example of this than the image classification model I talked last time - the accuracy of which has dropped from 75.6% to 60.7% after I fixed a bug....
AAAI-22 Doctoral Consortium
The most major thing in my calendar in the next few weeks is surely the AAAI-22 Doctoral Consortium. I've been given a complimentary registration to the main conference after I had a paper accepted, so as it turns out the next few weeks are going to be rather busy - as have been the last few weeks preparing for this conference.
Since the last post when I mentioned that AAAI-22 has been moved to be fully virtual, there have been a number of developments. Firstly, the specific nature of the doctoral consortium (and the wider conference) is starting to become clear. Not having been to a conference before (a theme which will come up a lot in this post), I'm not entirely sure what to expect, but as it turns out I was asked to create a poster that I'll be presenting in a poster session (whether this is part of AAAI-22 or the AAAI-22 Doctoral Consortium is unclear).
I've created one with baposter (or a variant thereof given to me by my supervisor some time ago) which I've now submitted. The thought of presenting a poster in a poster session at a conference to lots of people I don't know has been a rather terrifying thought though, so I've been increasingly anxious over this over the past few weeks.
This isn't the end of the story though, as I've also being asked to do a 20 minute presentation with 15 minutes for questions / discussion, the preparations for which have taken perhaps longer than I anticipated. Thankfully I've had prior experience presenting at my department's PGR seminars previously (which have also been online recently), so it's not as daunting as would otherwise be, but as with the poster there's still the fear of presenting to lots of people I don't know - most of which probably know more about AI than I do!
Despite these fears and other complications that you can expect from an international conference (timezones are such a pain sometimes), I'm looking forward to seeing what other people done, and also hoping that the feedback I get from my presentation and poster isn't all negative :P
Sentiment analysis: a different perspective
Since the last post, a number of things have changed in my approach to sentiment analysis. The reason for this is - as I alluded to earlier in this post - after I fixed a bug in my model that predicts the sentiment of images from twitter, causing the validation accuracy to drop from 75.6% to 60.7%. As of now, I have the following models implemented for sentiment analysis:
- Text sentiment prediction
- Status: Implemented, and works rather well actually - top accuracy is currently 79.6%
- Input: Tweet text
- Output: Positive/negative sentiment
- Architecture: Transformer encoder (previously: LSTM)
- Labels: Emojis from text [manually sorted into 2 categories]
- Image sentiment prediction
- Status: Implemented, but doesn't work very well
- Input: Images attached to tweets
- Output: Positive/negative sentiment
- Architecture: ResNet50 (previously: CCT, but I couldn't get it to work)
- Labels: Positive/negative sentiment predictions from model #1
- Combined sentiment prediction
- Status: Under construction
- Input: Tweet text, associated image
- Output: Positive/negative sentiment
- Architecture: CLIP → a few dense layers [provisionally]
- Labels: Emojis from text, as in model #1
Not mentioned here of course is my rainfall radar model, but that's still waiting for me to return to it and implement a plan I came up with some months ago to fix it.
Of particular note here is the new CLIP-based model. After analysing model #2 (image sentiment prediction) and sampling some images from each category on a per-flood basis, it soon became clear that the output was very noisy and wasn't particularly useful, and combined with it's rather low performance means that I'm pretty much considering it a failed attempt.
With this in mind, after my supervisor suggested I looked into CLIP. While I mentioned it in the paper I've submitted to AAAI-22, until recently I haven't had a clear picture of how it would be actually useful. As a model, CLIP trains on text-image pairs, and learns to identify which image belongs to which text caption. To do this, it has 2 encoders - 1 for the text, and 1 for the associated image.
You can probably guess where this is going, but my new plan here is to combine the text and images of the tweets I have downloaded into 1 single model rather than 2 separate ones. I figure that I can potentially take advantage of the encoders trained by the CLIP models to make a prediction. If I recall correctly, Ive read a paper that has done this before with tweets from twitter - just not with CLIP - but unfortunately I can't locate the paper at this time (I'll edit this post if I do find it in the future - if you know which paper I'm talking about please do leave a DOI link in the comments below).
At the moment, I'm busy implementing the code to wrap the CLIP model and make it suitable for my specific learning task, but this is as of yet incomplete. As noted in the architecture above, I plan on concatenating the output from the CLIP encoders and passing it through a few dense layers (as the [ batch_size, concat, dim ]
tensor is not compatible with a transformer, which operates on sequences), but I'm open to suggestions - please do comment below.
My biggest concern here is that the tweet text will not be enough like an image caption for CLIP to produce useful results. If this is the case, I have some tricks up my sleeve:
- Extract any alt text associated with images (yes, twitter does let you do this for images) and use that instead - though I can't imagine that people have captioned images especially often.
- Train a new CLIP model from scratch on my dataset - potentially using this model architecture - this may require quite a time investment to get the model working as intended (Tensorflow can be confusing and difficult to debug with complex model architectures).
Topic analysis
Another thing I've explored is topic analysis using gensim's LDAModel. Essentially, as far as I can tell LDA is an unsupervised algorithm that groups the words found in the source input documents into a fixed number of related groups, each of which contains words which are often found in close proximity to one another.
As an example, if I train an LDA model for 20 groups, I get something like the following for some of those categories:
place death north evacu rise drive hng coverag toll cumbria
come us ye issu london set line offic old mean
rescu dead miss uttarakhand leav bbc texa victim kerala defenc
flashflood awai world car train make best turn school boat
need town problem includ effect head assist sign philippin condit
(Full results may be available upon request, if context is provided)
As you can tell, the results of this are mixed. While it has grouped the words, the groups themselves aren't really what I was hoping for. The groups here seem to be quite generic, rather than being about specific things (words like rescu
, dead
, miss
, and victim
in a category), and also seem to be rather noisy (words like "north", "old", "set", "come", "us").
My first thought here was that instead of training the model on all the tweets I have, I might get better results if I train it on just a single flood at once - since some categories are dominated by flood-specific words (hurrican hurricaneeta
, hurricaneiota
for example). Here's an extract from the results of that for 10 categories for the #StormDennis
hashtag:
warn flood stormdenni met issu offic water tree high risk
good hope look love i’m game morn dai yellow nice
ye lol head enjoi tell book got chanc definit valentin
mph gust ireland coast wors wave brace wow doesn’t photo
(Again, full results may be available on request if context is provided)
Again, mixed results here - and still very noisy (though I'd expect nothing else from social media data), though it is nice to see something like gust
, ireland
, and coast
there in the bottom category.
The goal here was to attempt to extract some useful information about which places have been affected and by how much by combining this with the sentiment analysis (see above), but my approach here doesn't seem to have captured what I intended. Still, it does seem that on a per-category basis (for all tweets) there is some difference in sentiment on a per-category basis:
(Above: A char of the sentiment of each LDA topic, using the 20 topic model that was trained on all available tweets.)
It seems as if there's definitely something going on here. I speculate that positive words are more often than not used near other positive words, and so categories are more likely to skew to 1 extreme or another - though acquiring proof of this theory would likely require a significant time investment.
While LDA topic analysis was an interesting diversion, I'm not sure how useful it is in this context. Still, it's a useful thing to have in my growing natural language processing toolkit (how did this happen) - and perhaps future research problems will benefit from it more.
Moving forwards, I could imagine instead of doing LDA topic analysis it might be beneficial to group tweets by place and perhaps run some sentiment analysis on that instead. This comes with it's own set of problems of course (especially pinning down / inferring the location of the tweets in question), but this is not an insurmountable problem given strategies such as named entity recognition (which the twitter Academic API does for you, believe it or not).
Conclusion
While most of my time has been spent on preparing for the AAAI-22 conference, I have managed to do some investigating into the twitter data I've downloaded. Unfortunately, not all my methods have been a success (image sentiment analysis, LDA topic analysis), but these have served as useful exercises for both learning new techniques and understanding the dataset better.
Moving forwards, I'm going to implement a new CLIP-based model that I'm hoping will improve accuracy over my existing models - though I'm somewhat apprehensive that tweet text won't be descriptive enough for CLIP to produce a useful output.
With the AAAI-22 conference happening next week, I'll be sure to write up my experiences and post about the event here. I'm just hoping that I've done enough to make attending a conference like this worth it, and that what I have done is actually interesting to people :-)
Sources and further reading
- gensim
- AAAI-22 Doctoral Consortium (event is timezoned to UTC-8, registration is required)
- AAAI-22
twitter-academic-downloader
- Twitter Academic Research track
- Deep Residual Learning for Image Recognition - the ResNet paper
- CLIP