Stardust | Starbeamrainbowlabs

PhD Update 20: Like a bad smell.....

Hi again! Another wild blog post appeared. PhD corrections, I have come to realise, have a habit of hanging around long past the time you want to have had them finished and done.

Before we get into all of that though, here's the customary list of posts:

See also Doing a 3-way dataset split in Tensorflow // PhD Aside 3, which I posted since the last one of these PhD update blog posts.

Things have not been easy over the last 9 months, but I am making it through. I can't promise when blog posts will come, but know that I have a lot of ideas and it's just a case of having the energy to write them. See also my fediverse account @sbrl@fediscience.org (rss feed) for smaller updates in between times.

Corrections

One of the things I did not realise when I started my PhD was just how much work you are expected to do outside of your main scholarshiped research period. There's writing the thesis, doing the viva, and, of course, the corrections afterwards.

While I'm not sure how much I can share about the corrections I've been given I can say that the process of completing them has been both long and annoying.

More importantly, it is now coming to a close!

Yep, that's right: I'm almost done with my corrections! I just need a laundry list of people to check them and say they are okay, and then I can finally get this PhD thing over with and move on to cooler researchy things that I'll talk about later in this blog post.

As with my viva, the corrections I received were mainly organisational and big picture stuff in nature, though I also had my fair share of experiment redos to complete (ewwww), which have been very time consuming.

Once everything is done, I do plan on making my thesis available for free here on my website if my institution will let me. I doubt it will get indexed in any scholarly search engines any time soon, but hopefully it will be useful to someone here.

Speaking of, I'd like to share a Cool Graph™ I created whilst doing my corrections:

A grid of graphs showing the stability of the metrics for my rainfall radar models over 7 runs. There are areas shaded on the graphs to show the standard deviation and min/max values for each epoch

(Above: A grid of graphs showing the stability of the metrics for my rainfall radar models over 7 runs)

This is, as the caption suggests, a random cross-validation (because anything else would be far too complicated to implement) run of my rainfall radar model I have mentioned before. This was one of the things I was asked to do in my corrections.

The code behind this is kinda cool, as it aggregates a metrics.tsv file from every experiment directory in in a given directory.

As I write this I realise that's kinda confusing, so let me show you what the directory of experiments actually looks like:

+ 202412_crossval-stbl7
    + 2024-12-12_deeplabv3+_rainfall_csgpu_ri_celdice_lr0.00001_us2_t0.1_bs32_crossval-stbl7-A
    + 2024-12-12_deeplabv3+_rainfall_csgpu_ri_celdice_lr0.00001_us2_t0.1_bs32_crossval-stbl7-B
    + 2024-12-12_deeplabv3+_rainfall_csgpu_ri_celdice_lr0.00001_us2_t0.1_bs32_crossval-stbl7-C
    + 2024-12-12_deeplabv3+_rainfall_csgpu_ri_celdice_lr0.00001_us2_t0.1_bs32_crossval-stbl7-D
    + 2024-12-12_deeplabv3+_rainfall_csgpu_ri_celdice_lr0.00001_us2_t0.1_bs32_crossval-stbl7-E
    + 2024-12-12_deeplabv3+_rainfall_csgpu_ri_celdice_lr0.00001_us2_t0.1_bs32_crossval-stbl7-F
    + 2024-12-12_deeplabv3+_rainfall_csgpu_ri_celdice_lr0.00001_us2_t0.1_bs32_crossval-stbl7-G

The experiment series directory (202412_crossval-stbl7) contains 7 different runs, matching the stbl7 part of the experiment series name crossval-stbl7.

The directory names there look pretty complicated, but it's actually just made of up all the parameters that I'm currently interested in for that experiment series. Each part is separated by an underscore _.

So, to break down 2024-12-12_deeplabv3+_rainfall_csgpu_ri_celdice_lr0.00001_us2_t0.1_bs32_crossval-stbl7-C:

2024-12-12: The date the (individual) experiment was run. These all ran in parallel on a HPC at my university, hence all the dates are the same even though it takes more than 24 hours to train the rainfall radar model.
deeplabv3+: The architectural backbone of the model in question - this time DeepLabV3+.
rainfall: Identifier code for the project, rainfall is the traditional informal and short name I gave to the rainfall radar model.
csgpu: The place it was trained on. In this case csgpu is a small HPC cluster in my department.
ri: This marks the start of the experiment hyperparams I'm interested in, in no particular order (but the order usually remains constant for a given project). ri stands for Remove Isolated.
celdice: The loss function. Cross-Entropy Loss + Dice loss.
lr0.00001: The Learning Rate, this time 0.00001
us2: UpScale 2 - a property of the model in which it upscales the input x2 and downscales it just before the output. This improves the fidelity of the output at the cost of a higher memory usage.
t0.1: Threshold of 0.1 - the delimiter between water and no water. At some point, I want to split into multiple bins.
bs32: Batch Size of 32.
crossval-stbl7-C: The experiment series code - see above.
- crossval-stbl: The main part of the experiment series code
- 7: The number of cross-validation runs
- C: The differentiator. In this case it's part C of the experiment series since they are all the same, but usually each model has something unique about it, e.g. regresstest-regress vs regresstest-class.

Hmmm, looking at this it might be a bit more complicated a system than I first expected, but it makes sense to me. I wonder if I've blogged about how I organise experiments already? If not, that should go on the todo list.

Anyway, this is the foundation of my entire organisational system for running experiments. I've developed quite an intricate system since I started running experiments in 2020, but fundamentally it is based on the principle of preserving as much information about any given experiment that I've run as possible, as I am sure to need it later.

Even if I don't think I'll need it!

In fact, especially if I don't think I'll need it, because I've been bitten enough times to know that it's not a case of if, it's most certainly a case of when.

Corrections? What corrections?

Not to get too distracted, while I don't think the University would like it very much if I shared my exact list of corrections, it boiled down to the following basic principles, in no particular order:

There wasn't a clear narrative carrying the problem forwards through the thesis
They wanted more experiments running to confirm the stability of the models trained - hence this post on a 3-way split since they wanted a 3-way split, and also hence the stability testing done to produce the graph above amongst others
They wanted a regression model training for the rainfall radar model and a comparative analysis against my existing classification-based approach

It doesn't sound like much, but it has been quite a lot of work to get to this point, especially since I have been doing more teaching than I expected starting in September last year. I'm glad now that I applied for a 6 month extension and for the help of the people around me (and 2 people in particular - not sure if I can mention your names, but you know who you are), otherwise I would have run out of time to complete my corrections long ago.

Future research

Now that my corrections are (hopefully) coming to and end and I'm starting to get a handle on the teaching I've been asked to do (wow, and that isn't even the half of it), I'm finally starting to get myself into a place in which I can FINALLY start to look forwards to some more research that is actually useful, as opposed to making seemingly endless corrections to my thesis (the social media chapter in particular I can do SO MUCH BETTER).

First of all are real improvements to my rainfall radar model. These improvements largely fall into a few categories:

Analysing and improving the model's ability to actually predict floods, and applying sample weighting (psst, secret second graph for those of you who are still reading!) to hopefully measuably improve my model's ability to make actually useful predictions
Swapping out the physics-based model the model is trained on because it's bad and I didn't prepare the data very well and it's all bad
Exanding the model's ability to predict multiple bins instead of just a binarised water/no-water situation

These are not necessarily in order, but I imagine I'll likely tackle them in something like this order.

On the social media side, I know that I can do so much better than my social media paper which somehow has 41 citations (just HOW??!). Binary sentiment analysis is cute and all, but at the intersection of AI, disaster situational awareness, and user interface (UI) design and user experience (UX) I believe that I can do much better in the organisation of unstructured data.

With the use of contemporary AI algorithms and UI/UX, the extraction and presentation of richer information should be possible.

Even though these research plans won't be part of my PhD, I will still continue blogging about it! Who knows, I might even start a new long-running blog post series to mark the beginning of a new era in my life.

And, of course, I'll continue to share Cool Graphs™!

BlueSky

BlueSky: As a last thing, I'm going to blog about it at some point but I'm now using bridgy fed to allow you to follow me on Bluesky! I'm @sbrl.fediscience.org.ap.brid.gy, and to interact with me you'll need to follow @ap.brid.gy.

BlueSky seems to be becoming very popular, especially my circles - but while it pretends to be decentralised, it isn't. For this reason and others, my primary social media will remain on the Fediverse to ensure and preserve the long-term viability of my account.

I encourage you to join the fediverse too - it's a nice and friendly place :D

Final thoughts

It has been a long road, but I am finally nearing the end of one book and the beginning of another. This is not the last post in this series - I have at least 1 more planned. When I have the energy, I want to talk about my experiences learning to teach (I'm doing a course called PCAP right now, as it was a stipulation of my contract) in what may be a longer blog post than I expect.

I'm looking forward to continuing my research journey and blogging along the way right here at my stardust blog (I think this is the first time I've mentioned my blog's name!).

I'll see you next time, in what might be one of the last blog posts in this series: PhD Update 21: Where the water meets the sky.

--Starbeamrainbowlabs

Happy new year 2025! o/

Heya there!

This is just a short post to say hi, and happy new year! I hope everyone had a great christmas / winter break.

As I battle a bit of burnout from doing rather too much teaching related tasks in a short space of time (hoping to do a detailed post on my teaching experiences soon), I have actually achieved a lot this past year despite this:

Submitted my thesis and had my viva
Completed designed and delivered an entire 20 credit module on Linux + sysadmin from scratch in the span of ~4 months
Presented at a conference - NLDL-2024 - for the first time - to a room of 250 people O.o
Much open source work spread across many repositories

Looking ahead

Looking ahead, a significant portion of my energy is going to be spent on getting my corrections done for my PhD. It is requiring the re-running of a bunch of experiments with a grab-bag of new features being added to the codebases, which if you have been following the commit history have been arriving slowly.

I'll hopefully be at a good enough point soon-ish to write another PhD update blog post about this.

Last year, I said that I hoped 2024 would be the year I finally finish my PhD. It was..... kinda sorta maybe okay-not-really. This time I really do want 2025 to be the year I finally finish this PhD....... I'm ready to just be done with the stupid thing now.

In 2025 I want to dedicate more time to blogging here. 2024 has not been a great year for this blog, so I want to try and change that this year. I have had lots of ideas for blog posts..... I just haven't had the energy to write them. Hopefully this will mean lots of cool new blog posts about things I've learnt and found!

If you're interested in keeping up to date with what I've been up to, I can recommend following me on the fediverse (@sbrl@fediscience.org). I post smaller stuff there that either isn't bug enough for a blog post, or I don't have the energy to blog about at the time.

Mastodon (the fediverse software used by the instance I'm on) has a built-in RSS feed, if that's your jam: https://fediscience.org/@sbrl.rss

I want to do some incremental improvements to my website here soon - including tidying up and finishing the list of researchy things I've been doing on my homepage. Nothing too ground breaking (though I have bigger plans for a better backend to this blog, but I need lotsa time to impleent it)

Final thoughts

This last year has been rather stressful and emotional in many different ways - including some I have not mentioned here (like people I know very well leaving the University, including my primary supervisor, though I am maintaining regular and normal contact). I hope that 2025 is less stressful than 2024.

If there's something you'd like me to blog about that I've been doing that I haven't blogged about yet, I've probably forgotten about it. Please get in touch by leaving a comment below!

A cute wooden bauble in the dark on a christmas tree from a few years ago. It has a snowman against the night sky with a postbox with a bird sitting on it and some trees and stuff. The multicoloured christmas lights are turned on and shining brightly. I like this bauble very much.

(Source: Taken by me. See alt text for detailed description.)

Thanks for sticking with me for these last 10 years. There is ALWAYS hope. Especially when you can't see it.

--Starbeamrainbowlabs, your friendly but very tired blogger

Ducks

Heya!

I like ducks. Have some ducks:

Some duck wallpaper

(Source: Unknown. If you're the creator and can prove it, comment below and I'll attribute properly)

....teaching is not easy, and I don't like preparing content week-by-week (i.e. preparing content to teach the next week) very much at all.

I recognise this is the longest break in blog posts there has been since I began this on 29th June 2014 (wow, it has been 10 years here already?!). Both energy and time are extraordinarily low at the moment (I have not had even a moment to work on my PhD corrections in over a month at this point).

However, there is hope that this is not a permanent state of affairs (and if I have anything to say about it, it won't be).

Hopefully in a few weeks things should improve to the point that I have energy to work on my PhD and post here again.

I've got several cool ideas for posts that I want to write:

Most obviously, I want to write a blog post about my experiences teaching
I've found a really neat solution to a 3-way split of a dataset in Tensorflow

And others that are a thing in the background.

Just a short check in to let everyone know that while I am very exhausted I am also very determined to keep this blog going as a permanent thing.

It's not much of a 10 year celebration, but if you've been reading here for a while I thank you SO MUCH for continuing to stick around, even if you don't comment.

There is always hope.

--Starbeamrainbowlabs

Teaching this September

Hello!

Believe it or not, I'm going to be teachificatinating a thing at University this semester, which starts at the end of this month and lasts until around December-ish time (yeah, I'm surprised too).

It's called Secure Digital Infrastructure, and I'll be teaching Linux and system administration skills, so that includes the following sorta-areas:

Bash (Learn your terminal (or command line))
Linux file and account permissions (no posts I could find yet - plenty of resources online tho)
Practical networking and web server configuration (see also The NSD Authoritative DNS Server: What, why, and how and part 2, An epic journey awaits: The hows and whys of DNS (and why DNS privacy is important), Securing a Linux Server Part 1: Firewall, and others)
Docker (see also Cluster, Part 10: Dockerisification | Writing Dockerfiles)
Clustering - at least the theory (see also Cluster Series List)
Reverse proxies (see also related posts: Securing your port-forwarded reverse proxy and the unreasonably popular How to set up a WebDav share with Nginx)

(related posts aren't necessarily the exact content I'm going to cover, but are related)

To this end, it is quite stressful and is taking significantly more energy than I expected to prepare for this.

I definitely want to talk about it here, but that will likely happen after the fact - probably some time in January or February.

Please be patient with me as I navigate this new and unexpected experience :-)

--Starbeamrainbowlabs

PhD Update 19: The Reckoning

The inevitability of all PhDs. At first it seems distant and ephemeral, but it is also the inescapable and unavoidable destination for all on the epic journey of the PhD.

Sit down and listen as I tell my own tale of the event I speak of.

I am, of course, talking about the PhD Viva. It differs from country to country, but here in the UK the viva is an "exam" that happens a few months after you have submitted your thesis (PhD Update 18: The end and the beginning). Unlike across the pond in the US, in the UK vivas are a much more private affair, with only you, the chair, and your internal and external examiners normally attending.

In my case, that was 2 externals (as I am also staff, ref Achievement get: Experimental Officer Position!), an internal, and of course the chair. I won't name them as I'm unsure of policy there, but they were experts in the field and very kind people.

I write this a few weeks removed from the actual event (see also my post on Fediscience at the time), and I thought that my viva itself deserved a special entry in this series dedicated to it.

My purpose in this post is to talk about my experience as honestly and candidly as I can, and offer some helpful advice from someone who has now been through the process.

The Structure

The viva itself took about 4 hours. It's actually a pretty complicated affair: all your examiners (both internal and external) have to read your thesis and come up with a list of questions (hidden from you of course). Then, on the day but before you enter the room they have to debate who is going to ask what to avoid duplication.

In practice this usually means that the examiners will meet in the morning to discuss, before having lunch and then convening for the actual viva bit where they ask the questions. In my case, I entered the room to meet the examiners and say hi, before leaving again for them to sort out who was going to ask what.

Then, the main part of the viva simply consists of you answering all the questions that they have for you. Once all the questions are answered, then the viva is done.

You are usually allowed a copy of your thesis in one form or another to assist you while answering their questions. The exact form this will take varies from institution to institution, so I recommended always checking this with someone in charge (e.g. the Doctoral College in my case) well in advance - you don't want to be hit with paperwork and confusion minutes before your viva is scheduled to start!

After the questions, you leave the room again for the examiners to deliberate over what the outcome will be, before calling you back into the room to give you the news.

Once they have done this: the whole thing is over and you can go sleep (trust me, you will not want to do anything else).

My experience

As I alluded to in the aforementioned post on fediscience (a node in the fediverse), I found the viva a significantly intense experience - and one I'm not keen on repeating any time soon. I strongly recommend having someone nearby as emotional support for after the viva and during those periods when you have to step out of the room. I am not ashamed to admit that there were tears after the exam had ended.

More of the questions than I expected focused on the 'big picture' kinda stuff, like how my research questions linked in with the rest of the thesis, and how the thesis flowed. I was prepared for technical questions -- and there were some technical questions -- but the 'fluffy stuff' kinda questions caught me a little off guard. For example, there were some questions about my introduction and how while I introduced the subject matter well, the jump into the technical stuff with the research questions was quite jarring, with concepts mentioned that weren't introduced beforehand.

To this end, I can recommend looking over the 'big picture' stuff beforehand so that you are prepared for questions that quiz you on your motivations for doing your research in the first place and question different aspects research questions.

It can also feel quite demoralising, being questioned for hours on what has been your entire life for multiple years. It can feel like all you have done is pointless, and you need to start over. While it is sure that you could improve upon your methods if you started from scratch, remember that you have worked hard to get to this point! You have discovered things that were not known to the world before your research began, and that is a significant accomplishment!

Try not to think too hard about the corrections you will need to make once the viva is done. Institutions differ, but in my case it is the job of the chair to compile the list of corrections and then send them to you (in one form or another). The list of corrections - even if they are explained to you verbally when you go back in to receive the result - may surprise you.

Outcome

As I am sure that most of you reading this are wondering, what was my result?! Before I tell you, I will preface the answer to your burning question with a list of the possible outcomes:

Pass with no corrections (extremely rare)
Pass with X months corrections (common, where X is a multiple of 3)
Fail (also extremely rare)

In my case, I passed with corrections!

It is complicated by the fact that while the panel decided that I had 6 months of corrections to do, I am not able to spend 100% of my time doing them. To this end, it is currently undefined how long I will have to do them - paperwork is still being sorted out.

The reasons for this are many, but chief among them is that I will be doing some teaching in September - more to come on my experience doing that in a separate post (series?) just as soon as I have clarified what I can talk about and what I can't.

I have yet to recieve a list of the corrections themselves (although I have not checked my email recently as I'm on holiday now as I write this), but it is likely that the corrections will include re-running some experiments - a process I have begun already.

Looking ahead

So here we are. I have passed my viva with corrections! This is not the end of this series - I will keep everyone updated in future posts as I work through the corrections.

I also intend to write a post or two about my experience learning to teach - a (side)quest that I am currently persuing in my capacity as Experimental Officer (research is still my focus - don't worry!)

Hopefully this post has provided some helpful insight into the process of the PhD viva - and my experience in mine.

The viva is not a destination: only a waypoint on a longer journey.

If you have any questions, I am happy to anwser them in the comments, and chat on the fediverse and via other related channels.

PhD Update 18: The end and the beginning

Hello! It has been a while. Things have been most certainly happening, and I'm sorry I haven't had the energy to update my blog here as often as I'd like. Most notably, I submitted my thesis last week (gasp!)! This does not mean the end of this series though - see below.

Before we continue, here's our traditional list of past posts:

Since last time, that detecting persuasive tactic challenge has ended too, and we have a paper going through at the moment: BDA at SemEval-2024 Task 4: Detection of Persuasion in Memes Across Languages with Ensemble Learning and External Knowledge.

Theeeeeeeeeeeeesis

Hi! A wild thesis appeared! Final counts are 35,417 words, 443 separate sources, 167 pages, and 50 pages of bibliography - making that 217 pages in total. No wonder it took so long to write! I submitted at 2:35pm BST on Friday 10th May 2024.

I. can. finally. rest.

It has been such a long process, and taken a lot of energy to complete it, especially since large amounts of formal academic writing isn't usually my thing. I would like to extend a heartfelt thanks especially to my supervisor for being there from beginning to end and beyond to support me through this endeavour - and everyone else who has helped out in one way or another (you know who you are).

Next step is the viva, which will be some time in July. I know who my examiners are going to be, but I'm unsure whether it would be wise to say here. Between now and then, I want to ~~stalk~~ investigate my examiners' research histories, which should give me an insight into their perspective on my research.

Once the viva is done, I expect to have a bunch of corrections to do. Once those are completed, I will to the best of my ability be releasing my thesis for all to read for free. I still need to talk to people to figure out how to do that, but rest assured that if you can't get enough of my research via the papers I've written for some reason, then my thesis will not be far behind.

Coming to the end of my PhD and submitting my thesis has been surprisingly emotionally demanding, so I thank everyone who is still here for sticking around and being patient as I navigate these unfamiliar events.

Researchy things

While my PhD may be coming to a close (I still can't believe this is happening), I have confirmed that I will have dedicated time for research-related activities. Yay!

This means, of course, that as one ending draws near, a new beginning is also starting. Today's task after writing this post is to readificate around my chosen idea to figure out where there's a gap in existing research for me to make a meaningful contribution. In a very real way, it's almost like I am searching for directions as I did in my very first post in this series.

My idea is connected to the social media research that I did previously on multimodal natural language processing of flooding tweets and images with respect to sentiment analysis (it sounded better in my head).

Specifically, I think I can do better than just sentiment analysis. Imagine an image of a street that's partially underwater. Is there a rescue team on a boat rescuing someone? What about the person on the roof waving for help? Perhaps it's a bridge that's about to be swept away, or a tree that has fallen down? Can we both identify these things in images and map them to physical locations?

Existing approaches to e.g. detect where the water is in the image are prone to misidentifying water that is infact where it should be for once, such as in rivers and lakes. To this end, I propose looking for the people and things in the water rather than the water itself and go for a people-centred approach to flood information management.

I imagine that while I'll probably use data from social media I already have (getting a hold of new data from social media is very difficult at the moment) - filtered for memes and misinformation this time - if you know of any relevant sources of data or datasets, I'm absolutely interested and please get in touch. It would be helpful but not required if it's related to a specific natural disaster event (I'm currently looking at floods, branching out to others is absolutely possible and on the cards but I will need to submit a new ethics form for that before touching any data).

Another challenge I anticipate is that of unlabelled data. It is often the case that large volumes of data are generated during an unfolding natural disaster, and processing it all can be a challenge. To this end, somehow I want my approach here to make sense of unlabelled images. Of course, generalist foundational models like CLIP are great, but lack the ability to be specific and accurate enough with natural disaster images.

I also intend that this idea would be applicable to images from a range of sources, and not just with respect to social media. I don't know what those sources could be just yet, but if you have some ideas, please let me know.

Finally, I am particularly interested if you or someone you know are in any way involved in natural disaster management. What kinds of challenges do you face? Would this be in any way useful? Please do get in touch either in the comments below or sending me an email (my email address is on the homepage of this website).

Persuasive tactics challenge

The research group I'm part of were successful in completing the SemEval Task 4: Multilingual Detection of Persuasion Techniques in Memes! I implemented the 'late fusion engine', which is a fancy name for an algorithm that uses in basic probability to combine categorical predictions from multiple different models depending on how accurate each model was on a per-category basis.

I'm unsure of the status of the paper, but I think it's been through peer-review so you can find that here: BDA at SemEval-2024 Task 4: Detection of Persuasion in Memes Across Languages with Ensemble Learning and External Knowledge.

I wasn't the lead on that challenge, but I believe the lead person (a friend of mine, if you are reading this and want me to link to somewhere here get in touch) on that project will be going to mexico to present it.

Teaching

I'm still not sure what I can say and what I can't, but starting in september I have been asked to teach a module on basic system administration skills. It's a rather daunting prospect, but I have a bunch of people much more experienced than me to guide me through the process. At the moment the plan is for 21 lecture-ish things, 9 labs, and the assessment stuff, so I'm rather nervous about preparing all of this content.

Of course, as a disclaimer nothing written in this section should be taken as absolute. (Hopefully) more information at some point, though unfortunately I doubt that I would be allowed to share the content created given it's University course material.

As always though, if there's a specific topic that lies anywhere within my expertise that you'd like explaining, I'm happy to write a blog post about it (in my own time, of course).

Conclusion

We've taken a little look at what is been going on since I last posted, and while this post has been rather talky (will try for some kewl graphics next time!), nonetheless I hope this has been an interesting read. I've submitted my thesis, started initial readificating for my next research project - which we've explored the ideas here, helped out a group research challenge project thingy, and been invited to do some teaching!

Hopefully the next post in this series will come out on time - long-term the plan is to absolutely continue blogging about the research I'm doing.

Until next time, the journey continues!

(Oh yeah! and finally finally, to the person who asked a question by email about this old post (I think?), I'm sorry for the delay and I'll try to get back to you soon.)

Defining AI: Word embeddings

Hey there! It's been a while. After writing my thesis for the better part of a year, I've been getting a bit burnt out on writing - so unfortunately I had to take a break from writing for this blog. My thesis is almost complete though - more on this in the next post in the PhD update blog post series. Other higher-effort posts are coming (including the belated 2nd post on NLDL-2024), but in the meantime I thought I'd start a series on defining various AI-related concepts. Each post is intended to be relatively short in length to make them easier to write.

Normal scheduling will resume soon :-)

Banner showing the text 'Defining AI' on top on transclucent white vertical stripes against a voronoi diagram in white against a pink/purple background. 3 progressively larger circles are present on the right-hand side.

As you can tell by the title of this blog post, the topic for today is word embeddings.

AI models operate fundamentally on numerical values and mathematics - so naturally to process text one has to encode said text into a numerical format before it can be shoved through any kind of model.

This process of converting text to a numerically-encoded value is called word embedding!

As you might expect, there are many ways of doing this. Often, this involves looking at what other words a given word often appears next to. This could be for example framed as a task in which a model has to predict a word given the words immediately before and after it in a sentence, and then take the output of the last layer before the output layer as the embedding (word2vec).

Other models use matrix math to calculate this instead, producing a dictionary file as an output (GloVe | paper). Still others use large models that are trained to predict randomly masked words, and process entire sentences at once (BERT and friends) - though these are computationally expensive since every bit of text you want to embed has to get pushed through the model.

Then there's contrastive learning approaches. More on contrastive learning later in the series if anyone's interested, but essentially it learns by comparing pairs of things. This can lead to a higher-level representation of the input text, which can increase performance in some circumstances, and other fascinating side effects that I won't go into in this post. Chief among these is CLIP (blog post).

The idea here is that semantically similar words wind up having similar sorts of numbers in their numerical representation (we call this a vector). This is best illustrated with a diagram:

Words embedded with GloVe and displayed in a heatmap

I used GloVe to embed some words with GloVe (really easy to use since it's literally just a dictionary), and then used cosine distance to compute the similarity between the different words. Once done, I plotted this in a heatmap.

As you can see, rain and water are quite similar (1 = identical; 0 = completely different), but rain and unrelated are not really alike at all.

That's about the long and short of word embeddings. As always with these things, you can go into an enormous amount of detail, but I have to cut it off somewhere.

Are there any AI-related concepts or questions you would like answering? Leave a comment below and I'll write another post in this series to answer your question.

Website update: Share2Fediverse, and you can do it too!

Heya! Got another short port for you here. You night notice that on all posts now there's a new share button (those buttons that take you to difference places with a link to this site to share it elsewhere) that looks like this:

The 5-pointed rainbow fediverse logo

If you haven't seen it before, this is the logo for the Fediverse, a decentralised network of servers and software that all interoperate (find out more here: https://fedi.tips/).

Since creating my Mastodon account, I've wanted some way to allow everyone here to share my posts on the Fediverse if they feel that way inclined. Unlike other centralised social media platforms like Reddit etc though, the Fediverse doesn't have a 'central' server that you can link to.

To this end, you need a landing page to act as a middleman. There are a few options out there already (e.g. share2fedi), but I wanted something specific and static, so I built my own solution. It looks like this:

(Above: A screenshot of Share2Fediverse.)

It's basically a bit of HTML + CSS for styling, a splash of Javascript to make the interface function and remember the instance + software you select for next time via localStorage.

Check it out at this demo link:

https://starbeamrainbowlabs.com/share2fediverse/#text=The%20fediverse%20is%20cool!%20%E2%9C%A8

Currently, it supports sharing to Mastodon, GNU Social, and Diaspora. As it turns out, finding the share url (e.g. for Mastodon on fediscience.org it's https://fediscience.org/share?text=some%20text%20here) is more difficult than it sounds, as I haven't found it to be well advertised. I'd love to add e.g. Pixelfed, Kbin, GoToSocial, Pleroma, and more.... but I need the share URL! If you know the share URL for any piece of Fediverse software, please do leave a comment below.

If you're interested in the source code, you can find it here:

https://github.com/sbrl/Share2Fediverse/

...if you'd really like to help out, you could even open a pull request! The file you want to edit is src/lib/software_db.mjs - though if you leave a comment here or open an issue I'll pick it up and add any requests.

See you on the Fediverse! o/

500 posts - thank you!

Looking up into a blossom tree against a blue sky.

500 posts is a lot. When I started writing back in 2014, I never imagined that I was make it to this milestone. I've thought for a while about what I wanted to do to celebrate, but couldn't think of anything specific - so I wanted to thank everyone who has supported me so far in my journey through University - first in my undergraduate course, then in my MSc course, and now in my PhD.

It was Rob Miles that first encouraged me to start a blog in the first year of my undergraduate course. A few weeks later, and I had gone from a coming soon page to building starbeamrainbowlabs.com, followed closely by this blog which I put together piece by piece.

The backend is actually written in PHP - though it is on my (seemingly endless :P) todo list to rewrite it as it's not particularly well written. I've made a start on this already by refactoring the commenting system (and adding more statistics), but I haven't touched the blog itself and the main website (particularly the CSS) much yet.

In total, over the last 499 posts (I'm still writing this post as of the time of typing) I've written 347,256 words in total, counted by doing cat *.md | tr -d -- '-{}\[\]();=><' | wc -w on all the markdown sources of the posts I've written. This is a mind boggling number! I suspect it's somewhat inflated by the code I include in my blog posts though.

On these, I've received 192 (probably) genuine top-level comments that aren't spam (not counting replies, which are difficult to count with jq, as the replies parameter isn't always present in my backend JSON files I store comments in). Each and every one of these has been helpful, and given me motivation to continue writing here - especially more recently on my PhD Update series.

I might have missed some spam comments, so do get in touch if you spot one.

From my first post way back on 29th June 2014 to this post in the present spans exactly 7 years, 10 months, 13 days, and 8 hours (or 2874 days and 8 hours), averaging 5 days 17 hours between each post overall.

I would like to thank everyone who has supported me on this incredible journey - especially my personal supervisor and also my PhD supervisor - both of whom have continuously assisted me with issues both large and small at all times of the day and year. The entire Department of Computer Science at the University of Hull - members both past and present - have all been very kind and helpful, and I'm deeply grateful to have had such a welcoming place to be.

Finally, thank you for reading. While I don't write posts on my blog here expecting that anyone will read them, it's amazing to see and hear about people finding them helpful :D

I can't say where I'm headed next after my PhD (the end of which is still some time away), but I can say that I'm committed to posting on this blog - so it won't be going anywhere any time soon :P

If there's a specific topic you'd like me to cover (and I haven't already done so), please do leave a comment below.

A ladybird in a hawthorn bush.

Website Update: Tools section

A while ago I noticed that the tools section of my website was horribly outdated, so recently I decided to do something about it. It was still largely displaying tools from when I still used Windows as my primary operating system, which was a long time ago now!

The new revision changes it to display icons only instead of icons and screenshots, as screenshots aren't always helpful for some of the tools I now use - and it also makes it easier to keep the section updated in the future.

I also switched to use a tab separated values file (TSV file) instead of a JSON file for the backend data file that the tools list is generated from, as a TSV file is much more convenient to hand edit than a JSON file (JSON is awesome for lots of things and I use it all the time - it's just not as useful here). If you're interested, you can view the source TSV file here: tools.tsv

I'm still filling out the list (each item in the list also necessitates an update to my personal logo collection), but it's already a huge improvement from the old tools list. Things like GitHub's Atom IDE, Ubuntu, Mozilla Firefox, and KeePass2 are currently my daily drivers - so it's great to have them listed there.

Check out the new tools section here: https://starbeamrainbowlabs.com/#tools

Stardust Blog

Tag Cloud

PhD Update 20: Like a bad smell.....

Corrections

Corrections? What corrections?

Future research

BlueSky

Final thoughts

Happy new year 2025! o/

Looking ahead

Final thoughts

Ducks

Teaching this September

PhD Update 19: The Reckoning

The Structure

My experience

Outcome

Looking ahead

PhD Update 18: The end and the beginning

Theeeeeeeeeeeeesis

Researchy things

Persuasive tactics challenge

Teaching

Conclusion

Defining AI: Word embeddings

Website update: Share2Fediverse, and you can do it too!

500 posts - thank you!

Website Update: Tools section

Stardust
Blog