The plan to caption and index images
Something that has been on my mind for a while are the photos that I take. At last count on my NAS I have 8564 pictures I have taken so far since I first got a phone to take them with, and many more belonging to other family members.
I have blogged before about a script I've written that automatically processes photos graphs and files them in by year and month. It fixes the date taken, set the thumbnail for rapid preview loading, automatically rotates them to be the right way up, losslessly optimises them, and more.
The one thing it can't do though is to help me locate a specific photo I'm after, so given my work with AI recently I have come up with a plan to do something about this, and I want to blog about it here.
By captioning the images with an AI, I plan to index the captions (and other image metadata) and have a web interface in the form of a search engine. In this blog post, I'm going to outline the AI I intend to use, and the architecture of the image search engine I have already made a start on implementing.
AI for image captioning
The core AI to do image captioning will be somewhat based on work I've done for my PhD. The first order of business was finding a dataset to train on, and I stumbled across Microsoft's Common Objects in Context dataset. The next and more interesting part was to devise a model architecture that translate an image into text.
When translating 1 thing (or state space) into another in AI, it is generally done with an encoder-decoder architecture. In my case here, that's an encoder for the image - to translate it into an embedded feature space - and a decoder to turn that embedded feature space into text.
There are many options for these - especially for encoding images - which I'll look at first. While doing my PhD, I've come across many different encoders for images, which I'd roughly categorise into 2 main categories:
Since the transformer model was invented, they have been widely considered to be the best option. Swin Transformers adapt this groundbreaking design for images - transformers originally handled text - from what I can tell better than the earlier Vision Transformer architecture.
On the other side, a number of encoders were invented before transformers were a thing - the most famous of which was ResNet (I think I have the right paper), which was basically just a bunch of CNN layers stacked on top of one another with a few extra bits like normalisation and skip connections.
Recently though, a new CNN-based architecture that draws inspiration from the strong points of transformers - and it's called ConvNeXt. Based on the numbers in the paper, it even beats the swin transformer model mentioned earlier. Best of all, it's much simpler in design so it makes it relatively easy to implement. It is this model architecture I will be using.
For the text, things are both straight forward - the model architecture I'll be using is a transformer (of course - I even implemented it myself from scratch!) - but the trouble is representation. Particularly the representation of the image caption we want the model to predict.
There are many approaches to this problem, but the one I'm going to try first is a word-based solution using one-hot encoding. There are about 27K different unique words in the dataset, so I've assigned each one a unique number in a dictionary file. Then, I can turn this:
[ "a", "cat", "sat", "on", "a", "mat" ]
[ 0, 1, 2, 3, 0, 4 ]
...then, the model would predict something like this:
[ [ 1, 0, 0, 0, 0, 0, ... ], [ 0, 1, 0, 0, 0, 0, ... ], [ 0, 0, 1, 0, 0, 0, ... ], [ 0, 0, 0, 1, 0, 0, ... ] [ 1, 0, 0, 0, 0, 0, ... ] [ 0, 0, 0, 0, 1, 0, ... ] ]
...where each sub-array is a word.
This will as you might suspect use a lot of memory - especially with 27K words in the dictionary. By my calculations, with a batch size of 64 and a maximum caption length of 25, each output prediction tensor will use a whopping 172.8 MiB memory as float32, or 86.4 MiB memory as float16 (more on memory usage later).
I'm considering a variety of techniques to combat this if it becomes an issue. For example, reducing the dictionary size by discarding infrequently used words.
Another option would be to have the model predict GloVe vectors as an output and then compare the output to the GloVe dictionary to tell which one to pick. This would come with it's own set of problems however, like lots of calculations to compare each word to every word in the dictionary.
My final thought was that I could maybe predict individual characters instead of full words. There would be more items in the sequence predicted, but each character would only have up to 255 choices (probably more like 36-ish), potentially saving memory.
I have already implemented this AI - I just need to debug and train it now. To summarise, here's a diagram:
The last problem with the AI though is memory usage. I plan on eventually running the AI on a raspberry pi, so much tuning will be required to reduce memory usage and latency as much I can. In particular, I'll be trying out quantisating my model and writing the persistent daemon to use Tensorflow Lite to reduce memory usage. Models train using the
float32 data type - which uses 32 bits per value, but quantising it after training to use
float16 (16 bits / value) or even
uint8 (8 bits / value) would significantly reduce memory usage.
Search engine and indexing
The second part of this is the search engine. The idea here is to index all my photos ahead of time, and then have a web interface I can use to search and filter them. The architecture I plan on using to achieve this is rather complicated, and best explained with a diagram:
- While I'd love to implement my own, meilisearch is an open source project where they have put in more hours into making it cool than I ever would be able to
- Being a separate daemon means I can schedule it on my cluster as a separate task, which potentially might end up on a different machine
With this in mind, the search engine has 2 key parts to it: the crawler / indexer, and the HTTP server that serves the web interface. The web interface will talk to meilisearch to perform searches (not directly; requests will be proxied and transformed).
The crawler will periodically scan the disk for new, updated, and deleted files, and pass them on to the indexer queue. The indexer will do 4 things:
- Caption the image, by talking to a persistent Python child process via Inter Process Communication (IPC) - captions will be written as EXIF data to images
- Thumbnail images and store them in a cache (perhaps some kinda key-value store, as lots of little files on disk would be a disaster for disk space)
- Extract EXIF (and other) metadata
- Finally, push the metadata to meilisearch for indexing
Tasks 2 and 3 can be done in parallel, but the others will need to be done serially - though multiple images can of course be processed concurrently. I anticipate much asynchronous code here, which I'm rather looking forward to finishing writing :D
I already have a good start on the foundation of the search engine here. Once I've implemented enough that it's functional, I'll open source everything.
To finish this post, I have a mockup screenshot of what the main search page might look like:
Obviously the images are all placeholders (append
?help to this URL see the help page) for now and I don't yet have a name for it (suggestions in the comments are most welcome!), but the rough idea is there.