Demystifying Inverted Indexes
(The magnifying glass in the above banner came from openclipart)
After writing the post that will be released after this one, I realised that I made a critical assumption that everyone knew what an inverted index was. Upon looking for an appropriate tutorial online, I couldn't find one that was close enough to what I did in Pepperminty Wiki, so I decided to write my own.
First, some context. What's Pepperminty Wiki? Well, it's a complete wiki engine in a single file of PHP. The source files are obviously not a single file, but it builds into a single file - making it easy to drop into any PHP-enabled web server.
One of its features is a full-text search engine. A personal wiki of mine has ~75k words spread across ~550 pages, and it manages to search them all in just ~450ms! It does this with the aid of an inverted index - which I'll be explaining in this post.
First though, we need some data to index. How about the descriptions of some video games?
Kerbal Space Program
In KSP, you must build a space-worthy craft, capable of flying its crew out into space, without killing them. At your disposal is a collection of parts, which must be assembled to create a functional ship. Each part has its own function and will affect the way a ship flies (or doesn't). So strap yourself in, and get ready to try some Rocket Science!
Cross Code
Meet Lea as she logs into an MMO of the distant future. Follow her steps as she discovers a vast world, meets other players and overcomes all the challenges of the game.
Fort Meow
Fort Meow is a physics game by Upper Class Walrus about a girl, an old book and a house full of cats! Meow.
Factory Balls
Factory balls is the brand new bonte game I already announced yesterday. Factory balls takes part in the game design competition over at jayisgames. The goal of the design competition was to create a 'ball physics'-themed game. I hope you enjoy it!
Very cool, this should provide us with plenty of data to experiment with. Firstly, let's consider indexing. Take the Factory Balls description. We can split it up into tokens like this:
T | o | k | e | n | s | V | V |
---|---|---|---|---|---|---|---|
factory | balls | is | the | brand | new | bonte | game |
i | already | announced | yesterday | factory | balls | takes | |
part | in | the | game | design | competition | over | |
at | jayisgames | the | goal | of | the | design | |
competition | was | to | create | a | ball | physics | |
themed | game | i | hope | you | enjoy | it |
Notice how we've removed punctuation here, and made everything lowercase. This is important for the next step, as we want to make sure we consider Factory
and factory
to be the same word - otherwise when querying the index we'd have to remember to get the casing correct.
With our tokens sorted, we can now count them to create our index. It's like a sort of tally chart I guess, except we'll be including the offset in the text of every token in the list. We'll also be removing some of the most common words in the list that aren't very interesting - these are known as stop words. Here's an index generated from that Factory Balls text above:
Token | Frequency | Offsets |
---|---|---|
factory | 2 | 0, 12 |
balls | 2 | 1, 13 |
brand | 1 | 4 |
new | 1 | 5 |
bonte | 1 | 6 |
game | 3 | 7, 18, 37 |
i | 2 | 8, 38 |
announced | 1 | 10 |
yesterday | 1 | 11 |
takes | 1 | 14 |
design | 2 | 19, 28 |
competition | 2 | 20, 29 |
jayisgames | 1 | 23 |
goal | 1 | 25 |
create | 1 | 32 |
ball | 1 | 34 |
physics | 1 | 35 |
themed | 1 | 36 |
hope | 1 | 39 |
enjoy | 1 | 41 |
Very cool. Now we can generate an index for each page's content. The next step is to turn this into an inverted index. Basically, the difference between the normal index and a inverted index is that an entry in an inverted index contains not just the offsets for a single page, but all the pages that contain that token. For example, the Cross-Code example above also contains the token game
, so the inverted index entry for game
would contain a list of offsets for both the Factory Balls and Cross-Code pages.
Including the names of every page under every different token in the inverted index would be both inefficient computationally and cause the index to grow rather large, so we should assign each page a unique numerical id. Let's do that now:
Id | Page Name |
---|---|
1 | Kerbal Space Program |
2 | Cross Code |
3 | Fort Meow |
4 | Factory Balls |
There - much better. In Pepperminty Wiki, this is handled by the ids
class, which has a pair of public methods: getid($pagename)
and getpagename($id)
. If an id can't be found for a page name, then a new id is created and added to the list (Pepperminty Wiki calls this the id index) transparently. Similarly, if a page name can't be found for an id, then null
should be returned.
Now that we've got ids for our pages, let's look at generating that inverted index entry for game
we talked about above. Here it is:
- Term:
game
Id | Frequency | Offsets |
---|---|---|
2 | 1 | 31 |
3 | 1 | 5 |
4 | 3 | 5, 12, 23 |
Note how there isn't an entry for page id 1, as the Kerbal Space Program page doesn't contain the token game
.
This, in essence, is the basics of inverted indexes. A full inverted index will contain an entry for every token that's found in at least 1 source document - though the approach used here is far from the only way of doing it (I'm sure there are much more advanced ways of doing it for larger datasets, but this came to mind from reading a few web articles and is fairly straight-forward and easy to understand).
Can you write a program that generates a full inverted index like I did in the example above? Try testing it on the test game descriptions at the start of this post.
You may also have noticed that the offsets used here are of the tokens in the list. If you wanted to generate contexts (like Duck Duck Go or Google do just below the title of a result), you'd need to use the character offsets from the source document instead. Can you extend your program to support querying the inverted index, generating contexts based on the inverted index too?
Liked this post? Got your own thoughts on the subject? Having trouble with the challenges at the end? Comment below!