Starbeamrainbowlabs

Stardust
Blog

Next Gen Search, Part 1: Backend Storage

I've got a bit of a thing about full-text search engines. I've talked about one in particular before for Pepperminty Wiki, and I was thinking about it again the other day - and how I could optimise it further.

If you haven't already, I do recommend reading my previous post on the curious question - as a number of things in this post might not make sense otherwise.

Between the time I wrote that last post and now, I've done quite a bit more digging into the root causes of that ~450ms search time, and I eventually determined that most of it was actually being spent deserialising the inverted index from the JSON file it's stored in back into memory.

This is horribly inefficient and is actually taking up 90% of that query time, so I got to thinking about what I could do about it.

My solution was multi-faceted, as I also (separately) developed a new search-term analysis system (I abbreviated to STAS, because it sounds cool :D) to add advanced query syntax such as -dog to exclude pages that contain the word dog and the like - which I also folded in at the same time as fixing the existing bottleneck.

As of the time of typing, the work on STAS is still ongoing. This doesn't mean that I can't blog about the rest of it though! I've recently-ish come into contact with key-value data stores, such as Redis and LevelDB (now RocksDB). They work rather like a C♯ dictionary or a browser's localStorage, in that they store values that are associated with unique keys (Redis is a bit of a special case though, for reasons that I won't get into here).

Such a data store would suit an inverted index surprisingly well. I devised a key system to represent an inverted index:

The devised key system in diagram form, explained below.

The more I thought about it, the more I liked it. If I use a key-value data store in this manner, I can store the values as JSON objects - and then I only have to deserialise the parts of the index that I actually use. Furthermore, adding this extra layer of abstraction allows for some clever trickery to speed things up even more.

The problem here is that Pepperminty Wiki is supposed to be portable, so I try not to use any heavy external libraries or depend on odd PHP modules that not everyone will have installed. While a LevelDB extension for PHP does exist, it's not installed by default and it's a PECL module, which are awkward to install.

All isn't lost though, because it turns out that SQLite functions surprisingly well as a key-value data store:

CREATE TABLE store (
    key TEXT UNIQUE NOT NULL,
    value TEXT
);

Yes - it really is that simple! Now all we need is some code to create and interface with such a database. Some simple getters and setters should suffice!

(Can't see the above? Try a direct link.)

While this works, I quickly ran into several major problems:

The (partial) solution here was to refactor it such that the json encoding is handled by the storage provider, and to implement a cache.

Such a cache could just be an associative array:

private $cache = [];

Then, to fetch a value, we can do this:

// If it's not in the cache, insert it
if(!isset($this->cache[$key])) {
    $this->cache[$key] = [ "modified" => false, "value" => json_decode($this->query(
        "SELECT value FROM store WHERE key = :key;",
        [ "key" => $key ]
    )->fetchColumn()) ];
}
return $this->cache[$key]["value"];

Notice how each item in the cache is also an associative array. This is so that we can flag items that have been modified in memory, such that when we next sync to disk we can write multiple changes all at once in a single batch. That turns the setter into something like this:

if(!isset($this->cache[$key])) $this->cache[$key] = [];
$this->cache[$key]["value"] = $value;
$this->cache[$key]["modified"] = true;

Very cool! Now all we need is a function to batch-write all the changes to disk. This isn't hard either:

foreach($this->cache as $key => $value_data) {
    // If it wasn't modified, there's no point in saving it, is there?
    if(!$value_data["modified"])
        continue;

    $this->query(
        "INSERT OR REPLACE INTO store(key, value) VALUES(:key, :value)",
        [
            "key" => $key,
            "value" => json_encode($value_data["value"])
        ]
    );
}

I'll get onto the cogs and wheels behind that query() function a bit later in this post. It's one of those utility functions that are great to have around that I keep copying from one project to the next.

Much better, but still not great. Why does it still take ages to write all the changes to disk?

Well, it turns out that by default SQLite wraps every INSERT in it's own transaction. If we wrap our code in an explicit transaction, we can seriously boost the speed even further:

$this->db->beginTransaction();

// Do batch writing here

$this->db->commit();

Excellent (boomed the wizard).

But wait, there's more! The PHP PDO database driver supports prepared statements, which is a fancy way of caching SQL statements and substituting values into them. We use these already, but since we only use a very limited number of SQL queries, we can squeak some additional performance out by caching them in their prepared forms (preparing them is relatively computationally expensive, after all).

This is really easy to do. Let's create another associative array to store them in:

private $query_cache = [];

Then, we can alter the query() function to look like this:

/**
 * Makes a query against the database.
 * @param   string  $sql        The (potentially parametised) query to make.
 * @param   array   $variables  Optional. The variables to substitute into the SQL query.
 * @return  \PDOStatement       The result of the query, as a PDOStatement.
 */
private function query(string $sql, array $variables = []) {
    // Add to the query cache if it doesn't exist
    if(!isset($this->query_cache[$sql]))
        $this->query_cache[$sql] = $this->db->prepare($sql);
    $this->query_cache[$sql]->execute($variables);
    return $this->query_cache[$sql]; // fetchColumn(), fetchAll(), etc. are defined on the statement, not the return value of execute()
}

If a prepared statement for the given SQL exists in the query cache already, we re-use that again instead of preparing a brand-new one. This works especially well, since we perform a large number of queries with the same small set of SQL queries. Get operations all use the same SQL query, so why not cache it?

This completes our work on the backend storage for the next-generation search system I've built for Pepperminty Wiki. Not only will it boost the speed of Pepperminty Wiki's search engine, but it's also a cool reusable component, which I can apply to other applications to give them a boost, too!

Next time, we'll take a look at generating the test data required to stress-test the system.

I also want to give an overview of the new search-term analysis system and associated changes to the page ranking system I implemented (and aspire to implement) too, but those may have to wait until another post (teaser: PageRank isn't really what I'm after).

(Can't see the above? Try a direct link.)

Tag Cloud

3d 3d printing account algorithms android announcement architecture archives arduino artificial intelligence artix assembly async audio automation backups bash batch blog bookmarklet booting bug hunting c sharp c++ challenge chrome os code codepen coding conundrums coding conundrums evolved command line compilers compiling compression css dailyprogrammer data analysis debugging demystification distributed computing documentation downtime electronics email embedded systems encryption es6 features ethics event experiment external first impressions future game github github gist gitlab graphics hardware hardware meetup holiday holidays html html5 html5 canvas infrastructure interfaces internet interoperability io.js jabber jam javascript js bin labs learning library linux lora low level lua maintenance manjaro network networking nibriboard node.js operating systems own your code performance phd photos php pixelbot portable privacy problem solving programming problems project projects prolog protocol protocols pseudo 3d python reddit redis reference releases resource review rust searching secrets security series list server software sorting source code control statistics storage svg talks technical terminal textures thoughts three thing game three.js tool tutorial twitter ubuntu university update updates upgrade version control virtual reality virtualisation visual web website windows windows 10 xmpp xslt

Archive

Art by Mythdael