Archive

## Tag Cloud

3d 3d printing account algorithms android announcement architecture archives arduino artificial intelligence artix assembly async audio automation backups bash batch blog bookmarklet booting bug hunting c sharp c++ challenge chrome os cluster code codepen coding conundrums coding conundrums evolved command line compilers compiling compression containerisation css dailyprogrammer data analysis debugging demystification distributed computing documentation downtime electronics email embedded systems encryption es6 features ethics event experiment external first impressions future game github github gist gitlab graphics hardware hardware meetup holiday holidays html html5 html5 canvas infrastructure interfaces internet interoperability io.js jabber jam javascript js bin labs learning library linux lora low level lua maintenance manjaro network networking nibriboard node.js operating systems own your code pepperminty wiki performance phd photos php pixelbot portable privacy problem solving programming problems project projects prolog protocol protocols pseudo 3d python reddit redis reference releases rendering resource review rust searching secrets security series list server software sorting source code control statistics storage svg talks technical terminal textures thoughts three thing game three.js tool tutorial twitter ubuntu university update updates upgrade version control virtual reality virtualisation visual web website windows windows 10 xmpp xslt

## Avoiding accidental array mutation when iterating arrays in PHP

Pepperminty Wiki is written in PHP, and I've posted before about the search engine I've implemented for it that's powered by an inverted index. In this post, I want to talk about an anti-feature of PHP that doesn't behave the way you'd expect, and how to avoid running into the same problem I did.

To do this, let's introduce a simple example of the problem at work:

<?php
$arr = []; for($i = 0; $i < 3;$i++) {
$key = random_int(0, 2000);$arr[$key] =$i;
echo("[init] key: $key, i:$i\n");
}

foreach($arr as$key => &$value) { // noop } echo("structure before: "); var_dump($arr);

foreach($arr as$key => $value) { echo("key:$key, i was $value\n"); } echo("structure after: "); var_dump($arr);
?>

The above code initialises an associative array with 3 elements. The contents might look like this:

Key Value
469 0
1777 1
1685 2

Pretty simple so far. It then iterates over it twice: once referring to the values by reference (that's what the & there is for), and the second time referring to the items by value.

You'd expect the array to be identical before and after the second foreach loop, but you'd be wrong:

Key Value
469 0
1777 1
1685 1

Wait, what? That's very odd. What's going on here? How can a foreach loop that's iterating an array by value mutate an array? To understand why, let's take a step back for a moment. Here's another snippet:

<?php

$arr = [ 1, 2, 3 ]; foreach($arr as $key =>$value) {
echo("$key:$value\n");
}

echo("The last value was $key:$value\n");
?>

What do you expect to happen here? While in Javascript with a for..of loop with a let declaration both $key and $value would have fallen out of scope by now, in PHP foreach statements don't create a new scope for variables. Instead, they inherit the scope from their parent - e.g. the global scope in the above or their containing function if defined inside a function.

To this end, we can still access the values of both $key and $value in the above example even after the foreach loop has exited! Unexpected.

It gets better. Try prefixing $value with an ampersand & in the above example and re-running it - note that both $key and $value are both still defined. This leads us to why the unexpected behaviour occurs. For some reason because of the way that PHP's foreach loop is implemented, if we re-use the same variable name for $value here in a subsequent loop it replaces the value of the last item in the array.

Shockingly enough this is actually documented behaviour (see also this bug report), though I'm somewhat confused as to how it happens on the last element in the array instead of the first.

With this in mind, to avoid this problem in future if you iterate an array by reference with a foreach loop, always remember to unset() the $value, like so: <?php$arr = [];
for($i = 0;$i < 3; $i++) {$key = random_int(0, 2000);
$arr[$key] = $i; echo("[init] key:$key, i: $i\n"); } foreach($arr as $key => &$value) {
// noop
}
unset($key); unset($value);

echo("structure before: "); var_dump($arr); foreach($arr as $key =>$value) {
echo("key: $key, i was$value\n");
}

echo("structure after: "); var_dump($arr); ?> By doing this, you can ensure that you don't accidentally mutate your arrays and spend weeks searching for the bug like I did. It's language features like these that catch developers out: and being aware of the hows and whys of their occurrence can help you to avoid them next time (if anyone can explain why it's the last element in the array that's affected instead of the first, I'd love to know!). Regardless, although I'm aware of how challenging implementing a programming language is, programming language designers should take care to avoid unexpected behaviour like this that developers don't expect. Found this interesting? Comment below! ### Sources and further reading ## Pure CSS spoilers with the CSS :target selector For 1 reason or another, I've been working on some parser improvements for Pepperminty Wiki recently. Pepperminty Wiki uses Markdown for the page content syntax - specifically Parsedown. Markdown has a number of variations and extensions, some of which are more widely accepted than others. For Pepperminty Wiki, I try to stick as closely to existing Markdown conventions as possible (such as the CommonMark spec). Where that's not possible, I try to make sure there's an existing precedent (e.g. internal links use the same syntax as MediaWiki). Anyway, as part of this I thought it would be cool to implement a spoiler tag. The problem here is that nobody can agree on the canonical syntax. Discord has recently implemented a vertical bar syntax like a spoiler wall: Some text ||spoiler text|| more text Reddit, on the other hand, uses a different syntax: Some text >!spoiler text!< more text Anyway, I've ended up supporting both of the above 2 syntaxes. My Parsedown extension generates something like the following HTML: <p>Some text <a class="spoiler" id="spoiler-RSSZTkNA30-OGJQf_7VivKtJAaoNhbx" href="#spoiler-RSSZTkNA30-OGJQf_7VivKtJAaoNhbx" title="Click / tap to reveal spoiler">spoiler text</a> more text</p> The next question here is how to make it function as a spoiler. If you're not already aware, to reveal to text in a spoiler, one first has to click on it or perform some other action. Personally, I'd prefer to avoid Javascript if possible for this, as not all users have it enabled and it complicates matters in Pepperminty Wiki. To this end, if you search for "Pure CSS spoiler" with your favourite search engine, you'll find loads of different solutions out there. Some require Javascript, and others only show the text in a tooltip on hover (which doesn't work on mobile). All this isn't very cool, so I decided to implement my own solution and share it here :-) It's actually pretty concise: .spoiler { background: #333333; border-radius: 0.2em; color: transparent; cursor: pointer; } .spoiler:target { background: transparent; color: inherit; } By setting the text colour to transparent and the background to an obvious colour, we can give the user an obvious hint that there's a spoiler that can be clicked on. Setting the cursor to a hand on platforms with a mouse further helps to support this suggestion. When the link is clicked, it sets the anchor to spoiler-RSSZTkNA30-OGJQf_7VivKtJAaoNhbx, which is also the id of the spoiler. This triggers the :target selector, which makes the spoiler text visible. Here's a demo: See the Pen Pure CSS Spoiler by Starbeamrainbowlabs (@sbrl) on CodePen. The only issue here is that it doesn't support accessibility tools such as screen readers very well. Using a trick I've found on the Mozilla Developer Net, we can do this to improve that: .spoiler::before, .spoiler::after { clip-path: inset(100%); clip: rect(1px, 1px, 1px, 1px); height: 1px; overflow: hidden; position: absolute; white-space: nowrap; width: 1px; } .spoiler::before { content: " [spoiler start] "; } .spoiler::after { content: " [spoiler end] "; } ...but this still doesn't "fix" the issue, because we're only giving the user warning. Not being a screen-reader user myself, I'm not sure whether this is adequate (is there a 'skip' command that allows skipping to the end of the element or something?) and what isn't. If you've got a better idea for screen-reader users, please do comment below - I'd love to know. Found this useful? Got a suggestion to make it even better? Comment below! ## Variable-length fuzzy hashes with Nilsimsa for did you mean correction Or, why fuzzy hashing isn't helpful for improving a search engine. Welcome to another blog post about one of my special interests: search engines - specifically the implementation thereof :D I've blogged about search engines before, in which I looked at taking my existing search engine implementation to the next level by switching to a SQLite-based key-value datastore backing and stress-testing it with ~5M words. Still not satisfied, I'm now turning my attention to query correction. Have you ever seen something like this when you make a typo when you do a search? Surprisingly, this is actually quite challenging to achieve. The problem is that given a word with a typo in it, while it's easy to determine if a word contains a typo, it's hard to determine what the correct version of the word is. Consider a wordlist like this: apple orange pear grape pineapple If the user entered something like pinneapple, then it's obvious to us that the correct spelling would be pineapple - but in order to determine this algorithmically you need an algorithm capable of determining how close 2 different words are to 1 another. The most popular algorithm for this is called leveshtein. Given 2 words a and b, it calculates the number of edits to turn a into b. For example, the edit distance between pinneapple and pineapple is 1. This is useful, but it still doesn't help us very much. With this, we'd have to calculate the leveshtein distance between the typo and all the words in the list. This could easily run into millions of words for large wikis, so this is obviously completely impractical. To this end, we need a better idea. In this post, I'm going to talk about my first attempt at solving this problem. I feel it's important to document failures as well as successes, so this is part 1 of a 2 part series. The first order of business is to track down a Nilsimsa implementation in PHP - since it doesn't come built-in, and it's pretty complicated to implement. Thankfully, this isn't too hard - I found this one on GitHub. Nilsimsa is a fuzzy hashing algorithm. This means that if you hash 2 similar words, then you'll get 2 similar hashes: Word Hash pinneapple 020c2312000800920004880000200002618200017c1021108200421018000404 pineapple 0204239242000042000428018000213364820000d02421100200400018080200256 If you look closely, you'll notice that the hashes are quite similar. My thinking is that if we vary the hash size, then words that are similar will have identical hashes, allowing the search space to be cut down significantly. The existing Nilsimsa implementation I've found doesn't support that though, so we'll need to alter it. This didn't turn out to be too much of a problem. By removing some magic numbers and adding a class member variable, it seems to work like a charm: (Can't view the above? Try this direct link.) I removed the comparison functions since I'm not using them (yet?), and also added a static convenience method for generating hashes. If I end up using this for large quantities of hashes, I may come back to it make it resettable, to avoid having to create a new object for every hash. With this, we can get the variable-length hashes we wanted: 256 0a200240020004a180810950040a00d033828480cd16043246180e54444060a5 128 3ba286c0cf1604b3c6990f54444a60f5 64 02880ed0c40204b1 32 060a04f0 16 06d2 8 06 The number there is the number of bits in the hash, and the hex value is the hash itself. The algorithm defaults to 256-bit hashes. Next, we need to determine which sized hash is best. The easiest way to do this is to take a list of typos, hash the typo and the correction, and count the number of hashes that are identical. Thankfully, there's a great dataset just for this purpose. Since it's formatted in CSV, we can download it and extract the typos and corrections in 1 go like this: curl https://raw.githubusercontent.com/src-d/datasets/master/Typos/typos.csv | cut -d',' -f2-3 >typos.csv There's also a much larger dataset too, but that one is formatted as JSON objects and would require a bunch of processing to get it into a format that would be useful here - and since this is just a relatively quick test to get a feel for how our idea works, I don't think it's too crucial that we use the larger dataset just yet. With the dataset downloaded, we can run our test. First, we need to read the file in line-by line for every hash length we want to test: <?php$handle = fopen("typos.csv", "r");

$sizes = [ 256, 128, 64, 32, 16, 8 ]; foreach($sizes as $size) { fseek($handle, 0); // Jump back to the beginning
fgets($handle); // Skip the first line since it's the header while(($line = fgets($handle)) !== false) { // Do something with the next line here } } PHP has an inbuilt function fgets() which gets the next line of input from a file handle, which is convenient. Next, we need to actually do the hashes and compare them: <?php // .....$parts = explode(",", trim($line), 2); if(strlen($parts[1]) < 3) {
$skipped++; continue; }$hash_a = Nilsimsa::hash($parts[0],$size);
$hash_b = Nilsimsa::hash($parts[1], $size);$count++;
if($hash_a ==$hash_b) {
$count_same++;$same[] = $parts; } else {$not_same[] = $parts; } echo("$count_same / $count ($skipped skipped)\r");

// .....

Finally, a bit of extra logic around the edges and we're ready for our test:

<?php
$handle = fopen("typos.csv", "r");$line_count = lines_count($handle); echo("$line_count lines total\n");

$sizes = [ 256, 128, 64, 32, 16, 8 ]; foreach($sizes as $size) { fseek($handle, 0);fgets($handle); // Skipt he first line since it's the header$count = 0; $count_same = 0;$skipped = 0;
$same = [];$not_same = [];
while(($line = fgets($handle)) !== false) {
$parts = explode(",", trim($line), 2);
if(strlen($parts[1]) < 3) {$skipped++;
continue;
}
$hash_a = Nilsimsa::hash($parts[0], $size);$hash_b = Nilsimsa::hash($parts[1],$size);

$count++; if($hash_a == $hash_b) {$count_same++;
$same[] =$parts;
}
else $not_same[] =$parts;
echo("$count_same /$count ($skipped skipped)\r"); } file_put_contents("$size-same.csv", implode("\n", array_map(function ($el) { return implode(",",$el);
}, $same))); file_put_contents("$size-not-same.csv", implode("\n", array_map(function ($el) { return implode(",",$el);
}, $not_same))); echo(str_pad($size, 10)."→ $count_same /$count (".round(($count_same/$count)*100, 2)."%), $skipped skipped\n"); } I'm writing the pairs that are the same and different to different files here for a visual inspection. I'm also skipping words that are less than 3 characters long, and that lines_count() function there is just a quick helper function for counting the number of lines in a file for the progress indicator (if you write a \r without a \n to the terminal, it'll reset to the beginning of the current line): <?php function lines_count($handle) : int {
fseek($handle, 0);$count = 0;
while(fgets($handle) !== false)$count++;
return $count; } Unfortunately, the results of running the test aren't too promising. Even with the shortest hash the algorithm will generate without getting upset, only ~23% of typos generate the same hash as their correction: 7375 lines total 256 → 7 / 7322 (0.1%), 52 skipped 128 → 9 / 7322 (0.12%), 52 skipped 64 → 13 / 7322 (0.18%), 52 skipped 32 → 64 / 7322 (0.87%), 52 skipped 16 → 347 / 7322 (4.74%), 52 skipped 8 → 1689 / 7322 (23.07%), 52 skipped Furthermore, digging deeper with an 8-bit you start to get large numbers of words that have the same hash, which isn't ideal at all. A potential solution here would be to use hamming distance (basically counting the number of bits that are different in a string of binary) to determine which hashes are similar to each other like leveshtein distance does, but that still doesn't help us as we then still have a problem that's almost identical to where we started. In the second part of this mini-series, I'm going to talk about how I ultimately solved this problem. While the algorithm I ultimately used (a BK-Tree, more on them next time) is certainly not the most efficient out there (it's O(log n) if I understand it correctly), it's very simple to implement and is much less complicated than Symspell, which seems to be the most efficient algorithm that exists at the moment. Additionally, I have been able to optimise said algorithm to return results for a 172K wordlist in ~110ms, which is fine for my purposes. Found this interesting? Got another algorithm I should check out? Got confused somewhere along the way? Comment below! ## Pepperminty Wiki is 5 today! ....let's celebrate with the release of v0.20. I got a notification from my calendar system yesterday that Pepperminty Wiki's birthday is today, and since I did a beta release a few days ago and there haven't been any major issues, I thought I'd time the full release to coincide with its birthday. I'm timing it from the first commit I ever made in Pepperminty Wiki's git repository. 5 years is a long time - and as a program Pepperminty Wiki has come such a long way since then. Today, it's actually a really useful piece of open-source software, which is evidenced by the fact that people recommend it to other people on their own. Seeing such things and hearing about where it's used are really amazing to see - and give me lots of motivation to improve Pepperminty Wiki even more. While the number of commits a project has isn't always an indicator of quality or how complete a project is, you can usually get a pretty good idea as to how much work has been done on a project by the number of commits it has (but of course, not always). At the time of writing Pepperminty Wiki has 1,415 commits, which is more than any other project I have ever worked on - past or present. The air quality web interface (which is now more of a general sensor web interface) is my 2nd place project unless I've missed one - and at 425 commits it doesn't even come close! To summarise the features in the latest release: • 🌜 New automatic dark mode in the default theme! Uses prefers-color-scheme under-the-hood • 🌈 Added theme gallery! Read more here • Vastly improved search engine performance, with new advanced query syntax (with even more syntax along the way) • 🚁 Accessibility improvements - if you're a screen-reader or accessibility tool user, I want to hear from you if you think anything (big or small!) could be improved! Personally, I'm most proud of the optimisations to the search engine. I've actually blogged about how I did it in a 3 part series and tested it on a test wiki with ~5.9M words - while search times vary depending on your input (the new -exclude syntax will actually speed up queries) and your server hardware, a single word query for ~5.0M word wikis takes ~50ms O.o Unfortunately, this does mean that the search index will need to be rebuilt under the new format - and will be slightly larger than before. To get a progress bar for this operation, go to the master settings and click the rebuild button. Another notable change is the new 'mega-menu' style more menu: That menu has been bothering me for a while, and thanks to the kind people on Reddit, I've now got a solution. Note that you'll need to delete nav_links_extra from your peppermint.json in order for it to take effect. Please also test the theme gallery in particular. It's brand-new in this release and quite complicated under-the-hood, so I'd appreciate some extra eyes on that. As for when I'll release v1.0, I'm not sure. As a program, Pepperminty Wiki is certainly stable enough to be used in production scenarios today - so perhaps incrementing the version number to v1.0 would be a good idea to reflect that. At the same time though, there are a number of missing features - most notably watchlists and further improvements to the page history system - so I'm not sure when I'll be confident enough to bump it to v1.0. Either way, I'm pretty sure that I'll keep working on Pepperminty Wiki for years to come - I have no plans to cease development at this time. While Pepperminty Wiki releases don't move at the most rapid of paces, I aim to get about 2 releases out per year about 6 months apart from each other. Special thanks to @SeanFromIT for reporting a number of bugs which have been squashed. If you use Pepperminty Wiki, tweet me @SBRLabs! I'd love to hear about how you're using it. Lastly, don't forget to take a backup of your wiki before updating. While I've made every effort to squash bugs, you can never be too careful :P Check out v0.20 here: Pepperminty Wiki v0.20 ## Next Gen Search, Part 2: Pushing the limits In the last part, we looked at how I built a new backend for storing inverted indexes for Pepperminty Wiki, which allows for partial index deserialisation and other nice features that boost performance considerably. Since the last post, I've completed work on the new search system - though there are a few bits around the edges that I still want to touch up and do some more work on. In this post though, I want to talk about how I generated test data to give my full-text search engine something to chew on. I've done this before for my markov chain program I wrote a while back, and that was so much fun I did it again for my search engine here. After scratching my head for a bit to think of a data source I could use, I came up with the perfect plan. Ages ago I downloaded a Wikipedia dump - just the content pages in Wikitext markup. Why not use that? As it turns out, it was a rather good idea. Some processing of said dump was required though to transform it into a format that Pepperminty Wiki can understand, though. Pepperminty Wiki stores pages on disk as flat text files in markdown, and indexes them in pageindex.json. If pageindex.json doesn't exist, then Pepperminty Wiki rebuilds it automagically by looking for content pages on disk. This makes it easy to import batches of new pages into Pepperminty Wiki, so all we need to do is extract the wiki text, convert to markdown, and import! This ended up requiring a number of different separate steps though, so let's take it 1 at a time First, we need a Wikipedia database dump in the XML format. These are available from dumps.wikimedia.org. There are many different ones available, but I suggest grabbing one that has a filename similar to enwiki-20180201-pages-articles.xml - i.e. just content pages - no revision history, user pages, or additional extras. I think the most recent one as of the time of posting is downloadable here - though I'd warn you that it's 15.3GiB in size! You can see a list of available dump dates for the English Wikipedia here. Now that we've got our dump, let's extract the pages from it. This is nice and easy to do with wikiextractor on GitHub: nice -n20 wikiextractor enwiki-20180201-pages-articles.xml --no_templates --html --keep_tables --lists --links --sections --json --output wikipages --compress --bytes 25M >progress.log 2>&1  This will parse the dump and output a number of compressed files to the wikipages directory. These will have 1 JSON object per line, each containing information about a single page on Wikipedia - with page content pre-converted to HTML for us. The next step is to extract the page content and save it to a file with the correct name. This ended up being somewhat complicated, so I wrote a quick Node.js script to do the job: #!/usr/bin/env node const readline = require("readline"); const fs = require("fs"); if(!fs.existsSync("pages")) fs.mkdirSync("pages", { mode: 0o755 }); // From https://stackoverflow.com/a/44195856/1460422 function html_unentities(encodedString) { var translate_re = /&(nbsp|amp|quot|lt|gt);/g; var translate = { "nbsp":" ", "amp" : "&", "quot": "\"", "lt" : "<", "gt" : ">" }; return encodedString.replace(translate_re, function(match, entity) { return translate[entity]; }).replace(/&#(\d+);/gi, function(match, numStr) { var num = parseInt(numStr, 10); return String.fromCharCode(num); }); } const interface = readline.createInterface({ input: process.stdin, //output: process.stdout }); interface.on("line", (text) => { const obj = JSON.parse(text); fs.writeFileSync(pages/${obj.title.replace(/\//, "-")}.html, html_unentities(obj.text));
console.log(${obj.id}\t${obj.title});
});

This basically takes the stream of JSON object on the standard input, parses them, and saves the relevant content to disk. We can invoke it like so:

bzcat path/to/*.bz2 | ./parse.js

Don't forget to chmod +x parse.js if you get an error here. The other important thing about the above script ist hat we have to unescape the HTML entities (e.g. &gt;), because otherwise we'll have issues later with HTML conversation and page names will look odd. This is done by the html_unentities() function in the above script.

This should result in a directory containing a large number of files - 1 file per content page. This is much better, but we're still not quite there yet. Wikipedia uses wiki markup (which we converted to HTML with wikiextractor) and Pepperminty Wiki uses Markdown - the 2 of which are, despite all their similarities, inherently incompatible. Thankfully, pandoc is capable of converting from HTML to markdown.

Pandoc is great at this kind of thing - it uses an intermediate representation and allows you to convert almost any type of textual document format to any other format. Markdown to PDF, EPUB to plain text, ..... and HTML to markdown (just to name a few). It actually looks like it shares a number of features with traditional compilers like GCC.

Anyway, let's use it to convert our folder full of wikitext files to a folder full of markdown:

mkdir -p pages_md;
find pages/ -type f -name "*.html" -print0 | nice -n20 xargs -P4 -0 -n1 -I{} sh -c 'filename="{}"; title="${filename##*/}"; title="${title%.*}"; pandoc --from "html"  --to "markdown+backtick_code_blocks+pipe_tables+strikeout" "${filename}" -o "pages_md/${title}.md"; echo "${title}";'; _(See this on explainshell.com - doesn't include the nice -n20 due to a bug on their end)_ This looks complicated, but it really isn't. Let's break it down a bit: find pages/ -type f -name "*.html" -print0 This finds all the HTML files that we want to convert to Markdown, and delimits the output with a NUL byte - i.e. 0x0. This makes the next step easier: ... | nice -n20 xargs -P4 -0 -n1 -I{} sh -c '....' This pushes a new xargs instance into the background, which will execute 4 commands at a time. xargs executes a command for each line of input it receives. In our case, we're delimiting with NUL 0x0 instead though. We explicitly specify that we can 1 command per line of input though, as xargs tries to optimise and do command file1 file2 file3 instead. The sh -c bit is starting a subshell, in which we execute a small wrapper script that then calls pandoc. This is of course inefficient, but I couldn't find any way around spawning a subshell in this instance. filename="{}"; title="${filename##*/}";
title="${title%.*}"; pandoc --from "html" --to "markdown+backtick_code_blocks+pipe_tables+strikeout" "${filename}" -o "pages_md/${title}.md"; echo "${title}";

I've broken the sh -c subshell script down into multiple liens for readability. Simply put, it extracts the page title from the filename, converts the HTML to Markdown, and saves it to a new file in a different directory with the .md replacing the original .html extension.

When you put all these components together, you get a script that converts a folder full of HTML files to Markdown. Just like with the markov chains extraction I mentioned at the beginning of this post, Bash and shell scripting really is all about lego bricks. This is due in part to the Unix philosophy:

Make each program do one thing well.

There is more to it, but this is the most important point to remember. Many of the core utilities you'll find on the terminal follow this way of thinking.

There's 1 last thing we need to take care of before we have them in the right format though - we need to convert the [display text](page name) markdown-format links back into the Wikipedia [[internal link]] format that Pepperminty Wiki also uses.

Thankfully, another command-line tool I know of called repren is well-suited to this:

repren --from '$([^$]+)\]$([^):]+)$' --to '[[\1]]' pages_md/*.md

It took some fiddling, but I got all the escaping figured out and the above converts back into the [[internal link]] format well enough.

Now that we've got our folder full of markdown files, we need to extract a random portion of them to act as a test for Pepperminty Wiki - as the whole lot might be a bit much for it to handle (though if Pepperminty Wiki was capable of handling it all eventually that'd be awesome :D). Let's try 500 pages to start:

find path/to/wikipages/ -type f -name "*.md" -print0 | shuf --zero-terminated | head -n500 --zero-terminated | xargs -0 -n1 -I{} cp "{}" .

(See this on explainshell.com)

This is another lego-brick style command. Let's break it down too:

find path/to/wikipages/ -type f -name "*.md" -print0

This lists all the .md files in a directory, delimiting them with a NUL character, as before. It's better to do this than use ls, as find is explicitly designed to be machine-readable.

.... | shuf --zero-terminated

The shuf command randomly shuffles the input lines. In this case, we're telling it that the input is delimited by the NUL byte.

.... | head -n500 --zero-terminated

Similar deal here. head takes the top N lines of input, and discards the rest.

.... | xargs -0 -n1 -I{} cp "{}" .

Finally, xargs calls cp to copy the selected files to the current directory - which is, in this case, the root directory of my test Pepperminty Wiki instance.

Since I'm curious, let's now find out roughly how many words we're dealing with here:

cat data_test/*.md | wc --words
1593190


1.5 million words! That's a lot. I wonder how quickly we can search that?

24.8ms? Awesome! That's so much better than before. If you're wondering about new coat of paint in the screenshot - Pepperminty Wiki is getting dark theme, thanks to prefers-color-scheme :D

I wonder what happens if we push it to 2K pages?

This time we get ~120ms for 5.9M total words - wow! I wasn't expecting it to perform so well. At this scale, rebuilding the entire index is particularly costly - so if I was to push it even further it would make sense to implement an incremental approach that spreads the work over multiple requests, assuming I can't squeeze any more performance out the system as-is.

The last thing I want to do here is make a rough estimate about time time complexity the search system as-is, given the data we have so far. This isn't particularly difficult to do.

Given the results above, we can calculate that at 1.5M total words, an increase of ~60K total words results in an increase of 1ms of execution time. At 5.9M words, it's only ~49K words / ms of execution time - a drop of ~11K words / ms of execution time.

From this, we can speculate that for every million words total added to a wiki, we can expect a ~2.5K words / ms of execution time drop - not bad! We'd need more data points to make any reasonable guess as to the Big-O complexity function that it conforms to. My guess would be something like $O(xN^2)$, where x is a constant between ~0.2 and 2.

Maybe at some point I'll go to the trouble of running enough tests to calculate it, but with all the variables that affect the execution time (number of pages, distribution of words across pages, etc.), I'm not in any hurry to calculate it. If you'd like to do so, go ahead and comment below!

Next time, I'll unveil the inner working of the STAS: my new search-term analysis system.

Found this interesting? Got your own story about some cool code you've written to tell? Comment below!

## Next Gen Search, Part 1: Backend Storage

I've got a bit of a thing about full-text search engines. I've talked about one in particular before for Pepperminty Wiki, and I was thinking about it again the other day - and how I could optimise it further.

If you haven't already, I do recommend reading my previous post on the curious question - as a number of things in this post might not make sense otherwise.

Between the time I wrote that last post and now, I've done quite a bit more digging into the root causes of that ~450ms search time, and I eventually determined that most of it was actually being spent deserialising the inverted index from the JSON file it's stored in back into memory.

This is horribly inefficient and is actually taking up 90% of that query time, so I got to thinking about what I could do about it.

My solution was multi-faceted, as I also (separately) developed a new search-term analysis system (I abbreviated to STAS, because it sounds cool :D) to add advanced query syntax such as -dog to exclude pages that contain the word dog and the like - which I also folded in at the same time as fixing the existing bottleneck.

As of the time of typing, the work on STAS is still ongoing. This doesn't mean that I can't blog about the rest of it though! I've recently-ish come into contact with key-value data stores, such as Redis and LevelDB (now RocksDB). They work rather like a C♯ dictionary or a browser's localStorage, in that they store values that are associated with unique keys (Redis is a bit of a special case though, for reasons that I won't get into here).

Such a data store would suit an inverted index surprisingly well. I devised a key system to represent an inverted index:

• The first key, |termindex|, is used to store a master list of words that have entries in the page index.
• The second key, term is simply the word itself (e.g. cat, chicken, rocket, etc.), and stores a list of ids of the pages that contain that word.
• The third and final key, term|pageid, is a word followed by a separator and finally the id of a page (e.g. bill|1, cat|45, boosters|69, etc.). These keys store the locations that a particular word appears at in a given document in the index. A separate id index is needed to convert between the page id and it's associated name - Pepperminty Wiki provides this functionality out-of-the-box.

The more I thought about it, the more I liked it. If I use a key-value data store in this manner, I can store the values as JSON objects - and then I only have to deserialise the parts of the index that I actually use. Furthermore, adding this extra layer of abstraction allows for some clever trickery to speed things up even more.

The problem here is that Pepperminty Wiki is supposed to be portable, so I try not to use any heavy external libraries or depend on odd PHP modules that not everyone will have installed. While a LevelDB extension for PHP does exist, it's not installed by default and it's a PECL module, which are awkward to install.

All isn't lost though, because it turns out that SQLite functions surprisingly well as a key-value data store:

CREATE TABLE store (
key TEXT UNIQUE NOT NULL,
value TEXT
);

Yes - it really is that simple! Now all we need is some code to create and interface with such a database. Some simple getters and setters should suffice!

(Can't see the above? Try a direct link.)

While this works, I quickly ran into several major problems:

• Every time I read from the data store I'm decoding JSON, which is expensive
• Every time I'm saving to the data store, I'm encoding to JSON, which is also expensive
• I'm reading and writing the same thing multiple times, which is very expensive
• Writing anything to the data store takes a long time

The (partial) solution here was to refactor it such that the json encoding is handled by the storage provider, and to implement a cache.

Such a cache could just be an associative array:

private $cache = []; Then, to fetch a value, we can do this: // If it's not in the cache, insert it if(!isset($this->cache[$key])) {$this->cache[$key] = [ "modified" => false, "value" => json_decode($this->query(
"SELECT value FROM store WHERE key = :key;",
[ "key" => $key ] )->fetchColumn()) ]; } return$this->cache[$key]["value"]; Notice how each item in the cache is also an associative array. This is so that we can flag items that have been modified in memory, such that when we next sync to disk we can write multiple changes all at once in a single batch. That turns the setter into something like this: if(!isset($this->cache[$key]))$this->cache[$key] = [];$this->cache[$key]["value"] =$value;
$this->cache[$key]["modified"] = true;

Very cool! Now all we need is a function to batch-write all the changes to disk. This isn't hard either:

foreach($this->cache as$key => $value_data) { // If it wasn't modified, there's no point in saving it, is there? if(!$value_data["modified"])
continue;

$this->query( "INSERT OR REPLACE INTO store(key, value) VALUES(:key, :value)", [ "key" =>$key,
"value" => json_encode($value_data["value"]) ] ); } I'll get onto the cogs and wheels behind that query() function a bit later in this post. It's one of those utility functions that are great to have around that I keep copying from one project to the next. Much better, but still not great. Why does it still take ages to write all the changes to disk? Well, it turns out that by default SQLite wraps every INSERT in it's own transaction. If we wrap our code in an explicit transaction, we can seriously boost the speed even further: $this->db->beginTransaction();

// Do batch writing here

$this->db->commit(); Excellent (boomed the wizard). But wait, there's more! The PHP PDO database driver supports prepared statements, which is a fancy way of caching SQL statements and substituting values into them. We use these already, but since we only use a very limited number of SQL queries, we can squeak some additional performance out by caching them in their prepared forms (preparing them is relatively computationally expensive, after all). This is really easy to do. Let's create another associative array to store them in: private$query_cache = [];

Then, we can alter the query() function to look like this:

/**
* Makes a query against the database.
* @param   string  $sql The (potentially parametised) query to make. * @param array$variables  Optional. The variables to substitute into the SQL query.
* @return  \PDOStatement       The result of the query, as a PDOStatement.
*/
private function query(string $sql, array$variables = []) {
// Add to the query cache if it doesn't exist
if(!isset($this->query_cache[$sql]))
$this->query_cache[$sql] = $this->db->prepare($sql);
$this->query_cache[$sql]->execute($variables); return$this->query_cache[\$sql]; // fetchColumn(), fetchAll(), etc. are defined on the statement, not the return value of execute()
}`

If a prepared statement for the given SQL exists in the query cache already, we re-use that again instead of preparing a brand-new one. This works especially well, since we perform a large number of queries with the same small set of SQL queries. Get operations all use the same SQL query, so why not cache it?

This completes our work on the backend storage for the next-generation search system I've built for Pepperminty Wiki. Not only will it boost the speed of Pepperminty Wiki's search engine, but it's also a cool reusable component, which I can apply to other applications to give them a boost, too!

Next time, we'll take a look at generating the test data required to stress-test the system.

I also want to give an overview of the new search-term analysis system and associated changes to the page ranking system I implemented (and aspire to implement) too, but those may have to wait until another post (teaser: PageRank isn't really what I'm after).

(Can't see the above? Try a direct link.)

Next

Art by Mythdael