Starbeamrainbowlabs

Stardust
Blog


Archive


Mailing List Articles Atom Feed Comments Atom Feed Twitter Reddit Facebook

Tag Cloud

3d 3d printing account algorithms android announcement architecture archives arduino artificial intelligence artix assembly async audio automation backups bash batch blog bookmarklet booting bug hunting c sharp c++ challenge chrome os cluster code codepen coding conundrums coding conundrums evolved command line compilers compiling compression containerisation css dailyprogrammer data analysis debugging demystification distributed computing documentation downtime electronics email embedded systems encryption es6 features ethics event experiment external first impressions future game github github gist gitlab graphics hardware hardware meetup holiday holidays html html5 html5 canvas infrastructure interfaces internet interoperability io.js jabber jam javascript js bin labs learning library linux lora low level lua maintenance manjaro network networking nibriboard node.js operating systems own your code pepperminty wiki performance phd photos php pixelbot portable privacy problem solving programming problems project projects prolog protocol protocols pseudo 3d python reddit redis reference releases resource review rust searching secrets security series list server software sorting source code control statistics storage svg talks technical terminal textures thoughts three thing game three.js tool tutorial twitter ubuntu university update updates upgrade version control virtual reality virtualisation visual web website windows windows 10 xmpp xslt

Pure CSS spoilers with the CSS :target selector

For 1 reason or another, I've been working on some parser improvements for Pepperminty Wiki recently. Pepperminty Wiki uses Markdown for the page content syntax - specifically Parsedown. Markdown has a number of variations and extensions, some of which are more widely accepted than others. For Pepperminty Wiki, I try to stick as closely to existing Markdown conventions as possible (such as the CommonMark spec). Where that's not possible, I try to make sure there's an existing precedent (e.g. internal links use the same syntax as MediaWiki).

Anyway, as part of this I thought it would be cool to implement a spoiler tag. The problem here is that nobody can agree on the canonical syntax. Discord has recently implemented a vertical bar syntax like a spoiler wall:

Some text ||spoiler text|| more text

Reddit, on the other hand, uses a different syntax:

Some text >!spoiler text!< more text

Anyway, I've ended up supporting both of the above 2 syntaxes. My Parsedown extension generates something like the following HTML:

<p>Some text <a class="spoiler" id="spoiler-RSSZTkNA30-OGJQf_7VivKtJAaoNhbx" href="#spoiler-RSSZTkNA30-OGJQf_7VivKtJAaoNhbx" title="Click / tap to reveal spoiler">spoiler text</a> more text</p>

The next question here is how to make it function as a spoiler. If you're not already aware, to reveal to text in a spoiler, one first has to click on it or perform some other action. Personally, I'd prefer to avoid Javascript if possible for this, as not all users have it enabled and it complicates matters in Pepperminty Wiki.

To this end, if you search for "Pure CSS spoiler" with your favourite search engine, you'll find loads of different solutions out there. Some require Javascript, and others only show the text in a tooltip on hover (which doesn't work on mobile). All this isn't very cool, so I decided to implement my own solution and share it here :-)

It's actually pretty concise:

.spoiler {
    background: #333333;
    border-radius: 0.2em;
    color: transparent;
    cursor: pointer;
}
.spoiler:target {
    background: transparent;
    color: inherit;
}

By setting the text colour to transparent and the background to an obvious colour, we can give the user an obvious hint that there's a spoiler that can be clicked on. Setting the cursor to a hand on platforms with a mouse further helps to support this suggestion.

When the link is clicked, it sets the anchor to spoiler-RSSZTkNA30-OGJQf_7VivKtJAaoNhbx, which is also the id of the spoiler. This triggers the :target selector, which makes the spoiler text visible.

Here's a demo:

See the Pen Pure CSS Spoiler by Starbeamrainbowlabs (@sbrl) on CodePen.

The only issue here is that it doesn't support accessibility tools such as screen readers very well. Using a trick I've found on the Mozilla Developer Net, we can do this to improve that:

.spoiler::before, .spoiler::after {
    clip-path: inset(100%);
    clip: rect(1px, 1px, 1px, 1px);
    height: 1px;
    overflow: hidden;
    position: absolute;
    white-space: nowrap;
    width: 1px;
}
.spoiler::before {
    content: " [spoiler start] ";
}
.spoiler::after {
    content: " [spoiler end] ";
}

...but this still doesn't "fix" the issue, because we're only giving the user warning. Not being a screen-reader user myself, I'm not sure whether this is adequate (is there a 'skip' command that allows skipping to the end of the element or something?) and what isn't.

If you've got a better idea for screen-reader users, please do comment below - I'd love to know.

Found this useful? Got a suggestion to make it even better? Comment below!

Variable-length fuzzy hashes with Nilsimsa for did you mean correction

Or, why fuzzy hashing isn't helpful for improving a search engine. Welcome to another blog post about one of my special interests: search engines - specifically the implementation thereof :D

I've blogged about search engines before, in which I looked at taking my existing search engine implementation to the next level by switching to a SQLite-based key-value datastore backing and stress-testing it with ~5M words. Still not satisfied, I'm now turning my attention to query correction. Have you ever seen something like this when you make a typo when you do a search?

Surprisingly, this is actually quite challenging to achieve. The problem is that given a word with a typo in it, while it's easy to determine if a word contains a typo, it's hard to determine what the correct version of the word is. Consider a wordlist like this:

apple
orange
pear
grape
pineapple

If the user entered something like pinneapple, then it's obvious to us that the correct spelling would be pineapple - but in order to determine this algorithmically you need an algorithm capable of determining how close 2 different words are to 1 another.

The most popular algorithm for this is called leveshtein. Given 2 words a and b, it calculates the number of edits to turn a into b. For example, the edit distance between pinneapple and pineapple is 1.

This is useful, but it still doesn't help us very much. With this, we'd have to calculate the leveshtein distance between the typo and all the words in the list. This could easily run into millions of words for large wikis, so this is obviously completely impractical.

To this end, we need a better idea. In this post, I'm going to talk about my first attempt at solving this problem. I feel it's important to document failures as well as successes, so this is part 1 of a 2 part series.

The first order of business is to track down a Nilsimsa implementation in PHP - since it doesn't come built-in, and it's pretty complicated to implement. Thankfully, this isn't too hard - I found this one on GitHub.

Nilsimsa is a fuzzy hashing algorithm. This means that if you hash 2 similar words, then you'll get 2 similar hashes:

Word Hash
pinneapple 020c2312000800920004880000200002618200017c1021108200421018000404
pineapple 0204239242000042000428018000213364820000d02421100200400018080200256

If you look closely, you'll notice that the hashes are quite similar. My thinking is that if we vary the hash size, then words that are similar will have identical hashes, allowing the search space to be cut down significantly. The existing Nilsimsa implementation I've found doesn't support that though, so we'll need to alter it.

This didn't turn out to be too much of a problem. By removing some magic numbers and adding a class member variable, it seems to work like a charm:

(Can't view the above? Try this direct link.)

I removed the comparison functions since I'm not using them (yet?), and also added a static convenience method for generating hashes. If I end up using this for large quantities of hashes, I may come back to it make it resettable, to avoid having to create a new object for every hash.

With this, we can get the variable-length hashes we wanted:

256       0a200240020004a180810950040a00d033828480cd16043246180e54444060a5
128       3ba286c0cf1604b3c6990f54444a60f5
64        02880ed0c40204b1
32        060a04f0
16        06d2
8         06

The number there is the number of bits in the hash, and the hex value is the hash itself. The algorithm defaults to 256-bit hashes. Next, we need to determine which sized hash is best. The easiest way to do this is to take a list of typos, hash the typo and the correction, and count the number of hashes that are identical.

Thankfully, there's a great dataset just for this purpose. Since it's formatted in CSV, we can download it and extract the typos and corrections in 1 go like this:

curl https://raw.githubusercontent.com/src-d/datasets/master/Typos/typos.csv | cut -d',' -f2-3 >typos.csv

There's also a much larger dataset too, but that one is formatted as JSON objects and would require a bunch of processing to get it into a format that would be useful here - and since this is just a relatively quick test to get a feel for how our idea works, I don't think it's too crucial that we use the larger dataset just yet.

With the dataset downloaded, we can run our test. First, we need to read the file in line-by line for every hash length we want to test:

<?php
$handle = fopen("typos.csv", "r");

$sizes = [ 256, 128, 64, 32, 16, 8 ];
foreach($sizes as $size) {
    fseek($handle, 0); // Jump back to the beginning
    fgets($handle); // Skip the first line since it's the header

    while(($line = fgets($handle)) !== false) {
        // Do something with the next line here
    }
}

PHP has an inbuilt function fgets() which gets the next line of input from a file handle, which is convenient. Next, we need to actually do the hashes and compare them:

<?php

// .....

$parts = explode(",", trim($line), 2);
if(strlen($parts[1]) < 3) {
    $skipped++;
    continue;
}
$hash_a = Nilsimsa::hash($parts[0], $size);
$hash_b = Nilsimsa::hash($parts[1], $size);

$count++;
if($hash_a == $hash_b) {
    $count_same++;
    $same[] = $parts;
}
else {
    $not_same[] = $parts;
}
echo("$count_same / $count ($skipped skipped)\r");

// .....

Finally, a bit of extra logic around the edges and we're ready for our test:

<?php
$handle = fopen("typos.csv", "r");
$line_count = lines_count($handle);
echo("$line_count lines total\n");

$sizes = [ 256, 128, 64, 32, 16, 8 ];
foreach($sizes as $size) {
    fseek($handle, 0);fgets($handle); // Skipt he first line since it's the header

    $count = 0; $count_same = 0; $skipped = 0;
    $same = []; $not_same = [];
    while(($line = fgets($handle)) !== false) {
        $parts = explode(",", trim($line), 2);
        if(strlen($parts[1]) < 3) {
            $skipped++;
            continue;
        }
        $hash_a = Nilsimsa::hash($parts[0], $size);
        $hash_b = Nilsimsa::hash($parts[1], $size);

        $count++;
        if($hash_a == $hash_b) {
            $count_same++;
            $same[] = $parts;
        }
        else $not_same[] = $parts;
        echo("$count_same / $count ($skipped skipped)\r");
    }

    file_put_contents("$size-same.csv", implode("\n", array_map(function ($el) {
        return implode(",", $el);
    }, $same)));
    file_put_contents("$size-not-same.csv", implode("\n", array_map(function ($el) {
        return implode(",", $el);
    }, $not_same)));

    echo(str_pad($size, 10)."→ $count_same / $count (".round(($count_same/$count)*100, 2)."%), $skipped skipped\n");
}

I'm writing the pairs that are the same and different to different files here for a visual inspection. I'm also skipping words that are less than 3 characters long, and that lines_count() function there is just a quick helper function for counting the number of lines in a file for the progress indicator (if you write a \r without a \n to the terminal, it'll reset to the beginning of the current line):

<?php
function lines_count($handle) : int {
    fseek($handle, 0);
    $count = 0;
    while(fgets($handle) !== false) $count++;
    return $count;
}

Unfortunately, the results of running the test aren't too promising. Even with the shortest hash the algorithm will generate without getting upset, only ~23% of typos generate the same hash as their correction:

7375 lines total
256       → 7 / 7322 (0.1%), 52 skipped
128       → 9 / 7322 (0.12%), 52 skipped
64        → 13 / 7322 (0.18%), 52 skipped
32        → 64 / 7322 (0.87%), 52 skipped
16        → 347 / 7322 (4.74%), 52 skipped
8         → 1689 / 7322 (23.07%), 52 skipped

Furthermore, digging deeper with an 8-bit you start to get large numbers of words that have the same hash, which isn't ideal at all.

A potential solution here would be to use hamming distance (basically counting the number of bits that are different in a string of binary) to determine which hashes are similar to each other like leveshtein distance does, but that still doesn't help us as we then still have a problem that's almost identical to where we started.

In the second part of this mini-series, I'm going to talk about how I ultimately solved this problem. While the algorithm I ultimately used (a BK-Tree, more on them next time) is certainly not the most efficient out there (it's O(log n) if I understand it correctly), it's very simple to implement and is much less complicated than Symspell, which seems to be the most efficient algorithm that exists at the moment.

Additionally, I have been able to optimise said algorithm to return results for a 172K wordlist in ~110ms, which is fine for my purposes.

Found this interesting? Got another algorithm I should check out? Got confused somewhere along the way? Comment below!

Pepperminty Wiki is 5 today!

....let's celebrate with the release of v0.20. I got a notification from my calendar system yesterday that Pepperminty Wiki's birthday is today, and since I did a beta release a few days ago and there haven't been any major issues, I thought I'd time the full release to coincide with its birthday.

I'm timing it from the first commit I ever made in Pepperminty Wiki's git repository. 5 years is a long time - and as a program Pepperminty Wiki has come such a long way since then.

Today, it's actually a really useful piece of open-source software, which is evidenced by the fact that people recommend it to other people on their own. Seeing such things and hearing about where it's used are really amazing to see - and give me lots of motivation to improve Pepperminty Wiki even more.

While the number of commits a project has isn't always an indicator of quality or how complete a project is, you can usually get a pretty good idea as to how much work has been done on a project by the number of commits it has (but of course, not always). At the time of writing Pepperminty Wiki has 1,415 commits, which is more than any other project I have ever worked on - past or present. The air quality web interface (which is now more of a general sensor web interface) is my 2nd place project unless I've missed one - and at 425 commits it doesn't even come close!

To summarise the features in the latest release:

  • 🌜 New automatic dark mode in the default theme! Uses prefers-color-scheme under-the-hood
  • 🌈 Added theme gallery! Read more here
  • Vastly improved search engine performance, with new advanced query syntax (with even more syntax along the way)
  • 🚁 Accessibility improvements - if you're a screen-reader or accessibility tool user, I want to hear from you if you think anything (big or small!) could be improved!

Personally, I'm most proud of the optimisations to the search engine. I've actually blogged about how I did it in a 3 part series and tested it on a test wiki with ~5.9M words - while search times vary depending on your input (the new -exclude syntax will actually speed up queries) and your server hardware, a single word query for ~5.0M word wikis takes ~50ms O.o

Unfortunately, this does mean that the search index will need to be rebuilt under the new format - and will be slightly larger than before. To get a progress bar for this operation, go to the master settings and click the rebuild button.

Another notable change is the new 'mega-menu' style more menu:

image

That menu has been bothering me for a while, and thanks to the kind people on Reddit, I've now got a solution.

Note that you'll need to delete nav_links_extra from your peppermint.json in order for it to take effect.

Please also test the theme gallery in particular. It's brand-new in this release and quite complicated under-the-hood, so I'd appreciate some extra eyes on that.

As for when I'll release v1.0, I'm not sure. As a program, Pepperminty Wiki is certainly stable enough to be used in production scenarios today - so perhaps incrementing the version number to v1.0 would be a good idea to reflect that. At the same time though, there are a number of missing features - most notably watchlists and further improvements to the page history system - so I'm not sure when I'll be confident enough to bump it to v1.0.

Either way, I'm pretty sure that I'll keep working on Pepperminty Wiki for years to come - I have no plans to cease development at this time. While Pepperminty Wiki releases don't move at the most rapid of paces, I aim to get about 2 releases out per year about 6 months apart from each other.

Special thanks to @SeanFromIT for reporting a number of bugs which have been squashed.

If you use Pepperminty Wiki, tweet me @SBRLabs! I'd love to hear about how you're using it.

Lastly, don't forget to take a backup of your wiki before updating. While I've made every effort to squash bugs, you can never be too careful :P

Check out v0.20 here:

Pepperminty Wiki v0.20

Art by Mythdael