Starbeamrainbowlabs

Stardust
Blog


Archive


Mailing List Articles Atom Feed Comments Atom Feed Twitter Reddit Facebook

Tag Cloud

3d 3d printing account algorithms android announcement architecture archives arduino artificial intelligence artix assembly async audio automation backups bash batch blog bookmarklet booting bug hunting c sharp c++ challenge chrome os cluster code codepen coding conundrums coding conundrums evolved command line compilers compiling compression containerisation css dailyprogrammer data analysis debugging demystification distributed computing docker documentation downtime electronics email embedded systems encryption es6 features ethics event experiment external first impressions future game github github gist gitlab graphics hardware hardware meetup holiday holidays html html5 html5 canvas infrastructure interfaces internet interoperability io.js jabber jam javascript js bin labs learning library linux lora low level lua maintenance manjaro minetest network networking nibriboard node.js operating systems own your code pepperminty wiki performance phd photos php pixelbot portable privacy problem solving programming problems project projects prolog protocol protocols pseudo 3d python reddit redis reference releases rendering resource review rust searching secrets security series list server software sorting source code control statistics storage svg talks technical terminal textures thoughts three thing game three.js tool tutorial tutorials twitter ubuntu university update updates upgrade version control virtual reality virtualisation visual web website windows windows 10 xmpp xslt

Tensorflow / Tensorflow.js in Review

For my PhD, I've been using both Tensorflow.js (Tensorflow for Javascript) and more recently Tensorflow for Python (including the bundled Keras) extensively for implementing multiple different models. Given the experiences I've had so far, I thought it was high time I put my thoughts to paper so to speak and write a blog post reviewing the 2 frameworks.

Tensorflow logo

Tensorflow for Python

Let's start with Tensorflow for Python. I haven't been using it as long as Tensorflow.js, but as far as I can tell they've done a great job of ensuring it comes with batteries included. It has layers that come in an enormous number of different flavours for doing everything you can possibly imagine - including building Transformers (though I ended up implementing the time signal encoding in my own custom layer).

Building custom layers is not particularly difficult either - though you do have to hunt around a bit for the correct documentation, and I haven't yet worked out the all the bugs with loading model checkpoints that use custom layers back in again.

Handling data as a generic "tensor" that contains an n-dimension slab of data is - once you get used to it - a great way of working. It's not something I would recommend to the beginner however - rather I would recommend checking out Brain.js. It's easier to setup, and also more transparent / easier to understand what's going on.

Data preprocessing however is where things start to get complicated. Despite a good set of API reference docs to refer to, it's not clear how one is supposed to implement a performant data preprocessing pipeline. There are multiple methods for doing this (tf.data.Dataset, tf.utils.Sequence, and others), and I have as of yet been unable to find a definitive guide on the subject.

Other small inconsistencies are also present, such as both the Keras website and the Tensorflow API docs both documenting the Keras API, which in and of itself appears to be an abstraction of the Tensorflow API.... it gets confusing. Some love for the docs more generally is also needed, as I found some the wording in places ambiguous as to what it meant - so I ended up guessing and having to work it out by experimentation.

By far the biggest issue I encountered though (aside from the data preprocessing pipeline, which is really confusing and frustrating) is that a highly specific version of CUDA is required for each version of Tensorflow. Thankfully, there's a table of CUDA / CuDNN versions to help you out, but it's still pretty annoying that you have to have a specific version. Blender manages to be CUDA enabled while supporting enough different versions of CUDA that I haven't had an issue on stock Ubuntu with the propriety Nvidia drivers and the system CUDA version, so whhy can't Tensorflow do it too?

Tensorflow.js

This brings me on to Tensorflow.js, the Javascript bindings for libtensorflow (the underlying C++ library). This also has the specific version of CUDA issue, but in addition the version requirement documented in the README is often wrong, leaving you to make random wild guesses as to which version is required!

Despite this flaw, Tensorflow.js fixes a number of inconsistencies in Tensorflow for Python - I suspect that it was written after Tensorflow for Python was first implemented. The developers have effectively learnt valuable lessons from the Python version of Tensorflow, which has resulted in a coherent and cohesive API that makes a much more sense than the API in Python. A great example of this is tf.Dataset: The data preprocessing pipeline in Tensorflow.js is well designed and easy to use. The Python version could learn a lot from this.

While Tensorflow.js doesn't have quite the same set of features (a number of prebuilt layers that exist in Python don't yet existing in Tensorflow.js), it still provides a reasonable set of features that satisfy most use-cases. I have noticed a few annoying inconsistencies in the loss functions though and how they behave - for example I implemented an autoencoder in Tensorflow.js, but it only returned black and white pixels - whereas Tensorflow for Python returned greyscale as intended.

Aside from improving CUDA support and adding more prebuilt layers, the other thing that Tensorflow.js struggles with is documentation. It has comparatively few guides with respect to Tensorflow for Python, and it also has an ambiguity problem in the API docs. This could be mostly resolved by being slightly more generous with explanations as to what things are and do. Adding some examples to the docs in question would also help - as would also fixing the bug where it does not highlight the item you're currently viewing in the navigation page (DevDocs support would be awesome too).

Conclusion

Tensorflow for Python and Tensorflow.js are feature-filled and performant frameworks for machine learning and processing large datasets with GPU acceleration. Many tutorials are provided to help newcomers to the frameworks, but once you've followed a tutorial or 2 you're left very much to be on your own. A number of caveats and difficulties such as CUDA versions and confusing APIs / docs make mastering the framework difficult.

applause-cli: A Node.js CLI handling library

Continuing in the theme of things I've forgotten to talk about, I'd like to post about another package I've released a little while ago. I've been building a number of command line interfaces for my PhD, so I thought it would be best to use a library for this function.

I found [clap](), but it didn't quite do what I wanted - so I wrote my own inspired by it. Soon enough I needed to use the code in several different projects, so I abstracted the logic for it out and called it applause-cli, which you can now find on npm.

It has no dependencies, and it allows you do define a set of arguments and have it parsed out the values from a given input array of items automatically. Here's an example of how it works:

import Program from 'applause-cli';

let program = new Program("path/to/package.json");
program.argument("food", "Specifies the food to find.", "apple")
    .argument("count", "The number of items to find", 1, "number");

program.parse(process.argv.slice(2)); // Might return { food: "banana", count: 6 }

I even have automated documentation generated with documentation and uploaded to my website via Continuous Integration: https://starbeamrainbowlabs.com/code/applause-cli/. I've worked pretty hard on the documentation for this library actually - it even has integrated examples to show you how to use each function!

The library can also automatically generate help output from the provided information when the --help argument is detected too - though I have yet to improve the output if a subcommand is called (e.g. mycommand dostuff --help) - this is on my todo list :-)

Here's an example of the help text it automatically generates:

If this looks like something you'd be interested in using, I recommend checking out the npm package here: https://www.npmjs.com/package/applause-cli

For the curious, applause-cli is open-source under the MPL-2.0 licence. Find the code here: https://github.com/sbrl/applause-cli.

Rendering Time plan / Gantt charts: hourgraph

I have a number of tools and other programs I've implemented, but forgotten to blog about here - hourgraph is one such tool I stumbled across today again. Originally I implemented it for my PhD panel 1 topic project analysis report, as I realised that not only have I manually created a number of these, but I'm going to have to create a bunch more in the future, but I open-sourced it as I usually do with most of the things I write in the hopes that someone else will find it useful.

I've published it on NPM, so you can install it like this:

npm install --global hourgraph

You'll need Node.js installed, and Linux users will need to prefix the above with sudo.

The program takes in a TOML definition file. Here's an example:

width = 1500
height = 480
title = "Apples"

[[task]]
name = "Pick apples"
start = 0
duration = 3

[[task]]
name = "Make apple juice"
start = 2
duration = 2

[[task]]
name = "Enjoy!"
start = 4
duration = 4
colour = "hsl(46, 90%, 60%)"
ghost_colour = "hsla(46, 90%, 60%, 0.1)"

The full set of options are available in the default config file, which is loaded in to fill in any gaps of things you haven't specified in your custom file.

Comprehensive usage instructions are found in the README, but you can render a new time plan chart thingy like this:

hourgraph --input path/to/input.toml --output path/to/output.toml

The above renders to this:

Hourgraph output

Personally, I find it's much easier to create charts like this by defining them in a simple text file that is then rendered into the actual thing. That way, I don't have to fiddle with the layout myself - it all comes out in the wash automatically.

For those interested in the code, it can be found here: https://github.com/sbrl/hourgraph

Skyliner: Automated text document outlining

When editing large documents, it's often helpful to have a hierarchical "navigation view" of sorts. For a text document like the Markdown that I'm typing now for this blog post, it would consist of a list of headings in the document. For a Javascript or C♯ file, it might consist of classes, functions, and methods.

Either way, it's a helpful thing to have - but for some ridiculous reason as far as I know a generic text document outlining tool doesn't exist.

While GitHub's Atom (my code editor of choice) has an atom-ide has an outline view that works great, it only supports a limited number of languages (you have to have a plugin installed for every language), and sometimes it hangs and takes like 30 seconds plus to generate the outline.

To this end, I intend to remedy the situation with a new library I'm writing call Skyliner.

It's based on finite-state automata and regular expressions (also, Wikipedia and regexper), and it streams the input and generates an outline line-by-line. It provides both a command-line interface and a Javascript API (though the command-line interface currently consumes all input before generating any output, but the library is capable of streaming objects as it consumes the input). As of the time of typing, it supports the following languages:

  • clike (e.g. c, c++, header files)
  • csharp
  • go
  • ini
  • javascript
  • json
  • lua
  • markdown
  • php
  • rust
  • sh (including bash)
  • toml
  • xml

While using regular expressions means that the output won't be perfect (especially in the case of XML/HTML), it does mean that adding support for a new language is as simple as defining a new finite-state automaton in a Javascript object. Adding support for Lua was a 10-15 minute job - including automated tests! Support for more languages is definitely on the way.

The combination of line-by-line parsing, regular expressions, and finite-state automata also means that it's much faster than Atom IDE, because it doesn't have to parse the entire document into an abstract syntax tree before generating the outline.

My goal here is to have good support as many languages as possible, rather than amazing support for just a small handful of languages. This isn't to say that I won't fix issues with existing languages as they come up, but the focus is really on ensuring that whenever I open a text document in Atom.

Once I've added support for a bunch of languages, written some documentation, and published it on npm, I'm going to try my hand it implementing my first plugin for GitHub's Atom. Apparently plugins for Atom aren't too difficult to port to Visual Studio Code, so maybe I'll take a look at doing that too (but I don't use Visual Studio Code much, so not a priority).

The code is open-source already (link below) and completely usable, but since it's an ongoing process you can expect another post on here in the future about my progress.

Skyliner on GitHub: https://github.com/sbrl/skyliner

If you'd be interested in some more detail about how it works, I'll be writing some documentation soon, which will appear in the above Git repository.

PhD Aside: Reading a file descriptor line-by-line from multiple Node.js processes

Phew, that's a bit of a mouthful. We're taking a short break from the cluster series of posts (though those will be back next week I hope), because I've just run into a fascinating problem, the solution to which I thought I'd share here - since I didn't find a solution elsewhere on the web.

For my PhD, I've got a big old lump of data, and it all needs preprocessing before I train an AI model (or a variant thereof, since I'm effectively doing video-to-image translation). Unfortunately, one of the preprocessing steps is really slow. And because I'll naturally be training my AI for multiple epochs, the problem is multiplied.....

The solution, of course, is to do all the preprocessing up front such that I can just read the data in and push it directly into a Tensor in the right format. However, doing this on such a large dataset would take forever if I did the items 1 by 1. The thing is that Javascript isn't inherently multithreaded. I like this quote, as it describes the situation rather well:

In Javascript everything runs in parallel... except your code

--Felix Geisendörfer

In other words, when Node.js is reading or writing to and from the network, disk, or other places it can do lots of things at the same time because it does them asynchronously. The Javascript that gets executed though is only done on a single thread though.

This is great for io-bound tasks (such as a web server), as Node.js (a Javascript runtime) can handle many requests at the same time. On a side note, this is also the reason why Nginx is more efficient than Apache (because Nginx is event based too like Javascript, unlike Apache which is thread based).

It's not so great though for CPU bound tasks, such as the one I've got on my hands. All is not lost though, because Node.js has a number of useful functions inbuilt that we can use to tackle the issue.

Firstly, Node.js has a clever forking system. By using child_process.fork(), a single Node.js process can create multiple copies of itself to act as workers:

// main.js
import child_process from 'child_process';
import os from 'os';

let workers = [];

for(let i = 0; i < os.cpus().length; i++) {
    workers.push(
        child_process.fork("worker.mjs")
    );
}
// worker.js
console.log(`Hello, world from a child process!`);

Very useful! The next much more sticky problem though is how to actually preprocess the data in a performant manner. In my specific case, I'm piping the data in from a shell script that decompresses a number of gzip archives in a specific order (as of the time of typing I have yet to implement this).

Because this is a single pipe we're talking about here, the question now arises of how to allow all the child processes to access the data that's coming in from the standard input of the master process.

I've actually encountered an issue like this one before. I initially tried reading it in on the master process, and then using worker.send(message) to send it to the worker processes for processing. This didn't end up working very well, because the master process became a bottleneck as it couldn't read from the standard input and send stuff to the workers fast enough.

With this in mind, I came up with a new plan. In Node.js, when you're forking to create a worker process, you can supply it with some custom file descriptors upon initialisation. So long as it has at least IPC (inter-process communication) channel for passing messages back and forth with the .send() and .on("message", (message) => ....) method and listeners, it doesn't actually care what you do with the others.

Cue file descriptor cloning:


// main.js
import child_process from 'child_process';
import os from 'os';

let workers = [];

for(let i = 0; i 

I've highlighted the key line here (line 10 for those who can't see it). Here we tell it to clone file descriptors 0, 1, and 2 - which refer to stdin, stdout, and stderr respectively. This allows the worker processes direct access to the master process' stdin, stdout, and stderr.

With this, we can read from the same pipe with as many worker processes as we like - so long as they do so 1 at a time.

With this sorted, it gives rise to the next issue: reading line-by-line. Packages exist on npm (such as nexline, my personal favourite) to read from a stream line-by-line, but they have the unfortunate side-effect of maintaining a read buffer. While this is great for performance, it's not so great in my situation because it ends up scrambling the input! This is because said read buffer would be local to each worker process, so when the next worker along reads, it will skip a random number of bytes and start reading from the next bit along.

This means that I need to implement a custom method that reads a single line from a given file descriptor without maintaining a read buffer. I came up with this:

import fs from 'fs';

//  .....

// Global buffer to avoid unnecessary memory churn
let buffer = Buffer.alloc(4096);
function read_line_unbuffered(fd) {
    let i = 0;
    while(true) {
        let bytes_read = fs.readSync(fd, buffer, i, 1);
        if(bytes_read !== 1 || buffer[i] == 0x0A) {
            if(i == 0 && bytes_read == null) return null;
            return buffer.toString("utf-8", 0, i); // This is not inclusive, so we can abuse it to trim the \n off the end
        }

        i++;
        if(i == buffer.length) {
            let new_buffer = new Buffer(Math.ceil(buffer.length * 1.5));
            buffer.copy(new_buffer);
            buffer = new_buffer;
        }
    }
}

I read from the given file descriptor character by character directly into a buffer. As soon as it detects a new line character (\n, or character code 0x0A), it returns the new line. If we run out of space in the buffer, then we create a new larger one, copy the old buffer's contents into it, and keep going.

I maintain a global buffer here, because this helps to avoid unnecessary memory churn. In my case, the lines I'm reading in a rather long (hence the need to clone the file descriptor in the first place), and if I didn't keep a shared buffer I'd be allocating and deallocating a new pretty large buffer every time.

This also has the nice side-effect that we keep the largest buffer we've had to use so far around for next time, avoiding the need for subsequent copies to larger and larger buffers.

Finally, we can also guarantee that it won't be a problem if we call this multiple times, because as I explained above Javascript is single-threaded, so if we call the function multiple times in quick succession each read will happen 1 after another.

With this chain of Node.js features, we can read a large amount of data from and efficiently process the content of a pipe. The trick from here is to implement a proper messaging and locking system to avoid reading from the stream at the same time, and avoid write to the standard output at the same time.

Taking this further, I ended up with this:

(Licence: Mozilla Public Licence 2.0)

This correctly ensures that only 1 worker process reads from the stream at the same time. It doesn't do anything with the result though except log a message to the console, but when I implement that I'll implement a similar messaging system to ensure that only 1 process writes to the output at once.

On that note, my data is also ordered, so I'll have to implement a complicated cache system // ordering system to ensure that I write them to the standard output in the same order I read them in. When I do implement that, I'll probably blog about that too....

The main problem I still have with this solution is that I'm reading from the input stream. I haven't done any proper testing, but I'm pretty sure that doing so will be really slow. I not sure I can avoid this though and read a few KiBs at a time, because I don't currently know of any way to put the extra characters back into the input stream.

If anyone has a solution to that that increases performance, I'd love to know. Leave a comment below!

The legend of the disappearing data in Node.js

Happy leap day! :D

A green tree frog :D

_(Above: A nice green tree frog - source)_

Recently, I've been doing a bunch of work in Node.js streaming large amounts of data. For the most part the experience has been highly pleasurable, as Node.js makes it so easy! I have encountered a few pain points though, the most significant of which I'd like to talk about here.

In Node.js, streams come in 3 main forms:

  • Readable Streams
  • Writable Streams
  • Transform Streams

In addition, you can either plug streams together with the .pipe() method, write to them directly with the .write() method, or any combination thereof - allowing you to build up a chain of streams that enables data to flow through your program.

The problems start when you try and write large amounts of data to a stream directly:

import fs from 'fs';

import do_work from 'somewhere';
import get_some_stream from 'somewhere_else';

let stream_in = get_some_stream();
let out = fs.createWriteStream("/tmp/test.txt");
for(let i = 0; i < 1000000; i++) {
    out.write(do_work(stream_in, i))
}

(Above: Just an example of writing lots of data to a stream)

When this happens, you start to lose random chunks of data. The reason for this is not obvious, but it is buried in the Node.js docs:

The writable.write() method writes some data to the stream, and calls the supplied callback once the data has been fully handled. If an error occurs, the callback may or may not be called with the error as its first argument. To reliably detect write errors, add a listener for the 'error' event.

This is a huge pain. It means that you have to wrap all write calls like this:

"use strict";

/**
 * Writes data to a stream, automatically waiting for the drain event if asked.
 * @param   {stream.Writable}           stream_out  The writable stream to write to.
 * @param   {string|Buffer|Uint8Array}  data        The data to write.
 * @return  {Promise}   A promise that resolves when writing is complete.
 */
function write_safe(stream_out, data) {
    return new Promise((resolve, reject) => {
        // Handle errors
        let handler_error = (error) => {
            stream_out.off("error", handler_error);
            reject(error);
        };
        stream_out.on("error", handler_error);

        if(stream_out.write(data)) {
            // We're good to go
            stream_out.off("error", handler_error);
            resolve();
        }
        else {
            // We need to wait for the drain event before continuing
            stream_out.once("drain", () => {
                stream_out.off("error", handler_error);
                resolve();
            });
        }
    });
}

export { write_safe };

Such a huge boilerplate for such a simple task! Basically, if the .write() method returns false, you have to wait until the drain event is fired on the writeable stream before continuing to write to the stream. The reason for this I think is that it signals that the write buffer is full, and it needs to be drained before writing can continue.

This is ok, but it would be nice if this was abstracted away behind a single method, such as the wrapper I've shown above. Something like a async stream.Writable.writeAsync() would be great, but it doesn't currently exist.

I think I'm going to open an issue about it - since it seems very doable and just silly that it doesn't exist already.

Summer Project Series List

At this point, it is basically the end of my summer project series - at least for a while while I start my PhD (more on that in a future post). To this end, I'm releasing the series list for it.

Next Gen Search, Part 2: Pushing the limits

In the last part, we looked at how I built a new backend for storing inverted indexes for Pepperminty Wiki, which allows for partial index deserialisation and other nice features that boost performance considerably.

Since the last post, I've completed work on the new search system - though there are a few bits around the edges that I still want to touch up and do some more work on.

In this post though, I want to talk about how I generated test data to give my full-text search engine something to chew on. I've done this before for my markov chain program I wrote a while back, and that was so much fun I did it again for my search engine here.

After scratching my head for a bit to think of a data source I could use, I came up with the perfect plan. Ages ago I downloaded a Wikipedia dump - just the content pages in Wikitext markup. Why not use that?

As it turns out, it was a rather good idea. Some processing of said dump was required though to transform it into a format that Pepperminty Wiki can understand, though. Pepperminty Wiki stores pages on disk as flat text files in markdown, and indexes them in pageindex.json. If pageindex.json doesn't exist, then Pepperminty Wiki rebuilds it automagically by looking for content pages on disk.

This makes it easy to import batches of new pages into Pepperminty Wiki, so all we need to do is extract the wiki text, convert to markdown, and import! This ended up requiring a number of different separate steps though, so let's take it 1 at a time

First, we need a Wikipedia database dump in the XML format. These are available from dumps.wikimedia.org. There are many different ones available, but I suggest grabbing one that has a filename similar to enwiki-20180201-pages-articles.xml - i.e. just content pages - no revision history, user pages, or additional extras. I think the most recent one as of the time of posting is downloadable here - though I'd warn you that it's 15.3GiB in size! You can see a list of available dump dates for the English Wikipedia here.

Now that we've got our dump, let's extract the pages from it. This is nice and easy to do with wikiextractor on GitHub:

nice -n20 wikiextractor enwiki-20180201-pages-articles.xml --no_templates --html --keep_tables --lists --links --sections --json --output wikipages --compress --bytes 25M >progress.log 2>&1 

This will parse the dump and output a number of compressed files to the wikipages directory. These will have 1 JSON object per line, each containing information about a single page on Wikipedia - with page content pre-converted to HTML for us. The next step is to extract the page content and save it to a file with the correct name. This ended up being somewhat complicated, so I wrote a quick Node.js script to do the job:

#!/usr/bin/env node

const readline = require("readline");
const fs = require("fs");


if(!fs.existsSync("pages"))
        fs.mkdirSync("pages", { mode: 0o755 });

// From https://stackoverflow.com/a/44195856/1460422
function html_unentities(encodedString) {
        var translate_re = /&(nbsp|amp|quot|lt|gt);/g;
        var translate = {
                "nbsp":" ",
                "amp" : "&",
                "quot": "\"",
                "lt"  : "<",
                "gt"  : ">"
        };
        return encodedString.replace(translate_re, function(match, entity) {
                return translate[entity];
        }).replace(/&#(\d+);/gi, function(match, numStr) {
                var num = parseInt(numStr, 10);
                return String.fromCharCode(num);
        });
}

const interface = readline.createInterface({
        input: process.stdin,
        //output: process.stdout
});

interface.on("line", (text) => {
        const obj = JSON.parse(text);

        fs.writeFileSync(`pages/${obj.title.replace(/\//, "-")}.html`, html_unentities(obj.text));
        console.log(`${obj.id}\t${obj.title}`);
});

This basically takes the stream of JSON object on the standard input, parses them, and saves the relevant content to disk. We can invoke it like so:

bzcat path/to/*.bz2 | ./parse.js

Don't forget to chmod +x parse.js if you get an error here. The other important thing about the above script ist hat we have to unescape the HTML entities (e.g. &gt;), because otherwise we'll have issues later with HTML conversation and page names will look odd. This is done by the html_unentities() function in the above script.

This should result in a directory containing a large number of files - 1 file per content page. This is much better, but we're still not quite there yet. Wikipedia uses wiki markup (which we converted to HTML with wikiextractor) and Pepperminty Wiki uses Markdown - the 2 of which are, despite all their similarities, inherently incompatible. Thankfully, pandoc is capable of converting from HTML to markdown.

Pandoc is great at this kind of thing - it uses an intermediate representation and allows you to convert almost any type of textual document format to any other format. Markdown to PDF, EPUB to plain text, ..... and HTML to markdown (just to name a few). It actually looks like it shares a number of features with traditional compilers like GCC.

Anyway, let's use it to convert our folder full of wikitext files to a folder full of markdown:

mkdir -p pages_md;
find pages/ -type f -name "*.html" -print0 | nice -n20 xargs -P4 -0 -n1 -I{} sh -c 'filename="{}"; title="${filename##*/}"; title="${title%.*}"; pandoc --from "html"  --to "markdown+backtick_code_blocks+pipe_tables+strikeout" "${filename}" -o "pages_md/${title}.md"; echo "${title}";';

_(See this on explainshell.com - doesn't include the nice -n20 due to a bug on their end)_

This looks complicated, but it really isn't. Let's break it down a bit:

find pages/ -type f -name "*.html" -print0

This finds all the HTML files that we want to convert to Markdown, and delimits the output with a NUL byte - i.e. 0x0`. This makes the next step easier:

... | nice -n20 xargs -P4 -0 -n1 -I{} sh -c '....'

This pushes a new xargs instance into the background, which will execute 4 commands at a time. xargs executes a command for each line of input it receives. In our case, we're delimiting with NUL 0x0 instead though. We explicitly specify that we can 1 command per line of input though, as xargs tries to optimise and do command file1 file2 file3 instead.

The sh -c bit is starting a subshell, in which we execute a small wrapper script that then calls pandoc. This is of course inefficient, but I couldn't find any way around spawning a subshell in this instance.

filename="{}";
title="${filename##*/}";
title="${title%.*}";
pandoc --from "html" --to "markdown+backtick_code_blocks+pipe_tables+strikeout" "${filename}" -o "pages_md/${title}.md";
echo "${title}";

I've broken the sh -c subshell script down into multiple liens for readability. Simply put, it extracts the page title from the filename, converts the HTML to Markdown, and saves it to a new file in a different directory with the .md replacing the original .html extension.

When you put all these components together, you get a script that converts a folder full of HTML files to Markdown. Just like with the markov chains extraction I mentioned at the beginning of this post, Bash and shell scripting really is all about lego bricks. This is due in part to the Unix philosophy:

Make each program do one thing well.

There is more to it, but this is the most important point to remember. Many of the core utilities you'll find on the terminal follow this way of thinking.

There's 1 last thing we need to take care of before we have them in the right format though - we need to convert the [display text](page name) markdown-format links back into the Wikipedia [[internal link]] format that Pepperminty Wiki also uses.

Thankfully, another command-line tool I know of called repren is well-suited to this:

repren --from '\[([^\]]+)\]\(([^):]+)\)' --to '[[\1]]' pages_md/*.md

It took some fiddling, but I got all the escaping figured out and the above converts back into the [[internal link]] format well enough.

Now that we've got our folder full of markdown files, we need to extract a random portion of them to act as a test for Pepperminty Wiki - as the whole lot might be a bit much for it to handle (though if Pepperminty Wiki was capable of handling it all eventually that'd be awesome :D). Let's try 500 pages to start:

find path/to/wikipages/ -type f -name "*.md" -print0 | shuf --zero-terminated | head -n500 --zero-terminated | xargs -0 -n1 -I{} cp "{}" .

(See this on explainshell.com)

This is another lego-brick style command. Let's break it down too:

find path/to/wikipages/ -type f -name "*.md" -print0

This lists all the .md files in a directory, delimiting them with a NUL character, as before. It's better to do this than use ls, as find is explicitly designed to be machine-readable.

.... | shuf --zero-terminated

The shuf command randomly shuffles the input lines. In this case, we're telling it that the input is delimited by the NUL byte.

.... | head -n500 --zero-terminated

Similar deal here. head takes the top N lines of input, and discards the rest.

.... | xargs -0 -n1 -I{} cp "{}" .

Finally, xargs calls cp to copy the selected files to the current directory - which is, in this case, the root directory of my test Pepperminty Wiki instance.

Since I'm curious, let's now find out roughly how many words we're dealing with here:

cat data_test/*.md | wc --words
1593190

1.5 million words! That's a lot. I wonder how quickly we can search that?

A screenshot of the Pepperminty Wiki search results on the test wiki for the word food, showing the new dark theme coming soon!

24.8ms? Awesome! That's so much better than before. If you're wondering about new coat of paint in the screenshot - Pepperminty Wiki is getting dark theme, thanks to prefers-color-scheme :D

I wonder what happens if we push it to 2K pages?

Another screenshot, the same as before

This time we get ~120ms for 5.9M total words - wow! I wasn't expecting it to perform so well. At this scale, rebuilding the entire index is particularly costly - so if I was to push it even further it would make sense to implement an incremental approach that spreads the work over multiple requests, assuming I can't squeeze any more performance out the system as-is.

The last thing I want to do here is make a rough estimate about time time complexity the search system as-is, given the data we have so far. This isn't particularly difficult to do.

Given the results above, we can calculate that at 1.5M total words, an increase of ~60K total words results in an increase of 1ms of execution time. At 5.9M words, it's only ~49K words / ms of execution time - a drop of ~11K words / ms of execution time.

From this, we can speculate that for every million words total added to a wiki, we can expect a ~2.5K words / ms of execution time drop - not bad! We'd need more data points to make any reasonable guess as to the Big-O complexity function that it conforms to. My guess would be something like $O(xN^2)$, where x is a constant between ~0.2 and 2.

Maybe at some point I'll go to the trouble of running enough tests to calculate it, but with all the variables that affect the execution time (number of pages, distribution of words across pages, etc.), I'm not in any hurry to calculate it. If you'd like to do so, go ahead and comment below!

Next time, I'll unveil the inner working of the STAS: my new search-term analysis system.

Found this interesting? Got your own story about some cool code you've written to tell? Comment below!

Powahroot: Client and Server-side routing in Javascript

The powahroot logo, which is a 16x16 pixel-art image and looks like a purple-red carrot with bright orange stripes and yellow light lines coming out of the sides

If I want to really understand something, I usually end up implementing it myself. This is the case with my latest library - powahroot, but also because I didn't really like the way any of the alternatives functioned because I'm picky.

Originally I wrote it for this project (although it's actually for a little satellite project that isn't open-source unfortunately - maybe at some point in the future!) - but I liked it so much that I decided that I had to turn it into a full library that I could share here.

In short, a routing framework helps you get requests handled in the right places in your application. I've actually blogged about this before, so I'd recommend you go and read that post first before continuing with this one.

For all the similarities between the server side (as mentioned in my earlier post) and the client side, the 2 environments are different enough that they warrant having 2 distinctly separate routers. In powahroot, I provide both a ServerRouter and a ClientRouter.

The ServerRouter is designed to handle Node.js HTTP request and response objects. It provides shortcut methods .get(), .post(), and others to quickly create routes for different request types - and also supports middleware to enable logical separation of authentication, request processing, and response generation.

The ClientRouter, on the other hand, is essentially a stripped-down version of the ServerRouter that's tailored to functioning in a browser environment. It doesn't support middleware (yet?), but it does support the pushstate that's part of the History API.

I've also published it on npm, so you can install it like this:

npm install --save powahroot

Then you can use it like this:

# On the server
import ServerRouter from 'powahroot/Server.mjs';

// ....

const router = new ServerRouter();
router.on_all(async (context, next) => { console.debug(context.url); await next()})
router.get("/files/::filepath", (context, _next) => context.send.plain(200, `You requested ${context.params.filepath}`));
// .....
# On the client
import ClientRouter from 'powahroot/Client.mjs';

// ....

const router = new ClientRouter({
    // Options object. Default settings:
    verbose: false, // Whether to be verbose in console.log() messages
    listen_pushstate: true, // Whether to react to browser pushstate events (excluding those generated by powahroot itself, because that would cause an infinite loop :P)
});

As you can see, powahroot uses ES6 Modules, which makes it easy to split up your code into separate independently-operating sections.

In addition, I've also generated some documentation with the documentation tool on npm. It details the API available to you, and should serve as a good reference when using the library.

You can find that here: https://starbeamrainbowlabs.com/code/powahroot/docs/

It's automatically updated via continuous integration and continuous deployment, which I really do need to get around to blogging about (I've spent a significant amount of time setting up the base system upon which powahroot's CI and CD works. In short I use Laminar CI and a GitHub Webhook, but there's a lot of complicated details).

Found this interesting? Used it in your own project? Got an idea to improve powahroot? Comment below!

Easy Ansi Escape Code in C♯ and Javascript

I've been really rather ill over the weekend, so I wasn't able to write a post when I wanted to. I'm starting to recover now though, so I thought I'd write a quick post about a class I've written in C♯ (and later ported to Javascript via an ES6 Module) that makes using VT100 Ansi escape codes easy.

Such codes are really useful for changing the text colour & background in a terminal or console, and for moving the cursor around to draw fancy UI elements with just text.

I initially wrote it a few months ago as part of an ACW (Assessed CourseWork), but out of an abundance of caution I've held back on open-sourcing this particular code fragment that drove the code I submitted for that assessment.

Now that the assessment is done & marked though (and I've used this class in a few projects since), I feel comfortable with publishing the code I've written publically. Here it is in C♯:

(Can't see the above? Try this direct link instead)

In short, it's a static class that contains a bunch of Ansi escape codes that you can include in strings that you send out via Console.WriteLine() - no special fiddling required! Though, if you're on Windows, don't forget you need to enable escape sequences first. Here's an example that uses the C♯ class:

// ....
catch(Exception error) {
    Console.WriteLine($"{Ansi.FRed}Error: Oops! Something went wrong.");
    Console.WriteLine($"    {error.Message}{Ansi.Reset}");
    return;
}
// ....

Of course, you can get as elaborate as you like. In addition, if you need to disable all escape sequence output for some reason (e.g. you know you're writing to a log file), simply set Ansi.Enabled to false.

Some time later, I found myself needing this class in Javascript (for a Node.js application I'm pretty sure) quite badly (reason - I might blog about it soon :P). To that end, I wound up porting it from C♯ to Javascript. The process wasn't actually that painful - probably because it's a fairly small and simple class.

With porting complete, I intend to keep both versions in sync with each other as I add more features - but no promises :P

Here's the Javascript version:

(Can't see the above? Try this direct link instead)

The Javascript version isn't actually a static class as like the C♯ version, due to the way ES6 modules work. In the future, I'd like to do some science and properly figure out the ins and outs of ES6 modules and refactor this into the JS equivalent of a globally static class (or a Singleton, though in JS we don't have to worry about thread safety 'cause the only way to communicate between threads (also) in JS is through messaging).

Still, it works well enough for my purposes for now.

Found this interesting? Got an improvement? Just want to say hi? Comment below!

Art by Mythdael