Starbeamrainbowlabs

Stardust
Blog


Archive

Mailing List Articles Atom Feed Comments Atom Feed Twitter Reddit Facebook

Tag Cloud

3d account algorithms android announcement architecture archives arduino artificial intelligence artix assembly async audio bash batch blog bookmarklet booting c sharp c++ challenge chrome os code codepen coding conundrums coding conundrums evolved command line compilers compiling compression css dailyprogrammer debugging demystification distributed computing documentation downtime electronics email embedded systems encryption es6 features event experiment external first impressions future game github github gist gitlab graphics hardware hardware meetup holiday holidays html html5 html5 canvas infrastructure interfaces internet io.js jabber javascript js bin labs learning library linux low level lua maintenance manjaro network networking nibriboard node.js operating systems performance photos php pixelbot portable privacy problem solving programming problems projects prolog protocol protocols pseudo 3d python reddit redis reference release releases resource review rust searching secrets security series list server software sorting source code control statistics storage svg technical terminal textures three thing game three.js tool tutorial tutorials twitter ubuntu university update updates upgrade version control virtual reality virtualisation visual web website windows windows 10 xmpp xslt

Search Engine Optimisation: The curious question of efficiency

For one reason or another I found myself a few days ago inspecting the code behind Pepperminty Wiki's full-text search engine. What I found was interesting enough that I thought I'd blog about it.

Forget about that kind of Search Engine Optimisation (the horrible click-baity kind - if there's enough interest I'll blog about my thoughts there too) and cue the appropriate music - we're going on a field trip fraught with the perils of Unicode, page ids, transliteration, and more!

Firstly, I should probably mention a little about the starting point. The (personal) wiki that is exhibiting the slowness has 546 ~75K words spread across 546 pages. Pepperminty Wiki manages to search all of this in about 2.8 seconds by way of an inverted index. If you haven't read my last post, you should do so now - it sets the stage for this one - and you'll be rather confused otherwise.

2.8 seconds is far too slow though. Let's do something about it! In order to do something about it, there are several other things that need explaining before I can show you what I did to optimise it. Let's look at Pepperminty Wiki's search system first. It's best explained with the aid of a diagram:

An SVG diagram explaining how Pepperminty Wiki's search system works. A textual explanation is given below.

In short, every page has a numerical id, which is tracked by the ids core class. The search system interacts with this during the indexing phase (that's a topic for another blog post) and during the query phase. The query phase works something like this:

  1. The inverted index is loaded from disk in my personal wiki the inverted index is ~968k, and loads in ~128ms)
  2. The inverted index is queried for pages in that match the tokenised query terms.
  3. The results returned from the query are ranked and sorted before being returned.
  4. A context is extracted from the source of each page in the results returned - just like Duck Duck Go or Google have a bit of text below the title of each result
  5. Said context has the search terms hightlighted

It sounds complicated, but it really isn't. The complicated bit comes when I tried to optimise it. To start with, I analysed how long each of the above steps were taking. The results were quite surprising:

  • Step #1 took ~128ms on average
  • Steps #2 & #3 took ~1200ms on average
  • Step #4 took ~1500ms on average(!)
  • Step #5 took a negligible amount of time

I did this by setting headers on the returned page. Timing things in PHP is relatively easy:

$start_time = microtime(true);

// Do work here

$end_time = microtime(true);

$time_taken_ms = round(($end_time - $start_time )*1000, 3);

This gave me a general idea as to what needed attention. I was surprised to learn that the context extractor was taking most of the time. At first, I thought that my weird and probably inefficient algorithm was to blame. There's no way it should be taking 1500ms!

So I set to work rewriting it to make it more optimal. Firstly, I tried something like this. Instead of multiple sub-loops, I figured out a way to do it with just 1 for loop and a few calls to mb_stripos_all().

Unfortunately, this did not have the desired effect. While it did shave about 50ms off the total time, it was far from what I'd hoped for. I tried refactoring it slightly again to use preg_match_all(), but it still didn't give me the speed boost I was after - only another 50ms or so.

To get some answers, I brought out the big guns and profiled it with XDebug.

Upon analysing the generated profile it immediately became clear what the issue was: transliteration. Transliteration is the process of removing the diacritics and other accents from a string to make it easier to compare with other strings. For example, café becomes Café. In PHP this process is a bit funky. Here's what I do in Pepperminty Wiki:

$literator = Transliterator::createFromRules(':: Any-Latin; :: Latin-ASCII; :: NFD; :: [:Nonspacing Mark:] Remove; :: Lower(); :: NFC;', Transliterator::FORWARD);

$literator->transliterate($string);

(Originally from this StackOverflow answer)

Note that this requires the intl PHP extension (which should be installed & enabled by default). This does several things:

  • Converts the text to lowercase
  • Normalises it to UTF-8 (See this article for more information)
  • Casts Cyrillic characters to their Latin alphabet (i.e. a-z) phonetic equivalent
  • Removes all diacritics

In short, it preprocesses a chunk of text so that it can be easily used by the search system. In my case, I transliterate search queries before tokenising them, source texts before indexing them, and crucially: source texts before extracting contextual information.

The thing about this wonderful transliteration process is that, at least in PHP, it's really slow. Thinking about it, the answer was obvious. Why bother extract offset information when the inverted index already contains that information?

The answer is: you don't upon refactoring the context extractor to utilise the inverted index, I managed to get it down to just ~59ms. Success!

Next up was the query system itself. 1200ms seems a bit high, so while I was at it, I analysed a profile of that as well. It turned out that a similar problem was occurring here too. Surprisingly, the page id system's getid($pagename) function was being really slow. I found 2 issues here.

Firstly, I was doing too much Unicode normalisation. In the page id system, I don't want to transliterate to remove diacritics, but I do want to make sure that all the diacritics and accents are represented in the same way.

If you didn't know, Unicode has a both a character for letters like é (e-acute), and a code-point for the acute accent itself, which gets merged into the previous letter during rendering. This can cause a page to acquire 2 (or even more!) seemingly identical ids in the system, which caused me a few headaches in the past! If you'd like to learn more, the article on Unicode normalisation I linked to above explains it in some detail. Thankfully, the solution is quite simple. Here's what Pepperminty Wiki does:

Normalizer::normalize($string, Normalizer::FORM_C)

This ensures that all accents and other weird characters are represented in the same way. As you might guess though, it's slow. I found that in the getid() function I was normalising both the page names I was iterating over in the index, as well as the target page name to find in every loop. The solution here was simple:

  • Don't normalise the page names from the index - that's the job of the assign() protected method to ensure that they are suitably normalised when adding them in the first place
  • Normalise the target page name only once, and then use that when spinning through the index.

Implementing these simple changes brought the overall search time down to 700ms. The other thing to note here is the structure of the index. I show it in the diagram, but here it is again:

  • 1: Booster
  • 2: Rocket
  • 3: Satellite

The index is basically a hash-table mapping numerical ids to their page names. This is great for when you have an id and want to know what the name of the page associated with it is, but terrible for when you want to go in the other direction, as we need to do when performing a query!

I haven't quite decided what to do about this. Obviously, the implications on efficiency are significant whenever we need to convert a page name into its respective numerical id. The problem lies in the fact that the search query system travels in both directions: It needs to convert page ids into page names when unravelling the results from the inverted index, but it also needs to convert page names into their respective ids when searching the titles and tags in the page index (the index that contains information about all the pages on a wiki - not pictured in the diagram above).

I have several options that I can see immediately:

  • Maintain 2 indexes: One going in each direction. This would also bring a minor improvement to indexing new and updating existing content in the inverted index.
  • Use some fancy footwork to refactor the search query system to unwind the page ids into their respective page names before we search the pages' titles and tags.

While deciding what to do, I did manage to reduce the number of times I convert a page name into its respective id by only performing the conversion if I find a match in the page metadata. This brought the average search time down to ~455ms, which is perfectly fine for my needs at the moment.

In the future, I may come back to this and optimise it further - but as it stands I'm getting to the point of diminishing returns: Where every additional optimisation requires twice the amount of time to implement as the last, and only provides a marginal gain in speed.

To this end, it doesn't seem worth it to spend ages tackling this issue now. Pepperminty Wiki is written in such a way that I can come back later and turn the inner workings of any part of the system upside-down, and it doesn't have any effect on the rest of the system (most of the time, anyway.... :P).

If you do find the search system too slow after these optimisations are released in v0.17, I'd like to hear about it! Please open an issue and I'll investigate further - I certainly haven't reached the end of this particular lollipop.

Found this interesting? Learnt something? Got a better way of doing it? Comment below!

How to prevent php-fpm from overriding your PHP-based caching logic

A while ago I implemented ETag support to the dynamic preview generator in Pepperminty Wiki. While I thought it worked during testing, for some reason on a private instance of Pepperminty Wiki I discovered recently that my browser didn't seen to be taking advantage of it. This got me curious, so I decided to do a little bit of digging to find out why.

It didn't take long to find the problem. For some reason, all the responses from the server had a Cache-Control: no-cache, no-store, must-revalidate header attached to them. How strange! Even more so that it was in capital letters - my convention in Pepperminty Wiki is to always make the headers lowercase.

I checked the codebase with via the Project Find feature of Atom just to make sure that I hadn't left in a debugging statement or anything (I hadn't), and then I turned my attention to Nginx (engine-x) - the web server that I use on my server. Maybe it had added a caching header?

A quick grep later revealed that it wasn't responsible either - which leaves just one part of the system unchecked - php-fpm, the PHP FastCGI server that sits just behind Nginx that's responsible for running the various PHP scripts I have that power this website and other corners of my server. Another quick grep returned a whole bunch of garbage, so after doing some research I discovered that php-fpm, by default, is configured to send this header - and that it has to be disabled by editing your php.ini (for me it's in /etc/php/7.1/fpm/php.ini), and changing

;session.cache_limiter = nocache

to be explicitly set to an empty string, like so:

session.cache_limiter = ''

This appears to have solved by problem for now - allowing me to regain control over how the resources I send back via PHP are cached. Hopefully this saves someone else the hassle of pulling their entire web server stack apart and putting it back together again the future :P

Found this helpful? Still having issues? Got a better way of solving the problem? Post a comment below!

Deterring spammers with a comment key system

I recently found myself reimplementing the comment key system I use on this blog (I posted about it here) for a another purpose. Being more experienced now, my new implemention (which I should really put into use on this blog actually) is stand-alone in a separate file - so I'm blogging about it here both to help out anyone who reads this other than myself - and for myself as I know I'll forget otherwise :P

The basic algorithm hasn't changed much since I first invented it: take the current timestamp, apply a bunch or arbitrary transformations to it, put it as a hidden field in the comment form, and then reverse the transformations on the comment key the user submits as part of the form to discover how long they had the page loaded for. Bots will have it loaded for either less than 10-ish seconds, or more than 24 hours. Humans will be somewhere in the middle - at least according to my observations!

Of course, any determined spammer can easily bypass this system if they spend even a little bit of time analysing the system - but I'm banking on the fact that my blog is too low-value for a spammer to bother reverse-engineering my system to figure out how it works.

This time, I chose to use simple XOR encryption, followed by reversing the string, followed by base64 encoding. It should be noted that XOR encryption is certainly not secure - but in this case it doesn't really matter. If my website becomes a high-enough value target for that to matter, I'll investigate proper AES encryption - which will probably be a separate post in and of itself, as a quick look revealed that it's rather involved - and will probably require quite a bit of research and experimentation working correctly.

Let's take a look at the key generation function first:

function key_generate($pass) {
    $new_key = strval(time());
    // Repeat the key so that it's long enough to XOR the key with
    $pass_enc = str_repeat($pass, (strlen($new_key) / strlen($pass)) + 1);
    $new_key = $new_key ^ $pass_enc;
    return base64_encode(strrev($new_key));
}

As I explained above, this first XORs the timestamp against a provided 'passcode' of sorts, and then it reverses it, base64 encodes it, and then returns it. I discovered that I needed to repeat the passcode to make sure it's at least as long as the timestamp - because otherwise it cuts the timestamp short! Longer passwords are always desirable for certain, but I wanted to make sure I addressed it here - just in case I need to lift this algorithm from here for a future project.

Next up is the decoding algorithm, that reverses the transformations we apply above:

    function key_decode($key, $pass) {
    $key_dec = strrev(base64_decode($key));
    // Repeat the key so that it's long enough to XOR the key with
    $pass_dec = str_repeat($pass, (strlen($key_dec) / strlen($pass)) + 1);
    return intval($key_dec ^ $pass_dec);
}

Very similar. Again, the XOR passphrase has to be repeated to make it long enough to apply to the whole encoded key without inadvertently chopping some off the end. Additionally, we also convert the timestamp back into an integer - since it is the number of seconds since the last UNIX epoch (1st January 1970 as of the time of typing).

With the ability to create and decode keys, let's write a helper method to make the verification process a bit easier:

    function key_verify($key, $pass, $min_age, $max_age) {
    $age = time() - key_decode($key, $pass);
    return $age >= $min_age && $age <= $max_age;
}

It's fairly self-explanatory, really. It takes an encoded key, decodes it, and verifies that it's age lies between the specified bounds. Here's the code in full (it updates every time I update the code in the GitHub Gist):

(Above: The full comment key code. Can't see it? Check it out on GitHub Gist here.)

OC ReMix Albums Atom Feed

I've recently discovered the wonderful OverClocked ReMix thanks to the help of a friend (thank you for the suggestion ☺), and they have a bunch of cool albums you should totally listen to (my favourite so far is Esther's Dreams :D)!

To help keep up-to-date with the albums they release, I went looking for an RSS/Atom feed that I could subscribe to. Unfortunately, I didn't end up finding one - so I wrote a simple scraper to do the job for me, and I thought I'd share it on here. You can find it live over here.

The script itself uses my earlier atom.gen.php script, since I had it lying around. Here it is in full:

<?php

$settings = new stdClass();
$settings->version = "0.1-alpha";
$settings->ocremix_url = "https://ocremix.org";
$settings->album_url = "https://ocremix.org/albums/";
$settings->user_agent = "OCReMixAlbumsAtomFeedGenerator/$settings->version (https://starbeamrainbowlabs.com/labs/ocremix_atom/albums_atom.php; <webmaster@starbeamrainbowlabs.com> ) PHP/" . phpversion() . "+DOMDocument";
$settings->ignore_parse_errors = true;
$settings->atom_gen_location = "./atom.gen.php";

// --------------------------------------------

require($settings->atom_gen_location);

ini_set("user_agent", $settings->user_agent);
$ocremix_albums_html = file_get_contents($settings->album_url);

$ocremix_dom = new DOMDocument();
libxml_use_internal_errors($settings->ignore_parse_errors);
$ocremix_dom->loadHTML($ocremix_albums_html);

$ocremix_xpath = new DOMXPath($ocremix_dom);

$ocremix_albums = $ocremix_xpath->query(".//table[contains(concat(' ', normalize-space(./@class), ' '), ' data ')]//tr[contains(concat(' ', normalize-space(./@class), ' '), ' area-link ')]");

$ocremix_feed = new atomfeed();
$ocremix_feed->title = "Albums | OC ReMix";
$ocremix_feed->id_uri = $settings->album_url;
$ocremix_feed->sq_icon_uri = "http://ocremix.org/favicon.ico";
$ocremix_feed->addauthor("OverClocked ReMix", false, "https://ocremix.org/");
$ocremix_feed->addcategories([ "music", "remixes", "albums" ]);

foreach ($ocremix_albums as $album) {
    $album_date = $album->childNodes[6]->textContent;
    $album_name = $album->childNodes[2]->childNodes[1]->textContent;
    $album_url = $album->childNodes[2]->childNodes[1]->attributes["href"]->textContent;
    $album_img_url = $album->childNodes[0]->childNodes[0]->attributes["src"]->textContent;

    // Make the urls absolute
    $album_url = $settings->ocremix_url .  $album_url;
    $album_img_url = $settings->ocremix_url .  $album_img_url;

    $content = "<p><a href='$album_url'><img src='$album_img_url' alt='$album_name' /> $album_name</a> - released $album_date";

    $ocremix_feed->addentry(
        $album_url,
        $album_name,
        strtotime($album_date),
        "OverClocked ReMix",
        $content,
        false,
        [],
        strtotime($album_date)
    );
}

header("content-type: application/atom+xml");
echo($ocremix_feed->render());

Basically, I define a bunch of settings at the top for convenience (since I might end up changing them later). Then, I pull down the OcReMix albums page, extract the table of recent albums with a (not-so) tidy XPath query - since PHP's DOMDocument class doesn't seem to support anything else at a glance (yep, I used css-to-path to convert a CSS selector, since my XPath is a bit rusty and I was in a hurry :P).

Finally, I pull it apart to get a list of albums and their release dates - all of which goes straight into atom.gen.php and then straight out to the browser.

Sources and Further Reading

Profiling PHP with XDebug

(This post is a fork of a draft version of a tutorial / guide originally written as a whilst at my internship.)

The PHP and xdebug logos.

Since I've been looking into xdebug's profiling function recently, I've just been tasked with writing up a guide on how to set it up and use it, from start to finish - and I thought I'd share it here.

While I've written about xdebug before in my An easier way to debug PHP post, I didn't end up covering the profiling function - I had difficulty getting it to work properly. I've managed to get it working now - this post documents how I did it. While this is written for a standard Debian server, the instructions can easily be applied to other servers.

For the uninitiated, xdebug is an extension to PHP that aids in the debugging of PHP code. It consists of 2 parts: The php extension on the server, and a client built into your editor. With these 2 parts, you can create breakpoints, step through code and more - though these functions are not the focus of this post.

To start off, you need to install xdebug. SSH into your web server with a sudo-capable account (or just use root, though that's bad practice!), and run the following command:

sudo apt install php-debug

Windows users will need to download it from here and put it in their PHP extension direction. Users of other linux distributions and windows may need to enable xdebug in their php.ini file manually (windows users will need extension=xdebug.dll; linux systems use extension=xdebug.so instead).

Once done, xdebug should be loaded and working correctly. You can verify this by looking the php information page. To see this page, put the following in a php file and request it in your browser:

<?php
phpinfo();
?>

If it's been enabled correctly, you should see something like this somewhere on the resulting page:

Part of the php information page, showing the xdebug section.

With xdebug setup, we can now begin configuring it. Xdebug gets configured in php.ini, PHP's main configuration file. Under Virtualmin each user has their own php.ini because PHP is loaded via CGI, and it's usually located at ~/etc/php.ini. To find it on your system, check the php information page as described above - there should be a row with the name "Loaded Configuration File":

The 'loaded configuration file' directive on the php information page.

Once you've located your php.ini file, open it in your favourite editor (or type sensible-editor php.ini if you want to edit over SSH), and put something like this at the bottom:

[xdebug]
xdebug.remote_enable=1
xdebug.remote_connect_back=1
xdebug.remote_port=9000
xdebug.remote_handler=dbgp
xdebug.remote_mode=req
xdebug.remote_autostart=true

xdebug.profiler_enable=false
xdebug.profiler_enable_trigger=true
xdebug.profiler_enable_trigger_value=ZaoEtlWj50cWbBOCcbtlba04Fj
xdebug.profiler_output_dir=/tmp
xdebug.profiler_output_name=php.profile.%p-%u

Obviously, you'll want to customise the above. The xdebug.profiler_enable_trigger_value directive defines a secret key we'll use later to turn profiling on. If nothing else, make sure you change this! Profiling slows everything down a lot, and could easily bring your whole server down if this secret key falls into the wrong hands (that said, simply having xdebug loaded in the first place slows things down too, even if you're not using it - so you may want to set up a separate server for development work that has xdebug installed if you haven't already). If you're not sure on what to set it to, here's a bit of bash I used to generate my random password:

dd if=/dev/urandom bs=8 count=4 status=none | base64 |  tr -d '=' | tr '+/' '-_'

The xdebug.profiler_output_dir lets you change the folder that xdebug saves the profiling output files to - make sure that the folder you specify here is writable by the user that PHP is executing as. If you've got a lot of profiling to do, you may want to consider changing the output filename, since xdebug uses a rather unhelpful filename by default. The property you want to change here is xdebug.profiler_output_name - and it supports a number of special % substitutions, which are documented here. I can recommend something phpprofile.%t-%u.%p-%H.%R.%cachegrind - it includes a timestamp and the request uri for identification purposes, while still sorting chronologically. Remember that xdebug will overwrite the output file if you don't include something that differentiates it from request to request!

With the configuration done, we can now move on to actually profiling something :D This is actually quite simple. Simply add the XDEBUG_PROFILE GET (or POST!) parameter to the url that you want to test in your browser. Here are some examples:

https://localhost/carrots/moon-iter.php?XDEBUG_PROFILE=ZaoEtlWj50cWbBOCcbtlba04Fj
https://development.galacticaubergine.de/register?vegetable=yes&mode=plus&XDEBUG_PROFILE=ZaoEtlWj50cWbBOCcbtlba04Fj

Adding this parameter to a request will cause xdebug to profile that request, and spit out a cachegrind file according to the settings we configured above. This file can then be analysed in your favourite editor, or, if it doesn't have support, an external program like qcachegrind (Windows) or kcachegrind (Everyone else).

If you need to profile just a single AJAX request or similar, most browsers' developer tools let you copy a request as a curl or wget command (Chromium-based browsers, Firefox - has an 'edit and resend' option), allowing you to resend the request with the XDEBUG_PROFILE GET parameter.

If you need to profile everything - including all subrequests (only those that pass through PHP, of course) - then you can set the XDEBUG_PROFILE parameter as a cookie instead, and it will cause profiling to be enabled for everything on the domain you set it on. Here's a [bookmarklet]() that set the cookie:

javascript:(function(){document.cookie='XDEBUG_PROFILE='+'insert_secret_key_here'+';expires=Mon, 05 Jul 2100 00:00:00 GMT;path=/;';})();

(Source)

Replace insert_secret_key_here with the secret key you created for the xdebug.profiler_enable_trigger_value property in your php.ini file above, create a new bookmark in your browser, paste it in (making sure that your browser doesn't auto-remove the javascript: at the beginning), and then click on it when you want to enable profiling.

Was this helpful? Got any questions? Let me know in the comments below!

Sources and further reading

How to update your linux kernel version on a KimSufi server

(Or why PHP throws random errors in the latest update)

Hello again!

Since I had a bit of a time trying to find some clear information on the subject, I'm writing the blog post so that it might help others. Basically, yesterday night I updated the packages on my server (the one that runs this website!). There was a PHP update, but I didn't think much of it.

This morning, I tried to access my ownCloud instance, only to discover that it was throwing random errors and refusing to load. I'm running PHP version 7.0.16-1+deb.sury.org~xenial+2. It was spewing errors like this one:

PHP message: PHP Fatal error:  Uncaught Exception: Could not gather sufficient random data in /srv/owncloud/lib/private/Security/SecureRandom.php:80
Stack trace:
#0 /srv/owncloud/lib/private/Security/SecureRandom.php(80): random_int(0, 63)
#1 /srv/owncloud/lib/private/AppFramework/Http/Request.php(484): OC\Security\SecureRandom->generate(20)
#2 /srv/owncloud/lib/private/Log/Owncloud.php(90): OC\AppFramework\Http\Request->getId()
#3 [internal function]: OC\Log\Owncloud::write('PHP', 'Uncaught Except...', 3)
#4 /srv/owncloud/lib/private/Log.php(298): call_user_func(Array, 'PHP', 'Uncaught Except...', 3)
#5 /srv/owncloud/lib/private/Log.php(156): OC\Log->log(3, 'Uncaught Except...', Array)
#6 /srv/owncloud/lib/private/Log/ErrorHandler.php(67): OC\Log->critical('Uncaught Except...', Array)
#7 [internal function]: OC\Log\ErrorHandler::onShutdown()
#8 {main}
  thrown in /srv/owncloud/lib/private/Security/SecureRandom.php on line 80" while reading response header from upstream, client: x.y.z.w, server: ownc

That's odd. After googling around a bit, I found this page on the Arch Linux bug tracker. I'm not using arch (Ubuntu 16.04.2 LTS actually), but it turned out that this comment shed some much-needed light on the problem.

Basically, PHP have changed the way they ask the Linux Kernel for random bytes. They now use the getrandom() kernel function instead of /dev/urandom as they did before. The trouble is that getrandom() was introduced in linux 3.17, and I was running OVH's custom 3.14.32-xxxx-grs-ipv6-64 kernel.

Thankfully, after a bit more digging, I found this article. It suggests installing the kernel you want and moving one of the grub config file generators to another directory, but I found that simply changing the permissions did the trick.

Basically, I did the following:


apt update
apt full-upgrade
apt install linux-image-generic
chmod -x /etc/grub.d/06_OVHkernel
update-grub
reboot

Basically, the above first updates everything on the system. Then it installs the linux-image-generic package. linux-image-generic is the pseudo-package that always depends on the latest stable kernel image available.

Next, I remove execute privileges on the file /etc/grub.d/06_OVHkernel. This is the file that gives the default installed OVH kernel priority over any other instalaled kernels, so it's important to exclude it from the grub configuration process.

Lastly, I update my grub configuration with update-grub and then reboot. You need to make sure that you update your grub configuration file, since if you don't it'll still use the old OVH kernel!

With that all done, I'm now running 4.4.0-62-generic according to uname -a. If follow these steps yourself, make sure you have a backup! While I am happy to try and help you out in the comments below, I'm not responsible for any consequences that may arise as a result of following this guide :-)

An easier way to debug PHP

A nice view of some trees taken by Mythdael I think, the awesome person who designed this website :D

Recently at my internship I've been writing quite a bit of PHP. The language itself is OK (I mean it does the job), but it's beginning to feel like a relic of a bygone era - especially when it comes to debugging. Up until recently I've been stuck with using echo() and var_dump() calls all over the place in order to figure out what's going on in my code - that's the equivalent of debugging your C♯ ACW with Console.WriteLine() O.o

Thankfully, whilst looking for an alternative, I found xdebug. Xdebug is like visual studio's debugging tools for C♯ (albeit a more primitive form). They allow you to add breakpoints and step though your PHP code one line at a time - inspecting the contents of variables in both the local and global scope as you go. It improves the standard error messages generated by PHP, too - adding stack traces and colour to the output in order to make it much more readable.

Best of all, I found a plugin for my primary web development editor atom. It's got some simple (ish) instructions on how to set up xdebug too - it didn't take me long to figure out how to put it to use.

I'll assume you've got PHP and Nginx already installed and configured, but this tutorial looks good (just skip over the MySQL section) if you haven't yet got it installed. This should work for other web servers and configurations too, but make sure you know where your php.ini lives.

XDebug consists of 2 components: The PHP extension for the server, and the client that's built into your editor. Firstly, you need to install the server extension. I've recorded an asciicast (terminal recording) to demonstrate the process:

(Above: An asciinema recording demonstrating how to install xdebug. Can't see it? Try viewing it on asciinema.org.)

After that's done, you should be able to simply install the client for your editor (I use php-debug for atom personally), add a breakpoint, load a php page in your web browser, and start debugging!

If you're having trouble, make sure that your server can talk directly to your local development machine. If you're sitting behind any routers or firewalls, make sure they're configured to allow traffic though on port 9000 and configured to forward it on to your machine.

Capturing and sending error reports by email in C♯

A month or two ago I put together a simple automatic updater and showed you how to do the same. This time I'm going to walk you through how to write your own error reporting system. It will be able to catch any errors through a try...catch block, ask the user whether they want to send an error report when an exception is thrown, and send the report to your email inbox.

Just like last time, this post is a starting point, not an ending point. It has a significant flaw and can be easily extended to, say, ask the user to write a short paragraph detailing what they were doing at the time of the crash, or add a proper gui, for example.

Please note that this tutorial requires that you have a server of some description to use to send the error reports to. If you want to get the system to send you an email too, you'll need a working mail server. Thankfully DigitalOcean provide free credit if you have the GitHub Student pack. This tutorial assumes that your mail server (or at least a relay one) is running on the same machine as your web server. While setting one up correctly can be a challenge, Lee Hutchinson over at Ars Technica has a great tutorial that's easy to follow.

To start with, we will need a suitable test program to work with whilst building this thing. Here's a good one riddled with holes that should throw more than a few exceptions:

using System;
using System.IO;
using System.Net;
using System.Text;

public class Program
{
    public static readonly string Name = "Dividing program";
    public static readonly string Version = "0.1";

    public static string ProgramId
    {
        get { return string.Format("{0}/{1}", Name, Version); }
    }

    public static int Main(string[] args)
    {
        float a = 0, b = 0, c = 0;

        Console.WriteLine(ProgramId);
        Console.WriteLine("This program divides one number by another.");

        Console.Write("Enter number 1: ");
        a = float.Parse(Console.ReadLine());
        Console.Write("Enter number 2: ");
        b = float.Parse(Console.ReadLine());

        c = a / b;
        Console.WriteLine("Number 1 divided by number 2 is {0}.", c);

        return 0;
}

There are a few redundant using statements at the top there - we will get to utilizing them later on.

First things first - we need to capture all exceptions and build an error report:

try
{
    // Insert your program here
}
catch(Exception error)
{
    Console.Write("Collecting data - ");
    MemoryStream dataStream  = new MemoryStream();
    StreamWriter dataIn = new StreamWriter(dataStream);
    dataIn.WriteLine("***** Error Report *****");
    dataIn.WriteLine(error.ToString());
    dataIn.WriteLine();
    dataIn.WriteLine("*** Details ***");
    dataIn.WriteLine("a: {0}", a);
    dataIn.WriteLine("b: {0}", b);
    dataIn.WriteLine("c: {0}", c);
    dataIn.Flush();

    dataStream.Seek(0, SeekOrigin.Begin);
    string errorReport = new StreamReader(dataStream).ReadToEnd();
    Console.WriteLine("done");
}

If you were doing this for real, it might be a good idea to move all of your application logic it it's own class and have a call like application.Run() instead of placing your code directly inside the try{ } block. Anyway, the above will catch the exception, and build a simple error report. I'm including the values of a few variables I created too. You might want to set up your own mechanism for storing state data so that the error reporting system can access it, like a special static class or something.

Now that we have created an error report, we need to send it to the server to processing. Before we do this, though, we ought to ask the user if this is ok with them (it is their computer in all likeliness after all!). This is easy:

Console.WriteLine("An error has occurred!");
Console.Write("Would you like to report it? [Y/n] ");

bool sendReport = false;
while(true)
{
    ConsoleKey key = Console.ReadKey().Key;
    if (key == ConsoleKey.Y) {
        sendReport = true;
        break;
    }
    else if (key == ConsoleKey.N)
        break;
}
Console.WriteLine();

if(!sendReport)
{
    Console.WriteLine("No report has been sent.");
    Console.WriteLine("Press any key to exit.");
    Console.ReadKey(true);
    return 1;
}

Since this program uses the console, I'm continuing that trend here. You will need to create your own GUI if you aren't creating a console app.

Now that's taken care of, we can go ahead and send the report to the server. Here's how I've done it:

Console.Write("Sending report - ");
HttpWebRequest reportSender = WebRequest.CreateHttp("https://starbeamrainbowlabs.com/reportSender.php");
reportSender.Method = "POST";
byte[] payload = Encoding.UTF8.GetBytes(errorReport);
reportSender.ContentType = "text/plain";
reportSender.ContentLength = payload.Length;
reportSender.UserAgent = ProgramId;
Stream requestStream = reportSender.GetRequestStream();
requestStream.Write(payload, 0, payload.Length);
requestStream.Close();

WebResponse reportResponse = reportSender.GetResponse();
Console.WriteLine("done");
Console.WriteLine("Server response: {0}", ((HttpWebResponse)reportResponse).StatusDescription);
Console.WriteLine("Press any key to exit.");
Console.ReadKey(true);
return 1;

That may look unfamiliar and complicated, so let's walk through it one step at a time.

To start with, I create a new HTTP web request and point it at an address on my server. You will use a slightly different address, but the basic principle is the same. As for what resides at that address - we will take a look at that later on.

Next I set request method to be POST so that I can send some data to the server, and set a few headers to help the server out in understanding our request. Then I prepare the error report for transport and push it down the web request's request stream.

After that I get the response from the server and tell the user that we have finished sending the error report to the server.

That pretty much completes the client side code. Here's the whole thing from start to finish:

using System;
using System.IO;
using System.Net;
using System.Text;

public class Program
{
    public static readonly string Name = "Dividing program";
    public static readonly string Version = "0.1";

    public static string ProgramId
    {
        get { return string.Format("{0}/{1}", Name, Version); }
    }

    public static int Main(string[] args)
    {
        float a = 0, b = 0, c = 0;

        try
        {
            Console.WriteLine(ProgramId);
            Console.WriteLine("This program divides one number by another.");

            Console.Write("Enter number 1: ");
            a = float.Parse(Console.ReadLine());
            Console.Write("Enter number 2: ");
            b = float.Parse(Console.ReadLine());

            c = a / b;
            Console.WriteLine("Number 1 divided by number 2 is {0}.", c);
        }
        catch(Exception error)
        {
            Console.WriteLine("An error has occurred!");
            Console.Write("Would you like to report it? [Y/n] ");

            bool sendReport = false;
            while(true)
            {
                ConsoleKey key = Console.ReadKey().Key;
                if (key == ConsoleKey.Y) {
                    sendReport = true;
                    break;
                }
                else if (key == ConsoleKey.N)
                    break;
            }
            Console.WriteLine();

            if(!sendReport)
            {
                Console.WriteLine("No report has been sent.");
                Console.WriteLine("Press any key to exit.");
                Console.ReadKey(true);
                return 1;
            }

            Console.Write("Collecting data - ");
            MemoryStream dataStream  = new MemoryStream();
            StreamWriter dataIn = new StreamWriter(dataStream);
            dataIn.WriteLine("***** Error Report *****");
            dataIn.WriteLine(error.ToString());
            dataIn.WriteLine();
            dataIn.WriteLine("*** Details ***");
            dataIn.WriteLine("a: {0}", a);
            dataIn.WriteLine("b: {0}", b);
            dataIn.WriteLine("c: {0}", c);
            dataIn.Flush();

            dataStream.Seek(0, SeekOrigin.Begin);
            string errorReport = new StreamReader(dataStream).ReadToEnd();
            Console.WriteLine("done");

            Console.Write("Sending report - ");
            HttpWebRequest reportSender = WebRequest.CreateHttp("https://starbeamrainbowlabs.com/reportSender.php");
            reportSender.Method = "POST";
            byte[] payload = Encoding.UTF8.GetBytes(errorReport);
            reportSender.ContentType = "text/plain";
            reportSender.ContentLength = payload.Length;
            reportSender.UserAgent = ProgramId;
            Stream requestStream = reportSender.GetRequestStream();
            requestStream.Write(payload, 0, payload.Length);
            requestStream.Close();

            WebResponse reportResponse = reportSender.GetResponse();
            Console.WriteLine("done");
            Console.WriteLine("Server response: {0}", ((HttpWebResponse)reportResponse).StatusDescription);
            Console.WriteLine("Press any key to exit.");
            Console.ReadKey(true);
            return 1;
        }

        return 0;
    }
}

(Pastebin, Raw)

Next up is the server side code. Since I'm familiar with it and it can be found on all popular web servers, I'm going to be using PHP here. You could write this in ASP.NET, too, but I'm not familiar with it, nor do I have the appropriate environment set up at the time of posting (though I certainly plan on looking into it).

The server code can be split up into 3 sections: the settings, receiving and extending the error report, and sending the error report on in an email. Part one is quite straightforward:

<?php
/// Settings ///
$settings = new stdClass();
$settings->fromAddress = "postasaurus@starbeamrainbowlabs.com";
$settings->toAddress = "bugs@starbeamrainbowlabs.com";

The above simply creates a new object and stores a few settings in it. I like to put settings at the top of small scripts like this because it both makes it easy to reconfigure them and allows for expansion later.

Next we need to receive the error report from the client:

// Get the error report from the client
$errorReport = file_get_contents("php://input");

PHP on a web server it smarter than you'd think and collects some useful information about the connected client, so we can collect a few interesting statistics and tag them onto the end of the error report like this:

// Add some extra information to it
$errorReport .= "\n*** Server Information ***\n";
$errorReport .= "Date / time reported: " . date("r") . "\n";
$errorReport .= "Reporting ip: " . $_SERVER['REMOTE_ADDR'] . "\n";
if(isset($_SERVER["HTTP_X_FORWARDED_FOR"]))
{
    $errorReport .= "The error report was forwarded through a proxy.\n";
    $errorReport .= "The proxy says that it forwarded the request from this address: " . $_SERVER['HTTP_X_FORWARDED_FOR'] . "\n\n";
}
if(isset($_SERVER["HTTP_USER_AGENT"]))
{
    $errorReport .= "The reporting client identifies themselves as: " . $_SERVER["HTTP_USER_AGENT"] . ".\n";
}

I'm adding the date and time here too just because the client could potentially fake it (they could fake everything, but that's a story for another time). I'm also collecting the client's user agent string too. This is being set in the client code above to the name and version of the program running. This information could be useful if you attach multiple programs to the same error reporting script. You could modify the client code to include the current .NET version, too by utilising Environment.Version.

Lastly, since the report has gotten this far, we really should do something with it. I decided I wanted to send it to myself in an email, but you could just as easily store it in a file using something like file_put_contents("bug_reports.txt", $errorReport, FILE_APPEND);. Here's the code I came up with:

$emailHeaders = [
    "From: $settings->fromAddress",
    "Content-Type: text/plain",
    "X-Mailer: PHP/" . phpversion()
];

$subject = "Error Report";
if(isset($_SERVER["HTTP_USER_AGENT"]))
    $subject .= " from " . $_SERVER["HTTP_USER_AGENT"];

mail($settings->toAddress, $subject, $errorReport, implode("\r\n", $emailHeaders), "-t");

?>

That completes the server side code. Here's the completed script:

<?php
/// Settings ///
$settings = new stdClass();
$settings->fromAddress = "postasaurus@starbeamrainbowlabs.com";
$settings->toAddress = "bugs@starbeamrainbowlabs.com";

// Get the error report from the client
$errorReport = file_get_contents("php://input");

// Add some extra information to it
$errorReport .= "\n*** Server Information ***\n";
$errorReport .= "Date / time reported: " . date("r") . "\n";
$errorReport .= "Reporting ip: " . $_SERVER['REMOTE_ADDR'] . "\n";
if(isset($_SERVER["HTTP_X_FORWARDED_FOR"]))
{
    $errorReport .= "The error report was forwarded through a proxy.\n";
    $errorReport .= "The proxy says that it forwarded the request from this address: " . $_SERVER['HTTP_X_FORWARDED_FOR'] . "\n\n";
}
if(isset($_SERVER["HTTP_USER_AGENT"]))
{
    $errorReport .= "The reporting client identifies themselves as: " . $_SERVER["HTTP_USER_AGENT"] . ".\n";
}

$emailHeaders = [
    "From: $settings->fromAddress",
    "Content-Type: text/plain",
    "X-Mailer: PHP/" . phpversion()
];

$subject = "Error Report";
if(isset($_SERVER["HTTP_USER_AGENT"]))
    $subject .= " from " . $_SERVER["HTTP_USER_AGENT"];

mail($settings->toAddress, $subject, $errorReport, implode("\r\n", $emailHeaders), "-t");

?>

(Pastebin, Raw)

The last job we need to do is to upload the PHP script to a PHP-enabled web server, and go back to the client and point it at the web address at which the PHP script is living.

If you have read this far, then you've done it! You should have by this point a simple working error reporting system. Here's an example error report email that I got whilst testing it:

***** Error Report *****
System.FormatException: Input string was not in a correct format.
  at System.Number.ParseSingle (System.String value, NumberStyles options, System.Globalization.NumberFormatInfo numfmt) <0x7fe1c97de6c0 + 0x00158> in <filename unknown>:0 
  at System.Single.Parse (System.String s, NumberStyles style, System.Globalization.NumberFormatInfo info) <0x7fe1c9858690 + 0x00016> in <filename unknown>:0 
  at System.Single.Parse (System.String s) <0x7fe1c9858590 + 0x0001d> in <filename unknown>:0 
  at Program.Main (System.String[] args) <0x407d7d60 + 0x00180> in <filename unknown>:0 

*** Details ***
a: 4
b: 0
c: 0

*** Server Information ***
Date / time reported: Mon, 11 Apr 2016 10:31:20 +0100
Reporting ip: 83.100.151.189
The reporting client identifies themselves as: Dividing program/0.1.

I mentioned at the beginning of this post that that this approach has a flaw. The main problem lies in the fact that the PHP script can be abused by a knowledgeable attacker to send you lots of spam. I can't think of any real way to properly solve this, but I'd suggest storing the PHP script at a long and complicated URL that can't be easily guessed. There are probably other flaws as well, but I can't think of any at the moment.

Found a mistake? Got an improvement? Please leave a comment below!

Converting Hashtags into Titles with PHP

Recently I have been working on a website for someone I know. Mythdael (the awesome artist who created a design for this website!) did the design for this one too, and I felt that I had to bring it to life. While I was writing it, I found that I needed to convert any given hashtag into a presentable title. Since hashtags are all one word (and usually lower case), it is more or less impossible to work out where one word starts and another ends. The solution: A wordlist (My choice was a modified enable1.txt).

Here is the solution I came up with:

/*
 * From https://terenceyim.wordpress.com/2011/02/01/all-purpose-binary-search-in-php/
 * Parameters: 
 *   $a - The sort array.
 *   $first - First index of the array to be searched (inclusive).
 *   $last - Last index of the array to be searched (exclusive).
 *   $key - The key to be searched for.
 *   $compare - A user defined function for comparison. Same definition as the one in usort
 *
 * Return:
 *   index of the search key if found, otherwise return (-insert_index - 1). 
 *   insert_index is the index of smallest element that is greater than $key or sizeof($a) if $key
 *   is larger than all elements in the array.
 */
function binary_search(array $a, $first, $last, $key, $compare) {
    $lo = $first; 
    $hi = $last - 1;

    while ($lo <= $hi) {
        $mid = (int)(($hi - $lo) / 2) + $lo;
        $cmp = call_user_func($compare, $a[$mid], $key);

        if ($cmp < 0) {
            $lo = $mid + 1;
        } elseif ($cmp > 0) {
            $hi = $mid - 1;
        } else {
            return $mid;
        }
    }
    return -($lo + 1);
}

class hashtag_parser
{
    public $wordlist_length = 0;
    public $wordlist = [];

    function __construct($wordlist_path) {
        global $settings;
        $this->wordlist = file($wordlist_path, FILE_IGNORE_NEW_LINES);
        $this->wordlist_length = count($this->wordlist);
    }

    public function in_wordlist($word)
    {
        $word = strtolower($word);

        $result = binary_search($this->wordlist, 0, $this->wordlist_length, $word, "strcmp");

        if($result > -1)
            return true;
        else
            return false;
        /*
        if(in_array($word, $this->wordlist))
            return true;
        else
            return false;
        */
    }

    public function extract_words($hashtag)
    {
        global $settings;

        // Remove the hash from the beginning if it is present
        if(substr($hashtag, 0, 1) == "#") $hashtag = substr($hashtag, 1);

        // Create an array to hold the words we find
        $words = [];

        $length = strlen($hashtag); // Cache the length of the hashtag
        $pos = 0;
        while($pos < $length)
        {
//          echo("pos: $pos\n");
            // aim: find the length of the longest substring that is a valid
            // word according to the wordlist
            $longest_word_length = 0;
            for($scan_pos = $pos + 1; $scan_pos < $length + 1; $scan_pos++)
            {
//              echo("scan_pos: $scan_pos substring: " . substr($hashtag, $pos, $scan_pos - $pos) . "\n");
                if($this->in_wordlist(substr($hashtag, $pos, $scan_pos - $pos)))
                {
                    $longest_word_length = $scan_pos - $pos;
//                  echo("found word\n");
                }
            }

            // Set the length of the longest word to the remainder of the
            // string if we don't find any valid words
            if($longest_word_length == 0) $longest_word_length = $length - $pos;

            $words[] = substr($hashtag, $pos, $longest_word_length);

            $pos += $longest_word_length;
        }

        return ucwords(implode(" ", $words));
    }
}

The code is a bit messy (perhaps I should tidy it up a bit lot), but it does the job I intended it to do. I used a class because I was concerned that reading in the wordlist every time would cause the code to take too long to complete - it is slow enough as it is. Thankfully a binary search algorithm written in PHP by terenceyim helped speed things up enormously. Anyway, it can be used like this:

$parser = new hashtag_parser("/path/to/wordlist.txt");
echo($parser->extract_words("sometext")); // Prints "Some Text"

The algorithm I used is quite simple:

  1. Loop over each character.
  2. Scan ahead of the current character and figure out the length of the longest word in the input via the wordlist.
  3. If no valid word can be found, assume that the rest of the input is all one word.
  4. Extract the longest word we can find and add it to an array.
  5. Add the length of the word we found to the character pointer.
  6. If we haven't reached the end of the given input, go to step 2.
  7. If we have reach the end of the output, return the words we found.

I am posting this here in the hopes that someone else will find this code useful :)

Probably the world's most advanced string splitting function

When I was setting up this website, I foolishly picked a custom log file format that is rather hard for computers to parse. Because of this I haven't found a server log analysis tool that is intelligent enough to parse my logs (if you know of one please let me know in the comments!).

Here is an example of a typical log file entry:

[19/May/2015:00:47:52 +0100] "starbeamrainbowlabs.com" HTTP/1.1 GET 200 162.243.87.220 0s :443 /blog/article.php article=posts/013-Terminal-Reference.html "https://starbeamrainbowlabs.com/blog/?offset=60" "Mozilla/5.0 (compatible; spbot/4.4.2; +http://OpenLinkProfiler.org/bot )"

It looks strange, doesn't it? Since I want to have some idea of how many people are visiting my site, I have finally gotten around to writing my own custom log parser. In order to do this, I needed a way to convert each line into an array of terms. None of the answers on stackoverflow seemed to cut it, so I wrote my own:

<?php
function explode_adv($openers, $closers, $togglers, $delimiters, $str)
{
    $chars = str_split($str);
    $parts = [];
    $nextpart = "";
    $toggle_states = array_fill_keys($togglers, false); // true = now inside, false = now outside
    $depth = 0;
    foreach($chars as $char)
    {
        if(in_array($char, $openers))
            $depth++;
        elseif(in_array($char, $closers))
            $depth--;
        elseif(in_array($char, $togglers))
        {
            if($toggle_states[$char])
                $depth--; // we are inside a toggle block, leave it and decrease the depth
            else
                // we are outside a toggle block, enter it and increase the depth
                $depth++;

            // invert the toggle block state
            $toggle_states[$char] = !$toggle_states[$char];
        }
        else
            $nextpart .= $char;

        if($depth < 0) $depth = 0;

        if(in_array($char, $delimiters) &&
           $depth == 0 &&
           !in_array($char, $closers))
        {
            $parts[] = substr($nextpart, 0, -1);
            $nextpart = "";
        }
    }
    if(strlen($nextpart) > 0)
        $parts[] = $nextpart;

    return $parts;
}
?>

I have also posted this on stackoverflow. This function of mine takes 5 parameters:

  1. An array of characters that open a block - e.g. [, (, etc.
  2. An array of characters that close a block - e.g. ], ), etc.
  3. An array of characters that toggle a block - e.g. ", ', etc.
  4. An array of characters that should cause a split into the next part.
  5. The string to work on.

This function probably will have flaws, but it works well enough for me.

You can also find this function on GitHub's Gist - as always suggestions and contributions are always welcome :)

Art by Mythdael