Archive

## Tag Cloud

3d 3d printing account algorithms android announcement architecture archives arduino artificial intelligence artix assembly async audio automation backups bash batch blog bookmarklet booting bug hunting c sharp c++ challenge chrome os code codepen coding conundrums coding conundrums evolved command line compilers compiling compression css dailyprogrammer data analysis debugging demystification distributed computing documentation downtime electronics email embedded systems encryption es6 features ethics event experiment external first impressions future game github github gist gitlab graphics hardware hardware meetup holiday holidays html html5 html5 canvas infrastructure interfaces internet interoperability io.js jabber jam javascript js bin labs learning library linux lora low level lua maintenance manjaro network networking nibriboard node.js operating systems performance photos php pixelbot portable privacy problem solving programming problems project projects prolog protocol protocols pseudo 3d python reddit redis reference releases resource review rust searching secrets security series list server software sorting source code control statistics storage svg talks technical terminal textures thoughts three thing game three.js tool tutorial twitter ubuntu university update updates upgrade version control virtual reality virtualisation visual web website windows windows 10 xmpp xslt

## The infrastructure behind Air Quality Web

For a while now, I've been working on Air-Quality-Web, a web interface that displays air quality information. While I haven't blogged about it directly before, a number of posts (a, b, c, d) I've made here have been indirectly related.

Since the air quality data has to come from somewhere, I thought I'd blog a little about the wider infrastructure behind the air quality web interface. My web interface is actually just 1 small part of a much wider stack of software that's being developed as a group by Connected Humber.

Said stack is actually quite distributed, so let's start with a diagram:

From left to right:

• As a group we've designed a PCB (mainly thanks to @BNNorman) that acts as the base for sensor nodes themselves - though a number of people have built their own hardware.
• Multiple different pieces of software run on top of the various pieces of hardware we've developed - some people use ESP Easy, and others use custom firmware they've implemented themselves.
• Embedded devices send the data over WiFi to our MQTT broker (LoRaWAN via The Things Network is currently under development), which currently runs in a Debian Virtual Machine rented from a cloud infrastructure provider.
• Another Debian VM hosts a database loading script, which listens for MQTT messages sent to the broker. It adds the data contained within into a database, which runs on the same box.
• A final box hosts the web server, which simultaneously hosts the PHP-based HTTP API and the client-side web interface. Both of these are currently located in this repository, but later down the line I'd like to figure out how to decouple them into their own separate repositories.

We can represent the flow of data here in a flowchart, to get a better idea as to how it all fits together:

As you can see, there are many areas of the project that can be worked on independently of each other - depending on what people feel most comfortable working on. Personally, I stick mainly to the HTTP API and the main web interface with a hand in advising on database design, but there are lots of other ways to get involved if you so choose!

Sensors always need building, designing, and programming, and the data generated is available via the public HTTP API (the docs for which can be found here) - so anyone can write their own application on top of the data collected by our sensors. Want a light on your desk (or even your hat) that changes colour depending on your local air quality? Go ahead!

Found this interesting? Comment below!

## Summer Project Part 5: When is a function not a function?

Another post! Looks like I'm on a roll in this series :P

In the last post, I looked at the box I designed that was ready for 3D printing. That process has now been completed, and I'm now in possession of an (almost) luminous orange and pink box that could almost glow in the dark.......

I also looked at the libraries that I'll be using and how to manage the (rather limited) amount of memory available in the AVR microprocessor.

Since last time, I've somehow managed to shave a further 6% program space off (though I'm not sure how I've done it), so most recently I've been implementing 2 additional features:

• An additional layer of AES encryption, to prevent The Things Network for having access to the decrypted data
• GPS delta checking (as I'm calling it), to avoid sending multiple messages when the device hasn't moved

After all was said and done, I'm now at 97% program space and 47% global variable RAM usage.

To implement the additional AES encryption layer, I abused LMiC's IDEETRON AES-128 (ECB mode) implementation, which is stored in src/aes/ideetron/AES-128_V10.cpp.

It's worth noting here that if you're doing crypto yourself, it's seriously not recommended that you use ECB mode. Please don't. The only reason that I used it here is because I already had an implementation to hand that was being compiled into my program, I didn't have the program space to add another one, and my messages all start with a random 32-bit unsigned integer that will provide a measure of protection against collision attacks and other nastiness.

Specifically, it's the method with this signature:

void lmic_aes_encrypt(unsigned char *Data, unsigned char *Key);

Since this is an internal LMiC function declared in a .cpp source file with no obvious header file twin, I needed to declare the prototype in my source code as above - as the method will only be discovered by the compiler when linking the object files together (see this page for more information about the C++ compilation process. While it's for regular Linux executable binaries, it still applies here since the Arduino toolchain spits out a very similar binary that's uploaded to the microprocessor via a programmer).

However, once I'd sorted out all the typing issues, I slammed into this error:

/tmp/ccOLIbBm.ltrans0.ltrans.o: In function transmit_send':
sketch/transmission.cpp:89: undefined reference to lmic_aes_encrypt(unsigned char*, unsigned char*)'
collect2: error: ld returned 1 exit status

Very strange. What's going on here? I declared that method via a prototype, didn't I?

Of course, it's not quite that simple. The thing is, the file I mentioned above isn't the first place that a prototype for that method is defined in LMiC. It's actually in other.c, line 35 as a C function. Since C and C++ (for all their similarities) are decidedly different, apparently to call a C function in C++ code you need to declare the function prototype as extern "C", like this:

extern "C" void lmic_aes_encrypt(unsigned char *Data, unsigned char *Key);

This cleaned the error right up. Turns out that even if a function body is defined in C++, what matters is where the original prototype is declared.

I'm hoping to release the source code, but I need to have a discussion with my supervisor about that at the end of the project.

Found this interesting? Come across some equally nasty bugs? Comment below!

## Summer Project Part 4: Threading the needle and compacting it down

In the last part, I put the circuit for the IoT device together and designed a box for said circuit to be housed inside of.

In this post, I'm going to talk a little bit about 3D printing, but I'm mostly going to discuss the software aspect of the firmware I'm writing for the Arduino Uno that's going to be control the whole operation out in the field.

Since last time, I've completed the design for the housing and sent it off to my University for 3D printing. They had some great suggestions for improving the design like making the walls slightly thicker (moving from 2mm to 4mm), and including an extra lip on the lid to keep it from shifting around. Here are some pictures:

(Left: The housing itself. Right: The lid. On the opposite side (not shown), the screw holes are indented.)

At the same time as handling sending the housing off to be 3D printed, I've also been busily iterating on the software that the Arduino will be running - and this is what I'd like to spend the majority of this post talking about.

I've been taking an iterative approach to writing it - adding a library, interfacing with it and getting it to do what I want on it's own, then integrating it into the main program.... and then compacting the whole thing down so that it'll fit inside the Arduino Uno. The thing is, the Uno is powered by an ATmega328P (datasheet). Which has 32K of program space and just 2K of RAM. Not much. At all.

The codebase I've built for the Uno is based on the following libraries.

• LMiC (the matthijskooijman fork) - the (rather heavy and needlessly complicated) LoRaWAN implementation
• Entropy for generating random numbers as explained in part 2
• TinyGPS, for decoding NMEA messages from the NEO-6M
• SdFat, for interfacing with microSD cards over SPI

### Memory Management

Packing the whole program into a 32K + 2K box is not exactly an easy challenge, I discovered. I chose to first deal with the RAM issue. This was greatly aided by the FreeMemory library, which tells you how much RAM you've got left at a given point in the execution of your program. While it's a bit outdated, it's still a useful tool. It works a bit like this:

#include <MemoryFree.h>;

void setup() {
Serial.begin(115200);
Serial.println(freeMemory, DEC);
char test[] = "Bobs Rockets";
Serial.println(freeMemory, DEC); // Should be lower than the above call
}

void loop() {
// Nothing here
}

It's worth taking a moment to revise the way stacks and heaps work - and the differences between how they work in the Arduino environment and on your desktop. This is going to get rather complicated quite quickly - so I'd advise reading this stack overflow answer first before continuing.

First, let's look at the locations in RAM for different types of allocation:

• Things on the stack
• Things on the heap
• Global variables

Unlike on the device you're reading this on, the Arduino does not support multiple processes - and therefore the entirety of the RAM available is allocated to your program.

Since I wasn't sure about preecisly how the Arduino does it (it's processor architecture-specific), I wrote a simple test program to tell me:

#include <Arduino.h>

struct Test {
uint32_t a;
char b;
};

Test global_var;

void setup() {
Serial.begin(115200);

Test stack;
Test* heap = new Test();

Serial.print(F("Stack location: "));
Serial.println((uint32_t)(&stack), DEC);

Serial.print(F("Heap location: "));
Serial.println((uint32_t)heap, DEC);

Serial.print(F("Global location: "));
Serial.println((uint32_t)&global_var, DEC);
}

void loop() {
// Nothing here
}

This prints the following for me:

Stack location: 2295
Heap location: 461
Global location: 284

From this we can deduce that global variables are located at the beginning of the RAM space, heap allocations go on top of globals, and the stack grows down starting from the end of RAM space. It's best explained with a diagram:

Now for the differences. On a normal machine running an operating system, there's an extra layer of abstraction between where things are actually located in RAM and where the operating system tells you they are located. This is known as virtual memory address translation (see also virtual memory, virtual address space).

It's a system whereby the operating system maintains a series of tables that map physical RAM to a virtual address space that the running processes actually use. Usually each process running on a system will have it's own table (but this doesn't mean that it will have it's own physical memory - see also shared memory, but this is a topic for another time). When a process accesses an area of memory with a virtual address, the operating system will transparently translate the address using the table to the actual location in RAM (or elsewhere) that the process wants to access.

This is important (and not only for security), because under normal operation a process will probably allocate and deallocate a bunch of different lumps of memory at different times. With a virtual address space, the operating system can defragment the physical RAM space in the background and move stuff around without disturbing currently running processes. Keeping the free memory contiguous speeds up future allocations, and ensures that if a process asks for a large block of contiguous memory the operating system will be able to allocate it without issue.

As I mentioned before though, the Arduino doesn't have a virtual memory system - partly because it doesn't support multiple processes (it would need an operating system for that). The side-effect here is that it doesn't defragment the physical RAM. Since C/C++ isn't a managed language, we don't get _heap compaction_ either like in .NET environments such as Mono.

All this leads us to an environment in which heap allocation needs to be done very carefully, in order to avoid fragmenting the heap and causing a stack crash. If an object somewhere in the middle of the heap is deallocated, the heap will not shrink until everything to the right of it is also deallocated. This post has a good explanation of the problem too.

Other things we need to consider are keeping global variables to a minimum, and trying to keep most things on the stack if we can help it (though this may slow the program down if it's copying things between stack frames all the time).

To this end, we need to choose the libraries we use with care - because they can easily break these guidelines that we've set for ourselves. For example, the inbuilt SD library is out, because it uses a global variable that eats over 50% of our available RAM - and it there's no way (that I can see at least) to reclaim that RAM once we're finished with it.

This is why I chose SdFat instead, because it's at least a little better at allowing us to reclaim most of the RAM it used once we're finished with it by letting the instance fall out of scope (though in my testing I never managed to reclaim all of the RAM it used afterwards).

Alternatives like µ-Fat do exist and are even lighter, but they have restrictions such as no appending to files for example - which would make the whole thing much more complicated since we'd have to pre-allocate the space for the file (which would get rather messy).

The other major tactic you can do to save RAM is to use the F() trick. Consider the following sketch:

#include

void setup() {
Serial.begin(115200);
Serial.println("Bills boosters controller, version 1");
}

void loop() {
// Nothing here
}

On the highlighted line we've got an innocent Serial.println() call. What's not obvious here is that the string literal here is actually copied to RAM before being passed to Serial.println() - using up a huge amount of our precious working memory. Wrapping it in the F() macro forces it to stay in your program's storage space:

Serial.println(F("Bills boosters controller, version 1"));

### Saving storage

With the RAM issue mostly dealt with, I then had to deal with the thorny issue of program space. Unfortunately, this is not as easy as saving RAM because we can't just 'unload' something when it's not needed.

My approach to reducing program storage space was twofold:

• Picking lightweight alternatives to libraries I needed
• Messing with flags of said libraries to avoid compiling parts of libraries I don't need

It is for these reasons that I ultimately went with TinyGPS instead of TinyGPS++, as it saved 1% or so of the program storage space.

It's also for this reason that I've disabled as much of LMiC as possible:

#define DISABLE_JOIN
#define DISABLE_PING
#define DISABLE_BEACONS
#define DISABLE_MCMD_DCAP_REQ
#define DISABLE_MCMD_DN2P_SET

This disables OTAA, Class B functionality (which I don't need anyway), receiving messaages, the duty cycle cap system (which I'm not sure works between reboots), and a bunch of other stuff that I'd probably find rather useful.

In the future, I'll probably dedicate an entire microcontroller to handling LoRaWAN functionality - so that I can still use the features I've had to disable here.

Even doing all this, I still had to trim down my Serial.println() calls and remove any non-essential code to bring it under the 32K limit. As of the time of typing, I've got jut 26 bytes to spare!

Next time, after tuning the TPL5110 down to the right value, we're probably going to switch gears and look at the server-side of things - and how I'm going to be storing the data I receive from the Arudino-based device I've built.

Found this interesting? Got a suggestion? Comment below!

## Summer Project Part 3: Putting it together

In the first post in this series, I outlined my plans for my Msc summer project and what I'm going to be doing. In the second post, I talked about random number generation for the data collection.

In this post, I'm going to give a general progress update - which will mostly centre around the Internet of Things device I'm building to collect the signal strength data.

Since the last post, I've got nearly all the parts I need for the project, except the TPL5111 power manager and 4 rechargeable AA batteries (which should be easy to come by - I'm sure I've got some lying around somewhere).

I've also wired the thing up, with a cable standing in for the TPL5111.

The power management board there technically doesn't need a breadboard, but it makes mounting it in the box easier.

I still need to splice the connector onto the battery box I had lying around with some soldering and electrical tape - I'll do that later this week.

The wiring there is kind of messy, but I've tested each device individually and they all appear to work as intended. Here's a clearer diagram of what's going on that drew up in Fritzing (sudo apt install fritzing for Linux users):

Speaking of mounting things in the box, I've discovered OpenSCAD thanks to help from a friend and have been busily working away at designing a box to put everything in that can be 3D printed:

I've just got the lid to do next (which I'm going to do after writing this blog post), and then I'm going to get it printed.

With this all done, it's time to start working on the transport for the messages - namely using LMIC to connection to the network and send the GPS location to the application server, which is also unfinished.

The lovely people at the hardware meetup have lent me a full 8-channel LoRaWAN gateway that's connected to The Things Network for my project, which will make this process a lot easier.

Next time, I'll likely talk about 3D printing and how I've been 'threading the needle', so to speak.

## Summer Project Part 2: Random Number Analysis with Gnuplot

In my last post about my Masters Summer Project, I talked about my plans and what I'm doing. In this post, I want to talk about random number generator evaluation.

As part of the Arduino-based Internet of Things device that will be collecting the data, I need to generate high-quality random numbers in order to ensure that the unique ids I use in my project are both unpredictable and unique.

In order to generate such numbers, I've found a library that exploits the jitter in the inbuilt watchdog timer that's present in the Arduino Uno. It's got a manual which is worth a read, as it explains the concepts behind it quite well.

After some experimenting, I ended up with a program that would generate random numbers as fast as it could:

// Generate_Random_Numbers - This sketch makes use of the Entropy library
// to produce a sequence of random integers and floating point values.
// to demonstrate the use of the entropy library
//
// Copyright 2012 by Walter Anderson
//
// This file is part of Entropy, an Arduino library.
// Entropy is free software: you can redistribute it and/or modify
// the Free Software Foundation, either version 3 of the License, or
// (at your option) any later version.
//
// Entropy is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
// GNU General Public License for more details.
//
// You should have received a copy of the GNU General Public License
// along with Entropy.  If not, see <http://www.gnu.org/licenses/>.
//
// Edited by Starbeamrainbowlabs 2019

#include "Entropy.h"

void setup() {
Serial.begin(9600);

// This routine sets up the watch dog timer with interrupt handler to maintain a
// pool of real entropy for use in sketches. This mechanism is relatively slow
// since it will only produce a little less than two 32-bit random values per
// second.
Entropy.initialize();
}

void loop() {
uint32_t random_long;
random_long = Entropy.random();
Serial.println(random_long);
}

As you can tell, it's based on one of the examples. You may need to fiddle around with the imports in order to get it to work, because the Arduino IDE is terrible.

With this in place and uploaded to an Arduino, all I needed to do was log the serial console output to a file. Thankfully, this is actually really quite easy on Linux:

screen -S random-number-generator dd if=/dev/ttyACM0 of=random.txt bs=1

Since I connected the Arduino in question to a Raspberry Pi I have acting as a file server, I've included a screen call here that ensures that I can close the SSH session without it killing the command I'm executing - retaining the ability to 'reattach' to it later to check on it.

With it set off, I left it for a day or 2 until I had at least 1 MiB of random numbers. Once it was done, I ended up with a file looking a little like this:

216767155
986748290
455286059
1956258942
4245729381
3339111661
1821899502
3892736709
3658303796
2524261768
732282824
999812729
1312753534
2810553575
246363223
4106522438
260211625
1375011617
795481000
319056836

In total, it generated 134318 numbers for me to play with, which should be plenty to graph their distribution.

Graphing such a large amount of numbers requires a special kind of program. Since I've used it before, I reached for Gnuplot.

A histogram is probably the best kind of graph for this purpose, so I looked up how to get gnuplot to draw one and tweaked it for my own purposes.

I didn't realise that you could do arbitrary calculations inside a Gnuplot graph definition file, but apparently you can. The important bit below is the bin_width variable:

set key off
set border 3

# Add a vertical dotted line at x=0 to show centre (mean) of distribution.
#set yzeroaxis

# Each bar is half the (visual) width of its x-range.
#set boxwidth 0.05 absolute
#set style fill solid 1.0 noborder

bin_width = 171797751.16;

bin_number(x) = floor(x / bin_width)

rounded(x) = bin_width * ( bin_number(x) + 0.5 )

set terminal png linewidth 3 size 1920,1080

plot 'random.txt' using (rounded($1)):(1) smooth frequency with boxes It specifies the width of each bar on the graph. To work this out, we need to know the maximum number in the dataset. Then we can divide it by the target number of bins to get the width thereof. Time for some awk! awk 'BEGIN { a = 0; b=999999999999999 } { if ($1>0+a) a=$1; if ($1 < 0+b) b=$1; } END { print(a, b); }' <random.txt This looks like a bit of a mess, so let's unwind that awk script so that we can take a better look at it. BEGIN { max = 0; min = 999999999999999 } { if ($1 > max)
max = $1; if ($1 < min)
min = $1; } END { print("Max:", max, "Min:", min); } Much better. In short, it keeps a record of the maximum and minimum numbers it's seen so far, and updates them if it sees a better one. Let's run that over our random numbers: awk -f minmax.awk  Excellent. 4294943779 ÷ 25 = 171797751.16 - which is how I arrived at that value for the bin_width earlier. Now we can render our graph: gnuplot random.plt >histogram.png && optipng -o7 histogram.png I always optimise images with either optipng or jpegoptim to save on storage space on the server, and bandwidth for readers - and in this case the difference was particularly noticeable. Here's the final graph: As you can see, the number of numbers in each bin is pretty even, so we can reasonably conclude that the algorithm isn't too terrible. What about uniqueness? Well, that's much easier to test than the distribution. If we count the numbers before and after removing duplicates, it should tell us how many duplicates there were. There's even a special command for it: wc -l <random.txt 134318 sort <random.txt | uniq | wc -l 134317 sort <random.txt | uniq --repeated 1349455381 Very interesting. Out of ~134K numbers, there's only a single duplicate! I'm not sure whether that's a good thing or not, as I haven't profiled any other random number generated in this manner before. Since I'm planning on taking 1 reading a minute for at least a week (that's 10080 readings), I'm not sure that I'm going to get close to running into this issue - but it's always good to be prepared I guess...... Found this interesting? Got a better way of doing it? Comment below! ### Sources and Further Reading ## Summer Project Part 1: LoRaWAN Signal Mapping! What? A new series (hopefully)! My final project for my Masters in Science course at University is taking place this summer, and on the suggestion of Rob Miles I'll be blogging about it along the way. In this first post, I'd like to talk a little bit about the project I've picked and my initial thoughts. As you have probably guessed from the title of this post, the project I've picked is on mapping LoRaWAN signal coverage. I'm specifically going to look at that of The Things Network, a public LoRaWAN network. I've actually posted about LoRa before, so I'd recommend you go back and read that post first before continuing with this one. The plan I've drawn up so far is to build an Internet of Things device with an Arduino and an RFM95 (a LoRa modem chip) to collect a bunch of data, which I'll then push through some sort of AI to fill in the gaps. The University have been kind enough to fund some of the parts I'll need, so I've managed to obtain some of them already. This mainly includes: • Some of the power management circuitry • An Arduino Uno • A bunch of wires • A breadboard • A 9V battery holder (though I suspect I'll need a different kind of battery that can be recharged) • Some switches (Above: The parts that I've collected already. I've still got a bunch of parts to go though.) I've ordered the more specialised parts through my University, and they should be arriving soon: I'll also need a project box to keep it all in if I can't gain access to the University's 3D printers, but I'll tackle that later. I'll store on a local microSD card for each message a random id and the location a message was sent. I'll transmit the location and the unique id via LoRaWAN, and the server will store it - along with the received signal strength from the different gateways that received the message. Once a run is complete, I'll process the data and pair the local readings from the microSD card up with the ones the database has stored, so that we have readings from the 'block spots' where there isn't currently any coverage. By using a unique random id instead of a timestamp, I can help preserve the privacy oft he person carrying the device. Of course, I can't actually ask anyone to carry the device around until I've received ethical approval from the University to do so. I've already submitted the form, and I'm waiting to hear back on that. While I'm waiting, I'm starting to set up the backend application server. I've decided to write it in Node.js using SQLite to store the data, so that if I want to do multiple separate runs to compare coverage before and after a gateway is installed, I can do so easily by just moving on to a new SQLite database file. In the next post, I might talk a little bit about how I'm planning on generating the random ids. I'd like to do some research into the built-in random() function and how ti compares to other unpredictable sources of randomness, such as comparing clocks. ## Easy Node.JS Dependencies Updates Once you've had a project around for a while, it's inevitable that dependency updates will become available. Unfortunately, npm (the Node Package Manager), while excellent at everything else, is completely terrible at notifying you about updates. The solution to this is, of course, to use an external tool. Personally, I use npm-check, which is also installable via npm. It shows you a list of updates to your project's dependencies, like so: (Can't see the above? View it directly on asciinema.org) It even supports the packages that you've install globally too, which no other tool appears to do as far as I can tell (although it does appear to miss some packages, such as npm and itself). To install it, simply do this: sudo npm install --global npm-check Once done, you can then use it like this: # List updates, but don't install them npm-check # Interactively choose the updates to install npm-check -u # Interactively check the globally-installed packages sudo npm-check -gu The tool also checks to see which of the dependencies are actually used, and prompts you to check the dependencies it think you're not using (it doesn't always get it right, so check carefully yourself before removing!). There's an argument to disable this behaviour: npm-check --skip-unused Speaking of npm package dependencies, the other big issue is security vulnerabilities. GitHub have recently started giving maintainers of projects notifications about security vulnerabilities in their dependencies, which I think is a brilliant idea. Actually fixing said vulnerabilities is a whole other issue though. If you don't want to update all the dependencies of a project to the latest version (perhaps you're just doing a one-off contribution to a project or aren't very familiar with the codebase yet), there's another tool - this time built-in to npm - to help out - the npm audit subcommand. # Check for reported security issues in your dependencies npm audit # Attempt to fix said issues by updating packages *just* enough npm audit fix (Can't see the above? View it directly on asciinema.org) This helps out a ton with contributing to other projects. Issues arise when the vulnerabilities are not in packages you directly depend on, but instead in packages that you indirectly depend on via dependencies of the packages you've installed. Thankfully the vulnerabilities in the above can all be traced back to development dependencies (and aren't essential for Peppermint Wiki itself), but it's rather troubling that I can't install updated packages because the packages I depend on haven't updated their dependencies. I guess I'll be sending some pull requests to update some dependencies! To help me in this, the output of npm audit even displays the dependency graph of why a package is installed. If this isn't enough though, there's always the npm-why which, given a package name, will figure out why it's installed. Found this interesting? Got a better solution? Comment below! ## Generating Atom 1.0 Feeds with PHP (the proper way) I've generated Atom feeds in PHP before, but recently I went on the hunt to discover if PHP has something like C♯'s XMLWriter class - and it turns out it does! Although poorly documented (probably why I didn't find it in the first place :P), it's actually quite logical and easy to pick up. To this end, I thought I'd blog about how I used to write the Atom 1.0 feed generator for recent changes on Pepperminty Wiki that I've recently implemented (coming soon in the next release!), as it's so much cleaner than atom.gen.php that I blogged about before! It's safer too - as all the escaping is handled automatically by PHP - so there's no risk of an injection attack because I forgot to escape a character in my library code. It ends up being a bit verbose, but a few helper methods (or a wrapper class?) should alleviate this nicely - I might refactor it at a later date. To begin, we need to create an instance of the aptly-named XMLLWriter class. It's probable that you'll need the php-xml package installed in order to use this. $xml = new XMLWriter();
$xml->openMemory();$xml->setIndent(true); $xml->setIndentString("\t");$xml->startDocument("1.0", "utf-8");

In short, the above creates a new XMLWriter instance, instructs it to write the XML to memory, enables pretty-printing of the output, and writes the standard XML document header.

With the document created, we can begin to generate the Atom feed. To figure out the format (I couldn't remember from when I wrote atom.gen.php - that was ages ago :P), I ended following this helpful guide on atomenabled.org. It seems familiar - I think I might have used it when I wrote atom.gen.php. To start, we need something like this:

<feed xmlns="http://www.w3.org/2005/Atom">
......
</feed>

In PHP, that translates to this:

$xml->startElement("feed");$xml->writeAttribute("xmlns", "http://www.w3.org/2005/Atom");

$xml->endElement(); // </feed> Next, we probably want to advertise how the Atom feed was generated. Useful for letting the curious know what a website is powered by, and for spotting usage of your code out in the wild! Since I'm writing this for Pepperminty Wiki, I'm settling on something not unlike this: <generator uri="https://github.com/sbrl/Pepperminty-Wiki/" version="v0.18-dev">Pepperminty Wiki</generator> In PHP, this translates to this: $xml->startElement("generator");
$xml->writeAttribute("uri", "https://github.com/sbrl/Pepperminty-Wiki/");$xml->writeAttribute("version", $version); // A variable defined elsewhere in Pepperminty Wiki$xml->text("Pepperminty Wiki");
$xml->endElement(); Next, we need to add a <link rel="self" /> tag. This informs clients as to where the feed was fetched from, and the canonical URL of the feed. I've done this: xml->startElement("link");$xml->writeAttribute("rel", "self");
$xml->writeAttribute("type", "application/atom+xml");$xml->writeAttribute("href", full_url());
$xml->endElement(); That full_url() function is from StackOverflow, and calculates the full URI that was used to make a request. As Pepperminty Wiki can be run in any directory on ayn server, I can't pre-determine this url - hence the complexity. Note also that I output type="application/atom+xml" here. This specifies the type of content that can be found at the supplied URL. The idea here is that if you represent the same data in different ways, you can advertise them all in a standard way, with other formats having rel="alternate". Pepperminty Wiki does this - generating the recent changes list in HTML, CSV, and JSON in addition to the new Atom feed I'm blogging about here (the idea is to make the wiki data as accessible and easy-to-parse as possible). Let's advertise those too: $xml->startElement("link");
$xml->writeAttribute("rel", "alternate");$xml->writeAttribute("type", "text/html");
$xml->writeAttribute("href", "$full_url_stem?action=recent-changes&format=html");
$xml->endElement();$xml->startElement("link");
$xml->writeAttribute("rel", "alternate");$xml->writeAttribute("type", "application/json");
$xml->writeAttribute("href", "$full_url_stem?action=recent-changes&format=json");
$xml->endElement();$xml->startElement("link");
$xml->writeAttribute("rel", "alternate");$xml->writeAttribute("type", "text/csv");
$xml->writeAttribute("href", "$full_url_stem?action=recent-changes&format=csv");
$xml->endElement(); Before we can output the articles themselves, there are a few more pieces of metadata left on our laundry list - namely <updated />, <id />, <icon />, <title />, and <subtitle />. There are others in the documentation too, but aren't essential (as far as I can tell) - and not appropriate in this specific case. Here's what they might look like: <updated>2019-02-02T21:23:43+00:00</updated> <id>https://wiki.bobsrockets.com/?action=recent-changes&amp;format=atom</id> <icon>https://wiki.bobsrockets.com/rocket_logo.png</icon> <title>Bob's Wiki - Recent Changes</title> <subtitle>Recent Changes on Bob's Wiki</subtitle> The <updated /> tag specifies when the feed was last updated. It's unclear as to whether it's the date/time the last change was made to the feed or the date/time the feed was generated, so I've gone with the latter. If this isn't correct, please let me know and I'll change it. The <id /> element can contain anything, but it must be a globally-unique string that identifies this feed. I've seen other feeds use the canonical url - and I've gone to the trouble of calculating it for the <link rel="self" /> - so it seems a shame to not use it here too. The remaining elements (<icon />, <title />, and <subtitle />) are pretty self explanatory - although it's worth mentioning that the icon must be square apparently. Let's whip those up with some more PHP: $xml->writeElement("updated", date(DateTime::ATOM));
$xml->writeElement("id", full_url());$xml->writeElement("icon", $settings->favicon);$xml->writeElement("title", "$settings->sitename - Recent Changes");$xml->writeElement("subtitle", "Recent Changes on $settings->sitename"); PHP even has a present for generating a date string in the correct format required by the spec :D $settings is an object containing the wiki settings that's a parsed form of peppermint.json, and contains useful things like the wiki's name, icon, etc.

Finally, with all the preamble done, we can turn to the articles themselves. In the case of Pepperminty Wiki, the final result will look something like this:

<entry>
<title type="text">Edit to Compute Core by Sean</title>
<id>https://seanssatellites.co.uk/wiki/?page=Compute%20Core</id>
<updated>2019-01-29T10:21:43+00:00</updated>
<content type="html">&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Change type:&lt;/strong&gt; edit&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;User:&lt;/strong&gt;  Sean&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Page name:&lt;/strong&gt; Compute Core&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Timestamp:&lt;/strong&gt; Tue, 29 Jan 2019 10:21:43 +0000&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;New page size:&lt;/strong&gt; 1.36kb&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Page size difference:&lt;/strong&gt; +1&lt;/li&gt;
&lt;/ul&gt;</content>
<author>
<name>Sean</name>
<uri>https://seanssatellites.co.uk/wiki/?page=Users%2FSean</uri>
</author>
</entry>

There are a bunch of elements here that deserve attention:

• <title /> - The title of the article. Easy peasy!
• <id /> - Just like the id of the feed itself, each article entry needs an id too. Here I've followed the same system I used for the feed, and given the url of the page content.
• <updated /> - The last time the article was updated. Since this is part of a feed of recent changes, I've got this information readily at hand.
• <content /> - The content to display. If the content is HTML, it must be escaped and type="html" present to indicate this.
• <link rel="alternate" /> Same deal as above, but on an article-by-article level. In this case, it should link to the page the article content is from. In this case, I link to the page & revision of the change in question. In other cases, you might link to the blog post in question for example.
• <author /> - Can contain <name />, <uri />, and <email />, and should indicate the author of the content. In this case, I use the name of the user that made the change, along with a link to their user page.

Here's all that in PHP:

foreach($recent_changes as$recent_change) {
if(empty($recent_change->type))$recent_change->type = "edit";

$xml->startElement("entry"); // Change types: revert, edit, deletion, move, upload, comment$type = $recent_change->type;$url = "$full_url_stem?page=".rawurlencode($recent_change->page);

$content = ".......";$xml->startElement("title");
$xml->writeAttribute("type", "text");$xml->text("$type$recent_change->page by $recent_change->user");$xml->endElement();

$xml->writeElement("id",$url);
$xml->writeElement("updated", date(DateTime::ATOM,$recent_change->timestamp));

$xml->startElement("content");$xml->writeAttribute("type", "html");
$xml->text($content);
$xml->endElement();$xml->startElement("link");
$xml->writeAttribute("rel", "alternate");$xml->writeAttribute("type", "text/html");
$xml->writeAttribute("href",$url);
$xml->endElement();$xml->startElement("author");
$xml->writeElement("name",$recent_change->user);
$xml->writeElement("uri", "$full_url_stem?page=".rawurlencode("$settings->user_page_prefix/$recent_change->user"));
$xml->endElement();$xml->endElement();
}

I've omitted the logic that generates the value of the <content /> tag, as it's not really relevant here (you can check it out here if you're curious :D).

This about finishes the XML we need to generate for our feed. To extract the XML from the XMLWriter, we can do this:

$atom_feed =$xml->flush();

Then we can do whatever we want to with the generated XML!

When the latest version of Pepperminty Wiki comes out, you'll be able to see a live demo here! Until then, you'll need to download a copy of the latest master version and experiment with it yourself. I'll also include a complete demo feed below:

<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<generator uri="https://github.com/sbrl/Pepperminty-Wiki/" version="v0.18-dev">Pepperminty Wiki</generator>
<updated>2019-02-03T17:25:10+00:00</updated>
<id>http://[::]:35623/?action=recent-changes&amp;format=atom&amp;count=3</id>
<title>Pepperminty Wiki - Recent Changes</title>
<subtitle>Recent Changes on Pepperminty Wiki</subtitle>
<entry>
<updated>2019-01-29T19:55:08+00:00</updated>
<content type="html">&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Change type:&lt;/strong&gt; edit&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Timestamp:&lt;/strong&gt; Tue, 29 Jan 2019 19:55:08 +0000&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;New page size:&lt;/strong&gt; 2.11kb&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Page size difference:&lt;/strong&gt; +2007&lt;/li&gt;
&lt;/ul&gt;</content>
<author>
</author>
</entry>
<entry>
<title type="text">Edit to Main Page by admin</title>
<id>http://[::]:35623/?page=Main%20Page</id>
<updated>2019-01-05T20:14:07+00:00</updated>
<content type="html">&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Change type:&lt;/strong&gt; edit&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Page name:&lt;/strong&gt; Main Page&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Timestamp:&lt;/strong&gt; Sat, 05 Jan 2019 20:14:07 +0000&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;New page size:&lt;/strong&gt; 317b&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Page size difference:&lt;/strong&gt; +68&lt;/li&gt;
&lt;/ul&gt;</content>
<author>
<uri>http://[::]:35623/?page=Users%2FMain%20Page</uri>
</author>
</entry>
<entry>
<title type="text">Edit to Main Page by admin</title>
<id>http://[::]:35623/?page=Main%20Page</id>
<updated>2019-01-05T17:53:08+00:00</updated>
<content type="html">&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Change type:&lt;/strong&gt; edit&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Page name:&lt;/strong&gt; Main Page&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Timestamp:&lt;/strong&gt; Sat, 05 Jan 2019 17:53:08 +0000&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;New page size:&lt;/strong&gt; 249b&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Page size difference:&lt;/strong&gt; +31&lt;/li&gt;
&lt;/ul&gt;</content>
<author>
<uri>http://[::]:35623/?page=Users%2FMain%20Page</uri>
</author>
</entry>
</feed>

...this is from my local development instance.

Found this interesting? Confused about something? Want to say hi? Comment below!

## Generating word searches for fun and profit

(Above: A Word search generated with the tool below)

A little while ago I was asked about generating a wordsearch in a custom shape. I thought to myself "someone has to have built this before...", and while I was right to an extent, I couldn't find one that let you use any shape you liked.

This, of course, was unacceptable! You've probably guessed it by now, but I went ahead and wrote my own :P

While I wrote it a little while ago, I apparently never got around to posting about it on here.

In short, it works by using an image you drop into the designated area on the page as the shape the word search should take. Each pixel is a single cell of the word search - with the alpha channel representing whether or not a character is allowed to be placed there (transparent means that it can't contain a character, and opaque means that it can).

Creating such an image is simple. Personally, I recommend Piskel or GIMP for this purpose.

Once done, you can start building a wordlist in the wordlist box at the right-hand-side. It should rebuild the word search as soon as you click out of the box. If it doesn't, then you've found a bug! Please report it here.

With the word search generated, you can use the Question Sheet and Answer Sheet links to open printable versions for export.

You can find my word search generator here:

I've generated a word search of the current tags in the tag cloud on this blog too: Question Sheet [50.3KiB], Answer Sheet [285.6KiB]

The most complicated part of this was probably the logistics behind rude word removal. Thankfully, I did't have to find and maintain such a list of words, as the futility npm package does this for me, but algorithmically guaranteeing that by censoring 1 rude word another is not accidentally created in another direction is a nasty problem.

If you're interested in a more technical breakdown of one (or several!) particular aspects of this - let me know! While writing about all of it would probably make for an awfully long post, a specific aspect or two should be more manageable.

In the future, I'll probably revisit this and add additional features to it, such as the ability to restrict which directions words are placed in, for example. If you've got a suggestion of your own, open an issue (or even better, open a pull request :D)!

## Demystifying Inverted Indexes

(The magnifying glass in the above banner came from openclipart)

After writing the post that will be released after this one, I realised that I made a critical assumption that everyone knew what an inverted index was. Upon looking for an appropriate tutorial online, I couldn't find one that was close enough to what I did in Pepperminty Wiki, so I decided to write my own.

First, some context. What's Pepperminty Wiki? Well, it's a complete wiki engine in a single file of PHP. The source files are obviously not a single file, but it builds into a single file - making it easy to drop into any PHP-enabled web server.

One of its features is a full-text search engine. A personal wiki of mine has ~75k words spread across ~550 pages, and it manages to search them all in just ~450ms! It does this with the aid of an inverted index - which I'll be explaining in this post.

First though, we need some data to index. How about the descriptions of some video games?

Kerbal Space Program

In KSP, you must build a space-worthy craft, capable of flying its crew out into space, without killing them. At your disposal is a collection of parts, which must be assembled to create a functional ship. Each part has its own function and will affect the way a ship flies (or doesn't). So strap yourself in, and get ready to try some Rocket Science!

Cross Code

Meet Lea as she logs into an MMO of the distant future. Follow her steps as she discovers a vast world, meets other players and overcomes all the challenges of the game.

Fort Meow

Fort Meow is a physics game by Upper Class Walrus about a girl, an old book and a house full of cats! Meow.

Factory Balls

Factory balls is the brand new bonte game I already announced yesterday. Factory balls takes part in the game design competition over at jayisgames. The goal of the design competition was to create a 'ball physics'-themed game. I hope you enjoy it!

Very cool, this should provide us with plenty of data to experiment with. Firstly, let's consider indexing. Take the Factory Balls description. We can split it up into tokens like this:

T o k e n s V V
factory balls is the brand new bonte game
i already announced yesterday factory balls takes
part in the game design competition over
at jayisgames the goal of the design
competition was to create a ball physics
themed game i hope you enjoy it

Notice how we've removed punctuation here, and made everything lowercase. This is important for the next step, as we want to make sure we consider Factory and factory to be the same word - otherwise when querying the index we'd have to remember to get the casing correct.

With our tokens sorted, we can now count them to create our index. It's like a sort of tally chart I guess, except we'll be including the offset in the text of every token in the list. We'll also be removing some of the most common words in the list that aren't very interesting - these are known as stop words. Here's an index generated from that Factory Balls text above:

Token Frequency Offsets
factory 2 0, 12
balls 2 1, 13
brand 1 4
new 1 5
bonte 1 6
game 3 7, 18, 37
i 2 8, 38
announced 1 10
yesterday 1 11
takes 1 14
design 2 19, 28
competition 2 20, 29
jayisgames 1 23
goal 1 25
create 1 32
ball 1 34
physics 1 35
themed 1 36
hope 1 39
enjoy 1 41

Very cool. Now we can generate an index for each page's content. The next step is to turn this into an inverted index. Basically, the difference between the normal index and a inverted index is that an entry in an inverted index contains not just the offsets for a single page, but all the pages that contain that token. For example, the Cross-Code example above also contains the token game, so the inverted index entry for game would contain a list of offsets for both the Factory Balls and Cross-Code pages.

Including the names of every page under every different token in the inverted index would be both inefficient computationally and cause the index to grow rather large, so we should assign each page a unique numerical id. Let's do that now:

Id Page Name
1 Kerbal Space Program
2 Cross Code
3 Fort Meow
4 Factory Balls

There - much better. In Pepperminty Wiki, this is handled by the ids class, which has a pair of public methods: getid($pagename) and getpagename($id). If an id can't be found for a page name, then a new id is created and added to the list (Pepperminty Wiki calls this the id index) transparently. Similarly, if a page name can't be found for an id, then null should be returned.

Now that we've got ids for our pages, let's look at generating that inverted index entry for game we talked about above. Here it is:

• Term: game
Id Frequency Offsets
2 1 31
3 1 5
4 3 5, 12, 23

Note how there isn't an entry for page id 1, as the Kerbal Space Program page doesn't contain the token game.

This, in essence, is the basics of inverted indexes. A full inverted index will contain an entry for every token that's found in at least 1 source document - though the approach used here is far from the only way of doing it (I'm sure there are much more advanced ways of doing it for larger datasets, but this came to mind from reading a few web articles and is fairly straight-forward and easy to understand).

Can you write a program that generates a full inverted index like I did in the example above? Try testing it on the test game descriptions at the start of this post.

You may also have noticed that the offsets used here are of the tokens in the list. If you wanted to generate contexts (like Duck Duck Go or Google do just below the title of a result), you'd need to use the character offsets from the source document instead. Can you extend your program to support querying the inverted index, generating contexts based on the inverted index too?

Liked this post? Got your own thoughts on the subject? Having trouble with the challenges at the end? Comment below!

Art by Mythdael