Stardust Blog

Archive

Mailing List Articles Atom Feed Comments Atom Feed Twitter Reddit Facebook

Tag Cloud

3d account algorithms android announcement architecture archives arduino artificial intelligence artix assembly async audio bash batch blog bookmarklet booting c sharp c++ challenge chrome os code codepen coding conundrums coding conundrums evolved command line compilers compiling compression css dailyprogrammer debugging demystification distributed computing documentation downtime electronics email embedded systems encryption es6 features event experiment external first impressions future game github github gist gitlab graphics hardware hardware meetup holiday holidays html html5 html5 canvas infrastructure interfaces internet io.js jabber javascript js bin labs learning library linux low level lua maintenance manjaro network networking nibriboard node.js operating systems performance photos php pixelbot portable privacy problem solving programming problems projects prolog protocol protocols pseudo 3d python reddit redis reference release releases resource review rust searching secrets security series list server software sorting source code control statistics storage svg technical terminal textures three thing game three.js tool tutorial tutorials twitter ubuntu university update updates upgrade version control virtual reality virtualisation visual web website windows windows 10 xmpp xslt

Demystifying Inverted Indexes

(The magnifying glass in the above banner came from openclipart)

After writing the post that will be released after this one, I realised that I made a critical assumption that everyone knew what an inverted index was. Upon looking for an appropriate tutorial online, I couldn't find one that was close enough to what I did in Pepperminty Wiki, so I decided to write my own.

First, some context. What's Pepperminty Wiki? Well, it's a complete wiki engine in a single file of PHP. The source files are obviously not a single file, but it builds into a single file - making it easy to drop into any PHP-enabled web server.

One of its features is a full-text search engine. A personal wiki of mine has ~75k words spread across ~550 pages, and it manages to search them all in just ~450ms! It does this with the aid of an inverted index - which I'll be explaining in this post.

First though, we need some data to index. How about the descriptions of some video games?

Kerbal Space Program

In KSP, you must build a space-worthy craft, capable of flying its crew out into space, without killing them. At your disposal is a collection of parts, which must be assembled to create a functional ship. Each part has its own function and will affect the way a ship flies (or doesn't). So strap yourself in, and get ready to try some Rocket Science!

Cross Code

Meet Lea as she logs into an MMO of the distant future. Follow her steps as she discovers a vast world, meets other players and overcomes all the challenges of the game.

Fort Meow

Fort Meow is a physics game by Upper Class Walrus about a girl, an old book and a house full of cats! Meow.

Factory Balls

Factory balls is the brand new bonte game I already announced yesterday. Factory balls takes part in the game design competition over at jayisgames. The goal of the design competition was to create a 'ball physics'-themed game. I hope you enjoy it!

Very cool, this should provide us with plenty of data to experiment with. Firstly, let's consider indexing. Take the Factory Balls description. We can split it up into tokens like this:

T o k e n s V V
factory balls is the brand new bonte game
i already announced yesterday factory balls takes
part in the game design competition over
at jayisgames the goal of the design
competition was to create a ball physics
themed game i hope you enjoy it

Notice how we've removed punctuation here, and made everything lowercase. This is important for the next step, as we want to make sure we consider Factory and factory to be the same word - otherwise when querying the index we'd have to remember to get the casing correct.

With our tokens sorted, we can now count them to create our index. It's like a sort of tally chart I guess, except we'll be including the offset in the text of every token in the list. We'll also be removing some of the most common words in the list that aren't very interesting - these are known as stop words. Here's an index generated from that Factory Balls text above:

Token Frequency Offsets
factory 2 0, 12
balls 2 1, 13
brand 1 4
new 1 5
bonte 1 6
game 3 7, 18, 37
i 2 8, 38
announced 1 10
yesterday 1 11
takes 1 14
design 2 19, 28
competition 2 20, 29
jayisgames 1 23
goal 1 25
create 1 32
ball 1 34
physics 1 35
themed 1 36
hope 1 39
enjoy 1 41

Very cool. Now we can generate an index for each page's content. The next step is to turn this into an inverted index. Basically, the difference between the normal index and a inverted index is that an entry in an inverted index contains not just the offsets for a single page, but all the pages that contain that token. For example, the Cross-Code example above also contains the token game, so the inverted index entry for game would contain a list of offsets for both the Factory Balls and Cross-Code pages.

Including the names of every page under every different token in the inverted index would be both inefficient computationally and cause the index to grow rather large, so we should assign each page a unique numerical id. Let's do that now:

Id Page Name
1 Kerbal Space Program
2 Cross Code
3 Fort Meow
4 Factory Balls

There - much better. In Pepperminty Wiki, this is handled by the ids class, which has a pair of public methods: getid($pagename) and getpagename($id). If an id can't be found for a page name, then a new id is created and added to the list (Pepperminty Wiki calls this the id index) transparently. Similarly, if a page name can't be found for an id, then null should be returned.

Now that we've got ids for our pages, let's look at generating that inverted index entry for game we talked about above. Here it is:

• Term: game
Id Frequency Offsets
2 1 31
3 1 5
4 3 5, 12, 23

Note how there isn't an entry for page id 1, as the Kerbal Space Program page doesn't contain the token game.

This, in essence, is the basics of inverted indexes. A full inverted index will contain an entry for every token that's found in at least 1 source document - though the approach used here is far from the only way of doing it (I'm sure there are much more advanced ways of doing it for larger datasets, but this came to mind from reading a few web articles and is fairly straight-forward and easy to understand).

Can you write a program that generates a full inverted index like I did in the example above? Try testing it on the test game descriptions at the start of this post.

You may also have noticed that the offsets used here are of the tokens in the list. If you wanted to generate contexts (like Duck Duck Go or Google do just below the title of a result), you'd need to use the character offsets from the source document instead. Can you extend your program to support querying the inverted index, generating contexts based on the inverted index too?

Liked this post? Got your own thoughts on the subject? Having trouble with the challenges at the end? Comment below!

Demystifying microphones: The difference between dynamics and condensers

Welcome back to another demystification post! This time, it's about microphones. I had a question recently about microphones and phantom power, and after doing some rather extensive research on the subject (unintentionally of course :P), I thought it a waste not to summarise it here.

Basically, phantom power is a +48V direct current that's transmitted through a microphone cable (not the kind you plug into your laptop I don't think - the big chunky ones). It's required by condenser microphones (though some use a battery instead), which have a pair of films (called diaphragms) which vibrate. When a current is passed through from one plate to the other, the physical sound gets converted into an electrical signal we can use.

Condenser microphones are much more sensitive than their dynamic microphone counterparts. They are able to better represent a wider range of frequencies - but as a result of this heightened sensitivity, you normally need a pop filter if you're recording vocals. In addition, they don't tend to perform too well in loud environments, such as concerts. Finally, they tend to be more expensive than dynamic microphones, too.

Dynamic microphones, on the other hand, don't require phantom power. They are basically a loudspeaker in reverse and generate the current themselves - they have a single diaphragm that's attached to a metal core - which in turn has a coil of wire around it. When the diaphragm vibrates, so does the metal core - and as you can probably guess, a current is induced in the coil, as metal cores tend to do when inside coils of conveniently placed wires.

While they are better in loud environments (like concerts and drum kits), dynamic microphones aren't so good at representing a wide ranges of frequencies - and as such they are usually tailored to be pick up a specific frequency range better than others. Furthermore, they aren't as sensitive in general as your average condenser microphone, so they don't get on particularly well with very quiet sounds either.

Which you use generally depends on what you want to do. If you've got an overly enthusiastic drummer in a rock concert, you probably want a dynamic microphone. On the other hand, if you're trying to record the song of a cricket on a still summer's evening, you probably want to keep a condenser microphone handy.

I'm not an audio expert, so I might have made a few mistakes here and there! If you spot one, please do let me know in the comments below :-)

Demystifying traceroute

(Image from labnol. View the full map at submarinecablemap.com.)

A little while ago someone I know seemed a little bit confused as to how a traceroute interacts with a firewall, so I decided to properly look into and write this post.

Traceroute is the practice of sending particular packets across a network in order to discover all the different hops between the source computer and the destination. For example, here's a traceroute between starbeamrainbowlabs.com and bbc.co.uk:

traceroute to bbc.co.uk (212.58.244.23), 30 hops max, 60 byte packets
1  125.ip-37-187-66.eu (37.187.66.125)  0.100 ms  0.015 ms  0.011 ms
2  be10-147.rbx-g1-a9.fr.eu (37.187.231.169)  0.922 ms  0.912 ms  0.957 ms
3  be100-1187.ldn-1-a9.uk.eu (91.121.128.87)  7.536 ms  7.538 ms  7.535 ms
4  * * *
5  ae-1-3104.ear2.London2.Level3.net (4.69.143.190)  18.481 ms  18.676 ms  18.903 ms
6  unknown.Level3.net (212.187.139.230)  10.725 ms  10.434 ms  10.415 ms
7  * * *
8  ae0.er01.telhc.bbc.co.uk (132.185.254.109)  10.565 ms  10.666 ms  10.603 ms
9  132.185.255.148 (132.185.255.148)  12.123 ms  11.781 ms  11.529 ms
10  212.58.244.23 (212.58.244.23)  10.596 ms  10.587 ms  65.243 ms

As you can see, there are quite a number of hops between us and the BBC, not all of which responded to attempts to probe them. Before we can speculate as to why, it's important to understand how a traceroute is performed.

There are actually a number of different methods to perform a traceroute, but they all have a few things in common. The basic idea exploits something called time to live (TTL). This is a special value that all IP packets have (located 16 bytes into an ipv4 header, and 7 bytes into an ipv6 header for those who are curious) that determines the maximum number of hops that a packet is allowed to go through before it is dropped. Every hop along a packet's route decreases this value by 1. When it reaches 0, an ICMP TTL Exceeded message is returned to the source of the packet. This message can be used to discover the hops between a given source and destination.

With that out of the way, we can move on to the different methods of generating this response from every hop along a given route. Linux comes with a traceroute utility built-in, and this is the tool that I'm going to be investigating. If you're on Windows, you can use tracert, but it doesn't have as many options as the Linux version.

Linux's traceroute utility defaults to using UDP packets on an uncommon port. It defaults to this because it's the best method that unprivileged users can use if they have a kernel older than 3.0 (check your kernel version with uname -r). It isn't ideal though, because many hosts don't expect incoming UDP packets and silently drop them.

Adding the -I flag causes traceroute to use ICMP ping requests instead. Thankfully most hosts will respond to ICMP pings, making it a much better probing tool. Some networks, however, don't allow ping requests to pass through their gateways (usually large institutions and schools), rendering this method useless in certain situations.

To combat the above, a new method was developed that uses TCP SYN packets instead of UDP or ICMP ping. If you send a TCP SYN packet (manipulating the TTL as above), practically all hosts will return some kind of message. This is commonly referred to as the TCP half-open technique, and defaults to port 80 - this allows the traceroute to bypass nearly all firewalls. If you're behind a proxy though I suspect it'll snag on it - theoretically speaking using port 443 instead should rectify this problem in most cases (i.e. traceroute -T -p 443 hostname.tld).

Traceroute has a bunch of other less reliable methods, which I'll explain quickly below.

• -U causes traceroute to use UDP on port 53. This method usually only elicits responses from DNS servers along the route.
• -UL makes traceroute use udplite in a similar fashion to UDP in the bullet point above. This is only available to administrators.
• DCCP can also be used with the -D. It works similar to the TCP method described earlier.
• A raw IP packet can also be used, but I can't think of any reasons you'd use this.

That just about covers all the different traceroute methods. If you have any questions, please leave a comment below.

Demystifying UDP

Yesterday I was taking a look at [UDP Multicast], and attempting to try it out in C#. Unfortunately, I got a little bit confused as to how it worked, and ended up sending a couple of hours wondering what I did wrong. I'm writing this post to hopefully save you the trouble of fiddling around trying to get it to work yourself.

UDP stands for User Datagram Protocol (or Unreliable Datagram Protocol). It offers no guarantee that message sent will be received at the other end, but is usually faster than its counterpart, TCP. Each UDP message has a source and a destination address, a source port, and a destination port.

When you send a message to a multicast address (like the 239.0.0.0/8 range or the FF00::/8 range for ipv6, but that's a little bit more complicated), your router will send a copy of the message to all the other interested hosts on your network, leaving out hosts that have not registered their interest. Note here that an exact copy of the original message is sent to all interested parties. The original source and destination addresses are NOT changed by your router.

With that in mind, we can start to write some code.

IPAddress multicastGroup = IPAddress.Parse("239.1.2.3");
int port = 43;
IPEndPoint channel = new IPEndPoint(multicastGroup, port);
UdpClient client = new UdpClient(43);
client.JoinMulticastGroup(multicastGroup);

In the above, I set up a few variables or things like the multicast address that we are going to join, the port number, and so on. I pass the port number to the new UdpClient I create, letting it know that we are interested in messages sent to that port. I also create a variable called channel, which we will be using later.

Next up, we need to figure out a way to send a message. Unfortunately, the UdpClient class only supports sends arrays of bytes, so we will be have to convert anything we want to send to and from a byte array. Thankfully though this isn't too tough:

string data = "1 2, 1 2, Testing!";
byte[] payload = Encoding.UTF8.GetBytes(data);
string message = Encoding.UTF8.GetString(payload);

The above converts a simple string to and from a byte[] array. If you're interested, you can also serialise and deserialise C♯ objects to and from a byte[] array by using Binary Serialisation. Anyway, we can now write a method to send a message across the network. Here's what I came up with:

private static void Send(string data)
{
Console.WriteLine("Sending '{0}' to {1}.", data, destination);
byte[] payload = Encoding.UTF8.GetBytes(data);
}
private static void Send(byte[] payload)
{
}

Here I've defined a method to send stuff across the network for me. I've added an overload, too, which automatically converts string into byte[] arrays for me.

Putting the above together will result in a multicast message being sent across the network. This won't do us much good though unless we can also receive messages from the network too. Let's fix that:

public static async Task Listen()
{
while(true)
{
string message = Encoding.UTF8.GetString(result.Buffer);
Console.WriteLine("{0}: {1}", result.RemoteEndPoint, message);
}
}

You might not have seen (or heard of) asynchronous C# before, but basically it's a ways of doing another thing whilst you are waiting for one thing to complete. Dot net perls have a good tutorial on the subject if you want to read up on it.

For now though, here's how you call an asynchronous method from a synchronous one (like the Main() method since that once can't be async apparently):

Task.Run(() => Listen).Wait();

If you run the above in one program while sending a message in another, you should see something appear in the console of the listener. If not, your computer may not be configured to receive multicast messages that were sent from itself. In this case try running the listener on a different machine to the sender. In theory you should be able to run the listener on as many hosts on your local network as you want and they should all receive the same message.

Art by Mythdael