Archive

Tag Cloud

3d account algorithms android announcement architecture archives arduino artificial intelligence artix assembly async audio backups bash batch blog bookmarklet booting c sharp c++ challenge chrome os code codepen coding conundrums coding conundrums evolved command line compilers compiling compression css dailyprogrammer debugging demystification distributed computing documentation downtime electronics email embedded systems encryption es6 features event experiment external first impressions future game github github gist gitlab graphics hardware hardware meetup holiday holidays html html5 html5 canvas infrastructure interfaces internet io.js jabber jam javascript js bin labs learning library linux lora low level lua maintenance manjaro network networking nibriboard node.js operating systems performance photos php pixelbot portable privacy problem solving programming problems projects prolog protocol protocols pseudo 3d python reddit redis reference releases resource review rust searching secrets security series list server software sorting source code control statistics storage svg technical terminal textures three thing game three.js tool tutorial twitter ubuntu university update updates upgrade version control virtual reality virtualisation visual web website windows windows 10 xmpp xslt

RhinoReminds: An XMPP reminder bot for my convenience

(Above: A Picture of a Black Rhino. Source: WikiMedia Commons)

Many times when I write a program it's to solve a problem. With Pepperminty Wiki, it was that I needed a lightweight wiki engine - and MediaWiki was just too complex. With TeleConsole, it was that I wanted to debug a C♯ Program in an environment that had neither a debugger nor a console (I'm talking about you, Unity 3D).

Today, I'm releasing RhinoReminds, an XMPP bot that reminds you about things. As you might have guessed, this is the end product of a few different posts I've made on here recently:

You can talk to it like so:

Remind me to water the greenhouse tomorrow at 4:03pm
Show all reminders
Delete reminders 2, 3, 4, and 7
Remind me in 1 hour to check the oven

...and it'll respond accordingly. It figures out which action to take based on the first word of the sentence you send it, but after that it uses AI (specifically Microsoft.Recognizers.Text, which I posted about here) to work out what you how you want it to do it.

I'm still working out a few bugs (namely reconnecting automagically after the connection to the server is lost, and ensuring all the messages it sends in reply actually make sense), but it's at the point now where it's stable enough that I can release it to everyone who'd either like to use it for themselves, or is simply curious :-)

If you'd like to run an instance of the bot for yourself, I recommend heading over to my personal git server here:

https://git.starbeamrainbowlabs.com/sbrl/RhinoReminds

The readme file should contain everything you need to know to get started. If not, let me know by contacting me, or commenting here!

Unfortunately, I'm not able to offer a public instance of this bot at the moment, due to concerns about spam. However, patches to improve the bots resistance against spammers (perhaps a cooldown period or something or too many messages are sent? or a limit of 50 active reminders per account?) are welcome.

Found this interesting? Got a cool use for it? Want help setting it up yourself? Comment below!

Write an XMPP bot in half an hour

Recently I've looked at using AI to extract key information from natural language, and creating a system service with systemd. The final piece of the puzzle is to write the bot itself - and that's what I'm posting about today.

Since not only do I use XMPP for instant messaging already but it's an open federated standard, I'll be building my bot on top of it for maximum flexibility.

To talk over XMPP programmatically, we're going to need library. Thankfully, I've located just such a library which appears to work well enough, called S22.XMPP. Especially nice is the comprehensive documentation that makes development go much more smoothly.

With our library in hand, let's begin! Our first order of business is to get some scaffolding in place to parse out the environment variables we'll need to login to an XMPP account.

using System;
using System.Linq;

using S22.Xmpp;
using S22.Xmpp.Client;
using S22.Xmpp.Im;

namespace XmppBotDemo
{
public static class MainClass
{
// Needed later
private static XmppClient client;

// Settings
private static Jid ourJid = null;
private static string password = null;

public static int Main(string[] args)
{
// Read in the environment variables
ourJid = new Jid(Environment.GetEnvironmentVariable("XMPP_JID"));

// Ensure they are present
if (ourJid == null || password == null) {
Console.Error.WriteLine("XMPP Bot Demo");
Console.Error.WriteLine("=============");
Console.Error.WriteLine("");
Console.Error.WriteLine("Usage:");
Console.Error.WriteLine("    ./XmppBotDemo.exe");
Console.Error.WriteLine("");
Console.Error.WriteLine("Environment Variables:");
Console.Error.WriteLine("    XMPP_JID         Required. Specifies the JID to login with.");
return 1;
}

// TODO: Connect here

return 0;
}
}
}

Excellent! We're reading in & parsing 2 environment variables: XMPP_JID (the username), and XMPP_PASSWORD. It's worth noting that you can call these environment variables anything you like! I chose those names as they describe their contents well. It's also worth mentioning that it's important to use environment variables for secrets passing them as command-line arguments cases them to be much more visible to other uses of the system!

Let's connect to the XMPP server with our newly read-in credentials:

// Create the client instance
client = new XmppClient(ourJid.Domain, ourJid.Node, password);

client.Error += errorHandler;
client.SubscriptionRequest += subscriptionRequestHandler;
client.Message += messageHandler;

client.Connect();

// Wait for a connection
while (!client.Connected)

Console.WriteLine($"[Main] Connected as {ourJid}."); // Wait forever. Thread.Sleep(Timeout.Infinite); // TODO: Automatically reconnect to the server when we get disconnected. Cool! Here, we create a new instance of the XMPPClient class, and attach 3 event handlers, which we'll look at later. We then connect to the server, and then wait until it completes - and then write a message to the console. It looks like S22.Xmpp spins up a new thread, so unfortunately we can't catch any errors it throws with a traditional try-catch statement. Instead, we'll have to ensure we're really careful that we catch any exceptions we throw accidentally - otherwise we'll get disconnected! It does appear that XmppClient catches some errors though, which trigger the Error event - so we should attach an event handler to that. /// <summary> /// Handles any errors thrown by the XMPP client engine. /// </summary> private static void errorHandler(object sender, ErrorEventArgs eventArgs) { Console.Error.WriteLine($"Error: {eventArgs.Reason}");
Console.Error.WriteLine(eventArgs.Exception);
}

Before a remote contact is able to talk to our bot, they will send us a subscription request - which we'll need to either accept or reject. This is also done via an event handler. It's the SubscriptionRequest one this time:

/// <summary>
/// Handles requests to talk to us.
/// </summary>
/// <remarks>
/// Only allow people to talk to us if they are on the same domain we are.
/// You probably don't want this for production, but for developmental purposes
/// it offers some measure of protection.
/// </remarks>
/// <param name="from">The JID of the remote user who wants to talk to us.</param>
/// <returns>Whether we're going to allow the requester to talk to us or not.</returns>
public static bool subscriptionRequestHandler(Jid from) {
Console.WriteLine($"[Handler/SubscriptionRequest] {from} is requesting access, I'm saying {(from.Domain == ourJid.Domain?"yes":"no")}"); return from.Domain == ourJid.Domain; } This simply allows anyone on our own domain to talk to us. For development purposes this will offer us some measure of protection, but for production you should probably implement a whitelisting or logging system here. The other interesting thing we can do here is send a user a chat message to either welcome them to the server, or explain why we rejected their request. To do this, we need to write a pair of utility methods, as sending chat messages with S22.Xmpp is somewhat over-complicated: #region Message Senders /// <summary> /// Sends a chat message to the specified JID. /// </summary> /// <param name="to">The JID to send the message to.</param> /// <param name="message">The messaage to send.</param> private static void sendChatMessage(Jid to, string message) { //Console.WriteLine($"[Bot/Send/Chat] Sending {message} -> {to}");
client.SendMessage(
to, message,
null, null, MessageType.Chat
);
}
/// <summary>
/// Sends a chat message in direct reply to a given incoming message.
/// </summary>
/// <param name="originalMessage">Original message.</param>
{
//Console.WriteLine($"[Bot/Send/Reply] Sending {reply} -> {originalMessage.From}"); client.SendMessage( originalMessage.From, reply, null, originalMessage.Thread, MessageType.Chat ); } #endregion  The difference between these 2 methods is that one sends a reply directly to a message that we've received (like a threaded reply), and the other simply sends a message directly to another contact. Now that we've got all of our ducks in a row, we can write the bot itself! This is done via the Message event handler. For this demo, we'll write a bot that echo any messages to it in reverse: /// <summary> /// Handles incoming messages. /// </summary> private static void messageHandler(object sender, MessageEventArgs eventArgs) { Console.WriteLine($"[Bot/Handler/Message] {eventArgs.Message.Body.Length} chars from {eventArgs.Jid}");
char[] messageCharArray = eventArgs.Message.Body.ToCharArray();
Array.Reverse(messageCharArray);
eventArgs.Message,
new string(messageCharArray)
);
}

Excellent! That's our bot complete. The full program is at the bottom of this post.

Of course, this is a starting point - not an ending point! A number of issues with this demo stand out. There isn't a whitelist, and putting the whole program in a single file doesn't sound like a good idea. The XMPP logic should probably be refactored out into a separate file, in order to keep the input settings parsing separate from the bot itself.

Other issues that probably need addressing include better error handling and more - but fixing them all here would complicate the example rather.

Edit: The code is also available in a git repository if you'd like to clone it down and play around with it :-)

Found this interesting? Got a cool use for it? Still confused? Comment below!

Complete Program

using System;
using System.Linq;
using S22.Xmpp;
using S22.Xmpp.Client;
using S22.Xmpp.Im;

namespace XmppBotDemo
{
public static class MainClass
{
private static XmppClient client;
private static Jid ourJid = null;
private static string password = null;

public static int Main(string[] args)
{
// Read in the environment variables
ourJid = new Jid(Environment.GetEnvironmentVariable("XMPP_JID"));

// Ensure they are present
if (ourJid == null || password == null) {
Console.Error.WriteLine("XMPP Bot Demo");
Console.Error.WriteLine("=============");
Console.Error.WriteLine("");
Console.Error.WriteLine("Usage:");
Console.Error.WriteLine("    ./XmppBotDemo.exe");
Console.Error.WriteLine("");
Console.Error.WriteLine("Environment Variables:");
Console.Error.WriteLine("    XMPP_JID         Required. Specifies the JID to login with.");
return 1;
}

// Create the client instance
client = new XmppClient(ourJid.Domain, ourJid.Node, password);

client.Error += errorHandler;
client.SubscriptionRequest += subscriptionRequestHandler;
client.Message += messageHandler;

client.Connect();

// Wait for a connection
while (!client.Connected)

Console.WriteLine($"[Main] Connected as {ourJid}."); // Wait forever. Thread.Sleep(Timeout.Infinite); // TODO: Automatically reconnect to the server when we get disconnected. return 0; } #region Event Handlers /// <summary> /// Handles requests to talk to us. /// </summary> /// <remarks> /// Only allow people to talk to us if they are on the same domain we are. /// You probably don't want this for production, but for developmental purposes /// it offers some measure of protection. /// </remarks> /// <param name="from">The JID of the remote user who wants to talk to us.</param> /// <returns>Whether we're going to allow the requester to talk to us or not.</returns> public static bool subscriptionRequestHandler(Jid from) { Console.WriteLine($"[Handler/SubscriptionRequest] {from} is requesting access, I'm saying {(from.Domain == ourJid.Domain?"yes":"no")}");
return from.Domain == ourJid.Domain;
}

/// <summary>
/// Handles incoming messages.
/// </summary>
private static void messageHandler(object sender, MessageEventArgs eventArgs) {
Console.WriteLine($"[Handler/Message] {eventArgs.Message.Body.Length} chars from {eventArgs.Jid}"); char[] messageCharArray = eventArgs.Message.Body.ToCharArray(); Array.Reverse(messageCharArray); sendChatReply( eventArgs.Message, new string(messageCharArray) ); } /// <summary> /// Handles any errors thrown by the XMPP client engine. /// </summary> private static void errorHandler(object sender, ErrorEventArgs eventArgs) { Console.Error.WriteLine($"Error: {eventArgs.Reason}");
Console.Error.WriteLine(eventArgs.Exception);
}

#endregion

#region Message Senders

/// <summary>
/// Sends a chat message to the specified JID.
/// </summary>
/// <param name="to">The JID to send the message to.</param>
/// <param name="message">The messaage to send.</param>
private static void sendChatMessage(Jid to, string message)
{
//Console.WriteLine($"[Rhino/Send/Chat] Sending {message} -> {to}"); client.SendMessage( to, message, null, null, MessageType.Chat ); } /// <summary> /// Sends a chat message in direct reply to a given incoming message. /// </summary> /// <param name="originalMessage">Original message.</param> /// <param name="reply">Reply.</param> private static void sendChatReply(Message originalMessage, string reply) { //Console.WriteLine($"[Rhino/Send/Reply] Sending {reply} -> {originalMessage.From}");
client.SendMessage(
);
}

#endregion
}
}

Easy AI with Microsoft.Text.Recognizers

I recently discovered that there's an XMPP client library (NuGet) for .NET that I overlooked a few months ago, and so I promptly investigated the building of a bot!

The actual bot itself needs some polishing before I post about it here, but in writing said bot I stumbled across a perfectly brilliant library - released by Microsoft of all companies - that can be used to automatically extract common data-types from a natural-language sentence.

While said library is the underpinnings of the Azure Bot Framework, it's actually free and open-source. To that end, I decided to experiment with it - and ended up writing this blog post.

Data types include (but are not limited to) dates and times (and ranges thereof), numbers, percentages, monetary amounts, email addresses, phone numbers, simple binary choices, and more!

While it also lands you with a terrific number of DLL dependencies in your build output folder, the result is totally worth it! How about pulling a DateTime from this:

in 5 minutes

or this:

the first Monday of January

or even this:

next Monday at half past six

Pretty cool, right? You can even pull multiple things out of the same sentence. For example, from the following:

The host 1.2.3.4 has been down 3 times over the last month - the last of which was from 5pm and lasted 30 minutes

It can extract an IP address (1.2.3.4), a number (3), and a few dates and times (last month, 5pm, 30 minutes).

I've written a test program that shows it in action. Here's a demo of it working:

(Can't see the asciicast above? View it on asciinema.org)

The source code is, of course, available on my personal Git server: Demos/TextRecogniserDemo

If you can't check out the repo, here's the basic gist. First, install the Microsoft.Recognizers.Text package(s) for the types of data that you'd like to recognise. Then, to recognise a date or time, do this:

List<ModelResult> result = DateTimeRecognizer.RecognizeDateTime(nextLine, Culture.English);

The awkward bit is unwinding the ModelResult to get at the actual data. The matched text is stored in the ModelResult.Resolution property, but that's a SortedDictionary<string, object>. The interesting property inside which is value, but depending on the data type you're recognising - that can be an array too! The best way I've found to decipher the data types is to print the value of ModelResult.Resolution as a string to the console:

Console.WriteLine(result[0].Resolution.ToString());

The .NET runtime will helpfully convert this into something like this:

System.Collections.Generic.SortedDictionary2[System.String,System.Object]

Very helpful. Then we can continue to drill down:

Console.WriteLine(result[0].Resolution["values"]);

This produces this:

System.Collections.Generic.List1[System.Collections.Generic.Dictionary2[System.String,System.String]]

Quite a mouthful, right? By cross-referencing this against the JSON (thanks, Newtonsoft.JSON!), we can figure out how to drill the rest of the way. I ended up writing myself a pair of little utility methods for dates and times:

public static DateTime RecogniseDateTime(string source, out string rawString)
{
List<ModelResult> aiResults = DateTimeRecognizer.RecognizeDateTime(source, Culture.English);
if (aiResults.Count == 0)
throw new Exception("Error: Couldn't recognise any dates or times in that source string.");

/* Example contents of the below dictionary:
[0]: {[timex, 2018-11-11T06:15]}
[1]: {[type, datetime]}
[2]: {[value, 2018-11-11 06:15:00]}
*/

rawString = aiResults[0].Text;
Dictionary<string, string> aiResult = unwindResult(aiResults[0]);
string type = aiResult["type"];
if (!(new string[] { "datetime", "date", "time", "datetimerange", "daterange", "timerange" }).Contains(type))
throw new Exception($"Error: An invalid type of {type} was encountered ('datetime' expected)."); string result = Regex.IsMatch(type, @"range$") ? aiResult["start"] : aiResult["value"];
return DateTime.Parse(result);
}

private static Dictionary<string, string> unwindResult(ModelResult modelResult)
{
return (modelResult.Resolution["values"] as List<Dictionary<string, string>>)[0];
}

Of course, it depends on your use-case as to precisely how you unwind it, but the above should be a good starting point.

Once I've polished the bot I've written a bit, I might post about it on here.

Found this interesting? Run into an issue? Got a neat use for it? Comment below!

Markov Chains Part 4: Test Data

With a shiny-new markov chain engine (see parts 1, 2, and 3), I found that I had a distinct lack of test data to put through it. Obviously this was no good at all, so I decided to do something about it.

Initially, I started with a list of HTML colours (direct link; 8.6KiB), but that didn't produce very good output:

MarkovGrams/bin/Debug/MarkovGrams.exe markov-w --wordlist wordlists/Colours.txt --length 16
errobiartrawbear
frelecteringupsy
vendellorazanigh
arvanginklectrit
dighoonbottlaven
ndiu
llighoolequorain
indeesteadesomiu

I see a few problems here. Firstly, it's treating each word as it's entity, where in fact I'd like it to generate n-grams on a line-by-line basis. Thankfully, this is easy enough with my new --no-split option:

MarkovGrams/bin/Debug/MarkovGrams.exe markov-w --wordlist wordlists/Colours.txt --no-split --length 16
med carrylight b
jungin pe red dr
ureelloufts blue
uamoky bluellemo
trinaterry aupph
utatellon reep g
bian reep mardar
ght burnse greep
atimson-phloungu

Hrm, that's still rather unreadable. What if we make the n-grams longer by bumping the order?

MarkovGrams/bin/Debug/MarkovGrams.exe markov-w --wordlist wordlists/Colours.txt --length 16 --order 4
on fuchsia blue
rsity of carmili
e blossom per sp
ngel
ulean red au lav
as green yellowe
indigri
ly gray aspe
disco blus
berry pine blach

Better, but it looks like it's starting the generation process from inside the middle of words. We can fix that with my new --start-uppercase option, which ensures that each output always stars with an n-gram that begins with a capital letter. Unfortunately the wordlist is all lowercase:

air force blue
alice blue
alizarin crimson
almond
amaranth
amber
american rose
amethyst
android green
anti-flash white

This is an issue. The other problem is that with an order of 4 the choice-point ratio is dropping quite low - I got a low of just ~0.97 in my testing.

The choice-point ratio is a measure I came up with of the average number of different directions the engine could potential go in at each step of the generation process. I'd like to keep this number consistently higher than 2, at least - to ensure a good variety of output.

Greener Pastures

Rather than try fix that wordlist, let's go in search of something better. It looks like the CrossCode Wiki has a page that lists all the items in the entire game. That should do the trick! The only problem is extracting it though. Let's use a bit of bash! We can use curl to download the HTML of the page, and then xidel to parse out the text from the <a> tags inside tables. Here's what I came up with:

curl https://crosscode.gamepedia.com/Items | xidel --data - --css "table a"

This is a great start, but we've got blank lines in there, and the list isn't sorted alphabetically (not required, but makes it look nice :P). Let's fix that:

curl https://crosscode.gamepedia.com/Items | xidel --data - --css "table a" | awk "NF > 0" | sort

Very cool. Tacking wc -l on the end of the pipe chain I can we've got ourselves a list of 527(!) items! Here's a selection of input lines:

Rough Branch
Raw Meat
Blue Grass
Crystal Plate
Humming Razor
Everlasting Amber
Tracker Chip
Lawkeeper's Fist

Let's run it through the engine. After a bit of tweaking, I came up with this:

cat wordlists/Cross-Code-Items.txt | MarkovGrams/bin/Debug/MarkovGrams.exe markov-w --start-uppercase --no-split --length 16 --order 3
Capt Keboossauci
Fajiz Keblathfin
King Steaf Sharp
Stintze Geakralt
Fruisty 'olipe F
Apper's TN
Prow Peptumn's C
Rus Recreetan Co
Veggiel Spiragma
Laver's Bolden M

That's quite interesting! With a choice-point ratio of ~5.6 at an order of 3, we've got a nice variable output. If we increase the order to 4, we get ~1.5 - ~2.3:

Edgy Hoo
Junk Petal Goggl
Red Metal Need C
Samurai Shel
Echor
Krystal Wated Li
Sweet Residu
Raw Stomper Thor
Purple Fruit Dev
Smokawa

It appears to be cutting off at the end though. Not sure what we can do about that (ideas welcome!). This looks interesting, but I'm not done yet. I'd like it to work on word-level too!

Going up a level

After making some pretty extensive changes, I managed to add support for this. Firstly, I needed to add support for word-level n-gram generation. Currently, I've done this with a new GenerationMode enum.

public enum GenerationMode
{
CharacterLevel,
WordLevel
}

Under the hood I've just used a few if statements. Fortunately, in the case of the weighted generator, only the bottom method needed adjusting:

/// <summary>
/// Generates a dictionary of weighted n-grams from the specified string.
/// </summary>
/// <param name="str">The string to n-gram-ise.</param>
/// <param name="order">The order of n-grams to generate.</param>
/// <returns>The weighted dictionary of ngrams.</returns>
private static void GenerateWeighted(string str, int order, GenerationMode mode, ref Dictionary<string, int> results)
{
if (mode == GenerationMode.CharacterLevel) {
for (int i = 0; i < str.Length - order; i++) {
string ngram = str.Substring(i, order);
if (!results.ContainsKey(ngram))
results[ngram] = 0;
results[ngram]++;
}
}
else {
string[] parts = str.Split(" ".ToCharArray());
for (int i = 0; i < parts.Length - order; i++) {
string ngram = string.Join(" ", parts.Skip(i).Take(order)).Trim();
if (ngram.Trim().Length == 0) continue;
if (!results.ContainsKey(ngram))
results[ngram] = 0;
results[ngram]++;
}
}
}

Full code available here. After that, the core generation algorithm was next. The biggest change - apart from adding a setting for the GenerationMode enum - was the main while loop. This was a case of updating the condition to count the number of words instead of the number of characters in word mode:

(Mode == GenerationMode.CharacterLevel ? result.Length : result.CountCharInstances(" ".ToCharArray()) + 1) < length

A simple ternary if statement did the trick. I ended up tweaking it a bit to optimise it - the above is the end result (full code available here). Instead of counting the words, I count the number fo spaces instead and add 1. That CountCharInstances() method there is an extension method I wrote to simplify things. Here it is:

public static int CountCharInstances(this string str, char[] targets)
{
int result = 0;
for (int i = 0; i < str.Length; i++) {
for (int t = 0; t < targets.Length; t++)
if (str[i] == targets[t]) result++;
}
return result;
}

Recursive issues

After making these changes, I needed some (more!) test data. Inspiration struck: I could run it recipe names! They've quite often got more than 1 word, but not too many. Searching for such a list proved to be a challenge though. My first thought was BBC Food, but their terms of service disallow scraping :-(

A couple of different websites later, and I found the Recipes Wikia. Thousands of recipes, just ready and waiting! Time to get to work scraping them. My first stop was, naturally, the sitemap (though how I found in the first place I really can't remember :P).

What I was greeted with, however, was a bit of a shock:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap><loc>http://recipes.wikia.com/sitemap-newsitemapxml-NS_0-p1.xml</loc></sitemap>
<sitemap><loc>http://recipes.wikia.com/sitemap-newsitemapxml-NS_0-p2.xml</loc></sitemap>
<sitemap><loc>http://recipes.wikia.com/sitemap-newsitemapxml-NS_0-p3.xml</loc></sitemap>
<sitemap><loc>http://recipes.wikia.com/sitemap-newsitemapxml-NS_0-p4.xml</loc></sitemap>
<sitemap><loc>http://recipes.wikia.com/sitemap-newsitemapxml-NS_0-p5.xml</loc></sitemap>
<sitemap><loc>http://recipes.wikia.com/sitemap-newsitemapxml-NS_0-p6.xml</loc></sitemap>
<sitemap><loc>http://recipes.wikia.com/sitemap-newsitemapxml-NS_0-p7.xml</loc></sitemap>
<sitemap><loc>http://recipes.wikia.com/sitemap-newsitemapxml-NS_0-p8.xml</loc></sitemap>
<sitemap><loc>http://recipes.wikia.com/sitemap-newsitemapxml-NS_0-p9.xml</loc></sitemap>
<sitemap><loc>http://recipes.wikia.com/sitemap-newsitemapxml-NS_14-p1.xml</loc></sitemap>
<sitemap><loc>https://services.wikia.com/discussions-sitemap/sitemap/3355</loc></sitemap>
</sitemapindex>
<!-- Generation time: 26ms -->
<!-- Generation date: 2018-10-25T10:14:26Z -->

Like who has a sitemap of sitemaps, anyways?! We better do something about this: Time for some more bash! Let's start by pulling out those sitemaps.

curl http://recipes.wikia.com/sitemap-newsitemapxml-index.xml | xidel --data - --css "loc"

Easy peasy! Next up, we don't want that bottom one - as it appears to have a bunch of discussion pages and other junk in it. Let's strip it out before we even download it!

curl http://recipes.wikia.com/sitemap-newsitemapxml-index.xml | xidel --data - --css "loc" | grep -i NS_0

With a list of sitemaps extract from the sitemap (completely coconuts I tell you) extracted, we need to download them all in turn and extract the page urls therein. This is, unfortunately, where it starts to get nasty. While a simple xargs call downloads them all easily enough (| xargs -n1 -I{} curl "{}" should do the trick), this outputs them all to stdout, and makes it very difficult for us to parse them.

I'd like to avoid shuffling things around on the file system if possible, as this introduces further complexity. We're not out of options yet though, as we can pull a subshell out of our proverbial hat:

curl http://recipes.wikia.com/sitemap-newsitemapxml-index.xml | xidel --data - --css "loc" | grep -i NS_0 | xargs -n1 -I{} sh -c 'curl {} | xidel --data - --css "loc"'

Yay! Now we're getting a list of urls to all the pages on the entire wiki:

http://recipes.wikia.com/wiki/Mexican_Black_Bean_Soup
http://recipes.wikia.com/wiki/Eggplant_and_Roasted_Garlic_Babakanoosh
http://recipes.wikia.com/wiki/Bathingan_bel_Khal_Wel_Thome
http://recipes.wikia.com/wiki/Lebanese_Tabbouleh
http://recipes.wikia.com/wiki/Lebanese_Hummus_Bi-tahini
http://recipes.wikia.com/wiki/Baba_Ghannooj
http://recipes.wikia.com/wiki/Lebanese_Falafel
http://recipes.wikia.com/wiki/Kebab_Koutbane
http://recipes.wikia.com/wiki/Moroccan_Yogurt_Dip

One problem though: We want recipes names, not urls! Let's do something about that. Our next special guest that inhabits our bottomless hat is the illustrious sed. Armed with the mystical power of find-and-replace, we can make short work of these urls:

... | sed -e 's/^.*\///g' -e 's/_/ /g'

The rest of the command is omitted for clarity. Here I've used 2 sed scripts: One to strip everything up to the last forward slash /, and another to replace the underscores _ with spaces. We're almost done, but there are a few annoying hoops left to jump through. Firstly, there are A bunch of unfortunate escape sequences lying around (I actually only discovered this when the engine started spitting out random ones :P). Also, there are far too many page names that contain the word Nutrient, oddly enough.

The latter is easy to deal with. A quick grep sorts it out:

... | grep -iv "Nutrient"

The former is awkward and annoying. As far as I can tell, there's no command I can call that will decode escape sequences. To this end, I wound up embedding some Python:

... | python -c "import urllib, sys; print urllib.unquote(sys.argv[1] if len(sys.argv) > 1 else sys.stdin.read()[0:-1])"

This makes the whole thing much more intimidating that it would otherwise be. Lastly, I'd really like to sort the list and save it to a file. Compared to the above, this is chicken feed!

| sort >Dishes.txt

And there we have it. Bash is very much like lego bricks when you break it down. The trick is to build it up step-by-step until you've got something that does what you want it to :)

Here's the complete command:

curl http://recipes.wikia.com/sitemap-newsitemapxml-index.xml | xidel --data - --css "loc" | grep -i NS_0 | xargs -n1 -I{} sh -c 'curl {} | xidel --data - --css "loc"' | sed -e 's/^.*\///g' -e 's/_/ /g' | python -c "import urllib, sys; print urllib.unquote(sys.argv[1] if len(sys.argv) > 1 else sys.stdin.read()[0:-1])" | grep -iv "Nutrient" | sort >Dishes.txt

After all that effort, I think we deserve something for our troubles! With ~42K(!) lines in the resulting file (42,039 to be exact as of the last time I ran the monster above :P), the output (after some tweaking, of course) is pretty sweet:

cat wordlists/Dishes.txt | mono --debug MarkovGrams/bin/Debug/MarkovGrams.exe markov-w --words --start-uppercase --length 8
Lemon Lime Ginger
Seared Tomatoes and Summer Squash
Cabbage and Potato au
Endive stuffed with Lamb and Winter Vegetable
Stuffed Chicken Breasts in Yogurt Turmeric Sauce with
Blossoms on Tomato
Morning Shortcake with Whipped Cream Graham
Orange and Pineapple Honey
Tempura with a Southwestern
Rice Florentine with
Cabbage Slaw with Pecans and Parmesan
Pork Sandwiches with Tangy Sweet Barbecue
Tea with Lemongrass and Chile Rice
Butterscotch Mousse Cake with Fudge
Fish and Shrimp -
Beans in the Slow
California Avocado Chinese Chicken Noodle Soup with Arugula

...I really need to do something about that cutting off issue. Other than that, I'm pretty happy with the result! The choice-point ratio is really variable, but most of the time it's hovering around ~2.5-7.5, which is great! The output if I lower the order from 3 to 2 isn't too bad either:

Salata me Htapodi kai
Hot 'n' Cheese Sandwich with Green Fish and
Poisson au Feuilles de Milagros
Valentines Day Cookies with Tofu with Walnut Rice
Up Party Perfect Turkey Tetrazzini
Olives and Strawberry Pie with Iceberg Salad with
Mashed Sweet sauce for Your Mood with Dried
Zespri Gold Corn rice tofu, and Corn Roasted
California Avocado and Rice Casserole with Dilled Shrimp
Egyptian Tomato and Red Bell Peppers, Mango Fandango

This gives us a staggering average choice-point ratio of ~125! Success :D

One more level

After this, I wanted to push the limits of the engine, so see what it's capable of. The obvious choice here is Shakespeare's Complete Works (~5.85MiB). Pushing this through the engine required some work, as ~30 seconds is far too slow - namely optimising the pipeline as much as possible.

The Mono Profiler helped a lot here. With it, I discovered that string.StartsWith() is really slow. Like, ridiculously slow (though this is relative, since I'm calling it hundreds of thousand of times), as it's culture-aware. In our case, we can't be bothering with any of that, as it's not relevant anyway. The easiest solution is to write another extension method:

public static bool StartsWithFast(this string str, string target) {
if (str.Length < target.Length) return false;
return str.Substring(0, target.Length) == target;
}

string.Substring() is faster, so by utilising this instead of the regular string.StartsWith() yields us a perfectly gigantic boost! Next up, I noticed that I can probably parallelize the Linq query that builds the list of possible n-grams we can choose from next, so that it runs on all the CPU cores:

Parallel.ForEach(ngrams, (KeyValuePair<string, double> ngramData) => {
if (!ngramData.Key.StartsWithFast(nextStartsWith)) return;
throw new Exception("Error: Failed to add to staging ngram concurrent dictionary");
});

Again, this netted a another huge gain. With this and a few other architectural changes, I was able to chop the time down to a mere ~4 seconds (for a standard 10 words)! In the future, I might experiment with selective unmanaged code via the unsafe keyword to see if I can do any better.

For now, it's fast enough to enjoy some random Shakespeare on-demand:

What should they since that to that tells me Nero
He doth it turn and and too, gentle
Ha! you shall not come hither; since that to provoke
ANTONY. No further, sir; a so, farewell, Sir
Bona, joins with
From fairies and the are like an ass which is
The very life-blood of our blood, no, not with the
Alas, why is here-in which
Transform'd and weak'ned? Hath Bolingbroke

Very interesting. The choice-point ratios sit at ~20 and ~3 for orders 3 and 4 respectively, though I got as high as 188 for an order of 3 during my testing. Definitely plenty of test data here :P

Conclusion

My experiments took me to several other places - which, if I included them all here, would result in an post much much longer than this! I scripted the download of several other wordlists in download.sh (direct link, 4.2KiB), if you're interested, with ready-downloaded copies in the wordlists folder of the repository.

I would advise reading the table in the README that gives credit to where I sourced each list, because of course I didn't put any of them together myself - I just wrote the script :P

Particularly of note is the Starbound list, which contains a list of all the blocks and items from the game Starbound. I might post about that one separately, as it ended up being a most interesting adventure.

In the future, I'd like to look at implementing a linguistic drift algorithm, to try and improve the output of the engine. The guy over at Here Dragons Abound has a great post on the subject, which you should definitely read if you're interested.

Found this interesting? Got an idea for another wordlist I can push though my engine? Confused by something? Comment below!

Disassembling .NET Assemblies with Mono

As part of the Component-Based Architectures module on my University course, I've been looking at what makes the .NET ecosystem tick, and how .NET assemblies (i.e. .NET .exe / .dll files) are put together. In the process, we looked as disassembling .NET assemblies into the text-form of the Common Intermediate Language (CIL) that they contain. The instructions on how to do this were windows-specific though - so I thought I'd post about the process on Linux and other platforms here.

Our tool of choice will be Mono - but before we get to that we'll need something to disassemble. Here's a good candidate for the role:

using System;

namespace SBRL.Demo.Disassembly {
static class Program {
public static void Main(string[] args) {
int a = int.Parse(Console.ReadLine()), b = 10;
Console.WriteLine(
"{0} + {1} = {2}",
a, b,
a + b
);
}
}
}

Excellent. Let's compile it:

csc Program.cs

This should create a new Program.exe file in the current directory. Before we get to disassembling it, it's worth mentioning how the compilation and execution process works in .NET. It's best explained with the aid of a diagram:

As is depicted in the diagram above, source code in multiple languages get compiled (maybe not with the same compiler, of course) into Common Intermediate Language, or CIL. This CIL is then executed in an Execution Environment - which is usually a virtual machine (Nope! not as in Virtual Box and KVM. It's not a separate operating system as such, rather than a layer of abstraction), which may (or may not) decide to compile the CIL down into native code through a process called JIT (Just-In-Time compilation).

It's also worth mentioning here that the CIL code generated by the compiler is in binary form, as this take up less space and is (much) faster for the computer to operate on. After all, CIL is designed to be efficient for a computer to understand - not people!

We can make it more readable by disassembling it into it's textual equivalent. Doing so with Mono is actually quite simple:

monodis Program.exe >Program.il

Here I redirect the output to a file called Program.il for convenience, as my editor has a plugin for syntax-highlighting CIL. For those reading without access to Mono, here's what I got when disassembling the above program:

.assembly extern mscorlib
{
.ver 4:0:0:0
.publickeytoken = (B7 7A 5C 56 19 34 E0 89 ) // .z\V.4..
}
.assembly 'Program'
{
.custom instance void class [mscorlib]System.Runtime.CompilerServices.CompilationRelaxationsAttribute::'.ctor'(int32) =  (01 00 08 00 00 00 00 00 ) // ........

.custom instance void class [mscorlib]System.Runtime.CompilerServices.RuntimeCompatibilityAttribute::'.ctor'() =  (
01 00 01 00 54 02 16 57 72 61 70 4E 6F 6E 45 78   // ....T..WrapNonEx
63 65 70 74 69 6F 6E 54 68 72 6F 77 73 01       ) // ceptionThrows.

.custom instance void class [mscorlib]System.Diagnostics.DebuggableAttribute::'.ctor'(valuetype [mscorlib]System.Diagnostics.DebuggableAttribute/DebuggingModes) =  (01 00 07 01 00 00 00 00 ) // ........

.hash algorithm 0x00008004
.ver  0:0:0:0
}

.namespace SBRL.Demo.Disassembly
{
.class private auto ansi beforefieldinit Program
extends [mscorlib]System.Object
{

// method line 1
.method public static hidebysig
default void Main (string[] args)  cil managed
{
// Method begins at RVA 0x2050
.entrypoint
// Code size 47 (0x2f)
.maxstack 5
.locals init (
int32   V_0,
int32   V_1)
IL_0000:  nop
IL_0006:  call int32 int32::Parse(string)
IL_000b:  stloc.0
IL_000c:  ldc.i4.s 0x0a
IL_000e:  stloc.1
IL_000f:  ldstr "{0} + {1} = {2}"
IL_0014:  ldloc.0
IL_0015:  box [mscorlib]System.Int32
IL_001a:  ldloc.1
IL_001b:  box [mscorlib]System.Int32
IL_0020:  ldloc.0
IL_0021:  ldloc.1
IL_0023:  box [mscorlib]System.Int32
IL_0028:  call void class [mscorlib]System.Console::WriteLine(string, object, object, object)
IL_002d:  nop
IL_002e:  ret
} // end of method Program::Main

// method line 2
.method public hidebysig specialname rtspecialname
instance default void '.ctor' ()  cil managed
{
// Method begins at RVA 0x208b
// Code size 8 (0x8)
.maxstack 8
IL_0000:  ldarg.0
IL_0001:  call instance void object::'.ctor'()
IL_0006:  nop
IL_0007:  ret
} // end of method Program::.ctor

} // end of class SBRL.Demo.Disassembly.Program
}


Very interesting. There are a few things of note here:

• The metadata at the top of the CIL tells the execution environment a bunch of useful things about the assembly, such as the version number, the classes contained within (and their signatures), and a bunch of other random attributes.
• An extra .ctor method has been generator for us automatically. It's the class' constructor, and it automagically calls the base constructor of the object class, since all classes are descended from object.
• The ints a and b are boxed before being passed to Console.WriteLine. Exactly what this does and why is quite complicated, and best explained by this Stackoverflow answer.
• We can deduce that CIL is a stack-based language form the add instruction, as it has no arguments.

I'd recommend that you explore this on your own with your own test programs. Try changing things and see what happens!

• Try making the Program class static
• Try refactoring the int.Parse(Console.ReadLine()) into it's own method. How is the variable returned?

This isn't all, though. We can also recompile the CIL back into an assembly with the ilasm code:

ilasm Program.il

This makes for some additional fun experiments:

• See if you can find where b's value is defined, and change it
• What happens if you alter the Console.WriteLine() format string so that it becomes invalid?
• Can you get ilasm to reassemble an executable into a .dll library file?

Found this interesting? Discovered something cool? Comment below!

C# & .NET Terminology Demystified: A Glossary

After my last glossary post on LoRa, I thought I'd write another one of C♯ and .NET, as (in typical Microsoft fashion it would seem), they're seems to be a lot of jargon floating around whose meaning is not always obvious.

If you're new to C♯ and the .NET ecosystems, I wouldn't recommend tackling all of this at once - especially the bottom ~3 definitions - with those in particular there's a lot to get your head around.

C♯

C♯ is an object-oriented programming language that was invented by Microsoft. It's cross-platform, and is usually written in an IDE (Integrated Development Environment), which has a deeper understanding of the code you write than a regular text editor. IDEs include Visual Studio (for Windows) and MonoDevelop (for everyone else).

Solution

A Solution (sometimes referred to as a Visual Studio Solution) is the top-level definition of a project, contained in a file ending in .sln. Each solution may contain one or more Project Files (not to be confused with the project you're working on itself), each of which gets compiled into a single binary. Each project may have its own dependencies too: whether they be a core standard library, another project, or a NuGet package.

Project

A project contains your code, and sits 1 level down from a solution file. Normally, a solution file will sit in the root directory of your repository, and the projects will each have their own sub-folders.

While each project has a single output file (be that a .dll class library or a standalone .exe executable), a project may have multiple dependencies - leading to many files in the build output folder.

The build process and dependency definitions for a project are defined in the .csproj file. This file is written in XML, and can be edited to perform advanced build steps, should you need to do something that the GUI of your IDE doesn't support. I've blogged about the structuring of this file before (see here, and also a bit more here), should you find yourself curious.

CIL

Known as Common Intermediate Language, CIL is the binary format that C♯ (also Visual Basic and F♯ code) code gets compiled into. From here, the .NET runtime (on Windows) or Mono (on macOS, Linux, etc.) can execute it to run the compiled project.

MSBuild

The build system for Solutions and Projects. It reads a .sln or .csproj (there are others for different languages, but I won't list them here) file and executes the defined build instructions.

.NET Framework

The .NET Framework is the standard library of C♯ it provides practically everything you'll need to perform most common tasks. It does not provide a framework for constructing GUIs and Graphical Interfaces. You can browse the API reference over at the official .NET API Browser.

WPF

The Windows Presentation Foundation is a Windows-only GUI framework. Powered by XAML (eXtensible Application Markup Language) definitions of what the GUI should look like, it provides everything you need to create a native-looking GUI on Windows.

It does not work on macOS and Linux. To create a cross-platform program that works on all 3 operating systems, you'll need to use an alternative GUI framework, such as XWT or Gtk# (also: Glade). A more complete list of cross-platform frameworks can be found here. It's worth noting that Windows Forms, although a tempting option, aren't as flexible as the other options listed here.

C♯ 7

The 7th version of the C♯ language specification. This includes the syntax of the language, but not the .NET Framework itself.

.NET Standard

A specification of the .NET Framework, but not the C♯ Language. As of the time of typing, the latest version is 2.0, although version 1.6 is commonly used too. The intention here is the improve cross-platform portability of .NET programs by defining a specification for a subset of the full .NET Framework standard library that all platforms will always be able to use. This includes Android and iOS through the use of Xamarin.

Note that all .NET Standard projects are class libraries. In order to create an executable, you'll have to add an additional Project to your Solution that references your .NET Standard class library.

ASP.NET

A web framework for .NET-based programming languages (in our case C♯). Allows you to write C♯ code to handle HTTP (and now WebSockets) requests in a similar manner to PHP, but different in that your code still needs compiling. Compiled code is then managed by a web server IIS web server (on Windows).

With the release of .NET Core, ASP.NET is now obsolete.

.NET Core

Coming in 2 versions so far (1.0 and 2.0), .NET Core is the replacement for ASP.NET (though this is not its exclusive purpose). As far as I understand it, .NET Core is a modular runtime that allows programs targeting it to run multiple platforms. Such programs can either be ASP.NET Core, or a Universal Windows Platform application for the Windows Store.

This question and answer appears to have the best explanation I've found so far. In particular, the displayed diagram is very helpful:

....along with the pair of official "Introducing" blog posts that I've included in the Sources and Further Reading section below.

Conclusion

We've looked at some of the confusing terminology in the .NET ecosystems, and examined each of them in turn. We started by defining and untangling the process by which your C♯ code is compiled and run, and then moved on to the different variants and specifications related to the .NET Framework and C♯.

As always, this is a starting point - not an ending point! I'd recommend doing some additional reading and experimentation to figure out all the details.

Found this helpful? Still confused? Spotted a mistake? Comment below!

Markov Chains Part 3: Weighted Chains

Recently I remembered that I had all the pieces I need to make a weighted version of the unweighted markov chain I built in part 2 of this series - specifically the weighted random number generator I built shortly after the unweighted markov chain. With this in mind, I decided on a whim to put all the pieces of the puzzle together - and this post is the result!

Where to start... hrm. I know. To start with, I needed to perform a minor upgrade tot he WeightedRandom class I had. If you haven't read the original post about it, I'd recommend doing so now.

Finished reading that? Great! Lets talk about the changes I've made (they may show up there at the bottom, since I embedded it via GitHub Gist). Firstly, I needed a way to work out if the weighted random number generator was currently empty or not, leading me to add a Count property:

public int Count {
get {
return weights.Count;
}
}

With a count property in place, I also found I was going to need a way to dynamically swap out the weightings of the random number generator without creating a completely new instance - which would end up resetting the Random class instance it was working with, leading to a reduction in the quality in random numbers it uses under high load (see [this article]() for more information on that).

To that end, I ended up refactoring the constructor into a pair of methods: SetContents, and a companion method ClearContents. Since the weight calculations happen when the items are first added to the generator, and I'd need to completely recalculate them if another item is added, I wasn't able to emulate the API for another existing class in .NET, such as the List class, as I like to do.

Finally, I found later on I needed a way to initialise an empty weighted random generator, so I added a new empty constructor to facilitate that, along with an additional check in the Next() method that throws an InvalidOperationException if the generator is empty and you try to ask it to pick a random item.

Here's the updated WeightedRandomNumberGenerator:

(Can't see the above? Click here to view it on GitHub directly, or here for the raw code as plain text)

With the weighted random number generator updated to properly support the future weighted markov chain, let's get down to the markov chain itself. Firstly, let's create a skeleton that's based on the UnweightedMarkovChain class I wrote in the last post:

using System;
using System.Collections.Generic;
using System.Linq;
using MarkovGrams.Utilities;
using SBRL.Algorithms;

namespace MarkovGrams
{
/// <summary>
/// An unweighted character-based markov chain.
/// </summary>
public class WeightedMarkovChain
{
private WeightedRandom<string> wrandom = new WeightedRandom<string>();

/// <summary>
/// The ngrams that this markov chain currently contains.
/// </summary>
Dictionary<string, double> ngrams;

/// <summary>
/// Creates a new character-based markov chain.
/// </summary>
/// <param name="inNgrams">The ngrams to populate the new markov chain with.</param>
public WeightedMarkovChain(IEnumerable<string> inNgrams);

/// <summary>
/// Returns a random ngram that's currently loaded into this WeightedMarkovChain.
/// </summary>
/// <returns>A random ngram from this UnweightMarkovChain's cache of ngrams.</returns>
public string RandomNgram();

/// <summary>
/// Generates a new random string from the currently stored ngrams.
/// </summary>
/// <param name="length">
/// The length of ngram to generate.
/// Note that this is a target, not a fixed value - e.g. passing 2 when the n-gram order is 3 will
/// result in a string of length 3. Also, depending on the current ngrams this markov chain contains,
/// it may end up being cut short.
/// </param>
/// <returns>A new random string.</returns>
public string Generate(int length);
}
}

As you can see, it is rather similar to the unweighted version. Fear not however, for the differences will become more apparent shortly. The only real difference so far is the extra private WeightedRandom<string> wrandom declaration at the top of the class. Let's change that though, by filling out the constructor:

ngrams = new Dictionary<string, double>();
foreach (string ngram in inNgrams)
{
if (ngrams.ContainsKey(ngram))
ngrams[ngram]++;
else
}

Here, we read in the raw n-grams and a dictionary that represents the number of times that a given n-gram has been discovered. It's got to be a double there as the type value of the dictionary, as apparently the C♯ compiler isn't clever enough to convert a Dictionary<string, int> to a Dictionary<string, double>. Hrm. Maybe they'll fix that in the future (or if not, does anyone know why not-)?

Anyway, let's move on to RandomNgram(). Here it is:

if (wrandom.Count == 0)
wrandom.SetContents(ngrams);
return wrandom.Next();

Quite simple, right? Basically, we populate the weighted random generator if it's currently empty, and then we simply ask it for a random item. We're on a roll, here! Let's keep going with Generate(). Here's the first part:

string result = RandomNgram();
string lastNgram = result;
while(result.Length < length)
{
// ......
}

Here, we declare an accumulator-like variable result to hold the word we're generating as we construct it, and another one to holdt he last n-gram we picked. We also create a while loop to make sure we keep adding to the word until we reach the desired length (we'll be adding a stop condition just in case we run into a brick wall later). Next, let's put some code inside that while loop. First up is the (re)population of the weighted random number generator:

wrandom.ClearContents();
// The substring that the next ngram in the chain needs to start with
string nextStartsWith = lastNgram.Substring(1);
// Get a list of possible n-grams we could choose from next
Dictionary<string, double> convNextNgrams = new Dictionary<string, double>();
ngrams.Where(gram_data => gram_data.Key.StartsWith(nextStartsWith))
.ForEach((KeyValuePair<string, double> ngramData) => convNextNgrams.Add(ngramData.Key, ngramData.Value));

Ah, good ol' Linq to the rescue again! But wait, what's that ForEach() call there? I don't remember that being in core .NET! You'd be right of course, but through the power of [extension methods]() one can extend a class with an additional method that can then be used as if it were an integral part of that class, when in reality that isn't the case! Here's my definition for that ForEach() extension method I used above:

public static class LinqExtensions
{
public static void ForEach<T>(this IEnumerable<T> enumerable, Action<T> action)
{
foreach (T item in enumerable)
{
action(item);
}
}
}

Next, we need to add that stop condition we talked about earlier before I forget! Here it is:

// If there aren't any choices left, we can't exactly keep adding to the new string any more :-(
if(convNextNgrams.Count() == 0)
break;

Observant readers will notice that we haven't actually finished the (re)population process yet, so we should do that next. Once done, we can also obtain a random n-gram from the generator and process it:

wrandom.SetContents(convNextNgrams);
// Pick a random n-gram from the list
string nextNgram = wrandom.Next();
// Add the last character from the n-gram to the string we're building
result += nextNgram[nextNgram.Length - 1];
lastNgram = nextNgram;

That completes my initial weighted markov chain implementation. Here's the class in full:

using System;
using System.Collections.Generic;
using System.Linq;
using MarkovGrams.Utilities;
using SBRL.Algorithms;

namespace MarkovGrams
{
/// <summary>
/// An unweighted character-based markov chain.
/// </summary>
public class WeightedMarkovChain
{
private WeightedRandom<string> wrandom = new WeightedRandom<string>();

/// <summary>
/// The ngrams that this markov chain currently contains.
/// </summary>
Dictionary<string, double> ngrams;

/// <summary>
/// Creates a new character-based markov chain.
/// </summary>
/// <param name="inNgrams">The ngrams to populate the new markov chain with.</param>
public WeightedMarkovChain(IEnumerable<string> inNgrams)
{
ngrams = new Dictionary<string, double>();
foreach (string ngram in inNgrams)
{
if (ngrams.ContainsKey(ngram))
ngrams[ngram]++;
else
}
}

/// <summary>
/// Returns a random ngram that's currently loaded into this WeightedMarkovChain.
/// </summary>
/// <returns>A random ngram from this UnweightMarkovChain's cache of ngrams.</returns>
public string RandomNgram()
{
if (wrandom.Count == 0)
wrandom.SetContents(ngrams);
return wrandom.Next();
}

/// <summary>
/// Generates a new random string from the currently stored ngrams.
/// </summary>
/// <param name="length">
/// The length of ngram to generate.
/// Note that this is a target, not a fixed value - e.g. passing 2 when the n-gram order is 3 will
/// result in a string of length 3. Also, depending on the current ngrams this markov chain contains,
/// it may end up being cut short.
/// </param>
/// <returns>A new random string.</returns>
public string Generate(int length)
{
string result = RandomNgram();
string lastNgram = result;
while(result.Length < length)
{
wrandom.ClearContents();
// The substring that the next ngram in the chain needs to start with
string nextStartsWith = lastNgram.Substring(1);
// Get a list of possible n-grams we could choose from next
Dictionary<string, double> convNextNgrams = new Dictionary<string, double>();
ngrams.Where(gram_data => gram_data.Key.StartsWith(nextStartsWith))
.ForEach((KeyValuePair<string, double> ngramData) => convNextNgrams.Add(ngramData.Key, ngramData.Value));
// If there aren't any choices left, we can't exactly keep adding to the new string any more :-(
if(convNextNgrams.Count() == 0)
break;
wrandom.SetContents(convNextNgrams);
// Pick a random n-gram from the list
string nextNgram = wrandom.Next();
// Add the last character from the n-gram to the string we're building
result += nextNgram[nextNgram.Length - 1];
lastNgram = nextNgram;
}
wrandom.ClearContents();
return result;
}
}
}

You can find it on my private git server here, if you're interested in any future improvements I might have made to it since writing this post. Speaking of which, I've got a few in mind - mainly refactoring both this class and it's unweighted cousin to utilise lists of objects instead of strings. This way, I'll be able to apply it to anything I like - such as sentence generation, music improvisation, and more!

I'd also like to extend it such that I can specify the weights manually, giving me even more flexibility as to how I can put the engine to use.

(Found a cool use for a Markov Chain? Comment about it below!)

Building a line-by-line lexer in C#

So there I was. It was a lazy afternoon before my final exam of the semester, and I was idly looking through some old code. One thing led to another, and I ended up writing a line-based scanning lexer in C# - and I thought I'd share it here, pulling it apart and putting it back together again.

The aim was to build something regular expression based, that would be flexible enough that it could be used in a wide-range of applications - but not too difficult or confusing to pick up, use and add to another project. The final code isn't currently available in a public repository (it's actually for a personal project - maybe I'll get around to refactoring it into a library if there's the demand), but I'll still post the finished code at the end of this post.

To start with, let's define some data classes to hold some input and output information. Hrm. Let's see - we'll need a class to represent a rule that the lexer will utilise, and one to represent the tokens that our lexer will be emitting. Let's deal with the one for the rules first:

public class LexerRule<TokenType>
{
public bool Enabled { get; set; } = true;

public LexerRule(TokenType inName, string inRegEx)
{
if (!typeof(TokenType).IsEnum)
throw new ArgumentException($"Error: inName must be an enum - {typeof(TokenType)} passed"); Type = inName; RegEx = new Regex(inRegEx); } public bool Toggle() { Enabled = !Enabled; return Enabled; } } Here I define a template (or generic) class that holds a regular expression, and associates it with a value from an enum. There's probably a better / cleaner way to make sure that TokenType is an enum, but for now this should serve it's purpose just fine. I also add a simple Enabled boolean property - as we'll be adding support for dynamically enabling and disabling rules later on. Next up, let's tackle the class for the tokens that we're going to be emitting: public class LexerToken<TokenType> { public readonly bool IsNullMatch = false; public readonly LexerRule<TokenType> Rule = null; public readonly Match RegexMatch; public TokenType Type { get { try { return Rule.Type; } catch (NullReferenceException) { return default(TokenType); } } } private string nullValueData; public string Value { get { return IsNullMatch ? nullValueData : RegexMatch.Value; } } public LexerToken(LexerRule<TokenType> inRule, Match inMatch) { Rule = inRule; RegexMatch = inMatch; } public LexerToken(string unknownData) { IsNullMatch = true; nullValueData = unknownData; } public override string ToString() { return string.Format("[LexerToken: Type={0}, Value={1}]", Type, Value); } } A little more complex, but still manageable. It, like it's LexerRule cousin, is also a template (or generic) class. It holds the type of token it is and the regular expression Match object generated during the scanning process. It also has something strange going on with Value and nullValueData - this is such that we can emit tokens with an 'unknown' type (more on that later) for the text in between that doesn't match any known rule. We'll be covering this later too. With our data classes in place, it's time to turn our attention to the lexer itself. Let's put together some scaffolding to get an idea as to how it's going to work: public class Lexer<TokenType> { public List<LexerRule<TokenType>> Rules { get; private set; } = new List<LexerRule<TokenType>>(); public int CurrentLineNumber { get; private set; } = 0; public int CurrentLinePos { get; private set; } = 0; public int TotalCharsScanned { get; private set; } = 0; private StreamReader textStream; public Lexer() { } public void AddRule(LexerRule<TokenType> newRule); public void AddRules(IEnumerable<LexerRule<TokenType>> newRules); public void Initialise(StreamReader reader); public IEnumerable<LexerToken<TokenType>> TokenStream(); public void EnableRule(TokenType type); public void DisableRule(TokenType type); public void SetRule(TokenType type, bool state); } There - that should do the trick! CurrentLineNumber, CurrentLinePos, and TotalCharsScanned are properties to keep track of where we've got to, and textStream is the StreamReader we'll be reading data from. Then, we've got some methods that will add new lexer rules to Rules enable and disable rules by token type, a method to initialise the lexer with the correct textStream, and finally a generator method that will emit the tokens. With our skeleton complete, let's fill out a few of those methods: public void AddRule(LexerRule<TokenType> newRule) { Rules.Add(newRule); } public void AddRules(IEnumerable<LexerRule<TokenType>> newRules) { Rules.AddRange(newRules); } public void Initialise(StreamReader reader) { textStream = reader; } public void EnableRule(TokenType type) { SetRule(type, true); } public void DisableRule(TokenType type) { SetRule(type, false); } public void SetRule(TokenType type, bool state) { foreach (LexerRule<TokenType> rule in Rules) { // We have to do a string comparison here because of the generic type we're using in multiple nested // classes if (Enum.GetName(rule.Type.GetType(), rule.Type) == Enum.GetName(type.GetType(), type)) { rule.Enabled = state; return; } } } Very cool. None of this is particularly exciting - apart from SetBody. In SetBody we have to convert the type argument ot a string in order to compare it to the rules in the Rules list, as C♯ doesn't seem to understand that the TokenType on the LexerRule class is the same as the TokenType on the Lexer class - even though they have the same name! This did give me an idea for a trio of additional methods to make manipulating rules easier though: public void EnableRulesByPrefix(string tokenTypePrefix) { SetRulesByPrefix(tokenTypePrefix, true); } public void DisableRulesByPrefix(string tokenTypePrefix) { SetRulesByPrefix(tokenTypePrefix, false); } public void SetRulesByPrefix(string tokenTypePrefix, bool state) { foreach (LexerRule<TokenType> rule in Rules) { // We have to do a string comparison here because of the generic type we're using in multiple nested // classes if (Enum.GetName(rule.Type.GetType(), rule.Type).StartsWith(tokenTypePrefix, StringComparison.CurrentCulture)) { rule.Enabled = state; } } } This set of methods let us enable or disable rules based on what they are with. For example, if I have the 3 rules CommentStart, CommentEnd, and FunctionStart, then calling EnableRulesByPrefix("Comment") will enable CommentStart and CommentEnd, but not FunctionStart. With all the interfacing code out of the way, let's turn tot he real meat of the subject: The TokenStream method. This is the method behind the magic. It's a bit complicated, so let's take it step-by-step. Firstly, we need to iterate over the lines in the StreamReader: string nextLine; List<LexerToken<TokenType>> matches = new List<LexerToken<TokenType>>(); while ((nextLine = textStream.ReadLine()) != null) { CurrentLinePos = 0; // ..... CurrentLineNumber++; TotalCharsScanned += CurrentLinePos; } Fairly simple, right? I've used this construct a few times in the past. Before you ask, we'll get to matches in just a moment :-) Next, we need another while loop that iterates until we reach the end of the line: while (CurrentLinePos < nextLine.Length) { matches.Clear(); foreach (LexerRule<TokenType> rule in Rules) { if (!rule.Enabled) continue; Match nextMatch = rule.RegEx.Match(nextLine, CurrentLinePos); if (!nextMatch.Success) continue; matches.Add(new LexerToken<TokenType>(rule, nextMatch)); } // ..... } Also fairly easy to follow. Basically, we clear the matches list, and then attempt to find the next match from the current position on the line that we've reached (CurrentLinePos) for every rule - and we store all the successful matches for further inspection and processing. We also make sure we skip any disabled rules here, too. If we don't find any matching rules, then that must mean that we can't match the rest of this line to any known token. In this case, we want to emit an unknown token: if (matches.Count == 0) { yield return new LexerToken<TokenType>(nextLine.Substring(CurrentLinePos)); break; } This is what that extra LexerToken constructor is for that we created above. Note that we yield return here, instead of simply returning - this is very similar in construct to the yield statement in Javascript that I blogged about before (and again here), in that they allow you to maintain state inside a method and return multiple values in sequence. By an 'unknown token', I am referring to the default value of the TokenType enum. Here's an example enum you might use with this lexer class: public enum SimpleToken { Unknown = 0, Whitespace, Comment, BlockOpen, BlockClose, } Since the value Unknown is explicitly assigned the index 0, we can be absolutely certain that it's the default value of this enum. With our list of potential matches in hand, we next need to sort it in order to work our which one we should prioritise. After deliberating and experimenting a bit, I came up with this: matches.Sort((LexerToken<TokenType> a, LexerToken<TokenType> b) => { // Match of offset position position int result = nextLine.IndexOf(a.RegexMatch.Value, CurrentLinePos, StringComparison.CurrentCulture) - nextLine.IndexOf(b.RegexMatch.Value, CurrentLinePos, StringComparison.CurrentCulture); // If they both start at the same position, then go with the longest one if (result == 0) result = b.RegexMatch.Length - a.RegexMatch.Length; return result; }); Basically, this sorts them so that the matches closest to the current scanning position on the list (CurrentLinePos) are prioritised. If 2 or more matches tie based on this criterion, we prioritise the longest match. This seems to work for now - I can always change it later if it becomes a problem :P With our matches sorted, we can now pick our the one we're going t emit next. Before we do so though, we need to take care of any characters between our current scanning position and the start of the next token: LexerToken<TokenType> selectedToken = matches[0]; int selectedTokenOffset = nextLine.IndexOf(selectedToken.RegexMatch.Value, CurrentLinePos) - CurrentLinePos; if (selectedTokenOffset > 0) { CurrentLinePos += selectedTokenOffset; yield return new LexerToken<TokenType>(nextLine.Substring(CurrentLinePos, selectedTokenOffset)); } This emits these additional characters as an unknown token as we did before. Finally, we can emit the token and continue onto the next iteration of the loop: CurrentLinePos += selectedToken.RegexMatch.Length; yield return selectedToken; That concludes our TokenStream method - and with it this lexer! Here's the code in full: using System; using System.Collections.Generic; using System.IO; using System.Text.RegularExpressions; namespace SBRL.Tools { public class LexerRule<TokenType> { public readonly TokenType Type; public readonly Regex RegEx; public bool Enabled { get; set; } = true; public LexerRule(TokenType inName, string inRegEx) { if (!typeof(TokenType).IsEnum) throw new ArgumentException($"Error: inName must be an enum - {typeof(TokenType)} passed");

Type = inName;
RegEx = new Regex(inRegEx);
}

public bool Toggle()
{
Enabled = !Enabled;
return Enabled;
}
}

public class LexerToken<TokenType>
{
public readonly bool IsNullMatch = false;
public readonly LexerRule<TokenType> Rule = null;

public TokenType Type {
get {
try {
return Rule.Type;
}
catch (NullReferenceException) {
return default(TokenType);
}
}
}
private string nullValueData;
public string Value {
get {
return IsNullMatch ? nullValueData : RegexMatch.Value;
}
}

public LexerToken(LexerRule<TokenType> inRule, Match inMatch)
{
Rule = inRule;
RegexMatch = inMatch;
}
public LexerToken(string unknownData)
{
IsNullMatch = true;
nullValueData = unknownData;
}

public override string ToString()
{
return string.Format("[LexerToken: Type={0}, Value={1}]", Type, Value);
}
}

public class Lexer<TokenType>
{
public List<LexerRule<TokenType>> Rules { get; private set; } = new List<LexerRule<TokenType>>();

public bool Verbose { get; set; } = false;

/// <summary>
/// The number of the line that currently being scanned.
/// </summary>
public int CurrentLineNumber { get; private set; } = 0;
/// <summary>
/// The number of characters on the current line that have been scanned.
/// </summary>
/// <value>The current line position.</value>
public int CurrentLinePos { get; private set; } = 0;
/// <summary>
/// The total number of characters currently scanned by this lexer instance.
/// Only updated every newline!
/// </summary>
public int TotalCharsScanned { get; private set; } = 0;

public Lexer()
{

}

{
}
{
}

{
}

public IEnumerable<LexerToken<TokenType>> TokenStream()
{
string nextLine;
List<LexerToken<TokenType>> matches = new List<LexerToken<TokenType>>();
while ((nextLine = textStream.ReadLine()) != null)
{
CurrentLinePos = 0;

while (CurrentLinePos < nextLine.Length)
{
matches.Clear();
foreach (LexerRule<TokenType> rule in Rules) {
if (!rule.Enabled) continue;

Match nextMatch = rule.RegEx.Match(nextLine, CurrentLinePos);
if (!nextMatch.Success) continue;

}

if (matches.Count == 0) {
string unknownTokenContent = nextLine.Substring(CurrentLinePos);
if(Verbose) Console.WriteLine("[Unknown Token: No matches found for this line] {0}", unknownTokenContent);
yield return new LexerToken<TokenType>(unknownTokenContent);
break;
}

matches.Sort((LexerToken<TokenType> a, LexerToken<TokenType> b) => {
// Match of offset position position
int result = nextLine.IndexOf(a.RegexMatch.Value, CurrentLinePos, StringComparison.CurrentCulture) -
nextLine.IndexOf(b.RegexMatch.Value, CurrentLinePos, StringComparison.CurrentCulture);
// If they both start at the same position, then go with the longest one
if (result == 0)
result = b.RegexMatch.Length - a.RegexMatch.Length;

return result;
});
LexerToken<TokenType> selectedToken = matches[0];
int selectedTokenOffset = nextLine.IndexOf(selectedToken.RegexMatch.Value, CurrentLinePos) - CurrentLinePos;

if (selectedTokenOffset > 0) {
string extraTokenContent = nextLine.Substring(CurrentLinePos, selectedTokenOffset);
CurrentLinePos += selectedTokenOffset;
if(Verbose) Console.WriteLine("[Unmatched content] '{0}'", extraTokenContent);
yield return new LexerToken<TokenType>(extraTokenContent);
}

CurrentLinePos += selectedToken.RegexMatch.Length;
if(Verbose) Console.WriteLine(selectedToken);
yield return selectedToken;
}

if(Verbose) Console.WriteLine("[Lexer] Next line");
CurrentLineNumber++;
}
}

public void EnableRule(TokenType type)
{
SetRule(type, true);
}
public void DisableRule(TokenType type)
{
SetRule(type, false);
}
public void SetRule(TokenType type, bool state)
{
foreach (LexerRule<TokenType> rule in Rules)
{
// We have to do a string comparison here because of the generic type we're using in multiple nested
// classes
if (Enum.GetName(rule.Type.GetType(), rule.Type) == Enum.GetName(type.GetType(), type)) {
rule.Enabled = state;
return;
}
}
}

public void EnableRulesByPrefix(string tokenTypePrefix)
{
SetRulesByPrefix(tokenTypePrefix, true);
}
public void DisableRulesByPrefix(string tokenTypePrefix)
{
SetRulesByPrefix(tokenTypePrefix, false);
}
public void SetRulesByPrefix(string tokenTypePrefix, bool state)
{
foreach (LexerRule<TokenType> rule in Rules)
{
// We have to do a string comparison here because of the generic type we're using in multiple nested
// classes
if (Enum.GetName(rule.Type.GetType(), rule.Type).StartsWith(tokenTypePrefix, StringComparison.CurrentCulture))
{
rule.Enabled = state;
}
}
}
}
}

It's a bit much to take in all at once, but hopefully by breaking it down into steps I've made it easier to understand how I built it, and how all the different pieces fit together. The only real difference between the above code and the code I walked through in this post is the Verbose parameter I added for testing purposes, and the associated Console.WriteLine calls. For fun, here's a very basic LOGO (also here) lexer. I've based it on what I remember from using MSWLogo / FMSLogo a long time ago (there seem to be many dialects around these days):

public enum LogoToken
{
Unknown = 0,

Whitespace,

FunctionForwards,
FunctionBackwards,
FunctionLeft,
FunctionRight,
FunctionPenUp,
FunctionPenDown,

Number
}

public class LogoLexer : Lexer<LogoToken>
{
public LogoLexer()
{
new LexerRule<LogoToken>(LogoToken.Whitespace,    @"\s+"),

new LexerRule<LogoToken>(LogoToken.FunctionForwards,    @"FD"),
new LexerRule<LogoToken>(LogoToken.FunctionBackwards,    @"BK"),
new LexerRule<LogoToken>(LogoToken.FunctionLeft,    @"LT"),
new LexerRule<LogoToken>(LogoToken.FunctionRight,    @"RT"),

new LexerRule<LogoToken>(LogoToken.FunctionPenUp,    @"PU"),
new LexerRule<LogoToken>(LogoToken.FunctionPenDown,    @"PD"),

new LexerRule<LogoToken>(LogoToken.Number,    @"\d+"),
});
}
}

Here's an example LOGO program that it parses:

FD 100 RT 90
FD 50 PU RT 180
BK 40 LT 45 FD 250

...and here's the output from lexing that example program:

[LexerToken: Type=FunctionForwards, Value=FD]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=Number, Value=100]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=FunctionRight, Value=RT]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=Number, Value=90]
[Lexer] Next line
[LexerToken: Type=FunctionForwards, Value=FD]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=Number, Value=50]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=FunctionPenUp, Value=PU]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=FunctionRight, Value=RT]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=Number, Value=180]
[Lexer] Next line
[LexerToken: Type=FunctionBackwards, Value=BK]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=Number, Value=40]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=FunctionLeft, Value=LT]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=Number, Value=45]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=FunctionForwards, Value=FD]
[LexerToken: Type=Whitespace, Value= ]
[LexerToken: Type=Number, Value=250]
[Lexer] Next line
[Lexer] Next line

Very cool! This could easily be extended to support more of the LOGO syntax. As an exercise, can you extend it to support the REPEAT statement? At some point in the future, I might go even further and build a bottom-up left-to-right shift-reduce parser, and combine it with this lexer and some BNF to create a parse tree.

GlidingSquirrel is now on NuGet with automatic API documentation!

(This post is a bit late as I'm rather under the weather.)

A while ago, I posted about a HTTP/1.0 server library I'd written in C♯ for a challenge. One thing led to another, and I've now ended up adding support for HTTP/1.1, WebSockets, and more....

While these enhancements have been driven mainly by a certain project of mine that keeps stealing all of my attention, they've been fun to add - and now that I've (finally!) worked out how to package it as a NuGet package, it's available on NuGet for everyone to use, under the Mozilla Public License 2.0!

Caution should be advised though, as I've not thoroughly tested it yet to weed out the slew of bugs and vulnerabilities I'm sure are hiding in there somewhere - it's designed mainly to sit behind a reverse-proxy such as Nginx (not that's it's any excuse, I know!) - and also I don't have a good set of test cases I can check it against (I thought for sure there would be at least one test suite out there for HTTP servers, considering the age of the protocol, but apparently not!).

With that out of the way, specifying dependencies didn't turn out to be that hard, actually. Here's the extra bit I added to the .nuspec file to specify the GlidingSquirrel's dependencies:

<dependencies>
<dependency id="MimeSharp" version="1.0.0" />
<dependency id="System.ValueTuple" version="4.4.0" />
</dependencies>

The above goes inside the <metadata> element. If you're curious about what this strange .nuspec file is and it's format, I've blogged about it a while back when I figured out how to package my TeleConsole Client - in which I explain my findings and how you can do it yourself too!

I've still not figured out the strange $version$ and $description$ things that are supposed to reference the appropriate values in the .csproj file (if you know how those work, please leave a comment below, as I'd love to know!) - as it keeps telling me that <description> cannot be empty.....

In addition to packaging it and putting it up on NuGet, I've also taken a look at automatic documentation generation. For a while I've been painstakingly documenting the entire project with intellisense comments for especially this purpose - which I'm glad I started early, as it's been a real pain and taken a rather decent chunk of time to do.

Anyway, with the last of the intellisense comments finished today, I set about generating automatic documentation - which turned out to be a surprisingly difficult thing to achieve.

• Sandcastle, Microsoft's offering has been helpfully discontinued.
• Sandcastle Help File Builder, the community open-source fork of Sandcastle, appears to be Windows-only (I use Linux myself).
• DocFX, Microsoft's new offering, appears to be more of a static site generator that uses your code and a bunch of other things as input, rather than the simple HTML API documentation generator I was after.
• I found Docxygen to produce an acceptable result out-of-the-box, I discovered that configuring it was not a trivial task - the initial configuration file the init action created was over 100 KiB!

I found a few other options too, but they were either Windows-only, or commercial offerings that I can't justify using for an open-source project.

When I was about to give up the search for the day, I stumbled across this page on generating documentation by the mono project, who just happen to be the ones behind mono, the cross-platform .NET runtime that runs on Linux, macOS, and, of course, Windows. They're also the ones who have built mcs, the C♯ compiler that compliments the mono runtime.

Apparently, they've also built a documentation generation that has the ability to export to HTML. While it's certainly nothing fancy, it does look like you've all the power you need at your fingertips to customise the look and feel by tweaking the XSLT stylesheet it renders with, should you need it.

After a bit of research, I found this pair of commands to build the documentation for me quite nicely:

mdoc update -o docs_xml/ -i GlidingSquirrel.xml GlidingSquirrel.dll
mdoc export-html --out docs/ docs_xml

The first command builds the multi-page XML tree from the XML documentation generated during the build process (make sure the "Generate xml documentation" option is ticked in the project build options!), and the second one transforms that into HTML that you can view in your browser.

With the commands figured out, I set about including it in my build process. Here's what I came up with:

<Target Name="AfterBuild">
<Message Importance="high" Text="----------[ Building documentation ]----------" />

<Exec Command="mdoc update -o docs_xml/ -i GlidingSquirrel.xml GlidingSquirrel.dll" WorkingDirectory="$(TargetDir)" IgnoreExitCode="true" /> <Exec Command="mdoc export-html --out$(SolutionDir)/docs docs_xml" WorkingDirectory="$(TargetDir)" IgnoreExitCode="true" /> </Target> This outputs the documentation to the docs/ folder in the root of the solution. If you're a bit confused as to how to utilise this in your own project, then perhaps the Untangling MSBuild post I wrote to figure out how MSBuild works can help! You can view the GlidingSquirrel's automatically-generated documentation here. It's nothing fancy (yet), but it certainly does the job. I'll have to look into prettifying it later! Untangling MSBuild: MSBuild for normal people I don't know about you, but I find the documentation on MSBuild is be rather confusing. Even their definition of what MSBuild is is a bit misleading: MSBuild is the build system for Visual Studio. Whilst having to pull apart the .csproj file of a project of mine and put it back together again to get it to do what I wanted, I spent a considerable amount of time reading Microsoft's (bad) documentation and various web tutorials on what MSBuild is and what it does. I'm bound to forget what I've learnt, so I'm detailing it here both to save myself the bother of looking everything up again and to make sense of everything I've read and experimented with myself. Before continuing, you might find my previous post, Understanding your compiler: C# an interesting read if you aren't already aware of some of the basics of Visual Studio solutions, project files, msbuild, and the C♯ compiler. Let's start with a real definition. MSBuild is Microsoft's build framework that ties into Visual Studio, Mono, and basically anything in the .NET world. It has an XML-based syntax that allows one to describe what MSBuild has to do to build a project - optionally depending on other projects elsewhere in the file system. It's most commonly used to build C♯ projects - but it can be used to build other things, too. The structure of your typical Visual Studio solution might look a bit like this: As you can see, the .sln file references one or more .csproj files, each of which may reference other .csproj files - not all of which have to be tied to the same .sln file, although they usually are (this can be handy if you're using Git Submodules). The .sln file will also specify a default project to build, too. The file extension .csproj isn't the only one recognised by MSBuild, either - others such as .pssproj (PowerShell project), .vcxproj (Visual C++ Project), .targets (Shared tasks + targets - we'll get to these later), and others - though the generic extension is simply .proj. So far, I've observed that MSBuild is pretty intelligent about automatically detecting project / solution files in it's working directory - you can call it with msbuild in a directory and most of the time it'll find and build the right project - even if it finds a .sln file that it has to parse first. Let's get to the real meat of the subject: targets. Consider the following: <?xml version="1.0" encoding="utf-8"?> <Project xmlns="http://schemas.microsoft.com/developer/msbuild/2003" DefaultTargets="Build" ToolsVersion="4.0"> <Import Project="$(MSBuildBinPath)\Microsoft.CSharp.targets" />

<Target Name="BeforeBuild">
<Message Importance="high" Text="Before build executing" />
</Target>
</Project>

I've simplified it a bit to make it a bit clearer, but in essence the above imports the predefined C♯ set of targets, which includes (amongst others) BeforeBuild, Build itself, and After Build - the former of which is overridden by the local MSBuild project file. Sound complicated? It is a bit!

MSBuild uses a system of targets. When you ask it to do something, you're actually asking it to reach a particular target. By default, this is the Build target. Targets can have dependencies, too - in the case of the above, the Build target depends on the (blank) BeforeBuild target in the default C♯ target set, which is then overridden by the MSBuild project file above.

The second key component of MSBuild project files are tasks. I've used one in the example above - the Message task which, obviously enough, outputs a message to the build output. There are lots of other types of task, too:

• Reference - reference another assembly or core namespace
• Compile - specify a C♯ file to compile
• EmbeddedResource - specify a file to include as an embedded resource
• MakeDir - create a directory
• MSBuild - recursively build another MSBuild project
• Exec - Execute a shell command (on Windows this is with cmd.exe)
• Copy - Copy file(s) and/or director(y/ies) from one place to another

This is just a small sample of what's available - check out the MSBuild task reference for a full list and more information about how each is used. It's worth noting that MSBuild targets can include multiple tasks one after another - and they'll all be executed in sequence.

Unfortunately, if you try to specify a wildcard in an EmbeddedResource directive, both Visual Studio and Monodevelop 'helpfully' try to auto-expand these wildcards to reference the actual files themselves. To avoid this, a clever hack can be instituted that uses the CreateItem task:

<CreateItem Include="$(ProjectDir)/obj/client_dist/**"> <Output ItemName="EmbeddedResource" TaskParameter="Include" /> </CreateItem> The above loops above all the files that are in $(ProjectDir)/obj/client_dist, and dynamically creates an EmbeddedResource directive for each - thereby preventing annoying auto-expansion.

The $(ProjectDir) bit is a variable - the final key component of MSBuild project files. There are a number of built-in variables for different things, but you can also define your own. • $(ProjectDir) - The current project's root directory
• $(SolutionDir) - The root directory of the current solution. Undefined if a project is being built directly. • $(TargetDir) - The output directory that the result of the build will be written to.
• \$(Configuration) - The selected configuration to build. Most commonly Debug or Release`.

As with the tasks, there's a full list available that you can check out for more information.

That's the basics of how MSBuild works, as far as I'm aware at the moment. If I discover anything else of note, I'll post again in the future about what I've learnt. If this has helped, please comment below! If you're still confused, let me know in the comments too, and I'll try my best to help you out :-)

Want a tutorial on Git Submodules? Let me know by posting a comment below!

Art by Mythdael