Archive

## Tag Cloud

3d 3d printing account algorithms android announcement architecture archives arduino artificial intelligence artix assembly async audio automation backups bash batch blender blog bookmarklet booting bug hunting c sharp c++ challenge chrome os cluster code codepen coding conundrums coding conundrums evolved command line compilers compiling compression containerisation css dailyprogrammer data analysis debugging demystification distributed computing docker documentation downtime electronics email embedded systems encryption es6 features ethics event experiment external first impressions freeside future game github github gist gitlab graphics hardware hardware meetup holiday holidays html html5 html5 canvas infrastructure interfaces internet interoperability io.js jabber jam javascript js bin labs learning library linux lora low level lua maintenance manjaro minetest network networking nibriboard node.js operating systems own your code pepperminty wiki performance phd photos php pixelbot portable privacy problem solving programming problems project projects prolog protocol protocols pseudo 3d python reddit redis reference release releases rendering resource review rust searching secrets security series list server software sorting source code control statistics storage svg systemquery talks technical terminal textures thoughts three thing game three.js tool tutorial tutorials twitter ubuntu university update updates upgrade version control virtual reality virtualisation visual web website windows windows 10 worldeditadditions xmpp xslt

## Tips for training (large numbers of) AI models

As part of my PhD, I'm training AI models. The specifics as to what for don't particularly matter for this post (though if you're curious I recommend my PhD update blog post series). Over the last year or so, I've found myself training a lot of AI models, and dealing with a lot of data. In this post, I'm going to talk about some of the things I've found helpful and some of the things things I've found that are best avoided. Note that this is just a snapshot of my current practices now - this will probably gradually change over time.

I've been working with Tensorflow.js and Tensorflow for Python on various Linux systems. If you're on another OS or not working with AI then what I say here should still be somewhat relevant.

### Datasets

First up: a quick word on datasets. While this post is mainly about AI models, datasets are important too. Keeping them organised is vitally important. Keeping all the metadata that associated with them is also vitally important. Keeping a good directory hierarchy is the best way to achieve this.

I also recommend sticking with a standard format that's easy to parse using your preferred language - and preferably lots of other languages too. Json Lines is my personal favourite format for data - potentially compressed with Gzip if the filesize of is very large.

### AI Models

There are multiple facets to the problem of wrangling AI models:

1. Code that implements the model itself and supporting code
2. Checkpoints from the training process
3. Analysis results from analysing such models

All of these are important for different reasons - and are also affected by where it is that you're going to be training your model.

By far the most important thing I recommend doing is using Git with a remote such as GitHub and committing regularly. I can't stress enough how critical this is - it's the best way to both keep a detailed history of the code you've written and keep a backup at the same time. It also makes working on multiple computers easy. Getting into the habit of using Git for any project (doesn't matter what it is) will make your life a lot easier. At the beginning of a programming session, pull down your changes. Then, as you work, commit your changes and describe them properly. Finally, push your changes to the remote after committing to keep them backed up.

Coming in at a close second is implementing is a command line interface with the ability to change the behaviour of your model. This includes:

• Setting input datasets
• Specifying output directories
• Model hyperparameters (e.g. input size, number of layers, number of units per layer, etc)

This is invaluable for running many different variants of your model quickly to compare results. It is also very useful when training your model in headless environments, such as on High Performance Computers (HPCs) such as Viper that my University has.

For HPCs that use Slurm, a great tip here is that when you call sbatch on your job file (e.g. sbatch path/to/jobfile.job), it will preserve your environment. This lets you pass in job-specific parameters by writing a script like this:

#!/usr/bin/env bash
#SBATCH -J TwImgCCT
#SBATCH -N 1
#SBATCH -n 4
#SBATCH --gres=gpu:1
#SBATCH -o %j.%N.%a.out
#SBATCH -e %j.%N.%a.err
#SBATCH -p gpu05,gpu
#SBATCH --time=5-00:00:00
#SBATCH --mem=25600
# 25600 = 25GiB memory required

# Viper use Trinity ClusterVision: https://clustervision.com/trinityx-cluster-management/ and https://github.com/clustervision/trinityX

echo ">>> Installing requirements";
conda run -n py38 pip install -r requirements.txt;
echo ">>> Training model";
/usr/bin/env time --verbose conda run -n py38 src/my_model.py ${PARAMS} echo ">>> exited with code$?";

....which you can call like so:

PARAMS="--size 4 --example 'something else' --input path/to/file --output outputs/20211002-resnet" sbatch path/to/jobfile.job

You may end up finding you have rather a lot of code behind your model - especially for data preprocessing depending on your dataset. To handle this, I go by 2 rules of thumb:

1. If a source file of any language is more than 300 lines long, it should be split into multiple files
2. If a collection of files do a thing together rather nicely, they belong in a separate Git repository.

To elaborate on these, having source code files become very long makes them difficult to maintain, understand, and re-use in future projects. Splitting them up makes your life much easier.

Going further, modularising your code is also an amazing paradigm to work with. I've broken many parts of my various codebases I've implemented for my PhD out as open-source projects on npm (the Node Package Manager) - most notably applause-cli, terrain50, terrain50-cli, nimrod-data-downloader, and twitter-academic-downloader.

By making them open-source, I'm not only making my research and methods more transparent and easier for others to independently verify, but I'm also allowing others to benefit from them (and potentially improve them) too! As they say, there's no need to re-invent the wheel.

Eventually, I will be making the AI models I'm implementing for my PhD open-source too - but this will take some time as I want to ensure that the models actually work before doing so (I've got 1 model I implemented fully and documented too, but in the end it has a critical bug that means the whole thing is useless.....).

Saving checkpoints from the training process of your model is also essential. I recommend doing so at the end of each epoch. As part of this, it's also useful to have a standard format for your output artefacts from the training process. Ideally, these artefacts can be used to identify precisely what dataset and hyperparameters that model and checkpoints were trained with.

At the moment, my models output something like this:

+ output_dir/
+ summary.txt       Summary of the layers of the model and their output shapes
+ metrics.tsv       TSV file containing training/validation loss/accuracy and epoch numbers
+ settings.toml     The TOML settings that the model was trained with
+ checkpoints/      Directory containing the checkpoints - 1 per epoch
+ checkpoint_e1_val_acc0.699.hdf5   Example checkpoint filename [Tensorflow for Python]
+ 0/            OR, if using Tensorflow.js instead of Tensorflow for Python, 1 directory per checkpoint
+ this_run.log      Logfile for this run [depends on where the program is being executed]

settings.toml leads me on to settings files. Personally I use TOML for mine, and I use 2 files:

• settings.default.toml - Contains all the default values of the settings, and is located alongside the code for my model
• example.toml - Custom settings that override values in the default settings file can be specified using my standard --config CLI argument.

Having a config file is handy when you have multiple dataset input files that rarely change. Generally speaking you want to ensure that you minimise the number of CLI arguments that you have to specify when running your model, as then it reduces cognitive load when you're training many variants of a model at once (I've found that wrangling dozens of different dataset files and model variants is hard enough to focus on and keep organised :P).

Analysis results are the final aspect here that it's important to keep organised - and the area in which I have the least experience. I've found it's important to keep track of which model checkpoint it was that the analysis was done with and which dataset said model was trained on. Keeping the entire chain of dataflow clear and easy to follow is difficult because the analysis one does is usually ad-hoc, and often has to be repeated many times on different model variants.

For this, so far I generate statistics and some graphs on the command line. If you're not already familiar with the terminal / command line of your machine, I can recommend checking out my earlier post Learn Your Terminal, which has a bunch of links to tutorials for this. In addition, jq is an amazing tool for manipulating JSON data. It's not installed by default on most systems, but it's available in most default repositories and well worth the install.

For some graphs, I use Gnuplot. Usually though this is only for more complex plots, as it takes a moment to write a .plt file to generate the graph I want in it.

I'm still looking for a good tool that makes it easy to generate basic graphs from the command line, so please get in touch if you've found one.

I'm also considering integrating some of the basic analysis into my model training program itself, such that it generates e.g. confusion matrices automatically as part of the training process. matplotlib seems to do the job here for plotting graphs in Python, but I have yet to find an equivalent library for Javascript. Again, if you've found one please get in touch by leaving a comment below.

### Conclusion

In this post, I've talked about some of the things I've found helpful so far while I've been training models. From using Git to output artefacts to implementing command line interfaces and wrangling datasets, implementing the core AI model itself is actually only a very small part of an AI project.

Hopefully this post has given you some insight into the process of developing an AI model / AI-powered system. While I've been doing some of these things since before I started my PhD (like Git), others have taken me a while to figure out - so I've noted them down here so that you don't have to spend ages figuring out the same things!

If you've got some good tips you'd like to share on developing AI models (or if you've found the tips here in this blog post helpful!), please do share them below.

## Running multiple local versions of CUDA on Ubuntu without sudo privileges

I've been playing around with Tensorflow.js for my PhD (see my PhD Update blog post series), and I had some ideas that I wanted to test out on my own that aren't really related to my PhD. In particular, I've found this blog post to be rather inspiring - where the author sets up a character-based recurrent neural network to generate text.

The idea of transcoding all those characters to numerical values and back seems like too much work and too complicated just for a quick personal project though, so my plan is to try and develop a byte-based network instead, in the hopes that I can not only teach it to generate text as in the blog post, but valid Unicode as well.

Obviously, I can't really use the University's resources ethically for this (as it's got nothing to do with my University work) - so since I got a new laptop recently with an Nvidia GeForce RTX 2060, I thought I'd try and use it for some machine learning instead.

The problem here is that Tensorflow.js requires only CUDA 10.0, but since I'm running Ubuntu 20.10 with all the latest patches installed, I have CUDA 11.1. A quick search of the apt repositories on my system reveals nothing that suggests I can install older versions of CUDA alongside the newer one, so I had to devise another plan.

I discovered some months ago (while working with Viper - my University's HPC - for my PhD) that you can actually extract - without sudo privileges - the contents of the CUDA .run installers. By then fiddling with your PATH and LD_LIBRARY_PATH environment variables, you can get any program you run to look for the CUDA libraries elsewhere instead of loading the default system libraries.

Since this is the second time I've done this, I thought I'd document the process for future reference.

First, you need to download the appropriate .run installer for the CUDA libraries. In my case I need CUDA 10.0, so I've downloaded mine from here:

Next, we need to create a new subdirectory and extract the .run file into it. Do that like so:

cd path/to/runfile_directory;
mkdir cuda-10.0
./cuda_10.0.130_410.48_linux.run --extract=${PWD}/cuda-10.0/ Make sure that the current working directory contains no spaces, no preferably no other special characters either. Also, adjust the file and directory names to suit your situation. Once done, this will have extract 3 subfiles - which also have the suffix .run. We're only interested in CUDA itself, so we only need to extract the the one that starts with cuda-linux. Do that like so (adjusting file/directory names as before): cd cuda-10.0; ./cuda-linux.10.0.130-24817639.run -noprompt -prefix=$PWD/cuda;
rm *.run;
mv cuda/* .;
rmdir cuda;

If you run ./cuda-linux.10.0.130-24817639.run --help, it's actually somewhat deceptive - since there's a typo in the help text! I corrected it for this above though. Once done, this should leave the current working directory containing the CUDA libraries - that is a subdirectory next to the original .run file:

+ /path/to/some_directory/
+ ./cuda_10.0.130_410.48_linux.run
+ cuda-10.0/
+ version.txt
+ bin/
+ doc/
+ extras/
+ ......

Now, it's just a case of fiddling with some environment variables and launching your program of choice. You can set the appropriate environment variables like this:

export PATH="/absolute/path/to/cuda-10.0/bin:${PATH}"; if [[ ! -z "${LD_LIBRARY_PATH}" ]]; then
export LD_LIBRARY_PATH="/absolute/path/to/cuda-10.0/lib64:${LD_LIBRARY_PATH}"; else export LD_LIBRARY_PATH="/absolute/path/to/cuda-10.0/lib64"; fi You could save this to a shell script (putting #!/usr/bin/env bash before it as the first line, and then running chmod +x path/to/script.sh), and then execute it in the context of the current shell for example like so: source path/to/activate-cuda-10.0.sh Many deep learning applications that use CUDA also use CuDNN, a deep learning library provided by Nvidia for accelerating deep learning applications. The archived versions of CuDNN can be found here: https://developer.nvidia.com/rdp/cudnn-archive When downloading (you need an Nvidia developer account, but thankfully this is free), pay attention to the version being requested in the error messages generated by your application. Also take care to download the version of CUDA you're using, and match the CuDNN version appropriately. When you download, select the "cuDNN Library for Linux" option. This will give you a tarball, which contains a single directory cuda. Extract the contents of this directory over the top of your CUDA directory from following my instructions above, and it should work as intended. I used my graphical archive manager for this purpose. ## PyTorch and the GPU: A tale of graphics cards Recently, I've been learning PyTorch - which is an artificial intelligence / deep learning framework in Python. While I'm not personally a huge fan of Python, it seems to be the only library of it's kind out there at the moment (and Tensorflow.js has terrible documentation) - so it would seem that I'm stuck with it. Anyway, as I've been trying to learn it I inevitably came to the bit where I need to learn how to take advantage of a GPU to accelerate the neural network training process. I've been implementing a few test networks to see how it performs (my latest one is a simple LSTM, loosely following this tutorial). In PyTorch, this isn't actually done for you automatically. The basic building blocks of PyTorch are tensors (potentially multi-dimensional arrays that hold data). Each tensor is bound to a specific compute device - by default the CPU (in which the data is stored in regular RAM). TO do the calculations on a graphics card, you need to bind the data to the GPU in order to load the data into the GPU's own memory - so that the GPU can access it and do the calculation. The same goes for any models you create - they have to be explicitly loaded onto the GPU in order to run the calculations in the right place. Thankfully, this is fairly trivial: tensor = torch.rand(3, 4) tensor = tensor.to(COMPUTE_DEVICE) ....where COMPUTE_DEVICE is the PyTorch device object you want to load the tensor onto. I found that this works to determine the device that the data should be loaded onto quite well: COMPUTE_DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu') Unfortunately, PyTorch (and all other AI frameworks out there) only support a technology called CUDA for GPU acceleration. This is a propriety Nvidia technology - which means that you can only use Nvidia GPUs for accelerated deep learning. Since I don't actually own an Nvidia GPU (far too expensive, and in my current laptop I have an AMD Radeon R7 M445 - and I don't plan on spending large sums of money to replace a perfectly good laptop), I've been investigating hardware at my University that I can use for development purposes - since this is directly related to my PhD after all. Initially, I've found a machine with an Nvidia GeForce GTX 650 in it. If you run torch.cuda.is_available(), it will tell you if CUDA is available or not: print(torch.cuda.is_available()) # Prints True if CUDA is available .....but, as always, there's got to be a catch. Just because CUDA is available, doesn't mean to say that PyTorch can actually use it. After a bunch of testing, it transpired that PyTorch only supports CUDA devices with a capability index greater than or equal to 3.5 - and the GTX 650 has a capability index of just 3.0. You can see where this is going. I foound this webpage was helpful - it lists all of Nvidia's GPUs and their CUDA capability indices. You can also get PyTorch to tell you more about the CUDA device it has found: def display_compute_device(): """Displays information about the compute device that PyTorch is using.""" log(f"Using device: {COMPUTE_DEVICE}", newline=False) if COMPUTE_DEVICE.type == 'cuda': print(" {0} [Memory: {1}GB allocated, {2}GB cached]".format( torch.cuda.get_device_name(0), round(torch.cuda.memory_allocated(0)/1024**3, 1), round(torch.cuda.memory_cached(0)/1024**3, 1) )) print() If you execute the above method, it will tell you more about the compute device it has found. Note that you can actually make use of multiple compute devices at the same time - I just haven't done any research into that yet. Crucially, it will also generate a warning message if your CUDA device is too old. To this end, I'll be doing some more investigating as to the resources that the Department of Computer Science has available for PhD students to use.... If anyone knows of an artificial intelligence framework that can take advantage of any GPU (e.g. via OpenCL, oneAPI, or other similar technologies), do get in touch. I'm very interested to explore other options. ## Easy AI with Microsoft.Text.Recognizers I recently discovered that there's an XMPP client library (NuGet) for .NET that I overlooked a few months ago, and so I promptly investigated the building of a bot! The actual bot itself needs some polishing before I post about it here, but in writing said bot I stumbled across a perfectly brilliant library - released by Microsoft of all companies - that can be used to automatically extract common data-types from a natural-language sentence. While said library is the underpinnings of the Azure Bot Framework, it's actually free and open-source. To that end, I decided to experiment with it - and ended up writing this blog post. Data types include (but are not limited to) dates and times (and ranges thereof), numbers, percentages, monetary amounts, email addresses, phone numbers, simple binary choices, and more! While it also lands you with a terrific number of DLL dependencies in your build output folder, the result is totally worth it! How about pulling a DateTime from this: in 5 minutes or this: the first Monday of January or even this: next Monday at half past six Pretty cool, right? You can even pull multiple things out of the same sentence. For example, from the following: The host 1.2.3.4 has been down 3 times over the last month - the last of which was from 5pm and lasted 30 minutes It can extract an IP address (1.2.3.4), a number (3), and a few dates and times (last month, 5pm, 30 minutes). I've written a test program that shows it in action. Here's a demo of it working: (Can't see the asciicast above? View it on asciinema.org) The source code is, of course, available on my personal Git server: Demos/TextRecogniserDemo If you can't check out the repo, here's the basic gist. First, install the Microsoft.Recognizers.Text package(s) for the types of data that you'd like to recognise. Then, to recognise a date or time, do this: List<ModelResult> result = DateTimeRecognizer.RecognizeDateTime(nextLine, Culture.English); The awkward bit is unwinding the ModelResult to get at the actual data. The matched text is stored in the ModelResult.Resolution property, but that's a SortedDictionary<string, object>. The interesting property inside which is value, but depending on the data type you're recognising - that can be an array too! The best way I've found to decipher the data types is to print the value of ModelResult.Resolution as a string to the console: Console.WriteLine(result[0].Resolution.ToString()); The .NET runtime will helpfully convert this into something like this: System.Collections.Generic.SortedDictionary2[System.String,System.Object] Very helpful. Then we can continue to drill down: Console.WriteLine(result[0].Resolution["values"]); This produces this: System.Collections.Generic.List1[System.Collections.Generic.Dictionary`2[System.String,System.String]] Quite a mouthful, right? By cross-referencing this against the JSON (thanks, Newtonsoft.JSON!), we can figure out how to drill the rest of the way. I ended up writing myself a pair of little utility methods for dates and times: public static DateTime RecogniseDateTime(string source, out string rawString) { List<ModelResult> aiResults = DateTimeRecognizer.RecognizeDateTime(source, Culture.English); if (aiResults.Count == 0) throw new Exception("Error: Couldn't recognise any dates or times in that source string."); /* Example contents of the below dictionary: [0]: {[timex, 2018-11-11T06:15]} [1]: {[type, datetime]} [2]: {[value, 2018-11-11 06:15:00]} */ rawString = aiResults[0].Text; Dictionary<string, string> aiResult = unwindResult(aiResults[0]); string type = aiResult["type"]; if (!(new string[] { "datetime", "date", "time", "datetimerange", "daterange", "timerange" }).Contains(type)) throw new Exception($"Error: An invalid type of {type} was encountered ('datetime' expected).");

string result = Regex.IsMatch(type, @"range\$") ? aiResult["start"] : aiResult["value"];
return DateTime.Parse(result);
}

private static Dictionary<string, string> unwindResult(ModelResult modelResult)
{
return (modelResult.Resolution["values"] as List<Dictionary<string, string>>)[0];
}

Of course, it depends on your use-case as to precisely how you unwind it, but the above should be a good starting point.

Once I've polished the bot I've written a bit, I might post about it on here.

Found this interesting? Run into an issue? Got a neat use for it? Comment below!

## Semantic Nets in Prolog

Yesterday a few friends were puzzling over a few Prolog exam questions, and I thought I'd write up a post about what we learnt before I forget :-)

The first part of the question asked us to convert a paragraph of knowledge into a semantic net (isa / hasa) diagram. Here's the paragraph in question:

Charles and Wilbert are rats which are brown coloured European animals. Charles has a brown collar. Animals are defined as having DNA and being about to move. They include African animals, European animals and Australian animals. Skippy is a kangaroo; kangaroos are brown coloured Australian animals. Wallabies are dark brown Australian animals, Willy being one of them. They have a diet of eucalyptus leaves. Gnu are antelopes and come from Africa, and they have stripes, as are Nyala. Stella is a Gnu and Madge a Nyala.

This first part wasn't too tough. It doesn't quite fit in some places, but here's what I came up with:

(Generated using mermaid by Knut Sveidqvist)

The blue nodes are the isa node, while the green nodes are the hasa nodes. The next part asked us to convert the above into prolog. Again, this wasn't particularly hard - it's just a bunch of isa/2's and hasa/2's:

isa(charles, rat).
isa(wilbert, rat).
isa(rat, european_animal).
isa(european_animal, animal).
isa(african_animal, animal).
isa(australian_animal, animal).
isa(skippy, kangaroo).
isa(kangaroo, australian_animal).
isa(wallaby, australian_animal).
isa(willy, wallaby).
isa(gnu, antelope).
isa(antelope, african_animal).
isa(stella, gnu).
hasa(animal, dna).
hasa(animal, able_to_move).
hasa(rat, colour(brown)).
hasa(wallaby, colour(dark_brown)).
hasa(wallaby, diet(eucaliptus_leaves)).
hasa(gnu, stripes).
hasa(nyala, stripes).

After converting the diagram into Prolog, we were then asked to write some Prolog that interacts with the above knowledge base. Here's the first challenge:

Define a predicate called appearance which behaves as follows:

appearance(wilbert,Colour).
Colour=dark_brown
true.
appearance(skippy,Colour).
Colour=brown
true.

Upon first sight, this looks rather complicated, but it's not actually as bad as it looks. Basically, it is asking for a predicate, that, given the name of a thing, returns the colour of that thing. For example, wilbert was produce the answer brown, and wallaby would return dark_brown. The trick here is to get Prolog to recurse up the isa hasa tree if it doesn't find the answer at the current node.

When thinking about recursion, a good idea is to consider the stopping condition first. In our case, we want it to stop when it finds a thing that has a colour. Here's that in Prolog:

appearance(Name, Colour) :-
hasa(Name, colour(Colour)).

Now we've got a stopping condition in place, we can think about the recursion itself. If it doesn't find a colour at the current node, we want Prolog to follow the appropriate isa fact and travel to the next level up. We can do that like so:

appearance(Name, Colour) :-
isa(Name, Thing),
appearance(Thing, Colour).

That completes the first challenge. If you put the above together this is what you'll get:

appearance(Name, Colour) :-
hasa(Name, colour(Colour)).
appearance(Name, Colour) :-
isa(Name, Thing),
appearance(Thing, Colour).

The second challenge, however, was much more challenging:

Write a predicate that takes two argument and is true if both animals live on the same continent. Thus

?- same_continent(skippy,willy).

is true, whilst

?- same_continent(stella,skippy).

is not.

The problem with this challenge is that unlike the first challenge, there isn't any way (that I could think of anyway) to determine he continent that an animal comes from. I managed to hack around this by always going up 2 levels before comparing the things to see if they are the same:

same_continent(NameA, NameB) :-
isa(NameA, AnimalTypeA),
isa(AnimalTypeA, ContA),

isa(NameB, AnimalTypeB),
isa(AnimalTypeB, ContB),

ContA = ContB.

For example, if wilfred and charles were plugged in, both ContA and ContB would be set to european_animal, and so Prolog would return true. Prolog would tell us that skippy and wilbert are not of the same continent because ContA and ContB would be set to different values (european_animal and australian_animal).

Art by Mythdael