Starbeamrainbowlabs

Stardust
Blog


Archive


Mailing List Articles Atom Feed Comments Atom Feed Twitter Reddit Facebook

Tag Cloud

3d 3d printing account algorithms android announcement architecture archives arduino artificial intelligence artix assembly async audio automation backups bash batch blog bookmarklet booting bug hunting c sharp c++ challenge chrome os cluster code codepen coding conundrums coding conundrums evolved command line compilers compiling compression containerisation css dailyprogrammer data analysis debugging demystification distributed computing docker documentation downtime electronics email embedded systems encryption es6 features ethics event experiment external first impressions freeside future game github github gist gitlab graphics hardware hardware meetup holiday holidays html html5 html5 canvas infrastructure interfaces internet interoperability io.js jabber jam javascript js bin labs learning library linux lora low level lua maintenance manjaro minetest network networking nibriboard node.js operating systems own your code pepperminty wiki performance phd photos php pixelbot portable privacy problem solving programming problems project projects prolog protocol protocols pseudo 3d python reddit redis reference release releases rendering resource review rust searching secrets security series list server software sorting source code control statistics storage svg talks technical terminal textures thoughts three thing game three.js tool tutorial tutorials twitter ubuntu university update updates upgrade version control virtual reality virtualisation visual web website windows windows 10 xmpp xslt

Website Update: Tools section

A while ago I noticed that the tools section of my website was horribly outdated, so recently I decided to do something about it. It was still largely displaying tools from when I still used Windows as my primary operating system, which was a long time ago now!

The new revision changes it to display icons only instead of icons and screenshots, as screenshots aren't always helpful for some of the tools I now use - and it also makes it easier to keep the section updated in the future.

A screenshot of part of the new tools section of my website.

I also switched to use a tab separated values file (TSV file) instead of a JSON file for the backend data file that the tools list is generated from, as a TSV file is much more convenient to hand edit than a JSON file (JSON is awesome for lots of things and I use it all the time - it's just not as useful here). If you're interested, you can view the source TSV file here: tools.tsv

I'm still filling out the list (each item in the list also necessitates an update to my personal logo collection), but it's already a huge improvement from the old tools list. Things like GitHub's Atom IDE, Ubuntu, Mozilla Firefox, and KeePass2 are currently my daily drivers - so it's great to have them listed there.

Check out the new tools section here: https://starbeamrainbowlabs.com/#tools

Add your blog to hullblogs.com

For many years, csblogs.com has aggregated posts from the blogs of current and past students of the University of Hull. Unfortunately, recently the site has started to generate random error messages. To fix this, Freeside (of which I'm the head, as it turns out) have decided to implement a replacement.

It looks rather cool if I do say so myself, so I thought I'd share it here. I'll talk first about how you can add your blog to it and why it's a good idea to start one yourself (and how you can do so for free!). Then, I'll talk a little bit at the end about how I built hullblogs.com and how it's put together.

@closebracket had the idea to call it hullblogs.com, and then I implemented the code for it:

A screenshot of hullblogs.com

(Above: A screenshot of hullblogs.com)

You can visit it here: hullblogs.com.

If you're a current or past student at the University of Hull, then you we want to hear from you! If you've got a blog (and even if you haven't yet!), then you can add your blog by going to hullblogs.com, and then clicking "add your blog".

If you don't yet have a blog, then we have you covered. Our guide has a number fo really easy ways to get started and host a blog for free! There are multiple ways to host a blog without paying a penny (no hidden charges after some time either).

But I don't have a blog!

If you're not convinced about setting up a blog and putting it on the Internet, there are a number of key reasons you should:

  • It's not about other people reading it. By setting up a blog, you don't need to have 100s of views. What matters is that you write for you - not for anyone else.
  • If only 1 person reads your blog, then it's worth it. You personal blog and/or website makes a great addition to your CV. It's a wonderful way of documenting the things you're doing and learning while doing your degree and beyond - and a great way for potential employers to find more detail on all this information in 1 place.
  • Write for yourself. Personally, I find my blog a great place to post about difficult problems that I've encountered and solved. Then when I encounter a similar problem again, I can re-read my own blog post about it. Because I'm the one who wrote the post, I can anticipate what I may find confusing and explain that in more detail to help out future me.
  • Improve your technical writing skills. Technical writing - the process of writing about complex technical subjects (whether this is Computer Science or beyond) is an invaluable skill. Writing documentation, communicating complex ideas, and helping others are all situations which can call for technical writing - and it's a great thing to put on your CV too. Just doing a thing isn't enough - being able to write about it and document it is really important when working for example on commercial (or even academic) work.

If you're not computer science oriented, then Wordpress, Blogger, or maybe Squarespace [unclear if squarespace is free or not] would be a a great place to start your blogging journey. You can host a blog for free too!

If you are computer science oriented (I'd guess most of the people reading this probably are given the kinds of posts I write for my blog here), then I strongly recommend Eleventy. It's a static site generator, and there are a whole bunch of ready-made templates for you to use to get started quickly.

In fact, hullblogs.com itself is built with Eleventy, using feedme to parse RSS/Atom feeds.

Every night, the site will be rebuilt. During this process, it will download the RSS/Atom feed from all the blogs registered, and order the posts chronologically. If an image is associated with a post, this will also be downloaded - the site also looks for the first image in a post's content if an image isn't explicitly specified to be attached to a blog. Once ordered, the posts are then paginated (split into multiple pages) and the rest of the site is rendered.

Being a static site, once the (re)build is complete hosting the site is simple and secure. Currently, Freeside hosts it using Nginx to statically serve it.

If your feed is malformed/contains errors or your blog is offline don't worry! hullblogs.com will transparently ignore your feed when it rebuilds. Then, once your blog is back up and running, it will add it back into the main hullblogs.com site the next night when it rebuilds again.

If you're interested in digging into the source code of the site, I've open sourced it on GitHub under the Apache 2.0 Licence: https://github.com/FreesideHull/hullblogs.com

If you've read this far, thanks so much for reading! We're after a better favicon logo for hullblogs.com. If you've got an idea, please do let us know by opening an issue.

Securing your port-forwarded reverse proxy

Recently, I answered a question on Reddit about reverse proxies, and said answer was long enough and interesting enough to be tidied up and posted here.

The question itself is concerning port forwarded reverse proxies and internal services:

Hey everyone, I've been scratching my head over this for a while.

If I have internal services which I've mapped a subdomain like dashboard.domain.com through NGINX but haven't enabled the CNAME on my DNS which would map my dashboard.domain.com to my DDNS.

To me this seems like an external person can't access my service because dashboard.domain.com wouldn't resolve to an IP address but I'm just trying to make sure that this is the case.

For my internal access I have a local DNS that maps my dashboard.domain.com to my NGINX.

Is this right?

--u/Jhonquil

So to answer this question, let's first consider an example network architecture:

So we have a router sitting between the Internet and a server running Nginx.

Let's say you've port forwarded to your Nginx instance on 80 & 443, and Nginx serves 2 domains: wiki.bobsrockets.com and dashboard.bobsrockets.com. wiki.bobsrockets.com might resolve both internally and externally for example, while dashboard.bobsrockets.com may only resolve internally.

In this scenario, you might think that dashboard.bobsrockets.com is safe from people accessing it outside, because you can't enter dashboard.bobsrockets.com into a web browser from outside to access it.

Unfortunately, that's not true. Suppose an attacker catches wind that you have an internal service called dashboard.bobsrockets.com running (e.g. through crt.sh, which makes certificate transparency logs searchable). With this information, they could for example modify the Host header of a HTTP request like this with curl:

curl --header "Host: dashboard.bobsrockets.com" http://wiki.bobsrockets.com/

....which would cause Nginx to return dashboard.bobsrockets.com to the external attacker! The same can also be done with HTTPS with a bit more work.

That's no good. To rectify this, we have 2 options. The first is to run 2 separate reverse proxies, with all the internal-only content on the first and the externally-viewable stuff on the second. Most routers that offer the ability to port forward also offer the ability to do transparent port translation too, so you could run your external reverse proxy on ports 81 and 444 for example.

This can get difficult to manage though, so I recommend the following:

  1. Force redirect to HTTPS
  2. Then, use HTTP Basic Authentication like so:
server {
    # ....
    satisfy any;
    allow   192.168.0.0/24; # Your internal network IP address block
    allow   10.31.0.0/16; # Multiple blocks are allowed
    deny    all;
    auth_basic              "Example";
    auth_basic_user_file    /etc/nginx/.passwds;

    # ....
}

This allows connections from your local network through no problem, but requires a username / password for access from outside.

For your internal services, note that you can get a TLS certificate for HTTPS for services that run inside by using Let's Encrypt's DNS-01 challenge. No outside access is required for your internal services, as the DNS challenge is completed by automatically setting (and then removing again afterwards) a DNS record, which proves that you have ownership of the domain in question.

Just because a service is running on your internal network doesn't mean to say that running HTTPS isn't a good idea - defence in depth is absolutely a good idea.

PhD Update 10: Sharing with the world

Hey there - is it that time already? Another PhD update blog post! And in double digits too! In this one, I'll be talking mainly about the social media project I've been working on, as I haven't had time to work on the temporal CNN much since last time. I've also taken some time off in August, so technically this is only ~just over 1 month's worth of progress here.

Before we begin, here's a list of posts in this series so far. They give useful context - you probably won't understand this one without reading them first.

As with the last post, none of the graphs here are finalised, and are work in progress. There ~~may~ probably are multiple nasty bugs that invalidate the results that I haven't found yet.

AAAI Doctoral Consortium 2022

The main thing I've done since last time is apply for the AAAI Doctoral Consortium 2022. As I understand it, this is a specific part of the main AAAI conference (Association for the Advancement of Artificial Intelligence) that is designed for PhD student researchers. To apply, you have to submit a cover page, your CV, and a 2 page thesis summary.

The cover page wasn't too bad, and updating my CV and getting it checked by the careers service at my university was simply a case of doing it. The thesis summary on the other hand was more of a challenge than I expected - it's quite difficult to summarise your entire 3 years of (planned) work in just 2 pages! It helped me to picture it as a high-level overview / conversation starter rather than a free-standing paper in it's own right - even if it looks like one.

Although this took longer than I thought to prepare, I did in fact get my submission together in time. I was glad that I left some extra time at the end though, as the rules were both very strict for what you can and can't do with your paper and unclear since they were written for the main AAAI conference. In addition I found at least 2 mistakes in the instructions and rules which appeared to be left over from previous years and hadn't been updated.

AI + HADR 2021

The other conference which I attempted to apply for at the suggestion of my supervisor is AI + HADR 2021. This stands for Artificial Intelligence for Humanitarian Assistance and Disaster Response, and it's virtual workshop that is happening in December 2021. The submission calls for a 6 page paper (5 for content, and 1 for references) on things related to disaster response.

My plan here was to write a paper on the social media work I've been doing and submit that, with the idea of talking about the existing sentiment analysis I've done (see update 9) and focusing on answering a research question using the sentiment analysis model I've implemented.

Unfortunately, things didn't go to plan as I ran out of time - even with my supervisor helping to write some of the parts of he paper. After finding a number of bugs in my data processing pipeline and failing to see any obvious trends in the sentiment analysis graphs I plotted, I realised that it was unlikely that I was going to be able to submit for it this time around.

The plan from here is to take some more time over it and answer my other research question instead, and tidy up and re-use our existing unfinished submission here for IJCNN 2022 (I think that's the right link) - the the submission deadline for which I think is going to be in January 2021.

Social media

While I've spent time writing submissions for various conferences and workshops, I have also been doing a bit of social media data analysis too.

To start with, I implemented a new endpoint for labelling tweets using a given saved model checkpoint. After using this to label the various datasets I've acquired (with the best transformer-based model checkpoint I have, which I think is ~78% accurate (got to double check everything to make sure I haven't mixed anything up), I then got to work plotting some graphs. To start with, I plotted a simple bar graph of the overall sentiment of the different datasets I've downloaded.

Graph showing the overall sentiment of some of the floods in my dataset.

(Above: A bar graph showing the overall sentiment of some of the floods in my dataset.)

I'm not really sure what to make of this - I suspect context-specific information is required to fully interpret this. I couldn't find any reliable context-specific information on short notice for the AI + HADR paper, so I'm going to attempt to keep looking if I can find the time to do so. Asking someone form the energy and environment institute may be a good idea.

After this, I binned the tweets over time and used this to plot a combined graph showing both the tweet frequency and sentiment over time using Gnuplot.

Graph showing the frequency (blue) and sentiment (green and red) of tweets over time

(Above: A graph showing the frequency (blue) and sentiment (green and red) of tweets over time.)

Again here, I'm not sure what to make of this. Even with cropping out the long tail of people talking about the flooding even afterwards to make it easier to see the sentiment over time as the actual event occurred doesn't seem to help uncover any clear trends. It could be said that for Hurricane Iota that the sentiment got more positive over time at the beginning, but this does not really also hold true for Storm Christoph and others - and without context-specific information it's difficult to tell if there are any meaningful conclusions that can be drawn here.

To this end, after talking with my supervisor we've got some idea of things I'm going to try - so more on this in an upcoming PhD update blog post.

Finally, I've also had a discussion with my supervisor and when I'm ready to publish something on my social media work, I'm going to make the code behind it open source (probably either GPLv3 or MPL-2.0). If you're interested, the code for downloading tweets using Twitter's Academic API is already open source: https://www.npmjs.com/package/twitter-academic-downloader.

Temporal CNN

I haven't really done anything on the Temporal CNN since last time, but I wanted to make sure it wasn't left out of this post! It's definitely still on the cards - the plan is that once I've got this data analysis done and some meaningful social media results, I'm going to return the the Temporal CNN and put the plan I described in the last post into action.

Conclusion

Since last time I've mainly been writing and analysing social media data. While I didn't manage to apply to AI + HADR 2021, I did manage to submit for AAAI Doctoral Consortium 2022 - I'll find out if I've been accepted on the 15th October 2021.

Next up, I'm going to be working on answering my other research question first - more on this in a later blog post. If I have time, I'll put some effort into the Temporal CNN - though I doubt I'll have anything to show on that front next time. Finally, I'm going to be arranging my PhD Panel 4 (where did all the time go?) - hopefully for before the end of November, availability of those involved permitting.

If you are finding this series of blog posts on my PhD interesting, please do comment below. It's great to see that the stuff I'm working on for my PhD is actually interesting to someone.

Sources and further reading

I bought a 3d printer! | Ender 3 v2 in review

Hey there! Recently, I bought an Ender 3 v2 3d printer, and now that I've used it enough I can talk about how I've found it. In this post, I'll be covering my thoughts on the 2 parts of 3d printing: the printer itself, and the slicing software that you run models through to turn them into G-code that the 3d printer understands.

A photo of my ender 3 v2 in my loft.

(Above: A photo of my ender 3 v2 in my loft.)

Let's start with the printer itself. Compared to my earlier efforts, it is immediately apparent that the design of the printer is considerably more robust. The frame has 2 pillars that are pretty much impossible to screw together incorrectly (it will be obvious at later steps if you do it incorrectly). The drive belt does not come pre-installed though, which I found to be the most frustrating part of the build.

The instruction booklet was noticeably more sparse and unclear than the Axis 3d instructions though - at some points I found myself having to look up some assembly instructions in order to understand them.

Once assembled, the printer was fairly easy to use. It has a colour display that as I understand it has a significantly higher resolution than previous models by Creality, which allows room for things like icons which greatly enhance the usability of the printer. I suspect that this upgrade may be due to the presence of a 32 bit microprocessor (likely an ARM) rather than an 8 bit one found in previous models, which is bound to come with more RAM etc as standard.

While the printer does come with a small amount of filament, I recommend buying a reel or 2 to use with it. Because the filament that comes with it is not on a reel, it easily tangles into nasty knots. I'm going to empty a reel of other filament first before winding the white filament that came with the printer back onto the empty reel.

Loading the filament takes practice. The advice in the booklet to cut a 45ยฐ angle on the end of the filament is really important, as otherwise it's impossible to load the filament past the NEMA motor filament loading mechanism. You have to get the end of the filament at just the right angle too to catch the PTFE tubing that leads to the hot end nozzle. This does get easier though with time - personally I think I need to make or purchase an extra clip-on light, as because my printer is in my loft it can be difficult to see what I'm doing when changing the filament.

Before you print, you have to manually level the bed of the printer. This is done by adjusting the wheels under the 4 corners of the build plate until a piece of plain paper on the build plate just gently scratches the tip of the nozzle. If the wheel in 1 corner doesn't appear to go far enough, try coming back to it and doing the other corners first. By adjusting the wheel in the other corners, the corner in question will be adjusted as well. It is for this reason it is also recommended to go around 2-3 times to make sure it's all level before beginning.

Printing itself is fairly simple. You insert the microSD card containing the G-code, preheat the nozzle to the right temperature, use the auto-home feature, and then select the file you want to print using the menu. I found that it's absolutely essential that you make sure that the build plate itself is as far back is possible - just touching the end-stop - as otherwise you get nasty loud belt grinding noises when it runs through the preamble at the beginning the G-code that causes the hot end to move all the way to the front of the build plate and back again.

Once a print is complete, I've found the supplied scraping tool to be sufficient to extract prints from the print bed. It's much easier to wait between 5 and 10 minutes for the heated bed to cool down before attempting to scrape it off - many prints can just be lifted off the bed with no scraping required (tested with some PLA filament).

Speaking of the build plate, it has a glass surface on top. My research suggested that this leads to a much more even surface on the bottom of prints, and I've certainly found this to be true. While you do have to be careful not to scratch it, the glass build plate the Ender 3 v2 comes with as standard is a nice addition to the printer.

To summarise, the Ender 3 v2 is a really nice solid printer. It's well built and relatively easy to setup and use, though filament organisation and anti-tangling will be the first project you work on when you start printing.

4 calibration cats! 3 in blue and 1 in white. The 3 in buile have some stringing issues.

(Above: 4 calibration cats! 3 in blue and 1 in white - the 3 in buile have some stringing issues. I'll definitely be printing more of these :D)

Ultimaker Cura

In order to print things, you need to use a slicer, which takes a 3D model (e.g. in a .obj or .stl file). My choice here is Ultimaker Cura. While it's in the default Ubuntu repositories, I found the AppImage on GitHub to be more up-to-date, so I packaged it into my apt repository.

Since Cura doesn't appear to have explicit support for the Ender 3 v2 just yet, I've been using the Ender 3 Pro profile instead, which seems to work just fine.

Cura has a large number of feature which are reasonably well organised for preparing prints. You can import the aforementioned .obj or .stl files and apply various transformations to imported models, such as translate, scale, and rotate. Cura also helpfully auto-snaps the bottom of a model to the virtual build plate for you, so you don't have to worry about getting the alignment right.

Saving and loading project files is annoying. It asks you to specify the place you want to save a project to every time you hit the save button and doesn't remember the location of the project file you saved to last (e.g. like a text editor, GIMP, LibreOffice, etc), which is really frustrating. I guess it's a better than crashing on save like the version of Cura in the default Ubuntu repositories though, so I'll count that as a win?

It would also be helpful if there was a system for remembering or bookmarking the commonly adjusted settings. I've found that I usually need to adjust the same small set of setting over and over again, and it's a pain having to find them in the "expert" settings list or using the search bar.

The preview mode is also useful, as it shows you precisely what your printer will actually end up printing. Great for checking that text is large / thick enough, that parts are large though or avoid losing detail, and for double checking that supports will print the way you expect (I recommend trying the tree mode for supports if you have to use them).

Given that I've used Blender before (exhibit a), it would be very nice to have the ability customise keyboard shortcuts or even better have a Blender keyboard shortcut scheme I could enable. Hitting R, then X, then 90 is just 1 example of a whole range of keyboard shortcuts I keep trying to use in Cura, but Cura doesn't support them.

On the whole, Cura works well as a slicer. It provides many tweakable settings for adjusting things based on the filament you're using (I need to look into making a profile for the filament I use at the moment, as I'm sure the next roll of filament will require different settings). The 3d preview window is intuitive and easy to use. While the program does have some rough edges as mentioned above, these are minor issues that could easily be corrected.

Unethically disclosed vulnerabilities in Pepperminty Wiki: My perspective

Recently, I've made a new release of my PHP-based wiki engine Pepperminty Wiki - v0.23. This would not normally be notable, but as it turns out there were a number of security issues (the severity of which varies) that needed fixing. I fixed them of course, but the manner in which they were disclosed to me was less than ethical.

In this post, I want to explain what happened from my perspective, and why I'm rather frustrated with the way the reporter handled things.

It all started with issue #222 that was opened by @hmaverickadams. In that issue, they say that they have discovered a number of vulnerabilities:

Hi,

I am a penetration tester and discovered a couple of vulnerabilities within your application. I will be applying for CVE status on the findings, but would like to work with you on the issues if possible. I could not locate an email, so please feel free to shoot me your contact info if possible.

Thank you!

So far, so good! Seems responsible, right? It did to me too. For reference, CVE there refers to the Common Vulnerabilities and Exposures, a website that tracks vulnerabilities in software from across the globe.

I would have left it at that, but I decided to check out the GitHub projects that @hmaverickadams (henceforth "the reporter") had. To my surprise, I found these public GitHub repositories:

These appeared to be created just 1 day after the issue #222 was opened against Pepperminty Wiki. I was on holiday at the time (3 weeks; and I've haven't been checking my GitHub notifications as often as I perhaps should), so it took me 22 days to get to it. Generally speaking I would consider a minimum of 90 days with no response before publishing a vulnerability publicly like that - this is the core of the matter, but more on this later. Here are links these vulnerabilites on the CVE website:

You may also ask yourself "what were the vulnerabilities in question in the first place?" - glad you asked! Let's take a look.

CVE-2021-38600

Described officially as a "a stored Cross Site Scripting (XSS) vulnerability", this essentially that you can convince Pepperminty Wiki to store some arbitrary HTML (which may contain a malicious script for example) and later serve it to some poor unsuspecting visitors.

In this particular vulnerability, the reporter found that when filling out the initial setup web form that appears the first time you load Pepperminty Wiki up with a wiki name that contains arbitrary HTML, Pepperminty Wiki will blindly serve this to users.

It sounds like a big issue, but once you realise that to fill out the first run web form you need the site secret - which is generated randomly and stored in peppermint.json, which itself has a check to ensure it can't be loaded through the web server, you realise that this isn't actually a big deal. In fact, Pepperminty Wiki has a number of settings that by design allow one to serve arbitrary HTML:

  • editing_message - a message that appears below the page editing form and before the submit button
  • admindisplaychar - inserts text (or arbitrary HTML) before the name of an administrator
  • footer_message - a message (that may contain arbitrary HTML) that is displayed at the bottom of every page

All of these can be modified either by a moderator in the site settings page, or through peppermint.json directly.

...so personally I don't class this as a vulnerability. Regardless, I've fixed this by running the wiki name through htmlentities() - but in doing so I speculate that some special characters (e.g. quotes) will no longer display properly because of how I fixed CVE-2021-38600 (see below) - I'll continue working on this.

CVE-2021-38601

This vulnerability is described as "a reflected Cross Site Scripting (XSS) vulnerability". This is similar to CVE-2021-38600, but instead of storing a value the attack makes use of various GET parameters. There are (were, since I've fixed it) examples of GET parameters that caused this issue, including action (sets the subcommand/action that should be taken - e.g. view, edit, etc) and page (the current page on the wiki you're looking at).

Unlike CVE-2021-38600, this is a more serious vulnerability. Someone could generate a malicious link to a Pepperminty Wiki instance that instead redirects you to an attacker-controller website (i.e. by using location.href = "blah").

Fixing it though required me to do a comprehensive review of every single line of Pepperminty Wiki's codebase, which took me multiple hours of intense programming and was really rather unpleasant. The description by the reporter in the repo was quite unhelpful:

A reflected Cross Site Scripting (XSS) vulnerability exists on multiple parameters in version 0.23-dev of the Pepperminty-Wiki application that allows for arbitrary execution of JavaScript commands.

Affected page: http://localhost/index.php

Sample payload: https://localhost/index.php?action=<script>alert(1)</script>

CVE: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-38601

I discovered in my comprehensive review that action and page were not the only parameters affected - I fixed every instance I could find.

Reaching out

To try and understand their side of the story and gain additional insight into the vulnerabilities they discovered, I attempted to reach out to them. First, I tried opening an issue on the above GitHub repositories they created.

Instead of replying to the issues though, the reporter instead deleted the issues I opened, and set it so that nobody could opened issues on the repositories anymore! As of the time of typing I still do not have a response from the reporter about this.

Not to be deterred, I found a pair of twitter accounts they controlled and tweeted at them:

As you can probably guess, I haven't had a response yet (though I will of course update this blog post if I do). To make absolutely sure that I got through, I also filled out the contact form on their website - also to no avail so far.

With all this in mind, I get the impression the reporter does not want to talk to me - especially since they deleted the issues I opened against their repositories instead of replying to them. This is frustrating, because I was put in a really awkward position of having to deal with a zero day vulnerability as fast as I could after they publicly disclosed the vulnerabilities (worse still, I could tell that those repositories had some significant traffic since they have been starred by 7 + 4 people as of the time of typing).

Can't find an email address?

After this (and in between comprehensively reviewing Pepperminty Wiki's codebase), I also re-read the initial issue. When re-reading it, a particular sentence also struck me as odd:

I could not locate an email, so please feel free to shoot me your contact info if possible.

This is very strange, since I have the following paragraph in Pepperminty Wiki's README:

If you've found a security issue, please don't open an issue. Instead, get in touch privately - e.g. via Keybase or by email (security [at sign] starbeamrainbowlabs [replace me with a dot] com), and I'll try to respond ASAP.

I also have my website and email address on my GitHub profile, and my website lists:

  • My email address
  • My Keybase details
  • My Twitter account
  • My Stack Exchange account
  • My reddit account
  • My GPG/PGP key id

I don't have my Discord account on there, but I can chat over that too after first using one of the above.

With this in mind, I found it to be very strange that the reporter was unable to find a means of contact to use to responsibly disclose the vulnerabilities.

CVE confusion

Now that I've fixed the vulnerabilities, I'm somewhat confused above how I update the pair of CVEs. This website gives the following instructions:

  1. Identify the CNA that published the CVE Record by searching for the CVE Record on the CVE List.
  2. Locate the responsible CNA in the โ€œAssigning CNAโ€ field of the CVE Record.
  3. Contact the CNA using their preferred contact method to request the update.

In my case, the assigning CNA is stated as "N/A" - I assume it's the unresponsive reporter above. I'm confused here then about how I'm supposed to update the the CVE, since I can't contract the original reporter.

If anyone can suggest a way in which I can update these CVEs to reflect what has happened and the fact that I've fixed them, I'd love to know - I would hate to leave those CVEs outdated as they may misinform someone. Contact details on my website homepage. You can also leave a comment on this blog post.

Conclusion

I'm upset not because the reporter found a vulnerability - it's great they even took the time to find it in the first place in my little small-time project! I'm upset because they failed to properly disclose the vulnerabilities by privately contacting me. In fact, they would have discovered that CVE-2021-38600 is not really a vulnerability at all.

I'm also upset because despite the effort I've gone to in order to reach out, instead of opening a civil and polite discussion about the whole issue I've instead been met with doors slammed in my face (i.e. issues deleted instead of being replied to).

I wanted to document my experiences here also to educate others about ethical vulnerability / security issue / disclosure. Ethics, justice, and honesty are really important to me - and I'd like to try and avoid any accidental disclosures if at all possible.

If you find a vulnerability in someone's code (be it open or closed source), I would advise you to:

  1. Go in search of a security vulnerability disclosure policy (or, for open-source projects, search the README / find the contact details for the maintainers)
  2. Contact the authors of the code (or company, if commercial) to organise responsible disclosure
  3. When a patch has been written and tested, co-ordinate the release of the patch and the disclosure of the vulnerability

Please do not release the vulnerability publicly without first contacting the author (I suggest waiting 60 days and trying multiple methods of communication). This causes maintainers of projects (who in the case of open source are mostly volunteers who pour their time into projects without asking for anything in return) a lot of stress and anxiety, as I've discovered during this incident.

Given that I haven't experienced anything like this before and that I'm only human I'm sure that my response to this incident could use some work - but the manner in which these vulnerabilities were disclosed could use a lot of work too.

Sources and further reading

stl2png Nautilus Thumbnailer

Recently I've found myself working with STL files a lot more since I bought a 3d printer (more on that in a separate post!) (.obj is technically the better format, but STL is still widely used). I'm a visual sort of person, and with this in mind I like to have previews of things in my file manager. When I found no good STL thumbnailers for Nautilus (the default file manager on Ubuntu), I set out to write my own.

In my research, I discovered that OpenSCAD can be used to generate a PNG image from an STL file if one writes a small .scad file wrapper (ref), and wanted to blog about it here.

First, a screenshot of it in action:

(Above: stl2png in action as a nautilus thumbnailer. STL credit: Entitled Goose from the Untitled Goose Game)

You can find installation instructions here:

https://github.com/sbrl/nautilus-thumbnailer-stl/#nautilus-thumbnailer-stl

The original inspiration for this twofold:

From there, wrapping it in a shell script and turning it into a nautilus thumbnailer was not too challenging. To do that, I followed this guide, ehich was very helpful (though the update-mime-database bit was wrong - the filepath there needs to have the /packages suffix removed).

I did encounter a few issues though. Firstly finding the name for a suitable fallback icon was not easy - I resorted to browsing the contents of /usr/share/icons/gnome/256x256/mimetypes/ in my file manager, as this spec was not helpful because STL model files don't fit neatly into any of the categories there.

The other major issue was that the script worked fine when I called it manually, but failed when I tried to use it via the nautilus thumbnailing engine. It turned out that OpenSCAD couldn't open an OpenGL context, so as a quick hack I wrapped the openscad call in xvfb-run (from X Virtual FrameBuffer; sudo apt install xvfb)).

With those issues sorted, it worked flawlessly. I also added optional oxipng (or optipng as an optional fallback) support for optimising the generated PNG image - this I found in casual testing saved between 70% and 90% on file sizes.

Found this interesting or helpful? Comment below! It really helps motivate me.

NAS Backups, Part 2: Btrfs send / receive

Hey there! In the first post of this series, I talked about my plan for a backup NAS to complement my main NAS. In this part, I'm going to show the pair of scripts I've developed to take care of backing up btrfs snapshots.

The first script is called snapshot-send.sh, and it:

  1. Calculates which snapshot it is that requires sending
  2. Uses SSH to remote into the backup NAS
  3. Pipes the output of btrfs send to snapshot-receive.sh on the backup NAS that is called with sudo

Note there that while sudo is used for calling snapshot-receive.sh, the account it uses to SSH into the backup NAS, it doesn't have completely unrestricted sudo access. Instead, a sudo rule is used to restrict it to allow only specific commands to be called (without a password, as this is intended to be a completely automated and unattended system).

The second script is called snapshot-receive.sh, and it receives the output of btrfs send and pipes it to btrfs receive. It also has some extra logic to delete old snapshots and stuff like that.

Both of these are designed to be command line programs in their own right with a simple CLI, and useful error / help messages to assist in understanding it when I come back to it to fix an issue or extend it after many months.

snapshot-send.sh

As described above, snapshot-send.sh sends btrfs snapshot to a remote host via SSH and the snapshot-receive.sh script.

Before we continue and look at it in detail, it is important to note that snapshot-send.sh depends on btrfs-snapshot-rotation. If you haven't already done so, you should set that up first before setting up my scripts here.

If you have btrfs-snapshot-rotation setup correctly, you should have something like this in your crontab:

# Btrfs automatic snapshots
0 * * * *       cronic /root/btrfs-snapshot-rotation/btrfs-snapshot /mnt/some_btrfs_filesystem/main /mnt/some_btrfs_filesystem/main/.snapshots hourly 8
0 2 * * *       cronic /root/btrfs-snapshot-rotation/btrfs-snapshot /mnt/some_btrfs_filesystem/main /mnt/some_btrfs_filesystem/main/.snapshots daily 4
0 2 * * 7       cronic /root/btrfs-snapshot-rotation/btrfs-snapshot /mnt/some_btrfs_filesystem/main /mnt/some_btrfs_filesystem/main/.snapshots weekly 4

I use cronic there to reduce unnecessary emails. I also have a subvolume there for the snapshots:

sudo btrfs subvolume create /mnt/some_btrfs_filesystem/main/.snapshots

Because Btrfs does not take take a snapshot of any child subvolumes when it takes a snapshot, I can use this to keep all my snapshots organised and associated with the subvolume they are snapshots of.

If done right, ls /mnt/some_btrfs_filesystem/main/.snapshots should result in something like this:

2021-07-25T02:00:01+00:00-@weekly  2021-08-17T07:00:01+00:00-@hourly
2021-08-01T02:00:01+00:00-@weekly  2021-08-17T08:00:01+00:00-@hourly
2021-08-08T02:00:01+00:00-@weekly  2021-08-17T09:00:01+00:00-@hourly
2021-08-14T02:00:01+00:00-@daily   2021-08-17T10:00:01+00:00-@hourly
2021-08-15T02:00:01+00:00-@daily   2021-08-17T11:00:01+00:00-@hourly
2021-08-15T02:00:01+00:00-@weekly  2021-08-17T12:00:01+00:00-@hourly
2021-08-16T02:00:01+00:00-@daily   2021-08-17T13:00:01+00:00-@hourly
2021-08-17T02:00:01+00:00-@daily   last_sent_@daily.txt
2021-08-17T06:00:01+00:00-@hourly

Ignore the last_sent_@daily.txt there for now - it's created by snapshot-send.sh so that it can remember the name of the snapshot it last sent. We'll talk about it later.

With that out of the way, let's start going through snapshot-send.sh! First up is the CLI and associated error handling:

#!/usr/bin/env bash
set -e;

dir_source="${1}";
tag_source="${2}";
tag_dest="${3}";
loc_ssh_key="${4}";
remote_host="${5}";

if [[ -z "${remote_host}" ]]; then
    echo "This script sends btrfs snapshots to a remote host via SSH.
The script snapshot-receive must be present on the remote host in the PATH for this to work.
It pairs well with btrfs-snapshot-rotation: https://github.com/mmehnert/btrfs-snapshot-rotation
Usage:
    snapshot-send.sh <snapshot_dir> <source_tag_name> <dest_tag_name> <ssh_key> <user@example.com>

Where:
    <snapshot_dir> is the path to the directory containing the snapshots
    <source_tag_name> is the tag name to look for (see btrfs-snapshot-rotation).
    <dest_tag_name> is the tag name to use when sending to the remote. This must be unique across all snapshot rotations sent.
    <ssh_key> is the path to the ssh private key
    <user@example.com> is the user@host to connect to via SSH" >&2;
    exit 0;
fi

# $EUID = effective uid
if [[ "${EUID}" -ne 0 ]]; then
    echo "Error: This script must be run as root (currently running as effective uid ${EUID})" >&2;
    exit 5;
fi

if [[ ! -e "${loc_ssh_key}" ]]; then
    echo "Error: When looking for the ssh key, no file was found at '${loc_ssh_key}' (have you checked the spelling and file permissions?)." >&2;
    exit 1;
fi
if [[ ! -d "${dir_source}" ]]; then
    echo "Error: No source directory located at '${dir_source}' (have you checked the spelling and permissions?)" >&2;
    exit 2;
fi

###############################################################################

Pretty simple stuff. snapshot-send.sh is called like so:

snapshot-send.sh /absolute/path/to/snapshot_dir SOURCE_TAG DEST_TAG_NAME path/to/ssh_key user@example.com

A few things to unpack here.

  • /absolute/path/to/snapshot_dir is the path to the directory (i.e. btrfs subvolume) containing the snapshots we want to read, as described above.
  • SOURCE_TAG: Given the directory (subvolume) name of a snapshot (e.g. 2021-08-17T02:00:01+00:00-@daily), then the source tag is the bit at the end after the at sign @ - e.g. daily.
  • DEST_TAG_NAME: The tag name to give the snapshot on the backup NAS. Useful, because you might have multiple subvolumes you snapshot with btrfs-snapshot-rotation and they all might have snapshots with the daily tag.
  • path/to/ssh_key: The path to the (unencrypted!) SSH key to use to SSH into the remote backup NAS.
  • user@example.com: The user and hostname of the backup NAS to SSH into.

This is a good time to sort out the remote user we're going to SSH into (we'll sort out snapshot-receive.sh and the sudo rules in the next section below).

Assuming that you already have a Btrfs filesystem setup and automounting on boot on the remote NAS, do this:


sudo useradd --system --home /absolute/path/to/btrfs-filesystem/backups backups
sudo groupadd backup-senders
sudo usermod -a -G backup-senders backups
cd /absolute/path/to/btrfs-filesystem/backups
sudo mkdir .ssh
sudo touch .ssh/authorized_keys
sudo chown -R backups:backups .ssh
sudo chmod -R u=rwX,g=rX,o-rwx .ssh

Then, on the main NAS, generate the SSH key:


mkdir -p /root/backups && cd /root/backups
ssh-keygen -t ed25519 -C backups@main-nas -f /root/backups/ssh_key_backup_nas_ed25519

Then, copy the generated SSH public key to the authorized_keys file on the backup NAS (located at /absolute/path/to/btrfs-filesystem/backups/.ssh/authorized_keys).

Now that's sorted, let's continue with snapshot-send.sh. Next up are a few miscellaneous functions:


# The filepath to the last sent text file that contains the name of the snapshot that was last sent to the remote.
# If this file doesn't exist, then we send a full snapshot to start with.
# We need to keep track of this because we need this information to know which
# snapshot we need to parent the latest snapshot from to send snapshots incrementally.
filepath_last_sent="${dir_source}/last_sent_@${tag_source}.txt";

## Logs a message to stderr.
# $*    The message to log.
log_msg() {
    echo "[ $(date +%Y-%m-%dT%H:%M:%S) ] remote/${HOSTNAME}: >>> ${*}";
}

## Lists all the currently available snapshots for the current source tag.
list_snapshots() {
    find "${dir_source}" -maxdepth 1 ! -path "${dir_source}" -name "*@${tag_source}" -type d;
}

## Returns an exit code of 0 if we've sent a snapshot, or 1 if we haven't.
have_sent() {
    if [[ ! -f "${filepath_last_sent}" ]]; then
        return 1;
    else
        return 0;
    fi
}

## Fetches the directory name of the last snapshot sent to the remote with the given tag name.
last_sent() {
    if [[ -f "${filepath_last_sent}" ]]; then
        cat "${filepath_last_sent}";
    fi
}

# Runs snapshot-receive on the remote host.
do_ssh() {
    ssh -o "ServerAliveInterval=900" -i "${loc_ssh_key}" "${remote_host}" sudo snapshot-receive "${tag_dest}";
}

Particularly of note is the filepath_last_sent variable - this is set to the path to that text file I mentioned earlier.

Other than that it's all pretty well commented, so let's continue on. Next, we need to determine the name of the latest snapshot:

latest_snapshot="$(list_snapshots | sort | tail -n1)";
latest_snapshot_dirname="$(dirname "${latest_snapshot}")";

With this information in hand we can compare it to the last snapshot name we sent. We store this in the text file mentioned above - the path to which is stored in the filepath_last_sent variable.

if [[ "$(dirname "${latest_snapshot_dirname}")" == "$(cat "${filepath_last_sent}")" ]]; then
    if [[ -z "${FORCE_SEND}" ]]; then
        echo "We've sent the latest snapshot '${latest_snapshot_dirname}' already and the FORCE_SEND environment variable is empty or not specified, skipping";
        exit 0;
    else
        echo "We've sent it already, but sending it again since the FORCE_SEND environment variable is specified";
    fi
fi

If the latest snapshot has the same name as the one we last send, we exit out - unless the FORCE_SEND environment variable is specified (to allow for an easy way to fix stuff if it goes wrong on the other end).

Now, we can actually send the snapshot to the remote:


if ! have_sent; then
    log_msg "Sending initial snapshot $(dirname "${latest_snapshot}")";
    btrfs send "${latest_snapshot}" | do_ssh;
else
    parent_snapshot="${dir_source}/$(last_sent)";
    if [[ ! -d "${parent_snapshot}" ]]; then
        echo "Error: Failed to locate parent snapshot at '${parent_snapshot}'" >&2;
        exit 3;
    fi

    log_msg "Sending incremental snapshot $(dirname "${latest_snapshot}") parent $(last_sent)";
    btrfs send -p "${parent_snapshot}" "${latest_snapshot}" | do_ssh;
fi

have_sent simply determines if we have previously sent a snapshot before. We know this by checking the filepath_last_sent text file.

If we haven't, then we send a full snapshot rather than an incremental one. If we're sending an incremental one, then we find the parent snapshot (i.e. the one we last sent). If we can't find it, we generate an error (it's because of this that you need to store at least 2 snapshots at a time with btrfs-snapshot-rotation).

After sending a snapshot, we need to update the filepath_last_sent text file:

log_msg "Updating state information";
basename "${latest_snapshot}" >"${filepath_last_sent}";
log_msg "Snapshot sent successfully";

....and that concludes snapshot-send.sh! Once you've finished reading this blog post and testing your setup, put your snapshot-send.sh calls in a script in /etc/cron.daily or something.

snapshot-receive.sh

Next up is the receiving end of the system. The CLI for this script is much simpler, on account of sudo rules only allowing exact and specific commands (no wildcards or regex of any kind). I put snapshot-receive.sh in /usr/local/sbin and called it snapshot-receive.

Let's get started:

#!/usr/bin/env bash

# This script wraps btrfs receive so that it can be called by non-root users.
# It should be saved to '/usr/local/sbin/snapshot-receive' (without quotes, of course).
# The following entry needs to be put in the sudoers file:
# 
# %backup-senders   ALL=(ALL) NOPASSWD: /usr/local/sbin/snapshot-receive TAG_NAME
# 
# ....replacing TAG_NAME with the name of tag you want to allow. You'll need 1 line in your sudoers file per tag you want to allow.
# Edit your sudoers file like this:
# sudo visudo

# The ABSOLUTE path to the target directory to receive to.
target_dir="CHANGE_ME";

# The maximum number of backups to keep.
max_backups="7";

# Allow only alphanumeric characters in the tag
tag="$(echo "${1}" | tr -cd '[:alnum:]-_')";

snapshot-receive.sh only takes a single argument, and that's the tag it should use for the snapshot being received:


sudo snapshot-receive DEST_TAG_NAME

The target directory it should save snapshots to is stored as a variable at the top of the file (the target_dir there). You should change this based on your specific setup. It goes without saying, but the target directory needs to be a directory on a btrfs filesystem (preferable raid1, though as I've said before btrfs raid1 is a misnomer). We also ensure that the tag contains only safe characters for security.

max_backups is the maximum number of snapshots to keep. Any older snapshots will be deleted.

Next, ime error handling:

###############################################################################

# $EUID = effective uid
if [[ "${EUID}" -ne 0 ]]; then
    echo "Error: This script must be run as root (currently running as effective uid ${EUID})" >&2;
    exit 5;
fi

if [[ -z "${tag}" ]]; then
    echo "Error: No tag specified. It should be specified as the 1st and only argument, and may only contain alphanumeric characters." >&2;
    echo "Example:" >&2;
    echo "    snapshot-receive TAG_NAME_HERE" >&2;
    exit 4;
fi

Nothing too exciting. Continuing on, a pair of useful helper functions:


###############################################################################

## Logs a message to stderr.
# $*    The message to log.
log_msg() {
    echo "[ $(date +%Y-%m-%dT%H:%M:%S) ] remote/${HOSTNAME}: >>> ${*}";
}

list_backups() {
    find "${target_dir}/${tag}" -maxdepth 1 ! -path "${target_dir}/${tag}" -type d;
}

list_backups lists the snapshots with the given tag, and log_msg logs messages to stdout (not stderr unless there's an error, because otherwise cronic will dutifully send you an email every time the scripts execute). Next up, more error handling:

###############################################################################

if [[ "${target_dir}" == "CHANGE_ME" ]]; then
    echo "Error: target_dir was not changed from the default value." >&2;
    exit 1;
fi

if [[ ! -d "${target_dir}" ]]; then
    echo "Error: No directory was found at '${target_dir}'." >&2;
    exit 2;
fi

if [[ ! -d "${target_dir}/${tag}" ]]; then
    log_msg "Creating new directory at ${target_dir}/${tag}";
    mkdir "${target_dir}/${tag}";
fi

We check:

  • That the target directory was changed from the default CHANGE_ME value
  • That the target directory exists

We also create a subdirectory for the given tag if it doesn't exist already.

With the preamble completed, we can actually receive the snapshot:

log_msg "Launching btrfs in chroot mode";

time nice ionice -c Idle btrfs receive --chroot "${target_dir}/${tag}";

We use nice and ionice to reduce the priority of the receive to the lowest possible level. If you're using a Raspberry Pi (I have a Raspberry Pi 4 with 4GB RAM) like I am, this is important for stability (Pis tend to fall over otherwise). Don't worry if you experience some system crashes on your Pi when transferring the first snapshot - I've found that incremental snapshots don't cause the same issue.

We also use the chroot option there for increased security.

Now that the snapshot is transferred, we can delete old snapshots if we have too many:

backups_count="$(echo -e "$(list_backups)" | wc -l)";

log_msg "Btrfs finished, we now have ${backups_count} backups:";
list_backups;

while [[ "${backups_count}" -gt "${max_backups}" ]]; do
    oldest_backup="$(list_backups | sort | head -n1)";
    log_msg "Maximum number backups is ${max_backups}, requesting removal of backup for $(dirname "${oldest_backup}")";

    btrfs subvolume delete "${oldest_backup}";

    backups_count="$(echo -e "$(list_backups)" | wc -l)";
done

log_msg "Done, any removed backups will be deleted in the background";

Sorted! The only thing left to do here is to setup those sudo rules. Let's do that now. Execute sudoedit /etc/sudoers, and enter the following:

%backup-senders ALL=(ALL) NOPASSWD: /usr/local/sbin/snapshot-receive TAG_NAME

Replace TAG_NAME with the DEST_TAG_NAME you're using. You'll need 1 entry in /etc/sudoers for each DEST_TAG_NAME you're using.

We assign the rights to the backup-senders group we created earlier, of which the user we are going to SSH in with is a member. This make the system more flexible should we want to extend it later.

Warning: A mistake in /etc/sudoers can leave you unable to use sudo! Make sure you have a root shell open in the background and that you test sudo again after making changes to ensure you haven't made a mistake.

That completes the setup of snapshot-receive.sh.

Conclusion

With snapshot-send.sh and snapshot-receive.sh, we now have a system for transferring snapshots from 1 host to another via SSH. If combined with full disk encryption (e.g. with LUKS), this provides a secure backup system with a number of desirable qualities:

  • The main NAS can't access the backups on the backup NAS (in case fo ransomware)
  • Backups are encrypted during transfer (via SSH)
  • Backups are encrypted at rest (LUKS)

To further secure the backup NAS, one could:

  • Disable SSH password login
  • Automatically start / shutdown the backup NAS (though with full disk encryption when it boots up it would require manual intervention)

At the bottom of this post I've included the full scripts for you to copy and paste.

As it turns out, there will be 1 more post in this series, which will cover generating multiple streams of backups (e.g. weekly, monthly) from a single stream of e.g. daily backups on my backup NAS.

Sources and further reading

Full scripts

snapshot-send.sh

#!/usr/bin/env bash
set -e;

dir_source="${1}";
tag_source="${2}";
tag_dest="${3}";
loc_ssh_key="${4}";
remote_host="${5}";

if [[ -z "${remote_host}" ]]; then
    echo "This script sends btrfs snapshots to a remote host via SSH.
The script snapshot-receive must be present on the remote host in the PATH for this to work.
It pairs well with btrfs-snapshot-rotation: https://github.com/mmehnert/btrfs-snapshot-rotation
Usage:
    snapshot-send.sh <snapshot_dir> <source_tag_name> <dest_tag_name> <ssh_key> <user@example.com>

Where:
    <snapshot_dir> is the path to the directory containing the snapshots
    <source_tag_name> is the tag name to look for (see btrfs-snapshot-rotation).
    <dest_tag_name> is the tag name to use when sending to the remote. This must be unique across all snapshot rotations sent.
    <ssh_key> is the path to the ssh private key
    <user@example.com> is the user@host to connect to via SSH" >&2;
    exit 0;
fi

# $EUID = effective uid
if [[ "${EUID}" -ne 0 ]]; then
    echo "Error: This script must be run as root (currently running as effective uid ${EUID})" >&2;
    exit 5;
fi

if [[ ! -e "${loc_ssh_key}" ]]; then
    echo "Error: When looking for the ssh key, no file was found at '${loc_ssh_key}' (have you checked the spelling and file permissions?)." >&2;
    exit 1;
fi
if [[ ! -d "${dir_source}" ]]; then
    echo "Error: No source directory located at '${dir_source}' (have you checked the spelling and permissions?)" >&2;
    exit 2;
fi

###############################################################################

# The filepath to the last sent text file that contains the name of the snapshot that was last sent to the remote.
# If this file doesn't exist, then we send a full snapshot to start with.
# We need to keep track of this because we need this information to know which
# snapshot we need to parent the latest snapshot from to send snapshots incrementally.
filepath_last_sent="${dir_source}/last_sent_@${tag_source}.txt";

## Logs a message to stderr.
# $*    The message to log.
log_msg() {
    echo "[ $(date +%Y-%m-%dT%H:%M:%S) ] remote/${HOSTNAME}: >>> ${*}";
}

## Lists all the currently available snapshots for the current source tag.
list_snapshots() {
    find "${dir_source}" -maxdepth 1 ! -path "${dir_source}" -name "*@${tag_source}" -type d;
}

## Returns an exit code of 0 if we've sent a snapshot, or 1 if we haven't.
have_sent() {
    if [[ ! -f "${filepath_last_sent}" ]]; then
        return 1;
    else
        return 0;
    fi
}

## Fetches the directory name of the last snapshot sent to the remote with the given tag name.
last_sent() {
    if [[ -f "${filepath_last_sent}" ]]; then
        cat "${filepath_last_sent}";
    fi
}

do_ssh() {
    ssh -o "ServerAliveInterval=900" -i "${loc_ssh_key}" "${remote_host}" sudo snapshot-receive "${tag_dest}";
}

latest_snapshot="$(list_snapshots | sort | tail -n1)";
latest_snapshot_dirname="$(dirname "${latest_snapshot}")";

if [[ "$(dirname "${latest_snapshot_dirname}")" == "$(cat "${filepath_last_sent}")" ]]; then
    if [[ -z "${FORCE_SEND}" ]]; then
        echo "We've sent the latest snapshot '${latest_snapshot_dirname}' already and the FORCE_SEND environment variable is empty or not specified, skipping";
        exit 0;
    else
        echo "We've sent it already, but sending it again since the FORCE_SEND environment variable is specified";
    fi
fi

if ! have_sent; then
    log_msg "Sending initial snapshot $(dirname "${latest_snapshot}")";
    btrfs send "${latest_snapshot}" | do_ssh;
else
    parent_snapshot="${dir_source}/$(last_sent)";
    if [[ ! -d "${parent_snapshot}" ]]; then
        echo "Error: Failed to locate parent snapshot at '${parent_snapshot}'" >&2;
        exit 3;
    fi

    log_msg "Sending incremental snapshot $(dirname "${latest_snapshot}") parent $(last_sent)";
    btrfs send -p "${parent_snapshot}" "${latest_snapshot}" | do_ssh;
fi


log_msg "Updating state information";
basename "${latest_snapshot}" >"${filepath_last_sent}";
log_msg "Snapshot sent successfully";

snapshot-receive.sh

#!/usr/bin/env bash

# This script wraps btrfs receive so that it can be called by non-root users.
# It should be saved to '/usr/local/sbin/snapshot-receive' (without quotes, of course).
# The following entry needs to be put in the sudoers file:
# 
# %backup-senders   ALL=(ALL) NOPASSWD: /usr/local/sbin/snapshot-receive TAG_NAME
# 
# ....replacing TAG_NAME with the name of tag you want to allow. You'll need 1 line in your sudoers file per tag you want to allow.
# Edit your sudoers file like this:
# sudo visudo

# The ABSOLUTE path to the target directory to receive to.
target_dir="CHANGE_ME";

# The maximum number of backups to keep.
max_backups="7";

# Allow only alphanumeric characters in the tag
tag="$(echo "${1}" | tr -cd '[:alnum:]-_')";

###############################################################################

# $EUID = effective uid
if [[ "${EUID}" -ne 0 ]]; then
    echo "Error: This script must be run as root (currently running as effective uid ${EUID})" >&2;
    exit 5;
fi

if [[ -z "${tag}" ]]; then
    echo "Error: No tag specified. It should be specified as the 1st and only argument, and may only contain alphanumeric characters." >&2;
    echo "Example:" >&2;
    echo "    snapshot-receive TAG_NAME_HERE" >&2;
    exit 4;
fi

###############################################################################

## Logs a message to stderr.
# $*    The message to log.
log_msg() {
    echo "[ $(date +%Y-%m-%dT%H:%M:%S) ] remote/${HOSTNAME}: >>> ${*}";
}

list_backups() {
    find "${target_dir}/${tag}" -maxdepth 1 ! -path "${target_dir}/${tag}" -type d;
}

###############################################################################

if [[ "${target_dir}" == "CHANGE_ME" ]]; then
    echo "Error: target_dir was not changed from the default value." >&2;
    exit 1;
fi

if [[ ! -d "${target_dir}" ]]; then
    echo "Error: No directory was found at '${target_dir}'." >&2;
    exit 2;
fi

if [[ ! -d "${target_dir}/${tag}" ]]; then
    log_msg "Creating new directory at ${target_dir}/${tag}";
    mkdir "${target_dir}/${tag}";
fi

log_msg "Launching btrfs in chroot mode";

time nice ionice -c Idle btrfs receive --chroot "${target_dir}/${tag}";

backups_count="$(echo -e "$(list_backups)" | wc -l)";

log_msg "Btrfs finished, we now have ${backups_count} backups:";
list_backups;

while [[ "${backups_count}" -gt "${max_backups}" ]]; do
    oldest_backup="$(list_backups | sort | head -n1)";
    log_msg "Maximum number backups is ${max_backups}, requesting removal of backup for $(dirname "${oldest_backup}")";

    btrfs subvolume delete "${oldest_backup}";

    backups_count="$(echo -e "$(list_backups)" | wc -l)";
done

log_msg "Done, any removed backups will be deleted in the background";

PhD, Update 9: Results?

Hey, it's time for another PhD update blog post! Since last time, I've been working on tweet classification mostly, but I have a small bit to update you on with the Temporal CNN.

Here's a list of posts in this series so far before we continue. If you haven't already, you should read them as they give some contact for what I'll be talking about in this post.

Note that the results and graphs in this blog post are not final, and may change as I further evaluate the performance of the models I've trained (for example I recently found and fixed a bug where I misclassified all tweets without emojis as positive tweets by mistake).

Tweet classification

After trying a bunch of different tweaks and variants of models, I now have an AI model that classifies tweets. It's not perfect, but I am reaching ~80% validation accuracy.

The model itself is trained to predict the sentiment of a tweet, with the label coming from a preset list of emojis I compiled manually (looking at a list of emojis in the dataset itself I wrote a quick Bash one-liner to calculate). Each tweet is classified by adding up how any emojis in each category it has in it, and the category with the most emojis is then ultimately chosen as the label to predict. Here's the list of emojis I've compiled:

Category Emojis
positive ๐Ÿ˜‚๐Ÿ‘๐Ÿคฃโค๐Ÿ‘๐Ÿ˜Š๐Ÿ˜‰๐Ÿ˜๐Ÿ˜˜๐Ÿ˜๐Ÿค—๐Ÿ’•๐Ÿ˜€๐Ÿ’™๐Ÿ’ชโœ…๐Ÿ™Œ๐Ÿ’š๐Ÿ‘Œ๐Ÿ™‚๐Ÿ˜Ž๐Ÿ˜†๐Ÿ˜…โ˜บ๐Ÿ˜ƒ๐Ÿ˜ป๐Ÿ’–๐Ÿ’‹๐Ÿ’œ๐Ÿ˜น๐Ÿ˜œโ™ฅ๐Ÿ˜„๐Ÿ’›๐Ÿ˜ฝโœ”๐Ÿ‘‹๐Ÿ’—โœจ๐ŸŒน๐ŸŽ‰๐Ÿ‘Š๐Ÿ˜‹๐Ÿ˜๐ŸŒž๐Ÿ˜‡๐ŸŽถโญ๐Ÿ’ž๐Ÿ˜บ๐Ÿ˜ธ๐Ÿ–ค๐ŸŒธ๐Ÿ’๐Ÿ€๐ŸŒผ๐ŸŒŸ๐Ÿค๐ŸŒท๐Ÿฑ๐Ÿค“๐Ÿ˜Œ๐Ÿ˜›๐Ÿ˜™
negative ๐Ÿฅถโš ๐Ÿ’”๐Ÿ˜ฐ๐Ÿ™„๐Ÿ˜ฑ๐Ÿ˜ญ๐Ÿ˜ณ๐Ÿ˜ข๐Ÿ˜ฌ๐Ÿคฆ๐Ÿ˜ก๐Ÿ˜ฉ๐Ÿ™ˆ๐Ÿ˜”โ˜น๐Ÿ˜ฎโŒ๐Ÿ˜ฃ๐Ÿ˜•๐Ÿ˜ฒ๐Ÿ˜ฅ๐Ÿ˜ž๐Ÿ˜ซ๐Ÿ˜Ÿ๐Ÿ˜ด๐Ÿ™€๐Ÿ˜’๐Ÿ™๐Ÿ˜ ๐Ÿ˜ช๐Ÿ˜ฏ๐Ÿ˜จ๐Ÿ‘Ž๐Ÿคข๐Ÿ’€๐Ÿ˜ค๐Ÿ˜๐Ÿ˜–๐Ÿ˜๐Ÿ˜ˆ๐Ÿ˜‘๐Ÿ˜“๐Ÿ˜ฟ๐Ÿ˜ต๐Ÿ˜ง๐Ÿ˜ถ๐Ÿ˜ฆ๐Ÿคฅ๐Ÿค๐Ÿคง๐ŸŒงโ˜”๐Ÿ’จ๐ŸŒŠ๐Ÿ’ฆ๐Ÿšจ๐ŸŒฌ๐ŸŒชโ›ˆ๐ŸŒ€โ„๐Ÿ’ง๐ŸŒจโšก๐Ÿฆ†๐ŸŒฉ๐ŸŒฆ๐ŸŒ‚

These categories are stored in a tab separated values file (TSV), allowing the program I've written that trains new models to be completely customisable. Given that there were a significant number of tweets with flood-related emojis (๐ŸŒงโ˜”๐Ÿ’จ๐ŸŒŠ๐Ÿ’ฆ๐Ÿšจ๐ŸŒฌ๐ŸŒชโ›ˆ๐ŸŒ€โ„๐Ÿ’ง๐ŸŒจโšก๐Ÿฆ†๐ŸŒฉ๐ŸŒฆ๐ŸŒ‚) in my dataset, I tried separating them out into their own class, but it didn't work very well - more on this later.

The dataset I've downloaded comes from a number of different queries and hashtags, and is comprised of just under 1.4 million tweets, of which ~90K tweets have at least 1 emoji in the above categories.

A pair of pie charts showing statistics about the twitter dataset.

Tweets were downloaded with a tool I implemented using a number of flood related hashtags and query strings - from generic ones like #flood to specific ones such as #StormChristoph.

Of the entire dataset, ~6.44% contain at least 1 emoji. Of those, ~49.57~ are positive, and ~50.434% are negative:

Category Number of tweets
No emoji 1306061
Negative 45367
Positive 44585
Total tweets 1396013

The models I've trained so far are either a transformer, or an 2-layer LSTM with 128 units per layer - plus a dense layer with a softmax activation function at the end for classification. If batch normalisation, every layer except the dense layer has a batch normalisation layer afterwards:

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
bidirectional (Bidirectional (None, 100, 256)          336896    
_________________________________________________________________
batch_normalization (BatchNo (None, 100, 256)          1024      
_________________________________________________________________
bidirectional_1 (Bidirection (None, 256)               394240    
_________________________________________________________________
batch_normalization_1 (Batch (None, 256)               1024      
_________________________________________________________________
dense (Dense)                (None, 2)                 514       
=================================================================
Total params: 733,698
Trainable params: 732,674
Non-trainable params: 1,024
_________________________________________________________________

I tried using more units and varying the number of layers, but it didn't help much (in hindsight this was before I found the bug where tweets without emojis were misclassified as positive).

Now for the exciting part! I split the tweets that had at least 1 emoji into 2 datasets: 80% for training, and 20% for validation. I trained a few different models tweaking various different parameters, and got the following results:

2 categories: training accuracy

2 categories: validation accuracy

Each model has a code that is comprised of multiple parts:

  • c2: 2 categories
  • bidi: LSTM, bidirectional wrapper enabled
  • nobidi: LSTM, bidirectional wrapper disabled
  • batchnorm: LSTM, batch normalisation enabled
  • nobatchnorm: LSTM, batch normalisation disabled
  • transformer: Transformer (just the encoder part)

It seems that if you're an LSTM being bidirectional does result in a boost to both stability and performance (performance here meaning how well the model works, rather than how fast it runs - that would be efficiency instead). Batch normalisation doesn't appear to help all that much - I speculate this is because the models I'm training really aren't all that deep, and batch normalisation apparently helps most with deep models (e.g. ResNet).

For completeness, here are the respective graphs for 3 categories (positive, negative, and flood):

3 Categories: Training accuracy

3 Categories: Validation accuracy

Unfortunately, 3 categories didn't work quite as well as I'd hoped. To understand why, I plotted some confusion matrices at epochs 11 (left) and 44 (right). For both categories I've picked out the charts for bidi-nobatchnorm, as it was the best performing (see above).

2 Categories: confusion matrices

3 Categories: confusion matrices

The 2 category model looks to be reasonably accurate with no enormous issues as we expected from the ~80% accuracy from above. Twice the number of negative tweets are misclassified as positive compared to the other way around of some reason - it's possible that negative tweets are more difficult to interpret. Given the noisy dataset (it's social media and natural language, after all), I'm pretty happy with this result.

The 3 category model has fared less brilliantly. Although we can clearly see it has improved over the training process (and would likely continue to do so if I gave it longer than 50 epochs to train), it's accuracy is really limited by it's tendency to overfit - preventing it from performing as well as the 2 category model (which, incidentally, also overfits).

Although it classifies flood tweets fairly accurately, it seems to struggle more with positive and negative tweets. This leads me to suspect that training a model on flood / not flood might be worth trying - though as always I'm limited by a lack of data (I have only ~16K tweets that contain a flood emoji). Perhaps searching twitter for flood emojis directly would help gather additional here.

Now that I have a model that works, I'm going to move on to answering some research questions with this model. To do so, I've upgraded my twitter-academic-downloader to download information about attached media - downloading attached media itself deserves its own blog post. After that, I'll be looking to publish my first ever scientific papery thing (don't know the difference between journal articles and other types of publication yet), so that is sure to be an adventure.

Temporal CNN(?)

In Temporal CNN land, things have been much slower. My latest attempt to fix the accuracy issues I've described in detail in previous posts, I've implemented a vanilla convolutional autoencoder by following this tutorial (the links to the full code behind the tutorial are broken, and it's really annoying). This does image to image translation, so it would be ideal for modifying it to support my use-case. After implementing it with Tensorflow.js, I hit a snag:

Autoencoder: expected vs actual

After trying all the loss functions I found think of and all sort of other variations of the model, for sanity's sake I tried implementing the model in Python (the original language using in the blog post). I originally didn't want to do this, since all the data preprocessing code I've written so far is all in Javascript / Node.js, but within half an hour with some copying and pasting, I had a very quick model implemented. After training a teeny model it for just 7 epochs on the Fashion MNIST dataset, I got this:

Autoencoder: epoch 7 in Tensorflow for Python

...so clearly there's something different in Tensorflow.js compared to Tensorflow for Python, but after comparing them dozens of times, I'm not sure I'll be able to spot the difference. It's a shame to give up on the Tensorflow.js version (it also has a very nice CLI I implemented), but given how long I've spent in the background on this problem, I'm going to move forwards with the Python version, I'm going to work on converting my existing data preprocessing code into a standalone CLI for converting the data into a format that I can consume in Python with minimal effort.

Once that's done, I'm going to expand the quick autoencoder Python script I implemented into a proper CLI program. This involves adding:

  • A CLI argument parser (argparse is my weapon of choice here for now in Python)
  • A settings system that has a default and a custom (TOML) config file
  • Saving multiple pieces of data to the output directory:
    • A summary of the model structure
    • The settings that are currently in use
    • A checkpoint for each epoch
    • A TSV-formatted metrics file

The existing script also spits out a sample image (such as the one above - just without the text) for each epoch too, which I've found to be very useful indeed. I'll definitely be looking into generating more visualisations on-the-fly as the model is training.

All of this will take a while (especially since this isn't the focus right now), so I'm a bit unsure if I'll get all of this done by the time I write the next post. If this plan works, then I'll probably have to come up with a new name for this model, since it's not really a Temporal CNN. Suggestions are welcome - my current thought is maybe "Temporal Convolutional Autoencoder".

Conclusion

Since the last post I've made on here, I've made much more progress on stuff than I was expecting to have done. I've now got a useful(?) model for classifying tweets, which I'm now going to move ahead with answering some research questions with (more on that in the next post - this one is long already haha). I've also got an Autoencoder training properly for image-to-image translation as a step towards getting 2D flood predictions working.

Looking forwards, I'm going to be answering research questions with my tweet classification model and preparing to write something to publish, and expanding on the Python autoencoder for more direct flood prediction.

Curious about the code behind any of these models and would like to check them out? Please get in touch. While I'm not sure I can share the tweet classifier itself yet (though I do plan on releasing as open-source after a discussion with my supervisor when I publish), I can probably share my various autoencoder implementations, and I can absolutely give you a bunch of links to things that I've used and found helpful.

If you've found this interesting or have some suggestions, please comment below. It's really motivating to hear that what I'm doing is actually interesting / useful to people.

Sources and further reading

Tensorflow / Tensorflow.js in Review

For my PhD, I've been using both Tensorflow.js (Tensorflow for Javascript) and more recently Tensorflow for Python (including the bundled Keras) extensively for implementing multiple different models. Given the experiences I've had so far, I thought it was high time I put my thoughts to paper so to speak and write a blog post reviewing the 2 frameworks.

Tensorflow logo

Tensorflow for Python

Let's start with Tensorflow for Python. I haven't been using it as long as Tensorflow.js, but as far as I can tell they've done a great job of ensuring it comes with batteries included. It has layers that come in an enormous number of different flavours for doing everything you can possibly imagine - including building Transformers (though I ended up implementing the time signal encoding in my own custom layer).

Building custom layers is not particularly difficult either - though you do have to hunt around a bit for the correct documentation, and I haven't yet worked out the all the bugs with loading model checkpoints that use custom layers back in again.

Handling data as a generic "tensor" that contains an n-dimension slab of data is - once you get used to it - a great way of working. It's not something I would recommend to the beginner however - rather I would recommend checking out Brain.js. It's easier to setup, and also more transparent / easier to understand what's going on.

Data preprocessing however is where things start to get complicated. Despite a good set of API reference docs to refer to, it's not clear how one is supposed to implement a performant data preprocessing pipeline. There are multiple methods for doing this (tf.data.Dataset, tf.utils.Sequence, and others), and I have as of yet been unable to find a definitive guide on the subject.

Other small inconsistencies are also present, such as both the Keras website and the Tensorflow API docs both documenting the Keras API, which in and of itself appears to be an abstraction of the Tensorflow API.... it gets confusing. Some love for the docs more generally is also needed, as I found some the wording in places ambiguous as to what it meant - so I ended up guessing and having to work it out by experimentation.

By far the biggest issue I encountered though (aside from the data preprocessing pipeline, which is really confusing and frustrating) is that a highly specific version of CUDA is required for each version of Tensorflow. Thankfully, there's a table of CUDA / CuDNN versions to help you out, but it's still pretty annoying that you have to have a specific version. Blender manages to be CUDA enabled while supporting enough different versions of CUDA that I haven't had an issue on stock Ubuntu with the propriety Nvidia drivers and the system CUDA version, so whhy can't Tensorflow do it too?

Tensorflow.js

This brings me on to Tensorflow.js, the Javascript bindings for libtensorflow (the underlying C++ library). This also has the specific version of CUDA issue, but in addition the version requirement documented in the README is often wrong, leaving you to make random wild guesses as to which version is required!

Despite this flaw, Tensorflow.js fixes a number of inconsistencies in Tensorflow for Python - I suspect that it was written after Tensorflow for Python was first implemented. The developers have effectively learnt valuable lessons from the Python version of Tensorflow, which has resulted in a coherent and cohesive API that makes a much more sense than the API in Python. A great example of this is tf.Dataset: The data preprocessing pipeline in Tensorflow.js is well designed and easy to use. The Python version could learn a lot from this.

While Tensorflow.js doesn't have quite the same set of features (a number of prebuilt layers that exist in Python don't yet existing in Tensorflow.js), it still provides a reasonable set of features that satisfy most use-cases. I have noticed a few annoying inconsistencies in the loss functions though and how they behave - for example I implemented an autoencoder in Tensorflow.js, but it only returned black and white pixels - whereas Tensorflow for Python returned greyscale as intended.

Aside from improving CUDA support and adding more prebuilt layers, the other thing that Tensorflow.js struggles with is documentation. It has comparatively few guides with respect to Tensorflow for Python, and it also has an ambiguity problem in the API docs. This could be mostly resolved by being slightly more generous with explanations as to what things are and do. Adding some examples to the docs in question would also help - as would also fixing the bug where it does not highlight the item you're currently viewing in the navigation page (DevDocs support would be awesome too).

Conclusion

Tensorflow for Python and Tensorflow.js are feature-filled and performant frameworks for machine learning and processing large datasets with GPU acceleration. Many tutorials are provided to help newcomers to the frameworks, but once you've followed a tutorial or 2 you're left very much to be on your own. A number of caveats and difficulties such as CUDA versions and confusing APIs / docs make mastering the framework difficult.

Art by Mythdael