Starbeamrainbowlabs

Stardust
Blog


Archive

Mailing List Articles Atom Feed Comments Atom Feed Twitter

Tag Cloud

3d account algorithms announcement archives arduino artificial intelligence assembly async audio bash batch blog bookmarklet booting c sharp c++ challenge chrome os code codepen coding conundrums coding conundrums evolved command line compiling css dailyprogrammer debugging demystification distributed computing downtime embedded systems encryption es6 features event experiment external first impressions future game github github gist graphics hardware hardware meetup holiday html html5 html5 canvas interfaces internet io.js jabber javascript js bin labs learning library linux low level lua maintenance network networking node.js operating systems performance photos php pixelbot portable privacy programming problems project projects prolog protocol protocols pseudo 3d python reddit reference release releases resource review rust secrets security series list server software sorting source code control statistics svg technical terminal textures three thing game three.js tool tutorial tutorials twitter ubuntu university update updates upgrade version control virtualisation visual web website windows windows 10 xmpp xslt

4287 Reasons why your comments weren't posted

I don't get a lot of real comments on here from what I can tell, as you've probably noticed. I don't particularly mind (though it's always awesome when I do get one!) - but what I do mind about is the spam. Since February 2015, I've gotten 4287 spam comments. 4287! It's actually quite silly, when you think about it.

The other day I was fiddling with the code behind this blog (posts about that coming soon!), and I discovered that I implemented a log ages ago that records each and every spammer, and the mistake they made - and I thought I'd share some statistics here, and some tips for dealing with spam yourself (I've posted about tactics before here.

Here's an extract from my logs (full logs available on request):

[ Sun, 02 Apr 2017 23:17:25 +0100] invalid comment | ip: 94.181.153.194 | name: ghkkll | articlepath: posts/120-Cpu-Registers.html | mistake: shortcomment
[ Mon, 03 Apr 2017 02:16:58 +0100] invalid comment | ip: 191.96.242.17 | name: exercise pants | articlepath: posts/010-Gif-Renderer.html | mistake: invalidkey
[ Mon, 03 Apr 2017 02:16:58 +0100] invalid comment | ip: 191.96.242.17 | name: exercise pants | articlepath: posts/010-Gif-Renderer.html | mistake: invalidkey
[ Mon, 03 Apr 2017 02:16:59 +0100] invalid comment | ip: 191.96.242.17 | name: exercise pants | articlepath: posts/010-Gif-Renderer.html | mistake: invalidkey
[ Mon, 03 Apr 2017 02:16:59 +0100] invalid comment | ip: 191.96.242.17 | name: exercise pants | articlepath: posts/010-Gif-Renderer.html | mistake: invalidkey

Since the output format I chose is nice and regular, I could use a quick bit of bash magic to whip up some statistics:

cat failedcomments.log | sed -e 's/^.*mistake\: //' | grep -iv '\[' | sort | uniq -c | sort -nr

(explanation, courtesy of explainshell.com )

That gave me this output:

Count Reason
3922 invalidkey
148 website
67 nokey
67 noarticleid
56 noname
17 shortcomment
4 badarticlepath
3 longcomment
2 invalidemail
1 shortname

My first thoughts here were firstly along the lines of "wow, that's a lot of spam", secondly "that comment key is working really well!", and thirdly "I didn't realise how helpful that fake website field is". Still, though I had a table, I thought a visualisation might help to put things into perspective.

A pair of pie charts showing a breakdown on which mimstakes spammers made most often. Explained below.

There - much better :D As you might have suspected if you've been following my blog here for a while, having an invalid comment key is the most common mistake spammers make.

The comment key is a hidden field in the comment form that is actually a transformed timestamp of the time you loaded the page. Working it backwards, I can work out how long it took you to submit a comment from first loading the page.

Using a companion log file to the one that I generated the above pie chart from, I've calculated that 3558 potential comments were submitted within 10 seconds of loading the page! No ordinary humans are that fast.... (especially considering you probably want to read the article before commenting!) they have to be bots. Here's a graph to illustrate the dropoff (the time is in seconds):

A graph illustrating how many bots commented on my blog unnaturally fast.

Out of the other reasons that people failed, "website" was the second most common mistake, with ~3.45% of spammers getting caught out on it. This mistake refers to another of my little spam traps - defense in depth is always good! This particular one is a regular website address hidden field, which is hidden via some fancy CSS. Curious, I decided to investigate further - and what I found was fascinating.

About 497 spammers entered an invalid website address (i.e. one that doesn't start with http) into the website box - which I really can't understand, since it's got a (hidden) label and an appropriate name and type to match - 90 of which decided that "seo plugin" was a brilliant thing to fill it with! It's important to note here that spammers who got caught by the invalid comment key filter above are included in these statistics - here's the bash command I used here:

grep -i '"website"' rawsubmits.jsonlog  | sed -e 's/^.*"website": "//' -e 's/",//' -e 's/\\\///' | uniq | egrep -iv '^http' | wc -l

Other examples include "watch live sports free", "Samantha", "just click the following web site", long strings of html-encoded unicode characters (japanese I think, after decoding one), and more. Perfectly baffling, if you ask me (if you can shed some light on this one, please comment below!).

57 spambots forgot their own name. This could be because the box you put your name in below has a name of 'name', but an id of 'namebox' - which may have caused some confusion for some of the more stupid bots.

After all that, there were 3 long comments (probably a bunch of word salad), 2 invalid email addresses that weren't caught by any filters above, and 1 short name (under 3 characters).

That's about it for this impromptu analysis of my comments log! This took far longer than I thought it would to type up. Did you find it interesting? Thinking of putting some of these techniques into practice yourself? Comment below!

Art by Mythdael