4287 Reasons why your comments weren't posted
I don't get a lot of real comments on here from what I can tell, as you've probably noticed. I don't particularly mind (though it's always awesome when I do get one!) - but what I do mind about is the spam. Since February 2015, I've gotten 4287 spam comments. 4287! It's actually quite silly, when you think about it.
The other day I was fiddling with the code behind this blog (posts about that coming soon!), and I discovered that I implemented a log ages ago that records each and every spammer, and the mistake they made - and I thought I'd share some statistics here, and some tips for dealing with spam yourself (I've posted about tactics before here.
Here's an extract from my logs (full logs available on request):
[ Sun, 02 Apr 2017 23:17:25 +0100] invalid comment | ip: 126.96.36.199 | name: ghkkll | articlepath: posts/120-Cpu-Registers.html | mistake: shortcomment [ Mon, 03 Apr 2017 02:16:58 +0100] invalid comment | ip: 188.8.131.52 | name: exercise pants | articlepath: posts/010-Gif-Renderer.html | mistake: invalidkey [ Mon, 03 Apr 2017 02:16:58 +0100] invalid comment | ip: 184.108.40.206 | name: exercise pants | articlepath: posts/010-Gif-Renderer.html | mistake: invalidkey [ Mon, 03 Apr 2017 02:16:59 +0100] invalid comment | ip: 220.127.116.11 | name: exercise pants | articlepath: posts/010-Gif-Renderer.html | mistake: invalidkey [ Mon, 03 Apr 2017 02:16:59 +0100] invalid comment | ip: 18.104.22.168 | name: exercise pants | articlepath: posts/010-Gif-Renderer.html | mistake: invalidkey
Since the output format I chose is nice and regular, I could use a quick bit of bash magic to whip up some statistics:
cat failedcomments.log | sed -e 's/^.*mistake\: //' | grep -iv '\[' | sort | uniq -c | sort -nr
That gave me this output:
My first thoughts here were firstly along the lines of "wow, that's a lot of spam", secondly "that comment key is working really well!", and thirdly "I didn't realise how helpful that fake website field is". Still, though I had a table, I thought a visualisation might help to put things into perspective.
There - much better :D As you might have suspected if you've been following my blog here for a while, having an invalid comment key is the most common mistake spammers make.
The comment key is a hidden field in the comment form that is actually a transformed timestamp of the time you loaded the page. Working it backwards, I can work out how long it took you to submit a comment from first loading the page.
Using a companion log file to the one that I generated the above pie chart from, I've calculated that 3558 potential comments were submitted within 10 seconds of loading the page! No ordinary humans are that fast.... (especially considering you probably want to read the article before commenting!) they have to be bots. Here's a graph to illustrate the dropoff (the time is in seconds):
Out of the other reasons that people failed, "website" was the second most common mistake, with ~3.45% of spammers getting caught out on it. This mistake refers to another of my little spam traps - defense in depth is always good! This particular one is a regular website address hidden field, which is hidden via some fancy CSS. Curious, I decided to investigate further - and what I found was fascinating.
About 497 spammers entered an invalid website address (i.e. one that doesn't start with
http) into the website box - which I really can't understand, since it's got a (hidden) label and an appropriate name and type to match - 90 of which decided that "seo plugin" was a brilliant thing to fill it with! It's important to note here that spammers who got caught by the invalid comment key filter above are included in these statistics - here's the bash command I used here:
grep -i '"website"' rawsubmits.jsonlog | sed -e 's/^.*"website": "//' -e 's/",//' -e 's/\\\///' | uniq | egrep -iv '^http' | wc -l
Other examples include "watch live sports free", "Samantha", "just click the following web site", long strings of html-encoded unicode characters (japanese I think, after decoding one), and more. Perfectly baffling, if you ask me (if you can shed some light on this one, please comment below!).
57 spambots forgot their own name. This could be because the box you put your name in below has a name of 'name', but an id of 'namebox' - which may have caused some confusion for some of the more stupid bots.
After all that, there were 3 long comments (probably a bunch of word salad), 2 invalid email addresses that weren't caught by any filters above, and 1 short name (under 3 characters).
That's about it for this impromptu analysis of my comments log! This took far longer than I thought it would to type up. Did you find it interesting? Thinking of putting some of these techniques into practice yourself? Comment below!