Where in the world does spam come from?

Answer: The US, apparently. I was having a discussion with someone recently, and since I have a rather extensive log of comment failures for debugging & analysis purposes (dating back to February 2015!) they suggested that I render a map of where the spam is coming from.

It was such a good idea that I ended up doing just that - and somehow also writing this blog post :P

First, let's start off by looking at the format of said log file:

[ Sun, 22 Feb 2015 07:37:03 +0000] invalid comment | ip: a.b.c.d | name: Nancyyeq | articlepath: posts/015-Rust-First-Impressions.html | mistake: longcomment
[ Sun, 22 Feb 2015 14:55:50 +0000] invalid comment | ip: e.f.g.h | name: Simonxlsw | articlepath: posts/015-Rust-First-Impressions.html | mistake: invalidkey
[ Sun, 22 Feb 2015 14:59:59 +0000] invalid comment | ip: x.y.z.w | name: Simontuxc | articlepath: posts/015-Rust-First-Impressions.html | mistake: invalidkey

Unfortunately, I didn't think about parsing it programmatically when I designed the log file format.... Oops! It's too late to change it now, I suppose :P

Anyway, as an output, we want a list of countries in 1 column, and a count of the number of IP addresses in another. First things first - we need to extract those IP addresses. awk is ideal for this. I cooked this up just quickly:

BEGIN {
    FS="|"
}

{
    gsub(" ip: ", "", $2);
    print $2;
}

This basically tells awk to split lines on the solid bar character (|), extracts the IP address bit (ip: p.q.r.s), and then strips out the ip: bit.

With this done, we're ready to lookup all these IP addresses to find out which country they're from. Unfortunately, IP addresses can change hands semi-regularly - even across country borders, so my approach here isn't going to be entirely accurate. I don't anticipate the error generated here to be all that big though, so I think it's ok to just do a simple lookup.

If I was worried about it, I could probably investigate cross-referencing the IP addresses with a GeoIP database from the date & time I recorded them. The effort here would be quite considerable - and this is a 'just curious' sort of thing, so I'm not going to do that here. If you have done this, I'd love to hear about it though - post a comment below.

Actually doing a GeoIP lookup itself is fairly easy to do, actually. While for the odd IP address here and there I usually use ipinfo.io, when there are lots of lookups to be done (10,479 to be exact! Wow.), it's probably best to utilise a local database. A quick bit of research reveals that Ubuntu Server has a package I can install that should do the job called geoip-bin:


sudo apt install geoip-bin
(....)
geoiplookup 1.1.1.1 # CloudFlare's 1.1.1.1 DNS service
GeoIP Country Edition: AU, Australia

Excellent! We can now lookup IP addresses automagically via the command line. Let's plug that in to the little command chain we got going on here:

cat failedcomments.log | awk 'BEGIN { FS="|" } { gsub(" ip: ", "", $2); print $2 }' | xargs -n1 geoiplookup

It doesn't look like geoiplookup supports multiple IP addresses at once, which is a shame. In that case, the above will take a while to execute for 10K IP addresses.... :P

Next up, we need to remove the annoying label there. That's easy with sed:

(...) | sed -E 's/^[A-Za-z: ]+, //g'

I had some trouble here getting sed to accept a regular expression. At some point I'll have to read the manual pages more closely and write myself a quick reference guide. Come to think about it, I could use such a thing for awk too - their existing reference guide appears to have been written by a bunch of mathematicians who like using single-letter variable names everywhere.

Anyway, now that we've got our IP address list, we need to strip out any errors, and then count them all up. The first point is somewhat awkward, since geoiplookup doesn't send errors to the standard error for some reason, but we can cheese it with grep -v:

(...) | grep -iv 'resolve hostname'

The -v here tells grep to instead remove any lines that match the specified string, instead of showing us only the matching lines. This appeared to work at first glance - I simply copied a part of the error message I saw and worked with that. If I have issues later, I can always look at writing a more sophisticated regular expression with the -P option.

The counting bit can be achieved in bash with a combination of the sort and uniq commands. sort will, umm, sort the input lines, and uniq with de-duplicate multiple consecutive input lines, whilst optionaly counting them. With this in mind, I wound up with the following:

(...) | sort | uniq -c | sort -n

The first sort call sorts the input to ensure that all identical lines are next to each other, reading for uniq.

uniq -c does the de-duplication, but also inserts a count of the number of duplicates for us.

Lastly, the final sort call with the -n argument sorts the completed list via a natural sort, which means (in our case) that it handles the numbers as you'd expect it too. I'd recommend you read the Wikipedia article on the subject - it explains it quite well. This should give us an output like this:

      1 Antigua and Barbuda
      1 Bahrain
      1 Bouvet Island
      1 Egypt
      1 Europe
      1 Guatemala
      1 Ireland
      1 Macedonia
      1 Mongolia
      1 Saudi Arabia
      1 Tuvalu
      2 Bolivia
      2 Croatia
      2 Luxembourg
      2 Paraguay
      3 Kenya
      3 Macau
      4 El Salvador
      4 Hungary
      4 Lebanon
      4 Maldives
      4 Nepal
      4 Nigeria
      4 Palestinian Territory
      4 Philippines
      4 Portugal
      4 Puerto Rico
      4 Saint Martin
      4 Virgin Islands, British
      4 Zambia
      5 Dominican Republic
      5 Georgia
      5 Malaysia
      5 Switzerland
      6 Austria
      6 Belgium
      6 Peru
      6 Slovenia
      7 Australia
      7 Japan
      8 Afghanistan
      8 Argentina
      8 Chile
      9 Finland
      9 Norway
     10 Bulgaria
     11 Singapore
     11 South Africa
     12 Serbia
     13 Denmark
     13 Moldova, Republic of
     14 Ecuador
     14 Romania
     15 Cambodia
     15 Kazakhstan
     15 Lithuania
     15 Morocco
     17 Latvia
     21 Pakistan
     21 Venezuela
     23 Mexico
     23 Turkey
     24 Honduras
     24 Israel
     29 Czech Republic
     30 Korea, Republic of
     32 Colombia
     33 Hong Kong
     36 Italy
     38 Vietnam
     39 Bangladesh
     40 Belarus
     41 Estonia
     44 Thailand
     50 Iran, Islamic Republic of
     53 Spain
     54 GeoIP Country Edition: IP Address not found
     60 Poland
     88 India
    113 Netherlands
    113 Taiwan
    124 Indonesia
    147 Sweden
    157 Canada
    176 United Kingdom
    240 Germany
    297 China
    298 Brazil
    502 France
   1631 Russian Federation
   2280 Ukraine
   3224 United States

Very cool. Here's the full command for reference explainshell explanation:

cat failedcomments.log | awk 'BEGIN { FS="|" } { gsub(" ip: ", "", $2); print $2 }' | xargs -n1 geoiplookup | sed -e 's/GeoIP Country Edition: //g' | sed -E 's/^[A-Z]+, //g' | grep -iv 'resolve hostname' | sort | uniq -c | sort -n

With our list in hand, I imported it into LibreOffice Calc to parse it into a table with the fixed-width setting (Google Sheets doesn't appear to support this), and then pulled that into a Google Sheet in order to draw a heat map:

A world map showing the above output in a heat-map style. Countries with more failed comment attempt appear redder.

At first, the resulting graph showed just a few countries in red, and the rest in white. To rectify this, I pushed the counts through the natural log (log()) function, which yielded a much better map, where the countries have been spamming just a bit are still shown in a shade of red.

From this graph, we can quite easily conclude that the most 'spammiest' countries are:

The US
Russia
Ukraine (I get lots of spam emails from here too)
China (I get lots of SSH intrusion attempts from here)
Brazil (Wat?)

Personally, I was rather surprised to see the US int he top spot. I figured that with with tough laws on that sort of thing, spammers wouldn't risk attempting to buy a server and send spam from here.

On further thought though, it occurred to me that it may be because there are simply lots of infected machines in the US that are being abused (without the knowledge of their unwitting users) to send lots of spam.

At any rate, I don't appear to have a spam problem on my blog at the moment - it's just fascinating to investigate where the spam I do block comes from.

Found this interesting? Got an observation of your own? Plotted a graph from your own data? Comment below!

Type this	To get this	Notes
`bold text`	bold text	-
`_italics text_`	italics text	-
`~~deleted text~~`	~~deleted~~	-
`code text`	`code text`	Inserts some monospaced code. It is preferred that large blocks of code are linked to using a service such as Pastebin, Github Gists or Ideone.
`> Quote`	Quote	-
`[display text](//google.com)`	display text	Inserts a hyperlink. Please use responsibly. `[rel=nofollow]` is in use and spam will be deleted.
`---`		Inserts a horizontal line. The previous line must be blank.

Stardust
Blog