Block Referrer Spam on Your Website
Internet is a very beautiful place to learn new things and gain information. As it has to provide information and access to anybody from anywhere in the globe, there was no other option, than to keep it wide open to all public. Although in the beginning (during its infancy period)it was only used by good and nice fellows with good intention, gradually it has become a composite mixture of good and evil.
Even if you are not a tech junkie, am sure you already heard about a term called as SPAM. What is SPAM and what is not SPAM requires a huge debate because there are different school of thoughts and approach towards it. Almost all of us get spam e-mails everyday. In layman terms its Unwanted Email. Most of the times they are Advertising, Job related enquiries, Sales, etc. Some of them can be really dangerous because they can point you to a phishing link to steal your credentials. SPAM is not only associated with emails. Its a plague that every website is fighting with today.
The main reason am writing this post is simply to protect yourself from something called as “Referrer Spam”. Before we get into methods of blocking referrer spam, let's first understand what is it all about.
What is refferer Spam?
Its basically a technique used to get higher rankings in search engines. Everything on the internet is inter linked to each other. A person writing an article can link to any other website URL, that contains any related information, which might prove helpful to the reader for further reference. Its an awesome feature, because of which, the internet has become wonder land to be explored. You can click a link to reach another page with useful information, and from there another page and so on.
But everything changed when search engine's came into picture. Search engines has to do so many calculations and maths to find out which page to show first on the result. In the beginning it was simply a text search(all sites/URL that contained the search string were shown as result.). But then people started to fool it by stuffing their page with their target keywords to get on top. Search engines tried to fight against this by putting website links with higher number of inbound links first.
If a particular URL on a website is having too many inbound links from other sites, it clearly signals the fact that its something worth reading and linking to. Hence search engines started to show pages with higher number of inbound links first.
But even this was not going to stay too long. Because people started to get paid links, fake links, and spam links from other websites. I was amazed to know the fact that there were companies making millions of dollars only by making inbound links for somebody(termed as so called SEO companies). And some of these companies went ahead and took a completely wrong approach of spamming websites with comments and links to other websites(probably their SEO clients).
You wont believe the fact that my website used to get around 100 spam comments on daily basis..Most of them contained some business/marketing lines with a link to their website(They were even smart enough to bypass my captcha). This was really evil!!
Search engines retaliated against this link building tactics using some thing called as Page Rank. Which basically cares about the website, from which you are getting the link. Which means, if a spam website linked you, that can really put your site at last in the search results(Which really worked). This means if a website like microsoft.com, hp.com or any other credible site with huge reputation links you, then you get a higher rank in the searches. This was an awesome idea..
Referrer spam is a method used by so called SEO companies and spammers trying to build traffic with wrong means. What it does is to send referral traffic to a good website (which probably might be ranking good at search engines.). When your site is visited, it fills a log message. This message contains details about from where the user is coming (like the source URL, if the visitor is coming from another website which links you.). An example nginx log message which contains a referral link is shown below.
193.34.96.12 - - [20/Nov/2013:09:12:42 +0000] "GET /how-is-nginx-different-from-apache HTTP/1.1" 200 18612 "http://www.linux.org.ru/forum/talks/9146309" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.57 Safari/537.36"
The above log message indicates that my site is being visited by a user who through a link from linux.org.ru. Now that referral from linux.org.ru is legitimate, because somebody commented on the thread discussion with my link. Now when you click on a link that directs you to my website, the request header will contain the referral URL. HTTP requests contain too many headers like request user agent(basically browser version), accepted format types(text/html etc), and referral link(which is the link from which the visitor is reaching your website).
There are so many tools using which you can modify the referral header of your http request. One such tool is CURL. You can also use any programming language to modify headers in your request. Using curl, you can modify the referral as shown below.
ubuntu@puppetmaster:~$ curl -e "http://www.example.com" http://www.slashroot.in/linux-cpu-performance-monitoring-tutorial
For more curl related commands refer my below post.
Read: Curl Command in Linux Usage
Read: HTTP Request and Response
You might be thinking what is the use of sending such spam referral traffic to a website. The idea behind is to only target search engine algorithm. Some websites and blogs used to publish highest referrs to their website. Publishing higher referrers means, if there is a higher request of referrer spam to that website, it will also get published on their page.
Now when search engines index that page again, it will count that spam website link(which was published because of high referral traffic) as another outbound link. Imagine if a famous and reputed site/blog publishes this, then that spam website will now has a link from them. This will intern cause their ranking in search engines to go higher.
Although so many website's has stopped this practice of publishing their higher referrals', some still follow the practice. Due to which those evil guys are still doing that referral spam trick(by sending junk referral traffic)
Why was i worried about Referral SPAM & why you should as well ?
I usually get a few thousand request per day. And most of the sites who links me (who sends referral traffic to me) are related to my field/niche(Linux/system administration/ etc.). I have this habit of verifying my referrals' using Google analytics. And i saw a suspicious request from a source called SEMALT. I immediately started to Google about them, and found that so many other websites are being targeted by them, also others were facing the same problem.
I was getting around 100 hits from them (all spam) per day...Now i don't publish my top referrers, but i don't like to waste my server resources for serving a spam request. Processing request requires memory and CPU time. I don't want to waste even a bit of my resources for serving such requests.
This is much more riskier for people who do publish their top referrers, or even for them who publish their access logs or access statistics online. Coz this can damage their own search engine rankings. And in the worst case, will lose the search engine credibility they had..
Don't forget to watch the below YouTube video from Google's Matt Cutts.
The first thing i do, when i see such unknown referral traffic, is to visit that URL. When i visited that SEMALT, i found it to be completely rubbish, and appeared to be some so called SEO tool. And if you google about this SEMALT DOT COM, you will come to know that they are apparantly sending these spam requests to so many websites. So i set out to block them using my nginx webserver.
Blocking these kind of spam requests, using source IP address is not at all a good idea, because they will change the source within minutes. I do remember when i was previously working with a media advertising company called Media.net, we used to block a lot of spam requests on a daily basis, targeted to our serving platform. And at times, these block list (based on source addresses) became too large to manage. And blocking ip/subnet can also kill legitimate traffic to your website. I remember blocking a couple of subnet from a particular country, ended up blocking all traffic from that country, because those were subnets belonging to the largest ISP's in that country. Hence blocking any http request using source address is not advisable(although this can be done, if the request is not spam, but is some kind of denial of service attacks.)
A webserver can work like an application layer firewall for http traffic(well not a fully featured firewall, but atleast you can give a 403 forbidden based on different variables). This is because a webserver, gets the complete header of the request, and we can block based on different header values. Like for example, block if the request is from a user agent called curl, block if the request URL contains a particular word etc etc. Let's see how to block this referral spam on Nginx & Apache web servers.
How to Block Referral Spam in Nginx?
You can use two different methods to block such requests on nginx. Below shown is a simple method of searching the referrer URL and if it contains your mentioed string, it will block the request. For instance, if you want to block refferal traffic from www.example.com to your website, then you need to add the below code snippet in your nginx domain config.
if ($http_referer ~* "webperformancenews\.com|semalt\.com") { return 403; }
The above code snippet will throw a 403 forbidden when the request hits from the mentioned sources. Now doing this also requires a little bit of processing power on the server side, because the nginx server will scan each referral request for the strings we mentioned above. So please do not make the list too long for nginx to process. The smaller the list, the better the performance.
Read: Nginx Performance Tuning Tutorial
If You have an application layer load balencer in front of your webserver, its always better to block it there itself. Once you have blocked, you can confirm the block by sending a referral request to your website, using previously shown curl command(Take for example, you have blocked all referral request from example.com with the above method).
ubuntu@puppetmaster:~$ curl -e "http://www.example.com" http://www.slashroot.in/linux-cpu-performance-monitoring-tutorial <html> <head><title>403 Forbidden</title></head> <body bgcolor="white"> <center><h1>403 Forbidden</h1></center> <hr><center>nginx</center> </body> </html>
You can alternatively use the valid_referrers nginx module, to block referral spam request.
http://wiki.nginx.org/Referrer_Spam_Blocking
How to block referral spam in Apache web server
The first step is to enable mod_rewrite module in apache. Once you have enabled that module, then create a .htaccess file with the below content. Most of you already might have that module and .htaccess file in your doc root. The .htaccess file should contain the following content.
RewriteEngine on RewriteCond %{HTTP_REFERER} (example\.com) [NC,OR] RewriteCond %{HTTP_REFERER} (semalt\.com) [NC,OR] RewriteRule .* - [F]
Go on adding domains that are sending referral spam, in between the first and last line shown above.
As i previously told, as the list gets too large, server will spend a little more time scanning the request and comparing it with the blocked ones.
As we saw before, always double check, whether the rule is successful by sending a referral request using curl command. Hope this article was helpful in understanding what referral spam is and how to block it on your website. I request my readers to share any better method they are using to deal with referrer spam.
Comments
Is this right?
Great information here,, It's just what I've been dealing with.. I am wondering if you can confirm that I'm following your instructions correctly.. The following is the htaccess;
<IfModule mod_headers.c>
Header unset ETag
FileETag None
<FilesMatch "\.(ico|flv|jpg|jpeg|png|gif|js|css)$">
Header unset Last-Modified
Header set Expires "Fri, 21 Dec 2020 00:00:00 GMT"
Header set Cache-Control "public, no-transform"
</FilesMatch>
</IfModule>
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteBase /
RewriteCond %{HTTP_REFERER} (http: //semalt.semalt.com/crawler.php) [NC,OR]
RewriteCond %{HTTP_REFERER} (semalt\.com) [NC,OR]
RewriteCond %{HTTP_REFERER} (http: // www. reklama-i-rabota.ru) [NC,OR]
RewriteRule .* - [F]
RewriteCond %{QUERY_STRING} ^$
RewriteRule ^((.)?)$ index.php?p=home [L]
RewriteCond %{REQUEST_FILENAME} -f
RewriteRule ^(.*)$ $1 [QSA,L]
RewriteCond $1 "/home/########/######_html"
RewriteRule ^(.+)$ / [L]
RewriteCond $1 !^(\#(.)*|\?(.)*|cgi-bin\/(.)*|content\/(.)*|forum\/(.)*|robots\.txt(.)*|images\/(.)*|SAVE\/(.)*|login\.php(.)*|\.htaccess\.back(.)*|error_log(.)*|ioncube\/(.)*|\.ftpquota(.)*|checkbox\.png(.)*|admin\.php(.)*|download\.php(.)*|index\.php(.)*|\.htaccess(.)*|readme\.txt(.)*)
RewriteRule ^(.+)$ index.php?url=$1&%{QUERY_STRING} [L]
</IfModule>
<IfModule mod_deflate.c>
<FilesMatch "\.(js|css|ico|flv|jpg|jpeg|png|gif)$">
SetOutputFilter DEFLATE
</FilesMatch>
</IfModule>
<Files 403.shtml>
order allow,deny
allow from all
</Files>
Check referer blocking with curl
Add new comment