In a nutshell, referral spam is when bogus referral traffic is sent to a website or product. There are two main categories of referral spam; web crawlers (bots) and ghost referral traffic.
Web crawlers are robots that visit your website with the sole aim of indexing the content. Your Google Analytics account will catch most of these before they enter the sessions, because being bots, they can’t assist Google by identifying themselves as such. But some web crawlers don’t identify themselves as bots and end up being admitted because they look authentic, like Semalt (best-seo-solution.com) or blog.ranksonic.com.
Most web crawlers can be caught by setting up proper filters, explained further below.
Ghost Referral Traffic is the greater evil, because it never actually does anything as upfront as visiting your website. Instead, it sneaks directly into your Google Analytics account via HTTP requests to the server – which works like this:
- A spammer changes a genuine HTTP referrer (a header passed along when a browser goes from one page to another, usually showing where the user came from) and inserts the name of a site they want to promote.
- He uses this spammy HTTP referrer header to visit loads of websites and GA accounts all over the web, leaving a trail of referrers scattered around the Internet like crumbs. He creates a program that sends out hundreds of these HTTP requests at a time, and configures it to get the requests sent straight to your Google Analytics account.
- The business owner, unaware, publishes his Google Analytics account reports and everyone can now view the back-links of all users who visited your site.
- A user or customer sees the link, is curious, and enters the URL into his own browser to check it out – and BAM! The spammer gets paid his referral money.
The worst offenders in the ghost referral traffic scam are darodar.com, social-buttons.org, buy-cheap-online.info, pornhub-forum.ga, and hulfingtonpost.com.
Spam Referral Traffic Causes Real Damage
Spammy referral traffic might simply seem annoying at first glance; but the truth is that it can really do some damage in a Google Analytics account. Why?
- They mess up Google Analytics data: Sessions that enter your Google Analytics account via bogus referral traffic can skew reporting data, inflate statistics, etc. If you rely on your GA reports to make business decisions and don’t notice the spam traffic, then you could be making the wrong decision altogether – in other words, your Google Analytics account is useless.
- Referral spam gives SEO a bad name: Think about it this way – if your business publishes its access logs or Google Analytics online, a curious customer either clicks on the link (if there is a real URL) or enters the site name directly into his browser, and this pushes the spammy site up the SEO rankings and SERPs.
- It Exposes Customers to Malware: If one of your customers clicks on a ghost referral header they see in your Google Analytics data table, and the website turns out to be riddled with viruses and malware, then think of how badly it would reflect on you. After all, they found the link through you.
- It can be used to blacken a site’s reputation: In very malicious cases (the worst case scenario, really), the spammer can mask the true origin URL of the link by simply inserting the URL of another company. It is impossible to trace and verify the identity behind the referrer header, so users might end up thinking that a perfectly innocent company is in fact a spammer.
First of all, know thy enemy! If you’re in charge of cleansing your Google Analytics account, then we suggest you get familiar with the type of vermin currently infesting your backyard. You don’t want to go charging into the yard throwing rat poison everywhere, because you might end up killing your prize flowers as well as the vermin.
One way to tell the difference between valid (good) and invalid (bad) traffic is to examine the hostname. Ghost referral traffic don’t know who they are targeting, so their hostname will always be incorrect.
In your GA account, click on the Reporting tab and select the widest Timeframe there possible. In the sidebar, click on Acquisition, expand All Traffic, and then select Channels. Click the Secondary Dimension type Hostname and select that option. You will see something similar to this table:
The genuine, valid business traffic associated with your business all have your correct hostname in place (because they know who you are!), whereas the ghost referrer spam traffic all have an incorrect hostname for you. If you’re still not sure, check the metrics of the sessions these spambots visited – you will see extraordinarily high bounce rates (100%) and extraordinarily low average sessions times (1 second or less).
Use a Bot Filter
The Google Analytics bot filter screens out most of the creepers and the crawlers out there on the web, and while it’s not perfect, it’s worth a try. Go into your Google Analytics admin view interface and select the Bot Filtering tab. This filter will remove all known bots and spiders blacklisted on the IAB at no extra cost to you.
Individual View Filters
You can create individual filters on your account views as and when you identify a web crawler, bot, or ghost referrer name. This takes a lot of work because it’s up to you to keep a watch and keep adding names to the filter shield, and these spam traffic sites seem to multiply at a rate that is hard to keep up with. For the most up to date list of referrer spam, check out piwik blacklist over on Github.
Protection on the Basis of Valid Hostnames
Instead of applying site-level and individual filters to catch crawlers and ghosts, you could make it easier on yourself by identifying all the good, valid hostnames of website domains associated with your business, and simply exclude traffic from everywhere else.
The logic behind this approach is as follows: you don’t know where the spam referrals are coming from, but you sure as hell know where they’re landing, i.e., on your page. In return, the spammers don’t know the name of the hostname of where they’re landing. The one weapon you have is that you can exclude them automatically based on their incorrect naming of your hostname. Therefore, you can design a perfect, one-time filter that only lets in the traffic with the right hostnames, and block all the ones that do not.
In order to put this solution in place, you must correctly identify all the valid hostnames that might use your website tracking ID – these will be all the sites and pages that are in some way connected to your business and website. You can build this list by accessing the multi-year report in Google Analytics showing just the hostnames, as below:
Let’s say that you run a company called Organic Food Pots, selling healthy prepackaged meal pots for kids. So, for example, the valid hostnames that would legitimately be using your website tracking ID might be:
www.organicfoodpots4u.com – your main site
help.organicfoodpots4u.com – your help site
forums.organicfoodpots4u.com – the forums on your site
blog.organicfoodpots4u.com – if you run a blog on site
translate.googleusercontent.com – if your international site visitors use Google Translate to translate your content
www.youtube.com – if you host webinars or do instructional videos (YouTube users have to use their web ID in setting up their channel access)
www.paypal.com – if you use PayPal for payments on your site
webcache.googleusercontent.com – if your users want to look up your old articles and blog posts
If you don’t have any pages on google.com, mozilla.org, firefox.com or for sites that you didn’t configure using your web tracking ID code, then traffic from any of those sites is SPAM!
Select each of the sites and/or pages that you want to allow through the filter, and leave unchecked all those sites or pages that you don’t. In the example below, Megalytic has selected all its valid hostnames from the list (the red arrows):
RegEx – Regular Expression
Now for the last step.
When you have your list of valid hostnames, input them all into them in a single line of text, separated by the “|” in front of OR characters and with a backslash in front of all the “.” – (any period (fullstop) characters). This creates a regular expression (RegEx) that will match your good hostnames and exclude all the invalid hostnames that don’t match. For Megalytic, whose example we have been following, it looks like this:
The great thing about this method is that you select once for a small number of known positive values (valid hostnames) and leave the huge number of unknown, always proliferating, negative value sites to worry about themselves!
Test and Apply Filter!
Make sure that your filter works by running your filter expression in a segment under the Advanced tab, then sub-tab Conditions. You can choose two segments – one with the filter not applied, called, let’s say, “All Hostnames” and the other called “Valid Hostnames”. Run the test and watch how the data is sorted – the “Valid Hostnames” segment will have fewer hostnames listed in it now because the ghost referral sites have been weeded out!
Finally, apply the “Valid Hostnames” segment to your Audience > Technology > Network report and select Hostname as the primary dimension. Check that only the hostnames you put on your valid hostnames list are showing up. Now add it to your new view and name it something like “Spam-Free Data”. Data showing up in this view is sure to be clean and you can use it with 100% confidence in the knowledge that business decisions based on this Google Analytics data will be accurate.