Duplicate Content: NoIndex vs robots.txt vs rel canonical vs 301 redirects
Having a clear content structure is critical for Search Engine Optimisation. One of the many barriers to achieving this nirvana is thin or duplicate content.
What is Duplicate Content?
When we talk about duplicate content (bad SEO juju), this can take a number of forms and be created either accidentally or by the Content Management System (CMS) that your site is built on. Duplicate content creates a problem for the search engine and therefore for your rankings. When your site has copies of the same content, the search engine has a hard time determining which page to rank for which search query.
Common Causes of Duplicate Content
This occurs when the majority of the content on two or more pages is the same. Two or more pages have the same header, footer, and navigation which makes up most of the page content. The only difference being a short paragraph of text or an image.
In this instance, the majority of the page is duplicated.
Categories, Tags and Date Archive Pages
Sites built with CMS platforms such as WordPress often suffer issues caused by Category or Tag archives. The same problem often occurs with author archives and with date-based archives too, usually on blog posts.
The duplicate content often occurs because, for example, a blog only has one author – the author archive has almost identical content to the homepage of the blog. It’s quite possible that a month archive page could also feature almost identical content too.
Duplicate content also occurs when both blog post tags and blog posts categories are used together. Unless you have a LOT of blog content, it’s quite common to end up with a tag archive and a category archive looking very similar. Generally, using one or the other is recommended, rather than both.
Filter and Sort Pages
Many eCommerce platforms such as Magento include features to allow the user to sort results by price, or filter for one or more brands, for example.
That makes for a great user experience but can cause duplicate content issues. eCommerce platforms often use parameters in the URL – domain.com/my-category?sort=price&brand=7.
When the page content is almost identical, just in a different order or a subset of the content of the main category page, this also creates confusion for the search engine.
WWW and Non-WWW
A surprisingly common occurrence, most website owners do not realise that technically, www.domain.com and domain.com are two completely different websites.
It’s actually possible to host completely different sets of content on each – it’s rarely necessary or practical, though.
Incorrectly configured, your CMS may be serving both versions of your site, also creating a duplicate content issue.
HTTPS and HTTP
In the same way as www and non-www, HTTPS and HTTP versions of a domain can host two different sets of content. Yes, you got it, combining the two protocols means that technically, you could have four versions of every page on your website, creating huge problems with duplicate content.
How to Fix the Duplicate Content Issue
There are a number of tools in your arsenal, depending on the situation.
When to Use noIndex
If a page is valuable to the flow through your site, let’s say a login page, but isn’t really valuable to a search engine, you could choose to place a noIndex tag on that page.
The search engine will still crawl the page and unless you specify otherwise, will follow links placed on that page. However, in most cases, that page won’t be included in the search engine’s index.
noIndex is often the simplest solution for category and tag pages which create issues described earlier in this post.
When To Use Robots.txt
The robots.txt file on your server contains instructions for robots (or crawlers, bots, spiders) about how they should and shouldn’t access your site.
Placing a disallow instruction in the robots.txt for specific pages, whole folders or an entire site can be a solution in some circumstances.
If you’re building a new website on a temporary domain, it’s often advisable to disallow all bots from crawling the site. None of the content should be in Google’s index until you’re ready to go live – but do remember to update robots.txt!
When To Use Canonical Links
When your user experience or user journey demands “duplicate content” such as sort and filter pages, rel=canonical is usually the right solution.
A canonical link essentially tells the search engine that there is a more important version of your page. The bot still crawls and follows links, but in most cases should favour the more important page you specified in your canonical link.
The sort version of a page (?sort=price) is a perfect example of this. There’s no value to a search engine in that version of the page – the one you’d prefer to rank is the category page itself. Adding the canonical link should make that clearer to the search engine and help your rankings.
Fix the Content
If you do want your page to be indexed and the content is very similar to other pages, it’s time to roll up your sleeves and create some deeper, unique content.
In the example of a business with multiple locations and a page on its site for each, this often creates duplicate content issues. The only unique content is the address!
The quickest, simplest fix for this is to include unique content on each location page;
– directions from the local train station, bus routes or nearby towns
– key members of staff, perhaps the manager
– customer testimonials specific to that branch
When to Use Redirect
When you’ve duplicate content which is not necessary to the user experience and not valuable to a search engine, use a 301 redirect.
This is usually the case for non-www and www versions of the site and HTTP or HTTPS versions.
Decide which your preference is and (have your developer) create a blanket redirect for all page requests to the correct version.
All HTTP page requests will redirect to the HTTPS version. All non-www requests will redirect to the www version – you’ll need to decide which is your preference and implement the redirect in your .htaccess file accordingly, of course.
Note: .htaccess is the method on an Apache server. If you’re running a Windows server, you’ll need a different solution – talk to your developer or web host.
A 301 redirect is also the solution for removed pages or old content which has been replaced. This guide to 301 redirects explains in much more detail.
If the redirect is temporary, use a 302 redirect until there’s a permanent home, but avoid using 302 redirects where possible – a 301 passes link equity (good for rankings) whereas a 302 does not (not good for rankings).
What About Hreflang?
Finally, but rarely needed for most websites, there’s hreflang.
Use hreflang to specify duplicate content which is necessary for different languages or countries.
This works similar to a canonical link, signposting to the search engine that this is one version of a set of pages, each in a different language.
The search engine should index all pages and serve the most relevant one to the audience you intend – the French version is served to a French language user, the UK English version to an English speaker in the UK and the US English version to an English speaker in the US.
Still feeling confused about duplicate content or how to resolve it? Still unsure which fix to implement? Feel free to ask for our advice in the comments of this post.