Duplicate content
Duplicate content is loosely defined as being several copies of the same content in different parts of the web. It causes a problem when search engines spider all the copies, but only want to include a single copy in the search engine index to maintain relevancy. Generally, all other copies of this ‘duplicate’ content are ignored in the search results, bunged into the supplemental index, or perhaps not even indexed at all.
Doing the right thing
Fundamentally, search engines want to do the right thing and let the original author rank in the search engines for the content they wrote. We have just said that only one copy of the content can rank in the search engines, and search engines probably want this to be the original.
When you search for something, and some Wikipedia content appears in the search results, you want to see the copy from wikipedia.org, not one of the thousands of copies of the content on various rubbish scraper sites.
I believe that search engines want to give the search result / traffic to the original author of the content.
Determining the original author of the content
Consider the following 4 copies of the same article on different domains. How is Google to know which is the original copy ?
Duplicate content example
The above example shows 4 copies of the same content. The date indicates the date the content was first indexed, and the PageRank bar indicates, um, PageRank. Let’s assume for this simplified example that PageRank is an accurate measure of link strength / domain authority / trust etc. The smaller pages pointing to each larger page represent incoming links from other websites.
* Document 1 was first indexed a couple of weeks after the other copies, so as a search engine you might decide that this is not the original because it wasn’t published first.
* Documents 2 and 3 have the same good PR, were first indexed about the same time, and have the same number of incoming links.
* Document 4 was indexed slightly after documents 2 and 3, and it also has less PR and less links, so as a search engine you might conclude this is not the best copy to list either.
As a search engine, we are stuck deciding between document 2 and document 3 as to which is the original / best copy to list. At this point, Google is likely to take it’s best guess and leave it at that, which will see the original author “penalised” on many occasions.
Enter the cheesy scraper sites
Let’s recycle that same example, but this time we are going to add an “author credit” link to the bottom of document 4. Document 4 could be considered a cheesy low PR, low value, scraper site, but one that was kind enough to provide a link back to the original document.
Duplicate content example 2
All of a sudden, there is a crystal-clear signal to the search engines that document 2 is the original.
When there are a collection of identical pages out there on the web and it’s hard to decide who is the author - it’s likely that search engines look at how those copies link between each other and use that data to determine an original.
All other things being equal, this seems like a logical assumption to make.
Duplicating your own content
So, if you know your content is being duplicated on scraper sites - I’m saying you can prevent getting penalised by making sure some of the scrapers provide a link back to your original document.
If none of the scrapers are polite enough to do this, then I’m suggesting you should create your own scraper site, scrape your own content, and provide a link back to yourself.
As RSS feeds become more popular, and content is recycled all over the web, this problem is only going to get worse.
Disclaimer: I’m yet to back this up with any real testing, so don’t blame me if you duplicate your own website and find yourself having duplicate content problems. I wouldn’t even consider this tactic unless you were having problems with high PR sites scraping your RSS feed.
Full of How to Optimization of Google ( SEO on Page , Incoming Links, Sucess Ranking in Google )
Duplicate content is loosely defined as being several copies of the same content in different parts of the web. It causes a problem when search engines spider all the copies, but only want to include a single copy in the search engine index to maintain relevancy. Generally, all other copies of this ‘duplicate’ content are ignored in the search results, bunged into the supplemental index, or perhaps not even indexed at all.
Doing the right thing
Fundamentally, search engines want to do the right thing and let the original author rank in the search engines for the content they wrote. We have just said that only one copy of the content can rank in the search engines, and search engines probably want this to be the original.
When you search for something, and some Wikipedia content appears in the search results, you want to see the copy from wikipedia.org, not one of the thousands of copies of the content on various rubbish scraper sites.
I believe that search engines want to give the search result / traffic to the original author of the content.
Determining the original author of the content
Consider the following 4 copies of the same article on different domains. How is Google to know which is the original copy ?
Duplicate content example
The above example shows 4 copies of the same content. The date indicates the date the content was first indexed, and the PageRank bar indicates, um, PageRank. Let’s assume for this simplified example that PageRank is an accurate measure of link strength / domain authority / trust etc. The smaller pages pointing to each larger page represent incoming links from other websites.
* Document 1 was first indexed a couple of weeks after the other copies, so as a search engine you might decide that this is not the original because it wasn’t published first.
* Documents 2 and 3 have the same good PR, were first indexed about the same time, and have the same number of incoming links.
* Document 4 was indexed slightly after documents 2 and 3, and it also has less PR and less links, so as a search engine you might conclude this is not the best copy to list either.
As a search engine, we are stuck deciding between document 2 and document 3 as to which is the original / best copy to list. At this point, Google is likely to take it’s best guess and leave it at that, which will see the original author “penalised” on many occasions.
Enter the cheesy scraper sites
Let’s recycle that same example, but this time we are going to add an “author credit” link to the bottom of document 4. Document 4 could be considered a cheesy low PR, low value, scraper site, but one that was kind enough to provide a link back to the original document.
Duplicate content example 2
All of a sudden, there is a crystal-clear signal to the search engines that document 2 is the original.
When there are a collection of identical pages out there on the web and it’s hard to decide who is the author - it’s likely that search engines look at how those copies link between each other and use that data to determine an original.
All other things being equal, this seems like a logical assumption to make.
Duplicating your own content
So, if you know your content is being duplicated on scraper sites - I’m saying you can prevent getting penalised by making sure some of the scrapers provide a link back to your original document.
If none of the scrapers are polite enough to do this, then I’m suggesting you should create your own scraper site, scrape your own content, and provide a link back to yourself.
As RSS feeds become more popular, and content is recycled all over the web, this problem is only going to get worse.
Disclaimer: I’m yet to back this up with any real testing, so don’t blame me if you duplicate your own website and find yourself having duplicate content problems. I wouldn’t even consider this tactic unless you were having problems with high PR sites scraping your RSS feed.
Full of How to Optimization of Google ( SEO on Page , Incoming Links, Sucess Ranking in Google )