Algo Crack -- Analyzing how Google ranks sites - And taking advantage of it - A FoundFirst site.

September 24, 2007

How Does Copyscape find Plagiarism?

Copyscape is the leader in plagiarism detection in the Web. It keeps a Web index almost as extensive as Google's, and users can compare any text against its index for free. Plagiarism is instantly flagged.

Notice Copyscape and Google detect content duplication, not necessarily Plagiarism. A judge in an Intellectual Property Court could not necessarily agree with the algorithms used by Copyscape or Google. Justice would use more subjective and often blurry criteria.

We set out to determine how the Copyscape Algorithm works. It was necessary for our web content generation operations, that tend to be more or less automated. Unless this article, who took 7 days to research and 2 days to write, our standard web content is generated in minutes and by the kilogram.

Beating Copyscape is not exactly equal to avoiding plagiarism, but it takes care of the main enemy of Web Content mass-producers. If Copyscape does not detect your content duplication, chances are nor Google nor the original owner will.

How were the experiments

We took indexed texts from the Web and started to modify them, one at a time. We took strings of different lengths and substituted synonyms for common words, manually or with a synonymizer software.

There were two substitution modes: global, where a long text suffered replacements in 15 to 35% of its words, and short-ranged, where one every 3-8 words was replaced.

The edited texts were uploaded in a server for Copyscape testing. We overcame the free service limitation of 10 tests per month by switching domains.

Copyscape watches short text strings

We used Phrase Mixer, one of the synonymizer software features, to see if copyscape is cheated by phrase scrambling. It is not.

Synonym replacement is a good alternative because it does not destroy the meaning of the phrase. However, any word will do. Copyscape does not discriminate between a meaningful replacement and a senseless one. Anything that prevents a 4-6 word text string from being an exact duplication of another prevents infringement. You can use any word or letter, but punctuation marks will not do the trick.

It is even possible to insert a small black i over almost black background, every 3 words in a long plagiated text and Copyscape will not notice it. There could be a problem with Google, because that technique has been used in the past by SEOs for keyword-stuffing purposes, and it is now against the Google TOS. However, if the colors are not identical there should not be a problem.

Numbers can be used to mask the identity of a text, replacing the letter 'i' for the number'1', 'o' for '0', 'G' for '9', 'g' for '6' or even 's' for '5'. But keep in mind that you still need to replace one every 3-4 words.

Short vs. long text

Copyscape starts looking for plagiarism in texts longer than 14 words.

If the duplicated text is more than 70% of the total document, you will need 1 every 6 words replaced. It your duplicated (borrowed, stolen, pirated, plagiarized) text is over 70% of the total web page, you will need 1 every 4 words replaced.

More experiments to follow...
Posted 3 years, 1 month ago on September 24, 2007
The trackback url for this post is http://www.domaingrower.com/blog/bblog/trackback.php/74/

Comments have now been turned off for this post