September 24, 2007
How Does Copyscape find Plagiarism?
Copyscape is the leader in plagiarism detection in the Web. It keeps a Web index almost as extensive as Google's, and users can compare any text against its index for free. Plagiarism is instantly flagged.Notice Copyscape and Google detect content duplication, not necessarily Plagiarism. A judge in an Intellectual Property Court could not necessarily agree with the algorithms used by Copyscape or Google. Justice would use more subjective and often blurry criteria.
We set out to determine how the Copyscape Algorithm works. It was necessary for our web content generation operations, that tend to be more or less automated. Unless this article, who took 7 days to research and 2 days to write, our standard web content is generated in minutes and by the kilogram.
Beating Copyscape is not exactly equal to avoiding plagiarism, but it takes care of the main enemy of Web Content mass-producers. If Copyscape does not detect your content duplication, chances are nor Google nor the original owner will.
How were the experiments
We took indexed texts from the Web and started to modify them, one at a time. We took strings of different lengths and substituted synonyms for common words, manually or with a synonymizer software.
There were two substitution modes: global, where a long text suffered replacements in 15 to 35% of its words, and short-ranged, where one every 3-8 words was replaced.
The edited texts were uploaded in a server for Copyscape testing. We overcame the free service limitation of 10 tests per month by switching domains.
Copyscape watches short text strings
We used Phrase Mixer, one of the synonymizer software features, to see if copyscape is cheated by phrase scrambling. It is not.
Synonym replacement is a good alternative because it does not destroy the meaning of the phrase. However, any word will do. Copyscape does not discriminate between a meaningful replacement and a senseless one. Anything that prevents a 4-6 word text string from being an exact duplication of another prevents infringement. You can use any word or letter, but punctuation marks will not do the trick.
It is even possible to insert a small black i over almost black background, every 3 words in a long plagiated text and Copyscape will not notice it. There could be a problem with Google, because that technique has been used in the past by SEOs for keyword-stuffing purposes, and it is now against the Google TOS. However, if the colors are not identical there should not be a problem.
Numbers can be used to mask the identity of a text, replacing the letter 'i' for the number'1', 'o' for '0', 'G' for '9', 'g' for '6' or even 's' for '5'. But keep in mind that you still need to replace one every 3-4 words.
Short vs. long text
Copyscape starts looking for plagiarism in texts longer than 14 words.
If the duplicated text is more than 70% of the total document, you will need 1 every 6 words replaced. It your duplicated (borrowed, stolen, pirated, plagiarized) text is over 70% of the total web page, you will need 1 every 4 words replaced.
More experiments to follow...
September 9, 2007
Web Promotion Experiments
Being in the Web Promotion field since 2000 or before, I receive lots of promotions for new products. Every one of them promises "Instant google ranking", "Fast money making", "Be First or your money back", "Overnight Targeted Traffic" and the like. However, most of them fail to comply with the big expectations.Many times I feel tempted to try them, and I do some exploring in the SEO and SE Marketing forums. Most experts vote against the magic products, but few of them have actually tried.
I did not find any dedicated "Promotion Tool Testing Lab" that will buy the product, try it, and inform the subscribed audience about the results.
The main impartiality condition should be that the Lab would not be affiliated with the product. And raw results be made public.
I proposed an Experimenter Meeting Point in an earlier posting here, and I am running several SEO experiments with our dedicated Algo Cracker tool. We have some interesting raw data and preliminary findings, partly described on Algo Cracker.
I have now a concrete experiment to perform. I intend to buy a product named Atomic Blogging,
http://www.atomicblogging.com/. This product, with excellent graphic design and strong marketing talk, offers 'a solution to post articles on any blogs and instantly update them with all major Web 2.0 sites' (exclamation signs removed). Apparently it is a complete compilation of all the sites where you can ping your news and publish a short notice about your blog or latest blog posting.
It costs about 50 dollars, which is not such a large amount. However, I would say 95% of these products are scams, and chances are you will waste your money and 5 hours of your time. For that reason, I want to do the experiment with your help, and report the result to those who helped me cover the costs.
I will make a full description of how this product works, show my own promotion experiments and draw the conclusions.
This report will be U$D 5 before buying the product, with a 20 day delay until the results are obtained. Or ten times that value afterwards.
Check Webprom Testing Labs for the prices, description and status.
Write us at Domain Grower contact if you are interested.
September 4, 2007
A statistical approach to SEO
When trying to solve the site ranking logic used by search engines, we tried several tools. For keyword density, market leader IBP analyzes the top ten ranked sites under a certain keywords. By comparing the top ranked sites with the site to be ranked, the webmaster obtains useful data.IBP analyzes keyword density in all parts of a web page: metatags, alt text, body text, header text, from H1 to H6, anchor text and other parts.
However, we felt that this was cumbersome, restricting and not very effective. While some factors pointed out by IBP are true, there are several that are not. Or maybe they were true at some time, but Google and Yahoo changed their algorithms to avoid copycat sites in their top results (SERP).
Our approach was to create a similar product that would consider not only the top ten ranked sites, but all of them. This is 1000 sites for Google, because beyond that is tricky (but not impossible) to get SERPs.
The huge amount of data obtained in that way needed a statistical tool in order to produce useful information. Thus, we prepared our tool to take page ranges and calculate averages. Pages ranked in SERPs 1-10 are compared to 20-30, 40-50 and so on. In that way, we can find those keyword densities that correlate with rank.
Our tool answers questions such as:
- are keywords in the URL important for ranking?
- in the domain?
- in the subdomain?
- in the page name?
- in the different metatags?
Other interesting questions that can be answered with Algo Cracker:
- Which are the optimal keyword densities for the different parts of the web page?
- Does Google penalize/favor sites that use non-standard TLDs?
- Does Google penalize/favor sites that any string of text or code, like .php, .css, Javascript?
The results are beginning to flow. We have now hard data to prove or discard SEO myths or facts, and we are ready to discuss them with other SEO professionals.
We can offer custom raw data to those who have the math ability to extract useful conclusions from them and share them with us.
We also intend to publish some Excel data in this blog, for those who see money at the other end of this thread...
Check Algo Cracker.
August 24, 2007
Our Algo Cracker already brings useful data
Our Beta version of Algo Cracker is already spidering the Web and bringing us useful data.We are able to select differently ranked sets of pages from Google, analyze them, average keyword density values, and show them in Excel format.
After checking keyword densities in different parts of the pages (domain, subdomain, directory, file name, metatags, page body) we know the optimal values associated with top rankings, medium or lousy ones.
We also drew some conclusions about where keywords are important and where not.
Stay tuned.
August 17, 2007
Experimenter Meeting Point
Experiments are usally hard to do without a significant structure. Even while simple experiments can be performed by an individual, significant results are only obtained by qualified individuals or teams, with good planning, coordination and result analysis.A succesful experiment needs a good team, and assembling one is complicated. Experts are not always available for hiring, and a new approach, as it is often needed, might require a whole new set of specialized technicians or workers.
Measuring the quality and appeal of the main idea is paramount for success. If the idea does not appeal to enough powerful people, the experiment dies before maturity.
I propose a Web platform for Experiments, where the promoter will define a project, the requirements, the cost and the expected results. If enough people sign up for participation, the experiment is born, and after the predefined time, the results are published.
The sponsors can provide funding, expertise, or simply approval votes. Votes help the experiment climb to the top of the list. Succesful experiments provide good rankings to those who voted for the experiment. They also provide their full results to those who actually participated in the experiment, and a share of the profits, if that was part of the agreement.
As some people play in virtual Wall Street sites, buying and selling virtual shares, some other will vote for their favorite experiments and help them be born.
Experiments can be of all sorts: technical, commercial, scientific, Web, medical, financial.
This data will be included for every new experiment:
Name - Category - Description - Requirements - Timeframe - Short Result (public) - Long result (private).
As the e-Xperiment is born, new fields are added:
Voters - Contributors - Contributions - Feedback
When the experiment is finished, the Short Result is published in the site, and those Voters who guessed the outcome are mentioned and see its ranking improve. At the same time, the contributors receive the Full Report, and maybe a partnership is born in order to further develop the e-Xperiment.
I recently read an article about the revolutionary news site, digg.com, being developed for only $200.
The person who conceived the idea found a developer in a programming marketplace. This is an example of a succesful experiment, like many others in the web. However, the list of failed Web experiments is so long that I can myself fill a book. (Actually, I already did. See my Cibernegocios (CyberBusiness) book, at http://cibernegocios.netocios.com).
I always wanted to set up a virtual laboratory to test new website promotion tools. Those tools come up very often, and they promise huge results with little cost and effort. A few of them are probably useful and worth its cost. But someone should test them first.
At this point I am not sure on how expensive can this e-Xperiment.com website be. Probably under $5000, and it will be ready in 45 days. Those how participate in this First Experiment of the e-Xperimenter.com site, namely the creation of the site, will own 50% of it. Write me if you want to join in.
------
The closest approach to this idea is a News Aggregator, like Digg.com or Meneame.net, only for business ideas. See our own implementation already alive: . Business-ideas.com.ar
Many ideas there are the ones originally published here.
May 31, 2007
New Study on Google Ranking Factors
I read this nice article posted by L.Odden on May 22 in:http://www.toprankblog.com/2007/05/new-study-on-google-ranking-factors/
Which page elements offered the most influence on rankings
- Keywords in the title tag
- Targeted keywords in the body tag
- Keywords in H2-H6 headline tags seem to have an influence on the rankings while keywords in H1 headline tags don't seem to have an effect.
- Using keywords in bold or strong tags - slight effect
- Keywords in image file names
- Keywords in image alt attributes
- Keyword in the domain name - although, using domain names as link text may explain this
- Web pages that use very few parameters in the URL (?id=123, etc.)
- PageRank
- Inbound links - The top result on Google has usually about four times as many links as result number 11.
Additional notes:
Keywords in the file name don't seem to have a positive effect.
The file size doesn't seem to influence the ranking of a web page on Google although smaller sites tend to have slightly higher rankings.
We will keep this in mind. It mostly agrees with our previous insight and experiments.
May 23, 2007
Should I have won the 21st. Century Journalism Contest?: Validated News Language
Simple News Content DescriptionValidated News Language
Simple News Content Description
News in the near future will be obtained from a Web MultiDimensional Map where news can be located. For that, we need a Standard Hyper News Description Language, in XML format.
The MultiDimensional aspect of the news map can currently be expressed with dynamic symbols.
The proposed language is referred to the contents of the News, not to its publishing format. A News Publishing Format Standard already exists, and covers a few aspects of news publication.
You, as a Human News Reader, will open the browser and you will see your Personal News Map, with color symbols representing News Variables.
You will navigate to the News you want applying a Filter Set to the News Variables. The Filter Set will be easily saved in the News Browser, and can shared with other users.
Validated News Descriptors
Some Variables need Validation. For instance, when Source is proclaimed, an ID code can be necessary. When Support Material is available, a web address should keep the materials, or specify how to obtain them.
If a news claims Public Domain Status, someone will need to validate that Status, at least doing searches in the proper databases.
Each Validation-required Variable will need a special method for proper Online Validation. Agencies, Journalists or other News Producers will need access codes to the Validation mechanisms.
These are some proposed variables or fields:
Category |
Description |
Example |
Filter example |
Validation |
|
|
|
|
|
Source |
Who is the author of the News unit. |
Newspaper, news agency, company, individual, website |
Check boxes for every accepted source |
The source needs to be validated by a company official, website owner or trade association. Phone, email or address are required for some validations of a news source. |
Geographical location |
Use the mouse to assign levels of interest to several cities, countries or regions. |
street address, city, country or area |
Only news from a certain area |
Author |
Time of occurrence |
Timeframe for the fact : start, climax, end. Can be pinpointed or diffuse. Day, week, month, year. |
events occurring on a certain date or period, past, present or future |
look for weekend events, historical facts or next-year projections. |
Author |
Timeframe |
Refers to the period where news will be current . Short-lived news are event announcements, weather reports, sports forecasts. Long-lived news are deep reports, opinion, editorials |
A meeting call will be current until its planned occurrence time. A forecast will be current until the fact actually occurs. |
Urgent or Last Moment News are current for short spans. Analysis or deep reports are alive for a longer period. |
Author |
Credibility |
Depends on the source. The credibility is established by history and voting from qualified referents |
The Wall Street Journal has 10 points, while False-News.com has 0 points. |
Only news with high credibility. |
Credibility needs to be certified by independent, registered entities. |
Support material available |
For news that have associated hard data: more photos, tapes, stats, signed declarations, serious sources or other |
Unsustantiated, documented, rumored, believed |
Exclude or include rumors or unsupported news |
Author |
Interactivity |
News about events where the reader can interact. |
Movies, shows, rallies, voting, conferences, online forums or blogs |
Show events for the weekend in my town. |
Author |
Reader feedback |
Some news can be associated by surveys. Companies providing online survey mechanisms will need to adhere to standards for collecting and displaying information. |
Weather data can be validated by local residents; artists can be rated by the public; crimes can be witnessed; politicians can be supported or discredited. |
Reader feedback can be turned on or off. Light surveys can be accepted, while time-consuming ones can be filtered-out. |
Author |
Likeliness |
Weird, unusual, unpredictable news, as opposed to predictable news |
A weather report or sport result is predictable or probable, while a crime is not |
Look for unusual news, like a person biting a dog. |
Author – Independent validator |
Personal |
when you look for news where the protagonist is an important component |
Name, age, sex, national origin. |
News from an artist, politician or neighbor |
Author |
Reader age |
news are sometimes oriented to young or adult audiences |
Children, young adults, adults, senior citizens. |
Fantastic news, music events or interactive meetings are for the young, while credible, long-validity news are usually for the adults. |
Author |
Reader gender |
news directed to specific audiences, by gender or sexual preference: male, female, gays. |
Magazines oriented to women are a good example of women-oriented news.. |
Filtered for emphasis on male, female or gay news |
Author |
Matchmaking |
when you look for persons or companies with compatible needs |
Love, friendship, relationships, in dances, parties or bars |
Filtered by area, time or personal criteria |
Author |
Commercial |
when you look for commerce |
Sales, garage sales, auctions, business appeals. |
Look to buy, look to sell. |
Some commercial news will need a tax ID |
Language |
The language the original news is written, or its accepted translations |
English, Spanish, French. |
Check boxes for every accepted language |
Author |
Subject |
A thematic tree, like in the web directories |
Technology – Computers – Internet – blogs |
Assign a level of acceptance for most subjects: 10 is very interesting, 0 is I do not care. |
Thematic trees need to be provided by a standard News Subject Directory |
News type |
About the news unit itself |
editorial, announcement, report, press release, infomercial, other |
Check boxes for every accepted news type |
Author |
News format |
About the news unit itself |
text, image, sound, video, music, software, website |
Check boxes for every accepted news type |
Formats will be standardized as much as possible |
News rights |
intellectual property |
creative commons, copyrighted, public domain, other |
Check boxes for every accepted news type |
Some rights types have specific requirements, like a piece of HTML code. |
Ideology |
for ideologically biased or oriented news |
left, right, liberal, conservative, religious, ecological, others. |
Check boxes for every accepted news type |
Author – Independent critics |
Originality |
Some news has exclusive ownership, while others are widely known. |
A news protagonist will offer an exclusive interview to a certain media, providing 100% originality. Press conferences with wide attendance have low originality. |
Look for original stories, or look for any good story |
Author |
Price |
Some news will require payment for complete access or reproduction rights. |
Financial news, stock data and detailed metheorology data. |
Allow news with less than a fixed price, or within a certain budget. |
Author |
Advertisement |
News carrying obvious ads will be flagged. |
News carrying an offer for paid extra info, ads for a book. |
It will be necessary to disclose affiliation with the seller |
Author – Independent critics |
This is the first step towards a News Metatag Language.
May 11, 2007
Assassinations for $2000 publicly announced
Have gun, will travel everywhere...Reading free classified ads I found a disturbing announcement for Inexpensive Hit Men. The add links to a free hosting service in Spain, where the hosted page details the offer in great detail. Free email accounts are provided for contact, and the publishers promise "we end with your enemies" and "we provide you the peace that you long for" .

The website is in Spanish, and it offers "testimonials from satisfied clients in South America". I am within reach, so I am cautious about my writings.
The website alone does not mention killings, but the ads do.
I wrote to the webmaster of the hosting service, but I did not get an answer. Thus, I am writing this posting to see how the online community reacts to it.
I erased the contact data from the reproduced pages, but they are available for law enforcement officers.
I wonder where lies the responsibility of webmasters and search engines for a posting like this. The combined effect of a website, free ads and indexation creates a potential deadly business.
As a webmaster, I often need time to clean spam from comments or forum postings. But I do not police all of them, and I risk hosting dangerous ads like this.
Is a Web Police the solution? A WebPolice website? A complaint to search engines? Please comment.

We also need a Web police to stop dDos attacks, virus spread, piracy, bad porn... But that is another story...

May 9, 2007
12 Tricks Directories play on naive Webmasters
Submitting sites to directories is the pain of every webmaster. It needs to be done, but it requires the patience of a Buda and the wariness of a city fox.Since it is the best way to improve search engine rankings and free traffic, we devote a lot of effort to it. But even the ultimate software bought for this purpose is easily deceived by most directories.
We always wondered who runs directories. At some point, only large companies did. Currently, any webmaster who takes time to download a free script and install it in a shared low cost server. Thus, you need to distinguish the valuable directories from the 648 millions that appear when you google for the single 'directory' word.
To become valuable, directories need links in the web pointing to them. Which is the same item that you request from them: a free listing. So, most of them offer something they do not have: links.
The way to improve your search engine ranking is to get more incoming than outgoing links. To do this, you either buy links with money or creative effort, or you cheat naive customers.
We collected a few tricks commonly used by directories:
1- Unknown, unlisted, minimal or amateur directories, presented as very popular and influential ones.
2- Free but paid. Announced as free, but after a long form-filling process, they demand payment to continue.
3- Not delivering. They request a link and do not comply with publishing. They do this out of incompetence or calculated deceit, nobody knows. In our experience, 70% of those who request a link do not reciprocate after 2 weeks.

4- Partially delivering. They publish your link for a while, and when they expect you to be off-guard, they remove it.
5- Zero Page Rank link pages. It can be spontaneous, due to an orphan (unlinked) links page, or to other tricks.

6- No-follow attribute to your links. You can tell if this is the case by checking the source HTML of the whole site, not only before your link. The spider can be blocked by a no-follow code anywhere.
7- Your links are blocked by the robots.txt file.

8- Your links are not listed in the Google or Yahoo sitemap (usually sitemap.xml). Since Google Sitemap allows sitemaps with other names, only the webmaster knows which is valid.

9- Directories that request you to insert malicious Javascript in your site: popups, redirects, Trojans.
10- Directories that request you to insert a large banner in exchange for a small link, and eternal spam.

11- Directories that request you to pass a long, uncontrasted captcha. Or those who do not provide a hint to tell low case from upper case. Sometimes this wastes your time and prevents you from completing the submission.

12- Directories that delay 4-5 years publication for no reason, like Dmoz.org.

Previous article
April 15, 2007
The science of submitting sites to directories
Once very simple (only Yahoo was worth submitting), the science of submitting sites to directories has become increasingly complex. There are numerous softwares, sites and services that take care of (or promise) submitting your site to the millions of directories existing today.Unfortunately, the promise of "Be listed in 10,000 directories with one click" is blatantly false. If you want to be listed in PageRank Zero directories, an invitation to being banned, go ahead. And do not forget to give them a real email, for them to spam you forever.
Real, valuable directories do not accept automatic submissions for 2 good reasons:
- they need quality data that no software provides or guarantees.
- they need a human with a credit card, to whom sell something
Thus, they all have Humanity Tests, like the annoying captchas, or an email that must be read and acted upon.
The directories use different methods to admit new sites: Some require payment, some request a link exchange, that can equal better search engine (SE) ranking. Some require an easy-to-spam email address. A few accept your data for free without compensation: those who need quality links to improve themselves.
This is a classification of directories, according to their admission requirements:
-Really Free
-Paid
-Legitimate, but Request link exchange
-Cheaters
Cheating Directories
So far, I list 8 types of Cheaters:
1-Says free and is not: After a long form-filling process, they deceitfully demand ransom for your time.
2-Requests link exchange and does not comply: they demand your publishing first, and they promise a reciprocal link that never comes.
In our experience, 70% of those who request a link do not reciprocate after 2 weeks.
3-Home page with high PageRank, indexing pages with zero PageRank: this is not necessary cheating, unless the PageRank is manipulated to flow somewhere else. Most directories start from zero, get PageRank 3 in the home after the first spidering, and do not pass PageRank to their inner directories until they climb to 5-6 in the home page. However, if links to other pages are regularly incorporated in the home page before and more prominently (H1, H2, Bold) than the client listings, the PageRank flows to the privileged pages.
4-Adds no-follow attribute to your links: this is a dirty trick. Check for the code in the source HTML
5. Adds exclusions to the listings in the robots.txt file
6. Gives no or low priority in the Google sitemap (usually, but not always, sitemap.xml). Please notice that Google Sitemap allows sitemaps with other names, and only the webmaster knows which is the good one. So, it .is easy to give low weight to the non-paying links.
7-Insert Javascript in the code that you need to publish: popups, redirects, Trojans
8-Delay 4-5 years publication for no reason: I am talking about Dmoz.org, of course.
The Ideal directories have:
-Fast navigation: fnding the right category is the most time-consuming part of submission to directories.
-No hindrances (some captchas are 10 characters long, have non-letter characters, and are almost illegible because of strange fonts, or low background contrast. What is wrong with only one character?
-Accept triangulation: Reciprocal links are no longer automatically accepted by Google as a sign of quality. Link triangulation or quadrangulation are better.
-Instant publication: ideally, a script checks your link, tests the reciprocal and publishes it.
-SEO-friendly: the directory shall publish your link in a spiderable location, with no restrictions.
-Thematic: there is no more room in the Web for generic directories. Who needs a better Yahoo or Google Directory? Instead, the web will always welcome a new thematic directory, complete and with specific tools for better quality results.
-Good PageRank
-Large number of indexed incoming links
-Listed sites receive some advantage: traffic or better ranking
Good Submission Tools:
The good submission tools, either desktop or web based, should have:
- good directory listing, attending the criteria defined above
- directory selection feature, allowing the site owner to submit only to those directories that fit his/her preferences and budget.
- semi-automated form-filling
- listing detection spider or checker
- tracking and reporting feature
Instructed Operator
An operator able to submit successfully to directories should be aware of the above items, plus have some knowledge of Web editing and FTP. He needs to be realiable, because you will give him access to your site user/password for the link exchange. In some cases, he will need a disposable email account with the same domain as is promoted.
Even with a good submission tool, expect to spend 2 full-time weeks submitting to 1000 selected directories (one every 4 minutes), with link exchanges. If you have cheap offshore labor that could amount to $300, plus $100 for one decent submitting tool.
Provided all these aspects are taken into account, submitting to directories is still the best way to obtain incoming links to any site. Combined with appropriate SEO, (almost) every time it results in better SE ranking.
Our Directory Submissions Page
-
This blog had a very old platform and is being moved to
Web Promotion Service Blog