Google indexation assigns 3-4 keywords per page
How Google indexes pages: what the site: and [word] operators tell us about indexation deepness.
This is the 3rd article in the Googleometry Project.
Search Engine Saturation
One of the problems we found while searching for Partial Penalization Indicators, was that SERPs only show up to 1000 pages. So, saturation over 1000 is not measured.
To solve this limitation, we tried to limit the scope of searched pages inside a domain.
Search operators can be combined: site:mydomain.com can be combined with any word that must be found in the results.
So, if we search for:
site:domaingrower.com
we obtain 387 indexed pages, within that domain.
But if we add:
site:domaingrower.com the
we only obtain 2 pages.
Similar low results are obtained when we search for other very common words.
If indexation was complete, that would make no sense. Every page within that domain has the word 'the', many times. But they do not count.What is the explanation for this finding?
Actually, I do not have a definitive answer. So far, experts agree on 2 types of pages, according to Google: normal and supplementary. The Supplementary pages appear after the message:
'In order to show you the most relevant results, we have omitted some entries very similar to the XX already displayed. If you like, you can repeat the search with the omitted results included'.
Supplementary pages are of lower quality, and are rarely opened by normal searches.
Now, it seems that there is an intermediate class of pages: pages that are not supplemental, but never appear when searching with the combined operators:
site: [word]
Let's see an example.
Searching:
site:ebay.com
produces 121 million pages.
But searching:
site:ebay.com the
produces only 4 million pages, having the word 'the' in them.
It is a significant reduction, and it occurs with almost any site.
After analyzing 20 domains, I believe that not all the 117 millions of excluded pages are Supps.
I can propose a third kind of pages, named: 'non-keyword, non-supplemental' site pages. They cannot be retrieved by the site: [word] command.
To explain the results shown in the Excel file linked below, I propose:
Google indexation has several set of pages in a site, with different indexation power.
For each domain, Google assigns only a few keywords to the majority of pages, and keywords are always uncommon, long words.
Short words, like adverbs, articles, prepositions and pronouns are considered non-indexable 'stop words'.
It seems that after the main long keywords are chosen, if there is room for another one, Google choses a common, frequent word, like 'a' or 'the'. This fact accounts for the low number of pages indexed under those keywords.
The google experimental data that support this conclusions are published here.
The extensive indexation data collected by this method helps us understand how far the spider went into our site, and how much indexation strength was assigned to each keyword.
I obtained more few useful conclusions from this data, that are applied in my current SEO Service. Ask about it. Chech our free online SEO cost calculator.