Inflated Statistics

The population of the London Metropolitan Area is just shy of 14 million.  Having lived in London I can believe this, simply because I have witnessed the vast numbers of people that populate the city.  The population of my home town where I was born is 27,000 - I find this harder to believe.  I find it hard to believe for a number of reasons the first few would be that, I don't believe the infrastructure to support that many people exists, I know quite a few people there and so do my family and close friends, I know the theory of six degrees of separation but I don't think I can number 27k with just 1 degree, but I could have known the name of any passing person, or at least a friend at most would know if I didn't.  The other reason would be the lack of large crowds and the relatively sparse level of housing.  While there are many housing estates when you consider the number and the occupancy it would be hard to push any population figure beyond say 5 to 9 thousand.

The population of the city I live in at the moment when you take in the urban area is approximately 24,000 and again I find this hard to believe mostly for the same reasons above.  These are just two examples of statistics of which I question the accuracy.  The population of the UK overall is 62 million.  The population of Earth is supposedly 7 billion.  The wider you make the scope the more seriously questioning the statistics is taken.  Question your local population figures and you're probably laughed at but question the 7 billion figure for the world and people wouldn't be so quick to laugh as there are legitimate arguments against these figures as they are deemed to be approximated - there has never been an all encompassing "world census".  The question is whether your local population figures and invariably the amalgamated figures that comprise your country's overall population are accurate.

Let's try something else.  If you google 'anything' you will see "displaying results 1 to 20 of 2,420,000,000 results" - 2.4 billion results - really? Now if you actually go beyond the first page of results - I know, welcome to a world rarely seen - you'll find there's a limit to how far Google will go.  For 'anything' Google let me go 77 pages.

Now, forgetting the fact that page 77 has less than 20 results on it, if we multiply thee 20 results per page it presents you'll find that the most search results Google presents for the term is typically 1,540 that figure is a far cry from 2.4 billion.  Now I don't doubt that there are a lot of web sites online and I don't doubt that Google has indexed a lot, I do doubt however whether Google really has cached 2.4 billion pages, and that's a trivial example you can try yourself, you can find lots of searches that will return higher "results"

What is at question here is whether or not we can truly trust these "statistics" and whether they are accurate or vastly inflated.  Google uses a page rank system.  It's search results are initially like that of any other search engine - quite useless really.  Google re-orders its results based on the popularity each result.  This is hard to demonstrate without going to great lengths.  The people who will have experienced this first hand will be people who have built their own websites.  When you search for something that no-one else has searched for before, Google is rather useless.  If you start a new business with a unique company name and create a website, Goggling your company name for some time along will return nothing - even if you have a name that is a Googlewhack - a search term without quotes that will return only 1 result - your new customers have to find you through your content, not your name.  So if you were a company selling bicycles then people would have to search for your company name and a keyword related to cycling.  The more popular your site becomes the quicker it will rise up google's ranking and appear at the top.

The relevance this has to our discussion is that Google at the end of the day is not a search engine that presents billions of results for a given term, it presents results that other people have clicked for a given term.  In order to keep that momentum going you need to feed back the search term and the result you clicked, if one result eventually becomes more popular then it moves up.  Now for the question - if you only ever see the top 1,500 ish results for a given term, and they struggle amongst themselves for the top spot, how can you be sure that beyond those 1,500 Google really has listed them all for a given search term?  The moment you introduce another single keyword to your search, effectively you have a new set of results, one that mixes the old set and the new set for that word.

Now before I leave you I want to give you two new terms to add to your computer vocabulary.  They are "Surface web" and "Deep web".  The Surface Web is comprised of every website that search engines can reach.  The Deep web is comprised of every website that search engines can not reach due to technical barriers - mainly log-in only sites which have not given search engines access, dynamically generated sites that present content on request etc.  There are methods to make these web sites available to search engines but the majority are not.  The surface web is quite small in comparison to the deep web.  The terms were coined Mike Bergman as this New York Times Article will explain: Exploring the Deep Web that Google can't grasp

According to WorldWideWebSize - a site which estimates the size of the web based on amalgamated data from Yahoo, Google and Bing.- there are currently around 8 billion web pages.  WWWSize uses two metrics:

GYB - Google, Yahoo and Bing
YGB - Yahoo, Google and Bing

The first engine is sued to retireve a data set, so in GYB Google is used first, then those results are checked against Yahoo and Bing and the final set contains all entries that Google listed that both Yahoo and Bing listed.  YGB does the same, but starts with Yahoo, retrieving its entire set first then reducing it based to entries that both Google and Bing contain.  GYB and YGB result in different set sizes due to the fact that Google 'indexes more pages' than Yahoo


In both scenarios above you effectively remove the inflation and see the pages that each search engine can verify exist.  So, if the three working together can at most find 8 billion pages that exist, should you really believe Google when it tells you it found 2.4 trillion results for your query?

Just for fun if you like, see what the highest number of results you can get Google to say it found.  I'll set the bar for you with "a" which returns 25.27 trillion results apparently.

No comments:

Post a Comment

All comments are moderated before they are published. If you want your comment to remain private please state that clearly.