Relevance calculation and popularity
preamble
In this post I try to understand why some search engines more popular, how we can test relevance, and how popularity of search-engines related with relevance.
1 approach: we should test them on relevance
Here we just want to enter request and see who can provide better results.
You can think that I will generate some requests by some rules and then I will say to you who is best. We have many works like that.
But I say to you: relevance is a relative conception that has statistical constituent.
How to understand it simple:
Think about requests like “python” or “ruby”.
We get the response from search-engine which give us real good results with information about “python” or “ruby” but classify them as computer languages.
If we are software-engineers we think search-engine works good, if we are biologist we think differently:)
We can provide search engine with additional information using advanced knowledge about our request to get more relevant result. From another side sometimes we want to find information we don’t understand deeply at current moment, so we don’t know how define it and how then check whether we got optimal result.
Therefore we will not judge search-engine by so relative conception like relevance.
I have an opinion that we should classify information according to at least basic sections.
There are several problems:
If we provide many branches we will make situation where hard to describe our request with most appropriate appropriate category. We have such situation in web directories[http://en.wikipedia.org/wiki/Web_directory]. It is important to find balance between relevance and difficulty to describe of request.
Who should divide information which already crawled into categories. Algorithms help us to find and assign correct weight to pages (see my article: “Relevance and cheat techniques” for details about current techniques which search-engines use to search and estimate. We can develop another algorithms but it seems that at current moment it is better to use people. It can be volunteers which Jimmy Wales, creator of Wikipedia[http://en.wikipedia.org/wiki/Wikipedia] want to use for open-source[http://en.wikipedia.org/wiki/Open-source] search-engine he is working on.
Or we can give possibility to mark information with appropriate category to everyone.We have another problem how to believe to everyone. People like to cheat.
Possible answers are: statistical faith (10 say cool, 2 say bad; result: quite cool), community of trusted people, used to solve computer security problems (simplistically: if someone truthful think I am truthful and I trust to my friend; result: we can believe to my friend), or combinations.
Check some of the following to see how it can work by this time
StumbleUpon [http://en.wikipedia.org/wiki/Stumbleupon]
Technorati [http://en.wikipedia.org/wiki/Technorati.com]
Digg [http://en.wikipedia.org/wiki/Digg]
2 approach: search engine database does matter
Here we want just want to compare search engines databases and make decisions.
We know who stroked first. It was Google in 2000 year they said: “Searching 8,xxx,xxx,xxx web pages”. It is simply advertisement we know it. But others have strained..
Index size number estimation it is like a game. Which rules we should use? Or in other words how to calculate?
We will not worry about it..
On 8/05/2005 Tim Mayer on Yahoo! Search blog claimed that the “[Yahoo!] index now provides access to over 20 billion items” which include “19.2 billion web documents, 1.6 billion images, and over 50 million audio and video files”. [http://www.ysearchblog.com/archives/000172.html]
Google stopped providing quantity of pages indexed, which was listed as 8 billion, “because people don’t necessarily agree on how to count it,” Dr. Eric Schmidt Chairman of the Board and CEO. [http://news.zdnet.co.uk/internet/0,1000000097,39222233,00.htm?r=1]
In 2 days after Yahoo post on University of California at Berkeley blog visiting Professor John Battelle said that Google refuted this “.. scientists are not seeing the increase claimed in the Yahoo! index”. [http://battellemedia.com/archives/001790.php]
Then we had independent research about this [http://vburton.ncsa.uiuc.edu/indexsize.html] then some critics like: [http://aixtal.blogspot.com/2005/08/yahoo-missing-pages-3.html]
Then some corrections but still Yahoo and Google search engines truncate results returned to the user after 1,000 results so it is hard to be sure.
So we don’t have good approach to compare databases sizes. Search-engines companies seems not interesting in creating of common rules of playing (common rules of database index calculation)
instead of conclusion:
It is hard to compare because we should preciously define our criterions and notions.
If so how we can know who is better?
We can’t. There are some additional points like religion: “I like Google” or passion for design: ”MS Search looks better”, etc..

