Archive for June, 2007

Relevance calculation and popularity

preamble

In this post I try to understand why some search engines more popular, how we can test relevance, and how popularity of search-engines related with relevance.

1 approach: we should test them on relevance

Here we just want to enter request and see who can provide better results.

You can think that I will generate some requests by some rules and then I will say to you who is best. We have many works like that.

But I say to you: relevance is a relative conception that has statistical constituent.
How to understand it simple:
Think about requests like “python” or “ruby”.
We get the response from search-engine which give us real good results with information about “python” or “ruby” but classify them as computer languages.
If we are software-engineers we think search-engine works good, if we are biologist we think differently:)

We can provide search engine with additional information using advanced knowledge about our request to get more relevant result. From another side sometimes we want to find information we don’t understand deeply at current moment, so we don’t know how define it and how then check whether we got optimal result.
Therefore we will not judge search-engine by so relative conception like relevance.

I have an opinion that we should classify information according to at least basic sections.

There are several problems:

If we provide many branches we will make situation where hard to describe our request with most appropriate appropriate category. We have such situation in web directories[http://en.wikipedia.org/wiki/Web_directory]. It is important to find balance between relevance and difficulty to describe of request.

Who should divide information which already crawled into categories. Algorithms help us to find and assign correct weight to pages (see my article: “Relevance and cheat techniques” for details about current techniques which search-engines use to search and estimate. We can develop another algorithms but it seems that at current moment it is better to use people. It can be volunteers which Jimmy Wales, creator of Wikipedia[http://en.wikipedia.org/wiki/Wikipedia] want to use for open-source[http://en.wikipedia.org/wiki/Open-source] search-engine he is working on.

Or we can give possibility to mark information with appropriate category to everyone.We have another problem how to believe to everyone. People like to cheat.

Possible answers are: statistical faith (10 say cool, 2 say bad; result: quite cool), community of trusted people, used to solve computer security problems (simplistically: if someone truthful think I am truthful and I trust to my friend; result: we can believe to my friend), or combinations.

Check some of the following to see how it can work by this time
StumbleUpon [http://en.wikipedia.org/wiki/Stumbleupon]
Technorati [http://en.wikipedia.org/wiki/Technorati.com]
Digg [http://en.wikipedia.org/wiki/Digg]

2 approach: search engine database does matter

Here we want just want to compare search engines databases and make decisions.

We know who stroked first. It was Google in 2000 year they said: “Searching 8,xxx,xxx,xxx web pages”. It is simply advertisement we know it. But others have strained..

Index size number estimation it is like a game. Which rules we should use? Or in other words how to calculate?
We will not worry about it..

On 8/05/2005 Tim Mayer on Yahoo! Search blog claimed that the “[Yahoo!] index now provides access to over 20 billion items” which include “19.2 billion web documents, 1.6 billion images, and over 50 million audio and video files”. [http://www.ysearchblog.com/archives/000172.html]

Google stopped providing quantity of pages indexed, which was listed as 8 billion, “because people don’t necessarily agree on how to count it,” Dr. Eric Schmidt Chairman of the Board and CEO. [http://news.zdnet.co.uk/internet/0,1000000097,39222233,00.htm?r=1]

In 2 days after Yahoo post on University of California at Berkeley blog visiting Professor John Battelle said that Google refuted this “.. scientists are not seeing the increase claimed in the Yahoo! index”. [http://battellemedia.com/archives/001790.php]

Then we had independent research about this [http://vburton.ncsa.uiuc.edu/indexsize.html] then some critics like: [http://aixtal.blogspot.com/2005/08/yahoo-missing-pages-3.html]

Then some corrections but still Yahoo and Google search engines truncate results returned to the user after 1,000 results so it is hard to be sure.

So we don’t have good approach to compare databases sizes. Search-engines companies seems not interesting in creating of common rules of playing (common rules of database index calculation)

instead of conclusion:
It is hard to compare because we should preciously define our criterions and notions.

If so how we can know who is better?
We can’t. There are some additional points like religion: “I like Google” or passion for design: ”MS Search looks better”, etc..

Search engines relevance and cheat techniques

preamble

All of us fill that relevance [http://en.wikipedia.org/w/index.php?title=Relevance_%28Computer_Science)] of documents that we search thru search engines become worse.
This happened because of increasing size and dynamic nature of internet which hard to analyze and monitor even for web-crawlers [http://en.wikipedia.org/wiki/Web_crawler], and cheaters which invent more advanced methods of cheating.

relevance – sum of factors

Common factors I consider as important for site ranking:

patented PageRank algorithm [http://en.wikipedia.org/wiki/PageRank]
Which creators of Google Larry Page and Sergey Brin developed as part of a research project about a new kind of search engine in about 1995 or his numerous modifications.

Personalized search [http://www.google.com/history]

“It helps deliver more personalized search results based on what you’ve searched for on Google and which sites you’ve visited.”
Google Web history

Google try to find some dependency between search requests of users. It sound like good enhancement based on if I am looking for java or C++ I can be programmer and so I can prefer related resources. But occasionally I may want to find a game or a film and this can make result worse for this particular search, but in general case it should work fine. Or even better we can make additional research if this particular request related with things I searched before and if not we can have general case for searching.

direct weight analyze
[keyword density, anchor text, formatting elements analyze: font size, formatting tags: <title>,<b>,<h1>,<em>]
Search engine can analyze frequency of keyword on related page, emphasis of notes, placement relatively to begin of page, text under anchor, precision of requested phrase. Whether information reproduced from another sources, in which time information became available.

domain name analyze (educational, nonprofit, commercial, government, etc.)
Search engine can analyze which domain name can be more appropriate for request
If you looking for “government rules” it can give more weight to government (.org) first level domain sites.
If you looking for something in Russian it can be good idea to give more weight to Russian first level domain name sites (.ru)

site size analyze
Search engine can give higher mark to bigger site by compare of number of pages related to same domain name

intelligent site state analyze
Site updating: site last update, percent of updating;
Site maintains quality: broken links, number of type mistakes;

morphology analyze (intelligent ignoring of stop words, slang, obscene language)
sure we should ignore words like: very, of, occasionally, and…; but compare
beauty of car – we want to find what is the beauty when we are say about cars
beauty and car – we want to find how beauty of something or someone can be related to cars
or
paper box – box made of paper
paper and box – maybe shop where we can boy papers, boxes…

Morphology analyze can provide for us additional forms of requested entity for additional searches.
We can find more technical resources that contain special slang, and descries weight of site with obscene language if user didn’t provide request with such language.

using of social networks [http://en.wikipedia.org/wiki/Social_network]
Modern approach. After using of sets of different algorithms we use people to make estimation of resources. Real and working example is a StumbleUpon with current community size about 3,095,859 [http://en.wikipedia.org/wiki/StumbleUpon].

StumbleUpon chooses which new webpage to display based on the user’s ratings of previous pages, ratings by his/her friends, and by the ratings of users with similar interests
StumbleUpon site

“Next time you want to wander the Web, forget about Googling it. Stumble it.”
Wall Street Journal

If it is so cool why Google don’t provide is as part of their search. I think it is because of StumbleUpon patent-pending [http://en.wikipedia.org/wiki/Patent_pending] toolbar system. Search engine companies should invent something at least slightly different to overcome patent issues or bay it. By the way eBay [http://en.wikipedia.org/wiki/Cloaking]
According to GOOGLE webmaster FAQ cloaking is

The term “cloaking” is used to describe a website that returns altered webpages to search engines crawling the site. In other words, the webserver is programmed to return different content to Google than it returns to regular users, usually in an attempt to distort search engine rankings. This can mislead users about what they’ll find when they click on a search result. To preserve the accuracy and quality of our search results, Google may permanently ban from our index any sites or site authors that engage in cloaking to distort their search rankings.

basic idea: to give to web-crawler specially prepared page content with good keywords.
popularity: not very popular tactic. Not easy to implement it in a good way.

tactic: cheater should make special pages with keywords smeared by the pages, it is very important not go too far. Cheater should find special balance of common word, key words, not related words, etc.. Then he should give those pages to web-crawler when it comes to crawl pages. People recognize crawler by user-agent greetings (HTTP_USER-AGENT) [http://en.wikipedia.org/wiki/User_agent] like: “Googlebot/2.1” or using special database with IP-range that search company has or HTTP referrer [http://en.wikipedia.org/wiki/Referrer] or using some combinations.

When web-crawler crawl pages it cache them so cheaters should find a way to avoid it using things like search engines specific metategs like: or general metatags like:
but “pragma” doesn’t work in IE 5 so in order to assure a non-cache, you’ll need to add another meta tag:
and so on ..

Additionally cheaters can rotate different contents to make visibility of fast site modifications (news channel..)
or provide different content for different web-crawlers using robots.txt file [http://en.wikipedia.org/wiki/Robots.txt]:
User-Agent: Googlebot/2.1
Disallow: /doorway/lycos/
User-Agent: YahooBot
Disallow: /doorway/lycos/

As you see it is not so easy to be cheater. It is not talking about special schemes that search companies use to catch cheaters like:
incorrect user-agent names or special IPs for content checking.
Today we have off the shelf cloaking programs. Sure it is bad idea to use it, it can be trojan, virus, etc.. and you never get optimal solution as I wrote above.

There are many disputes about page cloaking: it is popular today to provide special content for mobile devices, vary content using geographical parameters, or make personalization for users. Is it page cloaking? I think personalized content delivery is not page cloaking. But search engines engineers know better. Ask google ..:)

Multiply entry point (doorway)
Cheaters buy different domain names that contain different keywords which can be targeted to the same site, or they can make several sites with slightly altered content which lead to some more important site.

basic idea: to have multiply entry point to one site
popularity: quite popular tactic.

Old techniques
There are many old techniques like:
invisible text: examples: black keywords on black background, very small font;
hidden frames, duplicate pages, keywords in title, keywords in comments, keywords in style tags, keywords in hidden values ..
I have not written about them simply because I think it is passed stage of search engines evolution.

PS: Sure I have missed something new and fantastic.
If you know about these techniques mail me and I will add it with big pleasure.

related links:
Early google architecture according to Sergey Brin and Lawrence Page work at Stanford University [http://infolab.stanford.edu/~backrub/google.html]
Google guide for webmaster [http://www.google.com/intl/az/webmasters/guidelines.html]
Google reporting page about cheaters [http://www.google.com/contact/spamreport.html]
Google sitemaps (Statistics, diagnostics and management of Google’s crawling) [http://www.google.com/webmasters/sitemaps/]
Keyword density [http://en.wikipedia.org/wiki/Keyword_density]
Google bomb technic: [http://en.wikipedia.org/wiki/Google_bomb]