Search engines relevance and cheat techniques

preamble

All of us fill that relevance [http://en.wikipedia.org/w/index.php?title=Relevance_%28Computer_Science)] of documents that we search thru search engines become worse.
This happened because of increasing size and dynamic nature of internet which hard to analyze and monitor even for web-crawlers [http://en.wikipedia.org/wiki/Web_crawler], and cheaters which invent more advanced methods of cheating.

relevance – sum of factors

Common factors I consider as important for site ranking:

patented PageRank algorithm [http://en.wikipedia.org/wiki/PageRank]
Which creators of Google Larry Page and Sergey Brin developed as part of a research project about a new kind of search engine in about 1995 or his numerous modifications.

Personalized search [http://www.google.com/history]

“It helps deliver more personalized search results based on what you’ve searched for on Google and which sites you’ve visited.”
Google Web history

Google try to find some dependency between search requests of users. It sound like good enhancement based on if I am looking for java or C++ I can be programmer and so I can prefer related resources. But occasionally I may want to find a game or a film and this can make result worse for this particular search, but in general case it should work fine. Or even better we can make additional research if this particular request related with things I searched before and if not we can have general case for searching.

direct weight analyze
[keyword density, anchor text, formatting elements analyze: font size, formatting tags: <title>,<b>,<h1>,<em>]
Search engine can analyze frequency of keyword on related page, emphasis of notes, placement relatively to begin of page, text under anchor, precision of requested phrase. Whether information reproduced from another sources, in which time information became available.

domain name analyze (educational, nonprofit, commercial, government, etc.)
Search engine can analyze which domain name can be more appropriate for request
If you looking for “government rules” it can give more weight to government (.org) first level domain sites.
If you looking for something in Russian it can be good idea to give more weight to Russian first level domain name sites (.ru)

site size analyze
Search engine can give higher mark to bigger site by compare of number of pages related to same domain name

intelligent site state analyze
Site updating: site last update, percent of updating;
Site maintains quality: broken links, number of type mistakes;

morphology analyze (intelligent ignoring of stop words, slang, obscene language)
sure we should ignore words like: very, of, occasionally, and…; but compare
beauty of car – we want to find what is the beauty when we are say about cars
beauty and car – we want to find how beauty of something or someone can be related to cars
or
paper box – box made of paper
paper and box – maybe shop where we can boy papers, boxes…

Morphology analyze can provide for us additional forms of requested entity for additional searches.
We can find more technical resources that contain special slang, and descries weight of site with obscene language if user didn’t provide request with such language.

using of social networks [http://en.wikipedia.org/wiki/Social_network]
Modern approach. After using of sets of different algorithms we use people to make estimation of resources. Real and working example is a StumbleUpon with current community size about 3,095,859 [http://en.wikipedia.org/wiki/StumbleUpon].

StumbleUpon chooses which new webpage to display based on the user’s ratings of previous pages, ratings by his/her friends, and by the ratings of users with similar interests
StumbleUpon site

“Next time you want to wander the Web, forget about Googling it. Stumble it.”
Wall Street Journal

If it is so cool why Google don’t provide is as part of their search. I think it is because of StumbleUpon patent-pending [http://en.wikipedia.org/wiki/Patent_pending] toolbar system. Search engine companies should invent something at least slightly different to overcome patent issues or bay it. By the way eBay [http://en.wikipedia.org/wiki/Cloaking]
According to GOOGLE webmaster FAQ cloaking is

The term “cloaking” is used to describe a website that returns altered webpages to search engines crawling the site. In other words, the webserver is programmed to return different content to Google than it returns to regular users, usually in an attempt to distort search engine rankings. This can mislead users about what they’ll find when they click on a search result. To preserve the accuracy and quality of our search results, Google may permanently ban from our index any sites or site authors that engage in cloaking to distort their search rankings.

basic idea: to give to web-crawler specially prepared page content with good keywords.
popularity: not very popular tactic. Not easy to implement it in a good way.

tactic: cheater should make special pages with keywords smeared by the pages, it is very important not go too far. Cheater should find special balance of common word, key words, not related words, etc.. Then he should give those pages to web-crawler when it comes to crawl pages. People recognize crawler by user-agent greetings (HTTP_USER-AGENT) [http://en.wikipedia.org/wiki/User_agent] like: “Googlebot/2.1” or using special database with IP-range that search company has or HTTP referrer [http://en.wikipedia.org/wiki/Referrer] or using some combinations.

When web-crawler crawl pages it cache them so cheaters should find a way to avoid it using things like search engines specific metategs like: or general metatags like:
but “pragma” doesn’t work in IE 5 so in order to assure a non-cache, you’ll need to add another meta tag:
and so on ..

Additionally cheaters can rotate different contents to make visibility of fast site modifications (news channel..)
or provide different content for different web-crawlers using robots.txt file [http://en.wikipedia.org/wiki/Robots.txt]:
User-Agent: Googlebot/2.1
Disallow: /doorway/lycos/
User-Agent: YahooBot
Disallow: /doorway/lycos/

As you see it is not so easy to be cheater. It is not talking about special schemes that search companies use to catch cheaters like:
incorrect user-agent names or special IPs for content checking.
Today we have off the shelf cloaking programs. Sure it is bad idea to use it, it can be trojan, virus, etc.. and you never get optimal solution as I wrote above.

There are many disputes about page cloaking: it is popular today to provide special content for mobile devices, vary content using geographical parameters, or make personalization for users. Is it page cloaking? I think personalized content delivery is not page cloaking. But search engines engineers know better. Ask google ..:)

Multiply entry point (doorway)
Cheaters buy different domain names that contain different keywords which can be targeted to the same site, or they can make several sites with slightly altered content which lead to some more important site.

basic idea: to have multiply entry point to one site
popularity: quite popular tactic.

Old techniques
There are many old techniques like:
invisible text: examples: black keywords on black background, very small font;
hidden frames, duplicate pages, keywords in title, keywords in comments, keywords in style tags, keywords in hidden values ..
I have not written about them simply because I think it is passed stage of search engines evolution.

PS: Sure I have missed something new and fantastic.
If you know about these techniques mail me and I will add it with big pleasure.

related links:
Early google architecture according to Sergey Brin and Lawrence Page work at Stanford University [http://infolab.stanford.edu/~backrub/google.html]
Google guide for webmaster [http://www.google.com/intl/az/webmasters/guidelines.html]
Google reporting page about cheaters [http://www.google.com/contact/spamreport.html]
Google sitemaps (Statistics, diagnostics and management of Google’s crawling) [http://www.google.com/webmasters/sitemaps/]
Keyword density [http://en.wikipedia.org/wiki/Keyword_density]
Google bomb technic: [http://en.wikipedia.org/wiki/Google_bomb]

Leave a Reply