Wednesday, January 1, 2014

How Web Search Engines Work

Given a list of words, a Web Search Engine has to find the Web Pages that are most relevant to the query words.

So the first step is pretty obvious. A Web Search Engine has to find the Web Pages that contain the words given in the query.

But there could be millions of such pages. How do you present the result to the user?

Obviously, you need someway to rank the Web Pages so that you can show the Web Pages that are ranked higher first.

So the next question is - how do you rank Web Pages?

There have been many approaches. But the one that revolutionized the landscape of Web Search Engines is the “PageRank” method jointly invented by Sergey Brin and Larry Page who founded Google.


How does PageRank rank Web pages?

Pagerank calculates the importance / quality / authority of a web-page.

Just as we can assume that, in a Social Network the person who is known to a lot of important persons must be important, in Web, a page that is linked at by a lot of high quality pages must be of high quality and authoritative. This is the idea behind PageRank.


It seems as though Sir Tim Berners-Lee made the job of creating a high quality Search Engine easier for Brin and Page by introducing hyperlinks in HTTP! So credit must be given to Tim! To find out the right way to spell Sir Tim Berners-Lee’s name I utilized page-rank and a lot of other methods hidden behind the Google’s simple search interface! 




Brin had the idea of ranking web pages using incoming links. While the rest of the idea seems to have originated from ideas for ranking of research papers. Page was working on that problem in Stanford at that time. A paper that is cited by a lot of highly cited papers must be authoritative and should be ranked higher. Calculation of impact factor of a particular research paper or a journal is done utilizing this idea.



But one requires a mathematical framework to calculate the PageRank of a Web page. The calculation for PageRank utilizes probability theory.

The “Random walk model” that PageRank utilizes, finds out the importance of a Web page by calculating how probable it is for a user to arrive at a page on the Web if he starts randomly from a Web page and follows the hyper-links randomly. So the Web page which is linked at by pages (that is, has incoming links of pages) that are more discover-able has more probability of being discovered.

This is how the equation for finding out the PageRank of a web page, say “W”, is formed -

Summation of
                Discoverability of each of the Web Pages that has link to “W”
                divided by, the number of links that each of these page has
                (assuming that each link has same probability of being clicked on by the user).

Mathematically,
we assume page A has pages TI...Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. Also, C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows:

PR(A) = (I -d) + d [ PR(TI) / C(Tl) …. + PR(Tn) / C(Tn) ]

Note that the PageRanks form a probability distribution over Web pages, so the sum of all Web pages’ PageRanks will be one. [1]

Of course Web Search Engines use other measures (i.e., TF-IDF, Anchor Texts etc.) for ranking Web pages. But PageRank is central to all the calculations.

Reference:

[1] The anatomy of a large-scale hypertextual Web search engine

No comments:

Post a Comment