Table of Contents
PageRank
PageRank is a function that assigns a number weighting each page in the Web, the intent is that the higher the PageRank of a page, the more important the page is. We can think of the Web as a directed graph, where the pages are the nodes and if there exists a link that connects page1 to page2 then there would be an edge connecting the two nodes.
Imagine an toy example where there are only 4 pages/nodes, :
has links connecting itself to each of ther other three pages.
has links to and .
has links to and .
has links only to .
Given this graph, we can build a transition matrix to depict what is the probability of landing on a given page after 1 step. If we look at an example below, the matrix has:
rows and columns if there are pages
Each element in the matrix, takes on the value of if page has edges and one of them is . Otherwise is 0.
Now suppose we start at any of the pages of the Web with equal probability. Then the initial vector will have for each page. If is the transition matrix of the Web, then after one step, the distribution of us landing on each of the page can be computed by a matrix vector multiplication.
As we can see after 1 step, the probability of landing on the first page, page , is higher than the probability of landing on other pages. We can repeat this matrix vector multiplication for multiple times and our results will eventually converge. Giving us an estimated probability of landing on each page, which in term is PageRank's estimate of how important a given page is when compared to all the other page in the Web.
This sort of convergence behavior is an example of the Markov Chain processes. It is known that the distribution of converges, provided two conditions are met:
The graph is strongly connected; that is, it is possible to get from any node to any other node.
There are no dead ends: nodes that have no edges out.
If we stare at the formula long enough, we can observe that our final result vector is an eigenvector of the matrix (recall an eigenvector of a matrix is a vector that satisfies for some constant eigenvalue ).
Taxation
The vanilla PageRank that we've introduced above needs some tweaks to handle data that can appear in real world scenarios. The two problems that we need to avoid is what's called spider traps and dead end.
spider trap is a set of nodes with edges, but these edges all links within the page itself. This causes the PageRank calculation to place all the PageRank score within the spider traps.
As predicted, all the PageRank is at node , since once we land there, there's no way for us to leave.
The other problem dead end describes pages that have no out-links, as a result pages that reaches these dead ends will not have any PageRank.
As we see, the result tells us the probability of us being anywhere goes to 0, as the number of steps increase.
To avoid the two problems mentioned above, we will modify the calculation of PageRank. At each step, we will give it a small probability of "teleporting" to a random page, rather than following an out-link from their current page. The notation form for the description above would be:
where:
is a chosen constant, usually in the range of 0.8 to 0.9.
is a vector of all 1s with the appropriate number of elements so that the matrix addition adds up.
is the number of pages/nodes in the Web graph.
The term, denotes that at this step, there is a probability that we will follow an out-link from their present page.
Notice the term does not depend on , thus if there are some dead ends in the graph, there will always be some fraction of opportunity to jump out of that rabbit hole.
This idea of adding is referred to as taxation (
networkx
package calls this damping factor).
The result looks much more reasonable after introducing the taxation. We can also compare the it with the pagerank
function from networkx
.
Having seen how to calculate the PageRank vector, we should examine how this information is being used. Of course each search engine has its own secret formula that decides the ordering of pages in response to users search query. But we can guess that first, in order to be considered for the ranking at all, a page has to have at least one of the search terms in the query. Then amongst the qualified pages, a score is computed for each, and an important component of this score is the PageRank of the page (we can think of component as a feature to a machine learning ranker). Other components might include the presence or absence of search terms in prominent places, such as headers or the links to the page itself.