Google PageRank: how search engines `bring order to the Web'

A crucial innovation of Google was a mathematically quite simple but powerful algorithm called PageRank.

In 1998, two graduate students from Stanford University, Sergey Brin and Larry Page, published a 10-page paper:

Sergey Brin, and Lawrence Page. "The anatomy of a large-scale hypertextual web search engine." Computer networks and ISDN systems 30.1-7 (1998): 107-117.

The paper introduced  a Web search engine based on entirely new principles.

This search engine is  Google, and Brin and Page made a history as its founders.

Couple of years ago, I told this story to bachelor students, and suddenly I realized that 1998 was the birth year for most of them. They do not know life without Google. But older people, like myself, remember other search engines like Yahoo! or Alta Vista. What made Google special on the search market?

PageRank solved the following profound problem. Each query gives  thousands of hits. In which order should we arrange them for the user? For example, if I type `Dutch Railways', I clearly want the official website of the NS, and not a blog of a traveler. But how will a computer know what is important for me?

The revolutionary idea of Google was to use not only the text of the page but also the links to it.

This is actually very logical. The World Wide Web is nothing else but a network. The nodes are web pages, and the connections are the hyperlinks, which point from one page to another. For example, here is a link from a web page of this blog to the web page of my colleague Clara Stegehuis.

I chose Clara because she often wrote for Networks Pages. If you click on this link, you will of course reach Clara's page.

When I link to Clara's page, it means I know it and I like it. This is a vote for her page, a valuable information, and this is exactly what was used in the Google PageRank algorithm.

The PageRank is a score that depends on the quantity, but also on the quality of links to a web page. This can be seen on a small example from Wikipedia, see the figure below:

The size of the nodes represents their PageRank score.

Node B has a large PageRank because it has many incoming links. Node C has only one incoming link, yet, the PageRank of C is high because this link comes from the important node B.

PageRank is a measure of importance of a web page, and we can rank all web pages accordingly. This way, the official website of the NS will receive a very high score, and will be on top of the list for any query related to trains in the Netherlands.

PageRank was designed for web search but has been used for many different applications: detecting communities in social networks, combating web spam, or finding most endangered species in food webs. Interestingly, the food webs have nothing to do with the digital world. These are merely networks of biological species, such as in the picture below, which I took at the Lammi biological station in Finland. The links represent ancient laws of nature: a directed edge, for example, from a fox to a mouse, means that a fox eats a mouse.

 

One may wonder, why an algorithm for web search is suddenly useful in biology?

This is because PageRank in fact solves a much broader problem than only ranking the web pages. This problem is known as network centrality, and can be formulated as follows:

Given a network, can we compute, which nodes are the most important or central in this network?

Centrality is very important because central nodes can be, for example, crucial hubs in transportation networks or most influential people in social networks. There are many ways to measure centrality, but we will not elaborate on this, it is a whole new topic for another post.

Back to web search, by now it has evolved a lot. My colleagues at Google claim that PageRank in its initial form is not used anymore. You probably know by experience that Google ranking on your screen is highly personalized. For example, it depends on your location and search history.

How to climb high up in the Google's ranking? This question is vital for businesses. So much so, that it spurred an entirely new branch of marketing  -  the Search Engine Optimization (SEO). These are consulting companies that help other businesses to improve their Google ranking.

My student Iris is now doing an internship at a SEO company in Hengelo. Over 20 years since Google conquered the web search market thanks to networks and mathematics, Iris uses machine learning algorithms to discover what factors determine the Google ranking today.

One  of her results is no surprise for the marketing experts: incoming links still play a very important role. The World Wide Web is literally a gigantic web of pages and hyperlinks. And that's  why the network approach will remain indispensable in the web search.

The featured image is taken from Lauren Edvalson from Unsplash.

Comments are closed