Categories: Tutorials

How to Build a Search Engine

6 min read

In this post, I will discuss how you can build a search engine like Google. We will start with a definition of a search engine, and I will go on to present the core concepts (mathematics and implementation).

A search engine is an information retrieval system that returns documents relevant to a user submitted query, ranked by their relevance to the query. There are many subtle issues in the last statement;for example, what does relevance mean in this context? We will make all of these notions clear as we go further. In this post, by documents I mean textual documents or webpages. There are surely search engines for images, videos or any type of documents, the discussion of which is beyond the scope of this post.

Information Retrieval System

Consider a corpus of textual documents, and we want to build a system that searches for a query, and returns the documents relevant to the query. The easiest way to accomplish this is to return all of the documents that contain the query. Such a model is called Boolean model of information retrieval (BIR). Mathematically, we can represent a document (and query) as a set of the words it contains. The BIR model retrieves documents that have non-zero intersection with the query. Notice that the retrieved documents are all indistinguishable; there is no inherent ranking method.

Another class of models is the vector space model, and all modern search engines use such a model as their base system. In these models, a document is represented as a vector in a high-dimensional vector space. The dimension of the vector is equal to the total number of words in the corpus, and there is an entry corresponding to each word. The documents are retrieved based on the similarity of the corresponding vectors. For example, doc = (v1, v2, … , vk) is the vector corresponding to some document, and q = (q1, q2, …, qk) is the vector corresponding to the query, then the similarity of the document and the query can be computed as the vector similarity (distance). There are many choices for such similarity,for example,Euclidean distance, Jaccard distance, Manhattan distance, and so on. In most modern search engines cosine similarity is used:

cos(doc,q) = (doc . q) / ||doc|| ||q||

where . is the dot product of the vectors, and || is the norm of the vector.

There are difference schemes for choosing the vectors as well, in binary scheme, so the vector entry corresponding to a word is 1 if the word is present in the document, and 0 otherwise. The most popular scheme is the so-called term frequence- inverse document frequency (tf-idf), which assigns higher score of words that are characteristic to the current document. Sothe tf-idf of word w in document d can be computed as:

tf-idf(w, d) = tf(w,d) * log(N/n)

Where tf(w,d) is the frequency of word w in document d, N is the total number of documents, and n is the number of documents that contain w. Thus tf-idf taxes words that are present in a large number of documents, are not good as features in a model. The retrieved documents are ranked according to the similarity value.

There are also probablistic models for retrieving documents, where we estimate the probability that the document d is relevant to a query q, and the documents are then ranked by this probability estimate.

For implementation of these models and more, check out InfoR.

Computational Complexities

The process of retrieval based on vector space model is incredibly expensive. For each search query, we need to compute its similarity with all the web documents, even though most of the documents don’t even contain that word. Going by the standard method, it’s impossible to achieve the results at the speed we experience while using Google. The trick is to store documents in a data structure, which facilitates fast computation. The most commonly used data structure is the inverted index. A forward index stores the lists of words for each document, an invereted index stores documents for each word in a hashmap-style structure:

Inverted index = {word1: [doc1, doc24, doc77], word2: [doc34, doc97], …. wordk: [doc11, doc22, doc1020, doc23]}

Given such a structure, one can just jump to the words in the query in constant time, and compute the similarity with a very small number of documents, ones containing the words in the query.

Now building such an index also presents a serious computational problem. This problem was the reason behind the development of the MapReduce paradigm, which is the distributed, massively parallel system, discovered by Google. The use of MapReduce system has extended to all major big data problems. If you are curious refer to the original paper.

Web Search Engines

The working of search engine can be described in the following steps:

Crawling: First of all, we need to crawl the web to extract the HTML documents. There are many open source services that provide the crawling capabilities, for example, Nutch, Scrappy,and so on. The documents are then processed to extract the text, the hyperlinks, geo-location, and so on. based on the complexity of the retrieval model.
Indexing: We want the documents to be stored in such a way that facilitates a quick search. As discussed earlier, building and maintaining an inverted index is a crucial step in building a search engine.
Retrieval: We have discussed retrieval models based on vector space and that probability above. Such a model is used as the base for all major web engines. In addition to the cosine similarity, all web search engines use PageRank for ranking the retrieved documents. This algorithm, discovered by founders of Google, uses the quality of hyperlinks in a webpage for ranking. Mathematically, pagerank assumes that the web is network graph where documents are linked via hyperlinks. A document receives higher pagerank if there are many ‘high-quality’ links in it. The PageRank can also be interpreted as the probability that a person randomly clicking on links will arrive at this particular page. For the more mathematically inclined, this translates to computing the principal eigenvector of the transition matrix of random jumps. For more details, refer to this article on American Mathematical Society website.

Modern web search engines like Google use more than 200 heuristics on top of the standard retrieval model!

Conclusion

In this post, we discussed the central ideas of information retrieval and challenges in building a web search engine. It’s not an easy problem. Even today, 18 years after the origin of Google, significant research, both at universities and in industry, is being pursued in this direction.

About the author

Janu Verma is a researcher in the IBM T.J. Watson Research Center, New York. His research interests are in mathematics, machine learning, information visualization, computational health and biology. He has held research positions at Cornell University, Kansas State University, Tata Institute of Fundamental Research, Indian Institute of Science, and Indian Statistical Institute. He has also written papers for IEEE Vis, KDD, International Conference on HealthCare Informatics, Computer Graphics and Applications, Nature Genetics, IEEE Sensors Journals,and so on. His current focus is on the development of visual analytics systems for prediction and understanding. Check out his personal website.

JanuVerma