In this post, I will discuss how you can build a search engine like Google. We will start with a definition of a search engine, and I will go on to present the core concepts (mathematics and implementation).
A search engine is an information retrieval system that returns documents relevant to a user submitted query, ranked by their relevance to the query. There are many subtle issues in the last statement;for example, what does relevance mean in this context? We will make all of these notions clear as we go further. In this post, by documents I mean textual documents or webpages. There are surely search engines for images, videos or any type of documents, the discussion of which is beyond the scope of this post.
Consider a corpus of textual documents, and we want to build a system that searches for a query, and returns the documents relevant to the query. The easiest way to accomplish this is to return all of the documents that contain the query. Such a model is called Boolean model of information retrieval (BIR). Mathematically, we can represent a document (and query) as a set of the words it contains. The BIR model retrieves documents that have non-zero intersection with the query. Notice that the retrieved documents are all indistinguishable; there is no inherent ranking method.
Another class of models is the vector space model, and all modern search engines use such a model as their base system. In these models, a document is represented as a vector in a high-dimensional vector space. The dimension of the vector is equal to the total number of words in the corpus, and there is an entry corresponding to each word. The documents are retrieved based on the similarity of the corresponding vectors. For example, doc = (v1, v2, … , vk) is the vector corresponding to some document, and q = (q1, q2, …, qk) is the vector corresponding to the query, then the similarity of the document and the query can be computed as the vector similarity (distance). There are many choices for such similarity,for example,Euclidean distance, Jaccard distance, Manhattan distance, and so on. In most modern search engines cosine similarity is used:
cos(doc,q) = (doc . q) / ||doc|| ||q||
where . is the dot product of the vectors, and || is the norm of the vector.
There are difference schemes for choosing the vectors as well, in binary scheme, so the vector entry corresponding to a word is 1 if the word is present in the document, and 0 otherwise. The most popular scheme is the so-called term frequence- inverse document frequency (tf-idf), which assigns higher score of words that are characteristic to the current document. Sothe tf-idf of word w in document d can be computed as:
tf-idf(w, d) = tf(w,d) * log(N/n)
Where tf(w,d) is the frequency of word w in document d, N is the total number of documents, and n is the number of documents that contain w. Thus tf-idf taxes words that are present in a large number of documents, are not good as features in a model. The retrieved documents are ranked according to the similarity value.
There are also probablistic models for retrieving documents, where we estimate the probability that the document d is relevant to a query q, and the documents are then ranked by this probability estimate.
For implementation of these models and more, check out InfoR.
The process of retrieval based on vector space model is incredibly expensive. For each search query, we need to compute its similarity with all the web documents, even though most of the documents don’t even contain that word. Going by the standard method, it’s impossible to achieve the results at the speed we experience while using Google. The trick is to store documents in a data structure, which facilitates fast computation. The most commonly used data structure is the inverted index. A forward index stores the lists of words for each document, an invereted index stores documents for each word in a hashmap-style structure:
Inverted index = {word1: [doc1, doc24, doc77], word2: [doc34, doc97], …. wordk: [doc11, doc22, doc1020, doc23]}
Given such a structure, one can just jump to the words in the query in constant time, and compute the similarity with a very small number of documents, ones containing the words in the query.
Now building such an index also presents a serious computational problem. This problem was the reason behind the development of the MapReduce paradigm, which is the distributed, massively parallel system, discovered by Google. The use of MapReduce system has extended to all major big data problems. If you are curious refer to the original paper.
The working of search engine can be described in the following steps:
Modern web search engines like Google use more than 200 heuristics on top of the standard retrieval model!
In this post, we discussed the central ideas of information retrieval and challenges in building a web search engine. It’s not an easy problem. Even today, 18 years after the origin of Google, significant research, both at universities and in industry, is being pursued in this direction.
Janu Verma is a researcher in the IBM T.J. Watson Research Center, New York. His research interests are in mathematics, machine learning, information visualization, computational health and biology. He has held research positions at Cornell University, Kansas State University, Tata Institute of Fundamental Research, Indian Institute of Science, and Indian Statistical Institute. He has also written papers for IEEE Vis, KDD, International Conference on HealthCare Informatics, Computer Graphics and Applications, Nature Genetics, IEEE Sensors Journals,and so on. His current focus is on the development of visual analytics systems for prediction and understanding. Check out his personal website.
I remember deciding to pursue my first IT certification, the CompTIA A+. I had signed…
Key takeaways The transformer architecture has proved to be revolutionary in outperforming the classical RNN…
Once we learn how to deploy an Ubuntu server, how to manage users, and how…
Key-takeaways: Clean code isn’t just a nice thing to have or a luxury in software projects; it's a necessity. If we…
While developing a web application, or setting dynamic pages and meta tags we need to deal with…
Software architecture is one of the most discussed topics in the software industry today, and…