full-text-searchinformation-retrievalsystem-designinverted-index

How can we design system for document search?


I was recently asked a system design question where I needed to "design system for document search" and first thing came in my mind was how elastic search works. So I came up with inverted index approach that is used to support text search. The inverted index has a record for each term. Each record has the list of documents that the term appears in. Documents are identified by an integer document ID. The list of document IDs is sorted in ascending order.

So I said something along below line but I am not sure whether this is the way it should work in distributed fashion because we may have lot of documents to be indexed so we need some load balancer and somehow we need to partition inverted index or data. Meaning a process that will upload the documents and then what process will tokenize it (will it be just one machine or bunch of machines). Basically wanted to understand what's the right way to design a system like this with proper components. How should we talk about this in system design interview? What are the things we should touch in an interview for this problem?

enter image description here

What's the right way we should design system for document search in distributed fashion with right components.


Solution

  • ok... it is a vast subject. Actually elasticsearch has been done exactly for that. But google too. from elasticSearch to google search there is a technologic gap.

    If you go for your personnal implementation it is still possible... but there is a ton of work to do, to be as efficient as elasticsearch. The quick answer is : use elasticsearch.

    You may be curious or for some reason you may need to write it by yourself. So how it works :

    TFIDF and cosine distance

    As you specified first you will tokenize then you will represent the tokenized text as vector, then measure the angular distance between the text and the search word.

    imagine have only 3 word in your language "foo, bar, bird" so a text with "foo bar bird" can be represented by a vector3[1,1,1] a text with

    A) "foo foo foo foo bird" will be [4,0,1] another

    B) "foo bar" [1,1,0]

    if you search for "bar" which is represented by [0,1,0] you will look for the text with which have the minimal angular distance and if you compute the angular distance between your search and B I think this is 90° which is lower than A.

    Actually the language is more than 3 word so you will compute the distance in a vector of many more dimensions because a 1 world = 1 dimension :)

    TFIDF stand for term frequency inverse document frequency. this rates the frequency of a word in a document by the inverse of the frequencies of this word in all document. What it does is it point the important word in a document.

    Let explain that :

    https://en.wikipedia.org/wiki/Tf%E2%80%93idf

    I have got a little surprise for you at the end : this is what is behind the scoring in elasticsearch https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html

    in addition elasticsearch will provide to you replication scalability, distribution and anything you could dream about.

    There is still reason to not use elasticsearch :


    the scaling and distribution part

    Read about lucene before https://en.wikipedia.org/wiki/Apache_Lucene

    it depend greatly on the volume of text to be indexed and word.

    If it represent 1M of text you dont need to distribut it. you will need a distributed inverted index if you index something big like wikipedia (wikipedia uses elasticsearch for the searchbox)

    foo is in text A,B,C,R so I will partition my index

    I will use a distributed cache with word for the keys and a list of pointer to the vectors as values. I would store the values in memory mapped file. A search engine is something that must be fast, so if you do the thing by yourself you will reduce the external libraries. You will use c++

    At google they end in a situation where the vectors take so much space that they need to store them in multiple machine so they invented GFS this is a distributed file system.

    the cosine distance computation between multidimensionnal vectors are time consuming, so I will go for a computation in GPU, because GPU are efficient for floating point operations on matrix and vectors.

    Actually this is a little bit crasy to reimplement all of that expect you have a good reason to do so for exemple a very good business model :)

    I will probably use kubernetes docker and mesos to virtualise all my component. I will look for something similar to GFS if high volume is needed. https://en.wikipedia.org/wiki/Comparison_of_distributed_file_systems

    You will need to get the text back, so I will use any NIO webserver that scale in any language. I will use nginx to serve the static pages and something like netty or vertx to get the search, then build the answer of link to the text (it depend on how many user you want to serve in one second)

    All of that if I plan to index something bigger than wikipedia. And if I plan to invent something better than elasticsearch (hard task good luck)

    for exemple wikipedia this is less than 1T of text.

    finally

    If you do it with elasticsearch in one week you are done may be 2 week and in production. If you do it by yourself, you will need at least 1 senior developer and a datascientist an architect, and something like one year or more depending on the volume of text you want to index. I cant help to stop asking myself "what is it for".

    Actually if you read the code source of lucene you will know exactly what you need. They done it, lucene is the engine of elasticsearch.

    Twitter is using Lucene for its real time search