performancesearchinformation-retrievalalphabeticalinverted-index

Is there an algorithm that takes advantage of an alphabetized inverted index?


I am working on an information retrieval project in Python. Multiple sources I read, including this book, have emphasized storing an inverted index in alphabetical order, though I have not found any advantage of doing so.

Many documents I have read suggest storing items in the following manner:

aardvark -> doc6, doc5, doc10
apple -> doc1, doc8
...
zebra -> doc7

How does storing records alphabetically improve speed? Is there any way with which I could take advantage of this alphabetical order when retrieving data?


Solution

  • Imagine if the index is so huge that it cannot fit inside memory of a single machine.
    Then we would have to partition the index into multiple smaller indices and store in multiple machines.

    Lets say one machine can store 1000 entries and we have a total of 100000 entries to index; meaning that we'll need 100 machines to store all the entries.

    Now if the keys are stored in alphabetical order, then it will become easier to lookup for a word by doing binary search.

    Example:

    Lets say words with prefixes between aa and ad are stored in machine 1.
    Words with prefixes between ae and ba are stored in machine 2.
    ...
    ...
    ...
    Words with prefixes yh - zz are stored in machine 100.

    Whenever we get a request for a lookup, we just just binary search for the word's prefix to find the machine where its entry is stored in time complexity O(nlogn).
    If the indices were stored in random order, then we would have to search for the word in all the machines one by one, resulting in a time complexity of O(n).