javaalgorithmintellij-ideastring-algorithm

Fast substring search algorithm to be used by a sort of IDE with tens of thousands of very big files


I'm developing something quite similar to an IDE that will handle tens of thousands of very large (text) files and I'm surveying what the state of the art in the subject is.

As an example, Intellij's searching algorithm for standard (non-regex) expressions is pretty much immediate. How do they accomplish this? Are they just keeping some sort of suffix-tree of all the searchable files in memory? Are they just keeping a good portion of the file's contents in memory so they just do a standard KMP almost fully in-memory to avoid any disk IO?

Thanks


Solution

  • Currently, IntelliJ IDEA indexes files in the project, and remembers which 3-grams (sequences of 3 letters or digits) occur in which files. When searching, it splits the query into 3-grams as well, gets the files from the index that contain all those trigrams, intersects those sets and uses a relatively straightforward text search in each of those files to check if they really contain the whole search string.