lucenesolrinformation-retrievalxapianwhoosh

Document search on partial words


I am looking for a document search engine (like Xapian, Whoosh, Lucene, Solr, Sphinx or others) which is capable of searching partial terms.

For example when searching for the term "brit" the search engine should return documents containing either "britney" or "britain" or in general any document containing a word matching r*brit*

Tangentially, I noticed most engines use TF-IDF (Term frequency-Inverse document frequency) or its derivatives which are based on full terms and not partial terms. Are there any other techniques that have been successfully implemented besides TF-IDF for document retrieval?


Solution

  • With lucene you would be able to implement this in several ways:

    1.) You can use wildcard queries *brit* (You would have to set your query parser to allow leading wild cards)

    2.) You can create an additional field containing N-Grams of all the terms. This would result in larger indexes, but would be in many cases faster (search speed).

    3.) You can use fuzzy search to handle typing mistakes in the query. e.g. someone typed britnei but wanted to find britney.

    For wildcard queries and fuzzy search have a look at the query syntax docs.