lucenesolr

Search for short words with SOLR


I am using SOLR along with NGramTokenizerFactory to help create search tokens for substrings of words

NGramTokenizer is configured with a minimum word length of 3

This means that I can search for e.g. "unb" and then match the word "unbelievable".

However I have a problem with short words like "I" and "in". These are not indexed by SOLR (I suspect it is because of NGramTokenizer) and therefore I cannot search for them.

I don't want to reduce the minimum word length to 1 or 2, since this creates a huge search index. But I would like SOLR to include whole words whose length is already below this minimum.

How can I do that?

/Carsten


Solution

  • First of all, try to understand why your words don't get indexed by solr using the "Analysis Tool"

    http://localhost:8080/solr/admin/analysis.jsp
    

    Just put the field and the text you are searching for and see which analyser is filtering your short term. I suggest you to do so because you said you have only a "suspect" and you have to be certain about which analyser filters your data.

    Then why don't you just simply copy the term in another field without that analyser?

    In this way your terms will be indexed twice, and will appear both as exact word and as n-gram. Then you have to deal with the scores of the two different fields.

    I hope this has helped you in some way.

    Some link for aggregation and copyfield attribute:

    Indexing data in multiple fields

    Using copy field tag