djangodjango-haystackwhoosh

Django+Haystack+Whoosh: how to deal with language inflection


Many languages in Europe are inflectional. This means that one word can be written in multiple forms in text. For example, word 'computer' in polish "komputer" has multiple forms: "komputera", "komputerowi", "komputerem", "komputery" , etc..

How should I use django+haystack+whoosh properly to deal with language inflection?

Whenever I search for any form of "komputer", "komputera", "komputerowi" I mean this same thing ->"komputer".

In NLP there is a basic approach based either on stemming words (cutting suffixes) either on converting a form to the base form ("komputerowi" => "komputer"). There are some libraries that can help with that.

My first thought was to prepare some special template filter that will convert every recognized word in a given variable to the text with base forms rather then forms. Then I could use it in search index templates in django+haystack. If search query will be also converted before evaluate in whoosh engine this should work great. See example:

haystack search index template:
    {{some_indexed_text|convert_to_base_form_filter}}

text to index: "Nie ma komputera"  => "Nie ma komputer" <- this is really indexed
 search query: "komputery"         => "komputer"   <-- this will match 

But I don't think that this is "elegant" solution of this problem, also some other features won't work - like suggesting misspelling suggestions.

So - how should I solve this issue? Maybe I should use other search engine than whoosh?


Solution

  • Whoosh has, by default, only stemming for the english language.
    To implement stemming for another language, first look inside:

    /your_path_to_whoosh/whoosh/lang/analysis.py
    

    This is where StemmingAnalyzer (the default analyzer) is defined and an excellent starting point. The stem function, imported from porter.py, is the other important place to look in.

    So, the three steps are: