Many languages in Europe are inflectional. This means that one word can be written in multiple forms in text. For example, word 'computer' in polish "komputer" has multiple forms: "komputera", "komputerowi", "komputerem", "komputery" , etc..
How should I use django+haystack+whoosh properly to deal with language inflection?
Whenever I search for any form of "komputer", "komputera", "komputerowi" I mean this same thing ->"komputer".
In NLP there is a basic approach based either on stemming words (cutting suffixes) either on converting a form to the base form ("komputerowi" => "komputer"). There are some libraries that can help with that.
My first thought was to prepare some special template filter that will convert every recognized word in a given variable to the text with base forms rather then forms. Then I could use it in search index templates in django+haystack. If search query will be also converted before evaluate in whoosh engine this should work great. See example:
haystack search index template:
{{some_indexed_text|convert_to_base_form_filter}}
text to index: "Nie ma komputera" => "Nie ma komputer" <- this is really indexed
search query: "komputery" => "komputer" <-- this will match
But I don't think that this is "elegant" solution of this problem, also some other features won't work - like suggesting misspelling suggestions.
So - how should I solve this issue? Maybe I should use other search engine than whoosh?
Whoosh has, by default, only stemming for the english language.
To implement stemming for another language, first look inside:
/your_path_to_whoosh/whoosh/lang/analysis.py
This is where StemmingAnalyzer
(the default analyzer) is defined and an excellent starting point. The stem
function, imported from porter.py
, is the other important place to look in.
So, the three steps are:
Implement your own stemming function, taking as a reference the stem function in porter.py and any grammar and language references you will need to get the rules right.
Implement your own Analyzer taking as reference StemmingAnalyzer
inside analysis.py
. The file is heavily documented so you should have no problem navigating through it. You'll see that StemmingAnalyzer
is basically a chaining of a Tokenizer
with a regex to match words, a lowercase filter and the stemming filter which basically calls the above stemming function. You'll see that StemFilter
takes the stemming function as a parameter, so you don't have to reimplement the filter.
Pass your brand new Analyzer function at schema creation time, see here: http://files.whoosh.ca/whoosh/docs/latest/schema.html#creating-a-schema