javalucenequery-parser

Lucene analyzer to handle yo and ye (Russian characters)


I'm using Lucene and StandardAnalyzer for creating indexes in my code, however, there is a problem with 'Yo' and 'Ye' (Ё and Е).

I want search results with 'yo' also yeild results with 'ye', and vise-versa. I tried to create new Analyzer class, similiar to StandartAnalyzer , with custom filter , but no luck on my side. I'm also well known about RussianAnalyzer, but it seems it's not working for me, as it treats 'yo' and 'ye' separately.

Here is the chunk, where I'm using this analyzer:

QueryParser queryParser = new QueryParser("myText", new MyAnalyzer());
queryParser.setDefaultOperator(QueryParser.Operator.AND);

After this I do queryParser.parse() and other query build stuff for searching.

The question is: What is right way to do this operation? Should I use my custom TokenFilter? Or, maybe, my own CharFilter?

Wikipedia links to character in question : https://en.wikipedia.org/wiki/Yo_(Cyrillic) https://en.wikipedia.org/wiki/Ye_(Cyrillic)


Solution

  • At first glance, I think you need to create a CharFilter that maps 'yo' to 'ye', as occasionally this happens anyway due to human error (see the 'Yo' page above) so you are more likely to find what you want with a 'yo' -> 'ye' mapping. Remember that this mapping needs to occur during searching as well as during indexing.