lucenelucene.net

Is there a Lucene Analyzer that will ignore the difference between greek symbol and phonetic english name?


Ideally, I'd like something that otherwise acts like a StandardAnalyzers but treats all Greek Symbols as equivalent with their English phonetic spelling ("beta" == "β", "omega" == "ω"). I looked at the ICU analyzer but it doesn't go quite that far. If it doesn't exist, might you have a suggestion about the most efficient way to design such an analyzer?


Solution

  • After doing research on @Val suggestion. I put this together. I'm not sure if it's quite right but saving here in case anyone finds as a useful starting point.

        private static Analyzer GetGreekSymbolAgnosticAnalyzer()
        {
            NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
            builder.Add("α", "alpha");
            builder.Add("β", "beta");
            builder.Add("ω", "omega");
    
            NormalizeCharMap norm = builder.Build();
            Analyzer analyzer = Analyzer.NewAnonymous(createComponents: (fieldName, reader) =>
            {
                Tokenizer tokenizer = new StandardTokenizer(LuceneVersion.LUCENE_48, reader);
                return new TokenStreamComponents(tokenizer, new StandardFilter(LuceneVersion.LUCENE_48, tokenizer));
            }, initReader: (fieldName, reader) => new MappingCharFilter(norm, reader));
    
            return analyzer;
        }