lucenepylucene

Problem chaining tokenizer with filters with PythonAnalyzer in PyLucene


I'm new to PyLucene. I managed to install it on my Ubuntu, and looked at [sample code][1] of how custom analyzer is implemented. I tried modifying it by adding an NGramTokenFilter. But I keep getting an error when printing out the result of the custom analyzer. If I remove the ngram filter, it would work just fine.

Basically what I'm trying to do is split all incoming text by white space, lower case it, convert to ascii code, and do ngram.

The code is as follow:

class myAnalyzer(PythonAnalyzer):

def createComponents(self, fieldName):
    source = WhitespaceTokenizer()
    filter = LowerCaseFilter(source)
    filter = ASCIIFoldingFilter(filter)
    filter = NGramTokenFilter(filter,1,2)

    return self.TokenStreamComponents(source, filter)

def initReader(self, fieldName, reader):
    return reader


analyzer = myAnalyzer()
stream = analyzer.tokenStream("", StringReader("MARGIN wondêrfule"))
stream.reset()
tokens=[]
while stream.incrementToken():
    tokens.append(stream.getAttribute(CharTermAttribute.class_).toString())
print(tokens)

The error I keep getting is:

InvalidArgsError: (<class 'org.apache.lucene.analysis.ngram.NGramTokenFilter'>, '__init__', (<ASCIIFoldingFilter: ASCIIFoldingFilter@192d74fb term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1>, 2, 3))

What am I doing wrong?


Solution

  • Looking at the JavaDoc for NGramTokenFilter, you have to use this:

    NGramTokenFilter(filter, size)
    

    for a fixed ngram size; or this:

    NGramTokenFilter(filter, min, max, boolean) 
    

    for a range of ngram sizes with a boolean for preserveOriginal, which determines:

    Whether or not to keep the original term when it is shorter than minGram or longer than maxGram

    What you have is neither of those.

    (Side note: I'm not sure an ngram of size 1 makes a lot of sense - but maybe it's what you need.)


    Update

    Just for completeness, here is my (somewhat modified) standalone version of the code in the question. The most relevant part is this line:

    result = NGramTokenFilter(filter, 1, 2, True)
    

    The program (using PyLucene 9.4.1 and Java 11):

    import sys, lucene, unittest
    from org.apache.pylucene.analysis import PythonAnalyzer
    from org.apache.lucene.analysis import Analyzer
    from java.io import StringReader
    from org.apache.lucene.analysis.core import WhitespaceTokenizer, LowerCaseFilter
    from org.apache.lucene.analysis.miscellaneous import ASCIIFoldingFilter
    from org.apache.lucene.analysis.ngram import NGramTokenFilter
    from org.apache.lucene.analysis.tokenattributes import CharTermAttribute
    
    class myAnalyzer(PythonAnalyzer):
    
        def __init__(self):
            PythonAnalyzer.__init__(self)
    
        def createComponents(self, fieldName):
            source = WhitespaceTokenizer()
            filter = LowerCaseFilter(source)
            filter = ASCIIFoldingFilter(filter)
            result = NGramTokenFilter(filter, 1, 2, True)
    
            return Analyzer.TokenStreamComponents(source, result)
    
        def initReader(self, fieldName, reader):
            return reader
    
    
    lucene.initVM(vmargs=['-Djava.awt.headless=true'])
    analyzer = myAnalyzer()
    stream = analyzer.tokenStream("", StringReader("MARGIN wondêrfule"))
    stream.reset()
    tokens=[]
    while stream.incrementToken():
        tokens.append(stream.getAttribute(CharTermAttribute.class_).toString())
    print(tokens)
    

    The output:

    ['m', 'ma', 'a', 'ar', 'r', 'rg', 'g', 'gi', 'i', 'in', 'n', 'margin', 'w', 'wo', 'o', 'on', 'n', 'nd', 'd', 'de', 'e','er', 'r', 'rf', 'f', 'fu', 'u', 'ul', 'l', 'le', 'e', 'wonderfule']