I'm new to PyLucene. I managed to install it on my Ubuntu, and looked at [sample code][1] of how custom analyzer is implemented. I tried modifying it by adding an NGramTokenFilter. But I keep getting an error when printing out the result of the custom analyzer. If I remove the ngram filter, it would work just fine.
Basically what I'm trying to do is split all incoming text by white space, lower case it, convert to ascii code, and do ngram.
The code is as follow:
class myAnalyzer(PythonAnalyzer):
def createComponents(self, fieldName):
source = WhitespaceTokenizer()
filter = LowerCaseFilter(source)
filter = ASCIIFoldingFilter(filter)
filter = NGramTokenFilter(filter,1,2)
return self.TokenStreamComponents(source, filter)
def initReader(self, fieldName, reader):
return reader
analyzer = myAnalyzer()
stream = analyzer.tokenStream("", StringReader("MARGIN wondêrfule"))
stream.reset()
tokens=[]
while stream.incrementToken():
tokens.append(stream.getAttribute(CharTermAttribute.class_).toString())
print(tokens)
The error I keep getting is:
InvalidArgsError: (<class 'org.apache.lucene.analysis.ngram.NGramTokenFilter'>, '__init__', (<ASCIIFoldingFilter: ASCIIFoldingFilter@192d74fb term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1>, 2, 3))
What am I doing wrong?
Looking at the JavaDoc for NGramTokenFilter, you have to use this:
NGramTokenFilter(filter, size)
for a fixed ngram size; or this:
NGramTokenFilter(filter, min, max, boolean)
for a range of ngram sizes with a boolean for preserveOriginal
, which determines:
Whether or not to keep the original term when it is shorter than minGram or longer than maxGram
What you have is neither of those.
(Side note: I'm not sure an ngram of size 1 makes a lot of sense - but maybe it's what you need.)
Update
Just for completeness, here is my (somewhat modified) standalone version of the code in the question. The most relevant part is this line:
result = NGramTokenFilter(filter, 1, 2, True)
The program (using PyLucene 9.4.1 and Java 11):
import sys, lucene, unittest
from org.apache.pylucene.analysis import PythonAnalyzer
from org.apache.lucene.analysis import Analyzer
from java.io import StringReader
from org.apache.lucene.analysis.core import WhitespaceTokenizer, LowerCaseFilter
from org.apache.lucene.analysis.miscellaneous import ASCIIFoldingFilter
from org.apache.lucene.analysis.ngram import NGramTokenFilter
from org.apache.lucene.analysis.tokenattributes import CharTermAttribute
class myAnalyzer(PythonAnalyzer):
def __init__(self):
PythonAnalyzer.__init__(self)
def createComponents(self, fieldName):
source = WhitespaceTokenizer()
filter = LowerCaseFilter(source)
filter = ASCIIFoldingFilter(filter)
result = NGramTokenFilter(filter, 1, 2, True)
return Analyzer.TokenStreamComponents(source, result)
def initReader(self, fieldName, reader):
return reader
lucene.initVM(vmargs=['-Djava.awt.headless=true'])
analyzer = myAnalyzer()
stream = analyzer.tokenStream("", StringReader("MARGIN wondêrfule"))
stream.reset()
tokens=[]
while stream.incrementToken():
tokens.append(stream.getAttribute(CharTermAttribute.class_).toString())
print(tokens)
The output:
['m', 'ma', 'a', 'ar', 'r', 'rg', 'g', 'gi', 'i', 'in', 'n', 'margin', 'w', 'wo', 'o', 'on', 'n', 'nd', 'd', 'de', 'e','er', 'r', 'rf', 'f', 'fu', 'u', 'ul', 'l', 'le', 'e', 'wonderfule']