I want to write a custom Similarity class in PyLucene to implement my own retrieval model.
Usually, using the java version of Lucene you have to extend the Similarity class and override the methods. For example:
public class IDFSimilarity extends TFIDFSimilarity {
/** Sole constructor: parameter-free */
public IDFSimilarity() {
}
/** Implemented as <code>overlap / maxOverlap</code>. */
@Override
public float coord(int overlap, int maxOverlap) {
return overlap / (float) maxOverlap;
}
/** Implemented as <code>1/sqrt(sumOfSquaredWeights)</code>. */
@Override
public float queryNorm(float sumOfSquaredWeights) {
return (float) (1/Math.sqrt(sumOfSquaredWeights));
}
.
.
etc
However, PyLucene uses JCC and it is not clear to me how can you extend the class in a python script. It would be something like:
import lucene
from org.apache.lucene.search.similarities import TFIDFSimilarity
class IDFSimilarity (TFIDFSimilarity):
def __init__(self):
TFIDFSimilarity.__init__()
?
?
but I do not how to proceed. I cannot find any example or documentation online.
Any idea?
From @JanŠpaček comment in the original question, thanks!
There is an example of defining a Similarity in Python in the PyLucene sources.
from org.apache.pylucene.search.similarities import PythonClassicSimilarity
class SimpleSimilarity(PythonClassicSimilarity):
def lengthNorm(self, numTerms):
return 1.0
def tf(self, freq):
return freq
def sloppyFreq(self, distance):
return 2.0
def idf(self, docFreq, numDocs):
return 1.0
def idfExplain(self, collectionStats, termStats):
return Explanation.match(1.0, "inexplicable", [])
Click here to see the example.