In PyLucene, there is a filter called StopFilter
which filters tokens based on given stopwords. The example call is as follows:
result = StopFilter(True, result, StopAnalyzer.ENGLISH_STOP_WORDS_SET)
It seems like it should be easy to replace the argument for the set of stop words, but this is actually a bit challenging:
>>> StopAnalyzer.ENGLISH_STOP_WORDS_SET
<Set: [but, be, with, such, then, for, no, will, not, are, and, their, if, this, on, into, a, or, there, in, that, they, was, is, it, an, the, as, at, these, by, to, of]>
This is a Set
, which is not able to be implemented:
>>> Set()
NotImplementedError: ('instantiating java class', <type 'Set'>)
It was suggested elsewhere to use a PythonSet
, which comes with PyLucene, but it turns out that this is not an instance of a Set
, and cannot be used with a StopFilter
.
How can one give a StopFilter
a new set of stop words?
I discovered the answer to this halfway through writing this question via this thread on the pylucene dev list:
http://mail-archives.apache.org/mod_mbox/lucene-pylucene-dev/201202.mbox/thread
You can define a StopFilter
using a custom list as follows:
mystops = HashSet(Arrays.asList(['a','b','c']))
result = StopFilter(True, result, mystops)