pythonfiltersetstop-wordspylucene

Custom stopwords for PyLucene


In PyLucene, there is a filter called StopFilter which filters tokens based on given stopwords. The example call is as follows:

result = StopFilter(True, result, StopAnalyzer.ENGLISH_STOP_WORDS_SET)

It seems like it should be easy to replace the argument for the set of stop words, but this is actually a bit challenging:

>>> StopAnalyzer.ENGLISH_STOP_WORDS_SET

<Set: [but, be, with, such, then, for, no, will, not, are, and, their, if, this, on, into, a, or, there, in, that, they, was, is, it, an, the, as, at, these, by, to, of]>

This is a Set, which is not able to be implemented:

>>> Set()

NotImplementedError: ('instantiating java class', <type 'Set'>)

It was suggested elsewhere to use a PythonSet, which comes with PyLucene, but it turns out that this is not an instance of a Set, and cannot be used with a StopFilter.

How can one give a StopFilter a new set of stop words?


Solution

  • I discovered the answer to this halfway through writing this question via this thread on the pylucene dev list:

    http://mail-archives.apache.org/mod_mbox/lucene-pylucene-dev/201202.mbox/thread

    You can define a StopFilter using a custom list as follows:

    mystops = HashSet(Arrays.asList(['a','b','c']))
    result = StopFilter(True, result, mystops)