pythonbinary-searchstop-wordsbisectfrozenset

An alternative for binary search on a frozenset in Python



I need to perform binary search on a frozenset, but as indexing doesn't work on frozenset, I cannot use the bisect library. I thought of converting the frozenset to a list to make things easy, but the problem is that the conversion (list(frozenset)) disarranges the order and then I cannot perform binary search. What solution do you suggest?
Just to be more clear, let me explain what exactly I'm doing: In an NLP task, I need to remove stopwords from my text, so I have imported the stopwords from scikit-learn (it has a better collection of stopwords than NLTK in my opinion):
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
And it returns a frozenset in which the stopwords are in alphabetical order. And now that I want to remove stopwords from my text, it's better to check if a token is in the stopwords using binary search (obviously because I have stopwords in alphabetical order and it's efficient to perform binary search). So it is as follows:

import bisect

bisect.bisect(ENGLISH_STOP_WORDS, word)

And this is where I'm stuck! I was expecting to find the desired index in stopwords list with the above code, and then compare my word with the one before and after it in the list. But I get this error: TypeError: 'frozenset' object does not support indexing.

FYI, I have not tried other libraries stopwords list (spaCy, gensim, etc.), so I don't know if they work better in this case. But the main point here is to learn handling the binary search on the frozenset. Thanks in advance.


Solution

  • If you want to know if the word is a stop word, simply do:

    if word in ENGLISH_STOP_WORDS:
        pass