I need to perform binary search on a frozenset, but as indexing doesn't work on frozenset, I cannot use the bisect
library. I thought of converting the frozenset to a list to make things easy, but the problem is that the conversion (list(frozenset)
) disarranges the order and then I cannot perform binary search. What solution do you suggest?
Just to be more clear, let me explain what exactly I'm doing: In an NLP task, I need to remove stopwords from my text, so I have imported the stopwords from scikit-learn (it has a better collection of stopwords than NLTK in my opinion):
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
And it returns a frozenset in which the stopwords are in alphabetical order. And now that I want to remove stopwords from my text, it's better to check if a token is in the stopwords using binary search (obviously because I have stopwords in alphabetical order and it's efficient to perform binary search). So it is as follows:
import bisect
bisect.bisect(ENGLISH_STOP_WORDS, word)
And this is where I'm stuck! I was expecting to find the desired index in stopwords list with the above code, and then compare my word with the one before and after it in the list. But I get this error:
TypeError: 'frozenset' object does not support indexing
.
FYI, I have not tried other libraries stopwords list (spaCy, gensim, etc.), so I don't know if they work better in this case. But the main point here is to learn handling the binary search on the frozenset. Thanks in advance.
If you want to know if the word is a stop word, simply do:
if word in ENGLISH_STOP_WORDS:
pass