I use this spacy code to later apply it on my text, but i need the negative words to stay in the text like "not".
nlp = spacy.load("en_core_web_sm")
def my_tokenizer(sentence):
return [token.lemma_ for token in tqdm(nlp(sentence.lower()), leave = False) if token.is_stop == False and token.is_alpha == True and token.lemma_ ]
Whit this when i apply i get this as a result :
[hello, earphone, work]
However the original sentence was
hello,my earphones are still not working.
So, i would like to see the following sentence: [earphone, still, not, work]
Thank you
"not" is actually a stop word and in your code if a token is removed if it's a stopword. You can see this either by looking at the list of Spacy stopwords
"not" in spacy.lang.en.stop_words.STOP_WORDS
or by looping over the tokens of your doc object
for tok in nlp(text.lower()):
print(tok.text, tok.is_stop, tok.lemma_)
#hello False hello
#, False ,
#my True my
#earphones False earphone
#are True be
#still True still
#not True not
#working False work
#. False .
To solve this, you should remove the target words such as "not" from the list of stop_words. You can do it this way:
# spacy.lang.en.stop_words.STOP_WORDS.remove("not")
# or for multiple words use this
to_del_elements = {"not", "no"}
nlp.Defaults.stop_words = nlp.Defaults.stop_words - to_del_elements
Then you can rerun your code and you'll get your expected results:
import spacy
to_del_elements = {"not", "no"}
nlp.Defaults.stop_words = nlp.Defaults.stop_words - to_del_elements
nlp = spacy.load("en_core_web_sm")
def my_tokenizer(sentence):
return [token.lemma_ for token in tqdm(nlp(sentence.lower()), leave = False) if token.is_stop == False and token.is_alpha == True and token.lemma_ ]
sentence = "hello,my earphones are still not working. no way they will work"
results = my_tokenizer(sentence)
#['hello', 'earphone', 'not', 'work', 'no', 'way', 'work']