pythonspacystop-words

Spacy, how not to remove "not" when cleaning the text with space


I use this spacy code to later apply it on my text, but i need the negative words to stay in the text like "not".

nlp = spacy.load("en_core_web_sm") 

def my_tokenizer(sentence): 
    return [token.lemma_ for token in tqdm(nlp(sentence.lower()), leave = False) if token.is_stop == False and token.is_alpha == True and  token.lemma_ ] 

Whit this when i apply i get this as a result :

[hello, earphone, work]

However the original sentence was

hello,my earphones are still not working.

So, i would like to see the following sentence: [earphone, still, not, work] Thank you


Solution

  • "not" is actually a stop word and in your code if a token is removed if it's a stopword. You can see this either by looking at the list of Spacy stopwords

    "not" in spacy.lang.en.stop_words.STOP_WORDS
    

    or by looping over the tokens of your doc object

    for tok in nlp(text.lower()):
      print(tok.text, tok.is_stop, tok.lemma_)
    
    #hello False hello
    #, False ,
    #my True my
    #earphones False earphone
    #are True be
    #still True still
    #not True not
    #working False work
    #. False .
    

    Solution

    To solve this, you should remove the target words such as "not" from the list of stop_words. You can do it this way:

    # spacy.lang.en.stop_words.STOP_WORDS.remove("not")
    # or for multiple words use this
    to_del_elements = {"not", "no"}
    nlp.Defaults.stop_words = nlp.Defaults.stop_words - to_del_elements
    

    Then you can rerun your code and you'll get your expected results:

    import spacy
    #spacy.lang.en.stop_words.STOP_WORDS.remove("not")
    to_del_elements = {"not", "no"}
    nlp.Defaults.stop_words = nlp.Defaults.stop_words - to_del_elements
    nlp = spacy.load("en_core_web_sm") 
    def my_tokenizer(sentence): 
        return [token.lemma_ for token in tqdm(nlp(sentence.lower()), leave = False) if token.is_stop == False and token.is_alpha == True and  token.lemma_ ] 
    
    sentence = "hello,my earphones are still not working. no way they will work"
    results = my_tokenizer(sentence)
    print(results)
    
    #['hello', 'earphone', 'not', 'work', 'no', 'way', 'work']