rnlpstop-wordstidytext

How to correctly remove stop words using tidytext package in R?


I am using stopwords dataset in tidytext package in R to remove stopwords. I am using following code:

library(tidyverse)
library(tidytext)
library(dplyr)

data(stop_words)
example_words <- c("the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog","i'm","don’t","it’s","i’ve")
filtered_words <- example_words[!example_words %in% stop_words$word]
filtered_words 

The final output is as follows:

> filtered_words
[1] "quick" "brown" "fox"   "jumps" "lazy"  "dog"   "don’t" "it’s"  "i’ve" 

We can see the stop words like "don’t" "it’s" "i’ve" still presented in the filtered output. But those stop words are actually presented in the stop word dataset and somehow not get removed. So could anyone help me to figure out why is it not removing some of these words that are presented in the stop words dataset?


Solution

  • Try replacing your (typographic) apostrophe with this: '

    library(tidyverse)
    library(tidytext)
    library(dplyr)
    
    data(stop_words)
    example_words <- c("the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog","i'm","don't","it's","i've")
    filtered_words <- example_words[!example_words %in% stop_words$word]
    filtered_words 
    #> [1] "quick" "brown" "fox"   "jumps" "lazy"  "dog"
    

    Created on 2023-04-07 with reprex v2.0.2