rtmstop-words

tm_map: Can use the removewords function with my own stopwords registered as an txt file?


I'm using the R tm package for text analysis on a facebook group, and find that the removewords function isn't working for me. I tried to combine the french stopwords with my own, but they are still appearing. So I create a file named "french.txt" with my own list as in the following command:

nom_fichier <- "Analyse textuelle/french.txt"
my_stop_words <- readLines(nom_fichier, encoding="UTF-8")

Here is the data for text mining:

text <- readLines(groupe_fb_ief, encoding="UTF-8")```
docs <- Corpus(VectorSource(text))
inspect(docs) 

Here are the tm_map commands:

docs <- tm_map(docs, tolower)

docs <- tm_map(docs, stripWhitespace)

docs <- tm_map(docs, removePunctuation)

docs <- tm_map(docs, removeNumbers)

docs <- tm_map(docs, removeWords, my_stop_words)

Applying that, it's still not working and I don't understand why. I even try to change to order of the commands with no result.

Do you have any idea ? Is it possible to change the french stopwords within R ? Where this list is located ?

Thanks!!


Solution

  • Rather than use RemoveWords, I typically use an anti_join() to remove all stop words.

    library(tidytext)
    my_stop_words <- my_stop_words  %>%
      unnest_tokens(output = word, input = text, token = "words")
    
    # anti_join
    anti_join(docs,my_stop_words, by = "word")
    

    That is if the the column that contains your corpus is called "word". Hope this helps.