I have the following data frame
report <- data.frame(Text = c("unit 1 crosses the street",
"driver 2 was speeding and saw driver# 1",
"year 2019 was the year before the pandemic",
"hey saw hei hei in the wood",
"hello: my kityy! you are the best"), id = 1:5)
report
Text id
1 unit 1 crosses the street 1
2 driver 2 was speeding and saw driver# 1 2
3 year 2019 was the year before the pandemic 3
4 hey saw hei hei in the wood 4
5 hello: my kityy! you are the best 5
From a previous coding help, we can remove stop words using the following code.
report$Text <- gsub(paste0('\\b',tm::stopwords("english"), '\\b',
collapse = '|'), '', report$Text)
report
Text id
1 unit 1 crosses street 1
2 driver 2 speeding saw driver# 1 2
3 year 2019 year pandemic 3
4 hey saw hei hei wood 4
5 hello: kityy! best 5
The above data still has noises (numbers, punctuations, and white space). Need to get the data in the following format by removing these noises before tokenization. Additionally, I want to remove selected stop words (for example, saw
and kitty
).
Text id
1 unit crosses street 1
2 driver speeding driver 2
3 year year pandemic 3
4 hey hei hei wood 4
5 hello best 5
We may get the union
of tm::stopwords
and the new entries, paste
them with collapse = "|"
, remove those with replacement as ""
in gsub
, along with removing the punctuations and digits and extra spaces (\\s+
- one or more spaces)
trimws(gsub("\\s+", " ",
gsub(paste0("\\b(", paste(union(c("saw", "kityy"),
tm::stopwords("english")), collapse="|"), ")\\b"), "",
gsub("[[:punct:]0-9]+", "", report$Text))
))
-output
[1] "unit crosses street"
[2 "driver speeding driver"
[3] "year year pandemic"
[4] "hey hei hei wood"
[5] "hello best"