I have a question relating to this old post: R Text mining - how to change texts in R data frame column into several columns with word frequencies?
I am trying to mimic something exactly similar to the one posted in link above, using R, however, with strings containing numeric characters.
Suppose res is my data frame defined by:
library(qdap)
x1 <- as.factor(c( "7317 test1 fool 4258 6287" , "thi1s is 6287 test funny text1 test1", "this is test1 6287 text1 funny fool"))
y1 <- as.factor(c("test2 6287", "this is test text2", "test2 6287"))
z1 <- as.factor(c( "test2 6287" , "this is test 4258 text2 fool", "test2 6287"))
res <- data.frame(x1, y1, z1)
When I calculate frequencies of words defined using these commands,
freqs <- t(wfm(as.factor(res$x1), 1:nrow(res), char.keep=TRUE))
abcd <- data.frame(res, freqs, check.names = FALSE)
abcd ignores 7317, 4258, 6287 and even the number 1 from test1 and counts the frequencies.
In the first row in column x1, 1 is stripped from test1 and counted as a word. Similarly, is is stripped from thi1s and counted as a word. However, what I want is test1. Similarly, the strings 7317, 4258 etc stored as strings must be counted as words and appear in the data table with their frequencies. What must be accomodated extra in the code?
You need to add the following to the freqs statement: removeNumbers = FALSE
. The wfm
function calls several other functions and one of them is tm::TermDocumentMatrix
. In here the default supplied by wfm
to this function is that removeNumbers = TRUE
. So this needs to be set to FALSE
.
Code:
freqs <- t(wfm(as.factor(res$x1), 1:nrow(res), char.keep=TRUE, removeNumbers = FALSE))
abcd <- data.frame(res, freqs, check.names = FALSE)