I have a df with some text data e.g.
words <- data.frame(terms = c("qhick brown fox",
"tom dick harry",
"cats dgs",
"qhick black fox"))
I'm already able to subset based on any row that contains a spelling error:
library(qdap)
words[check_spelling(words$terms)$row,,drop=F]
But given I have a lot of text data I want to filter only on spelling errors that occur more frequently:
> sort(which(table(which_misspelled(toString(unique(words$terms)))) > 1), decreasing = T)
qhick
2
So I now know that that "qhick" is a common misspelling.
How could I then subset words based on this table? So only return rows that contain "qhick"?
The words themselves are the names of your sort()
function. If you have only one name you can do:
top_misspelled <- sort(which(table(which_misspelled(toString(unique(words$terms)))) > 1), decreasing = T)
words[grepl(names(top_misspelled), words$terms), , drop = F]
# terms
#1 qhick brown fox
#4 qhick black fox
But if you have multiple you could collapse them together to build a grepl
lookup like:
words[grepl(paste0(names(top_misspelled), collapse = "|"), words$terms), ,drop = F]
A non-regex option would also be to split each row into words and then if any of the words in the row matches your strings of interest, return that row:
words[sapply(strsplit(as.character(words[,"terms"]), split=" "), function(x) any(x %in% names(top_misspelled))),
,drop = F]
# terms
#1 qhick brown fox
#4 qhick black fox