rqdap

subset/filter based on a frequency table


I have a df with some text data e.g.

words <- data.frame(terms = c("qhick brown fox",
                              "tom dick harry", 
                              "cats dgs",
                              "qhick black fox"))

I'm already able to subset based on any row that contains a spelling error:

library(qdap)
words[check_spelling(words$terms)$row,,drop=F]

But given I have a lot of text data I want to filter only on spelling errors that occur more frequently:

> sort(which(table(which_misspelled(toString(unique(words$terms)))) > 1), decreasing = T)
qhick 
    2 

So I now know that that "qhick" is a common misspelling.

How could I then subset words based on this table? So only return rows that contain "qhick"?


Solution

  • The words themselves are the names of your sort() function. If you have only one name you can do:

    top_misspelled <- sort(which(table(which_misspelled(toString(unique(words$terms)))) > 1), decreasing = T)
    
    words[grepl(names(top_misspelled), words$terms), , drop = F]
    #            terms
    #1 qhick brown fox
    #4 qhick black fox
    

    But if you have multiple you could collapse them together to build a grepl lookup like:

    words[grepl(paste0(names(top_misspelled), collapse = "|"), words$terms), ,drop = F]
    

    A non-regex option would also be to split each row into words and then if any of the words in the row matches your strings of interest, return that row:

    words[sapply(strsplit(as.character(words[,"terms"]), split=" "), function(x) any(x %in% names(top_misspelled))),
          ,drop = F]
    
    #            terms
    #1 qhick brown fox
    #4 qhick black fox