rtext-miningstring-matchingsurvey

Search survey comments (strings) for person names within large vector of names


I have data from a survey with many open-text comments (circa 40K comments). I need to create a new column which indicates whether the comment on each row contains a person name so I can then find and remove the name before releasing data to others. I have a full list of person names about (80K names). I tried using grepl and apply but my solution took a very long time and did not give me accurate results.

I have below some code with example data and the result I'm hoping for.

SurveyID <- 1:4
Comments <- c("I'm very dissatisfied with Ian Smith as he is not very inclusive", 
              "May Horner is a great person and I recommend her",
              "Robbing is at an all-time high", 
              "The person in charge may be clear but I'm not")
CommentsData <- data.frame(SurveyID = SurveyID, Comments=Comments)

names <- c("Ian", "May", "John", "Rob", "Emily", "Todd")

Result <- c("Name", "Name", "Clear", "Clear")

I had tried using grepl and lapply but it was very slow and doesn't fully work as I expected. For example, it finds "Rob" in Robbing.

CommentsData$flag <- ifelse(colSums(do.call(rbind,lapply(names,grepl,CommentsData$Comments,ignore.case=F)))>0,"Name","Clear")

Would appreciate any suggestions to solve this and speed up the performance.


Solution

  • paste the names and use | (or) to collapse them, put brackets around and place \\b (word boundary) around the regex.

    c("Clear", "Name")[1L +
                       grepl(paste0("\\b(", paste(names, collapse="|"), ")\\b"),
                             CommentsData$Comments)]
    #[1] "Name"  "Name"  "Clear" "Clear"
    

    Some other variants using Reduce

    c("Clear", "Name")[1L +
        Reduce(function(a,b) a | grepl(b, CommentsData$Comments),
           paste0("\\b", names, "\\b"), FALSE)]
    #[1] "Name"  "Name"  "Clear" "Clear"
    
    c("Clear", "Name")[1L +
         Reduce(function(a,b) `[<-`(a, !a, grepl(b, CommentsData$Comments[!a])),
           paste0("\\b", names, "\\b"), logical(nrow(CommentsData)))]
    #[1] "Name"  "Name"  "Clear" "Clear"
    
    `[<-`(rep("Name", nrow(CommentsData)),
              Reduce(function(a,b) a[!grepl(b, CommentsData$Comments[a])],
           paste0("\\b", names, "\\b"), seq_len(nrow(CommentsData))), "Clear")
    #[1] "Name"  "Name"  "Clear" "Clear"
    

    Benchmark

    bench::mark(
    "A" = ifelse(colSums(do.call(rbind,lapply(paste0("\\b", names, "\\b"),grepl,CommentsData$Comments,ignore.case=F)))>0,"Name","Clear"),
    "B" = c("Clear", "Name")[1L +
                       grepl(paste0("\\b(", paste(names, collapse="|"), ")\\b"),
                             CommentsData$Comments)],
    "C" = c("Clear", "Name")[1L +
                       grepl(paste0("\\b(", paste(names, collapse="|"), ")\\b"),
                             CommentsData$Comments, perl=TRUE)],
    "D" = c("Clear", "Name")[1L +
        Reduce(function(a,b) a | grepl(b, CommentsData$Comments),
           paste0("\\b", names, "\\b"), FALSE)],
    "E" = c("Clear", "Name")[1L +
         Reduce(function(a,b) `[<-`(a, !a, grepl(b, CommentsData$Comments[!a])),
           paste0("\\b", names, "\\b"), logical(nrow(CommentsData)))],
    "F" = `[<-`(rep("Name", nrow(CommentsData)),
              Reduce(function(a,b) a[!grepl(b, CommentsData$Comments[a])],
           paste0("\\b", names, "\\b"), seq_len(nrow(CommentsData))), "Clear") )
    

    Result

      expression      min  median itr/s…¹ mem_a…² gc/se…³ n_itr  n_gc total…⁴ result
      <bch:expr> <bch:tm> <bch:t>   <dbl> <bch:b>   <dbl> <int> <dbl> <bch:t> <list>
    1 A            97.7µs 105.1µs   8819.      0B    6.07  4358     3   494ms <chr> 
    2 B            32.5µs  35.2µs  27940.      0B    2.79  9999     1   358ms <chr> 
    3 C            21.7µs  23.5µs  41622.      0B    8.33  9998     2   240ms <chr> 
    4 D            88.3µs  93.8µs  10455.  8.18KB    6.07  5166     3   494ms <chr> 
    5 E            93.1µs 100.5µs   9732.  4.13KB    8.12  4794     4   493ms <chr> 
    6 F              92µs  98.3µs   9944. 12.01KB    8.12  4900     4   493ms <chr>