I have data from a survey with many open-text comments (circa 40K comments). I need to create a new column which indicates whether the comment on each row contains a person name so I can then find and remove the name before releasing data to others. I have a full list of person names about (80K names). I tried using grepl and apply but my solution took a very long time and did not give me accurate results.
I have below some code with example data and the result I'm hoping for.
SurveyID <- 1:4
Comments <- c("I'm very dissatisfied with Ian Smith as he is not very inclusive",
"May Horner is a great person and I recommend her",
"Robbing is at an all-time high",
"The person in charge may be clear but I'm not")
CommentsData <- data.frame(SurveyID = SurveyID, Comments=Comments)
names <- c("Ian", "May", "John", "Rob", "Emily", "Todd")
Result <- c("Name", "Name", "Clear", "Clear")
I had tried using grepl and lapply but it was very slow and doesn't fully work as I expected. For example, it finds "Rob" in Robbing.
CommentsData$flag <- ifelse(colSums(do.call(rbind,lapply(names,grepl,CommentsData$Comments,ignore.case=F)))>0,"Name","Clear")
Would appreciate any suggestions to solve this and speed up the performance.
paste
the names
and use |
(or) to collapse them, put brackets around and place \\b
(word boundary) around the regex.
c("Clear", "Name")[1L +
grepl(paste0("\\b(", paste(names, collapse="|"), ")\\b"),
CommentsData$Comments)]
#[1] "Name" "Name" "Clear" "Clear"
Some other variants using Reduce
c("Clear", "Name")[1L +
Reduce(function(a,b) a | grepl(b, CommentsData$Comments),
paste0("\\b", names, "\\b"), FALSE)]
#[1] "Name" "Name" "Clear" "Clear"
c("Clear", "Name")[1L +
Reduce(function(a,b) `[<-`(a, !a, grepl(b, CommentsData$Comments[!a])),
paste0("\\b", names, "\\b"), logical(nrow(CommentsData)))]
#[1] "Name" "Name" "Clear" "Clear"
`[<-`(rep("Name", nrow(CommentsData)),
Reduce(function(a,b) a[!grepl(b, CommentsData$Comments[a])],
paste0("\\b", names, "\\b"), seq_len(nrow(CommentsData))), "Clear")
#[1] "Name" "Name" "Clear" "Clear"
Benchmark
bench::mark(
"A" = ifelse(colSums(do.call(rbind,lapply(paste0("\\b", names, "\\b"),grepl,CommentsData$Comments,ignore.case=F)))>0,"Name","Clear"),
"B" = c("Clear", "Name")[1L +
grepl(paste0("\\b(", paste(names, collapse="|"), ")\\b"),
CommentsData$Comments)],
"C" = c("Clear", "Name")[1L +
grepl(paste0("\\b(", paste(names, collapse="|"), ")\\b"),
CommentsData$Comments, perl=TRUE)],
"D" = c("Clear", "Name")[1L +
Reduce(function(a,b) a | grepl(b, CommentsData$Comments),
paste0("\\b", names, "\\b"), FALSE)],
"E" = c("Clear", "Name")[1L +
Reduce(function(a,b) `[<-`(a, !a, grepl(b, CommentsData$Comments[!a])),
paste0("\\b", names, "\\b"), logical(nrow(CommentsData)))],
"F" = `[<-`(rep("Name", nrow(CommentsData)),
Reduce(function(a,b) a[!grepl(b, CommentsData$Comments[a])],
paste0("\\b", names, "\\b"), seq_len(nrow(CommentsData))), "Clear") )
Result
expression min median itr/s…¹ mem_a…² gc/se…³ n_itr n_gc total…⁴ result
<bch:expr> <bch:tm> <bch:t> <dbl> <bch:b> <dbl> <int> <dbl> <bch:t> <list>
1 A 97.7µs 105.1µs 8819. 0B 6.07 4358 3 494ms <chr>
2 B 32.5µs 35.2µs 27940. 0B 2.79 9999 1 358ms <chr>
3 C 21.7µs 23.5µs 41622. 0B 8.33 9998 2 240ms <chr>
4 D 88.3µs 93.8µs 10455. 8.18KB 6.07 5166 3 494ms <chr>
5 E 93.1µs 100.5µs 9732. 4.13KB 8.12 4794 4 493ms <chr>
6 F 92µs 98.3µs 9944. 12.01KB 8.12 4900 4 493ms <chr>