I am trying to filter out relevant rowas based on the presence or existence of a string or part/element of a string in R. Following is the example:
colA colb flag
New York Metropolitan Area New York Yes
New York Metropolitan Area York Yes
New York Metropolitan Area New York Area Yes
New York Metropolitan Area Los Angeles No
Things I have tried till now:
df1<- df1 %>% fuzzy_inner_join(df2, by = c("colA" = "colB"), match_fun = str_detect)
This option fails due to paranthesis and other special characters, cleaning them all up also did not help.
df[, "lookup"] <- gsub(" ", "|", df[,"colB"])
df[,"flag"] <- mapply(grepl, df[,"lookup"], df[,"colA"])
Results not satisfactory as only limted rows are filtered.
Thank you in advance.
Here is a base R solution.
The anonymous lambda function \(x, y)
was introduced in R 4.1.0, for older versions of R use function(x, y)
.
pattern <- gsub(" ", "|", df1$colb)
i <- mapply(\(x, y)grepl(x, y), pattern, df1$colA)
df1$flag <- c("No", "Yes")[i + 1L]
df1
# colA colb flag
#1 New York Metropolitan Area New York Yes
#2 New York Metropolitan Area York Yes
#3 New York Metropolitan Area New York Area Yes
#4 New York Metropolitan Area Los Angeles No
To remove the rows not matching the patterns:
df1[i, ]
# colA colb flag
#1 New York Metropolitan Area New York Yes
#2 New York Metropolitan Area York Yes
#3 New York Metropolitan Area New York Area Yes
df1 <-
structure(list(colA = c("New York Metropolitan Area",
"New York Metropolitan Area", "New York Metropolitan Area",
"New York Metropolitan Area"), colb = c("New York", "York",
"New York Area", "Los Angeles"), flag = c("Yes", "Yes", "Yes",
"No")), row.names = c(NA, -4L), class = "data.frame")