rstringstringrfuzzyjoin

Filter rows based on presence of a string element from another column


I am trying to filter out relevant rowas based on the presence or existence of a string or part/element of a string in R. Following is the example:

colA                                      colb                           flag
New York Metropolitan Area                New York                       Yes 
New York Metropolitan Area                York                           Yes
New York Metropolitan Area                New York Area                  Yes
New York Metropolitan Area                Los Angeles                    No 

Things I have tried till now:

  1. Where 2 different dataframes are present
df1<- df1 %>% fuzzy_inner_join(df2, by = c("colA" = "colB"), match_fun = str_detect)

This option fails due to paranthesis and other special characters, cleaning them all up also did not help.

  1. I joined the 2 dataframes based on an upper level hierarchay to limit the rows and created a dataframe df
df[, "lookup"] <- gsub(" ", "|", df[,"colB"])

df[,"flag"] <- mapply(grepl, df[,"lookup"], df[,"colA"])

Results not satisfactory as only limted rows are filtered.

Thank you in advance.


Solution

  • Here is a base R solution.
    The anonymous lambda function \(x, y) was introduced in R 4.1.0, for older versions of R use function(x, y).

    pattern <- gsub(" ", "|", df1$colb)
    i <- mapply(\(x, y)grepl(x, y), pattern, df1$colA)
    df1$flag <- c("No", "Yes")[i + 1L]
    
    df1
    #                        colA          colb flag
    #1 New York Metropolitan Area      New York  Yes
    #2 New York Metropolitan Area          York  Yes
    #3 New York Metropolitan Area New York Area  Yes
    #4 New York Metropolitan Area   Los Angeles   No
    

    To remove the rows not matching the patterns:

    df1[i, ]
    #                        colA          colb flag
    #1 New York Metropolitan Area      New York  Yes
    #2 New York Metropolitan Area          York  Yes
    #3 New York Metropolitan Area New York Area  Yes
    

    Data

    df1 <-
    structure(list(colA = c("New York Metropolitan Area", 
    "New York Metropolitan Area", "New York Metropolitan Area", 
    "New York Metropolitan Area"), colb = c("New York", "York", 
    "New York Area", "Los Angeles"), flag = c("Yes", "Yes", "Yes", 
    "No")), row.names = c(NA, -4L), class = "data.frame")