rdplyrstring-matching

Asymmetric partial matching of text strings between two dataframes


I have two dataframes:

Name Location
Alice London/Liverpool
Bob Manchester
Jim No
David I am based in Cardiff

I have another dataframe (df2) that is a single column list of UK cities and towns, like so:

city
Manchester
Cardiff
London
Brighton
Nottingham

What I want to do is write code that will:

Name Location Location_precise
Alice London, or Brighton London
Bob Manchester Manchester
Jim No -
David I am based in Cardiff Cardiff

I can then use the location_precise column for further analysis with standardised names. What's tripping me up with regular left_join/lapply functions is that I want to allow partial matching in one direction (if a city in df2 is partially contained in df1) but NOT in the other direction (if an answer in df1$location is part of a city in df2, e.g. 'no' and 'nottingham' in the above example.

I don't know how to adapt regular dplyr functions to this because there is no common identifier in the two datasets - and therefore every join attempt ends in various errors. I am otherwise completely at my wits' end; I have tried using grep:

df1$location_precise <- lapply(df1$Location, grep, df2$city,value=T)

But this doesn't pick up partial matches without whitespace, like 'London/Glasgow'

I've also tried agrep, but this falsely identifies partial matches, like 'No' = 'Nottingham', and still doesn't pick up many other partial matches


Solution

  • Always remember to place a reproducible example.

    I have taken your tables (df1 and df2) and added an extra case to both:

    df1 <- data.frame(Name = LETTERS[1:6], 
                      Location = c("London/Liverpool", 
                                   "London/Glasgow",
                                   "I born in Manchester",
                                   "I am from Chester",
                                   "No",
                                   "I am based in Cardiff"))
    
    df2 <- data.frame(city = c("Manchester",
                               "Cardiff",
                               "London",
                               "Chester",
                               "Glasgow",
                               "Brighton",
                               "Nottingham"))
    
    
    df1$Location_precise <- strsplit(x = df1$Location, split = "[[:punct:]]|[[:space:]]") |> 
      
      lapply(FUN = \(locations, patterns){
        
        sapply(X = locations, FUN = \(loc){
          patterns[sapply(X = sprintf("^%s$", patterns), 
                          FUN = grepl, x = loc, ignore.case = TRUE)]  
        })
      }, patterns = df2$city) |> 
      
      lapply(FUN = unlist) |> 
      
      sapply(FUN = \(x) if(length(x) == 0) "-" else if(length(x) == 1) x else if(length(x) > 1) paste(x, collapse = ", "))
    

    The output:

    > df1
      Name              Location Location_precise
    1    A      London/Liverpool           London
    2    B        London/Glasgow  London, Glasgow
    3    C  I born in Manchester       Manchester
    4    D     I am from Chester          Chester
    5    E                    No                -
    6    F I am based in Cardiff          Cardiff
    

    It may not be the most elegant example, but you could try it and we'll refine it. For cases where there is more than one match, I have it show all matches and separate them with commas.

    UPDATE: I have extended the steps and this time it is now able to identify complete words. However, when I talk about identifying complete words, I mean that the script now cuts and searches in each word of each element of df1$Location. For this you must keep in mind that 1) a (complete) word is defined as a set of characters contained either between two blanks or between two punctuation marks (check what characters are included in [:space:] and [:punct:] in this link), 2) because of the above, if there is a Man-chester, Man,chester, Man/chester, etc., the script will assign Chester to it and 3) this is computationally somewhat more expensive (because the search is done word by word), so if you plan to apply it over millions of values it may be slow.