I have two dataframes:
Name | Location |
---|---|
Alice | London/Liverpool |
Bob | Manchester |
Jim | No |
David | I am based in Cardiff |
I have another dataframe (df2) that is a single column list of UK cities and towns, like so:
city |
---|
Manchester |
Cardiff |
London |
Brighton |
Nottingham |
What I want to do is write code that will:
Loop through df1
For each entry in the location column, scan df2 and if ANY of the cities in df2 are exactly replicated in df1, add a new column(s) to df 1 pulling out the first match like so:
Name | Location | Location_precise |
---|---|---|
Alice | London, or Brighton | London |
Bob | Manchester | Manchester |
Jim | No | - |
David | I am based in Cardiff | Cardiff |
I can then use the location_precise column for further analysis with standardised names. What's tripping me up with regular left_join/lapply functions is that I want to allow partial matching in one direction (if a city in df2 is partially contained in df1) but NOT in the other direction (if an answer in df1$location is part of a city in df2, e.g. 'no' and 'nottingham' in the above example.
I don't know how to adapt regular dplyr functions to this because there is no common identifier in the two datasets - and therefore every join attempt ends in various errors. I am otherwise completely at my wits' end; I have tried using grep:
df1$location_precise <- lapply(df1$Location, grep, df2$city,value=T)
But this doesn't pick up partial matches without whitespace, like 'London/Glasgow'
I've also tried agrep, but this falsely identifies partial matches, like 'No' = 'Nottingham', and still doesn't pick up many other partial matches
Always remember to place a reproducible example.
I have taken your tables (df1 and df2) and added an extra case to both:
df1 <- data.frame(Name = LETTERS[1:6],
Location = c("London/Liverpool",
"London/Glasgow",
"I born in Manchester",
"I am from Chester",
"No",
"I am based in Cardiff"))
df2 <- data.frame(city = c("Manchester",
"Cardiff",
"London",
"Chester",
"Glasgow",
"Brighton",
"Nottingham"))
df1$Location_precise <- strsplit(x = df1$Location, split = "[[:punct:]]|[[:space:]]") |>
lapply(FUN = \(locations, patterns){
sapply(X = locations, FUN = \(loc){
patterns[sapply(X = sprintf("^%s$", patterns),
FUN = grepl, x = loc, ignore.case = TRUE)]
})
}, patterns = df2$city) |>
lapply(FUN = unlist) |>
sapply(FUN = \(x) if(length(x) == 0) "-" else if(length(x) == 1) x else if(length(x) > 1) paste(x, collapse = ", "))
The output:
> df1
Name Location Location_precise
1 A London/Liverpool London
2 B London/Glasgow London, Glasgow
3 C I born in Manchester Manchester
4 D I am from Chester Chester
5 E No -
6 F I am based in Cardiff Cardiff
It may not be the most elegant example, but you could try it and we'll refine it. For cases where there is more than one match, I have it show all matches and separate them with commas.
UPDATE:
I have extended the steps and this time it is now able to identify complete words. However, when I talk about identifying complete words, I mean that the script now cuts and searches in each word of each element of df1$Location
. For this you must keep in mind that 1) a (complete) word is defined as a set of characters contained either between two blanks or between two punctuation marks (check what characters are included in [:space:]
and [:punct:]
in this link), 2) because of the above, if there is a Man-chester, Man,chester, Man/chester, etc., the script will assign Chester to it and 3) this is computationally somewhat more expensive (because the search is done word by word), so if you plan to apply it over millions of values it may be slow.