I would like to do exact joins for the columns state and name, but a fuzzy join for the "name" and "versus" columns:
year <- c("2002", "2002", "1999", "1999", "1997", "2002")
state <- c("TN", "TN", "AL", "AL", "CA", "TN")
name <- c("George", "Sally", "David", "Laura", "John", "Kate")
df1 <- data.frame(year, state, name)
year <- c("2002", "1999")
state <- c("TN", "AL")
versus <- c("@ george v. SALLY", "@laura v. dAvid")
df2 <- data.frame(year, state, versus)
My preferred output would be the following:
year <- c("2002", "2002", "1999", "1999", "1997", "2002")
state <- c("TN", "TN", "AL", "AL", "CA", "TN")
name <- c("George", "Sally", "David", "Laura", "John", "Kate")
versus <- c("@ george v. SALLY", "@ george v. SALLY", "@laura v. dAvid", "@laura v. dAvid", NA, NA)
df3 <- data.frame(year, state, name, versus)
I've tried variations of the following:
library(fuzzyjoin)
stringdist_left_join(df1, df2, by = c("year", "state", "name" = "versus"), method = "hamming")
stringdist_left_join(df1, df2, by = c("year", "state"), method = "hamming")
And they don't seem to get close to what I want.
I'm wondering if I'll need to spit up the "versus" column (remove all special characters and delimit the names) or if there's a way for me to accomplish this with something within fuzzyjoin. Any guidance would be appreciated.
A simple approach, which depends somewhat on the structure of df2$versus
, would be this:
library(dplyr)
left_join(df1,df2, by=c("year","state")) %>%
rowwise() %>%
mutate(versus:=if_else(grepl(name,versus,ignore.case=T), versus,as.character(NA)))
Output:
year state name versus
<chr> <chr> <chr> <chr>
1 2002 TN George @ george v. SALLY
2 2002 TN Sally @ george v. SALLY
3 1999 AL David @laura v. dAvid
4 1999 AL Laura @laura v. dAvid
5 1997 CA John NA
6 2002 TN Kate NA
If name
has more complicated pattern, rather than a single word (say Molly Home, Jane Doe
), we need a way to retrieve the series of whole words, and check if any of them appear (case-insensitive) within the versus
column. Here is one simple way to do this:
f(n,v)
), which takes strings n
and v
, extracts the whole words (wrds
) from n
, and then counts how many of them are found in v
. Returns TRUE if this count exceeds 0f <- function(n,v) {
wrds = stringr::str_extract_all(n, "\\b\\w*\\b")[[1]]
sum(sapply(wrds[which(nchar(wrds)>1)], grepl,x=v,ignore.case=T))>0
}
Left join the original frames, and apply f()
by row
left_join(df1,df2, by=c("year","state")) %>%
rowwise() %>%
mutate(versus:=if_else(f(name, versus), versus,NA_character_))
Output:
1 2002 TN Molly Homes, Jane Doe Homes (v. Vista)
2 2002 TN Sally NA
3 1999 AL David @laura v. dAvid
4 1999 AL Laura @laura v. dAvid
5 1997 CA John NA
6 2002 TN Kate NA
Input:
df1 = structure(list(year = c("2002", "2002", "1999", "1999", "1997",
"2002"), state = c("TN", "TN", "AL", "AL", "CA", "TN"), name = c("Molly Homes, Jane Doe",
"Sally", "David", "Laura", "John", "Kate")), class = "data.frame", row.names = c(NA,
-6L))
df2 = structure(list(year = c("2002", "1999"), state = c("TN", "AL"
), versus = c("Homes (v. Vista)", "@laura v. dAvid")), class = "data.frame", row.names = c(NA,
-2L))