I have a data frame as df1 which contains a column of the name of the university as University_name and has 500000 number of rows. Now I have another data frame as df2 which contains 2 columns as university_name and university_aliases and has 150 rows. Now I want to match each university alias present in the university_aliases column with university names present in university_name_new.
a sample of df1$university_name
university of auckland
the university of auckland
university of warwick - warwick business school
unv of warwick
seneca college of applied arts and technology
seneca college
univ of auckland
sample of df2
University_Alias Univeristy_Name_new
univ of auckland university of auckland
universiry of auckland university of auckland
auckland university university of auckland
university of auckland university of auckland
warwick university university of warwick
warwick univercity university of warwick
university of warwick university of warwick
seneca college seneca college
unv of warwick university of warwick
I am expecting output like this
university of auckland
university of auckland
university of warwick
seneca college
seneca college
and I am using the following code but it is not working
df$university_name[ grepl(df$university_name,df2$university_alias)] <- df2$university_name_new
You can use sapply
and str_extract
to get the desired result.
# create sample data
df1 <- data.frame(university_name = c('university of auckland',
'the university of auckland',
'university of warwick - warwick business school',
'seneca college of applied arts and technology',
'seneca college'), stringsAsFactors = F)
# these are values to match (from df2)
vals <- c('university of auckland','university of warwick','seneca college')
# get the output
df1$output <- sapply(df1$university_name, function(z)({
f <- vals[complete.cases(str_extract(string = z, pattern = vals))]
return(f)
}), USE.NAMES = F)
print(df1)
university_name output
1 university of auckland university of auckland
2 the university of auckland university of auckland
3 university of warwick - warwick business school university of warwick
4 seneca college of applied arts and technology seneca college
5 seneca college seneca college
Update:
Based on my understanding, df2
already has one to one mapping of university_alias
with university_name_new
, so that problem comes down to checking if a university_alias is not present in df1, we remove it.
# check values for university_alias in university_name
maps2 <- as.character(df2$university_alias[which(df2$university_alias %in% df1$university_name)])
# remove unmatched rows from df2
df3 <- df2[df2$university_alias %in% maps2,]
print(df3)
university_alias university_name_new
1 univ of auckland university of auckland
4 university of auckland university of auckland
8 seneca college seneca college
9 unv of warwick university of warwick