rmultiple-columnstextmatching

approx text matching and updation at same time


I have a data frame as df1 which contains a column of the name of the university as University_name and has 500000 number of rows. Now I have another data frame as df2 which contains 2 columns as university_name and university_aliases and has 150 rows. Now I want to match each university alias present in the university_aliases column with university names present in university_name_new.

a sample of df1$university_name

university of auckland
the university of auckland
university of warwick - warwick business school
unv of warwick
seneca college of applied arts and technology
seneca college
univ of auckland

sample of df2

University_Alias                  Univeristy_Name_new

univ of auckland                  university of auckland
universiry of auckland            university of auckland
auckland university               university of auckland
university of auckland            university of auckland
warwick university                university of warwick
warwick univercity                university of warwick
university of warwick             university of warwick
seneca college                    seneca college
unv of warwick                    university of warwick

I am expecting output like this

university of auckland
university of auckland
university of warwick
seneca college
seneca college

and I am using the following code but it is not working

 df$university_name[ grepl(df$university_name,df2$university_alias)] <- df2$university_name_new

Solution

  • You can use sapply and str_extract to get the desired result.

     # create sample data
    df1 <- data.frame(university_name = c('university of auckland',
                                          'the university of auckland',
                                          'university of warwick - warwick business school',
                                          'seneca college of applied arts and technology',
                                          'seneca college'), stringsAsFactors = F)
    
    # these are values to match (from df2)
    vals <- c('university of auckland','university of warwick','seneca college')
    
    # get the output
    df1$output <- sapply(df1$university_name, function(z)({
    
        f <- vals[complete.cases(str_extract(string = z, pattern = vals))]
        return(f)
    
    }), USE.NAMES = F)
    
    print(df1)
    
                                      university_name                 output
    1                          university of auckland university of auckland
    2                      the university of auckland university of auckland
    3 university of warwick - warwick business school  university of warwick
    4   seneca college of applied arts and technology         seneca college
    5                                  seneca college         seneca college
    

    Update:

    Based on my understanding, df2 already has one to one mapping of university_alias with university_name_new, so that problem comes down to checking if a university_alias is not present in df1, we remove it.

    # check values for university_alias in university_name
    maps2 <- as.character(df2$university_alias[which(df2$university_alias %in% df1$university_name)])
    
    # remove unmatched rows from df2
    df3 <- df2[df2$university_alias %in% maps2,]
    
    print(df3)
                university_alias    university_name_new
    1           univ of auckland university of auckland
    4     university of auckland university of auckland
    8             seneca college         seneca college
    9             unv of warwick  university of warwick