rdplyr

grouped recoding with lookup table in R


I have a data table where one variable is messy and can contain different variants of the same value (e.g., team name Newcastle United or Newcastle). These variants occur alongside another grouping-like variable (e.g., both the Premier League and A-League have Newcastle clubs with different team name variants):

team = c('Newcastle United','Newcastle','Newcastle Utd','Newcastle United Jets','Newcastle','Newcastle Jets')
competition=c('Premier League','Premier League','Premier League','A-League','A-League','A-League')
df = tibble(team,competition)
# A tibble: 6 × 2
  team                  competition   
  <chr>                 <chr>         
1 Newcastle United      Premier League
2 Newcastle             Premier League
3 Newcastle Utd         Premier League
4 Newcastle United Jets A-League      
5 Newcastle             A-League      
6 Newcastle Jets        A-League     

I also have a lookup table that specifies the desired team name per competition as follows:

old_name=c('Newcastle','Newcastle Utd','Newcastle','Newcastle United Jets')
new_name=c('Newcastle United','Newcastle United','Newcastle Jets','Newcastle Jets')
competition=c('Premier League','Premier League','A-League','A-League')
lookup=tibble(old_name,new_name,competition)
# A tibble: 4 × 3
  old_name              new_name         competition   
  <chr>                 <chr>            <chr>         
1 Newcastle             Newcastle United Premier League
2 Newcastle Utd         Newcastle United Premier League
3 Newcastle             Newcastle Jets   A-League      
4 Newcastle United Jets Newcastle Jets   A-League    

How can I recode/relabel team such that only the relevant competition from the lookup table is used? I tried combining dplyr's group_by and recode in different ways but no luck so far.

(My real data and lookup tables are much bigger and the data table includes cases that don't have a match in the lookup table.)

Desired output:

# A tibble: 6 × 2
  team             competition   
  <chr>            <chr>         
1 Newcastle United Premier League
2 Newcastle United Premier League
3 Newcastle United Premier League
4 Newcastle Jets   A-League      
5 Newcastle Jets   A-League      
6 Newcastle Jets   A-League    

Solution

  • An approach using full_join

    library(dplyr)
    
    full_join(df, lookup, by = join_by(team == old_name, competition)) %>% 
      mutate(team = coalesce(new_name, team), new_name = NULL)
    # A tibble: 6 × 2
      team             competition
      <chr>            <chr>
    1 Newcastle United Premier League
    2 Newcastle United Premier League
    3 Newcastle United Premier League
    4 Newcastle Jets   A-League
    5 Newcastle Jets   A-League
    6 Newcastle Jets   A-League