I have a data table where one variable is messy and can contain different variants of the same value (e.g., team name Newcastle United or Newcastle). These variants occur alongside another grouping-like variable (e.g., both the Premier League and A-League have Newcastle clubs with different team name variants):
team = c('Newcastle United','Newcastle','Newcastle Utd','Newcastle United Jets','Newcastle','Newcastle Jets')
competition=c('Premier League','Premier League','Premier League','A-League','A-League','A-League')
df = tibble(team,competition)
# A tibble: 6 × 2
team competition
<chr> <chr>
1 Newcastle United Premier League
2 Newcastle Premier League
3 Newcastle Utd Premier League
4 Newcastle United Jets A-League
5 Newcastle A-League
6 Newcastle Jets A-League
I also have a lookup table that specifies the desired team name per competition as follows:
old_name=c('Newcastle','Newcastle Utd','Newcastle','Newcastle United Jets')
new_name=c('Newcastle United','Newcastle United','Newcastle Jets','Newcastle Jets')
competition=c('Premier League','Premier League','A-League','A-League')
lookup=tibble(old_name,new_name,competition)
# A tibble: 4 × 3
old_name new_name competition
<chr> <chr> <chr>
1 Newcastle Newcastle United Premier League
2 Newcastle Utd Newcastle United Premier League
3 Newcastle Newcastle Jets A-League
4 Newcastle United Jets Newcastle Jets A-League
How can I recode/relabel team
such that only the relevant competition
from the lookup table is used? I tried combining dplyr's group_by
and recode
in different ways but no luck so far.
(My real data and lookup tables are much bigger and the data table includes cases that don't have a match in the lookup table.)
Desired output:
# A tibble: 6 × 2
team competition
<chr> <chr>
1 Newcastle United Premier League
2 Newcastle United Premier League
3 Newcastle United Premier League
4 Newcastle Jets A-League
5 Newcastle Jets A-League
6 Newcastle Jets A-League
An approach using full_join
library(dplyr)
full_join(df, lookup, by = join_by(team == old_name, competition)) %>%
mutate(team = coalesce(new_name, team), new_name = NULL)
# A tibble: 6 × 2
team competition
<chr> <chr>
1 Newcastle United Premier League
2 Newcastle United Premier League
3 Newcastle United Premier League
4 Newcastle Jets A-League
5 Newcastle Jets A-League
6 Newcastle Jets A-League