rdataframerecode

How to merge multiple dataframes with a different number of rows by matching the ID in R


I'm sure that this answer exists somewhere but I've been trying to get this code to work and I just can't seem to be able to for my purpose. I have 7 different dataframes which each contains an ID, age, and race column. Each of the dataframes are from a different time interval and so if a respondent provided a response at say time 2 and 5 only, they would only have a row present in dataset 2 and 5. An example of how the variables look in say, dataframe 1, is this:

id <- c(1,2,3,4,5)
race <- c("Black", "White", "Asian", "White", "Black")
age <- c(26,24,33,45,65)
one_T1 <- c(1,0,1,1,1)
two_T1 <- c(1,0,1,1,0)
three_T1 <- c(0,0,0,1,1)
df1 <- data.frame(id,race,age,one_T1,two_T1,three_T1)

  id  race age one_T1 two_T1 three_T1
1  1 Black  26      1      1        0
2  2 White  24      0      0        0
3  3 Asian  33      1      1        0
4  4 White  45      1      1        1
5  5 Black  65      1      0        1

So all of the responses of interest are coded in binary and for each dataframe, there already is a suffix for each variable listed which timeframe they came from (but this can obviously be changed). My goal is to try to get a dataframe in which all IDs appear, even if they don't have data across all time periods. So if they don't have data during a certain time period, they would just have "NAs" for those specific variables that they were a non-response for. Also, age and race should remain constant so I don't want to duplicate those in the merged dataset either. So, if I combined df1 above with this dataframe:

id <- c(1,2,4,5,6)
race <- c("Black", "White", "White", "Black", "Indigenous")
age <- c(26,24,45,65,21)
one_T2 <- c(1,0,1,1,1)
two_T2 <- c(1,0,1,1,0)
three_T2 <- c(0,0,0,1,1)
df2 <- data.frame(id,race,age,one_T2,two_T2,three_T2)

  id       race age one_T2 two_T2 three_T2
1  1      Black  26      1      1        0
2  2      White  24      0      0        0
3  4      White  45      1      1        0
4  5      Black  65      1      1        1
5  6 Indigenous  21      1      0        1

I would want the output to look something like this:

  id       race age one_T1 two_T1 three_T1 one_T2 two_T2 three_T2
1  1      Black  26      1      1        0      1      1        0
2  2      White  24      0      0        0      0      0        0
3  3      Asian  33      1      1        0     NA     NA       NA
4  4      White  45      1      1        1      1      1        0
5  5      Black  65      1      0        1      1      1        1
6  6 Indigenous  21     NA     NA       NA      1      0        1

I hope that makes sense, thanks very much in advance!


Solution

  • dplyr::full_join(df1, df2, c('id', 'race', 'age'))
    
      id       race age one_T1 two_T1 three_T1 one_T2 two_T2 three_T2
    1  1      Black  26      1      1        0      1      1        0
    2  2      White  24      0      0        0      0      0        0
    3  3      Asian  33      1      1        0     NA     NA       NA
    4  4      White  45      1      1        1      1      1        0
    5  5      Black  65      1      0        1      1      1        1
    6  6 Indigenous  21     NA     NA       NA      1      0        1
    

    In base R:

    merge(df1, df2, c('id', 'race', 'age'), all=TRUE)
      id       race age one_T1 two_T1 three_T1 one_T2 two_T2 three_T2
    1  1      Black  26      1      1        0      1      1        0
    2  2      White  24      0      0        0      0      0        0
    3  3      Asian  33      1      1        0     NA     NA       NA
    4  4      White  45      1      1        1      1      1        0
    5  5      Black  65      1      0        1      1      1        1
    6  6 Indigenous  21     NA     NA       NA      1      0        1