I'm sure that this answer exists somewhere but I've been trying to get this code to work and I just can't seem to be able to for my purpose. I have 7 different dataframes which each contains an ID, age, and race column. Each of the dataframes are from a different time interval and so if a respondent provided a response at say time 2 and 5 only, they would only have a row present in dataset 2 and 5. An example of how the variables look in say, dataframe 1, is this:
id <- c(1,2,3,4,5)
race <- c("Black", "White", "Asian", "White", "Black")
age <- c(26,24,33,45,65)
one_T1 <- c(1,0,1,1,1)
two_T1 <- c(1,0,1,1,0)
three_T1 <- c(0,0,0,1,1)
df1 <- data.frame(id,race,age,one_T1,two_T1,three_T1)
id race age one_T1 two_T1 three_T1
1 1 Black 26 1 1 0
2 2 White 24 0 0 0
3 3 Asian 33 1 1 0
4 4 White 45 1 1 1
5 5 Black 65 1 0 1
So all of the responses of interest are coded in binary and for each dataframe, there already is a suffix for each variable listed which timeframe they came from (but this can obviously be changed). My goal is to try to get a dataframe in which all IDs appear, even if they don't have data across all time periods. So if they don't have data during a certain time period, they would just have "NAs" for those specific variables that they were a non-response for. Also, age and race should remain constant so I don't want to duplicate those in the merged dataset either. So, if I combined df1 above with this dataframe:
id <- c(1,2,4,5,6)
race <- c("Black", "White", "White", "Black", "Indigenous")
age <- c(26,24,45,65,21)
one_T2 <- c(1,0,1,1,1)
two_T2 <- c(1,0,1,1,0)
three_T2 <- c(0,0,0,1,1)
df2 <- data.frame(id,race,age,one_T2,two_T2,three_T2)
id race age one_T2 two_T2 three_T2
1 1 Black 26 1 1 0
2 2 White 24 0 0 0
3 4 White 45 1 1 0
4 5 Black 65 1 1 1
5 6 Indigenous 21 1 0 1
I would want the output to look something like this:
id race age one_T1 two_T1 three_T1 one_T2 two_T2 three_T2
1 1 Black 26 1 1 0 1 1 0
2 2 White 24 0 0 0 0 0 0
3 3 Asian 33 1 1 0 NA NA NA
4 4 White 45 1 1 1 1 1 0
5 5 Black 65 1 0 1 1 1 1
6 6 Indigenous 21 NA NA NA 1 0 1
I hope that makes sense, thanks very much in advance!
dplyr::full_join(df1, df2, c('id', 'race', 'age'))
id race age one_T1 two_T1 three_T1 one_T2 two_T2 three_T2
1 1 Black 26 1 1 0 1 1 0
2 2 White 24 0 0 0 0 0 0
3 3 Asian 33 1 1 0 NA NA NA
4 4 White 45 1 1 1 1 1 0
5 5 Black 65 1 0 1 1 1 1
6 6 Indigenous 21 NA NA NA 1 0 1
In base R:
merge(df1, df2, c('id', 'race', 'age'), all=TRUE)
id race age one_T1 two_T1 three_T1 one_T2 two_T2 three_T2
1 1 Black 26 1 1 0 1 1 0
2 2 White 24 0 0 0 0 0 0
3 3 Asian 33 1 1 0 NA NA NA
4 4 White 45 1 1 1 1 1 0
5 5 Black 65 1 0 1 1 1 1
6 6 Indigenous 21 NA NA NA 1 0 1