My dataset A is a panel of individuals, as below. Many individuals can belong to a single family (variable fam_id). I have person variables and family variables in this dataset. As you can see, I have values for family variables for 2012 but not for 2013 and 2014.
person_id | fam_id | year | personvar1 | personvar2 | familyvar1 | familyvar 2 |
---|---|---|---|---|---|---|
1 | 50 | 2012 | 500 | stringA | 1000 | stringW |
2 | 50 | 2012 | 550 | stringB | 1000 | stringW |
3 | 60 | 2012 | 710 | stringC | 150 | stringX |
4 | 70 | 2012 | 800 | stringC | 200 | stringY |
5 | 70 | 2012 | 0 | stringJ | 200 | stringY |
1 | 50 | 2013 | 120 | stringJ | NA | NA |
2 | 50 | 2013 | 370 | stringK | NA | NA |
3 | 60 | 2013 | 80 | stringL | NA | NA |
1 | 50 | 2014 | 100 | stringM | NA | NA |
2 | 50 | 2014 | 500 | stringM | NA | NA |
3 | 60 | 2014 | 300 | stringO | NA | NA |
Dataset B is a panel of the families in 2013. It has only family variables, which are the same as the ones in dataset A.
fam_id | year | familyvar1 | familyvar2 |
---|---|---|---|
50 | 2013 | 400 | stringW |
60 | 2013 | 700 | stringY |
70 | 2013 | 1000 | stringZ |
I'd like to 'fill' the NA family variables from 2013 in dataset A with the corresponding values of dataset B.
I've tried
merged_DT <- merge(A, B, by = c("fam_id", "year", "familyvar", "familyvar2") , all.x = TRUE)
But this somehow generates NAs instead of merging approprietely.
I also tried Merging two Dataframes in R by ID, One is the subset of the other approach but I get the error "object '..cols' not found".
How can I do this?
An approach that uses coalesce
after a left_join
to bring together the familyvars. If you include familyvars in the join-by they're not added if they don't match.
library(dplyr)
left_join(A, B, c("fam_id", "year")) %>%
mutate(familyvar1 = coalesce(familyvar1.x, familyvar1.y),
familyvar2 = coalesce(familyvar2.x, familyvar2.y)) %>%
select(-ends_with(c(".x", ".y")))
person_id fam_id year personvar1 personvar2 familyvar1 familyvar2
1 1 50 2012 500 stringA 1000 stringW
2 2 50 2012 550 stringB 1000 stringW
3 3 60 2012 710 stringC 150 stringX
4 4 70 2012 800 stringC 200 stringY
5 5 70 2012 0 stringJ 200 stringY
6 1 50 2013 120 stringJ 400 stringW
7 2 50 2013 370 stringK 400 stringW
8 3 60 2013 80 stringL 700 stringY
9 1 50 2014 100 stringM NA <NA>
10 2 50 2014 500 stringM NA <NA>
11 3 60 2014 300 stringO NA <NA>
A <- structure(list(person_id = c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L,
1L, 2L, 3L), fam_id = c(50L, 50L, 60L, 70L, 70L, 50L, 50L, 60L,
50L, 50L, 60L), year = c(2012L, 2012L, 2012L, 2012L, 2012L, 2013L,
2013L, 2013L, 2014L, 2014L, 2014L), personvar1 = c(500L, 550L,
710L, 800L, 0L, 120L, 370L, 80L, 100L, 500L, 300L), personvar2 = c("stringA",
"stringB", "stringC", "stringC", "stringJ", "stringJ", "stringK",
"stringL", "stringM", "stringM", "stringO"), familyvar1 = c(1000L,
1000L, 150L, 200L, 200L, NA, NA, NA, NA, NA, NA), familyvar2 = c("stringW",
"stringW", "stringX", "stringY", "stringY", NA, NA, NA, NA, NA,
NA)), class = "data.frame", row.names = c(NA, -11L))
B <- structure(list(fam_id = c(50L, 60L, 70L), year = c(2013L, 2013L,
2013L), familyvar1 = c(400L, 700L, 1000L), familyvar2 = c("stringW",
"stringY", "stringZ")), class = "data.frame", row.names = c(NA,
-3L))
EDIT, modified to address coalesce
ing multiple columns, using lapply
to cycle through all familyvars and finally remove the initial x/y contributions from A and B, respectively.
res <- left_join(A, B, c("fam_id", "year"))
num_rng <- 1:2 # set to number of familyvars
data.frame(res,
setNames(data.frame(lapply(num_rng, \(x)
Reduce(coalesce,
res[,grep(paste0("familyvar", x), colnames(res), value=T)]))),
paste0("familyvar", num_rng))) %>%
select(-matches("familyvar\\d+\\.[xy]"))
person_id fam_id year personvar1 personvar2 familyvar1 familyvar2
1 1 50 2012 500 stringA 1000 stringW
2 2 50 2012 550 stringB 1000 stringW
3 3 60 2012 710 stringC 150 stringX
4 4 70 2012 800 stringC 200 stringY
5 5 70 2012 0 stringJ 200 stringY
6 1 50 2013 120 stringJ 400 stringW
7 2 50 2013 370 stringK 400 stringW
8 3 60 2013 80 stringL 700 stringY
9 1 50 2014 100 stringM NA <NA>
10 2 50 2014 500 stringM NA <NA>
11 3 60 2014 300 stringO NA <NA>