rmergesubsetnamissing-data

Merge two datasets in a many to one framework, where dataset B's columns are a subset of dataset A's


My dataset A is a panel of individuals, as below. Many individuals can belong to a single family (variable fam_id). I have person variables and family variables in this dataset. As you can see, I have values for family variables for 2012 but not for 2013 and 2014.

person_id fam_id year personvar1 personvar2 familyvar1 familyvar 2
1 50 2012 500 stringA 1000 stringW
2 50 2012 550 stringB 1000 stringW
3 60 2012 710 stringC 150 stringX
4 70 2012 800 stringC 200 stringY
5 70 2012 0 stringJ 200 stringY
1 50 2013 120 stringJ NA NA
2 50 2013 370 stringK NA NA
3 60 2013 80 stringL NA NA
1 50 2014 100 stringM NA NA
2 50 2014 500 stringM NA NA
3 60 2014 300 stringO NA NA

Dataset B is a panel of the families in 2013. It has only family variables, which are the same as the ones in dataset A.

fam_id year familyvar1 familyvar2
50 2013 400 stringW
60 2013 700 stringY
70 2013 1000 stringZ

I'd like to 'fill' the NA family variables from 2013 in dataset A with the corresponding values of dataset B.

I've tried

merged_DT <- merge(A, B, by = c("fam_id", "year", "familyvar", "familyvar2") , all.x = TRUE)

But this somehow generates NAs instead of merging approprietely.

I also tried Merging two Dataframes in R by ID, One is the subset of the other approach but I get the error "object '..cols' not found".

How can I do this?


Solution

  • An approach that uses coalesce after a left_join to bring together the familyvars. If you include familyvars in the join-by they're not added if they don't match.

    library(dplyr)
    
    left_join(A, B, c("fam_id", "year")) %>% 
      mutate(familyvar1 = coalesce(familyvar1.x, familyvar1.y), 
             familyvar2 = coalesce(familyvar2.x, familyvar2.y)) %>% 
      select(-ends_with(c(".x", ".y")))
       person_id fam_id year personvar1 personvar2 familyvar1 familyvar2
    1          1     50 2012        500    stringA       1000    stringW
    2          2     50 2012        550    stringB       1000    stringW
    3          3     60 2012        710    stringC        150    stringX
    4          4     70 2012        800    stringC        200    stringY
    5          5     70 2012          0    stringJ        200    stringY
    6          1     50 2013        120    stringJ        400    stringW
    7          2     50 2013        370    stringK        400    stringW
    8          3     60 2013         80    stringL        700    stringY
    9          1     50 2014        100    stringM         NA       <NA>
    10         2     50 2014        500    stringM         NA       <NA>
    11         3     60 2014        300    stringO         NA       <NA>
    

    Data

    A <- structure(list(person_id = c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L,
    1L, 2L, 3L), fam_id = c(50L, 50L, 60L, 70L, 70L, 50L, 50L, 60L,
    50L, 50L, 60L), year = c(2012L, 2012L, 2012L, 2012L, 2012L, 2013L,
    2013L, 2013L, 2014L, 2014L, 2014L), personvar1 = c(500L, 550L,
    710L, 800L, 0L, 120L, 370L, 80L, 100L, 500L, 300L), personvar2 = c("stringA",
    "stringB", "stringC", "stringC", "stringJ", "stringJ", "stringK",
    "stringL", "stringM", "stringM", "stringO"), familyvar1 = c(1000L,
    1000L, 150L, 200L, 200L, NA, NA, NA, NA, NA, NA), familyvar2 = c("stringW",
    "stringW", "stringX", "stringY", "stringY", NA, NA, NA, NA, NA,
    NA)), class = "data.frame", row.names = c(NA, -11L))
    
    B <- structure(list(fam_id = c(50L, 60L, 70L), year = c(2013L, 2013L,
    2013L), familyvar1 = c(400L, 700L, 1000L), familyvar2 = c("stringW",
    "stringY", "stringZ")), class = "data.frame", row.names = c(NA,
    -3L))
    

    EDIT, modified to address coalesceing multiple columns, using lapply to cycle through all familyvars and finally remove the initial x/y contributions from A and B, respectively.

    res <- left_join(A, B, c("fam_id", "year"))
    
    num_rng <- 1:2 # set to number of familyvars
    
    data.frame(res, 
      setNames(data.frame(lapply(num_rng, \(x) 
        Reduce(coalesce, 
          res[,grep(paste0("familyvar", x), colnames(res), value=T)]))), 
        paste0("familyvar", num_rng))) %>% 
      select(-matches("familyvar\\d+\\.[xy]"))
       person_id fam_id year personvar1 personvar2 familyvar1 familyvar2
    1          1     50 2012        500    stringA       1000    stringW
    2          2     50 2012        550    stringB       1000    stringW
    3          3     60 2012        710    stringC        150    stringX
    4          4     70 2012        800    stringC        200    stringY
    5          5     70 2012          0    stringJ        200    stringY
    6          1     50 2013        120    stringJ        400    stringW
    7          2     50 2013        370    stringK        400    stringW
    8          3     60 2013         80    stringL        700    stringY
    9          1     50 2014        100    stringM         NA       <NA>
    10         2     50 2014        500    stringM         NA       <NA>
    11         3     60 2014        300    stringO         NA       <NA>