rdataframemergemissing-data

Merge two dataframes and keep non-missing entries


I have two dataframes like this:

set.seed(1)
df1 <- data.frame(id= 1:4, sex= c("m", "m", NA, NA), somevar= letters[1:4], whocares_var= rnorm(4))
df2 <- data.frame(id= 1:6, sex= c("m", NA, "m", NA, "m", NA), somevar= NA, morevars= LETTERS[1:6])

And I want to merge them. What I do is:

df_both <- merge(df1, df2, by= "id", all= TRUE)
df_both

  id sex.x somevar.x whocares_var sex.y somevar.y morevars
1  1     m         a    2.5721564     m        NA        A
2  2     m         b   -1.1182118  <NA>        NA        B
3  3  <NA>         c    0.6560304     m        NA        C
4  4  <NA>         d   -0.7959650  <NA>        NA        D
5  5  <NA>      <NA>           NA     m        NA        E
6  6  <NA>      <NA>           NA  <NA>        NA        F

I don't want the merged dataframe to have two columns sex.x and sex.y. Instead I want to have one sex column that contains the non-missing entry. So what I expect to get is:

set.seed(1)
df_wanted <- data.frame(id= 1:6, sex= c("m", "m", "m", NA, "m", NA),
                        somevar= c(letters[1:4], NA, NA),
                        whocares_var= c(rnorm(4), NA, NA),
                        morevars= LETTERS[1:6])
df_wanted
  id  sex somevar whocares_var morevars
1  1    m       a   -0.6264538        A
2  2    m       b    0.1836433        B
3  3    m       c   -0.8356286        C
4  4 <NA>       d    1.5952808        D
5  5    m    <NA>           NA        E
6  6 <NA>    <NA>           NA        F

So the function I am looking for only keeps the non-missing entries whenever both dataframe have the same column name. If a column is only present in one of the dataframes, it should also appear in the final data. How to achieve that?

Remark: I don't have the case of conflicting entries (i.e. different non-missing entries for same id)


Solution

  • Probably you can try

    d <- merge(df1, df2, by = "id", all = TRUE)
    nms <- sub("\\.[xy]", "", names(d))
    list2DF(
        lapply(
            split.default(d, nms)[unique(nms)],
            \(x) do.call(coalesce, x)
        )
    )
    

    which gives

      id  sex somevar whocares_var morevars
    1  1    m       a   -0.6264538        A
    2  2    m       b    0.1836433        B
    3  3    m       c   -0.8356286        C
    4  4 <NA>       d    1.5952808        D
    5  5    m    <NA>           NA        E
    6  6 <NA>    <NA>           NA        F
    

    Note: coalesce is from dplyr package