I have two dataframes like this:
set.seed(1)
df1 <- data.frame(id= 1:4, sex= c("m", "m", NA, NA), somevar= letters[1:4], whocares_var= rnorm(4))
df2 <- data.frame(id= 1:6, sex= c("m", NA, "m", NA, "m", NA), somevar= NA, morevars= LETTERS[1:6])
And I want to merge them. What I do is:
df_both <- merge(df1, df2, by= "id", all= TRUE)
df_both
id sex.x somevar.x whocares_var sex.y somevar.y morevars
1 1 m a 2.5721564 m NA A
2 2 m b -1.1182118 <NA> NA B
3 3 <NA> c 0.6560304 m NA C
4 4 <NA> d -0.7959650 <NA> NA D
5 5 <NA> <NA> NA m NA E
6 6 <NA> <NA> NA <NA> NA F
I don't want the merged dataframe to have two columns sex.x and sex.y. Instead I want to have one sex column that contains the non-missing entry. So what I expect to get is:
set.seed(1)
df_wanted <- data.frame(id= 1:6, sex= c("m", "m", "m", NA, "m", NA),
somevar= c(letters[1:4], NA, NA),
whocares_var= c(rnorm(4), NA, NA),
morevars= LETTERS[1:6])
df_wanted
id sex somevar whocares_var morevars
1 1 m a -0.6264538 A
2 2 m b 0.1836433 B
3 3 m c -0.8356286 C
4 4 <NA> d 1.5952808 D
5 5 m <NA> NA E
6 6 <NA> <NA> NA F
So the function I am looking for only keeps the non-missing entries whenever both dataframe have the same column name. If a column is only present in one of the dataframes, it should also appear in the final data. How to achieve that?
Remark: I don't have the case of conflicting entries (i.e. different non-missing entries for same id)
Probably you can try
d <- merge(df1, df2, by = "id", all = TRUE)
nms <- sub("\\.[xy]", "", names(d))
list2DF(
lapply(
split.default(d, nms)[unique(nms)],
\(x) do.call(coalesce, x)
)
)
which gives
id sex somevar whocares_var morevars
1 1 m a -0.6264538 A
2 2 m b 0.1836433 B
3 3 m c -0.8356286 C
4 4 <NA> d 1.5952808 D
5 5 m <NA> NA E
6 6 <NA> <NA> NA F
Note: coalesce
is from dplyr
package