I have two dataframes, lets say df1 and df2 as pictured below. I want to use R to combine them by adding df2 columns to df1 by pathway column. However, I want the pathway to be repeated as many times as present in the df that has more frequency of the pathway. So for example, HALLMARK_ADIPOGENESIS occurs 6 times in df1 while it occurs 4 times in df2. I want in the combined_df to have that pathway repeated 6 times with the 6 rows having values for all the _B1_7486 columns but only 4 rows having values for the _B1_7534 and the remaining 2 rows to be NAs or blank for the _B1_7534. I have tried using merge and cbind and different iterations but have come up short. Any help would be appreciated.
You could do this with a join, if you join by pathway and the # of observation within each pathway.
df1 <- data.frame(pathway = rep(c("a", "b"), c(4, 6)),
df1_data = 1:10)
df2 <- data.frame(pathway = rep(c("a", "b"), c(5, 5)),
df2_data = 1:10)
full_join(
df1 |> mutate(row = row_number(), .by = pathway),
df2 |> mutate(row = row_number(), .by = pathway),
by = join_by(pathway, row)
) |>
arrange(pathway, row)
This results in 5 rows for a
and 6 rows for b
, corresponding in each case to the table where the pathway appears more times.
pathway df1_data row df2_data
1 a 1 1 1
2 a 2 2 2
3 a 3 3 3
4 a 4 4 4
5 a NA 5 5
6 b 5 1 6
7 b 6 2 7
8 b 7 3 8
9 b 8 4 9
10 b 9 5 10
11 b 10 6 NA