[SOLVED] combine two dataframes in R by the pathway column when the pathways are differing in number

combine two dataframes in R by the pathway column when the pathways are differing in number

I have two dataframes, lets say df1 and df2 as pictured below. I want to use R to combine them by adding df2 columns to df1 by pathway column. However, I want the pathway to be repeated as many times as present in the df that has more frequency of the pathway. So for example, HALLMARK_ADIPOGENESIS occurs 6 times in df1 while it occurs 4 times in df2. I want in the combined_df to have that pathway repeated 6 times with the 6 rows having values for all the _B1_7486 columns but only 4 rows having values for the _B1_7534 and the remaining 2 rows to be NAs or blank for the _B1_7534. I have tried using merge and cbind and different iterations but have come up short. Any help would be appreciated.

Solution

You could do this with a join, if you join by pathway and the # of observation within each pathway.

df1 <- data.frame(pathway = rep(c("a", "b"), c(4, 6)),
                  df1_data = 1:10)

df2 <- data.frame(pathway = rep(c("a", "b"), c(5, 5)),
                  df2_data = 1:10)

full_join(
  df1 |> mutate(row = row_number(), .by = pathway),
  df2 |> mutate(row = row_number(), .by = pathway), 
  by = join_by(pathway, row)
) |>
  arrange(pathway, row)

This results in 5 rows for a and 6 rows for b, corresponding in each case to the table where the pathway appears more times.

   pathway df1_data row df2_data
1        a        1   1        1
2        a        2   2        2
3        a        3   3        3
4        a        4   4        4
5        a       NA   5        5
6        b        5   1        6
7        b        6   2        7
8        b        7   3        8
9        b        8   4        9
10       b        9   5       10
11       b       10   6       NA