rdataframedplyr

combine two dataframes in R by the pathway column when the pathways are differing in number


I have two dataframes, lets say df1 and df2 as pictured below. I want to use R to combine them by adding df2 columns to df1 by pathway column. However, I want the pathway to be repeated as many times as present in the df that has more frequency of the pathway. So for example, HALLMARK_ADIPOGENESIS occurs 6 times in df1 while it occurs 4 times in df2. I want in the combined_df to have that pathway repeated 6 times with the 6 rows having values for all the _B1_7486 columns but only 4 rows having values for the _B1_7534 and the remaining 2 rows to be NAs or blank for the _B1_7534. I have tried using merge and cbind and different iterations but have come up short. Any help would be appreciated.

df1

df2


Solution

  • You could do this with a join, if you join by pathway and the # of observation within each pathway.

    df1 <- data.frame(pathway = rep(c("a", "b"), c(4, 6)),
                      df1_data = 1:10)
    
    df2 <- data.frame(pathway = rep(c("a", "b"), c(5, 5)),
                      df2_data = 1:10)
    
    full_join(
      df1 |> mutate(row = row_number(), .by = pathway),
      df2 |> mutate(row = row_number(), .by = pathway), 
      by = join_by(pathway, row)
    ) |>
      arrange(pathway, row)
    

    This results in 5 rows for a and 6 rows for b, corresponding in each case to the table where the pathway appears more times.

       pathway df1_data row df2_data
    1        a        1   1        1
    2        a        2   2        2
    3        a        3   3        3
    4        a        4   4        4
    5        a       NA   5        5
    6        b        5   1        6
    7        b        6   2        7
    8        b        7   3        8
    9        b        8   4        9
    10       b        9   5       10
    11       b       10   6       NA