rdataframefor-loopapplyrbind

rbind 2 data frames by the elements in 2 columns, avoiding nested loops


I have two data frames like the following:

df1 <- data.frame(Marker1=c('+','+','+','-','-'), Marker2=c('+','+','+','+','-'), Marker3=c('+','-','+','-','+'),
                  Sample=c(1,1,2,3,3), Population_ID=c(1,2,1,5,6), Cells_in_Sample=c(443,23,567,98,3))
df2 <- data.frame(Population_ID=c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,5,5,5,5,5,5,5,6,6,6,6,6,6,6),
                  Marker1=c('+','+','+','+',NA,NA,NA,'+','+','+','+',NA,NA,NA,'-','-','-','-',NA,NA,NA,'-','-','-','-',NA,NA,NA),
                  Marker2=c('+','+',NA,NA,'+','+',NA,'+','+',NA,NA,'+','+',NA,'+','+',NA,NA,'+','+',NA,'-','-',NA,NA,'-','-',NA),
                  Marker3=c('+',NA,'+',NA,'+',NA,'+','-',NA,'-',NA,'-',NA,'-','-',NA,'-',NA,'-',NA,'-','+',NA,'+',NA,'+',NA,'+'))

They look like this:

> df1
  Marker1 Marker2 Marker3 Sample Population_ID Cells_in_Sample
1       +       +       +      1             1             443
2       +       +       -      1             2              23
3       +       +       +      2             1             567
4       -       +       -      3             5              98
5       -       -       +      3             6               3
> head(df2)
  Population_ID Marker1 Marker2 Marker3
1             1       +       +       +
2             1       +       +    <NA>
3             1       +    <NA>       +
4             1       +    <NA>    <NA>
5             1    <NA>       +       +
6             1    <NA>       +    <NA>

df1 contains my "base" populations with combinations of 3 markers (all 3 present), plus the counts (Cells_in_Sample) of each population per Sample.

df2 takes the unique combinations of 3 markers and makes all the possible combinations of 1 and 2 out of them. Note that df2 already includes the "base" populations from df1.

What I want to do here is just produce a final_df combining both, in an efficient and elegant way, avoiding nested loops if possible.

The final_df should preserve the Sample and counts values for each "base" 3-marker combination in df1, extended to all the "sub-combinations" in df2. Thus, I should rbind them by Sample and Population_ID.

Now I managed to do this, using nested for loops, but I am wondering whether there is a better solution.

This is what I have done:

final_df <- NULL
for (s in unique(df1$Sample)){
  df1_sub <- subset(df1, Sample==s)
  for (p in df1_sub$Population_ID){
    df1_sub_sub <- subset(df1_sub, Population_ID==p)
    df2_sub <- subset(df2, Population_ID==p)
    df2_sub$Sample <- s
    df2_sub$Cells_in_Sample <- df1_sub_sub$Cells_in_Sample
    df2_sub <- df2_sub[,c(2,3,4,5,1,6)]
    #note there is no need to rbind df1_sub_sub and df2_sub
    #cause df2 already contains the populations from df1
    final_df <- rbind(final_df, df2_sub)
  }
}

final_df looks exactly like I want. I paste it full below for reference:

> final_df
    Marker1 Marker2 Marker3 Sample Population_ID Cells_in_Sample
1         +       +       +      1             1             443
2         +       +    <NA>      1             1             443
3         +    <NA>       +      1             1             443
4         +    <NA>    <NA>      1             1             443
5      <NA>       +       +      1             1             443
6      <NA>       +    <NA>      1             1             443
7      <NA>    <NA>       +      1             1             443
8         +       +       -      1             2              23
9         +       +    <NA>      1             2              23
10        +    <NA>       -      1             2              23
11        +    <NA>    <NA>      1             2              23
12     <NA>       +       -      1             2              23
13     <NA>       +    <NA>      1             2              23
14     <NA>    <NA>       -      1             2              23
15        +       +       +      2             1             567
16        +       +    <NA>      2             1             567
17        +    <NA>       +      2             1             567
18        +    <NA>    <NA>      2             1             567
19     <NA>       +       +      2             1             567
20     <NA>       +    <NA>      2             1             567
21     <NA>    <NA>       +      2             1             567
151       -       +       -      3             5              98
161       -       +    <NA>      3             5              98
171       -    <NA>       -      3             5              98
181       -    <NA>    <NA>      3             5              98
191    <NA>       +       -      3             5              98
201    <NA>       +    <NA>      3             5              98
211    <NA>    <NA>       -      3             5              98
22        -       -       +      3             6               3
23        -       -    <NA>      3             6               3
24        -    <NA>       +      3             6               3
25        -    <NA>    <NA>      3             6               3
26     <NA>       -       +      3             6               3
27     <NA>       -    <NA>      3             6               3
28     <NA>    <NA>       +      3             6               3

Is there a straightforward and efficient way to do this without nested loops? My actual data is many times bigger than this.

Thanks!


Solution

  • You don't need loops at all, this is a case for merge.
    In the code below I make a copy of final_df just for test purposes, to keep the expected result as it was.

    df3 <- final_df
    row.names(df3) <- NULL
    
    # join removing the marker columns from df1
    mrg <- merge(df2, df1[4:6], by = "Population_ID")[c(2:5, 1, 6)]
    mrg <- mrg[order(mrg$Sample), ]
    row.names(mrg) <- NULL
    
    identical(df3, mrg)
    #> [1] TRUE
    

    Created on 2023-09-29 with reprex v2.0.2