I have two data frames like the following:
df1 <- data.frame(Marker1=c('+','+','+','-','-'), Marker2=c('+','+','+','+','-'), Marker3=c('+','-','+','-','+'),
Sample=c(1,1,2,3,3), Population_ID=c(1,2,1,5,6), Cells_in_Sample=c(443,23,567,98,3))
df2 <- data.frame(Population_ID=c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,5,5,5,5,5,5,5,6,6,6,6,6,6,6),
Marker1=c('+','+','+','+',NA,NA,NA,'+','+','+','+',NA,NA,NA,'-','-','-','-',NA,NA,NA,'-','-','-','-',NA,NA,NA),
Marker2=c('+','+',NA,NA,'+','+',NA,'+','+',NA,NA,'+','+',NA,'+','+',NA,NA,'+','+',NA,'-','-',NA,NA,'-','-',NA),
Marker3=c('+',NA,'+',NA,'+',NA,'+','-',NA,'-',NA,'-',NA,'-','-',NA,'-',NA,'-',NA,'-','+',NA,'+',NA,'+',NA,'+'))
They look like this:
> df1
Marker1 Marker2 Marker3 Sample Population_ID Cells_in_Sample
1 + + + 1 1 443
2 + + - 1 2 23
3 + + + 2 1 567
4 - + - 3 5 98
5 - - + 3 6 3
> head(df2)
Population_ID Marker1 Marker2 Marker3
1 1 + + +
2 1 + + <NA>
3 1 + <NA> +
4 1 + <NA> <NA>
5 1 <NA> + +
6 1 <NA> + <NA>
df1
contains my "base" populations with combinations of 3 markers (all 3 present), plus the counts (Cells_in_Sample
) of each population per Sample.
df2
takes the unique combinations of 3 markers and makes all the possible combinations of 1 and 2 out of them. Note that df2
already includes the "base" populations from df1
.
What I want to do here is just produce a final_df
combining both, in an efficient and elegant way, avoiding nested loops if possible.
The final_df
should preserve the Sample and counts values for each "base" 3-marker combination in df1
, extended to all the "sub-combinations" in df2
. Thus, I should rbind
them by Sample
and Population_ID
.
Now I managed to do this, using nested for
loops, but I am wondering whether there is a better solution.
This is what I have done:
final_df <- NULL
for (s in unique(df1$Sample)){
df1_sub <- subset(df1, Sample==s)
for (p in df1_sub$Population_ID){
df1_sub_sub <- subset(df1_sub, Population_ID==p)
df2_sub <- subset(df2, Population_ID==p)
df2_sub$Sample <- s
df2_sub$Cells_in_Sample <- df1_sub_sub$Cells_in_Sample
df2_sub <- df2_sub[,c(2,3,4,5,1,6)]
#note there is no need to rbind df1_sub_sub and df2_sub
#cause df2 already contains the populations from df1
final_df <- rbind(final_df, df2_sub)
}
}
final_df
looks exactly like I want. I paste it full below for reference:
> final_df
Marker1 Marker2 Marker3 Sample Population_ID Cells_in_Sample
1 + + + 1 1 443
2 + + <NA> 1 1 443
3 + <NA> + 1 1 443
4 + <NA> <NA> 1 1 443
5 <NA> + + 1 1 443
6 <NA> + <NA> 1 1 443
7 <NA> <NA> + 1 1 443
8 + + - 1 2 23
9 + + <NA> 1 2 23
10 + <NA> - 1 2 23
11 + <NA> <NA> 1 2 23
12 <NA> + - 1 2 23
13 <NA> + <NA> 1 2 23
14 <NA> <NA> - 1 2 23
15 + + + 2 1 567
16 + + <NA> 2 1 567
17 + <NA> + 2 1 567
18 + <NA> <NA> 2 1 567
19 <NA> + + 2 1 567
20 <NA> + <NA> 2 1 567
21 <NA> <NA> + 2 1 567
151 - + - 3 5 98
161 - + <NA> 3 5 98
171 - <NA> - 3 5 98
181 - <NA> <NA> 3 5 98
191 <NA> + - 3 5 98
201 <NA> + <NA> 3 5 98
211 <NA> <NA> - 3 5 98
22 - - + 3 6 3
23 - - <NA> 3 6 3
24 - <NA> + 3 6 3
25 - <NA> <NA> 3 6 3
26 <NA> - + 3 6 3
27 <NA> - <NA> 3 6 3
28 <NA> <NA> + 3 6 3
Is there a straightforward and efficient way to do this without nested loops? My actual data is many times bigger than this.
Thanks!
You don't need loops at all, this is a case for merge
.
In the code below I make a copy of final_df
just for test purposes, to keep the expected result as it was.
df3 <- final_df
row.names(df3) <- NULL
# join removing the marker columns from df1
mrg <- merge(df2, df1[4:6], by = "Population_ID")[c(2:5, 1, 6)]
mrg <- mrg[order(mrg$Sample), ]
row.names(mrg) <- NULL
identical(df3, mrg)
#> [1] TRUE
Created on 2023-09-29 with reprex v2.0.2