Say I have a data frame like the following:
mydf <- data.frame(id=LETTERS, locus=c(rep("alpha",14),rep("beta",12)),
group=c(rep(1,2),rep(2,6),rep(3,6),rep(4,4),rep(5,4),rep(6,4)),
pair=c(1:12,14,15,3,4,6,8,9:16))
which looks like this:
> mydf
id locus group pair
1 A alpha 1 1
2 B alpha 1 2
3 C alpha 2 3
4 D alpha 2 4
5 E alpha 2 5
6 F alpha 2 6
7 G alpha 2 7
8 H alpha 2 8
9 I alpha 3 9
10 J alpha 3 10
11 K alpha 3 11
12 L alpha 3 12
13 M alpha 3 14
14 N alpha 3 15
15 O beta 4 3
16 P beta 4 4
17 Q beta 4 6
18 R beta 4 8
19 S beta 5 9
20 T beta 5 10
21 U beta 5 11
22 V beta 5 12
23 W beta 6 13
24 X beta 6 14
25 Y beta 6 15
26 Z beta 6 16
It has an id
column as a primary key, and ids
are divided into locus
alpha and beta. Each id
in alpha should have a pair in beta (column pair
), though there might be cases of orphan alpha or beta ids
(due to prior filtering).
ids
are also grouped in different groups (column group
), which were defined on a per-locus basis.
Now I want to define another grouping variable new_group
that reflects the alpha-beta pairing information.
This seems very straightforward visually, but I am struggling to define the rules in an automatic way, so that every possible situation is taken into account.
The desired output for the example data frame above would be the following (colors just for guidance):
Note that I only define a new_group
when pair
information is found, so ids
A and B do not belong to any new_group
.
ids
C, D, F and H have pair
information in beta, so they all constitute new_group
1; since ids
E and G belong to the same group, they are pooled together (even if they do not have pair
information).
ids
from I to N have pair information in beta grouped in different groups, so new_group
2 has to be split in beta into 2a and 2b.
Essentially, rules would be:
Here is an igraph
solution
mbs <- mydf %>%
mutate(sg = str_c("g", group), sp = str_c("p", pair)) %>%
select(sg, sp) %>%
graph_from_data_frame(directed = FALSE) %>%
components() %>%
membership()
mydf %>%
mutate(new_group = mbs[startsWith(names(mbs), "g")][str_c("g", group)]) %>%
mutate(
new_group = ifelse(n_distinct(locus) == 2, new_group - 1, NA),
.by = new_group
) %>%
mutate(
new_group = if (n_distinct(group) == 1) as.character(new_group) else str_c(new_group, "_",match(group, unique(group))),
.by = c(locus, new_group)
)
which gives
id locus group pair new_group
1 A alpha 1 1 <NA>
2 B alpha 1 2 <NA>
3 C alpha 2 3 1
4 D alpha 2 4 1
5 E alpha 2 5 1
6 F alpha 2 6 1
7 G alpha 2 7 1
8 H alpha 2 8 1
9 I alpha 3 9 2
10 J alpha 3 10 2
11 K alpha 3 11 2
12 L alpha 3 12 2
13 M alpha 3 14 2
14 N alpha 3 15 2
15 O beta 4 3 1
16 P beta 4 4 1
17 Q beta 4 6 1
18 R beta 4 8 1
19 S beta 5 9 2_1
20 T beta 5 10 2_1
21 U beta 5 11 2_1
22 V beta 5 12 2_1
23 W beta 6 13 2_2
24 X beta 6 14 2_2
25 Y beta 6 15 2_2
26 Z beta 6 16 2_2