I have used Florian's answer to this question to get the unique pairs separately for the one shared (grouping) variable. Can this be extended to the case where there are two grouping variables, and the pairs should be unique based on either grouping variable?
For example, if the input data is
df = data.frame(scaffold = c("A", "A", "B", "B", "B", "C", "C", "D","A"),
geneID = c("162", "276", "64", "276", "281", "64", "162", "162","64"),
Grouping2 = c("x1", "x2", "x3", "x1", "x2", "x3", "x1", "x2", "x3"),
stringsAsFactors = F)
and my groups are geneID
and Grouping2
, how do I produce all sets of unique pairs for scaffold
? In this toy example, A
and B
pair both on 276
and x1
, but only one A-B pair should be produced?
Edit to add real example:
In my dataset, I have 25,530 people. While some people live alone, well over half are in households (Household
) with at least two people. People also work/attend school (Work
). Each person is one row. I need to find all unique pairs of people, defined as either having the same Household
number or the same Work
name.
For example, if people A, B, and C live at 123 Any Street
, then the household-based pairs are A-B, A-C, and B-C. If B and C both work at XYZ Tailoring
, which also employs D, E, and F, then the company-based pairs are: B-D, B-E, B-F, C-D, C-E, C-F, D-E, D-F, and E-F. There is no B-C pair based on XYZ Tailoring
because that is already accounted for in the 123 Any Street
combination.
Toy data:
temp <- data.frame(Person = c("A", "B", "C", "D", "E", "F"),
Household = c("123 Any Street", "123 Any Street", "123 Any Street", "4 Second Street", "5 Third Street",
"6 Last Street"),
Work = c("One Company", "XYZ Tailoring", "XYZ Tailoring", "XYZ Tailoring", "XYZ Tailoring",
"XYZ Tailoring"))
And this is my problem. To get an entire set of unique pairs based on sharing Household
or Work
.
Edit 2: desired output. A data frame of two variables (or something I can turn into a data frame of two variables, e.g. matrix). Calling the variables from
and to
the output would be 12 unique pairs. The order of these does not matter.
to from
A B
A C
B C
B D
B E
B F
C D
C E
C F
D E
D F
E F
B
and C
are linked both by the variable Household
and by the variable Work
. I can only have one pair B
and C
, it doesn't matter whether they are paired from Household
or paired from Work
. The pair C
and B
cannot occur, as that is the mirror image of B
and C
, and is thus a duplicate.
A tidyverse
option:
library(dplyr)
library(tidyr)
library(purrr)
df <- data.frame(scaffold = c("A", "A", "B", "B", "B", "C", "C", "D","A"),
geneID = c("162", "276", "64", "276", "281", "64", "162", "162","64"),
Grouping2 = c("x1", "x2", "x3", "x1", "x2", "x3", "x1", "x2", "x3"),
stringsAsFactors = FALSE)
# summarise() scaffolds as lists, filter() non-matched scaffold values,
# map() unique pairs using combn(), unnest() pairs to individual rows,
# unnest_wider() to individual columns
result <- df |>
summarise(scaffolds = list(unique(scaffold)), .by = c(geneID, Grouping2)) |>
filter(lengths(scaffolds) > 1) |>
mutate(pairs = map(scaffolds, ~ combn(.x, 2, simplify = FALSE))) |>
unnest(pairs) |>
unnest_wider(pairs, names_sep = "_")
data.frame(result)
# geneID Grouping2 scaffolds pairs_1 pairs_2
# 1 162 x1 A, C A C
# 2 64 x3 B, C, A B C
# 3 64 x3 B, C, A B A
# 4 64 x3 B, C, A C A
Based on your updated dataset and desired output, you can use pivot_longer()
to get permutations of your grouping variables, summarise as before, use separate()
to split pairs and assign column names, and then return only distinct()
rows:
temp <- data.frame(Person = c("A", "B", "C", "D", "E", "F"),
Household = c("123 Any Street", "123 Any Street", "123 Any Street", "4 Second Street", "5 Third Street",
"6 Last Street"),
Work = c("One Company", "XYZ Tailoring", "XYZ Tailoring", "XYZ Tailoring", "XYZ Tailoring",
"XYZ Tailoring"))
result <- temp |>
pivot_longer(-Person) |>
summarise(tmp = list(unique(Person)), .by = c(name, value)) |>
filter(lengths(tmp) > 1) |>
mutate(pairs = map(tmp, ~ combn(.x, 2, simplify = FALSE))) |>
unnest(pairs) |>
mutate(pairs = map_chr(pairs, ~ paste(sort(.x), collapse = "-"))) |>
separate(pairs, into = c("to", "from"), sep = "-") |>
select("to", "from") |>
distinct()
result
# # A tibble: 12 × 2
# to from
# <chr> <chr>
# 1 A B
# 2 A C
# 3 B C
# 4 B D
# 5 B E
# 6 B F
# 7 C D
# 8 C E
# 9 C F
# 10 D E
# 11 D F
# 12 E F