rsplitcombinationsunique

How do I get unique pairs based on shared values in two grouping variables?


I have used Florian's answer to this question to get the unique pairs separately for the one shared (grouping) variable. Can this be extended to the case where there are two grouping variables, and the pairs should be unique based on either grouping variable?

For example, if the input data is

df  = data.frame(scaffold = c("A", "A", "B", "B", "B", "C", "C", "D","A"), 
       geneID = c("162", "276", "64", "276", "281", "64", "162", "162","64"),
       Grouping2 = c("x1", "x2", "x3", "x1", "x2", "x3", "x1", "x2", "x3"),
       stringsAsFactors = F)

and my groups are geneID and Grouping2, how do I produce all sets of unique pairs for scaffold? In this toy example, A and B pair both on 276 and x1, but only one A-B pair should be produced?

Edit to add real example: In my dataset, I have 25,530 people. While some people live alone, well over half are in households (Household) with at least two people. People also work/attend school (Work). Each person is one row. I need to find all unique pairs of people, defined as either having the same Household number or the same Work name.

For example, if people A, B, and C live at 123 Any Street, then the household-based pairs are A-B, A-C, and B-C. If B and C both work at XYZ Tailoring, which also employs D, E, and F, then the company-based pairs are: B-D, B-E, B-F, C-D, C-E, C-F, D-E, D-F, and E-F. There is no B-C pair based on XYZ Tailoring because that is already accounted for in the 123 Any Street combination.

Toy data:

temp <- data.frame(Person = c("A", "B", "C", "D", "E", "F"),
               Household = c("123 Any Street", "123 Any Street", "123 Any Street", "4 Second Street", "5 Third Street",
                         "6 Last Street"),
               Work = c("One Company", "XYZ Tailoring", "XYZ Tailoring", "XYZ Tailoring", "XYZ Tailoring",
                        "XYZ Tailoring"))

And this is my problem. To get an entire set of unique pairs based on sharing Household or Work.

Edit 2: desired output. A data frame of two variables (or something I can turn into a data frame of two variables, e.g. matrix). Calling the variables from and to the output would be 12 unique pairs. The order of these does not matter.

to    from
A     B
A     C
B     C
B     D
B     E
B     F
C     D
C     E
C     F
D     E
D     F
E     F     

B and C are linked both by the variable Household and by the variable Work. I can only have one pair B and C, it doesn't matter whether they are paired from Household or paired from Work. The pair C and B cannot occur, as that is the mirror image of B and C, and is thus a duplicate.


Solution

  • A tidyverse option:

    library(dplyr)
    library(tidyr)
    library(purrr)
    
    df <- data.frame(scaffold = c("A", "A", "B", "B", "B", "C", "C", "D","A"), 
                     geneID = c("162", "276", "64", "276", "281", "64", "162", "162","64"),
                     Grouping2 = c("x1", "x2", "x3", "x1", "x2", "x3", "x1", "x2", "x3"),
                     stringsAsFactors = FALSE)
    
    # summarise() scaffolds as lists, filter() non-matched scaffold values,
    # map() unique pairs using combn(), unnest() pairs to individual rows,
    # unnest_wider() to individual columns
    result <- df |>
      summarise(scaffolds = list(unique(scaffold)), .by = c(geneID, Grouping2)) |>
      filter(lengths(scaffolds) > 1) |>
      mutate(pairs = map(scaffolds, ~ combn(.x, 2, simplify = FALSE))) |>
      unnest(pairs) |>
      unnest_wider(pairs, names_sep = "_")
      
    data.frame(result)
    #   geneID Grouping2 scaffolds pairs_1 pairs_2
    # 1    162        x1      A, C       A       C
    # 2     64        x3   B, C, A       B       C
    # 3     64        x3   B, C, A       B       A
    # 4     64        x3   B, C, A       C       A
    

    Based on your updated dataset and desired output, you can use pivot_longer() to get permutations of your grouping variables, summarise as before, use separate() to split pairs and assign column names, and then return only distinct() rows:

    temp <- data.frame(Person = c("A", "B", "C", "D", "E", "F"),
                       Household = c("123 Any Street", "123 Any Street", "123 Any Street", "4 Second Street", "5 Third Street",
                                     "6 Last Street"),
                       Work = c("One Company", "XYZ Tailoring", "XYZ Tailoring", "XYZ Tailoring", "XYZ Tailoring",
                                "XYZ Tailoring"))
    
    result <- temp |>
      pivot_longer(-Person) |>
      summarise(tmp = list(unique(Person)), .by = c(name, value)) |>
      filter(lengths(tmp) > 1) |>
      mutate(pairs = map(tmp, ~ combn(.x, 2, simplify = FALSE))) |>
      unnest(pairs) |>
      mutate(pairs = map_chr(pairs, ~ paste(sort(.x), collapse = "-"))) |>
      separate(pairs, into = c("to", "from"), sep = "-") |>
      select("to", "from") |>
      distinct()
    
    result
    # # A tibble: 12 × 2
    #    to    from 
    #    <chr> <chr>
    #  1 A     B    
    #  2 A     C    
    #  3 B     C    
    #  4 B     D    
    #  5 B     E    
    #  6 B     F    
    #  7 C     D    
    #  8 C     E    
    #  9 C     F    
    # 10 D     E    
    # 11 D     F    
    # 12 E     F