rdplyrdata.tableintervalsgenomicranges

Find which column ranges overlap after grouping in R


I have a huge data frame that looks like this.

I want to group_by(chr), and then for each chr to find

library(dplyr)

df1 <- tibble(chr=c(1,1,2,2),
               start1=c(100,200,100,200),
               end1=c(150,400,150,400),
       species=c("Penguin"), 
       start2=c(200,200,500,1000), 
       end2=c(250,240,1000,2000)
       )

df1
#> # A tibble: 4 × 6
#>     chr start1  end1 species start2  end2
#>   <dbl>  <dbl> <dbl> <chr>    <dbl> <dbl>
#> 1     1    100   150 Penguin    200   250
#> 2     1    200   400 Penguin    200   240
#> 3     2    100   150 Penguin    500  1000
#> 4     2    200   400 Penguin   1000  2000

Created on 2023-01-05 with reprex v2.0.2

I want my data to look like this. Essentially I want to check if the range2 overlaps with any range1. The new data does not change the question, but proof checks the code

# A tibble: 4 × 6
        chr start1  end1 species start2  end2 OVERLAP
         1    100   150 Penguin    200   250    TRUE
         1    200   400 Penguin    200   240    TRUE
         2    100   150 Penguin    500  1000    FALSE
         2    200   400 Penguin   1000  2000    FALSE

I have fought a lot with the ivs package and iv_overlaps with no success in getting what I want.

Major EDIT:


When I apply any of the codes in real data, I am not getting the results I want, and I am so confused. Why? The new data dataset does not change the question, but proofs check the code

data <- tibble::tribble(
  ~chr, ~start1, ~end1, ~strand, ~gene, ~start2, ~end2,
  "Chr2",   2739,   2840, "+", "A",    740,   1739,
  "Chr2",  12577,  12678, "+", "B",  10578,  11577,
  "Chr2",  22431,  22532, "+", "C",  20432,  21431,
  "Chr2",  32202,  32303, "+", "D",  30203,  31202,
  "Chr2",  42024,  42125, "+", "E",  40025,  41024,
  "Chr2",  51830,  51931, "+", "F",  49831,  50830,
  "Chr2",  82061,  84742, "+", "G",  80062,  81061,
  "Chr2",  84811,  86692, "+", "H",  82812,  83811,
  "Chr2",  86782,  88106, "-", "I",  88107,  89106,
  "Chr2", 139454, 139555, "+", "J", 137455, 138454,
  )

data %>% 
  group_by(chr) %>% 
  mutate(overlap = any(iv_overlaps(iv(start1, end1), iv(start2, end2))))

then It gives as an output

 chr   start1   end1 strand gene  start2   end2 overlap
   <chr>  <dbl>  <dbl> <chr>  <chr>  <dbl>  <dbl> <lgl>  
 1 Chr2    2739   2840 +      A        740   1739 TRUE   
 2 Chr2   12577  12678 +      B      10578  11577 TRUE   
 3 Chr2   22431  22532 +      C      20432  21431 TRUE   
 4 Chr2   32202  32303 +      D      30203  31202 TRUE   
 5 Chr2   42024  42125 +      E      40025  41024 TRUE   
 6 Chr2   51830  51931 +      F      49831  50830 TRUE   
 7 Chr2   82061  84742 +      G      80062  81061 TRUE   
 8 Chr2   84811  86692 +      H      82812  83811 TRUE   
 9 Chr2   86782  88106 -      I      88107  89106 TRUE   
10 Chr2  139454 139555 +      J     137455 138454 TRUE

Which is wrong. They might be indirect matches, but there there is not a direct overlap.


Solution

  • There are several interpretations to your questions, so here are three possible cases:

    1. Within a group, detect for each [start1, end1] if they overlap with any of [start2, end2].
    2. Within a group, detect if any of [start1, end1] overlap with any of [start2, end2].
    3. Within a group, detect if each of [start1, end1] overlap with their corresponding [start2, end2] (the one on the same row).

    In the three cases, you can use ivs::iv_overlaps.


    Case 1

    iv_overlaps will detect, within each group, if the intervals defined in [start1, end1] overlap in any way with any of the intervals [start2, end2]. It'll return a logical vector of the length of [start1, end1].

    library(ivs)
    library(dplyr)
    df1 %>% 
      group_by(chr) %>% 
      mutate(overlap = iv_overlaps(iv(start1, end1), iv(start2, end2)))
    
    # A tibble: 4 × 7
    # Groups:   chr [2]
        chr start1  end1 species start2  end2 overlap
      <dbl>  <dbl> <dbl> <chr>    <dbl> <dbl> <lgl>  
    1     1    100   150 Penguin    200   250 FALSE  
    2     1    200   400 Penguin    160   170 TRUE   
    3     2    100   150 Penguin    500  1000 FALSE  
    4     2    200   400 Penguin   1000  2000 FALSE  
    

    Case 2

    If you want to know if any (not each) of the intervals 1 overlaps with any of the intervals 2 (so one unique value per group), you should add any:

    df1 %>% 
      group_by(chr) %>% 
      mutate(overlap = any(iv_overlaps(iv(start1, end1), iv(start2, end2))))
    
    # A tibble: 4 × 7
    # Groups:   chr [2]
        chr start1  end1 species start2  end2 overlap
      <dbl>  <dbl> <dbl> <chr>    <dbl> <dbl> <lgl>  
    1     1    100   150 Penguin    200   250 TRUE   
    2     1    200   400 Penguin    160   170 TRUE   
    3     2    100   150 Penguin    500  1000 FALSE  
    4     2    200   400 Penguin   1000  2000 FALSE  
    

    Case 3

    If you want rowwise overlap detection, then you should use map2 with iv_overlaps:

    df1 %>% 
      group_by(chr) %>% 
      mutate(overlap = map2_lgl(iv(start1, end1), iv(start2, end2), iv_overlaps))
    
    # A tibble: 4 × 7
    # Groups:   chr [2]
        chr start1  end1 species start2  end2 overlap
      <dbl>  <dbl> <dbl> <chr>    <dbl> <dbl> <lgl>  
    1     1    100   150 Penguin    200   250 FALSE  
    2     1    200   400 Penguin    160   170 FALSE  
    3     2    100   150 Penguin    500  1000 FALSE  
    4     2    200   400 Penguin   1000  2000 FALSE  
    

    Order of the comparison

    Indeed, if one wants to compare the second intervals with the first, one should use iv_overlaps(interval2, interval1):

    # A tibble: 4 × 7
    # Groups:   chr [2]
        chr start1  end1 species start2  end2 overlap
      <dbl>  <dbl> <dbl> <chr>    <dbl> <dbl> <lgl>  
    1     1    100   150 Penguin    200   250 TRUE   
    2     1    200   400 Penguin    160   170 FALSE  
    3     2    100   150 Penguin    500  1000 FALSE  
    4     2    200   400 Penguin   1000  2000 FALSE  
    

    Data

    df1 <- tibble(chr=c(1,1,2,2),               start1=c(100,200,100,200),               end1=c(150,400,150,400),               species=c("Penguin"),                start2=c(200,160,500,1000),                end2=c(250,170,1000,2000) )