I have a huge data frame that looks like this.
I want to group_by(chr)
, and then for each chr
to find
library(dplyr)
df1 <- tibble(chr=c(1,1,2,2),
start1=c(100,200,100,200),
end1=c(150,400,150,400),
species=c("Penguin"),
start2=c(200,200,500,1000),
end2=c(250,240,1000,2000)
)
df1
#> # A tibble: 4 × 6
#> chr start1 end1 species start2 end2
#> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1 1 100 150 Penguin 200 250
#> 2 1 200 400 Penguin 200 240
#> 3 2 100 150 Penguin 500 1000
#> 4 2 200 400 Penguin 1000 2000
Created on 2023-01-05 with reprex v2.0.2
I want my data to look like this. Essentially I want to check if the range2 overlaps with any range1. The new data does not change the question, but proof checks the code
# A tibble: 4 × 6
chr start1 end1 species start2 end2 OVERLAP
1 100 150 Penguin 200 250 TRUE
1 200 400 Penguin 200 240 TRUE
2 100 150 Penguin 500 1000 FALSE
2 200 400 Penguin 1000 2000 FALSE
I have fought a lot with the ivs
package and iv_overlaps
with no success in getting what I want.
Major EDIT:
When I apply any of the codes in real data, I am not getting the results I want, and I am so confused. Why? The new data dataset does not change the question, but proofs check the code
data <- tibble::tribble(
~chr, ~start1, ~end1, ~strand, ~gene, ~start2, ~end2,
"Chr2", 2739, 2840, "+", "A", 740, 1739,
"Chr2", 12577, 12678, "+", "B", 10578, 11577,
"Chr2", 22431, 22532, "+", "C", 20432, 21431,
"Chr2", 32202, 32303, "+", "D", 30203, 31202,
"Chr2", 42024, 42125, "+", "E", 40025, 41024,
"Chr2", 51830, 51931, "+", "F", 49831, 50830,
"Chr2", 82061, 84742, "+", "G", 80062, 81061,
"Chr2", 84811, 86692, "+", "H", 82812, 83811,
"Chr2", 86782, 88106, "-", "I", 88107, 89106,
"Chr2", 139454, 139555, "+", "J", 137455, 138454,
)
data %>%
group_by(chr) %>%
mutate(overlap = any(iv_overlaps(iv(start1, end1), iv(start2, end2))))
then It gives as an output
chr start1 end1 strand gene start2 end2 overlap
<chr> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <lgl>
1 Chr2 2739 2840 + A 740 1739 TRUE
2 Chr2 12577 12678 + B 10578 11577 TRUE
3 Chr2 22431 22532 + C 20432 21431 TRUE
4 Chr2 32202 32303 + D 30203 31202 TRUE
5 Chr2 42024 42125 + E 40025 41024 TRUE
6 Chr2 51830 51931 + F 49831 50830 TRUE
7 Chr2 82061 84742 + G 80062 81061 TRUE
8 Chr2 84811 86692 + H 82812 83811 TRUE
9 Chr2 86782 88106 - I 88107 89106 TRUE
10 Chr2 139454 139555 + J 137455 138454 TRUE
Which is wrong. They might be indirect matches, but there there is not a direct overlap.
There are several interpretations to your questions, so here are three possible cases:
[start1, end1]
if they overlap with any of [start2, end2]
.[start1, end1]
overlap with any of [start2, end2]
.[start1, end1]
overlap with their corresponding [start2, end2]
(the one on the same row).In the three cases, you can use ivs::iv_overlaps
.
Case 1
iv_overlaps
will detect, within each group, if the intervals defined in [start1, end1]
overlap in any way with any of the intervals [start2, end2]
. It'll return a logical vector of the length of [start1, end1]
.
library(ivs)
library(dplyr)
df1 %>%
group_by(chr) %>%
mutate(overlap = iv_overlaps(iv(start1, end1), iv(start2, end2)))
# A tibble: 4 × 7
# Groups: chr [2]
chr start1 end1 species start2 end2 overlap
<dbl> <dbl> <dbl> <chr> <dbl> <dbl> <lgl>
1 1 100 150 Penguin 200 250 FALSE
2 1 200 400 Penguin 160 170 TRUE
3 2 100 150 Penguin 500 1000 FALSE
4 2 200 400 Penguin 1000 2000 FALSE
Case 2
If you want to know if any (not each) of the intervals 1 overlaps with any of the intervals 2 (so one unique value per group), you should add any
:
df1 %>%
group_by(chr) %>%
mutate(overlap = any(iv_overlaps(iv(start1, end1), iv(start2, end2))))
# A tibble: 4 × 7
# Groups: chr [2]
chr start1 end1 species start2 end2 overlap
<dbl> <dbl> <dbl> <chr> <dbl> <dbl> <lgl>
1 1 100 150 Penguin 200 250 TRUE
2 1 200 400 Penguin 160 170 TRUE
3 2 100 150 Penguin 500 1000 FALSE
4 2 200 400 Penguin 1000 2000 FALSE
Case 3
If you want rowwise overlap detection, then you should use map2
with iv_overlaps
:
df1 %>%
group_by(chr) %>%
mutate(overlap = map2_lgl(iv(start1, end1), iv(start2, end2), iv_overlaps))
# A tibble: 4 × 7
# Groups: chr [2]
chr start1 end1 species start2 end2 overlap
<dbl> <dbl> <dbl> <chr> <dbl> <dbl> <lgl>
1 1 100 150 Penguin 200 250 FALSE
2 1 200 400 Penguin 160 170 FALSE
3 2 100 150 Penguin 500 1000 FALSE
4 2 200 400 Penguin 1000 2000 FALSE
Order of the comparison
Indeed, if one wants to compare the second intervals with the first, one should use iv_overlaps(interval2, interval1)
:
# A tibble: 4 × 7
# Groups: chr [2]
chr start1 end1 species start2 end2 overlap
<dbl> <dbl> <dbl> <chr> <dbl> <dbl> <lgl>
1 1 100 150 Penguin 200 250 TRUE
2 1 200 400 Penguin 160 170 FALSE
3 2 100 150 Penguin 500 1000 FALSE
4 2 200 400 Penguin 1000 2000 FALSE
Data
df1 <- tibble(chr=c(1,1,2,2), start1=c(100,200,100,200), end1=c(150,400,150,400), species=c("Penguin"), start2=c(200,160,500,1000), end2=c(250,170,1000,2000) )