I have a large data frame that looks like this.
I want to group_by
seqnames and for each group, I want to check for overlapping ranges between the start and end.
If there is any overlapping range, then it should stay the row with the highest score.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df <- tibble(seqnames=rep(c("Chr1","Chr2"),each=3),
start=c(100,200,300,100,200,300),
end=c(150,400,500,120,220,320),
score=c(1000,500,1000,1000,1000,1000))
df
#> # A tibble: 6 × 4
#> seqnames start end score
#> <chr> <dbl> <dbl> <dbl>
#> 1 Chr1 100 150 1000
#> 2 Chr1 200 400 500
#> 3 Chr1 300 500 1000
#> 4 Chr2 100 120 1000
#> 5 Chr2 200 220 1000
#> 6 Chr2 300 320 1000
Created on 2022-12-27 with reprex v2.0.2
the desired output is
seqnames start end score
<chr> <dbl> <dbl> <dbl>
Chr1 100 150 1000
Chr1 300 500 1000
Chr2 100 120 1000
Chr2 200 220 1000
Chr2 300 320 1000
You could use ivs
, see:
library(dplyr)
library(ivs)
df <- df %>% mutate(interval = iv(start, end))
df %>%
group_by(seqnames) %>%
mutate(interval_group = iv_identify_group(interval)) %>%
group_by(seqnames,interval_group) %>%
top_n(1,score) %>%
ungroup %>%
select(seqnames, start,end,score)
# A tibble: 5 × 4
# seqnames start end score
# <chr> <dbl> <dbl> <dbl>
#1 Chr1 100 150 1000
#2 Chr1 300 500 1000
#3 Chr2 100 120 1000
#4 Chr2 200 220 1000
#5 Chr2 300 320 1000
or with data.table
:
library(data.table)
library(ivs)
setDT(df)
df[,interval_group:=iv_identify_group(iv(start, end)),seqnames][
,.SD[score==max(score)],.(seqnames,interval_group)][
,.(seqnames,start,end,score)]
# seqnames start end score
# <char> <num> <num> <num>
#1: Chr1 100 150 1000
#2: Chr1 300 500 1000
#3: Chr2 100 120 1000
#4: Chr2 200 220 1000
#5: Chr2 300 320 1000