[SOLVED] Finding overlaps between millions of ranges/intervals

Finding overlaps between millions of ranges/intervals

I am trying to find pairs of intervals that overlap by at least some minimum overlap length that is set by the user. The intervals are from this pandas dataframe:

import pandas as pds

print(df1.head())
print(df1.tail())

   query_id  start_pos  end_pos  read_length orientation
0   1687655          1     4158         4158           F
1   2485364          1     7233         7233           R
2   1412202          1     3215         3215           R
3   1765889          1     3010         3010           R
4   2944965          1     4199         4199           R
         query_id  start_pos   end_pos  read_length orientation
3082467    112838   27863832  27865583         1752           F
3082468    138670   28431208  28431804          597           R
3082469    171683   28489928  28490409          482           F
3082470   2930053   28569533  28569860          328           F
3082471   1896622   28589281  28589554          274           R

where start_pos is where the interval starts and end_pos is where the interval ends. read_length is the length of the interval.

The data is sorted by start_pos.

The program should have the following output format:

query_id1 -- query_id2 -- read_length1 -- read_length2 -- overlap_length

I am running the program on a compute node with up to 512gb RAM and 4x Intel Xeon E7-4830 CPU (32 cores).

I've tried running my own code to find the overlaps, but it is taking too long to run.

Here is the code that I tried,

import pandas as pds

overlap_df = pds.DataFrame()

def create_overlap_table(df1, ovl_len):

...
(sort and clean the data here)
...

    def iterate_queries(row):
        global overlap_df
        index1 = df1.index[df1['query_id'] == row['query_id']]
        next_int_index = df1.index.get_loc(index1[0]) + 1

        if row['read_length'] >= ovl_len:
            if df1.index.size-1 >= next_int_index:
                end_pos_minus_ovlp = (row['end_pos'] - ovl_len) + 2

                subset_df = df1.loc[(df1['start_pos'] < end_pos_minus_ovlp)]
                subset_df = subset_df.loc[subset_df.index == subset_df.index.max()]
                subset_df = df1.iloc[next_int_index:df1.index.get_loc(subset_df.index[0])]
                subset_df = subset_df.loc[subset_df['read_length'] >= ovl_len]

                rows1 = pds.DataFrame({'read_id1': np.repeat(row['query_id'], repeats=subset_df.index.size), 'read_id2': subset_df['query_id']})
                overlap_df = overlap_df.append(rows1)

    df1.apply(iterate_queries, axis=1)

    print(overlap_df)

Again, I ran this code on the compute node, but it was running for hours before I finally cancelled the job.

I've also tried using two packages for this problem--PyRanges, as well as an R package called IRanges, but they take too long to run as well. I've seen posts on interval trees and a python library called pybedtools, and I was planning on looking into them as a next step.

Any feedback would be really appreciated

EDIT: For minimum overlap length of, say 800, the first 5 rows should look like this,

query_id1 query_id2 read_length1 read_length2 overlap_length
1687655    2485364       4158         7233          4158
1687655    1412202       4158         3215          3215
1687655    1765889       4158         3010          3010             
1687655    2944965       4158         4199          4158
2485364    1412202       7233         3215          3215

So, query_id1 and query_id2 cannot be identical. Also, no duplications (i.e., an overlap between A and B should not appear twice in the output).

Solution

Here's an algorithm.

Prepare a set of intervals sorted by starting point. Initially the set is empty.
Sort all starting and ending points.
Traverse the points. If a starting point is encountered, add the corresponding interval to the set. If an ending point is encountered, remove the corresponding interval from the set.
When removing an interval, look at other intervals in the set. They all overlap the interval being removed, and they are sorted by the length of the overlap, longest first. Traverse the set until the length is too short, and report each overlap.