I am trying to find pairs of intervals that overlap by at least some minimum overlap length that is set by the user. The intervals are from this pandas dataframe:
import pandas as pds
print(df1.head())
print(df1.tail())
query_id start_pos end_pos read_length orientation
0 1687655 1 4158 4158 F
1 2485364 1 7233 7233 R
2 1412202 1 3215 3215 R
3 1765889 1 3010 3010 R
4 2944965 1 4199 4199 R
query_id start_pos end_pos read_length orientation
3082467 112838 27863832 27865583 1752 F
3082468 138670 28431208 28431804 597 R
3082469 171683 28489928 28490409 482 F
3082470 2930053 28569533 28569860 328 F
3082471 1896622 28589281 28589554 274 R
where start_pos is where the interval starts and end_pos is where the interval ends. read_length is the length of the interval.
The data is sorted by start_pos.
The program should have the following output format:
query_id1 -- query_id2 -- read_length1 -- read_length2 -- overlap_length
I am running the program on a compute node with up to 512gb RAM and 4x Intel Xeon E7-4830 CPU (32 cores).
I've tried running my own code to find the overlaps, but it is taking too long to run.
Here is the code that I tried,
import pandas as pds
overlap_df = pds.DataFrame()
def create_overlap_table(df1, ovl_len):
...
(sort and clean the data here)
...
def iterate_queries(row):
global overlap_df
index1 = df1.index[df1['query_id'] == row['query_id']]
next_int_index = df1.index.get_loc(index1[0]) + 1
if row['read_length'] >= ovl_len:
if df1.index.size-1 >= next_int_index:
end_pos_minus_ovlp = (row['end_pos'] - ovl_len) + 2
subset_df = df1.loc[(df1['start_pos'] < end_pos_minus_ovlp)]
subset_df = subset_df.loc[subset_df.index == subset_df.index.max()]
subset_df = df1.iloc[next_int_index:df1.index.get_loc(subset_df.index[0])]
subset_df = subset_df.loc[subset_df['read_length'] >= ovl_len]
rows1 = pds.DataFrame({'read_id1': np.repeat(row['query_id'], repeats=subset_df.index.size), 'read_id2': subset_df['query_id']})
overlap_df = overlap_df.append(rows1)
df1.apply(iterate_queries, axis=1)
print(overlap_df)
Again, I ran this code on the compute node, but it was running for hours before I finally cancelled the job.
I've also tried using two packages for this problem--PyRanges, as well as an R package called IRanges, but they take too long to run as well. I've seen posts on interval trees and a python library called pybedtools, and I was planning on looking into them as a next step.
Any feedback would be really appreciated
EDIT: For minimum overlap length of, say 800, the first 5 rows should look like this,
query_id1 query_id2 read_length1 read_length2 overlap_length
1687655 2485364 4158 7233 4158
1687655 1412202 4158 3215 3215
1687655 1765889 4158 3010 3010
1687655 2944965 4158 4199 4158
2485364 1412202 7233 3215 3215
So, query_id1 and query_id2 cannot be identical. Also, no duplications (i.e., an overlap between A and B should not appear twice in the output).
Here's an algorithm.