pythonpandasdataframebioinformaticspyranges

Accurately Compute Overlapping and Non-Overlapping Genomic Intervals Across Two DataFrames in Python


I want the exact intervals of DF1 that overlap with DF2. Also, I want the intervals that do not overlap with DF2. This is tricky because you must include A) the DF1 rows that are not overlapping at all B) the DF1 rows that overlap partially.

You can see in the first row of DF1 and DF2, there is overlap-- the overlapping interval (in terms of DF1) is 10-20. The non-overlapping intervals (in terms of DF1) would be 1-9, 21-100. The other tricky part comes in when you consider the next row. The non-overlapping interval would be 1-59, which is correct if we just look at this row. But you must make sure ALL intervals (whether overlapping or not) do not overlap within themselves ON THE CHROMOSOME level. If you remember the row above has an overlapping interval from 10-20, which overlaps with 1-59 (so 1-59 is not a true non-overlapping interval).

The second row on DF1 has no overlap with DF2, so you can count that entire interval as non-overlapping.

Therefore, for chromosome 1: the overlapping intervals are: 10-20, 60-100, 150-200.

The non-overlapping intervals are 1-9, 21-59 (for chromosome 1 lookin g at DF1 only).

As you can see, a true test to see if you got the right output is if you add up all of the intervals (overlapping & non-overlapping) it should be equal to lengths of the regions from DF1.

Example (ignoring chromosome 2 for simplicity):

DF1: 1-100 + 150-200= 151 bases long (start/end inclusive)

1-9 = 9

10-20 = 11

21-59 = 39

60-100 = 41

150-200 = 51

9+11+39+41+51= 151 bases (matches DF1)

DF1

chr   start   end 
1     1       100
1     150     200
2     5       10

DF2

chr   start   end 
1     10      20
1     60      260
1     500     550
2     1       20

UPDATE:

Comment on proposed solution (by @Andrej Kesely):

The proposed solution seems promising, but when you test the following example, it doesn't work as expected.

DF1 *total value count= 137

Chromosome  Start   End
0   chr1    200     227
1   chr1    613     721

DF2

Chromosome  Start   End
0   chr1    1000    1227

expected output: DF1 *total value count for left_only= 137

Chromosome  Start   End   merge
0   chr1    200     227   left_only
1   chr1    613     721   left_only
2   chr1    1000    1227  right_only  [optional to include right_only]

Incorrect DF output *total value count for left_only= 522 (does NOT match 137)

    Chromosome_ range_first range_last  _merge_first
1   chr1    1      200        721         left_only

You must keep within the ranges of DF1. In this example DF1, position 228-612 does not exist. Since these values are not in DF1 (or DF2) I do not want them reported in the final output.


Solution

  • IIUC you can do something like this:

    df1["range"] = df1.apply(lambda row: range(row["start"], row["end"] + 1), axis=1)
    df1 = df1[["chr", "range"]].explode("range")
    
    
    df2["range"] = df2.apply(lambda row: range(row["start"], row["end"] + 1), axis=1)
    df2 = df2[["chr", "range"]].explode("range")
    
    out = df1.merge(df2, on=["chr", "range"], how="outer", indicator=True)
    
    out = out.groupby("chr").apply(
        lambda g: g.groupby(
            (g["_merge"].ne(g["_merge"].shift()) + g["range"].diff().ne(1)).cumsum()
        ).agg({"range": ["first", "last"], "_merge": "first"}),
        include_groups=False,
    )
    
    print(out)
    

    Prints:

          range           _merge
          first last       first
    chr                         
    1   1     1    9   left_only
        2    10   20        both
        3    21   59   left_only
        4    60  100        both
        5   101  149  right_only
        6   150  200        both
        7   201  260  right_only
        8   500  550  right_only
    2   1     1    4  right_only
        2     5   10        both
        3    11   20  right_only
    

    The result contains all intervals, the "type" of intervals is in the _merge column. e.g. 1-9 is found only in df1, 10-20 is found in df1 and in df2, 101-149 only in df2 etc.


    With input:

    df1
       chr  start  end
    0    1    200  227
    1    1    613  721
    
    df2
       chr  start   end
    0    1   1000  1227
    

    Prints:

          range            _merge
          first  last       first
    chr                          
    1   1   200   227   left_only
        2   613   721   left_only
        3  1000  1227  right_only