I want the exact intervals of DF1 that overlap with DF2. Also, I want the intervals that do not overlap with DF2. This is tricky because you must include A) the DF1 rows that are not overlapping at all B) the DF1 rows that overlap partially.
You can see in the first row of DF1 and DF2, there is overlap-- the overlapping interval (in terms of DF1) is 10-20. The non-overlapping intervals (in terms of DF1) would be 1-9, 21-100. The other tricky part comes in when you consider the next row. The non-overlapping interval would be 1-59, which is correct if we just look at this row. But you must make sure ALL intervals (whether overlapping or not) do not overlap within themselves ON THE CHROMOSOME level. If you remember the row above has an overlapping interval from 10-20, which overlaps with 1-59 (so 1-59 is not a true non-overlapping interval).
The second row on DF1 has no overlap with DF2, so you can count that entire interval as non-overlapping.
Therefore, for chromosome 1: the overlapping intervals are: 10-20, 60-100, 150-200.
The non-overlapping intervals are 1-9, 21-59 (for chromosome 1 lookin g at DF1 only).
As you can see, a true test to see if you got the right output is if you add up all of the intervals (overlapping & non-overlapping) it should be equal to lengths of the regions from DF1.
Example (ignoring chromosome 2 for simplicity):
DF1: 1-100 + 150-200= 151 bases long (start/end inclusive)
1-9 = 9
10-20 = 11
21-59 = 39
60-100 = 41
150-200 = 51
9+11+39+41+51= 151 bases (matches DF1)
DF1
chr start end
1 1 100
1 150 200
2 5 10
DF2
chr start end
1 10 20
1 60 260
1 500 550
2 1 20
UPDATE:
Comment on proposed solution (by @Andrej Kesely):
The proposed solution seems promising, but when you test the following example, it doesn't work as expected.
DF1 *total value count= 137
Chromosome Start End
0 chr1 200 227
1 chr1 613 721
DF2
Chromosome Start End
0 chr1 1000 1227
expected output: DF1 *total value count for left_only= 137
Chromosome Start End merge
0 chr1 200 227 left_only
1 chr1 613 721 left_only
2 chr1 1000 1227 right_only [optional to include right_only]
Incorrect DF output *total value count for left_only= 522 (does NOT match 137)
Chromosome_ range_first range_last _merge_first
1 chr1 1 200 721 left_only
You must keep within the ranges of DF1. In this example DF1, position 228-612 does not exist. Since these values are not in DF1 (or DF2) I do not want them reported in the final output.
IIUC you can do something like this:
df1["range"] = df1.apply(lambda row: range(row["start"], row["end"] + 1), axis=1)
df1 = df1[["chr", "range"]].explode("range")
df2["range"] = df2.apply(lambda row: range(row["start"], row["end"] + 1), axis=1)
df2 = df2[["chr", "range"]].explode("range")
out = df1.merge(df2, on=["chr", "range"], how="outer", indicator=True)
out = out.groupby("chr").apply(
lambda g: g.groupby(
(g["_merge"].ne(g["_merge"].shift()) + g["range"].diff().ne(1)).cumsum()
).agg({"range": ["first", "last"], "_merge": "first"}),
include_groups=False,
)
print(out)
Prints:
range _merge
first last first
chr
1 1 1 9 left_only
2 10 20 both
3 21 59 left_only
4 60 100 both
5 101 149 right_only
6 150 200 both
7 201 260 right_only
8 500 550 right_only
2 1 1 4 right_only
2 5 10 both
3 11 20 right_only
The result contains all intervals, the "type" of intervals is in the _merge
column. e.g. 1-9
is found only in df1
, 10-20
is found in df1
and in df2
, 101-149
only in df2
etc.
With input:
df1
chr start end
0 1 200 227
1 1 613 721
df2
chr start end
0 1 1000 1227
Prints:
range _merge
first last first
chr
1 1 200 227 left_only
2 613 721 left_only
3 1000 1227 right_only