I am trying to find the non-overlapping intervals for the start/end DNA coordinates (on the same chromosome). I am having a hard time developing a function that takes into account two rows on the same exon. The non-overlapping interval must be unique (not overlap with any other intervals).
For ex, on the first row below, the non-overlapping interval would be 1-49, 61-100. But, if you look at the second row, the non-overlapping interval would be 1-69, 81-100. I want non-overlapping, non-overlapping intervals, so the true interval output I want is 1-49, 61-69, 81-100. Ideally, I would like these intervals to bee separated into their own columns (non-overlap_start, non-overlap_end).
*Please note: I have unique intervals per chromosome.
starting DF
chrom exon_start exon_end start1 end1
1 1 100 50 60
1 1 100 70 80
1 150 155 155 160
2 5 50 25 100
final DF
chrom exon_start exon_end start1 end1 non_overlap
1 1 100 50 60 [1-49, 61-69, 81-100]
1 1 100 70 80 [1-49, 61-69, 81-100]
1 150 155 155 160 [150-154]
2 5 50 25 100 [5-24]
Is this something you want?
def find_non_overlap(part):
minval = part.ex_start.min()
maxval = part.ex_end.max()
t = np.array([np.concatenate((
# add additional True from each side
np.ones(1 + row.ex_start - minval),
np.zeros(row.overlap_start - row.ex_start),
np.ones(row.overlap_end - row.overlap_start + 1),
np.zeros(row.ex_end - row.overlap_end),
np.ones(1 + maxval - row.ex_end),
)) for _, row in part.iterrows()])
changes = minval + np.flatnonzero(np.diff(np.logical_or.reduce(t)))
return list(zip(changes[::2], changes[1::2] - 1))
# return pd.Series(
# {'non_overlap_starts': changes[::2],
# 'non_overlap_ends': changes[1::2] - 1}
# )
def calc_intervals(df):
df = df.copy()
df['overlap_start'] = df[['ex_start', 'start']].max(axis=1)
df['overlap_start'] = df[['overlap_start']].assign(
t=df.ex_end + 1).min(axis=1)
df['overlap_end'] = df[['ex_end', 'end']].min(axis=1)
df['overlap_end'] = df[['overlap_end']].assign(
t=df.ex_start - 1).max(axis=1)
return df.groupby(
['chrom', 'ex_start', 'ex_end']
)[df.columns].apply(
find_non_overlap, include_groups=False)
df = pd.DataFrame([
[0, 0, 100, 50, 60],
[0, 0, 100, 20, 30],
[0, 0, 100, 25, 40],
[0, 0, 100, 90, 100],
[0, 0, 100, 0, 5],
[1, 10, 50, 5, 15],
[2, 10, 50, 1, 5],
[3, 10, 50, 30, 70],
[4, 10, 50, 5, 55],
[5, 10, 50, 20, 30],
[6, 1, 100, 50, 60],
[6, 1, 100, 70, 80],
[6, 150, 155, 155, 160],
[7, 5, 50, 25, 100],
[8, 0, 5, 0, 1],
[8, 0, 5, 3, 5],
[9, 0, 5, 0, 4]
],
columns=['chrom', 'ex_start', 'ex_end', 'start', 'end'])
calc_intervals(df)
Result is:
chrom ex_start ex_end
0 0 100 [(6, 19), (41, 49), (61, 89)]
1 10 50 [(16, 50)]
2 10 50 [(10, 50)]
3 10 50 [(10, 29)]
4 10 50 []
5 10 50 [(10, 19), (31, 50)]
6 1 100 [(1, 49), (61, 69), (81, 100)]
150 155 [(150, 154)]
7 5 50 [(5, 24)]
8 0 5 [(2, 2)]
9 0 5 [(5, 5)]
You can join it with your original dataframe if you need this values repeated in every row
UPD: I have changed the algorithm and excluded assumptions that one chromosome has one ex_start-ex_end diapason and that the (start: end) interval is entirely included in the (ex_start: ex_end) interval. Also, I add more test examples. Check if I use the correct logic of overlapping calculations. Your initial examples are 6 and 7 chrom.
Also if size of exons are very big, you can code values monotonically to make this algorithm much faster:
df = pd.DataFrame([
[0, 3, 1000000000, 50, 60],
[0, 3, 1000000000, 10, 12],
[1, 3, 10, 5, 30],
], columns=['chrom', 'ex_start', 'ex_end', 'start', 'end'])
values = pd.concat([df.ex_start, df.ex_end, df.start, df.end])
values = pd.concat(
[values, values + 1, values - 1]
).sort_values().unique()
code = dict(zip(values, np.arange(0, len(values))))
decode = dict(zip(np.arange(0, len(values)), values))
df_ = df.copy()
df_ = df_.replace(code)
pd.Series([
[(decode[i],decode[j]) for (i, j) in t]
for t in calc_intervals(df_)])
# It's much easier to decode inside find_non_overlap function
Results:
0 [(3, 9), (13, 49), (61, 1000000000)]
1 [(3, 4)]