Find non-overlapping intervals within DNA coordinates

I am trying to find the non-overlapping intervals for the start/end DNA coordinates (on the same chromosome). I am having a hard time developing a function that takes into account two rows on the same exon. The non-overlapping interval must be unique (not overlap with any other intervals).

For ex, on the first row below, the non-overlapping interval would be 1-49, 61-100. But, if you look at the second row, the non-overlapping interval would be 1-69, 81-100. I want non-overlapping, non-overlapping intervals, so the true interval output I want is 1-49, 61-69, 81-100. Ideally, I would like these intervals to bee separated into their own columns (non-overlap_start, non-overlap_end).

*Please note: I have unique intervals per chromosome.

starting DF

chrom        exon_start    exon_end   start1      end1   
1                  1       100        50           60      
1                  1       100        70           80      
1                  150     155        155          160
2                  5       50         25           100

final DF

chrom   exon_start    exon_end   start1   end1   non_overlap 
1           1             100    50      60      [1-49, 61-69, 81-100]        
1           1             100    70      80      [1-49, 61-69, 81-100] 
1          150             155  155      160      [150-154]
2           5             50     25      100     [5-24]

Solution

Is this something you want?

def find_non_overlap(part):
    minval = part.ex_start.min()
    maxval = part.ex_end.max()
    t = np.array([np.concatenate((
            # add additional True from each side
            np.ones(1 + row.ex_start - minval),
            np.zeros(row.overlap_start - row.ex_start), 
            np.ones(row.overlap_end - row.overlap_start + 1), 
            np.zeros(row.ex_end - row.overlap_end),
            np.ones(1 + maxval - row.ex_end), 
        )) for _, row in part.iterrows()])
    changes = minval + np.flatnonzero(np.diff(np.logical_or.reduce(t)))
    return list(zip(changes[::2], changes[1::2] - 1))
    # return pd.Series(
    #     {'non_overlap_starts': changes[::2], 
    #     'non_overlap_ends': changes[1::2] - 1}
    # )

def calc_intervals(df):
    df = df.copy()
    df['overlap_start'] = df[['ex_start', 'start']].max(axis=1) 
    df['overlap_start'] = df[['overlap_start']].assign(
        t=df.ex_end + 1).min(axis=1)
    df['overlap_end'] = df[['ex_end', 'end']].min(axis=1) 
    df['overlap_end'] = df[['overlap_end']].assign(
        t=df.ex_start - 1).max(axis=1)
    return df.groupby(
        ['chrom', 'ex_start', 'ex_end']
    )[df.columns].apply(
        find_non_overlap, include_groups=False)

df = pd.DataFrame([
    [0, 0, 100, 50, 60],
    [0, 0, 100, 20, 30],
    [0, 0, 100, 25, 40],
    [0, 0, 100, 90, 100],
    [0, 0, 100, 0, 5],
    [1, 10, 50, 5, 15],
    [2, 10, 50, 1, 5],
    [3, 10, 50, 30, 70],
    [4, 10, 50, 5, 55],
    [5, 10, 50, 20, 30],
    [6, 1, 100, 50, 60],
    [6, 1, 100, 70, 80],
    [6, 150, 155, 155, 160],
    [7, 5, 50, 25, 100],
    [8, 0, 5, 0, 1],
    [8, 0, 5, 3, 5],
    [9, 0, 5, 0, 4]
], 
columns=['chrom', 'ex_start', 'ex_end', 'start', 'end'])

calc_intervals(df)

Result is:

chrom  ex_start  ex_end
0      0         100        [(6, 19), (41, 49), (61, 89)]
1      10        50                            [(16, 50)]
2      10        50                            [(10, 50)]
3      10        50                            [(10, 29)]
4      10        50                                    []
5      10        50                  [(10, 19), (31, 50)]
6      1         100       [(1, 49), (61, 69), (81, 100)]
       150       155                         [(150, 154)]
7      5         50                             [(5, 24)]
8      0         5                               [(2, 2)]
9      0         5                               [(5, 5)]

You can join it with your original dataframe if you need this values repeated in every row

UPD: I have changed the algorithm and excluded assumptions that one chromosome has one ex_start-ex_end diapason and that the (start: end) interval is entirely included in the (ex_start: ex_end) interval. Also, I add more test examples. Check if I use the correct logic of overlapping calculations. Your initial examples are 6 and 7 chrom.

Also if size of exons are very big, you can code values monotonically to make this algorithm much faster:

df = pd.DataFrame([
    [0, 3, 1000000000, 50, 60],
    [0, 3, 1000000000, 10, 12],
    [1, 3, 10, 5, 30],
], columns=['chrom', 'ex_start', 'ex_end', 'start', 'end'])


values = pd.concat([df.ex_start, df.ex_end, df.start, df.end])
values = pd.concat(
    [values, values + 1, values - 1]
).sort_values().unique()
code = dict(zip(values, np.arange(0, len(values))))
decode = dict(zip(np.arange(0, len(values)), values))

df_ = df.copy()
df_ = df_.replace(code)

pd.Series([
    [(decode[i],decode[j]) for (i, j) in t] 
    for t in calc_intervals(df_)])
# It's much easier to decode inside find_non_overlap function

Results:

0    [(3, 9), (13, 49), (61, 1000000000)]
1                                [(3, 4)]