As part of a larger data set, I have a DataFrame
organized as such:
Chromosome arm Start End ratio_median
5 5.5 96100001 96150000 -0.582
5 5.5 96150001 96200000 -0.582
5 5.5 96200001 96250000 -0.582
5 5.5 96250001 96300000 -0.582
5 5.5 96300001 96350000 -0.582
The goal here is to group rows with the same Chromosome
, arm
and ratio_median
, and form larger intervals using the minimum of Start
and maximum of End
.
It looks like a bog-standard problem solvable with groupby
:
grouped = df.groupby(by=["Chromosome", "arm", "ratio_median"]).agg(
{"Chromosome": "first", "Start": "min", "End": "max", "ratio_median": "first"})
However, since these are coordinates, grouping should only consider consecutive groups with the common grouping keys, not the entire data set. In other words, the group boundaries should be set when there is a change in ratio_median
for the same chromosome and arm.
The groupby
approach works perfectly until you have intervals with the same values which are separated by one or more intervals with different keys (ratio_median
being the discriminant element). For example, starting from these data:
Chromosome arm Start End ratio_median
5 5.5 96150001 96200000 -0.582
5 5.5 96200001 96250000 -0.582
5 5.5 96250001 96300000 -0.582
5 5.5 96300001 96350000 -0.582
5 5.5 97000001 97050001 -0.582
5 5.5 102600001 102650000 -0.014
5 5.5 102650001 102700000 -0.014
5 5.5 102700001 102750000 -0.014
5 5.5 102750001 102800000 -0.014
5 5.5 102800001 102850000 -0.014
5 5.5 103700001 103750000 -0.582
5 5.5 103750001 103800000 -0.582
5 5.5 103800001 103850000 -0.582
5 5.5 103850001 103900000 -0.582
5 5.5 103900001 103950000 -0.582
There are three separate intervals here: but grouping with groupby
will lump (correctly: it's working as intended) the third interval with the first:
Chromosome arm Start End ratio_median
5 5.5 96100001 103950000 -0.582
5 5.5 102600001 102850000 -0.014
From the perspective of the coordinates, this is incorrect, because they should not overlap like that: only consecutive rows with the same grouping keys should be aggregated. The correct, expected result should be:
Chromosome arm Start End ratio_median
5 5.5 96100001 97050001 -0.582
5 5.5 102600001 102850000 -0.014
5 5.5 103700001 103950000 -0.582
I'm not, however, aware of how to do this properly in pandas, nor with additional, domain-specific libraries like PyRanges
or bioframe
. I have tried PyRanges.cluster()
but that, on the other hand, assigns IDs in a different way and the resulting intervals are smaller.
My guess is that some form of iteration is required here, but what would be the best approach? I have tried groupby
alone but again that suffers from the problem as above.
You can compare the End to the shifted Start per group to form a new grouper:
g = (df.sort_values(by=['Start', 'End'])
.groupby(['Chromosome', 'arm', 'ratio_median'])['End']
.transform(lambda s: s.shift().rsub(df['Start']).gt(1).cumsum())
)
out = (df
.groupby(['Chromosome', 'arm', 'ratio_median', g],
as_index=False, sort=False)
.agg({'Start': 'min', 'End': 'max'})
)
Output:
Chromosome arm ratio_median Start End
0 5 5.5 -0.582 96150001 96350000
1 5 5.5 -0.014 102600001 102850000
2 5 5.5 -0.582 103700001 103950000
Intermediates:
Chromosome arm Start End ratio_median sub g
0 5 5.5 96150001 96200000 -0.582 NaN 0
1 5 5.5 96200001 96250000 -0.582 1.0 0
2 5 5.5 96250001 96300000 -0.582 1.0 0
3 5 5.5 96300001 96350000 -0.582 1.0 0
4 5 5.5 102600001 102650000 -0.014 NaN 0
5 5 5.5 102650001 102700000 -0.014 1.0 0
6 5 5.5 102700001 102750000 -0.014 1.0 0
7 5 5.5 102750001 102800000 -0.014 1.0 0
8 5 5.5 102800001 102850000 -0.014 1.0 0
9 5 5.5 103700001 103750000 -0.582 7350001.0 1
10 5 5.5 103750001 103800000 -0.582 1.0 1
11 5 5.5 103800001 103850000 -0.582 1.0 1
12 5 5.5 103850001 103900000 -0.582 1.0 1
13 5 5.5 103900001 103950000 -0.582 1.0 1
If you only want to group by consecutive rows among selected columns:
cols = ['Chromosome', 'arm', 'ratio_median']
d = {c: 'first' for c in cols} | {'Start': 'min', 'End': 'max'}
out = df.groupby(df[cols].ne(df[cols].shift()).any(axis=1).cumsum()).agg(d)
Output:
Chromosome arm ratio_median Start End
1 5 5.5 -0.582 96150001 97050001
2 5 5.5 -0.014 102600001 102850000
3 5 5.5 -0.582 103700001 103950000