pythonpandasbioinformaticspyranges

Pandas groupby: use data in other columns to create groups (genomic intervals)


As part of a larger data set, I have a DataFrame organized as such:

Chromosome    arm    Start    End    ratio_median
5    5.5    96100001    96150000    -0.582
5    5.5    96150001    96200000    -0.582
5    5.5    96200001    96250000    -0.582
5    5.5    96250001    96300000    -0.582
5    5.5    96300001    96350000    -0.582

The goal here is to group rows with the same Chromosome, arm and ratio_median, and form larger intervals using the minimum of Start and maximum of End.

It looks like a bog-standard problem solvable with groupby:

grouped = df.groupby(by=["Chromosome", "arm", "ratio_median"]).agg(
    {"Chromosome": "first", "Start": "min", "End": "max", "ratio_median": "first"})

However, since these are coordinates, grouping should only consider consecutive groups with the common grouping keys, not the entire data set. In other words, the group boundaries should be set when there is a change in ratio_median for the same chromosome and arm.

The groupby approach works perfectly until you have intervals with the same values which are separated by one or more intervals with different keys (ratio_median being the discriminant element). For example, starting from these data:

Chromosome    arm    Start    End    ratio_median
5    5.5    96150001    96200000    -0.582
5    5.5    96200001    96250000    -0.582
5    5.5    96250001    96300000    -0.582
5    5.5    96300001    96350000    -0.582
5    5.5    97000001    97050001    -0.582
5    5.5    102600001    102650000    -0.014
5    5.5    102650001    102700000    -0.014
5    5.5    102700001    102750000    -0.014
5    5.5    102750001    102800000    -0.014
5    5.5    102800001    102850000    -0.014
5    5.5    103700001    103750000    -0.582
5    5.5    103750001    103800000    -0.582
5    5.5    103800001    103850000    -0.582
5    5.5    103850001    103900000    -0.582
5    5.5    103900001    103950000    -0.582

There are three separate intervals here: but grouping with groupby will lump (correctly: it's working as intended) the third interval with the first:

Chromosome    arm    Start    End    ratio_median
5    5.5    96100001    103950000    -0.582
5    5.5    102600001    102850000    -0.014

From the perspective of the coordinates, this is incorrect, because they should not overlap like that: only consecutive rows with the same grouping keys should be aggregated. The correct, expected result should be:

Chromosome    arm    Start    End    ratio_median
5    5.5    96100001    97050001    -0.582
5    5.5    102600001    102850000    -0.014
5    5.5    103700001    103950000    -0.582

I'm not, however, aware of how to do this properly in pandas, nor with additional, domain-specific libraries like PyRanges or bioframe. I have tried PyRanges.cluster() but that, on the other hand, assigns IDs in a different way and the resulting intervals are smaller.

My guess is that some form of iteration is required here, but what would be the best approach? I have tried groupby alone but again that suffers from the problem as above.


Solution

  • You can compare the End to the shifted Start per group to form a new grouper:

    g = (df.sort_values(by=['Start', 'End'])
           .groupby(['Chromosome', 'arm', 'ratio_median'])['End']
           .transform(lambda s: s.shift().rsub(df['Start']).gt(1).cumsum())
        )
    
    out = (df
       .groupby(['Chromosome', 'arm', 'ratio_median', g],
                as_index=False, sort=False)
       .agg({'Start': 'min', 'End': 'max'})
    )
    

    Output:

       Chromosome  arm  ratio_median      Start        End
    0           5  5.5        -0.582   96150001   96350000
    1           5  5.5        -0.014  102600001  102850000
    2           5  5.5        -0.582  103700001  103950000
    

    Intermediates:

        Chromosome  arm      Start        End  ratio_median        sub  g
    0            5  5.5   96150001   96200000        -0.582        NaN  0
    1            5  5.5   96200001   96250000        -0.582        1.0  0
    2            5  5.5   96250001   96300000        -0.582        1.0  0
    3            5  5.5   96300001   96350000        -0.582        1.0  0
    4            5  5.5  102600001  102650000        -0.014        NaN  0
    5            5  5.5  102650001  102700000        -0.014        1.0  0
    6            5  5.5  102700001  102750000        -0.014        1.0  0
    7            5  5.5  102750001  102800000        -0.014        1.0  0
    8            5  5.5  102800001  102850000        -0.014        1.0  0
    9            5  5.5  103700001  103750000        -0.582  7350001.0  1
    10           5  5.5  103750001  103800000        -0.582        1.0  1
    11           5  5.5  103800001  103850000        -0.582        1.0  1
    12           5  5.5  103850001  103900000        -0.582        1.0  1
    13           5  5.5  103900001  103950000        -0.582        1.0  1
    

    If you only want to group by consecutive rows among selected columns:

    cols = ['Chromosome', 'arm', 'ratio_median']
    d = {c: 'first' for c in cols} | {'Start': 'min', 'End': 'max'}
    out = df.groupby(df[cols].ne(df[cols].shift()).any(axis=1).cumsum()).agg(d)
    

    Output:

       Chromosome  arm  ratio_median      Start        End
    1           5  5.5        -0.582   96150001   97050001
    2           5  5.5        -0.014  102600001  102850000
    3           5  5.5        -0.582  103700001  103950000