pythonpython-3.xpandaspandas-groupby

Pandas: within groupby groups, return max value if it is at least 3x greater than any other value


I need to group a dataframe by the FeatureID column, and then test each group to determine if the max value in the group's gene_count column is at least 3 times greater than any of the other gene_count values in the group.

If no value is at least 3x times greater, I need to return the row with the max value in the feature_div column.

At this point, if there is not a row with max gene_count (aka they are all tied) and no row with max feature_div, then that FeatureID should not be returned.

So the pseudocode logic is:

For each group in df.groupby("FeatureID"):
    If max(gene_count) 3x > all other gene_counts in group:
        return this row
    Elif:
        there is a max(feature_div) return this row
    Else:
        max(gene_count) is not >3 x higher than all other gene_counts in group
        and the feature_div scores are all tied
        pass   

Here is my dataframe:

FeatureID     gene      feature_div  gene_count
1_1001_1028   NTRK1.1        2         2.0
1_1001_1028   TP53.1         1         2.0
1_1001_193    MTOR.1         1         9.0
1_1001_193    TP53.2         1         3.0
1_1003_1034   EGFR.1         3         4.0
1_1003_1034   EGFR.5         3         4.0
1_1008_823    ABL1.1         1         2.0
1_1008_823    BRCA2.2        2         2.0

My desired output would look like this:

 FeatureID     gene      feature_div    gene_count
 1_1001_1028   NTRK1.1       2           2.0
 1_1001_193    MTOR.1        1           9.0
 1_1008_823    BRCA2.2       2           2.0

In the first row, NTRK1.1 gets selected because the counts are tied, but it has higher feature_div than TP53.1.

In the second row, MTOR.1 gets selected because the gene_count is 3x higher than TP53.3.

In the third row BRCA2.2 is selected because the gene_count are tied but it has higher feature_div.

1_1003_1034 is not returned because the gene_count is not 3 times higher than the other options, and there is no feature_div higher than the other options.


Solution

  • The following solution uses apply and a custom filtering function to solve this:

    def filter_function(x):
        max_gene_counts = x.sort_values('gene_count', ascending=False).gene_count.head(2).values
        max_feature_divs = x.sort_values('feature_div', ascending=False).feature_div.head(2).values
    
        if max_gene_counts[0] >= 3 * max_gene_counts[1]:
            return x[x.gene_count == max_gene_counts[0]]
        elif max_feature_divs[0] > max_feature_divs[1]:
            return x[x.feature_div == max_feature_divs[0]]
        else:
            pass
    
    df.groupby('FeatureID').apply(filter_function)
    

    Result:

      FeatureID      gene  feature_div  gene_count
    1_1001_1028   NTRK1.1            2         2.0
     1_1001_193    MTOR.1            1         9.0
     1_1008_823   BRCA2.2            2         2.0