I need to group a dataframe by the FeatureID
column, and then test each group to determine if the max value in the group's gene_count
column is at least 3 times greater than any of the other gene_count
values in the group.
If no value is at least 3x times greater, I need to return the row with the max value in the feature_div
column.
At this point, if there is not a row with max gene_count
(aka they are all tied) and no row with max feature_div
, then that FeatureID
should not be returned.
So the pseudocode logic is:
For each group in df.groupby("FeatureID"):
If max(gene_count) 3x > all other gene_counts in group:
return this row
Elif:
there is a max(feature_div) return this row
Else:
max(gene_count) is not >3 x higher than all other gene_counts in group
and the feature_div scores are all tied
pass
Here is my dataframe:
FeatureID gene feature_div gene_count
1_1001_1028 NTRK1.1 2 2.0
1_1001_1028 TP53.1 1 2.0
1_1001_193 MTOR.1 1 9.0
1_1001_193 TP53.2 1 3.0
1_1003_1034 EGFR.1 3 4.0
1_1003_1034 EGFR.5 3 4.0
1_1008_823 ABL1.1 1 2.0
1_1008_823 BRCA2.2 2 2.0
My desired output would look like this:
FeatureID gene feature_div gene_count
1_1001_1028 NTRK1.1 2 2.0
1_1001_193 MTOR.1 1 9.0
1_1008_823 BRCA2.2 2 2.0
In the first row, NTRK1.1 gets selected because the counts are tied, but it has higher feature_div
than TP53.1.
In the second row, MTOR.1 gets selected because the gene_count
is 3x higher than TP53.3.
In the third row BRCA2.2 is selected because the gene_count
are tied but it has higher feature_div
.
1_1003_1034 is not returned because the gene_count
is not 3 times higher than the other options, and there is no feature_div
higher than the other options.
The following solution uses apply
and a custom filtering function to solve this:
def filter_function(x):
max_gene_counts = x.sort_values('gene_count', ascending=False).gene_count.head(2).values
max_feature_divs = x.sort_values('feature_div', ascending=False).feature_div.head(2).values
if max_gene_counts[0] >= 3 * max_gene_counts[1]:
return x[x.gene_count == max_gene_counts[0]]
elif max_feature_divs[0] > max_feature_divs[1]:
return x[x.feature_div == max_feature_divs[0]]
else:
pass
df.groupby('FeatureID').apply(filter_function)
Result:
FeatureID gene feature_div gene_count
1_1001_1028 NTRK1.1 2 2.0
1_1001_193 MTOR.1 1 9.0
1_1008_823 BRCA2.2 2 2.0