pythonpandaspandas-groupbysplit-apply-combine

Pandas groupby, aggregate and filter strange behavior


I am trying to filter a dataset based on some aggregate measures: i need to find the UserIDs that have performed between 5 and 15 transactions OR their average payment is between 0 and 1500. This is my code:

grouped_count = dataset.groupby('UserID').size()
user_count = grouped_count[(grouped_count >= 5) & (grouped_count <= 15)]
    
grouped_mean = dataset.groupby('UserID').mean()
user_mean = grouped_mean[(grouped_mean['Amount'] >= 0.0) & (grouped_mean['Amount'] <= 1500.0)]

The count part seems to be fine, but i have some concerns about the mean part: it seems the groupby().mean() runs correctly, but then the filtering part produces some rows showing a NaN value where they should be dropped instead.

> grouped_mean
            Amount      Authorized
UserID 
1        64.640000             1.0
2       750.000000             1.0
3       696.762857             1.0
4       424.666667             1.0
5       446.847500             1.0
...            ...             ...
58504   662.950000             1.0
58505  1578.008750             1.0
58506  2990.800848             1.0
58507    71.190000             1.0
58508    20.000000             1.0

[58508 rows x 2 columns]

> user_mean
           Amount      Authorized
UserID                                                      
1       64.640000             1.0
2      750.000000             1.0
3      696.762857             1.0
4      424.666667             1.0
5      446.847500             1.0
...           ...             ...
58504  662.950000             1.0
58505         NaN             1.0
58506         NaN             1.0
58507   71.190000             1.0
58508   20.000000             1.0

[58508 rows x 2 columns]

How can i get the result i need? Can i just add a user_mean = user_mean.dropna(subset='Amount') or is there a better way to do filtering after grouping and aggregating?


Solution

  • Indeed the solution by Scott Boston in the comments solves the problem. The relevant columns must be selected before computing the mean.