[SOLVED] Pandas groupby, aggregate and filter strange behavior

Pandas groupby, aggregate and filter strange behavior

I am trying to filter a dataset based on some aggregate measures: i need to find the UserIDs that have performed between 5 and 15 transactions OR their average payment is between 0 and 1500. This is my code:

grouped_count = dataset.groupby('UserID').size()
user_count = grouped_count[(grouped_count >= 5) & (grouped_count <= 15)]
    
grouped_mean = dataset.groupby('UserID').mean()
user_mean = grouped_mean[(grouped_mean['Amount'] >= 0.0) & (grouped_mean['Amount'] <= 1500.0)]

The count part seems to be fine, but i have some concerns about the mean part: it seems the groupby().mean() runs correctly, but then the filtering part produces some rows showing a NaN value where they should be dropped instead.

> grouped_mean
            Amount      Authorized
UserID 
1        64.640000             1.0
2       750.000000             1.0
3       696.762857             1.0
4       424.666667             1.0
5       446.847500             1.0
...            ...             ...
58504   662.950000             1.0
58505  1578.008750             1.0
58506  2990.800848             1.0
58507    71.190000             1.0
58508    20.000000             1.0

[58508 rows x 2 columns]

> user_mean
           Amount      Authorized
UserID                                                      
1       64.640000             1.0
2      750.000000             1.0
3      696.762857             1.0
4      424.666667             1.0
5      446.847500             1.0
...           ...             ...
58504  662.950000             1.0
58505         NaN             1.0
58506         NaN             1.0
58507   71.190000             1.0
58508   20.000000             1.0

[58508 rows x 2 columns]

How can i get the result i need? Can i just add a user_mean = user_mean.dropna(subset='Amount') or is there a better way to do filtering after grouping and aggregating?

Solution

Indeed the solution by Scott Boston in the comments solves the problem. The relevant columns must be selected before computing the mean.