I am trying to filter a dataset based on some aggregate measures: i need to find the UserID
s that have performed between 5 and 15 transactions OR their average payment is between 0 and 1500.
This is my code:
grouped_count = dataset.groupby('UserID').size()
user_count = grouped_count[(grouped_count >= 5) & (grouped_count <= 15)]
grouped_mean = dataset.groupby('UserID').mean()
user_mean = grouped_mean[(grouped_mean['Amount'] >= 0.0) & (grouped_mean['Amount'] <= 1500.0)]
The count part seems to be fine, but i have some concerns about the mean part: it seems the groupby().mean()
runs correctly, but then the filtering part produces some rows showing a NaN value where they should be dropped instead.
> grouped_mean
Amount Authorized
UserID
1 64.640000 1.0
2 750.000000 1.0
3 696.762857 1.0
4 424.666667 1.0
5 446.847500 1.0
... ... ...
58504 662.950000 1.0
58505 1578.008750 1.0
58506 2990.800848 1.0
58507 71.190000 1.0
58508 20.000000 1.0
[58508 rows x 2 columns]
> user_mean
Amount Authorized
UserID
1 64.640000 1.0
2 750.000000 1.0
3 696.762857 1.0
4 424.666667 1.0
5 446.847500 1.0
... ... ...
58504 662.950000 1.0
58505 NaN 1.0
58506 NaN 1.0
58507 71.190000 1.0
58508 20.000000 1.0
[58508 rows x 2 columns]
How can i get the result i need? Can i just add a user_mean = user_mean.dropna(subset='Amount')
or is there a better way to do filtering after grouping and aggregating?
Indeed the solution by Scott Boston in the comments solves the problem. The relevant columns must be selected before computing the mean.