Among other descriptive statistics, I want to get some quantiles out of my pandas DataFrame
. I can get the quantiles I want a couple of different ways, but I can't find the right way to do it with aggregate
. I'd like to use aggregate because it'd be tidy and maybe computationally efficient to get all my stats in one go.
rng = np.random.default_rng(seed=18860504)
df = pd.DataFrame({
"dummy": 1,
"bell": rng.normal(loc=0, scale=1, size=100),
"fish": rng.poisson(lam=10, size=100),
"cabin": rng.lognormal(mean=0, sigma=1.0, size=100),
})
quants = [x/5 for x in range(6)]
quantiles = pd.DataFrame({
"quantile" : [f"q{100*q:02n}" for q in quants],
"bell" : df.groupby("dummy")["bell"].quantile(quants),
"fish" : df.groupby("dummy")["fish"].quantile(quants),
})
print(quantiles)
Output:
quantile bell fish
dummy
1 0.0 q000 -2.313461 4.0
0.2 q020 -0.933831 7.0
0.4 q040 -0.246860 9.0
0.6 q060 0.211076 10.0
0.8 q080 0.685958 13.0
1.0 q100 3.017258 20.0
I'd like to get these quantiles using groupby().agg()
, ideally with programmatically named columns like "bell_q90". Here's an example of the aggregate
syntax that feels natural to me:
df.groupby("dummy").agg(
bell_med=("bell", "median"),
bell_mean=("bell", "mean"),
fish_med=("fish", "median"),
fish_mean=("fish", "mean"),
# fish_q10=("fish", "quantile(0.1)"), # nothing like it
# fish_q10=("fish", "quantile", 0.1), # nothing like it
# fish_q10=("fish", "quantile", kwargs({"q":0.1}), # nothing like it
)
I can imagine generating the columns by iterating over quants
and a list of named columns, using Series.agg
and than stitching them together, but this seems like a hack. (For example, it would require me to do my "normal" aggregation first and then add quantiles on afterwards.)
my_aggs = dict()
for q in quants:
for col in ["bell", "fish"]:
my_aggs[f"{col}_q{100*q:03n}"] = df.groupby("dummy")[col].quantile(q)
print(pd.DataFrame(my_aggs)) # numbers equivalent to those above
Is there a better way?
You could use a function factory to simplify the syntax:
def quantile(q=0.5, **kwargs):
def f(series):
return series.quantile(q, **kwargs)
return f
df.groupby('dummy').agg(
bell_med=('bell', 'median'),
bell_mean=('bell', 'mean'),
fish_med=('fish', 'median'),
fish_mean=('fish', 'mean'),
bell_q10=('bell', quantile(0.1)),
fish_q10=('fish', quantile(0.1)),
)
If you have many combinations, you could also combine this with a dictionary comprehension and parameter expansion:
df.groupby('dummy').agg(**{'bell_med': ('bell', 'median'),
'bell_mean': ('bell', 'mean'),
'fish_med': ('fish', 'median'),
'fish_mean': ('fish', 'mean'),
},
**{f'{c}_q{100*q:02n}': (c, quantile(q))
for q in [0.1] # add more if needed
for c in ['bell', 'fish']
}
)
Output:
bell_med bell_mean fish_med fish_mean bell_q10 fish_q10
dummy
1 -0.063454 -0.058557 10.0 9.92 -1.553682 6.0