pythonpandasgroup-byaggregate

compute named quantiles in pandas using groupby aggregate


Among other descriptive statistics, I want to get some quantiles out of my pandas DataFrame. I can get the quantiles I want a couple of different ways, but I can't find the right way to do it with aggregate. I'd like to use aggregate because it'd be tidy and maybe computationally efficient to get all my stats in one go.

rng = np.random.default_rng(seed=18860504)
df = pd.DataFrame({
    "dummy": 1,
    "bell": rng.normal(loc=0, scale=1, size=100),
    "fish": rng.poisson(lam=10, size=100),
    "cabin": rng.lognormal(mean=0, sigma=1.0, size=100),
})
quants = [x/5 for x in range(6)]
quantiles = pd.DataFrame({
    "quantile" : [f"q{100*q:02n}" for q in quants],
    "bell" : df.groupby("dummy")["bell"].quantile(quants),
    "fish" : df.groupby("dummy")["fish"].quantile(quants),
}) 
print(quantiles)

Output:

          quantile      bell  fish
dummy                             
1     0.0     q000 -2.313461   4.0
      0.2     q020 -0.933831   7.0
      0.4     q040 -0.246860   9.0
      0.6     q060  0.211076  10.0
      0.8     q080  0.685958  13.0
      1.0     q100  3.017258  20.0

I'd like to get these quantiles using groupby().agg(), ideally with programmatically named columns like "bell_q90". Here's an example of the aggregate syntax that feels natural to me:

df.groupby("dummy").agg(
    bell_med=("bell", "median"),
    bell_mean=("bell", "mean"),
    fish_med=("fish", "median"),
    fish_mean=("fish", "mean"),
    # fish_q10=("fish", "quantile(0.1)"), # nothing like it 
    # fish_q10=("fish", "quantile", 0.1), # nothing like it 
    # fish_q10=("fish", "quantile", kwargs({"q":0.1}), # nothing like it 
) 

I can imagine generating the columns by iterating over quants and a list of named columns, using Series.agg and than stitching them together, but this seems like a hack. (For example, it would require me to do my "normal" aggregation first and then add quantiles on afterwards.)

my_aggs = dict()
for q in quants:
    for col in ["bell", "fish"]:
        my_aggs[f"{col}_q{100*q:03n}"] = df.groupby("dummy")[col].quantile(q) 

print(pd.DataFrame(my_aggs)) # numbers equivalent to those above

Is there a better way?


Solution

  • You could use a function factory to simplify the syntax:

    def quantile(q=0.5, **kwargs):
        def f(series):
            return series.quantile(q, **kwargs)
        return f
        
    df.groupby('dummy').agg(
        bell_med=('bell', 'median'),
        bell_mean=('bell', 'mean'),
        fish_med=('fish', 'median'),
        fish_mean=('fish', 'mean'),
        bell_q10=('bell', quantile(0.1)),
        fish_q10=('fish', quantile(0.1)),
    )
    

    If you have many combinations, you could also combine this with a dictionary comprehension and parameter expansion:

    df.groupby('dummy').agg(**{'bell_med': ('bell', 'median'),
                               'bell_mean': ('bell', 'mean'),
                               'fish_med': ('fish', 'median'),
                               'fish_mean': ('fish', 'mean'),
                               },
                            **{f'{c}_q{100*q:02n}': (c, quantile(q))
                               for q in [0.1] # add more if needed
                               for c in ['bell', 'fish']
                              }
                           )
    

    Output:

    
           bell_med  bell_mean  fish_med  fish_mean  bell_q10  fish_q10
    dummy                                                              
    1     -0.063454  -0.058557      10.0       9.92 -1.553682       6.0