pythonrpandasdplyr

python/pandas equivalent to dplyr 1.0.0 summarize(across())


In R, I find the following very useful when dealing with many variables:

library(dplyr)
dat <- group_by(mtcars, cyl) 
summarize(dat, across(c('mpg','disp'), sum), across(c('drat','wt','qsec'), mean))
# A tibble: 3 x 5
    cyl  disp    hp  drat    wt
  <dbl> <dbl> <dbl> <dbl> <dbl>
1     4 1156.   909  4.07  2.29
2     6 1283.   856  3.59  3.12
3     8 4943.  2929  3.23  4.00

Or even better, selecting with pseudo-regex

summarize(dat, across(ends_with('p'), sum), across(ends_with('t'), mean))

In pandas, the equivalent seems to pass variables one-by-one into a dictionary, eg from this gist:

group_agg = df.groupby("group1").agg({
  "var1" : ["mean"], 
  "var2" : ["sum"], 
  "var3" : ["mean"]
  })

Is there a less verbose way to do this operation in pandas, or with some other package?


Solution

  • For the first scenario, pandas concat suffices :

    dat = df.groupby("cyl")
    
    pd.concat([dat[["mpg", "disp"]].sum(), 
               dat[["drat", "wt", "qsec"]].mean()], 
              axis=1)
    

    For the regex/string processing part, verbose is unavoidable :

    cols_p = [col for col in df.columns if col.endswith("p")]
    cols_t = [col for col in df.columns if col.endswith("t")]
    pd.concat((dat[cols_p].sum(), dat[cols_t].mean()), axis=1)
           disp    hp      drat        wt
    cyl
    4    1156.5   909  4.070909  2.285727
    6    1283.2   856  3.585714  3.117143
    8    4943.4  2929  3.229286  3.999214
    

    Another option, which IMO is simpler, is to build the dictionary before aggregating -inspired by @RichieV:

    aggs = {key:"mean" 
            if key.endswith("t") 
            else "sum" 
            for key in dat 
            if key.endswith(('p', 't'))}
    
    dat.groupby('cyl').agg(aggs)
    
           disp    hp      drat        wt
    cyl
    4    1156.5   909  4.070909  2.285727
    6    1283.2   856  3.585714  3.117143
    8    4943.4  2929  3.229286  3.999214
    

    update: 07/03/2025 another option is with jn.summarise from pyjanitor - I am a contributor to the library:

    # pip install pyjanitor
    import janitor as jn
    import pandas as pd
    mtcars.summarise(("*t",'mean'),("*p","sum"),by='cyl')
             drat        wt    disp    hp
    cyl
    6    3.585714  3.117143  1283.2   856
    4    4.070909  2.285727  1156.5   909
    8    3.229286  3.999214  4943.4  2929