pythonpandasgroup-byaggregate

Grouping data by specific years in python


I want to create a dataframe that is grouped by region and date which shows the average age of a region during specific years. so my columns would look something like

region, year, average age

So far I have:

# specify aggregation functions to column 'age'    
ageAverage = {'age':{'average age':'mean'}} 

# groupby and apply functions    
ageDataFrame = data.groupby(['Region', data.Date.dt.year]).agg(ageAverage)

This works great, but how can I make it so that I only group data from specific years? say for example between 2010 and 2015?


Solution

  • You need filter first by between:

    ageDataFrame = (data[data.Date.dt.year.between(2010, 2015)]
                      .groupby(['Region', data.Date.dt.year])
                      .agg(ageAverage))
    

    Also in last version of pandas 0.22.0 get:

    SpecificationError: cannot perform renaming for age with a nested dictionary

    Correct solution is specify column in list after groupby and aggregate by tuple - first value is new column name and second aggregate function:

    np.random.seed(123)
    
    rng = pd.date_range('2009-04-03', periods=10, freq='13M')
    data = pd.DataFrame({'Date': rng,
                         'Region':['reg1'] * 3 + ['reg2'] * 7,
                         'average age': np.random.randint(20, size=10)})  
    print (data)
            Date Region  average age
    0 2009-04-30   reg1           13
    1 2010-05-31   reg1            2
    2 2011-06-30   reg1            2
    3 2012-07-31   reg2            6
    4 2013-08-31   reg2           17
    5 2014-09-30   reg2           19
    6 2015-10-31   reg2           10
    7 2016-11-30   reg2            1
    8 2017-12-31   reg2            0
    9 2019-01-31   reg2           17
    
    ageAverage = {('age','mean')}
    
    #groupby and apply functions    
    ageDataFrame = (data[data.Date.dt.year.between(2010, 2015)]
                     .groupby(['Region', data.Date.dt.year])['average age']
                     .agg(ageAverage))
    print (ageDataFrame)
                 age
    Region Date     
    reg1   2010    2
           2011    2
    reg2   2012    6
           2013   17
           2014   19
           2015   10