pythonpandasgroup-byaggregation

Pandas groupby multiple columns, aggregate some columns, add a count column of each group


The data I am working with:

data (140631115432592), ndim: 2, size: 3947910, shape: (232230, 17)
VIN (1-10)                                            object
County                                                object
City                                                  object
State                                                 object
Postal Code                                          float64
Model Year                                             int64
Make                                                  object
Model                                                 object
Electric Vehicle Type                                 object
Clean Alternative Fuel Vehicle (CAFV) Eligibility     object
Electric Range                                       float64
Base MSRP                                            float64
Legislative District                                 float64
DOL Vehicle ID                                         int64
Vehicle Location                                      object
Electric Utility                                      object
2020 Census Tract                                    float64
dtype: object
   VIN (1-10)    County      City State  Postal Code  ...  Legislative District DOL Vehicle ID             Vehicle Location                               Electric Utility 2020 Census Tract
0  2T3YL4DV0E      King  Bellevue    WA      98005.0  ...                  41.0      186450183   POINT (-122.1621 47.64441)  PUGET SOUND ENERGY INC||CITY OF TACOMA - (WA)      5.303302e+10
1  5YJ3E1EB6K      King   Bothell    WA      98011.0  ...                   1.0      478093654  POINT (-122.20563 47.76144)  PUGET SOUND ENERGY INC||CITY OF TACOMA - (WA)      5.303302e+10
2  5UX43EU02S  Thurston   Olympia    WA      98502.0  ...                  35.0      274800718  POINT (-122.92333 47.03779)                         PUGET SOUND ENERGY INC      5.306701e+10
3  JTMAB3FV5R  Thurston   Olympia    WA      98513.0  ...                   2.0      260758165  POINT (-122.81754 46.98876)                         PUGET SOUND ENERGY INC      5.306701e+10
4  5YJYGDEE8M    Yakima     Selah    WA      98942.0  ...                  15.0      236581355  POINT (-120.53145 46.65405)                                     PACIFICORP      5.307700e+10

Data in csv format:

VIN (1-10),County,City,State,Postal Code,Model Year,Make,Model,Electric Vehicle Type,Clean Alternative Fuel Vehicle (CAFV) Eligibility,Electric Range,Base MSRP,Legislative District,DOL Vehicle ID,Vehicle Location,Electric Utility,2020 Census Tract
2T3YL4DV0E,King,Bellevue,WA,98005,2014,TOYOTA,RAV4,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,103,0,41,186450183,POINT (-122.1621 47.64441),PUGET SOUND ENERGY INC||CITY OF TACOMA - (WA),53033023604
5YJ3E1EB6K,King,Bothell,WA,98011,2019,TESLA,MODEL 3,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,220,0,1,478093654,POINT (-122.20563 47.76144),PUGET SOUND ENERGY INC||CITY OF TACOMA - (WA),53033022102
5UX43EU02S,Thurston,Olympia,WA,98502,2025,BMW,X5,Plug-in Hybrid Electric Vehicle (PHEV),Clean Alternative Fuel Vehicle Eligible,40,0,35,274800718,POINT (-122.92333 47.03779),PUGET SOUND ENERGY INC,53067011902
JTMAB3FV5R,Thurston,Olympia,WA,98513,2024,TOYOTA,RAV4 PRIME,Plug-in Hybrid Electric Vehicle (PHEV),Clean Alternative Fuel Vehicle Eligible,42,0,2,260758165,POINT (-122.81754 46.98876),PUGET SOUND ENERGY INC,53067012332
5YJYGDEE8M,Yakima,Selah,WA,98942,2021,TESLA,MODEL Y,Battery Electric Vehicle (BEV),Eligibility unknown as battery range has not been researched,0,0,15,236581355,POINT (-120.53145 46.65405),PACIFICORP,53077003200
3C3CFFGE1G,Thurston,Olympia,WA,98501,2016,FIAT,500,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,84,0,22,294762219,POINT (-122.89166 47.03956),PUGET SOUND ENERGY INC,53067010802
5YJ3E1EA4J,Snohomish,Marysville,WA,98271,2018,TESLA,MODEL 3,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,215,0,39,270125096,POINT (-122.1677 48.11026),PUGET SOUND ENERGY INC,53061052808
5YJ3E1EA3K,King,Seattle,WA,98102,2019,TESLA,MODEL 3,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,220,0,43,238776492,POINT (-122.32427 47.63433),CITY OF SEATTLE - (WA)|CITY OF TACOMA - (WA),53033006600
1N4AZ0CP5E,Thurston,Yelm,WA,98597,2014,NISSAN,LEAF,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,84,0,2,257246118,POINT (-122.60735 46.94239),PUGET SOUND ENERGY INC,53067012421

Filtering and grouping:

filt = (data["Model Year"] >= 2018) & (data["Electric Vehicle Type"] == "Battery Electric Vehicle (BEV)")
data = data[filt].groupby(["State", "Make"], sort=False, observed=True, as_index=False).agg( avg_electric_range=pd.NamedAgg(column="Electric Range", aggfunc="mean"), oldest_model_year=pd.NamedAgg(column="Model Year", aggfunc="min"))

Currently it yields the following table:

  State       Make  avg_electric_range  oldest_model_year
0    WA      TESLA           52.143448               2018
1    WA     NISSAN           60.051874               2018
<snip>

How do I add a Count column which shows the count of each group which is used for further filtering? Note: rule out apply as everything should stay in Pandas'land.


Solution

  • Your question would benefit from a minimal reproducible example.

    That said, the count doesn't really depend on a particular column, as long as you don't have missing values, thus pick any one that matches this criterion and add another aggregation (you can use one of the grouping columns or Model Year since you know it must be a valid number):

    out = (data[filt].groupby(["State", "Make"], sort=False, observed=True, as_index=False)
            .agg(avg_electric_range=pd.NamedAgg(column="Electric Range", aggfunc="mean"),
                 oldest_model_year=pd.NamedAgg(column="Model Year", aggfunc="min"),
                 count=pd.NamedAgg(column="Model Year", aggfunc="count"),
                )
           )
    

    Example output:

      State Make  avg_electric_range  oldest_model_year  count
    0    WA    X                 0.5               2018      2
    1    WA    Y                 3.0               2018      3