pandaspandas-groupbystdev

Standard deviation with groupby(multiple columns) Pandas


I am working with data from the California Air Resources Board.

site,monitor,date,start_hour,value,variable,units,quality,prelim,name 
5407,t,2014-01-01,0,3.00,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach 
5407,t,2014-01-01,1,1.54,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach 
5407,t,2014-01-01,2,3.76,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach 
5407,t,2014-01-01,3,5.98,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach 
5407,t,2014-01-01,4,8.09,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach 
5407,t,2014-01-01,5,12.05,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach 
5407,t,2014-01-01,6,12.55,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach 
...

df = pd.concat([pd.read_csv(file, header = 0) for file in f]) #merges all files into one dataframe
df.dropna(axis = 0, how = "all", subset = ['start_hour', 'variable'],
          inplace = True) #drops bottom columns without data in them, NaN

df.start_hour = pd.to_timedelta(df['start_hour'], unit = 'h')
df.date = pd.to_datetime(df.date)
df['datetime'] = df.date + df.start_hour
df.drop(columns=['date', 'start_hour'], inplace=True)
df['month'] = df.datetime.dt.month
df['day'] = df.datetime.dt.day
df['year'] = df.datetime.dt.year
df.set_index('datetime', inplace = True)
df =  df.rename(columns={'value':'conc'})

I have multiple years of hourly PM2.5 concentration data and am trying to prepare graphs that show the average monthly concentration over many years (different graphs for each month). Here's an image of the graph I've created thus far. [![Bombay Beach][1]][1] However, I want to add error bars to the average concentration line but I am having issues when attempting to calculate the standard deviation. I've created a new dataframe d_avg that includes the year, month, day, and average concentration of PM2.5; here's some of the data.

d_avg = df.groupby(['year', 'month', 'day'], as_index=False)['conc'].mean()
   year  month  day      conc
0  2014      1    1  9.644583
1  2014      1    2  4.945652
2  2014      1    3  4.345238
3  2014      1    4  5.047917
4  2014      1    5  5.212857
5  2014      1    6  2.095714

After this, I found the monthly average m_avg and created a datetime index to plot datetime vs monthly avg conc (refer above, black line).

m_avg = d_avg.groupby(['year','month'], as_index=False)['conc'].mean()
m_avg['datetime'] = pd.to_datetime(m_avg.year.astype(str) + m_avg.month.astype(str), format='%Y%m') + MonthEnd(1)
[In]: m_avg.head(6)
[Out]:
   year  month      conc   datetime
0  2014      1  4.330985 2014-01-31
1  2014      2  2.280096 2014-02-28
2  2014      3  4.464622 2014-03-31
3  2014      4  6.583759 2014-04-30
4  2014      5  9.069353 2014-05-31
5  2014      6  9.982330 2014-06-30

Now I want to calculate the standard deviation of the d_avg concentration, and I've tried multiple things:

sd = d_avg.groupby(['year', 'month'], as_index=False)['conc'].std()

sd = d_avg.groupby(['year', 'month'], as_index=False)['conc'].agg(np.std)

sd = d_avg['conc'].apply(lambda x: x.std())

However, each attempt has left me with the same error in the dataframe. I am unable to plot the standard deviation because I believe it is taking the standard deviation of the year and month too, which I am trying to group the data by. Here's what my resulting dataframe sd looks like:

        year     month        sd
0  44.877611  1.000000  1.795868
1  44.877611  1.414214  2.355055
2  44.877611  1.732051  2.597531
3  44.877611  2.000000  2.538749
4  44.877611  2.236068  5.456785
5  44.877611  2.449490  3.315546

Please help me! [1]: https://i.sstatic.net/ueVrG.png


Solution

  • I decided to dance around my issue since I couldn't figure out what was causing the problem. I merged the m_avg and sd dataframes and dropped the year and month columns that were causing me issues. See code below, lots of renaming.

    d_avg = df.groupby(['year', 'month', 'day'], as_index=False)['conc'].mean()
    m_avg = d_avg.groupby(['year','month'], as_index=False)['conc'].mean()
    sd = d_avg.groupby(['year', 'month'], as_index=False)['conc'].std(ddof=0) 
    sd = sd.rename(columns={"conc":"sd", "year":"wrongyr", "month":"wrongmth"})
    m_avg_sd = pd.concat([m_avg, sd], axis = 1)
    m_avg_sd.drop(columns=['wrongyr', 'wrongmth'], inplace = True)
    m_avg_sd['datetime'] = pd.to_datetime(m_avg_sd.year.astype(str) + m_avg_sd.month.astype(str), format='%Y%m') + MonthEnd(1)
    

    and here's the new dataframe:

    m_avg_sd.head(5)
    Out[2]: 
       year  month       conc         sd   datetime
    0  2009      1  48.350105  18.394192 2009-01-31
    1  2009      2  21.929383  16.293645 2009-02-28
    2  2009      3  15.094729   6.821124 2009-03-31
    3  2009      4  12.021009   4.391219 2009-04-30
    4  2009      5  13.449100   4.081734 2009-05-31