pythonpandastime-seriesearthpy

use time series data in python to calculate mean, variance std deviation


I have data collected from sensors that looks like:

sec   nanosec value 

1001   1       0.2 

1001   2       0.2

1001   3       0.2 

1002   1       0.1  

1002   2       0.2   

1002   3       0.1 

1003   1       0.2 

1003   2       0.2

1003   3       0.1  

1004   1       0.2   

1004   2       0.2 

1004   3       0.2 

1004   4      0.1 

I want to calculate average,std deviation and some other stats like maximum, minimum for a column every 2 seconds. so average for (1001, 1002)= 0.167, average of (1003,1004)=0.17

From the tutorials http://earthpy.org/pandas-basics.html, I think I should convert it to time series and the use rolling _means from pandas, but I am new to time series data so I am not sure if that is the correct way. Also how do I specify frequency here for conversion as observations for the first second have less observations. So for actual data I have less than 100 readings for 1001 second and then 100 observations for 1002 second onwards.

I could also do a simple groupby on seconds but it would group readings per second and not every 2 seconds, then how could i combine observations for 2 consecutive groups from groupby and then do analysis.


Solution

  • I think you can first convert column sec to_timedelta, set_index and resample by 2 seconds (2S):

    df['sec'] = pd.to_timedelta(df.sec, unit='s')
    df.set_index('sec', inplace=True)
    print (df)
              nanosec  value
    sec                     
    00:16:41        1    0.2
    00:16:41        2    0.2
    00:16:41        3    0.2
    00:16:42        1    0.1
    00:16:42        2    0.2
    00:16:42        3    0.1
    00:16:43        1    0.2
    00:16:43        2    0.2
    00:16:43        3    0.1
    00:16:44        1    0.2
    00:16:44        2    0.2
    00:16:44        3    0.2
    00:16:44        4    0.1
    
    print (df.value.resample('2S').mean())
    sec
    00:16:41    0.166667
    00:16:43    0.171429
    00:16:45         NaN
    Freq: 2S, Name: value, dtype: float64
    
    print (df.value.resample('2S').std())
    sec
    00:16:41    0.051640
    00:16:43    0.048795
    00:16:45         NaN
    Freq: 2S, Name: value, dtype: float64
    
    print (df.value.resample('2S').max())
    sec
    00:16:41    0.2
    00:16:43    0.2
    00:16:45    NaN
    Freq: 2S, Name: value, dtype: float64
    

    Maybe you need change base in resample:

    print (df.value.resample('2S', base=1).mean())
    sec
    00:16:42    0.166667
    00:16:44    0.171429
    00:16:46         NaN
    Freq: 2S, Name: value, dtype: float64
    
    print (df.value.resample('2S', base=1).std())
    sec
    00:16:42    0.051640
    00:16:44    0.048795
    00:16:46         NaN
    Freq: 2S, Name: value, dtype: float64
    
    print (df.value.resample('2S', base=1).max())
    sec
    00:16:42    0.2
    00:16:44    0.2
    00:16:46    NaN
    Freq: 2S, Name: value, dtype: float64
    
    print (df.value.resample('2S', base=2).mean())
    sec
    00:16:43    0.166667
    00:16:45    0.171429
    00:16:47         NaN
    Freq: 2S, Name: value, dtype: float64
    
    print (df.value.resample('2S', base=2).std())
    sec
    00:16:43    0.051640
    00:16:45    0.048795
    00:16:47         NaN
    Freq: 2S, Name: value, dtype: float64
    
    print (df.value.resample('2S', base=2).max())
    sec
    00:16:43    0.2
    00:16:45    0.2
    00:16:47    NaN
    Freq: 2S, Name: value, dtype: float64