pythonpandasdataframegroup-bysplit-apply-combine

Simple Split Apply Combine, custom function


I am using the split-apply-combine pattern in pandas to create a new column, which measures the difference between two time stamps.

The following is a simplified example of my problem.

Say, I have this df

df = pd.DataFrame({'ssn_start_utc':pd.date_range('1/1/2011', periods=6, freq='D'),  'fld_id':[100,100,100,101,101,101], 'task_name': ['sowing','fungicide','insecticide','combine',''combine','sowing']})
df

I want to group by fld_id and apply a function which creates a column measuring difference between two timestamps for each row. Such as this

def pasttime(group):
    val = group['ssn_start_utc'] - group['ssn_start_utc'][0]
    

    # why group['ssn_start_utc'][0] ? 
    # Because it measures time difference for each row respective to first row of each group/ particular to *sowing* entry respective to each group. I have moved all *sowing* entries to first row of df for each group 
    
    return val

df["PastTime"] =df.groupby('fld_id',group_keys=False).apply(pasttime)

the resultant column df should look like this

df_new = pd.DataFrame({'ssn_start_utc':pd.date_range('1/1/2011', periods=6, freq='D'),  'fld_id':[100,100,100,101,101,101], 'task_name': ['sowing','fungicide','insecticide','combine',''combine','sowing'], 'pasttime' :[ 0 days, 1 days, 2 days, 3 days, -1 days, 0 days] })
df_new

I get a error KeyError: 0

I have also tried using groupby:

df['pasttime'] = df.groupby(['fld_id'])['ssn_start_utc'].transform( df['ssn_start_utc'] - df.loc[df['name']=='sowing','ssn_start_utc'].values[0]) 

How to apply a custom group function and have the desired df?


Solution

  • In your function is possible match first value by position with Series.iat:

    def pasttime(group):
        val = group['ssn_start_utc'] - group['ssn_start_utc'].iat[0]
        return val
    
    df["PastTime"] =df.groupby('fld_id',group_keys=False).apply(pasttime)
        
    

    Fatser alternative is use GroupBy.first with GroupBy.transform:

    s = df.groupby('fld_id')['ssn_start_utc'].transform('first')
    df['pasttime'] = df['ssn_start_utc'].sub(s)
    

    If need subtrat sowing rows per groups use same solution like above, only first replace not matched datetimes to NaNs by Series.where:

    m = df['task_name']=='sowing'
    s = df['ssn_start_utc'].where(m).groupby(df['fld_id']).transform('first')
    df['pasttime1'] = df['ssn_start_utc'].sub(s)
    print (df)
      ssn_start_utc  fld_id    task_name PastTime pasttime pasttime1
    0    2011-01-01     100       sowing   0 days   0 days    0 days
    1    2011-01-02     100    fungicide   1 days   1 days    1 days
    2    2011-01-03     100  insecticide   2 days   2 days    2 days
    3    2011-01-04     101      combine   0 days   0 days   -2 days
    4    2011-01-05     101      combine   1 days   1 days   -1 days
    5    2011-01-06     101       sowing   2 days   2 days    0 days