rpandaspo

Selecting the data frame row with the earliest time value for a set period


I have a df in r with numerous records with the below format, with 'arrival_time' values for a 12 hour period'.

id arrival_time wait_time_value
1 2020-02-20 12:02:00 10
2 2020-02-20 12:04:00 5
99900 2020-02-20 23:47:00 8
10000 2020-02-20 23:59:00 21

I would like to create a new df that has a row for each 15 minute slot of the arrival time period and the wait_time_value of the record with the earliest arrival time in that slot. So, in the above example, the first and last row of the new df would look like:

id period_start wait_time_value
1 2020-02-20 12:00:00 10
48 2020-02-20 23:45:00 8

I have used the below code to achieve this for the mean average wait time for all records in each 15 minute range, but i'm not sure how to select the value for the earliest record?


df$period_start <- align.time(df$arrival_time- 899, n = 60*15)

avgwait_df <- aggregate(wait_time_value ~ period_start, df, mean)

Solution

  • Use DataFrame.resample with GroupBy.first, remove only NaNs and convert to DataFrame:

    df['arrival_time'] = pd.to_datetime(df['arrival_time'])
    
    df = (df.resample('15Min', on='arrival_time')['wait_time_value']
            .first()
            .dropna()
            .reset_index(name='wait_time_value'))
    print (df)
             arrival_time  wait_time_value
    0 2020-02-20 12:00:00             10.0
    1 2020-02-20 23:45:00              8.0