pythonpandasdataframebinning

Binning state time of a device by day using Pandas DataFrames


I have a Pandas dataframe of power enable/disable commands vs time. The index is not in-use at the moment. You can create it on your own with this:

pd.DataFrame(
    {'command_timestamp': {
        0: pd.Timestamp('2023-08-01 15:39:42'),
        1: pd.Timestamp('2023-08-02 03:30:39'),
        2: pd.Timestamp('2023-08-02 16:09:35'),
        4: pd.Timestamp('2023-08-02 17:30:16'),
        5: pd.Timestamp('2023-08-02 17:32:05'),
        6: pd.Timestamp('2023-08-02 17:45:43'),
        7: pd.Timestamp('2023-08-03 17:48:01'),
        8: pd.Timestamp('2023-08-03 18:20:11'),
        9: pd.Timestamp('2023-08-04 18:49:37'),
        10: pd.Timestamp('2023-08-07 21:13:05')},
     'command': {
        0: 'enable',
        1: 'disable',
        2: 'enable',
        4: 'enable',
        5: 'enable',
        6: 'disable',
        7: 'enable',
        8: 'disable',
        9: 'enable',
        10: 'disable'}})

My dataframe.

What I am trying to do

I need to calculate the "ON time" of the device by day. Imagine that the actual data set is much larger than this example set. I cannot seem to come up with good solution, much less one that doesn't involve iterating through the dataframe and checking a load of if statements.

Some assumptions and things to consider:

For a little context, this is for representing "daily utilization" of the device rather than counting "consecutive on-time" etc..

Sample data expected results

Here is a manual calculation of the example dataset:

results = {
    '2023-08-01': (
        pd.Timestamp('2023-08-02 00:00:00') - 
        pd.Timestamp('2023-08-01 15:39:42')),
    
    '2023-08-02': (
        pd.Timestamp('2023-08-02 03:30:39') - 
        pd.Timestamp('2023-08-02 00:00:00')
        ) + (
        pd.Timestamp('2023-08-02 17:45:43') - 
        pd.Timestamp('2023-08-02 16:09:35')
        ), 
    '2023-08-03': (
        pd.Timestamp('2023-08-03 18:20:11') - 
        pd.Timestamp('2023-08-03 17:48:01')
        ), 
    '2023-08-04': (
        pd.Timestamp('2023-08-05 00:00:00') - 
        pd.Timestamp('2023-08-04 18:49:37')
        ), 
    '2023-08-05': (
        pd.Timestamp('2023-08-06 00:00:00') - 
        pd.Timestamp('2023-08-05 00:00:00')
        ), 
    '2023-08-06': (
        pd.Timestamp('2023-08-07 00:00:00') - 
        pd.Timestamp('2023-08-06 00:00:00')
        ), 
    '2023-08-07': (
        pd.Timestamp('2023-08-07 21:13:05') - 
        pd.Timestamp('2023-08-07 00:00:00')
        )
}

Sample dataset results.


Solution

  • Okay, I now think I understand your question. First, create fictitious entries to mark your day boundaries.

    df = pd.concat([
        df.set_index('command_timestamp'),
        df.reindex(pd.date_range(
            start=df['command_timestamp'].min().date(),
            end=df['command_timestamp'].max().date(),
            freq='D'))
    ]).sort_index()
    

    (I assume start is disable.) Then assume that the last state carries forward to the fictitious day boundary. Then do the differencing. You must pass to_series to access the right function.

    df['command'] = df['command'].ffill().fillna('disable')
    df['diff'] = df.index.to_series().diff().shift(-1)
    

    A group by the date along with a filtered summation then yields your desired outcome:

    >>> df.groupby(df.index.to_series().dt.date) \
    ...     .apply(lambda d: d.loc[d['command'] == 'enable', 
    ...                            'diff'].sum())
    2023-08-01   0 days 08:20:18
    2023-08-02   0 days 05:06:47
    2023-08-03   0 days 00:32:10
    2023-08-04   0 days 05:10:23
    2023-08-05   1 days 00:00:00
    2023-08-06   1 days 00:00:00
    2023-08-07   0 days 21:13:05
    dtype: timedelta64[ns]
    

    Explanatory note. Prior to the group by summation, the relevant columns of the data frame look like this:

                         command            diff
    2023-08-01 00:00:00  disable 0 days 15:39:42
    2023-08-01 15:39:42   enable 0 days 08:20:18
    2023-08-02 00:00:00   enable 0 days 03:30:39
    2023-08-02 03:30:39  disable 0 days 12:38:56
    2023-08-02 16:09:35   enable 0 days 01:20:41
    2023-08-02 17:30:16   enable 0 days 00:01:49
    2023-08-02 17:32:05   enable 0 days 00:13:38
    2023-08-02 17:45:43  disable 0 days 06:14:17
    2023-08-03 00:00:00  disable 0 days 17:48:01
    2023-08-03 17:48:01   enable 0 days 00:32:10
    2023-08-03 18:20:11  disable 0 days 05:39:49
    2023-08-04 00:00:00  disable 0 days 18:49:37
    2023-08-04 18:49:37   enable 0 days 05:10:23
    2023-08-05 00:00:00   enable 1 days 00:00:00
    2023-08-06 00:00:00   enable 1 days 00:00:00
    2023-08-07 00:00:00   enable 0 days 21:13:05
    2023-08-07 21:13:05  disable             NaT
    

    It may be more didactic to read diff instead as time_elapsed.