pythondataframedurationpython-polars

How to sum durations in Polars dataframe?


I have the following dataframe:

import datetime

import polars as pl


df = pl.DataFrame(
    {
        "idx": [259, 123],
        "timestamp": [
            [
                datetime.datetime(2023, 4, 20, 1, 45),
                datetime.datetime(2023, 4, 20, 1, 51, 7),
                datetime.datetime(2023, 4, 20, 2, 29, 50),
            ],
            [
                datetime.datetime(2023, 4, 19, 6, 0, 1),
                datetime.datetime(2023, 4, 19, 6, 0, 17),
                datetime.datetime(2023, 4, 19, 6, 0, 26),
                datetime.datetime(2023, 4, 19, 19, 53, 29),
                datetime.datetime(2023, 4, 19, 19, 54, 4),
                datetime.datetime(2023, 4, 19, 19, 57, 52),
            ],
        ],
    }
)
print(df)
# Output
shape: (2, 2)
┌─────┬───────────────────────────────────────────────────────────────────┐
│ idx ┆ timestamp                                                         │
│ --- ┆ ---                                                               │
│ i64 ┆ list[datetime[μs]]                                                │
╞═════╪═══════════════════════════════════════════════════════════════════╡
│ 259 ┆ [2023-04-20 01:45:00, 2023-04-20 01:51:07, 2023-04-20 02:29:50]   │
│ 123 ┆ [2023-04-19 06:00:01, 2023-04-19 06:00:17, … 2023-04-19 19:57:52] │
└─────┴───────────────────────────────────────────────────────────────────┘

I want to know the total duration of each id, so I do:

df = df.with_columns(
    pl.col("timestamp")
    .map_elements(lambda x: [x[i + 1] - x[i] for i in range(len(x)) if i + 1 < len(x)])
    .alias("duration")
)

Which gives me:

shape: (2, 2)
┌─────┬─────────────────────┐
│ idx ┆ duration            │
│ --- ┆ ---                 │
│ i64 ┆ list[duration[μs]]  │
╞═════╪═════════════════════╡
│ 259 ┆ [6m 7s, 38m 43s]    │
│ 123 ┆ [16s, 9s, … 3m 48s] │
└─────┴─────────────────────┘

Now, in Pandas, I would have used total_seconds when calling apply and sum the list, like this:

df["duration"] = (
    df["timestamp"]
    .apply(
        lambda x: sum(
            [(x[i + 1] - x[i]).total_seconds() for i in range(len(x)) if i + 1 < len(x)]
        )
    )
    .astype(int)
)

Which would give me the expected result:

print(df[["idx", "duration"]])
# Output

   idx  duration
0  259      2690
1  123     50271

What would be the equivalent, idiomatic way, to do this in Polars?


Solution

  • There is an list.diff method for list types, which then can be summed, and the total seconds can be calculated with dt.total_seconds:

    df.select(
        "idx",
        duration=pl.col("timestamp")
            .list.diff(null_behavior="drop")
            .list.sum()
            .dt.total_seconds(),
    )
    
    
    ┌─────┬──────────┐
    │ idx ┆ duration │
    │ --- ┆ ---      │
    │ i64 ┆ i64      │
    ╞═════╪══════════╡
    │ 259 ┆ 2690     │
    │ 123 ┆ 50271    │
    └─────┴──────────┘
    

    An equivalent expression if you really don't need the intermediate durations at all, that should perform better, would be subtracting the first element of the list from the last:

    duration=(
      pl.col("timestamp").list.last() - pl.col("timestamp").list.first()
    ).dt.total_seconds()