I have the following dataframe:
import datetime
import polars as pl
df = pl.DataFrame(
{
"idx": [259, 123],
"timestamp": [
[
datetime.datetime(2023, 4, 20, 1, 45),
datetime.datetime(2023, 4, 20, 1, 51, 7),
datetime.datetime(2023, 4, 20, 2, 29, 50),
],
[
datetime.datetime(2023, 4, 19, 6, 0, 1),
datetime.datetime(2023, 4, 19, 6, 0, 17),
datetime.datetime(2023, 4, 19, 6, 0, 26),
datetime.datetime(2023, 4, 19, 19, 53, 29),
datetime.datetime(2023, 4, 19, 19, 54, 4),
datetime.datetime(2023, 4, 19, 19, 57, 52),
],
],
}
)
print(df)
# Output
shape: (2, 2)
┌─────┬───────────────────────────────────────────────────────────────────┐
│ idx ┆ timestamp │
│ --- ┆ --- │
│ i64 ┆ list[datetime[μs]] │
╞═════╪═══════════════════════════════════════════════════════════════════╡
│ 259 ┆ [2023-04-20 01:45:00, 2023-04-20 01:51:07, 2023-04-20 02:29:50] │
│ 123 ┆ [2023-04-19 06:00:01, 2023-04-19 06:00:17, … 2023-04-19 19:57:52] │
└─────┴───────────────────────────────────────────────────────────────────┘
I want to know the total duration of each id, so I do:
df = df.with_columns(
pl.col("timestamp")
.map_elements(lambda x: [x[i + 1] - x[i] for i in range(len(x)) if i + 1 < len(x)])
.alias("duration")
)
Which gives me:
shape: (2, 2)
┌─────┬─────────────────────┐
│ idx ┆ duration │
│ --- ┆ --- │
│ i64 ┆ list[duration[μs]] │
╞═════╪═════════════════════╡
│ 259 ┆ [6m 7s, 38m 43s] │
│ 123 ┆ [16s, 9s, … 3m 48s] │
└─────┴─────────────────────┘
Now, in Pandas, I would have used total_seconds when calling apply
and sum
the list, like this:
df["duration"] = (
df["timestamp"]
.apply(
lambda x: sum(
[(x[i + 1] - x[i]).total_seconds() for i in range(len(x)) if i + 1 < len(x)]
)
)
.astype(int)
)
Which would give me the expected result:
print(df[["idx", "duration"]])
# Output
idx duration
0 259 2690
1 123 50271
What would be the equivalent, idiomatic way, to do this in Polars?
There is an list.diff
method for list types, which then can be summed, and the total seconds can be calculated with dt.total_seconds
:
df.select(
"idx",
duration=pl.col("timestamp")
.list.diff(null_behavior="drop")
.list.sum()
.dt.total_seconds(),
)
┌─────┬──────────┐
│ idx ┆ duration │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪══════════╡
│ 259 ┆ 2690 │
│ 123 ┆ 50271 │
└─────┴──────────┘
An equivalent expression if you really don't need the intermediate durations at all, that should perform better, would be subtracting the first element of the list from the last:
duration=(
pl.col("timestamp").list.last() - pl.col("timestamp").list.first()
).dt.total_seconds()