I have a polars.DataFrame
like:
df = pl.DataFrame({
"timestamp": ['2009-04-18 11:30:00', '2009-04-18 11:40:00', '2009-04-18 11:50:00', '2009-04-18 12:00:00', '2009-04-18 12:10:00', '2009-04-18 12:20:00', '2009-04-18 12:30:00'],
"group": ["group_1", "group_1", "group_1", "group_2", "group_2", "group_1", "group_1"]})
df = df.with_columns(
pl.col("timestamp").str.to_datetime().dt.replace_time_zone("UTC"),
)
┌─────────────────────────┬─────────┐
│ timestamp ┆ group │
│ --- ┆ --- │
│ datetime[μs, UTC] ┆ str │
╞═════════════════════════╪═════════╡
│ 2009-04-18 11:30:00 UTC ┆ group_1 │
│ 2009-04-18 11:40:00 UTC ┆ group_1 │
│ 2009-04-18 11:50:00 UTC ┆ group_1 │
│ 2009-04-18 12:00:00 UTC ┆ group_2 │
│ 2009-04-18 12:10:00 UTC ┆ group_2 │
│ 2009-04-18 12:20:00 UTC ┆ group_1 │ <- reappearance of group_1
│ 2009-04-18 12:30:00 UTC ┆ group_1 │ <- reappearance of group_1
└─────────────────────────┴─────────┘
I want to calculate the time difference between the timestamp of the first element in each group to the timestamp of the elements in a group. Important is, that 'group' is defined as a (chronologically) consecutive appearance of the same group label. Like in the example shown group labels can occur later in time with the same group label but should by then be treated as a new group.
With that, the result should look something like this:
┌─────────────────────────┬─────────┬─────────┐
│ timestamp ┆ group │ timediff│
│ --- ┆ --- │ --- │
│ datetime[μs, UTC] ┆ str │ int(?) │
╞═════════════════════════╪═════════╪═════════╡
│ 2009-04-18 11:30:00 UTC ┆ group_1 │ 0 │
│ 2009-04-18 11:40:00 UTC ┆ group_1 │ 10 │
│ 2009-04-18 11:50:00 UTC ┆ group_1 │ 20 │
│ 2009-04-18 12:00:00 UTC ┆ group_2 │ 0 │
│ 2009-04-18 12:10:00 UTC ┆ group_2 │ 10 │
│ 2009-04-18 12:20:00 UTC ┆ group_1 │ 0 │ <- reappearance of group_1
│ 2009-04-18 12:30:00 UTC ┆ group_1 │ 10 │ <- reappearance of group_1
└─────────────────────────┴─────────┴─────────┘
.rle_id()
("Run-length encoding") can be used to identify the groups.
This is especially useful when you want to define groups by runs of identical values
df.with_columns(group_id = pl.col("group").rle_id())
shape: (7, 3)
┌─────────────────────────┬─────────┬──────────┐
│ timestamp ┆ group ┆ group_id │
│ --- ┆ --- ┆ --- │
│ datetime[μs, UTC] ┆ str ┆ u32 │
╞═════════════════════════╪═════════╪══════════╡
│ 2009-04-18 11:30:00 UTC ┆ group_1 ┆ 0 │
│ 2009-04-18 11:40:00 UTC ┆ group_1 ┆ 0 │
│ 2009-04-18 11:50:00 UTC ┆ group_1 ┆ 0 │
│ 2009-04-18 12:00:00 UTC ┆ group_2 ┆ 1 │
│ 2009-04-18 12:10:00 UTC ┆ group_2 ┆ 1 │
│ 2009-04-18 12:20:00 UTC ┆ group_1 ┆ 2 │
│ 2009-04-18 12:30:00 UTC ┆ group_1 ┆ 2 │
└─────────────────────────┴─────────┴──────────┘
You can then run the calculation .over()
each group.
df.with_columns(
(pl.col("timestamp") - pl.col("timestamp").first())
.over(pl.col("group").rle_id())
.alias("time_diff")
#.dt.total_minutes()
)
shape: (7, 3)
┌─────────────────────────┬─────────┬──────────────┐
│ timestamp ┆ group ┆ time_diff │
│ --- ┆ --- ┆ --- │
│ datetime[μs, UTC] ┆ str ┆ duration[μs] │
╞═════════════════════════╪═════════╪══════════════╡
│ 2009-04-18 11:30:00 UTC ┆ group_1 ┆ 0µs │
│ 2009-04-18 11:40:00 UTC ┆ group_1 ┆ 10m │
│ 2009-04-18 11:50:00 UTC ┆ group_1 ┆ 20m │
│ 2009-04-18 12:00:00 UTC ┆ group_2 ┆ 0µs │
│ 2009-04-18 12:10:00 UTC ┆ group_2 ┆ 10m │
│ 2009-04-18 12:20:00 UTC ┆ group_1 ┆ 0µs │
│ 2009-04-18 12:30:00 UTC ┆ group_1 ┆ 10m │
└─────────────────────────┴─────────┴──────────────┘