[SOLVED] Polars equivalent to groupby.last from pandas

The over function will keep all rows, which is probably not what you want. An easy way to get just the last value for index is to use unique.

(
    df
    .select("index", "value")
    .unique(subset="index", keep="last")
)

shape: (3, 2)
┌───────┬───────┐
│ index ┆ value │
│ ---   ┆ ---   │
│ i64   ┆ i64   │
╞═══════╪═══════╡
│ 1     ┆ 23    │
│ 2     ┆ 5     │
│ 3     ┆ 89    │
└───────┴───────┘

From this point, you can use the to_dicts method to convert the DataFrame to a list of dictionaries.

last_values = (
    df
    .select("index", "value")
    .unique(subset="index", keep="last")
    .to_dicts()
)
last_values

[{'index': 1, 'value': 23}, {'index': 2, 'value': 5}, {'index': 3, 'value': 89}]

If you are looking to later import this into a DataFrame, you'll want to stop at this point. For example:

pl.DataFrame(last_values)

>>> pl.DataFrame(last_values)
shape: (3, 2)
┌───────┬───────┐
│ index ┆ value │
│ ---   ┆ ---   │
│ i64   ┆ i64   │
╞═══════╪═══════╡
│ 1     ┆ 23    │
│ 2     ┆ 5     │
│ 3     ┆ 89    │
└───────┴───────┘

However, if you want to collapse this into a single dictionary of index:value pairs, you can use a dictionary comprehension.


{
    next_dict["index"]: next_dict["value"]
    for next_dict in last_values
}

{1: 23, 2: 5, 3: 89}

Edit: Updating based on date

Let's assume that we have this data:

import polars as pl
import datetime

df = pl.DataFrame({
    "index": [1, 2, 3],
    "value": [10, 20, 30],
}).join(
    pl.DataFrame({
        'date': pl.date_range(datetime.date(2021, 1, 1), datetime.date(2023, 1, 1), "1y", eager=True)
    }),
    how="cross"
)
df

shape: (9, 3)
┌───────┬───────┬────────────┐
│ index ┆ value ┆ date       │
│ ---   ┆ ---   ┆ ---        │
│ i64   ┆ i64   ┆ date       │
╞═══════╪═══════╪════════════╡
│ 1     ┆ 10    ┆ 2021-01-01 │
│ 1     ┆ 10    ┆ 2022-01-01 │
│ 1     ┆ 10    ┆ 2023-01-01 │
│ 2     ┆ 20    ┆ 2021-01-01 │
│ 2     ┆ 20    ┆ 2022-01-01 │
│ 2     ┆ 20    ┆ 2023-01-01 │
│ 3     ┆ 30    ┆ 2021-01-01 │
│ 3     ┆ 30    ┆ 2022-01-01 │
│ 3     ┆ 30    ┆ 2023-01-01 │
└───────┴───────┴────────────┘

And we have these values that we want to update.

update_df = pl.DataFrame({
    "index": [2, 3],
    "value": [200, 300],
})
update_df

shape: (2, 2)
┌───────┬───────┐
│ index ┆ value │
│ ---   ┆ ---   │
│ i64   ┆ i64   │
╞═══════╪═══════╡
│ 2     ┆ 200   │
│ 3     ┆ 300   │
└───────┴───────┘

Note: I've purposely left out index "1" (to show what will happen).

If we want to update the value associated with each index, but only beyond a certain date, we can use a join_asof.

Since this is an advanced method, we'll take it in steps.

We'll add the current_date to the update_df as a literal. (The same value for all rows.)

We also need to make sure both our DataFrames are sorted by the "as_of" column (date, not index). (update_df will already be sorted because its the same date on each row.)

I'll also sort after the join_asof so that we can see what is happening more clearly. (You don't need to do this step.)

current_date = datetime.date(2022, 1, 1)
(
    df
    .sort("date")
    .rename({"value": "prev_value"})
    .join_asof(
        update_df.with_columns(pl.lit(current_date).alias("date")),
        on="date",
        by=["index"],
        strategy="backward"
    )
    .sort("index", "date")
)

shape: (9, 4)
┌───────┬────────────┬────────────┬───────┐
│ index ┆ prev_value ┆ date       ┆ value │
│ ---   ┆ ---        ┆ ---        ┆ ---   │
│ i64   ┆ i64        ┆ date       ┆ i64   │
╞═══════╪════════════╪════════════╪═══════╡
│ 1     ┆ 10         ┆ 2021-01-01 ┆ null  │
│ 1     ┆ 10         ┆ 2022-01-01 ┆ null  │
│ 1     ┆ 10         ┆ 2023-01-01 ┆ null  │
│ 2     ┆ 20         ┆ 2021-01-01 ┆ null  │
│ 2     ┆ 20         ┆ 2022-01-01 ┆ 200   │
│ 2     ┆ 20         ┆ 2023-01-01 ┆ 200   │
│ 3     ┆ 30         ┆ 2021-01-01 ┆ null  │
│ 3     ┆ 30         ┆ 2022-01-01 ┆ 300   │
│ 3     ┆ 30         ┆ 2023-01-01 ┆ 300   │
└───────┴────────────┴────────────┴───────┘

Notice that only those rows with a date >= 2022-01-01 have a non-null value for value. (I'll show how to do a > 2022-01-01 at the end.)

Next we'll use fill_null to fill the null values in value with the prev_value column.

current_date = datetime.date(2022, 1, 1)
(
    df
    .sort("date")
    .rename({"value": "prev_value"})
    .join_asof(
        update_df.with_columns(pl.lit(current_date).alias("date")),
        on="date",
        by=["index"],
        strategy="backward"
    )
    .sort("index", "date")
    .with_columns(pl.col("value").fill_null(pl.col("prev_value")))
)

shape: (9, 4)
┌───────┬────────────┬────────────┬───────┐
│ index ┆ prev_value ┆ date       ┆ value │
│ ---   ┆ ---        ┆ ---        ┆ ---   │
│ i64   ┆ i64        ┆ date       ┆ i64   │
╞═══════╪════════════╪════════════╪═══════╡
│ 1     ┆ 10         ┆ 2021-01-01 ┆ 10    │
│ 1     ┆ 10         ┆ 2022-01-01 ┆ 10    │
│ 1     ┆ 10         ┆ 2023-01-01 ┆ 10    │
│ 2     ┆ 20         ┆ 2021-01-01 ┆ 20    │
│ 2     ┆ 20         ┆ 2022-01-01 ┆ 200   │
│ 2     ┆ 20         ┆ 2023-01-01 ┆ 200   │
│ 3     ┆ 30         ┆ 2021-01-01 ┆ 30    │
│ 3     ┆ 30         ┆ 2022-01-01 ┆ 300   │
│ 3     ┆ 30         ┆ 2023-01-01 ┆ 300   │
└───────┴────────────┴────────────┴───────┘

Now, to clean up, we can drop the prev_value column, and re-arrange the columns.

current_date = datetime.date(2022, 1, 1)
(
    df
    .sort("date")
    .rename({"value": "prev_value"})
    .join_asof(
        update_df.with_columns(pl.lit(current_date).alias("date")),
        on="date",
        by=["index"],
        strategy="backward"
    )
    .sort("index", "date")
    .with_columns(pl.col("value").fill_null(pl.col("prev_value")))
    .drop("prev_value")
    .select(
        pl.exclude("date"),
        pl.col("date")
    )
)

shape: (9, 3)
┌───────┬───────┬────────────┐
│ index ┆ value ┆ date       │
│ ---   ┆ ---   ┆ ---        │
│ i64   ┆ i64   ┆ date       │
╞═══════╪═══════╪════════════╡
│ 1     ┆ 10    ┆ 2021-01-01 │
│ 1     ┆ 10    ┆ 2022-01-01 │
│ 1     ┆ 10    ┆ 2023-01-01 │
│ 2     ┆ 20    ┆ 2021-01-01 │
│ 2     ┆ 200   ┆ 2022-01-01 │
│ 2     ┆ 200   ┆ 2023-01-01 │
│ 3     ┆ 30    ┆ 2021-01-01 │
│ 3     ┆ 300   ┆ 2022-01-01 │
│ 3     ┆ 300   ┆ 2023-01-01 │
└───────┴───────┴────────────┘

If you need to update only those rows that are strictly greater than current_date, you can simply add one day to your current_date. Polars makes this easy with the offset_by expression.

(
    df
    .sort("date")
    .rename({"value": "prev_value"})
    .join_asof(
        update_df.with_columns(pl.lit(current_date).dt.offset_by("1d").alias("date")),
        on="date",
        by=["index"],
        strategy="backward"
    )
    .sort("index", "date")
    .with_columns(pl.col("value").fill_null(pl.col("prev_value")))
    .drop("prev_value")
    .select(
        pl.exclude("date"),
        pl.col("date")
    )
)

shape: (9, 3)
┌───────┬───────┬────────────┐
│ index ┆ value ┆ date       │
│ ---   ┆ ---   ┆ ---        │
│ i64   ┆ i64   ┆ date       │
╞═══════╪═══════╪════════════╡
│ 1     ┆ 10    ┆ 2021-01-01 │
│ 1     ┆ 10    ┆ 2022-01-01 │
│ 1     ┆ 10    ┆ 2023-01-01 │
│ 2     ┆ 20    ┆ 2021-01-01 │
│ 2     ┆ 20    ┆ 2022-01-01 │
│ 2     ┆ 200   ┆ 2023-01-01 │
│ 3     ┆ 30    ┆ 2021-01-01 │
│ 3     ┆ 30    ┆ 2022-01-01 │
│ 3     ┆ 300   ┆ 2023-01-01 │
└───────┴───────┴────────────┘