pythonpython-polarssampleresamplingpolars

How to randomly sample n IDs for each combination of group_id and date in a Polars DataFrame


I am trying to randomly sample n IDs for each combination of group_id and date in a Polars DataFrame. However, I noticed that the sample function is producing the same set of IDs for each date no matter the group.

Since I need to set a seed for replication purposes, I believe the issue is occurring because the same seed value is being applied across all combinations. I tried to resolve this by creating a unique seed for each combination by generating a "group_date_int" column by combining group_id and date casted as Int64, but I encountered the following error:

.sample(n=n_samples, shuffle=True, seed=pl.col("group_date_int"))
TypeError: argument 'seed': 'Expr' object cannot be interpreted as an integer

For each date, I am getting the same set of IDs, rather than having a different random sample for each combination of group_id and date.

import polars as pl

df = pl.DataFrame(
    {
        "date": pl.date_range(
            pl.date(2010, 1, 1), pl.date(2025, 12, 1), "1mo", eager=True
        ).implode(),
        "group_id": [["bd01", "bd02", "bd03"]],
        "ids": [list(range(10))],
    }
).explode("date").explode("group_id").explode("ids")

# Parameters
n_samples = 3  # Number of random samples to pick for each group
SEED = 42  # The seed used for sampling

# Create `selected_samples` by sampling `n_samples` IDs per (group_id, date) combination
selected_samples = (
    df
    .group_by(['group_id', 'date'])
    .agg(
        pl.col("id")
        .sample(n=n_samples, shuffle=True, seed=SEED)  
        .alias("random_ids")
    )
    .explode("random_ids")
    .select(["group_id", "date", "random_ids"])
    .rename({"random_ids": "id"})
)

Additionally, I tried using the shuffle function, but the results are the same: 1,6,5...1,6,5

┌──────────┬────────────┬─────┐
│ group_id ┆ date       ┆ id  │
│ ---      ┆ ---        ┆ --- │
│ str      ┆ str        ┆ i64 │
╞══════════╪════════════╪═════╡
│ bd01     ┆ 2025-07-01 ┆ 1   │
│ bd01     ┆ 2025-07-01 ┆ 6   │
│ bd01     ┆ 2025-07-01 ┆ 5   │
│ bd01     ┆ 2012-03-01 ┆ 1   │
│ bd01     ┆ 2012-03-01 ┆ 6   │
│ …        ┆ …          ┆ …   │
│ bd03     ┆ 2024-10-01 ┆ 6   │
│ bd03     ┆ 2024-10-01 ┆ 5   │
│ bd01     ┆ 2010-08-01 ┆ 1   │
│ bd01     ┆ 2010-08-01 ┆ 6   │
│ bd01     ┆ 2010-08-01 ┆ 5   │
└──────────┴────────────┴─────┘

I was referred to the following question in the comments: Sample from each group in polars dataframe?, where a similar issue was raised. However, the solution does not include a seed, which is needed for replication.


Solution

  • If you need each group to be random but you also need to be able to set a seed to get predictable results then use numpy to generate random numbers and then choose your sample based on those like this. (Technically you could use base python to generate the random numbers but it's slower)

    First approach

    n_samples = 3 
    SEED = 46
    np.random.seed(SEED)
    (
        df
        .with_columns(
            pl.col("ids")
            .sort_by(pl.Series(np.random.normal(0,1,df.shape[0]))))
        .group_by("group_id","date",maintain_order=True)
        .agg(pl.col("ids").gather(range(n_samples)))
        .explode("ids")
    )
    

    Note I also set maintain_order=True in the group_by as that would otherwise be random.

    Second approach

    Having to do a sort over the whole series might be needlessly expensive. If we use numpy to create a 2d array which is sorted rowwise then use that to pick our indices it, in theory, should be more efficient.

    However, this only works if you have a fixed number of members per group and you know how many in advance.

    First, make this function

    def keep_args(members_per_group: int, n_samples: int, rows: int):
        return pl.Series(
            np.argsort(
                np.random.normal(0, 1, (rows, members_per_group)), 
                axis=1)[:, :n_samples],
            dtype=pl.List(pl.Int32),
        )
    

    It's going to generate a 2d array where each row has a random list of indices to choose. We use it with our df like this

    np.random.seed(SEED)
    (
        df
        .group_by("group_id","date",maintain_order=True)
        .agg(pl.col("ids"))
        .with_columns(
            pl.col("ids").map_batches(lambda s: (
                s.list.gather(keep_args(10, n_samples, s.len()))
            ))
        )
        .explode("ids")
    )
    

    In this version, we do the group_by first which then means we need to use map_batches to get the new len of ids. If you prefer you could do a pipe and use the new df.height but I don't think it would make a big difference either way.

    Performance diff

    In testing those two, the first was 10.4ms and the second was 9.97ms so basically the same.

    Third approach

    Here's a polars only approach that is about 60x slower than the above. Basically it just chops up your df into the individual groups and then samples them.

    pl.concat([
        g.sample(n_samples, seed=SEED) 
        for (_, g) in df.group_by("group_id","date",maintain_order=True)
        ])
    

    Fourth approach

    You can convert each of the groups to lazy to get parallelism which reduces the time by 33% making it just 40x slower than numpy approaches

    (
        pl.concat([
        g.lazy()
        .select(
            pl.col("group_id","date").first(), 
            pl.col("ids")
            .sample(n_samples, seed=SEED)
            .implode()
            )
        for (_, g) in df.group_by("group_id","date",maintain_order=True)
        ])
    .explode("ids")
    .collect()
    )
    

    Note about seed

    Maybe this goes without saying but just incase, the result between each approach will be different even with the same seed. The results are only consistent within a particular approach. Also, just to reiterate, you must use maintain_order=True in the first two approaches to get consistent results.