pythondataframepython-polars

Is it the expected behaviour for `pl.int_ranges(scalar1, scalar2).list.sample(n)` to generate a column with a same sample filled? and why?


Given a DataFrame that with a column of multiple rows, I try to generate a column with different random samples for each row from a same range, so I tried to write this:

>>> import polars as pl
>>> df = pl.select(pl.int_range(1,100).sample(5).alias('a'))
>>> df.with_columns(pl.int_ranges(3, 5).list.sample(2))

But it generates a column with a same sample for all rows:

>>> df = pl.select(pl.int_range(1,100).sample(5).alias('a'))
>>> df.with_columns(pl.int_ranges(1, 5).list.sample(2))
shape: (5, 2)
┌─────┬───────────┐
│ a   ┆ literal   │
│ --- ┆ ---       │
│ i64 ┆ list[i64] │
╞═════╪═══════════╡
│ 46  ┆ [3, 2]    │
│ 5   ┆ [3, 2]    │
│ 41  ┆ [3, 2]    │
│ 95  ┆ [3, 2]    │
│ 84  ┆ [3, 2]    │
└─────┴───────────┘

in which I've expected the samples of rows are different:

>>> df.with_columns(pl.int_ranges(1, 5).list.sample(2))
shape: (5, 2)
┌─────┬───────────┐
│ a   ┆ literal   │
│ --- ┆ ---       │
│ i64 ┆ list[i64] │
╞═════╪═══════════╡
│ 46  ┆ [3, 4]    │
│ 5   ┆ [2, 1]    │
│ 41  ┆ [4, 3]    │
│ 95  ┆ [1, 4]    │
│ 84  ┆ [1, 3]    │
└─────┴───────────┘

I've thought that it's my mistake not to make the range to a column, so I tried to replace the range in scalar with pl.lit, but the result outcomes the same:

>>> df.with_columns(pl.int_ranges(pl.lit(1), pl.lit(5)).list.sample(2))
shape: (5, 2)
┌─────┬───────────┐
│ a   ┆ literal   │
│ --- ┆ ---       │
│ i64 ┆ list[i64] │
╞═════╪═══════════╡
│ 46  ┆ [1, 4]    │
│ 5   ┆ [1, 4]    │
│ 41  ┆ [1, 4]    │
│ 95  ┆ [1, 4]    │
│ 84  ┆ [1, 4]    │
└─────┴───────────┘

This int_ranges api runs as expected if I seperate the chained-call into 2 steps:

>>> df.with_columns(pl.int_ranges(1, 5).alias('range')).with_columns(pl.col('range').list.sample(2))
shape: (5, 2)
┌─────┬───────────┐
│ a   ┆ range     │
│ --- ┆ ---       │
│ i64 ┆ list[i64] │
╞═════╪═══════════╡
│ 46  ┆ [4, 3]    │
│ 5   ┆ [2, 1]    │
│ 41  ┆ [2, 1]    │
│ 95  ┆ [2, 3]    │
│ 84  ┆ [2, 3]    │
└─────┴───────────┘

My question is: why does pl.int_ranges behave differently in a chained-call sampling with the seperated-call? what makes the difference?


Solution

  • It can help to use .select() when debugging queries because with_columns will always produce a result with the same number of rows (height) as the existing frame.

    When either start or end are an existing column, then you get an expected result.

    import polars as pl
    
    pl.Config(fmt_table_cell_list_len=10, fmt_str_lengths=100)
    
    df = pl.DataFrame({"x": [1, 2, 3, 4, 5], "y": [5, 6, 7, 8, 9]})
    
    df.select(pl.int_ranges("x", 6))
    # shape: (5, 1)
    # ┌─────────────────┐
    # │ x               │
    # │ ---             │
    # │ list[i64]       │
    # ╞═════════════════╡
    # │ [1, 2, 3, 4, 5] │
    # │ [2, 3, 4, 5]    │
    # │ [3, 4, 5]       │
    # │ [4, 5]          │
    # │ [5]             │
    # └─────────────────┘
    

    The difference is that when both start and end are literals, the result only has a height of 1.

    df.select(pl.int_ranges(1, 5))
    # shape: (1, 1)
    # ┌──────────────┐
    # │ literal      │
    # │ ---          │
    # │ list[i64]    │
    # ╞══════════════╡
    # │ [1, 2, 3, 4] │
    # └──────────────┘
    
    df.select(pl.int_ranges(1, 5).list.sample(2))
    # shape: (1, 1)
    # ┌───────────┐
    # │ literal   │
    # │ ---       │
    # │ list[i64] │
    # ╞═══════════╡
    # │ [4, 2]    │
    # └───────────┘
    

    As mentioned, .with_columns() cannot change the height (number of rows) of the frame, so the single list is broadcasted to all rows, as if you did:

    df.with_columns(literal = [4, 2])
    # shape: (5, 3)
    # ┌─────┬─────┬───────────┐
    # │ x   ┆ y   ┆ literal   │
    # │ --- ┆ --- ┆ ---       │
    # │ i64 ┆ i64 ┆ list[i64] │
    # ╞═════╪═════╪═══════════╡
    # │ 1   ┆ 5   ┆ [4, 2]    │
    # │ 2   ┆ 6   ┆ [4, 2]    │
    # │ 3   ┆ 7   ┆ [4, 2]    │
    # │ 4   ┆ 8   ┆ [4, 2]    │
    # │ 5   ┆ 9   ┆ [4, 2]    │
    # └─────┴─────┴───────────┘
    

    It is possible to "manually broadcast" the literals and pl.repeat() them pl.len() times.

    df.select(
        pl.int_ranges(pl.repeat(1, pl.len()), pl.repeat(5, pl.len()))
          .list.sample(2)
    )
    # shape: (5, 1)
    # ┌───────────┐
    # │ repeat    │
    # │ ---       │
    # │ list[i64] │
    # ╞═══════════╡
    # │ [1, 2]    │
    # │ [4, 3]    │
    # │ [4, 2]    │
    # │ [2, 1]    │
    # │ [2, 4]    │
    # └───────────┘
    

    This is essentially what Polars does for you in the pl.int_ranges("x", 6) example, the 6 is broadcasted.

    Should it automatically broadcast when both inputs are literals?

    It seems like a valid topic for the Polars GitHub Issues in my opinion.