pythonpython-polars

How to compute a column in polars dataframe using np.linspace


Consider the following pl.DataFrame:

df = pl.DataFrame(
    data={
        "np_linspace_start": [0, 0, 0], 
        "np_linspace_stop": [8, 6, 7],
        "np_linspace_num": [5, 4, 4]
    }
)

shape: (3, 3)
┌───────────────────┬──────────────────┬─────────────────┐
│ np_linspace_start ┆ np_linspace_stop ┆ np_linspace_num │
│ ---               ┆ ---              ┆ ---             │
│ i64               ┆ i64              ┆ i64             │
╞═══════════════════╪══════════════════╪═════════════════╡
│ 0                 ┆ 8                ┆ 5               │
│ 0                 ┆ 6                ┆ 4               │
│ 0                 ┆ 7                ┆ 4               │
└───────────────────┴──────────────────┴─────────────────┘

How can I create a new column ls, that is the result of the np.linspace function? This column will hold an np.array.

I was looking for something along those lines:

df.with_columns(
    ls=np.linspace(
        start=pl.col("np_linspace_start"),
        stop=pl.col("np_linspace_stop"),
        num=pl.col("np_linspace_num")
    )
)

Is there a polars equivalent to np.linspace?


Solution

  • As mentioned in the comments, adding an np.linspace-style function to polars is an open feature request. Until this is implemented a simple implementation using polars' native expression API could look as follows.


    Update. Modern polars supports broadcasting of operations between scalar and list columns. This can be used to shift and scale an integer list column created using pl.int_ranges and improve on the initial implementation outlined below.

    def pl_linspace(start: str | pl.Expr, stop: str | pl.Expr, num: str | pl.Expr) -> pl.Expr:
        start = pl.col(start) if isinstance(start, str) else start
        stop = pl.col(stop) if isinstance(stop, str) else stop
        num = pl.col(num) if isinstance(num, str) else num
    
        grid = pl.int_ranges(num)
        _scale = (stop - start) / (num - 1)
        _offset = start
        return grid * _scale + _offset
    
    df.with_columns(
        pl_linspace(
            "np_linspace_start",
            "np_linspace_stop",
            "np_linspace_num",
        ).alias("pl_linspace")
    )
    
    shape: (3, 4)
    ┌───────────────────┬──────────────────┬─────────────────┬────────────────────────────────┐
    │ np_linspace_start ┆ np_linspace_stop ┆ np_linspace_num ┆ pl_linspace                    │
    │ ---               ┆ ---              ┆ ---             ┆ ---                            │
    │ i64               ┆ i64              ┆ i64             ┆ list[f64]                      │
    ╞═══════════════════╪══════════════════╪═════════════════╪════════════════════════════════╡
    │ 0                 ┆ 8                ┆ 5               ┆ [0.0, 2.0, 4.0, 6.0, 8.0]      │
    │ 0                 ┆ 6                ┆ 4               ┆ [0.0, 2.0, 4.0, 6.0]           │
    │ 0                 ┆ 7                ┆ 4               ┆ [0.0, 2.333333, 4.666667, 7.0] │
    └───────────────────┴──────────────────┴─────────────────┴────────────────────────────────┘
    

    Note. If num is 1, the division when computing _scale will result in infinite values. This can be avoided by adding the following to pl_linspace.

    _scale = pl.when(_scale.is_infinite()).then(pl.lit(0)).otherwise(_scale)
    

    Outdated (but relevant for older versions of polars).

    First, we use pl.int_range (thanks to @Dean MacGregor) to create a range of integers from 0 to num (exclusive). Next, we rescale and shift the range according to start, stop, and num. Finally, we implode the column with pl.Expr.implode to obtain a column with the range as list for each row.

    def pl_linspace(start: pl.Expr, stop: pl.Expr, num: pl.Expr) -> pl.Expr:
        grid = pl.int_range(num)
        _scale = (stop - start) / (num - 1)
        _offset = start
        return (grid * _scale + _offset).implode().over(pl.int_range(pl.len()))
    
    df.with_columns(
        pl_linspace(
            start=pl.col("np_linspace_start"),
            stop=pl.col("np_linspace_stop"),
            num=pl.col("np_linspace_num"),
        ).alias("pl_linspace")
    )