pythondataframepython-polars

How to transform a series of a Polars dataframe?


I am dealing with a large dataframe (198,619 rows x 19,110 columns) and so am using the polars package to read in the tsv file. Pandas just takes too long.

However, I now face an issue as I want to transform each cell's value x raising it by base 2 as follows: 2^x.

I run the following line as an example:

df_copy = df
df_copy[:,1] = 2**df[:,1]

But I get this error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/var/tmp/pbs.98503.hn-10-03/ipykernel_196334/3484346087.py in <module>
      1 df_copy = df
----> 2 df_copy[:,1] = 2**df[:,1]

~/.local/lib/python3.9/site-packages/polars/internals/frame.py in __setitem__(self, key, value)
   1845 
   1846             # dispatch to __setitem__ of Series to do modification
-> 1847             s[row_selection] = value
   1848 
   1849             # now find the location to place series

~/.local/lib/python3.9/site-packages/polars/internals/series.py in __setitem__(self, key, value)
    512             self.__setitem__([key], value)
    513         else:
--> 514             raise ValueError(f'cannot use "{key}" for indexing')
    515 
    516     def estimated_size(self) -> int:

ValueError: cannot use "slice(None, None, None)" for indexing

This should be simple but I can't figure it out as I'm new to Polars.


Solution

  • The secret to harnessing the speed and flexibility of Polars is to learn to use Expressions. As such, you'll want to avoid Pandas-style indexing methods.

    Let's start with this data:

    import polars as pl
    
    nbr_rows = 4
    nbr_cols = 5
    df = pl.DataFrame({
        "col_" + str(col_nbr): pl.int_range(col_nbr, nbr_rows + col_nbr, eager=True)
        for col_nbr in range(0, nbr_cols)
    })
    df
    
    shape: (4, 5)
    ┌───────┬───────┬───────┬───────┬───────┐
    │ col_0 ┆ col_1 ┆ col_2 ┆ col_3 ┆ col_4 │
    │ ---   ┆ ---   ┆ ---   ┆ ---   ┆ ---   │
    │ i64   ┆ i64   ┆ i64   ┆ i64   ┆ i64   │
    ╞═══════╪═══════╪═══════╪═══════╪═══════╡
    │ 0     ┆ 1     ┆ 2     ┆ 3     ┆ 4     │
    │ 1     ┆ 2     ┆ 3     ┆ 4     ┆ 5     │
    │ 2     ┆ 3     ┆ 4     ┆ 5     ┆ 6     │
    │ 3     ┆ 4     ┆ 5     ┆ 6     ┆ 7     │
    └───────┴───────┴───────┴───────┴───────┘
    

    In Polars we would express your calculations as:

    df_copy = df.select(pl.lit(2).pow(pl.all()).name.keep())
    print(df_copy)
    
    shape: (4, 5)
    ┌───────┬───────┬───────┬───────┬───────┐
    │ col_0 ┆ col_1 ┆ col_2 ┆ col_3 ┆ col_4 │
    │ ---   ┆ ---   ┆ ---   ┆ ---   ┆ ---   │
    │ f64   ┆ f64   ┆ f64   ┆ f64   ┆ f64   │
    ╞═══════╪═══════╪═══════╪═══════╪═══════╡
    │ 1.0   ┆ 2.0   ┆ 4.0   ┆ 8.0   ┆ 16.0  │
    │ 2.0   ┆ 4.0   ┆ 8.0   ┆ 16.0  ┆ 32.0  │
    │ 4.0   ┆ 8.0   ┆ 16.0  ┆ 32.0  ┆ 64.0  │
    │ 8.0   ┆ 16.0  ┆ 32.0  ┆ 64.0  ┆ 128.0 │
    └───────┴───────┴───────┴───────┴───────┘