What are the fastest ways to apply functions to polars DataFrames - pl.DataFrame
or pl.internals.lazy_frame.LazyFrame
? This question is piggy-backing off Apply Function to all columns of a Polars-DataFrame
I am trying to concat all columns and hash the value using hashlib in python standard library. The function I am using is below:
import hashlib
def hash_row(row):
os.environ['PYTHONHASHSEED'] = "0"
row = str(row).encode('utf-8')
return hashlib.sha256(row).hexdigest()
However given that this function requires a string as input, means this function needs to be applied to every cell within a pl.Series. Working with a small amount of data, should be okay, but when we have closer to 100m rows this becomes very problematic. The question for this thread is how can we apply such a function in the most-performant way across an entire Polars Series?
Offers a few options to create new columns, and some are more performant than others.
df['new_col'] = df['some_col'] * 100 # vectorized calls
Another option is to create custom functions for row-wise operations.
def apply_func(row):
return row['some_col'] + row['another_col']
df['new_col'] = df.apply(lambda row: apply_func(row), axis=1) # using apply operations
From my experience, the fastest way is to create numpy vectorized solutions.
import numpy as np
def np_func(some_col, another_col):
return some_col + another_col
vec_func = np.vectorize(np_func)
df['new_col'] = vec_func(df['some_col'].values, df['another_col'].values)
What is the best solution for Polars?
Let's start with this data of various types:
import polars as pl
df = pl.DataFrame(
{
"col_int": [1, 2, 3, 4],
"col_float": [10.0, 20, 30, 40],
"col_bool": [True, False, True, False],
"col_str": ["2020-01-01"] * 4
}
).with_columns(pl.col("col_str").str.to_date().alias("col_date"))
df
shape: (4, 5)
┌─────────┬───────────┬──────────┬────────────┬────────────┐
│ col_int ┆ col_float ┆ col_bool ┆ col_str ┆ col_date │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ bool ┆ str ┆ date │
╞═════════╪═══════════╪══════════╪════════════╪════════════╡
│ 1 ┆ 10.0 ┆ true ┆ 2020-01-01 ┆ 2020-01-01 │
│ 2 ┆ 20.0 ┆ false ┆ 2020-01-01 ┆ 2020-01-01 │
│ 3 ┆ 30.0 ┆ true ┆ 2020-01-01 ┆ 2020-01-01 │
│ 4 ┆ 40.0 ┆ false ┆ 2020-01-01 ┆ 2020-01-01 │
└─────────┴───────────┴──────────┴────────────┴────────────┘
I should first point out that Polars itself has a hash_rows
function that will hash the rows of a DataFrame, without first needing to cast each column to a string.
df.hash_rows()
shape: (4,)
Series: '' [u64]
[
16206777682454905786
7386261536140378310
3777361287274669406
675264002871508281
]
If you find this acceptable, then this would be the most performant solution. You can cast the resulting unsigned int to a string if you need to. Note: hash_rows
is available only on a DataFrame, not a LazyFrame.
polars.concat_str
and map_elements
If you need to use your own hashing solution, then I recommend using the polars.concat_str
function to concatenate the values in each row to a string. From the documentation:
polars.concat_str(exprs: Union[Sequence[Union[polars.internals.expr.Expr, str]], polars.internals.expr.Expr], sep: str = '') → polars.internals.expr.Expr
Horizontally Concat Utf8 Series in linear time. Non utf8 columns are cast to utf8.
So, for example, here is the resulting concatenation on our dataset.
df.with_columns(
pl.concat_str(pl.all()).alias('concatenated_cols')
)
shape: (4, 6)
┌─────────┬───────────┬──────────┬────────────┬────────────┬────────────────────────────────┐
│ col_int ┆ col_float ┆ col_bool ┆ col_str ┆ col_date ┆ concatenated_cols │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ bool ┆ str ┆ date ┆ str │
╞═════════╪═══════════╪══════════╪════════════╪════════════╪════════════════════════════════╡
│ 1 ┆ 10.0 ┆ true ┆ 2020-01-01 ┆ 2020-01-01 ┆ 110.0true2020-01-012020-01-01 │
│ 2 ┆ 20.0 ┆ false ┆ 2020-01-01 ┆ 2020-01-01 ┆ 220.0false2020-01-012020-01-01 │
│ 3 ┆ 30.0 ┆ true ┆ 2020-01-01 ┆ 2020-01-01 ┆ 330.0true2020-01-012020-01-01 │
│ 4 ┆ 40.0 ┆ false ┆ 2020-01-01 ┆ 2020-01-01 ┆ 440.0false2020-01-012020-01-01 │
└─────────┴───────────┴──────────┴────────────┴────────────┴────────────────────────────────┘
Taking the next step and using the map_elements
method and your function would yield:
df.with_columns(
pl.concat_str(pl.all()).map_elements(hash_row).alias('hash')
)
shape: (4, 6)
┌─────────┬───────────┬──────────┬────────────┬────────────┬─────────────────────────────────────┐
│ col_int ┆ col_float ┆ col_bool ┆ col_str ┆ col_date ┆ hash │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ bool ┆ str ┆ date ┆ str │
╞═════════╪═══════════╪══════════╪════════════╪════════════╪═════════════════════════════════════╡
│ 1 ┆ 10.0 ┆ true ┆ 2020-01-01 ┆ 2020-01-01 ┆ 1826eb9c6aeb0abcdd2999a76eee576e... │
│ 2 ┆ 20.0 ┆ false ┆ 2020-01-01 ┆ 2020-01-01 ┆ ea50f5b11957bfc92b5ab7545b3ac12c... │
│ 3 ┆ 30.0 ┆ true ┆ 2020-01-01 ┆ 2020-01-01 ┆ eef039d8dedadcc282d6fa9473e071e8... │
│ 4 ┆ 40.0 ┆ false ┆ 2020-01-01 ┆ 2020-01-01 ┆ dcc5c57e0b5fdf15320a84c6839b0e3d... │
└─────────┴───────────┴──────────┴────────────┴────────────┴─────────────────────────────────────┘
Please remember that any time Polars calls external libraries or runs Python bytecode, you are subject to the Python GIL, which means single-threaded performance - no matter how you code it. From the Polars User Guide section Do Not Kill The Parallelization!:
We have all heard that Python is slow, and does "not scale." Besides the overhead of running "slow" bytecode, Python has to remain within the constraints of the Global Interpreter Lock (GIL). This means that if you were to use a lambda or a custom Python function to apply during a parallelized phase, Polars speed is capped running Python code preventing any multiple threads from executing the function.
This all feels terribly limiting, especially because we often need those lambda functions in a .groupby() step, for example. This approach is still supported by Polars, but keeping in mind bytecode and the GIL costs have to be paid.
To mitigate this, Polars implements a powerful syntax defined not only in its lazy API, but also in its eager API.