I know how to apply a function to all columns present in a Pandas-DataFrame. However, I have not figured out yet how to achieve this when using a Polars-DataFrame.
I checked the section from the Polars User Guide devoted to this topic, but I have not find the answer. Here I attach a code snippet with my unsuccessful attempts.
import numpy as np
import polars as pl
import seaborn as sns
# Loading toy dataset as Pandas DataFrame using Seaborn
df_pd = sns.load_dataset('iris')
# Converting Pandas DataFrame to Polars DataFrame
df_pl = pl.DataFrame(df_pd)
# Dropping the non-numeric column...
df_pd = df_pd.drop(columns='species') # ... using Pandas
df_pl = df_pl.drop('species') # ... using Polars
# Applying function to the whole DataFrame...
df_pd_new = df_pd.apply(np.log2) # ... using Pandas
# df_pl_new = df_pl.apply(np.log2) # ... using Polars?
# Applying lambda function to the whole DataFrame...
df_pd_new = df_pd.apply(lambda c: np.log2(c)) # ... using Pandas
# df_pl_new = df_pl.apply(lambda c: np.log2(c)) # ... using Polars?
Thanks in advance for your help and your time.
You can use the expression syntax to select all columns with pl.all()
and then map_batches
the numpy np.log2(..)
function over the columns.
df.select(
pl.all().map_batches(np.log2)
)
Note that we choose map_batches
here as map_elements
would call the function upon each value.
map_elements = pl.Series(np.log2(value) for value in pl.Series([1, 2, 3]))
But np.log2
can be called once with multiple values, which would be faster.
map_batches = np.log2(pl.Series([1, 2, 3]))
See the User guide for more.
map_elements
: Call a function separately on each value in the Series.map_batches
: Always passes the full Series to the function.Polars expressions also support numpy universal functions.
That means you can pass a polars expression to a numpy ufunc
:
df.select(
np.log2(pl.all())
)