I'm trying to use scikit-learn's LabelEncoder
with a Polars DataFrame to encode a categorical column. I am using the following code.
import polars as pl
from sklearn.preprocessing import LabelEncoder
df = pl.DataFrame({
"Color" : ["red","white","blue"]
})
enc = LabelEncoder()
However, an error is raised.
ValueError: y should be a 1d array, got an array of shape () instead.
Next, I tried converting the column to a NumPy.
df.with_columns(
enc.fit_transform(pl.col("Color").to_numpy())
)
Now, a different error is raised.
AttributeError: 'Expr' object has no attribute 'to_numpy'
Note. I found that .cast(pl.Categorical).to_physical()
could be used to obtain the desired result. Still, I'd prefer using something like transform()
on my test dataset.
df.with_columns(
pl.col("Color").cast(pl.Categorical).to_physical().alias("Color_encoded")
)
For such a call to an external API taking an entire sequence of values, such as enc.fit_transform
, pl.Expr.map_batches
could be used.
df.with_columns(
pl.col("Color").map_batches(enc.fit_transform)
)
shape: (3, 1)
┌───────┐
│ Color │
│ --- │
│ i64 │
╞═══════╡
│ 1 │
│ 2 │
│ 0 │
└───────┘
Note. It would be nice if enc.set_output("polars")
(as outlined in this answer) was available for the LabelEncoder
. However, this is not implemented.
You already shared an approach to label encoding a column using polars' native expression API. A cleaner way could rely on dense ranking as follows.
df.with_columns(
pl.col("Color").rank("dense") - 1
)
Subtraction is used only to obtain an output with lowest label being 0.