scikit-learnpython-polarslabel-encoding

How to Apply LabelEncoder to a Polars DataFrame Column?


I'm trying to use scikit-learn's LabelEncoder with a Polars DataFrame to encode a categorical column. I am using the following code.

import polars as pl

from sklearn.preprocessing import LabelEncoder

df = pl.DataFrame({
    "Color" : ["red","white","blue"]
})

enc = LabelEncoder()

However, an error is raised.

ValueError: y should be a 1d array, got an array of shape () instead.

Next, I tried converting the column to a NumPy.

df.with_columns(
    enc.fit_transform(pl.col("Color").to_numpy()) 
)

Now, a different error is raised.

AttributeError: 'Expr' object has no attribute 'to_numpy'

Note. I found that .cast(pl.Categorical).to_physical() could be used to obtain the desired result. Still, I'd prefer using something like transform() on my test dataset.

df.with_columns(
    pl.col("Color").cast(pl.Categorical).to_physical().alias("Color_encoded")
)

Solution

  • For such a call to an external API taking an entire sequence of values, such as enc.fit_transform, pl.Expr.map_batches could be used.

    df.with_columns(
        pl.col("Color").map_batches(enc.fit_transform)
    )
    
    shape: (3, 1)
    ┌───────┐
    │ Color │
    │ ---   │
    │ i64   │
    ╞═══════╡
    │ 1     │
    │ 2     │
    │ 0     │
    └───────┘
    

    Note. It would be nice if enc.set_output("polars") (as outlined in this answer) was available for the LabelEncoder. However, this is not implemented.


    You already shared an approach to label encoding a column using polars' native expression API. A cleaner way could rely on dense ranking as follows.

    df.with_columns(
        pl.col("Color").rank("dense") - 1
    )
    

    Subtraction is used only to obtain an output with lowest label being 0.