pandasdata-sciencedaskdask-distributeddask-ml

How to apply LabelEncoder to a Dask DataFrame to Encode the Categorical Values


I have a Dask Data Frame which is made up of categorical data and numerical (float and int) data. When I try LabelEncode the categorical columns using the code below, I get error.

from dask_ml.preprocessing import LabelEncoder, Categorizer

encoder = LabelEncoder()

encoded = encoder.fit_transform(train_X.values)

The error as follows:

ValueError: bad input shape (36862367, 15)

Furthermore, I have tried a different approach to this:

from sklearn.externals.joblib import parallel_backend


with parallel_backend('dask'):

    from sklearn.pipeline import make_pipeline
    pipe = make_pipeline(
    Categorizer(), LabelEncoder())

    pipe.fit(train_X)

    pipe.transform(train_X)

And this give me a new error:

TypeError: fit() takes 2 positional arguments but 3 were given

Can any one please advise me on the right way to apply encoding to categorical data in Dask DataFrame. Thanks in advance.


Solution

  • In scikit-learn / dask-ml, LabelEncoder transforms a 1-D input. So you would use it on a pandas / dask Series, not a DataFrame.

    >>> import dask.dataframe as dd
    >>> import pandas as pd
    >>> data = dd.from_pandas(pd.Series(['a', 'a', 'b'], dtype='category'),
    ...                       npartitions=2)
    >>> le.fit_transform(data)
    dask.array<values, shape=(nan,), dtype=int8, chunksize=(nan,)>
    >>> le.fit_transform(data).compute()
    array([0, 0, 1], dtype=int8)
    

    https://ml.dask.org/modules/api.html#dask_ml.preprocessing.LabelEncoder