daskdask-dataframedask-ml

Why does dask_ml.preprocessing.OrdinalEncoder.transform produce a not ordinally encoded result?


I'm confused with regard to the result of dask_ml.preprocessing.OrdinalEncoder.transform:

from sklearn.preprocessing import OrdinalEncoder
from dask_ml.preprocessing import OrdinalEncoder as DaskOrdinalEncoder
import numpy as np
import pandas as pd

N = 10
np.random.seed(1234)

df = pd.DataFrame({
    "cat1": np.random.choice(list(string.ascii_uppercase)[0:3], size=N),
    "cat2": np.random.choice(list(string.ascii_uppercase)[0:3], size=N),
})
df_dd = dd.from_pandas(df, npartitions=3)

The original OrdinalEncoder.transform returns a numpy.ndarray (with numeric values):

>>> OrdinalEncoder().fit_transform(df)
array([[2., 2.],
       [1., 0.],
       [0., 0.],
       [0., 2.],
       [0., 2.],
       [1., 2.],
       [1., 0.],
       [1., 0.],
       [2., 0.],
       [2., 1.]])

The dask-ml counterpart not just breaks the Interface by returning a pandas.DataFrame it simply returns the initial input DataFrame:

>>> DaskOrdinalEncoder().fit_transform(df_dd).compute().equals(df)
True

What I would expect is either a (Pandas or Dask) DataFrame or a (Numpy or Dask) Array holding numeric values analogous to what the sklearn OrdinalEncoder produces.


Solution

  • df_dd = df_dd.categorize(columns=["cat1", "cat2"])
    

    It is required to identify the columns as categories before applying the OrdinalEncoder.

    Note: This is explained in the Dask ML documentation here. The shape of the transformed Dask DataFrame needs to be known. Using the Categorical datatyle allows for this. However, it is not the case if you just leave the data as strings. Why is the shape important? The shape is required by Dask DataFrame (df_dd) to know the number of columns that will be produced in the transformed data since all partitions of the Dask DataFrame must have the same number of columns. If we just use the str datatype then, depending on the output, Dask does not know how many columns to expect after the transformation. However, if you specify the Categorical dtype then Dask knows exactly what categories (column encodings) will be produced. An example pipeline using OneHotEncoder with a more detailed explanation is also found in the Dask ML documentation here. A similar reasoning applies to OrdinalEncoder.