pythonpandasdataframedaskdask-ml

Create a category-code map based off a Dask.Series


I have a Dask.Series with a categorical dtype that is known. I want to create a little dataframe which shows the associated mapping without having to compute the entire series. How do I achieve this?

import pandas as pd
import dask.dataframe as dd
from dask_ml.preprocessing import Categorizer

df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
df = dd.from_pandas(df, npartitions = 2)
df = Categorizer().fit_transform(df)

test = df['species']

The above code creates a category series in dask. By using test.cat.codes, I can convert the categories into codes like the below:


> test.compute()
Out[5]: 
0         setosa
1         setosa
2         setosa
3         setosa
4         setosa
   
145    virginica
146    virginica
147    virginica
148    virginica
149    virginica
Name: species, Length: 150, dtype: category
Categories (3, object): [setosa, versicolor, virginica]

> test.cat.codes.compute()
Out[6]: 
0      0
1      0
2      0
3      0
4      0
      ..
145    2
146    2
147    2
148    2
149    2
Length: 150, dtype: int8

The desired outcome is to get a mapping from the categories to the codes as shown below without using a compute command until the very end.

Desired output:

Category      Code
setosa        0
versicolor    1
virginica     2

I have tried lots of things, but they all require converting the series into a pandas series or dataframe, which defeats the purpose of using dask. I haven't found anything in dask which would help me do this without re-partitioning, which I do not want to do. Also note that while the example has access to the DataFrame for setup purposes, I do not actually have access to an original dataframe so it would need to start with the series "test".


Solution

  • How about the following:

    category_mapping = dd.concat([test, test.cat.codes], axis=1)
    category_mapping.columns = ["Category", "Code"]
    category_mapping = category_mapping.drop_duplicates()
    print(category_mapping.compute())
    

    which would give you:

           Category  Code
    0        setosa     0
    50   versicolor     1
    100   virginica     2