I have a Dask.Series with a categorical dtype that is known. I want to create a little dataframe which shows the associated mapping without having to compute the entire series. How do I achieve this?
import pandas as pd
import dask.dataframe as dd
from dask_ml.preprocessing import Categorizer
df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
df = dd.from_pandas(df, npartitions = 2)
df = Categorizer().fit_transform(df)
test = df['species']
The above code creates a category series in dask. By using test.cat.codes
, I can convert the categories into codes like the below:
> test.compute()
Out[5]:
0 setosa
1 setosa
2 setosa
3 setosa
4 setosa
145 virginica
146 virginica
147 virginica
148 virginica
149 virginica
Name: species, Length: 150, dtype: category
Categories (3, object): [setosa, versicolor, virginica]
> test.cat.codes.compute()
Out[6]:
0 0
1 0
2 0
3 0
4 0
..
145 2
146 2
147 2
148 2
149 2
Length: 150, dtype: int8
The desired outcome is to get a mapping from the categories to the codes as shown below without using a compute command until the very end.
Desired output:
Category Code
setosa 0
versicolor 1
virginica 2
I have tried lots of things, but they all require converting the series into a pandas series or dataframe, which defeats the purpose of using dask. I haven't found anything in dask which would help me do this without re-partitioning, which I do not want to do. Also note that while the example has access to the DataFrame for setup purposes, I do not actually have access to an original dataframe so it would need to start with the series "test".
How about the following:
category_mapping = dd.concat([test, test.cat.codes], axis=1)
category_mapping.columns = ["Category", "Code"]
category_mapping = category_mapping.drop_duplicates()
print(category_mapping.compute())
which would give you:
Category Code
0 setosa 0
50 versicolor 1
100 virginica 2