pythonpandastypespyarrowdtype

How to use categorical data type with pyarrow dtypes?


I'm working with the arrow dtypes with pandas, and my dataframe has a variable that should be categorical, but I can't figure out how to transform it into pyarrow data type for categorical data (dictionary)

According to pandas (https://arrow.apache.org/docs/python/pandas.html#pandas-arrow-conversion), the arrow data type I should be using is dictionary.

Usually, if you want pandas to use a pyarrow dtype you just add[pyarrow] to the name of the pyarrow type, for example dtype='string[pyarrow]'. I tried using dtype='dictionary[pyarrow]', but that yields the error:

data type 'dictionary[pyarrow]' not understood

I also tried 'categorical[pyarrow]', or 'category[pyarrow]', pyarrow.dictionary, pyarrow.dictionary(pyarrow.int16(),pyarrow.string()), and they didn't work either.

How can i use dictionary dtype on a pandas series? pd.Series(['Chocolate','Candy','Waffles'], dtype='what_to_put_here????')


Solution

  • I believe pd.ArrowDtype is required:

    dtype=pd.ArrowDtype(pa.dictionary(pa.int16(), pa.string()))