pandasdaskpyarrowapache-arrowdtype

Create a dask array with pyarrow dtype


In pandas, I can create a series with a pyarrow dtype the following way:

>>> import pandas as pd

>>> s = pd.Series([1,2,3]).astype("int64[pyarrow]")
>>> s.dtype
int64[pyarrow]

I did not find how to do it with Dask.

I tried:

>>> import dask.config
>>> import dask.array as da
>>> dask.config.set({"array.pyarrow_dtype": True})

>>> s = da.array([1,2,3])
>>> s

Which returns an array with a numpy int 64 dtype.

enter image description here

I also tried the following:

>>> import dask.array as da
>>> s = da.array([1,2,3], dtype="int64[pyarrow]")
TypeError: data type 'int64[pyarrow]' not understood

and

>>> import dask.array as da
>>> import pyarrow as pa
>>> s = da.array([1,2,3], pa.int64())

TypeError: Cannot interpret 'DataType(int64)' as a data type

Is it possible?


Solution

  • dask.array does not directly support pyarrow. In fact, since they will represent (regular) numpy arrays, arrow would not provide any benefit.

    There IS support for arbitrary array backend supporting NEP18 (__array_function__), allowing for numpy to be swapped out for cupy, for example. However, I don't believe this includes any arrow structure - or I don't know how to achieve it.

    The references you see to arrow support in dask are specific to dataframes, and usually (always?) for strings.