In pandas, I can create a series with a pyarrow dtype the following way:
>>> import pandas as pd
>>> s = pd.Series([1,2,3]).astype("int64[pyarrow]")
>>> s.dtype
int64[pyarrow]
I did not find how to do it with Dask.
I tried:
>>> import dask.config
>>> import dask.array as da
>>> dask.config.set({"array.pyarrow_dtype": True})
>>> s = da.array([1,2,3])
>>> s
Which returns an array with a numpy int 64 dtype.
I also tried the following:
>>> import dask.array as da
>>> s = da.array([1,2,3], dtype="int64[pyarrow]")
TypeError: data type 'int64[pyarrow]' not understood
and
>>> import dask.array as da
>>> import pyarrow as pa
>>> s = da.array([1,2,3], pa.int64())
TypeError: Cannot interpret 'DataType(int64)' as a data type
Is it possible?
dask.array does not directly support pyarrow. In fact, since they will represent (regular) numpy arrays, arrow would not provide any benefit.
There IS support for arbitrary array backend supporting NEP18 (__array_function__
), allowing for numpy to be swapped out for cupy, for example. However, I don't believe this includes any arrow structure - or I don't know how to achieve it.
The references you see to arrow support in dask are specific to dataframes, and usually (always?) for strings.