Pandas' documentation explains how to use PyArrow
as the backend for IO methods.
However, I couldn't understand from it the difference between these two options:
df = pd.read_csv(data, engine="pyarrow")
# and
df_pyarrow = pd.read_csv(data, dtype_backend="pyarrow")
What is it?
The engine
specifies the CSV Parser engine to use. The available options are c
, python
, and pyarrow
. Where as the dtype_backend
argument is to let pandas know that we want Arrow backed types(Not the numpy types) by default.
I would suggest to read this excellent article from Marc(Pandas Core Developer) that covers pandas 2.0
in detail.
In the pandas 2.0 release candidates there was a
dtype_backend
option to let pandas know we want Arrow backed types by default. The option was confusing since not all operations support generating Arrow backed data yet, and it was removed. For I/O operators that support creating Arrow-backed data, there is adtype_backend
parameter:import pandas pandas.read_csv(fname, engine='pyarrow', dtype_backend='pyarrow')
Note that the engine is somehow independent of the backend. We can use PyArrow function (
engine
) to read CSV files while using columns with a NumPy data type (backend
), and the other way round.