pythonpandaspyarrow

What is difference between Pyarrow arguments for Pandas readers?


Pandas' documentation explains how to use PyArrow as the backend for IO methods. However, I couldn't understand from it the difference between these two options:

df = pd.read_csv(data, engine="pyarrow")
# and
df_pyarrow = pd.read_csv(data, dtype_backend="pyarrow")

What is it?


Solution

  • The engine specifies the CSV Parser engine to use. The available options are c, python, and pyarrow. Where as the dtype_backend argument is to let pandas know that we want Arrow backed types(Not the numpy types) by default.

    I would suggest to read this excellent article from Marc(Pandas Core Developer) that covers pandas 2.0 in detail.

    In the pandas 2.0 release candidates there was a dtype_backend option to let pandas know we want Arrow backed types by default. The option was confusing since not all operations support generating Arrow backed data yet, and it was removed. For I/O operators that support creating Arrow-backed data, there is a dtype_backend parameter:

    import pandas
    
    pandas.read_csv(fname, engine='pyarrow', dtype_backend='pyarrow')
    

    Note that the engine is somehow independent of the backend. We can use PyArrow function (engine) to read CSV files while using columns with a NumPy data type (backend), and the other way round.