While using Kedro I want to load some data and work with it. To do that, one has to register the data in a conf/base/catalog.yml file. The Kedro Documentation of the Data Catalog explains how one can register data for Kedro to load. However, there is little to no information on how to load a .arrow file.
In the conf/base/catalog.yml I tried to register my data thus:
dataframe:
type: arrow.ArrowDataSet
filepath: "home/place/data.arrow"
layer : primary
And ofcourse tried on different combinations from the data catalog documentation mentioned above.
The error code I get is the following :
DataSetError: An exception occurred when parsing config for DataSet 'dataframe': Class 'arrow.ArrowDataSet' not found or one of its dependencies has not been installed.
I have ofcourse installed the arrow package in my environment.
Does the Kedro Data Catalog simply not accept .arrow files or is there a way to register such a format in the catalog.yml file?
Thanks in advance,
Jamal
Like said @0x26res, you can use the parquet dataset or others that kedro supports. Parquet could be handled in kedro by pyarrow engine because under the hood is pandas read_parquet with 2 engines and pyarrow by default.
It may be necessary to install dependencies to use other dataset types:
pip install kedro[pandas.ParquetDataSet]