With Dask I try to create a column that has type list with integers. For example:
import dask.dataframe as dd
import pandas as pd
# Have an example Dask Dataframe
ddf = dd.from_pandas(pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
'age': [25, 30, 35, 40, 45]
}), npartitions=1)
# now create an array type column
ddf["alist"] = ddf.apply(
lambda k: [1, 0, 0], axis=1, meta=("alist", "list<item: int64>")
)
This particular case fails because:
TypeError: data type 'list<item: int64>' not understood
Eventually I want to write to parquet:
ddf.to_parquet(
"example",
engine="pyarrow",
compression="snappy",
overwrite=True,
)
and if I specify the dtype incorrect it raises:
ValueError: Failed to convert partition to expected pyarrow schema:
`ArrowInvalid('Could not convert [1, 2, 3] with type list: tried to convert to int64', 'Conversion failed for column alist with type object')`
Expected partition schema:
id: int64
name: large_string
age: int64
alist: int64
__null_dask_index__: int64
Received partition schema:
id: int64
name: large_string
age: int64
alist: list<item: int64>
child 0, item: int64
__null_dask_index__: int64
This error *may* be resolved by passing in schema information for
the mismatched column(s) using the `schema` keyword in `to_parquet`.
As discussed here, you can also specify the PyArrow types upon writing:
ddf.to_parquet(
"example",
engine="pyarrow",
compression="snappy",
overwrite=True,
schema={
"alist": pa.list_(pa.int32()),
}
)