I have exported a parquet file using parquet.net which includes a duration
column that contains values that are greater than 24 hours. I've opened the tool using the floor tool that's included with parquet.net and the column has type INT32, converted type TIME_MILIS and logical type TIME (unit: MILLIS, isAdjustedToUTC: True). In .NET code the column was added as new DataField<DateTime>("duration")
I'm trying to parse the file using pandas and pyarrow using the following method:
pd.read_parquet('myfile.parquet', engine="pyarrow")
This results in the following error:
ValueError: hour must be in 0..23
Is there a way to give pyarrow directions to load columns as the primitive type instead of the logical type? Pandas has a pandas.Period
type and Python has the datetime.timedelta
type. Is parquet.net creating an invalid column type?
The only way I can't think of it to provide a schema to read_parquet
. But it means you need to know the types of all the other columns.
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
table = pa.table(
{
"id": pa.array([0, 1], pa.int32()),
"duration": pa.array([23 * 3600 * 1000, 25 * 3600 * 1000], pa.time32("ms")),
}
)
pq.write_table(table, "/tmp/test.parquet")
df = pd.read_parquet(
"/tmp/test.parquet",
engine="pyarrow",
dtype_backend="pyarrow",
schema=pa.schema({"id": pa.int32(), "duration": pa.int32()}),
)
id | duration |
---|---|
0 | 82800000 |
1 | 90000000 |