[SOLVED] Read parquet file using pandas and pyarrow fails for time values larger than 24 hours

Read parquet file using pandas and pyarrow fails for time values larger than 24 hours

I have exported a parquet file using parquet.net which includes a duration column that contains values that are greater than 24 hours. I've opened the tool using the floor tool that's included with parquet.net and the column has type INT32, converted type TIME_MILIS and logical type TIME (unit: MILLIS, isAdjustedToUTC: True). In .NET code the column was added as new DataField<DateTime>("duration")

I'm trying to parse the file using pandas and pyarrow using the following method:

pd.read_parquet('myfile.parquet', engine="pyarrow")

This results in the following error:

ValueError: hour must be in 0..23

Is there a way to give pyarrow directions to load columns as the primitive type instead of the logical type? Pandas has a pandas.Period type and Python has the datetime.timedelta type. Is parquet.net creating an invalid column type?

Solution

The only way I can't think of it to provide a schema to read_parquet. But it means you need to know the types of all the other columns.

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

table = pa.table(
    {
        "id": pa.array([0, 1], pa.int32()),
        "duration": pa.array([23 * 3600 * 1000, 25 * 3600 * 1000], pa.time32("ms")),
    }
)

pq.write_table(table, "/tmp/test.parquet")

df = pd.read_parquet(
    "/tmp/test.parquet",
    engine="pyarrow",
    dtype_backend="pyarrow",
    schema=pa.schema({"id": pa.int32(), "duration": pa.int32()}),
)

id	duration
0	82800000
1	90000000