pythonpandasparquetpyarrowfastparquet

pyarrow timestamp datatype error on parquet file


I have this error when I read and count records in pandas using pyarrow, I do not want pyarrow to convert to timestamp[ns], it can keep in timestamp[us], is there an option to keep timestamp as is ?, i am using pyarrow 11.0,0 and python 3.10.Please advise

code:

import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.compute as pc
import pandas as pd

# Read the Parquet file into a PyArrow Table
table = pq.read_table('/Users/abc/Downloads/LOAD.parquet').to_pandas()

print(len(table))

error

pyarrow.lib.ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would result in out of bounds timestamp: 101999952000000000

Solution

  • I do not want pyarrow to convert to timestamp[ns], it can keep in timestamp[us], is there an option to keep timestamp as is ?

    At the moment, pandas only support nanosecond timestamp.

    If you insist on keeping us precision you have a few options:

    1. not use pandas, stick to pyarrow which supports microseconds:
    table = pq.read_table("data.parquet")
    len(table)
    
    1. Use datetime.datetime instead of pd.Timestamp in your dataframe (very slow)
    table = pq.read_table("data.parquet")
    df = table.to_pandas(timestamp_as_object=True)
    
    1. Ignore the loss of precision for the timestamps that are out of range
    table = pq.read_table("data.parquet")
    df = table.to_pandas(safe=False)
    

    But the original timestamp that was 5202-04-02 becomes 1694-12-04

    1. If you're feeling intrepid use pandas 2.0 and pyarrow as a backend for pandas
    pip install  pandas==2.0.0rc1
    
    pd.read_parquet("data.parquet", dtype_backend="pyarrow")
    
    1. Fix the data using pyarrow

    Surely 5202-04-02 is a typo. See this question