I have this error when I read and count records in pandas using pyarrow, I do not want pyarrow to convert to timestamp[ns], it can keep in timestamp[us], is there an option to keep timestamp as is ?, i am using pyarrow 11.0,0 and python 3.10.Please advise
code:
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.compute as pc
import pandas as pd
# Read the Parquet file into a PyArrow Table
table = pq.read_table('/Users/abc/Downloads/LOAD.parquet').to_pandas()
print(len(table))
error
pyarrow.lib.ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would result in out of bounds timestamp: 101999952000000000
I do not want pyarrow to convert to timestamp[ns], it can keep in timestamp[us], is there an option to keep timestamp as is ?
At the moment, pandas
only support nanosecond timestamp.
If you insist on keeping us precision you have a few options:
table = pq.read_table("data.parquet")
len(table)
table = pq.read_table("data.parquet")
df = table.to_pandas(timestamp_as_object=True)
table = pq.read_table("data.parquet")
df = table.to_pandas(safe=False)
But the original timestamp that was 5202-04-02
becomes 1694-12-04
pip install pandas==2.0.0rc1
pd.read_parquet("data.parquet", dtype_backend="pyarrow")
Surely 5202-04-02 is a typo. See this question