I have this piece of code that appends two parts of the same data to a PyArrow table. The second write fails because the column gets assigned null
type. I understand why it is doing that. Is there a way to force it to use the type in the table's schema, and not use the inferred one from the data in second write?
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
data = {
'col1': ['A', 'A', 'A', 'B', 'B'],
'col2': [0, 1, 2, 1, 2]
}
df1 = pd.DataFrame(data)
df1['col3'] = 1
df2 = df1.copy()
df2['col3'] = pd.NA
pat1 = pa.Table.from_pandas(df1)
pat2 = pa.Table.from_pandas(df2)
writer = pq.ParquetWriter('junk.parquet', pat1.schema)
writer.write_table(pat1)
writer.write_table(pat2)
My error on second write above:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/anaconda3/lib/python3.10/site-packages/pyarrow/parquet/core.py", line 1094, in write_table
raise ValueError(msg)
ValueError: Table schema does not match schema used to create file:
table:
col1: string
col2: int64
col3: null
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 578 vs.
file:
col1: string
col2: int64
col3: int64
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 577
The problem is that the assignment of pd.NA
leads to the incorrect dtype (object
):
df2 = df1.copy()
df2['col3'] = pd.NA
print(df2.dtypes)
col1 object
col2 int64
col3 object
dtype: object
Simply change it to Int64
first, using Series.astype
:
df2['col3'] = pd.NA
df2['col3'] = df2['col3'].astype('Int64')
Or in one statement, using pd.Series
:
df2['col3'] = pd.Series(pd.NA, dtype='Int64')
Both leading to:
pat2.schema
col1: string
col2: int64
col3: int64
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 577
pat1.schema == pat2.schema
# True