pythonpandaspyarrow

Force PyArrow table write to ignore NULL type and use original schema type for a column


I have this piece of code that appends two parts of the same data to a PyArrow table. The second write fails because the column gets assigned null type. I understand why it is doing that. Is there a way to force it to use the type in the table's schema, and not use the inferred one from the data in second write?

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

data = {
    'col1': ['A', 'A', 'A', 'B', 'B'],
    'col2': [0, 1, 2, 1, 2]
}

df1 = pd.DataFrame(data)
df1['col3'] = 1

df2 = df1.copy()
df2['col3'] = pd.NA

pat1 = pa.Table.from_pandas(df1)
pat2 = pa.Table.from_pandas(df2)

writer = pq.ParquetWriter('junk.parquet', pat1.schema)
writer.write_table(pat1)
writer.write_table(pat2)

My error on second write above:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/anaconda3/lib/python3.10/site-packages/pyarrow/parquet/core.py", line 1094, in write_table
    raise ValueError(msg)
ValueError: Table schema does not match schema used to create file: 
table:
col1: string
col2: int64
col3: null
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 578 vs. 
file:
col1: string
col2: int64
col3: int64
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 577


Solution

  • The problem is that the assignment of pd.NA leads to the incorrect dtype (object):

    df2 = df1.copy()
    df2['col3'] = pd.NA
    
    print(df2.dtypes)
    
    col1    object
    col2     int64
    col3    object
    dtype: object
    

    Simply change it to Int64 first, using Series.astype:

    df2['col3'] = pd.NA
    df2['col3'] = df2['col3'].astype('Int64')
    

    Or in one statement, using pd.Series:

    df2['col3'] = pd.Series(pd.NA, dtype='Int64')
    

    Both leading to:

    pat2.schema
    
    col1: string
    col2: int64
    col3: int64
    -- schema metadata --
    pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 577
    
    pat1.schema == pat2.schema
    # True