pandasamazon-redshiftamazon-redshift-serverless

Redshift Serverless Error: incompatible Parquet schema for default integer during COPY


I'm using a Python script to read a CSV file using Pandas, add some metadata columns to it, generate a parquet file and finally COPY it into Redshift Serverless. However, Redshift Serverless complains that one of the metadata columns has an incompatible data type.

This is how the whole script goes:

df = pd.read_csv(...)
df['processing_attempts'] = pd.Series(0, index=df.index) # I need an integer initialized to zero
# convert using Pyarrow
table = pa.Table.from_pandas(df)
pq.write_table(table, f'{file_name}.parquet')

When I do a df.info(), I can see that the column has a type of int64, which, based on my StackOverflow readings so far, is what it should be.

In the destination table, the column type is INT, so I'm surprised I'm getting this incompatibility error. Any ideas as to what might be off here?

Thanks in advance!


Solution

  • Figured it out myself. The INT type in Redshift is 4-byte long, so I needed to set my dtype to Int32 and not Int64. Hope this helps someone!