pythonpandasgoogle-bigqueryparquetfastparquet

Unable to write parquet with DATE as logical type for a column from pandas


I am trying to write a parquet file which contains one date column having logical type in parquet as DATE and physical type as INT32. I am writing the parquet file using pandas and using fastparquet as the engine since I need to stream the data from database and append to the same parquet file. Here is my code

import os
import pandas as pd
from sqlalchemy import create_engine
from sqlalchemy import text

sql = "SELECT TO_VARCHAR(CURRENT_DATE, 'YYYYMMDD') AS "REPORT_DATE" FROM DUMMY;"

def create_stream_enabled_connection(username, password, host, port):
   conn_str = f"mydb://{username}:{password}@{host}:{port}"
   engine = create_engine(conn_str, connect_args={'encrypt': 'True', 'sslValidateCertificate':'False'})
   connection = engine.connect().execution_options(stream_results=True)
   return connection

connection = create_stream_enabled_connection(username, password, host, port)

ROWS_IN_CHUNK = 500000
# Stream chunks from database
for dataframe_chunk in pd.read_sql(text(sql), connection, chunksize=ROWS_IN_CHUNK):
   if os.stat(local_path).st_size == 0:  # If file is empty
      # write parquet file
      dataframe_chunk.to_parquet(local_path, index=False, engine='fastparquet')
   else:
      # write parquet file
      dataframe_chunk.to_parquet(local_path, index=False, engine='fastparquet', append=True)

Problem:

I am unable to get the logical type to be DATE and physical type to be INT32 in the output parquet from pandas to_parquet function using fastparquet as engine.

A few things that I have tried:

dataframe_chunk['report_date'] = pd.to_datetime(dataframe_chunk['report_date'], format='%Y%m%d')

I need the parquet logical type to be DATE and physical type to be INT32 since this parquet will be loaded directly to bigquery and the REPORT_DATE column will go to a DATE type column in bigquery. See the bigquery documentation here: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet#type_conversions

If I try to store STRING or DATETIME column in parquet and load it in bigquery, bigquery fails by saying that it expected the column with another type.


Solution

  • Answered in https://github.com/dask/fastparquet/issues/880#issuecomment-1697436417

    The short story is: fastparquet cannot do this right now. I don't know how you can persuade bigquery to accept the current style of output, but I would think there must be a way, since the 64-bit format is valid parquet.