hiveparquetprestoalluxio

Hive metastore with alluxio storage in parquet data type problem


I am using prestodb with hive metastore for schema storage and alluxio cache as external storage for data. The storage format used in alluxio and hive schema is PARQUET. While retrieving timestamp field from presto using hive catalog. I get follow error.

The column utdate is declared as type timestamp, but the Parquet file declares the column as type INT64

Schema for dataset is

create table test( utcdate timestamp ) WITH ( format='PARQUIET', external_location='alluxio://path-toserver/directory-path-to-parquet' )

This issue is produced after upgrade from prestodb v0.221 to v0.226 version.


Solution

  • I use this while saving parquet to solve timestamp issue.

    pyarrow.parquet.write_table(self.arrow_table, file_name, use_deprecated_int96_timestamps=True)

    The presto and spark uses int96 timestamp value to store timestamp in nanoseconds. The default timestamp format for pyarrow table is int64, while presto consider int64 as bigint data type. If we use deprecated version of timestamp int96 in pyarrow we will get timestamp value in presto compatible format.