python apache-spark casting timestamp parquet

parquet int96 timestamp conversion to datetime/date via python

TL;DR

I'd like to convert an int96 value such as ACIE4NxJAAAKhSUA into a readable timestamp format like 2020-03-02 14:34:22 or whatever that could be normally interpreted...I mostly use python so I'm looking to build a function that does this conversion. If there's another function that can do the reverse -- even better.

Background

I'm using parquet-tools to convert a raw parquet file (with snappy compression) to raw JSON via this commmand:

C:\Research> java -jar parquet-tools-1.8.2.jar cat --json original-file.snappy.parquet > parquet-output.json

Inside the JSON, I'm seeing these values as the timestamps:

{... "_id":"101836","timestamp":"ACIE4NxJAAAKhSUA"}

I've determined that the timestamp value of "ACIE4NxJAAAKhSUA" is really int96 (this is also confirmed with reading the schema of the parquet file too....

message spark_schema {
 ...(stuff)...
  optional binary _id (UTF8);
  optional int96 timestamp;
}

I think this is also known as Impala Timestamp as well (at least that's what I've gathered)

Further Issue Research

I've been searching everywhere for some function or info regarding how to "read" the int96 value (into python -- I'd like to keep it in that language since I'm most familiar with it) and output the timestamp -- I've found nothing.

Here's a very articles I've already looked into (that's related to this subject):

ParquetWriter research in SO here
Casting int96 via golan in SO here NOTE: this has a function that I could explore but I'm not sure how to dive in too deep

Regarding the depreciated int96 timestamp

Please don't ask me to stop using an old/depreciated timestamp format within a parquet file, I'm well aware of that with the research I've done so far. I'm a receiver of the file/data -- I can't change the format used on creation.

If there's another way to control the initial JSON output to deliver a "non int96" value -- I'd be interested in that as well.

Thanks so much for your help SO community!

Solution

parquet-tools will not be able to change format type from INT96 to INT64. What you are observing in json output is a String representation of the timestamp stored in INT96 TimestampType. You will need spark to re-write this parquet with timestamp in INT64 TimestampType and then the json output will produce a timestamp (in the format you desire).

You will need to set a specific config in Spark -

spark-shell --conf spark.sql.parquet.outputTimestampType=TIMESTAMP_MICROS

2020-03-16 11:37:50 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://192.168.0.20:4040
Spark context available as 'sc' (master = local[*], app id = local-1584383875924).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.0
      /_/
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91)
Type in expressions to have them evaluated.
Type :help for more information.

val sourceDf = spark.read.parquet("original-file.snappy.parquet")
2020-03-16 11:38:31 WARN  Utils:66 - Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
sourceDf: org.apache.spark.sql.DataFrame = [application: struct<name: string, upgrades: struct<value: double> ... 3 more fields>, timestamp: timestamp ... 16 more fields]

scala> sourceDf.repartition(1).write.parquet("Downloads/output")

Parquet-tools will show the correct TimestampType

parquet-tools schema Downloads/output/part-00000-edba239b-e696-4b4e-8fd3-c7cca9eea6bf-c000.snappy.parquet 

message spark_schema {
  ...
  optional binary _id (UTF8);
  optional int64 timestamp (TIMESTAMP_MICROS);
  ...
}

And the json dump gives -

parquet-tools cat --json Downloads/output/part-00000-edba239b-e696-4b4e-8fd3-c7cca9eea6bf-c000.snappy.parquet

{..."_id":"101836", "timestamp":1583973827000000}

The timestamp recorded is in nano seconds. Hope this helps!