[SOLVED] Best way to convert JSON to Apache Parquet format using aws

Best way to convert JSON to Apache Parquet format using aws

I've been working on a project where I've been storing the iot data in s3 bucket and batching them using aws kinesis firehose, i have a lambda function running on the delivery stream where i convert the epoch milliseconds time to proper timestamp having date and time. here is my sample JSON payload

{
     "device_name":"inHand-RTU",
     "Temperature":22.3,
     "Pyranometer":6,
     "Active-Power":0,
     "Voltage-1":233.93,
     "Active-Import":2.57,
     "time":"17-01-2023 10:49:09"
}

I now want to convert these files in s3 to parquet files and then do processing on them using apache pyspark. What is the best way to do so? Should I use kinesis firehose itself where it provides the functionality to convert the data into parquet format, or should i go with aws glue jobs. Both the services does the same thing. what is the difference between both? Which approach should I follow?

Any help will be greatly appreciated.

Solution

Best way is to use native parquet conversion as part of firehose.

Firehose has an option (Convert record format - Enable it) to convert to parquet or Orc format before delivering them to S3

https://docs.aws.amazon.com/firehose/latest/dev/create-transform.html