amazon-web-servicesamazon-s3aws-lambdaamazon-kinesisamazon-kinesis-firehose

Firehose Stream Delivers to S3 in Uncompressed Format Despite Compression Enabled


I have Lambda function that direct put's JSON strings to a Firehose stream to deliver batches of records to S3, and I wish to deliver these records as compressed .gz files.

However, despite having Destination settings > Compression for data records for the stream set to GZIP, the files are delivered in plaintext even though they even get assigned a .gz extension. I can tell this because a) I can download the file from S3 and it opens as text with no modification and b) gzip -d ~/path/my_file.gz returns gzip: /path/my_file.gz: not in gzip format

Why would Firehose deliver the data uncompressed even though compression is enabled? Am I missing something?

Code:

Lambda:

import json
import boto3
firehose = boto3.client("firehose")

record = {'field_1': 'test'}               # dict/json
record_string = json.dumps(record) + '\n'  # Firehose expects ndjson

response = firehose.put_record(
    DeliveryStreamName=my_stream_name,
    Record={ 'Data': record_string }
)

Firehose (Terraform):

resource "aws_kinesis_firehose_delivery_stream" "my_firehose_stream" {
  name        = my_stream_name
  destination = "extended_s3"

  extended_s3_configuration {
    role_arn   = my_role_arn
    bucket_arn = my_bucket_arn

    prefix              = "my_prefix/!{partitionKeyFromQuery:extracted}/"
    error_output_prefix = "my_error_prefix/"

    buffering_size      = 64     # MB
    buffering_interval  = 900    # seconds
    compression_format  = "GZIP" # Compress as GZIP

    # Enabled to dynamic extract
    processing_configuration {
      enabled = true
      processors {
        type = "MetadataExtraction"
        parameters {
          parameter_name  = "JsonParsingEngine"
          parameter_value = "JQ-1.6"
        }
        parameters {
          parameter_name  = "MetadataExtractionQuery"
          parameter_value = "{extracted:.extracted}"
        }
      }
    }

    dynamic_partitioning_configuration {
      enabled        = true
    }
  }
}


Solution

  • If you are downloading the file via a web browser, it is possible that the browser is auto-decompressing the file because browsers know how to handle web pages that are gzip-compressed.

    To fully test what is happening, you should download the file via the AWS CLI and then check the file contents.

    You could also compare the size of the file shown in S3 vs the size on your local disk.

    See: Is GZIP Automatically Decompressed by Browser?