apache-flinkflink-streaming

Flink Sink Parquet Compression in Datastream API


I am using the streaming data API to read the parquet data and enrich write the S3 file system. In the flink doc it says for compressing the resultant file for table APIs

Parquet format also supports configuration from ParquetOutputFormat. For example, you can configure parquet.compression=GZIP to enable gzip compression.

Is there something similar in data stream API for compressing the output file?

Checked the corresponding sink documentation for datastream API but could not find anything related to compression for file sink.


Solution

  • The compression codec is mentioned in the DataStream Conenctors/FileSystem section, here

    For Bulk-encoded formats you would need to create a ParquetWriterFactory and set the compression codec using something like .setCodec(CodecFactory.snappyCodec()) , as mentioned in the documentation.