google-cloud-platformfile-iogoogle-cloud-dataflowapache-beamapache-beam-io

Apache beam fileio write compressed files


I would like to know if it's possible to write compressed files using the fileio module from Apache Beam, Python SDK. At the moment I am using the module to write files to a GCP bucket:

_ = (logs | 'Window' >> beam.WindowInto(window.FixedWindows(60*60))
    | 'Convert to JSON' >>  beam.ParDo(ConvertToJson())
    | 'Write logs to GCS file' >> fileio.WriteToFiles(path = gsc_output_path, shards=1, max_writers_per_bundle=0))

Compression would help in minimizing storage costs.

According to this doc and comment inside class _MoveTempFilesIntoFinalDestinationFn, developers still need to implement handling of compression.

Am I right about this or does someone know how to do it?

Thank you!


Solution

  • developers still need to implement handling of compression.

    This is correct.

    Though there are open FRs:

    At the moment, you can write a DoFn: read the final files -> compress -> write the compressed final files and delete original final files.