javagoogle-cloud-dataflowapache-beamdataflow

How can we prevent empty file write in dataflow pipeline when collection size is 0?


I have a dataflow pipeline and I'm parsing a file if I got any incorrect records then I'm writing it on the GCS bucket, but when there are no errors in the input file data still TextIO writes the empty file on the GCS bucket with a header.

So, how can we prevent this if the PCollection size is zero then skip this step?

errorRecords.apply("WritingErrorRecords", TextIO.write().to(options.getBucketPath())
             .withHeader("ID|ERROR_CODE|ERROR_MESSAGE")
             .withoutSharding()
             .withSuffix(".txt")
             .withShardNameTemplate("-SSS")
             .withNumShards(1));
        

Solution

  • Beam TextIO added support for skipIfEmpty() in 2.40.0, see: https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/TextIO.TypedWrite.html#skipIfEmpty--