google-cloud-dataflowapache-beamapache-beam-iospotify-scio

Scio all saveAs txt file methods output a txt file with part prefix


If I want to output a SCollection of TableRow or String to google cloud storage (GCS) I'm using saveAsTableRowJsonFile or saveAsTextFile, respectively. Both of these methods ultimately use

private[scio] def pathWithShards(path: String) = path.replaceAll("\\/+$", "") + "/part" 

which enforces that file names start with "part". Is the only way to output a custom sharded file via to use saveAsCustomOutput?


Solution

  • I had to do it in beam code via saveAsCustomOutput

    import org.apache.beam.sdk.util.Transport
    val jsonFactory: JsonFactory = Transport.getJsonFactory
    val outputPath = "gs://foo/bar_" // file prefix will be bar_
    @BigQueryType.toTable()
    case class Clazz(foo: String, bar: String)
    val collection: SCollection[Clazz] = ....
    collection.map(Clazz.toTableRow).
              map(jsonFactory.toString).
              saveAsCustomOutput(name = "CustomWrite", io.TextIO.write()
                .to(outputPath)
                .withSuffix("")
                .withWritableByteChannelFactory(FileBasedSink.CompressionType.GZIP))