scalagoogle-cloud-platformgoogle-cloud-dataflowapache-beamspotify-scio

How to match multiple files with names using TextIO.Read in Cloud Dataflow


I have a gcs folder as below:

gs://<bucket-name>/<folder-name>/dt=2017-12-01/part-0000.tsv
                                /dt=2017-12-02/part-0000.tsv
                                /dt=2017-12-03/part-0000.tsv
                                /dt=2017-12-04/part-0000.tsv
                                ...

I want to match only the files under dt=2017-12-02 and dt=2017-12-03 using sc.textFile() in Scio, which uses TextIO.Read.from() underneath as far as I know.

I've tried

gs://<bucket-name>/<folder-name>/dt={2017-12-02,2017-12-03}/*.tsv

and

gs://<bucket-name>/<folder-name>/dt=2017-12-(02|03)/*.tsv

Both match zero files:

INFO org.apache.beam.sdk.io.FileBasedSource - Filepattern gs://<bucket-name>/<folder-name>/dt={2017-12-02,2017-12-03}/*.tsv matched 0 files with total size 0

INFO org.apache.beam.sdk.io.FileBasedSource - Filepattern gs://<bucket-name>/<folder-name>/dt=2017-12-(02|03)/*.tsv matched 0 files with total size 0

What should be the valid filepattern on doing this?


Solution

  • You need to use the TextIO.readAll() transform that reads a PCollection<String> of filepatterns. Create the collection of filepatterns either explicitly via Create.of() or you can compute it using a ParDo.

    case class ReadPaths(paths: java.lang.Iterable[String]) extends PTransform[PBegin, PCollection[String]] {
      override def expand(input: PBegin) = {
        Create.of(paths).expand(input).apply(TextIO.readAll())
      }
    }
    
    val paths = Seq(
      "gs://<bucket-name>/<folder-name>/dt=2017-07-01/part-0000.tsv",
      "gs://<bucket-name>/<folder-name>/dt=2017-12-20/part-0000.tsv",
      "gs://<bucket-name>/<folder-name>/dt=2018-03-29/part-0000.tsv",
      "gs://<bucket-name>/<folder-name>/dt=2018-05-04/part-0000.tsv"
    )
    
    import scala.collection.JavaConverters._
    
    sc.customInput("Read Paths", ReadPaths(paths.asJava))