scalaspotify-scio

Put SCollection from textFile to BigQuery with Scio


I have read some documents with textFile, and did a flatMap of the single words, adding some extra information for each word:

val col = sc.textFile(args.getOrElse("input","documents/*"))
    .flatMap(_.split("\\s+").filter(_.nonEmpty))
val mapped = col.map(t => t + ": " + extraInformation())

I am currently saving this to text easily

mapped.saveAsTextFile(args.getOrElse("output", "results"))

But I cannot figure out how to save the map to a BigQuery schema. All examples I have seen create the initial Scollection from BigQuery and then save it to another table, so the initial collection is [TableRow] instead of [String].

What is the correct approach here? Should I investigate how to convert my data to a kind of collection Big Query will accept? Or should I try to investigate further how to push this plain text straight into a table?


Solution

  • I would suggest using the @BigQueryType.toTable annotation on a case class, like so:

    import com.spotify.scio.bigquery._
    
    object MyScioJob {
    
      @BigQueryType.toTable
      case class WordAnnotated(word: String, extraInformation: String)
    
    
      def main(args: Array[String]): Unit = {
        // ...job setup logic
    
        sc.textFile(args.getOrElse("input","documents/*"))
          .flatMap(_.split("\\s+").filter(_.nonEmpty))
          .map(t => WordAnnotated(t, extraInformation())
          .saveAsTypedBigQuery("myProject:myDataset.myTable")
      }
    }
    

    There's more information about this on the Scio wiki.