google-cloud-dataflowapache-beamspotify-scio

How do I deploy an Apache Beam/Spotify Scio Pipeline?


I've created a Pipeline using the Scio wrapper for Apache Beam. I want to deploy it in Google Dataflow.

I want for there to be a specific button or endpoint or Function that will execute this job regularly.

All of the instructions I can find involved running sbt runMain/pack, which builds the artifacts and uploads them each and every time.

How can I upload the artifacts once, and then create a job based on the pipeline as easily as possible?


Solution

  • At Spotify, the way we dealt with this was to create a docker image for Scio pipeline and execute that image via Styx, which is basically a k8s based cron, but you could execute it via your good old cron too (or airflow/luigi/gcp-composer) whatever fits your use case best. Beam has build in caching mechanism to cache dependencies, so consecutive runs just reuse previously uploaded files. Scio also supports Dataflow templates mentioned in the other answer.