mavengradlegoogle-cloud-dataflowarchetypes

Gradle Support for GCP Dataflow Templates?


According to Google's Dataflow documentation, Dataflow job template creation is "currently limited to Java and Maven." However, the documentation for Java across GCP's Dataflow site is... messy, to say the least. The 1.x and 2.x versions of Dataflow are pretty far apart in terms of details, I have some specific code requirements that lock me into the 2.0.0r3 codebase, so I'm pretty much required to use Apache Beam. Apache is -- understandably -- quite dedicated to Maven, but institutionally my company's thrown the bulk of its weight behind Gradle, so much so that they migrated all their Java projects over to it last year and have pushed back against re-introducing it.

However, now we seem to be at an impasse, because we've got a specific goal to try to centralize a lot of our back-end gathering in GCP's Dataflow, and GCP Dataflow doesn't appear to have formal support for Gradle. If it does, it's not in the official documentation.

Is there a sufficient technical basis to actually build Dataflow templates with Gradle and the issue is that Google's docs simply haven't been updated to support this? Is there a technical reason why Gradle can't do what's being done with Maven? Is there a better guide for working with GCP Dataflow than the docs on Google's and Apache's websites? I haven't worked with Maven archetypes before, and all the searches I've done for "gradle archetypes" turn up results from, at best, over a year ago. Most of the information points to forum discussions from 2014 and version 1.7rc3, but we're on 3.5. This feels like it ought to be a solved problem, but for the life of me I can't find any current information on this online.


Solution

  • There's absolutely nothing stopping you writing your Dataflow application/pipeline in Java, and using Gradle to build it.

    Gradle will simply produce an application distribution (e.g. ./gradlew clean distTar), which you then extract and run with the --runner=TemplatingDataflowPipelineRunner --dataflowJobFile=gs://... parameters.

    It's just a runnable Java application.

    The template and all the binaries will then be uploaded to GCS, and you can execute the pipeline through the console, CLI or even Cloud Functions.

    You don't even need to use Gradle. You could just run it locally and the template/binaries will be uploaded. But, I'd imagine you are are using a build server like Jenkins.

    Maybe the Dataflow docs should read "Note: Template creation is currently limited to Java", because this feature is not available in the Python SDK yet.