tensorflowgoogle-cloud-dataflowgoogle-cloud-mlapache-beamgoogle-cloud-ml-engine

How to use Google DataFlow Runner and Templates in tf.Transform?


We are in the process of establishing a Machine Learning pipeline on Google Cloud, leveraging GC ML-Engine for distributed TensorFlow training and model serving, and DataFlow for distributed pre-processing jobs.

We would like to run our Apache Beam apps as DataFlow jobs on Google Cloud. looking at the ML-Engine samples it appears possible to get tensorflow_transform.beam.impl AnalyzeAndTransformDataset to specify which PipelineRunner to use as follows:

from tensorflow_transform.beam import impl as tft
pipeline_name = "DirectRunner"
p = beam.Pipeline(pipeline_name) 
p | "xxx" >> xxx | "yyy" >> yyy | tft.AnalyzeAndTransformDataset(...)

TemplatingDataflowPipelineRunner provides the ability to separate our preprocessing development from parameterized operations - see here: https://cloud.google.com/dataflow/docs/templates/overview - basically:

The question is: Can you show me how we can we use tf.Transform to leverage TemplatingDataflowPipelineRunner ?


Solution

  • Python templates are available as of April 2017 (see documentation). The way to operate them is the following:

    class UserOptions(PipelineOptions):
         @classmethod
         def _add_argparse_args(cls, parser):
             parser.add_value_provider_argument('--value_provider_arg', default='some_value')
             parser.add_argument('--non_value_provider_arg', default='some_other_value')

    Note that Python doesn't have a TemplatingDataflowPipelineRunner, and neither does Java 2.X (unlike what happened in Java 1.X).