We are in the process of establishing a Machine Learning pipeline on Google Cloud, leveraging GC ML-Engine for distributed TensorFlow training and model serving, and DataFlow for distributed pre-processing jobs.
We would like to run our Apache Beam apps as DataFlow jobs on Google Cloud. looking at the ML-Engine samples it appears possible to get tensorflow_transform.beam.impl AnalyzeAndTransformDataset to specify which PipelineRunner to use as follows:
from tensorflow_transform.beam import impl as tft
pipeline_name = "DirectRunner"
p = beam.Pipeline(pipeline_name)
p | "xxx" >> xxx | "yyy" >> yyy | tft.AnalyzeAndTransformDataset(...)
TemplatingDataflowPipelineRunner provides the ability to separate our preprocessing development from parameterized operations - see here: https://cloud.google.com/dataflow/docs/templates/overview - basically:
The question is: Can you show me how we can we use tf.Transform to leverage TemplatingDataflowPipelineRunner ?
Python templates are available as of April 2017 (see documentation). The way to operate them is the following:
class UserOptions(PipelineOptions): @classmethod def _add_argparse_args(cls, parser): parser.add_value_provider_argument('--value_provider_arg', default='some_value') parser.add_argument('--non_value_provider_arg', default='some_other_value')
Note that Python doesn't have a TemplatingDataflowPipelineRunner, and neither does Java 2.X (unlike what happened in Java 1.X).