I am looking do to the following with apache beam.
Specifically pre-processing for a tensorflow neural network.
I need each return to be a 2d list of floats for each file.
I think I can accomplish this by creating nested pipelines.
I could create and run a pipeline inside of a ParDo of another pipeline.
This seems inefficient, but my problem seems like a pretty standard use case.
Thanks
Apache Beam is a great tool for pre-processing data for machine learning with Tensorflow. More information about this general use case and tf.Transform
is available in this post.
Nothing described seems to indicate the need for "nested pipelines". Processing each line of each file in a directory is a simple TextIO.Read
transformation. It is unclear what your requirements from now on are, but, in general, separating the line into floats and joining with other lines are straightforward ParDo and grouping operations.
As a general guidance, I'd avoid nested pipelines, and try to break down the problem to fit into a single pipeline.