[SOLVED] Nesting pipelines in apache beam

Nesting pipelines in apache beam

I am looking do to the following with apache beam.
Specifically pre-processing for a tensorflow neural network.

for each file from a folder.
- for each line from a file
  - process line to 1d list of floats

I need each return to be a 2d list of floats for each file.

I think I can accomplish this by creating nested pipelines.
I could create and run a pipeline inside of a ParDo of another pipeline.

This seems inefficient, but my problem seems like a pretty standard use case.

Is there a tool to do this better in apache beam?
Is there a way to restructure my problem to make it work in apache beam better?
Are nested pipelines not as bad as I think they are?

Thanks

Solution

Apache Beam is a great tool for pre-processing data for machine learning with Tensorflow. More information about this general use case and tf.Transform is available in this post.

Nothing described seems to indicate the need for "nested pipelines". Processing each line of each file in a directory is a simple TextIO.Read transformation. It is unclear what your requirements from now on are, but, in general, separating the line into floats and joining with other lines are straightforward ParDo and grouping operations.

As a general guidance, I'd avoid nested pipelines, and try to break down the problem to fit into a single pipeline.