viewpipelineapache-beamdirect-runner

Way to visualize Beam pipeline run with DirectRunner


In GCP we can see the pipeline execution graph. Is the same possible when running locally via DirectRunner?


Solution

  • You can use pipeline_graph and the InteractiveRunner to get a graphviz representation of your pipeline locally.

    An example for the word count pipeline used in the Beam documentation:

    import apache_beam as beam
    from apache_beam.runners.interactive.display import pipeline_graph
    from apache_beam.runners.interactive.interactive_runner import InteractiveRunner
    import re
    
    pipeline = beam.Pipeline(InteractiveRunner())
    lines = pipeline | beam.Create([f"some_file_{i}.txt" for i in range(10)])
    
    # Count the occurrences of each word.
    counts = (
        lines
        | 'Split' >> (
            beam.FlatMap(
                lambda x: re.findall(r'[A-Za-z\']+', x)).with_output_types(str))
        | 'PairWithOne' >> beam.Map(lambda x: (x, 1))
        | 'GroupAndSum' >> beam.CombinePerKey(sum))
    
    # Format the counts into a PCollection of strings.
    def format_result(word_count):
        (word, count) = word_count
        return f'{word}: {count}'
    
    output = counts | 'Format' >> beam.Map(format_result)
    
    # Write the output using a "Write" transform that has side effects.
    # pylint: disable=expression-not-assigned
    output | beam.io.WriteToText("some_file.txt")
    
    print(pipeline_graph.PipelineGraph(pipeline).get_dot())
    

    This prints

    digraph G {
    node [color=blue, fontcolor=blue, shape=box];
    "Create";
    lines [shape=circle];
    "Split";
    pcoll4978 [label="", shape=circle];
    "PairWithOne";
    pcoll8859 [label="", shape=circle];
    "GroupAndSum";
    counts [shape=circle];
    "Format";
    output [shape=circle];
    "WriteToText";
    pcoll6409 [label="", shape=circle];
    "Create" -> lines;
    lines -> "Split";
    "Split" -> pcoll4978;
    pcoll4978 -> "PairWithOne";
    "PairWithOne" -> pcoll8859;
    pcoll8859 -> "GroupAndSum";
    "GroupAndSum" -> counts;
    counts -> "Format";
    "Format" -> output;
    output -> "WriteToText";
    "WriteToText" -> pcoll6409;
    }
    

    Putting this into https://edotor.net results in:

    beam pipeline

    You can work with GraphViz in Python to produce a prettier output if needed (graphviz for example).