pythongoogle-cloud-dataflowapache-beamdataflow

DataflowRunner "Cannot convert GlobalWindow to apache_beam.utils.windowed_value._IntervalWindowBase" using SlidingWindows yet DirectRunner works?


Why does Dataflow generate the following error when joining two streams where one has been windowed into sliding windows?

TypeError: Cannot convert GlobalWindow to apache_beam.utils.windowed_value._IntervalWindowBase [while running 'B/Map(_from_proto_str)-ptransform-24']

I have created a reproducible example below that works on DirectRunner, but produces the error on DataflowRunner.

pipeline_options = PipelineOptions(pipeline_args, streaming=True, save_main_session=True)

with Pipeline(options=pipeline_options) as pipeline:
    x = pipeline | "A" >> ReadFromPubSub(topicA) | WindowInto(SlidingWindows(2, 1))
    y = pipeline | "B" >> ReadFromPubSub(topicB)
    (x, y) | Flatten()

Job ID: 2022-08-25_14_39_18-6045313873222423696

SDK Version: Apache Beam Python 3.9 SDK 2.40.0


Solution

  • PCollections must have consistent windowing to be flattened. I'm not sure why the direct runner doesn't enforce this (though I bet it would give errors later on, e.g. if a GroupByKey was attempted on the flatten's output), but we should catch this at construction time. Filed https://github.com/apache/beam/issues/22903 to make this more explicit.