pythongoogle-cloud-platformapache-beamdirect-runner

Apache Beam DirectRunner with Cloud Pub/Sub


I am trying to pass data from Cloud Pub/Sub to Google Cloud Storage. When I use runner DataflowRunner , the pipeline gets published to Google Cloud Dataflow and works as expected. However, for some testing I'd like the pipeline to run locally (but still read from Cloud Pub/Sub and write Cloud Storage). When I use the runner DirectRunner, the process writes out INFO:apache_beam.runners.direct.direct_runner:Running pipeline with DirectRunner., but does nothing when a new message is published to Pub/Sub.

I am executing the pipeline with this command:

python dev_radim_dataflow_gcs_direct.py ^
  --project=<GCP_PROJECT> ^
  --region="europe-west3" ^
  --input_subscription="projects/data-uat-280814/subscriptions/dev-radim-dataflow" ^
  --output_path=gs://dev_radim/dataflow_dest_local/ ^
  --runner=DirectRunner ^
  --window_size=1 ^
  --temp_location=gs://dev_radim/dataflow_temp_local/

The full dev_radim_dataflow_gcs_direct.py file is here: https://pastebin.com/W7VphH5A

Any ideas why the message doesn't make it from Pub/Sub to GCS?


Solution

  • Posting comment by @RadRussian as an answer, since this could happen for other people as well:

    There was another consumer reading from the same subscription, so no messages ever got to the pipeline running in the DirectRunner. In this case the consumer was a Dataflow job, but it could be anything.