I am writing a simple apache beam streaming pipeline, taking input from a pubsub topic and storing this into bigquery. For hours I thought I am not able to even read a message, as I was simply trying to log the input to console:
events = p | 'Read PubSub' >> ReadFromPubSub(subscription=SUBSCRIPTION)
logging.info(events)
When I write this to text it works fine! However my call to the logger
never happens.
How to people develop / debug these streaming pipelines?
I have tried adding the following line:
events | 'Log' >> logging.info(events)
Using print()
also yields no results in the console.
This is because events
is a PCollection
so you need to apply a PTransform
to it.
The simplest way would be to apply a ParDo
to events
:
events | 'Log results' >> beam.ParDo(LogResults())
which is defined as:
class LogResults(beam.DoFn):
"""Just log the results"""
def process(self, element):
logging.info("Pub/Sub event: %s", element)
yield element
Notice that I also yield the element in case you want to apply further steps downstream, such as writing to a sink after logging the elements. See the issue here, for example.