google-bigquerygoogle-cloud-pubsubdata-warehousedead-letter

Pub/Sub: How to retrieve messages sent to a dead-letter-topic and store them into a BigQuery table?


For context

We currently have a data mining script which collects a massive throughput of event data and publishes it to a Pub/Sub topic with a schema enforcement (with all fields being optional). A Pub/Sub-to-BigQuery subscription then sends the data to a destination table.

The problem

At present the script fetching the data sometimes also picks up events that do not conform to the schema. In order to avoid downtimes, the script has been configured to ingest erroneous messages and send them to the destination table in the form of a single field with a JSON string containing message contents while the rest of the fields for that row are nulls. This obviously poses a data quality issue, since end-users shouldn't have to filter through so many null values. However, since the schema failure could have resulted on a variety of fields, I need to somehow retain the raw JSON string and monitor which events are causing trouble.

Potential Solution

To move toward a cleaner approach, I have been exploring the use of a dead-letter topic to which these erroneous messages are sent. The idea is to have two separate BigQuery tables:

This would avoid the data-quality issue, while also retain information regarding errors in a more cost-efficient way.

Question

Is what I am trying to achieve possible? I have explored Google's documentation on dead-letter topics, but there is no details on how to retrieve unacknowledged messages.


Solution

  • A dead-letter topic is just like any other topic. Once messages are published to it, you can use a subscription that is attached to the topic to fetch the messages and do whatever you want with them. If all you want to do is write the messages to a BigQuery table, consider attaching a BigQuery subscription to your dead-letter topic.