google-bigquerygoogle-cloud-dataflowgoogle-bigquery-storage-api

Google Dataflow store to specific Partition using BigQuery Storage Write API


I want to store data to BigQuery by using specific partitions. The partitions are ingestion-time based. I want to use a range of partitions spanning over two years. I use the partition alias destination project-id:data-set.table-id$partition-date. I get failures since it does recognise the destination as an alias but as an actual table. Is it supported?


Solution

  • Direct writes to partitions by ingestion time is not supported using the Write API. Also using the stream api is not supported if a window of 31 days has passed

    From the documentation:

    When streaming using a partition decorator, you can stream to partitions within the last 31 days in the past and 16 days in the future relative to the current date, based on current UTC time.

    The solution that works is to use BigQuery load jobs to insert data. This can handle this scenario.

    Because this operation has lot's of IO involved (files getting created on GCS), it can be lengthy, costly and resource intensive depending on the data. A approach can be to create table shards and split the Big Table to small ones so the Storage Read and the Write api can be used. Then load jobs can be used from the sharded tables towards the partitioned table would require less resources, and the problem is already divided.