google-bigquerypartitioningdenormalization

Is it possible to partition nested tables in BigQuery?


I am currently in the process of migrating my data warehouse into BigQuery. I have been attempting to denormalize the database, as I have read that it can result in more efficient and cheaper queries. However, this has resulted in a number of nested tables. If each nested table has a column "created_at" and "last_modified_at", is there any way I can use either of these values to partition my tables?


Solution

  • No, you cannot partition a table by nested tables. According to the docs:

    You can partition BigQuery tables by:

    Time-unit column: Tables are partitioned based on a TIMESTAMP, DATE, or DATETIME column in the table.

    Ingestion time: Tables are partitioned based on the timestamp when BigQuery ingests the data.

    Integer range: Tables are partitioned based on an integer column.

    Furthermore, the partitions must be top level fields and cannot be leaf fields from a RECORD (STRUCT):

    Limitations

    You cannot use legacy SQL to query partitioned tables or to write query results to partitioned tables.

    Time-unit column-partitioned tables are subject to the following limitations:

    The partitioning column must be either a scalar DATE, TIMESTAMP, or DATETIME column. While the mode of the column can be REQUIRED or NULLABLE, it cannot be REPEATED (array-based). The partitioning column must be a top-level field. You cannot use a leaf field from a RECORD (STRUCT) as the partitioning column.

    Integer-range partitioned tables are subject to the following limitations:

    The partitioning column must be an INTEGER column. While the mode of the column may be REQUIRED or NULLABLE, it cannot beREPEATED (array-based). The partitioning column must be a top-level field. You cannot use a leaf field from a RECORD (STRUCT) as the partitioning column.

    And while you can use more data types with clustered tables in BigQuery, you cannot cluster tables using RECORD (STRUCT) columns:

    Clustering columns must be top-level, non-repeated columns of one of the following types:

    DATE BOOL GEOGRAPHY INT64 NUMERIC BIGNUMERIC STRING TIMESTAMP DATETIME

    If the reason you are partitioning is to increase the efficiency of date/time queries and if each nested table covers similar ranges of time I would suggest unnesting the tables into the parent table. If you don't want to unnest the tables it might be helpful to add another column to your main table with the earliest or latest date in the nested table and partition by the new column.