sqlgoogle-bigqueryazure-synapseunnestcopy-data

Number of rows increase and decrease when UNNEST in BigQuery dataset


I understand number of rows can increase when UNNEST an array of structs, but why would it decrease?

Scenario: Export Google Cloud Platform Billing Data from BigQuery to Azure Synapse Analytics via the Synapse's Serverless SQL Pool using its Copy Data Activity (Synapse pipeline). The source dataset is set to BigQuery, and use query to unnest the array of struct columns.

Here is the ddl from table's schema:

billing_account_id STRING,
project STRUCT<id STRING, number STRING, name STRING, labels ARRAY<STRUCT<key STRING, value STRING>>, ancestry_numbers STRING, ancestors ARRAY<STRUCT<resource_name STRING, display_name STRING>>>,
labels ARRAY<STRUCT<key STRING, value STRING>>,
system_labels ARRAY<STRUCT<key STRING, value STRING>>,
resource STRUCT<name STRING, global_name STRING>,
usage STRUCT<amount FLOAT64, unit STRING, amount_in_pricing_units FLOAT64, pricing_unit STRING>,
credits ARRAY<STRUCT<name STRING, amount FLOAT64, full_name STRING, id STRING, type STRING>>,
invoice STRUCT<month STRING>,
cost_type STRING,
adjustment_info STRUCT<id STRING, description STRING, mode STRING, type STRING>

Here is my count row query, and result:

SELECT count(1)
FROM `export.gcp_billing_export` tbl,     =>488,861 rows
UNNEST (project.labels) AS ar_proj_labels,                               =>236,567 rows, why a decrease from above?
UNNEST (project.ancestors) AS ar_proj_ancestors,                         =>1,241,985, an increase, as expected
UNNEST (labels) AS ar_labels,                                            =>2,077,164, an increase, as expected
UNNEST (system_labels) AS ar_system_labels,                              =>3,639,408, an increase, as expected
UNNEST (credits) AS tbl_credits                                          =>4,752, a big drop in number of rows, why?

Looking forward to hearing your thoughts. Thank you in advance.


Solution

  • Unnest will ignore the rows with null while flattening. There could be more null values in the project.labels field. That is the reason, you get less count even when unnesting. If you want to have all rows including rows with null value, then as @jaytiger suggested replace comma , and use left join.

    enter image description here img:1 sample table

    WITH c AS  (
    SELECT  'Asia'  AS continent, ['India','Singapore','Japan']  AS countries
    UNION  ALL  SELECT  'Europe', ['Spain','England','France']
    UNION  ALL  SELECT  'Australia', ['Australia','NZ']
    UNION  ALL  SELECT  'Antarctica', NULL
    )
    SELECT continent,countries FROM c,
    UNNEST(countries)  AS countries;
    

    enter image description here img:2 unnested data without null row.

     WITH c AS  (
    SELECT  'Asia'  AS continent, ['India','Singapore','Japan']  AS countries
    UNION  ALL  SELECT  'Europe', ['Spain','England','France']
    UNION  ALL  SELECT  'Australia', ['Australia','NZ']
    UNION  ALL  SELECT  'Antarctica', NULL
    )
    SELECT continent,countries FROM c
    left join
    UNNEST(countries)  AS countries;
    

    enter image description here