sqlgoogle-bigquery

Create data grid for missing values


I have two source tables:

Table A Table B
Dim 1 X X
Dim 2 X X
Dim 3 X
Dim 4 X
Metric 1 X X

I want to:

  1. Sum metric 1 on table B, for every possible combination on the four dimensions. I also need a "All" row with the sum of the metric where dim 3 and dim 4 are not considered. I can easily achieve this with grouping sets:
SELECT
    IF(GROUPING(table_b.dim_3 = 0, "Single product", "All products") AS aggregation_level,
    dim_1,
    dim_2,
    COALESCE(table_b.dim_3, "All") AS dim_3,
    COALESCE(table_b.dim_4, "All") AS dim_4,
    SUM(metric_1) AS b_metric_1
FROM table_b
GROUP BY GROUPING SETS (
    (dim_1, dim_2),
    (dim_1, dim_2, table_b.dim_3, table_b.dim_4)
)
  1. Sum metric 1 on table A for every possible combination on dim 1 & dim 2. This will be the "totals"
SELECT
    dim_1,
    dim_2,
    SUM(metric_1) AS a_metric_1
FROM table_a
GROUP BY dim_1, dim_2
  1. Append the a_metric_1 "totals" calculated in the previous step to every row for the step 1 output. However, remember that step 1 will have missing combinations on dim 1 and dim 2 (but step 2 output won't). So we need to:

I've been trying to use FULL OUTER JOINs to create the data grid I need, but I cannot hit the right solution


Solution

  • I got the expected output by changing the order where I applied the grouping sets:

    1. Group table B by (dim1, dim2, dim3, dim4), creating a grouped_by_product CTE
    2. Group table A by (dim1, dim2), creating a totals CTE
    3. CROSS JOIN table A with the dim table containing all dim3 & dim4 combinations, to create a data grid
    4. LEFT JOIN, or, if you're not sure table B is really a subset of table A, FULL OUTER JOIN both tables
    5. Finally, calculate the GROUPING SETS to get "per product" and "all products" views

    The SQL logic would be something like:

    WITH grouped_by_product AS (
        SELECT
            dim_1,
            dim_2,
            dim_3,
            dim_4,
            SUM(metric_1) AS b_metric_1
        FROM table_b
        GROUP BY ALL
    ), totals AS (
        SELECT
            dim_1,
            dim_2,
            SUM(metric_1) AS a_metric_1
        FROM table_a
        GROUP BY ALL
    ), totals_grid AS (
        -- Creates a grid with all possible combinations dim_3 & dim_4 can take
        SELECT
            *
        FROM totals
        CROSS JOIN dim_table -- dim_3 & dim_4 combinations
    ), joined AS (
       SELECT
            dim_1,
            dim_2,
            dim_3,
            dim_4,
            COALESCE(b_metric_1, 0) AS b_metric_1,
            COALESCE(a_metric_1, 0) AS a_metric_1
        FROM totals_grid
        FULL OUTER JOIN grouped_by_product
            USING (dim_1, dim_2, dim_3, dim_4)
    )
    
    SELECT
        IF(GROUPING(table_b.dim_3 = 0, "Single product", "All products") AS aggregation_level,
        dim_1,
        dim_2,
        COALESCE(table_b.dim_3, "All") AS dim_3,
        COALESCE(table_b.dim_4, "All") AS dim_4,
        SUM(b_metric_1) AS b_metric_1,
        MAX(a_metric_1) AS a_metric_1 -- we take the max because it's the same for all rows
    FROM joined
    GROUP BY GROUPING SETS (
        (dim_1, dim_2),
        (dim_1, dim_2, joined.dim_3, joined.dim_4)
    )