google-bigquerybigquery-udf

How do I insert multiple rows from one table into a struct column of a single row of another table?


I have 2 source tables at the moment.

Table #1: sourceTableMain

|EmployeeNumber| DepartmentNumber | CostCenterNumber |
| -------------| ---------------- |------------------|
| 1            | 100              |  1001            |
| 2            | 200              |  1001            |
| 3            | 100              |  1002            |

Table #2: sourceTableEmployee

|EmployeeNumber| EmployeeFirstName | EmployeeLastName | EmployeeAddress |
| -------------| ----------------  |------------------|---------------- |
| 1            | Michael           | Scott            | 110 ABC Ln      |
| 1            | Michael           | Scott            | 450 XYZ Ln      |
| 2            | Dwight            | Schrute          | 321 PQR St      |
| 3            | Jim               | Halpert          | 678 LMN Blvd    |

I am trying to insert the combine the rows into a 3rd table named targetTableCombined which has the following schema:

FieldName Type Mode
employeeNumber INTEGER NULLABLE
employeeDetails (struct) RECORD REPEATED
employeeFirstName STRING NULLABLE
employeeLastName STRING NULLABLE
employeeAddress STRING NULLABLE

Within the target table (targetTableCombined), I am trying to make sure that for each employeeNumber, all of the First Names, Last Names and Addresses are repeated under a single struct array. For example, EmployeeNumber 1 should have only 1 row in the target table, with the first name, last name and different addresses as part of the second column (struct), each in a separate row.

I wrote an insert script to do this, but I am going wrong:

        insert into `dev.try_sbx.targetTableCombined`
    
        select 
            main.employeeNumber,
                array(
                    select as struct
                        emp.employeeFirstName,
                        emp.employeeLastName,
                        emp.employeeAddress
                )
        from
          `dev.try_sbx.sourceTableMain` as main
              inner join `dev.try_sbx.sourceTableEmployee` as emp
                  on main.EmployeeNumber = emp.EmployeeNumber;

This is the result I am getting when running the query above:

| EmployeeNumber | EmployeeDetails                |
| -------------  | ------------------------------ |
| 1              | [Michael, Scott, 110 ABC Ln]   |
| 1              | [Michael, Scott, 450 XYZ Ln]   |
| 2              | [Dwight,  Schrute, 321 PQR St] |
| 3              | [Jim, Halpert, 678 LMN Blvd]   |

(Sorry about not being able to share screenshots - I don't have enough rep. But to elaborate, I am expecting only 3 rows on the insert (employee 1 should have had a single array containing both addresses). I am instead, getting 4 rows after the insert.)

Where am I going wrong with my script?


Solution

  • It's because ARRAY() is not an aggregation function. You should ARRAY_AGG() along with GROUP BY to group details for each employee into an array.

    SELECT EmployeeNumber,
           ARRAY_AGG((SELECT AS STRUCT EmployeeFirstName, EmployeeLastName, EmployeeAddress)) AS employeeDetails
       FROM `dev.try_sbx.sourceTableEmployee`
      GROUP BY 1;
    

    More preferred way is :

    SELECT EmployeeNumber,
           ARRAY_AGG(STRUCT(EmployeeFirstName, EmployeeLastName, EmployeeAddress)) AS employeeDetails
       FROM `dev.try_sbx.sourceTableEmployee`
      GROUP BY 1;
    

    output:

    enter image description here