google-cloud-datapreptrifacta

Combine multiple rows into single row in Google Data Prep


I have a table which has multiple payload values in separate rows. I want to combine those rows into a single row to have all the data together. Table looks something like this.

+------------+--------------+------+----+----+----+----+
|    Date    |     Time     | User | D1 | D2 | D3 | D4 |
+------------+--------------+------+----+----+----+----+
| 2020-04-15 | 05:39:45 UTC | A    |  2 |    |    |    |
| 2020-04-15 | 05:39:45 UTC | A    |    |  5 |    |    |
| 2020-04-15 | 05:39:45 UTC | A    |    |    |  8 |    |
| 2020-04-15 | 05:39:45 UTC | A    |    |    |    |  7 |
+------------+--------------+------+----+----+----+----+

And I want to convert it to something like this.

+------------+--------------+------+----+----+----+----+
|    Date    |     Time     | User | D1 | D2 | D3 | D4 |
+------------+--------------+------+----+----+----+----+
| 2020-04-15 | 05:39:45 UTC | A    |  2 |  5 |  8 |  7 |
+------------+--------------+------+----+----+----+----+

I tried "set" and "aggregate" but they didn't work as I wanted them to and I am not sure how to go forward.

Any help would be appreciated. Thanks.


Solution

  • tl;dr:

    use fill() function to fill all empty values within each d1-d4 columns in the wanted group (AKA - the columns date+time+user) then dedup\aggregate to your heart's content.

    long version

    So the quickest way to do this is by using a window-function called "fill()". What this function does for each given field in a column, it tells it: "Look down. look up. find the closest non-empty value, and copy it!" you can ofcourse limit it's sight (look only 3 rows above, for example) but for this example, don't need the limitation. so your fill function will look like this: FILL($col, -1, -1) So the "$col" will reference all the chosen columns. the "-1" says "unlimited sight". finally, the "~" says "from column D1 to column D4".

    So, function will look like this:

    fill_creation .

    Which in turn will make your columns look like this: output columns .

    Now you can use the "dedup" transformation to remove any duplications, and only 1 copy of each "group" will remain. Alternatively, if you still want to use "group by", you can do that aswell.

    Hope this helps =]

    p.s There are more ways to do this - which entails using the "pivot" transformation, and array unnesting. But in the process you'll lose your columns' names, and will need to rename them.