snowflake-cloud-data-platform

SELECT doing different operations on multiple columns with structured data


I’m working with a Snowflake table where each row contains structured data in two columns. Here’s an example of the table with 3 rows:

| id |  data_1             |   data_2          |
|----------------------------------------------|
| 1  | 2024-01-01, 1       | 2023-02-01, 2     |
|    | 2024-01-02, 2       | 2023-02-02, 2     |
|----------------------------------------------|
| 2  | 2024-02-03, 3       | 2023-02-01, 20    |
|    | 2024-02-02, 4       | 2023-02-02, 20    |
|----------------------------------------------|
| 3  | 2024-03-01, 5       | 2023-02-01, 30    |
|    | 2024-03-02, 6       | 2023-02-02, 30    |
|----------------------------------------------|

Conceptually, in each row, columns data_1 and data_2 contain an array of structured data where each element has a date and an integer value associated with it.

I want to query this table and obtain the following results:

  1. For data_1: the value associated with the latest date.
  2. For data_2: the sum of all values.

Note: I need the date in data_2 for other operations.

Desired result of the query:

| id | data_1 | data_2 |
|----------------------|
| 1  | 2      | 4      |
|----------------------|
| 2  | 3      | 40     |
|----------------------|
| 3  | 6      | 60     |
|----------------------|

Questions:

  1. Which Snowflake data types are best suited for data_1 and data_2? Ideally, I prefer a data type where I can address each sub-element using a name (something akin to data_1.date and data_1.value, so perhaps a structured object or a structured map).
  2. How do you write the query to achieve the desired result?

Solution

  • We can do a lot with filter and transform, but we still need a reduce - so we will implement that with a JS UDF in the meantime.

    with data as (
        select 1 id, [{'date':'2024-01-01', 'v':1}, {'date':'2024-01-02', 'v':2}] data_1, [{'date':'2023-02-01', 'v':2}, {'date':'2023-02-02', 'v':2}] data_2
        union all select 2 id, [{'date':'2024-02-03', 'v':3}, {'date':'2024-02-02', 'v':4}], [{'date':'2023-02-01', 'v':20}, {'date':'2023-02-02', 'v':20}] 
        union all select 3 id, [{'date':'2024-03-01', 'v':5}, {'date':'2024-03-02', 'v':6}], [{'date':'2023-02-01', 'v':30}, {'date':'2023-02-02', 'v':30}] 
    )
    
    select arrays_to_object(transform(data_1, x -> x['date']), transform(data_1, x -> x['v']))[array_max(transform(data_1, y -> y['date']))]
        , array_sum_js(transform(data_2, x -> x['v']))
    from data
    ;
    

    enter image description here

    The JS UDF:

    
    create function array_sum_js(X array)
    returns float
    language javascript as
    $$
        return X.reduce((a, b) => a + b, 0);
    $$;