snowflake-cloud-data-platformarray-agg

Snowflake - ARRAY_AGG with LIMIT


We can use ARRAY_AGG aggregate function to generate an array from data.

The goal is to find a way to limit the input to specific number of entries like ARRAY_AGG(...) WITHIN GROUP(... LIMIT 3) without restructuring the main query.

Sidenote: UDAF (User-Defined Aggregate Functions) are not available at the moment of writing.

For sample data:

CREATE OR REPLACE TABLE tab(grp TEXT, col TEXT) AS
SELECT * FROM VALUES 
  ('Grp1', 'A'),('Grp1', 'B'),('Grp1', 'C'),('Grp1', 'D'), ('Grp1', 'E'),
  ('Grp2', 'X'),('Grp2', 'Y'),('Grp2', 'Z'),('Grp2', 'V'),
  ('Grp3', 'M'),('Grp3', 'N'),('Grp3', 'M');

Output:

GRP ARR_LIMIT_3
Grp3    [   "M",   "M",   "N" ]
Grp2    [   "V",   "X",   "Y" ]
Grp1    [   "A",   "B",   "C" ]

Usage of ARRAY_SLICE is not an option if underlying ARRAY_AGG exceeds 16MB

SELECT grp, 
       ARRAY_SLICE(ARRAY_AGG(col), 1,3)) 
FROM big_table
JOIN ...
GROUP BY grp;
-- Result array of ARRAYAGG is too large

Solution

  • It is possible to achieve similar effect by using MIN_BY/MIN_MAX function:

    MIN_BY( <col_to_return>, <col_containing_mininum> [ , <maximum_number_of_values_to_return> ] )

    SELECT grp,
           ARRAY_AGG(col) AS arr,
           MIN_BY(col, col, 3) AS arr_limit_3
    FROM tab
    GROUP BY grp;
    

    Output:

    GRP ARR ARR_LIMIT_3
    Grp3 [ "M", "N", "M" ] [ "M", "M", "N" ]
    Grp2 [ "X", "Y", "Z", "V" ] [ "V", "X", "Y" ]
    Grp1 [ "A", "B", "C", "D", "E" ] [ "A", "B", "C" ]

    If the sorting is irrelevant then MIN_BY(col, 'some_constant', 3).


    ARRAY_UNIQUE_AGG or ARRAY_AGG(DISTINCT ...) is:

    SELECT grp,
           ARRAY_UNIQUE_AGG(col) AS arr,
           ARRAY_AGG(DISTINCT col) AS arr2,
           ARRAY_DISTINCT(MIN_BY(col, col, 3)) AS arr_distinct_limit_3
    FROM tab
    GROUP BY grp;
    

    Output:

    GRP ARR ARR2 ARR_DISTINCT_LIMIT_3
    Grp3 [ "M", "N" ] [ "M", "N" ] [ "M", "N" ]
    Grp2 [ "X", "Y", "Z", "V" ] [ "X", "Y", "Z", "V" ] [ "V", "X", "Y" ]
    Grp1 [ "A", "B", "C", "D", "E" ] [ "A", "B", "C", "D", "E" ] [ "A", "B", "C" ]

    It is possible to handle WITHIN GROUP(ORDER BY <some_col> ASC/DESC) too:

    SELECT grp,
           -- ASC
           ARRAY_AGG(col) WITHIN GROUP(ORDER BY some_col),
           MIN_BY(col, some_col, 3),
           -- DESC
           ARRAY_AGG(col) WITHIN GROUP(ORDER BY some_col DESC),
           MAX_BY(col, some_col, 3)
    FROM ...;