We can use ARRAY_AGG aggregate function to generate an array from data.
The goal is to find a way to limit the input to specific number of entries like ARRAY_AGG(...) WITHIN GROUP(... LIMIT 3)
without restructuring the main query.
Sidenote: UDAF (User-Defined Aggregate Functions) are not available at the moment of writing.
For sample data:
CREATE OR REPLACE TABLE tab(grp TEXT, col TEXT) AS
SELECT * FROM VALUES
('Grp1', 'A'),('Grp1', 'B'),('Grp1', 'C'),('Grp1', 'D'), ('Grp1', 'E'),
('Grp2', 'X'),('Grp2', 'Y'),('Grp2', 'Z'),('Grp2', 'V'),
('Grp3', 'M'),('Grp3', 'N'),('Grp3', 'M');
Output:
GRP ARR_LIMIT_3
Grp3 [ "M", "M", "N" ]
Grp2 [ "V", "X", "Y" ]
Grp1 [ "A", "B", "C" ]
Usage of ARRAY_SLICE
is not an option if underlying ARRAY_AGG
exceeds 16MB
SELECT grp,
ARRAY_SLICE(ARRAY_AGG(col), 1,3))
FROM big_table
JOIN ...
GROUP BY grp;
-- Result array of ARRAYAGG is too large
It is possible to achieve similar effect by using MIN_BY/MIN_MAX function:
MIN_BY( <col_to_return>, <col_containing_mininum> [ , <maximum_number_of_values_to_return> ] )
SELECT grp,
ARRAY_AGG(col) AS arr,
MIN_BY(col, col, 3) AS arr_limit_3
FROM tab
GROUP BY grp;
Output:
GRP | ARR | ARR_LIMIT_3 |
---|---|---|
Grp3 | [ "M", "N", "M" ] | [ "M", "M", "N" ] |
Grp2 | [ "X", "Y", "Z", "V" ] | [ "V", "X", "Y" ] |
Grp1 | [ "A", "B", "C", "D", "E" ] | [ "A", "B", "C" ] |
If the sorting is irrelevant then MIN_BY(col, 'some_constant', 3)
.
ARRAY_UNIQUE_AGG
or ARRAY_AGG(DISTINCT ...)
is:
SELECT grp,
ARRAY_UNIQUE_AGG(col) AS arr,
ARRAY_AGG(DISTINCT col) AS arr2,
ARRAY_DISTINCT(MIN_BY(col, col, 3)) AS arr_distinct_limit_3
FROM tab
GROUP BY grp;
Output:
GRP | ARR | ARR2 | ARR_DISTINCT_LIMIT_3 |
---|---|---|---|
Grp3 | [ "M", "N" ] | [ "M", "N" ] | [ "M", "N" ] |
Grp2 | [ "X", "Y", "Z", "V" ] | [ "X", "Y", "Z", "V" ] | [ "V", "X", "Y" ] |
Grp1 | [ "A", "B", "C", "D", "E" ] | [ "A", "B", "C", "D", "E" ] | [ "A", "B", "C" ] |
It is possible to handle WITHIN GROUP(ORDER BY <some_col> ASC/DESC)
too:
SELECT grp,
-- ASC
ARRAY_AGG(col) WITHIN GROUP(ORDER BY some_col),
MIN_BY(col, some_col, 3),
-- DESC
ARRAY_AGG(col) WITHIN GROUP(ORDER BY some_col DESC),
MAX_BY(col, some_col, 3)
FROM ...;