snowflake-cloud-data-platformarray-agg

Snowflake: ARRAY_AGG preserving "NULL"s


ARRAY_AGG function omits NULL by definition.

The requirement is to generate array with the following format:

SELECT [NULL, 'a1', 'a2', 'a3', NULL]
-- [   undefined,   "a1",   "a2",   "a3",   undefined ]

from source data using ARRAY_AGG function and nullifying all entries not starting with letter a:

CREATE OR REPLACE TABLE tab(id INT, val TEXT) AS
SELECT * FROM VALUES  (1,'b0'), (2,'a1'), (3,'a2'), (4,'a3'), (5,'b1');

Attempt 1:

SELECT ARRAY_AGG(CASE WHEN val LIKE 'a%' THEN val END) 
       WITHIN GROUP (ORDER BY id) AS result,
       result = [NULL, 'a1', 'a2', 'a3', NULL] AS is_output_as_requested
FROM  tab;
-- [   "a1",   "a2",   "a3" ]   FALSE

Result: ARRAY_AGG works as intended and skips NULL; The output does not match the requirement.

Attempt 2:

SELECT ARRAY_AGG(CASE WHEN val LIKE 'a%' THEN val ELSE PARSE_JSON('null') END) 
   WITHIN GROUP (ORDER BY id) AS result,
   result = [NULL, 'a1', 'a2', 'a3', NULL] AS is_output_as_requested
FROM tab;
--  [   "a1",   "a2",   "a3" ]   FALSE

Rationale: CASE expression output must have the same data type, therefore implicit conversion: TO_CHAR(PARSE_JSON('null')) <=> NULL

Attempt 3:

SELECT ARRAY_AGG(CASE WHEN val LIKE 'a%' THEN val::VARIANT ELSE PARSE_JSON('null') END) 
      WITHIN GROUP (ORDER BY id)AS result,
      result = [NULL, 'a1', 'a2', 'a3', NULL] AS is_output_as_requested
FROM tab;
--  [   null,   "a1",   "a2",   "a3",   null ]  FALSE

Almost what is requested, though not quite. Type of the first and last element is NULL_VALUE.


I am aware of VARIANT NULL. I am seeking for a way to "preserve NULLs" in exactly the format as requested for ARRAY_AGG.

The solution must be a standalone SQL expression (no Snowflake Scripting/UDFs).


Solution

  • It is possible to achieve it by aggregating arrays instead of scalar values:

    SELECT ARRAY_FLATTEN(ARRAY_AGG(CASE WHEN val LIKE 'a%' THEN [val] ELSE [NULL] END)
           WITHIN GROUP (ORDER BY id)) AS result,
           result = [NULL, 'a1', 'a2', 'a3', NULL] AS is_output_as_requested 
    FROM tab;
    -- [   undefined,   "a1",   "a2",   "a3",   undefined ] TRUE
    

    Rationale: ARRAY_AGG does not skip [NULL].

    Step 1:

    SELECT ARRAY_AGG(CASE WHEN val LIKE 'a%' THEN [val] ELSE [NULL] END)
           WITHIN GROUP (ORDER BY id) AS result
    FROM tab;
    -- [ [undefined], [ "a1"], ["a2"], ["a3"], [undefined] ]
    

    Step 2:

    SELECT ARRAY_FLATTEN(...)
    -- [ undefined, "a1", "a2", "a3", undefined ]
    

    In general:

    SELECT ARRAY_FLATTEN(ARRAY_AGG([some_col]))
    FROM tab;