google-bigquery

How to write a user-defined function to get the latest non-null value of a column?


Currently, I have code like this:

WITH
  foo AS (
    SELECT 1 AS a, "hello" AS b, NULL AS c, 10 AS some_timestamp
    UNION ALL
    SELECT 1, NULL, "world", 11
    UNION ALL
    SELECT 2, "hi", "!", 99
  )
SELECT
  a,
  ANY_VALUE(b HAVING MAX if(b is null, null, some_timestamp)) AS latest_b,
  ANY_VALUE(c HAVING MAX if(c is null, null, some_timestamp)) AS latest_c
FROM
  foo
GROUP BY
  a

The lines with ANY_VALUE are quite repetitive. Is there a way to abstract this logic away, maybe with a UDF?

In the end, it would be awesome if I could replace

  ANY_VALUE(b HAVING MAX if(b is null, null, some_timestamp)) AS latest_b,
  ANY_VALUE(c HAVING MAX if(c is null, null, some_timestamp)) AS latest_c

with something like that:

  latest_non_null(b, some_timestamp) AS latest_b,
  latest_non_null(c, some_timestamp) AS latest_c

Solution

  • You can create a custom user-defined aggregation function (UDAFs), currently in preview:

    CREATE OR REPLACE AGGREGATE FUNCTION `project.dataset.latest_non_null`(
      value STRING,
      some_timestamp INTEGER)
    RETURNS STRING
    AS (
      ANY_VALUE(value HAVING MAX if(value is null, null, some_timestamp))
    );
    
    SELECT
      a,
      project.dataset.latest_non_null(b, some_timestamp) AS latest_b,
      project.dataset.latest_non_null(c, some_timestamp) AS latest_c,
    FROM
      foo
    GROUP BY
      a
    

    Output:

    output