google-cloud-platformgoogle-bigquery

How to filter rows by values of a sorted column being close to the previous row?


Given a table like the following

WITH xs AS (
    SELECT 'A' name, 21 value
    UNION ALL
    SELECT 'B', 24
    UNION ALL
    SELECT 'C', 28
    UNION ALL
    SELECT 'D', 47
    UNION ALL
    SELECT 'E', 48
    UNION ALL
    SELECT 'F', 49
    UNION ALL
    SELECT 'G', 50
    UNION ALL
    SELECT 'H', 58
)

I'd like to keep only the rows that are required to not open up new gaps larger than 10 in the sorted value column.

For my understanding, I've implemented this algorithm in Python

from dataclasses import dataclass

max_gap = 10


@dataclass
class Row:
    name: str
    value: int


xs = [Row('A', 21), Row('B', 24), Row('C', 28), Row('D', 47), Row('E', 48), Row('F', 49), Row('G', 50), Row('H', 58)]

ys = [xs[0]]
for idx in range(1, len(xs) - 1):
    if xs[idx + 1].value > ys[-1].value + max_gap:
        ys.append(xs[idx])
ys.append(xs[-1])

assert ys == [Row('A', 21), Row('C', 28), Row('D', 47), Row('G', 50), Row('H', 58)]

but I'm struggling to convert it to BigQuery SQL.

My problem is, that the decision for each row not only depends on the neighboring rows (typical application of WINDOW with LEAD and LAG), but also also on the last value in the output table (ys).

I'm not sure if recursive CTEs would be a good fit, or if I need to use the procedural language (LOOP/FOR), or if there is an elegant solution utilizing "usual" SQL logic to solve this.

It would be awesome if somebody could show me how to do it.


Solution

  • Use below script (procedural language)

    DECLARE current_value INT64;
    DECLARE previous_value INT64;
    DECLARE result ARRAY<STRUCT<name STRING, value INT64>>;
    
    CREATE TEMP TABLE xs AS
    WITH data AS (
        SELECT 'A' AS name, 21 AS value UNION ALL
        SELECT 'B', 24 UNION ALL
        SELECT 'C', 28 UNION ALL
        SELECT 'D', 47 UNION ALL
        SELECT 'E', 48 UNION ALL
        SELECT 'F', 49 UNION ALL
        SELECT 'G', 50 UNION ALL
        SELECT 'H', 58
    )
    SELECT * FROM data;
    
    BEGIN
      SET result = ARRAY<STRUCT<name STRING, value INT64>>[(SELECT AS STRUCT name, value FROM xs ORDER BY value LIMIT 1)];
      
      SET current_value = (SELECT value FROM xs ORDER BY value LIMIT 1);
      SET previous_value = current_value;
    
      FOR next_row IN (SELECT name, value FROM xs ORDER BY value) DO
        IF next_row.value - current_value > 10 THEN
          SET result = ARRAY_CONCAT(result, [(STRUCT((SELECT name FROM xs WHERE value = previous_value), previous_value))]);
          SET current_value = previous_value; 
        END IF;    
        SET previous_value = next_row.value;
      END FOR;
    
      SET result = ARRAY_CONCAT(result, [(STRUCT((SELECT name FROM xs WHERE value = previous_value), previous_value))]);
    END;
    
    SELECT * FROM UNNEST(result) ORDER BY value;
    

    Output is

    enter image description here