Given a table like the following
WITH xs AS (
SELECT 'A' name, 21 value
UNION ALL
SELECT 'B', 24
UNION ALL
SELECT 'C', 28
UNION ALL
SELECT 'D', 47
UNION ALL
SELECT 'E', 48
UNION ALL
SELECT 'F', 49
UNION ALL
SELECT 'G', 50
UNION ALL
SELECT 'H', 58
)
I'd like to keep only the rows that are required to not open up new gaps larger than 10 in the sorted value
column.
For my understanding, I've implemented this algorithm in Python
from dataclasses import dataclass
max_gap = 10
@dataclass
class Row:
name: str
value: int
xs = [Row('A', 21), Row('B', 24), Row('C', 28), Row('D', 47), Row('E', 48), Row('F', 49), Row('G', 50), Row('H', 58)]
ys = [xs[0]]
for idx in range(1, len(xs) - 1):
if xs[idx + 1].value > ys[-1].value + max_gap:
ys.append(xs[idx])
ys.append(xs[-1])
assert ys == [Row('A', 21), Row('C', 28), Row('D', 47), Row('G', 50), Row('H', 58)]
but I'm struggling to convert it to BigQuery SQL.
My problem is, that the decision for each row not only depends on the neighboring rows (typical application of WINDOW
with LEAD
and LAG
), but also also on the last value in the output table (ys
).
I'm not sure if recursive CTEs would be a good fit, or if I need to use the procedural language (LOOP
/FOR
), or if there is an elegant solution utilizing "usual" SQL logic to solve this.
It would be awesome if somebody could show me how to do it.
Use below script (procedural language)
DECLARE current_value INT64;
DECLARE previous_value INT64;
DECLARE result ARRAY<STRUCT<name STRING, value INT64>>;
CREATE TEMP TABLE xs AS
WITH data AS (
SELECT 'A' AS name, 21 AS value UNION ALL
SELECT 'B', 24 UNION ALL
SELECT 'C', 28 UNION ALL
SELECT 'D', 47 UNION ALL
SELECT 'E', 48 UNION ALL
SELECT 'F', 49 UNION ALL
SELECT 'G', 50 UNION ALL
SELECT 'H', 58
)
SELECT * FROM data;
BEGIN
SET result = ARRAY<STRUCT<name STRING, value INT64>>[(SELECT AS STRUCT name, value FROM xs ORDER BY value LIMIT 1)];
SET current_value = (SELECT value FROM xs ORDER BY value LIMIT 1);
SET previous_value = current_value;
FOR next_row IN (SELECT name, value FROM xs ORDER BY value) DO
IF next_row.value - current_value > 10 THEN
SET result = ARRAY_CONCAT(result, [(STRUCT((SELECT name FROM xs WHERE value = previous_value), previous_value))]);
SET current_value = previous_value;
END IF;
SET previous_value = next_row.value;
END FOR;
SET result = ARRAY_CONCAT(result, [(STRUCT((SELECT name FROM xs WHERE value = previous_value), previous_value))]);
END;
SELECT * FROM UNNEST(result) ORDER BY value;
Output is