I have a snowflake table that roughly resembles the below. I want to filter out all rows that have the same name, provider and an overlapping voltage range.
Before:
NAME | Provider | Min_Volt | Max_Volt |
---|---|---|---|
OSMO | Bell | 10 | 100 |
OSMO | Bell | 80 | 150 |
OSMO | Bell | 120 | 155 |
OSMO | Bell | 160 | 200 |
MASD | Tele | 80 | 150 |
After:
NAME | Provider | Min_Volt | Max_Volt |
---|---|---|---|
OSMO | Bell | 160 | 200 |
MASD | Tele | 80 | 150 |
Doing only the name and voltage is straightforward with a group by and count() > 1. But I don't know how to apply compare the rows to each other individually like that.
I was thinking of doing it with min/max window functions like
with min_max AS (
SELECT NAME
, PROVIDER
, MIN_VOLT
, MAX_VOLT
, MAX(max_volt) OVER (PARTITION BY PROVIDER, NAME) AS set_max
, MIN(min_volt) OVER (PARTITION BY PROVIDER, NAME) AS set_min
FROM Table
)
SELECT NAME, PROVIDER, MIN_VOLT, MAX_VOLT
FROM min_max
WHERE MIN_VOLT NOT BETWEEN set_min AND set_min AND MAX_VOLT NOT BETWEEN set_min AND set_max;
However this returns no rows since its including the row its comparing in the max/min range. I'm not sure how to have it compare each row independently to each other row.
EDIT: For clarity - the rows we want to remove are those that have an overlapping voltage range with another row that has the same NAME and PROVIDER value. For example rows 1, 2 and 3 in the before table above all have the same name and provider, and there is overlap in the voltage range (10-100, 80-150, 120-155) so they should all be removed. Whereas Row 4 has the same name and provider but its voltage range doesn't overlap with the other rows so it should be kept.
Looks like LEAD
and LAG
should do the trick
with min_max AS (
SELECT
NAME
, PROVIDER
, MIN_VOLT
, MAX_VOLT
, LAG(MAX_VOLT) OVER (PARTITION BY PROVIDER, NAME ORDER BY MIN_VOLT, MAX_VOLT) AS prev_max
, LEAD(MIN_VOLT) OVER (PARTITION BY PROVIDER, NAME ORDER BY MIN_VOLT, MAX_VOLT) AS next_min
FROM ORENNIA_ANALYTICS_QC.ORENNIA.PRE_PROCESSED_SUBSTATIONS
)
SELECT
NAME,
PROVIDER,
MIN_VOLT,
MAX_VOLT
FROM min_max
WHERE (prev_max IS NULL OR MIN_VOLT > prev_max)
AND (next_min IS NULL OR MAX_VOLT < next_min);