[SOLVED] SQL Grouping based on one field and a flag

SQL Grouping based on one field and a flag

I struggle to get an sql that groups my records based on the following input data

AB	BIS	MyHash	flag
2024-01-01	2024-01-15	111	0
2024-01-16	2024-03-01	111	0
2024-03-02	2024-03-31	222	0
2024-04-01	2024-07-30	222	0
2024-07-31	2024-07-31	222	1
2024-08-01	2024-08-02	222	0
2024-01-01	2024-07-31	543	0
2024-11-03	2024-11-15	653	0
2024-01-01	2024-06-30	22435	0
2024-07-01	2024-10-31	22435	1
2024-11-01	2024-11-02	22435	0
2024-11-03	2024-11-15	22435	0

i basically need a group in the same hash until the flag changes to true/1. Then the next group starts. The records a ordered in the correct way, no need to care about that.

i need the following output:

AB	BIS	MyHash	flag	GROUP
2024-01-01	2024-01-15	111	0	1
2024-01-16	2024-03-01	111	0	1
2024-03-02	2024-03-31	222	0	2
2024-04-01	2024-07-30	222	0	2
2024-07-31	2024-07-31	222	1	2
2024-08-01	2024-08-02	222	0	3
2024-01-01	2024-07-31	543	0	4
2024-11-03	2024-11-15	653	0	5
2024-01-01	2024-06-30	22435	0	6
2024-07-01	2024-10-31	22435	1	6
2024-11-01	2024-11-02	22435	0	7
2024-11-03	2024-11-15	22435	0	7

What i tried so far:

--data setup
drop table if exists demo_data 
CREATE TABLE demo_data (
    AB DATE,
    BIS DATE,
    MyHash INT,
    flag bit
);

INSERT INTO demo_data (AB, BIS, myHash, flag) VALUES
    ('2024-01-01', '2024-01-15', 111, 0),
    ('2024-01-16', '2024-03-01', 111, 0),
    ('2024-03-02', '2024-03-31', 222, 0),
    ('2024-04-01', '2024-07-30', 222, 0),
    ('2024-07-31', '2024-07-31', 222, 1),
    ('2024-08-01', '2024-08-02', 222, 0),
    ('2024-01-01', '2024-07-31', 543, 0),
    ('2024-11-03', '2024-11-15', 653, 0),
    ('2024-01-01', '2024-06-30', 22435, 0),
    ('2024-07-01', '2024-10-31', 22435, 1),
    ('2024-11-01', '2024-11-02', 22435, 0),
    ('2024-11-03', '2024-11-15', 22435, 0)

query

SELECT
    AB,
    BIS,
    myHash,
    flag,
    DENSE_RANK() OVER (ORDER BY myHash, group_id) AS [GROUP]
FROM (
    SELECT
        AB,
        BIS,
        myHash,
        flag,
        ROW_NUMBER() OVER (PARTITION BY myHash ORDER BY AB, BIS) -
        ROW_NUMBER() OVER (PARTITION BY myHash, group_increment ORDER BY AB, BIS) + 1 AS group_id
    FROM (
        SELECT
            AB,
            BIS,
            myHash,
            flag,
            SUM(CASE WHEN flag = 1 THEN 1 ELSE 0 END) OVER (PARTITION BY myHash ORDER BY AB, BIS) AS group_increment
        FROM
            demo_data
    ) AS GroupedData
) AS FinalGroups
ORDER BY
myHash, AB, BIS;

result from the query (which is wrong!)

AB  BIS myHash  flag    GROUP
2024-01-01  2024-01-15  111     0   1
2024-01-16  2024-03-01  111     0   1
2024-03-02  2024-03-31  222     0   2
2024-04-01  2024-07-30  222     0   2
2024-07-31  2024-07-31  222     1   3
2024-08-01  2024-08-02  222     0   3
2024-01-01  2024-07-31  543     0   4
2024-11-03  2024-11-15  653     0   5
2024-01-01  2024-06-30  22435   0   6
2024-07-01  2024-10-31  22435   1   7
2024-11-01  2024-11-02  22435   0   7
2024-11-03  2024-11-15  22435   0   7

How can i achieve, that the group changes after the records with the flag, inclucing all records before (in the same hash). See excepted result.

I am using databricks and have well >1b records. The example above was done in mssql, i guess it will be the same(?)

Solution

This is a gap and islands problem, It can be solved by summing two running totals ( the running total of 1s whenever rn equals 1 and the running total of the previous flag) :

WITH CTE AS (
  SELECT *, 
         ROW_NUMBER() OVER (PARTITION BY myHash ORDER BY myhash) AS rn,
         LAG(flag, 1 , 0) OVER (ORDER BY myhash) AS lag_flag
  FROM demo_data
)
SELECT ab, bis, myhash, flag,
          SUM(CASE WHEN rn = 1 THEN 1 ELSE 0 END) OVER (ORDER BY myhash)  
          + SUM(lag_flag) OVER (ORDER BY myhash, ab, bis) AS grp
FROM CTE
ORDER BY myhash

Demo on postgresql