sqldatabricksgaps-and-islands

SQL Grouping based on one field and a flag


I struggle to get an sql that groups my records based on the following input data

AB BIS MyHash flag
2024-01-01 2024-01-15 111 0
2024-01-16 2024-03-01 111 0
2024-03-02 2024-03-31 222 0
2024-04-01 2024-07-30 222 0
2024-07-31 2024-07-31 222 1
2024-08-01 2024-08-02 222 0
2024-01-01 2024-07-31 543 0
2024-11-03 2024-11-15 653 0
2024-01-01 2024-06-30 22435 0
2024-07-01 2024-10-31 22435 1
2024-11-01 2024-11-02 22435 0
2024-11-03 2024-11-15 22435 0

i basically need a group in the same hash until the flag changes to true/1. Then the next group starts. The records a ordered in the correct way, no need to care about that.

i need the following output:

AB BIS MyHash flag GROUP
2024-01-01 2024-01-15 111 0 1
2024-01-16 2024-03-01 111 0 1
2024-03-02 2024-03-31 222 0 2
2024-04-01 2024-07-30 222 0 2
2024-07-31 2024-07-31 222 1 2
2024-08-01 2024-08-02 222 0 3
2024-01-01 2024-07-31 543 0 4
2024-11-03 2024-11-15 653 0 5
2024-01-01 2024-06-30 22435 0 6
2024-07-01 2024-10-31 22435 1 6
2024-11-01 2024-11-02 22435 0 7
2024-11-03 2024-11-15 22435 0 7

What i tried so far:

--data setup
drop table if exists demo_data 
CREATE TABLE demo_data (
    AB DATE,
    BIS DATE,
    MyHash INT,
    flag bit
);

INSERT INTO demo_data (AB, BIS, myHash, flag) VALUES
    ('2024-01-01', '2024-01-15', 111, 0),
    ('2024-01-16', '2024-03-01', 111, 0),
    ('2024-03-02', '2024-03-31', 222, 0),
    ('2024-04-01', '2024-07-30', 222, 0),
    ('2024-07-31', '2024-07-31', 222, 1),
    ('2024-08-01', '2024-08-02', 222, 0),
    ('2024-01-01', '2024-07-31', 543, 0),
    ('2024-11-03', '2024-11-15', 653, 0),
    ('2024-01-01', '2024-06-30', 22435, 0),
    ('2024-07-01', '2024-10-31', 22435, 1),
    ('2024-11-01', '2024-11-02', 22435, 0),
    ('2024-11-03', '2024-11-15', 22435, 0)

query

SELECT
    AB,
    BIS,
    myHash,
    flag,
    DENSE_RANK() OVER (ORDER BY myHash, group_id) AS [GROUP]
FROM (
    SELECT
        AB,
        BIS,
        myHash,
        flag,
        ROW_NUMBER() OVER (PARTITION BY myHash ORDER BY AB, BIS) -
        ROW_NUMBER() OVER (PARTITION BY myHash, group_increment ORDER BY AB, BIS) + 1 AS group_id
    FROM (
        SELECT
            AB,
            BIS,
            myHash,
            flag,
            SUM(CASE WHEN flag = 1 THEN 1 ELSE 0 END) OVER (PARTITION BY myHash ORDER BY AB, BIS) AS group_increment
        FROM
            demo_data
    ) AS GroupedData
) AS FinalGroups
ORDER BY
myHash, AB, BIS;

result from the query (which is wrong!)

AB  BIS myHash  flag    GROUP
2024-01-01  2024-01-15  111     0   1
2024-01-16  2024-03-01  111     0   1
2024-03-02  2024-03-31  222     0   2
2024-04-01  2024-07-30  222     0   2
2024-07-31  2024-07-31  222     1   3
2024-08-01  2024-08-02  222     0   3
2024-01-01  2024-07-31  543     0   4
2024-11-03  2024-11-15  653     0   5
2024-01-01  2024-06-30  22435   0   6
2024-07-01  2024-10-31  22435   1   7
2024-11-01  2024-11-02  22435   0   7
2024-11-03  2024-11-15  22435   0   7

How can i achieve, that the group changes after the records with the flag, inclucing all records before (in the same hash). See excepted result.

I am using databricks and have well >1b records. The example above was done in mssql, i guess it will be the same(?)


Solution

  • This is a gap and islands problem, It can be solved by summing two running totals ( the running total of 1s whenever rn equals 1 and the running total of the previous flag) :

    WITH CTE AS (
      SELECT *, 
             ROW_NUMBER() OVER (PARTITION BY myHash ORDER BY myhash) AS rn,
             LAG(flag, 1 , 0) OVER (ORDER BY myhash) AS lag_flag
      FROM demo_data
    )
    SELECT ab, bis, myhash, flag,
              SUM(CASE WHEN rn = 1 THEN 1 ELSE 0 END) OVER (ORDER BY myhash)  
              + SUM(lag_flag) OVER (ORDER BY myhash, ab, bis) AS grp
    FROM CTE
    ORDER BY myhash
    

    Demo on postgresql