I struggle to get an sql that groups my records based on the following input data
AB | BIS | MyHash | flag |
---|---|---|---|
2024-01-01 | 2024-01-15 | 111 | 0 |
2024-01-16 | 2024-03-01 | 111 | 0 |
2024-03-02 | 2024-03-31 | 222 | 0 |
2024-04-01 | 2024-07-30 | 222 | 0 |
2024-07-31 | 2024-07-31 | 222 | 1 |
2024-08-01 | 2024-08-02 | 222 | 0 |
2024-01-01 | 2024-07-31 | 543 | 0 |
2024-11-03 | 2024-11-15 | 653 | 0 |
2024-01-01 | 2024-06-30 | 22435 | 0 |
2024-07-01 | 2024-10-31 | 22435 | 1 |
2024-11-01 | 2024-11-02 | 22435 | 0 |
2024-11-03 | 2024-11-15 | 22435 | 0 |
i basically need a group in the same hash until the flag changes to true/1. Then the next group starts. The records a ordered in the correct way, no need to care about that.
i need the following output:
AB | BIS | MyHash | flag | GROUP |
---|---|---|---|---|
2024-01-01 | 2024-01-15 | 111 | 0 | 1 |
2024-01-16 | 2024-03-01 | 111 | 0 | 1 |
2024-03-02 | 2024-03-31 | 222 | 0 | 2 |
2024-04-01 | 2024-07-30 | 222 | 0 | 2 |
2024-07-31 | 2024-07-31 | 222 | 1 | 2 |
2024-08-01 | 2024-08-02 | 222 | 0 | 3 |
2024-01-01 | 2024-07-31 | 543 | 0 | 4 |
2024-11-03 | 2024-11-15 | 653 | 0 | 5 |
2024-01-01 | 2024-06-30 | 22435 | 0 | 6 |
2024-07-01 | 2024-10-31 | 22435 | 1 | 6 |
2024-11-01 | 2024-11-02 | 22435 | 0 | 7 |
2024-11-03 | 2024-11-15 | 22435 | 0 | 7 |
What i tried so far:
--data setup
drop table if exists demo_data
CREATE TABLE demo_data (
AB DATE,
BIS DATE,
MyHash INT,
flag bit
);
INSERT INTO demo_data (AB, BIS, myHash, flag) VALUES
('2024-01-01', '2024-01-15', 111, 0),
('2024-01-16', '2024-03-01', 111, 0),
('2024-03-02', '2024-03-31', 222, 0),
('2024-04-01', '2024-07-30', 222, 0),
('2024-07-31', '2024-07-31', 222, 1),
('2024-08-01', '2024-08-02', 222, 0),
('2024-01-01', '2024-07-31', 543, 0),
('2024-11-03', '2024-11-15', 653, 0),
('2024-01-01', '2024-06-30', 22435, 0),
('2024-07-01', '2024-10-31', 22435, 1),
('2024-11-01', '2024-11-02', 22435, 0),
('2024-11-03', '2024-11-15', 22435, 0)
query
SELECT
AB,
BIS,
myHash,
flag,
DENSE_RANK() OVER (ORDER BY myHash, group_id) AS [GROUP]
FROM (
SELECT
AB,
BIS,
myHash,
flag,
ROW_NUMBER() OVER (PARTITION BY myHash ORDER BY AB, BIS) -
ROW_NUMBER() OVER (PARTITION BY myHash, group_increment ORDER BY AB, BIS) + 1 AS group_id
FROM (
SELECT
AB,
BIS,
myHash,
flag,
SUM(CASE WHEN flag = 1 THEN 1 ELSE 0 END) OVER (PARTITION BY myHash ORDER BY AB, BIS) AS group_increment
FROM
demo_data
) AS GroupedData
) AS FinalGroups
ORDER BY
myHash, AB, BIS;
result from the query (which is wrong!)
AB BIS myHash flag GROUP
2024-01-01 2024-01-15 111 0 1
2024-01-16 2024-03-01 111 0 1
2024-03-02 2024-03-31 222 0 2
2024-04-01 2024-07-30 222 0 2
2024-07-31 2024-07-31 222 1 3
2024-08-01 2024-08-02 222 0 3
2024-01-01 2024-07-31 543 0 4
2024-11-03 2024-11-15 653 0 5
2024-01-01 2024-06-30 22435 0 6
2024-07-01 2024-10-31 22435 1 7
2024-11-01 2024-11-02 22435 0 7
2024-11-03 2024-11-15 22435 0 7
How can i achieve, that the group changes after the records with the flag, inclucing all records before (in the same hash). See excepted result.
I am using databricks and have well >1b records. The example above was done in mssql, i guess it will be the same(?)
This is a gap and islands problem, It can be solved by summing two running totals ( the running total of 1s whenever rn
equals 1 and the running total of the previous flag) :
WITH CTE AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY myHash ORDER BY myhash) AS rn,
LAG(flag, 1 , 0) OVER (ORDER BY myhash) AS lag_flag
FROM demo_data
)
SELECT ab, bis, myhash, flag,
SUM(CASE WHEN rn = 1 THEN 1 ELSE 0 END) OVER (ORDER BY myhash)
+ SUM(lag_flag) OVER (ORDER BY myhash, ab, bis) AS grp
FROM CTE
ORDER BY myhash