sqlimpalasubsequence

How to find longest subsequence based on conditions in Impala SQL


I have a SQL table on Impala that contains ID, dt (monthly basis with no skipped month), and status of each person ID. I want to check how long that each ID is in each status (my expected answer is shown on expected column)

I tried to solve this problem on the value column by using

count(status) over (partition by ID, status order by dt)

but it doesn't reset the value when the status is changed.

+------+------------+--------+-------+----------+
|  ID  |     dt     | status | value | expected |
+------+------------+--------+-------+----------+
| 0001 | 01/01/2020 |      0 |     1 |        1 |
| 0001 | 01/02/2020 |      0 |     2 |        2 |
| 0001 | 01/03/2020 |      1 |     1 |        1 |
| 0001 | 01/04/2020 |      1 |     2 |        2 |
| 0001 | 01/05/2020 |      1 |     3 |        3 |
| 0001 | 01/06/2020 |      0 |     3 |        1 |
| 0001 | 01/07/2020 |      1 |     4 |        1 |
| 0001 | 01/08/2020 |      1 |     5 |        2 |
+------+------------+--------+-------+----------+

Is there anyway to reset the counter when the status is changed?


Solution

  • When you partition by ID and status, two groups are formed for the values 0 and 1 in status field. So, the months 1, 2, 6 go into first group with 0 status and the months 3, 4, 5, 7, 8 go into the second group with 1 status. Then, the count function counts the number of statuses individually in those groups. Thus the first group has counts from 1 to 3 and the second group has counts from 1 to 5. This query so far doesn't account for the change in statuses rather just simply divide the record set as per different status values.

    One approach would be to divide the records into different blocks where each status change starts a new block. The below query follows this approach and gives the expected result:

    SELECT ID,dt,status,
        COUNT(status) OVER(PARTITION BY ID,block_number ORDER BY dt) as value 
    FROM (
        SELECT ID,dt,status,
            SUM(change_in_status) OVER(PARTITION BY ID ORDER BY dt) as block_number
        FROM(
            SELECT ID,dt,status,
                CASE WHEN 
                        status<>LAG(status) OVER(PARTITION BY ID ORDER BY dt)
                        OR LAG(status) OVER(PARTITION BY ID ORDER BY dt) IS NULL 
                    THEN 1 
                    ELSE 0 
                END as change_in_status 
            FROM statuses
        ) derive_status_changes
    ) derive_blocks;
    

    Here is a working example in DB Fiddle.