sqlloopsjoinsubquerysql-timestamp

SQL to Extract Continuous Timestamp Ranges


Table Device_Status

Id Status Timestamp
1 Active 2023-01-13T18.00.01.0187528
2 Active 2023-01-13T18.00.01.0187529
1 Failed 2023-01-13T18.00.01.0187530
3 Active 2023-01-13T18.00.01.0187531
1 Failed 2023-01-13T18.00.01.0187532
1 Active 2023-01-13T18.00.01.0187533
3 Active 2023-01-13T18.00.01.0187534
1 Failed 2023-01-13T18.00.01.0187535
4 Failed 2023-01-13T18.00.01.0187536
1 Active 2023-01-13T18.00.01.0187537

Expected Output (Need SQL query to generate):

ID Fail_Begin Fail_End
1 2023-01-13T18.00.01.0187530 2023-01-13T18.00.01.0187532
1 2023-01-13T18.00.01.0187535 2023-01-13T18.00.01.0187535
4 2023-01-13T18.00.01.0187536 2023-01-13T18.00.01.0187536

Basically, for each ID get min (timestamp) and max (timestamp) but over continuous records for that ID. If there is only one record, then min=max as is the case with the second and third record in the sample result set.

I have tried this (and various subquery variants thereof)

SELECT Id, min(Timestamp) AS Fail_Begin, max(Timestamp) AS Fail_End
FROM Device_Status
GROUP BY Id

but need to group by only continuously occurring records,

So perhaps by adding a notion of status change first would help? Starting with zero as initial status and then incrementing the change code by 1 every time the next record is different in order to thereby generate an intermediate result like so...

Table Device_Status_With_Change_Column

Id Status Change Timestamp
1 Active 0 2023-01-13T18.00.01.0187528
2 Active 0 2023-01-13T18.00.01.0187529
1 Failed 1 2023-01-13T18.00.01.0187530
3 Active 0 2023-01-13T18.00.01.0187531
1 Failed 1 2023-01-13T18.00.01.0187532
1 Active 2 2023-01-13T18.00.01.0187533
3 Active 0 2023-01-13T18.00.01.0187534
1 Failed 3 2023-01-13T18.00.01.0187535
4 Failed 0 2023-01-13T18.00.01.0187536
1 Active 4 2023-01-13T18.00.01.0187537

And then doing

SELECT Id, Change, min(Timestamp) AS Fail_Begin, max(Timestamp) AS Fail_End
FROM Device_Status_With_Change_Column
GROUP BY Id, Change

Other than looping over the result set in a programming language, I do not yet see a direct SQL statement that would do this in one fell swoop without the intermediate table and I do not see how to compute the column Change (in SQL).


Solution

  • The solution by user @Ajax1234 generates the result in the original post. However, when stress-tested with several tandem 'Failed' records, it generates pairs instead of producing exactly one additional record as required for those additional table entries.

    The DDL/Schema and initial sample data, tested for PostGreSQL version 16, with changed/trimmed timestamp for the parse to work and several tandem 'Failed' records added:

    CREATE TYPE status AS ENUM ('Active', 'Failed');
    DROP TABLE IF EXISTS Device_Status;
    CREATE TABLE Device_Status
    (
      ID integer,
      Status status,
      temptime text
    );
    INSERT INTO Device_Status VALUES (1,'Active','2023-01-13 10.00.01.528');
    INSERT INTO Device_Status VALUES (2,'Active','2023-01-13 10.00.01.529');
    INSERT INTO Device_Status VALUES (1,'Failed','2023-01-13 10.00.01.530');
    INSERT INTO Device_Status VALUES (3,'Active','2023-01-13 10.00.01.531');
    INSERT INTO Device_Status VALUES (1,'Failed','2023-01-13 10.00.01.532');
    INSERT INTO Device_Status VALUES (1,'Active','2023-01-13 10.00.01.533');
    INSERT INTO Device_Status VALUES (3,'Active','2023-01-13 10.00.01.534');
    INSERT INTO Device_Status VALUES (1,'Failed','2023-01-13 10.00.01.535');
    INSERT INTO Device_Status VALUES (4,'Failed','2023-01-13 10.00.01.536');
    INSERT INTO Device_Status VALUES (1,'Active','2023-01-13 10.00.01.537');
    insert into device_status values (1,'Failed','2023-01-13 10:00:01.538'), 
                                     (1,'Failed','2023-01-13 10:00:01.539'), 
                                     (1,'Failed','2023-01-13 10:00:01.540'), 
                                     (1,'Failed','2023-01-13 10:00:01.541'), 
                                     (1,'Failed','2023-01-13 10:00:01.542'), 
                                     (1,'Failed','2023-01-13 10:00:01.543'), 
                                     (1,'Failed','2023-01-13 10:00:01.544'),
                                     (1,'Failed','2023-01-13 10:00:01.545'), 
                                     (1,'Failed','2023-01-13 10:00:01.546'),
                                     (1,'Failed','2023-01-13 10:00:01.547');
    ALTER TABLE Device_Status ADD timestamp TIMESTAMP;
    UPDATE Device_Status set timestamp = 
          to_timestamp(temptime,'YYYY-MM-DD HH:MI:SS.MS');
    ALTER TABLE Device_Status DROP COLUMN temptime;
    
    WITH Device_Status_With_Change_Column AS
    (
      WITH flag AS
      (
        WITH lag AS
        (
          SELECT *, 
                 LAG(status) OVER (PARTITION BY Id ORDER BY TimeStamp) Last_Status,
                 LAG(Id,1,0) OVER (PARTITION BY Id ORDER BY TimeStamp) Last_Id
          FROM Device_Status
        )
        SELECT *, 
               CASE
               WHEN lag.last_status IS NULL THEN 0
               WHEN lag.last_status IS DISTINCT FROM lag.status 
                AND lag.last_id = lag.id  
                THEN 1
               ELSE 0
             END Change_Flag
         FROM lag
      )
      SELECT *, 
             SUM(change_flag) OVER (PARTITION BY Id ORDER BY TimeStamp) Change
      FROM flag
    )
    SELECT Id, min(Timestamp) AS Fail_Begin, max(Timestamp) AS Fail_End
    FROM Device_Status_With_Change_Column
    WHERE status = 'Failed'
    GROUP BY Id, Change
    ORDER BY Id
    

    Result (fiddle):

    id fail_start fail_end
    1 2023-01-13 10:00:01.53 2023-01-13 10:00:01.532
    1 2023-01-13 10:00:01.535 2023-01-13 10:00:01.535
    1 2023-01-13 10:00:01.538 2023-01-13 10:00:01.547
    4 2023-01-13 10:00:01.536 2023-01-13 10:00:01.536

    The innermost CTElag uses the SQL LAG function to generate a column for previous entries (i.e. exactly one previous row of corresponding column, for id and status, and in the case of id, default zero). This is projected into flag to enter a 1 when previous status changes from 'Active' to 'Failed' or vice-versa for the same id. The first instance of any id will have unconditional zero and so will cases where there is no change to status relative to the previous recorded log entry for any device. To create the CTE Device_Status_With_Change_Column, the Change_Flags are summed using the cumulative SUM function to record the sequence number of status change for each device Id when sorted by timestamp. The zero in the Change column ensures that SUM will put the same integer in all consecutive failed or active records having the same change sequence. In other words, to get the 'Active' image of the result set we just need to change the WHERE clause. Therefore, any number of statuses, say, 'PAUSED', 'OFFLINE' can all be addressed just by adding the corresponding word in the DDL and query.

    The full CTE Device_Status_With_Change_Column (fiddle):

    SELECT * 
    FROM Device_Status_With_Change_Column
    

    I tested the join-based query by @Ajax1234 on a database with 13 million records, to generate a result set of approximately 20,000, and it is faster that my CTE-based solution (24 seconds versus 42 seconds).

    SQL is far more powerful than doing loops! Look into CTEs using LAG, SUM, and the CASE Statement. Join-like operations on products of relations are more performant, if a bit harder to read and understand, and therefore need testing.