Table Device_Status
Id | Status | Timestamp |
---|---|---|
1 | Active | 2023-01-13T18.00.01.0187528 |
2 | Active | 2023-01-13T18.00.01.0187529 |
1 | Failed | 2023-01-13T18.00.01.0187530 |
3 | Active | 2023-01-13T18.00.01.0187531 |
1 | Failed | 2023-01-13T18.00.01.0187532 |
1 | Active | 2023-01-13T18.00.01.0187533 |
3 | Active | 2023-01-13T18.00.01.0187534 |
1 | Failed | 2023-01-13T18.00.01.0187535 |
4 | Failed | 2023-01-13T18.00.01.0187536 |
1 | Active | 2023-01-13T18.00.01.0187537 |
Expected Output (Need SQL query to generate):
ID | Fail_Begin | Fail_End |
---|---|---|
1 | 2023-01-13T18.00.01.0187530 | 2023-01-13T18.00.01.0187532 |
1 | 2023-01-13T18.00.01.0187535 | 2023-01-13T18.00.01.0187535 |
4 | 2023-01-13T18.00.01.0187536 | 2023-01-13T18.00.01.0187536 |
Basically, for each ID get min (timestamp) and max (timestamp) but over continuous records for that ID. If there is only one record, then min=max as is the case with the second and third record in the sample result set.
I have tried this (and various subquery variants thereof)
SELECT Id, min(Timestamp) AS Fail_Begin, max(Timestamp) AS Fail_End
FROM Device_Status
GROUP BY Id
but need to group by only continuously occurring records,
So perhaps by adding a notion of status change first would help? Starting with zero as initial status and then incrementing the change code by 1 every time the next record is different in order to thereby generate an intermediate result like so...
Table Device_Status_With_Change_Column
Id | Status | Change | Timestamp |
---|---|---|---|
1 | Active | 0 | 2023-01-13T18.00.01.0187528 |
2 | Active | 0 | 2023-01-13T18.00.01.0187529 |
1 | Failed | 1 | 2023-01-13T18.00.01.0187530 |
3 | Active | 0 | 2023-01-13T18.00.01.0187531 |
1 | Failed | 1 | 2023-01-13T18.00.01.0187532 |
1 | Active | 2 | 2023-01-13T18.00.01.0187533 |
3 | Active | 0 | 2023-01-13T18.00.01.0187534 |
1 | Failed | 3 | 2023-01-13T18.00.01.0187535 |
4 | Failed | 0 | 2023-01-13T18.00.01.0187536 |
1 | Active | 4 | 2023-01-13T18.00.01.0187537 |
And then doing
SELECT Id, Change, min(Timestamp) AS Fail_Begin, max(Timestamp) AS Fail_End
FROM Device_Status_With_Change_Column
GROUP BY Id, Change
Other than looping over the result set in a programming language, I do not yet see a direct SQL statement that would do this in one fell swoop without the intermediate table and I do not see how to compute the column Change (in SQL).
The solution by user @Ajax1234 generates the result in the original post. However, when stress-tested with several tandem 'Failed' records, it generates pairs instead of producing exactly one additional record as required for those additional table entries.
The DDL/Schema and initial sample data, tested for PostGreSQL version 16, with changed/trimmed timestamp for the parse to work and several tandem 'Failed' records added:
CREATE TYPE status AS ENUM ('Active', 'Failed');
DROP TABLE IF EXISTS Device_Status;
CREATE TABLE Device_Status
(
ID integer,
Status status,
temptime text
);
INSERT INTO Device_Status VALUES (1,'Active','2023-01-13 10.00.01.528');
INSERT INTO Device_Status VALUES (2,'Active','2023-01-13 10.00.01.529');
INSERT INTO Device_Status VALUES (1,'Failed','2023-01-13 10.00.01.530');
INSERT INTO Device_Status VALUES (3,'Active','2023-01-13 10.00.01.531');
INSERT INTO Device_Status VALUES (1,'Failed','2023-01-13 10.00.01.532');
INSERT INTO Device_Status VALUES (1,'Active','2023-01-13 10.00.01.533');
INSERT INTO Device_Status VALUES (3,'Active','2023-01-13 10.00.01.534');
INSERT INTO Device_Status VALUES (1,'Failed','2023-01-13 10.00.01.535');
INSERT INTO Device_Status VALUES (4,'Failed','2023-01-13 10.00.01.536');
INSERT INTO Device_Status VALUES (1,'Active','2023-01-13 10.00.01.537');
insert into device_status values (1,'Failed','2023-01-13 10:00:01.538'),
(1,'Failed','2023-01-13 10:00:01.539'),
(1,'Failed','2023-01-13 10:00:01.540'),
(1,'Failed','2023-01-13 10:00:01.541'),
(1,'Failed','2023-01-13 10:00:01.542'),
(1,'Failed','2023-01-13 10:00:01.543'),
(1,'Failed','2023-01-13 10:00:01.544'),
(1,'Failed','2023-01-13 10:00:01.545'),
(1,'Failed','2023-01-13 10:00:01.546'),
(1,'Failed','2023-01-13 10:00:01.547');
ALTER TABLE Device_Status ADD timestamp TIMESTAMP;
UPDATE Device_Status set timestamp =
to_timestamp(temptime,'YYYY-MM-DD HH:MI:SS.MS');
ALTER TABLE Device_Status DROP COLUMN temptime;
WITH Device_Status_With_Change_Column AS
(
WITH flag AS
(
WITH lag AS
(
SELECT *,
LAG(status) OVER (PARTITION BY Id ORDER BY TimeStamp) Last_Status,
LAG(Id,1,0) OVER (PARTITION BY Id ORDER BY TimeStamp) Last_Id
FROM Device_Status
)
SELECT *,
CASE
WHEN lag.last_status IS NULL THEN 0
WHEN lag.last_status IS DISTINCT FROM lag.status
AND lag.last_id = lag.id
THEN 1
ELSE 0
END Change_Flag
FROM lag
)
SELECT *,
SUM(change_flag) OVER (PARTITION BY Id ORDER BY TimeStamp) Change
FROM flag
)
SELECT Id, min(Timestamp) AS Fail_Begin, max(Timestamp) AS Fail_End
FROM Device_Status_With_Change_Column
WHERE status = 'Failed'
GROUP BY Id, Change
ORDER BY Id
Result (fiddle):
id | fail_start | fail_end |
---|---|---|
1 | 2023-01-13 10:00:01.53 | 2023-01-13 10:00:01.532 |
1 | 2023-01-13 10:00:01.535 | 2023-01-13 10:00:01.535 |
1 | 2023-01-13 10:00:01.538 | 2023-01-13 10:00:01.547 |
4 | 2023-01-13 10:00:01.536 | 2023-01-13 10:00:01.536 |
The innermost CTElag
uses the SQL LAG function to generate a column for previous entries (i.e. exactly one previous row of corresponding column, for id and status, and in the case of id, default zero). This is projected into flag
to enter a 1 when previous status changes from 'Active' to 'Failed' or vice-versa for the same id. The first instance of any id will have unconditional zero and so will cases where there is no change to status relative to the previous recorded log entry for any device. To create the CTE Device_Status_With_Change_Column
, the Change_Flags are summed using the cumulative SUM function to record the sequence number of status change for each device Id when sorted by timestamp. The zero in the Change column ensures that SUM will put the same integer in all consecutive failed or active records having the same change sequence. In other words, to get the 'Active' image of the result set we just need to change the WHERE clause. Therefore, any number of statuses, say, 'PAUSED', 'OFFLINE' can all be addressed just by adding the corresponding word in the DDL and query.
The full CTE Device_Status_With_Change_Column
(fiddle):
SELECT *
FROM Device_Status_With_Change_Column
I tested the join-based query by @Ajax1234 on a database with 13 million records, to generate a result set of approximately 20,000, and it is faster that my CTE-based solution (24 seconds versus 42 seconds).
SQL is far more powerful than doing loops! Look into CTEs using LAG, SUM, and the CASE Statement. Join-like operations on products of relations are more performant, if a bit harder to read and understand, and therefore need testing.