Here is my sample data set:
ID | Timestamp | Duration | State |
---|---|---|---|
17 | 05.11.2024 00:00 | 0 | Auto |
16 | 04.11.2024 23:59 | 0 | Auto |
15 | 04.11.2024 23:58 | 1,47 | Auto |
14 | 04.11.2024 23:52 | 5,72 | Manual |
13 | 04.11.2024 23:52 | 0,13 | Productive |
12 | 04.11.2024 23:52 | 0,5 | Error |
11 | 04.11.2024 23:50 | 1,83 | Auto |
10 | 04.11.2024 23:50 | 0,08 | Productive |
9 | 04.11.2024 23:48 | 1,67 | Auto |
8 | 04.11.2024 23:48 | 0,12 | Productive |
7 | 04.11.2024 23:46 | 1,57 | Auto |
6 | 04.11.2024 23:46 | 0,17 | Productive |
5 | 04.11.2024 23:44 | 2,08 | Auto |
4 | 04.11.2024 23:44 | 0,1 | Productive |
3 | 04.11.2024 23:42 | 1,67 | Auto |
2 | 04.11.2024 23:42 | 0,13 | Productive |
1 | 04.11.2024 23:40 | 1,83 | Auto |
My target table should be this one:
ID | Timestamp | Duration | State |
---|---|---|---|
9 | 05.11.2024 00:00 | 0 | Auto |
8 | 04.11.2024 23:59 | 0 | Auto |
7 | 04.11.2024 23:58 | 1,47 | Auto |
6 | 04.11.2024 23:52 | 5,72 | Manual |
5 | 04.11.2024 23:52 | 0,13 | Productive |
4 | 04.11.2024 23:52 | 0,5 | Error |
3 | 04.11.2024 23:50 | 1,83 | Auto |
2 | 04.11.2024 23:42 | 7,59 | Productive |
1 | 04.11.2024 23:40 | 1,83 | Auto |
I would like all lines to be merged as soon as the status Productive and then alternately Auto - Productive appears. The duration is totalled in the process. How can I realise this with an SQL query? I've been thinking about this for days and unfortunately I can't find the right solution. I have already tried using CTE to create and group specific sequences. Unfortunately without success so far.
REVISED: The following is a completely revised answer using a gaps-and-islands approach. The original (but apparently incorrect) answer has been left in place further down.
After seeing p3consulting's answer, I now see that it appears that you wish to combine sequences of "Productive-Auto-Productive-...-Auto-Productive" consisting of 3 or more rows of alternating Productive/Auto state values and which start and end with "Productive".
This can be achieved using a gaps-and-islands approach, where the "islands" represent the rows to be combined and the "gaps" are carefully defined so as to be the transitions between islands.
As I understand the problem, a row may be combined with prior rows (and becomes a continuation of the same island) if:
Any other row, including rows having states other than "Productive" or "Auto", would be considered to be the start of a new "island" following a "gap".
Once the gaps are identified the islands are defined and numbered by calculating a running sum of the gap flags. It is then just a matter of combining rows in the same islands using GROUP BY
and calculating the final results using appropriate aggregation functions. Note that an island might only consist of one row, which is not be combined with any other rows.
This approach can recognize and combine multiple matching sequenced within the same source data set.
WITH LagLead AS (
SELECT
*,
LAG(State, 2) OVER(ORDER BY ID) AS Lag2,
LAG(State, 1) OVER(ORDER BY ID) AS Lag1,
LEAD(State, 1) OVER(ORDER BY ID) AS Lead1
FROM Data
),
Gaps AS (
SELECT
*,
CASE
WHEN State = 'Productive' AND Lag1 = 'Auto' AND Lag2 = 'Productive'
THEN 0 -- productive-auto-PRODUCTIVE
WHEN State = 'Auto' AND Lag1 = 'Productive' AND Lead1 = 'Productive'
THEN 0 -- productive-AUTO-productive
ELSE 1 -- Otherwise, this is a "gap" (the start of a new "island")
END AS IsGap
FROM LagLead
),
Islands AS (
SELECT
*,
SUM(IsGap) OVER(ORDER BY ID) AS Island
FROM Gaps
)
SELECT
Island AS NewID,
MIN(Timestamp) AS Timestamp,
SUM(Duration) AS Duration,
CASE WHEN COUNT(*) >= 2 THEN 'Productive' ELSE MIN(State) END AS State,
--MAX(CASE WHEN IsGap = 1 THEN State ELSE '' END) AS State2,
COUNT(*) AS Rows
FROM Islands
GROUP BY Island
ORDER BY Island DESC
Results:
NewID | Timestamp | Duration | State | Rows |
---|---|---|---|---|
9 | 2024-11-05 00:00 | 0.00 | Auto | 1 |
8 | 2024-11-04 23:59 | 0.00 | Auto | 1 |
7 | 2024-11-04 23:58 | 1.47 | Auto | 1 |
6 | 2024-11-04 23:52 | 5.72 | Manual | 1 |
5 | 2024-11-04 23:52 | 0.13 | Productive | 1 |
4 | 2024-11-04 23:52 | 0.50 | Error | 1 |
3 | 2024-11-04 23:50 | 1.83 | Auto | 1 |
2 | 2024-11-04 23:42 | 7.59 | Productive | 9 |
1 | 2024-11-04 23:40 | 1.83 | Auto | 1 |
Intermediate results showing gaps and islands logic:
ID | Timestamp | Duration | Lag2 | Lag1 | State | Lead1 | IsGap | Island |
---|---|---|---|---|---|---|---|---|
17 | 2024-11-05 00:00 | 0.00 | Auto | Auto | Auto | null | 1 | 9 |
16 | 2024-11-04 23:59 | 0.00 | Manual | Auto | Auto | Auto | 1 | 8 |
15 | 2024-11-04 23:58 | 1.47 | Productive | Manual | Auto | Auto | 1 | 7 |
14 | 2024-11-04 23:52 | 5.72 | Error | Productive | Manual | Auto | 1 | 6 |
13 | 2024-11-04 23:52 | 0.13 | Auto | Error | Productive | Manual | 1 | 5 |
12 | 2024-11-04 23:52 | 0.50 | Productive | Auto | Error | Productive | 1 | 4 |
11 | 2024-11-04 23:50 | 1.83 | Auto | Productive | Auto | Error | 1 | 3 |
10 | 2024-11-04 23:50 | 0.08 | Productive | Auto | Productive | Auto | 0 | 2 |
9 | 2024-11-04 23:48 | 1.67 | Auto | Productive | Auto | Productive | 0 | 2 |
8 | 2024-11-04 23:48 | 0.12 | Productive | Auto | Productive | Auto | 0 | 2 |
7 | 2024-11-04 23:46 | 1.57 | Auto | Productive | Auto | Productive | 0 | 2 |
6 | 2024-11-04 23:46 | 0.17 | Productive | Auto | Productive | Auto | 0 | 2 |
5 | 2024-11-04 23:44 | 2.08 | Auto | Productive | Auto | Productive | 0 | 2 |
4 | 2024-11-04 23:44 | 0.10 | Productive | Auto | Productive | Auto | 0 | 2 |
3 | 2024-11-04 23:42 | 1.67 | Auto | Productive | Auto | Productive | 0 | 2 |
2 | 2024-11-04 23:42 | 0.13 | null | Auto | Productive | Auto | 1 | 2 |
1 | 2024-11-04 23:40 | 1.83 | null | null | Auto | Productive | 1 | 1 |
See this db<>fiddle for a demo with some extra test data showing multiple combined row sequences.
Original Answer (Apparently incorrect, but left as a reference.)
If you look at the sequence from the reverse direction (ascending IDs), it appears that the main trick is to identify where the repeating alternating sequence Auto
- Productive
ends.
This can be done by counting the odd rows containing 'Auto'
and the even rows containing 'Productive'
. The sequence ends at the MAX(ID)
where State = 'Productive'
and both counts equal half the running total row count.
Once you have that, you can partition the data into two parts at that point, performing a grouping and aggregation on the first part and then UNION ALL
that with the unmodified second part.
The final step is define new ID
values using ROW_NUMBER()
.
WITH Counts AS (
SELECT
*,
COUNT(*) OVER(ORDER BY ID) AS TCount, -- Total running count
COUNT(CASE WHEN ID % 2 = 1 AND State = 'Auto' THEN 1 END)
OVER(ORDER BY ID) AS ACount,
COUNT(CASE WHEN ID % 2 = 0 AND State = 'Productive' THEN 1 END)
OVER(ORDER BY ID) AS PCount
FROM Data
),
Cutoff AS (
SELECT MAX(ID) AS ID
FROM Counts
WHERE State = 'Productive'
AND ACount = TCount / 2
AND PCount = TCount / 2
),
Combined AS (
SELECT
MIN(D.ID) AS OldID,
MIN(Timestamp) AS Timestamp,
SUM(Duration) AS Duration,
State
FROM Cutoff C
JOIN Data D
ON D.ID <= C.ID
GROUP BY State
UNION ALL
SELECT D.ID AS OldId, Timestamp, Duration, State
FROM Cutoff C
JOIN Data D
ON D.ID > C.ID
)
SELECT
ROW_NUMBER() OVER(ORDER BY OldID) AS ID,
Timestamp, Duration, State
FROM Combined
ORDER BY ID DESC
If the source ID values are not guaranteed to be sequential starting at 1, you can add an extra CTE at the start to number the source data.
If you also allow your source data to start with 'Productive'
, you may need to also track odd rows containing 'Productive'
and the even rows containing 'Auto'
and check those alternate counts when selecting the cutoff.
Results:
ID | Timestamp | Duration | State |
---|---|---|---|
9 | 2024-11-05 00:00 | 0.00 | Auto |
8 | 2024-11-04 23:59 | 0.00 | Auto |
7 | 2024-11-04 23:58 | 1.47 | Auto |
6 | 2024-11-04 23:52 | 5.72 | Manual |
5 | 2024-11-04 23:52 | 0.13 | Productive |
4 | 2024-11-04 23:52 | 0.50 | Error |
3 | 2024-11-04 23:50 | 1.83 | Auto |
2 | 2024-11-04 23:42 | 0.60 | Productive |
1 | 2024-11-04 23:40 | 8.82 | Auto |
Note: The calculated combined duration values above (0.60
and 8.82
) are different from your posted desired results (7,59
and 1,83
). It is not clear to me how your values were calculated.
See this db<>fiddle for a demo.