Merge rows in a specific sequence in SQL (Azure SQL)

Here is my sample data set:

ID	Timestamp	Duration	State
17	05.11.2024 00:00	0	Auto
16	04.11.2024 23:59	0	Auto
15	04.11.2024 23:58	1,47	Auto
14	04.11.2024 23:52	5,72	Manual
13	04.11.2024 23:52	0,13	Productive
12	04.11.2024 23:52	0,5	Error
11	04.11.2024 23:50	1,83	Auto
10	04.11.2024 23:50	0,08	Productive
9	04.11.2024 23:48	1,67	Auto
8	04.11.2024 23:48	0,12	Productive
7	04.11.2024 23:46	1,57	Auto
6	04.11.2024 23:46	0,17	Productive
5	04.11.2024 23:44	2,08	Auto
4	04.11.2024 23:44	0,1	Productive
3	04.11.2024 23:42	1,67	Auto
2	04.11.2024 23:42	0,13	Productive
1	04.11.2024 23:40	1,83	Auto

My target table should be this one:

ID	Timestamp	Duration	State
9	05.11.2024 00:00	0	Auto
8	04.11.2024 23:59	0	Auto
7	04.11.2024 23:58	1,47	Auto
6	04.11.2024 23:52	5,72	Manual
5	04.11.2024 23:52	0,13	Productive
4	04.11.2024 23:52	0,5	Error
3	04.11.2024 23:50	1,83	Auto
2	04.11.2024 23:42	7,59	Productive
1	04.11.2024 23:40	1,83	Auto

I would like all lines to be merged as soon as the status Productive and then alternately Auto - Productive appears. The duration is totalled in the process. How can I realise this with an SQL query? I've been thinking about this for days and unfortunately I can't find the right solution. I have already tried using CTE to create and group specific sequences. Unfortunately without success so far.

Solution

REVISED: The following is a completely revised answer using a gaps-and-islands approach. The original (but apparently incorrect) answer has been left in place further down.

After seeing p3consulting's answer, I now see that it appears that you wish to combine sequences of "Productive-Auto-Productive-...-Auto-Productive" consisting of 3 or more rows of alternating Productive/Auto state values and which start and end with "Productive".

This can be achieved using a gaps-and-islands approach, where the "islands" represent the rows to be combined and the "gaps" are carefully defined so as to be the transitions between islands.

As I understand the problem, a row may be combined with prior rows (and becomes a continuation of the same island) if:

It has a "Productive" state, and is preceded by rows having a "Productive-Auto" state sequence.
It has an "Auto" state, and is both preceded and followed by rows having "Productive" states.

Any other row, including rows having states other than "Productive" or "Auto", would be considered to be the start of a new "island" following a "gap".

Once the gaps are identified the islands are defined and numbered by calculating a running sum of the gap flags. It is then just a matter of combining rows in the same islands using GROUP BY and calculating the final results using appropriate aggregation functions. Note that an island might only consist of one row, which is not be combined with any other rows.

This approach can recognize and combine multiple matching sequenced within the same source data set.

WITH LagLead AS (
    SELECT
        *,
        LAG(State, 2) OVER(ORDER BY ID) AS Lag2,
        LAG(State, 1) OVER(ORDER BY ID) AS Lag1,
        LEAD(State, 1) OVER(ORDER BY ID) AS Lead1
    FROM Data
),
Gaps AS (
    SELECT
        *,
        CASE
            WHEN State = 'Productive' AND Lag1 = 'Auto' AND Lag2 = 'Productive'
            THEN 0  -- productive-auto-PRODUCTIVE
            WHEN State = 'Auto' AND Lag1 = 'Productive' AND Lead1 = 'Productive'
            THEN 0  -- productive-AUTO-productive
            ELSE 1  -- Otherwise, this is a "gap" (the start of a new "island")
            END AS IsGap
    FROM LagLead
),
Islands AS (
    SELECT
        *,
        SUM(IsGap) OVER(ORDER BY ID) AS Island
    FROM Gaps
)
SELECT
    Island AS NewID,
    MIN(Timestamp) AS Timestamp,
    SUM(Duration) AS Duration,
    CASE WHEN COUNT(*) >= 2 THEN 'Productive' ELSE MIN(State) END AS State,
    --MAX(CASE WHEN IsGap = 1 THEN State ELSE '' END) AS State2,
    COUNT(*) AS Rows
FROM Islands
GROUP BY Island
ORDER BY Island DESC

Results:

NewID	Timestamp	Duration	State	Rows
9	2024-11-05 00:00	0.00	Auto	1
8	2024-11-04 23:59	0.00	Auto	1
7	2024-11-04 23:58	1.47	Auto	1
6	2024-11-04 23:52	5.72	Manual	1
5	2024-11-04 23:52	0.13	Productive	1
4	2024-11-04 23:52	0.50	Error	1
3	2024-11-04 23:50	1.83	Auto	1
2	2024-11-04 23:42	7.59	Productive	9
1	2024-11-04 23:40	1.83	Auto	1

Intermediate results showing gaps and islands logic:

ID	Timestamp	Duration	Lag2	Lag1	State	Lead1	IsGap	Island
17	2024-11-05 00:00	0.00	Auto	Auto	Auto	null	1	9
16	2024-11-04 23:59	0.00	Manual	Auto	Auto	Auto	1	8
15	2024-11-04 23:58	1.47	Productive	Manual	Auto	Auto	1	7
14	2024-11-04 23:52	5.72	Error	Productive	Manual	Auto	1	6
13	2024-11-04 23:52	0.13	Auto	Error	Productive	Manual	1	5
12	2024-11-04 23:52	0.50	Productive	Auto	Error	Productive	1	4
11	2024-11-04 23:50	1.83	Auto	Productive	Auto	Error	1	3
10	2024-11-04 23:50	0.08	Productive	Auto	Productive	Auto	0	2
9	2024-11-04 23:48	1.67	Auto	Productive	Auto	Productive	0	2
8	2024-11-04 23:48	0.12	Productive	Auto	Productive	Auto	0	2
7	2024-11-04 23:46	1.57	Auto	Productive	Auto	Productive	0	2
6	2024-11-04 23:46	0.17	Productive	Auto	Productive	Auto	0	2
5	2024-11-04 23:44	2.08	Auto	Productive	Auto	Productive	0	2
4	2024-11-04 23:44	0.10	Productive	Auto	Productive	Auto	0	2
3	2024-11-04 23:42	1.67	Auto	Productive	Auto	Productive	0	2
2	2024-11-04 23:42	0.13	null	Auto	Productive	Auto	1	2
1	2024-11-04 23:40	1.83	null	null	Auto	Productive	1	1

See this db<>fiddle for a demo with some extra test data showing multiple combined row sequences.

Original Answer (Apparently incorrect, but left as a reference.)

If you look at the sequence from the reverse direction (ascending IDs), it appears that the main trick is to identify where the repeating alternating sequence Auto - Productive ends.

This can be done by counting the odd rows containing 'Auto' and the even rows containing 'Productive'. The sequence ends at the MAX(ID) where State = 'Productive' and both counts equal half the running total row count.

Once you have that, you can partition the data into two parts at that point, performing a grouping and aggregation on the first part and then UNION ALL that with the unmodified second part.

The final step is define new ID values using ROW_NUMBER().

WITH Counts AS (
    SELECT
        *,
        COUNT(*) OVER(ORDER BY ID) AS TCount, -- Total running count
        COUNT(CASE WHEN ID % 2 = 1 AND State = 'Auto' THEN 1 END)
            OVER(ORDER BY ID) AS ACount,
        COUNT(CASE WHEN ID % 2 = 0 AND State = 'Productive' THEN 1 END)
            OVER(ORDER BY ID) AS PCount
    FROM Data
),
Cutoff AS (
    SELECT MAX(ID) AS ID
    FROM Counts
    WHERE State = 'Productive'
    AND ACount = TCount / 2
    AND PCount = TCount / 2
),
Combined AS (
    SELECT
        MIN(D.ID) AS OldID,
        MIN(Timestamp) AS Timestamp,
        SUM(Duration) AS Duration,
        State
    FROM Cutoff C
    JOIN Data D
        ON D.ID <= C.ID
    GROUP BY State

    UNION ALL

    SELECT D.ID AS OldId, Timestamp, Duration, State
    FROM Cutoff C
    JOIN Data D
        ON D.ID > C.ID
)
SELECT
    ROW_NUMBER() OVER(ORDER BY OldID) AS ID,
    Timestamp, Duration, State
FROM Combined
ORDER BY ID DESC

If the source ID values are not guaranteed to be sequential starting at 1, you can add an extra CTE at the start to number the source data.

If you also allow your source data to start with 'Productive', you may need to also track odd rows containing 'Productive' and the even rows containing 'Auto' and check those alternate counts when selecting the cutoff.

Results:

ID	Timestamp	Duration	State
9	2024-11-05 00:00	0.00	Auto
8	2024-11-04 23:59	0.00	Auto
7	2024-11-04 23:58	1.47	Auto
6	2024-11-04 23:52	5.72	Manual
5	2024-11-04 23:52	0.13	Productive
4	2024-11-04 23:52	0.50	Error
3	2024-11-04 23:50	1.83	Auto
2	2024-11-04 23:42	0.60	Productive
1	2024-11-04 23:40	8.82	Auto

Note: The calculated combined duration values above (0.60 and 8.82) are different from your posted desired results (7,59 and 1,83). It is not clear to me how your values were calculated.

See this db<>fiddle for a demo.