sqlsql-servert-sqlgaps-and-islands

SQL Gaps and Islands with multiple overlapping criteria


My (anonymized here) data has several completely separate sidewalks with various lines of chalk in different colors along each sidewalk. I need to find the start and end of each section of drawing with unique combinations of chalk colors. So far, that seems like a fairly straightforward gaps-and-islands problem (excellent Q&A here and explanation here). My issue is that I'd like each "island" to consolidate the longest possible contiguous section of equivalently overlapped values, but instead, it's either giving me every little overlapping piece separately or smudging all the islands for each color together, depending on how I try to partition or sort the row_number()s.

Initial data:

create table #chalk (
    sidewalkId int,
    "start" int,
    "end" int,
    color varchar(50)
);
insert into #chalk values
(1, 0, 5, 'blue'),
(1, 5, 10, 'blue'),
(1, 10, 15, 'blue'),
--(1, 15, 20, null),--nulls may be explicit or implicit
(1, 20, 25, 'blue'),
--(1, 25, 30, null),--nulls may be explicit or implicit
(1, 30, 35, 'blue'),
(1, 35, 40, 'blue'),
(1, 0, 5, 'red'),
(1, 5, 10, 'red'),
(1, 10, 15, 'red'),
(1, 30, 35, 'red')

Desired results:

sidewalkId color start end
1 'blue' 0 15
1 'red' 0 15
1 null 15 20
1 'blue' 20 25
1 null 25 30
1 'blue' 30 35
1 'red' 30 35
1 'blue' 35 40

Bad query 1:

    with cte as (
        select sidewalkId, color
            , row_number() over (partition by sidewalkId order by sidewalkId, "start") rna
            , row_number() over (partition by sidewalkId, color order by sidewalkId, "start") rnc
            , start, "end"
        from #chalk
    )
    select sidewalkId, color
        , min("start") "start", max("end") "end"
        , min(rna) rna_start, max(rna) rna_end
    from cte
    group by sidewalkId, color, rna-rnc
    order by sidewalkId, min(rna)

Bad results 1:

sidewalkId color start end rna_start rna_end
1 blue 0 5 1 1
1 red 0 10 2 3
1 blue 5 15 4 5
1 red 10 15 6 6
1 NULL 15 20 7 7
1 blue 20 25 8 8
1 NULL 25 30 9 9
1 red 25 35 10 11
1 blue 30 40 12 13

Bad query 2:

    with cte as (
        select sidewalkId, color
            , row_number() over (partition by sidewalkId order by sidewalkId, color, "start") rna
            , row_number() over (partition by sidewalkId, color order by sidewalkId, "start") rnc
            , "start", "end"
        from #chalk
    )
    select sidewalkId, color
        , min("start"), max("end")
        , min(rna), max(rna)
    from cte
    group by sidewalkId, color, rna-rnc
    order by sidewalkId, min(rna)

Bad results 2:

sidewalkId color start end rna_start rna_end
1 NULL 15 30 1 2
1 blue 0 40 3 8
1 red 0 35 9 13

I've tried using rank() or dense_rank() instead of row_number() (good comparison of those functions here) to get the desired results, but those don't seem to work for my requirements, either.


Solution

  • The islands just have to be computed on the groups of sidewalking colors, not on the isolated colors themselves:
    [0;5] is walked through by the blue,red band, then [5;10] too, and so on, [20;25] by the 1-member band blue, etc.

    By first determining which color segments walk together,
    and computing cte on it instead of #chalk,
    you'll be able to take your first query nearly as is (see (1)):

    with
      band as
      (
        select
          sidewalkId, start, "end",
          string_agg(color, ',') within group (order by color) name
        from #chalk
        group by sidewalkId, start, "end"
      ),
      cte as (
            select sidewalkId, name
                , row_number() over (partition by sidewalkId order by sidewalkId, start) rna
                , row_number() over (partition by sidewalkId, name order by sidewalkId, start) rnc
                , start, "end"
            from band
        )
      select cte.sidewalkId, color
          , min(cte.start) start, max(cte."end") "end"
          , min(rna) rna_start, max(rna) rna_end, cte.name bandname
      from cte join #chalk o on o.start = cte.start and o."end" = cte."end"
      group by cte.sidewalkId, name, color, rna-rnc
      order by cte.sidewalkId, min(rna)
    

    (the fiddle includes additional columns with the "band names" such as 'blue,red'; another fiddle adds the splitting that @jed-schaaf you already did elsewhere to produce #chalk)

    (1) In fact, after fully qualifying fields in your query (startcte.start), the diff with my query is only:

     with
    +  band as (…)
       cte as
       (
          […]
    -     from #chalk
    +     from band
       )
    -  select cte.sidewalkId, cte.color
    +  select cte.sidewalkId, o.color
       […]
       from cte
    +  join #chalk o on o.start = cte.start and o."end" = cte."end"
    -  group by cte.sidewalkId, cte.color, rna-rnc
    +  group by cte.sidewalkId, o.color, rna-rnc