My (anonymized here) data has several completely separate sidewalks with various lines of chalk in different colors along each sidewalk. I need to find the start and end of each section of drawing with unique combinations of chalk colors. So far, that seems like a fairly straightforward gaps-and-islands problem (excellent Q&A here and explanation here). My issue is that I'd like each "island" to consolidate the longest possible contiguous section of equivalently overlapped values, but instead, it's either giving me every little overlapping piece separately or smudging all the islands for each color together, depending on how I try to partition or sort the row_number()
s.
Initial data:
create table #chalk (
sidewalkId int,
"start" int,
"end" int,
color varchar(50)
);
insert into #chalk values
(1, 0, 5, 'blue'),
(1, 5, 10, 'blue'),
(1, 10, 15, 'blue'),
--(1, 15, 20, null),--nulls may be explicit or implicit
(1, 20, 25, 'blue'),
--(1, 25, 30, null),--nulls may be explicit or implicit
(1, 30, 35, 'blue'),
(1, 35, 40, 'blue'),
(1, 0, 5, 'red'),
(1, 5, 10, 'red'),
(1, 10, 15, 'red'),
(1, 30, 35, 'red')
Desired results:
sidewalkId | color | start | end |
---|---|---|---|
1 | 'blue' | 0 | 15 |
1 | 'red' | 0 | 15 |
1 | null | 15 | 20 |
1 | 'blue' | 20 | 25 |
1 | null | 25 | 30 |
1 | 'blue' | 30 | 35 |
1 | 'red' | 30 | 35 |
1 | 'blue' | 35 | 40 |
Bad query 1:
with cte as (
select sidewalkId, color
, row_number() over (partition by sidewalkId order by sidewalkId, "start") rna
, row_number() over (partition by sidewalkId, color order by sidewalkId, "start") rnc
, start, "end"
from #chalk
)
select sidewalkId, color
, min("start") "start", max("end") "end"
, min(rna) rna_start, max(rna) rna_end
from cte
group by sidewalkId, color, rna-rnc
order by sidewalkId, min(rna)
Bad results 1:
sidewalkId | color | start | end | rna_start | rna_end |
---|---|---|---|---|---|
1 | blue | 0 | 5 | 1 | 1 |
1 | red | 0 | 10 | 2 | 3 |
1 | blue | 5 | 15 | 4 | 5 |
1 | red | 10 | 15 | 6 | 6 |
1 | NULL | 15 | 20 | 7 | 7 |
1 | blue | 20 | 25 | 8 | 8 |
1 | NULL | 25 | 30 | 9 | 9 |
1 | red | 25 | 35 | 10 | 11 |
1 | blue | 30 | 40 | 12 | 13 |
Bad query 2:
with cte as (
select sidewalkId, color
, row_number() over (partition by sidewalkId order by sidewalkId, color, "start") rna
, row_number() over (partition by sidewalkId, color order by sidewalkId, "start") rnc
, "start", "end"
from #chalk
)
select sidewalkId, color
, min("start"), max("end")
, min(rna), max(rna)
from cte
group by sidewalkId, color, rna-rnc
order by sidewalkId, min(rna)
Bad results 2:
sidewalkId | color | start | end | rna_start | rna_end |
---|---|---|---|---|---|
1 | NULL | 15 | 30 | 1 | 2 |
1 | blue | 0 | 40 | 3 | 8 |
1 | red | 0 | 35 | 9 | 13 |
I've tried using rank()
or dense_rank()
instead of row_number()
(good comparison of those functions here) to get the desired results, but those don't seem to work for my requirements, either.
The islands just have to be computed on the groups of sidewalking colors, not on the isolated colors themselves:
[0;5]
is walked through by the blue,red
band, then [5;10]
too, and so on, [20;25]
by the 1-member band blue
, etc.
By first determining which color segments walk together,
and computing cte
on it instead of #chalk
,
you'll be able to take your first query nearly as is (see (1)):
with
band as
(
select
sidewalkId, start, "end",
string_agg(color, ',') within group (order by color) name
from #chalk
group by sidewalkId, start, "end"
),
cte as (
select sidewalkId, name
, row_number() over (partition by sidewalkId order by sidewalkId, start) rna
, row_number() over (partition by sidewalkId, name order by sidewalkId, start) rnc
, start, "end"
from band
)
select cte.sidewalkId, color
, min(cte.start) start, max(cte."end") "end"
, min(rna) rna_start, max(rna) rna_end, cte.name bandname
from cte join #chalk o on o.start = cte.start and o."end" = cte."end"
group by cte.sidewalkId, name, color, rna-rnc
order by cte.sidewalkId, min(rna)
(the fiddle includes additional columns with the "band names" such as 'blue,red';
another fiddle adds the splitting that @jed-schaaf you already did elsewhere to produce #chalk
)
(1) In fact, after fully qualifying fields in your query (start
→ cte.start
), the diff with my query is only:
with
+ band as (…)
cte as
(
[…]
- from #chalk
+ from band
)
- select cte.sidewalkId, cte.color
+ select cte.sidewalkId, o.color
[…]
from cte
+ join #chalk o on o.start = cte.start and o."end" = cte."end"
- group by cte.sidewalkId, cte.color, rna-rnc
+ group by cte.sidewalkId, o.color, rna-rnc