Markers having a structure like this:
{
id: 123,
category: 2,
object: 'A',
start: 10,
end: 25
}
...resulting in ranges along the different categories like so:
category-1: ----------[===========]---------------------------- ref-object: A
category-1: --------------------------[====================]--- ref-object: A
category-1: ------------------[============================]--- ref-object: A
category-1: -----------------[==========================]------ ref-object: C
category-1: ------------------------------[================]--- ref-object: C
category-2: ----[===================]-------------------------- ref-object: A
category-2: -----[==========================]------------------ ref-object: B
category-2: -----[====================]------------------------ ref-object: C
category-2: -------------------------[=================]------- ref-object: C
category-2: ------------[===================================]-- ref-object: C
category-3: ----------------[=============]-------------------- ref-object: A
category-3: ----[===================================]---------- ref-object: A
category-3: -------[=======================================]--- ref-object: B
category-3: ----------------------------[====]----------------- ref-object: C
category-3: -------------[=================]------------------- ref-object: C
Now to answer something like this:
Find all Objects, having overlapping(!) ranges along
category 1
andcategory 2
ideally ordered by sum of overlapping time
I can think of those steps
Is that something you would want to do in SQL?
Is there a name for that type of queries? (those are not exactly range queries, are they?)
Is there a database with built-in capabilities in that case, given the amount of data?
Currently all data is stored inside an Apache SOLR index, but that could also become Elasticsearch, InfluxDB etc
Is that something you would want to do in SQL?
Of course! And as it is quite simple, why not trying?
with
-- Where do segments intersect each other?
cuts as
(
select category, object, start t from m
union -- Without "all" to deduplicate.
select category, object, "end" from m
),
-- Give those cuts an index, and compute the stop of the subsegment they start
-- (the last cut having no next cut, will get a null stop:
-- thus it's the only one that will not be considered a segment in the following joins)
icuts as
(
select
row_number() over () id,
category, object,
t ts, lead(t) over w te, lead(t) over w - t + 1 len
from cuts
window w as (partition by category, object order by t)
),
-- Atomic marker segments: parts of markers that cannot be split anymore.
mseg as
(
select m.id mid, m.category, m.object, s.id sid
from m join icuts s on (s.category, s.object) = (m.category, m.object) and m.start <= s.ts and m.end >= s.te
),
overl as
(
select sid, count(1) noverlaps
from mseg
group by sid
having count(1) > 1
)
select category, object, sum(len) totaloverlap
from overl o join icuts s on s.id = o.sid
group by 1, 2
order by 3 desc, 1, 2;
Would return from your example (counting the [
and ]
in the markers, thus the forelast marker of your example, [====]
, is considered of duration 6):
category | object | totaloverlap |
---|---|---|
2 | C | 34 |
1 | A | 27 |
1 | C | 15 |
3 | A | 15 |
3 | C | 4 |
(here is the fiddle if you want to play with it)
10 million would not be that big for a PostgreSQL server, but there's no way to know without testing (… and I'll preserve DBFiddle from such a load). However I'm convinced it will run faster than with roundtrips from database to application.
There's still room for improvement:
all (category, object)
pairs can be reduced to a single integer
;
and the CTEs can be dumped to tables then indexed.