sql database performance time-series nosql

Query objects by overlapping time ranges accross multiple categories

10M+ Markers (Time Ranges in certain )
300k+ Referenced Objects (Videos)

Markers having a structure like this:

{
  id: 123,
  category: 2,
  object: 'A',
  start: 10,
  end: 25
}

...resulting in ranges along the different categories like so:

category-1: ----------[===========]---------------------------- ref-object: A
category-1: --------------------------[====================]--- ref-object: A
category-1: ------------------[============================]--- ref-object: A
category-1: -----------------[==========================]------ ref-object: C
category-1: ------------------------------[================]--- ref-object: C

category-2: ----[===================]-------------------------- ref-object: A
category-2: -----[==========================]------------------ ref-object: B
category-2: -----[====================]------------------------ ref-object: C
category-2: -------------------------[=================]------- ref-object: C
category-2: ------------[===================================]-- ref-object: C

category-3: ----------------[=============]-------------------- ref-object: A
category-3: ----[===================================]---------- ref-object: A
category-3: -------[=======================================]--- ref-object: B
category-3: ----------------------------[====]----------------- ref-object: C
category-3: -------------[=================]------------------- ref-object: C

Now to answer something like this:

Find all Objects, having overlapping(!) ranges along category 1 and category 2
ideally ordered by sum of overlapping time

I can think of those steps

query: start with a category, fetch all the markers in that category
code: reduce the ranges to a few larger timespans (remove redundant timespans)
query: fetch markers of next category matching those one of those timespans
code: reduce combined ranges to a few larger timespans (remove redundant timespans)
[...]
code: calculate the covered timespan per item and sort accordingly

Is that something you would want to do in SQL?

Is there a name for that type of queries? (those are not exactly range queries, are they?)

Is there a database with built-in capabilities in that case, given the amount of data?

Currently all data is stored inside an Apache SOLR index, but that could also become Elasticsearch, InfluxDB etc

Solution

Is that something you would want to do in SQL?

Of course! And as it is quite simple, why not trying?

with
    -- Where do segments intersect each other? 
    cuts as
    (
        select category, object, start t from m
        union -- Without "all" to deduplicate.
        select category, object, "end" from m
    ),
    -- Give those cuts an index, and compute the stop of the subsegment they start
    -- (the last cut having no next cut, will get a null stop:
    --  thus it's the only one that will not be considered a segment in the following joins)
    icuts as
    (
        select
            row_number() over () id,
            category, object,
            t ts, lead(t) over w te, lead(t) over w - t + 1 len
        from cuts
        window w as (partition by category, object order by t)
    ),
    -- Atomic marker segments: parts of markers that cannot be split anymore.
    mseg as
    (
        select m.id mid, m.category, m.object, s.id sid
        from m join icuts s on (s.category, s.object) = (m.category, m.object) and m.start <= s.ts and m.end >= s.te
    ),
    overl as
    (
        select sid, count(1) noverlaps
        from mseg
        group by sid
        having count(1) > 1
    )
select category, object, sum(len) totaloverlap
from overl o join icuts s on s.id = o.sid
group by 1, 2
order by 3 desc, 1, 2;

Would return from your example (counting the [ and ] in the markers, thus the forelast marker of your example, [====], is considered of duration 6):

category	object	totaloverlap
2	C	34
1	A	27
1	C	15
3	A	15
3	C	4

(here is the fiddle if you want to play with it)

10 million would not be that big for a PostgreSQL server, but there's no way to know without testing (… and I'll preserve DBFiddle from such a load). However I'm convinced it will run faster than with roundtrips from database to application.

There's still room for improvement: all (category, object) pairs can be reduced to a single integer;
and the CTEs can be dumped to tables then indexed.