sqldatabaseperformancetime-seriesnosql

Query objects by overlapping time ranges accross multiple categories


Markers having a structure like this:

{
  id: 123,
  category: 2,
  object: 'A',
  start: 10,
  end: 25
}

...resulting in ranges along the different categories like so:

category-1: ----------[===========]---------------------------- ref-object: A
category-1: --------------------------[====================]--- ref-object: A
category-1: ------------------[============================]--- ref-object: A
category-1: -----------------[==========================]------ ref-object: C
category-1: ------------------------------[================]--- ref-object: C

category-2: ----[===================]-------------------------- ref-object: A
category-2: -----[==========================]------------------ ref-object: B
category-2: -----[====================]------------------------ ref-object: C
category-2: -------------------------[=================]------- ref-object: C
category-2: ------------[===================================]-- ref-object: C

category-3: ----------------[=============]-------------------- ref-object: A
category-3: ----[===================================]---------- ref-object: A
category-3: -------[=======================================]--- ref-object: B
category-3: ----------------------------[====]----------------- ref-object: C
category-3: -------------[=================]------------------- ref-object: C

Now to answer something like this:

Find all Objects, having overlapping(!) ranges along category 1 and category 2
ideally ordered by sum of overlapping time

I can think of those steps

Is that something you would want to do in SQL?

Is there a name for that type of queries? (those are not exactly range queries, are they?)

Is there a database with built-in capabilities in that case, given the amount of data?

Currently all data is stored inside an Apache SOLR index, but that could also become Elasticsearch, InfluxDB etc


Solution

  • Is that something you would want to do in SQL?

    Of course! And as it is quite simple, why not trying?

    with
        -- Where do segments intersect each other? 
        cuts as
        (
            select category, object, start t from m
            union -- Without "all" to deduplicate.
            select category, object, "end" from m
        ),
        -- Give those cuts an index, and compute the stop of the subsegment they start
        -- (the last cut having no next cut, will get a null stop:
        --  thus it's the only one that will not be considered a segment in the following joins)
        icuts as
        (
            select
                row_number() over () id,
                category, object,
                t ts, lead(t) over w te, lead(t) over w - t + 1 len
            from cuts
            window w as (partition by category, object order by t)
        ),
        -- Atomic marker segments: parts of markers that cannot be split anymore.
        mseg as
        (
            select m.id mid, m.category, m.object, s.id sid
            from m join icuts s on (s.category, s.object) = (m.category, m.object) and m.start <= s.ts and m.end >= s.te
        ),
        overl as
        (
            select sid, count(1) noverlaps
            from mseg
            group by sid
            having count(1) > 1
        )
    select category, object, sum(len) totaloverlap
    from overl o join icuts s on s.id = o.sid
    group by 1, 2
    order by 3 desc, 1, 2;
    

    Would return from your example (counting the [ and ] in the markers, thus the forelast marker of your example, [====], is considered of duration 6):

    category object totaloverlap
    2 C 34
    1 A 27
    1 C 15
    3 A 15
    3 C 4

    (here is the fiddle if you want to play with it)

    10 million would not be that big for a PostgreSQL server, but there's no way to know without testing (… and I'll preserve DBFiddle from such a load). However I'm convinced it will run faster than with roundtrips from database to application.

    There's still room for improvement: all (category, object) pairs can be reduced to a single integer;
    and the CTEs can be dumped to tables then indexed.