sqlmysqloverlaptime-and-attendance

Create data-analysis about overlapped login grouped by time slot


I'm developing an internal dashboard and I have to show how many login sessions overlapped each other (the maximum of them), grouped by time slot (assuming the day is fixed, passed as a parameter).

In the database I already have information about when the user has logged in and logged out.

Sample source data:

userId login logout
7 2025-04-03 08:30:00 2025-04-03 09:10:00
2 2025-04-03 09:20:00 2025-04-03 09:30:00
3 2025-04-03 09:37:00 2025-04-03 09:45:00
6 2025-04-03 09:55:00 2025-04-03 10:30:00
1 2025-04-03 10:15:00 2025-04-03 10:50:00
5 2025-04-03 12:01:00 2025-04-03 12:04:00
6 2025-04-03 12:05:00 2025-04-03 17:26:00
7 2025-04-03 12:23:00 2025-04-03 12:53:00
4 2025-04-03 12:25:00 2025-04-03 12:37:00

Desired output:

time userLogged
08-09 1 (only user 7 logged in)
09-10 1 (users 2 and 3 never overlap their session)
10-11 2 (users 6 and 1 overlapped their session)
11-12 0
12-13 3 (users 4,6,7 overlapped their session)

Is it possible to reach this result using mySQL? (Possibly with a single query?)

And if not, any suggestion on how else to do this task?


Solution

  • As your problem contains two problems (detect overlaps then aggregate by hour, possibly creating the ones missing), I suggest splitting it in half;
    even each part taken individually will teach you a lot of things.

    Detecting overlaps

    Conception

    Rest reinsured that MySQL has all the tools you'll want to map your thinking, so try first to imagine "naturally" what you would do, then only we will implement it.

    When to detect overlaps? When any user logs in. No need to see at each second how many users are still logged in, you just have to recompute the sum after any event changes it: you're sure this sum will last until next event.
    (alternatively we could sum every +1 when a user logs in, and -1 when one logs out, to know the total for any timespan between two events).

    So 1. we'll list times when an event (log in or log out) occurs.

    Then 2. we'll determine which ones of those "time slots" each user session covers.
    Counter-intuitively, two time spans (a session and a slice) overlap when the end of one is after the start of the other one, and the end of the other after the start of the first.
    But we won't use this for now, because we can rely on our slices being perfectly sequential, and aligned on our sessions: thus if a timeslot's starts during a session, we're sure that its end too is still in the session.

    And 3. we'll sum the session slices on each slot start.

    Implementation

    To organize our thoughts, we will use Common Table Expressions, which allow us to declare pseudo-tables from the result of a first, simple query, then use its contents in a second query, and so on, until we reach the final query where we can refer to any of those pseudo-tables.
    The CTE are every pseudo-table declared after the with keyword: bounds and slices in the example below.

    Second trick: to make sure we do not get duplicates for the events timestamps (in case one user logs in at the precise second another logs out),
    we will use the union keyword to aggregate our start and stop events timestamps (list the starts, list the stops, then union the two lists).
    Contrary to union all which keeps duplicates, not only does union remove duplicate timestamps of a start and a stop, but it will deduplicate two identical starts too.

    with
        bounds as (select login t from sessions union select logout from sessions),
        slices as
        (
            select userId, t.*
            from sessions s join bounds t on s.login <= t.t and s.logout > t.t
        )
    select t, count(1) n from slices group by t order by t;
    

    This will tell us that the slot starting at 8:30 has 1 session, same at 9:20, and so on until 10:15 where we have 2, etc.:

    t n
    2025-04-03 08:30:00 1
    2025-04-03 09:20:00 1
    2025-04-03 09:37:00 1
    2025-04-03 09:55:00 1
    2025-04-03 10:15:00 2
    2025-04-03 10:30:00 1

    You can see it demoed in a fiddle.

    Implementation for summing +1s and -1s since the beginning

    Now let's reread my "(alternatively, …)" in the Conception part. This is a simple concept (draw the log ins and log outs on the timeline, then read the timeline left to right starting from 0 and adding 1 on each log in and subtracting 1 on each logout) for which SQL has the tools too.

    But we need to familiarize with window functions.
    A window function gives a row access to a partial aggregate of some other rows of the resultset (after the where has filtered).
    So let's start with a simple, order by window. In this kind of window, each row "sees" the rows before and including itself.
    Thus let's imagine a table containing 4 rows with a column n set respectively to 1, 2, 3, and 4.
    The (non windowed, absolute) sum(n) for this table is thus 1+2+3+4 = 10.
    Now if we select n, sum(n) over (order by n) from t order by n, we will get each row associated with the sum of all its predecessors (note that the window's order by is independent of the one of the select):

    n sum
    1 1
    2 3
    3 6
    4 10

    … Hey, that's exactly what we need to sum all our +1s and -1s from the beginning!

    with
        events as (select 1 inc, login t from sessions union all select -1 inc, logout t from sessions),
        -- Aggregate simultaneous events, to get 0 instead of a (parasit) -1+1 when a user logs in at the same moment another logs out:
        transitions as (select t, sum(inc) inc from events group by t),
    -- Now compute the sum of increments since the beginning.
    select t, sum(inc) over (order by t) as n from transitions order by t;
    

    (you'll find it at line 40 of this fiddle)

    Reporting the max for each hour

    Conception

    Now that we have our "slices of a stable count of sessions", we will have to dispatch them to their corresponding hour, or hours, as some sessions could last over an hour bound, or even last several hours like userId 6.
    For this user, we cannot rely on only rounding the session slice bounds (to the previous hour for the start, to the next hour for the end), because this would only give us 12:00 and 18:00 for this userId, and nothing inbetween.
    so we'll have to handle our sessions slices as timespans, and cross them with the 1-hour long, monotonic timespans of 08:00 - 09:00, 09:00 - 10:00, and so on:
    we'll do overlapping of session slices to hour spans, just like we did overlapping of whole sessions to session slices.

    So in order to compute their overlaps, we'll have to compute:

    A simple heuristic to get every hour of the day would be to select distinct <event timestamp truncated to hour>;
    however this would prevent hours with 0 session (11:00 - 12:00) to appear in the results,
    and even hours having 1 session that bridges over many hours (thus 15:00 - 16:00 wouldn't appear for userId 6, as it has one leg in 12:00 - 13:00 and one in 17:00 - 18:00, but none in the intermediate hours).

    So we have no other way than computing the first login hour, then add 1 hour by 1 hour until exceeding the last logout timestamp.

    Implementation

    To get the slice's end, as our slices are sequential, we know the slice's end is the next slice's start:
    this is a perfect use case for the lead() window function that will access the value of the row following the current row
    (N.B.: it will return null as the successor's value for the last row, which has no successor).

    Note that this is the reason why we deduplicated event timestamps above:
    if we hadn't, in case of two events at the same time (let's say one session 09:55 - 10:00 and another 10:00 - 10:05, thus 1 logout at 10:00 and 1 login at 10:00 too), the successor of the 10:00 logout would be the 10:00 login, which would have generated a 0-second timespan 10:00 - 10:00…

    And to generate every hour, we will emulate other RDBMS' generate_series() by using a recursive CTE (which are more iterative than recursive).\

    A recursive CTE works by the union all of two selects:

    Once the recursive part returns nothing more, the database consolidates the rows returned by each iteration and returns them as the CTE contents.

    If you look at the hours below:

    Now we can left join this hourly CTE to our session splices and compute the max(n /* simultaneous sessions */) over those hours.

    with recursive
        events as (select 1 inc, login t from sessions union all select -1 inc, logout t from sessions),
        transitions as (select t, sum(inc) inc from events group by t),
        connected as
        (
            select
                t,
                lead(t) over (order by t) as until,
                sum(inc) over (order by t) as n
            from transitions
        ),
        hours as
        (
            select date_add(date(min(t)), interval hour(min(t)) hour) as h from connected
            union all
            select h + interval 1 hour as h from hours
            where h < (select max(t) from connected) -- /!\ Do not forget the stop condition!
        )
    select h, max(n) "sessions"
    from hours left join connected s
    on s.t < h + interval 1 hour and s.until > h
    group by h
    order by h;
    
    h sessions
    2025-04-03 08:00:00 1
    2025-04-03 09:00:00 1
    2025-04-03 10:00:00 2
    2025-04-03 11:00:00 0
    2025-04-03 12:00:00 3
    2025-04-03 13:00:00 1
    2025-04-03 14:00:00 1
    2025-04-03 15:00:00 1
    2025-04-03 16:00:00 1
    2025-04-03 17:00:00 1
    2025-04-03 18:00:00 NULL

    It is shown running at line 58 of the fiddle.

    Exercise is left to the reader to: