sqlruby-on-railspostgresqlaggregate-functionsgenerate-series

Best way to count rows by arbitrary time intervals


My app has a Events table with time-stamped events.

I need to report the count of events during each of the most recent N time intervals. For different reports, the interval could be "each week" or "each day" or "each hour" or "each 15-minute interval".

For example, a user can display how many orders they received each week, day, or hour, or quarter-hour.

1) My preference is to dynamically do a single SQL query (I'm using Postgres) that groups by an arbitrary time interval. Is there a way to do that?

2) An easy but ugly brute force way is to do a single query for all records within the start/end timeframe sorted by timestamp, then have a method manually build a tally by whatever interval.

3) Another approach would be add separate fields to the event table for each interval and statically store an the_week the_day, the_hour, and the_quarter_hour field so I take the 'hit' at the time the record is created (once) instead of every time I report on that field.

What's best practice here, given I could modify the model and pre-store interval data if required (although at the modest expense of doubling the table width)?


Solution

  • Assuming this table:

    CREATE TABLE event(
      event_id serial PRIMARY KEY
    , ts timestamp NOT NULL
    );
    

    Note the data type timestamp. All following examples match this data type. When dealing with timestamptz, adjust accordingly. See:

    Postgres 14 or newer

    Since Postgres 14 we have the convenient function date_bin(). See:

    To get 1 row for every time slot with data in the table:

    SELECT date_bin('15 min', e.ts, '2018-05-01') AS start_time
         , count(e.ts) AS events
    FROM   event e
    GROUP  BY 1
    ORDER  BY 1;
    

    Time slots without entries in the table go missing in the result.
    Note that the "origin" ('2018-05-01' in the example) is only for alignment, not curtailment.

    To get 1 row for every time slot since 2018-05-01:

    SELECT start_time, COALESCE(events, 0) AS events
    FROM  (
       SELECT generate_series(timestamp '2018-05-01', max(ts), interval '15 min')
       FROM   event
       ) g(start_time)
    LEFT   JOIN (
       SELECT date_bin('15 min', e.ts, '2018-05-01'), count(e.ts)
       FROM   event e
       WHERE  e.ts >= '2018-05-01'  -- filter early (optional)
       GROUP  BY 1
       ) e(start_time, events) USING (start_time)
    ORDER  BY 1;
    

    The set-returning function generate_series() can be used to generate a full set of rows.
    Time slots without entries in the table are filled in with a count of 0.

    fiddle

    Postgres 13 or older

    Original answer, obsolescent, but with more explanation.

    Basic solution

    This query counts events for any arbitrary time interval. 17 minutes in the example:

    WITH grid AS (
       SELECT start_time
            , lead(start_time, 1, 'infinity') OVER (ORDER BY start_time) AS end_time
       FROM  (
          SELECT generate_series(min(ts), max(ts), interval '17 min') AS start_time
          FROM   event
          ) sub
       )
    SELECT start_time, count(e.ts) AS events
    FROM   grid       g
    LEFT   JOIN event e ON e.ts >= g.start_time
                       AND e.ts <  g.end_time
    GROUP  BY start_time
    ORDER  BY start_time;
    

    The query retrieves minimum and maximum ts from the base table to cover the complete time range. You can use an arbitrary time range instead.

    Provide any time interval as needed.

    Produces one row for every time slot. If no event happened during that interval, the count is 0.

    Be sure to handle upper and lower bound correctly. See:

    The window function lead() has an often overlooked feature: it can provide a default for when no leading row exists. Providing 'infinity' in the example. Else the last interval would be cut off with an upper bound NULL.

    Minimal equivalent

    The above query uses a CTE and lead() and verbose syntax. Elegant and maybe easier to understand, but a bit more expensive. Here is a shorter, faster, minimal version:

    SELECT start_time, count(e.ts) AS events
    FROM  (SELECT generate_series(min(ts), max(ts), interval '17 min') FROM event) g(start_time)
    LEFT   JOIN event e ON e.ts >= g.start_time
                       AND e.ts <  g.start_time + interval '17 min'
    GROUP  BY 1
    ORDER  BY 1;
    

    Example for "every 15 minutes in the past week"`

    Formatted with to_char().

    SELECT to_char(start_time, 'YYYY-MM-DD HH24:MI'), count(e.ts) AS events
    FROM   generate_series(date_trunc('day', localtimestamp - interval '7 days')
                         , localtimestamp
                         , interval '15 min') g(start_time)
    LEFT   JOIN event e ON e.ts >= g.start_time
                       AND e.ts <  g.start_time + interval '15 min'
    GROUP  BY start_time
    ORDER  BY start_time;

    Still ORDER BY and GROUP BY on the underlying timestamp value, not on the formatted string. That's faster and more reliable.

    db<>fiddle here

    Related answer producing a running count over the time frame: