postgresqlranking-functions

Postgresql: Using Rank function with a date interval


I have an existing query that uses the RANK() function to sum (or count) a column and rank results over a specified date range. Up until now, I have only needed the total aggregate sum or count of the given column over the date range. Now, however, I would like to create a window within this larger date range that allows me to rank results based on their count/sum over this smaller period within the overall date range.

My existing query looks something like this. This query correctly ranks the result by total count of tracks for each unique user_resort_registration over the date range specified in the WHERE clause.

SELECT
        u.*,
        user_resort_registrations.*,
        count(tracks) as count,
        RANK () OVER (ORDER BY count(tracks) DESC) as position
      FROM users as u
      INNER JOIN user_resort_registrations ON user_resort_registrations.user_id = u.id
      INNER JOIN resort_days ON user_resort_registrations.id = resort_days.user_resort_registration_id
      INNER JOIN resorts ON user_resort_registrations.resort_id = resorts.id
      INNER JOIN tracks ON resort_days.id = tracks.resort_day_id 
WHERE resort_days.date >= '2023-03-01T07:00:00' 
AND resort_days.date <= '2023-04-01T05:59:59' 
AND resorts.identifier = 'SOME_IDENTIFIER' 
GROUP BY u.id, user_resort_registrations.id 
LIMIT(25) 

Based on other answers on SO, I have tried using the RANGE and WINDOW functionality, unfortunately to no avail. As much as possible, I need to keep the existing result structure of this query (unless what I am proposing is not possible within this structure), particularly the ORDER BY count(tracks) DESC part. What I am trying to accomplish looks something like this:

SELECT
        u.*,
        user_resort_registrations.*,
        count(tracks) as count,
        RANK () OVER (
          ORDER BY count(tracks) DESC
          RANGE BETWEEN CURRENT ROW AND '2 days' FOLLOWING
        ) as position
      FROM users as u
      INNER JOIN user_resort_registrations ON user_resort_registrations.user_id = u.id
      INNER JOIN resort_days ON user_resort_registrations.id = resort_days.user_resort_registration_id
      INNER JOIN resorts ON user_resort_registrations.resort_id = resorts.id
      INNER JOIN tracks ON resort_days.id = tracks.resort_day_id 
WHERE resort_days.date >= '2023-03-01T07:00:00' 
AND resort_days.date <= '2023-04-01T05:59:59' 
AND resorts.identifier = 'SOME_IDENTIFIER' 
GROUP BY u.id, user_resort_registrations.id 
LIMIT(25) 

This does not work because RANGE is interpreting CURRENT ROW as the tracks count. It is throwing an error: ERROR: invalid input syntax for type bigint: "2 days". I believe that this is because the RANGE is being run on count(tracks) and not the date field that I want to run it on (resort_days.date).

My goal is to be able to use RANK() to retrieve the highest occurrences of count(tracks) within a 3 day period within a larger date range (as specified by the WHERE clause). In plain english, this would be something like "find the most track records within a consecutive 3 day period over a 30 day period for each unique user_resort_registration". I am unsure of the best way to structure the query in order to make this happen. I am using postgres version 14.12.

Here is a fiddle of my first query, which runs successfully and returns a count of the total tracks over the given time range.

Here is a fiddle of the query that I am working on that throws an error

Query Error: error: invalid input syntax for type bigint: "3 days"

MY GOAL: is to have the second query return a ranked set (the same as the first query) with a count of 22 tracks. This would be for the track ids 13-34, representing the 3 day period from '2023-03-03T10:00:00' to '2023-03-05T10:00:00' for user_resort_registration_id=1. This period represents the highest count of tracks within a 3 day period, within the larger 30 day period specified by the WHERE clause.


Solution

  • See example. Step by step

    1. CTE data - your query without grouping - source data.
    2. CTE daydata - group by user_id,regid,resort_date - data grouped by day, not for full period.
      Also we take data from previous 1 and 2 days.
    3. CTE day3data - if current, prev1 and prev2 data are consecutive then calculate sum(trackcnt) = cum3cntL
    4. CTE daydata_rnk - rank days by cum3cntL desc.
    5. CTE grouped_data - group by user_id and regId, take max value cum3cntL - rn=1
    with data as( -- source data
    SELECT u.id user_id,u.email_address email
      ,urr.id regId,urr.resort_id
      ,resorts.identifier res_ident
      ,resort_days.id resort_day_id,resort_days.date resort_date
      ,tracks.id track_id,distance_vertical distV
    FROM users as u
    INNER JOIN user_resort_registrations urr ON urr.user_id = u.id
    INNER JOIN resort_days ON urr.id = resort_days.user_resort_registration_id
    INNER JOIN resorts ON urr.resort_id = resorts.id
    INNER JOIN tracks ON resort_days.id = tracks.resort_day_id 
    WHERE resort_days.date >= '2023-03-01T07:00:00' 
       AND resort_days.date <= '2023-04-01T05:59:59' 
       AND resorts.identifier = 'fuzzy' 
    )
    
    ,daydata as(  -- day data. Also previous 2 days data
    select user_id,regId ,resort_date
      ,count(*) trackcnt
      ,lag(resort_date,1)over(partition by regId order by resort_date) prev1date
      ,lag(resort_date,2)over(partition by regId order by resort_date) prev2date
      ,lag(count(*),1)over(partition by regId order by resort_date) prev1cnt
      ,lag(count(*),2)over(partition by regId order by resort_date) prev2cnt
    from data
    GROUP BY user_id, regId ,resort_date
    )
    ,day3data as(  -- check dates prev1date+2=prev1date+1=current_date
    select * 
      ,case when (date_trunc('day',resort_date)-date_trunc('day',prev1date))=interval '1 day'
            and (date_trunc('day',resort_date)-date_trunc('day',prev2date))=interval '2 day'
          then prev2cnt+prev1cnt+trackcnt else 0 end cum3cntL
    from daydata
    )
    ,daydata_rnk as( -- rank for take max value row
    select *
      ,row_number()over(partition by regId order by cum3cntL desc)rn
    from day3data
    )
    ,grouped_data as( -- grouping by full period and take sums and max row
    select user_id,regid, sum(trackcnt) tracks_count
      ,max(case when rn=1 then cum3cntL end) max_cum3cnt
      ,max(case when rn=1 then resort_date end) max_cum3day
    from daydata_rnk
    group by user_id,regid
    )
    select * from grouped_data
    order by user_id,regId;
    

    Result

    user_id regid tracks_count max_cum3cnt max_cum3day
    1 1 35 22 2023-03-05 10:00:00
    2 2 4 3 2023-03-03 10:00:00

    For clarity, before last grouping

    user_id regid resort_date trackcnt prev1date prev2date prev1cnt prev2cnt cum3cntl rn
    1 1 2023-03-01 10:00:00 6 null null null null 0 5
    1 1 2023-03-02 10:00:00 6 2023-03-01 10:00:00 null 6 null 0 6
    1 1 2023-03-03 10:00:00 2 2023-03-02 10:00:00 2023-03-01 10:00:00 6 6 14 3
    1 1 2023-03-04 10:00:00 4 2023-03-03 10:00:00 2023-03-02 10:00:00 2 6 12 4
    1 1 2023-03-05 10:00:00 16 2023-03-04 10:00:00 2023-03-03 10:00:00 4 2 22 1
    1 1 2023-03-06 10:00:00 1 2023-03-05 10:00:00 2023-03-04 10:00:00 16 4 21 2
    2 2 2023-03-01 10:00:00 1 null null null null 0 3
    2 2 2023-03-02 10:00:00 1 2023-03-01 10:00:00 null 1 null 0 4
    2 2 2023-03-03 10:00:00 1 2023-03-02 10:00:00 2023-03-01 10:00:00 1 1 3 1
    2 2 2023-03-04 10:00:00 1 2023-03-03 10:00:00 2023-03-02 10:00:00 1 1 3 2

    See demo fiddle

    In demo also version for days without gap in user_resort_registrations.id dates.

    For (partition by regId order by resort_date)
    or (partition by user_id,regId order by resort_date) - difference may be in performance.