I have an existing query that uses the RANK() function to sum (or count) a column and rank results over a specified date range. Up until now, I have only needed the total aggregate sum or count of the given column over the date range. Now, however, I would like to create a window within this larger date range that allows me to rank results based on their count/sum over this smaller period within the overall date range.
My existing query looks something like this. This query correctly ranks the result by total count of tracks for each unique user_resort_registration over the date range specified in the WHERE clause.
SELECT
u.*,
user_resort_registrations.*,
count(tracks) as count,
RANK () OVER (ORDER BY count(tracks) DESC) as position
FROM users as u
INNER JOIN user_resort_registrations ON user_resort_registrations.user_id = u.id
INNER JOIN resort_days ON user_resort_registrations.id = resort_days.user_resort_registration_id
INNER JOIN resorts ON user_resort_registrations.resort_id = resorts.id
INNER JOIN tracks ON resort_days.id = tracks.resort_day_id
WHERE resort_days.date >= '2023-03-01T07:00:00'
AND resort_days.date <= '2023-04-01T05:59:59'
AND resorts.identifier = 'SOME_IDENTIFIER'
GROUP BY u.id, user_resort_registrations.id
LIMIT(25)
Based on other answers on SO, I have tried using the RANGE and WINDOW functionality, unfortunately to no avail. As much as possible, I need to keep the existing result structure of this query (unless what I am proposing is not possible within this structure), particularly the ORDER BY count(tracks) DESC
part. What I am trying to accomplish looks something like this:
SELECT
u.*,
user_resort_registrations.*,
count(tracks) as count,
RANK () OVER (
ORDER BY count(tracks) DESC
RANGE BETWEEN CURRENT ROW AND '2 days' FOLLOWING
) as position
FROM users as u
INNER JOIN user_resort_registrations ON user_resort_registrations.user_id = u.id
INNER JOIN resort_days ON user_resort_registrations.id = resort_days.user_resort_registration_id
INNER JOIN resorts ON user_resort_registrations.resort_id = resorts.id
INNER JOIN tracks ON resort_days.id = tracks.resort_day_id
WHERE resort_days.date >= '2023-03-01T07:00:00'
AND resort_days.date <= '2023-04-01T05:59:59'
AND resorts.identifier = 'SOME_IDENTIFIER'
GROUP BY u.id, user_resort_registrations.id
LIMIT(25)
This does not work because RANGE
is interpreting CURRENT ROW
as the tracks count. It is throwing an error: ERROR: invalid input syntax for type bigint: "2 days"
. I believe that this is because the RANGE is being run on count(tracks)
and not the date field that I want to run it on (resort_days.date
).
My goal is to be able to use RANK()
to retrieve the highest occurrences of count(tracks)
within a 3 day period within a larger date range (as specified by the WHERE clause). In plain english, this would be something like "find the most track
records within a consecutive 3 day period over a 30 day period for each unique user_resort_registration". I am unsure of the best way to structure the query in order to make this happen. I am using postgres version 14.12.
Here is a fiddle of my first query, which runs successfully and returns a count of the total tracks over the given time range.
Here is a fiddle of the query that I am working on that throws an error
Query Error: error: invalid input syntax for type bigint: "3 days"
MY GOAL: is to have the second query return a ranked set (the same as the first query) with a count of 22 tracks. This would be for the track ids 13-34, representing the 3 day period from '2023-03-03T10:00:00' to '2023-03-05T10:00:00' for user_resort_registration_id=1. This period represents the highest count of tracks within a 3 day period, within the larger 30 day period specified by the WHERE clause.
See example. Step by step
data
- your query without grouping - source data.daydata
- group by user_id,regid,resort_date - data grouped by day, not for full period.day3data
- if current, prev1 and prev2 data are consecutive
then calculate sum(trackcnt) = cum3cntL
daydata_rnk
- rank days by cum3cntL
desc.grouped_data
- group by user_id and regId, take max value cum3cntL - rn=1with data as( -- source data
SELECT u.id user_id,u.email_address email
,urr.id regId,urr.resort_id
,resorts.identifier res_ident
,resort_days.id resort_day_id,resort_days.date resort_date
,tracks.id track_id,distance_vertical distV
FROM users as u
INNER JOIN user_resort_registrations urr ON urr.user_id = u.id
INNER JOIN resort_days ON urr.id = resort_days.user_resort_registration_id
INNER JOIN resorts ON urr.resort_id = resorts.id
INNER JOIN tracks ON resort_days.id = tracks.resort_day_id
WHERE resort_days.date >= '2023-03-01T07:00:00'
AND resort_days.date <= '2023-04-01T05:59:59'
AND resorts.identifier = 'fuzzy'
)
,daydata as( -- day data. Also previous 2 days data
select user_id,regId ,resort_date
,count(*) trackcnt
,lag(resort_date,1)over(partition by regId order by resort_date) prev1date
,lag(resort_date,2)over(partition by regId order by resort_date) prev2date
,lag(count(*),1)over(partition by regId order by resort_date) prev1cnt
,lag(count(*),2)over(partition by regId order by resort_date) prev2cnt
from data
GROUP BY user_id, regId ,resort_date
)
,day3data as( -- check dates prev1date+2=prev1date+1=current_date
select *
,case when (date_trunc('day',resort_date)-date_trunc('day',prev1date))=interval '1 day'
and (date_trunc('day',resort_date)-date_trunc('day',prev2date))=interval '2 day'
then prev2cnt+prev1cnt+trackcnt else 0 end cum3cntL
from daydata
)
,daydata_rnk as( -- rank for take max value row
select *
,row_number()over(partition by regId order by cum3cntL desc)rn
from day3data
)
,grouped_data as( -- grouping by full period and take sums and max row
select user_id,regid, sum(trackcnt) tracks_count
,max(case when rn=1 then cum3cntL end) max_cum3cnt
,max(case when rn=1 then resort_date end) max_cum3day
from daydata_rnk
group by user_id,regid
)
select * from grouped_data
order by user_id,regId;
Result
user_id | regid | tracks_count | max_cum3cnt | max_cum3day |
---|---|---|---|---|
1 | 1 | 35 | 22 | 2023-03-05 10:00:00 |
2 | 2 | 4 | 3 | 2023-03-03 10:00:00 |
For clarity, before last grouping
user_id | regid | resort_date | trackcnt | prev1date | prev2date | prev1cnt | prev2cnt | cum3cntl | rn |
---|---|---|---|---|---|---|---|---|---|
1 | 1 | 2023-03-01 10:00:00 | 6 | null | null | null | null | 0 | 5 |
1 | 1 | 2023-03-02 10:00:00 | 6 | 2023-03-01 10:00:00 | null | 6 | null | 0 | 6 |
1 | 1 | 2023-03-03 10:00:00 | 2 | 2023-03-02 10:00:00 | 2023-03-01 10:00:00 | 6 | 6 | 14 | 3 |
1 | 1 | 2023-03-04 10:00:00 | 4 | 2023-03-03 10:00:00 | 2023-03-02 10:00:00 | 2 | 6 | 12 | 4 |
1 | 1 | 2023-03-05 10:00:00 | 16 | 2023-03-04 10:00:00 | 2023-03-03 10:00:00 | 4 | 2 | 22 | 1 |
1 | 1 | 2023-03-06 10:00:00 | 1 | 2023-03-05 10:00:00 | 2023-03-04 10:00:00 | 16 | 4 | 21 | 2 |
2 | 2 | 2023-03-01 10:00:00 | 1 | null | null | null | null | 0 | 3 |
2 | 2 | 2023-03-02 10:00:00 | 1 | 2023-03-01 10:00:00 | null | 1 | null | 0 | 4 |
2 | 2 | 2023-03-03 10:00:00 | 1 | 2023-03-02 10:00:00 | 2023-03-01 10:00:00 | 1 | 1 | 3 | 1 |
2 | 2 | 2023-03-04 10:00:00 | 1 | 2023-03-03 10:00:00 | 2023-03-02 10:00:00 | 1 | 1 | 3 | 2 |
In demo also version for days without gap in user_resort_registrations.id dates.
For (partition by regId order by resort_date)
or (partition by user_id,regId order by resort_date)
- difference may be in performance.