I have a large dataset and want to retrieve previous and subsequent observations for small windows. Below is a snippet around a window of interest (2024-09-23 to 2024-09-27).
item date price
A 2024-09-20 $106.59
A 2024-09-23 $91.84
A 2024-09-24 $92.66
A 2024-09-25 $107.87
A 2024-09-26 $98.47
A 2024-09-27 $91.65
A 2024-09-30 $108.10
B 2024-07-29 $120.71
B 2024-08-13 $87.80
B 2024-09-04 $115.80
B 2024-09-26 $94.06
B 2024-09-27 $120.88
In the example above, "A" had a price every weekday, but "B" had infrequent prices. As expected (but not desired), the code below produces NULL values at the beginning and end of the desired window (specified in the WHERE statement).
SELECT item, date, price,
lag(price) OVER (PARTITION BY item ORDER BY date) as prevprice,
lead(price) OVER (PARTITION BY item ORDER BY date) as nextprice
FROM dailytable
WHERE date BETWEEN date'2024-09-23' AND date'2024-09-27'
;
Results - Not Wanted
item date price prevprice nextprice
A 2024-09-23 $91.84 NULL $92.66
A 2024-09-24 $92.66 $91.84 $107.87
A 2024-09-25 $107.87 $92.66 $98.47
A 2024-09-26 $98.47 $107.87 $91.65
A 2024-09-27 $91.65 $98.47 NULL
B 2024-09-26 $94.06 NULL $120.88
B 2024-09-27 $120.88 $94.06 NULL
Obviously, I could run the LEAD() and LAG() over the entire dataset in a subquery, then select from that using the WHERE statement, as below.
SELECT *
FROM
( SELECT item, date, price,
lag(price) OVER (PARTITION BY item ORDER BY date) as prevprice,
lead(price) OVER (PARTITION BY item ORDER BY date) as nextprice
FROM dailytable
)
WHERE date BETWEEN date'2024-09-23' AND date'2024-09-27'
;
Desired Results
item date price prevprice nextprice
A 2024-09-23 $91.84 $106.59 $92.66
A 2024-09-24 $92.66 $91.84 $107.87
A 2024-09-25 $107.87 $92.66 $98.47
A 2024-09-26 $98.47 $107.87 $91.65
A 2024-09-27 $91.65 $98.47 $108.10
B 2024-09-26 $94.06 $115.80 $120.88
B 2024-09-27 $120.88 $94.06 NULL
However, this is very costly and takes a long time. The entire dataset has over 50 years of data, but the required lead and lag periods are never nearly that long.
Is there a way to get the previous and subsequent prices for the window without running a subquery that runs over the entire history of the dataset? Of course, if an item doesn't have a next price, like B in the above example, I don't expect magic; NULL is appropriate.
I suggest using a case expression that runs a correlated subquery when the prevprice is null, and this subquery looks for the immediately prior price despite the overall date range being applied. e.g:
SELECT
item
, DATE
, price
, CASE
WHEN prevprice IS NULL
THEN (
select t.price FROM dailytable as t
where t.item = dailytable.item
and t.DATE < dailytable.DATE
order by t.DATE DESC
limit 1
)
ELSE prevprice
END AS prevprice
, nextprice
FROM (
SELECT
item
, DATE
, price
, lag(price) OVER (PARTITION BY item ORDER BY DATE) AS prevprice
, lead(price) OVER (PARTITION BY item ORDER BY DATE) AS nextprice
FROM dailytable
WHERE DATE BETWEEN DATE '2024-09-23'
AND DATE '2024-09-27'
) subquery
nb I'm not sure about the trino syntax to limit an ordered subquery to the first row and cannot test the query above. Also note if Trino supported lateral joins (or an apply operator) I would rather use those, but as I understand it neither of these are available as of answering.
It is noteworthy that correlated subqueries aren't wonderful for performance but I suspect this will be better than scanning a table with 50 years of data.
options
WITH subquery AS (
SELECT
item,
DATE,
price,
lag(price) OVER (PARTITION BY item ORDER BY DATE) AS prevprice,
lead(price) OVER (PARTITION BY item ORDER BY DATE) AS nextprice
FROM dailytable
WHERE DATE BETWEEN DATE '2024-09-23' AND DATE '2024-09-27'
),
prev_price_table AS (
SELECT
item,
DATE,
price,
prevprice
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY item ORDER BY DATE DESC) AS row_num
FROM subquery
) t
WHERE row_num = 1
)
SELECT sq.*, ppt.prevprice
FROM subquery sq
LEFT JOIN prev_price_table ppt ON sq.item = ppt.item AND sq.DATE = ppt.DATE - INTERVAL '1 day'
nb (again) I don't know trino syntax/limitations