sqltime-serieslagtrinolead

Using LEAD and LAG on ragged time series


I have a large dataset and want to retrieve previous and subsequent observations for small windows. Below is a snippet around a window of interest (2024-09-23 to 2024-09-27).

item    date    price
A   2024-09-20   $106.59 
A   2024-09-23   $91.84 
A   2024-09-24   $92.66 
A   2024-09-25   $107.87 
A   2024-09-26   $98.47 
A   2024-09-27   $91.65 
A   2024-09-30   $108.10 
B   2024-07-29   $120.71 
B   2024-08-13   $87.80 
B   2024-09-04   $115.80 
B   2024-09-26   $94.06 
B   2024-09-27   $120.88 

In the example above, "A" had a price every weekday, but "B" had infrequent prices. As expected (but not desired), the code below produces NULL values at the beginning and end of the desired window (specified in the WHERE statement).

SELECT item, date, price,
    lag(price) OVER (PARTITION BY item ORDER BY date) as prevprice,
    lead(price) OVER (PARTITION BY item ORDER BY date) as nextprice
FROM dailytable
WHERE date BETWEEN date'2024-09-23' AND date'2024-09-27'
;

Results - Not Wanted

item    date    price   prevprice   nextprice
A   2024-09-23   $91.84      NULL    $92.66 
A   2024-09-24   $92.66      $91.84      $107.87 
A   2024-09-25   $107.87     $92.66      $98.47 
A   2024-09-26   $98.47      $107.87     $91.65 
A   2024-09-27   $91.65      $98.47      NULL 
B   2024-09-26   $94.06      NULL    $120.88 
B   2024-09-27   $120.88     $94.06      NULL 

Obviously, I could run the LEAD() and LAG() over the entire dataset in a subquery, then select from that using the WHERE statement, as below.

SELECT * 
FROM 
(   SELECT item, date, price,
        lag(price) OVER (PARTITION BY item ORDER BY date) as prevprice,
        lead(price) OVER (PARTITION BY item ORDER BY date) as nextprice
    FROM dailytable
)
WHERE date BETWEEN date'2024-09-23' AND date'2024-09-27'
;

Desired Results

item    date    price   prevprice   nextprice
A   2024-09-23   $91.84      $106.59     $92.66 
A   2024-09-24   $92.66      $91.84      $107.87 
A   2024-09-25   $107.87     $92.66      $98.47 
A   2024-09-26   $98.47      $107.87     $91.65 
A   2024-09-27   $91.65      $98.47      $108.10 
B   2024-09-26   $94.06      $115.80     $120.88 
B   2024-09-27   $120.88     $94.06      NULL 

However, this is very costly and takes a long time. The entire dataset has over 50 years of data, but the required lead and lag periods are never nearly that long.

Is there a way to get the previous and subsequent prices for the window without running a subquery that runs over the entire history of the dataset? Of course, if an item doesn't have a next price, like B in the above example, I don't expect magic; NULL is appropriate.


Solution

  • I suggest using a case expression that runs a correlated subquery when the prevprice is null, and this subquery looks for the immediately prior price despite the overall date range being applied. e.g:

    SELECT
          item
        , DATE
        , price
        , CASE 
            WHEN prevprice IS NULL
                THEN (
                        select t.price FROM dailytable as t
                        where t.item = dailytable.item
                        and t.DATE < dailytable.DATE
                        order by t.DATE DESC
                        limit 1
                        )
            ELSE prevprice
            END AS prevprice
        , nextprice
    FROM (
        SELECT
              item
            , DATE
            , price
            , lag(price) OVER (PARTITION BY item ORDER BY DATE) AS prevprice
            , lead(price) OVER (PARTITION BY item ORDER BY DATE) AS nextprice
        FROM dailytable
        WHERE DATE BETWEEN DATE '2024-09-23'
                       AND DATE '2024-09-27'
        ) subquery
    

    nb I'm not sure about the trino syntax to limit an ordered subquery to the first row and cannot test the query above. Also note if Trino supported lateral joins (or an apply operator) I would rather use those, but as I understand it neither of these are available as of answering.

    It is noteworthy that correlated subqueries aren't wonderful for performance but I suspect this will be better than scanning a table with 50 years of data.


    options

    WITH subquery AS (
      SELECT
        item,
        DATE,
        price,
        lag(price) OVER (PARTITION BY item ORDER BY DATE) AS prevprice,
        lead(price) OVER (PARTITION BY item ORDER BY DATE) AS nextprice
      FROM dailytable
      WHERE DATE BETWEEN DATE '2024-09-23' AND DATE '2024-09-27'
    ),
    prev_price_table AS (
      SELECT
        item,
        DATE,
        price,
        prevprice
      FROM (
        SELECT *,
          ROW_NUMBER() OVER (PARTITION BY item ORDER BY DATE DESC) AS row_num
        FROM subquery
      ) t
      WHERE row_num = 1
    )
    SELECT sq.*, ppt.prevprice
    FROM subquery sq
    LEFT JOIN prev_price_table ppt ON sq.item = ppt.item AND sq.DATE = ppt.DATE - INTERVAL '1 day'
    

    nb (again) I don't know trino syntax/limitations