sqlamazon-athenaprestotrino

`lag()` with `over` and `range between` returns a value if even the previous record is out of range


I want to get the previous value using lag(), over a partition defined with RANGE BETWEEN. I followed an example from the documentation:

WITH orders (custkey, orderdate, totalprice) 
    AS 
    (
    VALUES
    ('cust_1', DATE '2020-10-10', 100),
    ('cust_2', DATE '2020-10-10', 15),
    ('cust_1', DATE '2020-10-15', 200),
    ('cust_1', DATE '2020-10-16', 240),
    ('cust_2', DATE '2020-12-20', 25),
    ('cust_1', DATE '2020-12-25', 140),
    ('cust_2', DATE '2021-01-01', 5)    
)

SELECT *, 
       avg(totalprice) OVER (
                             PARTITION BY custkey
                             ORDER BY orderdate
                             RANGE BETWEEN INTERVAL '1' MONTH PRECEDING AND CURRENT ROW) AS past_month_avg,
    
       lag(totalprice) OVER ( PARTITION BY custkey
                              ORDER BY orderdate
                              RANGE BETWEEN INTERVAL '1' MONTH PRECEDING AND CURRENT ROW) AS previous_total_price_within_last_month
FROM orders

It returns:

custkey orderdate totalprice past_month_avg previous_total_price_within_last_month
cust_2 2020-10-10 15 15.0 NULL
cust_2 2020-12-20 25 25.0 15
cust_2 2021-01-01 5 15.0 25
cust_1 2020-10-10 100 100.0 NULL
cust_1 2020-10-15 200 150.0 100
cust_1 2020-10-16 240 180.0 200
cust_1 2020-12-25 140 140.0 240

The problem: while the calculation for past_month_avg results as expected, the previous_total_price_within_last_month result is not as expected.


Expected Output

Given that I defined a window that ranges at the last month, I expect that lag() will return null if the "previous" totalprice value is associated with an orderdate value that is out of the 1-month-back window.

custkey orderdate totalprice past_month_avg previous_total_price_within_last_month
cust_2 2020-10-10 15 15.0 NULL
cust_2 2020-12-20 25 25.0 NULL*
cust_2 2021-01-01 5 15.0 25
cust_1 2020-10-10 100 100.0 NULL
cust_1 2020-10-15 200 150.0 100
cust_1 2020-10-16 240 180.0 200
cust_1 2020-12-25 140 140.0 NULL*

*expected value


Solution

  • As written in the comment, lag isn't expected to be supported when using a window frame (e.g., range between). The following functions are not supported either: lead, ntile, rank, dense_rank, percent_rank, cume_dist, and row_number.

    Please refer to this issue in the Trino project, and to the subsequent pull request.