sqlpostgresqlwindow-functions

PostgreSQL window function for sum per interval


The table is represented by following scripts:

CREATE TABLE sales (
  id SERIAL PRIMARY KEY,
  product_id INTEGER,
  sales_date DATE,
  quantity INTEGER,
  price NUMERIC
);

INSERT INTO sales (product_id, sales_date, quantity, price) VALUES
   (1, '2023-01-01', 10, 10.00),
   (1, '2023-01-02', 12, 12.00),
   (1, '2023-01-03', 15, 15.00),
   (2, '2023-01-01', 8, 8.00),
   (2, '2023-01-02', 10, 10.00),
   (2, '2023-01-03', 12, 12.00);

The task is to sum sales quantity for last 3 days per each product_id. The period must be counted backwards from maximum (last) date of each product_id. So for 1 maximum is 2023-01-03, same for 2. But for product_id 2 last day may be differ from 1 - let's say 2023-01-05.

By applying this query with window function in subquery:

select product_id, max(increasing_sum) as quantity_last_3_days
   from 
        (SELECT product_id,
         SUM(quantity) OVER (PARTITION BY product_id ORDER BY sales_date RANGE BETWEEN INTERVAL '2 days'
                PRECEDING AND CURRENT ROW) AS increasing_sum
         FROM sales) as s
   group by product_id;

I receive the expected output:

  | product_id | quantity_last_3_days |
  |____________|______________________|            
  |_____1______|___________37_________|
  |_____2______|___________30_________|     
 

But is it solution optimal? Is there any way to solve the problem, by using window function without subquery ?


Solution

  • But is it solution optimal?

    No, not if you just need "to sum sales quantity for last 3 days per each product_id".

    Is there any way to solve the problem, by using window function without subquery?

    You can usually trade window functions for correlated subqueries, lateral or scalar, but there's an easier way to speed up your query, and correct it. It is doing extra work trying to get you a rolling/stepping sum; the window isn't trying to get the 3 most recent days for each product_id.

    Instead, for each row, it's looking back at rows with the same product_id up to 2 days prior. You're later picking the 3 day period with the highest sum of quantity, which won't necessarily be the 3 most recent dates.

    On 400k samples, your query takes a whole 1.0s without an index, 0.7s with a covering index, and you can go down from that to 0.4s without or 0.1s with the covering index. You just need to only ask about the sum for the 3 most recent dates for each product_id: demo at db<>fiddle

    select product_id,sum(quantity) as quantity_last_3_days
    from(select *,row_number()over(PARTITION BY product_id ORDER BY sales_date DESC)
         from sales)_
    where 3>=row_number
    group by product_id;
    

    The trick here is that the window function will execute with a Run Condition: (3 >= row_number() OVER (?)), which means it'll just grab the 3 most recent and quit. It can even get them straight off the top of the covering index, without ever having to visit the table.
    Your original query has to scan the whole thing (either the entire table or the whole index, if it's available), then sort that to get the max().


    As you pointed out, that's counting any last 3 dates for each product_id, which don't have to be consecutive and the most recent one doesn't have to be today or yesterday.
    Not the last 3 days in general, counting back from current date.
    Not the last 3 consecutive dates for each product_id.

    In case you need that, last 3 from today are trivial to implement:

    select product_id,sum(quantity) as quantity_last_3_days
    from sales
    where sales_date>='2023-01-03'::date--this would be current_date or now()
                      -'2d'::interval
    group by product_id;
    

    and the 3 most recent consecutive per product_id are similar

    with cte as (
      select distinct on (product_id) *
      from sales s1
      order by product_id,sales_date desc)
    select *,(select sum(quantity)
              from sales s2
              where s1.product_id=s2.product_id
              and s1.sales_date-'2 days'::interval >= s2.sales_date) 
              as quantity_last_3_days
    from cte s1;