postgresqlamazon-redshiftwindow-functionsrolling-computation

Moving SUM analytic function from PostgreSQL to Redshift with RANGE window frame clause


I have the following code in PostgreSQL, which I need to reproduce on Amazon Redshift:

select *
from
(   select distinct
      a.id
      , sum(case when a.is_batch_empty then 1 else 0 end) 
           over (partition by a.client_id order by a.id) as empty_count
    from my_temp_table a
) a
where a.id = 111

In Amazon Redshift, this code execution fails with the following error:

Aggregate window functions with an ORDER BY clause require a frame clause

So, on Redshift, I need to add ROWS BETWEEN section.

The problem is that I cannot find the formula which gives the same result as on PostgreSQL.

What I tried so far:

  1. Without ORDER BY

  2. With a frame ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

  3. With a frame ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING

I have different result with every of these options, and no one of them matches the PostgreSQL solution.

So, what I have to do with this formula to get identical results on PostgreSQL and Redshift?


Solution

  • In PostgreSQL order by in window spec implies between unbounded preceding and current row frame clause. If you remove order by, the frame spans the whole partition. Quoting the doc:

    With ORDER BY, this sets the frame to be all rows from the partition start up through the current row's last ORDER BY peer. Without ORDER BY, this means all rows of the partition are included in the window frame, since all rows become peers of the current row.

    And that's mentioned right after what you were looking for:

    The default framing option is RANGE UNBOUNDED PRECEDING, which is the same as RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.

    Your second attempt was the closest, but note that the default is RANGE, not ROWS and I don't see RANGE in Redshift window function doc. If a.id isn't unique, the results will differ as demonstrated here.

    You could try to emulate the RANGE frame clause behaviour with a correlated scalar subquery:

    select distinct a.id,
           (select sum(case when t2.is_batch_empty then 1 else 0 end)
            from my_temp_table t2
            where a.client_id=t2.client_id
              and a.id>=t2.id)
    from my_temp_table a