sqlpostgresqlaggregate-functionswindow-functionsgreenplum

How to use a SQL window function to calculate a percentage of an aggregate


I need to calculate percentages of various dimensions in a table. I'd like to simplify things by using window functions to calculate the denominator, however I am having an issue because the numerator has to be an aggregate as well.

As a simple example, take the following table:

create temp table test (d1 text, d2 text, v numeric);
insert into test values ('a','x',5), ('a','y',5), ('a','y',10), ('b','x',20);

If I just want to calculate the share of each individual row out of d1, then windowing functions work fine:

select d1, d2, v/sum(v) over (partition by d1)
from test;

"b";"x";1.00
"a";"x";0.25
"a";"y";0.25
"a";"y";0.50

However, what I need to do is calculate the overall share for the sum of d2 out of d1. The output I am looking for is this:

"b";"x";1.00
"a";"x";0.25
"a";"y";0.75

So I try this:

select d1, d2, sum(v)/sum(v) over (partition by d1)
from test
group by d1, d2;

However, now I get an error:

ERROR:  column "test.v" must appear in the GROUP BY clause or be used in an aggregate function

I'm assuming this is because it is complaining that the window function is not accounted for in the grouping clause, however windowing functions cannot be put in the grouping clause anyway.

This is using Greenplum 4.1, which is a fork of Postgresql 8.4 and shares the same windowing functions. Note that Greenplum cannot do correlated subqueries.


Solution

  • I think you are looking for this:

    SELECT d1, d2, sum(v)/sum(sum(v)) OVER (PARTITION BY d1) AS share
    FROM   test
    GROUP  BY d1, d2;
    

    Produces the requested result.

    Window functions are applied after aggregate functions. The outer sum() in sum(sum(v)) OVER ... is a window function (attached OVER ... clause) while the inner sum() is an aggregate function.

    Effectively the same as:

    WITH x AS (
       SELECT d1, d2, sum(v) AS sv
       FROM   test
       GROUP  BY d1, d2
       )
    SELECT d1, d2, sv/sum(sv) OVER (PARTITION BY d1) AS share
    FROM   x;
    

    Or (without CTE):

    SELECT d1, d2, sv/sum(sv) OVER (PARTITION BY d1) AS share
    FROM  (
       SELECT d1, d2, sum(v) AS sv
       FROM   test
       GROUP  BY d1, d2
       ) x;
    

    Or mu's variant.

    Aside: Greenplum introduced correlated subqueries with version 4.2. See release notes.