postgresql

Divide a dataset by percentage count using sum


I have a table:

create table orders (
  order_id integer,
  order_sum numeric
);

My task is:

  1. Divide all its rows into two parts: first part contains 20% of rows and second part contains 80% of rows.

  2. But at the same time, the first part must contain about 20% of the total order_sum (not exactly 20 but as close to it as possible).

Without p.2, the task solves easily, something like that:

SELECT
  t.order_id,
  t.order_sum,
  case when t.percentage <= 20 then '1' else '2' end as part
FROM
  (
    SELECT
      t.order_id,
      t.order_sum,
      100 * cume_dist() over (order by random()) as percentage
    FROM orders t
  ) t;

But how can I improve/rewrite this to consider the p.2?

Of course, the speed is important.


Solution

  • With two sliding half-windows

    After my exhaustive but memory-wise catastrophic first solution, here is a new, non-exhaustive solution.

    How it works

    The first step consist (as in my other answer) to identify the window of the contiguous rc20 orders that best match the ideal sum (rc20 = 20 % of row count), over the orders ordered by order_sum (thus left, with small positions, are orders of small amount, while right are big orders).
    This window should usually be around the median of our set (the 20 % of rows having an average order_sum should account for 20 % of the sum),
    however with some unbalanced orders (one big order of a car and many treats orders), and other discrepancies in the distribution may well make another window appear as the "first best match".

    This first pass can use a conventional sum() over (order by order_sum), with a twist to have our position rows between rc20 preceding (does not work, as rc20 depends on the total row count which is not considered constant by PostgreSQL) replaced by a position / rc20 range between 1.0 preceding (… and with a twist in the twist so that we don't divide by exactly rc20, which would return exactly 1.0 for row rc20 rows away from our current row, thus making a window of rc20 + 1 instead of rc20 rows if we count the current row).
    This is quick (less than 1 s for 100 k Zegarek-style random orders on my laptop).

    Then we split this window in two, and make each semi-window walk through its side of the cut in search for a (still contiguous) semi-window that, combined with the semi-window of the other side, results in a total closer to the ideal.
    Thus instead of 6 orders of medium amount, by combining 3 small orders with 3 medium-to-big ones, we get a better result.

    We start with two semi-window of half rc20, but will explore disymetric windows too. For example with a full window of 10, we'll try:

    To run our two semi-windows, we'll use a PL/pgSQL function which gets the whole dataset as an array, which will iterate as many times as there are orders, but, contrary to an aggregate function, will start from the middle and potentially change direction on each iteration (decrement twice on the left, then increment on right, decrement left, increment 3 times right, etc.), choosing the side to shift as the one staying closer to the ideal sum.
    The only recursive part is the choice of the next window size.

    Illustrated

    So let's say we have identified a window of 3:

    -- Ideal set: 3 orders totaling 18.8
    -- Quick pass: 3 contiguous orders totaling 20
    1 2 3 3 4 4 5[6 6 8]9 9 10 11 13 -- = 20
    -- First pass: try window sizes 2 and 1:
    1 2 3 3 4 4 5[6 6|8]9 9 10 11 13 -- = 20; next candidates on the left: 20+5-6 = 19, on the right: 20+9-8 = 21, choose left
    1 2 3 3 4 4[5 6]6[8]9 9 10 11 13 -- = 19; next candidates on the left: 19+4-6 = 17, on the right: 19+9-8 = 20, choose right
    1 2 3 3 4 4[5 6]6 8[9]9 10 11 13 -- = 20; next candidates on the left: 20+4-6 = 18, on the right: 20+9-9 = 20, choose left
    1 2 3 3 4[4 5]6 6 8[9]9 10 11 13 -- = 18; next candidates on the left: 18+4-5 = 17, on the right: 18+9-9 = 18, choose right, and so on
    -- Then window sizes 1 and 2:
    1 2 3 3 4 4 5[6|6 8]9 9 10 11 13 -- = 20; next candidates on the left: 20+5-6 = 19, on the right: 20+9-6 = 23, choose left
    1 2 3 3 4 4[5]6[6 8]9 9 10 11 13 -- = 19; next candidates on the left: 19+4-5 = 18, on the right: 19+9-6 = 22, choose left
    1 2 3 3 4[4]5 6[6 8]9 9 10 11 13 -- = 18; and so on
    -- Finally the 19s will be the best split candidates, which are better than the non-split one. So keep one of those 19s as the final best.
    
    Limitations
    Code!
    -- Two-headed average maintainer:
    -- given an array and a starting index that splits the array in a left and a right part,
    -- walk through the array through two windows of respectively nl and nr elements,
    -- whose sum should always stay as close as possible to a given reference s.
    -- We always shift of 1 element per iteration, either on the left window or the right one.
    create or replace function tham(v numeric[], i bigint, nl int, nr int, inout s numeric, out is int[])
    language plpgsql
    as
    $$
        declare
            il int := i - nl;
            ir int := i + nr - 1;
            irma int := array_length(v, 1); -- Index at Right MAx
            rol boolean; -- Right Or Left.
            pd numeric; -- Previous Difference from s
            dl numeric;
            dr numeric;
            -- Best results:
            bd numeric;
            bil int := il;
            bir int := ir;
        begin
            -- @todo If the left window has already reached the start of the array, AND the array is ordered, AND bs >= s, we can stop: we will never get a closer result while adding bigger and bigger numbers.
            --       Same on the left if having reached the end.
            select sum(val) - s into bd from unnest(v[il:ir]) a(val);
            pd := bd;
            while (il > 1 or ir < irma) and bd <> 0 loop
                rol := null;
                if il > 1 then
                    dl := pd + v[il - 1] - v[il + nl - 1];
                else
                    rol := true;
                end if;
                if ir < irma then
                    dr := pd + v[ir + 1] - v[ir - nr + 1];
                else
                    rol := false;
                end if;
                if rol is null then
                    rol := @dr < @dl;
                end if;
                if rol then
                    ir := ir + 1;
                    pd = dr;
                else
                    il := il - 1;
                    pd = dl;
                end if;
                if @pd < @bd then
                    bd := pd;
                    bil := il;
                    bir := ir;
                end if;
            end loop;
            s := s + bd;
            select array_agg(val) into is from (select * from generate_series(bil, bil + nl - 1) union all select * from generate_series(bir - nr + 1, bir)) t(val);
        end;
    $$;
    
    with recursive
        -- Row Counts.
        rc as
        (
            select
                count(*) rc100, -- Total Row Count for this dataset.
                round(count(*) / 5.0) rc20, -- Row Count to reach (20 %).
                sum(order_sum) sum, -- Total sum.
                sum(order_sum) / 5 sum20 -- Sum to reach (20 %).
            from orders
        ),
        -- Ordered orders.
        oo as
        (
            select
                order_id, order_sum,
                row_number() over w pos,
                -- As we cannot do a rows between 20 % preceding and current row, we transpose it to an axis where a unit represents 20 % of the rows count: thus we can do a range between 1.0 preceding and current row.
                -- https://stackoverflow.com/a/79670199/1346819
                row_number() over w / (rc20 - 1) pos20 -- Divide by count - 1 instead of count, so that the row at exactly 20 % is not included (it will be 1.1 or so beyond us).
            from orders o join rc on true
            window w as (order by order_sum rows between unbounded preceding and unbounded following)
        ),
        -- Look for the contiguous window of n having its sum closest to the ideal 20 %.
        -- Choosing a contiguous window is a simple heuristic that should return median rows.
        contiguous as
        (
            select
                pos,
                -- case when: the first windows are incomplete, ignore them.
                -- The last rc20 rows will NEVER be part of anything near the first 20 %, so don't try them either.
                case when pos between rc20 and rc100 - 1 then
                    abs(sum(order_sum) over w20 - sum20)
                end err -- Initial error.
            from oo join rc on true
            window w20 as (order by pos20 range between 1.0 preceding and current row)
        ),
        bestcontiguous as
        (
            select
     pos, err
            from contiguous
            order by  err
            limit 1
        ),
        asarray as
        (
            select
                array_agg(order_sum order by pos) sums,
                array_agg(order_id order by pos) ids
            from oo
        ),
        -- Now try to see if we find better sets at the extremities, by having two semi-windows going in opposite directions.
        -- For example, if 20 % of rows represent 6 rows, have 3 rows explore the left (= small order_sums) and 3 rows explore the right (= big orders).
        -- Perhaps the triplet of small orders + the triplet of big orders will sum up in 6 rows closer to 20 %.
        -- We're still focusing on contiguous sets of rows (we're looking for orders numbered 3,4,5 and 24,25,26 over a set of 30 orders for example).
        -- We'll explore more and more disymetric sets, for example from a window of 6 we'll try 3/3, then 2/4, 4/2, 1/5, and 5/1.
        -- Note that the size of the wings (nl and nr) are shifted: iteration 0 emits nl and nr for iteration 1, which uses them to call tham (Two-headed average maintainer) with them as parameter but emits tham()'s result for iteration 1 alongside nl and nr computed for iteration 2, and so on.
        wings as
        (
            select
                rc20,
                rc20::int / 2 + rc20::int % 2 nl,
                rc20::int / 2 nr,
                sum20,
                bc.err,
                -- Output positions for diagnosis (to see semi-windows appear); but in the end it's not necessary, only order_id will be of interest.
                (select array_agg(pos) from oo where oo.pos between bc.pos - rc20 + 1 and bc.pos) positions,
                (select array_agg(order_id) from oo where oo.pos between bc.pos - rc20 + 1 and bc.pos) ids,
                -- bestcontiguous has a window of rc20 _before and including_ pos, while tham() (used in the iterative part) awaits its start position being the middle of the window. Make it explicit in the column name (yes its acronym is lip):
                pos lastitempos
            from rc join bestcontiguous bc on true
            union all
            select
                rc20,
                -- On even iterations, we swap left and right (if we explored windows of 4 small and 2 big orders, we'll now look for combinations of 2 small and 4 big orders).
                -- On odd iterations, we increase the disymetry (go from 2 / 4 to 1 / 5).
                nr + case when nl > nr then 0 else 1 end,
                nl - case when nl > nr then 0 else 1 end,
                sum20,
                @(tham.s - sum20),
                (select array_agg(pos) from oo where oo.pos = any(tham.is)) positions,
                (select array_agg(order_id) from oo where oo.pos = any(tham.is)) ids,
                lastitempos
            from
                wings join asarray a on true
                cross join lateral tham(sums, lastitempos - nr + 1, nl, nr, sum20) tham
            where nr > 0 -- Only while not reaching a 0-length wing (a 0-length semi-window means that the other semi-window is in fact the full window, that we already tested in contiguous).
            and nl - nr <= 6 -- Don't try to test too disymetric semi-windows.
            and err <> 0 -- And stop if we've got a perfect match (we won't do better).
        ),
        -- Choose our best match over the set.
        res as
        (
            select *, row_number() over (order by @err) pos
            from wings r1
        )
    select
        ids,
        rc.rc100 "total row count",
        rc.rc20 "20 % threshold",
        (select sum(order_sum) from oo where oo.pos between res0.pos - rc.rc20 + 1 and res0.pos) "first draft sum", -- Should be sum20 + res0.err
        (select sum(order_sum) from orders o where o.order_id = any(res.ids)) "final sum", -- Should be sum20 + res.err
        rc.sum20::float "ideal sum"
    from res join bestcontiguous res0 on true join rc on true
    where res.pos = 1
    /*
    -- Uncomment the following select (and comment the preceding one, "select ids, rc.rc100 etc.") to get the details instead of the summary:
    select o.*, case when o.order_id = any(res.ids) then 1 else 2 end group_number
    from orders o
    join res on res.pos = 1
    order by order_sum -- Allows to see how the one or two blocks of orders with group_number = 1 are contiguous over the order_sum range; but you'll probably want to order by group_number to get your 20 % first, then the 80 %.
    */
    ;
    
    ids total row count 20 % threshold first draft sum final sum ideal sum
    {19,38,7,31,11,45,47,28,13,49} 50 10 423826.02 485781.89 485387.086

    Here is this version running in a fiddle, corresponding to your initial need (20 % of the full dataset),
    as well as a test version running multiple tests distinguished by an added test_id column.