Divide a dataset by percentage count using sum

I have a table:

create table orders (
  order_id integer,
  order_sum numeric
);

My task is:

Divide all its rows into two parts: first part contains 20% of rows and second part contains 80% of rows.
But at the same time, the first part must contain about 20% of the total order_sum (not exactly 20 but as close to it as possible).

Without p.2, the task solves easily, something like that:

SELECT
  t.order_id,
  t.order_sum,
  case when t.percentage <= 20 then '1' else '2' end as part
FROM
  (
    SELECT
      t.order_id,
      t.order_sum,
      100 * cume_dist() over (order by random()) as percentage
    FROM orders t
  ) t;

But how can I improve/rewrite this to consider the p.2?

Of course, the speed is important.

Solution

With two sliding half-windows

After my exhaustive but memory-wise catastrophic first solution, here is a new, non-exhaustive solution.

How it works

The first step consist (as in my other answer) to identify the window of the contiguous rc20 orders that best match the ideal sum (rc20 = 20 % of row count), over the orders ordered by order_sum (thus left, with small positions, are orders of small amount, while right are big orders).
This window should usually be around the median of our set (the 20 % of rows having an average order_sum should account for 20 % of the sum),
however with some unbalanced orders (one big order of a car and many treats orders), and other discrepancies in the distribution may well make another window appear as the "first best match".

This first pass can use a conventional sum() over (order by order_sum), with a twist to have our position rows between rc20 preceding (does not work, as rc20 depends on the total row count which is not considered constant by PostgreSQL) replaced by a position / rc20 range between 1.0 preceding (… and with a twist in the twist so that we don't divide by exactly rc20, which would return exactly 1.0 for row rc20 rows away from our current row, thus making a window of rc20 + 1 instead of rc20 rows if we count the current row).
This is quick (less than 1 s for 100 k Zegarek-style random orders on my laptop).

Then we split this window in two, and make each semi-window walk through its side of the cut in search for a (still contiguous) semi-window that, combined with the semi-window of the other side, results in a total closer to the ideal.
Thus instead of 6 orders of medium amount, by combining 3 small orders with 3 medium-to-big ones, we get a better result.

We start with two semi-window of half rc20, but will explore disymetric windows too. For example with a full window of 10, we'll try:

5 / 5
6 / 4 (6 small orders combined to 4 bigger) then 4 / 6 (probably 4 smaller and 6 medium-to-big)
7 / 3 and 3 / 7
8 / 2 and 2 / 8
and then we stop, on an arbitrary disymetry limit of 6

To run our two semi-windows, we'll use a PL/pgSQL function which gets the whole dataset as an array, which will iterate as many times as there are orders, but, contrary to an aggregate function, will start from the middle and potentially change direction on each iteration (decrement twice on the left, then increment on right, decrement left, increment 3 times right, etc.), choosing the side to shift as the one staying closer to the ideal sum.
The only recursive part is the choice of the next window size.

Illustrated

So let's say we have identified a window of 3:

-- Ideal set: 3 orders totaling 18.8
-- Quick pass: 3 contiguous orders totaling 20
1 2 3 3 4 4 5[6 6 8]9 9 10 11 13 -- = 20
-- First pass: try window sizes 2 and 1:
1 2 3 3 4 4 5[6 6|8]9 9 10 11 13 -- = 20; next candidates on the left: 20+5-6 = 19, on the right: 20+9-8 = 21, choose left
1 2 3 3 4 4[5 6]6[8]9 9 10 11 13 -- = 19; next candidates on the left: 19+4-6 = 17, on the right: 19+9-8 = 20, choose right
1 2 3 3 4 4[5 6]6 8[9]9 10 11 13 -- = 20; next candidates on the left: 20+4-6 = 18, on the right: 20+9-9 = 20, choose left
1 2 3 3 4[4 5]6 6 8[9]9 10 11 13 -- = 18; next candidates on the left: 18+4-5 = 17, on the right: 18+9-9 = 18, choose right, and so on
-- Then window sizes 1 and 2:
1 2 3 3 4 4 5[6|6 8]9 9 10 11 13 -- = 20; next candidates on the left: 20+5-6 = 19, on the right: 20+9-6 = 23, choose left
1 2 3 3 4 4[5]6[6 8]9 9 10 11 13 -- = 19; next candidates on the left: 19+4-5 = 18, on the right: 19+9-6 = 22, choose left
1 2 3 3 4[4]5 6[6 8]9 9 10 11 13 -- = 18; and so on
-- Finally the 19s will be the best split candidates, which are better than the non-split one. So keep one of those 19s as the final best.

Limitations

Performance: we're of course way beyond the windowed sum() of the first pass, with 100 seconds for 100 k orders (in the default maximum disymetry of 6)
Due to only working with 2 sets of contiguous orders, it will never find the ideal solution if it consists of sparse orders to cherry pick (one small + one medium + one big)
And its results will appear artificial, as they will be composed of two sets of potentially distant orders;
Thus a list of 20 orders may include every order from $1 to $10, then a gap, and then every order from $95 to $117.

Code!

-- Two-headed average maintainer:
-- given an array and a starting index that splits the array in a left and a right part,
-- walk through the array through two windows of respectively nl and nr elements,
-- whose sum should always stay as close as possible to a given reference s.
-- We always shift of 1 element per iteration, either on the left window or the right one.
create or replace function tham(v numeric[], i bigint, nl int, nr int, inout s numeric, out is int[])
language plpgsql
as
$$
    declare
        il int := i - nl;
        ir int := i + nr - 1;
        irma int := array_length(v, 1); -- Index at Right MAx
        rol boolean; -- Right Or Left.
        pd numeric; -- Previous Difference from s
        dl numeric;
        dr numeric;
        -- Best results:
        bd numeric;
        bil int := il;
        bir int := ir;
    begin
        -- @todo If the left window has already reached the start of the array, AND the array is ordered, AND bs >= s, we can stop: we will never get a closer result while adding bigger and bigger numbers.
        --       Same on the left if having reached the end.
        select sum(val) - s into bd from unnest(v[il:ir]) a(val);
        pd := bd;
        while (il > 1 or ir < irma) and bd <> 0 loop
            rol := null;
            if il > 1 then
                dl := pd + v[il - 1] - v[il + nl - 1];
            else
                rol := true;
            end if;
            if ir < irma then
                dr := pd + v[ir + 1] - v[ir - nr + 1];
            else
                rol := false;
            end if;
            if rol is null then
                rol := @dr < @dl;
            end if;
            if rol then
                ir := ir + 1;
                pd = dr;
            else
                il := il - 1;
                pd = dl;
            end if;
            if @pd < @bd then
                bd := pd;
                bil := il;
                bir := ir;
            end if;
        end loop;
        s := s + bd;
        select array_agg(val) into is from (select * from generate_series(bil, bil + nl - 1) union all select * from generate_series(bir - nr + 1, bir)) t(val);
    end;
$$;

with recursive
    -- Row Counts.
    rc as
    (
        select
            count(*) rc100, -- Total Row Count for this dataset.
            round(count(*) / 5.0) rc20, -- Row Count to reach (20 %).
            sum(order_sum) sum, -- Total sum.
            sum(order_sum) / 5 sum20 -- Sum to reach (20 %).
        from orders
    ),
    -- Ordered orders.
    oo as
    (
        select
            order_id, order_sum,
            row_number() over w pos,
            -- As we cannot do a rows between 20 % preceding and current row, we transpose it to an axis where a unit represents 20 % of the rows count: thus we can do a range between 1.0 preceding and current row.
            -- https://stackoverflow.com/a/79670199/1346819
            row_number() over w / (rc20 - 1) pos20 -- Divide by count - 1 instead of count, so that the row at exactly 20 % is not included (it will be 1.1 or so beyond us).
        from orders o join rc on true
        window w as (order by order_sum rows between unbounded preceding and unbounded following)
    ),
    -- Look for the contiguous window of n having its sum closest to the ideal 20 %.
    -- Choosing a contiguous window is a simple heuristic that should return median rows.
    contiguous as
    (
        select
            pos,
            -- case when: the first windows are incomplete, ignore them.
            -- The last rc20 rows will NEVER be part of anything near the first 20 %, so don't try them either.
            case when pos between rc20 and rc100 - 1 then
                abs(sum(order_sum) over w20 - sum20)
            end err -- Initial error.
        from oo join rc on true
        window w20 as (order by pos20 range between 1.0 preceding and current row)
    ),
    bestcontiguous as
    (
        select
 pos, err
        from contiguous
        order by  err
        limit 1
    ),
    asarray as
    (
        select
            array_agg(order_sum order by pos) sums,
            array_agg(order_id order by pos) ids
        from oo
    ),
    -- Now try to see if we find better sets at the extremities, by having two semi-windows going in opposite directions.
    -- For example, if 20 % of rows represent 6 rows, have 3 rows explore the left (= small order_sums) and 3 rows explore the right (= big orders).
    -- Perhaps the triplet of small orders + the triplet of big orders will sum up in 6 rows closer to 20 %.
    -- We're still focusing on contiguous sets of rows (we're looking for orders numbered 3,4,5 and 24,25,26 over a set of 30 orders for example).
    -- We'll explore more and more disymetric sets, for example from a window of 6 we'll try 3/3, then 2/4, 4/2, 1/5, and 5/1.
    -- Note that the size of the wings (nl and nr) are shifted: iteration 0 emits nl and nr for iteration 1, which uses them to call tham (Two-headed average maintainer) with them as parameter but emits tham()'s result for iteration 1 alongside nl and nr computed for iteration 2, and so on.
    wings as
    (
        select
            rc20,
            rc20::int / 2 + rc20::int % 2 nl,
            rc20::int / 2 nr,
            sum20,
            bc.err,
            -- Output positions for diagnosis (to see semi-windows appear); but in the end it's not necessary, only order_id will be of interest.
            (select array_agg(pos) from oo where oo.pos between bc.pos - rc20 + 1 and bc.pos) positions,
            (select array_agg(order_id) from oo where oo.pos between bc.pos - rc20 + 1 and bc.pos) ids,
            -- bestcontiguous has a window of rc20 _before and including_ pos, while tham() (used in the iterative part) awaits its start position being the middle of the window. Make it explicit in the column name (yes its acronym is lip):
            pos lastitempos
        from rc join bestcontiguous bc on true
        union all
        select
            rc20,
            -- On even iterations, we swap left and right (if we explored windows of 4 small and 2 big orders, we'll now look for combinations of 2 small and 4 big orders).
            -- On odd iterations, we increase the disymetry (go from 2 / 4 to 1 / 5).
            nr + case when nl > nr then 0 else 1 end,
            nl - case when nl > nr then 0 else 1 end,
            sum20,
            @(tham.s - sum20),
            (select array_agg(pos) from oo where oo.pos = any(tham.is)) positions,
            (select array_agg(order_id) from oo where oo.pos = any(tham.is)) ids,
            lastitempos
        from
            wings join asarray a on true
            cross join lateral tham(sums, lastitempos - nr + 1, nl, nr, sum20) tham
        where nr > 0 -- Only while not reaching a 0-length wing (a 0-length semi-window means that the other semi-window is in fact the full window, that we already tested in contiguous).
        and nl - nr <= 6 -- Don't try to test too disymetric semi-windows.
        and err <> 0 -- And stop if we've got a perfect match (we won't do better).
    ),
    -- Choose our best match over the set.
    res as
    (
        select *, row_number() over (order by @err) pos
        from wings r1
    )
select
    ids,
    rc.rc100 "total row count",
    rc.rc20 "20 % threshold",
    (select sum(order_sum) from oo where oo.pos between res0.pos - rc.rc20 + 1 and res0.pos) "first draft sum", -- Should be sum20 + res0.err
    (select sum(order_sum) from orders o where o.order_id = any(res.ids)) "final sum", -- Should be sum20 + res.err
    rc.sum20::float "ideal sum"
from res join bestcontiguous res0 on true join rc on true
where res.pos = 1
/*
-- Uncomment the following select (and comment the preceding one, "select ids, rc.rc100 etc.") to get the details instead of the summary:
select o.*, case when o.order_id = any(res.ids) then 1 else 2 end group_number
from orders o
join res on res.pos = 1
order by order_sum -- Allows to see how the one or two blocks of orders with group_number = 1 are contiguous over the order_sum range; but you'll probably want to order by group_number to get your 20 % first, then the 80 %.
*/
;

ids	total row count	20 % threshold	first draft sum	final sum	ideal sum
{19,38,7,31,11,45,47,28,13,49}	50	10	423826.02	485781.89	485387.086

Here is this version running in a fiddle, corresponding to your initial need (20 % of the full dataset),
as well as a test version running multiple tests distinguished by an added test_id column.