sqlpostgresqldistinctranking-functions

Select unique second column per each fist column with minimal value of third column


Using postgres, is it possible or how to select from table below distinct value of id and id2? :

id id2 a b a1 b1
1 2 150 200 200 200
2 2 150 200 200 200
3 2 150 200 200 200
1 4 150 200 200 200
2 4 150 200 200 200
3 4 150 200 200 200

I need to select distinct values of id, such as id2 is also distinct, and the differences between a and b, and then a1 and b2 are minimal.

For the example above, this should be only two rows, since there are only two distinct values in id2

Correct

id id2 a b a1 b1
1 2 150 200 200 200
2 4 150 200 200 200

I was trying to make it work via distinct on(rank() over (order by b - a, a1 - b1, id) Even if I try to select twice, like select over select, it doesn't help much, so I'm a bit stuck.

For now I only got this, which is not correct:

Incorrect

id id2 a b a1 b1
1 2 150 200 200 200
1 4 150 200 200 200
with data as (
select 1 id, 1 as id2, 150 a, 200 b, 200 a1, 150 b1
union
select 1 id, 2 as id2, 150 a, 150 b, 200 a1, 200 b1
union
select 1 id, 3 as id2, 150 a, 100 b, 200 a1, 100 b1
union
select 1 id, 4 as id2, 150 a, 150 b, 200 a1, 200 b1
union
select 2 id, 1 as id2, 150 a, 200 b, 200 a1, 150 b1
union
select 2 id, 2 as id2, 150 a, 150 b, 200 a1, 200 b1
union
select 2 id, 3 as id2, 150 a, 100 b, 200 a1, 100 b1
union
select 2 id, 4 as id2, 150 a, 150 b, 200 a1, 200 b1
union
select 3 id, 1 as id2, 150 a, 200 b, 200 a1, 150 b1
union
select 3 id, 4 as id2, 150 a, 150 b, 200 a1, 200 b1
union
select 3 id, 3 as id2, 150 a, 100 b, 200 a1, 100 b1
union
select 3 id, 2 as id2, 150 a, 150 b, 200 a1, 200 b1
)
select *
from data
where b >= a  and b1 >= a1
order by  id2, b - a, a1 - b1, rank() over (order by id2, id);

Solution

  • You've made a mistake that is quite common unfortunately: You've tried to write a query before getting the algorithm straight. First think through what the query is to do exactly. Only then write the query.

    Below I'll try to get the algorithm first, then the query.

    In the request comments you have clarified that the table shown is actally two tables. To avoid ambiguities I'll call the column id of the first table id1. This gives us two tables:

    Candidates

    The final result is a combination of rows. In your case one row for (id1, id2) = (1, 2), and another with (id1, id2) = (2, 4). ID1 3 is dismissed from the result set. This means that a result set does not necessarily include all id1.

    If table t1 contained only id1 = 1 and table t2 only id2 = 1 and 2, we'd have the joined result:

    id1 id2
    1 1
    1 2

    Your requirement is "I need to select distinct values of id1", then it is obvious that we would only include one of the two id2 and dismiss the other. This means that a result set does not necessarily include all id2 either.

    As we join rows from t1 and t2, we can get any a and b combined and any a1 and b1 comnbined. Your query shows that you want to apply two rules here: b must be greater than or equal to a and b1 must be greater than or equal to a1.

    From the following join result

    id1 a id2 b diff
    1 1 4 0 -1
    1 1 5 1 0
    2 2 4 0 -2
    2 2 5 1 -1
    3 3 4 0 -3
    3 3 5 1 -2

    we would dismiss the following IDs from the candidate result rows, because all their joined rows end up with a negative (b-a):

    As explained before and shown now, both IDs can have values that get completely dismissed from our candidate list.

    Desired result

    After selecting our candidates we get some combinations left. For some id1 this can be a single id2. For some id2 this can be a single id1. And for other IDs there may be many, many rows. For instance:

    id1 a id2 b dif
    1 1 4 1 0
    1 1 5 2 1
    1 1 6 3 2
    2 2 5 2 0
    2 2 6 3 1
    3 3 6 3 0

    The primary rule that you have stated in the request is that you only want distinct id1 and distinct id2 in the final result.

    If we chose the third row (1, 6) for id1 = 1, that would dismiss the last row (3, 6) from the results, and we'd lose id1 = 3 completely. This can be avoided by choosing the first row for id1 = 1: (1,4). Considering this you stated a secondary rule in the request comments: You want as many result rows as possible. So for the joined result table just shown, you'd want a final result of three rows and dismiss any result of only two rows or less.

    At last you stated this rule set in your request: "the differences between a and b, and then a1 and b2 are minimal". I suppose this tertiary rule is supposed to mean: you want to get as low a sum of (b-a) as possible. And the quaternary rule is then: you want to get as low a sum of (b1-a1) as possible.

    Only now have we stated all rules and their priorities, and only now can we start writing the query.

    The ranking

    We have learnt that we must combine rows to a final result set and depending on which rows we pick, we may end up with more or less result rows and lower or higher value differences. This means that we must first get all combinations in order to choose from them. For the last mentioned joined table, we'd get these options:

    id1 a id2 b dif
    1 1 4 1 0
    2 2 5 2 0
    3 3 6 3 0
    id1 a id2 b dif
    1 1 4 1 0
    2 2 6 3 1
    id1 a id2 b dif
    1 1 5 2 1
    2 2 6 3 1
    id1 a id2 b dif
    1 1 5 2 1
    3 3 6 3 0
    id1 a id2 b dif
    1 1 6 3 2
    2 2 5 2 0

    Now we can rank them, getting the first option on top, because it has more rows than the others. If there were more than one row on top with the same number of rows, we'd have to check further: We'd look up the (b - a) sums and at last the (b1 - a1) sums. Thus we'd either get a single best row or multiple best rows. In the latter case we'd pick one of those top row sets arbitrarily.

    The query

    In order to produce all the options first, we use a recursive query where we join one row with a second, then with a third, etc. as long as we don't get duplicates. The result is rows with an array of ID pairs and a diff sum each, e.g. result #1 = [(1,4), (2,5), (3,6)], diff_sum = 9. We'll rank the rows as shown, pick the best row and then use it to select the original rows for the final result.

    with recursive
      candidates as
      (
        select
          t1.id1, t1.a, t1.a1,
          t2.id2, t2.b, t2.b1
        from t1
        join t2 on t2.b - t1.a >= 0
                and t2.b1 - t1.a1 >= 0
        
      ),
      combinations (arr_pairs, last_id1, arr_id2, diff1, diff2) as
      (
        select
          array[(id1, id2)],
          id1,
          array[id2],
          b - a,
          b1 - a1
        from candidates
        union all
        select
          com.arr_pairs || (can.id1, can.id2),
          can.id1,
          com.arr_id2 || can.id2,
          com.diff1 + can.b - can.a,
          com.diff2 + can.b1 - can.a1
        from combinations com 
        join candidates can on can.id1 > com.last_id1
                              and can.id2 != all (com.arr_id2)
      ),
      best_combination as
      (
        select *
        from combinations
        order by cardinality(arr_pairs) desc, diff1, diff2
        fetch first row only
      ),
      best_combination_rows as
      (
        select bcr.id1, bcr.id2
        from unnest((select arr_pairs from best_combination)) as bcr (id1 int, id2 int)
      )
    select *
    from candidates
    where (id1, id2) in (select id1, id2 from best_combination_rows)
    order by id1;
    

    Demo: https://dbfiddle.uk/LqggdaM_

    Disclaimer: I am neither a math guy fluent with combinations, nor a PostgreSQL developer, so it is possible to get this slightly more effective here and there.