sqlpostgresqldistinctranking-functions# Select unique second column per each fist column with minimal value of third column

## Candidates

## Desired result

## The ranking

## The query

Using postgres, is it possible or how to select from table below distinct value of id and id2? :

id | id2 | a | b | a1 | b1 |
---|---|---|---|---|---|

1 | 2 | 150 | 200 | 200 | 200 |

2 | 2 | 150 | 200 | 200 | 200 |

3 | 2 | 150 | 200 | 200 | 200 |

1 | 4 | 150 | 200 | 200 | 200 |

2 | 4 | 150 | 200 | 200 | 200 |

3 | 4 | 150 | 200 | 200 | 200 |

I need to select distinct values of id, such as id2 is also distinct, and the differences between a and b, and then a1 and b2 are minimal.

For the example above, this should be only two rows, since there are only two distinct values in id2

**Correct**

id | id2 | a | b | a1 | b1 |
---|---|---|---|---|---|

1 | 2 | 150 | 200 | 200 | 200 |

2 | 4 | 150 | 200 | 200 | 200 |

I was trying to make it work via
`distinct on(rank() over (order by b - a, a1 - b1, id)`

Even if I try to select twice, like select over select, it doesn't help much, so I'm a bit stuck.

For now I only got this, which is not correct:

**Incorrect**

id | id2 | a | b | a1 | b1 |
---|---|---|---|---|---|

1 | 2 | 150 | 200 | 200 | 200 |

1 | 4 | 150 | 200 | 200 | 200 |

```
with data as (
select 1 id, 1 as id2, 150 a, 200 b, 200 a1, 150 b1
union
select 1 id, 2 as id2, 150 a, 150 b, 200 a1, 200 b1
union
select 1 id, 3 as id2, 150 a, 100 b, 200 a1, 100 b1
union
select 1 id, 4 as id2, 150 a, 150 b, 200 a1, 200 b1
union
select 2 id, 1 as id2, 150 a, 200 b, 200 a1, 150 b1
union
select 2 id, 2 as id2, 150 a, 150 b, 200 a1, 200 b1
union
select 2 id, 3 as id2, 150 a, 100 b, 200 a1, 100 b1
union
select 2 id, 4 as id2, 150 a, 150 b, 200 a1, 200 b1
union
select 3 id, 1 as id2, 150 a, 200 b, 200 a1, 150 b1
union
select 3 id, 4 as id2, 150 a, 150 b, 200 a1, 200 b1
union
select 3 id, 3 as id2, 150 a, 100 b, 200 a1, 100 b1
union
select 3 id, 2 as id2, 150 a, 150 b, 200 a1, 200 b1
)
select *
from data
where b >= a and b1 >= a1
order by id2, b - a, a1 - b1, rank() over (order by id2, id);
```

Solution

You've made a mistake that is quite common unfortunately: You've tried to write a query before getting the algorithm straight. First think through what the query is to do exactly. Only then write the query.

Below I'll try to get the algorithm first, then the query.

In the request comments you have clarified that the table shown is actally two tables. To avoid ambiguities I'll call the column id of the first table id1. This gives us two tables:

- t1 (id1, a, a1)
- t2 (id2, b, b1).

The final result is a combination of rows. In your case one row for (id1, id2) = (1, 2), and another with (id1, id2) = (2, 4). ID1 3 is dismissed from the result set. This means that **a result set does not necessarily include all id1**.

If table t1 contained only id1 = 1 and table t2 only id2 = 1 and 2, we'd have the joined result:

id1 | id2 |
---|---|

1 | 1 |

1 | 2 |

Your requirement is "I need to select distinct values of id1", then it is obvious that we would only include one of the two id2 and dismiss the other. This means that **a result set does not necessarily include all id2** either.

As we join rows from t1 and t2, we can get any a and b combined and any a1 and b1 comnbined. Your query shows that you want to apply two rules here: **b must be greater than or equal to a** and **b1 must be greater than or equal to a1**.

From the following join result

id1 | a | id2 | b | diff |
---|---|---|---|---|

1 | 1 | 4 | 0 | -1 |

1 | 1 | 5 | 1 | 0 |

2 | 2 | 4 | 0 | -2 |

2 | 2 | 5 | 1 | -1 |

3 | 3 | 4 | 0 | -3 |

3 | 3 | 5 | 1 | -2 |

we would dismiss the following IDs from the candidate result rows, because all their joined rows end up with a negative (b-a):

- id1 = 2
- id1 = 3
- id2 = 4

As explained before and shown now, both IDs can have values that get completely dismissed from our candidate list.

After selecting our candidates we get some combinations left. For some id1 this can be a single id2. For some id2 this can be a single id1. And for other IDs there may be many, many rows. For instance:

id1 | a | id2 | b | dif |
---|---|---|---|---|

1 | 1 | 4 | 1 | 0 |

1 | 1 | 5 | 2 | 1 |

1 | 1 | 6 | 3 | 2 |

2 | 2 | 5 | 2 | 0 |

2 | 2 | 6 | 3 | 1 |

3 | 3 | 6 | 3 | 0 |

The **primary rule** that you have stated in the request is that **you only want distinct id1 and distinct id2 in the final result**.

If we chose the third row (1, 6) for id1 = 1, that would dismiss the last row (3, 6) from the results, and we'd lose id1 = 3 completely. This can be avoided by choosing the first row for id1 = 1: (1,4). Considering this you stated a **secondary rule** in the request comments: **You want as many result rows as possible**. So for the joined result table just shown, you'd want a final result of three rows and dismiss any result of only two rows or less.

At last you stated this rule set in your request: "the differences between a and b, and then a1 and b2 are minimal". I suppose this **tertiary rule** is supposed to mean: **you want to get as low a sum of (b-a) as possible**. And the **quaternary rule** is then: **you want to get as low a sum of (b1-a1) as possible**.

Only now have we stated all rules and their priorities, and only now can we start writing the query.

We have learnt that we must combine rows to a final result set and depending on which rows we pick, we may end up with more or less result rows and lower or higher value differences. This means that we must first get all combinations in order to choose from them. For the last mentioned joined table, we'd get these options:

id1 | a | id2 | b | dif |
---|---|---|---|---|

1 | 1 | 4 | 1 | 0 |

2 | 2 | 5 | 2 | 0 |

3 | 3 | 6 | 3 | 0 |

id1 | a | id2 | b | dif |
---|---|---|---|---|

1 | 1 | 4 | 1 | 0 |

2 | 2 | 6 | 3 | 1 |

id1 | a | id2 | b | dif |
---|---|---|---|---|

1 | 1 | 5 | 2 | 1 |

2 | 2 | 6 | 3 | 1 |

id1 | a | id2 | b | dif |
---|---|---|---|---|

1 | 1 | 5 | 2 | 1 |

3 | 3 | 6 | 3 | 0 |

id1 | a | id2 | b | dif |
---|---|---|---|---|

1 | 1 | 6 | 3 | 2 |

2 | 2 | 5 | 2 | 0 |

Now we can rank them, getting the first option on top, because it has more rows than the others. If there were more than one row on top with the same number of rows, we'd have to check further: We'd look up the (b - a) sums and at last the (b1 - a1) sums. Thus we'd either get a single best row or multiple best rows. In the latter case we'd pick one of those top row sets arbitrarily.

In order to produce all the options first, we use a recursive query where we join one row with a second, then with a third, etc. as long as we don't get duplicates. The result is rows with an array of ID pairs and a diff sum each, e.g. result #1 = [(1,4), (2,5), (3,6)], diff_sum = 9. We'll rank the rows as shown, pick the best row and then use it to select the original rows for the final result.

```
with recursive
candidates as
(
select
t1.id1, t1.a, t1.a1,
t2.id2, t2.b, t2.b1
from t1
join t2 on t2.b - t1.a >= 0
and t2.b1 - t1.a1 >= 0
),
combinations (arr_pairs, last_id1, arr_id2, diff1, diff2) as
(
select
array[(id1, id2)],
id1,
array[id2],
b - a,
b1 - a1
from candidates
union all
select
com.arr_pairs || (can.id1, can.id2),
can.id1,
com.arr_id2 || can.id2,
com.diff1 + can.b - can.a,
com.diff2 + can.b1 - can.a1
from combinations com
join candidates can on can.id1 > com.last_id1
and can.id2 != all (com.arr_id2)
),
best_combination as
(
select *
from combinations
order by cardinality(arr_pairs) desc, diff1, diff2
fetch first row only
),
best_combination_rows as
(
select bcr.id1, bcr.id2
from unnest((select arr_pairs from best_combination)) as bcr (id1 int, id2 int)
)
select *
from candidates
where (id1, id2) in (select id1, id2 from best_combination_rows)
order by id1;
```

Demo: https://dbfiddle.uk/LqggdaM_

Disclaimer: I am neither a math guy fluent with combinations, nor a PostgreSQL developer, so it is possible to get this slightly more effective here and there.

- Postgres selecting from table row
- To find the next work day
- SQL Server Left Join With 'Or' Operator
- How to exclusivley select a record from different tables
- Add user to table and use users id while adding data to second table while user registration
- Selecting the earliest row where a specified date value falls between a date range (two columns) when multiple rows may match
- SQL AVG returning an int
- Using two tables. What am I doing wrong here?
- ORDER BY column, with specific notnull value LAST
- How to get the first available number in a range
- In GENERATE_SPREADSHEET SQL on IBM i there are problems if a $ character is present in qualified names
- How to return default value from SQL query
- how to sort a column which contains Semantic Version in Postgres
- Proper Case in Athena SQL
- SQL VBA Query not picking up letter characters, only fully numeric fields
- Joining two tables with a joint result in Postgresql
- Can I Query or Do I Need a Table Function?
- How to link 3 SQL tables and pass 0 value if the value is NULL
- Searching all tables in database containing specific column
- Convert VARCHAR to SMALLDATETIME as part of UPDATE statement
- WordPress - Optimize the Meta Query for 3 meta keys at a time
- How can I find the highest ancestor in a tree?
- Firebird lock and select unlocked fields
- How to pass parameters in a native query JPA
- SQL Server : query columns to JSON object with group by
- SQL Server While loop for last day of month
- Can anyone please explain this solution of hacker rank Binary Tree Nodes?
- generate DDL script from SQL Server database
- How to return name, given an ID Number from same dataset?
- Get the names of all the tables and the views in a single SQL query for SQL Server?