I have a table with some "functionally duplicate" records - different IDs, but the 4 columns of "user data" (of even more columns) are identical. I've got a query working that will select all records that have such duplicates.
Now I want to select, from each group of duplicates, first any of them that have column A
not null - and I've verified from the data that there are at most 1 such rows per group - and if there are none in this particular group, then the minimum of column ID
.
How do I select that? I can't exactly use a non-aggregate in the THEN of a CASE and an aggregate in the ELSE. E.g. this doesn't work:
SELECT CASE
WHEN d.A IS NULL THEN d.ID
ELSE MIN(d.ID) END,
d.B,
d.C,
d.E,
d.F
FROM TABLE T
JOIN (my duplicate query here) D ON T.B=D.B
AND T.C=D.C
AND T.E=D.E
AND T.F=D.F
GROUP BY T.B,
T.C,
T.E,
T.F
Error being:
column A must appear in the GROUP BY clause or be used in an aggregate function.
This can be radically simpler:
SELECT DISTINCT ON (b, c, e, f)
b, c, e, f, id -- add more columns freely
FROM (<duplicate query here>) sub
ORDER BY b, c, e, f, (a IS NOT NULL), id
Your duplicate query has all columns. No need to JOIN
to the base table again.
Use the Postgres extension of the standard SQL DISTINCT
: DISTINCT ON
:
Postgres has a proper boolean type. You can ORDER BY
boolean expression directly. The sequence is FALSE
(0), TRUE
(1), NULL
(NULL). If a is NULL, this expression is FALSE
and sorts first: (a IS NOT NULL)
. The rest is ordered by id
. Voilá.
Selection of ID
happens automatically. According to your description you want the ID of the row selected in this query. Nothing more to do.
You can probably integrate this into your duplicate query directly.