Got this question for a while and wondering if there is any faster query.
I have a table with multiple entries per ID, and would like to list all columns with different values for the same ID.
ID | Brand | Type |
---|---|---|
1 | Honda | Coupe |
1 | Jeep | SUV |
2 | Ford | Sedan |
2 | Ford | Crossover |
Example for above table:
Rows with ID = 1
have different Brand
and Type
values, so I want one result row for each column.
For ID = 2
there is only one brand, but multiple types, so only one result row for type.
The desired result would be like this.
ID | Difference |
---|---|
1 | Brand |
1 | Type |
2 | Type |
I solved it with below query checking each column with one SELECT
statement and then UNION
it all:
SELECT ID, 'Brand' AS Discrepancy
FROM table
GROUP BY ID
HAVING COUNT(DISTINCT Brand) > 1
UNION
SELECT ID,'Type' AS Discrepancy
FROM table
GROUP BY ID
HAVING COUNT(DISTINCT Type) > 1;
Is there any faster query or optimization?
Your query is good for few rows per ID (except for where it should be UNION
UNION ALL
).
This one is better (improved with a hint from Charlieface):
SELECT t.id, c.difference
FROM (
SELECT id
, min(brand) <> max(brand) AS b_diff
, min(type) <> max(type) AS t_diff
FROM tbl
GROUP BY id
) t
JOIN LATERAL (
VALUES
('Brand', t.b_diff)
, ('Type' , t.t_diff)
) c(difference, diff) ON c.diff
ORDER BY 1, 2; -- optional
A single sequential scan should bring the cost down by almost half.
Plus, avoiding the expensive count(DISTINCT ...)
should help some more.
Test with EXPLAIN ANALYZE
. See:
Note that null values are ignored by either query.
If there are many rows per ID (and an index on each of the tested columns), there are (much) faster options.
If so, and it matters, start a new question providing info for postgresql-performance questions as instructed in the tag description. Postgres version, exact table definition, test case. Most importantly, rough stats about data distribution. And post a comment here to link to the follow-up.