I'm fairly new to PostgreSQL.
I'm planning on running a data set of products through mechanical turk to enrich the data with pricing information. The problem is that I have 80,000 records uploaded by users, many of which are in actuality duplicates, although they may have other parameters not duplicate.
If I enrich data from a SELECT DISTINCT query, the problem is I won't have a way to add that data to the actual "duplicate" entries.
How can I see all the rows eliminated from a SELECT DISTINCT query, such that I can go back and enrich those rows with my new data later?
Instead of using DISTINCT
, you should GROUP BY
the fields you want to treat as indicating a duplicate.
Then you have a few options:
array_agg
the non-grouped-by rows;
Put a GROUP BY ... HAVING count(...) > 1
query in a subquery you use in the FROM
clause, then self-join on the original table on the same columns as you grouped by. That'll let you find all rows that have duplicates.
Use a window function to row_number() OVER (PARTITION BY col1, col2, col3) as dup_num
where col1, col2, col3
are the cols you currently have in the DISTINCT
query. Then wrap that up as a subquery-in-FROM and filter for WHERE dup_num > 1
. This approach lets you find only the duplicate rows, i.e. it excludes one row that it treats as the original row. To control which gets treated as duplicate and which as original you can use an ORDER BY
clause inside the OVER (...)
window.
I'm pretty sure there are lots of examples already on how to find and return duplicates. I suggest searching Stack Overflow under the postgresql tag for queries to find duplicates.