sqlpostgresqldeduplication

SQL query to map duplicated entries for data enrichment


I'm fairly new to PostgreSQL.

I'm planning on running a data set of products through mechanical turk to enrich the data with pricing information. The problem is that I have 80,000 records uploaded by users, many of which are in actuality duplicates, although they may have other parameters not duplicate.

If I enrich data from a SELECT DISTINCT query, the problem is I won't have a way to add that data to the actual "duplicate" entries.

How can I see all the rows eliminated from a SELECT DISTINCT query, such that I can go back and enrich those rows with my new data later?


Solution

  • Instead of using DISTINCT, you should GROUP BY the fields you want to treat as indicating a duplicate.

    Then you have a few options:

    I'm pretty sure there are lots of examples already on how to find and return duplicates. I suggest searching Stack Overflow under the tag for queries to find duplicates.