mysqlsqlduplicatessqlperformance

Optimizing MySQL-Query for removing duplicates and adjusting foreign key references


i have following tables:address_table

CREATE TABLE `address` (
  `id` varchar(255) NOT NULL,
  `city` varchar(255) DEFAULT NULL,
  `street` varchar(255) DEFAULT NULL,
  `house_number` varchar(255) DEFAULT NULL,
  `zip_code` varchar(255) DEFAULT NULL,
  `country` varchar(2) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB;

and customer_address_table

CREATE TABLE `customer_address` (
  `customer_id` int DEFAULT NULL,
  `address_id` varchar(255) DEFAULT NULL,
  `id` int NOT NULL AUTO_INCREMENT,
  PRIMARY KEY (`id`),
  KEY `address_id` (`address_id`),
  KEY `customer_id` (`customer_id`),
  CONSTRAINT `customer_address_ibfk_1` FOREIGN KEY (`address_id`) REFERENCES `address` (`id`),
  CONSTRAINT `customer_address_ibfk_2` FOREIGN KEY (`customer_id`) REFERENCES `customer` (`id`)
) ENGINE=InnoDB;

The address table stores addresses and the customer_address table stores the relation between customer and address. A customer can have multiple addresses, therefore the second table. In the address table there are duplicate rows (different id but same location) and there is an address_id reference for each address in the customer_address table, so each row in the address table is referenced.

I want to remove the duplicates in the address table and therefore adjust the references in the customer_address table to the one (the lowest) id remaining per location. I have written following query that works, the problem is, that it takes forever to execute (estimated 73h). There are ca. 900'000 rows in the address table and ca. 390'000 of them are unique (filtered by a group by statement).

update customer_address as ca set ca.address_id = 
        (select dfa.id from (select min(id) as id, zip_code, city, street, house_number from varys_dev.address group by zip_code, city, street, house_number) as dfa
        join address as a on (dfa.city = a.city and dfa.zip_code = a.zip_code and dfa.street = a.street and dfa.house_number = a.house_number)
        where a.id = ca.address_id limit 1);

Is there any way to improve that query's performance? I tried indexing the attributes used by the join-on clause but that didn't help anything.


Solution

  • Don't do it in a single pass. Instead, write a loop (in client code) to

    1. find the 'next' few dups
    2. fix them.
    3. step to next batch

    This may still take day(s), but it won't be impacting the rest of the system.

    Step 1

    SELECT a.id, b.id
        FROM address AS a
        JOIN address AS b  USING(city, street, house_number, state, zip_code, country)
        WHERE a.id BETWEEN $left_off AND $left_off + 100
    

    Step 2

    With that short, possibly empty list, of pairs, fix the links.

    Step 3

    $left_off = $left_off + 100

    Exit if no more.

    When finished, you had better add

    UNIQUE(city, street, house_number, state, zip_code, country)
    

    to prevent further dups. If adding the index fails, then there are more dups to clean up; continue where you left off.

    More on chunking for Delete, etc: http://mysql.rjweb.org/doc.php/deletebig#deleting_in_chunks