mysqlreplication

Mysql slave out of sync after crash


We have a "1 master, 1 slave" MySQL setup. We had a sudden power outage that took down the slave. After getting the machine back up, I found that the slave was out of sync with the master:

mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 10.0.0.1
                  Master_User: slave
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-log.001576
          Read_Master_Log_Pos: 412565824
               Relay_Log_File: mysqld-relay-bin.002671
                Relay_Log_Pos: 6930
        Relay_Master_Log_File: mysql-log.001573
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table: blah.table2
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 1032
                   Last_Error: Could not execute Update_rows event on table blah.info; Can't find record in 'info', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-log.001573, end_log_pos 689031225
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 689030864
              Relay_Log_Space: 2944772417
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 1032
               Last_SQL_Error: Could not execute Update_rows event on table blah.info; Can't find record in 'info', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-log.001573, end_log_pos 689031225
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 1
1 row in set (0.00 sec)

We're using a binlog format of "ROW", so when I try to use mysqlbinlog to look at the offending row, I don't see anything of use. I don't want to simply set the skip counter, because I think that would throw my table even further out of sync.

Is there anything I can do on the slave that would essentially "roll back" to a given point in time, where I could then reset the master log number, poition, etc? If not, is there anything at all that I can do to get back in sync?


Solution

  • One can usually recover from small discrepancies using pt-table-checksum and pt-table-sync.

    It looks to me like your replica lost its place in the binary log sequence when it crashes. The replica continually writes its last processed binlog event into datadir/relay-log.info, but this file uses buffered writes, so it is susceptible to losing data in a crash.

    That's why Percona Server created a crash-resistant replication feature to store the same replica info in an InnoDB table, to recover from this scenario.

    MySQL 5.6 has implemented a similar feature: you can set relay_log_info_repository=TABLE so the replica saves its state in a crash-resistant way.


    Re your comment:

    Yes, in theory pt-table-sync can fix any amount of replication drift, but it's not necessarily the most efficient way to correct large discrepancies. At some point, it's quicker and more efficient to trash the outdated replica and reinitialize it using a new backup from the master.

    Check out How to setup a replica for replication in 6 simple steps with Percona Xtrabackup.