mysql slave reset and fixing relay log read failure

Suddenly your slave server reset without a clean shutdown and when it came up again you saw the error of this kind:

2016-02-26 10:41:50 876 [ERROR] Slave SQL: Relay log read failure: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave. Error_code: 1594
2016-02-26 10:41:50 876 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.005014' position 152146793

So what we know, our master server is OK, our slave server was reset by unknown issue, so the problem is only in out slave logs. Mysql sever shows the status with:

mysql> SHOW SLAVE STATUS\G;

There are multiple lines of information, but the most important in our situation is these two lines:

Relay_Master_Log_File: mysql-bin.005014
  Exec_Master_Log_Pos: 152146793

This is the place where the slave server stopped at (as you can see from the logs above, newer versions of MySQL print these two values in the log, but older versions do not print them in the log, so check them with the above command!).
The slave server stopped at file mysql-bin.005014 and position 152146793 and could not continue, because its files are corrupted. We can reset the position issuing a CHANGE MASTER command, which will clean up the relay logs and the slave will start the replication from this position – no data will be lost. Before issuing the following commands save the relay log files, they can be useful if you have later errors. Here is the command:

STOP SLAVE;              
CHANGE MASTER TO
         MASTER_HOST='1.1.1.1',
         MASTER_USER='replusr',
         MASTER_LOG_FILE='mysql-bin.005014',
         MASTER_LOG_POS=152146793;
START SLAVE;

There three commands above

  • Stop the replication in the slave, because the replication is still running and the slave is logging the binary log received from the master
  • Change master command to reset the logs with the right position
  • Start the replication in the slave

The replication must continue without errors!

In some cases after we issue the above commands and the replication starts it immediately stops with error of

Duplicate entry

.

Last_SQL_Errno: 1062
Last_SQL_Error: Error 'Duplicate entry '3918722' for key 'PRIMARY'' on query. Default database: 'testdb'. Query: 'INSERT INTO `testtable` (`tabid`, `tabip`, `stat`, `ins`) VALUES ('83908', '2591777309', '1', NOW())'

So we did everything right, but our replication is again broken? The problem is that when there is such reset, it could happen the autoincrement of the table is reserved,but not used, because the server was reset just in the middle of the insert operation or it could be inserted properly, but the server was reset in the middle of updating the replication metadata! So you have two options:

  • Change the autoincrement value of the table if there is no record with ID of the duplicate entry, just select it:
    SELECT * FROM testdb WHERE id=[ID_FROM_THE_ERROR]
    

    If there is no ID with such value, change the autoicrement of the table with

    ALTER TABLE tbl AUTO_INCREMENT = [ID_FROM_THE_ERROR];
    
  • Skip the duplicate entry query with
    STOP SLAVE;
    SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1;
    START SLAVE;
    

    or parallel replication use

    STOP SLAVE;
    START SLAVE UNTIL sql_after_mts_gaps;
    SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1;
    START SLAVE;
    

You could trace the problem reading the relay logs at the position it stopped.
Often there is an issue with the last recorded position, so you should examine why you have duplicate entry. Check if the entry is inserted and if it is, just skip it! But then if you hit again a duplicate entry or another error, you should reinitialize the slave dumping the replicated databases from the master!

Here is the full log of status command, when there is a problem with the corrupted mysql relay logs:

mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 1.1.1.1
                  Master_User: repluser
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.005014
          Read_Master_Log_Pos: 246696051
               Relay_Log_File: mysqld-relay-bin.009911
                Relay_Log_Pos: 152146956
        Relay_Master_Log_File: mysql-bin.005014
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 1594
                   Last_Error: Relay log read failure: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave.
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 152146793
              Relay_Log_Space: 246698113
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 1594
               Last_SQL_Error: Relay log read failure: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave.
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 2
                  Master_UUID: ce8a6c29-cf8e-11e5-9d39-000000000001
             Master_Info_File: /var/lib/mysql/master.info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: 
           Master_Retry_Count: 86400
                  Master_Bind: 
      Last_IO_Error_Timestamp: 
     Last_SQL_Error_Timestamp: 180226 11:54:51
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
           Retrieved_Gtid_Set: 
            Executed_Gtid_Set: 
                Auto_Position: 0
1 row in set (0.00 sec)