Replication hangs with Relay_Master_Log_File mysql-bin.999999 -wrong variables in mariadb (MySQL) show slave status

Author:

Several days after another .999999 hang out – mariadb (MySQL) stopped flushing relay-bin log after mysqld-relay-bin.999999, the monitoring of one of the slaves got critical with replication delayed. Then several hours it kept delaying without any apparent reason.
The slave status was weird, the

  • “Relay_Master_Log_File: mysql-bin.999999” and
  • Exec_Master_Log_Pos: 104858214” were not changing,

which is essential for the slave server! In fact without these two values we do not know the real position, which is executed in the slave! And we cannot recover the slave if anything happened.

MariaDB [(none)]> show slave status \G;
*************************** 1. row ***************************
               Slave_IO_State: Queueing master event to the relay log
                  Master_Host: srv-master1.local
                  Master_User: rep-user
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.1007039
          Read_Master_Log_Pos: 52820624
               Relay_Log_File: mysqld-relay-bin.712675
                Relay_Log_Pos: 52807349
        Relay_Master_Log_File: mysql-bin.999999
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: test
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 104858214
              Relay_Log_Space: 7602615098
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 101
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
                   Using_Gtid: No
                  Gtid_IO_Pos: 
      Replicate_Do_Domain_Ids: 
  Replicate_Ignore_Domain_Ids: 
                Parallel_Mode: aggressive
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Reading event from the relay log
1 row in set (0.00 sec)

The SHOW SLAVE STATUS showed:

  1. Replication was definitely running
                Slave_IO_Running: Yes
                Slave_SQL_Running: Yes
    
  2. The “show processlist;” command showed INSERT transactions executing from the slave workers:
    | 920146 | system user |  | database1 | Slave_worker | 0 | update | INSERT INTO `mytable1` (`id`, `username`, `key`, `data`, `bytes | 0.000 |
    | 920149 | system user |  | database2 | Slave_worker | 0 | update | INSERT INTO `mytable1` (`id`, `username`, `key`, `data`, `bytes | 0.000 |
    
  3. As you know even you use the old style replication based on master bin file name and position in it the mariadb(mysql) server internally uses Global Transaction IDs (GTID) to tracking down the replication and for crash recovery issues! Even it is not enabled it (Using_Gtid: No) still it is engaged in the process!
    And you can check it and see how the transaction ID is changing, so really the replication is running even the old style variables are stuck (Relay_Master_Log_File: mysql-bin.999999″ and “Exec_Master_Log_Pos: 104858214):

    MariaDB [(none)]> SELECT @@GLOBAL.gtid_slave_pos;
    +-------------------------+
    | @@GLOBAL.gtid_slave_pos |
    +-------------------------+
    | 0-101-78054462996       |
    +-------------------------+
    1 row in set (0.00 sec)
    
    MariaDB [(none)]> SELECT @@GLOBAL.gtid_slave_pos;
    +-------------------------+
    | @@GLOBAL.gtid_slave_pos |
    +-------------------------+
    | 0-102-44185395354       |
    +-------------------------+
    1 row in set (0.00 sec)
    
    MariaDB [(none)]> SELECT @@GLOBAL.gtid_slave_pos;
    +-------------------------+
    | @@GLOBAL.gtid_slave_pos |
    +-------------------------+
    | 0-101-78054463301       |
    +-------------------------+
    1 row in set (0.00 sec)
    
  4. No rollback transactions
  5. The relay bin logs were filing with data and were rotated as usual:
    srv-slave2 mysql # ls -altr|grep relay
    -rw-rw----    1 mysql mysql         1640 Jan 20 10:06 mysqld-relay-bin.713309
    -rw-rw----    1 mysql mysql          358 Jan 20 10:06 mysqld-relay-bin.713310
    -rw-rw----    1 mysql mysql           78 Jan 20 10:06 mysqld-relay-bin.index
    -rw-rw----    1 mysql mysql         1690 Jan 20 10:06 relay-log.info
    -rw-rw----    1 mysql mysql     64727863 Jan 20 10:06 mysqld-relay-bin.713311
    srv-slave2 mysql # ls -altr|grep relay
    -rw-rw----    1 mysql mysql          593 Jan 20 10:06 mysqld-relay-bin.713312
    -rw-rw----    1 mysql mysql          358 Jan 20 10:06 mysqld-relay-bin.713313
    -rw-rw----    1 mysql mysql           78 Jan 20 10:06 mysqld-relay-bin.index
    -rw-rw----    1 mysql mysql         1690 Jan 20 10:06 relay-log.info
    -rw-rw----    1 mysql mysql     67442357 Jan 20 10:06 mysqld-relay-bin.713314
    

So to summarize:

The replication was running and the replication was consistent, but we cannot find the real slave’s position, because the variables were wrong! They had never changed after: Relay_Master_Log_File: “mysql-bin.999999” and “Exec_Master_Log_Pos: 104858214”

We tried “STOP/START SLAVE;” with no success, the Relay_Master_Log_File: “mysql-bin.999999” remained and Exec_Master_Log_Pos changed a little bit incremented “104859320”, which was again the last position in the master’s bin-log file:

srv-master1 mysql-binlog # mysqlbinlog mysql-bin.999999|tail -n 10 
# at 104858183
#180413  9:51:01 server id 102  end_log_pos 104858214 CRC32 0x634cb3d4  Xid = 238913678385
COMMIT/*!*/;
# at 104858214
#180413  9:51:01 server id 101  end_log_pos 104858262 CRC32 0x1af29770  Rotate to mysql-bin.1000000  pos: 4
DELIMITER ;
# End of log file
ROLLBACK /* added by mysqlbinlog */;
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;

And the second option to “RESET MASTER”, which indeed would help, as with the problem with relay-bin-log case (here mentioned above) CANNOT BE USED, because we need exactly the two values (Relay_Master_Log_File and Exec_Master_Log_Pos) to recover after the RESET command, which two values are WRONG! If we make “RESET SLAVE” and then “CHANGE MASTER” command with these two values you are going to have replication inconsistency and probably your slave will stop with duplicate entry error.

The solution

Switch the replication to use Global Transaction IDMariaDB and MySQL.
Here is how:

STOP SLAVE;
CHANGE MASTER TO master_use_gtid=slave_pos;
START SLAVE;

Executed in the mysql console.

And here is what happened (real output with changed names). You can see how the “SHOW SLAVE STATUS” changed. First there are:

                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos: 0-101-78054700522

and
Relay_Master_Log_File immediately change from the wrong value mysql-bin.999999 to mysql-bin.1007691 and Exec_Master_Log_Pos began changing really fast (not showing only one value as before).

srv-slave2 # mysql
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 920614
Server version: 10.2.17-MariaDB-10.2.17+maria~xenial mariadb.org binary distribution

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> SELECT @@GLOBAL.gtid_slave_pos;
+-------------------------+
| @@GLOBAL.gtid_slave_pos |
+-------------------------+
| 0-101-78054463301       |
+-------------------------+
1 row in set (0.00 sec)

MariaDB [(none)]> STOP SLAVE;
Query OK, 0 rows affected (1.68 sec)

MariaDB [(none)]> SELECT @@GLOBAL.gtid_slave_pos;
+-------------------------+
| @@GLOBAL.gtid_slave_pos |
+-------------------------+
| 0-101-78054463614       |
+-------------------------+
1 row in set (0.00 sec)

MariaDB [(none)]> SELECT @@GLOBAL.gtid_slave_pos;
+-------------------------+
| @@GLOBAL.gtid_slave_pos |
+-------------------------+
| 0-101-78054463614       |
+-------------------------+
1 row in set (0.00 sec)

MariaDB [(none)]> SELECT @@GLOBAL.gtid_slave_pos;
+-------------------------+
| @@GLOBAL.gtid_slave_pos |
+-------------------------+
| 0-101-78054463614       |
+-------------------------+
1 row in set (0.01 sec)

MariaDB [(none)]> CHANGE MASTER TO master_use_gtid=slave_pos;
Query OK, 0 rows affected (0.04 sec)

MariaDB [(none)]> START SLAVE;
Query OK, 0 rows affected (0.00 sec)

MariaDB [(none)]> SELECT @@GLOBAL.gtid_slave_pos;
+-------------------------+
| @@GLOBAL.gtid_slave_pos |
+-------------------------+
| 0-104-78053919845       |
+-------------------------+
1 row in set (0.00 sec)

MariaDB [(none)]> SELECT @@GLOBAL.gtid_slave_pos;
+-------------------------+
| @@GLOBAL.gtid_slave_pos |
+-------------------------+
| 0-101-78054464262       |
+-------------------------+
1 row in set (0.00 sec)

MariaDB [(none)]> SHOW SLAVE STATUS\G;
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: srv-master1.local
                  Master_User: rep-user
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.1007698
          Read_Master_Log_Pos: 57861886
               Relay_Log_File: mysqld-relay-bin.000002
                Relay_Log_Pos: 9566981
        Relay_Master_Log_File: mysql-bin.1007691
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: test
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 60692482
              Relay_Log_Space: 740756853
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: 111
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 101
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos: 0-101-78054700522
      Replicate_Do_Domain_Ids: 
  Replicate_Ignore_Domain_Ids: 
                Parallel_Mode: aggressive
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Waiting for room in worker thread event queue
1 row in set (0.00 sec)

ERROR: No query specified

MariaDB [(none)]> SHOW SLAVE STATUS\G;
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: srv-master1.local
                  Master_User: rep-user
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.1007704
          Read_Master_Log_Pos: 85833415
               Relay_Log_File: mysqld-relay-bin.000019
                Relay_Log_Pos: 42510046
        Relay_Master_Log_File: mysql-bin.1007697
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: test
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 42509746
              Relay_Log_Space: 924712984
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: 112
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 101
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos: 0-102-44185395356
      Replicate_Do_Domain_Ids: 
  Replicate_Ignore_Domain_Ids: 
                Parallel_Mode: aggressive
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Waiting for room in worker thread event queue
1 row in set (0.00 sec)

ERROR: No query specified

MariaDB [(none)]> status
--------------
mysql  Ver 15.1 Distrib 10.2.17-MariaDB, for debian-linux-gnu (x86_64) using readline 5.2

Connection id:          920614
Current database:
Current user:           root@localhost
SSL:                    Not in use
Current pager:          stdout
Using outfile:          ''
Using delimiter:        ;
Server:                 MariaDB
Server version:         10.2.17-MariaDB-10.2.17+maria~xenial mariadb.org binary distribution
Protocol version:       10
Connection:             Localhost via UNIX socket
Server characterset:    latin1
Db     characterset:    latin1
Client characterset:    latin1
Conn.  characterset:    latin1
UNIX socket:            /var/run/mysqld/mysqld.sock
Uptime:                 40 days 2 hours 23 min 17 sec

Threads: 38  Questions: 69062735474  Slow queries: 0  Opens: 106005039  Flush tables: 1  Open tables: 2000  Queries per second avg: 19933.843
--------------

MariaDB [(none)]

Leave a Reply

Your email address will not be published. Required fields are marked *