Recover from split-brain in MySQL 8 InnoDB Cluster

After a network connectivity problems with one of the member in a group of 3 MySQL InnoDB Cluster servers, the later member instance went out of the replication group with status “MISSING“. It appeared due to connectivity issues two servers were promoted to Primary status and received updates.
The status of the cluster is as follow – OK_NO_TOLERANCE_PARTIAL. The cluster is operational, but with a missing member and it does not have sufficient tolerance for failures (at least two should be in the group to recover without cutting the performance):

[root@db-cluster-1 ~]# mysqlsh
MySQL Shell 8.0.34

Copyright (c) 2016, 2023, Oracle and/or its affiliates.
Oracle is a registered trademark of Oracle Corporation and/or its affiliates.
Other names may be trademarks of their respective owners.

Type '\help' or '\?' for help; '\quit' to exit.
Creating a Classic session to 'root@localhost'
Fetching schema names for auto-completion... Press ^C to stop.
Your MySQL connection id is 26513453
Server version: 8.0.34 MySQL Community Server - GPL
No default schema selected; type \use <schema> to set one.
 MySQL  localhost  JS > \connect clusteradmin@db-cluster-1
Creating a session to 'clusteradmin@db-cluster-1'
Fetching schema names for auto-completion... Press ^C to stop.
Closing old connection...
Your MySQL connection id is 26513465 (X protocol)
Server version: 8.0.34 MySQL Community Server - GPL
No default schema selected; type \use <schema> to set one.
 MySQL  db-cluster-1:33060+ ssl  JS > var cluster = dba.getCluster()
 MySQL  db-cluster-1:33060+ ssl  JS > cluster.status()
{
    "clusterName": "mycluster1", 
    "defaultReplicaSet": {
        "name": "default", 
        "primary": "db-cluster-1:3306", 
        "ssl": "REQUIRED", 
        "status": "OK_NO_TOLERANCE_PARTIAL", 
        "statusText": "Cluster is NOT tolerant to any failures. 1 member is not active.", 
        "topology": {
            "db-cluster-1:3306": {
                "address": "db-cluster-1:3306", 
                "memberRole": "PRIMARY", 
                "mode": "R/W", 
                "readReplicas": {}, 
                "replicationLag": "applier_queue_applied", 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.0.34"
            }, 
            "db-cluster-2:3306": {
                "address": "db-cluster-2:3306", 
                "instanceErrors": [
                    "ERROR: split-brain! Instance is not part of the majority group, but has state ONLINE"
                ], 
                "memberRole": "SECONDARY", 
                "memberState": "ONLINE", 
                "mode": "n/a", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "(MISSING)", 
                "version": "8.0.34"
            }, 
            "db-cluster-3:3306": {
                "address": "db-cluster-3:3306", 
                "memberRole": "SECONDARY", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "replicationLag": "applier_queue_applied", 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.0.34"
            }
        }, 
        "topologyMode": "Single-Primary"
    }, 
    "groupInformationSourceMember": "db-cluster-1:3306"
}
 MySQL  db-cluster-1:33060+ ssl  JS > 
Bye!

The db-cluster-2 is in split-brain and despite it is online and reachable with the other members, it doesn’t take part in the replication group. So the server should be recovered.
Here are the steps to recover in this case:

STEP 1) Restart the MySQL service on the member, which is in the split-brain situation

In this case, the member instance db-cluster-2:3306 would try to auto-recover with distributed recovery from an ONLINE donor.

After the MySQL service restart and the service begins accepting connection, the recovery process will be initiated. This is the log lines after the MySQL service is up and running:

2024-06-04T12:07:14.941135Z 0 [System] [MY-010931] [Server] /usr/sbin/mysqld: ready for connections. Version: '8.0.34'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MySQL Community Server - GPL.
2024-06-04T12:07:14.942656Z 4 [System] [MY-011565] [Repl] Plugin group_replication reported: 'Setting super_read_only=ON.'
2024-06-04T12:07:14.963083Z 12 [System] [MY-010597] [Repl] 'CHANGE REPLICATION SOURCE TO FOR CHANNEL 'group_replication_applier' executed'. Previous state source_host='<NULL>', source_port= 0, source_log_file='', source_log_pos= 4, source_bind=''. New state source_host='<NULL>', source_port= 0, source_log_file='', source_log_pos= 4, source_bind=''.
2024-06-04T12:07:17.867392Z 0 [System] [MY-011565] [Repl] Plugin group_replication reported: 'Setting super_read_only=ON.'
2024-06-04T12:07:17.867611Z 0 [System] [MY-013471] [Repl] Plugin group_replication reported: 'Distributed recovery will transfer data using: Incremental recovery from a group donor'
2024-06-04T12:07:17.867677Z 4 [System] [MY-011511] [Repl] Plugin group_replication reported: 'This server is working as secondary member with primary member address db-cluster-1:3306.'
2024-06-04T12:07:17.868191Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to db-cluster-1:3306, db-cluster-2:3306, db-cluster-3:3306 on view 16962717234340524:30.'
2024-06-04T12:07:17.875690Z 24 [System] [MY-010597] [Repl] 'CHANGE REPLICATION SOURCE TO FOR CHANNEL 'group_replication_recovery' executed'. Previous state source_host='<NULL>', source_port= 0, source_log_file='', source_log_pos= 4, source_bind=''. New state source_host='db-cluster-3', source_port= 3306, source_log_file='', source_log_pos= 4, source_bind=''.
2024-06-04T12:07:17.895343Z 25 [Warning] [MY-010897] [Repl] Storing MySQL user name or password information in the connection metadata repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START REPLICA; see the 'START REPLICA Syntax' in the MySQL Manual for more information.
2024-06-04T12:07:17.990899Z 25 [System] [MY-014002] [Repl] Replica receiver thread for channel 'group_replication_recovery': connected to source 'mysql_innodb_cluster_1934970143@db-cluster-3:3306' with server_uuid=ee9e58c8-6157-11ee-a2fb-a2bd098291c6, server_id=-1773396559. Starting GTID-based replication.
2024-06-04T12:07:19.326152Z 24 [System] [MY-010597] [Repl] 'CHANGE REPLICATION SOURCE TO FOR CHANNEL 'group_replication_recovery' executed'. Previous state source_host='db-cluster-3', source_port= 3306, source_log_file='', source_log_pos= 4, source_bind=''. New state source_host='<NULL>', source_port= 0, source_log_file='', source_log_pos= 4, source_bind=''.
2024-06-04T12:07:19.652416Z 0 [System] [MY-011490] [Repl] Plugin group_replication reported: 'This server was declared online within the replication group.'

The member instance was recovered successfully and now is part from the replication group.

STEP 2) rescan the cluster if the replication recovery account is not stored in the cluster metadata.

Now, the cluster status is ONLINE and with the ability of one failures it should be with 3 active servers. Open a MySQL Shell console in one of the previous ONLINE members (in this case, the db-cluster-1, which is the primary, too):

[root@db-cluster-1 ~]# mysqlsh
^[[AMySQL Shell 8.0.34

Copyright (c) 2016, 2023, Oracle and/or its affiliates.
Oracle is a registered trademark of Oracle Corporation and/or its affiliates.
Other names may be trademarks of their respective owners.

Type '\help' or '\?' for help; '\quit' to exit.
Creating a Classic session to 'root@localhost'
Fetching schema names for auto-completion... Press ^C to stop.
Your MySQL connection id is 26640618
Server version: 8.0.34 MySQL Community Server - GPL
No default schema selected; type \use <schema> to set one.
 MySQL  localhost  JS > \connect clusteradmin@db-cluster-1
Creating a session to 'clusteradmin@db-cluster-1'
Fetching schema names for auto-completion... Press ^C to stop.
Closing old connection...
Your MySQL connection id is 26640637 (X protocol)
Server version: 8.0.34 MySQL Community Server - GPL
No default schema selected; type \use <schema> to set one.
 MySQL  db-cluster-1:33060+ ssl  JS > var cluster = dba.getCluster()
 MySQL  db-cluster-1:33060+ ssl  JS > cluster.status()
{
    "clusterName": "mycluster1", 
    "defaultReplicaSet": {
        "name": "default", 
        "primary": "db-cluster-1:3306", 
        "ssl": "REQUIRED", 
        "status": "OK", 
        "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.", 
        "topology": {
            "db-cluster-1:3306": {
                "address": "db-cluster-1:3306", 
                "memberRole": "PRIMARY", 
                "mode": "R/W", 
                "readReplicas": {}, 
                "replicationLag": "applier_queue_applied", 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.0.34"
            }, 
            "db-cluster-2:3306": {
                "address": "db-cluster-2:3306", 
                "instanceErrors": [
                    "WARNING: The replication recovery account in use by the instance is not stored in the metadata. Use Cluster.rescan() to update the metadata."
                ], 
                "memberRole": "SECONDARY", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "replicationLag": "applier_queue_applied", 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.0.34"
            }, 
            "db-cluster-3:3306": {
                "address": "db-cluster-3:3306", 
                "memberRole": "SECONDARY", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "replicationLag": "applier_queue_applied", 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.0.34"
            }
        }, 
        "topologyMode": "Single-Primary"
    }, 
    "groupInformationSourceMember": "db-cluster-1:3306"
}

The rescan will fix the recovery account in the second member (db-cluster-2). Continue in the MySQL Shell console from the above

 MySQL  db-cluster-1:33060+ ssl  JS > cluster.rescan()
Rescanning the cluster...

Result of the rescanning operation for the 'mycluster1' cluster:
{
    "name": "mycluster1", 
    "newTopologyMode": null, 
    "newlyDiscoveredInstances": [], 
    "unavailableInstances": [], 
    "updatedInstances": []
}

Fixing incorrect recovery account 'mysql_innodb_cluster_1934970143' in instance 'db-cluster-2:3306'
 MySQL  db-cluster-1:33060+ ssl  JS >

And the cluster status will be out of error or warnings as it should be:

 MySQL  db-cluster-1:33060+ ssl  JS > cluster.status()
{
    "clusterName": "mycluster1", 
    "defaultReplicaSet": {
        "name": "default", 
        "primary": "db-cluster-1:3306", 
        "ssl": "REQUIRED", 
        "status": "OK", 
        "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.", 
        "topology": {
            "db-cluster-1:3306": {
                "address": "db-cluster-1:3306", 
                "memberRole": "PRIMARY", 
                "mode": "R/W", 
                "readReplicas": {}, 
                "replicationLag": "applier_queue_applied", 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.0.34"
            }, 
            "db-cluster-2:3306": {
                "address": "db-cluster-2:3306", 
                "memberRole": "SECONDARY", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "replicationLag": "applier_queue_applied", 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.0.34"
            }, 
            "db-cluster-3:3306": {
                "address": "db-cluster-3:3306", 
                "memberRole": "SECONDARY", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "replicationLag": "applier_queue_applied", 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.0.34"
            }
        }, 
        "topologyMode": "Single-Primary"
    }, 
    "groupInformationSourceMember": "db-cluster-1:3306"
}
 MySQL  db-cluster-1:33060+ ssl  JS >

More MySQL InnoDB Cluster articles – https://ahelpme.com/tag/mysql-8-cluster-innodb/.

Any IT here? Help Me!

Recover from split-brain in MySQL and the instance is not part of the majority group

STEP 1) Restart the MySQL service on the member, which is in the split-brain situation

STEP 2) rescan the cluster if the replication recovery account is not stored in the cluster metadata.

Leave a Reply Cancel reply