MySQL 8 | Any IT here? Help Me!

Recover from split-brain in MySQL and the instance is not part of the majority group

After a network connectivity problems with one of the member in a group of 3 MySQL InnoDB Cluster servers, the later member instance went out of the replication group with status “MISSING“. It appeared due to connectivity issues two servers were promoted to Primary status and received updates.
The status of the cluster is as follow – OK_NO_TOLERANCE_PARTIAL. The cluster is operational, but with a missing member and it does not have sufficient tolerance for failures (at least two should be in the group to recover without cutting the performance):

[root@db-cluster-1 ~]# mysqlsh
MySQL Shell 8.0.34

Copyright (c) 2016, 2023, Oracle and/or its affiliates.
Oracle is a registered trademark of Oracle Corporation and/or its affiliates.
Other names may be trademarks of their respective owners.

Type '\help' or '\?' for help; '\quit' to exit.
Creating a Classic session to 'root@localhost'
Fetching schema names for auto-completion... Press ^C to stop.
Your MySQL connection id is 26513453
Server version: 8.0.34 MySQL Community Server - GPL
No default schema selected; type \use <schema> to set one.
 MySQL  localhost  JS > \connect clusteradmin@db-cluster-1
Creating a session to 'clusteradmin@db-cluster-1'
Fetching schema names for auto-completion... Press ^C to stop.
Closing old connection...
Your MySQL connection id is 26513465 (X protocol)
Server version: 8.0.34 MySQL Community Server - GPL
No default schema selected; type \use <schema> to set one.
 MySQL  db-cluster-1:33060+ ssl  JS > var cluster = dba.getCluster()
 MySQL  db-cluster-1:33060+ ssl  JS > cluster.status()
{
    "clusterName": "mycluster1", 
    "defaultReplicaSet": {
        "name": "default", 
        "primary": "db-cluster-1:3306", 
        "ssl": "REQUIRED", 
        "status": "OK_NO_TOLERANCE_PARTIAL", 
        "statusText": "Cluster is NOT tolerant to any failures. 1 member is not active.", 
        "topology": {
            "db-cluster-1:3306": {
                "address": "db-cluster-1:3306", 
                "memberRole": "PRIMARY", 
                "mode": "R/W", 
                "readReplicas": {}, 
                "replicationLag": "applier_queue_applied", 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.0.34"
            }, 
            "db-cluster-2:3306": {
                "address": "db-cluster-2:3306", 
                "instanceErrors": [
                    "ERROR: split-brain! Instance is not part of the majority group, but has state ONLINE"
                ], 
                "memberRole": "SECONDARY", 
                "memberState": "ONLINE", 
                "mode": "n/a", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "(MISSING)", 
                "version": "8.0.34"
            }, 
            "db-cluster-3:3306": {
                "address": "db-cluster-3:3306", 
                "memberRole": "SECONDARY", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "replicationLag": "applier_queue_applied", 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.0.34"
            }
        }, 
        "topologyMode": "Single-Primary"
    }, 
    "groupInformationSourceMember": "db-cluster-1:3306"
}
 MySQL  db-cluster-1:33060+ ssl  JS > 
Bye!

Keep on reading!

Recover MySQL InnoDB Cluster and Dba.rebootClusterFromCompleteOutage: Argument #2: Invalid options: primary (ArgumentError)

Recent version of MySQL 8 implemented more options to the rebootClusterFromCompleteOutage function! Definitely check the link’s manual above and most of the handy second options are implemented in MySQL 8.0.30, so the user’s MySQL InnoDB Cluster crashed and if rebootClusterFromCompleteOutage should be used, but it outputs an error sort of:

 MySQL  db-cluster-1:33060+ ssl  JS > var cluster = dba.rebootClusterFromCompleteOutage()
Restoring the cluster 'mycluster1' from complete outage...

Dba.rebootClusterFromCompleteOutage: Target member is in state ERROR (RuntimeError)

And when trying to use the node, which was healthy before the crash with this function, there is an error, too:

 MySQL  db-cluster-1:33060+ ssl  JS > var cluster = dba.rebootClusterFromCompleteOutage("mycluster1", {primary: "db-cluster-1:3306"});
Dba.rebootClusterFromCompleteOutage: Argument #2: Invalid options: primary (ArgumentError)

So no cluster is available and the database and its data is inaccessible.
Indeed, the initial state of the cluster was really bad and before the restart, the two of three servers were missing or in bad state.

[root@db-cluster-1 ~]# mysqlsh
MySQL Shell 8.0.28

Copyright (c) 2016, 2022, Oracle and/or its affiliates.
Oracle is a registered trademark of Oracle Corporation and/or its affiliates.
Other names may be trademarks of their respective owners.

Type '\help' or '\?' for help; '\quit' to exit.
 MySQL  JS > \connect clusteradmin@db-cluster-1
Creating a session to 'clusteradmin@db-cluster-1'
Fetching schema names for autocompletion... Press ^C to stop.
Your MySQL connection id is 241708346 (X protocol)
Server version: 8.0.28 MySQL Community Server - GPL
No default schema selected; type \use <schema> to set one.
 MySQL  db-cluster-1:33060+ ssl  JS > var cluster = dba.getCluster()
 MySQL  db-cluster-1:33060+ ssl  JS > cluster.status()
{
    "clusterName": "mycluster1", 
    "defaultReplicaSet": {
        "name": "default", 
        "primary": "db-cluster-1:3306", 
        "ssl": "REQUIRED", 
        "status": "OK_NO_TOLERANCE", 
        "statusText": "Cluster is NOT tolerant to any failures. 2 members are not active.", 
        "topology": {
            "db-cluster-1:3306": {
                "address": "db-cluster-1:3306", 
                "memberRole": "PRIMARY", 
                "mode": "R/W", 
                "readReplicas": {}, 
                "replicationLag": null, 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.0.28"
            }, 
            "db-cluster-2:3306": {
                "address": "db-cluster-2:3306", 
                "instanceErrors": [
                    "NOTE: group_replication is stopped."
                ], 
                "memberRole": "SECONDARY", 
                "memberState": "OFFLINE", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "(MISSING)", 
                "version": "8.0.28"
            }, 
            "db-cluster-3:3306": {
                "address": "db-cluster-3:3306", 
                "instanceErrors": [
                    "ERROR: GR Recovery channel receiver stopped with an error: error connecting to master 'mysql_innodb_cluster_2324239842@db-cluster-1:3306' - retry-time: 60 retries: 1 message: Access denied for user 'mysql_innodb_cluster_2324239842'@'10.10.10.11' (using password: YES) (1045) at 2023-09-19 04:37:00.076960", 
                    "ERROR: group_replication has stopped with an error."
                ], 
                "memberRole": "SECONDARY", 
                "memberState": "ERROR", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "(MISSING)", 
                "version": "8.0.28"
            }
        }, 
        "topologyMode": "Single-Primary"
    }, 
    "groupInformationSourceMember": "db-cluster-1:3306"
}

The problem here is the MySQL version is 8.0.28, but after MySQL 8.0.30 there are much more features, which can be used in the second argument of rebootClusterFromCompleteOutage including, which server should be considered primary therefore healthy. In fact, the updated rebootClusterFromCompleteOutage of MySQL 8.0.34 version even auto-detected the correct and healthy node and booted the MySQL InnoDB Cluster.
There were no problems with the update from MySQL 8.0.28 to MySQL 8.0.34 and after the MySQL 8.0.34 started, the rebootClusterFromCompleteOutage reconfigured and started the cluster with the right and healthy server auto-detected. In fact, it is safer to use the second argument and set the option, which is the healthy server “{primary: “db-cluster-1:3306″}”.
Keep on reading!

Switch to a new master (primary) in MySQL InnoDB Cluster 8

Switching to a new master (or new primary if to use the new naming) in a MySQL 8 InnoDB Cluster is simple with the MySQL Shell console and the function of the cluster variable – setPrimaryInstance.

Why would someone need to do it manually? One of the reasons may be because one of the nodes is on the same physical server and thus suppose a smaller latency.

First, get a cluster object of the cluster by connecting to the cluster API with MySQL Shell:

[root@db-cluster-1 ~]# mysqlsh
MySQL Shell 8.0.28

Copyright (c) 2016, 2022, Oracle and/or its affiliates.
Oracle is a registered trademark of Oracle Corporation and/or its affiliates.
Other names may be trademarks of their respective owners.

Type '\help' or '\?' for help; '\quit' to exit.
 MySQL  JS > \connect clusteradmin@db-cluster-1
Creating a session to 'clusteradmin@db-cluster-1'
Fetching schema names for autocompletion... Press ^C to stop.
Your MySQL connection id is 166928419 (X protocol)
Server version: 8.0.28 MySQL Community Server - GPL
No default schema selected; type \use <schema> to set one.
 MySQL  db-cluster-1:33060+ ssl  JS > var cluster = dba.getCluster()

Second, show the status of the cluster to get the cluster topology and the exact nodes’ names, which will use as an argument of the setPrimaryInstance. Still, in the MySQL Shell Console:

 MySQL  db-cluster-1:33060+ ssl  JS > cluster.status()
{
    "clusterName": "mycluster1", 
    "defaultReplicaSet": {
        "name": "default", 
        "primary": "db-cluster-2:3306", 
        "ssl": "REQUIRED", 
        "status": "OK", 
        "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.", 
        "topology": {
            "db-cluster-1:3306": {
                "address": "db-cluster-1:3306", 
                "memberRole": "SECONDARY", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "replicationLag": null, 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.0.28"
            }, 
            "db-cluster-2:3306": {
                "address": "db-cluster-2:3306", 
                "memberRole": "PRIMARY", 
                "mode": "R/W", 
                "readReplicas": {}, 
                "replicationLag": null, 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.0.28"
            }, 
            "db-cluster-3:3306": {
                "address": "db-cluster-3:3306", 
                "memberRole": "SECONDARY", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "replicationLag": null, 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.0.28"
            }
        }, 
        "topologyMode": "Single-Primary"
    }, 
    "groupInformationSourceMember": "db-cluster-2:3306"
}

Keep on reading!

Viewing the progress of MySQL 8 Cluster InnoDB recovery

This article will show several handy MySQL commands for viewing the progress of MySQL 8 Cluster recovery and how the administrators may keep track of how much time will need a MySQL InnoDB Cluster node will need to complete the recovering procedure.

If the reader needs to recover from a node failure there is the other article – Recovery of MySQL 8 Cluster instance after server crash and corrupted data in log event. In this article, the MySQL commands are executed on a CentOS Stream 8 with MySQL InnoDB 8 Cluster (here is how it is installed – Install and deploy MySQL 8 InnoDB Cluster with 3 nodes under CentOS 8 and MySQL Router for HA), which one node had been offline for about several weeks. The node was powered off normally, so the MySQL instance on the server was shut down gracefully.
Initially, the cluster state after the power down was two nodes with a missing one.
Use MySQL Shell Console to view the MySQL InnoDB 8 Cluster status:

[root@db-cluster-1 ~]# mysqlsh
MySQL Shell 8.0.28

Copyright (c) 2016, 2022, Oracle and/or its affiliates.
Oracle is a registered trademark of Oracle Corporation and/or its affiliates.
Other names may be trademarks of their respective owners.

Type '\help' or '\?' for help; '\quit' to exit.
 MySQL  JS > \connect clusteradmin@db-cluster-1
Creating a session to 'clusteradmin@db-cluster-1'
Fetching schema names for autocompletion... Press ^C to stop.
Your MySQL connection id is 158633505 (X protocol)
Server version: 8.0.28 MySQL Community Server - GPL
No default schema selected; type \use <schema> to set one.
 MySQL  db-cluster-1:33060+ ssl  JS > var cluster = dba.getCluster()
 MySQL  db-cluster-1:33060+ ssl  JS > cluster.status()
{
    "clusterName": "mycluster1", 
    "defaultReplicaSet": {
        "name": "default", 
        "primary": "db-cluster-2:3306", 
        "ssl": "REQUIRED", 
        "status": "OK_NO_TOLERANCE", 
        "statusText": "Cluster is NOT tolerant to any failures. 1 member is not active.", 
        "topology": {
            "db-cluster-1:3306": {
                "address": "db-cluster-1:3306", 
                "memberRole": "SECONDARY", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "replicationLag": null, 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.0.28"
            }, 
            "db-cluster-2:3306": {
                "address": "db-cluster-2:3306", 
                "memberRole": "PRIMARY", 
                "mode": "R/W", 
                "readReplicas": {}, 
                "replicationLag": null, 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.0.28"
            }, 
            "db-cluster-3:3306": {
                "address": "db-cluster-3:3306", 
                "memberRole": "SECONDARY", 
                "mode": "n/a", 
                "readReplicas": {}, 
                "role": "HA", 
                "shellConnectError": "MySQL Error 2003: Could not open connection to 'db-cluster-3:3306': Can't connect to MySQL server on 'db-cluster-3:3306' (111)", 
                "status": "(MISSING)"
            }
        }, 
        "topologyMode": "Single-Primary"
    }, 
    "groupInformationSourceMember": "db-cluster-2:3306"
}
 MySQL  db-cluster-1:33060+ ssl  JS >

When the third server in the cluster is started (from a clean shutdown), the third node will be in a recovery state. Here is the same status command and the third node in recovery. Still, with the MySQL Shell Console:
Keep on reading!

Recovery of MySQL 8 Cluster instance after server crash and corrupted data in log event

There is a MySQL 8 Cluster InnoDB of three servers and one of the server crashed with a bad RAM. The same setup is described here – Install and deploy MySQL 8 InnoDB Cluster with 3 nodes under CentOS 8 and MySQL Router for HA. The failed server got restarted without clean shutdown and after booting up the MySQL Cluster node tried to recover automatically, but the recover process failed and the node left the group of the three server:

2022-05-31T04:00:00.322469Z 24 [ERROR] [MY-011620] [Repl] Plugin group_replication reported: 'Fatal error during the incremental recovery process of Group Replication. The server will leave the group.'
2022-05-31T04:00:00.322489Z 24 [Warning] [MY-011645] [Repl] Plugin group_replication reported: 'Skipping leave operation: concurrent attempt to leave the group is on-going.'
2022-05-31T04:00:00.322500Z 24 [ERROR] [MY-011712] [Repl] Plugin group_replication reported: 'The server was automatically set into read only mode after an error was detected.'
2022-05-31T04:00:03.448475Z 0 [System] [MY-011504] [Repl] Plugin group_replication reported: 'Group membership changed: This member has left the group.'

The recovery process proposed here follows these steps

Connect with mysqlsh (MySQL Shell) to a MySQL instance, which is currently a part of the cluster group. The member, which left the group is not part any more, though the MySQL Cluster status shows it is part of the cluster topology, but with error.
Remove the bad instance from the MySQL Cluster with removeInstance
Add the instance with addInstance and the recovery process will kick in. The type of the recovery process will be chosen by the setup if not specified. In this case, the setup chooses the Incremental state recovery over (full) clone mode.
Initiate the cluster rescan operation to recovery the group replication and the MySQL Cluster.

mysql

Summery of the recovery process

The recovery process was successful.
The distributed recovery with Incremental state recovery has finished for 24 hours for 200Mbyte database, which is really strange and the speed was really bad. The instance uses ordinary disks, not SSDs and a 1Gbps network.
No need to change or manage the MySQL Router in any of the steps or the recovery stages. It handled the situation from the very beginning by removing the bad instance and then adding it again only after the recovery process had finished successfully.
MySQL Shell should be connected to an healthy instance currently a part of the Cluster.

In the console output logs all commands and important lines are highlighted.

STEP 1) Remove the bad instance from the cluster.

The status of the cluster with the bad instance.

[root@db-cluster-3 ~]# mysqlsh
MySQL Shell 8.0.28

Copyright (c) 2016, 2022, Oracle and/or its affiliates.
Oracle is a registered trademark of Oracle Corporation and/or its affiliates.
Other names may be trademarks of their respective owners.

Type '\help' or '\?' for help; '\quit' to exit.
 MySQL  JS > \connect clusteradmin@db-cluster-1
Creating a session to 'clusteradmin@db-cluster-1'
Fetching schema names for autocompletion... Press ^C to stop.
Closing old connection...
Your MySQL connection id is 39806649 (X protocol)
Server version: 8.0.28 MySQL Community Server - GPL
No default schema selected; type \use <schema> to set one.
 MySQL  db-cluster-1:33060+ ssl  JS > var cluster = dba.getCluster()
 MySQL  db-cluster-1:33060+ ssl  JS > cluster.status()
{
    "clusterName": "mycluster1", 
    "defaultReplicaSet": {
        "name": "default", 
        "primary": "db-cluster-1:3306", 
        "ssl": "REQUIRED", 
        "status": "OK_NO_TOLERANCE", 
        "statusText": "Cluster is NOT tolerant to any failures. 1 member is not active.", 
        "topology": {
            "db-cluster-1:3306": {
                "address": "db-cluster-1:3306", 
                "memberRole": "PRIMARY", 
                "mode": "R/W", 
                "readReplicas": {}, 
                "replicationLag": null, 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.0.28"
            }, 
            "db-cluster-2:3306": {
                "address": "db-cluster-2:3306", 
                "memberRole": "SECONDARY", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "replicationLag": null, 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.0.28"
            }, 
            "db-cluster-3:3306": {
                "address": "db-cluster-3:3306", 
                "instanceErrors": [
                    "ERROR: group_replication has stopped with an error."
                ], 
                "memberRole": "SECONDARY", 
                "memberState": "ERROR", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "(MISSING)", 
                "version": "8.0.28"
            }
        }, 
        "topologyMode": "Single-Primary"
    }, 
    "groupInformationSourceMember": "db-cluster-1:3306"
}

Keep on reading!