Centos 7 Server hangs up on boot after deleting a software raid (mdadm device)

We have a CentOS 7 server with a simple two hard drives setup in RAID1 of total 4 devices for boot, root, swap and storage. The storage device (/dev/md5) was removed and recreated with RAID0 for better performance, because the server was promoted as only cache server. Then the server was restarted and it never went up.
On IPMI KVM it just started loading the kernel and hanged up after several seconds without any additional information:

The kernel loads the mdadm devices and do not continue and the device md5 is missing.

main menu
CentOS 7 kernel loading the mdadm RAID devices

To boot successfully you must remove the missing device

On the Grub 2 menu press “e” and you’ll get this screen. Here you can edit all lines if you need. You must remove the last rd.md.uuid in our case or the one you deleted. Remove it and press Ctrl+x to load the kernel.

main menu
Grub 2 edit

There are two options you can do:

  • OPTION 1) Remove rd.md.uuid option of your old mdadm device
  • OPTION 2) Replace the ID in rd.md.uuid= with the new ID of the mdadm device.

Each of these two options could be used to solve the booting problem. Edit /etc/default/grub and replace or remove rd.md.uuid and generate the grub.conf.
You can find old mdadm ID in /etc/mdadm.conf (if you have not replace it there).

[root@srv ~]# cat /etc/mdadm.conf 
ARRAY /dev/md2 level=raid1 num-devices=2 metadata=0.90 UUID=9c08f218:cd5c0f8f:d96bc0d1:57b77e99
ARRAY /dev/md3 level=raid1 num-devices=2 metadata=1.2 name=2035110:swap UUID=1f74a2e0:757bfb9f:9c860e50:325f37cb
ARRAY /dev/md4 level=raid1 num-devices=2 metadata=1.2 name=2035110:root UUID=29bf4aa8:b7dae21a:45f4c188:baea4c13
ARRAY /dev/md5 level=raid1 num-devices=2 metadata=1.2 name=2035110:storage1 UUID=e6eb2590:b767be36:c76bb869:45ff0c3c
[root@srv ~]# mdadm --detail --scan
ARRAY /dev/md2 metadata=0.90 UUID=9c08f218:cd5c0f8f:d96bc0d1:57b77e99
ARRAY /dev/md3 metadata=1.2 name=2035110:swap UUID=1f74a2e0:757bfb9f:9c860e50:325f37cb
ARRAY /dev/md4 metadata=1.2 name=2035110:root UUID=29bf4aa8:b7dae21a:45f4c188:baea4c13
ARRAY /dev/md/5 metadata=1.2 name=s2035110:5 UUID=901074eb:16ba7c5b:0af69934:e9444102
[root@srv ~]# mdadm --detail --scan > /etc/mdadm.conf 

Here is our old /etc/default/grub:

[root@srv ~]# cat /etc/default/grub 
GRUB_TIMEOUT=1
GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)"
GRUB_DEFAULT=saved
GRUB_DISABLE_SUBMENU=true
GRUB_TERMINAL="serial console"
GRUB_SERIAL_COMMAND="serial --speed=115200"
GRUB_CMDLINE_LINUX="rd.md.uuid=9c08f218:cd5c0f8f:d96bc0d1:57b77e99 rd.md.uuid=1f74a2e0:757bfb9f:9c860e50:325f37cb rd.md.uuid=29bf4aa8:b7dae21a:45f4c188:baea4c13 rd.md.uuid=e6eb2590:b767be36:c76bb869:45ff0c3c console=tty0 crashkernel=auto console=ttyS0,115200 net.ifnames=1"
GRUB_DISABLE_RECOVERY="true"

Here we edit our /boot/grub2/grub.cfg, replace the old uuid and generate grub.cfg (legacy BIOS):

[root@srv ~]# cat /etc/default/grub 
GRUB_TIMEOUT=1
GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)"
GRUB_DEFAULT=saved
GRUB_DISABLE_SUBMENU=true
GRUB_TERMINAL="serial console"
GRUB_SERIAL_COMMAND="serial --speed=115200"
GRUB_CMDLINE_LINUX="rd.md.uuid=9c08f218:cd5c0f8f:d96bc0d1:57b77e99 rd.md.uuid=1f74a2e0:757bfb9f:9c860e50:325f37cb rd.md.uuid=29bf4aa8:b7dae21a:45f4c188:baea4c13 rd.md.uuid=901074eb:16ba7c5b:0af69934:e9444102 console=tty0 crashkernel=auto console=ttyS0,115200 net.ifnames=1"
[root@srv ~]# grub2-mkconfig -o /boot/grub2/grub.cfg 
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-3.10.0-957.5.1.el7.x86_64
Found initrd image: /boot/initramfs-3.10.0-957.5.1.el7.x86_64.img
Found linux image: /boot/vmlinuz-0-rescue-05cb8c7b39fe0f70e3ce97e5beab809d
Found initrd image: /boot/initramfs-0-rescue-05cb8c7b39fe0f70e3ce97e5beab809d.img
done
[root@srv ~]# reboot

Use this for UEFI BIOS boot:
First check if /boot and /boot/efi are mounted and if not you must mount them with:

mount /boot
mount /boot/efi

Generate the grub.cfg

grub2-mkconfig -o /boot/efi/EFI/centos/grub.cfg

Bonus

In fact when the original device was removed and added a new one we formatted it as usual. But it was not possible to mount it, you just execute mount

/dev/md5 /mnt/stor1

no error, but no mount could be found, the device was not mounted and when you execute

umount /mnt/stor1

The OS told the “/mnt/stor1” was not mounted. Several more tries were made unsuccessfully to mount the “/dev/md5”, then the restart was performed and the server never went up.
Suppose the systemd just did not allow to mount the device because of the boot parameters rd.md.uuid!

LSI MegaRAID 2108 freezes with abort command and all processes hang up in disk sleep

It happened to one of our old LSI MegaRAID 2108 controllers (AOC-USAS2LP-H8iR (smc2108) with 36 disk, 32x2T and 4x8T) to freeze and most of the processes hang up with Disk sleep. The server was up, the network was working, but no login could be successful. A hard reset was executed with the IPMI KVM. The server started up, the MegaRAID controller booted with a warning that it was shutdown unexpectedly so there could be possible loss of data and to accept it by pressing any key or “C” to boot in the WebBIOS of the controller.

To summarize it up: the LSI controller hangs up when is in the following modes:

  1. Background Initialization
  2. Check Consistency

Aborting and disabling the modes above let out controller to work till replacement. If you experience any kind of strange disk hangs or freezes you can try our solution here! Check below to see how to do it yourself.

Keep on reading!

systemd service freezes in activating (start-post) status – mysqld or other services

We’ve experienced this with the MySQL server under CentOS 7, but you can have this state with other services!
After updating our MySQL we tried to start it up, but the service got this strange state after “systemctl start” returned:

[root@mysql2 ~]# systemctl start mysqld
Job for mysqld.service failed because a timeout was exceeded. See "systemctl status mysqld.service" and "journalctl -xe" for details.

The timeout is big it’s something like 5 to 10 minutes and so it is typical (do not do it!) to type “ctrl+c” and you end up without this message and a strange state of the mysql:

[root@mysql2 ~]# systemctl status mysqld
● mysqld.service - MySQL Community Server
   Loaded: loaded (/usr/lib/systemd/system/mysqld.service; enabled; vendor preset: disabled)
   Active: activating (start-post) since Fri 2018-11-09 09:00:55 UTC; 6min ago
  Process: 8333 ExecStart=/usr/bin/mysqld_safe --basedir=/usr (code=exited, status=0/SUCCESS)
  Process: 8321 ExecStartPre=/usr/bin/mysql-systemd-start pre (code=exited, status=0/SUCCESS)
 Main PID: 8333 (code=exited, status=0/SUCCESS);         : 8334 (mysql-systemd-s)
   CGroup: /user.slice/user-0.slice/session-2395.scope/system.slice/mysqld.service
           └─control
             ├─ 8334 /bin/bash /usr/bin/mysql-systemd-start post
             └─10152 sleep 1

Nov 09 09:00:55 mysql2.mytv.bg systemd[1]: Starting MySQL Community Server...
Nov 09 09:00:56 mysql2.mytv.bg mysqld_safe[8333]: 181109 09:00:56 mysqld_safe Logging to '/var/log/mysqld.log'.
Nov 09 09:00:56 mysql2.mytv.bg mysqld_safe[8333]: 181109 09:00:56 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql

Meanwhile with “pstree”:

[root@mysql2 ~]# pstree
systemd─┬─agetty
        ├─crond
        ├─dbus-daemon
        ├─mysql-systemd-s───sleep
        ├─rsyslogd───2*[{rsyslogd}]
        ├─sshd─┬─sshd───bash───systemctl─┬─systemctl
        │      │                         └─systemd-tty-ask
        │      └─sshd───bash───pstree
        ├─systemd-journal
        └─systemd-logind

So as you can see no mysqld process! Apparently systemctl had tried to start MySQL server process and it failed.
So the first thing to do was to check the MySQL logs. In our case it was a obsolete option in my.cnf:

2018-11-09 09:10:57 11384 [ERROR] /usr/sbin/mysqld: unknown variable 'default-character-set=utf8'
2018-11-09 09:10:57 11384 [ERROR] Aborting

The interesting part is that

the service got “Active: activating (start-post)” and when you fix the problem you cannot “systemctl start mysqld” it just start to wait for the current timeout.

In fact this state means “I’m trying to start the service…” and it is in an endless loop to start the service and if you the service has a big start timeout like 5-10 minutes you must wait for the next iteration of the loop to start the service successfully (if you fixed the problem!). And if you want not to wait you must execute first stop to the service and then start – you’ll not wait for any timeout and you can check immediately if the service was started successfully:

[root@mysql2 ~]# systemctl status mysqld
● mysqld.service - MySQL Community Server
   Loaded: loaded (/usr/lib/systemd/system/mysqld.service; enabled; vendor preset: disabled)
   Active: activating (start-post) since Fri 2018-11-09 09:20:56 UTC; 2min 50s ago
  Process: 13208 ExecStart=/usr/bin/mysqld_safe --basedir=/usr (code=exited, status=0/SUCCESS)
  Process: 13196 ExecStartPre=/usr/bin/mysql-systemd-start pre (code=exited, status=0/SUCCESS)
 Main PID: 13208 (code=exited, status=0/SUCCESS);         : 13209 (mysql-systemd-s)
   CGroup: /user.slice/user-0.slice/session-2395.scope/system.slice/mysqld.service
           └─control
             ├─13209 /bin/bash /usr/bin/mysql-systemd-start post
             └─14357 sleep 1

Nov 09 09:20:56 mysql2.mytv.bg systemd[1]: Starting MySQL Community Server...
Nov 09 09:20:56 mysql2.mytv.bg mysqld_safe[13208]: 181109 09:20:56 mysqld_safe Logging to '/var/log/mysqld.log'.
Nov 09 09:20:56 mysql2.mytv.bg mysqld_safe[13208]: 181109 09:20:56 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
[root@mysql2 ~]# systemctl start mysqld
Job for mysqld.service failed because a timeout was exceeded. See "systemctl status mysqld.service" and "journalctl -xe" for details.
[root@mysql2 ~]# systemctl status mysqld
● mysqld.service - MySQL Community Server
   Loaded: loaded (/usr/lib/systemd/system/mysqld.service; enabled; vendor preset: disabled)
   Active: active (running) since Fri 2018-11-09 09:30:59 UTC; 2s ago
  Process: 15656 ExecStartPost=/usr/bin/mysql-systemd-start post (code=exited, status=0/SUCCESS)
  Process: 15643 ExecStartPre=/usr/bin/mysql-systemd-start pre (code=exited, status=0/SUCCESS)
 Main PID: 15655 (mysqld_safe)
   CGroup: /user.slice/user-0.slice/session-2395.scope/system.slice/mysqld.service
           ├─15655 /bin/sh /usr/bin/mysqld_safe --basedir=/usr
           └─16243 /usr/sbin/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib64/mysql/plugin --log-error=/var/log/mysqld.log --open-files-limit=10000...

Nov 09 09:30:56 mysql2.mytv.bg systemd[1]: Starting MySQL Community Server...
Nov 09 09:30:57 mysql2.mytv.bg mysqld_safe[15655]: 181109 09:30:57 mysqld_safe Logging to '/var/log/mysqld.log'.
Nov 09 09:30:57 mysql2.mytv.bg mysqld_safe[15655]: 181109 09:30:57 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
Nov 09 09:30:59 mysql2.mytv.bg systemd[1]: Started MySQL Community Server.

As you can see we even received error again that the service cannot be started and immediately after that the service status is in normal “active (running)” state! And we waited for around 10 minutes! You can see the times in the logs above.
So to summarize it up:

If you have a service in “activating (start-post)” the service cannot be started because of an error, check and fix the problem and then issue “stop and start”:

[root@mysql2 ~]# systemctl start mysqld
Job for mysqld.service failed because a timeout was exceeded. See "systemctl status mysqld.service" and "journalctl -xe" for details.
[root@mysql2 ~]# systemctl status mysqld
● mysqld.service - MySQL Community Server
   Loaded: loaded (/usr/lib/systemd/system/mysqld.service; enabled; vendor preset: disabled)
   Active: activating (start-post) since Fri 2018-11-09 10:05:20 UTC; 2min 17s ago
  Process: 23601 ExecStart=/usr/bin/mysqld_safe --basedir=/usr (code=exited, status=0/SUCCESS)
  Process: 23589 ExecStartPre=/usr/bin/mysql-systemd-start pre (code=exited, status=0/SUCCESS)
 Main PID: 23601 (code=exited, status=0/SUCCESS);         : 23602 (mysql-systemd-s)
   CGroup: /user.slice/user-0.slice/session-2395.scope/system.slice/mysqld.service
           └─control
             ├─23602 /bin/bash /usr/bin/mysql-systemd-start post
             └─24646 sleep 1

Nov 09 10:05:20 mysql2.mytv.bg systemd[1]: Starting MySQL Community Server...
Nov 09 10:05:21 mysql2.mytv.bg mysqld_safe[23601]: 181109 10:05:21 mysqld_safe Logging to '/var/log/mysqld.log'.
Nov 09 10:05:21 mysql2.mytv.bg mysqld_safe[23601]: 181109 10:05:21 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
[root@mysql2 ~]# systemctl stop mysqld
[root@mysql2 ~]# systemctl status mysqld
● mysqld.service - MySQL Community Server
   Loaded: loaded (/usr/lib/systemd/system/mysqld.service; enabled; vendor preset: disabled)
   Active: inactive (dead) since Fri 2018-11-09 10:07:52 UTC; 4s ago
  Process: 23602 ExecStartPost=/usr/bin/mysql-systemd-start post (code=killed, signal=TERM)
  Process: 23601 ExecStart=/usr/bin/mysqld_safe --basedir=/usr (code=exited, status=0/SUCCESS)
  Process: 23589 ExecStartPre=/usr/bin/mysql-systemd-start pre (code=exited, status=0/SUCCESS)
 Main PID: 23601 (code=exited, status=0/SUCCESS)

Nov 09 10:05:20 mysql2.mytv.bg systemd[1]: Starting MySQL Community Server...
Nov 09 10:05:21 mysql2.mytv.bg mysqld_safe[23601]: 181109 10:05:21 mysqld_safe Logging to '/var/log/mysqld.log'.
Nov 09 10:05:21 mysql2.mytv.bg mysqld_safe[23601]: 181109 10:05:21 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
Nov 09 10:07:52 mysql2.mytv.bg systemd[1]: Stopped MySQL Community Server.
[root@mysql2 ~]# systemctl start mysqld
[root@mysql2 ~]# systemctl status mysqld
● mysqld.service - MySQL Community Server
   Loaded: loaded (/usr/lib/systemd/system/mysqld.service; enabled; vendor preset: disabled)
   Active: active (running) since Fri 2018-11-09 10:08:06 UTC; 3s ago
  Process: 24711 ExecStartPost=/usr/bin/mysql-systemd-start post (code=exited, status=0/SUCCESS)
  Process: 24698 ExecStartPre=/usr/bin/mysql-systemd-start pre (code=exited, status=0/SUCCESS)
 Main PID: 24710 (mysqld_safe)
   CGroup: /user.slice/user-0.slice/session-2395.scope/system.slice/mysqld.service
           ├─24710 /bin/sh /usr/bin/mysqld_safe --basedir=/usr
           └─25298 /usr/sbin/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib64/mysql/plugin --log-error=/var/log/mysqld.log --open-files-limit=10000...

Nov 09 10:08:04 mysql2.mytv.bg systemd[1]: Starting MySQL Community Server...
Nov 09 10:08:04 mysql2.mytv.bg mysqld_safe[24710]: 181109 10:08:04 mysqld_safe Logging to '/var/log/mysqld.log'.
Nov 09 10:08:04 mysql2.mytv.bg mysqld_safe[24710]: 181109 10:08:04 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
Nov 09 10:08:06 mysql2.mytv.bg systemd[1]: Started MySQL Community Server.