LSI MegaRAID 2108 freezes with abort command and all processes hang up in disk sleep

Author:

It happened to one of our old LSI MegaRAID 2108 controllers (AOC-USAS2LP-H8iR (smc2108) with 36 disk, 32x2T and 4x8T) to freeze and most of the processes hang up with Disk sleep. The server was up, the network was working, but no login could be successful. A hard reset was executed with the IPMI KVM. The server started up, the MegaRAID controller booted with a warning that it was shutdown unexpectedly so there could be possible loss of data and to accept it by pressing any key or “C” to boot in the WebBIOS of the controller.

To summarize it up: the LSI controller hangs up when is in the following modes:

  1. Background Initialization
  2. Check Consistency

Aborting and disabling the modes above let out controller to work till replacement. If you experience any kind of strange disk hangs or freezes you can try our solution here! Check below to see how to do it yourself.


Here is the output after which the server hangs up and the disk array is unreadable for the operating system:

Nov 09 01:04:33 srv0 kernel: [ 2089.129528] sd 0:2:0:0: [sda] ABORT
Nov 09 01:04:34 srv0 kernel: [ 2090.927714] sd 0:2:1:0: [sdb] ABORT
Nov 09 01:07:34 srv0 kernel: [ 2269.128495] sd 0:2:0:0: [sda] ABORT
Nov 09 01:07:34 srv0 kernel: [ 2270.108198] sd 0:2:1:0: [sdb] ABORT

Several resets and the situation got worse. At first the server had been up an running for at least 30-40 minutes and then it began to hang up the first 2-5 minutes! A fast update of the firmware was performed with the latest firmware, but with no success, the server’s controller still just hung up with no error and apparently not responding to the ABORT commands of the kernel, which would wait for a response infinitely – just issuing new commands after 2-3 minutes!

Several resets took us to understand that one of the virtual drives is in “Background Initialization” mode probably because of the several resets (therefore unexpected shutdowns for the controller).

[root@srv0 ~]# /opt/MegaRAID/MegaCli/MegaCli64 -ldinfo -LALL -aALL


Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name                :
RAID Level          : Primary-6, Secondary-0, RAID Level Qualifier-3
Size                : 29.088 TB
Parity Size         : 3.635 TB
State               : Optimal
Strip Size          : 1.0 MB
Number Of Drives    : 18
Span Depth          : 1
Default Cache Policy: WriteBack, ReadAhead, Cached, Write Cache OK if Bad BBU
Current Cache Policy: WriteBack, ReadAhead, Cached, Write Cache OK if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy   : Enabled
Encryption Type     : None
Default Power Savings Policy: Controller Defined
Current Power Savings Policy: Automatic
Can spin up in 1 minute: Yes
LD has drives that support T10 power conditions: No
LD's IO profile supports MAX power savings with cached writes: No
Bad Blocks Exist: No
Is VD Cached: Yes
Cache Cade Type : Read Only


Virtual Drive: 1 (Target Id: 1)
Name                :
RAID Level          : Primary-6, Secondary-0, RAID Level Qualifier-3
Size                : 29.088 TB
Parity Size         : 3.635 TB
State               : Optimal
Strip Size          : 1.0 MB
Number Of Drives    : 18
Span Depth          : 1
Default Cache Policy: WriteBack, ReadAhead, Cached, Write Cache OK if Bad BBU
Current Cache Policy: WriteBack, ReadAhead, Cached, Write Cache OK if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy   : Enabled
Ongoing Progresses:
  Background Initialization: Completed 5%, Taken 53 min.
Encryption Type     : None
Default Power Savings Policy: Controller Defined
Current Power Savings Policy: Automatic
Can spin up in 1 minute: Yes
LD has drives that support T10 power conditions: No
LD's IO profile supports MAX power savings with cached writes: No
Bad Blocks Exist: No
Is VD Cached: Yes
Cache Cade Type : Read Only

So we decided to abort it and not only abort but disable it because we do not want to start again in the future

[root@srv0 ~]# /opt/MegaRAID/MegaCli/MegaCli64 -LDBI -Abort -LALL -aALL
                                     
Background Initialization on VD #0 is not in Progress.
Abort Background Initialization on Virtual Drive 1 (target id: 1) Success.

Exit Code: 0x00
[root@srv0 ~]# /opt/MegaRAID/MegaCli/MegaCli64 -LDBI -Dsbl -LALL -aALL
                                     
Background Initialization is Disabled on VD 0 (target id: 0) on Adapter 0
Background Initialization is Disabled on VD 1 (target id: 1) on Adapter 0

Exit Code: 0x00

And to be absolutely sure we checked it with those three commands below. No background initialization at the moment and disabled setting in the controller’s configuration

[root@srv0 ~]# /opt/MegaRAID/MegaCli/MegaCli64 -ldinfo -LALL -aALL
                                     

Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name                :
RAID Level          : Primary-6, Secondary-0, RAID Level Qualifier-3
Size                : 29.088 TB
Is VD emulated      : Yes
Parity Size         : 3.635 TB
State               : Optimal
Strip Size          : 1.0 MB
Number Of Drives    : 18
Span Depth          : 1
Default Cache Policy: WriteBack, ReadAhead, Cached, Write Cache OK if Bad BBU
Current Cache Policy: WriteBack, ReadAhead, Cached, Write Cache OK if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy   : Enabled
Encryption Type     : None
Bad Blocks Exist: No
Is VD Cached: No


Virtual Drive: 1 (Target Id: 1)
Name                :
RAID Level          : Primary-6, Secondary-0, RAID Level Qualifier-3
Size                : 29.088 TB
Is VD emulated      : Yes
Parity Size         : 3.635 TB
State               : Optimal
Strip Size          : 1.0 MB
Number Of Drives    : 18
Span Depth          : 1
Default Cache Policy: WriteBack, ReadAhead, Cached, Write Cache OK if Bad BBU
Current Cache Policy: WriteBack, ReadAhead, Cached, Write Cache OK if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy   : Enabled
Encryption Type     : None
Bad Blocks Exist: No
Is VD Cached: No

[root@srv0 ~]# /opt/MegaRAID/MegaCli/MegaCli64 -LDBI -ProgDsply -LALL -aALL
                                     
Background Initialization on VD #0 is not in Progress.
Background Initialization on VD #0 is not in Progress.
Background Initialization on VD #1 is not in Progress.
Background Initialization on VD #1 is not in Progress.

Exit Code: 0x00
[root@srv0 ~]# /opt/MegaRAID/MegaCli/MegaCli64 -LDBI -getSetting -LALL -aALL
                                     
Background Initialization is Disabled on VD 0 (target id: 0) on Adapter 0
Background Initialization is Disabled on VD 1 (target id: 1) on Adapter 0

Exit Code: 0x00

And WOW the server uptime got 3 hours and then it hung up again, the same situation! But this time it was clear it was not from the Background Initialization! The server was reset again (we made several power off/on and even unplugged the cords from the power grid) and issuing information command with the megacli utility showed that the controller this time is in

Check Consistency

[root@srv0 ~]# /opt/MegaRAID/MegaCli/MegaCli64 -ldinfo -LALL -aALL


Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name                :
RAID Level          : Primary-6, Secondary-0, RAID Level Qualifier-3
Size                : 29.088 TB
Is VD emulated      : Yes
Parity Size         : 3.635 TB
State               : Optimal
Strip Size          : 1.0 MB
Number Of Drives    : 18
Span Depth          : 1
Default Cache Policy: WriteBack, ReadAhead, Cached, Write Cache OK if Bad BBU
Current Cache Policy: WriteBack, ReadAhead, Cached, Write Cache OK if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy   : Enabled
Ongoing Progresses:
  Check Consistency        : Completed 0%, Taken 9 min.
Encryption Type     : None
Bad Blocks Exist: No
Is VD Cached: No


Virtual Drive: 1 (Target Id: 1)
Name                :
RAID Level          : Primary-6, Secondary-0, RAID Level Qualifier-3
Size                : 29.088 TB
Is VD emulated      : Yes
Parity Size         : 3.635 TB
State               : Optimal
Strip Size          : 1.0 MB
Number Of Drives    : 18
Span Depth          : 1
Default Cache Policy: WriteBack, ReadAhead, Cached, Write Cache OK if Bad BBU
Current Cache Policy: WriteBack, ReadAhead, Cached, Write Cache OK if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy   : Enabled
Ongoing Progresses:
  Check Consistency        : Completed 0%, Taken 9 min.
Encryption Type     : None
Bad Blocks Exist: No
Is VD Cached: No

Exit Code: 0x00

So second guess for the night what if we abort and disable the “Check Consistency” and we did it fast enough before a controller freeze:

[root@srv0 ~]# /opt/MegaRAID/MegaCli/MegaCli64 -LDCC -Abort -lall -aall
                                     
Abort Check Consistency on Virtual Drive 0 (target id: 0) Success.
Abort Check Consistency on Virtual Drive 1 (target id: 1) Success.

Exit Code: 0x00
[root@srv0 ~]# /opt/MegaRAID/MegaCli/MegaCli64 -AdpCcSched -Dsbl -aALL
                                     
Adapter 0: Scheduled CC mode is set to Disabled.

Exit Code: 0x00

And then we discovered the controller had a “Check Consistency” scheduler with

[root@srv0 ~]# /opt/MegaRAID/MegaCli/MegaCli64 -AdpCcSched -Info -aALL
                                     
Adapter #0

Operation Mode: Concurrent
Execution Delay: 168
Next start time: 11/17/2018, 03:00:00
Current State: Stopped
Number of iterations: 1
Number of VD completed: 2
Excluded VDs          : None
Exit Code: 0x00

The above output is before the disable command for the “Check Consistency”. After you disable “Check Consistency” the output will be:

[root@srv0 ~]# /opt/MegaRAID/MegaCli/MegaCli64 -AdpCcSched -Info -aALL
                                     
Adapter #0

Operation Mode: Disabled
Execution Delay: 168
Next start time: 07/28/2135, 02:00:00
Current State: Stopped
Number of iterations: 1
Number of VD completed: 2
Excluded VDs          : None
Exit Code: 0x00

“Back to the future”!

What a coincidence 3 hours after we experienced the problem with the “Background Initialization” we had a weekly scheduled “Check Consistency” task! And the controller got freezed again! So be sure you disable the “Check Consistency”, too!

There is only one more mode: “Initialization”, which is only started after a new logical drive is created:

[root@srv0 ~]# /opt/MegaRAID/MegaCli/MegaCli64 -LDInit -Abort -LALL -aALL
                                     
Initialization on VD #0 is not in progress.
Initialization on VD #1 is not in progress.

Exit Code: 0x00

And this mode cannot start without manual administration command.

Leave a Reply

Your email address will not be published. Required fields are marked *