It happened to one of our old LSI MegaRAID 2108 controllers (AOC-USAS2LP-H8iR (smc2108) with 36 disk, 32x2T and 4x8T) to freeze and most of the processes hang up with Disk sleep. The server was up, the network was working, but no login could be successful. A hard reset was executed with the IPMI KVM. The server started up, the MegaRAID controller booted with a warning that it was shutdown unexpectedly so there could be possible loss of data and to accept it by pressing any key or “C” to boot in the WebBIOS of the controller.
To summarize it up: the LSI controller hangs up when is in the following modes:
- Background Initialization
- Check Consistency
Aborting and disabling the modes above let out controller to work till replacement. If you experience any kind of strange disk hangs or freezes you can try our solution here! Check below to see how to do it yourself.
Keep on reading!
is really old (probably 7-9 years), but still it works, so you can check if you are with the latest and greatest firmware. Hope the latest fixes more things than it beaks. To flash the firmware you need Megaraid cli and the firmware file, the two files you check in the sub-directories of https://www.supermicro.com/wftp/driver/SAS/LSI/2108/Firmware/ They are still there despite this product is discontinued. In this URL these are the latest, tested and verified versions by Supermicro so it is advisable to download them from this link or at least use the same versions if they are not available (in the future, now they are still available).
As you know LSI (they bought 3ware RAID in 2009) was bought by Avago (2013), then Avago bought Broadcom (2016 and renamed itself to Broadcom, 2018), so not so easy to find stuff for such old hardware (which still works). So this old MegaRAID controller is better managed by MegaCli despite you can do it with “storcli”, which is a modification of the tw_cli utility of 3ware RAID.
Keep on reading!
Sometimes we need to start a rebuild with a disk in failed state when using a LSI hardware controller, but if we just return the good state of the failed disk, it will return immediately in the array and our filesystem will be broken for sure! In addition it happens that when we replace a disk the new disk to be in failed state, too.
So here are simple and tested steps for proper resetting a failed state of a disk to a good state and starting a rebuild. In the example below the disk in failed state is [32:1], replace with the proper [enclosure_id:slot_id] in your case.
- Make “Failed State” in “Unconfigured(BAD)”
megacli -pdmarkmissing -physdrv[32:1] -aAll
- Prepare for removal (this command could fail, not a critical one)
megacli -pdprprmv -physdrv[32:1] -a0
- Make the state of the disk “Unconfigured(Good), Spun Up”
megacli -PDMakeGood -PhysDrv[32:1] -a0
- Start rebuild (this command could fail) – if the command fails continue with the next step, if not, the rebuild is restarted successfully.
megacli -PDRbld -Start -PhysDrv[32:1] -a0
megacli -pdlocate -start -physdrv[32:1] -a0
One of the two commands will probably start the rebuild, but if the two fail then continue to the next step.
- Start rebuild, first clean the foreign configuration and then make the device hot spare (only if 4 the above command failed)
megacli -CfgForeign -Clear -aALL
#set global hostspare
megacli -PDHSP -Set -PhysDrv [32:1] -a0
* If you need to unset/remove a global hotspare:
megacli -PDHSP -Rmv -PhysDrv [32:1] -aN