Surviving 3 disks failure of RAID 6 with AVAGO 3108 MegaRAID and foreign config

Whatever is the reason to end up with 3 broken hard disks in a RAID 6 setup it does not matter! What matters is to recover the data if possible and the most important thing in this situation is to find the LAST hard disk, which was marked as failed and removed from the array. Then the array gets a offline state immediately! So if the last broken hard disk might have a little light of life, probably it is easy to recover the data. The hardware controller is an additional Supermicro board – AOC-S3108L-H8iR.
What happened – third disk got failed status and a virtual device using RAID 6 setup is in offline state. In offline state the virtual device would not execute any READ or WRITE operations, because part of the data is missing and the virtual drive has no meaningfully user data.
To survive to backup the data:

  1. Power off the server. And it is better to remove the power cord, afterwards, and wait for at least a minute before plugging back the power cord back.
  2. Power on the server.
  3. When prompted for actions during initialization the AVAGO 3108 MegaRAID just continue the server loading without accepting any changes.
  4. Boot a recovery disk and using the AVAGO command-line (cli) tool dump the “events” in a file. A sample command might be:
    /opt/MegaRAID/storcli/storcli64 /c0 show events >show.event.log
    

    Assuming the Offline RAID 6 virtual drive is “/c0”. Other possible options are “/c1”, “/c2” and so on.

  5. Read from the end till start the AVAGO 3108 MegaRAID events dump and find which hard drive was marked as failed LAST, i.e. with the most latest date and time. And then there are events marking the the virtual device as Offline.
    seqNum: 0x00009a46
    Time: Mon Jun 27 01:49:54 2022
    
    Code: 0x00000072
    Class: 0
    Locale: 0x02
    Event Description: State change on PD 10(e0x08/s5) from ONLINE(18) to FAILED(11)
    Event Data:
    ===========
    Device ID: 16
    Enclosure Index: 8
    Slot Number: 5
    Previous state: 24
    New state: 17
    
    
    seqNum: 0x00009a47
    Time: Mon Jun 27 01:49:54 2022
    
    Code: 0x00000051
    Class: 0
    Locale: 0x01
    Event Description: State change on VD 00/0 from DEGRADED(2) to OFFLINE(0)
    Event Data:
    ===========
    Target Id: 0
    Previous state: 2
    New state: 0
    

    The first event of the list above logs the hard drive PD 10(e0x08/s5) gets FAILED status. And immediately after that the virtual drive VD 00/0 goes Offline, which means the last disk before the RAID 6 virtual drive stops working is the PD 10(e0x08/s5)The “/s5” from PD 10(e0x08/s5) points to the “Slot 5” hard drive.

  6. Reboot the server and when prompted the AVAGO 3108 MegaRAID BIOS Configuration Utility this time enter the utility.
  7. Make the found hard drive from the previous steps with ONLINE state. The hard drive might be in a foreign configuration or just in a bad state, so import the foreign configuration, make the drive a GOOD state and its state will immediately be ONLINE, which mean it is a part of an existing virtual drive. The virtual drive state will immediately be changed to DEGRADED (still two broken disks are out of the virtual drive). Follow the screenshots below to get the last broken disk back ONLINE and the virtual drive in an operable state – DEGRADED. If the drive is only in BAD/FAILED state, just skip the Foreign part and make the disk ONLINE (it may require first to make the disk “unconfigured-good”)
  8. Recover the data by simply copy it to another server or a healthy virtual drive. DO NOT TRY TO REMOVE data, i.e. do not use “rm”, the real state of this third broken disk is unknown and writing would probably kill it off. A good idea is to mount the filesystems on this virtual drive read-only and just rsync the data to a backup.

Here is the process of getting the third disk on “Slot 5” from a “Missing” and the “Virtual Drive 0Offline to the ONLINE state of the hard drive and a DEGRADED state of the “Virtual Drive 0“, i.e. operating.

SCREENSHOT 1) The Drive in Slot 5 is missing and the Virtual Drive 0 is in OFFLINE state

Slot 5 is the hard drive we need to recover, but it reports the hard drive is missing. Missing points out there is another configuration, so press “Ctrl+N” to change to the next page (i.e. menu), which is “PD Mgmt” – physical disk management.

main menu
slot 5 disk missing and VD offline

Keep on reading!

Add cachecade SSD to an existing virtual drive in AVAGO MegaRAID with storcli64

Increase the IO performance of big storage consisting of hard drives with the CacheCade feature with MegaRAID hardware controllers. Here are the commands to ensure adding an SSD to cache storage of hard drives and the whole output:

storcli64 /c0 add VD cachecade r0 drives=252:0 WB
storcli64 /c0/v0 set ssdcaching=on

Summary and explanation

  • The controller used is AVAGO MegaRAID SAS9361-4i
  • Three 12T hard drives and 1 SSD 1T.
  • RAID5 of 3 hard drive. The name of the virtual drive is v0.
  • The CacheCade is in RAID0. The name of the virtual drive is v1.
  • The CacheCade is in WriteBack mode. The WriteBack mode is dangerous when the CacheCade device is RAID0, because there is no redundancy, but it’s OK if the purpose of the machine is just a proxy cache. WriteBack mode leads to much more performance to gain from the cache device over slow device sotrage, but should be carefully planned when using RAID0 cache (or single) device.
  • Using storcli – the cli command line tool to manage a (LSI) MegaRAID.
  • Add CacheCade RAID0 device by the first command. It will create a virtual drive RAID0 with the name v1.
  • Ensure the caching is enabled for the virtual device it should be by explicitly setting it with the second command. Adding only a CacheCade device may not add the cache virtual device to the existing virtual devices!

Output of a real world example

Keep on reading!

Install the new storcli to manage (LSI/AVAGO/Broadcom) MegaRAID controller under CentOS 7

After the acquisition of LSI there was a major change with the management console utility for the MegaRAID controllers. The utility was renamed from MegaCli (MegaCli64, megacli) to

storcli (storcli64)

We have new controllers like AVAGO MegaRAID SAS-9361-4i and really old ones like LSI 2108 MegaRAID (in fact Supermicro AOC-USAS2LP-H8iR) and the two controllers could be manage with the new cli. even the old controller, which is on more than 8 years could be manage by the new cli.
Interesting fact is that the storcli output and argument syntax and is almost identical to the one really old cli – tw_cli – the 3Ware management utility. As you know LSI bought 3ware RAID adapter business in 2009.
Keep on reading!