Whatever is the reason to end up with 3 broken hard disks in a RAID 6 setup it does not matter! What matters is to recover the data if possible and the most important thing in this situation is to find the LAST hard disk, which was marked as failed and removed from the array. Then the array gets a offline state immediately! So if the last broken hard disk might have a little light of life, probably it is easy to recover the data. The hardware controller is an additional Supermicro board – AOC-S3108L-H8iR.
What happened – third disk got failed status and a virtual device using RAID 6 setup is in offline state. In offline state the virtual device would not execute any READ or WRITE operations, because part of the data is missing and the virtual drive has no meaningfully user data.
To survive to backup the data:
- Power off the server. And it is better to remove the power cord, afterwards, and wait for at least a minute before plugging back the power cord back.
- Power on the server.
- When prompted for actions during initialization the AVAGO 3108 MegaRAID just continue the server loading without accepting any changes.
- Boot a recovery disk and using the AVAGO command-line (cli) tool dump the “events” in a file. A sample command might be:
/opt/MegaRAID/storcli/storcli64 /c0 show events >show.event.log
Assuming the Offline RAID 6 virtual drive is “/c0”. Other possible options are “/c1”, “/c2” and so on.
- Read from the end till start the AVAGO 3108 MegaRAID events dump and find which hard drive was marked as failed LAST, i.e. with the most latest date and time. And then there are events marking the the virtual device as Offline.
seqNum: 0x00009a46 Time: Mon Jun 27 01:49:54 2022 Code: 0x00000072 Class: 0 Locale: 0x02 Event Description: State change on PD 10(e0x08/s5) from ONLINE(18) to FAILED(11) Event Data: =========== Device ID: 16 Enclosure Index: 8 Slot Number: 5 Previous state: 24 New state: 17 seqNum: 0x00009a47 Time: Mon Jun 27 01:49:54 2022 Code: 0x00000051 Class: 0 Locale: 0x01 Event Description: State change on VD 00/0 from DEGRADED(2) to OFFLINE(0) Event Data: =========== Target Id: 0 Previous state: 2 New state: 0
The first event of the list above logs the hard drive PD 10(e0x08/s5) gets FAILED status. And immediately after that the virtual drive VD 00/0 goes Offline, which means the last disk before the RAID 6 virtual drive stops working is the PD 10(e0x08/s5)The “/s5” from PD 10(e0x08/s5) points to the “Slot 5” hard drive.
- Reboot the server and when prompted the AVAGO 3108 MegaRAID BIOS Configuration Utility this time enter the utility.
- Make the found hard drive from the previous steps with ONLINE state. The hard drive might be in a foreign configuration or just in a bad state, so import the foreign configuration, make the drive a GOOD state and its state will immediately be ONLINE, which mean it is a part of an existing virtual drive. The virtual drive state will immediately be changed to DEGRADED (still two broken disks are out of the virtual drive). Follow the screenshots below to get the last broken disk back ONLINE and the virtual drive in an operable state – DEGRADED. If the drive is only in BAD/FAILED state, just skip the Foreign part and make the disk ONLINE (it may require first to make the disk “unconfigured-good”)
- Recover the data by simply copy it to another server or a healthy virtual drive. DO NOT TRY TO REMOVE data, i.e. do not use “rm”, the real state of this third broken disk is unknown and writing would probably kill it off. A good idea is to mount the filesystems on this virtual drive read-only and just rsync the data to a backup.
Here is the process of getting the third disk on “Slot 5” from a “Missing” and the “Virtual Drive 0” Offline to the ONLINE state of the hard drive and a DEGRADED state of the “Virtual Drive 0“, i.e. operating.
SCREENSHOT 1) The Drive in Slot 5 is missing and the Virtual Drive 0 is in OFFLINE state
Slot 5 is the hard drive we need to recover, but it reports the hard drive is missing. Missing points out there is another configuration, so press “Ctrl+N” to change to the next page (i.e. menu), which is “PD Mgmt” – physical disk management.
SCREENSHOT 2) In this menu “PD Mgmt” all available disks are shown (all disks, which their electrical circuits work) and the disk on Slot 5 is in Foreign Bad State.
Foreign Bad means the disk is in a FAILED state (or Configure-BAD) and it is part of another configuration. It happens sometimes disks removed from an array and after a server reset to be separated in another (foreign) configuration.
SCREENSHOT 3) Mark the hard disk on Slot 5 and press F2 for all available operations in this state – choose the “Make unconfigured good”.
SCREENSHOT 4) The state of the disk in Slot 5 changed to Foreign, which indicates the disk is ONLINE and there is a new menu – “Foreign View”.
“Foreign View” shows only the configuration setup with one hard disk – this from the Slot 5. Of course, the configuration is the same RAID 6 with 6 disks as the main one.
SCREENSHOT 5) The idea is to import this configuration and the controller will recognize the hard disk on Slot 5 as part of the main configuration and the broken disks will be only 2, so the virtual drive might work and it will “assemble” the virtual drive.
SCREENSHOT 6) Select the top line – “AVAGO 3108 MegaRAID (Bus 1, DEV 0)” and press F2 for Operations menu to show.
Only one sub-menu “Foreign Config”.
SCREENSHOT 7) There are only two possibilities to import the “new” configuration with disk on Slot 5 or delete it.
Press “Import” to import the hard disk back to the main configuration.
SCREENSHOT 8) Additional configuration is needed by the user.
Confirm the operation by pressing “Yes”.
SCREENSHOT 9) The Offline Vritual Drive 0 immediately goes DEGRADED, which means the virtual drive now works and all the file systems may be used.
SCREENSHOT 10) The disk is ONLINE, so it is part of the current RAID 6 configuration.
There are two more broken disks – on Slot 1, which the disk is Offline and secondly, the disk on Slot 3 is in FAILED state.