Storage arrays are busy beast, they continuously seem to grow in number of drives, drive density, controller complexity, total storage capacity, total cache memory, etc. These days every vendor is pushing out these new features along with extended storage capacity. More complexities with hardware and software at times lead to exceptional cases related to spares replacement.
A drive failure happened in a storage system at a customer site. Normal call procedures were followed, a CE was dispatched onsite, logs were verified, a determination was made that the drive was failed in a RAID 5 set and the hot spare has been kicked in. A new drive is ordered and arrives onsite within 45 mins. The CE removes the defective drive and replaces it with the newly ordered drive (while the hot spare was still syncing), but life is back to normal.
Suddenly the customer loses access to the entire raid set of the failed drive. The call was quickly escalated to level 2 support. The engineer made a determination that another drive had failed in the same raid group while the hot spare was still synchronizing causing the entire raid set to fail (caused data loss on the raid set). It was recommended to replace this drive, lets call it drive 2. A new drive was ordered, CE replaced it and the customer started to prepare for data restore from snapshots.
Though someone at the customer site didn’t agree on the chain of events, they requested for level 3 support escalation. Once the engineering guys looked at the logs, they quickly determined that the CE had pulled the wrong drive during the first replacement and it caused double faults in the raid set which caused data unavailability.
Level 3 support engineers asked to insert drive (2) back in the same slot and wait for it to sync. Once the logs were verified then the new drive was inserted in the same slot at the original failed drive (1). With these procedures the customer were able to get the Raid Set functional again and were able to mount the volumes without a data restore.
Though this was obviously a genuine mistake, someone somewhere realized this wasn’t right and asked to escalate the issue. Even the smartest and most experienced CE’s tend to make mistakes under pressure.
Always recommend your CE to wait for the hot spare to finish synchronizing before any drive replacements, atleast that gives an added buffer incase something was to go wrong.
Do not degauss the drives right away, in some cases you may need to insert it back in the storage system.
Schedule all maintenance work for either off business hours or for weekends including drive replacements, etc.
Double faults are rare, but they happen, always try every available option to recover your data.
CE’s at times tend to go the extra mile to recover failed disk drives: Bang them against the floor, throw them on the floor from a 2 feet height or even put them in a deep freezer for 6 to 8 hours.
If your storage system supports RAID 6 (double parity) and there is not added penalty for your applications, you may want to try that option.