True IT – Storage Stories: 1 (Dataloss)
About 4 years ago, got a call early morning from a customer where we were doing some data migration work. The customer decided to put the project on hold until they sorted out some issues in their storage environment.
Later in the day, had another call with the same customer and they passed some very terrifying news to us. It seems one of the storage array that we were looking to migrate the data from in their environment was managed by an independent service provider. This independent service provide did not worry about setting up email home on the system, which means in a case of a catastrophic failure the system will not notify anyone. The customer started reporting problems to the independent service provider that they were losing access to certain volumes in the host environment, started to see data corruption and within a few hours the entire disk array was completely gonzo.
This was beyond the capabilities of the independent service provider to fix and they escalated the call to the OEM to get this resolved. The OEM engineering folks and onsite teams worked round the clock for 4 days trying to recover data from the machine. Due to a failure to call home or email home on a failed component in the storage array, the data started getting corrupted and caused the entire system to fall on its knees. THE DATA WAS GONE!!!
The customer lost 60TB’s of RAW storage in a few hours.
Now the question shouldn’t be if the data can be recovered from backup tapes and other media, which the customer were able to over the next 3 weeks.
The primary question is why did it happen? And what can be done to prevent a catastrophic failure similar to this?
If you are in charge of managing any data in your organization today that is associated with storage arrays, open a call with the OEM or your Independent Service Provider on a monthly basis. Have them check and verify every storage array in the environment and if they are calling home or emailing home on a regular basis. If a modem is attached to the system, verify all components are working, if you have a TCP/IP/SSL based connection verify all is working, if you have email home features verify the emails are not getting queued on your exchange or may be your exchange IP address has changed.
These call home, email home, tcp/ip/ssl features allows the storage arrays to regularly communicate back to the OEM or your Independent Service provider with errors / warnings / events and heartbeats.
If you are using any SRM tools, please regularly check for alerts. If you are receiving any failed communication alerts, please escalate the situation immediately rather than waiting. Also verify you are not consistently seeing the same failed components in the array through the SRM tools.