True IT – Storage Stories: 4 (Battery Replacements)

data outageYes, Environmental related issues on the Storage Arrays are very crucial and they need to be handled carefully, correctly and during change control windows only. Environmental issues include power, cooling, sensors, power supplies, fans, battery, etc. Due to a failed power component, the risk of an outage on a disk array goes up tremendously.

A CE was on site at a customer data center replacing a battery in a disk array. This change was scheduled to happen during normal business hours, since typically a battery replacement is considered a non-intrusive procedure. The CE removed the cables from the old battery then removed the old battery and replaced it with a new battery shipped from the manufacturer. Connected all the power cables back to the new replaced battery, and within a few seconds the storage subsystem entirely shutdown causing a complete outage in the customers environment. Several TB’s of data was on this system. This storage subsystem was running around 50% of customers supply chain management applications and caused several million dollars of loss with service disruption.

Lesson Learnt:

Don’t always assume the parts replacement procedures will always go according to the plan. Small trivial hiccups can cause havoc’s. These issues do not happen everyday, but with given exceptions, they do then to repeat at times. There are times when even a battery replacement can cause an entire subsystem to fail.

Schedule any power related maintenance and replacements on weekend nights and during change control windows only, where the business can take a risk of an outage. You never know when you may have a bad day, that might really become ugly.

Power related maintenance includes replacing power supplies, replacing fans, replacing batteries, changing sensors, messing with one power leg, etc, etc.

By the way, having a UPS will not help in these situations