True IT – Storage Stories: 5 (8,000 Users on MS Exchange)
This customer had a typical setup with BladeCenters servers and a SAN. This setup was providing MS Exchange email services to about 8000 internal users within the organization. Clustered BladeCenter servers, multiple switches, connected to one storage subsystem in the backend serving all user emails.
Though the BladeCenter servers were pretty new, the SAN in the backend had just expired its manufacturers warranty. The customer were deciding to migrate to a new storage subsystem, but in the mean while they let the support on this storage subsystem expire and have T&M support on it, in short no one was monitoring failures, errors, events on this storage subsystem. That morning, for some unknown reason the entire Storage subsystem powered off by itself. With UPS protection and generators in the environment this behavior was very unusual. This caused the MS Exchange databases, logs, mailboxes to fail. 8000 users lost email service. Yes, all the executives of the company were on the same system.
The call was escalated in a few minutes, since this caused a company wide outage, everyone was trying to figure out what had just happened. A T&M call was placed with the manufacturer to fix the system (see, I didn’t say diagnose the problem), SEV 1 calls are pretty nasty. They showed up immediately because of what had happened. The spares had arrived within an hour. 3 hours total and the system was ready to be powered back up, another hour or so to give the final health check, initialize all the mount points, servers, clusters, services, etc.
4 hours of outage, 8000 users affected.
The problem was narrowed down to multiple failed power supplies for the controllers enclosure. Due to lack of monitoring and support, previous failed power supplies went undetected and another failed power supply that morning caused the entire storage subsystem to fall on its knees.
So its very important to decide which systems will have a lapse of contract or coverage and which ones are business critical systems that need a 24 x 7 coverage. Have the vendor check for failures regularly. Though this customer has a pretty good investment into IT infrastructure, for their MS Exchange they didn’t think about replication or CDP solutions.
As much as it sounds unreal, I have heard several customers today perform “WALK THE DATACENTER” every week, where they have their technicians go from rack to rack to check for amber lights. Errors like these tend to get picked up with those practices in place.
Having world class Power systems, UPS’s, Generators will not help with these issues.
After all the question is what did the customer save or lose by leaving this storage subsystem unsupported.