Google+
Home > Technology, True Stories, Virtualization > True IT – Storage Stories: 5 (8,000 Users on MS Exchange)

True IT – Storage Stories: 5 (8,000 Users on MS Exchange)


email outageThat unfortunate thursday morning, around 8:30 AM PST, when all hell broke loose.

This customer had a typical setup with BladeCenters servers and a SAN. This setup was providing MS Exchange email services to about 8000 internal users within the organization. Clustered BladeCenter servers, multiple switches, connected to one storage subsystem in the backend serving all user emails.

Though the BladeCenter servers were pretty new, the SAN in the backend had just expired its manufacturers warranty. The customer were deciding to migrate to a new storage subsystem, but in the mean while they let the support on this storage subsystem expire and have T&M support on it, in short no one was monitoring failures, errors, events on this storage subsystem. That morning, for some unknown reason the entire Storage subsystem powered off by itself. With UPS protection and generators in the environment this behavior was very unusual. This caused the MS Exchange databases, logs, mailboxes to fail. 8000 users lost email service. Yes, all the executives of the company were on the same system.

The call was escalated in a few minutes, since this caused a company wide outage, everyone was trying to figure out what had just happened. A T&M call was placed with the manufacturer to fix the system (see, I didn’t say diagnose the problem), SEV 1 calls are pretty nasty. They showed up immediately because of what had happened. The spares had arrived within an hour. 3 hours total and the system was ready to be powered back up, another hour or so to give the final health check, initialize all the mount points, servers, clusters, services, etc.

4 hours of outage, 8000 users affected.

The problem was narrowed down to multiple failed power supplies for the controllers enclosure. Due to lack of monitoring and support, previous failed power supplies went undetected and another failed power supply that morning caused the entire storage subsystem to fall on its knees.

Lesson Learnt:

So its very important to decide which systems will have a lapse of contract or coverage and which ones are business critical systems that need a 24 x 7 coverage. Have the vendor check for failures regularly. Though this customer has a pretty good investment into IT infrastructure, for their MS Exchange they didn’t think about replication or CDP solutions.

As much as it sounds unreal, I have heard several customers today perform “WALK THE DATACENTER” every week, where they have their technicians go from rack to rack to check for amber lights. Errors like these tend to get picked up with those practices in place.

Having world class Power systems, UPS’s, Generators will not help with these issues.

After all the question is what did the customer save or lose by leaving this storage subsystem unsupported.

  • Name

    It's worth noting that this is part of the design goals for the Exchange 2010 platform. To make native mailbox database replication so easy and simple that you won't hesitate to use DAS and other cheap storage to create 2, 3, or even more copies of databases in your organization. Besides reducing costs, it also provides true redundancy at the application level, and makes multi-site replication and DR availability easy without having to restore to pricey SAN-level replication technology.

  • Name

    It's worth noting that this is part of the design goals for the Exchange 2010 platform. To make native mailbox database replication so easy and simple that you won't hesitate to use DAS and other cheap storage to create 2, 3, or even more copies of databases in your organization. Besides reducing costs, it also provides true redundancy at the application level, and makes multi-site replication and DR availability easy without having to restore to pricey SAN-level replication technology.