Google+

Archive

Author Archive

True IT – Storage Stories: 6 (Storage Subsystem Move)

October 13th, 2009 2 comments

datacenter move.jpg

This true customer story is related to a physical move of a storage subsystem. A need for a data center move could arise because of a wide variety of reasons. In this case the customer was moving all the IT assets from one building to another as part of a cost savings (large facilities to smaller facilities).

The annual revenues of this customer were around 250 Million a year, with several groups within their IT Business organization.

The customer were moving all IT data center assets from Building 1 to Building 2. Typically during these moves, the vendors of the IT Assets are brought in to verify power shut down procedures, label all the cables, moves, recertification of assets, connect cables, power up, data consistency checks, etc. This customer decided to make a move without involving all the necessary vendors. This move was scheduled in various phases, where all the primary servers and storage assets were being moved in phase 1. Project plans were put in place, resources scheduled, etc, etc. Things were moving along fine with phase 1 move, until it came to moving one of the storage subsystems, it was too heavy to push it across the raised floor since some storage assets need reinforced raised floor and typically that is not the case throughout the entire datacenter. Someone associated with the move decided to remove every drive that was installed in the storage subsystem, pack it in boxes and move it to Building 2. Though they didn’t label the drives as to where they came from. The storage subsystem move to Building 2 finished without any issues.

The customer quickly realized this very big mistake once it was time to power the system back online. No one had an idea of where the disk drives would be inserted based on slot addresses.

At this crucial point, the vendor was contacted and asked for help. This storage subsystem consists of all the data used for their CRM and APPs development teams. Because this storage subsystem was moved without the prior knowledge of the vendor, now they had to come in and certify the system before they could start working on the issue. The vendor knew it was a daunting task ahead.. The onsite CE asked for all the logs prior to the system shut down, which the customer were able to provide, based on slot numbers and serial numbers of disk drives, they inserted one in at a time. Most of the drive serial numbers went in fine, there were some that had been recently replaced, where there was no way to match the slot id’s to disk drive serial numbers since they were not found in the logs. The vendor took an extra step to go back to their own records to find every drive that had been replaced at this customer site and what slot id’s they had been replaced based on their service ticketing system.

12 hours of tedious work of matching serial numbers to slot id’s and finally the system was back up and running with some failed drives. Escalations, Vendor meetings, customer meetings and a 24 hour downtime could have been averted.

Lesson Learnt

Data Center moves should be taken very seriously, 99% of times plug and play is not an option.

Label every cable that was pulled out of the storage subsystem before the move.

Every IT Asset vendor should be involved in the process.

Systems should be powered off correctly based on manufacturer specifications and by the manufacturer itself, especially all storage subsystems.

Every system should be certified prior to the move and re-certified after the move, these services are typically provided for free by all the major vendors.

Vendors recommend using movers that move storage subsystems on a daily basis and it may be a good idea to involve them during this process, as they can provide extra precaution for the move.

Backup data on the storage subsystem before the move.

Run data consistency checks after the move on the storage subsystem and from the associated host system for data integrity.

True IT – Storage Stories: 5 (8,000 Users on MS Exchange)

October 7th, 2009 2 comments

email outageThat unfortunate thursday morning, around 8:30 AM PST, when all hell broke loose.

This customer had a typical setup with BladeCenters servers and a SAN. This setup was providing MS Exchange email services to about 8000 internal users within the organization. Clustered BladeCenter servers, multiple switches, connected to one storage subsystem in the backend serving all user emails.

Though the BladeCenter servers were pretty new, the SAN in the backend had just expired its manufacturers warranty. The customer were deciding to migrate to a new storage subsystem, but in the mean while they let the support on this storage subsystem expire and have T&M support on it, in short no one was monitoring failures, errors, events on this storage subsystem. That morning, for some unknown reason the entire Storage subsystem powered off by itself. With UPS protection and generators in the environment this behavior was very unusual. This caused the MS Exchange databases, logs, mailboxes to fail. 8000 users lost email service. Yes, all the executives of the company were on the same system.

The call was escalated in a few minutes, since this caused a company wide outage, everyone was trying to figure out what had just happened. A T&M call was placed with the manufacturer to fix the system (see, I didn’t say diagnose the problem), SEV 1 calls are pretty nasty. They showed up immediately because of what had happened. The spares had arrived within an hour. 3 hours total and the system was ready to be powered back up, another hour or so to give the final health check, initialize all the mount points, servers, clusters, services, etc.

4 hours of outage, 8000 users affected.

The problem was narrowed down to multiple failed power supplies for the controllers enclosure. Due to lack of monitoring and support, previous failed power supplies went undetected and another failed power supply that morning caused the entire storage subsystem to fall on its knees.

Lesson Learnt:

So its very important to decide which systems will have a lapse of contract or coverage and which ones are business critical systems that need a 24 x 7 coverage. Have the vendor check for failures regularly. Though this customer has a pretty good investment into IT infrastructure, for their MS Exchange they didn’t think about replication or CDP solutions.

As much as it sounds unreal, I have heard several customers today perform “WALK THE DATACENTER” every week, where they have their technicians go from rack to rack to check for amber lights. Errors like these tend to get picked up with those practices in place.

Having world class Power systems, UPS’s, Generators will not help with these issues.

After all the question is what did the customer save or lose by leaving this storage subsystem unsupported.

True IT – Storage Stories: 4 (Battery Replacements)

October 6th, 2009 No comments

data outageYes, Environmental related issues on the Storage Arrays are very crucial and they need to be handled carefully, correctly and during change control windows only. Environmental issues include power, cooling, sensors, power supplies, fans, battery, etc. Due to a failed power component, the risk of an outage on a disk array goes up tremendously.

A CE was on site at a customer data center replacing a battery in a disk array. This change was scheduled to happen during normal business hours, since typically a battery replacement is considered a non-intrusive procedure. The CE removed the cables from the old battery then removed the old battery and replaced it with a new battery shipped from the manufacturer. Connected all the power cables back to the new replaced battery, and within a few seconds the storage subsystem entirely shutdown causing a complete outage in the customers environment. Several TB’s of data was on this system. This storage subsystem was running around 50% of customers supply chain management applications and caused several million dollars of loss with service disruption.

Lesson Learnt:

Don’t always assume the parts replacement procedures will always go according to the plan. Small trivial hiccups can cause havoc’s. These issues do not happen everyday, but with given exceptions, they do then to repeat at times. There are times when even a battery replacement can cause an entire subsystem to fail.

Schedule any power related maintenance and replacements on weekend nights and during change control windows only, where the business can take a risk of an outage. You never know when you may have a bad day, that might really become ugly.

Power related maintenance includes replacing power supplies, replacing fans, replacing batteries, changing sensors, messing with one power leg, etc, etc.

By the way, having a UPS will not help in these situations

True IT – Storage Stories: 3 (SRM Tools)

October 4th, 2009 5 comments

storage web multisiteThe necessity for SRM (Storage Resource Management) tools is growing by the day, a complete picture of the entire storage environment can be obtained using these SRM Tools. Customers typically use SRM tools for performing key functions related to storage management which includes analyzing storage environments, configuration changes, reporting around storage, collecting performance data, alerts on exceptions, etc. Also a granular view of the storage subsystems and its relationships to the host, fabric, disk, file-systems, consumption and utilization can be obtained using SRM tools.

A very large customer in US decided to deploy Storage Resource Management tools for their Storage Infrastructure, that consisted of 15 sites globally, several PB’s of Storage, various make & model of storage arrays and segregated storage management teams. Overcoming several technological and organizational challenges they managed to deploy a SRM tool that will give them a complete picture of the Storage Environment.

30 Million Dollars as deployment cost in CAPEX which included SRM tool, licenses, OS licenses, hardware, agent deployment, testing, training, virtualization, etc, etc and 24 months deployment cycle, they were up and running, 50% over budget and 12 months behind schedule the implementation was over.

Though challenges today remain around patch management on SRM tool, managing 15 sites globally, OS upgrades, SRM tool upgrades, Array firmware upgrades & compatibility, SRM management, SRM periodic cleanup’s, support for non SMI-S arrays, support for other vendor arrays, accuracy of reports with virtualization, clustering and thin provisioning.

Lesson Learnt:

With any SRM tool deployment, set goals, targets, expectations, requirements and organizational needs. With today’s SRM tools it may be unrealistic to achieve 100% of your requirements.

See if there are trial versions available from vendors, deploy them for 3 to 6 months to see if those meet the expectations and organizational needs.

Check the compatibility of the SRM tool across a wide variety of storage platforms deployed in the organization. Review security features around access control, login rights, active directory integration, etc

Set Budget caps with implementation and set target completion dates and be aggressive to achieve those.

Obtain TCO models for the SRM tool deployment, which may include CAPEX SRM purchase, deployment, testing, day-to-day management, software support cost, upfront & ongoing training, hardware for deployment, infrastructure changes, etc.