Google+

Archive

Archive for the ‘True Stories’ Category

True IT – Storage Stories: 8 (Double faults with drives)

October 27th, 2009 No comments

Storage arrays are busy beast, they continuously seem to grow in number of drives, drive density, controller complexity, total storage capacity, total cache memory, etc. These days every vendor is pushing out these new features along with extended storage capacity. More complexities with hardware and software at times lead to exceptional cases related to spares replacement.

A drive failure happened in a storage system at a customer site. Normal call procedures were followed, a CE was dispatched onsite, logs were verified, a determination was made that the drive was failed in a RAID 5 set and the hot spare has been kicked in. A new drive is ordered and arrives onsite within 45 mins. The CE removes the defective drive and replaces it with the newly ordered drive (while the hot spare was still syncing), but life is back to normal.

Suddenly the customer loses access to the entire raid set of the failed drive. The call was quickly escalated to level 2 support. The engineer made a determination that another drive had failed in the same raid group while the hot spare was still synchronizing causing the entire raid set to fail (caused data loss on the raid set). It was recommended to replace this drive, lets call it drive 2. A new drive was ordered, CE replaced it and the customer started to prepare for data restore from snapshots.

Though someone at the customer site didn’t agree on the chain of events, they requested for level 3 support escalation. Once the engineering guys looked at the logs, they quickly determined that the CE had pulled the wrong drive during the first replacement and it caused double faults in the raid set which caused data unavailability.

Level 3 support engineers asked to insert drive (2) back in the same slot and wait for it to sync. Once the logs were verified then the new drive was inserted in the same slot at the original failed drive (1). With these procedures the customer were able to get the Raid Set functional again and were able to mount the volumes without a data restore.

Lesson Learnt

Though this was obviously a genuine mistake, someone somewhere realized this wasn’t right and asked to escalate the issue. Even the smartest and most experienced CE’s tend to make mistakes under pressure.

Always recommend your CE to wait for the hot spare to finish synchronizing before any drive replacements, atleast that gives an added buffer incase something was to go wrong.

Do not degauss the drives right away, in some cases you may need to insert it back in the storage system.

Schedule all maintenance work for either off business hours or for weekends including drive replacements, etc.

Double faults are rare, but they happen, always try every available option to recover your data.

CE’s at times tend to go the extra mile to recover failed disk drives: Bang them against the floor, throw them on the floor from a 2 feet height or even put them in a deep freezer for 6 to 8 hours.

If your storage system supports RAID 6 (double parity) and there is not added penalty for your applications, you may want to try that option.

True IT – Storage Stories: 7 (Data Wipe on the wrong machine)

October 26th, 2009 2 comments

data_eraser.jpg

Yea you are right, exactly as the title of this post says, Data Wipe got performed on the wrong machine.

The CE got permission from the customer to perform data wipe on a storage system. The host were retired, storage was ready to be turned off, but part of the procedures, customers typically require that all the data from the drives be cleaned.

The CE took this opportunity to remotely connect into the customers storage system. Thought it will take several hours to finish the process when he would go onsite to physically power off the storage system. Thought he logged into the machine he intended to, he got through to another one. Logically started taking the ports down through soft commands and then within 15 mins kicked off the process for data wipe.

An hour later, a SEV 1 ticket is opened at the customer site with major issues in the storage environment. To his luck he thought that while he is out taking care of the issue, he will also check on the data wipe and physically power off the storage subsystem.

On his way to the customer site, he gets a call from the Level 3 folks from the vendor support team on what they had just found on this storage system, that it was busy doing a data wipe and there was no way to stop it.

The realization set in for the CE…….that he had started a data wipe on the wrong storage system without performing the correct procedures.

Lesson Learnt

Set a corporate wide policy on how storage and server teams can perform certain task onsite and remotely. Set similar procedures with vendor teams as well.

The CE ended up losing his job, though there was no way to recover the data on the storage subsystem except for bringing the backup tapes out.

One could have the best storage execution plan in place to manage the storage environment, but is there a way to avert these exceptional cases.

True IT – Storage Stories: 6 (Storage Subsystem Move)

October 13th, 2009 2 comments

datacenter move.jpg

This true customer story is related to a physical move of a storage subsystem. A need for a data center move could arise because of a wide variety of reasons. In this case the customer was moving all the IT assets from one building to another as part of a cost savings (large facilities to smaller facilities).

The annual revenues of this customer were around 250 Million a year, with several groups within their IT Business organization.

The customer were moving all IT data center assets from Building 1 to Building 2. Typically during these moves, the vendors of the IT Assets are brought in to verify power shut down procedures, label all the cables, moves, recertification of assets, connect cables, power up, data consistency checks, etc. This customer decided to make a move without involving all the necessary vendors. This move was scheduled in various phases, where all the primary servers and storage assets were being moved in phase 1. Project plans were put in place, resources scheduled, etc, etc. Things were moving along fine with phase 1 move, until it came to moving one of the storage subsystems, it was too heavy to push it across the raised floor since some storage assets need reinforced raised floor and typically that is not the case throughout the entire datacenter. Someone associated with the move decided to remove every drive that was installed in the storage subsystem, pack it in boxes and move it to Building 2. Though they didn’t label the drives as to where they came from. The storage subsystem move to Building 2 finished without any issues.

The customer quickly realized this very big mistake once it was time to power the system back online. No one had an idea of where the disk drives would be inserted based on slot addresses.

At this crucial point, the vendor was contacted and asked for help. This storage subsystem consists of all the data used for their CRM and APPs development teams. Because this storage subsystem was moved without the prior knowledge of the vendor, now they had to come in and certify the system before they could start working on the issue. The vendor knew it was a daunting task ahead.. The onsite CE asked for all the logs prior to the system shut down, which the customer were able to provide, based on slot numbers and serial numbers of disk drives, they inserted one in at a time. Most of the drive serial numbers went in fine, there were some that had been recently replaced, where there was no way to match the slot id’s to disk drive serial numbers since they were not found in the logs. The vendor took an extra step to go back to their own records to find every drive that had been replaced at this customer site and what slot id’s they had been replaced based on their service ticketing system.

12 hours of tedious work of matching serial numbers to slot id’s and finally the system was back up and running with some failed drives. Escalations, Vendor meetings, customer meetings and a 24 hour downtime could have been averted.

Lesson Learnt

Data Center moves should be taken very seriously, 99% of times plug and play is not an option.

Label every cable that was pulled out of the storage subsystem before the move.

Every IT Asset vendor should be involved in the process.

Systems should be powered off correctly based on manufacturer specifications and by the manufacturer itself, especially all storage subsystems.

Every system should be certified prior to the move and re-certified after the move, these services are typically provided for free by all the major vendors.

Vendors recommend using movers that move storage subsystems on a daily basis and it may be a good idea to involve them during this process, as they can provide extra precaution for the move.

Backup data on the storage subsystem before the move.

Run data consistency checks after the move on the storage subsystem and from the associated host system for data integrity.

True IT – Storage Stories: 5 (8,000 Users on MS Exchange)

October 7th, 2009 2 comments

email outageThat unfortunate thursday morning, around 8:30 AM PST, when all hell broke loose.

This customer had a typical setup with BladeCenters servers and a SAN. This setup was providing MS Exchange email services to about 8000 internal users within the organization. Clustered BladeCenter servers, multiple switches, connected to one storage subsystem in the backend serving all user emails.

Though the BladeCenter servers were pretty new, the SAN in the backend had just expired its manufacturers warranty. The customer were deciding to migrate to a new storage subsystem, but in the mean while they let the support on this storage subsystem expire and have T&M support on it, in short no one was monitoring failures, errors, events on this storage subsystem. That morning, for some unknown reason the entire Storage subsystem powered off by itself. With UPS protection and generators in the environment this behavior was very unusual. This caused the MS Exchange databases, logs, mailboxes to fail. 8000 users lost email service. Yes, all the executives of the company were on the same system.

The call was escalated in a few minutes, since this caused a company wide outage, everyone was trying to figure out what had just happened. A T&M call was placed with the manufacturer to fix the system (see, I didn’t say diagnose the problem), SEV 1 calls are pretty nasty. They showed up immediately because of what had happened. The spares had arrived within an hour. 3 hours total and the system was ready to be powered back up, another hour or so to give the final health check, initialize all the mount points, servers, clusters, services, etc.

4 hours of outage, 8000 users affected.

The problem was narrowed down to multiple failed power supplies for the controllers enclosure. Due to lack of monitoring and support, previous failed power supplies went undetected and another failed power supply that morning caused the entire storage subsystem to fall on its knees.

Lesson Learnt:

So its very important to decide which systems will have a lapse of contract or coverage and which ones are business critical systems that need a 24 x 7 coverage. Have the vendor check for failures regularly. Though this customer has a pretty good investment into IT infrastructure, for their MS Exchange they didn’t think about replication or CDP solutions.

As much as it sounds unreal, I have heard several customers today perform “WALK THE DATACENTER” every week, where they have their technicians go from rack to rack to check for amber lights. Errors like these tend to get picked up with those practices in place.

Having world class Power systems, UPS’s, Generators will not help with these issues.

After all the question is what did the customer save or lose by leaving this storage subsystem unsupported.