Archive for the ‘True Stories’ Category

True IT – Storage Stories: 4 (Battery Replacements)

October 6th, 2009 No comments

data outageYes, Environmental related issues on the Storage Arrays are very crucial and they need to be handled carefully, correctly and during change control windows only. Environmental issues include power, cooling, sensors, power supplies, fans, battery, etc. Due to a failed power component, the risk of an outage on a disk array goes up tremendously.

A CE was on site at a customer data center replacing a battery in a disk array. This change was scheduled to happen during normal business hours, since typically a battery replacement is considered a non-intrusive procedure. The CE removed the cables from the old battery then removed the old battery and replaced it with a new battery shipped from the manufacturer. Connected all the power cables back to the new replaced battery, and within a few seconds the storage subsystem entirely shutdown causing a complete outage in the customers environment. Several TB’s of data was on this system. This storage subsystem was running around 50% of customers supply chain management applications and caused several million dollars of loss with service disruption.

Lesson Learnt:

Don’t always assume the parts replacement procedures will always go according to the plan. Small trivial hiccups can cause havoc’s. These issues do not happen everyday, but with given exceptions, they do then to repeat at times. There are times when even a battery replacement can cause an entire subsystem to fail.

Schedule any power related maintenance and replacements on weekend nights and during change control windows only, where the business can take a risk of an outage. You never know when you may have a bad day, that might really become ugly.

Power related maintenance includes replacing power supplies, replacing fans, replacing batteries, changing sensors, messing with one power leg, etc, etc.

By the way, having a UPS will not help in these situations

True IT – Storage Stories: 3 (SRM Tools)

October 4th, 2009 5 comments

storage web multisiteThe necessity for SRM (Storage Resource Management) tools is growing by the day, a complete picture of the entire storage environment can be obtained using these SRM Tools. Customers typically use SRM tools for performing key functions related to storage management which includes analyzing storage environments, configuration changes, reporting around storage, collecting performance data, alerts on exceptions, etc. Also a granular view of the storage subsystems and its relationships to the host, fabric, disk, file-systems, consumption and utilization can be obtained using SRM tools.

A very large customer in US decided to deploy Storage Resource Management tools for their Storage Infrastructure, that consisted of 15 sites globally, several PB’s of Storage, various make & model of storage arrays and segregated storage management teams. Overcoming several technological and organizational challenges they managed to deploy a SRM tool that will give them a complete picture of the Storage Environment.

30 Million Dollars as deployment cost in CAPEX which included SRM tool, licenses, OS licenses, hardware, agent deployment, testing, training, virtualization, etc, etc and 24 months deployment cycle, they were up and running, 50% over budget and 12 months behind schedule the implementation was over.

Though challenges today remain around patch management on SRM tool, managing 15 sites globally, OS upgrades, SRM tool upgrades, Array firmware upgrades & compatibility, SRM management, SRM periodic cleanup’s, support for non SMI-S arrays, support for other vendor arrays, accuracy of reports with virtualization, clustering and thin provisioning.

Lesson Learnt:

With any SRM tool deployment, set goals, targets, expectations, requirements and organizational needs. With today’s SRM tools it may be unrealistic to achieve 100% of your requirements.

See if there are trial versions available from vendors, deploy them for 3 to 6 months to see if those meet the expectations and organizational needs.

Check the compatibility of the SRM tool across a wide variety of storage platforms deployed in the organization. Review security features around access control, login rights, active directory integration, etc

Set Budget caps with implementation and set target completion dates and be aggressive to achieve those.

Obtain TCO models for the SRM tool deployment, which may include CAPEX SRM purchase, deployment, testing, day-to-day management, software support cost, upfront & ongoing training, hardware for deployment, infrastructure changes, etc.

True IT – Storage Stories: 2 (Fairy tale)

September 24th, 2009 3 comments

Here comes the fairy tale!!!CIO Storage

Once upon a time, there was a great company and they had great products and they were very successful. They had 1 PB of total storage and they were consistently growing.

The business this company was in, consistently kept on changing, they didn’t. Bad days were ahead.

They kept on purchasing new storage every quarter, until the CIO thought, lets go back and see what have we purchased and what are we using. He got external consultants to walk in there and perform an analysis of their entire storage environment. To the surprise of the CIO and the storage teams, they found that the average utilization on storage across the board for all their storage arrays was about 28%.

They had 70% Tier 1 storage, 20% Tier 2 storage and 10% Tier 3 storage. Average utilization on Tier 1 and Tier 2 was 20% while on Tier 3 was 100%. These consultants showed them how they can re-tier storage, increase efficiency, utilization and further reduce storage footprint in their IT environment through consolidation. A total savings of 10 Million USD over 3 years was proved to this customer using CAPEX/OPEX savings based on where their storage was today. The CIO budgeted of this work to be performed, planned the kickoff in 3 months.

Before the project kicks off, The CIO resigns and leaves. Due to troubled times, the company files for chapter 11 Bankruptcy. A new CIO takes over this position. The project is now stalled, the new CIO realizes that Storage consolidation and optimization is not one of their priorities (10M dollars) and moves on to do other projects while the company is still in bankruptcy.

Lesson Learnt:

Its never too late and never wait too long to realize your savings, put a plan together today on how the organization can optimize, increase utilization, efficiency with the existing storage environments before its too late. Every penny saved goes towards the bottom line.

This company just got sold for pennies on a dollar. By the way, its is an absolutely true story.

Who is the winner here, the CIO, the company, the shareholders, the storage admins, the storage architects, the consultants or someone’s ego?

True IT – Storage Stories: 1 (Dataloss)

September 22nd, 2009 No comments

Data LossAbout 4 years ago, got a call early morning from a customer where we were doing some data migration work. The customer decided to put the project on hold until they sorted out some issues in their storage environment.

Later in the day, had another call with the same customer and they passed some very terrifying news to us. It seems one of the storage array that we were looking to migrate the data from in their environment was managed by an independent service provider. This independent service provide did not worry about setting up email home on the system, which means in a case of a catastrophic failure the system will not notify anyone. The customer started reporting problems to the independent service provider that they were losing access to certain volumes in the host environment, started to see data corruption and within a few hours the entire disk array was completely gonzo.

This was beyond the capabilities of the independent service provider to fix and they escalated the call to the OEM to get this resolved. The OEM engineering folks and onsite teams worked round the clock for 4 days trying to recover data from the machine. Due to a failure to call home or email home on a failed component in the storage array, the data started getting corrupted and caused the entire system to fall on its knees. THE DATA WAS GONE!!!

The customer lost 60TB’s of RAW storage in a few hours.

Now the question shouldn’t be if the data can be recovered from backup tapes and other media, which the customer were able to over the next 3 weeks.

The primary question is why did it happen? And what can be done to prevent a catastrophic failure similar to this?

Lesson Learnt:

If you are in charge of managing any data in your organization today that is associated with storage arrays, open a call with the OEM or your Independent Service Provider on a monthly basis. Have them check and verify every storage array in the environment and if they are calling home or emailing home on a regular basis. If a modem is attached to the system, verify all components are working, if you have a TCP/IP/SSL based connection verify all is working, if you have email home features verify the emails are not getting queued on your exchange or may be your exchange IP address has changed.

These call home, email home, tcp/ip/ssl features allows the storage arrays to regularly communicate back to the OEM or your Independent Service provider with errors / warnings / events and heartbeats.

If you are using any SRM tools, please regularly check for alerts. If you are receiving any failed communication alerts, please escalate the situation immediately rather than waiting. Also verify you are not consistently seeing the same failed components in the array through the SRM tools.