Author Archive

True IT – Storage Stories: 8 (Double faults with drives)

October 27th, 2009 No comments

Storage arrays are busy beast, they continuously seem to grow in number of drives, drive density, controller complexity, total storage capacity, total cache memory, etc. These days every vendor is pushing out these new features along with extended storage capacity. More complexities with hardware and software at times lead to exceptional cases related to spares replacement.

A drive failure happened in a storage system at a customer site. Normal call procedures were followed, a CE was dispatched onsite, logs were verified, a determination was made that the drive was failed in a RAID 5 set and the hot spare has been kicked in. A new drive is ordered and arrives onsite within 45 mins. The CE removes the defective drive and replaces it with the newly ordered drive (while the hot spare was still syncing), but life is back to normal.

Suddenly the customer loses access to the entire raid set of the failed drive. The call was quickly escalated to level 2 support. The engineer made a determination that another drive had failed in the same raid group while the hot spare was still synchronizing causing the entire raid set to fail (caused data loss on the raid set). It was recommended to replace this drive, lets call it drive 2. A new drive was ordered, CE replaced it and the customer started to prepare for data restore from snapshots.

Though someone at the customer site didn’t agree on the chain of events, they requested for level 3 support escalation. Once the engineering guys looked at the logs, they quickly determined that the CE had pulled the wrong drive during the first replacement and it caused double faults in the raid set which caused data unavailability.

Level 3 support engineers asked to insert drive (2) back in the same slot and wait for it to sync. Once the logs were verified then the new drive was inserted in the same slot at the original failed drive (1). With these procedures the customer were able to get the Raid Set functional again and were able to mount the volumes without a data restore.

Lesson Learnt

Though this was obviously a genuine mistake, someone somewhere realized this wasn’t right and asked to escalate the issue. Even the smartest and most experienced CE’s tend to make mistakes under pressure.

Always recommend your CE to wait for the hot spare to finish synchronizing before any drive replacements, atleast that gives an added buffer incase something was to go wrong.

Do not degauss the drives right away, in some cases you may need to insert it back in the storage system.

Schedule all maintenance work for either off business hours or for weekends including drive replacements, etc.

Double faults are rare, but they happen, always try every available option to recover your data.

CE’s at times tend to go the extra mile to recover failed disk drives: Bang them against the floor, throw them on the floor from a 2 feet height or even put them in a deep freezer for 6 to 8 hours.

If your storage system supports RAID 6 (double parity) and there is not added penalty for your applications, you may want to try that option.

True IT – Storage Stories: 7 (Data Wipe on the wrong machine)

October 26th, 2009 2 comments


Yea you are right, exactly as the title of this post says, Data Wipe got performed on the wrong machine.

The CE got permission from the customer to perform data wipe on a storage system. The host were retired, storage was ready to be turned off, but part of the procedures, customers typically require that all the data from the drives be cleaned.

The CE took this opportunity to remotely connect into the customers storage system. Thought it will take several hours to finish the process when he would go onsite to physically power off the storage system. Thought he logged into the machine he intended to, he got through to another one. Logically started taking the ports down through soft commands and then within 15 mins kicked off the process for data wipe.

An hour later, a SEV 1 ticket is opened at the customer site with major issues in the storage environment. To his luck he thought that while he is out taking care of the issue, he will also check on the data wipe and physically power off the storage subsystem.

On his way to the customer site, he gets a call from the Level 3 folks from the vendor support team on what they had just found on this storage system, that it was busy doing a data wipe and there was no way to stop it.

The realization set in for the CE…….that he had started a data wipe on the wrong storage system without performing the correct procedures.

Lesson Learnt

Set a corporate wide policy on how storage and server teams can perform certain task onsite and remotely. Set similar procedures with vendor teams as well.

The CE ended up losing his job, though there was no way to recover the data on the storage subsystem except for bringing the backup tapes out.

One could have the best storage execution plan in place to manage the storage environment, but is there a way to avert these exceptional cases.

Policy! Policy!! Policy!!!

October 20th, 2009 6 comments

It has been an exciting month, some new details are emerging related to automated storage tiering, workload distributions, workflow automation, SLA’s, QoS and how Policy based storage management can help solve these challenges. “Policy” as we all know in the “business world”, “advanced algorithms” as known in “scientific community” is used to solve complex storage challenges. This has been one of the favorite topics of discussion in the storage blogosphere these days.

Though there are two distinct groups of people, one favoring automation and the other half possibly thinking this technology brings no value-add in terms of how storage is utilized and managed today. This game was initially started by Compellent (Compellent Data Progression technology) about 4 years ago, then joined by Pillar Data Systems and now other OEM’s (including EMC, HDS, IBM) are starting to catchup on policy based automated storage tiering.

With private clouds in the near future and then hybrid clouds (a mesh of private and public clouds) in the horizon, automation, workload distribution, SLA’s, QoS will need to be monitored and managed to optimally run IT Infrastructures. Policy based management will create a new wave of storage management, automation and will act as a principle ingredient of hybrid clouds.

Generation 1 of policy based storage tiering works within a single storage subsystem.
Generation 2 in the near future should work across heterogeneous storage subsystems (by the same manufacturers).
Generation 3 over the next year or two will work across storage platforms irrelevant of the manufacturers.
Generation 3 of policy based management will include the entire stack of management. These products will be capable of not only managing the Storage, but also interact through policies at the Virtualization, Networking, Application, OS, Middleware and other layers in the stack of Infrastructure management..

We should see an up-rise of new emerging technologies that will create these external policy based engines for data movement automation. All infrastructure components including Storage, Virtualization, Networking, Application, OS, Middleware will provide the necessary API’s for these external engines to interact and enable data automation and workflow automation in the hybrid clouds (irrelevant of the manufacturers).

www links

Here are a few articles from the past month related to the topics of Policy, Automated Storage Tiering, Workloads, SLA’s and QoS.

Pillar (OEM)


Compellent (Partner Blog)



Your thoughts always welcome!!!


Enhancements to EMC Symmetrix V-Max Systems coming!!

October 14th, 2009 No comments

Enhancements to EMC Symmetrix V-Max system is possibly around the corner (FY09 Q4).

FAST (Fully Automated Storage Tiering) is due this quarter and will be one of the most awaited software release in the enterprise storage space by EMC.

Bundled together with FAST, possibly a new microcode version the enables FAST (its associated features) and other expected enhancements.

Though this will be a major software release and functionality upgrade, I don’t think this would qualify as a 2nd generation EMC Symmetrix V-Max system.

But fully expect EMC to release its FAST v2 and V-Max G2 somewhere around Mid year 2010.

Here are a few new features to possibly expect on the EMC Symmetrix V-Max Systems this quarter.

1. Introduction of FAST v1, which should allow automated data movement within a single Symmetrix V-Max system. Here are some features of FAST as discussed on GestaltIT and by Barry Burke (TSA) on his blog.

2. FAST v1 data movement should possibly be policy driven around factors like time (how old is the data), SLA (promised SLA’s), Tier (from Tier 0 to Tier 1 to Tier 2) and possibly I/O or IOPS based.

3. FAST v1 should allow automated policy based data movement or prompt a user for manual intervention for data movement.

4. Do not expect FAST v1 to come for free, it will possibly be licensed based on the total number of TB’s in the storage subsystem.

5. Expect some integration between the IONIX platform and FAST v1 and possibly some very tight integration with future releases of FAST and IONIX.

6. Expect FAST and IONIX to integrate very tightly with Atmos through API’s and policies. We should expect to see this with FAST v2 and not with FAST v1.

7. So when does EMC retire Symmetrix Optimizer, with FAST v1 probably not, with FAST v2 probably yes.

8. 2TB SATA II drives will be introduced (According to a Keynote from Joe Tucci in NYC), though Joe Tucci didn’t mention what platforms the 2TB SATA II drives will be available on, it seems the V-Max upgrade would be the most logical platform.

9. The 2TB SATA II drive upgrade should make the V-Max 4 PB total storage (2400 drives x 2TB), possibly the single largest storage subsystem at an enterprise level.

10. RapidIO speed upgrade from 2.5 Gbps to 4 Gbps (interconnects between the engines) upgraded either through MBIE (new processors) and / or through microcode upgrades. Edit 10/15/2009 – 12:50 PM: Not sure currently the technology that EMC uses for RapidIO, since Parallel RapidIO supports 250 Mhz to 1Ghz clocking speeds while Serial RapidIO supports 1.25Ghz to 3Ghz.

11. Drive connect speed upgrade from 4 Gbps to 8 Gbps

12. FC and FICON (Host Connects) port speeds upgrade from 4 Gbps to 8 Gbps

13. Interconnect between two separate Symmetrix V-Max Systems (8 Engines each per system) expanding into possibly 16 or 32 (max) engines. The more I think about this concept, the more it makes me feel that there are no added benefits of this architecture, rather it will add more complexities with data management and higher latency. We may not see anything related to interconnects in this upgrade, but remember how the V-Max was initially marketed with having hundreds of engines and millions of IOPS, the only way to achieve that vision is through interconnects. The longer the distance, the more latency with cache and I/O. If Interconnets end up making in this release, limitation on the distance between two Symmetrix V-Max system bays would be around 100 feet.

14. To the point above, another way of possibly connecting these systems could merely be federation through external policy based engines. Ed Saipetch and myself have speculated that concept on GestaltIT.

15. With the use of larger drive size, possibly expect a cache upgrade. Currently the Symmetrix V-Max supports 1TB total cache (512GB usable), which may get upgraded to 2TB total cache (1024 GB usable).

16. New possible microcode version 5875 that will help bring features like FAST, SATA II drives and additional cache into the Symmetrix V-Max.

17. Processors: 4 x Quad Core Intel processors on V-Max engines may not get an upgrade in this release, it should possibly be with FAST v2 as a midlife enhancement next year.

18. Further enhancements related to FCoE support.

19. Upgrade of iSCSI interface on Symmetrix V-Max engines from 1GB to 10GB (is now available with the Clariion CX4 platforms).

20. Really do not expect this to happen, but imagine RapidIO interconnects change to FCoE. Really not sure what made EMC to go with RapidIO instead of Infiniband 40 Gbps (which most of the storage industry folks think is dead) or FCoE with Engine interconnects, but if the engineers at EMC thought of RapidIO as a means to connect the V-Max engines, there has to be a reason behind it. Edit 10/15/2009 12:50 PM: Enginuity more or less doesn’t care about the underlying switching technology, making a switch from RapidIO to FCoE or Infiniband can be accomplished without a lot of pains. Though for customers already invested into RapidIO technology (with existing V-Max systems), it might be offline time to change the underlying fabric, which in most cases is unacceptable.

21. Virtual Provisioning on Virtual LUNs which is currently not supported with the existing generation of Microcode on V-Max systems.

22. Atmos currently is running as a beta release and we should expect a market release this Quarter. Should we expect to see an integration between V-Max and Atmos. I am not sure of any integration today.

23. A very interesting feature to have in the EMC Symmetrix V-Max would be system partitioning, where you can run half the V-Max engines at a certain Microcode level with a certain set of features and other half can be treated as a completely separate system with its own identity (almost like a Mainframe environment). Shouldn’t this be a feature of a modular storage array.

24. Symmetrix Management Console (SMC) and Vmware integration (like VMware aware Navisphere and Navisphere aware VMware). There is already quite a bit of support related to VMware in SMC for provisioning and allocation.

25. Also a much tighter integration between IONIX, FAST, SMC, Navisphere and Atmos may after all be the secret sauce, which would enable workflow, dataflow and importantly automation. Though do not expect this integration now, something to look forward for the next year.


Though I am still a bit confused on where FAST will physically sit.

FAST v1 can merely be a feature integrated within the Microcode, configurable & driven through policy within the Symmetrix Management Console.

FAST v2 (Sometime Mid 2010) will support in-box and out-of-box (eg: Symmetrix to Clariion to Celerra to Centera) data movement through policy engine.

Ed Saipetch and myself have speculated on GestaltIT on how that may work. Though after some thoughts, I do believe a policy engine can merely be a VM or a vAPP sitting outside the physical storage system in the Storage environment.

To promote the sales of the EMC Symmetrix V-Max systems, Barry Burke in his blog post talks about Open Replicator, Open Migrator and SRDF / DM (Data mobility) are now available at no cost for customers purchasing a new EMC Symmetrix V-Max system, these are some of the incentives that EMC is offering and further promoting the sales of its latest generation Symmetrix technology.

It remains to be seen the path of success FAST will carve for Symmetrix V-Max systems.