EMC Clariion Systems: Global Hot Spares & Proactive Hot Spares
The concept of Global Hot Spares has been supported in Clariion environments since the first generation of FC & CX platforms. Now the technology has been extended into the CX3 and then the CX4 platforms. The primary purpose of global hot sparing is to protect the system against disk drive failures.
Typically look at a CX4-960, which can be scaled up to 960TB of raw storage and can have as many as 960 disk drives in it. With certain failure rates guaranteed, large number of drives can create a higher probability of failure. Every storage manufacturer these days includes some sort of hot sparing technology in the storage subsystems. EMC started offering this technology to its customers as Global Hot Spares. Then came an era where some value add offerings were brought in for proactive failures to minimize the chance of data loss. This brought to the table a technology that is termed as Proactive Hot Spares, where proactively failing drive is determined and global hot spare is kicked in.
I believe flare release 24 started offering Proactive hot spares. With this Flare release customers can proactively initiate a kickoff of hot spares through Navisphere or Naviseccli against a suspect drive.
Depending on the RAID type implemented, the RAID Groups can withstand drive failures and can run in degraded state without data loss or data unavailability. With RAID 6 implemented, a machine can have as many as 2 drive failures in the same RAID group, with RAID 5, a machine can have as many as 1 drive failure in the same RAID group, with RAID 1/0, RAID 1 a machine can have as many as 1 drive failure in the RAID group without data loss.
Drives supported on Clariion CX, CX3, CX4, AX and AX4 systems typically are FC (Fiber Channel), SATA II and ATA drives.
A Global Hot Spare has to be configured in an EMC Clariion system as a single RAID Group (with one drive). Once the RAID Group is created, a LUN should be bound as a Global Hot Spare before it could be activated.
The following is the sequence of steps that take place on a Clariion Subsystem related to Global Hot Spares (Supported on CX, CX3, CX4 systems)
- Disk Drive failure: A disk drive failure in the system, Flare Code marks it bad.
- Hot spare invoked: A preconfigured Global Hot Spare is invoked based on the Global Hot Spare selection criteria.
- Rebuild: The Global Hot Spare is rebuilt from surviving raid group members.
- Failed drive replaced: Failed disk drive is replaced with a good drive by a Customer Engineer
- Copy Back: The Global Hot Spare copy has to finish before the new drive starts rebuilding. The rebuild or equalize happens in a sequential order of LBA (Logical Block Address) and not the LUNs bound no it.
- Return Hot Spare: Once the sync of new drive is finished, the hot spare is invalidated (zero’ed) and put back in the Global Hot Spare pool.
The following is the sequence of steps that take place on a Clariion Subsystem related to Proactive Hot Spares (Supported on CX300, CX500, CX700, CX3, CX4). Proactive Hot Spares essentially use the same drives that are configured as Global Hot Spares.
- Threshold of errors on Disk Drive: A drive gets hit with errors, it surpasses the number and type of those errors, and the flare code marks it as a potential candidate for failure.
- Proactive Hot Spare invoked: Based on the potential candidate’s (drive) type, drive size and bus location a Global Hot Spare is indentified and the process is kicked off for data rebuild.
- Potential candidate fails: Once the Proactive Hot Spare is synced, the flare code fails the indentified potential candidate.
- Failed drive replacement: The failed drive is replaced by a Customer Engineer
- Copy Back: From the proactive hot spare, the data is copied back to the newly inserted drive. The rebuilt or equalize happens in a sequential order of LBA (Logical Block Address).
- Return Proactive Hot Spare: Once the sync of new drive is finished, the hot spare is invalidated (zero’ed) and put back to the Global Hot Spares pool.
The Global Hot Spares Selection Criteria:
The following are the criteria’s that are followed with selection (invoke) of a Global Hot Spare when a potential proactive candidate is identified or disk drive is failed. In the sequence listed below, Drive type is the first selection, Size of the drive is the second selection and location of the Global Hot Spare is the third selection. Speed of the drive (RPM) is not a selection criterion.
- Type of Global Hot Spare Drive: As discussed above, Clariion Systems use three primary drive types. For FC and SATA II type drives, either or can be invoked against each other type. ATA drives can be invoked against an ATA drive failure.
- Size of Global Hot Spare: Upon a disk failure, the drive size (Global Hot Spare) is examined by Flare Code. The size of failed drive is not the key in invoking the hot spare, but the total space of all LUNs (bound) on the drive is used as a determination criteria.
- Location of Global Hot Spare: Based on the above two criteria, the location of the Global Hot Spare is considered as the third criteria. If the Global Hot Spare is on the same bus as the failed drive, it will be considered as the primary selection if the above two criteria’s are met. If the above two criteria’s are met and the drive is not on the same bus, then the Global Hot Spare is selected from other buses.
- RAID Types: For the copy of data, with RAID 3 and RAID 5 data on the hot spare is built using the parity drive. With RAID 6 raid types, data on the hot spare is built using the RP (row parity) and / or DP (Diagonal Parity) depending on the number of failures in the RAID Groups. For the RAID 1/0 and RAID 1, data on the hot spare is built using the surviving mirrors.
- Copy Times: The time required to copy or rebuilt a hot spare really depends on how large the drive is, the speed of the drive, the cache available on the drive, the cache available on the array, the type of the array, raid type and the current job processing on the array. Typical rebuilt times vary from 30 minutes to 90 minutes again depending upon how busy the storage subsystem is.
- Global hot Spare types: For every 30 drives (2 DAE’s of drives), consider having 1 drive as a Global hot spare. Also verify, for every drive type (size, speed) in the machine, you have at least one configured global hot spare. Good idea to have global hot spares on various different buses and spread across multiple Service Processors.
- Vault Drives: Vault Drives cannot be used for Global Hot Spares. The Vault drives are considered as the first 5 drives [ 0_0_0, 0_0_1, 0_0_2, 0_0_3, 0_0_4 ] on the Clariion System. If a vault drive fails, a Global Hot Spare takes over its position.
- Rotational Speed: Rotational Speed of the Global Hot Spare is not considered before invoking it. It might be a good idea to have Global Hot Spares running 15K RPM’s potentially with large size drives.
- Mixed Loop Speed: With certain Clariion Systems like CX3’s, available loop options are 4GB and / or 2GB and you can have a mixed loop speed in your machine, for hot spare selection the loop speed is not considered, in those cases it might be wise to have similar hot spares on both the 2GB and 4GB loops.