KEMBAR78
Implementation of A Stretched Cluster With SVC | PDF | Computer Cluster | Computing
0% found this document useful (0 votes)
149 views41 pages

Implementation of A Stretched Cluster With SVC

Uploaded by

mostafasadat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
149 views41 pages

Implementation of A Stretched Cluster With SVC

Uploaded by

mostafasadat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

© 2012 IBM Corporation

IBM Systems Storage

Implementation of a stretched cluster with SAN


Volume Controller

Matthew Robinson – IBM Australia (matrobin@au1.ibm.com)


© 2012 IBM Corporation
© 2012 IBM Corporation

Special Programs:
 Comprehensive education, training and service  IBM Systems ‘Guaranteed to Run’ Classes --
offerings Make your education plans for classes
with confidence!
 Expert instructors and consultants,
world-class content and skills
 Multiple delivery options for training
 Instructor-led online (ILO) training
The classroom comes to you.
and services
 Conferences explore emerging trends
 Customized, private training
and product strategies

 Lab-based services assisting in high tech


solutions
www.ibm.com/training
© 2012 IBM
© 2012 IBMCorporation
Corporation
Agenda

 SVC Overview
 SVC Split Cluster Overview
– Volume Mirroring
– Voting Set
– Quorum Disks
 Split Cluster Scenarios
 Performance Optimization
– Data Paths
– Fast Write Technologies

3
© 2012 IBM Corporation
SVC 2145-CG8 Storage Engine
 Based on IBM System x3550 M3 server (1U)
– Intel® Xeon® 5600 (Westmere) 2.53 GHz quad-core processor
 24GB of cache
– Up to 192GB of cache per SVC cluster
 Four 8Gbps FC ports (support Short-Wave & Long-Wave SFPs)
– Up to 32 FC ports per SVC cluster
● For external storage
● And/or for server attachment
● And/or Remote Copy/Mirroring
 Two 1 Gbps iSCSI ports
– Up to 16 GbE ports per SVC cluster
 Optional 1 to 4 Solid State Drives, or Dual 10 Gbps iSCSI ports
– Up to 32 SSD per SVC cluster (supported only with SVC code 5.1 & 6.2, not with 6.1)
– Up to 16 10GbE ports per SVC cluster
 Optional two 10 Gbps iSCSI ports
 New engines may be intermixed in pairs with other engines in SVC clusters
– Mixing engine types in a cluster results in Volume throughput characteristics of the
engine type in that I/O group
 Cluster non-disruptive upgrade capability may be used to replace older engines
with new CG8 engines
 Replaces the SVC 2145-CF8 engine
4 – Available with Entry Edition License i.e. based on number of physical drives
© 2012 IBM Corporation
SAN Volume Controller Cluster Architecture
Node
 A controller (1 rack unit)
– 2 quad-core Xeon Intel processors
– 24GB of cache
– 4 Fibre Channel ports (8Gbps) IO Group 0
– Up to 4 SSD
Node 1
 Cache protected by dedicated Uninterruptible Power
Supply (1 rack unit) Node 2

Node-pair for each I/O group


IO Group 1
 48GB of cache per IO Group
Node 3
 Active/Active fail-over & fail back
Node 4

IO Group 2
System
Node 5
 1 to 4 Node-pairs or I/O Groups
 Up to 192 GB of cache Node 6
 Up to 32 FC ports (8Gbps)
 Up to 16 GbE ports IO Group 3
 Up to 16 10GbE ports Node 7
 Up to 32 SSD
Node 8
 Managed using the embedded Graphical User Interface
(Optional external by Master Console)
 All nodes interconnected in the cluster via the SAN
5 Note: rack and management console (optional)
© 2012 IBM Corporation
Virtualisation Concepts & Limits

Hosts
•Up to 1024 Hosts per cluster
•Up to 2048 FC ports per cluster & Host
Single Multipath driver •Up to 256 iSCSI ports per cluster & Host
(SDD, MPIO)

Volumes iSCSI or FCP


Host Mapping
•Up to 8,192 Volumes per cluster LAN SAN
•Max. Volume size of 256TiB
• Image Mode = Native Mode SVC
• Managed Mode = Virtualized Mode

SSD SAS/FC NL-SAS/SATA

Storage Pools
•Up to 128 MDisks per
Pool
Vol21 Pool Pool
Vol43 Pool 4 Pool
Vol65
Vol1 Vol32 Vol7 Vol5 Storage Pool
•Up to 128 storage pools
per cluster

Managed Storage 1 Managed Storage 2


Up to 4,096 MDisks Max. External MDisk
per cluster Size of 1PiB*
External Managed Storage: 1 LUN  Managed Disk (MDdisks)
Up to 32PiB of managed storage
6
(*) Check SVC interoperability website to obtain the list of Disk Subsystems supported with greater than 2TB Mdisks: http://www-03.ibm.com/systems/storage/disk/storwize_v7000/interop.html
© 2012 IBM Corporation
SAN Volume Controller Features
 FCP, iSCSI & FCoE Block Access Protocols
 Snapshot-FlashCopy (Point-In-Time copy)
 Cache partitioning
– Up to 256 target per source
 Easy to use GUI
– Full (with background copy = clone)
– Real Time Performance
Statistics – Partial (no background copy) Up to 256

– Space Efficient Vol1 Vol2


 Embedded SMI-S agent Vol0
Source
FlashCopy
target of Vol0
FlashCopy
target of Vol1
– Incremental Map 1 Map 2
 E-mail, SNMP trap & Syslog error event logging
– Cascaded

M
ap
 Authentication service for Single Sign-On & LDAP

3
Map 4
– Consistency Groups
 Virtualise data without data-loss – Reverse
Vol3
FlashCopy
Vol4
FlashCopy
target of Vol1 target of Vol3
– External storage virtualization (optional)
 Microsoft Virtual Disk Service & Volume Shadow
 Thin-provisioned & Compressed Volumes Copy Services hardware provider
– Reclaim Zero-write space
– Thick to thin, thin to thick & thin to  Remote Copy (optional)
thin migration – Synchronous & asynchronous remote replication with
 Expand or shrink Volumes on-line Volume Consistency groups Storwize V7000 Storwize V7000
Storwize V7000
 On-line Volume Migration between ● Optional Cycling Mode
using snapshots
Storage Pools & IO groups MDisk MDisk MM or GM Consolidated MM or GM
Volume Source Target Relationship DR Site Relationship

Storwize V7000
Storwize V7000

 Volume Mirroring
MM or GM Relationship
Volume Volume
copy 0 copy 1

 EasyTier: Automatic relocation of hot and cold  VMware Storwize V7000


SSDs HDDs SSDs HDDs
extents Automatic – Storage Replication Adaptor for Site Recovery
Relocation

Manager
7
Hot-spots Optimized performance and throughput – VAAI support & vCenter Server management plug-in
© 2012 IBM Corporation

SVC SPLIT CLUSTER OVERVIEW

8
© 2012 IBM Corporation
SVC Split Cluster

 SVC Split Cluster


– Also called Split I/O Group or Stretch Cluster
– Nodes spread across two different sites/failure domains
Site 1 Site 2
I/O Group 0

= Split Cluster

Does not = Split Cluster

I/O Group 0 I/O Group 1

 Not possible with V7000


9
© 2012 IBM Corporation
SVC Split Cluster
Site A Site B

Site C
10
© 2012 IBM Corporation
Volume Mirroring

 SVC stores two copies of a Volume


SVC
– It maintains both copies in sync, reads primary copy and writes to both copies R W

 If disk supporting one copy fails, SVC provides continuous


data access by using other copy
– Copies are automatically resynchronized after repair
 Intended to protect critical data against failure of a disk system or disk array
Copy 1
Copy 0
– A local high availability function, not a disaster recovery function
 Copies can be split
– Either copy can continue as production copy
 Either or both copies may be thin-provisioned or compressed
– Can be used to convert from one volume type to another
● Thin to thick or compressed
● Thick to thin or compressed
● Compressed to thin or thick
 Mirrored Volumes use twice physical capacity of un-mirrored Volumes
– Base virtualisation licensed capacity must include required physical capacity
 The user can configure the timeout for each mirrored volume
11
© 2012 IBM Corporation
Volume Mirroring Time Out Settings

 Two settings for Volume Mirroring timeouts for write operations:


– Redundancy
● Wait for the write operation to complete to both copies before releasing data from cache
– Latency
● If only one copy has returned a successful acknowledgement after a period of time release data
from cache and allow volume copies to be out of sync for a time
 Ensure the correct policy is set for each volume
– Set on a per volume basis
– Default is latency for backwards compatibility

Redundancy Latency
Failure occurs Then write data is still in Then write data is still in cache
before data is cache of other node. of other node.
freed from
cache Volume remains online. Volume remains online.
Failure occurs Both volume copies are in Volume copies may not be in
after data sync. sync.
released from
cache Volume stays online using Volume may go offline.
12 online storage device.
© 2012 IBM Corporation
Voting Set / Quorum

 Common idea in many clustered systems and solutions


 A group of nodes/resources that can decide which part of a cluster remains
online after a failure
– Used to avoid a split brain scenario
 Each node in the SVC cluster has one vote in the voting set / quorum decision
 SVC quorum decisions:
– The group of nodes who can communicate with more than half the cluster voting set
remains online (have quorum)
– If a node can only see exactly half of the voting it will go into service state (not have
quorum) unless the group of nodes contain the configuration node
– If a node can see less than half of the voting set it will go into service state (lose quorum)

13
© 2012 IBM Corporation
Quorum Disks

 The synchronization status for mirrored volumes is recorded on the quorum disk.
Volumes can be taken offline if no quorum disk is available
 SVC creates three Quorum disk candidates on the first three managed MDisks
 One Quorum disk is active, they other two disks remain as candidates
 SVC 6.2.0 and later:
– SVC verifies that the Quorum disk candidates are placed on different storage systems.
– SVC is able to handle the Quorum disk management in a very flexible way, but in a Split
IO group configuration a well defined setup is required.
– Disable the dynamic quorum feature using the “override” flag
● svctask chquorum -MDisk <mdisk_id or name> -override yes
● This flag is currently not configurable in the GUI
 Quorum Disk has a ½ vote in the voting set:
– This means that if a set of nodes contains half the SVC nodes and they can see the
quorum disk they will remain up and running
– Conversely if they can see half the nodes and not the quorum disk they will go into
service state

14
© 2012 IBM Corporation
Failure Scenarios

Site A SVC node 1 Site B SVC node 2 Site C Cluster status


(Config node) Quorum
Operational Operational Operational Operational, optimal

Failed Operational Operational Operational, Write cache disabled

Operational Failed Operational Operational, Write cache disabled

Operational Operational Failed Operational, Write Cache enabled,


Different active Quorum disk

Operational, link to Site B Operational, Link to Site A failed Operational Operational, Site A with Config node
failed continues with operation and take over
load from Site B, Site B stopped

Operational Failed same time as Site C Failed same time as Site B Stopped
Failed same time as Site C Operational Failed same time as Site A Stopped

Provides high availability against single site or inter-site link failures

15
© 2012 IBM Corporation

16
© 2012 IBM Corporation
SVC Failure Domain

 Commonly referred to as sites


– Not always separate buildings/campuses
 Any physical location where a localised outage can affect local infrastructure but
not infrastructure on other sites (E.g. Power sources, buildings, campuses)
– Ensure failure domains are adequately separated to protect against all disasters that
client wants protection for
 All candidate quorum disks need to be located in different failure domains

17
© 2012 IBM Corporation

SPLIT CLUSTER SCENARIOS

18
© 2012 IBM Corporation
Split Cluster without ISLs

 The original Split Cluster solution:


– Supported from SVC 5.1 and up
– FCIP to Quorum disk is supported but only if fabrics are not merged
– Can support up to 80ms of round trip latency
● Not many applications will support these latencies
 With SVC 6.2 and earlier:
– Two ports on each SVC node needed to be directly connected to the “remote” switch
– No ISLs between SVC nodes
– ISLs with max. 1 hop are supported for Server traffic and Quorum disk attachment
 SVC 6.2 (late) update:
– Distance extension up to 40 km with passive WDM devices
● Up to 20km at 4Gb/s or up to 40km at 2Gb/s (see table on next slide)
● LongWave SFPs for long distances required
● LongWave SFPs must be supported from the switch vendor

19
© 2012 IBM Corporation
Split Clusters without ISLs

 SVC 6.3:
– Similar to the support statement in SVC 6.2
– Additional: support for active WDM devices
– Quorum disk requirement similar to Remote Copy (MM/GM) requirments:
● Max. 80 ms Round Trip delay time, 40 ms each direction
● FCIP connectivity supported
● No support for iSCSI storage system
 Quorum disk must be listed as “Extended Quorum” in the SVC Supported
Hardware List
 Two ports on each SVC node needed to be connected to the “remote” switch
 Link speed over long distance must be reduced according to the distance

Min Distance Max Distance Max Link Speed


0 km 10 km Up to 8 Gbps
> 10 km 20 km Up to 4 Gbps
> 20 km 40 km Up to 2 Gbps
20
© 2012 IBM Corporation
SAN Buffer-to-Buffer credits

 Buffer to Buffer credits


– Are used as a flow control method by Fibre Channel technology and represent the
number of frames a port can store
● → Provides best performance
 Light must cover the distance twice
– Submit data from Node 1 to Node 2
– Submit acknowledge from Node 2 to Node 1
 B2B Calculation depends on link speed and distance
– Number of frames that can be in flight increases as the link speed or the distance
increases
– Assumption:
● Transmit the acknowledge at the same instant that the last bit of the incoming frame is received
> (This is clearly not going to be the case)

21
© 2012 IBM Corporation
Long Distance Configuration
 SVC Buffer to Buffer credits
– 2145–CF8 / CG8 have 41 B2B credits
● Enough for 10km at 8Gb/sec with 2 KB payload
– All earlier models:
● Use 1/2/4 Gbps fibre channel adapters
● Have 8 B2B credits, which is enough for 4km at 4Gb/sec
 Recommendation 1:
– Use CF8 / CG8 nodes for more than 4km distance for best performance
 SAN switches don’t auto-negotiate Buffer to Buffer credits, 8 B2B credits is
default for most SAN switches
 Recommendation 2:
– Change the Buffer to Buffer credits in your switch to 41 as well

Link Speed FC Frame Length B2B for 10km Distance with 8


B2B credits
1 Gbps 1 km 5 16 km
2 Gbps 0.5 km 10 8 km
4 Gbps 0.25 km 20 4 km
22 8 Gbps 0.125 km 40 2 km
© 2012 IBM Corporation
Without ISLs
Site A Site B

SAN A SAN A

SAN B SAN B

Site C

23
© 2012 IBM Corporation
Using WDM Devices
Site A Site B

CWDM CWDM

SAN A SAN A

SAN B SAN B

Site C

24
© 2012 IBM Corporation
With ISLs between sites

 Supported from SVC 6.3


 Extends supported distances to 300km (same as MetroMirror)
 Same quorum disks requirements as the without ISLs options
 Private SANs are needed for SVC node to node communication
– It is important to have dedicated bandwidth for node to node communication
● Not shared with any other traffic (including mirroring traffic)
● Any ISL technology that aggregates ISL’s is not supported
 Public SANs are used for all other traffic:
– Host attachment
– Storage controller attachment
– SVC Metro and Global mirroring
– Quorum connectivity
 Link requirements are the same as MetroMirror
 WDM Devices supported for MetroMirror are supported for Split Clusters

25
© 2012 IBM Corporation
With ISLs with separate switches
Site A Site B

PRIV A PRIV A

PRIV B PRIV B

PUB A PUB A

PUB B PUB B

Site C

26 ISLs are highlighted in red


© 2012 IBM Corporation
With ISLs with logical switches
Site A Site B

PUB PRIV PRIV PUB


A A A A

PUB PRIV PRIV PUB


A A A A

Site C

27 ISLs are highlighted in red


© 2012 IBM Corporation

PERFORMANCE OPTIMIZATION

28
© 2012 IBM Corporation
Latency Considerations

 The same code is used for all inter-node communication


– Clustering
– Write Cache Mirroring
– Global Mirror & Metro Mirror
 SVC’s proprietary SCSI protocol only has 1 round trip
– Only for cache mirroring
– Host and storage controller writes will use 2 by default
 SVC tolerates a round trip delay of up to 80ms between nodes
– 1ms per 100 Km of dark-fibre RTT = Max. distance of 8,000 Km
 In practice Applications are not designed to support a Write I/O latency of 80ms
 Stretch Cluster splits the nodes in an I/O group across two sites
– Cache Mirroring traffic rather than Metro Mirror traffic is sent across the inter-site link
– Data is mirrored to back-end storage using Volume Mirroring
– Data is written by the 'preferred' node to both the local and remote storage
● The SCSI Write protocol results in 2 round trips
● This latency is hidden from the Application by the write cache

29
© 2012 IBM Corporation
Metro Mirror
Application Latency = 1 long distance round trip

1. Write request from host


2. Xfer ready to host
Server Cluster 1 3. Data transfer from host Server Cluster 2
6. Write completed to host

4 3 2 1
4. Metro Mirror Data transfer to remote site
5. Acknowledgment
4
SVC cluster 1 SVC cluster 2
5

1 round trip
10a 9a 8a 7a 7. Write request from SVC 10b 9b 8b 7b
8. Xfer ready to SVC
9. Data transfer from SVC
10. Write completed to SVC

Data center 1 Data center 2

Steps 1 to 6 affect application latency


30
Steps 7 to 10 do not affect the application
© 2012 IBM Corporation
Split Cluster writes
Local I/O: Application Latency = 1 round trip

1. Write request from host


2. Xfer ready to host
Server Cluster 1 3. Data transfer from host Server Cluster 2
6. Write completed to host

4 3 2 1
4. Cache Mirror Data transfer to remote site
5. Acknowledgment
4
5
1 round trip
SVC stretch cluster
2 round trips - but SVC write cache
10a 9a 8a 7a 7b
hides this latency from the host
8b
9b
10b

7. Write request from SVC


8. Xfer ready to SVC
9. Data transfer from SVC
10. Write completed to SVC
Data center 1 Data center 2

Steps 1 to 6 affect application latency


31
Steps 7 to 10 do not affect the application
© 2012 IBM Corporation
Split Cluster for mobility

 Stretch Cluster is also often used to move workload between servers at different
sites
– VMotion or equivalent can be used to move Applications between servers
– Applications no longer necessarily issue I/O requests to the local SVC nodes
 SCSI Write commands from hosts to remote SVC nodes results in an additional 2
round trips worth of latency that is visible to the Application

 Some switches and distance extenders use extra buffers and proprietary
protocols to eliminate one of the round trips worth of latency for SCSI Write
commands
– These devices are already supported for use with SVC
– No benefit or impact to the inter-node communication
– Does benefit Host to remote SVC I/Os
– Does benefit SVC to remote Storage Controller I/Os

32
© 2012 IBM Corporation
Split Cluster writes
Remote I/O: Application Latency = 3 round trips

1. Write request from host


2. Xfer ready to host
Server Cluster 1 3. Data transfer from host Server Cluster 2
6. Write completed to host
6
3
2 round trips
2
1
4. Cache Mirror Data transfer to remote site
5. Acknowledgment
4
5
1 round trip
SVC stretch cluster
2 round trips - but SVC write cache
10a 9a 8a 7a 7b
hides this latency from the host
8b
9b
10b

7. Write request from SVC


8. Xfer ready to SVC
9. Data transfer from SVC
10. Write completed to SVC
Data center 1 Data center 2

Steps 1 to 6 affect application latency


33
Steps 7 to 10 do not affect the application
© 2012 IBM Corporation
Split Cluster writes
Remote I/O: Application Latency = 2 round trips
1.
Write request from host
2.
Xfer ready to host
3.
Data transfer from host
4.
Write data + Transfer to remote
site
Server Cluster 1 11. Write completion to remote site Server Cluster 2
5. Write request to SVC 12. Write completed to host
6. Xfer ready from SVC
10
7
11 12
3
1 round trip
6 2
5 4 1
7. Data transfer to SVC
8. Cache Mirror Data transfer to remote site
10. Write completed from SVC
9. Acknowledgment
8
9
1 round trip
SVC stretch cluster 1 round trip - but SVC write cache
13b hides this latency from the host
22a 15a 14a 13a 16b
14b 17b
15b 19b
22b 21b 20b 18b
13. Write request from SVC 17. Write request to storage
14. Xfer ready to SVC 18. Xfer ready from storage
15. Data transfer from SVC 19. Data transfer to storage
16. Write data + Transfer to remote site 20. Write completed from
storage
21. Write completion to remote site
Data center 1 Data center 2
22. Write completed to SVC

Steps 1 to 12 affect application latency


34
Steps 13 to 22 do not affect the application
© 2012 IBM Corporation
Volume Mirroring Primary Copy

 When writes are performed to a volume with 2 copies writes are destaged to
both copies
– Backend latencies are added regardless of the SVC node that performs write to disks
 When reads are performed to a volume one of two things can happen
1. Read data is in cache and returned from cache
2. SVC node needs to collect the data
● This is done from the primary volume copy as marked in the SVC

 Try to ensure that primary copy of data is on the same site as application
– When moving an application switch primary copy on the SVC
– Not always possible
● Same volume may be used by applications on both sites
● Features such as Storage vMotion may be helpful

35
© 2012 IBM Corporation
Split Cluster reads
Local I/O: Application Latency = 0 round trips

1. Read request from host


4. Send data to host
Server Cluster 1 Server Cluster 2

4 1

3 2

2. Read request from SVC


(Primary copy)
3. Send data to SVC

Data center 1 Data center 2

Steps 2 and 3 are skipped if data is in SVC node cache


36
© 2012 IBM Corporation
Split Cluster reads
Remote I/O: Application Latency = 1 round trips

1. Read request from host


4. Send data to host
Server Cluster 1 Server Cluster 2

4 1

2 1 round trip
3

2. Read request from SVC


3. Send data to SVC (Primary copy)

Data center 1 Data center 2

37
Steps 2 and 3 are skipped if data is in SVC node cache
© 2012 IBM Corporation
Split Cluster read
Remote I/O: Application Latency = 2 round trips

1. Read request from host


4. Send data to host
Server Cluster 1 Server Cluster 2

1 1 round trip

2 1 round trip
3

2. Read request from SVC


3. Send data to SVC (Primary copy)

Data center 1 Data center 2

38
Steps 2 and 3 are skipped if data is in SVC node cache
© 2012 IBM Corporation
Split I/O Group – Disaster Recovery

 SVC Split Cluster is a distributed HA functionality


 Usage of Metro Mirror / Global Mirror is recommended for disaster protection
 Differences between MM/GM and SVC Split Cluster:

Split Cluster Metro/Global Mirror


Failover Automatic Manual
Resynchronisation Automatic Manual
Consistency Group N/A Supported

 Implementing both Split Cluster and MM or GM:


– Both Split Cluster sites must be connected to the MM or GM infrastructure
– Without ISLs
● All ports can be used for mirroring
– With ISLs
● Only half the ports can be used for mirroring (Public SAN only)

39
© 2012 IBM Corporation
Summary

 SVC Split I/O group:


– Is a very powerful solution for automatic and fast handling of storage failures
– Transparent for servers
– Perfect fit in vitualized environments
– Transparent for all OS based clusters
– Distances up to 300 km (SVC 6.3) are supported

 Two possible scenarios:


– Without ISLs between SVC nodes (classic SVC Split I/O group),
● Up to 40 km distance with support for active (SVC 6.3) and passive (SVC 6.2) WDM
– With ISLs between SVC nodes (SVC 6.3):
● Up to 100 km distance for live data mobility (150 km with distance extenders)
● Up to 300 km for fail-over / recovery scenarios

 Long distance performance impact can be optimized by:


– Load distribution across both sites
– Correct selection of host paths and SVC primary volume copies
– Appropriate SAN Buffer to Buffer credits

40
41

You might also like