Justin Langford Principal Consultant
The business continuity problem Cold/ Warm/ Hot Standby Solutions Failover Clustering Database Mirroring Combining Availability Technologies
Fundamentals No failover, potential data loss
Backup and Restore Third party solutions: storage snapshots, VDI Detach/ Attach
Distributed Data Manual failover, possible data loss
Log Shipping Peer-to-peer replication Database Mirroring (High Performance mode) Data Mirroring (third party hardware/ software mirroring
High Availability Automatic failover, no data loss
Database Mirroring (High Availability mode) Failover clustering Multi-site (Geo) clusters
Operator
Majority of downtime attributed to System Admins Application Software can improve fault tolerance System Software can help at all levels
Application Software System Software Hardware
Commodity hardware now very reliable
Evaluating business continuity solutions
Automatic or Manual Detection Automatic or Manual Failover Failover time Number of failures survived Data Currency / Loss Data Consistency Granularity: Instance, Database, Table, Page, Row Cost - hardware, network, additional management Complexity Recovery Point Interval/ Recovery Time Interval
we need zero downtime, zero data loss
Transparency to clients
The business continuity problem Cold/ Warm/ Hot Standby Solutions Failover Clustering Database Mirroring Combining Availability Technologies
Backup / Restore
Backup/ Restore & Detach/ Attach
Manual detection and failover Potential for data loss Clients must be redirected Slowest failover most downtime Backup / Restore Log backups allow point in time restore
Detach / Copy / Attach
Detach / Copy / Attach Copies entire files No rolling forward logs
Replication
Replication and Log Shipping
Manual detection and failover Replication since SQL Server 6.0
Primarily used where availability is required in conjunction with scale out of read activity Failover requires custom solution Could define subset of source database Latency between source and copy can be seconds
Log Shipping Log Shipping
Basic idea: Backup, Copy, Restore Log will always be supported Database scope Database accessible but read-only Users must exit for next log to be applied
Peer-to-Peer Transactional Replication
Peer-to-Peer Transactional Replication
SQL Server 2005 introduced bi-directional transactional replication
Enables scale-out and improved availability Warm / hot standby
Possibility of data loss on failure
SQL Server 2008 improvements
Conflict detection
Each row has a hidden column listing originating peer node ID for change Distribution Agent on each node detects conflicts by comparing hidden column
New Topology Wizard makes setup/ changes easier
Failover Cluster
Failover clustering and Database Mirroring
Both solutions provide
Automatic detection Automatic, fast failover Manual failover Transparent client redirection Zero work loss Site redundancy
Database Mirroring
The business continuity problem Cold/ Warm/ Hot Standby Solutions Failover Clustering Database Mirroring Combining Availability Technologies
Failover Cluster
* Inst1
SRV2 SRV1 Hot Standby Automatic detection and failover Multiple nodes provide availability Failover transparent to client Windows Server 2008 EE supports 16 node clusters Windows 2008 clusters must pass validation tests (no HCL) Supports many scenarios: Multiple Active Instances, N+1, N+I
Multiple Active Instances
N+1: N Active, 1 Inactive Instances
N+I: N Active, I Inactive Instances
* Inst1 Inst3 * Inst2 *
* Inst1 Inst2 *
Failover Cluster
* Inst1
Node Majority Each node has a vote Cluster functions with majority of votes
Node and Disk Majority Each node plus the disk witness can vote Cluster functions with majority of votes Node and File Share Majority Each node plus the file share witness can vote Cluster functions with majority of votes
Majority means more than half
No Majority: Disk Only Cluster functions if one node is available with quorum disk presented Nodes in communication with quorum disk are cluster members
Failover Cluster
* Inst1
Choosing a Quorum Model
Description of Cluster Odd number of nodes Even number of nodes (single site) Even nodes, multi-site cluster Quorum Recommendation Node Majority Node and Disk Majority Node and File Share Majority
Failover Cluster
* Inst1
Zero work loss, minimal impact on throughput Instance Failover entire instance works as a unit Single copy of databases Available since SQL Server 7.0 No standby reporting, testing etc. Single cluster can support multiple instances Can provide site fault-tolerance
Failover Cluster
New setup for SQL Server 2008
No remote execution on cluster nodes Run SQL Server setup to Install/ uninstall / upgrade nodes Option 1: Integrated installation with Add Node Configure single-node cluster instance Run SQL Server Setup with Add Node functionality to add each node Option2: Advanced/Enterprise installation Run Setup with Prepare Failover Cluster on each node Run Setup on node owning shared disks to Complete Failover Cluster
Site redundancy
SQL Server requires single-subnet Stretch VLAN between sites
Requires synchronous storage replication Performance dependent on network RTT
Configure Heartbeat timeout Increase to extend clusters over further distances Decrease to increase failure sensitivity for faster failover
Two Quorum configurations suitable for multi-site failover clusters
Node and File Share Majority
Site C
\\SERV1\CLUS1
File Share Witness
Site A
Site B
SAN
SAN
STORAGE REPLICATION
Node Majority
Site A
Site B
SAN
SAN
STORAGE REPLICATION
The business continuity problem Cold/ Warm/ Hot Standby Solutions Failover Clustering Database Mirroring Combining Availability Technologies
Database Mirroring
Hot Standby Provides a fault-tolerant database Database Failover
Very fast failover
<5 secs (most cases)
Zero data loss
No special server, storage or network requirements Automatic or manual failover
Automatic re-sync after failover
Automatic, transparent client redirect
Application
Witness Mirror
Commit
Mirror always redoing log to remain current
Principal
2
SQL Server
SQL Server
2
Log
>2
Data
3
Log
>3
Data
Trade off: Performance vs. Safety High Availability - synchronous mirroring with a witness
Automatic detection/failover No data loss
High Protection - synchronous mirroring without a witness
Manual failover No data loss, downtime possible
High Performance - asynchronous mirroring
Manual failover Data loss possible
Available in SQL Server 2008 Enterprise Edition Automatic Page Repair
Auto-resolve some errors that prevent reading a page Retrieve fresh copy from partner Unreadable page is replaced by the copy
Repairs are asynchronous
Query on principal accessing corrupt page will fail If mirror accesses corrupt page - mirroring session is suspended
Not all errors can be repaired
823, 824 errors - Windows returns a CRC error when reading page 829 errors restore pending
Pages that cannot be repaired include:
file header page database boot page allocation bitmap pages (GAM, SGAM, PFS)
New DMV sys.dm_db_mirroring_auto_page_repair Repair logged to SQL Server Error log
... Error: 824, Severity: 24, State: 2. ... SQL Server detected a logical consistency-based I/O error: incorrect checksum ... ... Database mirroring is attempting to repair physical page (4:4256) in database MyDB" by requesting a copy from the partner. ... Database mirroring successfully repaired physical page (4:4256) in database MyDB" by obtaining a copy from the partner.
Log compression
DBM performance affected by network performance Useful low-bandwidth scenarios
Outgoing log stream - principal to mirror compressed
Requires CPU to compress/ decompress Can be disabled
High-bandwidth network performance
Heavy duty, fully-logged maintenance operations when mirroring is involved (e.g. index rebuild)
Log Compression Transactions/ second
Greatest improvement for low bandwidth networks
CPU Usage with Log Compression
CPU usage sometimes x2 Compression overhead Increased transaction throughput
The business continuity problem Cold/ Warm/ Hot Standby Solutions Failover Clustering Database Mirroring Combining Availability Technologies
Replication
Maximize availability for
Scale out
Offload primary data platform
Heavy reporting Mobile/disconnected users Autonomous business units that share data
Failover Solutions
Maximize availability of critical systems
Designed for failover
Fast, automatic
Zero data loss Transactionally current Masks planned and unplanned downtime
Failover Solution + Replication Fault-Tolerant Publisher and Distributor
Distributor Subscribers
Publisher
Principal Server can be a Failover Cluster
Failover to mirror will occur before failover within the cluster
Principal will assume role of Mirror
Mirror can be a Failover Cluster as well
Failover Cluster Failover Cluster
Principal
Mirror
Attribute Automatic failover Perceived downtime Potential data loss
Failover Cluster Yes 30 secs + recovery Single copy of data
Log Shipping No n/a Yes, most recent transaction log
Database Mirroring Yes, in High Safety < 5 secs Yes, in some configurations
Masking storage failure
Special hardware Scope
No
Must pass validation System and user databases
Yes
No User databases
Yes
No User databases