M10 Cluster Component Architecture
M10 Cluster Component Architecture
This module provides an overview of cluster architecture and looks in more detail at the Cluster
service component architecture.
Prerequisites
1
Before starting this session, you should:
● Understand how to create and configure cluster resource types.
● Understand where to find all locations of the cluster registry and how these copies are all kept
in synch.
Introduction
3
Server clusters are designed as a separate, isolated set of components that work together with the
operating system. This design avoids introducing complex processing system schedule
dependencies between the Server clusters and the operating system. However, some changes in the
base operating system are required to enable cluster features. These changes include:
● Support for dynamic creation and deletion of network names and addresses.
● Modification of the file system to enable closing open files during disk drive dismounts.
● Modification of the Input/Output (I/O) subsystem to enable sharing disks and volume sets
among multiple nodes.
Apart from the above changes and other minor modifications, cluster capabilities are built on top
of the existing foundation of the Microsoft Windows Server 2003 operating system.
The core of server clusters is the Cluster service itself, which is composed of several functional
units. This section provides a detailed discussion of the following topics:
● Event processor
● Node manager
● Membership manager
Cluster Abstractions
The top tier provides the abstractions for nodes, resources, dependencies, and groups. The
important operations are resource management, which controls the local state of resources, and
failure management, which orchestrates responses to failure conditions.
Cluster Operation
The middle tier provides important distributed operations, such as membership and regroup
operations, and maintaining the cluster configuration. The shared registry allows the Cluster
service to see a globally consistent view of the cluster's current resource state. The cluster's registry
is updated with a global atomic update protocol and made persistent using transactional logging
techniques. The current cluster membership is recorded in the registry.
Component Functionality
Event Processor Provides intra-component event delivery service.
Object Manager A simple object management system for the object collections in the
Cluster service.
Node Manager Controls the quorum Form and Join process, generates node failure
notifications, and manages network and node objects.
Membership Manager Handles the dynamic cluster membership changes.
Global Update Manager A distributed atomic update service for the volatile global cluster state
variables.
Database Manager Implements the Cluster Configuration Database.
Checkpoint Manager Provides persistent storage of the current state of a cluster (its registry
entries) on the quorum resource.
Transaction Manager Ensures that multiple related changes to the cluster database on a node
are performed as an atomic local transaction. Provides primitives for
beginning and ending a transaction.
Log Manager Provides structured logging to persistent storage and a lightweight
transaction mechanism.
Resource Manager Controls the configuration and state of resources and resource
dependency trees. It is responsible for monitoring active resources to see
if they are still online.
Failover Manager Controls the placement of resource groups at cluster nodes. Responds to
configuration changes and failure notifications by migrating resource
groups.
Communication Manager Provides inter-node communication among cluster members.
Event Log Replication Replicates event log entries from one node to all other nodes in the
Manager cluster.
Backup/Restore Manager Backs up or restores quorum log file and all checkpoint files, with help
from the Failover Manager and the Database Manager.
Event Processor
6
The Event Processor is the communications center of the Cluster service. It is responsible for
connecting events to applications and other components of the Cluster service. This includes:
● Application requests to open, close, or enumerate cluster objects.
● Delivery of signal events to cluster-aware applications and other components of the Cluster
service.
The Event Processor is also responsible for starting the Cluster service and bringing the node to the
“online” state. The Event Processor then calls the Node Manager to begin the process of joining or
forming a cluster.
Node Manager
7
The Node Manager keeps track of the status of other nodes in the cluster. There are two types of
nodes in the cluster at any time:
● Defined nodes are all possible nodes that can be cluster members.
● Active nodes are the current cluster members.
The Node Manager is notified of any node failures suspected by the cluster network driver
(ClusNet). This failure suspicion is generated by a succession of missed “heartbeat” messages
between the nodes. Five missed heartbeats generate a failure suspicion, causing the Node Manager
to initiate a regroup to occur to confirm the cluster membership, as will be described shortly.
Remember that a node is in one of three states: Offline, Online, or Paused. If a node does not
respond during regroup, the remaining nodes consider it to be in an offline state. This is indicated
graphically in Cluster Administrator. The node is still defined but is no longer active. If a node is
considered to be offline, its active resources must be failed over (“pulled”) to the node that is still
running. The Resource/Failover Manager does the failover portion of this process.
Once the Node Manager has determined that the other node is offline, it stops the heartbeat
messages being sent out by ClusNet and will not listen for heartbeats from the other node. When
the other node is restarted, it will go through the join process and the cluster heartbeats begin
again.
Startup Operations
The Node Manager controls the join and form processes when the Cluster service starts on a node.
It processes the node, network, and network information from the local cluster database. It brings
the defined networks online and then registers the network interfaces with the cluster network
transport, so that the discovery process can search for other nodes. The Node Manager initializes
the regroup process to determine cluster membership and checks Cluster service version
compatibility when a node attempts to join.
Membership Manager
8
If there are only two nodes in a cluster and one node fails, the remaining node constitutes the
whole cluster. The cluster “view” is therefore by definition consistent across the cluster. If there
are more than two, it is essential that all systems in the cluster always have exactly the same view
of cluster membership.
In the event that one system detects a communication failure with another cluster node, it
broadcasts a message to the entire cluster causing all members to verify their view of the current
cluster membership. This is called a regroup event and is handled by the Membership Manager.
The Node Manager freezes writes to potentially shared devices until the membership has
stabilized.
It maintains consensus among the active nodes: who is active and who is defined. There are two
important components to the membership management:
● The join mechanism, which admits new members into the cluster.
● The regroup mechanism, which determines current membership on startup or suspected
failure.
Member Join
When the Cluster service on a node starts, it will try to connect to an existing node in the cluster.
The joining node will petition this “sponsor” node to join the cluster. The sponsor will be the
remaining node in a two-node cluster or any active node if there are more than two defined nodes.
The join algorithm is controlled by the sponsor and has five distinct phases for each active node.
The sponsor starts the join algorithm by broadcasting the identity of the joining node to all active
nodes. It then informs the new node about the current membership and cluster database. This starts
the new member's heartbeats. The sponsor waits for the first heartbeat from the new member, and
then signals the other nodes to consider the new node a full member. The algorithm finishes with
an acknowledgement to the new member.
All the broadcasts are repeated RPCs to each active node. If there is a failure during the join
operation (detected by an RPC failure), the join is aborted and the new member is removed from
the membership.
Member Regroup
If there is suspicion that an active node has failed, the membership manager runs the regroup
protocol to detect membership changes. This suspicion can be caused by problems at the
communication level, resulting in missing heartbeat messages.
The regroup algorithm moves each node through six stages. Each node periodically sends a status
message in the form of a bit mask to all other nodes indicating which stage it has finished. None of
the nodes can move to the next stage until all nodes have finished the current stage.
● Activate. Each node waits for a local clock tick so that it knows that its timeout system is
working. After that, the node starts sending and collecting status messages. It advances to the
next stage if all active nodes have responded, or when the maximum waiting time has
elapsed.
● Closing. This stage determines whether partitions exist and whether the current node is in a
partition that should survive. Partitions are isolated groups of nodes within the same cluster.
They can arise because of a loss of network connections between some nodes but not others.
Any nodes that can see each other over the network are in the same partition.
● Pruning. All nodes that have been pruned for lack of connectivity halt in this phase. All
others move forward to the first cleanup phase.
● Cleanup Phase One. All surviving nodes install the new membership, mark the nodes that did
not survive the membership change as inactive, and inform the cluster network manager to
filter out messages from these nodes. Each node's Event Processor then invokes local callback
handlers to announce the node failures.
● Cleanup Phase Two. Once all members have indicated that the Cleanup Phase One has been
successfully executed, a second cleanup callback is invoked to allow a coordinated two-phase
cleanup. Once all members have signaled the completion of this last cleanup phase they move
to the final state.
● Stabilized. The regroup has finished.
There are several points during the operation where timeouts can occur. These timeouts cause the
regroup operation to restart at phase one.
Locker Node
One cluster node, dubbed the locker node, is assigned a central role in the Global Update Protocol.
The locker node is the one that owns the quorum resource. Typically, the oldest node in the cluster
will be the locker node. Any node that wants to start a global update first contacts the locker. The
locker node promises that if the sender (sequencer) fails during the update, the locker (or its
successor if the locker node fails) will take over the role of sender and update the remaining nodes.
In addition, the locker node guarantees that updates initiated from any node will have a unique
sequence number in the cluster database.
Node Updates
To do a global update, the updating node first sends update information to the locker node. The
locker node updates itself and returns a sequence number for the update to the sender. Once an
updating node knows that the locker has accepted the update, it sends an RPC with the update
request and the sequence number to each active node (including itself) in seniority order, based on
NodeID. The nodes are updated one at a time in NodeID order starting with the node immediately
following the locker node, and wrapping around the IDs up to the node with the ID preceding the
lockers. Once the update has been installed at all nodes, the updating node sends the locker node
an unlock request to indicate the protocol terminated successfully.
Failures
The protocol assumes that if all nodes that received the update fail, it is as if the update never
occurred. Should all updated nodes fail, they will roll back the cluster database and log on
recovery, and the update will not have been applied to the cluster configuration. Examples of such
failures are:
● Sender (sequencer) fails before locker accepts update.
● Sender (sequencer) installs the update at the locker, but both sender and locker nodes fail after
that.
If the sender fails during the update process after the locker has accepted the update, the locker
reconstructs the update and sends it to each active node. Nodes that already received the update
detect this through a duplicate sequence number and ignore the duplicate update.
If the sender and locker nodes both fail after the sender managed to install the update at any node
beyond the locker node, the node with the lowest NodeID assumes the role of locker. This new
locker node would have been the first to receive the update since it has the lowest NodeID. Having
received the update and the sequence number, the new locker node can complete any update that
was in progress using the saved update information. To make this work, the locker allows at most
one update at a time. This gives a total ordering property to the protocol – updates are applied in a
serial order.
Database Manager
The Database Manager implements the functions needed to maintain the cluster configuration
database on each node. The Database Managers on each node of the cluster cooperate to maintain
configuration information consistently across the cluster. One-phase commits are used to ensure
the consistency of the cluster database in all nodes. The Database Manager also provides an
interface to the configuration database for use by the other Cluster service components. This
interface is similar to the registry interface exposed by the Microsoft Win32® API set with the key
difference being that changes made in one node of the cluster are atomically distributed to all
nodes in the cluster that are affected.
Log Manager
The Log Manager writes changes to the recovery log stored on the quorum resource when any of
the cluster nodes are down. This allows the cluster to recover from a partition in time, a situation
that occurs when cluster nodes are not online at the same time. The Log Manager also works with
the Checkpoint Manager to take checkpoints at appropriate moments, thereby helping to ensure
that the local cluster databases are kept consistent across the cluster.
Maintaining a checkpoint and log for use during restart is an instance of the more general
transaction processing techniques of logging and commit/abort to perform atomic state
transformations on all the clusdb replicas.
Resource Manager
The Resource Manager is responsible for:
● Managing resource dependencies.
● Starting and stopping resources, by directing the Resource Monitors to bring resources online
and offline.
● Initiating failover and failback.
To perform the preceding tasks, the Resource Manager receives resource and cluster state
information from the Resource Monitors and the Node Manager. If a resource becomes
unavailable, the Resource Manager either attempts to restart the resource on the same node or
initiates a failover of the resource, based on the failover parameters for the resource. The Resource
Manager initiates the transfer of cluster groups to alternate nodes by sending a failover request to
the Failover Manager.
The Resource Manager also brings online or takes offline resources in response to an operator
request from Cluster Administrator, for example.
Failover Manager
The Failover Manager handles the transferring of groups of resources from one node to another in
response to a request from the Resource Manager. The Failover Manager is responsible for
deciding which systems in the cluster should “own” which groups. When this group arbitration
finishes, those systems that own individual groups turn control of the resources within the group
over to their respective Resource Managers. When failures of resources within a group cannot be
handled by the owning system, the Failover Managers re-arbitrate for ownership of the group. The
Failover Manager assigns groups to nodes based on failover parameters such as available resources
or services on the node and the possible owners defined for the resource.
Pushing a Group
13
Note The threshold counter has a limit of ten. This means that groups cannot be brought online
3. Failover is initiated.
The Resource/Failover Manager on the node that previously owned the resource notifies the
Resource/Failover Manager on the destination node. The destination node is notified that a
failover is occurring.
4. The destination Resource/Failover Manager begins to bring the resources online, in the
opposite order from which they were taken offline.
Pulling a Group
14
Regroup
In the event that one system detects a communication failure with another cluster node, it
broadcasts a message to the entire cluster causing all members to verify their view of the current
cluster membership. This is called a regroup event. Writes to shared devices must be frozen until
the membership has stabilized.
Regroup works by re-computing the members of the cluster. The cluster agrees to regroup after
checking communications among the nodes. If a Node Manager on a system does not respond, it is
removed from the cluster and its active groups must be failed over to an active system. Finally, the
cluster Membership Manager informs the cluster’s upper levels (such as the Global Update
Manager) of the regroup event.
Note Regroup is also used for the forced eviction of active nodes from a cluster.
Regroup States
After regroup, one of two states occurs:
1. There is a minority group or no quorum device, in which case the group does not survive.
2. There is a non-minority group and a quorum device, in which case the group does survive.
There is a non-minority rule such that the number of new members must be equal to or greater than
half of the old active cluster. This provision prevents a minority from seizing the quorum device at
the expense of a larger potentially surviving cluster. In addition, the quorum guarantees
completeness, by preventing a so-called split-brain cluster; that is, two nodes (or groups of nodes)
operating independently as the same cluster. Whichever group loses quorum arbitration will shut
down its Cluster service.
Failback
A group can be configured with a preferred owner node, so that if the preferred node is running,
the group will run on that node rather than any other. For example, Microsoft SQL Server and
Microsoft Exchange Server could be configured with different preferred owners in a cluster. In this
case, if all nodes are active the services should run on different nodes. If one node fails, another
will pull the groups it was hosting.
Every group has a failback property. If this is enabled, when the preferred owner comes online, the
Failover Manager can decide to move such groups back to the original node. To do this, the
Failover Manager on the preferred owner contacts the Resource Manager on the node that
currently has the resources online. The groups are then pushed from the current owner to the
preferred owner, as described above.
There is a failback option that can be configured to control the time of day during which failback
can occur. If a Failback window has been configured, the Resource/Failover Manager will wait for
the designated time before initiating failback.
Communication Manager
15
The components of the Cluster service communicate with other nodes through the Communication
Manager. Several components in each node of a cluster are in constant communication with other
nodes. Communication is fully connected in small clusters. That is, all nodes are in direct
communication with all other nodes. Intra-cluster communication uses RPC mechanisms to
guarantee reliable, exactly one time delivery of messages.
Communication Manager filters out messages from former cluster nodes that have been evicted.
1
In the following several slides, the same slide is provided with different components highlighted.
Teach the component that is highlighted on each page.
Introduction to Windows Server 2003 Cluster Technologies 13
© 2006 Microsoft Corporation. All rights reserved.
10. Cluster Component Architecture
Cluster Registry
16
The cluster configuration, including the properties of all resources and groups in the cluster, are
stored as a registry hive on each node in the cluster. This cluster database is loaded into the registry
on each node when the Cluster service starts up.
Note The cluster registry keys are separate from the rest of the Windows Server 2003 registry. The
cluster hive is stored in %windir%\Cluster in the file Clusdb and associated Clusdb.log file.
The cluster registry maintains updates on members, resources, restart parameters, and other
configuration information for the whole cluster. It is important that the cluster registry is
maintained on stable storage and is replicated at each member through a global update protocol.
Cluster service loads its registry settings into HKEY_LOCAL_MACHINE\Cluster.
Under this key are the following subkeys:
● Groups
● NetworkInterfaces
● Networks
● Nodes
● Quorum
● Resources
● Resource types
Each of these subkeys contains the configuration information for the cluster. For example, when a
new group is created a new entry is added under
HKEY_LOCAL_MACHINE\Cluster\Groups.
One method of verifying a successful installation of Windows Clustering is to verify that the
preceding registry keys have been created.
Cluster service stores the parameters for its resources and groups in keys in the cluster database in
the registry. The key name for a given group or resource is a globally unique identifier (GUID).
GUIDs are 128-bit numbers generated by a special algorithm that uses the network card media
access control (MAC) address of the machine on which it is running, and the exact time at which
the GUID is created, to ensure that every GUID is unique. GUIDs are used to identify many
components of the cluster, such as a resource, group, network or network interface. They are used
internally by the Cluster service to reference a cluster component. The Cluster service updates the
cluster configuration based on event notification of property changes for a GUID.
To assist in understanding the cluster log references to GUIDs, a Cluster object file (%windir%\
Cluster\Cluster.oml) is automatically created and maintained that contains a mapping of GUID's to
Resource Name mappings.
Example:
00000488.00000660::2003/05/19-23:58:22.977 OBCREATE "Resource" "7dc7fb50-
be58-4b34- 912e-830c93043e74" "Disk R:"
Much of the information that is stored in the cluster database is property information. For example,
below is an excerpt from the information stored in the cluster database for the physical disk
resource type:
Physical Disk
==========
AdminExtensions : REG_MULTI_SZ : {4EC90FB0-D0BB-11CF-B5EF-
00A0C90AB505}
Different values and parameters are discussed in depth in later modules. However, the important
items that have been discussed in this section are:
● The cluster registry is a database of the cluster configuration.
● All the nodes in a cluster should have identical cluster registries.
● In the cluster registry, each object, be it a group, network, network interface or resource, is
identified as a GUID.
● While troubleshooting a cluster by making use of the cluster log, you will need to reference
the cluster registry or the Cluster.oml file to find the friendly name of a resource because the
cluster log will refer to the resource by GUID.
During this lab session, you will examine the cluster registry on each node to identify how cluster
configuration is stored locally on each node. You will also use the Load Hive command in
Regedt32 to examine the contents of the checkpoint file.
Refer to the accompanying Lab Manual to complete the practice exercises.
Review
18
Topics discussed in this session include:
● Introduction
● Cluster Architecture Overview
● Cluster Service Component Architecture