Gpfs
Gpfs
GA22-7895-01
GA22-7895-01
Note:
Before using this information and the product it supports, be sure to read the general information under Notices on page 69.
Permission to copy without fee all or part of MPI: A Message Passing Interface Standard, Version 1.2 and Version
2.0 Message Passing Interface Forum is granted, provided the University of Tennessee copyright notice and the title
of the document appear, and notice is given that copying is by permission of the University of Tennessee. 1995,
1996, and 1997 University of Tennessee, Knoxville, Tennessee.
Copyright International Business Machines Corporation 2002. All rights reserved.
US Government Users Restricted Rights Use, duplication or disclosure restricted by GSA ADP Schedule Contract
with IBM Corp.
Contents
Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
About This Book . . . . .
Who Should Use This Book . .
How this book is organized . .
Typography and Terminology .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
ix
ix
ix
x
Whats new . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Whats new for GPFS 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Migration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. 9
. 9
. 9
. 9
. 10
. 10
. 10
. 11
. 11
. 11
. 14
. 14
. 14
. 15
. 15
. 15
. 15
. 16
. 17
. 17
. 19
. 19
. 19
. 20
. 21
. 21
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
3
3
3
3
4
4
5
5
6
6
7
7
iii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
21
22
23
24
24
24
25
25
25
26
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
29
29
29
29
30
31
31
31
32
32
33
33
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
35
35
35
35
35
36
36
36
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
37
37
37
37
38
39
39
40
GPFS .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Part 3. Appendixes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Appendix A. GPFS architecture .
Special management functions . .
The GPFS configuration manager
The file system manager . . . .
The metanode . . . . . . .
Use of disk storage and file structure
Metadata . . . . . . . . .
iv
. . .
. . .
. . .
. . .
. . .
within a
. . .
. . . .
. . . .
. . . .
. . . .
. . . .
GPFS file
. . . .
. . .
. . .
. . .
. . .
. . .
system
. . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
45
45
45
45
47
47
47
Quota files . . . . . . . . . . .
Log files . . . . . . . . . . . .
User data . . . . . . . . . . .
GPFS and memory . . . . . . . . .
Component interfaces . . . . . . . .
Program interfaces . . . . . . . .
Socket communications . . . . . .
Application and user interaction with GPFS
Operating system commands . . . .
Operating system calls . . . . . .
GPFS command processing . . . .
Recovery . . . . . . . . . . . .
GPFS cluster data . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
49
49
49
49
50
50
51
52
52
53
56
57
57
for GPFS
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
61
61
62
62
63
63
64
64
66
67
68
Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Bibliography . . . . . . . . .
GPFS publications . . . . . . .
AIX publications . . . . . . . .
Reliable Scalable Cluster Technology
HACMP/ES publications . . . . .
Storage related information . . . .
Redbooks . . . . . . . . . .
Whitepapers . . . . . . . . .
Non-IBM publications . . . . . .
. . . . .
. . . . .
. . . . .
publications
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
77
77
77
78
78
78
78
78
79
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Contents
vi
Figures
1.
2.
3.
4.
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . 7
. . 8
. . 10
. . 48
vii
viii
ix
Glossary on page 73
Bibliography on page 77
Usage
Bold
Bold words or characters represent system elements that you must use literally, such as
commands, subcommands, flags, path names, directories, file names, values, and selected
menu options.
Bold Underlined
Bold Underlined keywords are defaults. These take effect if you fail to specify a different
keyword.
Italic
v Italic words or characters represent variable values that you must supply
v Italics are used for book titles
v Italics are used for general emphasis
Monospace
[]
{}
Braces enclose a list from which you must choose an item in format and syntax descriptions.
<>
Angle brackets (less-than and greater-than) enclose the name of a key on the keyboard. For
example, <Enter> refers to the key on your terminal or workstation that is labeled with the
word Enter.
...
An ellipsis indicates that you can repeat the preceding item one or more times.
<Ctrl-x>
The notation <Ctrl-x> indicates a control character sequence. For example, <Ctrl-c> means
that you hold down the control key while pressing <c>.
Whats new
This section summarizes all the changes made to IBM General Parallel File System for AIX 5L:
For atime and mtime values as reported by the stat, fstat, gpfs_stat, and gpfs_fstat calls, you may:
Suppress updating the value of atime.
When supressing the periodic update, these calls will report the time the file was last accessed when
the file system was mounted with the -S no option or, for a new file, the time the file system was
created.
Display the exact value for mtime.
The default is to periodically update the mtime value for a file system. If it is more desirable to
display exact modification times for a file system, specify the -E yes option.
Commands which have been updated:
1. mmcrfs
2. mmchfs
3. mmlsfs
v The capability to read from or write to a file with direct I/O. The mmchattr command has been updated
with the -D option for this support.
v The default use designation for nodes in your GPFS nodeset has been changed from manager to
client.
Commands which have been updated:
1. mmconfig
2. mmchconfig
v The terms to install/uninstall GPFS quotas have been replaced by the terms enable/disable GPFS
quota management.
v The GPFS documentation is no longer shipped on the product CD-ROM. You may download, view,
search, and print the supporting documentation for the GPFS program product in the following ways:
1. In PDF format:
On the World Wide Web at www.ibm.com/servers/eserver/pseries/library/gpfs.html
From the IBM Publications Center at www.ibm.com/shop/publications/order
Copyright IBM Corp. 2002
xi
Migration
For information on migrating your system to the latest level of GPFS, see Migration, coexistence, and
compatibility.
xii
cause them all to become simultaneously unavailable. In order to assure file availability, GPFS maintains
each instance of replicated data on disks in different failure groups.
The replication feature of GPFS allows you to determine how many copies of a file to maintain. File
system replication assures that the latest updates to critical data are preserved in the event of disk failure.
During configuration, you assign a replication factor to indicate the total number of copies you wish to
store. Replication allows you to set different levels of protection for each file or one level for an entire file
system. Since replication uses additional disk space and requires extra write time, you might want to
consider replicating only file systems that are frequently read from but seldom written to (see File system
recoverability parameters on page 22). Even if you do not specify replication when creating a file system,
GPFS automatically replicates recovery logs in separate failure groups. For further information on failure
groups see Logical volume creation considerations on page 11.
Once your file system is created, you can have it automatically mounted whenever the GPFS daemon is
started. The automount feature assures that whenever the system and disks are up, the file system will be
available.
Simplified administration
GPFS commands save configuration and file system information in one or more files, collectively known as
GPFS cluster data. The GPFS administration commands are designed to keep these files synchronized
between each other and with the GPFS system files on each node in the nodeset, thereby ensuring
accurate configuration data (see GPFS cluster data on page 57).
GPFS administration commands are similar in name and function to UNIX file system commands, with one
important difference: the GPFS commands operate on multiple nodes. A single GPFS command performs
a file system function across the entire nodeset. Most GPFS administration tasks can be performed from
any node running GPFS (see the individual commands as documented in the General Parallel File System
for AIX 5L: AIX Clusters Administration and Programming Reference).
Environment
sp
The PSSP cluster environment is based on the IBM Parallel System Support Programs
(PSSP) program product and the shared disk concept of the IBM Virtual Shared Disk program
product.
In the PSSP cluster environment, the boundaries of the GPFS cluster depend on the switch
type being used. In a system with an SP Switch, the GPFS cluster is equal to the
corresponding SP partition. In a system with an SP Switch2, the GPFS cluster is equal to all of
the nodes in the system.
For information regarding the GPFS for AIX 5L licensed program for PSSP clusters go to
www.ibm.com/servers/eserver/pseries/software/sp/gpfs.html
rpd or hacmp
lc
Within a GPFS cluster, the nodes are divided into one or more GPFS nodesets. The nodes in each
nodeset share a set of file systems which are not accessible by the nodes in any other nodeset.
On each node in the cluster, GPFS consists of:
1. Administration commands
2. A kernel extension
3. A multi-threaded daemon
For a detailed discussion of GPFS, see Appendix A, GPFS architecture, on page 45.
For complete hardware and programming specifications, see Hardware specifications on page 9 and
Programming specifications on page 9.
pSeries machines (the size of the nodeset is constrained by the limitations of the SSA adapter). If the
disks in the file system are purely Fibre Channel, your nodeset may consist of up to 32 RS/6000 or
Eserver pSeries machines (the size of the nodeset is constrained by the limitations of the HACMP/ES
software). After a GPFS nodeset has been configured, or nodes have been added to or deleted from the
nodeset, GPFS obtains the necessary additional configuration data from the HACMP/ES Global Object
Data Manager (ODM):
1. node number
2. adapter type
3. IP address
The complete configuration data maintained by GPFS is then stored on the primary, and if specified, the
secondary GPFS cluster data server as designated on the mmcrcluster command (see GPFS cluster
creation considerations on page 14).
For complete hardware and programming specifications, see Hardware specifications on page 9 and
Programming specifications on page 9.
Hardware specifications
1. An existing IBM Eserver configuration:
v An RSCT peer domain established with the RSCT component of AIX 5L
For information on creating an RSCT peer domain, see the Reliable Scalable Cluster Technology for
AIX 5L: RSCT Guide and Reference
v An HACMP cluster established with the HACMP/ES program product
For information on creating an HACMP cluster, see the High Availability Cluster Multi-Processing for
AIX: Enhanced Scalability Installation and Administration Guide.
2. Enough disks to contain the file system (see Disk considerations on page 11).
3. An IP network of sufficient network bandwidth (minimum of 100Mb per second).
Programming specifications
1. AIX 5L Version 5 Release 1 (5765-E61) with IY30258, or later modifications
2. For a GPFS cluster type hacmp, HACMP/ES version 4.4.1 (5765-E54), or later modifications
Recoverability considerations
Sound file system planning includes considering replication as well as structuring your data so information
is not vulnerable to a single point of failure. GPFS provides you with parameters that enable you to create
a highly available file system with fast recoverability from failures. At the file system level, the metadata
and data replication parameters are set (see File system recoverability parameters on page 22). At the
disk level when preparing disks for use with your file system, you can specify disk usage and failure group
positional parameters to be associated with each disk (see Logical volume creation considerations on
page 11).
Additionally, GPFS provides several layers of protection against failures of various types:
1. Node failure on page 10
2. Disk failure on page 10
3. Making your decision on page 10
Node failure
This basic layer of protection covers the failure of file system nodes and is provided by Group Services.
When an inoperative node is detected by Group Services, GPFS fences it out using environment-specific
subsystems (see Disk fencing). This prevents any write operations that might interfere with recovery.
File system recovery from node failure should not be noticeable to applications running on other nodes,
except for delays in accessing objects being modified on the failing node. Recovery involves rebuilding
metadata structures, which may have been under modification at the time of the failure. If the failing node
is the file system manager for the file system, the delay will be longer and proportional to the activity on
the file system at the time of failure, but no administrative intervention will be needed.
During node failure situations, if multi-node quorum is in effect, quorum needs to be maintained in order to
recover the failing nodes. If multi-node quorum is not maintained due to node failure, GPFS restarts on all
nodes, handles recovery, and attempts to achieve quorum again.
Disk failure
The most common reason why data becomes unavailable is disk failure with no redundancy. In the event
of disk failure, GPFS discontinues use of the disk and awaits its return to an available state. You can
guard against loss of data availability from such failures by setting the GPFS recoverability parameters
(replication, disk usage, and failure group designations) either alone or in conjunction with one of these
environment specific methods to maintain additional copies of files.
One means of data protection is the use of a RAID/Enterprise Storage Subsystem (ESS) controller, which
masks disk failures with parity disks. An ideal configuration is shown in Figure 3, where a RAID/ESS
controller is multi-tailed to each node in the nodeset.
10
Disk considerations
You may have up to 1024 external shared disks or disk arrays with the adapters configured to allow each
disk connectivity to each node in the nodeset. No disk can be larger than 1 TB.
Proper planning for your disk subsystem includes determining:
v Sufficient disks to meet the expected I/O load
v Sufficient connectivity (adapters and buses) between disks
Disks can be attached using:
v SSA
v Fibre Channel
v Enterprise Storage Server (ESS) in either Subsystem Device Driver (SDD) or non-SDD mode
The actual number of disks in your system may be constrained by products other than GPFS which you
have installed. Refer to individual product documentation for support information.
Disk considerations include:
v Disk fencing
v Logical volume creation considerations
Disk fencing
In order to preserve data integrity in the event of certain system failures, GPFS will fence a node that is
down from the file system until it returns to the available state. Depending upon the types of disk you are
using, there are three possible ways for the fencing to occur:
SSA fencing
SSA disks
SCSI-3 persistent reserve
For a list of GPFS supported persistent reserve devices, see the Frequently Asked Questions at
www.ibm.com/servers/eserver/clusters/library/
disk leasing
A GPFS specific fencing mechanism for disks which do not support either SSA fencing or SCSI-3
persistent reserve.
Single-node quorum is only supported when disk leasing is not in effect. Disk leasing is activated if any
disk in any file system in the nodeset is not using SSA fencing or SCSI-3 persistent reserve.
0020570a72bbb1a0
none
None
None
b. If a PVID does not exist, prior to assigning a PVID you must ensure that the disk is not a member
of a mounted and active GPFS file system. If the disk is a member of an active and mounted
Chapter 2. Planning for GPFS
11
GPFS file system and you issue the chdev command to assign a PVID, there is the possibility you
will experience I/O problems which may result in the file system being unmounted on one or more
nodes.
c. To assign a PVID, issue the chdev command:
chdev -l hdisk4 -a pv=yes
To determine the PVID assign, issue the lspv command. The system displays information similar
to:
lspv
hdisk3
hdisk4
0020570a72bbb1a0
0022b60ade92fb24
None
None
2. Single-node quorum is only supported when disk leasing is not in effect. Disk leasing is activated if any
disk in any file system in the nodeset is not using SSA fencing or SCSI-3 persistent reserve.
3. Logical volumes created by the mmcrlv command will:
v Use SCSI-3 persistent reserve on disks which support it or SSA fencing if that is supported by the
disk. Otherwise disk leasing will be used.
v Have bad-block relocation automatically turned off. Accessing disks concurrently from multiple
systems using lvm bad-block relocations could potentially cause conflicting assignments. As a result,
software bad-block relocation is turned off allowing the hardware bad-block relocation supplied by
your disk vendor to provide protection against disk media errors.
4. You cannot protect your file system against disk failure by mirroring data at the LVM level. You must
use GPFS replication or RAID devices to protect your data (see Recoverability considerations on
page 9).
5. In an HACMP environment, any disk resources (volume groups and logical volumes) that will be used
by GPFS must not belong to any HACMP/ES resource group. HACMP/ES will not be in control of
these disk resources and is not responsible for varying them on or off at any time. The responsibility to
keep the disks in the proper state belongs to GPFS in the HACMP environment. For further information
on logical volume concepts, see the AIX 5L System Management Guide: Operating System and
Devices.
The mmcrlv command expects as input a file, DescFile, containing a disk descriptor, one per line, for
each of the disks to be processed. Disk descriptors have the format (second and third fields reserved):
DiskName:::DiskUsage:FailureGroup
DiskName
The physical device name of the disk you want to define as a logical volume. This is the /dev
name for the disk on the node on which the mmcrlv command is issued and can be either an
hdisk name or a vpath name for an SDD device. Each disk will be used to create a single volume
group and a single logical volume.
Disk Usage
What is to be stored on the disk. metadataOnly specifies that this disk may only be used for
metadata, not for data. dataOnly specifies that only data, and not metadata, is allowed on this
disk. You can limit vulnerability to disk failure by confining metadata to a small number of
conventional mirrored or replicated disks. The default, dataAndMetadata, allows both on the disk.
Note: RAID devices are not well-suited for performing small block writes. Since GPFS metadata
writes are often smaller than a full block, you may find using non-RAID devices for GPFS
metadata better for performance.
12
FailureGroup
A number identifying the failure group to which this disk belongs. All disks that are either attached
to the same adapter have a common point of failure and should therefore be placed in the same
failure group.
GPFS uses this information during data and metadata placement to assure that no two replicas of
the same block will become unavailable due to a single failure. You can specify any value from -1
(where -1 indicates that the disk has no point of failure in common with any other disk) to 4000. If
you specify no failure group, the value defaults to -1.
Upon successful completion of the mmcrlv command, these tasks are completed on all nodes in the
GPFS cluster:
v For each valid descriptor in the descriptor file, local logical volumes and the local volume groups are
created.
The logical volume names are assigned according to the convention:
gpfsNNlv
where NN is a unique non-negative integer not used in any prior logical volume named with this
convention.
The local volume group component of the logical volume is named according to the same convention:
gpfsNNvg
where NN is a unique non-negative integer not used in any prior local volume group named with
this convention.
v The physical device or vpath name is replaced with the created logical volume names.
v The local volume groups are imported to all available nodes in the GPFS cluster.
v The DescFile is rewritten to contain the created logical volume names in place of the physical disk or
vpath name and all other fields, if specified, are copied without modification. The rewritten disk
descriptor file can then be used as input to the mmcrfs, mmadddisk, or the mmrpldisk commands. If
you do not use this file, you must accept the default values or specify these values when creating disk
descriptors for subsequent mmcrfs, mmadddisk, or mmrpldisk commands.
If necessary, the DiskUsage and FailureGroup values for a disk can be changed with the mmchdisk
command.
13
mmchcluster
default value
To add or delete
none
nodes from the cluster
use mmaddcluster or
mmdelcluster
respectively
none
none
This cannot be
changed
none
/usr/bin/rsh
/usr/bin/rcp
Notes:
1. X indicates the option is available on the command
2. an empty cell indicates the option in not available on the command
Example
k145n01
k145n01.kgn.ibm.com
9.119.19.102
You must follow these rules when creating your GPFS cluster:
v A node may only belong to one GPFS cluster at a time.
v The node must be a properly configured member of either your RSCT peer domain or your HACMP
cluster.
v The node must be available for the command to be successful. If any of the nodes listed are not
available when the command is issued, a message listing those nodes is displayed. You must correct
the problem on each node, create a new input file containing the failed nodes only, and reissue the
mmaddcluster command to add those nodes.
14
backup server, the GPFS cluster data is inaccessible and any GPFS administrative command that is
issued will fail. Similarly, when the GPFS daemon starts up, at least one of the two GPFS cluster data
server nodes must be accessible (see GPFS cluster data on page 57).
mmchconfig
default value
15
mmchconfig
This cannot be changed
once it is set
default value
An integer value beginning with
one and increasing sequentially
no
/tmp/mmfs
Quorum on page 18
no
pagepool on page 18
20M
maxFilesToCache on page 18
1000
maxStatCache on page 18
Default value
initially used
4 x maxFilesToCache
Default value
initially used
256K
dmapiEventTimeout on page 19
86400000
dmapiSessionFailureTimeout on
page 19
dmapiMountTimeout on page 19
60
Notes:
1. X indicates the option is available on the command
2. an empty cell indicates the option is not available on the command
16
which the GPFS daemons communicate. Alias interfaces are not allowed. Use the original address
or a name that is resolved by the host command to that original address.
You may specify a node using any of these forms:
Format
Short hostname
Long hostname
IP address
Example
k145n01
k145n01.kgn.ibm.com
9.119.19.102
manager | client
An optional use designation.
The designation specifies whether or not the node should be included in the pool of nodes from
which the file system manager is chosen (the special functions of the file system manager
consume extra processing time see The file system manager on page 45). The default is to not
have the node included in the pool.
In general, small systems (less than 128 nodes) do not need multiple nodes dedicated for the file
system manager. However, if you are running large parallel jobs, threads scheduled to a node
performing these functions may run slower. As a guide, in a large system there should be one file
system manager node for each GPFS file system.
Nodeset identifier
You can provide a name for the nodeset by using the -C option on the mmconfig command or allow
GPFS to assign one. If you choose the identifier, it can be at most eight alphanumeric characters long and
may not be a reserved word or the number zero. If GPFS assigns one, it will be an integer identifier
beginning with the value one and increasing sequentially as nodesets are added. This designation may not
be changed once it is assigned.
17
Quorum
For all nodesets consisting of three or more nodes, GPFS quorum is defined as one plus half of the
number of nodes in the GPFS nodeset (referred to as multi-node quorum). For a two-node nodeset, you
have the choice of allowing multi-node quorum or specifying the -U option on the mmconfig command to
indicate the use of a single-node quorum. The specification of single-node quorum allows the remaining
node in a two-node nodeset to continue functioning in the event of the failure of the peer node.
Note: Single-node quorum is only supported when disk leasing is not in effect. Disk leasing is activated if
any disk in any file system in the nodeset is not using SSA fencing or SCSI-3 persistent reserve.
If multi-node quorum is used, quorum needs to be maintained in order to recover the failing nodes. If
multi-node quorum is not maintained due to node failure, all GPFS nodes restart, handle recovery, and
attempt to achieve quorum again. Therefore, in a three-node system, failure of one node will allow
recovery and continued operation on the two remaining nodes. This is the minimum configuration where
continued operation is possible due to the failure of a node. That is, in a two-node system where
single-node quorum has not been specified, the failure of one node means both nodes will restart, handle
recovery, and attempt to achieve quorum again.
If single-node quorum is specified, the failure of one node results in GPFS fencing the failing node from
the disks containing GPFS file system data. The remaining node will continue processing if the fencing
operation was successful. If not, those file systems which could not be completely fenced will be
unmounted and attempts to fence the node will continue (in the unlikely event that both nodes end up
fenced, see the General Parallel File System for AIX 5L: AIX Clusters Problem Determination Guide and
search on single-node quorum).
Cache usage
GPFS creates a number of cache segments on each node in the nodeset. The amount of cache is
controlled by three parameters:
pagepool
The amount of pinned memory reserved for caching data read from disk. This consists mainly of
file data, but also includes directory blocks and other file system metadata such as indirect blocks
and allocation maps (see Appendix A, GPFS architecture, on page 45). pagepool is used for
read-ahead and write-behind operations to increase performance, as well as for reuse of cached
data.
The size of the cache on each node can range from a minimum of 4 MB to a maximum of 512
MB. For systems where applications access large files, reuse data, or have a random I/O pattern,
increasing the value for pagepool may prove beneficial. This value must be specified with the
character M, for example 80M. The default is 20M.
maxFilesToCache
The total number of different files that can be cached at one time. Every entry in the file cache
requires some pageable memory to hold the content of the files inode plus control data structures.
This is in addition to any of the files data and indirect blocks that might be cached in the page
pool.
The total amount of memory required for inodes and control data structures can be calculated as:
maxFilesToCache 2.5 KB
where 2.5 KB = 2 KB + 512 bytes for an inode
Valid values of maxFilesToCache range from 0 to 1,000,000. For systems where applications use
a large number of files, of any size, increasing the value for maxFilesToCache may prove
beneficial (this is particularly true for systems where a large number of small files are accessed).
The value should be large enough to handle the number of concurrently open files plus allow
caching of recently used files. The default value is 1000.
18
maxStatCache
This parameter sets aside additional pageable memory to cache attributes of files that are not
currently in the regular file cache. This is useful to improve the performance of both the system
and GPFS stat( ) calls for applications with a working set that does not fit in the regular file cache.
The memory occupied by the stat cache can be calculated as:
maxStatCache 176 bytes
Valid values of maxStatCache range from 0 to 1,000,000. For systems where applications test the
existence of files, or the properties of files, without actually opening them (as backup applications
do), increasing the value for maxStatCache may prove beneficial. The default value is:
4 maxFilesToCache
The total amount of memory GPFS uses to cache file data and metadata is arrived at by adding pagepool
to the amount of memory required to hold inodes and control data structures (maxFilesToCache 2.5
KB), and the memory for the stat cache (maxStatCache 176 bytes) together. The combined amount of
memory to hold inodes, control data structures, and the stat cache is limited to 50% of the physical
memory. With an inode size of 512 bytes, the default 4-to-1 ratio of maxStatCache to maxFilesToCache
would result in a maximum 250,000 stat cache entries and 65,000 file cache entries.
During configuration, you can specify the maxFilesToCache, maxStatCache, and pagepool parameters
that control how much cache is dedicated to GPFS. These values can be changed later, so experiment
with larger values to find the optimum cache size that improves GPFS performance without affecting other
applications.
The mmchconfig command can be used to change the values of maxFilesToCache, maxStatCache,
and pagepool. The pagepool parameter is the only one of these parameters that may be changed while
the GPFS daemon is running. A pagepool change occurs immediately when using the -i option on the
mmchconfig command. Changes to the other values are effective only after the daemon is restarted.
set1
19
-p 100M
pagepool of 100 MB
Issue the command:
mmconfig -n /u/gpfsadmin/nodesGPFS1 -A -C set1 -p 100M
mmcrfs
mmchfs
default value
yes
32
256K
20
no
yes
mmchfs
default value
no
none
none
none
Notes:
1. X indicates the option is available on the command
2. an empty cell indicates the option is not available on the command
Automatic mount
Whether or not to automatically mount a file system when the GPFS daemon starts may be specified at
file system creation by using the -A option on the mmcrfs command or changed at a later time by using
the -A option on the mmchfs command. The default is to have the file system automatically mounted,
assuring file system availability whenever the system and disks are up.
Block size
The size of data blocks in a file system may be specified at file system creation by using the -B option on
the mmcrfs command or allowed to default to 256 KB. This value cannot be changed without recreating
the file system.
GPFS offers five block sizes for file systems: 16 KB, 64 KB, 256 KB, 512 KB, and 1024 KB. This value
should be specified with the character K, for example 512K. You should choose the block size based on
the application set that you plan to support and if you are using RAID hardware:
21
v The 256 KB block size is the default block size and normally is the best block size for file systems that
contain large files accessed in large reads and writes.
v The 16 KB block size optimizes use of disk storage at the expense of large data transfers.
v The 64 KB block size offers a compromise. It makes more efficient use of disk space than 256 KB while
allowing faster I/O operations than 16 KB.
v The 512 KB and 1024 KB block size may be more efficient if data accesses are larger than 256 KB.
If you plan to use SSA RAID devices in your file system, a larger block size may be more effective and
help you to avoid the penalties involved in small block write operations to RAID devices. For example,
in a RAID configuration utilizing 4 data disks and 1 parity disk (a 4+P configuration), which utilizes a 64
KB stripe size, the optimal file system block size would be 256 KB (4 data disks 64 KB stripe size =
256 KB). A 256 KB block size would result in a single data write that encompassed the 4 data disks
and a parity write to the parity disk. If a block size smaller than 256 KB, such as 64 KB, was used,
write performance would be degraded. A 64 KB block size would result in a single disk writing 64 KB
and a subsequent read from the three remaining disks in order to compute the parity that is then written
to the parity disk. The extra read degrades performance.
The maximum GPFS file system size that can be mounted is limited by the control structures in memory
required to maintain the file system. These control structures, and consequently the maximum mounted file
system size, are a function of the block size of the file system.
v If your file systems have a 16 KB block size, you may have one or more file systems with a total size of
1 TB mounted.
v If your file systems have a 64 KB block size, you may have one or more file systems with a total size of
10 TB mounted.
v If your file systems have a 256 KB or greater block size, you may have file systems mounted with a
total size of not greater than 200 TB where no single file system exceeds 100 TB.
Fragments and subblocks: GPFS divides each block into 32 subblocks. Files smaller than one block
size are stored in fragments, which are made up of one or more subblocks. Large files are stored in a
number of full blocks plus zero or more subblocks to hold the data at the end of the file.
The block size is the largest contiguous amount of disk space allocated to a file and therefore the largest
amount of data that can be accessed in a single I/O operation. The subblock is the smallest unit of disk
space that can be allocated. For a block size of 256 KB, GPFS reads as much as 256 KB of data in a
single I/O operation and small files can occupy as little as 8 KB of disk space. With a block size of 16 KB,
small files occupy as little as 512 bytes of disk space (not counting the inode), but GPFS is unable to read
more than 16 KB in a single I/O operation.
22
command. They can be changed for an existing file system using the mmchfs command, but
modifications only apply to files subsequently created. To apply the new replication values to existing files
in a file system, issue the mmrestripefs command.
Metadata and data replication are specified independently. Each has a default replication factor of 1 (no
replication) and a maximum replication factor. Although replication of metadata is less costly in terms of
disk space than replication of file data, excessive replication of metadata also affects GPFS efficiency
because all metadata replicas must be written. In general, more replication uses more space.
23
The soft limits define levels of disk space and files below which the user or group can safely operate. The
hard limits define the maximum disk space and files the user or group can accumulate. Specify hard and
soft limits for disk space in units of kilobytes (k or K) or megabytes (m or M). If no suffix is provided, the
number is assumed to be in bytes.
The grace period allows the user or group to exceed the soft limit for a specified period of time (the default
period is one week). If usage is not reduced to a level below the soft limit during that time, the quota
system interprets the soft limit as the hard limit and no further allocation is allowed. The user or group can
reset this condition by reducing usage enough to fall below the soft limit.
Default quotas
Applying default quotas ensures all new users or groups of users of the file system will have minimum
quota limits established. If default quota values for a file system are not enabled, a new user or group has
a quota value of zero which establishes no limit to the amount of space they can use. Default quotas may
be set for a file system only if the file system was created with the -Q yes option on the mmcrfs or
updated with the -Q option on the mmchfs command. Default quotas may then be enabled for the file
system by issuing the mmdefquotaon command and default values established by issuing the
mmdefedquota command.
Disk verification
When you create your file system, you may check to ensure the disks you are specifying do not already
belong to an existing file system by using the -v option on the mmcrfs command. The default is to verify
disk usage. You should only specify no when you want to reuse disks that are no longer needed for an
existing GPFS file system. To determine which disks are no longer in use by any file system, issue the
mmlsgpfsdisk -F command.
Enable DMAPI
Whether or not the file system can be monitored and managed by the GPFS Data Management API
(DMAPI) may be specified at file system creation by using the -z option on the mmcrfs command or
changed at a later time by using the -z option on the mmchfs command. The default is not to enable
DMAPI for the file system. For further information on DMAPI for GPFS, see General Parallel File System
for AIX 5L: AIX Clusters Data Management API Guide.
Mountpoint directory
There is no default mountpoint directory supplied for the file system. You must specify the directory.
24
|
|
|
|
|
Prior to issuing the mmcrfs command you must decide if you will:
1. Create new disks via the mmcrlv command.
2. Select disks previously created by the mmcrlv command, but no longer in use in any file system.
Issue the mmlsgpfsdisk -F command to display the available disks.
3. Use the rewritten disk descriptor file produced by the mmcrlv command or create a new list of disk
descriptors. When using the rewritten file, the Disk Usage and Failure Group specifications will remain
the same as specified on the mmcrlv command.
When issuing the mmcrfs command you may either pass the disk descriptors in a file or provide a list of
disk descriptors to be included. The file eliminates the need for command line entry of these descriptors
using the list of DiskDescs. You may use the rewritten file created by the mmcrlv command, or create
your own file. When using the file rewritten by the mmcrlv command, the Disk Usage and Failure Group
values are preserved. Otherwise, you must specify a new value or accept the default. You can use any
editor to create such a file to save your specifications. When providing a list on the command line, each
descriptor is separated by a semicolon (;) and the entire list must be enclosed in quotation marks ( or ).
The current maximum number of disk descriptors that can be defined for any single file system is 1024.
Each disk descriptor must be specified in the form (second and third fields reserved):
DiskName:::DiskUsage:FailureGroup
DiskName
You must specify the logical volume name. For details on creating a logical volume, see Logical
volume creation considerations on page 11. To use an existing logical volume in the file system,
only the logical volume name need be specified in the disk descriptor. The disk name must be set
up the same on all nodes in the nodeset.
Disk Usage
Specifies what is to be stored on the disk. Specify one or accept the default:
v dataAndMetadata (default)
v dataOnly
v metadataOnly
Failure Group
A number identifying the failure group to which this disk belongs. You can specify any value from
-1 (where -1 indicates that the disk has no point of failure in common with any other disk) to 4000.
If you do not specify a failure group, the value defaults to the -1. GPFS uses this information
during data and metadata placement to assure that no two replicas of the same block are written
in such a way as to become unavailable due to a single failure. All disks that are attached to the
same disk adapter, should be placed in the same failure group.
25
Automatically mount the file system when the GPFS daemon starts.
Default maximum number of copies of inodes, directories, and indirect blocks for the file.
26
value
----------------roundRobin
8192
512
16384
1
2
1
2
1048576
32
262144
none
none
33792
6.00
no
gpfs33lv
yes
set1
no
no
none
description
-----------------------------------------------Stripe method
Minimum fragment size in bytes
Inode size in bytes
Indirect block size in bytes
Default number of metadata replicas
Maximum number of metadata replicas
Default number of data replicas
Maximum number of data replicas
Estimated average file size
Estimated number of nodes that will mount file system
Block size
Quotas enforced
Default quotas enabled
Maximum number of inodes
File system version. Highest supported version: 6.00
Is DMAPI enabled?
Disks in file system
Automatic mount option
GPFS nodeset identifier
Exact mtime default mount option
Suppress atime default mount option
Additional mount options
27
28
29
#!/usr/bin/ksh
for node in $(cat /tmp/gpfs.allnodes)
do
rsh $node lslpp -l "mmfs.*"
done
If any mmfs filesets exist, you have either one or both of the IBM Multi-Media Server and IBM Video
Charger products installed and you should remove them.
Note: When installing AIX from system images created with the mksysb command, duplicate node ids
may be generated on those nodes. The lsnodeid (available in /usr/sbin/rsct/bin) has been
provided for you to verify whether or not node ids are duplicated within the cluster. If a duplicate
node id is found, go to the RSCT Resource Monitoring and Control README located at
www.ibm.com/servers/eserver/clusters/library and follow the procedure to generate unique node ids.
1. AIX 5L Version 5.1 with APAR IY33002, or later modifications:
lslpp -l bos.mp
2. For a GPFS cluster type rpd, the following RSCT file sets must be installed on each node in the GPFS
cluster:
lslpp -l rsct*
3. For a GPFS cluster type hacmp, HACMP/ES Version 4 Release 4.1, or later modifications must be
installed on each node in the GPFS cluster:
lslpp -l cluster*
30
cluster.es.client.lib
4.4.1.2
cluster.es.client.rte
4.4.1.4
cluster.es.client.utils
4.4.1.2
cluster.es.clvm.rte
4.4.1.0
cluster.es.cspoc.cmds
4.4.1.4
cluster.es.cspoc.dsh
4.4.1.0
cluster.es.cspoc.rte
4.4.1.2
cluster.es.hc.rte
4.4.1.1
cluster.es.server.diag
4.4.1.4
cluster.es.server.events
4.4.1.5
cluster.es.server.rte
4.4.1.5
cluster.es.server.utils
4.4.1.5
cluster.msg.En_US.es.client 4.4.1.0
cluster.msg.En_US.es.server 4.4.1.0
cluster.msg.en_US.es.client 4.4.1.0
cluster.msg.en_US.es.server 4.4.1.0
Path: /etc/objrepos
cluster.es.client.rte
4.4.1.0
cluster.es.clvm.rte
4.4.1.0
cluster.es.hc.rte
4.4.1.0
cluster.es.server.events
4.4.1.0
cluster.es.server.rte
4.4.1.5
cluster.es.server.utils
4.4.1.0
Path: /usr/share/lib/objrepos
cluster.man.en_US.es.data
4.4.1.0
COMMITTED
COMMITTED
COMMITTED
COMMITTED
COMMITTED
COMMITTED
COMMITTED
COMMITTED
COMMITTED
COMMITTED
COMMITTED
COMMITTED
COMMITTED
COMMITTED
COMMITTED
COMMITTED
ES
ES
ES
ES
ES
ES
ES
ES
ES
ES
ES
ES
ES
ES
ES
ES
Client Libraries
Client Runtime
Client Utilities
for AIX Concurrent Access
CSPOC Commands
CSPOC dsh
CSPOC Runtime Commands
HC Daemon
Server Diags
Server Events
Base Server Runtime
Server Utilities
Client Messages - U.S. English IBM-850
Server Messages - U.S. English IBM-850
Client Messages - U.S. English
Recovery Driver Messages - U.S. English
COMMITTED
COMMITTED
COMMITTED
COMMITTED
COMMITTED
COMMITTED
ES
ES
ES
ES
ES
ES
Client Runtime
for AIX Concurrent Access
HC Daemon
Server Events
Base Server Runtime
Server Utilities
COMMITTED
Installation procedures
Follow these steps to install the GPFS software using the installp command. This procedure installs
GPFS on one node at a time.
Note: The installation procedures are generalized for all levels of GPFS. Ensure you substitute the correct
numeric value for the modification (m) and fix (f) levels, where applicable. The modification and fix
level are dependent upon the level of PTF support.
Then copy the installation images from the CD-ROM to the new directory, using the bffcreate command:
bffcreate -qvX -t /tmp/gpfslpp -d /dev/cd0 all
This will place the following GPFS images in the image directory:
1. mmfs.base.usr.3.5.m.f
2. mmfs.gpfs.usr.2.1.m.f
3. mmfs.msg.en_US.usr.3.5.m.f
4. mmfs.gpfsdocs.data.3.5.m.f
31
2. Use the inutoc command to create a .toc file. The .toc file is used by the installp command.
inutoc .
If you have previously installed GPFS on your system, during the install process you may see messages
similar to:
Some configuration files could not be automatically merged into the
system during the installation. The previous versions of these files
have been saved in a configuration directory as listed below. Compare
the saved files and the newly installed files to determine if you need
to recover configuration data. Consult product documentation to
determine how to merge the data.
Configuration files which were saved in /lpp/save.config:
/var/mmfs/etc/gpfsready
/var/mmfs/etc/mmfs.cfg
/var/mmfs/etc/mmfsdown.scr
/var/mmfs/etc/mmfsup.scr
If you have made changes to any of these files, you will have to reconcile the differences with the new
versions of the files in directory /var/mmfs/etc. This does not apply to file /var/mmfs/etc/mmfs.cfg which is
automatically maintained by GPFS.
(everyone)
2. On each node, issue a mount command to NFS mount the image directory:
mount k145n01:/tmp/gpfslpp /mnt
3. On the first node in the GPFS nodeset, issue an installp command to install GPFS:
installp -agXYd /tmp/gpfslpp all
4. To install GPFS on the rest of the nodes individually, issue an installp command on each of the
nodes:
installp -agXYd /mnt all
32
Then, install on each node from its local GPFS installation directory:
installp -agXdY /tmp/gpfslpp all
3.5.0.0
2.1.0.0
COMMITTED
COMMITTED
Path: /usr/share/lib/objrepos
mmfs.gpfsdocs.data
3.5.0.0
COMMITTED
33
34
Security
When using rcp and rsh for remote communication, a properly configured /.rhosts file must exist in the
root users home directory on each node in the GPFS cluster. If you have designated the use of a different
remote communication program on either the mmcrcluster or the mmchcluster, you must ensure:
v Proper authorization is granted to all nodes in the GPFS cluster.
v The nodes in the GPFS cluster can communicate without the use of a password.
If this has not been properly configured, you will get GPFS errors.
Topology Services
GPFS requires invariant network connections. An adapter with an invariant address is one that cannot be
used for IP address takeover operations. The adapter must be part of a network with no service addresses
and should not have a standby adapter on the same network. That is, the port on a particular IP address
must be a fixed piece of hardware that is translated to a fixed network adapter and is monitored for failure.
Topology Services should be configured to heartbeat over this invariant address. For information on
configuring Topology Services:
1. For a cluster type of rpd, see the Reliable Scalable Cluster Technology for AIX 5L: RSCT Guide and
Reference
2. For a cluster type of hacmp, see the High Availability Cluster Multi-Processing for AIX: Enhanced
Scalability Installation and Administration Guide.
Communications I/O
The ipqmaxlen network option should be considered when configuring for GPFS. The ipqmaxlen
parameter controls the number of incoming packets that can exist on the IP interrupt queue. The default of
128 is often insufficient. The recommended setting is 512.
no -o ipqmaxlen=512
Since this option must be modified at every reboot, it is suggested it be placed at the end of one of the
system start-up files, such as the /etc/rc.net shell script. For detailed information on the ipqmaxlen
parameter, see the AIX 5L Performance Management Guide.
35
Disk I/O
The disk I/O option to consider when configuring GPFS and using SSA RAID:
max_coalesce
The max_coalesce parameter of the SSA RAID device driver allows the device driver to coalesce
requests which have been broken up to satisfy LVM requirements. This parameter can be critical
when using RAID and is required for effective performance of RAID writes. The recommended
setting is 0x40000 for 4+P RAID.
v To view:
lsattr -E -l hdiskX -a max_coalesce
v To set:
chdev -l hdiskX -a max_coalesce=0x40000
For further information on the max_coalesce parameter see the AIX 5L Technical Reference: Kernel and
Subsystems, Volume 2.
nofiles
Ensure that nofiles, the file descriptor limit in /etc/security/limits, is set to -1 (unlimited) on the Control
Workstation.
36
37
Format
Short hostname
Long hostname
IP address
Example
k145n01
k145n01.kgn.ibm.com
9.119.19.102
3. Using the mmdelnode command, delete the nodes in the test nodeset from the main GPFS nodeset.
See the General Parallel File System for AIX 5L: AIX Clusters Administration and Programming
Reference and search on deleting nodes from a GPFS nodeset.
4. Copy the install images as described in Creating the GPFS directory on page 31. Install the new
code on the nodes in the test nodeset. The install process will not affect your main GPFS nodeset.
See Chapter 3, Installing GPFS, on page 29.
5. Reboot all the nodes in the test nodeset. This is required so the kernel extensions can be replaced.
6. Using the mmconfig command, create the test nodeset. See Nodeset configuration considerations
on page 15.
7. Using the mmcrfs command, create a file system for testing the new level of GPFS (see File system
creation considerations on page 20).
Notes:
a. If you want to use an existing file system, move the file system by issuing the mmchfs command
with the -C option.
b. If the file system was created under your original level of GPFS, you must explicitly migrate the
file system (mmchfs -V) before you can use the new functions in the latest level of GPFS.
Remember you cannot go back once you do this step! Any attempt to mount a migrated file
system on a back-level GPFS system will be rejected with an error.
See the General Parallel File System for AIX 5L: AIX Clusters Administration and Programming
Reference for complete information on the GPFS administration commands.
8. Operate with the new level of code for awhile to make sure you want to migrate the rest of the nodes.
If you decide to go back to your original GPFS level, see Reverting to the previous level of GPFS on
page 39.
9. Attention: You cannot go back once you do this step! Any attempt to mount a migrated file system
on a back-level GPFS system will be rejected with an error.
Once you have decided to permanently accept the latest level of GPFS, for each of the file systems
that are in the new nodeset, issue:
mmchfs filesystem -V
You may now exploit the new functions in the GPFS code.
10. When you are ready to migrate the rest of the nodes in the main GPFS nodeset:
a. Follow steps 2, 4, 5, and 9.
b. Either delete the file systems from the test nodeset by issuing the mmdelfs command, or move
them to the main GPFS nodeset by issuing the mmchfs command with the -C option.
c. Delete the nodes from the test nodeset by issuing the mmdelnode command and add them to
the main nodeset by issuing the mmaddnode command.
11. Issue the mmlsfs command to verify that the file system has been upgraded to latest level of GPFS.
For GPFS 2.1 the -V option should indicate a version level of 6.
12. You may now operate with the new level of GPFS code.
38
3. Install the new code on all nodes. See Chapter 3, Installing GPFS, on page 29.
4. Reboot all nodes. This is required so the kernel extensions can be replaced.
5. Operate with the new level of code for awhile to make sure you want to permanently migrate.
If you decide to go back to the previous level of GPFS, see Reverting to the previous level of GPFS.
6. Attention: Remember you cannot go back once you do this step! Any attempt to mount a migrated
file system on a back-level GPFS system will be rejected with an error.
Once you have decided to permanently accept the latest level of GPFS, for each of the file systems,
issue:
mmchfs filesystem -V
7. You may now operate with the new level of GPFS code.
3. If you used a test nodeset for testing the latest level of GPFS, return all the nodes in the test nodeset
to the main nodeset:
a. Delete all file systems in the test nodeset that have the latest version number. Use the mmlsfs -V
command to display the version number of the file system.
b. Either delete or move to the main GPFS nodeset all file systems that are still at the back level of
GPFS.
c. Use the mmdelnode command to delete all nodes from the test nodeset.
d. Use the mmaddnode command to add all of the nodes back into the main GPFS nodeset.
4. Run the deinstall program on each node to remove the GPFS 2.1 level of code.
This program will not remove any customized files:
installp -u mmfs
Coexistence
GPFS file systems and nodesets must follow these coexistence guidelines:
v A GPFS file system may only be accessed from a single nodeset.
v All nodes in a GPFS nodeset must have been defined to the GPFS cluster.
v 32-bit and 64-bit applications may coexist within a GPFS nodeset.
v It is not possible for different levels of GPFS to coexist in the same nodeset. However, it is possible to
run multiple nodesets at different levels of GPFS.
Due to common components shared by GPFS, IBM Multi-Media Server, and IBM Video Charger, the
kernel extensions for GPFS cannot coexist with these products on the same system (see Verify there is
no conflicting software installed on page 29).
39
| The coexistence of an RSCT Peer Domain and PSSP or HACMP on the same node is not supported. See
| the RSCT Resource Monitoring and Control README located at
| www.ibm.com/servers/eserver/clusters/library.
Compatibility
When operating in a 64-bit environment:
v In order to use 64-bit versions of the GPFS programming interfaces created in an AIX 4.3 environment,
you must recompile your code for use in an AIX 5L environment. All other applications which executed
on the previous release of GPFS will execute on the new level of GPFS.
v GPFS supports interoperability between 32-bit and 64-bit GPFS kernel extensions within a nodeset.
File systems created under the previous release of GPFS may continue to be used under the new level of
GPFS. However, once a GPFS file system has been explicitly changed by issuing the mmchfs command
with the -V option, the disk images can no longer be read by a back level file system. You will be required
to recreate the file system from the backup medium and restore the content if you choose to go back after
this command has been issued.
File systems created for a PSSP or loose cluster (Linux) environment, may not be used in an AIX cluster
environment.
40
41
42
Part 3. Appendixes
43
44
45
Controls which regions of disks are allocated to each node, allowing effective parallel allocation of
space.
3. Token management
The token management function resides within the GPFS daemon on each node in the nodeset. For
each mount point, there is a token management server, which is located at the file system manager.
The token management server coordinates access to files on shared disks by granting tokens that
convey the right to read or write the data or metadata of a file. This service ensures the consistency of
the file system data and metadata when different nodes access the same file. The status of each token
is held in two places:
a. On the token management server
b. On the token management client holding the token
The first time a node accesses a file it must send a request to the file system manager to obtain a
corresponding read or write token. After having been granted the token, a node may continue to read
or write to the same file without requiring additional interaction with the file system manager, until an
application on another node attempts to read or write to the same region in the file.
The normal flow for a token is:
v A message to the token management server.
The token management server then either returns a granted token or a list of the nodes which are
holding conflicting tokens.
v The token management function at the requesting node then has the responsibility to communicate
with all nodes holding a conflicting token and get them to relinquish the token.
This relieves the token server of having to deal with all nodes holding conflicting tokens. In order for
a node to relinquish a token, the daemon must give it up. First, the daemon must release any locks
that are held using this token. This may involve waiting for I/O to complete.
4. Quota management
In a quota-enabled file system, the file system manager automatically assumes quota management
responsibilities whenever the GPFS file system is mounted. Quota management involves the allocation
of disk blocks to the other nodes writing to the file system and comparison of the allocated space to
quota limits at regular intervals. In order to reduce the need for frequent space requests from nodes
writing to the file system, more disk blocks are allocated than requested (see Automatic quota
activation on page 23).
5. Security services
GPFS will use the security enabled for the environment in which it is running, see Security on
page 35.
The file system manager is selected by the configuration manager. If a file system manager should fail for
any reason, a new file system manager is selected by the configuration manager and all functions
continue without disruption, except for the time required to accomplish the takeover.
Depending on the application workload, the memory and CPU requirements for the services provided by
the file system manager may make it undesirable to run a resource intensive application on the same
node as the file system manager. GPFS allows you to control the pool of nodes from which the file system
manager is chosen. When configuring your nodeset or adding nodes to your nodeset, you can specify
which nodes are to be made available to this pool of nodes. A nodes designation may be changed at
anytime by issuing the mmchconfig command. These preferences are honored except in certain failure
situations where multiple failures occur (see the General Parallel File System for AIX 5L: AIX Clusters
Problem Determination Guide and search on multiple file system manager failures). You may list which
node is currently assigned as the file system manager by issuing the mmlsmgr command or change
which node has been assigned to this task via the mmchmgr command.
46
The metanode
There is one metanode per open file. The metanode is responsible for maintaining file metadata integrity
(see Metadata). In almost all cases, the node that has had the file open for the longest continuous period
of time is the metanode. All nodes accessing a file can read and write data directly, but updates to
metadata are written only by the metanode. The metanode for each file is independent of that for any
other file and can move to any node to meet application requirements.
Use of disk storage and file structure within a GPFS file system
A file system consists of a set of disks (a stripe group) which are used to store:
v Metadata
v Quota files on page 49
v Log files on page 49
v User data on page 49
This set of disks is listed in a file system descriptor which is at a fixed position on each of the disks in the
stripe group. In addition, the file system descriptor contains information about the state of the file system.
Metadata
Within each file system, files are written to disk as in traditional UNIX file systems, using inodes, indirect
blocks, and data blocks. Inodes and indirect blocks are considered metadata, as distinguished from data,
or actual file content. You can control which disks GPFS uses for storing metadata when you create disk
descriptors at file system creation time.
Each file has an inode containing information such as file size and time of last modification. The inodes of
small files also contain the addresses of all disk blocks that comprise the file data. A large file can use too
many data blocks for an inode to directly address. In such a case, the inode points instead to one or more
levels of indirect blocks that are deep enough to hold all of the data block addresses. This is the
indirection level of the file.
A file starts out with direct pointers to data blocks in the inodes (a zero level of indirection). As the file
increases in size to the point where the inode cannot hold enough direct pointers, the indirection level is
increased by adding an indirect block and moving the direct pointers there. Subsequent levels of indirect
blocks are added as the file grows. This allows file sizes to grow up to the largest supported file system
size.
47
Notes:
1. The maximum number of file systems that may exist within a GPFS nodeset is 32.
2. The maximum file system size supported by IBM Service is 100TB.
3. The maximum number of files within a file system cannot exceed the architectural limit of 256 million.
4. The maximum indirection level supported by IBM Service is 3.
Using the file system descriptor to find all of the disks which make up the file systems stripe group, and
their size and order, it is possible to address any block in the file system. In particular, it is possible to find
the first inode, which describes the inode file, and a small number of inodes which are the core of the rest
of the file system. The inode file is a collection of fixed length records that represent a single file, directory,
or link. The unit of locking is the single inode because the inode size must be a multiple of the sector size
(the inode size is internally controlled by GPFS). Specifically, there are fixed inodes within the inode file for
the:
v Root directory of the file system
v Block allocation map
v Inode allocation map
The data contents of each of these files are taken from the data space on the disks. These files are
considered metadata and are allocated only on disks where metadata is allowed.
48
Quota files
For file systems with quotas installed, quota files are created at file system creation. There are two quota
files for a file system:
1. user.quota for users
2. group.quota for groups
For every user who works within the file system, the user.quota file contains a record of limits and current
usage within the file system for the individual user. If default quota limits for new users of a file system
have been established, this file also contains a record for that value.
For every group whose users work within the file system, the group.quota file contains a record of
common limits and the current usage within the file system of all the users in the group. If default quota
limits for new groups of a file system have been established, this file also contains a record for that value.
Quota files are found through a pointer in the file system descriptor. Only the file system manager has
access to the quota files. For backup purposes, quota files are also accessible as regular files in the root
directory of the file system.
Log files
Log files are created at file system creation. Additional log files may be created if needed. Log files are
always replicated and are found through a pointer in the file system descriptor. The file system manager
assigns a log file to each node accessing the file system.
Logging
GPFS maintains the atomicity of the on-disk structures of a file through a combination of rigid sequencing
of operations and logging. The data structures maintained are the inode, the indirect block, the allocation
map, and the data blocks. Data blocks are written to disk before any control structure that references the
data is written to disk. This ensures that the previous contents of a data block can never be seen in a new
file. Allocation blocks, inodes, and indirect blocks are written and logged in such a way that there will never
be a pointer to a block marked unallocated that is not recoverable from a log.
There are certain failure cases where blocks are marked allocated but not part of a file, and this can be
recovered by running mmfsck on-line or off-line. GPFS always replicates its log. There are two copies of
the log for each executing node. Log recovery is run:
1. As part of the recovery of a node failure affecting the objects that the failed node might have locked.
2. As part of a mount after the file system has been unmounted everywhere.
User data
The remaining space is allocated from the block allocation map as needed and is used for user data and
directories.
49
The file system manager node requires more daemon memory since token state for the entire file system
is stored there. The daemon memory is used for structures that persist for the execution of a command or
I/O operation, and also for states related to other nodes. file system manager functions use daemon
storage.
Shared segments consist of both pinned and unpinned storage, which is allocated at daemon start-up. The
pinned storage is labeled, pagepool and is controlled by configuration parameters. In a non-pinned area
of the shared segment, GPFS keeps information about open and recently opened files. This information is
held in two forms:
1. A full inode cache
2. A stat cache
The GPFS administrator controls the size of these caches through the mmconfig and mmchconfig
commands.
The inode cache contains copies of inodes for open files and for some recently used files which are no
longer open. The number of inodes cached is controlled by the maxFilesToCache parameter. The number
of inodes for recently used files is constrained by how much the maxFilesToCache parameter exceeds
the current number of open files in the system. However, you may have open files in excess of the
maxFilesToCache parameter.
The stat cache contains enough information to respond to inquiries about the file and open it, but not
enough information to read from it or write to it. There is sufficient data from the inode to respond to a
stat( ) call (the system call under commands such as ls -l). A stat cache entry consumes about 128 bytes
which is significantly less memory than a full inode. The default value is 4 maxFilesToCache. This value
may be changed via the maxStatCache parameter on the mmchconfig command. The stat cache entries
are kept for:
1. Recently accessed files
2. Directories recently accessed by a number of stat( ) calls
GPFS will prefetch data for stat cache entries if a pattern of use indicates this will be productive. Such a
pattern might be a number of ls -l commands issued for a large directory.
Note: Each entry in the inode cache and the stat cache requires appropriate tokens to ensure the cached
information remains correct and the storage of these tokens on the file system manager node.
Depending on the usage pattern, a degradation in performance can occur when the next update of
information on another node requires that the token be revoked.
pagepool is used for the storage of data and metadata in support of I/O operations. With some access
patterns, increasing the amount of pagepool storage may increase I/O performance for file systems with
the following operating characteristics:
v Heavy use of writes that can be overlapped with application execution
v Heavy reuse of files and sequential reads of a size such that prefetch will benefit the application
Component interfaces
The correct operation of GPFS is directly dependent upon:
v Program interfaces
v Socket communications on page 51
Program interfaces
The correct operation of the GPFS file system in a cluster environment depends on a number of other
programs. Specifically, GPFS depends on the correct operation of:
v RSCT
50
Socket communications
There are several component interfaces that affect GPFS behavior. These are socket communications
between:
v User commands and the daemon
v Instances of daemon code
Socket communications are used to process GPFS administration commands. Commands may be
processed either on the node issuing the command or on the file system manager, depending on the
nature of the command. The actual command processor merely assembles the input parameters and
sends them along to the daemon on the local node using a socket.
If the command changes the state of a file system or its configuration, the command is processed at the
file system manager. The results of the change are sent to all nodes and the status of the command
processing is returned to the node, and eventually, to the process issuing the command. For example, a
command to add a disk to a file system originates on a user process and:
1. Is sent to the daemon and validated.
2. If acceptable, it is forwarded to the file system manager, which updates the file system descriptors.
3. All nodes that have this file system are notified of the need to refresh their cached copies of the file
system descriptor.
4. The return code is forwarded to the originating daemon and then to the originating user process.
Be aware that this chain of communication may allow faults related to the processing of a command to
occur on nodes other than the node on which the command was issued.
The daemon also uses sockets to communicate with other instances of the file system on other nodes.
Specifically, the daemon on each node communicates with the file system manager for allocation of logs,
allocation segments, and quotas, as well as for various recovery and configuration flows. GPFS requires
an active internode communications path between all nodes in a nodeset for locking, metadata
coordination, administration commands, and other internal functions. The existence of this path is
necessary for the correct operation of GPFS. The instance of the GPFS daemon on a node will go down if
it senses that this communication is not available to it. If communication is not available to another node,
one of the two nodes will exit GPFS.
51
Initialization
GPFS initialization can be done automatically as part of the node start-up sequence, or manually using the
mmstartup command to start the daemon. The daemon start-up process loads the necessary kernel
extensions, if they have not been previously loaded by an earlier instance of the daemon subsequent to
the current IPL of this node. The initialization sequence then waits for the configuration manager to declare
that a quorum exists. If Group Services reports that this node is the first to join the GPFS group, this node
becomes the configuration manager. When quorum is achieved, the configuration manager changes the
state of the group from initializing to active using Group Services interfaces. This transition is evident in a
message to the GPFS console file (/var/adm/ras/mmfs.log.latest).
The initialization sequence also awaits membership in the Group Services adapter membership group, if
not already established. Note that Group Services will queue the request to join these groups if a previous
failure is still being recovered, which will delay initialization. This is crucial if the failure being recovered is
a failure of this node. Completion of the group join means that all necessary failure recovery is complete.
Initializing GPFS in an AIX cluster environment: The initialization sequence also awaits membership in
the GPFS adapters Group Services group, if not already established. Note that Group Services will queue
the request to join this group if a previous failure is still being recovered, which will delay initialization. This
is crucial if the failure being recovered is a failure of this node. Completion of the group join means that all
necessary failure recovery is complete.
When this state change from initializing to active has occurred, the daemon is ready to accept mount
requests.
mount
GPFS file systems are mounted using the mount command, which builds the structures that serve as the
path to the data. GPFS mount processing is performed on both the node requesting the mount and the
file system manager node. If there is no file system manager, a call is made to the configuration manager,
which appoints one. The file system manager will ensure that the file system is ready to be mounted. This
includes checking that there are no conflicting utilities being run by mmfsck or mmcheckquota, for
example, and running any necessary log processing to ensure that metadata on the file system is
consistent.
On the local node the control structures required for a mounted file system are initialized and the token
management function domains are created. In addition, paths to each of the disks which make up the file
system are opened. Part of mount processing involves unfencing the disks, which may be necessary if this
node had previously failed. This is done automatically without user intervention except in the rare case of
a two-node nodeset using single-node quorum (see the General Parallel File System for AIX 5L: AIX
Clusters Problem Determination Guide and search on single-node quorum). If insufficient disks are up, the
52
mount will fail. That is, in a replicated system if two disks are down in different failure groups, the mount
will fail. In a non-replicated system, one disk down will cause the mount to fail.
Note: There is a maximum of 32 file systems that may exist within a GPFS nodeset.
open
The open of a GPFS file involves the application making a call to the operating system specifying the
name of the file. Processing of an open involves two stages:
1. The directory processing required to identify the file specified by the application.
2. The building of the required data structures based on the inode.
The kernel extension code will process the directory search for those directories which reside in GPFS
(part of the path to the file may be directories in other physical file systems). If the required information is
not in memory, the daemon will be called to acquire the necessary tokens for the directory or part of the
directory needed to resolve the lookup. It will also read the directory entry into memory.
The lookup process occurs one directory at a time in response to calls from the operating system. In the
final stage of open, the inode for the file is read from disk and connected to the operating system vnode
structure. This requires acquiring locks on the inode, as well as a lock that indicates the presence to the
metanode:
v If no other node has this file open, this node becomes the metanode
v If another node has a previous open, then that node is the metanode and this node will interface with
the metanode for certain parallel write situations
v If the open involves creation of a new file, the appropriate locks are obtained on the parent directory
and the inode allocation file block. The directory entry is created, an inode is selected and initialized and
then open processing is completed.
read
The GPFS read function is invoked in response to a read system call and a call through the operating
system vnode interface to GPFS. read processing falls into three levels of complexity based on system
activity and status:
1. Buffer available in memory
2. Tokens available locally but data must be read
3. Data and tokens must be acquired
Buffer and locks available in memory: The simplest read operation occurs when the data is already
available in memory, either because it has been prefetched or because it has been read recently by
another read call. In either case, the buffer is locally locked and the data is copied to the application data
53
area. The lock is released when the copy is complete. Note that no token communication is required
because possession of the buffer implies that we at least have a read token that includes the buffer. After
the copying, prefetch is initiated if appropriate.
Tokens available locally but data must be read: The second, more complex, type of read operation is
necessary when the data is not in memory. This occurs under three conditions:
1. The token has been acquired on a previous read that found no contention.
2. The buffer has been stolen for other uses.
3. On some random read operations.
In the first of a series of random reads the token will not be available locally, but in the second read it
might be available.
In such situations, the buffer is not found and must be read. No token activity has occurred because the
node has a sufficiently strong token to lock the required region of the file locally. A message is sent to the
daemon, which is handled on one of the waiting daemon threads. The daemon allocates a buffer, locks the
file range that is required so the token cannot be stolen for the duration of the I/O, and initiates the I/O to
the device holding the data. The originating thread waits for this to complete and is posted by the daemon
upon completion.
Data and tokens must be acquired: The third, and most complex read operation requires that tokens
as well as data be acquired on the application node. The kernel code determines that the data is not
available locally and sends the message to the daemon waiting after posting the message. The daemon
thread determines that it does not have the required tokens to perform the operation. In that case, a token
acquire request is sent to the token management server. The requested token specifies a required length
of that range of the file, which is needed for this buffer. If the file is being accessed sequentially, a desired
range of data, starting at this point of this read and extending to the end of the file, is specified. In the
event that no conflicts exist, the desired range will be granted, eliminating the need for token calls on
subsequent reads. After the minimum token needed is acquired, the flow proceeds as in the token
management function on page 46.
At the completion of a read, a determination of the need for prefetch is made. GPFS computes a desired
read-ahead for each open file based on the performance of the disks and the rate at which the application
is reading data. If additional prefetch is needed, a message is sent to the daemon that will process it
asynchronously with the completion of the current read.
write
write processing is initiated by a system call to the operating system, which calls GPFS when the write
involves data in a GPFS file system.
Like many open systems file systems, GPFS moves data from a user buffer into a file system buffer
synchronously with the application write call, but defers the actual write to disk. This technique allows
better scheduling of the disk and improved performance. The file system buffers come from the memory
allocated by the pagepool parameter in the mmconfig or mmchconfig command. Increasing this value
may allow more writes to be deferred, which improves performance in certain workloads.
A block of data is scheduled to be written to a disk when:
v
v
v
v
v
54
stat
The stat( ) system call returns data on the size and parameters associated with a file. The call is issued
by the ls -l command and other similar functions. The data required to satisfy the stat( ) system call is
contained in the inode. GPFS processing of the stat( ) system call differs from other file systems in that it
supports handling of stat( ) calls on all nodes without funneling the calls through a server.
This requires that GPFS obtain tokens which protect the accuracy of the metadata. In order to maximize
parallelism, GPFS locks inodes individually and fetches individual inodes. In cases where a pattern can be
detected, such as an attempt to stat( ) all of the files in a larger directory, inodes will be fetched in parallel
in anticipation of their use.
Inodes are cached within GPFS in two forms:
1. Full inode
2. Limited stat cache form
The full inode is required to perform data I/O against the file.
55
The stat cache form is smaller than the full inode, but is sufficient to open the file and satisfy a stat( ) call.
It is intended to aid functions such as ls -l, du, and certain backup programs which scan entire directories
looking for modification times and file sizes.
These caches and the requirement for individual tokens on inodes are the reason why a second invocation
of directory scanning applications may execute faster than the first.
56
mmfsck -o -n scans the file system to determine if correction might be useful. The on-line version of
mmfsck runs on the file system manager and scans all inodes and indirect blocks looking for disk blocks
which are allocated but not used. If authorized to repair the file system, it releases the blocks. If not
authorized to repair the file system, it reports the condition to standard output on the invoking node.
The off-line version of mmfsck is the last line of defense for a file system that cannot be used. It will most
often be needed in the case where log files are not available because of disk media failures. mmfsck runs
on the file system manager and reports status to the invoking node. It is mutually incompatible with any
other use of the file system and checks for any running commands or any nodes with the file system
mounted. It exits if any are found. It also exits if any disks are down and require the use of mmchdisk to
change them to up or recovering. mmfsck performs a full file system scan looking for metadata
inconsistencies. This process can be lengthy on large file systems. It seeks permission from the user to
repair any problems that are found which may result in the removal of files or directories that are corrupt.
The processing of this command is similar to those for other file systems.
Recovery
In order to understand the GPFS recovery process, you need to be familiar with Group Services. In
particular, it should be noted that only one state change, such as the loss or initialization of a node, can be
processed at a time and subsequent changes will be queued. This means that the entire failure processing
must complete before the failed node can join the group again. Group Services also processes all failures
first, which means that GPFS will handle all failures prior to completing any recovery.
GPFS uses two groups to process failures of nodes, or GPFS failure on other nodes. The primary group is
used to process failures. The secondary group is used to restart a failure protocol if a second failure is
reported. This may be an actual second failure, or one that occurred at the same time as the first but was
detected by Group Services later. The only function of the second group is to abort the current protocol
and restart a new one processing all known failures.
GPFS recovers from node failure using notifications provided by Group Services. When notified that a
node has failed or that the GPFS daemon has failed on a node, GPFS invokes recovery for each of the
file systems that were mounted on the failed node. If necessary, a new Configuration Manager is selected
prior to the start of actual recovery, or new file system managers are selected for any file systems that no
longer have one, or both. This processing occurs as the first phase of recovery and occurs on the
configuration manager. This processing must complete before other processing can be attempted and is
enforced using Group Services barriers.
The file system manager for each file system fences the failed node from the disks comprising the file
system. If the file system manager is newly appointed as a result of this failure, it rebuilds token state by
querying the other nodes of the group. This file system manager recovery phase is also protected by a
Group Services barrier. After this is complete, the actual recovery of the log of the failed node proceeds.
This recovery will rebuild the metadata that was being modified at the time of the failure to a consistent
state with the possible exception that blocks may be allocated that are not part of any file and are
effectively lost until mmfsck is run, on-line or off-line. After log recovery is complete, the locks held by the
failed nodes are released for this file system. Completion of this activity for all file systems completes the
failure processing. The completion of the protocol allows a failed node to rejoin the cluster. GPFS will
unfence the failed node after it has rejoined the group.
57
The GPFS cluster data information is stored in the file /var/mmfs/gen/mmsdrfs. This file is stored on the
nodes designated as the primary GPFS cluster data server and, if specified, the secondary GPFS cluster
data server (see GPFS cluster data servers on page 14).
Based on the information in the GPFS cluster data, the GPFS commands generate and maintain a number
of system files on each of the nodes in the GPFS cluster. These files are:
/etc/cluster.nodes
Contains a list of all nodes that belong to the local nodeset.
/etc/filesystems
Contains lists for all GPFS file systems that exist in the nodeset.
/var/mmfs/gen/mmsdrfs
Contains a local copy of the mmsdrfs file found on the primary and secondary GPFS cluster data
server nodes.
/var/mmfs/etc/mmfs.cfg
Contains GPFS daemon startup parameters.
/var/mmfs/etc/cluster.preferences
Contains a list of the nodes designated as file system manager nodes.
The master copy of all GPFS configuration information is kept in the file mmsdrfs on the primary GPFS
cluster data server node. The layout of this file is defined in /usr/lpp/mmfs/bin/mmsdrsdef. The first
record in the mmsdrfs file contains a generation number. Whenever a GPFS command causes something
to change in any of the nodesets or any of the file systems, this change is reflected in the mmsdrfs file
and the generation number is incremented. The latest generation number is always recorded in the
mmsdrfs file on the primary and secondary GPFS cluster data server nodes.
When running GPFS administration commands in a GPFS cluster, it is necessary for the GPFS cluster
data to be accessible to the node running the command. Commands that update the mmsdrfs file require
that both the primary and secondary GPFS cluster data server nodes are accessible. Similarly, when the
GPFS daemon starts up, at least one of the two server nodes must be accessible.
58
Application support
Applications access GPFS data through the use of standard AIX 5L system calls and libraries. Support for
larger files is provided through the use of AIX 5L 64-bit forms of these libraries. See the AIX 5L product
documentation at www.ibm.com/servers/aix/library/techpubs.html for details.
59
60
Example
k145n01
k145n01.kgn.ibm.com
9.119.19.102
4. For a cluster type of hacmp, any node to be included in your GPFS cluster must be a properly
configured node in an existing HACMP cluster.
5.
6.
7.
8.
9.
For further information, see the High Availability Cluster Multi-Processing for AIX: Enhanced Scalability
Installation and Administration Guide.
For a cluster type of rpd, any node to be included in your GPFS cluster must be a properly configured
node in an existing RSCT peer domain.
For further information, see the Reliable Scalable Cluster Technology for AIX 5L: RSCT Guide and
Reference.
Nodes specified in the NodeFile which are not available when the mmcrcluster command is issued
must be added to the cluster by issuing the mmaddcluster command.
You must have root authority to run the mmcrcluster command.
The mmcrcluster command will only be successful if the primary server and, if specified, the
secondary server are available.
The authentication method between nodes in the GPFS cluster must be established when the
mmcrcluster command is issued:
a. When using rcp and rsh for remote communication, a properly configured /.rhosts file must exist
in the root users home directory on each node in the GPFS cluster.
61
b. If you have designated the use of a different remote communication program on either the
mmcrcluster or the mmchcluster command, you must ensure:
1) Proper authorization is granted to all nodes in the GPFS cluster.
2) The nodes in the GPFS cluster can communicate without the use of a password.
The remote copy and remote shell command must adhere to the same syntax form as rcp and rsh but
may implement an alternate authentication mechanism.
Example
k145n01
k145n01.kgn.ibm.com
9.119.19.102
Starting GPFS
These restrictions apply to starting GPFS:
1. DO NOT start GPFS until it is configured.
2. Quorum must be met in order to successfully start GPFS.
3. You must have root authority to issue the mmstartup command.
4. When using rcp and rsh for remote communication, a properly configured /.rhosts file must exist in
the root users home directory on each node in the GPFS cluster. If you have designated the use of a
different remote communication program on either the mmcrcluster or the mmchcluster command,
you must ensure:
a. Proper authorization is granted to all nodes in the GPFS cluster.
b. The nodes in the GPFS cluster can communicate without the use of a password.
5. You may issue the mmstartup command from any node in the GPFS cluster.
62
63
4. The PrimaryServer and, if specified, the SecondaryServer must be available for the mmaddcluster,
mmdelcluster, and mmlscluster commands to be successful.
5. The mmchcluster command, when issued with either the -p or -s option, is designed to operate in an
environment where the current PrimaryServer and, if specified, the SecondaryServer are not available.
When specified with any other options, the servers must be available for the command to be
successful.
6. A node being deleted cannot be the primary or secondary GPFS cluster data server unless you intend
to delete the entire cluster. Verify this by issuing the mmlscluster command. If a node to be deleted is
one of the servers and you intend to keep the cluster, issue the mmchcluster command to assign
another node as the server before deleting the node.
64
v Have bad-block relocation automatically turned off. Accessing disks concurrently from multiple
systems using lvm bad-block relocations could potentially cause conflicting assignments. As a
result, software bad-block relocation is turned off allowing the hardware bad-block relocation
supplied by your disk vendor to provide protection against disk media errors.
When creating a logical volume, you must have write access to where the disk descriptor file is
located.
5. When using rcp and rsh for remote communication, a properly configured /.rhosts file must exist in
the root users home directory on each node in the GPFS cluster. If you have designated the use of a
different remote communication program on either the mmcrcluster or the mmchcluster command,
you must ensure:
a. Proper authorization is granted to all nodes in the GPFS cluster.
b. The nodes in the GPFS cluster can communicate without the use of a password.
The remote copy and remote shell command must adhere to the same syntax form as rcp and rsh
but may implement an alternate authentication mechanism.
6. In order to run mmfsck off-line to repair a file system, you must unmount your file system.
7. When replacing quota files with either the -u or the -g option on the mmcheckquota command:
v The quota files must be in the root directory of the file system.
v The file system must be unmounted.
8. Multi-node quorum must be maintained when adding or deleting nodes from your GPFS nodeset.
9. You must unmount the file system on all nodes before deleting it.
10. You must unmount a file system on all nodes before moving it to a different nodeset.
11. When issuing mmchfs to enable DMAPI, the file system cannot be in use.
Commands may be run from various locations within your system configuration. Use this information to
ensure the command is being issued from an appropriate location and is using the correct syntax (see the
individual commands for specific rules regarding the use of that command):
1. Commands which may be issued from any node in the GPFS cluster running GPFS:
Note: If the command is intended to run on a nodeset other than the one you are on, you must
specify the nodeset using the -C option.
v mmaddnode
v mmchconfig
v mmcrfs
v mmstartup
v mmshutdown
2. Commands which require that Device be the first operand and may be issued from any node in the
GPFS cluster running GPFS
v mmadddisk
v mmchdisk
v
v
v
v
v
v
v
v
mmchfs
mmchmgr
mmdefragfs
mmdeldisk
mmdelfs
mmdf
mmfsck
mmlsdisk
Appendix C. Restrictions and conventions for GPFS
65
v mmlsfs
v mmlsmgr
Either Device or NodesetId must be specified.
v mmrestripefs
v mmrpldisk
3. Commands which require GPFS to be running on the node from which the command is issued:
v mmcheckquota
v mmdefedquota
v mmdefquotaoff
v mmdefquotaon
v mmedquota
v mmlsquota
v mmquotaoff
v mmquotaon
v mmrepquota
4. Commands which require the file system be mounted on the GPFS nodeset from which the command
is issued:
v mmchattr
v mmdelacl
v mmeditacl
v mmgetacl
v mmlsattr
v mmputacl
5. Commands which may be issued from any node in the GPFS cluster where GPFS is installed:
v mmaddcluster
v mmchcluster
v
v
v
v
v
mmconfig
mmcrcluster
mmcrlv
mmdelcluster
mmdellv
v mmdelnode
v mmlscluster
v
v
v
v
mmlsconfig
mmlsgpfsdisk
mmlsnode
mmstartup
66
3. You cannot run mmfsck on a file system that has disks in a down state.
4. A disk remains suspended until it is explicitly resumed. Restarting GPFS or rebooting the nodes does
not restore normal access to a suspended disk.
5. A disk remains down until it is explicitly started. Restarting GPFS or rebooting the nodes does not
restore normal access to a down disk.
6. Only logical volumes created by the mmcrlv command may be used. This ensures:
a. GPFS will exploit SCSI-3 persistent reserve if the disk supports it.
b. Bad-block relocation is automatically turned off. Accessing disks concurrently from multiple systems
using lvm bad-block relocations could potentially cause conflicting assignments. As a result, turning
off software bad-block relocation allows the hardware bad-block relocation supplied by your disk
vendor to provide protection against disk media errors. Bad-block relocation is automatically turned
off for logical volumes created via the command.
7. When creating a logical volume by issuing the mmcrlv command, you must have write access to the
disk descriptor file.
8. When referencing a disk, you must use the logical volume name.
9. All disks or disk arrays must be directly attached to all nodes in the nodeset.
10. You cannot protect your file system against disk failure by mirroring data at the LVM level. You must
use replication or RAID devices to protect your data (see Recoverability considerations.
11. Single-node quorum is only supported when disk leasing is not in effect. Disk leasing is activated if
any disk in any filesystem in the nodeset is not using SSA fencing or SCSI-3 persistent reserve.
12. Before deleting a disk use the mmdf command to determine whether there is enough free space on
the remaining disks to store the file system.
13. Disk accounting is not provided at the present time.
14. After migrating to a new level of GPFS, before you can use an existing logical volume, which was not
part of any GPFS file system at the time of migration, you must:
a. Export the logical volume
b. Recreate the logical volume
c. Add the logical volume to a file system
67
7. A file in data shipping mode cannot be written through any file handle that was not associated with
the data shipping collective through a gpfsDataShipStart_t directive.
8. Calls that are not allowed on a file that has data shipping enabled:
v chacl
v fchacl
v chmod
v fchmod
v chown
v fchown
v chownx
v fchownx
v link
9. The gpfsDataShipStart_t directive can only be cancelled by a gpfsDataShipStop_t directive.
10. For the gpfsDataShipMap_t directive, the value of partitionSize must be a multiple of the number of
bytes in a single file system block.
System configuration
GPFS requires invariant network connections. The port on a particular IP address must be a fixed piece of
hardware that is translated to a fixed network adapter and is monitored for failure. Topology Services
should be configured to heartbeat over this invariant address. In an HACMP environment, see the High
Availability Cluster Multi-Processing for AIX: Enhanced Scalability Installation and Administration Guide
and search on The Topology Services Subsystem. In an RSCT peer domain environment, see the Reliable
Scalable Cluster Technology for AIX 5L: RSCT Guide and Reference and search on The Topology
Services Subsystem.
68
Notices
This information was developed for products and services offered in the U.S.A.
IBM may not offer the products, services, or features discussed in this document in other countries.
Consult your local IBM representative for information on the products and services currently available in
your area. Any reference to an IBM product, program, or service is not intended to state or imply that only
IBMs product, program, or service may be used. Any functionally equivalent product, program, or service
that does not infringe any of IBMs intellectual property rights may be used instead. However, it is the
users responsibility to evaluate and verify the operation of any non-IBM product, program, or service.
IBM may have patents or pending patent applications covering subject matter described in this document.
The furnishing of this document does not give you any license to these patents. You can send license
inquiries, in writing, to:
IBM Director of Licensing
IBM Corporation
North Castle Drive
Armonk, NY 10594-1785
USA
For license inquiries regarding double-byte (DBCS) information, contact the IBM Intellectual Property
Department in your country or send inquiries, in writing, to:
IBM World Trade Asia Corporation
Licensing
2-31 Roppongi 3-chome, Minato-ku
Tokyo 106, Japan
The following paragraph does not apply to the United Kingdom or any other country where such provisions
are inconsistent with local law:
INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION AS IS
WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS
FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in
certain transactions, therefore, this statement may not apply to you.
This information could include technical inaccuracies or typographical errors. Changes are periodically
made to the information herein; these changes will be incorporated in new editions of the publication. IBM
may make improvements and/or changes in the product(s) and/or the program(s) described in this
publication at any time without notice.
Any references in this information to non-IBM Web sites are provided for convenience only and do not in
any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of
the materials for this IBM product and use of those Web sites is at your own risk.
IBM may use or distribute any of the information you supply in any way it believes appropriate without
incurring any obligation to you.
Licensees of this program who wish to have information about it for the purpose of enabling: (i) the
exchange of information between independently created programs and other programs (including this one)
and (ii) the mutual use of the information which has been exchanged, should contact:
IBM Corporation
Intellectual Property Law
2455 South Road,P386
Copyright IBM Corp. 2002
69
Poughkeepsie, NY 12601-5400
USA
Such information may be available, subject to appropriate terms and conditions, including in some cases,
payment or a fee.
The licensed program described in this document and all licensed material available for it are provided by
IBM under terms of the IBM Customer Agreement, IBM International Program License Agreement or any
equivalent agreement between us.
This information contains examples of data and reports used in daily business operations. To illustrate
them as completely as possible, the examples include the names of individuals, companies, brands, and
products. All of these names are fictitious and any similarity to the names and addresses used by an
actual business enterprise is entirely coincidental.
COPYRIGHT LICENSE:
This information contains sample application programs in source language, which illustrates programming
techniques on various operating platforms. You may copy, modify, and distribute these sample programs in
any form without payment to IBM, for the purposes of developing, using, marketing or distributing
application programs conforming to the application programming interface for the operating platform for
which the sample programs are written. These examples have not been thoroughly tested under all
conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these
programs. You may copy, modify, and distribute these sample programs in any form without payment to
IBM for the purposes of developing, using, marketing, or distributing application programs conforming to
IBMs application programming interfaces.
If you are viewing this information softcopy, the photographs and color illustrations may not appear.
Trademarks
The following terms are trademarks of the International Business Machines Corporation in the United
States or other countries or both:
v
v
v
v
v
v
v
v
v
v
AFS
AIX
AIX 5L
Eserver
Enterprise Storage Server
IBM
IBMLink
Netfinity
pSeries
SP
v TotalStorage
v xSeries
The Open Group is a trademark of The Open Group.
Linux is a registered trademark of Linus Torvalds.
Network File System is a trademark of Sun MicroSystems, Inc.
NFS is a registered trademark of Sun Microsystems, Inc.
70
ActionMedia, LANDesk, MMX, Pentium and ProShare are trademarks of Intel Corporation in the United
States, other countries, or both.
Sun is a trademark of Sun MicroSystems, Inc.
UNIX is a registered trademark of the Open Group in the United States and other countries.
Other company, product, and service names may be the trademarks or service marks of others.
Notices
71
72
Glossary
A
B
block utilization. The measurement of the percentage
of used subblocks per allocated blocks.
C
cluster. A loosely-coupled collection of independent
systems (nodes) organized into a network for the
purpose of sharing resources and communicating with
each other (see GPFS cluster on page 74).
configuration manager. The GPFS node that selects
file system managers and determines whether quorum
exists. The oldest continuously operating node in the file
system group as monitored by Group Services, is
automatically assigned as the configuration manager.
control data structures. Data structures needed to
manage file data and metadata cached in memory. This
includes hash tables and link pointers for finding cached
data, lock states and tokens to implement distributed
locking, as well as various flags and sequence numbers
to keep track of updates to the cached data.
D
Data Management API. The interface defined by the
Open Groups XDSM standard as described in the
publication System Management: Data Storage
Management (XDSM) API Common Application
Environment (CAE) Specification C429, The Open
Group ISBN 1-85912-190-X.
disk descriptor. A disk descriptor defines how a disk
is to be used within a GPFS file system. Each
descriptor must be in the form (second and third fields
reserved):
DiskName:::DiskUsage:FailureGroup
Where DiskName is the name of the disk. This must be
the name of the logical volume name. DiskUsage tells
GPFS whether data, metadata, or both are to be stored
on the disk. The FailureGroup designation indicates to
GPFS where not to place replicas of data and
metadata. All disks with a common point of failure
should belong to the same failure group. Since GPFS
does not place replicated information on disks in the
Copyright IBM Corp. 2002
E
event. A message from a file operation to a data
management application about the action being
performed on the file or file system. There are several
types of events, each used for a different type of action.
The event is delivered to a session according to the
event disposition.
F
failover. The assuming of server responsibilities by the
node designated as backup server, when the primary
server fails.
failure group. A collection of disks that share common
access paths or adaptor connection, and could all
become unavailable through a single hardware failure.
73
v Adding disks
v Changing disk availability
v Repairing the file system
2. Controls which regions of disks are allocated to
each node, allowing effective parallel allocation of
space.
3. Controls token management.
4. Controls quota management.
fragment. The space allocated for an amount of data
(usually at the end of a file) too small to require a full
block, consisting of one or more subblocks (one
thirty-second of block size).
G
GPFS cluster. A subset of existing cluster nodes
defined as being available for use by GPFS file
systems. The GPFS cluster is created via the
mmcrcluster command. GPFS nodesets and file
systems are subsequently created after the
mmcrcluster command has been issued.
GPFS cluster data. The GPFS configuration data.
which is stored on the primary and secondary GPFS
cluster data servers as defined on the mmcrcluster
command.
GPFS portability layer . The interface to the GPFS
for Linux proprietary code is an open source module
which each installation must build for its specific
hardware platform and Linux distribution. See
www.ibm.com/servers/eserver/clusters/software/.
H
HACMP environment. The operation of GPFS based
on the High Availability Cluster Multi-Processing for
AIX/Enhanced Scalability (HACMP/ES) program
product. This environment is defined on the
mmcrcluster command by specifying a cluster type of
hacmp.
I
IBM Virtual Shared Disk. The component of PSSP
that allows application programs executing on different
nodes access a raw logical volume as if it were local at
each node. In actuality, the logical volume is local at
only one of the nodes (the server node).
inode. The internal structure that describes an
individual file. An inode contains file size and update
74
K
Kernel Low-Level Application Programming
Interface (KLAPI). KLAPI provides reliable transport
services to kernel subsystems that have communication
over the SP Switch.
L
logical volume. A collection of physical partitions
organized into logical partitions all contained in a single
volume group. Logical volumes are expandable and can
span several physical volumes in a volume group.
Logical Volume Manager (LVM). Manages disk space
at a logical level. It controls fixed-disk resources by
mapping data between logical and physical storage,
allowing data to be discontiguous, span multiple disks,
replicated, and dynamically expanded.
loose cluster environment. The operation of GPFS
based on the Linux operating system. This environment
is defined on the mmcrcluster command by specifying
a cluster type of lc.
M
management domain. A set of nodes configured for
manageability by the Clusters Systems Management
(CSM) product. Such a domain has a management
server that is used to administer a number of managed
nodes. Only management servers have knowledge of
the whole domain. Managed nodes only know about the
servers managing them; they know nothing of each
other. Contrast with peer domain on page 75.
metadata. Data structures that contain access
information about file data. These might include inodes,
indirect blocks, and directories. These data structures
are used by GPFS but are not accessible to user
applications.
metanode. There is one metanode per open file. The
metanode is responsible for maintaining file metadata
integrity. In almost all cases, the node that has had the
file open for the longest period of continuous time is the
metanode.
mirroring. The creation of a mirror image of data to be
preserved in the event of disk failure.
N
Network File System (NFS). A distributed file system
that allows users to access files and directories located
on remote computers and treat those files and
directories as if they were local. NFS allows different
systems (UNIX or non-UNIX), different architectures, or
vendors connected to the same network, to access
remote files in a LAN environment as though they were
local files.
node descriptor. A node descriptor defines how a
node is to be used within GPFS.
In a Linux environment, each descriptor for a GPFS
cluster must be in the form:
primaryNetworkNodeName::secondaryNetworkNodeName
primaryNetworkNodeName
The host name of the node on the primary
network for GPFS daemon to daemon
communication.
designation
Currently unused and specified by the double
colon ::
secondaryNetworkNodeName
The host name of the node on the secondary
network, if one exists.
You may configure a secondary network node
name in order to prevent the node from
appearing to have gone down when the
network is merely saturated. During times of
excessive network traffic if a second network is
not specified, there is the potential for the
RSCT component to be unable to
communicate with the node over the primary
network. RSCT would perceive the node as
having failed and inform GPFS to perform node
recovery.
In all environments, each descriptor for a GPFS nodeset
must be in the form:
NodeName[:manager|client]
Where NodeName is the hostname or IP address of the
adapter to be used for GPFS daemon communications.
The optional designation specifies whether or not the
node should be included in the pool of nodes from
which the file system manager is chosen. The default is
not to have the node included in the pool.
node number. GPFS references node numbers in an
environment specific manner. In an RSCT peer domain
P
peer domain. A set of nodes configured for high
availability by the RSCT configuration manager. Such a
domain has no distinguished or master node. All nodes
are aware of all other nodes, and administrative
commands can be issued from any node in the domain.
All nodes also have a consistent view of the domain
membership. Contrast with management domain on
page 74.
persistent reserve. Persistent reserve is a capability
of the ANSI SCSI-3 architecture for interfacing with
storage devices. Specifically, persistent reserve provides
control of access from multiple host systems which is
useful in recovery situations. To access a storage
device which is configured to use persistent reserve, a
host must register using a unique key. In the event of a
perceived failure, another host system may preempt that
access using that unique key which will result in the
storage device not honoring attempts to read or write
data on the device until the pre-empted system has
re-registered. Software conventions exist in GPFS which
only allow a pre-empted system to re-register after the
recovery situation has been addressed. Contrast with
disk leasing on page 73.
primary GPFS cluster data server. In a GPFS
cluster, this refers to the primary GPFS cluster data
server node for the GPFS configuration data.
PSSP cluster environment. The operation of GPFS
based on the PSSP and IBM Virtual Shared Disk
program products.
Q
quorum. The minimum number of nodes that must be
running in order for the GPFS daemon to start.
For all nodesets consisting of three or more nodes, the
multi-node quorum algorithm applies defining quorum as
one plus half of the number of nodes in the GPFS
nodeset.
Glossary
75
S
SSA. Serial Storage Architecture. An expanded
storage adapter for multi-processor data sharing in
UNIX-based computing, allowing disk connection in a
high-speed loop.
SCSI. Small Computer Systems Interface. An adapter
supporting attachment of various direct-access storage
devices.
secondary GPFS cluster data server. In a GPFS
cluster, this refers to the backup server node for the
GPFS configuration data (see GPFS cluster data on
page 74).
session failure. The loss of all resources of a data
management session due to the failure of the GPFS
daemon on the session node.
session node. The node on which a data
management session was created.
single-node quorum. In a two node nodeset, use of
the single-node quorum algorithm allows the GPFS
daemon to continue operating in the event only one
node is available. Use of this quorum algorithm is not
valid if more than two nodes have been defined in the
76
V
virtual file system (VFS). A remote file system that
has been mounted so that it is accessible to the local
user. The virtual file system is an abstraction of a
physical file system implementation. It provides a
consistent interface to multiple file systems, both local
and remote. This consistent interface allows the user to
view the directory tree on the running system as a
single entity even when the tree is made up of a
number of diverse file system types.
virtual shared disk. See IBM Virtual Shared Disk on
page 74.
virtual node (vnode). The structure which contains
information about a file system object in an virtual file
system.
Bibliography
This bibliography contains references for:
v GPFS publications
v AIX publications
v RSCT publications
v HACMP/ES publications
v IBM Subsystem Device Driver, IBM 2105 Enterprise Storage Server, and Fibre Channel
v IBM RedBooks
v Non-IBM publications that discuss parallel computing and other topics related to GPFS
All IBM publications are also available from the IBM Publications Center at
www.ibm.com/shop/publications/order
GPFS publications
You may download, view, search, and print the supporting documentation for the GPFS program product in
the following ways:
1. In
v
v
2. In
PDF format:
On the World Wide Web at www.ibm.com/servers/eserver/pseries/library/gpfs.html
From the IBM Publications Center at www.ibm.com/shop/publications/order
HTML format at publib.boulder.ibm.com/clresctr/docs/gpfs/html
To view the GPFS PDF publications, you need access to the Adobe Acrobat Reader. The Acrobat Reader
is shipped with the AIX 5L Bonus Pack and is also freely available for downloading from the Adobe web
site at www.adobe.com. Since the GPFS documentation contains cross-book links, if you choose to
download the PDF files they should all be placed in the same directory and the files should not be
renamed.
To view the GPFS HTML publications, you need access to an HTML document browser such as Netscape.
An index file into the HTML files (aix_index.html) is provided when downloading the tar file of the GPFS
HTML publications. Since the GPFS documentation contains cross-book links, all files contained in the tar
file should remain in the same directory.
In order to use the GPFS man pages the gpfsdocs file set must first be installed (see Installing the GPFS
man pages).
The GPFS library includes:
v General Parallel File System for AIX 5L: AIX Clusters Concepts, Planning, and Installation Guide,
GA22-7895 (PDF file name an2ins10.pdf)
v General Parallel File System for AIX 5L: AIX Clusters Administration and Programming Reference,
SA22-7896 (PDF file name an2adm10.pdf)
v General Parallel File System for AIX 5L: AIX Clusters Problem Determination Guide, GA22-7897 (PDF
file name an2pdg10.pdf)
v General Parallel File System for AIX 5L: AIX Clusters Data Management API Guide, GA22-7898 (PDF
file name an2dmp10.pdf)
AIX publications
For the latest information on AIX 5L Version 5.1 and related products at
http://www.ibm.com/servers/aix/library/
Copyright IBM Corp. 2002
77
HACMP/ES publications
You can download the HACMP/ES manuals from the Web at
www.ibm.com/servers/eserver/pseries/library/hacmp_docs.html
v HACMP for AIX 4.4 Enhanced Scalability Installation and Administration Guide, SC23-4306
Redbooks
IBMs International Technical Support Organization (ITSO) has published a number of redbooks. For a
current list, see the ITSO Web site at www.ibm.com/redbooks
v IBM Eserver Cluster 1600 and PSSP 3.4 Cluster Enhancements, SG24-6604 provides information on
GPFS 1.5.
v GPFS on AIX Clusters; High Performance File System Administration Simplified, SG24-6035 provides
information on GPFS 1.4.
v Implementing Fibre Channel Attachment on the ESS,, SG24-6113
v Configuring and Implementing the IBM Fibre Channel RAID Storage Server, SG24-5414
Whitepapers
A GPFS primer at www.ibm.com/servers/eserver/pseries/software/whitepapers/gpfs_primer.html
Heger, D., Shah, G., General Parallel File System (GPFS) 1.4 for AIX Architecture and Performance, 2001,
at www.ibm.com/servers/eserver/clusters/whitepapers/gpfs_aix.html
IBM Eserver pSeries white papers at www.ibm.com/servers/eserver/pseries/library/wp_systems.html
Clustering technology white papers at www.ibm.com/servers/eserver/pseries/library/wp_clustering.html
AIX white papers at www.ibm.com/servers/aix/library/wp_aix.html
White paper and technical reports homepage at
www.ibm.com/servers/eserver/pseries/library/wp_systems.html
78
Non-IBM publications
Here are some non-IBM publications that you may find helpful:
v Almasi, G., Gottlieb, A., Highly Parallel Computing, Benjamin-Cummings Publishing Company, Inc., 2nd
edition, 1994.
v Foster, I., Designing and Building Parallel Programs, Addison-Wesley, 1995.
v Gropp, W., Lusk, E., Skjellum, A., Using MPI, The MIT Press, 1994.
v Message Passing Interface Forum, MPI: A Message-Passing Interface Standard, Version 1.1, University
of Tennessee, Knoxville, Tennessee, June 6, 1995.
v Message Passing Interface Forum, MPI-2: Extensions to the Message-Passing Interface, Version 2.0,
University of Tennessee, Knoxville, Tennessee, July 18, 1997.
v Ousterhout, John K., Tcl and the Tk Toolkit, Addison-Wesley, Reading, MA, 1994, ISBN 0-201-63337-X.
v Pfister, Gregory, F., In Search of Clusters, Prentice Hall, 1998.
v System Management: Data Storage Management (XDSM) API Common Applications Environment
(CAE) Specification C429, The Open Group, ISBN 1-85912-190-X. Available on-line in HTML from The
Open Groups Web site at www.opengroup.org/.
Bibliography
79
80
Index
Special characters
/etc/security/limits 36
nofiles file descriptor limit
36
Numerics
64-bit support
37
A
access to the same file
simultaneous 3
adapter membership 51, 52
administration commands
GPFS 51
AIX 6
communication with GPFS
AIX 5L 9
AIX cluster environment
description 6
allocation map
block 48
inode 48
logging of 49
application programs
communicating with GPFS
application support 59
autoload option 17
automatic mount
file systems 21
automount feature 4
51
52
B
bandwidth
increasing aggregate 3
block
allocation map 48
block size 19, 21
affect on maximum mounted file system size
C
cache 18, 50
cluster
restrictions 63
cluster type 15
clusters see GPFS cluster environment
coexistence
conflicting software 29
coexistence guidelines 39
commands
error communication 51
failure of 14
GPFS administration 51
mmadddisk 13
Copyright IBM Corp. 2002
83
22
commands (continued)
mmchconfig 15, 37, 50
mmchdisk 56
mmcheckquota 24, 52
mmchfs 20, 37, 40, 48
mmconfig 15, 16, 33, 50
mmcrcluster 9, 14, 16, 29
mmcrfs 13, 20, 33
mmcrlv 11
mmdefedquota 24
mmdefquotaon 24
mmedquota 23, 24
mmfsck 49, 52, 56
mmlsdisk 56
mmlsquota 24
mmrepquota 24
mmrpldisk 13
mmstartup 17
operating system 52
processing 56
remote file copy
rcp 15
remote shell
rsh 15
restrictions 64
where they run 4
communicating file accessing patterns
restrictions 67
communication
between GPFS and RSCT 14
GPFS daemon to daemon 14
communication protocol 50
communications I/O 35
compatibility 40
configuration
file system manager nodes 45
files 57
of a GPFS cluster 14
options
all environments 15
system 68
system flexibility 4
configuration see also nodeset 83
configuration files 4
configuration manager 45, 52
configuration settings 35
configuring GPFS 15
conflicting software 29
considerations for GPFS applications 59
creating GPFS directory
/tmp/gpfslpp 31
cssMembership 52
D
daemon memory
data
availability 3
49
81
data (continued)
consistency of 3
data blocks
logging of 49
recovery of 49
Data Management API (DMAPI)
configuration options 15, 19
enabling 24
data recoverability 9
default quotas 24
files 49
definition
of failure group 3
descriptor
file systems 47
descriptors
disk 13
directives
restrictions 67
disk descriptors 13
disk leasing 11
disk properties
DiskUsage 11
Failure Group 11
disk subsystems 9
disks
descriptors 25
failure 10, 11
fencing 11
media failure 57
recovery 56
releasing blocks 57
restrictions 66
state of 56
tuning parameters 36
usage 12, 25
usage verification 24
DiskUsage
disk properties 11
documentation 31
obtaining 77
dumps
path for the storage of 17
E
electronic license agreement
estimated node count 21
29
F
failing nodes
in multi-node quorum 18
in single-node quorum 18
failover support 3
failure
disk 10
node 10, 18
failure group
definition of 3
82
Failure Group
disk properties 11
failure groups
choosing 13, 25
file system manager 17
administration command processing 51
command processing 56
communication with 51
description 45
mount of a file system 52
pool of nodes to choose from 16
selection of 46
file systems
administrative state of 4, 57
automatic mount of 21
block size 19, 21
creating 20
descriptor 47
device name 25
disk descriptor 25
interacting with a GPFS file system 52
maximum number of 48, 52
maximum number of files 22, 48
maximum size supported 48
mounted file system sizes 22
mounting 24, 52
opening a file 53
reading a file 53
recovery 57
repairing 56
restrictions 63
sizing 21
writing to a file 54
files
/.rhosts 35
/etc/cluster.nodes 58
/etc/filesystems 57
/etc/fstab 58
/var/adm/ras/mmfs.log.latest 51
/var/mmfs/etc/cluster.preferences 58
/var/mmfs/etc/mmfs.cfg 58
/var/mmfs/gen/mmsdrfs 58
/etc/security/limits 36
consistency of data 3
group.quota 49
inode 48
log files 49
maximum number of 22, 48
maximum size 48
mmfs.cfg 57
structure within GPFS 47
user.quota 49
fragments, storage of files 22
G
GPFS
administration commands 51
communication within 51
daemon description 6
description of 3
GPFS (continued)
nodeset in an HACMP environment 7
nodeset in an RSCT peer domain environment
planning for 9
strengths of 3
structure of 5, 45
GPFS cl data
server nodes 14
GPFS cluster
configuration restrictions 61
creating 14
defining nodes in the cluster 14
planning nodes 14
GPFS cluster data 58
content 4, 57
designation of server nodes 14
GPFS daemon
quorum requirement 45
went down 51
grace period, quotas 24
Group Services 10, 50
initialization of GPFS 52
recovering a file system 57
29
K
kernel extensions 5
kernel memory 49
L
license inquiries 69
load
balancing across disks 3
log files
creation of 49
unavailable 57
logical volume
creation considerations 11
loose cluster
cluster type 15
H
ha.vsd group
initialization of GPFS 52
HACMP environment 7
HACMP/ES
HACMP environment 6
HACMP/ES program product
hard limit, quotas 24
hardware specifications 9
hints
restrictions 67
installing
what to do before you install GPFS
invariant IP address 35
ipqmaxlen parameter 35
I
IBM Multi-Media Server
conflicting software 29
IBM Video Charger
conflicting software 29
indirect blocks 47, 49
indirection level 47
initialization of GPFS 52
inode
allocation file 48
allocation map 48
cache 50
logging of 49
usage 47, 55
installation
files used during 29
images 32
installing on a network 32
on a non-shared file system network 33
on a shared file system network 32
verifying 33
what to do after the installation of GPFS 33
installation procedure 31
man pages
obtaining 77
max_coalesce parameter 36
maxFilesToCache
memory usage for 19
maxFilesToCache parameter 18, 50
maximum number of files 22
maxStatCache
memory usage for 19
maxStatCache parameter 18, 50
memory
controlling 18
usage 49
memory formula
for maxFilesToCache 19
for maxStatCache 19
metadata 47
disk usage to store 12, 25
metanode 47
migration
full 38
nodesets 37
requirements 37
reverting to the previous level of GPFS
staged 37
mmadddisk command
and rewritten disk descriptor file 13
mmcrfs command
and rewritten disk descriptor file 13
mmcrlv command 11
mmrpldisk command
and rewritten disk descriptor file 13
mount command 52
mounting a file system 24
multi-node quorum 18
39
Index
83
N
Network Shared Disks (NSDs)
definition 75
nodes
acting as special managers 45
estimating the number of 21
failure 10, 18, 57
in a GPFS cluster 14
planning 16
restrictions 64
nodeset
configuration restrictions 62
nodesets
creating 16
definition of 3
designation of 25
file for installation 29
identifier 17
in an HACMP environment 7
in an RSCT peer domain environment
migrating 37
moving a file system 25
operation of 17
planning 16
non-shared file system network
installing GPFS 33
notices 69
quorum
definition of 18
during node failure 10
enforcement 45
initialization of GPFS 52
quotas
default quotas 24
description 23
files 49
in a replicated system 23
mounting a file system with quotas enabled 24
role of file system manager node 46
system files 24
values reported in a replicated file system 23
O
operating system
commands 52
operating system calls
53
P
pagepool
in support of I/O 50
pagepool parameter
affect on performance 54
usage 18, 50
parameter
maxStatCache 18
parameters
maxFilesToCache 18
patent information 69
PATH environment variable 29
performance
pagepool parameter 54
use of GPFS to improve 3
use of pagepool 50
performance improvements
balancing load across disks 3
increasing aggregate bandwidth 3
parallel processing 3
simultaneous access the same file 3
supporting large amounts of data 3
persistent reserve 11
pool of nodes
in selection of file system manager 46
84
37
rcp 15
read operation
buffer available 53
buffer not available 54
requirements 53
token management 54
README file, viewing 32
recoverability 11
disk failure 10
disks 56
features of GPFS 3, 57
file systems 56
node failure 10
recoverability parameters 9
Redundant Array of Independent Disks (RAID)
Reliable Scalable Cluster Technology (RSCT)
subsystem of AIX 6
remote file copy command
rcp 15
remote shell command
rsh 15
removing GPFS 41
repairing a file system 56
replication 11
affect on quotas 23
description of 4
restrictions
cluster management 63
commands 64
disk management 66
file system configuration 63
12
restrictions (continued)
GPFS cluster configuration 61
node management 64
nodeset configuration 62
starting GPFS 62
restripe see rebalance 83
rewritten disk descriptor file
uses of 13
RSCT peer domain environment 7
rsh 15
S
SCSI-3 persistent reserve 11
secondary network for RSCT communications 14
security 35
GPFS use of 46
restrictions 64
shared external disks
considerations 9
shared file system network
installing GPFS 32
shared segments 50
single-node quorum 18
sizing file systems 21
socket communications, use of 51
soft limit, quotas 24
softcopy documentation 31
SSA fencing 11
SSA Redundant Array of Independent Disks (RAID) 22
standards, exceptions to 59
starting GPFS 17
restrictions 62
stat cache 50
stat( ) system call 50, 55
storage see memory 83
Stripe Group Manager see File System Manager 83
structure of GPFS 5
subblocks, use of 22
support
failover 3
syntax
rcp 62
rsh 62
system calls
open 53
read 53
stat( ) 55
write 54
system configuration 68
System Data Repository (SDR)
configuring all of nodes listed in 16
trademarks 70
Transmission Control Protocol/Internet Protocol
(TCP/IP) 50
tuning parameters
ipqmaxlen 35
max_coalesce 36
tuning your system 35
two-node nodeset 18
U
uninstalling GPFS
user data 49
41
V
verification
disk usage 24
verifying prerequisite software
30
W
write operation
buffer available 55
buffer not available 55
token management 55
T
token management
description 46
system calls 53
token management system 3
topology services
configuration settings 35
Index
85
86
Overall satisfaction
Very Satisfied
h
Satisfied
h
Neutral
h
Dissatisfied
h
Very Dissatisfied
h
Neutral
h
h
h
h
h
h
Dissatisfied
h
h
h
h
h
h
Very Dissatisfied
h
h
h
h
h
h
How satisfied are you that the information in this book is:
Accurate
Complete
Easy to find
Easy to understand
Well organized
Applicable to your tasks
Very Satisfied
h
h
h
h
h
h
Satisfied
h
h
h
h
h
h
h Yes
h No
When you send comments to IBM, you grant IBM a nonexclusive right to use or distribute your comments in any
way it believes appropriate without incurring any obligation to you.
Name
Company or Organization
Phone No.
Address
GA22-7895-01
___________________________________________________________________________________________________
Cut or Fold
Along Line
_ _ _ _ _ _ _Fold
_ _ _and
_ _ _Tape
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _Please
_ _ _ _ _do
_ _not
_ _ staple
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _Fold
_ _ _and
_ _ Tape
______
NO POSTAGE
NECESSARY
IF MAILED IN THE
UNITED STATES
IBM Corporation
Department 55JA, Mail Station P384
2455 South Road
Poughkeepsie, NY
12601-5400
_________________________________________________________________________________________
Fold and Tape
Please do not staple
Fold and Tape
GA22-7895-01
Cut or Fold
Along Line
GA22-7895-01