1
2014 IBM Corporation
GPFS performance session
Sven Oehme oehmes@s!i"m!com
2
2014 IBM Corporation
#1 cache reference 1 ns
L2 cache reference 5 ns
Acquire/release mutex 100 ns
Main Memory reference 100 ns
Send 2k byte over verbs 10000 ns
Send 2k bytes over 1 !b"s net#ork $0000 ns
%ead 1 M& sequentially from Memory 250000 ns
%ead 1 M& sequentially from net#ork 5000000 ns
'isk seek 10000000 ns
%ead 1 M& sequentially from disk 20000000 ns
Send net#ork "acket from S())A*+ ,%A'- *+ S())A 150000000 ns
Latency numbers you NEED to know * Latency numbers you NEED to know *
$%hese nm"ers are ron&e& an& &on't c(aim to "e 100) accrate
*
2014 IBM Corporation
Gi+a"it 100 M&/sec
10 !bit 1000 M&/sec
.0 !bit /'% 0& .000 M&/sec
5$ !bit ,'% 0& 5$00 M&/sec
!-1*2 2)0 Slot 3000 M&/sec
!-1*3 2)0 Slot $000 M&/sec
1L SAS drive 1004 Seq %ead/5rite 100 M&/sec
1L SAS drive 1004 2M& %andom %ead/5rite 65 M&/sec
SS' 1004 Seq %ead 300 M&/sec
SS' 1004 Seq 5rite 200 M&/sec
Bandwidth numbers you NEED to know * Bandwidth numbers you NEED to know *
$%hese nm"ers are ron&e& an& &on't c(aim to "e 100) accrate
4
2014 IBM Corporation
,# S-S &rive 100) 4. /an&om iops 100
10k SAS drive 1004 .k %andom io"s 200
SS' 1004 .k random reads 20000
SS' 1004 .k random #rites .000
IOPS numbers you NEED to know * IOPS numbers you NEED to know *
$%hese nm"ers are ron&e& an& &on't c(aim to "e 100) accrate
0
2014 IBM Corporation
How to collect a GPS trace !or "er!ormance analysis How to collect a GPS trace !or "er!ormance analysis
GPFS 3.5 TL3 provides a new low overhead Tracing facility in memory tracing
We had a in emory tracing !efore" !#t it had still large overhead and was now replaced !y a new techni$#e
To set %l#ster wide in emory tracing" r#n &
mmtracectl ''set ''trace(def ''tracedev'write'mode(overwrite ''tracedev'overwrite'!#ffer'si)e(*g
+o# can now t#rn on tracing &
mmtracectl ''start
Stop tracing" while leave the settings in place &
mmtracectl ''stop
,r t#rn tracing off and reset to non in'memory tracing &
mmtracectl ''off
1
2014 IBM Corporation
#naly$e traces % create re"orts #naly$e traces % create re"orts
-s an e.ampled my traces are stored in /var/log/gpfstraces/
-nd my GPFS clone directory is /.cat/oehmes/gpfs'clone
0ow r#n the following commands &
1 /.cat/oehmes/gpfs'clone/tools/trcio 'W 'T 'F 'L 'd /var/log/gpfstraces/ 'o trcio.o#tp#t
2eading directory /var/log/gpfstraces
Fo#nd * files matching 34trcrpt.546
Processing * of *& trcrpt.*378*9.*:.5:.37.sonas79n*.g)
!ad trace entry& ;sing invariant time cycle co#nter at <9==.==:777h) to calc#late timestamps
!ad trace entry& >etected %P; r#nning at <935.777777 ?) at *383@33=85.987757 A*73@*:7=@:9=73<<<B.
!ad trace entry& >etected %P; r#nning at <935.777777 ?) at *383@33=85.98775= A*73@*:7=@:9=<5*99B.
!ad trace entry& >etected %P; r#nning at <9==.777777 ?) at *383@35=@<.3:78<8 A*73@::<87@358<3:3B.
!ad trace entry& 0ote that this is a change in %P; speed. Aif prior T,> times are within a second of each
other this is a sp#rio#s errorB
!ad trace entry& >etected %P; r#nning at <9=@.777777 ?) at *383@35=@<.3:<399 A*73@::<87@8:*<9:7B.
!ad trace entry& >etected %P; r#nning at <9==.777777 ?) at *383@9*@7:.3**:87 A*797**@:=38:93858B.
!ad trace entry& >etected %P; r#nning at <3=:.777777 ?) at *383@9*@7:.3**:88 A*797**@:=38::7535B.
Writing trcio.o#tp#t
2
2014 IBM Corporation
#naly$e traces % look at the re"orts #naly$e traces % look at the re"orts
Processed * trace files&
347.7777774" 4*3@=3.*59<8=4" 4trcrpt.*378*9.*:.5:.37.sonas79n*.g)46
Total elapsedTime& *3@=3.*59 sec
Total Time %o#nt ,ps/sec ''Time'per'operation'Amilli'secB'' '%#m#l.'ed.
AsecondsB min avg =7C ma. AmsB pcnt
'''''''''' '''''' '''''''' ''''''' ''''''' ''''''' ''''''' ''''''''''''4
<9*5.:59 *9:7<3 *7.5 7.738 *:.593 <3.88* *57.953 *@.39: 3:C read 454
:7:.7<= 33@85 <.9 7.7** *8.@=7 *=.**9 *978.=57 <<.:<< 9C send 454
37.8:8 =@3 7.* 7.73= 3*.<== ==.39< <<8.@*: ==.397 *7C write 454
*:.85@ *73< 7.* 7.783 *:.<3@ 38.<*7 <55.89: 35.77= *5C nsd 454
7.<7: 85@ 7.* 7.77* 7.<8< 7.*3< *3.=5< =.77: *C dm#t. 454
7.*5: 55@5 7.9 7.773 7.7<@ 7.733 7.59* 7.7<8 37C recv 454
7.7*8 3:89: <.: 7.777 7.777 7.77* 7.7*3 7.77* <<C tsc 454
7.775 98: 7.7 7.773 7.7** 7.7<3 7.73@ 7.7*3 <3C vnop 454
7.77* 58@ 7.7 7.777 7.77* 7.77< 7.77: 7.77< 35C locD 454
Total Time %o#nt ,ps/sec ''Time'per'operation'Amilli'secB'' '%#m#l.'ed.
AsecondsB min avg =7C ma. AmsB pcnt
'''''''''' '''''' '''''''' ''''''' ''''''' ''''''' ''''''' ''''''''''''
*:95.@5= @@*88 :.3 7.8*< *@.::5 <9.<95 **5.:7@ *=.395 97C read 4vdisDE#f4
:@:.@*: 59@<9 3.= 7.738 *<.5<@ *=.*3= *57.953 *5.<=9 3<C read 4#nDnown4
393.8:9 <8@=@ <.7 7.7<< *<.3<< *=.9<@ =3.==: *5.9<= 3*C send 4nspdsg2eadWrite4
<:<.795 37@ 7.7 :@9.<=* @57.8=5 *3@9.@9* *978.=57 875.=5: 3=C send 4nspdsg>iscover4
@<.=:< <=8: 7.< *=.*@< <8.@88 33.*8< *77.9=8 <8.<@@ 93C read 4data4
<3.35@ <77 7.7 5=.@59 **:.8=7 *=9.5=: <<8.@*: **@.*7= 35C write 4data4
*:.85@ *73< 7.* 7.783 *:.<3@ 38.<*7 <55.89: 35.77= *5C nsd 4process2e$#est4
5.95* 37@ 7.7 **.78< *8.:=@ <7.3:< @3.@9< *8.*7@ 9*C write 4vdisDFGLog<4
7.89: 3:= 7.7 7.797 <.7<3 7.**3 58.:59 <:.7*: <C write 4vdisDFWLog4
7.87: 35 7.7 *8.*<* <7.*8: <7.5*@ <5.*=5 <7.<@* 9@C write 4vdisD2G>esc4
7.3@8 *< 7.7 <:.==8 3<.<:7 38.737 38.73: 3:.<39 57C write 4log>ata4
7.*:< <9 7.7 7.779 :.85< =.=55 *3.=5< =.588 <=C dm#t. 4GE#fFree#te. HGdisDScr#!WorDerThreadI4
.....
3
2014 IBM Corporation
sdstat Plu&in !or GPS sdstat Plu&in !or GPS
With GPFS 3.5 PTF*< we will add a new sample code pl#gin for sdatst to the GPFS rpms on Lin#.
Jn the meanwhile who has access to the git repository can #se it " files are &
./ts/#til/dstatKgpfsops.py.dstat.7.:
./ts/#til/dstatKgpfsops.py.dstat.7.8
The e.tension of the file A7.: and 7.8B are for the < incompati!le pl#gin versions of dstat.
7.: will worD on all older Lin#. version prior to 2?FL :.* and version 7.8 will worD on all newer versions
7.: and 7.8 are the versions of dstat reported !y dstat version
The version of the pl#gin needs to !e copied into /#sr/share/dstat/ Aon 2?FL :.LB and renamed to
dstatKgpfsops.py liDe
cp ./ts/#til/dstatKgpfsops.py.dstat.7.8 /#sr/share/dstat/dstatKgpfsops.py
-fter that yo# can add the pl#gin to the dstat o#tp#t !y r#nning &
dstat 'c 'n 'd ' gpfsops ''nocolor
This will show cp# " networD" disD and GPFS defa#lt stats on a single line at * second gran#larity
Jn order to ena!le vfs statistics yo# need to r#n &
mmfsadm vfsstats ena!le
,n each node in the cl#ster Aor add to mmfs#p file in /vsr/mmfs/etc/B
4
2014 IBM Corporation
sdstat Plu&in !or GPS sdstat Plu&in !or GPS
>stat class to display selected gpfs performance co#nters ret#rned !y the
mmpmon MvfsKsM" MiocKsM" MvioKsM" Mvfl#shKsM" and MlrocKsM commands.
The set of co#nters displayed can !e c#stomi)ed via environment varia!les&
>ST-TKGPFSKW?-T
Selects which of the five mmpmon commands to display.
Jt is a comma separated list of any of the following&
MvfsM& show mmpmon MvfsKsM co#nters
MiocM& show mmpmon MiocKsM co#nters related to 0S> client J/,
MnsdM& show mmpmon MiocKsM co#nters related to 0S> server J/,
MvioM& show mmpmon MvioKsM co#nters
Mvfl#shM& show mmpmon Mvfl#shKsM co#nters
MlrocM& show mmpmon MlrocKsM co#nters
MallM& e$#ivalent to specifying all of the a!ove
F.ample&
>ST-TKGPFSKW?-T(vfs"lroc dstat ' gpfsops
will display co#nters for mmpmon MvfsKsM and MlrocM commands.
The defa#lt setting is Mvfs"iocM" i.e." !y defa#lt only MvfsKsM and 0S>
client related MiocKsM co#nters are displayed.
For more details on f#rther c#stomi)ation see the dstatsKgpfsops.py file
10
2014 IBM Corporation
sdstat Plu&in !or GPS sdstat Plu&in !or GPS
N#st show GFS level %o#nters &
1 >ST-TKGPFSKW?-T(vfs dstat 'c 'n 'd ' gpfsops ''nocolor
W-20J0G& ,ption ' is deprecated" please #se ''gpfsops instead
/#sr/!in/dstat&*:8<& >eprecationWarning& os.popen3 is deprecated. ;se the s#!process mod#le.
pipes3cmd6 ( os.popen3Acmd" 4t4" 7B
''''total'cp#'#sage'''' 'net/total' 'dsD/total' '''''''''''''''''''''''''''gpfs'vfs'ops''''''''''''''''''''''''''
#sr sys idl wai hi$ si$O recv sendO read writO cr del op/cl rd wr tr#nc fsync looD# gattr sattr other
7 7 =@ * 7 7O 7 7 O *3 95O 7 7 7 7 7 7 7 7 7 7 7
7 7 *77 7 7 7O ==7E 5939EO 7 7 O 7 7 7 7 7 7 7 7 7 7 7
7 7 *77 7 7 7O 897E 9==:EO 7 7 O 7 7 7 7 7 7 7 7 7 7 7
7 7 *77 7 7 7O :@9E 95=7EO 7 *:DO 7 7 7 7 7 7 7 7 7 7 7
7 7 *77 7 7 7O :<@E 57*<EO 7 7 O 7 7 7 7 7 7 7 7 7 7 7
7 7 *77 7 7 7O =*:E 539:EO 7 :7DO 7 7 7 7 7 7 7 7 7 7 7
N#st show the v>JSP co#nters &
1 >ST-TKGPFSKW?-T(vio dstat 'c 'n 'd ' gpfsops ''nocolor
W-20J0G& ,ption ' is deprecated" please #se ''gpfsops instead
/#sr/!in/dstat&*:8<& >eprecationWarning& os.popen3 is deprecated. ;se the s#!process mod#le.
pipes3cmd6 ( os.popen3Acmd" 4t4" 7B
''''total'cp#'#sage'''' 'net/total' 'dsD/total' ''''''''''''''''''''''''''gpfs'vio'''''''''''''''''''''''''
#sr sys idl wai hi$ si$O recv sendO read writO%l2ea %lShW %ldW %lPFT %lFTW Fl;pW FlPFT igrt Scr#! LgWr
7 7 =@ * 7 7O 7 7 O *3 95O 7 7 7 7 7 7 7 7 7 7
7 7 *77 7 7 7O*959E 5=@:EO 7 7 O 7 7 7 7 7 7 7 7 7 7
7 7 *77 7 7 7O 5:9E 5*87EO 7 7 O 7 7 7 7 7 7 7 7 7 7
7 7 *77 7 7 7O :@9E 9=77EO 7 7 O 7 7 7 7 7 7 7 7 7 7
7 7 *77 7 7 7O @:9E 9:59EO 7 7 O 7 7 7 7 7 7 7 7 7 7
7 7 *77 7 7 7O 897E 9=@7EO 7 :9DO 7 7 7 7 7 7 7 7 7 7
7 7 *77 7 7 7O :@7E 9589EO 7 7 O 7 7 7 7 7 7 7 7 7 7
11
2014 IBM Corporation
Per! to" % 'ool to !ind the hi&h (P) contender Per! to" % 'ool to !ind the hi&h (P) contender
If you start perf top withou parameters, it gives you a top CPU consuming processes of the system in real
time and show a relative % compared to others.
<3.7:C mmfsd 3.6 rs>*7T<K@K=KlowKvectorKcDs#mAvoid55" %P:9State5" intB
*:.9<C mmfsd 3.6 rs>*7T<K@K=KhighKvectorKcDs#mAvoid55" %P:9State5" intB
=.<<C li!ml.9'rdmav<.so 3.6 7.77777777777735=<
3.<8C 3mmfslin#.6 3D6 c.iGetPagePtrs
<.@<C 3Dernel6 3D6 KspinKlocDKir$save
*.3:C 3Dernel6 3D6 sched#le
*.<*C 3Dernel6 3D6 KspinK#nlocDKir$restore
*.7=C mmfsd 3.6 GTracD>esc&&cleanE#ffers-ndEitmapsAGJ,2e$#est5B
*.7<C 3Dernel6 3D6 KspinKlocD
7.==C 3Dernel6 3D6 fgetKlight
7.=3C 3mmfslin#.6 3D6 c.iStartJ,
7.@7C mmfsd 3.6 G>ataE#f&&vPre!#ildE#fferTrailerAintB
7.83C mmfsd 3.6 0otGlo!al#te.%lass&&ac$#ireAB
7.8<C mmfsd 3.6 %hecDs#m*:&&calc*:Avoid const5" intB
7.:3C 3Dernel6 3D6 fp#t
...
12
2014 IBM Corporation
Per! to" % 'ool to !ind the hi&h (P) contender Per! to" % 'ool to !ind the hi&h (P) contender
+o# can f#rther )oom into a process Ain this e.ample mmfsdB and see a !reaDdown of cp# chewing f#nctions &
Samples& <:=P of event 4cycles4" Fvent co#nt Aappro..B& *<:8373=*8*5" Thread& mmfsdA3999<<B" >S,& mmfsd
3=.5*C 3.6 rs>*7T<K@K=KlowKvectorKcDs#mAvoid55" %P:9State5" intB < GN chec!sum code
<@.38C 3.6 rs>*7T<K@K=KhighKvectorKcDs#mAvoid55" %P:9State5" intB
*.=<C 3.6 GTracD>esc&&cleanE#ffers-ndEitmapsAGJ,2e$#est5B
*.39C 3.6 G>ataE#f&&vPre!#ildE#fferTrailerAintB
*.<@C 3.6 0otGlo!al#te.%lass&&ac$#ireAB
*.<5C 3.6 %hecDs#m*:&&calc*:Avoid const5" intB
*.77C 3.6 GTracD>esc&&!#ildE#fferTrailersAE#fEitmap constQ" G>ataE#f55B
7.=*C 3.6 J,E#ndle&&$#e#eJ,E#fferAG>ataE#f5" int" int" intB
7.@:C 3.6 GJ,2e$#est&&performPromotedFTWriteAB
7.@5C 3.6 ver!s&&ver!sServerKiAint" 2pc%onte.t5" 0ode-ddr" int" #nsigned int" int" int" nsd2dma2mrKs5" int" iovec5" long long" long longB
7.@9C 3.6 G>ataE#f&&vGet>ata-ddr-t,ffsetAint" intB const
7.@<C 3.6 %h#nDTa!&&find%h#nDAchar5" #nsigned longB
7.@*C 3.6 G>ataE#f&&vE#ildE#fferTrailerAintB
7.88C 3.6 ver!s&&ver!s>toThreadKiAintB
7.89C 3.6 G>ataE#f&&v?oldAB
7.:=C 3.6 Jncremental%hecDs#mState&&icD-cc#m#lateAvoid const5" intB
7.::C 3.6 GTracD>esc&&prepareToE#ildTrailersAGJ,2e$#est5B
7.::C 3.6 GTracD>esc&&vt;pdateTrailerGersionsAGJ,2e$#est5B
7.::C 3.6 Th%ond&&waitAint" char const5B
7.:3C 3.6 GJ,2e$#est&&vio2eshapeFreeE#ffersAintB
7.:3C 3.6 Gem?andle&&mGetG-ddrAB const
7.:7C 3.6 GTracD>esc&&vtProcessWriteJ,E#ndleStat#sAJ,E#ndle5" GJ,2e$#est5" E#fEitmap5B
...
1*
2014 IBM Corporation
Per!ormance data Per!ormance data
" word of caution # The achieved n#m!ers depends on the right %lient config#ration and
good Jnterconnect and can vary !etween environments. They sho#ld not !e #sed in 2FJ4s
as committed n#m!ers" rather to demonstrate the technical capa!ilities of the Prod#ct
in good conditions
Non of the following Performance numbers should be reused for
sales or contract purposes.
Some of the numbers produced are a result of very advanced
tuning and while achievable, not very easy to recreate at customer
systems without the same level of effort
14
2014 IBM Corporation
'est Setu" 'est Setu"
10 5*0006M* Server each 7ith
11 GB of Memor8 91 +" Pa+epoo(:
1 F;/ Port
1 5 1 core CP<
1 GSS24=21 &epen&in+ on the test!
2 F;/ Ports connecte& per Server
GPFS *!0!0!2 G- co&e (eve(
Me((ano5 *2 Port F;/ s7itch
45 105
10
2014 IBM Corporation
#""ly these numbers to "ractice #""ly these numbers to "ractice
%reating a single *7 G!yte File from one %lient #sing a GF0'< F>2 JE card
1 /#sr/local/!in/gpfsperf create se$ 'n *7G 'r @m /i!m/fs<'@m/test'*7g'write
/#sr/local/!in/gpfsperf create se$ /i!m/fs<'@m/test'*7g'write
recSi)e @ nEytes *7G fileSi)e *7G
nProcesses * nThreadsPerProcess *
file cache fl#shed !efore test
not #sing data shipping
not #sing direct J/,
offsets accessed will cycle thro#gh the same file segment
not #sing shared memory !#ffer
not releasing !yte'range toDen after open
no fsync at end of test
>ata rate was $%&'()).*+ ,-ytes.sec" iops was 3=@.=5" thread #tili)ation 7.=@9
2ecord si)e& @3@@:7@ !ytes" *78389*@<97 !ytes to transfer" *78389*@<97 !ytes transferred
%P; #tili)ation& #ser 3.:@C" sys 3.@8C" idle =<.95C" wait 7.77C
Why didn4t it r#n at 5.: GE/sec R GF0 <
11
2014 IBM Corporation
#""ly these numbers to "ractice #""ly these numbers to "ractice
2eading a single !locD random from this *7 G!yte File while it is not cached anymore
1 /#sr/local/!in/gpfsperf read rand 'n @m 'r @m /i!m/fs<'@m/test'*7g'write
/#sr/local/!in/gpfsperf read rand /i!m/fs<'@m/test'*7g'write
recSi)e @ nEytes @ fileSi)e *7G
nProcesses * nThreadsPerProcess *
file cache fl#shed !efore test
not #sing data shipping
not #sing direct J/,
offsets accessed will cycle thro#gh the same file segment
not #sing shared memory !#ffer
not releasing !yte'range toDen after open
>ata rate was %%/$%%.*% ,-ytes.sec" iops was <:.@=" thread #tili)ation 7.=@=
2ecord si)e& @3@@:7@ !ytes" @3@@:7@ !ytes to transfer" @3@@:7@ !ytes transferred
%P; #tili)ation& #ser 7.77C" sys 7.77C" idle *77.77C" wait 7.77C
3rootSclients.sonascl*: mpi61 mmdiag ''iohist
((( mmdiag& iohist (((
J/, history&
J/, start time 2W E#f type disD§or0#m nSec time ms Type >evice/0S> J> 0S> server
''''''''''''''' '' ''''''''''' ''''''''''''''''' ''''' ''''''' '''' '''''''''''''''''' '''''''''''''''
*7&79&95.=8=798 2 data *7&975<8=<**5< *:3@9 33.:9: cli %7-8797<&5*F*E*<% *=<.*:8.9.<
12
2014 IBM Corporation
#""ly these numbers to "ractice #""ly these numbers to "ractice
2eading @ !locDs se$#entially from this *7 G!yte File while it is not cached anymore
1 /#sr/local/!in/gpfsperf read se$ 'n :9m 'r @m /i!m/fs<'@m/test'*7g'write
/#sr/local/!in/gpfsperf read se$ /i!m/fs<'@m/test'*7g'write
recSi)e @ nEytes :9 fileSi)e *7G
nProcesses * nThreadsPerProcess *
file cache fl#shed !efore test
not #sing data shipping
not #sing direct J/,
offsets accessed will cycle thro#gh the same file segment
not #sing shared memory !#ffer
not releasing !yte'range toDen after open
>ata rate was *$$)$$.%+ ,-ytes.sec" iops was :5.*@" thread #tili)ation 7.=7*
2ecord si)e& @3@@:7@ !ytes" :8*7@@:9 !ytes to transfer" :8*7@@:9 !ytes transferred
%P; #tili)ation& #ser 7.:@C" sys 7.:@C" idle =@.:9C" wait 7.77C
3rootSclients.sonascl*: mpi61 mmdiag ''iohist
((( mmdiag& iohist (((
J/, history&
J/, start time 2W E#f type disD§or0#m nSec time ms Type >evice/0S> J> 0S> server
''''''''''''''' '' ''''''''''' ''''''''''''''''' ''''' ''''''' '''' '''''''''''''''''' '''''''''''''''
*7&7:&<9.::9@8@ 2 data **&:=758:5@@@7 *:3@9 <8.@78 cli %7-8797<&5*F*E*<> *=<.*:8.9.<
*7&7:&<9.::9@8@ 2 data *7&@*9*:355@97 *:3@9 39.38< cli %7-8797<&5*F*E*<% *=<.*:8.9.<
*7&7:&<9.879*== 2 data 8&*53:@33*<:97 *:3@9 <:.@@9 cli %7-8797*&5*F*E**% *=<.*:8.9.*
*7&7:&<9.87*=:8 2 data *<&*55<333@:9=: *:3@9 3<.588 cli %7-8797<&5*F*E*<F *=<.*:8.9.<
*7&7:&<9.879<*7 2 data @&@:*573:5*@9 *:3@9 3<.*== cli %7-8797*&5*F*E**> *=<.*:8.9.*
*7&7:&<9.838<37 2 data =&*873*5*@:*8: *:3@9 37.5<8 cli %7-8797*&5*F*E**F *=<.*:8.9.*
*7&7:&<9.89*5@7 2 data @&8*3*==<@@3< *:3@9 37.59< cli %7-8797*&5*F*E**> *=<.*:8.9.*
*7&7:&<9.838<38 2 data *7&*99*3*57:*8: *:3@9 38.35@ cli %7-8797<&5*F*E*<% *=<.*:8.9.<
*7&7:&<9.83=<33 2 data *<&=888@783:7 *:3@9 3:.385 cli %7-8797<&5*F*E*<F *=<.*:8.9.<
*7&7:&<9.83=<73 2 data **&=*9<@==578< *:3@9 38.:78 cli %7-8797<&5*F*E*<> *=<.*:8.9.<
*7&7:&<9.89*5@@ 2 data 8&*8*****8779@ *:3@9 97.997 cli %7-8797*&5*F*E**% *=<.*:8.9.*
13
2014 IBM Corporation
#""ly these numbers to "ractice #""ly these numbers to "ractice
2eading @ !locDs se$#entially from this *7 G!yte File while it is 01I22 cached
1 /#sr/local/!in/gpfsperf read se$ 'n :9m 'r @m /i!m/fs<'@m/test'*7g'write
/#sr/local/!in/gpfsperf read se$ /i!m/fs<'@m/test'*7g'write
recSi)e @ nEytes :9 fileSi)e *7G
nProcesses * nThreadsPerProcess *
file cache fl#shed !efore test
not #sing data shipping
not #sing direct J/,
offsets accessed will cycle thro#gh the same file segment
not #sing shared memory !#ffer
not releasing !yte'range toDen after open
>ata rate was $%'&(&*.3& ,-ytes.sec" iops was 97*.*9" thread #tili)ation 7.=@7
2ecord si)e& @3@@:7@ !ytes" :8*7@@:9 !ytes to transfer" :8*7@@:9 !ytes transferred
%P; #tili)ation& #ser 7.77C" sys 9.7@C" idle =5.=<C" wait 7.77C
3rootSclients.sonascl*: mpi61 mmdiag ''iohist
((( mmdiag& iohist (((
J/, history&
J/, start time 2W E#f type disD§or0#m nSec time ms Type >evice/0S> J> 0S> server
''''''''''''''' '' ''''''''''' ''''''''''''''''' ''''' ''''''' '''' '''''''''''''''''' '''''''''''''''
1
14
2014 IBM Corporation
#""ly these numbers to "ractice #""ly these numbers to "ractice
2eading the whole file se$#entially
1 /#sr/local/!in/gpfsperf read se$ 'n *7g 'r @m /i!m/fs<'@m/test'*7g'write
/#sr/local/!in/gpfsperf read se$ /i!m/fs<'@m/test'*7g'write
recSi)e @ nEytes *7G fileSi)e *7G
nProcesses * nThreadsPerProcess *
file cache fl#shed !efore test
not #sing data shipping
not #sing direct J/,
offsets accessed will cycle thro#gh the same file segment
not #sing shared memory !#ffer
not releasing !yte'range toDen after open
>ata rate was %$$/$3&.*& ,-ytes.sec" iops was <@9.98" thread #tili)ation *.777
2ecord si)e& @3@@:7@ !ytes" *78389*@<97 !ytes to transfer" *78389*@<97 !ytes transferred
%P; #tili)ation& #ser 3.7*C" sys 3.<9C" idle =3.85C" wait 7.77C
((( mmdiag& iohist (((
J/, history&
J/, start time 2W E#f type disD§or0#m nSec time ms Type >evice/0S> J> 0S> server
''''''''''''''' '' ''''''''''' ''''''''''''''''' ''''' ''''''' '''' '''''''''''''''''' '''''''''''''''
*7&39&7<.395=<5 2 data @&*:8*3*<57:@@ *:3@9 39.::@ cli %7-8797*&5*F*E**> *=<.*:8.9.*
*7&39&7<.395=<5 2 data =&*78:@3@<3:*: *:3@9 38.939 cli %7-8797*&5*F*E**F *=<.*:8.9.*
*7&39&7<.3@8**< 2 data **&:=758:5@@@7 *:3@9 37.35: cli %7-8797<&5*F*E*<> *=<.*:8.9.<
*7&39&7<.3@9*7@ 2 data *7&@*9*:355@97 *:3@9 35.<:5 cli %7-8797<&5*F*E*<% *=<.*:8.9.<
*7&39&7<.3@873* 2 data *<&*55<333@:9=: *:3@9 39.38: cli %7-8797<&5*F*E*<F *=<.*:8.9.<
*7&39&7<.9<<<@3 2 data @&@:*573:5*@9 *:3@9 3*.*<3 cli %7-8797*&5*F*E**> *=<.*:8.9.*
*7&39&7<.9<983* 2 data *7&*99*3*57:*8: *:3@9 <=.9=5 cli %7-8797<&5*F*E*<% *=<.*:8.9.<
*7&39&7<.9<<<5* 2 data 8&*53:@33*<:97 *:3@9 3<.935 cli %7-8797*&5*F*E**% *=<.*:8.9.*
*7&39&7<.9<983* 2 data =&*873*5*@:*8: *:3@9 39.79= cli %7-8797*&5*F*E**F *=<.*:8.9.*
.....
20
2014 IBM Corporation
#""ly these numbers to "ractice #""ly these numbers to "ractice
See that the data was prefetched which is why the response time per re$#est is lower &
mmfsadm d#mp iohist
J/, history&
J/, start time 2W E#f type disD§or0#m nSec time ms tag* tag< >isD ;J> typ
0S> server conte.t thread
''''''''''''''' '' ''''''''''' ''''''''''''''''' ''''' ''''''' ''''''''' ''''''''' '''''''''''''''''' '''
''''''''''''''' ''''''''' ''''''''''
*7&39&98.*9@5@< 2 data =&*78:@3@<3:*: *:3@9 37.:** <<=5@7@ * %7-8797*&5*F*E*37 cli
*=<.*:8.9.* Prefetch Prefetch4or!er1hread
*7&39&98.*9@5=7 2 data @&*:8*3*<57:@@ *:3@9 5*.*@7 <<=5@7@ 7 %7-8797*&5*F*E*<- cli
*=<.*:8.9.* E?andler FileElocD2eadFetch?andlerThread
*7&39&98.<79@@7 2 data **&:=758:5@@@7 *:3@9 <8.@@8 <<=5@7@ 3 %7-8797*&5*F*E*<> cli
*=<.*:8.9.< Prefetch PrefetchWorDerThread
*7&39&98.<7<59= 2 data *7&@*9*:355@97 *:3@9 3:.39@ <<=5@7@ < %7-8797*&5*F*E*<F cli
*=<.*:8.9.< Prefetch PrefetchWorDerThread
*7&39&98.<79@@@ 2 data *<&*55<333@:9=: *:3@9 39.7*8 <<=5@7@ 9 %7-8797*&5*F*E*<% cli
*=<.*:8.9.< Prefetch PrefetchWorDerThread
*7&39&98.<99735 2 data *7&*99*3*57:*8: *:3@9 3<.@:: <<=5@7@ @ %7-8797*&5*F*E*<F cli
*=<.*:8.9.< Prefetch PrefetchWorDerThread
*7&39&98.<9*@3= 2 data 8&*53:@33*<:97 *:3@9 35.83< <<=5@7@ 5 %7-8797*&5*F*E*<@ cli
*=<.*:8.9.* Prefetch PrefetchWorDerThread
*7&39&98.<9*@3= 2 data @&@:*573:5*@9 *:3@9 38.598 <<=5@7@ : %7-8797*&5*F*E*<- cli
*=<.*:8.9.* Prefetch PrefetchWorDerThread
*7&39&98.<9:5:8 2 data *<&=888@783:7 *:3@9 33.:39 <<=5@7@ *7 %7-8797*&5*F*E*<% cli
*=<.*:8.9.< Prefetch PrefetchWorDerThread
*7&39&98.<9:5:8 2 data **&=*9<@==578< *:3@9 3:.@8* <<=5@7@ = %7-8797*&5*F*E*<> cli
*=<.*:8.9.< Prefetch PrefetchWorDerThread
*7&39&98.<93=9: 2 data =&*873*5*@:*8: *:3@9 93.5=* <<=5@7@ 8 %7-8797*&5*F*E*37 cli
*=<.*:8.9.* Prefetch PrefetchWorDerThread
21
2014 IBM Corporation
Benchmark Benchmark e*ecution e*ecution and and results results
Operation 1m .m 1$m
!SS2$*#rite 7M&/sec8 3956:30 11302:19 1.960:.0
!SS2$*read 7M&/sec8 $9;6:5; 13915:3$ 15193:61
!SS2.*#rite 7M&/sec8 3023:23 6699:2$ 111.;:36
!SS2.*read 7M&/sec8 .9;6:02 9515:$$ 13;65:60
ior -i 2 -p -d 10 -w -r -e -t 16m -b 32G -o /ibm/fs2-16m/shared/ior//iorfile
-i N repetitions -- number of repetitions of test
-d N interTestDelay -- delay between reps in seconds
-w writeFile -- write file
-r readFile -- read existing file
-e fsync -- perform fsync upon POSIX write close
-t N transferSize -- size of transfer in bytes (e.g.: 8, 4k, 2m, 1g)
-b N blockSize -- contiguous bytes to write per task (e.g.: 8, 4k, 2m, 1g)
-o S testFile -- full name for test
" word of caution # The achieved n#m!ers depends on the right %lient
config#ration and good Jnterconnect and can vary !etween environments. They
sho#ld not !e #sed in 2FJ4s as committed n#m!ers" rather to demonstrate the
technical capa!ilities of the Prod#ct in good conditions
22
2014 IBM Corporation
+ead +ead Benchmark Benchmark
" word of caution # The achieved n#m!ers depends on the right %lient
config#ration and good Jnterconnect and can vary !etween environments. They
sho#ld not !e #sed in 2FJ4s as committed n#m!ers" rather to demonstrate the
technical capa!ilities of the Prod#ct in good conditions
2*
2014 IBM Corporation
,rite Benchmark ,rite Benchmark
" word of caution # The achieved n#m!ers depends on the right %lient
config#ration and good Jnterconnect and can vary !etween environments. They
sho#ld not !e #sed in 2FJ4s as committed n#m!ers" rather to demonstrate the
technical capa!ilities of the Prod#ct in good conditions
24
2014 IBM Corporation
GPS Parameters e*"lained GPS Parameters e*"lained
General Parameter
odern Servers have m#ltiple emory regions that are attached to a given socDet. !y defa#lt Lin#.
allocates data for a given process from only * 0;- 2egion This Parameter tells GPFS to ro#nd ro!in
across all regions to not r#n into a o#t of memory condition when yo# reached the limit of one of
the regions while the remaining still have plenty of memory left.
mmchconfig numa5emoryInterleave6yes
page pool defines the amo#nt of physical memory that sho#ld !e pinned !y GPFS at start#p. it is
#sed in vario#s places of the code" !#t from a Performance perspective its re$#ired to cache data
and metadata o!Tects Aindirect !locDs" directory !locDsB.
mmchconfig pagepool6$'g
>efines the ma.im#m n#m!er of E#fferdescriptors. for data !locD Af#ll !locD or fragmentBor
directory !locD yo# want to hold in the cache yo# need to have e.actly *
mmchconfig ma78uffer9escs6%m
Percentage of page pool #sed for file prefetching needs to !e less than the defa#lt of <7C since
most of the page pool was given to G02.
mmchconfig prefetchPct6*
-llow largest possi!le GPFS !locD si)e and G02 vdisD tracD si)e
mmchconfig ma7-loc!si:e6(&m
0#m!er of recent J,s whose target address and response times are recorded. >efa#lt 5*<.
mmchconfig io;istory0i:e6&+!
defines no of #lticlass / 0on'critical worDer threads to !e started
mmchconfig ma7General1hreads6(%'/
ma.EpS affects the depth of prefetching for se$#ential file access. Jt sho#ld !e set at least as
large as the ma.im#m e.pected hardware !andwidth.
mmchconfig ma758p06(&///
20
2014 IBM Corporation
GPS Parameters e*"lained GPS Parameters e*"lained
;ouse!eeping . cache related settings
syncJntervalStrict defines if we sho#ld only follow the syncJnterval Adefa#lt 37B val#e rather than
the main interval of the ,S triggered sync " which happens on lin#. every 5 seconds. this has a
very !ig positive impact on worDloads with !#ffered writes.
mmchconfig syncInterval0trict6yes
These are all a!o#t cleaning MfilesM so ,penFile o!Tects can !e stolen and re'#sed. To steal an
,penFile o!Tect the whole file Adata Q metadataB m#st !e fl#shed.
fl#shed>ataTarget& no of ,penFile o!Tects where data have !een fl#shed already
fl#shedJnodeTarget& no of ,penFile o!Tects where data Q metadata have !een fl#shed
ma.File%leaners& no threads fl#shing data and/or metada
mmchconfig flushed9ata1arget6(/%+
mmchconfig flushedInode1arget6(/%+
mmchconfig ma7<ileCleaners6(/%+
These are cleaning data !#ffers" so sync doesn4t have to fl#sh data !locDs
mmchconfig ma78ufferCleaners6(/%+
0#m!er of GPFS log !#ffers. ?aving lots of these allows the log to a!sor! !#rsts of log appends.
For systems with large page pools A* G or moreB" log !#ffers are the si)e of the metadata !locD
si)e" and there is a separate set of s#ch !#ffers for each file system. >efa#lt 3.
mmchconfig log8ufferCount6%/
GPFS log fl#sh controls. When the log !ecomes logWrapThresholdPct" the log fl#sh code is activated
to fl#sh dirty o!Tects so the log records that descri!e their #pdates can !e discarded. This
percentage defa#lts to 57C" and altho#gh there is some code to allow changing it" modifying this
val#e is not s#pported !y mmchconfig. Log wrap will start logWrapThreads fl#sh threads Adefa#lt
@B" which will fl#sh eno#gh dirty o!Tects so the recovery start position can !e moved forward !y
logWrap-mo#ntPct percent Adefa#lt *7CB.
mmchconfig log4rap"mountPct6%
mmchconfig log4rap1hreads6(%'
21
2014 IBM Corporation
GPS Parameters e*"lained GPS Parameters e*"lained
0#m!er of active allocation regions for disD allocation. Larger n#m!ers can improve allocation
performance" !#t high n#m!ers sho#ld not !e #sed for large cl#sters. >efa#lt is 9.
mmchconfig ma7"llocegionsPerNode6$%
Si)e of the pool of threads that completes file deletions in the !acDgro#nd. >efa#lt is 9.
mmchconfig ma78ac!ground9eletion1hreads6(&
a.im#m n#m!er of threads that prefetch inode toDens of deleted files to speed #p file creates.
>efa#lt is @.
mmchconfig ma7Inode9eallocPrefetch6(%'
a.im#m n#m!er of sim#ltaneo#s local GPFS re$#ests. >efa#lt 9@.
mmchconfig wor!er(1hreads6(/%+
ma.FilesTo%ache sho#ld !e set fairly large to assist with local worDload. Jt can !e set very
large in small client cl#sters" !#t sho#ld remain small on clients in large cl#sters to avoid
e.cessive memory #se on the toDen servers. The stat cache is not effective on Lin#." so it
sho#ld always !e small.
mmchconfig ma7<iles1oCache6(%'!
mmchconfig ma70tatCache6*(%
a.im#m n#m!er of threads that prefetch inode toDens of deleted files to speed #p file creates.
>efa#lt is @.
mmchconfig ma7Inode9eallocPrefetch6(%'
Pre'steal some page pool space to red#ce the latency of ac$#iring a free !#ffer.preSteal%o#nt is
the option to specify a hard n#m!er vs Pct. the way it worDs is if set to *7777 " 5777 go to
3<D" <577 to *:D" *<57 to @D " ....
mmchconfig pre0tealCount6(///
mmchconfig pre0tealPct6(
22
2014 IBM Corporation
GPS Parameters e*"lained GPS Parameters e*"lained
syncEacDgro#ndThreads define how many threads in parallel are allowed to r#n to fl#sh data
d#ring reg#lar sync intervals. >efa#lt *:.
syncWorDerThreads no of threads in parallele to fl#sh data d#ring e.plicit sync Async command"
or crsnapshot" or #nmo#nt" ...B
mmchconfig sync8ac!ground1hreads6&+
mmchconfig sync4or!er1hreads6%*&
These Settings infl#ence the inode Prefetch !ehavio#r for Mls 'lM
JnodePrefectFirst>ir!locD set to MyesM to have inode prefetch read the first !locD of each
s#!dir as well. >efa#lts to no.
JnodePrefetchThreshold defines how many stat4s we wait for !efore start prefetching inodes"
defa#lt is 5" maDe it smaller to start inode prefetch sooner.
JnodePrefetchWindow define how close together in time the stat4s have to !e to trigger inode
prefetch" defa#lt is 7.5 seconds which means the 5 stat4s all have to within half a second of
each other" otherwise we4ll ignore them. yo# need to maDe it larger to trigger inode prefetch
even if stat4s are coming in more slowly #nits are in milli seconds.
e.g." setting it to <577 will maDe the window !e <.5 seconds
mmchconfig InodePrefect<irst9ir-loc!6yes
mmchconfig InodePrefetch1hreshold6*
mmchconfig InodePrefetch4indow6*//
General n#m!er of inode prefetch threads to #se. >efa#lt @.
mmchconfig wor!er$1hreads6$%
pitWorDerThreadsPer0ode specify how m#ch threads do restripe" data movement" etc ...
>efa#lt is threadsPer0ode ( J0A*:" An#m!er,f>isDs 5 9B/n#m!er,f0odes U *BB so *:" or less if
there are fewer than a!o#t fo#r L;0s
mmchconfig pit4or!er1hreadsPerNode6(&
23
2014 IBM Corporation
GPS Parameters e*"lained GPS Parameters e*"lained
Prefetch-ggressiveness defines how aggressive to prefetch data
7 means never prefetch
* means prefetch on <nd access if se$#ential
< means prefetch on *st access at offset 7 or <nd se$#ential access anywhere else
3 means prefetch on *st access anywhere
Jn 3.3" the defa#lt was 3 Aprefetch,nFirst-ccessB" which means it wo#ld always prefetch
immediately" even if the first access is in the middle of the file.
Jn GPFS 3.9" the defa#lt is < Aprefetch0ormalB" which means if yo# start reading at the
!eginning of the file" it will start prefetching immediately" !#t if yo# start reading
somewhere in the middle of the file" it waits #ntil the second read to confirm that the access
is se$#ential !efore it starts prefetching. With the setting of * Aprefetch,nSecond-ccessB" it
will wait for a second read" even if the first read was at the !eginning of the file.
since 3.5 yo# can specify read and write aggressiveness independent.
mmchconfig prefetch"ggressiveness6%
mmchconfig prefetch"ggressivenessead6=(
mmchconfig prefetch"ggressiveness4rite6=(
ignorePrefetchL;0%o#nt tells the 0S> client to not limit the n#m!ers of re$#ests !ased on the
n#m!er of visi!le L;04s Aas they can have a large n#m!er of physical disDs !ehind themB and
rather limit !y the ma. to n#m!er of !#ffers and prefetch threads.>efa#lts to no
mmchconfig ignorePrefetch2UNCount6yes
24
2014 IBM Corporation
GPS Parameters e*"lained GPS Parameters e*"lained
Communication elated Parameter
tscWorDerPool defines no of threads per class of receive worDers
mmchconfig tsc4or!erPool6&+
nsdJnlineWritea. defines the ma.im#m allowed single io si)e to #se Jnline writes.>efa#lts to *D
mmchconfig nsdInline4rite5a76$%!
This needs to !e set larger than the defa#lt for server nodes that may have connections to many
clients" since it indirectly controls the n#m!er of T%P connections managed !y each receiver
thread.
mmchconfig ma7eceiver1hreads6$%
2>- Port config#ration
mmchconfig ver-sPorts6>ml7+?/.( ml7+?/.% ml7+?(.( ml7+?(.%>
ena!le 2>- in general" if this is set to disa!le all 2>- comm#nication is sh#t off
mmchconfig ver-sdma6ena-le
defines minim#m si)e of a PacDet to #se 2>- " also see nsdJnlineWritea.
mmchconfig ver-sdma5in8ytes6(&!
T#rns ver!sSend on" a low level JE inline transfer method
mmchconfig ver-sdma0end6yes
a. n#m!er of o#tstanding transfers at a time per connection
mmchconfig ver-sdmasPerConnection6%*&
a. n#m!er of o#tstanding transfers at a time for the entire node
mmchconfig ver-sdmasPerNode6(/%+
?ow m#ch dedicated P-gepool for ver!s comm#nication
mmchconfig ver-s0end8uffer5emory586(/%+