0% found this document useful (0 votes)

138 views110 pages

Scale15x-2017-Postgresql Zfs Best Practices

Uploaded by

Miguel Chaparro

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

138 views110 pages

Scale15x-2017-Postgresql Zfs Best Practices

Uploaded by

Miguel Chaparro

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 110

PostgreSQL + ZFS

Best Practices and Standard Procedures

"If you are not using ZFS,
you are losing data*."
3 Clark's Three Laws

1. When a distinguished but elderly scientist states that something is possible,

he is almost certainly right. When he states that something is impossible, he
is very probably wrong.
2. The only way of discovering the limits of the possible is to venture a little
way past them into the impossible.
3. Any sufficiently advanced technology is indistinguishable from magic.

ZFS is not magic, but it is an incredibly impressive piece of software.

4 PostgreSQL and ZFS

•Many bits
•Lots of bits
•Huge bits
•It's gunna be great
•Very excited
•We have the best filesystems
•People tell me this is true
•Except the fake media, they didn't tell me this
5 PostgreSQL and ZFS: It's about the bits and storage, stupid.

•Many bits
•Lots of bits
•Huge bits
•It's gunna be great
•Very excited
•We have the best filesystems
•People tell me this is true
•Except the fake media, they didn't tell me this

Too soon?
6 PostgreSQL and ZFS

1. Review PostgreSQL from a storage administrator's perspective

2. Learn what it takes to become a PostgreSQL "backup expert"
3. Dive through a naive block-based filesystem
4. Walk through the a high-level abstraction of ZFS
5. See some examples of how to use ZFS with PostgreSQL
•Tips
•Tunables
•Anecdotes

Some FS minutiae may have been harmed in the making of this talk.
Nit-pick as necessary (preferably after).
7 PostgreSQL - A Storage Administrator's View

•User-land page cache maintained by PostgreSQL in shared memory

•8K page size
• Each PostgreSQL table is backed by one or more files in $PGDATA/
•Tables larger than 1GB are automatically shared into individual 1GB files
•pwrite(2)'s to tables are:
•append-only if no free pages in the table are available
•in-place updated if free pages are available in the free-space map
• pwrite(2)'s are page-aligned
•Makes heavy use of a Write Ahead Log (WAL), aka an Intent Log
8 Storage Administration: WAL on Disk

•WAL files are written to sequentially

•append-only IO
• Still 8K page-aligned writes via pwrite(2)
•WAL logs are 16MB each, pre-allocated
•WAL logs are never unlink(2)'ed, only recycled via rename(2)
• Low-latency pwrite(2)'s and fsync(2) for WAL files is required for good
write performance
9 PostgreSQL - Backups

Traditionally, only two SQL commands that you must know:

1.pg_start_backup('my_backup')
2.${some_random_backup_utility} $PGDATA/
3.pg_stop_backup()

Wait for pg_start_backup() to return

before backing up $PGDATA/ directory.
10 PostgreSQL - Backups

Only two^Wthree SQL commands that you must know:

1.CHECKPOINT
2.pg_start_backup('my_backup')
3.${some_random_backup_utility} $PGDATA/
4.pg_stop_backup()

Manual CHECKPOINT if you can't twiddle the

args to pg_start_backup().
11 PostgreSQL - Backups

Only two^Wthree^Wtwo commands that you must know:

1.CHECKPOINT
2.pg_start_backup('my_backup', true)
3.${some_random_backup_utility} $PGDATA/
4.pg_stop_backup()

pg_start_backup('my_backup', true) 
a.k.a. aggressive checkpointing (vs default perf hit of: 
0.5 * checkpoint_completion_target)
14 Quick ZFS Primer
15 Quick ZFS Primer

TIP: Look for parallels.

16 Quick ZFS Primer: Features (read: why you must use ZFS)

• Never inconsistent (no fsck(8)'s required, ever)

•Filesystem atomically moves from one consistent state to another consistent state
•All blocks are checksummed
•Compression builtin
•Snapshots are free and unlimited
•Clones are easy
•Changes accumulate in memory, flushed to disk in a transaction
•Redundant metadata (and optionally data)
•Filesystem management independent of physical storage management
•Log-Structured Filesystem
•Copy on Write (COW)
17 Feature Consequences (read: how your butt gets saved)

•bitrot detected and automatically corrected if possible

•phantom writes
•misdirected reads or writes by the drive heads
•DMA parity errors
•firmware or driver bugs
•RAM capacitors aren't refreshed fast enough or with enough power
•Phenomenal sequential and random IO write performance
•Performance increase for sequential reads
•Cost of ownership goes down
•New tricks and tools to solve "data gravity" problems
ELI5: Block Filesystems vs Log
Structured Filesystems
19 Block Filesystems: Top-Down

Userland Application

write(fd, buffer, cnt) buﬀer

Userland
20 Block Filesystems: Top-Down

Userland Application

write(fd, buffer, cnt) buﬀer

Userland
Kernel

VFS Layer

Logical File: PGDATA/global/1

21 Block Filesystems: Top-Down

Userland Application

write(fd, buffer, cnt) buﬀer

Userland
Kernel

VFS Layer

Logical File: PGDATA/global/1

System Buﬀers
22 Block Filesystems: Top-Down

Userland Application
write(fd, buffer, cnt) buﬀer

Userland
Kernel

VFS Layer

Logical File: PGDATA/global/1

System Buﬀers

Logical File Blocks

0 1 2 3 4
23 Block Filesystems: Top-Down
Kernel

VFS Layer

Logical File: PGDATA/global/1

System Buﬀers

Logical File Blocks

0 1 2 3 4

Physical Storage Layer

2: #9971

Pretend this is a 3: #0016

0: #8884
spinning disk 4: #0317

1: #7014
24 Block Filesystems: PostgreSQL Edition

Userland Application cnt = 2

write(fd, buffer, cnt) 8k buﬀer

Userland
25 Block Filesystems: PostgreSQL Edition

Userland Application cnt = 2

write(fd, buffer, cnt) 8k buﬀer

Userland
Kernel

VFS Layer

Logical File: PGDATA/global/1

System Buﬀers

Logical File Blocks

0 1 2 3
26 Block Filesystems: PostgreSQL Edition

Kernel

VFS Layer

Logical File: PGDATA/global/1

System Buﬀers

Logical File Blocks

0 1 2 3

Physical Storage Layer

2: #9971

3: #0016 0: #8884

1: #7014
27 Quiz Time

What happens when you twiddle a bool in a row?

UPDATE foo_table SET enabled = FALSE WHERE id = 123;

28 Quiz Answer: Write Amplification

UPDATE foo_table SET enabled = FALSE WHERE id = 123;

foo_table Tuple

<~182 tuples

Userland Application
write(fd, buffer, cnt) 8k buﬀer
29 ZFS Tip: postgresql.conf: full_page_writes=oﬀ

ALTER SYSTEM SET full_page_writes=off;

CHECKPOINT;
-- Restart PostgreSQL

IMPORTANT NOTE: full_page_writes=off interferes with cascading replication

30 Block Filesystems: PostgreSQL Edition

Userland Application
•buffers can be 4K
cnt = 2
write(fd, buffer, cnt) 8k buﬀer

•disk sectors are 512B - 4K Userland

•ordering of writes is important Kernel

VFS Layer

•consistency requires complete Logical File: PGDATA/global/1

cooperation and coordination System Buﬀers

Logical File Blocks

0 1 2 3
31 ZFS Filesystem Storage Abstraction

Physical Storage is
decoupled
from
Filesystems.

If you remember one thing from this section,

this is it.
32 VDEVs On the Bottom

VDEV: raidz VDEV: mirror

IO Scheduler IO Scheduler

disk1 disk2 disk3 disk4 disk5 disk6

zpool: rpool or tank

33 Filesystems On Top

VFS
Dataset Name Mountpoint
tank/ROOT /
tank/db /db
canmount=off
tank/ROOT/usr /usr
tank/local none
tank/local/etc /usr/local/etc
34 Oﬀensively Over Simplified Architecture Diagram

ZPL - ZFS POSIX Layer

Filesystem zvol

Datasets

DSL - Dataset and Snapshot Layer

VDEV: raidz VDEV: mirror

IO Scheduler IO Scheduler

disk1 disk2 disk3 disk4 disk5 disk6

zpool: rpool or tank

35 ZFS is magic until you know how it fits together

VFS
Dataset Name Mountpoint
tank/ROOT /
tank/db /db
tank/ROOT/usr /usr
tank/local none
tank/local/etc /usr/local/etc

ZPL - ZFS POSIX Layer

Filesystem zvol

Datasets

DSL - Dataset and Snapshot Layer

VDEV: raidz VDEV: mirror

IO Scheduler IO Scheduler

disk1 disk2 disk3 disk4 disk5 disk6

zpool: rpool or tank

36 Log-Structured Filesystems: Top-Down
37 Log-Structured Filesystems: Top-Down

Disk Block with

foo_table Tuple
38 ZFS: User Data Block Lookup via ZFS Posix Layer

uberblock
Disk Block with
foo_table Tuple
39 ZFS: User Data + File dnode

t1
40 ZFS: Object Set

t1
41 ZFS: Meta-Object Set Layer

t1
42 ZFS: Uberblock

t1
43 At what point did the filesystem become inconsistent?

t1
44 At what point could the filesystem become inconsistent?

At t1
t3

t1
45 How? I lied while explaining the situation. Alternate Truth.

Neglected to highlight ZFS is Copy-On-Write (read: knowingly committed

perjury in front of a live audience)
46 How? I lied while explaining the situation. Alternate

ZFS is Copy-On-Write
What what's not been deleted and on disk is immutable.

(read: I nearly committed perjury in front of a live audience by knowingly

withholding vital information)
47 ZFS is Copy-On-Write

Disk Block with

foo_table Tuple

t1
48 At what point did the filesystem become inconsistent?

t1
49 At what point did the filesystem become inconsistent?

t1
50 At what point did the filesystem become inconsistent?

t1
51 At what point could the filesystem become inconsistent?

NEVER
t1
t2
t3
52 TIL about ZFS: Transactions and Disk Pages

• Transaction groups are flushed to disk ever N seconds (defaults to 5s)

•A transaction group (txg) in ZFS is called a "checkpoint"
•User Data can be modified as its written to disk
•All data is checksummed
•Compression should be enabled by default
53 ZFS Tip: ALWAYS enable compression

$ zfs get compression

NAME PROPERTY VALUE SOURCE
rpool compression off default
rpool/root compression off default
$ sudo zfs set compression=lz4 rpool
$ zfs get compression
NAME PROPERTY VALUE SOURCE
rpool compression lz4 local
rpool/root compression lz4 inherited from rpool

•Across ~7PB of PostgreSQL and mixed workloads and applications:

compression ratio of ~2.8:1 was the average.
•Have seen >100:1 compression on some databases 
(cough this data probably didn't belong in a database cough)
•Have seen as low as 1.01:1
54 ZFS Tip: ALWAYS enable compression

$ zfs get compression

I have yet to see compression slow down benchmarking results or real world
workloads. My experience is with:
•spinning rust (7.2K RPM, 10K RPM, and 15KRPM)
•fibre channel connected SANs
•SSDs
•NVME
55 ZFS Tip: ALWAYS enable compression

$ zfs get compressratio

NAME PROPERTY VALUE SOURCE
rpool compressratio 1.64x -
rpool/db compressratio 2.58x -
rpool/db/pgdb1-10 compressratio 2.61x -
rpool/root compressratio 1.62x -

•Use lz4 by default everywhere.

•Use gzip-9 only for archive servers
•Never mix-and-match compression where you can't suffer the
consequences of lowest-common-denominator performance
•Anxious to see ZStandard support (I'm looking at you Allan Jude)
56 ZFS Perk: Data Locality

•Data written at the same time is stored near each other because it's frequently
part of the same record
•Data can now pre-fault into kernel cache (ZFS ARC) by virtue of the temporal
adjacency of the related pwrite(2) calls
•Write locality + compression=lz4 + pg_repack == PostgreSQL Dream Team
57 ZFS Perk: Data Locality

If you don't know what pg_repack is, figure out how to move into a database
environment that supports pg_repack and use it regularly. 

https://reorg.github.io/pg_repack/ && https://github.com/reorg/pg_repack/

58 Extreme ZFS Warning: Purge all memory of dedup

•This is not just my recommendation, it's also from the community and author
of the feature.
•These are not the droids you are looking for
•Do not pass Go
•Do not collect $200
•Go straight to system unavailability jail
•The feature works, but you run the risk of bricking your ZFS server.

Ask after if you are curious, but here's a teaser:

What do you do if the dedup hash tables don't fit in RAM?

Bitrot is a Studied Phenomena
Bitrot is a Studied Phenomena
Bitrot is a Studied Phenomena
Bitrot is a Studied Phenomena
63 TIL: Bitrot is here

•TL;DR: 4.2% -> 34% of SSDs have one UBER per year
64 TIL: Bitrot Roulette

(1-(1-uberRate)^(numDisks)) = Probability of UBER/server/year

(1-(1-0.042)^(20)) = 58%
(1-(1-0.34)^(20)) = 99.975%

Highest quality SSD drives on the market

Lowest quality commercially viable SSD drives on the market

65 Causes of bitrot are Internal and External

External Factors for UBER on SSDs:

• Temperature
• Bus Power Consumption
• Data Written by the System Software
• Workload changes due to SSD failure
In a Datacenter no-one can hear your bits scream...
...except maybe they can.
68 Take Care of your bits

Answer their cry for help.

69 Take Care of your bits

Similar studies and research exist for:

•Fibre Channel
•SAS
•SATA
•Tape
•SANs
•Cloud Object Stores
70 So what about PostgreSQL?

"...I told you all of that, so I can tell you this..."

71 ZFS Terminology: VDEV

VDEV | vē-dēv
noun
a virtual device

•Physical drive redundancy is handled at the VDEV level

•Zero or more physical disks arranged like a RAID set:
•mirror
•stripe
•raidz
•raidz2
•raidz3
72 ZFS Terminology: zpool

zpool | zē-pool ͞
noun
an abstraction of physical storage made up of a set of VDEVs

Loose a VDEV, loose the zpool.

73 ZFS Terminology: ZPL

ZPL | zē-pē-el
noun
ZFS POSIX Layer

•Layer that handles the impedance mismatch between POSIX filesystem

semantics and the ZFS "object database."
74 ZFS Terminology: ZIL

ZIL | zil
noun
ZFS Intent Log

•The ZFS analog of PostgreSQL's WAL

•If you use a ZIL:
•Use disks that have low-latency writes
•Mirror your ZIL
•If you loose your ZIL, whatever data had not made it to the main data disks
will be lost. The PostgreSQL equivalent of: rm -rf pg_xlog/
75 ZFS Terminology: ARC

ARC | ärk
noun
Adaptive Replacement Cache

•ZFS's page cache

•ARC will grow or shrink to match use up all of the available memory

TIP: Limit ARC's max size to a percentage of physical memory

minus the shared_buffer cache for PostgreSQL minus the
kernel's memory overhead.
76 ZFS Terminology: Datasets

dataset | dædə ˌsɛt

noun
A filesystem or volume ("zvol")

•A ZFS filesystem dataset uses the underlying zpool

• A dataset belongs to one and only one zpool
•Misc tunables, including compression and quotas are set on the dataset level
77 ZFS Terminology: The Missing Bits

ZAP ZFS Attribute Processor

DMU Data Management Unit
DSL Dataset and Snapshot Layer
SPA Storage Pool Allocator
ZVOL ZFS Volume
ZIO ZFS I/O
RAIDZ RAID with variable-size stripes
L2ARC Level 2 Adaptive Replacement Cache
record unit of user data, think RAID stripe size
78 Storage Management

$ sudo zfs list

NAME USED AVAIL REFER MOUNTPOINT
rpool 818M 56.8G 96K none
rpool/root 817M 56.8G 817M /
$ ls -lA -d /db
ls: cannot access '/db': No such file or directory
$ sudo zfs create rpool/db -o mountpoint=/db
$ sudo zfs list
NAME USED AVAIL REFER MOUNTPOINT
rpool 818M 56.8G 96K none
rpool/db 96K 56.8G 96K /db
rpool/root 817M 56.8G 817M /
$ ls -lA /db
total 9
drwxr-xr-x 2 root root 2 Mar 2 18:06 ./
drwxr-xr-x 22 root root 24 Mar 2 18:06 ../
79 Storage Management

•Running out of disk space is bad, m'kay?

•Block file systems reserve ~8% of the disk space above 100%
•At ~92% capacity the performance of block allocators change from
"performance optimized" to "space optimized" (read: performance "drops").
80 Storage Management

•Running out of disk space is bad, m'kay?

ZFS doesn't have an artificial pool of free

space: you have to manage that yourself.
81 Storage Management

$ sudo zpool list -H -o size

59.6G
$ sudo zpool list

The pool should never consume more than 80% of the available space
82 Storage Management

$ sudo zfs set quota=48G rpool/db

$ sudo zfs get quota rpool/db
NAME PROPERTY VALUE SOURCE
rpool/db quota 48G local
$ sudo zfs list
NAME USED AVAIL REFER MOUNTPOINT
rpool 818M 56.8G 96K none
rpool/db 96K 48.0G 96K /db
rpool/root 817M 56.8G 817M /
83 Dataset Tuning Tips

• Disable atime
•Enable compression
• Tune the recordsize
•Consider tweaking the primarycache
84 ZFS Dataset Tuning

# zfs get atime,compression,primarycache,recordsize rpool/db

NAME PROPERTY VALUE SOURCE
rpool/db atime on inherited from rpool
rpool/db compression lz4 inherited from rpool
rpool/db primarycache all default
rpool/db recordsize 128K default
# zfs set atime=off rpool/db
# zfs set compression=lz4 rpool/db
# zfs set recordsize=16K rpool/db
# zfs set primarycache=metadata rpool/db
# zfs get atime,compression,primarycache,recordsize rpool/db
NAME PROPERTY VALUE SOURCE
rpool/db atime off local
rpool/db compression lz4 local
rpool/db primarycache metadata local
rpool/db recordsize 16K local
85 Discuss: recordsize=16K

•Pre-fault next page: useful for sequential scans

• With compression=lz4, reasonable to expect ~3-4x pages worth of data
in a single ZFS record

Anecdotes and Recommendations:

•Performed better in most workloads vs ZFS's prefetch
•Disabling prefetch isn't necessary, tends to still be a net win
•Monitor arc cache usage
86 Discuss: primarycache=metadata

• metadata instructs ZFS's ARC to only cache metadata (e.g. dnode entries),
not page data itself
•Default: cache all data

Two different recommendations based on benchmark workloads:

•Enable primarycache=all where working set exceeds RAM
• Enable primarycache=metadata where working set fits in RAM
87 Discuss: primarycache=metadata

• metadata instructs ZFS's ARC to only cache metadata (e.g. dnode entries),
not page data itself
•Default: cache all data
•Double-caching happens

Two different recommendations based on benchmark workloads:

•Enable primarycache=all where working set exceeds RAM
• Enable primarycache=metadata where working set fits in RAM

Reasonable Default anecdote: Cap max ARC size ~15%-25%

physical RAM + ~50% RAM shared_buffers
88 Performance Wins

2-4µs/pwrite(2)!!
89 Performance Wins
90 Performance Wins
91 Performance Wins

P.S. This was observed on 10K RPM spinning rust.

92 ZFS Always has your back

•ZFS will checksum every read from disk

•A failed checksum will result in a fault and automatic data reconstruction
•Scrubs do background check of every record
•Schedule periodic scrubs
•Frequently for new and old devices
•Infrequently for devices in service between 6mo and 2.5yr

PSA: The "Compressed ARC" feature was added to catch checksum errors in RAM

Checksum errors are an early indicator of failing disks

93 Schedule Periodic Scrubs
# zpool status
pool: rpool
state: ONLINE
scan: none requested
config:

NAME STATE READ WRITE CKSUM

rpool ONLINE 0 0 0
sda1 ONLINE 0 0 0

errors: No known data errors

# zpool scrub rpool
Non-zero on
# zpool status
pool: rpool
any of these
state: ONLINE
scan: scrub in progress since Fri Mar 3 20:41:44 2017 values is bad™
753M scanned out of 819M at 151M/s, 0h0m to go
0 repaired, 91.97% done
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
sda1 ONLINE 0 0 0
errors: No known data errors
# zpool status
pool: rpool
state: ONLINE
scan: scrub repaired 0 in 0h0m with 0 errors on Fri Mar 3 20:41:49 2017
94 One dataset per database

•Create one ZFS dataset per database instance

•General rules of thumb:
• Use the same dataset for $PGDATA/ and pg_xlogs/
•Set a reasonable quota
•Optional: reserve space to guarantee minimal available space

Checksum errors are an early indicator of failing disks

95 One dataset per database

# zfs list
NAME USED AVAIL REFER MOUNTPOINT
rpool 819M 56.8G 96K none
rpool/db 160K 48.0G 96K /db
rpool/root 818M 56.8G 818M /
# zfs create rpool/db/pgdb1
# chown postgres:postgres /db/pgdb1
# zfs list
NAME USED AVAIL REFER MOUNTPOINT
rpool 819M 56.8G 96K none
rpool/db 256K 48.0G 96K /db
rpool/db/pgdb1 96K 48.0G 96K /db/pgdb1
rpool/root 818M 56.8G 818M /
# zfs set reservation=1G rpool/db/pgdb1
# zfs list
NAME USED AVAIL REFER MOUNTPOINT
rpool 1.80G 55.8G 96K none
rpool/db 1.00G 47.0G 96K /db
rpool/db/pgdb1 96K 48.0G 12.0M /db/pgdb1
rpool/root 818M 55.8G 818M /
96 initdb like a boss
# su postgres -c 'initdb --no-locale -E=UTF8 -n -N -D /db/pgdb1'
Running in noclean mode. Mistakes will not be cleaned up.
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "C".

The default text search configuration will be set to "english".

Data page checksums are disabled.

fixing permissions on existing directory /db/pgdb1 ... ok

creating subdirectories ... ok

•Encode using UTF8, sort using C

•Only enable locale when you know you need it
• ~2x perf bump by avoiding calls to iconv(3) to figure out sort order
•DO NOT use PostgreSQL checksums or compression
97 Backups

# zfs list -t snapshot

no datasets available
# pwd
/db/pgdb1
# find . | wc -l
895
# head -1 postmaster.pid
25114
# zfs snapshot rpool/db/pgdb1@pre-rm
# zfs list -t snapshot
NAME USED AVAIL REFER MOUNTPOINT
rpool/db/pgdb1@pre-rm 0 - 12.0M -
# psql -U postgres
psql (9.6.2)
Type "help" for help.

postgres=# \q
# rm -rf *
Guilty Pleasure
# ls -1 | wc -l
0
During Demos
# psql -U postgres
psql: FATAL: could not open relation mapping file "global/pg_filenode.map":
No such file or directory
98 Backups: Has Them
$ psql
psql: FATAL: could not open relation mapping file "global/pg_filenode.map": No such file or directory
# cat postgres.log
LOG: database system was shut down at 2017-03-03 21:08:05 UTC
LOG: MultiXact member wraparound protections are now enabled
LOG: database system is ready to accept connections
LOG: autovacuum launcher started
FATAL: could not open relation mapping file "global/pg_filenode.map": No such file or directory
LOG: could not open temporary statistics file "pg_stat_tmp/global.tmp": No such file or directory
LOG: could not open temporary statistics file "pg_stat_tmp/global.tmp": No such file or directory
...
LOG: could not open temporary statistics file "pg_stat_tmp/global.tmp": No such file or directory
LOG: could not open file "postmaster.pid": No such file or directory
LOG: performing immediate shutdown because data directory lock file is invalid
LOG: received immediate shutdown request
LOG: could not open temporary statistics file "pg_stat/global.tmp": No such file or directory
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit,
because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
LOG: database system is shut down
# ll
total 1
drwx------ 2 postgres postgres 2 Mar 3 21:40 ./
drwxr-xr-x 3 root root 3 Mar 3 21:03 ../
99 Restores: As Important as Backups

# zfs list -t snapshot

NAME USED AVAIL REFER MOUNTPOINT
rpool/db/pgdb1@pre-rm 12.0M - 12.0M -
# zfs rollback rpool/db/pgdb1@pre-rm
# su postgres -c '/usr/lib/postgresql/9.6/bin/postgres -D /db/pgdb1'
LOG: database system was interrupted; last known up at 2017-03-03 21:50:57 UTC
LOG: database system was not properly shut down; automatic recovery in progress
LOG: redo starts at 0/14EE7B8
LOG: invalid record length at 0/1504150: wanted 24, got 0
LOG: redo done at 0/1504128
LOG: last completed transaction was at log time 2017-03-03 21:51:15.340442+00
LOG: MultiXact member wraparound protections are now enabled
LOG: database system is ready to accept connections
LOG: autovacuum launcher started

Works all the time, every time, even with kill -9

(possible dataloss from ungraceful shutdown and IPC cleanup not withstanding)
100 Clone: Test and Upgrade with Impunity

# zfs clone rpool/db/pgdb1@pre-rm rpool/db/pgdb1-upgrade-test

# zfs list -r rpool/db
NAME USED AVAIL REFER MOUNTPOINT
rpool/db 1.00G 47.0G 96K /db
rpool/db/pgdb1 15.6M 48.0G 15.1M /db/pgdb1
rpool/db/pgdb1-upgrade-test 8K 47.0G 15.2M /db/pgdb1-upgrade-test
# echo "Test pg_upgrade"
# zfs destroy rpool/db/pgdb1-clone
# zfs clone rpool/db/pgdb1@pre-rm rpool/db/pgdb1-10
# echo "Run pg_upgrade for real"
# zfs promote rpool/db/pgdb1-10
# zfs destroy rpool/db/pgdb1

Works all the time, every time, even with kill -9

(possible dataloss from ungraceful shutdown and IPC cleanup not withstanding)
101 Tip: Naming Conventions

• Use a short prefix not on the root filesystem (e.g. /db)

•Encode the PostgreSQL major version into the dataset name
•Give each PostgreSQL cluster its own dataset (e.g. pgdb01)
•Optional but recommended: Suboptimal Good
•one database per cluster rpool/db/pgdb1 rpool/db/prod-db01-pg94

•one app per database rpool/db/myapp-shard1 rpool/db/prod-myapp-shard001-pg95

•encode environment into DB name rpool/db/dbN rpool/db/prod-dbN-pg10

•encode environment into DB username

Be explicit: codify the tight coupling between

PostgreSQL versions and $PGDATA/.
102 Defy Gravity

•Take and send snapshots to remote servers

•zfs send emits a snapshot to stdout: treat as a file or stream
•zfs receive reads a snapshot from stdin
•TIP: If available:
• Use the -s argument to zfs receive
•Use zfs get receive_resume_token on the receiving end to get the
required token to resume an interrupted send: zfs send -t <token>

Unlimited flexibility. Compress, encrypt,

checksum, and offsite to your heart's content.
103 Defy Gravity

# zfs send -v -L -p -e rpool/db/pgdb1@pre-rm > /dev/null

send from @ to rpool/db/pgdb1-10@pre-rm estimated size is 36.8M
total estimated size is 36.8M
TIME SENT SNAPSHOT
# zfs send -v -L -p -e \
rpool/db/pgdb1-10@pre-rm | \
zfs receive -v \
rpool/db/pgdb1-10-receive
send from @ to rpool/db/pgdb1-10@pre-rm estimated size is 36.8M
total estimated size is 36.8M
TIME SENT SNAPSHOT
received 33.8MB stream in 1 seconds (33.8MB/sec)
# zfs list -t snapshot
NAME USED AVAIL REFER
MOUNTPOINT
rpool/db/pgdb1-10@pre-rm 8K - 15.2M -
rpool/db/pgdb1-10-receive@pre-rm 0 - 15.2M -
104 Defy Gravity: Incrementally

•Use a predictable snapshot naming scheme

•Send snapshots incrementally
•Clean up old snapshots
•Use a monotonic snapshot number (a.k.a. "vector clock")

Remember to remove old snapshots.

Distributed systems bingo!
105 Defy Gravity: Incremental

# echo "Change PostgreSQL's data"

# zfs snapshot rpool/db/pgdb1-10@example-incremental-001
# zfs send -v -L -p -e \
-i rpool/db/pgdb1-10@pre-rm \
rpool/db/pgdb1-10@example-incremental-001 \
> /dev/null
send from @pre-rm to rpool/db/pgdb1-10@example-incremental-001
estimated size is 2K
total estimated size is 2K
# zfs send -v -L -p -e \
-i rpool/db/pgdb1-10@pre-rm \
rpool/db/pgdb1-10@example-incremental-001 | \
zfs receive -v \
rpool/db/pgdb1-10-receive
send from @pre-rm to rpool/db/pgdb1-10@example-incremental-001
estimated size is 2K
total estimated size is 2K
receiving incremental stream of rpool/db/pgdb1-10@example-
incremental-001 into rpool/db/pgdb1-10-receive@example-incremental-001
received 312B stream in 1 seconds (312B/sec)
106 Defy Gravity: Vector Clock

# echo "Change more PostgreSQL's data: VACUUM FULL FREEZE"

# zfs snapshot rpool/db/pgdb1-10@example-incremental-002
# zfs send -v -L -p -e \
-i rpool/db/pgdb1-10@example-incremental-001 \
rpool/db/pgdb1-10@example-incremental-002 \
> /dev/null
send from @example-incremental-001 to rpool/db/pgdb1-10@example-
incremental-002 estimated size is 7.60M
total estimated size is 7.60M
TIME SENT SNAPSHOT
# zfs send -v -L -p -e \
-i rpool/db/pgdb1-10@example-incremental-001 \
rpool/db/pgdb1-10@example-incremental-002 | \
zfs receive -v \
rpool/db/pgdb1-10-receive
send from @example-incremental-001 to rpool/db/pgdb1-10@example-
incremental-002 estimated size is 7.60M
total estimated size is 7.60M
receiving incremental stream of rpool/db/pgdb1-10@example-incremental-002
into rpool/db/pgdb1-10-receive@example-incremental-002
TIME SENT SNAPSHOT
received 7.52MB stream in 1 seconds (7.52MB/sec)
107 Defy Gravity: Cleanup

# zfs list -t snapshot -o name,used,refer

NAME USED REFER
rpool/db/pgdb1-10@example-incremental-001 8K 15.2M
rpool/db/pgdb1-10@example-incremental-002 848K 15.1M
rpool/db/pgdb1-10-receive@pre-rm 8K 15.2M
rpool/db/pgdb1-10-receive@example-incremental-001 8K 15.2M
rpool/db/pgdb1-10-receive@example-incremental-002 0 15.1M
# zfs destroy rpool/db/pgdb1-10-receive@pre-rm
# zfs destroy rpool/db/pgdb1-10@example-incremental-001
# zfs destroy rpool/db/pgdb1-10-receive@example-incremental-001
# zfs list -t snapshot -o name,used,refer
NAME USED REFER
rpool/db/pgdb1-10@example-incremental-002 848K 15.1M
rpool/db/pgdb1-10-receive@example-incremental-002 0 15.1M
108 Controversial: logbias=throughput
•Measure tps/qps
•Time duration of an outage (OS restart plus WAL replay, e.g. 10-20min)
•Measure cost of back pressure from the DB to the rest of the application
•Use a txg timeout of 1 second

Position: since ZFS will never be inconsistent and therefore PostgreSQL will
never loose integrity, 1s of actual data loss is a worthwhile tradeoff for a ~10x
performance boost in write-heavy applications.

Rationale: loss aversion costs organizations more than potentially loosing 1s

of data. Back pressure is a constant cost the rest of the application needs to
absorb due to continual fsync(2)'ing of WAL data. Architectural cost and
premature engineering costs need to be factored in. Penny-wise, pound
foolish.
109 Controversial: logbias=throughput

# cat /sys/module/zfs/parameters/zfs_txg_timeout
5
# echo 1 > /sys/module/zfs/parameters/zfs_txg_timeout
# echo 'options zfs zfs_txg_timeout=1' >> /etc/modprobe.d/zfs.conf
# psql -c 'ALTER SYSTEM SET synchronous_commit=off'
ALTER SYSTEM
# zfs set logbias=throughput rpool/db
QUESTIONS?

Email sean@chittenden.org Twitter: @SeanChittenden

sean@hashicorp.com

Openzfs Basics: George Wilson Matt Ahrens
100% (1)
Openzfs Basics: George Wilson Matt Ahrens
39 pages
Enable Online Backup in PostgreSQL
No ratings yet
Enable Online Backup in PostgreSQL
4 pages
PostgreSQL When It's Not Your Job
100% (1)
PostgreSQL When It's Not Your Job
183 pages
PostgreSQL WAL Sync Tuning Guide
No ratings yet
PostgreSQL WAL Sync Tuning Guide
8 pages
ZFS: The Last Word in File Systems
No ratings yet
ZFS: The Last Word in File Systems
29 pages
Accidentaldbalinuxcon 130102190320 Phpapp02
No ratings yet
Accidentaldbalinuxcon 130102190320 Phpapp02
61 pages
Zfs Aaron Toponce
No ratings yet
Zfs Aaron Toponce
25 pages
ZFS: Advanced File System Overview
No ratings yet
ZFS: Advanced File System Overview
44 pages
PostgreSQL Proficiency For Python People
No ratings yet
PostgreSQL Proficiency For Python People
215 pages
Inside Buffer Cache
No ratings yet
Inside Buffer Cache
26 pages
Your Disk Array Is Slower Than It Should Be
No ratings yet
Your Disk Array Is Slower Than It Should Be
31 pages
Scalable OpenSource Storage
No ratings yet
Scalable OpenSource Storage
31 pages
Postgres Backup & Recovery Guide
No ratings yet
Postgres Backup & Recovery Guide
48 pages
PostgreSQL Backup Strategies Guide
100% (1)
PostgreSQL Backup Strategies Guide
44 pages
Ext3/4 File Systems: Don Porter CSE 506
No ratings yet
Ext3/4 File Systems: Don Porter CSE 506
33 pages
DSTN Session 4
No ratings yet
DSTN Session 4
45 pages
Postgres' Io Architecture, Tuning, Problems: Andres Freund Postgresql Developer & Committer
No ratings yet
Postgres' Io Architecture, Tuning, Problems: Andres Freund Postgresql Developer & Committer
30 pages
Postgre SQL
No ratings yet
Postgre SQL
35 pages
Buffer Cache Algorithms: Session No:5 Operating System Design @KL University, 2020
No ratings yet
Buffer Cache Algorithms: Session No:5 Operating System Design @KL University, 2020
21 pages
Postgresql'S Io Subsystem: Problems, Workarounds, Solutions: Andres Freund Postgresql Developer & Committer
No ratings yet
Postgresql'S Io Subsystem: Problems, Workarounds, Solutions: Andres Freund Postgresql Developer & Committer
23 pages
Kernel VFS and File System Efficiency
No ratings yet
Kernel VFS and File System Efficiency
28 pages
Postgresql Tuning Guide: Postgresql Architecture: Key Takeaways
No ratings yet
Postgresql Tuning Guide: Postgresql Architecture: Key Takeaways
8 pages
PostgreSQL Statistics Guide 2016
No ratings yet
PostgreSQL Statistics Guide 2016
54 pages
ZFS: Advanced File System Overview
No ratings yet
ZFS: Advanced File System Overview
22 pages
ZFS
100% (1)
ZFS
24 pages
File System Consistency and Exam Review
No ratings yet
File System Consistency and Exam Review
43 pages
Ext4 Foss
No ratings yet
Ext4 Foss
25 pages
PostgreSQL PITR
No ratings yet
PostgreSQL PITR
6 pages
Postgresdba
No ratings yet
Postgresdba
242 pages
Computers 13 00139
No ratings yet
Computers 13 00139
26 pages
Postgresql Course Material
No ratings yet
Postgresql Course Material
205 pages
FreeBSD Journal
No ratings yet
FreeBSD Journal
33 pages
Stein
No ratings yet
Stein
6 pages
ZFS: Advanced File System Features
No ratings yet
ZFS: Advanced File System Features
34 pages
#20 (File System Logging)
No ratings yet
#20 (File System Logging)
32 pages
Outline: File System Consistency Issues in The Presence of Failures
No ratings yet
Outline: File System Consistency Issues in The Presence of Failures
4 pages
Recommendations For ZFS and Databases
No ratings yet
Recommendations For ZFS and Databases
2 pages
Zfs Tech Talk
No ratings yet
Zfs Tech Talk
96 pages
ZFS Cheat Sheet
No ratings yet
ZFS Cheat Sheet
22 pages
ZFS: Advanced File System Guide
No ratings yet
ZFS: Advanced File System Guide
44 pages
Insight Outsourcing Report
100% (1)
Insight Outsourcing Report
32 pages
ZFS: Advanced File System Features
No ratings yet
ZFS: Advanced File System Features
33 pages
Solaris Dynamic File System: Sun Microsystems, Inc
No ratings yet
Solaris Dynamic File System: Sun Microsystems, Inc
26 pages
Five Steps Performance Postgres
No ratings yet
Five Steps Performance Postgres
94 pages
Os - 9 - IO Subsystems
No ratings yet
Os - 9 - IO Subsystems
65 pages
Storage Architectures For Oracle Rac: Matthew Zito, Chief Scientist Gridapp Systems
No ratings yet
Storage Architectures For Oracle Rac: Matthew Zito, Chief Scientist Gridapp Systems
23 pages
Lecture 2 Advanced File Systems
No ratings yet
Lecture 2 Advanced File Systems
66 pages
ZFS On FreeBSD
No ratings yet
ZFS On FreeBSD
20 pages
Parallel Io Intro
No ratings yet
Parallel Io Intro
38 pages
Becoming A ZFS Ninja
No ratings yet
Becoming A ZFS Ninja
68 pages
ZFSNinja Slides PDF
No ratings yet
ZFSNinja Slides PDF
68 pages
PostgreSQL High Availability Guide
No ratings yet
PostgreSQL High Availability Guide
14 pages
zOS - ZFS - Setup
No ratings yet
zOS - ZFS - Setup
18 pages
Lec7 Logging
No ratings yet
Lec7 Logging
4 pages
14.1 Zfs Intro
No ratings yet
14.1 Zfs Intro
44 pages
XFS Filesystem Seminar Report
No ratings yet
XFS Filesystem Seminar Report
46 pages
3 Abdelilah Salim SEHLAOUI
No ratings yet
3 Abdelilah Salim SEHLAOUI
17 pages
Ore Grindability and Testing Methods
No ratings yet
Ore Grindability and Testing Methods
8 pages
Understanding Solar Plant Design ParametersSolar Irradiance, Tilt Angle, Azimuth, Efficiency Factors and Shading Analysis
No ratings yet
Understanding Solar Plant Design ParametersSolar Irradiance, Tilt Angle, Azimuth, Efficiency Factors and Shading Analysis
46 pages
Quadrilateral
No ratings yet
Quadrilateral
7 pages
Epson SureLab OrderController Operation Guide en
No ratings yet
Epson SureLab OrderController Operation Guide en
196 pages
Step-By-Step Configuration of MRP Types in Sap PP
100% (1)
Step-By-Step Configuration of MRP Types in Sap PP
3 pages
Soil Permeability Calculations
No ratings yet
Soil Permeability Calculations
2 pages
Williams Poems
100% (1)
Williams Poems
3 pages
Yukitoshi Higashino Mfta
100% (2)
Yukitoshi Higashino Mfta
29 pages
Ph.D. Thesis, S. Hughes, V. 6
No ratings yet
Ph.D. Thesis, S. Hughes, V. 6
109 pages
Aga Khan University Examination Board Higher Secondary School Certificate Class Xi Examination 2008
No ratings yet
Aga Khan University Examination Board Higher Secondary School Certificate Class Xi Examination 2008
8 pages
Advanced EDM Die Sinking
No ratings yet
Advanced EDM Die Sinking
1 page
Cost and Management Accounting-I 1st Edition - Ebook PDF PDF Download
100% (1)
Cost and Management Accounting-I 1st Edition - Ebook PDF PDF Download
84 pages
Ame 8800
No ratings yet
Ame 8800
20 pages
Chapter 4 Thinkers Beliefs and Buildings Notes
100% (1)
Chapter 4 Thinkers Beliefs and Buildings Notes
32 pages
Jhunjhunu History
No ratings yet
Jhunjhunu History
5 pages
Swatch Sheet
100% (1)
Swatch Sheet
5 pages
Automatic Power Cut-Off Device For Electricity Distribution Board
No ratings yet
Automatic Power Cut-Off Device For Electricity Distribution Board
28 pages
Money Adv Comp Essay
No ratings yet
Money Adv Comp Essay
5 pages
Integrated Crop Water Management
No ratings yet
Integrated Crop Water Management
15 pages
Bayes Theorem PDF
No ratings yet
Bayes Theorem PDF
9 pages
003a L02 Education Policy Planning
No ratings yet
003a L02 Education Policy Planning
55 pages
Assignment#1
No ratings yet
Assignment#1
1 page
IoT-Based Battery Health Monitoring
No ratings yet
IoT-Based Battery Health Monitoring
6 pages
Expt 7 Classification Tests For Hydrocarbons
87% (30)
Expt 7 Classification Tests For Hydrocarbons
7 pages
Business Pressure
No ratings yet
Business Pressure
7 pages
Identity Crisis in Michael Ondaatje's The English Patient
No ratings yet
Identity Crisis in Michael Ondaatje's The English Patient
3 pages
Listening Compre and Dictation Grade 3
No ratings yet
Listening Compre and Dictation Grade 3
3 pages
Untitled Document 3
No ratings yet
Untitled Document 3
2 pages
How To Remove A Foreign Aggregate (Made Out of Orphaned Disks) in ONTAP
No ratings yet
How To Remove A Foreign Aggregate (Made Out of Orphaned Disks) in ONTAP
3 pages