KEMBAR78
PDB 2 Session Persistence | PDF | Hard Disk Drive | Cache (Computing)
0% found this document useful (0 votes)
11 views43 pages

PDB 2 Session Persistence

The document discusses data persistence in hard disk drives, highlighting their characteristics, advantages, and limitations. It covers topics such as hard disk organization, I/O timings, and the evolution of hard disk capacity over time. Additionally, it addresses the importance of disk cache and the requirements for different types of data storage applications.

Uploaded by

k8schem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views43 pages

PDB 2 Session Persistence

The document discusses data persistence in hard disk drives, highlighting their characteristics, advantages, and limitations. It covers topics such as hard disk organization, I/O timings, and the evolution of hard disk capacity over time. Additionally, it addresses the importance of disk cache and the requirements for different types of data storage applications.

Uploaded by

k8schem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Prof Joseph G Vella © - PDBMS Data Persistence

Practical DBMS
Prof. Joseph G. Vella ©
Dept. of Computer Information Systems
JOSEPH.G.VELLA@UM.EDU.MT

Data Persistence
Devices, Aggregate Devices , and Techniques

1
Prof Joseph G Vella © - PDBMS Data Persistence

The Scope of Things – A Simple Start

Joseph G Vella (c) - Practical DBMS 3 Data Persistence

Hard disks

2
Prof Joseph G Vella © - PDBMS Data Persistence

The hard facts

• Hard disk drives offer data persistence.

• Hard disk are:


– Inexpensive (per unit capacity);
– But fragile.

• Hard disk offer direct and sequential access:


– But:
• Have very (very) slow access speed;
• Have a limited bandwidth (a max transfer rate exist);
• Orginisation is based blocks (e.g., 4k).

• Hard disk and [OS] file systems development have advanced


concurrently.

Joseph G Vella (c) - Practical DBMS 5 Data Persistence

Hard Disks and File Systems


Rendered
files

Logical Reading & Writing


Orginisation

Padding & Unpadding

Physical
files

Joseph G Vella (c) - Practical DBMS 6 Data Persistence

3
Prof Joseph G Vella © - PDBMS Data Persistence

Hard Disk Capacity Over Time

Joseph G Vella (c) - Practical DBMS 7 Data Persistence

YEAR MANUFACTURER COST/GB


Unit Data Space
1956 IBM $1,000,000 Cost on Hard Disks
1980 North Star $193,000
1981 Morrow Designs $138,000
1983 Davong $119,000
1984 Pegasus (Great Lakes) $80,000
1985 First Class Peripherals $71,000
1987 Iomega $45,000
1988 IBM $16,000
1990 First Class Peripherals $12,000
1991 Western Digital $9,000
1992 Iomega $7,000
1994 Iomega $2,000
1995 Seagate $850
1996 Maxtor $259
1997 Maxtor $93
1998 Quantum $43
1999 Fujitsu IDE $16
2000 Maxtor 7200rpm UDMA/66 $9.58
2001 Maxtor 5400 rpm IDE $4.57
2002 Western Digital 7200 rpm $2.68
2003 Maxtor 7200 rpm IDE $1.39
2004 Western Digital Caviar SE $1.15
2011 WD Caviar Green (3 TB for $140) $0.05

Joseph G Vella (c) - Practical DBMS 8 Data Persistence

4
Prof Joseph G Vella © - PDBMS Data Persistence

Forbes - Sales of HDD


(by quarter for Seagate, WD, Toshiba)
(Average Sales Price)

Joseph G Vella (c) - Practical DBMS 9 Data Persistence

Forbes – 2020 Projection of Drive Delivery


(by Application)

Joseph G Vella (c) - Practical DBMS 10 Data Persistence

10

5
Prof Joseph G Vella © - PDBMS Data Persistence

The main parts

Joseph G Vella (c) - Practical DBMS 11 Data Persistence

11

Same device – but more parts shown!

Joseph G Vella (c) - Practical DBMS 12 Data Persistence

12

6
Prof Joseph G Vella © - PDBMS Data Persistence

Head Stack Assembly for Rotary Actuator

Joseph G Vella (c) - Practical DBMS 13 Data Persistence

13

Rotary Actuator with top magnet removed to


view voice call.

Joseph G Vella (c) - Practical DBMS 14 Data Persistence

14

7
Prof Joseph G Vella © - PDBMS Data Persistence

High Level Components of a disk drive

Joseph G Vella (c) - Practical DBMS 15 Data Persistence

15

Disk Space Organisation (i):

Not all
sectors,
tracks,
platters are
available for
storage!?

Each sector contains


some overhead – like The rotations per minute
management and ECC (RPM) are impressive: from
data 4500 RPM to 18000 RPM.

Joseph G Vella (c) - Practical DBMS 16 Data Persistence

16

8
Prof Joseph G Vella © - PDBMS Data Persistence

Disk Space Orginisation (i): continued

Not all space is


available for
storage!? I.e.
whole tracks
and platters are
sometimes
reserved

Joseph G Vella (c) - Practical DBMS 17 Data Persistence

17

Disk Space Orginisation (ii):

The outer tracks have


more sectors –
circumference is longer!
- Multi Zoning

Joseph G Vella (c) - Practical DBMS 18 Data Persistence

18

9
Prof Joseph G Vella © - PDBMS Data Persistence

Disk Space Organisation (iii):


• A sector is the smallest unit of data that
can be read or written from a disk.
• A cluster is the smallest unit of data that
a file system can allocate for a file. Each
cluster has a fixed size that is always a
multiple of the sector size.
– A file is stored optimally on disk as a
series of contiguous clusters (clusters
that are in order on disk).
– When a file is split into multiple
clusters on different areas of the disk
and this is called external
fragmentation.
• A track is a concentric ring of sectors on a
platter. A read/write head can read all the
data from a certain track by moving to a
position and then rotating the platter.
• A cylinder is a group of tracks in all the
platters that are on top of each other.

Joseph G Vella (c) - Practical DBMS 19 Data Persistence

19

Hard Disk Characteristic Measures:


seek, rotational delay (latency), and transfer time

1. Move
Head

2. Wait for
sector

RAM 3. Move
Transfer Time sector’s data

Important: Rotational latency is on average


half the time of a complete turn.
Joseph G Vella (c) - Practical DBMS 20 Data Persistence

20

10
Prof Joseph G Vella © - PDBMS Data Persistence

Hard Disk I/O Timings

• Time of I/O (abbreviated as IO) is expressed in terms of three


characteristic measures:

𝑇𝐼𝑂 = 𝑇𝑠𝑒𝑒𝑘 + 𝑇𝑟𝑜𝑡𝑎𝑡𝑖𝑜𝑛 + 𝑇𝑡𝑟𝑎𝑛𝑠𝑓𝑒𝑟

• The rate of I/O (abbreviated as IO) is expressed as size of data


blocks transferred over the time it takes:

𝑆𝑖𝑧𝑒𝐷𝑎𝑡𝑎𝐵𝑙𝑜𝑐𝑘𝑠
𝑅𝐼𝑂 =
𝑇𝐼𝑂

Joseph G Vella (c) - Practical DBMS 21 Data Persistence

21

Disk Space Orginisation (iv):


• A sector is the smallest unit of read/write storage. A sector is typically
512 bytes in size.
– A cluster is usually 4096 bytes.
– Most drives offer only sector level atomicity.
• Torn writes possible in case of cluster writes.

• A track is an important unit of storage as it holds all sectors, from a


disk platter, that can be read without moving the actuator from a
surface.
– There can be thousands of tracks on a surface.

• A cylinder is an important unit of storage too:


– It is the total storage accessible for reading and writing without
moving the actuators!
• Therefore only one seek time is required (i.e. to get to the required
cylinder)
• There are as many cylinders as tracks!

Joseph G Vella (c) - Practical DBMS 22 Data Persistence

22

11
Prof Joseph G Vella © - PDBMS Data Persistence

Hark Disk – expected self evident facts

• It is expected that:
time for accessing two blocks in succession is faster for two
neighboring blocks than distant blocks.

• One can also assume (within certain limits!?) that:


accessing blocks in a contiguous stream in a sequential read is
much faster than reading the blocks one by one in direct mode.

Joseph G Vella (c) - Practical DBMS 23 Data Persistence

23

Helping Sequential Reads (i)


0
11 1

10 2

9 3

8 4

7 5
6

• Track skew – angular offset should be long


enough to be just greater than seek time
required.
– Sequential scans that overlap cylinders are
avoiding rotational delay.

Joseph G Vella (c) - Practical DBMS 24 Data Persistence

24

12
Prof Joseph G Vella © - PDBMS Data Persistence

Helping Sequential Reads (ii)


0
11 1

10 2
0
7 5
9 3

2 10
8 4

9 3
7 5
6
4 8

11 1
• Interleaving – jump should be long 6
enough to be just greater than
transfer time required.
– This has become rare in the
presence of track buffering!

Joseph G Vella (c) - Practical DBMS 25 Data Persistence

25

Helping Hard Disks – Disk Cache


• All modern drives come with an ‘on-board’ memory cache (RAM).
– It is sometimes called buffer, or even track buffer.
– It is diminutive (from 16MB) compared to a drive persistent capacity.

• It holds disk blocks read from the drive and blocks to be written to the it.

• Other than holding a queue of blocks the disk drive might read a cluster
of sectors into it, rather than a single sector, to anticipate forthcoming
requests.
– What about writing back a cluster (i.e. sector at a time) can the cache
help?
• It depends what is needed?
If writer requires confirmation of write (write through) then ‘Not
much’.
If writer does not require write notice (write back / immediate
reporting) then ‘Yes’.
– (if un-aided this cluster write is in peril on drive failure.)

Joseph G Vella (c) - Practical DBMS 26 Data Persistence

26

13
Prof Joseph G Vella © - PDBMS Data Persistence

Average disk seek time approaches 1/3 of full seek time.

to
0 1 2 3 4 5 6 7 8
from 0 0 1 2 3 4 5 6 7 8
1 1 0 1 2 3 4 5 6 7
2 2 1 0 1 2 3 4 5 6
3 3 2 1 0 1 2 3 4 5
4 4 3 2 1 0 1 2 3 4
5 5 4 3 2 1 0 1 2 3
6 6 5 4 3 2 1 0 1 2
7 7 6 5 4 3 2 1 0 1
8 8 7 6 5 4 3 2 1 0

Average: 2.963 This is a model! In


reality time is not
linearly proportional to
inter distance!?
Joseph G Vella (c) - Practical DBMS 27 Data Persistence

27

Average disk seek time


is 1/3 of full seek time.

Joseph G Vella (c) - Practical DBMS 28 Data Persistence

28

14
Prof Joseph G Vella © - PDBMS Data Persistence

Hard Disk Characteristics:


Sequential vs Direct / Random access

What drive operations are required to support and execute these?

Joseph G Vella (c) - Practical DBMS 29 Data Persistence

29

Generic Disk Requirements

• Data Server:
– High RPM
– Lowest seek times
– High transfer band width
• Personal Computers
– Capacity & lowest cost per unit storage
• Laptops
– Sturdy
– Lowest power consumption (low RPM, few platters, etc)
• Home entertainment
– Low mechanical noise!

Joseph G Vella (c) - Practical DBMS 30 Data Persistence

30

15
Prof Joseph G Vella © - PDBMS Data Persistence

How to read Hard Disk numbers

• Disk parameters, for example one could read:


– Transfer size is 8K bytes
– Advertised average seek time is 12 ms
– Disk spins at 7200 RPM
– Transfer rate is 4 MB/sec
– Disk cache available
• Controller overhead is 2 ms
• Assume that disk is idle so no queuing delay
• What is average disk access time for a sector?
– Avg seek + avg rotational delay + transfer time + controller overhead
– 12 ms + 0.5/(7200 RPM/60) + 8 KB/4 MB/s + 2 ms
– 12 + 4.15 + 2 + 2 = 20 ms
• Advertised seek time assumes no locality: typically, 1/4 to 1/3
advertised seek time: 12 ms  3-4 ms

Joseph G Vella (c) - Practical DBMS 31 Data Persistence

31

A sample of disk drives!


Seagate Drives Cheetah 15k.5 Barracuda
(data server) (PC)

Capacity (GB) 300 1,000

RPM 15,000 7,200

Average Seek (ms) 4 9

Max. Transfer Rate (MB/s) 125 105

Platter 4 4

Cache (MB) 16 16/32

Interface SCSI SATA

Joseph G Vella (c) - Practical DBMS 32 Data Persistence

32

16
Prof Joseph G Vella © - PDBMS Data Persistence

Direct Access Workload:


Request small reads from anywhere on disk (e.g. 4K)
𝑆𝑖𝑧𝑒𝐷𝑎𝑡𝑎𝐵𝑙𝑜𝑐𝑘𝑠
𝑇𝐼𝑂 = 𝑇𝑠𝑒𝑒𝑘 + 𝑇𝑟𝑜𝑡𝑎𝑡𝑖𝑜𝑛 + 𝑇𝑡𝑟𝑎𝑛𝑠𝑓𝑒𝑟 𝑅𝐼𝑂 =
𝑇𝐼𝑂

• Cheetah Drive • Barracuda Drive

✓ Tseek = 4 ms {use average seek} ✓ Tseek = 9 ms {use average seek}


✓ Trot = (1/2)*(1/15000)*60 s ✓ Trot = (1/2)*(1/7200)*60 s
= (1/2)*(1/250)*1000 ms = (1/2)*(1/120)*1000 ms
= 2 ms = 4.1 ms
✓ Ttran = size / max transfer ✓ Ttran = size / max transfer
= 4096 / 125 * 106 s = 4096 / 105 * 106 s
= 32 micro s = 39 micro s
❑ TIO = 6 ms ❑ TIO = 13.1 ms

✓ RIO = 4096 / .006 ✓ RIO = 4096 / .0131


❑ RIO = 0.68 MB/s ❑ RIO = 0.31 MB/s
Joseph G Vella (c) - Practical DBMS 33 Data Persistence

33

Sequential Access Workload:


Request a long sequence of contiguous blocks (32 MB)
𝑆𝑖𝑧𝑒𝐷𝑎𝑡𝑎𝐵𝑙𝑜𝑐𝑘𝑠
𝑇𝐼𝑂 = 𝑇𝑠𝑒𝑒𝑘 + 𝑇𝑟𝑜𝑡𝑎𝑡𝑖𝑜𝑛 + 𝑇𝑡𝑟𝑎𝑛𝑠𝑓𝑒𝑟 𝑅𝐼𝑂 =
𝑇𝐼𝑂

• Cheetah Drive • Barracuda Drive


We do a seek, wait for right sector and then have a long transfer.
✓ Tseek = 4 ms {use average seek} ✓ Tseek = 9 ms {use average seek}
✓ Trot = (1/2)*(1/15000)*60 s ✓ Trot = (1/2)*(1/7200)*60 s
= (1/2)*(1/250)*1000 ms = (1/2)*(1/120)*1000 ms
= 2 ms = 4.1 ms
✓ Ttran = size / max transfer ✓ Ttran = size / max transfer
= 33554432 / 125 * 106 s = 33554432 / 105 * 106 s
= .27 s = .32 s
❑ TIO = .27 s ❑ TIO = .32 s

✓ RIO = 33554432 / .27 ✓ RIO = 33554432 / .32


❑ RIO = 124 MB/s ❑ RIO = 104 MB/s
Joseph G Vella (c) - Practical DBMS 34 Data Persistence

34

17
Prof Joseph G Vella © - PDBMS Data Persistence

Comparison of Direct & Sequential Examples

Transfer Rate per job


Cheetah Barracuda
Direct
(1 block of 4k) 0.68 0.31 MB/s
Sequential
(32MB) 124.00 104.00 MB/s

Total Time per job


Cheetah Barracuda
Direct
(1 block of 4k) 6.0 13.1 ms
Sequential
(32MB) 270.0 320.0 ms

Cheetah was designed as a high performance SCSI disk drive.


Joseph G Vella (c) - Practical DBMS 35 Data Persistence

35

Progress with Hard Disks:


but an apparent paradox appears
• Compare:
– The rate of growth in capacity (over time);
With the
– The rate of progress in seek (average) (over time).
• Continued advance in capacity (60%/yr) and bandwidth
(40%/yr.)
• Slow improvement in seek, rotation (8%/yr)

• Time to read whole disk


Year Sequentially Directly
1990 4 minutes 6 hours
2000 12 minutes 1 week

• It is apparent that the rate of growth is much faster than seek


time (average):
– Effectively we have more capacity with slower access!?

Joseph G Vella (c) - Practical DBMS 36 Data Persistence

36

18
Prof Joseph G Vella © - PDBMS Data Persistence

Have we reached a performance plateau?


• Physics is universal and has limits, and if these are reached then the
limits are real.
• We have hinted that cache, at different levels (e.g., h/d, controller,
operating system, DB engine) can offer some interesting advantages
• But there are some other exciting possibilities with clever engineering
of hard disk drives. For example:
– Double and independent actuators
(two read write heads on each plater)
– Two actuators is a very difficult engineering
problem because the heads needs to be
aligned together!

“Seagate lists the sustained, sequential transfer rate of the Mach.2 as


up to 524MBps—easily double that of a fast "normal" rust disk and
edging into SATA SSD territory. The performance gains extend into
random I/O territory as well, with 304 IOPS read / 384 IOPS write and
only 4.16 ms average latency. (Normal hard drives tend to be 100/150
IOPS and about the same average latency.)”

Joseph G Vella (c) - Practical DBMS 37 Data Persistence

37

Empirical analysis on multiple servers (heads)


effect on rotational delay (rd)

Given two servers (or three heads)


• Try multiple accesses
– For each head generate a random
number (0< rd <=1)
– Calculate the minimum of:
• Of the first two
• Of the three
– Repeat it x times (e.g., 128)
– Aggregate the stats
• First, second, third server
– Average the rd
» should be close to .5
– Minimum & maximum should
cover the range 0 to 1
• First two, three
– Average the min rd
» Should be close to .35 for 2
» Should be close to .27 for 3

Joseph G Vella (c) - Practical DBMS 38 Data Persistence

38

19
Prof Joseph G Vella © - PDBMS Data Persistence

Hard Disk Interfaces

39

Model of a disk drive attached to a host system

Joseph G Vella (c) - Practical DBMS 40 Data Persistence

40

20
Prof Joseph G Vella © - PDBMS Data Persistence

ATA & Serial ATA Configuration Controllers

Joseph G Vella (c) - Practical DBMS 41 Data Persistence

41

SCSI & Serial Attached SCI (SAS) Configuration


Controllers

Joseph G Vella (c) - Practical DBMS 42 Data Persistence

42

21
Prof Joseph G Vella © - PDBMS Data Persistence

Journey of a Byte
write(textfile, ch, 1); -- ch is assigned ‘P’

Joseph G Vella (c) - Practical DBMS 43 Data Persistence

43

Journey of a Byte

Joseph G Vella (c) - Practical DBMS 44 Data Persistence

44

22
Prof Joseph G Vella © - PDBMS Data Persistence

Journey of a Byte

Joseph G Vella (c) - Practical DBMS 45 Data Persistence

45

Journey of a Byte

Joseph G Vella (c) - Practical DBMS 46 Data Persistence

46

23
Prof Joseph G Vella © - PDBMS Data Persistence

Journey of a Byte

Joseph G Vella (c) - Practical DBMS 47 Data Persistence

47

HD, OS and FS Interaction: And Raw Device Access

The DBMS
has direct
User
Application access to
Interface
HD … by-
DBMS
passing
OS & FS.

OS & FS

H/W e.g HD

It’s origin is from Unix; i.e. Raw


(Block) Device. Raw device access is
available even in Linux too!

Joseph G Vella (c) - Practical DBMS 48 Data Persistence

48

24
Prof Joseph G Vella © - PDBMS Data Persistence

(External) Fragmentation
• External Fragmentation happens!?
• The file’s spanning might not be
contiguous.
• How does it affect:
– Sequential reads
• It interrupts flow by
introducing a seek access …
– Direct Access
• Superficially none.
• BUT!
– It can break the locality
advantage …
– Allocation
• Although space is available
it’s not contiguous –
consequently increasing the
problem.

Joseph G Vella (c) - Practical DBMS 49 Data Persistence

49

I/O Systems transfer mode

Joseph G Vella (c) - Practical DBMS 50 Data Persistence

50

25
Prof Joseph G Vella © - PDBMS Data Persistence

(Internal) Fragmentation

• Any data file needs to be spanned into a list of disk blocks (i.e.
sectors).
– Data files can't share any sector!
• Consider a data file of 6K (i.e. 6,000) bytes.
– The following table shows sector space utilisation.
Sectors in 1 2 4 8 16 32 64
Cluster

Bytes in Cluster 512 1024 2048 4096 8192 16384 32768

Clusters needed 12 6 3 2 1 1 1

File size on disk 6144 6144 6144 8192 8192 16384 32768

Disk space 98% 98% 98% 73% 73% 37% 18%


utilisation

Joseph G Vella (c) - Practical DBMS 51 Data Persistence

51

Hard Disk Internal & External Fragmentation

• There is an interesting study from Leffler et al 1989


(BSD Unix fame) about File System block size,
internal fragmentation, and access speed.
– These statistics are dated but most of the insights are
still valid.

Also, of interest is a preceding paper by:


Marshall K. McKusick, William N. Joy, Samuel J. Leffler, and Robert S. Fabry
A Fast File System for UNIX
ACM Trans. on Computer Systems 2(3), August 1984, pp. 181-197.

Joseph G Vella (c) - Practical DBMS 52 Data Persistence

52

26
Prof Joseph G Vella © - PDBMS Data Persistence

Issue of Which (i.e., to read and write)


• If where has to do with data placement into blocks … and issues
of external fragmentation.
• If how has to do with data packing into blocks … and issues of
internal fragmentation.
• We are still to discuss the ordering of disk bound operations:
– E.g., should be prefer a read from an outer sector rather than a
read from an inner sector?
– E.g., should we write before we read?
• The fact that an OS takes care of this is not the point:
1. DBMS still needs a model for read & write sequencing;
2. If a DBMS has raw (and direct) access to a hard disk then surly
the OS is out of the picture (does not interfere by the DBMS
planning).
• A common (and naïve) approach is Shortest Seek Time First.
– Considers actuator current location, orders request and
shortest is executed.
• Risk of Starvation …
Joseph G Vella (c) - Practical DBMS 53 Data Persistence

53

Aggregate Systems

54

27
Prof Joseph G Vella © - PDBMS Data Persistence

Requirements

• HOW TO MAKE A LARGER, FASTER, & MORE RELIABLE DISK?


– New techniques, and
– Some trade-offs required.

• One technique is to use multiple disks in an aggregate to build a


faster, bigger, and more reliable disk system.
– These are called RAID units.

• To the F/S the RAID unit is a big, fast, and reliable (single) disk.
– When a F/S issues a logical I/O request to the RAID, it internally
must calculate which disks to access in order to complete the
request, and then issue one or more physical I/Os to do so.

Joseph G Vella (c) - Practical DBMS 55 Data Persistence

55

Reliability:
Remember Disks Fail Badly!?

• Typical Numbers for SCSI drives is 1.2 million hours!

• Seagate quote a Barracuda 7200.7 model with 600,000-hour MTBF.


– i.e., half the population / farm should fail in the first 600K hours.

– Assuming:
• Units is in its first year of use;
• Tests results are in terms of arbitrary usage.

• What affects a unit’s reliability? Possibly:


– Number of Platters;
– Seek usage pattern (called duty cycle by engineers);
– Temperature; and
– Power consumption.

Joseph G Vella (c) - Practical DBMS 56 Data Persistence

56

28
Prof Joseph G Vella © - PDBMS Data Persistence

Some Definitions
• Disk Shadowing
– Making two or more copies of data written to a disk drive.

• Disk Duplexing
– A method of storing data whereby the data from one hard disk
is duplicated onto another, with each using its own hard disk
controller.
• In contrast to Disk Mirroring

• Disk Mirroring
– A method of storing data whereby the data from one hard disk
is duplicated on another, with both hard disks sharing a single
hard disk controller
• In contrast to Disk Duplexing

Joseph G Vella (c) - Practical DBMS 57 Data Persistence

57

Conventional Computer System:


Note: Every component represents a point of failure

Joseph G Vella (c) - Practical DBMS 58 Data Persistence

58

29
Prof Joseph G Vella © - PDBMS Data Persistence

Multi Controllers and Disk Mirroring reduce the


likelihood of total disk failure.
In this case the system can survive the failure of any single disk or single
disk controller.

Joseph G Vella (c) - Practical DBMS 59 Data Persistence

59

Dual-ported disks with multiple controllers further


enhance fault tolerance.
In this configuration the failure of a disk and controller is tolerated.

Joseph G Vella (c) - Practical DBMS 60 Data Persistence

60

30
Prof Joseph G Vella © - PDBMS Data Persistence

A fault-tolerant system in which all components are


replicated

Joseph G Vella (c) - Practical DBMS 61 Data Persistence

61

A Reliability Indicator

• Mean Time Between Failures (MTBF) is measured by averaging the


timespans a unit is continuously functional (time between
successive unplanned and un self manageable down time).

σ 𝑻𝒊𝒎𝒆 𝒐𝒇 𝒇𝒂𝒊𝒍𝒖𝒓𝒆 − 𝑻𝒊𝒎𝒆 𝒓𝒆𝒔𝒕𝒂𝒓𝒕𝒆𝒅


𝑴𝑻𝑩𝑭 =
𝒏𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒇𝒂𝒊𝒍𝒖𝒓𝒆𝒔

• Manufacturers claim that a


drive population failures over
time follows this graph.

Joseph G Vella (c) - Practical DBMS 62 Data Persistence

62

31
Prof Joseph G Vella © - PDBMS Data Persistence

Google’s Study of HD fleets (2005-2006):


100k ATA consumer drives with 5400-7200 RPM and 80 to 400 GB capacity

• Annualise failure rates by age


groups over a nine months
Failure rate co-relates with drive
model and age is confirmed in this
study too.

• Annualise failure rates by age


groups / seek usage pattern over a
nine months
The data extracted is a surprise as
it shows that intensive usage
does not play a role from 1- to 4-
year-olds.
Joseph G Vella (c) - Practical DBMS 63 Data Persistence

63

Google’s Study of HD fleets (continued)

• Although stated as a
primary cause of faults,
Google’s study suggests
there is more to it:
– Failures do not co-relate
with increase in
temperature.
– But failure co-relate with
low temperatures.
• Failure rates pick-up at
higher temperatures.

Joseph G Vella (c) - Practical DBMS 64 Data Persistence

64

32
Prof Joseph G Vella © - PDBMS Data Persistence

Individual HDs fail; right!?


But what’s in a farm then?

• Say we have a device that is stated to have a MTBF of 12 years.


– Therefore, a single such device has a 50% chance to be alive by
12 years.
• Say we have two such devices.
– Therefore, two devices being alive by 12 years is 25% chance.
• Say we have three such devices.
– Therefore, the three devices being alive by 12 years is 12.5%
chance.

• Therefore, the more devices present into a single system the


less likely all devices are working by their MTBF.

Joseph G Vella (c) - Practical DBMS 65 Data Persistence

65

RAID ("Redundant Array of Inexpensive Disks")

• An aggregated unit is built from a number of simpler (e.g.


cheaper) disks, CPUs, & RAM.
– It offers larger, faster, and more reliable set-ups.
• There are a number of structural layouts;
– RAID 1 to 5 being the original set-up proposed by Patterson
Gibson & Katz in 1987.
– These structures differ in terms of data placement and the
type of redundancy built.
– Other structures are built by mixing and matching from the
original designs – RAID6 is RAID5 and stripping;
• Also, industry have changed the I to “independent”!
• Both ‘software’ and ‘hardware’ solutions are available.
• Many operating systems directly (and transparently) support
these units too.
Joseph G Vella (c) - Practical DBMS 66 Data Persistence

66

33
Prof Joseph G Vella © - PDBMS Data Persistence

A RAID Evaluation is based on the following:

• Capacity
– Is the aggregate of N drives, what portion of these are used for
storage?
• For full redundancy we have N/2.
• Reliability
– What type of faults can a system withstand?
– How many faults can a system tolerate?
• Performance
– Different workloads are expected to have different measures.

Joseph G Vella (c) - Practical DBMS 67 Data Persistence

67

RAID0 (Not a RAID really!?)


• Data is divided into blocks (or chunks) and
then these are written across an array.
– This is called "striping".
• Requires at minimum two drives.
• No recovery feature is provided in case of a
disk failure – i.e. no redundancy present.
– As more disks are added, the higher the
possibility that a disk failure occurs.
• This process enables high level performance
as parallel access to the data on different
disks improves speed of retrieval.
• Typical application: e.g. in graphic image
processing.

Joseph G Vella (c) - Practical DBMS 68 Data Persistence

68

34
Prof Joseph G Vella © - PDBMS Data Persistence

Chunk Sizes

• Chunk size and performance are somewhat related.


• The smaller the chunk is the more pieces a file requires and
more spread over disks;
– Positive: Parallelism for read and writes.
– Negative: Seek time per drive is involved.
• The larger chunk reduces the pieces required and have a less
spread over disks;
– Positive: Fewer seek time.
– Negative: Less parallelism.

• Observation: it’s hard to work out an ideal chunk size, one way
to address the problem is by building a workload profile.
– A workload is a mix of direct and sequential writes.

Joseph G Vella (c) - Practical DBMS 69 Data Persistence

69

RAID1
• This level uses "mirroring" to copy data onto
two disk drives simultaneously.
– Possible twice the Read transaction rate
of single disks, same Write transaction
rate as single disks.
• RAID 1 provides failure tolerance.
– If one disk fails, then other maintains the
data.
• Reestablishing RAID1 requires straight
copying from live drive to new drive.
• Storage cost doubles as duplicating all data
means only half the total disk capacity is able
for storage.

• Application: high availability requirement.

Joseph G Vella (c) - Practical DBMS 70 Data Persistence

70

35
Prof Joseph G Vella © - PDBMS Data Persistence

RAID0+1
• RAID 0+1 is implemented as a
mirrored array whose segments
are RAID 0 arrays
• Use two set-ups in one array.
– Both data duplication and
improved access speed are
possible.
• High I/O rates are achieved
thanks to multiple stripe
segments.
• Four drives are the minimum.

• A single drive failure will cause the


whole set-up to become a RAID 0:
– Expensive;
– High storage overhead.

• Application: File server

Joseph G Vella (c) - Practical DBMS 71 Data Persistence

71

RAID5
• Uses a technique that avoids the
concentration of I/O on a dedicated parity
disk by writing it separately across multiple
disks.
– Three drives are required as a minimum.
• “Write penalty” still occurs as existing data
must be pre-read before update and parity
data has to be updated after the data is
written.
• RAID 5 enables multiple write orders to be
implemented concurrently because updated
parity data is dispersed across the multiple
disks.
• Highest Read data transaction rate.
• Medium Write data transaction rate.
• Difficult to rebuilt once a unit faults; when
compared to RAID1,
• Widely applicable: database and web servers

Joseph G Vella (c) - Practical DBMS 72 Data Persistence

72

36
Prof Joseph G Vella © - PDBMS Data Persistence

RAID10
• Not exactly like RAID0+1, as in RAID10 first
we mirror and then strip.
• RAID 10 is implemented as a striped array
whose segments are RAID 1 arrays.
– Requires a minimum of four drives.
• RAID 10 has the same fault tolerance as
RAID level 1
• RAID 10 has the same overhead for fault-
tolerance as mirroring alone

• High I/O rates are achieved by striping RAID


1 segments
• Under certain circumstances, RAID 10 array
can sustain multiple simultaneous drive
failures
• Excellent solution for sites that would have
otherwise gone with RAID 1 but need some
additional performance boost
Joseph G Vella (c) - Practical DBMS 73 Data Persistence

73

RAID
Capacity, Reliability & Performance Comparison

Joseph G Vella (c) - Practical DBMS 74 Data Persistence

74

37
Prof Joseph G Vella © - PDBMS Data Persistence

Persistent Memory
The wedge between the CPU stores
and persistent storage

75

Non-volatile random-access memory (NVRAM)

https://www.snia.org/education/what-is-persistent-memory

Joseph G Vella (c) - Practical DBMS 76 Data Persistence

76

38
Prof Joseph G Vella © - PDBMS Data Persistence

What is NVRAM?

• NVRAM is a type of Random Access Memory (RAM) that retains its data
even when the main power is not available.
– Read access latency is quoted at: 100ns to 1000ns.
• Types of NVRAM:
– Uses SRAM that is made non-volatile by connecting it to an additional
power source, e.g., battery.
– Uses EEPROM (Electrically Erasable Programmable Read-Only Memory)
to save its data when power is not available. NVRAM has a combination
of SRAM and EEPROM semiconductors incorporated into one chip.
• Advantages
– NVRAM’s support high-speed data read/write operations for parallel
processing and DBMS cache.
– NVRAM can act as in-unit caches for HDD, SDD.
– NVRAM semiconductors are light on power consumption and backup
power exhaustion is unlikely to happen for a long time.
• Disadvantages
– Write to Read speed ratio is an issue (for performance).
– Still very iffy, production wise.
– Chips fail!

Joseph G Vella (c) - Practical DBMS 77 Data Persistence

77

Read
at your
leisure! An Example Package
• IP-NVRAM-1M Greenspring
Non-Volatile Memory Industry Pack Module
• Features:
• Single-Wide IndustryPack
• IndustryPack Wait State
• Lithium Battery
• The Greenspring IP-NVRAM Non-Volatile Memory IndustryPack
Module provides a convenient and reliable way to implement
non-volatile memory up to one megabyte in a single-high
IndustryPack.
Eight TSOP 128K x 8-bit low power SRAM chips and a lithium
battery provide 10 years (at room temperature) of maintenance-
free operation.
The IP-NVRAM powers-up ready to go. No software initialization
is required.
Four memory configurations were manufactured: 256 KB, 512 KB,
768 KB, and 1 MB. The one megabyte configuration and the 256
KB configuration are standard.
Access to the IP-NVRAM occurs in the IndustryPack memory
space.
A unique ID PROMS identifies the IP-NVRAM. Users or systems
integrators may add information to the ID PROMs to indicate
user-specific information.
• https://www.artisantg.com/TestMeasurement/59561-1/Abaco-Systems-SBS-Greenspring-IP-
NVRAM-1M-Non-Volatile-Memory-IndustryPack-Module

Joseph G Vella (c) - Practical DBMS 78 Data Persistence

78

39
Prof Joseph G Vella © - PDBMS Data Persistence

Basic Idea of Performance measure with


NVRAM and Main Memory Databases
• The paper by (Hoya, 2019) used
an industry Main Memory
Database and applied a well-
known benchmarking suite on
various set-ups. E.g.;
– HDD (the baseline);
– NVMe-SSD;
– NVRAM Log Buffer;
– NVRAM Data Access and Log
buffer.
• Note the transaction throughput
results:
– NVMe-SSD is 11 times better
than the HDD
– NVRAM Log buffer is >100
times better than the HDD
• Observation:
– Look at the NVRAM Log
Buffer setup throughput, and
note it decreases with the
higher number of threads.
What could be the reason?
K. Hoya, K. Hatsuda, K. Tsuchida, Y. Watanabe, Y. Shirota and T. Kanai, "A
perspective on NVRAM technology for future computing system," 2019
International Symposium on VLSI Design, Automation and Test (VLSI-DAT),
Hsinchu, Taiwan, 2019, pp. 1-2
Joseph G Vella (c) - Practical DBMS 79 Data Persistence

79

SAN & NAS

80

40
Prof Joseph G Vella © - PDBMS Data Persistence

Storage area network - SAN

• Computers and remote storage cabinets.


– Connection: SCSI over fiber-optic.
– Mounting of device allows storage to appear local.

• Is a dedicated network that provides access to consolidated,


block level data storage. SANs are primarily used to enhance
storage devices, such as disk arrays, tape libraries, and optical
jukeboxes, accessible to servers so that the devices appear like
locally attached devices to the operating system.

Joseph G Vella (c) - Practical DBMS 81 Data Persistence

81

Joseph G Vella (c) - Practical DBMS 82 Data Persistence

82

41
Prof Joseph G Vella © - PDBMS Data Persistence

Joseph G Vella (c) - Practical DBMS 83 Data Persistence

83

Network-attached storage - NAS

• Computer and remote storage.


– Connection through NFS/CIFS [Common Internet FS] with
TPC/IP.
– Software: Logon and direct access facilities.

• Is file-level computer data storage connected to a computer


network providing data access to a heterogeneous group of
clients
• NAS systems are networked appliances which contain one or more
hard drives, often arranged into logical, redundant storage
containers or RAID.

Joseph G Vella (c) - Practical DBMS 84 Data Persistence

84

42
Prof Joseph G Vella © - PDBMS Data Persistence

Joseph G Vella (c) - Practical DBMS 85 Data Persistence

85

Joseph G Vella (c) - Practical DBMS 86 Data Persistence

86

43

You might also like