High end workstations or other systems in need of larger number of disks typically use
SCSI disks:
o The SCSI standard supports up to 16 targets on each SCSI bus, one of which is
generally the host adapter and the other 15 of which can be disk or tape drives.
o A SCSI target is usually a single drive, but the standard also supports up to 8 units
within each target. These would generally be used for accessing individual disks
within a RAID array. (See below.)
o The SCSI standard also supports multiple host adapters in a single computer, i.e.
multiple SCSI busses.
o Modern advancements in SCSI include "fast" and "wide" versions, as well as
SCSI-2.
o SCSI cables may be either 50 or 68 conductors. SCSI devices may be external as
well as internal.
FC is a high-speed serial architecture that can operate over optical fiber or four-conductor
copper wires, and has two variants:
o A large switched fabric having a 24-bit address space. This variant allows for
multiple devices and multiple hosts to interconnect, forming the basis for the
storage-area networks, SANs, to be discussed in a future section.
o The arbitrated loop, FC-AL that can address up to 126 devices (drives and
controllers.)
Network-Attached Storage
Network attached storage connects storage devices to computers using a remote
procedure call, RPC, interface, typically with something like NFS file system mounts.
This is convenient for allowing several computers in a group common access and naming
conventions for shared storage.
NAS can be implemented using SCSI cabling, or ISCSI uses Internet protocols and
standard network connections, allowing long-distance remote access to shared files.
NAS allows computers to easily share data storage, but tends to be less efficient than
standard host-attached storage.
Fig: Network-attached storage.
Storage-Area Network
A Storage-Area Network, SAN, connects computers and storage devices in a network,
using storage protocols instead of network protocols.
106
One advantage of this is that storage access does not tie up regular networking
bandwidth.
SAN is very flexible and dynamic, allowing hosts and devices to attach and detach on the
fly.
SAN is also controllable, allowing restricted access to certain hosts and devices.
Fig: Storage-area network.
Disk Scheduling
As mentioned earlier, disk transfer speeds are limited primarily by seek times and
rotational latency. When multiple requests are to be processed there is also some inherent
delay in waiting for other requests to be processed.
Bandwidth is measured by the amount of data transferred divided by the total amount of
time from the first request being made to the last transfer being completed, (for a series of
disk requests.)
Both bandwidth and access time can be improved by processing requests in a good order.
Disk requests include the disk address, memory address, number of sectors to transfer,
and whether the request is for reading or writing.
FCFS Scheduling
First-Come First-Serve is simple and intrinsically fair, but not very efficient. Consider in the
following sequence the wild swing from cylinder 122 to 14 and then back to 124:
107
Fig: FCFS disk scheduling.
SSTF Scheduling
Shortest Seek Time First scheduling is more efficient, but may lead to starvation if a
constant stream of requests arrives for the same general area of the disk.
SSTF reduces the total head movement to 236 cylinders, down from 640 required for the
same set of requests under FCFS. Note, however that the distance could be reduced still
further to 208 by starting with 37 and then 14 first before processing the rest of the
requests.
Fig: SSTF disk scheduling.
108
SCAN Scheduling
The SCAN algorithm, a.k.a. the elevator algorithm moves back and forth from one end of the
disk to the other, similarly to an elevator processing requests in a tall building.
Fig: SCAN disk scheduling.
Under the SCAN algorithm, if a request arrives just ahead of the moving head then it will
be processed right away, but if it arrives just after the head has passed, then it will have to
wait for the head to pass going the other way on the return trip. This leads to a fairly wide
variation in access times which can be improved upon.
Consider, for example, when the head reaches the high end of the disk: Requests with
high cylinder numbers just missed the passing head, which means they are all fairly
recent requests, whereas requests with low numbers may have been waiting for a much
longer time. Making the return scan from high to low then ends up accessing recent
requests first and making older requests wait that much longer.
C-SCAN Scheduling
The Circular-SCAN algorithm improves upon SCAN by treating all requests in a circular queue
fashion - Once the head reaches the end of the disk, it returns to the other end without processing
any requests, and then starts again from the beginning of the disk:
109
Fig: C-SCAN disk scheduling.
LOOK Scheduling
LOOK scheduling improves upon SCAN by looking ahead at the queue of pending requests, and
not moving the heads any farther towards the end of the disk than is necessary. The following
diagram illustrates the circular form of LOOK:
Fig: C-LOOK disk scheduling.
110
Selection of a Disk-Scheduling Algorithm
With very low loads all algorithms are equal, since there will normally only be one
request to process at a time.
For slightly larger loads, SSTF offers better performance than FCFS, but may lead to
starvation when loads become heavy enough.
For busier systems, SCAN and LOOK algorithms eliminate starvation problems.
The actual optimal algorithm may be something even more complex than those discussed
here, but the incremental improvements are generally not worth the additional overhead.
Some improvement to overall file system access times can be made by intelligent
placement of directory and/or inode information. If those structures are placed in the
middle of the disk instead of at the beginning of the disk, then the maximum distance
from those structures to data blocks is reduced to only one-half of the disk size. If those
structures can be further distributed and furthermore have their data blocks stored as
close as possible to the corresponding directory structures, then that reduces still further
the overall time to find the disk block numbers and then access the corresponding data
blocks.
On modern disks the rotational latency can be almost as significant as they seek time,
however it is not within the OSes control to account for that, because modern disks do
not reveal their internal sector mapping schemes, ( particularly when bad blocks have
been remapped to spare sectors. )
o Some disk manufacturers provide for disk scheduling algorithms directly on their
disk controllers, ( which do know the actual geometry of the disk as well as any
remapping ), so that if a series of requests are sent from the computer to the
controller then those requests can be processed in an optimal order.
o Unfortunately there are some considerations that the OS must take into account
that are beyond the abilities of the on-board disk-scheduling algorithms, such as
priorities of some requests over others, or the need to process certain requests in a
particular order. For this reason OSes may elect to spoon-feed requests to the disk
controller one at a time in certain situations.
Disk Management
Disk Formatting
Before a disk can be used, it has to be low-level formatted, which means laying down all
of the headers and trailers marking the beginning and ends of each sector. Included in the
header and trailer are the linear sector numbers, and error-correcting codes, ECC, which
allow damaged sectors to not only be detected, but in many cases for the damaged data to
be recovered (depending on the extent of the damage.) Sector sizes are traditionally 512
bytes, but may be larger, particularly in larger drives.
111
ECC calculation is performed with every disk read or write, and if damage is detected but
the data is recoverable, then a soft error has occurred. Soft errors are generally handled
by the on-board disk controller, and never seen by the OS. (See below.)
Once the disk is low-level formatted, the next step is to partition the drive into one or
more separate partitions. This step must be completed even if the disk is to be used as a
single large partition, so that the partition table can be written to the beginning of the
disk.
After partitioning, then the file systems must be logically formatted, which involves
laying down the master directory information (FAT table or inode structure), initializing
free lists, and creating at least the root directory of the file system. (Disk partitions which
are to be used as raw devices are not logically formatted. This saves the overhead and
disk space of the file system structure, but requires that the application program manage
its own disk storage requirements. )
Boot Block
Computer ROM contains a bootstrap program (OS independent) with just enough code to
find the first sector on the first hard drive on the first controller, load that sector into
memory, and transfer control over to it. ( The ROM bootstrap program may look in
floppy and/or CD drives before accessing the hard drive, and is smart enough to
recognize whether it has found valid boot code or not. )
The first sector on the hard drive is known as the Master Boot Record, MBR, and
contains a very small amount of code in addition to the partition table. The partition table
documents how the disk is partitioned into logical disks, and indicates specifically which
partition is the active or boot partition.
The boot program then looks to the active partition to find an operating system, possibly
loading up a slightly larger / more advanced boot program along the way.
In a dual-boot ( or larger multi-boot ) system, the user may be given a choice of which
operating system to boot, with a default action to be taken in the event of no response
within some time frame.
Once the kernel is found by the boot program, it is loaded into memory and then control
is transferred over to the OS. The kernel will normally continue the boot process by
initializing all important kernel data structures, launching important system services (e.g.
network daemons, sched, init, etc.), and finally providing one or more login prompts.
Boot options at this stage may include single-user a.k.a. maintenance or safe modes, in
which very few system services are started - These modes are designed for system
administrators to repair problems or otherwise maintain the system.
112
Fig: Booting from disk in Windows 2000.
Bad Blocks
No disk can be manufactured to 100% perfection, and all physical objects wear out over
time. For these reasons all disks are shipped with a few bad blocks, and additional blocks
can be expected to go bad slowly over time. If a large number of blocks go bad then the
entire disk will need to be replaced, but a few here and there can be handled through
other means.
In the old days, bad blocks had to be checked for manually. Formatting of the disk or
running certain disk-analysis tools would identify bad blocks, and attempt to read the data
off of them one last time through repeated tries. Then the bad blocks would be mapped
out and taken out of future service. Sometimes the data could be recovered, and
sometimes it was lost forever. (Disk analysis tools could be either destructive or non-
destructive.)
Modern disk controllers make much better use of the error-correcting codes, so that bad
blocks can be detected earlier and the data usually recovered. (Recall that blocks are
tested with every write as well as with every read, so often errors can be detected before
the write operation is complete, and the data simply written to a different sector instead.)
Note that re-mapping of sectors from their normal linear progression can throw off the
disk scheduling optimization of the OS, especially if the replacement sector is physically
far away from the sector it is replacing. For this reason most disks normally keep a few
spare sectors on each cylinder, as well as at least one spare cylinder. Whenever possible a
bad sector will be mapped to another sector on the same cylinder, or at least a cylinder as
close as possible. Sector slipping may also be performed, in which all sectors between the
113
bad sector and the replacement sector are moved down by one, so that the linear
progression of sector numbers can be maintained.
If the data on a bad block cannot be recovered, then a hard error has occurred. which
requires replacing the file(s) from backups, or rebuilding them from scratch.
Swap-Space Management
Modern systems typically swap out pages as needed, rather than swapping out entire
processes. Hence the swapping system is part of the virtual memory management system.
Managing swap space is obviously an important task for modern OSes.
Swap-Space Use
The amount of swap space needed by an OS varies greatly according to how it is used.
Some systems require an amount equal to physical RAM; some want a multiple of that;
some want an amount equal to the amount by which virtual memory exceeds physical
RAM, and some systems use little or none at all!
Some systems support multiple swap spaces on separate disks in order to speed up the
virtual memory system.
Swap-Space Location
Swap space can be physically located in one of two locations:
As a large file which is part of the regular file system. This is easy to implement, but
inefficient. Not only must the swap space be accessed through the directory system, the
file is also subject to fragmentation issues. Caching the block location helps in finding the
physical blocks, but that is not a complete fix.
As a raw partition, possibly on a separate or little-used disk. This allows the OS more
control over swap space management, which is usually faster and more efficient.
Fragmentation of swap space is generally not a big issue, as the space is re-initialized
every time the system is rebooted. The downside of keeping swap space on a raw
partition is that it can only be grown by repartitioning the hard drive.
Swap-Space Management: An Example
Historically OSes swapped out entire processes as needed. Modern systems swap out
only individual pages, and only as needed. (For example process code blocks and other
blocks that have not been changed since they were originally loaded are normally just
freed from the virtual memory system rather than copying them to swap space, because it
is faster to go find them again in the file system and read them back in from there than to
write them out to swap space and then read them back.)
In the mapping system shown below for Linux systems, a map of swap space is kept in
memory, where each entry corresponds to a 4K block in the swap space. Zeros indicate
114
free slots and non-zeros refer to how many processes have a mapping to that particular
block (>1 for shared pages only.)
Fig: The data structures for swapping on Linux systems.
RAID Structure
The general idea behind RAID is to employ a group of hard drives together with some
form of duplication, either to increase reliability or to speed up operations, (or sometimes
both.)
RAID originally stood for Redundant Array of Inexpensive Disks, and was designed to
use a bunch of cheap small disks in place of one or two larger more expensive ones.
Today RAID systems employ large possibly expensive disks as their components,
switching the definition to Independent disks.
Improvement of Reliability via Redundancy
The more disks a system has, the greater the likelihood that one of them will go bad at
any given time. Hence increasing disks on a system actually decreases the Mean Time to
Failure, MTTF of the system.
If, however, the same data was copied onto multiple disks, then the data would not be lost
unless both (or all) copies of the data were damaged simultaneously, which a MUCH
lower probability than for a single disk is going bad. More specifically, the second disk
would have to go bad before the first disk was repaired, which brings the Mean Time to
Repair into play. For example if two disks were involved, each with a MTTF of 100,000
hours and a MTTR of 10 hours, then the Mean Time to Data Loss would be 500 * 10^6
hours, or 57,000 years!
This is the basic idea behind disk mirroring, in which a system contains identical data on
two or more disks.
115
o Note that a power failure during a write operation could cause both disks to
contain corrupt data, if both disks were writing simultaneously at the time of the
power failure. One solution is to write to the two disks in series, so that they will
not both become corrupted (at least not in the same way) by a power failure. And
alternate solution involves non-volatile RAM as a write cache, which is not lost in
the event of a power failure and which is protected by error-correcting codes.
Improvement in Performance via Parallelism
There is also a performance benefit to mirroring, particularly with respect to reads. Since
every block of data is duplicated on multiple disks, read operations can be satisfied from
any available copy, and multiple disks can be reading different data blocks
simultaneously in parallel. (Writes could possibly be sped up as well through careful
scheduling algorithms, but it would be complicated in practice.)
Another way of improving disk access time is with striping, which basically means
spreading data out across multiple disks that can be accessed simultaneously.
o With bit-level striping the bits of each byte are striped across multiple disks. For
example if 8 disks were involved, then each 8-bit byte would be read in parallel
by 8 heads on separate disks. A single disk read would access 8 * 512 bytes = 4K
worth of data in the time normally required to read 512 bytes. Similarly if 4 disks
were involved, then two bits of each byte could be stored on each disk, for 2K
worth of disk access per read or write operation.
o Block-level striping spreads a file system across multiple disks on a block-by-
block basis, so if block N were located on disk 0, then block N + 1 would be on
disk 1, and so on. This is particularly useful when file systems are accessed in
clusters of physical blocks. Other striping possibilities exist, with block-level
striping being the most common.
RAID Levels
Mirroring provides reliability but is expensive; Striping improves performance, but does
not improve reliability. Accordingly there are a number of different schemes that
combine the principals of mirroring and striping in different ways, in order to balance
reliability versus performance versus cost. These are described by different RAID levels,
as follows: (In the diagram that follows, "C" indicates a copy, and "P" indicates parity,
i.e. checksum bits.)
o Raid Level 0 - This level includes striping only, with no mirroring.
o Raid Level 1 - This level includes mirroring only, no striping.
o Raid Level 2 - This level stores error-correcting codes on additional disks,
allowing for any damaged data to be reconstructed by subtraction from the
remaining undamaged data. Note that this scheme requires only three extra disks
to protect 4 disks worth of data, as opposed to full mirroring. (The number of
116
disks required is a function of the error-correcting algorithms, and the means by
which the particular bad bit(s) is (are) identified.)
o Raid Level 3 - This level is similar to level 2, except that it takes advantage of the
fact that each disk is still doing its own error-detection, so that when an error
occurs, there is no question about which disk in the array has the bad data. As a
result a single parity bit is all that is needed to recover the lost data from an array
of disks. Level 3 also includes striping, which improves performance. The
downside with the parity approach is that every disk must take part in every disk
access, and the parity bits must be constantly calculated and checked, reducing
performance. Hardware-level parity calculations and NVRAM cache can help
with both of those issues. In practice level 3 is greatly preferred over level 2.
o Raid Level 4 - This level is similar to level 3, employing block-level striping
instead of bit-level striping. The benefits are that multiple blocks can be read
independently, and changes to a block only require writing two blocks (data and
parity) rather than involving all disks. Note that new disks can be added
seamlessly to the system provided they are initialized to all zeros, as this does not
affect the parity results.
o Raid Level 5 - This level is similar to level 4, except the parity blocks are
distributed over all disks, thereby more evenly balancing the load on the system.
For any given block on the disk(s), one of the disks will hold the parity
information for that block and the other N-1 disks will hold the data. Note that the
same disk cannot hold both data and parity for the same block, as both would be
lost in the event of a disk crash.
o Raid Level 6 - This level extends raid level 5 by storing multiple bits of error-
recovery codes, (such as the Reed-Solomon codes), for each bit position of data,
rather than a single parity bit. In the example shown below 2 bits of ECC are
stored for every 4 bits of data, allowing data recovery in the face of up to two
simultaneous disk failures. Note that this still involves only 50% increase in
storage needs, as opposed to 100% for simple mirroring which could only tolerate
a single disk failure.
117
Fig: RAID levels.
There are also two RAID levels which combine RAID levels 0 and 1 (striping and
mirroring) in different combinations, designed to provide both performance and
reliability at the expense of increased cost.
o RAID level 0 + 1 disks are first striped, and then the striped disks mirrored to
another set. This level generally provides better performance than RAID level 5.
o RAID level 1 + 0 mirrors disks in pairs, and then stripes the mirrored pairs. The
storage capacity, performance, etc. are all the same, but there is an advantage to
this approach in the event of multiple disk failures, as illustrated below:.
In diagram (a) below, the 8 disks have been divided into two sets of four,
each of which is striped, and then one stripe set is used to mirror the other
set.
If a single disk fails, it wipes out the entire stripe set, but the
system can keep on functioning using the remaining set.
However if a second disk from the other stripe set now fails, then
the entire system is lost, as a result of two disk failures.
In diagram (b), the same 8 disks are divided into four sets of two, each of
which is mirrored, and then the file system is striped across the four sets of
mirrored disks.
118
If a single disk fails, then that mirror set is reduced to a single disk,
but the system rolls on, and the other three mirror sets continue
mirroring.
Now if a second disk fails, (that is not the mirror of the already
failed disk), then another one of the mirror sets is reduced to a
single disk, but the system can continue without data loss.
In fact the second arrangement could handle as many as four
simultaneously failed disks, as long as no two of them were from
the same mirror pair.
Fig: RAID 0 + 1 and 1 + 0
Selecting a RAID Level
Trade-offs in selecting the optimal RAID level for a particular application include cost,
volume of data, need for reliability, need for performance, and rebuild time, the latter of
which can affect the likelihood that a second disk will fail while the first failed disk is
being rebuilt.
Other decisions include how many disks are involved in a RAID set and how many disks
to protect with a single parity bit. More disks in the set increases performance but
increases cost. Protecting more disks per parity bit saves cost, but increases the likelihood
that a second disk will fail before the first bad disk is repaired.
119
Extensions
RAID concepts have been extended to tape drives (e.g. striping tapes for faster backups or parity
checking tapes for reliability), and for broadcasting of data.
Problems with RAID
RAID protects against physical errors, but not against any number of bugs or other errors
that could write erroneous data.
ZFS adds an extra level of protection by including data block checksums in all inodes
along with the pointers to the data blocks. If data are mirrored and one copy has the
correct checksum and the other does not, then the data with the bad checksum will be
replaced with a copy of the data with the good checksum. This increases reliability
greatly over RAID alone, at a cost of a performance hit that is acceptable because ZFS is
so fast to begin with.
Fig: ZFS checksums all metadata and data.
Another problem with traditional file systems is that the sizes are fixed, and relatively
difficult to change. Where RAID sets are involved it becomes even harder to adjust file
system sizes, because a file system cannot span across multiple file systems.
ZFS solves these problems by pooling RAID sets, and by dynamically allocating space to
file systems as needed. File system sizes can be limited by quotas, and space can also be
reserved to guarantee that a file system will be able to grow later, but these parameters
can be changed at any time by the file system’s owner. Otherwise file systems grow and
shrink dynamically as needed.
120