Storage and File Structure
Chapter 10, Database Systems Concepts, by Silberschatz, et al, 1997
Portions reproduced with permission
So far we have studied the DBMS at level of the logical model. The logical model of a
database system is the correct level for the database users to focus on. The goal of a
database system is to simplify and facilitate access to data. As members of the
development staff and as potential Database Administrators, we need to understand
the physical level better than a typical user.
Overview of Physical Storage Media
Storage media are classified by speed of access, cost per unit of data to buy the media,
and by the medium's reliability. Unfortunately, as speed and cost go up, the reliability
does down.
1. Cache is the fastest and the most costly for of storage. The type of cache
referred to here is the type that is typically built into the CPU chip and is
256KB, 512KB, or 1MB. Thus, cache is used by the operating system and has
no application to database, per se.
2. Main memory is the volatile memory in the computer system that is used to
hold programs and data. While prices have been dropping at a staggering rate,
the increases in the demand for memory have been increasing faster. Today's
32-bit computers have a limitation of 4GB of memory. This may not be
sufficient to hold the entire database and all the associated programs, but the
more memory available will increase the response time of the DBMS. There are
attempts underway to create a system with the most memory that is cost
effective, and to reduce the functionality of the operating system so that only
the DBMS is supported, so that system response can be increased. However,
the contents of main memory are lost if a power failure or system crash occurs.
3. Flash memory is also referred to as electrically erasable programmable read-
only memory (EEPROM). Since it is small (5 to 10MB) and expensive, it has
little or no application to the DBMS.
4. Magnetic-disk storage is the primary medium for long-term on-line storage
today. Prices have been dropping significantly with a corresponding increase in
capacity. New disks today are in excess of 20GB. Unfortunately, the demands
have been increasing and the volume of data has been increasing faster. The
organizations using a DBMS are always trying to keep up with the demand for
storage. This media is the most cost-effective for on-line storage for large
databases.
5. Optical storage is very popular, especially CD-ROM systems. This is limited
to data that is read-only. It can be reproduced at a very low-cost and it is
expected to grow in popularity, especially for replacing written manuals.
6. Tape storage is used for backup and archival data. It is cheaper and slower
than all of the other forms, but it does have the feature that there is no limit on
the amount of data that can be stored, since more tapes can be purchased. As
the tapes get increased capacity, however, restoration of data takes longer and
longer, especially when only a small amount of data is to be restored. This is
because the retrieval is sequential, the slowest possible method.
Magnetic Disks
A typical large commercial database may require hundreds of disks!
Physical Characteristics of Disks
Disks are actually relatively simple. There is normally a collection of platters on a
spindle. Each platter is coated with a magnetic material on both sides and the data is
stored on the surfaces. There is a read-write head for each surface that is on an arm
assembly that moves back and forth. A motor spins the platters at a high constant
speed, (60, 90, or 120 revolutions per seconds.)
The surface is divided into a set of tracks (circles). These tracks are divided into a set
of sectors, which is the smallest unit of data that can be written or read at one time.
Sectors can range in size from 31 bytes to 4096 bytes, with 512 bytes being the most
common. A collection of a specific track from both surfaces and from all of the
platters is called a cylinder.
Platters can range in size from 1.8 inches to 14 inches. Today, 5 1/4 inches and 3 1/2
inches are the most common, because they have the highest seek times and lowest
cost.
A disk controller interfaces the computer system and the actual hardware of the disk
drive. The controller accepts high-level command to read or write sectors. The
controller then converts the commands in the necessary specific low-level commands.
The controller will also attempt to protect the integrity of the data by computing and
using checksums for each sector. When attempting to read the data back, the controller
recalculates the checksum and makes several attempts to correctly read the data and
get matching checksums. If the controller is unsuccessful, it will notify the operating
system of the failure.
The controller can also handle the problem of eliminating bad sectors. Should a sector
go bad, the controller logically remaps the sector to one of the extra unused sectors
that disk vendors provide, so that the reliability of the disk system is higher. It is
cheaper to produce disks with a greater amount of sectors than advertised and then
map out bad sectors than it is to produce disks with no bad sectors or with extremely
limited possibility of sectors going bad.
There are many different types of disk controllers, but the most common ones today
are SCSI, IDE, and EIDE.
One other characteristic of disks that provides an interesting performance is the
distance from the read-write head to the surface of the platter. The smaller this gap is
means that data can be written in a smaller area on the disk, so that the tracks can be
closer together and the disk has a greater capacity. Often the distance is measured in
microns. However, this means that the possibility of the head touching the surface is
increased. When the head touches the surface while the surface is spinning at a high
speed, the result is called a "head crash", which scratches the surface and defaces the
head. The bottom line to this is that someone must replace the disk.
Performance Measures of Disks
1. Seek time is the time to reposition the head and increases with the distance that
the head must move. Seek times can range from 2 to 30 milliseconds. Average
seek time is the average of all seek times and is normally one-third of the
worst-case seek time.
2. Rotational latency time is the time from when the head is over the correct track
until the data rotates around and is under the head and can be read. When the
rotation is 120 rotations per second, the rotation time is 8.35 milliseconds.
Normally, the average rotational latency time is one-half of the rotation time.
3. Access time is the time from when a read or write request is issued to when the
data transfer begins. It is the sum of the seek time and latency time.
4. Data-transfer rate is the rate at which data can be retrieved from the disk and
sent to the controller. This will be measured as megabytes per second.
5. Mean time to failure is the number of hours (on average) until a disk fails.
Typical times today range from 30,000 to 800,000 hours (or 3.4 to 91 years).
Optimization of Disk-Block Access
Requests for disk I/O are generated by both the file system and by the virtual memory
manager found in most systems. Each request specifies the address on the disk to be
referenced; that address specifies is in the form of a block number. Each block is a
contiguous sequence of sectors from a single track of one platter and ranges from 512
bytes to several kilobytes of data. The lower level file manager must convert block
addresses into the hardware-level cylinder, surface, and sector number.
Since access to data on disk is several orders of magnitude slower is access to data in
main memory; much attention has been paid to improving the speed of access to
blocks on the disk. This is also where more main memory can speed up the response
time, by making sure that the data needed is in memory when it is needed.
This is the same problem that is addressed in designing operating systems, to insure
the best response time from the file system manager and the virtual memory manager.
Scheduling. Disk-arm scheduling algorithms attempt to order accesses in an
attempt to increase the number of accesses that can be processed in a given
amount of time. The might include First-Come/First-Serve, Shortest Seek First,
and elevator.
File organization. To reduce block-access time, data could be arranged on the
disk in the same order that it is expected to be retrieved. (This would be storing
the data on the disk in order based on the primary key.) At best, this starts to
produce less and less of a benefit, as there are more inserts and deletes. Also we
have little control of where on the disk things get stored. The more the data gets
fragmented on the disk, the more time it takes to locate it.
Nonvolatile write buffer. Using non-volatile memory (flash memory) can be
used to protect the data in memory from crashes, but it does increase the cost. It
is possible that the use of an UPS would be more effective and cheaper.
Log disk. You can use a disk for writing a sequential log.
Buffering. The more information you have in buffers in main memory, the
more likely you are to not have to get the information from the disk. However it
is more likely that more of the memory will be wasted with information not
necessary.
RAID
RAIDs are Redundant Arrays of Inexpensive Disks. There are six levels of organizing
these disks:
0 -- Non-redundant Striping
1 -- Mirrored Disks
2 -- Memory Style Error Correcting Codes
3 -- Bit Interleaved Parity
4 -- Block Interleaved Parity
5 -- Block Interleaved Distributed Parity
6 -- P + Q Redundancy
Tertiary Storage
This is commonly optical disks and magnetic tapes.
Storage Access
A database is mapped into a number of different files, which are maintained by the
underlying operating system. Files are organized into block and a block may contain
one or more data item.
A major goal of the DBMS is to minimize the number of block transfers between the
disk and memory. Since it is not possible to keep all blocks in main memory, we need
to manage the allocation of the space available for the storage of blocks. This is also
similar to the problems encountered by the operating system, and can be in conflict
with the operating system, since the OS is concerned with processes and the DBMS is
concerned with only one family of processes.
Buffer Manager
Programs in a DBMS make requests (that is, calls) on the buffer manager when they
need a block from a disk. If the block is already in the buffer, the requester is passed
the address of the block in main memory. If the block in not in the buffer, the buffer
manager first allocates space in the buffer for the block, through out some other block,
if required, to make space for the new block. If the block that is to be thrown out has
been modified, it must first be written back to the disk. The internal actions of the
buffer manager are transparent to the programs that issue disk-block requests.
Replacement strategy. When there is no room left in the buffer, a block must be
removed from the buffer before a new one can be read in. Typically, operating
systems use a least recently use (LRU) scheme. There is also a Most Recent
Used (MRU) that can be more optimal for DBMSs.
Pinned blocks. A block that is not allowed to be written back to disk is said to
be pinned. This could be used to store data that has not been committed yet.
Forced output of blocks. There are situations in which it is necessary to write
back to the block to the disk, even though the buffer space is not currently
needed. This might be done during system lulls, so that when activity picks up,
a write of a modified block can be avoided in peak periods.
File Organization
Fixed-Length Records
Suppose we have a table that has the following organization:
type deposit = record
branch-name : char(22);
account-number : char(10);
balance : real;
end
If each character occupies 1 byte and a real occupies 8 bytes, then this record
occupies 40 bytes. If the first record occupies the first 40 bytes and the second
record occupies the second 40 bytes, etc. we have some problems.
It is difficult to delete a record, because there is no way to indicate that the
record is deleted. (At least one system automatically adds one byte to each
record as a flag to show if the record is deleted.) Unless the block size happens
to be a multiple of 40 (which is extremely unlikely), some records will cross
block boundaries. It would require two block access to read or write such a
record.
One solution might be to compress the file after each deletion. This will incur a major
amount of overhead processing, especially on larger files. Additionally, there is the
same problem on inserts!
Another solution would be to have two sets of pointers. One that would link the
current record to the next logical record (linked list) plus a free list (a list of free
slots.) This increases the size the file.
Variable-Length Records
We can use variable length records:
Storage of multiple record types in one file.
Record types that allow variable lengths for one or more fields
Record types that allow repeating fields.
A simple method for implementing variable-length records is to attach a special end-
of-record symbol at the end of each record. But this has problems:
To easy to reuse space occupied formerly by a deleted record.
There is no space in general for records to grow. If a variable-length record is
updated and needs more space, it must be moved. This can be very costly.
It could be solved:
By making a variable-length into a fixed length.
By using pointers to point to fixed length records, chained together by pointers.
As you can see, there is not an easy answer.
Organization of Records in Files
Heap File Organization
Any record can be placed anywhere in the file. There is no ordering of records and
there is a single file for each relation.
Sequential File Organization
Records are stored in sequential order based on the primary key.
Hashing File Organization
Any record can be placed anywhere in the file. A hashing function is computed on
some attribute of each record. The function specifies in which block the record should
be placed.
Clustering File Organization
Several different relations can be stored in the same file. Related records of the
different relations can be stored in the same block.
Data Dictionary Storage
A RDBMS needs to maintain data about the relations, such as the schema. This is
stored in a data dictionary (sometimes called a system catalog):
Names of the relations
Names of the attributes of each relation
Domains and lengths of attributes
Names of views, defined on the database, and definitions of those views
Integrity constraints
Names of authorized users
Accounting information about users
Number of tuples in each relation
Method of storage for each relation (clustered/non-clustered)
Name of the index
Name of the relation being indexed
Attributes on which the index in defined
Type of index formed