Michael J.
Folk
Bill
Zoellick
File
Structures
Second Edition
Digitized by the Internet Archive
in
2011
http://www.archive.org/details/filestructuresOOfolk
File
SECOND
EDITION
Structures
MICHAEL
J.
FOLK
University of Illinois
BILL ZOELLICK
Avalanche Development Company
TV Addison-Wesley
Publishing
Company,
Inc.
Menlo Park, California
New York
Don Mills, Ontario Wokingham, England Amsterdam
Bonn Sydney Singapore Tokyo Madrid
Reading, Massachusetts
San Juan
Milan
Paris
Sponsoring Editor Peter Shcpard
Production Administrator Juliet Silveri
Copyeditor Patricia Daly
Text Designer Melinda Grosser for silk
Cover Designer Peter Blaiwas
Technical Art Consultant Dick Morton
Scot Graphics
Illustrator
Roy Logan
Manufacturing Supervisor
Photographs on pages 126 and 187 courtesy of
S.
Sukumar. Figure 10.7 on page 470
courtesy of International Business Machines Corporation.
Library of Congress Cataloging-in-Publication Data
Folk, Michael
J.
File structures
Michael
J.
Folk, Bill Zoellick.
2nd
ed.
cm.
p.
Includes bibliographical references and index.
ISBN 0-201-55713-4
1.
II.
File
organization (Computer science)"
I.
Zoellick, Bill.
Title.
QA76.9.F5F65
1992
005.74 dc20
91-16314
CIP
book have been included for their
They have been tested with care but are not guaranteed for any
purpose. The publisher does not offer any warranties or representations, nor does
The programs and
applications presented in this
instructional value.
particular
it
accept any liabilities with respect to the
Many
of the designations used
claimed
as
was aware
trademarks.
o\\\
Where
programs or
by manufacturers and
applications.
sellers to distinguish their
products are
those designations appear in this book, and Addison- Wesley
trademark claim, the designations have been printed
in initial caps or
all
caps.
Copyright
<
1992 by Addison-Wesley Publishing
All rights reserved.
No
system, or transmitted,
Inc.
may be reproduced, stored in
any form or by any means, electronic, mechanical,
part of this publication
Company.
a retrieval
photocopying, recording, or otherwise, without the prior written permission of the
publisher. Printed in the United States of America.
12
3 4 5 6 7 8 9 10-DO-9594939291
To
Pauline and Rachel
and
To Karen, Joshua, and
Peter
Preface
We wrote the first edition to promote file structure literacy.
familiarity with the tools used to organize
story of
how
files. It
also
Knowing
the different tools have evolved.
Literacy implies
means knowing the
the story
is
the
basis for using the tools appropriately.
The first edition told the story of file structures up to about 1980. This
second edition continues the story, examining developments such as
extendible hashing and optical disc storage that have moved from being a
research topic at the start of the
last
decade to
mature technology by
its
end.
While the history of file structures provides the key organizing principle
for
much of this
text,
we
also find ourselves compelled, particularly in this
second edition, to attend to developments in computing hardware and
system software. In the last twenty years computers have evolved from
being expensive monoliths, maintained by a priesthood of specialists, to
being appliances as ubiquitous as toasters. No longer do we need to
confront a corps of analysts to get information in and out of a computer. We
do it ourselves. Today, more often than yesterday, programmers design
and build their own file structures.
This text shows you how to design and build efficient file structures.
you need is a good programming language, a good operating system,
and the conceptual tools that enable you to think through alternative file
structure designs that apply to the task at hand. The first six chapters of this
book give you the basic tools to design simple file structures from the
ground up. We provide examples of program code and, if you are a UNIX
user, we show you, whenever possible, how to use this operating system to
help with much of the work. Building on the first six chapters of foundation
work, the last five chapters introduce you to the most important high-level
+
trees, and
file structure designs, including sequential access, B-trees and B
All
hashing and extendible hashing.
VI
PREFACE
The
last
ten years of development in software design are reason
we have
for this second edition, but
enough
also used this edition to discuss the
decreased cost and increased availability of computer storage hardware. For
computer configurations over
available RAM on
computers of all sizes. In 1986, when we completed the first edition of this
book, it was rare that a personal computer had more than 640 Kbytes of
RAM. Now, even for many mundane applications, four Mbytes is
common, and sometimes even mandatory. A decade ago, a sophisticated
mainframe system that was used extensively for sorting large files typically
had two to four Mbytes of primary memory; now 32 to 64 Mbytes is
common on workstations, and there are some computers with several
one of the most dramatic changes
instance,
the
decade
past
gigabytes of
RAM
that,
on
assuming
files
RAM
when
sorting
available
is
For example, most
sorting of large
is
the
in
amount of
RAM.
When more
differently.
increase in
the
is
disk.
available, sorting
second edition
is
we
can approach
it is
that
on disk
is
reflects this
structures problems
with the
One reason for this
much more viable than
always done on tape.
scarce, sorting
Now
file
earlier file structure texts deal
RAM
is
on tape
much
is
cheaper and more readily
not only viable, it is usually preferable. This
change and others that arise from changes in
computer hardware.
Using the Book as a College Text
The
first
edition has been used extensively as a text for
many
many
different kinds
of universities. Because the book is quite
readable, students typically are expected to read the entire book over the
course of a semester. The text covers the basics; class lectures can expand
and supplement the material presented in the text. The lecturer is free to
explore more complex topics and applications, relying on the text to supply
of students in
different kinds
the fundamentals.
A word
issues
of caution:
presented in the
material.
The
It is
much time on
Move quickly
easy to spend too
first
relatively large
six
chapters.
number of pages devoted
the low-level
through
this
to these matters
is
of the percentage of the course that should be spent on them.
The intent, instead, is to provide thorough coverage in the text so that the
instructor can simply assign these chapters as background reading, saving
precious lecture time for more important topics.
It is important to get students involved in writing file processing
programs early in the semester. Consider starting with a file reading and
not
a reflection
PREFACE
VII
due after the first week of class. The inclusion in
the text of sample programs in both C and Pascal makes it easier to work
in this hands-on style. We recommend that, by the time the students
encounter the B-tree chapter, they should have already written programs
that access a data set through a simple index structure. Since the students
then already have first-hand experience with the fundamental organizational
issues, it is possible for lectures to focus on the conceptual issues involved
writing assignment that
is
in B-tree design.
we
approximation of
the sequence of topics used in the book, especially through the first six
chapters. We have already stressed that we wrote the book so that it can be
read from cover to cover. It is not a reference work. Instead, we develop
Finally,
ideas as
makes
A Book
we
it
suggest that instructors adhere to
a close
proceed from chapter to chapter. Skipping around
difficult for students to
in the
book
follow this development.
Computing Professionals
for
Both authors used
We
to teach, but
we now
design and write programs for a
book with our colleagues in mind. The
style is conversational; the intent is to provide a book that you can read over
a number of evenings, coming away with a good sense of how to approach
file structure design problems. If you are already familiar with basic file
living.
wrote and revised
this
structure design concepts, skim through the
first six
chapters and begin
reading about cosequential access methods in Chapter 7. Subsequent
+
chapters introduce you to B-trees, B trees, hashing, and extendible
hashing. These are key design tools for any practicing
building
file
structures.
We have
tried to present
them
programmer who
in a
way
that
is
is
both
thorough and readable.
If
you
are not already a serious
UNIX
seven chapters will give you
environment in which to work with
user, the
a feel for
first
files.
UNIX
why UNIX
Similarly, the
material in the
is
several of the chapters provide an introduction to the use of
Also, if you need to build and access
text,
you may be
you can adapt
Finally,
to
able to use these
file
powerful
programs
with
in
files.
structures similar to the ones in the
programs
as a
source code toolkit that
your needs.
we know that an increasing number of computing
professionals
CD-ROM.
Appendix
with the need
in
introduced
principles
design
A not only provides an example of how the
this text are applied to this important medium, but it also gives you a good
to understand
are confronted
introduction to the
medium
itself.
and use
VIII
PREFACE
Acknowledgements
There are a number of people we would like to thank for help in preparing
this second edition. Peter Shepard, our editor at Addison- Wesley, initiated
the idea of a new edition, kept after us to get it done, and saw the
production through to completion. We thank our reviewers; James
Canning, Jan Carroll, Suzanne Dietrich, Terry Johnson, Theodore Norman, Gregory Riccardi, and Cliff Shaffer. We also thank Deebak Khanna
for comments and suggestions for improving the code.
Since the publication of the first edition, we have received a great deal
of feedback from readers. Their suggestions and contributions have had a
major effect on this second edition, and in fact are largely responsible for
our completely rewriting several of the chapters.
Colleagues with whom we work have also contributed to the second
edition, many without knowing they were doing so. We are grateful to
them for information, explanations, and ideas that have improved our own
understanding of many of the topics covered in the book. These colleagues
include Chin Chau Low, Tim Krauskopf, Joseph Hardin, Quincey Koziol,
Carlos Donohue, S. Sukumar, Mike Page, and Lee Fife.
Thanks
are
still
outstanding to people
who
contributed to the
initial
Marilyn Aiken, Art Crotzer, Mark Dalton, Don Fisher, Huey Liu,
Gail Meinert, and Jim Van Doren.
We thank J. S. Bach; whose magnificent contribution of music to work
edition:
this work possible.
Most important of all, we thank
by makes
Pauline, Rachel, Karen, Joshua and
Peter for putting
up with
write, are tired
day, and stay up too late
all
fathers
and husbands
at
who
night to
up too early to
write some more. It's
get
the price of fame.
Boulder, Colorado
B.Z.
Urbana,
M.F.
Illinois
Contents
Introduction to File Structures
1.1
The Heart of
1.2
A
A
1.3
File Structure
Short History of
Key Terms
Fundamental
File Processing
Physical Files and Logical Files
2.2
Opening
Operations
Files
2.3
Closing
2.4
Reading and Writing
13
Files
14
2.4.1 Read and Write Functions
2.4.2 A Program
to Display the
2.4.3 Detecting End-of-File
Seeking
File Structure Literacy
2.1
2.5
Design
File Structure
Conceptual Toolkit:
Summary
Design
14
Contents of a
File
15
18
18
2.5.1 Seeking
in
2.5.2 Seeking
in
Pascal
19
20
2.6
Special Characters in Files
2.7
The
2.8
Physical and Logical Files in
UNIX
21
Directory Structure
22
UNIX
2.8.1 Physical Devices as UNIX Files
23
23
IX
CONTENTS
2.8.2 The Console, the Keyboard, and Standard Error
2.8.3
2.9
File-related
2.10
UNIX
Summary
and Pipes
I/O Redirection
Header
27
Key Terms
Further Readings
26
Files
Commands
Filesystem
24
25
29
26
Exercises
31
33
Secondary Storage and System Software
3.1
Disks
37
3.1.1 The Organization of Disks
37
3.1.2 Estimating Capacities and Space Needs
3.1.3 Organizing Tracks by Sector
3.1.4 Organizing Tracks by Block
38
41
45
47
3.1.5 Nondata Overhead
3.1.6 The Cost of a Disk Access
49
3.1.7 Effect of Block. Size on Performance: A UNIX Example
Magnetic Tape
56
3.2.1 Organization of Data on Tapes
56
3.2.2 Estimating Tape Length Requirements
3.2.3 Estimating Data Transmission Times
3.2.4 Tape Applications
3.3
Disk Versus Tape
3.4
Storage as
3.5
A Journey
61
Byte
3.5.1 The
File
3.5.2 The
I/O Buffer
62
63
64
Manager
64
3.5.3 The Byte Leaves RAM: The
3.6
Buffer
Management
3.6.2 Buffering Strategies
I/O
in
UNIX
I/O
Processor and Disk Controller
68
3.6.1 Buffer Bottlenecks
3.7
57
59
60
Hierarchy
of
53
54
3.1.8 Disk as Bottleneck
3.2
35
69
69
72
3.7.1 The Kernel
72
3.7.2 Linking
File
Names
3.7.3 Normal
Files,
to Files
76
Special Files, and Sockets
78
66
CONTENTS
3.7.4 Block
78
I/O
3.7.5 Device Drivers
79
3.7.6 The Kernel and Filesystems
3.7.7 Magnetic Tape and UNIX
Summary
80
Key Terms
Further Readings
Field
87
Exercises
91
Fundamental
4.1
82
79
80
File Structure
and Record Organization
4.1.1 A Stream
Concepts
94
94
File
4.1.2 Field Structures
96
4.1.3 Reading a Stream
of Fields
4.1.4 Record Structures
99
101
4.1.5 A Record Structure That Uses a Length Indicator
4.1.6 Mixing Numbers and Characters: Use of a
4.2
Record Access
File
109
4.2.2 A Sequential Search
4.2.3 UNIX Tools
111
for Sequential
Processing
4.2.4 Direct Access
115
More about Record
Structures
114
117
4.3.1 Choosing a Record Structure and Record Length
4.3.2 Header Records
4.4
4.5
File
Access and
Beyond Record
File
120
Organization
Structures
122
123
124
4.5.1 Abstract Data Models
4.5.2 Headers and Self-Describing Files
4.5.3 Metadata
125
125
4.5.4 Color Raster Images
4.5.5 Mixing Object Types
4.5.6 Object-oriented
4.5.7 Extensibility
4.6
103
Dump
109
4.2.1 Record Keys
4.3
93
File
128
in
One
Access
File
132
133
Portability and Standardization
134
4.6.1 Factors Affecting Portability
134
4.6.2 Achieving Portability
136
129
117
107
XI
XJI
CONTENTS
Summary
142
Further Readings
Programs
Pascal
152
Programs
167
Files for
Data Compression
Performance
183
185
185
5.1.1 Using a Different Notation
5.1.2 Suppressing Repeating Sequences
5.1.3 Assigning Variable-length Codes
186
188
5.1.4 Irreversible Compression Techniques
5.2
146
Exercises
144
153
Organizing
5.1
Key Terms
5.1.5 Compression
in
UNIX
189
Reclaiming Space
in Files
190
189
5.2.1 Record Deletion- and Storage Compaction
190
5.2.2 Deleting Fixed-length Records for Reclaiming Space Dynamically
5.2.3 Deleting Variable-length Records
198
5.2.4 Storage Fragmentation
5.2.5 Placement Strategies
5.3
201
An
Finding Things Quickly:
Searching 203
5.3.1 Finding Things
in
Introduction to Internal Sorting and Binary
Simple Field and Record
5.3.2 Search by Guessing: Binary Search
RAM
File in
Files
203
204
5.3.3 Binary Search versus Sequential Search
5.3.4 Sorting a Disk
204
206
5.3.5 The Limitations of Binary Searching and Internal Sorting
5.4
Keysorting
209
5.4.2 Limitations of the Keysort Method
5.4.3 Another Solution:
Why
5.4.4 Pinned Records
213
Further Readings
207
208
5.4.1 Description of the Method
Summary 214
Key Terms
223
192
196
211
Bother to Write the File Back?
217
Exercises
219
212
CONTENTS
Indexing
225
6.1
What
6.2
6.3
Basic Operations on an Indexed, Entry-Sequenced File
6.4
Indexes That Are
6.5
Indexing to Provide Access by Multiple Keys
6.6
Retrieval
6.7
Improving the Secondary Index Structure: Inverted
an Index?
Is
226
Simple Index with an Entry-Sequenced
6.7.1 A
Too Large
to
Hold
227
Memory
First
Attempt
6.8
Selective Indexes
6.9
Binding
230
234
235
Using Combinations of Secondary Keys
239
242
Lists
242
at a Solution
6.7.2 A Better Solution: Linking the
Summary
in
File
244
References
List of
248
249
Key Terms
250
Further Readings
252
253
Exercises
256
Cosequential Processing and the Sorting
of Large
7.1
Files
A Model
for
257
Implementing Cosequential Processes
7.1.1 Matching
Names
in
7.1.2 Merging Two Lists
7.1.3
7.2
Summary
Lists
Application of the
Model
7.2.1 The Problem
268
of the
to a General
Model
Extension of the Model
to the
7.3.2 A Selection Tree
Second Look
at
for
Model
266
Ledger Program
Ledger Program
276
276
Merging Large Numbers
Sorting in
268
271
to Include Multiway Merging
7.3.1 A K-way Merge Algorithm
7.4
259
263
of the Cosequential Processing
7.2.2 Application
7.3
Two
259
RAM
7.4.1 Overlapping Processing and
Heapsort
I/O:
7.4.2 Building the Heap while Reading
7.4.3 Sorting while Writing out to the
of Lists
279
in
File
the File
283
280
281
278
XIII
CONTENTS
XIV
7.5
Merging
7.5.1
Way
as a
of Sorting Large
How Much Time Does
7.5.2 Sorting a
File
That
on Disk
Merge Sort Take?
Number
of
292
File Size
7.5.4 Hardware-based Improvements
285
287
290
Ten Times Larger
Is
7.5.3 The Cost of Increasing the
7.5.5 Decreasing the
Files
293
Seeks Using Multiple-step Merges
7.5.6 Increasing Run Lengths Using Replacement Selection
304
7.5.7 Replacement Selection Plus Multistep Merging
7.5.8 Using Two Disk Drives with Replacement Selection
7.5.9 More Drives? More Processors?
7.5.10 Effects
of
Multiprogramming
7.5.11 A Conceptual Toolkit
7.6
Sorting Files on Tape
312
314
7.6.4 Tapes versus Disks
for External Sorting
Sort-Merge Packages
318
Sorting and Cosequential Processing in
7.8.1 Sorting and Merging
in
UNIX
7.8.2 Cosequential Processing
Summary
Key Terms
322
Further Readings
310
315
7.6.3 Multiphase Merges
310
for External Sorting
7.6.2 The K-way Balanced Merge
7.8
307
309
311
7.6.1 The Balanced Merge
7.7
295
298
UNIX
318
318
Utilities in
325
317
UNIX
Exercises
320
328
331
B-Trees and Other Tree-structured
File Organizations 333
The Invention of the B-Tree
8.1
Introduction:
8.2
Statement of the Problem
8.3
Binary Search Trees
8.4
AVL
8.5
Paged Binary Trees
8.6
The Problem with
8.7
B-Trees: Working up from the
8.8
Splitting
Trees
as a
334
336
Solution
337
340
343
the
Top-down Construction of Paged
and Promoting
347
Bottom
347
Trees
345
XV
CONTENTS
8.9
Algorithms for B-Trcc Searching and Insertion
8.10
B-Tree Nomenclature
8.11
Formal Definition of B-Trce Properties
8.12
Worst-case Search Depth
8.13
Deletion, Redistribution, and Concatenation
364
364
Redistribution during Insertion:
A Way
to
Improve Storage
371
Utilization
8.15
B* Trees
8.16
Buffering of Pages: Virtual B-Trees
372
8.16.1 LRU Replacement
373
375
376
8.16.2 Replacement Based on Page Height
8.16.3 Importance
of Virtual B-Trees
377
8.17
Placement of Information Associated with the Key
8.18
Variable-length Records and Keys
Summary
Key Terms
380
Further Readings
Programs
Pascal
366
370
8.13.1 Redistribution
8.14
352
362
379
383
Exercises
387
to Insert
Programs
382
377
Keys
into a
Keys
to Insert
B-Tree
into a
389
B-Tree
397
The B + Tree Family and Indexed
Sequential
File Access 405
9.1
9.2
406
Indexed Sequential Access
Maintaining
Sequence Set
407
9.2.1 The Use of Blocks
9.2.2 Choice
of
Block Size
9.3
Adding
9.4
The Content of the
407
410
Simple Index to the Sequence Set
9.5
The Simple
9.6
Simple Prefix
Prefix
B+
41
Index: Separators Instead of
Tree
Keys
413
416
Tree Maintenance
417
9.6.1 Changes Localized to Single Blocks
9.6.2 Changes Involving Multiple Blocks
in
in
9.7
Index Set Block Size
9.8
Internal Structure of Index Set Blocks:
the Sequence Set
the Sequence Set
417
418
421
Variable-order B-Tree
422
CONTENTS
XVI
9.9
Loading
+
9.10
9.11
B-Trees,
B+
10.1
425
Trees, and Simple Prefix
Key Terms
436
B+
Trees in Perspective
437
Exercises
443
Further Readings
Hashing
Tree
429
Trees
Summary 434
B+
Simple Prefix
445
446
Introduction
10.1.1 What
is
Hashing?
447
448
10.1.2 Collisions
10.2
10.3
Hashing Functions and Record Distributions
Simple Hashing Algorithm
among Addresses
10.3.1 Distributing Records
10.3.2
Some
450
455
Other Hashing Methods
456
10.3.3 Predicting the Distribution of Records
10.3.4 Predicting Collisions
10.4
How Much
Extra
for a Full File
Memory
10.4.2 Predicting Collisions
Collision Resolution
10.5.1
How
Storing
for Different
of
Packing Densities
per Address: Buckets
10.8
of
476
Tombstones
of Deletions
480
for Insertions
481
and Additions on Performance
Other Collision Resolution Techniques
10.8.1 Double Hashing
483
483
10.8.2 Chained Progressive Overflow
484
10.8.3 Chaining with a Separate Overflow Area
10.8.4 Scatter Tables: Indexing Revisited
10.9
471
472
479
10.7.1 Tombstones for Handling Deletions
10.7.3 Effects
466
468
10.6.2 Implementation Issues
10.7.2 Implications
463
467
Buckets on Performance
Making Deletions
462
by Progressive Overflow
More Than One Record
10.6.1 Effects
10.7
Should Be Used?
Progressive Overflow Works
10.5.2 Search Length
10.6
461
462
10.4.1 Packing Density
10.5
453
454
Patterns of Record Access
488
487
486
482
431
CONTENTS
Summary 489
Key Terms
Further Readings
492
Extendible Hashing
11.1
Introduction
11.2
How
503
504
Extendible Hashing Works
505
505
11.2.1 Tries
11.2.2 Turning the Trie
11.2.3 Splitting
11.3
495
Exercises
501
Implementation
507
into a Directory
Handle Overflow
to
508
510
510
11.3.1 Creating the Addresses
11.3.2 Implementing the Top-level Operations
11.3.3 Bucket and Directory Operations
11.3.4 Implementation Summary
11.4
513
514
519
520
Deletion
11.4.1 Overview
of the Deletion
11.4.2 A Procedure
520
Process
Buddy Buckets
for Finding
522
11.4.4 Implementing the Deletion Operations
11.4.5 Summary
11.5
Extendible Hashing Performance
526
11.5.1 Space
526
11.5.2 Space Utilization
11.6
11.6.1 Dynamic Hashing
11.6.3 Approaches
Further Readings
Appendix A:
Appendix
Using
A. 2.
Exercises
File Structures
Introduction to
535
533
537
539
A.2
A. 2.1
528
to Controlling Splitting
Key Terms
A.l
this
527
528
530
11.6.2 Linear Hashing
534
Buckets
for the Directory
Alternative Approaches
Summary
526
of the Deletion Operation
Utilization for
520
522
11.4.3 Collapsing the Directory
CD-ROM
as a File
CD-ROM
542
CD-ROM
Short History of
on
543
CD-ROM
543
Structure Problem
545
541
XVII
XVIII
CONTENTS
A.3
Physical Organization of
A. 3.1 Reading
A. 3. 2
CLV
Pits
CAV
Instead of
547
549
A. 3. 4 Structure of a Sector
A.4
CD-ROM
Strengths and Weaknesses
A.4. 2 Data Transfer Rate
552
552
A. 4. 3 Storage Capacity
A. 4. 4
553
Read-Only Access
553
A.4. 5 Asymmetric Writing and Reading
A.5
Tree Structures on
A.
Loading Procedures and Other Considerations
5 Trees as Secondary Indexes on
Hashed
A.
Files
CD-ROM
on
Bucket Size
A. 6. 3
How
The
CD-ROM
System
Helps
558
Design Exercise
560
A. 7. 3
A Hybrid Design
562
559
563
Appendix B: ASCII Table
566
Appendix C: String Functions
in Pascal: tools. pre
Functions and Procedures Used to Operate on strng
Appendix D: Comparing Disk Drives
Bibliography
Index
559
559
A. 7. 2
Summary
556
CD-ROM's Read-Only Status
File
The Problem
CD-ROM
557
558
the Size of
CD-ROM
A. 7.1
555
556
557
6.1 Design Exercises
A. 6. 2
A.6. 4 Advantages of
A.7
553
5.4 Virtual Trees and Buffering Blocks
A. 5.
A.6
553
554
5.2 Block Size
A. 5.3 Special
A.
CD-ROM
Design Exercises
A. 5.1
552
552
Seek Performance
A. 4.1
546
546
549
3 Addressing
A. 3.
CD-ROM
and Lands
581
575
567
572
567
Introduction to
File Structures
CHAPTER OBJECTIVES
Introduce the primary design issues that characterize
file
structure design.
Survey the history of
file
ing the developments in
much about how
structure design, since trac-
to design
Introduce the notions of
structures teaches us
file
our
file
own
file
structures.
structure literacy and of
a conceptual toolkit for file structure design.
CHAPTER OUTLINE
1.1
1.2
The Heart of File Structure Design
1.3
mm
Conceptual Toolkit: File
Structure Literacy
Short History of File Structure
Design
1.1
The Heart
Design
of File Structure
Disks are slow. They are also technological marvels, packing hundreds of
megabytes on disks that can fit into a notebook computer. Only a few years
ago, disks with that kind of capacity looked like small washing machines.
However, relative to the other parts of a computer, disks are slow.
How slow? The time it takes to get information back from even
relatively slow electronic random access memory (RAM) is about 120
nanoseconds, or 120 billionths of a second. Getting the same information
from a typical disk might take 30 milliseconds, or 30 thousandths of a
second. To understand the size of this difference, we need an analogy.
Assume that RAM access is like finding something in the index of this
book. Let's say that this local, book-in-hand access takes 20 seconds.
Assume that disk access is like sending to a library for the information you
cannot find here in this book. Given that our "RAM access" takes 20
seconds,
how
long does the "disk access" to the library take, keeping the
ratio the
same
as that
is
of a
real
RAM access and disk access? The disk access
RAM access. This means that
quarter of a million times longer than the
getting information back
from the
library takes 5,000,000 seconds,
almost 58 days. Disks are very slow compared to
than
or
RAM.
On the other hand, disks provide enormous capacity at much less cost
RAM. They also keep the information stored on them when they are
turned off. The tension between
enormous, nonvolatile capacity
design.
Good
file
a disk's relatively
is
slow access time and
the driving force behind
structure design will give us access to
all
file
the capacity
without making our applications spend a lot of time waiting for the
how to develop such file designs.
This book shows you
its
structure
disk.
A SHORT HISTORY OF FILE STRUCTURE DESIGN
1.2
A Short
History of File Structure Design
Put another way, our goal is to show you how to think creatively about file
structure design problems. Part of our approach to doing this is based on
history: After introducing basic principles of design in the first part of this
book, we devote the last part to studying some of the key developments in
file design over the last 30 years. The problems that researchers struggle
same issues that you confront in addressing any substantial
Working through the approaches used to address major
design issues shows you a lot about how to approach new design
with
file
file
reflect the
design problem.
problems.
The general goals of research and development
drawn directly from our library analogy:
we would
be
in file structures can
we need with one acwe do not want to issue a
series of 58-day requests before we get what we want.
If it is impossible to get what we need in one access, we want strucIdeally,
like to get the
cess to the disk. In terms
information
of our analogy,
tures that allow us to find the target information with as
as possible.
For example, you
may remember from your
few accesses
studies of
data structures that a binary search allows us to find a particular
record
among
50,000 other records with no more than 16 compari-
to look 16 places on a disk before finding what we
want takes too much time. We need file structures that allow us to
find what we need with only two or three trips to the disk.
We want our file structures to group information so we are likely to
But having
sons.
we need with only one trip to the disk. If we need
name, address, phone number, and account balance, we
get everything
client's
would
to
prefer to get
all
that information at once, rather than having
look in several places for
It is
relatively easy to
these goals
when we have
it.
come up with
files
file
structure designs that
that never change.
that maintain these qualities as
files
Designing
file
meet
structures
change, growing and shrinking as
added and deleted, is much more difficult.
files presumed that files were on tape, since most files
were. Access was sequential, and the cost of access grew in direct
proportion to the size of the file. As files grew intolerably large for unaided
sequential access and as storage devices like disk drives became available,
indexes were added to files. The indexes made it possible to keep a list of
keys and pointers in a smaller file that could be searched more quickly;
given the key and pointer, the user had direct access to the large, primary
information
Early
file.
is
work with
INTRODUCTION TO
STRUCTURES
FILE
some of the same, sequential flavor
grew they too became
Unfortunately, simple indexes had
as
the data
themselves, and as the indexes
files
manage, especially for dynamic files in which the set of keys
changes. Then, m the early 1960s, the idea of applying tree structures
emerged as a potential solution. Unfortunately, trees can grow very
unevenly as records are added and deleted, resulting in long searches
difficult to
requiring
many
disk accesses to find a record.
1963 researchers developed the
In
AVL
tree,
an elegant, self-adjusting
RAM.
Other researchers began to look for
or something like them, to files. The problem
binary tree structure for data in
ways to apply AVL trees,
was that even with a balanced binary
tree,
dozens of accesses are required
even moderate-sized files. A way was needed to keep a
tree balanced when each node of the tree was not a single record, as in a
binary tree, but a file block containing dozens, perhaps even hundreds,
to find a record in
of records.
It took nearly 10 more years of design work before a solution emerged
in the form of the B-tree. Part of the reason that finding a solution took so
long was that the approach required for file structures was very different
from the approach that worked in RAM. Whereas AVL trees grow from
down
the top
as
records are added, B-trees
grow from
the
B-trees provided excellent access performance, but there
longer could
a file
bottom up.
was a cost: No
be accessed sequentially with efficiency. Fortunately,
this
problem was solved almost immediately by adding a linked list structure at
the bottom level of the B-tree. The combination of a B-tree and a sequential
linked
called a
list is
Over
B + tree.
the following 10 years B-trees and
many commercial
file
entries
practical terms, this
file
entry
among
trees
became
the basis for
systems, since they provide access times that
proportion to log^N, where
number of
B+
is
indexed in
means
the
number of entries
a single
in the file
grow
and k
is
in
the
block of the B-tree structure. In
that B-trees can guarantee that
you can
find
one
millions of others with only three or four trips to the disk.
Further, B-trees guarantee that as
you add and
delete entries,
performance
stays about the same.
Being able to get information back with just three or four accesses is
But how about our goal of being able to get what we want
with a single request? An approach called hashing is a good way to do that
with files that do not change size greatly over time. From early on, hashed
indexes were used to provide fast access to files. However, until recently,
pretty good.
hashing did not work well with
volatile, dynamic files that changed greatly
development of B-trees, researchers turned to work on
extendible, dynamic hashing that could retrieve information
in size. After the
svstems tor
A CONCEPTUAL TOOLKIT:
with one
We
most, two disk accesses no matter how big the file becomes.
book with a careful look at this work, which took place from
close this
A Conceptual
part of the 1980s.
first
Toolkit: File Structure Literacy
As we move through
decades, watching
first
STRUCTURE LITERACY
or, at
the late 1970s through the
1.3
FILE
the developments in
file
file
structures over the last three
structure design evolve as
it
addresses
dynamic
files
sequentially, then through tree structures, and finally through direct
we
same design problems and design tools keep
number of disk accesses by collecting data into
buffers, blocks, or buckets; we manage the growth of these collections by
splitting them, which requires that we find a way to increase our address or
index space, and so on. Progress takes the form of finding new ways to
combine these basic tools of file design.
We think of these tools as conceptual tools. They are ways of framing
access,
emerging.
see
We
that
the
decrease the
a design problem. Our own work in file structures has
by understanding the tools thoroughly, and by studying how
have been combined to produce such diverse approaches as B-trees
and addressing
shown
us that
the tools
and extendible hashing, we develop mastery and flexibility in our own use
of the tools. In other words, we acquire literacy with regard to file
This text
structures.
is
designed to help readers acquire
file
structure
through 6 introduce the basic tools; Chapters 7 through
11 introduce readers to the highlights of the past several decades of file
structure design, showing how the basic tools are used to handle efficient
+
trees, hashed indexes, and extendible,
sequential access, B-trees, B
literacy.
Chapters
dynamic hashed
files.
SUMMARY
The key design problem that shapes
large amount of time that is required
structure designs focus
file
structure design
is
the relatively
from disk. All file
and maximizing the
to get information
on minimizing disk
likelihood that the information the user will
accesses
want
is
already in
RAM.
This text begins by introducing the basic concepts and issues associated
with file structures. The last half of the book tracks the development of file
structure design as
it
30 years. The key problem
finding ways to minimize disk
has evolved over the
addressed throughout
this
evolution
is
last
INTRODUCTION TO
FILE
STRUCTURES
files that keep changing in content and size. Tracking these
developments takes us first through work on sequential file access, then
through developments in tree-structured access, and finally to relatively
recent work on direct access to information in files.
Our experience has been that the study of the principal research and
design contributions to file structures, focusing on how the design work
uses the same tools in new ways, provides a solid foundation for thinking
creatively about new problems in file structure design.
accesses for
KEY TERMS
AVL
tree.
self-adjusting binary tree structure that can guarantee
access times for data in
B-tree.
good
RAM.
tree structure that provides fast access to data stored in files.
Unlike binary trees, in which the branching factor from a node of
is two, the descendents from a node of a B-tree can be a
much larger number. We introduce B-trees in Chapter 8.
the tree
B+
tree.
on
variation
the B-tree structure that provides sequential ac+
discuss B
trees at
cess to the data as well as fast-indexed access.
length in Chapter
We
9.
Extendible hashing. An approach to hashing that works well with files
that undergo substantial changes in size over time.
File structures. The organization of data on secondary storage devices
such
as disks.
Hashing. An
access
mechanism
that transforms the search
key into
storage address, thereby providing very fast access to stored data.
Sequential access. Access that takes records
first, then the next, and so on.
in order, looking at the
Fundamental
File
Processing Operations
CHAPTER OBJECTIVES
Describe the process of linking a logical file within
to an actual physical file or device.
program
Describe the procedures used to create, open, and
close
files.
Describe the procedures used for reading from and
writing to files.
Introduce the concept of position within a file and describe procedures for seeking different positions.
Provide an introduction to the organization of the
UNIX file system.
Present the
file
UNIX
operations and
view of
a file,
commands
and describe
based on
this
UNIX
view.
CHAPTER OUTLINE
2.1
Physical Files and Logical Files
2.6
Special Characters in Files
2.2
Opening
2.7
The
2.3
Closing Files
2.8
Physical and Logical Files in
2.4
Reading and Writing
Files
Directory Structure
UNIX
UNIX
2.8.1
Read and Write Functions
2.4.2 A Program to Display the
Physical Devices as
2.8.2
The Console,
Contents of a File
2.4.3 Detecting End-of-File
2.8.3 I/O Redirection and Pipes
2.4.1
2.5
UNIX
and Standard Error
Seeking
Seeking in C
2.5.2 Seeking in Pascal
2.5.1
2.1
Files
the Keyboard,
2.9
File-related
2.10
UNIX
File
Header
Files
System Commands
Physical Files and Logical Files
When we
talk
about
on
a file
collection of bytes stored there.
physically exists.
these physical
From
disk or tape,
we
when
word
file,
the
refer to a particular
is
used in
this sense,
disk drive might contain hundreds, even thousands, of
files.
the standpoint of an application program, the notion of a
To
file is
somewhat like a telephone line connected
to a telephone network. The program can receive bytes through this phone
line, or send bytes down it, but knows nothing about where these bytes
actually come from or where they go. The program knows only about its
own end of the phone line. Moreover, even though there may be thousands
of physical files on a disk, a single program is usually limited to the use of
different.
the program, a
only about 20
The
file is
files.
application
program
relies
on the operating system
to take care
the details of the telephone switching system, as illustrated in Fig. 2.1.
could be that bytes coming
an actual physical
file,
down
the line into the
or they might
come from
program
It
originate
from
some
other
the keyboard or
input device. Similarly, the bytes that the program sends
of
down
the line
might end up in a file, or they could appear on the terminal screen.
Although the program often doesn't know where bytes are coming from or
where they are going, it does know which line it is using. This line is usually
OPENING FILES
referred to as the logical file to distinguish this view
from the
physical
files
on
the disk or tape.
Before the program can open
making
receive instructions about
phone
such
line)
and some physical
IBM's OS/MVS,
as
file
a file for use, the
a
operating system must
hookup between
or device.
When
a logical file (e.g., a
using operating systems
these instructions are provided through job
On
minicomputers and microcomputers, more
as UNIX, MS-DOS, and VMS provide the
instructions within the program. For example, in Turbo Pascal^ the
association between a logical file called inp_jile and a physical file called
myfile.dat is made with the following statement:
control language (JCL).
modern operating systems such
assignCi np_f ile, 'myfile.dat
'
file named
make the hookup by assigning a logical file (phone
The number identifying the particular phone line that is assigned
This statement asks the operating system to find the physical
and then
myfile.dat
line) to
is
it.
to
returned through the FILE variable inp_file, which
name. This logical
name
is
what we use
is
the
to refer to the
file's
file
logical
inside the
program. Again, the telephone analogy applies: My office phone is
connected to six telephone lines. When I receive a call I get an intercom
message such as, "You have a call on line three." The receptionist does not
say, "You have a call from 918-123-4567." I need to have the call identified
logically,
2.2
not physically
Opening
Files
Once we have
hooked up to a physical file or device,
what we intend to do with the file. In general, we have
two options: (1) open an existing file or (2) create a new file, deleting any
existing contents in the physical file. Opening a file makes it ready for use
by the program. We are positioned at the beginning of the file and are ready
to start reading or writing. The file contents are not disturbed by the open
we need
a logical file identifier
to declare
statement. Creating a
file
also
opens the
file in
use after creation. Since a newly created
initially the
file
the sense that
is
compilers vary widely with regard to I/O procedures, since standard Pasway of I/O definition. Throughout this book we use the term Pasdiscussing features common to most Pascal implementations. When we refer to
cal contains little in the
when
ready for
only use that makes sense.
+ Different Pascal
cal
it is
has no contents, writing
the features of a specific implementation, such as
Turbo
Pascal,
we
say so.
10
FUNDAMENTAL
FILE
PROCESSING OPERATIONS
Logical
files
Program
Limit of approximately
20 phone lines
Physical
files
Printer
FIGURE 2.1 The program
files
and physical
files
relies
on the operating system to make connections between
In Pascal the reset(
rewrite (
in
Turbo
logical
and devices.
statement
Pascal
is
statement
used to create
we might
ass ign( inp_f
reset ( i np_f
le
le)
use
is
used to open existing files and the
ones. For example, to open a file
new
sequence of statements such
myfile.dat*);
as:
OPENING FILES
CLEAN UP
YOUR MESS
11
SAFETY
FIRST
Operating system switchboard
Can make connections to thousands
of
Note
that
statement.
we use the
To create a
files
logical
file
in
or I/O devices
file
name, not the physical one, in the reset()
Pascal, the statements might read:
Turbo
'myfile.dat');
i 1 e
rewrite (inp_file)
a55 ign( out_f
FUNDAMENTAL
PROCESSING OPERATIONS
FILE
We can open an existing file or create a new one in C through the
UNIX system function open( This function takes two required argu).
ments and
a third
argument
that
fd = open(f ilename,
The
is
optional:
flags,
[pmode]);
return value fd and the arguments filename, flags, and pmode have the
following meanings:
fd
The fde
descriptor.
Using our
earlier analogy, this
is
the
phone
used to refer to the file within the
an integer. If there is an error in the attempt to
line (logical file identifier)
program. It is
open the file, this value
filename
flags
is
negative.
character string containing the physical
file
name. (Later
we
discuss pathnames that include directory information about
the
file's
The
argument can be
location. This
argument
pathname.)
an integer that controls the operation of
the open function, determining whether it opens an existing
file for reading or writing. It can also be used to indicate that
you want to create a new file, or open an existing file but delete its contents. The value of flags is set by performing a bitwise
of the following values, among others. 1
flags
is
"
OR
D_APPEND
Append every
the
0_CREAT
write operation to the end of
file.
Create and open
no
a file for writing.
This has
effect if the file already exists.
CUE X C L
Return an error if 0_C R E A T E
and the file exists.
CURDONLY
CURDWR
D_TRUNC
Open
a file for
reading only.
Open
a file for
reading and writing.
If the file exists, truncate
zero, destroying
CUHRONLY
Open
its
it
is
specified
to a length
of
contents.
a file for writing only.
Some of these flags cannot be used in combination with one
another. Consult your documentation for details, as well as
for other options.
pmode
If
0_C RE AT
ment
pmode
is
specified,
pmode
specifies the protection
is
a three-digit octal
is
required. This integer argufile. In UNIX, the
mode for the
number
that indicates
how
the
file
can be used by the owner (first digit), by members of the
owner's group (second digit), and by everyone else (third
\
+ These
piler.
values are defined in an "include"
The name of
system.
the include
file is
file
packaged with your
often fcntl.h or file. h, but
it
UNIX
system or
C com-
can vary from system to
CLOSING FILES
digit).
The
sion, the
first bit of each octal digit indicates read permissecond write permission, and the third execute per-
mission. So,
owner has
if pmode is
the octal
number
0751, the
file's
and execute permission for the file; the
owner's group would have read and execute permission; and
everyone else has only execute permission:
read, write,
rwe rwe rwe
111
10
owner
group
PMODE = 0751=
world
Given this description of the open( ) function, we can develop some
examples to show how it can be used to open and create files in C. The
following function call opens an existing file for reading and writing, or
creates a new one if necessary. If the file exists it is opened without change;
reading or writing
would
start at the file's first byte.
fd = openCf ilename,
The following
already
fd
a file
new
is
new
file
only
0_TRUNC, 0751);
if there is
this
not already
name
exists,
it is
negative value to indicate an error.
0_CREAT
If there is
contents are truncated.
with
in filename. If a file
openCf ilename, D_RDNR
and writing.
its
0_CREAT
not opened and the function returns
fd
0_CREAT, 0751);
in filename,
a call that will create a
with the name specified
for reading
file
with the name specified
openCf ilename, 0_RDWR
Finally, here
file
call creates a
0_RDWR
D_EXCL,
0751);
is tied more to the host operating system than to a
For example, implementations of Pascal running on
systems that support file protection, such as VAX/VMS, often include
extensions to standard Pascal that let you associate a protection status with
File protection
specific language.
a file
2.3
when you
create
it.
Closing Files
file is like hanging up the
is available for taking
phone
line
phone,
the
the
hang
up
phone. When you
logical
file name or file
the
file,
close
a
or placing another call; when you
In terms of our telephone line analogy, closing a
descriptor
is
available for use with another
file.
Closing
a file that
has been
used for output also ensures that everything has been written to the file. As
you will learn in a later chapter, it is more efficient to move data to and from
secondary storage in blocks than
it
is
to
move
data one byte at a time.
14
FUNDAMENTAL
FILE
PROCESSING OPERATIONS
Consequently, the operating system does not immediately send off the
we write, but saves them up in a buffer for transfer as a block of data.
Closing a file makes sure that the buffer for that file has been flushed of data
bytes
and that everything
we have
written has actually been sent to the
file.
automatically by the operating system when a
program terminates normally. Consequently, the explicit use of a CLOSE
Files are usually closed
statement within
program
is
needed only
as protection against data loss in
the event of program interruption and to free
up
logical filenames for reuse.
Some
languages, including Standard Pascal, do not even provide a CLOSE
statement. However, explicit file closing is possible in the C language,
Pascal, PL/I, and most other languages used for serious file processing
VAX
work.
Now
2.4
how to connect and disconnect programs to and
how to open the files, you are ready to start sending
you know
that
from physical
files
and receiving
data.
and
Reading and Writing
Reading and writing are fundamental to
that
make
file
file
processing; they are the actions
processing an input/output (I/O) operation.
The
actual
form of
the read and write statements used in different languages varies.
Some
languages provide very high-level access to reading and writing and
automatically take care of details for the programmer. Other languages
provide access
explore
at a
much lower
some of these
level.
Our
use of Pascal and
allows us to
differences.^
2.4.1 Read and Write Functions
We begin here with
reading and writing at a relatively low level. It is useful
have a kind of systems-level understanding of what happens when we
send and receive information to and from a file.
A low-level read call requires three pieces of information, expressed
here as arguments to a generic READ( ) function.
to
RE ADC Sour ce_f ile,Destinati on_addr Size)
Source_f i le
The RE AD(
We
to read.
name (phone
'To accentuate these differences and provide
systems level, we use the read( ) and write(
functions such as fgetc( ), jgets( ), and so on.
a
)
call
must know from where it
by logical file
through which data is re-
is
specify the source
line)
look at I/O operations at something closer to
system calls in C rather than higher-level
15
READING AND WRITING
(Remember, before we do any reading
already opened the file, so the
connection between a logical file and a specific
ceived.
we must have
physical
file
or device already exists.)
READ( must know
Dest inat ion_addr
where to place the inforfrom the input file. In this gefunction we specify the destination by
)
mation
neric
it
reads
giving the
address of the
first
where we want
Size
Finally,
READ(
memory
block
to store the data.
)
must know how much inin from the file. Here the
formation to bring
argument
A WRITE statement is
is
supplied as a byte count.
similar; the only difference
that the data
is
moves
in
the other direction:
NRITECDestinat ion_f i
Destinat i on_f
i 1
le
The
Source_addr
logical
name we
file
ze)
use for sending the
data.
Source_addr
WRITE(
must know where
mation that
it
ification as the first
Size
to find the infor-
We
provide this specaddress of the memory
will send.
block where the data
is
The number of bytes
to be written
stored.
must be
supplied.
2.4.2 A Program to Display the Contents of a
File
do some reading and writing to see how these functions are used. This
file processing program, which we call LIST, opens a file for
input and reads it, character by character, sending each character to the
screen after it is read from the file. LIST includes the following steps:
Let's
simple
first
Display a prompt for the
2.
Read the
user's response
name of the
input
from the keyboard
file.
into a variable called file-
name.
3.
Open
4.
While there
5.
the
file
for input.
are
still
characters to be read
from the input
a.
read a character from the
b.
write the character to the terminal screen.
Close the input
file
file.
Figures 2.2 and 2.3 are, respectively,
tions
of
this
file,
and
program.
these implementations.
It is
and Pascal language implementabetween
instructive to look at the differences
FUNDAMENTAL
/*
**
*/
1 i
5 t
c
.
PROCESSING OPERATIONS
FILE
-program to read characters from
to the terminal screen
file and write them
^include <stdio.h>
^include <fcntl.h>
ma i n(
char
fd;
/ *
file descriptor */
char filename[20];
int
printf ("Enter the name of the file:
get s( f i 1 ename)
fd =open(f i lename, D_RD0NLY);
while (readCfd, &c
1)
wr iteCSTDOUT, &c
"
!=
0)
1);
close(fd);
FIGURE 2.2 The LIST program
/*
/*
/*
Step
Step
Step
/*
/*
Step 4a */
Step 4b */
/*
Step
*/
*/
*/
*/
in C.
Steps 1 and 2 of the program involve writing and reading, but in each
of the implementations' this is accomplished through the usual functions for
handling the screen and keyboard. Step 4a, where we read from the input
the first instance of actual file I/O. Note that the read( ) call in the C
language parallels the low-level, generic READ( ) statement we described
file, is
system call in C as the model for our
argument gives the file descriptor
(C's version of a logical file name) as the source for the input, the second
argument gives the address of a character variable used as the destination for
the data, and the third argument specifies that only one byte will be read.
earlier; in truth,
low-level
we
READ(
).
The arguments
information
name
file
at a
used the read(
The
for
function's
first
the Pascal
read(
call
communicate the same
Once again, the first argument is the logical
source. The second argument gives the name of a
higher
level.
for the input
character variable used as a destination; given the name, Pascal can find the
address. Because of Pascal's strong emphasis
argument of the generic
that since
we
READ(
function
is
on variable
types, the third
not required. Pascal assumes
are reading data into a variable of type char,
we must want
to
read only one byte.
After a character
is
read,
again the differences between
to
I/O used
in
we
write
it
out to the screen in Step 4b.
Once
and Pascal indicate the range of approaches
different languages. Everything must be stated explicitly in
17
READING AND WRITING
the
write (
Using the
) call.
special, assigned file descriptor
of
STDOUT
to identify the terminal screen as the destination for our writing,
wnte( STDOUT,
&c
means: "Write to the screen the contents from memory starting at the
address Sec. Write only one byte." Beginning C programmers should pay
special attention to the use
particular
call,
as a
of the
&
symbol
very low-level
provide the starting address
in
RAM
call,
in the write(
call here; this
requires that the
programmer
of the bytes to be transferred.
STDOUT,
which stands for "standard output," is an integer value
defined in the file stdio.h, which has been included at the top of the program. The actual value of STDOUT that is set in stdio.h is, by convention,
always
The concept of standard output and
1.
counterpart "standard
its
input" are covered later in the section "Physical and Logical
Files
in
UNIX."
FIGURE 2.3 The LIST program
Pascal.
in
PROGRAM list (INPUT, OUTPUT);
{
reads input from
file and writes
it
to
screen
the terminal
>
VAR
c
char;
file of char;
<
logical file name
packed array [1..201 of char;
physical file name
infile
filename
BEGIN {main>
wnteCEnter
the name of the file:
readln(f i lename)
filename);
reset ( mf l le
while not ( eof ( i nf l 1 e ) ) DO
;
');
{
<
<
Step
Step
Step
>
>
>
BEGIN
readdnf ile,c)
write(c)
<
Step 4a
Step 4b
<
Step
<
>
>
END;
closednf lie)
END.
>
>
FUNDAMENTAL
FILE
PROCESSING OPERATIONS
Again the Pascal code operates
name
is
specified in a writef
higher
at a
When no logical file
we are writing
char, Pascal assumes we
level.'*'
statement, Pascal assumes that
to the terminal screen. Since the variable
c is
of type
The statement becomes simply
are writing a single byte.
write(c)
As
in the read(
bytes; the
statement, Pascal takes care of finding the address of the
programmer need
specify only the
name of the
variable
that
is
associated with that address.
2.4.3 Detecting End-of-File
in Figs. 2.2 and 2.3 have to know when to end the while loop
and stop reading characters. Pascal and C signal the end-of-file condition
differently, illustrating two of the most commonly used approaches to
The programs
end-of-file detection.
Pascal supplies
end-of-file.
Boolean function,
As we read from
eof(
),
which can be used
to test for
operating system keeps track of our
a file, the
with a read/write pointer. This is necessary so when the
system knows where to get it. The eof( ) function
queries the system to see whether the read/ write pointer has moved past the
last element in the file. If it has, eof( ) returns true; otherwise it returns false.
location in the
next byte
As
file
read, the
is
Fig. 2.3 illustrates,
byte. For an
empty
we
file,
use the eof(
eof(
) call
before trying to read the next
immediately returns
true
and no bytes are
read.
In the
read(
2.5
language, the read(
) call
returns the
returns a value of zero, then the
file.
So, rather than using an eof(
run
as
long
function,
as the read( ) call finds
number of bytes
program has reached
we
something
read. If
the end of the
construct the while loop to
to read.
Seeking
In the preceding
sample programs
reading one byte after another until
a
byte
we
is
read, the operating
we read through the
we reach the end of the
system moves
file sequentially,
file.
Every time
the read/write pointer ahead, and
are ready to read the next byte.
C does not have similar high-level functions. In fact, the standard C
panoply of higher-level I/O functions, including putc( ), which functions
for characters exactly like the Pascal write( ) shown here. We have chosen to emphasize the
use of the lower-level C functions mainly for pedagogical reasons. They provide opportunities for us to understand more fully the way file I/O works.
""This is
not to say that
library provides a
19
SEEKING
Sometimes we want to read or write without taking the time to go
through every byte sequentially. Perhaps we know that the next piece of
information we need is 10,000 bytes away, and so we want to jump there
to begin reading. Or perhaps we need to jump to the end of the file so we
can add
new information there. To satisfy these needs we must
movement of the read/write pointer.
be able to
control the
The
action of
called seeking.
moving
directly to a certain position in a
seek requires at least
two
often
file is
pieces of information, expressed
SEEK(
here as arguments to the generic pseudocode function
):
SEEK ( Sou rce_f ile,Df f set)
Source_f
Offset
The
le
logical
file
The number of
moved from
Now,
name
in
which the seek
positions in the
the start of the
file
will occur.
the pointer
is
to be
file.
we want to move directly from the origin to the 373rd position in
we don't have to move sequentially through the first 372
positions first. Instead, we can say
if
a file called data,
SEEK( data,
2.5.1 Seeking
in
373
UNIX
been incorporated into many
implementations of the C language is the ability to view a file as a
potentially very large array of bytes that just happens to be kept on secondary
storage. In an array of bytes in RAM, we can move to any particular byte
One of
the features of
that has
through the use of a subscript. The
provides
any byte in
The
C language seek function,
a similar capability for files.
where the
called lseek(
function has the following form:
lseekCfd, byte_offset, origin)
j^^citz
K~
variables have the following meanings:
pos
A long integer value returned by lseek( ) equal to the
position (in bytes) of the read/write pointer after it has
been moved.
fd
The
file
descriptor of the
file
to
which the
lseek(
) is
to
be applied.
byte_offset
),
us set the read/ write pointer to
a file.
lseek(
pos =
It lets
The number of bytes to move from some origin in the
file. The byte offset must be specified as a long integer,
hence the name Iseek for long seek. When appropriate,
the byte_offset can be negative.
20
FUNDAMENTAL
FILE
PROCESSING OPERATIONS
origin
value that specifies the starting position from which the
is to be taken. The origin can have the value 0, 1,
byte_offset
or
2-f
heek(
from
lseek(
from the current
2lseek(
from the end of the
the beginning of the
position;
file.
^CzAL^
file;
The following program fragment shows how you could
move
to a position that
long pos
i
nt
f d
seek
is
( i
373 bytes into
fd,
nt
2.5.2 Seeking
in
373L,
a file is a
byte-by-byte
long offset,
to
origin);
int
in Pascal differs
acter or integer, or
a file in
from the
view
sequence of bytes, so addressing within the
basis.
When we
particular type.
within
Pascal
seek to a position,
dress in terms of bytes. In Pascal a
some
use lseek(
a file.
0);
The view of a file as presented
two important respects:
B.AJk^
pos=lseek(fd
In
& Ll 9^-
^C^? r* -
S^C^-
3fc?
it
Pascal
may
is
file is a
record can be
in
be
we
in at least
on
file is
express the ad-
sequence of "records" of
simple scalar such as a char-
more complex
structure.
Addressing
terms of these records. For example,
if a
made up of 100-byte records, and we want to refer to the
fourth record, we would do so in Pascal simply by referencing
record number 4. In C, where the view is solely and always in terms
of bytes, we would have to address the fourth record as byte address
file is
400.
Standard Pascal actually does not provide for seeking. The model for
I/O for standard Pascal
tially.
is
magnetic
tape,
which must be read sequen-
In standard Pascal, adding data to the
reading the entire
file
from beginning
end of
a file
requires
to end, writing out the data
from the input file to a second, output file, and then adding the new
data to the end of the output file. However, many implementations
of Pascal such as VAX Pascal and Turbo Pascal have extended the
standard and do support seeking.
1, and 2 are almost always used here, they are not guaranteed to
implementations. Consult your documentation.
"'"Although the values 0,
work
tor
all
SPECIAL CHARACTERS
There
ANSI/IEEE
an extension to Pascal proposed by the Joint
is
Standards Committee (1984) that
the future.
It
may
21
IN FILES
Pascal
be included in the Pascal standard
in
includes the following procedures and functions that permit
seeking:
SeekWrite(f,n) A procedure that positions the file /on the element
with index n and places the file in write mode, so the selected
and following elements may be modified.
SeekRead(f,n) A procedure that positions the file /on the element
with index n and places the file in read mode, so the selected and
following elements
position
beyond
the end of the
Position(f)
may
file,
If
SeekRead(
then the
attempts to
positioned
file is
at
file.
function that returns the index value representing the
position of the current
EndPosition(f)
file
element.
function that returns the index value representing
the position of the last
Many
be examined.
the end of the
file
element.
Pascal implementations, recognizing the need to provide seeking
had already implemented seeking functions before these
Consequently, the mechanisms for handling
seeking vary widely among implementations.
capabilities,
proposals were set forth.
2.6
Special Characters
in Files
As you create the file structures described in this
some difficulty with extra, unexpected characters
you may encounter
up in your files,
text,
that turn
with characters that disappear, and with numeric counts that are inserted
your files. Here are some examples of the kinds of things you might
into
encounter:
On many
small computers you
value of 26)
is
appended
at
may
the end of your
use this to indicate end-of-file even
This
is
most
likely to
find that a
happen on
if
Control-Z (ASCII
Some
files.
applications
you have not placed
MS-DOS
it
there.
systems.
Some systems adopt
file
1"
value of 13)
""When
we
characters
a convention of indicating end-of-line in a text
of characters consisting of a carriage return (CR: ASCII
and a line feed (LF: ASCII value of 10). Sometimes I/O
as a pair
use the term text file in this text,
a specific standard character
from
wise specified, the ASCII character
describes the
ASCII
character
set.
set
we
are referring to a
file
such as ASCII or
will be assumed. Appendix
set,
consisting entirely of
EBCDIC.
Unless other-
contains a table that
>
t
^~Q? (^
22
FUNDAMENTAL
FILE
PROCESSING OPERATIONS
procedures written for such systems automatically expand single
characters or
LF
characters into
CR-LF
pairs.
tion of characters can cause a great deal of difficulty. Again,
most
likely to
encounter
phenomenon on
this
Users of larger systems, such
the opposite problem. Certain
riage return characters
them with
as a line
a count
of
file
of the characters
in
you
are
systems.
find that they have just
formats under
file
from your
MS-DOS
VMS
remove car-
without asking you, replacing
what the system has perceived
text.
These are just
that record
VMS, may
as
CR
This unrequested addi-
few examples of the kinds of uninvited modifications
or I/O support packages might make to
management systems
your files. You will find that they are usually associated with the concepts
of a line of text or end of a file. In general, these modifications to your files
are an attempt to make your life easier by doing things for you
automatically. This might, in fact, work out for users who want to do
nothing more than store some text in a file. Unfortunately, however,
programmers building sophisticated file structures must sometimes spend a
lot of time finding ways to disable this automatic assistance so they can have
complete control over what they are building. Forewarned is forearmed;
readers who encounter these kinds of difficulties as they build the file
structures described in this text can take some comfort from the knowledge
that the experience they gain in disabling automatic assistance will serve
them
2.7
well, over
and over,
in the future.
The UNIX Directory Structure
No matter what computer system you have,
even if it is a small PC, chances
hundreds or even thousands of files you have access to. To
provide convenient access to such large numbers of files, your computer has
are there are
some method for organizing its
The UNIX filesystem is a
with the
root
of the
including the root,
programs and
data,
UNIX
this is called the filesy stem.
by the character 7\ All
can contain two kinds of
and directories
references to devices, as
stored in a
In
tree-structured organization of
tree signified
drives are also treated like
name
files.
UNIX
files
shown
in
(Fig. 2.4).
UNIX,
files:
regular
directories,
directories,
files
with
Since devices such as tape
directories can also contain
in the dev directory in Fig. 2.4.
directory corresponds to
what we
call its
The
file
physical
name.
Since every file in a UNIX system is part of the filesystem that begins
with root, any file can be uniquely identified by giving its absolute pathname.
For instance, the true, unambiguous name of the file "addr" in Fig. 2.4 is
23
THE UNIX DIRECTORY STRUCTURE
(root)
adb
cc
console kbd
yacc
libc.a
TAPE
libm.a
FIGURE 2.4 Sample UNIX directory structure.
7 is used both to indicate the root directory
and to separate directory names from the file name.)
When you issue commands to a UNIX system, you do so within some
directory, which is called your current directory. A pathname for a file that
does not begin with a 7 describes the location of a file relative to the
/usr6/mydir/addr. (Note that the
'
'
current directory. Hence, if your current directory in Fig. 2.4
uniquely identifies the
The
file
2.8
mydir, addr
special filename "." stands for the current directory,
stands for the parent of the current directory.
directory
is
/usr6/ mydir /addr.
is
/usr6/mydir/DF, "../addr" refers to the
Physical and Logical Files
in
Hence,
file
if
and ".."
your current
/usr6/mydir /addr.
UNIX
2.8.1 Physical Devices as UNIX Files
One of the most powerful ideas in UNIX is reflected in its notion of what
a file is. In UNIX, a file is a sequence of bytes, without any implication of
how or where the bytes are stored or where they originate. This simple
24
FUNDAMENTAL
PROCESSING OPERATIONS
FILE
conceptual view of
a file
makes
it
operations what might require
possible in
many
different operating system. For example,
disk as the source of a
on
things
also files
But
disks.
in
because
file,
UNIX,
in Fig. 2.4, /dev/kbd
pressed;
the
console
accepts
corresponding symbols on
allows so
very few
easy to think of a magnetic
are used to the idea of storing such
devices like the keyboard and the console are
and /dev / console respectively. The keyboard
,
sequence of bytes
when keys are
and displays their
a screen.
we say that the UNIX concept of a file is simple when it
many different physical things to be called files? Doesn't this make
more complicated, not
matter what physical representation
UNIX
operations on a
can
the situation
file is
logically
it is
do with
to
many
sequence of bytes that are sent to the computer
produces
How
we
UNIX
times as
the same. In
by an integer
its
the
simpler?
a file
The
may
simplest form, a
file
UNIX
trick in
take, the logical
UNIX
file is
descriptor. This integer
is that no
view of a
represented
an index to an
is
more complete information about the file. A keyboard, a disk file,
magnetic tape are all represented by integers. Once the integer that
array of
and
describes a
logical
file is
name of
whether the
file
program can access that file. If it knows the
program can access that file without knowing
comes from a disk, a tape, or a telephone.
identified, a
a file,
2.8.2 The Console, the Keyboard, and Standard Error
We
see an
program
example of the duality between devices and
/*
/*
/*
Step
Step
Step
1)
while (readCfd, &c
writeCSTDOUT, &c
1);
/*
/*
Step 4a */
Step 4b */
>
0)
We
LIST
the
printf ("Enter the name of the file: ");
getsCf ilename)
fd =open(f ilename, 0_RD0NLY);
;
The
files in
in Fig. 2.2:
logical
file
is
some
STDOUT,
small integer value returned by the open(
assign this integer to the variable^ in Step
integer
defined as
earlier in the
3.
In Step 4b,
program,
we
*/
*/
*/
) call.
use the
to identify the
file to be written to.
There are two other file descriptors that are special in UNIX: The
keyboard is called STDIN (standard input) and the error file is called
STDERR (standard error). Hence, STDIN is the keyboard on your terminal.
console as the
The statement
readCSTDIN,
k,
1);
PHYSICAL AND LOGICAL FILES
reads a single character
like
STDOUT,
is
from your terminal.
usually just your console.
STDERR
25
UNIX
an error
is
When your
IN
file
which,
compiler detects an
error, it generally writes the error message to this file, which means
normally that the error message turns up on your screen. As with STDIN,
the values STDIX and STDERR are usually defined in stdio.h.
Steps 1 and 2 of the LIST program also involve reading and writing
from
STDIX
STDOUT. Since an enormous amount of I/O involves
most programming languages have special functions to
or
these devices,
perform console input and output
in LIST, the C functions print/ and gets
are used. Ultimately, however, print/ and gets send their output through
STDOUT and STDIN, respectively. But these statements hide important
elements of the I/O process. For our purposes, the second set of read and
write statements is more interesting and instructive.
2.8.3
I/O Redirection
Suppose you would
to a regular
file,
the output of
and Pipes
change the LIST program so it writes its output
STDOUT. Or suppose you wanted to use
input to another program. Because it is common to
like to
rather than to
LIST
as
do both of these, UNIX provides convenient shortcuts
switching between standard I/O (STDIX and STDOUT) and regular
I/O. These shortcuts are called I/O redirection and pipes/
I/O redirection lets you specify at xecuj|op time alternate files
input or output. The notations for input and output redirection are
want
to
<
>
For example,
if
output from
STDOUT to
>
What
if,
you wanted
UNIX
pipes
for
(redirect STDIN to "file")
(redirect STDDUT to "file")
file
file
list
for
file
the executable
myf
i 1
LIST program
a file called
let
programl
it
immediately
you do
I
called "list,"
we
redirect the
instead of storing the output
to use
is
"myfile" by entering the line
this.
in
from the
list
program
in a
file,
another program to sort the results?
The notation
for a
UNIX
pipe
is '|\
Hence,
program2
Stnctly speaking, I/O redirection and pipes are part of a UNIX shell, which is the cominterpreter that sits on top of the core UNIX operating system, the kernel. For the
purpose of this discussion, this distinction is not important.
mand
26
FUNDAMENTAL
FILE
PROCESSING OPERATIONS
means take any S TDO UT output from program 1 and use it in place of any
STDIN input to program2. Since UNIX has a special program called sort,
which takes its input from STDIN, you can sort the output from the list
program, without using an intermediate file, by entering
list
Since
sort
writes
sort
its
output to
STDOUT,
the sorted listing appears
on your
terminal screen unless you use additional pipes or redirection to send
it
elsewhere.
2.9
File-related
UNIX,
Header
Files
file
operations. For example,
return a special value indicating end-of-file
beyond
names and values that you
some C functions
(EOF) when you try to read
like all operating systems, has special
must use when performing
the end of a
file.
you use in an open( ) call to indicate whether you
want read-only, write-only, or read/write access. Unless we know just
where to look, it is often not easy to find where these values are defined.
UNIX handles the problem by putting such definitions in special header
files such as /usr/include, which can be found in special directories.
Recall the flags that
Three header files relevant to the material in this chapter are stdio.h,
and file.h. EOF, for instance, is defined on many UNIX systems in
/usr/include /stdio.h, as are the file pointers STDIN, STDOUT, and
STDERR. And the flags 0_RDONLY, 0_WRONLY, and O.RDWR
can usually be found in /usr/ include /sys/file.h or possibly one of the files that
fcntl.h,
it
includes.
It
would be
instructive for
you
to
browse through these
files,
as well as
others that pique your curiosity.
2.10
UNIX Filesystem Commands
UNIX provides many commands for
manipulating
are relevant to the material in this chapter.
files.
We
list a
few
options, but the simplest uses of most should be obvious. Consult a
manual
for
cat
tail
more information on how
filenames
filename
that
Most of them have many
to use them.
Print the contents of the
named
Print the last 10 lines of
the text
text
file.
files,
UNIX
SUMMARY
cp
filel ftle2
mv
filel file2
rm
filenames
chmod
Copy filel to file2.
Move (rename) filel to file2.
Remove (delete) the named files.
Change the protection mode on the named
mode filename
files.
Is
List the contents
mkdir
rmdir
name
Create
name
Remove
of the directory.
a directory
the
with the given name.
named
directory.
SUMMARY
This
chapter
OPEN(
introduces
CREATE(
),
fundamental operations of
the
CLOSE(
),
),
READ(
WRITE(
),
),
systems:
file
and SEEK( ).
link between a
Each of these operations involves the creation or use of a
on a secondary device and a logical file that represents a
program's more abstract view of the same file. When the program describes
physical file stored
an operation using the
gets
performed on the corresponding physical
The
six operations
Sometimes they are
functions, and sometimes they
),
for instance,
Before
we
can use
many
different
commands, sometimes they
built-in
are direct calls to an operating system.
languages provide the user with
SEEK(
file.
appear in programming languages in
forms.
all
name, the equivalent physical operation
logical file
all
six
are
Not
The operation
operations.
not available in standard Pascal.
is
a physical file,
we must
link
it
to a logical
file.
In
some programming environments we do this with a statement (e.g., assign
Turbo Pascal) or with instructions outside of the program (e.g., job
control language QCL] instructions). In other languages the link between
the physical file and a logical file is made with OPEN(
or CREATE( ).
The operations CREATE( and OPEN( make files ready for reading
in
or writing.
operates
CREATE(
on an already
causes a
pointer to the beginning of the
between
a logical file
that the file buffer
to the
is
new
physical
existing physical
and
its
file.
file,
file
to be created.
OPEN(
usually setting the read/write
The CLOSE(
operation breaks the link
corresponding physical
file. It
also
flushed so everything that was written
is
makes sure
actually sent
file.
The I/O operations READ(
systems
level, require three
The
logical
and
WRITE(
),
when viewed
at a
items of information:
name of the
file
to be read
from or written
to;
low,
27
28
FUNDAMENTAL
An
address
of
computer"
An
PROCESSING OPERATIONS
FILE
memory
area to be used for the "inside of the
part of the exchange;
how much
indication of
data
is
and
to be read or written.
These three fundamental elements of the exchange are illustrated in Fig. 2.5.
READ( ) and WRITE( ) are sufficient for moving sequentially through
a file to any desired position, but this form of access is often very inefficient.
Some languages provide
a certain
position in a
operation.
The
giving us
lseek(
seek operations that
file.
)
operation
great deal of
let a
program move
directly to
provides direct access by means of the lseek(
lets
freedom
us view a
file as a
in deciding
Standard Pascal does not support direct
how
to organize a
access, but
file
kind of large array,
many
file.
dialects
of
Pascal do.
One
other useful
file
operation involves
has been reached. End-of-file detection
is
knowing when
handled
the end of a
in different
file
ways by
different languages.
Much
effort goes into shielding
the physical characteristics of
programmers from having
files,
about the physical organization of
When we
but inevitably there are
files
that
to deal
little
with
details
programmers must know.
have our program operate on files at a very low level (as we
do a great deal in this text), we must be on the lookout for little surprises
inserted in our file by the operating system or applications.
The
try to
UNIX
file
system, called the filesystem, organizes
files in a tree
and subdirectories expressable by their pathnames. It
is possible to navigate around the filesystem as you work with UNIX files.
UNIX views both physical devices and traditional disk files as files, so,
for example, a keyboard (STDIN), a console (STDOUT), and a tape drive
all are considered files.
This simple conceptual view of files makes it
possible in UNIX to do with a very few operations what might require
many times the operations on a different operating system.
I/O redirection and pipes are convenient shortcuts provided in UNIX for
transferring file data between files and standard I/O. Header files in UNIX,
such at stdio.h, contain special names and values that you must use when
structure,
with
all files
FIGURE 2.5 The exchange between memory and external device.
identified
Amount of data
to transfer
KEY TERMS
performing file operations. It is important to be aware of the most common
of these in use on your system.
The following section lists a sampling of UNIX commands for
manipulating files.
KEY TERMS
Access mode. Type of
access allowed.
file
The
modes
variety of access
permitted varies from operating system to operating system.
When
Buffering.
input or output
destination immediately,
we
we
is
saved up rather than sent off to
say that
it is
its
buffered. In later chapters,
we
can dramatically improve the performance of proand write data if we buffer the I/O.
Byte offset. The distance, measured in bytes, from the beginning of the
file. The very first byte in the file has an offset of 0, the second byte
has an offset of 1, and so on.
CLOSE( ). A function or system call that breaks the link between a logical file name and the corresponding physical file name.
CREATE( ). A function or system call that causes a file to be created
on secondary storage and may also bind a logical name to the file's
find that
grams
that read
physical
name
see
OPEN(
).
CREATE(
call to
the generation of information used
by
the system to
also results in
manage
the
file,
such as time of creation, physical location, and access privileges for
anticipated users of the
End-of-file (EOF).
An
file.
indicator within a
has occurred, a function that
countered
tells if
(e.g., eof( ) in Pascal),
file
that the
the end of a
file
end of the file
has been en-
or a system-specific value that
returned by file-processing functions indicating that the end of
is
a file
has been encountered in the process of carrying out the function
(e.g.,
EOF
UNIX
UNIX).
in
File descriptor.
open(
A
)
small, non-negative integer value returned
or creat(
) call
that
is
used
as a logical
name
by
for the
UNIX
file
system calls.
Filesystem. The name used in UNIX to describe a collection of files
and directories organized into a tree-structured hierarchy.
Header file. A file in a UNIX environment that contains definitions and
declarations commonly shared among many other files and applications. In C, header files are included in other files by means of the
in later
"#include" statement
(see Fig. 2.2).
The header
files stdio.h, file.h,
29
30
FUNDAMENTAL
PROCESSING OPERATIONS
FILE
and fcntl.h described
and definitions used
in this chapter contain
important declarations
in file processing.
I/O redirection. The redirection of a stream of input or output from its
normal place. For instance, the operator '>' can be used to redirect
to a file output that would normally be sent to the console.
Logical file. The file as seen by the program. The use of logical files
allows a program to describe operations to be performed on a file
without knowing what actual physical file will be used. The program may then be used to process any one of a number of different
files that share the same structure.
OPEN( ). A function or system call that makes a file ready for use. It
may
bind
also
file
include information
Pathname.
name to a physical file. Its arguments inname and the physical file name and may also
a logical file
clude the logical
on how
the
file is
expected to be accessed.
character string that describes the location of a
rectory. If the
pathname
starts
with
a 7',
then
it
file
or di-
gives the absolute
the complete path from the root directory to the file.
Otherwise it gives the relative pathname the path relative to the current working directory.
Physical file. A file that actually exists on secondary storage. It is the
file as known by the computer operating system and that appears in
pathname
its file
directory.
A UNIX
operator specified by the symbol '|' that carries data
from one process to another. The originating process specifies that
the data is to go to STDOUT, and the receiving process expects the
data from STDIN. For example, to send the standard output from a
program makedata to the standard input of a program called usedata,
use the command "makedata usedata".
Protection mode. An indication of how a file can be accessed by vari-
Pipe.
ous classes of users. In
octal
number
UNIX, the protection mode is a three-digit
how the file can be read, written to, and
that indicates
executed by the owner, by members of the owner's group, and by
everyone
READ(
).
device.
(1) a
else.
function or system
When viewed
at
Source_file logical
call
used to obtain input from a file or
it requires three arguments:
the lowest level,
name corresponding
to an
open
file; (2)
Destination_address for the bytes that are to be read; and
or
amount of data
SEEK(
).
to be read.
function or system
specified position in the
tions allow
(3)
the
the Size
programs
file.
call that sets
Languages
the read/write pointer to a
that provide seeking func-
to access specific elements of a
rather than having to read through a
file
from
file directly,
the beginning (sequen-
EXERCISES
each time a specific item
is desired. In C, the lseek( ) system
provides this capability. Standard Pascal does not have a seeking
capability, but many nonstandard dialects of Pascal do.
tially)
call
Standard I/O. The source and destination conventionally used
and output.
By
ways
for input
there are three types of standard I/O: standard
(STDOUT),
standard output
default
STDERR
STDIN is
and
the keyboard, and
STDERR
(standard
STDOUT and
I/O redirection and pipes provide
are the console screen.
to override these defaults.
WRITE(
(1) a
UNIX,
(STDIN),
input
error).
ties.
In
).
function or system
When viewed
Destination_file
call
lowest
at the
used to provide output capabili-
level,
it
requires three arguments:
name corresponding
to an
open
file; (2)
Source_address of the bytes that are to be written; and
amount of the data to be written.
(3)
the
the Size or
EXERCISES
1. Look up operations equivalent to OPEN( ), CEOSE( ), CREATE( ),
READ( ), WRITE( ), and SEEK in other high-level languages, such as
PL/I, COBOL, and Fortran. Compare them with the C or Pascal versions.
2.
If
a)
you use C:
Make
of the different ways to perform the file operations
), CLOSE( ), READ( ), and WRITE( ). Why
there more than one way to do each operation?
a list
CREATE(
is
b)
How
OPEN(
),
would you use
Show how
lseek(
to find the current position in a file?
change the permissions on a file myfile so the owner
has read and write permissions, group members have execute permission, and others have no permission.
d) What is the difference between pmode and 0_RDWR? What
pmodes and 0_RDWR are available on your system?
e) In some typical C environments, such as UNIX and MS-DOS, all
of the following represent ways to move data from one place to anc)
to
other:
scanfC
fscanf(
getc( )
)
)
Describe as
useful.
fgetcC
gets(
fgetsC
<
)
to the
cat (or type)
main (argc argv)
7
many of these
Which belong
system?
read(
as
you
can,
and indicate
how
they might be
language, and which belong to the operating
31
32
3.
FUNDAMENTAL
PROCESSING OPERATIONS
FILE
you use Pascal:
What ways are provided
If
a)
file
operations
WRITE(
tell
)? If
why.
If
CREATE(
there
in
your version of Pascal
OPEN(
),
),
CLOSE(
more than one way
is
an operation
is
missing,
how
to
do
are
its
),
to
perform the
), and
READ(
a certain
operation,
functions carried
out?
Implement
b)
SEEK(
function in your Pascal,
if
it
does not
al-
ready have one.
4.
couple of years ago
One
new compiler
compiler.
that the
difference
a company we know of bought a new COBOL
between the new compiler and the old one was
did not automatically close
program terminated, whereas the old compiler
did this cause
when execution of a
What sorts of problems
files
did.
when some of the old software was
new compiler?
executed after having
been recompiled with the
5.
Look
the
at
two LIST programs
Pascal, the sequence
test,
Why
write.
of steps
in the
in the text.
loop
is test,
Each has
a while loop. In
read, write. In C,
it is
loop construction used for
read,
What would happen in Pascal if we used
C? What would happen in C if we used
the difference?
the
the
Pascal loop construction?
6.
In Fig. 2.4:
Give the full pathname for a file in directory DF.
Suppose your current directory is bin. Show how to copy the file
libdf.a to the directory DF without changing your current directory.
a.
b.
7.
What
is
direct error
8.
the difference
Look up
the
9.
WC
Find
STDOUT and STDERR? Find how to
compilation on your system to
UNIX command
environment, and explain
1
between
messages from
why
it
STDERR.
we. Execute the following in a
gives the
number of files
UNIX
in the directory.
on your system, and find what value is used to indicate
Also examine fde.h or jend. h and describe in general what its
stdio.h
end-of-file.
contents are for.
Programming Exercises
10. Make the LIST program we provide in
compiler on your operating system.
11.
Write
program
to
program
open the
file and store
and read the string.
to create a
file
this
chapter
a string in
work with your
it.
Write another
FURTHER READINGS
12.
Try
file
with an access
setting the protection
mode on
a file to read-only,
mode of read/write. What
13. Implement the UNIX command
from the end of the file to be copied
tail -n,
to
then opening the
happens?
where
is
the
number of lines
STDOUT.
Change
the program LIST so it reads from the STDIN, rather than a
and writes to a file, rather than the STDOUT. Show how to execute
the new version of the program in a UNIX environment, given that the
input is actually in a file called instuff. (You can also do this in most
14.
file,
MS-DOS
environments.)
a program to read a series of names, one per line, from standard
and write out those names spelled in reverse order to standard
output. Use I/O redirection and pipes to do the following:
a. Input a series of names that are typed in from the keyboard and
15.
Write
input,
them out, reversed, to a file called filel.
Read the names in from filel; then write them
write
b.
out, re-reversed, to
a file called file2.
Read the names
c.
the resulting
list
from jile2, reverse them
in
of reversed words using
again,
and then sort
sort.
FURTHER READINGS
Introductory textbooks on
only briefly,
if at all.
C and Pascal tend to treat the fundamental file operations
This
is
particularly true with regard to C, since there are
higher-level standard I/O functions in C, such as the read operations fgets(
) and
do provide treatment of the
fundamental file operations are Bourne (1984), Kernighan and Pike (1984), and
Kernighan and Ritchie (1978, 1988). These books also provide discussions of
higher-level I/O functions that we omitted from our text.
As for UNIX specifically, as of this writing there are two dominant flavors of
UNIX: UNIX System V from AT&T, the originators of UNIX, and 4.3BSD
(Berkeley Software Distribution) UNIX from the University of California at
Berkeley. The two versions are close enough that learning about either will give you
fgetc(
).
Some books on C and/or
UNIX
that
good understanding of UNIX generally. However, as you begin to use UNIX,
you will need reference material on the specific version that you are using. There are
many accessible texts on both versions, including Morgan and McGilton (1987) on
System V, and Wang (1988) on 4.3BSD. Less readable but absolutely essential to a
serious UNIX user is the 4.3BSD UNIX Programmers Reference Manual (U.C.
a
Berkeley, 1986) or the System
V Interface
Definition
(AT&T,
1986).
For Pascal, these operations vary so greatly from one implementation to
another that it is probably best to consult user's manuals and literature relating to
33
34
FUNDAMENTAL
your
as
specific
some
FILE
PROCESSING OPERATIONS
implementation. Cooper (1983) covers the ISO standard Pascal,
extensions. Jensen and Wirth (1974)
others arc based. Wirth (1975) discusses
file
operations in the section,
Problems: Files."
is
some
as well
on which all
with standard Pascal and
the definition of Pascal
difficulties
"An Important Concept and
a Persistent
Source of
Secondary Storage and
System Software
CHAPTER OBJECTIVES
Describe the organization of typical disk drives, including basic units of organization and their relationships.
and describe the factors affecting disk access
methods for estimating access
times and space requirements.
Identify
time, and describe
Describe magnetic tapes, identify some tape applications, and investigate the implications of block size
on space requirements and transmission speeds.
Identify fundamental differences
criteria that
between media and
can be used to match the right
medium
to an application.
Describe in general terms the events that occur when
data is transmitted between a program and a secondary storage device.
Introduce concepts and techniques of buffer management.
Illustrate
many of the
concepts introduced in the
chapter, especially system software concepts, in the
context of
UNIX.
35
CHAPTER OUTLINE
3.1
Disks
A Journey
3.5
3.1.1 The Organization of Disks
3.1.2 Estimating Capacities and Space
3.5.1
3.5.2
Needs
3.5.3
by Sector
Organizing Tracks by Block
Nondata Overhead
The Cost of a Disk Access
Effect of Block Size on
Performance: A UNIX Example
The File Manager
The I/O Buffer
The Byte Leaves RAM: The
3.1.3 Organizing Tracks
I/O Processor and Disk
3.1.4
Controller
3.1.5
3.1.6
3.1.7
Buffer
3.6
I/O in
3.7
Magnetic Tape
3.2.1
Organization of Data on Tapes
Tape Length
3.2.2 Estimating
The Kernel
3.7.
Linking
File
3.7.
Normal
Files, Special Files,
3.7.
3.7.
3.7.
3.4
Storage as a Hierarchy
Good
design
is
Names
always responsive to the constraints of the
the environment. This
is
to Files
and
Block I/O
Device Drivers
The Kernel and File Systems
Magnetic Tape and UNIX
3.7.
Times
3.2.4 Tape Applications
Disk versus Tape
UNIX
3.7.
Sockets
Requirements
3.2.3 Estimating Data Transmission
3.3
Management
Buffer Bottlenecks
3.6.2 Buffering Strategies
3.6.1
3.1.8 Disk as Bottleneck
3.2
of a Byte
as true for file structure
design as
medium and
it is
to
for designs
wood and
stone. Given the ability to create, open, and close files, and to
and write, we can perform the fundamental operations of file
construction. Now we need to look at the nature and limitations of the
devices and systems used to store and retrieve files, preparing ourselves for
in
seek, read,
file
design.
If files
were stored just
called file structures.
the tools
we would
The
in
RAM,
need to build
devices are very different
from
file
RAM. An
is
applications.
RAM. One
that accesses to secondary storage take
impact,
would be no
there
separate discipline
general study of data structures
would give
us
all
But secondary storage
difference, as already noted,
much more time
is
than do accesses to
even more important difference, measured in terms of design
that not
all
accesses are equal.
Good
knowledge of disk and tape performance
minimize access
costs.
file
structure design uses
to arrange data in
ways
that
37
DISKS
we examine the
on the constraints
In this chapter
devices, focusing
We
of secondary storage
that shape our design work in the
characteristics
a look at the major media used in the
magnetic disks, and tapes. We follow this
with an overview of the range of other devices and media used for
secondary storage. Next, by following the journey of a byte, we take a brief
chapters that follow.
begin with
storage and processing of
look
at
when
look
3.1
at
many
files,
hardware and software that become involved
by a program to a file on a disk. Finally, we take a closer
one of the most important aspects of file management
buffering.
the
byte
is
pieces of
sent
Disks
Compared to the time it takes to access an item in RAM, disk accesses are
always expensive. However, not all disk accesses are equally expensive. The
reason for this has to do with the way a disk drive works. Disk drives"
"
belong to
a class
because they
with
of devices
make
serial devices,
it
known
as direct access storage devices
possible to access data
directly.
DASDs
(DASDs)
are contrasted
the other major class of secondary storage devices. Serial
devices use media such as magnetic tape that permit only serial access
particular data item cannot be read or written until
it
on the tape have been read or written
Magnetic disks come
in
many
all
of the data preceding
in order.
forms. So-called hard disks offer high
low cost per bit. Hard disks are the most common disk used
everyday file processing. Floppy disks are inexpensive, but they are slow
and hold relatively little data. Floppies are good for backing up individual
and for transporting small amounts of data.
files or other floppies
Removable disk packs are hard disks that can be mounted on the same drive
at different times, providing a convenient form of backup storage that also
capacity and
in
makes it possible to access data
Nonmagnetic disk media,
directly.
becoming
Appendix A for a
especially optical discs, are
creasingly important for secondary storage. (See
treatment of optical disc storage and
its
infull
applications.)
3.1.1 The Organization of Disks
The information
stored on a disk
is
stored on the surface of one or
more
platters (Fig. 3.1). The arrangement is such that the information is stored in
successive tracks on the surface of the disk (Fig. 3.2). Each track is often
""When
we
use the terms disks or disk
drives,
we
are referring to magnetic disk media.
38
SECONDARY STORAGE AND SYSTEM SOFTWARE
2>
2>
2>
2>
FIGURE 3.1 Schematic
illustration of disk drive.
divided into
of a
disk.
file,
the
Boom
Read/write heads
Spindle
Platters
number of sectors.
When
READ(
sector
statement
computer operating system
is
the smallest addressable portion
byte from
disk
finds the correct surface, track,
and
and
calls for a particular
sector, reads the entire sector into a special area in
RAM called a buffer,
then finds the requested byte within that buffer.
number of platters, it may be called a disk pack. The
above and below one another form a cylinder (Fig.
3.3). The significance of the cylinder is that all of the information on a single
cylinder can be accessed without moving the arm that holds the read/write
If a disk drive uses a
tracks that are directly
heads.
Moving
arm
called seeking. This
arm movement
is
usually the
Disks range in width from 2 to about 14 inches. They range
in storage
this
is
slowest part of reading information from
a disk.
3.1.2 Estimating Capacities and Space Needs
capacity
from
less
pack, the top and
than 400,000 bytes to billions of bytes. In
bottom
platter each contribute
one
surface,
a typical disk
and
all
other
Tracks
Sectors
Gaps
FIGURE 3.2 Surface of disk showing tracks and sectors.
FIGURE 3.3 Schematic illustration of disk drive viewed as a set
of
seven cylinders.
Seven
cylinders
39
40
SECONDARY STORAGE AND SYSTEM SOFTWARE
platters contribute
cylinder
is
two
surfaces to the pack, so the
The amount of data
number of
tracks per
number of platters.
function of the
that can be held
on
track depends
on how densely
on the quality
can be stored on
of the recording medium and the size of the read/ write heads.) An
inexpensive, low-density disk can hold about 4 kilobytes on a track, and 35
the disk surface. (This in turn depends
bits
on a surface. A top-of-the-linc disk can hold about 50 kilobytes on a
and more than 1,000 tracks on a surface. Table D.l in Appendix D
shows how a variety of disk drives compare in terms of capacity, performance, and cost.
Since a cylinder consists of a group of tracks, a track consists of a group
of sectors, and a sector consists of a group of bytes, it is easy to compute
track, cylinder, and drive capacities:
tracks
track,
Track capacity = number of sectors per track X bytes per sector
Cylinder capacity
= number of tracks
per cylinder X track capacity
Drive capacity = number of cylinders x cylinder capacity.
If
to
we know
compute
the number of bytes in a file, we can use these relationships
amount of disk space the file is likely to require. Suppose,
that we want to store a file with 20,000 fixed-length data
the
for instance,
records
on
"typical"
300-megabyte small computer disk with the
following characteristics:
How many
of bytes per sector
Number
o{ sectors per track
Number
of tracks per cylinder
Number
of cylinders = 1,331.
cylinders does the
file
20,000
- =
40
=11
require if each data record requires 256
two
bytes? Since each sector can hold
One
= 512
Number
records, the
1AAnn
10,000
sectors.
= 440
sectors
file
requires
cylinder can hold
40 x
so the
number of cylinders
1 1
required
10,000
is
approximately
22.7 cylinders.
440
Of course,
it
may
does not have
be that
a disk drive
with 22.7 cylinders of available space
22.7 physically contiguous cylinders available. In this likely
41
DISKS
case, the file
might
in fact
have to be spread out over dozens, perhaps even
hundreds, of cylinders.
3.1.3 Organizing Tracks by Sector
There are two basic ways to organize data on a disk: by sector and by
user-defined block. So far, we have only mentioned sector organizations. In
this section we examine sector organizations more closely. In the following
section
we
look
at
block organizations.
The Physical Placement of
Sectors
There are several views
can have of the organization of sectors on
a track.
The
that
one
simplest view, one
most users most of the time, is that sectors are adjacent,
segments
of a track that happen to hold a file (Fig. 3.4a). This is
fixed-sized
often a perfectly adequate way to view a file logically, but it may not be a
that suffices for
good way
to store sectors physically.
When you want
one right
to read a series of sectors that are
after the other,
you often cannot read
all
FIGURE 3.4 Two views of the organization of sectors on a 32-sector track.
15\ 16 117/18X19
72
32 \ 31
(a)
23 \ 4
27/14/
17/30/11
i20\7
(b)
in the
same
adjacent sectors.
track,
That
is
42
SECONDARY STORAGE AND SYSTEM SOFTWARE
because, after reading the data,
it
takes the disk controller a certain
of time to process the received information before
So, if logically adjacent sectors
it is
amount
ready to accept more.
were placed on the disk so they were
also
we would miss the start of the following sector while we
the one we had just read in. Consequently, we would be
physically adjacent,
were processing
one sector per revolution of the disk.
I/O system designers usually approach this problem by interleaving the
sectors, leaving an interval of several physical sectors between logically
adjacent sectors. Suppose our disk had an interleaving factor of 5. The
assignment of logical sector content to the 32 physical sectors in a track is
illustrated in Fig. 3.4(b). If you study this figure, you can see that it takes
five revolutions to read the entire 32 sectors of a track. That is a big
able to read only
improvement over 32 revolutions.
Over the last year or two,
high-performance disks can
now
controller
offer
speeds
successive sectors actually are physically adjacent,
an entire track in
Clusters
performance,
the
is
sectors.
cluster has
cluster
map
It
is
the logical parts of the
does
this
a fixed
been found on
by viewing the
file
manager
that
possible to read
file
to their corresponding
file as a series
number of contiguous
sectors.
a disk, all sectors in that cluster
without requiring an additional seek.
To view a file as a series of clusters and
the
it
view of sector organization, also designed to improve
view maintained by that part of a computer's operating
the file manager. When a program accesses a file, it is the
system that we call
file manager's job to
physical locations.
making
means
revolution of the disk.
a single
third
have improved so
interleaving. This
ties logical
still
"*"
of
clusters
Once
of
given
can be accessed
maintain the sectored view,
sectors to the physical clusters that they belong
(FAT). The FAT contains a list of all the
ordered according to the logical order of the sectors they
contain. With each cluster entry in the FAT is an entry giving the physical
to
by using
clusters in a
a file allocation table
file,
location of the cluster (Fig. 3.5).
On many
systems, the system administrator can decide
how many
sectors there should be in a cluster. For instance, in the standard physical
disk structure used
by
cluster size to be used
VAX
on
systems, the system administrator sets the
a disk
when
the disk
is
initialized.
The
default
is three 512-byte sectors per cluster, but the cluster size may be set to
any value between 1 and 65,535 sectors. Since clusters represent physically
contiguous groups of sectors, larger clusters guarantee the ability to read
value
""It
is
not always physically contiguous; the degree of physical contiguity
the interleaving factor.
is
determined by
43
DISKS
File allocation table
(FAT)
The
part of the
FAT
pertaining
to
our
Cluster
Cluster
number
location
file
^^^\
r^
^r
-~~\
~t^"
^^?
FIGURE 3.5 The file manager determines which cluster
is to be accessed.
in
the
file
has the sector
that
more
sectors without seeking,
performance gains
substantial
so the use of large clusters can lead to
when
being processed sequentially.
a file is
Extents Our final view of sector organization represents a further attempt
to emphasize physical contiguity of sectors in a file, hence minimizing
seeking even more. (If you are getting the idea that the avoidance of seeking
is an important part of file design, you are right.) If there is a lot of free
room on a disk, it may be possible to make a file consist entirely of
contiguous clusters.
extent: All
When this is
of its sectors,
contiguous whole
tracks,
(Fig. 3.6a).
the case,
and
This
(if it is
is a
we say
of one
large enough) cylinders form one
good
that the
file
consists
situation, especially if the file
be processed sequentially, because it means that the whole
accessed with a minimum amount of seeking.
is
to
file,
can be
If there is
not enough contiguous space available to contain an entire
the
divided into
file is
an extent.
When new
make them
is
file
two or more noncontiguous
clusters are
added to
Each part is
manager tries to
parts.
a file, the file
physically contiguous to the previous end of the
unavailable for
this, it
must add one or more extents
file,
but
if space
(Fig. 3.6b).
The
44
SECONDARY STORAGE AND SYSTEM SOFTWARE
(a)
(b)
FIGURE 3.6
single
File extents
(shaded area represents space on disk used by a
file).
most important thing
extents in a
the
file
to understand about extents
increases, the
file
amount of seeking required
file is
records and sectors. There are
that as the
number of
disk,
and
to process the file increases.
Fragmentation Generally, all sectors on
same number of bytes. If, for example, the
the size of all records in a
is
becomes more spread out on the
given drive must contain the
size
300 bytes, there
two ways
of a sector
is
to deal
is
512 bytes and
no convenient
with
fit
between
this situation: Store
only one record per sector, or allow records to span sectors, so the
beginning of
another (Fig.
The
first
record might be found in one sector and the end of
it
in
3.7).
option has the advantage that any record can be retrieved by
it has the disadvantage that it might leave an
retrieving just one sector, but
enormous amount of unused space within each sector. This loss of space
within a sector is called internal fragmentation. The second option has the
45
DISKS
advantage that
it
loses
no space from
disadvantage that some records
internal fragmentation, but
may
it
has the
be retrieved only by accessing two
sectors.
Another potential source of internal fragmentation
of
clusters. Recall that a cluster
allocated for a
file.
When
the
is
results
from the use
the smallest unit of space that can be
number of
bytes in a
file
is
not an exact
multiple of the cluster
size, there will be internal fragmentation in the last
For instance, if a cluster consists of three 512-byte sectors,
a file containing one byte would use up 1,536 bytes on the disk; 1,535 bytes
would be wasted due to internal fragmentation.
Clearly, there are important trade-offs in the use of large cluster sizes.
extent of the
file.
A disk that is expected to have mainly large files that will often be processed
would usually be given a large cluster size, since internal
fragmentation would not be a big problem and the performance gains
sequentially
might be great. A disk holding smaller files or files that are usually accessed
only randomly would normally be set up with small clusters.
3.1.4 Organizing Tracks by Block
Sometimes disk tracks
numbers of user-defined
are
not
blocks
divided into sectors,
whose
size
but into integral
can vary. (Note:
The word
FIGURE 3.7 Alternate record organization within sectors (shaded areas represent
data records, and unshaded areas represent unused space).
(a)
block
46
SECONDARY STORAGE AND SYSTEM SOFTWARE
Sector
Sector 3
Sector 2
Sector 4
1111111111 1111111111 1111222222
2 2
3 3 3
Sector 5
Sector 6
444 4444444444
4 4 15 5
5.
(a)
11111111111.
.111111111222.
.22',333;444444.
.4444 441555
(b)
FIGURE 3.8 Sector organization versus block organization.
meaning
of the
UNIX
I/O system. See section
3.7 for details.) When the data on a track is organized by block, this usually
means that the amount of data transferred in a single I/O operation can vary
depending on the needs of the software designer, not the hardware. Blocks
can normally be either fixed or variable in length, depending on the
requirements of the file designer. As with sectors, blocks are often referred
to as physical records. (Sometimes the word block is used as a synonym for
a sector or group of sectors. To avoid confusion, we do not use it in that
way here.) Figure 3.8 illustrates the difference between one view of data on
a sectored track and that of a blocked track.
A block organization does not present the sector-spanning and fragmentation problems of sectors because blocks can vary in size to fit the logical
organization of the data. A block is usually organized to hold an integral
number of logical records. The term blocking factor is used to indicate the
number of records that are to be stored in each block in a file. Hence, if we
had a file with 300-byte records, a block-addressing scheme would let us
define a block to be some convenient multiple of 300 bytes, depending on
the needs of the program. No space would be lost to internal fragmentation,
and there would be no need to load two blocks to retrieve one record.
has a different
in the context
Generally speaking, blocks are superior to sectors
when
it is
desirable to
have the physical allocation of space for records correspond to their logical
organization. (There are disk drives that allow both sector-addressing and
block-addressing, but
we do
In block-addressing
by one or more
Typically there
number of
not describe them here. See Bohl, 1981.)
schemes, each block of data
is
usually accompanied
subblocks containing extra information about the data block.
is
a count subblock that
contains
(among other
things) the
bytes in the accompanying data block (Fig. 3.9a). There
also be a key subblock containing the
key
may
for the last record in the data block
DISKS
(Fig. 3.9b).
When
47
key subblocks are used, a track can be searched by the
disk controller for a block or record identified by a given key. This
that a
program can ask
its
disk drive to search
among
all
track for a block with a desired key. This approach can result in
searches
than
means
on a
much more
the blocks
normally
possible with sector-addressable
schemes, in which keys cannot generally be interpreted without first
loading them into primary memory.
efficient
are
3.1.5 Nondata Overhead
Both blocks and sectors require that a certain amount of space be taken up
on the disk in the form of nondata overhead. Some of the overhead consists
of information that is stored on the disk during preformatting, which is done
before the disk can be used.
On sector-addressable disks, preformatting involves storing, at the
beginning of each sector, such information as sector address, track address,
and condition (whether the sector is usable or defective). Preformatting also
placing gaps and synchronization marks between fields of
information to help the read/write mechanism distinguish between them.
This nondata overhead usually is of no concern to the programmer. When
the sector size is given for a certain drive, the programmer can assume that
this is the amount of actual data that can be stored in a sector.
On a block-organized disk, some of the nondata overhead is invisible to
the programmer, but some of it must be accounted for by the programmer.
Since subblocks and interblock gaps have to be provided with every block,
involves
FIGURE 3.9 Block addressing requires that each physical data block be accompanied by one
more subblocks containing information about its contents.
or
III
I
i
(a)
(b)
mSM
48
SECONDARY STORAGE AND SYSTEM SOFTWARE
there
generally
is
more nondata information provided with blocks than
with sectors. Also, since the number and
application to another, the relative
when block
can vary
addressing
is
sizes
of blocks can vary from one
amount of space taken up by overhead
used. This
is
illustrated in the following
example.
Suppose we have a block-addressable disk drive with 20,000 bytes per
and the amount of space taken up by subblocks and interblock gaps
is equivalent to 300 bytes per block. We want to store a file containing
100-byte records on the disk. How many records can be stored per track if
track,
the blocking factor
1.
If there are 10
is
10,
or
if
it is
60?
100-byte records per block, each block holds 1,000
+ 1,000, or 1,300, bytes of track space
bytes of data and uses 300
when overhead
can
on
fit
is
taken into account.
The number of blocks which
20,000-byte track can be expressed
20,000
1,300 J
L15.38J
as
15.
So 15 blocks, or 150 records, can be stored per track. (Note that we
have to take the floor of the result because a block cannot span two
tracks.)
If there are
60 100-byte records per block, each block holds 6,000
The number of
bytes of data and uses 6,300 bytes of track space.
blocks per track can be expressed
20,000
as
3.
6,300
So 3 blocks, or 180 records, can be stored per
track.
Clearly, the larger blocking factor can lead to
storage.
there
is
When
less
more
efficient use
blocks are larger, fewer blocks are required to hold
space
consumed by
the 300 bytes of overhead that
a file,
of
so
accompany
each block.
Can we conclude from this example that larger blocking factors always
more efficient storage utilization? Not necessarily. Since we can put
only an integral number of blocks on a track, and since tracks are fixed in
length, we almost always lose some space at the end of a track. Here we
lead to
have the internal fragmentation problem again, but this time it applies to
fragmentation within a track. The greater the block size, the greater
potential amount of internal track fragmentation. What would have
happened if we had chosen a blocking factor of 98 in the preceding example?
What about 97?
The flexibility introduced by
result
in
the use of blocks, rather than sectors, can
savings in time and efficiency,
since
it
lets
the
programmer
49
DISKS
determine to
disk.
On
a large
extent
how
the negative side,
data are to be organized physically on
blocking schemes
the
require
programmer
and/or operating system to do the extra work of determining the data
organization. Also, the very flexibility introduced by the use of blocking
schemes precludes the synchronization of I/O operations with the physical
movement of the disk, which sectoring permits. This means that strategies
such as sector interleaving cannot be used to improve performance.
3.1.6 The Cost
of a Disk
Access
To give you a feel for the factors contributing to the total amount of time
needed to access a file on a fixed disk, we calculate some access times. A disk
access can be divided into three distinct physical operations, each with its
own cost: seek rime, rotational delay, and transfer tiiiie.
Seek
Time
Seek time
is
the time required to
move
the access
The amount of time spent seeking during a
depends, of course, on how far the arm has to move. If we are
correct cylinder.
sequentially and the
arm
to the
disk access
accessing
packed into several consecutive cylinders,
seeking needs to be done only after all of the tracks on a cylinder have been
processed, and even then the read/write head needs to move the width of
only one track. At the other extreme, if we are alternately accessing sectors
file
from two
files
file is
that are stored at opposite extremes
innermost cylinder,
one
on
outermost cylinder),
the
at
disk (one at the
seeking
is
very
expensive.
Seeking
is
likely to be
more
costly in a multiuser environment,
where
several processes are contending for use of the disk at one time, than in
single-user environment,
where disk usage
is
dedicated to one process.
Since seeking can be very costly, system designers often go to great
extremes to minimize seeking. In an application that merges three files, for
example, it is not unusual to see the three input files stored on three different
drives and the output file stored on a fourth drive, so no seeking need be
done
as
I/O operations jump from
Since
it is
traversed in every seek,
required for
file
usually impossible to
particular
for each access are
third of the total
we
file
random,
to
know
file.
exactly
how many
operation. If the starting and ending positions
it
turns out that the average seek traverses one
number of cylinders
that the read/write
Manufacturers' specifications for disk drives often
"Derivations of
tracks will be
usually try to determine the average seek time
this result, as well as
more
list
head ranges over. 7
this
figure as the
detailed and refined models, can be found in
Wiederhold (1983), Knuth (1973b). Teory and Fry
(1982),
and Salzberg (1988).
50
SECONDARY STORAGE AND SYSTEM SOFTWARE
"
FIGURE 3.10 When a single file can span several tracks on a cylinder, we
can stagger the beginnings of the tracks to avoid rotational delay when
moving from track to track during sequential access.
average seek time for the drives. Most hard disks available today (1991)
have average seek times of less than 40 milliseconds (msec), and highperformance disks have average seek times as low as 10 msec.
Rotational Delay
Rotational delay refers to the time
we want
it
takes for the disk
under the read/write head. Hard disks
is one revolution per 16.7 msec.
On average, the rotational delay is half a revolution, or about 8.3 msec. On
floppy disks, which often rotate at only 360 rpm, average rotational delay
to rotate so the sector
is
usually rotate at about 3,600 rpm,
is
a sluggish
which
83.3 msec.
As in the case of seeking, these averages apply only when the read/ write
head moves from some random place on the disk surface to the target track.
In
many
circumstances, rotational delay can be
For example, suppose that you have
much
less
than the average.
two or more tracks,
on one cylinder, and that you write
a file that requires
of available tracks
the file to disk sequentially, with one write call. When the first track is
filled, the disk can immediately begin writing to the second track, without
any rotational delay. The "beginning" of the second track is effectively
staggered by just the amount of time it takes to switch from the read/write
head on the first track to the read/write head on the second. Rotational
delay, as it were, is virtually nonexistent. Furthermore, when you read the
file back, the position of data on the second track ensures that there is no
that there are plenty
rotational delay in switching
illustrates this
from one
staggered arrangement.
track to
another.
Figure 3.10
51
DISKS
Time
Once the data we want is under the read/write
can be transferred. The transfer time is given by the formula
Transfer
Transfer time
number of bytes transferred
X
number of bytes on a track
head,
it
rotation time.
time for one sector depends on the number
For example, if there are 32 sectors per track, the time
required to transfer one sector would be l/32nd of a revolution, or 0.5
msec.
If a drive is sectored, the transfer
of sectors on
a track.
Some Timing Computations
show how
situations that
times.
time
it
We
compare the time
will
takes to access
we
Let's look at
different types
all
it
of
two
file
different
takes to access a
of the records
file
processing
access can affect access
file in
sequence with the
in the file randomly. In the
former
much of the file as we can whenever we access it. In the
random-access case, we are able to use only one record on each access.
case,
The
use as
basis for
our calculations
is
"typical" 300-megabyte fixed disk
described in Table 3.1. This particular disk
is
typical
of one that might be
used with a workstation in 1991. Although it is typical only of a certain class
of fixed disk, the observations we draw as we perform these calculations are
quite general.
The
disks used with larger,
bigger and faster than
more expensive computers
factors contributing to total access times are essentially the same.
TABLE
Minimum
3.1
Specifications of disk drive used
(track-to--track) seek
time
examples
in
in text
6 msec
msec
Average seek time
18
Rotational delay
8.3
Maximum
16.7 msec/track, or 1,229 bytes/msec
transfer rate
are
but the nature and relative costs of the
this disk,
Bytes per sector
512
Sectors per track
40
Tracks per cylinder
11
msec
Tracks per surface
1,331
Interleave factor
Cluster size
8 sectors
5* cjtu/*ti/**
Smallest extent size
5 clusters
p^-^ "tJsoce^k-
52
SECONDARY STORAGE AND SYSTEM SOFTWARE
Since our drive uses
smallest extent
is
of 8 sectors (4,096 bytes) and the
a cluster size
5 clusters, space
allocated for storing
is
files in
with an interleave factor of
units. Sectors are interleaved
given track can be transferred
1,
one-track
so data on a
at the stated transfer rate.
suppose that we wish to know how long it will take, using this
drive, to read a 2,048-K-byte file that is divided into 8,000 256-byte records.
First we need to know how the file is distributed on the disk. Since the
4,096-byte cluster holds 16 records, the file will be stored as a sequence of
500 4,096-byte clusters. Since the smallest extent size is 5 clusters, the 500
clusters are stored as 100 extents, occupying 100 tracks.
This means that the disk needs 100 tracks to hold the entire 2,048 K
bytes that we want to read. We assume a situation in which the 100 tracks
are randomly dispersed over the surface of the disk. (This is an extreme
situation chosen to dramatize the point we want to make. Still, it is not so
extreme that it could not easily occur on a typical overloaded disk that has
a large number of small files.)
Now we are ready to calculate the time it would take to read the
2,048-K-byte file from the disk. We first estimate the time it takes to read
the file sector by sector in sequence. This process involves the following
Let's
operations for each track:
Average seek
msec
msec
16.7 msec
18
Rotational delay
Read one
8.3
track
Total
We
want
to find
and read 100
Total time
Now
let's
msec.
43
tracks, so the
100 x 43 msec
calculate the time
it
4,300 msec
would
4.3 seconds.
take to read in the
same 8,000
records using random access rather than sequential access. In other words,
one sector right
rather than being able to read
we have
to access the records in
track to track every time
we
after another,
some order
read a
new
that requires
we assume
that
jumping from
sector. This process involves the
following operations for each record:
Average seek
Read one
cluster 11
Total
Total time
msec
msec
3.3 msec
18
Rotational delay
8.3
16.7)
msec
29.6
8,000 x 29.6 msec
236,800 msec
236.8 seconds.
This difference in performance between sequential access and random
access is very important. If we can get to the right location on the disk and
53
DISKS
we are clearly much better off than
we are if we have to jump around, seeking every time we need a new record.
Remember that seek time is very expensive; when we are performing disk
operations we should try to minimize seeking.
read a lot of information sequentially,
3.1.7 Effect
In deciding
of
how
Block Size on Performance: A UNIX Example
best to organize disk storage allocation for several versions
of BSD UNIX, the Computer Systems Research Group (CSRG) in
Berkeley investigated the trade-offs between block size and performance in
a UNIX environment (Leffler et al., 1989). The results of their research
provide an interesting case study involving trade-offs between block size,
fragmentation, and access time.
The
standard
CSRG
at
research indicated that
the time
on
UNIX
minimum
systems,
w as
T
block
size
of 512 bytes,
not very efficient
in a typical
UNIX
environment. Files that were several blocks long often were
scattered over many cylinders, resulting in frequent seeks and thereby
significantly decreasing throughput. The researchers found that doubling
the block size to 1,024 bytes improved performance by more than a factor
of 2. But even with 1,024-byte blocks, they found that throughput was only
about 4% of the theoretical maximum. Eventually, they found that
4,096-byte blocks provided the fastest throughput, but this led to large
amounts of wasted space due to internal fragmentation. These results are
summarized in Table 3.2.
TABLE 3.2
The amount
of
Space
Percent
Used
Waste
wasted space as
a function of block size
Organization
(Mbyte)
775.2
0.0
Data only, no separation between
807.8
4.2
Data only, each
828.7
6.9
Data + inodes, 512-byte block
866.5
11.8
Data + inodes, 1,024-byte
948.5
22.4
1,128.3
45.6
+ modes,
Data + inodes,
Data
The Design and Implementation of the 4.3BSD
UNIX
file starts
files
on 512-byte boundary
2,048-byte
4,096-byte
UNIX file system
block UNIX file system
block UNIX file system
block UNIX file system
Operating System, Loftier et
al.,
p.
198.
54
SECONDARY STORAGE AND SYSTEM SOFTWARE
To
gain the advantages of both the 4,096-byte and the 512-byte
systems, the Berkeley group implemented
a variation of the cluster concept
implementation, they allocate 4,096-byte
that are big enough to need them; but for smaller files, they
(see section 3.1.3).
blocks for
files
In the
new
allow the large blocks to be divided into one or more fragments. With a
fragment size of 512 bytes, as many as eight small files can be stored in one
block, greatly reducing internal fragmentation.
With the 4,096/512 system,
wasted space was found to decline to about 12%.
3.1.8 Disk as Bottleneck
Disk performance is increasing steadily, even dramatically, but disk speeds
lag far behind local network speeds. A high-performance disk drive
with 50 K bytes per track can transmit at a peak rate of about 3 megabytes
per second, and only a fraction of that under normal conditions.
High-performance networks, in contrast, can transmit at rates of as much as
100 megabytes per second. The result can often mean that a process is disk
bound
the network and the CPU have to wait inordinate lengths of time
still
for the disk to transmit data.
A number of techniques are used to solve this problem. One is
multiprogramming, in which the CPU works on other jobs while waiting
for the data to arrive. But if multiprogramming is not available, or if the
process simply cannot, afford to lose so
ways must be found
is
much
time waiting for the disk,
up disk I/O.
One technique that is now offered on many high-performance systems
called striping. Disk striping involves splitting the parts of a file on several
to speed
different drives, then letting the separate drives deliver parts of the
file
to the
network simultaneously.
For example, suppose we have a 10-megabyte file spread across 20
high-performance (3 megabytes per second) drives that hold 50 K per track.
The first drive has the first 50 K of the file, the second drive has the second
50 K, etc., through the twentieth drive. The first drive also holds the
twenty-first 50 K, and so forth until 10 megabytes are stored. Collectively,
the 20 drives can deliver to the network 250 K per revolution, a combined
rate of 60 megabytes per second.
Disk striping exemplifies an important concept that we see more and
more in system configurations parallelism. Whenever there is a bottleneck
at
some point
in the system, consider duplicating the thing that
is
the source
of the bottleneck, and configure the system so several of them operate
in
parallel.
Another approach
the disk at
all.
As
to solving the disk bottleneck
the cost of
RAM steadily decreases,
is
to avoid accessing
more and more
users
55
DISKS
RAM to hold data that a few years ago had to be kept on a disk.
Two effective ways in which RAM can be used to replace secondary storage
are RAM disks and disk caches.
A RAM disk a large part of RAM configured to simulate the behavior
are using
is
of a mechanical disk in every respect except speed and volatility. Since data
can be located in
without a seek or rotational delay,
disks can
provide much faster access than mechanical disks. Since RAM is normally
volatile, the contents of a
disk are lost when the computer is turned
off.
disks are often used in place of floppy disks because they are
much faster than floppies and because relatively little
is needed to
RAM
RAM
RAM
RAM
RAM
simulate a typical floppy disk.
A disk cache^ is a large block of RAM configured to contain pages of data
a disk. A typical disk-caching scheme might use a 256-K cache with
from
When
a disk.
data
is
requested from secondary
looks into the disk cache to see
first
data. If
if it
memory,
the
file
manager
contains the page with the requested
does, the data can be processed immediately. Otherwise, the
it
manager reads the page containing
the data
from
disk, replacing
file
some page
already in the disk cache.
Cache memory can provide
especially
locality.
when
substantial
improvements
program's data access patterns exhibit
Locality exists in a
when
file
in
a
performance,
high degree of
blocks that are accessed in close
temporal sequence are stored close to one another on the disk. When a
disk cache is used, blocks that are close to one another on the disk are much
more likely to belong to the page or pages that are read in with a single
read, diminishing the likelihood that extra reads are
needed for extra ac-
cesses.
RAM
disks and cache memory are examples of
important and frequently used family of I/O techniques.
look
at
buffering,
We
very
take a closer
buffering in section 3.6.
we
examples of the need to
and disk caches, there
make trade-offs in file
is tension between the cost/capacity advantages of disk over RAM, on the
on the other. Striping provides
one hand, and the speed of
opportunities to increase throughput enormously, but at the cost of a more
In these three techniques
see once again
processing. With
RAM disks
RAM
complex and sophisticated disk management system. Good
file
design
balances these tensions and costs creatively.
f The
term
spect to
opposed to disk cache) generally refers to a very high-speed block of
performs the same types of performance-enhancing operations with
that a disk cache does with respect to secondary memory.
cache (as
mary memory
RAM
that
prire-
56
SECONDARY STORAGE AND SYSTEM SOFTWARE
Magnetic Tape
3.2
Magnetic tape units belong
of devices that provide no direct
to a class
accessing facility but that can provide very rapid sequential access to data.
Tapes are compact, stand up well under different environmental conditions,
are easy to store and transport, and are less expensive than disks.
3.2.1 Organization of Data on Tapes
Since tapes are accessed sequentially, there
identify the locations of data
byte within a
start
of the
file
file.
parallel tracks,
on
On
a tape.
corresponds directly to
We may
is
no need
is
sequence of
(see Fig. 3.11), the nine bits that are at
a typical tape as a set
as a
one-bit-wide
of
there are nine tracks
bits. If
corresponding positions in the nine
respective tracks are taken to constitute one byte, plus
can be thought of
of
physical position relative to the
its
envision the surface of
each of which
for addresses to
a tape, the logical position
slice
a.
parity
of tape. Such
bit.
So
byte
a slice is called a
frame.
The
parity bit
is
not part of the data but
is
used to check the validity of
make the number of 1 bits
frame odd. Even parity works similarly but is rarely used with tapes.
Frames (bytes) are grouped into data blocks whose size can vary from
a few bytes to many kilobytes, depending on the needs of the user. Since
tapes are often read one block at a time, and since tapes cannot stop or start
instantaneously, blocks are separated by interblock gaps, which contain no
the data. If odd parity
is
in effect, this bit
is
set to
in the
FIGURE 3.11
Nine-track tape.
Track
Frame
J.
I
I
111
1
Gap
JU
Data block
JU
Gap
57
MAGNETIC TAPE
information and are long enough to permit stopping and starting.
odd
tapes use
parity,
of consecutive
Tape
no
valid frame can contain
frames
is
used to
all
bits,
so
a large
When
number
the interrecord gap.
fill
come in many shapes, sizes, and speeds. Performance
among drives can usually be measured m terms of three
drives
differences
quantities:
commonly 800,
or 6,250
per inch
much
30,000
Tape speed commonly 30
200 inches per second
and
of interblock gap commonly between 0.3 inch and 0.75
Tape density
1,600,
track, but recently as
(bpi) per
bits
bpi;
as
to
(ips);
Size
Note
inch.
that a 6,250-bpi nine-track tape contains 6,250 bits per inch per track,
and 6,250 bytes per inch when the full nine tracks are taken together. Thus,
in the computations that follow, 6,250 bpi is usually taken to mean 6,250
bytes of data per inch.
3.2.2 Estimating Tape Length Requirements
we want
backup copy of a large mailing list file with one
million 100-byte records. If we want to store the file on a 6,250-bpi tape that
has an interblock gap of 0.3 inches, how much tape is needed?
To answer this question we first need to determine what takes up space
on the tape. There are two primary contributors: interblock gaps and data
blocks. For every data block there is an interblock gap. If we let
Suppose
to store a
the physical length of a data block,
g =
n
the length of an interblock gap, and
the
number of data
then the space requirement
for storing the
s
We know
b is
we
blocks,
(b
file is
g).
are. In fact,
not know what b and
depends on our choice of b. Suppose
choose each data block to contain one 100-byte record. Then b, the
that
is
0.3 inch, but
whatever we want
length of each block,
block
=
it
is
to be,
we do
and
;/
//
given by
size (bytes per block)
jr
rr
tape density (bytes per inch)
:
- r~^- =
0.016 inch.
6,250
number of blocks, is one million (one per record).
The number of records stored in a physical block is called the blocking
factor. It has the same meaning that it had when it was applied to the use ol
and
n,
the
58
SECONDARY STORAGE AND SYSTEM SOFTWARE
The blocking factor we have chosen here is 1
because each block has only one record. Hence, the space requirement for
blocks for disk storage.
the
file is
1,000,000 x (0.016
1,000,000 x 0.316 inch
316,000 inches
26,333
0.3) inch
feet.
Magnetic tapes range in length from 300 feet to 3,600 feet, with 2,400
being the most common length. Clearly, we need quite a few
2,400-foot tapes to store the file. Or do we? You may have noticed that our
choice of block size was not a very smart one from the standpoint of space
feet
utilization.
The
up about 19
interblock gaps in the physical representation of the
times as
much
snapshot of our tape,
Gap
Data
Clearly,
Gap
space on the tape
we
we
is
we were
file
take
to take a
like this:
Gap
Data
Data
not used!
should consider increasing the relative amount of space
used for actual data
If
would look something
Data
Most of the
tape.
it
space as the data blocks do. If
if
we want
to try to squeeze the
increase the blocking factor,
we
file
onto one 2,400-foot
can decrease the
number of
which decreases the number of interblock gaps, which in turn
decreases the amount of space consumed by interblock gaps. For example,
if we increase the blocking factor from 1 to 50, the number of blocks
becomes
blocks,
^Q =
20,000,
and the space requirement for interblock gaps decreases from 300,000 inches
to 6,000 inches. The space requirement for the data is of course the same as
it was previously. What has changed is the relative amount of space occupied
by the gaps, as compared to the data. Now a snapshot of the tape would
look much different:
Data
Gap
Data
Gap
Data
Gap
Data
Gap
Data
59
MAGNETIC TAPE
We leave it to
when
you
show
to
that the
blocking factor of 50
When we compute
numbers
file
can
easily
fit
on one 2,400-foot tape
used.
is
the space requirements for our
that are quite specific to our
A more
file.
we produce
file,
general measure of the
of choosing different block sizes is effective recording density. The
is supposed to reflect the amount of actual data
that can be stored per inch of tape. Since this depends exclusively on the
relative sizes of the interblock gap and the data block, it can be defined as
effect
effective recording density
number of bytes
number of inches
When
block
is
per block
required to store
is used in our example, the number of bytes per
and the number of inches required to store a block is 0.316.
100,
100 bytes
is
= 316 4bpi
'
0.316 inches
is
a far cry
at
it,
space utilization
of data blocks and interblock gaps. Let us
amount of
time
it
'
from the nominal recording density of 6,250
way you look
Either
sizes
block'
blocking factor of 1
Hence, the effective recording density
which
bpi.
sensitive to the relative
is
now
see
how
they affect the
takes to transmit tape data.
3.2.3 Estimating Data Transmission Times
If
you understand
the role of interblock gaps and data block sizes in
determining effective recording density, you can probably see immediately
of data transmission. Two other
of data transmission to or from tape are the
nominal recording density and the speed with which the tape passes the
that these
two
factors also affect the rate
factors that affect the rate
read/write head. If we
know
these
two
values,
we
can compute the nominal
data transmission rate:
Nominal
rate
tape density (bpi)
Hence, our 6,250-bpi, 200-ips tape has
6,250 x 200
x tape speed
(ips).
nominal transmission
1,250,000 bytes/sec
1,250 kilobytes/sec.
is competitive with most disk drives.
But what about those interblock gaps? Once our data
rate
of
This rate
interblock gaps, the
example, that
we
gets dispersed
effective transmission rate certainly suffers.
use our blocking factor of
with the same
by
Suppose, for
file
and tape
60
SECONDARY STORAGE AND SYSTEM SOFTWARE
discussed in
the
preceding section (1,000,000 100-byte records,
0.3-inch gap).
We
saw
that the effective recording density for this tape
organization
316.4 bpi.
drive
is
If the tape
effective transmission rate
is
moving
at a rate
of 200
ips,
then
its
is
316.4 x 200
63,280 bytes/sec
63.3 kilobytes/sec,
about one twentieth of the nominal rate!
should be clear that a blocking factor larger than
a rate that is
It
and that
result,
substantially larger blocking
factor
improves on this
improves on it
substantially.
Although there
size
are other factors that can influence performance, block
generally considered to be the one variable with the greatest influence
is
on space
and data transmission rate. The other factors we have
gap size, tape speed, and recording density
are often beyond
the control of the user. Another factor that can sometimes be important is
the time it takes to start and stop the tape. We consider start/stop time in the
exercises at the end of this chapter.
included
utilization
3.2.4 Tape Applications
Magnetic tape
tions
if
the
an appropriate
is
files
medium
for sequential processing applica-
being processed are not likely also to be used
that require direct access. For example, consider the
in applications
problem of updating
monthly periodical. Is it essential that the list be kept
absolutely current, or is a monthly update of the list sufficient?
If information must be up-to-the-minute, then the medium must
permit direct access so individual updates can be made immediately. But if
mailing
for a
list
list needs to be current only when mailing labels are printed, all
of the changes that occur during the course of a month can be collected in
one batch and put into a transaction file that is sorted in the same way that
the mailing list is sorted. Then a program that reads through the two files
simultaneously can be executed, making all the required changes in one pass
the mailing
through the
data.
Since tape
data offline.
megabytes
is
relatively inexpensive,
At current
prices,
costs about 30 times as
it is
an excellent
medium
removable disk pack
much
blocked, can hold the same amount. Tape
as a reel
is
of tape
for storing
that holds
that,
good medium
150
properly
for archival
storage and for transporting data, as long as the data does not have to be
available
on short notice
for direct processing.
of tape drive, a streaming tape drive, is used widely lor
nonstop, high-speed dumping of data to and from disks. Generally less
special kind
61
DISK VERSUS TAPE
expensive than general-purpose tape drives, it is also
processing that involves much starting and stopping.
3.3
less
suited
for
Disk versus Tape
magnetic tape and magnetic disk accounted for the lion's share
of all secondary storage applications. Disk was excellent for random access
and storage of files for which immediate access was desired; tape was ideal
In the past,
for processing data sequentially
and for long-term storage of
somewhat in favor of disk.
tape was preferable to disk
files.
Over
time, these roles have changed
The major reason
processing
that
for sequential
one process, while disk generally
serves several processes. This means that between accesses a disk read/ write
head tends to move away from the location where the next sequential access
that tapes are dedicated to
is
resulting in an expensive seek; while the tape drive, being
will occur,
dedicated to one process, pays no such price in seek time.
This problem of excessive seeking has gradually diminished, and disk
much of the secondary storage niche previously occupied by
has taken over
change is largely due to the continued dramatic decreases in the
and
storage. To fully understand this change, we need to
understand the role of
buffer space in performing I/O."*" Briefly, it is
that performance depends largely on how big a chunk of a file we can
transmit at any time; as more
space becomes available for I/O buffers,
the number of accesses decreases correspondingly, which means that the
number of seeks required goes down as well. Most systems now available,
even small systems, have enough
available to decrease the number of
tape. This
RAM
cost of disk
RAM
RAM
RAM
accesses required to process
most
files
to a level that
makes disk
quite
competitive with tape for sequential processing. This change, added to the
superior versatility and decreasing costs of disks, has resulted in use of disk
for
most
of tape.
This
sequential processing,
is
which
in the past
was primarily the domain
not to say that tapes should not be used for sequential
file is kept on tape, and there are enough drives available to
processing. If a
use
file
them
for sequential processing,
it
may
directly from tape than to stream
be more efficient to process the
it
to
disk and then process
it
sequentially.
Although
tions, tape
Tape
is still
it
has lost ground to disk in sequential processing applica-
remains important as a medium for long-term archival storage.
far less expensive than magnetic disk, and it is very easy and fast
t Techniques for
RAM
buffering are covered in section 3.6.
62
SECONDARY STORAGE AND SYSTEM SOFTWARE
of files between tape and disk. In this context,
one of our most important media (along with
to stream large files or sets
tape has
emerged
CD-ROM)
as
for tertiary storage.
Storage as a Hierarchy
3.4
Although the best mixture of devices
the needs of the system's users,
we
for a
computing system depends on
can imagine any computing system
as a
hierarchy of storage devices of different speed, capacity, and cost. Figure
3.12 summarizes the different types of storage found
at different levels in
FIGURE 3.12 Approximate comparisons of types of storage, circa 1991
Types of
Devices and
Access times
Capacities
Cost
memory
media
(sec)
(bytes)
(cents/bit)
- Primary
Core and
Registers
10
10
10-10 9
10- 1(T 3
RAM
RAM
disk
and
disk cache
Secondary
-_.
Direct-access
Magnetic disks
Serial
Tape and
10" 3 -10
-10 9
10- 2 -l(T o
10-10 n
10" 5 -10" 7
10
mass storage
_- Offline _,
Archival
and
backup
Removable
magnetic
disks,
optical discs,
and tapes
10- 10'
10
-10 12
5
1(T -10
A JOURNEY OF A BYTE
63
Operating system's
User's program:
WRITE
M
(
file
i/o system:
text M ,c,l)
Get one byte from variable c
in user
Write
it
program 's data
area.
to current location
in text file.
User's data area:
c:
FIGURE 3.13 The WRITE( ) statement tells the operating system to send one
character to disk and gives the operating system the location of the character. The operating system takes over the job of doing the actual writing and
then returns control to the calling program.
such hierarchies and shows approximately
and
access time, capacity,
3.5
A Journey
how
they compare in terms of
cost.
of a Byte
a program writes a byte to a file on a disk? We know
what the program does (it says WRITE(.
and we now know
.)),
something about how the byte is stored on a disk, but we haven't looked at
what happens between the program and the disk. The whole story of what
happens to data between program and disk is not one we can tell here, but
we can give you an idea of the many different pieces of hardware and
software involved and the many jobs that have to be done by looking at one
example of a journey of one byte.
Suppose we want to append a byte representing the character 'P' stored
What happens when
in a character variable
somewhere on
a disk.
that the byte will take
WRITECTEXT,
but the journey
is
The WRITE(
to a file
named
in the variable
TEXT
c,
1)
longer than
this
statement results in
simple statement suggests.
a call to the
computer's operating
system, which has the task of seeing that the rest of the journey
successfully (Fig.
stored
the program's point of view, the entire journey
might be represented by the statement
much
)
From
3.13).
is completed
Often our program can provide the operating
64
SECONDARY STORAGE AND SYSTEM SOFTWARE
system with information that helps it carry out this task more effectively,
but once the operating system has taken over, the job of overseeing the rest
of the journey is largely beyond our program's control.
3.5.1 The
An
Manager
File
operating system
is
not
program, but
a single
a collection
of programs,
each one designed to manage a different part of the computer's resources.
Among
these
devices.
We call
programs
are ones that deal with file-related matters
and I/O
of programs the operating system's file manager.
The file manager may be thought of as several layers of procedures (Fig.
3.14), with the upper layers dealing mostly with symbolic, or logical,
aspects of file management, and the lower layers dealing more with the
physical aspects. Each layer calls the one below it, until, at the lowest level,
the byte
The
is
this subset
actually written to the disk.
manager begins by finding out whether the logical characterare consistent with what we are asking it to do with the file.
It may look up the requested file in a table, where it finds out such things
as whether the file has been opened, what type of file the byte is being sent
to (a binary file, a text file, some other organization), who the file's owner
is, and whether WRITE( ) access is allowed for this particular user of the
file
of the
istics
file
file.
The
file
manager must
also
to be deposited. Since the 'P'
needs to
know where
sector in the
file.
is
determine where in the
to be
the end of the
This information
appended
file is
is
to the
file
TEXT the
the
file,
file
'P' is
manager
the physical location of the
obtained from the
file
last
allocation table
(FAT) described earlier. From the FAT, the file manager locates the
and sector where the byte is to be stored.
drive,
cylinder, track,
3.5.2 The
Next, the
I/O Buffer
file
manager determines whether the
already in
'P' is
RAM
sector that
or needs to be loaded into
RAM.
is
to contain the
If the sector
needs
manager must find an available system I/O buffer space
from the disk. Once it has the sector in a buffer in RAM,
to be loaded, the file
for
the
it,
then read
it
manager can deposit the 'P' into its proper position in the buffer
3.15). The system I/O buffer allows the file manager to read and write
file
(Fig.
data in sector-sized or block-sized units. In other words,
manager
to ensure that the organization
of data
in
RAM
it
enables the
file
conforms to the
have on the disk.
Instead of sending the sector immediately to the disk, the file manager
usually waits to see if it can accumulate more bytes going to the same sector
organization
it
will
65
A JOURNEY OF A BYTE
Logical
1.
The program
asks the operating system to write the contents of the
variable c to the next available position in
2.
The operating system passes
3.
The
file
about
it,
manager looks up
TEXT
such as whether the
types of access are allowed,
name TEXT corresponds
4.
The
file
The
file
to the file
manager.
in a table containing information
file is
if
on
open and available for
any, and what physical
file
use,
what
the logical
to.
manager searches a
location of the sector that
5.
the job
TEXT.
allocation table for the physical
file
is to
manager makes sure
contain the byte.
that the last sector in the
stored in a system I/O buffer in
RAM,
has been
file
then deposits the
into
its
proper position in the buffer.
6.
The
file
the byte
7.
manager gives instructions to the I/O processor about where
is stored in RAM and where it needs to be sent on the disk.
The I/O processor
finds a time
when
the drive
is
available to receive
the data and puts the data in proper format for the disk.
buffer the data to send
it
It
may
also
out in chunks of the proper size for the
disk.
8.
The I/O processor sends
9.
The
the data to the disk controller.
controller instructs the drive to move the read/write head to the
proper track, waits for the desired sector to come under the
read/write head, then sends the byte to the drive to be deposited, bitby-bit, on the surface of the disk.
Physical
FIGURE 3.14 Layers of procedures involved in transmitting a byte from a program's data area to a file called TEXT on disk.
before actually transmitting anything. Even though the statement WRITE(TEXT,c,l) seems to imply that our character is being sent immediately to
the disk,
it
(There are
buffer
may in fact be
many situations
is filled
would have
TEXT
kept in
in
before transmitting
to flush
so the data
all
RAM
it.
for
some time before
it
is
sent.
manager cannot wait until
For instance, if TEXT were closed,
which the
file
a
it
output buffers holding data waiting to be written to
would not be
lost.)
66
SECONDARY STORAGE AND SYSTEM SOFTWARE
File i/o system:
User's program:
WRITE ("text",c,
If necessary,
sector
1)
2.
load
last
from "TEXT"
into
system output buffer
Move *P into system
output buffer
User's data area:
c:
GJ
I/O
*"
system's
output buffer
P'
FIGURE 3.15 The file manager moves P from the program's data area to a system
output buffer, where it may join other bytes headed for the same place on the
disk. If necessary, the file manager may have to load the corresponding sector
from the disk into the system output buffer.
3.5.3 The Byte Leaves RAM: The
and Disk Controller
I/O
Processor
So far, all of our byte's activities have occurred within the computer's
primary memory and have probably been carried out by the computer's
central processing unit (CPU). The byte has travelled along data paths that
are designed to be very fast and that are relatively expensive. Now it is time
for the byte to travel along a data path that is likely to be slower and
narrower than the one in primary memory. (A typical computer might have
an internal data-path width of four bytes, whereas the width of the path
leading to the disk might be only two bytes.)
Because of bottlenecks created by these differences in speed and
data-path widths, our byte and its companions might have to wait for an
become available. This also means that the CPU has
on its hands as it deals out information in small enough chunks
and at slow enough speeds that the world outside can handle them. In fact,
the differences between the internal and external speeds for transmitting
external data path to
extra time
67
A JOURNEY OF A BYTE
data are often so great that the
CPU
can transmit to several external devices
simultaneously.
The processes of disassembling and assembling groups of bytes for
transmission to and from external devices are so specialized that it is
unreasonable to ask an expensive, general-purpose CPU to spend its
when a simpler device could do the job as well,
do the work that it is most suited for. Such a
special-purpose device is called an I/O processor.
An I/O processor may be anything from a simple chip capable of taking
a byte and, on cue, just passing it on; to a powerful, small computer capable
of executing very sophisticated programs and communicating with many
valuable time doing I/O
freeing
CPU
the
to
devices simultaneously.
The I/O processor
operating system, but once
it
takes
its
begins processing I/O,
from
instructions
it
the
runs independently,
CPU) of the task of communicating
with secondary storage devices. This allows I/O processes and internal
relieving the operating system (and the
"
computing
In
to overlap."
typical
computer, the
processor that there
is
file
manager might now
data in the buffer that
is
tell
the
I/O
to be transmitted to the disk,
how much
data there is, and where it is to go on the disk. This information
might come in the form of a little program that the operating system
constructs and the I/O processor executes (Fig. 3.16).
The job of actually controlling the operation of the disk is done by a
device called a disk controller. The I/O processor asks the disk controller if
the disk drive is available for writing. If there is much I/O processing, there
is a good chance that the drive will not be available and that our byte will
have to wait in its buffer until the drive becomes available.
What happens next often makes the time spent so far seem insignificant
in
comparison: The disk drive
instructed to
is
move
on the drive where our byte and
be stored. For the
first
mechanical!
The
time, a device
is
its
read/write head to
companions are to
being asked to do something
the track and sector
its
read/write head must seek to the proper track (unless
it is
already there), and then wait until the disk has spun around so the desired
sector
is
under the head. Once the track and sector are located, the I/O
processor (or perhaps the controller) can send out bytes, one
the drive.
drive,
Our
where
it
byte waits until
probably
is
its
stored in
at a
time, to
turn comes, then travels, alone, to the
a little
one-byte buffer while
it
waits to
be deposited on the disk.
On many systems the I/O processor can take data directly from RAM, without further
involvement from the CPU. This process is called direct memory access (DMA). On other
systems, the CPU must place the data in special I/O registers before the I/O processor can
have access to it.
t
68
SECONDARY STORAGE AND SYSTEM SOFTWARE
File
Manager
Invoke I/O processor
User's program:
I/O
processor
User's data area:
program
^
!i
L___i>
I
,-a
System
buffer
I/O processor
FIGURE 3.16 The file manager sends the I/O processor instructions in the form of
an I/O processor program. The I/O processor gets the data from the system
buffer, prepares it for storing on the disk, and then sends it to the disk controller, which deposits it on the surface of the disk.
under the read/ write head, the eight bits of our
byte are deposited, one at a time, on the surface of the disk (Fig. 3.16).
There the 'P' remains, at the end of its journey, spinning about at a leisurely
Finally, as the disk spins
50 to 100 miles per hour.
3.6
Buffer
Any
Management
from some knowledge of what happens to data
program's data area and secondary storage. One aspect
of this process that is particularly important is the use of buffers. Buffering
involves working with large chunks of data in
so the number of
accesses to secondary storage can be reduced. We concentrate on the
operation of system I/O buffers, but be aware that the use of buffers within
programs can also substantially affect performance.
user of files can benefit
travelling
between
RAM
69
BUFFER MANAGEMENT
3.6.1 Buffer Bottlenecks
We know
that a
hold incoming
file
manager
allocates
I/O buffers
that are big
enough
we have said nothing so far about how many
common for file managers to allocate several
data, but
are used. In fact,
it is
to
buffers
buffers
performing I/O.
To understand the need for several system buffers, consider what
happens if a program is performing both input and output on one character
at a time, and only one I/O buffer is available. When the program asks for
its first character, the I/O buffer is loaded with the sector containing the
character, and the character is transmitted to the program. If the program
then decides to output a character, the I/O buffer is filled with the sector
into which the output character needs to go, destroying its original
for
contents.
Then when
the next input character
is
needed, the buffer contents
have to be written to disk to make room for the (original) sector containing
the second input character, and so on.
Fortunately, there is a simple and generally effective solution to this
ridiculous state of affairs, and that is to use more than one system buffer.
For this reason, I/O systems almost always use at least two buffers
one
for input and one for output.
Even if a program transmits data in only one direction, the use of a
single system I/O buffer can slow it down considerably. We know, for
instance, that the operation of reading a sector from a disk is extremely slow
compared to the amount of time it takes to move data in RAM, so we can
guess that a program that reads many sectors from a file might have to
spend much of its time waiting for the I/O system to fill its buffer every
time a read operation is performed before it can begin processing. When this
happens, the program that is running is said to be I/O bound the CPU
spends much of its time just waiting for I/O to be performed. The solution
to this problem is to use more than one buffer and to have the I/O system
filling the
next sector or block of data while the
CPU
is
processing the
current one.
3.6.2 Buffering Strategies
Multiple Buffering Suppose that a program is only writing to a disk and
that it is I/O bound. The CPU wants to be filling a buffer at the same tune
that I/O is being performed. If two buffers arc used and I/O-CPU
overlapping is permitted, the CPU can be filling one buffer while the
contents of the other are being transmitted to disk.
When
both tasks are
70
SECONDARY STORAGE AND SYSTEM SOFTWARE
the roles of the buffers can be exchanged.
finished,
swapping the
called
roles
of two buffers
doti ble buffering.
after
This technique of
each output (or input) operation
Double buffering allows
operating on one buffer while the other buffer
is
the operating system to be
is
being loaded or emptied
(Fig. 3.17).
The
idea of
swapping system buffers
to allow processing
and I/O
to
overlap need not be restricted to two buffers. In theory, any
buffers can be used, and they can be organized in a variety
number of
of ways. The
management of system buffers is usually done by the operating
system and can rarely be controlled by programmers who do not work at
the systems level. It is common, however, for users to be able to control the
number of system buffers assigned to jobs.
Some file systems use a buffering scheme called buffer pooli ng: When a
system buffer is needed, it is taken from a pool of available buffers and used.
When the system receives a request to read a certain sector or block, it looks
to see if one of its buffers already contains that sector or block. If no buffer
contains it, then the system finds from its pool of buffers one that is not
actual
currently in use and loads the sector or block into
it.
FIGURE 3.17 Double buffering: (a) The contents of system I/O buffer 1 are sent to
is being filled; and (b) the contents of buffer 2 are sent to
disk while I/O buffer 1 is being filled.
disk while I/O buffer 2
To
disk
To
disk
(a)
ogram data area
(b)
71
BUFFER MANAGEMENT
Several different schemes are used to decide which buffer to take from
a
buffer pool.
buffer that
is
One
generally effective strategy
le ast recently
use d.
least-recently-used queue, so
less-recently-used
(LRU)
buffers
When
it is
to take
is
buffer
is
allowed to retain
new
its
The
have been accessed.
strategy for replacing old data with
from the pool
accessed,
data has
is
it
data until
tha'j
put on
all
other
least-recently-used
many
applications
computing. It is based on the assumption that a block of data that
has been used recently is more likely to be needed in the near future than
one that has been used less recently. (We encounter LRU again in later
in
chapters.)
It is
ceases
difficult to predict the point at
to
which the addition of extra buffers
As the cost of RAM
contribute to improved performance.
continues to decrease, so does the cost of using
the other hand, the
more
buffers there are, the
system to manage them. When
different numbers of buffers.
Move Mode and
Locate
in
Mode
more and bigger buffers. On
more time it takes for the file
doubt, consider experimenting with
Sometimes
it
is
not necessary
must always be copied from a system buffer
versa), the amount of time taken to perform
This
way of
handling buffered data
moving chunks of data from one
is
called
place in
to a
the
to
When
data
program buffer ( or
vice
distinguish between a program's data area and system buffers.
move
can be substantial.
move mode, since
it
involves
RAM to another before they can
be accessed.
There
are
wo way s
that
move mode
can be avoided. If the
file
manager
can perform I/O directly between secondary storage and the program's data
no extra move
manager could use
system buffers to handle all I/O, but provide the program with the locations,
through the use of pointer variables, of the system buffers. Both techniques
are examples of a general approach to buffering called locate mod e. When
locate mode is used, a program is able to operate directly on data in the I/O
buffer, eliminating the need to transfer data between an I/O buffer and a
program buffer.
area,
is
necessary. Alter nalively^ the
file
Suppose you are reading in a file with many blocks,
where each block consists of a header followed by data. You would like to
put the headers in one buffer and the data in a different buffer so the data can
be processed as a single entity. The obvious way to do this is to read the
whole block into a single big buffer, and then move the different parts to
their own buffers. Sometimes we can avoid this two-step process using a
technique called sc^lfx^in^uL With scatter input, a single READ call
Scatter/Gather I/O
72
SECONDARY STORAGE AND SYSTEM SOFTWARE
identifies
block
is
not one, but
a collection
of buffers into which data from
a single
to be scattered.
The converse of
scatter input
gather oulgut,.
is
With gather output,
and written with a single WRITE call,
avoiding the need to copy them to a single output buffer. When the cost of
copying several buffers into a single output buffer is high, scatter/gather can
have a significant effect on the running time of a program.
It is not always obvious when features like scatter/gather, locate mode,
and buffer pooling are available in an operating system. You often have to
go looking for them. Sometimes you can invoke them by communicating
with your operating system, and sometimes you can cause them to be
invoked by organizing your program in ways that are compatible with the
way the operating system does I/O. Throughout this text we return many
times to the issue of how to enhance performance by thinking about how
buffers work and adapting programs and file structures accordingly.
several buffers can be gathered
3.7
I/O in
UNIX
We see in the journey of a byte that we can view I/O as proceeding through
several layers. UNIX provides a good example of how these layers occur in
operating system, so we conclude this chapter with a look at UNIX.
of course beyond the scope of this text to describe the UNIX I/O layers
in detail. Rather, our objective here is just to pick a few features of UNIX
that illustrate points made in the text. A secondary objective is to familiarize
you with some of the important terminology used in describing UNIX
systems. For a comprehensive, detailed look at how UNIX works, plus a
thorough discussion of the design decisions involved in creating and
improving UNIX, see Leffler et al. (1989).
a real
It is
3.7.1 The Kernel
In Fig. 3. 14
we
see
how
the process of transmitting data
an external device can be described
The topmost
name,
file a
file.
body of text, an image, an
The
view
program
to
a series
of layers.
We
store in a
array of numbers, or
some other
that an application has of
what goes
layers that follow collectively carry out the task of turning
the logical object into a collection of bits
Likewise, the topmost I/O layer in
logical terms. This layer in
logical
from
proceeding through
layer deals with data in logical, structural terms.
logical entity. This reflects the
into a
as
views on
files.
UNIX
on
a physical device.
UNIX
deals with data primarily in
consists of processes that
impose
certain
Processes are associated with solving some problem,
I/O IN
PROCESSES
user programs
shell
73
UNIX
commands
libraries
system
call
interface
KERNEL
I/O
block
character
network
system
(normal
I/O system
files)
printers, etc.)
(terminals,
block device drivers
network interface drivers
TIT TTT
consoles
disk...
(sockets)
character device drivers
TT
disk
I/O system
printers...
..networks...
HARDWARE
FIGURE 3.18 Kernel
such
as
I/O structure.
counting the words in the
Processes include
and
programs
library
files,
once
The
we
scanf(
numbers,
this layer is the
UNIX
kernel views
all
I/O
as
pass control to the kernel
a file are
gone.
The
or searching for somebody's address.
and tail, user programs that operate on
and fread( ) that are called from
etc.
kernel,
The components of the
the layers.^
3.18.
routines like
to read strings,
Below
file
shell routines like cat
which incorporates
all
the rest of
kernel that do I/O are illustrated in Fig.
operating on
all
sequence of bytes, so
assumptions about the logical view of
decision to design
UNIX
in this
way
to
make
all
operations below the top layer independent of an application's logical view
of
a file
UNIX
""It
is
as a
is
unusual.
beyond the scope of
tion of the
It
is
also
one of the main attractions in choosing
UNIX lets us make all of the decisions
focus for this text, for
UNIX
this text to describe the
UNIX
kernel in detail. For
kernel, including the I/O system, see Leffler et
al.
(1989).
a full
descrip-
74
SECONDARY STORAGE AND SYSTEM SOFTWARE
about the logical structure of
think about the
file
a file,
beyond the
imposing no
it must be
fact that
on
restrictions
built
from
how we
sequence of
bytes.
Let's illustrate the journey
in this chapter
by tracing the
of a byte through the kernel, as we did earlier
of an I/O statement. We assume in this
character to disk. This corresponds to the left
results
example that we are writing a
branch of the I/O system in Fig. 3.18.
When your program executes a system
call
such
as
write (fd, &c, 1);
invoked immediately. * The routines that let processes
communicate directly with the kernel make up the syst em call interface. In
this case, the system call instructs the kernel to writelfcharacter to a file.
The kernel I/O system begins by connecting the file descriptor (fd) in
your program to some file or device in the filesystem. It does this by
proceeding through a series of four tables that enable the kernel to find its
the
kernel
way from
is
process to the places on the disk that will hold the
The
refer to.
file
that they
four tables are
descriptor table;
a file
an open
with information about open files;
which is part of a structure called an index
file table,
a file allocation table,
node; and
a table
of index nodes, with one entry for each
Although these
a sense,
tables are
"owned" by
managed by
file
in use.
the kernel's I/O system, they are, in
different parts of the system:
The file descriptor table is owned by the process (your program).
The open file table and index node tables are owned by the kernel.
The index node itself is part of the filesystem.
The
four tables are invoked in turn by the kernel to get the information
needs to write to your
file
on
disk. Let's see
how
this
works by looking
it
at
the functions of the tables.
The
of the
open
file
file
descriptors used
file table.
entries for
and
descriptor table (Fig. 3.19a)
by
simple table that associates each
process with an entry in another
Every process has
all files it
is
its
own
descriptor table,
has opened, including the "files"
table, the
which includes
STDIN, STDOUT,
STDERR.
^This should not be confused with a library call, such as fprintf( ), which invokes the standard library to perform some additional operations on the data, such as converting it to an
ASCII format, and then makes a corresponding system call.
I/O IN
75
UNIX
(a) descriptor table
table
file
file
descriptor
entry
*-
to open file
table
(b) open file table
Number of
inode
processes
Offset
of next
ptr to
R/W
write
table
mode
using
access
routine
entry
it
to inode
^^^
write
table
100
""
write() routine
for this type
of
FIGURE 3.19 Descriptor table and open
The
file
file
table.
open file.
added to the open file
entries are called file structures, and they contain important
information about how the corresponding file is to be used, such as the
read/write mode used when it was opened, the number of processes
open
Every time a
table. These
file
file is
able (Fig. 3.19b) contains entries for every
opened or
created, a
new
entry
is
and the offset within the file to be used for the next read
or write. The open file table also contains an array of pointers to generic
currently using
it,
76
SECONDARY STORAGE AND SYSTEM SOFTWARE
functions that can be used to operate on the
depending on the type of
file.
These functions
will differ
file.
same open file
one process could read part of a file, another process could
read the next part, and so forth, with each process taking over where the
previous one stopped. On the other hand, if the same file is opened by two
separate open ( ) statements, two separate entries are made in the table, and
the two processes operate on the file quite independently.^
The information in the open file table is transitory. It tells the kernel
what it can do with a file that has been opened in a certain way and provides
information on how it can operate on the file. The kernel still needs more
information about the file itself, such as where the file is stored on disk, how
big the file is, and who owns it. This information is found in an ndex node,
possible for several different processes to refer to the
It is
table entry, so
more commonly
An
inode
structure.
inode exists
inode
is
file
as
more permanent
long
as its
a file
is
opened,
open file table's file
open for access, but an
structure than an
structure exists only while a
corresponding
kept on disk with the
is
When
referred to as an 'mode (Fig. 3.20).
file is
For
file exists.
this reason, a file's
(though not physically adjacent to the file).
copy of its inode is usually loaded into
where
file
RAM
added to the aforementioned inode table for rapid access.
For the purposes of our discussion, the most important component of
the inode is a list (index) of the disk blocks that make up the file. This list
it is
is
the
UNIX
counterpart to the
in this chapter.*
knows
all
that
Once
it
fi le
needs to
processor program that
is
know
about the
is
program is called a device driver.
The device driver sees that your data
proper place on disk. Before we look at the
instructive to look at
kinds of
file
data that
3.7.2 Linking
It is
File
it
how
must
Names
instructive to look a
cess at the
that
meaning of these may be
*This might not be
table often has a
you
to be written.
is
role
more
file.
closely at
All
UNIX,
its
buffer to
of device drivers in
how
references
It"
you
are independently reading
difficult to
moved from
In
among
this
its
UNIX,
the different
to Files
a file
to
name
files
is
actually
begin with
file with one prowith another, the
are writing to a
from the
file
determine.
simple linear array.
dynamic,
it
then invokes an I/O
It
the kernel distinguishes
there are risks in letting this happen.
same time
file.
deal with.
little
linked to the corresponding
*Of course,
described earlier
appropriate for the type of data, the type of
operation, and the type of device that
it is
we
allocation table that
the kernel's I/O system has the inode information,
To accommodate both
tree-like structure.
large and small
files,
this
I/O IN
77
UNIX
device
permissions
owner's userid
file size
block count
w^
tile
allocation
table
FIGURE 3.20 An inode. The inode is the data structure used by UNIX to describe
file. It includes the device containing the file, permissions, owner and group
the
IDs,
and
allocation table,
file
directory, for
is
it is
just a small
pointer to the
inode of a
name
link
is
to
other things.
in directories that file
file's
It
are kept. In fact, a directory
a file
name
together with
other information about the
file.
RAM
When
and to
a directory to the
provides a direct reference from the
used to bring the inode into
It is
file,
inode on disk.^ This pointer from
called a hard link.
entry in the open
names
that contains, for each
file
file is
all
among
a file is
opened,
file
hard
this
up the corresponding
set
file table.
possible for several
file
names
can have several different names.
to point to the
field in the
same inode, so one
inode
tells
how many
file
hard
means that if a file name is deleted and
same file, the file itself is not deleted; its
decremented by one.
links there are to the inode. This
there are other
file
names
inode's hard-link count
is
for the
just
There is another kind of link, called a soft link, or syrnbolkjliik A
symbolic link links a file name to another file name, rather than to an actual
file. Instead of being a pointer to an inode, a soft link is a pathname of some
.
^The
actual structure of a directory
sential parts. See Leffler, et
al.
is
a little
more complex than
(1989) for details.
this,
but these arc the
es-
78
SECONDARY STORAGE AND SYSTEM SOFTWARE
symbolic link does not point to an actual file, it can refer to a
file in a different file system. Symbolic links are not
supported on all UNIX systems. UNIX System 4.3BSD supports symbolic
links, but System V does not.
Since
file.
directory or even to a
3.7.3 Normal
Files,
Special Files, and Sockets
The "everything is a file" concept in UNIX works only when we recognize
that some files are quite a bit different from others. We see in Fig. 3.18 that
the kernel distinguishes among three different types of files. Normal files are
files that this text is about. Special files almost always represent a stream
of characters and control signals that drive some device, such as a line
the
printer or a graphics device.
table (Fig.
The
3.19a) are special
first
files.
three
file
descriptors in the descriptor
Sockets are abstractions that serve as
endpoints for interprocess communication.
At
a certain
are very similar,
conceptual level, these three different types of
and many of the same routines can be used
UNIX
files
any
of them. For instance, you can establish access to all three types by opening
them, and you can write to them with the write( ) system call.
3.7.4 Block
I/O
In Fig.
we
3.18,
to access
see that the three different types of files access their
respective devices via three different I/O systems, the block I/O system, the
I/O system, and the network I/O system. Henceforth we ignore the
second and third categories, since it is normal file I/O that we are most
character
concerned with in this text.*
The block I/O system is the
the journey of a byte.
It
viewed by the user
data,
device like
UNIX
concerns
as a
itself
counterpart of the
with
how
sequence of bytes, onto
disk or tape. Given a byte to store
on
file
to transmit
a
manager in
normal file
block-oriented
a disk, for
example,
it
arranges to read in the sector containing the byte to be replaced, to replace
the byte, and to write the sector back to the disk.
UNIX view of a block device most closely resembles that of a disk.
randomly addressable array of fixed blocks. Originally all blocks were
512 bytes, which was the common sector size on most disks. No other
organization (such as clusters) was imposed on the placement of files on
The
It is
not entirely true. Sockets, for example, can be used to move normal
network systems bypass the normal
favor of sockets to squeeze every bit of performance out of the network.
iThis
is
place to place. In fact, high-performance
files
file
from
system
in
I/O IN
disk. (In section 3.1.7
with
we saw how
UNIX
the design of later
79
UNIX
systems dealt
convention.)
this
3.7.5 Device Drivers
For each peripheral device there
is a separate set of routines, called a device
performs the actual I/O between the I/O buffer and the device.
A device driver is roughly equivalent to the I/O processor program
described in the journey of a byte.
Since the block I/O system views a peripheral device as an array of
driver, that
physical blocks, addressed as block
driver's
job
is
physical blocks, and see that
it
0,
from
to take a block
block
etc.,
1,
a buffer,
block I/O device
destined for one of these
gets deposited in the proper physical place
the device. This saves the block I/O part of the kernel
from having
to
on
know
it is writing to, other than its identity and
thorough discussion of device drivers for block,
character, and network I/O can be found in Leffler et al. (1989).
anything about the specific device
that
it is
block device.
3.7.6 The Kernel and Filesystems
In Chapter 2
we
filesystem
is
the actual
files in
UNIX concept of a filesystem. A UNIX
of files, together with secondary information about
described the
a collection
the system.
filesystem includes the directory structure,
and the inodes that describe the files.
In our discussions we talk about the filesystem as if it is part of the
kernel's I/O system, which it is, but it is also in a sense separate from it. All
where the kernel
parts of a filesystem reside on disk, rather than in
the
kernel
as needed.
by
does its work. These parts are brought into
the directories, ordinary
files,
RAM
RAM
This separation of the filesystem from the kernel has many advantages. One
important advantage is that we can tune a filesystem to a particular device
or usage pattern independently of how the kernel views files. The
discussions in section 3.1.7 of
4.3BSD block
organization are file-system
concerns, for example, and need not have any effect on
how
the kernel
works.
Another advantage of keeping the filesystem and I/O system distinct is
that we can have separate filesystems that are organized differently, perhaps
on different devices, but are accessible by the same kernel. In Appendix A,
for instance, we describe the design of a filesystem on CDROM that is
organized quite differently from a typical disk-based file system yet looks
just like any other filesystem to the user and to the I/O system.
80
SECONDARY STORAGE AND SYSTEM SOFTWARE
3.7.7 Magnetic Tape and UNIX
it is to computing, magnetic tape is somewhat of an orphan in
view of I/O. A magnetic tape unit has characteristics similar to
both block I/O devices (being block oriented) and character devices (being
Important
the
as
UNIX
primarily used for sequential access), but does not
fit
nicely into either
category. Character devices read and write streams of data, not blocks, and
block devices in general access blocks randomly, not sequentially.
I/O is generally the least inappropriate of the two
inappropriate paradigms for tape, a tape device is normally considered in
UNIX to be a block I/O device and hence is accessed through the block I/O
interface. But because the block I/O interface is most often used to write to
random-access devices, disks, it does not require blocks to be written in
sequence, as they must be written to a tape. This problem is solved by
allowing only one write request at a time per tape drive. When highperformance I/O is required, the character device interface can be used in a
raw mode to stream data to tapes, bypassing the stage that requires the data
Since
block
to be collected into relativelv small blocks before or after transmission.
SUMMARY
In this chapter
we look
at
the software environment in
which
file
processing
programs must operate and at some of the hardware devices on which files
are commonly stored, hoping to understand how they influence the ways
we design and process files. We begin by looking at the two most common
storage media: magnetic disks and tapes.
A disk drive consists of a set of read/write heads that are interspersed
among one or more platters. Each platter contributes one or two surfaces,
each surface contains a set of concentric tracks, and each track is divided into
sectors or blocks.
read/write heads
is
The
set
of tracks that can be read without moving the
called a cylinder.
by sector and by
block. Used in this context, the term block refers to a group of records that
are stored together on a disk and treated as a unit for I/O purposes. When
There are two basic ways
blocks are used, the user
is
to address data
better able to
make
on
disks:
the physical organization of
and hence can sometimes
improve performance. Block-organized drives also sometimes make it
possible for the disk drive to search among blocks on a track for a record
with a certain key without first having to transmit the unwanted blocks into
data correspond to
its
logical
organization,
RAM.
Three possible disadvantages of block-organized devices are the danger
of internal track fragmentation, the burden of dealing with the extra
SUMMARY
complexity that the user has to bear, and the
the kinds of synchronization (such
some of
loss
of opportunities to do
as sector interleaving)
that
sector-addressing devices provide.
The
cost of a disk access can be
for seeking, rotational delay,
used,
it is
physically
measured in terms of the time it takes
and transfer time. If sector interleaving is
possible to access logically adjacent sectors
by one or more
sectors.
Although
by separating them
takes
it
much
less
time to
access a single record directly than sequentially, the extra seek time required
for
doing direct accesses makes it much slower than sequential access when
of records is to be accessed.
Despite increasing disk performance, network speeds have improved to
a series
the point that disk access
system.
is
often a significant bottleneck in an overall I/O
number of techniques
including striping, the use of
are available for addressing this problem,
RAM
disks,
can have a major
from 512 bytes
especially
effect
and disk caching.
BSD UNIX
Research done in connection with
shows
that block size
on performance. By increasing the default block size
throughput was improved enormously,
to 4,096 bytes,
for large
files,
because eight times
transferred in a single access.
as
A negative consequence
much
data
could be
of this reorganization
wasted storage increased from 6.9% for 512-byte blocks to 45.6%
It turned out that this problem of wasted space could
treating
the 4,096-byte blocks as clusters of 512-byte
be dealt with by
blocks, w^hich could be allocated to different files.
Though not as important as disks, magnetic tape has an important niche
was
that
for 4,096-byte blocks.
in file processing.
Tapes are inexpensive, reasonably
fast for sequential
processing, compact, robust, and easy to store and transport. Data are
usually organized
on
tapes in one-bit-wide parallel tracks, with a bit-wide
cross-section of tracks interpreted as one or
processing speed and
space utilization,
it is
more
bytes.
When
estimating
important to recognize the role
played by the interblock gap. Effective recording density and effective
transmission rate are useful measurements of the performance one can
expect to achieve for a given physical
file
organization.
secondary storage media, we see that
disks are replacing tape in more and more cases. This is largely because
is becoming less expensive, relative to secondary storage, which
means that one of the earlier advantages of tape over disk, the ability to do
In
comparing disk and tape
as
RAM
sequential access without seeking, has diminished significantly.
RAM
to disk.
journey of a byte as it is sent from
and
programs
The journey involves the participation of many different
This chapter follows
devices, including
a user's
tem;
program, which makes the
initial call to
the operating sys-
81
82
SECONDARY STORAGE AND SYSTEM SOFTWARE
file manager, which maintains tables of information that it uses to translate between the program's logical view of
the file and the physical file where the byte is to be stored;
an I/O processor and its software, which transmit the byte, synchronizing the transmission of the byte between an I/O buffer in
and the disk;
the disk controller and its software, which instruct the drive about
how to find the proper track and sector, then send the byte; and
tne disk drive, which accepts the byte and deposits it on the disk sur-
the operating system's
RAM
face.
Next,
for
we
on techniques
improve performance. Some techniques include
take a closer look at buffering, focusing mainly
managing
buffers to
double buffering, buffer pooling, locate-mode buffering, and scatter/gather
buffering.
We
second look at I/O layers, this time concentrating
I/O system call begins with a call to the UNIX
kernel, which knows nothing about the logical structure of a file, treating all
data essentially the same
as a sequence of bytes to be transmitted to some
on
conclude with
UNIX. We
see that every
external device. In doing
its
work
tables: a file descriptor table,
access table in the
to use
and
how
file's
inode.
to access
it,
it
the I/O system in the kernel invokes four
an open
Once
calls
an inode table, and a
file table,
the kernel has determined
on
file
which device
device driver to carry out the actual
accessing.
Although it treats every file as a sequence of bytes, the kernel I/O
system deals differently with three different types of I/O: block I/O,
character I/O, and network I/O. In this text we concentrate on block I/O.
We look briefly at the special role of the file system within the kernel,
how it uses links to connect file names in directories to their
corresponding inodes. Finally, we remark on the reasons that magnetic tape
does not fit well into the UNIX paradigm for I/O.
describing
KEY TERMS
bpi. Bits per inch per track.
tracks.
On
a tape,
On
a disk, data
is
recorded serially on
data are recorded in parallel
on
several tracks, so a
6,250-bpi nine-track tape contains 6,250 bytes per inch,
when
all
nine
tracks are taken into account (one track being used for parity).
Block. Unit of data organization corresponding
to the
amount of data
transferred in a single access. Block often refers to a collection of
KEY TERMS
records, but
it
may
be
a collection
of sectors
has no correspondence to the organization of the data.
sometimes
called a physical record; a sector
whose
(see cluster)
is
sometimes
block
size
is
called a
block.
Block device.
in
In
UNIX,
device such as
disk drive that
is
organized
blocks and accessed accordingly.
Block I/O. I/O between a computer and a block device.
Block organization. Disk drive organization that allows
the user to
define the size and organization of blocks, and then access a block by
giving
its
block address or the key of one of
its
records. (See sector
organization.)
Blocking factor. The number of records
Character device. In
tape drive
UNIX,
when stream I/O
stored in one block.
device such as
is
keyboard or printer
(or
used) that sends or receives data in the
form of a stream of characters.
Character I/O. I/O between a computer and a character device.
Cluster. Minimum unit of space allocation on a sectored disk, consisting
of one or more contiguous sectors. The use of large clusters can improve sequential access times by guaranteeing the ability to read
longer spans of data without seeking. Small clusters tend to decrease
internal fragmentation.
Controller. Device that directly controls the operation of one or more
secondary storage devices, such as disk drives and magnetic tape
units.
Count subblock. On block-organized
drives, a small block that pre-
cedes each data block and contains information about the data block,
such as its byte count and its address.
Cylinder. The set of tracks on a disk that are directly above and below
each other. All of the tracks in a given cylinder can be accessed without having to move the access arm; that is, they can be accessed
without the expense of seek time.
Descriptor table. In UNIX, a table associated with a single process that
links all of the file descriptors generated by that process to corresponding entries in an open file table.
Device driver. In UNIX, an I/O processor program invoked by the
kernel that performs I/O for a particular device.
Direct access storage device (DASD). Disk or other secondary storage device that permits access to a specific sector or block of data
without first requiring the reading of the blocks that precede it.
Direct memory access (DMA). Transfer of data directly between
and peripheral devices, without significant involvement by the
RAM
CPU.
83
84
SECONDARY STORAGE AND SYSTEM SOFTWARE
Disk cache. A segment of RAM configured to contain pages of data
from a disk. Disk caches can lead to substantial improvements in access time
when
access requests exhibit a high degree of locality.
Disk pack. An assemblage of magnetic
tical shaft.
pack of disks
number of cylinders
If disk
is
disks
mounted on
the
same ver-
treated as a single unit consisting of a
equivalent to the
number of tracks
per surface.
packs are removable, different packs can be mounted on the
same drive
at different times,
providing
convenient form of offline
storage for data that can be accessed directly.
Effective recording density. Recording density
after taking into ac-
count the space used by interblock gaps, nondata subblocks, and
other space-consuming items that
accompany
Effective transmission rate. Transmission rate
data.
after taking into ac-
count the time used to locate and transmit the block of data in which
a
desired record occurs.
One or more adjacent clusters allocated as part (or all) of a file.
The number of extents in a file reflects how dispersed the file is over
the disk. The more dispersed a file, the more seeking must be done
in moving from one part of the file to another.
File allocation table (FAT). A table that contains mappings to the
Extent.
physical locations of
File
all
the clusters in
all files
on disk
storage.
manager. The part of an operating system that is responsible for
managing files, including a collection of programs whose responsibilities range from keeping track of files to invoking I/O processes that
transmit information between primary and secondary storage.
File structure. In connection with the
open
file
table in a
UNIX
kernel,
the term file structure refers to a structure that holds information the
kernel needs about an open
such things
rently using
as the file's
it,
and the
file.
File structure
read/write mode,
information includes
number of processes
offset within the file to be
cur-
used for the next
read or write.
Filesystem. In
UNIX,
a hierarchical collection
of
single secondary device, such as a hard disk or
files,
usually kept
on
CD-ROM.
Fixed disk. A disk drive with platters that may not be removed.
Formatting. The process of preparing a disk for data storage, involving
such things as laying out sectors, setting up the disk's file allocation
table, and checking for damage to the recording medium.
Fragmentation. Space that goes unused within a cluster, block, track,
or other unit of physical storage. For instance, track fragmentation
when space on a track goes unused because there
enough space left to accommodate a complete block.
Frame. A one-bit-wide slice of tape, usually representing a
occurs
is
not
single byte.
KEY TERMS
Hard
link. In
to the
UNIX,
links to a single
deleted until
Index node.
its
all
directory that connects
hence
file;
a file
name
There can be several hard
can have several names. A file is not
a file
hard links to the
file.
file
are deleted.
UNIX, a data structure associated with a file that deAn index node includes such information as a file's
In
scribes the
type,
an entry in
inode of the corresponding
file.
owner and group IDs, and a list of the disk blocks
file. A more common name for index node is
comprise the
that
inode.
Inode. See index node.
Interblock gap. An interval of blank space that separates sectors,
blocks, or subblocks on tape or disk. In the case of tape, the gap
provides sufficient space for the tape to accelerate or decelerate
starting or stopping.
read/write heads to
when
On
tell
both tapes and disks the gaps enable the
accurately when one sector (or block or sub-
block) ends and another begins.
Interleaving factor. Since it is often not possible to read physically adjacent sectors of a disk, logically adjacent sectors are sometimes arranged so they are not physically adjacent. This is called interleaving.
The interleaving factor refers to the number of physical sectors the
next logically adjacent sector
is
located
from the current
sector being
read or written.
I/O processor. A device that
work on non-I/O tasks.
carries out
Kernel. The central part of the
Key subblock. On
UNIX
I/O
tasks,
allowing the
to
operating system.
block-addressable drives,
block that contains the
key of the last record in the data block that follows
it,
allowing the
among the blocks on
without having to load the blocks into primary
a track for a block containing a
drive to search
certain key,
CPU
mem-
ory.
Mass storage system. General term
applied to storage units with large
capacity. Also applied to very high-capacity secondary storage systems that are capable of transmitting data between a disk and any of
several thousand tape cartridges within a
few seconds.
density. Recording density on a disk track or
magnetic tape without taking into account the effects of gaps or non-
Nominal recording
data subblocks.
Nominal transmission
rate. Transmission rate of a disk or tape unit
without taking into account the effects of such extra operations as
seek time for disks and interblock gap traversal time for tapes.
Open file table. In UNIX, a table owned by the kernel with an entry,
called a file structure, for each
open
file.
See
file structure.
85
86
SECONDARY STORAGE AND SYSTEM SOFTWARE
An
Parity.
error-checking technique in which an extra parity
panies each byte and
is
set in
such
way
that the total
bit
accom-
number of
even (even parity) or odd (odd parity).
Platter. One disk in the stack of disks on a disk drive.
Process. An executing program. In UNIX, several instances of the same
program can be executing at the same time, as separate processes.
The kernel keeps a separate file descriptor table for each process.
configured to simulate a disk.
disk. Block of
Rotational delay. The time it takes for the disk to rotate so the desired
bits is
RAM
RAM
sector is under the read/write head.
Scatter/gather I/O. Buffering techniques that involve, on input, scattering incoming data into more than one buffer, and, on output,
gathering data from several buffers to be output as
chunk of
a single
data.
Sector.
The
fixed-sized data blocks that together
make up
on
the tracks
certain disk drives. Sectors are the smallest addressable unit
on
a disk
whose tracks are made up of sectors.
Sector organization. Disk drive organization that uses sectors.
Seek time. The time required to move the access arm to the correct cylinder on a disk drive.
Sequential access device.
card reader, in which the
the beginning.
device, such as a magnetic tape unit or
medium
Sometimes
(e.g., tape)
must be accessed from
called a serial device.
Socket. In UNIX, a socket is an abstraction that serves as an endpoint
of communication within some domain. For example, a socket can
be used to provide direct communication between two computers.
Although
in
some ways
the kernel treats sockets like
files,
we do
not
deal with sockets in this text.
Soft link. See symbolic
link.
UNIX,
ters
the term special file refers to a stream of characand control signals that drive some device, such as a line printer
or
graphics device.
Special
a
file.
In
Streaming tape drive. A tape drive whose primary purpose is dumping large amounts of data from disk to tape or from tape to disk.
Subblock. When blocking is used, there are often separate groupings of
information concerned with each individual block. For example,
count subblock,
key subblock, and
data subblock might
all
be
present.
Symbolic link. In UNIX, an entry in a directory that gives the pathname of a file. Since a symbolic link is an indirect pointer to a file,
not
with the file
can point to directories, or even to files
is
as closely associated
as a
hard
link.
Symbolic
in other filesystems.
it
links
EXERCISES
Track. The set of bytes on a single surface of a disk that can be accessed
without seeking (without moving the access arm). The surface of a
disk can be thought of as a series of concentric circles, with each circle corresponding to a particular position of the access arm and read/
write heads. Each of these circles is a track.
Transfer time. Once the data we want is under the read/write head, we
have to wait for it to pass under the head as we read it. The amount
of time required for this motion and reading is the transfer time.
EXERCISES
Determine as well as you can what the journey of a byte would be like
on your system. You may have to consult technical reference manuals that
describe your computer's file management system, operating system, and
peripheral devices. You may also want to talk to local gurus who have
experience using your system.
1.
Suppose you are writing
2.
a list
of names to
write statement.
Why is it not a good idea to
and then reopen
it
Find out what
3.
for
utility routines are available
computing system, there are
of users, depending on what
When you
every write,
create or
open
Compared
on your computer system
you have a large
utilization. If
different routines available for different kinds
privileges and responsibilities they have.
a file in
information to your computer's
properly.
one name per
file after
before the next write?
monitoring I/O performance and disk
4.
a text file,
close the
file
or Pascal, you must provide certain
manager so it can handle your file
to certain languages, such as PL/I or
COBOL,
the
amount of information you must provide in C or Pascal is very small. Find
a text or manual on PL/I or COBOL and look up the ENVIRONMENT
file description attribute, which can be used to tell the file manager a great
deal about how you expect a file to be organized and used. Compare PL/I
or
COBOL
with
available to the
5.
Much is
way
file
or Pascal in terms of the types of
said in section 3.
to store files.
every
Assume
that
must occupy
stored on
a file is
problems does
file
specifications
programmer.
it
create?
about
how
disk space
is
organized physically
no such complex organization
a single
tape.
contiguous piece of
How
does
this
is
a disk,
used and that
somewhat the
What
simplify disk storage?
87
88
6.
SECONDARY STORAGE AND SYSTEM SOFTWARE
program requests that a 128-byte
manager may have to read a sector from
can write the record. Why? What could you do to decrease
disk drive uses 512-byte sectors. If a
record be written to disk, the
the disk before
the
7.
it
number of times such an
file
extra read
is
likely to occur?
We
have seen that some disk operating systems allocate storage space
in clusters and/or extents, rather than sectors, so the size of any file
a multiple of a cluster or extent.
a. What are some advantages and potential disadvantages of this
method of allocating disk space?
b. How appropriate would the use of large extents be for an application that mostly involves sequential access of very large files?
c. How appropriate would large extents be for a computing system
that serves a large number of C programmers? (C programs tend to
be small, so there are likely to be many small files that contain C
programs.)
d. The VAX record management system uses a default cluster size
of three 512-byte sectors but lets a user reformat a drive with any
cluster size from 1 to 65,535 sectors. When might a cluster size larger
than three sectors be desirable? When might a smaller cluster size be
on disks
must be
desirable?
8.
In early
disk,
UNIX
systems, inodes were kept together on one part of
while the corresponding data was scattered elsewhere on the disk.
Later editions divided disk drives into groups of adjacent cylinders called
cylinder groups, in
corresponding data.
which each cylinder group contains inodes and their
How does this new organization improve perfor-
mance?
UNIX
systems, the minimum block size was 512 bytes, with
of one. The block size was increased to 1,024 bytes in 4.0BSD,
more than doubling its throughput. Explain how this could occur.
9.
In early
a cluster size
10.
Draw
the
numbers
11.
The IBM 3350
pictures that illustrate the role of fragmentation in determining
in
Table
3.2, section 3.1.7.
disk drive uses block addressing.
The two subblock
organizations described in the text are available:
Count-data, where the extra space used by count subblock and interis equivalent to 185 bytes; and
Count-key-data, where the extra space used by the count and key
subblocks and accompanying gaps is equivalent to 267 bytes, plus
block gaps
the key size.
An IBM
cylinder,
3350 has 19,069 usable bytes available per track, 30 tracks per
and 555 cylinders per drive. Suppose you have a file with 350,000
EXERCISES
80-byte records that you want to store on
a 3350 drive. Answer the
following questions. Unless otherwise directed, assume that the blocking
factor is 10 and that the count-data subblock organization is used.
a.
How many
How many
blocks can be stored on one track?
records?
b.
How many
blocks can be stored on one track
data subblock organization
Make
is
used and key
size
the count-key-
if
13 bytes?
is
graph that shows the effect of block size on storage utilization, assuming count-data subblocks. Use the graph to help predict
the best and worst possible blocking factor in terms of storage utilic.
zation.
Assuming that access to the file is always sequential, use the
graph from the preceding question to predict the best and worst
blocking factor. Justify your answer in terms of efficiency of storage
utilization and processing time.
d.
e.
How many
cylinders are required to hold the
10 and count-data format)?
How much
file
(blocking factor
space will go unused due to
internal track fragmentation?
If the file were stored on contiguous cylinders and if there were
no interference from other processes using the disk drive, the average
seek time for a random access of the file would be about 12 msec.
Use this rate to compute the average time needed to access one
f.
record randomly.
g.
how
Explain
fected
retrieval
time for random accesses of records
by increasing block
size.
af-
is
Discuss trade-offs between storage
and retrieval when different block sizes are used. Make a
with different block sizes to illustrate your explanations.
Suppose the file is to be sorted and a shell sort is to be used to
efficiency
table
h.
sort the
Since the
file.
sorted in place,
on the
that this requires
sents the total
random
It is
memory,
estimated (Knuth, 1973b,
number of records
access. If
provide
disk.
too large to read into
about 15N1.25 moves of records, where
all
to sort the file? (As
We
file is
much
in the
of the preceding
you
will see, this
will be
it
p.
380)
N repre-
Each move requires
file.
is
true,
how
is
not
very good solution.
better ones in Chapter
7,
which
long does
it
take
deals with cose-
quential processing.)
12.
there
sectored disk drive differs from one with
is
less
of
correspondence between
block organization
the
logical
and
in that
physical
organization of data records or blocks.
For example, consider the Digital
RM05
disk drive,
which uses sector
has 32 512-byte sectors per track, 19 tracks per cylinder, and
823 cylinders per drive. From the drive's (and drive controller's) point ot
addressing.
It
89
90
SECONDARY STORAGE AND SYSTEM SOFTWARE
view, a
file is
just a vector of bytes divided into 512-byte sectors. Since the
knows nothing about where one record ends and another begins, a
record can span two or more sectors, tracks, or cylinders.
One common way that records are formatted on the RM05 is to place
a two-byte field at the beginning of each block, giving the number of bytes
drive
of
followed by the data
store a
a.
There
no extra gap and no other
you want to
file with 350,000 80-byte records, answer the following questions:
How many records can be stored on one track if one record is
data,
overhead. Assuming that
this
itself.
organization
is
is
used, and that
stored per block?
b.
c.
How many
How might
cylinders are required to hold the
you block records
sults in 10 actual records
file?
so each physical record access re-
being accessed?
What
are the benefits
of do-
ing this?
13. Suppose you have a collection of 500 large images stored in files, one
image per file, and you wish to "animate" these images by displaying them
in sequence on a workstation at a rate of at least 15 images per second over
a high-speed network. Your secondary storage consists of a disk farm with
30 disk drives, and your disk manager permits striping over as many as 30
drives, if you request it. Your drives are guaranteed to perform I/O at a
steady rate of 2 megabytes per second. Each image is 3 megabytes in size.
Network transmission speeds
a.
mation
b.
are.
not a problem.
Describe in broad terms the steps involved in doing such an aniin real
time from disk.
Describe the performance issues that you have to consider in im-
plementing the animation. Use numbers.
c. How might you configure your I/O system to achieve the desired
performance?
Consider the 1,000,000-record mailing list file discussed in the text. The
to be backed up on 2,400-foot reels of 6,250-bpi tape with 0.3-inch
interblock gaps. Tape speed is 200 inches per second.
a. Show that only one tape would be required to back up the file if a
blocking factor of 50 is used.
b. If a blocking factor of 50 is used, how many extra records could
be accommodated on a 2,400-foot tape?
c. What is the effective recording density when a blocking factor of
14.
file is
50
is
d.
How
used?
large does the blocking factor have to be to achieve the
maximum
effective recording density?
What
negative results can re-
from increasing the blocking factor? (Note: An I/O buffer
enough to hold a block must be allocated.)
sult
large
FURTHER READINGS
the minimum blocking factor required to fit the
onto the tape?
f.
If a blocking factor of 50 is used, how long would it take to read
one block, including the gap? What would the effective transmission
rate be? How long would it take to read the entire file?
g. How long would it take to perform a binary search for one
What would be
e.
file
file, assuming that it is not possible to read backwards
on the tape? (Assume that it takes 60 seconds to rewind the tape.)
Compare this with the expected average time it would take for a sequential search for one record.
h. We implicitly assume in our discussions of tape performance that
the tape drive is always reading or writing at full speed, so no time is
lost by starting and stopping. This is not necessarily the case. For ex-
record in the
ample,
some
drives automatically stop after writing each block.
Suppose that the extra time it takes to start before reading a block
and to stop after reading the block totals 1 msec, and that the drive
must start before and stop after reading each block. How much will
the effective transmission rate be decreased due to starting and stop-
ping
if the
blocking factor
is
1?
What
if
it is
50?
15.
Why are there interblock gaps
just
jam
16.
The use of large blocks can lead to severe internal fragmentation of
on disks. Does this occur when tapes are used? Explain.
all
on
tapes? In other words,
why do we not
records into one block?
tracks
FURTHER READINGS
Many
textbooks contain more detailed information on the material covered
chapter. In the area of operating systems and
file
in this
management systems, we have
found the operating system texts by Deitel (1984), Peterson and Silberschatz (1985),
and Madnick and Donovan (1974) useful. Hanson (1982) has a great deal of material
on blocking and buffering, secondary storage devices, and performance. Flores's
book
(1973)
on peripheral devices may be
bit dated,
but
it
contains a
com-
prehensive treatment of the subject.
Bohl (1981) provides
DASDs. Chaney and Johnson
thorough treatment of mainframe-oriented IBM
good article on maximizing hard disk
(1984) wrote a
performance on small computers. Ritchie and Thompson (1974), Kermghan and
Ritchie (1978), Deitel (1984), and McKusick et al. (1984) provide information on
I/O is handled in the UNIX operating system. The latter provides a good
of ways in which a filesystem can be altered to provide substantially taster
throughput for certain applications. A comprehensive coverage of UNIX I/O from
the design perspective can be found in Leffler et al. (1989).
how
file
case study
91
92
SECONDARY STORAGE AND SYSTEM SOFTWARE
Information on specific systems and devices can often be found in manuals and
documentation published by manufacturers. (Unfortunately, information about
how software actually works is often proprietary and therefore not available.) If you
use a VAX, we recommend the manuals Introduction to the VAX Record Management
Services (Digital, 1978), VAX Software Handbook (Digital, 1982), and Peripherals
Handbook
(Digital,
Laboratories'
IBM PCs
useful.
1981).
UNIX users will find
useful to look at
UNIX I/O System by Dennis Ritchie (1979).
it
monograph The
will find the Disk Operating System (Microsoft,
1983 or
later)
the Bell
Users ot
manual
Fundamental File
Structure Concepts
CHAPTER OBJECTIVES
Introduce
file
structure concepts dealing with
Stream files;
and record boundaries;
Fixed-length and variable-length
Field
fields
and
records;
Search keys and canonical forms;
Sequential search;
Direct access; and
File access
and
file
organization.
Examine other kinds of file
structures in terms of
Abstract data models;
Metadata;
Object-oriented file access; and
Extensibility.
Examine
issues
of portability and standardization.
93
CHAPTER OUTLINE
4.1
Field
and Record Organization
4.4
4.1.1
4.5
Stream
File
File Access
Beyond Record
4.1.2 Field Structures
4.1.3 Reading a Stream of Fields
4.1.4 Record Structures
Record Structure That Uses a
4.1.5
Length Indicator
4.1.6 Mixing Numbers and
Characters: Use of a File Dump
Structures
Abstract Data Models
4.5.2
More Complex Headers
4.5.4 Color Raster Images
Mixing Object Types
in
One
File
4.5.6 Object-oriented File Access
4.5.7 Extensibility
4.2.1
Record Keys
4.2.2
A Sequential Search
UNIX Tools for Sequential
4.6
More about Record
Portability and Standardization
4.6.1
Factors Affecting Portability
4.6.2 Achieving Portability
Processing
4.2.4 Direct Access
4.3
Organization
4.5.1
4.5.5
Record Access
4.2.3
File
4.5.3 Metadata
4.2
and
Structures
Choosing a Record Structure
and Record Length
4.3.2 Header Records
4.3.1
4.1
Field
and Record Organization
When we build file structures we are imposing order on data. In this chapter
we investigate the many forms that this ordering can take. We begin by
looking
at the
base case:
4.1.1 A Stream
Suppose the
program
out as
file
organized
as a
are building contains
name and
names and addresses from
stream of consecutive bytes to
OUTPUT,
is
stream of bytes.
File
we
to accept
a file
described in the pseudocode
address information.
the keyboard, writing
file
shown
them
with the logical name
in Fig. 4.1.
Implementations of this program in both C and Pascal, called writstrm.c
and writstrm.pas, are provided in the C and Pascal Programs sections at the
end of this chapter. You should type in this program, working in either
C or Pascal, compile it, and run it. We use it as the basis for a number
of experiments, and you can get a better feel for the differences between
AND RECORD ORGANIZATION
FIELD
95
PROGRAM: writstrm
get output file name and open it with the logical name OUTPUT
get LAST name as input
while
LAST name has a length > 0)
get FIRST name, ADDRESS, CITY, STATE and ZIP as input
(
write
write
write
write
write
write
LAST
FIRST
ADDRESS
CITY
STATE
ZIP
to
to
to
to
to
to
the
the
the
the
the
the
file
file
file
file
file
file
OUTPUT
OUTPUT
OUTPUT
OUTPUT
OUTPUT
OUTPUT
get LAST name as input
endwhile
close OUTPUT
end PROGRAM
FIGURE 4.1
Program
the
file
to write out a
structures
we
name and address
are discussing if
file
as a stream of bytes.
you perform
the experiments your-
self
The following names and
John Ames
123 Maple
Stillwater, OK 74075
When we
AmesJohnl
list
Map 1 eS
23
The program
a
the output
i 1 1
file
addresses are used as input to the program:
Alan Mason
Eastgate
Ada, OK 74820
90
on our terminal
screen, here
wa t er 0K74 75MasonA lan90 Ea 5 t gat
writes the information out to the
file
specifications, the
there
is
We
no way
we
program
put
to get
all
it
creates a kind
what we
e Ada
see:
OK 7482
precisely as specified: as
stream of bytes containing no added information. But
problem. Once
is
of "reverse
in
meeting our
Humpty-Dumpty"
that information together as a single byte stream,
apart again.
of the fundamental organizational units of
our input data; these fundamental units are not the individual characters, but
meaningful aggregates of characters, such as "John Ames" or "123 Maple."
have
lost the integrity
96
FUNDAMENTAL
When we
fields.
STRUCTURE CONCEPTS
FILE
working with
arc
files,
we
these fundamental aggregates
call
field is the smallest logically meaningful unit
field
logical notion;
is
it
is
necessarily exist in any physical sense,
When we
structure.
write out our
of information
yet
name and
it
is
in a file.^
does not
important to the file's
conceptual tool.
field
address information as a stream
of undifferentiated bytes, we lose track of the fields that make the
information meaningful. We need to organize the file in some way that lets
us keep the information divided into fields.
4.1.2 Field Structures
There are many ways of adding structure to files
fields. Four of the most common methods are
Force the
of
fields into a predictable length.
Begin each
Place
to maintain the identity
field
with
a delimiter at
length indicator.
the end of each field to separate
from the next
it
field.
Use
"keyword = value" expression
to identify each field
and
its
contents.
Method
Fix the Length of Fields
1:
in their length. If
pull
we
them back out of the
field.
We
can define
Using
this
looks like that
file
shown
our sample
file
way
to the
or a record in Pascal to
vary
we
can
end of the
hold these
in Fig. 4.2.
kind of fixed-field length structure changes our output so
shown
in Fig. 4.3(a).
Simple arithmetic
recover the data in terms of the original
One
fields in
simply by counting our
a structure in
fixed-length fields, as
The
force the fields into predictable lengths, then
obvious disadvantage of
this
is
it
sufficient to let us
fields.
approach
is
that
adding
all
the
padding required to bring the fields up to a fixed length makes the file much
Rather than using 4 bytes to store the last name Ames, we use 10. We
can also encounter problems with data that is too long to fit into the
allocated amount of space. We could solve this second problem by fixing all
larger.
enough to cover all cases, but this would
problem of wasted space in the file even worse.
the fields at lengths that are large
just
make
"'"Readers
the
first
should not confuse the term field and
some programming
record
with the meanings given to them by
languages, including Pascal. In Pascal,
record
is
an aggregate data
where each member is referred to as
a field. As we shall see, there is often a direct correspondence between these definitions of
the terms and the fields and records that are used in files. However, the terms field and
record as we use them have much more general meanings than they do in Pascal.
structure that can contain
members of different
types,
FIELD
InC:
In Pascal:
struct {
char lastCIO];
char firstClO]
char addressCl5]
char city[15];
char stateC2]
char zipC9]
} set_of__fields;
TYPE
set _of_field; s = RECORD
last
packed array [1
first
packed array [1
address
city
state
zip
97
AND RECORD ORGANIZATION
packed
packed
packed
packed
array
array
array
array
[1
[1
[1
CI
of
of
of
of
of
of
10]
10]
15]
15]
2]
9]
char
char
char
char
char
char
END;
FIGURE 4.2 Fixed-length
fields.
Because of these
is
difficulties, the fixed-field
approach to structuring data
often inappropriate for data that inherently contain
of fields, such
variability in the length
kinds of data for which fixed-length
field
is
already fixed in length, or
lengths, using a
file
names and
fields are
if
there
is
a large
addresses.
amount of
But there
highly appropriate.
very
little
If
are
every
variation in field
structure consisting of a continuous stream of bytes
organized into fixed-length
Method 2: Begin Each
make
as
fields is often a
Field with a
very good solution.
Length Indicator
Another way
to
possible to count to the end of a field involves storing the field
it
length just ahead of the
too long (length
less
field, as illustrated in Fig. 4.3(b). If the fields are
than 256 bytes),
it is
not
possible to store the length in
single byte at the start of each field.
Separate the Fields with Delimiters We can also preserve
by separating them with delimiters. All we need to do
is choose some special character or sequence of characters that will not
appear within a field and then insert that delimiter into the file after writing
Method
3:
the identity of fields
each
field.
The choice of a
delimiter character can be very important since
it
must
way of processing. In many instances
white-space characters (blank, new line, tab) make excellent delimiters because
they provide a clean separation between fields when we list them on the
be
a character that
does not get in the
most programming languages include I/O statements
assume that fields are separated by white space.
Unfortunately, white space would be a poor choice for our file
console. Also,
by
that,
default,
blanks
often
occur
as
legitimate
characters
within
an
address
since
field.
98
FUNDAMENTAL
FILE
STRUCTURE CONCEPTS
Ames
John
123 Maple
Stillwater
OK74075377-1808
Mason
Alan
90 Eastgate
Ada
OK74820
(a) Field
lengths fixed. Place blanks in the spaces where the
phone number would
Ames John 123 Maple Stillwater OK 74075 377-1808
i
MasonlAlanl 90 Eastqate Ada OKI 74820
I
go.
(b) Delimiters are used to indicate the end of a field. Place the delimiter for the "empty" field
immediately after the delimiter for the previous field.
is
...
S-illwater|OK| 74075 377-1808 #Mason
I
90Eastgate Ada OK 74820
...
(c) Place the field for business phone at the end of the record.
encountered, assume that the field is missing.
SURNAME=Ames FIRSTNAME=John STREET=123 Maple
I
(d) Use a keyword to identify each
assumed to be missing.
field. If the
keyword
is
If the
...
end-of-record mark
is
ZIP = 74075 PHONE=3377-1 ono
missing, the corresponding field
is
FIGURE 4.3 Four methods for organizing fields within records to account for possible missing
the examples, the second record is missing the phone number.
fields. In
Therefore, instead of white space
delimiter, so our
file
we
use the vertical bar character as our
appears as in Fig. 4.3(c). Readers should jnodify the
st ream-of-bytes programs, writstrm.c and writstrm.pas (found in the
and Pascal Programs sections at the end of this chapter), changing them
so they place a delimiter after each field. We use this delimited field format
original
in the
next few sample programs.
Method
4:
Use
"Keyword = Value" Expression
to Identify Fields
This option, illustrated in Fig. 4.2(d), has an advantage that the others do
not: It is the first structure in which a field provides information about itself.
Such
self-describing structures
can be very useful tools for organizing
files
FIELD
in
AND RECORD ORGANIZATION
many applications. It is easy to tell what
if we don't know ahead of time what
even
contain.
It is
also a
fields are
contained in
fields the file is
99
a file,
supposed to
good format for dealing with missing fields. If a field is
makes it obvious, because the keyword is simply not
missing, this format
there.
You may have noticed in Fig. 4.3(d) that this format is used in
combination with another format, a delimiter to separate fields. While this
may not always be necessary, in this case it is helpful because it shows the
division between each value and the keyword for the following field.
Unfortunately, for the address file this format also wastes a lot of space.
Fifty percent or more of the file's space could be taken up by the keywords.
But there are applications in which this format does not demand so much
overhead. We discuss some of these applications in section 4.5.
4.1.3 Reading a Stream
of Fields
Given modified versions of writstrm.c and writstrm.pas that use delimiters to
we can write a program called readstrm that reads the stream
of bytes back in, breaking the stream into fields. It is convenient to conceive
of the program on two levels, as shown in the pseudocode description
provided in Fig. 4.4. The outer level of the program opens the file and then
calls the function readfield( ) until readfield( ) returns a field length of zero,
indicating that there are no more fields to read. The readfield( ) function, in
turn, works through the file, character by character, collecting characters
into a field until the function encounters a delimiter or the end of the file.
The function returns a count of the characters that are found in the field.
Implementations of readstrm in both C and Pascal are included with the
programs at the end of this chapter.
When this program is run using our delimited-field version of the file
containing data for John Ames and Alan Mason, the output looks like this:
separate fields,
Field
Field
Field
Field
Field
Field
Field
Field
Field
Field
Field
Field
lwat er
St
OK
74075
Mason
Alan
90 Eastgate
8
9
i 1
Ada
1
1
Ames
John
123 Maple
12
QK
74820
00
FUNDAMENTAL
FILE
STRUCTURE CONCEPTS
Define Constant: DELIMITER
'!'
readstrm
PROGRAM:
get input file name and open as INPUT
initialize FIELD_C0UNT
FIELD_LENGTH := readfield (INPUT, FIELD_C0NTENT
FIELD_LENGTH >
while
)
increment the FIELD_C0UNT
write FIELD.COUNT and FIELD_CONTENT to the screen
FIELD_LENGTH := readfield (INPUT, FIELD_C0NTENT
endwhile
close INPUT
end PROGRAM
FUNCTION:
readfield (INPUT, FIELD_C0NTENT
initialize I
initialize CH
while (not EOF (INPUT) and CH does not equal DELIMITER
read a character from INPUT into CH
increment I
FIELD_C0NTENT [I]
= CH
:
endwhile
return (length of field that was read)
end FUNCTION
FIGURE 4.4 Program to read fields from a
Clearly,
these data.
as a
we now
file
preserve the notion of
But something
stream of fields. In
is still
fact,
six fields are a set associated
are a set
records.
of
and display them on the screen.
fields associated
missing.
a field as
we
store
and retrieve
We do not really think of this
the fields need to be grouped into sets.
The
file
first
with someone named John Ames. The next six
with Alan Mason. We call these sets of fields
FIELD
AND RECORD ORGANIZATION
4.1.4 Record Structures
record
in
terms of a higher level of organization. Like the notion of a field, a record
can be defined as a
another conceptual tool.
set
It is
offields that belong together when the
another level of organization that
file is
viewed
is
we impose
on the data
in
to preserve meaning. Records do not necessarily exist in the file
any physical sense, yet they are an important logical notion included in
the
structure.
file's
Here
are
some of the most
often used methods for organizing a
file
into
records:
Require that the records be
Require that the records be
a predictable
a
predictable
number of bytes
number of fields
in length.
in length.
Begin each record with a length indicator consisting of a count of the
number of bytes that the record contains.
Use a second file to keep track of the beginning byte address for
each record.
Place a delimiter at the end of each record to separate
it
from the
next record.
1: Make Records a Predictable Number of Bytes (Fixedlength Records) A fixed-length record file is one in which each record
contains the same number of bytes. This method of recognizing records is
analogous to the first method we discussed for making fields recognizable.
Method
'
As we
will see in the chapters that follow, fixed-length record structures are
among the most commonly used methods for organizing files.
The C structure set_of_fields (or the Pascal RECORD of the same name)
that we define in our discussion of fixed-length fields is actually an example
of a fixed-length record as well as an example of fixed-length fields. We have
a fixed number of fields, each with a predetermined length, which combine
to make a fixed-length record. This kind of field and record structure is
illustrated in Fig. 4.5(a).
however, that fixing the number of bytes in
a record does not imply that the sizes or number of fields in the record must
be fixed. Fixed-length records are frequently used as containers "to hold
It is
important to
realize,
numbers of variable length fields. It is also possible to mix fixedand variable-length fields within a record. Figure 4.5(b) illustrates how
variable-length fields might be placed in a fixed-length record.
variable
Method
2:
Make Records
a Predictable
than specifying that each record in a
bytes,
we
good way
can specify that
it
file
Number
will contain a fixed
to organize the records in the
of Fields Rather
some fixed number of
number of fields. This is a
contain
name and
address
file
we have
been
102
FUNDAMENTAL
FILE
STRUCTURE CONCEPTS
Ames
John
123 Maple
Stillwater
0K74075
Mason
Alan
90 Eastgate
Ada
0K74820
(a)
Ames John
;
123 Maple
Stillwater OK 74075
Unused space
(b)
Ames John 123 Maple Stillwater OK 74075 Mason Alan 90 Eastgate Ada OK
|
(c)
FIGURE 4.5 Three ways of making the lengths of records constant and predictable, (a) Counting
bytes: fixed-length records with fixed-length fields, (b) Counting bytes: fixed-length records with
variable-length fields, (c) Counting fields: six fields per record.
looking
The
at.
writstrm
program
asks for six pieces of information for
every person, so there are six contiguous
(Fig. 4.5c).
fields in the file for
each record
We could modify readstrm to recognize fields simply by counting
the fields modulo six, outputting record
every time the count
boundary information
to the screen
starts over.
3: Begin Each Record with a Length Indicator
We can
communicate the length of records by beginning each record with a field
Method
containing an integer that indicates
the record (Fig.
This
4.6a).
variable-length records.
Method
4:
We
is
look
how many bytes there are in the rest of
commonly used method for handling
at
it
more
closely in the next section.
We
Use an Index to Keep Track of Addresses
index to keep a byte offset for each record in the original
offsets
us
the length of each record.
We
index and then seek to the record
illustrates this two-file
Method
at a
can use an
The byte
allow us to find the beginning of each successive record and also
compute
in the
file.
5:
is
let
record
in the data file.
Figure 4.6(b)
End of Each Record
This option,
mechanism.
Place a Delimiter at the
record level,
look up the position of
exactly analogous to the solution
we
used to keep the
FIELD
103
AND RECORD ORGANIZATION
sample program we developed. As with fields, the
must not get in the way of processing. Because we often
fields distinct in the
delimiter character
want
our console,
to read files directly at
delimiter for
that contain readable text
files
(carriage return/new-line pair or,
acter
common
'\n'). In Fig 4.6(c)
we
UNIX
on
choice of a record
the end-of-line character
is
systems, just a new-line char-
use a '#' character as the record delimiter.
4.1.5 A Record Structure That Uses a Length Indicator
Not one of
these approaches to preserving the idea of a record in a
file is
situations. Selection of a method for record organization
depends on the nature of the data and on what you need to do with it. We
begin by looking at a record structure that uses a record-length field at the
beginning of the record. This approach lets us preserve the variability in the
length of records that is inherent in our initial stream file.
appropriate for
all
Writing the Variable-length Records to the File We call the program
that builds this new, variable-length record structure writrec. The set of
programs at the end of this chapter contains versions of this program in C
and Pascal. Implementing this program is partially a matter of building on
FIGURE 4.6 Record structures for variable-length records, (a) Beginning each record with a length
indicator, (b) Using an index file to keep track of record addresses, (c) Placing the delimiter '#' at
the end of each record.
40kmes John 123 Maple Stillwater OK 74075 J6Mason Alan 90 Eastgate
\
(a)
Data
Ames John 123 Maple Stillwater OK 74075 Mason Alan
file:
T
Index
file:
00
40
(b)
Ames John 123 Maple Stillwater 0K 74075 #Mason Alan 90 Eastgate Ada 0K
;
(c)
04
FUNDAMENTAL
the writstrm
addressing
If
FILE
STRUCTURE CONCEPTS
program that we created
some new problems:
we want
earlier in this chapter,
to put a length indicator at the beginning
(before any other fields),
the fields in each record
but also involves
of every record
we must know the sum of the lengths
before we can begin writing the record
of
to
We
need to accumulate the entire contents of a record in a
bu ffcr before writing it out
In what form should we write the record-length field to the file? As
a binary integer? As a series of ASCII characters?
the
file.
The concept of buffering
with
files.
into
which we place the
is
one
we
run into again and again
In the case ofwritrec, the buffer can
fields
and
simply be
field delimiters as
as
we work
character array
we
collect
them.
Resetting the buffer length to zero and adding information to the buffer can
be handled using the loop logic provided in Fig.
Representing the Record Length
record length
is
a little
length in the form of
a natural
more
4.7.
The question of how to represent
One option would be to write
difficult.
much
two-byte binary integer before each record. This is
it does not require us to go to the trouble of
bigger numbers with an integer than
number of ASCII
by.tes
(e.g.,
32,767 versus 99).
FIGURE 4.7 Main program logic for writrec.
get LAST name as input
while
LAST name has a length >
set length of string in BUFFER to zero
concatenate: BUFFER + LAST name + DELIMITER
)
while
input fields exist for record
get the FIELD
(
concatenate: BUFFER
endwhile
FIELD
DELIMITER
write length of string in BUFFER to the file
write the string in BUFFER to the file
get LAST name as input
endwhile
the
solution in C, since
converting the record length into character form. Furthermore,
represent
the
we
It
is
we
can
can with the same
also conceptually
FIELD
AND RECORD ORGANIZATION
since it illustrates the use of a fixed-length, binary
combination with variable-length character fields.
interesting,
Although we could use
we might
between
choose,
and
instead,
this
same solution
to account for
field
05
in
for a Pascal implementation,
some important
differences
Pascal:
Unlike C, Pascal automatically converts binary integers into characof those integers if we are writing to a text file.
Consequently, it is no trouble at all to convert the record length into
a character form: It happens automatically.
In Pascal, a file is defined as a sequence of elements of a single type.
Since we have a file of variable-length strings of characters, the natural type for the file is that of a character.
ter representations
do in C is to store the integers in the file
two-byte fields containing integers. In Pascal it is easier to
make use of the automatic conversion of integers into characters for text
files. File structure design is always an exercise in flexibility. Neither of
these approaches is correct; good design consists of choosing the approach
that is most appropriate for a given language and computing environment. In
the programs included at the end of this chapter, we have implemented our
record structure both ways, using integer-length fields in C and character
representations in Pascal. The output from the Pascal implementation is
shown in Fig. 4.8. Each record now has a record-length field preceding the
data fields. This field is delimited by a blank. For example, the first record
(for John Ames) contains 40 characters, counting from the first 'A' in
"Ames" to the final delimiter after "74075," so the characters '4' and '0' are
placed before the record, followed by a blank.
Since the C version of writrec uses binary integers for the record length,
we cannot simply print it to a console screen. We need a way to interpret the
noncharacter portion of the file. For this, we introduce in the next section
the file dump, a valuable tool for viewing the contents of files. But first,
let's look at a program that will read in any file that is written by writrec.
In short, the easiest thing to
as fixed-length,
Reading the Variable-length Records from the
structure of variable-length records preceded
FIGURE 4.8 Records preceded by record-length fields
in
File
Given our file
by record-length fields, it is
character form.
123 Maple Stillwater OK 74075 36 Mason Alan 90
Eastgate Ada OK 74820
40 Ames John
\
06
FUNDAMENTAL
FILE
STRUCTURE CONCEPTS
readrec
PROGRAM:
open input file as INP_FILE
initialize SCAN_P0S to
RECORD_LENGTH := get_rec INP_FILE, BUFFER)
while (RECORD_LENGTH > 0)
SCAN_P0S := get_fld(FIELD, BUFFER, SCAN_POS,RECORD_LENGTH)
while (SCAN_P0S > 0)
print FIELD on the SCREEN
SCAN_P0S := get_fld( FIELD, BUFFER, SCAN_POS,RECORD_LENGTH)
endwhile
(
RECORD_LENGTH
endwhile
end PROGRAM
FUNCTION:
:=
get_rec INP_FILE, BUFFER)
(
get_rec INP_FILE, BUFFER)
(
if EOF (INP_FILE)
then return
read the RECORD_LENGTH
read the record contents into the BUFFER
return the RECORD_LENGTH
end FUNCTION
FUNCTION
ge t_f Id FIELD BUFFER, SCAN_P0S RECORD_LENGTH
(
if SCAN_P0S == RECORD.LENGTH then return
get a character CH at the SCAN_P0S in the BUFFER
while (SCAN_P0S < RECORD_LENGTH and CH is not a DELIMITER)
place CH into the FIELD
increment the SCAN_P0S
get a character CH at the SCAN_P0S in the BUFFER
endwhile
return the SCAN_P0S
end FUNCTION
FIGURE 4.9 Main program logic
for readrec,
along with functions get_rec(
and get_fld(
).
FIELD
easy to write
program
that reads
through the
file,
record by record,
The program
displaying the fields from each of the records on the screen.
logic
shown
is
in Fig. 4.9.
The main program
calls
the function get_rec(
that reads records into a buffer; this call continues until get_rec(
Once get_rec(
value of 0.
is
(SCAN_POS)
get_Jld(
or
the
in the
reads characters
returns a
places a record's contents into a buffer, the buffer
passed to the function get_fld(
position
107
AND RECORD ORGANIZATION
from
end of the record
).
The
call to get_jld( )
argument
list.
includes
scanning
SCAN_POS,
Starting at the
the buffer into a field until either a delimiter
is
reached.
Function get_Jld(
returns
the
SCAN_POS for use on the next call. Implementations ofwritrec and readrec
in both C and Pascal are included along with the other programs at the end
of
this chapter.
4.1.6 Mixing Numbers and Characters: Use
File
dumps
give us the ability to look inside
of a File
Dump
a file at the actual
bytes that are
stored there. Consider, for instance, the record-length information in the
program output that we were examining a moment ago. The length
Ames record, which is the first one in the file, is 40 characters,
including delimiters. In the Pascal version of writrec, where we store the
ASCII character representation of this decimal number, the actual bytes
Pascal
of the
stored
in
the file
implementation,
look
like the representation in Fig.
where we choose
to
4.10(a).
In the
represent the length field
as
C
a
two-byte integer, the bytes look like the representation in Fig. 4.10(b).
As you can see the number 40 is not the same as the set of characters '4'
and '0'. The hex value of the binary integer 40 is 0x28; the hex values of the
x 30. (We are using the C language
characters '4' and '0' are
x 34 and
convention of identifying hexadecimal numbers through the use of the
prefix Ox.) So, when we are storing a number in ASCII form, it is the hex
FIGURE 4.10 The number 40, stored as ASCII characters and as a short integer.
Decimal value
of number
Hex value
stored
ASCII
character form
in bytes
(a)
40 stored as ASCII chars:
40
34
30
'4'
'0'
(b)
40 stored as a 2-byte integer:
40
00
28
'\0'
"('
08
FUNDAMENTAL
ASCII
values of the
number
STRUCTURE CONCEPTS
FILE
characters that
go into the
file,
not the hex value of the
itself.
Figure
4.
shows
10(b)
an integer (this
is
the byte representation of the
called storing the
number
number 40
stored as
form, even though
in binary
Now
usually view the output as a hexadecimal number).
we
the hexadecimal
file is that of the number itself. The ASCII characters that
happen to be associated with the number's actual hexadecimal value have no
obvious relationship to the number. Here is what the version of the file that
uses binary integers for record lengths looks like if we simply print it on a
value stored in the
terminal screen:
(Ames
tt_
^0x28
John
is
Blank, since
123 Maple
code for
ascii
'\0' is
Stillwater
OK
74075
$Mason Alan
tf_
^ 0x28
'('
unprintable.
...
'*'
ascii
is
Blank:
'\0' is
code for
unprintable.
The ASCII representations of characters and numbers in the actual record
come out nicely enough, but the binary representations of the length fields
are displayed cryptically. Let's take a different look at the
using the
od
UNIX dump
-xc
< f
i 1
this
file,
time
UNIX command
Entering the
utility od.
ename>
produces the following:
Values
Offset
0000000
\0
3037
^ASCII
^Hex
\0
0024
7374
4561
3020
7
736f
6761
2
I
6461
4d61
7c39
6e
6572
357c
61
7c41
6174
6c77
696c
5374
3734
a
416c
7465
3320
3132
6e7c
657c
6f68
7c4a
S
4b7c
6e7c
0000100
7c4f
0000060
6573
706c
4d61
0000040
416d
0028
0000020
7c4f
4b7c
3734
3832
307c
As you can see, the display is divided into three different kinds of data. The
column on the left labeled Offset gives the offset of the first byte of the row
that is being displayed. The byte offsets are given in octal form; since each
line contains 16 (decimal) bytes, moving from one line to the next adds 020
to the range. Every pair of lines in the printout contains interpretations of
the bytes in the file in hexadecimal and ASCII. These representations were
requested on the command line with the -xc flag (x = "hex;" c =
"character").
Let's look at the first
row of ASCII
values.
As you would
'('
expect, the
RECORD ACCESS
09
ASCII form appears in this row in a readable way.
which there is no printable ASCII
x 00. But there
representation. The only such value appearing in this file is
could be many others. For example, the hexadecimal value of the number
500,000,000 is 0xlDCD6500. If you write this value out to a file, an od of
data placed in the
But there
the
file
file in
are hexadecimal values for
with the option -xc looks
0000000 \035\315
like this:
\0
1dcd 6500
The only
handles
printable byte in this
of the others by
all
file is
the one with the value 0x65 (V).
listing their equivalent octal values in the
Od
ASCII
representation.
The hex dump of this output from
this
file
we have
version of writrec shows
an interesting mix of
represents
structure
organizational tools
the
encountered. In
a single
number of
record
binary and ASCII data. Each record consists of a fixed-length
count) and several delimited, variable-length
different data types
file
fields.
and organizational methods
is
how
the
we have both
field (the
byte
This kind of mixing of
common
in real-world
structures.
A Note about Byte Order
or a computer from
DEC,
computer you are using is an IBM PC
VAX, your octal dump for this file will
one we see here. These machines store the
If the
such
as a
probably be different from the
values of numbers in reverse order from the
example,
if this
dump were
executed on an
way we
IBM PC,
think of them. For
the hex representation
of the first two-byte value in the file would be 0x2800, rather than 0x0028.
This reverse order also applies to long, four-byte integers on these
machines. This is an aspect of files that you need to be aware of if you expect
to make sense out of dumps like this one. A more serious consequence of
the byte-order differences among machines occurs when we move files
from a machine with one type of byte ordering to one with a different byte
ordering. We discuss this problem and ways to deal with it in section 4.6,
"Portability and Standardization."
4.2
Record Access
4.2.1 Record Keys
Since our
new
file
structure so clearly focuses
the quantity of information that
is
on the notion of a record
it makes sense
being read or written,
as
to
think in terms of retrieving just one specific record rather than having to
read
all
the
way through
the
file,
displaying everything.
When
looking for
1 1
FUNDAMENTAL
FILE
STRUCTURE CONCEPTS
an individual record,
convenient to identify the record with
it is
a key
based
on the record's contents. For example, in our name and address file we
might want to access the "Ames record" or the "Mason record" rather than
thinking in terms of the "first record" or "second record." (Can you
remember which record comes first?) This notion of a ke_y is another
fundamental conceptual tool. We need to develop a more exact idea of what
a key is.
When we
want
are looking for a record containing the last
name Ames, we
form "AMES",
"ames", or "Ames". To do this, we must define a standard form for keys,
along with associated rules and procedures for converting keys into this
standard form. A standard form of this kind is often called a canonical form
for the key. One meaning of the word canon is rule, and the word canonical
means conforming to the rule. A canonical form for a search key is the single
representation for that key that conforms to the rule.
As a simple example, we could state that the canonical form for a key
requires that the key consist solely of uppercase letters and have no extra
to recognize
blanks
it
even
end. So, if a user enters
at the
to the canonical
form
"AMES"
"Ames", we would convert
before searching for
,
record. If there
a single record,
is
not
one-to-one relationship between the key and
then the program has to provide additional mechanisms to
allow the user to resolve the confusion that can result
record
fits a
when more than one
we are looking for
particular "key. Suppose, for example, that
John Ames's
address. If there are several records in the
different people
named John Ames, how should
finds.
The
simplest solution
is
file.
When
for several
it
provide
way of
The prevention
takes
the user enters a
new
to prevent such confusion.
new records are added to the
we form a unique canonical key
file
program respond?
first John Ames that it
the
it should not just give the address of the
Should it give all the addresses at once? Should
scrolling through the records?
Certainly
place as
the key
it.
often desirable to have ^isthutke^s or keys that uniquely identify
It is
a single
the user enters the key in the
if
and then search the
file for that key. This concern about uniqueness applies only to primary keys.
A primary key is, by definition, the key that is used to identify a record
record,
for that record
uniquely.
An
It is also possible, as we see later, to search on s econdary keys
example of a secondary key might be the city field in our name and address
file. If we wanted to find all the records in the file for people who live in
towns named Stillwater, we would use some canonical form of "Stillwater"
as a secondary key. Typically, secondary keys do not uniquely identify a
.
record.
Although a person's name might at first seem to be a good choice for
primary key, a person's name runs a high risk of failing the test for
RECORD ACCESS
uniqueness.
A name is
a perfectly fine
important secondary key
likelihood that
The reason
in a retrieval
two names
a
name
is
in the
a risky
a real data value. In general,
think
we
are choosing a
secondary key, and
same
file
choice for
if
it
is
often an
too great
is
will be identical.
a
primary key
primary keys should be
unique key,
in fact
system, but there
1 1 1
is
dataless.
contains data there
that
it
contains
Even when we
is
danger that
unforeseen identical values could occur. Sweet (1985) cites an example of a
file system that used a person's Social Security number as a primary key for
turned out
personnel records.
It
represented in the
file,
a large
that, in the particular
population that was
number of people who were not U.S.
citizens
were included, and in a different part of the organization all of these people
had been assigned the Social Security number 999-99-9999!
Another reason, other than uniqueness, that a primary key should be
dataless is that a primary key should be unchanguw. If information that
a certain record changes, and that information is contained
primary key, what do you do about the primary key? You probably
cannot change the primary key itself, in most cases, because there are likely
to be reports, memos, indexes, or other sources of information that refer to
the record by its primary key. As soon as you change the key, those
corresponds to
in a
become useless.
A good rule of thumb is to avoid trying to put data into primary keys.
If we want to access records according to data content, we should assign this
content to secondary keys. We give a more detailed look at record access by
references
primary and secondary keys in Chapter 6. For the rest of this chapter, we
suspend our concern about whether a key is primary or secondary and
concentrate simply on finding things by key.
4.2.2 A Sequential Search
Now that you know about keys,
reads through the
file,
you should be
able to write a
record by record, looking for
program
that
record with
Such sequential searching is just a simple extension of our
readrec program, adding a comparison operation to the main loop to see if
the key for the record matches the key we are seeking. We leave the actual
program as an exercise.
particular key.
Evaluating Performance of Sequential Search In the chapters that
follow, we find ways to search for records that are faster than the sequential
search mechanism. We can use sequential searching as a kind of baseline
against which to measure the improvements that we make. It is important,
therefore, to find some way of expressing the amount of time and work
expended in a sequential search.
1 1
FUNDAMENTAL
FILE
STRUCTURE CONCEPTS
performance measure requires that we decide on a unit of
on the performance of the
whole process. When we describe the performance of searches that take
place in electronic RAM, where comparison operations are more expensive
Developing
work
that usefully represents the constraints
memory, we
than fetch operations to bring data in from
usually use the
number of comparisons required for the search as the measure of work. But,
given that the cost of a comparison in
is so small compared to the cost
of a disk access, comparisons do not fairly represent the performance
RAM
constraints for a search through a
count low-level
READ(
file
storage. Instead,
calls.
we
seek and that any one
from the discussions of matters such
a
on secondary
We assume that each READ( call requires
READ( call is as costly as any other. We know
as
system buffering
in
Chapter 3 that
these assumptions are not strictly accurate. But, in a multiuser environment
where many processes
enough
are using the disk at once, they are close
to
correct to be useful.
Suppose
we have
file
with 1,000 records and
sequential search to find Al Smith's record.
required? If Al Smith's record
read in only
makes 1,000
is
a single record. If
READ(
calls
the
it is
first
we want
How many READ(
one
in the
file,
to use a
)
calls are
program has to
file, the program
the
the last record in the
before concluding the search. For an average
search, 500 calls are needed.
we
If
double the number of records
average and the
maximum number
of
in a
we
file,
READ(
sequential search to find Al Smith's record in a
requires,
on the average, 1,000
required for a sequential search
records in the
calls.
is
double both the
also
required.
calls
Using
of 2,000 records,
In other words, the amount of work
file
directly proportional to the
number of
file.
In general, the
with n records
is
work
required to search sequentially for
proportional to n;
it
takes at
most
record in
a file
comparisons; on
//
it takes approximately nil comparisons. A sequential search is said
of the order O(n) because the time it takes is proportional to w.*
average
to be
Improving Sequential Search Performance with Record Blocking
It is interesting and useful to apply some of the information from Chapter
3 about disk performance to the problem of improving sequential search
We learned in
Chapter 3 that the major cost associated with a
on the
disk. Once data transfer begins, it is relatively fast, although still much
slower than a data transfer within RAM. Consequently, the cost of seeking
performance.
disk access
""If
is a
is
the time required to perform a seek to the right location
you are not familiar with
good source.
this
"big-oh" notation, you should look
it
up.
Knuth
(1973a)
RECORD ACCESS
and reading
record and then seeking and reading another record
than the cost of seeking just once and then reading
(Once
we
again,
are
assuming
required for each separate
to
two
call.) It
follows that
we
all
at
greater
seek
is
should be able
improve the performance of sequential searching by reading
several records
is
successive records.
multiuser environment in which
READ(
1 1
in a block
of
once and then processing that block of records
in
RAM.
We
began this chapter with a stream of bytes. We grouped the bytes
and then grouped the fields into records. Now we arc
considering a yet higher level of organization, grouping records into blocks.
This new level of grouping, however, differs from the others. Whereas
fields and records are ways of maintaining the logical organization within
the file, blocking is done strictly as a performance measure. As such, the
into
fields,
block
size
usually related
is
more
to the physical properties
of the disk drive
than to the content of the data. For instance, on sector-oriented disks the
block size
is
almost always some multiple of the sector
Suppose we have
size.
of 4,000 records and that the average length of
record
is
512
If
our
operating
system uses sector-sized buffers of 512
a
bytes.
bytes, then an unblocked sequential search requires, on the average, 2,000
READ( ) calls before it can retrieve a particular record. By blocking the
records in groups of 16 per block, so each READ( ) call brings in 8 kilobytes
worth of records, the number of reads required for an average search comes
down
a file
Each READ( requires slightly more time, since more data is
from the disk, but this is a cost that is usually well worth paying
to 125.
transferred
for such a large reduction in the
There
number of reads.
are several things to note
from
this analysis
and discussion of
record blocking:
Although blocking can
ments,
tion.
it
The
result in substantial
performance improve-
does not change the order of the sequential search operacost of searching
is still
tion to increases in the size of the
O(n), increasing in direct proporfile.
RAM
Blocking clearly reflects the differences between
access speed
and the cost of accessing secondary storage.
Blocking does not change the number of comparisons that must be
RAM, and it probably increases the amount of data transbetween disk and RAM. (We always read a whole block, even
if the record we are seeking is the first one in the block.)
Blocking saves time because it decreases the amount of seeking. We
find, again and again, that this differential between the cost of seeking and the cost of other operations, such as data transfer or RAM
done
in
ferred
access,
is
the force that drives
file
structure design.
FUNDAMENTAL
1 1
FILE
STRUCTURE CONCEPTS
When
Sequential Searching Is Good Much of the remainder of this text
devoted to identifying better ways to access individual records; sequential
searching is just too expensive for most serious retrieval situations. This is
is
unfortunate, because sequential access has
over other types of access:
the simplest of
file
It is
two major
practical advantages
extremely easy to program, and
it
requires
structures.
Whether sequential search is advisable depends largely on how the file
be used, how fast the computer system is that is performing the search,
and structural aspects of the file. There are many situations in which a
sequential search is often reasonable. Here are some examples:
is
to
ASCII
files in
which you
are searching for
some
pattern (see grep in
the next section);
Files
with few records
Files that
(e.g.,
10 records);
hardly ever need to be searched
(e.g.,
tape
files
usually
used for other kinds of processing); and
which you want all records with a certain secondary key
where a large number of matches is expected.
Files in
value,
Fortunately, these sorts of applications do occur often in day-to-day
computing
for
so
often, in fact, that operating systems provide
performing sequential processing.
this, as
we
UNIX is
many
utilities
one of the best examples of
see in the next section.
4.2.3 UNIX Tools
for
Sequential Processing
Recognizing the importance of having a standard file structure that is simple
and easy to program, the most common file structure that occurs in UNIX
is an ASCII file with the new-line character as the record delimiter and, when
-r-A%
possible, white space as the field delimiter. Practically all files that
using
we
create
UNIX editors use this structure. And since most of the built-in C and
Pascal functions that perform I/O write to this kind of file,
it is
common
to
numbers or words separated by blanks
or tabs, and records separated by new-line characters. Such files are simple
and easy to process. We can, for instance, generate an ASCII file with a
simple program, and then use an editor to browse through it or alter it.
UNIX provides a rich array of tools for working with files in this form.
see data files that consist of fields of
Since this kind of file structure
is
inherently sequential (records are variable
we have to pass from record to record to find any particular
field or record), many of these tools process files sequentially.
Suppose, for instance, that we choose the white-space/new-line
in length, so
file, ending every field with a tab and ending every
While this causes some problems in distinguishing
white space, but it doesn't separate a field), and in that sense
structure for our address
record with
fields (a
blank
new
is
line.
RECORD ACCESS
is
UNIX
For example,
such
utilities,
cat
buys us something very valuable: the full use of
around the white-space/new-line structure.
can print the file on our console using any of a number of
not an ideal structure,
those
1 1
it
tools that are built
we
as
>cat myf i 1
Stillwater DK 74075
Ames
John 123 Maple
Alan 90 Eastgate Ada
DK 74820
Mason
Or we
can use tools like
wc and
grep for processing the
files.
wc The command wc
("word count") reads through an ASCII file
number of lines (delimited by new lines), words
(delimited by white space), and characters in a file:
sequentially and counts the
>wc myf
i 1
grep
It is
common
character string in
sequentially,
(and
it.
want to find out
For ASCII files
recognize. In
a pattern,
its
if a text file
has a certain
word
or
that can reasonably be searched
provides an excellent
variants egrep and fgrep).
its
file
to
UNIX
regular expression,"
the
76
14
filter
The word
for
doing
this called grep
grep stands for "generalized
which describes the type of pattern
that grep
is
simplest form, grep searches sequentially through a
and then returns to standard output
(the console)
all
able to
file
for
the lines in
that contain the pattern.
>grep Ada myf
Mason
Alan
i 1
90 Eastgate Ada
We can also combine tools to create,
on the
OK
>grep Ada
some very powerful file
number of words in all
fly,
processing software. For example, to find the
records containing the
74820
word Ada:
wc
36
As we move through the text we will encounter
powerful UNIX commands that sequentially process
white-space/new-line structure.
number of
files
other
with the basic
4.2.4 Direct Access
The most
record
is
through a file for a
We have direct access
can seek directly to the beginning of the record and
radical alternative to searching sequentially
a retrieval
to a record
mechanism known
when we
as direct accesj.
1 1
FUNDAMENTAL
read
is
STRUCTURE CONCEPTS
Whereas sequential searching is an O(n) operation, direct access
no matter how large the file is, we can still get to the record we want
in.
it
O(l);
with
FILE
a single seek.
Direct access
required record
predicated on
is
is.
Sometimes
carried in a separate index
not have an index.
We
file.
knowing where
the beginning of the
information about record location is
this
But, for the
moment, we assume
assume, instead, that
we know
we do
that
the relative record
(RRN ) of the record that we want. The idea of an RRN is an
important concept that emerges from viewing a file as a collection of
records rather than a collection of bytes. If a file is a sequence of records,
then the RRN of a record gives its position relative to the beginning of the
number
file.
The
first
record in
a file
has
RRN 0,
the next has
"
RRN
1,
and so
forth."
file, we might tie a record to its RRN by
membership numbers that are related to the order in which we
enter the records in the file. The person with the first record might have a
membership number of 1001, the second a number of 1002, and so on.
Given a membership number, we can subtract 1001 to get the RRN of the
In
our name and address
assigning
record.
What can we do with
have been using so
tells
far,
this
which
RRN? Not much,
us the relative position of the record
records, but
records as
we
we
still
given the
structures
file
The
consist of variable-length records.
we want
we want. An exercise
go, to get to the record
sequence of
in the
have to read sequentially through the
at
we
RRN
file,
counting
the end of this
a method of moving through the file called skip sequential
which can improve performance somewhat, but looking for a
particular RRN is still an O(n) process.
To support direct access by RRN, we need to work with records of
fixed, known length. If the records are all the same length, then we can use
chapter explores
processing,
a record's
RRN
to the start
an
RRN
record,
to calculate the byte offset
of the
file.
For instance,
of 546 and our
we
file
if
we
of the
as follows:
= 546 x
128
In general, given a fixed-length record
file
byte offset of
offset
record with an
RRN
Byte
offset
of
//
69,888.
where the record
t In
this
byte offset calculation
keeping with the conventions of
:av-lhised count. In
some
file
is
size
n X
v,
the
r.
differ
with regard to
done and even with regard
and Turbo
is
is
Programming languages and operating systems
where
of the record relative
has a fixed-length record size of 128 bytes per
can calculate the byte offset
Byte
start
are interested in the record with
Pascal,
systems, the count starts
at
we assume
1
that the
rather than
0.
to
whether
RRN
is
MORE ABOUT RECORD STRUCTURES
byte offsets are used for addressing within
MS-DOS
operating systems), where
bytes, the application
command
a file
to
in
is
jump
files.
In
the calculation and uses the lseek(
to the byte that begins the record. All
terms of bytes. This
responsibility for translating an
and
sequence of
a file is treated as just a
program does
UNIX
(and the
1 1
very low-level view of
is
RRN
movement
the
files;
wholly
into a byte offset belongs
within
to
the application program.
The PL/I language and the operating environments in which PL/I is
(OS/MVS, VMS) are examples of a much different, higher-level
view of files. The notion of a sequence of bytes is simply not present when
you are working with record-oriented files in this environment. Instead,
files are viewed as collections of records that are accessed by keys. The
often used
operating system takes care of the translation between a key and
a record's
key is, in fact, just the record's RRN, but
of actual location within the file is still not the
location. In the simplest case, the
the
determination
programmer's concern.
If
we
at all in
of
no seeking
limit ourselves to the use of standard Pascal, the question
seeking by bytes or seeking by records
we
standard Pascal. But, as
is
not an issue: There
said earlier,
is
many implementations of
Pascal extend the standard definition of the language to allow direct access
to
different
locations in a
file.
The nature of
these extensions
varies
according to the differences in the host operating systems around which the
extensions were developed. All the same, one feature that
across implementations
is
that a
in Pascal
file
is
consistent
always consists of elements of
A file is a sequence of integers, characters, arrays, or records,
and so on. Addressing is always in terms of this fundamental element size.
For example, we might have a file of datarec, where datarec is defined as
a single type.
TYPE datarec = packed array [0..64] of char;
Seeking within
datarec,
which
datarec
number
into the
4.3
is
this file
is
in
terms of multiples of the elementary unit
to say in multiples
3 (zero-based count),
of
I
65-byte entity.
am jumping
If
195 bytes
ask to
(3
jump
X 65 =
to
195)
file.
More about Record Structures
4.3.1 Choosing a Record Structure and Record Length
Once we
decide to fix the length of our records so
give us direct access to
Clearly, this decision
is
a record,
we
we
can use the
RRN
to
have to decide on a record length.
related to the size
of the
fields
we want
to store in
1 1
STRUCTURE CONCEPTS
FUNDAMENTAL
FILE
the record.
Sometimes the decision
of
is
easy.
Suppose we
are building a
file
transactions that contain the following information about each
sales
transaction:
six-digit account
number of the
purchaser;
Six digits for the date field;
A
A
A
five-character stock
number
for item purchased;
three-digit field for quantity; and
10-position field for total cost.
These are all fixed-length fields; the sum of the field lengths is 30 bytes.
Normally, we would simply stick with this record size, but if performance
is so important that we need to squeeze every bit of speed out of our
retrieval system, we might try to fit the record size to the block
organization of our disk. For instance, if we intend to store the records on
a typical sectored disk (see Chapter 3) with a sector size of 512 bytes or some
other power of 2, we might decide to pad the record out to 32 bytes so we
can place an integral number of records in a sector. That way, records will
never span sectors.
The choice of a record length is more complicated when the lengths of
the fields can vary, as in our name and address file. If we choose a record
length that is the sum of our estimates of the largest possible values for all
the fields, we can be reasonably sure that we have enough space for
everything, but we also waste a lot of space. If, on the other hand, we are
conservative in our use of space and fix the lengths of fields at smaller
values, we may have to leave information out of a field. Fortunately, we can
avoid this problem to some degree through appropriate design of the field
structure within a record.
In our earlier discussion
of record structures, we saw that there are two
take toward organizing fields within a
approaches
we
can
fixed-length record.
The
first,
general
illustrated in Fig. 4.11(a), uses fixed-length
fields inside the fixed-length record.
sales transaction file
in Fig.
This
previously described.
is
the approach
we
took for the
The second approach,
illustrated
4.11(b), uses the fixed-length record as a kind of standard-sized
container for holding something that looks like
The
first
approach has the virtue of simplicity:
out" the fixed-length
fields
from within
variable-length record.
It is
very easy to "break
fixed-length record.
The second
approach lets us take advantage of an averaging-out effect that usually
occurs: The longest names are not likely to appear in the same record as the
longest address
field.
By
letting the field boundaries vary,
we
make
the two
can
more efficient use of a fixed amount of space. Also, note that
approaches are not mutually exclusive. Given a record that contains a
number of truly fixed-length fields and some fields that have variable-
119
MORE ABOUT RECORD STRUCTURES
Ames
John
123 Maple
Stillwater
0K74075
Mason
Alan
90 Eastgate
Ada
0K74820
(a)
Ames John 123 Maple Stillwater OK 74075
j
Mason Alan 90 Eastgate Ada OK 74820
;
Unused
space-
Unused space
(b)
FIGURE 4.1
Two fundamental approaches
to field structure within a fixed-
length record, (a) Fixed-length records with fixed-length fields, (b) Fixed-length
records with variable-length fields.
length information,
two approaches.
The programs
programs
at
and
which
update. pas,
are included in the set
change
it,
and then write
it
of
user to
back. These programs create
structure that uses variable-length fields within fixed-length records.
Given the
this
update. c
design a record structure that combines these
the end of this chapter, use direct access to allow
retrieve a record,
a file
we might
is
variability in the length
of the
fields in
our name and address
file,
an appropriate choice.
One of the
must be resolved in the design of
of distinguishing the real-data portion of the
record from the unused-space portion. The range of possible solutions
parallels that of the solutions for recognizing variable-length records in any
other context: We can place a record-length count at the beginning of the
record, we can use a special delimiter at the end of the record, we can count
fields, and so on. Because both update. c and update. pas use a character string
buffer to collect the fields, and because we are handling character strings
differently in C than in Pascal (strings are null-terminated in C; we keep a
byte count of the string length at the beginning of the Pascal strings), it is
this
interesting questions that
kind of structure
convenient to use
tations. In the
is
that
a slightly different file structure for
version
we
fill
null characters. In the Pascal version
we
(an integer) at the start of the record to
record are valid. As usual, there
structure; instead
and
situation.
we
the
two implemen-
out the unused portion of the record with
is
no
actually place a fixed-length field
tell
how many
is
in the
way to implement this file
most appropriate for our needs
single right
seek the solution that
of the bytes
20
FUNDAMENTAL
FILE
STRUCTURE CONCEPTS
dump
output from each of these programs.
number of
other ideas, such as the use of header
Figure 4.12 shows the hex
The output
records,
introduces
which we discuss
For now, however, just look
in the next section.
of the data records. We have italicized the length fields at the
of the records in the output from the Pascal program. Although we
filled out the records created by the Pascal program with blanks to make the
output more readable, this blank fill is unnecessary. The length field at the
at the structure
start
of the record guarantees that
start
we do
not read past the end of the data in
the record.
4.3.2 Header Records
often necessary or useful to keep track of
It is
about
a file to assist in future
at the
beginning of the
file
use of the
file.
some
general information
header record
is
often placed
to hold this kind of information. For example,
is no easy way to jump to the end of a file,
even though the implementation supports direct access. One simple
solution to this problem is to keep a count of the number of records in the
file and to store that count somewhere. We might also find it useful to
include information such as the length of the data records, the date and
time of the file's most recent update, and so on. Header records can help
in
some
make
versions of Pascal there
a file a self-describing object, freeing the
software that accesses the
file
from having to know a priori everything about its structure, and hence
making the file-access software able to deal with more variation in file
structures.
The header record usually has a different structure than the data records
file. The output from update. c, for instance, uses a 32-byte header
in the
record, whereas the data records each contain 64 bytes. Furthermore, the
by
data records created
update. c contain only character data,
header record contains an integer that
tells
how many
whereas the
data records are in the
file.
Implementing a header record presents more of a challenge for the
programmer. Recall that the Standard Pascal view of a file is one of
a repeated collection of components, all of which are the same component
type. Since a header record is fundamentally a different kind of record than
Pascal
the other records in a
In
some
file,
Pascal does not naturally support header records.
cases, Pascal lets us get
variant
record in Pascal
around this problem by using variant records.
one that can have different meanings,
is
depending on context. Unfortunately,
so
its
same
use as
header record
is
size as all other records in the
When
a variant
record cannot vary in
constrained by the fact that
it
size,
must be the
file.
faced with a language like Standard Pascal that strictly proscribes
the types of records
we
can use in
a file,
we
often find ourselves resorting
i_
+-
o
u
c
3
a
o
3
QJ
i-
QJ
l_
TJ
O
u
QJ
"H
o
U
TJ
>
i_
-C
._L
+J
to
4-
-M
$2
^A ^
0)
djDlz:
cz
CJ
i_
-^
>>
cn xi
QJ
--"
QJ
i-
in
X)
L.
ID
O
u
OJ
+->
in
QJ
i.
-t->
OJ
(=
CK
4-
=>
C=
in
QJ
l.
CJ
QJ
QJ
JZZ
CO
O
3
QJ
-t->
-t-J
+->
>>
X
+J
X
ID
_i
ro
o ^
CD O
(=
ID
ID
"D
<L
C
O
in
ID
*ID
TD
^-
4^r
CJ
r^
c\j
OJ
r^
co
oo
CD
"d"
en
r^
T
CD
U r^
r*
rs
QJ
CD
CD
CO
CD
<+-
CD
CD
m o
CVJ
T
cn
CD
o o
O O
o o
o o
o o
CD
CD
CD
r,
ID
cn
-h*
ro
r*
tn
cvj
r*
xi
-q-
4^a-
r^
1^
u ,_
CD
,
CD
CD
sr
CD
u ,_
CD
T
\T
00
^T
O
r^
\r
r^
i_n
^r
oo
i>.
CD
r^
00
OJ
^r
r^
LD
00
CD
r^
00
iv
uo
CD
U
1^
LD
CD
u u u
CD
CD
4CD
ro
<r
00
rs
cn
r^
r-.
o
t r^
o o o o
"srcDooj
o O t- rO CD O CD
O O CD CD
o o o o
O CD CD CD
sr
ID
lu
in
111
i.
<*
*
^
QJ
cn
+-1
^
0
m *
ID
TJ
(vj
10
CD
CD
CD
CD
CD
CD
CD
CD
CD
CD
CVJ
00
00
CJ
03
->
QJ
CD
<
CVJ
CVJ
CD
CVJ
CD
sr
cD
CVJ
o o
o o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
,_
CD
[V
CD
CD
CD
CD
CD
CD
CD
CD
CD
CD
CD
O
CD
CD
CD
CD
CD
CD
CD
ro
CD
CD
CD
O
CD
CD
CD
CD
CD
CD
CD
CD
CD
CD
CD
CD
CD
CD
CD
O
O
CD
CD
CD
\r
CVJ
CVJ
co
1^
00
cn
00
00
CD
CD
CD
CD
CD
CD
CD
O O O
CD
CD
CD
^-(Docvj
,-
O
CD
CD
CD
CD
CD
O
CD
CD
CD
CD
O
O O O O
CD
CD
CD
CD
CD
CD
,_
XI
TT
CD
CD
CD
CD
CD
CD
CD
CD
O
CD
CD
O
CD CD CD O
O O O O
CD
CD
CD
CJ
s*
CD
CD
O O
CD
CD
CD
CD
CD
CD
CD
CD
O
CD
CD
CD
CD
^r
CD
CD
CD
CD
CD
CD
CD
CD
O
O
O
O
O
O
CD
CD
CD
CD
CD
o
CD o
CD rCD CD
O O O O O
CD CD CD CD CD
CD CD O CD CD
cvj
CD
CD
CD
CD
r--
i*v
r^
CD
o o
o o
o
CVJ
u ,_
cvj
CO
T
^3-
CVJ
^r
t>
UO
CO
OJ
CD
CD
00
CD
4CD
CJ
CD
CD
CD
CD
r>
CVJ
r-
L0
00
QJ
\r
CVJ
CD
r^
r^
^r
1^
00
CVJ
CD
00
r^
00
uo
oo
CVJ
ro
0-
r^
uo
CD
uo
CD
X)
CD
T
\r
O
O
CD
CJ
cj
o
o o
CVJ
,_
co
\r
iv
00
ro
0J
^r
r^
CD
^r
r^
CVJ
CVJ
o X o
^d-
CO
,_
4-
cvj
CD
XI
^T
CVJ
00
00
CO
r^
CO
* o
cn
r\ o
o
o
o
4- ,_
CD
r^
O
[N.
OsJ
o
o
CVJ
o CD
o o
CVJ
-Q
^.
CD
Ul
M-
cvj
Cvj
CVJ
CO
CJ
CO
CVJ
I*"".
CVJ
00
CVJ
Cvj
00
CO
oo
^r
00
r^
00
00
CJ
oo
CD
OJ
o o
o o
CVJ
CVJ
CVJ
CVJ
ro
^r
cvi
cvj
o x o o
o o
(J)
4-
Cvj
ro
*-
CD
oj
CD
r^
r^
cvj
(D
CD
CD
oj
CD
oj
CD
CD
CD
CD
CD
CD
CD
CD
CD
CD
CD
CD
CD
Cvj
ST
=5
c5
03
CD
_Q
<D
o E
si
CD
OJ
OJ
CD
TT
OJ
CD
CD
CD
CD
-,,
c/)
J-;
Cj
>^
_Q
T3
CJ
CO
0J
oj
fC
o
c
'"5
CD
^ x:
ro
j2)
^:
"o
O 4-
<D
03
^^
TJ
03
CD
CJ
x:
djo
OJ
CJ
CD
h-
C3
CN
O c
X
i_
O O O O O
CJ
CVJ
CVJ
CJ
QJ
OJ
CVJ
CD
T3
-6
o1
o ci
Cvj
cvj
CVJ
ro
CJ
X X 3
to
u
X 6 TJ
^-
cvj
CD
OS)
+->
O O
O
CD O
oj oo
OJ
O O O CD O
CD CD O CD O
CD CD O CD O
CD
JC
4'
JC 0)
r^
CD
-3
CVJ
cvj
CD
*Q
00
>^
TJ JQ
CJ
CJ
i
CVJ
CVJ
03
X
o
> O
o o
CD o
o
u o
CD o
CVJ
(V\
CO
CJ
Ih.
o o
o CD
CVJ
>^
*- jd
CD
cc
CVJ
CO
CD
Q.
t
CJ
CVJ
rv
in
CD
O
^
CVJ
"3
CVJ
c:
..
nr
~ CO
<r
o o
o o
cu
T O CD
CM
o O
1^
T
CD
CD
-*=
CJ
cu
O O
>
= TD
-^ ?!
^fe
CJ
*-
in
CO
X3 "O
fO
<
CD
CD
CD
CD
<D
JV
ro
OJ
CO
o
o 2
O
o
DJD
CO
GO
ro
l/l
^
ID
QJ
in
CD
cn
o
o
o
o
o
o
o
o
o
-.
E17
-t->
*-
uo
\r
<v
x:
<,
^"^
<_
EI
cx>
~3
2 ffl
O '^
CJJ
<=
s
d.1
CUD
CJ
CJ
DJO
g?
-
QJ
CD
E
C
CO
c
x:
-trL
CU3
-t-^
*f 00 >,
JQ CD
LU TJ
II
O "O
CD
CJ x:
CD CJ 03 cd
X
u_ cn c= x=
121
22
FUNDAMENTAL
FILE
We
to tricks.
STRUCTURE CONCEPTS
use such
We just
a trick in update. pas:
use the
initial
integer
purpose in the header record. In the data
records this field holds a count of the bytes of valid data within the record;
in the header record it holds a count of the data records in the file.
Header records are a widely used, important file design tool. For
field in the
record for
when we
example,
a different
reach the point where
we
are discussing the construction
of tree-structured indexes for files, we see that header records are often
placed at the beginning of the index to keep track of matters such as the
RRN of the record that is the root of the index. We investigate some more
elaborate uses of header records later in this chapter and also in subsequent
chapters.
4.4
Access and
File
In the course
File
Organization
of our discussions in
we have
this chapter,
looked
at
Variable-length records;
Fixed-length records;
Sequential access; and
Direct access.
The
two of these
first
has to do with file
access
with
a useful one;
is
relate to aspects
access.
The
of file organization. The second pair
interaction
we need
to look at
between
it
more
organization and
file
file
closely before continuing
this chapter.
Most of what we have considered
so far
falls
into the category of
file
organization:
Can
Is
the
file
be divided into
fields?
there a higher level of organization to the
file
that
combines the
fields into records?
Do
the records have the same number of bytes or fields?
do we distinguish one record from another?
do we organize the internal structure of a fixed-length record
so we can distinguish between data and extra space?
all
How
How
We
have seen that there are many possible answers to these questions and
of a particular file organization depends on many things,
including the file-handling facilities of the language you are using and the
that the choice
use you
want
Using
to
make of the file.
file
implies access.
We
looked
first
ultimately developing a sequential search. So long as
individual records began, sequential access
at
we
sequential
did not
access,
know where
was the only option open
to us.
BEYOND RECORD STRUCTURES
23
When we wanted
direct access, we fixed the length of our records, and this
allowed us to calculate precisely where each record began and to seek
directly to
it.
In other
words, our desire for direct
fixed-length record
file
organization.
Does
access
this
caused us to choose
mean
that
we
can equate
fixed-length records with direct access? Definitely not. There
is nothing
about our having fixed the length of the records in a file that precludes
sequential access; we certainly could write a program that reads sequentially
through
fixed-length record
Not only
can
sequentially, but
we
we
elect
file.
to
read through the fixed-length records
can also provide direct access to variable-length records
simply by keeping a list of the byte offsets from the start of the file for the
placement of each record. We chose a fixed-length record structure in
update. c and update. pas because it is simple and adequate for the data that we
want to store. Although the lengths of our names and addresses vary, the
variation is not so great that we cannot accommodate it in a fixed-length
record.
Consider, however, the effects of using
fixed-length record organi-
zation to provide direct access to records that are documents ranging in
length from a few hundred bytes to over
hundred kilobytes. Fixed-length
of space, so some form of
variable-length record structure would have to be found. Developing file
structures to handle such situations requires that you clearly distinguish
between the matter of access and your options regarding organization.
The restrictions imposed by the language and file system used to
develop your applications do impose limits on your ability to take
advantage of this distinction between access method and organization. For
records
would be
disastrously
wasteful
C language provides the programmer with the ability to
implement direct access to variable-length records, since it allows access to
any byte in the file. On the other hand, Pascal, even when seeking is
supported, imposes limitations related to Pascal's definition of a file as a
collection of elements that are all of the same type and, consequently, size.
Since the elements must all be of the same size, direct access to
example, the
variable-length records
4.5
is
difficult, at best, in Pascal.
Beyond Record Structures
Now
on the concepts of organization and access, we
look at some interesting new file organizations and more complex ways of
accessing files. We want to extend the notion of a file beyond the simple
that
we
have
a grip
idea of records and fields.
124
FUNDAMENTAL
FILE
STRUCTURE CONCEPTS
We begin
with the idea of abstract data models. Our purpose here is to
put some distance between the physical and the logical organization of files,
to allow us to focus more on the information content of files and less on
physical format.
4.5.1 Abstract Data Models
The
history of
file
structures and
file
processing parallels the history of
computer hardware and software. When file processing first became
common on computers, magnetic tape and punched cards were the primary
means used to store files, RAM space was dear, and programming
languages were primitive. Programmers as well as users were compelled to
view file data exactly as it might appear on a tape or cards
as a sequence
of fields and records. Even after data was loaded into RAM, the tools for
manipulating and viewing the data were unsophisticated and reflected the
magnetic tape metaphor. Data processing meant processing fields and
records in the traditional sense.
Gradually, computer users began to recognize that computers could
process
more than just
fields
and records. Computers could,
for instance,
process and transmit sound, and they could process and display images and
These kinds of applications deal with information
metaphor of data stored as sequences of records
that does not nicely
that are divided into fields, even if, ultimately, the data might be stored
physically in the form of fields and records. It is easier, in the mind's eye,
to envision data objects such as documents, images, and sound as objects
documents
(Fig. 4.13).
fit
that
we
the
manipulate in ways that are specific to the objects themselves, rather
than simply as fields and records on
The notion
medium
is
that
we need
a disk.
not view data only
captured in the phrase
as
FIGURE 4.13 Data such as sound, images, and documents do not
of data stored
it
appears on
abstract data model, a
as sequences of records that are divided into
fit
fields.
term
a particular
that encourages
the traditional metaphor
BEYOND RECORD STRUCTURES
25
an application-oriented view of data, rather than
The organization and
in
terms of
how
an application views the data, rather than
might physically be
One way
file is
to
a medium-oriented view.
methods of abstract data models are described
access
that
how
the data
stored.
we
know
save a user from having to
keep information
about objects
in a
in the file that file-access software can use to
"understand" those objects. A good example of how
file structure information in a header.
might be done
this
is
to put
4.5.2 Headers and Self-Describing
We have
seen
how
Files
header record can be used to keep track of how
many
our programming language permits it, we can
put much more elaborate information about a file's structure in the header.
When a file's header contains this sort of information, we say the file is
records there are in a
file.
If
we
Suppose, for instance, that
self-describing.
store in a
file
the following
information:
A name
for each field;
The width of each field; and
The number of fields per record.
We
can now write a program that can read and print a meaningful display
of files with any number of fields per record and any variety of fixed-length
widths. In general, the
field
file's
header,
As
usual, there
is
files in
file
structure information
our software needs to
the less
structure of an individual
structures of
more
put into
about the specific
file.
a trade-off: If
the
know
we
programs
we do
field and record
and write them, the programs
not hard-code the
that read
themselves must be more sophisticated. They must be flexible enough to
interpret the self-descriptions that they find in the
file
headers.
4.5.3 Metadata
Suppose you
are an
astronomer interested
telescopes that scan the sky, and
digital representations
you want
of these images
in
studying images generated by
to design a
(Fig. 4.14).
You
file
structure for the
expect to have
many
images, perhaps thousands, that you want to study, and you want to store
one image per file. While you are primarily interested in studying the
images themselves, you will certainly need information about each image:
where in the sky the image is from, when it was made, what telescope was
used, references to related images, and so forth.
data that describes the
This kind of information is called metadata
primary data in a file. Metadata can be incorporated into any file whose
126
FUNDAMENTAL
STRUCTURE CONCEPTS
FILE
FIGURE 4.14 To make sense
two-Mbyte image, an
astronomer needs such metadata as the kind of image it
of this
is, the part of the sky it is
from, and the telescope that
was used to view it. Astronomical metadata is often
stored
data
in
the
itself.
same
file
(This image
as the
shows
polarized radio emission from
the southern spiral galaxy
NGC 5236 [M83] as observed
with the Very Large Array radio telescope in New Mexico.)
primary data requires supporting information. If a file is going to be shared
users, some of whom might not otherwise have easy access to its
by many
metadata,
it
A common
may
be most convenient to store the metadata in the
place to store metadata in a
Typically, a
community of users of a
file is
particular kind of data agrees
standard format for holding metadata. For example,
called
FITS
(Flexible
file itself.
the header record.
on
standard format
Image Transport System) has been developed by the
International Astronomers'
Union
for storing the kind of astronomical data
FITS header is a collection of 2,880-byte
which each record contains a single
piece of metadata. Figure 4.15 shows part of a FITS header. In a FITS file,
the header is followed by the actual numbers that describe the image, one
just described in a
blocks of 80-byte
file's
header.^
ASCII
records, in
number per observed point of the image.
Note that the designers of the FITS format chose to use ASCII in the
header, but binary values for the image. ASCII headers are easy to read and
process and, since they occur only once, take up relatively little extra space.
Since the numbers that make a FITS image are rarely read by humans, but
binary
rather are first processed into a picture and then displayed, binary format
the preferred choice for them.
t For
more
Readings."
details
on FITS,
see the references listed at the
end of
this
chapter
"Further
is
BEYOND RECORD STRUCTURES
=
=
=
=
=
=
=
-
27
CONFORMS TO BASIC FORMAT
BITS PER PIXEL
2 / NUMBER OF AXES
256 / RA AXIS DIMENSION
256 / DEC AXIS DIMENSION
F / T MEANS STANDARD EXTENSIONS EXIST
0.000100000 / TRUE = TAPE*BSCALE +BZERO
0.000000000 / OFFSET TO TRUE PIXEL VALUES
MAP_TYPE= REL_EXPOSURE'/ INTENSITY OR RELATIVE EXPOSURE MAP
=
/ DIMENSIONLESS PEAK EXPOSURE FRACTION
BUNIT
CRVAL1
0.625 / RA
REF POINT VALUE (DEGREES)
=
CRPIX1
128.500 / RA
REF POINT PIXEL LOCATION
=
-0.006666700 / RA
CDELT1
INCREMENT ALONG AXIS (DEGREES)
= 'RA
/
TAN'
RA
TYPE
CTYPE1
=
CROTA1
0.000 / RA
ROTATION
71.967 / DEC REF POINT VALUE (DEGREES)
CRVAL2 =
CRPIX2 =
128.500 / DEC REF POINT PIXEL LOCATION
0.006666700 / DEC
CDELT2 =
INCREMENT ALONG AXIS (DEGREES)
/ DEC
CTYPE2 = 'DEC--TAN'
TYPE
CR0TA2 =
ROTATION
0.000 / DEC
=
EPOCH
1950.0 / EPOCH OF COORDINATE SYSTEM
ARR_TYPE=
=DP
4 /
3 = FP, 4=1
DATAMAX 1.000 / PEAK INTENSITY (TRUE)
DATAMIN =
0.000 / MINIMUM INTENSITY (TRUE)
-22.450 / ROLL ANGLE (DEGREES)
ROLL_ANG=
BAD_ASP =
/
0=good, 1=bad(Do not use roll angle)
TIME_LIV=
5649.6 / LIVE TIME (SECONDS)
OBJECT = 'REM6791
/
SEQUENCE NUMBER
AVGOFFY =
1.899 / AVG Y OFFSET IN PIXELS, 8 ARCSEC/PIXEL
AVGOFFZ =
2.578 / AVG Z OFFSET IN PIXELS, 8 ARCSEC/PIXEL
RMSOFFY =
0.083 / ASPECT SOLN RMS Y PIXELS, 8 ARCSC/PIX
RMSOFFZ =
0.204 / ASPECT SOLN RMS Z PIXELS, 8 ARCSC/PIX
TELESCOP= 'EINSTEIN
/
TELESCOPE
INSTRUME= 'IPC
/
FOCAL PLANE DETECTOR
OBSERVER= 2
/
OBSERVER #: 0=CFA; 1=CAL; 2=MIT; 3=GSFC
=
GALL
119.370 / GALACTIC LONGITUDE OF FIELD CENTER
=
GALB
9.690 / GALACTIC LATITUDE OF FIELD CENTER
DATE_OBS= '80/238
/
YEAR & DAY NUMBER FOR OBSERVATION START
DATE_STP= '80/238
/
YEAR & DAY NUMBER FOR OBSERVATION STOP
= "SNR SURVEY: CTA1
TITLE
ORIGIN = 'HARVARD-SMITHSONIAN CENTER FOR ASTROPHYSICS
= '22/09/1989
DATE
/
DATE FILE WRITTEN
= '05:26:53
TIME
TIME FILE WRITTEN
/
END
SIMPLE
BITPIX
NAXIS
NAXIS1
NAXIS2
EXTEND
BSCALE
BZERO
16
'
'
'
FIGURE 4.15 Sample FITS header. On each line, the data to the left of the 7* is the actual
metadata (data about the raw data that follows in the file). For example, the second line
("BITPIX = 16") indicates that the raw data in the file will be stored in 16-bit integer format. Everything to the right of a V is a comment, describing for the reader the meaning of
the metadata that precedes it. Even a person uninformed about the FITS format can learn a
great deal about this file just by reading through the header.
28
FUNDAMENTAL
FILE
STRUCTURE CONCEPTS
is a good example of an abstract data model. The data
meaningless without the interpretive information contained in the
header, and FITS-specific methods must be employed to convert FITS data
FITS image
itself is
into an understandable image.
we
look
Another example
is
the raster image,
which
at next.
4.5.4 Color Raster Images
From
a user's
device as
it is
point of view, a
a data processor.
spreadsheets, or numbers,
we
modern computer is as much a graphical
Whether we are working with documents,
viewing and storing pictures
in addition to whatever other information we work with. Let's examine one
type of image, the color raster image, as a means to filling in our conceptual
understanding of data objects.
color raster image
on
is
are likely to be
a rectangular array
of colored dots, or pixels,^
FITS image is a raster image in the sense
that the numbers that make up a FITS image can be converted to colors and
then displayed on a screen. There are many different kinds of metadata that
can go with a raster image, including
that are displayed
a screen.
The dimensions of the image: the number or pixels per row and the
number of rows.
The number of bits used to describe each pixel. This determines how
many
colors can be associated with each pixel.
display only
two
colors, usually black
display four colors (2
and so
2
),
image can
image can
1-bit
2-bit
an 8-bit image can display 256 colors
(2
),
forth.
color lookup table,
or palette, indicating which color
to each pixel value in the image.
table
and white.
2-bit
with 4 colors, an 8-bit image uses
image uses
a table
is
to be assigned
color lookup
with 256 colors, and
so forth.
we think of an image as an abstract data
we might associate with images? There
If
that
with getting things in and out of
store_\mage routine.
Then
type,
what
are
some methods
are the usual ones associated
computer:
read_image routine and
there are those that deal with images as special
objects; for example,
Display an image in
window on
console screen;
color lookup table;
Overlay one image onto another to produce a composite image; and
Display several images in succession, producing an animation.
Associate an image with
'Pixel stands for "picture
element."
a particular
BEYOND RECORD STRUCTURES
The
color raster image
an example of
is
129
type of data object that
more than the traditional field/record file structure. This is
particularly true when more than one image might be stored in a single file,
or when we want to store a document or other complex object together
with images in a file. Let's look at some ways to mix object types in one file.
requires
4.5.5 Mixing Object Types
Keywords
The FITS header
described earlier,
nique,
be contained
in
One
File
4.15) illustrates an important tech-
(Fig.
and records: the use of
do not know what fields are going
for identifying
keywords. In the case of FITS
to
in
we
headers,
any given header, so
fields
we
identify each field using a
"keyword = value" format.
Why does this format work for FITS
files,
for our address file? For the address
we saw
demanded
the
file.
In
file
whereas
it
was inappropriate
keywords
that the use of
high price in terms of space, possibly even doubling the
FITS
files
When
size
amount of overhead introduced by keywords
the
of
is
FITS file in the example
quite small.
contains approximately 2 megabytes. The keywords in the header occupy a
total of about 400 bytes, or about 0.02% of the total file space.
Tags
the
image
With the addition
we
via
included, the
is
keywords of
file
structure information and
more than just a collection of
repeated fields and records. Can we extend this notion beyond the header to
other, more elaborate objects? For example, suppose an astronomer would
metadata to
header,
like to store several
see that a
can be
file
FITS images of different
sizes in a file,
usual metadata, plus perhaps lab notes describing
from the image
(Fig. 4.16).
Now we
can think of our
may
be very different in content
structures do not handle well. Maybe we need
objects that
together with the
what the
view
new
scientist learned
mixture ot
our previous file
kind of file structure.
file as a
that
There are many ways to address this new file design problem. One
would be simply to put each type of object into a variable-length record and
FIGURE 4.16 Information that an astronomer wants to include
----.
yj-.y.i
...
in a file.
MMIS1
KAXIS2
600
max: si
".-
MMUS4
kax:i<
yjvx
ssoale - :.::
BZERO - 1S8E<
BZES
U5x=rO
500
>
t::
l
1S8E14
U5y=rA
30
FUNDAMENTAL
write our
The
like:
FILE
processing programs so they
file
first
STRUCTURE CONCEPTS
record
is
image; the third record
header for the
is
first
know what
each record looks
image; the second record
document; the fourth
the
is
header for the second
is a
image; and so forth.
This solution
drawbacks:
workable and simple, but
is
it
has
some
familiar
Objects must be accessed sequentially, making access to individual
images
The
in large files
file
time consuming.
must contain exactly the
actly the order indicated.
We
objects that are described, in ex-
could not, for instance, leave out the
notebook for some of the images (or in some cases leave out the
notebook altogether) without rewriting all programs that access the
file
to reflect the changes in the
solution to these problems
is
file's
structure.
hinted
at in
the
FITS header: Each
line
keyword that identifies the metadata field that follows in
not use keywords to identify all objects in the file
not just
begins with a
Why
line.
fields in the headers,
but the headers themselves,
the
the
as well as the
images and
any other objects we might need to store? Unfortunately, the "keyword =
it is short and fits easily in an
data" format makes sense in a FITS header
80-byte line
but it doesn't work at all for objects that vary enormously in
size and content. Fortunately, we can generalize the keyword idea to
address these problems by making two changes:
Lift the restriction that
enough
each record be 80 bytes, and
to hold the object that
Place the
keywords
is
let it
be big
referenced by the keyword.
in an index table, together
with the byte
offset
of
the actual metadata (or data) and a length indicator that indicates
how many
The term
this
In
type of
it,
we
bytes the metadata (or data) occupies in the
tag is
file
file.
commonly used in place of keyword in connection with
The resulting structure is illustrated in Fig. 4.17.
structure.
encounter two important conceptual tools for
file
design:
(1)
the
use of an index table to hold descriptive information about the primary data,
and
(2)
the use of tags to distinguish different types of objects. These tools
allow us to store in one
one another
Tag
structures are
For example,
mixture of objects
and content.
file a
in structure
common among
a structure called
popular tagged
file
objects
standard
file
TIFF (Tagged Image
format used for storing images.
that can vary
from
formats in use today.
File
HDF
Format)
is
very
(Hierarchical Data
Format) is a standard tagged structure used for storing many different kinds
of scientific data, including images. In the world of document storage and
retrieval, SGML (Standard General Markup Language) is a language for
131
BEYOND RECORD STRUCTURES
Index table
with tags:
header
notes
header
image
rrx
SIMPLE - -7HAX1S - 4
HAXIS1 - :.:
KAXIS2 - *::
hax:s3 - l
HAXIS4 > i
8SCALE - 0.015
BZERO - 158EU
=:v?le
MAXIS
MAX! SI
MAX Is;
500
600
"_i
MAX1S4
BSCALE
BZEfcO
0.015
i53E-:4
U5x=ro
FIGURE 4.17
Same
as Fig. 4.16, except with tags identifying the objects.
describing
document
structures and for defining tags used to
mark up
structure. Like FITS, each of these provides an interesting study in
that
file
design and standardization. References to further information on each are
provided
at
the end of this chapter, in "Further Readings."
Accessing Files with Mixtures of Data Objects
files
to contain widely varying objects
is
The
idea of allowing
compelling, especially for appli-
amounts of metadata or unpredictable mixes of
of data, for it frees us of the requirement that all records be
fundamentally the same. As usual, we must ask what this freedom costs us.
To gain some insight into the costs, imagine that you want to write a
program to access objects in such a file. You now have to read and write
tags as well as data, and the structure and format for different data types are
likely to be different. Here are some questions you will have to answer
almost immediately:
cations that require large
different kinds
When we want
to read an object
of
a particular type,
how
do
we
search for the object?
When we want
how
and where do we
and where exactly do we put the object?
Given that different objects will have very different appearances
within a file, how do we determine the correct method for storing or
store
its
to store an object in the
file,
tag,
retrieving the object?
The
first
tags
and pointers to
two questions have
do with accessing the table
the objects. Solutions to this problem
to
that contains the
are dealt
with in
32
FUNDAMENTAL
detail
question,
STRUCTURE CONCEPTS
FILE
Chapter
how
6,
so
we
defer their discussion until then.
to determine the correct
implications that
we
on
briefly touch
4.5.6 Object-oriented
File
We
abstract data
have used the term
methods
The
third
for accessing objects, has
here.
Access
application has of a data object. This
model to describe the view that an
is
essentially an
in-RAM,
application-
oriented view of an object, one that ignores the physical format of objects
as
they are stored in
files.
Taking
this
view of objects buys our software two
things:
delegates to separate modules the responsibility of translating to
and from the physical format of the object, letting the application
modules concentrate on the task at hand. (For example, an image
processing program that can operate in
on 8-bit images should
not have to worry about the fact that a particular image comes from
a file that uses the 32-bit FITS format.)
It opens up the possibility of working with objects that at some level
fit the same abstract data model, even though they are stored in different formats. The in-RAM representations of the images could be
identical, even though they come from files with quite different forIt
RAM
mats.)
File access
access,
that exploits these possibilities could be called object-oriented
emphasizing the
oriented
between
parallels
it
and the well-known object-
programming paradigm.
As an example
that illustrates
both points, suppose you have an image
processing application program (we'll
call
nfind_star) that operates in
RAM
and you need to process a collection of images. Some are
FITS format and some in TIFF files in a different
format. An object-oriented approach (Fig. 4.18) would provide the
application program with a routine (let's call it read_jmage( )) for reading
images into
in the expected 8-bit form, letting the application
concentrate on the image processing task. For its part, the routine
read_\mage( ), given a file to get an image from, determines the format ot
the image within the file, invokes the proper procedure to read the image in
format
that format, and converts it from that format into the 8-bit
on
8-bit images,
stored in FITS
files in a
RAM
RAM
that the application needs.
Tagged
file
file
formats are one
organization
accompanied by
and
a
file
way
access.
specification
to
implement
The
this
specification
conceptual view of
ot
tag
can
be
of methods for reading, writing, and
133
BEYOND RECORD STRUCTURES
program find_star
read_image ("starl"
process image
image)
image
end find star
RAM
(FITS
file)
Disk
FIGURE 4.18 Example of object-oriented access. The program find_star knows nothing about
file format of the image that it wants to read. The routine readjmage has methods to
convert the image from whatever format it is stored in on disk into the 8-bit in-RAM format
required by find_star.
the
otherwise manipulating the corresponding data object according to the
needs
of an application.
Indeed,
definition of the abstract data
format lends
itself to the
any specification that separates the
that of the corresponding file
model from
object-oriented approach.
4.5.7 Extensibility
One of the
advantages of using tags to identify objects within
do not have
software
to
may
know
a priori
what
all
files is
of the objects will look
eventually have to deal with.
We
that
like that
have just seen that
if
we
our
our
34
STRUCTURE CONCEPTS
FUNDAMENTAL
FILE
program
methods
to be able to access a mixture of objects in a
file, it must have
and writing each object. Once we build into our
software a mechanism for choosing the appropriate methods for a given
type of object, it is easy to imagine extending, at some future time, the
types of objects that our software can support. Every time we encounter a
new type of object that we would like to accommodate in our files, we can
implement methods for reading and writing that object and add those
methods to the repertoire of methods available to our file processing
is
for reading
software.
4.6
Portability
and Standardization
recurring theme in several of the examples that
we have just
seen
is
the
want to share files. Sharing files means making sure
accessible on all of the different computers that they might turn
idea that people often
that they are
up on, and that they are somehow compatible with all of the different
programs that will access them. In this final section, we look at two
complementary topics that affect the sharability of files: portability and
standardization.
4.6.1 Factors Affecting Portability
Imagine that you work for a company that wishes to share simple data files
as our address file with some other business. You get together with the
other business to agree on a common field and record format, and you
discover that your business does all of its programming and computing in
C on a Sun computer and the other business uses Turbo Pascal on an IBM
PC. What sorts of issues would you expect to arise?
such
among Operating Systems
"Unexpected Characters in Files," we saw
Differences
linefeed character every time
whereas on most other
every time our address
file
file
it
In
that
encounters
systems
this is
Chapter 2
in the section
MS-DOS
adds an extra
carriage return character,
not the case. This means that
has a byte with hex value OxOd, whether or not
that byte is meant to be a carriage return, the file is extended by an extra
0x0a byte.
This example illustrates the fact that the ultimate physical format of the
same logical file can vary depending on differences among operating systems.
Differences
among Languages
header records,
forced to
we
make our
chose to
Earlier in this chapter,
make our C header 32
Pascal header 64 bytes.
when
discussing
bytes, but
allows us to
we were
mix and match
35
fixed record lengths according to our needs, but Pascal requires that
all
PORTABILITY AND STANDARDIZATION
records in
This
nontext
file
illustrates a
be the same
size.
second factor impeding portability
physical layout offiles produced with different languages
way
the languages let
you define
structures within a
may
among
files:
The
be constrained by the
file.
Differences in Machine Architectures Consider again the header
record that we produce in the C version of our address file. The hex dump
of the file (Fig. 4.13), which was generated using C on a Sun 3 computer,
shows
this
header record in the
0000000
The
,
or
0020 0000 0000 0000 0000 0000 0000 0000
two bytes contain the number of records in the file, in this case
If the same C program is compiled and executed on an IBM
first
20 16 or 32 10
PC
first line:
VAX,
the hex
0000000
Why
dump
of the header record will look
like this:
2000 0000 0000 0000 0000 0000 0000 0000
program? The answer
both cases the numbers were written to the file exactly as they
appeared in RAM, and the two different machines represent two-byte
integers differently
the Sun stores the high-order byte, followed by the
low-order byte; the IBM PC and VAX store the low-order byte, followed
is
are the bytes reversed in this version of the
that in
by the high-order
byte.
This reverse order also applies to four-byte integers on these machines.
For example, in our discussion of
IBM
dumps we saw
that the
hexadecimal
is ldcd6500 16 If you write this value out to a file on
PC, or some other reverse-order machine, a hex dump of the file
value of 500,000,000 10
an
file
created looks like
0000000
this:
0065 cdld
The problem of data representation is not restricted only to binary
numbers. The way structures, such as C structs or Pascal records, are laid
out in
can vary from machine to machine and compiler to compiler.
For example, suppose you have a C program containing the following lines
of code:
RAM
struct
{
i
char
>
i t
cost;
i den t
em
write (fd,
&item,
sizeof (item));
and you want to write files using this code on two different machines, a
Cray 2 and a Sun 3. Because it likes to operate on 64-bit words, Cray's C
36
FUNDAMENTAL
STRUCTURE CONCEPTS
FILE
compiler allocates
it
allocates
minimum
of eight bytes for any element
16 bytes for the struct item.
Cray writes 16 bytes
statement, then, the
When
it
in a struct, so
executes the write(
The same program
you probably would expect,
to the
file.
compiled on a Sun 3 writes only eight bytes, as
and on most IBM PCs it writes six bytes: same exact program; same
language; three different results.
Text
is
encoded differently on
also
differences are primarily restricted to
different platforms. In this case the
two
different types of systems: those
and those that use ASCII. EBCDIC is a standard created
by IBM, so machines that need to maintain compatibility with IBM must
support EBCDIC. Most others support ASCII. A few support both.
Hence, text written to a file from an EBCDIC-based machine may well not
be readable by an ASCII-based machine.
Equally serious, when we go beyond simple English text, is the
problem of representing different character sets from different national
languages. This is an enormous problem for developers of text databases.
that use
EBCDIC^
4.6.2 Achieving Portability
among
Differences
languages, operating systems, and machine architec-
major problems when we need to generate portable
Achieving portability means determining how to deal with these
differences. And the differences are often not just differences between two
platforms, for many different platforms could be involved.
The most important requirement for achieving portability is to
recognize that it is not a trivial matter and to take steps ahead of time to
tures represent three
files.
insure
it.
Here
are
some
guidelines.
Standard Physical Record Format and Stay with It A
is one that is represented the same physically, no matter
what language, machine, or operating system is used. FITS is a good
example of a physical standard, for it specifies exactly the physical format of
each header record, the keywords that are allowed, the order in which
keywords may appear, and the bit pattern that must be used to represent the
binary numbers that describe the image.
Agree on
physical standard
Unfortunately, once
"improve" on
a
it
standard
by changing
standard. If the standard
is
sometimes be avoided. FITS,
over
its
it
in
is
established,
it
is
very tempting to
some way, thereby rendering
it
no longer
sufficiently extensible, this temptation can
for example, has been extended a
lifetime to support data objects that
few times
were not anticipated
""EBCDIC stands for Extended Binary Coded Decimal Interchange Code.
in its
PORTABILITY AND STANDARDIZATION
original design, yet
all
37
additions have remained compatible with the original
format.
One way
to
make
simple enough that
sure that a standard has staying
files
power
is
to
can be written in the standard format from
make
a
it
wide
range of machines, languages, and operating systems. FITS again exemplifies
such
standard.
FITS headers
are
ASCII 80-byte records
in blocks
of 36
records each, and FITS images are stored as one contiguous block of
numbers, both very simple structures that are easy
modern operating systems and languages.
Agree on
most
to read
and write
Standard Binary Encoding for Data Elements
in
most
The two
common types
of basic data elements are text and numbers. In the case
of text, ASCII and EBCDIC represent the most common encoding
schemes, with ASCII standard on virtually all machines except IBM
mainframes. Depending on the anticipated environment, one of these
should be used to represent all text."*"
The situation for binary numbers is a little cloudier. Although the
number of different encoding schemes is not large, the likelihood of having
to share data among machines that use different binary encodings can be
quite high, especially when the same data is processed both on large
mainframes and on smaller computers. Two standards efforts have helped
diminish the problem, however: IEEE Standard formats, and External Data
Representation (XDR).
IEEE has established standard format specifications for 32-bit, 64-bit,
and 128-bit floating point numbers, and for 8-bit, 16-bit, and 32-bit
integers. With a few notable exceptions (e.g., IBM mainframes, Cray, and
Digital) most computer manufacturers have followed these guidelines in
designing their machines. This effort goes a long way toward providing
portable
number encoding schemes.
is an effort to go the rest of the way. XDR specifies not only a set
of standard encodings for all files (the IEEE encodings), but provides for a
set of routines for each machine for converting from its binary encoding
when writing to a file, and vice versa (Fig. 4.19). Hence, when we want to
store numbers in XDR, we can read or write them by replacing read and
XDR
write routines in our
program with
XDR routines.
The
XDR routines take
care of the conversions.*
""Actually, there are different
applications,
character
versions of both
and for the purposes of
this text,
ASCII and EBCDIC. However,
it is
for
most
sufficient to consider each as a single
set.
used for more than just number conversions. It allows a C programmer to deoriginated as a Sun
scribe arbitrary data structures in a machine-independent fashion.
protocol for transmitting data that is accessed by more than one type of machine. For further information, see Sun (1986 or later).
*XDR
is
XDR
138
FUNDAMENTAL
FILE
STRUCTURE CONCEPTS
x:
XDR float
(&xdrs,
&x)
234.5
RAM
XDR specifies a standard external data representation for numbers stored
XDR routines are provided for converting to and from the XDR representation
FIGURE 4.19
in
file.
encoding scheme used on the host machine. Here a routine called
) translates a 32-bit floating point number from its XDR representaon disk to that of the host machine.
to the
XDR_f loatC
tion
Once again, FITS provides us with an excellent example: The binary
numbers that constitute a FITS image must conform to the IEEE Standard.
Any program written on a machine with XDR support can thus read and
write portable FITS files.
Number and Text Conversion
encodings
IBM
is
not feasible.
Sometimes the use of standard data
For example, suppose you are working primarily
mainframes with software that deals with floating point numbers
and text. If you choose to store your data in IEEE Standard formats, every
time your program reads or writes a number or character it must translate
the number from the IBM format to the corresponding IEEE format. This
is not only time-consuming but can result in loss of accuracy. It is probably
better in this case to store your data in native IBM format in your files.
What happens, then, when you want to move your files back and forth
between your IBM and a VAX, which uses a different native format for
numbers and generally uses ASCII for text? You need a way to convert
from the IBM format to the VAX format and back. One solution is to write
(or borrow) a program that translates IBM numbers and text to their VAX
on
PORTABILITY AND STANDARDIZATION
and vice versa. This simple solution
equivalents,
is
illustrated
in
39
Fig.
4.20(a).
But what if, in addition to IBM and VAX computers, you find that
your data is likely to be shared among many different platforms that use
different numeric encodings? One way to solve this problem is to write a
program to convert from each of the representations to every other
representation. This solution, illustrated in Fig. 4.20(b), can get rather
complicated. In general,
need n(n
messy. Not
for each
to
1)
you have
if
of where the
know which
many
file
they are to be exported to
1)
If n
is
large, this can
be very
you need to keep track,
came from and/or where it is going in order
In this case, a better solution
n(n
encoding schemes, you will
translators, but
translator to use.
intermediate format, such as
Fig. 4.20(c).
(Why?)
different translators.
only do you need
file,
n different
it
and
it
down
cut
to agree
translate files into
the
number of
on
standard
XDR whenever
platform. This solution
a different
Not only does
to 2n, but
would probably be
XDR,
is
illustrated in
translators
from
should be easy to find translators to convert from most
platforms to and from
XDR. One
requires two conversions to
negative aspect of this solution
go from any one platform
is
that
it
to another, a cost that
has to be weighed against the complexity of providing n(n
1)
translators.
Conversion
Suppose you are a doctor and you have
organ taken periodically over several
minutes. You want to look at a certain image in the collection using a
program that lets you zoom in and out and detect special features in the
image. You have another program that lets you animate the collection of
images, showing how it changes over several minutes. Finally, you want to
annotate the images and store them in a special X-ray archive, and you have
another program for doing that. What do you do if each of these three
programs requires that your image be in a different format?
The conversion problems that apply to atomic data encodings also
apply to file structures for more complex objects, like images, but at a
different level. Whereas character and number encodings are tied closely to
specific platforms, more complex objects and their representations just as
File Structure
X-ray
raster
images of
a particular
often are tied to specific applications.
For example, there are many software packages that deal with images,
and very little agreement about a file format for storing them. When we
look at this software, we find different solutions to this problem:
Require that the user supply images in a format that is compatible
with the one used by the package. This places the responsibility on
the user to convert from one format to another. For such situations,
140
FUNDAMENTAL
FILE
STRUCTURE CONCEPTS
From:
To:
(a)
Converting between
IBM and Vax
native format
requires two conversion routines.
From:
To:
(b)
Converting directly between five different native formats
requires 20 conversion routines.
To
8c
From:
From:
To:
XDR
(c) Converting between five different native formats via an
intermediate standard format requires 10 conversion routines.
FIGURE 4.20 Direct conversion between n native machines formats requires n (n - 1) conversion routines, as illustrated in (a)
and (b). Conversion via an intermediate standard format requires
2n conversion routines, as illustrated in (c).
PORTABILITY AND STANDARDIZATION
may be preferable to provide utility programs that translate from
one format to another and that are invoked whenever translating.
Process only images that adhere to some predefined standard format.
This places the responsibility on a community of users and software
developers for agreeing on and enforcing a standard. FITS is a good
example of this approach.
Include different sets of I/O methods capable of converting an image
from several different formats into a standard RAM structure that
the package can work with. This places the burden on the software
developer to develop I/O methods for file object types that may be
stored differently but for the purposes of an application are conceptually the same. You may recognize this approach as a variation on the
concept of object-oriented access that we discussed earlier.
it
System Differences
File
to
another,
Finally, if
organized physically. For example,
512-byte blocks, but
such
as
you move
2,880-bytes
non-UNIX
UNIX
files
system
files
are
to tapes in
systems often use different block
When
you may need
problem.
to deal
systems write
file
way
sizes,
where the FITS
between systems,
thirty-six 80-byte records. (Guess
blocking format comes from?)
UNIX
from one
files
chances are you will find differences in the
with
and Portability
problem just described,
this
transferring
files
Recognizing problems such
UNIX
provides a utility called
intended primarily for copying tape data to and from
as the block-size
dd.
Although dd
UNIX systems,
be used to convert data from any physical source. The dd
following options, among others:
utility
it
is
can
provides the
Convert from one block size to another;
Convert fixed-length records to variable length, or vice versa;
Convert ASCII to EBCDIC, or vice versa;
Convert all characters to lowercase (or to uppercase); and
Swap every pair of bytes.
Of
course, the greatest contribution
discussed here
is
UNIX
itself.
By
its
UNIX
makes
to the
problems
simplicity and ubiquity,
UNIX
encourages the use of the same operating system, the same file system, the
same views of devices, and the same general views of file organization, no
matter what particular hardware platform you happen to be using.
For example, one of the authors works in an
nationwide constituency that operates many different
two Crays, a Connection Machine, and many Sun,
Graphics, and Digital workstations. Because each
organization with
computers, including
Apple, IBM, Silicon
runs
some
flavor of
42
FUNDAMENTAL
UNIX,
STRUCTURE CONCEPTS
FILE
they
incorporate precisely the same view of
all
all
external storage
ASCII, and they all provide the same basic
programming environment and file management utilities. Files are not
perfectly portable within this environment, for reasons that we have
covered in this chapter, but the availability of UNIX goes a long way
toward facilitating the rapid and easy transfer of files among the applications, programming environments, and hardware systems that the organithey
devices,
all
use
zation supports.
SUMMARY
The lowest
we
normally impose on a file is a
file merely as a stream of
bytes, we lose the ability to distinguish among the fundamental informational units of our data. We call these fundamental pieces of information
fields. Fields are grouped together to form records. Recognizing fields and
of organization
level
stream of bytes. Unfortunately,
recognizing records requires that
There are many ways
from the next:
that
by storing
data in a
we impose structure on
to separate
one
field
the data in the
file.
from the next and one record
Fix the length of each field or record.
Begin each
it
field
number of bytes
or record with a count of the
that
contains.
Use
delimiters to
In the case
of
mark
the divisions
between
another useful technique
fields,
value" form to identify
fields.
entities.
is
to use a
In the case of records,
"keyword =
another useful
where each record begins.
which
records are grouped into
higher level of organization, in
blocks, is also often imposed on files. This level is imposed to improve I/O
performance rather than our logical view of the file.
technique
is
to use a second, index
that tells
file
One
In this chapter
at
we
use the record structure that uses
simple
file
individuals.
before
We use buffering to
we know
its
length indicator
it
to the
complete record
length field of each record as a binary
In the
former
contents of our
file.
case,
it
is
for writing
accumulate the data in an individual record
length to write
allowing us to read in
digits.
programs
and reading
of variable-length records containing names and addresses of
the beginning of each record to develop
file.
at
Buffers are also useful in
one time.
number
We
represent the
or as a sequence of
useful to use a file
dump
to
ASCII
examine the
SUMMARY
Sometimes we identify individual records by their relative record numbers
(RRNs) in a file. It is also common, however, to identify a record by a key
whose value is based on some of the record's content. Key values must
occur in, or be converted to, some predetermined canonical form if they are
to be recognized accurately and unambiguously by programs. If every
record's key value is distinct from all others, the key can be used to identify
and locate the unique record in the file. Keys that are used in this way are
called primary keys.
we
In this chapter
through
a file
look
looking for
at
can perform poorly for long
searching
time for
process
the technique of searching sequentially
record with
files,
a particular key. Sequential search
but there are times
reasonable. Record blocking can be used to
is
a sequential search substantially.
sequentially are
files
wc and
Two
mechanism
the beginning of a record.
accessing the record
The
directly,
for looking
This,
sequential
UNIX
utilities that
clear that
some of the
useful
grep.
In our discussion of ways to separate records,
methods provide
when
improve the I/O
up or
in turn,
by RRN,
it is
calculating the byte
offset
of
opens up the possibility of
rather than sequentially.
RRN
simplest record formats for permitting direct access by
involve the use of fixed-length records.
When
the data itself actually
in fixed-size quantities (e.g., zip codes), fixed-length records
comes
can provide
good performance and good space utilization. If there is a lot of variation in
amount and size of data in records, however, the use of fixed-length
the
records can result in expensive waste of space. In such cases the designer
should look carefully
Sometimes
such as the
it is
at
number of
beginning of the
the possibility of using variable-length records.
helpful to keep track of general information about
file it
records they contain.
pertains to,
is
files,
header record, stored at the
a useful tool for storing this
kind of
information.
important to be aware of the difference between file access and file
We try to organize files in such a way that they give us the
types of access we need for a particular application. For example, one of the
It is
organization.
advantages of a fixed-length record organization
is
that
it
allows
access that is
either sequential or direct.
view of a file as a more or less regular
and records, we present a more purely logical view of the
contents of files in terms of abstract data models, a view that lets applications
ignore the physical structure of files altogether.
This view is often more appropriate to data objects such as sound,
images, and documents. We call files self-describing when they do not require
In addition to the traditional
collection of fields
an application to reveal their structure,
but provide that information
143
44
FUNDAMENTAL
FILE
STRUCTURE CONCEPTS
themselves. Another concept that deviates from the traditional view
is
which the file contains data that describe the primary data in
FITS files, used for storing astronomical images, contain extensive
metadata, in
the
file.
headers with metadata.
The
makes
file.
use of abstract data models, self-describing
it
When
models
mix
possible to
a variety
files, and metadata
of different types of data objects in one
this is the case, file access is
also facilitate extensible files
more object oriented. Abstract data
whose structures can be extended
files
accommodate new kinds of objects.
Portability becomes increasingly important as files are used in more
heterogeneous computing environments. Differences among operating
to
systems, languages, and machine architectures
portability.
One
important
way
all
lead to the need for
to foster portability
which means agreeing on physical formats, encodings
and
file
structures.
If a
standard does not exist and
it
is
standardization,
for data elements,
becomes necessary
to convert
from
one format to another, it is still often much simpler to have one standard
format that all converters convert into and out of. UNIX provides a utility
called dd that facilitates data conversion. The UNIX environment itself
supports portability simply by being commonly available on a large
number of platforms.
KEY TERMS
A collection of records stored as a physically contiguous unit on
secondary storage. In this chapter, we use record blocking to improve I/O performance during sequential searching.
Block.
Byte count
field.
that gives the
byte count
field at the
beginning of
number of bytes used
field
allows
program
a variable-length
to store the record.
The
record
use of
to transmit (or skip over) a vari-
able-length record without having to deal with the record's internal
structure.
Canonical form. A standard form for a key that can be derived, by the
application of well-defined rules, from the particular, nonstandard
form of the data found in a record's key field(s) or provided in a
search request supplied by a user.
Delimiter.
One
or
more
characters used to separate fields and records in
a file.
Direct access.
location of
A
a
file
accessing
mode
that involves
jumping
to the exact
record. Direct access to a fixed-length record
is
usually
KEY TERMS
accomplished by using
its relative
record
byte offset, and then seeking to the
Extensibility.
characteristic
number (RRN), computing
byte of the record.
its
first
of some
file
organizations that makes
possible to extend the types of objects that the format can
it
accommo-
date without having to redesign the format. For example, tagged
file
formats lend themselves to extensibility, for they allow the addition
of
new
tags for
new
new methods
data objects and associated
for
accessing the objects.
Field.
The
smallest logically meaningful unit of information in a
record in
usually
made up of several
file.
fields.
method. The approach used to locate information in a
two alternatives are sequential access and direct
File-access
file.
a file is
In general, the
access.
method. The combination of conceptual and physiused to distinguish one record from another and one
File organization
cal structures
field
from another.
An
example of
kind of
fixed-length records containing variable
delimited
same
organization
is
fields.
Fixed-length record. A
same length. Records
ters so
file
numbers of variable-length
file
organization in which
are
padded with blanks,
they extend to the fixed length. Since
length,
it is
all
records have the
nulls, or other charac-
all
the records have the
possible to calculate the beginning position of any
record, making direct access possible.
Header record. A record placed at the beginning of a
store information about the
file
contents and the
file
file
that
is
used to
organization.
Key. An expression derived from one or more of the fields within a
record that can be used to locate that record. The fields used to build
the key are sometimes called the key fields. Keyed access provides a
way of performing
retrieval
content-based retrieval of records, rather than
based merely on
a record's position.
Metadata. Data in a file that is not the primary data, but describes the
primary data in a file. Metadata can be incorporated into any file
whose primary data requires supporting information. If a file is going to be shared by many users, some of whom might not otherwise
have easy access to its metadata, it may be most convenient to store
the metadata in the
file is
file itself.
A common
place to store metadata in a
the header record.
Object-oriented
file access.
form of file
access in
access data objects in terms of the applications'
which applications
in-RAM view of the
methods associated with the objects are responsible
and from the physical format of the object, letting
the application concentrate on the task at hand.
objects. Separate
for translating to
145
46
FUNDAMENTAL
FILE
STRUCTURE CONCEPTS
files that describes how amenable
of different machines, via a variety of
different operating systems, languages, and applications.
Primary key. A key that uniquely identifies each record and that is
used as the primary method of accessing the records.
Record. A collection of related fields. For example, the name, address,
etc. of an individual in a mailing list file would probably make up
Portability. That characteristic of
they are to access on
one record.
Relative record
a variety
number (RRN). An
record relative to the beginning of
records, the
RRN
index giving the position of
its file.
If a file has fixed-length
can be used to calculate the byte
offset
of
record
so the record can be accessed directly.
Self-describing
files.
Files that contain
information such
as the
number
and formal descriptions of the file's record
structure, which can be used by software in determining how to
of records in the
cess the
file.
file
file's
header
is
good
ac-
place for this information.
Sequential access. Sequential access to a file means reading the file from
the beginning and continuing until you have read in everything that
you need. The
alternative
is
direct access.
Sequential search. A method of searching a file by reading the file
from the beginning and continuing until the desired record has been
found.
Stream of bytes. Term
describing the lowest-level view of a
begin with the basic stream-of-bytes view of
a file,
we
file.
If
we
can then impose
own higher levels of order on the file, including field, record,
and block structures.
Variable-length record. A file organization in which the records have
no predetermined length. They are just as long as they need to be,
hence making better use of space than fixed-length records do. Unfortunately, we cannot calculate the byte offset of a variable-length
record by knowing only its relative record number.
our
EXERCISES
1.
Find situations for which each of the four
the text
might be appropriate.
field structures
described in
Do the same for each of the record structures
described.
2.
Discuss the appropriateness of using the following characters to delimit
fields or records: carriage return, linefeed, space,
comma,
period, colon,
EXERCISES
Can you
escape.
think of situations in which you might want to use
different delimiters for different fields?
Suppose you want to change the programs in section 4.1 to include
phone number field in each record. What changes need to be made?
3.
4. Suppose you need to keep a file in which every record has both fixedand variable-length fields. For example, suppose you want to create a file of
employee records, using fixed-length fields for each employee's ID
(primary key), sex, birthdate, and department, and using variable-length
fields for each name and address. What advantages might there be to using
such a structure? Should we put the variable-length portion first or last?
Either approach is possible; how can each be implemented?
5.
One
record structure not described in
labeled record structure each field that
describing
ZP
and
its
is
this
chapter
represented
is
is
called labeled. In a
preceded by
a label
LN, FN, AD, CT, ST,
fixed-length fields for a name and
contents. For example, if the labels
are used to describe the six
address record,
it
might appear
as follows:
LNAmesbbbbbbFNJohnbbbbbbAD123 Map 1 ebbbbbbCTSt
i 1 1
Under what conditions might
even desirable, record
this
be
a reasonable,
water STDKZP74075bbbb
structure?
6.
Define the terms stream of bytes, stream offields, and stream of records.
7. Find out what basic file structures are
programming language that you are currently
available
to
you
in
the
using. For example, does
your language recognize a sequence-of-bytes structure? Does it recognize
of text? Record blocking? For those types of structures that your
language does not recognize, describe how you might implement them
using structures that your language does recognize.
lines
Report on the basic
and record structures available
in
PL/I or
9. Compare the use of ASCII characters to represent everything
with the use of binary and ASCII data mixed together.
in a file
8.
field
COBOL.
10. If
you
list
the contents of a
file
containing both binary and ASCII
on your terminal screen, what results can you expect? What
happens when you list a completely binary file on your screen? (Warning: If
you actually try this, do so with a very small file. You could lock up or
reconfigure your terminal, or even log yourself off!)
characters
11. If a
key
in a record
is
already in canonical form and the key
is
the
first
147
48
FUNDAMENTAL
of the record,
field
it is
FILE
STRUCTURE CONCEPTS
possible to search for a record
separating out the key field
from the
rest
of the
by key without ever
fields.
Explain.
1985) that primary keys should be
unchanging, unambiguous, and unique." These concepts are
interrelated since, for example, a key that contains data runs a greater risk
12.
has been suggested (Sweet,
It
"dataless,
of changing than a dataless key. Discuss the importance of each of these
concepts, and show by example how their absence can cause problems. The
primary key used in our example file violates at least one of the criteria.
How might you redesign the file (and possibly its corresponding information content) so primary keys satisfy these criteria?
13.
How many comparisons
would be required on average
using sequential search in a 10,000-record disk
file,
how many
comparisons are required?
are stored per block,
if
only one record
14. In
is
how many
If the file
is
to find a record
record
is
not in the
blocked so 20 records
disk accesses are required on average?
What
stored per block?
we assume that
do the assumptions change on a
magnetic disk? How do these changed
our evaluation of performance for sequential search,
every read results in
assumptions
seek.
affect the analysis
Look up
How
to a
of sequential searching?
UNIX commands grep,
What motivates the differences?
the differences between the
Why
and fgrep.
machine with access
single-user
15.
file? If the
are they different?
egrep,
16. Give a formula for finding the byte offset of a fixed-length record
which the RRN of the first record is 1 rather than 0.
17.
Why
is
in
variable-length record structure unworkable for the update
program? Does
it
help if
we have
an index that points to the beginning of
each variable-length record?
18.
The
How
update
must the
deletion if
program
file
we do
lets
the user change records, but not delete records.
structure and access procedures be modified to allow for
not care about reusing the space from deleted records?
How do the file structures and procedures
change
if
we do want
to reuse the
space?
19. In
our discussion of the uses of relative record numbers (RRNs), we
file in which there is a direct correspondence
suggest that you can create a
as membership number, and RRN, so we can
by knowing just the name or membership number.
What kinds of difficulties can you envision with this simple correspondence
between membership number and RRN? What happens if we want to delete
between
primary key, such
find a person's record
EXERCISES
name? What happens
variable-length record
20.
The following
file
filled in.
How
long
dump
type produced by the
is
we change the information
and the new record is longer?
if
file
describes the
first
in a record in a
few bytes from
a file
version ofwritrec, but the right-hand column
the
record?
first
What
are
its
of the
is
not
contents?
0000000 00264475 6D707C46 7265647C 38323120
0000020 4B6C7567 657C4861 636B6572 7C50417C
0000040 36353533 357C2E2E 48657861 64656369
21.
Assume
we have
that
variable-length record
(greater than 1,000 bytes each,
with
for a record
file
on the average). Assume
a particular
RRN.
with long records
that
we
are looking
Describe the benefits of using the
a byte count field to skip sequentially from record to record to
one we want. This is called skip sequential processing. Use your
knowledge of system buffering to describe why this is useful only for long
records. If the records are sorted in order by key and blocked, what
information do you have to place at the start of each block to permit even
contents of
find the
faster skip sequential processing?
Suppose you have a fixed-length record with fixed-length fields, and
the sum of the field lengths is 30 bytes. A record with a length of 30 bytes
would hold them all. If we intend to store the records on a sectored disk
with 512-byte sectors (see Chapter 3), we might decide to pad the record
out to 32 bytes so we can place an integral number of records in a sector.
Why would we want to do this?
22.
23.
Why
is
it
important to distinguish between
file
access
and
file
organization?
What is an abstract data model? Why did the early file processing
programs not deal with abstract data models? What are the advantages of
using abstract data models in applications? In what way does the UNIX
concept of standard input and standard output conform to the notion of an
abstract data model? (See "Physical and Logical Files in UNIX" in Chap24.
ter 2.)
25.
What
is
26. In the
about the
scientific
metadata?
some metadata provides information
and some provides information about the
which the corresponding image was recorded. Give
FITS header
files's
in Fig. 4.15,
structure,
context in
three examples of each.
149
50
FUNDAMENTAL
FITS header
27. In the
program
to
determine
STRUCTURE CONCEPTS
FILE
in Fig.
how
the block containing the header
large
is
the
file?
4.15, there
must be
What proportion of the
enough information for a
Assuming that the size of
is
to read the entire
file.
of 2,880 bytes,
a multiple
file
we
28. In the discussion of field organization,
list
"keyword = value"
the
How
construct as one possible type of field organization.
applied in tagged
object-oriented
file
file
structures?
access?
How
How
does
do tagged
how
contains header information?
tagged
file
file
is
this
notion
structure support
formats support extensibil-
ity?
29. List three factors that affect portability in
ways
30. List three
31.
What
is
that portability can be achieved in
XDR? XDR
is
described in this chapter. If
"Further Readings"
ways
that
it
files.
at
actually
much more
you have
the end of this
files.
extensive than what
we
XDR
documentation (see
look
chapter),
up XDR and list the
access to
supports portability.
we see two possible record structures for our address file,
one based on C and one based on Pascal. Discuss portability problems that
might arise from using these record structures in a heterogeneous
computing environment. (Hint: Some compilers allocate space for character
fields starting on word boundaries, and others do not.)
32. In Fig. 4.2,
Programming Exercises
33. Rewrite writstrm so
the
new
it
uses delimiters as field separators.
The output of
version of writstrm should be readable by readstrm.c or readstrm .pas
34. Create versions
of
writrec
and
readrec that use the
following fixed-field
lengths rather than delimiters.
Last name:
15 characters
name:
15 characters
First
Address:
30 characters
City:
20 characters
State:
2 characters
Zip:
5 characters
35. Write the
Make
36.
it
program described
in the preceding
store five records per block.
Implement the program find.
problem so
it
uses blocks.
EXERCISES
37.
Rewrite the program find so
file. For example,
position in the
can find
it
if
record on the basis of
requested to find the 547
th
its
record in
would read through the first 546 records, then print the contents of
th
record. Use skip sequential search (see exercise 21) to avoid
the 547
reading the contents of unwanted records.
file, it
program
38. Write a
similar to find, but with the following differences.
from the keyboard, the program reads them
Instead of getting record keys
from
a separate transaction
file
that contains only the keys of the records to
be extracted. Instead of printing the records on the screen, it writes them
out to a separate output file. First, assume that the records are in no
Then assume
by key. In the latter
than^m^?
both the main
particular order.
that
are sorted
case,
efficient
39.
Make any
a.
or
all
how
can you
file
and the transaction
file
make your program more
of the following alterations to
update. pas or update. c.
Let the user identify the record to be changed by name, rather
than
RRN.
Let the user change individual fields without having to change an
b.
entire record.
c.
40.
Let the user choose to view the entire
Modify
file.
update. c or update. pas to signal the user
when
record exceeds
The modification should allow the user to bring the
and input it again. What are some other
would make the program more robust?
the fixed-record length.
record
down
to an acceptable size
modifications that
Change update. c or update. pas to a batch program that reads a transaction
file in which each transaction record contains an RRN of a record that is to
be updated, followed by the new contents of the record, and then makes the
41.
changes in
batch run. Although not necessary,
the transaction
file
it
might be desirable
to sort
by RRN. Why?
42. Write a
program
dump. The
file
and outputs the file contents as a file
format similar to the one used in the
examples in this chapter. The program should accept the name of the input
file on the command line. Output should be to standard output (terminal
dump
that reads a
file
should have
screen).
43.
Develop
a set
of rules for translating the dates August
7,
1949,
1949, 8-7-49, 08-07-49, 8/7/49, and other, similar variations into
Aug.
7,
common
canonical form. Write a function that accepts a string containing a date in
one of these forms and returns the canonical form, according to your rules.
Be sure to document the limitations of your rules and function.
151
52
FUNDAMENTAL
FILE
STRUCTURE CONCEPTS
program to read in a FITS file and print
The size of the image (e.g., 256 by 256)
The title of the image
The telescope used to make the image
The date the image file was created
The average pixel value in the image (use BSCALE and BZERO)
44. Write a
a.
b.
c.
d.
e.
FURTHER READINGS
Many
textbooks cover basic material on field and record structure design, but only
few go into the options and design considerations in much detail. Teorey and Fry
(1982) and Wiederhold (1983) are two possible sources. Hanson's (1982) chapter,
"Choice of File Organization," is excellent but is more meaningful after you read
the material in the later chapters of this text. You can learn a lot about alternative
types of file organization and access by studying descriptions of options available in
certain languages and file management systems. PL/I offers a particularly rich set of
alternatives, and Pollack and Sterling (1980) describe them thoroughly.
Sweet (1985) is a short but stimulating article on key field design. A number of
interesting algorithms for improving performance in sequential searches are
described in Gonnet (1984) and, of course, Knuth (1973b). Lapin (1987) provides a
detailed coverage of portability in UNIX and C programming. For our coverage o{
XDR, we used the documentation in Sun (1986).
Our primary source of information on FITS is not formally printed text, but
online materials. A good paper defining the original FITS format is Wells (1981).
The FITS image and FITS header shown in this chapter, as well as the
documentation of how FITS works, can (at the time of writing, at least) be found
on an anonymous ftp server at the INTERNET address 128.183.10.4.
a
PROGRAMS:
153
FILEIO.H
C Programs
The
programs
listed in the
discussed in the text.
are contained in the following
name and
Writes out
writstrm.c
following pages correspond to the programs
The programs
address information as
files.
stream of con-
secutive bytes.
input and prints
readstrm.c
Reads
writrec.c
Writes a variable-length record file that uses a byte count
the beginning of each record to give its length.
readrec.c
Reads through a file, record by record, displaying the
from each of the records on the screen.
stream
file as
it
to the screen.
at
fields
Contains support functions for reading individual records or
These functions are needed by programs in readrec.c
and find, c
getrf.c
fields.
Searches sequentially through a
find.c
file
for a record with a partic-
ular key.
Combines
makekey.c
first
and
last
names and converts them
canonical form. Calls strtrimf
and
ucase(
),
to a
found
key
in
in strjuncs.c.
strfuncs.c
Contains two string support functions: strtrimf ) trims the
blanks from the ends of strings; ucase( ) converts alphabetic
characters to uppercase.
update.
Allows new records to be added
to a file or old records to
be
changed.
Fileio.h
All of the
programs include
useful definitions.
were
/*
fileio.h
to be run
on
Some of
a
UNIX
header
file
called fileio.h
which contains some
programs
these are system dependent. If the
system
fileio.h
might look
like this:
header file containing file I/O definitions
*/
(continued)
54
FUNDAMENTAL
ude <stdio.h>
<fcntl .h>
nc
nc lude
STRUCTURE CONCEPTS
FILE
'define PMDDE
0755
#define DELIM_STR
'define DELIM_CHR
#define out_s t
"I"
j
r ( f d
wr
5 )
e(
fd )
( s )
5 t r
write((fd),DELIM_STR,1
'define
d_t o_recbuf f ( rb
treat ( rb
en(
5 ) )
Id)
s t r
ca
t (
rb DEL IM_STR)
,
'define MAX_REC_SIZE 512
fw ritstrm.c
writstrm.c
creates name and address file that is strictly a stream of
bytes (no delimiters, counts, or other information to
distinguish fields and records).
/*
simple modification, to the out_str macro:
\
wr i t e( ( f d) ( 5 ) s t r len( s ) )
#define ou t_s t r ( f d 5 )
write((fd),DELIM_STR,1 );
changes the program so that it creates delimited fields.
*/
'include "fileio.h"
#def ine
ma
ou t_s
)
t r (
fd
write((fd),(s),strlen(s))
s )
char firstC303, last[30], address[30], city[20];
char stateM5], z i p 9
char f i lename [15];
[
nt
fd
pr intf ("Enter the name of
get s( f i lename)
if
creaUf
((fd =
pr int
("f
exitd);
>
i 1
the file you wish to create:
PMODE )
i 1 ename
opening error
,
<
0)
program
opped\n" )
");
printf("\n\nType
get s( last
>
in a last name
PROGRAMS: READSTRM.C
155
(surname), or <CR> toexit\n>>>");
while (strlen(last)
0)
>
printf ("\nFirst Name:");
gets(first);
printf ("
Address:")
gets (addres s )
City:")
printf ("
gets(city)
printf ("
State:")
gets(state)
printfC"
Zip:")
;
get s(
ip)
/output the strings to the buffer and then to the file*/
out_5 1 r(fd,last);
out _5 1 r (f d first)
out _5 t r ( f d address)
out_str(fd,city)
out _str(fd, state)
out_5 t r ( f d z ip)
;
/* prepare for next entry */
printf("\n\nType in a last name (surname), or <CR> toexit\n>>>");
get s( las
t )
>
/* close the file before leaving
close(fd)
*/
Hieadstrm.c
/ *
reads t rm
reads
stream of delimited fields
*/
^include "fileio.h"
int readf ield( int fd, char sM);
mai n(
int
f d
30
char
char
int
lename
1 d_count
f
f
[ 1
5
(continued)
156
FUNDAMENTAL
STRUCTURE CONCEPTS
FILE
printf (" Enter name of file to read: ");
ge t 5 ( f i 1 ename )
if ((fd = openCf i 1 ename Q_RD0NLY ) ) < 0) {
printf ("fi le opening error
program
,
opped\ n"
exitd);
main program loop -- calls readfieldC ) for as long
as the function succeeds
f ld_count =
while ((n = readf ield(fd,s)) > 0)
printf ("\tfield # %3d:
%s \n" + + f 1 d_count s )
/*
close(fd)
>
int
eadf
d(
char
fd,
nt
sM)
int
char
-
0;
while
s
sti
read(fd,&c
(
[
++
'
\0
,1 )
>
&&
!=
DELIM_CHR)
/*
'
append null to end string
*/
return (i);
Pvv
Writrec.c
/ *
writrec.c
creates name and address file using fixed length (2-byte)
record length field ahead of each record
*/
#include "fileio.h"
char recbuf f MAX_REC_S ZE +11;
= {
char *prompt
'Enter Last Name -- or <CR> to exit
First name
[
'"
>;
Address
City
State
Zip
/* null string to terminate the prompt loop */
*/
ma
C
[
57
f d, i
shor
pr
char response 50
char f i 1 ename
5
i
PROGRAMS: WRITREC.C
rec_lgt h
nt
"Ent er the name of the file you wish to create:
"
gets(filename);
if
((fd - creat(f
l 1
ename PMODE
,
) )
<
printfCfile opening error
exitd);
0)
program
opped \ n"
>
prmtf ("\n\n
get
7.
s", prompt
>
response )
while
( s t r 1
en( response
0)
\0
recbuf f [03 =
1 d_t o_r ecbuf f (recbuff
response)
for (i=1; *prompt[i] != '\0'
i + +)
'
prmtf ("7.s", prompt
get
response)
d_t o_r ecbuf f (recbuff response)
;
/* write out the record length and buffer contents
rec_lgth = s t r 1 en( recbuf f )
write(fd &rec__lgth sizeof(rec_lgth))
write(fd, recbuff ,rec_lgth)
;
/* prepare for next entry */
printf ("\ n\n %s" prompt [01 )
gets(response)
;
>
close the file before leaving
close(fd)
/*
*/
>
/*
question:
How does the termination condition work in the for loop:
for (i-1; *promptEi] != \0
i++)
;
What does the
*/
"
"
refer to?
Why do we need the "*"?
*/
58
/ *
FUNDAMENTAL
STRUCTURE CONCEPTS
FILE
Readrec.c
readrec
...
reads through a file, record by record, displaying the
fields from each of the records on the screen.
/
^include "fileio.h"
main(
rec_count
fld_count;
5can_po5
short rec_lgth;
char f i 1 ename
5
char recbuff MAX_REC_S IZE +11;
char f ield[MAX_REC_SIZE +1];
fd,
int
int
pr
get
if
"En t er name of file to read:
ename)
((fd = open(f
s ( f
pr
"
i 1
nt
exitd
"f
i 1
ename 0_RD0NLY )
opening error
i 1
<
0)
program
s t
opped\n")
);
ec_count =
5can_po5 =
while
((rec_lgth = get_rec ( fd recbuff
r
) )
>
0)
printf ("Record %d\n" ++rec_count )
f 1 d_coun t =
while ((scan_po5 = get_f 1 d( f i e 1 d recbuff scan_pos
rec_lgth))
printf ("\tField %d: %s \n" + + f 1 d_count f i e 1 d )
,
>
>
/*
lose(f d
question -- why can
to scan_pos just once, outside
assign
*/
of the while loop for records?
I
C PROGRAMS: GETRF.C
59
rsetrf.c
/*
get rf
...
Two functions used by programs
get_rec(
get_fld(
in
readrec.c and find.c:
reads a variable length record from file fd
into the character array recbuff.
moves a field from recbuff into the character
array field, inserting a '\0' to make it a
string.
*/
'include "fileio.h"
get_rec(int fd, char recbuffM)
{
short
rec__lgt h
(readCfd, &rec_lgth, 2) == 0) /* get record length */
/* return
return(O);
if EOF */
/* read record */
rec_lgth = readCfd, recbuff, rec_lgth);
return(rec_lgth)
if
get_fld(char field [
char recbuffM
short scan_pos
short rec_lgth)
short fpos = 0;
if
/*
position in "field" array
*/
(scan_po5 == rec_lgth) /*if no more fields to read,*/
return(O);
/*return scan_pos of 0.*/
/* scanning loop */
while ( 5can_po5 < rec_lgth &&
(f ieldCfpos++] = recbuff [scan_pos++]
!=
DELIM_CHR)
ifCfieldtfpos-1] = =DELIM_CHR)/*if last character
f
ieldC --f po5
'XO
is a field*/
/*delimiter, replace with null*/
else
fieldCfpos] =
ret urn( 5can_po5
/*otherwise, just ensure that
the field is nu 1 1 - ermi na t ed*
/*return position of start of next field*/
'\0';
60
FUNDAMENTAL
STRUCTURE CONCEPTS
FILE
Find.c
/*
find.c ...
searches sequentially through
particular key.
file for
record with
*/
^include "fileio.h"
#define TRUE
#define FALSE
1
main(
fd, scan_pos;
short r ec_l gt h
int mat ched
char search_key[30]
k ey_f ound
30
char f i 1 ename
5
char recbuff MAX_REC_S ZE +13;
char f ield[MAX_REC_SIZE +11;
int
[ 1
pr i nt
get s (
if
pr
lastC30], firstC303;
"En t er name of file to search:
ename )
((fd = open(f
f
");
i 1
nt
exitd
"f
i 1
ename D_RD0NLY )
opening error
i 1
<
0)
program
s t
opped\n")
);
>
("\n\nEnter last name: ");
gets(last);
pr i nt f ("\nEnt er first name: " )
gets(first);
makekey(last, first, search_key);
pr int
/*
get
search key */
matched = FALSE;
while
(Imatched && (rec_lgth = get_r ec ( fd recbuff
) )
>
5can_po5 =
recbuff, scan_pos, rec_lgth);
5can_po5 = get_f ld( las t
recbuff, scan_pos, rec_lgth)
5can_po5 = get_f ld(f i r s t
first, key_found);
mak ekey( las t
if (strcmp (key_found, search_key) == 0)
matched = TRUE;
;
>
/*
if
{
if
record found, print the fields
(ma t ched
*/
161
PROGRAMS: MAKEKEY.C
pnntf("\n\nRecord found :\n\n");
5can_po5 =
/*
wh
break out the fields */
e( ( 5can_po5 = get_f ld(field,recbuff, scan_pos ,rec_lgth))>0)
printf ("\t%s\n'\ field)
i 1
else
printf ("\n\nRecord not found. \n");
>
>
ques tions:
/ *
-why does scan_pos get set to zero inside the while loop here?
-what would happen if we wrote the loop that reads records
((rec_lgth = ge t_r ec ( f d ecbuf f ) ) >
&&
like this: while
ma t ched )
,
*/
Makekey.c
/*
mak ek ey( las
r s t
s )
...
function to make a key from the first and last names passed
Returns the key in
through the functions arguments.
canonical form through the address passed through the
argument s.
Calling routine is responsible for ensuring
that s is large enough to hold the return string.
Value returned through the function name
the string returned through s.
is
the
length of
*/
makekey(char lasttl, char
irst
[ ]
,char
sM)
{
i
nt
en
=
enf
/* trim the last name */
im( las t )
/* place it in the return string
las t )
/* append a blank at the end */
';
s[lenl++] =
=
s[ lenl
\0
/* trim the first name */
lenf = s t r t r im( f i r s t )
/* append it to the string */
s
rcat ( s f i r s t )
lenl
s t
s t r t r
cpy (
'
'
/* convert everything to uppercase
ucase(s,s);
returnClenl + lenf);
*/
*/
62
FUNDAMENTAL
FILE
STRUCTURE CONCEPTS
Strfuncs.c
/ *
51
rf uncs . c
module containing the following functions:
strtrim(s) trims blanks from the end of the ( nu 1 1 - ermi nat ed)
string referenced by the string address s.
Nhen
done, the parameter s points to the trimmed string
The function returns the length of the trimmed
string.
ucase(si,so) converts all lowercase alphabetic characters in
the string at address si into uppercase characters
returning the converted string through the address
so
*/
strtrimC char s[])
{
i
nt
for
(i
strlen(s)-1;
/* now that
to form a
==
'
i--)
the blanks are trimmed, reaffix null on the end
string */
= \0
s[++i
return(i);
]
i>=0 &&
ucase (char siM, char
soM)
while (*5o++ = (*5i >=
5i + +
'a'
&&
*si
<=
'z')
*si
&
0x5f
*si)
>
fupdate
/ *
update c ...
program to open or create a fixed length record file for
Records to be
updating.
Records may be added or changed.
changed must be accessed by relative record number
.
*/
^include "fileio.h"
#define REC_LGTH 64
static char *promptM = {"
Last Name:
"
First name
Address:
City:
State:
163
PROGRAMS: UPDATE.C
"
Zip:
mi
static int fd;
static struct {
short
rec_count
char
fill[30];
>
head
static
static
static
static
static
menu(
recbuffM);
read_and_show(
change( )
ma i n(
int
int
as k_i nf o( char
as k_r r n( )
);
menu_choice
rn
by t e_po5
char f i lename
5
long 1 seek ( )
char recbuf f MAX_REC_S ZE + 1];/*buffer to hold
[ 1
record*/
printf ("Enter the name of the file: ");
g e t 5 ( f i lename);
if (( fd = openCf ilename, 0_RDWR)) < 0) /*if OPEN fails*/
{
fd = creatCf ilename, PMODE);
head rec_count = 0;
wr i te(fd Ahead s i zeof (head) )
.
/*then CREAT*/
/*initialize header */
/*write header rec*/
>
/* existing file opened -- read in header */
read(fd &head,sizeof(head));
/* main program loop -- call menu and then jump to options */
whi le( (menu_choice = menu( )) < 3)
else
swi
ch(menu_cho ice)
case
printfC Input
/* add a new record */
the informationfor the new record --\n\n M
ask_info(recbuff)
by t e_po5 = head.rec_count * REC_LGTH
1 seek ( f d
(
ong ) byte_pos,0);
);
sizeof(head);
(continued)
64
FUNDAMENTAL
FILE
STRUCTURE CONCEPTS
write(fd,recbuff REC_LGTH);
head r ec_count + +
,
break
case
/*
2:
rrn = as k_r rn(
/*
if
update existing record */
if rrn is too big, print error message
(rrn >= head rec_coun t ) {
pr i nt f "Record Number is too large");
printfC... returning to menu ...");
break
...
*/
>
/* otherwise, seek to the record ... */
byte_po5 = rrn * REC_LGTH + s i zeof (head )
1 seek ( fd ,( long)
byte_pos,0);
display it and ask for changes
read_and_show( );
(change( )
i f
/*
...
*/
pr int f ("\n\n nput the revised Va lues \n\n")
ask_info(recbuff)
lseek(fd,(long) byte_pos,0);
write(fd,recbuff ,REC_LGTH);
I
>
break
/* end swi tch */
>
/* end while */
;
>
rewrite correct record count to header before leaving
lseek(fd,OL,0);
write(fd,&head,sizeof(head))
close(fd);
/*
>
/*
menu( ) ...
local function to ask user for next operation.
Returns numeric value of user response
*/
static menu( ) {
int choice;
char response [10
1;
printf ("\n\n\n\n
FILE UPDATING PROGRAMNn" )
pr intf ("\n\nYou May Choose to:\n\n");
printf("\t1.
Add a record to the end of the file\n");
pr intf ("\ t2
Retrieve a record for Upda t i ng\n")
pr int f ( "\ t 3
Leave the Pr ogram\n\n" )
.
*/
C PROGRAMS: UPDATE.C
pr i nt f ("Enter the number of your
get s( response)
choice = atoi(response);
ret ur n( cho ice)
65
");
choice:
>
/ *
as k_i nf o(
...
function to accept input of name and address fields,
writing them to the buffer passed as a parameter
local
*/
static as k_i nf o( char recbuffM)
{
fie ld_count
i nt
char response 50
,
i
]
for
clear the record buffer */
(i = 0; i < REC_LGTH; r ecbuf f
/*
get
/*
for
the fields
(i=0;
++
'\0)
*/
*prompt[i]
!=
'NO
i++)
printf ( %s" prompt[i]);
ge t s response )
f 1 d_t o_r ecbuf f (recbuff response)
,,
>
>
/*
as k_r rn(
...
function to ask for the relative record number of the
record that is to be updated.
local
*/
static as k_r r n(
i
nt
r r
char response
[ 1
M \n\n nput
i nt f
the Relative Record Number of the Record
that\n");
");
pr i nt f ("\ tyou want to update:
get s( response)
rrn = atoi(response);
return(rrn)
pr
/*
read_and_show( ) ...
Note that this
local function to read and display a record.
function does not include a seek -- reading starts at the
current position in the file
*/
static
ead_and_show(
(continued)
66
FUNDAMENTAL
FILE
STRUCTURE CONCEPTS
char recbuff [MAX_REC_S ZE +
I
int
MAX_REC_S ZE +13;
I
scan.pos, data_lgth;
5can_po5 =
readCfd, recbuff ,REC_LGTH);
;
M \n\n\n\nExi 5 1 i
i nt f
ng Record Content 5 \n" )
/* ensure that record ends with
\
recbuff CREC_LGTH] =
pr
'
null
*/
data_lgth = strlen(recbuff);
while ((scan_po5 = get_f 1 d( f eld recbuff scan_pos
data_lgth))
,
printf
>
0)
\t%s\n", field)
>
/ *
changeC
...
local function to ask user whether or not he wants to change
otherwise
if the answer is yes,
the record.
Returns
1
*/
{
static change( )
char response [
(
"\n\nDo you want to change this record?\n");
i nt f
Answer Y or N, followed by <CR> ==>");
printfC"
gets(response)
ucaseC response, response);
returnCCresponseCO] '== Y') ?
0);
pr
167
PASCAL PROGRAMS
Pascal Programs
The
Pascal programs listed in the following pages correspond to the
programs discussed in the text. Each program is organized into one or more
files, as
follows.
writstrm.pas
Writes out
name and
address information as a stream of con-
secutive bytes.
readstrm.pas
Reads
writrec.pas
Writes a variable length record file that uses a byte count
the beginning of each record to give its length.
readrec.pas
Reads through a file, record by record, displaying the
from each of the records on the screen.
stream
file as
input and prints
it
to the screen.
Supports functions for reading individual records or
get. pre
These functions are needed by the program
Searches sequentially through a
find. pas
file
at
fields
fields.
in readrec.pas.
for a record
with
a partic-
ular key.
Allows new records
update. pas
to be
added to
file,
or old records to be
changed.
Support function for update. pas, which converts
type strng to a variable of type datarec.
stod.prc
In addition to these
the tools for operating
contained in Appendix
files,
on
there
is
a file called tools. pre,
variables of type strng.
at
a variable
of
which contains
listing
of
tools. pre is
the end of the textbook.
We have added line numbers to some of these Pascal listings to assist the
reader in finding specific
The
files
program statements.
do not contain
and stod.prc.
that contain Pascal functions or procedures but
main programs
are given the extension
.pre, as in get. pre
168
PASCAL PROGRAMS
Writstrm.pas
Some
things to note about writstrm.pas:
The comment {SB-} on
piler, instructing
Without
it
line 6
to handle
this directive
a directive to the
is
keyboard input
we would
function properly in the
WHILE
as a
Turbo
Pascal
com-
standard Pascal
file.
not be able to handle the len_str()
loop on line 36.
The comment {$1 tools. pre} on line 24 is also a directive to the
Turbo Pascal compiler, instructing it to include the file tools. pre in
the compilation. The procedures read_str, len_str, and fwrite_str are
the
in
file tools. pre.
we choose not
conforming to standard Passtrng type, which is a packed array
The length of the strng is stored in
Although Turbo Pascal supports
come
to use that type here to
a special string type,
closer to
cal. Instead, we create our own
{O..MAX_REC_SIZE} of char.
the zeroth byte of the array as a character value. If
value in the zeroth byte of the array, then
ORD(X)
X
is
is
the character
the length of
the string.
The assign statement on line 31 is one that is nonstandard. It is
Turbo Pascal procedure, which, in this case, assigns filename to
file,
so
further operation
all
PROGRAM writstrm
on
outfile will
operate on the disk
out-
file.
NPUT OUTPUT)
,
writes out name and address information as
consecutive bytes >
stream of
{$B->
Directive to the Turbo Pascal compiler, instructing
handle keyboard input as a standard Pascal file >
8
9
1
CONST
DELIM_CHR
MAX_REC_SIZE
11
'I';
=
255;
12
13
14
TYPE
strng
inp_list
filetype
15
16
17
18
19
20
21
22
23
=
=
MAX_REC_S ZE of char;
packed array
first address ,c ity, state zip)
( last
packed arrayC1..40l of char;
[
VAR
response
resp_type
filename
outfile
array [inp
inp_list;
filetype;
text;
list]
of
strng
it
to
PASCAL PROGRAMS: READSTRM.PAS
{$1
24
25
26
27
28
tools. pre)
Another directive,
too Is. pre
30
31
instructing the compiler to include the file
);
32
33
writeC'Type in a last name, or press <CR>
read 5 trCresponset last )
while ( 1 en_5 t r response las t ) >
)
DO
BEGIN
{
get all the input for one person }
write(
First Name: ');
r ead_5 tr(response[first
)
write(
Addr es s
)
read s trCresponsetaddress] )
34
35
36
37
38
39
40
to
exit:
);
'
41
'
42
43
44
45
46
47
48
49
writeC
City: );
ead s trCresponsetcity] )
writeC'
State: ');
r ead_5 trCresponset state]
writeC'
Zip: ');
r
)-,
read_5 trCresponselzip]);
write the responses to the file }
resp_type := last TO zip DO
f wr i t e_5 trCoutfile, response! resp_type
50
51
for
52
53
54
start the next round of
input
writeC'Type in a last name, or press <CR>
read_5 t rC response [last])
55
56
57
58
59
69
BEGIN {main}
writeC Enter the name of the file:
readlnCfi lename)
assignCoutfi le f i lename)
rewriteCoutfi le)
29
to
exit:
');
END;
closeCoutfile)
END.
Readstrm.pas
PROGRAM readstrm
{
NPUT OUTPUT)
,
program that reads a stream file Cfields separated by
delimiters) as input and prints it to the screen >
A
CONST
DELIM_CHR
MAX_REC_SIZE
!
=
255;
(continued)
170
PASCAL PROGRAMS
TYPE
5 1
rng
letype
packed array
MAX_REC_S ZE
packed array M..40] of char;
lename
nf
of
char;
VAR
f
f
letype
str
integer
i nt eger
5 1 rng
ld_count
ld_len
{$1
text
le
tools. prc>
FUNCTION readfield (VAR infile
{
text; VAR str
strng):
int eger
Function readfield reads characters from file infile until it
reaches end of file of a "I".
Readfield puts the characters in
str and returns the length of str >
VAR
i
ch
integer;
char
BEGIN
i
:=
0;
ch
'
while (not EOF(infile)) and (ch <> DELIM_CHR) DO
BEGIN
read (infile, ch);
i
s t r
=
[
ch
END;
i
:=
strCOJ := CHR(i);
readf ield
=
i
END;
:
BEGIN {MAIN}
write ('Enter the name of the file that you wish to open:
readln (filename);
assign(infile,filename);
reset (infile);
f
ld_coun t
');
fld_len := readf ie 1 d( i nf i le 5 t r )
while (fld_len > 0) DO
BEGIN
fld_count := fld_count +
2)
writeC field #
f 1 d_count
{
write_str()
wr i te_st r(st r )
,
'
'
is
in
tools. pre
PASCAL PROGRAMS: WRITREC.PAS
fld_len
:=
END;
ose
( i
nf
readfield(infile str)
7
le)
END.
Writrec.pas
Note about
writrec.pas: After writing the rec_\gth to outfile
write a space to the
PROGRAM wntrec
This
is
NPUT OUTPUT)
,
on
line 69,
we
because in Pascal values to be read into
must be separated by
integer variables
1
file.
spaces, tabs, or end-of-line markers.
{$B->
CONST
DELIM_CHR =
MAX_REC_SIZE
8
9
TYPE
5;
255;
5trng = packed array
filetype = packed array
[
1
1
'I'
12
13
14
15
16
17
18
19
20
MAX_REC_S ZE
C 1
.40
of
of
char;
char;
VAR
filename
outfile
response
buffer
rec_lgth
{$1
filetype;
text;
5 t r
ng
strng;
integer;
too Is. prc>
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
PROCEDURE
C
ld_to_buf er(VAR buff:
This procedure concatenates
buff >
strng;
and
s:
strng);
delimiter to end of
VAR
strng;
d_5 t r
BEGIN
cat_str(buff,s);
d_str[0] := CHRC1 );
:= DELIM_CHR;
d_str[1
cat_str(buff d_s t r
:
END;
(continued)
172
37
38
39
40
41
PASCAL PROGRAMS
BEGIN {main}
write( 'Enter the name of the file you wish to create
readln(filename)
assignCoutfile,filename)
rewriteCoutfile);
');
42
43
44
45
46
47
48
49
50
writeC'Enter Last Name -- or <CR> to exit: );
read_5 t r( response)
while ( 1 en_st r response) > 0) DO
BEGIN
=
bufferCO]
CHR(O);
{Set length of string
in buffer to
>
f 1 d_t o_buf f er (buffer response)
;
51
writeC
52
53
54
55
56
57
58
59
60
ead_5 t r( response)
f 1 d__t o_b ufferCbuffer
response)
First name
writeC
Address:');
rea d_5 t r( response)
f 1 d_t o_buf f er (buffer response)
;
writeC
read_5 tr(response)
f 1 d_t o_buf f er (buffer response)
writeC'
r ead_5 trCresponse)
f 1 d_t o_buf f er Cbuffer response)
writeC'
read_5 t rC response)
f 1 d_t o_buf f er Cbuffer response)
City:
'
61
State:
);
62
63
64
65
66
67
68
69
Zip:');
write out the record length and buffer contents
{
rec_lgth := 1 en_s t r C buf f er )
writeCoutfile,rec_lgth)
wr i t eCout f i le
)
f wr i t e_5 1 rCoutfile, buffer);
>
70
71
'
'
72
73
74
75
76
77
78
{
prepare for next entry >
writeC'Enter Last Name -- or <CR> to exit:
read_5 t rC response)
END;
closeCoutfile)
END.
Readrec.pas
PROGRAM readrec
{
');
NPUT OUTPUT)
,
This program reads through a file, record by record, displaying
the fields from each of the records on the screen. >
PASCAL PROGRAMS: READREC.PAS
173
{$B->
CONST
=
i nput_5 i ze
255
DELIM_CHR = h
MAX_REC_SIZE = 255;
;
TYPE
npu t_5 i ze
of char;
strng = packed array
filetype = packed array M..40] of char;
[
VAR
f
i 1
ename
out f
le
ec_coun
5can_po5
{$1
{$1
ex t
nt eger
ec_l gt h
1
tools. pre
get. prc>
eger
i nt eger
i nt eger
strng;
strng;
i
d_coun
buffer
field
f
letype
BEGIN {main}
write( 'Enter name of file to read:
readln (filename);
as5ign(outfile,filename)
reset (outfile);
');
rec_count
=
5can_po5
=
rec_lgth := ge t_rec out f i le buff er )
while rec_lgth >
DO
BEGIN
,rec_count);
wr telnC Record
rec_count := rec_count +
f ld_count
=
5can_po5 := ge t_f 1 d( ield buf f er scan_pos r ec_l gt h )
while 5can_po5 >
DO
BEGIN
);
writeC
f 1 d_count
Field
writ e_5 tr(field);
fld_count := fld_count + 1;
5can_po5 := get_f 1 d( f i e 1 d buf f er scan_pos r ec_l g t h
:
'
END;
rec_lgth
:=
END;
close(outfile)
END.
get_r ec out f i le buf f er
74
PASCAL PROGRAMS
Get.prc
FUNCTION get_rec(VAR fd:
VAR buffer:
text;
strng);
integer;
{
A function that reads a record and its length from file fd.
The function returns the length of the record. If EOF is
encountered get_rec() returns
>
VAR
rec_lgth
integer;
space
char
BEGIN
if EOF(fd) then
get_r ec
=
else
BEGIN
readCfd ,rec_lgth)
r ead( f d
space)
f r ead_5 tr(fd, buffer, r ec_l gt h )
get_rec := rec_lgth
END
:
END;
FUNCTION get_fld(VAR
rng buff er
;
rec_lgth:
{
integer):
rng VAR scanpos: integer;
integer;
;
function that starts reading at scanpos and reads characters
from the buffer until it reaches a delimiter or the end of the
record. It returns scanpos for use on the next call. }
VAR
f
pos
nt eger
BEGIN
scanpos = rec_lgth then
get_fld :=
else
BEGIN
if
pos
scanpos := scanpos +
fieldCfposl := buf fer scanpos
while (f ieldtfpos] <> DELIM_CHR) and (scanpos
BEGIN
fpos := fpos
1;
scanpos := scanpos +1;
fieldCfposl := buf fer scanpos
1
<
rec_lgth)DO
END;
if
fieldCfpos]
f ieldCO]
:
DELIM_CHR then
CHRCfpos -
:=
=
,v~l-
PASCAL PROGRAMS: FIND.PAS
175
else
fieldCO] := CHR(fpos);
get_fld := scanpos
END
END;
Find. pas
PROGRAM find
i
NPUT OUTPUT)
,
This program reads through a file, record by record, looking
for a record with a particular key.
If a match occurs, when
all the fields in the record are displayed.
Otherwise a message
is displayed indicating that the record was not found. >
{$B->
CONST
MAX_REC_SIZE
DELIM_CHR =
255;
=
'I'
TYPE
strng = packed array
MAX_REC_S IZE of char;
filetype = packed array C1..403 of char;
[
VAR
f
i 1
ename
out f
last
le
first
search_key
length
matched
rec_l gt
buffer
5can_po5
key_f ound
field
$
{ $
tools, pre
get pre >
letype
ext
s t rng
s t rng
s t rng
t
nt eger
boolean
nt eger
strng
i nt eger
strng;
strng;
;
BEGIN {main}
writeC'Enter name of file to search:
readlnCf i lename)
);
as5ign(outfile,filename)
reset(outfile);
(continued)
76
PASCAL PROGRAMS
writeCEnter last name:
read_5 t
r (
las
');
t )
writeCEnter first name:
');
read_5 1 r(first);
makekeydast first sear ch_k ey )
,
matched := FALSE;
rec_lgth := ge t_rec out f i 1 e buff er )
while ((not matched) and (rec_lgth
Beg
>
0)) DO
=
5can_po5
5can_po5 := ge t_f 1 d( las t buff er scan_pos rec_l gth )
5can_po5 := get_f 1 d( f i r 5 t buf f er scan_pos rec_l gt h )
makekey(last first k ey_f ound )
if cmp_s t r ( key_f ound search_key ) =
then
matched := TRUE
else
rec_lgth := get_rec(out f i 1 e buf f er )
:
END;
lose( out
if
le)
record found, print the fields
mat ched t hen
BEGIN
wr i t e 1 n( Record found:');
if
wr
i t
>
5can_po5
{
break out the fields >
5can_po5 := ge t_f 1 d( f i e 1 d buf f er scan_pos r ec_l gt h )
while 5can_po5 >
DO
BEGIN
writ e_5 tr(field)
5can_po5 := get_f 1 d( f i e Id buf f er scan_pos rec_l gt h
,
END;
END
else
writeln(
'
Record not found.
);
END.
Update. pas
Some
things to note about update. pas:
name and address fields are read in as
and procedure Jld_to_buffer() writes the fields to strbuff (also of
type strng). Writing strbuff to outfile would result in a type mismatch,
since outfile is a file of type datarec. However, the procedure stod(),
In the procedure ask_info(), the
strngs,
177
PASCAL PROGRAMS: UPDATE.PAS
located in stod.prc, converts a variable of type strng to a variable of
type datarec to write the buffer to the
cated on
The
lines
seek() statements
on
dard; they are features of
PROGRAM update
file.
The
calls to stod() are lo-
210 and 237.
NPUT OUTPUT )
,
lines 212, 229, 239,
Turbo
and 250 are not stan-
Pascal.
{$B->
{
program to open or create a fixed length record file for
updating.
Records may be added or changed.
Records to be
changed must be accessed by relative record number }
CONST
MAX_REC_SIZE
REC_LGTH
DELIMCHR
255;
64;
'I';
TYPE
s t r
packed array
MAX_REC_S ZE of char;
packed array M..40] of char;
RECORD
len
integer;
data
packed array
REC_LGTH of cha
ng
filet ype
datarec
END;
VAR
f
i 1
ou
ename
f
i 1
response
menu cho ice
strbuff
by t
head
r r n
drecbuff
i
ec_count
{$1
{$1
tools. pr c
>
stod.prc
>
{ $
get
pre
PROCEDURE
{
d__t
s t r ng
integer
datarec
integer
datarec
integer
integer
pos
filetype;
file of datarec
char
i nt eger
;
>
ld_to_buf er(VAR buff: strng;
o_buf f er concatenates strng
end of buff
and
s:
strng);
delimiter to the
>
(continued)
178
PASCAL PROGRAMS
VAR
d_str
strng;
BEGIN
ca t_5 t r ( but f 5 )
d_str[0] := CHRC1);
d_strt
=
DELIM_CHR;
cat_str(buf f ,d_str)
:
END;
FUNCTION menu integer
{local function to ask user for next operation.
value of user response }
Returns numeric
VAR
choice
int eger
BEGIN
writeln;
writelnC
FILE UPDATING PROGRAM' )
writeln;
writelnC'You May Choose to: );
writeln;
writelnC'
Add a record to the end of the file');
writelnC'
2.
Retrieve a record for updating');
writelnC'
3.
Leave the program');
writeln;
writeC'Enter the number of your choice: ');
r eadl nCchoice)
writeln;
=
menu
choice
:
'
END;
PROCEDURE ask_infoCVAR strbuff: strng);
{local procedure to accept input of name and address fields
writing them to the buffer passed as a parameter >
VAR
response
strng;
BEGIN
{
clear the record buffer
clear_strCbuff)
:
get the fields }
{
writeC'
Last Name: ');
r ead_5 trCresponse)
f 1 d_t o_buf f er C strbuff response)
writeC'
First Name: ');
r ead_5 trCresponse)
buffer Cstrbuff response)
f 1 d
t o
;
179
PASCAL PROGRAMS: UPDATE.PAS
writeC
Address:
)
ead_5 tr(response);
f 1 d_t o_b ufferCstrbuff response)
94
95
96
97
98
99
100
1
01
02
City:
');
read_5 tr(response);
f 1 d_ to bufferCstrbuff, response);
writeC'
State: ');
r ead_5 tr( response);
f 1 d_t o_buf f er (strbuff
response);
writeC'
Zip:');
r ead_5 tr(response);
fid' to buff er C strbuff response);
,
03
04
07
writeC
105
106
1
'
'
wr
i t
END;
108
09
110
1
12
13
FUNCTION ask_rrn:
integer;
function to ask for the relative record number of the record
that is to be updated.
>
114
1
15
116
1
17
18
19
120
VAR
rrn
integer;
BEGIN
i t e 1 n(
npu t the relative record number of the record that');
writeC'
you want to update: ');
read ln(rrn)
wr
'
wr i t e 1 n
122
as k r un
=
rrn
123
END;
124
PROCEDURE read_and_show;
125
126 {procedure to read and display a record. This procedure does not
127
include a seek -- reading starts at the current file position
128
129 VAR
130
5can_po5
i n t eger
131
dr ecbuf f
da t ar ec
132
integer
133
data__l g t h
integer
134
field
5 t r ng
135
strbuff
5 t r ng
136
BEGIN
137
scan pos
138
readCoutfi le drecbuff)
139
140
<
convert drecbuff to type strng }
141
strbuffCO] := CHR( drecbuff 1 en )
142
for i :=
to drecbuff .len DO
121
>
(continued)
80
PASCAL PROGRAMS
143:
strbufffi] := dr ecbuf f da t a [ i
144:
145:
wr i t e 1 n( Ex i s t i ng Record Contents');
146:
writeln;
147:
148:
data_lgth := 1 en_s t r ( 5 rbuf f )
149:
5can_po5 := get_f 1 d( f i e 1 d 5 rbuf f scan_pos da ta_l g t h )
while scan_pos >
D)
150:
BEGIN
151:
write_5tr(f ield)
152:
153:
5can_po5 := ge t_f 1 d( f i e 1 d 5 rbuf f scan = pos,data:= lgth)
154:
END
155: END;
156:
157:
158: FUNCTION change: integer;
159:
160: { function to ask the user whether or not to change the
161:
record.
Returns
if the answer is yes,
otherwise.
162:
163: VAR
164:
char;
response
165: BEGIN
166:
writeln('Do you want to change this record?');
167:
wnteC
Answer Y or N, followed by <CR> = = >);
168:
readln(response);
1 69:
writeln;
y ) then
170:
if (response =
Y ) or (response =
=
171:
change
172:
else
73
change
=
174: END;
175: BEGIN {main}
176:
write( 'Enter the name of the file: ');
177:
read 1 n( f i 1 ename )
178:
ass i gn( out f i 1 e f i 1 ename )
179:
180:
write('Does this file already exist? (respond Y or N): ');
181:
read ln( response )
writeln
82:
183:
if (response = Y') OR (response = y') then
184:
BEGIN
185:
open outfile
>
r ese t ( ou t f i 1 e )
{
186:
get header
>
read(out f i le ,head)
{
187:
{
read in record count >
rec_count := head.len
188:
END
189:
else
190:
BEGIN
create outfile
}
191:
rewr te(outf i le)
(
initialize record count }
{
192:
rec_count := 0;
]
'
'
'
'
'
PASCAL PROGRAMS: UPDATE.PAS
193:
194:
195:
196:
197:
198:
199:
200:
201:
202:
203:
204:
205:
206:
20 7:
208:
209:
210:
211:
212:
213:
214:
215:
216:
217:
218:
219:
220:
221:
222:
223:
224:
225:
226:
227:
228:
229:
230:
231:
232:
233:
234:
235:
236:
237:
238:
239:
240:
241
rec_count;
to REC_LGTH DO
head.dataU] := CHR(O);
wr i te( out f i 1 e head )
head.len
for
:=
:=
place in header record
>
set header data to nulls)
write header rec
}
{
i
END;
main program loop -- call menu and then jump to options
{
menu_choice := menu;
while menu_choice < 3 DO
BEGIN
CASE menu_choice OF
1
>
add a new record >
i
BEGIN
writeln( Input the information for the new record --');
writeln;
writeln;
ask_info(strbuff );
{convert strbuff to type datarec}
stodCdrechbuf f st rbuf f )
rrn := rec_count + 1;
seek(outfile,rrn);
wr i t e( ou t f i 1 e dr ecbuf f )
rec count := rec count +
:
'
END;
2
update existing record
>
BEGIN
rrn
{
if
:=
ask_rrn;
rrn is too big, print error message ...
(rrn > rec_count) or (rrn < 1) then
BEGIN
wr i te( Record Number is out of range');
wr i teln( "... returning to menu...')
END
if
else
BEGIN
seek(outf
otherwise, seek to the record
le ,rrn)
...
>
display it and ask for changes
read_and_show
{
if
...
>
then
change =
BEGIN
writeln(' Input the revised Values: ');
ask_info(strbuff );
convert strbuff to type
{
stod(drecbuf f st rbuf f )
datarec }
seek(outf i le,rrn)
wr iteCoutf i le ,drecbuf f )
END
1
(continued)
82
PASCAL PROGRAMS
242
243
244
245
246
247
248
249
250
END;
menu_choice
END; { while
:=
menu
rewrite correct record count to header before leaving
{
head.len := rec
count;
seekCoutf i le 0)
write(outfile,head);
close(outfile)
,
251
252
253
END
END
CASE >
<
END
Stod.prc
PROCEDURE stod (VAR drecbuff: datarec; strbuff; strng);
{
A procedure that converts
variable of type datarec
variable of type strng to
>
VAR
i
i nt eger
BEGIN
drecbuff. len
:
for
:=
:=
mi n( REC_LGTH
to drecbuff. len DO
drecbuff. datati]
:=
:=
END;
:=
en_s
strbuffCi];
{
Clear the rest of the buffer
while i < REC_LGTH DO
BEGIN
drecbuff .datati]
END
'
'
r ( s t
rbuf f
) )
>
Organizing Files
for Performance
CHAPTER OBJECTIVES
Look
at several
Look
at storage
space in a
I
Develop
approaches to data compression.
compaction as a simple
way of reusing
file.
procedure for deleting fixed-length records
file space to be reused dynamically.
that allows vacated
I
Illustrate the use
avail
I
of
linked
lists
and
stacks to
manage an
list.
Consider several approaches to the problem of deleting
variable-length records.
Introduce the concepts associated with the terms
internal
fragmentation and external fragmentation
I
Outline some placement strategies associated with the reuse of space in a variable-length record file.
Provide an introduction to the idea underlying
a binary
search
I
Undertake an examination of the limitations of binary
searching.
Develop
a keysort
procedure for sorting larger
files;
in-
vestigate the costs associated with keysort.
I
Introduce the concept of
pinned
record.
183
CHAPTER OUTLINE
5.1
Data Compression
Using
Finding Things Quickly: An
Introduction to Internal Sorting
and Binary Searching
5.3
Notation
5.1.2 Suppressing Repeating
Sequences
5.1.3 Assigning Variable-length
5.1.1
a Different
Finding Things in Simple Field
and Record Files
5.3.2 Search by Guessing: Binary
5.3.1
Codes
5.1.4 Irreversible
Compression
Search
Techniques
5.1.5
5.2
Compression
5.3.3 Binary Search versus Sequential
UNIX
in
Reclaiming Space in
5.2.1
Search
5.3.4 Sorting a Disk File in
Files
5.3.5
RAM
The Limitations of Binary
Record Deletion and Storage
Searching and Internal Sorting
Compaction
Keysorting
5.4
5.2.2 Deleting Fixed-length Records
Reclaiming Space
Dynamically
for
Description of the
5.4.1
Method
5.2.3 Deleting Variable-length
Another Solution: Why Bother
to Write the File Back?
5.4.4 Pinned Records
Records
5.4.3
5.2.4 Storage Fragmentation
5.2.5 Placement Strategies
We
how
have already seen
consider
how
a file
to
is
and records and other
file
Method
5.4.2 Limitations of the Keysort
important
be accessed
file
when
some
for the
is
cases reorganize,
file
deciding on
structures. In this chapter
organization, but the motivation
organize, or in
it is
we
a little different.
files in
system designer to
how
to create fields
continue to focus on
We look at
direct response to the
ways
to
need to
improve performance.
In the first section
smaller.
we
look
Compression techniques
basic information in the
at
how we
let
us
make
organize
files
files
to
make them
smaller by encoding the
file.
Next we look at ways to reclaim unused space in files to improve
performance. Compaction is a batch process that we can use to purge holes
of unused space from a file that has undergone many deletions and updates.
Then we investigate dynamic ways to maintain performance by reclaiming
space made available by deletions and updates of records during the life of
a file.
In the third section
sorting
them
we examine
the
problem of reorganizing files by
Then, in an effort to find
to support simple binary searching.
DATA COMPRESSION
85
method, we begin a conceptual line of thought that will
continue throughout the rest of this text: We find a way to improve file
performance by creating an external structure through which we can access
a better sorting
the
5.1
file.
Data Compression
In this section
reasons for
Use
we
look
making
at
files
some ways
to
make
smaller. Smaller
smaller.
files
less storage, resulting in cost savings;
Can be
transmitted faster, decreasing access time or, alternatively,
lowing the same access time with
and
Can be processed
as to take
up
al-
lower and cheaper bandwidth;
faster sequentially.
Data compression involves encoding the information
way
There are many
files
less space.
Many
in a file in
such
different techniques are available for
Some are very general and some are designed only for
of data, such as speech, pictures, text, or instrument data. The
variety of data compression techniques is so large that we can only touch on
the topic here, with a few examples.
compressing
data.
specific kinds
5.1.1
Using a Different Notation
Remember our
fields,
fields
address
file
such
as these are
good candidates
"state" field in the address
many
from Chapter
4?
It
had several fixed-length
including "state," "zip code," and "phone number." Fixed-length
bits are really
could represent
all
file
needed for
for compression. For instance, the
required
this field?
two ASCII
bytes, 16 bits.
How
Since there are only 50 states,
possible states with only six bits. (Why?) Thus,
we
we
could
encode all state names in a single one-byte field, resulting in a space savings
of one byte, or 50%, per occurrence of the state field.
This type of compression technique, in which we decrease the number
of bits by finding a more compact notation/ is one of many compression
techniques classified as redundancy reduction. The 10 bits that we were able to
throw away were redundant in the sense that having 16 bits instead of 6
provided no extra information.
"'"Note that the original two-letter notation
tation tor the full state
name.
we
used for "state"
is
itself a
more compact no-
86
ORGANIZING FILES FOR PERFORMANCE
What
are the costs of this
compression scheme? In
this case, there are
many:
By
using a pure binary encoding,
we have made
the
file
unreadable
by humans.
We
whenever we add a new stateand a similar cost for decoding when we need
to get a readable version of state name from the file.
We must also now incorporate the encoding and/or decoding modules in all software that will process our address file, increasing the
complexity of the software.
incur
name
some
field to
cost in encoding time
our
file,
With so many costs, is this kind of compression worth it? We can
answer this only in the context of a particular application. If the file is
already fairly small, if the file is often accessed by many different pieces of
software, and if some of the software that will access the file cannot deal
with binary data (e.g., an editor), then this form of compression is a bad
idea. On the other hand, if the file contains several million records and is
generally processed by one program, compression is probably a very good
idea. Since the encoding and decoding algorithms for this kind of
compression are extremely simple, the savings in access time is likely to
exceed any processing time required for encoding or decoding.
5.1.2 Suppressing Repeating Sequences
Imagine an 8-bit image of the sky that has been processed so only objects
above a certain brightness are identified and all other regions of the image
are set to some background color represented by the pixel value 0. (See Fig.
5.1.)
Sparse arrays of this sort are very good candidates for compression of
which in this example works as follows.
unused byte value to indicate that a run-length
code follows. Then, the run-length encoding algorithm goes like this:
a sort called run-length encoding,
First,
we
choose one
special,
Read through the
values to the
file
pixels that
make up
in sequence, except
the image, copying the pixel
where the same
more than once in succession.
Where the same value occurs more than once
pixel value oc-
curs
in succession, substi-
tute the following three bytes, in order:
The special run-length code indicator;
The pixel value that is repeated; and
The number of times that the value is
repeated (up to 256 times).
DATA COMPRESSION
187
FIGURE 5.1 The empty space
in this astronomical image is
represented by repeated se-
quences of the same value
and is thus a good candidate
for compression. (This FITS
image shows a radio continuum structure around the spiral galaxy NGC 891 as observed with the Westerbork
Synthesis radio telescope in
The Netherlands.)
we wish to compress an image using run-length
we can omit the byte Oxff from the rep-
For example, suppose
encoding, and
we
find that
resentation of the image.
indicator.
How
We
would we
choose the byte Oxff as our run-length code
encode the following sequence of hexadecimal
byte values?
22 23 24 24 24 24 24 24 24 25 26 26 26 26 26 26 25 24
The
first three pixels are to be copied in sequence. The runs of 24 and 26 are
both run-length encoded. The remaining pixels are copied in sequence. The
resulting sequence is
22 23 ff
24
07 25 ff
26 06 25 24
Run-length encoding is another example of redundancy reduction.
(Why?) It can be applied to many kinds of data, including text, instrument
data, and sparse matrices. Like the compact notation approach, the
run-length encoding algorithm is a simple one whose associated costs rarely
affect performance appreciably.
Unlike compact notation, run-length encoding does not guarantee any
particular amount of space savings. A "busy" image with a lot of variation
will not benefit appreciably from run-length encoding. Indeed, under some
88
ORGANIZING FILES FOR PERFORMANCE
pressed" image that
prevent
"com(Why? Can you
the aforementioned algorithm could result in a
circumstances,
larger than the original image.
is
this?)
5.1.3 Assigning Variable-length Codes
Suppose you have two
symbols to use in an encoding scheme: a
have to assign combinations of dots and
dashes to letters of the alphabet. If you are very clever, you might determine
the most frequently occurring letters of the alphabet (e and t) and use a
single dot for one and a single dash for the other. Other letters of the
alphabet will be assigned two or more symbols, with the more frequently
occurring letters getting fewer symbols.
Sound familiar? You may recognize this scheme as the oldest and most
common of the variable-length codes, the Morse code. Variable-length codes,
in general, are based on the principle that some values occur more
dot ("") and
different
dash ("-").
You
frequently than others, so the codes for those values should take the least
amount of
space. Variable-length codes are another
form of redundancy
reduction.
A variation on the compact notation technique, the Morse code can be
implemented using a table lookup, where the table never changes. In
contrast, since many sets of data values do not exhibit a predictable
frequency distribution, more modern variable-length coding techniques
dynamically build the tables that describe the encoding scheme.
most
successful
of these
the
is
Huffman
code,
One of the
which determines the
probabilities of each value occurring in the data set, and then builds a binary
which the search path
tree in
value.
More
for each value represents the code for that
frequently occurring values are given shorter search paths in
then turned into a table, much like a Morse code table,
encode and decode the data.
For example, suppose we have a data set containing only the seven
letters shown in Fig. 5.2, and each letter occurs with the probability
the tree. This tree
is
that can be used to
indicated.
The
third
row
in the figure
shows
be assigned to the letters. Based on Fig.
encoded as "101000000001."
the
Huffman codes
5.2, the string
FIGURE 5.2 Example showing the Huffman encoding for a set of
seven letters, assuming certain probabilities. (From Lynch, 1985.
Let ter
Probability:
Code
abed
0.4
0.1
010
0.1
011
0.1
0.1
0.1
0.1
0000
0001
0010
0011
that
would
"abde" would be
DATA COMPRESSION
In the example, the letter a occurs
others, so
it
number of
bits
case as
many
much more
assigned the one-bit code
is
1.
as four bits are required.
This
is
minimum
letters is three, yet in this
necessary trade-off to insure
that the distinct codes can be stored together, without delimiters
them, and
89
often than any of the
Notice that the
needed to represent these seven
between
be recognized.
still
5.1.4 Irreversible Compression Techniques
The techniques we have
discussed so far preserve
all
information in the
original data. In effect, they take advantage of the fact that the data, in
its
removed and
Another type of compression, irreversible
based on the assumption that some information can be
original form, contains redundant information that can be
then reinserted
compression,
at a later time.
is
sacrificed.
An example of irreversible compression would be shrinking a raster
image from, say, 400-by-400 pixels to 100-by-100 pixels. The new image
contains one pixel for every 16 pixels in the original image, and there is no
way, in general, to determine what the original pixels were from the one
new
pixel.
compression
Irreversible
less
is
compression, but there are times
common
when
in data files than reversible
the information that
or no value. For example, speech compression
a
technique that transmits
be synthesized
at
is
in
is
lost
by
is
of little
voice coding,
paramaterized description of speech, which can
the receiving end with varying
5.1.5 Compression
often done
amounts of
distortion.
UNIX
Both Berkeley and System
V UNIX
provide compression routines that are
heavily used and quite effective. System
has routines called pack and
which use Huffman codes on a byte-by-byte basis. Typically, pack
achieves 25 to 40% reduction on text files, but appreciably less on binary
files that have a more uniform distribution of byte values. When pack
unpack,
compresses
it automatically appends a ".z" to the end of the packed
any future user that the file has been compressed using the
standard compression algorithm.
Berkeley UNIX has routines called compress and uncompress, which use
an effective dynamic method called Lempel-Ziv (Welch, 1984). Except for
using different compression schemes, compress and uncompress behave
file,
a file,
signalling to
"^Irreversible
compression
is
sometimes
average information (entropy)
is
called
reduced.
"entropy reduction" to emphasize that the
90
ORGANIZING FILES FOR PERFORMANCE
almost the same
of
files it
pack and unpack." Compress appends
has compressed.
1"
as
Since these routines are readily available on
effective general-purpose routines,
it is
".Z"
to the
end
UNIX systems and are very
wise to use them whenever there are
not compelling reasons to use other techniques.
5.2
Reclaiming Space
Suppose
that the
record in
new
record
the extra data?
from the
You
a
is
in Files
variable-length record
file is
could append
it
end of the
to the
a way
What do you do with
modified in such
longer than the original record.
file
and put
original record space to the extension of the record.
rewrite the
whole record
at
the end of the
(unless the
file
file
pointer
You
could
needs to be
of the record. Each solution
drawback: In the former case, the job of processing the record is more
awkward and slower than it was originally; in the latter case, the file
contains wasted space.
In this section we take a close look at the way file organization
deteriorates as a file is modified. In general, modifications can take any one
of three forms:
sorted), leaving a hole at the original location
has
Record addition;
Record updating; and
Record deletion.
If the
only kind of change to
deterioration of the kind
we
a file
is
cover in
variable-length records are updated, or
record addition, there
chapter.
this
when
It
is
only
is
no
when
either fixed- or variable-
length records are deleted, that maintenance issues
become complicated and
interesting. Since record updating can always be treated as a record deletion
followed by
When
record addition, our focus
record has been deleted,
is
we want
on the
effects
of record deletion.
to reuse the space.
5.2.1 Record Deletion and Storage Compaction
files smaller by looking for places in a file where
and then recovering this space. Since empty spaces
occur in files when we delete records, we begin our discussion of
compaction with a look at record deletion.
Storage compaction
there
is
no data
makes
at all,
"''Many implementations of System
Berkeley extensions.
V UNIX
also support
compress and uncompress
as
RECLAIMING SPACE
Any
record-deletion
mark
place a special
address
us
for
simple and usually workable approach
Chapter
in
field in a deleted record.
file
4,
we might
place an asterisk as the
Figures 5.3(a) and 5.3(b)
show
is
name and
first
address
similar to the one in Chapter 4 before and after the second record
marked
deleted.
as
(The dots
padding between the
Once we
how
last field
the ends of records
at
to
to
name and
For example, in the
in each deleted record.
developed
file
must provide some way
strategy
recognize records as deleted.
191
IN FILES
is
and 2 represent
and the end of each record.)
are able to recognize a record as deleted, the next question
is
from the record. Approaches to this problem that
rely on storage compaction do nothing at all to reuse the space for a while.
The records are simply marked as deleted and left in the file for a period of
time. Programs using the file must include logic that causes them to ignore
records that are marked as deleted. One nice side effect of this approach is
that
to reuse the space
it is
usually possible to allow the user to "undelete" a record with very
This
little effort.
is
rather than destroy
once. After deleted records have accumulated for
program
out
is
(Fig.
used to reconstruct the
5.3c).
If there
compaction
is
records.
also possible,
It is
mark
particularly easy if you keep the deleted
some of the original data,
The reclamation of the space from the deleted
field,
through
file
with
all
in a special
our example.
as in
records happens
some
time,
all at
special
the deleted records squeezed
enough space, the simplest way to do this
copy program that skips over the deleted
though more complicated and time-consuming,
is
a file
do the compaction in place. Either of these approaches can be used with
both fixed- and variable-length records.
to
FIGURE 5.3 Storage requirements of sample file using 64-byte fixed-length records, (a)
Before deleting the second record, (b) After deleting the second record, (c) After compaction
the second record is gone.
Ames! John! 123 Maple Stillwater OK 74075
Morrison!Sebastian!9035 South Hillcrest Forest Village OK 74820
Brown! Martha! 625 KimbarkiDes Moines IA 50311
!
(a)
Ames! John 1123 Maple Stillwater OK 174075
!
rrison!Sebastian!9035 South Hillcrest Forest Village OK 74820
Brown Martha 625 KimbarkiDes Moines IA 50311
*!
(b)
Ames John 123 Maple Stillwater OK 74075
Brown! Martha 1625 KimbarkiDes Moines IA 50311
!
(c)
92
ORGANIZING FILES FOR PERFORMANCE
The
decision about
how
can be based on either the
often to run the storage compaction
number of deleted
accounting
programs, for example,
compaction procedure on certain files
it
often
at the
program
records or on the calendar. In
makes sense
end of the
fiscal
run
to
year or
some
other point associated with closing the books.
5.2.2 Deleting Fixed-length Records
for
Reclaiming
Space Dynamically
most widely used of the storage
There are some applications, however,
that are too volatile and interactive for storage compaction to be useful. In
these situations we want to reuse the space from deleted records as soon as
possible. We begin our discussion of such dynamic storage reclamation
with a second look at fixed-length record deletion, since fixed-length
records make the reclamation problem much simpler.
In general, to provide a mechanism for record deletion with subsequent
reutilization of the freed space, we need to be able to guarantee two things:
Storage compaction
the simplest and
is
reclamation methods
we
discuss.
That deleted records are marked in some special way; and
That we can find the space that deleted records once occupied so we
can reuse that space when we add records.
We
a method of meeting the first requirement: We
by putting a field containing an asterisk at the
have already identified
mark records
as deleted
beginning of deleted records.
If you are working with fixed-length records and are willing to search
sequentially through a file before adding a record, you can always provide
you have provided the first. Space reutilization can
form of looking through the file, record by record, until a deleted
record is found. If the program reaches the end of the file without finding
a deleted record, then the new record can be appended at the end.
Unfortunately, this approach makes adding records an intolerably slow
process if the program is an interactive one and the user has to sit at the
the second guarantee if
take the
terminal and wait as the record addition takes place.
To make
record reuse
happen more quickly, we need
A way
A way
Linked
to
to
Lists
know immediately if there are empty slots in the
jump directly to one of those slots if they exist.
The use of
available records can
structure in
its
a linked
list
for stringing together
meet both of these needs.
linked
list
file;
and
all
of the
is
a data
which each element or node contains some kind of reference
successor in the
list.
(See Fig. 5.4.)
to
RECLAIMING SPACE
FIGURE 5.4 A linked
193
IN FILES
list.
If you have a head reference to the first node in the list, you can move
through the list by looking at each node, and then at the node's pointer
field, so you know where the next node is located. When you finally
encounter a pointer field with some special, predetermined end-of-list
value, you stop the traversal of the list. In Fig. 5.4 we use a 1 in the pointer
field to mark the end of the list.
When a list is made up of deleted records that have become available
space within the
file,
new
the
list is
usually called an avail
list.
When
inserting a
any one available record is just as
good as any other. There is no reason to prefer one open slot over another
since all the slots are the same size. It follows that there is no reason for
ordering the avail list in any particular way. (As we see later, this situation
changes for variable-length records.)
record into
The
Stacks
which
So, if
all
simplest
insertions
we have
way
file,
to handle a
list
managed
as a stack.
RRN
Head
pointer
Head
\
S
RRN 3,
stack
is
a list in
RRN
5
it
list.
record
looks like this before and
(3)
one end of the
RRN
RRN
pointer
at
as a stack that contains relative
and 2, and then add
the addition of the new node:
When
list is
and removals of nodes take place
an avail
numbers (RRN)
after
fixed-length record
RRN
added to the top or front of a stack, we say that it
next thing that happens is a request for some
available space, the request is filled by taking RRN 3 from the avail list.
is
new node
is
pushed onto the stack.
If the
94
ORGANIZING FILES FOR PERFORMANCE
This
called popping the stack.
is
only records 5 and
The
list
Linking and Stacking Deleted Records
Now we
it
We
know immediately if there are empty slots in the
jump directly to one of those slots if they exist.
to
to
contains
can meet the two
space from deleted records.
criteria for rapid access to reusable
A way
A way
which
returns to a state in
2.
need
file;
and
Placing the deleted records on
a stack meets both criteria. If the pointer
of the stack contains the end-of-list value, then we know that
there are not any empty slots and that we have to add new records by
appending them to the end of the file. If the pointer to the stack top contains
to the top
a valid
node
reference,
then
available, but also exactly
Where do we keep
or
a separate file,
we need
to
structures.
when
is it
we know
where
to find
the stack?
Is it a
not only that
reusable slot
is
it.
separate
somehow embedded
list,
perhaps maintained in
within the data
file?
Once
again,
be careful to distinguish between physical and conceptual
The
deleted, available records are not actually
moved anywhere
we need them,
they are pushed onto the stack. They stay right where
located in the
file.
The
stacking and linking
rearranging the links used to
make one
is
done by arranging and
available record slot point to the
we are working with fixed-length records in a disk file, rather
memory addresses, the pointing is not done with pointer variables
next. Since
than with
in the
formal sense, but through relative record numbers (RRNs).
Suppose we
are
working with
contained seven records
(RRNs
fixed-length record
file
that
once
0-6). Furthermore, suppose that records 3
have been deleted, in that order, and that deleted records are marked by
first field with an asterisk. We can then use the second field of
a deleted record to hold the link to the next record on the avail list. Leaving
out the details of the valid, in-use records, Fig. 5.5(a) shows how the file
might look.
Record 5 is the first record on the avail list (top of the stack) since it is
the record that is most recently deleted. Following the linked list, we see
and
replacing the
that record 5 points to record 3. Since the link field for record 3 contains -1,
which
is
our end-of-list marker,
we know
that record 3
is
the last slot
available for reuse.
Figure 5.5(b) shows the same
file after
record
1 is
also deleted.
Note
that
the contents of all the other records on the avail list remain unchanged.
Treating the list as a stack results in a minimal amount of list reorganization
when we push and pop records to and from the list.
If we now add a new name to the file, it is placed in record 1, since
RRN 1 is the first available record. The avail list would return to the
RECLAIMING SPACE
List
head
(first
available record) -* 5
2
Edwards
Bates
195
IN FILES
Wills
*-l
Masters
Masters
*3
Chavez
*3
Chavez
(a)
List
head
(first
available record)
*5
Edwards
Wills
*-l
(b)
List
head
(first
available record)
1st
new
Wills
rec
Edwards
3rd new rec
Masters
2nd new
Chavez
rec
(c)
FIGURE 5.5 Sample
file
showing linked
5, in that order, (b) After deletion of
new
lists of
deleted records,
records 3, 5, and
1,
in
(a)
After deletion of records 3
records.
shown
configuration
the avail
list,
we
the size of the
5.5c). If yet
avail list
at
is
in Fig. 5.5(a). Since there are
file.
After that, however, the avail
name
empty and
the end of the
still
could add two more names to the
another
is
added to the
that the
name
file,
file
two record
on
slots
without increasing
would be empty (Fig.
program knows that the
list
the
requires the addition of a
new
record
file.
Implementing Fixed-length Record Deletion
nisms that place deleted records on
list
Implementing mechaand that treat the avail
need a suitable place to keep
linked avail
as a stack is relatively straightforward.
We
list
RRN
of the first available record on the avail list. Since this is
information that is specific to the data file, it can be carried in a header
record at the start of the file.
When we delete a record we must be able to mark the record as deleted,
and then place it on the avail list. A simple way to do this is to place an *
the
and
that order, (c) After insertion of three
'
96
ORGANIZING FILES FOR PERFORMANCE
(or some other special mark) at the beginning of the record as a deletion
mark, followed by the RRN of the next record on the avail list.
Once we have a list of available records within a file, we can reuse the
space previously occupied by deleted records. For this
single function that returns either (1) the
(2)
the
RRN
RRN
appended
ot the next record to be
we would
of a reusable record
if
no reusable
write
slot,
or
slots are
available.
5.2.3 Deleting Variable-length Records
Now that we have a
mechanism for handling an avail list of available space
once records are deleted, let's apply this mechanism to the more complex
problem of reusing space from deleted variable-length records. We have
seen that to support record reuse through an avail list, we need
A way
to link the deleted records together into a
list (i.e.,
place to
put
a link field);
An
An
algorithm for adding newly deleted records to the avail list; and
algorithm for finding and removing records from the avail list
when we
are ready to use them.
An
Avail List of Variable-length Records What kind of file structure
do we need to support an avail list of variable-length records? Since we will
want to delete whole records and then place records on an avail list, we need
a structure in which the record is a clearly defined entity. The file structure
in which we define the length of each record by placing a byte count of the
record contents
at
the beginning of each record will serve us well in this
regard.
We
can handle the contents of
a deleted
we
did with fixed-length records. That
the
first field,
followed by
is,
binary link
variable-length record just as
we
field
can place
a single asterisk in
pointing to the next deleted
list. The avail list itself can be organized just as it was
with fixed-length records, but with one difference: We cannot use relative
record numbers (RRNs) for links. Since we cannot compute the byte offset
record on the avail
of variable-length records from
their
RRNs,
the links
must contain the byte
offsets themselves.
To
illustrate,
suppose
we
begin with
variable-length record
containing the three records for Ames, Morrison, and
earlier.
Figure 5.6(a) shows what the
file
Brown
file
introduced
looks like (minus the header)
and Fig. 5.6(b) shows what it looks like after the
before any
deletion of the second record. The periods in the deleted record signify
deletions,
discarded characters.
RECLAIMING SPACE
197
IN FILES
HEAD. FIRST_AVAIL: -1
40 Ames! John! 123 Maple Stillwater OK !74075 64 Morrison Sebastian
!9035 South Hillcrest Forest Village OK 74820 45 Brown !Martha 62
5 Kimbark!Des Moines IA 50311
!
(a)
HEAD. FIRST AVAIL: 43
40 Ames John 123 Maple Stillwater OK 74075 64
!
*
-1
45 Brown Martha 62
!
Kimbark!Des MoinesilA 50311!
(b)
FIGURE 5.6 A sample
sample
file
cluded), (b)
stored
in
Sample
file for illustrating
variable-length record deletion,
(a) Original
variable-length format with byte count (header record not
file
after deletion of the
in-
second record (periods show discarded
characters).
Adding and Removing Records
Let's address the questions
and removing records to and from the
list
of adding
together, since they are clearly
With fixed-length records we could access the avail list as a stack
because one member of the avail list is just as usable as any other. That is not
true when the record slots on the avail list differ in size, as they do in a
variable-length record file. We now have an extra condition that must be
met before we can reuse a record: The record must be the right size. For the
related.
moment we
define right size as "big enough." Later
we
find that
it
is
to be more particular about the meaning of right size.
even likely, that we need to search through the avail list for
a record slot that is the right size. We can't just pop the stack and expect the
first available record to be big enough. Finding a proper slot on the avail list
now means traversing the list until a record slot is found that is big enough
to hold the new record that is to be inserted.
For example, suppose the avail list contains the deleted record slots
shown in Fig. 5.7(a), and a record that requires 55 bytes is to be added. Since
the avail list is not empty, we traverse the records whose sizes are 47 (too
small), 38 (too small), and 72 (big enough). Having found a slot big enough
to hold our record, we remove it from the avail list by creating a new link
that jumps over the record (Fig. 5.7b). If we had reached the end of the avail
list before finding a record that was large enough, we would have appended
sometimes useful
It is
the
new
possible,
record
at
the end of the
file.
98
ORGANIZING FILES FOR PERFORMANCE
Size
Removed record
72
(b)
FIGURE 5.7 Removal of a record from an avail list with variablelength records, (a) Before removal, (b) After removal.
Since this procedure for finding a reusable record looks through the
entire avail
list
if
necessary,
we do
not need
putting newly deleted records onto the
somewhere on
it.
list,
a sophisticated
If a
list,
just as
method
for
record of the right size
is
our get-available-record procedure eventually finds
we can continue to push new members
we do with fixed-length records.
follows that
It
the
this
list.
onto the front of
Development of algorithms for adding and removing avail list records
to you as part of the exercises found at the end of this chapter.
is left
5.2.4 Storage Fragmentation
Let's look again at the fixed-length record version
(Fig. 5.8).
The
of our three-record
dots at the ends of the records represent characters
we
file
use as
padding between the last field and the end of the records. The padding is
wasted space; it is part of the cost of using fixed-length records. Wasted
space within
Clearly,
record
is
we want
called internal fragmentation.
minimize internal fragmentation.
to
working with fixed-length
FIGURE 5.8 Storage requirements of sample
we
records,
file
If
we
using 64-byte fixed-length records.
Ames John 123 Maple Stillwater OK 74075
Morrison! Sebastian 19035 South Hillcrest Forest Village OK 174820
Brown Martha 625 Kirabark'Des Moines IA 50311
I
are
attempt such minimization by
RECLAIMING SPACE
IN FILES
99
40 Ames! John! 123 Maple Stillwater OK 74075 64 Morrison Sebastian
!9035 South Hillcrest Forest Village OK 74820 45 Brown iMartha 62
5 KimbarkiDes Moines IA 50311
!
FIGURE 5.9 Storage requirements of sample
count field.
record.
record length that
But unless the
a certain
using variable-length records with a
file
is
what we need for each
we have to put up with
as close as possible to
actual data
is
fixed in length,
amount of internal fragmentation
One of the
choosing
in a fixed-length record
attractions of variable-length records
is
that they
file.
minimize
wasted space by doing away with internal fragmentation. The space set
aside for each record is exactly as long as it needs to be. Compare the
fixed-length example with the one in Fig. 5.9, which uses the variablelength record structure
a byte count followed by delimited data fields.
The only space (other than the delimiters) that is not used for holding data
in each record is the count field. If we assume that this field uses two bytes,
this amounts to only six bytes for the three-record file. The fixed-length
record file wastes 24 bytes in the very first record.
But before we start congratulating ourselves for solving the problem of
wasted space due to internal fragmentation, we should consider what
happens in a variable-length record file after a record is deleted and replaced
with a shorter record. If the shorter record takes less space than the original
record, internal fragmentation results. Figure 5.10 shows how the problem
FIGURE 5.10 Illustration of fragmentation with variable-length records, (a) After deletion of
the second record (unused characters in the deleted record are replaced by periods), (b) After
the subsequent addition of the record for Al Ham.
HEAD. FIRST AVAIL: 43
I
40 Ames
John 123 Maple Stillwater OK 74075 64
|
*
[
-1
45 Brown Martha 62
]
KimbarkjDes Moines IA 50311
(a)
HEAD. FIRST_AVAIL: -1
40 Ames John; 123 Maple Stillwater OK 74075
OK J70332;
5 KimbarkiDes Moines IA 50311
!
(b)
64 Ham;Al[28 Elm Ada
45 Brown [Martha 62
\
200
ORGANIZING FILES FOR PERFORMANCE
HEAD. FIRST_AV'AIL: 43
40 Ames John 123 Maple Stillwater OK 74075 35 *
-1
26 Ham Al 28 Elm Ada OK 70332 45 Brown Martha 6
25 Kimbark Des Moines IA 50311
;
FIGURE 5.1 1 Combatting internal fragmentation by putting the unused part of the deleted
back on the avail list.
slot
could occur with our sample
and the following record
is
when
file
the second record in the
file is
deleted
added:
Ham|Al|28 E lm|Ada|OK|70332|
appears that escaping internal fragmentation
It
vacated by the deleted record
we
not so easy. The slot
is
37 bytes larger than
is
is
needed for the new
as part of the new record, they are
and are therefore unusable. But instead of keeping the
64-byte record slot intact, suppose we break it into two parts: one part to
hold the new Ham record, and the other to be placed back on the avail list.
record. Since
not on the avail
treat the extra
37 bytes
list
we would take only as much space as necessary
would be no internal fragmentation.
Figure 5.11 shows what our file looks like if we
Since
for the
Ham
record,
there
use this approach to
Al Ham. We steal the space for the Ham record from the
end of the 64-byte slot and leave the first 35 bytes of the slot on the avail list.
(The available space is 35 rather than 37 bytes because we need two bytes to
insert the record for
form
new
The 35
size field for the
bytes
on the
still
Ham
avail
record.)
can be used to hold yet another record.
list
Figure 5.12 shows the effect of inserting the following 25-byte record:
Lee|Ed|Rt
2|Ada|OK|74820|
As we would expect, the new record is carved out of the 35-byte record that
is on the avail list. The data portion of the new record requires 25 bytes, and
FIGURE 5.12 Addition of the second record into the
slot originally
occupied by a single de-
leted record.
HEAD,
FIRST AVAIL: 43
1
-1 ... 25 Lee Ed
40 Ames John 123 Maple Stillwater OK 74075 8 *
Rt 2 Ada OK 74820 26 Ham Al 28 Elm Ada OK 70332 45 Brown Martha 6
25 Kimbarkl Des Moines IA 50311
;
RECLAIMING SPACE
then
we need two more
in the record
still
bytes for another size
on the
avail
201
IN FILES
This leaves eight bytes
field.
list.
What are the chances of finding a record that can make use of these eight
Our guess would be that the probability is close to zero. These eight
bytes?
bytes are not usable, even though they are not trapped inside any other
The
an example of external fragmentation.
record. This
is
the avail
rather than being locked inside some other record, but
list
space
is
actually
is
on
too
fragmented to be reused.
There are some interesting ways to combat external fragmentation.
One way, which we discussed at the beginning of this chapter, is storage
compaction. We could simply regenerate the file when external fragmentation becomes intolerable. Two other approaches are as follows:
If two record slots on the avail list are physically
them to make a single, larger record slot. This is
adjacent,
combine
called coalescing the
holes in the storage space.
Try
minimize fragmentation before it happens by adopting a
placement strategy that the program can use as it selects a record
from the avail list.
to
Coalescing holes presents some interesting problems. The avail
not kept in physical record order;
physically adjacent, there
adjacent to each other
provides
on the
discussion
developing
is
of
if there are
no reason
avail
this
to
two
list is
deleted records that are
presume
Exercise 15
list.
slot
at
that they are linked
the end of this chapter
problem along with
framework
for
a solution.
The development of
better placement strategies,
warrants
matter.
It
among
alternative strategies
is
a topic that
is
not
however,
is
a different
separate discussion, since the choice
as
obvious
as
it
might seem
at first
glance.
5.2.5 Placement Strategies
Earlier
we
discussed
ways
to
We
add and remove variable-length records from
add records by treating the avail list as a stack, putting
When we need to remove a record slot from the
avail list (to add a record to the file), we look through the list, starting at the
beginning, until we either find a record slot that is big enough or reach the
end of the list.
This is called afirst-fit placement strategy. The least possible amount of
work is expended when we place newly available space on the list, and we
are not very particular about the closeness of fit as we look for a record slot
to hold a new record. We accept the first available record slot that will do
an avail
list.
deleted records at the front.
202
ORGANIZING FILES FOR PERFORMANCE
the job, regardless of whether the slot
or whether
We
it is
a perfect
is
10 times bigger than
what
is
needed
fit.
more orderly approach for placing
on the avail list, keeping them in either ascending or descending
sequence by size. Rather than always putting the newly deleted records at
the front of the list, these approaches involve moving through the list,
could, of course, develop a
records
looking for the place to insert the record to maintain the desired sequence.
If we order the avail list in ascending order by size, what is the effect on
the closeness of
fit
of the records that are retrieved from the
list?
Since the
procedure searches sequentially through the avail list until it
encounters a record that is big enough to hold the new record, the first
record encountered is the smallest record that will do the job. The fit
retrieval
and the new record's needs would be as close as
we can make it. This is called a best-fit placement strategy.
A best-fit strategy is intuitively appealing. There is, of course, a price to
be paid for obtaining this fit. We end up having to search through at least
a part of the list not only when we get records from the list, but also when
we put newly deleted records on the list. In a real-time environment the
extra processing time could be significant.
A less obvious disadvantage of the best-fit strategy is related to the idea
of finding the best possible fit: The free area left over after inserting a new
record into a slot is as small as possible. Often this remaining space is too
small to be useful, resulting in external fragmentation. Furthermore, the
slots that are least likely to be useful are the ones that will be placed toward
the beginning of the list, making first-fit searches increasingly long as time
between the
available slot
goes on.
These problems suggest an alternative strategy: What if we arrange the
it is in descending order by size? Then the largest record slot on
the avail list would always be at the head of the list. Since the procedure that
retrieves records starts its search at the beginning of the avail list, it always
returns the largest available record slot if it returns any slot at all. This is
known as a worst- fit placement strategy. The amount of space in the record
slot beyond what is actually needed is as large as possible.
A worst-fit strategy does not, at least initially, sound very appealing. But
avail list so
consider the following:
The procedure
only
for
at the first
enough
removing records can be simplified
element of the avail
list.
If the first
so
it
looks
record slot
is
not
do the job, none of the others will be.
By extracting the space we need from the largest available slot, we
are assured that the unused portion of the slot is as large as possible,
large
to
decreasing the likelihood of external fragmentation.
203
FINDING THINGS QUICKLY: AN INTRODUCTION TO INTERNAL SORTING AND BINARY SEARCHING
What can you conclude from all of this? It should be clear that no one
placement strategy is superior for all circumstances. The best you can do is
formulate a series of general observations and then, given a particular design
seems most appropriate. Here are
have to be yours.
situation, try to select the strategy that
some
'
suggestions.
The judgment
will
Placement strategies make sense only with regard
able-length record
files.
With fixed-length
to volatile, vari-
records, placement
is
sim-
ply not an issue.
&x
is lost due to internal fragmentation, then the choice is between
and best fit. A worst-fit strategy truly makes internal fragmentation worse.
If the space is lost due to external fragmentation, then one should give
If space
first fit
careful consideration to a worst-fit strategy.
5.3
Finding Things Quickly: An Introduction to Internal
Sorting and Binary Searching
This text begins with
storage.
discussion of the cost of accessing secondary
You may remember
accessing
magnify
that the
magnitude of the difference between
RAM and seeking information on a fixed disk such that, if we
the time for a RAM access to 20 seconds, a similarly magnified
is
disk access
would
take 58 days.
So far we have not had to pay much attention to this cost. This section,
then, marks a kind of turning point. Once we move from fundamental
organizational issues to the matter of searching a file for a particular piece of
information, the cost of a seek becomes a major factor in determining our
And what is true for searching is all the more true for sorting. If
you have studied sorting algorithms, you know that even a good sort
involves making many comparisons. If each of these comparisons involves
approach.
a seek,
the sort
Our
is
agonizingly slow.
discussion of sorting and searching, then, goes
getting the job done.
We
beyond simply
develop approaches that minimize the number of
disk accesses and that therefore minimize the
amount of time expended.
This concern with minimizing the number of seeks continues to be
focus throughout the rest of this text. This
for
ways
to order
major
just the beginning of a quest
and find things quickly.
5.3.1 Finding Things
All of the
is
in
Simple Field and Record
programs we have written up
strengths they offer, share a major failing:
Files
to this point, despite
The only way
any other
to retrieve or find
204
ORGANIZING FILES FOR PERFORMANCE
record with any degree of rapidity
to look for
is
it
by
relative record
number (RRN). If the file has fixed-length records, knowing the RRN
us compute the record's byte offset and jump to it using direct access.
But what if we do not
want? How likely is it that
"What
RRN
the byte offset or
of the record
we
question about this file would take the form,
23?" Not very likely, of course. We are
the record stored in
a
RRN
is
much more
question
know
lets
is
know
likely to
more
the identity of a record
likely to take the form,
"What
is
by
its
key, and the
the record for Bill Kelly?"
Given the methods of organization developed so far, access by key
a sequential search. What if there is no record containing the
requested key? Then we would have to look through the entire file. What
if we suspect that there might be more than one record that contains the
key, and we want to find them all? Once again, we would be doomed to
looking at every record in the file. Clearly, we need to find a better way to
handle keyed access. Fortunately, there are many better ways.
implies
5.3.2 Search by Guessing: Binary Search
Suppose we
are looking for a record for Bill Kelly in a
fixed-length records, and suppose the
We
ascending order by key.
file is
of 1,000
KELLY BILL
by comparing
start
file
sorted so the records appear in
canonical form of the search key) with the middle key in the
file,
(the
which
is
whose RRN is 500. The result of the comparison tells us which half
of the file contains Bill Kelly's record. Next, we compare KELLY BILL with
the middle key among records in the selected half of the file to find out
the key
which quarter of the
file Bill
Kelly's record
until either Bill Kelly's record
is
found or
we
is in.
This process
is
repeated
have narrowed the number of
potential records to zero.
This kind of searching
binary searching
is
shown
comparisons to find
that
it is
not in the
is
called binary searching.
in Fig. 5.13.
Bill Kelly's record, if
file.
Compare
this
w ith
T
An
algorithm for
Binary searching takes
it is
in the
a sequential
at
most 10
or to determine
file,
search for the record.
1,000 records, then it takes at most 1,000 comparisons to find a
given record (or establish that it is not present); on the average, 500
If there are
comparisons are needed.
5.3.3 Binary Search versus Sequential Search
In general, a binary search
of
with n records takes
a file
at
most
"
|_log FtJ
comparisons 1
"In this text, log x refers to the logarithm function to the base
intended,
it
is
so indicated.
2.
When
any other base
is
FINDING THINGS QUICKLY: AN INTRODUCTION TO INTERNAL SORTING
/*
205
AND BINARY SEARCHING
function to perform a binary search in the file associated with the
logical name INPUT. Assumes that INPUT contains REC0RD_C0UNT records.
Searches for the key KEY_S0UGHT. Returns RRN of record containing
key if the key is found; otherwise returns -1
*/
FUNCTION:
n_sea r ch( NPUT
I
LOW :=0
HIGH := REC0RD_C0UNT
KEY_S0UGHT, REC0RD_C0UNT
initialize lower bound for searching
intialize high bound -- we subtract
from the count since RRNs start from
/*
-
/*
while (LOW <= HIGH)
GUESS := (LOW
HIGH)
find
/*
*/
1
*/
midpoint
*/
read record with RRN of GUESS
place canonical form of key from record GUESS into KEY_F0UND
(KEY_S0UGHT < KEY_F0UND)
HIGH := GUESS else if (KEY_S0UGHT > KEY_F0UND)
:= GUESS +
LOW
GUESS is too high
so reduce upper
bound
/* GUESS is too low
/* increase lower
bound
if
I*
/*
*/
*/
/
*/
else
return(GUESS)
endwh i
match -- return the RRN
/*
*/
return (-1)
/*
if
FIGURE 5.13 The bin_search(
loop completes,
)
function
in
then key was not
found
*/
pseudocode.
and on average approximately
|_log
A binary search is
wj
+-
comparisons.
therefore said to be
that a sequential search
of the same
O (log n).
file
In contrast,
requires at
most
you may
recall
n comparisons,
and
which is to say that a sequential search is O(n).
The difference between a binary search and a sequential search becomes
on average
Vi n,
even more dramatic as we increase the size of the file to be searched. If we
double the number of records in the file, we double the number of
comparisons required for sequential search; when binary search is used,
doubling the file size adds only one more guess to our worst case. This
makes sense, since we know that each guess eliminates half of the possible
choices. So, if we tried to find Bill Kelly's record in a file of 2,000 records,
it
would
take at
most
1
-I-
[Jog 2,000J
=11
comparisons,
206
ORGANIZING FILES FOR PERFORMANCE
whereas
sequential search
-n
would average
=
1,000 comparisons,
and could take up to 2,000 comparisons.
Binary searching is clearly a more attractive way to find things than is
sequential searching. But, as you might expect, there is a price to be paid
before we can use binary searching: Binary searching works only when the
list of records is ordered in terms of the key we are using in the search. So,
to make use of binary searching, we have to be able to sort a list on the basis
of a key.
Sorting is a very important part of file processing. Next, we look at
some simple approaches to sorting files in RAM, at the same time
introducing some important new concepts in file structure design. In
Chapter 7 we take a second look at sorting, when we deal with some tough
problems that occur when files are too large to sort in RAM.
5.3.4 Sorting a Disk
File in
RAM
Consider the operation of any internal sorting algorithm with which you
The algorithm requires multiple passes over the list that is to be
sorted, comparing and reorganizing the elements. Some of the items in the
list are moved a long distance from their original positions in the list. If such
an algorithm were applied directly to data stored on a disk, it is clear that
there would be a lot of jumping around, seeking, and rereading of data.
This would be a very slow operation
unthinkably slow.
are familiar.
If the entire
alternative
is
contents of the
file
disk, but this
from the disk
to read the entire file
the sorting there, using an internal
way we
can access
can be held in
sort.
it
RAM,
into
very attractive
memory, and then do
We still have to access the data on the
sequentially, sector after sector, without
having to incur the cost of a lot of seeking and the cost of multiple passes
over the disk.
This is one instance of a general class of solutions to the problem of
minimizing disk usage: Force your disk access into a sequential mode,
performing the more complex, direct accesses in RAM.
Unfortunately, it is often not possible to use this simple kind of
solution, but when you can, you should take advantage of it. In the case of
sorting, internal sorts are increasingly viable as the
increases.
which
in
good
sorts files in
Chapter
7.
illustration
RAM if
it
of an internal sort
is
amount of RAM space
the
can find enough space. This
UNIX
utility
sort utility,
is
described
207
FINDING THINGS QUICKLY: AN INTRODUCTION TO INTERNAL SORTING AND BINARY SEARCHING
5.3.5 The Limitations of Binary Searching and Internal Sorting
Let's look at three problems associated with our "sort, then binary search"
approach to finding things.
Problem
Two
Binary Searching Requires More than One or
1:
Accesses
In the average case,
binary search requires approximately
n] + Vi comparisons. If each comparison requires a disk access, a series
of binary searches on a list of 1,000 items requires, on the average, 9.5
accesses per request. If the list is expanded to 100,000 items, the average
search length extends to 16.5 accesses. Although this is a tremendous
improvement over the cost of a sequential search for the key, it is also true
that 16 accesses, or even 9 or 10 accesses, is not a negligible cost. The cost
of this seeking is particularly noticeable, and objectionable, if we are doing
a large enough number of repeated accesses by key.
|_log
When we access records by relative record number (RRN) rather than
by key, we are able to retrieve a record with a single access. That is an order
of magnitude of improvement over the 10 or more accesses that binary
searching requires with even a moderately large file. Ideally, we would like
to
RRN
approach
performance,
retrieval
while
maintaining
still
the
advantages of access by key. In the following chapter, on the use of index
structures,
Problem
we
2:
begin to look
Keeping
at
ways
to
a File Sorted Is
move toward
Very Expensive
use a binary search has a price attached to
order by key. Suppose
as
often as
we
we
are
this ideal.
it:
working with
We must keep
a file to
search for existing records. If
we
Our
the
file
ability to
in sorted
which we add records
leave the
in
file
unsorted
on the average each search
requires reading through half the file. Each record addition, however, is
very fast, since it involves nothing more than jumping to the end of the file
and writing a record.
order, doing sequential searches for records, then
If,
as
an alternative,
substantially
on the
But we encounter
all
we
keep the
file
in sorted order,
cost of searching, reducing
difficulty
when we add
the records in sorted order. Inserting a
it
a record, since
new
we
can cut
to a handful
we want
record into the
down
of accesses.
file
to
keep
requires,
we
not only read through half the records, but that we
open up the space required for the insertion. We are
doing more work than if we simply do sequential searches on an
on the average,
that
also shift the records to
actually
unsorted
The
file.
costs
of maintaining
a file that
can be accessed through binary
searching are not always as large as in this example involving frequent
record addition. For example,
it is
often the case that searching
is
required
208
ORGANIZING FILES FOR PERFORMANCE
much more
frequently than
record addition. In such
is
more than
benefits of faster retrieval can
sorted.
As another example,
the
This can be an
circumstance, the
of keeping the
offset the costs
many
file
which record
additions can be accumulated in a transaction file and made in a batch mode.
By sorting the list of new records before adding them to the main file, it is
possible to merge them with the existing records. As we see in Chapter 7,
such merging is a sequential process, passing only once over each record in
file.
So, despite
appears to be
searching also
its
there are
efficient, attractive
approach to maintaining the
us see
what
However, knowing
the costs of binary
the requirements will be for better solutions
problem of finding things by key. Better solutions
one of the following conditions:
to the
file.
problems, there are situations in which binary searching
useful strategy.
lets
applications in
have to meet
will
at least
They
flr
will not involve reordering of the records in the
y^iew record
is
file
They
will be associated with data structures that allow for substan-
tially
more
rapid, efficient reordering
In the chapters that follow
we
these categories. Solutions of the
indexes.
They can
of the
file.
develop approaches that
first
fall
Problem
also involve hashing. Solutions
3:
An
Internal Sort
to use binary searching
works only
sort
if
is
of the second type can
we
Works Only on Small
is
we
An
file
ability
internal
into the
cannot do
that,
we
need a different kind of sort.
In the following section we develop
then
so large that
in order.
Our
file.
can read the entire contents of
If the file
file
Files
limited by our ability to sort the
computer's electronic memory.
into each of
type can involve the use of simple
involve the use of tree structures, such as a B-tree, to keep the
a variation
called a keysort. Like internal sorting, keysort
large a
file it
can sort, but
its
limit
keysort begins to illuminate
is
larger.
is
on
internal sorting
limited in terms of
how
More importantly, our work on
new approach
to the
problem of finding
things that will allow us to avoid the sorting of records in a
5.4
when
added; and
file.
Keysorting
Keysort,
sort a
keys;
sometimes referred
to as tag sort,
is
based on the idea that
when we
RAM the only things that we really need to sort are the record
into RAM during the
therefore, we do not need to read the whole
file in
sorting process. Instead,
file
we
read the keys from the
file
into
RAM,
sort
KEYSORTING
them, and then rearrange the records
ordering of the keys.
in
the
according to the
file
Since keysort never reads the complete set of records into
can sort larger
files
than
regular internal sort, given the
209
new
memory,
it
same amount of
RAM.
5.4.1 Description of the Method
we assume that we are dealing with a fixed-length
of the kind developed in Chapter 4, with a count of the number
records stored in a header record. We begin by reading the keys into an
array of identically sized character fields, with each row of the array
containing a key. We call this array KEYNODES[], and we call the key
field KEYNODES[].KEY. Figure 5.14 illustrates the relationship between
the array KEYNODES[] and the actual file at the time that the keysort
procedure begins.
There must, of course, be some way of relating the keys back to the
records from which they have been extracted. Consequently, each node of
the array KEYNODES[] has a second field KEYNODES[].RRN that
contains the RRN of the record associated with the corresponding key.
The actual sorting process simply sorts the KEYNODES[] array
according to the KEY field. This produces an arrangement like that shown
To keep
record
things simple,
file
FIGURE 5.14 Conceptual view of KEYNODES array to be used
routine, and record array on secondary store.
RAM
in
by internal sort
**
KEYNODES
array
Records
RRN
KEY
HARRISON SUSAN
Harrison Susan
KELLOG BILL
KelloglBilll 17 Maple...
HARRIS MARGARET
Harris
BELL ROBERT
Bell! Robert
In
RAM
387 Eastern...
Margaret 4343 West...
1
8912
Hill...
On secondary store
210
ORGANIZING FILES FOR PERFORMANCE
%/
3 -^7*
.
KEYNODES array
Records
RRN
KEY
BELL ROBERT
Harrison Susan
HARRIS MARGARET
KelloglBUIl 17 Maple...
HARRISON SUSAN
Harris
KELLOG BILL
In
Bell
RAM
Margaret 4343 West.
I
Robert 8912
I
Hill...
On secondary store
FIGURE 5.15 Conceptual view of
in Fig. 5.15.
way
387 Eastern..
KEYNODES
array and
file
after sorting keys in
The elements of KEYNODES[]
that the first element has the
to the first position in the
file,
are
now
RAM
sequenced in such
RRN of the record that should be moved
the second element identifies the record that
should be second, and so forth.
Once KEYNODES[] is sorted, we are ready to reorganize the file
this new ordering. This process can be described as follows:
according to
for
i:
to
number of records
Seek in the input
file
to the record
whose
RRN
is
KEYNODES[i].RRN.
Read
this
record into
buffer in
RAM.
Write the contents of the buffer out to output
file.
Figure 5.16 outlines the keysort procedure in pseudocode. This pro-
cedure works
much
the
same way
that a
normal
internal sort
would work,
but with two important differences:
RAM
array, we simply read
Rather than read entire records into a
each record into a temporary buffer, extract the key, and then dis-
j card
it;
and
When we
read
them
are writing the records out in sorted order,
in a
second time, since they are not
all
we have
stored in
to
RAM.
211
KEYSORTING
PROGRAM: keysort
open input file as
N F LE
create output file as DUT_FILE
I
read header record from
N F LE and write a copy to 0UT_FILE
:= record count from header record
I
REC_CDUNT
/*
for
read in records;
1
:=
to
up KEYNODES array */
set
REC_C0UNT
N_F LE into BUFFER
read record from
extract canonical key and place it in KE YN0DE5
KEYNODEStil .RRN = i
I
KEY
KEY
thereby ordering RRNs correspondingly */
sort KEYNODESC
sort (KEYNODES, REC_C0UNT)
/*
/*
/*
for
read in records according to sorted order, and write them
in this order
out
i
:=
to
REC_C0UNT
N_F LE to record with RRN of KEYNODES
seek in
N_F LE
read the record into BUFFER from
I
close
end PROGRAM
*/
*/
].
RRN
write BUFFER contents to 0UT_FILE
N_F LE and 0UT_FILE
I
FIGURE 5.16 Pseudocode
for keysort.
fUUJ&
?/w>
5.4.2 Limitations of the Keysort Method
At first glance, keysorting appears to be an obvious improvement over sorts
performed entirely in RAM; it might even appear to be a case of getting
something for nothing. We know that sorting is an expensive operation and
that we want to do it in RAM. Keysorting allows us to achieve this
objective without having to hold the entire
file in
RAM
at
once.
But, while reading about the operation of writing the records out in
sorted order, even a casual reader probably senses a cloud on this apparently
bright horizon. In keysort
before
we
can write out the
we need to read in the records a second time
new sorted file. Doing something twice is never
desirable.
But the problem
Look
them out
to the
input
sequentially. Instead,
file
is
worse than
that.
carefully at the for loop that reads in the records before writing
from the sorted
new
file.
You
can see that
KEYNODES[]
to seek to each record
and read
we
are
to the
it
we
are not reading
RRNs
through the
in sorted order,
moving
of the records. Since
we have
working
in before writing
it
back out, creating the
21 2
ORGANIZING FILES FOR PERFORMANCE
sorted
file
requires as
many random
As we have noted
records.
seeks into the input
number of
between the time required
difference
to
times,
read
all
sequentially and the time required to read those
seek to each record separately.
What
is
worse,
we
file as
there
is
the records in a
same records
are
there are
an enormous
if
performing
accesses in alternation with write statements to the output
file.
file
we must
of these
all
So, even the
which would otherwise appear to be sequential,
most
in
cases involves seeking. The disk drive must move the head back and
forth between the two files as it reads and writes.
writing of the output
file,
The getting-something-for-nothing
aspect of keysort has suddenly
work of sorting in RAM,
from the map supplied
trivial matter when the only
evaporated. Even though keysort does the hard
it
turns out that creating a sorted version of the
by the
KEYNODES[]
array
not
is
at all a
file
copies of the records are kept on secondary store.
5.4.3 Another Solution:
Why
Bother to Write the
The fundamental
idea behind keysort
an entire record
when
is
File
an attractive one:
Back?
Why work
with
the only parts of interest, as far as sorting and
searching are concerned, are the fields used to form the key? There
is
compelling parsimony behind this idea, and it makes keysorting look
promising. The promise fades only when we run into the problem of
rearranging
It is
all
the records in the
interesting to ask
file
so they reflect the new, sorted order.
whether we can avoid
bothering with the task that
this
giving us trouble:
is
problem by simply not
What
if
we just
skip the
time-consuming business of writing out a sorted version of the file? What
if, instead, we simply write out a copy of the array of canonical key nodes?
If we do without writing the records back in sorted order, writing out the
contents of our KEYNODES[] array instead, we will have written a
program that outputs an index to the original file. The relationship between
the
two
This
files is illustrated in Fig.
is
5.17.
an instance of one of our favorite categories of solutions to
computer science problems:
If
some
part of a process begins to look like a
Can you do without it? Instead
of creating a new, sorted copy of the file to use for searching, we have
created a second kind of file, an index file, that is to be used in conjunction
with the original file. If we are looking for a particular record, we do our
binary search on the index file, then use the RRN stored in the index file
record to find the corresponding record in the original file.
There is much to say about the use of index files, enough to fill several
chapters. The next chapter is about the various ways we can use simple
indexes, which is the kind of index we illustrate here. In later chapters we
bottleneck, consider skipping
it
altogether.
KEYSORTING
Index
Original
file
file
BELL ROBERT
Harrison
HARRIS MARGARET
Kellogg
HARRISON SUSAN
Harris
KELLOGG BILL
Belli Robert 18912 Hill
FIGURE 5.17 Relationship between the index
file
Susan 387 Eastern
!
Bill
I
213
17
Maple.
Margaret 4343 West
;
and the data
file.
about different ways of organizing the index to provide more flexible
and easier maintenance.
talk
access
5.4.4 Pinned Records
we discussed the problem of updating and maintaining files.
of that discussion revolved around the problems of deleting records
and keeping track of the space vacated by deleted records so it can be reused.
An avail list of deleted record slots is created by linking all of the available
slots together. This linking is done by writing a link field into each deleted
record that points to the next deleted record. This link field gives very
specific information about the exact physical location of the next available
In section 5.2
Much
record.
When a
we
file
contains such references to the physical locations of records,
say that these records are pinned.
particular choice
these
files
of terminology
containing an avail
that cannot
You
can gain an appreciation for this
you consider
(such as an index
become what
file)
is
file
pinned record
or in
is
some other
one
file
contain references to the physical location of the
moved,
these references
no longer lead
to the record;
are called dangling pointers, pointers leading to incorrect,
meaningless locations in the
file.
Clearly, the use of pinned records in a
difficult
the effects of sorting one of
of deleted records.
be moved. Other records in the same
record. If the record
they
list
if
and sometimes impossible. But what
file
if
can
make
we want
sorting
more
to support rapid
21
ORGANIZING FILES FOR PERFORMANCE
access
by key, while
deletion?
One
solution
still
is
reusing the space
to use an index
records, while keeping the actual data
the
a
problem of finding things
file
to
file in its
made
available
Once again,
we need to take
original order.
leads to the suggestion that
close look at the use of indexes, which, in turn, leads us to the next
chapter.
SUMMARY
we look at ways to organize or reorganize files to improve
performance in some way.
Data compression methods are used to make files smaller by re-encoding
data that goes into a file. Smaller files use less storage, take less time to
transmit, and can often be processed faster sequentially.
The notation used for representing information can often be made more
compact. For instance, if a two-byte field in a record can take on only 50
values, the field can be encoded using only 6 bits instead of 16. Another
form of compression called run-length encoding encodes sequences of
repeating values, rather than writing all of the values in the file.
A third form of compression assigns variable-length codes to values
depending on how frequently the values occur. Values that occur often are
given shorter codes, so they take up less space. Huffman codes are an example
of variable-length codes.
In this chapter
Some compression
techniques are
tion in the encoding process.
The
irreversible in that
UNIX utilities
they lose informa-
compress, uncompress, pack,
and unpack provide good compression in UNIX.
A second way to save space in a file is to recover space in the file after
it has undergone changes. A volatile file, one that undergoes many changes,
can deteriorate very rapidly unless measures are taken to adjust the file
organization to the changes. One result of making changes to files is storage
fragmentation.
Internal fragmentation occurs
In
fixed-length
record
file,
when
there
internal
is
wasted space within
variable-length record
file
when one
a record.
fragmentation can result
variable-length records are stored in fixed slots.
record
is
It
when
can also occur in
replaced by another record of
smaller size. External fragmentation occurs when holes of unused space
between records are created, normally because of record deletions.
There are a number of ways to combat fragmentation. The simplest is
storage compaction, which squeezes out unused space caused by external
fragmentation by sliding all of the undeleted records together. Compaction
is generally done in a batch mode.
a
by record
keep the sorted order of the
SUMMARY
Fragmentation can be dealt with dynamically by reclaiming deleted space
records are added. The need to keep track of the space to be reused
makes this approach more complex than compaction.
We begin with the problem of deleting fixed-length records. Since
when
finding the
first field
of a fixed-length record
can be accomplished by placing
Since
all
records in
is
very easy, deleting
mark
a special
fixed-length record
record
in the first field.
are the
file
same
size,
the reuse
of deleted records need not be complicated. The solution we adopt consists
of collecting all the available record slots into an avail list. The avail list is
created
by stringing together
the deleted records to
all
form
a linked
list
of
deleted record spaces.
In a fixed-length record
other
maintain the linked avail
are
added to the
slots are
any one record
file,
slot
is
just as usable as any
they are interchangeable. Consequently, the simplest
slot;
avail
list is
list
removed from
to treat
it
as a stack.
Newly
way
to
available records
by pushing them onto the front of the list; record
list by popping them from the front of the
the avail
list.
Next,
we
form
still
consider the matter of deleting variable-length records.
linked
of available record
list
We
but with variable-length
slots,
we need to be sure that a record slot is the right size to hold the new
Our initial definition of right size is simply in terms of being big
enough. Consequently, we need a procedure that can search through the
avail list until it finds a record slot that is big enough to hold the new record.
records
record.
Given such
on the
deleted records
complementary function that places newly
list, we can implement a system that deletes and
and
a function,
avail
reuses variable-length records.
We then consider the amount and nature of fragmentation that develops
inside a
file
due to record deletion and
space
internally if the
develop
new
because
procedure that breaks
two or more
into
a
lost
is
it
reuse.
Fragmentation can happen
locked up inside
is
a record.
smaller ones, using exactly as
record, leaving the remainder
on the
much
avail
list.
space as
We
We
record slot
a single, large, variable-length
is
needed for
see that, although
could decrease the amount of wasted space, eventually the remaining
fragments are too small to be useful. When this happens, the space is lost to
this
external fragmentation
There are
fragmentation.
things that one can do to minimize external
(1)
compacting the
of fragmentation becomes excessive;
level
on the
a
number of
They include
avail
list
to
make
larger,
more
placement strategy to select slots
file
in a batch
mode when
generally useful slots; and
for
reuse in a
way
adopting
minimizes
(3)
that
fragmentation. Development of algorithms for coalescing holes
part
of the exercises
more
at
careful discussion.
the
record slots
(2) coalescing adjacent
is
left as
the end of this chapter. Placement strategies need
215
21 6
ORGANIZING FILES FOR PERFORMANCE
The placement
up
strategy used
record deletion and reuse procedures
simply, "If the record slot
in sorted order,
it
is
to this point
is a first-fit
big enough, use
it."
by the variable-length
strategy. This strategy
By
is
list
two other placement
easy to implement either of
is
keeping the avail
strategies:
which a new record is placed in the smallest slot that is
enough to hold it. This is an attractive strategy for variablelength record files in which the fragmentation is internal. It involves
more overhead than other placement strategies.
Worst fit, in which a new record is placed in the largest record slot
Best fit, in
still
big
available.
The
idea
is
to
have the left-over portion of the
slot
be
as
large as possible.
no firm rule for selecting a placement strategy; the best one can do
judgment based on a number of guidelines.
In the third major section of this chapter, we look at ways to find things
quickly in a file through the use of a key. In preceding chapters it was not
There
is
is
use informed
knowing its physical location or
we explore some of the problems and
possible to access a record rapidly without
relative
record number.
Now
opportunities associated with keyed direct access.
This
key
develops
chapter
only
one
method of
binary searching. Binary searching requires O
finding
records
by
comparisons to
(log n)
with n records, and hence is far superior to sequential
works only on a sorted file, a sorting
procedure is an absolute necessity. The problem of sorting is complicated
by the fact that we are sorting files on secondary storage rather than vectors
in RAM. We need to develop a sorting procedure that does not require
seeking back and forth over the file.
Three disadvantages are associated with sorting and binary searching as
developed up to this point:
find a record in a
file
searching. Since binary searching
Binary searching
searching, but
per record.
is
an enormous improvement over sequential
it still
The need
in applications
where
usually requires
more than one
for fewer disk accesses
a large
or
becomes
number of records
two
accesses
especially acute
are to be accessed
by
key.
The requirement
that the file be kept in sorted order can be expenFor active files to which records are added frequently, the cost
of keeping the file in sorted order can outweigh the benefits of binary searching.
sive.
A RAM
sort can be used only
the size of the
files
that
given our sorting tools.
we
on
relatively small
files.
This limits
could organize for binary searching,
KEY TERMS
The
problem can be solved
third
partially
by developing more powerful
sorting procedures, such as a keysort. This approach to sorting resembles a
RAM
sort in
Instead,
most
respects, but does not use
reads in only the keys
it
uses the sorted
from the
RAM
to hold the entire
file.
records, sorts the keys, and then
of keys to rearrange the records on secondary storage so
list
they are in sorted order.
The disadvantage to a keysort is that rearranging a file of n records
random seeks out to the original file, which can take much more
time than does a sequential reading of the same number of records. The
requires
//
inquiry into keysorting
to
not wasted, however. Keysorting naturally leads
is
the suggestion that
we merely w rite
T
the sorted
list
of keys off to
secondary storage, setting aside the expensive matter of rearranging the
This
list
file.
of keys, coupled with RRN tags pointing back to the original
an example of an index. We look at indexing more closely in
records,
is
Chapter
6.
This chapter closes with
cost of sorting
elsewhere
(in
a discussion of another, potentially hidden,
and searching. Pinned records are records that are referenced
same
the
position in the
file.
file
or in
some
other
file)
according to their physical
Sorting and binary searching cannot be applied to a
containing pinned records, since the sorting, by definition,
change the physical position of the record. Such
is
file
likely to
change causes other
record to become inaccurate, creating the problem of
references to this
dangling pointers.
KEY TERMS
Avail
list.
list
of the space, freed through record deletion, that
available for holding
chapter, this
list
new
is
records. In the examples considered in this
of space took the form of
linked
list
of deleted
records.
Best fit. A placement strategy for selecting the space on the avail list
used to hold a new record. Best-fit placement finds the available
record slot that is closest in size to what is needed to hold the new
record.
Binary search. A binary search algorithm locates a key in a sorted list
by repeatedly selecting the middle element of the list, dividing the
list in half, and forming a new, smaller list from the half that conThis process
tains the key.
the kev that
Coalescence.
If
is
is
continued until the selected element
is
sousht.
two
deleted, available records are physically adjacent,
they can be combined to form
a single, larger available
This process of combining smaller available spaces into
record space.
a larger
one
217
21
ORGANIZING FILES FOR PERFORMANCE
is
known
Coalescence
as coalescing holes.
is
way
to counteract the
problem of external fragmentation.
Compaction. A way of getting
all
of
rid
the records together so there
is
by sliding
between them.
external fragmentation
all
no space
Data compression. Encoding information
lost
in a file in
such
way
as to
take up less space.
External fragmentation. A form of fragmentation that occurs in a file
when there is unused space outside or between individual records.
First fit. A placement strategy for selecting a space from the avail list.
First-fit placement selects the first available record slot large enough
to hold the
new
record.
Fragmentation. The unused space within
locked within individual records
a file.
The space can be
(internal fragmentation) or outside or
between individual records (external fragmentation).
code. A variable-length code in which the lengths of the
codes are based on their probability of occurrence.
Internal fragmentation. A form of fragmentation that occurs when
space is wasted in a file because it is locked up, unused, inside of
Huffman
records. Fixed-length record structures often result in internal frag-
mentation.
Irreversible compression. Compression in which information
Keysort.
entire
method of sorting a file that does not require holding the
file in memory. Only the keys are held in memory, along
with pointers that
tie
these keys to the records in the
they are extracted.
The keys
used to construct
new
sorted order.
less
cess
Linked
cific
RAM
file
new
file
from which
list
of keys
is
that has the records in
The primary advantage of a keysort is
a RAM sort. The disadvantage
a
file
and the sorted
than does
of constructing
list.
are sorted,
version of the
that
is
it
requires
that the pro-
requires a lot of seeking for records.
collection of nodes that have been organized into a spe-
sequence by means of references placed in each node that point
to a single successor node.
The
logical
order of
different than the actual physical order of the
er's
is lost.
linked
nodes
list is
in the
often
comput-
memory.
Pinned record.
record
is
pinned
structures that refer to
it
by
its
when
there are other records or
physical location.
It is
file
pinned in the
we are not free to alter the physical location of the record:
doing so destroys the validity of the physical references to the
record. These references become useless dangling pointers.
Placement strategy. As used in this chapter, a placement strategy is a
mechanism for selecting the space on the avail list that is to be used
sense that
to hold a
new
record added to the
file.
EXERCISES
Redundancy reduction. Any form of compression
219
that does not lose
information.
Run-length encoding.
compression method
peated codes are replaced by a count of the
in
which runs of
number of
re-
repetitions
by the code that is repeated.
in which all additions and deletions take
of
the code, followed
Stack.
kind of
list
same end.
Variable-length encoding. Any encoding scheme
place at
the
are
of different lengths.
More
which the codes
in
frequently occurring codes are given
shorter lengths than are frequently occurring codes.
ing
Worst
Huffman encod-
an example of variable-length encoding.
is
fit.
placement strategy for selecting
space from the avail
Worst-fit placement selects the largest record
small the
new
record
is.
slot,
regardless of
list.
how
Insofar as this leaves the largest possible
record slot for reuse, worst
fit
can sometimes help minimize external
fragmentation.
EXERCISES
our discussion of compression, we show how we can compress the
name" field from 16 bits to 6 bits, yet we say that this gives us a space
savings of 50%, rather than 62.5%, as we would expect. Why is this so?
What other measures might we take to achieve the full 62.5% savings?
1.
In
"state
2. What is redundancy reduction?
of redundancy reduction?
Why is run-length encoding an example
3. What is the maximum run length that can be handled in the run-length
encoding described in the text? If much longer runs were common, how
might you handle them?
4.
Encode each of
results,
(a)
(b)
and indicate
01
01
01
01
how you might improve
the algorithm.
01 01 01 01 01 01 02 03 03 03 03 03 03 03 04 05 06 06 07
02 02 03 03 04 05 06 06 05 05 04 04
01
5.
From
Fig. 5.2,
6.
What
is
How
the following using run-length encoding. Discuss the
determine the Huffman code for the sequence "daeab".
the difference between internal and external fragmentation?
can compaction affect the amount of internal fragmentation in
What about
external fragmentation?
a file?
220
ORGANIZING FILES FOR PERFORMANCE
7.
In-placc compaction purges deleted records
separate
new
file.
What
compaction compared
from a file without creating
and disadvantages of in-place
which a separate compacted file is
are the advantages
compaction
to
in
created?
8.
loss
9.
Why
is a worst-fit placement strategy a bad choice
of space due to internal fragmentation?
if
there
is
significant
Conceive of an inexpensive way to keep
amount of fragmentation
in a
a continuous record of the
This fragmentation measure could be
file.
used to trigger the batch processes used to reduce fragmentation.
10. Suppose a file must remain sorted.
placement strategies available?
11.
How
does
this affect the
range of
Develop a pseudocode description of a procedure for performing
compaction in a variable-length record file that contains size fields
in-place
at
the start of each record.
12.
Consider the process of updating rather than deleting
variable-length
record. Outline a procedure for handling such updating, accounting for the
update possibly resulting in either
longer or shorter record.
we raised the question of where to keep the stack
of available records. Should it be a separate list, perhaps
maintained in a separate file, or should it be embedded within the data file?
We choose the latter organization for our implementation. What advantages and disadvantages are there to the second approach? What other kinds
of file structures can you think of to facilitate various kinds of record
13. In section 5.4,
containing the
list
deletion?
14. In
some
the record
is
files,
each record has
inactive rather than deleted.
record?
a delete bit that is set to
to indicate that
deleted. This bit can also be used to indicate that a record
Could
reactivation be
What
is
is
required to reactivate an inactive
done with the deletion procedures we have
used?
we outlined three general approaches to the problem of
minimizing storage fragmentation: (a) implementation of a placement
strategy; (b) coalescing of holes; and (c) compaction. Assuming an
interactive programming environment, which of these strategies would be
used "on the fly," as records are added and deleted? Which strategies would
be used as batch processes that could be run periodically?
15. In this chapter
16.
Why
record
do placement
files?
strategies
make
sense only with variable-length
EXERCISES
17.
Compare
the average case performance of binary search with sequential
search for records, assuming
That the records being sought are guaranteed to be in the file;
That half of the time the records being sought are not in the file; and
That half of the time the records being sought are not in the file and
that missing records must be inserted.
Make
showing your performance comparisons
a table
for files of 1,000,
2,000, 4,000, 8,000, and 16,000 records.
18. If the records in exercise 17 are
how
blocked with 20 records per block,
does this affect the performance of the binary and sequential searches?
19.
An
internal sort
works only with
files
small enough to
fit
in
RAM.
Some computing systems provide users with an almost unlimited amount
of RAM with a memory management technique called virtual storage.
Discuss the use of internal sorting to sort large
files
on systems
that use
virtual storage.
Our discussion of keysorting covers the considerable expense associated
with the process of actually creating the sorted output file, given the sorted
vector of pointers to the canonical key nodes. The expense revolves around
20.
two primary
areas of difficulty:
jump around
file, performing many seeks
new, sorted order; and
Writing the output file at the same time we are reading the input
jumping back and forth between the files can involve seeking.
Having
to
in the input
to
retrieve the records in their
Design an approach to
this
problem
that uses buffers to hold a
number of
records, therefore mitigating these difficulties. If your solution
viable,
obviously the buffers must use
place entirely within electronic
less
RAM
than would
file;
is
a sort
to be
taking
memory.
Programming Exercises
21.
Rewrite the program update. c or
records to
fixed-length record
file
update. pas so
it
can delete and add
using one of the replacement procedures
discussed in this chapter.
program similar to the one described
works with variable-length record files.
22. Write a
but that
23.
Develop
in the preceding exercise,
pseudocode description of a variable-length record deletion
if the newly deleted record is contiguous with
procedure that checks to see
221
222
ORGANIZING FILES FOR PERFORMANCE
any other deleted records.
make
If there
a single, larger available
is
record
contiguity, coalesce the records to
slot.
Some
things to consider as
you
address this problem are as follows:
a.
The
avail
list
does not keep records arranged in physical order; the
list is not necessarily the next deleted record
next record on the avail
in the physical file. Is
avail
list,
you do
b.
possible to
it
merge
two views of the
these
the physical order and the logical order, into a single
this,
what placement strategy
will
list? If
you use?
Physical adjacency can include records that precede as well as fol-
low
the
newly deleted
How
record.
will
you look
for a deleted
record that precedes the newly deleted record?
c.
as
Maintaining two views of the list of deleted records implies that
you discover physically adjacent records you have to rearrange
links to
tions
update the nonphysical avail
would we encounter
if
list.
What
additional complica-
we were combining
the coalescing of
holes with a best-fit or worst-fit strategy?
Implement the bin_search() function in either C or Pascal. Write a driver
program named search to test the function bin_search(). Assume that the files
are created with the update program developed in Chapter 4, and then
sorted. Include enough debug information in the search driver and
bin_search() function to watch the binary searching logic as it makes
successive guesses about where to place the new record.
24.
25.
Modify
the bin_search() function so if the key
the relative record
number
The function should
also
is
not in the
file, it
returns
key would occupy were it in the file.
continue to indicate whether the key was found or
that the
not.
26. Rewrite the search driver
from
exercise 24 so
it
uses the
function developed in exercise 25. If the sought-after key
new
is
bin_search()
in the
file,
the
program should display the record contents. If the key is not found, the
program should display a list of the keys that surround the position that the
key would have occupied. You should be able to move backward or
forward through this list at will. Given this modification, you do not have
to remember an entire key to retrieve it. If, for example, you know that you
are looking for someone named Smith, but cannot remember the person's
first name, this new program lets you jump to the area where all the Smith
records are stored.
you recognize
You
the right
can then scroll back and forth through the keys until
first
name.
27. Write an internal sort that can sort a variable-length record
kind produced by the writrec programs in Chapter
4.
file
of the
FURTHER READINGS
FURTHER READINGS
A
thorough treatment of data compression techniques can be found in Lynch (1985).
is described in Welch (1984). Huffman encoding is covered
The Lempel-Ziv method
in
many data structures texts, and also in Knuth (1973a).
Somewhat surprising, the literature concerning storage
reuse often does not consider these issues
fragmentation and
from the standpoint of secondary
storage.
Typically, storage fragmentation, placement strategies, coalescing of holes, and
garbage collection are considered in the context of reusing space within electronic
random
access
memory (RAM). As you
the concepts to secondary storage,
it is
read this literature with the idea of applying
necessary to evaluate each strategy in light of
the cost of accessing secondary storage.
used in electronic
RAM
Some
are too expensive
Discussions about space
management
strategies that are attractive
on secondary
in
RAM
when
storage.
are usually
heading "Dynamic Storage Allocation." Knuth (1973a) provides
found under the
a good, though
overview of the fundamental concerns associated with dynamic storage
placement strategies. Much of Knuth's discussion is reworked
and made more approachable by Tremblay and Sorenson (1984). Standish (1980)
provides a more complete overview of the entire subject, reviewing much of the
important literature on the subject.
This chapter only touches the surface of issues relating to searching and sorting
files. A large part of the remainder of this text is devoted to exploring the issues in
more detail, so one source for further reading is the present text. But there is much
more that has been written about even the relatively simple issues raised in this
chapter. The classic reference on sorting and searching is Knuth (1973b). Knuth
provides an excellent discussion of the limitations of keysort methods. He also
develops a very complete discussion of binary searching, clearly bringing out the
analogy between binary searching and the use of binary trees. Baase (1978) provides
a clear, understandable analysis of binary search performance.
technical,
allocation, including
223
Indexing
CHAPTER OBJECTIVES
Introduce concepts of indexing that have broad apfile systems.
plications in the design of
Introduce the use of a simple linear index to provide
rapid access to records in an entry-sequenced, variable-length record file.
Investigate the implications of the use of indexes
for
file
maintenance.
Describe the use of indexes to provide access to
records by more than one key.
Introduce the idea of an inverted
Boolean operations on lists.
Discuss the issue of when
address in the data file.
iS
to
list,
illustrating
bind an index
key
to an
Introduce and investigate the implications of selfindexing
files.
225
CHAPTER OUTLINE
6.1
What
6.2
6.3
Is
an Index?
6.6
Secondary Keys
Simple Index with an EntrySequenced File
6.7
Improving the Secondary Index
Structure: Inverted Lists
Basic Operations on an Indexed,
Entry-Sequenced
Retrieval Using Combinations of
6.7.1
File
6.7.2
6.4
Indexes That Are
Hold
6.5
in
Too Large
to
First Attempt at a Solution
Better Solution: Linking the
List of References
Memory
Indexing to Provide Access by
Multiple Keys
6.1
A
A
What
Is
6.8
Selective Indexes
6.9
Binding
an Index?
few pages of many books contain an index. Such an index is a table
a list of topics (keys) and numbers of pages where the topics can
be found (reference fields).
All indexes are based on the same basic concept
keys and reference
fields. The types of indexes we examine in this chapter are called simple
The
last
containing
indexes because they are represented using simple arrays
contain the keys and reference
fields. In later
chapters
of structures that
we
look
at
indexing
schemes that use more complex data structures, especially trees. In this
chapter, however, we want to emphasize that indexes can be very simple
and still provide powerful tools for file processing.
The index to a book provides a way to find a topic quickly. If you have
ever had to use a book without a good index, you already know that an
index is a desirable alternative to scanning through the book sequentially to
find a topic. In general, indexing is another way to handle the problem that
we explored in Chapter 5: An index is a way to find things.
Consider what would happen if we tried to apply the previous chapter's
methods, sorting and binary searching, to the problem of finding things in
a book. Rearranging all the words in the book so they were in alphabetical
order certainly would make finding any particular term easier but would
obviously have disastrous effects on the meaning of the book. In a sense, the
terms in the book are pinned records. This is an absurd example, but it
clearly underscores the power and importance of the index as a conceptual
tool. Since
it
works by
indirection, an index
without actually rearranging the
file.
lets
you impose order on a
file
This not only keeps us from disturbing
A SIMPLE INDEX WITH AN ENTRY-SEQUENCED
pinned records, but also makes matters such
expensive than they are with
Take,
as
as
227
FILE
much
record addition
less
a sorted file.
another example, the problem of finding books in
We want to be able to locate books by a specific author,
One way of achieving
by
a library.
their titles, or
by
have three copies of each book
and three separate library buildings. All of the books in one building would
be sorted by author's name, another building would contain books arranged
by title, and the third would have them ordered by subject. Again, this is an
absurd example, but one that underscores another important advantage of
subject areas.
this is to
indexing. Instead of using multiple arrangements, a library uses a card
catalog.
The
is actually a set of three indexes, each using a
and all of them using the same catalog number as a
Another use of indexing, then, is to provide multiple access
card catalog
different key field,
reference field.
paths to a
We
file.
also find that indexing gives us keyed access
to
variable-length record
our discussion of indexing by exploring this problem of
access to variable-length records and the simple solution that indexing
files.
Let's begin
provides.
6.2
A Simple Index
with an Entry-Sequenced File
Suppose we own an extensive collection of musical recordings and we want
keep track of the collection through the use of computer files. For each
recording, we keep the information shown in Fig. 6. 1. The data file records
are variable length. Figure 6.2 illustrates such a collection of data records.
to
We
refer to this data record file as Datafle.
There
are a
number of approaches
variable-length record
file
that could be used to create a
to hold these records; the record addresses used
in Fig. 6.2 suggest that each record
skip sequential access and easier
be preceded by
file
a size field that
maintenance. This
is
permits
the structure
we
use.
Suppose
initials for
identification
number
Title
Composer or composers
Artist or artists
Label (publisher)
we formed
the record
primary key for these records consisting of the
company
label
combined with
FIGURE 6.1 Contents of a data record.
the record
company's
228
Rec.
addr.
32t
77
INDEXING
ID
number
Label
LO\
RCA
2312
2626
Title
Composer(s)
Romeo and Juliet
C
Prokofiev
Maazel
Beethoven
Julliard
Corea
Corea
Beethoven
Giulini
Springsteen
Springsteen
Quartet in
Artist(s)
Sharp Minor
132
167
WAR
ANG
396
COL
DG
MER
COL
DG
442
FF
211
256
300
353
Touchstone
23699
Symphony No.
3795
38358
Nebraska
18807
Symphony No.
Coq d'or Suite
Symphony No.
75016
31809
139201
245
Beethoven
Karajan
Rimsky-Korsakov
Leinsdorf
Dvorak
Bernstein
Violin Concerto
Beethoven
Good News
Sweet Honev
the
tAssume there
is
Ferras
Sweet Honev in
in
Rock
the
Rock
a header recoi d that uses the first 32 bytes.
FIGURE 6.2 Sample contents of Datafile.
ID number. This
will
make
good primary key
since
it
should provide
file. We call this key the Label ID. The
ID consists of the uppercase form of the Label
followed immediately by the ASCII representation of the ID number.
unique key for each entry in the
form
canonical
field
for the Label
For example,
LDN2312
How
we
keyed access to
and then use binary searching?
Unfortunately, binary searching depends on being able to jump to the
middle record in the file. This is not possible in a variable-length record file
there is no
because direct access by relative record number is not possible
way to know where the middle record is in any group of records.
could
organize the
Could we
individual records?
file
sort the
to provide rapid
file
An
alternative to sorting
illustrates
such an index.
is
to construct an index for the
On the right is
the data
file
file.
Figure 6.3
containing information
about our collection of recordings, with one variable-length data record per
recording. Only four fields are shown (Label, ID number, Title, and
Composer), but
it is
easy to imagine the other information
filling
out each
record.
On the left is the index file,
key
data
(left justified,
file.
Each key
blank
is
filled)
each record of which contains
corresponding
12-character
to a certain Label
ID
in the
associated with a reference field giving the address of the
A SIMPLE INDEX WITH AN ENTRY-SEQUENCED
Indexfile
Key
229
FILE
Datafile
Reference
Address of
field
record
Actual data record
ANG3795
167
32
LON
COL31809
353
77
RCA
COL38358
211
132
WAR
DG139201
396
167
ANG
DG18807
256
211
COL
FF245
442
256
DG
32
300
MER
300
353
COL
77
396
DG
132
442
FF 245 Good News Sweet Honey In The
LON2312
MER75016
RCA2626
WAR23699
2312
2626 Quartet in
C Sharp Minor
38358 Nebraska Springsteen
ANG3795
The
full
31809 Symphony No. 9 Dvorak
\
Violin Concerto
Beethoven
file.
ANG3795, for example,
number 167, meaning that
containing the
field
Symphony No. 9 Beethoven
byte of the corresponding data record.
the record containing
75016 Coq d'or Suite Rimsky
corresponds to the reference
FIGURE 6.3 Sample index with corresponding data
first
3795 Symphony No. 9 Beethoven
139201
Prokofiev
23699 Touchstone Corea
18807
Romeo and Juliet
information on the recording with Label ID
can be found starting
structure of the index
byte
at
file is
number 167
very simple.
It is
in the record
a
file.
fixed-length record
which each record has two fixed-length fields: a key field and a
byte-offset field. There is one record in the index file for every record in the
file
in
data
file.
Note
also
that
the index
is
Consequently, although Label ID
it is
not necessarily the
first
sorted,
whereas the data
ANG3795
entry in the data
is
the
file.
first
file
is
not.
entry in the index;
In fact, the data
file is
entry
which means that the records occur in the order that they are
entered into the file. As we see soon, the use of an entry-sequenced file can
sequenced,
make record
with
addition and
a data file that is
file
maintenance
kept sorted by
some
much
key.
simpler than
is
the case
230
INDEXING
PROCEDURE retrieve_record(KEY)
find position of KEY in Indexfile /* Probably using binary search */
look up the BYTE_0FFSET of the corresponding record in Datafile
use SEEK() and the byte_offset to move to the data record
read the record from Datafile
end PROCEDURE
FIGURE 6.4 Retrieve _record():
procedure to retrieve a single record from Datafile through
Indexfile.
Using the index to provide access
The steps needed to retrieve
to the data
matter.
shown
Datafile are
in the
a single
file
procedure retrieve_record(
this retrieval strategy is relatively straightforward,
We
now
are
The index
because
with
it
By
dealing with
file is
two
files
contains
simple
KEY
in Fig. 6.4.
it
the index
from
Although
some
features
uses fixed-length records (which
is
likely
it is
and the data
file
work with
considerably easier to
binary search) and because
the data
is
comment:
that deserve
by Label ID
record with key
than the data
file.
file
why we can search it
to be much smaller than
file.
file have fixed-length records, we impose
on the sizes of our keys. In this example we assume that the
primary key field is long enough to retain every key's unique iden-
requiring that the index
a limit
tity.
The
problems
use of a small, fixed key field in the index could cause
if a
key's uniqueness
is
truncated
away
as
it is
placed in the
fixed index field.
In the example, the index carries
no information other than
and the reference fields, but this need not be the
example, keep the length of each Datafile record
6.3
case.
We
the keys
could, for
in Indexfile.
Basic Operations on an Indexed, Entry-Sequenced
We
have noted that the process of keeping
searching for records can be very expensive.
using
sorted to permit binary
One of the
simple index with an entry-sequenced data
addition can take place
long
files
as the
index
is
much more
small
enough
record length
is
consisting of
no more than
short, this
is
not
a
quickly than with
to be held entirely in
a difficult
File
great advantages of
file
is
that record
sorted data
memory.
file as
If the index
condition to meet for small
files
few thousand records. For the moment our
231
BASIC OPERATIONS ON AN INDEXED, ENTRY-SEQUENCED FILE
discussions assume that the condition
is
met and
that the index
secondary storage into an array of structures called
when
consider what should be done
the index
is
INDEX[
is
].
read from
Later
too large to
fit
we
into
memory.
Keeping the index in memory as the program runs also lets us find
records by key more quickly with an indexed file than with a sorted one
since the binary searching can be performed entirely in memory. Once the
byte offset for the data record
is
hand, requires
found, then
The use of
required to retrieve the record.
a single
sorted data
is
file
that
is
coupled with
simple index requires the development of procedures to handle
of different
all
on the other
file,
seek for each step of the binary search.
The support and maintenance of an entry-sequenced
a
seek
tasks. Besides the retrieve_record(
ously, other procedures used to find things
number
algorithm described previ-
by means of the index include
the following:
Create the original empty index and data
Load the index
Rewrite the index
Add
files;
memory before using it;
from memory after using it;
into
file
file
records to the data
file
and index;
file; and
Delete records from the data
Update records
in the data
file.
Creating the Files
empty
files,
quite easily
Both the index file and the data file are created as
with header records and nothing else. This can be accomplished
by creating the files and writing headers to both files.
Memory We assume that the index file is
Loading the Index into
small
to
memory, so we define an array INDEX[ ] to
hold the index records. Each array element has the structure of an index
enough
fit
into primary
record. Loading the index
file
into
memory,
then,
is
simply
matter of
reading in and saving the index header record and then reading the records
from the index
file
into the
INDEX[
array. Since this will be a sequential
and since the records are short, the procedure should be written so it
reads a large number of index records at once, rather than one record at a
read,
time.
File from Memory
When processing of an
indexed file is completed, it is necessary to rewrite INDEXf ] back into the
index file if the array has been changed in any way. In Fig. 6.5, the
Rewriting the Index
procedure rewrite_index(
describes the steps for doing this.
232
INDEXING
PROCEDURE rewrite_index(
check a status flag that tells whether the INDEX [] array
has been changed in any way.
if there were changes, then
open the index file as a new empty file
update the header record and rewrite the header
write the index out to the newly created file
close the index file
end PROCEDURE
FIGURE 6.5 The rewrite_index() procedure.
It is
important to consider what happens
rewriting of the index
if this
does not take place, or takes place incompletely. Programs do not always
run to completion.
failures, against the
program designer needs
guard against power
to
wrong
operator turning the machine off at the
other such disasters.
One of the
time, and
serious dangers associated with reading an
memory and then writing it out when the program is over is that
copy of the index on'disk will be out of date and incorrect if the program
interrupted. It is imperative that a program contain at least the following
index into
the
is
two safeguards
Q'
to protect against this kind
There should be
when
the index
of
error:
mechanism
that permits the
is
out of date.
One
program
to
know
possibility involves setting a sta-
copy of the index in memory is changed. This
be written into the header record of the index file
on disk as soon as the index is read into memory, and then subsequently cleared when the index is rewritten. All programs could
check the status flag before using an index. If the flag is found to be
tus flag as
soon
as the
status flag could
set,
then the program would
If a
program
know
detects that an index
that the index
is
out of date.
is
out of date, the program must
have access to a procedure that reconstructs the index from the data
file. This should happen automatically, taking place before any attempt is made to use the index.
Record Addition
Adding
new
record to the data
file
requires that
we
add a record to the index file. Adding to the data file itself is easy. The
exact procedure depends, of course, on the kind of variable-length file
also
233
BASIC OPERATIONS ON AN INDEXED, ENTRY-SEQUENCED FILE
organization being used. In any case,
when we add
know
file
data record
we
should
which we wrote the
record. This information, along with the canonical form of the record's
array.
key, must be placed in the INDEX[
array is kept in sorted order by key, insertion of
Since the INDEX[
the starting byte_offset of the
location at
new
index record probably requires some rearrangement of the index.
way, the situation is similar to the one we face as we add records to a
sorted data file. We have to shift or slide all the records that have keys that
come in order after the key of the record we are inserting. The shifting
opens up a space for the new record. The big difference between the work
we have to do on the index records and the work required for a sorted data
array is contained wholly in memory. All of the
file is that the INDEX[
index rearrangement can be done without any file access.
the
In a
Record Deletion
In
Chapter 5
we
describe a
deleting records in variable-length record
files
number of approaches
that allow for the reuse
to
of the
space occupied by these records. These approaches are completely viable for
our data file since, unlike a sorted data file, the records in this file need not
be moved around to maintain an ordering on the file. This is one of the great
advantages of an indexed file organization: We have rapid access to
individual records by key without disturbing pinned records. In fact, the
indexing
Of
itself pins all the records.
when we
from the data file we must also
from our index file. Since the index is
an array during program execution, deleting the index record
the other records to close up the space may not be an overly
course,
delete a record
delete the corresponding entry
contained in
and shifting
expensive operation. Alternatively,
as deleted, just as
we might mark
Record Updating
Q
we
could simply mark the index record
the corresponding data record.
Record updating
falls
into
two
categories:
The update changes
bring about
the value of the key field. This kind of update can
reordering of the index file as well as the data file.
Conceptually, the easiest
deletion followed
by an
implemented while
that he or she
The update
is
still
way
to think
of
this
kind of change
is
as a
addition. This delete/add approach can be
providing the program user with the view
merely changing
a record.
does not affect the key field. This
not require rearrangement of the index
second kind of update does
but may well involve re-
file,
file. If the record size is unchanged or decreased
by the update, the record can be written directly into its old space,
but if the record size is increased by the update, a new slot for the
ordering of the data
234
INDEXING
record will have to be found. In the
the rewritten record
field
6.4
latter case the starting
must replace the old address
address of
in the byte_offset
of the corresponding index record.
Indexes That Are Too Large to Hold
in
Memory
The methods we have been
discussing, and, unfortunately, many of the
advantages associated with them, are tied to the assumption that the index
file is small enough to be loaded into memory in its entirety. If the index is
large for this approach to be practical, then index access and
maintenance must be done on secondary storage. With simple indexes of
the kind we have been discussing, accessing the index on a disk has the
too
following disadvantages:
Binary searching of the index requires several seeks rather than takmemory speeds. Binary searching of an index
on secondary storage is not substantially faster than the binary
ing place at electronic
searching of a sorted
file.
Index rearrangement due to record addition or deletion requires shifting or sorting records on secondary storage. This is literally millions
of times more expensive than the cost of these same operations when
performed in electronic memory.
Although these problems are no worse than those associated with the
file that is sorted by key, they are severe enough to warrant the
use of any
consideration of alternatives.
in
memory, you should
A
A
Any
time
simple index
too large to hold
consider using
hashed organization if access speed
tree-structured index,
such
is
as a B-tree, if
top priority; or
you need
both keyed access and ordered, sequential
These alternative
is
file
the flexibility of
access.
organizations are discussed at length in the
chapters that follow. But, before writing off the use of simpje indexes on
secondary storage altogether, we should note that they provide some
important advantages over the use of a data file sorted by key even if the
index cannot be held in memory:
simple index makes
keyed access
it
possible to use a binary search to obtain
to a record in a variable-length record
file.
The index
provides the service of associating a fixed-length and therefore binary-searchable record with each variable-length data record.
235
INDEXING TO PROVIDE ACCESS BY MULTIPLE KEYS
If the
index records are substantially smaller than the data
file
records, sorting and maintaining the index can be less expensive than
would be
,
sorting and maintaining the data
cause there
is
less
information to
If there are pinned records in the data
rearrange the keys without
There
one
that
is
file.
move around
moving
file,
This
is
in the
simply be-
index
file.
the use of an index lets us
the data records.
another advantage associated with the use of simple indexes,
we have
not yet discussed.
It,
in itself, can
be reason enough to use
simple indexes even if they do not fit into memory. Remember the analogy
between an index and a library card catalog? The card catalog provides
multiple views or arrangements of the library's collection, even though
there is only one set of books arranged in a single order. Similarly, we can
use multiple indexes to provide multiple views of a data
6.5
file.
Indexing to Provide Access by Multiple Keys
One
question that might reasonably arise
business
using
is
pretty interesting, but
key such
as
at this
who would
DG18807? What
want
is
point
is,
"All this indexing
ever want to find a record
the
Symphony No.
9 record
by Beethoven."
our analogy between our index and a library card
Suppose we think of our primary key, the Label ID, as a kind of
catalog number. Like the catalog number assigned to a book, we have taken
care to make our Label ID unique. Now, in a library it is very unusual to
begin by looking for a book with a particular catalog number (e.g., "I am
looking for a book with a catalog number QA331T5 1959."). Instead, one
generally begins by looking for a book on a particular subject, with a
particular title, or by a particular author (e.g., "I am looking for a book on
functions," or "I am looking for The Theory of Functions by Titchmarsh.").
Given the subject, author, or title, one looks in the card catalog to find the
primary key, the catalog number.
Let's return to
catalog.
Similarly, we could build a catalog for our record collection consisting
of entries for album title, composer, and artist. These fields are secondary key
fields.
Just as the library catalog relates an author entry (secondary key) to a
card catalog
Composer
number (primary
Along with
this
key), so can
we build
an index
the similarities, there
is
that relates
an important difference between
kind of secondary key index and the card catalog
library,
file
to Label ID, as illustrated in Fig. 6.6.
in a library.
In a
once you have the catalog number you can usually go directly to the
236
INDEXING
Composer index
Secondary key
Primary key
BEETHOVEN
ANG3795
BEETHOVEN
DG139201
BEETHOVEN
DG18807
key
BEETHOVEN
RCA2626
^4 5i^i9^ r/^.
COREA
WAR23699
DVORAK
COL31809
PROKOFIEV
LON2312
RIMSKY-KORSAKOV
MER75016
SPRINGSTEEN
COL38358
SWEET HONEY IN THE R
FF245
book
**>
p^;TF6r^'^t'
ja>
FIGURE 6.6 Secondary key
index organized by composer.
books are arranged in order by catalog
books are sorted by primary key. The actual
data records in our file, on the other hand, are entry sequenced. Consequently, after consulting the composer index to find the Label ID, you must
consult one additional index, our primary key index, to find the actual byte
offset of the record that has this particular Label ID. The procedure is
summarized in Fig. 6.7.
Clearly it is possible to relate secondary key references (e.g.,
Beethoven) directly to a byte offset (211) rather than to a primary key
(DG18807). However, there are excellent reasons for postponing this
binding of a secondary key to a specific address for as long as possible.
These reasons become clear as we discuss the way that fundamental file
operations such as record deletion and updating are affected by the use of
stacks to find the
number.
since the
In other words, the
secondary indexes.
Record Addition
the
file
means adding
When
a
secondary index
is
present, adding a record to
record to the secondary index.
The
cost of doing this
237
INDEXING TO PROVIDE ACCESS BY MULTIPLE KEYS
PROCEDURE search_on_secondary KEY
(
search for KEY in the secondary index
once the correct secondary index record is found, set LABEL_ID
to the primary key value in the record's reference field
call retrieve _record(LABEL_ID) to get the data record
end PROCEDURE
FIGURE 6.7 Search _on_secondary: an algorithm to retrieve a single record from Datafile
through a secondary key index.
is
very similar to the cost of adding
record to the primary index: Either
must be shifted or a vector of pointers to structures needs to be
rearranged. As with primary indexes, the cost of doing this decreases
records
greatly if the secondary index can be read into electronic
changed
memory and
there.
Note that the key field in the secondary index file is stored in canonical
form (all of the composers' names are capitalized), since this is the form that
we want to use when we are consulting the secondary index. If we want to
print out the name in normal, mixed upper- and lowercase form, we can
pick up that form from the original data file. Also note that the secondary
keys are held to a fixed length, which means that sometimes they are
truncated. The definition of the canonical form should take this length
restriction into account if searching the index is to work properly.
One
important difference between
secondary index and
primary
sample
index illustrated in Fig. 6.6, there are four records with the key
BEETHOVEN. Duplicate keys are, of course, grouped together. Within
this group, they should be ordered according to the values of the reference
index
is
that a secondary index can contain duplicate keys. In the
fields. In this
example, that means placing them in order by Label ID. The
reasons for this second level of ordering
discuss retrieval based
Record Deletion
become
clear a little later, as
we
on combinations of two or more secondary keys.
Deleting
ences to that record in the
file
record usually implies removing
system. So, removing
would mean removing not only
all
refer-
record from the data
the corresponding record in the
in the secondary indexes that refer
of
the
records
primary index, but also all
to this primary index record. The problem with this is that secondary
indexes, like the primary index, are maintained in sorted order by key.
file
238
INDEXING
Consequently, deleting
would involve rearranging
record
records to close up the space
open by
left
the remaining
deletion.
This delete-all-references approach would indeed be advisable if the
secondary index referenced the data file directly. If we did not delete the
secondary key references, and if the secondary keys were associated with
actual byte offsets in the data file, it could be difficult to tell when these
references
were no longer
The
record problem.
would be pointing
and subsequent
be associated with different data records.
have carefully avoided referencing actual addresses in the
we
search to find the secondary key,
we do
time on primary key. Since the primary index does
changes due to record deletion, a search for the primary key of a
another search,
record
another instance of the pinned-
file,
secondary key index. After
reflect
is
to byte offsets that could, after deletion
space reuse in the data
But we
This
valid.
reference fields associated with the secondary keys
that
this
been deleted will fail, returning
sense, the updated primary key index
has
condition. In a
check, protecting us
from trying
to retrieve records that
Consequently, one option that
from the data
file is
to
open
is
to us
record-not-found
acts as a
kind of final
no longer
when we
exist.
delete a record
modify and rearrange only the primary key index.
We could safely leave intact the references
to the deleted record that exist in
the secondary key indexes. Searches starting
that lead to a deleted record are caught
from
when we
secondary key index
consult the primary key
index.
If there are a
from not having
number of secondary key
to rearrange
can be substantial. This
is
of these indexes
all
especially important
indexes are kept on secondary storage.
system, where the user
is
indexes, the savings that result
waiting
at a
It is
also
when a record is deleted
when the secondary key
important
in
an interactive
terminal for the deletion operation to
complete.
There
is,
of course,
a cost associated
with
this short cut:
Deleted records
With a file system that undergoes
few deletions, this is not usually a problem. With a somewhat more volatile
file structure, it is possible to address the problem by periodically removing
from the secondary index files all records that contain references that are no
take up space in the secondary index
files.
system is so volatile that even periodic
purging is not adequate, it is probably time to consider another index
structure, such as a B-tree, which allows for deletion without having to
rearrange a lot of records.
longer in the primary index. If a
file
In our discussion of record deletion, we find that the
primary key index serves as a kind of protective buffer, insulating the
Record Updating
239
RETRIEVAL USING COMBINATIONS OF SECONDARY KEYS
secondary indexes from changes
in the data
file.
This insulation extends to
record updating as well. If our secondary indexes contain references directly
to byte offsets in the data
file,
then updates to the data
file
that result in
changing a record's physical location in the file also require updating the
secondary indexes. But, since we are confining such detailed information to
the primary index, data file updates affect the secondary index only when
they change either the primary or the secondary key. There are three
possible situations:
Update changes the secondary key: If the secondary key
we may have
to rearrange the secondary
sorted order. This can be
is
key index so
changed, then
it
stays in
a relatively expensive operation.
Update changes the primary key: This kind of change has
a large
on the primary key index, but often requires only
we
that
affected reference field (Label_id in our example) in
all
impact
update the
the secondary
indexes. This involves searching the secondary indexes (on the un-
changed secondary keys) and rewriting the affected fixed-length
does not require reordering of the secondary indexes unless
the corresponding secondary key occurs more than once in the index. If a secondary key does occur more than once, there may be
some local reordering, since records having the same secondary key
are ordered by the reference field (primary key).
Update confined to other fields: All updates that do not affect either the
primary or secondary key fields do not affect the secondary key index, even if the update is substantial. Note that if there are several
secondary key indexes associated with a file, updates to records often
affect only a subset of the secondary indexes.
field. It
6.6
Retrieval Using Combinations of Secondary Keys
One of the most important applications of secondary keys involves using
two or more of them in combination to retrieve special subsets of records
from the data file. To provide an example of how this can be done, we will
extract another secondary key index from our file of recordings. This one
uses the recording's
title
respond to requests such
as the key, as illustrated in Fig. 6.8.
all
poser);
COL38358 (primary key access);
work (secondary key com-
the recordings of Beethoven's
and
can
as
Find the record with Label ID
Find
Now we
240
INDEXING
Title index
Primary key
Secondary key
COQ DOR SUITE
MER75016
GOOD NEWS
FF245
NEBRASKA
COL38358
QUARTET
IN C
SHARP M
RCA2626
ROMEO AND JULIET
LON2312
SYMPHONY NO.
ANG3795
SYMPHONY NO.
COL31809
SYMPHONY NO.
DG18807
TOUCHSTONE
WAR23699
VIOLIN CONCERTO
DG139201
FIGURE 6.8 Secondary key
index organized by recording
title.
Find
all
the recordings titled "Violin Concerto" (secondary key
title).
What
is
more
interesting,
however,
is
that
we
can also respond to
request that combines retrieval on the composer index with retrieval on the
title index, such as: Find all recordings of Beethoven's Symphony No. 9.
Without the use of secondary indexes, this kind of request requires a
sequential search through the entire file. Given a file containing thousands,
or even just hundreds, of records, this is a very expensive process. But, with
the aid of secondary indexes, responding to this request is simple and quick.
We begin by recognizing that this request can be rephrased as a Boolean
AND operation,
specifying the intersection of
two
Find all data records with:
composer = "BEETHOVEN" AND title
subsets of the data
"SYMPHONY NO.
file:
9'
241
RETRIEVAL USING COMBINATIONS OF SECONDARY KEYS
We begin
our response to this request by searching the composer index
of Label IDs that identify records with Beethoven as the
composer. (An exercise at the end of this chapter describes a binary search
procedure that can be used for this kind of retrieval.) This yields the
following list of Label IDs:
for the
list
ANG3795
DG139201
DG18807
RCA2626
Next we
that
search the
title
index for the Label IDs associated with records
SYMPHONY NO.
have
9 as the
key:
title
ANG3795
COL31809
DG18807
Now we
perform the Boolean
combining the
in the
output
lists
which
is
match operation,
that appear in both
lists
are placed
list.
Composers
ANG3795
DG139201
DG 18807
RCA2626
We
AND,
members
so only the
Titles
ANG3795
CDL31809
>DG 18807
>
Matched list
>ANG379S
>DG18807
'
give careful attention to algorithms for performing this kind of
Note that this kind of matching is much
combined are in sorted order. That is the
reason why, when we have more than one entry for a given secondary key,
the records are ordered by the primary key reference fields.
match operation
in
Chapter
7.
easier if the lists that are being
once
Finally,
we
file
we have
the
list
of primary keys occurring
in
both
lists.
can proceed to the primary key index to look up the addresses of the data
records.
ANG
DG
',
',
This
3795
18807
is
useful in a
Then we can
!
retrieve the records:
Symphony No.
Symphony No.
',
Beethoven
Beethoven
Guilini
Karajan
makes computer-indexed file systems
manual systems. We have
record, and yet. working through the
that far exceeds the capabilities of
secondary indexes,
in
the kind of operation that
way
only one copy of each data
them
9
9
order by
we have
title,
file
multiple views of these records:
by composer, or by any other
We can look at
field that interests us.
242
INDEXING
Using the computer's ability to combine sorted lists rapidly, we can even
combine different views, retrieving intersections (Beethoven AND Symphony No. 9) or unions (Beethoven OR Prokofiev OR Symphony No. 9) of
these views. And since our data file is entry sequenced, we can do all of this
without having to sort data file records, confining our sorting to the smaller
index records which can often be held in electronic memory.
Now
indexes,
we have
that
we
can look
of the design and uses of secondary
improve these indexes so they take less
a general idea
at
ways
to
space and require less sorting.
6.7
Improving the Secondary Index Structure:
Inverted Lists
The secondary index
structures that
we have
developed so
far result in
two
distinct difficulties:
We
have to rearrange the index file every time a new record is added
to the file, even if the new record is for an existing secondary key.
For example, if we add another recording of Beethoven's Symphony
No. 9 to our collection, both the composer and title indexes would
have to be rearranged, even though both indexes already contain entries for secondary keys (but not the Label IDs) that are being added.
If there are duplicate secondary keys, the secondary key field is repeated for each entry. This wastes space, making the files larger than
necessary. Larger index files are less likely to be able to fit in elec-
tronic
6.7.1 A
One
memory.
First
Attempt
at a Solution
simple response to these
structure so
example,
it
we might
use
is to change the secondary index
of references with each secondary key. For
difficulties
associates an array
record structure that allows us to associate up to
four Label ID reference fields with
BEETHOVEN
ANG3795
a single
DG139201
secondary key,
DG18807
as in
RCA2626
Figure 6.9 provides a schematic example of how such an index would look
if
used with our sample data
file.
243
IMPROVING THE SECONDARY INDEX STRUCTURE: INVERTED LISTS
Revised composer index
Set of primary key references
Secondary key
BEETHOVEN
ANG3795
COREA
WAR23699
DVORAK
COL31809
PROKOFIEV
LON2312
RIMSKY-KORSAKOV
MER75016
SPRINGSTEEN
COL38358
SWEET HONEY IN THE R
FF245
DG139201
FIGURE 6.9 Secondary key index containing space
RCA2626
DG18807
for multiple references for
each secondary
key.
The major
solution of our
contribution of this revised index structure
first difficulty:
is
toward the
the need to rearrange the secondary index
file
Looking at Fig. 6.9, we can
see that the addition of another recording of a work by Prokofiev does not
require the addition of another record to the index. For example, if we add
every time a
new
record
is
added
to the data
file.
the recording
ANG
36193
we need
to
Piano Concertos
and
Prokofiev
Francois
modify only the corresponding secondary index record by
inserting a second Label ID:
PROKOFIEV
ANG36193
LON2312
Since we are not adding another record to the secondary index, there is no
need to rearrange any records. All that is required is a rearrangement of the
fields in the existing
record for Prokofiev.
244
INDEXING
Although
this
secondary index
it
new
file
structure helps avoid the need to rearrange the
so often,
it
does have some problems. For one thing,
provides space for only four Label IDs to be associated with
a given key.
very likely case that more than four Label IDs will go with some key,
need a mechanism for keeping track of the extra Label IDs.
In the
we
second problem has to do with space usage. Although the structure
does help avoid the waste of space due to the repetition of identical keys,
this
space savings comes
at a potentially
high
cost.
By
length of each of the secondary index records to hold
we might
easily lose
by not repeating
more
extending the fixed
more
reference fields,
space to internal fragmentation than
we
gained
identical keys.
we don't want to waste any more space than we have to, we need
whether we can improve on this record structure. Ideally, what we
would like to do is develop a new design, a revision of our revision, that
Since
to ask
Retains the attractive feature of not requiring reorganization of the
secondary indexes for every
new
entry to the data
file;
Allows more than four Label IDs to be associated with each secondary key; and
Does away w ith the waste of space due to internal fragmentation.
T
6.7.2 A Better Solution: Linking the
List of
References
such as our secondary indexes, in which a secondary key leads to a set
of one or more primary keys, are called inverted lists. The sense in which the
list is inverted should be clear if you consider that we are working our way
backward from a secondary key to the primary key to the record itself.
The second word in the term inverted list also tells us something
Files
important: that
Our
we
are, in fact, dealing
with
list
of primary key references.
number of Label
revised secondary index, which
IDs for each secondary key, reflects this list aspect of the data more directly
than did our initial secondary index. Another way of conceiving of this list
aspect of our inverted list is illustrated in Fig. 6.10.
As Fig. 6. 10 shows, an ideal situation would be to have each secondary
key point to a different list of primary key references. Each of these lists
could grow to be just as long as it needs to be. If we add the new Prokofiev
record, the
PROKOFIEV
collects together a
list
of Prokofiev references becomes
ANG36193
LON2312
245
IMPROVING THE SECONDARY INDEX STRUCTURE: INVERTED LISTS
of primary
key references
Lists
Secondary key index
BEETHOVEN
ANG3795
COREA
DG139201
DVORAK
DG18807
PROKOFIEV
RCA2626
WAR23699
COL31809
LON2312
FIGURE 6.10 Conceptual view of the primary key reference fields as a series of
Similarly,
adding two
additional elements to the
new Beethoven
list
recordings
lists.
adds just two
of references associated with the Beethoven
key. Unlike our record structure that allocates
enough space
for four Label
IDs for each secondary key, the lists could contain hundreds of references,
if needed, while still requiring only one instance of a secondary key. On the
other hand,
if a list requires
internal fragmentation.
the
file
only one element, then no space
Most important of
all,
we need
can
we
set
if a new composer is added to the
up an unbounded number of different
varying length, without creating a large number of small
way
is
through the use of linked lists.
it consists of records with two
index so
a field containing the relative record
We
lost to
to rearrange only
of secondary keys
How
is
files?
file.
lists,
each of
The
simplest
could redefine our secondary
fields
number of
secondary key
the
first
field,
and
corresponding
246
INDEXING
primary key reference (Label ID) in the inverted list. The actual primary key
references associated with each secondary key would be stored in a separate,
entry-sequenced file.
Given the sample data we have been working with, this new design
would result in a secondary key file for composers and an associated Label
ID file that are organized as illustrated in Fig. 6.11. Following the links for
the list of references associated with Beethoven helps us see how the Label
ID List file is organized. We begin, of course, by searching the secondary
key index of composers for Beethoven. The record that we find points us
number (RRN) 3 in the Label ID List file. Since this is a
it is easy to jump to RRN 3 and read in its Label ID
to relative record
fixed-length
file,
(ANG3795). Associated with this Label ID is a link to a record with RRN
8. We read in the Label ID for that record, adding it to our list (ANG379
DG139201). We continue following links and collecting Label IDs until the
list
looks like
this:
ANG3795
The
link field in the last record read
of 1. As
a value
DG18807
DG139201
in
our
earlier
RCA2626
from the Label ID
programs,
List file contains
this indicates end-of-list, so
know that we now have all the Label ID
To illustrate how record addition affects the Secondary Index
ID List files, we add the Prokofiev recording mentioned earlier:
we
references for Beethoven records.
36193
ANG
You
last
one
record
Piano Concertos
and
Prokofiev
and Label
Francois
can see (Fig. 6.11) that the Label ID for this new recording is the
ID List file, since this file is entry sequenced. Before this
in the Label
is
added, there
is
only one Prokofiev recording.
It
has a Label
ID of
we want to keep the Label ID Lists in order by ASCII
character values, the new recording is inserted in the list for Prokofiev so it
logically precedes the LON2312 recording.
Associating the Secondary Index file with a new file containing linked
LON2312.
lists
Since
of references provides some advantages over any of the structures
considered up to this point:
The only time we need to rearrange the Secondary Index file is when
a new composer's name is added or an existing composer's name is
it was misspelled on input). Deleting or adding recomposer who is already in the index involves changing only the Label ID List file. Deleting all the recordings for a composer could be handled by modifying the Label ID List file, while
changed
(e.g.,
cordings for
leaving the entry in the Secondary Index
file
in place, using a value
247
IMPROVING THE SECONDARY INDEX STRUCTURE: INVERTED LISTS
Improved
revision of the
composer index
Secondary Index file
Label
BEETHOVEN
ID
LON2312
COREA
RCA2626
DVORAK
WAR23699
PROKOFIEV
ANG3795
RIMSKY-KORSAKOV
COL38358
SPRINGSTEEN
DG18807
SWEET HONEY IN THE R
MER75016
COL31809
DG139201
FF245
ANG36193
10
FIGURE 6.1
Secondary key index referencing linked
of
primary key references.
the task
is
list
of entries for
this
empty.
is
In the event that
is
lists of
in its reference field to indicate that the
composer
List file
we do need
now since
quicker
to rearrange the
Secondary Index
file,
there are fewer records and each record
smaller.
Since there
is
less
need for sorting,
it
follows that there
penalty associated with keeping the Secondary Index
ondary storage, leaving more room
in
RAM
less
is
files
off
of
on
sec-
for other data struc-
tures.
[j
The Label ID
J
^
needs to be sorted.
List file
is
entry sequenced. That means that
it
never
ID List file is a fixed-length record file, it would be
implement a mechanism for reusing the space from de-
Since the Label
very easy to
leted records, as described in
Chapter
5.
*^T
7&
248
INDEXING
There
also at least
is
one potentially
significant disadvantage to this kind
of file organization: The Label IDs associated with
given composer are no
longer guaranteed to be physically grouped together. The technical term for
such "togetherness"
such
as this,
it
is
is
with
locality;
a linked,
less likely that there will
entry-sequenced structure
be locality associated with the
groupings of reference fields for a given secondary key. Note, for
list of Label IDs for Prokofiev consists of the very last and
the very first records in the file. This lack of locality means that picking up
the references for a composer that has a long list of references could involve
a large amount of seeking back and forth on the disk. Note that this kind of
logical
example, that our
seeking would not be required for our original Secondary Index
file
structure.
One obvious
List file in
antidote to this seeking problem
memory.
is
to
keep the Label ID
This could be expensive and impractical, given
many
secondary indexes, except for the interesting possibility of using the same
Label ID List file to hold the lists for a number of Secondary Index files.
Even if the file of reference lists were too large to hold in memory, it might
be possible to obtain
the
file
in
memory
as
memory
performance improvement by holding only a part of
paging sections of the file in and out of
at a time,
they are needed.
Several exercises
at
more thoroughly. These
the end of the chapter explore these possibilities
are very important problems, since the notion
of
fundamental to the design of B-trees and
other methods for handling large indexes on secondary storage.
dividing the index into pages
6.8
is
Selective Indexes
Another interesting feature of secondary indexes
divide a
file
providing
into parts,
is
that they can be used to
selective view.
For example,
possible to build a selective index that contains only the
titles
of
it
is
classical
recordings in the record collection. If we have additional information about
file, such as the date the recording was released,
could build selective indexes such as "recordings released prior to 1970"
and "recordings since 1970." Such selective index information could be
the recordings in the data
we
combined
"List
all
into
Boolean
AND
operations to respond to requests such
1970." Selective indexes are sometimes useful
as,
Symphony released since
when the contents of a file fall
the recordings of Beethoven's Ninth
naturally and logically into several broad categories.
BINDING
6.9
249
Binding
recurrent and very important question that emerges in the design of
systems that
utilize
physical address of
its
indexes
is:
At what point
in
time
is
the hey
bound
file
to the
associated record?
system we are designing in the course of this chapter, the
primary keys to an address takes place at the time the files are
The secondary keys, on the other hand, are bound to an address
In the file
binding of our
constructed.
at the time that they are actually used.
Binding at the time of the file construction results in faster access. Once
you have found the right index record, you have in hand theT>yte offset of
the data record you are seeking. If we elected to bind our secondary keys to
their associated records at the time of file construction, so when we find the
record in the composer index we would know immediately that
the data record begins at byte 353 in the data file, secondary key retrieval
would be simpler and faster. The improvement in performance is
particularly noticeable if both the primary and secondary index files are used
on secondary storage rather than in memory. Given the arrangement we
designed, we would have to perform a binary search of the composer index
and then a binary search of the primary key index before being able to jump
to the data record. Binding early, at file construction time, does away
at*?\Mflfi|
entirely with the need to search on the primary key.
The disadvantage of binding directly in the tile, of binding tightly, is that
reorganizations of the data file must result in modifications to all bound
DVORAK
index
files. This reorganization cost can be very expensive, particularly
with simple index files in which modification would often mean shifting
records. By postponing binding until execution time, when the records are
actually being used, we are able to develop a secondary key system that
involves a minimal amount of reorganization when records are added or
deleted.
Another important advantage of postponing binding until a record is
actually retrieved is that this approach is safer. As we see in the system that
we set up, associating the secondary keys with reference fields consisting of
primary keys allows the primary key index to act as a kind of final check of
whether a record is really in the file. The secondary indexes can afford to be
wrong. This situation is very different if the secondary index keys are
bound, containing addresses. We would then be jumping directly
from the secondary key into the data file; the address would need to be
tightly
right.
250
INDEXING
This brings up a related safety aspect: It is always more desirable to have
make important changes in one place, rather than having to make them
in many places. With a bind-at-retrieval-time scheme such as we developed,
we need to remember to make a change in only one place, the primary key
index, if we move a data record. With a more tightly bound system, we
have to make many changes successfully to keep the system internally
consistent, braving power failures, user interruptions, and so on.
When designing a new file system, it is better to deal with this question
to
of binding intentionally and early
binding just happen. In general,
in the design process, rather than letting the
tight, in-the-data
binding
is
most
attractive
when
The
data
file is static
or nearly so, requiring
and updating of records; and
Rapid performance during actual retrieval
little
or no adding, de-
leting,
is
high priority.
For example, tight binding is desirable for file organization on a massproduced, read-only optical disk. The addresses will never change since no
new records can ever be added; consequently, there is no reason not to
obtain the extra performance associated with tight binding.
For
file
applications in
which record
addition, deletion, and updating
occur, however, binding at retrieval time
is
option. Postponing binding for as long as possible
operations simpler and safer. If the
and, in particular, if the indexes use
as B-trees, retrieval
additional
work
performance
is
file
structures are carefully designed,
more
sophisticated organizations such
usually quite acceptable, even given the
required by a bind-at-retrieval system.
SUMMARY
We
began
this
sorting as a
sorting,
chapter with the assertion that indexing
way of structuring
indexing
permits
variable-length record
addition, deletion,
and
files.
perform binary
to
If the
file
an alternative to
searches
for
keys
in
index can be held in memory, record
retrieval can
indexed, entry-sequenced
be done
much more
than with a sorted
quickly with an
file.
much more than merely improve on access time: They
new capabilities that are inconceivable with access
on sorted data records. The most exciting new capability
Indexes can do
can provide us with
methods based
us
is
so records can be found by key. Unlike
a file
do
more desirable
usually makes these
usually the
SUMMARY
involves the use of multiple secondary indexes. Just as
allows us to regard
a collection
subject order, so index
records in a data
file.
files
We
lists
in
a library
author order,
card catalog
title
order, or
allow us to maintain different views of the
find that
obtain different views of the
associated
of books
we
file,
cannot only use secondary indexes to
we
but that
can also combine the
of primary key references and thereby combine particular
views.
In this chapter
indexes of
two
The need
The need
added
we
address the problem of
how
to rid our secondary
liabilities:
to repeat duplicate secondary keys;
and
to rearrange the secondary indexes every time a record
to the data
is
file.
A first solution to these problems involves associating a fixed-size vector
with each secondary key. This solution results in an
internal fragmentation but serves to illustrate the
attractiveness of handling the reference fields associated with a particular
secondary key as a group, or list.
Our next iteration of solutions to our secondary index problems is
more successful and much more interesting. We can treat the primary key
of reference
fields
overly large
amount of
file, forming the necessary lists
through the use of link fields associated with each primary record entry. This
allows us to create a secondary index file that, in the case of the composer
index, needs rearrangement only when we add new composers to the data
file. The entry-sequenced file of linked reference lists never requires sorting.
We call this kind of secondary index structure an inverted list.
There are also, of course, disadvantages associated with our new
solution. The most serious disadvantage is that our file demonstrates less
references themselves as an entry-sequenced
of associated records are less likely to be physically adjacent.
A good antidote to this problem is to hold the file of linked lists in memory.
We note that this is made more plausible because a single file of primary
references can link the lists for a number of secondary indexes.
As indicated by the length and breadth of our consideration of
locality: Lists
secondary indexing, multiple keys, and inverted
among
most
the
lists,
interesting aspects of indexed access to
these topics are
files.
The concepts
of secondary indexes and inverted lists become even more powerful later, as
we develop index structures that are themselves more powerful than the
simple indexes that we consider here. But, even so, we already see that for
small files consisting of no more than a few thousand records, approaches
to inverted lists that rely merely on simple indexes can provide a user with
a great deal
of capability and
flexibility.
251
252
INDEXING
KEY TERMS
Binding. Binding takes place when
physical record in the data
key
is
associated with a particular
In general, binding can take place
file.
either during the preparation
program execution.
of the data file and indexes or during
former case, which is called tight binding,
In the
the indexes contain explicit references to the associated physical data
record. In the latter case, the connection
ular physical record
in the course
is
postponed
file.
in
file
that they are entered into the
An
index
a
is
key and
a partic-
actually retrieved
of program execution.
Entry-sequenced
Index.
between
until the record
is
a tool for
which the records occur
in the order
file.
finding records in
a file. It consists
of
key
on which the index is searched and a reference field that tells
where to find the data file record associated with a particular key.
Inverted list. The term inverted list refers to indexes in which a key may
be associated with a list of reference fields pointing to documents that
contain the key. The secondary indexes developed toward the end of
field
this
Key
chapter are examples of inverted
field.
The key
the canonical
field is the
lists.
portion of an index record that contains
form of the key
that
is
being sought.
.when records that will be accessed
given temporal sequence are found in physical proximity to each
Locality. Locality exists in
a file
in a
other on the disk. Increased locality usually results in better perfor-
mance, since records that are in the same physical area can often be
brought into memory with a single read request to the disk.
Reference field. The reference field is the portion of an index record
that contains information about
where
to find the data record con-
taining the information listed in the associated key field of the index.
Selective index.
selective index contains keys for only a portion
the records in the data
view of
file.
a specific subset
Simple index.
Such an index provides the user with
of the
file's
of
a
records.
All the index structures discussed in this chapter are sim-
ple indexes insofar as they are
all
built
around the idea of an ordered,
sequence of index records. All these simple indexes share a
common weakness: Adding records to the index is expensive. As we
see later, tree-structured indexes provide an alternate, more efficient
linear
solution to this problem.
EXERCISES
EXERCISES
1.
Until now,
it
was not possible
variable-length record
With
Does
file.
fixed-length record
mean
this
perform
to
binary search on
Why does indexing make binary search possible?
is possible to perform a binary search.
need not be used with fixed-length record
file it
that indexing
files?
2.
Why
is title
chapter? If
it
not used
were used
primary key in the data file described in this
secondary key, what problems would have to
as a
as a
be considered in deciding on
3.
What
is
canonical form for
titles?
the purpose of keeping an out-of-date-status flag in the header
record of an index? In a multiprogramming environment, this flag might be
found to be set by one program because another program is in the process
of reorganizing the index. How should the first program respond to this
situation?
4.
Explain
how
5.
When
record in
the use of an index pins the data records in a
secondary key indexes
is
updated, corresponding primary and
may
not have to be altered, depending on
a data file
may
or
file.
whether the file has fixed- or variable-length records, and depending on the
type of change made to the data record. Make a list of the different updating
situations that can occur, and explain how each affects the indexes.
when you add the following recording
assuming that the composer index shown in Fig. 6.9
is used. How might you solve the problem without substantially changing
the secondary key index structure?
6.
Discuss the problem that occurs
to the recordings
LON
1259
file,
Fidelio
Beethoven
and when
7.
What
is
8.
How
are the structures in Fig. 6.11
an inverted
list,
is it
Maazel
useful?
changed by the addition of the
recording
LON
1259
Fidelio
Beethoven
Maazel
Suppose you have the data file described in this chapter, greatly
a primary key index and secondary key indexes organized
by composer, artist, and title. Suppose that an inverted list structure is used
to organize the secondary key indexes. Give step-by-step descriptions of
how a program might answer the following queries:
a. List all recordings of Bach or Beethoven; and
b. List all recordings by Perleman of pieces by Mozart or Joplin.
9.
expanded, with
253
254
10.
INDEXING
One
inverted
problem of diminished locality when using
same Label ID List file to hold the lists for several
possible antidote to the
lists is
to use the
of the secondary index files. This increases the likelihood that the secondary
indexes can be kept in primary memory. Draw a diagram of a single Label
ID List file that can be used to hold references for both the secondary index
of composers and the secondary index of titles. How would you handle the
difficulties that this arrangement presents with regard to maintaining the
Label ID List
file?
11. Discuss the following structures as antidotes to the possible loss
locality in a
of
secondary key index:
Leave space for multiple references for each secondary key (Fig. 6.9).
Allocate variable-length records for each secondary key value, where
each record contains the secondary key value, followed by the Label
IDs, followed by free space for later addition of new Label IDs. The
amount of free space left could be fixed, or it could be a function of
the size of the original
12.
file
list
of Label IDs.
The method and timing of binding
system
speed and
flexibility.
two important
affect
attributes
of a
Discuss the relevance of these attributes,
and the effect of binding time on them, for a hospital patient information
system designed to provide information about current patients by patient
name, patient ID, location, medication, doctor or doctors, and illness.
Programming and Design Exercises
13.
Implement the
retrieve_record( )
14. In solving the preceding
deciding
how many
procedure outlined in
mechanism for
each record. At least
problem, you have to create
bytes to read from the Datafile for
Fig. 6.4.
a
four options are open to you:
a.
Jump
to the
byte_ofifiset,
read the size
field,
then use this informa-
tion to read the record.
b.
Build an index
file
that contains a record size field that reflects the
true size
of the data record, including the
Datafile.
Use
many
c.
size field carried in the
the size field carried in the index
file
to decide
how
bytes to read.
Follow much the same strategy
as in
option
(b),
except use a
Datafile that does not contain internal size fields.
d.
Jump
to the byte_offset
bytes (e.g., 512 bytes).
and read
Once
buffer, use the size field at the start
many
a fixed,
overly large
these bytes are read into a
number of
memory
how
of the buffer to decide
bytes to break out of the buffer.
EXERCISES
Evaluate each of these options, listing the advantages and disadvantages of
each.
Implement procedures
15.
to the
index
When
16.
some of
first
and to write back the
INDEXf
array
searching secondary indexes that contain multiple records for
we do
we want to
not want to find just any record for
the keys,
secondary key;
the
to read in
file.
given
find the first record containing that key. Finding
record allows us to read ahead, sequentially, extracting
all
of the
records for the given key. Write a variation of a binary search function that
returns the relative record
key.
The function should
17. If a Label
be held in
ID
List file
memory
in
the
first
record containing the given
return a negative value
key cannot be found.
as the
entirety,
if the
one shown in Fig. 6.11 is too large to
might still be possible to improve
number of blocks of the file in memory. These
such
its
performance by retaining
number of
it
blocks are called pages. Since the records in the Label ID List
file
are each 16
bytes long, a page might consist of 32 records (512 bytes). Write
function
would hold the most recently used eight pages in memory. Calls for a
specific record from the Label ID List file would be routed through this
function. It would check to see if the record exists in one of the pages that
is already in memory. If so, the function would return the values of the
record fields immediately. If not, the function would read in the page
containing the desired record, either writing out or dumping the page that
was used least recently. Clearly, if a page has been changed, it needs to be
written out rather than dumped. When the program is over, all pages still
in memory must be checked to see if they should be written out.
that
18. Assuming the use of a paged index
problem, and given that the Label ID List
any particular order of data entry (initial
as
described in the preceding
file is
file
entry sequenced,
is
there
loading) that tends to give
performance than other methods? How does the use of an
organization method such as that described in problem 10, which combines
the linked lists from several secondary indexes into a single file, affect your
answer about performance?
better
19.
The Label ID
schemes
is
List file
is
entry sequenced. Development of paging
simpler for entry-sequenced
files
than for
files
that are kept in
sorted order. List the additional difficulties involved in the design of a
paging system for
a sorted index,
such
the possibility that there will be a
full,
design such a paging system.
new key when
the page in
which
primary key index. Accepting
that are only partially
will you handle the insertion of a
as the
number of pages
How
it
belongs
is
full?
255
256
INDEXING
FURTHER READINGS
Wc
have
much more
to say about indexing in later chapters,
subjects of tree-structured indexes and of indexed sequential
where we take up
file
organizations.
the
The
topics developed in the current chapter, particularly those relating to secondary
indexes and inverted
texts.
The few
files,
texts that
are also covered
we
list
by many other
file
and data structure
here are of interest because they either develop
more detail or present the material from a different viewpoint.
Wiederhold (1983) provides a survey of many of the index structures we
discuss, along with a number of others. His treatment is more mathematical than
that provided in our text. Users interested in looking at indexed files in the context
of PL/I and of large IBM mainframes will want to see Bradley (1982). A brief,
readable overview of a number of different file organizations is provided in J. D.
certain topics in
Ullman (1980).
Tremblay and Sorenson
(1984) provide a comparison of inverted
provides
users.
a similar discussion,
M.
list
structures
Loomis (1983)
along with some examples oriented toward COBOL
with an alternative organization called
multilist
files.
Salton and McGill (1983) discuss inverted
application in information retrieval systems.
lists
in
E.
S.
the context of their
Cosequential Processing
and the Sorting of
Large Files
CHAPTER OBJECTIVES
Describe
tivities
a class
known
of frequently used processing ac-
as cosequential processes.
Provide a general model for implementing
eties of cosequential processes.
Illustrate the use
of the model to solve
all
vari-
number of
different kinds of cosequential processing prob-
lems, including problems other than simple merges
and matches.
I
Introduce heapsort as an approach to overlapping
I/O with sorting
in
Show how merging
very large
RAM.
provides the basis for sorting
files.
Examine the costs of K-way merges on
find ways to reduce those costs.
Introduce the notion of replacement
disk and
selection.
Examine some of the fundamental concerns
ated with sorting large
files
associ-
using tapes rather than
disks.
Introduce
UNIX
utilities for sorting,
merging, and
cosequential processing.
257
^^
CHAPTER OUTLINE
7.1
Model
for
Implementing
7.5.3
Cosequential Processes
Matching Names in Two Lists
7.1.2 Merging Two Lists
7.1.3 Summary of the Cosequential
7.5.4
Hardware-based Improvements
7.5.5
Decreasing the Number of
Seeks Using Multiple-step
Merges
Model
7.5.6
7.2
Application of the Model to a
General Ledger Program
7.2.1
7.5.7
Model
to the
7.5.8
Using Two Disk Drives with
Replacement Selection
7.5.9
More
Ledger Program
Extension of the Model to Include
7.5.10 Effects of
A K-way Merge
A Selective Tree
Algorithm
7.3.2
for Merging
Large Numbers of Lists
7.4
Second Look
7.4.1
at
Sorting in
7.5.11
Reading
More
Multiprogramming
Conceptual Toolkit for
External Sorting
7.6
RAM
Sorting Files on Tape
7.6.1
7.6.2
Overlapping Processing and
I/O: Heapsort
7.4.2 Building the
Drives?
Processors?
Multiway Merging
7.3.1
Run Lengths Using
Replacement Selection
Replacement Selection Plus
Increasing
Multistep Merging
The Problem
7.2.2 Application of the
7.3
the File
Size
7.1.1
Processing
The Cost of Increasing
The Balanced Merge
The K-way Balanced Merge
7.6.3 Multiphase
Merges
7.6.4 Tapes versus Disks for External
Heap while
Sorting
in the File
7.4.3 Sorting while Writing out to the
7.7
Sort-Merge Packages
7.8
Sorting and Cosequential
Processing in UNIX
File
7.5
Merging as a Way of Sorting Large
Files on Disk
7.5.1
How Much
Time Does
7.8.1
Merge
Sort Take?
7.5.2 Sorting a File That
Is
Sorting and Merging in
UNIX
7.8.2 Cosequential Processing
Utilities in
Ten
UNIX
Times Larger
Cosequential operations involve
the
coordinated processing of two or more
Sometimes the processing results
of the input lists; sometimes the goal is a matching, or
intersection, of the lists; and other times the operation is a combination of
matching and merging. These kinds of operations on sequential lists are the
basis of a great deal of file processing.
In the first half of this chapter we develop a general model for doing
cosequential operations, illustrate its use for simple matching and merging
sequential
lists to
produce a single output
in a merging, or union,
list.
259
A MODEL FOR IMPLEMENTING COSEQUENTIAL PROCESSES
it to the development of a more complex general
Next we apply the model to multiway merging, which is
component of external sort-merge operations. We conclude the
operations, and then apply
ledger program.
an essential
chapter with
trade-offs,
7.1
a discussion of external sort-merge procedures, strategies, and
paying special attention to performance considerations.
A Model
Implementing Cosequential Processes
for
Cosequential operations usually appear to be simple to construct; given the
information that
we
can be turned into
provide in
reality.
this chapter, this
However,
it
is
appearance of simplicity
also true that approaches to
cosequential processing are often confused, poorly organized, and incor-
These examples of bad practice are by no means limited to student
the problems also arise in commercial programs and in
textbooks. The difficulty with these incorrect programs is usually that they
are not organized around a single, clear model for cosequential processing.
Instead, they seem to deal with the various exception conditions and
problems in a cosequential process in an ad hoc rather than systematic way.
This section addresses such lack of organization head on. We present a
single, simple model that can be the basis for the construction of any kind
of cosequential process. By understanding and adhering to the design
principles inherent in the model, you will be able to write cosequential
procedures that are simple, short, and robust.
rect.
programs;
7.1.1 Matching
Names
Suppose
we want
Fig. 7.1.
This operation
We
a
in
Two
to output the
is
names common
and that the
lists
to the
two
lists
shown
in
usually called a match operation, or an intersection.
assume, for the moment, that
list,
Lists
we
will not allow duplicate keys within
are sorted in ascending order.
We begin by reading in the initial name from each list, and we find that
We output this first name as a member of the match set, or
intersection set. We then read in the next name from each list. This time the
they match.
name
lists
in List 2
is less
visually, as
we
than the
are
When we are processing these
we are trying to match the
and scan down List 2 until we either find it or
we eventually find a match for CARTER, so
name
in List
1.
now, we remember
that
name CARTER from List 1,
jump beyond it. In this case,
we output it, read in the next name from each list, and continue the process.
Eventually we come to the end of one of the lists. Since we are looking for
names common to both lists, we know we can stop at this point.
Although the match procedure appears
number of
well.
to be quite simple, there are a
matters that have to be dealt with to
make
it
work reasonably
260
List
COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES
List 2
ADAMS
ANDERSON
ANDREWS
BECH
BURNS
CARTER
DAVIS
DEMPSEY
GRAY
JAMES
JOHNSON
KATZ
PETERS
ROSEWALD
SCHMIDT
THAYER
WALKER
WILLIS
ADAMS
CARTER
CHIN
DAVIS
FOSTER
GARWICK
JAMES
JOHNSON
KARNS
LAMBERT
MILLER
PETERS
RESTON
ROSEWALD
TURNER
FIGURE 7.1 Sample input
lists for
cosequential operations.
Initializing: We need to arrange things in such
dure gets going properly.
Synchronizing:
list is
never so
that the proce-
have to make sure that the current name from one
far
ahead of the current name on the other
2,
end-of-file conditions:
When we
or
we need
program.
errors:
to halt the
When
list
that a
means reading the next name
sometimes from both lists.
this
Handling
Recognizing
way
We
match will be missed. Sometimes
from List 1, sometimes from List
file 2,
get to the end of either
an error occurs in the data
names or names out of sequence) we want
to detect
(e.g.,
it
file 1
duplicate
and take some
action.
Finally, we would like our algorithm to be reasonably efficient, simple,
and easy to alter to accommodate different kinds of data. The key to
accomplishing these objectives in the model we are about to present lies in
synchronization.
the way we deal with the second item in our list
At each step in the processing of the two lists, we can assume that we
have two names to compare: a current name from List 1 and a current name
from List 2. Let's call these two current names NAME_1 and NAME_2.
We can compare the two names to determine whether NAME_1 is less
than, equal to, or greater than
NAME_2:
261
A MODEL FOR IMPLEMENTING COSEQUENTIAL PROCESSES
If
NAME_1
List
If
NAME_1
List 2;
than
NAME_2, we
name from
read the next
is
greater than
NAME_2, we
read the next
we
name and
name from
and
names
names from
If the
It
is less
1;
are the same,
the
two
output the
read the next
lists.
turns out that this can be handled very cleanly with a single loop
one three-way conditional statement, as illustrated in the
7.2. The key feature of this algorithm is that control always
the head of the main loop after every step of the operation. This means
containing
algorithm in Fig.
returns to
that
1
no extra
gets ahead
logic
of
is
required within the loop to handle the case
List 2, or List 2 gets
ahead of List
1,
PROGRAM: match
call initialize!) procedure to:
- open input files LIST_1 and LIST_2
- create output file 0UT_FILE
- set MORE_NAMES_EXIST to TRUE
- initialize sequence checking variables
to get NAME_1 from LIST_1
to get NAME_2 from LIST_2
while (MORE_NAMES_EXIST)
if (NAME_1 < NAME_2)
call input () to get NAME_1 from LIST_1
else if (NAME_1 > NAME_2
call input () to get NAME_2 from LIST_2
/* match
names are the same */
write NAME_1 to 0UT_FILE
call input () to get NAME_1 from LIST_1
call input () to get NAME_2 from LIST_2
endif
endwhile
finish_up(
else
end PROGRAM
List
or the end-of-file
FIGURE 7.2 Cosequential match procedure based on a single loop.
call input ()
call input()
when
262
COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES
is reached on one list before it is on the other. Since each pass
through the main loop looks at the next pair of names, the fact that one
list may be longer than the other does not require any special logic. Nor
the while statement simply checks the
does the end-of-file condition
MORE_NAMES_EXIST flag on every cycle.
condition
The
logic inside
the loop
conditions can exist after reading
them. Since
when
the
Note
as
we
are
names
implementing
is
a
a
equally simple.
name; the if
match process
'
Only
else
three
possible
logic handles
all
of
here, output occurs only
are the same.
main program does not concern itself with such matters
sequence checking and end-of-file detection. Since their presence in the
that the
main loop would only obscure the main synchronization logic, they have
been relegated to subprocedures.
Since the end-of-file condition is detected during input, the setting of
the MORE_NAMES_EXIST flag is done in the inputf ) procedure. The
input( ) procedure can also be used to check the condition that the lists be in
strictly
ascending order (no duplicate entries within
Fig. 7.3 illustrates
one method of handling these
list).
tasks.
The algorithm
in
This "filling out"
FIGURE 7.3 Input routine for match procedure.
PROCEDURE:
input
()
/* input routine for MATCH procedure */
input arguments:
INP_FILE
PREVI0US_NAME
file descriptor for input file to be used
(could be LIST_1 OR LIST_2)
last name read from this list
arguments used to return values:
name to be returned from input procedure
NAME
flag used by main loop to halt processing
MORE_NAMES_EXIST
:
read next NAME from INP_FILE
/* check for end of file,
if (EOF)
M0RE_NAMES_EXIST
:=
duplicate names, names out of order
FALSE
else if (NAME <= PREVIOUS_NAME)
issue sequence check error
abort processing
endif
PREVIOUS_NAME
end PROCEDURE
:=
NAME
/* set flag to end processing
*/
*/
263
A MODEL FOR IMPLEMENTING COSEQUENTIAL PROCESSES
PROCEDURE:
initialize(
arguments used to return values:
PREV_1, PREV_2
previous name variables for the 2 lists
LIST_1, LIST_2
file descriptors for input files to be used
MORE NAMES EXIST
flag used by main loop to halt processing
/* set both the previous_name variables (one for each list) to
a value that is guaranteed to be less than any input value */
PREV_1
PREV_2
L0W_VALUE
L0W_VALUE
:=
open file for List
open file for List
as LIST_1
as LIST_2
if (both open statements succeed)
MORE_NAMES_EXIST
:=
TRUE
end PROCEDURE
FIGURE 7.4 Initialization procedure for cosequential processing.
of the
input()
would
use.
All
procedure also indicates the arguments that the procedure
we need now
procedure
initialize(
that
)
to
complete the logic
begins
procedure,
the
shown
is
a description
in Fig. 7.4,
performs three
1.
It
opens the input and output
2.
It
sets the
MORE_NAMES_EXIST
3.
It
sets the
previous_name variables (one for each
guaranteed to be
PREV_1
and
less
to
special
The
two
lists
tasks:
LOW_VALUE
is
to a value that
list)
is
of setting
that the procedure ineffect
first
two records
in
way.
Given these program fragments, you should be able
the
The
TRUE.
flag to
put() does not need to treat the reading of the
any
initialize(
files.
than any input value.
PREV_2
of the
main cosequential match procedure.
provided in
Fig.
7.1,
following
to
work through
the pseudocode,
and
demonstrate to yourself that these simple procedures can handle the various
resynchronization problems that these sample lists present.
7.1.2 Merging Two Lists
The three-way
test,
single-loop
model
for cosequential processing can
easily be modified to handle merging of
lists
as
well as matching,
as
264
COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES
PROGRAM: merge
call initialize!) procedure to:
- open input files LIST_1 and LIST_2
- create output file 0UT_FILE
- set MORE_NAMES_EXIST to TRUE
- initialize sequence checking variables
call input
call input
()
()
to get NAME_1 from LIST_1
to get NAME_2 from LIST_2
while (MORE_NAMES_EXIST)
< NAME_2)
write NAME_1 to 0UT_FILE
call input () to get NAME_1 from LIST_1
if (NAME_1
else if (NAME_1 > NAME_2
write NAME_2 to 0UT_FILE
call input () to get NAME_2 from LIST_2
/* match
names are the same */
write NAME_1 to 0UT_FILE
call input () to get NAME_1 from LIST_1
call input (J to get NAME_2 from LIST_2
endif
endwhile
finish_up(
else
end PROGRAM
FIGURE 7.5 Cosequential merge procedure based on a single loop.
illustrated in Fig. 7.5.
the if
An
else
Note
that
we now produce
list contents.
merge is
between matching and merging is that with
construction since
important difference
output for every case of
a union of the
merging we must read completely through each of the lists. This necessitates a change in our input( ) procedure, since the version used for matching sets the MORE_NAMES_EXIST flag to FALSE as soon as we
detect end-of-file for one of the lists. We need to keep this flag set to
TRUE as long as there are records in either list. At the same time, we must
recognize that one of the lists has been read completely, and we should
avoid trying to read from it again. Both of these goals can be achieved if
265
A MODEL FOR IMPLEMENTING COSEQUENTIAL PROCESSES
we simply
NAME
set the
variable for the completed
list
to
some
value
that
Cannot possibly occur as a legal input value; and
Has a higher collating sequence value than any possible legal input
value. In other words, this special value would come after all legal
put values in the
We
Fig. 7.6
files
file's
refer to this special value as
HIGH_VALUE
shows how
in-
ordered sequence.
are read to completion.
OTHER_LIST_NAME
to
whether the other input
list
HIGH_VALUE. The
pseudocode
in
can be used to ensure that both input
Note that we have
argument list so
the
has reached
its
to
add the argument
the
function
knows
end.
FIGURE 7.6 Input routine for merge procedure.
PROCEDURE:
input
/* input routine for MERGE procedure */
()
input arguments
INP_FILE
PREVIOUS_NAME
OTHER_LIST_NAME
file descriptor for input file to be used
(could be LIST_1 OR LIST_2)
last name read from this list
most recent name read from the other list
arguments used to return values:
name to be returned from input procedure
NAME
flag used by main loop to halt processing
MORE_NAMES_EXIST
:
read next NAME from INP_FILE
and OTHER_LIST_NAME == HIGH_VALUE)
/* end of both lists
MORE_NAMES_EXIST := FALSE
if (EOF)
else if (EOF)
NAME := HIGH_VALUE
else if (NAME <= PREVIOUS_NAME)
issue sequence check error
abort processing
endif
PREVIOUS_NAME
end PROCEDURE
NAME
*/
/* just this list ended */
/* sequence check
*/
266
COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES
Once
again, you should use this logic to work, step by step, through
provided in Fig. 7.1 to see how the resynchronization is handled
and how the use of the HIGH_VALUE forces the procedure to finish both
the
lists
lists
before terminating. Note that the version of input(
HIGH_VALUE
incorporating the
logic can also be used for matching procedures, producing
The only disadvantage to doing so is that the matching
procedure would no longer terminate as soon as one list is completely
processed, but would go through the extra work of reading all the way
through the unmatched entries at the end of the other list.
correct results.
With these two examples, we have covered all of the
Now let us summarize the model before adapting
complex problem.
model.
Summary
7.1.3
of the Cosequential Processing
pieces of our
to a
it
more
Model
Generally speaking, the model can be applied to problems that involve the
performance of set operations (union, intersection, and more complex
on two or more sorted input files to produce one or more output
processes)
files.
In this
summary of the
we assume
cosequential processing model,
that
two input files and one output file. It is important to
understand that the model makes certain general assumptions about the
there are only
nature of the data and type of problem to be solved. Here
is
a list
of the
assumptions, together with clarifying comments.
Assumptions
Comments
Two
In
or more input files are to be
processed in a parallel fashion to produce one or more output files.
Each
fields,
file is
sorted
and
all files
on one or more key
It is
are ordered in the
the
same ways on the same
In
some
cases, there
key value
that
is
some cases an output file may be
same file as one of the input files.
the
not necessary that
same record
all files
have
structures.
fields.
must
exist a
high
greater than any le-
gitimate record key, and a low key
value that is less than any legitimate
record key.
The use of a high key value and a low
key value is not absolutely necessary,
but can help avoid the need to deal
with beginning-of-file and end-of-file
conditions as special cases, hence decreasing complexity.
Records are to be processed
sorted order.
in logical
The
physical ordering of records
is ir-
relevant to the model, but in practice
be very important to the way
is implemented. Physical
ordering can have a large impact on
it
may
the
model
processing efficiency.
267
A MODEL FOR IMPLEMENTING COSEQUENTIAL PROCESSES
Assumptions
Comments
For each file there is only one current
record. This is the record whose key
is accessible within the main synchro-
The model does not
prohibit looking
ahead or looking back at records, but
such operations should be restricted to
subprocedures and should not be allowed to affect the structure of the
main synchronization loop.
nization loop.
Records can be manipulated only
internal
in
memory.
program cannot alter a record
on secondary storage.
in
place
Given these assumptions, here
are the essential
components of the
model.
1.
cal
records in the respective
low
set to the
2.
One main
long
3.
Current records for
Initialization.
all files
are read
the
logi-
first
all files
are
synchronization loop
as relevant records
is
used, and the loop continues as
remain.
Within the body of the main synchronization loop is a selection
based on comparison of the record keys from respective input file
such
as
if
cur r ent_f
i 1
two input
e1_k ey
cur rent_f
else if
else
/*
end
the selection takes a
files,
cur r ent_f
>
e1_key
<
i 1
cur r ent_f
i 1
e2_k ey ) then
current keys equal
Input
read
form
e2_k ey ) then
files
and output
files
are sequence checked
the previous_key value with the current_key value
in.
After
a successful
by comparing
when
sequence check, previousjkey
current_key to prepare for the next input operation
sponding
5.
from
Previous_key values for
value.
records. If there are
4.
files.
record
is
is
set to
on the corre-
file.
when end-of-file ocwhen high values have
High
values are substituted for actual key values
curs.
The main processing loop
occurred for
all
relevant input
terminates
files.
The
use of high values eliminates
the need to add special code to deal with each end-of-file condition.
(This step is not needed in a pure match procedure, since a match
procedure halts when the first end-of-file condition is encountered.)
268
COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES
6.
All possible
I/O and error detection
activities are to
be relegated to
subprocesses, so the details of these activities do not obscure the
principal processing logic.
three-way test, single-loop model for creating cosequential
both simple and robust. You will find very few applications
requiring the coordinated sequential processing of two files that cannot be
handled neatly and efficiently with the model. We now look at a problem
This
processes
that
is
is
much more complex
than
simple match or merge, but that
nevertheless lends itself nicely to solution by means of the model.
7.2
Application of the Model to a General Ledger Program
7.2.1 The Problem
Suppose
we
are given the
problem of designing
general ledger
program
The system includes a journal file and a
the month-by-month summaries of the
part of an accounting system.
file.
The
ledger contains
associated with each of the bookkeeping accounts.
as
ledger
values
sample portion of the
FIGURE 7.7 Sample ledger fragment containing checking and expense accounts.
Acct.
no.
Account
title
101
102
Checking account #1
Checking account #2
505
510
515
520
525
530
535
540
545
550
555
560
565
Advertising expense
Auto expenses
Bank charges
Books and publications
Interest expense
Legal expense
Miscellaneous expense
Office expense
Postage and shipping
Rent
Supplies
Travel and entertainment
Utilities
Jan
Feb
Mar
1032.57
543 78
2114.56
3094.17
5219.23
1321.20
25.00
195.40
0.00
27.95
103.50
25.00
307.92
5.00
27.95
255.20
25.00
501.12
5.00
87.40
380 27
12.45
57.50
21.00
500.00
112.00
62.76
84.89
17.87
105.25
27.63
1000.00
167.50
198.12
190.60
23.87
138.37
57.45
1500.00
241.80
307.74
278 48
Apr
269
APPLICATION OF THE MODEL TO A GENERAL LEDGER PROGRAM
Debit/
Acct.
no.
Check no
Date
Description
101
510
101
550
101
505
102
540
101
510
1271
1271
1272
1272
1273
1273
670
670
1274
1274
04/02/86
04/02/86
04/02/86
04/02/86
04/04/86
04/04/86
04/07/86
04/07/86
04/09/86
04/09/86
Auto expense
Tune up and minor repair
Rent
Rent for April
Advertising
Newspaper ad re: new product
Office expense
Printer ribbons (6)
Auto expense
Oil change
credit
78 70
78.70
500.00
500.00
87.50
87.50
32.78
32.78
12.50
12.50
.
FIGURE 7.8 Sample journal entries.
ledger, containing only checking
and expense accounts,
is
illustrated in
Fig. 7.7.
The journal
file
monthly transactions that are ultimately to
Figure 7.8 shows what these journal transactions
contains the
be posted to the ledger file.
look like. Note that the entries in the journal
file
are paired. This
is
because
every check involves both subtracting an amount from the checking
account balance and adding an amount to
at least
one expense account. The
accounting program package needs procedures for creating
interactively,
probably outputting records to the
file as
this
journal
file
checks are keyed in
and then printed.
Once
file is complete for a given month, which means that
of the transactions for that month, the journal must be posted
to the ledger. Posting involves associating each transaction with its account
in the ledger. For example, the printed output produced for accounts 101,
102, 505, and 510 during the posting operation, given the journal entries in
Fig. 7.8, might look like the output illustrated in Fig. 7.9.
How is the posting process implemented? Clearly, it uses the account
it
the journal
contains
number
all
as a key to relate the
journal transactions to the ledger records.
possible solution involves building an index for the ledger, so
we
can
One
work
through the journal transactions, using the account number in each journal
entry to look up the correct ledger record. But this solution involves
seeking back and forth across the ledger file as we work through the
journal. Moreover, this solution does not really address the issue of creating
the output
list,
in
which
collected together. Before
all
we
journal entries for even the
the journal entries relating to an account are
could print out the ledger balances and collect
first
account, 101,
we would have
to
proceed
all
270
101
COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES
Checking Account #1
1271 04/02/86 Auto expense
1272 04/02/86 Rent
1273 04/04/86 Advertising
1274 04/09/86 Auto expense
- 78.70
- 500.00
- 87.50
12.50
Prev.
Prev.
510
5219.23
bal
1321.20
New bal: 4540.53
Checking account #2
670 04/07/86 Office expense
102
505
bal
32.78
New bal: 1288.42
Advertising expense
1273 04/04/86 Newspaper ad re: new product
25.00
Prev. bal:
Auto expenses
1271 04/02/86
1274 04/09/86
87.50
New bal:
112.50
78.70
12.50
Tune up and minor repair
Oil change
501.12
Prev. bal:
New bal:
592.32
FIGURE 7.9 Sample ledger printout showing the effect of posting from the journal.
the
way through
account 101
as
the journal
we
collect
list. Where would we save the transactions
them during this complete pass through
for
the
journal?
A much
better
solution
is
to
begin by collecting
all
the journal
transactions that relate to a given account. This involves sorting the journal
transactions
FIGURE 7.10
by account number, producing
List of journal transactions sorted
a list
ordered
by account number.
Debit/
Acct.
no.
Check no
Date
101
101
101
101
102
1271
1272
1273
1274
670
1273
1271
1274
670
1272
04/02/86
04/02/86
04/04/86
04/09/86
04/07/86
04/04/86
04/02/86
04/09/86
04/07/86
04/02/86
505
510
510
540
550
as in Fig. 7.10.
Description
Auto expense
Rent
Advertising
Auto expense
Office expense
Newspaper ad re: new product
Tune up and minor repair
Oil change
Printer ribbons (6)
Rent for April
credit
- 78.70
- 500.00
- 87.50
- 12.50
- 32.78
87.50
78.70
12.50
32.78
500.00
27
APPLICATION OF THE MODEL TO A GENERAL LEDGER PROGRAM
Ledger
Journal
list
101
Checking account #1
102
505
510
Checking account #2
Advertising expense
Auto expenses
FIGURE 7.1
Conceptual view
101
101
101
101
102
505
510
510
of cosequential
list
Auto expense
Rent
Advertising
Auto expense
Office expense
Newspaper ad re: new product
Tune up and minor repair
Oil change
1271
1272
1273
1274
670
1273
1271
1274
matching
of the ledger
and journal
files.
Now we can create our output list by working through both the ledger
and the sorted journal cosequentially meaning that we process the two lists
sequentially and in parallel. This concept is illustrated in Fig. 7.11. As we
start working through the two lists, we note that we have an initial match
on account number. We know that multiple entries are possible in the
journal file, but not in the ledger, so we move ahead to the next entry in the
journal. The account numbers still match. We continue doing this until the
,
account numbers no longer match.
We
then resynchronize the cosequential
by moving ahead in the ledger list.
This matching process seems simple, as it in fact is, as long as every
account in one file also appears in another. But there will be ledger accounts
for which there is no journal entry, and there can be typographical errors
that create journal account numbers that do not actually exist in the ledger.
Such situations can make resynchronization more complicated and can
result in erroneous output or infinite loops if the programming is done in an
ad hoc way. By using the cosequential processing model, we can guard
against these problems. Let us now apply the model to our ledger problem.
action
7.2.2 Application of the Model to the Ledger Program
The ledger program must perform two
It
needs to update the ledger
file
tasks:
with the correct balance for each ac-
count for the current month.
It
must produce
a printed
version of the ledger that not only shows
the beginning and current balance for each account, but also
lists all
the journal transactions for the month.
We
focus on the second task since
it
is
the
most
difficult.
Let's
look
again at the form of the printed output, this time extending the output to
272
101
COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES
Checking account #1
1271
04/02/86 Auto expense
1272 04/02/86 Rent
1273 04/04/86 Advertising
1274 04/09/86 Auto expense
Prev.
102
515
520
bal:
1321.20
Advertising expense
1273 04/04/86 Newspaper ad
Auto expenses
1271
04/02/86
1274 04/09/86
32.78
New bal: 1288.42
new product
25.00
87.50
New bal:
Tune up and minor repair
Oil change
Prev. bal:
501.12
78.70
12.50
New bal:
592.32
re:
Prev.
510
5219.23
Checking account #2
670 04/07/86 Office expense
Prev.
505
bal:
- 78.70
- 500.00
- 87.50
12.50
New bal 4540.53
bal:
112.50
Bank charges
Prev.
bal:
5.00
New Bal:
5.00
Prev.
bal:
87.40
New bal:
87.40
Books and publications
FIGURE 7.12 Sample ledger printout
include a few
for the first six accounts.
more accounts
as
shown
in Fig. 7.12.
As you can
see, the
printed output from the ledger program shows the balances of
all
accounts, whether or not there were transactions for the account.
From
point of view of the ledger accounts, the process
is
a merge,
ledger
the
since even
unmatched ledger accounts appear in the output.
What about unmatched journal accounts? The ledger accounts and
journal accounts are not equal in authority. The ledger file defines the set of
legal accounts; the journal file contains entries that are to be posted to the
accounts listed in the ledger.
not match
The
existence of a journal account that does
ledger account indicates an error.
From
the point of view of the
one of matching. Our
procedure needs to implement a kind of combined merging/matching
algorithm while simultaneously handling the chores of printing account
title lines, individual transactions, and summary balances.
Another difference between the ledger posting operation and the
journal accounts,
straightforward
the posting process
is
strictly
matching and merging algorithms
is
that
the
ledger
procedure must accept duplicate entries for account numbers in the journal
273
APPLICATION OF THE MODEL TO A GENERAL LEDGER PROGRAM
while
still
earlier
treating a duplicate entry in the ledger as an error. Recall that our
matching and merging routines accept keys only
order, rejecting
The
as
functions that
variables that
ascending
duplicates.
all
inherent simplicity of the three-way
our favor
in
in strict
test,
single-loop
we make these modifications. First,
we use for the ledger and journal
we need for use in the main loop.
let's
model works
look
files,
input
at the
identifying the
Figure 7.13 presents
We
have
treated individual variables within the ledger record as return values to
draw
pseudocode for the procedure
that accepts input
from the
attention to these variables; in practice the procedure
ledger.
would probably
return
the entire ledger record to the calling routine so that other procedures could
have access to things such as the account title as they print the ledger. We
are overlooking such matters here, focusing instead on the variables that are
FIGURE 7.13 Input routine for ledger
PROCEDURE:
ledger_input
input arguments:
L_FILE
J_ACCT
file.
file descriptor for ledger file
current value of journal account number
arguments used to return values:
L_ACCT
account number of new ledger record
balance for this ledger record
L_BAL
flag used by main loop to halt processing
MORE RECORDS EXIST
static,
local variable that retains its value between calls
last acct number read from ledger file
PREV_L_ACCT
read next record from L_FILE, assigning values to L_ACCT and L_BAL
if (EOF)
and (J_ACCT == HIGH_VALUE)
:= FALSE
MORE_RECORDS_EXIST
/* end of both files
*/
/* just ledger is done
*/
/* sequence check
/* (permit no duplicates)
*/
*/
else if (EOF)
L_ACCT
:=
HIGH_VALUE
else if (L_ACCT <= PREV_L_ACCT)
issue sequence check error
abort processing
endif
PREV_L_ACCT
end PROCEDURE
:=
L_ACCT
274
COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES
involved in the cosequential logic. Note that since
for use with ledger entries,
number
we
this
function
is
strictly
can keep track of the previous ledger account
locally within the procedure rather than pass this value in as an
argument.
Figure 7.14 outlines the logic for the procedure used to accept input
from the journal
though
file.
It is
including that
respects,
a
it
similar to the ledger_input() procedure in
most
returns values for individual variables,
even
working implementation would probably return
the entire journal
record. Note, however, that the sequence-checking logic
journal_input(). In this procedure
same account number
FIGURE 7.14 Input routine for journal
PROCEDURE:
ournal_input
as
we need
is
different in
to accept records that
have the
previous records. Given these input procedures,
file.
input arguments:
J_FILE
L_ACCT
file descriptor for journal file
current value of ledger account number
arguments used to return values:
account number of new journal record
J_ACCT
amount of this journal transaction
TRANS_AMT
flag used by main loop to halt processing
MORE RECORDS EXIST
local variable that retains its value between calls
last acct number read from journal file
PREV_J_ACCT
static,
read next record from J_FILE, assigning values to J_ACCT
and TRANS.AMT
if (EOF)
and (L_ACCT == HIGH_VALUE)
:= FALSE
MORE_RECORDS_EXIST
/* end of both files
*/
/* just ledger is done
*/
/* sequence check
/* (permit duplicates)
*/
*/
else if (EOF)
J_ACCT
HIGH_VALUE
:=
else if (J_ACCT < PREV_J_ACCT)
issue sequence check error
abort processing
endif
PREV_J_ACCT
end PROCEDURE
J_ACCT
275
APPLICATION OF THE MODEL TO A GENERAL LEDGER PROGRAM
PROGRAM:
ledger
call initialize() procedure to:
- open input files L_FILE and J_FILE
- set MORE_RECORDS_EXIST to TRUE
call ledger_input
PREV_L_BAL
L_BAL
call journal_input
/* set starting ledger balance
for this first ledger account
*/
/* we have read all the journal
entries for this account
*/
while (MORE_RECORDS_EXIST)
if (L_ACCT
< J_ACCT)
print PREV_L_BAL and L_BAL
call ledger_input
if (L_ACCT < HIGH_VALUE)
print account number and title for new ledger account
PREV_L_BAL = L_BAL
endif
(
else if (L.ACCT > J.ACCT)
print error message
call journal_input
/* bad journal account number
*/
/* match
add journal transaction amount
/* to ledger balance for this account
else
*/
*/
L_BAL := L_BAL + TRANS_AMT
output the transaction to the printed ledger
call journal_input
endif
endwhile
(
end PROGRAM
FIGURE 7.15 Cosequential procedure to process ledger and journal
files to
produce printed
ledger output.
we
can handle our cosequential processing and output as illustrated in
Fig. 7.15.
The reasoning behind
1.
If the ledger
account
the three-way test
is
less
is
as follows:
than the journal account, then there are
no more transactions to add to this ledger account (perhaps there
were none at all), so we print out the ledger account balances and
read in the next ledger account. If the account exists (value
<
276
COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES
HIGH_VALUE), we
PREV_BAL
date the
journal account
print the
title line
for the
new
account and up-
variable.
2.
If the
3.
matched journal account, perhaps due to an input error. We print an
error message and continue.
If the account numbers match, then we have a journal transaction
that
is
action
is
less
than the ledger account, then
to be posted to the current ledger account.
amount
We
it is
an un-
add the trans-
to the account balance, print the description of the
Note that unlike
merging algorithms, we do
transaction, and then read in the next journal entry.
the
match
case in either the matching or
not read in
new
entry from both accounts. This
our acceptance of more than one journal entry for
is a
reflection
a single
of
ledger
account.
The development of
cosequential processing
contributes to
its
this
model
adaptability.
ledger posting procedure from our basic
illustrates
We
entirely different direction, extending
it
the simplicity of the
model
to enable cosequential processing
more than two input files at once. To
model to include multiway merging.
7.3
how
can also generalize the model in an
illustrate this,
we now
of
extend the
Extension of the Model to Include Multiway Merging
The most common
than two input files
lists
to create a single,
as the order
of
more
which we want to merge K input
sequentially ordered output list. Kis often referred to
application of cosequential processes requiring
is
K-way
merge, in
K-way merge.
7.3.1 A /(-way Merge Algorithm
Recall the synchronizing loop
we
use to handle
two-way merge of two
lists of names (Fig. 7.5). This merging operation can be viewed as a process
of deciding which of two input names has the minimum value, outputting
name, and then moving ahead in the list from which that name is taken.
of duplicate input entries, we move ahead in each list.
Given a min() function that returns the name with the lowest collating
sequence value, there is no reason to restrict the number of input names to
two. The procedure could be extended to handle three (or more) input lists
that
In the event
as
shown
in Fig. 7.16.
Clearly, the expensive part of this procedure
in
which
lists
the
name
occurs and which
files
is
the series of tests to see
therefore need to be read.
277
EXTENSION OF THE MODEL TO INCLUDE MULTIWAY MERGING
while (MORE_NAMES_EXIST)
OUT_NAME = min(NAME_l, NAME_2
write OUT_NAME to OUT_FILE
NAME_3
...
NAME_K
if (NAME_1 == OUT_NAME)
call input () to get NAME_1 from LIST_1
if (NAME.2 == 0UT_NAME)
call input () to get NAME_2 from LIST_2
if (NAME_3 == 0UT_NAME)
call input () to get NAME_3 from LIST_3
if (NAME_K == 0UT_NAME)
call input () to get NAME_K from LIST_K
endwhile
FIGURE 7.16 /(-way merge loop, accounting
for duplicate
names.
Note that since the name can occur in several lists, every one of these //tests
must be executed on every cycle through the loop. However, it is often
possible to guarantee that a single name, or key, occurs in only one
procedure becomes much simpler and more
reference our lists through a vector of list names
this case, the
we
listen, list[2], list
C3 J
...
efficient.
list.
In
Suppose
listCK]
and suppose we reference the names (or keys) that are being used from these
lists at any given point in the cosequential process through another vector:
named], name[23, name [3],
Then
the procedure
shown
nameCK]
in Fig. 7.17 can be used,
that the input () procedure attends to the
This procedure clearly
...
differs in
assuming once again
MORE_NAMES_EXIST
many ways from our
initial
flag.
three-way
procedure that merges two lists. But, even so, the
single-loop parentage is still evident: There is no looping within a list. We
determine which list has the key with the lowest value, output that key,
test,
single-loop
move
as
it is
ahead one key in that
powerful.
list,
and loop again. The procedure
is
as
simple
278
COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES
7.3.2 A Selection Tree
for
The K-way merge described
8 or so.
When we
Merging Large Numbers
in Fig. 7.17
begin merging
works nicely if K
number of
larger
sequential comparisons to find the key with the
noticeably expensive.
merge more than
of Lists
no
larger than
lists,
the set of
is
minimum
We see later that for practical
reasons
value becomes
it is
rare to
want
one time, so the use of sequential
comparisons is normally a good strategy. If there is a need to merge
considerably more than eight lists, we could replace the loop of comparisons with a selection tree.
Use of a selection tree is an example of the classic time versus space
trade-off that we so often encounter in computer science. We reduce the
time required to find the key with the lowest value by using a data structure
to save information about the relative key values across cycles of the
procedure's main loop. The concept underlying a selection tree can be
readily communicated through a diagram such as that in Fig. 7.18. Here we
have used lists in which the keys are numbers rather than names.
The selection tree is a kind of tournament tree in which each higher-level
node represents the "winner" (in this case the minimum key value) of the
to
eight
files
at
FIGURE 7.17 K-way merge loop, assuming no duplicate names.
/* initialize the process by reading in a name from each list
for i := 1 to K
call input () to get nameti] from listCi]
next
/* now start the K-way merge */
while (MORE_NAMES_EXIST)
/* find subscript of name that has the lowest collating
sequence value among the names available on the K lists
LOWEST
for
to K
if (nameti]
i
LOWEST
next
< name [LOWEST]
:
write nameCLOWEST] to 0UT_FILE
/* now replace the name that was written out
call input () to get nameCLOWEST] from listCLOWEST]
endwhile
A SECOND LOOK AT SORTING
IN
279
RAM
7, 10,
17
List
9, 19,
23
List
11, 13,
32
List 2
18, 22,
24.
List 3
List 4
List 5
15. 20,
30.
List 6
8, 16,
29.
List 7
Input
12, 14, 21
5,
FIGURE 7.18 Use of a selection tree to assist
mum value in a /(-way merge.
in
6,
25
the selection of a key with mini-
comparison between the two descendent keys. The minimum value is
at the root node of the tree. If each key has an associated reference
to the list from which it came, it is a simple matter to take the key at the
root, read the next element from the associated list, and then run the
tournament again. Since the tournament tree is a binary tree, its depth is
always
rio g2
for a
merge of K
lists.
The number of comparisons
new tournament winner
being
7.4
a linear
In
Chapter 5
to
fit
of course, related to
is,
required to establish
this depth,
rather than
function of K.
A Second Look
enough
Ki
we
in
at Sorting in
RAM
considered the problem of sorting
RAM. The
operation
we
disk
file
that
is
small
described involves three separate
steps:
from disk
RAM.
1.
Read the
2.
Sort the records using a standard sorting procedure, such as Shell-
3.
Write the
entire
file
into
sort.
The
total
steps.
file
back to
disk.
time taken for sorting the
file is
the
sum of the
times for the three
We see that this procedure is much faster than sorting the file in place,
on the
disk, because
both reading and writing are sequential.
280
COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES
Can we improve on the time that it takes for this RAM sort? If we
assume that we are reading and writing the file as efficiently as possible, and
we have chosen the best internal sorting routine available, it would seem
not. Fortunately, there is one way that we might speed up an algorithm that
has several parts, and that is to perform some of those parts in parallel.
Of the three operations involved in sorting a file that is small enough to
fit into RAM, is there any way to perform some of them in parallel? If we
have only one disk drive, clearly we cannot overlap the reading and writing
how about doing either
that we sort the file?
operations, but
the
same time
7.4.1 Overlapping Processing and
Most of the time when we use an
the
whole
file in
memory
the reading or writing (or both) at
I/O:
Heapsort
internal sort
before
we
can
we have to
wait until
start sorting. Is there
we have
an internal
and that can begin sorting numbers
immediately as they are read in, rather than waiting for the whole file to be
in memory? In fact, there is, and we have already seen part of it in this
chapter. It is called heapsort, and it is loosely based on the same principle as
sorting algorithm that
is
reasonably
fast
the selection tree.
Recall that the selection tree compares keys as
time
key
a
it
new key
arrives,
it is
compared
goes to the front of the
because
it
means
that
we
tree.
is,
This
is
encounters them. Each
file is
and
if
it is
the largest
very useful for our purposes
can begin sorting keys
rather than waiting until the entire
That
it
to the others,
as
they arrive in
loaded before
we
RAM,
start sorting.
sorting can occur in parallel with reading.
Unfortunately, in the case of the selection tree, each time a new largest
key is found it is output to the file. We cannot allow this to happen if we
want to sort the whole file because we cannot begin outputting records until
we know which one comes first, second, etc., and we won't know this until
we have seen all of the keys.
Heapsort solves this problem by keeping all of the keys in a structure
called a heap. A heap is a binary tree with these properties:
Each node has a single key, and that key is less than or equal to the
key at its parent node.
i\2j It is a complete binary tree, which means that all of its leaves are on
no more than two levels, and all of the keys on the lowest level are
'*~\.)
^_
in the leftmost position.
Because of properties 1 and 2, storage for the tree can be allocated
sequentially as an array in such a way that the indices of the left and
right children of node i are 2i and 2i + 1, respectively. Conversely,
the index of the parent of node j is |_j/2j.
A SECOND LOOK AT SORTING IN
23456789
IT
/\
/\
281
RAM
P
ew^>
Q-
/\
FIGURE 7.19 A heap
in
both
Figure
its
tree form
and as
it
would be stored
in
an array.
19 shows a heap in both its tree form and as it would be stored
Note that this is only one of many possible heaps for the given
7.
in an array.
of keys. In practice, each key has an associated record that is either stored
with the key or pointed to by a pointer stored with the key.
Property 3 is very useful for our purposes, because it means that a heap
is just an array of keys, where the positions of the keys in the array are
sufficient to impose an ordering on the entire set of keys. There is no need
for pointers or other dynamic data structuring overhead to create and
maintain the heap. (As we pointed out earlier, there may be pointers
associating each key with its corresponding record, but this has nothing to
do with maintaining the heap itself.)
set
in the array
7.4.2 Building the Heap while Reading
The algorithm
we
for heapsort has
two
parts. First
output the keys in sorted order. The
same time
that
we
The
shown
7.20.
Fig.
first
the File
we
build the heap, and then
stage can occur at virtually the
read in the data, so in terms of computer time
essentially free.
in
in
it
comes
basic steps in the algorithm for building the heap are
Figure 7.21 contains
sample application of
this
algorithm.
This describes
how we
build the heap, but
it
doesn't
the input overlap with the heap-building procedure.
To
tell
how
to
FIGURE 7.20 Procedure for building a heap.
For
:=
to
REC0RD_C0UNT
Read in the next record and append it to the end of the
array; call it5 key K
While K 15 less than the key of its parent:
Exchange the record with key K with its parent
next
make
solve that problem,
282
COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES
FDCGHIBEA
New key to
Heap, after insertion
be inserted
of the
Selected heaps
new key
in tree
form
12345678'
F
12
D F
12
C F D
^\
123456789
C F D G
123456789
C F D G H
123456789
C F D G H
123456789
B F C G H
123456789
BECFHIDG
123456789.
ABCEHIDGF
g'Nf
HI
FIGURE 7.21 Sample application of the heap-building algorithm. The keys
G, H, I, B, E, and A are added to the heap in the order shown.
we need
to look at
not going to do
a
block of records
F,
D, C,
how we perform the read operation. For starters, we are
we want a new record. Instead, we read
seek every time
at a
time into an input buffer, and then operate on
all
of
RAM
the input buffer for each new block of keys can be part of the RAM
the records in the block before going on to the next block. In terms of
storage,
is set up for the heap itself. Each time we read in a new block, we
append it to the end of the heap (i.e., the input buffer "moves" as the
heap gets larger). The first new record is then at the end of the heap array,
as required by the algorithm (Fig. 7.20). Once that record is absorbed into
the heap, the next new record is at the end of the heap array, ready to be
absorbed into the heap, and so forth.
Use of an input buffer avoids doing an excessive number of seeks, but
it still doesn't let input occur at the same time that we build the heap. We
area that
just
A SECOND LOOK AT SORTING
IN
283
RAM
saw in Chapter 3 that the way to make processing overlap with I/O is to use
more than one buffer. With multiple buffering, as we process the keys in
one block from the file, we can simultaneously be reading in later blocks
from the file. If we use multiple buffers, how many should we use, and
where should we put them? We already answered these questions when we
decided to put each
new
a new
new
block
at the
block, the array gets bigger
by
end of the array. Each time
we add
the size of that block, in effect creating
file. So the number of buffers is the
and they are located in sequence in the array
input buffer for each block in the
number of blocks
in the
file,
itself.
Figure 7.22 illustrates the technique that
we append
employing
on
we
we have just
described,
where
block of records to the end of the heap, thereby
RAM-sized
set
of input buffers.
Now we
read in
new
blocks
having to wait for processing before reading in a
block. On the other hand, processing (heap building) cannot occur
given block until the block to be processed is read in, so there may
as fast as
new
new
each
can, never
be some delay in processing
if
processing speeds are faster than reading
speeds.
7.4.3 Sorting while Writing out to the
The second and
Again,
it is
First, let's
final step
File
involves writing out the heap in sorted order.
possible to overlap I/O (in this case writing) with processing.
look
at the
Again, there
is
algorithm for outputting the sorted keys
nothing inherent in
this
algorithm that
(Fig. 7.23).
lets it
overlap
with I/O, but we can take advantage of certain features of the algorithm to
make overlapping happen. First, we see that we know immediately which
record will be written first in the sorted file; next, we know what will come
second; and so forth. So as soon as we have identified a block of records, we
can write out that block, and while we are writing out that block we can be
identifying the next block, and so forth.
Furthermore, each time we identify a block to write out, we make the
heap smaller by exactly the size of a block, freeing that space for a new
output buffer. So just as was the case when building the heap, we can have
as many output buffers as there are blocks in the file. Again, a little
coordination is required between processing and output, but the conditions
exist for the two to overlap almost completely.
A final point worth making about this algorithm is that all I/O that it
performs is essentially sequential. All records are read in in the order in
which they occur in the file to be sorted, and all records are written out in
sorted order. The technique could work equally well if the file were kept on
tape or disk.
More
can be done with
importantly, since
all
minimum amount
I/O
is
sequential,
of seeking.
we know
that
it
284
COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES
Total
RAM area allocated for heap
First input buffer. First part
SZ
added
to the heap, then the
of heap is built here. The first record
second record is added, and so forth.
is
Second input buffer. This buffer is being
filled while heap is being built in first buffer.
Second part of heap is built here. The first record
added to the heap, then the second record, etc.
is
Third input buffer. This buffer is filled
is being built in second buffer.
while heap
Third part of heap
is
built here.
1
Fourth input buffer
heap
is
is
filled while
being built in third buffer.
FIGURE 7.22 Illustration of the technique described in the text for overlapping input with
heap building in RAM. First read in a block into the first part of RAM. The first record is the
first record in the heap. Then extend the heap to include the second record, and incorporate
that record into the heap, and so forth. While the first block is being processed, read in the
second block. When the first block is a heap, extend it to include the first record in the second block, incorporate that record into the heap, and go on to the next record. Continue until
all blocks are read in and the heap is completed.
FIGURE 7.23 Procedure
For
:=
to
for outputting the
contents of a heap
in
sorted order.
REC0RD_C0UNT
Output the record in the first position in the array (this
record has the smallest key).
Move the key in the last position in the array (call it K)
to the first position, and define the heap as having one
fewer member than it previously had.
While K is larger than both keys of its children:
Exchange K with the smaller of its two children's keys
next
MERGING AS A WAY OF SORTING LARGE
7.5
Way
Merging as a
In
Chapter 5
we
FILES
of Sorting Large Files
ran into problems
too large to be wholly contained in
when we needed
RAM. The
285
ON DISK
on Disk
to sort files that
chapter offered
were
a partial,
but
ultimately unsatisfactory, solution to this problem in the form of a key sort,
in
which we needed
to hold only the keys in
RAM,
along with pointers to
each key's corresponding record. Keysort had two shortcomings:
Once the keys were sorted, we then had to bear the substantial cost
of seeking to each record in sorted order, reading each record in and
then writing it out into the new, sorted file.
With keysorting, the size of the file that can be sorted is limited by
the number of key/pointer pairs that can be contained in RAM.
Consequently, we still cannot sort really large files.
RAM
As an example of the kind of file we cannot
sort with either a
sort
have a file with 800,000 records, each of which is
100 bytes long and contains a key field that is 10 bytes long. The total length
of this file is about 80 megabytes. Let us further suppose that we have one
megabyte of
available as a work area, not counting
used to
hold the program, operating system, I/O buffers, and so forth. Clearly, we
cannot sort the whole file in RAM. We cannot even sort all the keys in
or a keysort, suppose
we
RAM
RAM
RAM.
The multiway merge algorithm
discussed in section 7.3 provides the
beginning of an attractive solution to the problem of sorting large files such
as this one. Since
sorting algorithms such as heapsort can work in
RAM
place, using only a small
some temporary
amount of overhead
variables,
reading records into
we
can create
for maintaining pointers
a sorted subset
RAM until the RAM work area
work
area,
disk as a sorted subfile.
We
is
of our
almost
and
full file
full,
by
sorting
and then writing the sorted records back to
such a sorted subfile a run. Given the
memory constraints and record size in our example, a run could contain
approximately
the records in this
call
1,000,000 bytes of
RAM
-7
-r
100 bytes per record
Once we
again filling
example,
we
10,000 records.
we then read in a new set of records, once
and create another run of 10,000 records. In our
repeat this process until we have created 80 runs, with each run
create the first run,
RAM,
containing 10,000 sorted records.
Once we have the 80 runs in 80 separate files on disk, we can perform
an 80-way merge of these runs, using the multiway merge logic outlined in
286
COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES
800,000 unsorted records
iT
oo
<X~wOO
80 internal sorts
^^
80 runs, each containing 10,000 sorted records
iic^-i
800,000 records in sorted order
FIGURE 7.24 Sorting through the creation
merging of runs.
of runs (sorted subfiles)
section 7.3, to create a completely sorted
records.
schematic view of
this
file
and subsequent
containing
all
provided in Fig. 7.24.
This solution to our sorting problem has the following
It
can, in fact, sort large
files,
the original
run creation and merging process
and can be extended to
is
features:
files
of any
size.
Reading of the input file during the run creation step is sequential,
and hence is much faster than input that requires seeking for every
record individually
(as in a keysort).
Reading through each run during merging and writing out the sorted
records
is
also sequential.
Random
accesses are required only as
switch from run to run during the merge operation.
we
MERGING AS A WAY OF SORTING LARGE
If a
heapsort
RAM
used for the
is
in section 7.4,
we
in-RAM
FILES
287
ON DISK
part of the merge, as described
can overlap these operations with I/O, so the in-
part does not add appreciably to the total time for the merge.
Since I/O
largely sequential, tapes can be used if necessary for both
is
input and output operations.
How Much Time Does
7.5.1
Merge Sort Take?
This general approach to the problem of sorting large
To compare
takes.
long
We do this
it
files
approach to others, we now look at
by taking our 800,000-record example
this
do
takes to
merge
sort
looks promising.
how much
file
time
and seeing
it
how
on the hypothetical disk drive whose
specifications are listed in Table 3.2. (Please note that our intention here
mean anything
is
any environment other
Nor do we want to
overwhelm you with numbers or provide you with magic formulas for
determining how long a particular sort on a real system will really take.
Rather, our goal in this section is to derive some benchmarks that we can
use to compare several variations on the basic merge sort approach to
not to derive time estimates that
than the hypothetical environment
sorting external
We
the
we have
in
posited.
files.)
can simplify matters by making the following assumptions about
computing environment:
Entire
files
are always stored in contiguous areas
and
a single
seek
is
We
(extents),
cylinder-to-cylinder seek takes no time. Hence, only one
required for any single sequential access.
Extents that span
such
on disk
way
more than one
track are physically staggered in
that only one rotational delay
is
required per access.
see in Fig. 7.24 that there are four times
During the sort phase:
Reading all records
into
when I/O
RAM for sorting
is
performed:
and forming runs; and
Writing sorted runs out to disk.
During the merge phase:
Reading sorted runs into
Writing sorted
Let's look at each
of these
1:
Since
we
sort the
time from the
file
file.
in
In
RAM
for merging;
and
out to disk.
in order.
RAM
for Sorting and Forming Runs
one-megabyte chunks, we read in one megabyte at
a sense, RAM is a one-megabyte input buffer that
Reading Records into
Step
a
file
288
COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES
we
fill
up 80 times
to
form the 80
runs. In
computing the
total
time to input
we need to include the amount of time it takes to access each block
(seek time + rotational delay), plus the amount of time it takes to transfer
each block. We keep these two times separate because, as we see later in our
each run,
calculations, the role that each plays can vary significantly depending on the
approach used.
From Table 3.2 we see that seek and rotational delay times are 18 msec"
and 8.3 msec, respectively, so total time per seek is 26.3 msec* The
transmission rate is approximately 1,229 bytes per msec. Total input time
for the sort phase consists of the time required for 80 seeks, plus the time
required to transfer 80 megabytes:
"
80 seeks X 26.3 msec
Access:
80 megabytes
Transfer:
1,229 bytes/msec
=
=
67 seconds.
Total:
Step
2:
2 seconds
65 seconds
Writing Sorted Runs out to Disk In this case, writing is just the
the same number of seeks and the same amount of data
reverse of reading
to transfer.
So
it
takes another 67 seconds to write out the 80 sorted runs.
RAM
for Merging Since we have
Step 3: Reading Sorted Runs into
for storing runs, we divide one megabyte into 80
one megabyte of
parts for buffering the 80 runs. In a sense, we are reallocating our one
megabyte of
as 80 input buffers. Each of the 80 buffers then holds
l/80th of a run (12,500 bytes), so we have to access each run 80 times to read
all of it. Since there are 80 runs, to complete the merge operation (Fig. 7.25)
we end up making
RAM
RAM
80 runs x 80 seeks
Total seek and rotation time
80 megabytes
is still
is
6,400 seeks.
then 6,400 X 26.3 msec
transferred, transfer time
is still
=168
seconds. Since
65 seconds.
computing environment has many active users pulling the read/write head to
other parts of the disk, seek time is actually likely to be less than the average, since many
of the blocks that make up the file are probably going to be physically adjacent to one another on the disk. Many will be on the same cylinder, requiring no seeks at all. However,
for simplicity we assume the average seek time.
""Unless the
*For simplicity, we use the term seek even though we really mean seek and rotational delay.
Hence, the time we give for a seek is the time that it takes to perform an average seek followed by an average rotational delay.
MERGING AS A WAY OF SORTING LARGE
1st
ii
FILES
289
ON DISK
run = 80 buffers' worth (80 accesses)
mi mi
ii
ii ii
ii
ii
2nd run = 80 buffers' worth (80 accesses)
H Ml Ml
1
II
III
II
II
II
800,000
sorted records
80th run = 80 buffers' worth (80 accesses)
i
ii
ii
ii
FIGURE 7.25 Effect of buffering on the number of seeks required, where each run
large as the available work area in RAM.
Step
4:
Writing Sorted File out to Disk
writing out the
Unlike steps
buffer,
before
we
it is
are
file,
and
now
actually
2,
we need
to
know how
To compute
is
as
the time for
big our output buffers are.
RAM sorting space doubled as our I/O
RAM space for storing the data from the runs
where our big
using that
merged.
To keep
matters simple,
let
us assume that
can allocate two 20,000-byte output buffers." With 20,000-byte buffers,
1"
we
we
need to make
80,000,000 bytes
4,000 seeks.
20,000 bytes per seek
Total seek and rotation time
Transfer time
is still
is
then 4,000 X 26.3 msec
=105
seconds.
65 seconds.
The time estimates for the four steps are summarized in the first row in
7.1. The total time for this merge sort is 537 seconds, or 8 minutes,
57 seconds. The sort phase takes 134 seconds, and the merge phase takes 403
Table
seconds.
To gain an appreciation of the improvement that this merge sort
approach provides us, we need only look at how long it would take us to
do one part of a nonmerging method like the keysort method described in
We
is
use
two
buffers to allow double buffering;
approximately the
size
of
a track
we
use 20,000 bytes per buffer because that
on our hypothetical disk
drive.
290
COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES
TABLE
7.1
Time estimates
for
merge
phase (steps 1 and 2)
phase is 403 seconds.
is
80-megabyte
file, assuming use of
Table 3.2. The total time for the sort
134 seconds, and the total time for the merge
sort of
hypothetical disk drive described
in
Number
Amount
Seek + Rotation
Transfer
of
Transferred
(Megabytes)
Time
Time
(Seconds)
(Seconds)
Seeks
Total Time
(Seconds)
Sort: reading
80
80
65
Sort: writing
80
80
65
67
6,400
80
168
65
233
Merge: reading
Merge: writing
Totals
67
4,000
80
105
65
170
10,560
320
277
260
537
Chapter
5.
The
last
part of the keysort algorithm (Fig. 5.17) consists of this
for loop:
/*
/*
for
read in records according to sorted order, and write them
in this order
out
i
:=
to
*/
*/
REC_CDUNT
seek in IN_FILE to record with RRN of KEYN0DES
N_F LE
read the record into BUFFER from
I
RRN
write BUFFER contents to DUT_FILE
This for loop requires us to do
a separate seek for
every record in the
file.
That is 800,000 seeks. At 26.3 msec per seek, the total time required to
perform that one operation works out to 21,040 seconds, or 5 hours, 50
minutes, 40 seconds!
Clearly, for large files the merge sort approach in general is the best
option of any that we have seen. Does this mean that we have found the best
technique for sorting large files? If sorting is a relatively rare event and files
are not too large, the particular approach to merge sorting that we have just
looked at produces acceptable results. Let's see how those results stand up
as we change some of the parameters of our sorting example.
7.5.2 Sorting a
The
first
File
That
Is
Ten Times Larger
question that comes to
applicability of a
mind when we ask about the general
is, What happens when we make the
computing technique
problem bigger? In this instance, we need
up as we scale up the size of the file.
to ask
how
this
approach stands
MERGING AS A WAY OF SORTING LARGE
Before
we
look
at
how
bigger
file affects
291
ON DISK
FILES
the performance of our
merge sort, it will help to examine the kinds of I/O that are being done in
the two different phases, the sort phase and the merge phase. We will see
that for the purposes of finding ways to improve on our original approach,
we need pay attention only to one of the two phases.
A major difference between the sort phase and the merge phase is in the
amount of sequential (vs. random) access that each performs. By using
we
all I/O is,
minimal seeking, we
cannot algorithmically speed up I/O during the sort phase. No matter what
we do with the records in the file, we have to read them and write them all
at least once. Since we cannot improve on this phase by changing the way
we do the sort or merge, we ignore the sort phase in the analysis that
heapsort to create runs during the sort phase,
guarantee that
in a sense, sequential.^ Since sequential access implies
follows.
the
The merge phase is a different matter. In particular, the reading step of
merge phase is different. Since there is a RAM buffer for each run, and
these buffers get loaded and reloaded at unpredictable times, the read step of
the merge phase is to a large extent one in which random accesses are the
norm. Furthermore, the number and size of the RAM buffers that we read
the run data into determine the number of times we have to do random
accesses. If we can somehow reconfigure these buffers in ways that reduce
the number of random accesses, we can speed up I/O correspondingly. So,
if we are going to look for ways to improve performance in a merge sort
algorithm, our best hope is to look for ways to cut down on the number of random
accesses that occur while reading runs during the
What about
the write step of the
not influenced by differences in the
phase, this step
is
Improvements
in the
On
merge phase.
merge phase? Like
the other hand,
when we measure
the steps of the sort
way we
organize runs.
way we organize the merge sort do not affect this step.
we will see later that it is helpful to include this phase
the results of changes in the organization of the
merge
sort.
To sum up, since the merge phase is the only one in which we can
improve performance by improving the method, we concentrate on it from
now on. Now let's
get back to the question that
What happens when we make
the
we started
problem bigger? How,
time for the merge phase affected if our
file is
this section with:
for instance,
is
800,000?
"'"It
is
not sequential in the sense that in a multiuser
environment there
will be other users
pulling the read/write head to other parts of the disk between reads and writes, possibly
forcing the disk to do
seek each time
it
the
8,000,000 records rather than
reads or writes a block.
r^
292
COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES
TABLE 7.2
Time estimates
merge
phase
800-megabyte
file, assuming use of
Table 3.2. The total time for the merge
19,186 seconds, or 5 hours, 19 minutes, 22 seconds.
for
sort of
hypothetical disk drive described
is
in
Number
Amount
Seek + Rotation
Transfer
of
Transferred
(Megabytes)
Time
Time
(Seconds)
(Seconds)
Seeks
Total Time
(Seconds)
Merge: Reading
640,000
800
16,832
651
Merge: Writing
40,000
800
1,050
651
1,703
680,000
1,600
17,882
1,302
19,186
Totals
we increase
space, we
17,483
file by a factor of 10 without increasing the
need to create more runs. Instead of 80 initial
10,000-record runs, we now have 800 runs. This means we have to do an
800-way merge in our one megabyte of
space. This, in turn, means
that during the merge phase we must divide
into 800 buffers. Each of
the 800 buffers holds 1 /800th of a run, so we would end up making 800
seeks per run, and
If
RAM
the size of our
clearly
RAM
RAM
800 runs x 800 seeks/run
The times
for the
merge phase
are
640,000 seeks altogether.
summarized
in
Table
7.2.
Note
that
is over 5 hours and 19 minutes, almost 50 times greater than
80-megabyte file. By increasing the size of our file, we have gotten
ourselves back into the situation we had with keysort, where we can't do
the job we need to do without doing a huge amount of seeking. In this
instance, by increasing the order of the merge from 80 to 800, we made it
necessary to divide our one-megabyte RAM area into 800 tiny buffers for
doing I/O, and because the buffers are tiny each requires many seeks to
the total time
for the
process
its
corresponding run.
to improve performance, clearly we need to look for ways
improve
on
to
the amount of time spent getting to the data during the
merge phase. We will do this shortly, but first let us generalize what we
If
we want
have just observed.
7.5.3 The Cost of Increasing the
File Size
Obviously, the big difference between the time it took to merge the
8-megabyte file and the 800-megabyte file was due to the difference in total
seek and rotational delay times. You probably noticed that the number of
MERGING AS A WAY OF SORTING LARGE
seeks for the larger
100
is
file is
number of seeks
100 times the
for the first
the square of the difference in size between the
formalize this relationship as follows: In general, for a
runs where each run
for each
of the runs
so
K seeks
is
two
files.
file,
and
We
can
K-way merge of K
RAM space available,
as large as the
293
ON DISK
FILES
the buffer size
is
of
size
RAM
space
are required to read in
K runs
all
= I x
size
of each run,
of the records
in each individual run.
merge operation requires K2 seeks.
2
Hence, measured in terms of seeks, our sort merge is an 0(K ) operation.
Since K is directly proportional to N (if we increase the number of records
from 800,000 to 8,000,000, K increases from 80 to 800) it also follows that
2
our sort merge is an 0(N ) operation, measured in terms of seeks.
Since there are
This
we
brief,
altogether, the
formal look establishes the principle that
as files
grow
large,
can expect the time required for our merge sort to increase rapidly.
would be very
nice if
we
could find some ways to reduce
this
It
time.
Fortunately, there are several:
Allocate
more hardware, such
as disk drives,
RAM,
and I/O chan-
nels;
Perform the merge in more than one step, reducing the order of each
merge and increasing the buffer size for each run;
Algorithmically increase the lengths of the initial sorted runs; and
Find ways to overlap I/O operations.
In the following sections
with the
first:
Invest in
we
look
at
each of these in
detail,
beginning
more hardware.
7.5.4 Hardware-based Improvements
We
have seen that changes in our sorting algorithm can improve
performance. Likewise, there are changes that we can make in our hardware
that will also improve performance. In this section we look at three possible
changes to a system configuration that could lead to substantial decreases in
sort time:
Increasing the
Increasing the
Increasing the
amount of RAM;
number of disk drives; and
number of I/O channels.
RAM
Increasing the Amount of
It should be clear now that when we
have to divide limited buffer space into many small buffers, we increase
294
COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES
overwhelm all other sorting
number of seeks is
file size, given a fixed amount
seek and rotation times to the point where they
operations.
Roughly speaking,
the increase in the
proportional to the square of the increase in
of
total buffer space.
It
RAM space ought to have a
A larger RAM size means longer and
stands to reason, then, that increasing
substantial effect
on
total sorting time.
fewer initial runs during the sort phase, and it means fewer seeks per run
during the merge phase. The product of fewer runs and fewer seeks per run
means
a substantial
reduction in total seeks.
with our 8,000,000-record file, which took
about 5 hours, 20 minutes using one megabyte of RAM. Suppose we are
able to obtain 4 megabytes of
buffer space for our sort. Each of the
Let's test this conclusion
RAM
would
from 10,000 records to 40,000 records, resulting
in 200 40,000-record runs. For the merge phase, the internal buffer space
would be divided into 200 buffers, each capable of holding 1 /200th of a run,
meaning that there would be 200 X 200 = 40,000 seeks. Using the same
time estimates that we used for the previous two cases, the total time for
this merge is 56 minutes, 45 seconds, nearly a sixfold improvement.
initial
runs
increase
Number
of Dedicated Disk Drives If we could have a
no other users contending for use
of the same read/write heads, there would be no delay due to seek time after
the original runs are generated. The primary source of delay would now be
rotational delays and transfers, which would occur every time a new block
Increasing the
separate read/write head for every run and
had to be read in.
For example, if each run is on a separate, dedicated drive, our 800-way
merge calls for only 800 seeks (one seek per run), down from 640,000, and
cutting the total seek and rotation times from 11,500 seconds to 14 seconds.
Of course we can't configure 800 separate disk drives every time we want
to do a sort, but perhaps something short of this is possible. For instance,
if we had two disk drives to dedicate for the merge, we could assign one to
input and the other to output, so reading and writing could overlap
whenever they occurred simultaneously. (This approach takes some clever
buffer management, however. We discuss this later in this chapter.)
Increasing the
Number
of I/O Channels
two transmissions can occur
If there
is
only one I/O
same time, and the total
transmission time is the one we have computed. But if there is a separate
I/O channel for each disk drive, I/O can overlap completely.
For example, if for our 800-way merge there are 800 channels from 800
channel, then no
at
the
disk drives, then transmissions can overlap completely. Practically speaking,
it is
unlikely that 800 channels and 800 disk drives are available, and
MERGING AS A WAY OF SORTING LARGE
even
all
if
FILES
295
ON DISK
it is unlikely that all transmissions would overlap because
would not need to be refilled at one time. Nevertheless,
the number of I/O channels could improve transmission time
they were,
buffers
increasing
substantially.
So
we
see that there are
control over
how
ways
which external sorting occupies
have
are likely to
at least
improve performance
is
configured. In those environments in
a large
some such
we are not able to expand
might have. When this is
system
the case,
improve performance, and
this
is
we have some
to
our hardware
if
percentage of computing time,
control.
we
On the other hand, many times
specifically to
meet sorting needs
we need to look for
what we do now.
that
we
algorithmic ways to
7.5.5 Decreasing the Number of Seeks Using
Multiple-step Merges
One of the
hallmarks of
a solution to a file structure
problem,
as
opposed
of a mere data structure problem, is the attention given to the
between accessing information on disk and
accessing information in RAM. If our merging problem involved only
operations, the relevant measure of work, or expense, would be the
number of comparisons required to complete the merge. The merge pattern
that would minimize the number of comparisons for our sample problem,
in which we want to merge 800 runs, would be the 800-way merge
considered. Looked at from a point of view that ignores the cost of seeking,
to the solution
enormous
difference in cost
RAM
this
K-way merge
has the following desirable characteristics:
Each record
read only once.
is
If a selection tree
is
used for the comparisons performed in the mergnumber of com-
ing operation, as described in section 7.3, then the
parisons required for
tion of
Since
is
K-way merge of N
records
(total) is a
func-
log K.
directly proportional to
N,
this is
an
0(N
numbers of comparisons), which
reasonably efficient even as N grows large.
tion (measured in
is
log N) operato say that
it is
be very good news were we working exclusively in
sort procedure is to be able to sort
files that are too large to fit into RAM. Given the task at hand, the costs
associated with disk seeks are orders of magnitude greater than the costs of
operations in RAM. Consequently, if we can sacrifice the advantages of an
800-way merge, trading them for savings in access time, we may be able to
obtain a net gain in performance.
This would
RAM,
all
but the very purpose of this merge
296
COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES
We
have seen that one of the keys to reducing seeks is to reduce the
that we have to merge, thereby giving each run a bigger
number of runs
share of available buffer space. In the previous section
we accomplished
this
by adding more memory. Multiple-step merging provides a way for us to
apply the same principle without having to go out and buy more memory.
In multiple-step merging, we do not try to merge all runs at one time.
Instead, we break the original set of runs into small groups and merge the
runs in these groups separately.
On
each of these smaller merges, more
and hence, fewer seeks are required
of the smaller merges are completed, a second pass
merges the new set of merged runs.
It should be clear that this approach will lead to fewer seeks on the first
pass, but now there is a second pass. Not only are a number of seeks
required for reading and writing on the second pass, but extra transmission
time is used in reading and writing all records in the file. Do the advantages
of the two-pass approach outweigh these extra costs? Let's revisit the merge
step of our 8-million record sort to find out.
Recall that we began with 800 runs of 10,000 records each. Rather than
merging all 800 runs at once, we could merge them as, say, 25 sets of 32
runs each, followed by a 25-way merge of the intermediate runs. This
buffer space
per run.
scheme
is
is
available for each run,
When
all
illustrated in Fig. 7.26.
When compared
to our original
disadvantage of requiring that
FIGURE 7.26 Two-step merge of
runs.
32 runs
32 runs
VV V
800-way merge,
this
approach has the
read every record twice: once to form the
25 sets of 32 runs each
32 runs
800
we
MERGING AS A WAY OF SORTING LARGE
FILES
intermediate runs and then again to form the final sorted
each step of the merge
is
reading from 25 input
and avoid
to use larger buffers
files at a
number of
a large
297
ON DISK
But, since
file.
we are able
seeks. When we
time,
disk
analyzed the seeking required for the 800-way merge, disregarding seeking
for the output
we
file,
800-way merge involved 640,000
perform similar calculations for our
calculated that the
seeks between the input
files.
Let's
multistep merge.
First
Merge Step
For each of the 32-way merges of the
input buffer can hold V32 run, so
initial
runs, each
we end up making 32 X 32 = 1,024 seeks.
we make 25 x 1,024 = 25,600 seeks. Each
For all 25 of the 32-way merges,
of the resulting runs is 320,000 records, or 32 megabytes.
Second Merge Step
space
For each of the 25
final runs, Vis
of the
total buffer
400 records, or Vsoo run.
step there are 800 seeks per run, so we end up making 25 X
allocated, so each input buffer can hold
is
Hence, in this
800 = 20,000 seeks, and
The
So,
total
number of seeks
by accepting the
number of
for the
two
steps
25,600
cost of processing each record twice,
seeks for reading in
spent a penny for extra
But what about the
from 640,000
20,000
to 45,600,
we
45,600.
reduce the
we
and
haven't
RAM.
total
time for the merge?
inputting data, but there are costs.
We now
We save on access times for
have to transmit
all
of the
records four times instead of two, so transmission time increases by 651
seconds. Also,
we
write the records out twice, rather than once, requiring
an extra 40,000 seeks.
for the
merge
is
When we add
in these extra operations, the total
5,907 seconds, or about
5 hours, 20 minutes for the single-step merge. These results are
in
Table
Once more, note
over the data for
is
If
summarized
7.3.
that the essence of what
we have done is
to increase the available buffer space for each run.
trade
time
hour, 38 minutes, compared to
We
to find a
way
trade extra passes
dramatic decrease in random accesses. In
this case the
certainly a profitable one.
we
can achieve such an improvement with
do even
better with three steps? Perhaps, but
7.3 that
we have
reduced
total seek
it is
file,
we
where
three-step merge would
and rotation times
transmission times are about as expensive. Since
require yet another pass over the
two-step merge, can
important to note in Table
we may
to the point
have reached
point of
diminishing returns.
We also could have chosen to distribute our initial runs differently.
How would the merge perform if we did 400 two-way merges, followed by
298
COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES
Time estimates
TABLE 7.3
for two-step
merge
hypothetical disk drive described
sort of
in
800-megabyte
Table 3.2. The
total
file,
time
assuming use
is 1 hour, 31
of
minutes.
Number
Amount
Seek + Rotation
Transfer
of
Transferred
(Megabytes)
Time
Time
(Seconds)
(Seconds)
Seeks
Total Time
(Seconds)
1st
Merge: Reading
25,600
800
673
651
1,324
1st
Merge: Writing
40,000
800
1,052
651
1,703
2nd Merge: Reading
20,000
800
526
651
1,177
2nd Merge: Writing
40,000
800
1,052
651
1,703
125,600
3,200
3,303
2,604
5,907
Totals
one 400-way merge, for instance? A rigorous analysis of the trade-offs
between seek and rotation time and transmission time, accounting for
different buffer sizes, is beyond the scope of our treatment of the subject.'*'
Our goal is simply to establish the importance of the interacting roles of the
major costs in performing merge sorts: seek and rotation time, transmission
time, buffer size, and number of runs. In the next section we focus on the
the number of runs.
pivotal role of the last of these
7.5.6 Increasing Run Lengths Using Replacement Selection
What would happen
if
we
could
somehow
increase the size of the initial
runs? Consider, for example, our earlier sort of 8,000,000 records in which
each record was 100 bytes.
Our
10,000 records because the
RAM work area was limited to
Suppose
we
are
somehow
initial
able
to
runs were limited to approximately
create
one megabyte.
runs of twice this length,
containing 20,000 records each. Then, rather than needing to perform an
800-way merge, we need
is
to
do only
400-way merge. The
divided into 400 buffers, each holding
the
number of seeks
required per run
is
available
RAM
/800th of a run. (Why?) Hence,
800, and the total
number of seeks
is
800 seeks/run x 400 runs
half the
""For
number
320,000 seeks,
required for the 800-way merge of 10,000-byte runs.
more rigorous and
end of
detailed analyses of these issues, consult the references cited at the
this chapter, especially
Knuth (1973b) and Salzberg
(1988, 1990).
MERGING AS A WAY OF SORTING LARGE
In general, if
we
can
somehow
FILES
299
ON DISK
increase the size of the initial runs,
we
amount of work required during the merge step of the sorting
process. A longer initial run means fewer total runs, which means a
lower-order merge, which means bigger buffers, which means fewer seeks.
But how, short of buying twice as much memory for the computer, can we
create initial runs that are twice as large as the number of records that we can
hold in RAM? The answer, once again, involves sacrificing some efficiency
in our in-RAM operations in return for decreasing the amount of work to
be done on disk. In particular, the answer involves the use of an algorithm
decrease the
known
as replacement selection.
Replacement selection
from memory
replacing
it
with
implemented
1.
is
based on the idea of always
key
and then
Replacement selection can be
selecting the
that has the lowest value, outputting that key,
a
new key from
the input
list.
as follows:
of records and sort them using heapsort. This
heap of sorted values. Call this heap the primary heap.
Instead of writing out the entire primary heap in sorted order (as we
do in a normal heapsort), write out only the record whose key has
Read
in a collection
creates a
2.
3.
4.
the lowest value.
Bring in a new record and compare the value of its key with that of
the key that has just been output.
a.
If the new key value is higher, insert the new record into its
proper place in the primary heap along with the other records
that are being selected for output. (This makes the new record
part of the run that is being created, which means that the run
being formed will actually be larger than the number of keys
that can be held in memory at one time.)
b.
If the new record's key value is lower, place the record in a secondary heap of records with key values lower than those already
written out. (It cannot be put into the primary heap, because it
cannot be included in the run that is being created.)
Repeat step 3 as long as there are records left in the primary heap
and there are records to be read in. When the primary heap is empty,
make the secondary heap into the primary heap and repeat steps 2
and 3.
To
see
how
this
works,
input
list
of only
keys.
As
Fig. 7.27 illustrates,
six
keys and
let's
begin with
simple example, using an
memory work area that can hold
we begin by reading into RAM the
a
there and use heapsort to sort them.
We
only three
three keys
key with the
minimum value, which happens to be 5 in this example, and output that
key. We now have room in the heap for another key, so we read one from
that
fit
select the
300
COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES
Input:
67,
21,
12,
47,
5,
16
t_ Front of input string
Memory
Remaining input
21,
67,
21,
67
12
(P
47
16
12
47
16
21
67
47
16
67
47
21
47
-
67
67
Output run
3)
67
12,
16,
12,
21,
16,
12,
47
21,
16,
12,
47
21,
16,
12,
FIGURE 7.27 Example of the principle underlying replacement selection.
key, which has a value of 12, now becomes a
of keys to be sorted into the output run. In fact, since it
is smaller than the other keys in RAM, 12 is the next key that is output. A
new key is read into its place, and the process continues. When the process
is complete, it produces a sorted list of six keys while using only three
the input
list.
member of the
memory
in
set
locations.
In this
happens
The new
example the entire file is created using only one heap, but what
fourth key in the input list is 2 rather than 12? This key arrives
if the
memory
The
too
late to
be output into
its
proper position relative to the other
been written to the output list. Step 3b in the
algorithm handles this case by placing such values in a second heap, to be
included in the next run. Figure 7.28 illustrates how this process works.
keys:
During the
5 has already
first
run,
when
keys are brought in that are too small to be
we mark them with parentheses, indicating
have to be held for the second run.
It
is
interesting to use this example to compare the action of
replacement selection to the procedure we have been using up to this point,
namely that of reading keys into RAM, sorting them, and outputting a run
that is the size of the
space. In this example our input list contains 13
included in the primary heap,
that they
RAM
keys.
A series of successive RAM sorts,
results in five runs.
runs.
The replacement
given only three
Since the disk accesses during a multiway
expense,
replacement selection's
ability
to
fewer, runs can be an important advantage.
Two
questions emerge
memory
locations,
selection procedure results in only
at this point:
merge can be
create longer,
two
major
and therefore
MERGING AS A WAY OF SORTING LARGE
FILES
301
ON DISK
Input:
33,
18,
24,
58,
14,
17,
67,
21,
7,
12,
47,
5,
16
Front of input string
Memory
Remaining input
33,
18,
24,
58,
14,
17,
7,
21,
67,
33,
18,
24,
58,
14,
17,
7,
21,
67
33,
18,
24,
58,
14,
17,
7,
21
33,
18,
24,
58,
14,
17,
33,
18,
24,
58,
14,
17
33,
18,
24,
58,
14
33,
18,
24,
58
12
(P
47
16
12
47
16
67
47
16
67
47
21
67
47
7)
67
(1?)
7)
(14)
(17)
7)
tart building the
33,
18,
24,
33,
18,
24
33,
18
58
33
Output run
3)
67,
12,
16,
12,
21,
16,
12,
47,
21,
16,
12,
47,
21,
16,
12,
second
14
17
14
17
58
24
17
58
24
18
58
24
33
58
33
58
58
14,
58,
17,
14,
18,
17,
14,
17,
14,
24,
18,
33,
24,
18,
17,
14,
33,
24,
18,
17,
14,
FIGURE 7.28 Step-by-step operation of replacement selection working to form two sorted runs.
1.
2.
in memory, how long a run can
placement selection to produce, on the average?
What are the costs of using replacement selection?
Given P locations
Average Run Length for Replacement Selection
first
question
is
that,
on the average, we can expect
P memory locations. Knuth
intuitive argument for why
A
clever
way
to
discovered by E.
a circular track
situation
shown
we
expect re-
The answer
to the
run length of 2P, given
(1973b)^ provides an excellent description of an
this is so:
show
F.
that 2P is indeed the expected run length was
Moore, who compared the situation to a snowplow on
[U.S. Patent 2983904 (1961), Cols. 3-4]. Consider the
[below]; flakes of snow are falling uniformly on
a circular
From Donald Knuth, The Art of Computer Programming, 1973, Addison-Wesley, Reading,
Mass. Pages 254-55 and Figs. 64 and 65. Reprinted with permission.
+
302
COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES
road, and a lone
snow
snowplow
is
continually clearing the snow.
has been plowed off the road,
it
Once
the
disappears from the system. Points
< x < 1; a flake of
on the road may be designated by real numbers x,
snow falling at position x represents an input record whose key is x, and
the snowplow represents the output of replacement selection. The ground
speed of the snowplow
that
it
is
inversely proportional to the height of the
encounters, and the situation
is
snow
perfectly balanced so that the total
A new run is
amount of snow on the road at all times is exactly
output whenever the plow passes point 0.
P.
After this system has been in operation for awhile,
it is
formed
in the
it
will
approach
speed (because of the circular
snow
is
at
constant height
linearly in front
intuitively clear that
which the snowplow runs at constant
symmetry of the track). This means that the
a stable situation in
when it meets the plow, and the height drops off
as shown [below]. It follows that the volume
of the plow
of snow removed in one revolution (namely the run length)
amount
is
twice the
present at any one time (namely P).
lllHHHIHil
Falling
snow
Future snow
Existing
Total length of the road
snow/=fc(o^==!
|
^(Op~
MERGING AS A WAY OF SORTING LARGE
So,
given
random ordering of
we
keys,
FILES
303
ON DISK
can expect replacement
hold in
form runs that contain about twice as many records as we can
memory at one time. It follows that replacement selection creates
half as
many
selection to
assuming
the
runs as does
same amount of memory. (As we
selection does, in fact, have to
RAM
of
sorts
and the
see in a
make do with
RAM
memory
contents,
have access to
moment, the replacement
less
sort
memory
than does the
sort.)
It is
actually often possible to create runs that are substantially longer
than 2P. In
many
random; the keys
the order of the records
applications,
produce runs
(Consider what would happen
Replacement selection becomes an
ordered input
is
not
wholly
are often already partially in ascending order. In these
cases replacement selection can
2P.
RAM
of
a series
that the replacement selection
if
that,
the input
on the average, exceed
list
is
already sorted.)
especially valuable tool for such partially
files.
The Costs of Using Replacement
Selection
Unfortunately, the no-
free-lunch rule applies to replacement selection, as
it
does to so
many
other
of file structure design. In the worked-by-hand examples we have
looked at up to this point, we have been inputting records into memory one
at a time. We know, in fact, that the cost of seeking out to disk for every
areas
single input record
which means,
is
prohibitive. Instead,
in turn, that
we
operation of replacement selection.
output buffering. This
sorting,
To
is
cost,
we want
Some of it
and the
affect
it
to buffer the input,
of the memory for the
has to be used for input and
are not able to use
all
has on available space for
illustrated in Fig. 7.29.
need for buffering during the replacement
selection step, let's return to our example in which we sort 8 million
records, given a memory area that can hold 10,000 records.
see the effects
of
this
FIGURE 7.29 In-RAM sort versus replacement selection, in
terms of their use of available
heapsort area
(a)
In-RAM
sort: all available
i/o buffer
(b)
Replacement
space used for the
RAM
sort.
heapsort area
selection:
some of
available space
is
used for
i/o.
for sorting operation.
304
CQSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES
RAM
For the
records into
sorting
memory
10,000 records
methods such
until
full,
is
it
we
as heapsort,
which simply read
can perform sequential reads of
800 runs have been created. This means that
at a time, until
the sort step requires 1,600 seeks: 800 for reading and 800 for writing.
For replacement selection
we might
use an input/output buffer that can
enough space to hold 7,500
records for the actual replacement selection process. If the I/O buffer holds
hold, for example, 2,500 records, leaving
2,500 records,
so
we
can perform sequential reads of 2,500 records
at a
time,
takes 8,000,000/2,500
3,200 seeks to access all records in the file.
This means that the sort step for replacement selection requires 6,400 seeks:
it
3,200 for reading and 3,200 for writing.
If the records occur in a random key sequence, the average run length
using replacement selection will be 2 X 7,500 = 15,000 records, and there
about 8,000,000/15,000
will be
we
step
534 such runs produced. For the merge
divide the one megabyte of
average of 18.73 records, so
RAM into 534 buffers,
we end up making
which hold an
15,000/18.73
801 seeks
per run, and
801 seeks per run x 534 runs
427,734 seeks altogether.
Table 7.4 compares the access times required to sort the 8 million
records using both a
sort and replacement selection. The table
RAM
800-way merge and two replacement selection examples. The second replacement selection example, which produces runs of
40,000 records while using only 7,500 record storage locations in memory,
assumes that there is already a good deal of sequential ordering within the
includes our
initial
input records.
It
is
clear that,
given randomly distributed input data, replacement
selection can substantially reduce the
number of runs formed. Even though
as many seeks to form the runs,
amount of seeking effort required to merge the runs
more than offsets the extra amount of seeking that is required to form the
runs. And when the original data is assumed to possess enough order to
make the runs 40,000 records long, replacement selection produces less than
replacement selection requires four times
the reduction in the
one third
as
many
seeks as
RAM
sorting.
7.5.7 Replacement Selection Plus Multistep Merging
While these comparisons highlight the advantages of replacement selection
RAM
we would
choose the one-step
merge patterns shown in Table 7.4. We have seen that two-step merges can
result in much better performance than one-step merges. Table 7.5 shows
how these same three sorting schemes compare when two-step merges are
over
sorting,
probably not in
reality
>s
to
c Z
09
Mo
cd
o B
C/5
"O
ed
to
ed
to
b*XH
^
<
OT
XV
CO
_C
CD
o E
.a o
M
GO
C CO
CO
3 =3
CO V|_
o
"O
i_
o CD
o -O
CD
E
c 3
o C
33
E
00
HZ
,i
(ft
0P
5 c
**
CD
fi
O 3
o
^m
03
Z3
cr
CD
o
D
CO
cd
[/)
-C
o
CO
o
u M
_*
o
u
i
C o
3
^_
CD
Z3
cr
CD
i_
CO
CD
E
'-
o
CD
s?
CD
S/3
d
o
o
o o
CD
CO
CO
CD
o c s
N 3 5Cfl
Un
03
CD
o CO
c c
o
CD
CO
E
03 CD
Q. O
E
o
03
(m
5-
a
5/3
Q.
CD
^t
r-
LU
E O
CN
<U
~-
_o
Uh
rs
O
-
t/3
i_
00
ri
<
hh
2^
Q,^o
c2
^
as
gJS
u 12
o C
^-n
305
>s
ed
73 .3
c
CD
E
CD
O
LO
LD
00
"*
CM
00
en
<*
"*
vC
2 =
C/3
TO
p<
A 03 ^
o o .5
PN -w
Q.
CD
Im
mC
HH
TO
o
CO
.>
<
ce
<~
HZ
^
CM
K
CM
ScJS
o
^
'
'
'
.c
*->
<+-
.Q
GO
r
C
C
CO
3
cz^
s S 2
co
O
O
SO O
lO O
CM CM
o
o
o
oo
q
^C CM
r- ^c
00 rH
cn in
CM
oo"
so
L.
o
o
CD
> *
* %
00
Cl
cd
las
X C
x c
8-3
-5
S
CM
LO
00
l<sl
*-*
TO
8
CO
CM
"5
o
2 CM
^
00 J,
Lj
4-1
-a
"^
fi
*5
CD
8|
o c
CD
*~
CD
>
8J
1
c
.3 J o
y. a u-
o
o
O
8
LO
o
LO
o
o
LO
cm'
cm'
TO
n_ -C
C
O
._2
&
TO
CD
fe.2
OO
CD
CD
CO
s o S
tf
c*o
t2
If)
P^
LU
1
<
1
306
J_,
4->
T3
r^
s-
u
ed
2
<
1m
&
Cm
<
o
,
00
'
-S
Cm
o
vi
es:
-MJ
*m
(/>
c/5
S o c
CJ
<~>
^^
-r
'->
^ C
8 OT!
-- u O rt o
Cm <y U
o c3 J
IU -c- S > Cm O
MERGING AS A WAY OF SORTING LARGE
From Table
used.
less in
7.5
every case than
we
it
see that the total
was
FILES
307
ON DISK
number of seeks
dramatically
is
method
for the one-step merges. Clearly, the
used to form runs is not nearly
than one-step, merges.
as
important
as the use
of multistep, rather
Furthermore, since the number of seeks required for the merge steps is
smaller in all cases, while the number of seeks required to form runs
much
have a bigger effect proportionally on the final
total, and the differences between the RAM-sort based method and
replacement selection are diminished.
The differences between the one-step and two-step merges are
exaggerated by the results in Table 7.5, because they don't take into account
the amount of time spent transmitting the data. The two-step merges
and disk two more times
require that we transfer the data between
than do the one-step merges. Table 7.6 shows the results after adding
transmission time to our results. The two-step merges are still better, and
remains the same, the
latter
RAM
replacement selection
still
wins, but the results are
less
dramatic.
7.5.8 Using Two Disk Drives with Replacement Selection
and fortunately, replacement selection offers an opportunity
sort
to save on both transmission and seek times in ways that
methods do not. As usual, this is at a cost, but if sorting time is expensive,
it could well be worth the cost.
Suppose that we have two disk drives available that we can assign the
separate dedicated tasks of reading and writing during replacement
selection. One drive, which contains the original file, does only input, and
Interestingly,
RAM
the other does only output. This has
two very
nice results: (1)
It
input and output can overlap, reducing transmission time by
50%; and
If
(2)
seeking
we have two
to take advantage
two
means
as
that
much
as
virtually eliminated.
is
disks at our disposal,
of them.
We
we should
memory
configure
also configure
as follows:
memory
We
allocate
buffers each for input and output, permitting double buffering, and
allocate the rest
of memory for forming the selection
tree.
This arrangement
might proceed
to take advantage
illustrated in Fig. 7.30.
is
Let's see
of
how
the
merge
sort process
this configuration.
First,
the sort phase.
the heap-sized part of
We
begin by reading in enough records to
the heap. Next, as we
memory, and form
records from the heap into one of the output buffers,
we
fill
up
move
replace those
records with records from one of the input buffers, adjusting the tree in the
usual manner. While
filling the
we empty one
input buffer into the
tree,
we
can be
other one from the input disk. This permits processing and input
c 5
co .5
.2
a C
co
OS
Hhhi
c 5 S c
.22 *S
0)
5 I &
ho* oft
tt
c
2
*
tN
00
J*
03
0)
LO
<T)
C
St
0:
co
CO
>N
03
s?
<N
r
^
4-
^
S
.,
04
c/5
lo
CN
o
LO
o
O
OC
CN G\
i
O
V
O U
'J
o
c
E
o
On
^H
||3
*-H
v.
<N
X o
C
i
CN
LT5
o
in
CN
CN
w
O C
CO
<si
s g
s -8
S ? S
o
h
.,
>*
O O
PC _Q
>-
T o
So.
ji
oilC
too
X R O
j~.
rt
J-c
? o
p5c OC
<u
<u
co
&
a
308
C
O
u
O
u
CO
ft
X>-
-a
>.
>,
CN
in
<N
>^
CO
X ,
v x U
ScSD
*-*
'E
CO
CO
Ctf
-Q .S
4-'
CO
CO
MERGING AS A WAY OF SORTING LARGE
FILES
ON DISK
309
input
buffers
output
buffers
FIGURE 7.30 Memory organization for replacement selection.
to overlap. Similarly, at the
buffers
from the
tree,
we
same time
we
that
are filling
one of the output
can be transmitting the contents of the other to the
way, run selection and output can overlap.
During the merge phase, the output disk becomes the input disk, and
vice versa. Since the runs are all on the same disk, seeking will occur on the
input disk. But output is still sequential, since it goes to a dedicated drive.
Because of the overlapping of so many parts of this procedure, it is
difficult to estimate the amount of time the procedure is likely to take. But
it should be clear that by substantially reducing seeking and transmission
time, we are attacking those parts of the sort merge that are the most costly.
output disk. In
this
7.5.9 More Drives? More Processors?
If
two
Isn't
it
drives can
improve performance, why not
true that the
phase, the faster
course the
we
more
drives
we have
can perform I/O?
number and speed of I/O
to
Up
more?
hold runs during the merge
three, or four, or
to a point this
is
true,
but of
processors must be sufficient to keep
up with the data streaming in and out. And there will also be a point at
which I/O becomes so fast that processing can't keep up with it.
But who is to say that we can use only one processor? A decade ago, it
would have been far-fetched to imagine doing sorting with more than one
processor, but it is very common now to be able to dedicate more than one
processor to a single job. Possibilities include the following:
Mainframe computers, many of which spend a great deal of their
time sorting, commonly come with two or more processors that can
simultaneously work on different parts of the same problem.
Vector and array processors can be programmed to execute certain
kinds of algorithms orders of magnitude faster than scalar processors.
310
COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES
Massively
machines provide thousands, even millions, of
at the same time communicate in complex ways with one another.
Very fast local area networks and communication software make it
relatively easy to parcel out different parts of the same process to
parallel
processors that can operate independently and
several different machines.
It is
these
not appropriate, in
newer
cover in detail the implications of
this text, to
architectures for external sorting.
But just
the past decade in the availability and performance of
have altered the way
many more
we
look
at
as the
changes over
RAM and disk storage
we can expect it to change
new architectures becomes
external sorting,
times as the current generation of
commonplace.
7.5.10 Effects
of
Multiprogramming
of external sorting on disk we are, of course, making tacit
assumptions about the computing environment in which this merging is
In our discussions
We
are assuming, for example, that the merge job is running
environment (no multiprogramming). If, in fact, the
operating system is multiprogrammed, as it normally is, the total time for
the I/O might be longer, as our job waits for other jobs to perform their
taking place.
in
dedicated
I/O.
On
the other hand, one of the reasons for
multiprogramming
is
to
allow the operating system to find ways to increase the efficiency of the
overall system
by overlapping processing and I/O among
the system could be performing I/O for our job while
it
different jobs.
was doing
So
CPU
processing on others, and vice versa, diminishing any delays caused by
overlap of I/O and
CPU
processing within our job.
Effects such as these are very hard to predict, even
when you have
much
information about your system. Only experimentation can determine
what
real
performance will be
like
7.5.11 A Conceptual Toolkit
We can now list many tools
on
busy, multiuser system.
for External Sorting
that can
improve external sorting performance.
should be our goal to add these various tools to our conceptual toolkit for
designing external sorts and to pull them out and use them whenever they
It
are appropriate.
following:
full listing
of our
new
set
of tools would include the
311
SORTING FILES ON TAPE
For
in-RAM
lap input
Use
as
time
With
RAM
as possible.
more
much
It
makes the runs longer and promerge phase.
buffers during the
number of initial runs
is
it
and output with internal processing.
much
vides bigger and/or
If the
forming the original list of
and double buffering, we can over-
sorting, use heapsort for
sorted elements in a run.
is
so large that total seek and rotation
greater than total transmission time, use a multistep
It increases the amount of transmission time but
number of seeks enormously.
merge.
the
Consider using replacement selection for
cially if there is a possibility that the
Use more than one
initial
can decrease
run formation, espe-
runs will be partially ordered.
disk drive and I/O channel so reading and writ-
ing can overlap. This
is
especially true if there are not other users
on
the system.
Keep
mind
the fundamental elements of external sorting and their
and look for ways to take advantage of new architectures and systems, such as parallel processing and high-speed local
area networks.
in
relative costs,
7.6
Sorting Files on Tape
There was a time when it was usually
on tape than on disk, but this is much
is still used in external sorting, and
faster to
perform large external
we would
sorts
now. Nevertheless, tape
less the case
be remiss
if
we
did not
consider sort merge algorithms designed for tape.
There are a large number of approaches to sorting files on tape. After
approximately 100 pages of closely reasoned discussion of different
alternatives for tape sorting, Knuth (1973b) summarizes his analysis in the
following way:
Theorem A.
It is
difficult to decide
which merge pattern
is
best in a
given situation.
Because of the complexity and number of alternative approaches and
way that these alternatives depend so closely on the specific
characteristics of the hardware at a particular computer installation, our
because of the
merely to communicate some of the fundamental issues
For a more comprehensive
discussion of specific alternatives we recommend Knuth's (1973b) work as
objective here
is
associated with tape sorting and merging.
a starting point.
312
COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES
Viewed from
general perspective, the steps involved in sorting on
tape resemble those that
we
1.
Distribute the unsorted
2.
Merge
discussed with regard to sorting on disk:
file
into sorted runs;
the runs into a single sorted
Replacement selection
is
and
file.
almost always
good choice
as a
method
for
You will remember that the
when we are working on disk is that
creating the initial runs during a tape sort.
problem with replacement
the
amount of seeking
selection
required during run creation
more than
offsets the
advantage of creating longer runs. This seeking problem disappears when
the input is from tape. So, for a tape-to-tape sort, it is almost always
advisable to take advantage of the longer runs created by replacement
selection.
7.6.1 The Balanced Merge
Given
that the question
of
how
create the initial runs has
to
such
merging process that we
encounter all of the choices and complexities implied by Knuth's tonguein-cheek theorem. These choices begin with the question of how to distribute
the initial runs on tape and extend into questions about the process of
merging from this initial distribution. Let's look at some examples to show
what we mean.
Suppose we have a file that, after the sort phase, has been divided into
10 runs. We look at a number of different methods for merging these runs
on tape, assuming that our computer system has four tape drives. Since the
initial, unsorted file is read from one of the drives, we have the choice of
initially distributing the 10 runs on two or three of the other drives. We
begin with a method called two-way balanced merging, which requires that the
initial distribution be on two drives, and that at each step of the merge,
except the last, the output be distributed on two drives. Balanced merging
is the simplest tape merging algorithm that we look at; it is also, as you will
straightforward answer,
it is
clear that
it is
in the
see, the slowest.
The balanced merge proceeds according
to the pattern illustrated in Fig.
7.31.
This balanced merge process
is
The numbers
expressed in an alternate,
form in Fig. 7.32.
measured in terms of the number of
run. For example, in step
By
all
initial
runs included in each merged
the input runs consist of a single initial run.
At the start of
Tl contains one run consisting of four initial runs
run consisting of two initial runs. This method of illustration
step 2 the input runs each consist of a pair of initial runs.
step 3,
tape drive
followed by
more compact
inside the table are the run lengths
SORTING FILES ON TAPE
Tape
Step
313
Contains runs
Tl
Rl
R3
R5
R7
T2
T3
T4
R2
R4
R6
R8
R1-R2
R3-R4
R5-R6
R7-R8
R9-R10
Tl
T2
R1-R4
R5-R8
R9-R10
T3
T4
R9
RIO
Tl
Step 2
Step 3
T2
T3
T4
Tl
Step 4
Step 5
T2
T3
T4
R1-R8
R9-R10
R1-R10
Tl
T2
T3
T4
FIGURE 7.31
Balanced four-tape merge
more
grow
shows
of
10
runs.
way some of
combine and
one run that is copied
again and again stays at length 2 until the end. The form used in this
illustration is used throughout the following discussions on tape merging.
Since there is no seeking, the cost associated with balanced merging on
tape is measured in terms of how much time is spent transmitting the data.
In the example, we passed over all of the data four times during the merge
phase. In general, given some number of initial runs, how many passes over
the data will a two-way balanced merge take? That is, if we start with
runs, how many passes are required to reduce the number of runs to 1?
clearly
the
the intermediate runs
into runs of lengths 2, 4, and 8, whereas the
314
COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES
XI
T2
T3
T4
11111
11111
Step 2
Step 3
4 2
Step 4
Step 5
Step
Merge ten runs
22222
Merge ten runs
4
10
Merge ten runs
Merge ten runs
FIGURE 7.32 Balanced four-tape merge of 10 runs expressed
in
more compact
table notation.
Since each step combines
the
number
two
runs, the
for the previous step. If p
number of runs after each step is half
the number of passes, then we can
is
express this relationship as follows:
(Vif
from which
it
can be
shown
N=
N<
1,
that
P =
In our simple example,
Hog, N~l
10, so four passes
over the data were required.
Recall that for our partially sorted 800-megabyte
flog 2 200 "| = 8 passes are required for
file
there
were 200 runs, so
balanced merge. If reading and
writing overlap perfectly, each pass takes about 11 minutes," so the total
1"
time
is 1
hour, 28 minutes. This time
merges, even
when
a single
outweigh the savings
is
disk drive
not competitive with our disk-based
is
used.
The transmission times
far
in seek times.
7.6.2 The /(-way Balanced Merge
If
we want
to
tells
us that
improve on
this
approach,
it is
clear that
we must
number of passes over the data. A quick look at
we can reduce the number of passes by increasing
to reduce the
find
ways
the formula
the order of
assumes the 6,250 bpi tape used in the examples in Chapter 3. If the transports speed
200 inches per second, the transmission rate is 1,250 Kbytes per second, assuming no
blocking. At this rate an 800-megabyte file takes 640 seconds, or 10.67 minutes to read.
""This
is
SORTING FILES ON TAPE
each merge. Suppose, for instance, that
we have
315
20 tape drives, 10 for input
and 10 for output, at each step. Since each step combines 10 runs, the
number of runs
after
each step
is
one tenth the number for the previous
step.
we have
Hence,
(Vxof
N<
and
p = Rogio
In general,
at
~l
A k-way balanced merge is one in which the order of the merge
each step (except possibly the
last)
is
k.
required for a k-way balanced merge with
Hence, the number of passes
N initial
runs
is
v = r~iog N~i.
fe
10-way balanced merge of our 800-megabyte
file with 200 runs,
200 1 = 3, so three passes are required. The best estimated time now
is reduced to a more respectable 42 minutes. Of course, the cost is quite
high: We must keep 20 working tape drives on hand for the merge.
For
|~logio
7.6.3 Multiphase Merges
The balanced merging algorithm has the advantage of being very simple; it
easy to write a program to perform this algorithm. Unfortunately, one
is
reason
it
is
simple
is
that
it
is
"dumb" and cannot take advantage of
how we can improve on it.
when we merge the extra run with empty
opportunities to save work. Let's see
We
can begin by noting that
runs in steps 3 and
4,
we
don't really accomplish anything. Figure 7.33
shows how we can dramatically reduce the amount of work that has to be
done by simply not copying the extra run during step 3. Instead of merging
this run with a dummy run, we simply stop tape T3 where it is. Tapes Tl
and T2 now each contains a single run made up of four of the initial runs.
We rewind all the tapes but T3 and then perform a three-way merge of the
runs on tapes Tl, T2, and T3, writing the final result on T4. Adding this
intelligence to the merging procedure reduces the number of initial runs that
must be read and written from 40 down to 28.
The example in Fig. 7.33 clearly indicates that there are ways to
improve on the performance of balanced merging. It is important to be able
to state, in general terms, what it is about this second merging pattern that
saves work:
We
use a higher-order merge. In place of
use one three-way merge.
two two-way merges, we
316
COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES
T2
Tl
Step
11111
Step 2
Step 3
T3
T4
2 2 2
2 2
Merge
ten runs
Merge
eight runs
Merge ten runs
Step 4
10
FIGURE 7.33 Modification of balanced four-tape merge that does not rewind
between steps 2 and 3 to avoid copying runs.
We
extend the merging of runs from one tape over several
we merge some
Specifically,
in step 4.
We
could say that
of the runs from T3
we merge
the runs
steps.
and some
in step 3
from T3
in
two
phases.
These
ideas, the use
of higher-order merge patterns and the merging of
runs from a tape in phases, are the basis for two well-known approaches to
merging called polyphase merging and cascade
merges share the following characteristics:
The
a
initial
distribution of runs
J 1-way merge, where J
The
is
is
the
such that
merging.
at least
Figure 7.34 illustrates
the initial
number of available
distribution of the runs across the tapes
ten contain different
In general,
is
merge
initial
such that the tapes of-
numbers of runs.
how
polyphase merge can be used to merge 10
runs that must be read and written from 40 (for
two-way merge)
to 25.
is
tape drives.
runs distributed on four tape drives. This merge pattern reduces the
of
these
easy to see that this reduction
It is
is
number
balanced
consequence
of the use of several three-way merges in place of two-way merges. It
should also be clear that the ability to do these operations as three-way
merges is related to the uneven nature of the initial distribution. Consider,
for example,
what happens
than 5-3-2.
We
T3, but
this also clears all the
Tl. Obviously,
second
if
the initial distribution of runs
is
4-3-3
rather
can perform three three-way merges to open up space on
we
runs off of T2 and leaves only a single run on
are not able to
perform another three-way merge
as a
step.
Several questions arise at this point:
1.
How
does one choose an
efficient
merge
pattern?
initial
distribution that leads readily to an
SORTING FILES ON TAPE
T2
Tl
1
11111
Step 2
..111
Step 3
...
Step 4
....
Step
T4
33
10
Step 5
T3
Merge
six
Merge
five
317
runs
runs
Merge four runs
Merge ten runs
FIGURE 7.34 Polyphase four-tape merge of 10 runs.
2.
Are there algorithmic descriptions of the merge
patterns, given an
initial distribution?
3.
N runs
and J tape drives, is there some way to compute the
merging performance so we have a yardstick against which
compare the performance of any specific algorithm?
Given
optimal
to
beyond the scope of this text; in
answer to question 3 requires a more mathematical approach
to the problem than the one we have taken here. Readers wanting more
than an intuitive understanding of how to set up initial distributions should
Precise answers to these questions are
particular, the
consult
Knuth
(1973b).
7.6.4 Tapes versus Disks
for External Sorting
RAM
was considered a substantial amount of
of
any single job, and extra disk drives were very
costly. This meant that many of the disk sorting techniques to decrease seeking that we have seen were not available to us or were very
decade ago 100
memory
to allocate to
limited.
we want
our 800-megabyte file, and
available, instead of one megabyte. The
there is only 100 K of
approach that we used for allocating memory for replacement selection
would provide 25 K for buffering, and 75 K for our selection tree. From this
Suppose, for instance, that
to sort
RAM
we
can expect 5,334 runs of 1,500 records each, versus 534 when there is a
RAM. For a one-step merge, this 10-fold increase in the
megabyte of
number of runs
results in a 100-fold increase in the
took three hours with one megabyte of memory
for the seeks!
No wonder
no seeking, were
tapes,
preferred.
which
number of seeks. What
now
takes 300 hours, just
are basically sequential
and require
31
COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES
now
RAM
much more readily available. Runs can be longer and
much less of a problem. Transmission time is now
more important. The best way to decrease transmission time is to reduce
the number of passes over the data, and we can do this by increasing the
But
is
fewer, and seeks are
order of the merge. Since disks are random-access devices, very large order
merges can be performed, even if there is only one drive. Tapes, however,
are not random-access devices; we need an extra tape drive for every extra
run we want to merge. Unless a large number of drives is available, we can
only perform low-order merges, and that means large numbers of passes
over the data. Disks are better.
7.7
Sort-Merge Packages
Many
utility programs are available for users who need to sort
Often the programs have enough intelligence to choose from one
of several strategies, depending on the nature of the data to be sorted and the
available system configuration. They also often allow users to exert some
control (if they want it) over the organization of data and strategies used.
Consequently, even if you are using a commercial sort package rather than
designing your own sorting procedure, it is useful to be familiar with the
variety of different ways to design merge sorts. It is especially important to
have a good general understanding of the most important factors and
large
very good
files.
trade-offs influencing performance.
7.8
Sorting and Cosequential Processing
UNIX
has a
number of utilities
also has sorting routines, but
for
in
UNIX
performing cosequential processing. It
at the level of sophistication that you
nothing
find in production sort-merge packages. In the following discussion
introduce
some of
these
utilities.
For
full
details,
consult the
we
UNIX
documentation.
7.8.1 Sorting and Merging
Because
UNIX
sorting of large
is
in
UNIX
not an environment in which one expects to do frequent
files
of the type
we
discuss in this chapter, sophisticated
sort-merge packages are not generally available on UNIX systems. Still, the
sort routines you find in UNIX are quick and flexible and quite adequate for
the types of applications that are common in a UNIX environment. We can
SORTING AND COSEQUENTIAL PROCESSING
divide
UNIX
two
sorting into
IN
319
UNIX
command, and
categories: (1) the sort
(2)
callable sorting routines.
UNIX
The
Command
sort
options, but the simplest one
(A
lexical order.
character
'\n'.)
command
sorted
one
is
line
line
By
is
sort
command
to sort the lines in an
default the sort utility takes
named on
has
ASCII
many
file in
different
ascending
any sequence of characters ending with the new-line
and writes the sorted
too large to
file is
is
The
fit
in
RAM,
file
sort
its
input
file
name from
to standard output. If the
performs
merge
sort. If
the input line, sort sorts and merges the
file
the
to be
more than
files.
As a simple example, suppose we have an ASCII file called team with
names of members of a basketball team, together with their classes and their
scoring averages:
Jean Smith Senior 7.8
Chris Mason Junior 9.6
Pat Jones Junior 3.2
Leslie Brown Sophomore 18.2
Pat Jones Freshman 11.4
To
sort the
file,
enter
$
sort team
Chris Mason Junior 9.6
Jean Smith Senior 7.8
Leslie Brown Sophomore 18.2
Pat Jones Freshman 11.4
Pat Jones Junior 3.2
Notice that by default sort considers an entire line as the sort key.
Hence, of the two players named "Pat Jones," the freshman occurs first in
"Freshman" is lexically smaller than "Junior." The
assumption that the key is an entire line can be overridden by sorting on
specified key fields. For sort a key field is assumed to be any sequence of
characters delimited by spaces or tabs. You can indicate which key fields to
use for sorting by giving their positions:
the output because
+po5
where posl
tells which
of the
$
-pos2
tells
how many
sort
file
+1
If pos2
is
before starting the key, and pos2
omitted, the key extends to the end
-2 team
team to be sorted according to the
form of posl and pos2
to start a
fields to skip
end with.
Hence, entering
field to
line.
causes the
a
key with.)
that allows
you
last
names. (There
is
also
to specify the character within a field
320
COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES
The following options, among
ASCII ordering used by sort:
-d
Use "dictionary"
ordering:
others, allow
Only
you
to override the default
and blanks are
letters, digits,
signifi-
cant in comparisons.
-f
"Fold" lowercase letters into uppercase. (This
defined in Chapter 4.)
the canonical
is
form
that
we
-r
"Reverse" the sense of comparison: Sort
Notice that
and within
in
descending ASCII order.
compares groups of
Chapter 4, records
are lines, and fields are groups of characters delimited by white space. This
is consistent with the most common UNIX view of fields and records
sort sorts lines,
characters delimited
within
The
UNIX
text
by white
it
files.
Library Routine
qsort
lines
space. In the language of
UNIX
The
library routine qsort (
general sorting routine. Given a table of data, qsort(
the table in place.
table could be the contents
where the elements of the
table are
its
nel,
int
of a
is
sorts the elements in
file,
loaded into
records. In C, qsort ()
RAM,
defined as
is
follows:
qsortCchar *ba5e,
The argument
base
is
int
a pointer to the
elements in the table; and width
argument, compar(
),
is
the
is
name of
width,
int
base of the data, nel
*compar (
is
the
) )
number of
The last
the size of each element.
a
user-supplied comparison function
) must have two parameters,
which are pointers to elements that are to be compared. When qsort( ) needs
to compare two elements, it passes to comparf ) pointers to these elements,
and compar( ) compares them, returning an integer that is less than, equal to,
or greater than zero, depending on whether the first argument is considered
that qsort(
uses to
compare keys. Compar(
equal to, or greater than the second argument. A full explanation
of how to use qsort( ) is beyond the scope of this text. Consult the UNIX
documentation for details.
less than,
7.8.2 Cosequential Processing
UNIX
utility,
UNIX
number of utilities for cosequential
when used to merge files, is one example.
provides a
introduce three others:
cmp
Utilities in
difj]
The
sort
In this section
we
processing.
cmp, and cotnm.
Suppose you find in your computer that you have two team files,
one called team and the other called my team. You think that the two files are
the same, but you are not sure. You can use the command cmp to find out.
SORTING AND COSEQUENTIAL PROCESSING
IN
321
UNIX
cmp compares two files. If they differ, it prints the byte and line number
where they differ; otherwise it does nothing. If all of one file is identical to
the first part of another, it reports that end-of-file was reached on the
shorter file before any differences were found.
For example, suppose the
team and myteam have the following
file
contents:
team
myteam
Jean Smith Senior 7.8
Chris Mason Junior 9.6
Pat Jones Junior 3.2
Leslie Brown Sophomore 18.2
Pat Jones Freshman 11.4
Jean Smith Senior 7.8
Stacy Fox Senior 1.6
Chris Mason Junior 9.6
Pat Jones Junior 5.2
Leslie Brown Sophomore 18.2
Pat Jones Freshman 11.4
cmp
tells
you where they
differ:
cmp team myteam
team myteam differ:
char 23 line
files on a byte-by-byte basis until it finds a
makes no assumptions about fields or records. It works with
Since cmp simply compares
difference,
it
both text and nontext
files.
useful if you just want to know if two files are different, but it
you much about how they differ. The command diff gives fuller
information, diff telh what lines must be changed in two files to bring them
cmp
diff
doesn't
is
tell
into agreement. For example:
team myteam
diff
1a2
>
Stacy Fox Senior
.6
3c4
<
Pat Jones Junior 3.2
1
Jones Junior 5.2
Pat
>
The "la2"
indicates that after line
in the first
file,
we need
to add line 2
make them agree. This is followed by the line from
the second file that would need to be added. The "3c4" indicates that we
need to change line 3 in the first file to make it look like line 4 in the second
from
file.
the second
This
leading
is
"<"
indicates that
file
to
followed by
a listing
of the two differing
indicates that the line
it is
from the second
is
from the
first
lines,
file,
where the
and the
">"
file.
One other indicator that
a line in the first file
means
could appear in d(ffoutput is "d", meaning that
has been deleted in the second file. For example, "12dl5"
that line 12 in the first
file
appears to have been deleted from being
322
COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES
right after line 15 in the second
work with
lines
of text.
file.
Notice that
diff,
like sort,
is
designed to
would not work well with non-ASCII
It
text
files.
comm
Whereas diff tells what is different about two files, comm compares
which must be ordered in ASCII collating sequence, to see what
they have in common. The syntax for comm is the following:
two
files,
comm [-123] filel
file2
comm produces three columns of output. Column 1 lists
filel only; column 2 lists lines in file2 only, and column
in
both
files.
the lines that are in
lists lines
that are
For example,
sort team >ts
sort myteam >ms
comm ts ms
Chris Mason Junior 9.6
Jean Smith Senior 7.8
$
$
Leslie Brown Sophomore 18.2
Pat Jones Freshman 11.4
Pat
Jones Junior 3.2
Pat Jones Junior 5.2
Stacy Fox Senior
.6
1
Selecting any of the flags
you
1, 2,
or 3 allows you to print only those columns
are interested in.
The
sort, diff,
representative of
comm, and emp commands (and the
what
is
available in
UNIX
qsort() function) are
for sorting
and cosequential
As we have said, they have many useful options
cover and that you will be interested in reading about.
processing.
that
SUMMARY
In the first half
and apply
merge
it
of this chapter,
to
sorting.
we
develop
two common problems
cosequential processing
updating
In the second half of the chapter
we
model
general ledger and
identify the
most
important factors influencing performance in merge-sorting operations and
suggest some strategies for achieving good performance.
The cosequential processing model can be applied to problems that
involve operations such as matching and merging (and combinations of
these)
on two or more sorted input
files.
We
begin the chapter by
we
don't
SUMMARY
illustrating the use
common
to
to
two
of the model to perform
and
lists,
merge of two
simple match of the elements
lists.
perform these two operations embody
The procedures we develop
all
the basic elements of the
model.
most complete form, the model depends on
certain assumptions
enumerate these assumptions in our
formal description of the model. Given these assumptions, we can describe
the processing components of the model.
The real value of the cosequential model is that it can be adapted to
more substantial problems than simple matches or merges without too
In
its
about the data
much
in the input files.
alteration.
We
We
illustrate this
by using
the
model
to design a general
ledger accounting program.
model involve only two
multiway merge to show how the
All of our early sample applications of the
input
We next adapt the
files.
model
to a
model might be extended to deal with more than two input lists. The
problem of finding the minimum key value during each pass through the
main loop becomes more complex as the number of input files increases. Its
solution involves replacing the three-way selection statement with either a
multiway
selection or a procedure that keeps current keys in a
that can be processed
We
more
see that the application
well for small values of
it is
more
k,
list
structure
conveniently.
of the model to fe-way merging performs
but that for values of k greater than eight or
efficient to find the
minimum key
so,
value by means of a selection
tree.
After discussing multiway merging,
we shift our attention
to a
problem
encountered in a previous chapter
how to sort large files. We
begin with files that are small enough to fit into
and introduce an
efficient sorting algorithm, heapsort, which makes it possible to overlap I/O
with the sorting process.
we
that
RAM
The generally accepted
is some form of merge
sorts
1.
Break the
file
nal sorting
2.
Merge
into
solution
sort.
when
merge
two or more
a file
is
too large for
sort involves
two
in-RAM
steps:
sorted subfiles, or runs, using inter-
methods; and
the runs.
to keep every run in a separate file so we can perform
one pass through the runs. Unfortunately, practical
considerations sometimes make it difficult to do this effectively.
The critical elements when merging many files on disk are seek and
rotational delay times and transmission times. These times depend largely
on two interrelated factors: the number of different runs being merged and
Ideally,
the
we would like
merge
step with
323
324
the
COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES
amount of internal
buffer space available to hold parts of the runs.
We
can reduce seek and rotational delay times in two ways:
By performing the merge in more than one step;
By increasing the sizes of the initial sorted runs.
In
and/or
both cases, the order of each merge step can be reduced, increasing the
of the internal buffers and allowing more data to be processed per seek.
sizes
Looking
at the first alternative,
means
that
total data
we need
we
see
how
number of seeks
several steps can decrease the
performing the merge
dramatically, though
it
in
also
through the data more than once (increasing
to read
transmission time).
The second
through use of an algorithm called
Replacement selection, which can be implemented
using the selection tree mentioned earlier, involves selecting the key from
memory that has the lowest value, outputting that key, and replacing it
with a new key from the input list.
With randomly organized files, replacement selection can be expected
to produce runs twice as long as the number of internal storage locations
available for performing the algorithms. Although this represents a major
step toward decreasing the number of runs needing to be merged, it carries
with it an additional cost. The need for a large buffer for performing the
replacement selection operation leaves relatively little space for the I/O
buffer, which means that many more seeks are involved in forming the runs
than are needed when the sort step uses an in-RAM sort. If we compare the
alternative
realized
is
replacement selection.
total
number of seeks
required by the
two
different approaches,
replacement selection can actually require more seeks;
tially better
only
when
Next we turn our
there
is a
it
we find
great deal of order in the initial
on
that
performs substanfile.
I/O with
tapes does not involve seeking, the problems and solutions associated with
tape sorting can differ from those associated with disk sorting, although the
fundamental goal of working with fewer, longer runs remains. With tape
sorting, the primary measure of performance is the number of times each
record must be transmitted. (Other factors, such as tape rewind time, can
also be important, but
attention to
we do
file
sorting
tapes. Since file
not consider them here.)
Since tapes do not require seeking, replacement selection
always
good choice
for creating initial runs. Since the
available to hold run files
the
files
on the
tapes. In
is
limited, the next question
most
cases,
each of several tapes, reserving one or
it is
is
how
drives
to distribute
necessary to put several runs on
more other
tapes for the results. This
generally leads to merges of several steps, with the total
being decreased after each merge step.
almost
is
number of
Two
number of runs
approaches to doing
this are
KEY TERMS
balanced merges
and multiphase merges. In
tapes contain approximately the
number of output
a fe-way
balanced merge,
same number of
all
input
runs, there are the
same
tapes as there are input tapes, and the input tapes are read
through entirely during each
of k after each step.
step.
The number of runs
decreased by a
is
factor
multiphase merge (such
as
polyphase merge or
requires that the runs initially be distributed unevenly
the available tapes. This increases the order of the
cascade merge)
among
merge and
but one of
all
as a result
can
number of times each record has to be read. It turns out that the
distribution of runs among the first set of input tapes has a major
on the number of times each record has to be read.
decrease the
initial
effect
Next,
available
we
discuss briefly the existence of sort-merge
on most large systems and can be very
conclude the chapter with
a listing
of
UNIX
utilities,
flexible
utilities
and
which
effective.
are
We
used for sorting and
cosequential processing.
KEY TERMS
Balanced merge. A multistep merging technique that uses the same
number of input devices as output devices. A two-way balanced
merge uses two input tapes, each with approximately the same number of runs on it, and produces two output tapes, each with approximately half as many runs as the input tapes. A balanced merge is
suitable for merge sorting with tapes, though it is not generally the
best method (see multiphase merging).
cmp. A UNIX utility for determining whether two files are identical.
Given two files, it reports the first byte where the two files differ, if
they
comtn.
differ.
A UNIX
utility for
common. Given two
determining what
files, it
the lines that are in the
lines
two
files
have
reports the lines they have in
first file
and not
in the second,
in
common,
and the
lines
second file and not in the first.
Cosequential operations. Operations applied to problems that involve
the performance of union, intersection, and more complex set operations on two or more sorted- input files to produce one or more output files built from some combination of the elements of the input
files. Cosequential operations commonly occur in matching, merging, and file-updating problems.
that are in the
325
326
diff.
COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES
A UNIX
determining
utility for
two files. It
make it like
all
the lines that differ
between
reports the lines that need to be added to the
the second, the lines that need to be deleted
first file to
from the
file to make it like the first, and the lines that need to be
changed in the first file to make it like the second.
heapsort. A sorting algorithm especially well suited for sorting large
second
files
that
fit
in
RAM
because
variation of heapsort
is
its
execution can overlap with I/O.
used to obtain longer runs in the replacement
selection algorithm.
HIGH_VALUE. A
value used in the cosequential model that
than any possible key value.
By
assigning
HIGH_VALUE
is
greater
as the
current key value for files for which an end-of-file condition has
been encountered, extra logic for dealing with end-of-file conditions
can be simplified.
fc-way merge. A merge in which k input files are merged to produce
one output
file.
LOW_VALUE. A
value used in the cosequential model that is less than
any possible key value. By assigning LOW_VALUE as the previous
key value during initialization, the need for certain other special
start-up code is eliminated.
Match. The process of forming a sorted output file consisting of all the
elements common to two or more sorted input files.
Merge. The process of forming a sorted output file that consists
of the union of the elements from two or more sorted input
files.
Multiphase merge.
merge
which the initial distrimerge is a J 1 -way
merge (J is the number of available tape drives), and in which the
distribution of runs across the tapes is such that the merge performs
bution of runs
efficiently at
is
every
Multistep merge.
multistep tape
such that
at least
the
in
initial
step. (See polyphase merge.)
merge
in
which not
all
runs are merged in one
of runs are merged separately, each set producing one long run consisting of the records from all of its runs.
These new, longer sets are then merged, either all together or in several sets. After each step, the number of runs is decreased and the
step. Rather, several sets
length of the runs
is
increased.
run consisting of the entire
file.
The output of the
is
is
a single
(Be careful not to confuse our use of
the term multistep merge with multiphase merge.)
merge
final step
Although
multistep
more time-consuming than is a single-step
can involve much less seeking when performed on a disk,
theoretically
merge, it
and it may be the only reasonable way
the number of tape drives is limited.
to
perform
merge on tape
if
KEY TERMS
Order of a merge. The number of different
or runs, being
files,
merged. For example, the 100 is the order of a 100-way merge.
Polyphase merge. A multiphase merge in which, ideally, the merge
order
qsort.
is
maximized
ploys
every step.
at
general-purpose
UNIX
library routine for sorting files that
em-
user-defined comparison function.
Replacement
selection.
method of creating initial runs based on the
from memory whose key has the
idea of always selecting the record
lowest value, outputting that record, and then replacing
with
new
record from the input
list.
When new
it
memory
in
records are
brought in whose keys are greater than those of the most recently
output records, they eventually become part of the run being created. When new records have keys that are less than those of the
most recently output records, they are held over for the next run.
Replacement selection generally produces runs that are substantially
longer than runs that can be created by in-RAM sorts, and hence can
help improve performance in merge sorting. When using replacement selection with merge sorts on disk, however, one must be careful that the extra seeking required for replacement selection does not
outweigh the benefits of having longer runs to merge.
Run. A sorted subset of a file resulting from the sort step of a sort
merge or one of the steps of a multistep merge.
Selection tree. A binary tree in which each higher-level node represents
the winner of the comparison between the two descendent keys. The
minimum (or maximum) value in a selection tree is always at the
root node,
making
ing several
lists.
the selection tree a
It is
also a
good
key structure
in
data structure for
replacement selection
algorithms, which can be used for producing long runs for
sorts.
(Tournament
sort,
an internal
merg-
sort, is also
merge
based on the use of
selection tree.)
Sequence checking. Checking that records in
order. It is recommended that all files used
a file are in the
expected
in a cosequential opera-
tion be sequence checked.
A UNIX utility for sorting and merging files.
Synchronization loop. The main loop in the cosequential processing
model. A primary feature of the model is to do all synchronization
sort.
within
a single
loop, rather than in multiple nested loops.
keep the main synchronization loop
objective
is
to
This
is
done by
ble.
as
second
simple as possi-
restricting the operations that occur within the
loop to those that involve current keys, and by relegating
special logic as possible (such as error checking
ing) to subprocedures.
as
much
and end-of-file check-
327
328
COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES
Theorem
(Knuth).
It is
difficult to decide
which merge pattern
is
best in a given situation.
EXERCISES
1.
Write an output procedure to go with the procedures described in
7. 1 for doing cosequential matching. As a defensive measure, it is a
section
good idea to have the output procedure do sequence checking
manner as the input procedure does.
2.
Consider the cosequential
in the
initialization routine in Fig. 7.4. If
same
PREV_1
LOW_VALUE in this routine, how would
input() have to be changed? How would this affect the adaptability of input ()
PREV_2 were
and
not
set to
for use in other cosequential processing algorithms?
Consider the cosequential merge procedures described in section 7.1.
Comment on how they handle the following situations. If they do not
correctly handle a situation, indicate how they might be altered to do so.
a. List 1 empty and List 2 not empty
b. List 1 not empty and List 2 empty
c. List 1 empty and List 2 empty
3.
4.
it
In the ledger procedure
example
also updates the ledger file
5.
Use
the /e-way
in section 7.2,
with the
merge example
new
modify the procedure so
account balances for the month.
as the basis for a
procedure that
is
/e-way match.
6.
are
Figure 7. 17 shows a loop for doing a /e-way merge, assuming that there
no duplicate names. If duplicate names are allowed, one could add to the
procedure a facility for keeping
names. Alter the procedure to do
7.
In section 7.3,
keys
at
two methods
list
of subscripts of duplicate lowest
are presented for choosing the lowest of/?
each step in a /e-way merge:
Compare
this.
a linear
search and use of a selection tree.
in terms of numbers of
comparisons for k = 2, 4, 8, 16, 32, and 100. Why do you think the linear
approach is recommended for values of k less than 8?
8.
the performances of the
two approaches
Suppose you have 8 megabytes of
800,000-record
file
How long does it take to sort the
rithm described in section 7.5.1?
a.
RAM
available for sorting the
described in section 7.5.1.
file
using the merge sort algo-
EXERCISES
b.
How
long does
take to sort the
it
file
using the keysort algorithm
described in Chapter 5?
c.
Why
work
will keysort not
if
there
is
one megabyte of
RAM
available for the sorting phase?
9.
How much
seek time
is
the one described in section
the
amount of available
10.
Performance
7.5 if the
why
is
50 msec and
500 K? 100 K?
is
often measured in terms of the
is
the
number of comparisons
measuring performance in sorting large
11. In
one-step merge such as
time for an average seek
internal buffer space
in sorting
comparisons. Explain
required to perform
is
number of
not adequate for
files.
our computations involving the merge
sorts,
we made
the simpli-
fying assumption that only one seek and one rotational delay are required
for
any single sequential
access. If this
were not the
case, a great deal
more
time would be required to perform I/O. For example, for the 80-megabyte
file
used in the example in section 7.5.1, for the input step of the sort phase
all records into
for sorting and forming runs"), each
RAM
("reading
individual run could require
many
accesses.
extent size for our hypothetical drive
track),
and that
all files
separately (one seek
a.
b.
c.
Now
let's
assume
that the
20,000 bytes (approximately one
are stored in track-sized blocks that
must be accessed
and one rotational delay per block).
How many seeks does step now require?
How long do steps 1, 2, 3, and 4 now take?
How does increasing the file size by a factor
total
12.
is
of 10
now
affect the
time required for the merge sort?
Derive two formulas for the number of seeks required to perform the
step of a one-step /e-way sort merge of a file with r records divided
merge
into k runs,
If
where the amount of available
an internal sort
of each run is M,
the length of each run
Assume
RAM
is
equivalent to
M records.
used for the sort phase, you can assume that the length
but if replacement selection is used, you can assume that
is
is
about 2M.
Why?
system with four separately addressable disk drives,
hundred megabytes. Assume that the
80-megabyte file described in section 7.5 is already on one of the drives.
Design a sorting procedure for this sample file that uses the separate drives
to minimize the amount of seeking required. Assume that the final sorted
file is written off to tape and that buffering for this tape output is handled
invisibly by the operating system. Is there any advantage to be gained by
using replacement selection?
13.
a quiet
each of which
is
able to hold several
329
330
COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES
14. Use replacement selection to
assuming P = 4.
a. 23 29 5 17 9 55 41 3 51
b. 3 5 9 11 17 18 23 24 29
c. 55 51 47 41 33 29 24 23
Suppose you have
15.
so 10 cylinders
produce runs from the following
33 18 24 11 47
33 41 47 51 55
18 17 11 9 5 3
a disk drive that has 10
may be
files,
read/write heads per surface,
accessed at any one time without having to
you could control
on disk, how might you be able to
a sort merge?
actuator arm. If
move
the
the physical organization of runs stored
exploit this arrangement in performing
Assume we need
16.
patterns starting
c.
8-4-2
7-4-3
6-5-3
d.
5-5-4.
a.
b.
to merge 14 runs on four tape drives. Develop merge
from each of these initial distributions:
A four-tape polyphase merge is to be performed to
17.
25 16 45 29 38 23 50 22 19 43 30
runs are of length
following runs
(a
1.
After
1 1
27 47. The original
initial sorting,
1:
24/36/13/25
Tape
2:
16
45
29
38
23
50
Tape
3:
22
19
43
30
11
27
b.
1,
24 36 13
list is
2,
on tape
4. Initial
and 3 contain the
slash separates runs):
Tape
a.
tapes
list
sort the
47
Show the contents of tape 4 after one merge phase.
Show the contents of all four tapes after the second and
fourth
phases.
c.
Comment on
the appropriateness of the original
4-6-7
distribu-
tion for performing a polyphase merge.
18. Obtain a copy of the manual for one or more commercially available
sort-merge packages. Identify the different kinds of choices available to
users of the packages. Relate the options to the performance issues discussed
in this chapter.
Programming Exercises
19.
in
20.
in
Implement the cosequential match procedures described
Implement the cosequential merge procedures described
in section 7.1
or Pascal.
or Pascal.
in section 7.
FURTHER READINGS
21.
Implement
complete program corresponding to the solution to the
general ledger problem presented in section 7.2.
22.
Design and implement
a.
Examine
program to do the following:
two sorted files Ml and M2.
the contents of
COMMON
containing a copy of records
Produce a third file
from the original two files that are identical.
c. Produce a fourth file DIFF that contains all records from the two
b.
files
that are not identical.
FURTHER READINGS
The
two separate topics: the
model for cosequential processing, and discussion of external
merging procedures on tape and disk. Although most file processing texts discuss
cosequential processing, they usually do it in the context of specific applications,
rather than presenting a general model that can be adapted to a variety of
applications. We found this useful and flexible model through Dr. James VanDoren,
who developed this form of the model himself for presentation in the file structures
course that he teaches. We are not aware of any discussion of the cosequential model
subject matter treated in this chapter can be divided into
presentation of a
elsewhere in the literature.
of work has been done toward developing simple and effective
file updating, which is an important instance of
cosequential processing. The results deal with some of the same problems the
cosequential model deals with, and some of the solutions are similar. See Levy
(1982) and Dwyer (1981) for more.
Unlike cosequential processing, external sorting is a topic that is covered
widely in the literature. The most complete discussion of the subject, by far, is in
Quite
a bit
algorithms to do sequential
Knuth
some
Knuth
book in
(1973b). Students interested in the topic of external sorting must, at
point, familiarize themselves with Knuth's definitive
also describes replacement selection, as evidenced
summary of the
subject.
by our quoting from
his
this chapter.
Salzberg (1987) provides an excellent analytical treatment of external sorting,
and Salzberg (1990) describes an approach that takes advantage of replacement
selection, parallelism, distributed computing, and large amounts of memory. Lorin
(1975) spends several chapters on sort-merge techniques. Bradley (1982) provides a
good treatment of replacement selection and multiphase merging, including some
interesting comparisons of processing time on different devices. Tremblay and
Sorenson (1984) and Loomis (1983) also have chapters on external sorting.
Since the sorting of large files accounts for a large percentage of data processing
time, most systems have sorting utilities available. IBM's DFSORT (described in
IBM, 1985) is a flexible package for handling sorting and merging applications. A
VAX
sort utility
is
described in Digital (1984).
331
B-Trees and Other
Tree-structured File
Organizations
CHAPTER OBJECTIVES
Place the development of B-trees in the historical
context of the problems they were designed to
solve.
I
Look
might be
paged AVL
briefly at other tree structures that
used on secondary storage, such
as
trees.
I
Provide an understanding of the important properties possessed by B-trees, and show how these
properties are especially well suited to secondary
storage applications.
Describe fundamental operations on B-trees.
I
Introduce the notion of page buffering and virtual
B-trees.
Describe variations of the fundamental B-tree algorithms, such as those used to build B * trees and
B-trees with variable-length records.
333
CHAPTER OUTLINE
8.1
Introduction:
The Invention of the
8.13 Deletion, Redistribution,
B-Tree
Concatenation
8.2
Statement of the Problem
8.13.1
8.3
Binary Search Trees
8.4
AVL
8.5
Paged Binary Trees
8.6
The Problem with
8.7
as a
Solution
and
Redistribution
8.14 Redistribution during Insertion:
Way to Improve Storage
Trees
Utilization
8.15
the
Top-Down
B*
Trees
8.16 Buffering of Pages: Virtual
Construction of Paged Trees
B-Trees
B-Trees: Working up from the
8.16.1
Bottom
8.16.2 Replacement Based on Page
LRU
Replacement
Height
and Promoting
8.8
Splitting
8.9
Algorithms for B-Tree Searching
and Insertion
8.10
B-Tree Nomenclature
8.18 Variable-length
Records and Keys
8.11
Formal Definition of B-Tree
C Program
Keys into
Properties
B-Tree
8.12 Worst-case Search
8.16.3 Importance of Virtual B-Trees
Depth
8.17
Placement of Information
Associated with the Key
Pascal
to Insert
Program
to Insert
Keys into
B-Tree
8.1
Introduction:
Computer
science
that at the start
The Invention
is a
young
of 1970,
of the B-Tree
discipline.
As evidence of this youth, consider
had twice travelled to the moon,
after astronauts
B-trees did not yet exist. Today, only 15 years
a
major, general-purpose
file
system that
is
later, it is
not built around
hard to think of
a
B-tree design.
Douglas Comer, in his excellent survey article, "The Ubiquitous
B-Tree" [1979], recounts the competition among computer manufacturers
and independent research groups that developed in the late 1960s. The goal
was the discovery of a general method for storing and retrieving data in
large file systems that would provide rapid access to the data with minimal
overhead cost. Among the competitors were R. Bayer and E. McCreight,
who were working for Boeing Corporation at that time. In 1972 they
published an article, "Organization and Maintenance of Large Ordered
335
INTRODUCTION: THE INVENTION OF THE B-TREE
By
Indexes," which announced B-trees to the world.
published his survey
Comer was
able
B-trees had already
article,
to
that
state
organization for indexes in
We
become
so widely used that
de facto,
is,
when Comer
standard
the
database system."
have reprinted the
"the B-tree
1979,
first
few paragraphs of the 1972 Bayer and
McCreight article^ because it so concisely describes the facets of the
problem that B-trees were designed to solve: how to access and maintain
efficiently an index that is too large to hold in memory. You will remember
that this is the same problem that is left unresolved in Chapter 6, on simple
index structures. It will be clear as you read Bayer and McCreight's
introduction that their
back
in the
work goes
straight to the heart
of the issues
we
raise
indexing chapter.
In this paper
index for
we
consider the problem of organizing and maintaining an
dynamically changing random access
a collection
of index elements which are pairs
adjacent data items,
The key x
information
random
namely
identifies
file.
(x, a)
By
an index
of fixed
we mean
size physically
key x and some associated information
unique element
in
the
index,
the
a.
associated
typically a pointer to a record or a collection of records in a
is
access
file.
For
this
paper the associated information
is
of no
further interest.
We assume that the index itself is so voluminous that only rather small
parts
of it can be kept in main store
must be kept on some backup
store.
are pseudo random access devices
time
as
opposed
The
class
which have
random
to a true
one time. Thus the bulk of the index
at
of backup stores considered
rather long access or wait
access device like core store
and
rather high data rate once the transmission of physically sequential data has
been
Typical pseudo random access devices
initiated.
moving head
disks,
Since the data
the index and
keys
to
drums, and data
file itself
changes,
it
elements,
retrieve
are:
fixed and
cells.
must be possible not only
to search
but also to delete and to insert
more accurately index elements economically.
The index orga-
nization described in this paper allows retrieval, insertion, and deletion ot
keys in time proportional to log^
size
or better, where
/ is
the size of the
dependent natural number which describes the page
such that the performance of the maintenance and retrieval scheme
index, and k
is
a device
becomes near optimal.
Exercises 17, 18, and 19 at the end of Chapter 6 introduced
of a paged index. Bayer and McCreight's statement that
developed a scheme with retrieval time proportional to log^ /,
related to the page size, is very significant. As we will see, the use
"From
Ada-Informatica, 1:173-189,
permission.
1972, Springer Verlag,
New
the notion
they have
where k
is
of a B-tree
York. Reprinted with
336
B-TREES AND OTHER TREE-STRUCTURED
FILE
ORGANIZATIONS
of 64 to index a file with a million records results in being
key for any record in no more than four seeks to the disk.
A binary search on the same file can require as many as 20 seeks. Moreover,
we are talking about getting this kind of performance from a system that
requires only minimal overhead as keys are inserted and deleted.
Before looking in detail at Bayer and McCreight's solution, let's first
return to a more careful look at the problem, picking up where we left off
in Chapter 6. We will also look at some of the data and file structures that
were routinely used to attack the problem before the invention of B-trees.
Given this background, it will be easier to appreciate the contribution made
by Bayer and McCreight's work.
with
page
size
able to find the
One
provides
matter before
last
we
begin:
Why
the
name
B-tree?
Comer
(1979)
this footnote:
'
lie origin
of 'B-tree
As we
Creight].
'
'
shall see,
has never been explained by [Bayer and
Mc-
"balanced," "broad," or "bushy" might apply.
Others suggest that the "B" stands for Boeing. Because of his contribuhowever, it seems appropriate to think of B-trees as "Bayer"-trees.
tions,
8.2
Statement
of the
Problem
The fundamental problem with keeping an index on secondary
of course, that accessing secondary storage
problem can be broken down into two more
Biliary searching requires too
many
seeks.
is
slow.
specific
storage
is,
This fundamental
problems:
Searching for
key on
disk
often involves seeking to different disk tracks. Since seeks are expensive, a search that has to
look in more than three or four locations
more time than is desirable. If
before finding the key often requires
we
are using a binary search, four seeks
is
only enough to differenti-
between 15 items. An average of about 9.5 seeks is required to
find a key in an index of 1,000 items using a binary search. We need
to find a way to home in on a key using fewer seeks.
It can be very expensive to keep the index in sorted order so we can perform
a binary search. As we saw in Chapter 6, if inserting a key involves
ate
number of the other keys in the index, index maintevery nearly impractical on secondary storage for indexes
consisting of only a few hundred keys, much less thousands of keys.
We need to find a way to make insertions and deletions that have
moving
nance
a large
is
only local effects in the index, rather than requiring massive reorganization.
337
BINARY SEARCH TREES AS A SOLUTION
AX
CL
DE
FIGURE 8.1 Sorted
list
HN
FT
FB
NR
KF
JD
PA
RF
WS
TK
SD
YJ
of keys.
These were the two
problems that confronted Bayer and McCreight
in 1970. They serve as guideposts for steering our discussion of the use of
tree structures for secondary storage retrieval.
8.3
critical
Binary Search Trees as a Solution
Let's begin
by addressing the second of these two problems, looking at the
list in sorted order so we can perform binary searches.
cost of keeping a
Given the sorted
list
in Fig. 8.1,
shown
as a binary search tree, as
we
can express
Using elementary data structure techniques,
create nodes that contain right
can be constructed as
representation of the
8.2. In each
and
left link fields
two
left
list
it
is
simple matter to
so the binary search tree
linked structure. Figure 8.3 illustrates
first
node, the
binary search of this
in Fig. 8.2.
levels
and right
of the binary search
tree
links point to the left
shown
and right
linked
in Fig.
children
of the node.
If
each node
is
treated as a fixed-length record in
contain relative record numbers
(RRNs) pointing
which the
link fields
to other nodes, then
it is
possible to place such a tree structure on secondary storage. Figure 8.4
illustrates the contents
of the 15 records that would be required to form the
binary tree depicted in Fig. 8.2.
of the link fields in the file are empty because they
with no children. l^j^xMXl^^fjlod^jaeedio contain some
special character, suchas f, to indicat e t hat the search through the tree h as
Note
that over half
are leaf nodes,
FIGURE 8.2 Binary search tree representation of the
KF.
list
of keys.
338
B-TREES AND OTHER TREE-STRUCTURED
FILE
ORGANIZATIONS
FIGURE 8.3 Linked representation of part of a binary search
tree.
reached the leaf level and that there are no more nodes on the search
We
leave the fields blank in this figure
tomake them more
illustrating the potentially substantial cost in
incurred by this kind of linked representation of
But
to focus
important
new
have to sort the
records in the
on the
costs
a tree.
and not the advantages
file
to be able to
perform
binary search.
illustrated in Fig. 8.4 appear in
random
to
is
capability that this tree structure gives us:
file
noticeable,
terms of space utilization
miss the
We
no longer
Note that the
rather than sorted
FIGURE 8.4 Record contents
[ill
for a linked representation of
the binary tree
Key
Left
Right
child
child
Left
Key
FB
HN
JD
KF
RF
10
CL
SD
11
NR
AX
12
DE
YJ
13
WS
PA
14
TK
FT
Right
child child
in Fig.
8.2.
339
BINARY SEARCH TREES AS A SOLUTION
FIGURE 8.5 Binary search tree with LV added.
The sequence of the
order.
structure of the tree;
in the link fields.
that if
we add
records in the
new key
has no necessary relation to the
positive consequence that follows
to the
such
file,
to create a tree that
we would get with
file
the information about the logical structure
The very
appropriate leaf node
as
all
as
L V, we
need only link
provides search performance that
a binary search on a sorted
list.
The
tree
with
carried
is
from
is
LV
this is
it
to the
as
good
added
is
illustrated in Fig. 8.5.
Search performance on
balanced state.
By
a leaf
does not differ
level.
For the tree in Fig.
to complete balance,
this tree
we mean
is
still
good because
the tree
is
in a
of the shortest path to
from the height of the longest path by more than one
balanced
where
that the height
8.5, this difference
all
the paths
of one
from root
is
we
as close as
can get
to leaf are exactly the
same
length.
Consider what happens if we go on to enter the following eight keys to
the tree in the sequence in which they appear:
NP MB TM LA UF ND TS NK
Just searching
down through
the tree and adding each key at
position in the search tree results in the tree
The
tree
is
now
shown
its
correct
in Fig. 8.6.
out of balance. This is a typical result for trees built by
treeasTiey occur without rearrangement. The
placing keys ""into Hie
resulting disparity
between the length of various search paths
is
undesirable
any binary search tree, but is especially troublesome if the nodes of the
tree are being kept on secondary storage. There are now keys that require
seven, eight, or nine seeks for retrieval. A binary search on a sorted list of
these 24 keys requires only five seeks in the worst case. Although the use of
a tree lets us avoid sorting, we are paying for this convenience in terms of
extra seeks at retrieval time. For trees with hundreds of keys, in which an
in
out-of-balance search path might extend to 30, 40, or
is
too high.
more
seeks, this price
340
B-TREES AND OTHER TREE-STRUCTURED
HN
CL
/ \DE
AX
FILE
ORGANIZATIONS
PA
/ \JD
FT
/
NR
/
LV
WS
RF
/X
TK
YT
\TM
'\
NP
UF
/
MB
\ND
TS
NK
FIGURE 8.6 Binary search tree showing the effect
8.4
of
added
keys.
AVL Trees
we
is no necessary relationship between the order in
and the structure of the tree. We stress the word
necessary because it is clear that order of entry is, in fact, important in
determining the structure of the sample tree illustrated in Fig. 8.6. The
reason for this sensitivity to the order of entry is that, so far, we have just
been linking the newest nodes at the leaf levels of the tree. This approach
Earlier
said that there
which keys
are entered
can result in
some very
undesirable tree organizations.
example, that our keys consist of the
letters
A-G,
and that
keys in alphabetical order. Linking the nodes together
\
\D
\G
FIGURE 8.7 A degenerate
tree.
as
Suppose,
for
we receive these
we receive them
341
AVL TREES
FIGURE 8.8 AVL trees.
produces
degenerate tree that
is,
in fact,
nothing more than
linked
list,
as illustrated in Fig. 8.7.
The
tree as
solution to this problem
we
receive
elegant
method
known
as
new
A VL trees,
in
a height-balanarftree^
of difference that
a
common
AVL
tree
is
honor of the
that
which
M.
Landis
who
first
This means that there
is
An AVL
defined them.
a limit
placed on the
M^
tree
amount
AVL tree the maximum allowable difference is
HB(1)
general class of height-balanced trees
AVL trees.
not in balance
By
mance
is
tree.
known
as
1.
It
An
is
HB(k)
In each of these trees, the root of the subtree that
marked with an X.
make AVL
feat ures that
setting a
subtrees,
maximum
AVL
trees
important are
as follows:
allowable difference in the height of any
trees guarantee a certain
in searching;
Maintaining
minimum
level
AVL
form as new nodes are inserted involves
of four possible rotations. Each of the rota-
FIGURE 8.9 Trees that are not AVL trees.
two
of perfor-
and
a tree in
the use of one of a set
trees
The trees illustrated in Fig. 8.8 have the AVL, or HB(1) property. Note
no two subtrees of any root differ by more than one level. The trees in
The two
/\
One
of
are permitted to be k levels out of balance.
Fig. 8.9 are not
is
nodes of the
pair of Russian mathematicians, G.
therefore called a height-balanced 1-tree or
member of a more
trees,
to reorganize the
allowed between the heights of any two subtrees sharing
root. In an
is
somehow
for handling such reorganization results in a class
Add'son-Vel'skii and E.
is
is
keys, maintaining a near optimal tree structure.
/\
X
/
m m
342
B-TREES AND OTHER TREE-STRUCTURED FILE ORGANIZATIONS
FIGURE 8.10 A completely balanced search
tions
confined to
is
a single, local area
of the rotations requires only
of the
tree.
tree.
The most complex
five pointer reassignments.
AVL
trees are an important class of data structure. The operations used
and maintain AVL trees are described in Knuth (1973b), Standish
(1980), and elsewhere. AVL trees are not themselves directly applicable to
most file structure problems because, like all strictly binary trees, they have
they are too deep. However, in the context of our general
too many levels
discussion of the problem of accessing and maintaining indexes that are too
to build
large to
fit
in
memory,
AVL
trees are interesting because they suggest that
possible to define procedures that maintain height balance.
it is
The
fact that an
AVL
tree
height-balanced guarantees that search
is
performance approximates that of a completely balanced tree^For example,
the completely balancedTorm oTTTreelnade up from the input keys
BCGEFDA
is
illustrated in Fig. 8.10,
keys, arriving in the
For
given
and the
AVL
same sequence,
completely balanced
N possible keys,
looks
of the
tree.
For an
So,
illustrated in Fig. 8.11.
is
tree, the
AVL
(N +
tree, the
1.44 log 2
levels.
from the same input
worst-case search to find
key,
at
log 2
levels
tree resulting
given 1,000,000 keys,
1)
worst-case search could look
(N+
at
2)
completely balanced tree requires
some of the keys, but never to 21 levels. If the tree
maximum number of levels increases to only 28. This
seeking to 20 levels for
is
an
AVL
tree, the
FIGURE 8.1
A search tree conAVL procedures.
structed using
343
PAGED BINARY TREES
is
very interesting
AVL
given that the
result,
no more than
single reorganization requires
procedures guarantee that
five pointer reassignments.
Empirical studies by VanDoren and Gray (197 4V
shown
among
others,
have
that such local reorganizations are required for approximately every
other insertion into the tree and for approximately every fourth deletion. So
height balancing using
AVL
methods guarantees
that
we
will obtain a
reasonable approximation to optimal binary tree performance
is
at a cost that
acceptable in most applications using primary, random-access
When we
memory.
more
are using secondary storage, a procedure that requires
than five or six seeks to find
unacceptable. Returning to the
key
is less
than desirable; 20 or 28 seeks
two problems
that
we
is
identified earlier in
this chapter:
Binary searching requires too
Keeping an index
we
is
seeks;
and
expensive,
can see that height-balanced trees provide an acceptable solution to the
second problem.
8.5
many
in sorted order
Now we
need to turn our attention to the
first
problem.
Paged Binary Trees
Once
again
we
are confronting
secondary storage devices:
It
is
perhaps the most
critical feature
of
takes a relatively long time to seek to a specific
location, but once the read head
a
what
is
positioned and ready, reading or writing
stream of contiguous bytes proceeds rapidly. This combination of slow
seek and fast data transfer leads naturally to the notion of paging. In
system, you do not incur the cost of
a paged
few bytes.
of the disk, you read
disk seek just to get a
once you have taken the time to seek to an area
file. This page might consist of a great many
individual records. If the next bit of information you need from the disk is
in the page that was just read in, you have saved the cost of a disk access.
Paging, then, is a potential solution to our searching problem. By
dividing a binary tree into pages and then storing each page in a block of
contiguous locations on disk, we should be able to reduce the number of
seeks associated with any search. Figure 8.12 illustrates such a paged tree. In
this tree we are able to locate any one of the 63 nodes in the tree with no
more than two disk accesses. Note that every page holds seven nodes and
can branch to eight new pages. If we extend the tree to one additional level
of paging, we add 64 new pages; we can then find any one of 511 nodes in
only three seeks. Adding yet another level of paging lets us find any one of
4,095 nodes in only four seeks. A binary search of a list of 4,095 items can
Instead,
in
an entire page from the
take as
many
as 12 seeks.
344
B-TREES AND OTHER TREE-STRUCTURED
ORGANIZATIONS
FILE
A A
mi
nnnn nnnn
A A
a
/\
/\
9 9 *
-3,
n n n
FIGURE 8.12 Paged binary
A ^A
S
l\ l\ l\ l\
/\ /\
3 #
/\
/\
nn
ft
A A
A A
/wwwi
S:
/\ /\ l\ l\
/i
/in
tree.
Clearly, breaking the tree into pages has the potential to result in faster
searching on secondary storage, providing us with
we
than any other form of keyed access that
much
faster retrieval
have considered up to
this
Moreover, our use of a page size of seven in Fig. 8. 12 is dictated more
by the constraints of the printed page than by anything having to do with
secondary storage devices. A more typical example of a page size might be
8 kilobytes capable of holding 511 key /reference field pairs. Given this page
size, and assuming that each page contains a completely balanced, full tree,
and that the pages themselves are organized as a completely balanced, full
tree, it is then possible to find any one of 134,217,727 keys with only three
seeks. That is the kind of performance we are looking for. Note that, while
the number of seeks required for a worst-case search of a completely full,
point.
balanced binary tree
is
(N +
log 2
where
is
the
number of keys
in the tree, the
the paged versions of a completely
full,
log fe+1
1)
number of seeks
balanced tree
(N +
required for
is
1)
once again, the number of keys. The new variable, k, is the
held in a single page. The second formula is actually a
generalization of the first, since the number of keys in a page of a purely
where
is,
number of keys
=!
345
THE PROBLEM WITH THE TOP-DOWN CONSTRUCTION OF PAGED TREES
binary tree
is
It is
makes
the logarithmic effect of the page size that
the
impact of paging so dramatic:
log 2 (134,217,727
log 511 +
(134,217,727
1)
27 seeks
1)
3 seeks.
come free. Every access to a page
amount of data, most of which is not
used. This extra transmission time is well worth the cost, however, because
it saves so many seeks, which are far more time-consuming than the extra
transmissions. A much more serious problem, which we look at next, has
to do with keeping the paged tree organized.
The use of
large pages does not
requires the transmission of a large
8.6
The Problem with the Top-down Construction
of Paged Trees
Breaking
pages is a strategy that is well suited to the physical
of
secondary
storage devices such as disks. The problem,
characteristics
once we decide to implement a paged tree, is how to build it. If we have the
entire set of keys in hand before the tree is built, the solution to the problem
is relatively straightforward: We can sort the list of keys and build the tree
a tree into
from this sorted list. Most importantly, if we plan to start building the tree
from the root, we know that the middle key in the sorted list of keys should
be the root key within the root page of the tree. In short, we know where to
begin and are assured that
a
this
beginning point will divide the
set
of keys
in
balanced manner.
Unfortunately,
receiving keys in
the
problem
is
random order and
them. Assume that
we must
much more
inserting
complicated
them
build a paged tree as
we
as
soon
as
we
if
we
are
receive
receive the following
sequence of single-letter keys:
CSDTAMPIBWNGURKEHOLJYQZFXV
We
will build a
keys per page. As
paged binary
we
tree that contains a
insert the keys,
we
necessary to keep each page as balanced as
illustrated in Fig. 8.13.
in pages), this tree
as
is
does not turn out too badly. (Consider, for example,
arrive in alphabetical order.)
is
not dramatically misshapen,
it
clearly illustrates
from the top down.
from the root, the initial keys must, of necessity, go into the
example at least two of these keys, C and D, are not keys that
the difficulties inherent in building a paged binary tree
start
root. In this
of three
Evaluated in terms of the depth of the tree (measured
what happens if the keys
Even though this tree
When you
maximum
them within a page
possible. The resulting tree
rotate
346
B-TREES AND OTHER TREE-STRUCTURED
ORGANIZATIONS
FILE
A A
H
\O
A
Y
FIGURE 8.13 Paged tree constructed from keys arriving
we want
there.
beginning of the
They
in
random input sequence.
adjacent in sequence and tend toward the
of keys. Consequently, they force the tree out of
are
total set
balance.
Once the wrong keys are placed in the root of the tree (or in the root
of any subtree further down the tree), what can you do about it?
Unfortunately, there is no easy answer to this. We cannot simply rotate
entire pages of the tree in the same way that we would rotate individual
keys in an unpaged tree. If we rotate the tree so the initial root page moves
down to the left, moving the C and D keys into a better position, then the
S key is out of place. So we must break up the pages. This opens up a whole
world of possibilities and difficulties. Breaking up the pages implies
rearranging them to create new pages that are both internally balanced and
well arranged relative to other pages. Try creating a page rearrangement
algorithm for the simple, three-keys-per-page tree from Fig. 8.13. You will
find it very difficult to create an algorithm that has only local effects,
rearranging just a few pages. The tendency is for rearrangements and
adjustments to spread out through a large part of the tree. This situation
grows even more complex with larger page sizes.
So, although we have determined that the idea of collecting keys into
pages is a very good one from the standpoint of reducing seeks to the disk,
347
SPLITTING AND PROMOTING
we have
not yet found
confronting
CK
at least
way
to collect the right keys.
two unresolved
How
do we ensure
good
separator keys, dividing
We
are
still
questions:
that the keys in the root
up the
set
page turn out to be
of other keys more or
less
evenly?
B^
How
do we avoid grouping keys, such
ple, that
There
is,
should not share
size
C, D, and 5 in our exam-
page?
in addition, a third question that
because of the small page
*&
as
we have
of our sample
not yet had to confront
tree:
How
can we guarantee that each of the pages contains at least some
minimum number of keys? If we are working with a larger page
size, such as 8,191 keys per page, we want to avoid situations in
which
a large
number of pages each
Bayer and McCreight's 1972 B-tree
precisely tow ard these questions.
contains only a few dozen keys.
article
provides a solution directed
8.7
B-Trees: Working up from the Bottom
A number
in computer science have
problem from a different viewpoint. B-trees are
an example of this viewpoint-shift phenomenon.
The key insight required to make the leap from the kinds of trees we
have been considering to a new solution, B-trees, is that we can choose to
build trees upward from the bottom instead of downward from the top. So far, we
have assumed the necessity of starting construction from the root as a given.
Then, as we found that we had the wrong keys in the root, we tried to find
ways to repair the problem with rearrangement algorithms. Bayer and
McCreight recognized that the decision to work down from the root was,
of itself, the problem. Rather than finding ways to undo a bad situation,
they decided to avoid the difficulty altogether. With B-trees, you allow the
root to emerge, rather than set it up and then find ways to change it.
of the elegant, powerful ideas used
grown out of looking
8.8
Splitting
at a
and Promoting
of an ordered sequence of keys and a
with the paged
trees shown previously; there is just an ordered list of keys and some
pointers. The number of pointers always exceeds the number of keys by
In a B-tree, a page, or node, consists
set
of pointers. There
is
no
explicit tree within a node, as
348
B-TREES AND OTHER TREE-STRUCTURED FILE ORGANIZATIONS
FIGURE 8.14
Initial ieaf of
B-tree with a page size of
seven.
one.
The maximum number of pointers
that can be stored in a
we have
the order of the B-tree. For example, suppose
Each page can hold
most seven keys and
at
node
is
called
an order-eight B-tree.
eight pointers.
Our
initial leaf of
the tree might have a structure like that illustrated in Fig. 8.14 after the
insertion of the letters
B C G E
The
DA
starred (*) fields are the pointer fields. In this leaf, as in any other
leaf node, the value
of
all
definition, a leaf node has
do not lead
pages
the pointers
no children
to other pages in the tree.
usually
contain
an
is
set to indicate end-of-list.
We assume that the pointers in the leaf
pointer
invalid
By
in the tree; consequently, the pointers
such
value,
as
1.
Note,
incidentally, that this leaf is also our root.
In a real-life application there
also usually
is
some other information
stored with the key, such as a reference to a record containing data that are
associated with the key. Consequently, additional pointer fields in each
page might actually lead to some associated data records that are stored
But,
elsewhere.
paraphrasing Bayer and McCreight,
purposes, "the associated information
Building the
a single
page
first
is
our present
As we insert new keys, we use
memory and, working in memory,
Since we are working in electronic
easy enough.
disk access to read the page into
insert the
key into
memory,
this insertion is relatively
its
for
of no further interest."
is
place in the page.
inexpensive compared to the cost of
additional disk accesses.
But what happens
the kev
full.
we
come
keys
in?
Suppose
try to insert the J
we
leaf to
choose between the leaves
accommodate the new
when
shown
a
to
add
is
in Fig. 8.15
searching. In short,
higher level in the
J key.
we want
find that our Jeaf
distributing the keys as evenly as
can between the old leaf node and the new one, as
Si nre wgjiow have two lea ves, we need to create
FIGURE 8.15 Splitting the
we
We then split the le af intoTwo leaves,
tree to enable us to
as additional
to the B-tree._W hen
we
349
SPLITTING AND PROMOTING
FIGURE 8.16 Promotion of the E key into a root node.
need to create
leaves. In this
new root. We do this by promoting a key that separates th e
case, we promote the E from the first position in the second
a
leaf, as illustrated in Fig.
In this
in
two
8.16.
example we describe the
steps to
make
and promotion are handled
Let's see
how
paged binary
splitting
and the promotion operations
the procedure as clear as possible; in practice, splitting
in a single operation.
B-tree grows given the key sequence that produces the
tree illustrated in Fig. 8.13.
The sequence
is
CSDTAMPIBWNGURKEHOLJYQZFXV
We use an order-four B -tree
(four poi nter fields and three key fields pe r
page), since this corresponds to the page size of the paged binary tree.
such
more
Using
small page size has the additional advantage of causing pages to
frequently,
promotion.
We
split
providing us with more examples of splitting and
omit
explicit indication
of the pointer
fields so
we
can
fit
on the printed page.
8. 17 illustrates the growth of the tree up to the point at which the
root node is about to split. Figure 8.18 shows the tree after the splitting of
the root node. The figure also shows how the tree continues to grow as the
larger tree
Figure
remaining keys in the sequence are added. We number each of the tree's
pages (upper left corner of each node) so you can distinguish the newly
added pages from the ones already in the tree.
Note
that the tree
is
always perfectly balanced
path from the root to any leaf is the same
other
leaf.
as the
w ith regar d to height; the
path from the root to any
Also note that the keys that are promoted upward into the tree
of keys we want in a root: keys that are good
are necessarily the kind
separators.
pages
fill
By working up from the leaf level, splitting and promoting as
we overcome the problems that plague our earlier paged
up,
binary tree efforts.
Insertion of C,
S,
and
D
c D
into the initial page:
Insertion of
forces the split
and the promotion of
5:
c D
A added without
incident:
A C D
Insertion of
split
\<^r
M forces another
and the promotion of D:
'3
A C
P,
I,
B,
and
^cx/vCtX*^
W inserted
into existing pages:
D N
A B C
N causes another
followed by the promotion of N. G, U, and R are
Insertion of
split,
added
to existing pages:
A B C
FIGURE 8.17 Growth of a B-tree, part
is imminent.
ting of the root
350
I.
The
tree grows to a point at
which
split-
m^ o<? 33
Insertion of K causes a split at leaf level,
followed by the promotion of K. This
causes a split of the root. N is promoted
to
become the new
root.
is
added
to a leaf:
D K
r
E G
A B C
Insertion of
H causes
a leaf to split.
/
M
T U
P R
H is
promoted. O, L, and/ are added:
A B C
T U
Y and Q force two more leaf
and promotions. Remaining letters
Insertion of
splits
are added:
B C
FIGURE 8.18 Growth of a B-tree, part
II.
M
The
root splits to
T L V
add
new
level;
X Y
remaining keys are
inserted.
351
352
B-TREES AND OTHER TREE-STRUCTURED FILE ORGANIZATIONS
8.9
Algorithms for B-Tree Searching and Insertion
Now
that
we
have had
look
a brief
at
how
work on paper,
make them work
B-trees
outline the structures and algorithms required to
let's
in a
computer. Most of the code that follows is pseudocode. C and Pascal
implementations of the algorithms can be found at the end of this chapter.
Page Structure
We
used by
As you
a B-tree.
begin by defining one possible form for the page
see later in this chapter
and
in the following
ways to construct the page of a B-tree. We
start with a simple one in which each key is a single character. If the
maximum number of keys and children allowed on a page is MAXKEYS
chapter, there are
many
different
MAXCHILDREN, respectively, then the following
C and Pascal describe a page called PAGE.
and
structures ex-
pressed in
In C:
struct BTPAGE {
short
KEYCDUNT;
char
KEYCMAXKEYS]
CHILDCMAXKEYS+1
short
>
PAGE;
;
/* number of keys stored
/* the actual keys
/RRNs of children
&
in
PAGE */
*/
3SS^S
* /
In Pascal:
TYPE
BTPAGE
RECORD
KEYCDUNT
KEY
CHILD
nt eger
ar ray [
ar ray [
1
1
MAXKEYS] of char;
MAXCHILDREN] of integer
END;
VAR
PAGE
BTPAGE;
Given this page structure, the file containing the B-tree consists of a set
of fixed-length records. Each record contains one page of the tree. Since the
keys in the tree are single letters, this structure uses an array of characters to
hold the keys. More typically, the key array is a vector of strings rather than
The variable PAGE.KEYCOUNT is useful
when the algorithms must determine whether a page is full or not. The
PAGE.CHILD[] array contains the RRNs of PAGE's children, if there are
any. When there is no descendent, the corresponding element of PAGE.
CHILD[] is set to a nonaddress value, which we call NIL. Figure 8.19
just a vector of characters.
shows two pages
in a B-tree
of order four.
ALGORITHMS FOR B-TREE SEARCHING AND INSERTION
353
Part of a B-tree:
2
H K
/
A B C
'NT
(a)
Contents of
PAGE
-
for pages 2
and
KEYCOUNT f ^
{
KEY
3:
array
CHILD
p*
array
*>*
Page 2
Page 3
NIL
NIL
NIL
(b)
FIGURE 8.19 A B-tree of order four, (a) An internal node and some leaf
nodes, (b) Nodes 2 and 3, as we might envision them in the structure
PAGE.
Searching
The
first
procedure. Searching
yet
-^n
^-~*n
still
is
B-tree algorithms
a
good
we examine
place to begin because
illustrates the characteristic aspects
They are recursive; and
They work in two stages, operating
are a tree-searching
it is
relatively simple
of most B-tree algorithms:
alternatively
on
entire pages
and
then within pages.
The searching procedure
calls itself recursively,
seeking to a page and
at successively lower
of the tree until it either finds the key or finds that it cannot descend
further, having reached beyond the leaf level. Figure 8.20 contains a
description of the searching procedure in pseudocode.
then searching through the page, looking for the key
levels
354
B-TREES AND OTHER TREE-STRUCTURED
FUNCTION:
search (RRN, KEY, FOUND_RRN
if RRN == NIL then
FILE
ORGANIZATIONS
FOUND_POS)
/* stopping condition for the recursion */
return NOT FOUND
else
read page RRN into PAGE
look through PAGE for KEY, setting POS equal to the
position where KEY occurs or should occur,
if KEY was found then
F0UND_RRN = RRN
/* current RRN contains the key */
F0UND_P0S = POS
return FOUND
else /* follow CHILD reference to next level down */
return(search(PAGE.CHILDCPOS], KEY, F0UND_RRN, F0UND_P0S
endif
endif
:
end FUNCTION
FIGURE 8.20 Function search (RRN, KEY,
F0UND_RRN, FOUND_POS)
searches
re-
cursively through the B-tree to find KEY. Each invocation searches the page refer-
enced by RRN. The arguments FOUND_RRN and FOUND_POS identify the page
and position of the key, if it is found. If searchO finds the key, it returns FOUND.
it goes beyond the leaf level without finding the key, it returns NOT FOUND.
Let's
work through
the function
by hand, searching
for the
If
key Kin the
RRN
RRN
not
NIL,
so
the
(2).
function reads the root into PAGE, then searches for K among the elements
of PAGE.KEY[]. The K
not found. Since K should go between D and
tree illustrated in Fig. 8.21.
argument equal
We begin by
RRN
to the
calling the function
of the root
This
with the
is
is
N, POS identifies position
in the root as the position of the pointer to
where the search should proceed. So search () calls itself, this time using the
1"*"
RRN
stored in
On
PAGE.CHILD[1]. The
the next
M. Once again the
PAGE.KEY[]. Again,
indicates
is
searches
not
found.
where the search should proceed.
RRN
RRN
is 3.
for
This
and
among the keys in
time PAGE.CHILD[2]
Searchf) calls itself again, this
through the various
stored in
levels
of
return
statements until the program that
originally calls search() receives the information that the key
f
I,
PAGE.CHILD[2].
Since this call is from a leaf node, PAGE.CHILD[2] is NIL, so the call
search() fails immediately. The value NOT FOUND is passed back
time using the
to
function
value of this
reads the page containing the keys G,
call, search()
We
will use zero origin indexing in these examples, so the leftmost
PAGE.KEY[0], and
the
RRN
of the leftmost child
is
key
PAGE.CHILDfO].
is
in a
not found.
page
is
355
ALGORITHMS FOR B-TREE SEARCHING AND INSERTION
D N
/
G
A B C
/.XT
M
P R
T U
FIGURE 8.21
B-tree used for
the search example.
Now let's use search()
same downward path
position 2 of page
FOUND_POS,
2 of page
3,
3.
to look for
that
It
it
M, which
is
in the tree.
did for K, but this time
and 2
stores the values 3
respectively, indicating that
and returns the value
in
it
It
follows the
finds the
M in
FOUND_RRN
and
M can be found in the position
FOUND.
and Promotion There ar e two important obsermake abo ut the insertion, splitting, and promotion proc ess.
Insertion, Splitting,
vations
*a
It
we
begins with a search that proceeds
level;
&r
can
all
the
way down
to the leaf
and
at the leaf level, the work of inand promotion proceeds upward from the bottom.
After finding the insertion location
sertion, splitting,
Consequently,
we
can conceive of our recursive procedure
as
having three
phases:
1.
search-page step
the recursive
2.
The
recursive call
the tree as
3.
it
itself,
before
which moves the operation down through
searches for either the key or the place to insert
Insertion, splitting,
recursive
that, as in the search() function, takes place
call;
call,
and promotion logic that are executed
the action taking place
it;
and
after the
on the upward return path
fol-
lowing the recursive descent.
We
need an example of an insertion so we can watch the insertion
procedure work through these phases. Let's insert the $ character into the
tree shown in the top half of Fig. 8.22, which contains all of the letters of
the alphabet. Since the ASCII character sequence places the $ character
ahead of the character A, the insertion is into the page with an RRN of 0.
This page and its parent are both already full, so the insertion causes
splitting and promotion that result in the tree shown in the bottom half of
Fig. 8.22.
356
B-TREES AND OTHER TREE-STRUCTURED
Before inserting
FILE
ORGANIZATIONS
$:
After inserting
$:
H N
FIGURE 8.22 The effect of adding $ to the tree constructed
in Fig.
8.18.
Now let's see how the insert!) function performs this splitting and
promotion. Since the function operates recursively, it is important to
understand
insert()
how
the function arguments are used
function that
we
CURRENT_RRN
on successive
calls.
The
are about to describe uses four arguments:
The
RRN
use.
As
o{ the B-tree page that
is
currently in
the function recursively descends and as-
cends the
tree, all the
RRNs
on the search and
in-
sertion path are used.
KEY
The key
PROMO_KEY
Argument used only
that
is
to be inserted.
to carry
back the return
and the pro-
value. If the insertion results in a split
motion of a key, PROMO_KEY contains the
promoted key on the ascent back up the tree.
PROMO_R_CHILD
another return value argument. If there
higher levels of the calling sequence
must not only insert the promoted key value,
This
is
is
a split,
357
ALGORITHMS FOR B-TREE SEARCHING AND INSERTION
RRN of the new page created in
PROMO_KEY inserted,
PROMO_R_CHILD the right child pointer
but also the
the
When
split.
is
is
inserted with
it.
In addition to the values returned via the
and
PROMO_R_CHILD,
makes
NO PROMOTION
PROMO_KEY
PROMOTION if
arguments
insertQ returns the value
it
done and
ERROR if the insertion cannot be made.
Figure 8.23 illustrates the way the values of these arguments change as
the insert() function is called and calls itself to perform the insertion of the
$ character. The figure makes a number of important points:
promotion,
nothing is promoted, and
a
During the search
changes
as the
fected
by
an insertion
step part of the insertion, only
function
path of successive
The
if
calls
CURRENT_RRN
descending the
calls itself,
is
This search
tree.
includes every page of the tree that can be af-
and promotion on the return path.
when CURRENT_RRN is NIL. There are no
splitting
search step ends
further levels to search.
As each
recursive
call returns,
we
execute the insertion and splitting
PROOtherwise, we
logic at that level. If the lower-level function returns the value
MOTION,
then we have a key to insert at this level.
have no work to do and can just return. For example,
insert
at the highest (root) level of the tree without
therefore return
that the
NO PROMOTION
PROMO_KEY
and
from
we
are able to
splitting,
and
That means
from this level
this level.
PROMO_R_CHILD
have no meaning.
Given
this
introduction to the
insert()
to look at an algorithm for the function
function's operation,
shown in
Fig. 8.24.
we
are ready
We have already
described insertO's arguments. There are several important local variables as
well:
PAGE
The page
NEWPAGE
New
POS
The position in
or would occur
P_B_RRN
The
that insert()
page created
currently examining.
if a split occurs.
PAGE
where the key occurs
(if it is
present)
(if inserted).
relative record
level. If a split
is
number promoted
occurs
at the
from below up
next lower
level,
to this
P_B_RRN
number of the new page created
during the split. P_B_RRN is the right child that is inserted
with P_B_KEY into PAGE.
contains the relative record
P_B_KEY
The key promoted from below up
along with
P_B_RRN,
is
to this level. This key,
inserted into
PAGE.
358
B-TREES AND OTHER TREE-STRUCTURED FILE ORGANIZATIONS
KEY =
CURRENT RRN =
NO PROMOTION
PROMO_KEY = <undefined>
PROMO_R_CHILD = <undefined>
Return value:
fc
Search
step
Recursive
call
Insertion and
KEY =
splitting logic
mm
CURRENT RRN
Return value: PROMOTION
PROMO_KEY = H
PROMO R CHILD = 12
Search
step
Recursive
call
Insertion and
KEY =
splitting logic
CURRENT RRN
PROMOTION
PROMO_KEY = B
PROMO R CHILD = 11
Return value:
Search
fet
step
Recursive
Insertion
KEY =
call
and
4n
splitting logic
CURRENT RRN = NIL
PROMOTION
PROMO_KEY = $
PROMO R CHILD = NIL
Return value:
Search
step
Recursive
Insertion
call
and
splitting logic
FIGURE 8.23 Pattern of recursive calls to insert $ into the B-tree as illustrated
in Fig.
8.22.
ALGORITHMS FOR B-TREE SEARCHING AND INSERTION
insert
FUNCTION:
CURRENT_RRN, KEY PROMO_R_CHILD, PROMO_KEY)
if CURRENT_RRN
/* past bottom of tree */
NIL then
PROMO_KEY := KEY
PR0M0_R_CHILD := NIL
return PROMOTION
/* promote original key and NIL */
else
read page at CURRENT_RRN into PAGE
search for KEY in PAGE.
let POS := the position where KEY occurs or should occur.
if KEY found then
issue error message indicating duplicate key
return ERROR
RETURN_VALUE
insert PAGE CHILD [ POS]
(
KEY,
P_B_RRN, P_B_KEY)
if RETURN_VALUE
= = NO PROMOTION or ERROR then
return RETURN_VALUE
elseif there is space in PAGE for P_B_KEY then
insert P_B_KEY and P_B_RRN (promoted from below) in PAGE
return NO PROMOTION
else
split L ( P_B_KEY, P_B_RRN, PAGE, PR0M0_KEY, PR0M0_R_CHILD, NEWPAGE
write PAGE to file at CURRENT_RRN
write NEWPAGE to file at rrn PR0M0_R_CHILD
/* promoting PR0M0_KEY and PR0M0_R_CHILD */
return PROMOTION
endif
end FUNCTION
FIGURE 8.24 Function insert
inserts a
KEY
in a
B-tree.
number CURRENT_RRN.
sively until
it
finds
KEY
(CURRENT_RRN,
The
If
in a
KEY,
PROMO_R_CHILD, PROMO_KEY)
insertion attempt starts at the page with relative record
page is not a
page or reaches a
this
leaf page, the function calls itself recurleaf.
If
it
finds KEY,
it
issues an error
ERROR. If there is space for KEY in PAGE, KEY is inserted. Otherwise, PAGE is split. A split assigns the value of the middle key to
PROMO_KEY and the relative record number of the newly created page to PROMO_R_CHILD so insertion can continue on the recursive ascent back up the tree. If a promotion does occur, insertO indicates this by returning PROMOTION. Otherwise, it returns
NO PROMOTION.
message and
quits, returning
359
360
B-TREES AND OTHER TREE-STRUCTURED
PROCEDURE:
split (I_KEY,
I_RRN,
FILE
ORGANIZATIONS
PROMO_KEY, PROMO_R_CHILD, NEWPAGE
PAGE,
copy all keys and pointers from PAGE into a working page that
can hold one extra key and child.
insert I_KEY and I_RRN into their proper places in the working page.
allocate and initialize a new page in the B-tree file to hold NEWPAGE.
set PR0M0_KEY to value of middle key, which will be promoted after
the split.
set PR0M0_R_CHILD to RRN of NEWPAGE.
copy keys and child pointers preceding PR0M0_KEY from the working
page to PAGE.
copy keys and child pointers following PR0M0_KEY from the working
page to NEWPAGE.
end PROCEDURE
FIGURE 8.25 Split (l_KEY, l_RRN, PAGE, PROMO_KEY, PROMO_R_CHILD, NEWPAGE), a
procedure that inserts l_KEY and l_RRN, causing overflow, creates a new page called
NEWPAGE, distributes the keys between the original PAGE and NEWPAGE, and determines
which key and RRN to promote. The promoted key and RRN are returned via the arguments
PROMO_KEY
and
PROMO_R_CHILD.
When
functions.
coded
in a real language,
The most obvious one
distributes the keys
between the
determines which key and
description of a simple
insertf)
uses a
number of support
which creates a new page,
page and the new page, and
split(),
original
RRN
split()
is
to
promote. Figure 8.25 contains a
is also encoded in C and
procedure, which
Pascal at the end of this chapter.
You
should pay careful attention to
only the key
is
promoted from
the
how
split()
working page
moves
all
of the
data.
Note
that
CHILD RRNs
back to PAGE and NEWPAGE. The RRS that is promoted
of NEWPAGE, since NEWPAGE is the right descendent from
are transferred
is
the
RRN
promoted key. Figure 8.26 illustrates the working page activity among
the working page, and the function arguments.
The version of splitf) described here is less efficient than might
sometimes be desirable, since it moves more data than it needs to. In
Exercise 17 you are asked to implement a more efficient version of split().
the
PAGE, NEWPAGE,
361
ALGORITHMS FOR B-TREE SEARCHING AND INSERTION
The Top Level
We
need
routine to
tie
together our
insert!
and
split(
procedures and to do some things that are not done by the lower-level
Our
routines.
Open
Read
driver
or create the B-tree
Create
It is
new
PAGE
root node
that the
FIGURE 8.26 The movement
Contents of
file,
do the following:
and identify or create the root page.
call \nsert(
to put the
tree.
routine driver
assumed
able to
keys to be stored in the B-tree. and
keys in the
The
must be
when
shown
RRN
insert(
splits the
in Fig. 8.27 carries
of the root node
is
current root page.
out these top-level tasks.
stored in the B-tree
file itself.
of data in splitO.
are copied to the working page.
PAGE
7D
K
Working page
^
w
I_KEY
into
(B)
and
I_RRN
(11) are inserted
working page.
^f
Contents of working page are divided
between PAGE and NEWPAGE, except
for the middle key (H). H promoted,
along with the
RRN
(12) of
t
PROMO RRN
NEWPAGE.
NEWPAGE
PAGE
PROMO KEY
7*-
362
B-TREES AND OTHER TREE-STRUCTURED
FILE
ORGANIZATIONS
MAIN PROCEDURE: driver
if the B-tree file exists, then
open B-tree file
else
create a B-tree file and place the first key in the root
get RRN of root page from file and store it in ROOT
get a key and store it in KEY
while keys exist
if (insert (ROOT, KEY, PR0M0_R_CHILD, PROMO.KEY) == PROMOTION) then
create a new root page with key := PR0M0_KEY, left
child := ROOT, and right child := PR0M0_R_CHILD
set ROOT to RRN of new root page
get next key and store it in KEY
endwhile
write RRN stored in ROOT back to B-tree file
close B-tree file
end MAIN PROCEDURE
FIGURE 8.27 Driver for building a B-tree.
if the file exists. If the file
does
exist, driver
opens
it
root node. If it does not exist, driver must create the
and gets the RRN of the
file and build an original
root page. Since a root must contain
at least one key, this involves getting
and placing it in the root. Next, driver
reads in the keys to be inserted, one at a time, and calls insertf) to insert the
keys into the B-tree file. If insertf) splits the root node, it promotes a key and
right child in PROMO_KEY and PROMO_R_CHILD, and driver uses
the
first
key
to be inserted in the tree
these to create a
8.10
new
root.
B-Tree Nomenclature
Before moving on to discuss B-tree performance and variations on the basic
B-tree algorithms, we need to formalize our B-tree terminology. Providing
careful definitions
of terms such
the properties that
must be present
as order
and
leaf enables us to state precisely
for a data structure to qualify as a B-tree.
This definition of B-tree properties, in turn, informs our discussion of
matters such as the procedure for deleting keys from a B-tree.
Unfortunately, the literature on B-t rge_s
terms relatin g to
Barges.
Reading
is
that literature
not uniform in
its
us e_ojl
and keeping up with new
363
B-TREE NOMENCLATURE
developments therefore require some flexibility and some background: The
reader needs to be aware of the different usages of some of the fundamental
terms.
For example, Bayer and McCreight (1972),
Comer
others refer to the order of a B-tree as the minimum
(1979),
and
number of keys
few
that can
be in a page of a tree. So, our initial sample B-tree (Fig. 8.16), which can
hold a maximum of seven keys per page, has an order of three, using Bayer
and McCreight's terminology. The problem with this definition of order is
that it becomes clumsy when you try to account for pages that hold a
maximum number of keys that is odd. For example, consider the following
question: Within the Bayer and McCreight framework, is the page of an
order three B-tree
full
when
it
contains six keys or
when
it
contains seven
keys?
Knuth (1973b) and
others have addressed the odd/even confusion
by
maximum number of descendents that
a page can have. This is the definition of order that we use in this text. Note
that this definition differs from Bayer and McCreight's in two ways: It
defining the order of a B-tree to be the
references a
maximum, not
minimum, and
it
counts descendents rather than
keys.
Use of Knuth's
number of keys in a
definition
must be coupled with the
B-tree page
descendents from the page.
is
always one
Consequently,
less
fact
than the
that the
number of
B-tree of order 8 has
maximum of seven keys per page. In general, given a B-tree
maximum number of keys per page is m 1.
When you split the page of a B-tree, the descendents
of order m, the
are divided as
between the new page and the old page. Conseq uently,
every page except the root and the leaves has at east ml 2 descendents
Expressed in terms of a ceiling function, we can say that the minimum
It follows that the minimum numnumber of descendents is \ m/2
ber of keys per page is
ro/2
1, so our initial sample B-tree has an
order of eight, which means that it ca n hold no m ore than seven kev s
per page and that all of the pages except the root contain at least thr ee
evenly
as possible
.__
1.
\J&pL
(The
other term that is used differently by different authors is leaf. Bayer
and McCreight refer to the lowest level of keys in a B-tree as the leaf level.
This is consistent with the nomenclature we have used in this text. Other
authors, including Knuth, consider the leaves of a B-tree to be one level
below the lowest level of keys. In other words, they consider the leaves to
be the actual data records that might be pointed to by the lowest level of
keys in the tree. We do not use this definition, sticking instead with the
notion of leaf
as the
lowest level of keys in the B-tree.
364
8.1
B-TREES AND OTHER TREE-STRUCTURED
FILE
ORGANIZATIONS
Formal Definition of B-Tree Properties
Given these definitions of order and leaf, we can formulate
statement of the properties of a B-tree of order m:
2.
Every page has a maximum of m descendents.
Every page, except for the root and the leaves, has
3.
The
4.
All the leaves
1.
at least [
precise
m/2~\ de-
scendents.
6.
8.12
two descendents (unless
appear on the same level.
at least
it is
a leaf).
nonleaf page with k descendents contains k 1 keys.
leaf page contains at least [ m/2~] - 1 keys and no more than
A
A
5.
root has
m -
keys.
Worst-case Search Depth
important to have a quantitative understanding of the relationship
between the page size of a B-tree, the number of keys to be stored in the
tree, and the number of levels that the tree can extend. For example, you
It
is
might
know
that
you need
to store 1,000,000 keys
and
nature of your storage hardware and the size of your keys,
to consider using a B-tree
of order 512
(maximum of 511
that,
given the
it is
reasonable
keys per page).
Given these two facts, you need to be able to answer the question, "In the
worst case, what will be the maximum number of disk accesses required to
locate a key in the tree?" This is the same as asking how deep the tree
will be.
We can answer this question by beginning with the observation that the
number of descendents from any level of a B-tree is one greater than the
number of keys contained at that level and all the levels above it. Figure 8.28
illustrates this relation for the tree
we
constructed earlier in this chapter.
T his tree contains 27 kevs fall the letters of the alphabet and S). If you co unt
me number of potential descendents trailing from the leaf level, you see that
there are 28 of them.
Next we need
to observe that
properties to calculate the
from any
of
level
we
can use the formal definition of B-tree
minimum number of descendents
B-tree of
some given
order. This
is
that can extend
of interest because
we are interested in the worst-case depth of the tree. The worst case occurs
when every page of the tree has only the minimum number of descendents.
In such a case
iimal breadth.
it
he keys are spread over
maximal height for the
tree
and
WORST-CASE SEARCH DEPTH
365
H N
ddd
ddddddd ddd ddd
dd
FIGURE 8.28 A B-tree with
Fo r
root page
keys can have
(A/
1)
T U V
X Y
dddddddd
dd
descendents from the
leaf level.
minimum number of descendents from the
two, so the second level of the tree contains only two pages.
B-tree o f order m, the
is
Ea cRof these pages,
in turn has at least
,
mil 1 de scendents.
he third
level,
then, co ntains
2
X [ ml2~\
pages. Since each of these pages, once again, has a
minimum of [ m/2~\
descendents, the general pattern of the relation between depth and the
minimum number of descendents
Minimum number
Level
1
0?
of descendents
(root)
x \ mll\
x fm/21 x [~m/2~|or 2 x
3
2 X \ml2~}
tf
takes the following form:
\
'
mil'}
4?
d-\
2 x fro/21
So.
in
general,
for
any
descendents extending fro
level
d of a B-tree,
thaf level
2
\^
x fm/2l d-\
the
minimum numbe r of
366
B-TREES AND OTHER TREE-STRUCTURED
Wejcnow
Let
level.
that
s call
ORGANIZATIONS
N keys
free with.
descendents from
N+
we know
than the
at
HesrenHe ntS from
that the
number
We
d.
descendents and the
of height d
a tree
its
leaf
can express the
minimum number
of
as
N+1>2X
since
has
the depth of the tree at the leaf level
between the
relationship
FILE
[ ml2~]
d-
number of descendents from any
cannot be
tree
for a worst-case tree of that depth. Solving for
d,
we
less
arrive
the following expression:
d
+ logrw2l
((N
l)/2).
This expression gives us an upper bound for the depth of
a B-tree
A/keys. Let's find the upper bound for the hypothetical tree that
at the start
of
this section: a tree
Substituting these specific
of order 512 that contains 1,000,000 keys.
numbers
d
<
with
we describe
into the expression,
we
find that
log 256 500000.5,
or
d
<
3.37.
So we can say that given 1,000,000 keys,
of no more than three levels.
8.13
B-tree of order 512 has a depth
Deletion, Redistribution, and Concatenation
Indexing 1,000,000 keys in no more than three levels of a tree is precisely
the kind of performance we are looking for. As we have just seen, this
performance
is
predicated on the B-tree properties
Every page except
is
coupled t o the r ules that
for the root
describe earlier; in
broad and shallow rathe r
particular, the ab:Qity_to_guarantee that B-trees are
than na rrow and dee p
we
state the
and the leaves has
follow ing:
at least
\~
m/2~\ de-
scendents;
A
A
nonleaf page with k descendents contains k
leaf
We
page contains
at
least [ ml2~\
keys; and
keys and no more than
keys.
have already seen that the process of page
these properties are maintained
when new keys
We
splitting guarantees that
are inserted into the tree.
need to develop some kind of equally reliable guarantee that
proper ties are maintained when keys are deleted from the tree.
jJiese
-Working through some simple deletion situations by hand helps us
a kev can result in several different
demonstrate that the deletion of
367
DELETION, REDISTRIBUTION, AND CONCATENATION
of these situations and the associated
situations. Figure 8.29 illustrates each
response in the course of several deletions from an order six B-tree.
The
simplest situation
is
illustrated in case
cause the contents of page 5 to drop below the
Deleting the key J does not
minimum number of keys.
Consequently, deletion involves nothing more than removing the key from
the page and rearranging the keys within the page to close up the space.
M (case
2) is more complicated. If we simply remove the
becomes very difficult to reorganize the tree to maintain
its B-tree structure. Since this problem can occur whenever we delete a key
from a nonleaf page, we always delete keys only from leaf pages. If a key
Deleting the
M from the root,
to be deleted
is
swap
its
it
not in a
lea f, there
is
an easy
wav
to get
it
into a leaf:
We
immediate successor, which is guaranteed to be in a lea f,
then delete it immediately from the leaf. In our example, we can swap the
with tne iV in page b, then delete 'the
from page 6. This simple
operation does not put the
out of order, since all keys in the subtree of
which Nis a part must be greater than N. (Can you see why this is the case?)
In case 3 we delete R from page 7. If we simply remove R and do
nothing more, the page that it is in has only one key. The minimum
number of keys for the leaf page of an order six tree is
it
wit h
r6/2~|Therefore,
we
2.
have to take some kind of action to correct
this
condition. Since the neighboring page 8 (called a sibling since
parent) has
more than
the
minimum number
it
underflow
has the
same
of keys, the corrective action
between the pages. Redistribution
ust
change in the key that is in the parent page so it continues to
act as a separator between the lower-level pages. In the example, we move
the U and Finto page 7, and move I^into the separator position in page 2.
The deletion of A in case 4 results in a situation that cannot be resolved
by redistribution. Addressing the underflow in page 3 by moving keys
from page 4 only transfers the underflow condition. There are not enough
keys to share between two pages. The solution to this is concatenation,
combining the two pages and the key from the parent page to make a single
consists of redistributing the keys
also result in a
full
page.
Concatenation
is
essentially the reverse
of splitting. Like
splitting,
it
can
propagate upward through the B-tree. Just as splitting promotes a key,
concatenation must involve demotion of keys, and this can in turn cause
underflow
in the parent page. This
is
just
what happens
in
our example.
Our concatenation of pages 3 and 4 pulls the key D from the parent page
down to the leaf level, leading to case 5: The loss of the D from the parent
page causes it, in turn, to underflow. Once again, redistribution does not
solve the problem, so concatenation
must be used.
1: No action.
Delete J from page 5. Since page 5 has more
than the minimum number of keys,
Case
be removed without reorganization.
J can
Case
2:
Swap with immediate
Swap
X Y
X Y
successor.
M (page 0) with N (page
and then delete M from page
Delete
6),
6.
Case
3:
Redistribution.
Delete R. Underflow occurs. Redistribute keys
among pages
2, 7,
and
8 to restore balance
between leaves.
Promote
move U and V
into page
7.
u V
FIGURE 8.29 Six situations that can occur during deletions.
368
,i
Case
4:
Concatenation.
Delete A. Underflow occurs, but
it cannot be
addressed by redistribution. Concatenate the
keys from pages 3 and 4, plus the D from
page 1 into one page.
Underflow
u V
X Y
X Y
New
page
3:
Underflow propagates upward.
1 has underflow. Again, we cannot
redistribute, so we concatenate.
Case
Now
5:
page
Underflow
moves up to
here
C D E F
Case
6:
Height of tree decreased.
Since the root contains only one key,
it is
absorbed into the new root.
c D E F
U V
X Y
369
370
B-TREES AND OTHER TREE-STRUCTURED
Note
ORGANIZATIONS
propagation of the underflow condition does not
the
that
FILE
necessarily imply the propagation of concatenation. If page 2
(Q and W) had
contained another key, then redistribution, not concatenation, would be
used to resolve the underflow condition at the second level of the tree.
Case 6 shows what happens when concatenation propagates all the way
The concatenation of pages 1 and 2 absorbs the only key in the
root page, decreasing the height of the tree by one level.
The steps involve d in deleting keys from a B-tree can be su mmarized as
to the root.
follow
If the
key
successor,
to be deleted
which
Q1
Delete the key.
fuA
If the leaf
>~v
(Aj
now
further action
If the leaf
swap
a leaf,
contains at least the
it
with
its
immediate
minimum number of keys, no
required.
is
now
not in
is
in a leaf.
is
contains one too few keys, look at the
left
and right
siblings.
a.
If a sibling
has
more than
the
minimum number
of keys, redis-
tribute.
b.
If neither sibling has
two
leaves and the
more than
If leaves are concatenated,
If the last
the
minimum,
median key from the parent
apply steps
key from the root
is
3-6
concatenate the
into
one
leaf.
to the parent.
removed, then the height of the
tree
decreases.
8.13.1 Redistribution
Unlike concatenation, which is a kind of reverse split, redistribution is a
idea. Our insertion algorithm does not involve operations analogous to
new
redistribution.
Redistribution differs from both splitting and concatenation in that
does not propagat e.
It is
guaranteed to have
strictly local effects.
the term sibling implies that the pages have the
Note
it
that
same parent page. If there are
adjacent but do not have the
two nodes at the leaf level that are logically
same parent (for example, IJK and NOP in the
tree at the top
of Fig. 8.29),
these nodes are not siblings. Redistribution algorithms are generally written
moving keys between nodes that are not siblings,
even when they are logically adjacent. Can you see the reasoning behind
so they do not consider
this restriction?
Another difference between redistribution on the one hand a nd
is that there is no necessary, iixe3
concatenation and splitting on the other
prescription for
how
the keys should be rearranged.
single deletion in a
REDISTRIBUTION DURING INSERTION: A
WAY
37
TO IMPROVE STORAGE UTILIZATION
pr operly formed B-tree cannot cause an underflow of
more than one key
moving only
.
Therefore, redistribution can restore the B-tree properties by
one key from a sibling into the page that has underflowed, even if the
distribution of the keys between the pages is very uneven. Suppose, for
example, that we are managing a B-tree of order 101. The minimum
number of keys that can be in a page is 50, the maximum is 100. Suppose
we have one page that contains the minimum and a sibling that contains the
maximum. If a key is deleted from the page containing 50 keys, an
underflow condition occurs. We can correct the condition through
redistribution by moving one key, 50 keys, or any number of keys that falls
between 1 and 50. The usual strategy is to divide the keys as evenly as
possible between the pages. In this instance that means moving 25 keys.
8.14
Redistribution during Insertion: A
Storage Utilization
As you may
recall,
to redistribution; splitting
as
to
Improve
B-tree insertion does not require an operation analogous
is
able to account for
This does not mean, however, that
during insertion
Way
it is
not
an option, particularly since
algorithms must already include
all
instances of overflow.
desirable to
a set
use redistribution
of B-tree maintenance
redistribution procedure to support
Given that a redistribution procedure is already present, what
advantage might we gain by using it as an alternative to node splitting?
deletion.
Redistribution during insertion
is
way of
avoiding,
postponing, the creation of new pages. Rather than splitting
creating
two approximately
or
a full
at
least
page and
half-full pages, redistribution lets us place
some
of the overflowing keys into another page. The use of redistribution in place
of splitting should therefore tend to make a B-tree more efficient in terms
of its utilization of space.
It is possible to quantify this efficiency of space utilization by viewing
the amount of space used to store information as a percentage of the total
amount of space required to hold the B-tree. After a node splits, each of the
two resulting pages is about half full. So, in the worst case, space utilization
in a B-tree using two-way splitting is around 50%. Of course, the actual
degree of space utilization
lias
shown
approaches
is
better than this worst- case ligure.
Yao
(1978)
of relatively large order, space Trtilizatioli
theoretical average of about 69% if insertion is handled
that, for large trees
a
through two-wav splittin gs
The idea of using redistribution as an alternative to splitting when
possible, splitting a page only when both of its siblings are full, is
introduced in Bayer and McCreight's original paper (1972). The paper
372
B-TREES AND OTHER TREE-STRUCTURED
includes
some experimental
of
in a space utilization
insertions.
possible,
testing
When
ORGANIZATIONS
results that
67%
show
that
two-way
splitting results
for a tree of order 121 after 5,000
random
was repeated, using redistribution when
increased to over 86%. Subsequent empirical
the experiment
space utilization
by Davis
FILE
(1974) (B-tree of order 49) and Crotzer (1975) (B-tree of
order 303) also resulted in space utilization exceeding
85% when
redistri-
bution was used.
These findings and others suggest that any serious
application of B-trees to even moderately large files should implement
insertion procedures that handle overflow through redistribution when
possible.
8.15
B* Trees
and amplification of work on B-trees
In his review
Knuth (1973b)
in 1973,
new
extends the notion of redistribution during insertion to include
for splitting.
He calls
afi* tree.
Consider
the resulting variation
system
in
which we
rules
on the fundamental B-tree form
are postponing
splitting
we
redistribution, as outlined in the preceding section. If
through
are considering
any page other than the root, we know that when it finally is time to split,
the page has at least one sibling that is also full. This opens up the possibility
of a two-to-three split rather than the usual one-to-two or two-way split.
Figure 8.30 illustrates such a
The important
split.
aspect of this two-to-three split
is
that
that are each about two-thirds full rather than just half
possible to define a
new
kind of B-tree, called
B*
it
results in pages
full.
tree,
This makes it
which has the
following properties:
maximum
1.
Every page has
2.
Every page except for
the root
of m descendents.
and the leaves has
at least
(2m
l)/3 de-
scendents.
3.
The
4.
All the leaves appear
5.
6.
A
A
root has
at least
two descendents (unless
on the same level.
it is
a leaf).
nonleaf page with k descendents contains k 1 keys.
leaf page contains at least \_(2m 1)/3J keys and no more than
keys.
of properties and the set we define
for a conventional B-tree are in rules 2 and 6: a B* tree has pages__tiiat
contain a minimum of (2m - l)/3j keys. This new property, o f course,
The
critical
changes between
this se t
affects
procedures for deletion and redistribution.
BUFFERING OF PAGES: VIRTUAL B-TREES
373
Original tree:
A C D F
H K
T V X
Two-to-three-split:
H K M
After the insertion of the
key B.
C D
FIGURE 8.30 A two-to-three
T V X
split.
To impleme nt B * tree proced ures, o ne must also deal with the question
of sj>h tting. the root, which, bv definition, never has a sibling. If there is no
sibling, no two-to-three split is possible. Knu th suggests allowing the roo t
grow to a size larger than the other pages so. when it does split, it can
producetwo pages that are each about tw o-thirds full. This sugges tion has
to
the advantage of ensuring that
Tree charact eristic's.
However,
all
it
pa,g e s
below the root
level
adhere to
B*
has the disadvantage of requiring that the
page that is larger than all the others. Another
solution is to handle the splitting of the root as a conventional one-to-two
split. This second solution avoids any special page-handling logic. On the
other hand, it complicates deletion, redistribution, and other procedures
procedures be able to handle
must be sensitive to the minimum number of keys allowed in a page.
Such procedures would have to be able to recognize that pages descending
from the root might legally be only half full.
that
8.16
Buffering of Pages: Virtual B-Trees
We
have seen
very
that,
given some additional refinements, the B-tree can be
efficient, flexible
ties after
storage structure that maintains
its
balanced proper-
repeated deletions and insertions and that provides access to any
374
B-TREES AND OTHER TREE-STRUCTURED
key with just
aspects, as
few disk
we have
so
using this structure to
a
FILE
accesses.
far,
full
However, focusing on just
the structural
can cause us inadvertently to overlook ways of
advantage. For example, the fact that
depth of three levels does not
accesses to retrieve keys
ORGANIZATIONS
at all
from pages
mean
that
we
B-tree has
need to do three disk
We can do much better
at the leaf level.
than that.
Obtaining better performance from B-trees involves looking in a
way at our original problem. We needed to find a way to make
efficient use of indexes that are too large to be held entirely in RAM. Up to
this point we have approached this problem in an all-or-nothing way: An
index has been either held entirely in RAM, organized as a list or binary
tree, or has been accessed entirely on secondary store, using a B-tree
precise
structure. But, stating that
not imply that
we
For example, assume
records and that
we
cannot hold
cannot hold some of
we
we have
it
ALL
of an index in
RAM does
there.
an index that contains
cannot reasonably use more than 256
megabyte of
RAM
of
for
any given time. Given a page size of 4 K, holding around
64 keys per page, our B-tree can be contained in three levels. We can reach
any one of our keys in no more than three disk accesses. That is certainly
acceptable, but why should we settle for this kind of performance? Why not
try to find a way to bring the average number of disk accesses per search
down to one disk access or less?
Thinking of the problem strictly in terms of physical storage structures,
retrieval averaging one disk access or less sounds impossible. But,
remember, our objective was to find a way to manage our megabyte of
index within 256 K of RAM, not within the 4 K required to hold a single
page of our tree.
We know that every search through the tree requires access to the root
page. Rather than accessing the root page again and again at the start of
and just keep it there.
every search, we could read the root page into
requirement from 4 K to 8 K, since we
This strategy increases our
need 4 K for the root and 4 K for whatever other page we read in, but this
is still much less than the 256 K that are available. This very simple strategy
reduces our worst-case search to two disk accesses, and the average search
index storage
at
RAM
RAM
to
under two accesses (keys in the root require no disk
require one access).
access; keys at the
first level
This simple,
keep-the-root strategy suggests
an important,
general approach: Ra ther than jnsr holding the root page in
create a page buffer
ho 1rl
cnt1 ir
number of B-tree
more
RAM, we
pages, perhaps
can
5, 1CL
or
more. As we read pages in from the disk in response to user requests, we fill
if we
"up trie buffer. Then, when a page is requested, we access it from
read
we
then
in
RAM,
can, thereby avoiding a disk access. If the page is not
RAM
375
BUFFERING OF PAGES: VIRTUAL B -TREES
into the buffer from secondary storage, replacing one of the pages that
was previously there. A B-tree that uses a RAM buffer in this way is
sometimes referred to as a virtual B-tree.
it
8.16.1 LRU Replacement
Clearly, such a buffering
request a page that
is
scheme works only if we are more likely to
one that is not. The process of
in the buffer than
accessing the disk to bring in a page that
a
is
page fault. There are two causes of page
1.
We
2.
It
not already in the buffer
is
called
faults:
have never used the page.
was once
in the buffer
but has since been replaced with
new
page.
The first cause of page faults is unavoidable: If we have not yet read in
and used a page, there is no way it can already be in the buffer. But the
second cause is one we can try to minimize through buffer management.
The critical management decision arises when we need to read a new page
into a buffer that is already full: Which page do we decide to replace?
One common approach is to replace the page that was least recently
used; this
is
called
LRU
replacing the page that
page
is
was
always read in
replacing the root,
replacement. Note that this
first,
which
different
is
from
read into the buffer least recently. Since the root
is
simply replacing the oldest page results in
an undesirable outcome. Instead, the
LRU
method keeps track of the actual requests for pages. Since the root is
requested on every search, it seldom, if ever, is selected for replacement.
The page to be replaced is the one that has gone the longest time without a
request for use.
Some
research
number of pages
by Webster
that
(1980)
shows the
replacement strategy. Table 8.1 summarizes
TABLE
8.1
Effect of using
more buffers with
a simple
of increasing the
Average Accesses per Search
3.00
1.71
LRU
small but representative
LRU replacement
Buffer Count
Number of keys = 2,400
Total pages = 140
Tree height = 3 levels
effect
can be held in the buffer area under an
strategy
10
1.42
20
0.97
376
B-TREES AND OTHER TREE-STRUCTURED
portion of Webster's results.
search given different
LRU
using a simple
FILE
It lists
ORGANIZATIONS
the average
numbers of page
number of disk
accesses per
These results are obtained
replacement strategy without accounting for page
buffers.
height.
B+
Webster's study was conducted using
rather than simple
+
B-trees. In the next chapter, where we look closely at B trees, you see that
+
the nature of B
trees accounts for the fact that, given one buffer, the
+
average search length is 3.00. With B trees, all searches must go all the way
+
to the leaf level every time. The fact that Webster used B
trees, however,
trees
does not detract from the usefulness of his results
as
an illustration of the
positive impact of page buffering. Keeping less than
15% of
the tree in
RAM (20 pages out of the total 140) reduces the average number of accesses
per search to less than one.
simple B-tree, since not
Note
that
all
The
results are
decision to use
the
we
even more dramatic with
searches have to proceed to the leaf level.
LRU
replacement
is
based on the
more likely to need a page that we have used
need a page that we have never used or one that we
used some time ago. If this assumption is not valid, then there is absolutely
no reason to preferentially retain pages that were used recently. The term
for this kind of assumption is temporal locality. We are assuming that there
is a kind of clustering of the use of certain pages over time. The hierarchical
nature of a B-tree makes this kind of assumption reasonable.
For example, during redistribution after overflow or underflow, we
access a page and then access its sibling. Because B-trees are hierarchical,
accessing a set of sibling pages involves repeated access to the parent page
in rapid succession. This is an instance of temporal locality; it is easy to see
assumption that
recently than
how
it is
we
are
are to
related to the tree's hierarchy.
8.16.2 Replacement Based on Page Height
There
is
another,
more
direct
way
to use the hierarchical nature
of the
B-tree to guide decisions about page replacement in the buffers.
simple, keep-the-root strategy exemplifies this alternative:
the pages that occur at the highest levels of the tree. Given a
Our
Always retain
larger amount
of buffer space, it might be possible to retain not only the root, but also all
of the pages at the second level of a tree.
Let's explore this notion by returning to a previous example in which
and a 1-megabyte index. Since our page
we have access to 256 K of
size is 4 K, we could build a buffer area that holds 64 pages within the
area. Assume that our 1 megabyte worth of index requires around 1.2
megabytes of storage on disk (storage utilization = 83%). Given the 4 K
page size, this 1.2 megabytes requires slightly more than 300 pages. We
RAM
RAM
377
PLACEMENT OF INFORMATION ASSOCIATED WITH THE KEY
assume
It
that,
on the average, each of our pages has around 30 descendents.
follows that our three-level tree has, of course,
followed by 9 or 10 pages
level,
Using
at
page
a single
the second level, with
all
at the
root
the remaining
page replacement strategy that always retains
the higher-level pages, it is clear that our 64-page buffer eventually contains
the root page and all the pages at the second level. The approximately 50
pages
at the leaf level.
remaining buffer slots are used to hold leaf-level pages. Decisions about
which of these pages to replace can be handled through an LRU strategy.
For many searches, all of the pages required are already in the buffer; the
search requires no disk accesses.
possible to bring the average
to a
number
that
is less
easy to see how, given a sizable buffer,
number of disk accesses per search down
It is
it is
than one.
Webster's research (1980) also investigates the effect of taking page
height into account, giving preference to pages that are higher in the tree
when
it
comes time
Augmenting the LRU
to
decide which pages
to
keep
page height reduces the average number of accesses, given
from 1.42
accesses per search
8.16.3 Importance
It is
difficult to
scheme
down
the
buffers.
10-page buffer,
to 1.12 accesses per search.
of Virtual B-Trees
overemphasize the importance of including
page buffering
into any implementation of a B-tree index structure. Because the
B-tree structure itself is so interesting and powerful,
trap of thinking that the B-tree organization
the
in
strategy with a weighting factor that accounts for
problem of accessing large indexes
we have
secondary storage. As
required to handle large indexes.
amount of memory
is itself a
way
We
easy to
fall
into the
sufficient solution to
must be maintained on
that
emphasized, to
sight of the original problem: to find a
it is
fall
into that trap
to reduce the
is
to lose
amount of memory
did not, however, need to reduce the
amount required for a single index page. It is
enough memory to hold a number of pages. Doing
to the
usually possible to find
so can dramatically increase system performance.
8.17
Placement
of Information
Early in this chapter
we
Associated with the Key
focused on the B-tree index
itself,
setting aside
consideration of the actual information associated with the keys.
any
We
paraphrased Bayer and McCreight and stated that "the associated information
is
of no further interest."
But, of course, in any actual application the associated information
in fact, the true object
of interest. Rarely do
we ever want
is,
to index keys just
378
B-TREES AND OTHER TREE-STRUCTURED
to be able to find the keys
themselves.
we
associated with the key that
ORGANIZATIONS
FILE
It
want
really
usually the information
is
to find. So, before closing our
discussion of B-tree indexes,
it is important to turn to the question of where
and how to store the information indexed by the keys in the tree.
Fundamentally, we have two choices. We can
Store the information in the B-tree along with the key; or
Place the information in a separate
the key with a relative record
within the index
file:
number
we
references the location of the information in that separate
The
advantage that the
couple
or byte address pointer that
file.
approach has over the second is that
The information
is right there with the key.
However, if the amount of information
associated with each key is relatively large, then storing the information
with the key reduces the number of keys that can be placed in a page of the
B-tree. As the number of keys per page is reduced, the order of the tree is
distinct
once the key
reduced,
is
found, no
and the
tree
more
first
disk accesses are required.
tends
become
to
since
taller
there
fewer
are
descendents from each page. So, the advantage of the second method
given associated information that has
that,
is
long length relative to the
length of a key, placing the associated information elsewhere allows us to
build a higher-order and therefore possibly shallower tree.
we need
For example, assume
1,000 keys and associated
to index
information records. Suppose that the length required to store
associated information
is
128 bytes. Furthermore, suppose that
the associated information elsewhere,
key and
if
we can store just the key and
to the associated information in only 16 bytes.
Given
a B-tree
we
a
its
store
pointer
page that had
512 bytes available for keys and associated information, the two fundamental storage alternatives translate into the following orders of B-trees:
Information stored with key: four keys per page
order
Pointer stored with key: 32 keys per page
order 33
Using the formula
worst-case
developed
for
finding
w/key)
d(info elsewhere)
we
and
tree.
depth
of B-trees
earlier:
d (in fo
So, if
the
five tree;
** +
500.5
6.66
k)g 17 500. 5
3.
log3
19
store the information with the keys, the tree has a worst-case
levels. If we store the information elsewhere, we end up
reducing the height of the worst-case tree to three. Even though the
depth of six
method costs us one
number of accesses to
additional indirection associated with the second
second method
record in the worst case.
access, the
a
still
reduces the total
disk
find
379
VARIABLE-LENGTH RECORDS AND KEYS
general,
In
then,
the decision about
where
to store the associated
information should be guided by some calculations that compare the depths
of the trees that result. The critical factor that influences these calculations
is the ratio of overall record length to the length of just a key and pointer.
If
you can put many key/pointer
key/record
tion
8.18
from
pair,
it is
pairs in the area required for a single, full
probably advisable to remove the associated informa-
the B-tree and put
it
in a separate
file.
Variable-length Records and Keys
In
many
applications the information associated with a key varies in length.
Secondary indexes referencing inverted
One way
this.
information in
to
a
handle
separate,
this
lists
variability
is
to
variable-length record
contain a reference to the information in this other
is
to allow a variable
Up
number of keys and
example of
are an excellent
place
associated
the
would
file;
the B-tree
file.
Another approach
records in a B-tree page.
we have regarded B-trees as being of some order m.
fixed maximum and minimum number of keys that it can
to this point
Each page has
The notion of a variable-length record, and, therefore, a
variable number of keys per page, is a significant departure from the point
of view we have developed so far. A B-tree with a variable number of keys
per page clearly has no single, fixed order.
The variability in length can also extend to the keys themselves as well
as to entire records. For example, in a file in which people's names are the
legally hold.
keys,
we might
choose to use only
as
much
space as required for a name,
rather than allocate a fixed-size field for each key.
chapters,
to put
implementing
a structure
many more names
in a
internal fragmentation. If
we
fields
given amount of space since
can put
larger
number of descendents from
fewer
levels.
Accommodating
As we saw
with variable-length
more keys
it
in earlier
can allow us
does
in a page, then
page and, very probably,
away with
we have
a tree
with
this variability in length means using a different kind
of page structure. We look at page structures appropriate for use with
+
variable-length keys in detail in the next chapter, where we discuss B
trees. We also need a different criterion for deciding when a page is full and
when it is in an underflow condition. Rather than use a maximum and
minimum number of keys per page, we need to use a maximum and
minimum number of bytes.
Once the fundamental mechanisms for handling variable-length keys or
records are in place, interesting new possibilities emerge. For example, we
might consider the notion of biasing the key promotion mechanism so the
380
B-TREES AND OTHER TREE-STRUCTURED
FILE
ORGANIZATIONS
shortest variable-length keys (or key/record pairs) are
preference to longer keys.
The
idea
is
that
we want
promoted upward
to
in
have pages with the
numbers of descendents up high in the tree, rather than at the leaf
Branching out as broadly as possible as high as possible in the tree
tends to reduce the overall height of the tree. McCreight (1977) explores
this notion in the article, "Pagination of B* Trees with Variable-Length
Records."
The principal point we want to make with these examples of variations
on B-tree structures is that this chapter introduces only the most basic forms
of this very useful, flexible file structure. Actual implementations of B-trees
do not slavishly follow the textbook form of B-trees. Instead, they use
many of the other organizational techniques we study in this book, such as
variable-length record structures, in combination with the fundamental
B-tree organization to make new, special-purpose file structures uniquely
suited to the problems at hand.
largest
level.
SUMMARY
We begin this chapter by picking up the problem we left unsolved at the end
of Chapter
Simple, linear indexes
6:
RAM memory,
that
work
well
if
they are held in electronic
but are expensive to maintain and search
secondary storage
is
most evident
in
two
if
they are so big
The expense of using
they must be held on secondary storage.
areas:
Sorting of the index; and
Searching, since even binary searching required
more than just two
or three disk accesses.
We
first
address the question of structuring an index so
it
can be kept
in order
without sorting.
We use tree structures
we need
a balanced tree to
ensure that the tree does not become overly deep
after repeated
random
insertions.
We
see that
to
do
AVL
this,
trees
discovering that
provide
way of
amount of overhead.
Next we turn to the problem of reducing the number of disk accesses
required to search a tree. The solution to this problem involves dividing the
balancing a binary tree with only a small
tree into pages, so a substantial portion
of the
tree can
be retrieved with
Paged indexes let us search through very large numbers
of keys with only a few disk accesses.
Unfortunately, we find that it is difficult to combine the idea of paging
of tree structures with the balancing of these trees by AVL methods. The
most obvious evidence of this difficulty is associated with the problem of
selecting the members of the root page of a tree or subtree when the tree is
built in the conventional top-down manner. This sets the stage for
single disk access.
SUMMARY
work on
introducing Bayer and McCreight's
B-trees,
paging and balancing dilemma by starting from the leaf
which solves the
level, promoting
keys upward as the tree grows.
Our
discussion of B-trees begins with examples of searching, insertion,
how B-trees grow while maintaining
Next we formalize our description of B-trees.
and promotion to show
splitting,
balance in a paged structure.
This formal definition permits us to develop
worst-case B-tree depth.
The formal
on developing deletion procedures
keys are removed from a tree.
Once
we
formula for estimating
work
when
description also motivates our
that maintain the B-tree properties
the fundamental structure and procedures for B-trees are in place,
refining and improving on these ideas. The first set of
improvements involves increasing the storage utilization within B-trees. Of
begin
course, increasing storage utilization can also result in a decrease in the
height of the tree, and therefore in improvements in performance.
We
find
by sometimes redistributing keys during insertion, rather than splitting
pages, we can improve storage utilization in B-trees so it averages around
85%. Carrying our search for increased storage efficiency even farther, we
find that we can combine redistribution during insertion with a different
that
kind of splitting to ensure that the pages are about two-thirds
than only one-half
full
after the split.
Trees using
redistribution and two-to-three splitting are called
Next we turn
to the matter
this
B*
combination of
trees.
of buffering pages, creating
a virtual B-tree.
We note that the use of memory is not an all-or-nothing choice:
are too large to
fit
secondary storage.
then
We
we
memory do
If we hold pages
not have to be accessed
into
rather
full
Indexes that
entirely
that are likely to be reused in
can save the expense of reading these pages in from the disk again.
develop two methods of guessing which pages are to be reused.
method
from
RAM,
uses the height of the page in the tree to decide
which pages
One
to keep.
Keeping the root has the highest priority, the root's descendents have the
The second method for selecting pages to keep in
RAM is based on recentness of use: We always replace the least-recentlyused (LRU) page, retaining the pages used most recently. We see that it is
possible to combine these methods, and that doing so can result in the
ability to find keys while using an average of less than one disk access per
next priority, and so on.
search.
We
then turn to the question of where to place the information
associated with a key in the B-tree index. Storing
it
with the key
is
attractive
same as finding the information;
required. However, if the associated
because, in that case, finding the key
is
no additional disk accesses are
information takes up a lot of space,
the
it
can reduce the order of the
thereby increasing the tree's height. In such cases
store the associated information in a separate
it is
file.
tree,
often advantageous to
381
382
B-TREES AND OTHER TREE-STRUCTURED FILE ORGANIZATIONS
We
close the chapter with a brief look at the use of variable-length
records within the pages of a B-tree, noting that significant savings in space
and consequent reduction
in the height
variable-length records.
The modification of
many
variations
from the use of
tree can result
the basic textbook B-tree
of variable-length records is just one example
that are used in real-world implementa-
definition to include the use
of the
of the
on B-trees
tions.
KEY TERMS
AVL
tree. A height-balanced (HB(1)) binary tree in which insertions
and deletions can be performed with minimal accesses to local nodes.
AVL trees are interesting because they keep branches from getting
overly long after many random insertions.
B-tree of order m. A multiway search tree with these properties:
1.
2.
Every node has a maximum of m descendents.
Every node except the root and the leaves has
at
least
|~w/2l
descendents.
3.
The
4.
All of the leaves appear
5.
6.
A
A
root has
two descendents (unless
on the same level.
at least
it is
nonleaf page with k descendents contains k
leaf
page contains
at least
[ ml2~\
a leaf).
keys.
keys and no more than
keys.
B-trees are built
pages always
upward from
the leaf level, so creation of
new
starts at the leaf level.
The power of B-trees
lies in
the facts that they are balanced (no
overly long branches); they are shallow (requiring few seeks); they
accommodate random
deletions and insertions at a relatively
while remaining in balance; and they guarantee
at least
50%
low
cost
storage
utilization.
B*
which each node is at least two-thirds full.
provide better storage utilization than do B-trees.
Concatenation. When a B-tree node underflows (becomes less than
50% full), it sometimes becomes necessary to combine the node with
tree.
B*
special B-tree in
trees generally
an adjacent node, thus decreasing the total number of nodes in the
tree. Since concatenation involves a change in the number of nodes
in the tree, its effects can require reorganization at many levels of the
tree.
Height-balanced tree.
each node there is a
tree structure
limit to the
with
a special
property: For
amount of difference
that
is
allowed
EXERCISES
among the heights of any of the node's subtrees. An HB(k) tree allows subtrees to be k levels out of balance. (See AVL tree.)
Leaf of a B-tree. A page at the lowest level in a B-tree. All leaves in a
B-tree occur at the same level.
Order of a B-tree. The maximum number of descendents that a node
in the B-tree
can have.
Paged index. An index that is divided into blocks, or pages, each of
which can hold many keys. The use of paged indexes allows us to
search through very large
numbers of keys with only
few disk
ac-
cesses.
Promotion of a key. The movement of a key from one node
into a
node (creating the higher-level node, if necessary) when
the original node becomes overfull and must be split.
Redistribution. When a B-tree node underflows (becomes less than
50% full), it may be possible to move keys into the node from an
adjacent node with the same parent. This helps ensure that the 50%full property is maintained. When keys are redistributed, it becomes
higher-level
necessary to alter the contents of the parent as well. Redistribution,
as opposed to concatenation, does not involve creation or deletion of
nodes
its effects are entirely local. Redistribution can also often be
used as an alternative to splitting.
Splitting. Creation of two nodes out of one because the original node
becomes overfull. Splitting results in the need to promote a key to a
higher-level node to provide an index separating the two new nodes.
Virtual B-tree. A B-tree index in which several pages are kept in
in anticipation of the possibility that one or more of them will be
needed by a later access. Many different strategies can be applied to
replacing pages in
when virtual B-trees are used, including the
least-recently-used strategy and height-weighted strategies.
RAM
RAM
EXERCISES
1.
Balanced binary trees can be effective index structures for
part or
all
when
RAM-based
become so large that
of them must be kept on secondary storage. The following
indexing, but they have several drawbacks
they
questions should help bring these drawbacks into focus, and thus reinforce
the need for an alternative structure such as the B-tree.
a.
There are two major problems with using binary search to search
simple sorted index on secondary storage: The number of disk ac-
cesses
is
larger than
index sorted
is
we would
substantial.
search tree alleviate?
like;
and the time
it
takes to keep the
Which of the problems does
binary
383
384
B-TREES AND OTHER TREE-STRUCTURED FILE ORGANIZATIONS
b.
Why
c.
In
is it
important to keep search trees balanced?
is an AVL tree better than a simple binary search
what way
tree?
d.
Suppose you have
a file
with 1,000,000 keys stored on disk
completely
full,
what
maximum number
the
is
the tree
is
balanced binary search
paged
in the
manner
tree. If the tree is
in a
not paged,
of accesses required to find
illustrated in Fig. 8.12, but
key?
If
with each
page able to hold 15 keys and to branch to 16 new pages, what is the
maximum number of accesses required to find a key? If the page size
is increased to hold 511 keys with branches to 512 nodes, how does
the maximum number of accesses change?
e. Consider the problem of balancing the three-key-per-page tree in
Fig. 8.13
by rearranging the pages.
Why
is it
difficult to create a
tree-balancing algorithm that has only local effects?
size increases to a
more
become difficult to guarantee that each of the pages
some minimum number of keys?
g.
Although B-trees
downward from
search trees for external searching, binary trees are
3.
Why
Describe the necessary parts of
differ
from an
is
a leaf
still
commonly
this so?
node of a B-tree.
How
does
a leaf
internal node?
Since leaf nodes never have children,
pointer fields in a leaf node
need for pointer
the top.
are generally considered superior to binary
used for internal searching.
2.
contains at least
Explain the following statement: B-trees are built upward from
f.
the bottom, whereas binary trees are built
node
When the page
why does it
512 keys),
likely size (such as
it
might be possible
to use the
to point to data records. This could eliminate the
fields to data records in the internal nodes.
Why? What
are
the implications of doing this in terms of storage utilization and retrieval
time?
4.
sets
Show
the B-trees of order four that result
from loading
the following
of keys in order.
a.
b.
c.
d.
C GJ X
CGJXNSUOAEBHI
CGJXNSUOAEBHIF
CGJXNSUOAEBHIFKLQRTVUWZ
Figure 8.23 shows the pattern of recursive calls involved in inserting a
$ into the B-tree in Fig. 8.22. Suppose that subsequent to this insertion, the
character [ is inserted after the Z. (The ASCII code for [ is greater than the
5.
ASCII code
for Z.)
Draw
a figure similar to Fig.
8.23 which shows the
pattern of recursive calls required to perform this insertion.
385
EXERCISES
6.
Given
B-tree of order 256
What is
What is
a.
b.
the
the
maximum number
minimum number
of descendents from
of descendents from
page?
page (ex-
cluding the root and leaves)?
d.
What
What
e.
How many
c.
is
the
is
the
minimum number of descendents from
minimum number of descendents from
keys are there on
the root?
a leaf?
nonleaf page with 200 descen-
dents?
f.
What
the
is
maximum
depth of the tree
if
it
contains 100,000
keys?
Using
method
to that used to derive the formula for
formula for best case, or minimum depth, for an
order m B-tree with
keys. What is the minimum depth of the tree
described in the preceding question?
7.
similar
worst-case depth, derive
Suppose you have a B-tree index for an unsorted file containing
data
where each key has stored with it the RRN of the corresponding
record. The depth of the B-tree is d. What are the maximum and minimum
numbers of disk accesses required to
8.
records,
a.
Retrieve
b.
Add
c.
Delete
d.
Retrieve
Assume
arrived at
9.
Show
deleted
record;
a record;
a
record; and
all
records from the
page buffering
your answer.
that
is
file
in sorted order.
not used. In each case, indicate
how you
the trees that result after each of the keys A, B, Q, and
from the following B-tree of order
is
five.
D H
A B C
r
F
K L
N O
\v
386
10.
B-TREES AND OTHER TREE-STRUCTURED
A common
unless
11.
belief about B-trees
100%
it is
full.
Discuss
FILE
ORGANIZATIONS
that a B-tree cannot
is
grow deeper
this.
Suppose you want to delete
key from
node
in a B-tree.
You
look
at
the right sibling and find that redistribution does not work; concatenation
would be
12.
What
Do you
it
introduce?
compare with
13.
What
is
look to the left and see that redistribution
choose to concatenate or redistribute?
difference
the
is
improvement does
does
You
necessary.
option here.
that
B*
between
tree offer
over
B*
does the minimum
of an order m B-tree?
B-tree?
How
can
it
an
and a B-tree? What
and what complications
depth of an order m B * tree
tree
a B-tree,
How
a virtual
is
be possible to average fewer than
one access per key when
from a three-level virtual B-tree?
Write a pseudocode description for an LRU replacement scheme for a
10-page buffer used in implementing a virtual B-tree.
retrieving keys
between storing the information indexed by the
14. Discuss the trade-offs
keys in
15.
We
B-tree with the key and storing the information in
noted
that,
given variable-length keys,
it is
a separate file.
possible to optimize a
by building in a bias toward promoting shorter keys. With fixed-order
trees we promote the middle key. In a variable-order, variable-length key
tree, what is the meaning of "middle key"? What are the trade-offs
associated with building in a bias toward shorter keys in this selection of a
key for promotion? Outline an implementation for this selection and
promotion process.
tree
Programming Exercises
16.
Implement the programs
procedure that performs
created
at the
end of this chapter and add
by the program. As an example, here
traversal
of the
tree
shown
parenthesized symmetric traversal of
is
recursive
a
B-tree
the result of a parenthesized
in Fig. 8.18:
(((A,B,C)D(E,F,G)H(I,J)K(L,M))N((0,P)Q(R)S(T,U,V)W(X,Y,Z)))
The split() routine in the B-tree programs
to make it more efficient.
17.
it
program
key
is
18.
Write
19.
Write an interactive program that allows
delete kevs
from
that searches for a
a B-tree.
not very
efficient.
Rewrite
in a B-tree.
a
user to find, insert, and
FURTHER READINGS
20. Write a B-tree
program
that uses keys that are strings, rather than single
characters.
program that builds a B-tree index for
records contain more information than just a key.
21.
Write
a data file in
which
FURTHER READINGS
Currently available textbooks on
file and data structures contain surprisingly brief
These discussions do not, in general, add substantially to the
information presented in this chapter and the following chapter. Consequently,
readers interested in more information about B-trees must turn to the articles that
have appeared in journals over the past 15 years.
The article that introduced B-trees to the world is Bayer and McCreight's
"Organization and Maintenance of Large Ordered Indexes" (1972). It describes the
theoretical properties of B-trees and includes empirical results concerning, among
other things, the effect of using redistribution in addition to splitting during
insertion. Readers should be aware that the notation and terminology used in this
article differ from that used in this text in a number of important respects.
Comer's (1979) survey article, "The Ubiquitous B-tree," provides an excellent
overview of some important variations on the basic B-tree form. Knuth's (1973b)
discussion of B-trees, although brief, is an important resource, in part because many
of the variant forms such as B* trees were first collected together in Knuth's
discussion. McCreight (1977) looks specifically at operations on trees that use
variable-length records and that are therefore of variable order. Although this article
speaks specifically about B* trees, the consideration of variable-length records can
be applied to many other B-tree forms. In "Time and Space Optimality on B-trees,"
Rosenberg and Snyder (1981) analyze the effects of initializing B-trees with the
minimum number of nodes. In "Analysis of Design Alternatives for Virtual
Memory Indexes," Murayama and Smith (1977) look at three factors that affect the
cost of retrieval: choice of search strategy, whether or not pages in the index are
structured, and whether or not keys are compressed. Zoellick (1986) discusses the
use of B-tree- like structures on optical discs.
Since B-trees in various forms have become a standard file organization for
databases, a good deal of interesting material on applications of B-trees can be found
in the database literature. Ullman (1986) Held and Stonebraker (1978), and Snyder
(1978) discuss the use of B-trees in database systems generally. Ullman (1986) covers
the problem of dealing with applications in which several programs have access to
the same database concurrently and identifies literature concerned with concurrent
discussions
on
B-trees.
access to B-tree.
Uses of B-trees for secondary key access are covered in many of the previously
cited references. There is also a growing literature on multidimensional dynamic
indexes, including a B-tree- like structure called a k-d B-tree. K-d B-trees are
387
388
B-TREES AND OTHER TREE-STRUCTURED FILE ORGANIZATIONS
described in papers by Ouskel and Scheuermann (1981) and Robinson (1981). Other
tries and grid files. Tries are
and data structures, including Knuth (1973b) and
Loomis (1983). Grid files are covered thoroughly in Nievergelt et al. (1984).
An interesting early paper on the use of dynamic tree structures for processing
files is "The Use of Tree Structures for Processing Files," by Sussenguth (1963).
Wagner (1973) and Keehn and Lacy (1974) examine the index design considerations
that led to the development of VSAM. VSAM uses an index structure very similar
to a B-tree, but appears to have been developed independently of Bayer and
McCreight's work. Readers interested in learning more about AVL trees will find a
good, approachable discussion of the algorithms associated with these trees in
approaches to secondary indexing include the use of
covered
in
many
Standish (1980).
tree operations
texts
on
files
Knuth (1973b)
and properties.
takes a
more
rigorous, mathematical look at
AVL
C Programs
PROGRAMS TO INSERT KEYS INTO A B-TREE
to Insert
Keys
389
into a B-Tree
The C program that follows implements the insert program described in the
text. The only difference between this program and the one in the text is
that this program builds a B-tree of order five, whereas the one in the text
builds a B-tree of order four. Input characters are taken from standard I/O,
with
q indicating
end of
The program
data.
from
requires the use of functions
driver, c
Contains the main program, which
described in the text very closely.
insert, c
Contains
several
files:
parallels the driver
program
the recursive function that finds the proper place
insertf),
for a key, inserts
it,
and supervises
splitting
and promotions.
Contains all support functions that directly perform I/O. The
header files fdeio.h and stdio.h must be available for inclusion in
btio.c
btio.c.
Contains the
btutil.c
split ()
All the
/*
of the support functions, including the function
rest
described in the text.
programs include the header
file
called bt.h.
bt.h.
header file for btree programs
.
*/
#def
#def
#def
#def
#def
#def
MAXKEYS
MINKEYS
ne
ine
i ne
i ne
ine
ine
i
NIL
NDKEY
MAXKEYS/2
(-1
/@/
NO
YES
typedef s t rue t {
shor t keycoun t;
char
keyCMAXKEYS]
shor t childEMAXKEYS+1
>
BTPAGE
;
/*
/*
/*
number of keys in page
*/
* /
the ac t ua 1 keys
ptrs to rrns of descendants*/
(continued)
390
B-TREES AND OTHER TREE-STRUCTURED
#define PAGESIZE
FILE
ORGANIZATIONS
zeof (BTPAGE
/* rrn of root page */
extern short
root;
/* file descriptor of btree file */
extern int btfd;
/* file descriptor of input file */
extern int infd;
/ *
pr o t otypes *
btcloseC);
btopenC )
btread(short rrn, BTPAGE *page_ptr);
btwr i t e( shor t r r n
BTPAGE *page_ptr);
creat e_root (char key
short left
shor t right)
short creat e_t ree( )
shor t get page( )
;
shor t get root ()
insert (short rrn, char key, short *pr omo_r_ch i 1 d char
ins_i n_page(char key, short r_child, BTPAGE *p_page);
;
key);
pageinit (BTPAGE *p_page);
put root ( shor t root);
search_node( char key, BTPAGE *p_page, short *pos);
split(char key, short r_child, BTPAGE *p_oldpage, char *P
k ey
short *promo_r_ch i Id BTPAGE *p_newpage);
,
Driver.c
/*
driver.c...
Driver for btree tests:
Opens or creates b-tree file.
Gets next key and calls insert to insert
If necessary, creates a new root.
key in tr
*/
#include <stdio.h>
^include "bt .h"
ma i n(
tells if a promotion from below */
#/
page
root
of
rrn
short root,
*/
rrn promoted from below
promo_rrn;
/
below
from
promo_key,/* key promoted
char
*/
/* next key to insert in tree
key;
promoted; /* boolean:
int
/*
/*
if
e
(btopenO)
root
get root ()
root
c r
5e
ea t e_t
/*
try to open btree.dat and get
/*
if
ee(
btree.dat not there, create
root
*/
it
*/
PROGRAMS TO INSERT KEYS INTO A
391
B-TREE: INSERT.C
while ((key = getcharO) != 'q') {
promoted = insert(root, key, &promo_rrn, &pr omo_k ey )
if ( promo t ed
root, promo_rrn);
root = c rea t e_root ( promo_k ey
,
>
btclose();
>
Insert.
/ *
insert .c.
Contains insertO function to insert a key into a btree.
Calls itself recursively until bottom of tree is reached.
Then inserts key in node.
If node is out of room,
calls splitO to split node
promotes middle key and rrn of new node
*/
^include "bt .h"
/* inser t (
Argument s
...
rrn:
*pr omo_r_ch
key:
i 1
*promo_key:
rrn of page to make insertion in
child promoted up from here to next level
key to be inserted here or lower
key promoted up from here to next level
*/
insert(short rrn, char key, short *pr omo_r_ch i
char *promo_key)
/* current page
BTPAGE page,
/* new page created
newpage;
int found, promoted; /* boolean values
short
pos
char
p_b_rrn;
p_b_key;
if
*/
if
split occurs
*/
*/
rrn promoted from below
key promoted from below
/*
/*
*/
*/
(rrn == NIL)
/* past bottom of tree... "promote"*/
i
/* original key so that it will be */
*promo_key = key;
*/
*promo_r_chi Id = NIL;/* inserted at leaf level
return (YES);
>
btread(rrn, &page);
found = search_node( k ey
&page, &pos);
if (found) {
printfC Error: attempt to insert duplicate key: %c \n\007", key);
return (0);
,
>
(continued)
392
B-TREES AND OTHER TREE-STRUCTURED FILE ORGANIZATIONS
promoted = i nser t (page ch i 1 d po5
key, &p_b_rrn, &p_b_key);
if ( promot ed)
return (NO);
/* no promotion */
if (page. keycount < MAXKEYS) {
ins_in_page(p_b_key p_b_rrn Apage) /* OK to insert key and*/
btwrite(rrn, &page);
/* pointer in this page. */
return (NO);
/* no promotion */
.
>
else
i t(p_b_key
p_b_rrn &page ,promo_key promo_r_chi
btwrite(rrn, &page);
btwrite( *promo_r_chi Id, &newpage);
return (YES);
/* promotion */
spl
Id
Anewpage);
>
Btio.c
/*
btio.c...
Contains btree functions that directly involve file i/o:
btopen() -- open file "btree.dat" to hold the btree.
btcloseO -- close "btree.dat"
getrootO -- get rrn of root node from first two bytes of btree.dat
putrootO -- put rrn of root node in first two bytes of btree.dat
cr eat e_t ree( ) -- create "btree.dat" and root node
getpageO -- get next available block in "btree.dat" for a new page
btread() -- read page number rrn from "btree.dat"
btwriteO -- write page number rrn to "btree.dat"
*/
^include "stdio.h"
#include "bt .h"
^include "fileio.h"
int
btfd;
bt open(
global file descriptor for "btree.dat"
/*
btfd = openC'btree.dat", 0_RDWR);
return(btfd > 0);
btclose(
close(btfd)
>
short get root
i
( )
*/
short root
long lseekC
PROGRAMS TO INSERT KEYS INTO A
B-TREE: BTIO.C
393
lseekCbtfd, OL, 0);
if CreadCbtfd, &root, 2) == 0) {
Unable to get root \007\n")
printf ("Error
.
exitd
return (root);
>
put root
shor
root)
lseekCbtfd,
teCbtf d
wr
OL,
0);
&root
2)
short cr eat e_t ree(
{
char key
creatC"btree.dat M ,PMODE);
/* Have to close and reopen to insure
closeCbtfd);
/* read/write access on many systems.
btopenC);
*/
/* Get first key.
key = getcharC);
return ( c r ea t e_r oo t ( k ey NIL, NIL));
btfd
short get page(
long lseekO, addr;
addr = lseekCbtfd, OL, 2) - 2L;
return CCshort) addr / PAGESIZE);
}
btreadCshort rrn, BTPAGE *page_ptr)
{
long
lseekC), addr;
addr = Clong)rrn * C ong )P AGES ZE + 2L
lseekCbtfd, addr, 0);
return C readCblfd, page_ptr, PAGESIZE)
I
>
btwr
i t
eC shor
rrn,
BTPAGE *page_ptr)
long lseekC), addr
addr = Clong) rrn * Clong) PAGESIZE + 2L
lseekCbtfd, addr, 0);
return CwriteCbtfd, page_ptr, PAGESIZE));
;
>
);
*/
*/
394
B-TREES AND OTHER TREE-STRUCTURED FILE ORGANIZATIONS
Btutil.c
btut ll.c...
/*
Contains utility functions for btree program:
e_r oo t
--
initialize root node and insert one key
NDKEY in all "key" slots and NIL in "child" slots
search_node( ) -- return YES if key in node, else NO. In either case,
put key's correct position in pos.
ins in_page() -- insert key and right child in page
splitC) -- split node by creating new node and moving half of keys to
new node. Promote middle key and rrn of new node.
c r
ea
pageinitO
( )
--
get and
put
*/
'include "bt.h"
c r
ea
e_r oo t char key,
short
left,
short
right)
BTPAGE page;
short rrn;
rrn = getpageC);
page i n i t ( &page )
page.key[01 = key;
page.childtO] = left;
=
right;
page ch i 1 d
page.keycount = 1;
btwriteCrrn Apage)
putroot(rrn)
return(rrn)
;
[ 1
page i
i t
(BTPAGE *p_page)
/*
p_page: pointer to
page
*/
int
for
))
<
MAXKEYS;
(j = 0;
= NDKEY;
p_page->key[
=
NIL;
p_page- >ch i Id
j
>
p_page->childtMAXKEYS]
NIL;
search_node(char key, BTPAGE *p_page, short *pos)
/* pos: position where key is or should be inserted
{
int
for
(i
0;
*po5
<
p_page- > k eycoun t
&&
key
>
p_page- > k ey
i 3
if
PROGRAMS TO INSERT KEYS INTO A
395
B-TREE: BTUTIL.C
*po5 < p_page - > eycoun t
&&
key == p_page/* key is in page */
return (YES);
> k
ey *pos
[
else
return (NO);
key
/ *
not
is
page */
in
ns_i n_page( cha r
r_child, BTPAGE *p_page)
short
key,
{
i
for
=
p_page- > k eycoun t
key < p_page=
p_page- > k ey i
p_page - > k ey i =
p_page- >ch i 1 d i +
p_page- >c h i 1 d i
(1
>
ey
&&
>
0;
i--)
>
p_page- >keycount++
=
p_page- > k ey i
key;
=
r_child;
p_page- >ch i 1 d i +
;
/* split ()
Argument s
inserted
promoted up from here
to be inserted
promoted up from here
pointer to old page structure
pointer to new page structure
key to be
key to be
child rrn
rrn to be
key:
promo key:
r_child:
promo r child:
p_oldpage:
p_newpage:
*/
splitCchar key, short r_child, BTPAGE *p_oldpage, char *promo_key,
BTPAGE *p_newpage)
short *pr omo_r_c h i 1 d
,
{
i
short mid;
char
wor k k eys [MAXKEYS+
short wor k c h MAX KE YS+ 2
t
*/
tells where split is to occur
*/
temporarily holds keys, before split
/* temporarily holds children, before split*/
/*
1
/ *
(i=0; l < MAXKEYS; i++) {
workkeystil = p_o 1 dpage - > k ey
workchti] = p_o 1 dpage -> ch i 1 d
/*
move keys and children from
old page into work arrays
&&
for
/*
[
i ]
*/
*/
>
workchti] = p_o 1 dpage- > ch i 1 d
for (i=MAXKEYS; key < wor k k ey s
workkeystil = wor k k eys i workchti+11 = workchti]
[
l
[
>
0;
- - ) { / *
insert new key*/
>
workkeystil
workchti+1]
key;
r_child;
*promo_r_chi Id = getpageC);
i n i t
p_newpage )
(
page
/*
/*
create new page for split,
and promote rrn of new page
*/
*/
(continued)
396
B-TREES AND OTHER TREE-STRUCTURED
FILE
ORGANIZATIONS
/* move first half of keys and
1
<
MINKEYS; i + + ) {
(1 =
=
/* children to old page, second
p_o 1 dpage - > k ey
wor keyst i ]
=
/*
workch[ i
half to new page
p_o 1 dpage- >ch i 1 d i
=
p_newpage- > k ey i
wor k k ey s i + + M NKEYS
=
p_newpage- >ch i 1 d i
wor kch[ i +1 +MI NKEYS ]
/* mark second half of old
p_oldpage->key[ l+MINKEYS] = NOKEY;
/* page as empty
p_oldpage->child[ i+1 +MINKEYS] = NIL;
for
*/
*/
*/
p_oldpage->child[MINKEYS] = wor k ch M NKE YS
p_newpage->child[MINKEYS] = wor k ch i
+M NKEYS
p_newpage->keycount = MAXKEYS - MINKEYS;
p_o 1 dpage- > eycount = MINKEYS;
/* promote middle key
*promo_key = wor k k eys M NKEYS
[
*/
*/
*/
PASCAL PROGRAMS TO INSERT KEYS INTO A B-TREE: DRIVER.PAS
397
Pascal Programs to Insert Keys into a B-Tree
The
Pascal
program
that follows
in the text.
The only
difference
implements the insert program described
between this program and the one in the text
is that this program builds a B-tree of order five, whereas the one in the text
builds a B-tree of order four. Input characters are taken from standard I/O,
with q indicating end of data.
The main program includes three nonstandard compiler directives:
{$B->
{$1
{$1
btutil.prc}
insert .pre}
The $B
as a
instructs the
standard Pascal
The
Turbo
Pascal compiler to handle keyboard input
file.
$1 directives instruct the
compiler to include the
files btutil.prc
and
main program. These two files contain functions needed by
the main program. So the B-tree program requires the use of functions
from three files:
insert. pre
in the
Contains the main program, which closely
driver. pas
driver
program described
parallels the
in the text.
insert. pre
Contains insertf), the recursive function that finds the
proper place for a key, inserts it, and supervises splitting
and promotions.
btutil.prc
Contains
all
other support functions, including the func-
tion split() described in the text.
Driver.pas
PROGRAM btree
NPUT OUTPUT)
,
Driver for B-tree tests
(continued)
398
B-TREES AND OTHER TREE-STRUCTURED
FILE
ORGANIZATIONS
Opens or creates btree file
Gets next key and calls insert to insert key in tree
If necessary, creates a new root.
{$B->
CONST
MAXKEYS = 4;
MAXCHLD = 5;
MAXWKEYS = 5;
MAXWCHLD = 6;
NOKEY = '@'
NO = FALSE;
YES = TRUE;
NULL = -1
{maximum number of keys
P a 9 e>
{maximum number of children in page}
{maximum number of keys in working space}
{maximum number of children in working space}
{symbol to indicate no key}
TYPE
BTPAGE = RECORD
keycount
integer;
{number of keys in page
}
key
array [1.. MAXKEYS] of char;
{the actual keys
}
child
array
MAXCHLD of integer; {ptrs to RRNs of descendents}
:
END;
VAR
promoted
boo 1 ean
oo t
pr omo_r
integer
promo_k ey
btfd
char
file of BTPAGE
MINKEYS
PAGESIZE
integer
i nt eger
{$1
{$1
ey
{tells if a promotion from below
{RRN of root
{RRN promoted from below
{key promoted from below
{next key to insert in tree
{global file descriptor for
{"btree.dat"
{min. number of keys in a page
{size of a page
btutil.prc}
insert. pre}
BEGIN {main}
MINKEYS
PAGESIZE
if
MAXKEYS DIV 2;
sizeof (BTPAGE)
{try to open btree.dat and get root}
btopen then
root
root
:=
get root
{if btree.dat
else
ead( k ey )
create_tree
NHILE (key <>
BEGIN
q'
DO
not
there,
create it}
399
PASCAL PROGRAMS TO INSERT KEYS INTO A B-TREE: INSERT.PRC
promoted
if
i nser t ( roo t
k ey
promo_r rn promo_k ey )
then
c r eat e_root ( promo_k ey root promo_r rn
:=
pr omo ted
root
ead( key)
END;
:=
btclose
END
Insert.prc
FUNCTION insert (rrn: integer;key: char;VAR pr omo_r_ch i
VAR promo_key: char): boolean;
Function to insert
integer;
key into a B-tree:
Calls itself recursively until the bottom of the tree is reached.
Then inserts the key in the node.
If node is out of room, then it calls split() to split the node and
promotes the middle key and RRN of new node.
VAR
{current page
{new page created if split occurs
{tells if key is already in B-tree
{tells if key is promoted
{position that key is to go in
{RRN promoted from below
{key promoted from below
page,
newpage
found
promoted
BTPAGE;
pos
b_rrn
p_b_key
boolean
i n t eger
char
GIN
{past bottom of tree... "promote"
(rrn = NULL) th
{original key so that it will be
BEGIN
{inserted at leaf level
promo key := key;
promo r child := NULL;
insert := YES
END
else
BEGIN
btreadCrrn ,page)
found := search n ode ( k ey page pos )
if (found ) then
BEGIN
key);
attempt to insert duplicate key:
wr telnC Error
insert := NO
END
if
'
}
>
(continued)
400
B-TREES AND OTHER TREE-STRUCTURED FILE ORGANIZATIONS
else
BEGIN
promoted := 1 nser t ( page ch 1 1 d pos
k ey
p_b_r rn p_b_k ey )
if (NOT promoted) then
insert := ND
{no promotion}
else
BEGIN
if (page, keycount < MAXKEYS) then
BEGIN
{OK to insert key
p_b_rrn page ) {and pointer in this
i ns_i n_page( p_b_key
btwrite(rrn,page);
{page.
insert := NO
{no promotion}
END
else
BEGIN
spl i t ( p_b_k ey p_b_r r n page promo_k ey
promo_r_ch ild,newpage);
btwrite(rrn,page)
btwrite( promo_r ch ild,newpage);
insert := YES
{promotion}
END
END
END
]
END
END:
Btutil.prc
FUNCTION btopen
BOOLEAN;
{Function to open "btree.dat"
it returns false}
:
if
it
already exists. Otherwise
VAR
response
char;
BEGIN
assign(btfd, 'btree.dat );
write('Does btree.dat already exist? (respond
readln(response)
writeln;
if (response = 'Y') OR (response = 'y') then
BEGIN
reset(btfd)
btopen := TRUE
END
else
btopen := FALSE
:
'
END;
or
N):
*);
}
}
401
PASCAL PROGRAMS TO INSERT KEYS INTO A B-TREE: BTUTILPRC
PROCEDURE btclose;
{Procedure to close "btree.dat"}
BEGIN
close (btfd);
END;
FUNCTION getroot
integer;
{Function to get the RRN of the root node from first record of btree.dat)
:
VAR
root
BTPAGE;
BEGIN
seek(btfd,0);
if (not EOF) then
BEGIN
r ead( btfd, root);
getroot := r oo t
eycoun t
END
else
wr i t e 1 n( Er r or
Unable to get root.')
:
'
END;
FUNCTION getpage
integer;
{Function that gets the next available block in "btree.dat" for
BEGIN
getpage := f i 1 es i ze( b t f d )
:
new page)
END;
PROCEDURE pageinit (VAR p_page
BTPAGE);
{puts NOKEY in all "key" slots and NULL in "child" slots}
:
VAR
j
eger
BEGIN
for
:=
to MAXKEYS
DO
BEGIN
:= NOKEY;
p_page. key[
:= NULL;
p_page ch i 1 d
1
END;
p_page.child[MAXKEYS+1
:=
NULL
END;
PROCEDURE putroot (root: integer);
{Puts RRN of root node in the keycount of the first record of btree.dat
>
VAR
rootrrn
BTPAGE;
BEGIN
seek(btf d,0)
rootrrn. keycount
:
:=
root;
(continued)
402
B-TREES AND OTHER TREE-STRUCTURED
FILE
ORGANIZATIONS
pageinit (rootrrn);
write(btfd,rootrrn)
END;
PROCEDURE btread (rrn
integer; VAR page_ptr
{reads page number RRN from btree.dat}
BEGIN
seek (btfd,rrn);
read(btfd, page_pt r )
BTPAGE)
END;
PROCEDURE btwnte (rrn
integer; page_ptr
{writes page number RRN to btree.dat}
BEGIN
seekCbtf d,rrn)
write(btfd, page p t r )
BTPAGE)
END;
FUNC TION create_root (key :char; left, right
integer):
{get and initialize root node and insert one key}
VAR
page
BTPAGE;
rrn
integer
BEGIN
rrn
=
get page
page i n i t ( page)
key
page k ey
=
left
page ch i 1 d
right
page ch i 1 d 2
page k eycount
=
:
btwrite(rrn,page)
putroot(rrn)
create_root := rrn
;
END;
integer;
FUNCTION create_tree
{creates "btree.dat" and the root node}
VAR
rootrrn
int eger
BEGIN
:
rewnteCbtf d)
r
ead( key);
rootrrn := get page;
putroot(rootrrn)
create_tree := c r ea t e_r oo t
;
END;
( k
ey NULL NULL
,
integer;
403
PASCAL PROGRAMS TO INSERT KEYS INTO A B-TREE: BTUTIL.PRC
FUNCTION search_node( key char
p_page
{returns YES if key in node, else NO.
position in pos>
:
BTPAGE VAR pos integer): boolean
either case, put key's correct
;
In
VAR
1
eger
BEGIN
i
:=
while ((i
l
pos
<=
page
eye oun t
AND (key
>
p_page
ey
l ] )
DO
((pos <= p_page
eycount
search__node := YES
else
sear ch_node
NO
=
if
AND (key
p_page
ey pos
[
) )
then
END;
PROCEDURE ins_in_page (key: char r_chi
{insert key and right child in page}
;
integer; VAR p_page:
Id:
BTPAGE);
VAR
i
integer;
BEGIN
:= p_page
i
eycount
while ((key < p_page
BEGIN
:=
p_page k ey i
p_page ch i 1 d i +
1;
ey
l
[
page
:=
ey
AND
i
[
p_page ch i
.
>
1)) DO
l ]
END;
p
page
eye oun t
p_page k ey
p_page ch i 1
[
p_page
:=
eye oun
key;
:=
r_child
:=
END;
PROCEDURE split (key: char; r_child: integer; VAR p_oldpage: BTPAGE;
integer;
VAR promo_key: char; VAR pr omo_r_ch l 1 d
VAR p_newpage: BTPAGE);
:
{split node by creating new node and moving half of keys to new node.
Promote middle key and RRN of new node.)
VAR
i
integer;
{temporarily holds keys,}
of char;
workkeys
array
MA XNKE YS
{
before split}
of integer; {temporarily holds children
workch
MA XNCHLD
array
before split }
{
:
(continued)
404
B-TREES AND OTHER TREE-STRUCTURED
ORGANIZATIONS
FILE
BEGIN
:=
to MAXKEYS
DO
1
BEGIN
workkeyslil := p_o 1 dpage k ey
workchEi] := p_o dpage ch i 1 d
for
{move keys and children from
{old page into work arrays
}
}
END;
workchCMAXKEYS+1
:=
p_o 1 dpage ch
.
i 1
i
MAXKEYS
=
while ((key < wor k k ey s i )
AND
BEGIN
workkeystil := wor k k ey 5 i workchEi+1] := workchtil;
:
>
:=
1)) DO
}
key;
r_child;
{create new page for split
{and promote RRN of new page
{move first half of keys and
{children to old page,
{second half to new page.
i 1
page
{insert new key
omo_r_ch d := getpage;
i n i t
p_newpage )
(
for i
TO MINKEYS DO
=
pr
workchti+11
END;
wor k k ey s
MAXKEYS-
BEGIN
:= workkeystil;
p_o 1 dpage k ey[ i
:= workchtil;
p_o 1 dpage ch i 1 d i
:= wor k k ey s i + + M NKEYS
p_newpage k ey i
:= wor k ch i + + M NKEYS
p_newpage. child! i
p_oldpage. key[ i+MINKEYSl
{mark second half of old
=
NOKEY;
p_oldpage.child[ i+1 +MINKEYS] := NULL
<page as empty
]
>
>
>
>
>
>
>
END;
p_oldpage.child[MINKEYS+1
if
wor k ch M NKEYS+
:=
odd(MAXKEYS)
t
hen beg i
:= wor k k ey s MAXNKEYS
p_newpage. k ey M NKEYS+
p_newpage.child[MINKEYS+2] := wor k ch MAXNCHLD
:= wor k ch MA XWCHLD=
p_newpage.chi ld[MINKEYS+1
[
end
else
:= wor k ch MA XNCHLD
p_newpage.chi 1 d M NKEYS+
p_newpage. keycount := MAXKEYS - MINKEYS;
p_oldpage. keycount := MINKEYS;
{promote middle key
promo_key := wor k k eys M NKE Y S+1]
1
END;
>
4
The B Tree Family
and Indexed Sequential
File
Access
CHAPTER OBJECTIVES
Introduce indexed sequential
files.
Describe operations on a sequence set of blocks that
maintains records in order by key.
Show how
sequence
set can be built on top of the
produce an indexed sequential file
an index
set to
structure.
Introduce the use of a B-tree to maintain the index
+
set, thereby introducing B
trees and simple prefix
B+
trees.
Illustrate
fix
B+
how
the B-tree index set in a simple pre-
tree can
variable
be of variable order, holding
number of separators.
Compare
the strengths and weaknesses of
+
trees, simple prefix B
trees, and B-trees.
B+
405
CHAPTER OUTLINE
9.1
Indexed Sequential Access
9.2
Maintaining a Sequence Set
9.2.1
9.6.2
Blocks
The Use of Blocks
9.2.2 Choice of Block Size
9.3
Adding
9.5
The Simple
9.6
Simple Prefix
Maintenance
9.6.1
9.8
Internal Structure of Index Set
9.9
9.10
Separators Instead of Keys
B+
Prefix
B+
Sequence Set
Index Set Block Size
Blocks:
The Content of the Index:
in the
9.7
Simple Index to the
Sequence Set
9.4
Changes Involving Multiple
Loading
Simple Prefix B" Tree
Trees
9.11 B-Trees,
Tree
Variable-order B-Tree
Prefix
B"
B+
Trees, and Simple
Trees in Perspective
Tree
Changes Localized
to Single
Blocks in the Sequence Set
9.1
Indexed Sequential Access
Indexed sequential
views of a file:
The
Indexed:
file
file
structures provide a choice
can be seen
as a set
between two
of records that
is
alternative
indexed
by key;
or
Sequential:
The
ous records
The
file
can be accessed sequentially (physically contigu-
no seeking),
returning records in order by key.
idea of having a single organizational
method
that provides both of
have had to choose between
them. As a somewhat extreme, though instructive, example of the potential
divergence of these two choices, suppose that we have developed a file
these views
is
new
one.
Up
to this point
we
structure consisting of a set of entry-sequenced records indexed
by
separate B-tree. This structure can provide excellent indexed access to any
individual record by key, even as records are added and deleted.
suppose that
we
also
want
cosequential processing
to use this
we want
Since the actual records in this
file as
to retrieve
file
Now let's
part of a cosequential merge. In
all
system are
the records in order
by key.
entry sequenced, rather than
by key, the only way to retrieve them in order by key is
pointers from
through the index. For a file of
records, following the
essentially random seeks
the index into the entry sequenced set requires
into the record file. This is a much less efficient process than the sequential
physically sorted
407
MAINTAINING A SEQUENCE SET
so much so
reading of physically adjacent records
for
any situation
On
in
which cosequential processing
is
that
a
the other hand, our discussions of indexing
it is
unacceptable
frequent occurrence.
show
us that a
file
consisting of a set of records sorted by key, though ideal for cosequential
processing,
an unacceptable structure
is
by key
delete records
What
in
random
when we want
to access, insert,
and
order.
an application involves both interactive random access and
if
cosequential batch processing? There are
many examples of such dual-mode
applications. Student record systems at universities, for example, require
keyed access
to individual records while also requiring a large
batch processing,
when
grades are posted or
when
amount of
during
both batch processing of
charge slips and interactive checks of account status. Indexed sequential
access methods were developed in response to these kinds of needs.
as
fees are paid
registration. Similarly, credit card systems require
9.2
Maintaining a Sequence Set
We set aside,
for the
moment,
the indexed part of indexed sequential access,
focusing on the problem of keeping
of records in physical order by key
ordered set of records as
sequence set. We will assume that once we have a good way of maintaining
sequence set, we will find some way to index it as well.
as
a
a set
We
records are added and deleted.
refer to this
9.2.1 The Use of Blocks
We
can immediately rule out the idea of sorting and resorting the entire
sequence
entire
set as
the changes.
deletion
records are added and deleted, since
One of the
to just
part
best
ways
When we
We
the buffers
4:
we know
instead to find a
to restrict the effects
of the sequence
encountered in chapters 3 and
output.
We need
an expensive process.
file is
We
set
that sorting an
way
of an insertion or
involves
tool
can collect the records into
we
first
blocks.
block records, the block becomes the basic unit of input and
at once. Consequently, the size of
read and write entire blocks
we
use in
After reading in
program
block,
all
is
such that they can hold an entire block.
the records in
block are in
RAM,
can work on them or rearrange them much more rapidly.
An example helps illustrate how the use of blocks can help
sequence set in order. Suppose we have records that are keyed on
and collected together so there are four records
link fields in
to localize
in a block.
We
where we
us keep a
last
name
also include
each block that point to the preceding block and the following
408
THE B + TREE FAMILY AND INDEXED SEQUENTIAL
block.
We need
these fields because, as
FILE
ACCESS
you
will see, consecutive blocks are
not necessarily physically adjacent.
As with B-trees, the insertion of new records into a block can cause the
block to overflow. The overflow condition can be handled by a blocksplitting process that is analogous to, but not the same as, the blocksplitting process used in a B-tree. For example, Fig. 9.1(a) shows what our
blocked sequence set looks like before any insertions or deletions take place.
We show only the forward links. In Fig. 9.1(b) we have inserted a new
record with the key CARTER. This insertion causes block 2 to split. The
second half of what was originally block 2
Note
we
is
found
in
block 4 after the
split.
that this block-splitting process operates differently than the splitting
encountered in B-trees. In
a B-tree a split results in the promotion
of
Here things are simpler: We just divide the records between two
blocks and rearrange the links so we can still move through the file in order
by key, block after block.
Deletion of records can cause a block to be less than half full and
therefore to underflow. Once again, this problem and its solutions are
analogous to what we encounter when working with B-trees. Underflow in
record.
B-tree can lead to either of
If a
neighboring node
is
two
solutions:
also half full,
we
can concatenate the two
nodes, freeing one up for reuse.
neighboring nodes are more than half
If the
records between the nodes to
make
full,
we
the distribution
can
redistribute
more
nearly
even.
the
Underflow within a block of our sequence set can be handled through
same kinds of processes. As with insertion, the process for the sequence
set is
simpler than the process for B-trees since the sequence
set is not a tree
and there are therefore no keys and records in a parent node. In Fig. 9.1(c)
we show the effects of deleting the record for DAVIS. Block 4 underflows
and is then concatenated with its successor in logical sequence, which is
block 3. The concatenation process frees up block 3 for reuse. We do not
show an example in which underflow leads to redistribution, rather than
concatenation, since it is easy to see how the redistribution process works.
Records are simply moved between logically adjacent blocks.
Given the separation of records into blocks, along with these fundamental block-splitting, concatenation, and redistribution operations, we
can keep a sequence set in order by key without ever having to sort the
entire set
of records. As always, nothing comes
free;
consequently, there are
costs associated with this avoidance of sorting:
Once
made, our file takes up more space than an unof sorted records because of internal fragmentation
insertions are
blocked
file
MAINTAINING A SEQUENCE SET
Block
ADAMS
Block 2
w BYNUM
Block 3
W DENVER
BAIRD
fc
ELLIS
BIXBY
CARSON
COLE
BOONE
DAVIS
(a)
Block
ADAMS
Block 2
fc
w
BYNUM
DENVER
Block 3
Block 4
fe
w
COLE
BAIRD
ELLIS
DAVIS
BIXBY
CARSON
BOONE
CARTER
(b)
Block
ADAMS
fe
Block 2
BYNUM
BAIRD
BIXBY
CARSON
CARTER
BOONE
Block 3
Available
for reuse
Block 4
w COLE
DENVER
ELLIS
(c)
Block splitting and concatenation due to insertions and deletions in
(a) Initial blocked sequence set. (b) Sequence set after insertion of CARTER record block 2 splits, and the contents are divided between
block 4 is less
blocks 2 and 4. (c) Sequence set after deletion of DAVIS record
than half full, so it is concatenated with block 3.
FIGURE 9.1
the sequence set.
409
41
THE B + TREE FAMILY AND INDEXED SEQUENTIAL
FILE
ACCESS
within a block. However, we can apply the same kinds of strategies
used to increase space utilization in a B-tree (e.g., the use of redistribution in place of splitting during insertion, two-to-three splitting,
and so on). Once again, the implementation of any of these strategies
must account for the fact that the sequence set is not a tree and that
there is therefore no promotion of records.
The order of the records is not necessarily physically sequential
throughout the file. The maximum guaranteed extent of physical sequentially
This
last
is
within
a block.
point leads us to the important question of selecting
block
size.
9.2.2 Choice of Block Size
As we work with our sequence
set,
block
is
the basic unit for our I/O
When we read data from the disk, we never read less than a
block; when we write data, we always write at least one block. A block is
also, as we have said, the maximum guaranteed extent of physical
sequentiality. It follows that we should think in terms of large blocks, with
operations.
each block holding
many
one of identifying the
big
on block
size:
Why
not
make
we can fit the entire file in a single block?
One answer to this is the same as the reason why we
RAM
our
records. So the question of block size
limits
first
becomes
the block size so
cannot always use
enough RAM available. So
consideration regarding an upper bound for block size is as follows:
sort
on
a file:
Consideration
We
usually do not have
size should be such that we can hold several
blocks in
at once. For example, in performing a
block split or concatenation, we want to be able to
at a time. If we are
hold at least two blocks in
implementing two-to-three splitting to conserve disk
at
space, we need to hold at least three blocks in
The block
1:
RAM
RAM
RAM
a time.
Although we
sequence
are
presently
set sequentially
randomly accessing
in an entire
we
a single
block to get
focusing on the ability to access our
eventually want to consider the problem of
set. We have to read
any one record within that block. We can
record from our sequence
at
therefore state a second consideration:
Consideration
2:
block should not take
an unlimited amount of
RAM, we would want to place an upper limit on the
block size so we would not end up reading in the en-
Reading
in or writing out a
very long. Even
tire file just to
if
we had
get at a single record.
41
ADDING A SIMPLE INDEX TO THE SEQUENCE SET
This second consideration
We
very long?
knowledge of
not
interested in a
so
mandatory limitation, but
block because
And where is
When we discussed
introduced the term
it
is
it
a sensible
is
still
that
it
We
are
at
which we can guarantee such
that?
sector formatted disks back in Chapter 3,
cluster.
uses
cluster
up eight
guarantees a
As we move from
one:
contains records that are physically adjacent,
the
is
minimum number of
allocated at a time. If a cluster consists of eight sectors, then a
only one byte
is
long
size should be such that we can access a
block without having to bear the cost of a disk seek
within the block read or block write operation.
adjacency.
clustering
How
The block
not extend blocks beyond the point
let's
imprecise:
the performance characteristics of disk drives:
(redefined):
is
a little
can refine this consideration by factoring in some of our
Consideration 2
This
more than
is
sectors
on the
minimum amount
cluster to cluster in reading a
disk.
file
we
sectors
containing
The reason
for
of physical sequentiality.
file,
we may
incur a disk
without seeking.
block size, then, is to make
seek, but within a cluster the data can be accessed
One reasonable
suggestion for deciding on
Often the cluster size on a disk
system has already been determined by the system administrator. But what
if you are configuring a disk system for a particular application and can
therefore choose your own cluster size? Then you need to consider the
issues relating to cluster size raised in Chapter 3, along with the constraints
available and the number of blocks you
imposed by the amount of
each block equal to the size of
want
to hold in
a cluster.
RAM
RAM at once. As
is
so often the case, the final decision will
probably be a compromise between a number of divergent considerations.
The important thing is that the compromise be a truly informed decision,
based on knowledge of how I/O devices and file structures work, rather
than just
If
a guess.
you
that allows
point
is
to
revise this
working with a disk system that is not sector oriented, but
you to choose the block size for a particular file, a good starting
think of a block as an entire track of the disk. You may want to
downward, to half a track, for instance, depending on memory
are
constraints, record size,
9.3
and other
factors.
Sequence Set
Adding a Simple Index
to the
We
for maintaining a set of records so
have created
access
them
mechanism
sequentially in order
by key.
It is
we
can
based on the idea of grouping
the records into blocks and then maintaining the blocks, as records are
THE B + TREE FAMILY AND INDEXED SEQUENTIAL
412
12
ADAMS-BERNEE
BOLEN-CAGE
CAMP-DUTTON
ACCESS
FILE
\
/
EMBRY-EVANS
FABER-FOLK Ni FOLKS-GADDIS
FIGURE 9.2 Sequence of blocks showing the range of keys
each block.
in
added and deleted, through splitting, concatenation, and redistribution.
Now let's see whether we can find an efficient way to locate some specific
block containing
We
illustrated in Fig. 9.2. This
actually read
it is
given the record's key.
a particular record,
can view each of our blocks as containing
is
a range
of records,
as
an outside view of the blocks (we have not
any blocks and so do not
know
exactly what they contain), but
choose which block might have
for example, that if we are looking
sufficiently informative to allow us to
we
the record
are seeking.
for a record with the
We
key
can
see,
BURNS, we
want
to retrieve
and inspect the
second block.
It is
these
easy to see
blocks.
We
how we
could construct
might choose,
for
a simple, single-level
example,
fixed-length records that contain the key for the
as
shown in Fig. 9.3.
The combination of this kind of index with
to
last
build
index for
an index of
record in each block,
the sequence set of blocks
provides complete indexed sequential access. If we need to retrieve a specific
record,
we
consult the index and then retrieve the correct block; if we need
we start at the first block and read through the linked list
we have read them all. As simple as this approach is, it is in
sequential access
of blocks until
fact a very workable one
as
long
RAM
as the entire
memory. The requirement
important for two reasons:
Since this
we
is
for the
sequence
Key
Block number
BERNE
CAGE
DUTTON
EVANS
FOLK
GADDIS
4
5
by means of
set illustrated
index can be held in electronic
the index be held in
simple index of the kind
find specific records
FIGURE 9.3 Simple index
that
in
we
RAM
discussed in Chapter
6,
binary search of the index.
Fig. 9.2.
is
41 3
THE CONTENT OF THE INDEX: SEPARATORS INSTEAD OF KEYS
Binary searching works well
but, as
many
As
we saw
in the
seeks if the
if
file is
on
RAM,
the searching takes place in
previous chapter on B-trees,
it
requires too
secondary storage device.
the blocks in the sequence set are changed through splitting, con-
Updating
catenation, and redistribution, the index has to be updated.
simple, fixed-length record index of this kind
dex
and contained
RAM.
works well
if
the in-
however, the updating requires seeking to individual index records on disk, the prorelatively small
is
cess can
in
become very expensive. Once again,
more completely in earlier chapters.
If,
this
is
we
point
discussed
What do we
do, then, if the
index does not conveniently
found
we
blocks
in
that
we
contains so
into
RAM?
many
blocks that the block
In the preceding chapter
could divide the index structure into pages,
much
we
like the
are discussing here, handling several pages, or blocks, of the index
RAM at a time.
file
file
fit
More
specifically,
we found
that B-trees are an excellent
structure for handling indexes that are too large to
This suggests that
we might
fit
entirely in
RAM.
organize the index to our sequence set as a
B-tree.
The use of
B-tree index for our sequence set of blocks
very powerful notion. The resulting hybrid structure
which
is
appropriate since
is,
known
it is
we need
to
The purpose of
the index
we
The index
Keys
is to assist us when we
The index must guide us to
are building
searching for a record with a specific key.
all.
tree,
keep in the index.
of the Index: Separators Instead of
set at
B+
The Content
block in the sequence
in fact, a
as a
B-tree index plus a sequence set that holds
+
can fully develop the notion of a B tree, we
it is
the actual records. Before we
need to think more carefully about what
9.4
is
set that contains the record, if it exists in the
serves as a kind of roadmap for the sequence
interested in the content of the index only insofar as
it
can
assist
are
the
sequence
set.
We are
us in getting
to the correct block in the sequence set; the index set does not itself contain
answers,
it
contains only information about where to go to get answers.
Given this view of the index set as a roadmap, we can take the very
important step of recognizing that we do not need to have actual keys in the
index set. Our real need is for separators. Figure 9.4 shows one possible set of
separators for the sequence set in Fig. 9.2.
Note
that there are
many
potential separators capable of distinguishing
between two blocks. For example, all of the strings shown between blocks
3 and 4 in Fig. 9.5 are capable of guiding us in our choice between the blocks
as we search for a particular key. If a string comparison between the key and
THE B + TREE FAMILY AND INDEXED SEQUENTIAL
414
ACCESS
CAM
BO
Separators:
FILE
12
ADAMS-BERNE
FOLKS
"7
CAMP-DUTTON
EMBRV-EVANSSj FABER-FOLK Si FOLKS-GADDIS
FIGURE 9.4 Separators between blocks
3
in
the sequence set.
any of these separators shows that the key precedes the separator, we look
for the key in block 3. If the key follows the separator, we look in block 4.
If
we
are willing to treat the separators as variable-length entities within
our index structure (we
by placing the
as the separator to
about
how
to
we
can save space
4.
we
Note
use
that
is
not always
are separators that are
function
Note
shown
in Fig. 9.6
blocks 5 and
6,
follows that, as
must decide
one that
is
by using
in the Pascal
the logic
procedure
embodied
in the
listed in Fig. 9.7.
is the same as the
by the separator between
which is the same as the first key contained in block 6. It
we use the separators as a roadmap to the sequence set, we
is
produce
a separator that
illustrated in Fig. 9.4
is to the right of the separator or the
of the separator according to the following rule:
to retrieve the block that
to the left
Relation of Search
Key <
Key =
separator
Key >
separator
list
and
that these functions can
second key. This situation
FIGURE 9.5 A
this later),
guide our choice between blocks 3 and
the other separators contained in Fig. 9.4
do
index structure. Consequently,
unique shortest separator. For example, BK, BN, and
all the same length and that are equally effective
separators between blocks 1 and 2 in Fig. 9.4. We choose BO and all of
there
BO
as
talk
shortest separator in the
Key and
Separator
separator
Decision
Go
Go
Go
left
right
right
of potential separators.
DUTU
CAMP-DUTTON
DVXGHESJF
DZ
E
EMBRY-EVANS
EBQX
ELEEMOSYNARY
^-n
41 5
THE CONTENT OF THE INDEX: SEPARATORS INSTEAD OF KEYS
/* find_sep(keyl, key2, sep)
...
finds shortest string that serves as a separator between keyl and
Returns this separator through the address provided by
key2
the "sep" parameter
.
the function assumes that key2 follows keyl in collating sequence
V
f ind_sep
keyl key2 sep
char keyl[], key2[],
(
sep[];
while
(*sep++
*sep='\0';
*key2++)
== *keyl++)
/* ensure that separator string is null terminated */
FIGURE 9.6 C function to find a shortest separator.
FIGURE 9.7 Pascal procedure to find a shortest separator.
PROCEDURE find_sep (keyl, key2
strng
VAR sep
strng)
{
finds the shortest string that serves as a separator between keyl and
key2
Returns the separator through the variable sep.
Strings are
handled as character arrays in which the length of the string is stored
in the Oth position of the array.
The type "strng" is used for strings.
:
Assumes that key2 follows keyl in collating sequence.
Uses two functions defined in the Appendix:
len_str(s)
returns the length of the string s.
min(i,j)
compares i and j and returns the smallest value
>
VAR
i, minlgth
integer;
BEGIN
minlgth := min( len_str (keyl
:
:=
len_str(key2)
1;
while (keylCi] = key2Ci]) and
BEGIN
sepCi]
= key2Ci]
(i
<= minlgth) DO
:=
END;
sepCi]
sepCO]
END;
:=
:=
key2Ci];
CHR(i)
set length indicator in separator array
41
THE B + TREE FAMILY AND INDEXED SEQUENTIAL
FILE
ACCESS
E
I
Index
FIGURE 9.8 A B-tree index set
9.5
The Simple
for the
Prefix B
Figure 9.8 shows
a
sequence
how we
set,
forming a simple prefix B
tree.
Tree
can form the separators identified in Fig. 9.4 into
The B-tree index is called the
Taken together with the sequence set, it forms a file structure
+
simple prefix B
tree. The modifier simple prefix indicates that the
B-tree index of the sequence set blocks.
index
set.
called a
index
set contains shortest separators,
copies of the actual keys.
Our
or prefixes of the keys rather than
separators are simple because they are,
They are actually just the initial letters within the keys.
More complicated (not simple) methods of creating separators from key
simply, prefixes:
remove unnecessary characters from the front of the separator as
from the rear. (See Bayer and Unterauer, 1977, for a more complete
prefixes
well as
discussion of prefix
Note
B+
"
1
trees.)"
that since the index set
branches to
N+
EMBRY, we
separator E.
retrieving the
children. If
start at the
is
we
a B-tree, a
node containing
N separators
are searching for the record with the
root of the index
set,
comparing
EMBRY
key
to the
Since EMBRY comes after E, we branch to the right,
node containing the separators F and FOLKS. Since EMBRY
on B trees and simple prefix B trees is remarkably inconsistent in the nomenclature used for these structures. B + trees are sometimes called B* trees; simple prefix
B + trees are sometimes called simple prefix B-trees. Comer's important article in Computing
Surveys in 1979 has reduced some of the confusion by providing a consistent, standard nomenclature which we use here.
""The literature
SIMPLE PREFIX B
TREE MAINTENANCE
417
the first of these separators, we follow the branch that is
of the F separator, which leads us to block 4, the correct block in
the sequence set.
comes before even
to the left
9.6
Simple Prefix B
Tree Maintenance
9.6.1 Changes Localized to Single Blocks
in
the Sequence Set
suppose that we want to delete the records for EMBRY and FOLKS,
suppose that neither of these deletions results in any concatenation
or redistribution within the sequence set. Since there is no concatenation or
redistribution, the effect of these deletions on the sequence set is limited to
changes within blocks 4 and 6. The record that was formerly the second
record in block 4 (let's say that its key is ERVIN) is now the first record.
Similarly, the former second record in block 6 (we assume it has a key of
FROST) now starts that block. These changes can be seen in Fig. 9.9.
Let's
and
let's
The more
interesting question
is
what
effect,
if
any, these deletions
have on the index set. The answer is that since the number of sequence set
blocks is unchanged, and since no records are moved between blocks, the
index set can also remain unchanged. This is easy to see in the case of the
EMBRY deletion: E is still a perfectly good separator for sequence set
blocks 3 and 4, so there is no reason to change it in the index set. The case
E
!
FIGURE 9.9 The deletion of the
index set unchanged.
EMBRY
and FOLKS records from the sequence set leaves the
41 8
THE B + TREE FAMILY AND INDEXED SEQUENTIAL
FOLKS
of the
deletion
appears both as
To
index
set.
these
two
key
more confusing
a little
is
ACCESS
FILE
in the deleted record
and
since the string
avoid confusion, remember to distinguish clearly between
FOLKS: FOLKS can continue to serve as a
uses of the string
FOLKS
separator between blocks 5 and 6 even though the
(One could argue
we
separator,
a shorter
FOLKS
within the
as a separator
that although
we do
record
is
deleted.
FOLKS
not need to replace the
should do so anyway because
it is
now
possible to construct
However, the cost of making such a change in the index
outweighs the benefits associated with saving a few bytes of
separator.
set usually
space.)
The
effect
of inserting into the sequence
cause block splitting
is
much
the
same
not result in concatenation: The index
example, that
we
by the separators
insert a record for
in the
index
into block 4 of the sequence
room
in
4,
we
set
We
records that do not
of these deletions that do
remains unchanged. Suppose, for
set
EATON.
we
new
find that
Following the path indicated
we will insert
The new record becomes
but no change in the index
the
new
record
assume, for the moment, that there
set is necessary.
This
the
is
is
record
first
not surprising
decided to insert the record into block 4 on the basis of the existing
information in the index
index
set.
for the record in the block.
block
since
set,
set
as the effect
is
set. It
follows that the existing information in the
sufficient to allow us to find the record again.
9.6.2 Changes Involving Multiple Blocks
in
the Sequence Set
What happens when
the addition and deletion of records to and from the
change the number of blocks in the sequence set? Clearly,
if we have more blocks, we need additional separators in the index set, and
if we have fewer blocks, we need fewer separators. Changing the number
of separators certainly has an effect on the index set, where the separators
sequence
set does
are stored.
Since the index set for a simple prefix
B+
tree
is
actually just a
normal
B-tree, the changes to the index set are handled according to the familiar
and deletion. 1 In the following examples, we
assume that the index set is a B-tree of order three, which means that the
maximum number of separators we can store in a node is two. We use this
small node size for the index set to illustrate node splitting and concatena"
rules for B-tree insertion
separators. As you will see later, actual
+
implementations of simple prefix B trees place a much larger number of
few
tion while using only a
separators in a
""As
we
you study
node of the index
you may find it helpful
much more detail.
the material here,
discuss B-tree operations in
set.
to refer
back to Chapter
8,
where
SIMPLE PREFIX B
We
assume
that there
is
block
set
A new block
hold the second half of what was originally the
is
shown
an insertion into the
that this insertion causes the block to split.
419
TREE MAINTENANCE
begin with an insertion into the sequence
Specifically, let's
in to
'
set,
block, and
brought
block. This new
(block 7)
first
linked into the correct position in the sequence
in Fig. 9.9.
first
is
following block
and preceding block 2 (these are the physical block numbers). These
changes to the sequence set are illustrated in Fig. 9.10.
Note that the separator that formerly distinguished between blocks 1
and 2, the string BO, is now the separator for blocks 7 and 2. We need a
new separator, with a value of AY, to distinguish between blocks 1 and 7.
As we go to place this separator into the index set, we find that the node into
which we want to insert it, containing BO and CAM, is already full.
Consequently, insertion of the new separator causes a split and promotion,
1
according to the usual rules for B-trees. The promoted separator,
placed in the root of the index
BO,
is
set.
Now let's
suppose we delete a record from block 2 of the sequence set
underflow condition and consequent concatenation of blocks
2 and 3. Once the concatenation is complete, block 3 is no longer needed in
the sequence set, and the separator that once distinguished between blocks
2 and 3 must be removed from the index set. Removing this separator,
CAM, causes an underflow in an index set node. Consequently, there is
that causes an
FIGURE 9.10 An insertion into block 1 causes a split and the consequent addition of block 7.
of a block in the sequence set requires a new separator in the index set. Insertion
of the AY separator into the node containing BO and CAM causes a node split in the index set
B-tree and consequent promotion of BO to the root.
The addition
420
THE
TREE FAMILY AND INDEXED SEQUENTIAL
FILE
ACCESS
FIGURE 9.1 1 A deletion from block 2 causes underflow and the consequent concatenation of
blocks 2 and 3. After the concatenation, block 3 is no longer needed and can be placed on
an avail list. Consequently, the separator CAM is no longer needed. Removing CAM from its
node in the index set forces a concatenation of index set nodes, bringing BO back down from
the root.
another concatenation,
this
time in the index
that
set,
results
in
the
demotion of the BO separator from the root, bringing it back down into a
node with the AY separator. Once these changes are complete, the simple
prefix
B+
tree has the structure illustrated in Fig. 9.11.
Although in these examples
node split in the index set, and
in a concatenation in the
index
block
concatenation in the sequence
set,
split in the
there
is
sequence
not always
this
set results in
set results
correspondence
of action. Insertions and deletions in the index set are handled as standard
B-tree operations; whether there is splitting or a simple insertion,
concatenation or a simple deletion, depends entirely on how full the index
set node is.
Writing procedures to handle these kinds of operations is a straightfor-
ward
task if you remember that the changes take place from the bottom up.
Record insertion and deletion always take place in the sequence set, since
that is where the records are. If splitting, concatenation, or redistribution is
necessary, perform the operation just as you would if there were no index set
at all. Then, after the record operations in the sequence set are complete,
make changes
If
as
blocks are
necessary in the index
split in the
serted into the index
sequence
set:
set, a
new
separator
must be
in-
set;
421
INDEX SET BLOCK SIZE
If
blocks are concatenated in the sequence
moved from
set, a
separator
must be
and
If records are redistributed between blocks in the sequence
value of a separator in the index set must be changed.
Index
set
the index
re-
set;
set,
the
operations are performed according to the rules for B-trees.
This means that node splitting and concatenation propagate up through the
set. We see this in our examples as the BO
and out of the root. Note that the operations on the
sequence set do not involve this kind of propagation. That is because the
sequence set is a linear, linked list, whereas the index set is a tree. It is easy
to lose sight of this distinction and think of an insertion or deletion in terms
+
of a single operation on the entire simple prefix B tree. This is a good way
to become confused. Remember: Insertions and deletions happen in the
sequence set since that is where the records are. Changes to the index set are
secondary; they are a byproduct of the fundamental operations on the
sequence set.
higher levels of the index
separator
9.7
moves
in
Index Set Block Size
Up
we
have ignored the important issues of the size and
Our examples have used extremely small
index set nodes and have treated them as fixed-order B-tree nodes, even
though the separators are variable in length. We need to develop more
realistic, useful ideas about the size and structure of index set nodes.
The physical size of a node for the index set is usually the same as the
physical size of a block in the sequence set. When this is the case, we speak
of index set blocks, rather than nodes, just as we speak of sequence set blocks.
There are a number of reasons for using a common block size for the index
and sequence sets:
to this point
structure of the index set nodes.
The block size for the sequence set is usually chosen because there is
a good fit between this block size, the characteristics of the disk
drive, and the amount of memory available. The choice of an index
set
block
fore, the
size
block
the index
size that
is
best for the sequence set
is
factors; there-
usually best for
set.
A common
scheme
governed by consideration of the same
is
block
size
makes
it
easier to
to create a virtual simple prefix
implement
B+
buffering
tree, similar to the virtual
B-trees discussed in the preceding chapter.
The index
set
blocks and sequence
within the same
file
blocks are often mingled
between two separate files
set
to avoid seeking
422
THE B + TREE FAMILY AND INDEXED SEQUENTIAL
FILE
ACCESS
while accessing the simple prefix B tree. Use of one file for both
kinds of blocks is simpler if the block sizes are the same.
9.8
Internal Structure of Index Set Blocks:
A Variable-order B-Tree
Given
a large,
fixed-size block for the index
separators within
how do we
set,
store the
In the examples considered so far, the block structure
it?
such that it can contain only a fixed number of separators. The entire
motivation behind the use of shortest separators is the possibility of packing
more of them into a node. This motivation disappears completely if the
is
index
set uses a
fixed-order B-tree in which there
number of
a fixed
is
separators per node.
We
want each index
variable-length separators.
set
How
block
hold
to
should
we go
variable
number of
about searching through
these separators? Since the blocks are probably large, any single block can
hold
a large
we want
number of separators. Once we
to be able to
separators.
We
do
read a block into
RAM for use,
binary rather than sequential search on
therefore need to structure the block so
it
its list
can support
of
a
binary search, despite the fact that the separators are of variable length.
Chapter
In
6,
which covers indexing, we see that the use of a separate
a means of performing binary searches on a list of
index can provide
variable-length entities. If the index itself consists of fixed-length refer-
we
ences,
can use binary searching on the index, retrieving the actual
variable-length records or fields through indirection. For example, suppose
we
are
going to place the following
of separators into an index block:
set
As, Ba, Bro, C, Ch, Cra, Dele, Edi, Err, Fa,
(We
are using lowercase letters, rather than
find the separators
more
easily
all
when we
Fie.
uppercase
letters,
concatenate them.)
concatenate these separators and build an index for them,
as
so
you can
We
shown
could
in Fig.
9.12.
If
we
are using this block of the index set as a
the record in the sequence set for "Beck",
the index to the separators, retrieving
which
starts
separator
Our
in position
by looking
10.
Note
at the starting
binary search eventually
first
that
roadmap
we perform
to help us find
binary search on
the middle separator, "Cra",
we
can find the length of
this
position of the separator that follows.
tells
us
that
"Beck"
"Ba" and "Bro". Then what do we do?
The purpose of the index set roadmap is to guide
falls
between the
separators
the levels of the simple prefix
B+
tree,
us
downward through
leading us to the sequence set block
423
INTERNAL STRUCTURE OF INDEX SET BLOCKS: A VARIABLE-ORDER B-TREE
AsBaBroCChCraDeleEdiErrFaFle
00 02 04 07 08 10 13 17 20 23 25
Index
Concatenated
to separators
separators
FIGURE 9.12 Variable-length separators and corresponding index.
we want
to retrieve. Consequently, the index set block needs
store references to
lower
level
a relative
of the
its
children, to the blocks descending
We
tree.
assume
that the references are
block number (RBN), which
number except
If there are
that
it
from
is
analogous to
some way
it
made
to
in the next
in
terms of
a relative
record
references a fixed-length block rather than a record.
N separators within a block, the block has N + children, and
N + RBNs in addition to the separators and
1
therefore needs space to store
the index to the separators.
There are many ways to combine the list of separators, index to
and list of RBNs into a single index set block. One possible
approach is illustrated in Fig. 9.13. In addition to the vector of separators,
the index to these separators, and the list of associated block numbers, this
separators,
block structure includes:
Separator count:
We
need
this to help us find the
the index to the separators so
we
middle element
in
can begin our binary search.
Total length of separators: The list of concatenated separators varies in
length from block to block. Since the index to the separators begins
at the
list is
end of
so
we
this variable-length
we need
list,
to
know how
long the
can find the beginning of our index.
Let's suppose,
once again, that
we
are looking for a record with the
key
"Beck" and that the search has brought us to the index set block pictured in
Fig. 9.13. The total length of the separators and the separator count allows
Separator count
r
r
2H
Total length of separators
AsBaBroCChCraDeleEdiErrFaFle
Separators
00 02 04 07 08 10 13 17 20 23 25
*U Index
to separators
FIGURE 9.13 Structure of an index set block.
BOO B01 B02 B03 B04 B05 B06 B07 BOS B09 B10 Bll
*U
Relative block
numbers
THE B + TREE FAMILY AND INDEXED SEQUENTIAL
424
Separator
subscript:
BOO
FILE
ACCESS
01234 56789
As
B01
Ba
B02
Bro B03
Ch
B04
B05 Cra B06 Dele B07
Edi
B08
Err
B09
Fa
10
BIO
Bll
Fie
FIGURE 9.14 Conceptual relationship of separators and relative block numbers.
us to find the beginning, the end, and consequently the middle of the index
to the separators.
As
in the preceding
of the separators through
example,
this index, finally
we perform
binary search
concluding that the key "Beck"
falls between the separators "Ba" and "Bro". Conceptually, the relation
between the keys and the RBNs is as illustrated in Fig. 9.14. (Why isn't this
good physical arrangement?)
As Fig. 9.14 makes clear, discovering that the key falls between "Ba"
and "Bro" allows us to decide that the next block we need to retrieve has the
RBN stored in the B02 position of the RBN vector. This next block could
be another index
set block,
could be the sequence
set
and thus another block of the roadmap, or
block that
we
it
are looking for. In either case, the
quantity and arrangement of information in the current index set block
is
conduct our binary search within the index block and then
+
proceed to the next block in the simple prefix B tree.
There are many alternate ways to arrange the fundamental components
of this index block. (For example, would it be easier to build the block if the
vector of keys were placed at the end of the block? How would you handle
sufficient to let us
the fact that the block consists of both character and integer entities with
no
constant, fixed dividing point between them?) For our purposes here, the
implementation details for this particular index block structure are
not nearly as important as the block's conceptual structure. This kind of index
block structure illustrates two important points.
The first point is that a block is not just an arbitrary chunk cut out of
a homogeneous file; it can be more than just a set of records. A block can
specific
have
a sophisticated internal structure all its
own, including
its
own internal
index, a collection of variable-length records, separate sets of fixed-length
records,
and so
forth.
This idea of building more sophisticated data
becomes increasingly attractive as the block
structures inside of each block
it becomes imperative that we have an
of the data within a block once it has been
+
read into RAM. This point applies not only to simple prefix B trees, but
to any file structure using a large block size.
size increases.
efficient
With very
large blocks
way of processing
all
LOADING A SIMPLE PREFIX B
The second
425
TREE
is that a node within the B-trce index set of our simple
of variable order, since each index set block contains a
variable number of separators. This variability has interesting implications:
prefix
B+
tree
point
is
The number of separators in a block is directly limited by block size
rather than by some predetermined order (as in an order M B-tree).
The index set will have the maximum order, and therefore the mini-
mum
depth, that
form
the separators.
possible given the degree of compression used to
is
of
Since the tree
is
when
is full,
block
comparing
mum.
or half
full,
are
separator count against
Decisions about
come more
The
variable order, operations
when
such
no longer
some
as
simple matter of
fixed
determining
maximum
or mini-
to split, concatenate, or redistribute be-
complicated.
exercises at the end of this chapter provide opportunities for
exploring variable-order trees more thoroughly.
9.9
Loading a Simple Prefix B
In the previous description
Tree
of the simple prefix
B*
tree,
we
on
something
focus
first
a sequence set, and subsequently present the index set as
added or built on top of the sequence set. It is not only possible to
conceive of simple prefix B^ trees this way, as a sequence set with an added
index, but one can also build them this way.
One way of building a simple prefix B~ tree, of course, is through a
series of successive insertions. We would use the procedures outlined in
+
section 9.6, where we discuss the maintenance of simple prefix B trees, to
split or redistribute blocks in the sequence set and in the index set as we
added blocks to the sequence set. The difficulty with this approach is that
splitting and redistribution are relatively expensive. They involve searching
down through the tree for each insertion and then reorganizing the tree as
necessary on the way back up. These operations are tine for tree maintenance
as the tree is updated, but when we are loading the tree we do not have to
contend with a random-order insertion and therefore do not need procedures
that are so powerful, flexible, and expensive. Instead, we can begin by
building
that
is
sorting the records that are to be loaded.
next record
we
encounter
Working from
blocks, one
fills
up.
by one,
sorted
file,
starting a
As we make
Then we can guarantee
the next record
is
we
new
the transition
we need
that the
to load.
can place the records into sequence set
when the one we are working with
between two sequence set blocks, we can
block
426
THE B + TREE FAMILY AND INDEXED SEQUENTIAL
FILE
ACCESS
Next separator:
CAT
Next
CATCH-CHECK
sequence
block
set
FIGURE 9.15 Formation of the
first
index set block as the sequence set
is
loaded.
determine the shortest separator for the blocks. We can
separators into an index set block that we build and hold in
collect
these
RAM until
it is
full.
develop an example of how this works, let's assume that we have
of records associated with terms that are being compiled for a book
index. The records might consist of a list of the occurrences of each term.
In Fig. 9.15 we show four sequence set blocks that have been written out to
To
sets
the disk and one index set block that has been built in
shortest separators derived
from the sequence
set
RAM
from the
block keys. As you can
FIGURE 9.16 Simultaneous building of two index set levels as the sequence set continues to
grow.
CAT
00 -1
-1
1
Index block
containing no
separators
LOADING A SIMPLE PREFIX B + TREE
see,
427
the next sequence set block consists of
CATCH
through
CHECK,
a set of terms ranging from
and therefore the next separator is CAT. Let's
suppose that the index set block is now full. We write it out to disk. Now
what do we do with the separator CAT?
Clearly, we need to start a new index block. But we cannot place CAT
into another index block at the same level as the one containing the
separators ALW, ASP, and BET since we cannot have two blocks at the
same level without having a parent block. Instead, we promote the CAT
separator to a higher-level block.
point directly to the sequence
blocks. This
means
that
RAM
we
build
in
as
we
However, the higher-level block cannot
must point to the lower-level index
set; it
will
now
be building two levels of the index
sequence
the
set.
Figure
9.16
illustrates
set
this
working-on-two-levels phenomenon: The addition of the CAT separator
requires us to start a new, root-level index block as well as a lower-level
index block. (Actually, we are working on three levels at once since we are
also constructing the sequence set blocks in
the index looks like after even
RAM.)
more sequence
set
Figure 9.17 shows what
blocks are added. As you
see, the lower-level index block that contained no separators when we
added CAT to the root has now filled up. To establish that the tree works,
do a search for the term CATCH. Then search for the two terms CASUAL
and CATALOG. How can you tell that these terms are not in the sequence
can
set?
It
is
instructive to ask
CHECK,
what would happen
if
the last record
so the construction of the sequence sets and index sets
were
would stop
+
with the configuration shown in Fig. 9.16. The resulting simple prefix B
tree would contain an index set node that holds no separators. This is not an
isolated, one-time possibility. If we use this sequential loading method to
build the tree, there will be many points during the loading process at which
there is an empty or nearly empty index set node. If the index set grows to
more than two levels, this empty node problem can occur at even higher
levels of the tree, creating a potentially severe out-of-balance problem.
Clearly, these empty node and nearly empty node conditions violate the
B-tree rules that apply to the index set. However, once a tree is loaded and
goes into regular use, the very fact that a node is violating B-tree conditions
can be used to guarantee that the node will be corrected through the action
of normal B-tree maintenance operations. It is easy to write the procedures
for insertion and deletion so a redistribution procedure is invoked when an
underfull node
is
encountered.
The advantages of loading
sequential
operation
following
simple prefix
a
sort
B+
tree in this
of the records,
outweigh the disadvantages associated with the
way,
as a
almost always
possibility
of creating
428
THE B + TREE FAMILY AND INDEXED SEQUENTIAL
FILE
ACCESS
ALHASPBET
ACCESS-ALSO
FIGURE 9.17 Continued growth
of index set built
up from the sequence
set.
blocks that contain too few records or too few separators.
advantage
is
that the loading process goes
more quickly
The output can be written sequentially;
We make only one pass over the data, rather than
associated with random order insertions; and
No
blocks need to be reorganized
as
we
proceed.
The
principal
since
the
many
passes
429
TREES
There are two additional advantages to using a separate loading process
we have described. These advantages are related to
performance after the tree is loaded rather than performance during loading:
such as the one
Random insertion produces blocks that are, on the average, between
67% and 80% full. In the preceding chapter, as we discussed B-trees,
we increased this storage utilization by mechanisms such as using redistribution during insertion rather than using just block splitting.
still, we never had the option of filling the blocks completely so
we had 100% utilization. The sequential loading process changes
this. If we want, we can load the tree so it starts out with 100% utilization. This is an attractive option if we do not expect to add very
many records to the tree. On the other hand, if we do anticipate
many insertions, sequential loading allows us to select any other degree of utilization that we want. Sequential loading gives us much
But,
more
control over the
newly loaded
amount and placement of empty space
in the
tree.
example presented
In the loading
in Fig. 9.16,
we
write out the
first
four sequence set blocks, then write out the index set block containing the separators for these sequence set blocks. If
file
we
use the same
for both sequence set and index set blocks, this process guaran-
tees that
quence
an index
set
set
block
blocks that are
starts
out in physical proximity to the se-
descendents. In other words, our se-
its
is creating a degree of spatial locality within
our file. This locality can minimize seeking as we search down
through the tree.
quential loading process
9.10
Trees
+
have focused primarily on simple prefix B
trees. These structures are actually a variant of an approach to file
+
organization known simply as a B Tree. The difference between a simple
Our
discussions
prefix
B+
tree
up
and
to this point
a plain
B+
tree
is
that the latter structure does not involve
the use of prefixes as separators. Instead, the separators in the index set are
simply copies of the actual keys. Contrast the index
9.18,
which
block that
B+
is
illustrates the initial
loading steps for a
illustrated in Fig. 9.15,
where we
set
shown
block
tree,
in Fig.
with the index
are building a simple prefix
tree.
The
operations performed on
discussed for simple prefix
trees consist
of
a set
B+
trees.
trees are essentially the
Both B
of records arranged
in
same
as
those
and simple prefix
key order in a sequence
trees
B+
set,
430
THE B + TREE FAMILY AND INDEXED SEQUENTIAL
FILE
ACCESS
Next separator:
CATCH
Next
CATCH-CHECK
sequence
set
FIGURE 9.18 Formation of the
first
index set block
in
block
tree without the use of shortest
separators.
coupled with an index set that provides rapid access to the block containing
any particular key/record combination. The only difference is that in the
simple prefix B tree we build an index set of shortest separators formed
from key prefixes.
One of the reasons behind our decision to focus first on simple prefix
B trees, rather than on the more general notion of a B + tree, is that we
want to distinguish between the role of the separators in the index set and
keys in the sequence
set. It is
much more
difficult to
make
this distinction
when
the separators are exact copies of the keys. By beginning with simple
+
prefix B
trees, we have the pedagogical advantage of working with
separators that are clearly different than the keys in the sequence set.
+
But another reason for starting with simple prefix B trees revolves
around the
plain
B+
implies that
we
can.
they are quite often a more desirable alternative than the
want the index set to be as shallow as possible, which
we want to place as many separators into an index set block as
fact that
tree.
We
Why use anything longer than the simple prefix in the index set?
general, the
answer
to this question
is
anything longer than a simple prefix
prefix
B+
trees are often a
factors that
keys
might argue
good
that
not, in fact,
as a separator;
solution.
in favor
we do
of using
There
a
B+
are,
want
In
to use
consequently, simple
however,
at least
two
tree that uses full copies
of
as separators:
The reason
for using shortest separators
an index set block. As
we
have already
is
to
pack more of them into
said, this implies, ineluctably,
the use of variable-length fields within the index set blocks. For
some
applications the cost of the extra overhead required to maintain
and use
this variable-length structure
outweighs the benefits of
shorter separators. In these cases one might choose to build a
B-TREES, B
straightforward
TREES, AND SIMPLE PREFIX B
B+
TREES
IN
431
PERSPECTIVE
tree using fixed-length copies
of the keys from
the sequence set as separators.
Some key
fix
sets
method
is
do not show much compression when
the simple pre-
used to produce separators. For example, suppose the
keys consist of large, consecutive alphanumeric sequences such
34C18K756, 34C18K757, 34C18K758, and so
as
on. In this case, to en-
joy appreciable compression, we need to use compression techniques
that remove redundancy from the front of the key. Bayer and Unterauer (1977) describe such compression methods. Unfortunately,
they are more expensive and complicated than simple prefix compression. If we calculate that tree height remains acceptable with the
use of full copies of the keys as separators, we might elect to use the
no-compression option.
9.1
B-Trees, B Trees, and Simple Prefix B
Trees in Perspective
and the preceding chapter we have looked at a number of
+
file structures. These tools
B-trees, B trees, and
simple prefix B
trees
have similar-sounding names and a number of
In this chapter
"tools" used in building
common
features.
We
need
way
to differentiate these tools so
we
can
choose the most appropriate one for a given file structure job.
Before addressing this problem of differentiation, however, we should
point out that these are not the only tools in the toolbox. Because B-trees,
reliably
B+
trees,
and
their relatives are such powerful, flexible file structures,
it is
them as the answer to all problems.
Simple index structures of the kind discussed in
Chapter 6, which are maintained wholly in RAM, are a much simpler,
neater solution when they suffice for the job at hand. As we saw at the
indexes are not limited to direct
beginning of this chapter, simple
access situations. This kind of index can be coupled with a sequence set of
blocks to provide effective indexed sequential access as well. It is only when
that
the index grows so large that we cannot economically hold it in
+
we need to turn to paged index structures such as B-trees and B trees.
In the chapter that follows we encounter yet another tool, known as
hashing. Like simple RAM-based indexes, hashing is an important alterna+
tive to B-trees, B
trees, and so on. In many situations, hashing can provide
faster access to a very large number of records than can the use of a member
of the B-tree family.
into the trap of regarding
easy to
fall
This
a serious mistake.
is
RAM
RAM
432
THE B + TREE FAMILY AND INDEXED SEQUENTIAL
B+
FILE
ACCESS
and simple prefix B trees are not a panacea.
However, they do have broad applicability, particularly for situations that
require the ability to access a large file both sequentially, in order by key,
and through an index. All three of these different tools share the following
So, B-trees,
trees,
characteristics:
They
tire
are
all
paged index structures, which means
blocks of information into
RAM
at
once.
many
possible to choose between a great
As
that they bring ena
consequence,
it is
alternatives (e.g., the keys
hundreds of thousands of records) with just a few seeks out to
The shape of these trees tends to be broad and shallow.
All three approaches maintain height-balanced trees. The trees do not
grow in an uneven way, which would result in some potentially
long searches for certain keys.
for
disk storage.
In
grow from
cases the trees
all
the
bottom up. Balance
is
main-
tained through block splitting, concatenation, and redistribution.
With
all
three structures
it is
possible to obtain greater storage effi-
ciency through the use of two-to-three splitting and of redistribution
in place
of block
splitting
scribed in Chapter
when
possible.
These techniques are de-
8.
All three approaches can be implemented as virtual tree structures in
which the most recently used blocks are held in RAM. The advantages of virtual trees were described in Chapter 8.
Any
of these approaches can be adapted for use with variable-length
a block similar to those outlined in
records using structures inside
this chapter.
For
of
all
this similarity, there are
some important
differences.
These
differences are brought into focus through a review of the strengths and
unique characteristics of each of these three
B-Trees
B-trees contain information that
member of
each pair
file
is
the key; the other
is
structures.
grouped
member
as a set
is
of pairs.
One
the associated infor-
These pairs are distributed over all the nodes of the B-tree. Consequently, we might find the information we are seeking at any level of the
+
+
B-tree. This differs from B trees and simple prefix B trees, which require
mation.
all
searches to proceed
all
the
way down
to the lowest,
sequence
set level
of
the tree. Because the B-tree itself contains the actual keys and associated
information, and there
is
therefore
Given
a large
no need
for additional storage to hold
B^
up
less
space than does
enough block
size
and an implementation
separators, a B-tree can take
tree as a virtual B-tree,
it is
tree.
that treats the
possible to use a B-tree for ordered sequential
access as well as for indexed access.
The ordered
sequential access
is
B-TREES, B
TREES, AND SIMPLE PREFIX B + TREES
obtained through an in-order traversal of the
virtual tree
tree.
433
PERSPECTIVE
IN
The implementation
as a
necessary so this traversal does not involve seeking as
is
it
returns to the next highest level of the tree. This use of a B-tree for indexed
sequential access
works only when
the record information
is
actually stored
within the B-tree. If the B-tree merely contains pointers to records that are
sequence off in some other
in entry
workable because of
all
then indexed sequential access
file,
is
not
the seeking required to retrieve the actual record
information.
B-trees are most attractive
each record stored in the
record,
when
the key itself comprises a large part of
When
the key
possible to build a broader,
is
it
tree.
is
only
small part of the
+
shallower tree using
tree
methods.
B+
that in the
set
The primary
Trees
tree
all
known
of blocks
difference
between the B
the key and record information
as the sequence set.
not in the upper-level, tree-like portion
sequence
set
provided through
is
tree
and the B-tree
contained in a linked
is
The key and record information
B+
of the
tree.
is
Indexed access to
is
this
conceptually (though not necessarily
B+
tree the index set
of copies of the keys that represent the boundaries between
sequence set blocks. These copies of keys are called separators since they
separate a sequence set block from its predecessor.
+
There are two significant advantages that the B tree structure provides
over the B-tree:
physically) separate structure called the index
set.
In a
consists
The sequence
set
can be processed in
a truly linear, sequential
way,
providing efficient access to records in order by key; and
The
ten
use of separators, rather than entire records, in the index set of-
means
index
set
that the
block in
number of separators
a
B+
that can be placed in a single
tree substantially exceeds the
could be placed in an equal-sized block in
records that
number of
a B-tree.
Sepa-
rators (copies of keys) are simply smaller than the key/record pairs
you can put more of them in a block of
given size, it follows that the number of other blocks descending
+
from that block can be greater. As a consequence, a B tree approach can often result in a shallower tree than would a B-tree apstored in a B-tree. Since
proach.
In
practice,
the latter of these
important one. The impact of the
it
is
first
two advantages
advantage
is
is
often the
more
lessened by the fact that
often possible to obtain acceptable performance during an in-order
traversal
B-tree.
of
B-tree through the page buffering mechanism of
a virtual
THE B + TREE FAMILY AND INDEXED SEQUENTIAL
434
Simple Prefix
using
aB
B+
ACCESS
We just indicated that the primary advantage of
+
is that aB
tree sometimes allows us to
Trees
tree instead
FILE
of a B-tree
we can obtain a higher branching factor out
of the upper-level blocks of the tree. The simple prefix B + tree builds on
this advantage by making the separators in the index set smaller than the
keys in the sequence set, rather than just using copies of these keys. If the
build a shallower tree because
we can fit more of them into a block to obtain
an even higher branching factor out of the block. In a sense, the simple
+
+
prefix B
tree takes one of the strongest features of the B
tree one step
separators are smaller, then
farther.
The price we have to pay to obtain this separator compression and
consequent increase in branching factor is that we must use an index set
block structure that supports variable-length fields. The question of
whether
this price is
worth the gain
is
one that has
to be considered
on
case-by-case basis.
SUMMARY
We
begin
this
chapter by presenting a
new problem.
In previous chapters
we
provided either indexed access or sequential access in order by key, but
without finding an efficient way to provide both of these kinds of access.
This chapter explores one class of solutions to this problem, a class based on
the use of a blocked sequence set and an associated index set.
The sequence
Since
all
set
holds
all
of the
tions to the sequence set,
we
start
structures with an examination of a
changes.
file's
insertion or deletion operations
The fundamental
on
data records in order
the
file
by key.
begin with modifica-
our study of indexed sequential
method for managing sequence
tools used to insert
and delete records while
keeping everything in order within the sequence
set
are ones that
file
set
still
we
block concatenation, and
8: block splitting,
between blocks. The critical difference between
the use made of these tools for B-trees and the use made here is that there
is no promotion of records or keys during block splitting in a sequence set.
A sequence set is just a linked list of blocks, not a tree; therefore there is no
place to promote anything to. So, when a block splits, all the records are
divided between blocks at the same level; when blocks are concatenated
there is no need to bring anything down from a parent node.
encountered in
Chapter
redistribution of records
In this chapter,
sequence
set
we
also discuss the question of
blocks. There
is
no
precise
answer
we
how
large to
can give to
this
make
question
SUMMARY
between applications and environments.
since conditions vary
block should be large, but not so large that
in
RAM
or cannot read in
of
a single
Once we
a seek. In
of a cluster (on sector-formatted
disks) or
disk track.
are able to build
and maintain
sequence
set,
matter of building an index for the blocks in the sequence
enough
In general a
cannot hold several blocks
block without incurring the cost of
practice, blocks are often the size
the size
we
we
turn to the
set. If
the index
RAM,
one very satisfactory solution is to use a
simple index that might contain, for example, the key for the last record in
every block of the sequence set.
is
small
If the
index
to
fit
in
set turns
out to be too large to
fit
in
RAM, we recommend
same strategy we developed in the preceding chapter when
simple index outgrows the available RAM space: We turn the index into
the use of the
a
a
combination of a sequence set with a B-tree index set is our
+
first encounter with the structure known as a B
tree.
+
Before looking at B trees as complete entities, we take a closer look at
the makeup of the index set. The index set does not hold any information
that we would ever seek for its own sake. Instead, an index set is used only
as a roadmap to guide searches into the sequence set. The index set consists
of separators that allow us to choose between sequence set blocks. There are
many possible separators for any two sequence set blocks, so we might as
well choose the shortest separator. The scheme we use to find this shortest
separator consists of finding the common prefix of the two keys on either
side of a block boundary in the sequence set, and then going one letter
+
beyond this common prefix to define a true separator.
tree with an
index set made up of separators formed in this way is called a simple prefix
B-tree. This
AB
B+
tree.
We study the mechanism used to maintain the index set as insertions
+
and deletions are made in the sequence set of a B tree. The principal
observation we make about all of these operations is that the primary action
is where the records are. Changes to the
of the fundamental operations
byproduct
set are secondary; they are a
on the sequence set. We add a new separator to the index set only if we form
a new block in the sequence set; we delete a separator from the index set
only if we remove a block from the sequence set through concatenation.
Block overflow and underflow in the index set differ from the operations on
the sequence set in that the index set is potentially a multilevel structure and
is
within the sequence
set,
since that
index
is
therefore handled as a B-tree.
The
size
of blocks in the index
for the sequence set.
variable-length
To
separators
set is usually the
same
as the size
create blocks containing variable
while
at
the
chosen
numbers of
same time supporting binary
435
436
THE B + TREE FAMILY AND INDEXED SEQUENTIAL
searching,
we
block header
FILE
ACCESS
develop an internal structure for the block that consists of
count and total separator length), the
fields (for the separator
variable-length separators themselves, an index to these separators, and a
vector of relative block numbers
(RBNs)
for the blocks descending
from
the index set block. This illustrates an important general principle about
large blocks within
homogeneous
set
structure of their
We
file
structures:
They
are
more than just
of records; blocks often have
own,
apart
from the
out of a
a slice
a sophisticated internal
larger structure of the file.
+
a B
tree.
find that if
We
turn next to the problem of loading
we
of records sorted by key, we can use a single-pass, sequential
process to place these records into the sequence set. As we move from block
to block in building the sequence set, we can extract separators and build the
blocks of the index set. Compared to a series of successive insertions that
work down from the top of the tree, this sequential loading process is much
more efficient. Sequential loading also lets us choose the percentage of space
utilized, right up to a goal of 100%.
The chapter closes with a comparison of B-trees, B + trees, and simple
+
+
prefix B
trees offer over B-trees
trees. The primary advantages that B
with
start
a set
are:
They support
The index set
records, so
than
We
true indexed sequential access; and
contains only separators, rather than full keys and
+
often possible to create a B tree that is shallower
it is
a B-tree.
suggest that the second of these advantages
important
indexed sequential access in
many
circumstances.
often the
The simple
second advantage and carries it farther,
separators and potentially producing an even shallower
+
tree is that
this extra compression in a simple prefix B
variable-length fields and a variable-order tree.
takes
is
more
one, since treating a B-tree as a virtual tree provides acceptable
this
prefix
B+
tree
compressing the
tree.
The
we must
price for
deal with
KEY TERMS
A B+
of records that are ordered
set that provides indexed
access to the records. All of the records are stored in the sequence
set. Insertions and deletions of records are handled by splitting, concatenating, and redistributing blocks in the sequence set. The index
set, which is used only as a finding aid to the blocks in the sequence
tree.
sequentially
set, is
tree consists
of
a sequence set
by key, along with an
managed
as a B-tree.
index
EXERCISES
Index
set.
The index
set consists
of separators that provide information
+
about the boundaries between the blocks in the sequence set of a B
tree. The index set can locate the block in the sequence set that contains the record
corresponding to
Indexed sequential access. Indexed
a certain
key.
sequential access
is
not actually
method, but rather a term used to describe situations in
which a user wants both sequential access to records, ordered by
+
key, and indexed access to those same records. B trees are just one
method for providing indexed sequential access.
Separator. Separators are derived from the keys of the records on either
side of a block boundary in the sequence set. If a given key is in one
of the two blocks on either side of a separator, the separator reliably
tells the user which of the two blocks holds the key.
Sequence set. The sequence set is the base level of an indexed sequential
+
file structure, such as B
tree. It contains all of the records in the
single-access
file.
When
read in logical order, block after block, the sequence set
lists all of the records in order by key.
Shortest separator. Many possible separators can be used to distinguish
between any two blocks in the sequence set. The class of shortest
separators consists of those separators that take the least space, given
a particular
We looked carefully at a compresremoving as many letters as possible
of the separators, forming the shortest simple prefix
compression strategy.
sion strategy that consists of
from the
that can
rear
still
Simple prefix
serve as a separator.
tree.
B + tree in which the index set
B+
shortest separators that are
simple prefixes,
as
is
made up of
described in the defini-
tion for shortest separator.
of variable order when the number of direct
is variable. This occurs
when the B-tree nodes contain a variable number of keys or separators. This form is most often used when there is variability in the
+
trees always make
lengths of the keys or separators. Simple prefix B
use of a variable-order B-tree as an index set so it is possible to take
advantage of the compression of separators and place more of them
Variable order.
B-tree
is
descendents from any given node of the tree
in a block.
m
EXERCISES
1.
Describe
access:
(a)
file
structures that permit each of the following types of
sequential
sequential access.
access
only;
(b)
direct
access
only;
(c)
indexed
437
THE B + TREE FAMILY AND INDEXED SEQUENTIAL
438
2.
A B+
tree
structure
sequential access. Since
whenever
B+
is
FILE
ACCESS
generally superior to a B-tree for indexed
why not use a B + tree
trees incorporate B-trees,
hierarchical indexed structure
is
called for?
Consider the sequence set shown in Fig. 9.1(b). Show the sequence set
keys DOVER and EARNEST are added; then show the sequence
set after the key DAVIS is deleted. Did you use concatenation or
redistribution for handling the underflow?
3.
after the
4.
What
considerations affect your choice of a block size for constructing
you know something about expected patterns of access
random versus an even division
between the two), how might this affect your choice of block size? On a
sector-oriented drive, how might sector size and cluster size affect your
a
sequence
set? If
(primarily sequential versus primarily
choice of
5.
It
block size?
possible to construct an indexed sequential
is
tree-structured index.
could be used.
index?
6.
Under what
Under what
(such as an
AVL
The index
of
discussed in Chapter
a
8,
without using
conditions might one consider using such an
conditions might
tree) rather
set
file
simple index like the one developed in Chapter 6
B+
than
tree
is
it
be reasonable to use
binary tree
B-tree for the index?
just a B-tree, but unlike the B-trees
the separators do not have to be keys.
Why
the
difference?
7.
How
differ
does block splitting in the sequence
from block
splitting in the
index
set
of a simple prefix
B+
tree
set?
key BOLEN in the simple prefix B tree in Fig. 9.8 is deleted
from the sequence set node, how is the separator BO in the parent node
8.
If the
affected?
+
Consider the simple prefix B tree shown in Fig. 9.8. Suppose a key
added to block 5 results in a split of block 5 and the consequent addition of
block 8, so blocks 5 and 8 appear as follows:
9.
FABER-FINGER
FINNEY-FOLK
3
a.
b.
What does
Suppose
the tree look like after the insertion?
that,
subsequent to the insertion,
a deletion causes
under-
EXERCISES
flow and the consequent concatenation of blocks 4 and
5.
What does
the tree look like after the deletion?
c.
Describe
a case in
which
10.
Why
often a
is it
good
a deletion results in redistribution, rather
show
than concatenation, and
the effect
it
has
on the
tree.
idea to use the same block size for the index set
+
and the sequence set in a simple prefix B tree? Why should the index
nodes and the sequence set nodes usually be kept in the same file?
11.
Show
conceptual view of an index
illustrated in Fig. 9.12, that
Ab Arch
Also show
Astron
more
is
set block,
similar to the
set
one
loaded with the separators
B Bea
detailed
view of the index block,
as illustrated in Fig.
9.13.
of records is sorted by key, the process of loading a B
tree can be handled by using a single-pass sequential process, instead of
randomly inserting new records into the tree. What are the advantages of
this approach?
12. If the initial set
13.
Show how
the simple prefix
B+
tree in Fig. 9.17
changes after the
addition of the node
ITEMIZE-JAR
Assume
that the index set
does not have
14.
a
Use
room
for the
node containing the separators EF, H, and IG
new
separator but that there
the data stored in the simple prefix
room
in the root.
B+
tree in Fig. 9.17 to construct
+
tree is of order four.
that the index set of the B
+
+
the resulting B
tree.
tree with the simple prefix B
tree.
Compare
is
Assume
15. The use of variable-length separators and/or key compression changes
some of the rules about how we define and use a B-tree and how we
measure B-tree performance.
a. How does it affect our definition of the order of
b.
Suggest
criteria for
deciding
when
B-tree?
splitting, concatenation,
and
redistribution should be performed.
c.
What
difficulties arise in estimating
maximum number
16.
in
Make
a table
terms of the
simple prefix
B+
tree height,
of accesses, and space?
comparing
criteria listed
B+
and simple prefix B trees
below. Assume that the B-tree nodes do not
B-trees,
trees,
439
THE B + TREE FAMILY AND INDEXED SEQUENTIAL
440
ACCESS
FILE
RRNs of data
answers based on a
contain data records, but only keys and corresponding
some
records. In
tree's
will
you will be able to give
number of keys in the tree.
cases
height or the
depend on unknown
factors,
such
specific
In other cases, the answers
of access or average
as patterns
separator length.
The number of accesses required to retrieve a record from
of height h (average, best case, and worst case).
a.
The number of accesses
b.
worst
c.
a tree
required to insert a record (best and
cases).
The number of accesses
worst
required to delete
record (best and
cases).
The number of accesses required
to process a file of n keys seassuming that each node can hold a maximum of k keys
and a minimum of k/2 keys (best and worst cases).
e. The number of accesses required to process a file of n keys sequentially, assuming that there are h + 1 node-sized buffers availd.
quentially,
able.
17. Some commercially available indexed sequential file organizations are
based on block interval splitting approaches very similar to those used with
B+
trees.
which
is
VSAM
called
key-sequenced access
+
much
organized
how
IBM's
like a
offers the user several file access
tree.
and which
Look up
results
a description
modes, one of
in
file
being
of VSAM and report
key-sequenced organization relates to a B tree, and also how it
offers the user file handling capabilities well beyond those of a straightforward B tree implementation. (See the Further Readings section of this
chapter for articles and books on VSAM.)
on
its
Although
methods now
18.
B+
trees
provide the basis for most indexed sequential access
in use, this was not always the case. A method called ISAM
Readings for this chapter) was once very common, especially
on large computers. ISAM uses a rigid tree-structured index consisting of at
least two and at most three levels. Indexes at these levels are tailored to the
specific disk drive being used. Data records are organized by track, so the
lowest level of an ISAM index is called the track index. Since the track index
points to the track on which a data record can be found, there is one track
(see Further
index for each cylinder.
overflow, the track
is
When
not
the addition of data records causes a track to
split.
Instead, the extra records are put into a
separate overflow area and chained together in logical order. Hence, every
entry in a track index
addition to
The
its
may
pointer to the
essential
difference
tree- like organizations
is
contain
home
pointer to the overflow area, in
track.
between the
in the
ISAM
way overflow
organization
and
B+
records are handled. In the
EXERCISES
of ISAM, overflow records are simply added to a chain of overflow
+
the index structure is not altered. In the B
tree case, overflow
records are not tolerated. When overflow occurs, a block is split and the
case
records
index structure
Can you
accommodate
altered to
the extra data block.
think of any advantages of using the
more
index
Why do you
separate overflow areas to handle overflow
+
think B tree- like approaches are replacing those that
use overflow chains to hold overflow records? Consider the
in
rigid
ISAM, with
structure of
records?
is
terms of both sequential and direct access,
two approaches
as well as addition
and deletion
of records.
Programming Exercises
We
chapter by discussing operations on
a sequence set, which is
of blocks containing records. Only later do we add the
concept of an index set to provide faster access to the blocks in the sequence
set. The following programming problems echo this approach, requiring
begin
this
just a linked
you
list
to write a
first
program
that builds a sequence set,
and
to the sequence
functions that maintain the sequence
functions to add an index set
programs can be implemented
19.
Write
program
set,
in either
that accepts a
file
then to write
finally to write
creating a
set,
programs and
tree. These
B+
or Pascal.
of strings
as input.
The input
file
should be sorted so the strings are in ascending order. Your program should
use this input
The
file
strings are stored in 15-byte records;
sequence
Sequence
The
to build a sequence set with the following characteristics:
first
set
set
block
is
128 bytes long;
blocks are doubly linked;
block in the output
other things,
header block containing,
file is a
a reference to the
RRN
of the
first
among
block in the se-
quence set;
Sequence set blocks are loaded so they are as full as possible; and
Sequence set blocks contain other fields (other than the actual records
containing the strings) as needed.
Write an update program that accepts strings input from the keyboard,
along with an instruction either to search, add, or delete the string from the
20.
sequence
set.
The program should have
the following characteristics:
Strings in the sequence set must, of course, be kept in order;
Response to the search instruction should be
either
found or not
found;
string should not be
added
if
it is
already in the sequence
set;
441
THE B + TREE FAMILY AND INDEXED SEQUENTIAL
442
Blocks in the sequence
half
set
FILE
ACCESS
should never be allowed to be
less
than
and
full;
Splitting, redistribution,
and concatenation operations should be
written as separate procedures so they can be used in subsequent pro-
gram development.
21. Write a
assume
program
file
form of
a B-tree.
B+
two
You may
levels.
The
should have the following characteristics:
The index
Do
set in the
that the B-tree index will never be deeper than
resulting
that traverses the sequence set created in the preceding
and that builds an index
exercises
set
and the sequence
set,
taken together, should constitute
tree;
not compress the keys
as
you form
the separators for the index
set;
Index
set blocks, like
and
Index
set
blocks.
sequence
set blocks,
should be 128 bytes long;
blocks should be kept in the same
The header block should
file as
the sequence set
contain a reference to the root of
the index set as well as the already existing reference to the begin-
ning of the sequence
22. Write a
new
set.
version of the update program that acts on the entire
B+
you created
in the preceding exercise. Search, add, and delete
should be supported, as they are in the earlier update program.
B-tree characteristics should be maintained in the index set; the sequence set
tree that
capabilities
should, as before, be maintained so blocks are always at least half
full.
Consider the block structure illustrated in Fig. 9.13, in which an index
is used to permit binary searching for a key in an index page.
Each index set block contains three variable length sets of items: a set of
separators, an index to the separators, and a set of relative block numbers.
Develop code in Pascal or C for storing these items in an index block and
for searching the block for a separator. You need to answer such questions
23.
to separators
as:
Where should
the three sets be placed relative to one another?
Given the data types permitted by the language you are using, how
can you handle the fact that the block consists of both character and
integer data with no fixed dividing point between them?
As items are added to a block, how do you decide when a block is
too
full to insert
another separator?
FURTHER READINGS
FURTHER READINGS
The
suggestion for the
initial
IP
tree structure appears to
have come from Knuth
(1973b), although he did not name or develop the approach. Most of the literature
+
that discusses B
trees in detail (as opposed to describing specific implementations
VSAM)
such as
provides what
is
form of
in the
articles rather
than textbooks.
perhaps the best brief overview of
B^
Comer
(1979)
Bayer and Untcrauer
(1977) offer a definitive article describing techniques for compressing separators.
The article includes consideration of simple prefix B^ trees as well as a more general
+
approach called a prefix B tree. McCreight (1977) describes an algorithm for taking
advantage of the variation in the lengths of separators in the index set of a B" tree.
McCreight's algorithm attempts to ensure that short separators, rather than longer
promoted
ones, are
up
is
in the tree
shallower
have
as
trees.
split. The intent is to shape the tree so blocks higher
number of immediate descendents, thereby creating a
blocks
a greater
tree.
Rosenberg and Snyder (1981) study the effects of initializing a compact B-tree
on later insertions and deletions. The use of batch insertions and deletions to B-trees,
+
rather than individual updates, is proposed and analyzed in Lang et al. (1985). B
trees are
compared with more
rigid indexed sequential
file
organizations (such as
Batory (1981) and in IBM's VSAM Planning Guide.
There are many commercial products that use methods related to the
ISAM)
in
B+
operations described in this chapter, but detailed descriptions of their underlying
structures are scarce.
An
exception to
(VSAM), one of
the
sequential access.
Wagner
insights into the early
this
is
tree
file
IBM's Virtual Storage Access Method
most widely used commercial products providing indexed
(1973) and Keehn and Lacy (1974) provide interesting
thinking behind VSAM. They also include considerations of
key maintenance, key compression, secondary indexes, and indexes to multiple data
sets. Good descriptions of VSAM can be found in several sources, and from a
variety of perspectives, in
IBM's
(VSAM
in a
B+
VSAM Planning Guide,
Bohl
(1981),
Comer
(1979)
Bradley (1982) (emphasis on implementation
PL/I environment), and Loomis (1983) (with examples from COBOL).
as
an example of
tree),
VAX-11 Record Management Services (RMS). Digital's file and record access
subsystem of the VAX/ VMS operating system, uses a B"" tree- like structure to
support indexed sequential access (Digital, 1979). Many microcomputer implementations
of B
trees can
(Borland, 1984).
be found, including dBase
III
and Borland's Turbo Toolbox
443
Hashing
10
CHAPTER OBJECTIVES
Introduce the concept of hashing.
Examine
the
problem of choosing
one
algorithm, present a reasonable
describe
some
good
hashing
in detail,
and
others.
Explore three approaches for reducing collisions: randomization of addresses, use of extra memory, and
storage of several records per address.
Develop and use mathematical tools for analyzing
performance differences resulting from the use of
different hashing techniques.
Examine problems associated with file
and discuss some solutions.
Examine
effects
of patterns of record
deterioration
access
on perfor-
mance.
445
CHAPTER OUTLINE
10.1
10.5.2 Search Length
Introduction
10.1.1
What
Hashing?
is
10.6 Storing
More Than One Record
10.1.2 Collisions
per Address: Buckets
10.2
10.6.1
10.3
Hashing Functions and Record
on
Performance
10.6.2 Implementation Issues
Simple Hashing Algorithm
Effects of Buckets
Distributions
10.3.1
10.3.2
Distributing Records
10.7
among
Making Deletions
Tombstones
10.7.1
Some Other Hashing Methods
Deletions
10.7.2 Implications of Tombstones
10.3.3 Predicting the Distribution of
for Insertions
Records
10.7.3 Effects of Deletions and
10.3.4 Predicting Collisions for a Full
Additions on Performance
File
10.4
Handling
for
Addresses
How Much
Memory
Be Used?
Other Collision Resolution
Techniques
10.4.1 Packing Density
10.8.1
10.4.2 Predicting Collisions for
10.8.2 Chained Progressive
Extra
Should
10.8
Double Hashing
Overflow
10.8.3 Chaining with a Separate
Different Packing Densities
Overflow Area
Resolution by
Progressive Overflow
10.5 Collision
10.5.1
How
10.8.4 Scatter Tables: Indexing
Revisited
Progressive Overflow
10.9 Patterns
Works
10.1
of Record Access
Introduction
O(l) access to
a
files
means
that
how
no matter
record always takes the same, small
sequential searching gives us
grows
O(N)
in proportion to the size
number of
access,
of the
big the
file.
file
grows, access to
seeks.
By
contrast,
wherein the number of seeks
As we saw
in the preceding
chapters, B-trees
improve on
N)
access; the
number of seeks
increases as the logarithm to the base k of the
number of
records,
where k
is
this greatly,
measure of the
providing 0(\og k
leaf size.
0(\og k N) access can provide
very good retrieval performance, even for very large
files,
but
it is still
not
O(l) access.
Holy Grail of file structure design.
what we want to achieve, but until
In a sense, O(l) access has been the
Everyone agrees
that O(l) access
is
~_i
447
INTRODUCTION
about 10 years ago
was not
it
one could develop
clear that
O(l) access strategies that would work on dynamic
that
files
general class of
change greatly
in size.
In this chapter
They provide
in size.
we
begin with
Static
hashing was the
following chapter
we show how
begun to
dynamic and
ways
has
find
hash Junction
is
art until
research and design
and O(l)
file
increases
about 1980.
work during
In
the
the 1980s
access, to files that are
Hashing?
like a black
is
The
box
retrieving records. In Fig.
be the home address of
is
like
produces an address every time you
that
it is
function h(K) that transforms
resulting address
10.1, the
hash function to the address
Hashing
of the
state
to extend hashing,
in a key. ISAore formally,
into an address.
to
description of static hashing techniques.
increase in size over time.
10.1.1 What
drop
us with O(l) access but are not extensible as the
4.
That
key
is.
used
key
as the basis for storing
LOWELL
is
//(LOWELL) =
and
transformed by the
4.
Address 4
is
said
LOWELL.
indexing in that
relative record address.
is
Hashing
it
differs
involves associating
from indexing
in
key with
two important
ways:
With hashing, the addresses generated appear to be random
there is
no immediately obvious connection between the key and the location
of the corresponding record, even though the key is used to determine the location of the record. For this reason, hashing is sometimes referred to
as randomizing.
With hashing, two different keys may be transformed to the same
address so two records may be sent to the same place in the file.
When this occurs, it is called a collision and some means must be
found
to deal
with
it.
Consider the following simple example. Suppose you want to store 75
file, where the key to each record is a person's name. Suppose
also that you set aside space for 1.000 records. The key can be hashed by
taking two numbers from the ASCII representations of the first two
records in a
of the name, multiplying these together, then using the
rightmost three digits of the result for the address. Table 10.1 shows how
three names would produce three addresses. Note that even though the
characters
names
are listed in alphabetical order, there
addresses.
They appear
to be in random order.
is
no apparent order
to the
448
HASHING
Address
Record
Key
K = LOWE
Address
FIGURE 10.1
LOWELL.
r:
Hashing the key
LOWELL
to
address
LOWELL'S
home address
4.
10.1.2 Collisions
Now
sample file with the name OLIVIER.
with the same two letters as the name
they produce the same address (004). There is a collision between
suppose there
Since the
is
key
name OLIVIER
LOWELL,
the record for
in the
starts
OLIVIER and the
hash to the same address
as
Collisions cause problems.
space, so
we must
record for
LOWELL. We refer to
keys that
synonyms.
We
cannot put two records in the same
resolve collisions.
We
do
this in
two ways: by choosing
hashing algorithms partly on the basis of how few collisions they are likely
to produce,
TABLE 10.1
Name
and by playing some
tricks
with the ways
we
store records.
A simple hashing scheme
ASCII Code
First
Two
for
Letters
Product
Home
Address
BALL
66
65
66 x 65
LOWELL
76
79
76 x 79
TREE
84
82
84 x 82
=
=
-
4,290
290
6,004
004
6,888
888
449
INTRODUCTION
The
ideal solution to collisions
that avoids collisions altogether.
turns out to be
is
to find a transformation algorithm
Such an algorithm
much more
is
called a perfect hashing
hashing
algorithm than one might expect, however. Suppose, for example, that you
algorithm.
want
It
to store 4,000 records
difficult to find a perfect
among
5,000 available addresses. It can be
of the huge number of possible hashing
12(,0,)l)
algorithms for doing this, only one out of l()
avoids collisions
altogether. Hence, it is usually not worth trying.^
shown (Hanson,
A more
1982)
that
practical solution
is
acceptable number. For example,
to reduce the
if
number of
collisions to an
only one out of 10 searches for
record
results in a collision, then the average
number of
retrieve a record remains quite low.
There are several different ways to
reduce the
number of collisions,
Spread out the
compete
records.
for the
including the following three:
Collisions occur
same
address. If
that distributes the records fairly
dresses, then
we would
disk accesses required to
we
when two
could find
or
more records
hashing algorithm
randomly among
the available ad-
not have large numbers of records clustering
around certain addresses. Our sample hash algorithm, which uses
only two letters from the key, is not good on this account because
certain combinations of two letters are quite common in starting
names, while others are uncommon (e.g., compare the number of
names that start with "JO" with the number that start with "XZ").
We need to find a hashing algorithm that distributes records moi
randomly.
Use extra memory.
It is
easier to find a hash
algorithm that avoids col-
we have only a few records to distribute among many adwe have about the same number of records as adOur sample hashing algorithm is very good on this account
lisions if
dresses than if
dresses.
since there are 1,000 possible addresses and only 75 addresses (corre-
sponding to the 75 records) will be generated. The obvious disadvantage to spreading out the records
is
that storage space
is
wasted.
(In
the example, 7.5% of the available record space is used, and the remaining 92.5% is wasted.) There is no simple answer to the question
of how much empty space should be tolerated to get the best hash-
ing performance, but
some techniques
are provided later in this
not unreasonable to try to generate perfect hashing functions for small (less than 500).
of keys, such as might be used to look up reserved words in a programming language. But files generally contain more than a few hundred keys, or they contain sets of
keys that change frequently, so they are not normally considered candidates for perfect
hashing functions. See Knuth (1973b), Sager (1985), Chang (1984), and Chichelli (1980) for
more on perfect hashing functions.
'''It
is
stable sets
450
HASHING
chapter for measuring the relative gains in performance for different
amounts of
free space.
Put more than one record
sumed
tacitly that
exactly one record, but there
create our
file
in
Up
at a single address.
to
now we
each physical record location in
such
way
is
usually
that every
no reason
file
a file
have ascould hold
why we
address
is
cannot
big enough to
hold several records.
If, for example, each record is 80 bytes long,
with 512-byte physical records, we can store up
to six records at each file address. Each address is able to tolerate five
synonyms. Addresses that can hold several records in this way are
sometimes called buckets.
and
we
create a
file
In the following sections
methods, and
as
we do
so
we
we
elaborate
present
on these collision-reducing
for managing hashed
some programs
files.
10.2
A Simple Hashing Algorithm
One
goal in choosing any hashing algorithm should be to spread out
records as uniformly as possible over the range of addresses available.
use of the term hash for this technique suggests what
Our
is
done
The
to achieve this.
to hash means "to chop into small
muddle or confuse." The algorithm used previously chops off
the first two letters and then uses the resulting ASCII codes to produce a
number that is in turn chopped to produce the address. It is not very good
at avoiding clusters of synonyms because so many names begin with the
same two letters.
One problem with the algorithm is that it does not really do very much
hashing. It uses only two letters of the key and it does not do much with the
two letters. Now let us look at a hash function that does much more
randomizing, primarily because it uses more of the key. It is a reasonably
good basic algorithm and is likely to give good results no matter what kinds
dictionary reminds us that the verb
pieces
of keys are used.
a specific
It is
also an algorithm that
is
instance of the algorithm does not
not too
work
difficult to alter in case
well.
This algorithm has three steps:
1.
Represent the key in numerical form.
2.
Fold and add.
3.
Divide by
prime number and use the remainder
Step 1. Represent the Key in Numerical Form
number, then this step is already accomplished. If it
as the address.
key is already a
string of characters,
If the
is
451
A SIMPLE HASHING ALGORITHM
we
ASCII code of each
take the
character and use
it
to
form
number. For
example,
76 79 87 69 76 76 32 32 32 32 32 32
nlir
=
LOWELL
L
D
W
L
L
L
Blanks
,
In this algorithm
letters.
By
among
differences
we
more
using
use the entire key, rather than just the
we
parts of a key,
first
two
increase the likelihood that
The
when
the keys cause differences in addresses produced.
extra processing time required to do this
compared
>t
<
to the potential
improvement
is
usually insignificant
in performance.
Step 2. Fold and Add Folding and adding means chopping off pieces of the
number and adding them together. In our algorithm we chop off pieces
with two ASCII numbers each:
76 79
These number
87 69
76 76
',
32 32
thought of
pairs can be
32 32
32 32
as integer variables
(rather than
which is how they started out) so we can do arithmetic
on them. If we can treat them as integer variables, then we can add them.
This is easy to do in C because C allows us to do arithmetic on characters.
character variables,
In Pascal,
we
can use the ord() function to obtain the integer position of
character within the computer's character
we add
Before
the fact that in
limited.
32,767
the numbers,
most
we
set.
have to mention
cases the sizes of
a problem caused by
numbers we can add together are
On some microcomputers,
(15 bits)
adding the
first
for example, integer values that exceed
overflow
errors
or become negative. For example,
cause
five of the foregoing numbers gives
7679 + 8769 + 7676 + 3232 + 3232 = 30588.
Adding
in the last 3,232
would, unfortunately, push the result over the
33,820), causing an overflow error.
32,767 (30,588 + 3,232 =
Consequently, we need to make sure
maximum
32,767.
We
can do this by
first
that each successive
sum
is
less
identifying the largest single value
we
than
will
ever add in our summation, and then making sure after each step that our
intermediate result differs from 32,767 by that amount.
In our case, let us
assume
that keys consist only
of blanks and uppercase
ZZ.
Suppose we choose 19,937 as our largest allowable intermediate result. This
differs from 32,767 by much more than 9,090, so we can be confident (in
this example) that no new addition will cause overflow. We can ensure in
our algorithm that no intermediate sum exceeds 19,937 by using the mod
alphabetic characters, so the largest addend
is
9,090, corresponding to
452
HASHING
which returns
operator,
the remainder
when one
integer
is
divided by
another:
+
+
4187 +
7419 +
10651 +
7679
16448
The number 13,883
Why
did
we
8769 -* 16448
7676 -> 24124
3232^
7419
7419
3232 - 10651
3232 -> 13883
10651
is
mod 19937 ->
mod 19937 ->
mod 19937-^
mod 19937 -
mod 19937 -^
16448
24124
13883
16448
4187
7419
10651
13883
the result of the fold-and-add operation.
bound
use 19,937 as our upper
rather than, say, 20,000?
Because the division and subtraction operations associated with the mod
operator are more than just a way of keeping the number small; they are
part of the transformation work of the hash function. As we see in the
number
more random distribution than does transformation by
number 19,937 is prime.
discussion for the next step, division by a prime
usually produces
nonprime. The
Divide by the Size of the Address Space The purpose of this
down to size the number produced in step 2 so it falls within
the range of addresses of records in the file. This can be done by dividing
that number by a number that is the address size of the file, and then taking
the remainder. The remainder will be the home address of the record.
Step
step
3.
is
to cut
We
can represent this operation symbolically as follows: If 5 represents
sum produced in step 2 (13,820 in the example), n represents the divisor
(the number of addresses in the file), and a represents the address we are
trying to produce, we apply the formula
the
The remainder produced by
and n
the
file.
mod
mod
n.
operator will be
number between
1.
Suppose, for example, that
our
we
decide to use the 100 addresses
0-99
for
In terms of the preceding formula,
a
= 13820 mod
100
20.
Since the number of addresses allocated for the file does not have to be
any specific size (as long as it is big enough to hold all of the actual records
to be stored in the file), we have a great deal of freedom in choosing the
divisor n. It is a good thing that we do, because the choice of n can have a
how
major
effect
on
prime
number
distribute
well the records are spread out.
is
nonprime can
usually used for the divisor because primes tend to
much more uniformly than do nonprimes. A
work well in many cases, however, especially if it has no
remainders
453
HASHING FUNCTIONS AND RECORD DISTRIBUTIONS
FUNCTION hash(KEY,MAXAD)
set
set
SUM to
to
while (J
<
12)
SUM to (SUM + 100*KEY[J]
ncremen t J by 2
set
i
endwh i
KEYCJ
1]) mod 19937
return (SUM mod MAXAD)
end FUNCTION
FIGURE 10.2 Function hash( KEY, MAXAD) uses folding and prime number division
compute a hash address.
to
prime divisors less than 20 (Hanson, 1982). Since the remainder is going to
be the address of a record, we choose a number as close as possible to the
desired size of the address space. This number actually determines the size
of the address space. For a file with 75 records, a good choice might be 101,
which would leave the
the
file
74.3%
If 101 is the size of the address
example becomes
a
Hence,
the record
whose key
is
=
=
full
space, the
13820
= 0.743).
home address of the
(74/101
mod
record in
101
84.
LOWELL
is
assigned
record
to
number 84
in the
file.
that
The procedure
described previously can be carried out with a function
we
described mostly in pseudocode in Fig. 10.2. Procedure
call hash(),
hash() takes
at least
two
inputs:
returned by hash()
10.3
KEY, which must
12 characters, and
is
MAXAD,
the address.
Hashing Functions and Record Distributions
Of the two hash functions we have so far examined, one spreads out records
pretty well, and one does not spread
look
at
ways
distributions
be an array of ASCII codes for
which has the address size. The value
them out well
at all. In this section
makes
it
easier
we
Understanding
to discuss other hashing methods.
to describe distributions of records in
files.
454
HASHING
10.3.1 Distributing Records among Addresses
Figure 10.3 illustrates three different distributions of seven records
10 addresses. Ideally, a hash function should distribute records in a
no
by distribution
among
file
so
Such a distribution
is called uniform because the records are spread out uniformly among the
addresses. We pointed out earlier that completely uniform distributions are
so hard to find that it is generally not considered worth trying to find them.
Distribution (b) illustrates the worst possible kind of distribution. All
records share the same home address, resulting in the maximum number of
there are
collisions.
will be a
collisions, as illustrated
The more
a distribution
(a).
looks like this one, the
more
collisions
problem.
Distribution
somewhat spread
(c)
illustrates
out, but
with
a
a
distribution in
few
collisions.
which the records
is the most likely
This
are
case
we have a function that distributes keys randomly. If a hash function is
random, then for a given key every address has the same likelihood of being
if
chosen as every other address. The fact that a certain address is chosen for
one key neither diminishes nor increases the likelihood that the same
address will be chosen for another key.
It should be clear that if a random hash function is used to generate a
large number of addresses from a large number of keys, then simply by
chance some addresses are going to be generated more often than others. If
you have, for example, a random hash function that generates addresses
between
and 99, and you give the function 100 keys, you would expect
FIGURE 10.3 Different distributions,
(worst case), (c) A few synonyms.
(a)
Worst
Best
Record
Address
(a)
UAfuI
No synonyms
Record
Address
(b)
(uniform), (b) All
synonyms
Acceptable
Record
Address
(c)
455
HASHING FUNCTIONS AND RECORD DISTRIBUTIONS
some of
ideal,
it is
random
among
distribution of records
an acceptable alternative, given that
to find a function that gives a
may
to be
at all.
Although
not
more than once and some
the 100 addresses to be chosen
chosen not
it is
available addresses
is
practically impossible
uniform distribution. Uniform distributions
be out of the question,
when we
but there are times
distributions that are better than
random
in the sense that,
can find
while they do
fair number of synonyms, they spread out records among
more uniformly than does a random distribution.
generate a
addresses
10.3.2
It
Some
would be
Other Hashing Methods
nice
better-than-random
if
were
there
distribution
by
in
hash
function
cases,
all
but
that
there
hashing function depends on the
guaranteed
is
not.
The
of keys that
are actually hashed. Therefore, the choice of a proper hashing function
should involve some intelligent consideration of the keys to be hashed, and
perhaps some experimentation. The approaches to choosing a reasonable
hashing function covered in this section are ones that have been found to
work well, given the right circumstances. Further details on these and other
methods can be found in Knuth (1973b), Maurer (1975), Hanson (1982),
and Sorenson et al. (1978).
Here are some methods that are potentially better than random:
distribution generated
Examine keys for
rally
a pattern.
Sometimes keys
spread themselves out. This
is
more
fall
set
in patterns that natu-
likely to
be true of numeric
a set of employee identifinumbers might be ordered according to when the employees
entered an organization. This might even lead to no synonyms. If
some part of a key shows a usable underlying pattern, a hash func-
keys than of alphabetic keys. For example,
cation
tion that extracts that part of the
key can
also be used.
Fold parts of the key. Folding is one stage in the method discussed earlier. It involves extracting digits from part of a key and adding the
extracted parts together. This
terns but in
method destroys
some circumstances may
the original key pat-
preserve the separation between
certain subsets of keys that naturally spread themselves out.
Divide the key by a number. Division by the address size and use of
the remainder usually
is
involved somewhere in
hash function since
produce an address within a certain
range. Division preserves consecutive key sequences, so you can take
advantage of sequences that effectively spread out keys. However, if
there are several consecutive key sequences, division by a number
the purpose of the function
is
to
456
HASHING
many small factors can result in many collisions. Research
shown that numbers with no divisors less than 19 generally
avoid this problem. Division by a prime is even more likely than dithat has
has
vision
by
nonprime
to generate different results
from
different con-
secutive sequences.
The preceding methods are designed to take advantage of natural
among the keys. The next two methods should be tried when, for
some reason, the better-than-random methods do not work. In these cases,
orderings
randomization
the goal.
is
method
Square the key and take the middle. This popular
the mid-square method) involves treating the key as
(often called
a single large
number, squaring the number, and extracting whatever number of
digits is needed from the middle of the result. For example, suppose
you want to generate addresses between and 99. If the key is the
number 453, its square is 205,209. Extracting the middle two digits
yields a number between
and 99, in this case 52. As long as the
keys do not contain many leading or trailing zeros, this method usually
produces
method
fairly
that
is
it
random
results.
One
unattractive feature of this
often requires multiple precision arithmetic.
Radix transformation. This method involves converting the key to
some number base other than the one you are working in, and then
taking the result modulo the maximum address as the hash address.
and
For example, suppose you want to generate addresses between
key
99. If the
382; 382
mod
is
number
the decimal
99
85, so 85
is
453,
its
base 11 equivalent
is
the hash address.
Radix transformation is generally more reliable than the mid-square
for approaching true randomization, though mid-square has been
found to give good results when applied to some sets of keys.
method
10.3.3 Predicting the Distribution
Given
that
it
records
among
predict
how
that a large
is
of
Records
nearly impossible to achieve a uniform distribution of
the available addresses in a
file, it is
records are likely to be distributed. If
number of addresses
to
them than they can
of
collisions.
is
hold, then
likely to
we know
have
important to be able to
we know,
far
more
that there are
for example,
records assigned
going to be
a lot
457
HASHING FUNCTIONS AND RECORD DISTRIBUTIONS
Although there
among
collisions
are
no nice mathematical
distributions
are
that
tools available for predicting
than random,
better
there
are
mathematical tools for understanding just this kind of behavior when
records are distributed randomly. If we assume a random distribution
(knowing
that very likely
random), we can use these
our hashing method is likely
will be better than
it
tools to obtain conservative estimates of
how
to behave.
The Poisson
We
a
We want to
Distribution"''
that are likely to occur in a
predict the
begin by concentrating on what happens to
hash function
is
When
questions.
applied to a key.
of the keys
all
We
would
number of collisions
one record
that can hold only
file
a single
like to
in a file are hashed,
at
an address.
given address
when
answer the following
what
is
the likelihood
that
None
will hash to the given address?
Exactly one key will hash to the address?
Exactly two keys will hash to the address (two synonyms)?
Exactly three, four (and so on) keys will hash to the address?
All keys in the
file
will hash to the
same given address?
Which of these outcomes would you expect to be fairly likely, and
which quite unlikely? Suppose there are N addresses in a file. When a single
key is hashed, there are two possible outcomes with respect to the given
address:
A The
B
address
is
not chosen; or
The address
is
chosen.
How do we express
bothp(A) and a stand for
p(B) and
the probability that the address
stand for the probability that the address
P(B) =
since the address has
"''This
uted
one chance
in
p(A) = a
addresses in a
file if a
is
is
If
we
let
not chosen, and
chosen, then
= ^>
N of being
N- =
-^1
section develops a formula for predicting the
among
two outcomes?
the probabilities of the
random hashing
ways
chosen, and
in
which records
function
is
used.
The
will be distrib-
discussion assumes
knowledge of some elementary concepts of probability and combinatorics. You may want
to skip the development and go straight to the formula, which is introduced in the next
section.
458
HASHING
since the address has
(N =
chances in
N of not being
chosen. If there are
of our address being chosen
0.1, and the probability of the address not being chosen is a
10 addresses
10), the probability
1/10 =
0.1 = 0.9.
is
Now suppose two keys are hashed. What is the probability that both
keys hash to our given address? Since the two applications of the hashing
function are independent of one another, the probability that both will
produce the given address
p(BB) =
a product:
is
--
tor
N=
10: b
0.1
0.1
0.01.
Of course, other outcomes are possible when two keys are hashed. For
example, the second key could hash to an address other than the given
address.
The
p(BA) =
probability of this
In general,
is
- -M
when we want
the product
for
to
know
N=
10: b
xbxbxa
how
2 3
a b
0.9
0.09.
the probability of a certain sequence
of outcomes, such as BABBA, we can replace each
respectively, and compute the indicated product:
p(BABBA) = bx
0.1
for
N=
B by
and
2 3
10: a b
and
h,
(0.9) (0.1)
of three Bs and two
shown. We want to know the
probability that there are a certain number of Bs and As, but without regard
to order. For example, suppose we are hashing four keys and we want to
This example shows
As, where the Bs and
know how likely it is
to find the probability
As occur
in the order
that exactly
This can occur in six ways,
all
two of the keys hash to our given address.
ways having the same probability:
six
Outcome
Probability
BBAA
BABA
BAAB
ABBA
ABAB
AABB
bbaa
bV
baba = bV
baab = bV
abba = bV
abab = bV
aabb = bV
=
For
N=
10
(0.1) (0.9)
2
(0.1) (0.9)
2
(0.1) (0.9)
2
(0.1) (0.9)
2
(0.1) (0.9)
2
(0.1) (0.9)
=
=
=
=
=
0.0036
0.0036
0.0036
0.0036
0.0036
0.0036
Since these six sequences are independent of one another, the probability of two Bs and two As is the sum of the probabilities of the individual
outcomes:
459
HASHING FUNCTIONS AND RECORD DISTRIBUTIONS
p(BBAA) + p(BABA) +
+ p(AABB) =
2 2
6b a
x 0.0036 = 0.0216.
The 6 in the expression 6b a~ represents the number of ways two s and two
As can be distributed among four places.
In general, the event "r trials result in r x As and x Bs" can happen
in as many ways as r x letters A can be distributed among r places. The
probability of each such
way
is
and the number of such ways
is
given by the formula
(r
This
is
the
items out of
well-known formula
of
a set
items.
It
x)\x
number of ways of selecting x
when r keys are hashed, the
for the
follows that
probability that an address will be chosen x times and not chosen
x times
can be expressed as
p(x)
Furthermore,
we know
if
= Ca'~ x b x
that there are
N addresses
precise about the individual probabilities of
available, we can be
and B, and the formula
becomes
p(x)
where
= C
has the definition given previously.
What does
this
mean?
It
means
that
for example,
if,
compute the probability that a given address will have
by the hashing function using the formula
x =
0,
we
can
records assigned to
it
*) =
If
1,
this
c (' h\-
(h)-
formula gives the probability that one record will be assigned
to a given address:
p(\)
CI
This expression has the disadvantage that
(Try
it
for 1,000 addresses and 1,000 records:
for large values
of
and
r,
there
is
it
is
N=
awkward
to
compute.
1,000.) Fortunately,
function that
is
very good
460
HASHING
approximation for p(x) and
much
is
easier to
compute.
It
is
called the
Poisson function.
The Poisson Function Applied to Hashing
which we also denote by p(x), is given by
x
where N,
(r/N) e-
P(*)
The Poisson
function,
(r/x >
Tj
and p(x) have exactly the same meaning they have
previous section. That is, if
N=
r
x,
r,
the
number of available
number of records
the
addresses;
to be stored;
x = the number of records assigned
then p(x) gives the probability
to it after the
*s
in the
and
to a given address,
that a given address will
hashing function has been applied
to all
have had x records assigned
n records.
Suppose, for example, that there are 1,000 addresses (N = 1,000) and
whose keys are to be hashed to the addresses (r = 1,000).
1,000 records
Since r/N
to
it
(x
=
0)
1,
the probability that a given address will have no keys hashed
becomes
=1jr
P(0)
The
0-368.
probabilities that a given address will have exactly one, two, or
three keys, respectively, hashed to
it
are
= ]~
0.368
p(2)=-^j- =
0.184
p(\)
[r
I'e
p(3)
If
we
0.061.
can use the Poisson function to estimate the probability that
given address will have
predict the
number of records, we can also use it to
that will have a certain number of records
a certain
number of addresses
assigned.
For example, suppose there are 1,000 addresses (N = 1,000) and 1,000
(r = 1,000). Multiplying 1,000 by the probability that a given
address will have x records assigned to it gives the expected total number of
records
addresses with x records assigned to them. That
number of addresses with x
is,
records assigned to them.
\,000p(x) gives the
461
HASHING FUNCTIONS AND RECORD DISTRIBUTIONS
N addresses,
In general, if there are
then the expected
them
addresses with x records assigned to
number of
is
Np(x).
way of thinking about p(x). Rather than thinking
measure of probability, we can think of p(x) as giving the
proportion of addresses having x logical records assigned by hashing.
Now that we have a tool for predicting the expected proportion of
addresses that will have zero, one, two, etc. records assigned to them by a
This suggests another
about p(x)
as a
random hashing
function,
we
can apply
this tool to predicting
numbers of
collisions.
10.3.4 Predicting Collisions
for a Full File
Suppose you have a hashing function that you believe will distribute records
randomly, and you want to store 10,000 records in 10,000 addresses. How
many addresses do you expect to have no records assigned to them?
= 10,000, r/N = 1. Hence the proportion of
Since r = 10,000 and
addresses with
records assigned should be
= LIV =
1
P(0)
ir
The number of addresses with no
0-3679.
records assigned
10,000 x p (0)
How many addresses should have one,
is
3,679.
two, and three records assigned,
respectively?
10,000 x ^(l)
0.3679 x 10,000
3,679
10,000 x p (2)
0.1839 x 10,000
1,839
10,000 x p (3)
0.0613 x 10,000
613.
Since the 3,679 addresses corresponding to x
have exactly one
record assigned to them, their records have no synonyms.
two records
The 1,839
however, represent potential trouble. If
each such address has space only for one record, and two records are
assigned to them, there is a collision. This means that 1,839 records will fit
into the addresses, but another 1,839 will not fit. There will be 1,839
addresses with
apiece,
overflow records.
Each of the 613 addresses with three records apiece has an even bigger
problem. If each address has space for only one record, there will be two
overflow records per address. Corresponding to these addresses will be a
462
HASHING
of 2 X 613 = 1,226 overflow records. This is a bad situation. We have
thousands of records that do not fit into the addresses assigned by the
hashing function. We need to develop a method for handling these overflow
total
records.
10.4
But
first, let's
How Much
We
Extra
try to reduce the number of
overflow records.
Memory Should Be Used?
have seen the importance of choosing a good hashing algorithm to
A second way to decrease the number of collisions (and
reduce collisions.
thereby decrease the average search length)
to use extra
is
memory. The
developed in the previous section can be used to help us determine the
effect of the use of extra memory on performance.
tools
10.4.1 Packing Density
The term
packing density refers to the ratio of the
stored
to the
(r)
For example,
number of available
spaces (N):^
Number of records _
Number of spaces
if there are
the packing density
r_
75 records (n
75)
75o/o.
number of records
to be
and 100 addresses (N
100),
is
1=
0.75
The packing
density gives a measure of the amount of space in a file that
and it is the only such value needed to assess performance
in a hashing environment, assuming that the hash method used gives a
reasonably random distribution of records. The raw size of a file and its
address space do not matter; what is important is the relative sizes of the
two, which are given by the packing density.
Think of packing density in terms of tin cans lined up on a 10-foot
length of fence. If there are 10 tin cans and you throw a rock, there is a
certain likelihood that you will hit a can. If there are 20 cans on the same
length of fence, the fence has a higher packing density and your rock is
is
actually used,
more
""We
likely to hit a can.
assume here
that only
essarily the case, as
we
So
it is
with records
one record can be stored
see later.
at
in a
file.
The more
each address. In
fact, that is
records
not nec-
HOW MUCH EXTRA MEMORY SHOULD
there are packed into a given
will occur
the
when
new
file
record
is
likely
it is
that a collision
added.
We need to decide how much space we are willing to waste
number of collisions. The answer depends in large measure on
circumstances.
example,
at
We
want
to
have
need
few
as
the expense of requiring the
10.4.2 Predicting Collisions
We
more
space, the
463
BE USED?
to reduce
particular
collisions as possible, but not, for
file
to use
for Different
of the
two
disks instead of one.
Packing Densities
of changing the packing
number of collisions
that are likely to occur for a given packing density. Fortunately, the Poisson
function provides us with just the tool to do this.
You may have noted already that the formula for packing density (r/N)
occurs twice in the Poisson formula
a quantitative description
density. In particular,
we need
effects
to be able to predict the
P(x)
(r/N) e~
r/N
numbers of records (r) and addresses (N) always occur together
They never occur independently. An obvious implication of
this is that the way records are distributed depends partly on the ratio of the
number of records to the number of available addresses, and not on the
absolute numbers of records or addresses. The same behavior is exhibited
by 500 records distributed among 1,000 addresses as by 500,000 records
Indeed, the
as the ratio r/N.
distributed
among
1,000,000 addresses.
Suppose that 1,000 addresses are allocated to hold 500 records in a
randomly hashed file, and that each address can hold one record. The
packing density for the
file is
r_
= 500 =
1,000
Let us answer the following questions about the distribution of records
among
the available addresses in the
How many
How many
file:
addresses should have no records assigned to them?
addresses should have exactly one record assigned (no
synonyms)?
How many
onyms?
Assuming
addresses should have one record plus one or
more syn-
one record can be assigned to each home adoverflow records can be expected?
What percentage of records should be overflow records?
dress,
that only
how many
464
HASHING
1.
How many
addresses should have no records assigned to them? Since p(0)
gives the proportion of addresses with no records assigned, the
ber of such addresses
Np(0)
2.
How many
num-
is
1,000 x
=
-
607.
5 )" g
(-
1,000 x 0.607
addresses should have exactly one record assigned (no syn-
onyms)?
Np(\)
3.
How many
1,000 x
=
=
303.
^2L
1,000 x 0.303
addresses should have one record plus one or
more synonyms?
The
values o p(2), p(3), p(4), and so on give the proportions of addresses with one, two, three, and so on synonyms assigned to them.
Hence
the
sum
p(2)
gives the proportion of
all
may
appear to require
p(3)
addresses with at least one
a great deal
since the values of p(x)
p(4)
grow
synonym. This
of computation, but
it
doesn't
quite small for x larger than 3. This
should make intuitive sense. Since the
file is
50%
only
loaded, one
would not expect very many keys to hash to any one address.
Therefore, the number of addresses with more than about three keys
hashed to them should be quite small. We need only compute the results up to p(5) before they become insignificantly small:
p(2)
p(2>)
p(4)
p(5)
=
=
N and this
Assuming
many
0.0002
or
more synonyms
is
just the
result:
N\p{2)
4.
0.0016
0.0902.
The number of addresses with one
product of
+ 0.0126 +
0.0758
p{3)
=
-
1,000
x 0.0902
90.
that only one record can be assigned to each
home
address,
how
overflow records could be expected? For each of the addresses rep-
resented by p(2), one record can be stored at the address and one
must be an overflow record. For each address represented by p{2>),
one record can be stored
at
the address, two are overflow records,
HOW MUCH EXTRA MEMORY SHOULD
465
BE USED?
and so on. Hence, the expected number of overflow records
is
given by
1
N
=
=
5.
+ 2 X
x p (2)
NX
[1
1,000 x
N X p(3)
[1
+ 3 x iV x p(4) + 4 X
X p (5)
x p (3) + 3 x p (4) + 4 x p(5)]
x 0.0758 + 2 x 0.0126 + 3x 0.0016 + 4 x 0.0002]
p(2)
107.
Wliat percentage of records should be overflow records? If there are 107
overflow records and 500 records
flow records is
jjgConclusion:
only one record,
0.214
in
all,
then the proportion of over-
= 21.4%.
If the
packing density
we
can expect about
50% and each address can hold
21% of all records to be stored
is
somewhere other than at their home addresses.
Table 10.2 shows the proportion of records
home
if
that are not stored in their
addresses for several different packing densities.
the packing density
is
10%, then about
5%
The
of the time
table
we
shows
that
try to access
another record there. If the density is 100%, then
of all records collide with other records at their home addresses.
The 4.8% collision rate that results when the packing density is 10% looks
a record, there is already
about
TABLE 10.2
37%
Effect of packing density on the proportion of records not stored at their
home addresses
Packing
Density (%)
Synonyms
as
of Records
10
4.8
20
9.4
30
13.6
40
17.6
50
21.4
60
24.8
70
28.1
80
31.2
90
34.1
100
36.8
466
HASHING
very good until you realize that for every record
in
your
file
there will be
nine unused spaces!
The 36.8%
that results from 100% usage looks good when viewed in
unused space. Unfortunately, 36.8% doesn't tell the whole
story. If 36.8% of the records are not at their home addresses, then they are
somewhere else, probably in many cases using addresses that are home
addresses for other records. The more homeless records there are, the more
contention there is for space with other homeless records. After a while,
clusters of overflow records can form, leading in some cases to extremely
long searches for some of the records. Clearly, the placement of records that
collide is an important matter. Let us now look at one simple approach to
placing overflow records.
terms of
10.5
0%
Collision Resolution by Progressive Overflow
Even
if a
hashing algorithm
is
occur. Therefore, any hashing
very good,
is
likely that collisions will
program must incorporate some method
dealing with records that cannot
number of techniques
it
fit
into their
home
for
addresses. There are a
for handling overflow records, and the search for
FIGURE 10.4 Collision resolution with progressive overflow.
Novak
Rosen
York's
home
address (busy)
Jaspei
2nd
Moreley
-3rd try (busy)
try (busy)
-4th try (open)
York's actual
address
467
COLLISION RESOLUTION BY PROGRESSIVE OVERFLOW
Key
Blue
"1
98
Hash
Address
routine
"U
99
99
Jello
Wrapping around
FIGURE 10.5 Searching
for
an address beyond the end of
ever-better techniques continues to be
several approaches, but
works
and
well.
we
The technique
a file.
a lively area
concentrate on
of research.
We examine
very simple one that often
has various names, including progressive overflow
linear probing.
How
10.5.1
An example
Progressive Overflow Works
occurs is shown in Fig. 10.4.
whose key is York in the file.
Unfortunately, the name York hashes to the same address as the name
Rosen, whose record is already stored there. Since York cannot fit in its
of a situation
In the example,
home
address,
we want
it is
in
which
a collision
to store the record
an overflow record.
If
progressive overflow
used, the
is
next several addresses are searched in sequence until an empty one
The
first free
address 9
is
is
the
Eventually
6,
found.
record found empty, so the record pertaining to
first
stored in address
hashes to
is
address becomes the address of the record. In the example,
York
9.
we need
to find
York's record in the
file.
the search for the record begins at address
York's record there, so
it
proceeds to look
where it finds York.
An interesting problem occurs when
at
Since
6. It
York
still
does not find
successive records until
it
gets
to address 9,
or for
record
at the
end of the
file.
This
there
is
is
search for an open space
illustrated in Fig. 10.5, in
which
468
HASHING
assumed that the file can hold 100 records in addresses 0-99. Blue is
hashed to record number 99, which is already occupied by Jello. Since the
file holds only 100 records, it is not possible to use 100 as the next address.
The way this is handled in progressive overflow is to wrap around the
address space of the file by choosing address
as the next address. Since, in
this case, address
is not occupied, Blue gets stored in address 0.
What happens if there is a search for a record but the record was never
it is
placed in the
file?
The
and then proceeds
home
search begins, as before, at the record's
look for
to
it
in successive locations.
Two
address,
things can
happen:
If
an open address
sume
this
If the file
is it
means
is full,
is
encountered, the searching routine might as-
that the record
clear that the record
when we approach
is
filling
is
not in the
comes back
the search
not in the
our
file,
to
file.
When
searching can
slow, whether or not the record being sought
The
cases,
it
greatest strength of progressive overflow
is
a perfectly
adequate method. There
or
file;
where
is
it
began. Only then
this occurs,
become
in the
or even
intolerably
file.
is its
simplicity. In
are,
however,
many
collision-
handling techniques that perform better than progressive overflow, and
examine some of them later in this chapter.
progressive overflow on performance.
we
Now let us look at the effect of
10.5.2 Search Length
The reason
to avoid
overflow
disk accesses) have to occur
If there are a lot
of course, that extra searches (hence, extra
a record is not found in its home address.
there are going to be a lot of overflow records
is,
when
of collisions,
taking up spaces where they ought not to be. Clusters of records can form,
resulting in the placement of records a long
way from home,
so
many
disk
accesses are required to retrieve them.
Consider the following set of keys and the corresponding addresses
produced by some hash function.
Key
Home
Address
Adams
20
Bates
21
Cole
21
Dean
22
Evans
20
469
COLLISION RESOLUTION BY PROGRESSIVE OVERFLOW
Number
of
Actual
Home
accesses needed
address
address
to retrieve
20
Adams
20
21
Bates
21
22
Cole
21
23
Dean
22
24
Evans
20
25
FIGURE 10.6 Illustration of the effects of clustering of records. As
keys are clustered, the number of accesses required to access
later keys can become large.
If these records are
is
used to resolve
loaded into an empty
collisions,
file,
and progressive overflow
only two of the records will be
at their
addresses. All the others require extra accesses to retrieve.
shows where each key
is
stored, together with information
accesses are required to retrieve
The term
number of
from secondary memory.
long
way from its home
accesses required to
In the context of hashing, the
search length for a record increases every time there
is
on how many
it.
search length refers to the
retrieve a record
home
Figure 10.6
is
address, the search length
a collision. If a
may
record
be unacceptable.
good measure of the extent of the overflow problem is average search
length. The average search length is just the average number of times you
can expect to have to access the disk to retrieve a record. A rough estimate
of average search length may be computed by finding the total search length
(the sum of the search lengths of the individual records) and dividing this by
the number of records:
Average search length
total search length
total
number of records'
470
HASHING
In the
example, the average search length for the
1
1+2
five records
is
= ?2
With no
is 1, since only one
any record. (We indicated earlier that an
algorithm that distributes records so evenly that no collisions occur is
appropriately called a perfect hashing algorithm, and that, unfortunately,
such an algorithm is almost impossible to construct.) On the other hand, if
a large number of the records in a file results in collisions, the average search
length becomes quite long. There are ways to estimate the expected average
search length, given various file specifications, and we discuss them in a
access
is
collisions at
needed
to
all,
the average search length
retrieve
later section.
It
turns out that, using progressive overflow, the average search length
goes up very rapidly as the packing density increases. The curve in Fig.
10.7,
adapted from Peterson (1957), illustrates the problem. If the packing
is kept as low as 60%, the average record takes fewer than two tries
density
to access, but for a
it
much more
desirable packing density of
80%
or more,
increases very rapidly.
Average search lengths of greater than 2.0
unacceptable, so
it
appears that
of your storage space
it is
are generally considered
to get tolerable performance. Fortunately,
FIGURE 10.7 Average search length versus packing density
one record can be stored per address, progressive overflow
sions, and the file has just been loaded.
hashed file in which
used to resolve colli-
in a
is
Average
search
length
20
40
40%
we can
usually necessary to use less than
60
Packing density
80
100
471
STORING MORE THAN ONE RECORD PER ADDRESS: BUCKETS
improve on
by making one small change
to
our
hashing program. The change involves putting more than one record
at a
this situation substantially
single address.
10.6
Storing
More Than One Record per Address: Buckets
when
computer receives information from a disk, it is just
I/O system to transfer several records as it is to transfer
a single record. Recall too that sometimes it might be advantageous to think
of records as being grouped together in blocks rather than stored individuRecall that
about
Therefore,
ally.
as easy for the
why
not extend the idea of
The word
address of a group of records?
a
block of records that
bucket
record address in
is
sometimes used
retrieved in one disk access, especially
is
records are seen as sharing the same address.
On
a file to
an
to describe
when
those
sector-addressing disks, a
bucket typically consists of one or more sectors; on block-addressing disks,
a bucket might be a block.
Consider the following
set
of keys, which
is
to be loaded into a hash
file.
Home
Key
Address
Green
30
Hall
30
Jenks
32
King
33
Land
33
Marx
33
Nutt
33
Figure 10.8 illustrates part of
are loaded.
Each address
a file into
which the records with these keys
in the file identifies a
records corresponding to three synonyms.
to
Nutt cannot be accommodated
in a
bucket capable of holding the
Only
home
the record corresponding
address.
When a record is to be stored or retrieved, its home bucket address is
determined by hashing. The entire bucket is loaded into primary memory.
An in-RAM search through successive records in the bucket can then be
used to find the desired record. When a bucket is filled, we still have to
worry about the record overflow problem (as in the case of Nutt), but this
occurs
much
less
often
hold only one record.
when
buckets are used than
when
each address can
472
HASHING
Bucket
address
30
Bucket contents
Green
Hall
31
32
Jenks
33
King
(Nutt
.
Land
Marks
...
is
an overflow
.
record)
FIGURE 10.8 An illustration of buckets. Each bucket can hold up to three
records. Only one synonym (Nutt) results in overflow.
10.6.1 Effects of Buckets on Performance
When
buckets are used, the formula used to compute packing density
changed
slightly since each bucket address can hold
To compute how
densely packed a
file is,
more than one
we need
to consider
is
record.
both the
number or records we can put at each
number of addresses and b is the number
of records that fit in a bucket, then bN is the number of available locations
for records. If r is still the number of records in the file, then
number of addresses
address (bucket
(buckets) and the
size). If
the
is
Packing density
Suppose
we
the following
We
have
a file in
bN
which 750 records
two ways we might organize
can store the 750 data records
location can hold one record.
the
among
1,000 locations, where each
The packing
750
are to be stored. Consider
file.
density in this case
is
75%.
1,000
We
can store the 750 records
tion has a bucket size of 2.
among 500
There are
still
locations,
store the 750 records, so the packing density
bN
0.75
= 75%
where each
loca-
1,000 places (2 x 500) to
is still
473
STORING MORE THAN ONE RECORD PER ADDRESS: BUCKETS
we might at first not expect
improve performance, but in fact it does
improve performance dramatically. The key to the improvement is that,
Since the packing density
the use of buckets in this
way
not changed,
is
to
although there are fewer addresses, each individual address has more
for variation in the number of records assigned to it.
room
performance for these two ways of
in the same amount of space. The
the fundamental description of each file
Let's calculate the difference in
storing the
same number of records
starting point for our calculations
is
structure.
with
Buckets
without
Buckets
File
File
Number of records
Number of addresses
Bucket
= 750
N=
size
= 750
N = 500
b = 2
r
1,000
1
Packing density
0.75
0.75
Ratio of records to addresses
r/N = 0.75
r/N =
1.5
To
determine the number of overflow records that are expected in the
case of each file, recall that when a random hashing function is used, the
Poisson function
,
p(x)
(r/N) e-"
xi
gives the expected proportion of addresses assigned x records. Evaluating
the function for the
two
different
file
organizations,
assigned to addresses according to the distributions
We
see
from
the table that
when
make
many
22.3% of
intuitive sense
the addresses have
since
in the
addresses to choose from,
it
find that records are
shown
in
Table 10.3.
buckets are not used, 42.3% of the
addresses have no records assigned, whereas
used, only
we
when two-record
buckets are
no records assigned. This should
two-record case there are only half
as
stands to reason that a greater proportion
of the addresses are chosen to contain at least one record.
Note that the bucket column in Table 10.3 is longer than the nonbucket
column. Does this mean that there are more synonyms in the bucket case
than in the nonbucket case? Indeed it does, but half of those synonyms do
not result in overflow records because each bucket can hold two records.
Let us examine this further by computing the exact number of overflow
records likely to occur in the two cases.
474
HASHING
TABLE 10.3
p(x)
Poisson distributions for two different
File
without
Buckets
File with
Buckets
(r/N = 0.75)
(r/N
0.472
0.223
p(l)
0.354
0.335
p(2)
0.133
0.251
p(3)
0.033
0.126
P(4)
0.006
0.047
P(5)
0.001
0.014
0.001
P(7)
organizations
1.5)
p(0)
P(6)
file
0.004
In the case of the
file
with bucket
size one,
any address that
exactly one record does not have any overflow.
Any
is
more
number of
than one record does have overflow. Recall that the expected
overflow records
which, for r/N
1,000 x
is
given by
+2x
[1
x p (2)
0.75 and
[1
assigned
address with
p (3) + 3 x p (4) + 4 x p(S) +
N=
1,000,
is
approximately
x 0.1328 + 2 x 0.0332 + 3 x 0.0062 + 4 x 0.0009 +
5 x 0.0001] = 222.
The 222 overflow
records represent
29.6% overflow.
of the bucket file, any address that is assigned either one or
two records does not have overflow. The value of p(\) (with r/N =1.5)
gives the proportion of addresses assigned exactly one record, and p(2)
(with r/N = 1.5) gives the proportion of addresses assigned exactly two
records. It is not until we get to p(3) that we encounter addresses for which
there are overflow records. For each address represented by p(3), two
records can be stored at the address, and one must be an overflow record.
Similarly, for each address represented by p(4), there are two overflow
records, and so forth. Hence, the expected number of overflow records in
In the case
the bucket
file is
[1
x p (3) + 2 x p (4) + 3 x
p(S)
4 x p(6)
+ ...],
475
STORING MORE THAN ONE RECORD PER ADDRESS: BUCKETS
which
for
r/N =1.5 and
500 x
[1
= 500
is
approximately
x 0.1255 + 2 x 0.0471 + 3 x 0.0141 + 4 x 0.0035 +
5 x 0.0008] = 140.
The 140 overflow records represent 18.7% overflow.
We have shown that with one record per address and a packing density
of 75%, the expected number of overflow records is 29.6%. When 500
buckets are used, each capable of holding two records, the packing density
remains 75%, but the expected number of overflow records drops to
18.7%. That is about a 37% decrease in the number of times the program
is going to have to look elsewhere for a record. As the bucket size gets
performance continues to improve.
Table 10.4 shows the proportions of collisions that occur for different
packing densities and for different bucket sizes. We see from the table, for
example, that if we keep the packing density at 75% and increase the bucket
size to 10, record accesses result in overflow only 4% of the time.
It should be clear that the use of buckets can improve hashing
performance substantially. One might ask, "How big should buckets be?"
Unfortunately, there is no simple answer to this question because it depends
very much on a number of different characteristics of the system, including
larger,
TABLE 10.4
Synonyms causing
densities
collisions as a percent of records
nd different bucket sizes
for different
packing
Bucket Size
Packing
Density
<%)
10
100
10
4.8
0.6
0.0
0.0
0.0
20
9.4
2.2
0.1
0.0
0.0
30
13.6
4.5
0.4
0.0
0.0
40
17.6
7.3
1.1
0.1
0.0
50
21.3
10.4
2.5
0.4
0.0
60
24.8
13.7
4.5
1.3
0.0
70
28.1
17.0
7.1
2.9
0.0
75
29.6
18.7
8.6
4.0
0.0
80
31.2
20.4
10.3
5.3
0.1
90
34.1
23.8
13.8
8.6
0.8
100
36.8
27.1
17.6
12.5
4.0
476
HASHING
the sizes of buffers the operating system can manage, sector and track
on
capacities
and access times of the hardware
disks,
(seek, rotation,
and
data transfer times).
As
good idea to use buckets larger than a
Even a track, however, can sometimes
when one considers the amount of time it takes to transmit an
as compared to the amount of time it takes to transmit a few
a rule, it is
probably not
track (unless records are very large).
be too large
entire track,
sectors. Since hashing
almost always involves retrieving only one record
per search, any extra transmission time resulting from the use of extra large
buckets
essentially wasted.
is
many
is the best bucket size. For example,
with 200-byte records is to be stored on a disk system that
uses 1,024-byte clusters. One could consider each cluster as a bucket, store
five records per cluster, and let the remaining 24 bytes go unused. Since it
is no more expensive, in terms of seek time, to access a five-record cluster
than it is to access a single record, the only losses from the use of buckets
are the extra transmission time and the 24 unused bytes.
The obvious question now is, "How do improvements in the number
of collisions affect the average search time?" The answer depends in large
measure on characteristics of the drive on which the file is loaded. If there
are a large number of tracks in each cylinder, there will be very little seek
time because overflow records will be unlikely to spill over from one
cylinder to another. If, on the other hand, there is only one track per
cylinder, seek time could be a major consumer of search time.
In
suppose that
A less
is
cases a single cluster
a file
exact measure of the
average search length, which
amount of time required
we
introduced
to retrieve a record
earlier. In the case
of buckets,
average search length represents the average number of buckets that must
Table 10.5 shows the expected average
with different packing densities and bucket sizes,
given that progressive overflow is used to handle collisions. Clearly, the use
of buckets seems to help a great deal in decreasing the average search length.
be accessed to retrieve
search lengths for
The bigger
a record.
files
the bucket, the shorter the search length.
10.6.2 Implementation Issues
of this text, we paid quite a bit of attention to issues
producing, using, and maintaining random-access files with
fixed-length records that are accessed by relative record number (RRN).
In the early chapters
involved
in
Since a hashed
RRN, you
Hashed
respects,
file is a
fixed-length record
file
whose records
are accessed
by
know much about implementing hashed files.
from the files we discussed earlier in two important
should already
files
differ
however:
477
STORING MORE THAN ONE RECORD PER ADDRESS: BUCKETS
TABLE 10.5
Average number of accesses required
in
a successful search by
progressive overflow
Bucket Sizes
Packing
Density
(%)
10
50
10
1.06
1.01
1.00
1.00
1.00
30
1.21
1.06
1.00
1.00
1.00
40
1.33
1.10
1.01
1.00
1.00
50
1.50
1.18
1.03
1.00
1.00
60
1.75
1.29
1.07
1.01
1.00
70
2.17
1.49
1.14
1.04
1.00
80
3.00
1.90
1.29
1.11
1.01
90
5.50
3.15
1.78
1.35
1.04
95
10.50
5.6
2.7
1.8
1.1
Adapted from Donald Knuth, The Art of Computer Programming, Vol.
Mass. Page 536. Reprinted with permission.
5 1973,
3,
Addison-Wesley, Read-
ing,
1.
Since a hash function depends on there being a fixed
available addresses, the logical size of a hashed
fore the
as
long
size to
file
file
can be populated with records, and
it
number of
must be fixed bemust remain fixed
same hash function is used. (We use the phrase logical
leave open the possibility that physical space be allocated as
as the
needed.)
2.
Since the
to
its
home
RRN
of
record in
hashed
file is
uniquely related
key, any procedures that add, delete, or change a record
do so without breaking the bond between
dress. If this
bond
is
broken, the record
is
record and
no longer
its
must
home
accessible
ad-
by
hashing.
We
must keep these
work with hashed
Bucket Structure
special needs in
mind when we write programs
to
files.
The only difference between a file with buckets and
which each address can hold only one key is that with a bucket file
each address has enough space to hold more than one logical record. All
records that are housed in the same bucket share the same address. Suppose,
one
in
478
HASHING
for example, that
Here
An empty
Two
we want
many
to store as
are three such buckets with different
bucket:
entries:
full bucket:
as five names in one bucket.
numbers of records.
JONES
ARNSWORTH
JONES
ARNSWORTH STOCKTON BRICE
Each bucket contains
has stored in
it.
a counter that
keeps track of how
Collisions can occur only
causes the counter to exceed the
The counter
tells
us
when
many records it
new record
the addition of a
number of records
how many
THROOP
bucket can hold.
data records are stored in a bucket, but
tell us which slots are used and which are not. We need a way
whether or not a record slot is empty. One simple way to do this is
to use a special marker to indicate an empty record, just as we did with
deleted records earlier.. We use the key value ///// to mark empty records in
does not
it
to
tell
the preceding illustration.
Initializing a File for
remain
fixed,
it
Hashing
makes sense
in
Since the
most
logical size
of a hashed
file
must
cases to allocate physical space for the
we
begin storing data records in it. This is generally done by
of empty spaces for all records, and then filling the slots as
they are needed with the data records. (It is not necessary to construct a file
of empty records before putting data in it, but doing so increases the
file
before
creating a
file
likelihood that records will be stored close to one another on the disk,
avoids the error that occurs
when
an attempt
is
made
to read a missing
makes it easy to process the file sequentially, without having
empty records in any special way.)
record, and
treat the
to
Loading a Hash File A program that loads a hash file is similar in many
ways to earlier programs we use for populating fixed-length record files,
with two differences. First, the program uses the function hash() to produce
a
home
address for each key. Second, the program looks for
the record
by
starting with the bucket stored at
its
home
a free
space for
address and then,
479
MAKING DELETIONS
home bucket is full, continuing to look at successive buckets until one
found that is not full. The new record is inserted in this bucket, which is
rewritten to the file at the location from which it is loaded.
If, as it searches for an empty bucket, a loading program passes the
maximum allowable address, it must wrap around to the beginning
address. A potential problem occurs in loading a hash file when so many
records have been loaded into the file that there are no empty spaces left.
A naive search for an open slot can easily result in an infinite loop. Obviously, we want to prevent this from occurring by having the program make
if
the
is
sure that there
space available for each
is
new
somewhere
record
in the
file.
Another problem
when an attempt
that often arises
made
when adding
records to
files
occurs
add a record that is already stored in the file. If
there is a danger of duplicate keys occurring, and duplicate keys are not
allowed in the file, some mechanism must be found for dealing with this
problem.
10.7
is
to
Making Deletions
Deleting a record from
record for
The
two
slot freed
searches;
It
hashed
file is
more complicated than adding
by the deletion must not be allowed
to hinder later
and
should be possible to reuse the freed
When
reasons:
progressive overflow
is
slot for later additions.
used, a search for a record terminates if
this, we do not want to leave
open addresses that break overflow searches improperly. The following
example illustrates the problem.
Adams, Jones, Morris, and Smith are stored in a hash file in which each
address can hold one record. Adams and Smith both are hashed to address
5, and Jones and Morris are hashed to address 6. If they are loaded in
an open address
is
encountered. Because of
alphabetical order using progressive overflow for collisions, they are stored
in the locations
shown
in Fig. 10.9.
Smith starts at address 5 (Smith's home address).
successively looks for Smith at addresses 6, 7, and 8, then finds Smith at 8.
Now suppose Morris is deleted, leaving an empty space, as illustrated in
Fig. 10.10. A search for Smith again starts at address 5, and then looks at
addresses 6 and 7. Since address 7 is now empty, it is reasonable for the
search
program
for
to conclude that Smith's record
is
not in the
file.
480
HASHING
Home
Actual
Record
address
address
Adams
Jones
Adams
Morris
Jones
Smith
Morris
Smith
FIGURE 10.9
File organization before deletions.
10.7.1 Tombstones
Chapter 5
In
One
we
for
Handling Deletions
discussed techniques for dealing with the deletion problem.
simple technique
we
use for identifying deleted records involves
replacing the deleted record (or just
its
key) with
marker indicating that a
is sometimes
record once lived there but no longer does. Such a marker
FIGURE 10.10 The same organization as in Fig. 10.9, with
Morris deleted.
Adams
Jones
Smith
481
MAKING DELETIONS
Adams
Jones
######
Smith
FIGURE 10.11 The same file as
Fig. 10.9 after the insertion of
tombstone
The
is
that
it
for Morris.
referred to as a tombstone (Wiederhold, 1983).
of tombstones
in
The
nice thing about the use
solves both of the problems described previously:
freed space does not break
sequence of searches for
a record;
and
The
freed space
is
obviously available and
may
be reclaimed for
later
additions.
Figure 10.11 illustrates how the sample file might look aftc the
tombstone ###### is inserted for the deleted record. Now a search for
Smith does not halt at the empty record number 7. Instead, it uses the
######
It is
as
an indication that
it
should continue the search.
not necessary to insert tombstones every time
a deletion occurs.
example, suppose in the preceding example that the record for Smith
following the Smith record
For
is
to
empty, nothing is
lost by marking Smith's slot as empty rather than inserting a tombstone.
Indeed, it is actually unwise to insert a tombstone where it is not needed. (If,
after putting an unnecessary tombstone in Smith's slot, a new record is
added at address 9, how would a subsequent unsuccessful search for Smith
be affected?)
be deleted. Since the
slot
10.7.2 Implications
of
Tombstones
is
for Insertions
With the introduction of the use of tombstones, the insertion of records
becomes slightly more difficult than our earlier discussions imply. Whereas
programs that perform initial loading simply search for the first occurrence
of an empty record slot (signified by the presence of the key /////), it is now
482
HASHING
permissible to insert
record where either ///// or
######
which
yields a shorter average
occurs as the
key.
This
new
feature,
search length, brings with
earlier
example
shown
in Fig. 10.
in
1 1
it
is
desirable because
a certain
which Morris
.
is
it
danger. Consider, for example, the
deleted,
giving the
file
organization
Now suppose you want a program to insert Smith into
the file. If the program simply searches until it encounters a ######, it
never notices that Smith is already in the file. We almost certainly don't
want to put a second Smith record into the file, since doing so means that
later searches would never find the older Smith record. To prevent this
from occurring, the program must examine the entire cluster of contiguous
keys and tombstones to ensure that no duplicate key exists, and then go
back and insert the record in the first available tombstone, if there is one.
10.7.3 Effects
of Deletions
The use of tombstones
enables our search algorithms to
storage recovery, but one can
after a
and Additions on Performance
still
number of deletions and
expect
some
work and
additions occur within a
Consider, for example, our
little
helps in
deterioration in performance
four-record
file
file.
of Adams, Jones,
one slot further from its
home address than it needs to be. If the tombstone is never to be used to
store another record, every retrieval of Smith requires one more access than
is absolutely necessary. More generally, after a large number of additions
and deletions, one can expect to find many tombstones occupying places
that could be occupied by records whose home records precede them but
that are stored after them. In effect, each tombstone represents an
unexploited opportunity to reduce by one the number of locations that
must be scanned while searching for these records.
Some experimental studies show that after a 50% to 150% turnover of
records, a hashed file reaches a point of equilibrium, so average search
length is as likely to get better as it is to get worse (Bradley, 1982; Peterson,
1957). By this time, however, search performance has deteriorated to the
point that the average record is three times as far (in terms of accesses) from
its home address as it would be after initial loading. This means, for
Smith, and Morris. After deleting Morris, Smith
example, that
if after original
is
loading the average search length
be about 1.6 after the point of equilibrium
is
is
1.2,
it
will
reached.
There are three types of solutions to the problem of deteriorating
One involves doing a bit of local reorganizing every
time a deletion occurs. For example, the deletion algorithm might examine
average search lengths.
483
OTHER COLLISION RESOLUTION TECHNIQUES
the records that follow a tombstone to see if the search length can be
shortened by moving the record backward toward
Another solution involves completely reorganizing the
search length reaches an unacceptable value.
its
file
home
address.
after the
average
third type of solution
involves using an altogether different collision resolution algorithm.
10.8
Other Collision Resolution Techniques
randomized hashing using progressive overflow with
reasonably sized buckets generally performs well. If it does not perform
wxll enough, however, there are a number of variations that may perform
even better. In this section we discuss some refinements that can often
improve hashing performance when using external storage.
Despite
its
simplicity,
10.8.1 Double Hashing
One of the problems
with progressive overflow is that if many records hash
vicinity, clusters of records can form. As the packing
density approaches one, this clustering tends to lead to extremely long
to buckets in the
searches for
same
some
records.
One method
for avoiding clustering
is
to store
overflow records a long way from their home addresses by double hashing.
With double hashing, when a collision occurs, a second hash function is
applied to the key to produce a number c that is relatively prime to the
number of addresses.^ The value c is added to the home address to produce
the overflow address. If the overflow address is already occupied, c is added
to it to produce another overflow address. This procedure continues until a
free overflow address is found.
Double hashing does tend to spread out the records in a file, but it
from a potential problem that is encountered in several improved
overflow methods: It violates locality by deliberately moving overflow
records some distance from their home addresses, increasing the likelihood
suffers
that the disk will
need extra time to get to the
new overflow
address. If the
more than one cylinder, this could require an expensive extra
head movement. Double hashing programs can solve this problem if they
are able to generate overflow addresses in such a way that overflow records
are kept on the same cylinder as home records.
file
""If
covers
is
divisors.
the
number of addresses,
then
and
N are relatively
prime
if
they have no
common
484
HASHING
10.8.2 Chained Progressive Overflow
Chained progressive overflow
problems caused by clustering.
is
another technique designed to avoid the
works
in the same manner as progressive
synonyms are linked together with pointers. That is,
each home address contains a number indicating the location of the next
record with the same home address. The next record in turn contains a
It
overflow', except that
home
pointer to the following record with the same
The
net effect of this
that for each set of
is
connecting their records, and
it is
address, and so forth.
synonyms
this list that
is
there
searched
is
linked
when
list
record
is
sought.
The advantage of chained progressive overflow over simple progressive
overflow is that only records with keys that are synonyms need to be
accessed in any given search. Suppose, for example, that the set of keys
shown in Fig. 10.12 is to be loaded in the order shown into a hash file with
bucket size one, and progressive overflow is used. A search for Cole
involves an access to Adams (a synonym) and Bates (not a synonym). Flint,
the worst case, requires six accesses, only two of which involve synonyms.
Since Adams, Cole, and Flint are synonyms, a chaining algorithm
forms a 'inked list connecting these three names, with Adams at the head of
the list. Since Bates and Dean are also synonyms, they form a second list.
This arrangement
from
decreases
is
10.13.
illustrated in Fig.
The average
search length
2.5 to
1
1+2
1+3
The use of chained
some details that are not
link field
must be added
progressive overflow requires that
we
attend to
required for simple progressive overflow.
to each record, requiring the use
of
First, a
a little
more
FIGURE 10.12 Hashing with progressive overflow.
Home
Search
Key
address
Actual
address
length
Adams
20
20
Bates
21
21
Cole
20
22
Dean
21
23
Evans
24
20
Average search le ngth =
Flint
(1
24
25
3+1
6)/6
==
2.5
485
OTHER COLLISION RESOLUTION TECHNIQUES
Home
Actual
address
address
20
20
Adams
21
21
Bates
20
22
Cole
21
23
Dean
24
24
Evans
20
25
Flint
Address of
Data
synonym
Search
length
next
FIGURE 10.13 Hashing with chained progressive overflow. Adams, Cole, and
Flint are synonyms; Bates and Dean are synonyms.
must guarantee
storage. Second, a chaining algorithm
that
it is
possible to
any synonym by starting at its home address. This second
requirement is not a trivial one, as the following example shows.
Suppose that in the example Dean's home address is 22 instead of 21.
Since, by the time Dean is loaded, address 22 is already occupied by Cole,
get
to
Dean
still
ends up
address 23.
at
Does
this
mean
that Cole's pointer should
point to 23 (Dean's actual address) or to 25 (the address of Cole's
Flint)? If the pointer
kept intact, but
is
Dean
The problem here
by
home
problem
for
some record
is
handled easily
are not
is
list
joining
If the pointer
is
synonym
Adams, Cole, and
23, Flint
Flint
is
is lost.
that a certain address (22) that should be occupied
is
occupied by
a different record.
One solution to
home address
to require that every address that qualifies as a
in the file actually
when
two-pass loading.
Two-pass loading,
passes.
is lost.
record (Dean)
the
two
25, the linked
file
as the
On the first pass,
is
hold
first
home
record.
loaded by using
The problem can be
a
technique called
name implies, involves loading a hash file in
only home records are loaded. All records that
home records are kept in a separate file. This guarantees that no
home addresses are occupied by overflow records. On the second
potential
overflow record is loaded and stored in one of the free addresses
according to whatever collision resolution technique is being used.
pass, each
486
HASHING
Two-pass loading guarantees
is
home
address, so
it
home
that every potential
problem
solves the
address actually
example.
in the
It
does not
and additions will not re-create the same
problem, however. As long as the file is used to store both home records
and overflow records, there remains the problem of overflow records
displacing new records that hash to an address occupied by an overflow
guarantee that
later deletions
record.
The methods used for handling these problems after initial loading are
somewhat complicated and can, in a very volatile file, require many extra
disk accesses. (For more information on techniques for maintaining
pointers, see Knuth, 1973b and Bradley, 1982.) It would be nice if we could
somehow altogether avoid this problem of overflow lists bumping into one
another, and that is what the next method does.
10.8.3 Chaining with a Separate Overflow Area
One way
to
keep overflow records from occupying
they should not be
to
is
move them
hashing schemes are variations of
addresses
is
all
this basic
and the
called the prime data area,
called the overflow area.
home
to a separate
addresses where
overflow
approach. The
set
area.
set
of
Many
home
of overflow addresses is
is that it keeps all
The advantage of this approach
unused but potential home addresses free for later additions.
In terms of the file we examined in the preceding section, the records
for Cole, Dean, and Flint could have been stored in a separate overflow area
rather than potential
home
addresses for later-arriving records (Fig. 10.14).
Now
no problem occurs when a new record is added. If its home address
has room, it is stored there. If not, it is moved to the overflow file, where
it is added to the linked list that starts at the home address.
If the bucket size for the primary file is large enough to prevent
excessive numbers of overflow records, the overflow file can be a simple
file with a bucket size of one. Space can be allocated for
overflow records only when it is needed.
The use of a separate overflow area simplifies processing somewhat and
would seem to improve performance, especially when many additions and
deletions occur. However, this is not always the case. If the separate
entry-sequenced
overflow area
is
on
a different
cylinder than
is
the
home
address, every
search for an overflow record will involve a very costly head
Studies
show
that actual access time
records are stored in
a separate
is
generally worse
movement.
when overflow
overflow area than when they are stored
in
the prime overflow area (Lum, 1971).
One
situation in
the packing density
which
is
a separate
overflow area
greater than one
there
are
is
required occurs
more
records than
when
home
487
OTHER COLLISION RESOLUTION TECHNIQUES
Home
Primary
address
data area
20
Adams
21
Bates
Overflow
area
Cole
Dean
22
Flint
23
24
-*
.
Evans
FIGURE 10.14 Chaining to a separate overflow area. Adams, Cole, and
synonyms; Bates and Dean are synonyms.
addresses.
If,
for example,
it is
anticipated that a
file
Flint are
will
grow beyond
capacity of the initial set of home addresses and that rehashing the
a larger
address space
is
not reasonable, then
separate overflow area
the
with
file
must
be used.
10.8.4 Scatter Tables: Indexing Revisited
Suppose you have
no records, only pointers to
records. The file is obviously just an index that is searched by hashing rather
than by some other method. The term scatter table (Severance, 1974) is often
a
hash
file
applied to this approach to
organization of
a file
using
that contains
file
organization. Figure 10.15 illustrates the
a scatter table.
The scatter table organization provides many of the same advantages
simple indexing generally provides, with the additional advantage that the
search of the index itself requires only one access. (Of course, that one
one more than other forms of hashing require, unless the scatter
table can be kept in primary memory.) The data file can be implemented in
many different ways. For example, it can be a set of linked lists of
access
is
synonyms
(as
shown
in Fig. 10.15), a sorted
file,
or an entry-sequenced
file.
Also, scatter table organizations conveniently support the use of variable-
length records. For
more information on
and Teorey and Fry (1982).
scatter tables, see Severance (1974)
488
HASHING
k.
20
21
1*
Adams
Bates
Cole
b Dean
Flint
??
23
24
Evans
25
FIGURE 10.15 Example of a scatter table structure. Because the hashed part
file may be organized in any way that is appropriate.
is
an index,
the data
10.9
Patterns of Record Access
Twenty
Twenty
L.
percent of the fishermen catch 80 percent of the
fish.
percent of the burglars steal 80 percent of the loot.
M. Boyd
The use of different
necessarily the best
collision resolution techniques
way of improving performance
is
in a
not the only nor
hashed
file.
If
we
know something
about the patterns of record access, for example, then it is
often possible to use simple progressive overflow techniques and still
achieve very good performance.
Suppose you have a grocery store with 10,000 different categories of
grocery items, and you have on your computer a hashed inventory file with
a
record for each of the 10,000 items that your
time an item
accessed. Since the
file is
hashed,
records are distributed randomly
up the
company
handles. Every
purchased, the record that corresponds to that item must be
is
file. Is it
it is
reasonable to assume that the 10,000
among
the available addresses that
make
equally reasonable to assume that the distribution of accesses
to the records in the inventory are
randomly
distributed? Probably not.
Milk, for example, wall be retrieved very frequently, brie seldom.
There
is
a principle
The Concept of
used by economists called the Pareto Principle, or
Few and the Trivial Many, which in file terms
the Vital
says that a small percentage of the records in a
percentage of the accesses.
80/20 Rule of
file
account for
a large
popular version of the Pareto Principle
Thumb: 80% of
the accesses are performed
on 20%
is
the
of the
SUMMARY
records. In our groceries
items, brie
among
file,
among
milk would be
the
20%
489
high-activity
the rest.
We cannot take advantage of the 80/20 principle in a file structure unless
we know something about the probable distribution of record accesses.
Once we have this information, we need to find a way to place the
high-activity items
possible.
way
If,
when
that the
loaded
at
where they can be found with
items are loaded into
20% (more or
home
or near their
less)
a file,
that are
as
few accesses
as
they can be loaded in such a
most likely to be accessed are
most of the transactions will
addresses, then
access records that have short search lengths, so the effective average search
length will be shorter than the nominal average search length that
defined
we
earlier.
For example, suppose our grocery store's file handling program keeps
number of times each item is accessed during a one-month
period. It might do this by storing with each record a counter that starts at
zero and is incremented every time the item is accessed. At the end of the
month the records for all the items in the inventory are dumped onto a file
that is sorted in descending order according to the number of times they
have been accessed. When the sorted file is rehashed and reloaded, the first
records to be loaded are the ones that, according to the previous month's
experience, are most likely to be accessed. Since they are the first ones
loaded, they are also the ones most likely to be loaded into their home
addresses. If reasonably sized buckets are used, there will be very few, if any,
high-activity items that are not in their home addresses and therefore
track of the
retrievable in
one
access.
SUMMARY
There are three major modes for accessing files: sequentially, which provides
O(N) performance, through tree structures, which can produce 0(\og k N)
performance, and directly. Direct access provides O(l) performance, which
means that the number of accesses required to retrieve a record is constant
and independent of the size of the file. Hashing is the primary form of
organization used to provide direct access.
Hashing can provide
we
faster access than
study, usually with very
little
most of the other organizations
storage overhead, and
most types of primary keys. Ideally, hashing makes
record with only one disk access, but this ideal
primary disadvantage of hashing
by key.
is
that
hashed
is
it
it is
adaptable to
possible to find any
rarely achieved.
files
may
The
not be sorted
490
HASHING
Hashing involves the application of a hash function h(K) to a record key
The address is taken to be the home address of the
record whose key is K, and it forms the basis for searching for the record.
The addresses produced by hash functions generally appear to be random.
When two or more keys hash to the same address, they are called
synonyms. If an address cannot accommodate all of its synonyms, collisions
result. When collisions occur, some of the synonyms cannot be stored in the
home address and must be stored elsewhere. Since searches for records
begin with home addresses, searches for records that are not stored at their
home addresses generally involve extra disk accesses. The term average
search length is used to describe the average number of disk accesses that are
required to retrieve a record. An average search length of 1 is ideal.
Much of the study of hashing deals with techniques for decreasing the
number and effects of collisions. In this chapter we look at three general
approaches to reducing the number of collisions:
K to produce an address.
Spreading out the records;
Using extra memory; and
Using buckets.
Spreading out the records involves choosing a hashing function that
distributes the records at least
randomly over the address
space.
distribution
spreads out records evenly, resulting in no collisions.
or nearly
random
distribution
is
much
easier to achieve
and
A uniform
A random
is
usually
considered acceptable.
In this chapter a simple hashing algorithm
is
developed to demonstrate
the kinds of operations that take place in a hashing algorithm.
The
three
steps in the algorithm are:
1.
Represent the key in numerical form;
2.
Fold and add; and
3.
Divide by the
When we examine
size
of the address space, producing
a valid address.
several different types of hashing algorithms,
sometimes algorithms can be found
we
see that
produce better-than-random distributions. Failing this, we suggest some algorithms that generally produce
distributions which are approximately random.
The Poisson distribution provides a mathematical tool for examining in
detail the effects of a random distribution. Poisson functions can be used to
predict the numbers of addresses likely to be assigned 0, 1, 2, and so on,
records, given the number of records to be hashed and the number of
available addresses. This allows us to predict the number of collisions likely
to occur when a file is hashed, the number of overflow records likely to
that
occur, and sometimes the average search length.
SUMMARY
Using extra memory is another way to avoid collisions. When a fixed
number of keys is hashed, the likelihood of synonyms occurring decreases
as the number of possible addresses increases. Hence, a file organization that
allocates many more addresses than are likely to be used has fewer
synonyms than one that allocates few extra addresses. The term packing
density describes the
proportion of available address space that actually holds
function is used to determine how differences in
The Poisson
records.
packing density influence the percentage of records that are likely to be
synonyms.
is the third method for avoiding collisions. File addresses
more records, depending on how the file is organized by the
The number of records that can be stored at a given address,
Using buckets
can hold one or
file
designer.
called bucket size, determines the point at
address will overflow.
effects
The Poisson
which records assigned
to the
function can be used to explore the
of variations in bucket sizes and packing densities. Large buckets,
a low packing density, can result in very small average
combined with
search lengths.
Although we can reduce the number of collisions, we need some means
with collisions when they do occur. We examined one simple
to deal
collision resolution technique in detail
store a
new
record results in
progressive overflow. If an attempt to
a collision,
progressive overflow involves
searching through the addresses that follow the record's
order until one
not found in
its
is
found to hold the new record.
home
If a
home
record
is
address in
sought and
is
address, successive addresses are searched until either
found or an empty address is encountered.
Progressive overflow is simple and sometimes works very well.
Progressive overflow creates long search lengths, however, when the
packing density is high and the bucket size is low. It also sometimes
produces clusters of records, creating very long search lengths for new
the record
records
is
whose home
addresses are in the clusters.
Three problems associated with record deletion
1.
The
possibility that
empty
slots created
by
in
hashed
files
are
deletions will hinder later
searches for overflow records;
2.
The need
leted;
3.
to
recover space
made
available
when
records are de-
and
The deterioration of average search lengths caused by empty spaces
keeping records further from home than they need be.
The
that are
first two problems can be solved by using tombstones to mark spaces
empty (and can be reused for new records) but should not halt a
search for a record. Solutions to the deterioration problem include local
491
492
HASHING
reorganization, complete
file
reorganization, and the choice of a collision-
resolving algorithm that does not cause deterioration to occur.
Because overflow records have a major influence on performance,
different overflow handling techniques have been proposed. Four
such techniques that are appropriate for file applications are discussed
many
briefly:
1.
Double hashing reduces local clustering but
records so far from
2.
home
may
place
some overflow
that they require extra seeks.
Chained progressive overflow reduces search lengths by requiring that
when a record is being sought. For
only synonyms be examined
chained overflow to work, every address that qualifies
record for
nisms for
3.
some record in the file must hold a home
making sure that this occurs are discussed.
home
Mecha-
Chaining with a separate overflow area simplifies chaining substantially
and has the advantage
that the
ways more appropriate
overflow area may be organized in
overflow records. A danger of
to handling
Scatter tables
is that it might lose locality.
combine indexing with hashing. This approach provides
much more
flexibility in
this
4.
as a
record.
approach
using scatter tables
is
organizing the data
that, unless the
requires
one extra disk access for every
Since in
many
accessed,
home
we
than
we
it is
disadvantage of
RAM,
it
more
frequently than
often worthwhile to take access patterns
can identify those records that are most likely to be
can take measures to
less
search.
cases certain records are accessed
others (the 80/20 rule of thumb),
into account. If
file.
index can be held in
make
sure that they are stored closer to
frequently accessed records, thus decreasing the
average search length.
One
such measure
is
to load the
effective
most frequently
accessed records before the others.
KEY TERMS
Average search length.
We
define average search length as the sum of
number of accesses required for each record in the file divided by the
number of records in the file. This definition does not take into account
the number of accesses required for unsuccessful searches, nor does it
the
account for the
fact that
some
records are likely to be accessed
more
often than others. See 80/20 rule of thumb.
Better-than-random. This term is applied to distributions in which the
records are spread out
more uniformly than they would be
if
the
KEY TERMS
hash function distributed them randomly. Normally, the distribution
produced by a hash function is a little bit better than random.
Bucket.
An
area of space
and
for storage
on the
retrieval
By
eral logical records.
file
that
is
treated as a physical record
purposes but that
is
capable of storing sev-
storing and retrieving logical records in buck-
than individually, access times can, in
ets rather
many
cases,
be im-
proved substantially.
Collision. Situation in which
room
not have sufficient
curs,
some means
Double hashing. A
c,
which
is
hashed to an address that does
is
collision resolution
is
ber of addresses) as
record
record
When
a collision
oc-
has to be found to resolve the collision.
handled by applying
number
to store the record.
scheme
in
which
collisions are
second hash function to the key to produce a
added to the original address (modulo the numa
many
located or an
times as necessary until either the desired
empty space
is
found. Double hashing helps
avoid some of the clustering that occurs with progressive overflow.
The
thumb. An assumption
80/20 rule of
80%) of the
accesses are
performed on
that a large percentage (e.g.,
small percentage (e.g.,
20%)
of the records in a file. When the 80/20 rule applies, the effective average search length is determined largely by the search lengths of the
more
active records, so attempts to
make
these search lengths short
can result in substantially improved performance.
Fold and add.
sized parts
added.
The
A
of
method of hashing
a
key are extracted
resulting
sum
in
which the encodings of fixedevery two bytes) and are
(e.g.,
can be used to produce an address.
Hashing. A technique for generating a unique home address for a given
key. Hashing is used when rapid access to a key (or its corresponding record)
is
required. In this chapter applications of hashing involve
direct access to records in a
cess items in arrays in
file,
RAM.
but hashing
is
also often used to ac-
In indexing, for example, an index
might be organized for hashing rather than for binary search if exfast searching of the index is desired.
Home address. The address generated by a hash function for a given
tremely
key. If a record
for the record
the record.
is
is
stored at
its
home
address, then the search length
one because only one access
record not
at its
home
is
required to retrieve
address requires
more than one
access to retrieve or store.
Indexed hash.
Instead of using the results of a hash to produce the ad-
dress of a record, the hash can be used to identify a location in an
index that in turn points to the address of the record. Although
approach requires one extra access for every search,
ble to organize the actual data records in a
way
it
makes
it
this
possi-
that facilitates other
types of processing, such as sequential processing.
493
494
HASHING
Mid-square method.
the key
is
hashing method in which
representation of
squared and some digits from the middle of the result are
used to produce the address.
Minimum
is
hashing. Hashing scheme in which the number of addresses
number of records. No storage space is
exactly equal to the
wasted.
Open
addressing. See progressive overflow.
Overflow. The
its
home
when
situation that occurs
record cannot be stored in
address.
Packing density. The proportion of allocated
space that actually
file
holds records. (Sometimes referred to as load factor.) If
full, its
are the
packing density
is
50%. The packing
two most important measures
ol a collision occurring
when
in
a file is half
density and bucket size
determining the likelihood
searching for
record in
a file.
Perfect hashing function. A hashing function that distributes records
uniformly, minimizing the number of collisions. Perfect hashing
functions are very desirable, but they are extremely difficult to find
for large sets of keys.
Poisson distribution. Distribution generated by the Poisson function,
approximate the distribution of records among
is random. A particular Poisson distribution depends on the ratio of the number of records to the number of
available addresses. A particular instance of the Poisson function,
p(x), gives the proportion of addresses that will have x keys assigned
to them. See better-than-random.
Prime division. Division of a number by a prime number and use of
the remainder as an address. If the address size is taken to be a prime
number p, a large number can be transformed into a valid address by
dividing it by p. In hashing, division by primes is often preferred to
division by nonprimes because primes tend to produce more random
which can be used
to
addresses if the distribution
remainders.
Progressive overflow. An overflow handling technique in which collisions are resolved by storing a record in the next available address
after its home address. Progressive overflow is not the most efficient
overflow handling technique, but it is one of the simplest and is adequate for many applications.
Randomize. To produce a number
(e.g.,
by hashing)
that appears to be
random.
Synonyms. Two
When
each
file
or
more
result in collisions. If buckets are used, several
are
same address.
synonyms always
records whose keys
different keys that hash to the
address can hold only one record,
synonyms may be
stored without collisions.
EXERCISES
Tombstone. A special marker placed in the key field of a record to
mark it as no longer valid. The use of tombstones solves two problems associated with the deletion of records: The freed space does
not break a sequential search for
ily
recognized
Uniform. Term
out evenly
as available
and
a record,
may
and the freed space
be reclaimed for
is
eas-
later additions.
applied to a distribution in which records are spread
among
addresses. Algorithms that produce uniform distri-
butions are better than randomizing algorithms in that they tend to
avoid the numbers of collisions that would occur by chance from
randomizing algorithm.
EXERCISES
1.
Use
the function
hash(KEY,
MAXAD)
described in the text to answer
the following questions.
a.
What
b.
Find two different words of more than four characters that are
is
the value of hash("Browns", 101)?
synonyms.
c. It is assumed
in the text that the function hash() does not
need to
generate an integer greater than 19,937. This could present a problem
we
if
ways
have
with addresses larger than 19,937. Suggest some
around this problem.
a file
to get
it is important to understand the relationof the available memory, the number of keys to be
hashed, the range of possible keys, and the nature of the keys. Let us give
2.
In understanding hashing,
ships
between the
names
size
to these quantities, as follows:
M=
the
number of memory
spaces available (each capable of hold-
ing one record);
r
=
=
the
number of records
the
number of unique home
to be stored in the
memory
spaces;
addresses produced by hashing the
record keys; and
key, which
may
be any combination of exactly five uppercase
characters.
Suppose h(K)
and
Ma.
is
hash function that generates addresses between
1.
How many
were one upperunique keys are possible? (Hint: If
would be 26 possible unique keys.)
case letter, rather than five, there
b.
How
are n
and
related?
495
496
HASHING
How
If the function h
would
3.
M related?
were minimum perfect hashing function,
and M be related?
c.
d.
are
n,
r,
and
The following
table
different hash functions
how
shows distributions of keys resulting from three
on a file with 6,000 records and 6,000 addresses.
Function
Function
Function
<J(0)
0.71
0.25
0.40
*(1)
0.05
0.50
0.36
d{2)
0.05
0.25
0.15
d(3)
0.05
0.00
0.05
d(4)
0.05
0.00
0.02
d(S)
0.04
0.00
0.01
d(6)
0.05
0.00
0.01
d(l)
0.00
0.00
0.00
a.
Which of the
records that
b.
c.
d.
4.
Which
Which
Which
There
is
is
three functions
(if
any) generates a distribution of
approximately random?
generates a distribution that
(if
is
nearest to uniform?
any) generates a distribution that
is
worse than random?
function should be chosen?
surprising mathematical result called the birthday paradox that
says that if there are
more than 23 people in a room, then there is a
two of them have the same birthday. How
than 50-50 chance that
better
is
the
birthday paradox illustrative of a major problem associated with hashing?
5. Suppose that 10,000 addresses are allocated to hold 8,000 records in a
randomly hashed file and that each address can hold one record. Compute
the following values:
The packing density for the file;
The expected number of addresses with no records
them by the hash function;
c. The expected number of addresses with one record
a.
b.
assigned to
assigned (no
synonyms);
The expected number of addresses with one record
more synonyms;
e. The expected number of overflow records; and
f.
The expected percentage of overflow records.
d.
plus
one or
EXERCISES
6.
Consider the
file
described in the preceding exercise.
number of overflow
expected
records
if the
What
the
is
10,000 locations are reorganized
as
a.
b.
7.
5,000 two-record buckets; and
1000 10-record buckets?
Make
1, 2, 5,
showing Poisson function values for r/N = 0.1, 0.5,
Examine the table and discuss any features and patterns
a table
and
10.
0.8,
that
provide useful information about hashing.
8.
There
is
an overflow handling technique called count-key progressive
works on block-addressable
number from a key,
overflow (Bradley, 1982) that
Instead of generating a relative record
disks as follows.
the hash function
generates an address consisting of three values: a cylinder, a track, and a
block number.
The corresponding
numbers
three
constitute the
home
address of the record.
Since block-organized drives (see Chapter
find a record with a given key, there
to find out
whether or not
it
is
no need
can often scan a track to
3)
to load a block into
contains a particular record.
The I/O processor
can direct the disk drive to search a track for the desired record.
empty record
direct the disk to search for an
its
home
a.
position, effectively
What
sive
about
is it
this
slot if a
memory
record
is
It
can even
not found in
implementing progressive overflow.
technique that makes
it
superior to progres-
overflow techniques that might be implemented on sector-orga-
nized drives.
b.
The main disadvantage of this technique
only with a bucket
size
of
Why
is
is
that
this the case,
it
can be used
and
why
is it
disadvantage?
9. In discussing
implementation
issues,
we
suggest initializing the data
file
marked empty before loading the file with
There are some good reasons for doing this. However, there
might be some reasons not to do it this way. For example, suppose you
want a hash file with a very low packing density and cannot afford to have
the unused space allocated. How might a file management system be
designed to work with a very large logical file, but allocate space only for
by
creating real records that are
actual data.
those blocks in the
file
that actually contain data?
10. This exercise (inspired
by an example
concerns the problem of deterioration.
are to be
made
to a
file.
Tombstones
in
Wiederhold, 1983, p. 136)
and deletions
A number of additions
are to be used
where necessary
preserve search paths to overflow records.
Show what the file looks like after
compute the average search length.
a.
the following operations, and
to
497
498
HASHING
Operation
Add
Add
Add
Add
Add
Alan
Home
Address
Bates
Cole
Dean
Evans
Del
Bates
Del
Cole
Add
Add
Finch
Del
Alan
Gates
Add Hart
How
has the use of tombstones caused the
What would be
file
to deteriorate?
the effect of reloading the remaining items in the
file
Dean, Evans, Finch, Gates, Hart?
b. What would be the effect of reloading the remaining items using
two-pass loading?
in the order
11. Suppose you have a file in which 20% of the records account for 80%
of the accesses, and that you want to store the file with a packing density of
and a bucket size of 5. When the file is loaded, you load the active 20% of
the records first. After the active 20% of the records are loaded, and before
what is the packing density of the partially
Using this packing density, compute the percentage of the active
20% which would be overflow records. Comment on the results.
the other records are loaded,
filled file?
our computations of average search lengths, we consider only the
takes for successful searches. If our hashed file were to be used in such
a way that searches were often made for items that are not in the file, it
would be useful to have statistics on average search length for an unsuccessful
search. If a large percentage of searches to a hashed file are unsuccessful,
how do you expect this to affect overall performance if overflow is han12. In
times
it
dled by
a.
Progressive overflow; or
b.
Chaining to
a separate
(See Knuth, 1973b, pp.
overflow area?
535-539
for a treatment of these differences.)
Although hashed files are not generally designed to support access to
in any sorted order, there may be times when batches of
transactions need to be performed on a hashed data file. If the data file is
13.
records
EXERCISES
sorted (rather than hashed), these transactions are normally carried out by
some
of cosequential process, which means that the transaction file also
file is hashed, the transaction file might also be
presorted, but on the basis of the home addresses of its records rather than
sort
has to be sorted. If the data
some more "natural" criterion.
Suppose you have a file whose records are usually accessed directly, but
that is periodically updated from a transaction file. List the factors you
would have to consider in deciding between using an indexed sequential
organization and hashing. (See Hanson, 1982, pp. 280-285, for a discussion
of these
issues.)
We
assume throughout this chapter that a hashing program should be
correctly whether a given key is located at a certain address. If
this were not so, there would be times when we would assume that a record
exists when in fact it does not, a seemingly disastrous result. But consider
what Doug Mcllroy did in 1978 when he was designing a spelling checker
program. He found that by letting his program allow one out of every 4,000
misspelled words to sneak by as valid (and using a few other tricks), he
could fit a 75,000-word spelling dictionary into 64 K of RAM, thereby
improving performance enormously.
Mcllroy was willing to tolerate one undetected misspelled word out of
14.
able to
tell
every 4,000 because he observed that drafts of papers rarely contained more
than 20 errors, so one could expect
program to
cases where
it
fail
it
at
to detect a misspelled
might be reasonable
most one out of every 200 runs of the
word. Can you think of some other
to report that a
key
exists
when
in fact
does not?
Jon Bentley (1985) provides an excellent account of Mcllroy 's program,
on the process of solving problems of this nature. D.
Dodds (1982) discusses this general approach to hashing, called
J.
check-hashing. Read Bentley's and Dodds's articles, and report on them to
your class. Perhaps they will inspire you to write a spelling checker.
plus several insights
Programming Exercises
15.
Implement and
16. Create a
key
this
version of the function hash().
with one record for every city in California. The
name of the corresponding city. (For the
exercise, there need be no fields other than the key field.)
hashed
in each record
purposes of
test a
file
is
to be the
Begin by creating a sorted list of the names of all of the cities and towns in
California. (If time or space is limited, just make a list of names starting
with the letter '5'.)
499
500
HASHING
the sorted list. What patterns do you notice that might
your choice of a hash function?
b. Implement the function hash() in such a way that you can alter
the number of characters that are folded. Assuming a packing density
of 1, hash the entire file several times, each time folding a different
number of characters, and producing the following statistics for
a.
Examine
affect
each run:
The number of collisions; and
The number of addresses assigned
more records.
0, 1, 2,
10,
and 10-or-
Discuss the results of your experiment in terms of the effects of
numbers of characters, and how they compare to
the results you might expect from a random distribution.
c. Implement and test one or more of the other hashing methods described in the text, or use a method of your own invention.
folding different
17.
Using some
set
of keys, such
as the
names of California towns, do
the
following:
a.
Write and
hash
files
test a
program
for loading the keys into three different
using bucket sizes of
1, 2,
and
b.
5,
respectively,
Use progressive overflow
Have your program maintain statistics on
ing density of 0.8.
length, the
maximum
and
pack-
for handling collisions.
the average search
search length, and the percentage of records
overflow records.
a Poisson distribution, compare your results with the
expected values for average search length and the percentage of
records that are overflow records.
that are
c.
Assuming
18.
Repeat exercise
17,
but use double hashing to handle overflow.
19.
Repeat exercise
17,
but handle overflow using chained overflow into
separate overflow area.
number of keys
Assume
to available
program
that the packing density
is
the ratio of
home addresses.
perform insertions and deletions in the file
problem using a bucket size of 5. Have the program
keep running statistics on average search length. (You might also
implement a mechanism to indicate when search length has deteriorated to
a point where the file should be reorganized.) Discuss in detail the issues
you have to confront in deciding how to handle insertions and deletions.
20. Write a
that can
created in the previous
FURTHER READINGS
FURTHER READINGS
There are
number of good surveys of hashing and issues
Knuth (1973b), Severance (1974), Maurer
related to hashing
and SorenTremblay, and Deutscher (1978). Textbooks concerned with file design
generally contain substantial amounts of material on hashing, and they often provide
extensive references for further study. Each of the following can be useful:
generally, including
(1975),
son,
Hanson
(1982)
of the issues
on comparing
with analytical and experimental
is filled
we
introduce, and
different
Bradley (1982) covers
file
file
many more, and
results exploring
also contains a
all
good chapter
organizations.
hashing generally but also includes
much informa-
on programming for hashed files using IBM PL/I.
Loomis (1983) also covers hashing generally, with additional emphasis on
tion
for hashed files in COBOL.
Teorey and Fry (1982) and Wiederhold (1983) will be useful to practitioners
interested in analyses of trade-offs among the basic hashing methods.
programming
One of the
applications of hashing that has stimulated a great deal of interest recently
development of spelling checkers. Because of special characteristics of spelling
checkers, the types of hashing involved are quite different from the approaches we
describe in this text. Papers by Bentley (1985) and Dodds (1982) provide entry into
the literature on this topic. (See also exercise 14.)
is
the
501
Extendible Hashing
11
CHAPTER OBJECTIVES
Describe the problem solved by extendible hashing
and related approaches.
Explain
it
how
combines
Show how
extendible hashing works;
tries
to
with conventional,
show how
static
hashing.
implement extendible hashing,
in-
cluding deletion.
Review
studies of extendible hashing performance.
Examine alternative approaches to the same problem, including dynamic hashing, linear hashing, and
hashing schemes that control splitting by allowing
for
overflow buckets.
503
CHAPTER OUTLINE
11.1
11.2
Introduction
How
11.2.1
11.4.2
Extendible Hashing Works
11.4.3 Collapsing the Directory
Tries
11.4.4 Implementing the Deletion
11.2.2 Turning the Trie into a
Directory
Handle Overflow
Implementation
11.5
Creating the Addresses
11.3.2 Implementing the Top-level
Operations
11.3.3 Bucket and Directory
Operations
11.3.1
11.3.4 Implementation
Extendible Hashing Performance
Space Utilization for Buckets
11.5.1
11.5.2 Space Utilization for the
Directory
11.6 Alternative
Summary
Approaches
Dynamic Hashing
11.6.1
11.6.2 Linear Hashing
11.4 Deletion
11.4.1
Operations
Summary of the Deletion
Operation
11.4.5
11.2.3 Splitting to
11.3
Procedure for Finding
Buddy Buckets
11.6.3 Approaches to Controlling
Overview of
the Deletion
Splitting
Process
11.1
Introduction
In Chapter 8
B-trees.
we began
B-trees
are
with
a historical
review of the work that led up to
such an effective solution to the problems that
stimulated their development that it is easy to wonder if there is any more
important thinking to be done about file structures. Work on extendible
hashing during the late 1970s and early 1980s shows that the answer to that
question
of the
is
file
yes. This chapter tells the story
structures that
emerge from
of that work and describes some
it.
B-trees do for secondary storage what
memory: They provide
dynamic data.
data
set.
By
dynamic
The key
feature of both
AVL
self-adjusting structures that include
As we add and
delete
AVL
trees
do
for storage in
way of using tree structures that works well with
we mean that we add and delete records from the
records,
trees
and B-trees
mechanisms
is
that they are
to maintain themselves.
the tree structures use limited,
local
restructuring to ensure that the additions and deletions do not degrade
performance beyond some predetermined
level.
HOW
Robust, self-adjusting data and
file
EXTENDIBLE HASHING
WORKS
505
structures are critically important to
data storage and retrieval. Judging from the historical record, they are also
hard to develop. It was not until 1963 that Adel'son-Vel'skii and Landis
developed a self-adjusting structure for tree storage in memory, and it took
another decade of
dynamic
work
before computer scientists found, in B-trees,
works well on secondary storage.
B-trees provide 0(\og k N) access to the keys in a file. Hashing, when
there is no overflow, provides access to a record with a single seek. But as
a file grows larger, the need to look for records that overflow their buckets
degrades performance. For dynamic files that undergo a lot of growth, the
tree structure that
performance of a static hashing system such as we described in Chapter 10
typically worse than the performance of a B-tree. So, by the late 1970s,
after the initial burst of new research and design work revolving around
B-trees was over, a number of researchers began to work on finding ways
to modify hashing so that it, too, could be self-adjusting as files grow and
shrink. As often happens when a number of groups are working on the
same problem, several different, yet essentially similar, approaches emerged
to extend hashing to dynamic files. We begin our discussion of the problem
by looking closely at the approach called "extendible hashing" described by
Fagin, Nievergelt, Pippenger, and Strong (1979). Later in this chapter we
compare this approach with others that emerged over the last decade.
is
1 1
.2
How
Extendible Hashing
Works
11.2.1 Tries
idea behind extendible hashing is to combine conventional hashing
with another retrieval approach called the trie. (The word trie is pronounced
so that it rhymes with sky.) Tries are also sometimes referred to as radix
searching because the branching factor of the search tree is equal to the
number of alternative symbols (the radix of the alphabet) that can occur in
each position of the key. A few examples will illustrate how this works.
The key
Suppose
we want to
anderson, andrews,
11.1.
As you can
and
build a
baird.
see, the
trie that stores
the keys able, abrahms, adams,
schematic form of the
trie is
shown
in Fig.
searching proceeds letter by letter through the key.
Since there are 26 symbols in the alphabet, the potential branching factor at
every node of the search is 26. If we used the digits 0-9 as our search
would be
one shown in
alphabet, rather than the letters a-z, the radix of the search
reduced to
Fig. 11.2.
10.
search tree using digits might look like the
506
EXTENDIBLE HASHING
anderson
andrews
baird
FIGURE 11.1
Radix 26
trie
that indexes
names according
trie
that indexes
numbers according
to the letters
of the alphabet.
FIGURE 11.2 Radix 10
digits they contain.
to the
HOW
Notice that
EXTENDIBLE HASHING
507
WORKS
we sometimes
use only a portion of the
need more information to complete the
search. This use-more-as-we-need-more capability is fundamental to the
structure of extendible hashing.
We use
key.
in searching a trie
more of the key
11.2.2 Turning the Trie
We
with
as
we
into a Directory
of two
our approach to extendible hashing:
Furthermore, since we are
retrieving from secondary storage, we will not work in terms of individual
keys, but in terms of buckets containing keys, just as in conventional
hashing. Suppose we have bucket A containing keys that, when hashed,
use
tries
a radix
Search decisions are
made on
in
a bit-by-bit basis.
have hash addresses that begin with the
hash addresses beginning with
10,
bits 01. Bucket B contains keys with
and bucket C contains keys with
addresses that start with 11. Figure 11.3 shows
a trie that
allows us to
retrieve these buckets.
How
should
we
represent the trie? If we represent
it
as a tree structure,
do a number of comparisons as we descend the tree. Even
worse, if the trie becomes so large that it, too, is stored on disk, we are faced
once again with all of the problems associated with storing trees on disk.
We might as well go back to B-trees and forget about extendible hashing.
we
are forced to
So, rather than representing the trie as a tree,
we
flatten
it
into an array
of contiguous records, forming a directory of hash addresses and pointers to
the corresponding buckets. The first step in turning a tree into an array
involves extending it so it is a complete binary tree with all of its leaves at
is enough
the same level as shown in Fig. 11.4(a). Even though the initial
to select bucket A, the new form of the tree also uses the second address bit
so both alternatives lead to the same bucket. Once we have extended the
tree this
11.4(b).
we can collapse it into the directory structure shown in Fig.
Now we have a structure that provides the kind of direct access
way,
associated with hashing:
10 2
th
Given an address beginning with the
bits 10, the
directory entry gives us a pointer to the associated bucket.
FIGURE 11.3 Radix 2 trie that
provides an index to buckets.
508
EXTENDIBLE HASHING
00
J
01
10
11
(a)
FIGURE 11.4 The
trie
from
(b)
11.3 transformed
Fig.
first into
complete binary
and then
tree,
flattened into a directory to the buckets.
11.2.3 Splitting to Handle Overflow
key
any hashing system
issue in
overflows.
The
is
what happens when
goal in an extendible hashing system
is
to find a
bucket
way
to
increase the address space in response to overflow, rather than responding
by creating long sequences of overflow records and buckets
that
have to be
searched linearly.
Suppose we
overflow. In
insert records that cause bucket
in
Fig.
this case the solution is simple: Since addresses
11.4(b)
to
beginning with
we can split bucket A by putting
new bucket D, while keeping only the 00 addresses
in A. Put another way, we already have two bits of address information,
but are throwing one away as we access bucket A. So, now that bucket A
is overflowing, we must use the full two bits to divide the addresses
between two buckets. We do not need to extend the address space; we
simply make full use of the address information that we already have.
00 and 01 are mixed together in bucket A,
all
the 01 addresses in a
Figure 11.5 shows the directory and buckets after the
Let's consider a
more complex
case.
split.
Starting once again with the
directory and buckets in Fig. 11.4(b), suppose that bucket
overflows.
How do we split bucket B and where do we attach the new bucket after the
split?
Unlike our previous example,
we do
not have additional, unused
bits
of address space that we can press into duty as we split the bucket. We now
need to use three bits of the hash address in order to divide up the records
that hash to bucket B. The trie illustrated in Fig. 11.6(a) makes the
distinctions required to complete the split. Figure 11.6(b)
trie
looks like once
leaves at the
form of the
same
trie.
it is
extended into
level,
and
Fig.
completely
11.6(c)
shows
full
shows what
this
binary tree with
all
the collapsed, directory
00
<
01
10
11
<
^
(
FIGURE 11.5 The directory
from Fig. 11.4(b) after bucket
/I
overflows.
FIGURE 11.6 The results of an overflow of bucket B in Fig. 11.4(b), represented
then as a complete binary tree, and finally as a directory.
(a)
000
001
;>
CZEZJ
CZD
(b)
(c)
first
as a
trie,
510
EXTENDIBLE HASHING
By
used in
building on the
search,
trie's ability to
we have doubled
from 2 2 to 2 3 cells. This ability to
shrink) the address space gracefully is what extendible hashing is
therefore, of our directory), extending
grow
all
(or
extend the amount of information
the size of our address space (and,
it
about.
We
have been concentrating on the contribution that
tries
make
to
extendible hashing; one might well ask where the actual hashing comes into
play. Why not just use the tries on the bits in the key itself, splitting buckets
and extending the address space as necessary? The answer to this question
grows out of hashing's most fundamental characteristic: A good hash
function produces a nearly uniform distribution of keys across an address
space. Notice that the trie shown in Fig. 11.6 is poorly balanced, resulting
is twice as big as it actually needs to be. If we had an
uneven distribution of addresses that placed even more records in buckets B
and D without using other parts of the address space, the situation would
get even worse. By using a good hash function to create addresses with a
nearly uniform distribution, we avoid this problem.
in a directory that
11.3
Implementation
11.3.1 Creating the Addresses
Now that we have a high-level overview of how extendible hashing works,
let's
look
at
The
pseudocode
in Fig. 11.7 that describes the algorithms in
more
with the functions that create the addresses, since
the notion of an extendible address underlies all other extendible hashing
detail.
place to start
is
operations.
The hash
function itself
hashing algorithm
we
is
simple variation on the fold-and-add
used in Chapter
10.
The only
difference
is
that
we do
not conclude the operation by returning the remainder of the folded address
divided by the address space.
We
don't need to do that, since in extendible
we don't have a fixed address space, instead using as much of the
address as we need. The division that we do perform in this function, when
we take the sum of the folded character values modulo 19,937, is to make
hashing
summation stays within the range of a signed 16-bit
machines that use 32-bit integers, we could divide by a larger
number and create an even larger initial address.
Since extendible hashing uses more bits of the hashed address as they
are needed to distinguish between buckets, we need a make_address function
that extracts just a portion of the full hashed address. We also use the
make_address function to reverse the order of the bits in the hashed address,
making the lowest-order bit of the hash address the highest-order bit of the
value used in extendible hashing. To see why this reversal of bit order is
sure that the character
integer. For
511
IMPLEMENTATION
FUNCTION hash(KEY)
set
set
set
if
SUM to
J
to
LEN to the length of the key
LEN is odd, concatenate a blank
to make the length even
while (J < LEN)
SUM := (SUM + 100*KEY[J]
J by 2
i nc rement
endwh i 1
to
the key
KEYCJ+13) mod 19937
return SUM
end FUNCTION
FIGURE
1.7 Function hash(KEY) returns an integer hash value for
KEY
for a
15-bit address space.
which is a set of keys and binary hash addresses
produced by our hash function. Even a quick scan of these addresses reveals
that the distribution of the least significant bits of these integer values tends
to have more variation than the high-order bits. This is because many of the
addresses do not make use of the upper reaches of our address space; the
desirable, look at Fig. 11.8,
high-order
By
bits often turn
out to be zero.
reversing the bit order,
working from
advantage of the greater variability of low-order
right to
bit values.
left,
we
take
For example,
we want to avoid having the addresses of bill,
and pauline turn out to be 0000, 0000, and 0000. If we work from right
to left, starting with the low-order bit in each address, we get 0011 for bill,
0001 for lee, and 1010 for pauline, which is a much more useful result.
given a four-bit address space,
lee,
The
make_address function, described in Fig. 11.9, accomplishes this bit
extraction and reversal.
number of address
bill
lee
pauline
alan
julie
mike
elizabeth
mark
The
0110
0100 0010
1111 0110
1100 1010
00101110 0000
0000 0111 0100
1100 0110
001
01
0000
0000
0000
0000
0000
0100
0011
DEPTH
argument
tells
the function the
bits to return.
1100
1000
FIGURE 11.8 Output from the
hash function for a number
0101
of keys,
001
1 001
1101
1010
01
512
EXTENDIBLE HASHING
FUNCTION make_address(KEY, DEPTH)
set
RETVAL to
/*
accumulate reversed
* /
string
0...001 mask to extract
*
low bit from N
to
bit
set MASK
to
HASH_VAL
:=
/*
/*
hash(KEY)
Summary of loop logic:
Shift RETVAL one position
Then move
left, to make room for a new low bit.
Then
HASH_VAL's low bit to RETVAL s low bit.
shift HASH_VAL in the opposite direction, to the
'
* *
right, so we can look at the next lowest bit.
Keep doing this until we have moved as many bits
as we need from HASH_VAL to RETVAL in reverse
order
.
*/
for
:=
to
RETVAL
LOWBIT
RETVAL
HASH_VAL
next
DEPTH
RETVAL left shifted one position
HASH_VAL bitwise ANDed with MASK
RETVAL bitwise ORed with LDWBIT
= HASH_VAL right
shifted one position
return RETVAL
end FUNCTION
FIGURE 11.9 Function make_address (DEPTH) gets a hashed address, reverses the
bits, and returns an address of DEPTH bits.
order of the
FIGURE 11.10
BUCKET and DIR_CELL
record structures.
Record Type: BUCKET
DEPTH
integer count of the number of bits used
"in common" by the keys in this bucket
COUNT
integer count of the number of keys in
the buc ket
KEYM
array
MAX_BUCKET_S ZE
hold k eys
[ 1
of
strings to
Record Type: D RECT0RY_CELL
BUCKET_REF relative record number or other reference
to a specific BUCKET record on disk
I
IMPLEMENTATION
513
11.3.2 Implementing the Top-level Operations
Our
extendible hashing scheme consists of
a set of buckets and a directory
them. Each bucket is a record that contains the information
shown in Fig. 11.10. These bucket records are stored in a file; we retrieve
that references
them as necessary.
Each cell in the directory
Because
we
use
consists of a reference to a
direct access to find
directory as an array of these cells in
directory records,
RAM. The
BUCKET
we implement
than the directory
From
the
address values returned by
make_address are treated as subscripts for this array, ranging from
less
record.
to
one
size.
view of the driver function, use of the system
RAM from
disk, a set of calls from the user to find or add keys, and a final, closing step
that writes the possibly modified directory back to disk. Pseudocode for the
driver, initialization, and close functions are shown in Fig. 11.11.
the high-level
consists of an initialization step that reads the directory into
FIGURE 11.11 The driver, ex_init, and ex_close functions provide a high-level view
hashing program operation.
of the extendible
FUNCTION driverO
ex_i n i t (
call op_add() and op_find() as directed by the user
ex_c 1 ose(
end FUNCTION
FUNCTION ex_init()
open (or create, as necessary) the directory
and buc ket files
if the hash file already exists
read directory records into the array DIRECTORY
DIR_DEPTH := log 2 (size of DIRECTORY)
else
allocate an initial directory consisting of a
single cell
set DIR_DEPTH to
allocate an initial bucket and assign its address
to the directory cell
end i f
end FUNCTION
FUNCTION ex_close()
write the directory back to disk
close files
end FUNCTION
514
EXTENDIBLE HASHING
Note
that the
DIR_DEPTH
directly
is
related
to
the size of the
directory, since
2
If
dir_depth =
we are starting
we are using
the
hash
file,
the
ln
DIRECTORY.
DIR_DEPTH
same bucket, no matter what
which means
the keys go
get the address of
zero,
We
their address.
everything-goes-here bucket and assign
initial,
is
no bits to distinguish between addresses;
that
into the
new
number of cells
the
it
all
to the single directory
cell.
Given a way to open and close the file, we are ready to add records. The
op_add and op_find functions are outlined in Fig. 11.12.
The op_find function turns the key into a directory address. Given this
address,
we do
and assign
a direct
FOUND_BUCKET
return
calls
contains the key,
we
return success; otherwise,
we
FAILURE.
The op_add
in the
lookup of the bucket location, retrieve the bucket
and then search for the key. If
FOUND_BUCKET,
to
it
hash
file,
function begins by calling op_find. If the key already exists
op_add returns immediately;
bk_add_key
to insert
if
the key
is
not found, op_add
it.
11.3.3 Bucket and Directory Operations
When
not
op_add
calls
bk_add_key
bk_add_key
full,
bucket, however,
it
passes a bucket and a key. If the bucket
11.13) simply inserts
(Fig.
requires a
split,
which
is
it
into the bucket.
where things
start
is
full
to
get
interesting.
What we do when we split a bucket depends on the relationship
between the number of address bits used in the bucket and the number used
in the directory as a whole. The two numbers are often not the same. To see
this,
look
at Fig. 11.6(a).
The keys
The
directory uses three bits to define
its
address
from keys in other
buckets by having an initial bit. All the other bits in the hashed key values
in bucket A can be any value; it is only the first bit that matters. Bucket A
space (8
is
cells).
using only one
in
bucket
are distinguished
bit.
The keys in bucket C all share a common
with 11. The keys in buckets B and D use
first
two
bits;
they
all
begin
three bits to establish their
bucket locations. If you look at Fig. 11.6(c),
you can see how using more or fewer address bits changes the relationship
between the directory and the bucket. Buckets that do not use as many
address bits as the directory have more than one directory cell pointing to
identities and, therefore, their
them.
If
we
directory,
one of the buckets that is using fewer address bits than the
which
therefore is referenced from more than one directory
and
split
515
IMPLEMENTATION
FUNCTION: op_add (KEY)
/* if we find the key, we do not add
if op_find(KEY, FOUND_BUCKET)
return FAILURE
/*
* *
second copy */
otherwise, add the key to the bucket that op_find
ret r i eved
*/
bk_add_key(FOUND_BUCKET, KEY)
return SUCCESS
end FUNCTION
FUNCTION: op_find (KEY, FOUND_BUCKET)
/*
**
*/
uses DIR_DEPTH, the number of bits used to create
the addresses in the directory
/* create an address based on directory depth
ADDRESS := ma k e_addr es s ( KE Y DIR_DEPTH)
*/
/*
**
*/
get the bucket that will
key exists in the file
FOUND_BUCKET
if
:=
contain the key,
if
the
bucket referenced by
DIRECTORYCADDRESS] .BUCKET_REF
FOUND_BUCKET contains the KEY
return SUCCESS
else
return FAILURE
end FUNCTION
FIGURE 11.12 op_add and op_find functions.
FUNCTION bk_add_key(BUCKET, KEY)
(BUCKET. USED
add the key
else
if
<
MAX_BUCKET_S ZE )
bk_split(BUCKET)
op_add(KEY)
endi
end FUNCTION
FIGURE 1 1.13 bk_add_key
function adds the key to the
existing bucket if there is
room.
splits
key.
If
it
the bucket
is full,
and then adds the
it
516
EXTENDIBLE HASHING
FUNCTION bk_split(BUCKET)
/* if the depth used for the BUCKET addresses is
** already the same as the address depth in the
** directory, we must first split the directory
** to double the directory address space
*/
if
(BUCKET. DEPTH == DIR_DEPTH)
di r_double( )
allocate NEW_BUCKET
/*
**
*/
f
find the range of directory entries for the new
bucket, given the depth and keys in the old bucket
ind_new_range(BUCKET, NEW_START, NEH_END)
/*
the new bucket over this range */
insert
dir_ins_bucket(NEW_BUCKET, NEW_START, NEN_END)
/*
change the address depths in the buckets to
reflect the split
*/
increment BUCKET. DEPTH
NEULBUCKET. DEPTH := BUCKET. DEPTH
redistribute the keys between the two buckets
end FUNCTION
FIGURE 11.14 bk_split function divides keys between an existing bucket and a new
bucket. If necessary, it doubles the size of the directory to accommodate the new
bucket.
cell,
the
we
can use half of the directory
split.
cells to
Suppose, for example, that
we
point to the
split
bucket
new
bucket
in Fig.
after
11.6(c).
Before the split only one bit, the initial zero, is used to identify keys that
belong in bucket A. After the split, we use two bits. Keys starting with 00
(directory cells 000 and 001) go in bucket A; keys starting with 01 (directory
cells 010 and Oil) go in the new bucket. We do not have to expand the
directory because the directory already has the capacity to keep track of the
additional address information required for the
If,
on the other hand, we
as the directory,
such
as
split a
split.
bucket that has the same address depth
B or D in Fig. 11.6(c), then there are no
we can use to reference the new bucket.
we have to double the size of the directory,
for every one that is currently there, so we
buckets
additional directory entries that
Before
we
creating a
can
new
split the
bucket,
directory entry
can accommodate the
new
address information.
517
IMPLEMENTATION
we
Figure 11.14 shows the bucket-splitting logic in pseudocode. First
compare
number of bits used
the
we double
new bucket
depths are the same,
Next we
get the
we need
addresses
for
to double the directory. If the
the directory before proceeding.
we need for
we will use
the
that
range of directory addresses that
instance,
number used
for the directory with the
the bucket to determine whether
split.
Then we
new
for the
find the
bucket.
For
when we split bucket A in Fig. 11.6(c), the range of directory
for the new bucket is from 010 to Oil. We attach the new bucket
to the directory over this range, adjust the bucket address depth information
in
both buckets to
of an additional address
reflect the use
redistribute the keys
from
the original bucket across the
The most complicated operation supporting
find_new_range, which finds the range of directory
the
new
bucket instead of the old one
pseudocode
in Fig. 11.15.
To
see
how
it
bit,
the bk_split function
cells that
after the split.
It
is
described in
works, return, once again, to
FIGURE 11.15 find_new_range function finds the start and end directory addresses
the new bucket by using information from the old bucket.
f
md_new_range(OLD_BUCKET
NEN_START, NEW_END)
/* find the shared address for the OLD bucket
SHARED_ADDRESS := ma k e_addres s (any KEY from
OLD_BUCKET, OLD_BUCKET DEPTH
*/
/*
**
**
**
**
**
*/
shift everything one bit to the left, then put
in the lowest bit. This is the shared address
for the new bucket. Fill the new shared address on
the right with zero bits until we have reached the
directory depth. This is the start of the range.
Fill it with
bits -- this is the range's end.
NEW_SHARED
NEW_SHARED
:=
:=
SHARED_ADDRESS left shifted
NEW_SHARED bitwise ORed with
place
1)
BITS_TO_FILL := DIR_DEPTH - ( OLD_BUCKET DEPTH
set NEW_START and NEW_END to the NEW_SHARED value
for J :=
to BITS_TO_FILL
place
NEH_START := NEN_START left shifted
place
NEU_END := NEU_END left shifted
NEW_END := NEH_END bitwise ORed with
.
next
end FUNCTION
is
should point to
for
FUNCTION
and then
two buckets.
Fig.
518
EXTENDIBLE HASHING
Assume that we need to split bucket A, putting some of the keys
new bucket E. Before the split, any address beginning with a leads
11.6(c).
into a
address of the keys in bucket A is 0.
add another address bit to the path leading
to the keys; addresses leading to bucket A now share an initial 00 while
those leading to E share an 01. So, the range of addresses for the new bucket
is all directory addresses beginning with 01. Since the directory addresses
to
A. In other words, the shared
When we
split
bucket
A we
FIGURE 11.16 Directory operations to support bk_split: the dir_double and
dir_ins__bucket functions.
FUNCTION dir_double()
/* calculate the current size and new size */
CURRENT_SIZE := 2 DIR - DEPTH
NEH_SIZE := 2 * CURRENT_SIZE
allocate memory for the new, larger directory
temporarily call it NEW_DIR
/*
**
**
**
*/
--
Transfer the bucket addresses from the old
directory to the new one. Each cell in the
original is copied, into two cells of the
expanded directory
for
:=
to CURRENT_SIZE neh_dirc2*i .bucket_ref
d rectory
neh_dir[2*i+1 .bucket_ref
directory:
I
next
.bucket_ref
:=
.bucket_ref
free memory for old DIRECTORY
rename NEN_DIR to DIRECTORY
increment DIR_DEPTH
end FUNCTION
FUNCTION dir_ins_bucket(BUCKET_ADDRESS, START, LAST)
for J := START to LAST
DIRECTORYC J] .BUCKET_REF := BUCKET_ADDRESS
next
end FUNCTION
519
IMPLEMENTATION
bits, the new bucket is attached to the directory cells starting with
010 and ending with Oil.
Suppose that the directory used a five-bit address instead of a three-bit
address. Then the range for the new bucket would start with 01000 and
would end with 01111. This range covers all five bit addresses that share 01
as the first two bits. The logic for finding the range of directory addresses
use three
for the
new
bucket.
It
bucket, then, starts by finding shared address bits for the
then
fills
the address out with zeroes until
used in the directory. This
bits
is
we have
the
new
number of
the start of the range. Filling the address
out with ones produces the end of the range.
The
directory operations required to support bk_split are easy to
implement. They are outlined in pseudocode in Fig. 11.16. The first,
dir_double, simply calculates the new directory size, allocates the required
memory, and writes the information from each old directory cell into two
successive cells in the new directory. It finishes by freeing the old space
associated with the name DIRECTORY, renaming the new space as the
DIRECTORY, and increasing the DIR_DEPTH value to reflect the fact
that the directory
The
range of directory
make
is
now
using an additional address
bit.
dir_ins_bucket function, used to attach a bucket address across a
cells, is
simply
loop that works through the
cells to
the change.
11.3.4 Implementation Summary
Now that we have assembled all of the pieces necessary to add records to an
extendible hashing system,
The op_add
let's
see
how
the pieces
work
together.
function manages record addition. If the key already exists,
op_add returns immediately. If the key does not exist, op_add calls
bk_add_key, passing it the bucket into which the key is to be added. If
bk_add_key finds that there is still room in the bucket, it adds the key and
the operation
is
complete. If the bucket
is full,
bk_add_key
calls bk_split to
handle the task of splitting the bucket.
The bk_split function starts by determining whether the directory
large
enough
to
accommodate
the
new
larger, bk_split calls a function that doubles the directory size.
then allocates
new
bucket, attaches
it
is
bucket. If the directory needs to be
The
function
to the appropriate directory cells,
and divides the keys between the two buckets.
When bk_add_key regains control after bk_split has allocated a new
bucket, it calls op_add to try to place the key into the new, revised directory
structure. The op_add function, of course, calls bk_add_key again, recursively.
new
This cycle continues until there
key.
is
bucket that can accommodate the
520
11.4
EXTENDIBLE HASHING
Deletion
11.4.1 Overview of the Deletion Process
If
extendible hashing
trees,
it
must be
AVL
to be a truly dynamic system, like B-trees or
is
able to shrink
files
gracefully as well as
grow them. When
we
delete a key, we need a way to see if we can decrease the size of the file
system by combining buckets and, if possible, decreasing the size of the
directory.
As with any dynamic system,
the important question during deletion
concerns the definition of the triggering condition:
When do we combine
buckets? This question, in turn, leads us to ask,
Which buckets can be
combined? For B-trees the answer involves determining whether buckets
are siblings and whether they are at the leaf level of the tree. In extendible
hashing
we
Look
use a similar concept: buckets that are buddy buckets.
again
the
at
combined? Trying
to
trie
in
Fig.
11.6(b).
Which buckets could be
A would mean
combine anything with bucket
collapsing everything else in the trie
first.
Similarly, there
bucket that could be combined with bucket C. But buckets
no
is
B and
single
D are in
same configuration as buckets that have just split. They are ready to be
combined; they are buddy buckets. We will take a closer look at the
question of finding buddy buckets as we consider implementation of the
deletion procedure; for now let's assume that we combine buckets B and D.
After combining buckets, we examine the directory to see if we can
make changes there. Looking at the directory form of the trie in Fig.
1 1 .6(c), we see that once we combine buckets B and D, directory entries 100
and 101 both point to the same bucket. In fact, each of the buckets has at
least a pair of directory entries pointing to it. In other words, none of the
buckets requires the depth of address information that is currently available
in the directory. That means that we can shrink the directory and reduce the
the
address space to half
its size.
Reducing the size of the address space restores the directory and bucket
structure to the arrangement shown in Fig. 11.4, before the additions and
splits that produced the structure in Fig. 11.6(c). Reduction consists of
collapsing each adjacent pair of directory cells into a single
since both cells in each pair point to the
nothing more than a reversal of the directory
when we need to add new directory cells.
11.4.2 A Procedure
for Finding
cell.
This
same bucket. Note
splitting
is
easy,
that this
procedure that
is
we use
Buddy Buckets
overview of how deletion works, we begin by focusing on
buddy buckets. Given a bucket, how do we find its buddy? Figure 11.17
Given
this
521
DELETION
FUNCTION bk_f i nd_buddy(BUCKET)
/* NOTE: this function uses DIR_DEPTH -- we
** assume this value is available globally or
** through a function call
*/
/*
**
*/
if
There
/*
**
*/
if
unless the bucket has the same depth as the
directory, there is no single bucket to pair with
is
is no buddy if the DIR_DEPTH is
just a single bucket)
(there
(DIR_DEPTH == 0)
return N0_BUDDY
(BUCKET. DEPTH < DIR_DEPTH)
return N0_BUDDY
/* find the shared address for this bucket */
SHARED_ADDRESS := ma k e_addr ess(any KEY from
BUCKET, BUCKET. DEPTH)
/*
**
*/
flip the last bit -- that
buddy bucket
BUDDY_ADDRESS
:=
is
the address of
the
SHARED_ADDRESS exclusive ORed with
return BUDDY_BUCKET found at BUDDY_ADDRESS
end FUNCTION
FIGURE 11.17 The bk_find_buddy function returns a buddy bucket or the special
NO_BUDDY if none is found.
signal
The procedure works by checking
whether it is possible for there to be a buddy bucket. Clearly, if the
directory depth is zero, meaning that there is only a single bucket, there
cannot be a buddy.
The next test compares the number of bits used by the bucket with the
number of bits used in the directory address space. A pair of buddy buckets
is a set of buckets that are immediate descendents of the same node in the
trie. They are, in fact, pairwise siblings resulting from a split. Going back
to Fig. 11.6(b), we see that asking whether the bucket uses all the address
bits used in the directory is another way of asking whether the bucket is at
the lowest level of the trie. It is only when a bucket is at the outer edge of
the trie that it can have a single parent and a single buddy.
describes the procedure in pseudocode.
to see
522
EXTENDIBLE HASHING
Once we determine
we
that there
is a
buddy bucket, we need
to find
its
we have at hand;
this is the shared address of the keys in the bucket. Since we know that the
buddy bucket is the other bucket that was formed from a split, we know
address. First
that the
Once
find the address used to find the bucket
buddy has
the
same address
again, this relationship
11.6(b). So, to get the
buddy
is
in
all
illustrated
address,
we
regards except for the
last bit.
in Fig.
by buckets B and
flip
the last
bit.
We
return the
buddy bucket.
11.4.3 Collapsing the Directory
The
other important support function used to implement deletion
the
is
function that handles collapsing the directory. Downsizing the directory
is
one of the principal potential benefits of deleting records. In our implementation we use one function to check to see whether downsizing is possible
and, if it is, to actually collapse the directory. Figure 11.18 shows
pseudocode for this function, called dir_try_collapse( ).
The function begins by making sure that we are not at the lower limit
of directory size. By treating the special case of a directory with a single cell
here, at the start of the function, we simplify subsequent processing: With
the exception of this case, all directory sizes are evenly divisible by two.
The actual test for the COLLAPSE_CONDITION consists of examining each pair of directory entries. We assume at the outset that we can
collapse the directory and then look for a pair of directory cells that do not
both point to the same bucket. As soon as we find such a pair, we know that
We set the value of the COLLAPSE_
and break out of the test loop. If we get all the way
through the directory without encountering such a pair, then we can
we
cannot collapse the directory.
CONDITION
to false
collapse the directory.
The
actual collapsing operation consists of allocating space for a
directory that
is
references shared
new
half the size of the original and then copying the bucket
by each
cell pair to a single cell in
the
new
directory.
11.4.4 Implementing the Deletion Operations
Now
that
we have
deletion, finding
an approach to the two
buddy buckets and
critical
support operations for
collapsing the directory,
we
are ready
of the deletion operation.
The highest-level deletion operation, op_del, is very simple. We first try
to find the key to be deleted. If we cannot find it, we return failure; if we
do find it, we call a service function to remove the key from the bucket. We
to construct the higher levels
DELETION
FUNCTION dir_try_collapse()
/* the directory is already at minimum size when
** the depth is zero
*/
if
/*
**
**
*/
(DIR_DEPTH == 0)
return FAILURE
check each pair of directory cells to see whether
each member references the same bucket -- if so,
we can collapse the directory.
DIR_SIZE := 2 DIR - DEPTH
COLLAPSE_CDNDITION := TRUE;
for J
to DIR_SIZE
:=
if
/*
by
assume the best,
try to disprove
then
it
*/
(DIRECTORY!! J] .BUCKET_REF
!=
directory: j+1 .bucket_ref)
c0llapse_c0nditi0n := false
break out of the loop
]
next
/*
**
**
*/
if
endi f
J by
if we have a collapse condition, create a new
directory that is half the size of the original,
and transfer the bucket references
(C0LLAPSE_C0NDITI0N)
NEN_DIR_SIZE := DIR_SIZE / 2
allocate memory for NEW_DIR
for
:=
to
NEW_DIR[
next
J]
NEH_DIR_SIZE .BUCKET_REF :=
DIRECT0RY[2*J] .BUCKET_REF
1
free memory for old DIRECTORY
rename NEN_DIR to DIRECTORY
decrement DIR_DEPTH
endif
return C0LLAPSE_C0ND
ON
end FUNCTION
FIGURE 11.18 The dir_try_collapse function first tests to see whether the directory
can be collapsed. If the test succeeds, the directory is collapsed.
523
524
EXTENDIBLE HASHING
FUNCTION: op_del (KEY)
if (op_f indCKEY, FOUND_BUCKET) == FAILURE)
return FAILURE
found it -- now delete it */
return ( b k_de l_k ey ( FOUND_BUCKET
/*
KEY))
end FUNCTION
FUNCTION bk_del_key(BUCKET, KEY)
set KEY_REMOVED to FALSE
Look
/*
**
*/
if
for KEY in BUCKET -- if found
remove the KEY
set KEY_REMOVED to TRUE
decrement BUCKET. COUNT
if a key was removed, see whether we can combine
this bucket with its buddy bucket
(KEY_REMOVED)
bk_try_combine( BUCKET)
return SUCCESS
else
return FAILURE
endi
end FUNCTION
FIGURE 11.19 The op_del and bk_del_key functions.
return the value reported back from the service function. Figure 11.19
describes op_del and the service function, bk_del_key
The bk_del_key
function does
consists of finding the
second
which
its
work
key and physically
in
pseudocode.
two steps. The first step
removing it from the bucket. The
in
is removed, consists of calling
key has decreased the size of the bucket
enough to allow us to combine it with its buddy.
Figure 11.20 shows the pseudocode for bk_try_combine and its service
function, bk_combine. Note that when we combine buckets, we reduce the
address depth associated with the bucket: Combining buckets means that
we use one less address bit to differentiate keys.
After combining the buckets, we call dir_try_collapse to see if the
decrease in the number of buckets enables us to decrease the size of the
step,
takes place only if a key
bk_try_combine to see
if deleting the
d!
DELETION
525
directory. If we do, in fact, collapse the directory (dir_try_collapse succeeds),
bk_try_combine
created a
calls itself recursively.
new buddy
for the
Collapsing the directory
BUCKET;
it
may
may have
do even more
recursive combining and
be possible to
combination and collapsing. Typically, this
collapsing happens only when the directory has a number of empty buckets
that are awaiting changes in the directory structure that finally produce a
buddy
to
combine with.
FIGURE 1 1.20 The bk_try_combine function tests to see whether a bucket can be
combined with its buddy. If the test succeeds, bk_try_combine calls bk_combine to do the actual combination.
FUNCTION bk_try_combine(BUCKET)
/* If there i5 no baddy return right away
BUDDY := b k_f i nd_buddy(BUCKET)
if (BUDDY == ND_BUDDY)
return
/*
if
*/
see if we can combine buckets */
(BUDDY. COUNT + BUCKET. COUNT <= MAX_BUCKET_S ZE
I
bk_combine(BUCKET
BUDDY)
free memory used by the BUDDY bucket
reassign the DIRECTORY value for the BUDDY so
that it now references the BUCKET
/*
**
*/
if
see if we can collapse the directory -- if
there may be a new buddy to combine with
so,
(di r_t ry_col lapse(
bk_try_combine( BUCKET)
endi
end FUNCTION
FUNCTION bk_combine(BUCKET, BUDDY)
for J :=
to BUDDY. COUNT
increment BUCKET. COUNT
BUCKETCBUCKET. COUNT] .KEY = BUDDYCJ3.KEY
1
next
decrement BUCKET. DEPTH
end FUNCTION
526
EXTENDIBLE HASHING
11.4.5 Summary
of the Deletion Operation
Deletion begins with
a call to op_del that passes the key that is to be deleted.
key cannot be found, there is nothing to delete. If the key is found, the
bucket containing the key is passed to bk_del_key
The bk_del_key function deletes the key and then passes the bucket on
to bk_try_combine to see if the smaller size of the bucket will now permit
combination with a buddy bucket. The bk_try_combine function first checks
to see if there is a buddy bucket. If not, we are done. If there is a buddy, and
if the sum of the keys in the bucket and its buddy is less than or equal to the
size of a single bucket, we combine the buckets.
The elimination of a bucket through combination might enable
collapsing of the directory to half its size. We investigate this possibility by
calling bk_try_collapse. If collapsing succeeds, we may have a new buddy
bucket, and so bk_try_combine calls itself again, recursively.
If the
1.5
Extendible Hashing Performance
Extendible hashing
is
an elegant solution to the problem of extending and
contracting the address space for a hash
shrinks.
How
must consider the
The time dimension
hashing can be kept in
it
is
RAM, two
RAM,
accesses
as the file itself
grows and
the answer to this question
space.
easy to handle: If the directory for extendible
a single access
retrieve a record. If the directory
of
file
work? As always,
trade-off between time and
well does
may
is
is all
so large that
it
that
is
ever required to
must be paged
in
and out
be necessary. The important point
extendible hashing provides O(l) performance: Since there
is
these access time values are truly independent of the size of the
file.
for extendible hashing
are
Questions about
complicated than questions about access time.
space utilization
is
that
no overflow,
more
We need to be concerned
about two uses of space: the space for the buckets and the space for the
directory.
11.5.1 Space Utilization
for
Buckets
In their original paper describing extendible hashing, Fagin, Nievergelt,
and Strong include analysis and simulation of extendible
hashing performance. Both the analysis and simulation show that the space
utilization is strongly periodic, fluctuating between values of 0.53 and 0.94.
The analysis portion of their paper suggests that for a given number of
Pippenger,
527
EXTENDIBLE HASHING PERFORMANCE
r and a block size of
approximated by the formula
records
the average
b,
N
Space utilization, or packing density,
to the total
is
2'
b In
number of records
number of blocks
is
defined as the ratio of the actual
number of records
that could be stored in the
allocated space:
Utilization
Substituting the approximation for
Utilization
we
So,
bN
N gives
us:
0.69.
In 2
expect average utilization of 69%. In Chapter
space utilization for B-trees,
we found
8,
where we looked
that simple B-trees tend to
have
at
a
of about 67%, but this can be increased to over 85% by
redistributing keys during insertion, rather than just splitting when a page
utilization
is
full.
So, B-trees tend to use less space than simple extendible hashing,
typically at a cost of requiring a
The average
few extra
seeks.
space utilization for extendible hashing
is
only part of the
story; the other part relates to the periodic nature of the variations in space
utilization.
It
turns out that if
we
have keys with randomly distributed
fill up at about
addresses, the buckets in the extendible hashing table tend to
the
same time and therefore tend
to split at the
same
time. This explains the
As the buckets
up, space utilization
can reach past 90%. This is followed by a concentrated series of splits that
reduce the utilization to below 50% As these now nearly half-full buckets
fill up again, the cycle repeats itself.
large fluctuations in space utilization.
fill
11.5.2 Space Utilization
for the Directory
directory used in extendible hashing grows by doubling its size. A
prudent designer setting out to implement an extendible hashing system
will want assurance that this doubling levels off for reasonable bucket sizes,
The
even when the number of keys is quite large. Just how large
should we expect to have, given an expected number of keys?
Flajolet (1983) addressed this question in a lengthy, carefully
paper that produces
size.
number of
Table 11.1, which
is
different
ways
directory
developed
to estimate the directory
taken from Flajolet's paper, shows the expected
value for the directory size for different numbers of keys and different
bucket
sizes.
528
EXTENDIBLE HASHING
TABLE 11.1
Expected directory size
records
for a given
bucket size b and
total
number
of
20
10
100
50
200
10
10
10
10
10
1
68.20
M
M
M
11.64 M
25.60
424.10
6.90
111.11
K =
From
K
K
K
1.50
10
3
,
0.30
4.80
K
K
K
0.10
1.70
16.80
62.50
K
K
K
K
16.80
K
K
K
K
0.52
0.26
0.00
0.50
4.10
M
2.25 M
0.26
1.02
10
K
K
K
0.00
0.20
2.00
0.13
0.00
1.00
6
.
Flajolet, 1983.
Flajolet
also
provides
the
following
formula
for
making rough
estimates of the directory size for values that are not in this table.
that this
formula tends to overestimate directory
Estimated directory
1 1
8.10
K
K
K
K
0.00
.6
size
size
-^j r (1
+ 1/
by
a factor
He
notes
of 2 to
4.
Alternative Approaches
11.6.1 Dynamic Hashing
and Strong produced their
paper describing a scheme
called dynamic hashing. Functionally, dynamic hashing and extendible
hashing are very similar. Both use a directory to track the addresses of the
buckets, and both extend the directory through the use of tries.
In 1978, before Fagin, Nievergelt, Pippenger,
paper on extendible hashing, Larson published
The key
difference
between the approaches
like conventional, static hashing, starts
address space of a fixed
size.
with
is
that
dynamic hashing,
hash function that covers an
As buckets within
that fixed address space
forming the leaves of a trie that grows down from the
original address node. Eventually, after enough additions and splitting, the
buckets are addressed through a forest of tries that have been seeded out of
overflow, they
split,
the original static address space.
Let's look at an example. Figure 11.21(a) shows an initial address space
of four, and four buckets descending from the four addresses in the
529
ALTERNATIVE APPROACHES
(a)-
)(
Original address space
)C
40
X XIX
41
Original address space
FIGURE 11.21 The growth of index
in
dynamic hashing.
directory. In Fig. 11.21(b)
the
two buckets
resulting
we have
split the
from the
split as
bucket
address
at
40 and 41.
We
of the directory node at address 4 from a square to a
changed from an external node, referencing a bucket,
that points to
two
We address
circle
because
to an internal
it
has
node
child nodes.
In Fig. 11.21(c)
new
4.
change the shape
we
split the
external nodes 20 and 21.
downward
bucket addressed by node
We
2, creating the
also split the bucket addressed
by
41,
410 and 411. Because the directory
node 41 is now an internal node rather than an external one, it changes from
a square to a circle. As we continue to add keys and split buckets, these
directory tries continue to grow.
Finding a key in a dynamic hashing scheme can involve the use of two
extending the
trie
to include
hash functions, rather than just one.
covers the original address space. If
First,
you
external node, and therefore points to
However,
if the
directory node
is
there
is
the hash function that
find that the directory
a
bucket, the search
is
node
is
an
complete.
an internal node, then you need additional
address information to guide you through the ones and zeroes that form the
530
EXTENDIBLE HASHING
trie.
Larson suggests using a second hash function on the key and using the
of this hashing as the seed for a random-number generator that
result
produces
sequence of ones and zeroes for the key. This sequence describes
the path through the
It is
trie.
interesting to
compare dynamic hashing and extendible hashing.
but illuminating, characterization of similarities and differences is that
while both schemes extend the hash function locally, as a binary search trie,
brief,
in
order to handle overflow, dynamic hashing expresses the extended
directory as a linked structure while extendible hashing expresses
perfect tree,
which
is
Because of this fundamental
utilization within
the buckets
Moreover, since the
it
as a
in turn expressible as an array.
similarity,
is
the
it is
not surprising that the space
same (69%)
for
both approaches.
directories are essentially equivalent,
expressed differently,
and are just
follows that the estimates of directory depth
it
developed by Flajolet (1983) apply equally well to dynamic hashing and
extendible hashing. (In section 11.5.2 we talk about estimates for the
directory size for extendible hashing, but we know that in extendible
hashing directory depth = log 2 directory size.)
The primary
difference
between the two approaches
that
is
dynamic
hashing allows for slower, more gradual growth of the directory, wmereas
it. However, because
dynamic hashing must be capable of holding pointers
to children, the actual .size of a node in dynamic hashing is larger than a
directory cell in extendible hashing, probably by at least a factor of two. So,
the directory for dynamic hashing will usually require more space in
memory. Moreover, if the directory becomes so large that it requires use of
extendible hashing extends the directory by doubling
the directory nodes in
virtual
memory,
extendible hashing offers the advantage of being able to
access the directory with
hashing uses
incur
no more than
a single
page
linked structure for the directory,
more than one page
fault to
move through
it
fault.
may
Since
dynamic
be necessary to
the directory.
11.6.2 Linear Hashing
The key
feature of both extendible hashing and
dynamic hashing
is
that they
use a directory to direct access to the actual buckets containing the key
modify the hashed
After expanding
the directory, more than one directory node can point to the same bucket.
However, the directory adds an additional layer of indirection which, if the
directory must be stored on disk, can result in an additional seek.
Linear hashing, introduced by Litwin in 1980, does away with the
directory. An example, developed in Fig. 11.22, shows how linear hashing
records. This directory
makes
it
possible to expand and
address space without expanding the
number of buckets:
ALTERNATIVE APPROACHES
(a)
00
01
10
11
FIGURE 11.22 The growth of
address space in linear hashing. Adapted from Enbody
and Du (1988).
w
i
(b)
000
(c)
000
10
11
100
001
10
11
100
101
100
101
110
b
1
I
X
i i
(d)
000
001
010
d
1 1
z
i k
(e)
000
001
010
Oil
100
1( )1
110
D
111
531
532
EXTENDIBLE HASHING
works. This example
Enbody and Du
is
adapted from
a description
Linear hashing, like extendible hashing, uses
address space of four, which means that
with two
that produces addresses
that
we
developed
a
more
grows. The example begins
as the address space
key and
of linear hashing by
(1988).
earlier in this
second argument of 2.
h 2 (k) address function.
Note
we
of hashed value
bits
(Fig.
11.22a) with an
are using an address function
of depth. In terms of the pseudocode
chapter, we are calling make_address with a
For this example we will refer to this as the
bits
that the address space consists of four buckets,
rather than four directory nodes that can point to buckets.
As we add records, bucket b overflows. The overflow forces a split.
However, as Fig. 11.22(b) shows, it is not bucket b that splits, but bucket
a. The reason for this is that we are extending the address space linearly, and
bucket a is the next bucket that must split to create the next linear extension,
which we call bucket A. A three-bit hash function, h 3 (k), is applied to
buckets a and A to divide the records between them. Since bucket b was not
the bucket that
we
split,
the overflowing record
is
placed into an overflow
bucket w.
We add
more records, and bucket d overflows. Bucket b is the next one
and extend the address space, so we use the h 3 (k) address function
to divide the records from bucket b and its overflow bucket w between b and
the new bucket B. The record overflowing bucket d is placed in an overflow
bucket x. The resulting arrangement is illustrated in Fig. 11.22(c).
Figure 11.22(d) shows what happens when, as we add more records,
bucket d overflows beyond the capacity of the overflow bucket w. Bucket
to split
c is
the next in the extension sequence, so
to divide the records
Finally,
in
assume
between
that bucket
the overflow bucket z.
we
use the h 3 (k) address function
and C.
overflows.
The overflow
The overflow record
is
placed
also triggers the extension to
and y between buckets d and D. At
this point all of the buckets use the h 3 (k) address function, and w e have
finished the expansion cycle. The pointer for the next bucket to be split
bucket D, dividing the contents ofd,
x,
returns to bucket a to get ready for a
function to reach
new
new
cycle that will use an h 4 (k) address
buckets.
Since linear hashing uses
two hash
functions to reach the buckets during
an expansion cycle, an h d (k) function for the buckets
at
the current address
depth and an h d+l (k) function for the expansion buckets, finding
requires
knowing which function
to use. \{p
is
record
the pointer to the address of
the next bucket to be split and extended, then the procedure for finding the
address of the bucket containing
key k
is
as follows:
ihjk)> = P
address
h d (k)
ALTERNATIVE APPROACHES
533
else
address
hd +
(k)
Litwin (1980) shows that the access time performance of linear hashing
is no directory to access or maintain, and since we
extend the address space through splitting every time there is overflow, the
is
quite good. There
overflow chains do not become very large. Given a bucket size of 50, the
average number of disk accesses per search approaches very close to one.
Space utilization, on the other hand, is lower than it is for extendible
hashing or dynamic hashing, averaging around only 60%.
11.6.3 Approaches to Controlling Splitting
We know
from Chapter 8 that we can increase the storage capacity of
by implementing measures that tend to postpone splitting,
redistributing keys between pages rather than splitting pages. We can apply
B-trees
similar logic to the hashing schemes introduced in this chapter, placing
records in chains of overflow buckets to postpone splitting.
Since linear hashing has the lowest storage utilization of the schemes
introduced here, and since
buckets,
its
it is
already includes logic to handle overflow
it
an attractive candidate for use of controlled splitting logic. In
uncontrolled-splitting form, linear hashing splits
bucket and extends
the address space every time any bucket overflows.
triggering event for splitting
the bucket that splits
is
is
arbitrary, particularly
utilization
Suppose we
reaches
some
let
when we consider that
we
split a
such
75%
and even 85%, the
We
still
average number of
stays
as
an alternative
below
as
75%. Every time
the
bucket and extend the address
space. Litwin simulated this kind of system and
unsuccessful searches
file
the buckets overflow until the space
desired figure,
utilization exceeds that figure,
of
typically not the bucket that overflows. Litwin
(1980) suggests using the overall load factor of the
triggering event.
This choice of
found
that for load factors
accesses for successful and
2.
can also use overflow buckets to defer splitting and increase space
utilization for
which use
attraction
dynamic hashing and extendible hashing. For
these methods,
directories to the buckets, deferring splitting has the additional
of keeping the directory
size
down. For extendible hashing
it
is
particularly advantageous to chain to an overflow bucket and therefore
avoid
a split
when
the split
would cause
the directory to double in size.
Consider the example that we used early in this chapter, where we split the
bucket B in Fig. 11.4(b), producing the expanded directory and bucket
structure shown in Fig. 11.6(c). If we had allowed bucket B to overflow
instead, we could have retained the smaller directory. Depending on how
much space we allocated for the overflow buckets, we might also have
534
EXTENDIBLE HASHING
improved space
utilization
ments, of course,
is
among
the buckets.
The
cost of these
improvedue to the overflow
a potentially greater search length
chains.
Studies of the effects of different overflow bucket sizes and chaining
mechanisms has supported a small industry of academic research during the
early and mid-1980s. Larson (1978) suggested the use of deferred splitting in
his original paper on dynamic hashing but found the results of some
preliminary simulations of the idea to be disappointing.
developed
refinement of
this idea in
Scholl (1981)
which overflow buckets
are shared.
by Chang (1985) tested Scholl's suggestions
empirically and found that it was possible to achieve storage utilization of
Master's
about
thesis
81%
research
while maintaining search performance in the range of
1.1 seeks
per search. Veklerov (1985) suggested using buddy buckets for overflow
rather
than
allocating
chains
of
new
buckets.
This
is
an
attractive
since splitting buckets without buddies can never cause a
suggestion,
doubling of the directory in extendible hashing. Veklerov obtained storage
of about 76% with a bucket size of 8.
utilization
SUMMARY
Conventional,
dynamic, that
static
hashing does- not adapt well to
grow and
file
structures that are
shrink over time. Extendible hashing
is
one of
several hashing systems that allow the address space for hashing to
grow
and shrink along with the file. Because the size of the address space can
grow as the file grows, it is possible for extendible hashing to provide
hashed access without the need for overflow handling, even as files grow
many
times beyond their original expected
The key
to extendible hashing
is
size.
the idea of using
more
bits
of the
hashed value as we need to cover more address space. The model for
extending the use of the hashed value is the trie: Every time we use another
bit of the hashed value, we have added another level to the depth of a trie
with a radix of two.
In extendible hashing we fill out all the leaves of the trie until we have
a perfect tree,
The
and then
we
collapse that tree into a one-dimensional array.
array forms a directory to the buckets, kept on disk, that actually hold
is managed in RAM, if possible.
no room for it in a bucket, we split the
bucket. We use one additional bit from the hash values for the keys in the
bucket to divide the keys between the old bucket and the new one. If the
the keys and records.
If
we add
The
directory
record and there
is
address space represented in the directory can cover the use of this
new
bit,
KEY TERMS
no more changes
however, the address space is using
we double the
address space to accommodate the use of the new bit.
fewer
are necessary.
If,
than are needed by our splitting buckets, then
bits
Deletion reverses the addition process, recognizing that
combine the records
is
for
two buckets only
if
of buckets that resulted from
to say that they are the pair
Access performance for extendible hashing
directory can be kept in
it is
RAM.
possible to
they arc buddy buckets, which
If the directory
is
a split.
single seek
must be paged off
if
the
to disk,
is two seeks. Space utilization for the buckets is
approximately 69%. Tables and an approximation formula developed by
Flajolet (1983) permit estimation of the probable directory size, given a
worst-case performance
bucket
and
size
There are
number of records.
number of other approaches
total
extendible hashing.
Dynamic hashing uses
to the
problem solved by
very similar approach but
The
more cumbersome but grows more smoothly. Space
expresses the directory as a linked structure rather than as an array.
linked structure
utilization
is
and seek performance for dynamic hashing are the same
as for
extendible hashing.
Linear hashing does
address space
overflow of
away with
the directory entirely, extending the
by adding new buckets
in a linear sequence.
Although the
bucket can be used to trigger extension of the address space
in linear hashing, typically the bucket that overflows is not the one that is
a
and extended. Consequently, linear hashing implies maintaining
overflow chains and a consequent degradation in seek performance. The
degradation is slight, since the chains typically do not grow to be very long
before they are pulled into a new bucket. Space utilization is about 60%.
Space utilization for extendible, dynamic, and linear hashing can be
improved by postponing the splitting of buckets. This is easy to implement
for linear hashing, since there are already overflow buckets. Using deferred
splitting, it is possible to increase space utilization for any of the hashing
split
schemes described here to 80% or better while still maintaining search
performance averaging less than two seeks. Overflow handling for these
approaches can use the sharing of overflow buckets.
KEY TERMS
Buddy
bucket. Given a bucket with an address uvwxy, where u, v, w,
or 1, the buddy bucket, if it exists,
and y have values of either
has the value uvwxz, such that
x,
XOR
1.
535
536
EXTENDIBLE HASHING
Buddy
buckets are important in deletion operations for extendible
hashing since,
if
enough keys
are deleted, the contents of
buddy
buckets can be combined into a single bucket.
Deferred splitting.
It is
possible to
improve space
utilization for dy-
namic hashing, extendible hashing, and linear hashing by postponing, or
deferring, the splitting of buckets, placing records into overflow
buckets instead. This
is
a classic
space/time trade-off in which
accept diminished performance in return for
more compact
we
storage.
Directory. Conventional, static hashing schemes transform a key into a
bucket address. Both extendible hashing and dynamic hashing introduce
an additional layer of indirection, in which the key is hashed to a
directory address.
The
directory, in turn, contains information about
the location of the bucket. This additional indirection
ble to extend the address space
by extending the
makes
it
possi-
directory, rather
work with an address space made up of buckets.
hashing. Used in a generic sense, dynamic hashing can refer to
any hashing system that provides for expansion and contraction of
the address space for dynamic files where the number of records
changes over time. In this chapter we use the term in a more specific
sense to refer to a system initially described by Larson (1978). The
system uses a directory to provide access to the buckets that actually
contain the records. Entries in the directory can be used as root
nodes of trie structures that accommodate greater numbers of buckets
than having to
Dynamic
as
buckets
split.
Extendible hashing. Like dynamic hashing, extendible hashing is sometimes used to refer to any hashing scheme that allows the address
space to grow and shrink so it can be used in dynamic file systems.
Used more
precisely, as
it is
used in
this chapter, extendible hashing
files that was
proposed by Fagin, Nievergelt, Pippenger, and Strong (1979).
Their proposal is for a system that uses a directory to represent the
address space. Access to buckets containing the records is through
the directory. The directory is handled as an array; the size of the
array can be doubled or halved as the number of buckets changes.
Linear hashing. An approach to hashing for dynamic files that was first
proposed by Litwin (1980). Unlike extendible hashing and dynamic
refers to
an approach to hashed retrieval for dynamic
first
hashing, linear hashing does not use a directory. Instead, the actual
is extended one bucket at a time as buckets overflow.
Because the extension of the address space does not necessarily correspond to the bucket that is overflowing, linear hashing necessarily
involves the use of overflow buckets, even as the address space ex-
address space
pands.
EXERCISES
Splitting.
new
The hashing schemes described
chapter
in this
records by splitting buckets to form
new
make room
tending the address space to cover these buckets. Conventional,
hashing schemes rely
strictly
for
buckets, and then exstatic
on overflow buckets without extending
the address space.
Trie.
key
search tree structure in which each successive character of the
used to determine the direction of the search at each successive
of the tree. The branching factor (the radix of the trie) at any
is potentially equal to the number of values that the character
is
level
level
can take.
wmmmmmmmam
EXERCISES
ex
1.
Briefly describe the differences
hashing, and linear hashing.
What
between extendible hashing, dynamic
and weaknesses of each
are the strengths
approach?
2.
The
tries
that
are
the basis
change
3.
if
we
use
a larger
procedure
the extendible hashing
for
described in this chapter have a radix of two.
How
does performance
radix?
what would happen if we did not reverse
number of low-order
same left-to-right order that they occur in the address? Think
way the directory location would change as we extend the
In the make_address function,
the order of the bits but just extracted the required
bits in the
about the
implicit trie structure to use yet another bit.
language that you are using to implement the make_address
function does not support bit shifting and masking operations, how could
you achieve the same ends, even if less elegantly and clearly?
4.
If the
we
between the original
for this
implementation
bucket and a new one. Outline a possible
redistribution. How do you decide whether a key belongs in the new bucket
5.
In the bk_split function,
redistribute keys
or the original bucket?
6.
Suppose the redistribution of keys
in bk_split
any keys into the new bucket. Under
happen? How will the program handle
does not result
in
moving
what conditions could such an event
7.
we
The bk_try_combine function
described
situation in
is
this?
potentially recursive. In section 11.4.4
which there
combined with other buckets through
are
a
empty buckets
series
that can be
of recursive
calls
to
537
538
EXTENDIBLE HASHING
two
bk_try_combine. Describe
in the
8.
produce empty buckets
Deletion occasionally results in collapsing the directory. Describe the
conditions that must be
9.
situations that could
hash structure.
met before
the directory can collapse.
Deletion depends on finding buddy
buckets.
Why
does the address depth
have to be the same as the address depth for the directory
bucket to have a buddy?
tor a bucket
order for
10. In the extendible hashing
procedure described in
directory can occasionally point to
that can
produce empty buckets.
empty
How
this
buckets. Describe
could
we modify
chapter,
two
in
the
situations
the procedures to
avoid empty buckets?
11. If buckets are large, a
less
wasteful than an
bucket containing only
empty bucket.
How
could
few records
we minimize
is
not
much
nearly empty
buckets?
makes use of overflow records. Assuming an unconimplementation where we split and extend the address
space as soon as we have an overflow, what is the effect of using different
bucket sizes for the overflow buckets? For example, consider overflow
buckets that are as large as the original buckets. Now consider overflow
buckets that can only hold one record. How does this choice affect
performance in terms of space utilization and access time?
12. Linear hashing
trolled splitting
13. In section
we
11.6.3
described an approach to linear hashing that
controls splitting. For a load factor of 85%, the average
for a successful search
is
number of accesses
1.20 (Litwin, 1980). Unsuccessful searches require
an average of 1.78 accesses.
Why
is
the average search length greater for
unsuccessful searches?
Because linear hashing splits one bucket at a time, in order, until it has
reached the end of the sequence, the overflow chains for the last buckets in
the sequence can become much longer than those for the earlier buckets.
Read about Larson's approach to solving this problem through the use of
"partial expansions," originally described in Larson (1980) and subsequently summarized in Embody and Du (1988). Write a pseudocode
description of linear hashing with partial expansions, paying particular
14.
attention to
how
addressing
15. In section 11.6.3
splitting
of buckets
utilization.
What
larger ones?
is
How
we
is
handled.
discussed different mechanisms for deferring the
in extendible hashing in order to increase
storage
the effect of using smaller overflow buckets rather than
does the use of smaller overflow buckets compare with
the idea of sharing overflow buckets?
FURTHER READINGS
Programming Exercises
Write
16.
version of the make_address function that prints out the input
key, the hash value, and the extracted, reversed address. Build a driver that
allows you to enter keys interactively for this function and see the results.
Study the operation of the function on different keys.
Write
17.
in
a simplified version
pseudocode
Keep
Hold
of the extendible hashing program described
in this chapter. This simplified version should
the directory and buckets in
RAM
rather than
on
disk;
three keys per bucket;
Find and add keys, but not delete them;
Accept keys entered interactively; and
Display the resulting directory structure and buckets so you can see
how the directory references the buckets and can see which buckets
contain which keys.
Once you
as
buckets
it to see how the directory grows
program developed in exercise 16 to develop
hash to the same bucket. Enter such sequences and
build this program, play with
split.
Use
sequences of keys that
the
all
watch what happens.
Extend exercise 17 to include deletion. Once again, experiment with
program to see how deletion works. Try deleting all the keys. Try to
create situations where the directory will recursively collapse over more
18.
the
than one level.
19.
Write an extendible hashing program that stores and retrieves buckets
from disk rather than from
RAM.
Using the information in Enbody and Du (1988) and Litwin
implement a simple, RAM-based linear hashing program.
20.
(1980),
FURTHER READINGS
For information about hashing for dynamic
here,
you must turn
to journal
articles.
files
The
beyond what we present
summary of the different
article titled "Dynamic Hashing
that goes
best
is Enbody and Du's Computing Surveys
Schemes," which appeared in 1988.
The original paper on extendible hashing is "Extendible Hashing A Fast
Access Method for Dynamic Files" by Fagin, Nievergelt, Pippenger, and Strong
(1979). Larson (1978) introduces dynamic hashing in an article titled "Dynamic
Hashing." Litwin's initial paper on linear hashing is titled "Linear Hashing: A New
Tool for File and Table Addressing" (1980). All three of these introductory articles
approaches
539
540
EXTENDIBLE HASHING
are quite readable; Larson's paper
especially
and Fagin, Nievergelt, Pippcnger, and Strong are
recommended.
Michel Scholl's 1981 paper titled "New File Organizations Based on Dynamic
Hashing" provides another readable introduction to dynamic hashing. It also
investigates implementations that defer splitting by allowing buckets to overflow.
Papers analyzing the performance of dynamic or extendible hashing often
derive results that apply to either of the
careful analysis
of directory depth and
two methods.
size.
Flajolet (1983) presents a
Mendelson
(1982) arrives at similar
and goes on to discuss the costs of retrieval and deletion as different design
parameters are changed. Veklerov (1985) analyzes the performance of dynamic
hashing when splitting is deferred by allowing records to overflow into a buddy
results
bucket. His results can be applied to extendible hashing as well.
number of papers
building
ideas associated with linear hashing. His 1980 paper titled "Linear
Hashing
After introducing dynamic hashing, Larson wrote a
on the
with Partial Expansions" introduces an approach to linear hashing that can avoid the
uneven distribution of the lengths of overflow chains across the cells in the address
He followed up with a performance analysis in a 1982 paper titled
"Performance Analysis of Linear Hashing with Partial Expansions." A subsequent,
Handling by Linear Probing"
1985 paper titled "Linear Hashing with Overflow
introduces a method of handling overflow that does not involve chaining.
space.
Appendix A
File Structures
on
CD-ROM
OBJECTIVES
Introduce the commercially important characteristics
of
CD-ROM
storage.
medium with performance charvery different from those o{
magnetic disks; show how to apply good file structure design principles to develop solutions that are
appropriate to this new medium.
Examine
storage
acteristics that are
Describe the directory structure of the
system and show how
acteristics of the medium.
file
it
CD-ROM
grows from the char-
541
OUTLINE
A.l Using this Appendix
A.2 Introduction to
A. 2.1
A. 2. 2
A. 5. 2 Block Size
A. 5. 3 Special Loading Procedures and
Other Considerations
A. 5. 4 Virtual Trees and Buffering
CD-ROM
CD-ROM
Short History of
CD-ROM
as a File Structure
Blocks
A. 5. 5 Trees as Secondary Indexes on
Problem
A.3 Physical Organization of
CD-ROM
CD-ROM
A.6 Hashed
A. 3.1 Reading Pits and Lands
A.3.2 CLV Instead of CAV
A. 3. 3 Addressing
A. 3. 4 Structure of a Sector
A.4
CD-ROM
1
A. 4. 2
A. 4. 3
A. 4. 4
A. 4. 5
on
CD-ROM
A. 6.1 Design Exercises
A. 6. 2 Bucket Size
A. 6.3 How the Size of
Helps
A.6.4 Advantages of
Strengths and
Weaknesses
A.4.
Files
Read-Only
Seek Performance
Data Transfer Rate
Storage Capacity
Read-Only Access
Asymmetric Writing and
Reading
A.5 Tree Structures on
A.7 The
CD-ROM
CD-ROM
CD-ROM's
Status
File
System
A. 7.1 The Problem
A. 7. 2 Design Exercise
A. 7. 3 A Hybrid Design
CD-ROM
A. 5.1 Design Exercises
A.1
Using this Appendix
This appendix has two purposes.
performance
characteristics
of
The
first
CD-ROM,
is
information distribution medium. The second
designing
file
structures for
and techniques presented
CD-ROM
to
review
to
tell
you about
commercially
is
to use the
many of the
the
important
problem of
design issues
in the text.
We begin by introducing CD-ROM. We explain how CD-ROM
works and enumerate the features that make file structure design for
CD-ROM a different problem than file structure design for magnetic
media.
Once we have examined CD-ROM's performance, we provide
high-level look at
how this
performance
affects the
design of tree structures,
hashed indexes, and directory structures for CD-ROM. These discussions
of trees and hashing do not present new information; they review material
that has already been developed in detail. Since you already have the tools
543
INTRODUCTION TO CD-ROM
required to think through these design problems,
we
introduce exercises
and questions throughout this discussion, rather than holding them to the
end. We encourage you to stop at these blocks of questions, think carefully
about the answers, and then compare results with the discussion that
follows.
A.2
CD-ROM
Introduction to
CD-ROM
CD audio
CD-ROM
is
an acronym for
disc
is
contains
that
digital
data
than
rather
commercially interesting because
can be reproduced cheaply.
That
Compact Disc Read-Only Memory.
it
can hold
digital
1"
It
is
sound.
of data and
megabytes of
a lot
single disc can hold over 600
approximately 200,000 printed pages, enough storage to hold
size of this one. Replicates can be stamped from a
master disc for about only a dollar a copy.
data.
is
almost 400 books the
CD-ROM
is
cannot record on
read-only in the same sense as
It
it.
many
is
CD
publishing medium,
audio
disc:
You
used for distributing
and retrieval medium
magnetic disks. Currently, CD-ROMs are often used to publish
database information such as telephone directories, zip codes, and demographic information. There are also many CD-ROM products that deliver
textual data, such as bibliographic indexes, abstracts, dictionaries, and
encyclopedias, often in association with digitized images stored on the disc.
They are also used to publish video information and, of course, digital
information to
users, rather than a data storage
like
audio.
A.2.1 A Short History of
CD-ROM
is
CD-ROM
the offspring of videodisc technology developed in the late
1960s and early 1970s, before the advent of the
to store
movies on
methods
for storing video signals, including
disc.
respond mechanically to grooves
does.
home VCR. The
Different companies developed
The consumer products
in a disc,
one that used
much
industry spent
like a vinyl
great
deal
goal
was
number of
needle to
LP record
of money
developing the different technologies, including several approaches to
optical storage, and then spent years fighting over the question of which
approach should become standard. The surviving format is one called
LaserVision. By the time LaserVision emerged as the winner of these
""Usually
spell
it
we
with
spell disk
a
c.
with
a k,
but the convention
among
optical disc manufacturers
is
to
544
APPENDIX
A: FILE
STRUCTURES ON CD-ROM
standards battles, the competing developers had not only spent enormous
sums of money, but had also lost important market opportunities. These
hard lessons were put to use in the subsequent development of CD audio
and
CD-ROM.
From the outset, there was interest in using LaserVision discs to do
more than just record movies. The LaserVision format supports recording
in
both
CLV
capacity, and a
(Constant Linear Velocity) format that maximizes storage
CAV
seek performance.
frames quickly,
(Constant Angular Velocity) format that enables fast
using the
format to access individual video
CAV
By
number of organizations,
produced prototype interactive video
including the
MIT Media
Lab,
discs that could be used to teach
and
entertain.
In the early 1980s, a
number of firms began looking
at the possibility
of
storing digital, textual information on LaserVision discs. LaserVision stores
data in an analog form;
it
is,
after
storing an analog video signal.
all,
came up with different ways of encoding digital information
analog form so it could be stored on the disc. The capabilities
Different firms
in
demonstrated in the prototypes and early, narrowly distributed products
were impressive. The videodisc has a number of performance characteristics
that
make
it
a technically
more
desirable
medium
than the
CD-ROM;
in
one can build drives that seek quickly and deliver information
from the disc at a high rate of speed. But, reminiscent of the earlier disputes
over the physical format of the videodisc, each of these pioneers in the use
of LaserVision discs as computer peripherals had incompatible encoding
schemes and error correction techniques. There was no standard format,
and none of the firms was large enough to impose their format over the
others through sheer marketing muscle. Potential buyers were frightened
by the lack of a standard; consequently, the market never grew.
During this same period the Philips and Sony companies began work
on a way to store music on optical discs. Rather than storing the music in
the kind of analog form used on videodiscs, they developed a digital data
format. Philips and Sony had learned hard lessons from the expensive
particular,
standards battles over videodiscs. This time they
worked with other
in the consumer products industry to develop
players
licensing system that
emergence of CD audio discs as a broadly accepted, standard
format as soon as the first discs and players were introduced. CD audio
appeared in the United States in early 1984. CD-ROM, which is a digital
data format built on top of the CD audio standard, emerged shortly
thereafter. The first commercially available CD-ROM drives appeared in
resulted in the
1985.
Not
surprisingly,
LaserVision discs
first
the
saw
firms
that
CD-ROM
were delivering
digital
data
as a threat to their existence.
on
They
545
INTRODUCTION TO CD-ROM
however,
also recognized,
always eluded them
CD-ROM
CD-ROM
that
in the past: a
was guaranteed
manufactured by any
drive
of any disc
promised
what had
to provide
standard physical format.
Anyone with
that they could find and read a sector off
medium
firm. For a storage
such
to be used in
fundamental level is essential.
What happened next is remarkable in the history of standards and
cooperation within an industry. The firms that had been working on
products to deliver computer data from videodiscs recognized that a
publishing, standardization
at
CD-ROM, was not
standard physical format meant that everyone was guaranteed to
standard physical format, such as that provided by
enough.
be able to read sectors off of any
work
disc.
But computer applications do not
terms of sectors; they store data in files. Having an agreement
about finding sectors, without further agreement about how to organize the
in
sectors into
settled
on
files, is like
how
relatively
moving
small,
into the
system to be
be organized into words on
from the
the firms emerging
were
everyone's agreeing on an alphabet without having
letters are to
together
called
CD-ROM
built
industry
on top of the
many of the much
to begin work on a
CD-ROM
format. In
file
now emerged
as
official international
of which
standard
file
display of
a rare
system standard by early summer of 1986;
an
all
larger firms
worked out
cooperation, the different firms, large and small,
features of a
page. In late 1985
videodisc/digital data industry,
that
the
main
work
standard for organizing
has
on
files
CD-ROM.
The
has
CD-ROM
begun
to
industry
show mature
is still
matters such as disc formats to
in the past two years it
moving away from concentration on
young, though
signs of
concern with
CD-ROM
applications;
on the new medium in isolation, vendors are seeing
an enabling mechanism for new systems. As it finds more uses in
rather than focusing
as
broader array of applications,
CD-ROM
it
looks like an optical publishing
technology that will be with us over the long term.
A. 2.
CD-ROM
CD-ROM
as a File Structure Problem
presents
medium with
include the fact that
durable.
slow,
interesting
it
structure problems
The key weakness
often taking
and showed
The
from
is
that if
that
is
seek performance on
RAM
access
is
RAM
it
is
CD-ROM
inexpensive, and
a
half second to a second
we compared
because
strengths of
has a lot of storage capacity,
introduction to this textbook
access
file
great strengths and weaknesses.
CD-ROM
per seek.
is
In
is
very
the
access and magnetic disk
analogous to your taking 20
seconds to look up something in the index to this textbook, the equivalent
disk access would take 58 days, or almost two months. With a
CD-ROM
546
APPENDIX
A: FILE
STRUCTURES ON CD-ROM
the analogy stretches the disc access to over two and a half years! This kind
of performance, or lack of it, makes intelligent file structure design a critical
concern for CD-ROM applications. CD-ROM provides an excellent test of
our ability to integrate and adapt the principles we have developed in the
preceding chapters of this book.
A.3
Physical Organization of
CD-ROM
is
is
the child of
CD
CD-ROM
audio. In this instance, the impact of heredity
strong, with both positive and negative aspects. Commercially, the
audio parentage
the market.
it is
It is
possible to
CD
probably wholly responsible for CD-ROM's viability in
because of the enormous size of the CD audio market that
is
make
CD-ROM discs so inexpensively.
Similarly, advances
and decreases in the costs of making CD audio players affect
performance and price of CD-ROM drives. Other optical disc media that
have not enjoyed the benefits of this parentage have not experienced the
commercial success of CD-ROM.
On the other hand, making use of the manufacturing capacity
associated with CD audio means adhering to the fundamental physical
organization of the CD audio disc. Audio discs are designed to play music,
in the design
not to provide
fast,
random
access to data.
This difference in design
objective biases
CD toward having high storage capacity and moderate data
transfer
but against decent seek performance.
requires
rates,
good random-access performance,
from our
in the
file
structure design efforts;
medium
it
If
an application
performance has to emerge
won't come from anything inherent
that
itself.
A.3.1 Reading Pits and Lands
CD-ROM
stamped from a master disc. The master is formed by
we want to encode to turn a powerful laser on and
off very quickly. The master disc, which is made of glass, has a coating that
is changed by the laser beam. When the coating is developed, the areas hit
by the laser beam turn into pits along the track followed by the beam. The
smooth, unchanged areas between the pits are called lands. The copies
formed from the master retain this pattern of pits and lands.
When we read the stamped copy of the disc, we focus a beam of laser
light on the track as it moves under the optical pickup. The pits scatter the
light, but the lands reflect most of it back to the pickup. This alternating
pattern of high- and low-intensity reflected light is the signal used to
reconstruct the original digital information. The encoding scheme used for
discs are
using the digital data that
547
PHYSICAL ORGANIZATION OF CD-ROM
this signal is
not simply
the Is are represented
matter of calling
by the
transitions
a pit a
from
and
pit to
land
a 0. Instead,
land and back again.
Every time the light intensity changes, we get a 1. The zeroes arc
represented by the amount of time between transitions; the longer between
more zeroes we have.
you think about this encoding scheme, you realize that it is not
possible to have two adjacent Is
Is are always separated by zeroes. In fact,
due to the limits of the resolution of the optical pickup, there must be at
least two 0s between any pair of Is. This means that the raw pattern of Is
transitions, the
If
and 0s has
to be translated in order to get the 8-bit patterns
of
Is
and
()s
that
form the bytes of the original data. This translation scheme, which is done
through a lookup table, turns the original 8 bits of data into 14 expanded
bits that can be represented in the pits and lands on the disc; the reading
process reverses this translation. Figure A.l shows a portion of the lookup
table values. Readers
players
may have
who
have looked closely
encountered the term
EFM
at
the specifications for
encoding.
EFM
CD
stands for
"eight to fourteen modulation" and refers to this translation scheme.
It is
important to
realize that since
we
represent the zeroes in the
EFM-
the length of time between transitions, our ability to read the
encoded data by
data is dependent on moving the pits and lands under the optical pickup at
a precise and constant speed. As we will see, this affects the CD-ROM
drive's ability to seek quickly.
A.3.2 CLV Instead
Data on
CD-ROM
three miles
of
is
CAV
stored in a single, spiral track that winds for almost
from the center
to the outer
edge of the
disc.
This spiral pattern
which
from CD
requires a lot of storage space, we want to pack the data on the disc as
tightly as possible. Since we "play" audio data, often from start to finish
is
part of the
CD-ROM's
heritage
Decimal Original Translated
bits
value
bits
00000000 01001000100000
00000001 10000100000000
2
00000010 10010000100000
3
00000011 10001000100000
4
00000100 01000100000000
5
00000101 00000100010000
6
00000110 00010000100000
7
00000111 00100100000000
8
00001000 01001001000000
1
audio. For audio data,
FIGURE A. 1 A portion
EFM encoding table.
of the
548
APPENDIX
STRUCTURES ON CD-ROM
A: FILE
Constant angular
velocity
Constant linear
velocity
FIGURE A.2 CLV and
CAV
recording.
without interruption, seeking is not important. As Fig. A.2 shows, a spiral
pattern serves these needs well. A sector toward the outer edge of the disc
takes the same amount of space as a sector toward the center of the disc.
This means that we can write all of the sectors at the maximum density
permitted by the storage medium. Since reading the data requires that it
pass under the optical pickup device at a constant rate, the constant data
density implies that the disc has to spin more slowly when we are reading
at the outer edges than when we are reading toward the center. This is why
the spiral is a Constant Linear Velocity (CLV) format: As we seek from the
center to the edge, we change the rate of rotation of the disc so the linear
speed of the spiral past the pickup device stays the same.
By contrast, the familiar Constant Angular Velocity (CAV) arrangement shown in Fig. A.2, with its concentric tracks and pie-shaped sectors,
writes data less densely in the outer tracks than in the tracks toward the
center. We are wasting storage capacity in the outer tracks but have the
advantage of being able to spin the disc at the same speed for all positions
of the read head. Given the sector arrangement shown in the figure, one
rotation
reads
Furthermore,
start
of
no matter where we
timing mark placed on the disk makes
eight
sectors,
are
it
on the
disc.
easy to find the
a sector.
The CLV format is responsible, in large part, for the poor seeking
performance of CD-ROM drives. The CAV format provides definite track
boundaries and a timing mark to find the start of a sector. The CLV format,
549
PHYSICAL ORGANIZATION OF CD-ROM
on the other hand, provides no straightforward way
location. Part of the problem is associated with
rotational speed as
that
is
we
seek across the disc.
adjust the speed,
know where we
we need
to a specific
need to change
we need
the correct speed. But to
to be
moving
know how
to
we
to be able to read the address information so
How
are.
at
jump
read the address information
stored on the disc along with the user's data,
the data under the optical pickup
this
To
to
the
does the drive's control mechanism break out of
loop? In practice, the answer often involves making guesses, finding the
correct speed through
trial
and error. This takes time and slows
down
seek
performance.
On the positive side, the CLV sector arrangement
CD-ROM's large storage capacity. Given a CAV
CD-ROM
A. 3.
would have only
a little better
than half
its
contributes to the
arrangement,
the
present capacity.
3 Addressing
The use of
sector
the
Instead,
CLV
organization means that the familiar cylinder, track,
CD-ROM.
we use a sector-addressing scheme that is related to the CD-ROM's
way of
identifying
sector address will not
work on
roots as an audio playback device. Each second of playing time on
CD
is
divided into 75 sectors, each of which holds 2 Kbytes of data. According to
the original Philips/Sony standard, a
CD-ROM,
disc
is
CD
disc,
whether used
for audio or
contains at least one hour of playing time. That means that the
capable of holding
at least
540,000 Kbytes of data:
60 minutes x 60 seconds/minute x 75 sectors/second = 270,000 sectors.
In fact, since
it is
possible to put over 70 minutes of playing time on
the capacity of the disk
is
CD,
over 600 Mbytes.
We address a given sector by referring to the minute,
second, and sector
of play. So, the 34th sector in the 22nd second in the 16th minute of play
would be addressed with the three numbers 16:22:34.
A. 3. 4 Structure of a Sector
It is
interesting to look at the
way
that the
fundamental design of the
CD
designed for delivering digital audio information, has been
adapted for computer data storage. This investigation will also help answer
the question, "If the disc is capable of storing a quarter of a million printed
pages, why does it hold only an hour's worth of Roy Orbison?"
disc, initially
we need to convert a wave pattern into
wave.
At any given point in time, the
digital form. Figure A. 3
digitize
the wave by measuring the
wave has a specific amplitude. We
When we want
to store sound,
shows
550
APPENDIX
A: FILE
STRUCTURES ON CD-ROM
32767
actual
wave
wave reconstructed
from sample data
sampling frequency
-32767
FIGURE A.3 Digital sampling
amplitude
at
of a
very frequent intervals and storing the measurements. So, the
question of how
turns into
we need to represent a wave digitally
questions: How much space does it take to store each
and how often do we take samples?
much
storage space
two other
amplitude sample,
CD audio uses
that
wave
16 bits to store each amplitude measurement; that
our "ruler" that
different gradations.
we use to measure the height of the wave has
To accurately approximate a wave through
means
65,536
digital
we
need to take the samples at a rate that is more than twice as
frequent as the highest frequency that we want to capture. This makes sense
if you look at the wave in Fig. A. 4. You can see that if we sample at less than
twice the frequency of the wave, we lose information about the variation in
the wave pattern. The designers of CD audio selected a sampling frequency
of 44.1 KHz, or 44,100 times per second, so they could record sounds with
frequencies ranging up to 20 KHz (20,000 cycles per second), which is
toward the upper bound of what people can hear.
So, if we are taking a 16-bit, or 2-byte sample 44,100 times per second,
sampling,
we need to store 88,200 bytes per second. Since we want to store stereo
sound, we need double this, storing 176,400 bytes per second. You can see
why storing an hour of Roy Orbison takes so much space.
551
PHYSICAL ORGANIZATION OF CD-ROM
actual
wave
wave reconstructed
from sample data
*
sampling frequency
FIGURE A.4 The effect of sampling at less than twice the frequency of the wave.
If
you divide
the 176,400-byte-per-second storage capacity of the
into 75 sectors per second,
divides
up
this
"raw"
sector
CD
you have 2,352 bytes per sector. CD-ROM
storage as shown in Fig. A. 5 to provide 2 K of
user data storage, along with addressing information, error detection, and
error correction information.
because, although
CD
The
error correction information
not adequate to meet computer data storage needs.
correction
discs.
would
The
result in an average
additional
error
2,352-byte sector decreases
20,000
12 bytes
necessary
The audio
it is
error
of one incorrect byte for every two
information stored within the
correction
this error rate to
one uncorrectable byte
in
every
discs.
FIGURE A.5 Structure of a
synch
is
audio contains redundancy for error correction,
CD-ROM
4 bytes
sector ID
sector.
2,048 bytes
user data
4 bytes
8 bytes
error
detection
null
276 bytes
error
correction
552
A.4
APPENDIX
STRUCTURES ON CD-ROM
A: FILE
CD-ROM
Strengths and Weaknesses
As we say throughout this book, good file design is responsive to the nature
of the medium, making use of strengths and minimizing weaknesses. We
begin, then, by cataloging the strengths and weaknesses of CD-ROM.
A.4.1 Seek Performance
chief weakness of CD-ROM
Current magnetic disk technology
The
random
the
is
is
random-access performance.
such that the average time for
data access, combining seek time and rotational delay,
is
about 30
On a CD-ROM,
this average access takes 500 msec, and can take up
more. Clearly, our file design strategies must avoid seeks to
an even greater extent than on magnetic disks.
msec.
to a second or
A.4. 2 Data Transfer Rate
A CD-ROM
drive reads 75 sectors, or 150 Kbytes of data per second. This
data transfer rate
is
part of the fundamental definition of
CD-ROM;
it
can't
be changed without leaving behind the commercial advantages ot adhering
to the
CD audio standard.
It is
modest
transfer rate, about five times faster
than the transfer rate for floppy disks, and an order of magnitude slower
than the rate for good Winchester disks.
makes
itself felt
when we
The inadequacy of the
are loading large
files,
such
as
transfer rate
those associated
On the other hand, the transfer rate is fast enough
CD-ROM's seek performance that we have a design
with digitized images.
relative
to
the
incentive to organize data into blocks, reading
with the hope that
we
can avoid as
much
more
seeking
data with each seek
as possible.
A. 4. 3 Storage Capacity
A CD-ROM
to use
up
holds
images, 600 Mbytes
to
more than 600 Mbytes of data. Although
this storage area
is
big
very quickly, particularly
when
download 600 Mbytes of
it
text
it is
possible
if you are storing raster
comes
to text applications. If you decide
with
2,400-baud
modem,
it
will take
about three days of constant data transmission, assuming errorless transmission conditions. Many typical text databases and document collections
published on
CD-ROM
The design
use only
benefit arising
a fraction
from such
of the
disc's capacity.
large capacity
is
that
it
enables us
and other support structures that can help overcome some
of the limitations associated with CD-ROM's poor seek performance.
to build indexes
553
TREE STRUCTURES ON CD-ROM
A. 4. 4
From
Read-Only Access
design standpoint, the fact that
CD-ROM
storage device that cannot be changed
significant advantages.
We
organization.
through
file
structures
to optimize our index structures
We know
later additions
is
publishing medium,
provides
manufacture,
never have to worry about updating. This not
only simplifies some of the
worthwhile
after
but also
means
that
is
it
and other aspects of
file
that our efforts to optimize access will not be lost
or deletions.
Asymmetric Writing and Reading
A. 4. 5
For most media,
files
are written
and read using the same computer system.
Often, reading and writing are both interactive and are therefore con-
by the need
strained
to provide quick response to the user.
CD-ROM
is
We create the files to be placed on the disc once; then we distribute
is accessed thousands, even millions, of times. We are in a
and
different.
the disc
it
position
to
bring
substantial
computing power
to
the
task
of
file
when the disc will be used on systems with
we can use extensive, batch-mode processing
organization and creation, even
much
on
less capability. In fact,
large computers to try to provide systems that will perform well
small machines.
file
We make
on
the investment in intelligent, carefully designed
structures only once; users can enjoy the benefits of this investment
again and again.
A.5
CD-ROM
Tree Structures on
A. 5.1
Design Exercises
Tree structures are a good way to organize indexes and data on CD-ROM.
+
Chapters 8 and 9 took a close look at B-trees and B trees. Before we
discuss the effective use of trees on CD-ROM, think through these design
questions:
1.
2.
3.
How
How
big should the block size be for B-trees and
B+
trees?
should you go in the direction of using virtual tree structures? How much memory should you set aside for buffering blocks?
+
How could you use special loading procedures to advantage in a B
far
implementation? Are there similar procedures that will assist in
the loading of B-trees?
Suppose we have a primary index and several secondary indexes to
tree
4.
set
of records.
How
should you organize these access mechanisms
554
APPENDIX
tor
A: FILE
CD-ROM?
your
A.
STRUCTURES ON CD-ROM
Address the issues of binding and pinned records
in
reply.
5.2 Block Size
Avoiding seeks
Consequently,
is
the key strategy in
+
and
B-tree
tree
CD-ROM
file
structure design.
good choices for
As we showed in Chapters 8
structures
are
implementing index structures on CD-ROM.
and 9, given a large enough block size, B-trees and B + trees can provide
access to a large number of records in only a few seeks.
How large should the block size be? The answer, of course, depends on
the application, but
it is
possible to provide
since the sector size of the
be
less
than 2 Kbytes.
consequently,
it
sector
does not
Since the
CD-ROM's
especially
when viewed
CD-ROM
make
is
is
some
general guidelines. First,
2 Kbytes, the block size should not
sequential reading performance
attractive to use a block
on the
the smallest addressable unit
disc;
sense to read in anything less than a sector.
relative to
its
is
moderately
seeking performance,
composed of several
sectors.
it
is
Once you have
the better part of a second seeking for the sector and reading
it,
fast,
usually
spent
reading an
make an 8-Kbyte block takes only an additional 40
added fraction of a second can contribute to avoiding another
additional 6 Kbytes to
msec. If
this
it is time well spent.
Table A.l shows the maximum number of 32-byte records that can be
contained in a B-tree as the tree changes in height and block size. The
seek,
dramatic effect of block
trees
size
on the record counts
for
two- and
suggests that large tree structures should usually use
three-level
at
least
an
8-Kbyte block.
TABLE
A.
The maximum number
of
32-byte records that can be stored
in
a B-tree of
giver height a nd block size
Tree Height
One
Block
size
Block
size
Block
size
=
=
=
4
8
K
K
K
Level
Two
Levels
Three Levels
64
4,224
274,624
128
16,640
2,146,688
256
66,048
16,974,592
555
TREE STRUCTURES ON CD-ROM
A. 5.3 Special
Loading Procedures and Other Considerations
trees are commonly used in CD-ROM applications because they
provide both indexed and sequential access to records. If, for example, you
are building a telephone directory system for CD-ROM, you will need an
index that can provide fast access to any one of the millions of names that
appear on the
You
will also want to provide sequential access so once
name, they can browse through records with the same
name, checking addresses, to make sure they have the right phone number.
disc.
users have found a
B+
trees are also attractive in
CD-ROM
applications because they can
of sequenced records. As we
+
the content of the index part of a B tree can consist
provide very shallow, broad indexes to
showed
a set
in Chapter 9,
of nothing more than the shortest separators required to provide access to
lower levels of the tree and, ultimately, to the target records. If these
shortest separators are only a few bytes long, as is frequently the case, it is
often possible to provide access to millions of records with an index that is
only tw o levels deep. An application can keep the root of this index in
RAM, reducing the cost of searching the index part of the tree to a single
seek. With one additional seek we are at the record in the sequence set.
+
Another attractive feature of B trees is that it is easy to build a
two-level index above the sequence set with a separate loading procedure
T
from the bottom up. We described this operation in
Chapter 9. The great advantage of this kind of loading procedure, as
opposed to building the tree through a series of top-down insertions, is that
we can pack the nodes and leaves of the tree as fully as we wish. With
CD-ROM, where the cost of additional seeks is so high, and where there is
absolutely no possibility that anyone will make additional insertions to the
tree, we will want to pack the nodes and leaves of the tree so they are
completely full. This is an example of a design decision that recognizes that
the CD-ROM is a publishing medium that, once constructed, is used only
that builds the tree
and never for additional storage.
This kind of special, 100%-full loading procedure can also be designed
for B-tree applications. The procedure for B-trees is usually somewhat
more complex because the index will often consist of more than just a root
for retrieval,
node and one
level
manage more
levels
of children. The loading procedure for B-trees has to
of the tree at a time.
This discussion of indexes, and the importance of packing them as
tightly as possible, brings home one of the interesting paradoxes of
CD-ROM
The
CD-ROM
large
storage
capacity that usually gives us a great deal of freedom with regard to
how we
design.
disc
has
relatively
on the disc; a few bytes here or there usually doesn't matter much
when you have 600 Mbytes of capacity. But when we design the index
store data
556
APPENDIX
A: FILE
STRUCTURES ON CD-ROM
structures for
CD-ROM, we
even counting
bits as
reason for this
is
we pack
not, in
most
find ourselves counting bytes,
information into
cases, that
we
a single
are
sometimes
byte or integer.
The
running out of space on the
but because packing the index tightly can often save us from making
file design the cost of seeks adds up very
quickly; the designer needs to get as much information out of every seek as
disc,
CD-ROM
an additional seek. In
possible.
A.
5.4 Virtual Trees and Buffering Blocks
Given the very high cost of seeking on
blocks in
RAM
CD-ROM, we
node should always be buffered. As we indicated
trees in Chapter 8, buffering nodes below
contribute
will
for as long as they are likely to be useful.
significantly
buffering
is
intelligent
Buffering
is
most
to
in
useful
in
selecting
when
the
node
to
keep
tree's root
our discussion of virtual
root
the
reducing seek time,
want
The
can
particularly
to
sometimes
when
the
replace in the buffer.
successive accesses to the tree tend to be
clustered in one area.
Note
we
that packing the tree as tightly as possible during loading,
discussed earlier as a
way
likelihood that an index block in
to reduce tree height,
which
also increases the
RAM will be useful on successive accesses
to the data.
A. 5.
5 Trees as Secondary Indexes on
Typically,
the data
CD-ROM
on the
disc.
CD-ROM
applications provide
more than one
access route to
For example, document retrieval applications usually
give direct access to the documents, so you can page through them in
them up by name, chapter, or section while also providing
access through an index of keywords or included terms. Similarly, in a
telephone directory application you would have access to the database by
name, but also by location (state, city, zip code, street address). As we
sequence or
call
described in Chapter
6,
secondary indexes provide these multiple views of
the data.
Chapter 5 raised the design issue of whether the secondary indexes
should be tightly bound to the records they point to, or whether the binding
should take place at retrieval time, through the use of a common key
Viewed another way, the issue is
whether the target records should be pinned to a specific location through
references in secondary indexes, or whether they should be left unpinned so
accessed through yet another index.
they can be reorganized.
Records will never be reorganized on a CD-ROM; since it is a
disc, there is no disadvantage to having pinned records. Further,
read-only
HASHED
FILES
ON CD-ROM
557
minimizing the number of seeks is the overriding design consideration on
CD-ROM. Consequently, secondary index designs for CD-ROM should
usually bind the indexes to the target records as tightly as possible, ensuring
that once you have found the correct place in the index, you are ready to
retrieve the target with, at most, one additional seek.
One objection to this bind-tightly approach to CD-ROM index design
is that, although it is true that the indexes cannot be reorganized once
CD-ROM,
written to the
they
are, in fact, quite
frequently reorganized
between successive "editions" of the disc. Many CD-ROM publications are
reissued to keep them up to date. The period between successive versions
may
be years, or
may
be
as short as a
week. So, although pinned records
cause no problem on the finished disc, they
difficulty in the files
used to prepare the
may
cause a great deal of
disc.
There are a number of approaches to resolving this tension between
what is best on the published disc and what is best for the files used to
produce it. One solution is to maintain loosely bound records in the source
database, transforming them to tightly bound records for publication on
CD-ROM. CD-ROM
product designers often
on the
fail
to realize that the
file
and often should, be different than the
structures used to maintain the source data and produce the discs. Another
solution, of course, is to trade off performance on the published disc for
decreased costs in producing it. Production costs, time constraints, user
acceptance, and competitive factors interact to determine which course is
structures placed
best.
The key
issue
from
disc can,
the
file
designer's standpoint
is
to recognize that
the alternatives exist, and then to be able to quantify the costs and benefits
of each.
A.6
Hashed
A. 6.1
Files
on CD-ROM
Design Exercises
Hashing, with
its
promise of single access
organize indexes on
CD-ROM. We begin
retrieval,
is
an excellent
way
to
with some design questions that
your knowledge of hashing with what you now know about
CD-ROM. As you think through your answers, remember that your goal
should be to avoid any additional seeking due to hash bucket overflow. As
in any hashing design problem, the design parameters that you can
intersect
manipulate are
Bucket size;
Packing density for the hashed index; and
The hash
function
itself.
558
APPENDIX
A: FILE
STRUCTURES ON CD-ROM
The following
efficient
which you should try to answer before you
think about ways to use these parameters to build
questions,
you
read on, encourage
CD-ROM
to
applications.
What
2.
3.
considerations go into choosing a bucket size?
does the relatively large storage capacity of
developing efficient hashed retrieval?
How
CD-ROM
Since a
CD-ROM
is
read-only,
you have
hashed before you create the
to be
disc.
complete
How
list
assist in
of the keys
can this assist in reduc-
ing retrieval costs?
Bucket Size
A. 6. 2
Chapter 10 we showed how to reduce overflow, and therefore retrieval
time, by grouping records into buckets, so each hashed address references an
entire bucket of records. Since any access to a CD-ROM always reads in a
minimum of a 2-Kbyte sector, the bucket size should be a multiple of 2
Kbytes. Having the bucket be only a part of a sector would be
counterproductive. As we described in Chapter 3, transferring anything less
than a sector means first moving the data into a system buffer, and from
there into the user's data area. With transfers ot a complete sector, many
operating systems can move the data directly into the user area.
How many sectors should go into a bucket? As with trees, it is a
trade-off between seeking and sequential reading. In addition, larger
buckets require more searching and comparing to find the record once the
bucket is read into RAM. In Chapter 10 we provided tools to allow you to
calculate the effect of bucket size on the probability of overflow. For
In
CD-ROM
you
applications,
will
want
to use these tools to reduce the
probability of overflow to almost nothing.
How
A.6.3
Packing
the Size of
hashed
additional seeking.
bucket
we
loosely
good
rule
is
Helps
another
of thumb
is
way
that,
all
see that for
to
avoid overflow and
even with only
keeping the packing density below
size,
overflow almost
10,
file
CD-ROM
60%
moderate
will tend to avoid
the time. Consulting Tables 10.4 and 10.5 in Chapter
randomly
distributed keys, a packing density of 60% and
of 10 will reduce the percentage of records that overflow to
1.3% and will reduce the average number of seeks required for a successful
bucket
size
search to 1.01.
When
there
disadvantage to
unused space available on the disc, there is no
expanding the size of the hashed index so overflow is
is
virtually eliminated.
THE CD-ROM
A. 6. 4
Advantages
of
CD-ROM
collections often use
more
the
is
CD-ROM
90%
full?
discs, this situation
most of the
images along with the
digitized
559
SYSTEM
CD-ROM's Read-Only Status
What if space is at a premium on
a way to pack your index so it
capacity of
FILE
is
disc,
and you need to find
Despite the relatively large
fairly
common. Large
disc just for text. If the product
text
file
storing
is
the available space disappears even
text,
two discs at once are much
and deliver than a single disc application; when a disc is
already nearly full of data, the index files are always a target for size
quickly. Applications requiring the use of
harder to
sell
reduction.
The
space.
we do
calculations that
packing density assume
If
we
could find
uniformly, rather than
a
a
to estimate the effects of bucket size and
random distribution of keys across the address
hash function that would distribute the keys
randomly,
we
could achieve
100% packing
density
and no overflow.
that
Once again, the fact that CD-ROM is
would not be available in a dynamic,
we produce
CD-ROM, we have all
This means that
we do
expecting
a distribution that is
that
read-write environment.
When
the keys that are to be hashed at hand.
not have to choose
whatever distribution of keys
read-only opens up possibilities
it
hash function and then
settle for
produces, hoping for the best, but
merely random. Instead,
function that provides the performance
we
we
can select
hash
need, given the set of keys
we
have to hash. If our performance and space constraints require it, we can
develop a hash function that produces no overflow even at very high
packing densities. We identify the selected hash function on the disc, along
with the data, so the retrieval software knows how to locate the keys. This
relatively expensive and time-consuming function-fitting effort is worthwhile because of the asymmetric nature of writing and reading CD-ROMs;
the one-time effort spent in making the disc is paid back many times as the
disc
A.7
is
distributed to
The CD-ROM
many
File
A. 7.1
The Problem
When
the
users.
System
firms involved in developing
together to begin
work on
confronted with an interesting
common
file
CD-ROM
applications
came
system in late 1985, they were
structures problem. The design goals and
file
constraints included the following:
Support hierarchical directory structures;
560
APPENDIX
A: FILE
STRUCTURES ON CD-ROM
Find and open any one of thousands of
and
Support the use of generic
with only one or two
files
seeks;
file
names,
as in "file*.c",
during directory
access.
The
way
usual
more than
directories as nothing
you
notation,
support hierarchical directories
to
are looking for a
a special
file
with the
kind of
full
file.
is
If,
to
the
treat
using
UNIX
path
/usr/home/mydir/fi lebook/cdrom/part3. txt
you look
in the root directory
usr to find the location
it
(/)
to find the directory file usr, then
of the directory
file
you open
home, you seek to home and open
and so on until you finally open the directory file named
where you find the location of the target file part3.txt. This is a very
to find mydir,
cdrom,
simple, flexible system;
and many
CD-ROM developer,
must seek
it is
the approach used in
other operating systems.
to,
CD-ROM,
that before
is
we
the standpoint of a
can find the location ofpart3.txt,
open, and use six other
such
MS-DOS, UNIX, VMS,
The problem, from
files.
At
we
half-second per seek on
a directory structure results in a
very unresponsive
file
system.
A. 7.
2 Design Exercise
At the time of the
meetings to begin looking
initial
directory structure and
file
system,
at a
standard
CD-ROM
number of vendors were using
this
magnetic disc directwo alternative approaches
treat-directories-as-files approach, literally replicating
tory systems on
that
ROM. One
child,
left
CD-ROM.
were commercially
There were at least
and more specifically
available
placed the entire directory structure in
right sibling tree to
CD-
building a
express the directory structure. Given the
directory hierarchy in Fig. A. 6, this system produced a
shown
tailored to
a single file,
file
containing the
A. 7. The other system created an index to the file
locations by hashing the full path names of each file. The entries in the hash
tree
in Fig.
table for the directory structure in Fig. A. 6 are
shown
in Fig. A. 8.
Considering what you know about CD-ROM (slow seeking, readonly, and so on), think about these alternative file systems and try to answer
the following questions. Keep in mind the design goals and constraints that
were facing the committee (hierarchical structure, fast access to thousands
of
files,
1.
2.
use of generic
file
names).
and disadvantages of each system.
Try to come up with an alternative approach that combines the best
features of the other systems while minimizing the disadvantages.
List the advantages
THE CD-ROM
FILE
SYSTEM
561
ROOT
REPORTS
LETTERS
CALLS.LOG
WORK
SCHOOL
ZATX
S2.RPT
Sl.RPT
WLRPT
Wl.LTR
P2.LTR
Pl.LTR
FIGURE A.6 A sample directory hierarchy.
FIGURE A.7 Left child, right sibling tree to express directory structure.
ROOT
I
REPORTS
fc
LETTERS
CALLS.LOG
I
PERSONAL
ISC
fr
ABC.LTR
SCHOOL
h WORK
I
I
Sl.RPT
Wl.RPT
S2.RPT
XYZ.LTR
fc
WORK
i
Pl.LTR
b>
P2.LTR
Wl.LTR
562
APPENDIX
A: FILE
STRUCTURES ON CD-ROM
/REPORTS/SCHOOL/S 1 .RPT
/REPORTS/SCHOOL/S2.RPT
/REPORTS /WORK/W1 .RPT
Hash
/LETTERS/PERSONAL/P 1 .LTR
/LETTERS/PERSONAL/P2.LTR
Table
of
Path
Hash Function
/LETTERS/ABC.LTR
Names
/LETTERS/XYZ.LTR
/LETTERS/WORK/W1 .LTR
/CALLS.LOG
FIGURE A.8 Hashed index of
file
pathnames.
A.7.3 A Hybrid Design
Placing the entire directory structure into a single
right sibling tree,
works well
as
long
file,
as
with the
left-child,
as the directory structure is small. If
into a few kilobytes, the entire directory
and can be accessed without any seeking at
all. But if the directory, structure is large, containing thousands of files,
accessing the various parts of the tree can require multiple seeks, just as it
the
file
containing the tree
structure can be held in
does
when
fits
RAM
each directory
is
a separate file.
Hashing the path names, on the other hand, provides single-seek access
to any file but does a very poor job of supporting generic file and directory
names, such as prog* .c, or even a simple command such as Is or dir to list
all
the
files
in a given subdirectory.
distribution of the keys, scattering
all
of the
files
By
definition, hashing
them over
in a given subdirectory, say the
randomizes the
the directory space. Finding
letters
subdirectory for the tree
shown in Fig. A. 6, requires a sequential reading of the entire directory.
What about a hybrid approach, in which we build a conventional
file for each directory and then supplement
hashed index to all the files in all directories? This
approach allows us to get to any subdirectory, and therefore to the
information required to open a file, with a single seek. At the same time, it
directory structure that uses a
this
by building
provides us with the ability to
using generic
a
file
work with
names and commands such
conventional directory structure to get
all
the
as
files
Is
and
inside each directory,
dir.
In short,
we build
the advantages of that approach,
and then solve the access problem by building an index for the subdirectories.
563
SUMMARY
Parent
RRN
1
3
4
5
FIGURE A.9 Path index table of directories.
-1
Root
Report 5
Letters
School
Work
Persona 1
Hork
2
2
This
very close to the approach that the committee
is
they went one step further.
settled on.
Since the directory structure
is
But
highly
organized, hierarchical collection of
special index that takes advantage
files, the committee decided to use a
of that hierarchy, rather than simply
hashing the path names of the subdirectories. Figure A.9 shows what
index structure looks like
when it is
A. 6. Only the directories are
this
applied to the directory structure in Fig.
listed in the index; access to the data files
is
through the directory files. The directories are ordered in the index so
parents always appear before their children. Each child is associated with an
integer that is a backward reference to the relative record number (RRN) of
directory
the parent. This allows us to distinguish between the
directory under LETTERS. It also
under REPORTS and the
allows us to traverse the directory structure, moving both up and down
with a command such as the cd command in DOS or UNIX, without
having to access the actual directory files on the CD-ROM. It is a good
example of a specialized index structure that makes use of the organization
inherent in the data to produce a very compact, highly functional access
WORK
WORK
mechanism.
Summary
CD-ROM
an electronic publishing medium that allows us to replicate
and distribute large amounts of information very inexpensively. The
primary disadvantage of CD-ROM is that seek performance is relatively
slow. This is not a problem that can be solved simply by building better
is
performance grow directly from the fact that
CD-ROM is built on top of the CD audio standard. Adherence to this
standard, even given its limitations, is the basis for CD-ROM's success as
a publishing medium. Consequently, CD-ROM application developers
drives; the limits in seek
must look
software.
to careful
file
structure design to build
fast,
responsive retrieval
564
APPENDIX
STRUCTURES ON CD-ROM
A: FILE
B-tree and
B+
on
sector size
work
tree structures
provide access to
ability to
CD-ROM
is
many
it is
few
because of their
Because the
seeks.
2 Kbytes, the block size used in a tree should be
2 Kbytes or an even multiple of this sector
seek so slowly,
CD-ROM
well on
keys with just
size.
Because
CD-ROM
drives
usually advantageous to use larger blocks consisting of
Kbytes or more. Since no additions or deletions will be made to a tree
once it is on CD-ROM, it is useful to build the trees from the bottom up
8
so the blocks are completely
filled.
indexes, the read-only nature of
When
using trees to create secondary
CD-ROM
makes
possible to bind the
it
indexes tightly to the target data, pinning the index records to reduce
seeking and increase performance.
Hashed indexes
are often a
good choice
for
CD-ROM because they can
provide single-seek access to target information. As with
trees, the
The bucket
sector size affects the design of the hashed index:
size
2-Kbyte
should be
one or more full sectors. Since CD-ROMs are large, there is often enough
space on the disc to permit use of packing densities of 60% or less for the
hashed index. Use of packing densities of less than 60%, combined with a
bucket size of 10 or more records, results in single-seek access for almost all
records. But it is not always possible to pack the index this loosely. Higher
packing densities can be accommodated without loss of performance if we
hash function to the records in the index, using
tailor the
provides
more
nearly uniform distribution. Since
we know
function that
that there will
be no deletions or additions to the index, and since time spent optimizing
the index will result in benefits again and again as the discs are used,
often worthwhile
functions. This
of
densities
In 1985
is
90%
to invest the effort in finding the best
especially true
there
to support higher packing
or more.
companies trying
faced an interesting
common
when we need
it is
of several hash
file
to build the
structure problem.
directory structure and
file
CD-ROM
They
system for
were no directory structure designs
in use
publishing market
needed a
At the time,
realized that they
on
CD-ROM.
CD-ROM that provided
nearly optimal performance across a wide variety of applications.
Directory structures are usually implemented
Moving from
another
it
CD-ROM,
This is not a good design for
of several seconds just to locate and open
file.
in a wait
as
directory to a subdirectory beneath
of
series
since
it
files
to
could result
a single file.
alternatives, such as putting the entire directory structure in a single
hashing the path names of all the
files.
means seeking
Simple
file,
or
on a disc, have other drawbacks. The
problem emerged with a design that
committee charged with solving this
combined a conventional hierarchical directory of files with an index to the
directory. The index makes use of the structure inherent in the directory
hierarchy to provide
very compact, yet functional
map of the
directory
SUMMARY
565
CD-ROM indexing problems, this directory
importance of building indexes very tightly on
CD-ROM, despite the vast, often unused capacity of the CD-ROM disc.
Tight, dense indexes work better on CD-ROM because they require fewer
seeks to access. Avoiding seeks is the key consideration for all CD-ROM
structure. Typical
index
file
illustrates
of other
the
structure design.
Appendix B
ASCII Table
Dec. Oct. Hex.
nul
Dec. Oct. Hex.
sp
40
20
41
21
42
43
44
45
46
47
22
23
24
25
26
27
28
29
64
65
66
67
68
69
70
71
2F
N
O
72
73
74
75
76
77
78
79
60
30
61
31
62
63
64
65
66
67
32
33
34
35
36
37
70
38
39
3A
3B
stx
etx
cot
enq
ack
&
bel
bs
10
40
50
ht
11
41
51
nl
10
12
11
13
np
12
14
cr
13
15
so
14
16
si
15
17
42
43
44
45
46
47
52
vt
die
16
20
10
del
17
21
11
dc2
dc3
dc4
18
12
13
51
20
14
nak
21
15
syn
22
23
22
23
24
25
26
27
48
49
50
16
17
52
53
54
55
30
18
31
19
32
33
34
35
36
37
1A
"
'
etb
19
rs
24
25
26
27
28
29
30
us
31
can
cm
sub
esc
fs
gs
566
1C
ID
<
56
57
58
59
60
61
IE
>
IF
IB
32
33
34
35
36
37
38
39
sol
Dec. Oct. Hex.
62
63
53
54
55
56
57
71
72
73
74
75
76
77
2A
2B
2C
2D
2E
3C
3D
D
E
42
43
44
45
46
47
110
48
49
68
69
111
160
70
51
P
q
112
113
161
71
122
52
123
53
54
55
56
57
114
115
116
117
118
119
162
163
164
165
166
167
72
73
74
75
76
77
58
59
120
170
121
171
78
79
122
172
173
174
175
176
177
130
150
4F
88
89
90
62
63
64
65
66
67
82
83
84
85
86
87
61
142
143
144
145
146
147
4E
60
152
153
154
155
156
157
4A
121
140
141
104
105
106
107
108
109
110
111
81
3E
3F
102
103
104
105
106
107
Q
T
102
103
50
101
41
120
101
80
96
97
98
99
100
40
4C
4D
100
112
113
114
115
116
117
Dec. Oct. Hex.
124
125
126
127
131
4B
k
1
u
V
132
133
5A
91
5B
92
93
94
95
134
135
136
137
5C
5D
123
124
5E
125
126
5F
del
127
151
6A
6B
6C
6D
6E
6F
7A
7B
7C
7D
7E
7F
Appendix C
String Functions
in
Pascal: tools. pre
Functions and Procedures Used to Operate on strng
The following
make up
functions and procedures
the tools for operating
on
variables that are declared as:
TYPE
strng
packed array
The length of
the strng
is
MAX_REC_LGTH
of
char;
stored in the zeroth byte of the array as a
character representative of the length. Note that the Pascal functions CHR(
and ORD() are used to convert integers to characters and vice versa.
Functions include:
Returns the length of str.
len_str(str)
by
dear_str(str)
Clears
copy _jtr(str l,str2)
Copies contents of str2
cat_str(strl,str2)
str
setting
Concatenates
str2 to
Puts result in
strl.
Reads
write _str(str)
Writes contents of
Reads
str as
a str
Trims
trim_str(str)
first,
to the screen.
str
with length
trailing blanks
Returns length of
Converts
ucase(strl ,str2)
key)
strl to
Combines
0.
end of strl.
Igth
Writes contents of str to
fwrite_str(fd,str)
makekey (last,
length to
to strl.
input from the keyboard.
read_str(str)
fread_str(fd,str, Igth)
its
last
from
file fd.
file fd.
from
str.
str.
uppercase, storing result in
and first into key
str2.
in canonical
form, storing result in key.
minfintl ,int2)
cmp_jstr(strl ,str2)
Returns the
Compares
If strl
If strl
If strl
<
>
minimum
of two integers.
strl to str2:
str2,
cmp_str returns
0.
str2,
returns a negative number.
str2,
returns a positive
number
567
568
APPENDIX
C:
STRING FUNCTIONS
IN
PASCAL: tools.prc
FUNCTION len_5tr Cstr: 5trng): integer;
{
len_str() returns the length of 5tr >
BEGIN
len_str := QRDCstrCO]
END;
PROCEDURE clear_str(VAR str: strng);
{
A procedure that clears str by setting its length to
BEGIN
str CO]
CHR(O)
=
>
END;
PROCEDURE copy_str(VAR strl: strng; str2: strng);
>
{
A procedure to copy str2 into strl
VAR
i
eger
BEGIN
for
:=
strl
strl CO]
en_s t
str2C
str2[01
to
1
[
:=
r ( s t r
i
DO
END;
PROCEDURE cat_str (VAR strl: strng; str2: strng);
{
cat_str() concatenates str2 to the end of strl and stores
the resu
VAR
i
in
It
strl
integer
>
BEGIN
for
:=
to 1 en_s t r (
strl [(len_str(str1
strHO]
:=
DO
5 t r 2 )
) +
i)l
CHR(len_str(str1
:=
)
str2Ci];
1 en_s t r (
s t r 2 ) )
END;
PROCEDURE read_str (VAR str: strng);
A procedure that reads str as input from the keyboard
VAR
1
gt h
nt eger
BEGIN
lgth
:=
while (not EOLN) and (lgth <= MAX_REC_S ZE ) DO
BEGIN
I
lgth
=
lgth +
read (strtlgthl)
:
END;
read 1 n
str CO]
END;
;
:
CHR(lgth)
>
FUNCTIONS AND PROCEDURES USED TO OPERATE ON
PROCEDURE write_str (VAR str: strng);
write_str() writes str to the screen
{
strng
569
>
VAR
i
eger
BEGIN
for
wr
:=
to
wr i t e(
t eln
s t r
len_str(str) DO
[ i 3
END;
PROCEDURE fread_str (VAR fd: text; VAR str: strng; lgth:
fread_str() reads a str with length lgth from fd >
{
VAR
i
nt eger
BEGIN
for
str [0]
to
readC
f d
lgth DO
s t r
CHR(lgth)
END;
PROCEDURE fwrite_str (VAR fd: text; str
{
fwrite_str() writes str to file fd }
strng);
VAR
i
eger
BEGIN
for
:=
wr
i t
e(
to
f d
len_str(str) DO
s t r
END;
FUNCTION trim_str (VAR str: strng): integer;
{
trim_str() trims the blanks off the end of str and
returns its new length >
VAR
lgth
nt eger
BEGIN
lgth := 1 en_5 t r (
while strtlgth]
s t r )
=
'
lgth
=
lgth strCO]
=
CHR(lgth);
lgth
t r im_5 t r
=
:
END;
DO
integer);
570
APPENDIX
C:
STRING FUNCTIONS
IN
PASCAL: tools.prc
PROCEDURE ucase (strl: strng; VAR str2: strng);
{
ucaseO converts 5tr1 to uppercase letters and stores the
capitalized string in str2 }
VAR
i
integer
BEGIN
to 1 en_s t r ( s t r ) DO
for i :=
BEGIN
i
) <= ORDCz
if CDRDCstrl Ci]) >= ORD('a')) AND CORDC s t r
str2Ci] := CHR(0RD(str1 til)- 32)
else
str2[i] := str! [i]
:
END;
str2[0]
:=
)) then
strl [0]
END;
PROC EDURE makekey (last: strng; first: strng; VAR key: strng);
{
ma kekey() trims the blanks off the ends of the strngs last and
f i r st concatenates last and first together with a space
se parating them, and converts the letters to uppercase }
VAR
integer
lenl
integer;
lenf
blank_str: strng;
BEGIN
lenl := t r im_s t r ( las t )
copy_str (key last);
blank_str[0] := CHRC1 );
;
blank_str[1 ] :=
:
'
cat_str(key blank_str)
T
lenf
:=
t r
im_s t r ( f
rst )
cat_5 1 r(key,first);
ucase( key key)
,
END;
FUNCTION min (int1,int2: integer): integer;
min() returns the minimum of two integers
BEGIN
if
intl
<=
int2 then
min
nt
min
nt
else
END;
FUNCTION cmp_str (strl: strng; str2: strng):
{
integer;
If strl = str2, then
function that compares strl to str2.
If strl < str2, then cmp_str returns a
cmp_str returns 0.
A
FUNCTIONS AND PROCEDURES USED TO OPERATE ON strng
negative number.
positive number.
Or
if
strl
>
str2,
then cmp_str returns a
VAR
i
integer
length
i nt eger
BEGIN
if len_str(str1 )
BEGIN
:
:=
while strl
:=
if
[1
len_str(str1
cmp_5 t r
str2Ci] DO
1)
(i
len_s t r ( s t r2) then
then
else
cmp_str := (ORDC s t r 1 [ i ]
END
else BEGIN
length := mi n( len_s t r ( s t r 1
i
:=
if
len_s t r (
s t r
2)
while (striCil
:=
(ORDC s t r2
) )
str2Ci]) and (i <= length) DO
length then
cmp_str := len_s t r ( s t r
cmp_str
) )
>
len_s t r ( s t r2)
else
END
END;
:=
(0RD( s t r 1
(0RD(
s t
r2
i ]
57
Appendix D
Comparing Disk Drives
There are enormous differences among different types of drives in terms of
the amount of data they hold, the time it takes them to access data, overall
cost, cost per bit, and intelligence. Furthermore, disk devices and media are
evolving so rapidly that the figures on speed, capacity, and intelligence that
apply one month may very well be out of date the next month.
Access time, you will recall, is composed of seek time, rotational delay,
and transfer time.
Seek times are usually described in two ways: minimum seek time and
average seek time. Usually, but not always,
time
it
takes for the head to accelerate
settle to a stop.
from
as these since their
Average seek time
sector
is
seek time includes the
a standstill,
move one
Sometimes the track-to-track seek time
separate figure for head settling time.
such
minimum
as likely to
One
the average time
track,
and
given, with a
has to be careful with figures
meanings are not always stated
is
is
it
clearly.
takes for a seek if the desired
be on any one cylinder
as
it
is
on any
other. In a
completely random accessing environment, it can be shown that the
number of cylinders covered in an average seek is approximately one-third
of the total number of cylinders (Pechura and Schoeffler, 1983). Estimates
of average seek time are commonly based on this result.
Certain disk drives, called fixed head disk
drives,
require no seek time.
Fixed head drives provide one or more read/write heads per track, so there
is
no need
very
fast,
There
to
move
the heads
from
track to track. Fixed head disk drives are
but also considerably more expensive than movable head drives.
are generally
no
significant differences in rotational delay
among
similar drives. Most floppy disk drives rotate between 300 and 600 rpm.
Hard disk drives generally rotate at approximately 3600 rpm, though this
will increase as disks decrease in physical size. There is at least one drive that
rotates at 5400 rpm, and speeds of 7200 rpm are possible. Floppy disks
usually do not spin continuously, so intermittent accessing of floppy drives
might involve an extra delay due to startup of a second or more. Strategies
572
COMPARING DISK DRIVES
such
as sector interleaving
some circumstances.
The volume of data
years, thereby focusing
rate
from
a single
drive
can mitigate the effects of rotational delay
to be transferred has increased
much
is
attention
on
enormously
data transfer rate.
in
in recent
Data transfer
constrained by rotation speed, recording density
itself, and the speed at which the controller can pass data
through to or from RAM. Since rotation speeds vary little, the main
differences among drives are due to differences in recording density. In
recent years there have been tremendous advances in improving recording
densities on disks of all types. Differences in recording densities are usually
expressed in terms of the number of tracks per surface, and the number of
bytes per track. If data are organized by sector on a disk, and more than one
sector is transferred at a time, the effective data transfer rate depends also on
the method of sector interleaving used. The effect of interleaving can be
substantial, of course, since logically adjacent sectors are often widely
on the disk
separated physically.
A
from
different
approach to increasing data transfer
A technology
different places simultaneously.
rate
is
called
to access data
PTD
(parallel
and writes data simultaneously from multiple read/write
heads. The Seagate Sable PTD reaches a transfer rate of over 20 Mbytes per
second using eight read/write heads.
Another promising technology for achieving high transfer rates is
RAID (redundant arrays of inexpensive disks), in which a collection of
small inexpensive disks function as one. RAIDs allow the use of several
separate I/O controllers operating in parallel. These parallel accesses can be
coordinated to satisfy a single logical I/O request, or can service several
independent I/O requests simultaneously.
Although it is very possible that most of the figures in Table D.l will
be superseded during the time between the writing and the publication of
this text, they should give you a basic idea of the magnitude and range of
performance characteristics for disks. The fact that they are changing so
rapidly should also serve to emphasize the importance of being aware of
disk drive performance characteristics when you are in a position to choose
transfer disk) reads
among
Of
different drives.
course, in addition to the quantitative differences
among
drives,
there are other important differences. The IBM 3380 drive, for example, has
many built-in features, including separate actuator arms that allow it to
perform two accesses simultaneously. It also has large local buffers and a
it to optimize many operations that,
have to be monitored by the central
great deal of local intelligence, enabling
with less sophisticated drives,
computer.
573
sC
Is
-w
<J so
C/5
C/3
co
r^
rf
in
cm
oc
vc
CM
in
"O
O.SO
< s
z 8
m in
m o>
c
13 JZ
cs
.* 13++
J2 -S
t-h
vC
00
0 CO
o co
ffl
oc
CO
oc
sc
r-
moo
r-
tN
co
C<
Efl
-^
lib
73
m
CO
CM
oc
CM
CM
-3
_
ed
C/5
C
u
r-,
nj
V5
&
CM
CM
^^
(N
3\
rn .5 uu
o
oc
y:
^
4-1
y:
O
>
n
t/5
Si
O
>
T3
'
in
7^-
^j
n j
p
e a
T3
O
C/5
(/;
t/5
(U
'-*-
<u
"5 5c
TO
Q.
Tj co
.s
E
o
Ih
CJ
T3
-^
c/3
574
;/i
i-
<
-=
H U
>-
-^
~
r;
J-i
W)*T3
H u
"T3
to
s-
.s
pq
ij
CJ
c/J
o
>
>
>
o
o
3 S
m a,
u tfl w
(U
>
'o
"T3
4-1
'
(j
r;
T3
X)
^ H u 2
u
C/5
t/5
(J
(/5
u u >
'13
Bibliography
AT&T.
System
V Interface
Definition. Indianapolis, IN:
AT&T,
1986.
Computer Algorithms: Introduction to Design and Analysis. Reading, Mass.:
Addison-Wesley, 1978.
+
Batory, D.S. "B trees and indexed sequential files: A performance comparison."
SIGMOD (1981): 30-39.
Bayer, R., and E. McCreight. "Organization and maintenance of large ordered
Baase,
S.
ACM
indexes." Acta Informatica
1,
no. 3 (1972): 173-189.
Bayer, R., and K. Unterauer. "Prefix B-trees."
Systems
Bentley,
J.
2,
no.
(March
ACM
Transactions on Database
1977): 11-26.
"Programming pearls: A spelling checker." Communications of the
no. 5 (May 1985): 456-462.
ACM 28,
Bohl,
M.
Introduction to
IBM
Direct Access Storage Devices. Chicago: Science Re-
search Associates, Inc., 1981.
Borland. Turbo Toolbox Reference Manual. Scott's Valley, Calif: Borland International, Inc., 1984.
Bourne, S.R. The Unix System. Reading, Mass.: Addison-Wesley. 1984.
Bradley, J. File and Data Base Techniques. New York: Holt, Rinehart, and Winston, 1982.
Chaney, R., and B. Johnson. "Maximizing hard-disk performance." Byte 9, no.
5 (May 1984): 307-334.
Chang, C.C. "The study of an ordered minimal perfect hashing scheme." Communications of the
27, no. 4 (April 1984): 384-387.
Chang, H. "A Study of Dynamic Hashing and Dynamic Hashing with Deferred
Splitting." Unpublished Master's thesis, Oklahoma State University, De-
ACM
cember 1985.
"Minimal
Chichelli, R.J.
the
ACM 23,
no.
perfect hash functions
made simple." Communications
of
(January 1980): 17-19.
Comer, D. "The ubiquitous B-tree."
ACM Computing
Surveys 11, no. 2 (June
1979): 121-137.
Cooper, D. Standard Pascal User Reference Manual.
New
York:
W.W. Norton &
Co., 1983.
575
576
BIBLIOGRAPHY
Crotzer, A.D. "Efficacy of B-trees in an information storage and retrieval envi-
ronment." Unpublished Master's
thesis,
Oklahoma
State University, 1975.
Davis, W.S. "Empirical behavior of B-trees." Unpublished Master's thesis,
Oklahoma
State University, 1974.
H. An Introduction to Operating Systems. Revised 1st Ed. Reading, Mass.:
Addison-Wesley, 1984.
Digital. Introduction to VAX-11 Record Management Services. Order No. AADeitel,
D024A-TE.
Equipment Corporation, 1978.
Equipment Corporation, 1981.
RMS-II User's Guide. Digital Equipment Corporation, 1979.
VAX-U SORT/MERGE User's Guide. Digital Equipment Corporation,
Digital
Digital. Peripherals Handbook. Digital
Digital.
Digital.
1984.
VAX Software Handbook. Digital Equipment Corporation, 1982.
Dodds, DJ. "Pracnique: Reducing dictionary size by using a hashing technique."
Digital.
Communications of the
ACM 25,
Dwyer, B. "One more time
the
ACM 24,
no.
how
no. 6 (June 1982): 368-370.
to update a master file." Communications of
(January 1981): 3-8.
ACM
Computing
and H.C. Du. "Dynamic Hashing Schemes."
Qune 1988): 85-113.
Fagin, R., J. Nievergelt, N. Pippenger, and H.R. Strong. "Extendible hashTransactions on Database
ing
a fast access method for dynamic files."
Systems 4, no. 3 (September 1979): 315-344.
Computing Surveys 17, no. 1
Faloutsos, C. "Access methods for text."
(March 1985): 49-74.
Flajolet, P. "On the Performance Evaluation of Extendible Hashing and Trie
Searching." Acta Informatica 20 (1983): 345-369.
Enbody,
R.J.,
Surveys 20, no. 2
ACM
ACM
Flores,
I.
Peripheral Devices.
Englewood
Cliffs, N.J.: Prentice-Hall, 1973.
Gonnet, G.H. Handbook of Algorithms and Data Structures. Reading, Mass.: Addison-Wesley, 1984.
Hanson, O. Design of Computer Data Files. Rockville, Md.: Computer Science
Press, 1982.
Held, G., and
ACM 21,
M.
Stonebraker. "B-trees reexamined." Communications of the
no. 2 (February 1978): 139-143.
Hoare, C.A.R. "The emperor's old clothes." The C.A.R. Turing
dress. Communications of the
ACM 24,
Award
ad-
no. 2 (February 1981): 75-83.
IBM. DFSORT General Information. IBM Order No. GC33-4033-11.
IBM. OS/VS Virtual Storage Access Method (VSAM) Planning Guide. IBM Order
No. GC26-3799.
Jensen, K., and N. Wirth. Pascal User Manual and Report, 2d Ed. Springer Verlag,
1974.
Keehn, D.G., andJ.O. Lacy.
"VSAM
data set design parameters."
IBM
Systems
fournal 13, no. 3 (1974): 186-212.
Kernighan, B., and R. Pike. The
UNIX Programming
Environment.
Englewood
Cliffs, N.J.: Prentice-Hall, 1984.
Kernighan, B., and D. Ritchie. The
N.J.: Prentice-Hall, 1978.
Programming Language. Englewood
Cliffs,
577
BIBLIOGRAPHY
Kernighan, B., and D. Ritchie. The
wood
Programming Language, 2nd Ed. Engle-
Cliffs, N.J.: Prentice-Hall, 1988.
Knuth, D. The Art of Computer Programming. Vol.
Ed. Reading, Mass.: Addison-Wesley, 1973a.
1,
Fundamental Algorithms. 2d
Knuth, D. The Art of Computer Programming. Vol. 3, Searching and Sorting. Reading, Mass.: Addison-Wesley, 1973b.
Lang, S.D., J.R. Dnscoll, andJ.H. Jou. "Batch insertion for tree structured file
improving differential database representation." C.^-TR-85,
Department of Computer Science, University of Central Florida, Orlando,
organizations
Flor.
Lapin, J.E. Portable
and
UNIX
System Programming. Englewood Cliffs, N.J.:
Prentice-Hall, 1987.
"Dynamic Hashing." BIT
Larson, P.
18 (1978): 184-201.
ACM
Larson, P. "Linear Hashing with Overflow-handling by Linear Probing."
Transactions on Database Systems 10, no.
(March
75-89.
1985):
Larson, P. "Linear Hashing with Partial Expansions." Proceedings of the 6th Conference on Very Large Databases. (Montreal, Canada Oct 1-3, 1980) New
ACM/IEEE: 224-233.
York:
Larson, P. "Performance Analysis of Linear Hashing with Partial Expansions."
ACM
Laub, L.
S.
Transactions on Database Systems 7, no. 4
"What
is
CD-ROM?"
Ropiequet, eds.
(December
1982):
566-587.
CD-ROM: The New Papyrus. S. Lambert
Redmond, WA: Microsoft Press, 1986: 47-71.
In
and
M.K. McKusick, M. Karels, andJ.S. Quarterman. The Design and
4.3BSD UNIX Operating System. Reading, Mass.: Addi-
Leffler, S.,
Implementation of the
son-Wesley, 1989.
Levy, M.R. "Modularity and the sequential
file
update problem." Communications
ACM
25, no. 6 (June 1982): 362-367.
of the
Litwin, W. "Linear Hashing: A New Tool for File and Table Addressing." Proceedings of the 6th Conference on Very Large Databases (Montreal,
1-3, 1980)
Litwin,
W.
New
York:
"Virtual Hashing:
the 4th Conference on
Canada, Oct
ACM/IEEE: 212-223.
Dynamically Changing Hashing.
Very Large Databases (Berlin 1978)
New
"'
Proceedings oj
York:
ACM/
IEEE: 517-523.
Loomis, M. Data Management and
File Processing.
Englewood
Cliffs, N.J.:
Pren-
tice-Hall, 1983.
Lorin, H. Sorting and Sort Systems. Reading, Mass.: Addison-Wesley, 1975.
Lum, V.Y.,
P.S. Yuen, and M. Dodd. "Key-to-Address Transform Techniques,
Fundamental Performance Study on Large Existing Formatted Files."
Communications of the
14, no. 4 (April 1971): 228-39.
ACM
Lynch, T. Data Compression Techniques and Applications.
trand Reinhold
Madnick, S.E., and
Company,
J.J.
Inc.,
Donovan.
New
York: Van Nos-
1985.
Operatifig Systems.
Englewood
Cliffs. N.J.:
Prentice-Hall, 1974.
Maurer, W.D., and T.G. Lewis. "Hash table methods."
7, no. 1 (March 1975): 5-19.
ACM Computing
Surveys
578
BIBLIOGRAPHY
McCrcight, E. "Pagination of
cations
of the
ACM 20,
McKusick, M.K.,
W.M.
ACM
UNIX."
B*
trees
with variable length records." Communi-
no. 9 (September 1977): 670-674.
Joy,
S.J. Leffler,
Transactions on
and R.S. Fabry.
Computer Systems
2,
"A
fast file
system for
no. 3 (August 1984):
181-197.
Mendelson, H. "Analysis of Extendible Hashing." IEEE Transactions on Software
Engineering 8, no. 6 (November 1982): 611-619.
Microsoft, Inc. Disk Operating System. Version 2.00.
Language Series. IBM, 1983.
Morgan, R., and H. McGilton. Introducing
UNIX
IBM
Personal
System V.
New
Computer
York:
Mc-
Graw-Hill, 1987.
Murayama,
K., and S.E. Smith. "Analysis of design alternatives for virtual
memory
indexes." Communications of the
ACM 20,
no. 4 (April 1977):
245-254.
Nievergelt,
J.,
H. Hinterberger, and K. Sevcik. "The grid
metric, multikey
1
file
structure."
ACM
file:
an adaptive sym-
Transactions on Database Systems 9, no.
(March 1984): 38-71.
Ouskel, M., and P. Scheuermann. "Multidimensional B-trees: Analysis of dynamic behavior." BIT 21 (1981):401-418.
Pechura, M.A., and J.D. Schoeffler. "Estimating file access of floppy disks."
Communications of the
26, no. 10 (October 1983): 754-763.
ACM
Peterson, J.L., and A. Silberschatz. Operating System Concepts, 2nd Ed. Reading,
Mass.: Addison- Wesley, 1985.
Peterson,
W.W.
"Addressing for random access storage."
and Development
1,
IBM fournal
of Research
no. 2(1957):130-146.
and T. Sterling. A Guide to Structured Programming and PE/I. 3rd Ed.
York: Holt, Rinehart, and Winston, 1980.
Ritchie, B., and K. Thompson. "The UNIX time-sharing system." Communications of the
17, no. 7 (July 1974): 365-375.
Pollack, S.,
New
ACM
Ritchie,
D. The Unix I/O System. Murray
Hill, N.J.:
AT&T
Bell Laboratories,
1979.
Robinson, J.T. "The K-d B-tree:
dynamic indexes."
search structure for large multidimensional
ACM SIGMOD
1981 International Conference on Manage-
ment of Data. April 29-May 1, 1981.
Rosenberg, A.L., and L. Snyder. "Time and space optimality in B-trees."
ACM
(March 1981): 174-183.
Sager, T.J. "A polynomial time generator for minimal perfect hash functions."
Communications of the
28, no. 5 (May 1985): 523-532.
Salton, G., and M. McGill. Introduction to Modern Information Retrieval. McGrawTransactions on Database Systems 6, no.
ACM
Hill, 1983.
Salzberg, B. File Structures.
Salzberg, B., et
al.
Englewood
"FastSort:
Cliffs, N.J.: Prentice-Hall, 1988.
Distributed Single-Input, Single-Output Sort."
ACM SIGMOD International Conference on Management
SIGMOD RECORD, Vol. 19, Issue 2, (June 1990): 94-101.
Proceedings of the 1990
of Data,
Scholl,
M. "New
tions
file
organizations based on dynamic hashing."
on Database Systems
6,
no.
(March
1981): 194-211.
ACM
Transac-
579
BIBLIOGRAPHY
Severance, D.G. "Identifier search mechanisms:
ACM Computing Surveys 6,
model."
"On
Snyder, L.
survey and generalized
no. 3 (September 1974): 175-194.
B-trees reexamined." Communications of the
ACM 21,
no. 7 (July
1978): 594.
J. P. Tremblay and R.F. Deutscher. "Key-to-Address Transformation Techniques." INFOR (Canada) Vol. 16, no. 1 (1978): 397-409.
Spector, A., and D. Gifford. "Case study: The space shuttle primary computer
system." Communications of the
27, no. 9 (September 1984): 872-900.
Standish, T.A. Data Structure Techniques. Reading, Mass.: Addison-Wcsley, 1980.
Sun Microsystems. Networking on the Sun Workstation. Mountain View, CA: Sun
Microsystems, Inc., 1986.
Sussenguth, E.H. "The use of tree structures for processing files." Communications of the
6, no. 5 (May 1963): 272-279.
Sorenson, P.G.,
ACM
ACM
"Keyfield design." Datamation (October
Sweet,
F.
Teory,
T.J.,
and
1,
1985): 119-120.
Fry. Design of Database Structures.
J. P.
Englewood
Cliffs, N.J.:
Prentice-Hall, 1982.
The Joint ANSI/IEEE
Pascal Standards Committee. "Pascal:
SIGPLAN Notices
Forward
to the can-
28-44.
and P.G. Sorenson. An Introduction to Data Structures with Applications. New York: McGraw-Hill, 1984.
Ullman, J. Principles of Database Systems, 2d Ed. Rockville, Md.: Computer Scididate extension library."
Tremblay,
19, no. 7 (July 1984):
J. P.,
ence Press, 1980.
Ullman, J.D.
Principles of Database Systems,
3d Ed. Rockville, Md.: Computer
Science Press, 1986.
U.C. Berkeley.
UNIX Programmer's
Reference Manual. University
of California
Berkeley, 1986.
VanDoren,
the
"Some
J.
NSF-CBMS
empirical results on generalized
zation and Retrieval. University
VanDoren,
J.,
and
AVL
trees." Proceedings of
Regional Research Conference on Automatic Information Organi-
J.
Gray.
In Information Systems,
"An
of Missouri
at
Columbia
(July 1973):
algorithm for maintaining dynamic
COINS
IV,
New
46-62.
AVL
trees."
York: Plenum Press, 1974:
161-180.
Veklerov, E. "Analysis of Dynamic Hashing with Deferred Splitting."
Transactions on Database Systems 10, no. 1 (March 1985): 90-96.
ACM
Wagner, R.E. "Indexing design considerations." IBM Systems Journal 12, no. 4
(1973): 351-367.
Wang, P. An Introduction to Berkeley Unix. Belmont, CA: Wadsworth Publishing
Co., 1988.
Webster, R.E.
sity,
"B +
trees."
Unpublished Master's
thesis,
Oklahoma
State Univer-
1980.
Welch, T.
"A Technique
for
High Performance Data Compression." IEEE Com-
puter, Vol. 17, no. 6 (June 1984):
Wells, D.C.,
8-19.
E.W. Greisen and R.H. Harten. "FITS:
Flexible
Image Transport
System." Astronomy and Astrophysics Supplement Series, no. 44 (1981):
363-370.
Wiederhold, G. Database Design, 2d Ed. New York: McGraw-Hill, 1983.
580
BIBLIOGRAPHY
Wirth, N.
"An
assessment of the programming language Pascal."
IEEE
Transac-
Sofiware Engineering SE-1, no. 2 (June 1975).
tions on
Yao, A. Chi-Chih.
159-170.
Zocllick, B.
"On random 2-3
"CD-ROM
trees." Acta Informatica 9, no. 2 (1978):
software development." Byte
11, no. 5
(May
1986):
173-188.
System Support for CD-ROM." In
Lambert and S. Ropiequet, eds. Redmond,
Zoellick, B. "File
rus. S.
CD-ROM: The New PapyWA: Microsoft Press,
1986: 103-128.
Zoellick, B. "Selecting an
ume
Approach
2: Optical Publishing. S.
1987: 63-82.
to
Document Retrieval." In CD-ROM, VolRedmond, WA: Microsoft Press,
Ropiequet, ed.
Index
Abstract data models
explanation
of,
FITS image
as
128
Access. See
124-125, 132
example
Random
Record
of,
access;
access
variable order, 422-425, 437
trees
explanation
B*
trees,
B+
trees
447, 493
overflow
sector, 46
LRU replacement, 376
simple prefix, 429-430. See
Simple prefix
use of,
114, 115
352-362
Assign statement, 9
list
explanation of, 193, 217
of fixed-length records, 193
195
of variable-length records,
196-198
Average search length
of,
492
number of collisions
and, 476
progressive overflow and,
469-471
record turnover and, 482
433
trees vs.,
and,
of,
553-555
347
deletion, redistribution,
and
concatenation in, 366-372
depth of, 364-366
explanation
558
explanation
trees
B-trees
construction
versions of, 137
Avail
431-433
insertion,
and hex values, 107-109
Baver, R., 334-335, 337, 347,
348, 363, 371-372, 431
Berkeley UNIX, compression
of,
tor indexes, 234
and information placement,
377-379
invention
of,
334-336
383
of order m, 364, 382
order of, 362-364, 364, 382.
383
page structure used by, 253.
352
splitting and promoting, 347leaf of,
189
fit
of, 217
placement strategies, 202
Better-than-random, 492-493
Binary encoding, 137-138
Binary search
explanation of, 204. 2(>5. 217
of index on secondary
storage, 234, 336
limitations of, 207-208, 228
explanation
sequential vs., 204-206
on variable-length entities,
422
Binary search
trees
balanced. 4
explanation
of,
337340
heap and, 280- 2S1
paged, 343-347. 352, 353
Binding
explanation of, 252
in
indexing,
Bkdddkey
249-250
function. 514. 515.
519
Bk del key function. 524. 526
Bkfind buddy function. 521
Bk<plit function,
516-519
524-
BktryColl<ip<c function.
351
underflow
in,
Best
and
CD-ROM
table,
436
429-431
ASCII
UNIX,
of, 4, 6, 413,
algorithms for searching and
126
431-433
virtual, 373-377, 383
Balanced merge
explanation of, 312-314, 325
improving performance of,
315, 316
general discussion regarding,
B+
in headers,
553-555
and,
explanation
also
4,
K-wav, 314-315
Adel'son-Vel'skh, G. M., 341
ASCII
340-343,
372-373, 382
CD-ROM
46-48
indexes to keep track of, 102,
103
open. See Progressive
in
of, 6,
382
and files, 4
B-trees vs., 433
buckets and, 471-479
extendible hashing and, 510
513
hashing and, 452-466
home,
use of,
AVL
access; Sequential
Access mode, 29
Addresses
block,
Average seek time, 572
in,
408
526
581
582
INDEX
Bk try combine
function,
524-
526
Block addressing, 46-48, 471
Block device, 82
Block I/O
explanation of, 83
UNIX. 46
use of,
Block
and
Byte count
Byte offset
file
78-79
and order
strengths and weaknesses of,
of,
109
63-68
of,
number
a predictable
of, 101
on performance, 53-54
of, 3,
45-47
sequence sets, 407-413,
417-421
Boolean functions, 18
in
Bpi, 82
392-393
394-396
Btutil.prc, 400-404
Btutil.c,
Buckets
buddv. 520-522, 535-536
and effect on performance,
472-476
explanation
493
closing
LIST program
tries
and,
507-509
of,
535-536
procedure for finding, 520522
Buffer pooling, 70-71
Buttering
bottlenecks
69
in,
double, 70, 311
explanation
of, 29, 38,
68
282-283, 287
multiple, 69-72, 283
disks and cache
input,
RAM
memory as, 55
during replacement selection,
303-304
and virtual
Collision resolution
155-156
trees,
373-377
156-157
94-95,
154-155
CD-ROM,
556
488
of, 447, 493
and extra memory, 462-466
velocity),
544^547-549
Comm, 322. 325
Compact disc read-onlv
memory (CD-ROM).
CD-ROM
Compaction
62
explanation
448-449
methods of reducing, 449450, 462-466
predicting, 457, 461-466
Color lookup table, 128
Color raster images, 128-129
Comer, Douglas, 334-336, 363
in hashing,
98, 99,
Cache, 55h
Canonical form
explanation of, 144
for keys, 110
in secondary indexes, 237
Cascade merge, 316
CAV (constant angular
of, 543,
563-565
problem.
545-546
system and, 79, 559-563
hashed files and, 557-559
file
tables, 487,
Collisions
explanation
as file structure
Buffering blocks, and
162-
103-105, 107, 109,
writstrm.c,
CD-ROM,
by chained progressive
overflow, 484-486
by chaining with separate
overflow area, 486- 487
by double hashing, 483
by progressive overflow,
466-471
and scatter
162
166
writrec.c,
(constant linear velocity),
Coalescing holes, 201
update. c, 119, 120, 123.
Buddy buckets
explanation
strjuncs.c,
from
547-549
Cmp, 320-322, 325
Coalescence, 217-218
106, 107, 158
readstrm.c, 99,
411
544,
389-396
391-392
readrec.c,
block size and, 411
effect on records, 469
explanation of. 42-43, 83,
CLV
keys into B-tree,
makekey.c, 161
13-14
use of, 45
159
to insert
29
14,
internal fragmentation
19-20
C programs
btio.c, 392-393
btutil.c, 394-396
driver. c, 390-391
fileio.h, 153-154
find.c, 160-161
520
and implementation
479
527, 534
15-18
in,
352,
space utilization for, 526-
in,
record length and, 105
insert. c,
476-
in,
),
files,
Clusters
14
in,
extendible hashing and, 512of,
Closing
119
hashing fold and add step
451
getrf.c,
of, 450, 471, 472,
in,
direct access in, 117, 123
seeks
Btio.c,
484-486
Chang, H., 534
CLOSE(
character strings
file
553-557
Character I/O, 83
Character I/O system, 78
Character strings, in Pascal and
C, 119
portability and, 141
organization
tree structure and,
Chained progressive overflow,
per track, 573
stream of, 146
Blocking factor, 46, 57-59
Blocks
explanation of, 82-83, 144
grouping records into, 113
546-
552-553
dump
making records
554
543-545
551
to calculate, 116
journey
choice of, 410-411
effect
29
of,
Bytes
size
CD-ROM,
history of,
physical organization of,
explanation
RRN
144
field,
explanation
storage,
of,
218
190-192
Compar( ), 320
Compression. See Data
compression
See
583
INDEX
Computer hardware, sort time
and. 293-295
Computer Systems Research
Group (CSRG), 53, 54
Concatenation
370
in B-trees, 367, 369,
due to insertions and
deletions. 408. 409
explanation of, 382
nominal. 59. 60, 85
Da tarec,
of block
performance
effect
117
Deletion. See Record deletion
Delimiters
organization
end of records. 102-103
ot.
144
and nondata overhead. 47-49
45-47
organizing tracks by sector.
speed
applied to general ledger
Descriptor table, 83
types of. 37
program, 268-276
and matching, 259-263
and merging, 263266
and multiway merging. 276-
Device driver. 76, 79. 83
Difi 321-322. 326
Dir_double function, 518. 519
285-286
of, 266-268
of.
of.
(DASDs),
83
Direct
42
Conversion
text,
138-140
m UNIX.
318-320
320-322
utilities for.
Count subblocks.
),
turning
Cylinders
a.
of.
tries into. 507,
508
Dir_ifi<-.bucket function. 518.
computing capacity
of,
of, 38, 40,
40
83
519
Dir_.try collapse function, 522.
523
Disk access
Dangling pointers, 213
Data
decreasing
application-oriented view of.
125
number
of. 5
rotational delay and, 50
49-50
seek access and, 37,
standardization of. 136-139
for,
188-189
218
irreversible, 189
and simple prefix method to
produce separators. 431
of,
suppressing repeated
sequences for, 186-188
UNIX. 189-190
using different data
186
for.
185
Data files, 212. 213. 239
Data transfer rate. 552. 573
Data transmission rate
estimating,
5960
explanation
of.
84
Record
distribution
Double buffering. 70, 311
Double hashing, 483, 493
Drive capacitv, 40
390-391
397-399
Driver. pa<,
C,
Dynamic
531, 532
hashing. 528-530.
Disk bound. 54
Disk cache. 55. 84
Disk controller. 67
Disk drives, 37
comparison of. 572-574
dedicated. 294
fixed head. 572
replacement selection and use
of two. 307, 309
use of multiple. 309-31
Disk packs
explanation
removable.
Disks
EBCDIC
(extended binary
coded decimal interchange
code), 136, 137
Effective recording densitv. 59.
84
Effective transmission rate.
of. 38.
;
84
as bottleneck.
59-
60, 84
EFM encoding, 547
80/20 rule of thumb. 488-489,
493
Enbodv, R. J., 531. 532
End-of-file (EOF), 28. 29
EndPosition(f 21
Entropy reduction, 189m
Entrv-sequenced files
.
basic operations on.
transfer time and, 51. 112
assigning variable-length
explanation
timing computations and,
51-53
Data compression
in
522
536
extendible hashing and. 513
519. 527-528. 530
collapsing
space utilization for. 527-528
46, 83
29
explanation
.See
317-318
536
explanation
Cosequential processing
codes
(DMA),
67m. 83
number and
Distribution.
Du, H.
access
of. 2
tape vs.. 61-62.
Driver. c,
37, 83
memory
41-45
Directory
structure, 139. 141
CREATE(
144-145
Direct access storage devices
explanation
file
of.
use of, 115-117, 123
Controller
speed
Direct access
explanation
37-39
of,
organizing tracks by block,
separating fields with, 97-99
Density, packed. See Packed
density
279,
53-54
estimating capacities and
space needs of, 38, 40-41
at
summary
on
of.
Davis. W. S.. 372
Dedicated disk drives. 294
Deferred splitting. 536
explanation
Consequential operations, 258.
325
Consequential processing model
size
explanation
of,
230-234
252
simple indexes with. 227230
Extendible hashing
and controlling
splitting.
533-534
and deletion. 520-526
and dynamic hashing, 528530
explanation ot. 6. 505510,
536
implementation of. 510-519
and linear hashing, 530-533
use ot.
4-5
Extendible hashing performance
and space utilization for
54-55
buckets.
526-527
584
INDEX
and space
merge
utilization for
527-528
directory,
Extensibility,
External fragmentation
218
methods to combat, 201
placement strategies and, 203
of,
tools for,
317-318
96-99
145
of, 96,
190-203
work
in,
organization and, 122
CD-ROM,
79,
with mixtures of data objects,
kernel and,
79-80
483-487
448-450
deletions and, 479-483
22-23,
559-563
131-132
object-oriented, 132-133,
141, 145
dynamic, 528-530, 536
79, 84, 141
explanation of, 84
UNIX counterpart
to,
76
File descriptor table,
218
placement strategy,
201-202
FITS (flexible image transport
system), 126-129, 136-137
102,
118-119
Fixed-length records
29
File descriptor,
74-75
dump, 107-109
manager
clusters and, 42-43
and
access, 123,
File
organization
access and,
527-528, 530
Floppy disks, 37, 572
Fold and add, 451-452, 493
Formatting
method
of,
explanation
pre-,
145
File protection,
of, 5, 6, 75,
84
closing,
data, 212, 213,
239
displaying contents of,
end
of,
84,
218
450-453
(hierarchical data format),
Header
files
explanation
of,
29-30
FITS, 126, 127, 130
self-describing, 125
UNIX,
in
26
Header records,
120, 122, 145
281-284
283, 284
Heapsort
of, 304, 326
use of, 280-281, 287, 291,
explanation
311
Height-balanced
trees, 341,
382-383
Hex dump
15-18
18
logical, 9, 30. See also Logical
files
44-45,
44-45, 198-200,
203, 218
storage, 198-201
Frames, 56, 84
13-14
4-5
writing out in sorted order,
of,
internal,
Files
use of,
Hashing algorithms
perfect, 449, 494
properties of, 280
external, 201, 203, 218
153-154
462
with simple indexes, 234
building,
74//
explanation
for, 139, 141
historv of, 3-5, 124
Fileio.h,
),
and, 462-466
record access and, 488-489
record distribution and, 453-
Heap
Fragmentation
File structures
conversions
explanation
84
47
Fprintf(
13
of,
530-533, 536
memory
130
Flajolet, P.,
122-123
linear,
192-196
deleting,
explanation of, 145
use of, 101, 102, 118-119
names, 76-78
hashing
steps in simple,
File
File
446-
indexed, 493
indexing vs., 447
HDF
204
File
explanation of, 84
function of, 64, 66, 68
of, 6, 431,
448, 493
extendible. See Extendible
Fixed disk, 84
Fixed head disk drives, 572
Fixed-length fields, 96-98, 101,
method, 145
allocation table (FAT)
File-access
file
double, 483, 493
explanation
160-161
Find.new range function, 517
Find. pas, 175-176
fit,
557-559
466-
471,
using indexes, 249
First-fit
CD-ROM,
collisions and,
29
of,
First
123
85
buckets and, 471-479
and
21-22
in,
explanation
UNIX,
link, 77,
Hardware. See Computer
hardware
Hashing
Find.c,
File access
File
in,
disks, 37
collision resolution and,
contained
Filesystems
making records a predictable
number of, 101-102
reading stream of, 99-100
file
files
78
special,
on
early
Hard
Hard
physical, 8-9, 30. See also
Physical
Gray, J., 343
Grep, 115
129-
special characteristics
Fields
explanation
in,
self-describing, 125
310-311
Field structures,
285-
132
normal, 78
opening, 9-13
reclaiming space
External sorting. See also
Sorting
tapes vs. disks for,
size of,
mixing object types
133-134, 145
Extents, 43-44, 84
explanation
and
sorts
311
Gather output, 72
Get. pre,
Getrf.c,
174-175
159
explanation
of,
107-109
portabilitv and, 135
HIGH_VALUE,
265-266, 326
585
INDEX
Home
address, 447, 493
Huffman
explanation
code, 188, 218
to
of, 226-227, 252
keep track of addresses,
102, 103
I/O
approaches
|,
paged, 383
primary, 237
in different
Keys
explanation
block, 46, 78-79, 83
secondary. See Secondary
indexes
character, 78, 83
selective, 248,
processing as, 14
overlapping processing and,
simple, 227-230, 234-235,
languages
16-17
to,
file
RAM
buffer space in
performing, 61
scatter/gather, 86
252
I/O buffers, 64-65, 69
I/O channels
transmission time and, 294295
Insert.c,
391-392
Insertions
+
of,
145
hashing methods and, 455456
and index content, 413-415
indexing to provide access by
'
multiple,
Inode table, 76
Input buffers, 282-283, 287
Insert( ) function, 357, 359
UNIX, 72-80
in
252
Indexing, hashing vs., 447
Inodc. See Index node
280-281, 283
Key held. 252
Key subblocks, 46-47, 85
KEYNOI)ES|
209-212
235-239
placement of information
associated with, 377-379
primary, 110, 111, 146
promotion
of, 383
sequence set, 430
secondary. See Secondary
keys
role in
418-421
355-360, 371-372
block splitting and, 408, 409
as separators,
in B-trees,
separators instead of,
variable-length records and,
description of, 67
236-237
random, 429
explanation
tombstones and, 481-482
use of, 311
of,
85
Insert. pre,
use of multiple, 309-310
I/O redirection
explanation of, 30
IBM,
examples
standardization issues and,
IEEE Standard
data
format, 137, 138
files
files
memory,
Inverted
contents of, 413
of deletions on, 417,
role
Index
of, 416, 433,
of separators
set
in,
422-425
244-248
Irreversible compression, 189
explanation
of,
218
to
files,
3-4
binding and, 249-250
Landis, E. M., 341
Larson, P., 528, 530. 534
LaserVision, 543-544
Leaf
of B-tree, 363, 383
Least-recently-used
(LRU)
strategy, 71
Ledger program, application of
consequential processing
Job control language (JCL), 9
model to, 268-276
Lempel-Ziv method, 189
Linear hashing, 530-533, 536
/C-way merge
Linear probing. See Progressive
421-422
Index tables, 130
Indexed hashing, 493
Indexed sequential access
explanation of, 406-407, 437
and sequence sets, 412
Indexes
added
lists
blocks
internal structure of,
size of,
437
430
373
134-135
Match
explanation of, 252
secondary index structure
and,
418
211-213, 285
pinned records and, 213-214
Keywords, 129, 130
Knuth, Donald, 301-302. 311.
limitations of,
Languages, portability and,
206
operation
Index node, 76, 77, 85
Index set
explanation
207-208
Intersection, 259. See also
335. See also B-trees
208-211, 218,
Lands, 546-547
limitations of,
of,
of,
312, 317, 342, 363, 372.
44-45, 218
minimization of, 198-200
placement strategies and, 203
use
234-235,
effect
of,
Internal sort
and, 212, 213
too large to hold in
85
Internal fragmentation
explanation
135-139
42
of,
Kcysort
289-290
56-57, 85
of,
of,
explanation
415
explanation
Interleaving factor
3380 drive, 573
portability and
Index
399-400
explanation
430-431
413-
379-380
Interblock gap, 47
UNIX, 25-26
IBM
trees,
index,
I/O processors
in
in
balanced,
314-315
explanation of, 326
use of, 276-279, 293, 295
Kernel
explanation of, 85
and filesystems, 79-80
Kernel I/O structure, 72-76
overflow
Linked lists
explanation of. 218
use of, 192-193, 245-247
LIST program
explanation
in Pascal
of.
15.
24-25
and C, 15-18
586
INDEX
Lists
inverted,
244-248
Linked
linked. See
lists
Litwin, W., 530, 533
Loading
of CD-ROM, 555
+
of simple prefix B
maintaining
match
311,
30
Machine architecture, 135-136
Magnetic disks, 37. See Disks
Magnetic tape
applications for, 60-61
disks vs., 61-62
and estimated data
transmission times, 5960
organizing data on, 56-57
sorting files on, 311-318
and tape length requirements,
Makekey.c, 161
Mass storage system, 85
Match operation
of,
326
merge vs., 264-265
for names in two lists, 259
334-335, 337,
347, 348, 363, 371-372,
380
E.,
Memory
and
collisions,
index
in,
files
462-466
too large to hold
234-235
loading index files into, 231
rewriting index file from,
231-232
Merge
balanced, 312-314
cascade, 316
fc-way,
285-
290-292
explanation
of, 145
image, 128
125-126, 128
and
raster
use
of,
Mid-square method, 456, 494
Minimum hashing, 494
Minimum seek time, 572
Mod
operator, 451-452
Morse code, 188
Move mode, 71
multistep. See Multistep
merge
9-13
files,
Operating systems, portability
and, 134
Opfind function, 514, 515
Optical discs, 37
Order
of B-trees, 362-364, 383,
422-425, 437
file dump and byte, 109
of merge, 327
Overflow. See also Progressive
of,
494
508-510
Overflow records
buckets and, 473-475, 532
expected number of, 464-465
techniques for handling, 466splitting to handle,
54
467
310
Multistep merge
number of seeks
using, 295-298, 311
explanation of, 326
replacement selection using,
304, 306, 307
Multiway merge
consequential processing
model and, 276-279
files, 285-
for sorting large
286
Packing density. See
also
Space
utilization
average search length and,
470
buckets and, 472-473
explanation of, 462-463, 494
overflow area and, 486-487,
558
predicting collisions for
different,
463-466
Page fault, 375
Paged binary trees
Network I/O system, 78
Nodes
in B-trees, 347-349
index, 76, 77, 85, 421
Nominal recording density, 85
Nominal transmission rate
explanation
of,
343-345
structure of, 253, 352
top-down construction
345-347
128
Parallel transfer disk
explanation
Parallelism, 54
85
Nondata overhead, 47-49
Nonmagnetic disks, 37
of,
of,
Paged index, 383
Palette,
computing, 59
(PTD), 573
Pareto Principle, 488-489
Parity, 86
Parity bit, 56
276-279
multiphase, 326
75-76, 85
13, 76
overflow
use of, 315-317
Multiprogramming
decreasing
30
file table,
explanation
Multiphase merges
explanation of, 326
effects of,
12,
Open() function,
Opening
time involved in, 287-290,
308
of two lists, 263-266
to avoid disk bottleneck,
57-59
and UNIX, 80
Makeaddress function, 510513, 532
263
McCreight,
files,
),
addressing. See
Progressive overflow
Open
Metadata
LOW_VALUE, 326
LRU replacement, 375-377
explanation
Open
through, 208
files
for sorting large
Locality, 55, 252
9,
OPEN(
264-265
vs.,
56
function, 514, 515, 519
OpJel function, 526
OpJir function, 522, 524
numbers of lists,
278-279
trees,
parity,
Op^add
for large
425-429
two-pass, 485-486
Locate mode, 71
Logical files
explanation of,
in UNIX, 23
Odd
order of, 327
polyphase, 316, 317, 327
Merge operation
explanation of, 326
O(l) access, 446-447
Object-oriented file access, 132133, 141, 145
Pascal
character strings
in,
119
direct access in, 117, 123
587
INDEX
hashing fold and add step in,
451
header records and, 120, 122
LIST program
opening
in,
files in,
15-18
10-11
record length and, 105
20-21
Pascal programs
btutil.prc, 400-404
driver.pas, 397-399
find. pas, 175-176
get. pre, 174-175
methods
182
Position(f), 21
Prefix, 416. See also
176-182
171-172
94-95,
writstrm.pas,
98, 99,
168-169
Pathnames, 30, 562, 563
Perfect hashing algorithm, 449,
494
files
of,
Simple
division,
452-453, 494
Process, 86
Progressive overflow
chained, 484-486
of, 466-467, 494
and open addresses, 480
and search length, 468-471,
476, 477
Promotion of key, 355-357,
383
Protection mode, 30
Pseudo random access devices,
335
PTD (parallel transfer disk), 573
of,
explanation
of,
30
25-26
546-547
use of,
Pixels, 128
Placement strategies
explanation of, 218
selection of, 203
types of,
201-202
chained progressive
overflow, 484-486
dangling, 213
Poisson distribution
in
applied to hashing, 460-461,
473
457-460, 494
packing density and, 463
Polyphase merge, 316, 317, 327
explanation
of,
Portability
explanation
Radix searching. See Tries
Radix transformation, 456
RAID (redundant arrays of
inexpensive disks), 573
RAM
RAM
146
86
disks, 55,
51-53
access,
access
196
from hashed file, 479-483
index, 233, 237-238
storage compaction and, 190192
of variable-length records,
also
vs. tape, 61,
storage
in,
304-306
545
36
54-55
amount of, 293-
increased use of,
increasing
294
in, 206-208, 211,
279-285, 287-290
Random hash functions, 454456
Randomization, 455, 456, 494.
See also Hashing
sorting
READ(
distribution
uniform, 454
117-121
and header records, 120, 122
methods of organizing, 101
103
that use length indicator,
103-107
Record updating, index, 233234,
238-239
Records
explanation of, 100, 101, 146
in Pascal,
96n
reading into
RAM,
287-288
record structure and length
of,
117-121
Redistribution. See also Record
C, 16
explanation
hashing functions and, 453462
Poisson. See Poisson
choosing record length and,
memory (RAM)
access time using, 2,
in
of,
526
of fixed-length records, 192
Record keys, 109-111
Record structures
buffer space, 61
Random
Random
file
and B-trees, 347-349
extendible hashing and, 520-
Redistribution
and disk
86
Pointers
RAM
and
sort, 304-306
Record additions. See Insertions
Record blocking, 112-113
Record deletion
+
in B
trees, 418-421
in B-trees, 366-368, 370
196-198
Record distribution. See
320, 327
of,
Pipes
Platter,
),
218
213-214, 235
explanation
use
Qsort(
172-173
155-156
Readstrm.pas, 169-172
Record access, 3-4
file access and, 51-53
hashing and, 488-489
patterns of, 488-489
Readrec.pas,
using replacement selection
for, 111
UNIX, 23-26
in
99
Readstrm.c, 99,
8-9, 30
Pinned records
Pits,
requirements
Prime
),
Readrec.c, 106, 107, 158
explanation
update. pas, 119, 120, 122, 123,
explanation
Readfieldf
explanation of, 110, 146
in index files, 244, 246, 247
397-404
insert. pre, 399-400
readrec.pas, 172-173
readstrm.pas, 169-172
16
sequential search and, 112
use of, 14-15, 113
and, 141-142
Primary indexes, 237
Primary keys
binding, 249
352,
Physical
136-
prefix B-trees
to insert keys into B-tree,
writrec.pas,
in Pascal,
for achieving,
141
UNIX
seeks in,
stod.pre,
134-136
factors of,
of,
30
distribution
588
INDEX
370-372,
in B-trees, 367,
408, 410, 425
explanation
of,
383
Redundancy reduction,
187, 188,
185,
219
Redundant arrays of
inexpensive disks (RAID),
573
Reference field, 228-229, 252
Relative block number (RBN),
423
Relative record number (RRN)
access by, 116, 204, 207
explanation of, 146
hashed files and, 476-477
in stack, 193, 194
and variable-length records,
196
Replacement
based on page height, 376377
LRU, 375-377
heapsort and, 280
Sequential search
Search length, 469. See also
for
Average search length
Secondary indexes
on CD-ROM, 556-557
improving structure of, 242248
primary vs., 237
236-237
to, 237-238
record updating to, 238-239
retrieval and, 239-241
record addition
record deletion
306-308
to,
235-236
use of,
Secondary key fields, 235
Secondary keys
binding, 249
index applications
111,
of,
Retrieval, using combinations
of secondarv keys, 239242
Rewrite statement, 10
Rotational delav, 50, 86, 572-
573
Run-length encoding
explanation of, 219
use of. 186-188
Runs
10
235-238
retrieval using
239-242
of,
327
298-303
285-289
length of,
use of,
Scatter/gather I/O, 86
Scatter input,
Scatter tables,
71-72
487-488
M., 534
Seagate Sable
PTD, 573
Selective indexes, 248, 252
Self-describing
files,
125. 146
Separators
explanation
of, 433, 437
and index content, 413-415
index
425
set
blocks and. 422-
413-415
430-431
instead of keys,
keys
as,
shortest, 437
Sequence checking, 327
Sequence set
adding simple index to, 411413
and block size, 410-411
blocks and, 407-410, 417-
425-429
explanation
of,
407, 433,
437
Secondary storage
Sequences, suppressing
access to, 36,
336-337
paged binary
344
trees and, 343,
repeating,
186-188
Sequential access,
explanation
3-4
of, 6,
146
indexed. See Indexed
simple indexes on, 234
Sector addressing, 46, 471
sequential access
Sectors
explanation of, 86
organization of, 86
organizing tracks by, 41-45
41-42
SEEK(
time computations and, 5253
use of, 122, 291
Sequential access device, 86
explanation of, 30-31
use of, 18-19
UNIX
Sequential processing,
tools for,
114-115
Sequential search
best uses of, 114
Seek and rotational delav, 288,
292-294
binary vs., 204-206
evaluating performance
CD-ROM,
552
explanation
of,
explanation
49-50,
86,
of,
146
use of record blocking with,
112-113
112-113
types of, 572
Serial devices,
SGML
37
(standard general
markup
Seeks
in C, 19-20
language), 130
131
excessive, 61
Shortest separator, 437
explanation of, 38
multistep merges to decrease
number of, 295-298, 311
Sibling, 367
in Pascal,
of,
111-112
Seek time
See'kRead(f,n), 21
explanation
merging large numbers of
278-279
lists,
421,
combinations
phvsical placement of,
Reset statement, 10, 11
248
selective indexes from.
of,
Replacement selection
average run length for, 301303
cost of using, 303-305
explanation of, 327
increasing run lengths using,
298-301
for initial run formation, 311.
312
plus multistep merging, 304,
Scholl,
Search. See Binary search;
20-21
Simple indexes
with entrv-sequenced
252
SeekWnte(f,n), 21
explanation
Selection tree
too large to hold in
explanation
of,
327
files,
227-230
234-235
of,
memory,
589
INDEX
Simple prefix
B+
B+
B+
also
STDIN, 24-25,
trees
STDOUT,
429-430. See
trees vs.,
trees
Stod.prc,
changes involving multiple
blocks in sequence set and,
418-421
changes localized to single
blocks in sequence set and,
417-418
explanation
of,
416-417, 437
425-429
loading,
use of, 431-432, 434
31,
Storage capacity, of CD-ROM,
552
Storage compaction, 190-192
Storage fragmentation, 198-201
Stream file, 94-96
Stream of bytes, 146
Streaming tape drive, 60-61, 86
Soft links, 77-78. See also
bottleneck, 54, 55
Stmg, 567-571
Subblocks
explanation of, 86
types of,
and cosequential processing
in
UNIX, 318-322
disk
files in
merging
RAM,
for large
206-208
285-
on
for directory,
Special
file,
527-528
86
Split ( ) function, 360,
361
Splitting
in B-trees, 355, 356, 360,
367, 425
Synchronization loop, 260-262,
267, 276, 327
Symbolic links, 77-78, 86
sort. See
Keysort
Tags
advantages of using, 133
explanation of, 129-131
132-133
specification of,
Tape. See Magnetic tape
Temporal locality, 376
Theorem A (Knuth), 328
deferred, 536
Tombstones
189-190
22-23
of block size on
performance, 53-54
and file dump, 108
effect
header
files,
26
commands, 26-27
72-80
filesystem
I/O in,
magnetic tape and, 80
physical and logical files
in,
23-26
explanation
and sequential processing,
114-115
sort utility for, 206
sorting and cosequential
processing
in,
318-322
standard I/O in, 31
Unterauer, K., 431
Update. c, 119, 120, 123, 162-
file
166
format), 130
Update.pas, 119, 120, 122. 123,
176-182
of,
480-481, 495
for handling deletions,
480-
481
Stack
explanation
use of,
in,
portability and, 141
Tag
TIFF (tagged image
explanation of, 383, 537
to handle overflow, 508-510
distribution, 454, 455
file-related
408-410
control of, 533-534
block,
balanced merging,
312
directory structure,
310-311
while writing out to file, 283,
284
Space utilization. See also
Packing density
for buckets, 526-527, 534
Two-way
compression
46-47
chained progressive
overflow, 484
explanation of, 448, 464, 494
System call interface, 74
System V UNIX, 189
311-318
UNIX
in
tape,
508
Turbo Pascal, 9Two-pass loading, 485-486
Uniform
Synonyms
tools for external,
of, 505-507, 537
turned into directory, 507,
explanation
Uniform, 495
file,
311
Tries
162
Striping, to avoid disk
Sorting
234
for indexes,
62-63
Sockets, 78, 86
Symbolic link
Sort, 319-320, 322, 327
Sort-merge programs, 318
553-557
382-383
height-balanced,
74
182
Storage, as hierarchy,
Strfuncs.c,
CD-ROM,
on
74
17, 25, 31,
of,
and insertions, 481-482
performance and, 482-483
219
193-194
Standard I/O, 31
Standardization
of data elements, 137-138
of number and text
conversion, 138-139
of physical record format,
136-137
Standish, T. A., 342
Static hashing,
STDERR,
74
J.,
343
Variable-length codes, 188-189,
219
Variable-length records
379-380
196-198
B-trees and,
Tracks
deleting,
of, 37-40, 87
organizing by sector, 41-45
per surface, 573
Transfer time, 51, 87
explanation of, 146
internal fragmentation and,
explanation
Tree structure
447
24, 25, 31,
Tools, pre, 167
Total search length, 469
Track capacity, 40
VanDoren,
application of, 4
199
methods of handling, 102
Variable order B-tree, 422-425,
437
590
VAX.
INDEX
135.
138-139
Veklerov, E., 534
of,
importance
of,
219
fit,
Writstrm.c,
Worst-fit placement strategies,
202
Virtual B-trees
explanation
Worst
373-377, 383
377
WRITE(
Writstrm.pas,
115
explanation
Webster. R. E.. 375, 377
White-space characters. 97-98
Writrec.c,
98, 99, 154-
94-95,
98, 99,
168-169
of, 31
use of, 15-18,
IIV.
94-95,
155
63-65
103-105, 107, 109,
156-157
Writrec.pas, 171-172
XDR
(external data
representation),
137-139
Yao, A. Chi-Chih, 371
Computer
Science/File Structures
Structures
Bill
Zoellick
Michael
J.
Second Edition
Avalanche Development Company
Folk National Center for Supercomputing Applications
This second edition of the leading
file structures book currently on the market has been
thoroughly revised and updated to instruct readers on the design of fast and flexible file
structures. The new edition now includes timely coverage of file structures in a UNIX
environment in addition to a new and substantial appendix on CD-ROM. Other modern file
structures such as extendible hashing methods are also explored.
This book develops a framework for approaching the design of systems to store and retrieve
information on magnetic disks and other mass storage devices. It provides a fundamental
collection of tools that
any user needs
appropriate solutions to
file
in
order to design intelligent, cost-effective, and
structure problems.
Highlights
Discusses a "toolkit" of approaches to
retrieve file records: simple indexes, paged
indexes (e.g. B-trees), variations on paged
indexes (e.g. B + trees, B trees), and hashing
Includes a
new
chapter on extendible hashing
Uses pseudocode extensively, particularly
where the procedures are complex and where
it
is
important to avoid the distractions
in actual compilable code
Emphasizes the building of conceptual tools
and retrieval of information
from files
for the design
Provides complete examples
and Turbo Pascal 6.0
in
both ANSI C
Introduces UNIX concepts and utilities that
apply directly to file structures and file
management
inherent
Second Edition
an invaluable resource for computer science professionals
UNIX environment. It will also be of interest to professionals
interested in learning about the design of file structures and the retrieval of records. Students
majoring in computer science will benefit from this book's emphasis on fundamental concepts
and its inclusion of C and UNIX.
File Structures,
using
file
and
is
data. structures in a
About the Authors
Bill
Zoellick
is
Vice President and Chief
Scientist at the
Avalanche Development
Michael J. Folk is currently a Senior Software
Engineer at the National Center for
in Boulder, Colorado, a leading
producer of text conversion software.
Previously, he was the Director of
Technology for the Alexandria Institute, a
nonprofit organization working to resolve the
Supercomputing Applications
problems associated with electronic
Computer Science for fifteen years at
Oklahoma State and Drake Universities.
Company
publishing.
writer
He
is
a frequent lecturer
on CD-ROM
and
at the University
three years
he has been responsible for developing
general purpose scientific data file formats.
of Illinois
in
Urbana. For the
Prior to this, Dr. Folk
was
last
a Professor of
issues.
90000>
Addison-Wesley Publishing Company
780201"557138
ISBN D-ED1-55713-4