Fundamental File
Structure Concepts
By
Junaid Ali
Siddiqui
1
Files
A file can be seen as
2. A stream of bytes (no structure), or
3. A collection of records with fields
2
A Stream File
File is viewed as a sequence of bytes:
87359CarrollAlice in wonderland38180FolkFile Structures ...
Data semantics is lost: there is no
way to get it apart again.
3
Field and Record Organization
Definitions
Record: a collection of related fields.
Field: the smallest logically meaningful unit
of information in a file.
Key: a subset of the fields in a record used
to identify (uniquely) the record.
e.g. In the example file of books:
Each line corresponds to a record.
Fields in each record: ISBN, Author, Title
4
Record Keys
Primary key: a key that uniquely
identifies a record.
Secondary key: other keys that may be
used for search
Author name
Book title
Author name + book title
Note that in general not every field is a
key (keys correspond to fields, or a
combination of fields, that may be used in
a search). 5
Field Structures
Fixed-length fields
87359Carroll Alice in wonderland
38180Folk File Structures
Begin each field with a length
indicator
058735907Carroll19Alice in wonderland
053818004Folk15File Structures
Place a delimiter at the end of
each field
87359|Carroll|Alice in wonderland|
38180|Folk|File Structures|
Store field as keyword = value
ISBN=87359|AU=Carroll|TI=Alice in wonderland|
ISBN=38180|AU=Folk|TI=File Structures 6
Record Structures
1. Fixed-length records.
2. Fixed number of fields.
3. Begin each record with a length
indicator.
4. Use an index to keep track of
addresses.
5. Place a delimiter at the end of the
record.
7
Fixed-length records
Two ways of making fixed-length
records:
2. Fixed-length records with fixed-
length fields.
87359 Carroll Alice in wonderland
03818 Folk File Structures
3. Fixed-length records with
variable-length fields.
87359|Carroll|Alice in wonderland| unused
38180|Folk|File Structures| unused
8
Variable-length records
Fixed number of fields:
87359|Carroll|Alice in wonderland|38180|Folk|File Structures| ...
Record beginning with length indicator:
3387359|Carroll|Alice in wonderland|2638180|Folk|File Structures| ..
Use an index file to keep track of record
addresses:
The index file keeps the byte offset for each record;
this allows us to search the index (which have fixed
length records) in order to discover the beginning of
the record.
Placing a delimiter: e.g. end-of-line char
9
File Organization
Four basic types of organization:
• Sequential
today
• Indexed
• Indexed Sequential
• Hashed
In all cases we view a file as a sequence
of records.
A record is a list of fields. Each field has
a data type.
10
File Operations
Typical Operations:
Retrieve a record
Insert a record
Delete a record
Modify a field of a record
In direct files:
Get a record with a given field value
In sequential files:
Get the next record
11
Sequential files
Records are stored contiguously on the
storage device.
Sequential files are read from beginning
to end.
Some operations are very efficient on
sequential files (e.g. finding averages)
Organization of records:
– Unordered sequential files (pile files)
– Sorted sequential files (records are
ordered by some field)
12
Pile Files
A pile file is a succession of records,
simply placed one after another with
no additional structure.
Records may vary in length.
Typical Request:
Print all records with a given field value
e.g. print all books by Folk.
We must examine each record in the
file, in order, starting from the first
record. 13
Searching Sequential Files
To look-up a record, given the value of one
or more of its fields, we must search the
whole file.
In general, (b is the total number of blocks
in file):
At least 1 block is accessed
At most b blocks are accessed.
On average 1/b * b (b + 1) / 2 => b/2
Thus, time to find and read a record in a
pile file is approximately
Time to: fetch one record
TF = (b/2) * btt 14
Exhaustive Reading of the File
Read and process all records
(reading order is not important)
TX = b * btt
(approximately twice the time to fetch one
record)
e.g. Finding averages, min or max, or
sum.
Pile file is the best organization for this
kind of operations.
They can be calculated using double 15
Sorted Sequential Files
Sorted files are usually read sequentially
to produce lists, such as mailing lists,
invoices.etc.
A sorted file cannot stay in order after
additions (usually it is used as a temporary
file).
A sorted file will have an overflow area of
added records. Overflow area is not
sorted.
To find a record:
First look at sorted area
Then search overflow area
16
Searching for a record
We can do binary search (assuming
fixed-length records) in the sorted part.
Sorted part overflow
x blocks y blocks (x + y = b)
17