Data Structures
Data Structures
SEMESTER- III
AY-2021-2022
TIPS COLLEGE OF ARTS AND SCIENCE
B.Sc COMPUTER SCIENCE – III SEMESTER
CORE 4 : DATA STRUCTURES
UNIT I
Introduction: Introduction of Algorithms, Analysing Algorithms. Arrays: Sparse Matrices -
Representation of Arrays. Stacks and Queues. Fundamentals - Evaluation of Expression Infix
to Postfix Conversion - Multiple Stacks and Queues
UNIT II
Linked List: Singly Linked List - Linked Stacks and Queues - Polynomial Addition - More
on Linked Lists - Sparse Matrices - Doubly Linked List and Dynamic - Storage Management
- Garbage Collection and Compaction.
UNIT III
Trees: Basic Terminology - Binary Trees - Binary Tree Representations - Binary Trees -
Traversal - More on Binary Trees - Threaded Binary Trees - Binary Tree Representation of
Trees - Counting Binary Trees. Graphs: Terminology and Representations - Traversals,
Connected Components and Spanning Trees, Shortest Paths and Transitive Closure
UNIT IV
External Sorting: Storage Devices -Sorting with Disks: K-Way Merging - Sorting with Tapes
Symbol Tables: Static Tree Tables - Dynamic Tree Tables - Hash Tables: Hashing Functions
- Overflow Handling.
UNIT V
Internal Sorting: Insertion Sort - Quick Sort - 2 Way Merge Sort - Heap Sort - Shell Sort -
Sorting on Several Keys. Files: Files, Queries and Sequential organizations - Index
Techniques -File Organizations.
TEXT BOOKS
1. Ellis Horowitz, Sartaj Shani, Data Structures, Galgotia Publication.
2. Ellis Horowitz, Sartaj Shani, Sanguthevar Rajasekaran, Computer Algorithms, Galgotia
Publication.
INTRODUCTION
OVERVIEW
The study of computer science has four distinct areas.
Machines for executing algorithms:
This area includes everything from the smallest pocket calculator to the largest
general purpose digital computer. The goal is to study various forms of machine
organizations so that algorithms can be effectively carried out.
Languages for describing algorithms:
One often distinguishes between two phases of this area: language design and
translation. The first calls for methods for specifying the syntax and the semantics of a
language. The translation requires a means for translation into a more basic set of
commands.
Foundations of algorithms:
Is a particular task accomplishable by a computing device or what is a minimum
number of operations necessary for any algorithm which performs a particular function?
Abstract models of computers are devised so that these properties can be studied.
Analysis of algorithms:
An algorithm behaviour pattern is measured in terms of the computing time and space
that are consumed while the algorithm is processing. Questions such as the worst and
average time and how often they occur are typical.
ALGORITHM
An algorithm is a finite set of instructions, to achieve a particular task. Every algorithm must
satisfy the following criteria.
i) input: there are zero or more quantities which are externally supplied;
ii) output : at least one quantity is produced;
iii) definiteness : each instruction must be clear and unmistakable;
iv) finiteness : if we trace out the instructions of an algorithm, then for all cases the
algorithm will terminate after a finite number of steps;
v) effectiveness : every instruction must be feasible.
FLOWCHART
A flow chart is a pictorial representation of the algorithm.
RECURSION:
If a function or procedure calls by itself is called as recursion.
n! = n*(n-1) !
4! = 4 * 3!
e.g. Program using a recursive function
For I :=1 to n do
For I := 1 to n do
For J :=1 to n do
x:= x+1; x: = x+1;
x := x+1;
In program (a) we assume that the statement x := x + 1 is not contained within any loop either
explicit or implicit. Then its frequency count is 1. In program (b) the same statement will be
executed n times and in program (c) n2 times (assuming n >=1). Now I, n and n2 are said to
be different in increasing orders of magnitude just like 1,10, 100 would be if we let n=10. In
our analysis of execution we will be concerned chiefly with determining the order of
magnitude of an algorithm.
To determine the order of magnitude, formulas such as
S1 , S i , S i2 often occur.
e.g. program to find the frequency count of Fibonacci series.
1 program fibonacci (input,output);
2 {compute the Fibanacci number Fn}
3 type natural = 0 … maxint;
4 var fnm1,fnm2,fn,n,I : natural;
5 begin
6 readln(n)
7 If n <= 1 then writeln(n) {F0=0 and F1=1}
8 else
9 begin {compute Fn}
10 fnm2:=0;fnm1:=1;
11 for i:=2 to n do
12 begin
13 fn:=fnm1+fnm2;
14 fnm2:=fnm1;
15 fnm1:=fn;
16 end; {of for}
17 writeln(fn);
18 end; {of else}
19 end. {of Fibonacci}
Clearly, the actual time taken by each statement will vary. The for statement is really a
combination of several statement but we will count it as 1. The total count then is 5n + 4. We
will often write this as O(n), ignoring the two constants 5 and 4. This notation means that the
order of magnitude is proportional to n.
The notation f (n) = O (g ( n) ) (read as f of n equals big –oh of g of n ) has precise
mathematical notation.
DEFINITION
f (n) = O (g ( n) ) iff there exists two constants c and n0 such that
| f ( n ) | £ c | g (n) | for all n ³ n0.
When we say that the algorithm is O(g (n) ) we mean that the execution takes no more
than g (n) . n is a parameter which characterizes the inputs and / or the number of outputs.
For example n might be the number of inputs and outputs or their sum or the magnitude of
one of them. For the fibonacci program n represents the magnitude of input and the time for
this program is written as T(fibonacci) = O (n).
We write O(1) to mean a computing time which is a constant. O(n) is called linear, O(n 2)
is called quadratic, O(n3) is called cubic, O(2n) is called exponential. If an algorithm takes
time O(log n) it is faster, for sufficiently large n, than it had taken for O(n). Similarly O(n log
n) is better than O(n2) but not as good as O(n).
N 10n n2/2
1 10 ½
5 50 12-1/2
10 100 50
15 150 112-1/2
20 200 200
25 250 312-1/2
30 300 450
For n<=20,algorithm two has a smaller computing time but once past that point algorithm
one becomes better. This shows why we choose the algorithm with the smaller order of
magnitude, but we emphasize that this is not the whole story. For small data sets, the
respective constants must be carefully determined. In practice these constants depend on
many factors, such as the language and the machine one is using.
ARRAYS:
An array is a set of consecutive memory locations.
Given below is the data structure of an array.
structure ARRAY (value,index)
declare CREATE ( ) à array
RETRIEVE(array,index)à value;
STORE(array,index,value)àarray;
for all A Î array, I, j Î index, x Î value let
RETRIEVE(CREATE,i) ::= error
RETRIEVE(STORE(A,i,x),j) ::=
if EQUAL(I,j)then x else RETRIEVE(A,j)
end
end array
The function CREATE produces a new, empty array. RETRIEVE takes as input an array
and an index, and either returns the appropriate value or an error. STORE is used to enter
new index-value pairs. The second axiom is read as ”to retrieve the j-th item where x already
been stored at index i in A is equivalent to checking if i and j are equal and if so, x, or search
for the j-th value in the remaining array, A.”
OREDERED LISTS
One of the simplest and most commonly found data object is the
ordered or linear list. Examples are the days of the week.
Applications
One of the applications of array is polynomial. Any polynomials can be represented in a
global array called terms as defined below:
2 1 1 10 3 1
1000 0 4 3 2 0
In the above table af and bf give the location of the first term of A and B respectively ,
whereas al and bl give the location of the last term of A and B. free gives the location of the
next free location in the array terms. In our example, af=1, al=2, bf=3,bl=6, and free=7.
0 11 3 0 0 0
0 0 0 -6 0 0
0 0 0 0 0 0
91 0 0 0 0 0
0 0 28 0 0 0
fig. 1 sparse matrix
It is very natural to store a matrix in a two dimensional array, say A[1..m,1..n]. Since sparse
matrix contain more number of zeros another representation is used to store only non-zero
elements. Each element of a matrix is uniquely characterized by its row and column position
say i,j. We can store a matrix as a list of 3-tuples of the form (i,j,value). All the 3-tuples of
any row can be stored so that the columns are increasing.
1 2 3
b[0, 6, 6, 8
[1, 1, 1, 15
[2, 1, 5, 91
[3, 2, 2, 11
[4, 3, 2, 3
[5, 3, 6, 28
[6, 4, 1, 22
[7, 4, 3, -6
[8, 6, 1, -15
Procedure transpose computes the transpose of A. A is initially stored as a sparse matrix
in the array a and its transpose is obtained in the array b.
The data type sparsematrix is defined as below. type sparsematrix = array[0..maxterms, 1..3 ]
of integer;
where maxterms is a constant.
1.procedure transpose (a:sparsematrix; var b: sparsematrix);
2.{b is set to be the transpose of a}
3.var m,n,p,q,t,col :integer;
4.{m: number of rows in a
5. n : number of columns in a
6. t: number of terms in a
7. q: position of next term in a
8. p: current term in a}
9. begin
10. m:=a[0,1]; n:= a[0,2]; t:= a[0,3];
11. b[0,1] :=n; b[0,2]:= m; b[0,3]:= t;
12. if t>0 then {nonzero matrix}
13.begin
14. q:= 1;
15. for col:= 1 to n do {transpose by columns}
16. for p := 1 to t do
17. if a[p,2] = col then
18. begin
19. b[q,1] := a[p,2]; b[q,2] := a[p,1];
20. b[q,3] := a[p,3]; q := q+1;
21. end;
Memory is regarded as one dimensional with words numbered from 1 to m. So, we are
concerned with representing n dimensional arrays in a one dimensional memory. There fore
the location of an arbitrary element in memory can be determined efficiently. consider var
a : array [ 1….5 ] of integer , The above array is represented as
200 a[1 ]
201 a [2]
202 a [3]
203 a[4]
address of any element in an array is calculated as follows:.
One of the common ways to represent an array is in row major order. If we have the
declaration
A[4..5,2..4,1..2,3..4]
Then we have a total of 2*3*2*2=24 elements.
Using row major order these elements will be stored as
A[4,2,1,3], A[4,2,1,4], A[4,2,2,3], A[4,2,2,4],
A[4,3,1,3], A[4,3,1,4], A[4,3,2,3], A[4,3,2,4],
A[4,4,1,3], A[4,4,1,4], A[4,4,2,3], A[4,4,2,4],
A[5,2,1,3], A[5,2,1,4], A[5,2,2,3], A[5,2,2,4],
A[5,3,1,3], A[5,3,1,4], A[5,3,2,3], A[5,3,2,4],
A[5,4,1,3], A[5,4,1,4], A[5,4,2,3], A[5,4,2,4],
Another synonym for row major order is lexicographic order.
Next step is to calculate the address for the elements in memory. We assume that the
lower bounds on each dimension li are 1.
If is the address of A[1,1], then the address of A[i,1] is +(i-1)u2 as there are i-1
rows each of size u2 preceding the first element in the i-th row. Then the address of A[i,j] is
+(i-1)u2 + (j-1).
Three dimensional array
A is declared as A[1..u1,1..u2,1..u3]. This array is interpreted as u1 2 dimensional arrays
of dimension u2 x u3. Sequential row major representation of a three dimensional array is
represented below.
| (i-1)u2u3 elements |
Multidimensional array
+(i1-1)u2u3…un
+ (i2-1)u3u4..un
+ (i3-1)u4u5..un
.
.
.
+ (in-1-1)un
+ (in-1)
nn
= + [ij-1)aj with aj = uk 1<=j <n
j=1 k=j+1 an=1
Two of the more common data objects found in computer algorithms are stacks and
queues. A stack is an ordered list in which all insertions and deletions area at one end, called
the top. A queue is an ordered list in which all insertions take place at one end called the rear,
while all deletions take place at another end called the front.
Stack Queue
4
TOP 1 2 3 4
3
2
1 front rear
Given a stack S = ( a1, a2, … ,an), we say that a1 is the bottom most
element of the stack and ai ,is on the top of element ai-1, 1< i£n .
Given a queue Q = ( a1,a2,….,an), we say that an as the rear element,a1 as the front
element and a(i+1) is behind ai, 1<=i<n.
The restrictions on stack imply that if the elements A,B,C,D and E are added to the
stack in that order, then the first element to be removed or deleted must be E. Equivalently
the last element to be inserted in to the stack will be the first to be removed. For this reason
stacks are sometime called as „Last In First Out‟ (LIFO) lists.
The restrictions on a queue imply that if the elements A,B,C,D and E are added to the
queue in that order, then the first element which is inserted into the queue will be the first one
to be removed. Thus, A is the first letter to be removed, and queues are known as „First In
First Out‟ (FIFO) lists.
The simplest way to represent a stack is by using a one dimensional array, say,
stack[1..n] where n is the maximum number of allowable entries. The first or bottom element
in the stack will be stored at stack[1], the second at stack[2] and ith at stack[i]. Associated
with an array will be variable, top which points to the top element in the stack.
Implementations of stack is as follows:
CREATE(stack)::= var stack :array[1..n] of items; top : 0..n;
ISEMTS(stack)::= if top = 0 then true
Else false;
TOP(stack) ::= if top = 0 then error
else stack[top]
The ADD and DELETE procedures are given below. The corresponding procedures
have been written assuming that stack, top and n are global.
In a queue when a job is submitted for execution, it joins at the rear of the job queue.
The job at the front of the queue is the next one to be executed. Operations of queue is as
follows:
To represent a queue we need a one-dimensional array q[1..n], and two variables, front
and rear.Front always points to one position less than the actual front and rear always points
to the last element in the queue. Thus front = rear if and only if there are no elements in the
queue. The initial condition then is
front = rear = 0.
Consider the following example, which shows the state of the queue when a job enters
and leaves it:
The queuefull signal does not imply that there are n elements in the queue. The queue
may contain empty spaces when rear = n which implies that the queue is full. To avoid this
problem, elements are to be moved left when deletions take place which is time consuming
one.
Consider the following example which explains the shifting of elements of a queue
when deletions take place. Each time a new element is added, the entire queue of n-1
elements is moved left.
Circular queue
[3 ] [n–4]
[2 ]
[n–3]
J1
[1]
[ n-2 ]
[0 ] [
n-1 ]
front = n – 4 ; rear = 0
In order to add an element, it will be necessary to move rear one position clockwise. That is,
In the above two algorithms, the test condition for queue full in addq and the test for
queue empty in delete queue are the same. In the case of addq when front = rear is evaluated
and found to be true, there is actually one free space that is, q[rear], since the first element in
the queue is not at q[front] but is one position clockwise from this point. However, if we
insert an item here, then we will not be able to distinguish the causes full and empty, since
this insertion would leave front= rear. To avoid this, we signal queuefull, thus permitting a
maximum of n-1 rather than n elements to be in the queue at any time.
EVALUATION OF EXPRESSION:
X := A / B – C + D * E – A * C
The expression above has five operands A, B,C, D and E. Though these are all one-
letter variables, operands can be any legal variable name or constant. In any expression the
values that variables take must be consistent with the operations performed on them. The
operations are described by the operators. The basic arithmetic operators are +, -, / and *. The
other arithmetic operators include unary minus, mod and div. Relational operators are <, <=,
=, <>,>= and >. The result of an expression, which contains relational operators, can be either
true or false. Such an expression is called Boolean. Logical operators are AND, OR and
NOT.
To fix the order of evaluation, we assign to each operator a priority. Then within any
pair of parenthesis the operators with the highest priority will be evaluated first. A set of
sample priorities from Turbo Pascal is given below:
========================================================
priority operator
When we have an expression where two adjacent operators have the same priority, we
remove the operators from left to right.
There are many ways of representing the same arithmetic expression. For example, the
assignment statements
Z := A * B / C + D
Z := (A * B ) / C + D
Z := ( ( A * B ) / C ) + D
result in the same value. The process of collapsing such different expressions into one
unique form is called parsing the expression, and stacks are used for parsing.
For example,
A * B / C is equivalent to
Infix : A * B / C
Postfix : AB * C /
Prefix / * A B C
Algorithm for translating from infix to postfix.
For example,
(( ((A/B)–C)+ (D*E))–(A*C))
The arcs join an operator and its corresponding right parenthesis. Performing steps 2 and 3
gives
AB/C-DE*+AC*-
procedurepostfix (e:expression);
begin
stack[1] := „ # „ ; top :=1;{initialize stack }
x := nexttoken (e);
while x<> „ # „ do
begin
ifx is an operand
then write(x)
else ifx=‟)‟
then begin {unstack until‟)‟}
whilestack [top]<>‟)‟do
begin
delete (y);
write(y);
end;
delete (y); {delete‟(‟}
end;
else begin
whileisp [stack[top]] >=icp [x]do
begindelete (y);write(y);end;
add (x);
end;
x:= nexttoken (e);
end; { of while }
{end of expression; empty stack}
whiletop>1 do
begindelete (y); write (y);end;
writeln(„#‟)
end;{ of postfix}
At this point we want to unstuck down to the corresponding left parenthesis, and then delete
the left and right parentheses; this gives us
) * ABC+
* * ABC+*
D * ABC+*D*
done empty ABC+*D*
Let us suppose that the symbols A, B, C, D and E has associated with them the values:
Symbol value
================
A 5
B 3
C 6
D 8
5 3 15
5
READ C READ D READ E
2
8 8
6
6 6
151
515
6 5
2
151
So the result of A B * C D E / - + * is 17 for the given values.
Let us consider an array v[1..m] if we have single stack then v[1] is the bottom most
element in the stack and v[m] the top most element of the stack. If we have two stacks to
represent then we use v[1] for the bottom most element in the stack 1 and v[m] the
corresponding element in stack 2. Stack 1 can grow towards v[m] and stack 2 towards v[1].
When more than 2 stacks, say n, are to be represented sequentially we initially divide out the
available memory v[1..m] in to n segments and allocate one of these segments to each of the
n stacks. This initial division of 1 to m into segments may be done in proportion to expected
sizes of various stacks if these are known. In the absence of such information v[1..m] may be
divided into equal segments. For each stack i we shall use b[i] to represent a position one less
than the position in v for the bottom most element of that stack. t[i] , 1<=i<=n will point to
the top most element of the stack i. The boundary condition b[i]=t[i] is used if and only if the
v 1 2 [m/n] 2 [ m/n ] m
SIMULATION
Simulation is the process of forming an abstract model from a real situation in order to
understand the impact of modifications and the effect of introducing various strategies on the
situation. The main objective of the simulation program is to aid the user, most often an
operations research specialist or systems analyst, in projecting what will happen to a given
physical situation under certain simplifying assumptions. It allows the user to experiment
We can share the processor by allowing one user‟s program to execute for a short time,
then allowing another user‟s program to execute , and another etc., until there is a return to
the execution of the initial user‟s program. This cycle is continued repeatedly on all active
user programs. This method of sharing the CPU among many users is referred to as
timesharing. We can also share memory among the user programs by simply dividing
memory into regions and allowing each user program to execute in its own region when it
receives CPU control. In a system such as this, each user in unaware of the presence of the
other users. In fact, each terminal appears like a separate computer to the user.
For example there are three users, Tremblay, Sorenson and Bunt.
0 Tremblay 4, 8, 3
1 Sorenson 2, 1, 2, 2
2 Bunt 4, 6, 1
========================================================
We interpret these figures in the following manner. After logging on the system at time
0 Tremblay initially is allotted 4 seconds of CPU time before he receives a typed response at
his terminal. A further 8 secs. Of CPU time is required before another response can be printed
at Tremblay‟s terminal. Finally, Tremblay‟s program uses three additional secs of CPU time
and then Tremblay logs off, terminating his session. Because we have only one CPU
Sorenson‟s
program and Bunt‟s program will be made to wait until Tremblay‟s program has finished its
first requested CPU time period of 4 secs. We can indicate the fact that
Sorenson‟s and Bunt‟s program are waiting for the processor by placing their program ID‟s in
the queue behind Tremblay‟s program ID. Hence the time at time 2 the queue would appear
as in the figure 1.
When Tremblay‟s program has completed its first requested CPU time period a
scheduling strategy would be to allocate the CPU to Sorenson‟s program. When Sorenson‟s
program has finished its CPU time period Bunt‟s program should be allowed to execute its
CPU time period. This type of scheduling strategy is called „First Come First Service‟ FCFS
scheduling strategy.
Total user waiting time = total time – total CPU time – total user delay
0 5 10 15 20 25 30 35 40
Legend :
***** EXECUTING
WAITING IN QUEUE
-------
Figure 1:
TREMBLAY
SORENSON
BUNT
The rules for the FCFS strategy in this application are follows:
1. When a program requests CPU time , it is placed at the back of the processor queue.
2. The program at the head of the processor queue is the program that is currently being
executed. It remains at the head of the queue for its entire CPU time period.
The primary actions which occur in this system , and the manner in which the simulation
imitates the actions , are as follows:
User requests service : In the real system this request would be a teletype signal which would
alert the processor. In the model we use a variable set to the time
at which the user will require service. When the simulation time reaches or passes this time,
the user is added to the processor queue.
User program in execution. To imitate this action, we update the simulation time by the
length of the time requested by the user.
User completes current executing period requested. At this point the user is removed from
the queue or “ thinking and typing “ period . We will assume a delay period of five time units
for all users. At this point the variable used to signal service requests, is updated to indicate
that the user is in the delay period.
INTRODUCTION:
LINKED LISTS:
The representation of simple data structures using an array and sequential mapping
has the property that successive nodes of the data object are stored at fixed distance apart.
When a sequential mapping is used for ordered lists, operations such as insertion and deletion
of arbitrary elements become expensive. Consider the following list containing
To complete the list naturally we want to add 8 to the list. The insertion will require
us to move elements already in the list either one location lower or higher. We must either
move 10, 12, 14 or else move 2, 4, 6. If we have to do many insertions into the middle, then
neither alternative is attractive because of the amount of data movement. On the other hand,
suppose we decide to remove many elements we have to move many elements so as to
maintain the sequential representation of the list.
When we have several ordered lists of varying sizes, sequential representation again
proved to be inadequate. By storing each list in a different array of maximum size, storage
may be wasted. By maintaining the lists in a single array a potentially large amount of data
movement is needed. Solution to this problem of data movement in sequential representation
is achieved by using linked representations.
The use of pointers or links to refer to elements of data structure implies that elements
which are logically adjacent need not be physically adjacent in memory. This type of
allocation is called linked allocation. A list has been defined to consist of an ordered set of
elements which may vary in number. A simple way to represent a linear list to expand each
node to contain a link or pointer to the next node. This representation is called a one way
chain or singly linked list. In general a node is collection of data, data1, data2, … data n and
links link1, link2 … link n. Each item in the node is called as a field. A field contains either
a data item or a link. The link address of nill in the last node signals the end of the list. Nill
is not an address of any possible node but is a special value. If a list does not contain any
node then it is called an empty list.
It is easy to make arbitrary insertions and deletions using linked list rather than a
sequential list. To insert 8 between 6 and 10 get a new node and set its data field to 8. Create
a link from 6 to 8 and from 8 to 10. It is illustrated in the following figure.
6 10. . . 12 nil
2 4
2
To delete the element 6 from the list remove the existing link between 4 and 6 and
create a new link between 4 and 8. It is illustrated in the following figure.
declares the data object named Ram of type student. Ram is the name of the object which
directly refers to its both components.
Consider
Type p = student;
Var s : p;
The type p is a pointer that points to the unnamed object student. The variable s is of type
p. The value of the variable s will be an address in memory at which the object of type
student created by us. Since no object exists currently, the value of s is said to be nil
(reserve word), a new object of type student is created by executing the procedure call
new(s).
s.rollno = 200
s.name = „ram‟
Pointing to 200=
s
ram
s serves as the name of the object. After the object s has served its
purpose, we can remove it from memory and release its location by using the procedure call
dispose(s). If a linked list is consisting of nodes which have data field of type integer and
link field, then the following type declaration is used.
Var f : pointer;
Procedure to create a linked list with two nodes of type list node:
The data field of the first node is set to 10 and that of the second to 20. firstis a pointer to
the first node.
first
10 20 nil0
Example: Let first be a pointer to a linked list. First = nil if the list is empty. Let x be a
pointer to some arbitrary node in the list. The following procedure inserts a new node with
data field 50 following the node pointed by x. The resulting list structures for the two cases
first = nil and first <> nil are :
first first x
y 8
t
LINKED STACKS :
When several stacks and queues coexist, there is no way to represent them sequentially.
The solution is obtained by linked list representation of stacks and queues.
Linked stack:
top
.
.
.
nil
Linked queue:
nil
...
Front Rear
The direction of links for both the stack and queue are such as to facilitate easy insertion
and deletion of nodes. You can easily add or delete a node at the top of the stack. You can
Typepointer = ↑node;
node = record
data : integer;
link = pointer;
end;
Procedureaddqueue(i,y: integer);
{add y to the i‟th queue,1<=i<=m}
varx : pointer;
begin
new(x); {get a node}
x.data := y; {set it‟s data field}
x.link := nil;
iffront[i] = nilthen
front[i] := x; { empty queue}
else
rear[i].link := x;
rear[i] := x;
end; {of addqueue}
The variable AV must initially be set to one and we assume it as a global variable.
We consider a general case. Normally we link together all of the available nodes in a
single list we call AV. Link fields are used to link the available nodes.
Procedure INIT(n)
// initialize the storage pool , through the LINK field, to contain nodes
with addresses 1,2,3, … n and set AV to point to the first node in this
list//
for i 1 to n-1 do
LINK(i) i +1
end
LINK(n) 0
AV 1
End INIT
After executing the INIT procedure , the program can used nodes. Every time a new node is
needed, a call to GETNODE procedure is made. GETNODE examines the list AV and
returns the first node on the list . The GEDNODE procedure is as follows:
Procedure GETNODE(X)
// X is set to point to a free node if there is one on AV //
if AV = 0 then call NO_MORE_NODES
X AV
AV LINK(AV)
end GETNODE
RET procedure is used to return the node to AV. AV is used as a stack since the last node
inserted into AV is the first node removed (LIFO). Procedure RET is as follows:
Procedure RET(X)
// X point to a node which is to be returned to the available space list //
LINK(X) AV
AV X
end RET
Most of the times if we look at the available space pool in the middle of processing , adjacent
nodes may no longer have consecutive addresses. Suppose we have a variable ptr which is a
pointer to a node which is part of a list. Then the legal operation that we can perform on
pointer variable is :
1) Set ptr=0
2) Set ptr to point to a node.
POLYNOMIAL ADDITION:
Assuming that all coefficients are integer, the required type declarations are:
Typepolypointer = polynode;
polynode = record
Coef : integer ; {coefficient}
Exp :integer; {exponent }
Link : polypointer;
end;
3 14 polyn
-3 108 2 8 1 0 nil
omial
b = 8 x 14
–3x 10 + 10 x 6 would be stored as
In order to add the polynomials together we examine the terms starting at the nodes
pointed by a and b. Two pointers p and q are used to move among the terms of a and b. If the
exponents are of the two terms are equal, then the coefficients are added and a new term
created for the result. If the exponent of a is less than the exponent current of b, then a
duplicate of the term of b is created and attached to c. The pointer q is advanced to the next
term. Similar action is taken on a p↑.exp > q↑.expThe following figure illustrates the
polynomial addition.
STEP 1:
8 14 -3 108 10
101 6 ni
ni
nil
ll
p.exp = q.exp
11 14 nil
procedure add (a,b : polypointer; var c : polypointer);
{polynomials a and b represented as singly linked lists
are summed to form the new polynomial named c }
varp.q.d: polypointer; x : integer ;
begin
p := a; q := b;{p, q point to next term of a and b}
new(c); d := c; {initial node for c, returned later }
while (p <>nil) and (q<>nil ) do
casecompare (p.exp, q.exp) of
„=‟: begin
x := p.coef + q.coef;
ifx<> 0 then attach (x , p.exp, d);
p := p.link; q := q.link; {advance to the next node}
end;
„<‟ : begin
attach (q.coef, q.exp, d);
q := q.link; {next term of b}
end;
„>‟ : begin
attach (p.coef, p.exp, d);
p := p.link; {next term of a}
end;
-3 108 2 8 1 0 nil
Procedurecerase (t : polypointer)
{erase a circular list}
varx : polypointer;
begin
if t <>nil
For a list of m >=1 nodes, the while loop is executed m times and so the computing time
is linear or O(m).
To insert x at the rear, we need to add the additional statement a := x to the else part of the
insertfront
Sparse matrices
In a matrix if many of the entries are zero then we call it as a sparse matrix. In case of
sparse matrices space and computing time will be saved if only the non-zero entries are
retained explicitly. We represent the matrix elements in a sequential scheme in which each
node is of three fields: row,column, and value. As matrix operations such as addition ,
subtraction and multiplication are performed , the number of nonzero terms in matrices will
vary. So linked scheme facilitates efficient representation of varying size structures. In the
data representation each column of a sparse matrix is represented by a circularly linked list
with a head node. In addition each row will be a circularly linked list with a head node.
Each and every node has five fields: left,up,v,r and c. The left field points to the node
with the next smallest column subscript. Up points to the node with the next smallest row
subscript. V represents value , r represents row and c represents column. The node structure is
also referred to as MATRIX_ELEMENT.
value
Algorithm Construct_Matrix ( In this algorithm M and N represents the number of rows and
columns respectively. Arrays AROW and ACOL contain pointers to the head node of the
circular lists, P and Q are auxiliary pointers.The row index, column index, value are read
into the variables ROW,COLUMN,VALUE respectively.
3. Read (ROW,COLUMN,VALUE)
4.
5.
Repeat while C(P) < C(LEFT(Q))
6.
Repeat while R(P) < R(UP(Q))
7. Exit
A doubly-linked list whose nodes contain three fields: an integer value, the link to the next
node, and the link to the previous node.
The two node links allow traversal of the list in either direction. While adding or removing a
node in a doubly-linked list requires changing more links than the same operations on a
singly linked list, the operations are simpler and potentially more efficient (for nodes other
than first nodes) because there is no need to keep track of the previous node during traversal
or no need to traverse the list to find the previous node, so that its link can be modified.
Nomenclature and implementation
The first and last nodes of a doubly-linked list are immediately accessible (i.e., accessible
without traversal, and usually called head and tail) and therefore allow traversal of the list
from the beginning or end of the list, respectively: e.g., traversing the list from beginning to
end, or from end to beginning, in a search of the list for a node with specific data value. Any
node of a doubly-linked list, once obtained, can be used to begin a new traversal of the list, in
either direction (towards beginning or end), from the given node.
The link fields of a doubly-linked list node are often called next and previous or forward and
backward. The references stored in the link fields are usually implemented as pointers, but (as
in any linked data structure) they may also be address offsets or indices into an array where
the nodes live.
Basic algorithms
Open doubly-linked lists
Data type declarations
recordDoublyLinkedNode {
prev // A reference to the previous node
next // A reference to the next node
data // Data or a reference to data
}
recordDoublyLinkedList {
DoublyLinkedNode firstNode // points to first node of list
DoublyLinkedNode lastNode // points to last node of list
}
Advanced concepts
Asymmetric doubly-linked list
An asymmetric doubly-linked list is somewhere between the singly-linked list and the
regular doubly-linked list. It shares some features with the singly linked list (single-direction
traversal) and others from the doubly-linked list (ease of modification)
It is a list where each node's previous link points not to the previous node, but to the link to
itself. While this makes little difference between nodes (it just points to an offset within the
previous node), it changes the head of the list: It allows the first node to modify the first Node
link easily
As long as a node is in a list, its previous link is never null.
Inserting a node
To insert a node before another, we change the link that pointed to the old node, using the
prev link; then set the new node's next link to point to the old node, and change that node's
prev link accordingly.
function insertBefore(Node node, Node newNode)
if node.prev == null
error "The node is not in a list"
newNode.prev := node.prev
atAddress(newNode.prev) := newNode
newNode.next := node
node.prev = addressOf(newNode.next)
function insertAfter(Node node, Node newNode)
newNode.next := node.next
if newNode.next != null
Points to remember
1. Linked list have a two types one is singly linked list 2. Doubly linked list
2. Storage pool has a GETNODE, RET concepts.
3. Garbage collection is use the unused free or memory space.
4. Polynamial has a co efficient and exponent
********
INTRODUCTION
Tree Terminology
Ancestor: The predecessor of a node together with all the ancestors of the
predecessor of a node. The root node has no ancestors.
Descendant: The children of a node together with all the descendants of the children
of a node. A leaf node has no descendants.
The unique node with no predecessor is called the root of the tree. This is the start of
the data structure. A node with no successors is called a leaf. There will usually be
many leaves in a tree. The successors of a node are called its children; the unique
predecessor of a node is called its parent. If two nodes have the same parent, they are
called brothers or siblings. In a binary tree the two children are defined by the
position occupied, which will be either the left or right.
Tree Height
The level of a node is one greater than the level of its parent. The level of the root node is 1.
The height of a tree is the maximum level of any node in the tree. Te following figure shows
how we calculate the height of a binary tree:
The path is any linear subset of a tree, eg: 1, 6, 7 and 1, 2, 3, 5 are paths. The length of a path
can be counted as either the number of nodes in the path or the number of lines in a given
path, we will count the nodes; eg 1, 6, 7 has a length of 3. However, remember that there is
no agreed definition.
There is always a unique path from the root to any node. Simple as this property seems, it is
extremely important. Our algorithms for processing trees will be based upon this property.
The depth or level of a node is the length of this path. When drawing a tree, it is useful if all
the nodes in the same level are drawn as a neat horizontal row.
Structure of Trees
In general, each child of a node is the root of a tree 'within the big tree'. For example, 7 is the
root of a little tree (8, 6 and 9). These inner trees are called subtrees. The subtrees of a node
are the trees whose roots are the children of the node, eg the subtrees of 2 are the subtrees
whose roots are 3 and 4. In a binary tree we refer to the left subtree and the right subtree.
Structural Features
A node value, usually called an element, containing the data we are storing in the tree.
With linked lists we had an alternative to recursion. We could scan through a list easily with
normal loops (while, do, repeat etc) as well as with recursion. This is not so easy to do for
trees. It is possible to scan through a tree non-recursively, but it is not nearly as easy as a
recursive scan.
A binary tree is either empty or it is a root node together with two subtrees, the left subtree
and the right subtree, each of which is a binary tree. The binary tree is full if each level,
except the last, contains as many nodes as possible. Therefore a binary tree is complete if
each level contains as many nodes as possible. Binary trees have the following three
distinctive characteristics:
2. Each subtree is identified as either the left-subtree or the right-subtree of the parent.
(data structure)
Definition: A way to represent a multiway tree as a binary tree. The leftmost child, c, of
a node, n, in the multiway tree is the left child, c', of the corresponding node, n', in the
binary tree. The immediately right sibling of c is the right child of c'.
If nl is the leftmost child of nk, n'l is the left child of n'k. (If nk has no children, n'k
has no left child.)
If ns is the next (immediately right) sibling of nk, n's is the right child of n'k.
Also known as first child-next sibling binary tree, doubly-chained tree, filial-heir chain.
. The binary tree representation of a multiway tree or k-ary tree is based on first child-
next sibling representation of the tree. In this representation every node is linked with its
leftmost child and its next (right nearest) sibling.
1
/
/
/
2---3---4
/ /
5---6 7
/
8---9
Now, if we look at the first child-next sibling representation of the tree closely, we will see
that it forms a binary tree. To see this better, we can rotate every next-sibling edge 45
degrees clockwise. After that we get the following binary tree:
1
/
2
/\
5 3
\ \
6 4
/
7
Binary Trees
A binary tree is an important type of structure which occurs very often. It is characterized by
the fact that any node can have at most two branches, i.e.,there is no node with degree greater
than two. For binary trees we distinguish between the subtree on the left and on the right,
whereas for trees the order of the subtreewas irrelevant. Also a binary tree may have zero
nodes. Thus a binary tree is really a different object than a tree.
Definition: A binary tree is a finite set of nodes which is either empty or consists of a root
and two disjoint binary trees called the left subtree and the right subtree.
structure BTREE
declare CREATE( ) --> btree
ISMTBT(btree,item,btree) --> boolean
MAKEBT(btree,item,btree) --> btree
LCHILD(btree) --> btree
DATA(btree) --> item
RCHILD(btree) --> btree
for all p,r in btree, d in item let
ISMTBT(CREATE)::=true
ISMTBT(MAKEBT(p,d,r))::=false
LCHILD(MAKEBT(p,d,r))::=p; LCHILD(CREATE)::=error
DATA(MAKEBT(p,d,r))::d; DATA(CREATE)::=error
RCHILD(MAKEBT(p,d,r))::=r; RCHILD(CREATE)::=error
end
end BTREE
This set of axioms defines only a minimal set of operations on binary trees. Other operations
can usually be built in terms of these. The distinctions between a binary tree and a tree should
A full binary tree of depth k is a binary tree of depth k having pow(2,k)-1 nodes. This
is the maximum number of the nodes such a binary tree can have. A very elegant sequential
representation for such binary trees results from sequentially numbering the nodes, starting
with nodes on level 1, then those on level 2 and so on. Nodes on any level are numbered from
left to right. This numbering scheme gives us the definition of a complete binary tree. A
binary tree with n nodes and a depth k is complete iff its nodes correspond to the nodes which
are numbered one to n in the full binary tree of depth k. The nodes may now be stored in a
one dimensional array tree, with the node numbered i being stored in tree[i].
Lemma 5.3: If a complete binary tree with n nodes (i.e., depth=[LOG2N]+1) is represented
sequentially as above then for any node with index i, 1<I
(i) parent(i) is at [i/2] if is not equal to 1. When i=1, i is the root and has no parent.
(ii) lchild(i) is at 2i if 2in, then i has no left child.
(iii) rchild(i) is at 2i+1 if 2i+1n, then i has no right child.
Proof: We prove (ii). (iii) is an immediate consequence of (ii) and the numbering of
nodes on the same level from left to right. (i) follows from (ii) and (iii). We prove (ii) by
induction on i. For i=1, clearly the left child is at 2 unless 2>n in which case 1 has no left
child. Now assume that for all j, 1<Jn in which case i+1 has no left child. This representation
can clearly be used for all binary trees though in most cases there will be a lot of unutilized
space. For complete binary trees the representation is ideal as no space is wasted. In the worst
case a skewed tree of k will require pow(2,k)-1 spaces. Of these only k will be occupied.
While the above representation appears to be good for complete binary trees it is
wasteful for many other binary trees. In addition, the representation suffers from the general
There are many operations that we often want to perform on trees. One notion that
arises frequently is the idea of traversing a tree or visiting each node in the three exactly once.
A full traversal produces a linear order for the information in a tree. This linear order may be
familiar and useful. When traversing a binary tree we want treat each node and its subtrees in
the same fashion. If we let L, D, R stand for moving left, printing the data, and moving right
when at a node then there are six possible combinations of traversal: LDR, LRD, DLR, DRL,
RDL, and RLD. If we adopt the convention that we traverse left before right then only three
traversals remain: LDR, LRD, and DLR. To these we assign the names inorder, postorder and
preorder because there is a natural correspondence between these traversals and producing
the infix, postfix and prefix forms of an expression. Inorder Traversal: informally this calls
for moving down the tree towards the left untilyou can go no farther. Then you "visit" the
node, move one node to the right and continue again. If you cannot move to the right, go back
one more node. A precise way of describing this traversal is to write it as a recursive
procedure.
procedure inorder(currentnode:treepointer);
{currentnode is a pointer to a noder in a binary tree. For full
tree traversal, pass inorder the pointer to the top of the tree}
begin{inorder}
ifcurrentnode<>nil
then
begin
Recursion is an elegant device for describing this traversal. A second form of traversal is
preorder:
procedure preorder(currentnode:treepointer);
{currentnode is a pointer to a node in a binary tree. For full
tree traversal, pass preorder the ponter to the top of the tree}
begin {preorder}
if currentnode <> nil
then
begin
write(currentnode^.data);
preorder(currentnode^.leftchild);
preorder(currentnode^.rightchild);
end {of if}
end; {of preorder}
In words we would say "visit a node, traverse left and continue again. When you cannot
continue, move right and begin again or move back until you can move right and resume. At
this point it should be easy to guess the next thraversal method which is called postorder:
procedure postorder(currentnode:treepointer);
{currentnode is a pointer to a node in a binary tree. For full
tree traversal, pass postorder the pointer to the top of the tree}
begin {postorder}
if currentnode<>nil
then
begin
postorder(currentnode^.leftchild);
postorder(currentnode^.rightchild);
write(currentnode^.data);
1. A Threaded Binary Tree is a binary tree in which every node that does not have
a right child has a THREAD (in actual sense, a link) to its INORDER successor.
By doing this threading we avoid the recursive method of traversing a Tree,
which makes use of stacks and consumes a lot of memory and time.
The node structure for a threaded binary tree varies a bit and its like this --
Code:
struct NODE
int node_value;
Advantage
1. By doing threading we avoid the recursive method of traversing a Tree , which makes
use of stack and consumes a lot of memory and time .
2 . The node can keep record of its root .
Disadvantage
1. This makes the Tree more complex .
2. More prone to errors when both the child are not present & both values of nodes
pointer to their ansentors .
Overview
1. V = a set of vertices
2. E = a set of edges
Edges:
Vertices:
Representation:
Example:
1. V = {A,B,C,D}
2. E = {(A,B),(A,C),(A,D),(B,D),(C,D)}
Motivation
Examples:
o Network
o Social networks
Graph Classifications
o Weighted or unweighted
o Directed or undirected
o Cyclic or acyclic
o Actually, an edge is a set of 2 nodes, but for simplicity we write it with parens
Directed Graphs
Subgraph
If graph G=(V, E)
Example ...
Degree of a Node
The degree of a node is the number of edges the node is used to define
o Degree 2: B and C
o Degree 3: A and D
Examples
o Cycle: ?
o Simple Cycle: ?
o Lengths?
o Otherwise it is unconnected
A directed graph is strongly connected if every pair of vertices has a path between
them, in both directions
o Example ...
o Example ...
o Example ...
Spanning trees and minimum spanning tree are not necessarily unique
o Adjacency lists
o Adjacency matrix
o A: B, C, D
o B: A, D
o D: A, B, C
o A: B, C, D
o B: D
o C: Nil
o D: C
Time:
o Entries contain ∞ (ie Integer'last) if no edge from row vertex to column vertex
A 0 1 1 1
B 1 0 ∞ 1
C 1 ∞ 0 1
D 1 1 1 0
A B C D
A ∞ 1 1 1
B ∞ ∞ ∞ 1
C ∞ ∞ ∞ ∞
D ∞ ∞ 1 ∞
Space: Θ(V2)
Time:
GRAPH TRAVERSALS
Big and ongoing topic! Emphases: understand basic definitions, matrix and adjacency
representation of graphs, breadth first and depth first search, and some simple algorithms
based on them and some that use the more complicated classifications of edges. In general
understand the idea of adding to DFS and BFS to obtain various objectives. Understand
logical arguments made with graph ideas and make simple logical arguments.
Below are edge list and adjacency matrix interpretations for an undirected and a directed
graph.
Like the book, I will follow the practice of labeling vertices with capital letters, avoiding any
confusion with various uses of numbers that will later be associated with graphs. In efficient
implementations the vertices are just associated with numeric indices.
Many things can be done by inserting code into the basic DFS, and to a lesser extent, BFS.
I will generally use V as the vertices of a graph and E as the edges and |V| and |E| as their
sizes. All of the algorithms have order |E| + |V|: the main body of dfs or bfs is called once for
each vertex, and the inner loops are executed once for each edge. All the elaborations add
only an extra O(1) each time they are executed in the basic flow.
Assuming fixed size descriptors for each piece of the data, the total size of the data has the
same order as the time of execution, and the algorithm time order is said to be linear (in the
size of the data).
Edge classifications:
Note the order of the start of visiting (None indicates still not visited)
Note the order of finishing visiting (not yet finished None)
If u is first visited after a vertex v is first visited but before v is finished, u is a descendant of
v. (Everything on the stack after v is a descendant.)
The remaining case is when there is no ancestor-descendant relationship. This is a cross edge.
This occurs when dfs is working on v, and looks to a neighbor u that is already finished.
You cannot finish with a vertex in an undirected graph when a vertex directly connecting to it
is still unvisited. In a directed graph there can be a connection to a finished vertex - as long
as the reverse edge is missing.
In an operating system, at the most abstract level, how do you get deadlock?
A convention needs to be maintained with a dependency graph. You can either have the
arrows go with time, from what must come first to what must follow later (as the book does
it) OR you can have a task point to the things that must be done first, reversing all the arrows
in the text. This latter approach is actually more useful for further algorithms, making it
easier to check on all prerequisites being completed, so I will use it:
With either convention, in deadlock there is a cycle in the dependencies. If you expect to
finish a collection of tasks with dependencies, there can be no cyclic dependency! A
workable dependency graph is a directed acyclic graph (DAG).
Topological Order: In graph with n vertices, assign topological number t(v) = 1..n to the
vertices in a way that if vw is any edge, t(v) < t(w).
Reverse topogogical order, r(v): backwards from above: if vw is an edge, r(v) > r(w).
Theorem: The sequence of ending times end[v] in a DFS satisfy end[v] > end[w] for edges
vw.
Proof: Need to show for any edge vw, end[v] > end[w]. Look at the four edge types:
Forward, tree edges: w is a descendant of v, and hence by the LIFO sequence of the
stack, the ending time for w is before v.
A topological order or its revese is is a permutation of the vertices. What is actually useful is
the functional inverse of this permutation: the mapping from time order to vertex. By the
theorem above, the list for a reverse topological order is is easily constructed in the DFS by
just adding to a list of finished vertices each time a vertex is finished. Reversing the final
order of that list provides a vertex listing in forward topological order. A direct translation of
this is in topoordersure.py. A possible issue with this algorithm is that it can be run with any
graph, not just a DAG, but if the graph is not a DAG, the result makes no sense as a
topological order. A more robust version, avoiding a meaningless answer is topoorder.py. In
this algorithm there is a check for back edges as the DFS progresses, and None is returned
instead of a vertex sequence if the graph is not a DAG. As in DFSEdgeClassify.py, the check
Critical paths are applications for DAG's and reverse topological order:
The last application of reverse topological order was for a sequence with a single processor or
actor. In parallel processes, (human or machine), tasks that do not depend on each other can
be done at the same time. The chain of dependencies that takes the longest time determines
the minimum time to completion. This longest time chain of dependencies is a critical path.
Original DAG:
Each task is associated with a time for completion. Also we want one final node (the DONE
node, with 0 duration) depending on each originally terminal node:
With dependencies marked in this order the direction of time and the direction of the graph
are backwards. I will use start and and end to refer to time, and use the digraph jargon source
(for a vertex with no arrows to it) and sink (for a vertex with no arrows from it).
The table below gives dependencies and times required for each action.
The tasks with no dependencies are then sinks. This example illustrates that there can be
several sinks (actions with no dependencies).
For simplicity the alphabetical order of tasks is already a reverse topological order. This
means the table below can be filled in sequentially from top to bottom.
A 4
B 3
C 1
D A, B 5
E B 7
F B, C 6
G D, E 2
H 5
I F, G 3
J F, H 9
Done I, J 0
If each vertex is associated with its Earliest Starting Time (EST), then the path between a
vertex and a dependency adds the duration for the dependency. The edges to the DONE node
then allows the duration of the otherwise terminal nodes to be encoded.
If rTop is a list of the vertices in reverse topological order, then the maximal path weight and
the path itself are easy to calculate. We assume we start with the vertex duration data in an
array, and we need extra arrays EFT and also worstDep, if we want to recover the longest
path itself.
Then the EFT for the Done vertex is the overall earliest finishing time. Without the marked
Done vertex, there must also be a check for the vertex with the worse EFT.
Full code with testing in critPathTopo.py. By the same logic as for the DFS and BFS, this
algorithm is linear in the data size.
A 4 4 -
B 3 3 -
C 1 1 -
D A, B 5 9 A
E B 7 10 B
F B, C 6 9 B
G D, E 2 12 E
H 5 5 -
I F, G 3 15 G
J F, H 9 18 F
Done I, J 0 18 J
So the minimum duration is 18, and critical path is obtained by reversing the path from Done
given by the WorstDep column. J, F, B. Reversed is the order of the critical parts to
complete (forward in time) B, F, J.
We will come back to this topic later, but for now note that DAGs and topological order are
central to dynamic programming.
You could make a weighted graph model with the vertex duration consistently either on
edges to the vertex or from. The addition of the Done vertex makes the approach I took, with
the weight on edges to the dependency, more convenient. You would have to add further
edges fromthe sources to a Start vertex to put the weights on the opposite side.
The module graph.py is worth discussion before getting to the parts for meta-graphs. We will
be using this module later when discussing weighted graphs. It includes a simple Edge class
which includes the edge's numerical weight.
I save a coding step by identifying all vertices purely by name rather than going back and
forth between vertex labels and vertex indices as I do in DFS.py. Admittedly the version
with indices, which involves a basic list or array data structure, is somewhat more efficient in
time than my representation which uses basically the same notation, but is dictionary based.
The neighbor lists are modified to be lists of edges to neighbors, so the weights are
included. Rather than my kludgy list structure for a graph used before in DFS.py, vertices
and edges are instance variables, and g.edges[v] is the list of edges with starting vertex v in
the graph g.
A C,E 1 -
B A, F 3 -
C 2 -
D B, F 4
E 3
F E 2
Points to remember
1. Threaded binary tree is gives the new link to the base node
External Sorting: Storage Devices -Sorting with Disks: K-Way Merging – Sorting with
Tapes Symbol Tables: Static Tree Tables - Dynamic Tree Tables - Hash Tables:
Hashing Functions - Overflow Handling.
INTRODUCTION
SORTING
1) as an aid in searching,
Sorting also finds application in the solution of many other more complex problems, e.g.,
from operations research and job scheduling.
2. Sophisticated algorithms that require the O(n log2n) comparisons to sort n items.
Internal sorting.
External sorting
Internal sorting methods are applied when the entire collection of data to be sorted is small
enough that the sorting can take place within the memory. The time required to read or write
is not considered to be significant in evaluating the performance of the internal sorting
methods.
External sorting methods are applied to larger collection of data which reside on
secondary devices, read and write access time are major concern in determining sort
performances.
Internal sorting:
Insertion sort
Quick sort
Merge sort
Heap sort
Radix sort
External sorting:
EXTERNAL SORTING
MAGNETIC DISK
Disks are examples of direct access storage devices. Data are recorded on disk platter in
concentric tracks. A disk has two surfaces on which data can be recorded. Disk packs have
several such disks rigidly mounted on a common spindle. Data is red/written to the disk by a
read/write head. A disk pack would have one such disk per surface.
Each surface has a number of concentric circles called tracks. In a disk pack, a set of parallel
tracks on each surface is called a cylinder. Tracks are further divided in to sectors. A sector is
the smallest addressable segment of a track.
Data is stored along the tracks in blocks. Therefore to access a disk, the track or cylinder
number and the sector number of the starting block must be specified. For disk packs the
surface must also be specified. The read/write head moves horizontally to position itself over
the correct track for accessing disk data.
seek time : time taken to position to read/write head over the correct cylinder
latency time : time taken to position the correct sector under the head.
Transfer time(transmission time): time taken to actually transfer the block between
main memory and disk.
Allocate 3 blocks of memory each capable of holding 500 records. Two of these buffers B1,
B2 will be treated as input buffer and B3 will be treated as an output buffer.
Thus from 6 runs of size 1000 each, we have now 3 runs of size 2000 each
Stage 1:
R1 R2 R6
………..
Stage 2:
1
2000 2001 4000 4001 6000
Stage 3:
R21 R13
Stage 4:
RF
1 6000
Analysis :
T1 : seek time
T2 : latency time
T : T1 + T2 + T3
nTm - time to merge n records from input buffers to the output buffer.
Stage 1:
Stage 2:
Stage 3:
Stage 4:
We read = 12 blocks
Write = 12 blocks
The total time = 24T + 6T4 + 24T + 6000 Tm + 16T + 4000Tm + 24T +6000 Tm
it is seen that the largest influencing factor is Tm, which depends on the number of passes
made over the data or the number of times runs must be combined.
In the above example we used 2 – way merging, i.e combination of 2 runs at a time.
The number passes over the data can be reduced by combining more runs at a time, hence the
k – way merge, where k >=2.
This advantage is accompanied by certain side effects. The time merge K runs would
obviously be more than the time to merge 2 runs. Here the value of Tm itself could increase.
For large values of K, an efficient method of selecting the smallest record out of K in each
step of the merge is needed. One such method is the use of a selection tree.
A selection tree is a binary tree in which each node represents the smaller of its children.
It therefore follows that the root will be the smallest node. The way it makes is simple.
Initially, the selection tree is built from the 1st item of each of the K runs. The one which gets
to the root is selected as the smallest. Then the next item in the runs from which the root was
selected enters the tree and the tree gets restructured to get a new root and so on till the K
runs are merged.
Example
2 4 1 3 ……
R1
1 7 6 9 ……
R3
6 8 3 2 ……
R4
1.Construction
2 1
2 4 1 6
2 6
2 4 7 6
In going in to higher order merge, we save input/output time. But with the increase in
the order of the merge the number of buffers required also increases at the minimum to merge
K runs need K + 1 buffers. The internal memory is a fixed quantity, therefore if the number
of buffers increases, their size must decrease.
MAGNETIC TAPES
In order to read or write data to a tape the block length and the address in memory
to/from which the data is to be transferred must be specified. These areas in memory from/to
which data is transferred will be called buffers. Usually block size will respond to buffer size.
Block size is a crucial factor in tape access. A large block size is preferred because of the
following reasons:
1.Consider a tape with a tape density of 600 bpi. And an inter block gap of ¾‟. This gap is
enough to write 450 characters. With a small block size , the number of blocks per tape
length will increase. This means a large number of inter block gaps, i.e. bits of data which
cannot be utilized for data storage , and thus a decreased tape utilization. Thus the larger the
block size , fewer the number of blocks, fewer the number of inter block gaps and better the
tape utilization.
2. Larger block size reduces the input/output time. The delay time in tape access is the time
needed to cross the inter block gap. This delay time is larger when a tape starts from rest than
when the tape is already moving. With a small block size the number of halts in a read are
considerable causing the delay time to be incurred each time.
While large block size is desirable from the view of efficient tape usage as well as
reduced access time, the amount of main memory available for use as I/O buffers is a limiting
factor.
Sorting on tapes is carried out using the same basic steps as sorting on disks. Sections of the
input file are internally sorted in to runs which are written out on to tape. These runs are
repeatedly merged together until there is only one run.
Example :
Stage 1:
Stage 2:
Internally sort sets of 3 blocks of 250 records each to obtain runs 750 records long.
Write these runs alternatively on to tapes T1 and T2.
R1 R3 R5
T1
T2 R2 R4 R6
Step 3:
Rewind T1, T2 and input tape. Dismount input tape and replace by a work tape T4.
Stage 4:
Using two – merge , merge runs from the file T1 with those on the file T2 and
distribute the resulting bigger runs alternatively on to T3 and T4.
R2
T4
Stage 5:
Stage 6:
R3
T3
T1
Stage 5:
Stage 6:
Stage 7 :
nTm - time to merge n records from input file to output buffers using 2 – way
merge
BALANCED MERGE:
In addition sorting with disk depends on the number of passes made over the data. Use of a
higher order merge results in a decrease in the number of passes being made with out
changing the internal merge time. i.e higher order merge results in lesser time.
In disk sorting, the order of merge is limited by the amount of main memory available for
input/output buffers. A k-way merge requires the use of 2k + 2 buffers. Another restriction in
the case of tape is the number of tapes available. In order to avoid excessive seek time, it is
necessary that runs being merged be on different tapes. Thus a k-way merge requires at least
k-tapes for use as input tapes during the merge. In addition, another tape is required for the
output generated during the merge. Hence at least k + 1 tapes must be available for a k- way
merge. Using k + 1 tapes for a k-way merge requires an additional pass over the output tape
to redistribute the runs on to k- tapes for the next level of merge. The redistribution can be
avoided by using 2k tapes. During the k-way merge with 2k tapes, k of these tapes are used as
input tapes and remaining k as output tapes. Now we are going to see 3 procedures, procedure
m1 and m2 follows k+1 tape strategy and procedure m3 follows 2k tape strategy.
Procedure m1 always assumes k+1 as the output tape , in procedure m2 we assume output
tape in cyclic order i.e. k+1,1,2…k. In this case, the redistribution from the output tape makes
less than a full pass over the file. Procedure m3 follows 2k tape strategy.
1.procedure m1;
2.{sort a file of records from a given input tape using a 3.k-way merge given tapes t1,. . .tk+1
are available for 4.the sort.}
6.begin
7.Create runs from the input tape distributing them evenly over
10.if there is only 1 run then goto 99;{the sorted file is on t1}
12.while true do
13.{repeatedly merge onto tk+1 and redistribute back onto t1,. . .tk}
14.begin
20.end;
1.procedure m2 ;
2. {Sort a file of records from a given input tape using a 3.k-way merge given tapes t1. . .tk+1
are available for
4.the sort}
5.label 99.
6.begin
13.while true do
14.begin
ti ;
16.rewind ti….tk+1;
22.end;
1.procedure m3;
2.{sort a file of records from a given input tape using a 3.k-way merge on 2k tapes t1,. . .,t2k}
4.begin
9. i:=0;
11.begin
12. j:=1-i;
17.end;
SYMBOL TABLES
structureSYMBOL- TABLE
declareCREATE ( ) symtb
INSERT(symtb,name,value) symtb
DELETE(symtb,name) symtb
FIND(symtb,name) value
HAS(symtb,name) Boolean
ISMTST(symtb) Boolean;
ISMTST(CREATE) :: = true
ISMTST(INSERT(S,a,r)) :: = false
HAS(INSERT(S,a,r),b) :: =
DELETE(CREATE,a) :: = CREATE
DELETE(INSERT(S,a,r),b) :: =
ifEQUAL(a,b) then S
elseINSERT(DELETE(S,b),a,r)
FIND(CREATE, a) :: = error
elseFIND(S,b)
end
endSYMBOL__TABLE
If the identifiers in a symbol table are known in advance, and no additions and deletions are
permitted then the symbol table is called as static otherwise dynamic.
Definition : A binary search tree t is a binary tree; either it is empty or each node in the tree
contains an identifier and
(ii) all identifiers in the right subtree of t are greater than the identifier in the
root node t;
(iii) the left and right subtrees of t are also binary search trees.
For a given set of identifiers several binary trees are possible. To determine whether an
identifier x is present in a binary search tree, x is compared with the root. If x is less than the
identifier in the root, then the search continues in the left subtree; if x equals the identifier in
the root, the search terminates successfully; otherwise the search continues in the right
subtree. The data types used by the algorithm are defined as:
treepointer = ↑ treerecord;
treerecord = record
leftchild : treepointer;
ident : identifier;
rightchild : treepointer;
end;
30
20 40
45
5 25 35
2. {Search the binary search tree t for x.Each node has fields.
5. varfound : boolean;
6. begin
In evaluating binary search trees, we add a square node at every place there is a null
link.Every binary tree with n nodes has n+1 null links and hence it will have n+1 square
nodes. They are called as external nodes or failure nodes. The remaining nodes are called
internal nodes. The external path length = sum over all external nodes of lengths of the
paths from the root to those nodes. Similarly the internal path length = sum over all
internal nodes of lengths of the paths from the root to those nodes.
30
Internal nodes
5 25 35 45
External nodes
Dynamic tables are also maintained as binary search trees. Here we can add an
identifier to the tree if it is not present. The following algorithm explains the insertion of an
identifier in to binary search tree.
{Search the binary search tree t for the node pointed to by retpointer such that
retpointer↑.ident = x. if x is not present in the tree then it is entered at the appropriate point}
found : boolean;
begin
trailpointer := nil;
begin
trailpointer := retpointer;
end;
begin
trailpointer := retpointer;
retpointer := retpointer↑.rightchild;
end;
found := true;
new (retpointer);
retpointer↑ .ident : = x;
retpointer↑.rightchild : = nil;
retpointer↑.leftchild := nil;
t := retpointer
elseifx<retpointer↑ .identthen
else
trailpointer↑.rightchild := retpointer;
end;
A binary tree of height h is called completely balanced or balanced if all leaves occur at
nodes of level h or h-1 and if all nodes at levels lower than h-1 have two children. Adelson -
DEFINITION
The balancing factor, BF(T) , of a node T in a binary tree is defined to be hL –hR , where hL
and hR are the heights of left and right subtrees of T. For any node in an AVL tree BF(T) = -
1, 0 or 1.
HASH TABLES
In tree tables, the search for an identifier key is carried out through a sequence of
comparisons . Hashing differs from this in that the address or location of an identifier X is
obtained by computing some arithmetic function f(X). f(X) gives the address of X in the
table. This address is called the hash or home address of X. The memory is referred to as the
hash table, ht. The hash table is partitioned into b buckets, ht(0) , ht(1), … ht(b-1). Each
bucket is capable of holding s records. Thus, a bucket is said to consist of s slots, each slot
being large enough to hold one record. The hashing function f(X) is used to perform an
identifier transformation on X. f(X) maps the set of possible identifier onto the integers 0
through b-1. Two identifiers I1, I2 are said to be synonyms with respect to f if f(I1) = f(I2).
Distinct synonyms are entered into the same bucket so long as all the s slots in the bucket
have not been used. An overflow is said to occur when an new identifier I is mapped or
hashed by f into a full bucket A collision occurs when two non-identical identifiers are
hashed into the same bucket.
Example
Consider a hash table ht, with 26 buckets. Each bucket having exactly two slots. Assume that
there are n (= 6) distinct identifiers in the program and each identifier begins with a letter.
The function f is defined by f(X) = the first character of X will hash all identifiers X into the
hash table. The identifiers are GA, D, A, A1, G and E.
SLOT 1 SLOT 2
1 A A1
2 0 0
3 0 0
4 D 0
5 E 0
HASHING FUNCTIONS
A hashing function f, transforms an identifier X into a bucket address in the hash table. A
hash function „f ‟ is called an uniform hash function, if for each identifier X there is an equal
chance for hashing. i.e For each identifier X in the identifier space the probability that f (X)
= i to be 1/b for all buckets i.
MID-SQUARE:
One hash function that has found much use in symbol table applications is the „
middle of square‟ function. This function, fm, is computed by squaring the identifier and then
using an appropriate number of bits from the middle of the square to obtain the bucket
address; the identifier is assumed to fit into the computer word.
DIVISION:
fD (x) = X mod M
This gives bucket addresses in the range 0 – ( M - 1 ) and so the hash table is at least of
size b = M.
FOLDING:
In this method the identifier X is partitioned into several parts, all but the last being of
the same length. These parts are then added together to obtain the hash address for X. There
are two ways of carrying out this addition. In the first, all but the last part are shifted so that
the least significant bit of each part lines up with the corresponding bit of the last part. The
different parts are now added together to get f(X).This method is is known as shift folding.
The other method of adding the parts is folding at the boundaries. In this method, the
identifier is folded at the part boundaries and digits falling into the same position are added
together.
P1 123 P1 123
203 P2 302
P2
241
P3 P3 241
P4 112
211
This method is particularly useful in the case of a static file where all the identifiers in
the table are known in advance. Each identifier X is interpreted as a number using some radix
r. The same radix is used for all the identifiers in the table. Using radix, the digits of each
identifier are examined. Digits having the most skewed distributions are deleted. Enough
digits are deleted so that the number of digits left is small enough to give an address in the
range of the hash table.
OVERFLOW HANDLING:
In order to be able to detect collisions and overflows, it is necessary to initialize the hash
table, ht, to represent the situation when all slots are empty. When a new identifier gets
hashed in to a full bucket, it is necessary to find another bucket for this identifier. The
simplest solution is to find the closest unfilled bucket. Consider the identifiers GA, G, A, A4,
A5, A2, Z, ZA.
1 A
2 A4
3 A5
4 A2
5 ZA
The identifiers GA is placed in the bucket ht[7]. The next identifier G is placed in the nearest
free bucket, i.e ht[8]. The next identifier A is placed in the bucket ht[1]. The identifier A4 is
placed in the nearest free bucket ht[2] and so on. Finally the identifier Z is placed in ht[26].
The next identifier ZA is placed in the next free bucket in the circular list, ht[5]. The above
method of resolving overflows is called linear probing or linear overflow addressing. It is
Procedure to search for an identifier if linear probing method is used is given below:
var i : integer;
begin
i:=f(x); j:=i;
begin
j :=(j+1) mod b;
if j = i then tablefull;
end;
end;
Internal Sorting: Insertion Sort - Quick Sort - 2 Way Merge Sort - Heap Sort – Shell
Sort - Sorting on Several Keys. Files: Files, Queries and Sequential organizations –
Index Techniques -File Organizations.
Internal sorting:
Insertion sort
Quick sort
Merge sort
Heap sort
INSERTION SORT:
This is a naturally occurring sorting method exemplified by a card player arranging the
cards dealt to him. He picks up the cards as they are dealt and inserts them in to the required
position. Thus at every step, we insert an item into its proper place in an already ordered list.
3. sequence list[0],…list[i] in
8. var j : integer;
9. begin
10. j :=i;
14. j:=j-1;
15. end;
17. end;
3. var j : integer;
4. begin
5. list[0].key :=maxint;
6. for j:=2 to n do
7. insert(list[j],list,j-1);
8. end;
For example if n=5 and the input sequence is (2,3,4,5,1}. After each execution of insert we
have:
-,2,3,4,5,1 [initial]
-,2,3,4,5,1 i=2
-,2,3,4,5,1 i=3
-,2,3,4,5,1 i=4
-,1,2,3,4,5 i=5
This is the most widely used internal sorting algorithm. Its popularity lies in the case
of implementation, moderate use of resources and acceptable behaviour for a variety of
sorting cases. The basis of quick sort is the „divide‟ and „conquer‟ strategy. i.e. divide the
problem [list to be sorted] into sub problems [sub lists] until solved sub problems [sorted sub
list] are found. The implementation is
2. Rearrange the list so that this item is in the proper position .i.e all preceding items
have a lesser value and all succeeding items have a greater value than this item.
A[I]
3. Repeat the steps 1 & 2 for sub list 1 and sub list 2 till A[ ]is a sorted list.
The input file has 10 records with keys (26,5,37,1,61,11,59,15,48,19). The table below
gives the status of the file at each call of qsort.
R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 m n
[26 5 37 1 61 11 59 15 48 19] 1 10
1 5 11 15 19 26 [59 61 48 37] 7 10
1 5 11 15 19 26 37 48 59 [61] 10 10
1 5 11 15 19 26 37 48 59 61
The average computing time of the quick sort is O (n log 2n). It needs stack space to
implement the recursion.
MERGE SORT
Merge sort is one of the „divide and conquer „ class of algorithms. The
basic idea is to this is to divide the list into a number of sub lists, sort each of this sub lists
and merge them to get a single sorted list.
Example: The input file (26,5,77,1,61,11,59,15,49,19) is to be sorted using the 2-way merge
sort. The two way merge sort begins by interpreting the input as n sorted files of length 1.
these are merged pair-wise to obtain n.2 files of size 2. These n/2 files are then merged pair
wise and so on until we are left with only one file.
Example:
[26] [5] [77] [1] [61] [11] [59] [15] [48] [19]
[1 5 11 26 59 61 77 ] [ 19 48 ]
[1 5 11 15 19 26 48 59 61 77]
9.begin
10.i:=l;
11.k:=l;
14.begin
16.then
17.begin
19. i:=i+1;
20.end;
21.else
22.begin
24. j:=j+1;
25.end;
26. k:=k+1;
27.end;{of while}
28.if (i > m)
31. z[k+t-j]:=x[t]
35.end; { of merge }
2.{This algorithm performs one pass of the merge sort. It 3.merges adjacent pairs of
subfiles of length l from the 4.list x to list y. n is the number of records in x.}
5.var i,t:integer;
6.begin
7.i:=1;
9.begin
11. i:=i + 2 * l;
12. end;
16.for t:=i to n do
18.end;{of mpass}
2.{sort the file x = (x[1],. . .,x[n])into nondecreasing 3.order on the keys x[1].key,. .
.,x[n].key}
4.var l : integer;
5.y:afile;
6.begin
8.l:=1;
9.while l<n do
10.begin
11. mpass(x,y,n,l)
12. l:=2*l;
14. l:=2*l;
15.end
In this method we interpret the file to be sorted R = (R1, R2, …Rn) as a binary tree. In a
binary tree the parent of node at i is at [i/2], the left child at 2i and right child at 2i + 1. If 2i
or 2i +1 is greater than n ( no. of nodes), then the corresponding children do not exist.
Thus, initially the file is interpreted as being shown in the following figure.
X1
X2 X3
X4 X5 X6 X7
Xn-2 Xn-1 Xn
Heap sort is a two stage method. First the tree representing the file is converted in to
a heap. A heap is defied to be a completely binary tree with the property that the value of
each node is at least as large as the value of its children nodes (ie. Kj/2kj for 1j/2<j n).
this implies that the root of the heap has the largest value in the tree. In the second stage the
output sequence is generated in decreasing order by successively outputting the root and
restructuring the remaining tree in to a heap.
A complete binary tree is said to satisfy the „heap condition‟ if the key of each
node is greater than or equal to the key in its children. Thus the root node will have the
largest key value.
A HEAP is a complete binary tree, in which each node satisfies the heap condition ,
represented as an array.
2 step 1 may cause violation of the heap condition so the heap is traversed and
2 {Adjust the binary tree with root i to satisfy the heap property.
3 The left and right subtrees of i, i.e., with root 2i and 2i +1,
5 than n.}
6 varj : integer;
7 k : integer;
8 r : records;
9 done : boolean;
10 begin
11 done := false;
12 r := tree[i];
13 k := tree[i].key;
14 j := 2 * i;
20 done := true
21 else
22 begin
24 j := 2 * j;
25 end;
26 end;
27 tree[ jdiv 2] := r;
28 end; { of adjust}
HSORT
4 vari : integer;
5 t : records;
6 begin
8 adjust(r,i,n);
10 begin
12 r[ i +1 ] := r[1];
13 r[1] := t;
15 end;
FILE ORGANIZATIONS
File :
The primary objective of file organization is to provide a means for record retrieval and
update. The update of a record could involve its deletion, changes in some of the fields or the
insertion of an entirely new record. Certain fields in a record may be designated as key fields.
Records may be retrieved by specifying values for some or all of these keys. A combination
of key values specified for retrieval is called as a query.
query types
1. Sex = M
and sex = F)
Linked organization
Inverted files
Cellular partition
In this organization the records are placed sequentially on to the storage media. i.e.
they occupy consecutive memory locations and in the case of a tape this would mean placing
records adjacent to each other. In addition, the physical sequence of records is ordered on
some key called primary key.
Mode of retrieval
In real time retrieval the response for any query is immediate. Example. In Air line
reservation system, one must be able to determine the status of the flight in a matter of
seconds.
In batch processing system the response time is not significant. Requests for retrieval are
batched together on a „transaction‟ file until either enough requests have been received or
suitable amount of time has passed. Then the requests in the transaction file are processed.
Mode of update
In real time system, update is made immediately. For example, in a reservation system,
as soon as a seat on a flight is reserved, the file must be updated immediately to reflect the
changes made to the file.
In batch system, the updating is made when the transaction file is processed. For example,
in a banking system, all deposits and withdrawals made on a particular day is collected on a
transaction file and updates are made at the end of the day. The system contains two types of
files.
For batch processing system magnetic tape is an adequate storage medium. Master file
represents the file status. The transaction file contains all update requests that have not been
reflected in the master file. So the master file is always „out of date‟ to the extent that update
requests have been batched on the transaction file. The master file contains records which are
sorted on the primary key of the file. The requests for retrieval and update are on the
transaction file. When it is time to process the transaction file, the transactions are sorted on
the key and an update process is carried out to create a new master file. All the records in
Sequential organization is also possible on dynamic access storage devices (DASD). Even
though the disk storage is really two-dimensional (cylinder X surface) it can be mapped in to
a one-dimensional memory. If a disk contains c cylinders and s surfaces one way is to view
the disk memory sequentially as given in the following figure.(figure 1)
The other way of representing sequential file organization is to access tracks in order : all
tracks of surface 1, all tracks of surface 2, etc.
If the records are of same size ,binary search technique can be used to search for a record
with the required key. For a file containing n records, log 2 n accesses are to be made.
If the records are of variable size, binary search cannot be used. Sequential search has to be
applied. But the retrieval time can be reduced by maintaining an index. An index contains
(address, key) pairs. In case of record retrieval, first the index is referenced, then the record
is read directly from the address of the storage medium. For example, for the table given in
figure1, one can maintain an index for the key empno as given below.
address key
A1 510
A2 620
All records must be structurally identical. If a few field is to be added, then every
record must be rewritten to provide space for the new field.
Continuous areas may not be possible because both the primary data file and the
transaction file must be looked during merging.
Area of use
Sequential files are most frequently used in commercial batch oriented data processing
applications where there is the concept of a master file to which details are added
periodically. Ex. Payroll applications
INDEX TECHNIQUES
An index is a collection of pairs of the form (key value, address). If the records of the table1
are stored on addresses a1, a2, a3, … an respectively, then an index for the key empnumber
would have entries (800, a1), (510, a2), (950, a3), (750, a4) and (620, a5). The index is dense
since it contains an entry for each record. In case of occupation key index, the number of
records with „occupation = programmer‟ is three and „occupation = analyst‟ is two, therefore
entries of the index corresponds to some of the records. The difficulty can be overcome by
keeping in the address field of each distinct key value a pointer to another address where we
maintain a list of addresses of records having this value. If at address b1 we store the list of
addresses of all programmer records i.e. a1, a4 and a5and at b2 the addresses of all analysts
i.e. a2 and a3 then we achieve the index of the occupation field as („programmer‟, b1)
and („analyst‟, b2). Another method is to change the format of the entries in an index to (key
value, address1, address2, .. address n). The second method is for records of variable size.
An index differs from a table essentially in its size. While a table was small enough to fit into
available internal memory, an index is too large for this and has to be maintained on external
storage devices ( floppy, hard disk, etc.).
1. CYLINDER-SURFACE INDEXING
The simplest of all indexing techniques is cylinder-surface indexing. It is useful only for the
primary key index of a sequentially ordered file. The sequential interpretation of the disk
memory is shown in figure 1.
It is assumed that the records are stored sequentially in the increasing order of the primary
key. The index contains of the cylinder index and several surface indexes. If the file requires
c cylinders ( 1 through c) for storage then the cylinder index contains c entries. Associated
with each of the c cylinders is a surface index. If the disk has s usable surfaces then the
surface has s entries. The ith entry in the surface index for cylinder j is the value largest key
on the jth track of the ith surface. The total number of surfaces is s c.
A search for a record with a particular key with value X is carried by first reading into
memory and cylinder index. Since the number of cylinders in a disk is only a few hundred
and cylinder index occupies only one track. The cylinder index is searched to determine
which cylinder possibly contains the desired record. The search can be carried out by binary
search in the case when the entry requires a fixed number of words. If it is not feasible, the
cylinder index can consist of an array of pointers to the starting point of individual key
values. In either case the search can be carried out in O(log c) time.
nce the cylinder index is searched, appropriate cylinder is determined, the surface index
corresponding to the cylinder is retrieved from the disk. The number of surfaces on a disk is
usually very small, so the best way to search a surface index would be sequential search.
Having determined which surface and cylinder is to be accessed, this track is read in and
searched for the record with desired key. So the total number of disk accesses is three ( one
to access the cylinder index c, one for the surface index and one to get the track address).
When track sizes are very large it may not be feasible to read in the whole track.. In this
case the disk is usually be sector addressable and so an extra level of indexing will be needed:
the sector index. In this case the number of accesses needed to retrieve a record will be four.
When the file extends over several disks, a disk index is also be maintained.
This method of maintaining a file and index is referred to as ISAM (Indexed Sequential
Access Method). It is probably the most popular and simplest file organization in use for
2. HASHED INDEXES
The principles involved in maintaining hashed indexes are essentially the same as those of
hash tables.
All the hash functions and overflow techniques of hash tables are applicable to hashed
indexes also.
1. Rehashing
2. open addressing
a. Random
b. Quadratic
c. Linear
3. Chaining
(refer to unit 4)
3. TREE INDEXING
The AVL trees are used to search, insert and delete entries from a table of size n using at
most O(log n) time. The AVL tree resides on a disk. If nodes are retrieved from the disk, one
at a time, then a search of an index with n entries would require at most 1.4 log n disk
accesses (the maximum depth of an AVL tree is 1.4 log n). This is a lot worse than the
cylinder sector index. Therefore balanced tree based upon an m-way search tree is used
which is better than binary search tree.
Definition:
(iii) All key values in the sub tree Ai, are less than the key value Ki+1,
0 ≤ i <n
(iv) All key values in the sub tree An are greater than Kn.
(v) The sub trees Ai, 0 ≤ i ≤ n are also m-way search trees.
A B-tree is a balanced m-way tree. A node of the tree contain many records or keys and
pointers to children.
A B-tree is also known as the balanced sort tree. It finds its use in external sorting. It is
not a binary tree.
There must be no empty sub trees above the leaves of the tree;
The leaves of the tree must all be on the same level; and
All nodes except the leaves must have at least some minimum number of children.
a
20 40
b c d
10 15 25 30 45 50
35
2. {Search the m-way search tree t residing on disk for the key value x.
5. Else j=0 and p is the location of the node into which x can be inserted.}
6. label 99:
7. begin
9. while p<>0 do
10. begin
17. end;
19. 99:end;
In figure A if we want to search 35 then a search in the root node indicates that the
appropriate subtree to be searched is the one with root A1 at address c. A search of this root
node indicates that the next node to search is at address e. The key value 35 is found in this
node and the search terminates. So if this search tree resides on a disk then the search for
x=35 would require accessing the nodes of addresses a,c and e for a total of three disk
accesses. The maximum number of disk accesses made is equal to the height of the tree. ( to
minimize the number of disk accesses, we can minimize the height of a search tree).
In this organization the records are stored at random locations on disks. The following
techniques are used for randomization.
Direct addressing
Directory lookup
Hashing
Direct addressing
In direct addressing with equal size records, available disk space is divided in to nodes large
enough to hold a record. The numeric value of the primary key is used to determine the node
in to which a particular record is stored. No index on this key is required. With primary key
= empno., the record for empno =259 will be stored in node 259. In this organization,
searching and deleting a record given its primary key value requires only one disc access,
whereas updating requires two disc access (one to read and another to write back the
modified record ).
In case of variable size records, an index is setup with pointers to actual records on the disk
as shown in the following figure. The number of disc access is one more than that of fixed
size records.
Record B
Record E
Record D
Record A
Record C
The space efficiency of direct accessing depends on the identifier density n/T ( n = number
of distinct primary key values in the file, T is the total number of possible primary keys. )
This is similar to the addressing technique for variable size records. The index is not of direct
access type but a dense index maintained using a structure suitable for index operations.
Retrieving a record involves searching the index for the record address and then accessing the
record itself. This technique makes a more efficient utilization of space than direct
addressing, but requires more accesses for retrieval and update, since index searching will
generally require more than one access.
Hashing
The principle of hashed file organization is same as that of a hashed table. The available
space is divided into buckets and slots. Some space may have to be set aside for an overflow
area in case chaining is being used to handle overflows. When variable size records are
present, the number of slots per bucket will be only a rough indicator of the number of
records a bucket can hold. The actual number will vary dynamically with the size of records
in a particular bucket.
Random organization on the primary key using any of the above three techniques
overcomes difficulties of sequential organizations. One of the main disadvantages of random
organization is that batch processing of queries become inefficient as the records are not
maintained in order of the primary key. For example the query retrieve the records of all
employees with employee number >800, needs to examine every node.
LINKED ORGANISATION
Linked organization differs from the sequential organizations essentially in that the logical
sequence of records is generally different from the physical sequence. In a sequential
organization if the ith record of the file is at Li then the (i+1) th record will be at Li+c, where c
is the size of the ith record.
In linked organization the next record is obtained by following the link value from the present
record. Linking records together by the primary key facilitates the deletion and insertion of
records once the place at which insertion and deletion to be made is known. An index with
ranges of empnumbers can be maintained to facilitate searching based on empnumbers.
For example, ranges for empumbers 501-700, 701-900, 901-1100 can be created for the
EMPLOYEE table given in Table 1. All records having empno in the same range can be
linked together as shown in the following figure.
700 Record E
Record B
900 Record D
Record A
Record C
Using an index in this way reduces the length of the lists and thus the search time. This idea
can be generalized to allow for easy secondary key retrieval. We set up indexes for each key
and allow records to be in more than one list. This leads to multi list representation.
Retaining list lengths enables us to reduce search time by allowing us to search the smaller
list.(in case of Boolean queries)
2 2 1 2 3
length
pointer to
0
empno link
0
occupation link
C
salary link B E D A C
1 3 1
length
pointer
Inserting a new record in to a multi list structure is easy so long as the individual lists do not
have to be maintained in some order. A record may be inserted at the front of the appropriate
list. Deletion of a record is difficult since there are no back pointers. Deletion may be
simplified if we maintain it as a doubly liked list.
CORAL RINGS
The coral ring structure is same as the doubly linked multi list structure. Each list is
structured as a circular list with a head node. The headnode for the list for the key value Ki =
x will have an information field with value x. The field for the key Ki is replaced by a link
field. Thus for each record the coral ring contains 2 fields : y↑.alink[i], y↑.blink[i]
The alink is used to link together all records with the same key Ki. The alinks form a
circular list with a head node whose information field retains the value of Ki for the records
in the ring.
The blink is used for some records as a back pointer and for some records it is a pointer to the
head node. To distinguish between these two y↑.field[i] is used. If y↑.field[i] = 1 then it is
a back pointer and y↑.field[i] = 0 it is a pointer to the nearest record z preceding it its circular
list for Ki having z↑.blink[i] also a back pointer.
In any given circular list, all records with back pointers form another circular list in the
reverse direction.
The presence of these back pointers makes it possible to carry out the deletion without having
to start from the beginning of the list.
INVERTED FILES
Inverted files are similar to multi lists. The difference is that while in multi lists records with
the same key value are linked together with link information being kept in individual records
whereas in the case inverted files this information is kept in the index itself. The following
are the indexes of the file given in figure 1.
510 B Analyst B, C
Programmer A, D, E
620 E
750 D
800 A
Sex index
Salary index
Male A, E
10,000 A
12,000 C, D
The index for every key is dense and contains a value entry for each distinct value in the file.
Since the index entries are variable length (the number of records with the same key value is
variable ), index maintenance becomes more complex than for multi lists.
Boolean queries require only one access per record satisfying the query. Queries of the type
K1 = XY or K2 = XY can be processed by first accessing the indexes and obtaining the
address lists for all the records with K1 = XX and K2 = XY. These two lists are merged to
obtain a list of all records satisfying the query. K1 = XX and K2 = XY can be handled by
intersecting the two lists. similarly K1 = .not. XX can be obtained by taking the difference
between the universal list (list with all the records) and the list of K1 = XX.
The number of disk accesses = number of records being retrieved + the number to process the
indexes.
Inverted files result in space saving compared to other file structures when record retrieval
does not require retrieval of key fields. In this case the key fields may be deleted from the
records.
Insertion and deletion of records requires only the ability to insert and delete within indexes.
CELLULAR PARTITIONS:
Preorder : ABCDEFG
B D D Inorder : CBAEFDG
C E
Preorder: Inorder :
Traverse the right subtree in preorder Traverse the right subtree in inorder
Postorder :