KEMBAR78
Data Structures | PDF | Matrix (Mathematics) | Mathematics
0% found this document useful (0 votes)
29 views128 pages

Data Structures

This document serves as study material for the B.Sc Computer Science program, specifically for the Data Structures course in Semester III. It covers various topics including algorithms, data structures like arrays, linked lists, trees, and graphs, as well as sorting techniques and file organization. The document also emphasizes algorithm analysis, recursion, and performance evaluation, providing foundational knowledge essential for computer science students.

Uploaded by

Ramesh L
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views128 pages

Data Structures

This document serves as study material for the B.Sc Computer Science program, specifically for the Data Structures course in Semester III. It covers various topics including algorithms, data structures like arrays, linked lists, trees, and graphs, as well as sorting techniques and file organization. The document also emphasizes algorithm analysis, recursion, and performance evaluation, providing foundational knowledge essential for computer science students.

Uploaded by

Ramesh L
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 128

STUDY MATERIAL

BSC [COMPUTER SCIENCE]

SEMESTER- III

SUBJECT: CORE 4 : DATA STRUCTURES

AY-2021-2022
TIPS COLLEGE OF ARTS AND SCIENCE
B.Sc COMPUTER SCIENCE – III SEMESTER
CORE 4 : DATA STRUCTURES

UNIT I
Introduction: Introduction of Algorithms, Analysing Algorithms. Arrays: Sparse Matrices -
Representation of Arrays. Stacks and Queues. Fundamentals - Evaluation of Expression Infix
to Postfix Conversion - Multiple Stacks and Queues

UNIT II
Linked List: Singly Linked List - Linked Stacks and Queues - Polynomial Addition - More
on Linked Lists - Sparse Matrices - Doubly Linked List and Dynamic - Storage Management
- Garbage Collection and Compaction.

UNIT III
Trees: Basic Terminology - Binary Trees - Binary Tree Representations - Binary Trees -
Traversal - More on Binary Trees - Threaded Binary Trees - Binary Tree Representation of
Trees - Counting Binary Trees. Graphs: Terminology and Representations - Traversals,
Connected Components and Spanning Trees, Shortest Paths and Transitive Closure

UNIT IV
External Sorting: Storage Devices -Sorting with Disks: K-Way Merging - Sorting with Tapes
Symbol Tables: Static Tree Tables - Dynamic Tree Tables - Hash Tables: Hashing Functions
- Overflow Handling.

UNIT V
Internal Sorting: Insertion Sort - Quick Sort - 2 Way Merge Sort - Heap Sort - Shell Sort -
Sorting on Several Keys. Files: Files, Queries and Sequential organizations - Index
Techniques -File Organizations.

TEXT BOOKS
1. Ellis Horowitz, Sartaj Shani, Data Structures, Galgotia Publication.
2. Ellis Horowitz, Sartaj Shani, Sanguthevar Rajasekaran, Computer Algorithms, Galgotia
Publication.

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 2


Unit – I
Introduction: Introduction of Algorithms, Analyzing Algorithms. Arrays: Sparse
Matrices - Representation of Arrays. Stacks and Queues. Fundamentals - Evaluation of
Expression Infix to Postfix Conversion - Multiple Stacks and Queues

INTRODUCTION

OVERVIEW
The study of computer science has four distinct areas.
 Machines for executing algorithms:
This area includes everything from the smallest pocket calculator to the largest
general purpose digital computer. The goal is to study various forms of machine
organizations so that algorithms can be effectively carried out.
 Languages for describing algorithms:
One often distinguishes between two phases of this area: language design and
translation. The first calls for methods for specifying the syntax and the semantics of a
language. The translation requires a means for translation into a more basic set of
commands.
 Foundations of algorithms:
Is a particular task accomplishable by a computing device or what is a minimum
number of operations necessary for any algorithm which performs a particular function?
Abstract models of computers are devised so that these properties can be studied.
 Analysis of algorithms:
An algorithm behaviour pattern is measured in terms of the computing time and space
that are consumed while the algorithm is processing. Questions such as the worst and
average time and how often they occur are typical.

ALGORITHM
An algorithm is a finite set of instructions, to achieve a particular task. Every algorithm must
satisfy the following criteria.
i) input: there are zero or more quantities which are externally supplied;
ii) output : at least one quantity is produced;
iii) definiteness : each instruction must be clear and unmistakable;
iv) finiteness : if we trace out the instructions of an algorithm, then for all cases the
algorithm will terminate after a finite number of steps;
v) effectiveness : every instruction must be feasible.

FLOWCHART
A flow chart is a pictorial representation of the algorithm.

Datarepresents the facts and figures in raw form.


A data type is a term, which refers to the kinds of data that variables may “hold” in a
programming language.
e.g. In FORTRAN the data types are integer, real, logical , complex and double precision.
A data objectis a term referring to a set of elements, say D.For example the data objects
integers , refers to D = {0, ± 1 , ±2, …..}The data object alphabetic character strings of
length less than Thirty one implies D = { “,‟A‟,‟B‟,…,‟Z‟, ‟AA‟,…}. Thus D may be finite or
infinite.
A data structure is a set of objects and set of operations which may legally be applied to
elements of data object. i.e. We must specify the set of operations and show how they work.

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 3


For integers we would have the arithmetic operators + ,- , *, / and many others such as MOD,
DIV , < , > etc.

Definition of data structure


A data structure is a set of domainsD, a designated domain d Î D , a set offunctionsF and set
of axiomsA. The triple ( D, F, A) donates the data structure D and it will usually be
abbreviated by writing d.

HOW TO CREATE PROGRAMS


(i) Requirements. Here the input and output should be given.
(ii) Design: There are several data objects (such as a maze, a polynomial, or a list of
names). For each object there will be some basic operations to perform on it (such
as print the maze, add two polynomials, or find a name in the list). Write the
algorithm, which solves the problem.
(iii)Analysis: There can be more than one algorithm to solve a problem. Try to compare
these two methods, choose the one which has less time complexity and which
takes less execution time.

(iv) Refinementandcoding: We have to choose representations for the data objects (a


maze as a two dimensional array of zeroes and ones, a polynomial as a one
dimensional array of degree and coefficients, a list of names possibly as an array)
and write algorithms for each of the operations on these objects. One of the
criteria of a good design is that it can absorb changes relatively easily. So choose
the best one and produce a program.
(v) Verification: Verification consists of three distinct aspects:
 program proving,
 testing
 debugging.
We should prove that our program is correct. If a correct proof can be
obtained, then one is assured that for all possible combinations of inputs, the program
and its specification agree.
Testing is the art of creating sample data upon which to run your program. If the
program fails to respond correctly then debugging is needed to determine what went
wrong and how to correct it.Many times during the proving process errors are
discovered
in the code. The proof can‟t be completed until these are eliminated. This is another
use of program proving, namely as a methodology for discovering errors. There are
tools available to aid the testing process. One such tool instruments the source
program and then tells us for every data set:
a) the number of times a statement was executed
b) the number of times a branch was taken
c) the smallest and largest values of all variables.
As a minimal requirement, the test data we construct should force every
statement to execute and every condition to assume the value true or false at
least once.

RECURSION:
If a function or procedure calls by itself is called as recursion.
n! = n*(n-1) !
4! = 4 * 3!
e.g. Program using a recursive function

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 4


Program sample12;
Function fact (m, integer): integer;
Begin
If m < =0 then fact :=1
else
fact := m * fact (m-1);
End;
Begin
writeln(fact (4));
End.
(There should be a terminating condition for recursion.)

HOW TO ANALYZE PROGRAMS:


There are many criteria upon which we can judge a program, for instance:
1.) Does it do what we want to do?
2.) Does it work correctly according to the original specifications of the task?
3.) Is there documentation, which describes how to use it and how it works?
4.) Are procedures created in such a way that they perform logical
sub-functions?
5.) Is the code readable?
The above criteria are all very important when it comes to writing software, most
especially for large systems.
There are other criteria for judging programs, which have a more direct relationship
to performance. These have to do with computing time and storage requirements of
the algorithms. Performance evaluation can be loosely divided into two major
phases.

a) a priori estimates and


b) a posteriori testing.

Both of these are equally important.


First consider a priori estimation. Suppose that somewhere in one of your programs
is the statement
x := x + 1
We would like to determine two numbers for this statement. The first is the
amount of time a single execution will take; the second is the number of time it is
executed. The product of these numbers will be the total time taken by this
statement. The second statistics is called the frequencycount, and this may vary
from data set to data set. One of the hardest tasks in estimating frequency counts is
to choose adequate samples of data. It is impossible to determine exactly how much
time it takes to execute any command unless we have the following information:

1) the machine we are executing on;


2) its machine language instruction set;
3) the time required by each machine instruction;
4) the translation a compiler will make from the source to the machine language.
It is possible to determine the figures by choosing a real machine and existing
compiler. Another approach would be to design a hypothetical machine (with
imaginary execution times), but makes the times reasonably close to those of
existing hardware so that resulting figures would be representative. In both the
cases the exact times would not apply to many machines or to any machine. Also,

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 5


there would be the problem of complier, which could vary from machine to
machine. Moreover , it is often difficult to get reliable timing figures because clock
limitations and multiple programming are time sharing environment. Finally the
difficulty of learning another machine language outweighs the advantage of finding
„exact‟ fictitious times. All these considerations lead us to limit our goals for an a
priori analysis. Instead we will concentrate on developing only the frequency count
of all statements.

Consider the following examples:

For I :=1 to n do
For I := 1 to n do
For J :=1 to n do
x:= x+1; x: = x+1;
x := x+1;

(a) (b) (c)

In program (a) we assume that the statement x := x + 1 is not contained within any loop either
explicit or implicit. Then its frequency count is 1. In program (b) the same statement will be
executed n times and in program (c) n2 times (assuming n >=1). Now I, n and n2 are said to
be different in increasing orders of magnitude just like 1,10, 100 would be if we let n=10. In
our analysis of execution we will be concerned chiefly with determining the order of
magnitude of an algorithm.
To determine the order of magnitude, formulas such as
S1 , S i , S i2 often occur.
e.g. program to find the frequency count of Fibonacci series.
1 program fibonacci (input,output);
2 {compute the Fibanacci number Fn}
3 type natural = 0 … maxint;
4 var fnm1,fnm2,fn,n,I : natural;
5 begin
6 readln(n)
7 If n <= 1 then writeln(n) {F0=0 and F1=1}
8 else
9 begin {compute Fn}
10 fnm2:=0;fnm1:=1;
11 for i:=2 to n do
12 begin
13 fn:=fnm1+fnm2;
14 fnm2:=fnm1;
15 fnm1:=fn;
16 end; {of for}
17 writeln(fn);
18 end; {of else}
19 end. {of Fibonacci}

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 6


To analyze the time complexity of this program we need to consider the two cases : (i)
n=0 or 1, and (ii) n > 1. We shall count the total number of statement execution and use this
as a measure of the time complexity of the program. Further, we note that only executable
statement contribute to the complexity of the program. Thus, lines 3 and 4 do not add to the
time needed to compute a fibanacci number. Also, begin statements will be counted as they
simply designate the start of statement block. When n=0 or 1, lines 6, 7 and 19 get executed
once each. We shall view line 7 as 7a and 7b with count 2 (1 for conditional and one for the
then clause). The total statement count for this case is therefore 4. When n>1 the for loop of
line 11 to 16 gets executed. Line 11 gets executed n times while lines 12 to 16 gets executed
n-1 time each (note that the last time line 11 is executed, I is incremented to n+1 and the loop
exited). Each execution of line 10 counts that 2 towards the statement count as line 10
contains 2 statements. The contribution of each line of the program to the total statement
count is given in the following figure. The contribution of remaining lines is 0. Line 7a refers
the conditional of line 7 and 7b refers then clause.

line Count Line count


6 1 14 n-1
7a 1 15 n-1
7b 1 16 n-1
10 2 17 1
11 n(1) 18 1
12 n-1 19 1
13 n-1
Statement counts for computing Fn, n > 1

Clearly, the actual time taken by each statement will vary. The for statement is really a
combination of several statement but we will count it as 1. The total count then is 5n + 4. We
will often write this as O(n), ignoring the two constants 5 and 4. This notation means that the
order of magnitude is proportional to n.
The notation f (n) = O (g ( n) ) (read as f of n equals big –oh of g of n ) has precise
mathematical notation.

DEFINITION
f (n) = O (g ( n) ) iff there exists two constants c and n0 such that
| f ( n ) | £ c | g (n) | for all n ³ n0.

f (n) will normally represent the computing time of some algorithm.

When we say that the algorithm is O(g (n) ) we mean that the execution takes no more
than g (n) . n is a parameter which characterizes the inputs and / or the number of outputs.
For example n might be the number of inputs and outputs or their sum or the magnitude of
one of them. For the fibonacci program n represents the magnitude of input and the time for
this program is written as T(fibonacci) = O (n).
We write O(1) to mean a computing time which is a constant. O(n) is called linear, O(n 2)
is called quadratic, O(n3) is called cubic, O(2n) is called exponential. If an algorithm takes
time O(log n) it is faster, for sufficiently large n, than it had taken for O(n). Similarly O(n log
n) is better than O(n2) but not as good as O(n).

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 7


If we have two algorithms which perform the same task, and the first has a computing
time which is O(n) and the second O(n2), then we will usually take the first as superior. The
reason for this as n increases the time for the second algorithm will get far worse than the
time for the first. For example, if the constant for algorithms one and two are 10 and ½
respectively, then we get the following table of times:

N 10n n2/2
1 10 ½
5 50 12-1/2
10 100 50
15 150 112-1/2
20 200 200
25 250 312-1/2
30 300 450

For n<=20,algorithm two has a smaller computing time but once past that point algorithm
one becomes better. This shows why we choose the algorithm with the smaller order of
magnitude, but we emphasize that this is not the whole story. For small data sets, the
respective constants must be carefully determined. In practice these constants depend on
many factors, such as the language and the machine one is using.

ARRAYS:
An array is a set of consecutive memory locations.
Given below is the data structure of an array.
structure ARRAY (value,index)
declare CREATE ( ) à array
RETRIEVE(array,index)à value;
STORE(array,index,value)àarray;
for all A Î array, I, j Î index, x Î value let
RETRIEVE(CREATE,i) ::= error
RETRIEVE(STORE(A,i,x),j) ::=
if EQUAL(I,j)then x else RETRIEVE(A,j)
end
end array

The function CREATE produces a new, empty array. RETRIEVE takes as input an array
and an index, and either returns the appropriate value or an error. STORE is used to enter
new index-value pairs. The second axiom is read as ”to retrieve the j-th item where x already
been stored at index i in A is equivalent to checking if i and j are equal and if so, x, or search
for the j-th value in the remaining array, A.”

OREDERED LISTS
One of the simplest and most commonly found data object is the
ordered or linear list. Examples are the days of the week.

(MONDAY,TUESDAY, WEDNESDAY,THURSDAY,FRIDAY,SATURDAY, SUNDAY)

or the values in a card deck.

( 2, 3, 4, 5, 6, 7, 8, 9, 10, Jack, Queen, King, Ace)

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 8


There are a variety of operations that are performed on these lists. These
operations include:
(i) find the length of the list, n;
(ii) read the list form left-to-right (or right-or-left);
(iii)retrieve the i-th element , 1<=i<=n;
(iv) store a new value in to the i-th position , 1<=i<=n;
(v) insert a new element at position i, 1<=i<=n+1 causing elements numbered i, i+1…,n
to become numbered i+1, i+2,…., n+1;
(vi) delete the element at position i, 1<=i<=n causing elements numbered i+1,….,n to
become numbered i, i+1,….,n-1.

Data structure of a polynomial is given below:


structure POLYNOMIAL
declare ZERO()à poly;ISZER0(poly)à Boolean
COEF(poly, exp)à coef ;
ATTACH(poly, coef, exp)àpoly
REM(poly, exp)à poly
SMULT(poly, coef, exp)àpoly
ADD(poly, poly)àpoly; MULT(poly, poly)àpoly;
for all P,Q Î poly , c, d, Î coef e,f, Î exp let
REM(ZERO,f) ::= ZERO
REM(ATTACH(P,c,e),f) ::=
if e = f then REM (P,f) else ATTACH(REM(P,f),c,e)
ISZERO(ZERO) ::= true
ISZERO(ATTACH(P,c,e)) ::=
if COEF(P,e) = -c then ISZERO(REM(P,e)) else false
COEF(ZERO,e) ::= 0
COEF(ATTACH(P,c,e),f) ::=
if e = f then c+ COEF(P,f) else COEF(P,f)
SMULT(ZERO,d,f) ::= ZERO
SMULT(ATTACH(P,c,e),d,f ) ::=
ATTACH (SMULT(P,d,f),c*d,e +f
ADD (P,ZERO) ::= P
ADD(P,ATTACH(Q,d,f)) :: = ATTACH(ADD(P,Q),d,f)
MULT(P,ZERO) :: = ZERO
MULT(P,ATTACH(Q,d,f)) :: = ADD(MULT(P,Q),SMULT(P,d,f))
end
end POLYNOMIAL

Program segment for adding two polynomials


{c:=a + b where a and b are the input polynomials}
c:= ZERO;
while not ISZERO(a) and not ISZERO(b) do
begin
case compare (EXP(a), EXP(b)) of
„<‟: begin
c:= ATTACH (c,COEF(b,EXP(b)),EXP(b));
b: = REM(b,EXP(b));
end;
„=‟: begin
c: = ATTACH(c,COEF(a,EXP(a))+COEF(b,EXP(b)),EXP(a));

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 9


a: = REM (a, EXP(a));b:=REM(b,EXP(b));
end;
„>‟ : begin
c:=ATTACH(c,COEF(a,EXP(a)),EXP(a));
a:=REM(a,EXP(a));
end
end ; { of case }
end ; { of while }

Applications
One of the applications of array is polynomial. Any polynomials can be represented in a
global array called terms as defined below:

type term = record


coef : real;
exp : 0..maxint;
end;
var terms : array [1..maxterms] of term;
where maxterms is a constant.
Consider two polynomials A(x)=2x1000 +1 and B(x)=x4 +10x3+3x2+1. These polynomials
could be stored in array terms as shown below.
af al bf bl free

2 1 1 10 3 1
1000 0 4 3 2 0
In the above table af and bf give the location of the first term of A and B respectively ,
whereas al and bl give the location of the last term of A and B. free gives the location of the
next free location in the array terms. In our example, af=1, al=2, bf=3,bl=6, and free=7.

We will write a procedure to add two polynomials A and B represented as in the


above table to obtain the sum c= a + b. Procedure padd adds A(x) and B(x) term by term to
produce C(x). The terms of C are entered into the array terms starting at the position free. If
there is not enough space in terms to accommodate C, an error message is printed and the
program terminates. 999 is the label of the last statement in the program. Hence, a goto 999
terminates the program.

1 procedure padd(af,al,bf,bl : integer; var cf,cl :integer);


2 { add A(x) and B(x) to get C(x)}
3 var p,q : integer; c:real;
4 begin
5 p:=af; q:=bf; cf:=free;
6 while (p<=al) and (q<=bl) do
7 case compare (terms[p].exp,terms[q].exp) of
8 „=‟ : begin
9 c:=terms[p].coef + terms[q].coef;
10 if c <> 0 then newterm(c,terms[p].exp);
11 p:=p+1; q:=q+1;
12 end;
13 „<‟ : begin

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 10


14 newterm(terms[q].coef,terms[q].exp);
15 q:=q+1;
16 end;
17 „>‟ : begin
18 newterm(terms[p].coef,terms[p].exp);
19 p:=p+1;
20 end;
21 end; { of case and while}
22 { add in remaining terms of A(x)}
23 while p<=al do
24 begin
25 newterm(terms[p].coef,terms[p].exp);
26 p:=p+1;
27 end;
28 { add in remaining terms of B(x)}
29 while q<=bl do
30 begin
31 newterm(terms[q].coef,terms[q].exp);
32 q:=q+1;
33 end;
34 cl:=free –1;
35 end { of padd}

procedure newterm( c: real; e: integer);


{ add a new term to C(x)}
begin
if free > maxterms then begin
writeln(„too many terms in polynomials‟);
goto 999;
end;
terms[free].coef :=c;
terms[free].exp:=e;
free:=free+1;
end;
Another application of array is sparse matrices.
SPARSE MATRICES
A matrix which contains more number of zero entries are called sparse matrices.For
example only 8 out of 36 possible elements are non-zero and it is called as sparse
15 0 0 22 0 -15

0 11 3 0 0 0
0 0 0 -6 0 0
0 0 0 0 0 0
91 0 0 0 0 0
0 0 28 0 0 0
fig. 1 sparse matrix
It is very natural to store a matrix in a two dimensional array, say A[1..m,1..n]. Since sparse
matrix contain more number of zeros another representation is used to store only non-zero
elements. Each element of a matrix is uniquely characterized by its row and column position
say i,j. We can store a matrix as a list of 3-tuples of the form (i,j,value). All the 3-tuples of
any row can be stored so that the columns are increasing.

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 11


A[0, 6, 6, 8
[1 1, 1, 15
[2, 1, 4 22
[3, 1, 6, -15
[4, 2, 2, 11
[5, 2, 3, 3
[6, 3, 4, -6
[7, 5, 1, 91
[8, 6, 3, 28
The elements A[0,1] and A[0,2] contain the number of rows and columns of the matrix.
A[0,3] contains the number of nonzero terms. One of the normal operation is to compute the
transpose matrix. Transpose of matrix is nothing but interchanging rows and columns.
The transpose of the given sparse matrix is given below.

1 2 3
b[0, 6, 6, 8
[1, 1, 1, 15
[2, 1, 5, 91
[3, 2, 2, 11
[4, 3, 2, 3
[5, 3, 6, 28
[6, 4, 1, 22
[7, 4, 3, -6
[8, 6, 1, -15
Procedure transpose computes the transpose of A. A is initially stored as a sparse matrix
in the array a and its transpose is obtained in the array b.
The data type sparsematrix is defined as below. type sparsematrix = array[0..maxterms, 1..3 ]
of integer;
where maxterms is a constant.
1.procedure transpose (a:sparsematrix; var b: sparsematrix);
2.{b is set to be the transpose of a}
3.var m,n,p,q,t,col :integer;
4.{m: number of rows in a
5. n : number of columns in a
6. t: number of terms in a
7. q: position of next term in a
8. p: current term in a}
9. begin
10. m:=a[0,1]; n:= a[0,2]; t:= a[0,3];
11. b[0,1] :=n; b[0,2]:= m; b[0,3]:= t;
12. if t>0 then {nonzero matrix}
13.begin
14. q:= 1;
15. for col:= 1 to n do {transpose by columns}
16. for p := 1 to t do
17. if a[p,2] = col then
18. begin
19. b[q,1] := a[p,2]; b[q,2] := a[p,1];
20. b[q,3] := a[p,3]; q := q+1;
21. end;

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 12


22. end; {of if}
23. end; {of transpose}
REPRESENTATION OF ARRAYS

Memory is regarded as one dimensional with words numbered from 1 to m. So, we are
concerned with representing n dimensional arrays in a one dimensional memory. There fore
the location of an arbitrary element in memory can be determined efficiently. consider var
a : array [ 1….5 ] of integer , The above array is represented as

200 a[1 ]

201 a [2]

202 a [3]

203 a[4]
address of any element in an array is calculated as follows:.

address ( a [ i ] ) = base address (a ) + i – 1 ;


e.g. ( a [ 2 ] ) = 200 + 2 – 1 = 201 .

If an array is declared as A[l1..u1,l2..u2,…ln..un], then the number of elements can be


calculated using the following formula:
n
PI (ui-li +1).
i=1
l stands for lower bound and u stands for upper bound.

One of the common ways to represent an array is in row major order. If we have the
declaration
A[4..5,2..4,1..2,3..4]
Then we have a total of 2*3*2*2=24 elements.
Using row major order these elements will be stored as
A[4,2,1,3], A[4,2,1,4], A[4,2,2,3], A[4,2,2,4],
A[4,3,1,3], A[4,3,1,4], A[4,3,2,3], A[4,3,2,4],
A[4,4,1,3], A[4,4,1,4], A[4,4,2,3], A[4,4,2,4],
A[5,2,1,3], A[5,2,1,4], A[5,2,2,3], A[5,2,2,4],
A[5,3,1,3], A[5,3,1,4], A[5,3,2,3], A[5,3,2,4],
A[5,4,1,3], A[5,4,1,4], A[5,4,2,3], A[5,4,2,4],
Another synonym for row major order is lexicographic order.

Next step is to calculate the address for the elements in memory. We assume that the
lower bounds on each dimension li are 1.

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 13



One dimensional array
A is declared as A[1..u1], then assuming one word per element, we can represent it
in sequential memory as follows:

Array element : A[1], A[2] , A[3], …A[i], …. A[u1]


Address : µ , µ+1, µ +2, ….µ+i-1 …µ+u1-1
Total number of elements = u1
If µ is the address of A[1] , then the address of an arbitrary element A[i] is µ+i-1.
 Two dimensional array
A is declared as A[1..u1,1..u2] and may be interpreted as u1 rows: row1,row2,…,rown,
each row consisting of u2 elements. Sequential representation of A[1..u1,1..u2] is as follows.
u2 u2
Elements | elements

Row1 Row2 Row i Row u1


| (I-1) u2 elements |

If  is the address of A[1,1], then the address of A[i,1] is +(i-1)u2 as there are i-1
rows each of size u2 preceding the first element in the i-th row. Then the address of A[i,j] is
+(i-1)u2 + (j-1).


Three dimensional array
A is declared as A[1..u1,1..u2,1..u3]. This array is interpreted as u1 2 dimensional arrays
of dimension u2 x u3. Sequential row major representation of a three dimensional array is
represented below.

| (i-1)u2u3 elements |

A(1,u2,u3) A(2,u2,u3) A(i,u2,u3) A(u1,u2,u3)

 Multidimensional array

A is declared as [1..u1,1..u2,…1..un]. If  is the address for A[1,1,..1] then +(i1-


1)u2u3…un is the address for A[i1,1,…1]. Then the address for A[i1,i2,…in] is

 +(i1-1)u2u3…un
+ (i2-1)u3u4..un
+ (i3-1)u4u5..un
.
.
.
+ (in-1-1)un
+ (in-1)

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 14


The above formula can be written as follows:

nn
= + [ij-1)aj with aj = uk 1<=j <n
j=1 k=j+1 an=1

STACKS AND QUEUES :

Two of the more common data objects found in computer algorithms are stacks and
queues. A stack is an ordered list in which all insertions and deletions area at one end, called
the top. A queue is an ordered list in which all insertions take place at one end called the rear,
while all deletions take place at another end called the front.

Stack Queue

4
TOP 1 2 3 4
3
2
1 front rear
Given a stack S = ( a1, a2, … ,an), we say that a1 is the bottom most
element of the stack and ai ,is on the top of element ai-1, 1< i£n .
Given a queue Q = ( a1,a2,….,an), we say that an as the rear element,a1 as the front
element and a(i+1) is behind ai, 1<=i<n.

The restrictions on stack imply that if the elements A,B,C,D and E are added to the
stack in that order, then the first element to be removed or deleted must be E. Equivalently
the last element to be inserted in to the stack will be the first to be removed. For this reason
stacks are sometime called as „Last In First Out‟ (LIFO) lists.

The restrictions on a queue imply that if the elements A,B,C,D and E are added to the
queue in that order, then the first element which is inserted into the queue will be the first one
to be removed. Thus, A is the first letter to be removed, and queues are known as „First In
First Out‟ (FIFO) lists.

One natural example of stacks which arises in computer programming is the


processing of procedure calls and their terminations. Consider the following example.

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 15


The MAIN procedure invokes procedure A1. On completion of A1 execution of MAIN
will resume at location r. The address r is passed to A1 which saves it in some location for
latter processing. A1 then invokes A2 which in turn invokes A3. In each case the invoking
procedure passing the return address to the invoked procedure. If examine the memory while
the A3 is computing there will be implicit stack which looks like (q, r, s, t) the first entry, q,
is the address to which MAIN returns control. The list operates as a stack since the returns
will be made in the reverse order of the invocations. Thus, t is removed before s, s before r,
and r before q. Equivalently, this means that A3 must finish processing before A2, A2 before
A1 and A1 before MAIN.

Operations of stack is as follows:

CREATE(S) which creates S as an empty stack


ADD(i,S) which inserts the element I on to the stack S and returns a
new stack
DELETE(S) which removes the top element of stack S and returns a
new stack
TOP(S) which returns the top element of stack S
ISEMTS(S) which returns if S is empty else false

The simplest way to represent a stack is by using a one dimensional array, say,
stack[1..n] where n is the maximum number of allowable entries. The first or bottom element
in the stack will be stored at stack[1], the second at stack[2] and ith at stack[i]. Associated
with an array will be variable, top which points to the top element in the stack.
Implementations of stack is as follows:
CREATE(stack)::= var stack :array[1..n] of items; top : 0..n;
ISEMTS(stack)::= if top = 0 then true
Else false;
TOP(stack) ::= if top = 0 then error
else stack[top]

The ADD and DELETE procedures are given below. The corresponding procedures
have been written assuming that stack, top and n are global.

Procedure to add an item in to the stack


Procedure add (item:items);
{add item to the global stack stack;
top is the current top of the stack
and n is its maximum size}
begin
if top = nthen stackfull
else begin
top : = top +1 ;
stack [top]:= item;
end:
end; { of add }

Procedure to delete an item from the stack

Procedure delete (var item:items);

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 16


{ remove top element from the stack stack and put it in item }
begin
iftop =0 then stackempty
else begin
item := stack [stop];
top :=top –1;
end: { of delete }
end; {of delete }

In a queue when a job is submitted for execution, it joins at the rear of the job queue.
The job at the front of the queue is the next one to be executed. Operations of queue is as
follows:

CREATEQ(Q) which creates Q as an empty queue


ADDQ(i,Q) which inserts the element I at the rear of the queue Q
returns a new queue
DELETEQ(Q) which removes the first element from queue Q and returns
the resulting queue
FRONT(Q) which returns the first element of queue Q
ISEMTQ(Q) which returns true if Q is empty else false

To represent a queue we need a one-dimensional array q[1..n], and two variables, front
and rear.Front always points to one position less than the actual front and rear always points
to the last element in the queue. Thus front = rear if and only if there are no elements in the
queue. The initial condition then is
front = rear = 0.
Consider the following example, which shows the state of the queue when a job enters
and leaves it:

Q[1] Q[2] Q[3] Q[4] Q[5] Q[6] …. REMARKS


========================================================
Front rear
0 0 queue empty initial
0 1 J1 job 1 joins Q
0 2 J1 J2 job 2 joins Q
0 3 J1 J2 J3 job 3 joins Q
1 3 J2 J3 job 1 leaves Q
1 4 J2 J3 J4 job 4 joins Q
2 4 J3 J4 job 2 leaves Q
========================================================

The following implementation of the CREATEQ,ISEMTQ and FRONT operations is


desired for a queue with capacity n.

CREATEQ(q) ::= varq: array [1..n] of items; front, rear:0..n:


front := 0; rear := 0;
ISEMTQ(q) ::= iffront = rear then true
else false;
FRONTQ(q) ::= if ISEMTQ(q) then error
else q [front+1];

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 17


The ADDQ and DELETEQ procedures are given below.

Procedure to add an item in to the queue

Procedureaddq (item : items);


{ add item to the queue q}
{ q, rear, and n are global variables }
begin
if rear = n thenqueuefull
else begin
rear = rear + 1;
q[rear] := item;
end;
end; {of addq}

Procedure to delete an item from the queue

Procedure deleteq(varitem : items);


{ delete from the front of the queue q and put it in to item}
begin
if front = rearthenqueueempty
else begin
front = front + 1;
item := q[front];
end;
end; {of addq}

The queuefull signal does not imply that there are n elements in the queue. The queue
may contain empty spaces when rear = n which implies that the queue is full. To avoid this
problem, elements are to be moved left when deletions take place which is time consuming
one.

Consider the following example which explains the shifting of elements of a queue
when deletions take place. Each time a new element is added, the entire queue of n-1
elements is moved left.

Front rear q[1] q[2] q[3] q[n] next operation


========================================================
0 n J1 J2 J3 ……. Jn initial state
1 n J2 J3 … …. Jn delete J1
0 n J2 J3 …… Jn+1 add Jn+1
(job J2 through
Jn are moved)
1 n J3 J4 …… Jn+1 delete J2
0 n J3 J4 J5 ……. Jn+2 add Jn+2
========================================================

Circular queue

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 18


A more efficient queue representation is obtained by regarding the array q[1..n] as
circular. Declare the array as q[0..n-1] when rear = n-1 the next element is inserted at q[0] in
case that spot is free, front will always point one position counter clockwise from the first
element in the queue. Again front = rear if and only if the queue is empty. Initially we have
front = rear = 1. The following figure illustrates some of the possible configurations for a
circular queue containing the four elements J1-J4 with n>4.

[3 ] [n–4]

[2 ]
[n–3]

J1
[1]
[ n-2 ]

[0 ] [
n-1 ]

front = n – 4 ; rear = 0

In order to add an element, it will be necessary to move rear one position clockwise. That is,

If rear = n-1 then rear := 0


else rear := rear +1.

Using the module operator which computes remainders that is just


rear : = (rear+1) mod n. Similarly, it will be necessary to move front one position each time
a deletion is made. Again using the module operator this can be done by front := (front+1)
mod n.

Procedure to add an item in to the circular queue


Procedureaddq (item : items);
{ insert item to the circular queue q[0 .. n-1]}
{ q, rear, and n are global variables }
begin
rear = ( rear + 1)mod n ;
if front = rearthenqueuefull
else begin
q[rear] := item;
end; {of addq}

Procedure to delete an item from a circular queue

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 19


Proceduredeleteq (varitem : items);
{ remove front element from q and put into item}
begin
if front = rearthenqueueempty
else begin
front = (front + 1) mod n;
item := q[front];
end;
end; {of addq}

In the above two algorithms, the test condition for queue full in addq and the test for
queue empty in delete queue are the same. In the case of addq when front = rear is evaluated
and found to be true, there is actually one free space that is, q[rear], since the first element in
the queue is not at q[front] but is one position clockwise from this point. However, if we
insert an item here, then we will not be able to distinguish the causes full and empty, since
this insertion would leave front= rear. To avoid this, we signal queuefull, thus permitting a
maximum of n-1 rather than n elements to be in the queue at any time.
EVALUATION OF EXPRESSION:

An expression is made up of operands, operators and delimiters. Consider the following


expression

X := A / B – C + D * E – A * C

The expression above has five operands A, B,C, D and E. Though these are all one-
letter variables, operands can be any legal variable name or constant. In any expression the
values that variables take must be consistent with the operations performed on them. The
operations are described by the operators. The basic arithmetic operators are +, -, / and *. The
other arithmetic operators include unary minus, mod and div. Relational operators are <, <=,
=, <>,>= and >. The result of an expression, which contains relational operators, can be either
true or false. Such an expression is called Boolean. Logical operators are AND, OR and
NOT.

Priority of Operators in Turbo Pascal :

To fix the order of evaluation, we assign to each operator a priority. Then within any
pair of parenthesis the operators with the highest priority will be evaluated first. A set of
sample priorities from Turbo Pascal is given below:

========================================================
priority operator

1 **,unary minus,unary plus,


2 not
3 * , / , div, mod, and
4 + , - , or , xor
5 < , <= , = , <> , >= , > , in
========================================================

When we have an expression where two adjacent operators have the same priority, we
remove the operators from left to right.

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 20


Parsing and evaluationof arithmetic expressions using stacks:

There are many ways of representing the same arithmetic expression. For example, the
assignment statements

Z := A * B / C + D
Z := (A * B ) / C + D
Z := ( ( A * B ) / C ) + D

result in the same value. The process of collapsing such different expressions into one
unique form is called parsing the expression, and stacks are used for parsing.

Prefix, infix and postfix notation:

If e is an expression with operators and operands, the conventional way of writing e is


called infix because the operators come in-between the operands. The postfix of an
expression calls for each operator to appear after its operands. In prefix form of an expression
an operator immediately precedes the corresponding operands.

For example,

A * B / C is equivalent to

Infix : A * B / C

Postfix : AB * C /
Prefix / * A B C
Algorithm for translating from infix to postfix.

1. Fully parenthesize the expression


2. Move all operators so that they replace their corresponding right parenthesis
3. Delete all parentheses.

For example,

A/B-C+D*E-A*C when fully parenthesized gives

(( ((A/B)–C)+ (D*E))–(A*C))

The arcs join an operator and its corresponding right parenthesis. Performing steps 2 and 3
gives

AB/C-DE*+AC*-

1. The importance of prefix and postfix notations in parsing arithmetic expressions is


that these notations are completely free of parentheses. Consequently, an expression in
postfix (or prefix ) form is in unique form.

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 21


Procedure for converting infix to postfix form is given below:

procedurepostfix (e:expression);

{output the post fix form of the infix expression e.


nexttoken and stack are as in procedure eval. It is
assumed that the last token in e is „# „. Also ,‟# „ is used
at the bottom of the stack.}

var x,y : token;

begin
stack[1] := „ # „ ; top :=1;{initialize stack }
x := nexttoken (e);
while x<> „ # „ do

begin
ifx is an operand
then write(x)
else ifx=‟)‟
then begin {unstack until‟)‟}
whilestack [top]<>‟)‟do
begin
delete (y);
write(y);
end;
delete (y); {delete‟(‟}
end;
else begin
whileisp [stack[top]] >=icp [x]do
begindelete (y);write(y);end;
add (x);
end;
x:= nexttoken (e);
end; { of while }
{end of expression; empty stack}
whiletop>1 do
begindelete (y); write (y);end;
writeln(„#‟)
end;{ of postfix}

Procedure for evaluating a postfix expression


Procedure eval (e: expression )
{evaluate the postfix expression e. It is assumed that the last
token ( a token is either an operator, operand, or „#‟
in e is „# „. A procedure nexttoken is to extract from e the next token.
An one-dimensional array stack[1..n] is used as a stack.}
varx : token;
begin

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 22


top := 0;{initialize stack}
x := nexttoken (e);
while x <> „# „ do
begin
if x is an operand
thenadd(x) {add to stack}
else begin {operator}
remove the correct number of operands for operator x from
stack; perform the operation x and store the result(if any)
on to the stack;
end;
x := nexttoken (e);
end; {of while}
end; {of eval}

Example for converting infix expression to postfix form

Consider the expression A * ( B + C) * D


The algorithm behaves as

Next Token Stack Output


================================
none empty none
A empty A
* * A
( *( AB
B *( AB
+ *(+ AB
C *(+ ABC

At this point we want to unstuck down to the corresponding left parenthesis, and then delete
the left and right parentheses; this gives us

) * ABC+
* * ABC+*
D * ABC+*D*
done empty ABC+*D*

So the postfix form of the given expression is ABC+*D*


Example for postfix evaluation using stacks

Consider the postfix expression A B * C D E / - + *

Let us suppose that the symbols A, B, C, D and E has associated with them the values:
Symbol value
================
A 5
B 3
C 6
D 8

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 23


C 2
=================
To evaluating such expression, we repeatedly read characters from the postfix
expression. If the character read is an operand, push the value from the associated with it on
to the stack. If it is an operator, pop two values from the stack, apply the operator to them,
and push the result back on to the stack.
READ A READ B READ *

5 3 15
5
READ C READ D READ E

2
8 8
6
6 6
151
515

READ / READ - READ +

6 5
2
151
So the result of A B * C D E / - + * is 17 for the given values.

MULTIPLE STACKS AND QUEUES:

Let us consider an array v[1..m] if we have single stack then v[1] is the bottom most
element in the stack and v[m] the top most element of the stack. If we have two stacks to
represent then we use v[1] for the bottom most element in the stack 1 and v[m] the
corresponding element in stack 2. Stack 1 can grow towards v[m] and stack 2 towards v[1].
When more than 2 stacks, say n, are to be represented sequentially we initially divide out the
available memory v[1..m] in to n segments and allocate one of these segments to each of the
n stacks. This initial division of 1 to m into segments may be done in proportion to expected
sizes of various stacks if these are known. In the absence of such information v[1..m] may be
divided into equal segments. For each stack i we shall use b[i] to represent a position one less
than the position in v for the bottom most element of that stack. t[i] , 1<=i<=n will point to
the top most element of the stack i. The boundary condition b[i]=t[i] is used if and only if the

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 24


ith stack is empty. If we grow the ith stack in lower memory indices than the (i+1)th then with
roughly equal initial segments we have
b[i] = t[i] = [m/n] (i-1) 1<=i<=n
as the initial values of b[i] and t[i]. Stack i, 1<=i<=n can grow from b[i] +1 up to b[i+1]
before it catches up with the (i+1)th stack.. We define b[n+1] = m

v 1 2 [m/n] 2 [ m/n ] m

b[1] b[2] b[3] b[ n+1]


t[1] t [2] t [3]

procedure to add an element in to the i th stack

procedureadd ( i: integer ;x; items);


{add x to the i’th stack }
begin
ift [ i ] = b [ i + 1 ] then stackfull (i)
else begin
t [ i ] := t [ i ] + 1;
v [ t[i]] : = x ; { add to i‟th stack }
end ;
end; { of add }

procedure to delete an element from i th stack


procedure delete ( i : integer ; varx : items);
{delete topmost item of stack i }
begin
ift [i] = b [i] thenstackempty (i)
else begin
x:= v [t [i] ] ;
t [ i]:= t [i ] – 1 ;
end;
end; { of delete}

SIMULATION

Simulation is the process of forming an abstract model from a real situation in order to
understand the impact of modifications and the effect of introducing various strategies on the
situation. The main objective of the simulation program is to aid the user, most often an
operations research specialist or systems analyst, in projecting what will happen to a given
physical situation under certain simplifying assumptions. It allows the user to experiment

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 25


with real and proposed situations otherwise impossible or impractical. Simulation is a very
powerful tool if applied correctly. The major advantage of simulation is that it permits
experimentation without modifying the real situation.

SIMULATION OF TIME-SHARING COMPUTER SYSTEM

We can share the processor by allowing one user‟s program to execute for a short time,
then allowing another user‟s program to execute , and another etc., until there is a return to
the execution of the initial user‟s program. This cycle is continued repeatedly on all active
user programs. This method of sharing the CPU among many users is referred to as
timesharing. We can also share memory among the user programs by simply dividing
memory into regions and allowing each user program to execute in its own region when it
receives CPU control. In a system such as this, each user in unaware of the presence of the
other users. In fact, each terminal appears like a separate computer to the user.

For example there are three users, Tremblay, Sorenson and Bunt.

The following statistics are gathered concerning their session.

Relative session program – ID requested CPU


starting time time periods

0 Tremblay 4, 8, 3
1 Sorenson 2, 1, 2, 2
2 Bunt 4, 6, 1
========================================================

We interpret these figures in the following manner. After logging on the system at time
0 Tremblay initially is allotted 4 seconds of CPU time before he receives a typed response at
his terminal. A further 8 secs. Of CPU time is required before another response can be printed
at Tremblay‟s terminal. Finally, Tremblay‟s program uses three additional secs of CPU time
and then Tremblay logs off, terminating his session. Because we have only one CPU
Sorenson‟s
program and Bunt‟s program will be made to wait until Tremblay‟s program has finished its
first requested CPU time period of 4 secs. We can indicate the fact that
Sorenson‟s and Bunt‟s program are waiting for the processor by placing their program ID‟s in
the queue behind Tremblay‟s program ID. Hence the time at time 2 the queue would appear
as in the figure 1.

When Tremblay‟s program has completed its first requested CPU time period a
scheduling strategy would be to allocate the CPU to Sorenson‟s program. When Sorenson‟s
program has finished its CPU time period Bunt‟s program should be allowed to execute its
CPU time period. This type of scheduling strategy is called „First Come First Service‟ FCFS
scheduling strategy.

Total CPU time


CPU utilization is = * 100
Total session time

Total user waiting time = total time – total CPU time – total user delay

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 26


(for each user)

Total user delay = 5 * ( no.of CPU requests – 1)


(for each user)

0 5 10 15 20 25 30 35 40

Legend :

***** EXECUTING

USER DELAY PERIOD

WAITING IN QUEUE
-------

Figure 1:

TREMBLAY

SORENSON

BUNT

Processor queue at time 2

The rules for the FCFS strategy in this application are follows:

1. When a program requests CPU time , it is placed at the back of the processor queue.
2. The program at the head of the processor queue is the program that is currently being
executed. It remains at the head of the queue for its entire CPU time period.

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 27


3. When an executing program completes its current requested CPU time period, it is
removed from the processor queue and is not placed back into the queue until a
further request is made .

The primary actions which occur in this system , and the manner in which the simulation
imitates the actions , are as follows:

User requests service : In the real system this request would be a teletype signal which would
alert the processor. In the model we use a variable set to the time
at which the user will require service. When the simulation time reaches or passes this time,
the user is added to the processor queue.

User program in execution. To imitate this action, we update the simulation time by the
length of the time requested by the user.

User completes current executing period requested. At this point the user is removed from
the queue or “ thinking and typing “ period . We will assume a delay period of five time units
for all users. At this point the variable used to signal service requests, is updated to indicate
that the user is in the delay period.

POINTS TOBE REMEMBER:

1. Array applications are ordered list, sparse matrices.


2. Multiple stack and queue have simulation calculation.
3. How to create an algorithm & analysis of an algorithm same answer for how to create a
Program & how to analysis the program.
4. Stack have LIFO operation, queue have FIFO operation.
5. Sparse Matrices is having more number of zero elements and minimum non- zero
elements.
6. Infix=normal expression
7. Prefix=operator first, operand next.
8. Postfix= operand first, operator next.

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 28


Unit – II
Linked List: Singly Linked List - Linked Stacks and Queues - Polynomial Addition -
More on Linked Lists - Sparse Matrices - Doubly Linked List and Dynamic – Storage
Management - Garbage Collection and Compaction.

INTRODUCTION:

LINKED LISTS:

The representation of simple data structures using an array and sequential mapping
has the property that successive nodes of the data object are stored at fixed distance apart.
When a sequential mapping is used for ordered lists, operations such as insertion and deletion
of arbitrary elements become expensive. Consider the following list containing

(2, 4, 6, 10,12, 14).

To complete the list naturally we want to add 8 to the list. The insertion will require
us to move elements already in the list either one location lower or higher. We must either
move 10, 12, 14 or else move 2, 4, 6. If we have to do many insertions into the middle, then
neither alternative is attractive because of the amount of data movement. On the other hand,
suppose we decide to remove many elements we have to move many elements so as to
maintain the sequential representation of the list.

When we have several ordered lists of varying sizes, sequential representation again
proved to be inadequate. By storing each list in a different array of maximum size, storage
may be wasted. By maintaining the lists in a single array a potentially large amount of data
movement is needed. Solution to this problem of data movement in sequential representation
is achieved by using linked representations.

The use of pointers or links to refer to elements of data structure implies that elements
which are logically adjacent need not be physically adjacent in memory. This type of
allocation is called linked allocation. A list has been defined to consist of an ordered set of
elements which may vary in number. A simple way to represent a linear list to expand each
node to contain a link or pointer to the next node. This representation is called a one way
chain or singly linked list. In general a node is collection of data, data1, data2, … data n and
links link1, link2 … link n. Each item in the node is called as a field. A field contains either
a data item or a link. The link address of nill in the last node signals the end of the list. Nill
is not an address of any possible node but is a special value. If a list does not contain any
node then it is called an empty list.

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 29


2 4 6 ... 14 nil
2

It is easy to make arbitrary insertions and deletions using linked list rather than a
sequential list. To insert 8 between 6 and 10 get a new node and set its data field to 8. Create
a link from 6 to 8 and from 8 to 10. It is illustrated in the following figure.

6 10. . . 12 nil
2 4
2

To delete the element 6 from the list remove the existing link between 4 and 6 and
create a new link between 4 and 8. It is illustrated in the following figure.

Pointer data type:

T represents pointer to the data type T.


Consider

Type student = record


Rollno : integer;
Name : integer;
End;

Var Ram : student;

declares the data object named Ram of type student. Ram is the name of the object which
directly refers to its both components.

Consider
Type p = student;
Var s : p;

The type p is a pointer that points to the unnamed object student. The variable s is of type
p. The value of the variable s will be an address in memory at which the object of type
student created by us. Since no object exists currently, the value of s is said to be nil
(reserve word), a new object of type student is created by executing the procedure call
new(s).

s.rollno = 200
s.name = „ram‟

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 30


object with name s object of type student
type student no name for the object

Pointing to 200=
s
ram

s serves as the name of the object. After the object s has served its
purpose, we can remove it from memory and release its location by using the procedure call
dispose(s). If a linked list is consisting of nodes which have data field of type integer and
link field, then the following type declaration is used.

Type pointer =  listnode;


listnode = record
data : integer;
link : pointer;
end;

Consider the declaration

Var f : pointer;

The fields of a node pointed by f may be referenced in the following way :


f.data and f.link.

f f.data f. link

Procedure to create a linked list with two nodes of type list node:

The data field of the first node is set to 10 and that of the second to 20. firstis a pointer to
the first node.

first

10 20 nil0

procedure create (var : first : pointer );


var second : pointer;
begin
new(first);

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 31


new(second);
first.link :=second; { link first node to the second }
second.link := nil; {last node}
first.data := 10; { set data of the first node }
second.data := 20; {set data of the second node }
end; {of create}

procedure for inserting a node into a linked list

Example: Let first be a pointer to a linked list. First = nil if the list is empty. Let x be a
pointer to some arbitrary node in the list. The following procedure inserts a new node with
data field 50 following the node pointed by x. The resulting list structures for the two cases
first = nil and first <> nil are :

first first x

10 ... 126 ...


5 4 6
2 2

y 8
t

procedure insert (varfirst : pointer; x : pointer);


Vart : pointer;
Begin
new(t); {get a new node }
t.data : = 50;
if first = nilthen begin { insert into empty list }
first := t;
t.link : = nil;
end
elsebegin {insert after x }
t.link := x.link;
x.link := t;
end;
end;

Procedure to delete an element from the list:

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 32


Let first and x be as in the previous example. Let y point to the node that precedes x and
let y = nil if x = first. The following procedure deletes the node x from the list.

proceduredelete ( x, y : pointer; varfirst : pointer);


begin
if y = nilthen first : = first.link
else y.link := x.link;
dispose(x); {return the node }
end;

LINKED STACKS :

When several stacks and queues coexist, there is no way to represent them sequentially.
The solution is obtained by linked list representation of stacks and queues.

Linked stack:

top

.
.

.
nil

Linked queue:

nil
...

Front Rear

The direction of links for both the stack and queue are such as to facilitate easy insertion
and deletion of nodes. You can easily add or delete a node at the top of the stack. You can

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 33


add a node at the rear end and both deletion and addition can be performed at the front. But
normally we would not add nodes at the front.

Node and Pointer are defined as follows

Typepointer = ↑node;
node = record
data : integer;
link = pointer;
end;

The following global arrays of type pointer are used;


top[i] = node at top of the i th stack 1<=i<=n
front[i] = node at front of the i th queue 1<=i<=m
rear[i] = node at rear of the i th queue 1<=i<=m

The initial conditions are :

top[i] = nil , 1<=i<=n


front[i] = nil , 1<=i<=m

The boundary conditions are :

top[i] = nil, iff the i th stack is empty


front[i] = nil, iff the i th queue is empty

procedure for adding an element into a linked stack

Procedure addstack (i,y: integer);


{add y to the i‟th stack,1<=i<=n}
var x : pointer;
begin
new(x); {get a node}
x.data := y; {set it‟s data field}
x.link := top[i]; {attach to top of i th stack}
top[i] := x; {update stack pointer}
end; {of addstack}

procedure for deleting an element into a linked stack

Proceduredeletestack( i :integer; var y :integer);


{delete the top node from the stack i and set to y to be its data field, 1in}
var x : pointer;
begin
if top[i] = nil then stackempty;
x := top[i]; { data field of top node}
y := x.data;
top[i]:=x.link {remove the top node}
dispose(x); {free the node}
end; { of deletestack}

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 34


procedure for adding an element into a linked queue

Procedureaddqueue(i,y: integer);
{add y to the i‟th queue,1<=i<=m}
varx : pointer;
begin
new(x); {get a node}
x.data := y; {set it‟s data field}
x.link := nil;
iffront[i] = nilthen
front[i] := x; { empty queue}
else
rear[i].link := x;
rear[i] := x;
end; {of addqueue}

procedure for adding an element into a linked stack

Proceduredeletequeue(i, integer; var y: integer);


{delete the first node from queue i and set it to y to its data field,1in }
var x: pointer;
begin
iffront[i] = nilthen queueempty;
x := front[i];
front[i] := x.link; { delete the first node}
y := x.data;
dispose(x); { free the node}
end;{of deletequeue}

THE STORAGE POOL


The storage pool contains all nodes that are not currently being used. The point that
we should note in case of linked representation how a node is to be constructed. The number
and size of the fields will depend upon the kind of problem. The number of pointers will
depend upon the structural properties of the data and the operations to be performed. The
amount of space to be allocated for each field depends partly on the problem and partly on
the addressing characteristics of the machine. The next point is whether or not any nodes will
ever be returned.
If we assume that the nodes are not returned then we can allocate storage in
consecutive order. So, if the storage pool has n nodes with fields DATA and LINK, then
GETNODE could be implemented as follows:
Procedure GETNODE(I)
// I is set as a pointer to the next available node//
if AV > n then call NO_MORE_NODES
I AV
AV AV+1
end GETNODE

The variable AV must initially be set to one and we assume it as a global variable.
We consider a general case. Normally we link together all of the available nodes in a
single list we call AV. Link fields are used to link the available nodes.

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 35


Procedure is given below:

Procedure INIT(n)
// initialize the storage pool , through the LINK field, to contain nodes
with addresses 1,2,3, … n and set AV to point to the first node in this
list//
for i 1 to n-1 do
LINK(i) i +1
end
LINK(n) 0
AV 1
End INIT

After executing the INIT procedure , the program can used nodes. Every time a new node is
needed, a call to GETNODE procedure is made. GETNODE examines the list AV and
returns the first node on the list . The GEDNODE procedure is as follows:

Procedure GETNODE(X)
// X is set to point to a free node if there is one on AV //
if AV = 0 then call NO_MORE_NODES
X AV
AV LINK(AV)
end GETNODE

RET procedure is used to return the node to AV. AV is used as a stack since the last node
inserted into AV is the first node removed (LIFO). Procedure RET is as follows:

Procedure RET(X)
// X point to a node which is to be returned to the available space list //
LINK(X) AV
AV X
end RET

Most of the times if we look at the available space pool in the middle of processing , adjacent
nodes may no longer have consecutive addresses. Suppose we have a variable ptr which is a
pointer to a node which is part of a list. Then the legal operation that we can perform on
pointer variable is :
1) Set ptr=0
2) Set ptr to point to a node.

APPLICATIONS OF LINKED LIST

POLYNOMIAL ADDITION:

Consider the polynomial

A(x) = a x + …… + a x , where the a are non-zero coefficients with exponents e i


such that em> e m-1> …. > e 1 0; Each term will be represented by a node. A node will be

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 36


of fixed size having 3 fields which represent the coefficient and exponent a term plus a
pointer to the next term.

Assuming that all coefficients are integer, the required type declarations are:

Typepolypointer = polynode;
polynode = record
Coef : integer ; {coefficient}
Exp :integer; {exponent }
Link : polypointer;
end;

Polynomials will be drawn as below :

coef exp link

For example the polynomial a = 3 x 14 + 2 x 8 + 1 would be stored as

3 14 polyn
-3 108 2 8 1 0 nil
omial
b = 8 x 14
–3x 10 + 10 x 6 would be stored as

8 148 -3 10 100 6 nil

In order to add the polynomials together we examine the terms starting at the nodes
pointed by a and b. Two pointers p and q are used to move among the terms of a and b. If the
exponents are of the two terms are equal, then the coefficients are added and a new term
created for the result. If the exponent of a is less than the exponent current of b, then a
duplicate of the term of b is created and attached to c. The pointer q is advanced to the next
term. Similar action is taken on a p↑.exp > q↑.expThe following figure illustrates the
polynomial addition.

STEP 1:

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 37


3 14 2 8 1 0 nil

8 14 -3 108 10
101 6 ni
ni
nil
ll

p.exp = q.exp

11 14 nil
procedure add (a,b : polypointer; var c : polypointer);
{polynomials a and b represented as singly linked lists
are summed to form the new polynomial named c }
varp.q.d: polypointer; x : integer ;
begin
p := a; q := b;{p, q point to next term of a and b}
new(c); d := c; {initial node for c, returned later }
while (p <>nil) and (q<>nil ) do
casecompare (p.exp, q.exp) of
„=‟: begin
x := p.coef + q.coef;
ifx<> 0 then attach (x , p.exp, d);
p := p.link; q := q.link; {advance to the next node}
end;

„<‟ : begin
attach (q.coef, q.exp, d);
q := q.link; {next term of b}
end;
„>‟ : begin
attach (p.coef, p.exp, d);
p := p.link; {next term of a}
end;

whilep<>nildo { copy rest of a}


begin
attach (p.coef, p.exp, d);
p := p.link;

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 38


end;
whileq<>nil do { copy rest of b}
begin
attach (q.coef, q.exp, d);
q := q.link;
end;
d.link := nil; { last node}
{delete the extra initial node}
p := c; c := c.link; dispose(p);
end; { of padd}

Procedure attach (c, e : integer ; var d:polypointer);


{create a new node with coef=c, and exp =e and attach it to the node pointed by d.}
var x : polypointer;
begin
new(x);
with x do begin coef :=c; exp:=e; d.link :=x; end;
d:=x;
end;
A list in which the last node points back to the first is called a circular list. A singly
linked list in which the last node has a nil link is called a chain.

-3 108 2 8 1 0 nil

Procedure to free all the nodes in the chain t

Procedureerase (var p: polypointer);


Var x: polypointer;
begin
While t<>nil do
begin
x := t↑.link;
dispose(t);
t := x;
end;
end;
The reason we dispose of nodes that are no longer in use is so that these nodes may be
reused later.

Procedure to erase the nodes in a circular linked list.

Procedurecerase (t : polypointer)
{erase a circular list}
varx : polypointer;
begin
if t <>nil

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 39


thenbegin
x := t↑.link; {second node}
t↑.link := av; {first node linked to av}
av:=x; {second node of t becomes front of av list}
end;
end ; {of cerase}

Procedure to invert a linked list.


Procedureinvert (var x: pointer);
{ a chain pointed at by x is inverted so that if x= (a1,….an) , then after the execution
x=(an,…, a1) }
var p,q,r : pointer;
begin
p := x; q := nil; {q trails p}
while p <>nil do
begin
r := q; q := p; {r trails q}
p := p↑.link; { p moves to the next node}
q↑.link := r; { link q to proceeding node}
end;
x := q;
end; {of invert}

For a list of m >=1 nodes, the while loop is executed m times and so the computing time
is linear or O(m).

Procedure for concatenating two chains x and y.

Procedureconcatenate (x,y : pointer ; var x : pointer)


{ x= (a1,…. an) and y = ( b1,…. bn), m,n >=0
produces the new chain z = ( a1, …am, b1,..bn)};
varp: pointer;
begin
if x = nilthen
z := y
else begin
z := x;
if y <>nil
then begin { find the last node in x }
p := x;
while p↑.link <>nildop :=p↑.link;
p↑.link := y; { link the last node of x to the first of y}
end;
end; {of concatenate}
end;

Procedure to insert a node at the front of the linked list.

Procedureinsertfront (var a : pointer; x : pointer);

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 40


{insert the node pointed at by x at the front of the circular lists a, where a points to the last
node in the list}
begin
if a = nil
then begin { empty list}
a := x;
x ↑.link := x;
end
else begin
x↑.link := a↑.link;
a↑.link := x;
end;
end; { of insert}

To insert x at the rear, we need to add the additional statement a := x to the else part of the
insertfront

Function to find the length of a circular linked list

Functionlength (a : pointer): integer;


{ find the length of the circular linked list a}
var x : pointer;
begin
length := 0;
if a<>nil
then begin
x := a;
repeat
length := length +1;
x := x↑.link;
until x = a;
end;
end; {of length}

Sparse matrices

In a matrix if many of the entries are zero then we call it as a sparse matrix. In case of
sparse matrices space and computing time will be saved if only the non-zero entries are
retained explicitly. We represent the matrix elements in a sequential scheme in which each
node is of three fields: row,column, and value. As matrix operations such as addition ,
subtraction and multiplication are performed , the number of nonzero terms in matrices will
vary. So linked scheme facilitates efficient representation of varying size structures. In the
data representation each column of a sparse matrix is represented by a circularly linked list
with a head node. In addition each row will be a circularly linked list with a head node.

Each and every node has five fields: left,up,v,r and c. The left field points to the node
with the next smallest column subscript. Up points to the node with the next smallest row
subscript. V represents value , r represents row and c represents column. The node structure is
also referred to as MATRIX_ELEMENT.

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 41


node structure

down row column right

value

Algorithm to construct sparse matrix is given below.

Algorithm Construct_Matrix ( In this algorithm M and N represents the number of rows and
columns respectively. Arrays AROW and ACOL contain pointers to the head node of the
circular lists, P and Q are auxiliary pointers.The row index, column index, value are read
into the variables ROW,COLUMN,VALUE respectively.

1. Repeat for I =1,2,…M

Repeat for I =1,2,…N

2. Repeat thru step 7 until input records are exhausted

3. Read (ROW,COLUMN,VALUE)

4.

5.
Repeat while C(P) < C(LEFT(Q))

6.
Repeat while R(P) < R(UP(Q))

7. Exit

DOUBLY LINKED LIST

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 42


Llink data Rlink

A doubly-linked list is a linked data structure that consists of a set of sequentially


linked records called nodes. Each node contains two fields, called links, that are references to
the previous and to the next node in the sequence of nodes. The beginning and ending nodes'
previous and next links, respectively, point to some kind of terminator, typically a sentinel
node or null, to facilitate traversal of the list. If there is only one sentinel node, then the list is
circularly linked via the sentinel node. It can be conceptualized as two singly linked lists
formed from the same data items, but in opposite sequential orders.

A doubly-linked list whose nodes contain three fields: an integer value, the link to the next
node, and the link to the previous node.
The two node links allow traversal of the list in either direction. While adding or removing a
node in a doubly-linked list requires changing more links than the same operations on a
singly linked list, the operations are simpler and potentially more efficient (for nodes other
than first nodes) because there is no need to keep track of the previous node during traversal
or no need to traverse the list to find the previous node, so that its link can be modified.
Nomenclature and implementation
The first and last nodes of a doubly-linked list are immediately accessible (i.e., accessible
without traversal, and usually called head and tail) and therefore allow traversal of the list
from the beginning or end of the list, respectively: e.g., traversing the list from beginning to
end, or from end to beginning, in a search of the list for a node with specific data value. Any
node of a doubly-linked list, once obtained, can be used to begin a new traversal of the list, in
either direction (towards beginning or end), from the given node.
The link fields of a doubly-linked list node are often called next and previous or forward and
backward. The references stored in the link fields are usually implemented as pointers, but (as
in any linked data structure) they may also be address offsets or indices into an array where
the nodes live.
Basic algorithms
Open doubly-linked lists
Data type declarations
recordDoublyLinkedNode {
prev // A reference to the previous node
next // A reference to the next node
data // Data or a reference to data
}
recordDoublyLinkedList {
DoublyLinkedNode firstNode // points to first node of list
DoublyLinkedNode lastNode // points to last node of list
}

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 43


Traversing the list
Traversal of a doubly-linked list can be in either direction. In fact, the direction of traversal
can change many times, if desired. Traversal is often called iteration, but that choice of
terminology is unfortunate, for iteration has well-defined semantics (e.g., in mathematics)
which are not analogous to traversal.
Forwards
node := list.firstNode
while node ≠ null
<do something with node.data>
node := node.next
Backwards
node := list.lastNode
while node ≠ null
<do something with node.data>
node := node.prev
Inserting a node
These symmetric functions insert a node either after or before a given node, with the diagram
demonstrating after:
function insertAfter(List list, Node node, Node newNode)
newNode.prev := node
newNode.next := node.next
if node.next == null
list.lastNode := newNode
else
node.next.prev := newNode
node.next := newNode
function insertBefore(List list, Node node, Node newNode)
newNode.prev := node.prev
newNode.next := node
if node.prev == null
list.firstNode := newNode
else
node.prev.next := newNode
node.prev := newNode
We also need a function to insert a node at the beginning of a possibly empty list:
function insertBeginning(List list, Node newNode)
if list.firstNode == null
list.firstNode := newNode
list.lastNode := newNode
newNode.prev := null
newNode.next := null
else
insertBefore(list, list.firstNode, newNode)
A symmetric function inserts at the end:
function insertEnd(List list, Node newNode)
if list.lastNode == null
insertBeginning(list, newNode)
else

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 44


insertAfter(list, list.lastNode, newNode)
Removing a node
Removal of a node is easier than insertion, but requires special handling if the node to be
removed is the firstNode or lastNode:
function remove(List list, Node node)
if node.prev == null
list.firstNode := node.next
else
node.prev.next := node.next
if node.next == null
list.lastNode := node.prev
else
node.next.prev := node.prev
destroy node
One subtle consequence of the above procedure is that deleting the last node of a list sets both
firstNode and lastNode to null, and so it handles removing the last node from a one-element
list correctly. Notice that we also don't need separate "removeBefore" or "removeAfter"
methods, because in a doubly-linked list we can just use "remove(node.prev)" or
"remove(node.next)" where these are valid. This also assumes that the node being removed is
guaranteed to exist. If the node does not exist in this list, then some error handling would be
required.
Circular doubly-linked lists
Traversing the list
Assuming that someNode is some node in a non-empty list, this code traverses through that
list starting with someNode (any node will do):
Forwards
node := someNode
do
do something with node.value
node := node.next
while node ≠ someNode
Backwards
node := someNode
do
do something with node.value
node := node.prev
while node ≠ someNode
Notice the postponing of the test to the end of the loop. This is important for the case where
the list contains only the single node someNode.
Inserting a node
This simple function inserts a node into a doubly-linked circularly linked list after a given
element:
function insertAfter(Node node, Node newNode)
newNode.next := node.next
newNode.prev := node
node.next.prev := newNode

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 45


node.next := newNode
To do an "insertBefore", we can simply "insertAfter(node.prev, newNode)". Inserting an
element in a possibly empty list requires a special function:
function insertEnd(List list, Node node)
if list.lastNode == null
node.prev := node
node.next := node
else
insertAfter(list.lastNode, node)
list.lastNode := node
To insert at the beginning we simply "insertAfter(list.lastNode, node)". Finally, removing a
node must deal with the case where the list empties:
function remove(List list, Node node)
if node.next == node
list.lastNode := null
else
node.next.prev := node.prev
node.prev.next := node.next
if node == list.lastNode
list.lastNode := node.prev;
destroy node

Advanced concepts
Asymmetric doubly-linked list
An asymmetric doubly-linked list is somewhere between the singly-linked list and the
regular doubly-linked list. It shares some features with the singly linked list (single-direction
traversal) and others from the doubly-linked list (ease of modification)
It is a list where each node's previous link points not to the previous node, but to the link to
itself. While this makes little difference between nodes (it just points to an offset within the
previous node), it changes the head of the list: It allows the first node to modify the first Node
link easily
As long as a node is in a list, its previous link is never null.
Inserting a node
To insert a node before another, we change the link that pointed to the old node, using the
prev link; then set the new node's next link to point to the old node, and change that node's
prev link accordingly.
function insertBefore(Node node, Node newNode)
if node.prev == null
error "The node is not in a list"
newNode.prev := node.prev
atAddress(newNode.prev) := newNode
newNode.next := node
node.prev = addressOf(newNode.next)
function insertAfter(Node node, Node newNode)
newNode.next := node.next
if newNode.next != null

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 46


newNode.next.prev = addressOf(newNode.next)
node.next := newNode
newNode.prev := addressOf(node.next)
Deleting a node
To remove a node, we simply modify the link pointed by prev, regardless of whether the node
was the first one of the list.
function remove(Node node)
atAddress(node.prev) := node.next
if node.next != null
node.next.prev = node.prev
destroy node

Points to remember
1. Linked list have a two types one is singly linked list 2. Doubly linked list
2. Storage pool has a GETNODE, RET concepts.
3. Garbage collection is use the unused free or memory space.
4. Polynamial has a co efficient and exponent

********

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 47


DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 48
Unit – III
Trees: Basic Terminology - Binary Trees - Binary Tree Representations – Binary Trees
-Traversal - More on Binary Trees - Threaded Binary Trees - Binary Tree
Representation of Trees - Counting Binary Trees. Graphs: Terminology and
Representations - Traversals, Connected Components and Spanning Trees Shortest
Paths and Transitive Closure

INTRODUCTION

Tree Terminology

 Parent: The predecessor of a node.

 Child: Any successor of a node.

 Siblings: Any pair of nodes that have the same parent.

 Ancestor: The predecessor of a node together with all the ancestors of the
predecessor of a node. The root node has no ancestors.

 Descendant: The children of a node together with all the descendants of the children
of a node. A leaf node has no descendants.

 Subtree: A node together with its descendants.

 The unique node with no predecessor is called the root of the tree. This is the start of
the data structure. A node with no successors is called a leaf. There will usually be
many leaves in a tree. The successors of a node are called its children; the unique
predecessor of a node is called its parent. If two nodes have the same parent, they are
called brothers or siblings. In a binary tree the two children are defined by the
position occupied, which will be either the left or right.

Tree Height

The level of a node is one greater than the level of its parent. The level of the root node is 1.
The height of a tree is the maximum level of any node in the tree. Te following figure shows
how we calculate the height of a binary tree:

Looking at this example we can see:

 The level of node 1 is 1. This is the root node.

 The level of nodes 2 and 6 is 2.

 The level of nodes 3, 7 and 8 is 3.

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 49


 The level of nodes 4, 5, 9 and 10 is 4.

 The height of this tree is 4, which is the number of levels.

Paths within a Tree

The path is any linear subset of a tree, eg: 1, 6, 7 and 1, 2, 3, 5 are paths. The length of a path
can be counted as either the number of nodes in the path or the number of lines in a given
path, we will count the nodes; eg 1, 6, 7 has a length of 3. However, remember that there is
no agreed definition.

There is always a unique path from the root to any node. Simple as this property seems, it is
extremely important. Our algorithms for processing trees will be based upon this property.
The depth or level of a node is the length of this path. When drawing a tree, it is useful if all
the nodes in the same level are drawn as a neat horizontal row.

Structure of Trees

Nodes in a tree exhibit a


hierarchical relationship. There is a unique first node, called the root node, which has no
predecessors and can have many successors. There can be many nodes that have no
successors. These nodes are called leaf nodes. Each leaf node has a unique predecessor. Each
node that is neither a root node nor a leaf node has a unique predecessor and at least one
successor.

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 50


In the figure shown, the root is at the top; below are its children. A line connects a node to
each of its children: we sometimes draw arrowheads on the line, but they are optional because
the direction parent ( child is always the same as top ( bottom.

In general, each child of a node is the root of a tree 'within the big tree'. For example, 7 is the
root of a little tree (8, 6 and 9). These inner trees are called subtrees. The subtrees of a node
are the trees whose roots are the children of the node, eg the subtrees of 2 are the subtrees
whose roots are 3 and 4. In a binary tree we refer to the left subtree and the right subtree.

Structural Features

The following can be said about the figure shown above:

 The parent of node 5 is node 3.

 The parent of node 3 is node 2.

 The children of node 2 are nodes 3 and 4.

 Node 5 has no children.

 Nodes 6 and 9 are siblings.

 Nodes 3 and 4 are siblings.

Structural Definition of Binary Trees

A binary tree is either empty or it consists of three parts:

 A node value, usually called an element, containing the data we are storing in the tree.

 A left subtree - a link / pointer variable.

 A right subtree - a link / pointer variable.

An example of a diagrammatic representation is alongside.

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 51


The next figure shows the typical node structure for our binary trees. If, for example, there
were three subtrees associated with each node then we would have three pointer variables:

With linked lists we had an alternative to recursion. We could scan through a list easily with
normal loops (while, do, repeat etc) as well as with recursion. This is not so easy to do for
trees. It is possible to scan through a tree non-recursively, but it is not nearly as easy as a
recursive scan.

The Binary Tree

A binary tree is either empty or it is a root node together with two subtrees, the left subtree
and the right subtree, each of which is a binary tree. The binary tree is full if each level,
except the last, contains as many nodes as possible. Therefore a binary tree is complete if
each level contains as many nodes as possible. Binary trees have the following three
distinctive characteristics:

1. Each node has at most two children (subtrees).

2. Each subtree is identified as either the left-subtree or the right-subtree of the parent.

3. A binary tree may be empty.

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 52


The example on the right shows a binary tree diagram that is neither complete nor full. The
following example shows a complete binary tree. This is a balanced tree with each level
node being full apart from the final level, which has no subtrees:

binary tree representation of trees

(data structure)

Definition: A way to represent a multiway tree as a binary tree. The leftmost child, c, of
a node, n, in the multiway tree is the left child, c', of the corresponding node, n', in the
binary tree. The immediately right sibling of c is the right child of c'.

Formal Definition: A multiway tree T can be represented by a corresponding binary tree


B. Let {n1,..., nk} be nodes of the multiway tree, T. Let {n'1,..., n'k} be nodes of the
corresponding binary tree B. Node nk corresponds to n'k. In particular, nodes nk and n'k
have the same labels and if nk is the root of T, n'k is the root of B. Connections
correspond as follows:

 If nl is the leftmost child of nk, n'l is the left child of n'k. (If nk has no children, n'k
has no left child.)

 If ns is the next (immediately right) sibling of nk, n's is the right child of n'k.

Also known as first child-next sibling binary tree, doubly-chained tree, filial-heir chain.

. The binary tree representation of a multiway tree or k-ary tree is based on first child-
next sibling representation of the tree. In this representation every node is linked with its
leftmost child and its next (right nearest) sibling.

Let us see one example. Consider the following multiway tree

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 53


1
/|\
/|\
/ | \
2 3 4
/\ |
5 6 7
/\
8 9

This tree can be represented in first child-next sibling manner as follows

1
/
/
/
2---3---4
/ /
5---6 7
/
8---9

Now, if we look at the first child-next sibling representation of the tree closely, we will see
that it forms a binary tree. To see this better, we can rotate every next-sibling edge 45
degrees clockwise. After that we get the following binary tree:

1
/
2
/\
5 3
\ \
6 4
/
7

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 54


/
8
\
9

This is binary tree representation of the given (multiway) tree.

Binary Trees

A binary tree is an important type of structure which occurs very often. It is characterized by
the fact that any node can have at most two branches, i.e.,there is no node with degree greater
than two. For binary trees we distinguish between the subtree on the left and on the right,
whereas for trees the order of the subtreewas irrelevant. Also a binary tree may have zero
nodes. Thus a binary tree is really a different object than a tree.

Definition: A binary tree is a finite set of nodes which is either empty or consists of a root
and two disjoint binary trees called the left subtree and the right subtree.

We can define the data structure binary tree as follows:

structure BTREE
declare CREATE( ) --> btree
ISMTBT(btree,item,btree) --> boolean
MAKEBT(btree,item,btree) --> btree
LCHILD(btree) --> btree
DATA(btree) --> item
RCHILD(btree) --> btree
for all p,r in btree, d in item let
ISMTBT(CREATE)::=true
ISMTBT(MAKEBT(p,d,r))::=false
LCHILD(MAKEBT(p,d,r))::=p; LCHILD(CREATE)::=error
DATA(MAKEBT(p,d,r))::d; DATA(CREATE)::=error
RCHILD(MAKEBT(p,d,r))::=r; RCHILD(CREATE)::=error
end
end BTREE

This set of axioms defines only a minimal set of operations on binary trees. Other operations
can usually be built in terms of these. The distinctions between a binary tree and a tree should

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 55


be analyzed. First of all there is no tree having zero nodes, but there is an empty binary tree.
The two binary trees below are different. The first one has an empty right subtree while the
second has an empty left subtree. If these are regarded as trees, then they are the same despite
the fact that they are drawn slightly differently.

Binary Tree Representations

A full binary tree of depth k is a binary tree of depth k having pow(2,k)-1 nodes. This
is the maximum number of the nodes such a binary tree can have. A very elegant sequential
representation for such binary trees results from sequentially numbering the nodes, starting
with nodes on level 1, then those on level 2 and so on. Nodes on any level are numbered from
left to right. This numbering scheme gives us the definition of a complete binary tree. A
binary tree with n nodes and a depth k is complete iff its nodes correspond to the nodes which
are numbered one to n in the full binary tree of depth k. The nodes may now be stored in a
one dimensional array tree, with the node numbered i being stored in tree[i].

Lemma 5.3: If a complete binary tree with n nodes (i.e., depth=[LOG2N]+1) is represented
sequentially as above then for any node with index i, 1<I
(i) parent(i) is at [i/2] if is not equal to 1. When i=1, i is the root and has no parent.
(ii) lchild(i) is at 2i if 2in, then i has no left child.
(iii) rchild(i) is at 2i+1 if 2i+1n, then i has no right child.

Proof: We prove (ii). (iii) is an immediate consequence of (ii) and the numbering of
nodes on the same level from left to right. (i) follows from (ii) and (iii). We prove (ii) by
induction on i. For i=1, clearly the left child is at 2 unless 2>n in which case 1 has no left
child. Now assume that for all j, 1<Jn in which case i+1 has no left child. This representation
can clearly be used for all binary trees though in most cases there will be a lot of unutilized
space. For complete binary trees the representation is ideal as no space is wasted. In the worst
case a skewed tree of k will require pow(2,k)-1 spaces. Of these only k will be occupied.

While the above representation appears to be good for complete binary trees it is
wasteful for many other binary trees. In addition, the representation suffers from the general

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 56


inadequacies of sequential representations. Insertion or deletion of nodes from the middle of a
tree requires the movement of potentially many nodes to reflect the change in level number of
these nodes. These problems can be easily overcome through the use of a linked
representation. Each node will have three fields leftchild, data and rightchild and is defined in
Pascal as

type treepointer = ^treerecord;


treerecord = record
leftchild : treepointer;
data : char;
rightchild : treepointer;
end;

Binary Tree Traversal

There are many operations that we often want to perform on trees. One notion that
arises frequently is the idea of traversing a tree or visiting each node in the three exactly once.
A full traversal produces a linear order for the information in a tree. This linear order may be
familiar and useful. When traversing a binary tree we want treat each node and its subtrees in
the same fashion. If we let L, D, R stand for moving left, printing the data, and moving right
when at a node then there are six possible combinations of traversal: LDR, LRD, DLR, DRL,
RDL, and RLD. If we adopt the convention that we traverse left before right then only three
traversals remain: LDR, LRD, and DLR. To these we assign the names inorder, postorder and
preorder because there is a natural correspondence between these traversals and producing
the infix, postfix and prefix forms of an expression. Inorder Traversal: informally this calls
for moving down the tree towards the left untilyou can go no farther. Then you "visit" the
node, move one node to the right and continue again. If you cannot move to the right, go back
one more node. A precise way of describing this traversal is to write it as a recursive
procedure.
procedure inorder(currentnode:treepointer);
{currentnode is a pointer to a noder in a binary tree. For full
tree traversal, pass inorder the pointer to the top of the tree}
begin{inorder}
ifcurrentnode<>nil
then
begin

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 57


inorder(currentnode^.leftchild);
write(currentnode^.data);
inorder(currentnode^.rightchild);
end
end; {of inorder}

Recursion is an elegant device for describing this traversal. A second form of traversal is
preorder:
procedure preorder(currentnode:treepointer);
{currentnode is a pointer to a node in a binary tree. For full
tree traversal, pass preorder the ponter to the top of the tree}
begin {preorder}
if currentnode <> nil
then
begin
write(currentnode^.data);
preorder(currentnode^.leftchild);
preorder(currentnode^.rightchild);
end {of if}
end; {of preorder}

In words we would say "visit a node, traverse left and continue again. When you cannot
continue, move right and begin again or move back until you can move right and resume. At
this point it should be easy to guess the next thraversal method which is called postorder:

procedure postorder(currentnode:treepointer);
{currentnode is a pointer to a node in a binary tree. For full
tree traversal, pass postorder the pointer to the top of the tree}
begin {postorder}
if currentnode<>nil
then
begin
postorder(currentnode^.leftchild);
postorder(currentnode^.rightchild);
write(currentnode^.data);

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 58


end {of if}
end; {of postorder}

Threaded of binary tree

1. A Threaded Binary Tree is a binary tree in which every node that does not have
a right child has a THREAD (in actual sense, a link) to its INORDER successor.
By doing this threading we avoid the recursive method of traversing a Tree,
which makes use of stacks and consumes a lot of memory and time.
The node structure for a threaded binary tree varies a bit and its like this --

Code:

struct NODE

struct NODE *leftchild;

int node_value;

struct NODE *rightchild;

struct NODE *thread;

Advantage
1. By doing threading we avoid the recursive method of traversing a Tree , which makes
use of stack and consumes a lot of memory and time .
2 . The node can keep record of its root .
Disadvantage
1. This makes the Tree more complex .
2. More prone to errors when both the child are not present & both values of nodes
pointer to their ansentors .

Graphs - Terminology and Representation

Overview

 Definitions: Vertices, edges, paths, etc

 Representations: Adjacency list and adjacency matrix

Definitions: Graph, Vertices, Edges

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 59


 Define a graph G = (V, E) by defining a pair of sets:

1. V = a set of vertices

2. E = a set of edges

 Edges:

1. Each edge is defined by a pair of vertices

2. An edge connects the vertices that define it

3. In some cases, the vertices can be the same

 Vertices:

1. Vertices also called nodes

2. Denote vertices with labels

 Representation:

1. Represent vertices with circles, perhaps containing a label

2. Represent edges with lines between circles

 Example:

1. V = {A,B,C,D}

2. E = {(A,B),(A,C),(A,D),(B,D),(C,D)}

Motivation

 Many algorithms use a graph representation to represent data or the problem to be


solved

 Examples:

o Cities with distances between

o Roads with distances between intersection points

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 60


o Course prerequisites

o Network

o Social networks

o Program call graph and variable dependency graph

Graph Classifications

 There are seveal common kinds of graphs

o Weighted or unweighted

o Directed or undirected

o Cyclic or acyclic

 Choose the kind required for problem and determined by data

 We examine each below

Kinds of Graphs: Weighted and Unweighted

 Graphs can be classified by whether or not their edges have weights

 Weighted graph: edges have a weight

o Weight typically shows cost of traversing

o Example: weights are distances between cities

 Unweighted graph: edges have no weight

o Edges simply show connections

o Example: course prereqs

Kinds of Graphs: Directed and Undirected

 Graphs can be classified by whether or their edges are have direction

o Undirected Graphs: each edge can be traversed in either direction

o Directed Graphs: each edge can be traversed only in a specified direction

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 61


Undirected Graphs

 Undirected Graph: no implied direction on edge between nodes

o The example from above is an undirected graph

o In diagrams, edges have no direction (ie they are not arrows)

o Can traverse edges in either directions

 In an undirected graph, an edge is an unordered pair

o Actually, an edge is a set of 2 nodes, but for simplicity we write it with parens

 For example, we write (A, B) instead of {A, B}

 Thus, (A,B) = (B,A), etc

 If (A,B) ∈ E then (B,A) ∈ E

o Formally: ∀ u,v ∈ E, (u,v)=(v,u) and u ≠ v

 A node normally does not have an edge to itself

Directed Graphs

 Digraph: A graph whose edges are directed (ie have a direction)

o Edge drawn as arrow

o Edge can only be traversed in direction of arrow

o Example: E = {(A,B), (A,C), (A,D), (B,C), (D,C)}

o Examples: courses and prerequisites, program call graph

 In a digraph, an edge is an ordered pair

o Thus: (u,v) and (v,u) are not the same edge

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 62


o In the example, (D,C) ∈ E, (C,D) ∉ E

o What would edge (B,A) look like? Remember (A,B) ≠ (B,A)

 A node can have an edge to itself (eg (A,A) is valid)

Subgraph

 If graph G=(V, E)

o Then Graph G'=(V',E') is a subgraph of G if V' ⊆ V and E' ⊆ E and

 Example ...

Degree of a Node

 The degree of a node is the number of edges the node is used to define

 In the example above:

o Degree 2: B and C

o Degree 3: A and D

o A and D have odd degree, and B and C have even degree

 Can also define in-degree and out-degree

o In-degree: Number of edges pointing to a node

o Out-degree: Number of edges pointing from a node

o Whare are the in- and out-degree of the example?

Graphs: Terminology Involving Paths

 Path: sequence of vertices in which each pair of successive vertices is connected by


an edge

 Cycle: a path that starts and ends on the same vertex

 Simple path: a path that does not cross itself

o That is, no vertex is repeated (except first and last)

o Simple paths cannot contain cycles

 Length of a path: Number of edges in the path

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 63


o Sometimes the sum of the weights of the edges

 Examples

o A sequence of vertices: (A, B, C, D) [Is this path, simple path, cycle?]

o (A, B, D, A, C) [path, simple path, cycle?]

o (A, B, D, A, C) [path, simple path, cycle?]

o Cycle: ?

o Simple Cycle: ?

o Lengths?

Cyclic and Acyclic Graphs

 A Cyclic graph contains cycles

o Example: roads (normally)

 An acyclic graph contains no cycles

o Example: Course prereqs!

 Examples - Are these cyclic or acyclic?

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 64


Connected and Unconnected Graphs and Connected Components

 An undirected graph is connected if every pair of vertices has a path between it

o Otherwise it is unconnected

o Give an example of a connected graph

 An unconnected graph can be broken in to connected components

 A directed graph is strongly connected if every pair of vertices has a path between
them, in both directions

Trees and Minimum Spanning Trees

 Tree: undirected, connected graph with no cycles

o Example ...

o If G=(V, E) is a tree, how many edges in G?

 Spanning tree: a spanning tree of G is a connected subgraph of G that is a tree

o Example ...

 Minimum spanning tree (MST): a spanning tree with minimum weight

o Example ...

 Spanning trees and minimum spanning tree are not necessarily unique

 We will look at two famous MST algorithms: Prim's and Kruskal's

Data Structures for Representing Graphs

 Two common data structures for representing graphs:

o Adjacency lists

o Adjacency matrix

Adjacency List Representation

 Each node has a list of adjacent nodes

 Example (undirected graph):

o A: B, C, D

o B: A, D

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 65


o C: A, D

o D: A, B, C

 Example (directed graph):

o A: B, C, D

o B: D

o C: Nil

o D: C

 Weighted graph can store weights in list

 Space: Θ(V + E) (ie |V| + |E|)

 Time:

o To visit each node that is adjacent to node u: Θ(degree(u))

o To determine if node u is adjacent to node v: Θ(degree(u))

Adjacency Matrix Representation

 Adjacency Matrix: 2D array containing weights on edges

o Row for each vertex

o Column for each vertex

o Entries contain weight of edge from row vertex to column vertex

o Entries contain ∞ (ie Integer'last) if no edge from row vertex to column vertex

o Entries contain 0 on diagonal (if self edges not allowed)

 Example undirected graph (assume self-edges not allowed):

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 66


A B C D

A 0 1 1 1

B 1 0 ∞ 1

C 1 ∞ 0 1

D 1 1 1 0

 Example directed graph (assume self-edges allowed):

A B C D

A ∞ 1 1 1

B ∞ ∞ ∞ 1

C ∞ ∞ ∞ ∞

D ∞ ∞ 1 ∞

 Can store weights in cells

 Space: Θ(V2)

 Time:

o To visit each node that is adjacent to node u: Θ(V)

o To determine if node u is adjacent to node v: Θ(1)

GRAPH TRAVERSALS

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 67


Graphs – Depth and Breadth first Search

Big and ongoing topic! Emphases: understand basic definitions, matrix and adjacency
representation of graphs, breadth first and depth first search, and some simple algorithms
based on them and some that use the more complicated classifications of edges. In general
understand the idea of adding to DFS and BFS to obtain various objectives. Understand
logical arguments made with graph ideas and make simple logical arguments.

Make data structures concrete with example.

Below are edge list and adjacency matrix interpretations for an undirected and a directed
graph.

Like the book, I will follow the practice of labeling vertices with capital letters, avoiding any
confusion with various uses of numbers that will later be associated with graphs. In efficient
implementations the vertices are just associated with numeric indices.

An alternate search approach is breadth-first search BFS.py. We can show its


closeness to DFS with a nonrecursive explicit stack DFS implementation DFSStack.py. If
you want to separate distances, you can use two queues, BSFDist.py.

Many things can be done by inserting code into the basic DFS, and to a lesser extent, BFS.

I will generally use V as the vertices of a graph and E as the edges and |V| and |E| as their
sizes. All of the algorithms have order |E| + |V|: the main body of dfs or bfs is called once for
each vertex, and the inner loops are executed once for each edge. All the elaborations add
only an extra O(1) each time they are executed in the basic flow.

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 68


Given that the number of edges can vary up to Θ(|E|2), the order may be from Θ(|E|), with a
sparse collection of edges, to Θ(|E|2), with a dense set of edges.

Assuming fixed size descriptors for each piece of the data, the total size of the data has the
same order as the time of execution, and the algorithm time order is said to be linear (in the
size of the data).

Digraphs and DAG's

With directed edges the classifications are more complex.

Edge classifications:

Tree edge: to previously undiscovered node


Back edge: to ancestor
Forward or descendent edge: to descendant previously discovered
Cross edge: no ancestor/descendant relationship

How do mechanically? digraphEdgeClassify.py illustrates some approaches.

Note the order of the start of visiting (None indicates still not visited)
Note the order of finishing visiting (not yet finished None)

If u is first visited after a vertex v is first visited but before v is finished, u is a descendant of
v. (Everything on the stack after v is a descendant.)

The edge to such a u when u is first discovered is a tree edge.


The edge to such a u, but after u first discovered, is still to a descendant, but now the edge is
forward
Conversely an edge from u to ancestor v would be back.

The remaining case is when there is no ancestor-descendant relationship. This is a cross edge.
This occurs when dfs is working on v, and looks to a neighbor u that is already finished.

Why can't forward and cross edges occur in an undirected graph?

You cannot finish with a vertex in an undirected graph when a vertex directly connecting to it
is still unvisited. In a directed graph there can be a connection to a finished vertex - as long
as the reverse edge is missing.

DAG: Directed Acyclic Graph

In an operating system, at the most abstract level, how do you get deadlock?

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 69


You have a collection of processes waiting for (dependent on) other processes. This gives a
dependency graph.

A convention needs to be maintained with a dependency graph. You can either have the
arrows go with time, from what must come first to what must follow later (as the book does
it) OR you can have a task point to the things that must be done first, reversing all the arrows
in the text. This latter approach is actually more useful for further algorithms, making it
easier to check on all prerequisites being completed, so I will use it:

task --> prerequisite task.

With either convention, in deadlock there is a cycle in the dependencies. If you expect to
finish a collection of tasks with dependencies, there can be no cyclic dependency! A
workable dependency graph is a directed acyclic graph (DAG).

Topological Order: In graph with n vertices, assign topological number t(v) = 1..n to the
vertices in a way that if vw is any edge, t(v) < t(w).
Reverse topogogical order, r(v): backwards from above: if vw is an edge, r(v) > r(w).

Theorem: The sequence of ending times end[v] in a DFS satisfy end[v] > end[w] for edges
vw.

Proof: Need to show for any edge vw, end[v] > end[w]. Look at the four edge types:

 Back edges imply a cycle - so excluded.

 Forward, tree edges: w is a descendant of v, and hence by the LIFO sequence of the
stack, the ending time for w is before v.

 Cross edge: w is already completed when the edge from v is checked.

A topological order or its revese is is a permutation of the vertices. What is actually useful is
the functional inverse of this permutation: the mapping from time order to vertex. By the
theorem above, the list for a reverse topological order is is easily constructed in the DFS by
just adding to a list of finished vertices each time a vertex is finished. Reversing the final
order of that list provides a vertex listing in forward topological order. A direct translation of
this is in topoordersure.py. A possible issue with this algorithm is that it can be run with any
graph, not just a DAG, but if the graph is not a DAG, the result makes no sense as a
topological order. A more robust version, avoiding a meaningless answer is topoorder.py. In
this algorithm there is a check for back edges as the DFS progresses, and None is returned
instead of a vertex sequence if the graph is not a DAG. As in DFSEdgeClassify.py, the check

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 70


for back edges requires three possible states for the visited list. The code is considerably
shortened by raising an exception when a cycle is found.
Critical Paths (important, but not in text)

Critical paths are applications for DAG's and reverse topological order:

The last application of reverse topological order was for a sequence with a single processor or
actor. In parallel processes, (human or machine), tasks that do not depend on each other can
be done at the same time. The chain of dependencies that takes the longest time determines
the minimum time to completion. This longest time chain of dependencies is a critical path.

Original DAG:

Each task is associated with a time for completion. Also we want one final node (the DONE
node, with 0 duration) depending on each originally terminal node:

With dependencies marked in this order the direction of time and the direction of the graph
are backwards. I will use start and and end to refer to time, and use the digraph jargon source
(for a vertex with no arrows to it) and sink (for a vertex with no arrows from it).

The table below gives dependencies and times required for each action.

The tasks with no dependencies are then sinks. This example illustrates that there can be
several sinks (actions with no dependencies).

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 71


For the critical path time calculation the the earliest finishing time for a task (EFT) =
maximum of the EFTs for its dependencies + duration of the task. WorstDep is the
dependency with the worst (largest) EFT.

For simplicity the alphabetical order of tasks is already a reverse topological order. This
means the table below can be filled in sequentially from top to bottom.

Task Dependencies Duration EFT WorstDep

A 4

B 3

C 1

D A, B 5

E B 7

F B, C 6

G D, E 2

H 5

I F, G 3

J F, H 9

Done I, J 0

If each vertex is associated with its Earliest Starting Time (EST), then the path between a
vertex and a dependency adds the duration for the dependency. The edges to the DONE node
then allows the duration of the otherwise terminal nodes to be encoded.

If rTop is a list of the vertices in reverse topological order, then the maximal path weight and
the path itself are easy to calculate. We assume we start with the vertex duration data in an
array, and we need extra arrays EFT and also worstDep, if we want to recover the longest
path itself.

for each vertex v in rTop:

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 72


est = 0
worstDep[v] = None
for w in the neighbos of v:
if EFT[w] > est:
worstDep[v] = w
est = EFT[w]
EFT[v] = duration[v] + est

Then the EFT for the Done vertex is the overall earliest finishing time. Without the marked
Done vertex, there must also be a check for the vertex with the worse EFT.
Full code with testing in critPathTopo.py. By the same logic as for the DFS and BFS, this
algorithm is linear in the data size.

Task Dependencies Duration EFT WorstDep

A 4 4 -

B 3 3 -

C 1 1 -

D A, B 5 9 A

E B 7 10 B

F B, C 6 9 B

G D, E 2 12 E

H 5 5 -

I F, G 3 15 G

J F, H 9 18 F

Done I, J 0 18 J

So the minimum duration is 18, and critical path is obtained by reversing the path from Done
given by the WorstDep column. J, F, B. Reversed is the order of the critical parts to
complete (forward in time) B, F, J.
We will come back to this topic later, but for now note that DAGs and topological order are
central to dynamic programming.

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 73


Weighted graphs are a standard kind that we will discuss more later, but they weight the
edges, not the vertices. We can make a weighted graph by making each edge vw get the
weight or time needed to complete w. With task durations and Done vertex, and this shift, we
get a weighted graph:

Showing critical paths to each vertex with the weighted edges:

You could make a weighted graph model with the vertex duration consistently either on
edges to the vertex or from. The addition of the Done vertex makes the approach I took, with
the weight on edges to the dependency, more convenient. You would have to add further
edges fromthe sources to a Start vertex to put the weights on the opposite side.

The module graph.py is worth discussion before getting to the parts for meta-graphs. We will
be using this module later when discussing weighted graphs. It includes a simple Edge class
which includes the edge's numerical weight.

I save a coding step by identifying all vertices purely by name rather than going back and
forth between vertex labels and vertex indices as I do in DFS.py. Admittedly the version
with indices, which involves a basic list or array data structure, is somewhat more efficient in
time than my representation which uses basically the same notation, but is dictionary based.

The neighbor lists are modified to be lists of edges to neighbors, so the weights are
included. Rather than my kludgy list structure for a graph used before in DFS.py, vertices
and edges are instance variables, and g.edges[v] is the list of edges with starting vertex v in
the graph g.

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 74


I also take an object-oriented approach to modifications of depth-first search. In the earlier
programs like DFSComp.py, I copied and then modified the code for DFSSweep and
dfs. This was not a big deal, since the code is so short. As an alternative I use a consistent
dfs_sweep and dfs that have hooks allowing all the elaborations. All the details of the
elaborations are stored a subclass of DFSHelper. This is not any shorter generally, but it
groups the elaborations together, which has some conceptual appeal. One extra piece of data I
pass in dfs is the edge from the parent, which is needed sometimes (like checking if an
undirected graphis cyclic). Algorithms for component finding
(numberConnectedComponents), active intervals (visitOrder), and topological order
(topoOrder), are included following the the same underlying ideas as DFSComp,
DFSActiveInterval.py, and topoordersure.py but formatted rather differently with the OO
approach.

Task Dependencies Duration EFT WorstDep

A C,E 1 -

B A, F 3 -

C 2 -

D B, F 4

E 3

F E 2

Points to remember

1. Threaded binary tree is gives the new link to the base node

2. Siblings – same parent of the children

3. Connected all nodes in a graph is called completely connected graph

4. Directed, Undirected graphs

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 75


DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 76
Unit – IV

External Sorting: Storage Devices -Sorting with Disks: K-Way Merging – Sorting with
Tapes Symbol Tables: Static Tree Tables - Dynamic Tree Tables - Hash Tables:
Hashing Functions - Overflow Handling.

INTRODUCTION
SORTING

The two important uses of sorting are

1) as an aid in searching,

2) as a means for matching entries in files.

Sorting also finds application in the solution of many other more complex problems, e.g.,
from operations research and job scheduling.

Retrieval of information is made easier when it is stored in some predefined order.


Sorting is a very important computer application activity. Many sorting algorithms are
available. Differing environments require different sorting methods. Sorting algorithms can
be characterized in the following two ways:

1. Simple algorithms which require the order of n2 (written as O(n2 ) ) comparisons to


sort n items.

2. Sophisticated algorithms that require the O(n log2n) comparisons to sort n items.

There are two basic types of sorting methods.

 Internal sorting.

 External sorting

Internal sorting methods are applied when the entire collection of data to be sorted is small
enough that the sorting can take place within the memory. The time required to read or write
is not considered to be significant in evaluating the performance of the internal sorting
methods.

External sorting methods are applied to larger collection of data which reside on
secondary devices, read and write access time are major concern in determining sort
performances.

Internal sorting:

 Insertion sort

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 77


 Bubble sort

 Quick sort

 Merge sort

 Heap sort

 Radix sort

External sorting:

 Sorting with disk

 Sorting with tape

EXTERNAL SORTING

MAGNETIC DISK

Disks are examples of direct access storage devices. Data are recorded on disk platter in
concentric tracks. A disk has two surfaces on which data can be recorded. Disk packs have
several such disks rigidly mounted on a common spindle. Data is red/written to the disk by a
read/write head. A disk pack would have one such disk per surface.

Each surface has a number of concentric circles called tracks. In a disk pack, a set of parallel
tracks on each surface is called a cylinder. Tracks are further divided in to sectors. A sector is
the smallest addressable segment of a track.

Data is stored along the tracks in blocks. Therefore to access a disk, the track or cylinder
number and the sector number of the starting block must be specified. For disk packs the
surface must also be specified. The read/write head moves horizontally to position itself over
the correct track for accessing disk data.

This introduces three time components in to disk access

 seek time : time taken to position to read/write head over the correct cylinder

 latency time : time taken to position the correct sector under the head.

 Transfer time(transmission time): time taken to actually transfer the block between
main memory and disk.

Sorting with disk

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 78


Example : The file F containing 6000 records to be sorted. The main memory is capable of
sorting 1000 records at a time. The block length of input file is 500 records. The file is treated
as 6 sets of 1000 records each. Each set is sorted and stored on a disk as a „run‟. These 6 runs
will be merged as follows

Allocate 3 blocks of memory each capable of holding 500 records. Two of these buffers B1,
B2 will be treated as input buffer and B3 will be treated as an output buffer.

1. 6 runs R1, R2, R3, R4, R5 and R6 on the disk.

1. 3 buffers B1, B2, B3.

2. Read 500 records from R1 tin to B1.

Read 500 records from R2 tin to B2.

Merge B1 and B2 and write in to B3.

When B3 is full – write out on to the disk as R11

Similarly merge R3 & R4 to get run R12

Merge R5 & R6 to get run R13.

Thus from 6 runs of size 1000 each, we have now 3 runs of size 2000 each

Stage 1:

R1 R2 R6

………..

1 1000 1001 2000 5001 6000

Stage 2:

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 79


R11 R12 R13

1
2000 2001 4000 4001 6000

Stage 3:

R21 R13

1 4000 4001 6000

Stage 4:

RF

1 6000

Analysis :

T1 : seek time

T2 : latency time

T3 : transmission time for 1 block of 500 records.

T : T1 + T2 + T3

T4 : time to internally sort 1000 records

nTm - time to merge n records from input buffers to the output buffer.

Stage 1:

We read 6000/500 = 12 blocks

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 80


Internally sort 6000/1000 = 6 sets of 1000 records.

Write 6000/500 = 12 blocks

So the time taken in stage 1 = 24T + 6T4

Stage 2:

We read 6000/500 = 12 blocks

Merge 3 X 2000 = 6000 records.

Write 6000/500 = 12 blocks

So the time taken in stage 2 = 24T + 6000Tm

Stage 3:

We merge only 2 runs

So we read 8 blocks & write 8 blocks

Merge 2 X 2000 = 4000 records.

So the time taken in stage 3 = 16T + 4000Tm

Stage 4:

We read = 12 blocks

Merge 4000 + 2000 = 6000 records.

Write = 12 blocks

So the time taken in stage 4 = 24T + 6000Tm

The total time = 24T + 6T4 + 24T + 6000 Tm + 16T + 4000Tm + 24T +6000 Tm

= 88T + 6T4 + 16000Tm

it is seen that the largest influencing factor is Tm, which depends on the number of passes
made over the data or the number of times runs must be combined.

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 81


K – way merging

In the above example we used 2 – way merging, i.e combination of 2 runs at a time.
The number passes over the data can be reduced by combining more runs at a time, hence the
k – way merge, where k >=2.

In same example suppose we had used a three way merge then

 At stage 2: we would have had 2 runs of size 3000 each.

 At stage 3: we would have had a single sorted run of size 6000

This would have affected our analysis as follows:

Stage 1 = 24T +6T4

Stage 2 = 24T + 6000Tm

Stage 3 = 24T + 6000Tm

Total = 74T + 6T4 + 12000Tm

This advantage is accompanied by certain side effects. The time merge K runs would
obviously be more than the time to merge 2 runs. Here the value of Tm itself could increase.
For large values of K, an efficient method of selecting the smallest record out of K in each
step of the merge is needed. One such method is the use of a selection tree.

A selection tree is a binary tree in which each node represents the smaller of its children.
It therefore follows that the root will be the smallest node. The way it makes is simple.
Initially, the selection tree is built from the 1st item of each of the K runs. The one which gets
to the root is selected as the smallest. Then the next item in the runs from which the root was
selected enters the tree and the tree gets restructured to get a new root and so on till the K
runs are merged.

Example

Consider a selection tree for the following 4 runs:

2 4 1 3 ……
R1

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 82


4 1 2 8 ……
R2

1 7 6 9 ……
R3

6 8 3 2 ……
R4

1.Construction

2 1

2 4 1 6

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 83


Node 1 is selected as the minimum. This is removed from the tree. Node 1 belongs to R3,
therefore the next item from R3 ,7 enters the tree. The new root is 2. this came from R1 in
the next step, the next item from R1 enters the tree and so on.

2 6

2 4 7 6

In going in to higher order merge, we save input/output time. But with the increase in
the order of the merge the number of buffers required also increases at the minimum to merge
K runs need K + 1 buffers. The internal memory is a fixed quantity, therefore if the number
of buffers increases, their size must decrease.

MAGNETIC TAPES

Magnetic tape devices for computer input/output are similar to principle of


audio tape recorders. Magnetic tape is wound on a spool. Tracks run across the length of the
tape. Usually there are 7 to 9 tracks across the tape width Data is recorded on the tape in a
sequence of bits. The number that can be written per inch of track is called the tape density –
measured in bits per inch.

Information on tapes is usually grouped into blocks, which may be of fixed or


variable size. Blocks are separated by an inter-block gap. Because request to read or write
blocks do not arrive at a tape drive at constant rate, there must be a gap between each pair of
blocks forming a space to be passed over as the tape accelerates to read/write speed. The
medium is not strong enough to withstand the stress that it would sustain instantaneous starts

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 84


and stops. Because the tape is not moving at a constant speed, a gap cannot user data. Data is
usually read/write from tapes in terms of blocks.

Inter block gaps

In order to read or write data to a tape the block length and the address in memory
to/from which the data is to be transferred must be specified. These areas in memory from/to
which data is transferred will be called buffers. Usually block size will respond to buffer size.
Block size is a crucial factor in tape access. A large block size is preferred because of the
following reasons:

1.Consider a tape with a tape density of 600 bpi. And an inter block gap of ¾‟. This gap is
enough to write 450 characters. With a small block size , the number of blocks per tape
length will increase. This means a large number of inter block gaps, i.e. bits of data which
cannot be utilized for data storage , and thus a decreased tape utilization. Thus the larger the
block size , fewer the number of blocks, fewer the number of inter block gaps and better the
tape utilization.

2. Larger block size reduces the input/output time. The delay time in tape access is the time
needed to cross the inter block gap. This delay time is larger when a tape starts from rest than
when the tape is already moving. With a small block size the number of halts in a read are
considerable causing the delay time to be incurred each time.

While large block size is desirable from the view of efficient tape usage as well as
reduced access time, the amount of main memory available for use as I/O buffers is a limiting
factor.

SORTING WITH TAPES

Sorting on tapes is carried out using the same basic steps as sorting on disks. Sections of the
input file are internally sorted in to runs which are written out on to tape. These runs are
repeatedly merged together until there is only one run.

Example :

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 85


A file containing 4,500 records, R1, R2, R3 …. R4500 is to be sorted using a computer
whose internal memory is large enough to hold only 800 records. The available storage
consists of T1, T2, T3 and T4. Each block on the input tape contains 250 records. In this case
one output buffer and two input buffers are used. This requires only 750 records.

Stage 1:

Rewind all the tapes

Stage 2:

Internally sort sets of 3 blocks of 250 records each to obtain runs 750 records long.
Write these runs alternatively on to tapes T1 and T2.

The total time taken = 36 Trw + 6 Tis

R1 R3 R5

T1

T2 R2 R4 R6

Step 3:

Rewind T1, T2 and input tape. Dismount input tape and replace by a work tape T4.

So the total time taken = 9Trew + ∆

Stage 4:

Using two – merge , merge runs from the file T1 with those on the file T2 and
distribute the resulting bigger runs alternatively on to T3 and T4.

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 86


T3
R1 R3

R2
T4

So time taken is 36 Trw + 4500

Stage 5:

Rewind T1 – T4. Time taken = 12 Trew

Stage 6:

Merge run 1 from T3 with run 2 from T4 on to T1

R3
T3

R1 = 12 blocks = 3000 records

T1

Total time taken = 24 Trw + 3000 Tm

Stage 5:

Rewind the tapes T1 and T4, time taken = 12 Trew

Stage 6:

Merge the run on T1 with run 3 on T3 on to T2 to obtain the sorted file

Total time taken = 36 Trw + 4500 Tm

Stage 7 :

Rewind all the tapes. Time taken = 18 Trew

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 87


So the total time taken = 6Tis + 132 tw + 12000Tm + 51 Trew + ∆

Where Trw – time to read or write one block of 250 records.

Trew - time to rewind tape over a length corresponding to 1 block.

nTm - time to merge n records from input file to output buffers using 2 – way

merge

∆ - delay results during mounting and dismounting the tapes

BALANCED MERGE:

In addition sorting with disk depends on the number of passes made over the data. Use of a
higher order merge results in a decrease in the number of passes being made with out
changing the internal merge time. i.e higher order merge results in lesser time.

In disk sorting, the order of merge is limited by the amount of main memory available for
input/output buffers. A k-way merge requires the use of 2k + 2 buffers. Another restriction in
the case of tape is the number of tapes available. In order to avoid excessive seek time, it is
necessary that runs being merged be on different tapes. Thus a k-way merge requires at least
k-tapes for use as input tapes during the merge. In addition, another tape is required for the
output generated during the merge. Hence at least k + 1 tapes must be available for a k- way
merge. Using k + 1 tapes for a k-way merge requires an additional pass over the output tape
to redistribute the runs on to k- tapes for the next level of merge. The redistribution can be
avoided by using 2k tapes. During the k-way merge with 2k tapes, k of these tapes are used as
input tapes and remaining k as output tapes. Now we are going to see 3 procedures, procedure
m1 and m2 follows k+1 tape strategy and procedure m3 follows 2k tape strategy.

Procedure m1 always assumes k+1 as the output tape , in procedure m2 we assume output
tape in cyclic order i.e. k+1,1,2…k. In this case, the redistribution from the output tape makes
less than a full pass over the file. Procedure m3 follows 2k tape strategy.

1.procedure m1;

2.{sort a file of records from a given input tape using a 3.k-way merge given tapes t1,. . .tk+1
are available for 4.the sort.}

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 88


5.label 99;

6.begin

7.Create runs from the input tape distributing them evenly over

8.tapes t1,. . .,tk;

9.Rewind t1,. . .,tk and also the input tape;

10.if there is only 1 run then goto 99;{the sorted file is on t1}

11.replace input tape by tk+1;

12.while true do

13.{repeatedly merge onto tk+1 and redistribute back onto t1,. . .tk}

14.begin

15.merge runs from t1,. . .,tk+1;

16.rewind t1,. . .,tk+1;

17.if number of runs on tk+1=1 then goto 99;{output on tk+1}

18.evenly distribute from tk+1 onto t1,. . .,tk;

19.rewind t1,. . .,tk+1;

20.end;

21. 99:end; {of m1}

1.procedure m2 ;

2. {Sort a file of records from a given input tape using a 3.k-way merge given tapes t1. . .tk+1
are available for

4.the sort}

5.label 99.

6.begin

7. Create runs from the input file distributing them evenly

8.Over tapes t1...tk;

9.Rewind t1...tk and also the input tape.

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 89


10.if there is only 1 run then goto 99; {sorted file on t1}

11.replace input tape by tk+1 ;

12.i: =k+1; {i is index of output tape}

13.while true do

14.begin

15.merge runs from the k tapes tj ,1<=j<=k+1 and j<> i onto

ti ;

16.rewind ti….tk+1;

17.if number of runs on ti=1 then goto 99;{output on ti}

18.evenly distribute (k-1)/k of the runs on ti onto tape tj,

19.1<=j<=k+1 and j<>i and j<>i mod (k+1)+1;

20.rewind tapes tj,1<=j<=k+1 and j<>i;

21.i: =i mod (k+1)+1;

22.end;

23. 99: end; {of m2}

1.procedure m3;

2.{sort a file of records from a given input tape using a 3.k-way merge on 2k tapes t1,. . .,t2k}

4.begin

5. Create runs from the input file distributing them evenly

6. over tapes t1,. . . ,tk;

7. rewind t1,. . . ,tk; rewind the input tape;

8. replace the input tape by tape t2k;

9. i:=0;

10.while total number of runs on tik+1,. . .,tik+k > 1 do

11.begin

12. j:=1-i;

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 90


13.perform a k-way merge from tik+1,. . .,tik+k evenly

14.distributing output runs onto tjk+1,. . .,tjk+k;

15.rewind t1,. . . ,t2k;

16.i:=j;{switch input and output tapes]

17.end;

18.{sorted file is on tik+1}

19.end; {of m3}

SYMBOL TABLES

When building loaders, assemblers, compliers, or any keyword driven translator a


symbol table is used. Symbol table is a set of name-value pairs. Associated with each name
in the table is an attribute, a collection of attributes, or some directions about what further
processing is needed. The operations that one generally wants to perform on symbol tables
are (i) ask if a particular name is already present, (ii) retrieve the attributes of that name, (iii)
insert a new name and its value, and (iv) delete a name and its value.

The structure of a symbol table is given below.

structureSYMBOL- TABLE

declareCREATE ( ) symtb

INSERT(symtb,name,value) symtb

DELETE(symtb,name) symtb

FIND(symtb,name) value

HAS(symtb,name) Boolean

ISMTST(symtb)  Boolean;

for all S ε symtb, a,b ε name,r ε valuelet

ISMTST(CREATE) :: = true

ISMTST(INSERT(S,a,r)) :: = false

HAS(CREATE,a) ::= false

HAS(INSERT(S,a,r),b) :: =

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 91


ifEQUAL (a,b) then true elseHAS(S,b)

DELETE(CREATE,a) :: = CREATE

DELETE(INSERT(S,a,r),b) :: =

ifEQUAL(a,b) then S

elseINSERT(DELETE(S,b),a,r)

FIND(CREATE, a) :: = error

FIND(INSERT(S,a,r),b) :: = ifEQUAL(a,b) thenr

elseFIND(S,b)

end

endSYMBOL__TABLE

If the identifiers in a symbol table are known in advance, and no additions and deletions are
permitted then the symbol table is called as static otherwise dynamic.

STATIC TREE TABLES

Definition : A binary search tree t is a binary tree; either it is empty or each node in the tree
contains an identifier and

(i) all identifiers in the left subtree of t are less (numerically or


alphabetically)than the identifier in the root node t;

(ii) all identifiers in the right subtree of t are greater than the identifier in the
root node t;

(iii) the left and right subtrees of t are also binary search trees.

For a given set of identifiers several binary trees are possible. To determine whether an
identifier x is present in a binary search tree, x is compared with the root. If x is less than the
identifier in the root, then the search continues in the left subtree; if x equals the identifier in
the root, the search terminates successfully; otherwise the search continues in the right
subtree. The data types used by the algorithm are defined as:

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 92


typeidentifier = packed array [1 … maxchar] of char;

treepointer = ↑ treerecord;

treerecord = record

leftchild : treepointer;

ident : identifier;

rightchild : treepointer;

end;

where maxchar is of type const.

Ex. A binary search tree with 7 nodes

30

20 40

45
5 25 35

SEARCH ING a binary search tree

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 93


To determine whether an identifier X is present in a binary search tree, X is compared with
the root. If X is less than the identifier in the root, then the search continues in the left sub
tree. If X is equal to the identifier in the root, the search terminates successfully; otherwise
the search continues in the right sub tree.

1 proceduresearch(t:treepointer; x:identifier; varretpointer:treepointer);

2. {Search the binary search tree t for x.Each node has fields.

3. leftchild, ident, rightchild, Return retpointer = nil if x is not

4. in t. Otherwise , return retpointer such that retpointer↑.ident = x.}

5. varfound : boolean;

6. begin

7. retpointer := t; found := false;

8. while (retpointer<>nil) and ( notfound) do

9. ifx<retpointer↑ .identthen {search left subtree}

10. retpointer := retpointer ↑.leftchild

11. else ifx>retpointer↑ .ident then {search rightsubtree}

12. retpointer := retpointer↑.rightchild

13. else {We have not found x}

14. found := true;

15. end; {of search}

In evaluating binary search trees, we add a square node at every place there is a null
link.Every binary tree with n nodes has n+1 null links and hence it will have n+1 square
nodes. They are called as external nodes or failure nodes. The remaining nodes are called
internal nodes. The external path length = sum over all external nodes of lengths of the
paths from the root to those nodes. Similarly the internal path length = sum over all
internal nodes of lengths of the paths from the root to those nodes.

30

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 94


20 40

Internal nodes
5 25 35 45

External nodes

DYNAMIC TREE TABLES

Dynamic tables are also maintained as binary search trees. Here we can add an
identifier to the tree if it is not present. The following algorithm explains the insertion of an
identifier in to binary search tree.

procedure bst ( x:identifier; var t, retpointer:treepointer);

{Search the binary search tree t for the node pointed to by retpointer such that

retpointer↑.ident = x. if x is not present in the tree then it is entered at the appropriate point}

vartrailpointer: tree pointer;

found : boolean;

begin

trailpointer := nil;

retpointer := t; found := false;

while (retpointer<>nil) and ( notfound) do

ifx<retpointer↑ .identthen {search left subtree}

begin

trailpointer := retpointer;

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 95


retpointer := retpointer↑.leftchild;

end;

else ifx>retpointer↑ .ident then {search rightsubtree}

begin

trailpointer := retpointer;

retpointer := retpointer↑.rightchild;

end;

else {We have not found x}

found := true;

if not found then

begin { x is not in the tree and is entered as child of trailpointer};

new (retpointer);

retpointer↑ .ident : = x;

retpointer↑.rightchild : = nil;

retpointer↑.leftchild := nil;

if t = nil then { insert into empty tree}

t := retpointer

elseifx<retpointer↑ .identthen

trailpointer ↑.leftchild := retpointer

else

trailpointer↑.rightchild := retpointer;

end;

end; {of search};

HEIGHT BALANCED TREE

A binary tree of height h is called completely balanced or balanced if all leaves occur at
nodes of level h or h-1 and if all nodes at levels lower than h-1 have two children. Adelson -

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 96


Velskii and Landis introduced a binary tree structure that is balanced with respect to the
heights of the sub trees. A tree is height balanced, if for each node in the tree, the height of
the left sub tree differs from the height of the right tree by no more than 1.

DEFINITION

An empty tree is height balanced. If T is a non-empty binary tree with TL and TR as


its left and right sub trees, then T is height balanced iff (i) TL and TR are height balanced
and (ii) HL - HR <= 1, where HL and HR are the heights of TL and TR respectively.

The balancing factor, BF(T) , of a node T in a binary tree is defined to be hL –hR , where hL
and hR are the heights of left and right subtrees of T. For any node in an AVL tree BF(T) = -
1, 0 or 1.

HASH TABLES

In tree tables, the search for an identifier key is carried out through a sequence of
comparisons . Hashing differs from this in that the address or location of an identifier X is
obtained by computing some arithmetic function f(X). f(X) gives the address of X in the
table. This address is called the hash or home address of X. The memory is referred to as the
hash table, ht. The hash table is partitioned into b buckets, ht(0) , ht(1), … ht(b-1). Each
bucket is capable of holding s records. Thus, a bucket is said to consist of s slots, each slot
being large enough to hold one record. The hashing function f(X) is used to perform an
identifier transformation on X. f(X) maps the set of possible identifier onto the integers 0
through b-1. Two identifiers I1, I2 are said to be synonyms with respect to f if f(I1) = f(I2).
Distinct synonyms are entered into the same bucket so long as all the s slots in the bucket
have not been used. An overflow is said to occur when an new identifier I is mapped or
hashed by f into a full bucket A collision occurs when two non-identical identifiers are
hashed into the same bucket.

Example

Consider a hash table ht, with 26 buckets. Each bucket having exactly two slots. Assume that
there are n (= 6) distinct identifiers in the program and each identifier begins with a letter.
The function f is defined by f(X) = the first character of X will hash all identifiers X into the
hash table. The identifiers are GA, D, A, A1, G and E.

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 97


The following figure shows how the identifiers are stored in the hash table. The identifier
GA and G are placed in ht(7), A,A2 are stored in ht(1) and so on. GA and G, A and A1 are
synonyms.

SLOT 1 SLOT 2

1 A A1

2 0 0

3 0 0

4 D 0

5 E 0

HASHING FUNCTIONS

A hashing function f, transforms an identifier X into a bucket address in the hash table. A
hash function „f ‟ is called an uniform hash function, if for each identifier X there is an equal
chance for hashing. i.e For each identifier X in the identifier space the probability that f (X)
= i to be 1/b for all buckets i.

Several kinds of uniform functions are in use.

MID-SQUARE:

One hash function that has found much use in symbol table applications is the „
middle of square‟ function. This function, fm, is computed by squaring the identifier and then
using an appropriate number of bits from the middle of the square to obtain the bucket
address; the identifier is assumed to fit into the computer word.

DIVISION:

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 98


Another simple choice for a hash function is obtained by using the modulo(mod)
operator. The identifier X is divided by some number M and the remainder is used as the hash
address for X.

fD (x) = X mod M

This gives bucket addresses in the range 0 – ( M - 1 ) and so the hash table is at least of
size b = M.

FOLDING:

In this method the identifier X is partitioned into several parts, all but the last being of
the same length. These parts are then added together to obtain the hash address for X. There
are two ways of carrying out this addition. In the first, all but the last part are shifted so that
the least significant bit of each part lines up with the corresponding bit of the last part. The
different parts are now added together to get f(X).This method is is known as shift folding.
The other method of adding the parts is folding at the boundaries. In this method, the
identifier is folded at the part boundaries and digits falling into the same position are added
together.

P1 P2 P3 P4 P5 P1= 123, P2=203, P3 =241, P4=112, P5=20

P1 123 P1 123

203 P2 302

P2

241

P3 P3 241

P4 112
211

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 99


DIGIT ANALYSIS:

This method is particularly useful in the case of a static file where all the identifiers in
the table are known in advance. Each identifier X is interpreted as a number using some radix
r. The same radix is used for all the identifiers in the table. Using radix, the digits of each
identifier are examined. Digits having the most skewed distributions are deleted. Enough
digits are deleted so that the number of digits left is small enough to give an address in the
range of the hash table.

OVERFLOW HANDLING:

In order to be able to detect collisions and overflows, it is necessary to initialize the hash
table, ht, to represent the situation when all slots are empty. When a new identifier gets
hashed in to a full bucket, it is necessary to find another bucket for this identifier. The
simplest solution is to find the closest unfilled bucket. Consider the identifiers GA, G, A, A4,
A5, A2, Z, ZA.

The following table illustrates the storage of the identifiers

1 A

2 A4

3 A5

4 A2

5 ZA

The identifiers GA is placed in the bucket ht[7]. The next identifier G is placed in the nearest
free bucket, i.e ht[8]. The next identifier A is placed in the bucket ht[1]. The identifier A4 is
placed in the nearest free bucket ht[2] and so on. Finally the identifier Z is placed in ht[26].
The next identifier ZA is placed in the next free bucket in the circular list, ht[5]. The above
method of resolving overflows is called linear probing or linear overflow addressing. It is

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 100


characterized by the searching the buckets ( f (X)+i ) mod b, where b is the number of
buckets.

Procedure to search for an identifier if linear probing method is used is given below:

type identifier = packed array [1..maxchar] of char;

hashtable = array[1..maxsize]of identifier;

procedure linsrch(x: identifier; ht: hashtable; var j: integer; b: integer);

var i : integer;

begin

i:=f(x); j:=i;

while (ht[j] <> x ) amd (ht[j] <> blankident ) do

begin

j :=(j+1) mod b;

if j = i then tablefull;

end;

end;

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 101


Unit – V

Internal Sorting: Insertion Sort - Quick Sort - 2 Way Merge Sort - Heap Sort – Shell
Sort - Sorting on Several Keys. Files: Files, Queries and Sequential organizations –
Index Techniques -File Organizations.

Internal sorting:

 Insertion sort

 Quick sort

 Merge sort

 Heap sort

INSERTION SORT:

This is a naturally occurring sorting method exemplified by a card player arranging the
cards dealt to him. He picks up the cards as they are dealt and inserts them in to the required
position. Thus at every step, we insert an item into its proper place in an already ordered list.

Algorithm for insertion sort

type afile = array [0..maxn] of records;

1. procedure insert ( r:records;var list:afile; t : integer);

2. { Insert record r with key r.key into the ordered

3. sequence list[0],…list[i] in

4. such a way that the resulting sequence is

5. also ordered on the field key.

6. We assume that list contains a dummy record at index zero

7. such that r.key >= list[0].key for all i}

8. var j : integer;

9. begin

10. j :=i;

11. while r.key < list[j].key do

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 102


12. begin

13. list[j+1] :=list[j];

14. j:=j-1;

15. end;

16. list[j+1] :=r;

17. end;

1. procedure insort(var list : afile;n : integer);

2. { sort list in nondecreasing value of the file key. Assume n


> 0.}

3. var j : integer;

4. begin

5. list[0].key :=maxint;

6. for j:=2 to n do

7. insert(list[j],list,j-1);

8. end;

For example if n=5 and the input sequence is (2,3,4,5,1}. After each execution of insert we
have:

-,2,3,4,5,1 [initial]

-,2,3,4,5,1 i=2

-,2,3,4,5,1 i=3

-,2,3,4,5,1 i=4

-,1,2,3,4,5 i=5

The above example is referred to as LOO( Left Out of Order). The


time taken for each i=2,3,4 is O(1), while for I=5 it is O(n).

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 103


QUICK SORT:

This is the most widely used internal sorting algorithm. Its popularity lies in the case
of implementation, moderate use of resources and acceptable behaviour for a variety of
sorting cases. The basis of quick sort is the „divide‟ and „conquer‟ strategy. i.e. divide the
problem [list to be sorted] into sub problems [sub lists] until solved sub problems [sorted sub
list] are found. The implementation is

1. Choose one item A[I] from the list A[ ]

2. Rearrange the list so that this item is in the proper position .i.e all preceding items
have a lesser value and all succeeding items have a greater value than this item.

 A[0], A[1], ….. A[I-1] in sub list 1

 A[I]

 A[I+1], A[I+2], ….. A[N] in sub list 2

3. Repeat the steps 1 & 2 for sub list 1 and sub list 2 till A[ ]is a sorted list.

The input file has 10 records with keys (26,5,37,1,61,11,59,15,48,19). The table below
gives the status of the file at each call of qsort.

R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 m n

[26 5 37 1 61 11 59 15 48 19] 1 10

[11 5 19 1 15] 26 [59 61 48 37] 1 5

[1 5] 11 [19 15] 26 [59 61 48 37] 1 2

1 5 11 [19 15] 26 [59 61 48 37] 4 5

1 5 11 15 19 26 [59 61 48 37] 7 10

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 104


1 5 11 15 19 26 [48 37] 59 [61] 7 8

1 5 11 15 19 26 37 48 59 [61] 10 10

1 5 11 15 19 26 37 48 59 61

1 procedureqsort(varlist :afile; m,n : integer);


2 { sort records list[m],…., list[n] into nondecreasing
3 order on field key.
4 Key k = list[m].key is arbitrarily chosen as the control key.
5 Pointers i and j are used to partition the subfile so
6 That at any time list[l].key ≤ k, l<i and
7 list[i].key ≥ k, l>j. It is assumed that
8 list[m].key ≤ list[n + 1 ].key}
9 vari,j,k : integer;
10 begin
11 ifm <n
12 then
13 begin
14 i := m; j := n + 1; k := list[m].key;
15 repeat
16 repeat
17 i := i + 1;
18 untillist[i].key> = k;
19 repeat
20 j := j –1 ;
21 untillist[j].key<= k;
22 ifi<j
23 theninterchange(list[i],list[I]);
24 untili>= j;
25 interchange (list[m],list[j]);
26 qsort(list,m,j - 1);
27 qsort(list,j + 1,n);
28 end; {of if}

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 105


29 end; {of qsort}

The average computing time of the quick sort is O (n log 2n). It needs stack space to
implement the recursion.

MERGE SORT

Merge sort is one of the „divide and conquer „ class of algorithms. The
basic idea is to this is to divide the list into a number of sub lists, sort each of this sub lists
and merge them to get a single sorted list.

Example: The input file (26,5,77,1,61,11,59,15,49,19) is to be sorted using the 2-way merge
sort. The two way merge sort begins by interpreting the input as n sorted files of length 1.
these are merged pair-wise to obtain n.2 files of size 2. These n/2 files are then merged pair
wise and so on until we are left with only one file.

Example:

[26] [5] [77] [1] [61] [11] [59] [15] [48] [19]

[5 26] [ 1 77] [11 61] [ 15 59 ] [ 19 48]

[1 5 26 77] [ 11 15 59 61] [19 48]

[1 5 11 26 59 61 77 ] [ 19 48 ]

[1 5 11 15 19 26 48 59 61 77]

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 106


Merge sort consists of several passes over the records being sorted. In the first pass
the files of size of 1 are merged. In the second files of size to are merged and so on. So on the
i th pass the files of size 2 i-1 are merged. The total computing time is O(n log n).

2-way merge algorithm is given below.

1.procedure merge(var x,z:afile;l,m,n:integer);

2.{(x[l],. . . ,x[m]) and (x[m+1],. . .,x[n]) are

3. two sorted lists with keys

4. such that x[l].key<= . . <=x[m].key, and

5.x[m+1].key<=. . .<=x[n].key, These records are merged to

6.obtain the sorted list(z[l],. . .,z[n]) such that

7.z[l].key <=. . .<=z[n].key}

8.var i,j,k,t: integer;

9.begin

10.i:=l;

11.k:=l;

12.j:=m+1;{i,j and k are positions in the three files}

13.while((i <= m) and (j <= n)) do

14.begin

15.if x[i].key <= x[j].key

16.then

17.begin

18. z[k] := x[i];

19. i:=i+1;

20.end;

21.else

22.begin

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 107


23. z[k]:=x[j];

24. j:=j+1;

25.end;

26. k:=k+1;

27.end;{of while}

28.if (i > m)

29.then {(zx,. . .,zn}:=(xj,. . .,xm)}

30. for t:=j to n do

31. z[k+t-j]:=x[t]

32.else {(zk,. . .,zm):=(xj,. . .,xm)}

33. for t:=i to n do

34. z[k+t-i] := x[t];

35.end; { of merge }

1.procedure mpass(var x,y : afile ; n,l:integer);

2.{This algorithm performs one pass of the merge sort. It 3.merges adjacent pairs of
subfiles of length l from the 4.list x to list y. n is the number of records in x.}

5.var i,t:integer;

6.begin

7.i:=1;

8.while i <= (n-2*l+1) do

9.begin

10. merge(x,y,i ,i+l-1,i+2*l –1);

11. i:=i + 2 * l;

12. end;

13.{merge remaining file of length < 2*l}

14.if(i+l-1)< n then merge(x,y,i,i+l-1,n)

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 108


15.else

16.for t:=i to n do

17. y[t] := x[t];

18.end;{of mpass}

1.procedure msort(var x : afile; n:integer);

2.{sort the file x = (x[1],. . .,x[n])into nondecreasing 3.order on the keys x[1].key,. .
.,x[n].key}

4.var l : integer;

5.y:afile;

6.begin

7.{l is the size of the subfile currently being merged}

8.l:=1;

9.while l<n do

10.begin

11. mpass(x,y,n,l)

12. l:=2*l;

13. mpass(y,x,n,l);{interchange role of x and y}

14. l:=2*l;

15.end

16.end; {of sort}

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 109


HEAP SORT:

In this method we interpret the file to be sorted R = (R1, R2, …Rn) as a binary tree. In a
binary tree the parent of node at i is at [i/2], the left child at 2i and right child at 2i + 1. If 2i
or 2i +1 is greater than n ( no. of nodes), then the corresponding children do not exist.

Thus, initially the file is interpreted as being shown in the following figure.

(X1, X2, ….. ,Xn)

X1

X2 X3

X4 X5 X6 X7

Xn-2 Xn-1 Xn

Heap sort is a two stage method. First the tree representing the file is converted in to
a heap. A heap is defied to be a completely binary tree with the property that the value of
each node is at least as large as the value of its children nodes (ie. Kj/2kj for 1j/2<j n).
this implies that the root of the heap has the largest value in the tree. In the second stage the
output sequence is generated in decreasing order by successively outputting the root and
restructuring the remaining tree in to a heap.

A complete binary tree is said to satisfy the „heap condition‟ if the key of each
node is greater than or equal to the key in its children. Thus the root node will have the
largest key value.

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 110


Trees can be represented as arrays, by first numbering the nodes (starting from the root)
from left to right. The key values of the nodes are then assigned to array positions whose
index is given by the number of the nodes

A HEAP is a complete binary tree, in which each node satisfies the heap condition ,
represented as an array.

The operations on a heap work in 2 steps.

1 The required node is inserted / deleted / or replaced

2 step 1 may cause violation of the heap condition so the heap is traversed and

modified to rectify any such violations.

1 procedureadjust(vartree : afile; i,n : integer);

2 {Adjust the binary tree with root i to satisfy the heap property.

3 The left and right subtrees of i, i.e., with root 2i and 2i +1,

4 already satisfy the heap property. No node has index greater

5 than n.}

6 varj : integer;

7 k : integer;

8 r : records;

9 done : boolean;

10 begin

11 done := false;

12 r := tree[i];

13 k := tree[i].key;

14 j := 2 * i;

15 while (( j< = n) and (not done) do

16 begin {first find max of left and right child}

17 ifj<nthen iftree[j].key<tree[ j + 1].keythenj := j +1;

18 {compare max.child with k. if k is max. then done.}

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 111


19 ifk>= tree[j].key then

20 done := true

21 else

22 begin

23 tree[ jdiv 2] := tree[j]; {move j th record up the tree }

24 j := 2 * j;

25 end;

26 end;

27 tree[ jdiv 2] := r;

28 end; { of adjust}

HSORT

1 procedure hsort( varr :afile; n: integer);

2 { the file r = (r[1], …..,r[n] is sorted into nondecreasing order on

3 the field key}

4 vari : integer;

5 t : records;

6 begin

7 fori := (ndiv 2 ) downto 1 do {convert r into a heap}

8 adjust(r,i,n);

9 fori := (n - 1) downto 1 do {sort r}

10 begin

11 t := r[i + 1]; {interchange r1 and rj+1}

12 r[ i +1 ] := r[1];

13 r[1] := t;

14 adjust(r,1,i) ; {recreate heap}

15 end;

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 112


16 end; { of hsort}

The computing time of heap sort is order of O(n log n)

FILE ORGANIZATIONS

File :

A file is collection of records. A record is a collection of related fields. Each field is a


data item. Ex. Employee file, student file etc.

The primary objective of file organization is to provide a means for record retrieval and
update. The update of a record could involve its deletion, changes in some of the fields or the
insertion of an entirely new record. Certain fields in a record may be designated as key fields.
Records may be retrieved by specifying values for some or all of these keys. A combination
of key values specified for retrieval is called as a query.

query types

The following are the different types of queries

1. Sex = M

Simple query ( the value of the key is specified)

2. Salary > 9000

Range query ( a range of values for a single key is specified)

3. Salary > average salary of all employees

Functional query ( some function of the key values in file is specified)

4. (sex = M and occupation = programmer) or (employee number > 700

and sex = F)

Boolean query ( Boolean operators are used )

The different types of file organizations are

 Sequential file organization

 Random file organization

 Linked organization

 Inverted files

 Cellular partition

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 113


 SEQUENTIAL FILE ORGANIZATION

In this organization the records are placed sequentially on to the storage media. i.e.
they occupy consecutive memory locations and in the case of a tape this would mean placing
records adjacent to each other. In addition, the physical sequence of records is ordered on
some key called primary key.

Mode of retrieval

The mode of retrieval may be either batched or realtime.

In real time retrieval the response for any query is immediate. Example. In Air line
reservation system, one must be able to determine the status of the flight in a matter of
seconds.

In batch processing system the response time is not significant. Requests for retrieval are
batched together on a „transaction‟ file until either enough requests have been received or
suitable amount of time has passed. Then the requests in the transaction file are processed.

Mode of update

The mode of update can either be batched or real time.

In real time system, update is made immediately. For example, in a reservation system,
as soon as a seat on a flight is reserved, the file must be updated immediately to reflect the
changes made to the file.

In batch system, the updating is made when the transaction file is processed. For example,
in a banking system, all deposits and withdrawals made on a particular day is collected on a
transaction file and updates are made at the end of the day. The system contains two types of
files.

They are „master file‟ and „transaction file‟.

For batch processing system magnetic tape is an adequate storage medium. Master file
represents the file status. The transaction file contains all update requests that have not been
reflected in the master file. So the master file is always „out of date‟ to the extent that update
requests have been batched on the transaction file. The master file contains records which are
sorted on the primary key of the file. The requests for retrieval and update are on the
transaction file. When it is time to process the transaction file, the transactions are sorted on
the key and an update process is carried out to create a new master file. All the records in

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 114


the old master file are examined, changed if necessary and then written on to the new master
file. The time required for this process is O(n + m log m).

Sequential organization is also possible on dynamic access storage devices (DASD). Even
though the disk storage is really two-dimensional (cylinder X surface) it can be mapped in to
a one-dimensional memory. If a disk contains c cylinders and s surfaces one way is to view
the disk memory sequentially as given in the following figure.(figure 1)

Cylinder 1 Cylinder 2 . . . Cylinder c

Sequence for cylinders

Surface 1 Surface 2 ... Surface s

Sequence within a cylinder

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 115


Using the notation tij to represent jth of the ith surface , the sequence is t11, t21, …ts1,
t12, … ts,2 and so on.

The other way of representing sequential file organization is to access tracks in order : all
tracks of surface 1, all tracks of surface 2, etc.

Surface 1 surface 2 … surface S

For each surface the tracks are

Track1 track2 track3 track c

If the records are of same size ,binary search technique can be used to search for a record
with the required key. For a file containing n records, log 2 n accesses are to be made.

If the records are of variable size, binary search cannot be used. Sequential search has to be
applied. But the retrieval time can be reduced by maintaining an index. An index contains
(address, key) pairs. In case of record retrieval, first the index is referenced, then the record
is read directly from the address of the storage medium. For example, for the table given in
figure1, one can maintain an index for the key empno as given below.

address key

A1 510

A2 620

A3 750 A1, A2, A3, A4,A5 are addresses

of records on the storage medium.

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 116


Disadvantages of sequential file organization.

 Updates are not easily accommodated.

 By definition, random accessing is not possible

 All records must be structurally identical. If a few field is to be added, then every
record must be rewritten to provide space for the new field.

 Continuous areas may not be possible because both the primary data file and the
transaction file must be looked during merging.

Area of use

Sequential files are most frequently used in commercial batch oriented data processing
applications where there is the concept of a master file to which details are added
periodically. Ex. Payroll applications

INDEX TECHNIQUES

One of the important components of a file is directory. A directory is a collection of indexes.


The directory may contain one index for every key or may contain an index for only some of
the keys. Some of the indexes may be dense (i.e. contains an entry for every record ) while
the others may be non-dense ( contains an entry for some of the records )

An index is a collection of pairs of the form (key value, address). If the records of the table1
are stored on addresses a1, a2, a3, … an respectively, then an index for the key empnumber
would have entries (800, a1), (510, a2), (950, a3), (750, a4) and (620, a5). The index is dense
since it contains an entry for each record. In case of occupation key index, the number of
records with „occupation = programmer‟ is three and „occupation = analyst‟ is two, therefore
entries of the index corresponds to some of the records. The difficulty can be overcome by
keeping in the address field of each distinct key value a pointer to another address where we
maintain a list of addresses of records having this value. If at address b1 we store the list of
addresses of all programmer records i.e. a1, a4 and a5and at b2 the addresses of all analysts
i.e. a2 and a3 then we achieve the index of the occupation field as („programmer‟, b1)
and („analyst‟, b2). Another method is to change the format of the entries in an index to (key
value, address1, address2, .. address n). The second method is for records of variable size.

An index differs from a table essentially in its size. While a table was small enough to fit into
available internal memory, an index is too large for this and has to be maintained on external
storage devices ( floppy, hard disk, etc.).

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 117


Accessing a word of information from internal memory takes about 10 – 8 seconds while
accessing the same word from a disk could take about 10 –1 seconds.

1. CYLINDER-SURFACE INDEXING

The simplest of all indexing techniques is cylinder-surface indexing. It is useful only for the
primary key index of a sequentially ordered file. The sequential interpretation of the disk
memory is shown in figure 1.

It is assumed that the records are stored sequentially in the increasing order of the primary
key. The index contains of the cylinder index and several surface indexes. If the file requires
c cylinders ( 1 through c) for storage then the cylinder index contains c entries. Associated
with each of the c cylinders is a surface index. If the disk has s usable surfaces then the
surface has s entries. The ith entry in the surface index for cylinder j is the value largest key
on the jth track of the ith surface. The total number of surfaces is s c.

A search for a record with a particular key with value X is carried by first reading into
memory and cylinder index. Since the number of cylinders in a disk is only a few hundred
and cylinder index occupies only one track. The cylinder index is searched to determine
which cylinder possibly contains the desired record. The search can be carried out by binary
search in the case when the entry requires a fixed number of words. If it is not feasible, the
cylinder index can consist of an array of pointers to the starting point of individual key
values. In either case the search can be carried out in O(log c) time.

nce the cylinder index is searched, appropriate cylinder is determined, the surface index
corresponding to the cylinder is retrieved from the disk. The number of surfaces on a disk is
usually very small, so the best way to search a surface index would be sequential search.
Having determined which surface and cylinder is to be accessed, this track is read in and
searched for the record with desired key. So the total number of disk accesses is three ( one
to access the cylinder index c, one for the surface index and one to get the track address).
When track sizes are very large it may not be feasible to read in the whole track.. In this
case the disk is usually be sector addressable and so an extra level of indexing will be needed:
the sector index. In this case the number of accesses needed to retrieve a record will be four.
When the file extends over several disks, a disk index is also be maintained.

This method of maintaining a file and index is referred to as ISAM (Indexed Sequential
Access Method). It is probably the most popular and simplest file organization in use for

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 118


single key values. When the file contains more than one key, it is not possible to use this
index organization for the remaining keys.

2. HASHED INDEXES

The principles involved in maintaining hashed indexes are essentially the same as those of
hash tables.

All the hash functions and overflow techniques of hash tables are applicable to hashed
indexes also.

The overflow techniques are

1. Rehashing

2. open addressing

a. Random

b. Quadratic

c. Linear

3. Chaining

(refer to unit 4)

3. TREE INDEXING

The AVL trees are used to search, insert and delete entries from a table of size n using at
most O(log n) time. The AVL tree resides on a disk. If nodes are retrieved from the disk, one
at a time, then a search of an index with n entries would require at most 1.4 log n disk
accesses (the maximum depth of an AVL tree is 1.4 log n). This is a lot worse than the
cylinder sector index. Therefore balanced tree based upon an m-way search tree is used
which is better than binary search tree.

Definition:

An m-way search tree, T , is a tree in which all nodes are of degree ≤ m. If T is


empty,(T= nil) then T is a m-way search tree. When T is not empty it has the following
properties:

(i) T is a node of the type


n
. A 0, (K1, A1), (K2,A2),…..(Kn,An) Where the A i, 0 ≤ i ≤ n are pointers to the sub trees of
T and then the Ki, 1 ≤ i ≤ n are key values; and 1 ≤ n<m .

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 119


(ii) Ki, < Ki+1, 1 ≤ i <n

(iii) All key values in the sub tree Ai, are less than the key value Ki+1,

0 ≤ i <n

(iv) All key values in the sub tree An are greater than Kn.

(v) The sub trees Ai, 0 ≤ i ≤ n are also m-way search trees.

A B-tree is a balanced m-way tree. A node of the tree contain many records or keys and
pointers to children.

A B-tree is also known as the balanced sort tree. It finds its use in external sorting. It is
not a binary tree.

 To reduce disk accesses, several conditions of the tree must be true;

 The height of the tree must be kept to a minimum,

 There must be no empty sub trees above the leaves of the tree;

 The leaves of the tree must all be on the same level; and

 All nodes except the leaves must have at least some minimum number of children.

3- way search tree.

a
20 40

b c d
10 15 25 30 45 50

35

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 120


Figure A.

Procedure to search in m-way search tree is given below:

1. Procedure msearch(t : mtree ; x : integer; var p:mtree; var i,j: integer);

2. {Search the m-way search tree t residing on disk for the key value x.

3. Individual node format is n,Ao,(K1,A1),…(Kn,…An),n <m. A triple (p,i,j) is

4. returned. j=1 implies x is found at node location p with key Ki.

5. Else j=0 and p is the location of the node into which x can be inserted.}

6. label 99:

7. begin

8. p:=t; K0:= -maxint; q:=nil; j:=1;

9. while p<>0 do

10. begin

11. input node located at p from disk;

12. Let this node define n, A0,(K1,A1),…(Kn,An);

13. Kn+1 := maxint;

14. Let i be such that Ki<= x <ki+1;

15. if x = Ki then return (p,i,l) goto 99;

16. q:=p; p:=Ai;

17. end;

18. p:=q; j:=0; return (q,i,0);

19. 99:end;

In figure A if we want to search 35 then a search in the root node indicates that the
appropriate subtree to be searched is the one with root A1 at address c. A search of this root
node indicates that the next node to search is at address e. The key value 35 is found in this
node and the search terminates. So if this search tree resides on a disk then the search for
x=35 would require accessing the nodes of addresses a,c and e for a total of three disk
accesses. The maximum number of disk accesses made is equal to the height of the tree. ( to
minimize the number of disk accesses, we can minimize the height of a search tree).

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 121


 RANDOM ORGANIZATION:

In this organization the records are stored at random locations on disks. The following
techniques are used for randomization.

 Direct addressing

 Directory lookup

 Hashing

Direct addressing

In direct addressing with equal size records, available disk space is divided in to nodes large
enough to hold a record. The numeric value of the primary key is used to determine the node
in to which a particular record is stored. No index on this key is required. With primary key
= empno., the record for empno =259 will be stored in node 259. In this organization,
searching and deleting a record given its primary key value requires only one disc access,
whereas updating requires two disc access (one to read and another to write back the
modified record ).

In case of variable size records, an index is setup with pointers to actual records on the disk
as shown in the following figure. The number of disc access is one more than that of fixed
size records.

Record B

Record E

Record D

Record A

Record C

The space efficiency of direct accessing depends on the identifier density n/T ( n = number
of distinct primary key values in the file, T is the total number of possible primary keys. )

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 122


Directory lookup

This is similar to the addressing technique for variable size records. The index is not of direct
access type but a dense index maintained using a structure suitable for index operations.
Retrieving a record involves searching the index for the record address and then accessing the
record itself. This technique makes a more efficient utilization of space than direct
addressing, but requires more accesses for retrieval and update, since index searching will
generally require more than one access.

Hashing

The principle of hashed file organization is same as that of a hashed table. The available
space is divided into buckets and slots. Some space may have to be set aside for an overflow
area in case chaining is being used to handle overflows. When variable size records are
present, the number of slots per bucket will be only a rough indicator of the number of
records a bucket can hold. The actual number will vary dynamically with the size of records
in a particular bucket.

Random organization on the primary key using any of the above three techniques
overcomes difficulties of sequential organizations. One of the main disadvantages of random
organization is that batch processing of queries become inefficient as the records are not
maintained in order of the primary key. For example the query retrieve the records of all
employees with employee number >800, needs to examine every node.

 LINKED ORGANISATION

Linked organization differs from the sequential organizations essentially in that the logical
sequence of records is generally different from the physical sequence. In a sequential
organization if the ith record of the file is at Li then the (i+1) th record will be at Li+c, where c
is the size of the ith record.

In linked organization the next record is obtained by following the link value from the present
record. Linking records together by the primary key facilitates the deletion and insertion of
records once the place at which insertion and deletion to be made is known. An index with
ranges of empnumbers can be maintained to facilitate searching based on empnumbers.

For example, ranges for empumbers 501-700, 701-900, 901-1100 can be created for the
EMPLOYEE table given in Table 1. All records having empno in the same range can be
linked together as shown in the following figure.

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 123


Upper value

700 Record E
Record B

900 Record D
Record A

Record C

Using an index in this way reduces the length of the lists and thus the search time. This idea
can be generalized to allow for easy secondary key retrieval. We set up indexes for each key
and allow records to be in more than one list. This leads to multi list representation.

Retaining list lengths enables us to reduce search time by allowing us to search the smaller
list.(in case of Boolean queries)

Empno index occupation index

700 900 1100 Analyst programmer


max empno

2 2 1 2 3
length

pointer to

the first node

0
empno link

0
occupation link
C

salary link B E D A C

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 124


Salary index

<=9000 <=12000 <= 15000


value

1 3 1
length

pointer

Inserting a new record in to a multi list structure is easy so long as the individual lists do not
have to be maintained in some order. A record may be inserted at the front of the appropriate
list. Deletion of a record is difficult since there are no back pointers. Deletion may be
simplified if we maintain it as a doubly liked list.

 CORAL RINGS

The coral ring structure is same as the doubly linked multi list structure. Each list is
structured as a circular list with a head node. The headnode for the list for the key value Ki =
x will have an information field with value x. The field for the key Ki is replaced by a link
field. Thus for each record the coral ring contains 2 fields : y↑.alink[i], y↑.blink[i]

The alink is used to link together all records with the same key Ki. The alinks form a
circular list with a head node whose information field retains the value of Ki for the records
in the ring.

The blink is used for some records as a back pointer and for some records it is a pointer to the
head node. To distinguish between these two y↑.field[i] is used. If y↑.field[i] = 1 then it is
a back pointer and y↑.field[i] = 0 it is a pointer to the nearest record z preceding it its circular
list for Ki having z↑.blink[i] also a back pointer.

In any given circular list, all records with back pointers form another circular list in the
reverse direction.

The presence of these back pointers makes it possible to carry out the deletion without having
to start from the beginning of the list.

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 125


analyst
z↑.alink

 

Forward circular list contains nodes α, A, B, C, and D

Reverse circular list contains nodes α, C, A

 INVERTED FILES

Inverted files are similar to multi lists. The difference is that while in multi lists records with
the same key value are linked together with link information being kept in individual records
whereas in the case inverted files this information is kept in the index itself. The following
are the indexes of the file given in figure 1.

Empno index Occupation index

510 B Analyst B, C

Programmer A, D, E
620 E

750 D

800 A

Sex index
Salary index

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 126


Female B, C, D
900 E

Male A, E
10,000 A

12,000 C, D

The index for every key is dense and contains a value entry for each distinct value in the file.
Since the index entries are variable length (the number of records with the same key value is
variable ), index maintenance becomes more complex than for multi lists.

Benefits of inverted file organization

Boolean queries require only one access per record satisfying the query. Queries of the type
K1 = XY or K2 = XY can be processed by first accessing the indexes and obtaining the
address lists for all the records with K1 = XX and K2 = XY. These two lists are merged to
obtain a list of all records satisfying the query. K1 = XX and K2 = XY can be handled by
intersecting the two lists. similarly K1 = .not. XX can be obtained by taking the difference
between the universal list (list with all the records) and the list of K1 = XX.

The retrieval works in two steps.

 Indexes are processed to obtain a list of records satisfying the query

 The records are retrieved using the list.

The number of disk accesses = number of records being retrieved + the number to process the
indexes.

Inverted files result in space saving compared to other file structures when record retrieval
does not require retrieval of key fields. In this case the key fields may be deleted from the
records.

Insertion and deletion of records requires only the ability to insert and delete within indexes.

 CELLULAR PARTITIONS:

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 127


In order to reduce file search times, the storage media may be divided into cells. A cell may
be an entire disk pack or it may simply be a cylinder. Lists are localized to lie with in a
cylinder. If we have a multi list organization in which the list for key1 = prog included
records on several different cylinders then we can break the list in to several smaller lists
where each prog list included only those records in the same cylinder. By doing this all the
records in the same cell (i.e. the same cylinder)may be accessed with out moving the
read/write heads. In case a cell is a disk pack then using cellular partitions it is possible to
search different cells in parallel.

Note: ( The information given below is not related to cellular partitions)

Traversing a binary tree can be done as follows:

Preorder : ABCDEFG

B D D Inorder : CBAEFDG

G Post order : CBFEGDA

C E

Preorder: Inorder :

Traverse the root node Traverse the left subtree in inorder

Traverse the left subtree in preorder Traverse the root node

Traverse the right subtree in preorder Traverse the right subtree in inorder

Postorder :

Traverse the left subtree in postorder

Traverse the right subtree in postorder

Traverse the root node

DEPARTMENT OF COMPUTER SCIENCE-TIPSCAS | 128

You might also like