Data Structures and Types
Data types
• Type: A collection of values (homogeneous and effectively described)
together with a set of operators on these values
• What is a type depends on the specific programming language
What are types used for?
• Project level: Organize the information
◦ Different types for different concepts
◦ “Comments” for the intended use of identifiers
• Program level: Identify and avoid errors
◦ Types (as opposed to comments) can be automatically verified
◦ 3 + “pipo” must be an error
• Implementation level: Permit certain optimizations
◦ A boolean needs fewer bits than a real
◦ Precalculate the offset for a record/struct
Type systems
• The type system of a language:
◦ Predefined types
◦ Mechanisms to define new types
◦ Control mechanisms
• Equivalence
• Compatibility
• Inference
◦ Specifies if types are static or dynamic
• A type system is type safe if
◦ At runtime, we cannot have undetected errors deriving from type
errors
Scalar types
• Boolean
◦ Val: True or false
◦ Op: or, and, not, conditionals
◦ Repr: One byte
◦ Note: C does not have this type
• Characters
◦ Val: a, A, b,B, è, ë, ; , . . .
◦ Equality, code/decode, language-dependent
◦ Repr: One byte (ASCII) or two bytes (UNICODE)
Scalar types
• Integer
◦ Val: 0, 1, -1, 2, -2, . . . , maxint
◦ Op: +, −, ∗,mod, div
◦ Repr: Several bytes (2 or 4)
◦ Note: long integers (also 8 bytes); could have portability problems,
when it is not specified by the language
• Reals
◦ Val: rational numbers in a certain interval
◦ +, −, ∗, /, etc
◦ Repr: Several bytes (4); mobile decimal point
◦ Note: Long reals (8 bytes); severe portability problems, when it is
not specified by the language
Scalar types
• Complex numbers
◦ Val:
◦ Op:
◦ Repr: 2 reals
◦ Note: Used in Scheme and Ada
• Fixed point
◦ Val: rational numbers in a certain interval
◦ +, −, ∗, /, etc
◦ Repr: Several bytes (2 or 4)
◦ Note: Supported by Ada
Scalar types
• The void type
◦ Val: A single value
◦ Op: None
◦ Used to define the type of operations that modify the state without
returning a value
void f(...) { ...}
• The type of f must have a value
• The value, of type void, returned by f , is always the same
An initial classification
• Ordinal/discrete types
◦ Booleans, integers, characters
◦ Each value has a successor and a predecessor (except for first/last
values)
◦ Other ordinal types
• Enumerations
• Intervals (subrange)
◦ One can iterate over them or use as indicies to an array
• Scalar types
◦ All those we have seen
• Ordinals
• Reals, complex numbers
◦ Have a “direct” representation in the implementation
◦ Not composed of aggregations of other values
Enumeration Types
• Introduced in Pascal
type Days = (Mon,Tue,Wed,Thu,Fri,Sat,Sun)
• Programs are easier to understand
• Ordered values Tue < Fri
• Iteration on values: for i=Mon to Sat do
• Successor, Predecesor
• Can be represented as short integer (1 byte)
• In C
enum days = (Mon,Tue,Wed,Thu,Fri,Sat,Sun)
Equivalent to
typedef int days;
const days Mon=0, Tue=1 ...
Intervals (subrange)
• Also introduced in Pascal
• Values: An interval of values of an ordinal type (the base type of an
interval)
• Examples
type LessThanTen = 0..9;
type WorkDays = Mon..Fri;
• Represented like the base type
• Using intervals rather than base type:
◦ “Controllable” documentation
◦ Generation of efficient code
Structured/non-scalar types
• Record
◦ Collection of “fields”, each of a (possibly different) type
◦ A field is selected by its name
• Variant records
◦ Record where only one out of several fields is active for a specific
datum
• Array
◦ Function from a (scalar) index type to another type
◦ Arrays of characters are called strings, with special operators
• Sets
◦ Subsets of the base type
• Pointer
◦ Reference to an object of another type
Records
• Manipulate data of different types in a uniform manner
• C, C++, CommonLisp, Algol67: struct
• Java: No special record type; subsumed by class
• Example (C)
struct student {
char name(20);
int number; };
• Selection of a field:
student s;
s.number = 123123;
• Records can be nested
• Storable, expressible and denotable
◦ Pascal: No notion of constant record, except at the initialization
step
◦ Equality usually not defined (exception Ada)
Records: Memory layout
• Sequential storage of fields
• Aligned to word boundaries
◦ Memory wasted
• Packed records
◦ Not aligned with word boundaries
◦ More costly access
Variant records
• In a variant record, there are several (exclusive) alternatives for some
records
type student = record
name : packed array [1..6] of char;
number: integer;
case finished : Boolean of
true: (last_year: 2000..maxint);
false:(year:(first,second,third));
end;
var s: student;
s.finished := true;
s.last_year:= 2001;
• The two variant fields last_year and year can use the same location
of memory
Variant records
• Permitted in many languages
◦ C: union with struct
struct student {char name[6];
int number;
bool finished;
union {
int last_year;
int : variant records};
}
• Pascal, Modula, Ada: Unions and records combined
Array
• Collection of homogeneous data
◦ Function from index type to element type
◦ Index type: Usually discrete
◦ Elements can be of any type (rarely function type)
• Declarations
◦ C: int vec[30] (index 0 to 29)
◦ Pascal: var vect:array[0..29] of integer;
• Multidimensional array
◦ Function from index type to array type
◦ Pascal: The following are equivalent
var mat : array [0..29,a..z] of real;
var mat : array [0..29] of array [a..z] of real;
◦ These are not equivalent in Ada: The second permits slicing
Array: Operations
• Main operations
◦ Selection of an element vect[3], mat[10,’c’]
◦ Note: Modification is not an array operation, but on the location
of memory that stores an array element
• Some languages permit slicing
◦ Selection of a contiguous part of an array
◦ Example: Ada with
mat : array(1..10) of array (1..10) of real;
mat(3) indicates the third row of matrix mat
• In the presence of slicing, other global operations on array may be
possible
◦ Fortran90: A+B: Sum of elements of A and B (of the same type)
◦ APL has many operations for manipulating arrays
Slicing an array
int W[21..30];
type Nano = {Brontolo, Cucciolo, Dotto, Eolo, Gongolo, Mammolo,
Pisolo};
int V[1..10,1..10];
char C[Dotto..Mammolo,0..10,1..10]
8.4 Composite Types 217
Fig. 8.4 A slice through an array
In some languages, a multidimensional array can be obtained by declaring that
the type of the array elements is, at the same time, an array. A possible syntax in our
pseudo-language could be the following, so we can declare the above as:
int V[1..10][1..10];
char C[Dopey..Sleepy][0..10][1..10];
In some languages, the two models (multidimensional array and array of array)
are equivalent (one is an abbreviation for the other). This is the case in Pascal. In
other languages, only one of the two possibilities is admitted. In C, a multidimen-
sional array is an array of arrays and must be declared as such. Other languages,
finally, allow both models, but arrays of arrays support additional operations (slic-
ing, as will shortly be seen) that are only defined over multidimensional arrays.
Operations on Arrays The simplest operation permitted on an array is element
selection. It is performed using an index variable. The most common notation uses
square brackets: W[e] indicates the element of W corresponding to the index de-
Storage of arrays
• Elements stored in contiguous locations. Two possibilities
◦ Row order: V [1, 1], V [1, 2], . . . , V [1, 10], V [2, 1], . . .
• Most commonly used
◦ Column order: V [1, 1], V [2, 1], V [3, 1], . . . , V [10, 1], V [1, 2], . . .
• Order is important in systems with caching
Array: Calculating addresses
• Location corresponding to A[i, j, k] (stored by row)
◦ A:array[l1..u1,l2..u2,l3..u3] of elem_type
• S3: size of an element of type elem_type
• S2 = (u3 − l3 + 1) ∗ S3 size of a row
• S1 = (u2 − l2 + 1) ∗ S2 size of a plane
◦ Location of A[i, j, k] is the sum of:
• α: Address of beginning of A
• (i − l1) ∗ S1: Start of plane of A[i, j, k]
• (j − l2) ∗ S2: Start of row of A[i, j, k]
• (k − l3) ∗ S3: Start of A[i, j, k]
◦ If the shape is known at compilation type, we can rewrite this as
i ∗ S1 + j ∗ S2 + k ∗ S3 + α
◦ If all indicies start at 0, no precomputation is needed
Form/shape
• Number of dimensions or intervals of an index
• Which is fixed?
◦ Static form
◦ Form fixed on evaluation of the declaration
◦ Dynamic form
Array: Dope vector
• Information about the form of the array is maintained by the compiler
• If the form can vary during runtime, the information is maintained in
an array called the dope vector of the array. This contains:
◦ Pointer to the beginning of the array
◦ The number of dimensions
◦ Lower bound (upper bound, if we want to control that at runtime)
◦ Space needed for each dimension
• This is stored in the fixed part of the activation record
Example of Activation record with dope vector
Pointer type
• Some languages allow referring to and modifying an l-value without
dereferencing it
• Values: References (l-values) and constant null or nil
• Operations:
◦ Creation (library function that returns a pointer; malloc)
◦ Dereferencing: Access to the object pointed to, *p
◦ Equality test (including test of equality with null)
Pointers and arrays in C
• In C, pointers and arrays are interchangeable
int n;
int *a; // pointers to integers
int b[10]; // array of 10 integers
...
a = b; // a points to the initial element of b
n = a[3]; // n has the value of the third element of b
n = *(a+3); // idem
n = b[3]; // idem
n = *(b+3); // idem
• But remember that a[3]=a[3]+1 also modifies b[3]
• Multidimensional arrays: b[i][j] is the same as
◦ (*(b+i))[j]
◦ *(b[i]+j)
◦ *(*(b+i)+j)
Type equivalence and compatibility
• Two types T and S are equivalent if every object of type T is also of
type S, and vice versa
• T is compatible with S, if any object of type T can be used in a
context where an object of type S is expected
Equivalence by name
• Two types are equivalent if they have the same name
• Used in Pascal, Ada, and Java
• Loose equivalence by name: Pascal, Modula-2
◦ A declaration of an alias of a type generates a new name, not a
new type
type A = record ... end;
type B = A;
◦ A and B are names of the same type
Structural equivalence
• Two types are equivalent if they have the same structure
◦ Definition (structural equivalence). Structural equivalence between
types is the minimal equivalence relation that satisfies:
• A type name is equivalent to itself
• If T is defined as T = expression, then T is equivalent to
expression
• Two types constructed using the same type constructor, based
on equivalent types, are equivalent
◦ Equivalence controlled by rewriting
Compatibility
• T is compatible with S if objects of type T can be used in contexts
where objects of type S are expected
◦ Example: int n; float r; r=r+n
• The definition depends on the language. S is compatible with T if
◦ T and S are equivalent
◦ The values of T are a subset of the values of S (interval)
◦ All the operations on values of S can be performed on values of T
(extension of record)
◦ There is a natural correspondence between values of T and values
of S (int to float)
◦ The values of T can be made to correspond to some values of S
(float to int with truncating)
Type conversion
• If T is compatible with S, there is some type conversion mechanism.
The main ones are
◦ Implicit conversion, also called coercion. The abstract machine
does the conversion, with no mention at the language level
◦ Explicit conversion, or cast, when the conversion is mentioned in
the program
Coercion
• Coercion indicated, in a case of compatibility. How the conversion
should be done
• Three possibilities. The types are different, but
◦ Have the same values and the same representation. E.g., types
that are structurally the same, but have different names
• Conversion only at compile time
◦ Different values, but the common values have the same represen-
tation. E.g., intervals and integers
• Code for dynamic control when there is an intersection
◦ Different representations for the values. E.g., reals and integers
◦ Code for the conversion
Cast
• In certain cases, the programmer must insert explicit type conversion
(C, Java: cast)
◦ S s = (S) T
For example r = (float) n and n=(int) r
• Cases similar to coercion
• Not every explicit conversion is allowed
◦ Only when the language knows how to do the conversion
◦ Can always be done when types are compatible (useful for docu-
mentation)
• Modern languages prefer cast to coercion.
Polymorphism
• A single value has multiple types
• Three forms of polymorphism
◦ Ad hoc polymorphism (overloading)
◦ Universal polymorphism
• Parametric polymorphism (explicit or implicit)
• Subtype polymorphism
Ad hoc polymorphism: Overloading
• The same symbol has different meanings
◦ 3+5
◦ 4.3+5.3
• The compiler translates “+” in different ways
• Resolved in compile time, after type inference
Parametric polymorphism
• A value has parametrized universal polymorphism when the value has
an infinite number of possible types, obtained by instantiation of a
general type schema
• Polymorphic function is a single definition that is applied uniformally
to all instances of a general type
◦ Ide(x)=x;
◦ sort(v) = ... ;
◦ Ide has type <T>-><T>
◦ Ide has type <T>[]->void
T is a type variable
Explicit parametrized polymorphism
• In C++: Function template
◦ Swap to exchange two integers
void swap (int &x, int &y){
int tmp = x; x=y; y=tmp;}
◦ A template to swap any two objects. T is the type (sort) of the
parameter
template <typename T>
void swap (T &x, T &y){
T tmp = x; x=y; y=tmp;}
◦ Automatic instantiation.
T becomes int at link time
int i,j;swap(i,j);
T becomes float at link time
float r,s;swap(r,s);
T becomes string at link time
String v,w;swap(v,w);
Parametric polymorphism in ML (implicit)
• Swap function in ML
swap(x,y) = let val tmp = !x in
x = !y; y = tmp end;
val swap = fn : ’a ref * ’a ref -> unit
• ML inserts the template automatically when possible
• Instantiation at compile time
x, y are integer variables
swap(x,y);
val it = (): unit
v, w are string variables
swap(v,w);
val it = (): unit
Subtype polymorphism
• Similar to explicit polymorphism, but not all types can be used to
instantiate the general one
• Suppose T is a subtype of S, written T < S
• Definition: A value has subtype (or limited) polymorphism if it has
an infinite number of possible types, obtained by substituting for a
parameter all the subtypes of the given type
• A polymorphic function is a single program that can be applied uni-
formally to all legal instances of the general type
Handling dangling references: Tombstones
• Dangling reference: A pointer to an object that is no longer valid
8.10 Avoiding Dangling References 249
Fig. 8.10 Tombstones
8.10.2 Locks and Keys
An alternative to the tombstone technique is called locks and keys. This solves the
problem of dangling references into a heap using a mechanism which does not suffer
from the accumulated problems of tombstones.
Every time an object is created on the heap, the object is associated with a lock
which is a word of memory in which an arbitrary value is stored. (Strictly speak-
ing, it should be a value that can be sequentially incremented every time an object
is allocated, but which avoids values such as 0 and 1, the code of frequently used
characters, and so on.) In this approach, a pointer is composed of a pair: the address
proper and a key (a word of memory that will be initialised with the value of the
lock corresponding to the object being pointed to). When one pointer is assigned to
another, the whole pair is assigned. Every time the pointer is dereferenced, the ab-
stract machine checks that the key opens the lock. In other words, it verifies that the
information contained in the key coincides with that in the lock. In the case in which
the values are not the same, an error is signalled. When an object is deallocated, its
lock is invalidated and some canonical value (for example, zero) is stored, so that
all the keys which previously opened it now cause an error (see Fig. 8.11). Clearly,
Locks and keys
Garbage collection
• User allocates memory
• Deallocation not allowed
• The system periodically recovers allocated memory that is no longer
in use
◦ No longer in use: No valid way to access it.
Garbage collection: Reference count
252 8 Structuring Data
Fig. 8.12 Reference counting
the counter belonging the object pointed to by q is incremented by one, while the
counter of object pointed to by p is decremented by one. When a local environ-
ment is exited, the counters of all the objects pointed to by pointers local to the
environment are decremented. Figure 8.12 shows operation of this technique in di-
agrammatic form.
When a counter returns to the value 0, its associated object can be deallocated
and returned to the free list. This object, moreover, might contain internal pointers,
so before returning the memory occupied by an object to the free list, the abstract
machine follows the pointers inside the object and decrements by one the counters
of the objects they point to, recursively collecting all the objects whose counter has
reached 0.
From the viewpoint of the abstract division into two phases of a garbage collector,
the update and checking of counters implement the garbage-detection phase. The
collection phase occurs when the counter reaches zero.
A clear advantage of this technique is that it is incremental because checking
and collection are interleaved with the normal operation of the program. With a few
Potential problem: Circular structure
ction 253
structures
nts
and Sweep
weep method takes its name from the way in which the two abstract
oned at the start are implemented:
gnize that something is unused, all the objects in the heap are tra-
ng each of them has “not in use”. Then, starting from the pointers
ive and present on the stack (the root set), all the data structures
heap are traversed recursively (the search is usually performed
d every object that is encountered is marked as “in use”.
eap is swept—all blocks marked “in use” are left alone, while those
Garbage collection: Mark and sweep
• Mark all the objects in the heap as “unused”
• Starting from a pointer not in the heap, visit all the linked structures,
marking each object visited as “used”
• Recover from the heap all unused items
Mark and sweep
• Space needed for garbage collection
• When garabage collector is running, the user sees a significant reduc-
tion in performance
• Java: Incremental GC
Stack for depth first visit
Pointer reversal
Exercise 1
Consider the declaration of the multi-dimensional array:
int A[10][10][10]
We know that an integer can be stored in 4 bytes. The array is stored in
row order, with increasing memory addresses (that is, if an element is at
address i, its successor is at i + 4, and so on). The element A[0][0][0] is
stored at address 0. State the address at which element A[2][2][5] is stored
Exercise 2
Instead of the contiguous multidimensional array allocation that we have
discussed, some languages allow (e.g., C), or adopt (Java), a different or-
ganisation, called row-pointer. Let us deal with the case of a 2-dimensional
array. Instead of storing rows one after the other, every row is stored sepa-
rately in some region of memory (for example on the heap). Corresponding
to the name of the vector, a vector of pointers is allocated. Each of the
pointers points to a proper row of the array.
• Give the formula for accessing an arbitrary element A[i][j] of an
array allocated using this scheme.
• Generalise this memory technique to arrays of more than 2 dimensions.
• Discuss the advantages and disadvantages of this organisation in the
general case.
• Discuss the advantages and disadvantages of this technique for the
storage of 2-dimensional arrays of characters (that is, arrays of strings)
Exercise 3
Consider the following (Pascal) declarations:
type string = packed array [1..16] of char;
type string_pointer = string;
type person = record
name : string;
case student: Boolean of
true: (year: integer);
false: (socialsecno: string_pointer)
end
Assume that the variable C contains a pointer to the string "LANGUAGES".
Describe the memory representation for the person record after each of the
following operations:
var pippo : person;
pippo.student := true;
pippo.year :=223344;
pippo.student := false;
pippo.socialsecno := C;
Exercise 4
Show that the integer type can be defined as a recursive type starting with
just the value null. Then write a function that checks if two integers defined
in this way are equal
Exercise 5
Given the following type definitions in a programming language which uses
structural type equivalence:
type T1 = struct{
int a;
bool b;
}
type T2 = struct{
int a;
bool b;
}
type T3 = struct{
T2 u;
T1 v;
}
type T4 = struct{
T1 u;
T2 v;
}
In the scope of the declarations T3 a and T4 b, is the assignment a = b
permitted? Why?
Exercise 6
Which type is assigned to each of the following functions using polymorphic
type inference?
fun G(f,x){return f(f(x));}
fun H(t,x,y){if (t(x)) return x;
else return y;}
fun K(x,y){return x;}
Exercise 7
Consider the following fragment in a pseudo-language with reference-based
variables and which uses locks and keys (C is a class whose structure is of
no importance):
C foo = new C(); // object OG1
C bar = new C(); // object OG2
C fie = foo;
bar = fie;
Give possible values after the execution of the fragment for all the keys and
all the locks involved.
Exercise 8
Under the same assumptions as in Exercise 7, consider the following code
fragment in which free(p) denotes the explicit deallocation of the object
referred to by the pointer p:
class C { int n; C next;}
C foo = new C(); // object OG1
C bar = new C(); // object OG2
foo.next = bar;
bar.next = foo;
free(bar);
For all the pointers in the fragment, give possible key values. For each
object in the fragment, give possible lock values. In both cases, the values
should be those after execution of the fragment terminates. After execution
of the fragment, also execute the code
foo.n = 1; foo.next = 0;
State what a possible result of this execution might be
Exercise 9
Consider the following fragment which is written in a pseudo-language with
a reference-model for variables and a reference-counting garbage collector.
If OGG is an arbitrary object in the heap, indicate by OGG.cont its (hidden)
contents:
class C { int n; C next;}
C foo(){
C p = new C(); // object OGG
p.next = new C(); // object OGG2
C q = new C(); // object OGG3
q.next = p.next;
return p.next;
}
C r = foo();
State what are the values of the reference counters for the three objects
after execution of line 6 and then of line 9.
Exercise 10
Under the same assumptions as in Exercise 9, state what the values of the
reference counters are for the two objects after the execution of the following
fragment. Which of the two can be returned to the free list?
C foo = new C(); // object OG1
C bar = new C(); // object OG2
C fie = foo;
bar = fie;
Exercise 11
Under the same assumptions as in Exercise 9, state what are the reference-
counter values for the three objects after execution of the following frag-
ment. Which of them can be returned to the free list?
class C { int n; C next;}
C foo = new C(); // object OG1
bar = new C(); // object OG2
foo.next = bar;
bar = new C(); // object OG3
foo = bar;