KEMBAR78
Computer Science Notes Year 3 | PDF | Data Type | Boolean Data Type
0% found this document useful (0 votes)
33 views130 pages

Computer Science Notes Year 3

The document contains notes by Norbert Ochieng on functional programming, discussing its significance, historical context, and the evolution of programming languages. It covers key concepts in Haskell, including types, functions, recursion, and polymorphism, emphasizing the differences between conventional and functional programming paradigms. The notes aim to provide useful insights for others studying computer science, though accuracy and completeness are not guaranteed.

Uploaded by

securenetcyber
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views130 pages

Computer Science Notes Year 3

The document contains notes by Norbert Ochieng on functional programming, discussing its significance, historical context, and the evolution of programming languages. It covers key concepts in Haskell, including types, functions, recursion, and polymorphism, emphasizing the differences between conventional and functional programming paradigms. The notes aim to provide useful insights for others studying computer science, though accuracy and completeness are not guaranteed.

Uploaded by

securenetcyber
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 130

COMPUTER SCIENCE NOTES

This pdf contains notes I, Norbert Ochieng, have taken while studying. These notes are shared
here in the hope that they may be useful to others, but I do not guarantee their accuracy,
completeness, or whether they are fully up-to-date.

YEAR 3
PART A
CHAPTER 1:
Functional Programming
Why Functional Programming Still Matters
The relative cost of hardware and software has changed over time. Originally hardware cost a
lot more than software, however as machines got ever faster and cheaper, software got more
complicated and expensive. This lead to the "software crisis" of the 1970s, when for the first
time, software started to cost more than hardware.

An investigation into new programming paradigms began to try and find a solution to this
problem, however this was not universally popular. Naur in 1975 said "The era of significant
[programming] language development is now over".

The software crisis led to software being developed into smaller components, making the
entire system less prone to error, however hardware limits were still imposed. Hardware was
still stuck in serial processing, and massive parallelisation was needed to take advantage of
these components. Even today, massive parallelisation is not common.

Triangle of Programming Languages

Originally, the only aspect considered in programming was execution - getting the code to
run was the most important aspect. Early programming languages were machine-orriented.
This became an unsustainable approach in the long term, and a shift towards ease of
expression was encountered to make programming more accessible. This resulted in the high
level languages such as FORTRAN being developed. As hardware and software further
developed, a need to show correctness was developed. Most modern mainstream languages
currently implement this approach, by using things such as type-checking to help towards
correctness.

As languages developed through the eras, the effects of different vertices were felt within the
triangle of programming languages. Eventually, an all new approach was decided to be taken
with programming languages, which is to take Mathematical expressions and logic, and to
apply this Mathematical logic to machines. This results in the declarative languages, which
evolved from standard Mathematic notation.

Expression of programs

Historically, programs are something to instruct machines.

Conventional programming languages are structured like a procedure, or a recipe, and the
order of the statements is often critical. The result of the program is an accumulation of
indirect events

Functional programming read more like a series of declarations, or descriptions. The order of
these are rarely important, and the consequences of the program are a result of direct
expressions.

Conventional programming languages are quite obese and require their programs to be
verbose. Functional programming languages, however, tend to be leaner, and lend themselves
to more concise programs.

Can Programming Be Liberated


from the von Neumann Style?

John Backus, the inventor of Fortran, gave strong words against conventional programming
styles and in favour of functional programming at his Turing award lecture.

Execution and the von Neumann bottleneck

The von Neumann machine relies on sequential word at a time computing. This is not only a
physical architectural bottleneck, but a conceptual bottleneck in programming. The same
disgram and bottleneck will apply if you replaced processor with "loop", and store with array

Attempts have been made to introduce the parallelism required to take full advantage of a
functional programming paradigm by redesigning the computer architecture to move away
from the von Neumann model. This had led to some spectacular failures, such as the
Connection Machine, but more recently to some limited success, such as increased
parallelism in modern consumer CPUs. Functional programming computer architectures are
promising areas for future research.

Introduction to Haskell
A functional program is a collection of definitions, and an expression to be evaluated. The
result of the expression is simply the value of the expressions.

Types

In Hugs (the Haskell interpreter used in FUN), all values have types. Types can be
automatically inferred, but may also be declared. Haskell is a strongly typed language, and
whilst starting, it is recommended that you declare types. There is a distinction between data
types and functional types, but the aim to treat them both alike as first-class citizens.
Values of data types are built of constructors. The simplest constructors are just constants.

Char

The Char type is a primitive type. ASCII characters are the values of Char, and quoted
characters (e.g., '*', '\n') are the Char constructors.

Bool

Logical truth values are the values of Bool. True and False are the Bool constructors.

Int

A large contiguous range of integers make up the Int values, and decimal numerals
(e.g., 6345) are the Int constructors.

Note the convention that appears above, initial caps indicate fixed contents
(e.g., True and False), whereas lower case indicates variables.

Though rarely necessary, type declarations are often useful. The notation e :: T is used to
mean expression e of type T. For example, the typed constants '!' :: Char and False
:: Bool are both valid expressions, however just '!' or False would suffice. Type
declarations are more often used for names about to be given a declaration.

Pairs and Tuples

For any types t, u, there is a paired type (t, u). The types (t, u) and (u, t) are
different, unless t and u are the same.

The value of type (t, u) has two components, the first one of type t and the second
component of type u. To construct one of these pairs, the ( , ) notation is also used.

e.g., (False,'!') :: (Bool,Char)

It is sometimes useful to think of tuples are analogous with records. A tuple type (t1, t2,
..., tn) is analogous to an n-field record type. The ti can be of any type, even a function.

Lists

For any type t, there is a list type [t]. The values of type [t] are sequences of items, each
of type t.

The constant [ ] means an empty list.

A non-empty list has a construction of the form x:xs, where : (pronounced cons) is an infix
constructor. x is the first item (the head) of the list, and xs is the list of remaining items (the
tail). Convention dictates that lists have an s suffix to denote that it is plural.
The type [t] includes lists of all lengths. [] is a list 0 items, and if xs:[t] is a list
of n items, then for any x::t, (x:xs)::[t] is a list of n + 1 items.

By convention, : is right associative, so the list 0 : 1 : 2 : [] :: [Int] consists of


3 items, and the constructions looks like so:

Some useful abbreviations are also available, using the [ ] construction syntax. The above
list, for example, could be constructed using [0,1,2], or alternatively to construct lists of
numbers, using [0..2].

String is a synonym for type [Char], and strings can be represented as "hello", which
is identical to ['h','e','l','l','o'].

User-defined Data Types

New data types are defined by type equations. The LHS of a type equation is often just a type
name, and the RHS specifies alternative forms of construction by giving constructor name
and components types, with each alternate form separated by a vertical bar.

data Primary = Blue | Red | Yellow


data Road = M Int | A Int | B Int | Local Name
data Accom = Hotel Name Size Standard | BandB Price

Type synonyms can also be defined, e.g., type Name = String or type Price =
(Int,Int).

You should note that = here does not mean assignment, rather definition.

It is also possible to have recursively defined types, e.g., data Substance = Pure
String | Mix (Int,Int) Substance Substance. A value representing mortar
may therefore be Mix (5,2) (Mix (3,1) (Pure "sand") (Pure "lime"))
(Pure "water"). A diagram of this construction would look like:
Functions

A function is a rule determining a unique result value, given an argument value.

For any types a, r, there is a type a->r whose values are functions with an argument
type a and a result type r.

Functional types don't have constructors. Non-primitive functions can be defined by


equations, or expressed as lamba expressions.

If f::a->r and x::a are expressions, then f x :: r is an expression meaning the result
of applying function f to argument x.

Evaluation

Evaluation (sometimes called reduction) is the process of obtaining values for expressions.

We use the notation e ⇒ v to mean expression e has the value v. We also use the symbol ⊥
to mean undefined value. These symbols are not part of Haskell programs, they are simply
notations about programs.

Curried Functions and Multiple Arguments

By convention, the -> symbol is right associative, but function application is left-associative
and function results may themselves be functions. We can exploit these three facts so that
multiple-argument functions can conveniently be represented as a single-argument function
with a functional result - a curried function.

For a function f whose uncurried version has two arguments, currying works like so:

f :: a->b->r x::a y::b


f x y :: r

f x yields a function of type b->r.


Arithmetic Primitives

The following list contains five of the arithmetic primitives defined in the prelude: (+),(-
),(*),div,mod] :: [Int->Int->Int].

Arithmetic can also be expressed infix style: x + y means the same as (+) x y.
Identifiers such as div and mod can also be used infix too, if back-quoted: 20 `div`
3 means the same as div 20 3. This backtick style can be used to make any operator infix.

One caveat to be aware of is that function application is like an invisible operator with the
highest precedence. f x + y means (f x) + y, not f (x + y).

Constructor Functions and Polymorphism

The non-constant data constructors are functions for combining values.

For lists of Int items, such as 1:2:[], we need (:) :: Int -> [Int] -> Int, but
for lists of Char items, such as 'a':'b':[], we need (:) :: Char -> [Char] ->
[Char].

In fact, since list elements may be of any type, for all types t we will need: (:) :: t ->
[t] -> [t].

This is a legitimate type, and we refer to t as a type variable. It is distinguished from constant
type names by an initial non-capital. Types containing these type variables are polymorphic,
and functions with these parametric polymorphism always do the same thing, whatever
instance of their type is being used.

As well as the built in polymorphic types, we can also define our own polymorphic
constructions, for example to define a polymorphic tree, we could use: data TreeOf a =
Leaf a | Branch (TreeOf a) (TreeOf a), where
these Leaf and Branch constructor functions have types: Leaf :: a -> TreeOf
a and Branch :: TreeOf a -> TreeOf a -> TreeOf a.

Projectors

Projector functions are often defined to access the component values of constructions. e.g.,
non-empty lists have head and tail components, and pair types have first and second
components.

Where there is more than one type of constructions, projectors are often partial functions,
e.g., head [] = tail [] = ⊥.

Defining Values by Equations

To bind a value to a name, we make a definition comprising one or more equations. A


preceeding type declaration optional. The simplest form of definition is a single equation with
a name as the left hand side, and an expression as the right hand side.
linelength :: Int
linelength = 80

fullname :: String
fullname = title ++ forename ++ surname

The above example uses the list concatenation operator (++) :: [a] -> [a] ->
[a], which is defined in the Haskell prelude. It is important to note that these are definitions
of fact, and not assignments.

Functions

When a functional value is being defined, the left hand side can take the form of a curried
application to one or more arguments. These left hand side arguments are patterns, built with
variables, constructors and brackets. It is possible for there to be several equations, each
catering for different argument patterns. The right hand side of each equation is an expression
possibly involving variables from the left hand side.

When executing a function, the equation used is always the first whose left hand side patterns
match the actual arguments, therefore this is one case where ordering is important. The
language designers felt this was necessary to avoid ambiguity and overly verbose programs.
e.g., p `implies` q can be expressed as:

implies True False = False


implies p q = True

by using the ordering trick, but if ordering was irrelevant, the only unambiguous way
involves expressing the full truth table.

The expression for the result is derived from the right hand side by substituting corresponding
argument components for pattern variables.

If no left hand side matches an application, then the result is ⊥

Sometimes, a definition may only have a single equation, but bind values to several names. In
this kind of definition, the left hand side is a pattern whose variables are names being defined
and the right hand side is an expression whose values match the left hand side pattern.

We can also use the _ symbol as an anonymous variable that indicates (a component of) a
pattern that is not relevant to the equation in which it occurs.

Recursion

Every general purpose programming language provides functionality for the expression of
recurrence in some form. Most procedural programming languages use loops, whereas
functional languages use recursion as the basic tool, with control abstractions defined as
required.
Generally, to ensure useful recursion and to avoid useless circularity, define at least one
simple case (a simple case is an application that does not use recursion), and in each recursive
application, the arguments should be simpler than the case being defined (i.e., close to a
simple case). These rules are not hard and fast, and may be broken if you know what you are
doing.

Functions with arguments that consist of a recursive data type t (e.g., a list) are often defined
using a method called structural recursion, that is, with an equation for each
different t constructor.

For constructions without any component of type t, the result is expressed simply, without
recursion. For other constructions, the result is expressed recursively, that is, in terms of the
results of applying the function to each component of type t.

e.g., for a list argument, we can make [] the simple case, and x:xs a recursive case,
because of xs, so for a function length xs, then length [] will be an expression not
involving length (in this case length [] = 0), and length (x:xs) will be an
expression involving length xs.

Although the Int type is not recursively constructed, it is often convenient to define
functions of the non-negative integers as if there were two forms of construction: 0 and n+1.
For this reason, +1 is allowed in argument patterns where it plays the role of a quasi-
constructor.

For functions of a non-negative integer argument, frequently we can now make 0 a simple
case, and n+1 a recursive case.

For example, we could define replicate x n to yield a list of n x's:

replicate :: a -> Int -> [a]


replicate x 0 = []
replicate x (n+1) = x : replicate x n

Strictness
A function f is strict iff f ⊥ ⇒ ⊥.

It is strict if in its nth argument if f ... ⊥ ... ⇒ ⊥, no matter what the other argument
values are.

Here, ⊥ is a value nothing is known about - type may be known, but construction and
properties not. Therefore, we can say that strict is if nothing is known about the argument,
then nothing is known about the result.

Some predefined strict functions include the character coding functions ord and char (to
know the ASCII code, you need to know the character, and vice versa), the list
projectors head and tail (e.g., head ⊥ ⇒ ⊥, because if nothing is known about a list's
construction, then we don't even know if it has a first item, let alone what that item is), and
the arithmetic functions (+), (-), (*), div and mod (we can say that the arithmetic
functions are strict in both arguments - e.g., to know the value of m+n, we must know the
values of both m and n).

Polymorphic Equality Test

Equality testing is defined for each type belonging to the Eq class: (==) :: Eq a => a
-> a -> Bool.

All the types in the Haskell Prelude belong to Eq if their component types do. Functional
types, however, do not belong to Eq, as function equality is undecidable (thanks to the
halting problem).

It is important to not confuse == with =. == is used in expressions to test equality; = is used


in equations to assert equality.

We can also say that (==) is strict in both its arguments; if we don't know about one of the
arguments, we can't say whether it is the same as the other.

Constructors

Functional constructors are by default non-strict in all their arguments. e.g., one may still
know about a (:)-constructed list, even if nothing is known about its tail. O:⊥ is not the
same as ⊥ - head (0:⊥) ⇒ 0.

Even if nothing is known about the head of a list, a (:)-constructed list is not just
undefined; ⊥:⊥ is not the same as ⊥.

Later, we find that non-strict constructors will be the key to computing over infinite data
structures. We do not need to know about all the components to know something about the
structure as a whole.

Pattern Equations

⊥ can be substituted for variables when matching arguments against patterns, however we
cannot say whether ⊥ matches a construction, so any application involving such a match
⇒ ⊥.

Conditions
Logical conjunction and disjunction are predefined as functions strict in their first argument,
but non-strict in their second argument.

(&&) :: Bool -> Bool -> Bool


False && y = False
True && y = y
(||) :: Bool -> Bool -> Bool
False || y = y
True || y = True

We can also consider a cond function defined by:

cond :: Bool -> a -> a -> a


cond True x y = x
cond False x y = y

This is strict in its first argument - we need the value of the condition to determine the correct
branch, however it is non-strict in both other arguments - we do not need to know about one
or other of the alternatives.

As an alternative to this cond function, there is a special but familiar notation for something
equivalent: if c then x else y, which is equivalent to cond c x y. This
"distributed-fix" notation has lower priority than function application or any infix operator.

Conditions can also be specified at the definition level, e.g., f n | n > 0 = ....

Local Definitions
Before now, all definitions were flobal, howeer, the right hand side of a defining equation
may incorporate local definitions:

equation
where
local equations

Things in the parent scope (i.e., that of equation) are still in scope of the local definitions.

Local definitions make it easy to keep together a definition and all of its uses, enable
defintions to be split into pieces of suitable size for a human reader, and eliminates the need
for some function arguments, since local definitions may refer to variables on the left hand
side of the main equation. Local definitions also avoid problems of naming conflicts that may
arise in larger programs, and they provide a way to avoid repeated occurences of the same
subexpressions - and perhaps repeated evaluations of it too.

Languages without local definitions may be smaller and simpler, but programs may need to
defined more functions with more arguments.

Local defintions have come in and out of favour repeatedly over the years. Circa 1975, SASL
had local definitions, but circa 1980 KRC had none in an experimental language. In 1985
they were re-introduced in Miranda, and then kept in 1990 with Haskell.

There is also an alternate syntax for local definitions:

let
defns
in
expr

Projections of Recursive Results

When a recursively defined function computes a data structure as its result, to refer to
components of a recursive result, it is necessary to bind it to a local pattern, rather than to use
explicity projecions.

Auxiliaries

For an example see


page 38 of the lecture slides.

An auxiliary function is a function which exists to help another function do it's job.

The choice of auxiliary functions often determines the simplicity of solutions, as many
auxiliaries are intended to handle specialised sub-problems. Auxiliaries may also correspond
to generalised super-problems, however. The main method used to generalise a function is to
introduce auxiliary arguments.

Auxiliary arguments come with two conflicting intuitions. The first is that the additional
arguments mean additioonal work, so definition is more complex, but the second is that
addition arguments mean additional information, so the definition becomes simpler. Neither
of these intuitions are are adequate alone, so the potential validity of both needs to be
realised.

The most useful argument to be supplied to a function is the answer to the function. This is
obviously not going to be the case for most functions, but an argument known as an
accumulator can be used to carry a partial construction of the result to be computed.
Recursive applications extend the construction of the accumulator value.

To formulate a basic structure-recursive definition of a function f over lists, we study the


relationship of f xs and f ys to f (xs++ys).

A good rule of thumb is to introduce an auxiliary argument for each maximal subexpression
of the right hand side involving xs, but not ys.

Higher Order Functions


A higher order function is one that takes an argument of functional type, or yields a result of
functional type.

Curried functions are higher order since they represent multi-argument functions as single-
argument functions with a functional result. Higher order functions can be used to provide
parametrised recursion schemes, reducing the need for explicit recursion from scratch.
Polymorphic types are particularly valuable in the case of higher order functions.

map applies its first argument (a function) to each item of its second argument (a list) and
yields the list of results. For example:

map head ["Value","Added","Tax"] ⇒ "VAT"

filter is another higher-order function for list-processing. Its first argument is a


predicate p, and its second a list xs. Its result is a list of the items in xs that satisfy p, for
example:

filter set ["every","one","a","winner"] ⇒ ["one","a"]

Generalised Lists

foldr expresses a more fundamental list-processing pattern. It computes generalised lists,


that is, its result is a reconstruction of a list using a given function f in place of (:) and a
given value z in place of [].

foldr's type reflects the asymmetric nature of lists: foldr :: (a->b->b) -> b - >
[a] -> b.

foldr's defining equations make clear the list recusive pattern of computation that captures:

foldr f z [] = z
foldr f z (x:xs) = f x (foldr f z xs)

Intuition tells us that completely replacing all constructors is a very general way of
recombining components, and we can observe that foldr can be used for a surprisingly
wide variety of computations.
foldl is a mirror image of foldr. foldl captures the list-accumulation scheme, as its
definition reveals:

foldl :: (a->b->a) -> a -> [b] -> a

foldl f a [] = a
foldl f a (x:xs) = foldl f (f a x) xs

Higher order functions can be obtained by auxiliary argument generalisations. e.g., if we


define sum like so:

sum [] = 0
sum (x:xs) = x + sum xs

What makes this "summy" is the use of 0 and +. We can make sum more generally
applicable by replacing 0 with an auxiliary argument z, and by replacing + by an auxiliary
argument f, i.e.,

gensum f z [] = z
gensum f z (x:xs) = f x (gensum f z xs)

gensum is now just foldr by another name.

Partial Application

If f is a curried function of n arguments, it can be applied to fewer than n (say m) arguments.


The result here is a function of n-m arguments, that is, a specialised version of f with the
first m argument values "frozen in". e.g.,

contains :: Eq a => [a] -> a -> Bool


contains [] e = False
contains (x:xs) e = x==e || contains xs e

punctuation :: Char -> Bool


punctuation = contains ",;:.!?"
Such a partial application complements the technique of auxiliary argument generalisation. It
enables us to make fuller use of higher order functions that necessarily assume functional
arguments of a particular arity.

Parital applications of binary operators benefit from a special notation: e.g., (5-) or (:[]),
and are sometimes called operator sections.

Function composition is to functional programs what statement sequencing is to procedural


programs: (.) :: (b->c) -> (a->b) -> (a->c), so (f . g) x = f (g x).
This ability to compose functional arguments using (.) further enhances opportunities to
use higher order functions.

The (.) operator is a function level primitive - all its arguments and its results are functions.
Many defining equations for functions can be cast entirely at a functional level too:
arguments of data type need not be mentioned.

For example, if we had a function inter xs ys = filter (contains xs) ys,


both the left hand side and the right hand side apply a function to ys. By definition, for every
argument ys, their results are the same, that is, they are the same function. It is therefore
equivalent to define: inter xs = filter (contains xs). Now, the RHS is a
composed application, that is: inter xs = (filter . contains) xs. And just as
we eliminated ys before, we can now eliminate xs: inter = filter . contains.

Although the arguments may be unmentioned in a function-level definition, they are still
implicit in the types of the functions.

Redexs and Normal Forms


A reducible expression (called a redex) is an instance of the left-hand side of a defining
equation, but not an instance of any earlier equation, or a primitive application to which a
primitive rule of computation immediately applies.

A normal form is an expression containing no redex.

e.g., both 2+3 and contains [] 0 are redexs, and the normal forms to which they reduce
are 5 and False respectively.

These expressions are not redexs, though they contain two redexes as subexpressions: (3-1)
+ (5-2) and contains (tail[6]) (head[0]).

Running a functional program on a single-processor computer requires some evaluation


strategy to select and reduce successive redexes until a normal form is reached.

Fortunately, normal forms are unique - different strategies applied to the same expression can
not lead to different normal forms.

However, some strategies are less efficient than others, requiring more reductions to reach
normal form starting from the same expression. Some strategies, though efficient for many
expressions, are infinitely inefficient for others. They may fail to terminate, even though there
is a normal form that could be reached in a finite number of steps.

Here, we refer to efficiency as the number of reductions, but it is important to bear in mind
that the housekeeping may be complex.

Eager Evaluation

To evaluate a function application, the eager strategy of applicative order, or innermost


reduction, first evaluates the argument to normal form. Semantically, the effect is to make all
functions strict. A similar call-by-value rule is usually assumed for procedure calls of
imperative languages.

Lazy Evaluation

The lazy strategy of normal order evaluation always reduces the outermost redex available.
Semantically, this makes non-strict functions possible. A similar call-by-name rule is
available in some imperative languages, but it is little used.

Under this strategy, arguments are evaluated zero or more times, as often as they are needed.
The strategy always reaches a normal form if one exists.

This form of evaluation is considered to be näive. We can be lazy efficiently by using normal
order graph reduction. This avoids repeated evaluation of arguments by representing
expressions as graphs rather than as trees. All uses of an argument then point to a single
shared representation of it, so that when it is reduced, all uses feel the benefit.

Pattern Driven Reduction

Several outermost redexes may be subexpressions of an enclosing (non-redex) function


application. If the enclosing application is a constructor, its components are evaluated left-to-
right, otherwise the enclosing application must lack the information needed to match an
equation left hand side - the matching strategy is top-to-bottom and left-to-right.

Infinite Data Structures

Functional valures comprising infinitely many argument-result rules can be defined


recursively. It therefore follows that data values with infinitely many components can also be
defined recursively, e.g.,:

white :: String
white = ' ' : white

nats :: [Int]
nats = 0 : map (+1) nats

The second form of an infinite data structure can be represented more concisely as nats =
[0..].
Many data structures are naturally infinite. We can only use a finite part of them in any one
computation, but that does not need to force us to set limits specifying which part.

Although this data structure has a finite representation, it is infinite and from the users point
of view is the same as an infinite foldr with a spine of :.

Lazy evaluation provides a means of computing exactly those parts of an infinite data
structure that are needed for a particular computation. For example, with the nats example:

The last element in the list has not been used yet, and is waiting to be unrolled.

Lazy Tabulation

To tabulate a function, we can list its results for all arguments in order. Assuming natural
number arguments, we can define:

table f :: (Int->a) -> [a]


table f = map f [0..]

Looking up a result in such a table is a linear operation, defined like so:

(x:_) !!0 = x
(_:xs) !!(n+1) = xs!!n

Using table and (!!) we can arrange for amortised linear computation of any function of
the naturals without abandoning the structure of a specification-based definition.

Set Comprehensions

In some ways, lists are like sets. They are collections of elements, but ordered, and may
contain repeats. In Maths, sets are often defined by comprehensions such as : S = { n2 | 1
≤ n ≤ 10, odd(n) }.

In that example, the boolean conditions are all we have to constrain or generate appropriate
elements n. But the condition can often be expressed as an extra requirement on elements
drawn from a known set, e.g., S = { n2 | n ← {1, ..., 10{, odd(n) }. This is a relative
abstraction.

This kind of comprehension is available for lists in Haskell. The above example expressed in
Haskell would be s = [n*n | n <- [1..10], odd n].

The qualifiers <- are generators that bind successive items from a given list to fresh
variables. These variables are in scope to the right of the qualifier and in the expression to the
left of the |. The other qualifiers are constraints expressed as boolean expressions, they do
not introduce fresh variables. This is called the generate and test paradigm.

More generally, the syntactic forms of a comprehension for a list of type [t] include:
comp = [ t-expr | qual_1, ..., qual_n ]
qual = var <- list-expr | bool-expr

We could define the meaning of comprehensions directly, using extra rules of evaluation. But
they can also be translated into other forms of expression like these:

[e | c] => if x then [e] else []


[e | quals-prefix, quals-suffix] => concat [ [ e | quals-
suffix] | quals-prefix ]
[ e | v <- list ] => map (\v -> e) list

Implementations do something equivalent to these translation rules, but can often be more
efficient. For example, the comprehension [ x | x <- xs, p x ], is intuitively
equivalent to just filter p xs, but the translation rules give us something a lot more
verbose.

If a comprehension has more than one list generator, all combinations of elements (that
satisfy any boolean qualifiers) are generated. Rightward generators iterate within leftward
ones, e.g.,

[(x,y) | x <- [1..3], y <- "ABC"]


==> [(1,'A'), (1,'B'), (1,'C'), (2,'A'), ..., (3,'C')]

This is at the expense of a slightly more complex translation rule, instead of just a simple
variable on the left of <-, there may be a pattern with the effect that only elements matching
the pattern are drawn from the generating list.

Sometimes we don't want all combinations of elements from two lists, but only a combination
of elements at corresponding positions. The function zip :: [a] -> [b] ->
[(a,b)] is handy, e.g., positions xs x = [n | (n,y) <- zip [0..] xs,
x==y].

An apparently distinctive feature of logic programming is its support for multiple results,
reached by back-tracking to find alternatives. We can do the same thing in a lazy functional
language, just using lists of alternative results. Even if there are ininitely many alternatives,
they are only computed if they are needed.

Interaction
Processing Message Streams

Functions computing over infinite lists can also be viewed as processes communicating with
their context by streams of messages. Argument items correspond to messages received, and
result items correspond to messages sent.

Programs can be constructed as networks of such processes operating (conceptually) in


parallel.
These networks can often be expressed diagrammatically, using circles representing
functions, and a triangle representing :. Solid lines represent potentially infinite lists of
items, and dashed lines indicate a single item.

Until now, program execution has involved the user supplying one input - an expression. This
is reduced to a normal form, and then printed.

If results are values, then what about interactive behaviour?

We can formulate programs that take input beyond an original expressiong by defining a
function f :: String -> String, and then arrange to supply the list of characters
entered at the keyboard as f's argument, and for its result to be the list of characters sent to
the screen.

Unless f is hyper-strict or tail-strict (that is, it must know everything about its arguments
before returning), such a program described above behaves interactively. The lazy reduction
strategy ensures that its output is suspended only when a further portion of input is strictly
needed.

Monads are another way to implement interactivity, and are fashionable at the moment, but
are not being covered by this course.

Page 71 of the lecture notes

It is often convenient to define two line-based auxiliaries to provide line-based, rather than
character-based, interaction. lines chops text into lists of lines, and unlines assembles
lists of lines into text. The first can be defined by a recursive application of breakAt and
the second is easily defined as a composition of concat and map, but a foldr version is
more efficient.

A simple scheme for line-based interaction is the simple command-response loop, in which
the same function is applied to each command (line of input) to yield a response (line of
output).

If we require prompts, then the solution is slightly more complicated. One way is to merge a
stream of prompts into the output, using the repeat function and
the interleave auxiliary, defined as:

interleave [] _ = []
interleave (x:xs) ys = x : interlease ys xs

Stately Mappings

Using map means that the same function is applied to interpret every command. No
information from an interaction is carried forward for later use. One way to overcome this is
to generalise map, introducing state as an auxiliary argument.
By using this new mapState in the place of map, our scheme for expressing interaction can
be enriched further.

It is often useful to think of these as message streams at a process level, e.g.,

We also need to consider what type each of the message streams between processes should
have. In our previous example, the type of user is already determined by interact as
a String - the list of characters entered at the keyboard.

We can represent distinct commands as distinct constructors: data Command = Ins


Int String | Del Int, and assign the type [Command] to this stream.

Lines can sensibly be given a type string, so a simple text representation is a list of
lines, type Text = [String], giving [Text] as the type of the texts stream.

As the result of edit is a list of characters sent to the screen, edit user must also be
a String.

In the above process diagram, we should also consider the individual processes. For the
decode process, each line from user is interpreted similarly using
the bufferedLines and map auxiliaries. Each line must be then split into either two or
three fields, depending on the command, using the breakAt ' ' auxiliary, and we also
need to convert numerals into integers where appropriate using the strint auxiliary.

For the perform process, each combination of a command and a text is interpreted by the
same rules, using the zipWith function. Intuition tells us that we can fulfull each command
by cutting the text in two at the specified line number, inserting or deleting text at the cut and
rejoining the text. The auxiliaries take and drop can be used to cut, and then ++ can be
used to rejoin.

To express the display resulting from an entire session for the display process, we can
just concat the successive text displays. As each text is displayed in the same way, we can
just use map, and zipWith to attach line numbers.

As you can see, this problem was factorised by beginning with a process level design, and
explicit recursion in the definitions of processes was avoided by using higher order functions.
The remaining definition work was minimised by using polymorphic auxiliaries. No
conditionals were needed, because all case analysis was done by pattern matching.

More intrusive extensions and comparisons can be considered, such as a more efficient
representation of text supporting the introduction of a current line-number, or an undo
command (although) it is surprisingly easy to modify the perform process to receive and send
text histories, say as lists of texts.

Transformation
A common problem is that a function may have many different possible definitions.
Definition A may be simple and obviously correct, but inefficient in computation, but a
defintion B may be far more efficient, but also more complex or subtle.

The transformational approach begins by formulating a simple definition, ignoring efficiency


concerns, and then applying transformational rules to derive new equations guaranteed to
define the same function, but chosen to do so more efficiently.

1. Eureka - the inventive formulation (e.g., fstCol ys = map head ys)


2. Instantiate - substitute a constructor pattern for a variable in an existing equation
(e.g., instantiating ys to [] and (x:xs)).

fstCol [] = map head []


fstCol (x:xs) = map head (x:xs)
3. Unfold - apply one defining equation in another, replacing a left hand side
instance by corresponding right hand side (e.g., unfolding map in each of the
above equations)

fstCol [] = []
fstCol (x:xs) = head x : map head xs
4. Fold - the inverse of unfold (e.g., appealing to the original fstCol equation)

fstCol [] = []
fstCol (x:xs) = head x : fstCol xs
5. Abstract - introduce a local definition (e.g., if we already have the equation fib
(n+3) = fib (n+1) + fib n + fib (n+1) we may derive:

fib (n+3) = f + fib n + f


where
f = fib (n+1)
6. Law - substitute for one expression another known to be equivalent by appealing
to axioms about primitives, or theorems about explicitly defined functions. (e.g.,
another reformulation of the fib equation achieves a similar economy by a law
of arithmetic: fib (n+3) = fib n + 2 * fib (n+1)

This system of six equation-forming rules is known as the fold/unfold transformation system.
It should be noted that folding and unfolding here does not refer to foldl and foldr.

A good general strategy to follow is: eureka (make some definitions); instantiate equations of
interest; in each instantiated equation, unfold repeatedly and try to apply new laws, abstract
local definitions and fold in new places.

Some commonly applied heuristics for new definitions include:

• composition/fusion - if some composite application f .. (g .. )


.. has g producing a structure that f dismantles, make this composition a
defined function in its own right. Transformational composition replaces an
indirect route from an argument to a result by a more direct route, e.g., avoiding
building intermediate lists.
• generalisation - if folding proves difficult because some subexpression of the
relevant right hand side does not match, generalise the original equation so that
the right hand side has an auxiliary argument variable at that point.
• tupling - if two or more subexpressions involve recursive computations over the
same argument, say f .. x .. and g .. x .., introduce a function that
computes the tuple of these results. Intuition tells us that tupling replaces
independent travel of a common argument by shared travel as far as possible

Often, one (cheap) auxiliary function is applied to the result of another (recursive and
comparatively expensive) auxiliary, to obtain a desired result, e.g.,

prog xs = extract (fun xs)

fun [] = c
fun (x:xs) = f x (fun xs)

Suppose we want a directly recursive defintition of a function, the usual fold/unfold strategy
would give us the equation:

prog (x:xs) = extract (f x (fun xs))

But the intervening f x prevents a fold. If extact and f x could change places, then we
could fold. The promotion tactic is to calculate an f′ such that extract . f x = f′ x
. extract, hence creating the desired opportunity for a fold: prog (x:xs) = f′ x
(extract (fun xs))

Retaining Strictness

If we consider the reduction of a function application by appeal to the defining equations of


the function, then:
• if an argument is never matched against anything more than a simple
variable, the function may or may not be strict in that argument;
• if an argument is matched against a constructor pattern, the defined function
is strict in that argument;
• instantiation replaces simple variables by constructor patterns;

Therefore, an instantiated definition will be strict, whether or not the original was. This
means that a change in meaning due to instantiation in a non-strict context can be quite
subtle, but its effects are all too apparent when the function is used interactively. For
complete inputs, examining only the complete result printed, given a complete input sequence
may result in no apparent problem, but for partial input problems become apparent.

Adding instances creates an implicit equation f ⊥ = ⊥, so you mean consider the situations
of f ⊥ and f (x:⊥).

To achieve safe instantiation, we must check before instantiate whether the function is strict
in the instantiated variable, as only strict instantiation is safe.

To handle non-strict functions, instead of instantiating the definition of the entire non-strict
function, we apply a similar instantiation to a maximal strict auxiliary. One intuition tells us
do this by peeling away the thinnest possibly layer of non-strict material to reveal one or
more strict functions underneath, but another intuition tells us to extract the largest possible
strict expressions surrounding the actual uses of the variables to be instantiated.

To extract safe auxiliaries, we first unfold as far as possible without instantiation, and then
subsitute ⊥ for the argument and unfold ⊥ results only. Compare the before and after results
and define a strict auxiliary, folding it back in to the before result.

Proving Equivalence
How can we establish whether two functional expressions are equivalent? (i.e.,
interchangable). Testing for equal results across a range of arguments is of limited use (there
may be an infinite number of possible arguments, and when one or both results is ⊥, the
result of == is itself ⊥, not True as required). Indeed, automatic checking is not feasible by
any algorithm, because the equivalence problem is undecidable.

Proof of equivalence as a law or theorem appears to be necessary. We can expect inductive


proofs because we have recursive definitions. Fold/unfold can be regarded as a kind of proof,
though we have not been rigorous about the inductive assumptions involved.

Natural Induction

Using natural induction, to show a proposition P n holds for every natural number n, it
suffices to show P 0 and assuming P n holds, P (n+1) follows.

If n ranges over programmed expressions, whose results, if any, are natural numbers, we
must also show: P ⊥.

Structural Induction
To show a proposition P v holds for every fully-defined value v of some data type t, it
suffices to show:

• for every constant constructor C of type t, P C holds;


• for every functional constructor F of type t, if ∀i . P vi holds,
where i ranges overs positions of F's arguments of type t, then P (F
v1 ... vn) follows.

To extend this to cover (partially) undefined values of type t, we must also show P ⊥.

Applying this scheme to a list type, for example, P is true of every type-compatible list
expression if:

• P [];
• if P xs holds, P (x:xs) follows;
• P ⊥
CHAPTER 2:
Formal Specification of Systems
Alloy
FSS expands on the modelling techniques discussed in MSD, but using more formal methods
of modelling - specifically the Alloy language. UML is not a very formal language.

Formal languages allow for precise reasoning, and straight forward translations from
Mathematical formulae.

Alloy is a lightweight language, using purely relational and predicate calculus. It is first-
order, that is, relations can only relate atoms, not relationships, and atoms can only be
represented using relations. Alloy consists of property-oriented, not algorithm-oriented
statements, that is, it focuses on the whats (the properties), rather than the hows (the
algorithms), and Alloy specifications are non-deterministic, that is, they don't care about
determinism.

It is often necessary to use the small model hypothesis (sometimes called the small instance
hypothesis), as it is often impossible to prove a model completely (to do so would involve
solving the halting problem). Most interesting errors are found by looking at small instances,
however this leads us to a world where predicates are not necessarily true, but are true often
enough to be useful.

Predicate Calculus

And can be represented in the following ways:

p and q

p && q

{
p
q
}

Or can be represented in the following ways:

p or q

p || q

if-then statements can be represented using the implies keyword or =>, and if-then-else
using implies ... else.
if-and-only-if can be represented using the iff keyword, of the <=> symbol.

not p is used to mean not.

all x : t | p - for all x of type t, p is true, or ∀x∈t.p.

some x : t | p - for at least one x of typet, p is true, or ∃x∈t.p.

In a similar vein, no x : t | p, one x : t | p and lone x : t | p mean there


exists no x, exactly one x, or at most one x respectively of type t, for which p is true.

Other useful expressions (although not predicate calculus) in Alloy include: sum x : t |
E, which is equivalent to Σx∈t E.

Also useful is local assignment, which can be done using the syntax let x = e | ....

Relations

Relationships are defined using the expression abstract sig Object { eats :
set Object }. This defines a binary relationship "eats" between an object and an object.

Atoms are invisible - that is, they are never defined. Tuples are lists of atoms
(e.g., (Fox,Chicken) or Fox -> Chicken). The arity of a tuple is its length.

A relation is defined as a collection of tuples all of the same arity. The arity of the relation is
defined as the tuple arity. A relation is called binary if the arity is 2, a set if arity is 1, or a
scalar if the relation is a set with only 1 element.

If a and b are relationships, then the following notation can be used:

a=b - the relations are the same


a!=b - the relations are different
a in b - a ⊆ b
a not in b - a ⊄ b

a.b means join (note, in Haskell, this is the same thing as b.a). An example of a join
is: {x→y→z, x→z→y} . {z→e, c→d, z→d} = {x→y→e, x→y→d}, that is, all
matching pairs are found and the front and end connected. An alternate syntax is b[a],
which is equivalent to a.b.

If a and b are numbers, then we can also use standard arithmetic notation such as a <
b and a !< b.

Constants

univ is a constant representing the universe - an Alloy set consisting of all the atoms. It also
includes the 4 bit signed integers (-8..7).
none is the empty set.

iden the reflexive closure of the universe.

Operators

+ union of two relations of the same arity, e.g., a→b, c→d} + {c→e} = {a→b, c→d,
c→e}.

- is subtraction - the tuples on the right hand side are removed from the relation on the left
hand side.

& is intersection.

-> is the cross product, it generates a binary relation between two sets. e.g., Fox <: eats,
where Fox is a set and eats is any relation, then the result is all the tuples that start
with Fox.

<: is domain restriction, and :> is range restriction. e.g., eats :> Grain keeps all the
the tuples that end in Grain.

The operators can also be combined, so Fox <: eats :> Grains returns all the tuples
that start in Fox and end in Grain.

If Object is a set, then Fox >: Object = Fox & Object (note, this expression
would return a truth value).

++ is relational override (note, completely different to Haskell). It creates a relation which is


similar to the first relation, but with one thing overriden. e.g., eats ++ {Fox ->
Grain} results in {Fox -> Grain, Chicken -> Grain}. That is, everything
in eats starting with Fox is replaced with Fox -> Grain. The second operand has a
higher priority than the first.

~ is inverse. e.g., ~(Fox -> Chicken) = Chicken -> Fox

^ is the transitive closure, i.e., r + r.r + r.r.r + r.r.r.r + ...

* is the reflexive-transitive closure, i.e., iden + ^r.

We class all the above as the relational operators, but in addition we also have predicate
operators:

some r returns true if r has tuples in it.

no r returns true if r has no tuples in it.


one r returns true if there is exactly one tuple in r.

lone r returns true if there is at most one tuple in r.

Signatures

Signatures are used to define atoms

e.g., abstract sig Object { ... }. We call this a top-level signature, as it does not
extend any other signature. The abstract keyword means the same as in OO, that the
signature itself can not be instantiated, only extended. In between the { ... } we can
specify relations for that signature.

Also, we can use the syntax one sig SpecificObject extends Object { ...
}. Here, the one keyword enforces this to be a singleton, but the some or lone keywords
could also be used. By specifying extends Object, we inherit the relations from the
signature Object.

Relations are specified using the syntax sig S { r : T }, which defines exactly one T,
and r in S -> T. Alternate syntax includes sig S { r : set T } for any number
of Ts, or sig S { r : lone T } for 1 or no Ts.

We can also get increasingly complex, e.g., sig S { f : A m 0> n B } which is


equivalent to f in S -> A -> B. The m and n annotations are multiplicities, e.g., A ->
one B, or A one -> B, or A one -> lone B, etc.

fact N { ... } is used to state that the arguments are true in any universe. It is possible
to attack anonymous facts to a signature using the syntax sig S { relations }{
anonymous fact }

Additionally, we have pred to define predicates, fun to define functions, run to find
universes in which all atoms satisfy the facts and predicates (an upper limit must be given to
avoid running in infinite time), and check which looks for counter examples in the same
way as run.

When defining things in Alloy, you have to be explicit about things that don't change,
otherwise behaviour is left undefined and things may change.

It is important to note in Alloy that any property of the form some x : T | P will not
work, as Alloy does not generate the infinite world, only an approximation of it, so you may
miss out some states - including the one which satisfies the condition P. To ensure all states
are generated, you may need to force T to be everything, which is very hard to do for any
real T.

Mathematics of Program Specification


Programming is defining how to move from state to state. Modelling is defining what those
states are. It is this defining the what that makes formal specifications different - we can then
leave the how up the implementator.

e.g., if the before states are x = 1 and x = 2, and the after states are x = 2 and x =
3 repsectively, we can abstractly represesent these are predicates: x = Y before, and x = Y
+ 1 afterwards. This pair of predicates gives us a relationship.

We can represent these relations so: { x = Y } x := x + 1 {x = Y + 1 }. It is


best to think of this backwards - to get to the final state, you run the code in the middle, and
you must have started in the first state. Alternatively, we can refer to the first state as a
precondition and the final state as a postcondition.

Formal specifiers give the two outer predicates, leaving the code in the middle up to the
programmer. The code the programmer writes must satisfy the pre and post conditions.

We can also 0-subscript precondition variables in the postcondition, but we do not do this in
Alloy, e.g., {x = x0 + 1}. VDM uses this method - referred to as the Carroll-Morgan
specification statement.

The most common way is to prime postcondition variables in the postcondition, so x′ = x


+ 1. When using this notation, the precondition simply beomes true. This notation is used
in Z, and was the most popular convention when Alloy was developed, so it is the method
Alloy uses.

Preconditions can never refer to postcondition variables, and a common convention is to


write the pre and postconditions as just one, so in this case is the specification is just the
postcondition when the precondition is just true. This is the convention Z uses.

If a precondition is true, then you are guaranteed to get a result that satisfies the post-
condition. If the precondition is not true, however, it does not matter than happens with the
predicate.

We can use a special predicate to set up our initial state, which only needs an output state. No
input state could possibly be given as it is the first ever state, e.g., pred init[s' :
Stack] { no s'.content }.

A data invariant is one which is always true no matter which state you're in - this isn't
possible in normal languages, although asserts come close.

The best precondition is the weakest precondition, i.e., that which allows most. The strongest
precondition is just "False".

Domain and Range

When specifying programs using mathematics, it is important to take care when creating
those specifications from natural languages. Natural languages contain ambiguity, and this
ambiguity is sometimes missed and the specification is interpreted incorrectly.
Trace-based Specifications
Up until now, we have considered a basic form of modeling called state-based specification.
In this form, we consider properties, not algorithms (that is, the what), in an abstract non-
deterministic way. Pre and post conditions and combined, and we have looked at formula for
preconditions, as well as how to catch extra (syntactic and semantic) errors. An important
lesson in the ambiguity of natural language has also been learnt.

In this section, we will look at trace based specifications, which are increasingly abstract, as
well as refinement.

Traces are equivalent to logs, and both good traces (that satisfy the model) and bad traces
exist.

When specifying, one of the most important questions asked is "under what circumstances
may this event occur". Events are synonomous to operation, and act at the interface between
components, applying to both components instantly. Trace-based specifications work by
specifying only the preconditions, no postconditions.

Traces don't show which agent initiated an action and only shows successes, not failures.
This is problematic for concurrent systems, which need to know about failures.
CHAPTER 3:
Algorithms for Graphical Models
Probability
Probability is defined in one of two ways. It is either the degree of belief in a statistic, or the
frequency of occurance in an infinite population.

The Yule-Simpson paradox

Wikipedia

The Yule-Simpson paradox basically states how breaking down a set of figures can show a very
different picture than it first seems from looking at a high level overview.

Data

For actual data, see


the lecture slides.

If we consider a table of fictitious data, then we have columns of variables (in this case which has two
values), a field header, and for each possible combination, a count. A less compact possibility is to list
each combination for each time it occurs.

Such a table is called a discrete contingency table. This table is really a flattened version of an n-
dimensional object: one dimension for each variable For continuous data, contingency tables are not
suitable and new methods are needed. Unfortunately, there is not enough time in AGM to cover these
techniques.

A contingency table tells us what has been observed in the past; it contains data. We need to move
from this data to probability. A simple way to create this probability distribution from the data is to
find the sum of all counts, and divide by the count for each cell in the total. This probability
distribution results in what is effectively a prediction of what's likely in the future, and this specific
method produces a joint probability distribution.

It is a distribution because a probability mass of 1 has been distributed over the 8 cells. What makes it
a probability distribution is because each individual number is a probability, and the joint probability
distribution comes from each probability corresponding to a joint instantiation of the 3 variables.

This is also an example of maximum likelihood estimation. It is important to note that it is only an
estimation since the distribution it produces is an estimate of some unknown true distribution.

Putting aside the joint structure of our distribution, our maximum likelihood estimation distribution
defines a probability distribution with 8 possible outcomes, like throwing a biased 8-sided die. Given
a fixed data size, it defines a multinomial distribution over all the possible datasets of that size.
Adopting a multinomial distribution is tantamount to assuming that each datapoint is independently
'drawn' from our probability distribution.

For formula, see slides.

That is, our multinomial distribution is the probability of that particular observed data set happening.
The data set that has the highest multinomial distribution is the most likely set to occur within that
probabilistic model. It should be the same as or very close to the observed data. To figure out the
highest possible data set, we could use calculus, but plugging in the numbers from the observed
dataset generally works just as well.

A probabilistic model imposes structural constraints on what the true probability distribution, and a
graphical model is just one type of probabilistic model. Formally, a model is just a set of probability
distributions.

We can think of probabilistic models as qualitative structural constraints, where in a graphical model,
fewer edges mean more constraints, where all nodes connected mean there are no restrictions, and no
nodes are independent.

A saturated model is one where there are no constraints, therefore formally it is the set of all possible
probability distributions for a given collection of variables.

For a saturated model, we can compute the MLE as follows. Let there are n datapoints, where n(i) fall
into cell i, then the MLE distribution is defined by a probability for each cell. Let p(i) be the unknown
true probability for cell i, then MLE gives us p^(i) = n(i)/n, as the estimate for p(i) for all values of i.

Factors
A joint probability distribution can be represented using a factor. A factor is simply a mapping from
joint instantiations of variables to real numbers. Factors are the workhorses of graphical models.

Saturated models have only one factor, however most joint probability distributions we will study are
defined using more than one factor. Saturated models are unusual.

If we let Δ be a set of variables, then for each variable δ ∈ Δ, then define Ιδ to be the variable's values.
We can now have a table Ι = ×δ∈ΔΙδ. Each element i or Ι is a vector of values called a cell. A factor
with table Ι is therefore simply a function ƒ : Ι → ℜ

Marginalisation
Marginalisation throws some information away from a factor by reducing the number of variables
involved. It is easiest to understand by looking at contingency tables (which are also factors). Each
cell in the marginal table is simply the sum of all the corresponding cells in the original table.

Marginalisation is just addition that results in a reduction of dimensionality (projecting down a factor
to a smaller number of dimensions), and therefore a more coarse grained view. Marginalisation is
sometimes called "summing down", and can be done with both data and probability.

We can define marginalisation formally. If we let ƒ(i) be the value associated with the cell i in some
factor with the variables Δ and the table Ι, and let a ⊂ Δ then Ιa (the table for the marginal factor
defined by a) is given by Ιa = ×δ∈aΙδ, then for any cell i ∈ Ι, let iδ be its value for the variable δ, we can
let ia be the projection of i onto a ⊂ Δ defined as ×δ∈aiδ. We can now define ƒa : Ιa → ℜ (the function
defining the marginal factor) as: ƒa(ia) = Σj∈Ι:ja=ia ƒ(j).

We can more informally define marginalisation too. If we suppose a factor f involves the variables A,
B, C and D (i.e., these variables are the ones that define its table), then we can normally reflect this by
writing f as f(A, B, C, D). It is important to note in this informal definition that the factor is actually a
function operating on tuples of the values of its variables, not on the variables themselves. With this
informal notation, we can represent marginalising away A as: f′(B, C, D) = ΣAf(A, B, C, D).

Probability distributions, assign probabilities to a sample space of events, however in joint probability
distributions, each joint instantiation is an event, and they are all mutually exclusive.

A marginal distribution can the be computed by summing out one or more variables from the joint
distribution. If the original joint distribution is defined by a single factor, then it is computed by
marginalising the factor. As all events are exclusive, marginalising a factor is as simple as adding all
the instances involving that factor together, e.g., if factor variable A has two states, A 1 and A2,
then f′(B, C, D) = f(A1, B, C, D) + f(A2, B, C, D).

Big joint probability distributions are difficult to make sense of, hence the need to compute less
informative marginal distributions. Usually, we want marginal distributions with a single variable.
Many of the algorithms we will study are efficient ways to compute marginal distributions for all
variables in a joint distribution.

Independence
A = a and B = b are two independent events iff P(A = a ∩ B = b) = P(A = a)P(B = b).

Two random variables C and D are independent in a joint distribution P iff: ∀c∈ΙC, d∈ΙD : P(C=c,
D=d) = P(C=c)P(D=d), where P(C=c) = Σd∈ΙDP(C=c, D=d) is the marginal probability of C=c, and
P(D=d) is the marginal probability of D=d.

This is a strong condition; independendence has to hold for all combinations of values.

If C and D are indepdendent and the distributions for C and D are each represented by a single factor
(denoted P(C) and P(D) respectively), then computing a new factor for the joint (denoted P(C,D)) is
just a question of multiplying the appropriate values. To do this, we 'broadcast' each factor so that
both have the same variables, and then apply pointwise multiplication.

We can now formally define factors. Each factor involved in defining a joint distribution can be
defined on the same table. If we had some table Ι such that P(A) and P(B): Ι → ℜ, then this means
that factor multiplication is normal pointwise function, and that this table involves all the variables in
the joint distribution.

However, particular factors typically only depend on a subset of all the variables in the joint
distribution, so we take advantage of this to give them a more compact representation. The data
broadcasting makes the existence of this shared table more obvious.

If two random variables C and D are independent, then their joint distribution P(C,D) is simply the
product of the two relevant marginal distributions P(C) and P(D). It then follows that the joint
distribution can be represented by these two factors - this is our first example of a factored model.

Similarly, if we have n mutually independent random variables, then their joint distribution can be
represented by n factors.
As a model is just a set of probability distributions, then for a set of n variables, we can let the
independence model be the set of all distributions where there is independence between all variables.
Any distribution in this model can be represented by n univariate distributions (each represented by a
factor).

If we suppose that each variable has m values, then if no independence relations exist (i.e., the
saturated model), a single factor with mn numbers is needed. If we assume the independence model,
then n factors each with m values are needed - that is, only nm numbers in total (you can save a bit
more if you're desperate, since probabilities must sum to 1 you can figure out 1 of the values by
deduction). In addition to these space savings, computations are also far cheaper with the
independence model.

The main problem with this is that complete independence rarely models the real world well, so we
need models somewhere in between the saturated and independence models. Fortunately, there exists
such a class of models - those which use conditional independence.

Conditional Independence
Until now, we have multiplied factors representing independent random variables, but since all factors
can be implicitly defined on the same table, the 'broadcasting followed by pointwise multiplication'
algorithm can be applied to multiply any two factors.

If you are getting a set of factors {f1, f2, ..., fn} with non-negative values, then we can let Z = Σf1 × f2 ×
... × fn. This sum over all the variablse is the scalar you get by marginalising all the variables. As long
as Z > 0, the factors define a distribution: P = Z-1f1 × f2 × ... × fn

So P = f1 × f2 × ... × (fi/Z) × ... × fn, for any i.

Computing Z, known as the partition function, is a pain. If we only need probabilities up to a common
factor, it is better to represent P implicitly by {f1, f2, ..., fn}.

So, if P = f1 × f2 × ... × fn, we can deduce that if each fi is a univariate factor for a distinct variable, then
as we have seen, each variable is independent. We still need to consider the general case, however.

If we let X and Z be subsets of variables of the joint distribution P, then we can let P(X,Z) denote the
distribution produced by projecting P down onto X ∪ Z by marginalisation. Similarly, P(Z) is the
marginal distribution just on Z.

We can define the distribution on X conditional on Z as P(X|Z) = P(X,Z)/P(Z). P(X|Z) contains


conditional distributions P(X|Z = z), one for each instantiation z of Z. P(X|Z = z) is therefore a
distribution for X, given that Z = z.

If P(X,Z) and P(Z) are both represented by factors, then P(X|Z) can be computed by pointwise
division. P(X|Z) is undefined if there is division by 0.

If we let X and Z be subsets of variables of the joint distribution P, then we can say X is conditionally
independent of Y given Z under P if P(X,Y|Z) = P(X|Z)P(Y|Z), or X and Y are independent once we
know Z. We write this as X ⊥ Y|Z[P] or just X ⊥ Y|Z when the P is obvious. That is, ⊥ means "is
independent of".

Also, if P(X|Z) = P(X|Y,Z), then we can that that the probility of X given Z does not change if Y is
also given - that is, X and Y are conditionally independent.
It turns out that factored distributions have conditional independence properties which depend on their
structure.

Hypergraphs
What matters is not so much the numbers in the factors, but the relationship between the variables in
the various factors. We can consider H = { set of variables used by fi }i, e.g., H = { { A, B }, { B, C },
{ C, D }, { A, D } }. H is a hypergraph.

A hypergraph H is a set of subsets of a finite set H - the base set. The elements h ∈ H are called the
hyperedges, and we only need to consider hypergraphs where H = ∪h∈Hh.

If a hypergraph consists of sets which are all arity 2, then it can be represented as a graph, where the
nodes are the base set H, and the edges are defined by which two nodes are in the same set.

Reducing Hypergraphs

red(H) is the reduced hypergraph produced by removing all hyperedges in H that are contained in
some other hyperedge. The hyperedges that are left after reduction are called a generating class.

The example hypergraph earlier H = { { A, B }, { B, C }, { C, D }, { A, D } } is already reduced.

A second example, H′ = { { Smoking }, { Cancer, Smoking}, { Bronchitis, Smoking } }, then red(H′)


= { { Cancer, Smoking }, { Bronchitis, Smoking } } - redundant hyperedges correspond to redundant
factors.

A generating class (i.e., a reduced hypergraph) H = { h1, h2, ..., hn } defines a set of factored
probability distributions. This is simply the set of probability distributions of the form: f1 × f2 × ... × fn,
where hi = the set of variables used by fi.

Such a set of probability distributions is called a hierarchial model, so each generating class defines a
hierarchial models.

Graphs

A graph is a pair (V,E) where V is a set of vertices and E is a set of edges. E ⊂ V × V. Graphs can be
easily represented.

If (α,β) is an edge, and so is (β,α), then it is an undirected edge, which we can express using the
notation α~β. If all edges in a graph are undirected, then we can refer to the entire graph as undirected.

Each hypergraph H has an associated undirected graph H[2] = (V,E), where V = the base set H and
(α,β) ∈ E ⇔ {α,β} ⊂ h, for some h ∈ H.

H[2] is called the 2-section of H. If H is the hypergraph for a factored distribution, then H[2] is the
interaction graph for the distribution.

We call a set of vertices A ⊆ V complete if for all distinct α,β ∈ A: α~β - that is, if all vertices are
connected to every other vertice in that set. We call a set of vertices C a clique if it is maximally
complete - that is, it is complete and all vertices which added to that subset to make it more complete
are added, or, adding any new vertex to C would produce a set that is not complete. For any graph G,
the set of all of its cliques C(G) is the clique hypergraph of the graph.
If we let G be a graph whose vertices are the variables of a joint distribution P, then G has a clique
hypergraph C(G) and each clique c ∈ C(G) is a subset of the variables in the joint distribution P. The
distribution P is said to factorise according to the graph G if there are factors fc such that: P =
∏c∈C(G) fc, where each factor fc only depends on the variables in c.

The Global Markov Property

If we remember that each factored distribution has an associated hypergraph and interaction graph,
then because the interaction graph is generated from the distribution's hypergraph, it is easy to see that
a factored distribution always factorises according to its own interaction graph.

We say that a joint distribution P over variables V obeys the global Markov property relative to a
graph G = (V,E) if for any triple (A,B,S) of disjoint subsets of V, such that S separates A from B in G,
we have: A⊥B|S[P]. So, if the global Markov property occurs, we can use the graph to read off
conditional independence relations.

So, if P factorises to G, then P obeys the global Markov property relative to G. This means a
distribution's interaction graph can be used to read off its conditional independence relations.

The Hammersley-Clifford theorem


The Hammersley-Clifford theorem states that if a distribution P is everywhere positive, then it
factorises according to a graph G, if and only if it obeys the global Markov property relative to G.

If there are zeros in P, then we can have the global Markov property without factorisation holding.

There are other Markov properties other than the global one, however they are equivalent if P is
everywhere positive.

Variable Elimination
The basic inference problem is that given a joint distribution, we compute the marginal distribution
for a given variable.

One option to solve this is to multiply all the factors to get one (often enormous) factor, and then
marginalise on that factor as discussed previously. This is only really useful conceptually.

For factored distributions, it is better to interleave multiplication and addition, and this is the basis of
the variable elimination algorithm.

If we had to compute xy + xw + xz + xu, then doing so 'directly' would involve 4 multiplications and 3
additions. However, if we rewrote this as the equivalent x(y + w + z + u), then this involves only 1
multiplication and 3 additions.

We can exploit this elementary arithmetic fact for variable elimination:

1. Replacing several factors by the single factor which is their product does not alter the
distribution represented (multiplication is associative).
2. To sum out several variables, we can sum them out (i.e., eliminate them) one at a time
(addition is associative).
3. If a variables appears in only one factor, then to sum it out of the distribution, it is enough to
sum it out of that factor (this is the key to variable elimination).
For information on summing
out a variable occuring in
only one factor, see the
lecture slides 6 and 7.

To eliminate a single variable, we first grab all the factors containing the variable, and then compute
their product factor, and then delete them. Finally, we marginalise away the variable from the product
factor, and add the resulting factor to the distribution. gPy can do this for us for us, and the algorithm
is explained in the source and on slides 9-11.

The order in which variables are summed out is called an elimination ordering - the choice of
elimination ordering makes no difference to the final result (due to the commutativity of addition), but
it makes an immense difference to the efficiency of the variable elimination algorithm.

For any given elimination ordering, the variable elimination algorithm can be used to generate a
cluster forest. For each variable summed out, (in gPy) there is a prod_factor whose variables
are prod_factor.variables(). These sets of variables are called clusters, and make up the nodes of the
cluster forest.

There is an arrow from prod_factor_1.variables() to prod_factor_2.variables() if


the message produced by prod_factor_1 is one of the products in prod_factor_2.

Forests are collection of trees, and trees are graphs with no cycles. Since each cluster produces exactly
one message, and a message is deleted once it is used in another cluster, this is enough to ensure that
the cluster forest is indeed a forest.

Clusters are actually hyperedges, and the nodes of a cluster forest thus form a hypergraph - generally
with a lot of redundancy, however. Hypergraphs whose hyperedges form the nodes of forests will
prove to be the key data structure for probabilistic inference in factored distributions.

The general inference problem is that, given a joint distribution and some observed evidence, we need
to compute the marginal distribution for a given variable, e.g., how to computer
P(Cancer|XRay=abnormal,Smoking=absent).

One way to do this is to delay normalisation. Using the above example, we have
P(L,A,T,X,D,B,S,E|X=abnormal,S=absent), which by definition is equal to
P(L,A,T,X=abnormal,D,B,S=absent,E) / P(X=abnormal,S=absent), so we can work
withP(L,A,T,X=abnormal,D,B,S=absent,E) and delay normalising until the last moment.

If we imagine that P(L,A,T,X=abnormal,D,B,S=absent,E) were represented by one enormous factor,


then this factor has the same value of P(L,A,T,X,D,B,S,E) for those rows where X = abnormal and S
= absent, but has a value of 0 for those rows where this is not the case. That is, rows for instantiations
which contradict the evidence have a value of 0.

In other words, we can define the same unnormalised distribution by altering the original factors - in
each factor, just set any row corresponding to an instantiation contradicting the evidence to 0, and
leave the other rows as they were, and then just run variable eliminiation as before.

To avoid pointless multiplication (multiplying by 0), it's a bit neater to simply delete any rows which
are inconsistent with the evidence - this amounts to not even acknowledging the existence of variables
which contradict the evidence. This is just informal, in the real conditional distribution, values don't
just diappear, but it is better for algorithms.
The variable elimination algorithm presented here is overly simple, since it does not take advantage of
any conditional independence relations.

If we suppose we want the marginal distribution of X, if some factor contains only variables
independent of X (perhaps because we have conditioned on some evidence), then we just delete that
factor. It's not too difficult to concoct scenarios where this gives you a massive speed-up.

Bayesian Networks
The structure of a Bayesian net is given by a directed acyclic graph (DAG) whose nodes are the
variables of a distribution.

If there's an arrow A → B in a directed graph, then A is a parent of B and B is a child of A. A directed


graph (or a digraph) is acyclic if the graph contains no loops which follow the direction of the arrows.
A DAG defines a partial order on the nodes. A total ordering of nodes consistent with a given DAG is
one consistent with this partial ordering.

For each variable, there is a conditional probability table (CPT) which defines a distribution over the
values of the variable, for each joint instantiation of the parents of the variable.

Note that each CPT is a factor, and the distribution represented by a Bayesian network is the product
of all the CPTs. So, Bayesian nets are factored representations of probability distributions, and so
everything we have said about such more general factored representations holds for Bayesian
networks as well. Unsurprisingly, Bayesian networks have extra properties.

A product of CPTs, one for each variable, defines a distribution directly, no matter what the numbers
are in the CPTs, as long as the corresponding digraph is acyclic. Additionally, there is no need to
normalise.

If we ignore that CPTs are actually CPTs and treat them just as factors, then the resulting interaction
graph is called a moral graph. Unmarried parents get connected (since they are together in some
CPT), and the graph becomes undirected. With this in mind, it's not difficult to see that a Bayesian
network factorises according to its moral graph.

Since a Bayesian network factorises according to its moral graph, we can read off some conditional
independencies from its moral graph. However, there are further conditional independence relations
which hold in a Bayesian network, but can not be deduced from the moral graph. To get these, we
need to exploit properties of the CPTs.

A key property of a CPT is that, for any configuration of its parents, the values for the child must add
up to one. This is what makes it a CPT.

So, if we marginalise away the child variable in a CPT, we end up with a factor of ones. Factors
where all the data values = 1 can be deleted from a factored distribution without altering the
distribution. So, if a variable only appears in its own CPT, then we can marginalise it away from the
Bayesian network by simply removing the CPT.

In a given digraph, the ancestors of vertex X, denoted as an(X) is the set of nodes Y such that there is
a directed path from Y to X. That is, it's just the transitive closure of the parent relationship.

A set of nodes X is an ancestral set if and only if for any node x ∈ X, we have an(x) ⊆ X. We can now
let an(Z) denote the smallest ancestral set containing a set of nodes Z but just adding the ancestors
recursively.
If we let P be a joint distribution and X be some subset of its variables, then we denote the distribution
produced by marginalising P onto X as PX. If G is a graph, then let GX be the graph formed by
removing for G all the nodes not in X.

Using this, we can let P be defined by a Bayesian network with acyclic digraph G and let X be an
ancestral set in G, and then PX is given by the Bayesian network with the DAG GX (and the relevant
CPTs deleted).

Now, for conditional independence, if we want to know whether the Bayesian network structure
implies A⊥B|S for disjoint sets of variables A, B, S, then first construct the DAG for the distribution
PAn(A∪B∪S). This is just GAn(A∪B∪S) where G is the original DAG. Then we construct the moral graph for
this smaller DAG, and use this smaller moral graph to check for separation.

According to Lauritzen, if we let P factorise recursively according to G, then A⊥B|S, whenever A and
B are separated by S in (G An(A∪B∪S))m, the moral graph of the smallest ancestral set containing A ∪ B ∪
S.

This property, which recursive factorisation implies, is the directed global Markov property.

In some DAG G, we let nd(A) be the non-descendants of node A and pa(A) by the parents of A, then
a distribution P obeys the directed local Markov property relative to G, if each variable is independent
of its nondescendants given its parents: A⊥nd(A)|pa(A).

It turns out that for any DAG G, recursive factorsation, the global Markov property and the local
Markov property are all equivalent.

For inference in Bayesian networks, since a Bayesian net is a factored distribution, one option is just
to use variable eliminaton as previously described. There are also Bayesian network specific options
for inference, both exact and approximate.

Triangulation
In variable elimination, to sum out a variable we have to first multiply all factors containing that
variable. This corrseponds to merging the associated hyperedges. If there is more than one factor, this
may produce a new hyperedge. We can then sum out the variable - this corresponds to removing the
variable from the possibly new hyperedge. It's easy to see this process by looking at the distributions
interaction graph.

If we recall that two variables are connected in the interaction graph if they both appear in a common
hyperedge (i.e., they are both variables for some factor), then merging all hyperedges containing a
given variables v corresponds to connecting in the interaction graph all variables in these hyperedges.
This new connection is called a fill-in.

Deleting a vertex from the only hyperedge to contain it corresponds to deleting that vertex from the
graph.

This process is known as triangulation. If carried out exactly as described we would end up with an
empty graph (since all variables would eventually get deleted). Instead, we just mark 'eliminated'
vertices and treat them as if they were deleted in the algorithm.

The triangulated graph gives a nice graphical representation of the complexity of variable elimination
with a particular ordering. If there are n variables and N is the size (number of rows) of the biggest
factor, then variable elimination is Ο(nN).
If this biggest factor has μ variables each with no more than v values, then we only get an exponential
upper bound: N ≤ vμ. Therefore, we need to keep μ, the size of the biggest clique, small.

Even deciding whether there is an ordering which bounds the biggest clique by some value is NP-
complete, thus finding an ordering which minimises the biggest clique is a hard problem.

Nevertheless, there are some reasonable heuristic approaches. One greedy option is, at each point, to
eliminate whichever vertex produces the fewest fill-in lines (preferably none). Another option is to
compute an elimination ordering backwards, using maximum cardinality search.

Maximum Cardinality Search


To perform maximum cardinality search, we must number the vertices from [n - 1] to [0] in
descending order. As the next vertex to number, we select the vertex adjacent to the largest number of
previously numbered vertices, breaking ties arbitrarily.

If there are n vertices and m edges in the graph, the maximum cardinality search if Ο(n + m), that is,
it's linear. If there is a zero fill-in for a graph, maximum cardinality search will find one. If there is
not, then Kjærulff has shown that the order it finds is generally not that good.

A triangulated graph is defined to be an undirected graph such that every cycle of length n ≥ 4 has a
chord: two non-consecutive vertices which are neighbours. Because of this, sometimes triangulated
graphs are called chordal. A graph is triangulated if and only if it has zero fill-in.

An elimination ordering for a graph is just an ordering of the vertices of the graph.

The fill-in generated by an elimination ordering is the collection of extra lines added by running the
vertex elimination algorithm with this ordering.

An elimination ordering is a zero fill-in if its fill-in is empty.

Join Forests
The cliques of a triangulated graph can be arranged in an important data structure, a join forest. These
are very closely related to the cluster trees we have seen previously.

A join forest is a collection of trees (connected graphs without cycles), whose vertices are the cliques
of the triangulated graphs.

We can construct them by doing vertex elimination on a distribution's hypergraph, rather than its
interaction graph.

A hypergraph is decomposable if its reduction is the clique hypergraph of a decomposable (i.e.,


triangulated) graph. This is not usually the given definition for decomposable hypergraphs, but it may
as well be - in maths literature, decomposable hypergraphs are usually called acyclic hypergraphs.

Graham's Algorithm
Graham's algorithm is for variable elimination on hypergraphs. To apply it, you repeat the following
two operations on a hypergraph until neither apply:

1. Delete a vertex that appears in only one hyperedge.


2. Delete a hyperedge that is contained in another hyperedge.

A hypergraph is decomposable if and only if Graham's algorithm deletes all vertices.

Join Trees
A join tree T for a hypergraph H is a tree such that:

1. the vertices of T are the hyperedges of H, and


2. for all h1, h2 ∈ H, and any h on the unique path between h1 and h2 in T, we have h1 ∩ h2 ⊆ h.

This second condition is the join, or junction, property. Join trees are sometimes called junction trees.

A join forest for a hypergraph H is simply a collection of join trees for hypergraphs H i where H is the
disjoint union of the Hi such that hyperedges in different trees are disjoint.

However, most of the time our join forests contain just a single tree, so most of the literature talks
about junction trees, rather than join forests.

A fundamental theorem is that a hypergraph is decomposable if and only if it has a join forest.

Constructing Join Forests


There are a number of different ways of construction a join forest for a decomposable hypergraph. If
we recall that a decomposable hypergraph is always the clique hypergraph of some triangulated graph,
and that the triangulated graph has an associated elimination order which is a zero fill-in, then if we
run variable elimination with this ordering, then each elimination happens in a clique. If a clique
Ci produces a message which is used by clique C j, then we can connect these two cliques in a join
forest.

To construct a join forest with Graham's algorithm, then we first delete a vertex that appears in only
one hyperedge, and then delete a hyperedge that is contain in another hyperedge.

Graham's algorithm is just structural variable elimination on a decomposable hypergraph using a zero
fill-in ordering. When a hypergraph gets deleted, just connect it to one of the hyperedges which
contained it.

Typically, we want marginal distributions for all uninstantiated variables. We could run variable
elimination from scratch for each of them, but, whatever the elimination ordering, we could end up
repeating a lot of computation work.

A better way is to run variable elimination in a join forest. If we remember that the cluster tree
representation of the execution of the variable elimination algorithm, then a join forest is essentially a
cluster tree with redundancy removed, so, unsurprisingly, we can run variable elimination in a join
forest.

This means running variable elimination with an elimination ordering which can be used to run
Graham's algorithm. If we assume further that this ordering is such that, when we eliminate an
isolated variable (i.e., it's in only one clique), we also eliminate all other isolated variables in that
clique, if any. It is easy to see that this still gives us a zero fill-in.
Since our distribution is decomposable, there is a one-one mapping between the initial factors and the
nodes (i.e., cliques) in the join forest. It is useful to think of each factor as being located at its
corresponding clique.

Once all the isolated vertices in a clique have been summed out, a new factor is created (unless the
clique contained only isolated vertices). This factor is called a message. Rather than adding messages
as a new factor in the factored distribution, we will immediately multiply them into an existing factor.

Decomposability guarantees that there is always a receiving factor which includes all the variables of
the message. The corresponds to the 'deleting a redundant hyperedge' step of Graham's algorithm.

Since the join forest can be built from running Graham's algorithm, the sending and receiving cliques
will be directly connected in the join forest.

To retain the join forest, rather than actually deleting a factor once it has produced a message, we can
just treat it as if it were deleted. So, to run variable elimination in a join tree, each clique except the
last 'sends a message' to its neighbour. The last clique is the root clique - once it has collected all its
messages, it contains the marginal distribution for the variables it contains. The generalisation from
this to join forests is not very interesting.

Now, suppose we have run variable elimination in a join tree as just described. If we consider a clique
C2 neighbouring the root clique C1, then C2 has not received a message from the root C1, but has
received all the other messages it would have recieved had it been the root instead of C 1.

If C1 were to send it a message now, it would be the wrong message, since C 1 includes (by way of
multiplication) the message sent to it from C2.

If we store the message mC2→C1 that C2 sends to C1, then C1 can send the right message back to C2,
even after receiving mC2→C1. It just divides the normal message (produced by marginalisation)
by mC2→C1, thus cancelling out the effect of previously receiving it. Then, C2 will end up with exactly
the same factor it would have had, had it been the original root.

This argument can then be extended to all the cliques in the join forest. Storing messages is the key to
the probability propagation algorithm in a join forest. It's best to imagine the message being stored in
the lines connecting neighbouring cliques in the join forest.

It is now becoming obvious that join forests are the right structure for parallel variable elimination.
We can consider a more elegant characterisation of what's going on. A separator between two
neighbouring cliques in a join tree is just the intersection of the variables in the 2 cliques. The
variables of the message constitute a separator. We can associate factors with separators, just as we
associate factors with the cliques.

If we had a decomposable distribution p, such that p = ∏c∈C fc, where C are cliques, the vertices of a
join forest. If we now let S be the separators of a join forest, and set 1 s to be a table of 1s for each s ∈
S, we can then write p = ∏c∈C fC / ∏s ∈ S 1s.

If we suppose C1 sends a message mC1→C2 to C2, and their separator is S, to effect the probability
propagation algorithm, the factor for C2 gets multiplied by mC1→C2 / fS, and so does the factor for S.
Therefore, sending a message does not alter the distribution, and we have a loop invariant.

Once each clique has received all its messages, it will contain the marginal distribution for its
variables. We just need to ensure that all messages in both directions are passed.
One option to accomplish this is to choose an arbitrary root, and then send all messages towards this
root (the upward pass), and then send all messages in the direction away from the root (the downward
pass).

A perfect sequence is an ordering of the cliques in a join tree where the first clique is a root, and such
that if Ci is nearer this root than Cj, then Ci < Cj.

It is important to note that this is not the actual definition, and can be extended to join forests.

Perfect sequences can be used to schedule messages.

It is possible for distributions to be made to have a decomposable hypergraph.

1. Find a (hopefully) small fill-in for the interaction graph, or do something equivalent directly
on the hypergraph.
2. Map each original factor to a clique which includes variables (this is always possible).
3. The factor for each clique in the new hypergraph is a (possibly empty) product of those
factors mapped to it.
PART B
CHAPTER 4:
Code Generation and Optimisation
Compiler Passes
A good compiler has various requirements, including:

• correctly implementing the language;


• detecting all static errors;
• user interface: precise and clear diagnostics and easy to use processing options;
• generating optimised code;
• compiling quickly using modest resources;
• structure: modular, with low coupling, well-documented and maintainable with
lots of internal consistency checking.

The pipeline represents the various stages, from a front end dealing with the source language
(analysis), to a backend in the target language (synthesis). A properly modular architecture
allows for reuse of the backend components between different frontend languages.

A pass is a traversal of a program or its representation. In a multi-pass setup, the compiler


driver drives the syntactic analyser, semantic analyser and code generator separately (i.e.,
each one passes over the code). This gives us modularity, advanced optimisations and more
freedom for the source language.

In a single pass setup, the compiler driver only drives the syntactic analyser, which in turn
triggers the semantic analyser and code generator. This means code is partially generated as
the code is parsed, and therefore only one pass is required. This gives us a compiler that is
faster and uses less space, but some obscure language constructs may require information
from the semantic analyser in the syntactic analyser (e.g., (A)-1 in C), which a single-pass
can not cope with.
Semantic Analysis
Static semantics are program constraints - checkable at compile time (as opposed to run
time), but are not expressible by a context-free grammar. These constraints include elements
such as:

• scope rules, which govern occurences of identifiers in declarations and in use;


• type rules, which determine the type of expressions, and whether a program is
well-typed;
• name-related checks;
• flow-of-control checks;
• apply precedence and associativity of user-defined operators, etc

The semantic analysis phase checks static semantics and augments the abstract syntax tree
(e.g., with type information, etc). The main subphases of this phase are identification,
checking the scope rules and connecting all occurrences of an identifier within the same
scope, and type checking - inferring types and checking type rules, and possibly augments
expressions with their types.

Identification

A declaration binds an identifier to some meaning.

A scope is a region of program delimited by program constructs (e.g., { ... }). The scope
of a declaration is the region of the program in which the declaration is active.

Modern languages have similar scope rules, these are the most common:

• Every use of an identifier must be in the scope of a definition of the


identifier.
• An identifier may only be declared once per scope.
• Scopes may be nested.
• The scope of a declaration is the scope in which it appears, plus every nested
scope inside that scope where it has not been hidden.
• A declaration hides a declaration of the same identifier in its scope and
internally nested scope.

In other words, the declaration of a used identifier is in the smallest enclosing scope that
contains a declaration of the identifier.
Identifier have attributes which contain information about a programming language construct.
For a variable identifier, this might be type and memory address, or for a function identifier,
this might be types of arguments and results, and memory address, etc...

The identification phase connects the identifiers occuring in the abstract syntax tree to a
record of attributes (which is filled in by other phases) and reports violation of scope rules.
The identifier in a declaration and all occurrences of the identifier in the sope of the
declaration share a record.

Symbol Tables

The symbol table is a dictionary data structure mapping identifiers to attribute records.

The identification phases traverses the abstract syntax tree and for each identifier in a
declaration, a new attribute record is inserted into the symbol table. When each identifier is
used, it is looked up in the symbol table and connected to the found attribute record.

The symbol table is a stack of tables, with a table for each nested scope. Insertion is always in
the table for the the innermost scope, and lookup starts first with the innermost scope, moving
on to the next enclosing scope if it's not found. Opening and closing a scope is simply putting
a new empty table on the stack, or removing the top table from it.

For sequential declaration structure (as in C), the scope of the declaration starts at the
beginning of the declaration, and identification can be done in a single program traversal. For
mutually recursive declarations (e.g., an f(x) which calls g(x) which can in turn call f(x)), a
forward declaration is needed. The scope of declaration starts already at is forward
declaration, and identification has to check the consistency of the forward declaration.

For languages such as Java which support recursive declaration structures, the scope of
declaration encompasses the whole scope. Identification first inserts all declarations of a
scope in the symbol table, and then traverses the scope for lookup of use occurrences. This
has the downside of being incompatible with single-pass compilation.

The efficiency of the symbol table has a major impact on the efficiency of the whole
compiler, especially lookup.

Standard data structures for a table include a hash table with external chaining (fixed size
table with a linked list as buckets), a balanced tree (e.g., red-black tree, 2-3-4 tree or an AVL
tree), or a trie (which is good for looking up strings, however a scanner may have replaced
identifier strings by a number referring to an identifier table to save space).

Optimisation for a stack of tables also needs to be considered - after lookup, a copy is made
of the table in the current scope.

An alternative solution to using one stack of tables is to use one table of stacks. The nesting
level is stored as one attribute, opening a scope increases the nesting level, and closing the
scope decreases the nesting level and removes all stack entries of a higher level.

With this method, lookup is only needed to be done in one table instead of several (but
usually not many nested scopes), and as we never know the right size for a hash table, here
one big table is only a minor waste of space. However, closing the scope is less efficient and
more complex - making it harder to support the additional features we will discuss later.

In related literature, different approaches to identification may be found, including one where
identifiers in the abstract syntax tree are not connected to their attribute record, and the
symbol table is passed to subsequent compiler phases.

The disadvantages of this approach include inefficiency - every time an attribute is read or
written, the attribute record is searched anew in the symbol table; and complexity - the
symbol table has to associate identifiers for all scopes within attributes (either a tree, instead
of a stack of tables, or a single table with a serial number per scope and during traversal, a
scope stack of serial numbers).

Name Spaces

An identifier may be used for different purposes, even if in the same scope, if uses are in
different name spaces. Identifiers in different name spaces, but with the same scope rules,
may be stored in the same symbol table with its name space class.

Members/methods are in scope in certain expressions (e.g., myQueue.enqueue(...)). One


way to implement this is to have a table of members/methods for every structure/union/class,
and this table is linked to from the type entry in the main symbol table. To support
inheritance, a list of tree of tables is needed. Often, we need to combine identification with
type checking. Furthermore, scope depends on the context (e.g., public, protected, private,
etc).

Modules export sets of identifiers, and code and modules can import identifiers from other
modules. The simplest symbol table implementation to support this is having a separate table
for each export interface, and then connect the tables according to scope rules.

Overloading further complicates matters, as an identifier may have several meanings,


depending on the types being used. The symbol table entry must contain attribute records for
all the meanings that are in scope. This type information must be in function attributes during
identification.

Dynamic Scoping

So far we have covered static (or lexical) scoping. The scope of declaration is determined by
the program structure. Dyanamic scoping exists where the scope of the declaration is
determined by the control flow (i.e., the current declaration is the last declaration encountered
during execution that is still active).

Dynamic scoping is sometimes useful (implicit arguments), but identification and type
checking are deferred to run time, and it possibly makes programs harder to understand.
Various "safe" variants of dynamic scoping have been deployed.

Type Checking

Type checking is the phase that infers type and checks type rules, possibly augmenting
expressions with their types.
Types

Types are sets of values, and values of different types may be represented by the same bits, so
we need to know the type in order to interpret a bit pattern.

Type systems exist to prevent occurrence of errors at run time. If an execution error, its
effects depend on the implementation details that are not specified by the language (e.g.,
illegal instruction fault, illegal memory access fault, data corruption). We say an error is
untrapped if it goes unnoticed for a while at run time (e.g., writing data past the end of an
array in C).

A strong type system guarantees that a specified set of errors can not occur, and this set
should include all untrapped execution errors.

Static type systems have type expressions as part of the language systems, and the rules
define well-typed programs. It can be checked at compile time if a program is well typed.
Some run time checks can still be required if the type system is strong (e.g., array
subscripting, subrange types, etc).

Dyanamic typing systems are when the values carry type information at run time, and all type
checks are performed at run time.

Static type systems enable generation of better code, as memory requirements of all variables
are known, more specific code can be generated, run-time tests can be avoided and less space
is needed for run type information. Many errors (including logical ones) can be found at
compile time.

Types can be used to describe interfaces of modules, supporting modular software


deployment, and independent compilation of modules. Static type systems can guide the
design of powerful, but less complex, programming languages. Orthogonal type systems lead
to orthogonal languages.

A type is denoted by a type expression. A type expression is:

• a basic type (e.g., int, char, void);


• an application of a type constructor to type expressions (e.g., array,
record/structure, class, union, point, function);
• an identifier of a user defined type; or
• a type variable (see parametric polymorphism).

Often type systems are not orthogonal (there is no function as a function argument or result).

a language is polymorphic is it allows language constructs to have more than one type (the
opposite being monomorphic). Some languages allow program constructs to be parameterised
by type variables - parametric polymorphism.

A type annotation is a restricted form of formal specification. A type system should be


decidably verifiable to enforce it. An algorithm ensures that a program meets the properties
specified by its types, and some properties are checked at run time. A type system should be
transparent. A programmer should be able to predict if a program will type check. Type error
messages should be comprehensible.

Basic Type Checking

Basic type checking involves a single traversal of the abstract syntax tree. Only expression
have types - declarations and statements do not.

To type check a declaration, the type attribute is filled in for a the declared identifier and the
components recursively type checked.

For a statement, the components are recursively type checked, meaning that the type of the
expression components meet type rules (e.g., id = exp, id and exp must have equivalent
types).

To type check an expression, the type is inferred from the bottom-up, i.e., recursively type
check subexpressions and check that types of subexpressions meet type rules. The type rule
and types of subexpressions determine the type of the expression.

Type Equivalence

Two types are structually equivalent iff they are the same basic type, or they are applications
of the same type constructor to structually equivalent types. A problem is that type systems
shall enforce properties of the implemented systems. Often types with the same internal
structure shall be distinct.

New types can be declared in Ada as type Celsius is new Float which declares a new
type Celsius as distinct from all other types, including Float.

Synonyms can also be declared, e.g., in Ada as subtype Celsius is Float and in C
as typedef float Celsius.

The following types in C are distinct, despite being structurally equivalent: struct Celsius
{ float temp; } and struct Fahrenheit { float temp; }.

Most languages use a mixture of structural and name equivalence, and in some languages,
distinct occurrences of type expressions create unique types. Many languages do not define
type equivalence precisely.

A type may appear in the type expression that defines it. C permits recursion only in form of
pointer to tagged structure or union, similarly other languages use type constructs that are not
compared structurally to prevent infinite comparison.

Actually, structurally comparing cycling type expressions is easy. In recursive comparison,


assume that all previously visited type pairs are equivalent (equivalence is the largest relation
defined by structural equivalence).

Type conversion can be done when types are inequivalent and rules specify which type may
be converted into which other type, and how. This is implicit and is done by the type checker.
The two types of conversion are coercion, which usually require a transformation of
representation (e.g., int to float), and subtyping, which usually has no representation
transformation (e.g., downcasting).

Explicit type conversion can also be done (e.g., in C (float)3).

Overloading

A function identifier is overloaded if in one scope it refers to more than one function.
Equality and arithmetic operations are often overloaded. Some languages allow user-defined
overloaded functions.

Resolution can be local (as in C), where only the types of the arguments determines which
function is meant, or global (as in Ada), when the full usage context is taken into
consideration. A program is well typed in global resolution if the types of all subexpressions
are unique. Type checking can be one bottom-up and one top-down traversal.

Another form of resolution is at runtime (as in Haskell). The compiler infers predicates over
types, and solves constraints to ensure predicates are compatible, perhaps simplifying or
eliminating them. At runtime, functions receive dictionaries representing the evidence that a
predicate holds.

Attribute Grammars
So far, transformation of an abstract syntax tree (identification, type checking, etc) has been
described by an explicit traversal of the tree. However, we can use attribute grammars to
define a shorter, implemenation-indpendendent specification.

Transformation is a determination of attributes - an attribute contains information about a


programming language construct (e.g., the current symbol table, or even a new tree of
intermediate code). If we specifiy attributes locally for each grammar rule, we get syntax
direct definition.

Part of an example attribute grammar for an expression type may be:

exp0 → exp1 < exp2


exp0.type = if exp1.type = Int and exp2.type = Int then Bool else error

exp0 → exp1 + exp2


exp0.type = if exp1.type = Int and exp2.type = Int then Int else error

exp → num
exp.type = Int

X.a denotes the attribute a associated with the grammar symbol X. Subscripts denote multiple
occurrences of the same symbol. In these particular examples, error may be a special type, or
the grammar is intended to be refined later. One thing to bear in mind is that these are
expressions, not statements.
Given a context free grammar, we can now define an attribute grammar:

• associates with each grammar symbol X a set of attributes Att(X);


• contains for each grammar rule X0 → X1 ... Xn a set of attribute equations of the
form: X.a = exp { Y.b | Y ∈ { X0, ..., Xn }, b ∈ Att(Y) }, where X ∈ { X0, ...,
Xn }, a ∈ Att(X) and exp is an expression is some meta-language and may use
the listed attributes.

The attribution of a parse tree maps the attributes of all tree nodes to values such that the
attribute equations are fulfilled. (Note, attribution actually works on the AST as a
representation of the parse tree, but parse tree has a closer relationship to the grammar).

The dependency graph of an attribute equation X.a = exp has a node for each attribute
Y.b appearing in the equation and an edge from each node Y.b appearing in the right hand
side of the equation to the node X.a on the left-hand side of the equation. An example
dependency graph can be superimposed over a parse tree segment corresponding to the
grammar rule:

The dependency graph of a grammar rule is the union of the dependency graphs of all the
attribute equations associated with the grammar rule.

The dependency graph of a grammar rule is the union of the dependency graphs of all
attribute equations associated with the grammar rules analogously to the construction of the
parse tree.

An attribution of a parse tree is well-defined if for every attribute a of every tree node X there
is one defining attribute equation X.a = exp (actually, there may be no equations for some
attributes, because their values are already known from syntax analysis, e.g., Num.value = 6),
and the dependency graph of the parse tree is acyclic.

We say a grammar is non-circular if the dependency graph of every parse tree is acyclic. An
exponential time decision algorithm for non-circularity exists.

One way to ensure an attribute grammar is well-defined is to ensure that for all attributes X.a,
all defining equations X.a = exp either belong solely to productions X → Y1 ... Yn, or belong
solely to productions Y0 → Y1 ... X ... Yn, otherwise there will be two conflicting equations
for some tree node.
Attribute values can be computed in the order indicated by the dependency graph. In a lazy
functional programming language, attribute values can be computed by a single tree traversal.
For easy implementation in other languages, further restrictions are needed.

We say an attribute X.a is synthesised if every attribute equation of the form X0.a = exp
belongs to a grammar rule X0 → X1 ... Xn and exp contains only attributes of X1, ..., Xn,
otherwise an attribute is inherited.

In a parse tree, all dependencies for a synthesised attribute point from child to parent, and all
dependencies for an inherited attribute from point parent of sibling to child.

If an attribute grammar consists only of synthesised attributes, it is called an S-attributed


grammar. Computation of attribute values for S-attributed grammars is done by single tree
traversal, with the children in an arbitrary order.

An attribute grammar is L-attributed if for every inherited attribute X.a in every equation
Xi.a = exp belonging to a grammar rule X0 → X1 ... Xn the expression exp contains only
attributes of X1, ..., Xi - 1 and inherited attributes of X0. Computation of attribute values for L-
attributed grammars is by single tree traversal, children from left to right.

Two different implementations of attribute computation include a traversal of the syntax tree,
where some attributes are stored with their nodes, but most attributes are only temporarily
needed, so inherited attibutes are passed down as arguments, and synthesised attributes
passed upwards as a result of the traversal function. The other implementation is never
construct a syntax tree, but to combine attribute computation with parsing. A recursive-
descent parser can easily compute attributes of an L-attributed grammer. A LR-parser can
easily compute attibutes of a S-attributed grammer, but inherited attributes only in limited
form.

Example attribute grammars


are around slide 62.

Attribute grammars can also have side effects: stInsert creates a new symbol table, preserving
the old one. Every language construct having a symbol table leading to many different
symbol tables. In practice, most implementations just modify a global symbol table, meaning
the order of attribute value computation becomes important and the attribute grammar is only
meaningful for a specific attribute computation order.

Attribute grammars are modular. If you specify aspects of the compiler in separate attribute
grammars, then for implementation all attribute grammars are combined in a union, and
attribute value computation could possibly be by single tree traversal.

Attribute grammars can also be composed - the value of one attribte of the first grammar is a
tree structure, and this tree structure is the input to the second grammar. e.g., attribution of a
concrete syntax grammar constructs an abstract syntax tree, and attribution of the abstract
syntax grammar realises further compiler phases. Under some restrictions, both attributed
grammars can be fused into one.
Currently attribute grammar tools are not as popular as Yacc/Bison, as their full advantages
can only be reached if no side effects are used, but popular languages are based on side
effects. Lazy functional languages directly offer advantages of attribute grammars, except for
their modularity.

Run Time Organisation


Now, we can address general issues of mapping high-level programming language constructs
onto low-level features of the target machine. These issues include: data representation,
mapping language values to words, bytes and bits; memory organisation, subdivision of
RAM for code, static data, the stack and the heap; and the procedure call protocol,
subdivision of work between the caller and the callee.

Data Representations

One form of data representation is constant-size, where all values of a type occupy the same
amount of space, and knowing the type of the variable means knowing the space required for
that variable.

These representations can either be direct, which is the binary representation itself (and
allows for more efficient access) or indirect, which is a pointer to the actual representation.
With indirect representations, pointers are always constant size, even if the type is not or is
unknown (e.g., subtyping, polymorphism), and copying (argument passing) of the pointer is
efficient.

Some basic types include booleans, which are one word, byte, or at it's smallest, a single bit:
0 (False) and 1 (True); characters, which are 1 or 2 bytes depending on whether it's an ASCII
or Unicode representation; integers, which are a word (2 or 4 bytes depending on
architecture) normally using two's complement representation is signed; floats, using the
IEEE standard; and enumeration types (small integers).

More advanced types include records (or structs in C), which is a record composed of a list of
fields, each referred to by an identifier and a field selection operator for accessing a specified
field.

Fields are stored consecutively in memory, and the offset of each field is known at compile-
time from the size of the preceeding fields, and the overall size of the record representation is
known at compile-time.

The address of a 2/4/8 byte value is normally a multiple of 2/4/8. On many machines this is
either required, or yeilds faster memory access, therefore often unused fill is left between
consecutive values. Reordering fields may reduce the internal fill, but this is sometimes
forbidden by the language specification (such as in C).

Union types contain the value of only one of its components, and the operation is for
accessing a specific component. All components have the same address, and the size of the
union is the size of its biggest component. Unions introduce the danger of error in data
interpretation, and many languages provide safe variants (e.g., variant records in Ada).

Arrays provide a mapping from a bounded range of indices to elements of one type, and an
indexing operation is used for accessing an element of a given index. Elements are stored
consecutively in memory with the memory address of a specific element given as:
address(a[i]) = address(a) + i × size(typeof(a)). The value of i is usually only known at run-
time, so we use run-time address computation.

Arrays bring further issues, however. For type safety, indexing code should check array
bounds, and the first element may have an index unequal to 0. Multi-dimensional arrays are
arrays of arrays, and you have to decide which array to consider first.

Another problem is that the array size may not be known at compile time, a lower bound is
needed for indexing, and bounds are needed for a bound check. The bounds may be stored at
the beginning of an array, and sometimes an indirect representation may be chosen for a
constant size representation and simple copying.

Pointers are represented as the address of the value referred to, which means there is one
machine-dependent size for all pointers. They are often needed for recursive types (lists,
trees) or just indirect representation. The value referred to must be in RAM, not in a register.

Functions are often just represented as a pointer to the code entry point of the function, but
more complex representations come up later.

Box/pointer diagrams are


around slide 75.

Classes are similar to records, but the method calls depends on the receiver object, so we also
have a method table per class, with instance variables and methods at fixed offsets.

With inheritence, instance variables are inherited and new ones may be added. Methods are
also inherited or overwritten and new ones may be added. Inherited/overwritten instance
variables and methods are accessed at the same offsets. Indirect representation of objects is
usually used to handle different sizes of direct representations.

Memory Organisation & Procedure Call Protocol


For global variables, static variables of cuntions and constants and literals that are not stored
directly in the code (eg., printf("Hello world");), we can use static storage allocation, as
a fixed address means efficient access.

Now we need to consider procedures (or functions, subroutines, methods, etc). If we consider
simple recursive procedures without arguments and local variables, we have a program
counter, which is the address of the next instruction to be executed, and at the end of
procedure execution we need a return address (the address after the call instruction which
passed control to the procedure). For arbitrarily recursive procedures, a stack of return
addresses (corresponding to the nesting of the procedure calls) is needed.

If we consider a more advanced (and realistic) case of procedures with arguments and local
variables, then each procedure activation has its own arguments and local variables with the
same lifetime as the procedure is active for. Instead of a simple return address stack, we push
a stack frame (sometimes called an activation record) onto the stack for each procedure call.
When procedure execution completes, the frame is popped from the stack.
A typical stack frame is shown above. The global frame pointer (fp) points to the middle of
the current frame on the stack, and all frame components (especially variables) are accessed
by a fixed offset from fp. Sometimes, a global stack pointer (st) points to the top of the stack,
and the dynamic link (sometimes called the control link) points to the preceeding frame on
the stack (that is, the old value of fp). This is often only used because the frame size is not
known until late in the compler phases.

The procedure call protocol defines the exact stack frame format and the division of work
between the caller and the callee, e.g., the caller writes arguments, dynamic link, machine
registers and return address to the frame and sets the new frame pointer, and the callee writes
the return value and resets the fp.

The standard protocol is usually defined by OS and machine architecture, meaning it is not
optimal for every language, but is needed for OS and foreign language calls.

How the arguments are passed also should be considered here. There are two types of
argument passing mechanisms: pass by value and pass by reference. With pass by value, the
argument is evaluated by the caller and the value passed to the callee. Update of the argument
variable in the callee does not change an argument value in the caller. With pass by reference,
the argument must be a variable (or any expression that has an address), and the address of
tehv variable is passed (the l-value), thus an argument can be used to return a result.

With nested procedure definitions this becomes more complex, as we need to consider how to
access a local variable or argument of the enclosing procedure. One solution is to add a static
link (sometimes called an access link) to the frame of the inner procedure. Another solution is
to use a global display - an array of frame pointers. An entry at index n points to a frame of
nesting depth n. This allows faster access, and the maximal display size is the maximal
nesting depth.

We should also consider procedures as arguments. How can a procedure that is passed as an
argument access the variables or arguments of its enclosing procedures? The way to solve
this is to represent a procedure as a pair of the entry point and the static link. The display
technique is unsuitable, because the display is larger than one static link.

Something else to consider is procedures as results. How can a procedure that is returned as a
result access the variables and arguments of its enclosing procedures? If we need the lifetime
of arguments and local variables of a procedure longer than the procedure activation, we can
not store them on the stack. In this casr then, arguments and local variables are stored on the
heap, and garbage collection is used for deallocation after the end of the lifetime.

Access to registers is typically much faster than RAM, and modern CPUs have many
(typically 32), divided into global registers (pc, fp, etc), caller-save registers and callee-save
registers. Typically the procedure result is returned in the register, the first 4-6 procedure
arguments are passed in to register and the caller stores the return address in a register.
Registers may have to be spilled into a frame later, but many procedures do not call other
procedures, and register values may only be needed before the first procedure call.

Heap
For data where the amount of size is not known at compile-time, or the lifetime is shorter
than the whole program run-time, but not linked to procedure activation, we use dynamic
memory, called the heap.

Allocation on the heap can either be explicit (e.g., malloc()) or implicit, where all values are
heap allocated (e.g., Haskell). Similarly, deallocation can be explicit (e.g., free()), or
implicit through garbage collection.

At run-time, the heap expands and occasionally contracts, and this deallocation usually leaves
a gap. A run-time module called the heap manager maintains a free list, and this free list is a
linked list of gaps.

Fragmentation occurs over time where gaps become smaller and more numerous. The total
free space is large, but there is no single gap sufficient for an allocation. To combat this, we
can allocate a best-fit instead of first-fit algorithm, coalesce adjacent gaps, or compact the
heap by shifting all data (garbage collection).

Garbage Collection
Garbage is heap-allocated records unreachable by any chain of pointers from program
variables. Garbage collection is the process of reclaiming memory occupied by garbage for
new records.

Whether or not a record is "live" can be approximated by reachability.

Mark-and-Sweep Collection

In mark and sweep collection, all reachable records are marked using depth-first search, and
then the entire heap is sweeped, collecting unmarked records into a linked list.

The cost of mark and sweep collection can be computed with C = (c1R + c2H) / (H - R),
where R = the size of reachable data, H = the heap size and C is the eventual garbage-
collection cost per allocated word.

If R ≈ H, then C is very large, but if R << H, C ≈ c2. After garbage collection, R and H are
known. If R/H > 0.5, the heap should be grown.

This has problems as the DFS algorithm is recursive, and requires a stack proportional to the
longest path in the graph. This may require a huge stack. The solution is to use the current
field x.fi to point back to record t from which x was reached. Only a short integer done[x] is is
needed in every record to indicate the current field.

See slide 98 for a


pseudocode implementation.

Reference Counting
In reference counting, each record has a reference counter, which counts the number of
pointers pointing to the record. Before every assignment from x to y, the counter of x is
decremented and the counter of y incremented. When the counter reaches 0, the record is
added to the free list and the records x points to are also decremented.

This has a problem of circular references, where cycles of garbage all referring to each other
can not be reclaimed, and the cost of incrementing and decrementing.

Copying Collection

The idea here is that the heap is divided into from-space and to-space, and that it starts with
both reachable records and the garbage in the from-space, and then reachable records only in
the to-space.

Copying is compact, fragmentation is eliminated, and allocation fron contiguous free memory
is cheap.

Pseudocode for Cheney's


algorithm is in slide 101.

One algorithm that implements this idea of Cheney's breadth-first copying algorithm. The
cost of this algorithm is C = c3R / (H/2 - R).

With the cost algorithm for Cheney and mark-and-sweep, we can plot the two to see in which
case it is optimal to use which one:

A problem with breadth-first copying is poor locality of reference (i.e., record pointing to
another record that is located far from it). A good locality of reference improves memory
access time, because of memory cache and virtual memory.
Depth-first copying provides a better locality of reference, but is more complicated (requiring
pointer reversal). A solution is to use a hybrid of depth-first and breadth-first copying.

Generational Collection

We can observe that newly allocated records are likely to die soon, and older records are
likely to live longer. Using this, we can divide the heap into generations, where G0 consists of
the youngest records, and G1 consits of records older than the oldest in G0 and so on. Garbage
collection then only needs to collect some generations, and records get promoted when it has
survived 2-3 collection. Each generation is exponentially larger than the previous.

To do this collection, we can use any mark-and-sweep or copying collector. This has the
problem of roots including pointers in other generations. However, pointers from old to new
records are rare, and updates to newer records are rare. Using this, we can generate code to
remember each updated record.

Incremental Collection

Another problem is the interruption of program execution for long periods of time by garbage
collection, especially for interactive or real-time programs. One solution is to interleave
garbage collection and program execution, but this has the danger of collection missing
reachable records.

One solution to this is tricolour marking. Records are classified into three colours: white, not
yet visited; grey, visited but children not yet examined; and black, visited and all children
visisted. The invariant "no black record points to a white record" .

This causes additional work when program execution modifies the reachability graph. Two
different approaches can be used: a write barrier algorithm, when program execution stores a
white pointer in a black object, it colours the pointer grey; and a read barrier algorithm, when
program execution reads a white pointer, it colours it grey (this may use the virtual memory
fault handler).

Compilers can help for garbage collection. A collectior must find all pointers in heap-
allocated records, so each record is started with a pointer to a data layout descriptor. Pointers
must also be found in variables and temporaries (especially on the stack and in registers). The
use of temporaries changes continuously, and this information is only needed at function call
(allocation call). One way to implement this is to have a mapping from the return address in
the frame to the data layour descriptor.

Conservative collectors work mostly without compiler support by guessing what is a pointer.

Intermediate Languages
Intermediate languages exist to provide an interface between several front and back ends, and
to allow reuse of the optimisation phase.
Intermediate languages have a number of desirable attributes, including independence of the
source and target language, each construct having a clear and simple meaning (making it
suitable for optimisation), as well as being easy to translate to and from the source languages.

These simple languages typically have few control structures (for example, only conditional
jump and procedure calls), and no structured data types (only basic types and addresses),
which means that explicit address computation and memory access is needed.

Sometimes compilers have several intermediate languages.

Intermediate languages fall into three classes:

• Tree or directed acyclic graph based languages are close to the source language,
but are more simple, and are good for high-level optimisations;
• Stack-based languages are where all instructions obtain operands from, and
return results to, the stack;
• Three-address instruction languages are useful for RISC-like machines with an
unlimited number of registers.

Examples are around


slide 112 of the notes.

Evaluating expressions in a stack-based language results in code corresponding to the postfix


representation of expression. Function calls are integrated with a procedure call, where
arguments are pushed onto the stack, and the result is on top of the stack.

In three-address instruction languages, temporaries are the variables of the intermedite


language, and all operations only work on temporaries, like registers in the RISC machine.
The variables of the source code do not exist in the intermediate code.

The three-address instruction language from the notes (and the one we will undoubtedly need
to learn) has only a single type for integers, booleans and addresses, otherwise different move
instructions and possibly different temporaries would be needed.

• a ← b binop c - operation on temporaries


• a ← M[b] - move from memory
• M[a] ← b - move to memory
• a ← b - move between temporaries
• L: - mark code address with label L
• goto L - jump to label L
• if a relop b goto L - conditional jump
• f(a1, ..., an) - procedure call
• b ← f(a1, ..., an) - procedure call with return value
• enter f - beginning of procedure
• return - end a procedure
• return a - end a procedure that has a return value

a, b and c are temporaries or constants, and L and f are labels.


Both stack-based and three-address instruction languages yield linear programs (a sequence
of instructions. Stack-based generally leads to fewer addresses and compact code, and does
not need later register allocation like three-address code. Three-address code, however,
profits from fast registers whereas stack-based code does not, and three-address instructions
are easier to rearrange than stack instructions, so three-address code is easier to optimise than
stack code.

Intermediate Code Generation

It is important to remember that only the intermediate language is target language


independent, and the intermediate code contains addresses and offsets, which is target
machine specific. For portability, all address computation should be kept separate in code
generation.

The intermediate code generation phase therefore has two main parts: storeage allocation,
which fixes addresses or offsets for each variable identifier (memory organisation), where the
address or offset is a target machine specific attribute of an identifer (the identification phase
should ensure that this attribute is shared by all occurrences of the same identifier and this
attribute should be written when processing the declaration; and the actual translation to
intermediate code.

There are two ways to implement the translation which fall under the definition of "syntax-
directed" translation: specification by synthesised attribute, where intermediate code
generation can be viewed as an attribute computation similar to many of the attribute
problems discussed earlier; and specification by translation, where translation or compilation
functions are used.

We can have expressions other than simple identifiers on the left-hand-side of assignments,
so expressions on the left-hand-side need to be translated. We get l-values - the value of an
expression that appears on the left-hand-side of an assignment (i.e., an address) and the value
of an expression that appears on the right-hand-side of an assignment (e.g., the memory
content at an address). Different translations for the expressions are required, one for l-value
(transLExp) and one for the r-value (transExp). Not all expressions may be on the lhs,
so transLExp may not be defined for all expressions.

Slide 121 onwards contain


sample translation functions.

When translating operator expressions, the order of operand evaluation matters if the
evaluation causes side effects. Operand evaluation order may be specified by the language
(this is not the case in C, however). The sample function given assumes that op exists also in
the intermediate language.

Structured control constructs (e.g., if) gets translated into (conditional) jumps. Some
structures (such as while loops) can be translated in two different ways, which are sometimes
more efficient depending on the context (e.g., one translation uses less instructions overall if
there are multiple loops).
Most machines and intermediate languages do not support the boolean type directly,
so False is represented as 0 and True as a 1. The result of relational operators can only be
used in a conditional jump. Boolean operators can also be short circuited: a and b ⇒ if a
then b else False, and a or b as if a then True else b.

Similarly, procedure calls depend on an argument evaluation order. The intermediate


language must make sure it specifies the requirements of the procedure call protocol. The
function identifier is used as a label, and the code for nested procedures is not nested.

Target Code Generation

The target code might be assembly code, relocatable machine code, or absolute machine
code, and may be for different architecutres including RISC or CISC machines.

Three-address instructions are even more primitive than RISC instructions, so translating
each three-address instruction (after optimisation) separately yields inefficient code. It is
necessary to combine several three-address instructions into one target instruction.

Generation is therefore done in two steps, first by translating three-address code into tree-
based code by inlining of used-once-only definitions of temporaries, and then the tree-based
code is translated into target code with an unlimited number of registers by tiling each tree
with target code instructions.

To translate into a tree-based language, you must first locate definitions of temporaries that
are used only once (that is, there is no redefinition, label, jump or call between definition and
use) and these definitions are inlined at the use occurrence.

For some sample patterns,


see slide 133.

Target instructions are expressed as fragments of an intermediate code tree, called a tree
pattern. Instructions can usually correspond to more than one tree pattern - this is called a
tiling.

As usually several different tilings are possible, then each kind of instruction is assigned a
cost. The optimum tiling is a tiling with the lowest possible costs, and an optimal timing is
where no two adjacent tiles are combinable into a single tile of lower cost.

Optimum implies optimal, but this is not a transitive relationship.


In RISC architectures, instructions are typically all simple with a uniform cost, so there is
little difference between the cost of optimum and optimal tiling. For CISC machines, the
complex instructions have different costs, and there is a noticeable difference between the
cost of optimum and optimal tiling.

The Maximal Munch tiling algorithm is a top-down algorithm that yeilds an optimal tiling.
The algorithm starts at the root of a tree, finding the largest tile that fits. That tile covers the
root node and some other nodes, leaving subtrees, and the algorithm is repeated for each
subtree.

The Dynamic Programming tiling algorithm is bottom-up and yields an optimum tiling. The
algorithm here is to assign to each a subtree and optimum tiling and the cost of that tiling
computed by considering each tile that matches at the root. The cost of using a tile is the cost
of the tile plus the costs of the optimum tiling of the remaining subtrees. The tile chosen is
where the cost (as defined above) is minimal.

The grammar is
on slide 127.

Tiling can be considered parsing with respect to a highly ambiguous context-free tree
grammar.

Tools that generate code generators process grammars that specify machine instruction sets,
and associated with each rule of the grammar is a cost and action (or attribute equation).

Compilation functions can be used for a direct translation from the abstract syntax tree to
target code, but this often leads to inefficient code, so these compilation functions are
extended by special cases. The compilation functions is then based on overlapping tree
patterns, so a tiling algorithm can be used for code generation.

Another problem is that when generating machine code, absolute or relative code locations
have to be used instead of labels, which causes problems when addresses for forward jumps
need to be generated. One option is to leave a gap in the code and add the jump later - this is
called backpatching. You can either backpatch when the target location is known, or in a
separate pass. A problem however is that shorter machine instructions are often used for
shorter jumps.

Liveness Analysis & Register Allocation


Two temporaries can fit in the same register, as long as they are never live at the same time.
A temporary is live if it holds a value that will be used in the future.

A process known as liveness analysis yiels a conservative approximation of liveness (static


liveness). If the temporary is live, then the analysis will say so, but it does give false positives
- if the analysis says the temporary is live, this does not necessarily mean so.

Two analysis methods include data flow analysis, where liveness is determined by the flow of
data through the program; and backwards analysis, from the future to the past.
In general, liveness analysis and register allocation is only for one procedure body at a time.
It is easier to use three-address code rather than specific target code to do this analysis.

For examples, see


slide 142.

A control flow graph can be built, where nodes are the instructions of the program (excluding
labels), and an edge exists from x to y if the instruction x can be followed by instruction y.

The control flow graph is an approximation of dynamic control flow.

• out-edges of a node n lead to successor nodes: succ(n);


• in-edges of a node n come from predecessor nodes: pred(n);
• def(n) is a set of temporaries defined by an instruction at node n;
• use(n) is a set of temporaries used by an instruction at node n;
• A temporary t is live on an edge e iff there exists a path from e to a use of t that
does not go through a definition of t;
• in(n) is a set of temporaries that are live on any in-edge of n;
• out(n) is a set of temporaries that live on any out-edge of n.

We can more formally define in(n) and out(n) as:

• in(n) = use(n) ∪ (out(n) \ def(n))


• out(n) = ∪s ∈ succ(n) in(s)

The algorithm for liveness is then as follows:

foreach n:
in(n) := ∅
out(n) := ∅
repeat
foreach n:
in′(n) := in(n)
out′(n) := out(n)
out(n) := ∪s ∈ succ(n) in(s)
in(n) = use(n) ∪ (out(n) \ def(n))
until in′(n) = in(n) and out′(n) = out(n) for all n

The run time for this algorithm is better if nodes are processed approximately in the reverse
direction of control flow. In the worst case it is Ο(N$), but in practice it tends to lie between
Ο(N2) and Ο(N4) (where N is the number of nodes).

Another thing to consider for register allocation is an interference graph, where nodes are
temporaries, and edges are interferences (temporaries that can not be allocated to the same
register). All temporaries in the same live-in or live-out set interfere, with the extra case of a
temporary t defined by the instruction t ← ... shall always be in the live-out set of
instructions, even if not actually live-out.
Slide 146 contains the
graph colouring algorithm.

Mapping these temporaries to K registers saves memory access time. To compute the
registers, the interference graph needs to be K-coloured. This is NP-complete, so the
approximation algorithm should be used. If K-colouring is impossible, than some temporaries
get spilled over into memory.

Assignments such as t ← s do not make an interference edge between t and s, because they
can be mapped to the same register. We can therefore coalesce nodes t and s in the
interference graph and delete the instruction t ← s.

However, coalescing can hinder K-coloring, so we should only coalesce if the resulting node
has fewer than K neighbours. This coalescing needs to be integrated into the K-colouring
algorithm (as the number of neighbours change), and the preceeding compiler phases are
therefore free to insert move instructions where convenient.

We also need to aim to minimise the number of stack frame locations for spilled temporaries,
and the number of move instructions involving pairs of spilled temporaries. Hence, we should
construct an interference graph for spilled nodes, and coalesce aggressively (as the number of
stack frame locations is not limited) and finally colour the graph.

Sometimes it is necessary to have a register in the interference graph as a pre-coloured node,


and all pre-colour nodes interfere with each other. We can express that some temporaries
have to be in specific registers (e.g., the frame pointer), or that some instructions only work
with certain registers. Also, call instruction invaldiates all caller save registers (the callee
save registers used for temporaries live across procedure call), and the register allocator
ensures saving of caller save registers.

Optimisation
Optimisation takes intermediate code and returns optimised code. It is important to note that
optimisation does not yield optimal code. The requirements of optimisation are to preserve
the meaning of the program (including errors), to on average improve the program (mostly
runtime, but sometimes space and program size) and is worth the cost (optimisation can be
complex and extends compile time).

Programs should be modular and easy to understand, and many optimisations need access to
low-level details inaccessible to high-level languages (e.g., addressing), so it is often better to
have the compiler do optimisations, not the programmer. Additionally, ome optimisations use
target machine specific knowledge, so the compiler can optimise for different target machines
which if done in code would eliminate portability. However, optimisations are no substitute
for good algorithm selection.

Optimisations done in the compiler can often lead to simple code generation.

Optimisation phases typically consist of a series of optimisations.


Optimisations for Eliminating Unnecessary Computations

The general gist of this is to do at compile time what would otherwise be done at run time.
This can stretch from simple constant folding (if the operands of an instruction are constant,
the computation can be done at compile time) to state-of-the-art partial evaluation/program
specialisation (a whole program together with partial input data is transformed into a
specialised version of the program by precomputing parts that depend only on that specific
input data).

An example of program specialisation is compiling interpreted core - the interpreter is


specialised with respect to the input program.

Arithmetic identities can be used to delete superfluous operations (e.g., t1 ← t2 * 1 to t1 ←


t2).

Dead code (code that can never be reached) can be eliminated, and instructions which
compute values that are never used can also be.

Array bound checks can also be eliminated, as often the compiler can determine if the value
of an index variable is always within the array bounds.

Induction variables can also be eliminated. Some loops have a variable i which is just
incremented or decremented, and a variable j that is then set to i × c + d, where c and d are
loop-invariant. j's value can then be calculated without reference to i, and whenever i is
incremented, we can just incrememnt j by c.

Common subexpressions from multiple expressions can be eliminated by computing it only


once, then saving the value and using it whereever needed. Many common subexpressions
are actually compiler-generated address calculations.

Code motion can also be done, ranging from hoistnig loop-invariant computations (moving
instructions that compute loop-invariant values out of a loop) to lazy code motion (moving all
instructions as far up in the control-flow graph as possible, exposing exposes the maximum
number of redundancies and instructions are pushed as far down in the control-flow graph as
possible, minimising register lifetimes).

Optimisations for Reducing Costs

These optimisations include reduction in strength, replacing expensive instructions with


cheaper ones (e.g., t1 ← t1 * 4 to t1 ← t1 << 2).

Replacing procedure calls with the code of the procedure body (inlining) can also be an
optimisation. Here, the procedure parameters are also replaced in the instruction arguments.

Modern computers can execute parts of many different functions at the same time, but
dependencies between instructions hinder this instruction-level parallelism, so by reordering
instructions to minimise dependencies, we can optimise instruction scheduling.

Aligning and ordering instructions for data so they are mapped efficiently into cache can
speed up memory access (cache being must faster than main memory, but also much slower).
By using this cache-aware data access, we can also ensure that data structure traversals
(especialyl of arrays) take advantage of the cache.

Enabling Optimisations

Optimisations in this class don't themselves result in optimisations, but they enable
applications of other optimisations.

Constant propagation can be used when the value of a temporary is known to be a constant at
an instruction, so the temproary can simply be replaced by the constant.

Copy propagation is similar, but instead of a constant there is a temporary (similar to


coalescing in register allocation).

Loop unrolling can be used to put two or more copies of a loop body in a row.

Optimisations are typically applied to intermediate code to enable reuse in different front and
back ends, however some optimisations are most easily expressed on a different program
representation (closer to the source or target code), and some optimisations are target code
specific (special machine instructions).

Determining a sequence of optimisations is complex because each optimisation uncovers


opportunities for other optimisations, and the effects of many optimisations overlap, with
some subsuming others.

A basic block is a maximal-length sequence of three-address instuctions that is always


entered at the beginning and is exited at the end. That is, a label may only appear as the first
instruction, and a jump may only appear as the last instruction.

To prepare for optimisations, the code is divided into these basic blocks and a control-flow
graph is built of the basic blocks.

Locality

Optimisations can also be classed with regard to locality.

A peephole optimisation slides a several instruction window over the target code, looking for
a suboptimal pattern of instructions. Constant folding and strength reduction falls into this
class.

Local optimisations operate on single basic block.

Global (or intraprocedural) optimisations operate on procedure bodies. All loop optimisations
fall into this class.

Interprocedural optimisations operate on several procedure bodies. Procedure inlining and


partial evaluation fall into this class.

Many optimisations can be performec, with different effectiveness, on either a peephole,


local, global or interprocedural level.
The runtime of many optimisation algorithms is not linear, but often quadratic or worse.

Most optimisations require information from an analysis.

Dead code can be detected as any instruction which assigns to a variable a and a is not live-
out of the instruction, and the instruction can not cause a side effect (e.g., arithmetic
overflow, or division by zero).

Reaching Definitions

Slide 162 onwards contains


a reaching definition algorithm.

A definition of a temporary t is a control-flow node where it is assigned something, we can


say that a definition d of t reaches a control-flow node n if there is a path from d to n without
any definition of t. Several definitions of t may reach a node, because the control flow graph
approximates dynamic control flow.

For the actual applications,


see slides 165 onwards.

If d reaches n and no other definition of t reaches n, then we can apply constant propagation.

If d reaches n, no other definition of t reaches n and no definition of z on any path


from d to n, we can apply copy propagation.

Forward substitution can be done on associative operators, if d reaches n, no other definition


of t1 reaches n and there is no defintion of r on any path between d and n.

Common subexpression elimination can be applied if the right-hand-side of d equals the


right-hand-side of n, d reaches n, no other definition of t1 reaches n and there is no definition
of r on any path between d and n.

Dead code elimination can be applied if d1 reaches no node that uses t.

Data flow analysis can be done using liveness or reaching definitions. Liveness uses
backwards analysis, and reaching definitions uses forward analysis. Equations may have
more than one solution, but Knaster-Tarski fixpoint theorem states that the least solution is
the start with empty sets and add elements.
CHAPTER 5:
Cryptography, Attacks and
Countermeasures
Introduction
Secure communication is critical, for e-commerce, e-banking, and even e-mail. Additionally,
systems are becoming even more distributed (pervasive computing, grid computing, as well
as other systems - on-line gambing or MMOGs). We know that the real world has many
people who will try to gain advantage. Mischief in the virtual world is typically remote and
faceless.

In the real world, we want confidentiality, secrecy and privacy - criminal offences exist to
prosecute unauthorised opening of mail. Integrity is also an important aspect, it should be
apparent when artefacts have been altered - seals have been used in years gone by to protect
documents. Similarly, the authenticity of artefacts needs to be proved.

Other things we want in the real world includes non-repudiation (you should not be able to
deny doing specific actions, for example if you bought a share you should not later deny it),
which is often implemented using the idea of signatures and witnessing (but reputation is still
important in this system). Similarly, recorded delivery aims to protect against denial of the
receipt of goods.

We also want to have anonymity if need be, and also validation of properties (usually by
some expert, e.g., certificates of authenticity).

Variants of all these properties, and more, exist in the digital world - there are plenty of
opportunities for mischief. Tools ared needed to enable these sort of properties to be
guaranteed in digital systems - cryptography lays at the heart of providing such tools and
guarantees in digital systems.

Cryptography by itself is not security, but it can make a very significant contribution towards
it.

Loosely speaking, we are interested in two things, making intelligible information


unintelligible (i.e., scrambling it in some way), and also making unintelligible information
intelligible. These processes are generally known as encryption and decryption.

Cryptographers are the people who supply us with mechanisms for encrypting and decrypting
data, and they are in a perpetual war with the cryptanalysts - the code breakers. Often these
people work on the same side and together in order to improve security.

Data in its normal intelligible form is called plaintext. This data is encrypted into messages
unintelligible to the enemy, and these are called ciphertext. In modern computer-based
cryptography, the plaintext and ciphertext alphabets are simply {0,1}, but traditionally were
the alphabet.

A typical block diagram of the process looks similar to this:

Additionally, we also need a secure channel to ensure the appropriate keys are possessed by
the sender and receiver.

Private and Public Cryptography

Two major approaches to cryptography are the private, or symmetric, key cryptography
approach and the public key approach. In the private key, the encrypting and decrypting keys
K and K-1 are the same (are one is easily derivable from the other). With public key
cryptography, the two keys are different, and it is very difficult (without special knowledge)
to derive one key from the other.

Public key cryptography has arisen in the last 30 years, and has been invented more than
once: first secretly by GCHQ's Cox and Ellis, and then publicly by Diffie and Hellman.
Private key cryptography has a much longer history.

Attacks

With massive computational power become available, such as grid computing, or using cheap
specialised hardware such as FPGAs, brute force attacks are possible.

Another general idea of cryptanalysis is to detect and exploit structure in the relationship
between plaintext inputs, the key used and the ciphertext outputs. This can be used to derive
keys, but weaker breaks are important - being able to tell what type of cryptographic system a
stream is encrypted with gives us a foot in the door, as you should not be able to tell random
streams apart. This attack detects leaking strcture.

The general idea of cryptographic systems it to make the plaintext to ciphertext mapping as
random looking as we can. Sometimes, using an attack by approximation, we can detect a
structural relationship in the operation of an algorithm, which exists with some small bias
(i.e., an apparent deviation from random-ness).
Some schemes are based on the assumed computational difficulty of solving instances of
particular combinatorial problems (e.g., knapsacks, perceptron problems, syndrome
decoding). Solving such problems may not be as hard as we think. As NP-complete is about
the worst case, cryptography cares more about the average case.

Cryptography is based on the idea that factorisation really is hard.

We can also attack the implementation. Mathematical functions simply map inputs to outputs
and exist in some conceptual space, but when we have to implement a function, the
computation consumes resources: time and power.

Encrypting data with different keys may take different times. The key and the data affects the
time taken to encrypt or decrypt, which gives a leakage of information about the key. Even
monitoring power consumption may reveal which instructions are being executed.

Classical Ciphers
All classical ciphers work by doing plaintext substitution (replacing each character with
another), or transposition (shuffling the characters). These are very easy to break and no
longer used, but are instructive since some cryptanalytic themes recur in modern day
cryptanalysis.

Caesar Ciphers

The Caesar cipher works by shifting letters of the alphabet some uniform amount, e.g., if it
was a Caesar cipher of 3, A becomes X, B becomes Y, etc, so BAD becomes YXA. This is
hardly the most difficult cipher to break. These classical ciphers operate on characters or
blocks of characters (modern implementations operate on bits, or blocks of bits).

This early cipher was used by Julius Caesar, and works on the normal ordered alphabet, and
shifts letters by some number - this number is the key k. If we map the alphabet onto the
range of integers 0-25, then the Caeser cipher encrypts the characters corresponding to x as
the character corresponding to (x + k) mod 26.

As there are only 26 keys, it is simple to brute force this cipher. Occasionally, we may get
more than one possible decryption, but this is only likely if there is a very small amount of
ciphertext.

Substitution Ciphers

A simple substitution cipher is simply an injective mapping: f : Π → Π, where Π is the


alphabet.

All instances of a word are enciphered in exactly the same way. The appearance of these
repeated subsequences in the ciphertext may allow the cryptanalyst to guess the plaintext
mapping for that subsequence (using the context of the message to help).

If you actually know some of the plaintext, than these sort of ciphers are fairly trivial to
break, you don't need much in the way of inspired guessing.
This problem arises because identical plaintexts are encrypted to identical ciphertexts.

Another possible attack arises due to the nature of natural language - in particular, the
frequency of character occurence in text is not uniform. Some characters are more common
than others, e.g., "e" in English text, as well as pairs (bigraphs), "th", etc.

Unfortunately for simple substitution ciphers, this structure in the plaintext is simply
transformed to corresponding structures in the ciphertext. If the most common ciphertext
character is "r", then there is a good change that it will be the encryption of "e" (or one of the
other highly frequent characters).

One thing to be aware of is that the frequency tables are compiled using a great deal of text,
any particular (especially true if the ciphertext is shorter) text will have frequencies that differ
from the reference corpus. Thus, mapping the most frequent ciphertext character to the letter
"e" may not be the right thing to do, but mapping it to one of the most popular characters
probably is.

One way to address this problem is to allow each character to be encrypted in multiple ways,
or use multiple ciphers.

One solution homophonic substitution, that is. to have a separate (larger) ciphertext alphabet
and allow each plaintext character to map to a number of ciphertext characters. The number
of ciphertext characters corresponding to a plaintext character is directly proportional to the
frequency of occurrence of the plaintext character.

e.g., I has a frequency of 7 compared to V's 1, so I would map to 7 characters and V would
map to 1. In this way, the ciphertext unigram statistics become even.

Vigenere made an extremely important advance in classical ciphers. He chose to use different
ciphers to encrypt consecutive characters. A plaintext is put under a repeated keyword, with
the character for that particular point to index the actual cipher used.

This appears to solve the problem - the statistics are flattended, but if we were to measure the
frequencies of subsequences of the ciphertext (e.g., for key length 4: subsequences, 1, 5, 9,
13, etc, and 2, 6, 10, 14, etc...), we might expect these frequencies to be approximately those
of the natural language text as usual.

In order to figure out the size of the keyword, or the period of the cipher, we look for n-grams
occurring frequently in the ciphertext (e.g., frequent bigrams, trigrams - the bigger the better).
The chances are that they correspond to idential plaintext enciphered with the same
component ciphers (positions at some multiple of the period).

Once we have this period, we can apply earlier frequency analysis to each component cipher.
Bi-graph frequencies can help too.

Substitution ciphers have a nice property - keys that are nearly correct give rise to plaintexts
that are nearly correct. This allows from some interesting searches to be carried out. If you
can measure how "good" your current key guess is, you can seek over the keyspace in order
to maximuse this measure of goodness.
Of course, we can not know how good our current key guess is without knowing the actual
plaintext, but we can use frequency statistics under decryption to key K to give us a strong
hint.

A typical search starts with a key K (a substitution mapping), which makes repeated moves
swapping the mapping of two characters, with the general aim of minimising the cost
function derived from the frequency analysis. This is non-linear, and you run the risk of
getting stuck in local optima.

Transposition Ciphers

Transposition ciphers "shuffle" the text in some way - a simple transposition cipher f might
permute consecutive blocks of characters, e.g., for a reversal permutation: f = (4, 3, 2, 1),
encryptf(JOHNHITSJOHN) = NHOJSTIHNHOJ.

To create a transposition cipher, we chose a period L, and a permutation of the numbers 1..L.
The plaintext is then written out in rows of length L, and the columns are labelled according
to the permutation chosen (e.g., 6 3 1 2 5 4), and then the columns reorganised in ascending
order according to their label.

Unigram statistics can be used to tell us this is a transposition cipher, as the statistics will be
very similar to that of plaintext.

Bi-gram statistics are rather important, and we can use similar sort of tricks as earlier to
determine the period. If you guess the period wrongly, and try to find a permutation, you will
be hard pressed to find pairs of columns in the decryption which have appropriate bigram
statistics when considered consecutively. If you guess the period correctly, then it is possible
to find pairs of columns with good bi-gram frequency statistics.

We can do searches over the permutation to find the approximate order (using the bigram
cost function identified earlier).

We can also use the dictionary (or a subset of it, most people only use a limited amount of it
in daily conversation) to scan the obtained plaintext to see whether it contains known words.
If it does, this suggests that corresponding parts of the key are correct - this has been greatly
facilitated by computers.

This theme of power being obtains due to guessing correctly, and then being able to do so
again in the future is important and will return again.

Stream Ciphers
A common simple stream cipher works by generating a random bit stream, and then XORing
that stream on a bit by bit basis with the plaintext. This is called a Vernan cipher.

This depends on both the sender and receiver having synchronised streams of both ciphertext,
and the keys - packet loss would send the streams out of synch. A very large amount of work
has been done on random stream generation, it's more difficult than it first appears.
Linear Feedback Shift Registers

At each iteration, there is a right shift, a bit fall off the end and the leftmost bit is set
according to the linear feedback function.

We want the stream to be random looking. One feature should be that the stream should not
repeat itself too quickly. As this is, in effect, a finite state machine, it will repeat itself
eventually, and the maximal period for an n-bit register is 2n-1 (it can not be 2n as if the
register was all 0, it would get stuck).

The tap sequence defines the linear feedback function, and is often regarded as a finite field
polynomial. You have to choose the tap sequence very carefully. Some choices provide a
maximal length period, and these are primitive polynomials.

The above polynomial has a maximum period, and it is common to denote this as C(D) = 1 +
D + D4. A similar polynomial, C(D) = 1 + D + D3 does not give a maximal period sequence.

This kind of implementation is not good for pseudo-random number generation. If we had a
64-bit register, then once we know a very small amount of plaintext (e.g., 32 consecutive
bits), then you can calculate the corresponding key stream and so know the rightmost 32 bits
of the register. The remaining 232 bits can be brute forced quite easily, and when you get the
right one, the whole key stream should be generated.

This is far too easy to break, but LFSRs are very easy to implement, and are fast. To fix
matters, we can use a less primitive way of extracting the key stream, or even combining
several streams to achieve better security.
A very simple model would involve using some function, f to operate on some subset of the
LFSR register components.

Algebraic Normal Form

A boolean function on n-inputs can be represented in the minimal sum (XOR, or +) of


products (AND .) form:

f(x1, ..., xn) = a0 + a1.x1 + ... + an.xn + a1,2.x1.x2 + ... + an - 1, n.xn - 1.xn + ... + a1, 2, ..., n.x1.x2...xn.

This is called the algebraic normal form of the function.

The algebraic degree of the function is the size of the largest subset of inputs (i.e., the number
of xj in it) associated with a non-zero co-efficient.

e.g., 1 is a constant function (as is 0). x1 + x3 + x5 is a linear function, x1.x3 + x5 is a quadratic


function, etc.

A very simple model with a linear f would be pretty awful. If we knew a sequence of
keystream bits, then essentially every key stream output can be expressed of a linear function
of elements of the initial state. We can derice a number of these equations and then solve
them by standard linear algebra techniques.

A non-linear f is slightly better, but it is still possible to attack such systems if f is


approximated by a linear function.

Please see slide 5


for more detail.

Given f(x1, ..., xn), it is fairly straightforward to derive the ANF. If we consider the general
form, the constant term a0 is it is f(0, ..., 0) (the 0's cancel out any other term).

Classic Model

The classical model involves a series of n-bit registers with some function f, where the initial
register values form the key.

The LFSRs need not all be the same length, and the LFSRs will give a vector input which has
a period that is the product of the least common multiple of the periods of each of the LFSRs.

An awful choice for f would be f(x1j, x2j) = x1j. If the LFSRs were 32-bits long, this would
disregard 32 of those bits, leading to our key size to be effectively 32 bits.

XOR would use all 64 bits, but would still not be a good choice. If the stream bit is 0, then
there are only 2 possible pairs - similarly if the stream value is 1. This again reduces our key
size to 32 bits.

These examples are extreme, but beware of linearity, even a hint of it can cause weaknesses.
Divide and Conquer Attacks

A Geffe generator works by using a nx1 multiplexor, with n + 1 LFSRs, with one being used
for select. For a 2x1 multiplexor, this gives us the equation: Z = (a & b) + (not(a) & c).

If we put this into a truth table, we see that the output Z agrees with b 75% of the time, and
also agrees with c 75% of the time.

If we consider each possible initial state s of the second register and then determine the
stream produced from this initial state, we can check the degree of agreement of this stream
with the actual key stream. If state s is correct, we will get roughly the right amount of
agreement. If it is incorrect, the agreement will be roughly random.

We can target the third register in exactly the same way, and then can derive the first register
(the selector) very easily by trying every possible state. The correct one would allow us to
simulate the whole sequence. There are other ways too.

These divide and conquer attacks were suggested by Siegenthaler as a means of exploiting
approximate linear relationships between function inputs and its ouput. This led to new
criteria being developed as countermeasures to these correlation attacks.

If we have two functions f(x1, x2) and g(x1, x2), we say that f(x1, x2) is approximated
by g(x1, x2) if the percentage of pairs (x1, x2) which given the same values for f and g differs
from 50%.

If they agree precisely half the time, we say that they are uncorrelated. If the agreement is
less than 50%, we can do g(x1, x2) + true to find a function with positive agreement.

We can consider similar ideas for n-input functions: f(x1, x2, ..., xn) and f(x1, x2, ..., xn). The
degree of approximation with linear functions may be slight, and the smaller the degree of
approximation, the more data you need to have to break the system.

The idea of multiple LFSRs is that the size of the keyspace should be the product of the
keyspace sizes for each register. Divide and conquer reduces this to a sum of key sizes, and
you can attack each in turn. When you crack one LFSR, the complexity of the remaining task
is reduced (e.g., f(x1, x2) = x1.x2 + x1 - once you know what x1 is, then whenever you
know x1 = 1, you know what x2 is).

In a similar vein, if we suppose there is a small exploitable correlation with input x1 and there
is a small correlation with x1 + x2. If LFSR can be broken to reveal x1, then we now have a
straightforward correlation with x2 with which to exploit.

All these attacks assume you know the tap sequence. If you kept the feedback polynomial
secret (i.e., make it part of the key), this makes things harder, but there are in fact some
further attacks here too.

Boolean Functions

Boolean functions are, unsurprisingly, those of the type: f : {0,1}n → {0,1}. We can also
consider a polar representation (f^(x)), where f^(x) = (-1)f(x).
Boolean functions can be regarded as vectors in R2n with elements 1 or -1 (derived from our
polar form). Any vector space has a basis set of vectors, and given any vector v, it can always
be expressed uniquely as a weighted sum of the vectors in the basis set.

If the basis vectors are orthogonal and each have a normal length (1), we can say they form
an orthonormal basis. We can express any vector in terms of its projections onto each of the
basis vectors.

Given a basis, you can always turn it into an orthonormal basis using the Gram-Schmidt
procedure. Given an orthogonal basis, you can always create an orthonormal one by dividing
each vector by its norm.

n-dimensional vectors can be normalised in the same way, the norm is the square root of the
sum of the squres of its elements.

Linear Functions

Read the slides.

If we recall that for any ω in 0..(2n - 1), we can define a linear function for all x in 0..(2n - 1)
by Lω(x) = ω1x1 XOR ... XOR ωnxn, where ω and x are simply sequences of bits.

We will use natural decimal indexing where convenient (e.g., ω = 129 = 10000001).

The polar form of a linear function is a vector of 1 and -1 elements defined by L^ω(x) = (-
1)omega;1x1 XOR ... XOR ωnxn = ∏nj = 1(-1)ωjxj.

One criterion that we may desire for combining function is balance (that is, there are an equal
number of 0s and 1s in the truth table form, and there are an equal number of 1s and -1s in the
polar form), so the polar form has elements that sum to 0, or, if you take the dot product of
the polar form of a function with the constant function comprising all 1s, the result is 0.

Each linear function has an equal number of 1s and -1s (and so is a balanced function).

Dissimilar linear functions are orthogonal. If we consider the dot product of any two columns
of the 8×8 matrix given in the slides, the result is 0.

The linear functions are vectors of 2n elements each of which is 1 or -1, the norm is therefore
2n / 2, and thus we can form an orthonormal base set.

Since a function f is just a vector, and we have an orthonormal basis, we can represent it as
the sum of projections onto the elements of that basis using the Walsh Hadamard function.

Various desirable properties of functions are expressed in terms of the Walsh Hadamard
function, e.g., balance means that the projection onto the constant function should be 0.
We saw that functions that looked like linear functions too much were a problem, but a
measure of agreement is fairly easily calculable (Hamming distance with linear function in
usual bit form, in polar form, we simply take the dot product of the linear function).

A function f that has minimal useful agreement (i.e., 50% agreement) with Lω has Hamming
distance 2n / 2 with it (or, in polar terms, half the elements agree and half disagree). If all the
elements disagree (i.e., f = not Lω), we can form another function that agrees with Lω entirely
by negating f (i.e., f + 1).

If correlation with linear functions is a bad idea, let's have all such correlations equal to 0
(i.e., choose f such that the projections onto all linear functions are 0). This is not possible,
however. Because we have a basis of linear functions, if a vector has a null projection onto all
of them, it is the zero-vector.

A Boolean function is not a zero-vector, it must have projections onto some of the linear
function. But some projections are more harmful than others, from the point of view of the
correlation attacks.

These correlations with single inputs are particularly dangerous, with the danger decreasing
proportional to the number of inputs.

Correlations with single inputs correspond to projections onto the Lω where the ω has only a
single bit set. Similarly, correlations with linear functions on two inputs correspond to the
projections onto the linear functions Lω where the ω has only two bits set.

If a function has a null projection onto all linear Lω functions with 1, 2, ..., k bits set in ω (i.e.,
it is uncorrelated with any subset of k or fewer inputs), the function is said to be correlation
immune of order k. If it is also balanced, then we say it is resilient.

For a variety of reasons (there are other attacks that exploit linearity), we would like to keep
the degree of agreement with any linear function as low as possible. If we can not have all
that we want (all projections 0), we can try to keep the worst agreement to a minimum.

This leads to the definition of the non-linearity of a function. We want to keep the Hamming
distance to any linear function (or its negation) as close to 2 n / 2 as possible. Or, keep the
maximum absolute value of any projection on a linear function to a minimum, that is, keep
maxω|F^(ω)| as low as possible.

Non-linearity is defined by: Nf = 0.5(2n - maxω|F^(ω)|). It seems to minimise the worst


absolute value of the projection onto any linear function, but what is the maximum value we
can get for non-linearity?

We can project these vectors onto a basis of 2n orthogonal (Boolean function) vectors L0, ...,
L2n - 1, where Lω(x) = ω1x1 XOR ... XOR ωnxx.

Bent functions were first researched by Rothuas, and maximise non-linearity, and are
function based on even number of variables. Bent functions have projection magnitudes of
the same size, but with different signs. This includes projection onto the constant function,
not a balanced function, therefore if you want maximum non-linearity, you cannot have
balance.
Additionally, non-linearity and correlation immunity are in conflict, as it requires an increase
in the magnitude of the vectors.

All other things being equal, we would prefer more complex functions to simpler ones. One
aspect that is of interest is the algebraic degree of the function, which we would like to be as
high as possible.

This causes a conflict with correlation immunity. Sigenthaler has shown that for a
function f on n variables with correlation immunity of order m and algebraic degree d, we
must have m + d ≤ n, and for balanced functions, we must have m + d ≤ n - 1.

There is another structure that can be exploited. It is a form of correlation between outputs
corresponding to inputs that are related in a straightforward way. This is autocorrelation.

We begin to see the sorts of problems cryotpgraphers face. There are many different forms of
attack, and protecting against one may allow another form of attack. However, given the
mathematical constraints, we might still want to achieve the best profile of properties we can
- a lot of Boolean function research seeks constructions to derive such functions.

There is no such thing as a secure Boolean function. There may be functions that are
appropriate to be used in particular contexts to give a secure system, however, the treatment
here shows quite effectively that life is not easy and compromises have to be made.

Block Ciphers
Another modern type of cipher is the block cipher, which is very common. They work by
taking a block of data (e.g., 64, 80, 128 bits, etc) and transform it into a scrambled form
(itself a block of the same size).

The encryption key K determins the mapping between the plaintext blocks and the ciphertext
blocks. Without a knowledge of the key, this should look like a random association.

In the examples on the slide, this key is expressed by citing the whole permutation, however
in practice this would be very difficult. Usually, a key consists of a number of bits, the value
of which together with the algorithm defines a functional relationship between the input and
output blocks.

Defining this relationship by enumeration is rare, except for very small components of an
algorithm.

There are many different block ciphers, the most common are DES (Data Encryption
Standard), LUCIFER (a precursor to DES), IDEA (used in PGP) and AES (Rijndael's
Advanced Encryption Standard).

Data Encryption Standard

The Data Encryption Standard (DES) has a very controversial history. It was developed on
behalf of the US government based on previous work done by IBM. It was first issued in
1976 as FIPS 46, and has a 56-bit key (the key is actually 64-bits long, but it has check bits).
DES has generally been the first chose in commerce (e.g., banking).

DES has a public design which has been subject to attack by virtually all of the world's
leading cryptanalysts. The publication of the standard has had the greatest affect on the
development of modern day public cryptology than any other single development.

The 56-bit key length is controversial. It was originally 128, but was reduced by the NSA.
The NSA also changed the configuration of the S-boxes. Many people are suspicious of the
NSA's motives, and the design criteria for the algorithm was never revealed. Some suspect
that there is a trapdoor in the algorithm.

Each round is indexed by a 48-bit subkey K1 derived from the full key. 16 rounds in total are
applied.

The initial permutation simply permutes the 64 bits of input according to a specific pattern. It
has no security significance.

Each round takes the form given below, operating on the left and right 32-bit words. The
XOR is a bitwise XOR of the 32-bit words.
To decrypt, the same algorithm is applied with the subkeys in reverse order. The final round
is followed by a swap.

Each round includes a swap, so applying a swap immediately afterwards simply undoes it. It
is clear that the method simply decrypts as expected. Induction can be used to show that the
property holds for an arbitrary number of rounds.

The S-boxes take 6-bit inputs and give 4-bit outputs.

All S-boxes are different and are represented using tables. For an input b, we interpret b1b6 as
an integer in the range 0..3 as the row to use, and b2b3b4b5 as an integer in the range 0..15
indexing the column to use. Each entry in the table is represented by a decimal and indexed
as 4-bit binary.

For the expansion and P-permutation stages, each outer bit in a block of four influences the
row used to do substitution in the nearest neighbouring block of four. P-permutation is
straightforward permutation.
The 56-bit key is split into two 28-bit halves, and each half is subject to a cyclic shift of 1 or
2 bits in each round. After these shifts, a sub-key is extracted from the resulting halves. This
is repeated for each round.

The theoretically promising cryptanalytical method of differential cryptanalysis emerged in


the laste 80s, but DES was surprisingly resilient to this attack vector. Coppersmith (1994)
revealed some of the design criteria, and the reason for DES's resistence to differential
cryptanalysis - that it was specifically designed to be so.

This means that IBM and the NSA knew about this method of attack over 16 years before it
was discovered and published by leading cryptograhpic academics.

A later technique known as linear cryptanalysis exists which DES is vulnerable to, which
reduces the number of operations for a brute-force crack from 255 to 243.

Diffie & Hellman proposed a key search machine that would be able to break DES at a cost
of $20M (1977), and a detailed hardware design for a key search machine was published by
Wiener in 1993. The EFF in 1998 successfully cracked DES in 3 days using purpose built
hardware that took less than 1 year to build at a cost of less than $250,000.

FPGAs now exist that can crack DES in 2 hours.

Advanced Encryption Standard

An interational competition was held to find a replacement for DES, which resulted in
Rijndael's Advanced Encryption Standard being chosen. The competition was very public, as
was the evaluation of it.

AES uses quite simple operations and appears to give good security as well as being very
efficient. Many are now attacking the standard.

There are numerous standard modes of using block ciphers: electronic code book; cipher
block chaining; cipher feedback moe and output feedback mode. There are some others that
have been tried occasionally (e.g., propagating cipher block chaining), but we will only look
at mainstream ones.

Electrionic Code Book


Electronic code book is the most primitive means of using block algorithms for message or
data encryption. Successive blocks of plaintext are encrypted directly using the block cipher
(the plaintext block is the input and the encrypted result is the ciphertext sent).

This leads to a problem where message blocks that contain the same values are always
mapped to the same ciphertext values. Code books can be built up, where once we know that
a particular block contains a certain enciphered form, we can recognise it if it occurs again.
This kind of attack also occurred using classical ciphers.

More problematic is that the codebook can be used to forge messages pretending to belong to
one of the legitimate communicants, as elements of messages can be replaced very easily.

This kind of encryption may be more suited to randomly accessed data (e.g., databases). ECB
use is rare, and other modes are more common.

Cipher Block Chaining

There is clearly a need to mask the occurrence of identical plaintext blocks in messages. It is
possibly to do this by allowing the data sent so far to affect the subsequent ciphertext blocks.

Cipher Block Chaining (CBC) is a widely used means of doing this. It XORs the plaintext
block with the previously produced ciphertext block and then encrypts the result.

To recover the plaintext in such a system, you must first decrypt the appropriate ciphertext
block, and then XOR it with the preceeding ciphertext block to get the plaintext.

Initial blocks are crucially important. If the same initial block is used for different messages,
then it will be clear when messages have the same initial sequences. The usual solution to this
is to vary the initial block in some way. There is a general notion that initial blocks should
not be repeated between messages. This cna be achieved by randomess (this is generally the
best), but some texts advocate simply incrementing.

The matter of initial blocks is subtle, and many texts can be downright dangerous in their
advice.

Cipher Feedback Mode

If we want to transmit encrypted data a character at a time (e.g., from a terminal to a host),
cipher feedback mode is often useful as it can enxrypt plaintext of a size less than the block
size.

Typically 8-bit sub-blocks are used, but even 1-bit sub-blocks are possible.

We also need to consider the error propagation characteristics of the various modes of use,
and the effect of a single transmission error (a bit flip) in the ciphertext.
Triple-DES is where the plaintext is first encrypted with K1, then decrypted with K2 and then
re-encrypted with K1.

If K1 = K2 this is a very inefficient form of normal DES (and so can be compatible with
standard DES).

Linear Cryptanalysis
Linear cryptanalysis was invented by Matsui in 1991, and is a powerful attack against many
block ciphers.

It works by approximating block ciphers by means of linear expressions, which are true (or
false) with some bias. This bias can be utilised to discover the key bits.

For an S-box which has 4-bit inputs (X) and 4-bit outputs (Y), two linear equations on the
inputs and outputs are defined (e.g., X2 ⊕ X3 and Y1 ⊕ Y3 ⊕ Y4).

These expressions involve either inputs, or outputs, but not both. Agreement is calculated
between some input expression and some output expressions, and if this differs from equal
numbers of agreement and disagreement, than given a specific output expression, we have
strong information on the likely value of the input expression.

Slide 8 has
an example.

This shows simply how structrual relationships between input and output that hold with some
bias can be used to leak key information. Reality is a bit more sophisticated than this.

Generally ciphers are multiple round, and expressions between plaintext and intermediate
ciphertext hold with some bias. To crack such a scheme, the final round is peeled off by
trying all relevant subkeys and finding the one with the greatest approximation bias when
evaluated. This can also be done in reverse (finding an expression that holds with a bias over
rounds 2 to n which can be tested by using appropriate parts of the first key, and encrypting
to obtain intermediate ciphertext).

Differential Cryptanalysis
Differential cryptanalysis is a powerful chosen plaintext attack against many block ciphers
that was first published by Eli Biham and Adi Shamir.

In this attack, we work with differences. If two input X′ and X″ differ by bitwise XOR ΔX =
X′ ⊕ X″, then their corresponding outputs will differ similarly by some amount: ΔY = Y′ ⊕
Y″ (where Y′ = Sbox[X′] and Y″ = Sbox[X″]).

Some characteristic pairs (ΔX, ΔY) are more common than others, and this can be used to
build up a biased expression in terms of input and output differences in the same way as
linear cryptanalysis.
For a given input difference, certain output differences seem to occur more often than others.
In an ideal cipher, all output differences would be equally likely given an input difference,
but this is actually impossible.

In this analysis we seem to join likely input-output differences over consecutive rounds.
Output differences from one round feed in as the input differences to the next, in a similar
fashion to that done in linear cryptanalysis. However, working with differences gives us an
unexpected bonus, as the key bits cancel out.

Lecture 7 contains examples.

In a substitution-permutation network, the mth output of Si, j goes to the jth input of Si+1, m.
The 16-bit subkeys are assumed independent, but in practice subkeys are often derived from a
master key (as in DES). The key mixing is simply a bitwise XOR.

We can not access the intermediate ciphertext, but if we could obtain it we could use it to
show that the bias was present. However, if we use correct key bits, then the two relevant
ciphertext sub blocks will decrypt to the correct intermedia cipher text, and the relationship
when checked will hold with a bias.

If we use the wrong key bits, then after decryption we should have the same expectation that
the relationship will hold.

We can now try each target subkey in turn, and decrypt the ciphertext sub-blocks to see if
whether or not the required difference holds. The number of times this is true for the target
sub-key is counted, and the one with the greatest count is likely the correct target subkey.

If 10,000 plaintexts are encountered satisfying the required ΔP, then probability = count /
10000, and the largest deviation from 0 indicates the actual subkey.

However, as we mentioned above, the effect of the key disappears in differential


cryptanalysis, yet it matters in the final round.

Generally, out technique has been to show that there is a structure (a relationship that holds
with a bias) over n-1 rounds of the cipher, and if we can guess an idenitified subset of the
final round key bits, then the structure can be checked and seen to hold with a bias. If we
guess incorrectly for the subset, we do not get such a strong bias for the relationship. We can
therefore tell when we have the identified subset correct, as part of the key is shouting at you
- a form of divide and conquer.

We can generalise this further as there are many forms of structure, such as differences of
differences, approximating sets of differentials using linear expressions, exploiting
knowledge that certain differentials are impossible, having multiple approximations, or
defining difference differently (bitwise XOR is simple).

We can often exploit any form of structure.

The S-boxes clearly pay an important role in the provision of security, as they provide
confusion. We have seen that existence of a linear approximation leads to linear
cryptanalysis, and of a differential one to differential cryptanalysis, and with this known, can
S-boxes be strengthened to resist attack?

We want to avoid linear combinations of subsets of the inputs agreeing with linear
combinations of subsets of the outputs. That is, we don't want expressions of the following
form to hold with a bias (ωi and βj are 0 or 1): ω1x1 ⊕ … ⊕ ωnxn = β1S1(x) ⊕ … ⊕ βmSm(x).

However, the right hand side really is just a function of the inputs ƒβ(x), and so we do not
want the function ƒβ(x) to agree with a linear combination of a subset of the inputs.

We can measure agreements of this form using the Walsh Hadamard values. We want the
function ƒβ to have high non-linearity (a low maximum magnitude of Walsh Hadamards), but
as we can choose any input subset and any output subset, we take the worst non-linearity of
any derived function ƒβ(x) (i.e., over all possible β) as a measure of the strength of the S-box.

We also need to protect against the differential structure being exploited. For a simple
Boolean function with one output we could measure the degree to which inputs satisfying a
difference give rise to correlated outputs. We can do this using the autocorrelation function.
For any input difference s the autocorrelation function r(s) is given by: r^f(s) = ∑2n-
1
x=0 f^(x)f^(x ⊕ s).

If ∀x, ƒ(x) = f(x + s) then this has the value 2n. Similarly, if they always disagree the value is -
2n. Bent functions have zero autocorrelations for all non-zero s.

We take the highest magnitude value as an indication of the lack of differential input-output
structure of a function: ACf = maxs | ∑2n-1x=0 f^(x)f^(x ⊕ s) | = maxs | r^f(s) |.

This basically means that we would like a derived linear function to have low autocorrelation
(i.e., we want to reduce the biggest autocorrelation magnitude as much as possible), but we
are free to chose any dervied linear approximation for the S-box. So, we calculate the
autocorrelation for every possible derived linear function, and take that as an indicator of the
strength of the S-box.

Deriving S-boxes with a very low autocorrelation is not easy. Getting S-boxes with high non-
linearity and low autocorrelation is very hard.

Additionally, some linear and differential relationships are more worrying than others. Some
research is being done into developing correlation immune and resilient S-boxes (where all
the derived function are correlation immune). Additionaly, with small number of bits
changing, we might want for all single bits changing that the autocorrelation is 0. This leads
to a notion of propagation immunity of order 1 (a strict avalanche criterion), or more
generally a propagation immunity of order k.

Brute Force Attacks


Supercomputing, dedicated crypto-hardware and special purpose re-programmable hardware
(e.g., FPGAs) can be used to brute-force an algorithm - that is, try every key until it finds one
that works.
Special purpose hardware is expensive but has a long history. Early examples include that
designed to break the Enigma codes. Early work on the DES designs also considered special
purpose hardware, and Diffie and Hellman address brute force DES attacks in 1977. Wiener
gave a very detailed DES cracking design in 1993, and the EFF developed an engine that
cracked the RSA Labs' DES challenge in 56 hours.

Crypto is often simple and may not take a great deal of computation. Reprogrammable
hardware such as FPGAs allow software to be written and compiled down onto the
reconfigurable gate arrays for computation.

Massive parallelism allows multiple programs to be compiled to one chip and run
simultaneously. The different rounds of a multiple round encryption can be run
simultaneously in a pipeline. An FPGA attack on DES was announced by Richard Clayton in
November 2001.

The SETI@Home initiative showed that signal processing can be done all over the world as a
background task. This idea has been taken up by various groups intent on finding keys or
factorising primes.

A theoretical attack is called the 'Chinese radio attack'. If everyone in China had a radio with
a processor capable of checking 1,000,000 keys per second, then cracking DES would take
on average 30 seconds or so. At the moment this is theoretical, but with the rise of ubiquitous
computing, the mechanisms may be in place to carry out similar attacks. People are happy to
do signal processing on their PCs, so there should be no objection to spare-time computation
on their phones or TVs.

Additionally, with every increasing supercomputer speeds brute forcing is becoming ever
easier and faster.

Quantum computing allows us to dispense with the idea of distribution and allows us to carry
out all computations in parallel.

In quantum computing you can place registers in a superposition of all possible states. The
rules of quantum mechanics allow all states to be calculated at once. Technically, we carry
out 'unitary transformations' on the state. However, there's a catch as you're only allowed to
see one result. Suppose you 'look at' the final register and see either a 0 or a 1. The choice
between these is random. If we see a 1 (or a 0), then the whole superposition collapses to a
superposition consistent with that observation.

We then read off each register deterministically.

Unitary transformations can be used to bump up the probabilities of us seeing the result we
actually want (i.e., don't leave it to luck), and this is how quantum algorithms work. The
results can be surprising.

Grover's algorithm allows a state space of size 2 n to be searched for satisfying a particular
predicate in Ο(2n/2) iterations of a loop body. This gives us a massive increase in power.
Shor's algorith, which uses a quantum discrete fourier transform to extract periodicity
information from a function, gives us the biggest result and gives rise to a polynomial time
attack on RSA.

Another non-standard approach is to use DNA computing. With DNA computing, we can
perform all sorts of operations on DNA strands (splitting, merging, chopping bits out, etc)
and get strings to reproduce given the right environment. If we place the string in the right
environmental soup, a strand of DNA will attract base elements around it to give it a
complementary second strand which can be used to breed computational engines. This can be
used to give us massive potential parallelism.

Another approach is to corrupt program state in some way and then infer key information
from errant behaviour. Fault injection is a vector to be protected against.

Optimisation-based Approaches
Finding good boolean functions

Using the definitions of algorithm strength (non-linearity and auto-correlation), we can use
search to find the best ones satisfying these criteria.

The definitions of these (given in the relevant sections above) can easily be evaluated given a
function ƒ. They can therefore be used as the expressions to be optimised (and traditionally
they are).

Hill Climbing

One type of search is hill climbing (local optimisation), however if started at the the wrong
point we can get stuck at a local optimum. Simulated annearling is an algorithm that allows
non-improving moves so that it is possible to go down in order to rise again to reach a global
optimum. However, in practice the neighbourhood may be very large and the trial neighbour
is chosen randomly, so it is possible to accept a worsening move when an improving one may
exist.

Simulated Annealing

In simualted annealing, improving moves are always accepted, and non-improving moves
may be accepted probabilistically and in a manner depending on the temperature parameter T.
Loosely, the worse the move is, the less likely it is to be accepted, and a worsening move is
less likely to be accepted the cooler the temperature.

The temperature T starts high and is gradually cooled as the search progresses. Initially,
virtually anything is accepted, and at the end only improving moves are allowed (the search
effectively reduces to hill climbing).

More formally, we can represent this as pseudo-code:

current x = x0
temp = T0
until Frozen do:
do 400 times: (at each temperature, consider 400 moves
y = generateNeighbour(x)
Δ = ƒ(x) - ƒ(y)
if (Δ > 0):
current x = y
accept (always accept improving moves)
else if (exp-Δ/Temp > U(0,1))
current x = y
accept (accept worsening moves probabilistically
else reject (it gets harder to do the worse the move)
temp = temp × 0.95
return the best so far

However, there are more advanced versions of the cost function we can consider than the
simple non-linearity/autocorrelation ones above.

Parseval's theorem states ∑2n - 1ω = 0 F^(ω)2 = 22n, so loosely we can push down on F(ω)2 for
some particular ω and it appears elsewhere. This suggests that arranging for uniform values
of F(ω)2 will lead to good non-linearity - bent functions achieve this but we shall be
concerned with balanced functions. This is the motivation for a new cost function:

cost(f) = ∑2n-1ω=0 | | F^ƒ(ω) | - X | R.

This considers the cost not for one particular input but for all inputs, to ensure that the
function is optimised for all cases, not specific ones.

Moves in our search should also preserve balance. Balance is the number of 0s and 1s output.
In order to preserve balance, we start with a balanced (but otherwise random) solution, and
then our move strategy preserves it. This is done by defining the neighbourhood of a
particular function as the set of functions obtained by exchanging two dissimilar values. We
should also note that neighbouring functions have close non-linearity and autocorrelation,
providing some degree of continuity.

Minimising the cost function family by itself doesn't actually give good results, but is good in
getting in the right area. Our actual method is:

• Using simulated annealing to minimise the cost function given (for given
parameter values of X and R), with the resulting function being ƒsa;
• Hill-climb with respect to non-linearity (using the non-linearity targeted
techinque), or;
• Hill-climb with respect to autocorrelation (the autocorrelation targeted
technique).

In 1995, Zheng and Zhang introduced two global avalanche criteria (autocorrelation and sum-
of-squares), so autocorrelation bounds are receiving more attention. The sum-of-squares
conjecture seemed to give poor results.

Another approach is correlation immunity, which seeks to punish a lack of correlation


immunity (the first part of the formula) and low non-linearity. The cost function here is:

cost(ƒ) = ∑|ω|≤m | F^ƒ(ω) |R + A × maxω | F^ƒ(ω) |


We can do a linear transformation for correlation immunity. If we let WZƒ be the set of
Walsh zeros of the function ƒ: WZƒ = { ω : F^ƒ(ω) = 0 } (that is, the inputs where the Walsh
Hadamards are 0).

If Rank(WZƒ = n, then we can form the matrix Bƒ whose rows are the linearly independent
vectors from the Walsh zeros. We can now let Cƒ = Bƒ-1 and let ƒ′(x) = ƒ(Cƒ x).

This resulting function ƒ′ has the same nonlinearity and algebraic degree and is also CI(1) -
correlation immune. This can be applied to the basic functions generated earlier.

Designing Functions with Trapdoors

We can also use these techniques to plant, and detect, trapdoors left in cryptographic
functions.

If we are planting a trapdoor, we first want to find a function with high non-linearity and low
autocorrelation (that is, belongs in our set of Public Goodness functions - P), but also has our
trapdoor function (functions in the subset of the design space - T), so we need our function to
be in the set P ∩ T.

If we suppose we have an effective optimisation based approach to getting functions with this
public property (P) as discussed above, we can let our cost function be cost = honest(ƒ). If we
similarly had an effective optimisation technique to find functions with our trapdoor property
T, we could let our cost function = trapdoor(ƒ).

We can combine the two to find a function we actually want to use: sneakyCost(ƒ) = (1 - λ) ×
honest(ƒ) + λ × trapdoor(ƒ), where λ is the malice factor (λ = 0 is trult honest and λ = 1 is
wicked).

We want to be able to tell whether an unknown trapdoor has been inserted. Experiments have
used a randomly generated vector as a trapdoor, and closeness to this vector represents a good
trapdoor bias. We can investigate what happens when different malice factors are used if high
non-linearity and low autocorrelation are our goodness measures.

Of all our publicly good solutions, there are two distinct subsets, those found with a honest
cost function, and then with a high trapdoor bias found by combining trapdoor functions.

There appears to be nothing to distinguish the sets of solutions obtained, unless you know
what form the trapdoor takes, however... If we represent two groups of functions as vectors
and calculate the mean vectors, projecting one onto teh other and calculating the residual r.
This represents the λ value.

Other ways we can use these approaches is to S-box design (i.e., multiple input-multiple
output functions), however success has been less obvious, due to the enormity of the search
space, even for small S-boxes.

Public Key Encryption


A new way of doing cryptography was published in 1976 by Whitfield Diffie and Martin
Hellman, entitled New Directions in Cryptography.

Essentially, each user has two keys, a public key made available to the world (for user A, this
is KA), and a private key kept secret by the user (KA-1).

This system makes possible many things that can not be done with conventional key
encryption.

To send a message M in confidence to A, user B encrypts the message M using A's public
key KA. A (and only A) can decrypt the message using the private key KA-1.

Some public key schemes allow the public and private keys to be used for both encryption
and decryptiong (i.e., use the private key for encryption, and the public key for decryption).
Thus, to send a message and guarantee authenticity, A encrypts with KA-1, and then B
decrypts with KA. If the decrypted message makes sense, then B knows it came from A.

Rivest-Shamir-Adleman (RSA)

The best-known public key encryption algorithm was published in 1978 by Rivest, Shamir
and Adleman, it is universally called RSA. It is based on the difficulty of factoring a number
into its two constituent primes.

In practice, the prime factors of interest will be several hundred bits long.

First, we pick two large primes p and q, and calculate n = p × q. e is found relatively prime to
(p - 1)(q - 1), i.e., no factor in common other than 1. We then find a d such that e × d = 1 mod
(p - 1)(q - 1), e.g., using Euclid's algorithm.

e is the public key and d the private key. e and n are released to the public.

Encryption and decryption are actually the same operation: C = M e mod n and M = Cd mod n.

RSA set up a very successful company and became the world's most celebrated algorithm.
However, it turns out the ideas were developed at GCHQ in the late 60s and early 70s by
Ellis and Cox. Additionally, in 1974 Williamson developed an algorithm that would later be
known as the Diffie-Helman algorithm.

Public key is often used to distribute keys or encrypt small amounts of data. A major problem
is speed, the modular exponentiation eats up computational resources: "At it's fastest, RSA is
about 1000x slower than DES" - Schneier.

Hash Functions

A hash function takes some message M of an arbitrary length and produces a reduced form,
or digest, H(M) of a fixed length.

Hash functions have several desirable properties:

• Given M it is easy to compute H(M);


• Given a hash value h it is hard to compute M such that h = H(M) (i.e., hash
functions are one-way);
• Given M, it is hard to find another message M′ such that H(M) = H(M′)
(hash functions should be resistant to collisions).

In an open use of a hash function, sender A sends a message M together with hash value: (M,
H(M)). The receiver then calculated H(M) from M and checks that the result agrees with the
encrypted hash value. If so, then B may assume that the message has not been altered since it
was sent, but he can not prove it was sent by A.

However, we can use hash functions in an authenticated way. If we assume A and B share a
key K, then sender A sends a message M together with the encrypted hash value: (M, E(K,
H(M))). The receiver then calculates H(M) and encrypts it with the key K, checking whether
the result agrees with the encrypted hash value received. If it does, then B can assume that the
message has not been altered since it was sent, and that the message was sent by A.

Encrypting just the hash value is faster than encrypting the whole message.

This of course assumes that A and B trust other: A can not prove that B created a message or
vice-versa (either could have just made it up themselves). But with variants of public key
cryptography (e.g., RSA), we can get the sender to encrypt the hash using his or her private
key - we say the hash is signed.

If our hashing algorithm was not resistant to collisions, then we could monitor the network
for a message (M, E(K, H(M))) and record it, and then find a message such that H(M′) =
H(M), and then forge a message (M′, E(K, H(M))).

Factorisation

Pollard's Rho Method

This is a simple algorithm. If we suppose we wish to factor n, then we shall be concerned


with Zn, the positive integers modulo n. We then choose an easily evaluated map from
the Zn to Zn, e.g., ƒ(x) = x2 + 1 mod n.

If we start with some value x0, we generate a sequence such that: x1 = ƒ(x0), x2 = ƒ(x1) =
ƒ(ƒ(x0)), x3 = ƒ(x2) = ƒ(ƒ(ƒ(x0))) …

These various x's are compared to see if there is any xj, xk such that gcd(xj - xk, n) ≠ 1. If so,
we have a divisor of n (the greatest common divisor gcd(a, b) is the largest number that
divides both a and b.

Essentially, we are seeking values xj and xk such that they are in the same residue classes
modulo some factor r of n.

The method is faster than simply enumerating all the primes up to √n. Variations on this
theme exist.

Fermat Factorisation
This method is useful when n is the product of two primes that are close to one another.

We define n = a × b, t = (a + b)/2, s = (a - b)/2 so a = t + s and b = t - s, therefore n = t2 - s2.

If a and b are close, then s is small, and so t is only slightly larger than the square root of n.
We then try successive values of t, t = ⌊√n⌋ + 1, , t = ⌊√n⌋ + 2, t = ⌊√n⌋ + 3, …, until we find a
value where t2 - n = s2 is a perfect square.

Fermat can be generalised. If we discovered t and s such that t2 = s2 mod n, then we have n |
(t + s)(t - s), or a × b | (t + s)(t - s) (where | means divides).

a is prime and must divide one of (t + s) or (t - s), we can use gcd to find out.

However, we must be able to find appropriate congruences. Random generation is unlikely to


work.

A very important method is to generate various bi such that bi2 mod n is a product of a small
number of primes, and that when multiplied together give a number whose square is
congruent to a perfect square mod n.

A factor base is a set of distinct primes (that we may have p1 = -1). We say that the square of
an integer b2 is a B-number if the least absolute residue mod n is a product of factors from B.
The least absolute residue of a mod n is the number in the range -n/2 to n/2 to which a is a
congruent mod n.

Probably the most important factor base methods to emerge in recent years have been
variants of the quadratic sieve developed by Pommerance. This has been the best performer
for n with no factor or order of magnitude significantly less than √n. The most recent is the
number field sieve.

The best general purpose algorithm typically had a complexity of Ο(e(1 + ε)√log n log log n)),
however number field sieve has the complexity exp(Ο((log n)1/3 (log log n)2/3)).

Knapsacks

The knapsack scheme was at the heart of the first published public key scheme, but is fraught
with danger.

Variants on the knapsack problem have proven of considerable interest in public key
cryptography. They are not much used these days, but are of high historial importance.

If we take a sequence of positive integers B, with a super-increasing property: B = ⟨b1, …, bN


- 1, bN⟩, that is b2 > b1, b3 > b2 = b1, or generically, bN > ∑
N-1
j = 1 bj.

If we suppose we have a plaintext N-bit sequence p1…pN - 1pN, we encrypt this by E(p) =
∑Nj = 1 pjbj, but the plaintext can be easily recovered.
Example in lecture 14
slide 8.

By considering each value of the knapsack in decreasing order we can consider whether or
not it can possibly exist in the final solution, and therefore whether the corresponding value
of the key is 1 or 0.

In order to strengthen this, we can hide the super-increasing property of a sequence by


transforming it in some way or another. If we start with a super-increasing knapsack B′ = ⟨b′1,
…, b′N - 1, b′N⟩ and two integers (m, p) with no factors in common. A new knapsack is now
formed: B = ⟨b1, …, bN - 1, bN⟩, bi = m × b′i mod p. This new knapsack is non-super-
increasing.

Encryption is done as before with the new knapsack: C = E(p) = ∑Nj=1 pjbj.

To decrypt, we assume we can find an inverse to m such that m × m-1 = 1 mod p, then we can
calculate:

Cm-1 mod p = ∑Nj=1 pjbjm-1 mod p


Cm-1 mod p = ∑Nj=1 pjb′jmm-1 mod p
Cm-1 mod p = ∑Nj=1 pjb′j mod p

The final form is super-increasing.

Euclid's algorithm for greatest common divisor

d | X and d | X + Y ⇒ d | Y

r0 = q1r1 + r2, r1 = q2r2 + r3, …, rm - 2 = qm - 1rm - 1 + rm, rm - 1 = qmrm, where 0 < rm < rm - 1 < …
< r2 < r1

gcd(r0, r1) = gcd(r1, r2) = … = gcd(rm - 1, rm) = rm

Euclid's algorithm can be extended such: Given integers a and b with a greatest common
divisor d = gcd(a, b), we can find integers r and s such that ra + sb = d. For the case
where a and b are relatively prime (have no factor in common), we can find r and s such
that ra + sb = 1.

Real knapsacks may have hundreds of elements. The methods of attack aren't detailed here,
but there have been many knapsack variants, and virtually all have been broken. Schneir says
"There have all been analysed and broken, generally using the same techniques, and litter teh
cryptographic highway".

Other Schemes
Sometimes you may want to show that you have some secret, or capability, without actually
revealing what it is. For example, to buy alcohol you only need to demonstrate that you are
over 18 years old, though you are commonly required to additionally demonstrate who you
are - this is too strong. From this, zero-knowledge schemes have been defined.

The latest research in cryptography is focussing on using quantum effects to provide provable
security. Considerable strides are being made in using quantum key distribution.

Zero-Knowledge Protocols
Zero-knowledge proofs allow provers to show that they have a secret to a verifier without
verifying what that sequence is.

A classic example of this is called Quisquatier and Guillou's cave. If we suppose that there is
a secret to open a door in a circular cave, then if Peggy (our prover) has the secret, they can
exit the cave through both ways. If not, they can only exit it the way they came in.

Victor (our verifier) stands at the entrance the cave, and Peggy enters. Victor does not see
which why Peggy enters. Victor tells Peggy to either come out the right or left passage.
Peggy does so, and if necessary and she has the secret, uses the door.

There is 50% chance Peggy did not need to use the door, so Victor asks Peggy to repeat the
sequence, and other time the chance that Peggy never needs to use the door exponentially
decreases.

Graph Isomorphisms

See GRA.

We can use tests of graph isomorphism to apply this protocol to cryptography. Graph
isomorphism is a very hard problem in general and we can use this fact to allow us to
demonstate knowledge of an isomorphism without revealing the isomorphism itself.

If we start with two isomorphic graphs G and G′, then Peggy generates a random
isomorphism H of G and sends it to Victor. Victor then asks Peggy to prove H is isomorphic
to G, or H is isomorphic to G′.

If Peggy knows the secret mapping f she is able to respond appropriately to either of these
requests. For the G case, she just sends the inverse q of the random isomorphism p. For the G′
case, she responds with r = q;f. This is repeated as in the example above.

Many schemes have been proposed based on the supposed computational intractability of
various NP-complete problem types. The following example shows us that the security of
such schemes may be harder to ensure than it would first seem.

Perceptron Problems

Given a matrix Am × n where aij = ±1, find a vector Sn where sj = ±1, so that in the vector
Am × nSn × 1, wn ≥ 0.
This problem could be made harder to imposing an extra constraint, so Am × nSn × 1 has a
particular histogram H of positive values.

The suggested method to generate an instance is to generate a random matrix A, and then a
random secret S, and then calculate AS. If any (AS)i < 0, then the ith row of A is negated.

However, there is significant structure in this problem, as there is a high correlation between
the majority values of the matrix columns and the secret corresponding secret bits. Each
matrix row/secret dot product is the sum of n Bernouilli variables. The intial histogram has
Binomial shape, and is symmetric about 0, and after negation this simply folds over to be
positive.

Using Search

To solve the perceptron problem, we can use search to search the space of possible secret
vectors (Y) to find one that is an actual solution to the problem at hand.

We first need to define a cost function, where vectors that nearly solve the problem have a
low cost, and vectors that are far from solving the problem have a high cost. We also need to
define a means of generating neighbours to the current vector, and a means to determine
whether to move to that neighbour or not.

Pointcheval couched the perceptron problem as a search problem. The neighbourhood is


defined by single bit flips on the current solution (Y), and the cost function punishes any
negative image components.

A solution to this permuted Perceptron Problem (where a Perceptron Problem solution has a
particular histogram) is also a Perceptron Problem solution. Estimates of cracking the
permuted Perceptron Problem on a ratio of Perceptron Problem solutions to the permuted
Perceptron Problem solutions were used to calculate the size of the matrix for which this
should be the most difficult: (m, n) = (m, m + 16).

This gave rise to estimates for the number of years needed to solve the permuted Perceptron
Problem using annealing as the Perceptron Problem solution means that instances with
matrices of size 200 could usually be solved within a day, but no Pointcheval problem
instance greater than 71 was ever solved this way, despite months of computation.

Knudsen and Meier carried out a new approach in 1999, that loosely carried out sets of runs,
and noted the positions where the results obtained all agree. Those elements where there is
complete agreement is fixed, and new sets of runs are carried out. If repeated runs give some
values for particular bits, the assumption is that those bits are actually set correctly.

This sort of approach was used to solve instances of the Perceptron Problem up to 180 times
faster than the previous mechanism.

This approach is not without its problems, as not all bits that have complete agreement are
correct.

An enumeration stage is used to search for the wrong bbits, and a new cost function with w1 =
30 and w2 = 1 is used with histograph punishment: cost(y) = w1costNeg(y) + w2costHist(y).
Fault Injection

What limits the ability of annealing to find a solution? A move changes a single element of
the current solution, and we want current negative image values to go positive, but changing
a bit to cause negative values to go positive will often cause small positive values to go
negative (as it is near in the simulated annealing).

Results can be significantly improved by punishing at a positive value, which drags elements
away from the boundary during the search. Using a higher exponent in differences (e.g.,
|wi - k|2, where k is the boundary, e.g., 4) rather than simple deviation.

With this, we can use a similar cost function as Knudsen and Meier, but with fault injection
on the negativity part (and different exponents).

cost(Y) = G∑mi = 1 (max{ K - (AY)i, 0 })R + ∑ni = 1 (| Hy(i) - Hs(i) |)R.

Each problem instance is attacked using a variety of different weightings (G), bounds (K) and
values of the exponent (R). This resutls in a number of different viewpoints on each problem.

The consequence is that the warped problems typically give rise to solutions with more
agreement to the original secret than the non-warped ones. However, results may vary
considerably, even between runs for the same problem.

Democratic viewpoint analysis is used, as it is essentially the same as Knudsen and Meier as
before, but this time substantial rather than unanimous aggreement. If we choose the amount
of disagreement which is tolerated carefully, we can sometimes get over half the key this
way.

Timing Channel Attacks

A lot of information is thrown away, but if we monitor the search process as it slows down
we can profile annealing and gain information from the timing.

This is based on the notion of thermostatistical annealing. Analysis shows that some elements
will take some values early in the search and then never subsequently change (i.e., they get
stuck). These ones that get stuck early often do so for a good reason - they are the correct
values.

Timing profiles of warped problems can reveal significant information; the trajectory by
which a search reaches its final state may reveal more information about the sought secret
than the final state of the search.

Multiple clock watches analysis is essentially the same as for timing analysis, but this time
the times of all runs where each bit got stuck is added up. As we might expect, those bits that
often get stuck early (i.e., have low aggregate times to getting stuck) generally do so at their
correct values (they take the majority value). This seems to have a significant potential, but
needs some work.

What the above has shown is that search techniques have a computational dynamics too.
Profiling annealing on various warped problems (mutants of the original problem) has also
been looked at, and this shows an analogy with fault injection, but here it is public
mathematics that is being injected.

Attacking the Implementation


If we take the purely mathematical black box crypto function approach we' have taken so far,
then the input/output relation is all that can be viewed/analysed. However, crypto functions
are executed on physical devices. This execution takes time and consumes power, and the
precise characteristics of these are data-dependent. This can be observed and used as the basis
for an attack.

A simple example here is that of safe breaking with combination dial locks. If we make the
assumption that trying a combination is atomic, then we miss the intermediate stages. In
reality, a sequence of tumblers fall as the dial is turned correctly, which we can attack by
using a stephoscope to listen to the intermediate states as the tumblers fall. This reduces the
total state space greatly, from n1 × n2 × … × nt to n1 + n2 + … + nt.

The TENEX password attack is similar. It mistakenly assumes that passwords are atomic, but
in reality the password is checked byte by byte, and aborts if an incorrect byte is found. If the
password is stored across a page boundary in memory, then we can watch for page faults,
which occur only if all the characters up to that point are correct. This attack requires the
ability to engineer the password to be stored across a page boundary in memory.

The important thing to note here is that we are observing something that is outside of the
abstract model.

Timing Attacks

Many crypto algorithms calculate R = yk mod m, where y is the plaintext and k is the secret
key, however exponentiation is often very large and is broken down to be more efficient.

Examples on slide 9
of lecture 11.

In base 2, if we let the binary expression of the n-bit key k be kn - 1…k1k0, then xk = yk0k1…kn-1 =
(…((yk0)2 × yk1)2 … )2 × ykn - 1.

Similarly for modulo arithmetic, in general (x × Y) mod m = ((x mod m) × (y mod m)) mod m.

With this we can develop an algorithm for modular exponentiation:

s0 := 1
for i = 0 to n-1:
Ri := (if ki = 1 then (s1 × y) mod m else si)
si + 1 := (Ri × Ri) mod m
return Rn-1

We assume the attacker knows the total time to execute the algorithm for several unknown k.
We notice above that the algorithm performs key-dependent multiplications, that is, it does
when ki = 1, but not when ki = 0 (from line 4 of the algorithm). So, given a range of k, the
attacker can collect enough information to deduce the Hamming weight (number of 1s) of
each key.

This isn't very interesting, because the Hamming weight is normally approximately n/2. It is
also easily blinded, by doing dummy calculations when ki = 0, e.g.,

d := (si × y) mod m
Ri := (if ki = 1 then d else si)

We can do a better attack in the same theme. If we assume the attacker knows the total time
to execute the algorithm for a givem m and several known y: {(y1, t1), …, (yn, tn)}, then the data-
dependent time take to perform each multiplication comes from the code: Ri := (if ki = 1 then (si × y) mod m
else si)
si+1 := (Ri × Ri) mod m

We notice that the time for each multiplication is depdendent on s i, and hence on the key
bits ki…k0 used so far.

Different ys result in different calculations, which will take different times, so the measured
total times for the different y have a variance: σ2 = 1/N ∑Ni = 1(ti - μ)2.

The attack then emulates the algorithm for each yi. As we have assumed the attacker knows
the data-dependent multiplication times, the attacker can caculate how long each round takes,
assuming a key-bit value.

The first step is to calculate the time for the i = 0 loop twice, getting getting tk = 0 by assuming
that k0 = 0 and tk=1 for k0 = 1, and then subtract the calculated time from the measured ti.

We can now calculate new variances of the remaining times. On average, one of the new
variances is higher, and one lower than the original: σk = 02 < σ2 < σk = 12, or, σk = 12 < σ2 < σk =
2
0 .

The smaller variance indicates the correct choice of key bit, as the correct guess subtracts off
the correct times, leaving less variation in the remaining times (one algorithm loop fewer of
different timings). An incorrect guess subtracts off an arbitrary time, increasing the variation
in the timings.

This can then be interated. The attack progresses bit by bit, subtracting the times for the
relevant calculations and the result of the previous key bit guesses. If a mistake is made, the
variances start to increase (even if subsequent key bit guesses are correct, because of the
wrong value of si being used in the timing calculations), indicating the attack should
backtrack.

This continues until the whole key is revealed.

This attack is computationally feasible (it progresses one bit at a time), is self-correcting as
erroneous guesses become apparent, and works against many obvious blinding defences (but
not all).
Power Analysis

Time is not the only resource consumed by implementations; they all require some form of
power, and unsurprisingly monitoring power consumption can reveal secret key information
too.

To measure the power consumption of a circuit, a small (50 Ω) resistor is placed in series
with the power or ground input. The voltage difference across the resistor can be used to give
current (V/R=I). A well-equiped electronics lab can sample at over 1 GHz with less and 1%
error, and sampling at 20 MHz or faster can now be done quite cheaply.

Simple Power Analysis

Different instructions have different power consumption profiles.

If we consider DES, we recall that each half of the key in a round is subject to a cyclic shift
of 1 or 2 bits following a schedule. After the shifts, a sub-key is extracted from the resulting
halves, and this is repeated for each round. These difference in cyclic shifts are what we care
about.

The executing of these different paths reveals itself in the power trace as variations between
rounds. These are due to conditional jumps and computational intermediates.

We can use simple power analysis to watch the algorithm execute and the conditional
branches depending on key bit values (different instructions require different power for a 0 or
1), as well as other conditional branches, multiplications, etc.

It is relatively simple to blind the implementation against simple power analysis attacks by
avoiding conditional branching, generating "noise", etc.

Differential Power Analysis

Differential power analysis is used to extract correlations between different executions. The
structure of DES can be used to reduce the search space. We can guess some of the key bits,
and then look at the correlations that exist if the guess is correct.

In DES, the initial permutation (IP) is followed by 15 rounds with crossovers of halves
(remember the 16th does not have cross-over). This IP is public knowledge, so if we have the
ciphertext we know L16, R16 and R15. Additionally, if we new the key for round 16 (K16),
we would be able to calculate the outputs of ƒ and deduce L15 too.

For each 6-bit sub-key Ks corresponding to an identified S-box we can calculate four bits of
output after the permutation, and hence four bits of L15. We define D(Ks, C) as the predictor
for one of these bits.

The method here involves monitoring the power as each of m cipher-texts, C1, C2, …, Cm is
produced. Each power trace is taken over some number n time steps. For each cipher-text
Ci and key guess Ks, the predictor predicts either a 0 or a 1 for the identified bit.
For each key guess Ks, the predictor D(Ci, Ks) partitions the set of ciphertexts according to
whether the predicted bit value is 0 or 1.

ΔD[j] = ( ( ∑mi=1 D(Ci, Ks)Ti[j] ) / ( ∑mi=1 D(Ci, Ks) ) ) - ( ( ∑mi=1 (1 - D(Ci, Ks))Ti[j] ) / (
∑mi=1 (1 - D(Ci, Ks)) ) )

That is, the average of power samples at time step j for ciphertexts giving a predicted value of
1 minus the average of power samples at time step j for ciphertexts giving a predicted value
of 0.

If a random function is used to partition the ciphertexts, then we would expect Δ D[j] to be
very small. This is what happens if we guess the 6-bit S-box subkey Ks incorrectly.

If the key Ks is correct, then the predictor accurately produces two groups which are
correlated with what actually happens, as we would have expected the two groups to have
different power consumption averages.

For each of the 64 subkeys, we can plot values of ΔD[j] of the 64 subkey correlations against
time. The correct subkey will display a prominent peak at some time, the others will not have
peaks. The peak is at the time when the correct predicted bit value is being manipulated.

Countermeasures for these kind of attacks including reducing the signal strength (so leak less
information, balancing Hamming weights, etc), shielding behind nouse, and destroying the
correlation with blinding (hashing values, encrypting addresses, etc). However, much of this
obfuscation that reduces the signal-noise ratio can be overcome by obtaining more samples.

Cryptographic Themes, Principles and Lessons


Cryptography is not just a science, but an engineering subject. Kerckhoff's principles in 1883
were invented for ciphers at the time, and can be interpreted for the modern age:

• The system should be, if not theoretically secure, unbreakable in practice;


• Compromise of the should should not inconvenience the correspondents;
• The method for choosing the particular member of the cryptographic system
should be easy to memorise and change;
• The method for choosing the particular member of the cryptographic system
should be easy to memorise and change;
• Ciphertext should be transmittable by telegraph;
• The apparatus should be portable;
• Use of the system should not require a long list of rules or mental strain.

One key principle of cryptography is making the design public. DES is the most controversial
algorithm in cryptographic history - the biggest problem being that the design was kept
secret. The problems with this are now accepted. DES's successor, AES, was determined via
open competition with various analyses and the results open for all to see.

This gives rise to "ordeal by cryptanalysis". If everyone has had a go and it survives the
analysis, it says something about the cipher.
Cryptanalysis as seen above can be broken into a number of classes:

• Chosen plaintext;
• Adaptive chosen plaintext attack;
• Known plaintext;
• Known ciphertext;
• Chosen ciphertext;
• Chosen key;
• Rubber cosh cryptanalysis.

Another important principle is that it is not normally worth inventing your own algorithm.
Many are invented, but precious few are used. If you don't understand this principle, you
won't understand how easily your algorithm will be broken.

Another recurrent theme is guess-and-check, when we can tell how good a trial of some key
or sub-key is. Additionally, repetition is a problem - starting messages with the same greeting
or callsign, or using block ciphers in electronic code book mode, or with the same
initialisation block in CBC.

Key strategies are also important. If the key is symmetric, then it must be kept secret. In
public-private key systems, the private key must be secret, but the public key is shouted to the
world (maintaining integrity). Another alternative is to send it openly, but detect interference
(quantum key distribution), or to bombard them with science, broadcast trillions of bits per
second with only a small (unknown to the interceptor) time window actually containing the
key (this is not actually implemented).

Tradeoffs are also made all the time - one time pads are perfectly secure, but logistically
difficult, but other tradeoffs relate to efficiency and speed, e.g., length of public keys - 256 bit
RSA used to be common, but now 2048 is recommended. Low power algorithms are also
becoming important for embedded systems.

Cryptology is as much an art as a science, and in places we simply have faith in cipher
security, ever conscious it could all go wrong - do you believe in the inherent difficulty of
factorisation? At best, ciphers are secure against practical attack from well-established forms
of attack.
CHAPTER 6:
Computing by Graph Transformation
Graphs
Graphs are flexible structures for representing objects and the relations between objects. They
are ubiquitous in computer science, and lend themselves nicely to a natural visualisation.

A large body of mathematical knowledge has developed around graphs known as graph
theory.

Graphs consist of nodes, or vertices, typically represented using dots or circles, and edges,
which are represented using lines. Directed edges typically have an arrow indiciating
direction. Vertices and edges in graphs can be labelled.

Different varieties of graphs exist, such as restrictons on parallel edges, or loops, or even
graphs with hyperedges (edges that connect more than 2 nodes).

We can formally define a graph as so. If we let LV be a set of node labels, and LE a set of
edge labels and L = ‹LV, LE›, then a graph over L is a system G = ‹ V G, EG, sG, tG, lG, mG ›,
where VG and EG are finite sets of vertices and edges, sG, tG : EG → VG are functions
assigning a source and a target to each edge, and l G : VG → LV and mG : EG → LE are
functions assigning a label to each node and edge.

If VG = ∅, then G is called the empty graph, and can be denoted by ∅.

Concrete graphs are represented in tabular form and include identifiers (note, different from
labels!) for the nodes and edges. Abstract graphs are represented diagramatically and do not
include the identifiers, just the structure (including labels).
Isomorphic graphs are graphs with the same structure (abstract graph), but not necessarily the
same concrete graph (i.e., the identifiers can be different).

Graph Morphisms

Given graphs G and H over L, a graph morphism g : G → H is a pair of mappings g =


&lasquo; gV : VG → VH, gE : EG → EH &rasquo;, such that for all e ∈ EG and v ∈ VG:

1. gV(sG(e)) = sH(gE(e)) - that is, all sources are preserved;


2. gV(tG(e)) = tH(gE(e)) - that is, all targets are preserved;
3. lG(v) = lH(gV(v)) - that is, node labels are preserved;
4. mG(e) = mH(gE(e)) - that is, edge labels are preserved.

For examples, see slide 16.

We say a graph morphism g : G → H is injective, surjective or bijective if gV and gE are


injective, subjective or bijective.

A bijective graph morphism g : G → H is an isomorphism. In this case, G and H are


isomorphic, which is denoted by G =~ H (note, the ~ should be on top of the =, but I can't
typeset that in HTML).

An abstract graph [G] is the isomorphism class of a graph G: [G] = { G′ | G =~ G′ }.

Given graph morphisms f : F → G, and g : G → H, the composition of g o f: G → H is


defined by g o f = &lasquo; gV o fV, gE o fE &rasquo;.

For all graph morphisms f : F → G and g : G → H, g o f is a graph morphism from F to H.


Subgraphs and Inclusions
Graph G is a subgraph of graph H, denoted by G ⊆ H, if VG ⊆ VH, EG ⊆ EH, and &forall e ∈
EG, v ∈ VG:

1. sG(e) = sH(e);
2. tG(e) = tH(e);
3. lG(v) = lH(v);
4. mG(e) = mH(e).

A graph morphism inc : G → H is an inclusion if, ∀ v ∈ VG, e ∈ EG then incV(v) = v and


incE(e) = e.

For all graphs G and H, G is a subgraph of H iff there is an inclusion inc : G → H.

If we consider a graph G, and subsets V ⊆ VG and E ⊆ EG, then there is a subgraph S of G


with VS = V and ES = E iff ∀ e &isin E then sG(e) ∈ V and tG(e) ∈ V. If this property is
satsified, then S is the subgraph induced by V and E.

Given a graph morphism f : G → H and a subgraph S of G, fV(VS) and fE(ES) induce a


subgraph of G. The subgraph induced by fV(VS) and fE(ES) is the image of S under f and is
denoted by f(S).

Transforming Graphs
The idea of graph transformation is to manipulate graphs by rules L ⇒ R, where L and R are
graphs. Applying this rule L ⇒ R to a graph G replaces an occrrence of L in G with R:

We come across problems with this though - how do we define an occurrence of L, is it a


subgraph, a subgraph isomorphic to L, etc. Additionally, what happens to edges in G - L that
touch nodes in L, and how is R connected to G - L?

A rule r = &lasquo; L ← K → R &rasquo; over M consists of graphs L, K and R over M, and


inclusions K → L and K → R. L is the left-hand side, R is the right-hand side, and K is the
interface of r.

Using these definitions, we can form the intutition that L - K is deleted, K is preserved, and R
- K is added.

To apply a rule r = &lasquo; L ← K → R &rasquo; to a graph G:


1. Find an injective graph morphism g : L → G;
2. Check that no edge in G - g(L) is incident to a node in g(L - K);
3. Delete g(L - K) to obtain an intermediate graph D;
4. Add R - K to D.

Interfaces don't need to contain any edges, and should just contain nodes. In morphisms, you
can just remove and put any edges back in again in L and R, instead of preserving constantly.

Additionally, rules are symmetric - left and right can be swapped. Using the double pushout
method as we do, this means we can reverse steps, and dangling edges are forbidden (as upon
reverse, we would not know where to connect them back to).

In the single pushout technique, dangling edges are allowed (they are discarded), but in this
case, an inverse is not always obtainable as dangling edges may be lost.

Danging Condition

To deal with the problem of dangling edges (those that are left when L is deleted), we simply
deny transformations on graphs where this is the case (as the resulting graph is not a graph).

So, given a rule r = &lasquo; L ← K → R &rasquo;, an injective graph morphism g : L → G


satisfies the dangling condition if no edge in E G - gE(EL) is incident to a node in gV(VL - VK).

In other words, edges outside the occurrences of the left-hand side of the rule must not touch
nodes that will be deleted.

If we let &lasquo; L ← K → R &rasquo; be arule, and g : L → G an injective graph


morphism satisfying the dangling condition, then VD = VG - gV(VL - VK) and ED =
EG - gE(EL - EK) induce a subgraph D of G and there is an injective graph morphism of d : K
→ D such that for all v ∈ VK and e ∈ EK, dV(v) = gV(v) and dE(e) = gE(e). We say that this
graph morphism d : K → D is the restriction of g to K and D.

We can represent the disjoint union of two sets A and B as: A + B = (A × {1}) ∪ (B × {2}).
For readability, we will usually treat A + B as A ∪ B, assuming that A and B are disjoint.

Note that there are injective functions iA : A → A + B and iB : B → A + B such that iA(A)
∪ iB(B) = A + B and iA(A) ∩ iB(B) = ∅.

Gluing

If we let &lasquo; L ← K → R &rasquo; be a rule and d : K → D be an injective graph


morphism, then the following defines a graph H, the gluing of D and R according to d:

• VH = VD + (VR - VK);
• EH = ED + (ER - EK);
• sH(e) = { sD(e) if e ∈ ED, dV(sR(e)) if e ∈ ER - EK and sR(e) ∈ VK, sR(e)
otherwise;
• tH is analogous to sH;
• lH(v) = { lD(v) if v ∈ VD, lR(v) otherwise;
• mH is analogous to lH.
We can see in our definition for source and targets we deal with 3 cases: e ∈ ED defines
nothing changing, the second case defines the target in the new bit with the source in the
interface, and the third case handles where the edge is completely in the new bit less the
interface.

We can also let &lasquo; L ← K → R &rasquo; be a rule, d : K → D be an injective graph


morphism, and H be the gluing for D and R according to d. Now, D is a subgraph of H, and
there is an injective graph morphism h : R → H such that for all x in R, h(x) = { x if x ∈ R -
K, d(x) otherwise.

Direct Derivation

If we let G and M be graphs, r = lasquo; L ← K → R &rasquo; be a rule, and g : L → G be


an injective graph morphism satisfying the dangling condition, then G directly derives M
by r and g if M is isomorphic to the graph H constructed as follows:

1. D is the subgraph G - g(L - K) of G and d : K → D is the restriction of g to


K and D;
2. H is the gluing of D and R according to d.

In this case, we write G ⇒r,gM, or just G ⇒r M, or G ⇒ M, and then call g the match of r in
G. We also write G ⇒R M is r belongs to a set R of rules.

Given the graphs G and H, and a set R of rules, we say that G derives H by R if G =~ H, or
there is a sequence of direct derivations: G = G0 ⇒r1 G1 ⇒r2 ... ⇒rn Gn = H with r1, ..., rn ∈ R.
In this case, we can just write G ⇒* H.

Graph Grammars
A graph transformation system &lasquo; L, R &rasquo; consists of a pair L = &lasquo; LV,
LE &rasquo; of sets of node and edge labels, and a set R of rules over L.

A graph grammar is a system G = &lasquo; L, N, R, S &rasquo; such that &lasquo; L, R


&rasquo; is a graph transformaion system, N = &lasquo; NV, NE &rasquo; and is a pair of
sets of nonterminal labels with NV ⊆ LV and NE ⊆ LE, and S is a graph over L, the start
graph.

The graph language generated by G consists of all graphs derivable from S by R that do not
contain nonterminal labels.

For examples, see slide 41 onwwards.

Slide 48.
CHAPTER 7:
Natural Language Processing
The ultimate goal of NLP is to build machines that can understand human language, using
speech and language processing.

At present and in the near future, NLP can used for information extraction (robust, large-scale
shallow NLP for automated shallow understanding, e.g., e-mail scanning, CV processing,
database extraction, etc), language based interaction with everyday devices, domain specific
information systems (such as call centres with question/answering systems), and to aid the
aged and disabled (responding to emergency situations, being used for long-term health
monitoring, or for companionship and entertainment).

NLP is hard due to ambiguity in natural languages, whether phonetic (I scream vs. ice cream)
at the lowest level, syntactic, semantic, or pragmatic (different contexts give different
semantic meanings) at the highest level.

Natural language understanding systems go through many stages. e.g., a virtual agent system
starting with the user query and ending with an engaging response has:

• A spelling corrector;
• A tagger (which disambiguates syntactic words and performs lexical scanning, e.g.,
tagging things as nouns, adjectives, verbs, etc);
• A chunker/parser;
• A phrase matcher;
• A dialog engine (which feeds back into the matcher);
• A gesture planner (for an on-screen character - the gesture comes from the class of
word, e.g., if it's a happy work, the previous mood, and the personality of the
character, i.e., how often the mood changes);
• An animation client.

A more general version of the NLP pipeline starts with speech processing, morphological
analysis, syntactical analysis, semantic analysis, applying pragmatics, finally resulting in a
meaning.

Spelling Correction
Spelling errors are common in written user queries, and the types of errors can be classified:

• Insertions (e.g., beinefits);


• Deletions (e.g., benfits);
• Transposititons (e.g., bneefits);
• Repetitions (e.g., bennefits);
• Replacements (e.g., benifits).

A simple technique to detect which word a misspelt word is supposed to be is edit distance
(that is, number of errors of the types above a candidate word is from the misspelt word).
Another method is statistical spelling correction. If t is the typo and c is the correct word,
then p(c | t) = p(t | c) × p(c). We then choose the correct c using c* = argmax p(c | t) = p(t | c)
× p(c), c ∈ C. However, p(t | c) is too sparse, so we can approximate with p(t | previous
character).

This method only corrects single errors, but can be extended to multi-error cases.

The probabilities are estimated from real data, so therefore incorporate domain data
automatically. If there are two ways to get to a word, then their probabilities are combined.

Robust Parsing
In addition to spelling correction, two issues for robust natural language understanding
include robust parsing (dealing with unknown or ambiguous words) and robust semantic
tagging.

Robust parsing must consider the two stages of parsing: part-of-speech (PoS) tagging and
syntactic parsing (both shallow and deep).

PoS tagging is the pre-step to syntactic analysis - it tags words with their type, e.g., pronoun,
verb, noun, etc, but at this level there can be ambiguity and unknown words. The goal is to
assign the correct PoS tags.

One way to solve this is to construct canonical logical forms from sentences (e.g., the
following are roughly equivalent: "John bought a toy" and "A toy was bought by John").

Context free grammars are deficient in many ways for dealing with ambiguity, and can not
handle common phenomena such as relative clauses, questions or verbs which change
control.

Structural ambiguity, such as propositional phrase (PP) attachment ambiguity, where


attachment preference depends on semantics (e.g., "I ate pizza with ham" vs. "I ate pizza with
my hands"). Current systems are not very accurate at dealing with this (~74%), so it is often
better to leave PPs unattached rather than guessing wrong. Position-based and meaning-based
matching can be used to combat this.

For semantic tagging, we must also deal with robustness in the named entity recognition and
sense disambiguation phases. For named entity recognition, this deals with open class words
such as person, location, date or time or organisation names. Known words are left to a later
stage.

Word sense disambiguation is still difficult. The most frequent WordNet sense baseline gives
~64%, and the best supervised systems achieve ~66-70%, with unsupervised systems achieve
~62%. Question and answer systems can do without full sense disambiguation though.

Morphology
Morphology is the study of the structure within words.
Morphemes are the smallest grammatical unit, and can be free (which are words of their own
right), or bound (where they can only exist as part of other words, e.g., prefixations and
suffixations in English).

Inflectional morphology is used to get different forms of a verb or noun. They are functional
(in that the affix acts like a function e.g., suffix s on nouns always gives a plural form),
productive (new words conform to the rules) and preserve category (the major category, e.g.,
noun/verb is not changed).

Derivational morphology is used to get new words from existing stems (e.g., national from
nation+al). It is relational (adding the same suffix to two different stems can have different
semantic effects), partly productive (some are in current use, whereas others are historic
relics) and alters the category (e.g., national (adj) vs. nation (n)).

Morphological analysis typically uses just inflectional morphology, mainly as it is too


impractical to store the entire lexicon (which is too large, and can not deal with unknown
words), so our lexical lookup involves morphological analysis, so a word is a base form and
affixes/inflection.

Morphological analysis is non-trivial, as we can not assume, for example, that re- is a prefix
for all words, e.g., read is not re+ad. Our morphological analysis process takes 3 steps:

1. Segment the word into stem + affix, e.g., walked → walk+ed, bought → buy+ed
(this case is more difficult);
2. Lookup the grammatical category for the stem, e.g., walk → verb, base;
3. Apply the inflection rule for +ed, e.g., (verb, base) → (verb, psp).

Inductive Logic Programming

Inductive logic programming (ILP) is a symbolic machine learning framework, where logic
programs are learnt from training examples, usually consisting of positive and negative
examples. The generalisation and specialiation hierarchy of logic programs is exploited.

Morphological analysis
slide 11

In Prolog, _ is more general than (X, _), which in turn is more general than (X, Y).

For morphological learning, all base forms cover some positive examples, but no negative
examples.

First-order decision lists are essentially Prolog representation for functions. Every clause is
terminated by !, the cut operator. This tells Prolog to stop searching as soon as the cut is
reached, and therefore only get the first solution (and is therefore a function).

These can also be thought of as ordered classes, with the specific clauses at the top and the
general clauses at the bottom.
For morphology learning, we have training examples consisting of (baseform,
inflected_form) and the task to learn rules that predict inflected form given a base form, and a
base form given an inflected form. However, we must also consider the negative examples.
Fortunately, these are implicit, as if (likes, like+s) is a positive example, than the following
negative examples exist: (likes, lik+es), (likes, li+kes), etc. This is due to it being typically
sufficient to assume that functions are being learnt, and recall the universal property of
functions, that for any input x, ƒ(x) is unique.

Another form of learning is called bottom-up learning, where we go from examples to


clauses. For each example, a generalisation is generated that covers the example, and all such
clauses form a generalisation set.

Syntax
Formally, the coverage of a grammar G refers to the set of sentences generated by that
grammar, i.e., it is the language generated by that grammar.

In natural language, we say that a grammar overgenerates if it generates ungrammatical


sentences, or undergenerates if it does not generate all grammatical sentences. Typically
grammars undergenerate, but will also overgenerate to a lesser extent.

However, with natural language, adequacy is a more important concept, that is, how well
does the grammar capture the linguistic phenomena? It needs to be adequate for syntactic
phenomena to be captured.

In linguistics, grammars are more than just a syntax checking mechanism, they should also
provide a recipe for constructing a meaning. Therefore, grammars are needed to assign
structure to a sentence in such a way that language universal generalisation are preserved, and
language specific generalisations are preserved. Transformations can therefore be defined
that relate sentences with related meaning.

We say that grammars allow a productive method for constructing the meaning of a sentence
from the meaning of its parts.

Linguistic Universals

The following are assumed to be true for all human languages:

• Nouns denote classes of objects, concrete or abstract;


• Verbs denote events;
• Noun phrases denote objects;
• Verb phrases;
• Subcategorisation;
• Noun modifiers (adjectives);
• Verb/sentential modifiers (adverbs);
• Relative clauses.

Taking these linguistic views are useful as it allows for easier meaning construction, and
should result in better coverage. Ultimately, they will help build natural language
understanding systems that port easily across languages.
In addition to these linguistic universals, there are some language specific generalisations,
such as in English, where adjectives always precede nouns, and verbs normally precede their
objects, or in Japanese and Hindi, where objects precede the verb.

English Grammar

• Pronouns are shorthand references to nouns (he, she, it, that, etc).
• Adjectives are modifiers for nouns.
• Adverbs are modifiers for verbs and sentential expressions (quickly, very,
etc).
• Auxiliary verbs are used to express tense information (e.g., is - present, was
- past, -ing - continuous or progressive).
• Modal auxiliaries are used to express possibility, permission, wish, etc (e.g.,
must, can, may, should, would, shall).

We also have constraints on our grammar such as determiner-noun agreement (singular/plural


noun phrases, e.g., these hobbits vs. this hobbit), subject-verb agreement (subjects must agree
with the main verb, e.g., John likes tomatoes vs. I like tomatoes) and pronouns, which apply
to the person (first person, second person, etc) and the number being referred to (e.g., I, you,
he, she, we, they).

Applying these agreements to a context-free grammar result in a lot of rules, a lot of which
are roughly the same structure (see Introducing Syntax - slide 23), and this gets worse when
you start considering tensed forms of verbs.

Other rules to consider is the case system. Pronouns in English are marked for the nominative
and accusative case (e.g., He likes Mary, but not Mary likes he - Mary likes him).

There are also auxiliary verbs, such as the different forms of have and be. We can consider
two further factors here: tense and aspect. Tense refers to time (when an event happened) and
is usually expressed by the main verb, whereas aspect refers to the duration, completion or
repetition, and is usually expressed by an auxiliary verb.

Modal verbs can also be used to express the modality.

In English, there are syntactic constraints on auxiliary verbs. The order, in particular, is fixed.
The have auxiliary comes before be, using be/is selects the -ing (present participle) form.

As we can see above, problems with using context-free phrase structure grammars (CF-PSG)
include the size they can grow too, an inelegant form of expression, and a poor ability to
generalise.

Natural languages are believed to be at least context-free, but there is some evidence they are
context-sensitive. Simple natural language phenomena (e.g., NP-NP, V-NP-NP patterns) can
be described using regular grammars, but this does not cover all linguistic phenomena, such
as relative clauses, so natural languages must be at least context-free.

There is some evidence from Swiss-German and Dutch to suggest that natural languages are
not context free - these are known as cross-serial dependencies. However, CFGs are largely
sufficient for describing natural languages.
Definite Clause Grammars

Definite clause grammars (DCGs) generalise over context-free grammar rules. A DCG is a
CF-PSG except for one difference: non-terminals can be Prolog/first-order terms. This allows
us to collapse our CFG to be parameterised and considerably more compact. (see slides 4 and
5 in the DCG notes for an example).

However, DCGs are more powerful than CFGs, and rules such as s --> np(A,B),
vp(A,B) could have infinite instantiations. However, DCGs have a CFG backbone, and if we
restrict the arguments for a DCG, we get the CFG power.

DCGs in Prolog

Prolog supports the syntactic sugar -->, so that the following are equivalent: s --> np,
vp. and s(S) :- np(S1), vp(S2), append(S1, S2, S)., and np -->
[john]. becomes np([john])..

This sugar defines a top-down depth-first parser, and exploits the top-down depth-first Prolog
interpreter.

Using Prolog's difference lists, we can simply do something like: s(S1-S3) :- np(S1-S2),
vp(S2-S3)., or for terminals: np([john|X]-X).

However, the actual implementation is a bit different, as s(S1-S3) is actually represented


as s(S1,S3), similarly for lexical entries. They are essentially equivalent, however.

Modifiers are used to modify the meaning of a head (e.g., noun or verb) in a systematic way.
In other words, modifiers are functions that map the meaning of the head to another meaning
in a predictable manner. e.g., book on the table ( book(x) & on(x, y) & table(y) ) to book on
the table near the sofa ( book(x) & on(x, y) & (table(y) & near(y, z) & sofa(z)) ).

Usually, modifiers only further specialise the meaning of the verb/noun and do not alter the
basic meaning of the head. For a modifier to apply, the head must be compatible
semantically. Modifiers can be repeated, successively modifying the meaning of the head
(e.g., book on the box on the table near the sofa).

The interpretation of complements is totally determined by the head, and the number of
complements a head can take is also determined by the head (e.g., king of France vs. picture
of Jim - in both cases, the meaning of of is determined by the head, and it is different in both
cases).

Complements can not be repeated without turning them into modifiers. In sentences where
both modification and complementation are possible, then world or pragmatic knowledge will
dictate the preferred interpretation. When there is not strong pragmatic preference for either
readings, then complementation would be preferred.

Proposition phrases (PPs) are usually ambiguous in English, e.g., "I saw the man with a
telescope", and most PPs can be attached to both verbs and nouns.
This attachment preference is pragmatic (e.g., I ate pizza with chips vs. I ate pizza with fork),
and is difficult for current systems (accuracy is often below 80%).

Propositional verbs are verbs taking propositional (i.e., sentential) arguments, e.g., believe,
claim, trust, etc.

Control verbs are used to control the NP argument of a subcategorised VP, e.g., promise is a
subject control verb, and persuade an object control verb (contrast "Sandy promised Kim to
come home early" vs. "Sandy persuaded Kim to come home early").

Movement occurs when the argument or complement of some head word does not fall in the
standard place, but has moved elsewhere. We say that for every space, or gap, where there
must be a NP, there is a filler elsewhere in the sentence that replaces it (this is a one-to-one
dependency). This gap and filler must have the same category.

There can be an unbounded amount of words and structure between the head word and its
moved argument. We can add verbs taking sentential arguments an unbounded number of
times, and still maintain a syntactically allowable sentence - this gives us what are known as
unbounded dependencies between words.

In order to capture these unbounded dependencies in a grammar, we can use a technique that
uses DCGs called "gapping". A sentence with unbounded dependencies has three distinct
sections: top (where the filler is), the middle (the structure separating the gap from the filler)
and the bottom (where the gap is).

At the top, we add an argument to the nonterminals to indicate whether or not the non-
terminal has a gap (i.e., missing NP). Two rules are added to do this. The first rule allows
sentences to be built with or without a gap, and marks the S non-terminal with whether or not
it has a gap (S(Gap) --> NP VP(Gap)). The second rule takes a sentence that has a gap in it
and adds a filler to turn it into a sentence that no longer has a gap (S(nogap) --> NP
S(gap)). Complete sentences are now trees rooted by S(nogap).

At the bottom level, gaps must be identified. For transitive verbs, are rules will
be: VP(nogap) --> Vt NP and VP(gap) --> Vt.

To complete the picture, the middle must have rules that allow the gap to propagate from the
bottom to the top, where it is discharged by the filler. The S rule we had above will deal with
some cases, but for VPs, we need VP(Gap) --> Vs S(Gap)m and also for nouns that
move: N --> N S(gap).

These simple cases do not deal with semantics however. In the general case, the binding of
the semantic value must be retained.

Simple relative clauses can also be handled in a similar way, e.g., NP --> NP S(NP). This is
essentially a modifier rule.

Parsing
The task of parsing is defined as enumerating all parses for a given sentence. For any
sentence of length n, there are 2n ways of bracketing them. Bracketing ignores non-terminal
assignments on tree nodes. We would therefore expect that the complexity of parsing a CFG
is exponential. In reality, even regular grammars are exponential, but recognition can be done
in linear time (e.g., with a DFA).

A recogniser gives a yes or no answer for whether a string is allowed by the grammar or not.
A parser is a recogniser that also returns a parse tree. Parse trees are used to represent the
derivation, and the best representation is as a tree.

A parser should be sound, complete and efficient.

For backtrack parsers, worst-case recognition complexity is exponential. Tabulated parsing


avoids recomputation of parses by storing it in a table, known as a chart, or well-formed
substring table. In the worst-case, it has n3 complexity.

Tabulated parsing takes advantage of dynamic programming and stores results for a given set
of inputs in a table. This is very useful for recursive algorithms.

Top-Down Parsing

Top-down parsers start by proving S, and then rewrite goals until the sentence is reached.
Parsers have a choice of which rule to apply, and in which order. DCG parsing in Prolog is
top-down, which very little or no bottom-up prediction.

Top-down parsers are sound, but incomplete (top-down, left-to-right, depth-first parsers do
not terminate if the grammar contains left-recursive rules) and inefficient (due to the
recalculation in backtracking, and if there are lots of different rules with the same LHS.

Bottom-Up Parsing

Bottom-up parsing starts with words, and then matches right-hand sides to derive a left-hand
side. It completes when S is derived. The choices a parser has to make are which right-hand
side (typically there is less choice here) and the order it is parsed in.

Bottom-up parsers use a method called shift-reduce, consisting of a stack and a buffer.

Bottom-up parsers are sound and complete, but are inefficient in the recalculation in back-
tracking, and when there is lexical abiguity.

Left Corner Parsing

Left corners parsing uses the rules, and provides top-down lookahead to a bottom-up parser
by pre-building a lookahead table.

For any non-terminals A, B, C, the left-corner relation lc(A, B) is defined as follows:

1. lc(A, A) - reflexive;
2. lc(A, B) if A → B … is a CFG rule (immediate left-corner);
3. if lc(A, B) and lc(B, C) then lc(A, C) - transitive.
This definition is an approximate algorithm. The non-terminals are worked through (in
Prolog, this is done using a failure-driven loop).

Bottom-up parsing is used to wait for a complete right-hand side, and then left-corner parsing
predicts rules with the right left-corner. It removes many choices.

Left-corner parsing reduces local ambiguity, increases efficiency and increases psychological
plausibility.

CKY Algorithm

The CKY, or Cocke-Kasami-Younger algorithm requires grammars to be in Chomsky normal


form (i.e., binary branching). The theorem is that for every CFG, there is a weakly equivalent
CFG in Chomsky normal form.

A binary function is defined that takes two sets of non-terminals {a1, …, am} and {b1, …, bn}
and returns {g1, …, gp}, such that if α ∈ {a1, …, am} and β ∈ {b1, …, bn}, then g ∈ {g1,
…, gp} if g → αβ.

This function can be implemented efficiently, e.g., by storing the sets as a list of integers.
Sentence length does not affect the time complexity of this function.

We also create a matrix of size (n + 1)2, which we call the chart. This is initialised such
that chart(i - 1, i) = { A | A → wordi }.

The CKY algorithm essentially completes the chart entries by using the following
equation: chart(i, j) = ∪i < k < j chart(i, k) × chart(k, j).

To ensure completeness, some control is needed.

for j from 1 to n do
chart(j - 1, j) = { A | A → wordk }
for i from j - 2 to 0 do
for k from i + 1 to j - 1 do
chart(i, j) = chart(i, j) ∪ (chart(i, k) × chart(k, j))
if S ∈ chart(0, n) then return true else return false

The complexity of this algorithm is n3.

CKY is an instance of a passive chart parsing algorithm, i.e., chart entries correspond to
completed edges. A disadvantage of bottom-up parsing is that parsing is not sufficiently goal-
driven. All subconstituents, whether or not they get incorporated into the final parse, will be
found.

Earley Algorithm

CKY parsers have no top-down prediction, work only for Chomsky Normal Form grammars
and have no flexible control.

The Earley algorithm is a chart parsing algorithm that works for any CF-PSG. It has an active
chart that consists of parse bits, and goals.
The Earley algorithm uses dotted rules, that is, partially completed edges. For the rule vp -->
v np, it may exist in three states: vp --> .v np, which is a VP with nothing found, vp -->
v .np, which is a VP with a V found, and vp --> v np., which is a complete VP.

Proved and unproved goals are stored, and this is often represented in a chart form.

This edge can be written as: 2vp → v 3 np.

The Earley algorithm has a fundamental rule: (i L → α j X β) × (jX → γ k) = (i L → α X k β),


where α, β is a sequence of NTs, and X is a NT.

In other words, if we have an incomplete edge from i to j (this is the active edge), and a
complete edge (that is, where everything has been found) from j to k, then the two can be
combined and the active edge extended.

The Earley algorithm also defines two operations: complete, which takes a complete edge and
looks left for an active edge to match; and scan, which takes an active edge and looks right
for a complete edge to match.

To enter edges, first the algorithm checks an edge is not present, and then predicts either
active edges from complete ones (if bottom-up) or active edges from new active ones (top-
down). It finishes by completing (combining complete edges to the left) and then scanning
(combining active edges to the right).

In bottom-up active chart parsing, the active edge is predicted from complete; when a
complete edge is found, rules are predicted that could use it. The rule that defines this is if i C
→ α j is added, then for all rules B → C β, the edge i B → i C β is added. This is immediate
left corner parsing.

This needs initialisation, so all lexical edges are entered.

Top-down active chart parsing is similar, but the initialisation adds all the S rules at (0,0), and
the prediction adds new active edges that look to complete. Now, our predict rule is if
edge i C → α j X β then for all X → γ, add j X → j γ.

Agenda-based Parsing

Agenda-based parsing does not assert new edges immediately, but instead adds them to an
agenda or queue. The edges are them removed in some order from the agenda.

A more flexible control of parsing can be achieved by including an explicit agenda to the
parser. The agenda will consist of new edges that have been generated, but which yet to be
incorporated to the chart.
The parsing process will still be complete as long as all the consequence of adding a new
edge to the chart happen, and the resulting edges go to the agenda. This way, the order in
which new edges are added to the agenda does not matter.

If the agenda is organised as a queue, then the parsing proceeds breadth-first. If the agenda is
organsed as a stack, then parsing proceeds depth-first. Agenda-based parsing is especialyl
useful if any repair strategies need to be implemented (to recover from error during parsing).

Semantics
Compositionality is sometimes called Fregean semantics, due to Frege's conjecture.
Compositionality essentially means that the meaning of the whole can be derived from the
meaning of its parts. In the context of linguistics, we can interpret this to mean that the
meaning of a phrase can be determined from the meanings of the subphrases it contains.

In the late 19th centry, Gottlob Frege conjectured that semantic composition always consists
as the saturation of an unsaturated meaning component. Frege construed unsaturated
meanings as functions, and saturation as function application.

We represent the world by a set U of individuals, and a set of relations R that hold between
individuals in U. We denote an n-place relation in U as a subset of Un. We denote some basic
lexical semantics as follows:

Ann as an individual is represented as ann′ ∈ U. Runs is a 1-place relation represented


as runs′ ⊆ U, and likes is a 2-place relation represented as likes′ ⊆ U × U, etc.

Every relation R in Un can be equivalently represented by a function ƒ : Un → {0,1}, defined


by ƒ(a1, …, an) = 1 ⇔ (a1, …, an) ∈ R. Thus, there is no loss of generality in using functions,
and semantic rules are easier stated using functions.

We can now represent our lexical semantics from above using functions as follows. e.g., for
an intransitive verb: sleeps′ : U → B, or for a transitive verb: likes′ : U × U → B.

The rule-to-rule hypothesis says we can pair syntactic and semantic rules to achieve
compositionality, e.g., S → NP VP and S′ → VP′(NP′).

To derive the meaning of transitive VPs, we use the idea of currying from functional
programming (i.e., any function ƒ : U × U → B) can equivalently be written as
ƒc : U → U → B.

Instead of inventing new function names for partially saturated functions, we can use lambda
notation for functions, e.g., [[likes]] = λyλxlikes′(x, y), then the meaning of the VP likes
Mary can be [[like Mary]] = λxlikes′(x, mary′).

For quantifying NPs, the case is slightly different. If we have [[sleeps]] = λxsleeps′(x), then
our phrase [[Everyone sleeps]] is ∀x(person′(x) → sleeps′(x)). Our definition of [[Everyone]]
if λP ∀x(person′(x → P(x)). An alternate representation might be: [[Everyone sleeps]] = true
iff [[person]] ⊆ [[sleeps]].

Some other quantifying NPs (QNPs) that may be useful are:


• [[nothing]] = λP(∀x ¬P(x));
• [[something]] = ΛP(∃x P(x));
• [[everything]] = ΛP(∀x P(x));

QNPs change meaning depending on whether they are in the object or subject position. e.g.,
[[everyone]] = λP ∀x(person′(x) → P(y)) for the subject position, or [[everyone]] =
λPλx ∀y(person′(y) → P(y)(x)) in the object position.

Sentences involving multiple quantifiers are always ambiguous. Sentences such as "Every
man loves a woman" are ambigous between the wide scope forall interpretation, and the wide
scope for existential interpretation (i.e, every man loving a particular woman, vs. every man
loving someone who is a woman).

To generate a wide scope for existential interpretation, we can give additional semantic
entries for quantifiers, but this would mean that each quantifier will have 4 different
meanings: subject NP wide scope, object NP wide scope, subject NP narrow scope and object
NP narrow scope. This is clearly not a very workable solution.

There are a number of issues with quantifiers. The amount of semantic ambiguity explodes,
and syntactic processing forces semantic choices, leading to much backtracking. It is also
difficult to engineer delayed decision making in a processing pipeline.

Collocations
A collocation is an expression consisting of two or more words that correspond to some
conventional way of saying things, or a statement of habitual or customary places of its head
word. More accurately, we described a collocation as a sequence of two or more consecutive
words that has the characteristics of a syntactic and semantic unit whose exact and
unambiguous meaning or connotation can not be derived directly from the maning or
connotation of its components.

Examples of collocations include "New York", "strong tea" (but not powerful tea) or "kick
the bucket".

The notion of collocations can be generalised to include cases of words that are strongly
associated with each other, but do not necessarily occur in a common grammatical unit or in a
particular order.

Collocations typically have limited compositionality - kick the bucket meaning to die, spill
the beans to reveal a secret, where the meaning of the components do not combine to give the
meaning of the whole. Also, substitutability is limited (spill the lentils does not make sense
compared to spill the beans) and modifiability (kick the big blue bucket), etc.

Collocations are not directly translatable into other languages.

Frequency Analysis

If two or more words occur together a lot, this is evidence they have a special function that is
not simply explained as the function that results from their combination. However, just
computing the most frequent n-grams is not that interesting (such as, "of the", "in the", "to
the", etc).

There are several ways to improve results using heuristic measures. One way is to filter out
collocations containing at least one of the words in a stoplist (e.g., a list of frequent words,
prepositions, pronouns, etc). Another way is to consider N-grams that consist of words frmo
some parts-of-speech (e.g., nouns, adjectives, etc) only, or to consider N-grams on which
some part-of-speech filter applies.

However, N-gram frequency combined with a part-of-speech filter or a stoplist is not always
successful. Meaningful, but rare, collocations are found at a low rank.

N-grams are simple to compute, and can perform well when combined with a stoplist of PoS
filter, but is useful for fixed phrases only, and does require modification due to closed-class
words. High frequency can also be accidental; two words might co-occur a lot just be chance,
even if they do not form a collocation.

Pointwise Mutual Information

Pointwise mutual information (PMI) is a measure motivated by information theory.


If x and y are events, PMI is defined as: l(x, y) = log2(P(x, y)/P(x)P(y), which can be further
reduced to log2(P(x|y)/P(x)), or log2(P(y|x)/P(y)).

The PMI measure informs about the amount of information increase we have about the
occurrence of y given than x has occurred.

For proof, see slides.

In the case of perfect independence, then l(x, y) = log 1 = 0, and for perfect dependence l(x, y)
= log 1/P(y).

Unfortuantely, a decrease in uncertainty does not always correspond to what we want to


measure. PMI does not perform well when frequencies are low, it overestimates rare events.
PMI can be extended to accommodate more than two tokens.

Hypothesis Testing

The null hypothesis H0 is the case when two words do not form a collocation. In this case,
two words only co-occur by chance.

The independence hypothesis is therefore if P(w1) is the probability of w1 in some


corpus c and P(w2) is the probability of w2, then the probability of w1 and w2 co-occuring by
chance is: P(w1, w2) = P(w1)P(w2).

We therefore need to discover if two words occur together more often than chance.
Hypothesis testing allows for this, and there are 3 methods we can use: t-test, Pearon's
Χ2 test, and Dunning's likelihood ratios test.
With this method, we must first form a null hypothesis - that there is no association between
the words beyond occurrences by chance. The probability, p, of the co-occurence of words
given that this null hypothesis holds is then computed. The hypothesis testing methods
compute the ratio or the difference between the actual co-occurrence statistic and the
statistics under the null hypothesis, and then null hypothesis is accepted or rejected depending
on whether or not the computed ratio is above a certain significance level.

The binomial distribution b(k; n, x) is the discrete probability distribution of the number of
successes (k) is a sequence of n independent yes or no experiments, each of which yields
success with a probability, x.

This kind of success/failure experiment is called a Bernoulli experiment, or a Bernoulli trial.


When n = 1, the binomial distribution is a Bernoulli distribution.

The t-test

The t test looks at the mean and variance of a sample of measurements. The assumption is
made that the sample is drawn from a normal distribution with a mean μ.

The t-test looks at the difference between observed and expected means, scaled by the
variance of the data. It tells us how likely we are to get a sample (the real sample) of that
mean and variance.

If we let x be the sample mean and s2 the sample variance, N the sample size and μ the mean
of the distribution, then t = (x - μ)/√s2/N.

t values correspond to confidence level. If a t level is large enough (over a confidence level),
the null hypothesis can be rejected. t values per confidence level and degrees of freedom are
available in t distribution tables.

A confidence interval (CI) is an interval estimate of a population parameter. Instead of


estimating the parameter by a single value, an intevral likely to include the parameter is
given. CIs are used to indicate the reliability of an estimate. How likely the interval is to
contain the parameter is determind by the confidence level, or confidence co-efficient.

A confidence interval is always qualified by a particular confidence level (expressed as a


percentage). We therefore speak of a "95% confidence interval", for example.

Several statistical distributions have parameters that are commonly referred to as degrees of
freedom, an underlying random vector. If Χi (i = 1, …, n) are independent normal random
variables, the t statistic follows a t distribution with n - 1 degrees of freedom.

While computing the t-test on collocations, n is a alrge integer, thus the degrees of freedom
are approximated as ∞.

In the case of a chi-square test, the degrees of freedom correspond to the size of the
contingency table less one. Thus, for a 2 × 2 table, there is 1 degree of freedom.

For the null hypothesis, we will assume that the probability of bigrams follows a normal
distribution, with a mean μ equal to p1 × p2.
The variance measure the average squared difference from the mean. This can simply be
computed as s2 = p × (1 - p).

The process of randomly generating bigrams and assigning 1 if the bigram is w1w2 or 0
otherwise is a Bernoulli trial with the probability: s2 = p(1 - p) ≈ p. This approximation
of s2 holds, since p is very small.

The t test can be used to find words whose co-occurrence patterns best distinguish between
two words in focus.

The t test and other statistical tests are most useful as a method for ranking collocations, the
level of significance itself is less useful. The t test assumes that the probabilities are
approximately normally distributed, which is not true in general. The t-test also fails for large
probabilities, due to the normality assumption. However, unlike a frequency-based method,
the t-test can differentiate between bigrams which occur with the same frequency.

Pearson's chi-square test

The Χ2 does not assume normally distributed probabilities, and can be applied to any table of
frequencies. Table values are compared with the expected ones for independence. The
formula is as follows:

Χ2 = ∑i,j ((Oij - Eij)2)/Eij), where O denotes the table of observed values, and E the table of
expected values.

This quantity Χ2 is asymptotically χ2 distributed. The χ2 values correspond to confidence


levels, which are available in tables.

For 2 × 2 tables, Χ2 has a simpler form: Χ2 = N(O11O22 - O12O21)2/(O11 + O12)(O11 +


O21)(O12 + O22)(O21 + O22).

For the problem of finding collocations, the differences between t and Χ2 statistic do not
seem to be large, however the Χ2 test is also appropriate for large probabilities, for which
the t test fails (due to the normality assumption), however the Χ2 test still overestimates rare
bigrams.

Dunning's likelihood ratios test

The likelihood ratio is a number that tells us how much more likely a hypothesis is than
another. Therefore, we construct our two hypotheses:

1. Independence - the occurence of w2 is independent of the previous


occurrence of w1: P(w2 | w1) = p = P(w2 | ¬w1).
2. Dependence - good evidence for an interesting collocation: P(w2 | w1)
= p1 ≠ p2 = P(w2 | ¬w1).

These probabilities can be computed according to the usual maximum likelihood


estimates: p = c2/N, p1 = c12/c1, p2 = (c2 - c12)/(N - c1).
If we assume a binomial distribution, then the probability that c12 out of c1 bigrams
are w1w2 is b(c12; c1, p) under hypothesis 1, and b(c12; c1, p1) under hypothesis 2. The
probability that c2 - c12 out of N - c1 bigrams are ¬w1w2 is b(c2 - c12; N - c1, p) under
hypothesis 1, and b(c2 - c12; N - c1, p2) under hypothesis 2.

Hypothesis 1 therefore becomes L(H1) = p(w1) × p(w2) = b(c12; c1, p1) × b(c2 - c12; N - c1, p),
and hypothesis 2 becomes L(H2 = p1(w1) × p2(w2.

The log of the likelihood ratio λ is log λ = log H1/H2. 2 log λ is asymptotically χ2 distributed.

Likelihood ratios are more appropraite for spare data than the Χ 2 test (i.e., rare bigrams). The
test has a clear intuitive interpretation. If we take λ -1 = log H2/H1, then the result is λ-1 times
more likely that w2 is to follow w1 that is base rate of occurrence would suggest.

Lexical Relations
Lexical Semantics

Lexical semantics is the study of the meaning of words, and how these combine to form the
meaning of longer contexts (phrases, sentences, paragraphs, etc).

However, what is meaning - is it differente from a sense or a concept, and how do words map
to concepts?

A concept, or sense, is an abstract idea derived from or expressed by specific words. A word
is the linguistic realisation of one or more concepts. Words and concepts have a many-to-
many relationship.

Lexical ambiguity is inherent - a word or phrase has more than one sense. Finding and
choosing an appropriate sense is word sense disambiguation.

A dictionary is a reference book containing an alphabetical list of words, with definition,


etymology, etc. A thesaurus is a reference book containing a classified list of synonyms (and
sometimes definitions).

Wordnets are more expressive than dictionaries and thesauri, and are usually called large
lexical databases.

The English WordNet is one of the most useful resources in lexical semantics, and it can be
used for word sense disambiguation, question answering, sentiment analysis, information
retrieval and named entity recognition. A number of APIs to access it are available.

In the English WordNet, words and phrases are grouped into sets of synonyms (synsets),
which express a distinct concept. These synsets are interlinked by means of semantic
relations:

• Hypernymy - is a;
• Hyponomy - is supertype of;
• Synonymy - also known as;
• Antonymy - is opposite of;
• Meronymy - is part of;
• Holonymy - has as part;
• Entailment - implies.

In the English WordNet, nouns are organised as topical hierarchies, verbs as entailment
relations, and adjectives and adverbs as multi-dimensional clusters. For hyponym/hypernym
relations, synsets are organised into taxonomic relations. Meronymy is a relation that holds
between a part and the whole (e.g., kitchen is a meronym of house) - holonymy is the inverse
relation. Antonymy is used to represent oppositeness in meaning (e.g., rise is an antonym of
fall), and this is the opposite of synonymy.

Entailment is used to represent actions which are logical consequences of other actions (e.g.,
snoring entails sleeping).

There are problems with WordNet, such as a non-uniform sense granuality (some synsets are
vague, or unnecessarily precise when compared to other synsets). Other problems include a
lack of explicit relations between topically related concepts, or missing concepts, specifically
domain-specific ones (such as medical terminology).

Feature Modelling

Worse sense disambiguation takes a computational representation of a target word context,


and a computational representation of word sense, and outputs the winning word sense.

The representation of a context of a word is a computational formulation of the context which


defines the use of a word in a given corpus, e.g., "I rent a house", House is a direct object of
rent. These kind of representations can be built from grammatical relations, such as
subject/verb and object/verb. An alternate method is proximity representation, which instead
of using grammatical relations, defines a window size around the target word which is used to
build a set representation of context for the target word.

Feature modelling is the computational formulation of the context which defines the use of a
word in a given corpus. The features are a set of instantiated grammatical relations, or a set of
words in a proximity representation.

Frequencies can be used to denote the co-occurrence of a feature with a noun, either in a
grammatical relation, or in proximity with a specified window size in the whole corpus.

The distributional hypothesis is that words that occur in the same contexts tend to have
similar meanings. That is, two words are similar to the extent that they share similar contexts,
which are defined by their features.

The distributional hypothesis can be modelled by creating feature vectors, and then
comparing these feature vectors to determine if words are similar in meaning, or which
meaning a word has.

Measuring the discriminating power of a feature in the feature vector of a word can be done
using frequency analysis, TF-IDF (term frequency × inverse document frequency), or
statistical models (as used in collocation).
TF-IDF

The term frequency is the number of co-occurrences of a feature with the target word, and the
inverse document frequency is used to measure the general importance of a feature (the
number of all words, divided by the number of different words that appear with a feature F,
and then taking the logarithm).

So, TF-IDF(word, feature) = TF(word) × IDF(feature).

TF-IDF and frequency analysis can be used to weigh each feature, but now vector similarity
needs to be measured.

Cosine similartity is a measure between two vectors of n-dimensions which calculates the
cosine of the angle between them: cos(v, u) = (∑y v(y)u(y)) / √∑y v(y)2 ∑y u(y)2.

Jaccard similarity (JC) measures the similarity between sets, considering only binary features
(i.e., whether a feature exists or not): JC(A,B) = | A ∩ B | / | A ∪ B |.

The Dice coefficient (DC) also measures set similarity, but considers only binary features
(i.e., whether or not a feature exists): DC(A,B) = (2 | A ∩ B |) / ( |A| + |B| ).

There are many other similarity measures: mutual information, Kullback-Leibler, Jensen-
Shannon, Skew divergence, Jaro Winkler, etc. However, there is no agreement as to which is
the best, and which one is to be used depends on the application and domain.

The underlying assumption is that distributional similarity correlates with semantic similarity
(if the contexts that the two words appear in are similar, than these words are semantically
related). However, these assumptions are not always valid, and significant challenges lay
ahead for statistical methods in lexical semantics.

The distributional hypothesis is not valid when two words are semantically similar according
to a machine readable dictionary, yet they appear in significantly different contexts (in effect,
having a low distributional similarity).

Word Sense Diambiguation

Word sense disambiguation is the task of associating a given word, w in a given sentence to a
definition, or sense, which is distinguishable from other senses potentially attributable to that
word. These word senses come from machine readable dictionaries, thesauri, or most
popularly - WordNet.

The most frequent sense heuristic is used as a number to compare against to get performance
data. Given a set of sentences, where the target word w appears, the task is to assign to each
sentence the correct sense of w. This MFS baseline assigns the most frequent sense to
each w (taken from WordNet), and WSD systems are compared by its ability to improve upon
this MFS baseline.

If a system does not perform better than the MFS, then there is no practical reason to use that
system. The MFS heuristic is hard to beat because senses follow a log distribution - a target
word appears very frequently with its MFS, and very rarely with other senses.
The state-of-the-art supervised systems take pairs of input objects (e.g., context vectors) and
desired outputs (the correct sense), and then learn a function ƒ from the training data. To
evaluate, unseen data is given, and ƒ used to predict the correct sense. However, training data
is difficult to find for every domain, and there is a performance decreases when it is tested in
a domain different to the one trained in.

In unsupervised systems, there is no annotated training data, but raw unannotated training
data - this is called the bag-of-words model. Senses come from WordNet, and this is portable
across different domains. This is low performing, however. Semi-supervised methods are
another option.

We can use feature modelling to accomplish WSD. We can filter out some filters -
determiners have a low discriminating ability, similarly with the majority of verbs. Nouns
have a higher discriminating ability than any other part of speech.

The senses of a word w is just a fixed list, which can be represented in the same manner as a
context representation, either as a vector or a set.

A sense signature is a vector/set of words which are related to a particular sense. These
related words can be found by exploiting the relations in WordNet. In vectors, the numbers
are frequency counts. TF-IDF can also be used, or we can take the number of all target word
senses divided by the number of all the senses that appear with a feature F, and taking the
logarithm. Using this "bag-of-words" model, we then need to assign to the context the most
probable sense, by measuring the similarity between the context vector and the sense vectors.
Similarity can be measured using techniques discussed previously.

In the set-of-words model, we have sets instead of vectors, and we can use the set similarity
methods discussed above to find the sense set with the most similarity to the context set.
CHAPTER 8:
System Design Methodologies
Methodologies
The term methodology is used confusingly in software systems engineering.

A method is a set of procedures to follow to accomplish a set of goals within a particular domain.

A methodology involves the study of methods in a particular domain.

In software engineering, these two terms are often used (incorrectly) interchangeably.

Descriptions
A description is a statement about phenomena of interest. It may be a program, a model, or even a
picture scribbled on a whiteboard.

Descriptions may be precise (i.e., there are notational rules that must be obeyed), or they may be
rough sketches.

System design involves constructing and manipulating many forms of descriptions; these descriptions
help us formulate and understand the requirements and constraints underlying software system design.

Systems
A system is a set of interacting entities/things.

Systems are organised, that is, they have an architecture. Systems produce something something that
the individual interacting entities can not (that is, a system integrates individual entities).

Systems have boundaries, which separates its own entities from those in its environment. Different
types of system have different kinds of boundaries, and relationships with their environment.

If we recall sets, we say a set is closed under an operation if the result of an operation gives you a
member of the set. We can apply the same thinking to systems. A closed system is one that is isolated
from its environment.

Closed systems are sometimes called self-describing systems. That is, they not only describe a truth,
but also describe what truth is, and how it can be recognised. (i.e., everything you want to know about
a system can be learned by studying it, and only it).

This notion of closure can be applied to different kind of systems. UML is an example of a closed
language, it is determined in terms of a meta-model, which is defined in terms of itself.

A closed system has communicating entities, but no connection to its environment, therefore we need
to consider if any reasonable system is in fact closed.
An open system has communicating entities, but also interacts with the environment and other
systems. Theoretically, an open system communicates constantly with its environment.

If we recall logic, a description of a system can be said to be complete if all of its propositions can be
derived only from the axioms of the system (e.g., first-order predicate logic).

Complete systems allow the truth of any statement in a description to be verified by referring to the
basic axioms and definitions of the systems. We can apply this to software by assuming that our
requirements are axioms (the assumed truths), and that our design and how it satisfies its requirements
is a set of propositions.

Incomplete systems come around when the description of the propositions can not be deduced from
the systems basic axioms. The information about those propositions may come from elsewhere (e.g.,
if the system is open). Incompleteness is not necessarily bad.

Using these descriptions, we can group systems into four categories:

• Complete and Closed


• Complete and Open
• Incomplete and Closed
• Incomplete and Open

In reality, however, most systems do not fit exactly into one of these pigeonholes (i.e., systems are
more or less complete, or more or less closed, than other systems).

Programs and Software


We already know how to design programs - algorithms and data structure design. To know how to
design software we should first know what the difference between a program and software is. Both
involve descriptions (or are themselves descriptions).

A program is a sequence of operations that a machine can be set to perform automatically.

Formal Languages
Early programs required machine operators to change the physical order of the machine to change the
program. Reprogramming was a long, drawn out manual process. No such thing as program design
methods existed at this very early experimental stage.

When the von Neumann architecture was introduced, computers became more flexible, and the idea
of description became very important. This architecture began to impose methodology on to how
programs were designed.

Von Neumann did not use natural language to describe programs, he borrowed from the mathematic
theory developed by Turing, Post and Church in 1936. The languages proposed by these theories
relied heavily on formal logic, and consequently the languages themselves became known as formal
languages. Using these formal languages for description is only sensible under certain assumptions.

Formal languages are covered


in more details in
Formal Specification of Systems.
Formal languages are mathematically defined description languages, containing a finite number of
symbols, and rules for combining symbols. Atomic symbols can not be decomposed into other
symbols and form the alphabet of the language, and non-atomic symbols are created by combining
atomic symbols, following the set of rules that define precisely how they are created. These atomic
symbols, and the rules to create non-atomic symbols constitute the syntax (grammar, meta-model) of
the language.

Some basic assumptions can be made when we construct descriptions using formal languages - the
description language is complete (i.e., no new symbols or operations can be introduced), and that the
systems they describe are closed (i.e., a system can not behave in such a way that the behaviour
exhibited can not be described by the language).

Formal languages, and hence program design methods, target complete and closed systems. For a
complete and closed system, given a written description (e.g., a program), it is possible to work out
exactly what sequence of actions the machine will produce. Similarly, given a sequence of actions by
the machine, it is possible to work out exactly what the written description (program) must have been.
That is, the actions of the machine and the system description are equivalent (actions = instructions).

This is important for distinguishing programs and software.

Software
Before the 1960s, a separate notion of software to programs did not really exist, but as computer use
became more widespread, the need for separate terms became apparent.

There is often a mismatch between the needs of the machine and the needs of the business. The
actions of the machine and the system description are no longer equivalent - the system description
(business needs) must somehow be translated into machine actions.

Formal languages are not well suited to describing sensible and long established business rules, in part
because formal languages work well with closed and complete systems. Something else was needed,
and this something began to be called software.

Software has been given multiple definitions, including: "the printed materials supplied by a computer
manufacturer to its customers" (1972), "computer programs, procedures, rules and any associated
documentation concerned with the operation of a data processing system" (1982), and "those
components of a computer system that are intangible, rather than physical" (1996).

Program design methods were developed before software design methods. Structured code led to
structured design methods, and object-oriented languages led to object-oriented design methods.

Software design must lead to a software implementation (i.e., a program), so this so called "seamless
development" (i.e., applying the same idea to both software and program design) has some obvious
attractions.

History and expediency can lead to the assumption that program design methods are always right, but
it can sometimes be a struggle to make software descriptions fit into the confines of formal languages.

Validation and Verification

One of the main differences between software and programs is the relative importance of validation
and verification.
Validation and verification address the problem of demonstrating the suitability of a software
description.

Validation: are we building the right product? Validation is concerned with demonstrating that the
software description is based on the right assumptions.

Verification: are we building the product right? Verification is concerned with demonstrating the
internal consistency of the software description.

Validation is particularly important for software designers because it lets them assess closure - does
the design correspond to the requirements specified by the customer? If a design is valid (and the
system is closed), then the requirement descriptions can be composed and decomposed without
introducing errors. This allows the designer to reduce one large problem into several smaller ones, and
structure those problems in such a way that they are capable of being described in a formal language.

Verificiation is important for software designers because it lets them check completeness - is the
software correct in terms of the design established at the beginning of the activity? If verified, a
designer can claim that the relationship between the description and the thing being described is
completely and unambiguously defined. This allows the designer to ignore differences between
conceptual descriptions and the thing it describes within the scope of the software.

Closed and complete descriptions mean that a designer can claim the description and the thing being
described are equivalent. This allows the designer to treat the descriptions and things they describe as
being interchangeable. Because the of the unique environment of programs, program designers can do
this (at least in theory), but software designers (at least in practice), never can.

When you apply formal languages to describe things, you are working under assumptions of closure
and completeness. The problem is, software, and software design methods, only partially conforms to
the closure/completeness assumptions of formal languages. Software is not usually closed, it must
satisfy customers, and neither is it usally complete, it's enver fully implemented.

Program design methods are totally bounded by the completeness and closure assumptions of formal
languages, so software designers must go some way towards finding a complete and closed
description of the software system (i.e., of the requirements and the design), otherwise, program
designers can not design programs. This is, funamentally, what a software design method tries to do.

Software must describe something of the communities that design, use and form the environment for
the programs. But there are some difficulties for software designers, like: where do the boundaries of
these communities lie; what is the relationship between these communities and the computers attached
to them, and; how can we form closed or complete descriptions under these conditions.
Program descriptions are special cases of software descriptions.

Software descriptions are special cases of the full complexity of the relationships and associated
communities that have interest in the system. Thus, software is a description a special case of the
relationships among human communities, the environment of those communities, and machines
associated with them. By assuming that complete or closed descriptions exist, software design
methods may fail to create an adequate description of reality.

You might also like