KEMBAR78
2023 Bioinformatics Algorithms Day 1 | PDF | Time Complexity | Dynamic Programming
0% found this document useful (0 votes)
67 views74 pages

2023 Bioinformatics Algorithms Day 1

Uploaded by

ast56erx
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views74 pages

2023 Bioinformatics Algorithms Day 1

Uploaded by

ast56erx
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 74

Bioinformatics

Algorithms
Stephan Peischl
Interfaculty Unit for Bioinformatics
Baltzerstrasse 6
CH-3012 Bern
Switzerland
Phone: +41 31 631 45 19
Email: stephan.peischl@bioinformatics.unibe.ch
Day 1
• Introduction and Organization
• Defintion Algorithm
• Fibonacci numbers
• Performance of algorithm
• Travelling salesman problem

17.09.23 Bioinformatics Algorithms 2


Some recources on Bioinformatics and
Algorithms
• An Introduction to Bioinformatics Algorithms
(Neil C Jones and Pavel A Pevzner 2004)

• Bioinformatics Algorithms: An Active Learning Approach Volume I


(Compeau and Pevzner 2015)

• Various pdf’s that will be found on ilias

• To be continued and updated throughout the course...

Bioinformatics Algorithms 3
Slides, Exercises, etc.
• I will provide pdfs of slides, book • If you have questions about the
chapters, R-scripts etc on ilias course, send me an email
stephan.peischl@unibe.ch

• The folder will be updated


continuously
• You can find me in
• Please join the course on ilias
Room D401, 4th floor
Balterstrasse 6
3012 Bern

Bioinformatics Algorithms 4
Organization of the course
• Course language: English
• Written exam at the end of the course (english or german)

• Prerequisites:
• Basic math knowledge
• Basic programming knowledge
• Basic biology knowledge

Bioinformatics Algorithms 5
Exam
• Written exam at the end of the semester

• 2 hours

• Date: TBA

• Mandatory to sign up on KSL (CST) at least 14 days before exam!

17.09.23 Bioinformatics Algorithms 6


Overview
1. Introduction to Algorithms 5. Graph algorithms
• Fibonacci numbers, Travelling Salesman Problem • Introduction to Graphs
• Complexity and Big-O notation • From genetics to graphs
2. Sequence Alignment • Shortest superstring problem, DNA sequencing
• Dynamic Programming 6. Classification and regression
• BLAST • SVM
• Needleman-Wunsch algorithm • Decision Trees
3. Exhaustive search algorithms • Bagging and boosting
• Motig finding problem 7. Phylogenetics
• Search trees • Parsimony
4. Greedy Algorithms • Neighborhood joining algorithm
• Genome rearrangements/Towers of Hanoi • Likelihood of trees
• Permutations
• Sorting by reversals

17.09.23 Bioinformatics Algorithms 7


17.09.23 Bioinformatics Algorithms 8
Answer questions in Biology
• What is the genome of the bear? (genomics)

• How can we explain changes and diversity within/across populations?


(population genetics/genomics)

• Which genes are expressed in healthy vs. diseased tissue?


(transcriptomics)

• How related are two species if we look at their whole genomes?


(phylogenetics / phylogenomics)

17.09.23 Bioinformatics Algorithms 9


Our capabilities for high-throughput measurement
of Biological data has been transformative

• 1990 - 2000
• Sequencing the first human genome took ~10 years and cost ~$2.7 billion

• 2018
• Today, sequencing a genome costs ~$1,000✢ and a “run” takes <3 days✢
~18 Tb per “run” at maximum capacity

✢ on an Illumina HiSeq X Ten — the machine costs ~$10M and sample prep takes a little extra
time.

17.09.23 Bioinformatics Algorithms 10


Tons of Data, but we need Knowledge

e.g., genome sequencing, hallmark limitations are:

• Short “reads” (75 — 250) characters when the texts we’re interested
in are 1,000s to 1,000,000s of characters long.
• Imperfect “reads” — results in infrequent but considerable “errors”;
modifying, inserting or deleting one or more characters in the “read”
• Biased “reads” — as a result of the underlying chemistry & physics,
sampling is not perfectly uniform and random. Biases are not always
known.

17.09.23 Bioinformatics Algorithms 11


Despite these limitations, scientists have taken a very
subtle and nuanced approach . . .

17.09.23 Bioinformatics Algorithms 12


Despite these limitations, scientists have taken a very
subtle and nuanced approach . . .

17.09.23 Bioinformatics Algorithms 13


Answering questions on such a scale becomes
a fundamentally computational endeavor:

• Assembly — Find a likely “super string” that parsimoniously explain


200M short sub-strings (string processing, graph theory)

• Alignment — Find an approximate match for 50M short string in a


5GB corpus of text (string processing, data structure & algorithm
design)

• Phylogenomics — Given a set of related gene sequences, and an


assumed model of sequence evolution, determine how these
sequences are related to each other (statistics & ML)

17.09.23 Bioinformatics Algorithms 14


Introduction to algorithms
What is an algorithm?
Wikipedia:
«[An algorithm is] a self-contained step-by-step set of operations to be performed. »

«An algorithm is an effective method expressed as a finite list of well-defined instructions for
calculating a function. Starting from an initial state and initial input (perhaps empty), the
instructions describe a computation that, when executed, proceeds through a finite number of well-
defined successive states, eventually producing "output" and terminating at a final ending state.
The transition from one state to the next is not necessarily deterministic; some algorithms, known
as randomized algorithms, incorporate random input.»

17.09.23 Bioinformatics Algorithms 15


What is an algorithm?
• finite list of well-defined instructions
• that lead from initial input
• through a finite number of well-defined successive states
• to an output

Recipe
1) Take ingredients
2) Mix them
together in
random way
3) Bake at random
temperature
4) If result is
cupcake: end
5) Else: go to 1

17.09.23 Bioinformatics Algorithms 16


X
X n21

Xn ¼
ðiÞ
jn : (A1) variation in offspring distributions within generations. Then,
i¼1 for each n, the jðiÞ
n are a family of independent identically
distributed random variables. We denote the offspring of
a mutant copy in generation n by jn. Further, we denote
n by
We denote the PGF of jðiÞ
the PGF of jn by

What is an algorithm? ðiÞ


fi;n ðxÞ ¼
N
X

k¼0
fn;k xk ;
ðiÞ
(A2)
fn ðxÞ ¼
N
X

k¼0
fn;k xk ; (A11)

where fn;k ¼ PðjðiÞ


n ¼ kÞ. The PGF of Xn is denoted Fn. We ! n ¼ fn , we im-
where fn,k = P(jn = k). By observing that f
assume that the jðiÞn are independent but not necessarily mediately get
identically distributed random variables. It follows that
N
Fnþ1 ðxÞ ¼ f0 ðf1 ð. . . fn ðxÞÞÞ (A12)
X
k
Fnþ1 ðxÞ ¼ PðXnþ1 ¼ kÞx (A3)
k¼0
from Equation A10. The probability of ultimate survival, p, is
then given by
N X
X N l
X
! Computational
1 2 p ¼ lim f0 ðf1 ð. . . fn ð0ÞÞÞ; (A13)
ðiÞ k
Biological problem ¼
k¼0 l¼0
PðXn ¼ lÞP
i¼1
jn ¼ k x (A4)
solution:
n/N

N
X N
X l
X
! Algorithm
which is Equation 3 in the main text.

¼ k x k : (A5)
ðiÞ
¼ PðXn ¼ lÞ P jn
l¼0 k¼0 i¼1
Appendix B: Derivation of Equation 5
PN P Pl ðiÞ
Note that k¼0 Pð Ii¼1 jðiÞ k
n ¼ kÞx is the PGF of i¼1 jn . Be- Here, we derive an approximation for the probability of
cause the PGF of a sum of independent random variables is establishment if there is no variation in offspring distribu-
the product of the PGFs of the random variables (e.g., Karlin tions within generations.
and Taylor 1975), it follows that Let qn = Fn(0) denote the probability of extinction by
generation n. It is straightforward to see that {qn} is
N
X l
Y
Fnþ1 ðxÞ ¼ PðXn ¼ lÞ fi;n ðxÞ (A6) an increasing sequence that is bounded by 1. Intuitively, this
l¼0 i¼1 can be seen by observing that the probability of extinction in
generation n + 1 has to be larger than the probability
Formal
XN description
! " of extinction in generation n. More formally, it follows be-
! n ðxÞ l ;
Mathematics/Statistics
¼ PðXn ¼ lÞ f (A7) cause the fn are monotonically increasing functions of x.

?
l¼0 Consequently, q = limn/N qn exists and q 2 [0, 1]. The
probability of establishment is then given by p = 1 2 q.
where
We introduce a time-independent branching process,
!1=Xn denoted reference process, and define
Xn
Y
! n ðxÞ :¼
f fi;n : (A8)
i¼1
dn ðxÞ :¼ fn ðxÞ 2 f*ðxÞ; (B1)

where f* is the PGF of the offspring distribution in the


Consequently, reference process. Since both fn and f* are PGFs, it follows
17.09.23 Bioinformatics Algorithms 17
! " that dn is a bounded and continuously differentiable func-
!n
Fnþ1 ðxÞ ¼ fn f (A9) tion of x 2 [0, 1] and that dn(1) = 0. Furthermore, dn can be
What is an algorithm?

We start with a pair of


rabbits that reproduce Just follow the
each month. Rabbits
mathematical
take one month to
reach maturity. How description to get a
many rabbits will we recursive algorithm.
have after n months?

Fibonacci numbers:
F0 = 0, F1 = 1
Fn = Fn-1 + Fn-2, for n > 2

0,1,1,2,3,5,8,13,21,34,....

17.09.23 Bioinformatics Algorithms 18


Example I: Fibonacci numbers
In R:

fib = function(n)
{
if(n < 2)
return(n)
return(fib(n-2)+fib(n-1))
}

> fib(10)
> 55

17.09.23 Bioinformatics Algorithms 19


Example I: Fibonacci numbers
In R:

fib = function(n)
{
if(n < 2)
return(n)
return(fib(n-2)+fib(n-1))
}

Compuation time «explodes» with large n!


> fib(10) Our algorithm is correct but it is computationally
> 55 unfeasible for large n!

17.09.23 Bioinformatics Algorithms 20


How good is our Fibonacci algorithm?
Whenever we have an algorithm, there are three questions we can ask
about it:

1. Is it correct?

2. How much time does it take, as a function of n?

3. Can we do better?

17.09.23 Bioinformatics Algorithms 21


How good is our algorithm?

Whenever we have an algorithm, there are three questions we can ask


about it:

1. Is it correct?
Answer: Yes
2. How much time does it take, as a function of n?

3. Can we do better?

17.09.23 Bioinformatics Algorithms 22


How good is our algorithm?

Whenever we have an algorithm, there are three questions we can ask


about it:

1. Is it correct?
Answer: Yes
2. How much time does it take, as a function of n?
Answer: ?
3. Can we do better?

17.09.23 Bioinformatics Algorithms 23


How good is our algorithm?

Whenever we have an algorithm, there are three questions we can ask


about it:

1. Is it correct?
Answer: Yes
2. How much time does it take, as a function of n?
Answer: ?
3. Can we do better?
Answer: ?

17.09.23 Bioinformatics Algorithms 24


How much time?
Let’s call T(n) the number of basic operations that are need to perform to get Fn using our algorithm.

The following relation holds:


T(n) = T(n-1) + T(n-2) + 2

We cannot solve this but one can show that


Tn grows faster than and ( 2)n .

Thus, the number of operations needed to calculate


groes exponentially as a function of n.

This is bad: T(200) ≥ F200 ≥ 2 138 ≥ 10 41!

17.09.23 Bioinformatics Algorithms 25


Can we do better?
Why is fib1 so bad?

Many calculations are done


multiple times:

Fn-2 is calculted twice, Fn-3 three


times, Fn-4 four times, ....

This suggests that we can do better!

Homework exercise:
Can you find a more efficient way to calculate Fn?
17.09.23 Bioinformatics Algorithms 26
Can we do better?
Why is fib1 so bad?

Many calculations are done


multiple times:

Fn-2 is calculted twice, Fn-3 three


times, Fn-4 four times, ....

This suggests that we can do better!

Homework exercise:
Can you find a more efficient way to calculate Fn?
17.09.23 Bioinformatics Algorithms 27
Can we do better?
Why is fib1 so bad?

Many calculations are done


multiple times:

Fn-2 is calculted twice, Fn-3 three


times, Fn-4 four times, ....

This suggests that we can do better!

Homework exercise:
Can you find a more efficient way to calculate Fn?
17.09.23 Bioinformatics Algorithms 28
Can we do better?
Why is fib1 so bad?

Many calculations are done


multiple times:

Fn-2 is calculted twice, Fn-3 three


times, Fn-4 four times, ....

This suggests that we can do better!

Question:
Can you find a more efficient way to calculate Fn?
17.09.23 Bioinformatics Algorithms 29
Fibonacci II
To avoid computing the same number several times, we store previous calculations in a
vector:

fib1 = function(n)
{
if(n <2) return(n)
f = 1:n; f[1] = 1; f[2] = 1;
if(n >2)
for(i in (3:n))
f[i] = f[i-1] + f[i-2]
return(f[n])
}

17.09.23 Bioinformatics Algorithms 30


Fibonacci II
• fib1 is much faster than fib!
• Number of computations of fib1 as a function of n:
T(n) ≈ n
• Number of computations of fib as a function of n:
T(n) ≈ 2n

These calculations are still a bit vague, so let’s try to formalize this as a
mathematical concept

17.09.23 Bioinformatics Algorithms 31


Example II: Sorting numbers

https://imgur.com/gallery/voutF

17.09.23 Bioinformatics Algorithms 32


How efficient is an algorithm or
piece of code?
Efficiency covers lots of resources, Be careful to differentiate between:
including:
Performance:
• CPU (time) usage How much time/memory/disk/... is
• memory usage actually used when a program is run. This
• disk usage depends on the machine, compiler, etc.
as well as the code.
• network usage
Complexity:
All are important but we will mostly talk How do the resource requirements of a
about time complexity (CPU usage). program or algorithm scale, i.e., what
happens as the size of the problem being
solved gets larger?

17.09.23 Bioinformatics Algorithms 33


How efficient is an algorithm or
piece of code?
Efficiency covers lots of resources, Be careful to differentiate between:
including:
Performance:


CPU (time) usage
memory usage
Complexity affects How much time/memory/disk/... is
actually used when a program is run. This


disk usage performance but not the
network usage
depends on the machine, compiler, etc.
as well as the code.

other way around!


Complexity:
All are important but we will mostly talk How do the resource requirements of a
about time complexity (CPU usage). program or algorithm scale, i.e., what
happens as the size of the problem being
solved gets larger?

17.09.23 Bioinformatics Algorithms 34


Complexity
• When we are trying to find the complexity of the algorithm, we are
not interested in the exact number of operations that are being
performed.
• We are interested in the relation of the number of operations to the
problem size: how does the number of operations scale with the
input size
• We are usually interested in the worst case: what is the maximum
number of operations that might be performed for a given problem
size.

17.09.23 Bioinformatics Algorithms 35


Big-O notation
• A concise way to describe the running-time behavior of algorithms is the
Big-O notation

• O(f(n)) means that the running time of an input of size n is limited by a


function of form f(n)

• E.g.: O(n2) means that the run-time grows with the square of n,
this can mean that
T(n) = 0.0005 n2 or
T(n) = 10000000 n2 + 3n + 2

• The important thing is the leading-order term of T(n)!


17.09.23 Bioinformatics Algorithms 36
Big-O notation
Formal definition:

A function f(x) is O (g(x)) if there are positive real


constants c and x0 such that f(x) < c g(x) for all
values of x > x0.

17.09.23 Bioinformatics Algorithms 37


Big-O notation
A function f(n) is O (g(n)) if there are positive real constants c and n0 such that f(n) < c g(n) for all values of n > n0.

What does that mean?


1) We are looking for an upper bound for our function f. ( f(n) < c g(n))
2) We are only interested in the behavior for large values. (n > n0)
3) We don’t care for constant factors. (c can be chosen arbitrarily)

Example:
f(n) = 6 n4 – 2 n3 + 5

f(n) = O(n4)

17.09.23 Bioinformatics Algorithms 38


Big-O notation
How do we get that f(n) = 6 n4 – 2 n3 + 5 is O(n4)?
Assume that n ≥ 1:

6 n4 – 2 n3 + 5 ≤ 6 n4 + 2 n3 + 5
≤ 6 n4 + 2 n4 + 5 n4
≤ 13 n4

We not set c = 13, n0 = 1, and g(n) = n4

And get that f(n) ≤ c g(n) for all n > n0!


17.09.23 Bioinformatics Algorithms 39
Big-O notation
Common notations:
Slow-growing

Notation name

O(1) Constant
O(log(n)) Logarithmic
O((log(n))k) Polylogarithmic
O(n) Linear
O(n2) Quadratic
O(nk) Polynomial
O(kn) Exponential

k is an arbitrary constant Fast-growing

17.09.23 Bioinformatics Algorithms 40


Determining Complexity
A few helpful rules:

1. O(k1 f(n) + k2) = O(f(n)) if k1 and k2 are constants


2. O(f(n)+ g(n)) = O(g(n)) if g(n) > f(n) for large n
e.g.: O(n2 + n3) = O(n3)

3. This means that only the fastest growing term of a function is important!
4. We can ignore constants, i.e., everything that does not depend on n!

17.09.23 Bioinformatics Algorithms 41


Determining Complexity
• Examples:

exp(n)+n4 = O(exp(n))
exp(n) n + log(n) = O(n)
n

log(nk) = O(log(n))
k*log(n) = log(n) n log(n) + n2 = O(n
n^22)

log(n)k = O(log(n)
log(n)^kk) sqrt(n) n3 = O(n 2)
n^3.5

log(kn) = O(n)
n * log(k) = n
10 log(n) + (log(n))3 + 6 n3 = O(n
n^33)

17.09.23 Bioinformatics Algorithms 42


Complexity
• When we are trying to find the complexity of the algorithm, we are
not interested in the exact number of operations that are being
performed.
• We are interested in the relation of the number of operations to the
problem size: how does the number of operations scale with the
input size
• We are usually interested in the worst case: what is the maximum
number of operations that might be performed for a given problem
size.

17.09.23 Bioinformatics Algorithms 43


Calculate complexity from code
In general, how can you determine the running time of a piece of code?
It depends on what kinds of statements are used.

Sequence of statements
• statement 1;
• statement 2;
• ...
• statement k;

The total time is found by adding the times for all statements:

Total time = time(statement 1) + time(statement 2) + ... + time(statement k)

17.09.23 Bioinformatics Algorithms 44


Calculate complexity from code
If each staement is simple (constant number of operations, independent of
n), the time for each statement is constant and so is the amount of time for
all statements.

Consequently we get that the complexity of the code is O(1).

If one of the statments requires n basic operations, and the other are simple,
we get that the complexity is linear in O(n).

If one statmenet requires exp(n) operations, and all others are simples, we
get O(exp(n)).
17.09.23 Bioinformatics Algorithms 45
Some examples
What is the complexity of this loop:

for (i in 3:n)
v[i] = v[i-2]+v[i-1]

The loop is repeated n-2 times, and we perform two operations at each
iteration, one addition and one assignment.

Consequently, T(n) = 2*(n-2) = 2*n – 4 = O(n).

17.09.23 Bioinformatics Algorithms 46


Loops and functions
for (i in 1:n) for(i in 1:n)
for (j in 1:n) f(i)
m[i,j] = a[i] * b[j]
Where f(n) is O(n)

Solution: O(n2)
Solution: O(n2)
The loop is exectued n times and the function inside
the loop is O(n).
The operation is performed n*n Thus T(n) = 1+2+3+... + n = ½ n (n+1) = O(n2)
times.

17.09.23 Bioinformatics Algorithms 47


If/Else
If(condition1)
function(1)
else
function(2)

funciton1 is O(log(n))
function2 is O(n2)

Solution: O(n2)

Either funciton 1 or function 2 is executed.


Since we are interested in the worst case, we only need to consider the funciton with the higher
complexity.

17.09.23 Bioinformatics Algorithms 48


Example
1. Take the fib and fib1 code and calculate the complexity of the two
algorithms:

fib: based on the left figure we can


find an upper boundary for the
number of operations:
T(n) ≈ 1 + 2 + 4 + 8 + ....
= 2! + 2" + … + 2#
= 𝑂(2# )

17.09.23 Bioinformatics Algorithms 49


Example
1. Take the fib and fib1 code and calculate the complexity of the two
algorithms:

fib: based on the left figure we can


find an upper boundary for the
number of operations:
T(n) ≈ 1 + 2 + 4 + 8 + ....
= 2! + 2" + … + 2#
= 𝑂(2# )

17.09.23 Bioinformatics Algorithms 50


Example
1. Take the fib and fib1 code and calculate the complexity of the two
algorithms:

fib: based on the left figure we can


find an upper boundary for the
number of operations:
T(n) ≈ 1 + 2 + 4 + 8 + ....
= 2! + 2" + … + 2#
= 𝑂(2# )

17.09.23 Bioinformatics Algorithms 51


Example
1. Take the fib and fib1 code and calculate the complexity of the two
algorithms:

fib: based on the left figure we can


find an upper boundary for the
number of operations:
T(n) ≈ 1 + 2 + 4 + 8 + ....
= 2! + 2" + … + 2#
= 𝑂(2# )

17.09.23 Bioinformatics Algorithms 52


Example
1. Take the fib and fib1 code and calculate the complexity of the two
algorithms:

fib: based on the left figure we can


find an upper boundary for the
number of operations:
T(n) ≈ 1 + 2 + 4 + 8 + ....
= 2! + 2" + … + 2#
= 𝑂(2# )

17.09.23 Bioinformatics Algorithms 53


Example
1. Take the fib and fib1 code and calculate the complexity of the two
algorithms:

fib: based on the left figure we can


find an upper boundary for the
number of operations:
T(n) ≈ 1 + 2 + 4 + 8 + ....
= 2! + 2" + … + 2#
= 𝑂(2# )

17.09.23 Bioinformatics Algorithms 54


Example
1. Take the fib and fib1 code and calculate the complexity of the two
algorithms:
fib1 = function(n)
fib1: {
if(n <2) return(n)
f = 1:n; f[1] = 1; f[2] = 1;
T(n) = 𝐂𝟏 + 𝒏 − 𝟐 𝑪𝟐
if(n >2)
= 𝑶(𝒏) for(i in (3:n))
f[i] = f[i-1] + f[i-2]
return(f[n])
}

17.09.23 Bioinformatics Algorithms 55


Travelling salesman problem (TSP)
Given a list of cities and the distances between each pair of cities, what is the
shortest possible route that visits each city exactly once and returns to the origin
city?

TSP is important for computer science, logistics,


computer chip architecture,
and even DNA sequencing algorithms!

Many problems are TSP in disguise

n cities: (n-1)!/2 possible routes

17.09.23 Bioinformatics Algorithms 56


Travelling Salesman Problem
• R-script that exhaustively tries all possible • Some benchmarking
sequences Number of possible sequences explodes quickly:

• Example for 10 cities:

~10 cities is limit of what is feasible:

17.09.23 Bioinformatics Algorithms 57


Complexity of the TSP

• Exhaustive search: O(n!)


• Dynamic programming approach:
O(n2 2n)
• No better solution known (yet)
• It is possible that no better solution
exists!

17.09.23 Bioinformatics Algorithms 58


Traveling salesman problem (TSP)
Dynamic programming algorithms can improve speed considerably, while still
finding an optimal solution.

Nevertheless, the time needed to find optimal solution grows more than
exponentially with the number of cities.

TSP is an example of a NP-complete problem:


It is possible that the worst-case running time
for any algorithm for the TSP increases
superpolynomially (or perhaps exponentially)
with the number of cities.
HTTP://XKCD.COM/399/
Exact TSP algorithms are not useful for practical purposes!
17.09.23 Bioinformatics Algorithms 59
Approximations
For the TSP it may be better to find an approximate solution that may
not be optimal but is stil reasonably good for practical purposes.

A simple algorithm that yields an approximation to the true solution:


1. Start with random city
2. Find closest city that has not yet been visited and go there
3. Continue until finished

17.09.23 Bioinformatics Algorithms 60


Nearest Neighbor Algorithm
A simple algorithm that yields an approximation to the true solution:
1. Start with random city
2. Find closest city that has not yet been visited and go there
3. Continue until finished

Solution is
obviously not
optimal!

Local optimization algorithm Exhaustive algorithm

17.09.23 Bioinformatics Algorithms 61


Nearest Neighbor Algorithm
A simple algorithm that yields an approximation to the true solution:
1. Start with random city
2. Find closest city that has not yet been visited and go there
3. Continue until finished

Algorithm yields
different
solutions
depending on
starting point.
Local optimization algorithm Local optimization algorithm

17.09.23 Bioinformatics Algorithms 62


Complexity of NN-algorithm

Pseudocode:
Finding nearest city takes (n-k) calculations where k is
the number of already visited cities.
pick starting city
for (i in 2:n) All other operations are O(1). Thus, we get:
find nearest city 𝑇 𝑛 = 𝑛 − 1 𝐶 + 𝑛 − 2 𝐶 + …+ 𝑛 − 𝑛 − 3 + (𝑛 − (𝑛 − 2))𝐶
return result = O(n2)

17.09.23 Bioinformatics Algorithms 63


Some facts about the NN algorithm
• This algorithm quickly yields an effectively short route.
• For n cities randomly distributed on a plane, the algorithm on average
yields a path 25% longer than the shortest possible path (Johnson and
McGeoch, 1997).
• There exist many specially arranged city distributions which make the
NN algorithm give the worst route (Gutin, Yeo, and Zverovich, 2002).
• Modifications exist that can improve the NN algorithm, e.g., breaking
the network into small subsets and connect those via NN algorithm.

17.09.23 Bioinformatics Algorithms 64


Other approximations
• Iterative improvement:
1. Start with a random tour
2. pick k edges and remove them
3. Add k new edges such that we get a new tour
4. If the new tour is shorter, keep it. If not, keep the old one.
5. Repeat until no shorter sequence can be found
• Genetic algorithms (radomised algorithm):
1. Generate k random sequences
2. Generate n mutant (and/or recombinant) copies of the k best sequences
3. Go to 2 until stopping criterion is satisfied
• Ant colony algorithms:
Models behavior observed in real ants to find short paths between food sources and their nest,
an emergent behaviour resulting from each ant's preference to follow trail pheromones
deposited by other ants.

17.09.23 Bioinformatics Algorithms 65


Ant Colony Algorithm

"AntColony" by Saurabh.harsh - Own work. Licensed under CC BY-SA 3.0 via Wikimedia Commons -
http://commons.wikimedia.org/wiki/File:AntColony.gif#mediaviewer/File:AntColony.gif

17.09.23 Bioinformatics Algorithms 66


Exercise
• Use the exhaustive TSP algorithm from ilias to implement
approximate algorithms of your choice
• Compare performance and solutions of algorithms

17.09.23 Bioinformatics Algorithms 67


Hints for Genetic Algorithms
• You can mutate a sequence by changing it in some random way, for
instance by exchanging the order of two cities:
e.g.: 123456789 -> 127456389
• Try several different ways to mutate your sequence
• Try different choices of n (total population size) and k (selected
individuals)
• How do these choices affecte performance?
• Is some choice of k better than others?
• What might be a problem if you chose k = 1?
• (Use Google if you get stuck – there are many resources out there!)
17.09.23 Bioinformatics Algorithms 68
Tractable and intractable problems
• Algorithms can be categorized according to their complexity
• Different algorithms for the same problem can have different
complexity (cf. fib and fib1)
• Problems also have an inherent complexity
• For instance, listing all subsets of a 𝑛-element set will always be at
least 𝑂(exp(𝑛)), no matter how smart the algorithm is
• The class of problems for which we do not have an algorithm that can
solve them in polynomial time is denoted NP-complete

17.09.23 Bioinformatics Algorithms 69


Algorithm Designs
Depending on the type of problem and the desired solution, there are
different types of algorithms.:
• Exhaustive search algorithms
• Greedy algorithms
• Dynamic programming algorithms
• Machine learning algorithms
• Recursive algorithms
• Clustering algorithms
• Graph alogrihtms
• ...
17.09.23 Bioinformatics Algorithms 70
A few examples
• Exhaustive algorithms • Dynamic programming: • Machine learning
An algorithm that explores all possible Dynamic programming is a method for The construction and study
answers to a problem to find the best solving a complex problem by breaking of algorithms that can learn from data.
solution. E.g.: the TSP algorithm that it down into a collection of simpler sub- Such algorithms operate by building a
explores all possible routes to find the problems. Fib1 is a very simple example model from example inputs and using
shortest one. of dynamic programming. that to make predictions or decisions,
rather than following strictly static
• Greedy algorithms • Recursive algorithms program instructions.
A greedy algorithm returns an An algorithm which calls itself with
approximate solution using less "smaller (or simpler)" input values, and • Clustering Algorithms
resources (computing time, memory, which obtains the result for the current Grouping a set of objects in such a way
etc.) than an exhaustive algorithm. input by applying simple operations to that objects in the same group (called
Greedy algorithms usually try to find the the returned value for the smaller (or a cluster) are more similar (in some
«locally» best solution. simpler) input. sense or another) to each other than to
An example of a greedy algorithm for those in other groups (clusters). This
the TSP is: describes a set of problems rather than
1. Start with a random city.
a specific type of algorithm.
2. At each stage visit an unvisited
city nearest to the current city.

17.09.23 Bioinformatics Algorithms 71


A few examples
• Exhaustive algorithms • Dynamic programming: • Machine learning
An algorithm that explores all possible Dynamic programming is a method for The construction and study
answers to a problem to find the best solving a complex problem by breaking of algorithms that can learn from data.
solution. E.g.: the TSP algorithm that Correct vs.
it down into a collection of simpler sub- incorrect
Such algorithms operate by building a
explores all possible routes to find the
shortest one.
problems. Fib1 is a very simple
of dynamic programming.
algorithms
example model from example inputs and using
that to make predictions or decisions,
rather than following strictly static
• Greedy algorithms • Recursive algorithmsExhaustive algorithms return correct
program instructions.
A greedy algorithm returns an An algorithm which calls itself with
solutions.
approximate solution using less "smaller (or simpler)" input values, and • Clustering Algorithms
resources (computing time, memory, which obtains the result for the current Grouping a set of objects in such a way
etc.) than an exhaustive algorithm. Greedy algorithms
input by applying simple operations to may return a sub-in the same group (called
that objects
Greedy algorithms usually try to find the the returned valueoptimal solution(orbut do soa in
for the smaller shorter
cluster) are more similar (in some
«locally» best solution. simpler) input. time. sense or another) to each other than to
An example of a greedy algorithm for those in other groups (clusters). This
the TSP is: describes a set of problems rather than
1. Start with a random city.
a specific type of algorithm.
2. At each stage visit an unvisited
city nearest to the current city.

17.09.23 Bioinformatics Algorithms 72


A few examples
• Exhaustive algorithms • Dynamic programming: • Machine learning
An algorithm that explores all possible Dynamic programming is a method for The construction and study
answers to a problem to find the best solving a complex problem by breaking of algorithms that can learn from data.
solution. E.g.: the TSP algorithm that it down into a collection of simpler sub- Such algorithms operate by building a
explores all possible routes to find the problems. Fib1 is a very simple example model from example inputs and using
shortest one. of dynamic programming. that to make predictions or decisions,
Different methods rather than following strictly static
• Greedy algorithms • Recursive algorithms program instructions.
A greedy algorithm
Algorithms thatreturns an
yield the An algorithm which calls itself with
approximate solution using less "smaller (or simpler)" input values, and • Clustering Algorithms
same solution can be
resources (computing time, memory, which obtains the result for the current Grouping a set of objects in such a way
implemented in different
etc.) than an exhaustive algorithm. input by applying simple operations to that objects in the same group (called
waysalgorithms
Greedy (e.g., fib and fib1).
usually try to find the the returned value for the smaller (or a cluster) are more similar (in some
«locally» best solution. simpler) input. sense or another) to each other than to
An example of a greedy algorithm for those in other groups (clusters). This
the TSP is: describes a set of problems rather than
1. Start with a random city.
a specific type of algorithm.
2. At each stage visit an unvisited
city nearest to the current city.

17.09.23 Bioinformatics Algorithms 73


A few examples
• Exhaustive algorithms • Dynamic programming: • Machine learning
An algorithm that explores all possible Dynamic programming is a method for The construction and study
answers to a problem to find the best solving a complex problem by breaking of algorithms that can learn from data.
solution. E.g.: the TSP algorithm that it down into a collection of simpler sub- Such algorithms operate by building a
explores all possible routes to find the problems. Fib1 is a very simple example model from example inputs and using
shortest one. A different categoryofof problems
dynamic programming. that to make predictions or decisions,
rather than following strictly static
• Greedy algorithms • Recursive algorithms program instructions.
Machine
A greedy learning
algorithm andanclustering algorithms
returns oftenwhich
An algorithm have calls
no unique
itself with
«optimal» solution. These
approximate solution using less algorithms "smaller (or simpler)" input values, and • Clustering Algorithms
often have to make predicitions
about
resources data based
(computing time,on available information.
memory, The output
which obtains offor
the result thethe current Grouping a set of objects in such a way
etc.) than an exhaustive
algorithms algorithm.
may change input by
with available data. Theapplying simple
problems areoperations
often to that objects in the same group (called
Greedy algorithms usually try to find the the returned value for the smaller (or a cluster) are more similar (in some
related
«locally» best solution.
to some kind of pattern recognition.
simpler) input. sense or another) to each other than to
An example of a greedy algorithm for those in other groups (clusters). This
the TSP is: describes a set of problems rather than
1. Start with a random city.
a specific type of algorithm.
2. At each stage visit an unvisited
city nearest to the current city.

17.09.23 Bioinformatics Algorithms 74

You might also like