Bioinformatics
Algorithms
Stephan Peischl
Interfaculty Unit for Bioinformatics
Baltzerstrasse 6
CH-3012 Bern
Switzerland
Email: stephan.peischl@unibe.ch
Sequence Alignment
GATTACA
insertion
GAATTACA substitution
GATTACC
substitution
TAATTACA
GAATTACA TAATTACA GATTACC
22.09.24 Bioinformatics Algorithms 2
DNA and Protein sequences
• DNA (Deoxyribonucleic acid) uses a 4-letter alphabet (A,T,C,G)
• RNA (Ribonucleic acid) also uses a 4-letter alphabet (A,U,C,G)
• Proteins use 20 amino acids (A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y)
• We will represent sequences of DNA, RNA or Proteins as strings
e.g.: ATCTGTTA, TGCCATA, ...
22.09.24 Bioinformatics Algorithms 3
Sequence evolution
GATTACA
GAATTACA
GATTACC
TAATTACA
GAATTACA TAATTACA GATTACC
22.09.24 Bioinformatics Algorithms 4
Sequence evolution
GATTACA
We know real alignment:
insertion
GAATTACA substitution GAATTACA
GATTACC TAATTACA
GA-TTACC
substitution
TAATTACA How can we align sequences
if we don’t know the exact
evolutionary history?
GAATTACA TAATTACA GATTACC
22.09.24 Bioinformatics Algorithms 5
Sequence alignment
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA,
RNA, or protein to identify regions of similarity that may be a consequence of functional,
structural, or evolutionary relationships between the sequences.
If two sequences in an alignment share a common ancestor, mismatches can be
interpreted as point mutations and gaps as indels (that is, insertion or deletion
mutations) introduced in one or both lineages in the time since they diverged from one
another.
AAAGCGGAAGTCACAG AAGGCTGAAGTATAG
There are two different ways to align sequences:
AAAGCGGAAGTCACAG
1. Global alignment assumes that the two proteins are basically similar over the ||.||.||||| |.||
entire length of one another. The alignment attempts to match them to each other
from end to end. AAGGCTGAAGT-ATAG
2. Local alignment searches for segments of the two sequences that match well. AAAGCGGAAGTCACAG
There is no attempt to force entire sequences into an alignment, just those parts ......||||| ....
that appear to have good similarity, according to some criterion.
AAGGCTGAAGT-ATAG
22.09.24 Bioinformatics Algorithms 6
Manhattan tourist algorithm
The problem: Given a weighted grid G, travel from source (top left)
to the sink (bottom right) along the highest scoring path only travelling
south and east.
There are only finitely many possible paths and
we can solve problem by trying all of them.
But what does that have to do with sequence
alignments? (we will see later….)
22.09.24 Bioinformatics Algorithms 7
Manhattan Toursit
Trying all possible routes will
give us best solution, but as we
have ssen before, this is
probably not the best way to
tackle this problem!
22.09.24 Bioinformatics Algorithms 8
Manhattan Toursit
There are
𝑛+𝑚 𝑛+𝑚 !
=
𝑛 𝑛! 𝑚!
possible paths!
Therfore, an exhaustive search
!"# !
algorithm is O (! ! !&# !).
22.09.24 Bioinformatics Algorithms 9
Manhattan Toursit
There are
𝑛+𝑚 𝑛+𝑚 !
=
𝑛 𝑛! 𝑚!
possible paths!
Therfore, an exhaustive search
!"# !
algorithm is O (! ! !&# !).
22.09.24 Bioinformatics Algorithms 10
Manhattan Toursit
Like in the Fibonacci example,
the same sub-calculations are
made multiple times if we
exhaustively calculate all
possible paths!
22.09.24 Bioinformatics Algorithms 11
Manhattan Toursit
Like in the Fibonacci example,
the same sub-calculations are
made multiple times if we
exhaustively calculate all
possible paths!
22.09.24 Bioinformatics Algorithms 12
Manhattan Toursit
Like in the Fibonacci example,
the same sub-calculations are
made multiple times if we
exhaustively calculate all
possible paths!
The red piece is calculated
multiplie times!
22.09.24 Bioinformatics Algorithms 13
Manhattan Toursit
Like in the Fibonacci example,
the same sub-calculations are
made multiple times if we
exhaustively calculate all
possible paths!
The red piece is calculated
multiple times!
22.09.24 Bioinformatics Algorithms 14
Manhattan Toursit
In fact, the scores for many paths
are calculated multiple times.
This suggests that we can find the
highest scoring path by splitting
the problem into calculating the
highest scores for sub-paths.
22.09.24 Bioinformatics Algorithms 15
Manhattan Tourist
Let’s have a look at a more general problem:
What is the best way (i.e., highest scoring
path) to get from the source (upper left
corner) to any location on the grid?
Even though this may seem a stupid thing to do at first glance.... Instead
of one question we now have to answer n x m questions!
22.09.24 Bioinformatics Algorithms 16
Dynamic programming
• Dynamic programming describes breaking a large problem into
several smaller ones that are easier to solve
• Solving the more general problem of how we can get from the source
to any location is as easy as solving the more specific problem of
going from source to sink and...
• ... we get the solution of the specific problem en route!
• Finding the best path to a cell is easy because there are only two
possible ways to get there
• These observations outline the key ideas of dynamic programming
22.09.24 Bioinformatics Algorithms 17
Manhattan Toursit
Many paths are calculated multiple
times.
We can split the problem into
small peices by first calculating a
set of sub-paths, and then use this
information to find the optimal
global path.
How shall we do this?
22.09.24 Bioinformatics Algorithms 18
Manhattan Tourist Algorithm
1. We start in the source and put a 0 in the source
22.09.24 Bioinformatics Algorithms 19
Manhattan Tourist Algorithm
1. We start in the source and put a 0 in the source
2. Next, we fill in the top row and leftmost column
with the according scores (note that there is
only onw way to get to each cell!)
3. The next cell to fill is (2,2) – there are only two
ways to get there: from west or from north.
4. We take the higher scoring path and enter the
score
22.09.24 Bioinformatics Algorithms 20
Manhattan Tourist Algorithm
1. We start in the source and put a 0 in the source
2. Next, we fill in the top row and leftmost column
with the according scores (note that there is
only onw way to get to each cell!)
3. The next cell to fill is (2,2) – there are only two
ways to get there: from west or from north.
4. We take the higher scoring path and enter the
score
5. We continue like this until every cell is filled
6. At every step we record the path taken
22.09.24 Bioinformatics Algorithms 21
Manhattan Tourist Algorithm
1. We start in the source and put a 0 in the source
2. Next, we fill in the top row and leftmost column
with the according scores (note that there is
only onw way to get to each cell!)
3. The next cell to fill is (2,2) – there are only two
ways to get there: from west or from north.
4. We take the higher scoring path and enter the
score
5. We continue like this until every cell is filled
6. At every step we record the path taken
22.09.24 Bioinformatics Algorithms 22
Manhattan Tourist Algorithm
1. We start in the source and put a 0 in the source
2. Next, we fill in the top row and leftmost column
with the according scores (note that there is
only onw way to get to each cell!)
3. The next cell to fill is (2,2) – there are only two
ways to get there: from west or from north.
4. We take the higher scoring path and enter the
score
5. We continue like this until every cell is filled
6. At every step we record the path taken
22.09.24 Bioinformatics Algorithms 23
Manhattan Tourist Algorithm
1. We start in the source and put a 0 in the source
2. Next, we fill in the top row and leftmost column
with the according scores (note that there is
only onw way to get to each cell!)
3. The next cell to fill is (2,2) – there are only two
ways to get there: from west or from north.
4. We take the higher scoring path and enter the
score
5. We continue like this until every cell is filled
6. At every step we record the path taken
22.09.24 Bioinformatics Algorithms 24
Manhattan Tourist Algorithm
Finally, there is a single (or multiple) optimal path(s) left:
22.09.24 Bioinformatics Algorithms 25
The Algorithm
SE(i,j), SS(i,j) .... Score to come to (i,j) from west or north
S .... nxm Matrix with score of highest scoring path
D ... nxm Matrix with direction of highest scoring path
S(0,0) = 0
for (i in 1:n) S(i,0) = S(i-1,0) + SS(i,0)
for(j in 1:m) S(0,j) = S(0,j-1) + SE(0,j)
For(i in 1:n)
for(j in 1:m)
S(i,j) = max(S(i-1,j) + SS(i,j), S(i,j-1) + SE(i,j))
D(i,j) = path.to(i,j)
22.09.24 Bioinformatics Algorithms 26
The Algorithm
SE(i,j), SS(i,j) .... Score to come to (i,j) from east or south
S .... nxm Matrix with score of highest scoring path
D ... nxm Matrix with direction of highest scoring path
Complexity:
S(0,0) = 0 T(n) = C1 + C2 (n + m) + C3 n m
for (i in 1:n) S(i,0) = S(i-1,0) + SS(i,0) = O(n m)
for(j in 1:m) S(0,j) = S(0,j-1) + SE(0,j)
For(i in 1:n)
for(j in 1:m)
S(i,j) = max(S(i-1,j) + SS(i,j), S(i,j-1) + SE(i,j))
D(i,j) = path.to(i,j)
22.09.24 Bioinformatics Algorithms 27
Global alignment
What does sequence alignment have to do
with the Manhattan Tourist algorithm?
AAAGCGGAAGTCACAG We can represent sequences as strings and
||.||.||||| |.|| give each alignment between two sites a score:
AAGGCTGAAGT-ATAG
w(i,j) = 1 if site i and j are equal (match)
w(i,j) = -1 if site i and j are not equal (substitution
or insertion/deletion )
22.09.24 Bioinformatics Algorithms 28
Global alignment
What does sequence alignment have to do
with the Manhattan Tourist algorithm?
AAAGCGGAAGTCACAG We can represent sequences as strings and
||.||.||||| |.|| give each alignment between two sites a score:
AAGGCTGAAGT-ATAG
w(i,j) = 1 if site i and j are equal (match)
w(i,j) = -1 if site i and j are not equal (substitution
1+1-1+1+1-1+1+1+1+1+1-1+1-1+1+1 or insertion/deletion )
22.09.24 Bioinformatics Algorithms 29
Global alignment
What does sequence alignment have to do
with the Manhattan Tourist algorithm?
AAAGCGGAAGTCACAG We can represent sequences as strings and
||.||.||||| |.|| give each alignment between two sites a score:
AAGGCTGAAGT-ATAG
w(i,j) = 1 if site i and j are equal (match)
w(i,j) = -1 if site i and j are not equal (substitution
1+1-1+1+1-1+1+1+1+1+1-1+1-1+1+1 or insertion/deletion )
22.09.24 Bioinformatics Algorithms 30
Global alignment
We now have three possible ways to from
a site to the next:
match, subsitution, or insertion/deletion
If we label a grid the letters of the two
strings, we can represent alignments as
paths through the grid.
Matches and substitutions are represented
by going diagonally to the next cell and
indels by going down.
22.09.24 Bioinformatics Algorithms 31
Global alignment
We now have three possible ways to from
a site to the next:
match, subsitution, or insertion/deletion
ATCGT-AC-
If we label a grid the letters of the two
||.||.|..
strings, we can represent alignments as
paths throughAT-GTTA-T
the grid.
Matches and substitutions are represented
by going diagonally to the next cell and
indels by going down/right.
22.09.24 Bioinformatics Algorithms 32
Global alignment
We now have:
• how to find the highest scoring path on
a grid
• how to asign scores to sequence
alignments
• How to represent sequence-alignments
on a grid
AAAGCGGAAGTCACAG
||.||.||||| |.||
AAGGCTGAAGT-ATAG
1+1-1+1+1-1+1+1+1+1+1-1+1-1+1+1
22.09.24 Bioinformatics Algorithms 33
Needleman-wunsch algorithm
w(i,j,x) .... Score to come to (i,j) via x (= match, mismatch, indel)
S .... nxm Matrix with score of highest scoring path
D ... nxm Matrix with direction of highest scoring path
S(0,0) = 0
for (i in 1:n) S(i,0) = S(i-1,0) + SS(i,0)
for(j in 1:m) S(0,j) = S(0,j-1) + SE(0,j)
For(i in 1:n)
for(j in 1:m)
S(i,j) = max(w(i,j,x))
D(i,j) = path.to(i,j)
22.09.24 Bioinformatics Algorithms 34
References
• Needlman and Wunsch (1970) - A general method applicable to the
search for similarities in the amino acid sequence of two proteins
• Online Tools, e.g.: BLAST:
22.09.24 Bioinformatics Algorithms 35
Comparing strings
To compare strings, there are two common measures:
1. Edit distance: the minimum number of operations (insertions, deletions
and substitutions) required to transform 1 string into another
AAAGCGGAAGTCACAG
||.||.|||||-|.|| Edit distance: 4
AAGGCTGAAGT-ATAG
2. Hamming distance: the number of differences when comparing the ith
value of a string against the ith value of another
AAAGCGGAAGTCACAG
||.||.|||||....- Hamming distance: 7
AAGGCTGAAGTATAG
22.09.24 Bioinformatics Algorithms 36
Which distance?
DNA sequences are subject to insertions, deletions, and substitutions.
Subsitutions can be measured by the edit distance, but because insertions and deletions occur as
well, we usually do not know which is the «ith» position.
Thus, the edit distance seems to be more appropriate for sequence alignment.
e.g.: ATATATAT and TATATATA
Edit distance = 2 Hamming distance = 7
-ATATATAT ATATATAT
-|||||||- ........
TATATATA TATATATA
22.09.24 Bioinformatics Algorithms 37
Scoring function
The simplest form of the edit distance gives equal weight to substitutions, insertions or
deletions.
This is not reflecting biological reality because not all substitutions occur with the same
probability, substitutions are more/less likely than insertions or deleteions, etc.
We therfore introduce a scoring function that assigns a score (or penality) to each «edit».
For instance, if we penalize mismatches by μ, insertions/deletions by σ, and score matches
by 1, we can calculate the edit distance for two strings by:
# matches – μ # mismatches – σ # insertions/deletions
22.09.24 Bioinformatics Algorithms 38
Example
AAAGCGGAAGTCACAG AAAG-C-GGAAGTCACAG
||.||.|||||-|.|| ||-|-|-|-||||-|.||
AAGGCTGAAGT-ATAG AA-GGCTG-AAGT-ATAG
#matches = 12 #matches = 12
#mismatches = 3 #mismatches = 1
# indels = 1 # indels = 5
distance = 12 – 3 μ – σ distance = 12 – μ – 5 σ
22.09.24 Bioinformatics Algorithms 39
Example
AAAGCGGAAGTCACAG AAAG-C-GGAAGTCACAG
Which is better||-|-|-|-||||-|.||
alignment?
||.||.|||||-|.||
AAGGCTGAAGT-ATAG d1 > d2 or AA-GGCTG-AAGT-ATAG
d1 < d2 ?
d1 = 12 – 3 μ – σ
μ = 1, σ = 1: d1 > d2 (8 > 6)
#matches = 12 #matches = 12
#mismatches = 3 μ = 2, σ = 1: d1 #mismatches
= d2 (5 = 5) =1
d2 = 12 – μ – 5 σ
# indels = 1 # indels = 5
μ = 3, σ = 1: d1 < d2 (2 < 4)
distance = 12 – 3 μ – σ distance = 12 – μ – 5 σ
22.09.24 Bioinformatics Algorithms 40
Example
AAAGCGGAAGTCACAG AAAG-C-GGAAGTCACAG
||.||.|||||-|.|| ||-|-|-|-||||-|.||
AAGGCTGAAGT-ATAG AA-GGCTG-AAGT-ATAG
How we score matches, mismatches
and indels impacts the outcome
#matches = 12
of the
#matches = 12
#mismatches = 3 sequence alignment!#mismatches = 1
# indels = 1 # indels = 5
distance = 12 – 3 μ – σ distance = 12 – μ – 5 σ
22.09.24 Bioinformatics Algorithms 41
Needleman-wunsch algorithm
w(i,j,x) .... Score to come to (i,j) via x (= match, mismatch, indel)
S .... nxm Matrix with score of highest scoring path
D ... nxm Matrix with direction of highest scoring path
S(0,0) = 0
for (i in 1:n) S(i,0) = S(i-1,0) + SS(i,0)
for(j in 1:m) S(0,j) = S(0,j-1) + SE(0,j)
For(i in 1:n)
for(j in 1:m)
S(i,j) = max(w(i,j,x))
D(i,j) = path.to(i,j)
22.09.24 Bioinformatics Algorithms 42
Example Needlman-Wunsch
TGATACCA
-|.||.||
-GCTAACA
μ=1
σ=1
22.09.24 Bioinformatics Algorithms 43
Example Needlman-Wunsch
TGATACCA
-|.||.||
-GCTAACA
μ=1
σ=1
--TGATACCA
--|-|-||-|
GCT-A-AC-A
μ=2
σ=1
22.09.24 Bioinformatics Algorithms 44
Scoring matrix
We can define a function that incorporates empirical
evidence about the frequency at which a given amino
acid/nucelotide/ etc. is subsituted by another. A C T G
A dAA dAC dAT dAG
One problem: C dCA dCC dCT dAG
To get the the empirical data to construct scoring
functions, one needs to align sequences!
T dTA dTC dTT dAG
Solution: G dGA dGC dGT dGG
If we have «good» data, we can use a naive scoring
function and incrementally improve it.
22.09.24 Bioinformatics Algorithms 45
Modifications to
Needleman-Wunsch algorithm
• So far we have assumed that indels are always of length 1
• A single indel of length n is, however, more likely than n small indels
• An indel of length n should therfore have a smaller penalty than n
indels of length 1
• We can implement this in our algorithm by changing the score
function for indels:
σ1 if new indel
w(i,j,indel) =
σ2 if extension of «open» indel
22.09.24 Bioinformatics Algorithms 46
Modifications to
Needleman-Wunsch algorithm
An alignment with two long indels might be biologically more plausible
than an alignment with fewer indels:
ACCTAACCGGTAGATAC--- ACCT-A-AC-CGGTAGATAC
||||------||||.||--- ||||-|-||-|--||||.||
ACCT------TAGAGACGAC ACCTTAGACAC--TAGAGAC
2 long indels (9 positions in total) 4 short indels (5 positions in total)
1 substitution 1 substitution
22.09.24 Bioinformatics Algorithms 47
Modifications to
Needleman-Wunsch algorithm
• It can also make sense to allow long indels
at beginning and end of alignment:
---AAGCATAGCCAT
---||||||||----
CTGAAGCATAG----
• This can be achieved by simply set the first
row and first column to zero in the
initalization step of the NW algorithm
22.09.24 Bioinformatics Algorithms 48
Modifications to
Needleman-Wunsch algorithm
• It can also make sense to allow long indels
at beginning and end of alignment:
---AAGCATAGCCAT
---||||||||----
CTGAAGCATAG----
• This can be achieved by simply set the first
row and first column to zero in the
initalization step of the NW algorithm
22.09.24 Bioinformatics Algorithms 49
Modifications to
Needleman-Wunsch algorithm
• A simple heuristic approximation for
the NW algorithm can be obtained by
assuming that the best alignmnet
should be close to the diagonal of the
scoring matrix
• We can therfore improve space and
time complexity considerably by
considering only a «band» along the
main diagonal
22.09.24 Bioinformatics Algorithms 50
Complexity of NW algorithm
• It is easy to see from the pseudo-code that NW algorithm has
• space complexity O(n m) (store nxm matrix) and
• time complexity O(n m) (calculate entries of nxm matrix)
• We can reduce the space complexity to O(n) at the cost of doubling
the computation time
• We need to store only 2 n entries to calculate the next column of
entries
• But we need to backtrack the highest scoring path after the calculation:
this requires to store a nxm matrix = O(nm)
• How can we calculate alignment in O(n) space?
22.09.24 Bioinformatics Algorithms 51
From O(n^2) -> O(n) (in terms of memory)
by doubling the computation time
Step 1)
It is easy to calculate just the scores of paths
in linear space because we only need to store
at most 2n elements to calculate the next row
of entries.
22.09.24 Bioinformatics Algorithms 52
From O(n^2) -> O(n) (in terms of memory)
by doubling the computation time
If (U,V) = NW(X,Y) is the alignment of sequences X and Y then we can
see the following:
Let Xl and Xr be a partition of X than there exists a partitioning Yl and
Yr such that NW(X,Y) = NW(Xl,Yl) + NW(Xr,Yr)
ACCT-A-AC-CGGTAGATAC ACCT-A-AC-C GGTAGATAC
||||-|-||-|--||||.|| ||||-|-||-| --||||.||
ACCTTAGACAC--TAGAGAC ACCTTAGACAC --TAGAGAC
22.09.24 Bioinformatics Algorithms 53
From O(n^2) -> O(n) (in terms of memory)
by doubling the computation time
If (U,V) = NW(X,Y) is the alignment of sequences X and Y then we can see
the following:
Let Xl and Xr be a partition of X than there exists a partitioning Yl and Yr
such that NW(X,Y) = NW(Xl,Yl) + NW(Xr,Yr)
ACCT-A-AC-CGGTAGATAC ACCT-A-AC-C GGTAGATAC
||||-|-||-|--||||.|| ||||-|-||-| --||||.||
ACCTTAGACAC--TAGAGAC ACCTTAGACAC --TAGAGAC
X = Xl + Xr = ACCTAACC + GGTAGATAC, Y = Yl + Yr = ACCTTAGACAC+ TAGAGAC
22.09.24 Bioinformatics Algorithms 54
From O(n^2) -> O(n) (in terms of memory)
by doubling the computation time
We start by finding the middle, i.e.
the point on the middle column
of the matrix that divides the highest scoring
path into two sub-paths.
This can be done in O(n) space and O(n m/2)
time.
Note that we only need to find the cell with
the highest score and note the whole path!
That‘s why we can do it in O(n) space.
22.09.24 Bioinformatics Algorithms 55
From O(n^2) -> O(n) (in terms of memory)
by doubling the computation time
When we now the location of the «middle»,
we can simply repeat this process for the
remaining upper and lower sub-matrices to get
the location of next «middle» of the alignment
path.
This can be done in time proportional to the
area of the submatrices (gray rectangles).
22.09.24 Bioinformatics Algorithms 56
From O(n^2) -> O(n) (in terms of memory)
by doubling the computation time
We recursively repeat this until we have
reconstructed the whole sequence alignment, i.e.,
until we have found the «middle» for all columns.
At each step, this can be done in time
proportional to the area of the submatrices.
22.09.24 Bioinformatics Algorithms 57
From O(n^2) -> O(n) (in terms of memory)
by doubling the computation time
We recursively repeat this until we have
reconstructed the whole sequence alignment, i.e.,
until we have found the «middle» for all columns.
At each step, this can be done in time proportional
to the area of the submatrices.
Total time =
nm + nm/2 + nm/4 + nm/8 + ... ≤ 2 nm = O(n m)
......
Total space = O(n)
22.09.24 Bioinformatics Algorithms 58
From O(n2) -> O(n) (in terms of memory)
by doubling the computation time
• This is an example of a «divide and conquer» algorithm:
A divide and conquer algorithm works by recursively breaking down a
problem into two or more sub-problems of the same (or related)
type, until these become simple enough to be solved directly. The
solutions to the sub-problems are then combined to give a solution
to the original problem.
22.09.24 Bioinformatics Algorithms 59
Implementation in R
• Open the neddleman wunsch R script from ilias
• Run the code
• Try some examples: Score for match
Penalty for
plotNeedlemanWunsch("GCTAACA","TGATACCA",1,1,1) insertion/deletion
Sequence 1 Sequence 2
Penalty for mismatch
22.09.24 Bioinformatics Algorithms 60
Needleman−Wunsch
Implementation in R match = 1
T
mismatch = −1
G A T A
gap = −1
C C A
0 −1 −2 −3 −4 −5 −6 −7 −8
A −1 −1 −2 −1 −2 −3 −4 −5 −6
plotNeedlemanWunsch T −2 0 −1 −2 0 −1 −2 −3 −4
("ATTCCAATA","TGATACCA",1,1,1) T −3 −1 −1 −2 −1 −1 −2 −3 −4
C −4 −2 −2 −2 −2 −2 0 −1 −2
--ATTCCAATA
--||.|||--- C −5 −3 −3 −3 −3 −3 −1 1 0
TGATACCA
A −6 −4 −4 −2 −3 −2 −2 0 2
A −7 −5 −5 −3 −3 −2 −3 −1 1
--ATTCCAATA
--||.||---| T −8 −6 −6 −4 −2 −3 −3 −2 0
TGATACC---A
A −9 −7 −7 −5 −3 −1 −2 −3 −1
22.09.24 Bioinformatics Algorithms 61
Needleman−Wunsch
Implementation in R match = 1
T
mismatch = −1
G A T A
gap = −5
C C A
0 −5 −10 −15 −20 −25 −30 −35 −40
A −5 −1 −6 −9 −14 −19 −24 −29 −34
plotNeedlemanWunsch T −10 −4 −2 −7 −8 −13 −18 −23 −28
("ATTCCAATA","TGATACCA",1,1,5) T −15 −9 −5 −3 −6 −9 −14 −19 −24
C −20 −14 −10 −6 −4 −7 −8 −13 −18
ATTCCAATA
C −25 −19 −15 −11 −7 −5 −6 −7 −12
-|...|..|
-TGATACCA A −30 −24 −20 −14 −12 −6 −6 −7 −6
A −35 −29 −25 −19 −15 −11 −7 −7 −6
T −40 −34 −30 −24 −18 −16 −12 −8 −8
A −45 −39 −35 −29 −23 −17 −17 −13 −7
22.09.24 Bioinformatics Algorithms 62
Scores for amino acids
• The scoring matrices for DNA sequence comparison are usually
defined only by a few parameters
• Random mutations of the nucleotide sequence within a gene may
change the amino acid sequence of the corresponding protein.
• Some of these mutations do not drastically alter the protein’s
structure, but others do and impair the protein’s ability to function.
• The common matrices for protein sequence comparison are point
accepted mutations (PAM) and block substitution (BLOSUM)
• They reflect the frequency with which amino acid x replaces amino
acid y in evolutionarily related sequences
22.09.24 Bioinformatics Algorithms 63
PAM scoring matrices
PAM = Point (or Percentage) Accepted Mutations.
(Dayhoff, Schwartz, and Orcutt, 1978).
Proteins that have, on average, only one mutation per 100 amino acids are
defined as being one PAM unit diverged.
Roughly speaking: 1 PAM is the amount of time in which an “average” protein
mutates 1% of its amino acids.
Accepted mutations are the ones that are able to spread in the population
(typically mutations that do not disrupt the protein function or even increase
fitness).
22.09.24 Bioinformatics Algorithms 64
PAM scoring matrices
1) Given a set of base alignments, define f(i, j) as the total number of times amino acids i and j are aligned
against each other, divided by the total number of aligned positions.
2) We also define g(i, j) as f(i,j)/f(i) , where f(i) is the frequency of amino acid i in all proteins (or seqeunces)
from the data set. g(i, j) defines the probability that an amino acid i mutates into amino acid j within 1
PAM unit.
3) The (i, j) entry of the PAM1 matrix is defined as
log (f(i,j)/ (f(i)·f(j))) = log (g(i,j)/f(j))
Note that f(i) f(j) is the frequency of alignment i against j by chance.
4) The PAM n matrix can be defined as the result of applying the PAM 1 matrix n times.
5) Multiplying g by itself n times gives the probability that amino acid i mutates into amino acid j during n
PAM units.
6) The (i, j) entry of the PAM n matrix is then defined as log(gn(i,j)/f(j)).
22.09.24 Bioinformatics Algorithms 65
PAM matrix
• Usually one starts with sequences that are easy to align because they
are not strongly diverged (e.g., human-chimp)
• Align with naive scoring function
• Pick subset that satisfy the condition that 1% of the sequence is
mutated
• Calculate PAM 1 matrix
• Calculate PAM n matrix
22.09.24 Bioinformatics Algorithms 66
Multiple alignments
• So far we have only seen how to align 2 sequences
• What if we want to align 3 or more sequences?
• Mulitple pariwise alignment:
22.09.24 Bioinformatics Algorithms 67
Multiple pairwise alignment
• In this case, the pairwise alignments are compatible:
22.09.24 Bioinformatics Algorithms 68
Incompatible Alignments
• This is not always the case:
22.09.24 Bioinformatics Algorithms 69
Multiple alignment: three sequences
We can represent all alignments
of three sequences as paths on a
3-dimensional grid.
We now have more directions to move:
- match in 2 sequences, mismatch in 1 (2x)
- match in 2 sequences, indel in 1 (2x)
- match in 3 sequences
- Mismatch in all sequences
- indel in 2 sequences (2x)
22.09.24 Bioinformatics Algorithms 70
Multiple alignment: three sequences
We can represent all alignments
of three sequences as paths on a
3-dimensional grid.
We now have more directions to move:
- match in 2 sequences, mismatch in 1 (2x)
- match in 2 sequences, indel in 1 (2x)
- match in 3 sequences
- Mismatch in all sequences
- indel in 2 sequences (2x)
22.09.24 Bioinformatics Algorithms 71
Multiple alignment algorithm
• Needleman-Wunsch algorithm can be extend to arbitrarily many
sequences
• Two modifications:
1. Calculate highest score over all possible “directions”
2. Store path-pointers in a k-dimensional array instead of a matrix
• Idea of algorithm remains the same
• Time and space complexity increases with number of sequences
22.09.24 Bioinformatics Algorithms 72
Complexity
Unforutnately, time-complexity for k sequences is O(nk)
Space-complexity is also O(nk)
Multiple heuristics have been proposed
• Compute all possible pairwise alignments and find best multiple alignment
Sometimes difficult because of incompatible aligmnents
still demanding: (k 2) possible pairwise alginments
• Align sequences iterativels: start with strong pairwise alignment and join one
sequence after another
• Iterative procedures work well if sequences are not too different from each
other
22.09.24 Bioinformatics Algorithms 73
Sequence alignment
1. Global alignment assumes that the two proteins are basically
similar over the entire length of one another. The alignment
attempts to match them to each other from end to end.
AAAGCGGAAGTCACAG
||.||.||||| |.||
AAGGCTGAAGT-ATAG
1. Local alignment searches for segments of the two sequences
that match well. There is no attempt to force entire sequences
into an alignment, just those parts that appear to have good
similarity, according to some criterion.
AAAGCGGAAGTCACAG
AAAGCGGAAGTCACAG AAGGCTGAAGTATAG
......||||| ....
AAGGCTGAAGT-ATAG
22.09.24 Bioinformatics Algorithms 74
Local alignments
• In many biological applications, the score of an alignment between
two substrings of v and w might actually be larger than the score of
an alignment between the entireties of v and w.
• For example, homeobox genes, which regulate embryonic
development, are present in a large variety of species. Although
homeobox genes are very different in different species, one region in
each gene—called the homeodomain— is highly conserved.
• How to find this conserved area and ignore the areas that show little
similarity?
22.09.24 Bioinformatics Algorithms 75
Local alignments
--T--CC-C-AGT--TATGT-CAGGGGACACG--A-GCATGCAGA-GAC
--|--||-|--||--|-|-|-|||----||-|--|-|--|-||||---|
AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG--T-CAGAT--C
versus
tccCAGTTATGTCAGgggacacgagcatgcagagac
||||||||||||
aattgccgccgtcgttttcagCAGTTATGTCAGatc
How can we represent such a local alignment on a grid?
22.09.24 Bioinformatics Algorithms 76
--T--CC-C-AGT--TATGT-CAGGGGACACG--A-GCATGCAGA-G
| || | || | | | ||| || | | | | ||||
AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG--T-CAGAT-
Local alignments tccCAGTTATGTCAGgggacacgagcatgcaga
||||||||||||
aattgccgccgtcgttttcagCAGTTATGTCAGatc
A A T T G C CG C C G T CG T T T T C A G C A G T T A T G T C A G A T C
T
--T--CC-C-AGT--TATGT-CAGGGGACACG--A-GCATGCAGA-GAC C
C
C
--|--||-|--||--|-|-|-|||----||-|--|-|--|-||||---| A
G
T
AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG--T-CAGAT--C T
A Global Local
T
G
T
C
A
versus G
G
G
G
A
C
tccCAGTTATGTCAGgggacacgagcatgcagagac A
C
G
|||||||||||| A
G
C
A
aattgccgccgtcgttttcagCAGTTATGTCAGatc T
G
C
A
G
A
G
A
C
22.09.24 Bioinformatics Algorithms 77
Figure 6.16 (a) Global and (b) local alignments of two hypothetical genes th
have a conserved domain. The local alignment has a much worse score accor
Local alignment
• Smith-Waterman algorithm (1981)
6.8 Local Sequence Alignment
• Introduces “free jumps” to arbitrary cells into Needleman-Wunsch
algorithm
• This can be represented by edges of weight 0
(here shown with dashed lines) from the
source (0, 0) to every other cell
on the grid.
Figure 6.17 The Smith-Waterman local alignment algorithm introdu
weight 0 (here shown with dashed lines) from the source vertex (0, 0) t
22.09.24 Bioinformatics Algorithms 78
vertex in the edit graph.
Smith-Waterman algorithm
w(i,j,x) .... Score to come to (i,j) via x (= match, mismatch, indel)
S .... nxm Matrix with score of highest scoring path
D ... nxm Matrix with direction of highest scoring path
S(0,0) = 0
for (i in 1:n) S(i,0) = 0
for(j in 1:m) S(0,j) = 0
for(i in 1:n)
for(j in 1:m)
S(i,j) = max(w(i,j,x),0)
D(i,j) = path.to(i,j)
22.09.24 Bioinformatics Algorithms 79
Smith-Waterman
How to find the best local alignment?
1. Find largest entry in matrix
22.09.24 Bioinformatics Algorithms 80
Smith-Waterman
How to find the best local alignment?
1. Find largest entry in matrix
2. Backtrack until cell with entry 0
22.09.24 Bioinformatics Algorithms 81
Smith-Waterman
How to find the best local alignment?
1. Find largest entry in matrix
2. Backtrack until cell with entry 0
It is also possible to find multiple
local alignments
22.09.24 Bioinformatics Algorithms 82
Mutiple
non-overlapping
local alignments:
22.09.24 Bioinformatics Algorithms 83
Mapping and re-sequencing
Alignemnt is also called mapping if you align (map) sequences against an
existing genome
This is an essential step in re-sequencing
If we have a reference genome of an organism, we can align new
sequences against this reference and find sites with genetic differences
between indivudals
Because genomes are quite big, mapping can be quite tedious
22.09.24 Bioinformatics Algorithms 84
NGS data
(next generation sequencing)
NGS generally produces shorter reads than Sanger sequencing (200 vs
1000 bp), which poses some challenges:
• The reads don’t come with position information
• Reference sequence can be quite long (3 billion bp for humans)
• Alignment is not perfect: we need to allow for small variations
• Reads have sequencing-errors
• Short reads are more likely to fit into several positions
• We need to map millions of reads
22.09.24 Bioinformatics Algorithms 85
Some tools
• BLAST (Basic Local Alignment Search Tool)
one of the most commonly used tools in BioInformatics
aligns sequences against existing databases
BLAST can be used for identifying species, locating domains, establishing
phylogeny, DNA mapping, and comparison
• Bowtie
Bowtie is an ultrafast, memory-efficient short read aligner. It aligns short
DNA sequences (reads) to the human genome at a rate of over 25 million
35-bp reads per hour. Bowtie indexes the genome with a Burrows-
Wheeler index to keep its memory footprint small: typically about 2.2 GB
for the human genome (2.9 GB for paired-end).
• TopHat
Similar to BowTie but for Transcriptome data
22.09.24 Bioinformatics Algorithms 86
Excercises
• Take the NW-algorithm implementation and try to see how the
outcome changes when you change parameters
• Change the scoring function:
1. smaller penalty for longer indels
2. Include a scoring matrix (different penalities depending on the type of
substitution, e.g., C -> T is more costly than G -> T, etc.)
• Adapt NW algorithm for local alignments
22.09.24 Bioinformatics Algorithms 87
References
• Needleman and Wunsch - A general method applicable to the search
for similarities in the amino acid sequence of two proteins (1970)
• Smith and Waterman - Identification of common molecular
subsequences (1981)
• Chapter 6 in “An Introduction to Bioinformatics Algorithms” – Jones
and Pevzner
22.09.24 Bioinformatics Algorithms 88
Exhaustive search and search trees
22.09.24 Bioinformatics Algorithms 89
Exhaustive search algorithms
• NW and SW are algorithms where we can find solution not by
exhaustive search, but by smart algorithms (dynamic programming)
• Exhaustive search/brute force usually not good but first step towards
better algorithm design
• Sometimes exhaustive search is only option
• Can we find smart ways to ignore subsets of (wrong) solutions?
22.09.24 Bioinformatics Algorithms 90
Restriction sites
Hamilton Smith discovered in 1970 that the restriction enzyme HindII
cleaves DNA molecules at every occurrence, or site, of the sequences
GTGCAC or GTTAAC, breaking a long molecule into a set of restriction
fragments.
Shortly thereafter, maps of restriction sites in DNA molecules, or
restriction maps, became powerful research tools in molecular biology by
helping to narrow the location of certain genetic markers.
• Restiriciton sites are locations on a chromosome that are recognized
by restriction enzymes
• a particular restriction enzyme may cut the sequence between two
nucleotides within its recognition site, or somewhere nearby.
• E.g., the common restriction enzyme EcoRI recognizes the palindromic
sequence GAATTC and cuts between the G and the A on both the top
and bottom strands
• This can then be used to ligate in complementary pieces of DNA
22.09.24 Bioinformatics Algorithms 91
Restriction map
• A restriction map is a map of known
restriction sites within a sequence of
DNA.
• Restriction mapping requires the use of
restriction enzymes.
• In molecular biology, restriction maps are
used as a reference to engineer plasmids
or other relatively short pieces of DNA,
and sometimes for longer genomic DNA.
22.09.24 Bioinformatics Algorithms http://dx.doi.org/10.1590/S0074-02762007005000121
92
Restriction mapping
• If the genomic DNA sequence of an organism is known, then
construction of a restriction map for, say, HindII amounts to finding all
occurrences of GTGCAC and GTTAAC in the genome.
• However, not all restriction sites are known and
• sequencing is not always the best option if one wants to find a
restriction map
• Also: 25 years ago sequencing was not an option!
• How can we build restriction maps for genomes without prior
knowledge of the genomes’ DNA sequence?
22.09.24 Bioinformatics Algorithms 93
Electrophoresis
1. Aliquots of purified plasmid DNA for each digest
2. Digestion with enzyme
3. Samples are run on an electrophoresis gel
4. We can determine the lengths of DNA fragments from
band-pattern
Electrophoresis gives us number and
length of DNA fragments. How can we
infer restriciton map from this
information?
http://bio1151.nicerweb.com/Locked/media/ch20/electrophoresis.html
22.09.24 Bioinformatics Algorithms 94
Some notation
A multiset is a set that allows duplicate elements: e.g., {1,2,2,3,4,4,6}
Let X = {x1 = 0,x2, .... , xn} a set of n elements in increasing order
We denote by ΔX the multiset of all pairwise distances between
elements of X:
ΔX = {xj − xi : 1 ≤ i < j ≤ n} .
22.09.24 Bioinformatics Algorithms 95
Example
For example, if X={0, 2, 4, 7, 10}, then ΔX={2, 2, 3, 3, 4, 5, 6, 7, 8, 10}
22.09.24 Bioinformatics Algorithms 96
The problem
Given all pairwise differences between points on a
line, reconstruct the position of all those points.
!
Input: The multiset of pairwise distances L, containing "
integers.
Output: A set X of n integers such that ΔX = L.
22.09.24 Bioinformatics Algorithms 97
Restriction mapping: brute force
!
Input: The multiset of pairwise distances L, containing "
integers.
Output: A set X of n integers such that ΔX = L.
M = max(L)
for (every set of integers 0 < x2 < .. < xn-1 < M)
X = {0, x2, ...., xn-1,M}
ΔX = get.pairwise.distances(X)
if (ΔX == L)
return(X)
return(«no solution»)
22.09.24 Bioinformatics Algorithms 98
Complexity
• The algorithm is slow since it examines #$% !
different sets of
positions
• This requires about O(M(n−2)) time
• This makes the algorithm unpractical for most applications
• One could improve the algorithm by choosing the elements of the
sets more wisely (e.g., choosing n-2 distinct elements from L rather
than n-2 arbitrary integers)
• This reduces time-complexity to O(n2n-4)
(consider L = {2,998,1000}, n = 3, M = 1000)
22.09.24 Bioinformatics Algorithms 99
A practial algorithm (Skiena, 1990)
Idea:
1. Find largest distance in L (determines the two outermost points of X)
2. Find second-largest distance δ
3. two options to place δ:
left or right 0 δ(?) δ(?) max(X)
4. Pick left and add δ to preliminiary set
5. calculated all pairwise distances 0 δ(?) δ(?) xn-1 max(X)
6. check if they are contained in L
7. If yes: remove δ from L and go to 2.
8. If not: go back and pick other direction 0 x2 δ(?) δ(?) xn-1 max(X)
9. If L is empty, we have found a valid solution
22.09.24 Bioinformatics Algorithms 100
Example
L = {2,2,3,3,4,5,6,7,8,10} x3 = 3 is not possible because x3 – x2 = 1 is not in L
! Therfore x4 must be 7
L has 10 elements and therefore n = 5 ( "
= 10)
10 is the largest distance in L, so x1 = 0 and x5 = 10 X = {0,2,7,10} and L = {2,3,4,6}
X = {0,10} and L ={2,2,3,3,4,5,6,7,8} Largest remaining distance is 6
The largest remaining distance is 8. Two choices: x3 = 4 or x3 = 6
Set x2 = 2 and remove distances 8 and 2. (x2 - x1, x5 – x2) x3 = 6 does not work, so x3 must be 4
We get X = {0,2,10} and L = {2,3,3,4,5,6,7} We get X = {0,2,4,7,10} which is a valid solution
The largest remaining distance is 7 -> either x4 = 7 or x3 = 3
22.09.24 Bioinformatics Algorithms 101
Algorithm
get.rest.map = find.next = function(L,X,m)
If L is empty: X is solution, return
function(L,X)
y = max(L)
m = max(L) If Δ(y,X) is subset of L
L = L without m add y to X and remove Δ(y,X) from L
find.next(L,X,m)
X = {0,m} remove y from X and add Δ(y,X) to L
find.next(L,X,m) if Δ(m-y,X) is subset of L
add m-y to X and remove Δ(y,X) from L
find.next(L,X,m)
remove m – y from X and add Δ(m-y,X) to L
22.09.24 Bioinformatics Algorithms 102
Recursive tree
At each step we can go left or right,
which creates a recursive-tree like this:
If we find that going, say,
left doesn’t yield a viable
solution, we can ignore
whole subtree!
……
……
22.09.24 Bioinformatics Algorithms 103
Some notes
After each recursion we undo the modifications to sets X and L for the next recursive call
This algorithm will list all sets X with Δ(X) = L
This algorithm is usually very fast because we do not go deep into the «recursive tree»
However, there exists pathological examples where we explore 2k possible paths (left and right are
always viable paths)
So strictly speaking, this algorithm has exponential time-complexity, but is much faster in most cases
A polynomial-time algorithm was found in 2002 by Nivat et al.
22.09.24 Bioinformatics Algorithms 104
Search trees
As we have seen, trees can be useful structures to “organize”
information.
We can exploit the tree structure to efficiently chose whether a subset
of all possible combinations is “uninteresting”
22.09.24 Bioinformatics Algorithms 105
Example: Binary search tree
• One node is designated the root of the tree.
• Each node contains a key and has (at most) two subtrees.
• Each subtree is itself a binary search tree.
• The left subtree of a node contains only nodes with keys strictly less
than the node's key.
• The right subtree of a node contains only nodes with keys strictly
greater than the node's key.
22.09.24 Bioinformatics Algorithms 106
Example: Binary tree
See R-script: “tree.r” on ilias
22.09.24 Bioinformatics Algorithms 107