2024 Bioinformatics Algorithms Day 3 - 4
2024 Bioinformatics Algorithms Day 3 - 4
Algorithms
Stephan Peischl
Interfaculty Unit for Bioinformatics
Baltzerstrasse 6
CH-3012 Bern
Switzerland
Email: stephan.peischl@unibe.ch
Exhaustive search and search trees
http://bio1151.nicerweb.com/Locked/media/ch20/electrophoresis.html
06.10.24 Bioinformatics Algorithms 7
Some notation
A multiset is a set that allows duplicate elements: e.g., {1,2,2,3,4,4,6}
!
Input: The multiset of pairwise distances L, containing "
integers.
M = max(L)
for (every set of integers 0 < x2 < .. < xn-1 < M)
X = {0, x2, ...., xn-1,M}
ΔX = get.pairwise.distances(X)
if (ΔX == L)
return(X)
return(«no solution»)
……
This algorithm is usually very fast because we do not go deep into the «recursive tree»
However, there exists pathological examples where we explore 2k possible paths (left and right are
always viable paths)
So strictly speaking, this algorithm has exponential time-complexity, but is much faster in most
cases
• Located within the RR are the Transcription Factor Binding Sites (TFBS),
also known as motifs, specific for a given transcription factor
TTCCGG gene
ATCCCG gene
ATGCCG gene
ATGCCC gene
Sequence length n
06.10.24 Bioinformatics Algorithms 33
Example
l=8 DNA
cctgatagacgctatctggctatccaGgtacTtaggtcctctgtgcgaatctatgcgtttccaaccat
agtactggtgtacatttgatCcAtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc
t=5
aaacgtTAgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt
agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtCcAtataca
ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaCcgtacgGc
n = 69
s s3 = 3 s2 = 21 s1 = 26 s4 = 56 s5 = 60
Some profiles show a very high degree of conseravtion while others don’t.
Some profiles show a very high degree of conseravtion while others don’t.
S = (8,19,3, ….)
Some profiles show a very high degree of conseravtion while others don’t.
S = (8,19,3, ….)
Some profiles show a very high degree of conseravtion while others don’t.
S = (8,19,3, ….)
Some profiles show a very high degree of conseravtion while others don’t.
S = (8,19,3, ….)
Let P(s) define the profile matrix for starting position vector s.
We denote the largest count in column j by MP(s) (j).
DespitedHthe dH fact
(w, s) =s)that
(w, = mintheminMedian dString
dH (v, problem
(v,=s)lt=
Hs) −ltScore(s,isDN
− Score(s,a minimization
A) A)
DN
all choices of vof v
all choices
problem and the Motif Finding problem is a maximization problem,
the
Since ttwo
and
Since problems
l arel are
t and are computationally
constants, the the
constants, smallest value
smallest ofequivalent.
value dof
H dcan alsoalso
H can be obtained by by
be obtained
maximizing Score(s,
maximizing DNDN
Score(s, A) over all choices
A) over of s:of s:
all choices
The problem on the left is the Median String problem while the problem on
the right is the Motif Finding problem.
In otherMedian
words, theProblem
String consensus string for the solution ofString
Consensus the Problem
Motif Find-
ing problem is the median string for the input DN A sample. The median
string for DN A can be used to generate a profile that solves the Motif Find-
ing problem, by searching in each
06.10.24 Bioinformatics t sequences for the substring with
of theAlgorithms 50
Median String/Consensus String
In both the Median String problem and the Motif Finding problem we
have to sift through a large number of alternatives to find the best
one.
1
In computer science, a search tree is a ----
3 10 18 25
11- - 12- - 21- - 22- -
subtrees on the left and less than any 1111 1112 1121 1122 1211 1212 1221 1222 2111 2112 2121 2122 2211 2212 2221 2222
1111 1112 1121 1122 1211 1212 1221 1222 2111 2112 2121 2122 2211 2212 2221 2222
1
----
2 17
1- - - 2- - -
3 10 18 25
11- - 12- - 21- - 22- -
4 7 11 14 19 22 26 29
111- 112- 121- 122- 211- 212- 221- 222-
5 6 8 9 12 13 15 16 20 21 23 24 27 28 30 31
1111 1112 1121 1122 1211 1212 1221 1222 2111 2112 2121 2122 2211 2212 2221 2222
---
A- - T- - G- - C- -
AA- AT- AG- AC- TA- TT- TG- TC- GA- GT- GG- GC- CA- CT- CG- CC-
ATGC ATGC ATGC ATGC ATGC ATGC ATGC ATGC ATGC ATGC ATGC ATGC ATGC ATGC ATGC ATGC
06.10.24 Bioinformatics Algorithms 58
Search trees
4.7 Search Trees 105
mers
5 6 8 9 12 13 15 16 20 21 23 24 27 28 30 31
• And so on… 1111 1112 1121 1122 1211 1212 1221 1222 2111 2112 2121 2122 2211 2212 2221 2222
P REORDER (v)
1 output v
2 if v has children
3 P REORDER ( left child of v )
06.10.24 Bioinformatics Algorithms 4 P REORDER ( right child of v ) 59
Search trees
• To represent all possible starting positions for the Motif Finding
problem, we can construct the tree with L = t levels and k = n − l + 1
children per vertex.
• For the Median String problem, L = l and k = 4.
• We are only interested in the leaves, and the internal nodes of the
tree are somewhat meaningless for our problem
• Internal nodes do not represent complete vectors of starting points or
complete L-mers
• The tree structure allows us, however, to efficiently scan all leaves!
1
----
2 17
1- - - 2- - -
3 10 18 25
11- - 12- - 21- - 22- -
4 7 11 14 19 22 26 29
111- 112- 121- 122- 211- 212- 221- 222-
5 6 8 9 12 13 15 16 20 21 23 24 27 28 30 31
1111 1112 1121 1122 1211 1212 1221 1222 2111 2112 2121 2122 2211 2212 2221 2222
P REORDER (v)
06.10.24 1 output v Bioinformatics Algorithms 61
2 if v has children
Traversing a tree
Before we can write down an algorithm, we need a routine to traverse
a tree such that all leaves are visited: 4.7 Search Trees 105
1
----
traverse = function(node)
2 17
1- - - 2- - -
print(node)
If(node has children) 3
11- - 12- -
10 18
21- - 22- -
25
traverse(left child) 4 7 11 14 19 22 26 29
111- 112- 121- 122- 211- 212- 221- 222-
traverse(right child)
5 6 8 9 12 13 15 16 20 21 23 24 27 28 30 31
1111 1112 1121 1122 1211 1212 1221 1222 2111 2112 2121 2122 2211 2212 2221 2222
bestScore ← Score(s,DNA)
bestMotif ← (s1,s2,...,st)
if s = (1,1,...,1)
return bestMotif
2 17
1- - - 2- - -
3 10 18 25
11- - 12- - 21- - 22- -
4 7 11 14 19 22 26 29
111- 112- 121- 122- 211- 212- 221- 222-
5 6 8 9 12 13 15 16 20 21 23 24 27 28 30 31
1111 1112 1121 1122 1211 1212 1221 1222 2111 2112 2121 2122 2211 2212 2221 2222
P REORDER (v)
06.10.24 1 output v Bioinformatics Algorithms 65
2 if v has children
Uninteresting subtrees
Each node shows the maximum score for all leaves that are descendants of this node.
If we traverse “left to right” the uninteresting sub-trees are the ones that have a lower
108 maximum score than the highest scoring node that has4been Exhaustive Search
visited.
30
24 16 30
24 20 10 16 3 15 21 30 26
24 15 3 20 4 5 10 6 3 8 16 4 3 1 1 2 1 15 17 21 23 15 27 30 26 18 19
06.10.24 Bioinformatics Algorithms 66
Figure 4.8 A tree that has uninteresting subtrees. The numbers next to a leaf rep-
Branch and bound algorithms
At each vertex we calculate a bound - the most wildly optimistic score
of any leaves in the subtree rooted at that vertex - and then decide
whether or not to consider its children.
1
----
This tree corresponds to 4
sequences of length n and a motif
of length n-1, e.g.: 2 17
1- - - 2- - -
ATA CTA ATG ATC
aTA 5 6 8 9 12 13 15 16 20 21 23 24 27 28 30 31
cTA 1111 1112 1121 1122 1211 1212 1221 1222 2111 2112 2121 2122 2211 2212 2221 2222
ATg
ATc
06.10.24 Bioinformatics Algorithms 71
P REORDER (v)
Sketch of motif tree 4.7 Search Trees 105
1
----
This tree corresponds to 4
sequences of length n and a motif
Atofany given
length n-1,position
e.g.: the score of the 2 17
final
ATAalignment
CTA ATGofATC a leave in the 1- - - 2- - -
aTA 5 6 8 9 12 13 15 16 20 21 23 24 27 28 30 31
cTA 1111 1112 1121 1122 1211 1212 1221 1222 2111 2112 2121 2122 2211 2212 2221 2222
ATg
ATc
06.10.24 Bioinformatics Algorithms 72
P REORDER (v)
Branch and bound algorithm
(median string)
S = (1,1,...,1) ; bestDistance = ∞
whilei>0
if i < l
prefix ← nucleotide string corresponding to (s1,s2,...,si)
optimisticDistance = TOTALDISTANCE(prefix,D)
if optimisticDistance > bestDistance D … Matrix with sequences
(s, i) ← BYPASS(s, i) s … vector with starting points
else i … number of starting positions
NEXTVERTEX … function that returns next vertex
(s, i) ← NEXTVERTEX(s, i)
(NEXTVERTEX returns (s,0) when at last leaf)
else BYPASS … function that skips subtree
word=nucleotidestringcorrespondingto (s1,s2,...sl) Score … function that returns score of a sequence s
if TOTALDISTANCE(word, D) < bestDistance
bestDistance ← TOTALDISTANCE(word, D)
bestWord ← word
(s, i) ← NEXTVERTEX(s, i)
Return bestWord
06.10.24 Bioinformatics Algorithms 73
Branch and bound algorithm
(median string)
S = (1,1,...,1) ; bestDistance = ∞
whilei>0 Here we decide whether a subtree is
if i < l uninteresting or not. We only consider
prefix ← nucleotide string corresponding to (s1,s2,...,si) the subtree if the current total
optimisticDistance = TOTALDISTANCE(prefix,D) distance is lower than the current best
if optimisticDistance > bestDistance D … Matrix with sequences
distance,
s … vector i.e, ifpoints
with starting all remaining
(s, i) ← BYPASS(s, i)
else i … nucleotides can bepositions
number of starting chosen to yield a
perfect that
NEXTVERTEX … function fit. returns next vertex
(s, i) ← NEXTVERTEX(s, i)
(NEXTVERTEX returns (s,0) when at last leaf)
else BYPASS … function that skips subtree
word=nucleotidestringcorrespondingto (s1,s2,...sl) Score … function that returns score of a sequence s
if TOTALDISTANCE(word, D) < bestDistance
bestDistance ← TOTALDISTANCE(word, D)
bestWord ← word
(s, i) ← NEXTVERTEX(s, i)
Return bestWord
06.10.24 Bioinformatics Algorithms 74
Sketch of median string tree
112 4 Exhaustive Search
---
In this tree the leaves
represent consensus
string of the profile, i.e.,
the string the maximizes A- - T- - G- - C- -
ATGC ATGC ATGC ATGC ATGC ATGC ATGC ATGC ATGC ATGC ATGC ATGC ATGC ATGC ATGC ATGC
Figure 4.9 A search tree for the Median String problem. Each branching point can
06.10.24 give rise to only four children,
Bioinformatics as opposed to the n−l+1 children in the Motif Finding
Algorithms 75
Sketch of median string tree
112 4 Exhaustive Search
---
In this tree the leaves
At any given position the distances
represent consensus
in the subtree can only become
string of the profile, i.e.,
larger.
the Thus
string thethe current distance A-
maximizes is -a T- - G- - C- -
lower
the bound
score. forother
Or, in the optimal
distance.
words, the median string
that minimizes the total AA- AT- AG- AC- TA- TT- TG- TC- GA- GT- GG- GC- CA- CT- CG- CC-
Hamming distance.
ATGC ATGC ATGC ATGC ATGC ATGC ATGC ATGC ATGC ATGC ATGC ATGC ATGC ATGC ATGC ATGC
Figure 4.9 A search tree for the Median String problem. Each branching point can
06.10.24 give rise to only four children,
Bioinformatics as opposed to the n−l+1 children in the Motif Finding
Algorithms 76
Complexity
• Worst case running time is bad (exponential)
• For most standard situations the algorithm is relatively fast,while still
searching exhaustively through all relevant states
• We used a conservative «bound» criterion
• More aggressive criteria can be used to increase speed of algorithm
(potentially at cost of accuracy)
Each internal
node has two
subtrees.
Each internal
node has two
subtrees.
The left subtree of a The right subtree of a
node contains only nodes node contains only nodes
with keys strictly less with keys strictly greater
than the node's key. than the node's key.
2. Print node
3. Go up one level
4. Print node
Basic:
- Create a binary search tree “by hand” (e.g., using tree.r from the dropbox)
- write a function that finds an element
- compare performance relative to searching an element in a vector
We start by finding the middle element of our list and set it as the root:
Next, we need to choose the roots of the left and the right subtree:
We can do this by finding the middle element of the left and right parts of
the remeinaing list:
"Tree Rotations" by Mtanti at English Wikipedia. Licensed under CC BY-SA 3.0 via Commons -
105
https://commons.wikimedia.org/wiki/File:Tree_Rotations.gif#/media/File:Tree_Rotations.gif
How to rotate a tree