KEMBAR78
2024 Bioinformatics Algorithms Day 3 - 4 | PDF | Restriction Enzyme | Gene
0% found this document useful (0 votes)
31 views106 pages

2024 Bioinformatics Algorithms Day 3 - 4

Uploaded by

ast56erx
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views106 pages

2024 Bioinformatics Algorithms Day 3 - 4

Uploaded by

ast56erx
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 106

Bioinformatics

Algorithms
Stephan Peischl
Interfaculty Unit for Bioinformatics
Baltzerstrasse 6
CH-3012 Bern
Switzerland

Email: stephan.peischl@unibe.ch
Exhaustive search and search trees

06.10.24 Bioinformatics Algorithms 2


Exhaustive search algorithms
• NW and SW are algorithms where we can find solution not by
exhaustive search, but by smart algorithms (dynamic programming)
• Exhaustive search/brute force usually not good but first step towards
better algorithm design
• Sometimes exhaustive search is only option
• Can we find smart ways to ignore subsets of (wrong) solutions?

06.10.24 Bioinformatics Algorithms 3


Restriction sites
Hamilton Smith discovered in 1970 that the restriction enzyme HindII
cleaves DNA molecules at every occurrence, or site, of the sequences
GTGCAC or GTTAAC, breaking a long molecule into a set of restriction
fragments.
Shortly thereafter, maps of restriction sites in DNA molecules, or
restriction maps, became powerful research tools in molecular biology
by helping to narrow the location of certain genetic markers.

• Restiriciton sites are location on a chromosome that are recognized by


restriction enzymes
• a particular restriction enzyme may cut the sequence between two
nucleotides within its recognition site, or somewhere nearby.
• E.g., the common restriction enzyme EcoRI recognizes the
palindromic sequence GAATTC and cuts between the G and the A on
both the top and bottom strands
• This can then be used to ligate in a complementary pieces of DNA

06.10.24 Bioinformatics Algorithms 4


Restriction map

• A restriction map is a map of known


restriction sites within a sequence of
DNA.
• Restriction mapping requires the use of
restriction enzymes.
• In molecular biology, restriction maps
are used as a reference to engineer
plasmids or other relatively short pieces
of DNA, and sometimes for longer
genomic DNA.
06.10.24 Bioinformatics Algorithms http://dx.doi.org/10.1590/S0074-02762007005000121
5
Restriction mapping
• If the genomic DNA sequence of an organism is known, then
construction of a restriction map for, say, HindII amounts to finding
all occurrences of GTGCAC and GTTAAC in the genome.
• However, not all restriction sites are known and
• sequencing is not always the best option if one wants to find a
restriction map
• Also: 25 years ago sequencing was not an option!
• How can we build restriction maps for genomes without prior
knowledge of the genomes’ DNA sequence?

06.10.24 Bioinformatics Algorithms 6


Electrophoresis
1. Aliquots of purified plasmid DNA for each digest
2. Digestion with enzyme
3. Samples are run on an electrophoresis gel
4. We can determine the lengths of DNA fragments from
band-pattern

Electrophoresis gives us number and length


of DNA fragments. How can we infer
restriciton map from this information?

http://bio1151.nicerweb.com/Locked/media/ch20/electrophoresis.html
06.10.24 Bioinformatics Algorithms 7
Some notation
A multiset is a set that allows duplicate elements: e.g., {1,2,2,3,4,4,6}

Let X = {x1 = 0,x2, .... , xn} a set of n elements in increasing order

We denote by ΔX the multiset of all pairwise distance between


elements of X:
ΔX = {xj − xi : 1 ≤ i < j ≤ n} .

06.10.24 Bioinformatics Algorithms 8


Example
For example, if X={0, 2, 4, 7, 10}, then ΔX={2, 2, 3, 3, 4, 5, 6, 7, 8, 10}

06.10.24 Bioinformatics Algorithms 9


The problem

Given all pairwise differences between points on a line,


reconstruct the position of all those points.

!
Input: The multiset of pairwise distances L, containing "
integers.

Output: A set X of n integers such that ΔX = L.

06.10.24 Bioinformatics Algorithms 10


Restriction mapping: brute force
!
Input: The multiset of pairwise distances L, containing "
integers.
Output: A set X of n integers such that ΔX = L.

M = max(L)
for (every set of integers 0 < x2 < .. < xn-1 < M)
X = {0, x2, ...., xn-1,M}
ΔX = get.pairwise.distances(X)
if (ΔX == L)
return(X)
return(«no solution»)

06.10.24 Bioinformatics Algorithms 11


Complexity
• The algorithm is slow since it examines #$"
!$"
different sets of
positions
• This requires about O(M(n−2)) time
• This makes the algorithm unpractical for most applications
• One could improve the algorithm by choosing the elements of the
sets more wisely (e.g., choosing n-2 distinct elements from L rather
than n-2 arbitrary integers)
• This reduces time-complexity to O(n2n-4)
(consider L = {2,998,1000}, n = 3, M = 1000)

06.10.24 Bioinformatics Algorithms 12


A practial algorithm (Skiena, 1990)
Idea:
1. Find largest distance in L (determines the two outermost points of X)
2. Find second-largest distance δ
3. two options to place δ:
left or right 0 δ(?) δ(?) max(X)
4. Pick left and add δ to preliminiary set
5. calculated all pairwise distances 0 δ(?) δ(?) xn-1 max(X)
6. check if they are contained in L
7. If yes: remove δ from L and go to 2.
8. If not: go back and pick other direction 0 x2 δ(?) δ(?) xn-1 max(X)
9. If L is empty, we have found a valid solution

06.10.24 Bioinformatics Algorithms 13


Example
L = {2,2,3,3,4,5,6,7,8,10} The largest remaining distance is 7 -> either x4 = 7 or x3 = 3
x3 = 3 is not possible because x3 – x2 = 1 is not in L
!
L has 10 elements and therefore n = 5 ( "
= 10)
Therfore x4 must be 7
10 is the largest distance in L, so x1 = 0 and x5 = 10
X = {0,2,7,10} and L = {2,3,4,6}
X = {0,10} and L ={2,2,3,3,4,5,6,7,8}
Largest remaining distance is 6
The largest remaining distance is 8.
Two choices: x3 = 4 or x3 = 6
Set x2 = 2 and remove distances 8 and 2. (x2 - x1, x5 – x2)
x3 = 6 does not work, so x3 must be 4
We get X = {0,2,10} and L = {2,3,3,4,5,6,7}
We get X = {0,2,4,7,10} which is a valid solution

06.10.24 Bioinformatics Algorithms 14


Algorithm
get.rest.map = function(L,X) find.next = function(L,X,m)
m = max(L) If L is empty: X is solution, return
y = max(L)
L = L without m
If Δ(y,X) is subset of L
X = {0,m} add y to X and remove Δ(y,X) from L
find.next(L,X,m) find.next(L,X,m)
remove y from X and add Δ(y,X) to L
If Δ(m-y,X) is subset of L
add m-y to X and remove Δ(y,X) from L
find.next(L,X,m)
remove m – y from X and add Δ(m-y,X) to L

06.10.24 Bioinformatics Algorithms 15


Recursive tree
At each step we can go left or right,
which creates a recursive-tree like this:

If we find that going, say,


left doesn’t yield a viable
solution, we can ignore
whole subtree!
……

……

06.10.24 Bioinformatics Algorithms 16


Some notes
After each recursion we undo the modifications to sets X and L for the next recursive call

This algorithm will list all sets X with Δ(X) = L

This algorithm is usually very fast because we do not go deep into the «recursive tree»

However, there exists pathological examples where we explore 2k possible paths (left and right are
always viable paths)

So strictly speaking, this algorithm has exponential time-complexity, but is much faster in most
cases

A polynomial-time algorithm was found in 2002 by Nivat et al.

06.10.24 Bioinformatics Algorithms 17


Search trees
As we have seen, trees can be useful structures to “organize”
information.

We can exploit the tree structure to efficiently chose whether a subset


of all possible combinations is “uninteresting”

06.10.24 Bioinformatics Algorithms 18


Example: Binary search tree
• One node is designated the root of the tree.
• Each node contains a key and has (at most) two subtrees.
• Each subtree is itself a binary search tree.
• The left subtree of a node contains only nodes with keys strictly less
than the node's key.
• The right subtree of a node contains only nodes with keys strictly
greater than the node's key.

06.10.24 Bioinformatics Algorithms 19


Example: Binary tree

See R-script: “tree.r” on ilias:

06.10.24 Bioinformatics Algorithms 20


Motif finding

06.10.24 Bioinformatics Algorithms 21


Transcription factor binding sites
• Every gene contains a regulatory region (RR) upstream of the
transcriptional start site

• Located within the RR are the Transcription Factor Binding Sites (TFBS),
also known as motifs, specific for a given transcription factor

• A TFBS can be located anywhere within the Regulatory Region (RR).

• A single TF can regulate multiple genes if those genes’ RRs contain


corresponding TFBS
• Can find regulated genes via knock out experiments
06.10.24 Bioinformatics Algorithms
Sequence motifs
Sequence motifs are short, recurring patterns in DNA that are
presumed to have a biological function.

Often they indicate sequence-specific binding sites for


proteins such as nucleases and transcription factors (TF).

Others are involved in important processes at the RNA level,


including ribosome binding, mRNA processing (splicing,
editing, polyadenylation) and transcription termination.
Nature Biotechnology 24, 423 - 425 (2006)
doi:10.1038/nbt0406-423
06.10.24 Bioinformatics Algorithms 23
Complications
• We do not know the motif sequence
• May know its length

• We do not know where it is located relative to the genes start

• Motifs can differ slightly from one gene to the next


• Non-essential bases could mutate…

• How to discern functional motifs from random ones?


06.10.24 Bioinformatics Algorithms 24
ATCCCG gene

TTCCGG gene

ATCCCG gene

ATGCCG gene

ATGCCC gene

06.10.24 Bioinformatics Algorithms 25


Random sequences
Can you find a motif?

06.10.24 Bioinformatics Algorithms 26


Random sequences plus motif
How about now?

06.10.24 Bioinformatics Algorithms 27


Random sequences plus motif
There it is!

06.10.24 Bioinformatics Algorithms 28


Random sequences plus motif plus
mutations
Finding a motif is hard, especially if the motif has some variation.

06.10.24 Bioinformatics Algorithms 29


What is a motif?
Before we can tackle the problem, we need a formal way to describe a
motif. Representation by a single string is precise but fails for many
biological relevant applications.

06.10.24 Bioinformatics Algorithms 30


Alignment matrix
Consider a set of t DNA sequences, each n nucleotides long

Select a position in each of the t sequences:


(s1,s2, .... ,st) 1 ≤ si < n – l + 1, where l is the length of the motif

We can form a txl alignment-matrix by setting the (i,j)th element to


be the si + j – 1 th element of the ith sequence

06.10.24 Bioinformatics Algorithms 31


Notation
• t - number of sample DNA sequences
• n - length of each DNA sequence
• DNA - sample of DNA sequences (t x n array)

• l - length of the motif (l mer)


• si - starting position of an l-mer in sequence i
• s=(s1, s2,… st) - array of motif’s starting
positions

06.10.24 Bioinformatics Algorithms 32


Notation

Sequence length n
06.10.24 Bioinformatics Algorithms 33
Example
l=8 DNA
cctgatagacgctatctggctatccaGgtacTtaggtcctctgtgcgaatctatgcgtttccaaccat

agtactggtgtacatttgatCcAtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc

t=5
aaacgtTAgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt

agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtCcAtataca

ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaCcgtacgGc

n = 69

s s3 = 3 s2 = 21 s1 = 26 s4 = 56 s5 = 60

06.10.24 Bioinformatics Algorithms 34


Alignment matrix
We can form a txl alignment-matrix by setting the (i,j)th element to be the
si + j – 1 th element of the ith sequence.
e.g.: s = (8,19,3,5,31,27,15)

06.10.24 Bioinformatics Algorithms 35


Profile Matrix
We now calculate the 4xl profile-matrix.

The (i,j)th element holds the number of


times nucleotide i appears in column j,
i = C,G,T,A

This matrix illustrates the variability of the


nucleotide composition at each position for
a particular choice of l-mers.

06.10.24 Bioinformatics Algorithms 36


Consensus String
The consesus string is formed by taking the most common nucleotide
at each site of the l-mer. E.g.: ATGCAACT

06.10.24 Bioinformatics Algorithms 37


Consensus string
We now have a way to represent different alignments of l-mers.

Some profiles show a very high degree of conseravtion while others don’t.

A first, vague representation of the motif finding problem is to find the


starting positions s that correpsonds to the most conserved profile.

06.10.24 Bioinformatics Algorithms 38


Consensus string
We now have a way to represent different alignments of l-mers.

Some profiles show a very high degree of conseravtion while others don’t.

A first, vague representation of the motif finding problem is to find the


starting positions s that correpsonds to the most conserved profile.

S = (8,19,3, ….)

06.10.24 Bioinformatics Algorithms 39


Consensus string
We now have a way to represent different alignments of l-mers.

Some profiles show a very high degree of conseravtion while others don’t.

A first, vague representation of the motif finding problem is to find the


starting positions s that correpsonds to the most conserved profile.

S = (8,19,3, ….)

06.10.24 Bioinformatics Algorithms 40


Consensus string
We now have a way to represent different alignments of l-mers.

Some profiles show a very high degree of conseravtion while others don’t.

A first, vague representation of the motif finding problem is to find the


starting positions s that correpsonds to the most conserved profile.

S = (8,19,3, ….)

06.10.24 Bioinformatics Algorithms 41


Consensus string
We now have a way to represent different alignments of l-mers.

Some profiles show a very high degree of conseravtion while others don’t.

A first, vague representation of the motif finding problem is to find the


starting positions s that correpsonds to the most conserved profile.

S = (8,19,3, ….)

06.10.24 Bioinformatics Algorithms 42


Consensus score

Let P(s) define the profile matrix for starting position vector s.
We denote the largest count in column j by MP(s) (j).

We define the consesus score of


an alignment s by

Score(s) = ΣMP(s) (j)


06.10.24 Bioinformatics Algorithms 43
Example

MP(s)(1) = 5, MP(s)(2) = 5, …, MP(s)(8) = 6

Score (s) = 5+5+6+4+5+5+6+6 = 42

06.10.24 Bioinformatics Algorithms 44


The motif finding problem
Given a set of DNA sequences, find a set of l-mers, one from each
sequence, that maximizes the consensus score.

Input: A t×n matrix D , and l, the length of the pattern to find.

Output: An array of t starting positions s = (s1, s2, . . . , st) maximizing


the score.

06.10.24 Bioinformatics Algorithms 45


Median string problem
Another view onto this problem is to reframe the Motif Finding
problem as the problem of finding a median string.

Median String Problem:


Given a set of DNA sequences, find a median string.
Input: A t × n matrix D, and l, the length of the pattern to find.
Output: A string v of l nucleotides that minimizes the total Hamming
distance between dist(v,D) over all substrings of length l.

06.10.24 Bioinformatics Algorithms 46


Median String
4.6 The Motif Finding Problem
Total Hamming distance:
A T C C A G C T
The bold letters show the
G G G C A A C T
Positions where the elements
A T G G A T C T
in the (sub)matrix match the
A A G C A A C C
string v = ATGCAACT.
T T G G A A C T
A T G C C A T T
The total Hamming A T G G C A C T
Distance is the sum of all
non-bold letters.
Figure 4.4 Calculating the total Hamming distance for the conse
06.10.24 Bioinformatics Algorithms 47
CAACT (the alignment is the same as in figure 4.3). The bold lette
to find.
Median
Output: A String
string v of l nucleotides that minimizes
T otalDistance(v,
Notice that this isDN A) over
a double all strings of that length.
minimization:
We are finding a string v that minimizes dist(v,D), which is in turn the
Noticesmallest
that this is a double
distance minimization:
among all we points
choices of starting s in the a string v tha
are finding
sequences
nimizes in D.
T otalDistance(v, DN A), which is in turn the smallest distance
Letchoices
mong all dH(v,s) denote the Hamming
of starting points sdistance between
in the DNA string v and
sequences. the is, we are
That
strings that start at positions s.
culating
Then, we are calculating
min min dH (v, s).
all choices of all choices of
l-mers v starting positions s
06.10.24 Bioinformatics Algorithms 48
Despite the fact that the Median String problem is a minimization problem
For For
example, in figure
example, 4.4,4.4,
in figure the the
Hamming
Hamming distance between
distance the the
between consensus
consensus
string w and
string eacheach
w and of the seven
of the implanted
seven implantedpatterns is 2,isand
patterns dH (w,
2, and s)=2
dH (w, ·7=
s)=2 ·7=
Median String vs. Consesus String
7 · 87−· 842.
TheThe
− 42.
consensus string minimizes dH (v, of v,ofi.e.,
consensus string minimizes (v,over
dHs) all choices
s) over all choices v, i.e.,

DespitedHthe dH fact
(w, s) =s)that
(w, = mintheminMedian dString
dH (v, problem
(v,=s)lt=
Hs) −ltScore(s,isDN
− Score(s,a minimization
A) A)
DN
all choices of vof v
all choices
problem and the Motif Finding problem is a maximization problem,
the
Since ttwo
and
Since problems
l arel are
t and are computationally
constants, the the
constants, smallest value
smallest ofequivalent.
value dof
H dcan alsoalso
H can be obtained by by
be obtained
maximizing Score(s,
maximizing DNDN
Score(s, A) over all choices
A) over of s:of s:
all choices

minmin minmin dH (v, (v,=s)lt=−lt − maxmax Score(s,


dHs) DNDN
Score(s, A). A).
all choices of sofall
all choices s choices of vof v
all choices all choices of sof s
all choices
TheThe problem on the
problem left left
on the is the Median
is the Median StringString problem
problemwhile the the
while problem on on
problem
the the
right is the
right Motif
is the Finding
Motif Findingproblem.
problem.
In other words,
In other the the
words, consensus
consensus stringstringfor forthe the solution of the
solution Motif
of the Find-
Motif Find-
ing ing
problem
problemis the median
is the median string for for
string the theinput input DNDNA sample.
A sample.TheThe
median
median
string
string DNDN
for for A can be used
A can be usedto generate
to generate a profile
a profile thatthat
solves the the
solves Motif Find-
Motif Find-
ing problem,
ing
06.10.24 problem,by searching
by searching in each
in each t sequences
ofBioinformatics
the
of the t sequences
Algorithms for for
the the
substring withwith
substring 49
For example, in figure 4.4, the Hamming distance between the consensus
string w and each of the seven implanted patterns is 2, and dH (w, s)=2 · 7 =
Median String vs. Consensus String
7 · 8 − 42.
The consensus string minimizes dH (v, s) over all choices of v, i.e.,

Despite the dH fact


(w, s)that
= themin MediandString problem
H (v, s) = is a minimization
lt − Score(s, DN A)
all choices of v
problem and the Motif Finding problem is a maximization problem,
the twot and
Since problems are computationally
l are constants, the smallest valueequivalent.
of dH can also be obtained by
maximizing Score(s, DN A) over all choices of s:

min min dH (v, s) = lt − max Score(s, DN A).


all choices of s all choices of v all choices of s

The problem on the left is the Median String problem while the problem on
the right is the Motif Finding problem.
In otherMedian
words, theProblem
String consensus string for the solution ofString
Consensus the Problem
Motif Find-
ing problem is the median string for the input DN A sample. The median
string for DN A can be used to generate a profile that solves the Motif Find-
ing problem, by searching in each
06.10.24 Bioinformatics t sequences for the substring with
of theAlgorithms 50
Median String/Consensus String
In both the Median String problem and the Motif Finding problem we
have to sift through a large number of alternatives to find the best
one.

In the Motif Finding problem we have to consider all (n − l + 1)t possible


starting positions s.

For the Median String problem we need to consider all 4l possible l-


mers.

06.10.24 Bioinformatics Algorithms 51


4.7 Search Trees 101
Median String/Consensus String
( 1, 1, . . . , 1, 1 )
( 1, 1, . . . , 1, 2 )
In the Motif Finding problem we ( 1, 1, . . . , 1, 3 )
..
have to consider all (n − l + 1)t .
possible starting positions s: (
(
1,
1,
1, . . . ,
1, . . . ,
1, n − l + 1
2, 1
)
)
( 1, 1, . . . , 2, 2 )
( 1, 1, . . . , 2, 3 )
..
.
( 1, 1, . . . , 2, n − l + 1 )
..
.
( n − l + 1, n − l + 1, . . . , n − l + 1, 1 )
( n − l + 1, n − l + 1, . . . , n − l + 1, 2 )
( n − l + 1, n − l + 1, . . . , n − l + 1, 3 )
..
.
( n − l + 1, n − l + 1, . . . , n − l + 1, n − l + 1 )
For the Median String problem we need to consider all 4l possible l-mers:
06.10.24 Bioinformatics Algorithms 52
AA· · · TA
l
For the Median String problem we need to consider 4
AA· · · TT po
all
Median String/Consensus String AA· · · TG
AA· · · TC
AA· · · AA ..
AA· · · AT .
AA· · · AG CC· · · GG
For the Median String problem we need AA· · · AC CC· · · GC
to consider all 4l possible l-mers:
AA· · · TA CC· · · CA
AA· · · TT CC· · · CT
AA· · · TG CC· · · CG
AA· · · TC CC· · · CC
..
06.10.24
.
Bioinformatics Algorithms 53
Algorithms
We could easily implement a brute force algorithm to find the
consensus or median string by simply minimizing or maximizing over
all posible strings or starting positions.

Because this is impractical we will now introduce a powerful


algorithmic concept that allows us to solve the problem much more
quickly:
Search Trees

06.10.24 Bioinformatics Algorithms 54


Search trees
4.7 Search Trees 105

1
In computer science, a search tree is a ----

tree data structure used for locating


2 17
specific values from within a set. 1- - - 2- - -

3 10 18 25
11- - 12- - 21- - 22- -

In order for a tree to function as a


search tree, the key for each node 4
111- 112-
7 11
121- 122-
14 19
211- 212-
22 26
221- 222-
29

must be greater than any keys in 5 6 8 9 12 13 15 16 20 21 23 24 27 28 30 31

subtrees on the left and less than any 1111 1112 1121 1122 1211 1212 1221 1222 2111 2112 2121 2122 2211 2212 2221 2222

keys in subtrees on the right.


P REORDER (v)
1 output v
2 if v has children
3 P REORDER ( left child of v )
06.10.24 4
Bioinformatics Algorithms P REORDER ( right child of v ) 55
Search trees
The elements of our search tree are the kL L-mers from a k letter alphabet.

In the consensus string problem:


k = n – l + 1 (the starting positions), L = number of sequences

In the median string problem:


k = 4 (the nucleotides), L = length of motif to find
4 Exhaustive S
Example: All 4-mers in a 2 letter alphabet.

1111 1112 1121 1122 1211 1212 1221 1222 2111 2112 2121 2122 2211 2212 2221 2222

06.10.24 Bioinformatics Algorithms 56


Search trees
We will represent the
4.7 Search TreesL-mers as “leaves” in a tree: 105

1
----

2 17
1- - - 2- - -

3 10 18 25
11- - 12- - 21- - 22- -

4 7 11 14 19 22 26 29
111- 112- 121- 122- 211- 212- 221- 222-

5 6 8 9 12 13 15 16 20 21 23 24 27 28 30 31
1111 1112 1121 1122 1211 1212 1221 1222 2111 2112 2121 2122 2211 2212 2221 2222

06.10.24 Bioinformatics Algorithms 57


P REORDER (v)
Search trees 4 Exhaustive Search

---

A- - T- - G- - C- -

AA- AT- AG- AC- TA- TT- TG- TC- GA- GT- GG- GC- CA- CT- CG- CC-

ATGC ATGC ATGC ATGC ATGC ATGC ATGC ATGC ATGC ATGC ATGC ATGC ATGC ATGC ATGC ATGC
06.10.24 Bioinformatics Algorithms 58
Search trees
4.7 Search Trees 105

• A tree of L-mers has L levels (excluding 1


----
the root)
• Each vertex has k children 2
1- - - 2- - -
17

• The lowest level (leaves) carries the L- 3 10 18 25


mers 11- - 12- - 21- - 22- -

• The next level up represents the (L-1)- 4


111- 112-
7 11
121- 122-
14 19
211- 212-
22 26
221- 222-
29

mers
5 6 8 9 12 13 15 16 20 21 23 24 27 28 30 31

• And so on… 1111 1112 1121 1122 1211 1212 1221 1222 2111 2112 2121 2122 2211 2212 2221 2222

P REORDER (v)
1 output v
2 if v has children
3 P REORDER ( left child of v )
06.10.24 Bioinformatics Algorithms 4 P REORDER ( right child of v ) 59
Search trees
• To represent all possible starting positions for the Motif Finding
problem, we can construct the tree with L = t levels and k = n − l + 1
children per vertex.
• For the Median String problem, L = l and k = 4.
• We are only interested in the leaves, and the internal nodes of the
tree are somewhat meaningless for our problem
• Internal nodes do not represent complete vectors of starting points or
complete L-mers
• The tree structure allows us, however, to efficiently scan all leaves!

06.10.24 Bioinformatics Algorithms 60


Search trees
We traverse the tree from vertex to vertex to get to105all leaves
4.7 Search Trees

1
----

2 17
1- - - 2- - -

3 10 18 25
11- - 12- - 21- - 22- -

4 7 11 14 19 22 26 29
111- 112- 121- 122- 211- 212- 221- 222-

5 6 8 9 12 13 15 16 20 21 23 24 27 28 30 31
1111 1112 1121 1122 1211 1212 1221 1222 2111 2112 2121 2122 2211 2212 2221 2222

P REORDER (v)
06.10.24 1 output v Bioinformatics Algorithms 61
2 if v has children
Traversing a tree
Before we can write down an algorithm, we need a routine to traverse
a tree such that all leaves are visited: 4.7 Search Trees 105

1
----

traverse = function(node)
2 17
1- - - 2- - -

print(node)
If(node has children) 3
11- - 12- -
10 18
21- - 22- -
25

traverse(left child) 4 7 11 14 19 22 26 29
111- 112- 121- 122- 211- 212- 221- 222-

traverse(right child)
5 6 8 9 12 13 15 16 20 21 23 24 27 28 30 31
1111 1112 1121 1122 1211 1212 1221 1222 2111 2112 2121 2122 2211 2212 2221 2222

06.10.24 Bioinformatics Algorithms 62


P REORDER (v)
Brute Force algorithm (motif search)
s = (1,1,...,1)
bestScore ← Score(s,D)
D … Matrix with sequences
while forever s … vector with starting points
s ← NEXTLEAF(s) NEXTLEAF … function that returns next leaf
(NEXTLEAF returns s = (1,1,1, …) when at last leaf)
if Score(s,D) > bestScore Score … function that returns score of a sequence s

bestScore ← Score(s,DNA)
bestMotif ← (s1,s2,...,st)
if s = (1,1,...,1)
return bestMotif

06.10.24 Bioinformatics Algorithms 63


Search trees
The tree representation to help reduce work relative to a brute force
search algorithm.

We ignore any children (or grandchildren, great-grandchildren, and so


on) of a vertex if we can show that they are all uninteresting.

If none of the descendents of a vertex could possibly have a better


score than the best leaf that has already been explored, then there
really is no point descending into the children of that vertex.

06.10.24 Bioinformatics Algorithms 64


Search trees
When we decide that a whole subtree is “uninteresting”,
4.7 Search Trees 105 we skip it:
1
----

2 17
1- - - 2- - -

3 10 18 25
11- - 12- - 21- - 22- -

4 7 11 14 19 22 26 29
111- 112- 121- 122- 211- 212- 221- 222-

5 6 8 9 12 13 15 16 20 21 23 24 27 28 30 31
1111 1112 1121 1122 1211 1212 1221 1222 2111 2112 2121 2122 2211 2212 2221 2222

P REORDER (v)
06.10.24 1 output v Bioinformatics Algorithms 65
2 if v has children
Uninteresting subtrees
Each node shows the maximum score for all leaves that are descendants of this node.
If we traverse “left to right” the uninteresting sub-trees are the ones that have a lower
108 maximum score than the highest scoring node that has4been Exhaustive Search
visited.
30

24 16 30

24 20 10 16 3 15 21 30 26

24 15 3 20 4 5 10 6 3 8 16 4 3 1 1 2 1 15 17 21 23 15 27 30 26 18 19
06.10.24 Bioinformatics Algorithms 66
Figure 4.8 A tree that has uninteresting subtrees. The numbers next to a leaf rep-
Branch and bound algorithms
At each vertex we calculate a bound - the most wildly optimistic score
of any leaves in the subtree rooted at that vertex - and then decide
whether or not to consider its children.

In fact, the strategy is named branch-and-bound for exactly this


reason: at each point we calculate a bound and then decide whether or
not to branch out further

06.10.24 Bioinformatics Algorithms 67


Another brute force version
s = (1,...,1)
bestScore = 0
i=1 D … Matrix with sequences
s … vector with starting points
While i>0
i … number of starting positions
if i < t NEXTVERTEX … function that returns next vertex
(s,i) ← NEXTVERTEX(s,i) (NEXTVERTEX returns (s,0) when at last leaf)
else Score … function that returns score of a sequence s
if Score(s,D) > bestScore
bestScore ← Score(s,D)
bestMotif ← s
(s, i) ← NEXTVERTEX(s, i)
return bestMotif

06.10.24 Bioinformatics Algorithms 68


Branch and bound algorithm (motif)
While i>0
if i < t
optimisticScore ← Score(s, i, DNA) + (t − i) l
if optimisticScore < bestScore
D … Matrix with sequences
(s, i) ← BYPASS(s, i,) s … vector with starting points
else i … number of starting positions
(s, i) ← NEXTVERTEX(s, i) NEXTVERTEX … function that returns next vertex
(NEXTVERTEX returns (s,0) when at last leaf)
else BYPASS … function that skips subtree
if Score(s,D) > bestScore
Score … function that returns score of a sequence s
bestScore ← Score(s)
bestMotif = s
(s, i) ← NEXTVERTEX(s, i,)
return bestMotif

06.10.24 Bioinformatics Algorithms 69


Branch and bound algorithm
(motif finding)
Here we decide whether a subtree is
While i>0 interesting or not. We only consider
if i < t the subtree if it improves the score in
optimisticScore ← Score(s, i, DNA) + (t − i) l the best case, i.e, if all remaining
if optimisticScore < bestScore starting positions can be chosen to
D … Matrix with sequences
(s, i) ← BYPASS(s, i,) yield a perfect fit.
s … vector with starting points
else i … number of starting positions
(s, i) ← NEXTVERTEX(s, i) NEXTVERTEX … function that returns next vertex
(NEXTVERTEX returns (s,0) when at last leaf)
else BYPASS … function that skips subtree
if Score(s,D) > bestScore
Score … function that returns score of a sequence s
bestScore ← Score(s)
bestMotif = s
(s, i) ← NEXTVERTEX(s, i,)
return bestMotif

06.10.24 Bioinformatics Algorithms 70


Sketch of motif tree 4.7 Search Trees 105

1
----
This tree corresponds to 4
sequences of length n and a motif
of length n-1, e.g.: 2 17
1- - - 2- - -
ATA CTA ATG ATC

The entries at each node 3 10 18 25


correspond to the starting positions 11- - 12- - 21- - 22- -

e.g.: (2,2,1,1) corresponds to the 4 7 11 14 19 22 26 29


alignment 111- 112- 121- 122- 211- 212- 221- 222-

aTA 5 6 8 9 12 13 15 16 20 21 23 24 27 28 30 31
cTA 1111 1112 1121 1122 1211 1212 1221 1222 2111 2112 2121 2122 2211 2212 2221 2222

ATg
ATc
06.10.24 Bioinformatics Algorithms 71
P REORDER (v)
Sketch of motif tree 4.7 Search Trees 105

1
----
This tree corresponds to 4
sequences of length n and a motif
Atofany given
length n-1,position
e.g.: the score of the 2 17
final
ATAalignment
CTA ATGofATC a leave in the 1- - - 2- - -

decending subtree cannot be higher


Thethe
than entries at each
current node plus the
score 3 10 18 25
correspond to the starting positions 11- - 12- - 21- - 22- -
maximum score for the remaining
positions, i.e.,corresponds
e.g.: (2,2,1,1) if all positions
to theyield to4a 7 11 14 19 22 26 29
perfect alignment.
alignment 111- 112- 121- 122- 211- 212- 221- 222-

aTA 5 6 8 9 12 13 15 16 20 21 23 24 27 28 30 31
cTA 1111 1112 1121 1122 1211 1212 1221 1222 2111 2112 2121 2122 2211 2212 2221 2222

ATg
ATc
06.10.24 Bioinformatics Algorithms 72
P REORDER (v)
Branch and bound algorithm
(median string)
S = (1,1,...,1) ; bestDistance = ∞
whilei>0
if i < l
prefix ← nucleotide string corresponding to (s1,s2,...,si)
optimisticDistance = TOTALDISTANCE(prefix,D)
if optimisticDistance > bestDistance D … Matrix with sequences
(s, i) ← BYPASS(s, i) s … vector with starting points
else i … number of starting positions
NEXTVERTEX … function that returns next vertex
(s, i) ← NEXTVERTEX(s, i)
(NEXTVERTEX returns (s,0) when at last leaf)
else BYPASS … function that skips subtree
word=nucleotidestringcorrespondingto (s1,s2,...sl) Score … function that returns score of a sequence s
if TOTALDISTANCE(word, D) < bestDistance
bestDistance ← TOTALDISTANCE(word, D)
bestWord ← word
(s, i) ← NEXTVERTEX(s, i)
Return bestWord
06.10.24 Bioinformatics Algorithms 73
Branch and bound algorithm
(median string)
S = (1,1,...,1) ; bestDistance = ∞
whilei>0 Here we decide whether a subtree is
if i < l uninteresting or not. We only consider
prefix ← nucleotide string corresponding to (s1,s2,...,si) the subtree if the current total
optimisticDistance = TOTALDISTANCE(prefix,D) distance is lower than the current best
if optimisticDistance > bestDistance D … Matrix with sequences
distance,
s … vector i.e, ifpoints
with starting all remaining
(s, i) ← BYPASS(s, i)
else i … nucleotides can bepositions
number of starting chosen to yield a
perfect that
NEXTVERTEX … function fit. returns next vertex
(s, i) ← NEXTVERTEX(s, i)
(NEXTVERTEX returns (s,0) when at last leaf)
else BYPASS … function that skips subtree
word=nucleotidestringcorrespondingto (s1,s2,...sl) Score … function that returns score of a sequence s
if TOTALDISTANCE(word, D) < bestDistance
bestDistance ← TOTALDISTANCE(word, D)
bestWord ← word
(s, i) ← NEXTVERTEX(s, i)
Return bestWord
06.10.24 Bioinformatics Algorithms 74
Sketch of median string tree
112 4 Exhaustive Search

---
In this tree the leaves
represent consensus
string of the profile, i.e.,
the string the maximizes A- - T- - G- - C- -

the score. Or, in other


words, the median string
that minimizes the total AA- AT- AG- AC- TA- TT- TG- TC- GA- GT- GG- GC- CA- CT- CG- CC-
Hamming distance.

ATGC ATGC ATGC ATGC ATGC ATGC ATGC ATGC ATGC ATGC ATGC ATGC ATGC ATGC ATGC ATGC

Figure 4.9 A search tree for the Median String problem. Each branching point can
06.10.24 give rise to only four children,
Bioinformatics as opposed to the n−l+1 children in the Motif Finding
Algorithms 75
Sketch of median string tree
112 4 Exhaustive Search

---
In this tree the leaves
At any given position the distances
represent consensus
in the subtree can only become
string of the profile, i.e.,
larger.
the Thus
string thethe current distance A-
maximizes is -a T- - G- - C- -
lower
the bound
score. forother
Or, in the optimal
distance.
words, the median string
that minimizes the total AA- AT- AG- AC- TA- TT- TG- TC- GA- GT- GG- GC- CA- CT- CG- CC-
Hamming distance.

ATGC ATGC ATGC ATGC ATGC ATGC ATGC ATGC ATGC ATGC ATGC ATGC ATGC ATGC ATGC ATGC

Figure 4.9 A search tree for the Median String problem. Each branching point can
06.10.24 give rise to only four children,
Bioinformatics as opposed to the n−l+1 children in the Motif Finding
Algorithms 76
Complexity
• Worst case running time is bad (exponential)
• For most standard situations the algorithm is relatively fast,while still
searching exhaustively through all relevant states
• We used a conservative «bound» criterion
• More aggressive criteria can be used to increase speed of algorithm
(potentially at cost of accuracy)

06.10.24 Bioinformatics Algorithms 77


Some more on search trees
• Search trees are used for many applications
• The advantage of search trees is their efficient search time given the
tree is reasonably balanced
• The leaves at either end should be of comparable depths
• Various search-tree data structures exist, several of which also allow
efficient insertion and deletion of elements
• Operations should always maintain tree balance
• It is costly to build the tree, but it can then be used for efficient
searching

06.10.24 Bioinformatics Algorithms 78


References
• Chapter 4 in “An Introduction to Bioinformatics Algorithms” – Jones
and Pevzner
• Skiena, Steven S., Warren D. Smith, and Paul Lemke.
"Reconstructing sets from interpoint distances." Proceedings of the
sixth annual symposium on Computational geometry. ACM, 1990.

06.10.24 Bioinformatics Algorithms 79


http://www.stoimen.com/blog/2012/06/22/computer-
algorithms-binary-search-tree-data-structure/

06.10.24 Bioinformatics Algorithms 80


Binary search tree
• One node is designated the root of the tree.
• Each internal node contains a key and has two subtrees.
• Leaves are commonly represented by a special leaf or nil symbol, a
NULL pointer, etc.
• Each subtree is itself a binary search tree.
• The left subtree of a node contains only nodes with keys strictly less
than the node's key.
• The right subtree of a node contains only nodes with keys strictly
greater than the node's key.

06.10.24 Bioinformatics Algorithms 81


Example: Binary search tree

06.10.24 Bioinformatics Algorithms 82


Example: Binary search tree
One node is designated
the root of the tree.

06.10.24 Bioinformatics Algorithms 83


Example: Binary search tree
One node is designated
the root of the tree.
Each subtree is itself
a binary search tree.

06.10.24 Bioinformatics Algorithms 84


Example: Binary search tree
One node is designated
the root of the tree.
Each subtree is itself
a binary search tree.

Each internal
node has two
subtrees.

06.10.24 Bioinformatics Algorithms 85


Example: Binary search tree
One node is designated
the root of the tree.
Each subtree is itself
a binary search tree.

Each internal
node has two
subtrees.
The left subtree of a The right subtree of a
node contains only nodes node contains only nodes
with keys strictly less with keys strictly greater
than the node's key. than the node's key.

06.10.24 Bioinformatics Algorithms 86


Example: Binary search tree
• Binary Search Trees are useful for finding elements in a library:
At each step we ask if the element has a key larger or smaller than
the key of the node. In that we half the number of possible elements
at each step until we find the desired element.
• A binary search tree can also be used to implement a simple but
efficient sorting algorithm:
We insert all the values we wish to sort into a new ordered data
structure—in this case a binary search tree—and then traverse it in
order
• Worst case complexity is O(n log(n)) – if the tree is balanced!

06.10.24 Bioinformatics Algorithms 87


Example: Finding an element
First, construct a binary search tree. This can be done in
O(n) time.
1. Start at root.

2. If key of current node is larger than key of desired


element, go left, else go right.

3. Continue until right element is found.

Suppose we want to find the element with key 7.

8 is larger than 7 -> go left


3 is smaller than 7 -> go right
6 is smaller than 7 -> go right

06.10.24 Bioinformatics Algorithms 88


Example: Sorting
First, construct a binary search tree. This can be done in
O(n) time.

1. Go left until you hit a terminal node

2. Print node

3. Go up one level

4. Print node

5. Go right. If node has decendants, continue at 1.

6. If you are at a right hand terminal node, bo back up


until you are at the deepest node such that all nodes
in the subtree have been visited. Continue at 3.

7. Stop if all leaves have been visited


06.10.24 Bioinformatics Algorithms 89
Exercise
Implement a binary search tree

Basic:
- Create a binary search tree “by hand” (e.g., using tree.r from the dropbox)
- write a function that finds an element
- compare performance relative to searching an element in a vector

Advanced extra function:


- write a function that adds nodes to an existing tree at the right position
automatically

06.10.24 Bioinformatics Algorithms 90


Linked lists and arrays

• Easy to implement and maintain

• Linked list is good at adding


and deleteing elements

• Arrays are usually better at accessing


elements directly

• Finding a particular element is


inefficient in both cases

06.10.24 Bioinformatics Algorithms 91


Trees
• A tree is simialr to a linked list
• A tree has a unique root
• Each internal node has
children and parents
• The root doesn’t have parents
• Leaves have no children

06.10.24 Bioinformatics Algorithms 92


06.10.24 Bioinformatics Algorithms 93
If we’re looking at the root of
the tree we can assume there
are two sub-trees – one left
and one right.

If we isolate only one of these


sub-trees we can again think of
it as a tree and assume that it
has one left and one right sub-
trees.

06.10.24 Bioinformatics Algorithms 94


Balanced trees
• We say that a tree is
ablanced if the depth of all
subtrees is comparable
• The depth of a node is the
number of edges from the
node to the tree's root node
• The depth of a subtree is the
maximum of the depth of all
nodes in subtree
• A balanced tree is essential
for efficient searching!
06.10.24 Bioinformatics Algorithms 95
Creating a tree from a list
Example: [1,2,3,4,5,6]

We need to choose a root


such that the left and right subtrees of
the root are balanced.

How can we do that?

06.10.24 Bioinformatics Algorithms 96


Creating a tree from a list
Example: [1,2,3,4,5,6]

We start by finding the middle element of our list and set it as the root:

06.10.24 Bioinformatics Algorithms 97


Creating a tree from a list
Example: [1,2,3,4,5,6]

Next, we need to choose the roots of the left and the right subtree:

06.10.24 Bioinformatics Algorithms 98


Creating a tree from a list
Example: [1,2,3,4,5,6]

We can do this by finding the middle element of the left and right parts of
the remeinaing list:

06.10.24 Bioinformatics Algorithms 99


Creating a tree from a list
Example: [1,2,3,4,5,6]

Going on like this will yield a balanced tree:

06.10.24 Bioinformatics Algorithms 100


Inserting and deleting elements

• Inserting is fairly easy, just find the


right location and insert the element:

• Deleting is even easier, just find the


right the element and set it to NULL:

06.10.24 Bioinformatics Algorithms 101


Balancing a tree
• Inserting and deleting will unbalance the tree

06.10.24 Bioinformatics Algorithms 102


Balancing a tree
• Inserting and deleting will unbalance the tree

06.10.24 Bioinformatics Algorithms 103


Balancing a tree
• Inserting and deleting will unbalance the tree
• We can restore balance by «rotating» a tree
• A left (right) rotation will decrease the depth of
the left (right) and will increase the depth of the
right (left) subtree:

06.10.24 Bioinformatics Algorithms 104


How to rotate a tree

"Tree Rotations" by Mtanti at English Wikipedia. Licensed under CC BY-SA 3.0 via Commons -
105
https://commons.wikimedia.org/wiki/File:Tree_Rotations.gif#/media/File:Tree_Rotations.gif
How to rotate a tree

06.10.24 Bioinformatics Algorithms 106

You might also like