0% found this document useful (0 votes)

31 views106 pages

2024 Bioinformatics Algorithms Day 3 - 4

Uploaded by

ast56erx

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views106 pages

2024 Bioinformatics Algorithms Day 3 - 4

Uploaded by

ast56erx

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 106

Bioinformatics

Algorithms
Stephan Peischl
Interfaculty Unit for Bioinformatics
Baltzerstrasse 6
CH-3012 Bern
Switzerland

Email: stephan.peischl@unibe.ch
Exhaustive search and search trees

06.10.24 Bioinformatics Algorithms 2

Exhaustive search algorithms
• NW and SW are algorithms where we can find solution not by
exhaustive search, but by smart algorithms (dynamic programming)
• Exhaustive search/brute force usually not good but first step towards
better algorithm design
• Sometimes exhaustive search is only option
• Can we find smart ways to ignore subsets of (wrong) solutions?

06.10.24 Bioinformatics Algorithms 3

Restriction sites
Hamilton Smith discovered in 1970 that the restriction enzyme HindII
cleaves DNA molecules at every occurrence, or site, of the sequences
GTGCAC or GTTAAC, breaking a long molecule into a set of restriction
fragments.
Shortly thereafter, maps of restriction sites in DNA molecules, or
restriction maps, became powerful research tools in molecular biology
by helping to narrow the location of certain genetic markers.

• Restiriciton sites are location on a chromosome that are recognized by

restriction enzymes
• a particular restriction enzyme may cut the sequence between two
nucleotides within its recognition site, or somewhere nearby.
• E.g., the common restriction enzyme EcoRI recognizes the
palindromic sequence GAATTC and cuts between the G and the A on
both the top and bottom strands
• This can then be used to ligate in a complementary pieces of DNA

06.10.24 Bioinformatics Algorithms 4

Restriction map

• A restriction map is a map of known

restriction sites within a sequence of
DNA.
• Restriction mapping requires the use of
restriction enzymes.
• In molecular biology, restriction maps
are used as a reference to engineer
plasmids or other relatively short pieces
of DNA, and sometimes for longer
genomic DNA.
06.10.24 Bioinformatics Algorithms http://dx.doi.org/10.1590/S0074-02762007005000121
5
Restriction mapping
• If the genomic DNA sequence of an organism is known, then
construction of a restriction map for, say, HindII amounts to finding
all occurrences of GTGCAC and GTTAAC in the genome.
• However, not all restriction sites are known and
• sequencing is not always the best option if one wants to find a
restriction map
• Also: 25 years ago sequencing was not an option!
• How can we build restriction maps for genomes without prior
knowledge of the genomes’ DNA sequence?

06.10.24 Bioinformatics Algorithms 6

Electrophoresis
1. Aliquots of purified plasmid DNA for each digest
2. Digestion with enzyme
3. Samples are run on an electrophoresis gel
4. We can determine the lengths of DNA fragments from
band-pattern

Electrophoresis gives us number and length

of DNA fragments. How can we infer
restriciton map from this information?

http://bio1151.nicerweb.com/Locked/media/ch20/electrophoresis.html
06.10.24 Bioinformatics Algorithms 7
Some notation
A multiset is a set that allows duplicate elements: e.g., {1,2,2,3,4,4,6}

Let X = {x1 = 0,x2, .... , xn} a set of n elements in increasing order

We denote by ΔX the multiset of all pairwise distance between

elements of X:
ΔX = {xj − xi : 1 ≤ i < j ≤ n} .

06.10.24 Bioinformatics Algorithms 8

Example
For example, if X={0, 2, 4, 7, 10}, then ΔX={2, 2, 3, 3, 4, 5, 6, 7, 8, 10}

06.10.24 Bioinformatics Algorithms 9

The problem

Given all pairwise differences between points on a line,

reconstruct the position of all those points.

!
Input: The multiset of pairwise distances L, containing "
integers.

Output: A set X of n integers such that ΔX = L.

06.10.24 Bioinformatics Algorithms 10

Restriction mapping: brute force
!
Input: The multiset of pairwise distances L, containing "
integers.
Output: A set X of n integers such that ΔX = L.

M = max(L)
for (every set of integers 0 < x2 < .. < xn-1 < M)
X = {0, x2, ...., xn-1,M}
ΔX = get.pairwise.distances(X)
if (ΔX == L)
return(X)
return(«no solution»)

06.10.24 Bioinformatics Algorithms 11

Complexity
• The algorithm is slow since it examines #$"
!$"
different sets of
positions
• This requires about O(M(n−2)) time
• This makes the algorithm unpractical for most applications
• One could improve the algorithm by choosing the elements of the
sets more wisely (e.g., choosing n-2 distinct elements from L rather
than n-2 arbitrary integers)
• This reduces time-complexity to O(n2n-4)
(consider L = {2,998,1000}, n = 3, M = 1000)

06.10.24 Bioinformatics Algorithms 12

A practial algorithm (Skiena, 1990)
Idea:
1. Find largest distance in L (determines the two outermost points of X)
2. Find second-largest distance δ
3. two options to place δ:
left or right 0 δ(?) δ(?) max(X)
4. Pick left and add δ to preliminiary set
5. calculated all pairwise distances 0 δ(?) δ(?) xn-1 max(X)
6. check if they are contained in L
7. If yes: remove δ from L and go to 2.
8. If not: go back and pick other direction 0 x2 δ(?) δ(?) xn-1 max(X)
9. If L is empty, we have found a valid solution

06.10.24 Bioinformatics Algorithms 13

Example
L = {2,2,3,3,4,5,6,7,8,10} The largest remaining distance is 7 -> either x4 = 7 or x3 = 3
x3 = 3 is not possible because x3 – x2 = 1 is not in L
!
L has 10 elements and therefore n = 5 ( "
= 10)
Therfore x4 must be 7
10 is the largest distance in L, so x1 = 0 and x5 = 10
X = {0,2,7,10} and L = {2,3,4,6}
X = {0,10} and L ={2,2,3,3,4,5,6,7,8}
Largest remaining distance is 6
The largest remaining distance is 8.
Two choices: x3 = 4 or x3 = 6
Set x2 = 2 and remove distances 8 and 2. (x2 - x1, x5 – x2)
x3 = 6 does not work, so x3 must be 4
We get X = {0,2,10} and L = {2,3,3,4,5,6,7}
We get X = {0,2,4,7,10} which is a valid solution

06.10.24 Bioinformatics Algorithms 14

Algorithm
get.rest.map = function(L,X) find.next = function(L,X,m)
m = max(L) If L is empty: X is solution, return
y = max(L)
L = L without m
If Δ(y,X) is subset of L
X = {0,m} add y to X and remove Δ(y,X) from L
find.next(L,X,m) find.next(L,X,m)
remove y from X and add Δ(y,X) to L
If Δ(m-y,X) is subset of L
add m-y to X and remove Δ(y,X) from L
find.next(L,X,m)
remove m – y from X and add Δ(m-y,X) to L

06.10.24 Bioinformatics Algorithms 15

Recursive tree
At each step we can go left or right,
which creates a recursive-tree like this:

If we find that going, say,

left doesn’t yield a viable
solution, we can ignore
whole subtree!
……

……

06.10.24 Bioinformatics Algorithms 16

Some notes
After each recursion we undo the modifications to sets X and L for the next recursive call

This algorithm will list all sets X with Δ(X) = L

This algorithm is usually very fast because we do not go deep into the «recursive tree»

However, there exists pathological examples where we explore 2k possible paths (left and right are
always viable paths)

So strictly speaking, this algorithm has exponential time-complexity, but is much faster in most
cases

A polynomial-time algorithm was found in 2002 by Nivat et al.

06.10.24 Bioinformatics Algorithms 17

Search trees
As we have seen, trees can be useful structures to “organize”
information.

We can exploit the tree structure to efficiently chose whether a subset

of all possible combinations is “uninteresting”

06.10.24 Bioinformatics Algorithms 18

Example: Binary search tree
• One node is designated the root of the tree.
• Each node contains a key and has (at most) two subtrees.
• Each subtree is itself a binary search tree.
• The left subtree of a node contains only nodes with keys strictly less
than the node's key.
• The right subtree of a node contains only nodes with keys strictly
greater than the node's key.

06.10.24 Bioinformatics Algorithms 19

Example: Binary tree

See R-script: “tree.r” on ilias:

06.10.24 Bioinformatics Algorithms 20

Motif finding

06.10.24 Bioinformatics Algorithms 21

Transcription factor binding sites
• Every gene contains a regulatory region (RR) upstream of the
transcriptional start site

• Located within the RR are the Transcription Factor Binding Sites (TFBS),
also known as motifs, specific for a given transcription factor

• A TFBS can be located anywhere within the Regulatory Region (RR).

• A single TF can regulate multiple genes if those genes’ RRs contain

corresponding TFBS
• Can find regulated genes via knock out experiments
06.10.24 Bioinformatics Algorithms
Sequence motifs
Sequence motifs are short, recurring patterns in DNA that are
presumed to have a biological function.

Often they indicate sequence-specific binding sites for

proteins such as nucleases and transcription factors (TF).

Others are involved in important processes at the RNA level,

including ribosome binding, mRNA processing (splicing,
editing, polyadenylation) and transcription termination.
Nature Biotechnology 24, 423 - 425 (2006)
doi:10.1038/nbt0406-423
06.10.24 Bioinformatics Algorithms 23
Complications
• We do not know the motif sequence
• May know its length

• We do not know where it is located relative to the genes start

• Motifs can differ slightly from one gene to the next

• Non-essential bases could mutate…

• How to discern functional motifs from random ones?

06.10.24 Bioinformatics Algorithms 24
ATCCCG gene

TTCCGG gene

ATCCCG gene

ATGCCG gene

ATGCCC gene

06.10.24 Bioinformatics Algorithms 25

Random sequences
Can you find a motif?

06.10.24 Bioinformatics Algorithms 26

Random sequences plus motif
How about now?

06.10.24 Bioinformatics Algorithms 27

Random sequences plus motif
There it is!

06.10.24 Bioinformatics Algorithms 28

Random sequences plus motif plus
mutations
Finding a motif is hard, especially if the motif has some variation.

06.10.24 Bioinformatics Algorithms 29

What is a motif?
Before we can tackle the problem, we need a formal way to describe a
motif. Representation by a single string is precise but fails for many
biological relevant applications.

06.10.24 Bioinformatics Algorithms 30

Alignment matrix
Consider a set of t DNA sequences, each n nucleotides long

Select a position in each of the t sequences:

(s1,s2, .... ,st) 1 ≤ si < n – l + 1, where l is the length of the motif

We can form a txl alignment-matrix by setting the (i,j)th element to

be the si + j – 1 th element of the ith sequence

06.10.24 Bioinformatics Algorithms 31

Notation
• t - number of sample DNA sequences
• n - length of each DNA sequence
• DNA - sample of DNA sequences (t x n array)

• l - length of the motif (l mer)

• si - starting position of an l-mer in sequence i
• s=(s1, s2,… st) - array of motif’s starting
positions

06.10.24 Bioinformatics Algorithms 32

Notation

Sequence length n
06.10.24 Bioinformatics Algorithms 33
Example
l=8 DNA
cctgatagacgctatctggctatccaGgtacTtaggtcctctgtgcgaatctatgcgtttccaaccat

agtactggtgtacatttgatCcAtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc

t=5
aaacgtTAgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt

agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtCcAtataca

ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaCcgtacgGc

n = 69

s s3 = 3 s2 = 21 s1 = 26 s4 = 56 s5 = 60

06.10.24 Bioinformatics Algorithms 34

Alignment matrix
We can form a txl alignment-matrix by setting the (i,j)th element to be the
si + j – 1 th element of the ith sequence.
e.g.: s = (8,19,3,5,31,27,15)

06.10.24 Bioinformatics Algorithms 35

Profile Matrix
We now calculate the 4xl profile-matrix.

The (i,j)th element holds the number of

times nucleotide i appears in column j,
i = C,G,T,A

This matrix illustrates the variability of the

nucleotide composition at each position for
a particular choice of l-mers.

06.10.24 Bioinformatics Algorithms 36

Consensus String
The consesus string is formed by taking the most common nucleotide
at each site of the l-mer. E.g.: ATGCAACT