KEMBAR78
02-alignment.pdf
Lecture 2
Sequence Alignment
Burr Settles
IBS Summer Research Program 2008
bsettles@cs.wisc.edu
www.cs.wisc.edu/~bsettles/ibs08/
Sequence Alignment:
Task Definition
• given:
– a pair of sequences (DNA or protein)
– a method for scoring a candidate alignment
• do:
– determine the correspondences between substrings in
the sequences such that the similarity score is
maximized
Why Do Alignment?
• homology: similarity due to descent from a common
ancestor
• often we can infer homology from similarity
• thus we can sometimes infer structure/function from
sequence similarity
Homology Example:
Evolution of the Globins
Homology
• homologous sequences can be divided into two groups
– orthologous sequences: sequences that differ because
they are found in different species (e.g. human α
-globin and mouse α-globin)
– paralogous sequences: sequences that differ because of
a gene duplication event (e.g. human α-globin and
human β-globin, various versions of both )
Issues in Sequence Alignment
• the sequences we’re comparing probably differ in length
• there may be only a relatively small region in the
sequences that match
• we want to allow partial matches (i.e. some amino acid
pairs are more substitutable than others)
• variable length regions may have been inserted/deleted
from the common ancestral sequence
Sequence Variations
• sequences may have diverged from a common ancestor
through various types of mutations:
– substitutions (ACGA AGGA)
– insertions (ACGA ACCGGAGA)
– deletions (ACGGAGA AGA)
• the latter two will result in gaps in alignments
Insertions, Deletions and
Protein Structure
loop structures: insertions/deletions
here not so significant
• Why is it that two “similar” sequences may have large
insertions/deletions?
– some insertions and deletions may not significantly
affect the structure of a protein
Example Alignment: Globins
• figure at right shows prototypical
structure of globins
• figure below shows part of alignment
for 8 globins (-’s indicate gaps)
Three Key Questions
• Q1: what do we want to align?
• Q2: how do we “score” an alignment?
• Q3: how do we find the “best” alignment?
Q1: What Do We Want to Align?
• global alignment: find best match of both sequences in
their entirety
• local alignment: find best subsequence match
• semi-global alignment: find best match without penalizing
gaps on the ends of the alignment
The Space of Global Alignments
• some possible global alignments for ELV and VIS
ELV
VIS
-ELV
VIS-
--ELV
VIS--
ELV-
-VIS
ELV--
--VIS
E-LV
VIS-
EL-V
-VIS
Q2: How Do We Score
Alignments?
• gap penalty function
– w(k) indicates cost of a gap of length k
• substitution matrix
– s(a,b) indicates score of aligning character a with
character b
Linear Gap Penalty Function
• different gap penalty functions require somewhat different
dynamic programming algorithms
• the simplest case is when a linear gap function is used
€
w(k) = g× k
where g is a constant
• we’ll start by considering this case
Scoring an Alignment
• the score of an alignment is the sum of the scores for pairs of
aligned characters plus the scores for gaps
• example: given the following alignment
VAHV---D--DMPNALSALSDLHAHKL
AIQLQVTGVVVTDATLKNLGSVHVSKG
• we would score it by
s(V,A) + s(A,I) + s(H,Q) + s(V,L) + 3g + s(D,G) + 2g …
Q3: How Do We Find the Best
Alignment?
• simple approach: compute & score all possible alignments
• but there are
n
n
n
n
n n
π
2
2
2
)
!
(
)!
2
(
2
≈
=








possible global alignments for 2 sequences of length n
• e.g. two sequences of length 100 have possible
alignments
77
10
≈
Pairwise Alignment Via
Dynamic Programming
• dynamic programming: solve an instance of a
problem by taking advantage of solutions for
subparts of the problem
– reduce problem of best alignment of two sequences to
best alignment of all prefixes of the sequences
– avoid recalculating the scores already considered
• example: Fibonacci sequence 1, 1, 2, 3, 5, 8, 13, 21, 34…
• first used in alignment by Needleman & Wunsch,
Journal of Molecular Biology, 1970
Dynamic Programming Idea
• consider last step in computing alignment of
AAAC with AGC
• three possible options; in each we’ll choose a different
pairing for end of alignment, and add this to best
alignment of previous characters
AAA
C
AG
C AAAC
C
AG
-
AAA
-
AGC
C consider best
alignment of
these prefixes
score of
aligning
this pair
+
Dynamic Programming Idea
• given an n-character sequence x, and an m-character
sequence y
• construct an (n+1) × (m+1) matrix F
• F ( i, j ) = score of the best alignment of
x[1…i ] with y[1…j ]
A
A
C
A G
A
C
score of best alignment of
AAA to AG
Needleman-Wunch Algorithm
• one way to specify the DP is in terms of its recurrence
relation:





+
−
+
−
+
−
−
=
g
j
i
F
g
j
i
F
y
x
s
j
i
F
j
i
F
j
i
)
1
,
(
)
,
1
(
)
,
(
)
1
,
1
(
max
)
,
(
match xi with yj
insertion in x
insertion in y
DP Algorithm Sketch:
Global Alignment
• initialize first row and column of matrix
• fill in rest of matrix from top to bottom, left to right
• for each F ( i, j ), save pointer(s) to cell(s) that resulted in
best score
• F (m, n) holds the optimal alignment score; trace pointers
back from F (m, n) to F (0, 0) to recover alignment
Initializing Matrix
A g
A 2g
C
A G
A 3g
C 4g
0 3g
g 2g
Global Alignment Example
• suppose we choose the following scoring scheme:
+1
-1
g (penalty for aligning with a gap) = -2
=
)
,
( i
i y
x
s
i
i y
x =
when
i
i y
x ≠
when
Global Alignment Example
A
A
C
A G
A
C
+1
-1
g = -2
=
)
,
( i
i y
x
s
i
i y
x =
when
i
i y
x ≠
when
Global Alignment Example
A -2
A -4
C
A G
A -6
C -8
0 -6
-2 -4
1 -1 -3
-1 0 -2
-3 -2 -1
-5 -4 -1
x:
y:
A
-
A
G
A
A
C
C
one optimal alignment
Equally Optimal Alignments
• many optimal alignments may exist for a given pair of
sequences
• can use preference ordering over paths when doing
traceback
highroad lowroad
1
2
3 1
2
3
• highroad and lowroad alignments show the two most
different optimal alignments
Highroad & Lowroad Alignments
A -2
A -4
C
A G
A -6
C -8
0 -6
-2 -4
1 -1 -3
-1 0 -2
-3 -2 -1
-5 -4 -1
x:
y:
A
G
A
A
A
-
C
C
lowroad alignment
x:
y:
A
-
A
G
A
A
C
C
highroad alignment
DP Comments
• works for either DNA or protein sequences, although the
substitution matrices used differ
• finds an optimal alignment
• the exact algorithm (and computational complexity)
depends on gap penalty function (we’ll come back to this)
Local Alignment
• so far we have discussed global alignment, where we are
looking for best match between sequences from one end to
the other
• more commonly, we will want a local alignment, the best
match between subsequences of x and y
Local Alignment Motivation
• useful for comparing protein sequences that share a
common motif (conserved pattern) or domain
(independently folded unit) but differ elsewhere
• useful for comparing DNA sequences that share a similar
motif but differ elsewhere
• useful for comparing protein sequences against genomic
DNA sequences (long stretches of uncharacterized
sequence)
• more sensitive when comparing highly diverged sequences
Local Alignment DP Algorithm
• original formulation: Smith & Waterman, Journal of
Molecular Biology, 1981
• interpretation of array values is somewhat different
– F ( i, j ) = score of the best alignment of a suffix of
x[1…i ] and a suffix of y[1…j ]
Local Alignment DP Algorithm







+
−
+
−
+
−
−
=
0
)
1
,
(
)
,
1
(
)
,
(
)
1
,
1
(
max
)
,
(
g
j
i
F
g
j
i
F
y
x
s
j
i
F
j
i
F
j
i
• the recurrence relation is slightly different than for global
algorithm
Local Alignment DP Algorithm
• initialization: first row and first column initialized with 0’s
• traceback:
– find maximum value of F(i, j); can be anywhere in
matrix
– stop when we get to a cell with value 0
Local Alignment Example
T
T
A
A
G
G
A A
A
+1
-1
g = -2
=
)
,
( i
i y
x
s
i
i y
x =
when
i
i y
x ≠
when
Local Alignment Example
0
0
0
0 0
0
0
0 0
0
0
T
T
A
A
G
0
0
0
0
0
0 0
G
0
A
0
A
0
A
1
0
1
1 2
3
1
1
1
x:
y:
G
G
A
A
A
A
More On Gap Penalty Functions
• a gap of length k is more probable than k gaps of length 1
– a gap may be due to a single mutational event that
inserted/deleted a stretch of characters
– separated gaps are probably due to distinct mutational
events
• a linear gap penalty function treats these cases the same
• it is more common to use an affine gap penalty function,
which involves two terms:
– a penalty h associated with opening a gap
– a smaller penalty g for extending the gap
Gap Penalty Functions
• linear



=
≥
+
=
0
,
0
1
,
)
(
k
k
gk
h
k
w
gk
k
w =
)
(
• affine
Dynamic Programming for the
Affine Gap Penalty Case
• to do in time, need 3 matrices instead of 1
)
,
( j
i
M
)
,
( j
i
Ix
)
,
( j
i
Iy
best score given that y[j] is
aligned to a gap
best score given that x[i] is
aligned to a gap
best score given that x[i] is
aligned to y[j]
)
( 2
n
O
Global Alignment DP for the
Affine Gap Penalty Case





+
−
−
+
−
−
+
−
−
=
)
,
(
)
1
,
1
(
)
,
(
)
1
,
1
(
)
,
(
)
1
,
1
(
max
)
,
(
j
i
y
j
i
x
j
i
y
x
s
j
i
I
y
x
s
j
i
I
y
x
s
j
i
M
j
i
M



+
−
+
+
−
=
g
j
i
I
g
h
j
i
M
j
i
I
x
x
)
,
1
(
)
,
1
(
max
)
,
(



+
−
+
+
−
=
g
j
i
I
g
h
j
i
M
j
i
I
y
y
)
1
,
(
)
1
,
(
max
)
,
(
match xi with yj
insertion in x
insertion in y
open gap in x
extend gap in x
open gap in y
extend gap in y
Global Alignment DP for the
Affine Gap Penalty Case
−∞
=
×
+
=
×
+
=
=
column
leftmost
and
row
in top
cells
other
)
,
0
(
)
0
,
(
0
)
0
,
0
(
j
g
h
j
I
i
g
h
i
I
M
y
x
• initialization
• traceback
– start at largest of
– stop at any of
– note that pointers may traverse all three
matrices
)
,
(
),
,
(
),
,
( n
m
I
n
m
I
n
m
M y
x
)
0
,
0
(
),
0
,
0
(
),
0
,
0
( y
x I
I
M
Global Alignment Example
(Affine Gap Penalty)
M : 0
Ix : -3
Iy : -3
-∞
-∞
-4
-∞
-∞
-5
-∞
-∞
-7
-∞
-∞
-6
-∞
-∞
-8
1
-∞
-∞
-5
-∞
-3
-7
-∞
-5
-4
-∞
-4
-8
-∞
-6
-∞
-4
-∞
-3
-3
-∞
0
-9
-7
-5
-11
-5
-2
-8
-4
-6
-12
-6
-∞
-5
-∞
-6
-4
-∞
-4
-4
-10
-3
-9
-5
-1
-6
-8
-4
-10
-6
-∞
-6
-∞
A C A C T
A
A
T
h = -3, g = -1
Global Alignment Example (Continued)
M : 0
Ix : -3
Iy : -3
-∞
-∞
-4
-∞
-∞
-5
-∞
-∞
-7
-∞
-∞
-6
-∞
-∞
-8
1
-∞
-∞
-5
-∞
-3
-7
-∞
-5
-4
-∞
-4
-8
-∞
-6
-∞
-4
-∞
-3
-3
-∞
0
-9
-7
-5
-11
-5
-2
-8
-4
-6
-12
-6
-∞
-5
-∞
-6
-4
-∞
-4
-4
-10
-3
-9
-5
-1
-6
-8
-4
-10
-6
-∞
-6
-∞
A C A C T
A
A
T
ACACT
--AAT
ACACT
A--AT
ACACT
AA--T
three optimal alignments:
Local Alignment DP for the
Affine Gap Penalty Case







+
−
−
+
−
−
+
−
−
=
0
)
,
(
)
1
,
1
(
)
,
(
)
1
,
1
(
)
,
(
)
1
,
1
(
max
)
,
(
j
i
y
j
i
x
j
i
y
x
s
j
i
I
y
x
s
j
i
I
y
x
s
j
i
M
j
i
M



+
−
+
+
−
=
g
j
i
I
g
h
j
i
M
j
i
I
x
x
)
,
1
(
)
,
1
(
max
)
,
(



+
−
+
+
−
=
g
j
i
I
g
h
j
i
M
j
i
I
y
y
)
1
,
(
)
1
,
(
max
)
,
(
Local Alignment DP for the
Affine Gap Penalty Case
−∞
=
=
=
=
,
of
column
leftmost
and
row
in top
cells
0
)
,
0
(
0
)
0
,
(
0
)
0
,
0
(
y
x I
I
j
M
i
M
M
• initialization
• traceback
– start at largest
– stop at
)
,
( j
i
M
0
)
,
( =
j
i
M
Gap Penalty Functions



=
≥
+
=
0
,
0
1
,
)
(
k
k
gk
h
k
w
gk
k
w =
)
(
)
log(
)
( k
g
h
k
w ×
+
=
• linear:
• affine:
• concave: a function for which the following holds for all
k, l, m ≥ 0
e.g.
)
(
)
(
)
(
)
( k
w
l
k
w
m
k
w
l
m
k
w −
+
≤
+
−
+
+
Concave Gap Penalty Functions
( ) ( )
w k m l w k m
+ + − +
0
1
2
3
4
5
6
7
8
1 2 3 4 5 6 7 8 9 10
( ) ( )
w k l w k
+ −
)
(
)
(
)
(
)
( k
w
l
k
w
m
k
w
l
m
k
w −
+
≤
+
−
+
+
l
More On Scoring Matches
• so far, we’ve discussed multiple gap penalty functions, but
only one match-scoring scheme:
+1
-1
=
)
,
( i
i y
x
s
i
i y
x =
when
i
i y
x ≠
when
• for protein sequence alignment, some amino acids have
similar structures and can be substituted in nature:
aspartic acid (D) glutamic acid (E)
Substitution Matrices
• two popular sets of matrices for protein sequences
– PAM matrices [Dayhoff et al., 1978]
– BLOSUM matrices [Henikoff & Henikoff, 1992]
• both try to capture the the relative substitutability of amino
acid pairs in the context of evolution
BLOSUM62 Matrix
Heuristic Methods
• the algorithms we learned today take O(nm) time to align
sequences, which is too slow for searching large databases
– imagine an internet search engine, but where queries and results
are protein sequences
• heuristic methods do fast approximation to dynamic
programming
– example: BLAST [Altschul et al., 1990; Altschul et al., 1997]
– break sequence into small (e.g. 3 base pair) “words”
– scan database for word matches
– extend all matches to seek high-scoring alignments
– tradeoff: sensitivity for speed
Multiple Sequence Alignment
• we’ve only discussed
aligning 2 sequences, but
we may want to do more
• discover common motifs in
a set of sequences
(e.g. DNA sequences that
bind the same protein)
• characterize a set of
sequences
(e.g. a protein family)
• much more complex
Figure from A. Krogh, An Introduction to Hidden Markov Models for Biological Sequences
Next Time…
• basic molecular biology
• sequence alignment
• probabilistic sequence models
• gene expression analysis
• protein structure prediction
– by Ameet Soni

02-alignment.pdf

  • 1.
    Lecture 2 Sequence Alignment BurrSettles IBS Summer Research Program 2008 bsettles@cs.wisc.edu www.cs.wisc.edu/~bsettles/ibs08/
  • 2.
    Sequence Alignment: Task Definition •given: – a pair of sequences (DNA or protein) – a method for scoring a candidate alignment • do: – determine the correspondences between substrings in the sequences such that the similarity score is maximized
  • 3.
    Why Do Alignment? •homology: similarity due to descent from a common ancestor • often we can infer homology from similarity • thus we can sometimes infer structure/function from sequence similarity
  • 4.
  • 5.
    Homology • homologous sequencescan be divided into two groups – orthologous sequences: sequences that differ because they are found in different species (e.g. human α -globin and mouse α-globin) – paralogous sequences: sequences that differ because of a gene duplication event (e.g. human α-globin and human β-globin, various versions of both )
  • 6.
    Issues in SequenceAlignment • the sequences we’re comparing probably differ in length • there may be only a relatively small region in the sequences that match • we want to allow partial matches (i.e. some amino acid pairs are more substitutable than others) • variable length regions may have been inserted/deleted from the common ancestral sequence
  • 7.
    Sequence Variations • sequencesmay have diverged from a common ancestor through various types of mutations: – substitutions (ACGA AGGA) – insertions (ACGA ACCGGAGA) – deletions (ACGGAGA AGA) • the latter two will result in gaps in alignments
  • 8.
    Insertions, Deletions and ProteinStructure loop structures: insertions/deletions here not so significant • Why is it that two “similar” sequences may have large insertions/deletions? – some insertions and deletions may not significantly affect the structure of a protein
  • 9.
    Example Alignment: Globins •figure at right shows prototypical structure of globins • figure below shows part of alignment for 8 globins (-’s indicate gaps)
  • 10.
    Three Key Questions •Q1: what do we want to align? • Q2: how do we “score” an alignment? • Q3: how do we find the “best” alignment?
  • 11.
    Q1: What DoWe Want to Align? • global alignment: find best match of both sequences in their entirety • local alignment: find best subsequence match • semi-global alignment: find best match without penalizing gaps on the ends of the alignment
  • 12.
    The Space ofGlobal Alignments • some possible global alignments for ELV and VIS ELV VIS -ELV VIS- --ELV VIS-- ELV- -VIS ELV-- --VIS E-LV VIS- EL-V -VIS
  • 13.
    Q2: How DoWe Score Alignments? • gap penalty function – w(k) indicates cost of a gap of length k • substitution matrix – s(a,b) indicates score of aligning character a with character b
  • 14.
    Linear Gap PenaltyFunction • different gap penalty functions require somewhat different dynamic programming algorithms • the simplest case is when a linear gap function is used € w(k) = g× k where g is a constant • we’ll start by considering this case
  • 15.
    Scoring an Alignment •the score of an alignment is the sum of the scores for pairs of aligned characters plus the scores for gaps • example: given the following alignment VAHV---D--DMPNALSALSDLHAHKL AIQLQVTGVVVTDATLKNLGSVHVSKG • we would score it by s(V,A) + s(A,I) + s(H,Q) + s(V,L) + 3g + s(D,G) + 2g …
  • 16.
    Q3: How DoWe Find the Best Alignment? • simple approach: compute & score all possible alignments • but there are n n n n n n π 2 2 2 ) ! ( )! 2 ( 2 ≈ =         possible global alignments for 2 sequences of length n • e.g. two sequences of length 100 have possible alignments 77 10 ≈
  • 17.
    Pairwise Alignment Via DynamicProgramming • dynamic programming: solve an instance of a problem by taking advantage of solutions for subparts of the problem – reduce problem of best alignment of two sequences to best alignment of all prefixes of the sequences – avoid recalculating the scores already considered • example: Fibonacci sequence 1, 1, 2, 3, 5, 8, 13, 21, 34… • first used in alignment by Needleman & Wunsch, Journal of Molecular Biology, 1970
  • 18.
    Dynamic Programming Idea •consider last step in computing alignment of AAAC with AGC • three possible options; in each we’ll choose a different pairing for end of alignment, and add this to best alignment of previous characters AAA C AG C AAAC C AG - AAA - AGC C consider best alignment of these prefixes score of aligning this pair +
  • 19.
    Dynamic Programming Idea •given an n-character sequence x, and an m-character sequence y • construct an (n+1) × (m+1) matrix F • F ( i, j ) = score of the best alignment of x[1…i ] with y[1…j ] A A C A G A C score of best alignment of AAA to AG
  • 20.
    Needleman-Wunch Algorithm • oneway to specify the DP is in terms of its recurrence relation:      + − + − + − − = g j i F g j i F y x s j i F j i F j i ) 1 , ( ) , 1 ( ) , ( ) 1 , 1 ( max ) , ( match xi with yj insertion in x insertion in y
  • 21.
    DP Algorithm Sketch: GlobalAlignment • initialize first row and column of matrix • fill in rest of matrix from top to bottom, left to right • for each F ( i, j ), save pointer(s) to cell(s) that resulted in best score • F (m, n) holds the optimal alignment score; trace pointers back from F (m, n) to F (0, 0) to recover alignment
  • 22.
    Initializing Matrix A g A2g C A G A 3g C 4g 0 3g g 2g
  • 23.
    Global Alignment Example •suppose we choose the following scoring scheme: +1 -1 g (penalty for aligning with a gap) = -2 = ) , ( i i y x s i i y x = when i i y x ≠ when
  • 24.
    Global Alignment Example A A C AG A C +1 -1 g = -2 = ) , ( i i y x s i i y x = when i i y x ≠ when
  • 25.
    Global Alignment Example A-2 A -4 C A G A -6 C -8 0 -6 -2 -4 1 -1 -3 -1 0 -2 -3 -2 -1 -5 -4 -1 x: y: A - A G A A C C one optimal alignment
  • 26.
    Equally Optimal Alignments •many optimal alignments may exist for a given pair of sequences • can use preference ordering over paths when doing traceback highroad lowroad 1 2 3 1 2 3 • highroad and lowroad alignments show the two most different optimal alignments
  • 27.
    Highroad & LowroadAlignments A -2 A -4 C A G A -6 C -8 0 -6 -2 -4 1 -1 -3 -1 0 -2 -3 -2 -1 -5 -4 -1 x: y: A G A A A - C C lowroad alignment x: y: A - A G A A C C highroad alignment
  • 28.
    DP Comments • worksfor either DNA or protein sequences, although the substitution matrices used differ • finds an optimal alignment • the exact algorithm (and computational complexity) depends on gap penalty function (we’ll come back to this)
  • 29.
    Local Alignment • sofar we have discussed global alignment, where we are looking for best match between sequences from one end to the other • more commonly, we will want a local alignment, the best match between subsequences of x and y
  • 30.
    Local Alignment Motivation •useful for comparing protein sequences that share a common motif (conserved pattern) or domain (independently folded unit) but differ elsewhere • useful for comparing DNA sequences that share a similar motif but differ elsewhere • useful for comparing protein sequences against genomic DNA sequences (long stretches of uncharacterized sequence) • more sensitive when comparing highly diverged sequences
  • 31.
    Local Alignment DPAlgorithm • original formulation: Smith & Waterman, Journal of Molecular Biology, 1981 • interpretation of array values is somewhat different – F ( i, j ) = score of the best alignment of a suffix of x[1…i ] and a suffix of y[1…j ]
  • 32.
    Local Alignment DPAlgorithm        + − + − + − − = 0 ) 1 , ( ) , 1 ( ) , ( ) 1 , 1 ( max ) , ( g j i F g j i F y x s j i F j i F j i • the recurrence relation is slightly different than for global algorithm
  • 33.
    Local Alignment DPAlgorithm • initialization: first row and first column initialized with 0’s • traceback: – find maximum value of F(i, j); can be anywhere in matrix – stop when we get to a cell with value 0
  • 34.
    Local Alignment Example T T A A G G AA A +1 -1 g = -2 = ) , ( i i y x s i i y x = when i i y x ≠ when
  • 35.
    Local Alignment Example 0 0 0 00 0 0 0 0 0 0 T T A A G 0 0 0 0 0 0 0 G 0 A 0 A 0 A 1 0 1 1 2 3 1 1 1 x: y: G G A A A A
  • 36.
    More On GapPenalty Functions • a gap of length k is more probable than k gaps of length 1 – a gap may be due to a single mutational event that inserted/deleted a stretch of characters – separated gaps are probably due to distinct mutational events • a linear gap penalty function treats these cases the same • it is more common to use an affine gap penalty function, which involves two terms: – a penalty h associated with opening a gap – a smaller penalty g for extending the gap
  • 37.
    Gap Penalty Functions •linear    = ≥ + = 0 , 0 1 , ) ( k k gk h k w gk k w = ) ( • affine
  • 38.
    Dynamic Programming forthe Affine Gap Penalty Case • to do in time, need 3 matrices instead of 1 ) , ( j i M ) , ( j i Ix ) , ( j i Iy best score given that y[j] is aligned to a gap best score given that x[i] is aligned to a gap best score given that x[i] is aligned to y[j] ) ( 2 n O
  • 39.
    Global Alignment DPfor the Affine Gap Penalty Case      + − − + − − + − − = ) , ( ) 1 , 1 ( ) , ( ) 1 , 1 ( ) , ( ) 1 , 1 ( max ) , ( j i y j i x j i y x s j i I y x s j i I y x s j i M j i M    + − + + − = g j i I g h j i M j i I x x ) , 1 ( ) , 1 ( max ) , (    + − + + − = g j i I g h j i M j i I y y ) 1 , ( ) 1 , ( max ) , ( match xi with yj insertion in x insertion in y open gap in x extend gap in x open gap in y extend gap in y
  • 40.
    Global Alignment DPfor the Affine Gap Penalty Case −∞ = × + = × + = = column leftmost and row in top cells other ) , 0 ( ) 0 , ( 0 ) 0 , 0 ( j g h j I i g h i I M y x • initialization • traceback – start at largest of – stop at any of – note that pointers may traverse all three matrices ) , ( ), , ( ), , ( n m I n m I n m M y x ) 0 , 0 ( ), 0 , 0 ( ), 0 , 0 ( y x I I M
  • 41.
    Global Alignment Example (AffineGap Penalty) M : 0 Ix : -3 Iy : -3 -∞ -∞ -4 -∞ -∞ -5 -∞ -∞ -7 -∞ -∞ -6 -∞ -∞ -8 1 -∞ -∞ -5 -∞ -3 -7 -∞ -5 -4 -∞ -4 -8 -∞ -6 -∞ -4 -∞ -3 -3 -∞ 0 -9 -7 -5 -11 -5 -2 -8 -4 -6 -12 -6 -∞ -5 -∞ -6 -4 -∞ -4 -4 -10 -3 -9 -5 -1 -6 -8 -4 -10 -6 -∞ -6 -∞ A C A C T A A T h = -3, g = -1
  • 42.
    Global Alignment Example(Continued) M : 0 Ix : -3 Iy : -3 -∞ -∞ -4 -∞ -∞ -5 -∞ -∞ -7 -∞ -∞ -6 -∞ -∞ -8 1 -∞ -∞ -5 -∞ -3 -7 -∞ -5 -4 -∞ -4 -8 -∞ -6 -∞ -4 -∞ -3 -3 -∞ 0 -9 -7 -5 -11 -5 -2 -8 -4 -6 -12 -6 -∞ -5 -∞ -6 -4 -∞ -4 -4 -10 -3 -9 -5 -1 -6 -8 -4 -10 -6 -∞ -6 -∞ A C A C T A A T ACACT --AAT ACACT A--AT ACACT AA--T three optimal alignments:
  • 43.
    Local Alignment DPfor the Affine Gap Penalty Case        + − − + − − + − − = 0 ) , ( ) 1 , 1 ( ) , ( ) 1 , 1 ( ) , ( ) 1 , 1 ( max ) , ( j i y j i x j i y x s j i I y x s j i I y x s j i M j i M    + − + + − = g j i I g h j i M j i I x x ) , 1 ( ) , 1 ( max ) , (    + − + + − = g j i I g h j i M j i I y y ) 1 , ( ) 1 , ( max ) , (
  • 44.
    Local Alignment DPfor the Affine Gap Penalty Case −∞ = = = = , of column leftmost and row in top cells 0 ) , 0 ( 0 ) 0 , ( 0 ) 0 , 0 ( y x I I j M i M M • initialization • traceback – start at largest – stop at ) , ( j i M 0 ) , ( = j i M
  • 45.
    Gap Penalty Functions    = ≥ + = 0 , 0 1 , ) ( k k gk h k w gk k w= ) ( ) log( ) ( k g h k w × + = • linear: • affine: • concave: a function for which the following holds for all k, l, m ≥ 0 e.g. ) ( ) ( ) ( ) ( k w l k w m k w l m k w − + ≤ + − + +
  • 46.
    Concave Gap PenaltyFunctions ( ) ( ) w k m l w k m + + − + 0 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 9 10 ( ) ( ) w k l w k + − ) ( ) ( ) ( ) ( k w l k w m k w l m k w − + ≤ + − + + l
  • 47.
    More On ScoringMatches • so far, we’ve discussed multiple gap penalty functions, but only one match-scoring scheme: +1 -1 = ) , ( i i y x s i i y x = when i i y x ≠ when • for protein sequence alignment, some amino acids have similar structures and can be substituted in nature: aspartic acid (D) glutamic acid (E)
  • 48.
    Substitution Matrices • twopopular sets of matrices for protein sequences – PAM matrices [Dayhoff et al., 1978] – BLOSUM matrices [Henikoff & Henikoff, 1992] • both try to capture the the relative substitutability of amino acid pairs in the context of evolution
  • 49.
  • 50.
    Heuristic Methods • thealgorithms we learned today take O(nm) time to align sequences, which is too slow for searching large databases – imagine an internet search engine, but where queries and results are protein sequences • heuristic methods do fast approximation to dynamic programming – example: BLAST [Altschul et al., 1990; Altschul et al., 1997] – break sequence into small (e.g. 3 base pair) “words” – scan database for word matches – extend all matches to seek high-scoring alignments – tradeoff: sensitivity for speed
  • 51.
    Multiple Sequence Alignment •we’ve only discussed aligning 2 sequences, but we may want to do more • discover common motifs in a set of sequences (e.g. DNA sequences that bind the same protein) • characterize a set of sequences (e.g. a protein family) • much more complex Figure from A. Krogh, An Introduction to Hidden Markov Models for Biological Sequences
  • 52.
    Next Time… • basicmolecular biology • sequence alignment • probabilistic sequence models • gene expression analysis • protein structure prediction – by Ameet Soni