The Needleman-Wunsch Algorithm for Sequence Alignment

Saul B. Needleman & Christian D. Wuncsch (1969)
KAVINDRI DILSHANI
H.M.K.G BANDARA
PARINDA RAJAPAKSHE

ABOUT RESEARCH
Title
• “A General method applicable to the search for similarities
in the amino acid sequence of two proteins”
Authors
• Saul B. Needleman & Christian D. Wuncsch, Department of Biochemistry,
North-western University & Nuclear Medicine Service , V.A Research
Hospital ,Chicago, USA. (1969) [Cited by 8474]
S.B. Needleman & C.D. Wuncsch , “A General method applicable to the search for similarities in
the amino acid sequence of two proteins” , J. Mol . Biol .(1970) 48, 443- 453.

OUTLINE
• Introduction
- Sequence Alignment
- Approaches
- Needleman-Wunch Algorithm vs. Dynamic Programming
• Example
- Optimal Alignment Score
- Optimal Alignment
- Algorithm Cost
• Applications
- Results & Discussion
- Methodology
- Usefulness

SEQUENCE ALIGNMENT
• Sequence alignment is a way of arranging two or more
sequences of characters to identify regions of similarity.
• Identification of residue-residue correspondences
• Sequence : Can be taken as ordered strings of letters.
• Sequences in Bio-Informatics ?
• DNA sequences
• RNA sequences
• Protein sequences

MOTIVATION
• Find homologous proteins
– Allows to predict structure and function
• Locate similar subsequences in DNA
– e.g.: Allows to identify regulatory elements
– Infer Biological similarities
• Locate DNA sequences that might overlap
– Helps in sequence assembly

SEQUENCE ALIGNMENT - RESULTS
• Input: Two sequences over the same alphabet.
- GCGCATGGATTGAGCGA and TGCGCCATTGATGACCA
• Output: An alignment of the two sequences.
-GCGC-ATGGATTGAGCGA
TGCGCCATTGAT-GACC-A
-GCGC-ATGGATTGAGCGA
TGCGCCATTGAT-GACC-A
-GCGC-ATGGATTGAGCGA
TGCGCCATTGAT-GACC-A
Insertions /
Deletions
(indel)
Perfect
matches
Mismatches

APPROACHES
Sequence Alignment
Qualitative Quantitative
Dot-plot Global
Local
Multiple

QUALITATIVE
• Dot-plot
-Pictorial representation & relationship between two sequences
- Uses a Table or a Matrix
- Doesn’t quantifies the similarity figure !!

QUANTITATIVE
• Construction of the best alignment between
the sequences.
• Assessment of the similarity from the
alignment. ( Numerically Quantifies)

GLOBAL SEQUENCE ALIGNMENT
• The best alignment over the entire length of two
sequences.
• Suitable : Two sequences are of similar length,
with a significant degree of similarity throughout .

LOCAL SEQUENCE ALIGNMENT
• Compares short portions of sequence or a whole
library of sequences with short portions of
another.
• Suitable : Comparing substantially different
sequences, which possibly differ significantly in
length, and have only a short patches of similarity.

MULTIPLE SEQUENCE ALIGNMENT
• Simultaneous alignment of more than two
sequences.
• Suitable : Suitable when searching for subtle
conserved sequence patterns in a protein family,
and when more than two sequences of the protein
family are available.

EXAMPLE
S1 : SIMILARITY S2 : PILLAR S3 : MOLARITY
Global Local Multiple
SIMILARITY
PI-LLAR---
MILAR
ILLAR
SIMILARITY
PI-LLAR---
--MOLARITY

HOW TO QUANTIFY ?
• Introduces a Scoring Schema
• Set of rules which assigns the Alignment score to
any given alignment of two sequences.
• Alignment score : Goodness of Alignment
• Scoring Schema
Substitution scores
Gap penalties

THE SUBSITUTION MATRIX
• Simple scoring schema for Residue substitution
• Express the residue substitution costs can be achieved with
a N x N matrix (N is 4 for DNA and 20 for proteins).
C T A G
C 1 -1 -1 -1
T -1 1 -1 -1
A -1 -1 1 -1
G -1 -1 -1 1

EXAMPLE
• Consider the "best" alignment of ATGGCGT and
ATGAGT
• +1 as a reward for a match, -1 as the penalty for a mismatch,
and ignore gaps
ATGGCGT
ATG_ AGT
Score: +1 + 1 + 1 + 0 - 1 + 1 + 1 = 4
Alternative alignment
ATGGCGT
A_TGAGT
Score: +1 + 0 - 1 + 1 - 1 + 1 + 1 = 2

BETTER MATRIX
• Certain changes in DNA /Protein sequences are more
likely to occur naturally than the others.
• Proteins are composed of twenty amino acids, and
physico-chemical properties of individual amino acids
vary considerably.
• Important to incorporate evolutionary relationships
for this substitution schema.

EVOLUTIONARY SUBSTITUTION MATRIX
• PAM ("point accepted mutation") family
- PAM250, PAM120, etc.
• BLOSUM ("Blocks substitution matrix") family
- BLOSUM62, BLOSUM50, etc.
• Derived from the analysis of known alignments of
closely related proteins
• Assigns variable weights to different substitution
operations.

GAPS
• A Gap, indicates consecutive run of spaces in an
alignment , may be introduced in either sequence.
(insertion or a deletion of a residue)
• Objective :- Optimal sequence alignment with
meaningful alignments.
• Is it Good ?
– Interrupts the entire polymer chain
– In DNA shifts the reading frame
Penalty

GAP PENALTIES
Constant Linear Affine
• Whatever size it is,
receives the constant
negative penalty : -g
• Depends linearly on
the size of a gap.
Parameter : -g, is the
penalty per unit
length of a gap.
• Gap introduction cost >
Gap extension cost
g = o + (L-1)e.
|e| < |o|

ENRICHED SCORING SCHEMA
• Scoring scheme provides us with the quantitative
measure of how good is some alignment relative to
alternative alignments .
• Does this scoring scheme tell us how to find the
best alignment ?

BASIC APPROACH
• Brute-force approach
- Generate the list all possible alignments between two
sequences, score them
- Select the alignment with the best score
– The number of possible global alignments between
two sequences of length N is
22𝑁
𝜋𝑁
- For two sequences of 250 residues this is ~ 10149

NEEDLEMAN-WUNCH ALGORITHM
• Reduce the massive number of possibilities that
need to be considered, yet still guarantees that the
best solution will be found.
• Global Sequence Alignment Technique.
• Build up the best alignment by using optimal
alignments of smaller sub sequences.
• Dynamic Programming

DYNAMIC PROGRAMMING
• Dynamic Programming is an algorithmic
paradigm.
- Breaking problem into sub problems
- Stores results of sub problems
- Avoids computing the same results again.
• Main properties of the problem
- Overlapping Sub problems
- Optimal substructure

OVERLAPPING SUB PROBLEMS
• Segregate main problem into sub-problems.
• Mainly used when solutions of same sub problems are
needed again and again.
• Computed solutions to sub problems are stored in a
table/matrix so that these don’t have to recomputed.
• Dynamic Programming is not useful when there are no
common (overlapping) sub problems.

OPTIMAL SUBSTRUCTURE
• If an optimal solution can be constructed efficiently
from optimal solutions of its sub problems.
• Optimal global solution contains the optimal
solutions of all its sub problems.
• Dynamic Programming is not useful when there
isn’t optimal substructure in the problem.

HOW IT WORKS?
• Governed by three steps
- Break the problem into smaller sub problems.
- Solve the smaller problems optimally
- Use the sub-problem solutions to construct an
optimal solution for the original problem
• Needleman-Wunsch Algorithm incorporates
the Dynamic Algorithm paradigm  Optimal global
alignment and the corresponding score.

Definitions
• A scoring function (σ)
defines the score to give to a substitution mutation
eg. -1 for a match, -1 for mismatch
• A gap penalty
defines the score to give to an insertion or deletion
eg. -1
• A recurrence relation
defines what actions we repeat at each iteration (step) of the
algorithm
T(i-1, j-1) + σ(S1(i), S2(j))
T(i, j) = max T(i-1, j) + gap penalty
T(i, j-1) + gap penalty

Steps
• Step 1
– Fill up a matrix (table) T using the recurrence
relation
• Step 2
–The Trace back step use the filled-in matrix T to
work out the best alignment

Work Out
• Sequences
S1= TGGTG
S2= ATCGT
• Scoring function
For matches : +1
For mismatches : -1
A C G T
A +1 -1 -1 -1
C -1 +1 -1 -1
G -1 -1 +1 -1
T -1 -1 -1 +1
Substitution Matrix

Work Out cont..
• Initializing the table
T G G T G
A
T
C
G
T
i=0 i=1 i=2 i=3 i=4 i=5
j=0
j=1
j=2
j=3
j=4
j=5
Left to Right Top to Bottom
Step 1 - The value of T(0,0) is set to zero at the start

Work Out cont..
T G G T G
0
A
T
C
G
T
i=0 i=1 i=2 i=3 i=4 i=5
j=0
j=1
j=2
j=3
j=4
j=5
T(i-1, j-1) + σ(S1(i), S2(j))
previous column & row
previous column & same row
same column & previous row
Gap penalty = -2
A C G T
A +1 -1 -1 -1
C -1 +1 -1 -1
G -1 -1 +1 -1
T -1 -1 -1 +1

Work Out cont..
0 -2 -4 -6 -8 -10
-2 -1 -3 -5 -7 -9
i=0 i=1 i=2 i=3 i=4 i=5
j=0
j=1
j=2
j=3
j=4
j=5
A
T
C
G
T
T G G T G
A C G T
A +1 -1 -1 -1
C -1 +1 -1 -1
G -1 -1 +1 -1
T -1 -1 -1 +1
Gap penalty = -2
T(i-1, j-1) + σ(S1(i), S2(j))

Work Out cont..
0 -2 -4 -6 -8 -10
-2 -1 -3 -5 -7 -9
-4 -1 -2 -4 -4 -6
-6 -3 -2 -3 -5 -5
-8 -5 -2 -1 -3 -4
-10 -7 -4 -3 0 -2
i=0 i=1 i=2 i=3 i=4 i=5
j=0
j=1
j=2
j=3
j=4
j=5
A
T
C
G
T
T G G T G
Gap penalty = -2

Work Out cont..
0 -2 -4 -6 -8 -10
-2 -1 -3 -5 -7 -9
-4 -1 -2 -4 -4 -6
-6 -3 -2 -3 -5 -5
-8 -5 -2 -1 -3 -4
-10 -7 -4 -3 0 -2
i=0 i=1 i=2 i=3 i=4 i=5
j=0
j=1
j=2
j=3
j=4
j=5
A
T
C
G
T
T G G T G Trace Back

Work Out cont..
0 -2 -4 -6 -8 -10
-2 -1 -3 -5 -7 -9
-4 -1 -2 -4 -4 -6
-6 -3 -2 -3 -5 -5
-8 -5 -2 -1 -3 -4
-10 -7 -4 -3 0 -2
i=0 i=1 i=2 i=3 i=4 i=5
j=0
j=1
j=2
j=3
j=4
j=5
A
T
C
G
T
T G G T G
Trace Back

Work Out cont..
0 -2 -4 -6 -8 -10
-2 -1 -3 -5 -7 -9
-4 -1 -2 -4 -4 -6
-6 -3 -2 -3 -5 -5
-8 -5 -2 -1 -3 -4
-10 -7 -4 -3 0 -2
i=0 i=1 i=2 i=3 i=4 i=5
j=0
j=1
j=2
j=3
j=4
j=5
A
T
C
G
T
T G G T G
Trace Back
S1= TGGTG
S2= ATCGT
-
A
T
|
T
G
C
G
|
G
T
|
T
G
-
→ Score = 3-1-4 = -2

Work Out cont..
W H A T
0 -2 -4 -6 -8
W -2 1 -1 -3 -5
H -4 -1 2 0 -2
Y -6 -3 0 1 -1
W H A T
0 -2 -4 -6 -8
W -2 1 -1 -3 -5
H -4 -1 2 0 -2
Y -6 -3 0 1 -1
W H A T
0 -2 -4 -6 -8
W -2 1 -1 -3 -5
H -4 -1 2 0 -2
Y -6 -3 0 1 -1
W
|
W
H
|
H
A
-
T
Y
(Pink traceback)
W
|
W
H
|
H
A
Y
T
-
(Orange traceback)
match:+1
mismatch:-1
gap:-2
Two possible trace backs ?

Work Out cont..
W H A T
0 -2 -4 -6 -8
W -2 1 -1 -3 -5
H -4 -1 2 0 -2
Y -6 -3 0 1 -1
W H A T
0 -2 -4 -6 -8
W -2 1 -1 -3 -5
H -4 -1 2 0 -2
Y -6 -3 0 1 -1
W H A T
0 -2 -4 -6 -8
W -2 1 -1 -3 -5
H -4 -1 2 0 -2
Y -6 -3 0 1 -1
W
|
W
H
|
H
A
-
T
Y
(Pink traceback)
W
|
W
H
|
H
A
Y
T
-
(Orange traceback)

Performance
• The N-W algorithm takes time proportion to n2
• Accessing all possible alignment one by one 2nCn
N2 < 2nCn
N-W is much faster than assessing all possible alignments one-
by-one

Role of weighing factors
in evaluating a maximum
match
Proteins not expected to
exhibit homology
Proteins expected to
exhibit homology
• Whale myoglobin
• Human β-hemoglobin
• Bovin pancreatic ribonuclease
• Hen’s egg lysozyme

APPLICATION OF THE METHOD
• Identification of the types of amino acid pairs
• Establish variable sets consisting of values to be assigned
to each type of pair
• Determine a value for the penalty

TYPES OF AMINO ACID PAIRS
• Pairs having a maximum of three
corresponding bases in their codonsType 3
• Pairs having a maximum of two
• Pairs having a maximum of one
• Pairs having no possible

• Reading the amino acid sequences to be compared into the
computer
• Maximum-Correspondence array
– Contain all possible pairs of amino acids
– Identify each pair to the corresponding type
• Generating the two-dimensional array row-by-row
• Assigning the variable set containing the type values and
appropriate value from that set to the appropriate cell of the
comparison array
METHODOLOGY

Nucleotide sequences of
RNA codons recognized by
AA-tRNA*
*Marshall RE, Caskey CT, Nirenberg M. Fine
structure of RNA codewords recognized by
bacterial, amphibian, and mammalian transfer
RNA. Science. 1967 Feb 17;155(3764):820–826.

• Determination of the maximum-match by the procedure
of successive summations
• Randomizing the amino acid sequence of only one
member of the protein
– Sequences of β-hemoglobin and ribonuclease
– Randomization procedure: A sequence shuffling routine based
on computer-generated random
numbers
• Repeating the cycle of sequence randomization &
maximum-match determination
• Estimating the average and standard deviation for the
random values of each variable set
METHODOLOGY

RESULTS AND DISCUSSION
• A small random sample size (ten)
• Assumption: For each set of variables the random
values would be distributed in the fashion
of the normal-error curve
• The values of the first six random sets in the β-hemoglobin–
myoglobin comparison were converted to standard measures
• Probit plot

β-HEMOGLOBIN –
MYOGLOBIN MAXIMUM MATCHES

RIBONUCLEASE –
LYSOZYME MAXIMUM MATCHES

• To detect homology and define its nature
• Assumption:
– Homologous proteins are the result of gene duplication
and subsequent mutations
• Construct several hypothetical amino-acid sequences that would
be expected to show homology
– Following the duplications, point mutations occur at a
constant or variable rate
• After a relatively short period of time pairs will have nearly
identical sequences
USEFULLNESS

DETECTION OF THE HIGH DEGREE
OF HOMOLOGY PRESENT
• Use of values for non-identical pairs
• Assigning a relative high penalty for gaps
• Attaching substantial weight
• Reducing the penalty
• Assessing a very small or even negative penalty factor

THE NATURE OF HOMOLOGY
• Indication?
–Variables which maximize the significance of
the difference between real and random
proteins

EVOLUTIONARY DIVERGENCE
• Similar populations accumulate difference over
evolutionary time, and so become increasingly
distinct

• “Divergent evolution" can be applied to molecular
biology characteristics.
• To genes and proteins derived from two or more
homologous genes
• Assignment of weight to type 2 pairs
– Enhances the significance of the results
– Substantial Evolutionary Divergence

• Exception??
– Evolutionary divergence manifested by
cytochrome and other heme proteins
• Non-random mutations along the genes

THE DEGREE
& TYPE OF HOMOLOGY
• Differ between protein pairs
• Due to the difference
– No a priori best set of cell and operation values
– No best set of value to detect only slight
homology

METHODS OF DETERMINING
THE DEGREE OF HOMOLOGY
• Counting the number of non-identical pairs in the
homologous comparison
• Counting the number of mutations represented by the
non-identical pairs
• Measure of evolutionary distance

The Needleman-Wunsch Algorithm for Sequence Alignment

More Related Content

What's hot

Viewers also liked

Similar to The Needleman-Wunsch Algorithm for Sequence Alignment

More from Parinda Rajapaksha

Recently uploaded

The Needleman-Wunsch Algorithm for Sequence Alignment