KEMBAR78
Pairwise sequence alignment | PPT
Pairwise sequence Alignment

                    Dr Avril Coghlan
                   alc@sanger.ac.uk

Note: this talk contains animations which can only be seen by
downloading and using ‘View Slide show’ in Powerpoint
Sequence comparison
• How can we compare the human & Drosophila
  melanogaster Eyeless protein sequences?
  One method is a dotplot
• A dotplot is a graphical (visual) approach
  Regions of local similarity between the 2 sequences appear as diagonal
       lines of coloured cells (‘dots’)
                Fruitfly Eyeless




                                                   Window-size = 10,
                                                   Threshold = 5




                                   Human Eyeless
Sequence alignment
• A second method for comparing sequences is a
  sequence alignment
• An alignment is an arrangement in columns of 2
  sequences, highlighting their similarity
  The sequences are padded with gaps (dashes) so that wherever
  possible, alignment columns contain identical letters from the   two
  sequences involved
  An insertion or deletion is represented by ‘–’ (a gap)
  The symbol “|” is used to represent matches
  eg. here is an alignment for amino acid sequences
  “QKGSYPVRSTC” & “QKGSGPVRSTC”:

            Q K G S Y P V R S T C             This alignment has
                                              There are 10 matches
                                                     is 1 mismatch
            | | | |   | | | | | |
            Q K G S G P V R S T C              11 columns
            1 2 3 4 5 6 7 8 9 10 11
Sequence alignment
• An alignment of the human and fruitfly
  (Drosophila melanogaster) Eyeless proteins:
What does an alignment mean?
• An alignment is tells you tells you what mutations
  occurred in the sequences since the sequences
  shared a common ancestor
  eg. an alignment of the human & fruitfly Eyeless suggests:
  (i) there were probably deletion(s) at the start of the human
  Eyeless, or insertion(s) at the start of fruitfly Eyeless




  (ii) there was probably a G→N substitution in human Eyeless, or a N→G
         substitution in fruitfly Eyeless (see arrow)
How do we make an alignment?
• Given two or more sequences, what is the best way
  to align them to each other
  We want the alignment columns to contain identical letters
• Comparison of similar sequences of similar length is
  straightforward
  eg. for amino acid sequences “QKGSYPVRSTC” & “QKGSGPVRSTC”, we
       line up the identical letters in columns:

               Q K G S Y P V R S T C            sequence 1
               | | | |   | | | | | |
               Q K G S G P V R S T C            sequence 2

  The alignment implies that one mutation occurred since the two
  sequences shared a common ancestor
  That is, the alignment implies there was a G→Y substitution in
  sequence 1 or a Y→G substitution in sequence 2
Problem
• Are there other possible plausible alignments for
  sequences “QKGSYPVRSTC” & “QKGSGPVRSTC”?
Answer
• Are there other possible plausible alignments for
  sequences “QKGSYPVRSTC” & “QKGSGPVRSTC”?
  There are many other possible alignments, eg. :

  Q K G S Y - P V R S T C
  | | |       | | | | | |
  Q K G - S G P V R S T C
  Q K G S - Y P V R S T C
  | | | |       | | | | |
  Q K G S G P - V R S T C
  Q K G - - - - - S Y P V R S T C
  | | |           |           | |
  Q K G S G P V R S - - - - - T C
  Q K - G S Y P V R S T C
  | |                   |
  Q K G S G P V R S T - C                  etc. etc. etc. . . .
Number of possible pairwise alignments
• There are lots of different possible alignments for
  two sequences that are both of length n
  The number of possible alignments of 2 seqs of length n letters (amino
  acids/nucleotides) is ( ) (“2n2n
                                 choose n”)
                                       n
      2n
  (   n)   can be calculated as ( 2n
                                   n   ) =   (2*n) !
                                             n! * n!
  where n! (‘n factorial’) = n * (n - 1) * (n – 2) * (n – 3) * ... * 3 * 2 * 1
• For example, for “QKGSYPVRSTC” &
  “QKGSGPVRSTC”, n (length) = 11 letters
  The number of possible alignments of these two sequences is
  (2*11) = ( 22 ) = (2*11) !  =           22!
    11       11
                    11! * 11!     39916800*3991680

  = 1.124001e+21/1.593351e+15 = 705,432 possible alignments
Number of possible pairwise alignments
• Even for relatively short sequences, (2n ) is large, so
                                        n
  there are lots of possible alignments
  eg. for two sequences that are both 11 letters long, there are
  705,432 possible alignments
• In fact, the number of possible alignments, ( 2n ),
                                                n
  increases exponentially with the sequence length (n)
  ie. ( 2n ) is approximately equal to 22n
        n

                                                        For two sequences of
    Number of                                           17 letters long (n=17),
    possible                                            there are 2.3 billion
    alignments                                          possible alignments



                         Length of sequences (n)
• Many of the possible alignments for 2 seqs are
  implausible as they imply many mutations occurred
  (but it’s known mutations are rare)
  eg. for amino acid sequences “QKGSYPVRSTC” & “QKGSGPVRSTC”, the
        alignment made by lining the identical letters into columns only
        implies one mutation:
  Q K G S Y P V R S T C              This alignment implies that 1 G→Y or
  | | | |   | | | | | |              Y→G substitution occurred
  Q K G S G P V R S T C

  Many of the alternative alignments for these two sequences        imply
  that many more mutations occurred, eg. :

  Q K G S Y - P V R S T C             This alignment implies that 1 S→Y or
  | | |       | | | | | |             Y→S substitution occurred;
  Q K G - S G P V R S T C
                                      that 1 insertion of S or deletion of S
                                      occurred;
                                      and that 1 deletion of G or insertion of G
                                      occurred
Further Reading
•   Chapter 3 in Introduction to Computational Genomics Cristianini & Hahn
•   Practical on pairwise alignment in R in the Little Book of R for
    Bioinformatics:
    https://a-little-book-of-r-for-
    bioinformatics.readthedocs.org/en/latest/src/chapter4.html

Pairwise sequence alignment

  • 1.
    Pairwise sequence Alignment Dr Avril Coghlan alc@sanger.ac.uk Note: this talk contains animations which can only be seen by downloading and using ‘View Slide show’ in Powerpoint
  • 2.
    Sequence comparison • Howcan we compare the human & Drosophila melanogaster Eyeless protein sequences? One method is a dotplot • A dotplot is a graphical (visual) approach Regions of local similarity between the 2 sequences appear as diagonal lines of coloured cells (‘dots’) Fruitfly Eyeless Window-size = 10, Threshold = 5 Human Eyeless
  • 3.
    Sequence alignment • Asecond method for comparing sequences is a sequence alignment • An alignment is an arrangement in columns of 2 sequences, highlighting their similarity The sequences are padded with gaps (dashes) so that wherever possible, alignment columns contain identical letters from the two sequences involved An insertion or deletion is represented by ‘–’ (a gap) The symbol “|” is used to represent matches eg. here is an alignment for amino acid sequences “QKGSYPVRSTC” & “QKGSGPVRSTC”: Q K G S Y P V R S T C This alignment has There are 10 matches is 1 mismatch | | | | | | | | | | Q K G S G P V R S T C 11 columns 1 2 3 4 5 6 7 8 9 10 11
  • 4.
    Sequence alignment • Analignment of the human and fruitfly (Drosophila melanogaster) Eyeless proteins:
  • 5.
    What does analignment mean? • An alignment is tells you tells you what mutations occurred in the sequences since the sequences shared a common ancestor eg. an alignment of the human & fruitfly Eyeless suggests: (i) there were probably deletion(s) at the start of the human Eyeless, or insertion(s) at the start of fruitfly Eyeless (ii) there was probably a G→N substitution in human Eyeless, or a N→G substitution in fruitfly Eyeless (see arrow)
  • 6.
    How do wemake an alignment? • Given two or more sequences, what is the best way to align them to each other We want the alignment columns to contain identical letters • Comparison of similar sequences of similar length is straightforward eg. for amino acid sequences “QKGSYPVRSTC” & “QKGSGPVRSTC”, we line up the identical letters in columns: Q K G S Y P V R S T C sequence 1 | | | | | | | | | | Q K G S G P V R S T C sequence 2 The alignment implies that one mutation occurred since the two sequences shared a common ancestor That is, the alignment implies there was a G→Y substitution in sequence 1 or a Y→G substitution in sequence 2
  • 7.
    Problem • Are thereother possible plausible alignments for sequences “QKGSYPVRSTC” & “QKGSGPVRSTC”?
  • 8.
    Answer • Are thereother possible plausible alignments for sequences “QKGSYPVRSTC” & “QKGSGPVRSTC”? There are many other possible alignments, eg. : Q K G S Y - P V R S T C | | | | | | | | | Q K G - S G P V R S T C Q K G S - Y P V R S T C | | | | | | | | | Q K G S G P - V R S T C Q K G - - - - - S Y P V R S T C | | | | | | Q K G S G P V R S - - - - - T C Q K - G S Y P V R S T C | | | Q K G S G P V R S T - C etc. etc. etc. . . .
  • 9.
    Number of possiblepairwise alignments • There are lots of different possible alignments for two sequences that are both of length n The number of possible alignments of 2 seqs of length n letters (amino acids/nucleotides) is ( ) (“2n2n choose n”) n 2n ( n) can be calculated as ( 2n n ) = (2*n) ! n! * n! where n! (‘n factorial’) = n * (n - 1) * (n – 2) * (n – 3) * ... * 3 * 2 * 1 • For example, for “QKGSYPVRSTC” & “QKGSGPVRSTC”, n (length) = 11 letters The number of possible alignments of these two sequences is (2*11) = ( 22 ) = (2*11) ! = 22! 11 11 11! * 11! 39916800*3991680 = 1.124001e+21/1.593351e+15 = 705,432 possible alignments
  • 10.
    Number of possiblepairwise alignments • Even for relatively short sequences, (2n ) is large, so n there are lots of possible alignments eg. for two sequences that are both 11 letters long, there are 705,432 possible alignments • In fact, the number of possible alignments, ( 2n ), n increases exponentially with the sequence length (n) ie. ( 2n ) is approximately equal to 22n n For two sequences of Number of 17 letters long (n=17), possible there are 2.3 billion alignments possible alignments Length of sequences (n)
  • 11.
    • Many ofthe possible alignments for 2 seqs are implausible as they imply many mutations occurred (but it’s known mutations are rare) eg. for amino acid sequences “QKGSYPVRSTC” & “QKGSGPVRSTC”, the alignment made by lining the identical letters into columns only implies one mutation: Q K G S Y P V R S T C This alignment implies that 1 G→Y or | | | | | | | | | | Y→G substitution occurred Q K G S G P V R S T C Many of the alternative alignments for these two sequences imply that many more mutations occurred, eg. : Q K G S Y - P V R S T C This alignment implies that 1 S→Y or | | | | | | | | | Y→S substitution occurred; Q K G - S G P V R S T C that 1 insertion of S or deletion of S occurred; and that 1 deletion of G or insertion of G occurred
  • 12.
    Further Reading • Chapter 3 in Introduction to Computational Genomics Cristianini & Hahn • Practical on pairwise alignment in R in the Little Book of R for Bioinformatics: https://a-little-book-of-r-for- bioinformatics.readthedocs.org/en/latest/src/chapter4.html

Editor's Notes

  • #4 Made
  • #5 Made alignment of human.fa and fly.fa using Needleman-wunsch with default parameters at: http://emboss.bioinformatics.nl/cgi-bin/emboss/needle (EMBOSS needle) Human Eyeless (PAX6) from: http://www.treefam.org/cgi-bin/TFseq.pl?id=ENST00000379111.1 D. Melanogaster Eyeless from: http://www.treefam.org/cgi-bin/TFseq.pl?id=FBtr0100396.5 Viewed in jalview, and saved as humanfly_needlemanwunsch.png
  • #6 Made
  • #7 Made
  • #10 In R factorial(22)/( (factorial(11)) * (factorial(11)) )
  • #11 N.B. (2n choose n) = the binomial coefficient = the number of ways that n things can be 'chosen' from a set of 2 n things = ((2n)!)/(n!)*(n!). This can be shown to be proportional to 2^(2*n) (Deonier, Tavare & Waterman book page 158-9). Graph made using wolfram alpha at http://www.wolframalpha.com/ and typing “plot 2n choose n from 1 to 20”.
  • #12 Made