Introduction to DNA Sequencing
1
Introduction to DNA Sequencing
Colin A. Graham and Alison J. M. Hill
1. Introduction DNA sequencing methods were first developed more than 20 years ago with the publication of two approaches to sequencing methodology that became known as Sanger sequencing (1), based on enzymatic synthesis from a single-stranded DNA template with chain termination using dideoxynucleotides (ddNTPs) and Maxim-Gilbert sequencing (2), which involved chemical degradation of end-radiolabeled DNA fragments. Both methods relied on four-lane, highresolution polyacrylamide gel electrophoresis to separate the labeled fragment and allow the base sequence to be read in a staggered ladder-like fashion. Sanger sequencing was technically easier and faster, and thus became the main DNA sequencing method for the vast majority of applications. The chain termination method relies on the dideoxynucleotide lacking a 3'OH group, which is required for extension of the sugar phosphate backbone. Thus, DNA polymerases cannot extend the template copy chain beyond the incorporated ddNTP. In practice, four reactions are set up for each sequence, incorporating respectively a proportion of ddATP, ddGTP, ddCTP, and ddTTP. The
From: Methods in Molecular Biology, vol. 167: DNA Sequencing Protocols, 2nd ed. Edited by: C. A. Graham and A. J. M. Hill Humana Press Inc., Totowa, NJ
Graham and Hill
ddATP reaction will terminate a proportion of chains at every A occurrence in the sequence, the chains are radiolablled by incorporation of, for example, [ -32P] or [ -35S] dATP. Polyacrylamide gel electrophoresis of the A-reaction will give a ladder on the autoradiograph representing the chain length from the sequencing primer to each A base in the sequence. When this is repeated for each base type, the resulting autoradiograph chain lengths can be read off as the DNA sequence. Automated DNA sequencing using fluorescent/infrared primer or terminator labeling and a variety of detection systems, coupled with software-based sequence determination, has greatly improved the ease, speed, accuracy, and reliability of DNA sequencing. Manual DNA sequencing methods are still used in some smaller laboratories and are covered in Chapters 35, and 8. Methods for gel pouring and producing gradient gels and detection systems can be found in standard molecular biology laboratory protocol books (3). 2. DNA Sequencing Chemistries
2.1. Sequencing Enzymes
A number of different enzymes are currently being used for DNA sequencing, with the choice of enzyme for any particular situation being dependent on the method of sequencing (manual or automated), the template being used (direct sequencing or cloned template), and the detection method (radiation, biotin, or fluorescence) (4). Some enzymes, such as Taq polymerase, are thermostable and readily lend themselves to use in automated sequencing reactions such as cycle sequencing. Others, such as Klenow polymerase and reverse transcriptase, can, like Taq polymerase, be applied to both direct sequencing of polymerase chain reaction (PCR) products and cloned template, but because of their thermal instability, cannot be used in cycle sequencing. Another enzyme, modified T7 polymerase (Sequenase), has also been used successfully in radioactive and fluorescence cycle sequencing based protocols. It is commonly used for solid-phase sequencing where the template is biotinylated and made
Introduction to DNA Sequencing
single stranded, then immobilized on streptavidin-coated beads or columns. The complementary nonbiotinylated strand is then washed away, leaving the single strand ready to sequence. Perhaps the biggest disadvantage of Klenow and reverse transcriptase is that they require relatively large amounts of template DNA and whereas manual, radioactive solid-phase sequencing with Sequenase uses quite small amounts of template the fluorescence-based systems, which are more easily automated, require larger amounts (5). Sequencing with these enzymes, therefore, will entail the use of much larger amounts of starting DNA which may be in short supply. Like T7 DNA polymerase, Taq polymerase has a high rate of dNTP incorporation and processivity. When necessary, it can be used to incorporate dNTP analogs such as 7-deazo-2-deoxyguanosine triphosphate, thus helping to resolve areas of compression in the template. Its thermostability allows high reaction temperatures to be used again, allowing resolution of secondary structures. Taq polymerase has a high Km for dNTPs and an absence of 3'-exonuclease activity. This can result in misincorporation of bases and inappropriate termination if the dNTP concentration drops too low; however, these problems are relatively minor when compared to the problems of sequence anomalies being propagated during the cloning process. No matter which system is used, the required end result is to produce labeled sequence product and there are several ways of labeling the DNA.
2.2. Labeled Primers
When carrying out sequencing reactions, there are two main methods of incorporating the label of choice into the sequence product, and each method has its own particular advantages and disadvantages. The first of these involves attaching the label to the sequencing primer. Primers may be biotinylated, radioactively or fluorescently labeled and are used in cycle sequencing in conjunction with unlabeled ddNTP terminators that stop DNA extension when they are incorporated. If a single label is used, a radioactive
Graham and Hill
tag such as 35S, then four separate reactions must be set upeach containing one of the four ddNTPs (ddGTP, ddCTP, ddATP, and ddTTP) and the remaining three deoxynucleotide triphosphates (three out of dGTP, dCTP, dATP, and dTTP). The reactions are then run separately, that is, on four lanes of a sequencing gel with the resultant bands in each lane corresponding to a specific base termination. The sequence is then read stepwise up the gel across all four lanes. Using single-label methods is therefore time consuming and expensive, with each sample requiring four reactions and four lanes on the gel. Alternatively, the primer can be synthesized four timeseach time with a distinct tag such as different fluorescent dyes. Again, four separate reactions are required, but these may be pooled and run in a single lane on the gel as long as the detection system in use is capable of distinguishing between the labels. Manufacturing the primer four times may be expensive, so this method is most cost effective if a particular locus is to be sequenced in many templates, perhaps during a screening program in a clinical diagnostic service. In its favor, however, the labeled primer technique (at least when using fluorescence) does result in very clean data with extremely even peak heights, which is a major advantage in heterozygote analysis. This is probably because the label, a relatively large molecule, is attached to the primer and does not have to be incorporated into an extending DNA chain (6).
2.3. Labeled Terminators
The second method of labeling sequencing products is to tag the ddNTPs, usually fluorescently, each with a specific label. Thus, every strand of DNA produced during the sequencing reaction is flagged at is termination site with a label specific for the base at that position. Like labeled primer sequencing, it is possible to label all four ddNTPs with the same label, but this then requires four reactions to be set up and four lanes on the gel are needed for analysis. If each of the four ddNTPs carries a distinct label, then it is possible to mix them all and carry out the reaction in a single tube and analyze
Introduction to DNA Sequencing
it on a single lane. This drastically reduces the bench time required and the cost of sequencing. Labeling the terminators does, however, have some disadvantages. Incorporation of a ddNTP with a large, unwieldy molecule attached is more difficult for the enzyme and large excesses of labeled nucleotides have to be used. The resulting data may have more background noise than with dye primer sequencing and peak heights tend to be more variable (although within a specific region peak patterns show very little variation between individual samples). Many of these problems continue to be addressed by companies producing sequencing consumables, and improvements in the enzymes and general sequencing chemistries have made dye terminator sequencing a quick, easy, and efficient technique. 3. DNA Sequencing Instrumentation Many companies also now produce both manual and automated DNA sequencing instruments. In this subheading, we will concentrate on the automated DNA sequencers which are dominating the market at the present time. Some of these automated sequencers are covered in detail in later chapters, so only an overview of the types of machine available and their possible applications are given here. Other authors later in the book concentrate on more manual methods, and details of the equipment used is supplied in the relevant chapter. Automated sequencers can be divided into two groups: those using the more conventional polyacrylamide slab gels to separate sequencing products and those using the more recently developed capillary systems (either single or in arrays). In the future, a third group of sequencing systemsthe DNA microarray systemsare likely to become increasingly important, although at the time of writing, they are more commonly used in analysis of gene expression than in direct sequencing. Full details of the equipment described below can be found at the respective company websites:
Amersham Pharmacia Biotech: Applied Biosystems: http://www.apbiotech.com http://www.appliedbiosystems.com
6
MWG: Beckman Coulter
Graham and Hill
http://bio.licor.com http://www.beckmancoulter.com
(It should be emphasized that this is not meant to be an exhaustive list and other systems are available, but those discussed are probably the most commonly used.)
3.1. Gel-Based Systems
Until recently, almost all automated DNA sequencers relied on polyacrylamide slab gel-based systems similar to those used in manual sequencing. All of these machines separate fluorescently labeled fragments by electrophoresis through a denaturing polyacrylamide gel. The length of the gels varies from a minimum of approx 12 cm to a maximum of approx 60 cm, and the number of bases that can accurately be called in a particular run is a function of the gel length and run times employed. The slab gel systems include the Perkin-Elmer (Applied Biosystems Division) ABI PRISM 373A and 377 DNA Analyzers, the Amersham Pharmacia Biotech (APB) ALF range of DNA Analyzers, and the MWG LI-COR. The type of sequencer suited to any particular laboratory will depend on the read length and throughput required, the type of analysis required (sequencing, fragment analysis, microsatellite analysis, etc.), and the funds available. All the machines herein and in Table 1 can be used for sequencing and fragment analysis and should have sufficient capacity for the average research laboratory.
3.2. Capillary-Based Systems
There has been a gradual move, in recent years, away from the traditional slab gel-based sequencing systems toward the capillarybased systems. These machines use either an array or a single capillary filled with polyacrylamide or specially developed polymers through which the samples are electrophoresed. Advantages of these systems include greater automation (sample loading, electrophoresis, and analysis) and reduced operator time (no gel pouring). There is currently one system on the market aimed at the smaller laboratory (ABI PRISM 310 DNA Analyzer) and two medium-throughput instruments (ABI PRISM 3100 Genetic Analyzer and the
Introduction to DNA Sequencing
Table 1 Automated DNA Sequencersa Gel type Slab gels Model Read length 850 bp 750 bp 1200 bp 650 bp 650 bp 550 bp 700 bp 550 bp Accuracy 98.5% 99% 99% 98.5% 98.5% 98.5% 98% 98.5%
Throughput (bases/24 h) 81,600 130,000 (SBS) 9120 90,720 211,200 48,000 605,000
ABI 373/377 APB ALF express MWG LI-COR Capillary ABI 310 ABI 3100 ABI 3700 CEQ 2000XL APB MegaBACE 1000
aComparison of automated DNA sequencers. Please note that all figures are as quoted by the manufacturers. Full details on all the sequencers can be obtained from the company web sites (Subheading 3.) and prices are available on request.
Beckman Coulter CEQ 2000XL), which are 16 and 8 capillary instruments and operate off microtiter plates. There are two very high-throughput machines (the ABI PRISM 3700 DNA Analyzer and APB MegaBACE 1000 DNA analyzer system), which are highly automated and contain arrays with 96 active capillaries. These can produce thousands of base pairs of sequence in very short times. These sequencers are now being used to generate large amounts of DNA sequence for the human genome projects.
3.3. DNA Microarrays
Sequencing and mutational analysis may also be carried out using oligonucleotide microarray (DNA chip) based hybridization analysis or enzyme-dependent minisequencing (7). In the hybridization approaches, every possible sequence of the region of interest is represented in an array of oligonucleotides, usually 725-mers. The target DNA is labeled and hybridized to the array, then scanning detectors are used to monitor the signal and therefore the relative hybridization efficiency. The signal may increase or decrease depending on the method in use, that is, whether only the target
Graham and Hill
DNA is labeled (signal gain) or whether reference DNA and target DNA are both labeled and loss of signal from the reference DNA is monitored (8). In the minisequencing approach, the target DNA is unlabeled and hybridized to an array of primer-style oligonucleotides. Fluorescently tagged ddNTPs are then incorporated during the enzyme-dependent extension reactions, which occur only if the primer is an exact match, and the fluorescent signals are analyzed. Although this technique is theoretically very useful, in reality it has been found to be difficult to apply to large-scale sequencing projects because of problems associated with controlling the specificity of hybridization. It has, however, been shown to be very useful as a quick and easy method of screening for mutations, both new and previously characterized, in specific genes such as the breast cancer gene (BRCA1) and the cystic fibrosis gene (CFTR). With improvements in the technology, it should eventually be possible to use DNA microarray systems in situations where screening of individuals or populations is required. Large-scale projects are now underway to analyze polymorphisms in the human genome in the hope of identifying variations that predispose to or offer resistance to complex disease traits or disorders that are highly dependent on environment, and these types of variation could easily be screened for cheaply and quickly using this technology (9). 4. Genome Sequencing Projects and the Future of Genomics During the past 5 yr, genome sequencing projects have progressed very rapidly, with more than 400 viral, 16 bacterial, 6 archaea, and 2 eukaryote genomes now complete. Some of the landmark achievements in genome sequencing are noted in Table 2, and progress of the sequencing of other genomes can be obtained from the many genome sequencing websites that are now available, some of which are given in Table 3. One of the greatest achievements is that more than 3000 million bases of the human genome have been sequenced, and the entire 3286 Mb should be completed as finished sequence by 2003. This will have a great influence on the future of medicine
Table 2 Genome Sequencing Projects Genome size No. of genes Completed Landmark First animal genome First eukaryote genome Mba 50100,000 16,332 6241 4288 4100 2003 est. 1998 1996 1997 1997 1995 1977 1709 9
Introduction to DNA Sequencing
Organism
Ref.
9
1.8 Mb 5386 bp
Human C. elegans Saccharomyces cerevisiae E. coli K-12 Bacillus subtilis 3286 97 Mb 12 Mb 4.6 Mb 4.2 Mb
(10) (11) (12) (13)
Haemophilus influenzae X174 bacteriophage
Best characterized gram-positive bacterium First free-living organism First genome
(14) (15)
aMb,
million base pairs.
10
Table 3 Genome Sequencing Websites Web address
Center
10
The Sanger Centre National Center for Biotechnology Information The Genome Database UK Human Genome Mapping Project Resource Centre Whitehead Institute Center for Genome Research Stanford Human Genome Center Human Genome Sequencing Center The Institiute for Genome Research (TIGR) Genome Science and Technology Center Genoscope Genome Sequencing Center Joint Genome Institute Japan Science and Technology Corporation (JST) Advanced Life Science Information System (ALIS)
http://www.sanger.ac.uk/ http://www.ncbi.nlm.nih.gov/ http://www.gdb.org/ http://www.hgmp.mrc.ac.uk/ http://www.genome.wi.mit.edu/ http://www-shgc.stanford.edu/ http://www.hgsc.bcm.tmc.edu/ http://www.tigr.org/ http://gestec.swmed.edu/ http://www.genoscope.cns.fr/ http://genome.wustl.edu/ http://www.jgi.doe.gov/ http://www.alis.tokoyo.jst.go.jp/hgs/top.html
Graham and Hill
Introduction to DNA Sequencing
11
as we enter the realms of functional genomics, with DNA sequence data being used for a wide range of patient management programs such as:
Assessing genetic factors for susceptibility to common diseases (cancer, diabetes, heart disease) Screening for specific gene mutations (BRCA1/2, CFTR, LDLR) Determining genetic drug resistance factors (16) Classifying bacterial and viral infections Development of genetic vaccines (17) Gene replacement therapy for mutated or deficient genes
The great strides made in DNA sequencing in recent years have largely been technology driven, using better equipment, improved chemistries, and more sophisticated software. Many of these developments are covered in this book, however, the ultimate goal of a fully automated DNA sequencer that does not require complex preparation steps and analysis based on fragment length determination remains elusive. Perhaps advances in biochip technology and microarray scanning will realize this goal before long. References
1. Sanger, F., Nicklen, S., and Coulson, A. R. (1977) DNA sequencing with chain terminator inhibitors. Proc. Natl. Acad. Sci. USA 74, 54635467. 2. Maxim, A. M. and Gilbert, W. (1977) A new method for sequencing DNA. Proc. Natl. Acad. Sci. USA 74, 560564. 3. Ausubel, F. M., Brent, R., Kingston, R. E., Moore, D. D., Seidman, J. G., Smith, J. A., and Struhl, K. (eds.) (1991) Current Protocols in Molecular Biology, vol. 1, John Wiley & Sons, New York. 4. Erlich, H. A. (ed.) (1992) PCR Technology: Principles and Applications for DNA Amplification. Oxford University Press, Oxford. 5. Wahlberg, J., Hultman, T., and Uhten, M. (1995) Solid phase sequencing of PCR products, in PCR2: A Practical Approach (McPherson, M. J., Hames, B. D., and Taylor, G. R., eds.), Oxford University Press, Oxford. 6. Comparative PCR Sequencing: A Guide to Sequencing-Based Mutation Detection. Applied Biosystems. Available on line at http:// www.appliedbiosystems.com
12
Graham and Hill
7. Hacia J. G. (1999) Resequencing and mutational analysis using oligonulceotide arrays. Nat. Genet. 21(Suppl), 4247. 8. Hacia, J. G., Brody, L. C., Chee, M. S., Fodor, S. P., and Collins, F. S. (1996) Detection of heterozygous mutations in BRCA1 using high density oligonucleotide arrays and two-colour fluorescence analysis. Nat. Genet. 14, 441447. 9. The Chipping Forecast. (1999) Nat. Genet. 21(Suppl), 160. 10. The C. elegans sequencing consortium. (1998) Genome sequence of the Nematode C. elegans: A platform for investigating biology. Science 282, 20122018. 11. Goffeau, et al. (1997) The yeast genome directory. Nature 387 (Suppl.), 1105. 12. Blattner, F. R., Plunkett, G. III, Bloch, C. A., et al. (1997) The complete genome sequence of Escherichia coli K-12. Science 277, 14531474. 13. Kunst, F., Ogasawara, N., Moszer, I., et al. (1997) The complete genome sequence of the gram-positive bacterium Bacillus subtilis. Nature 390, 249256. 14. Fleischmann R.D., Adams M.D., White O., et al. (1995) Whole genome random sequencing and assembly of Haemophilus influenzae Science 269, 496512. 15. Sanger, F., Air, G. M., Barrell, B. G., et al. (1977) Nucleotide sequence of bacteriophage phi X174 DNA. Nature 265, 687695. 16. Moukheiber, Z. (1998) This drug is for you. Forbes, Sept 7th. 17. Weiner, D. B. and Kennedy, R. C. (1999) Genetic vaccines. Scientific American, July.