Herein we will compare the application of a CleanAmp™ 7-deaza-dGTP Mix to a standard 7-deaza-dGTP mix for the PCR amplification of GC-rich targets in preparation for Sanger dideoxy seque
Trang 1DNA SEQUENCING –
METHODS AND APPLICATIONS Edited by Anjana Munshi
Trang 2DNA Sequencing – Methods and Applications
Edited by Anjana Munshi
As for readers, this license allows users to download, copy and build upon published chapters even for commercial purposes, as long as the author and publisher are properly credited, which ensures maximum dissemination and a wider impact of our publications
Notice
Statements and opinions expressed in the chapters are these of the individual contributors and not necessarily those of the editors or publisher No responsibility is accepted for the accuracy of information contained in the published chapters The publisher assumes no responsibility for any damage or injury to persons or property arising out of the use of any materials, instructions, methods or ideas contained in the book
Publishing Process Manager Bojan Rafaj
Technical Editor Teodora Smiljanic
Cover Designer InTech Design Team
First published April, 2012
Printed in Croatia
A free online edition of this book is available at www.intechopen.com
Additional hard copies can be obtained from orders@intechopen.com
DNA Sequencing – Methods and Applications, Edited by Anjana Munshi
p cm
ISBN 978-953-51-0564-0
Trang 5Contents
Preface VII
Chapter 1 DNA Representation 3
Bharti Rajendra Kumar
Chapter 2 Hot Start 7-Deaza-dGTP Improves Sanger
Dideoxy Sequencing Data of GC-Rich Targets 15
Sabrina Shore, Elena Hidalgo Ashrafi and Natasha Paul
Chapter 3 Sequencing Technologies and
Their Use in Plant Biotechnology and Breeding 35
Chapter 6 The Input of DNA Sequences to
Animal Systematics: Rodents as Study Cases 103
Laurent Granjon and Claudine Montgelard
Chapter 7 The Application of
Pooled DNA Sequencing in Disease Association Study 141
Chang-Yun Lin and Tao Wang
Chapter 8 Nucleic Acid Aptamers as Molecular Tags for
Omics Analyses Involving Sequencing 157
Masayasu Kuwahara and Naoki Sugimoto
Trang 7Preface
More than a quarter of a century earlier the story of DNA sequencing began when Sanger’s studies of insulin first demonstrated the importance of sequence in biological macromolecules Although two different DNA sequencing methods have been developed during the same period, Sanger’s dideoxy chain-termination sequencing method has became the method of choice over the Maxam–Gilbert method The complete sequence of oX-174 was published in 1977 and then revised slightly in the following year by dideoxy method It demonstrated that the DNA sequence could tell
a fascinating story based upon the interpretation of the sequence in terms of the genetic code Recently several next generation high throughput DNA sequencing techniques have arrived on the scene and are opening fascinating opportunities in the fields of biology and medicine
This book, “DNA Sequencing - Methods and Applications” illustrates methods of DNA sequencing and its application in plant, animal and medical sciences This book has two distinct sections The first one includes 2 chapters devoted to the DNA sequencing methods and the second one includes 6 chapters focusing on various applications of this technology The content of the articles presented in the book is guided by the knowledge and experience of the contributing authors This book is intended to serve as an important resource and review to the researchers in the field of DNA sequencing
An overview of DNA sequencing technologies right from the Sanger’s method to the next generation high throughput DNA sequencing techniques including massively parallel signature sequencing, polony sequencing, pyrosequencing, Illumina Sequencing, SOLiD sequencing etc has been presented in chapter 1 Chapter 2 reviews how Hot Start-7-deaza-dGTP improves Sanger’s dideoxy sequencing data of GC rich template DNA
Chapter 3 demonstrates how the use of sequencing methods in combination with strategies in breeding and molecular genetic modifications has contributed to our knowledge of plant genetics and remarkable increase in agricultural productivity
Chapter 4 provides information on applications of DNA sequencing in crop protection This chapter highlights the perspectives for new sustainable and environmental friendly strategies for controlling pests and diseases of crop plants
Trang 8Chapter 5 has discussed the application of DNA sequencing in improving the breeding strategies of farm animals The development of molecular markers using DNA sequencing serves as an underlying tool, for geneticists and breeders to create desirable farm animals
Chapter 6 aims at showing how DNA sequencing technology has reboosted rodent systematics leading to a much better supported classification of this order The molecular data generated by DNA sequencing has played an important role in rodent systematics over the last decades indicating the importance of this kind of information
in evolutionary biology as a whole
Chapter 7 has discussed the application of pooled DNA sequencing in disease association studies It is a cost effective strategy for genome-wide association studies (GWAS) and successfully identifies hundreds of variants associated with complex traits Some strategies of pooling design including PI- deconvolution shifted-transversal design, multiplex scheme and overlapping pools to recover linkage disequilibrium information have also been introduced Statistical methods for the detection of variants and case-control association studies accounting for high levels of sequencing errors have been discussed
Chapter 8 focuses on the development of nucleic acid-aptamers and the outlook for related technologies Aptamers can be readily amplified by PCR and decoded by sequencing and it is possible to apply them as molecular tags to quantitative bimolecular analysis and single cell analysis
The scientific usefulness of DNA sequencing continues to be proven, and the number
of sequenced and catalogued genomes has grown more than five times from where it was at the middle of the decade
Anjana Munshi
Department of Molecular Biology, Institute of Genetics and Hospital for Genetic Diseases, Hyderabad,
India
Trang 11Methods of DNA Sequencing
Trang 13DNA Representation
Bharti Rajendra Kumar
B.T Kumaon Institute of Technology, Dwarahat,Almora, Uttarakhand,
India
1 Introduction
The term DNA sequencing refers to methods for determining the order of the nucleotides bases adenine,guanine,cytosine and thymine in a molecule of DNA The first DNA sequence were obtained by academic researchers,using laboratories methods based on 2- dimensional chromatography in the early 1970s By the development of dye based sequencing method with automated analysis,DNA sequencing has become easier and faster The knowledge of DNA sequences of genes and other parts of the genome of organisms has become indispensable for basic research studying biological processes, as well as in applied fields such as diagnostic or forensic research
DNA is the information store that ultimately dictates the structure of every gene product, delineates every part of the organisms The order of the bases along DNA contains the complete set of instructions that make up the genetic inheritance
The rapid speed of sequencing attained with modern DNA sequencing technology has been instrumental in the sequencing of the human genome, in the human genome project
Fig 1 DNA Sequence Trace
DNA can be sequenced by a chemical procedure that breaks a terminally labelled DNA molecule partially at each repetition of a base The length of the labelled fragments then identify the position of that base We describe reactions that cleave DNA preferentially at guanines,at adenines,at cytosine and thymines equally, and at cytosine alone When the product of these four reactions are resolved by size,by electrophoresis on a polyacrylamide gel, the DNA sequences can be read from the pattern of radioactive bands The technique
Trang 14will permit sequencing of atleast 100 bases from the point of labelling The purine specific reagent is dimethyl sulphate; and the pyrimidine specific reagent is hydrazine
In 1973 , Gilbert and Maxam reported the sequence of 24 base pairs using a method known
as wandering- spot analysis
The chain termination method developed by Sanger and coworkers in 1975 owing to its relative easy and reliability
In 1975 the first complete DNA genome to be sequenced is that of bacteriophage X174
By knowing the DNA sequence, the cause of the various diseases can be known We can determine the sequence responsible for various disease and can be treated with the help of Gene therapy
DNA sequencing is very significant in research and forensic science The main objective of DNA sequence generation method is to evaluate the sequencing with very high accuracy and reliability
There are some common automated DNA sequencing problems :-
1 Failure of the DNA sequence reaction
2 Mixed signal in the trace ( multiple peaks)
3 Short read lengths and poor quality data
4 Excessive free dye peaks “dye blobs” in the trace
5 Primer dimer formation in sequence reaction
6 DNA polymerase slippage on the template mononucleotide regions
So, we should have to do the sequencing in such a manner to avoid or minimize these problems
DNA sequencing can solve a lot of problems and perform a lot of work for human wellfare
A sequencing can be done by different methods :
1 Maxam – Gilbert sequencing
2 Chain-termination methods
3 Dye-terminator sequencing
4 Automation and sample preperation
5 Large scale sequencing strategies
6 New sequencing methods
2 Maxam-Gilbert sequencing
In 1976-1977, Allan Maxam and Walter Gilbert developed a DNA sequencing method based
on chemical modification of DNA and subsequent cleavage at specific bases
The method requires radioactive labelling at one end and purification of the DNA fragment
to be sequenced Chemical treatment generates breaks at a small proportions of one or two
of the four nucleotide based in each of four reactions (G,A+G, C, C+T) Thus a series of labelled fragments is generated,from the radiolabelled end to the first ‘cut’ site in each molecule The fragments in the four reactions are arranged side by side in gel
Trang 15electrophoresis for size separation To visualize the fragments,the gel is exposed to X-ray film for autoradiography,yielding a series of dark bands each corresponding to a radiolabelled DNA fragment,from which the sequence may be inferred
Fig 2 Part of a radioactively labelled sequencing gel
The newly synthesized and labelled DNA fragments are heat denatured , and separated by size by gel electrophoresis on a denaturing polyacrylamide-urea gel with each of the four reactions run in one of the four individual lanes(lanes A, T, G,C), the DNA bands are then visualized by autoradiography or UV light,and the DNA sequence can be directly read off
Trang 16the X-ray film or gel image A dark band in a lane indicates a DNA fragment that is result of chain termination after incorporation of a dideoxynucleotide (ddATP, ddGTP, ddCTP, or ddTTP) The relative position of the different bands among the four lanes are then used to read (from bottom to top) the DNA sequence
The technical variations of chain termination sequencing include tagging with nucleotides containing radioactive phosphorus for labelling, or using a primer labelled at the 5’ end with
a fluorescent dye Dye- primer sequencing facilitates reading in an optical system for faster and more economical analysis and automation
Chain termination methods have greatly simplified DNA sequencing Limitations include non-specific binding of the primer to the DNA,affecting accurate read-out of the DNA
sequence,and DNA secondary structures affecting the fidelity of the sequence
Fig 3 Sequence ladder by radioactive sequencing compared to fluorescent peaks
Trang 173.2 Automation and sample preparation
Automated DNA sequencing instruments (DNA sequencers) can sequence upto 384 DNA samples in a single batch (run) in up to 24 runs a day DNA sequencers carry out capillary electrophoresis for size seperation,detection and recording of dye fluorescence,and data output as fluorescent peak trace chromatograms
A number of commercial and non-commercial software packages can trim low-quality DNA traces automatically These programmes score the quality of each peak and remove low-quality base peaks (generally located at the ends of the sequence)
Fig 4 View of the start of an example dye-terminator read
4 Large-scale sequencing strategies
Current methods can directly sequence only relative short (300-1000 nucleotides long) DNA fragments in a single reaction The main obstacle to sequencing DNA fragments above this size limit is insufficient power of separation for resolving large DNA fragments that differ in length by only one nucleotide
Large scale sequencing aims at sequencing very long DNA pieces,such as whole chromosomes It consist of cutting (with restriction enzymes)or shearing (with mechanical forces) large DNA fragments into shorter DNA fragments The fragmented DNA is cloned into a DNA vector, and amplified in E.coli Short DNA fragments purified from individual bacterial colonies are individually sequenced and assembled electronically into one long,contiguous sequence This method does not require any pre- existing information about the sequence of the DNA and is reffered to as de novo sequencing Gaps in the assembled sequence may be filled by primer walking The different strategies have different tradeoffs in speed and accuracy
Trang 18Fig 5 Genomic DNA is fragmented into random pieces and cloned as a bacterial library DNA from individual bacterial clones is sequenced and the sequence is assembled by using overlapping regions
5 New sequencing methods
The high demand for low-cost sequencing has driven the development of high- throughput sequencing technologies that parallelize the sequencing process,producing thousands or millions of sequences at once High-throughput sequencing technologies are intended to lower the cost of DNA sequencing
Molecular detection method are not sensitive enough for single molecule sequencing, so most approaches use an in vitro cloning step to amplify individual DNA molecules
In microfluidic Sanger sequencing the entire thermocycling amplification of DNA fragments
as well as their separation by electrophoresis is done on a single chip (appoximately 100cm
in diameter) thus reducing the reagent usage as well as cost In some instances researchers have shown that they can increase the throughput of conventional sequencing through the use of microchips
6 High throughput sequencing
The high demand for low-cost sequencing has driven the development of high-throughput sequencing technologies that parallelize the sequencing process, producing thousands or millions of sequences at once High-throughput sequencing technologies are intended to lower the cost of DNA sequencing beyond what is possible with standard dye-terminator methods
Trang 196.1 Lynx therapeutics' massively parallel signature sequencing (MPSS)
The first of the "next-generation" sequencing technologies, MPSS was developed in the 1990s
at Lynx Therapeutics, a company founded in 1992 by Sydney Brenner and Sam Eletr MPSS
is an ultra high throughput sequencing technology When applied to expression profile, it reveal almost every transcript in the sample and provide its accurate expression level MPSS was a bead-based method that used a complex approach of adapter ligation followed
by adapter decoding, reading the sequence in increments of four nucleotides; this method made it susceptible to sequence-specific bias or loss of specific sequences However, the essential properties of the MPSS output were typical of later "next-gen" data types, including hundreds of thousands of short DNA sequences In the case of MPSS, these were typically used for sequencing cDNA for measurements of gene expression levels Lynx Therapeutics merged with Solexa in 2004, and this company was later purchased by Illumina
6.2 Polony sequencing
It is an inexpensive but highly accurate multiplex sequencing technique that can be used to read millions of immobilized DNA sequences in parallel This techniques was first developed by Dr George Church in Harvard Medical college It combined an in vitro paired-tag library with emulsion PCR, an automated microscope, and ligation-based sequencing chemistry to sequence an E coli genome at an accuracy of > 99.9999% and a cost approximately 1/10 that of Sanger sequencing
6.4 Illumina (Solexa) sequencing
Solexa developed a sequencing technology based on dye terminators In this, DNA molecule are first attached to primers on a slide and amplified, this is known as bridge amplification Unlike pyrosequencing, the DNA can only be extended one neucleotode at a time A camera takes images of the fluorescently labeled nucleotides, then the dye along with the terminal 3' blocker is chemically removed from the DNA, allowing the next cycle
6.5 SOLiD sequencing
The technology for sequencing used in ABISolid sequencing is oligonucleotide ligation and detection In this, a pool of all possible oligonucleotides of fixed length are labelled according to the sequenced position This sequencing results to the sequences of quantities and lengths comparable to illumine sequencing
Trang 206.6 DNA nanoball sequencing
It is high throughput sequencing technology that is used to determine the entire genomic sequence of an organisms The method uses rolling circle replication to amplify fragments of genomic DNA molecules This DNA sequencing allows large number of DNA nanoballs to
be sequenced per run and at low reagent cost compared to other next generation sequencing platforms However, only short sequences of DNA are determined from each DNA nanoball which makes mapping the short reads to a reference genome difficult This technology has been used for multiple genome sequencing projects and is scheduled to be used for more
6.7 Helioscope(TM) single molecule sequencing
Helioscope sequencing uses DNA fragments with added polyA tail adapters, which are attached to the flow cell surface The next steps involve extension-based sequencing with cyclic washes of the flow cell with fluorescently labeled nucleotides The reads are performed by the Helioscope sequencer The reads are short, up to 55 bases per run, but recent improvemend of the methodology allowes more accurate reads of homopolymers and RNA sequencing
6.8 Single molecule SMRT(TM) sequencing
SMRT sequencing is based on the sequencing by synthesis approach The DNA is synthesisd
in so calles zero-mode wave-guides (ZMWs) - small well-like containers with the capturing tools located at the bottom of the well The sequencing is performed with use of unmodified polymerase and fluorescently labelled nucleotides flowing freely in the solution The wells are constructed in a way that only the fluorescence occurring by the bottom of the well is detected The fluorescent label is detached from the nucleotide at its incorporation into the DNA strand, leaving an unmodified DNA strand The SMTR technology allows detection of nucleotide modifications This happens through the observation of polymerase kinetics This approach allows reads of 1000 nucleotides
6.9 Single molecule real time (RNAP) sequencing
This method is based on RNA polymerase (RNAP), which is attached to a polystyrene bead, with distal end of sequenced DNA is attached to another bead, with both beads being placed in optical traps RNAP motion during transcription brings the beads in closer and their relative distance changes, which can then be recorded at a single nucleotide resolution The sequence is deduced based on the four readouts with lowered concentrations of each of the four nucleotide types
7 Other sequencing technologies
Sequencing by hybridization is a non-enzymatic method that uses a DNA microarray A single pool of DNA whose sequence is to be determined is fluorescently labelled and hybridized to an array containing known sequences Strong hybridization signals from a given spot on the array identifies its sequence in the DNA being sequenced Mass spectrometry may be used to determine mass differences between DNA fragments produced in chain-termination reactions
Trang 21Some important applications of DNA sequencing are :
1 To analyse any protein structure and function we must have the knowledge of its primary structure i.e its DNA sequence
2 With its study we can understand the function of a specific sequence and the sequence responsible for any disease
3 With the help of comparative DNA sequence study we can detect any mutation
4 Kinship study
5 DNA fingerprinting
6 By knowing the whole genome sequence, Human genome project get completed
The main problem with sequencing is its intactness If we perform the sequencing of same sample with different methods the result may be different so we should have to do it in such
a manner that atleast 40-50% sequence must be same of similar sample
8 Benchmarks in DNA sequencing
1953 Discovery of the structure of the DNA double helix
1972 Development of recombinant DNA technology, which permits isolation of defined fragments of DNA; prior to this, the only accessible samples for sequencing were from bacteriophage or virus DNA
1975 The first complete DNA genome to be sequenced is that of bacteriophage φX174
1977 Allan Maxam and Walter Gilbert publish "DNA sequencing by chemical degradation" Fred Sanger, independently, publishes "DNA sequencing by enzymatic synthesis"
1980 Fred Sanger and Wally Gilbert receive the Nobel Prize in Chemistry
EMBL-bank, the first nucleotide sequence repository, is started at the European Molecular Biology Laboratory
1982 Genbank starts as a public repository of DNA sequences
Andre Marion and Sam Eletr from Hewlett Packard start Applied Biosystems in May, which comes to dominate automated sequencing
Akiyoshi Wada proposes automated sequencing and gets support to build robots with help from Hitachi
1984 Medical Research Council scientists decipher the complete DNA sequence of the Epstein-Barr virus, 170 kb
1985 Kary Mullis and colleagues develop the polymerase chain reaction, a technique to replicate small fragments of DNA
1986 Leroy E Hood's laboratory at the California Institute of Technology and Smith announce the first semi-automated DNA sequencing machine
1987 Applied Biosystems markets first automated sequencing machine, the model ABI
370
Walter Gilbert leaves the U.S National Research Council genome panel to start Genome Corp., with the goal of sequencing and commercializing the data
1990 The U.S National Institutes of Health (NIH) begins large-scale sequencing trials
on Mycoplasma capricolum, Escherichia coli, Caenorhabditis elegans, and Saccharomyces cerevisiae (at 75 cents (US)/base)
Trang 22 Barry Karger (January), Lloyd Smith (August), and Norman Dovichi (September) publish on capillary electrophoresis
1991 Craig Venter develops strategy to find expressed genes with ESTs (Expressed Sequence Tags)
Uberbacher develops GRAIL, a gene-prediction program
1992 Craig Venter leaves NIH to set up The Institute for Genomic Research (TIGR)
William Haseltine heads Human Genome Sciences, to commercialize TIGR products
Wellcome Trust begins participation in the Human Genome Project
Simon et al develop BACs (Bacterial Artificial Chromosomes) for cloning
First chromosome physical maps published:
Page et al - Y chromosome;
Cohen et al chromosome 21
Lander - complete mouse genetic map;
Weissenbach - complete human genetic map
1993 Wellcome Trust and MRC open Sanger Centre, near Cambridge, UK
The GenBank database migrates from Los Alamos (DOE) to NCBI (NIH)
1995 Venter, Fraser and Smith publish first sequence of free-living organism, Haemophilus influenzae (genome size of 1.8 Mb)
Richard Mathies et al publish on sequencing dyes (PNAS, May)
Michael Reeve and Carl Fuller, thermostable polymerase for sequencing
1996 International HGP partners agree to release sequence data into public databases within 24 hours
International consortium releases genome sequence of yeast S cerevisiae (genome size of 12.1 Mb)
Yoshihide Hayashizaki's at RIKEN completes the first set of full-length mouse cDNAs
ABI introduces a capillary electrophoresis system, the ABI310 sequence analyzer
1997 Blattner, Plunkett et al publish the sequence of E coli (genome size of 5 Mb)
1998 Phil Green and Brent Ewing of Washington University publish “phred” for interpreting sequencer data (in use since ‘95)
Venter starts new company “Celera”; “will sequence HG in 3 yrs for $300m.”
Applied Biosystems introduces the 3700 capillary sequencing machine
Wellcome Trust doubles support for the HGP to $330 million for 1/3 of the sequencing
NIH & DOE goal: "working draft" of the human genome by 2001
Sulston, Waterston et al finish sequence of C elegans (genome size of 97Mb)
1999 NIH moves up completion date for rough draft, to spring 2000
NIH launches the mouse genome sequencing project
First sequence of human chromosome 22 published
2000 Celera and collaborators sequence fruit fly Drosophila melanogaster (genome size
of 180Mb) - validation of Venter's shotgun method HGP and Celera debate issues related to data release
HGP consortium publishes sequence of chromosome 21
HGP & Celera jointly announce working drafts of HG sequence, promise joint publication
Trang 23 Estimates for the number of genes in the human genome range from 35,000 to 120,000 International consortium completes first plant sequence, Arabidopsis thaliana(genome size of 125 Mb)
2001 HGP consortium publishes Human Genome Sequence draft in Nature (15 Feb)
Celera publishes the Human Genome sequence
2005 420,000 VariantSEQr human resequencing primer sequences published on new NCBI Probe database
2007 For the first time, a set of closely related species (12 Drosophilidae) are sequenced, launching the era of phylogenomics
Craig Venter publishes his full diploid genome: the first human genome to be sequenced completely
2008 An international consortium launches The 1000 Genomes Project, aimed to study human genetic variability
2008 Leiden University Medical Center scientists decipher the first complete DNA sequence of a woman
9 References
[1] Olsvik O, Wahlberg J, Petterson B, et al (January 1993) "Use of automated sequencing of
polymerase chain reaction-generated amplicons to identify three types of cholera toxin subunit B in Vibrio cholerae O1 strains" J Clin Microbiol 31 (1): 22–5 PMC 262614.PMID 7678018
[2] Pettersson E, Lundeberg J, Ahmadian A (February 2009) "Generations of sequencing
technologies" Genomics 93 (2): 105–11 doi:10.1016/j.ygeno.2008.10.003 PMID
18992322
[3] Min Jou W, Haegeman G, Ysebaert M, Fiers W (May 1972) "Nucleotide sequence of the
gene coding for the bacteriophage MS2 coat protein" Nature 237 (5350): 82–8 Bibcode1972Natur.237 82J doi:10.1038/237082a0 PMID 4555447
[4] Fiers W, Contreras R, Duerinck F, et al (April 1976) "Complete nucleotide sequence of
bacteriophage MS2 RNA: primary and secondary structure of the replicase gene" Nature260 (5551): 500–7 Bibcode 1976 Natur.260 500F doi:10.1038/260500a0.PMID
1264203
[5] Maxam AM, Gilbert W (February 1977) "A new method for sequencing DNA" Proc
Natl Acad Sci U.S.A 74 (2): 560–4 Bibcode 1977PNAS 74 560M.doi:10.1073/ pnas.74.2.560 PMC 392330 PMID 265521
[6] Gilbert, W DNA sequencing and gene structure Nobel lecture, 8 December 1980
[7] Gilbert W, Maxam A (December 1973) "The Nucleotide Sequence of the lac
Operator".Proc Natl Acad Sci U.S.A 70 (12): 3581–4 Bibcode 1973PNAS 70.3581G.doi:10.1073/pnas.70.12.3581 PMC 427284 PMID 4587255 [8] Sanger F, Coulson AR (May 1975) "A rapid method for determining sequences in DNA
by primed synthesis with DNA polymerase" J Mol Biol 94 (3): 441–8.doi:10.1016/0022-2836(75)90213-2 PMID 1100841
[9] Sanger F, Nicklen S, Coulson AR (December 1977) "DNA sequencing with chain-terminating
inhibitors" Proc Natl Acad Sci U.S.A 74 (12): 5463–7 Bibcode1977PNAS 74.5463S doi:10.1073/pnas.74.12.5463 PMC 431765.PMID 271968
[10] Sanger F Determination of nucleotide sequences in DNA Nobel lecture, 8 December 1980
Trang 24[11] 10.Graziano Pesole; Cecilia Saccone (2003) Handbook of comparative genomics:
principles and methodology New York: Wiley-Liss pp 133 ISBN 0-471-39128-X [12] Smith LM, Fung S, Hunkapiller MW, Hunkapiller TJ, Hood LE (April 1985) "The
synthesis of oligonucleotides containing an aliphatic amino group at the 5' terminus: synthesis of fluorescent DNA primers for use in DNA sequence analysis" Nucleic Acids Res 13 (7): 2399–412 doi:10.1093/nar/13.7.2399 PMC 341163 PMID 4000959 [13] Base-calling for next-generation sequencing platforms — Brief Bioinform" Retrieved
2011-02-24
[14] Murphy, K.; Berg, K.; Eshleman, J (2005) "Sequencing of genomic DNA by combined
amplification and cycle sequencing reaction" Clinical chemistry 51 (1): 35–39.doi:10.1373/clinchem.2004.039164 PMID 15514094 edit
[15] Sengupta, D .; Cookson, B (2010) "SeqSharp: A general approach for improving
cycle-sequencing that facilitates a robust one-step combined amplification and sequencing method" The Journal of molecular diagnostics : JMD 12 (3): 272–277.doi:10.2353/jmoldx.2010.090134 PMC 2860461 PMID 20203000 edit
[16] Richard Williams, Sergio G Peisajovich, Oliver J Miller, Shlomo Magdassi, Dan S Tawfik,
Andrew D Griffiths (2006) "Amplification of complex gene libraries by emulsion PCR".Nature methods 3 (7): 545–550 doi:10.1038/nmeth896 PMID 16791213
[17] Hall N (May 2007) "Advanced sequencing technologies and their wider impact in
microbiology" J Exp Biol 210 (Pt 9): 1518–25 doi:10.1242/jeb.001370.PMID
17449817
[18] Church GM (January 2006) "Genomes for all" Sci Am 294 (1): 46–54.doi:10.1038/
scientificamerican0106-46 PMID 16468433
[19] Schuster, Stephan C (2008) "Next-generation sequencing transforms today's
biology".Nature methods (Nature Methods) 5 (1): 16–18 doi:10.1038/ nmeth1156.PMID 18165802
[20] Brenner, Sidney; Johnson, M; Bridgham, J; Golda, G; Lloyd, DH; Johnson, D; Luo, S;
McCurdy, S et al (2000) "Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays" Nature Biotechnology (Nature Biotechnology)18 (6): 630–634 doi:10.1038/76469 PMID 10835600
[21] Schuster SC (January 2008) "Next-generation sequencing transforms today's biology"
Nat Methods 5 (1): 16–8 doi:10.1038/nmeth1156 PMID 18165802
[22] Mardis ER (2008) "Next-generation DNA sequencing methods" Annu Rev Genomics
Hum Genet 9: 387–402 doi:10.1146/annurev.genom.9.081307.164359.PMID 18576944 [23] Valouev A, Ichikawa J, Tonthat T, et al (July 2008) "A high-resolution, nucleosome
position map of C elegans reveals a lack of universal sequence-dictated positioning".Genome Res 18 (7): 1051–63 doi:10.1101/gr.076463.108 PMC 2493394 PMID 18477713
[24] Human Genome Sequencing Using Unchained Base Reads in Self-Assembling DNA
Nanoarrays Drmanac, R et al Science, 2010, 327 (5961): 78-81,
[25] Genome Sequencing on Nanoballs Porreca, JG Nature Biotechnology, 2010, 28:(43-44) [26] Human Genome Sequencing Using Unchained Base Reaads in Self-Assembling DNA
Nanoarrays, Supplementary Material Drmanac, R et al Science, 2010, 327 (5961):78-81, Complete Genomics Press release, 2010
[27] Hanna GJ, Johnson VA, Kuritzkes DR, et al (1 July 2000) "Comparison of Sequencing by
Hybridization and Cycle Sequencing for Genotyping of Human Immunodeficiency Virus Type 1 Reverse Transcriptase" J Clin Microbiol 38 (7): 2715–21 PMC 87006.PMID 10878069
Trang 25Hot Start 7-Deaza-dGTP Improves Sanger Dideoxy Sequencing Data of GC-Rich Targets
Sabrina Shore, Elena Hidalgo Ashrafi and Natasha Paul
TriLink BioTechnologies, Inc
USA
1 Introduction
DNA sequencing has developed substantially over the years into a more cost-effective and accurate technique for scientific advancement in medical diagnostics, forensics, systematics, and genomics Sanger dideoxy sequencing is currently one of the most established and popular sequencing methods (Sanger et al., 1977) Over the years, sequencing methods have become automated, faster and more specific and now allow sequencing of difficult and unknown regions of DNA Several protocols have been developed over the years which include new dye chemistries, use of modified nucleotide analogs, use of additives in the sequencing reaction, and variations to the sequence cycling parameters (Prober et al., 1987; Kieleczawa, 2006) These modified protocols allow for sequencing through difficult regions
of DNA and may be applied in the pre-sequencing PCR step or in the actual sequencing reaction itself Despite these improvements, there continues to be DNA regions that are problematic to sequence, such as AT, GT, GC-rich regions, regions high in secondary structure, hairpins, homopolymer regions, and regions with repetitive DNA sequence (Kieleczawa, 2006; Frey et al., 2008) These challenging DNA templates often result in ambiguous sequencing data which include false stops, compressions, weak signals, and premature termination of signal In particular, sequences high in GC content still suffer from several of these problems despite all the advancements
Templates high in GC content have higher melting temperatures that do not allow for adequate strand separation of the DNA duplex in standard sequencing protocols The tendency for these sequences to form complex secondary structures, such as hairpins or G-quadruplexes (Simonsson, 2001) can prevent a DNA polymerase from processively replicating an entire stretch of sequence (Weitzmann et al., 1996) The innate secondary structure of GC-rich templates and the strength of the DNA duplex can be obstacles in sequencing reactions as well as in PCR GC-rich PCR assays are often plagued by mis-priming and inadequate amplicon yield which in turn provide poor quality DNA samples for downstream sequencing reactions (Shore & Paul, 2010) When a DNA template high in
GC content is sequenced, base compressions, weak signals from high background noise, and truncated sequencing reads are typical results Base compressions are due to secondary structure and cause abnormal migration during the electrophoretic separation step These fragment irregularities in migration often plague the downstream sequence analysis, where the software is unable to discriminate between the fragments (Motz et al., 2000) As a result,
Trang 26several advancements have been developed to address the challenge of GC-rich DNA in the pre-sequencing PCR step as well as in the sequencing reaction itself
To overcome the higher melting temperatures of GC-rich DNA sequences, sequence cycling parameters have been adjusted to use higher temperatures Additives such as dimethyl sulfoxide (DMSO), betaine, formamide, and glycerol are often added into sequencing reactions to reduce secondary structure and to promote strand separation (Jung et al., 2002)
It is well known that the use of modified analogs such as 7-deaza-dGTP deoxyguanosine-5-triphosphate) and dITP (2-deoxyinosine-5-triphosphate) can be used in the sequencing reaction in place of dGTP or in some combination thereof to reduce secondary structure (Motz, Svante et al., 2000) 7-deaza-dGTP is a modified dGTP analog which lacks a nitrogen at the seven position of the purine ring The absence of this nitrogen destabilizes G-quadruplex formation by preventing Hoogsteen base pairing without affecting Watson-Crick base pairing Alternatively, dITP alters the Watson-Crick base pairing by reducing the number of hydrogen bonds from three to two in a base pair between inosine and cytosine This reduces the strength of GC-rich duplexes by lowering the melting temperature
(7-deaza-2-Efforts have also focused on improving the PCR amplification of GC-rich targets to provide
a good quality DNA template prior to the sequencing reaction The addition of dGTP in the PCR step permanently disrupts secondary structure by incorporation of this modified analog into the amplicon The pre-sequencing PCR amplification effectively linearizes the DNA template and prepares it for sequencing Though the optimal ratio of 7-deaza-dGTP to dGTP in a PCR reaction has been thoroughly investigated and determined, this method often requires additional components such as a Hot Start polymerase or additives to be included into the reaction (McConlogue et al., 1988)
7-deaza-A recently developed Hot Start technology named Clean7-deaza-Amp™ employs a thermolabile group on the 3 hydroxyl of a deoxynucleoside triphosphate The presence of the protecting group blocks low temperature primer extension, which can often be a significant problem in PCR At higher temperatures, the protecting group is released to allow for incorporation by the DNA polymerase and for more specific amplification of the intended target (Koukhareva & Lebedev, 2009) CleanAmp™ dNTPs are Hot Start versions of standard deoxynucleotide triphosphates (dATP, dCTP, dGTP, and dTTP) while CleanAmp™ 7-deaza-dGTP applies the same concept to a modified nucleoside triphosphate, 7-deaza-dGTP Previous results have shown that a dNTP mixture containing CleanAmp™ 7-deaza-dGTP provides a significant improvement over standard 7-deaza-dGTP in the amplification
of GC-rich targets in PCR assays (Shore & Paul, 2010) CleanAmp™ 7-deaza-dGTP Mix is a formulation which contains the CleanAmp™ versions of dATP, dCTP, dGTP, dTTP, and 7-deaza-dGTP where the 7-deaza-dGTP:dGTP ratio is 3:1 CleanAmp™ 7-deaza-dGTP applies
a Hot Start technology to a secondary structure reducing analog that is permanently incorporated into the PCR amplicon Preliminary sequencing results on PCR amplicons generated using the CleanAmp™ 7-deaza-dGTP Mix have shown an improvement in sequencing reads over a standard (unmodified) 7-deaza-dGTP mix The significant improvement in data quality when CleanAmp™ 7-deaza-dGTP Mix was compared to an analogous mix containing unmodified versions of the dNTPs instigated further experimental inquiries to identify the optimal mix composition for use prior to sequencing
of PCR products
Trang 27Herein we will compare the application of a CleanAmp™ 7-deaza-dGTP Mix to a standard 7-deaza-dGTP mix for the PCR amplification of GC-rich targets in preparation for Sanger dideoxy sequencing We show that CleanAmp™ 7-deaza-dGTP Mix provides an improvement compared to the standard version of a 7-deaza-dGTP mix and provide guidance as to the best ratio of 7-deaza-dGTP to dGTP to use for optimal PCR and downstream sequencing performance Performance categories that weigh into this decision include measures of PCR performance, such as preliminary amplicon yield and amplicon quality, and measures of sequencing performance including percent of high quality base calls, read length, and pairwise identity The most crucial metric for determining sequence performance is the percent high quality base calls, which provides a numerical readout of fragment resolution and sequence determination The effects of amplicon yield and amplicon quality were also assessed to determine if these parameters directly correlate with sequencing quality
2 Methods
The use of the CleanAmp™ technology in a pre-sequencing PCR amplification step was explored by comparing analogous reactions with standard dNTPs to investigate the effect of Hot Start activation on the PCR step The ratio of 7-deaza-dGTP to dGTP was thoroughly investigated to determine an optimal mixture that will provide robust PCR yield and accurate Sanger dideoxy sequencing results The effect of magnesium chloride concentration
on amplicon yield and subsequent sequencing results was also investigated for mixtures with high 7-deaza-dGTP ratios, which showed low amplicon yield under normal 1.5 mM magnesium chloride concentrations Experimental data evaluates five GC-rich targets of varying GC content and amplicon length
2.1 PCR
The following five GC-rich targets were investigated: ACE (60% GC-rich), BRAF (74%), GNAQ (79%), GNAS (84%) (Frey, Bachmann et al., 2008) and B4GN4 (78%) (Zhang et al., 2009) PCR reactions were set up in a 50 μL total volume and consisted of 1x PCR buffer (20
mM Tris pH 8.4, 50 mM KCl; Invitrogen), 1.5 mM MgCl2 (Invitrogen), 2.0 U Taq DNA
polymerase (Invitrogen), 0.2 µM primers (TriLink BioTechnologies), 0.2 mM d(A,C,T), X % dGTP, and Y % 7-deaza-dGTP (TriLink Biotechnologies) where X and Y are varying percentages of 0.2 mM (X/Y = 30/70, 25/75, 20/80, 10/90, 0/100) Nucleotide mixes consisted of all standard dNTPs (New England Biolabs) or all CleanAmp™ dNTPs (TriLink BioTechnologies) and 10 ng Human Genomic DNA (Promega) as template For PCR with the 90 and 100% 7-deaza-dGTP nucleotide mixtures, different concentrations of magnesium chloride were used: 1.5, 2, or 2.5 mM Five replicates for each condition were prepared, each
in a single, thin-walled 200-μL tube and placed in a BioRad Tetrad 2 thermal cycler with a thermal cycling protocol 95 °C (10 min); [95 °C (40 s), X °C (1 s), 72 °C (1 min)] 35x; 72 °C (7 min) where X (annealing temperature) varied according to the target (ACE 70 °C; BRAF 67
°C; B4GN4 57 °C; GNAQ 66 °C; GNAS 64 °C) The GNAQ and GNAS targets went through
40 PCR cycles After thermal cycling, all 5 reactions were combined, and 15 μL of product was run on a 2% agarose E-gel (Invitrogen) to determine preliminary amplicon yield before PCR cleanup The five combined PCR reactions were then cleaned up to remove excess primers and dNTPs using the Qiagen PCR Cleanup kit (Qiagen) After cleanup, 5 μL of
Trang 28purified samples were run on an agarose gel to visualize the final product to be sequenced Image J software was used to integrate amplicon and off-target bands on the E-gels both before PCR cleanup and after (Abramoff, 2004)
2.2 Sanger dideoxy Sequencing
PCR products generated with 70, 75, and 80% mixtures of 7-deaza-dGTP that were submitted for sequencing were amplified using 1.5 mM MgCl2 The following 90 and 100% samples which were submitted for sequencing required additional magnesium chloride in the PCR step: ACE (100% - 2 mM MgCl2 for standard and CleanAmpTM), B4GN4 (90% - a mixture of the 1.5 and 2 mM MgCl2 for standard dNTPs and 100% - 2 mM MgCl2 for standard and CleanAmpTM), GNAQ (100% - 2 mM MgCl2 for standard and CleanAmpTM), GNAS (90 and 100% - 2 mM MgCl2 for standard and CleanAmpTM) Those samples not mentioned contained the standard 1.5 mM MgCl2 Each target was amplified in five individual PCR experiments and analyzed by sequencing Samples were submitted to Eton Bioscience, Inc (San Diego, CA) where they were quantified using a nanodrop and submitted for Sanger dideoxy sequencing with the Big Dye Terminator v3.1 (Applied Biosystems) kit used for cycle sequencing reactions A specified amount for each PCR product was used during the sequencing reaction, depending on the number of base pairs in the amplicon: ACE (156 bp, 10 ng DNA), BRAF (185 bp, 12 ng DNA), B4GN4 (720 bp, 28 ng DNA), GNAQ (642 bp, 27 ng DNA), GNAS (242 bp, 15 ng DNA) All sequencing results were analyzed with the Genious Pro v5.5.2 sequencing software (Drummond AJ, 2011)
2.3 Data analysis
Several categories of results were examined including preliminary amplicon yield, amplicon quality, pairwise identity, read length of target, and percent of high quality base calls All values were averaged for the five independent PCR or sequencing runs and statistical analysis was done using Graphpad Prism Software (San Diego, CA) A two tailed t-test was used to test the null hypothesis that CleanAmpTM and standard dNTPs yield the same results in each 7-deaza-dGTP mixture A one-way ANOVA test was used to test the null hyposthesis that all 7-deaza-dGTP compositions perform equally within a group (CleanAmp™ or standard dNTPs) The Tukey-Kramer post test was also done in addition to the one-way ANOVA to compare all levels of substitution with one another (Prism) Statistical probability values are indicated on graphs and tables
Trang 293.1 PCR specificity and yield
The first factor was assessment of the PCR yield of reactions with CleanAmp™ (or hot start dNTPs) and standard dNTPs to determine if one would provide a superior result The graphs
in Figure 1 show this comparison, where each of the five targets was PCR amplified with five different 7-deaza-dGTP blends, ranging from 70 to 100% substitution of dGTP with 7-deaza-
dGTP The preliminary amplicon yield is presented as the average relative amplicon yield for
each target, where the raw data values for target band gel densitometry in each experiment were normalized to the 75% CleanAmp™ 7-deaza-dGTP mixture Amplicon yield was quantified prior to PCR cleanup and reflects approximately how much DNA yield was generated from an independent PCR reaction While many reactions generated sufficient amplicon for downstream sequencing from a single set-up, all reactions contained five replicates to ensure that the amplicon yield after PCR cleanup was sufficient for sequencing It was also noted that reactions using a higher percentage of 7-deaza-dGTP often required an increase in magnesium chloride concentration for sufficient amplicon yield Therefore, the 90 and 100% 7-deaza-dGTP brews were prepared with 1.5, 2, and 2.5 mM MgCl2 to ensure that enough product was formed From the samples with varying MgCl2, the PCR product with the highest yield and least off-target per run was submitted for sequencing Preliminary experiments indicated that altered magnesium chloride concentration in the PCR step did not significantly affect sequencing reads, so this variable was eliminated when analyzing results When the amplicon yield of a reaction containing standard dNTPs was compared to an analogous reaction using CleanAmp™ dNTPs for a given percentage of 7-deaza-dGTP, twelve out of twenty five reactions showed a significant improvement in amplicon yield with CleanAmp™ dNTPs (Figure 1) While there were not many statistically significant differences for the lower GC-rich targets ACE (60%) and BRAF (74%), the three highest GC-rich targets: B4GN4 (78%), GNAQ (79%), and GNAS (84%) showed an improved amplicon yield using CleanAmp™ dNTPs over standard dNTPs for several 7-deaza-dGTP compositions In the case of B4GN4, CleanAmp™ dominates in all the 7-deaza-dGTP mixtures except 100%, where very little amplicon was formed with either type of dNTPs Furthermore, the addition of magnesium chloride had little effect on amplicon yield for this target producing very low yields prior to PCR cleanup
The second factor was the evaluation of several 7-deaza-dGTP nucleotide blends (from 70 to 100%) across each group (CleanAmp™ or standard dNTPs) for a given target to determine an ideal amount of 7-deaza-dGTP for optimal PCR yield In addition to the graphs shown in Figure 1, Table 1 includes the raw data averages and numerical standard deviations for this figure Also featured in this table are the results for statistical comparison of each of the 7-deaza-dGTP compositions with one another across each of the two groups: standard or CleanAmp™ dNTPs The colored boxes in the tables represent which of the five 7-deaza-dGTP mixtures fared the best with each type of dNTP The results of this analysis indicate that there was not just one composition that dominated over the others for all five targets but there were some noteworthy trends First, it was common that the 90 or 100% compositions produced less
or comparable amplicon yield than any of the lower percentage compositions despite the additional MgCl2 used Second, although several different 7-deaza-dGTP blends gave comparable amplicon yields for a given target, the blend that was optimal was not consistent from one target to the next Results were more well-defined for the CleanAmp™ group than for standard dNTPs For CleanAmp™ dNTPs, a 75% 7-deaza-dGTP mix produced the most
Trang 30Amplicon yield for each target was normalized to the 75% CleanAmp™ 7-deaza dGTP mixture, and all values were analyzed with a two tailed t-test where (* p < 0.05; **p < 0.01)
Fig 1 Average Relative Amplicon Yield for five GC-rich targets (A-E) amplified using a dNTP mixture with 70 to 100% 7-deaza-dGTP
Trang 31Off-Target (primer dimer and mis-priming) yield for each target was quantified relative to the desired amplicon and analyzed with a two tailed t-test where probability values (* p < 0.05; **p < 0.01) specify statistically significant values
Fig 2 Average Relative Off-Target Yield for five GC-rich targets (A-E) amplified using a dNTP mixture with 70 to 100% 7-deaza-dGTP
Trang 32amplicon for two targets, B4GN4 and GNAQ A 70% 7-deaza-dGTP mix provided the highest yield for BRAF and gave higher yields than at least three other mixtures for ACE and GNAS
On the other hand, standard dNTP mixtures showed few obvious trends and varied considerably from one target to the next For example, the B4GN4 target had comparable amplicon yields for the 75 and 80% mixtures, which provided higher yields than the remaining mixtures, while for the GNAS target, the 75, 90, and 100% yields were comparable and had greater yields than the 70 and 80% mixtures Overall, one single 7-deaza-dGTP composition could not be identified for highest amplicon yield In addition to amplicon yield another important factor in PCR product preparation for sequencing is amplicon quality
Target %
7-deaza-dGTP
Average Relative Amplicon Yield (Standard deviation)
Average Relative Off-Target (Standard deviation) Standard CleanAmp™ Standard CleanAmp™ ACE
Amplicon yields were integrated and normalized relative to the 75% CleanAmp™ 7-deaza-dGTP
mixture Off-target amplification yields are the fraction of off-target product formed relative the desired amplicon as determined by gel densitometry The five sets of 7-deaza-dGTP compositions in each group (standard or CleanAmp™) were analyzed by a one-way ANOVA and Tukey-Kramer post test where ( p
< 0.05 ; p < 0.01 ; p < 0.001 ) for a given percentage of 7-deaza-dGTP Boxes outlined in color represent means that give statistically significant values
Table 1 Relative Average Amplicon Yield and Off-Target Integration Data
Trang 33Amplicon quality indicates the purity of the sample that is being sent for sequencing A high
quality sample should contain only the DNA target to be sequenced and be free of any contaminants, excess primers, excess dNTPs, or off-target products which might interfere with the sequencing reaction In this study, the PCR products went through a commonly used PCR cleanup process that rids the samples of excess dNTPs and primers but does not remove off-target products that were generated during the PCR Generation of a high quality PCR product with no off-target would eliminate the more laborious step of gel purification prior to sequencing Therefore, in this chapter the amplicon quality was assessed by integrating the amount of average relative off-target products in the sample after the five PCR replicates were pooled, cleaned, and concentrated Amplicon quality is represented graphically as the fraction of off-target (mis-priming and primer dimer) generated relative to the amplicon (Figure 2, Table 1), so the lower this value, the higher the sample quality will be
Thirteen out of twenty five reactions showed significant reduction in off-target when CleanAmp™ dNTPs were used (Figure 2) The ACE off-target products consisted entirely of primer dimer while the other four targets were prone to a combination of primer dimer and mis-priming side products (Figure 3) The two lowest percent GC-rich targets, ACE and BRAF, produced the lowest amount of off-target and highest amplicon quality, with comparable performance between CleanAmp™ and standard dNTPs The other three target reactions formed substantially more off-target, especially when amplified with standard dNTPs Figure 2 results show that amplicon quality is highest in most cases when CleanAmp™ mixtures are used For the B4GN4 and GNAQ targets, reactions with standard dNTPs produced significantly more off-target products than CleanAmp™ dNTPs regardless
of the percent dGTP For all five amplicons, the reactions using 75 and 80% dGTP produced no primer dimer or mis-priming when CleanAmp™ was used As was the case for amplicon yield, there was not just one composition of 7-deaza-dGTP that produced the highest quality DNA product However it was evident that the use of CleanAmp™ dNTPs reduced the amount of off-target compared to standard dNTPs, providing an amplicon with a higher chance of successfully being sequenced
7-deaza-Although it is important to generate a high quality PCR product with adequate yield, other measures of the sequencing results, such as read length, pairwise identity, and percent high quality base scores should also be considered Read length is the number of bases that were called for a given target Optimally, this value should match the length of the reference sequence, provided it does not exceed the ~1000 base pair limits of the current Sanger dideoxy sequencing technology (Slater & Drouin, 1992; Kieleczawa, Adam et al., 2009) For some challenging targets shorter than these upper limits, the read lengths can often become truncated due to complex secondary structures or regions of DNA that the polymerase can not read through such as GC-rich regions Results in Table 2 indicate that the average read lengths for the five GC-rich targets were comparable between CleanAmp™ and standard dNTPs in nearly every case These results were not surprising since these five targets, which were chosen mainly for GC content are not long enough to accurately assess the impact on read length Therefore the effect of GC content on read length remains yet to be determined in assays with longer amplicons B4GN4, the longest target with 720 base pairs, varied the most in read length based on standard deviations, suggesting that this was one of the most difficult targets out of the group and just beginning to approach the length
Trang 34threshold where sequencing becomes more of a challenge For these targets, no one deaza-dGTP mixture fared better than the rest, as all of the average read length values were statistically comparable to one other In addition to the read length of the sequence, it is critical to identify the correct bases within a sequence
7-Gel images show the variability among the 3 different GC-rich targets in amplicon yield and off-target products (primer dimer and mis-priming)
Fig 3 Agarose gel images of BRAF, B4GN4, and GNAS amplicons post PCR cleanup
Trang 353.2 Sanger dideoxy sequencing data quality
The pairwise identity in an alignment of two sequences is the percentage of shared identical bases (Drummond AJ, 2011) In this chapter, all sequences were known, so all experimental sequencing read-outs were aligned to the appropriate GenBank reference sequences (Dennis
A Benson, 2005) If the read-outs matched exactly, they would have 100% pairwise identity
to the reference sequence For ACE and BRAF, the sequencing data matched with the reference sequences nearly 100% of the time for both standard and CleanAmp™ dNTPs in all 7-deaza-dGTP compositions For the three targets with higher GC content, results showed that standard and CleanAmp™ dNTPs pairwise identity values approached 100% with no statistically significant differences for most 7-deaza-dGTP mix compositions However, at 70% 7-deaza-dGTP, three targets amplified with CleanAmp™ dNTPs had higher pairwise identity values than standard dNTPs Another noteworthy outlier was B4GN4, which was prone to the highest level of off-target formation Although no one 7-deaza-dGTP composition improved results over any of the others in the group, reactions employing CleanAmp™ dNTPs provided a higher pairwise identity to the B4GN4 reference sequences at lower 7-deaza-dGTP substitutions While read length and correct base calls are important parameters in determining the quality of sequencing data, the most critical parameter is the percentage of high quality base calls or HQ percentage
HQ percentage differs from pairwise identity, as it is a measure of the confidence by which the sequencing software can determine a sequence (Drummond AJ, 2011) Often, the pairwise identity may be a 100% match to the reference sequence but have very low confidence values for each base call within the sequence The confidence in base calling becomes more important when sequencing unknown regions of DNA where there is no reference sequence In these cases, the resultant sequence can only be trusted if the confidence of the sequencing software is high enough Phred scores, or quality scores are a widely accepted measure of the quality of DNA sequences Phred scores are numerical estimates of error probability for a given base (Ewing & Green, 1998) Sequencing softwares each have their own scale for base scoring, and
in these studies, the percent of high quality base calls in a sequence read-out (HQ%) is defined
to be the percent of bases that have a quality score (phred score) greater than 40 The highest score is a one in a million (10−6) probability of a calling error where a middle score (20-40) represents a probability of only a one in a thousand (10−3) (Drummond AJ, 2011) The data presented herein includes the HQ percentages for each sample (Table 2 and Figure 4)
The HQ percent scores for the ACE and BRAF sequencing data showed minimal differences between CleanAmp™ and standard dNTPs for a given 7-deaza-dGTP mix composition For B4GN4, GNAQ, and GNAS, the sequencing of amplicons generated with CleanAmp™ dNTPs showed significantly improved HQ scores at 70% 7-deaza-dGTP relative to analogous reactions with standard dNTPs which did not reach a value of 50% CleanAmp™ also yielded higher HQ scores (p< 0.05) in four out of the five mixtures for the B4GN4 target When looking
at the compositions of 7-deaza-dGTP across each group of CleanAmp™ or standard dNTPs, all HQ percent values for each 7-deaza blend are comparable for low GC content targets For targets with greater than 75% GC content, there are some noticeable differences in the lowest (70%) and the highest (100%) substitutions of 7-deaza-dGTP However, there were no statistically significant differences between HQ values in the middle mix compositions of 75,
80, and 90% (Table 2) Although a specific optimal percentage of 7-deaza-dGTP could not be identified, results indicate that 75, 80, and 90% blends provided the best results across a wide
Trang 36range of targets Furthermore, CleanAmp™ was found to improve high quality base calls for seven out of the twenty five reactions which included certain challenging targets and lower 7-deaza-dGTP compositions for higher GC-rich species
Average Pairwise Identity (Standard deviation)
Average Read Length (Standard deviation) Standard CleanAmp ™ Standard CleanAmp ™ Standard CleanAmp ™
The five sets of 7-deaza-dGTP compositions in each group (standard or CleanAmp™) were analyzed by
a one-way ANOVA and Tukey-Kramer post test where probability values shown with highlighted boxes ( p < 0.05 ; p < 0.01 ; p < 0.001 ) find the means statistically significant for HQ values, pairwise identity, and read length In addition, Pairwise Identity and Read Length comparison of CleanAmp™ to standard dNTPs in individual 7-deaza-dGTP compositions, which is not shown by bar graph, are indicated by stars where (*p < 0.05; **p < 0.01; **p < 0.001) represent values that are statistically
significant
Table 2 Average HQ, Pairwise Identity, and Read Length Data Averages
Trang 37The percentage of bases in a sequencing read out which had a high quality score of 40 or higher determined a HQ percent value These values were averaged after alignments to each reference sequence and analyzed with a two tailed t-test where probability values (* p < 0.05; **p < 0.01) find the means statistically significant
Fig 4 Average HQ Percentages from sequencing read-outs for all five GC-rich targets (A-E) amplified with 7-deaza-dGTP mixes of different compositions
Trang 38In summary, these studies have investigated both the percent of 7-deaza-dGTP substitution and the influence of standard and CleanAmp™ versions of the nucleotide mix on PCR and Sanger sequencing performance When different metrics such as amplicon yield, amplicon quality, HQ percentage and sequencing chromatogram quality are considered, there are notable instances where reactions employing CleanAmp™ dNTPs have either comparable performance or statistically significant improvements in performance To better understand how these parameters interplay with one another, a more detailed analysis will be presented
in the Conclusion section
4 Conclusion
Both standard and CleanAmp™ dNTPs can effectively generate PCR amplicons with GC-rich sequence when amplified in combination with 7-deaza-dGTP The use of this nucleotide analog effectively linearizes the DNA sequences in preparation for Sanger dideoxy sequencing
by destabilizing secondary structures such as G-quadruplexes This allows the DNA fragments
to migrate more predictably through the polyacrylamide gel and reduces the possibility of ambiguous base calls in sequencing results (Motz, Svante et al., 2000) CleanAmp™ dNTPs also offer the added benefit of reduced off-target amplification due to Hot Start activation in the PCR assay Therefore the effect of Hot Start PCR activation in conjunction with the extent
of 7-deaza-dGTP substitution was investigated to determine its potential benefit
For the targets with lower GC content (less than 75%), CleanAmp™ dNTPs helped to reduce off-target product formation at the PCR step but gave comparable results to standard dNTPs
in all other categories For the three highest GC-rich targets, CleanAmp™ improved amplicon yield and amplicon quality with several different 7-deaza-dGTP compositions, indicating that the Hot Start activation is a much-needed benefit In one case, B4GN4, CleanAmp™ significantly improved amplicon yield, amplicon quality, and percent HQ over standard dNTPs for at least four out of the five 7-deaza-dGTP mixtures (from 70-100% 7-deaza-dGTP) Quality sequencing results for this target were not achieved with standard dNTPs alone The use of CleanAmp™ dNTPs at the PCR stage also improved amplicon yield, pairwise identity, and percent HQ with the 70% 7-deaza-dGTP composition However, reactions with an analogous mixture of standard dNTPs were not as successful, indicating that standard dNTP mixtures may require a higher percentage of 7-deaza-dGTP Although the categories of pairwise identity and read length showed minimal differences when it came to using standard
or CleanAmp™ dNTPs, CleanAmp™ dNTPs improved the PCR assay and down stream sequencing results over standard dNTPs in DNA targets with GC content higher than 75% After analyzing the results in five categories individually, it was determined that three of the categories, amplicon yield, amplicon quality, and percent HQ, were most affected by the variables being investigated Therefore the influence of these categories on one another was more thoroughly studied to discern the most optimal percent 7-deaza-dGTP mixture Figure 5A(I to V) shows scatter plots of the percent HQ and amplicon yield for all variables being tested The shaded portion of the plot highlights dNTP mixtures that reached a threshold of
at least 50% relative amplicon yield and 50% high quality bases called The dNTP compositions that were found in this region of the scatter plot were identified and re-plotted
in a scatter plot of HQ percent versus amplicon quality (Figure 5B (I to V)) Optimal compositions for Figure 5B lie highest on the plots for HQ scores and furthest to the left for the least amount of off-target formed or highest amplicon quality
Trang 39Several of the different CleanAmp™ dNTP mixtures met the threshold requirements for amplicon yield, were high in amplicon quality and yielded high HQ scores While many
of the standard dNTP mixtures also have adequate amplicon yield, the amplicon quality suffered for several targets From the scatter plot analysis in Figure 5, the 75% CleanAmp™ mixture provided adequate amplicon yield, best amplicon quality and highest HQ scores for 4 out of the 5 targets For the 75% mixture of CleanAmp™ dNTPs
in the GNAS target adequate yield and high amplicon quality were evident but the HQ scores were lower for this composition The lesser correlation of GNAS to the other samples may be due to its higher GC content (84%) and the higher concentrations of magnesium chloride, which was needed for adequate amplicon yield For targets with higher GC than 80% composition, the data indicated that complete substitution of 7-deaza-dGTP may be necessary If more replicates are pooled to produce enough amplicon at a lower magnesium concentration then it is likely that the off-target products will decrease and base call quality will still remain high If standard dNTPs are used in the PCR step, an 80% 7-deaza-dGTP mixture was the optimal composition However amplicon quality can sometimes still be affected at this composition with standard dNTPs, which may require a more laborious gel purification step to remove off-target products prior to sequencing
In addition to the numerical metrics from the sequencing data of the five targets, the sequencing chromatograms were studied In Figure 6, representative chromatograms for the regions with high GC content for the BRAF, GNAS, and B4GN4 targets are presented Comparisons show the optimal compositions of CleanAmp™ with 75% 7-deaza-dGTP and standard with 80% 7-deaza-dGTP For BRAF (74% GC), a region from 50-100 bp is shown Since the analogous HQ percentages are similar for templates with standard and CleanAmp™ dNTPs, it was not surprising that the chromatograms are similar in this region Similarly for GNAS, the HQ percentages were comparable for both CleanAmp™ and standard dNTP samples, with a modest improvement in chromatogram shape and base call confidence for the CleanAmp™ dNTP target For B4GN4, there were significant differences in the HQ percentages between standard (HQ: 43.7%) and CleanAmp™ (HQ: 82.2%) The sequencing trace for standard dNTPs died out at 600 bp, while the trace for CleanAmp™ dNTPs persists to the end of the target (~700 bp) Two representative regions of sequence are shown (10-60 bp and 160-210 bp), where reactions with CleanAmp™ dNTPs had strong performance for both regions, and reactions with standard dNTPs had poor performance in the early part of the read, culminating in stronger performance mid-sequence
Overall, these studies represent a thorough investigation of both the effect of 7-deaza-dGTP substitution and the use of a Hot Start PCR technology on PCR amplification and downstream sequencing performance Though the differences were subtle for the extent of 7-deaza-dGTP substitution when individual parameters were analyzed, the advantages of using CleanAmp™ over standard versions of the nucleotide mix, were more pronounced Upon a more detailed analysis of amplicon yield, specificity and downstream sequencing quality, optimal nucleotide compositions were revealed Future studies may include exploration of more targets greater than 80% in GC composition, longer templates, and the incorporation of CleanAmp™ 7-deaza-dGTP into the sequencing step
Trang 40In column A, HQ and amplicon yield values that lie in the blue shaded boxes (top right) met HQ and amplicon yield threshold values and were then re-plotted with scatter plots in column B (HQ versus relative off-target yield) Values that lie furthest to the left and highest on the plots in column B are the most optimal mixtures for PCR and downstream sequencing
Fig 5 A) HQ-Amplicon Yield and B) HQ-relative off-target yield Scatter Plots for all five GC-rich targets (I-V)