Several hundred human-blind sequences were identified, inclu-ding those that were a present in each individual viral strain’s genome, b present in all 83 dengue strains regardless of thei
Trang 1Exhaustive analysis of subsequences present in the human and
83 dengue genome sequences
Catherine Putonti1, Sergei Chumakov2, Rahul Mitra3, George E Fox4, Richard C Willson4,5
and Yuriy Fofanov1,4
1 Department of Computer Science, University of Houston, Houston, TX, USA
2 Department of Physics, University of Guadalajara, Guadalajara, Jalisco, Mexico
3 Genomics USA, Houston, TX, USA
4 Department of Biology and Biochemistry, University of Houston, Houston, TX, USA
5 Department of Chemical Engineering, University of Houston, Houston, TX, USA
Members of the Flavivirus genus are responsible for a
number of diseases, including yellow fever, West Nile,
St Louis encephalitis, and dengue fever One or
more of the four serotypes of the dengue virus are
endemic in many parts of the world, including all of
south-east Asia, parts of Africa, and Southern and
Central America The Aedes aegypti mosquito, which
prefers to feed on humans, is a carrier of the dengue
virus and is commonly found on the US Gulf Coast
according to the CDC (Centers for Disease Control
and Prevention) (http://www.cdc.gov/ncidod/dvbid/
dengue/index.htm) Although the USA has had relat-ively few reported cases of dengue, epidemics have occurred in northern Mexico and hence dengue is a growing concern for bordering states
As no vaccine or treatments are available for den-gue, early detection of the viral infection is critical to avoid a potential epidemic Dengue diagnosis has his-torically relied on either (a) isolation and growth of the virus in cell cultures in vitro or (b) serological tests The former, while able to provide a more definitive diagnosis, is time consuming and ill-suited for use in
Keywords
dengue; diagnostic assay; flavivirus;
microarray; pathogen identification
Correspondence
C Putonti, University of Houston, 218 PGH,
Houston, TX 77204–3058, USA
Fax: +1 713 7431250
Tel: +1 713 7433992
E-mail: putonti@bioinfo.uh.edu
(Received 5 September 2005, revised 22
November 2005, accepted 23 November
2005)
doi:10.1111/j.1742-4658.2005.05074.x
Reliable detection and identification of pathogens in complex biological samples, in the presence of contaminating DNA from a variety of sources,
is an important and challenging diagnostic problem for the development of field tests The problem is compounded by the difficulty of finding a single, unique genomic sequence that is present simultaneously in all genomes of a species of closely related pathogens and absent in the genomes of the host
or the organisms that contribute to the sample background Here we des-cribe ‘host-blind probe design’ – a novel strategy of designing probes based
on highly frequent genomic signatures found in the pathogen genomes of interest but absent from the host genome Upon hybridization, an array of such informative probes will produce a unique pattern that is a genetic fin-gerprint for each pathogen strain This multiprobe approach was applied
to 83 dengue virus genome sequences, available in public databases, to design and perform in silico microarray experiments The resulting patterns allow one to unequivocally distinguish the four major serotypes, and within each serotype to identify the most similar strain among those that have been completely sequenced In an environment where dengue is indigenous, this would allow investigators to determine if a particular isolate belongs
to an ongoing outbreak or is a previously circulating version Using our probe set, the probability that misdiagnosis at the serotype level would occur is 1 : 10150
Trang 2the field Thus, serology has emerged as the primary
method for dengue diagnosis Serological tests are easy
to use and able to accommodate a great number of
samples, both necessities when confronting an
epi-demic These benefits, however, come at a cost; tests
such as hemagglutination inhibition, IgG-ELISA and
MAC-ELISA cannot easily distinguish dengue at the
serotype level and are likely to misidentify other
flavi-viruses as dengue [1,2] Recently, specific tests have
been developed for dengue identification using nucleic
acid-based technologies [3] such as the PCR [4–14] and
nucleic acid sequence-based amplification (NASBA)
assays [15,16], and microarrays of cDNA [17] and
oligonucleotides [18] These methods are both quick
and easy to use, while offering reliable serotype-specific
detection The probability of false positives, however,
still remains a concern [4,15,16] Regardless of which
technology is used, identification is typically based on
the presence of one or a few unique subsequences
[15,16,19] as indicators of the target of interest
Several inherent problems exist in basing detection
and⁄ or identification on recognition of unique
sequences First, to select a candidate one must know
the pathogen’s genomic sequence Moreover, even if
appropriate unique sequences can be found for the
entire group, they will not be able to distinguish the
various subgroups of the target organism This would
require unique sequences for every subgroup of
inter-est However, an important observation was made
pre-viously, by McGill et al [20], in that sequences need
not be universally present in a group of interest or
always absent from other groups to be informative
about phylogenetic relationships Recently it was
shown that large numbers of such ‘characteristic’
sequences exist in the 16S ribosomal RNA [21] Hence,
an alternative approach [21] is to rely on multiple
sequences that may individually not be uniquely or
universally found in any particular grouping, but
which are highly characteristic of particular groups
Recognition is then based on a set of such
characteris-tic sequences that together form a signature [19] for a
particular organism or grouping In either approach,
analysis is further complicated because viruses are
obli-gate intracellular parasites; they are found in
conjunc-tion with host cells whose DNA might contain
sequences that would interfere with the test As
separ-ation of viral from host nucleic acids is quite difficult,
it is important that the sequences used for virus
detec-tion are absent from any potentially contaminating
DNA
We have recently developed a set of novel
algo-rithms that make it possible to efficiently calculate
the frequency of all subsequences (n-mers) of length
5–25+ nucleotides in any sequenced genome within no more than a few hours, depending on the genome size This allows exclusion of all subsequences that are present in a selected host⁄ background genome (e.g human) in the PCR primer⁄ microarray probe design step, which has greatly increased speed, predictability and effectiveness compared with current design meth-ods The microarray format is particularly attractive as
it permits testing for multiple pathogens simulta-neously (e.g the set of viral pathogens causing similar symptoms in hosts or those rampant in the same regions in which the infection has occurred) We refer
to the sequences that are present in the genome of interest and absent from the host genome as being
‘host-blind’ (human-blind, mosquito-blind, mouse-blind, rat-mouse-blind, etc.) sequences The greater the num-ber of changes necessary to ‘convert’ such a host-blind sequence to a sequence found in the host genome, the less likely the host-blind sequence, when used as a PCR primer and⁄ or microarray probe, is to mispair with the host’s genomic sequence Thus, our algo-rithms can also exclude, in the design step, all host-blind sequences one, two, three, etc changes away from the nearest host sequence We refer to such sequences as being host-blind and one [two, three, etc.] change away from the nearest host sequence (Fig 1) This new approach can readily be extended to develop assays that are insensitive to the background of a host, such as a food (animal or plant) species, pathogen vec-tor, or any other environmental background for which genomic sequence information is available By using sequences three or four changes away, we can reduce and possibly eliminate false positives in the presence of background contaminating genomes
Fig 1 Sensitivity of host-blind sequences The pathogen sequen-ces are examples of host-blind sequensequen-ces that are one possible change (left) and two possible changes (right) away from the near-est human host sequence (above).
Trang 3In the work presented here, 83 complete dengue
virus genomes (representative of all four serotypes)
were analyzed in conjunction with the available draft
sequence of the human genome in order to find all
potential probes⁄ primers that could be used to detect
or identify this pathogen in a human-derived sample
The analysis was conducted for all n-mers up to 22
nucleotides long Our analysis focuses on those n-mers
present in dengue and human-blind for all possible
changes of one, two, three or four nucleotides Several
hundred human-blind sequences were identified,
inclu-ding those that were (a) present in each individual viral
strain’s genome, (b) present in all 83 dengue strains
regardless of their serotype, (c) unique to each serotype
of the virus (present in all strains of the serotype), and
(d) unique to each individual viral strain’s genome
(present in the strain and absent from all other
strains)
The results demonstrate that any method of
identifi-cation based solely on hybridization with a particular
unique sequence or a small set (typically less than six)
of sequences, as used in the existing tests of dengue
diagnosis, would not be able to reliably accommodate
potential mispriming To minimize the probability of
misdiagnosis, sequences that require three, four or
more bases to be altered for a mispriming to occur are
considered ideal for identification purposes A multiple
probe approach was taken in which detection and
identification of any dengue virus strain in the presence
of human DNA was developed using characteristic
sequences A sample probe set that could be used in a
microarray format was developed and tested by
in silico hybridization This probe set was designed to
contain the minimal number of probes necessary to
detect and identify dengue at the strain level and the
ability to unequivocally distinguish between the four
major serotypes
Results
Human n-mers
In order to generate the set of all probes⁄ primers that
are present in the dengue virus genome(s) and absent
in (and distant from) the human genome, analysis of
the human genome was first necessary An analytical
model was designed to provide us with an estimate of
the absence of subsequences from the complete human
genome given any one, two, three, or four changes In
addition, calculations were performed for the complete
human genomic sequence for n < 18 (The results are
included in the Supplementary material.) For n-mers
of size less than 15 nucleotides, sequences present in
the human genome lie within two changes of all poss-ible sequences of 14 nucleotides It is only when n¼
16 that the human genome does not include a sequence within three changes of any selected n-mer Therefore,
in our search for human-blind n-mers, only values of n for which some sequences are actually absent from the human genome should be considered Furthermore, it
is important that the number of sequences absent from the human genome are large enough such that there is
a reasonable probability that some will occur within the much smaller viral genome For large values of n, however, specificity becomes a greater concern Thus, calculations and analysis were confined to n-mers of size 16–22
Human-blind sequences present in each individual viral genome
For each of the 83 dengue virus genomes, calculations identified each n-mer, as 16£ n £ 22, that is at least one, two, three, or four changes away from the nearest human sequence (Fig 2) The results of which are provided in the Supplementary data The presence
or absence of each n-mer was calculated, rather than the frequency of occurrence There were no 16-mers three changes away from the nearest human sequence
in any of the viral genomes and only a single 17-mer (and its complementary 17-mer), which was found in
Fig 2 Number of unique human-blind sequences found in each of the 83 complete dengue virus strain genomes considering different sizes of n and number of changes away from the nearest human sequence Also listed is the average number of 22-mers present in
an individual genome and absent from the human sequence given any one, two, three or four changes The ideal set of probes would
be 22-mers that are four changes away; 16-, 17- and 18-mers with one change away will lead to false-positive results because the mismatches could be tolerated in the hybridization between the host target and dengue probes.
Trang 4just two of the dengue strains It is not until 19-mers
were considered that all of the dengue genomes were
found to have some human-blind sequences at least
three changes away from the nearest human sequence
The sequences four changes away are ideal candidates
for use in recognizing dengue because it is unlikely
that mispriming will occur and a false positive will
be reported Each of the 83 strains had sequences
at least four changes away when considering n-mers
for n‡ 21
Sequences present in all 83 dengue genomes
regardless of serotype
Prior to our calculations, we hypothesized that there
would be some human-blind n-mers that are present in
all 83 dengue genomes Such sequences could then
serve as a reliable indicator of the presence of the virus
in a complex sample (e.g an infected individual)
Using all 83 genomic sequences, the number of such
n-mers was calculated (Table 1) The number of unique
sequences was quite small There appear to be several
reasons for this First, as n increases, the number of
common n-mers decreases Second, because the
number of human-blind sequences in general is smaller
for small n-mer sizes (no human-blind n-mers for
n< 11 and no human-blind n-mers two changes away
from the nearest human sequence for n£ 15), the
num-ber of human-blind sequences decreases rapidly It is
also obvious that by requiring characteristic sequences
be at least two, three, etc., changes away from the
nearest human sequence, one dramatically reduces the
number of available sequences (Table 1) Such
sequences are ideal primers⁄ probes for identification of
dengue because of their decreased probability of a false
positive, yet there are no n-mers present in any of the
83 dengue sequences and absent from human given
any 3+ changes There is only one 18-mer (and its
complement) two changes away from the nearest
human sequence and shared by all 83 dengue genomes This 18-mer could be used to detect the presence of dengue in a human sample; however, it is possible, and even likely, that this sequence could mispair to the host sequence or related flavivirus genomes Our results lead us to the conclusion that there are no human-blind sequences common to all 83 dengue strains that are at least three changes away from the nearest human sequence
Sequences unique for serotype 1 and 2
We also calculated the number of unique sequences for each dengue type 1 (DENV-1) or DENV-2 serotype,
as these types comprise the great majority of the 83 genomes considered (Table 2) It is likely that when a more extensive sample of DENV-3 and DENV-4 genomes become available that the results will be sim-ilar It is observed that while there are far more human-blind n-mers shared within each group, as the sequence length and stringency increase the number of common n-mers decreases In the case of DENV-2, there are no n-mers four changes away from the near-est human sequence shared amongst all 46 virus genomes Further analysis of all serotype-specific sequences is required to verify that they are unique with respect to other flavivirus genomes as well Select-ing host-blind primers⁄ probes that are unique to the serotype and host-blind with the most changes possible
Table 1 The number of n-mers present simultaneously in all 83
dengue genomes The first row does not consider if the sequences
are absent from the human genome, just that they are present in
all of the dengue genomes.
n
16 17 18 19 20 21 22
Present in all dengue; absence
in human not considered
20 14 8 4 2 0 0
Human-blind one change away 8 12 8 2 2 0 0
Human-blind two changes away 0 0 2 0 0 0 0
Human-blind three changes away 0 0 0 0 0 0 0
Human-blind four changes away 0 0 0 0 0 0 0
Table 2 Human-blind sequences present simultaneously in all DENV-1 and DENV-2 genomes DENV-3 and DENV-4 are not inclu-ded, because the few sequences that are available are so similar to each other that the vast majority of the n-mers present in the sequence are unique to the serotype.
n
16 17 18 19 20 21 22
DENV-1 (28 genomes) Present in all dengue;
absence in human not considered
664 558 458 392 336 284 254
Human-blind one change away 218 372 408 382 334 284 254 Human-blind two changes away 2 38 94 172 250 252 242 Human-blind three changes away 0 0 2 12 54 118 182 Human-blind four changes away 0 0 0 0 0 2 38 DENV-2 (46 genomes)
Present in all dengue;
absence in human not considered
62 54 44 34 24 16 10
Human-blind one change away 24 38 40 34 24 16 10 Human-blind two changes away 0 0 6 8 16 16 10 Human-blind three changes away 0 0 0 0 0 0 4 Human-blind four changes away 0 0 0 0 0 0 0
Trang 5from the nearest human sequence would ensure a more
reliable method of detection with a lower false-positive
rate than the currently available techniques
Sequences unique for each individual viral
genome
For each n-mer present in a dengue genome, the
num-ber of other dengue genomes that also contain this
particular n-mer was calculated On average, 4.4%
(16-mers) to 8.3% (22-mers) of the viral genome is
comprised of n-mers that are not present in any of the
other dengue genomes For example, in the genome of
the DENV-4 China Guangzhou B5 strain (AF289029),
75.4% of the 22-mers are unique to this genome Three
genomes (one DENV-1, two DENV-2, and one
DENV-4) do not have any 16- to 22-mers that do not
occur in any other dengue strain’s genomic sequence
Thus, no single sequence could be used as a
pri-mer⁄ probe to identify one of these strains Figure 3
shows the distribution of the percentage of unique
n-mers per genome for 16- to 22-mers This analysis
was next extended to those n-mers that are
human-blind The average number of host-blind n-mers that
are unique to a particular genome is less than 8%, and
many genomes have no human-blind n-mers at least
two changes away from the nearest human sequence
Despite this low average, there are several genomes
that have a higher number of human-blind sequences
then would be expected In AF289029, 30 of the 34
16-mers that are two changes away from the nearest
human sequence are unique, and 16 014 of its 21 248
22-mers one change away from the nearest human
sequence are unique Figure 4 reflects the distribution
of host-blind 22-mers one, two, three or four changes
away from the nearest human sequence A complete report of our calculations of unique n-mers for all 83 genomes is available in the Supplementary data
In silico array hybridization studies
To reduce the likelihood of false positives, host-blind sequences that are 3+ changes away from the nearest human sequence are ideal for diagnostic purposes The fact that there is no single sequence meeting this criteria for all of the 83 dengue strains considered, suggests the use of multiple probes in a parallel (e.g array) assay in which a particular unique subset will hybridize with each dengue virus strain Thus, in the case of a microarray assay, identification would be based not on a single unique sequence but rather on a unique pattern The set of probes can be designed such that each serotype, or even each strain, will produce a unique pattern The proposed approach can easily be extended to unsequenced strains of dengue because novel patterns can be compared with all known pat-terns and the affinity of the new isolate to the known strains can be inferred by clustering techniques Identi-fication of the particular strain of infection would, for example, allow epidemiologists and public health offi-cials to rapidly determine if an isolate causing hem-orrhagic fever represents a new outbreak or belongs to known circulating versions of the virus The ability to quickly, inexpensively, and reliably diagnose dengue at the strain level in such a manner is not possible with existing techniques
Based upon the results, presented above, for human-blind n-mers in the 83 dengue genomes, a set of 216 probes (22-mers, at least three changes away from the nearest human sequence) was designed for in silico
Fig 3 Distribution of the percentage of n-mers per genome that
are unique (i.e not contained in any of the other dengue genomes
considered).
Fig 4 Distribution of the percentage of human-blind 22-mers per genome that are unique (i.e not contained in any of the other den-gue genomes considered).
Trang 6experiments The 216-probe set was computed as the
minimum number of probes possible to uniquely
iden-tify each of the 83 genomes such that each genome
was required to contain a subset of at least 28% (in
this case 61) of the 216 22-mers or probe sequences
Furthermore, for any two strain’s genomes, the subsets
contained in each must differ by at least two
sequences If two strains differ only by two sequences
and mutations occur in these two sequences, the
strains will be indistinguishable The likelihood of such
an occurrence can be reduced by demanding more
sequences for distinguishing between any two
individ-ual strains; this, however, will necessitate a larger
probe set size We further stipulated that serotypes are
distinguishable from each other such that any strain in
one serotype differs significantly from any strain in
any of the other three serotypes To this end, it was
required that serotypes be distinguishable by at least
20% of the 216 22-mers or probe sequences contained
in any of their strain members For the 216-probe set,
the minimum number of probes differentiating a
DENV-1 strain from the all other strains belonging to
one of the three other serotypes was 70, 56 for
DENV-2, 65 for DENV-3, and 56 for DENV-4 Thus, in the
event that identification is not possible at the strain
level as a result of mutations, identification at the
sero-type level is possible
To estimate the probability that a misdiagnosis
occurs at the serotype level, we assume, in the worst
case scenario, that a target sequence in dengue will no
longer hybridize with its complementary probe if just
one point mutation occurs For a given sequence of
length l, there are l) n n-mers As the length of the
dengue genomic sequence is significantly larger than
the sizes of n considered here, l) n l, such that the
probability that m specific n-mers are mutated can be
estimated as m!⁄ lm To misdiagnose the infection at the
serotype level in the 216-probe set would require at
least 56 mutations (m¼ 56) to occur within a dengue
genome of 10 000 bp (l ¼ 10 000) Thus, the
probab-ility that such an event would occur is 1 : 10150
The microarray of 216 probes represents what many
researchers can produce in-house at low cost We
determined the pattern that would appear on the
microarray given a particular genome’s ability to
hybridize with the probe sequences Figure 5 shows the
overlapping expression patterns for two pairs of
genomes for the set of 216 probes The distribution of
the number of probes present on the 216-probe set
microarray for each of 83 genomes ranges from 61 to
95 Because dengue infections occur in regions in
which other flaviviruses are also prevalent, it is
impera-tive that a diagnostic tool is able to discriminate
between the different viruses [2] Considering the close relative of dengue virus, West Nile virus, we computed the number of probes expected to be present in 26 publicly available strains Of the 216 dengue probes, at most only three would hybridize with a West Nile strain In fact, 24 of the West Nile virus strains share these same three 22-mers with the dengue virus strains
In the event that the clinical sample contained West Nile virus and not dengue virus, the expression pattern
is expected to show only 1% of the probes hybridized, far less than the 28% required during the set design Thus, it is highly unlikely that a misidentification of the presence of dengue will be made using the 216-probe set, even in the presence of another flavivirus
To estimate the ability of such arrays to distinguish between different strains of a virus, as well as its pos-sible genomic modifications, we introduce the distance (D) between any two patterns:
D¼ 1 n12
minðn1;n2Þ; where n1 and n2 are the numbers of probes present in each of genomes being compared and n12is number of probes present in both genomes simultaneously While there are many different ways to define such a distance between patterns, we chose this definition because of
Fig 5 Overlapping in silico expression patterns obtained using the
216 probes with pairs of dengue virus genomes (A) DENV-1 strain
BR ⁄ 90 AF226685 (green) and DENV-2 M 29095 (red); (B) DENV-2 from Cambodia AF309641 (green) and DENV-3 DENCME (red) Probes present in both genomes are shown in black, while probes absent in both genomes are shown in white.
Trang 7its simplicity; the distance is 0 if both genomes produce
the same pattern and 1 if they do not share any of the
same probes
By computing the distances between each pair of 83
patterns (the distance matrix), we were able to group
virus isolates using phylip’s kitsch (University of
Washington, Seattle, WA, USA) [22] and visualize these
groups using publicly available software packages
[23,24] based on the distances between the patterns
observed on the microarray (Fig 6) The trees generated
clearly separate DENV-1 strains from the remainder of
the serotypes DENV-3 and DENV-4 are most closely
clustered within their own respective serotypes but are
nested within the DENV-2 branch While this may be
attributed to the fact that there are far fewer DENV-3
and DENV-4 available to be included in this analysis, it
is much more probable that it is a result of the design
process itself Because each strain must contain a
percentage of the overall probe set, sequences that are
unique to a strain are, in essence, selected against
A second probe set was designed containing a ran-dom sampling of 4000 sequences (18-mers, two away from the nearest human sequence) Because members
of this set were chosen at random, many more sequences unique to a single strain or to just a few strains are included A tree displaying similarity between isolates was also created using this set (Fig 7) This allows the dengue stains to be grouped
by their origin and the time at which the samples were taken, as well as by serotype Although a true evolu-tionary history of the virus can probably not be obtained in this way, the results suggest that an unknown isolate can be characterized with respect to its closest relatives by a comparison of hybridization patterns Well-developed methods such as k-means [25], self-organising map (SOM) [26] and hierarchical clustering [27] might improve determinations of how a new isolate compares to the previously studied isolates
Discussion
We found no single sequence 16–22 bases in length present in all 83 dengue sequences and absent given three base changes from the human genome There-fore, our approach was to use a unique pattern made
by a group of oligos (minimum 216) to identify a par-ticular virus strain A probe set of 216 human-blind sequences was designed that can both diagnose and identify the most similar strain among those whose genome has previously been sequenced The tests currently being used in the field can at best only distin-guish between serotypes With this decreased
specifici-ty, the probability of misdiagnosis remains a major concern The assay proposed here will essentially reduce the error in misdiagnosing the serotype of dengue to 1 : 10150 With microarray technology, the 216-probe set can easily be accommodated in a single diagnostic device Here, just one experiment provides both diagnosis and phylogenetic tree construction This assay will be able, without necessitating viral isolation,
to quickly detect a new pattern signifying a new strain
of dengue almost akin to sequencing the genome The ability to identify strains very similar to an unknown isolate in the data set of sequenced dengue genomes may be especially valuable in epidemiological studies where one would like to rapidly understand the origins of an outbreak of hemorrhagic fever For example, if such an outbreak were to occur in a loca-tion where dengue fever is indigenous, it may be the result of a new variant of the virus which is common
in that region, a re-emergence of an earlier version, a continuation of an outbreak from the previous season
or the introduction of a new strain as the result of
Fig 6 Dengue groupings based on the similarity of the observed
hybridization patterns for the 216-probe hypothetical microarray of
22-mers at least three changes away from the nearest human
sequence.
Trang 8travel If the needed complete sequences are obtained
initially, hybridization arrays will allow these
alternat-ive explanations to be monitored on an ongoing basis
Such monitoring might be conducted routinely to
detect changes in the local virus population before
cases of hemorrhagic fever occur
If reduction of cost and size of this test are critical,
the ability to identify dengue at the strain level can be
sacrificed such that specificity is available only at the
serotype level The host-blind technology provides a
much more reliable solution than is currently available
by greatly decreasing the likelihood that a primer⁄
probe sequence will mispair with the host sequence
We are confident in the ability of this technology
for reliable detection Human-blind sequences have
been successfully used as PCR primers generating
amplicons matching those predicted computationally
(M Anez, R C Willson, et al unpublished results)
The development of host-blind diagnostic microarrays
is underway The human-blind dengue primer⁄ probe
sequences can be additionally improved to not only be
blind to humans but also to single nucleotide
polymor-phisms and organisms known to be associated with
humans (e.g microflora), in addition to pathogens
known to be transmitted by the same vector
Further-more, the computation-based host-blind approach can
easily be extended to include not only human hosts
but also mouse, rat, chicken, chimpanzee, mosquito, and any other sequenced host genome
Experimental procedures
Data
Version 3.2.2 of the human genome was used This partially assembled human genome, located in 944 files containing
2 860 215 662 base pairs of sequence, is available from GenBank (http://www.ncbi.nlm.nih.gov/mapview/map_ search.cgi?taxid¼ 9606) This version contains 794 007 unknown⁄ unidentified bases For simplicity, all n-mers con-taining such characters were excluded from the calculations Moreover, because the file structure of the genome assem-bly does not allow the assemassem-bly of each chromosome with-out gaps, all n-mers having a subsequence belonging to one file and the remaining sequence in another file were not included in our calculations All calculations on the human genome utilized both the original and complementary strand sequences
Eighty-three complete sequences of the dengue virus (28 DENV-1, 46 DENV-2, two DENV-3, and seven DENV-4) were considered This set of sequences, including their accession numbers, is provided in the Supplementary data The dengue genome is 10 kb with minor variations in length Although dengue is a single-stranded RNA
positive-Fig 7 Dengue groupings obtained from the
similarity of the observed hybridization
patterns for the hypothetical 4000-probe
microarray of randomly sampled 18-mers at
least two changes away from the nearest
human sequence.
Trang 9strand virus with no DNA stage, both the original and
complementary strand sequences were used in our
calcula-tions as a precautionary measure
Calculations
We have recently developed a set of novel algorithms that
make it possible to analyze the occurrence frequency of all
short subsequences (n-mers) of length 5–25+ nucleotides
in any sequenced genome within a reasonable time (hours)
[28–30] The unique properties of this new approach are:
l exact consideration of each subsequence of size n
(n-mer) in contrast to traditional blast-based approaches
(no approximate heuristics – no missing cases);
l consideration of all sequences that can be derived from
each n-mer by up to any four changes;
l extremely good time efficiency: calculations for up to
19-mers can be performed on a regular desktop PC;
calcu-lations for 20£ n < 25+ can be performed using a
stan-dard high-performance cluster;
l the large number of background (or host) genomic
sequences can be taken into consideration in one run, as
needed to avoid possible false positives; and
l it can be used for genomic sequences of all sizes of
practical interest, including the human genome (3 Gb)
The basic idea is to set in correspondence to each of the
4n n-mers a particular element of a counting array, A, and
define the procedure to convert the n-mer character
sequence to an index of an element in such an array It
cur-rently takes less than one minute to find the set of all
16-mers present in dengue and absent in the human
gen-ome For PCR assay applications, the spacing of pairs and
sets of primers is important; however, these can be designed
much faster if consideration can be limited to only the
sub-set of unique n-mers present in the genome of interest;
extension of the algorithms to a PCR primer set design is
now in progress
Probe selection
It is our intent to define the minimum optimal set of
subse-quences, smin, that can both identify the presence of a
par-ticular pathogen and distinguish between different strains
of the pathogen To ensure the sensitivity needed to
prop-erly identify a genomic sequence, each genome under
con-sideration must contain at least a subsequences from smin
If applicable, each subclass or type must be distinguishable
from any other subclass or type by at least b subsequences
The set of subsequences present in each genome must differ
by at least c subsequences from the set present in every
other genome Furthermore, for each element k in smin, its
complement k¢ must not be a member of smin
In designing this optimal set, an evolutionary
program-ming approach was taken While many sets, s, may meet
the criteria above, a fitness function is needed to measure how ‘good’ a particular set s is in order to determine whe-ther it is, in fact, the optimal solution For instance, a par-ticular set may exceed the minimum values required of a, b and c and, in fact, have values A, B and G, where A‡ a,
B‡ b and G ‡ c While these values contribute to the fit-ness of a set, the size of the set plays a much greater role in
an effort to reduce the number of probes needed and thus the cost of the array Therefore, we chose to evaluate the fitness of a particular set as f(s)¼ (A + B + G) ⁄ set size, such that for the optimal set, smin, there exists no other set with a greater fitness value For sets consisting of host-blind sequences, the number of changes away from the host genome must also be integrated into the assessment of fitness
Acknowledgements
We would like to express our gratitude to the Texas Learning and Computation Center (TLCC) and to NASA (Grant NNJ04HF43G to GEF and RCW) for partial support of this work CP’s work was supported
by a training fellowship from the Keck Center for Computational and Structural Biology of the Gulf Coast Consortia (NLM Grant no 5T15LM07093) The authors would also like to thank Dr R Pad Padmanabhan for many interesting discussions and suggestions
References
1 De Paula SO & da Fonseca BAL (2004) Dengue: a review of the laboratory tests a clinician must know to achieve a correct diagnosis Braz J Infect Dis 8, 390–398
2 Kao CL, King CC, Chao DY, Wu HL & Chang GJJ (2005) Laboratory diagnosis of dengue virus infection: current and future perspectives in clinical diagnosis and public health J Microbiol Immunol Infect 38, 5–16
3 Relman DA (1998) Detection and identification of pre-viously unrecognized microbial pathogens Emerg Infect Dis 4, 382–389
4 Lanciotti RS, Calisher CH, Gubler DJ, Chang GJ & Vorndam AV (1992) Rapid detection and typing of den-gue viruses from clinical samples by using reverse tran-scriptase-polymerase chain reaction J Clin Microbiol
30, 545–551
5 Harris E, Roberts TG, Smith L, Selle J, Krammer LD, Valle S, Sandoval E & Balmaseda A (1998) Typing of dengue viruses in clinical specimens and mosquitoes by single-tube multiplex reverse trascriptase PCR J Clin Microbiol 36, 2634–2639
6 De Paula SOD, Lima CDM, Torres MP, Pereira MR &
da Fonseca BAL (2004) One-step RT-PCR protocols
Trang 10improve the rate of dengue diagnosis compared to
two-step RT-PCR approaches J Clin Virol 30, 297–301
7 Wang WK, Sung TL, Tsai YC, Kao CL, Chang SM &
King CC (2002) Detection of dengue virus replication in
perifperal blood mononuclear cells from dengue virus
type 2-infected patients by a reverse
transcription-real-time PCR assay J Clin Microbiol 40, 4472–4478
8 Sudiro TM, Zivny J, Ishiko H, Green S, Vaughn DW,
Kalayanorooj S, Nisalak A, Norman JE, Ennis FA &
Rothman AL (2001) Analysis of plasma viral RNA
levels during acute dengue virus infection using
quanti-tative competitor reverse transcription-polymerase chain
reaction J Med Virol 63, 29–34
9 Houng HH, Hritz D & Kanesa-thasan N (2000)
Quanti-tative detection of dengue 2 virus using fluorogenic
RT-PCR based on 3¢-noncoding sequence J Virol
Methods 86, 1–11
10 Drosten C, Gottig S, Schilling S, Asper M, Panning M,
Schmitz H & Gunther S (2002) Rapid detection and
quantification of RNA of Ebola and Marburg viruses,
Lassa virus, Crimean-Congo hemorrhagic fever virus,
Rift Valley fever virus, dengue virus, and yellow fever
virus by real-time reverse transcription-PCR J Clin
Microbiol 40, 2323–2330
11 Shu PY, Chang SF, Kuo YC, Yueh YY, Chien LJ,
Sue CL, Lin TH & Huang JH (2003) Development of
group- and seortype-specific one-step SYBR green
I-based real-time reverse transcription-PCR assay for
dengue virus J Clin Microbiol 41, 2408–2416
12 Tanaka M (1993) Rapid identification of flavivirus using
the polymerase chain reaction J Virol Methods 41,
311–322
13 Figueiredo LT, Batista WC, Kashima S & Nassar ES
(1998) Identification of Brazilian flaviviruses by a
simplified reverse transcription-polymerase chain
reac-tion method using flavivirus universal primers Am J
Trop Med Hyg 59, 357–362
14 Ito M, Takasaki T, Yamada KI, Nerome R, Tajima S
& Kurane I (2004) Development and evaluation of
fluorogenic TaqMan reverse-transciptase PCR assays
for detection of dengue virus types 1–4 J Clin Microbiol
42, 5935–5937
15 Wu SJL, Lee EM, Pubatana R, Shurtliff RN, Porter
KR, Suharyono W, Watts DM, King CC, Murphey
GS, Hayes CG et al (2001) Detection of dengue viral
RNA using a nucleic acid sequence-based amplification
assay J Clin Microbiol 39, 2794–2798
16 Baeumner AJ, Schlesinger NA, Slutzki NS, Romano J,
Lee EM & Montagna RA (2002) Biosensor for dengue
virus detection: sensitive, rapid, and serotype specific
Anal Chem 74, 1442–1448
17 Schena M, Shalon D, Davis RW & Brown PO (1995)
Quantitative monitoring of gene expression patterns
with a complementary DNA microarray Science 270,
467–470
18 Lipshutz RJ, Fodor SP, Gingeras TR & Lockhart DJ (1999) High density synthetic oligonucleotide arrays Nat Genet 21, 20–24
19 Woese CR, Maniloff J & Zablen LB (1980) Phyloge-netic analysis of the mycoplasmas Proc Natl Acad Sci USA 77, 494–498
20 McGill TR, Jurka J, Sobieski JM, Pickett MH, Woese
CR & Fox GE (1986) Characteristic Archaebacterial 16S rRNA Oligonucleotides Syst Appl Microbiol 7, 194–197
21 Zhang Z, Willson RC & Fox GE (2002) Identifica-tion of characteristic oligonucleotides in the 16S ribosomal RNA sequence dataset Bioinformatics 18, 244–250
22 Felsenstein J (2005)PHYLIP(Phylogeny Inference Pack-age), Version 3.6 Distributed by the Author Depart-ment of Genome Sciences University of Washington, Seattle
23 Choi J-H, Jung H-Y, Kim H-S & Cho HG (2000) PhyloDraw: a phylogenetic tree drawing system Bioinformatics 16, 1056–1058
24 Perrie`re G & Gouy M (1996) WWW-Query: An on-line retrieval system for biological sequence banks Biochimie
78, 364–369
25 MacQueen J (1967) Methods for classification and ana-lysis of multivariate observations In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability(LeCam LM & Neyman J, eds), pp 281–297 California Press, Berkeley, California
26 Dopazo J & Carazo JM (1997) Phylogenetic reconstruc-tion using an unsupervised growing neural network that adopts the topology of a phylogenetic tree J Mol Evol
44, 226–233
27 Ward JH (1963) Hierarchical grouping to optimize an objective function J Am Stat Assoc 58, 236–244
28 Fofanov Y, Belapurkar C, Luo Y, Katili C, Wang J, Belosludtsev Y, Powdrill T, Fofanov V, Li T-B, Chum-akov S et al (2004) How independent are the appear-ances of n-mers in different genomes? Bioinformatics 20, 2421–2428
29 Chumakov S, Putonti C, Pettitt BM, Fox GE, Willson RC & Fofanov Y (2004) Using statistical properties of short subsequences in microbial identifi-cation In Proceedings of the International Conference
on Mathematics and Engineering Techniques in Medicine and Biological Sciences (Valafar F & Valafar H, eds), pp 363–367 CSREA Press, Las Vegas, NV
30 Fofanov V, Putonti C, Chumakov S, Pettitt BM & Fofanov Y (2005) Fast Algorithm for the Analysis of the Presence of Short Oligonucleotide Sequences in Genomic Sequences UH Technical Report #UH-CS-05–11, Uni-versity of Houston, Houston, Texas [Online http:// www.cs.uh.edu/Preprints/preprint/uh-cs-05-11.pdf]