1. Trang chủ
  2. » Luận Văn - Báo Cáo

Báo cáo khoa học: Human-blind probes and primers for dengue virus identification Exhaustive analysis of subsequences present in the human and 83 dengue genome sequences doc

11 542 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Human-blind Probes And Primers For Dengue Virus Identification Exhaustive Analysis Of Subsequences Present In The Human And 83 Dengue Genome Sequences
Tác giả Catherine Putonti, Sergei Chumakov, Rahul Mitra, George E. Fox, Richard C. Willson, Yuriy Fofanov
Trường học University of Houston
Chuyên ngành Computer Science, Biology and Biochemistry, Physics
Thể loại báo cáo khoa học
Năm xuất bản 2005
Thành phố Houston
Định dạng
Số trang 11
Dung lượng 379,22 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Several hundred human-blind sequences were identified, inclu-ding those that were a present in each individual viral strain’s genome, b present in all 83 dengue strains regardless of thei

Trang 1

Exhaustive analysis of subsequences present in the human and

83 dengue genome sequences

Catherine Putonti1, Sergei Chumakov2, Rahul Mitra3, George E Fox4, Richard C Willson4,5

and Yuriy Fofanov1,4

1 Department of Computer Science, University of Houston, Houston, TX, USA

2 Department of Physics, University of Guadalajara, Guadalajara, Jalisco, Mexico

3 Genomics USA, Houston, TX, USA

4 Department of Biology and Biochemistry, University of Houston, Houston, TX, USA

5 Department of Chemical Engineering, University of Houston, Houston, TX, USA

Members of the Flavivirus genus are responsible for a

number of diseases, including yellow fever, West Nile,

St Louis encephalitis, and dengue fever One or

more of the four serotypes of the dengue virus are

endemic in many parts of the world, including all of

south-east Asia, parts of Africa, and Southern and

Central America The Aedes aegypti mosquito, which

prefers to feed on humans, is a carrier of the dengue

virus and is commonly found on the US Gulf Coast

according to the CDC (Centers for Disease Control

and Prevention) (http://www.cdc.gov/ncidod/dvbid/

dengue/index.htm) Although the USA has had relat-ively few reported cases of dengue, epidemics have occurred in northern Mexico and hence dengue is a growing concern for bordering states

As no vaccine or treatments are available for den-gue, early detection of the viral infection is critical to avoid a potential epidemic Dengue diagnosis has his-torically relied on either (a) isolation and growth of the virus in cell cultures in vitro or (b) serological tests The former, while able to provide a more definitive diagnosis, is time consuming and ill-suited for use in

Keywords

dengue; diagnostic assay; flavivirus;

microarray; pathogen identification

Correspondence

C Putonti, University of Houston, 218 PGH,

Houston, TX 77204–3058, USA

Fax: +1 713 7431250

Tel: +1 713 7433992

E-mail: putonti@bioinfo.uh.edu

(Received 5 September 2005, revised 22

November 2005, accepted 23 November

2005)

doi:10.1111/j.1742-4658.2005.05074.x

Reliable detection and identification of pathogens in complex biological samples, in the presence of contaminating DNA from a variety of sources,

is an important and challenging diagnostic problem for the development of field tests The problem is compounded by the difficulty of finding a single, unique genomic sequence that is present simultaneously in all genomes of a species of closely related pathogens and absent in the genomes of the host

or the organisms that contribute to the sample background Here we des-cribe ‘host-blind probe design’ – a novel strategy of designing probes based

on highly frequent genomic signatures found in the pathogen genomes of interest but absent from the host genome Upon hybridization, an array of such informative probes will produce a unique pattern that is a genetic fin-gerprint for each pathogen strain This multiprobe approach was applied

to 83 dengue virus genome sequences, available in public databases, to design and perform in silico microarray experiments The resulting patterns allow one to unequivocally distinguish the four major serotypes, and within each serotype to identify the most similar strain among those that have been completely sequenced In an environment where dengue is indigenous, this would allow investigators to determine if a particular isolate belongs

to an ongoing outbreak or is a previously circulating version Using our probe set, the probability that misdiagnosis at the serotype level would occur is 1 : 10150

Trang 2

the field Thus, serology has emerged as the primary

method for dengue diagnosis Serological tests are easy

to use and able to accommodate a great number of

samples, both necessities when confronting an

epi-demic These benefits, however, come at a cost; tests

such as hemagglutination inhibition, IgG-ELISA and

MAC-ELISA cannot easily distinguish dengue at the

serotype level and are likely to misidentify other

flavi-viruses as dengue [1,2] Recently, specific tests have

been developed for dengue identification using nucleic

acid-based technologies [3] such as the PCR [4–14] and

nucleic acid sequence-based amplification (NASBA)

assays [15,16], and microarrays of cDNA [17] and

oligonucleotides [18] These methods are both quick

and easy to use, while offering reliable serotype-specific

detection The probability of false positives, however,

still remains a concern [4,15,16] Regardless of which

technology is used, identification is typically based on

the presence of one or a few unique subsequences

[15,16,19] as indicators of the target of interest

Several inherent problems exist in basing detection

and⁄ or identification on recognition of unique

sequences First, to select a candidate one must know

the pathogen’s genomic sequence Moreover, even if

appropriate unique sequences can be found for the

entire group, they will not be able to distinguish the

various subgroups of the target organism This would

require unique sequences for every subgroup of

inter-est However, an important observation was made

pre-viously, by McGill et al [20], in that sequences need

not be universally present in a group of interest or

always absent from other groups to be informative

about phylogenetic relationships Recently it was

shown that large numbers of such ‘characteristic’

sequences exist in the 16S ribosomal RNA [21] Hence,

an alternative approach [21] is to rely on multiple

sequences that may individually not be uniquely or

universally found in any particular grouping, but

which are highly characteristic of particular groups

Recognition is then based on a set of such

characteris-tic sequences that together form a signature [19] for a

particular organism or grouping In either approach,

analysis is further complicated because viruses are

obli-gate intracellular parasites; they are found in

conjunc-tion with host cells whose DNA might contain

sequences that would interfere with the test As

separ-ation of viral from host nucleic acids is quite difficult,

it is important that the sequences used for virus

detec-tion are absent from any potentially contaminating

DNA

We have recently developed a set of novel

algo-rithms that make it possible to efficiently calculate

the frequency of all subsequences (n-mers) of length

5–25+ nucleotides in any sequenced genome within no more than a few hours, depending on the genome size This allows exclusion of all subsequences that are present in a selected host⁄ background genome (e.g human) in the PCR primer⁄ microarray probe design step, which has greatly increased speed, predictability and effectiveness compared with current design meth-ods The microarray format is particularly attractive as

it permits testing for multiple pathogens simulta-neously (e.g the set of viral pathogens causing similar symptoms in hosts or those rampant in the same regions in which the infection has occurred) We refer

to the sequences that are present in the genome of interest and absent from the host genome as being

‘host-blind’ (human-blind, mosquito-blind, mouse-blind, rat-mouse-blind, etc.) sequences The greater the num-ber of changes necessary to ‘convert’ such a host-blind sequence to a sequence found in the host genome, the less likely the host-blind sequence, when used as a PCR primer and⁄ or microarray probe, is to mispair with the host’s genomic sequence Thus, our algo-rithms can also exclude, in the design step, all host-blind sequences one, two, three, etc changes away from the nearest host sequence We refer to such sequences as being host-blind and one [two, three, etc.] change away from the nearest host sequence (Fig 1) This new approach can readily be extended to develop assays that are insensitive to the background of a host, such as a food (animal or plant) species, pathogen vec-tor, or any other environmental background for which genomic sequence information is available By using sequences three or four changes away, we can reduce and possibly eliminate false positives in the presence of background contaminating genomes

Fig 1 Sensitivity of host-blind sequences The pathogen sequen-ces are examples of host-blind sequensequen-ces that are one possible change (left) and two possible changes (right) away from the near-est human host sequence (above).

Trang 3

In the work presented here, 83 complete dengue

virus genomes (representative of all four serotypes)

were analyzed in conjunction with the available draft

sequence of the human genome in order to find all

potential probes⁄ primers that could be used to detect

or identify this pathogen in a human-derived sample

The analysis was conducted for all n-mers up to 22

nucleotides long Our analysis focuses on those n-mers

present in dengue and human-blind for all possible

changes of one, two, three or four nucleotides Several

hundred human-blind sequences were identified,

inclu-ding those that were (a) present in each individual viral

strain’s genome, (b) present in all 83 dengue strains

regardless of their serotype, (c) unique to each serotype

of the virus (present in all strains of the serotype), and

(d) unique to each individual viral strain’s genome

(present in the strain and absent from all other

strains)

The results demonstrate that any method of

identifi-cation based solely on hybridization with a particular

unique sequence or a small set (typically less than six)

of sequences, as used in the existing tests of dengue

diagnosis, would not be able to reliably accommodate

potential mispriming To minimize the probability of

misdiagnosis, sequences that require three, four or

more bases to be altered for a mispriming to occur are

considered ideal for identification purposes A multiple

probe approach was taken in which detection and

identification of any dengue virus strain in the presence

of human DNA was developed using characteristic

sequences A sample probe set that could be used in a

microarray format was developed and tested by

in silico hybridization This probe set was designed to

contain the minimal number of probes necessary to

detect and identify dengue at the strain level and the

ability to unequivocally distinguish between the four

major serotypes

Results

Human n-mers

In order to generate the set of all probes⁄ primers that

are present in the dengue virus genome(s) and absent

in (and distant from) the human genome, analysis of

the human genome was first necessary An analytical

model was designed to provide us with an estimate of

the absence of subsequences from the complete human

genome given any one, two, three, or four changes In

addition, calculations were performed for the complete

human genomic sequence for n < 18 (The results are

included in the Supplementary material.) For n-mers

of size less than 15 nucleotides, sequences present in

the human genome lie within two changes of all poss-ible sequences of 14 nucleotides It is only when n¼

16 that the human genome does not include a sequence within three changes of any selected n-mer Therefore,

in our search for human-blind n-mers, only values of n for which some sequences are actually absent from the human genome should be considered Furthermore, it

is important that the number of sequences absent from the human genome are large enough such that there is

a reasonable probability that some will occur within the much smaller viral genome For large values of n, however, specificity becomes a greater concern Thus, calculations and analysis were confined to n-mers of size 16–22

Human-blind sequences present in each individual viral genome

For each of the 83 dengue virus genomes, calculations identified each n-mer, as 16£ n £ 22, that is at least one, two, three, or four changes away from the nearest human sequence (Fig 2) The results of which are provided in the Supplementary data The presence

or absence of each n-mer was calculated, rather than the frequency of occurrence There were no 16-mers three changes away from the nearest human sequence

in any of the viral genomes and only a single 17-mer (and its complementary 17-mer), which was found in

Fig 2 Number of unique human-blind sequences found in each of the 83 complete dengue virus strain genomes considering different sizes of n and number of changes away from the nearest human sequence Also listed is the average number of 22-mers present in

an individual genome and absent from the human sequence given any one, two, three or four changes The ideal set of probes would

be 22-mers that are four changes away; 16-, 17- and 18-mers with one change away will lead to false-positive results because the mismatches could be tolerated in the hybridization between the host target and dengue probes.

Trang 4

just two of the dengue strains It is not until 19-mers

were considered that all of the dengue genomes were

found to have some human-blind sequences at least

three changes away from the nearest human sequence

The sequences four changes away are ideal candidates

for use in recognizing dengue because it is unlikely

that mispriming will occur and a false positive will

be reported Each of the 83 strains had sequences

at least four changes away when considering n-mers

for n‡ 21

Sequences present in all 83 dengue genomes

regardless of serotype

Prior to our calculations, we hypothesized that there

would be some human-blind n-mers that are present in

all 83 dengue genomes Such sequences could then

serve as a reliable indicator of the presence of the virus

in a complex sample (e.g an infected individual)

Using all 83 genomic sequences, the number of such

n-mers was calculated (Table 1) The number of unique

sequences was quite small There appear to be several

reasons for this First, as n increases, the number of

common n-mers decreases Second, because the

number of human-blind sequences in general is smaller

for small n-mer sizes (no human-blind n-mers for

n< 11 and no human-blind n-mers two changes away

from the nearest human sequence for n£ 15), the

num-ber of human-blind sequences decreases rapidly It is

also obvious that by requiring characteristic sequences

be at least two, three, etc., changes away from the

nearest human sequence, one dramatically reduces the

number of available sequences (Table 1) Such

sequences are ideal primers⁄ probes for identification of

dengue because of their decreased probability of a false

positive, yet there are no n-mers present in any of the

83 dengue sequences and absent from human given

any 3+ changes There is only one 18-mer (and its

complement) two changes away from the nearest

human sequence and shared by all 83 dengue genomes This 18-mer could be used to detect the presence of dengue in a human sample; however, it is possible, and even likely, that this sequence could mispair to the host sequence or related flavivirus genomes Our results lead us to the conclusion that there are no human-blind sequences common to all 83 dengue strains that are at least three changes away from the nearest human sequence

Sequences unique for serotype 1 and 2

We also calculated the number of unique sequences for each dengue type 1 (DENV-1) or DENV-2 serotype,

as these types comprise the great majority of the 83 genomes considered (Table 2) It is likely that when a more extensive sample of DENV-3 and DENV-4 genomes become available that the results will be sim-ilar It is observed that while there are far more human-blind n-mers shared within each group, as the sequence length and stringency increase the number of common n-mers decreases In the case of DENV-2, there are no n-mers four changes away from the near-est human sequence shared amongst all 46 virus genomes Further analysis of all serotype-specific sequences is required to verify that they are unique with respect to other flavivirus genomes as well Select-ing host-blind primers⁄ probes that are unique to the serotype and host-blind with the most changes possible

Table 1 The number of n-mers present simultaneously in all 83

dengue genomes The first row does not consider if the sequences

are absent from the human genome, just that they are present in

all of the dengue genomes.

n

16 17 18 19 20 21 22

Present in all dengue; absence

in human not considered

20 14 8 4 2 0 0

Human-blind one change away 8 12 8 2 2 0 0

Human-blind two changes away 0 0 2 0 0 0 0

Human-blind three changes away 0 0 0 0 0 0 0

Human-blind four changes away 0 0 0 0 0 0 0

Table 2 Human-blind sequences present simultaneously in all DENV-1 and DENV-2 genomes DENV-3 and DENV-4 are not inclu-ded, because the few sequences that are available are so similar to each other that the vast majority of the n-mers present in the sequence are unique to the serotype.

n

16 17 18 19 20 21 22

DENV-1 (28 genomes) Present in all dengue;

absence in human not considered

664 558 458 392 336 284 254

Human-blind one change away 218 372 408 382 334 284 254 Human-blind two changes away 2 38 94 172 250 252 242 Human-blind three changes away 0 0 2 12 54 118 182 Human-blind four changes away 0 0 0 0 0 2 38 DENV-2 (46 genomes)

Present in all dengue;

absence in human not considered

62 54 44 34 24 16 10

Human-blind one change away 24 38 40 34 24 16 10 Human-blind two changes away 0 0 6 8 16 16 10 Human-blind three changes away 0 0 0 0 0 0 4 Human-blind four changes away 0 0 0 0 0 0 0

Trang 5

from the nearest human sequence would ensure a more

reliable method of detection with a lower false-positive

rate than the currently available techniques

Sequences unique for each individual viral

genome

For each n-mer present in a dengue genome, the

num-ber of other dengue genomes that also contain this

particular n-mer was calculated On average, 4.4%

(16-mers) to 8.3% (22-mers) of the viral genome is

comprised of n-mers that are not present in any of the

other dengue genomes For example, in the genome of

the DENV-4 China Guangzhou B5 strain (AF289029),

75.4% of the 22-mers are unique to this genome Three

genomes (one DENV-1, two DENV-2, and one

DENV-4) do not have any 16- to 22-mers that do not

occur in any other dengue strain’s genomic sequence

Thus, no single sequence could be used as a

pri-mer⁄ probe to identify one of these strains Figure 3

shows the distribution of the percentage of unique

n-mers per genome for 16- to 22-mers This analysis

was next extended to those n-mers that are

human-blind The average number of host-blind n-mers that

are unique to a particular genome is less than 8%, and

many genomes have no human-blind n-mers at least

two changes away from the nearest human sequence

Despite this low average, there are several genomes

that have a higher number of human-blind sequences

then would be expected In AF289029, 30 of the 34

16-mers that are two changes away from the nearest

human sequence are unique, and 16 014 of its 21 248

22-mers one change away from the nearest human

sequence are unique Figure 4 reflects the distribution

of host-blind 22-mers one, two, three or four changes

away from the nearest human sequence A complete report of our calculations of unique n-mers for all 83 genomes is available in the Supplementary data

In silico array hybridization studies

To reduce the likelihood of false positives, host-blind sequences that are 3+ changes away from the nearest human sequence are ideal for diagnostic purposes The fact that there is no single sequence meeting this criteria for all of the 83 dengue strains considered, suggests the use of multiple probes in a parallel (e.g array) assay in which a particular unique subset will hybridize with each dengue virus strain Thus, in the case of a microarray assay, identification would be based not on a single unique sequence but rather on a unique pattern The set of probes can be designed such that each serotype, or even each strain, will produce a unique pattern The proposed approach can easily be extended to unsequenced strains of dengue because novel patterns can be compared with all known pat-terns and the affinity of the new isolate to the known strains can be inferred by clustering techniques Identi-fication of the particular strain of infection would, for example, allow epidemiologists and public health offi-cials to rapidly determine if an isolate causing hem-orrhagic fever represents a new outbreak or belongs to known circulating versions of the virus The ability to quickly, inexpensively, and reliably diagnose dengue at the strain level in such a manner is not possible with existing techniques

Based upon the results, presented above, for human-blind n-mers in the 83 dengue genomes, a set of 216 probes (22-mers, at least three changes away from the nearest human sequence) was designed for in silico

Fig 3 Distribution of the percentage of n-mers per genome that

are unique (i.e not contained in any of the other dengue genomes

considered).

Fig 4 Distribution of the percentage of human-blind 22-mers per genome that are unique (i.e not contained in any of the other den-gue genomes considered).

Trang 6

experiments The 216-probe set was computed as the

minimum number of probes possible to uniquely

iden-tify each of the 83 genomes such that each genome

was required to contain a subset of at least 28% (in

this case 61) of the 216 22-mers or probe sequences

Furthermore, for any two strain’s genomes, the subsets

contained in each must differ by at least two

sequences If two strains differ only by two sequences

and mutations occur in these two sequences, the

strains will be indistinguishable The likelihood of such

an occurrence can be reduced by demanding more

sequences for distinguishing between any two

individ-ual strains; this, however, will necessitate a larger

probe set size We further stipulated that serotypes are

distinguishable from each other such that any strain in

one serotype differs significantly from any strain in

any of the other three serotypes To this end, it was

required that serotypes be distinguishable by at least

20% of the 216 22-mers or probe sequences contained

in any of their strain members For the 216-probe set,

the minimum number of probes differentiating a

DENV-1 strain from the all other strains belonging to

one of the three other serotypes was 70, 56 for

DENV-2, 65 for DENV-3, and 56 for DENV-4 Thus, in the

event that identification is not possible at the strain

level as a result of mutations, identification at the

sero-type level is possible

To estimate the probability that a misdiagnosis

occurs at the serotype level, we assume, in the worst

case scenario, that a target sequence in dengue will no

longer hybridize with its complementary probe if just

one point mutation occurs For a given sequence of

length l, there are l) n n-mers As the length of the

dengue genomic sequence is significantly larger than

the sizes of n considered here, l) n  l, such that the

probability that m specific n-mers are mutated can be

estimated as m!⁄ lm To misdiagnose the infection at the

serotype level in the 216-probe set would require at

least 56 mutations (m¼ 56) to occur within a dengue

genome of 10 000 bp (l ¼ 10 000) Thus, the

probab-ility that such an event would occur is 1 : 10150

The microarray of 216 probes represents what many

researchers can produce in-house at low cost We

determined the pattern that would appear on the

microarray given a particular genome’s ability to

hybridize with the probe sequences Figure 5 shows the

overlapping expression patterns for two pairs of

genomes for the set of 216 probes The distribution of

the number of probes present on the 216-probe set

microarray for each of 83 genomes ranges from 61 to

95 Because dengue infections occur in regions in

which other flaviviruses are also prevalent, it is

impera-tive that a diagnostic tool is able to discriminate

between the different viruses [2] Considering the close relative of dengue virus, West Nile virus, we computed the number of probes expected to be present in 26 publicly available strains Of the 216 dengue probes, at most only three would hybridize with a West Nile strain In fact, 24 of the West Nile virus strains share these same three 22-mers with the dengue virus strains

In the event that the clinical sample contained West Nile virus and not dengue virus, the expression pattern

is expected to show only 1% of the probes hybridized, far less than the 28% required during the set design Thus, it is highly unlikely that a misidentification of the presence of dengue will be made using the 216-probe set, even in the presence of another flavivirus

To estimate the ability of such arrays to distinguish between different strains of a virus, as well as its pos-sible genomic modifications, we introduce the distance (D) between any two patterns:

D¼ 1  n12

minðn1;n2Þ; where n1 and n2 are the numbers of probes present in each of genomes being compared and n12is number of probes present in both genomes simultaneously While there are many different ways to define such a distance between patterns, we chose this definition because of

Fig 5 Overlapping in silico expression patterns obtained using the

216 probes with pairs of dengue virus genomes (A) DENV-1 strain

BR ⁄ 90 AF226685 (green) and DENV-2 M 29095 (red); (B) DENV-2 from Cambodia AF309641 (green) and DENV-3 DENCME (red) Probes present in both genomes are shown in black, while probes absent in both genomes are shown in white.

Trang 7

its simplicity; the distance is 0 if both genomes produce

the same pattern and 1 if they do not share any of the

same probes

By computing the distances between each pair of 83

patterns (the distance matrix), we were able to group

virus isolates using phylip’s kitsch (University of

Washington, Seattle, WA, USA) [22] and visualize these

groups using publicly available software packages

[23,24] based on the distances between the patterns

observed on the microarray (Fig 6) The trees generated

clearly separate DENV-1 strains from the remainder of

the serotypes DENV-3 and DENV-4 are most closely

clustered within their own respective serotypes but are

nested within the DENV-2 branch While this may be

attributed to the fact that there are far fewer DENV-3

and DENV-4 available to be included in this analysis, it

is much more probable that it is a result of the design

process itself Because each strain must contain a

percentage of the overall probe set, sequences that are

unique to a strain are, in essence, selected against

A second probe set was designed containing a ran-dom sampling of 4000 sequences (18-mers, two away from the nearest human sequence) Because members

of this set were chosen at random, many more sequences unique to a single strain or to just a few strains are included A tree displaying similarity between isolates was also created using this set (Fig 7) This allows the dengue stains to be grouped

by their origin and the time at which the samples were taken, as well as by serotype Although a true evolu-tionary history of the virus can probably not be obtained in this way, the results suggest that an unknown isolate can be characterized with respect to its closest relatives by a comparison of hybridization patterns Well-developed methods such as k-means [25], self-organising map (SOM) [26] and hierarchical clustering [27] might improve determinations of how a new isolate compares to the previously studied isolates

Discussion

We found no single sequence 16–22 bases in length present in all 83 dengue sequences and absent given three base changes from the human genome There-fore, our approach was to use a unique pattern made

by a group of oligos (minimum 216) to identify a par-ticular virus strain A probe set of 216 human-blind sequences was designed that can both diagnose and identify the most similar strain among those whose genome has previously been sequenced The tests currently being used in the field can at best only distin-guish between serotypes With this decreased

specifici-ty, the probability of misdiagnosis remains a major concern The assay proposed here will essentially reduce the error in misdiagnosing the serotype of dengue to 1 : 10150 With microarray technology, the 216-probe set can easily be accommodated in a single diagnostic device Here, just one experiment provides both diagnosis and phylogenetic tree construction This assay will be able, without necessitating viral isolation,

to quickly detect a new pattern signifying a new strain

of dengue almost akin to sequencing the genome The ability to identify strains very similar to an unknown isolate in the data set of sequenced dengue genomes may be especially valuable in epidemiological studies where one would like to rapidly understand the origins of an outbreak of hemorrhagic fever For example, if such an outbreak were to occur in a loca-tion where dengue fever is indigenous, it may be the result of a new variant of the virus which is common

in that region, a re-emergence of an earlier version, a continuation of an outbreak from the previous season

or the introduction of a new strain as the result of

Fig 6 Dengue groupings based on the similarity of the observed

hybridization patterns for the 216-probe hypothetical microarray of

22-mers at least three changes away from the nearest human

sequence.

Trang 8

travel If the needed complete sequences are obtained

initially, hybridization arrays will allow these

alternat-ive explanations to be monitored on an ongoing basis

Such monitoring might be conducted routinely to

detect changes in the local virus population before

cases of hemorrhagic fever occur

If reduction of cost and size of this test are critical,

the ability to identify dengue at the strain level can be

sacrificed such that specificity is available only at the

serotype level The host-blind technology provides a

much more reliable solution than is currently available

by greatly decreasing the likelihood that a primer⁄

probe sequence will mispair with the host sequence

We are confident in the ability of this technology

for reliable detection Human-blind sequences have

been successfully used as PCR primers generating

amplicons matching those predicted computationally

(M Anez, R C Willson, et al unpublished results)

The development of host-blind diagnostic microarrays

is underway The human-blind dengue primer⁄ probe

sequences can be additionally improved to not only be

blind to humans but also to single nucleotide

polymor-phisms and organisms known to be associated with

humans (e.g microflora), in addition to pathogens

known to be transmitted by the same vector

Further-more, the computation-based host-blind approach can

easily be extended to include not only human hosts

but also mouse, rat, chicken, chimpanzee, mosquito, and any other sequenced host genome

Experimental procedures

Data

Version 3.2.2 of the human genome was used This partially assembled human genome, located in 944 files containing

2 860 215 662 base pairs of sequence, is available from GenBank (http://www.ncbi.nlm.nih.gov/mapview/map_ search.cgi?taxid¼ 9606) This version contains 794 007 unknown⁄ unidentified bases For simplicity, all n-mers con-taining such characters were excluded from the calculations Moreover, because the file structure of the genome assem-bly does not allow the assemassem-bly of each chromosome with-out gaps, all n-mers having a subsequence belonging to one file and the remaining sequence in another file were not included in our calculations All calculations on the human genome utilized both the original and complementary strand sequences

Eighty-three complete sequences of the dengue virus (28 DENV-1, 46 DENV-2, two DENV-3, and seven DENV-4) were considered This set of sequences, including their accession numbers, is provided in the Supplementary data The dengue genome is  10 kb with minor variations in length Although dengue is a single-stranded RNA

positive-Fig 7 Dengue groupings obtained from the

similarity of the observed hybridization

patterns for the hypothetical 4000-probe

microarray of randomly sampled 18-mers at

least two changes away from the nearest

human sequence.

Trang 9

strand virus with no DNA stage, both the original and

complementary strand sequences were used in our

calcula-tions as a precautionary measure

Calculations

We have recently developed a set of novel algorithms that

make it possible to analyze the occurrence frequency of all

short subsequences (n-mers) of length 5–25+ nucleotides

in any sequenced genome within a reasonable time (hours)

[28–30] The unique properties of this new approach are:

l exact consideration of each subsequence of size n

(n-mer) in contrast to traditional blast-based approaches

(no approximate heuristics – no missing cases);

l consideration of all sequences that can be derived from

each n-mer by up to any four changes;

l extremely good time efficiency: calculations for up to

19-mers can be performed on a regular desktop PC;

calcu-lations for 20£ n < 25+ can be performed using a

stan-dard high-performance cluster;

l the large number of background (or host) genomic

sequences can be taken into consideration in one run, as

needed to avoid possible false positives; and

l it can be used for genomic sequences of all sizes of

practical interest, including the human genome (3 Gb)

The basic idea is to set in correspondence to each of the

4n n-mers a particular element of a counting array, A, and

define the procedure to convert the n-mer character

sequence to an index of an element in such an array It

cur-rently takes less than one minute to find the set of all

16-mers present in dengue and absent in the human

gen-ome For PCR assay applications, the spacing of pairs and

sets of primers is important; however, these can be designed

much faster if consideration can be limited to only the

sub-set of unique n-mers present in the genome of interest;

extension of the algorithms to a PCR primer set design is

now in progress

Probe selection

It is our intent to define the minimum optimal set of

subse-quences, smin, that can both identify the presence of a

par-ticular pathogen and distinguish between different strains

of the pathogen To ensure the sensitivity needed to

prop-erly identify a genomic sequence, each genome under

con-sideration must contain at least a subsequences from smin

If applicable, each subclass or type must be distinguishable

from any other subclass or type by at least b subsequences

The set of subsequences present in each genome must differ

by at least c subsequences from the set present in every

other genome Furthermore, for each element k in smin, its

complement k¢ must not be a member of smin

In designing this optimal set, an evolutionary

program-ming approach was taken While many sets, s, may meet

the criteria above, a fitness function is needed to measure how ‘good’ a particular set s is in order to determine whe-ther it is, in fact, the optimal solution For instance, a par-ticular set may exceed the minimum values required of a, b and c and, in fact, have values A, B and G, where A‡ a,

B‡ b and G ‡ c While these values contribute to the fit-ness of a set, the size of the set plays a much greater role in

an effort to reduce the number of probes needed and thus the cost of the array Therefore, we chose to evaluate the fitness of a particular set as f(s)¼ (A + B + G) ⁄ set size, such that for the optimal set, smin, there exists no other set with a greater fitness value For sets consisting of host-blind sequences, the number of changes away from the host genome must also be integrated into the assessment of fitness

Acknowledgements

We would like to express our gratitude to the Texas Learning and Computation Center (TLCC) and to NASA (Grant NNJ04HF43G to GEF and RCW) for partial support of this work CP’s work was supported

by a training fellowship from the Keck Center for Computational and Structural Biology of the Gulf Coast Consortia (NLM Grant no 5T15LM07093) The authors would also like to thank Dr R Pad Padmanabhan for many interesting discussions and suggestions

References

1 De Paula SO & da Fonseca BAL (2004) Dengue: a review of the laboratory tests a clinician must know to achieve a correct diagnosis Braz J Infect Dis 8, 390–398

2 Kao CL, King CC, Chao DY, Wu HL & Chang GJJ (2005) Laboratory diagnosis of dengue virus infection: current and future perspectives in clinical diagnosis and public health J Microbiol Immunol Infect 38, 5–16

3 Relman DA (1998) Detection and identification of pre-viously unrecognized microbial pathogens Emerg Infect Dis 4, 382–389

4 Lanciotti RS, Calisher CH, Gubler DJ, Chang GJ & Vorndam AV (1992) Rapid detection and typing of den-gue viruses from clinical samples by using reverse tran-scriptase-polymerase chain reaction J Clin Microbiol

30, 545–551

5 Harris E, Roberts TG, Smith L, Selle J, Krammer LD, Valle S, Sandoval E & Balmaseda A (1998) Typing of dengue viruses in clinical specimens and mosquitoes by single-tube multiplex reverse trascriptase PCR J Clin Microbiol 36, 2634–2639

6 De Paula SOD, Lima CDM, Torres MP, Pereira MR &

da Fonseca BAL (2004) One-step RT-PCR protocols

Trang 10

improve the rate of dengue diagnosis compared to

two-step RT-PCR approaches J Clin Virol 30, 297–301

7 Wang WK, Sung TL, Tsai YC, Kao CL, Chang SM &

King CC (2002) Detection of dengue virus replication in

perifperal blood mononuclear cells from dengue virus

type 2-infected patients by a reverse

transcription-real-time PCR assay J Clin Microbiol 40, 4472–4478

8 Sudiro TM, Zivny J, Ishiko H, Green S, Vaughn DW,

Kalayanorooj S, Nisalak A, Norman JE, Ennis FA &

Rothman AL (2001) Analysis of plasma viral RNA

levels during acute dengue virus infection using

quanti-tative competitor reverse transcription-polymerase chain

reaction J Med Virol 63, 29–34

9 Houng HH, Hritz D & Kanesa-thasan N (2000)

Quanti-tative detection of dengue 2 virus using fluorogenic

RT-PCR based on 3¢-noncoding sequence J Virol

Methods 86, 1–11

10 Drosten C, Gottig S, Schilling S, Asper M, Panning M,

Schmitz H & Gunther S (2002) Rapid detection and

quantification of RNA of Ebola and Marburg viruses,

Lassa virus, Crimean-Congo hemorrhagic fever virus,

Rift Valley fever virus, dengue virus, and yellow fever

virus by real-time reverse transcription-PCR J Clin

Microbiol 40, 2323–2330

11 Shu PY, Chang SF, Kuo YC, Yueh YY, Chien LJ,

Sue CL, Lin TH & Huang JH (2003) Development of

group- and seortype-specific one-step SYBR green

I-based real-time reverse transcription-PCR assay for

dengue virus J Clin Microbiol 41, 2408–2416

12 Tanaka M (1993) Rapid identification of flavivirus using

the polymerase chain reaction J Virol Methods 41,

311–322

13 Figueiredo LT, Batista WC, Kashima S & Nassar ES

(1998) Identification of Brazilian flaviviruses by a

simplified reverse transcription-polymerase chain

reac-tion method using flavivirus universal primers Am J

Trop Med Hyg 59, 357–362

14 Ito M, Takasaki T, Yamada KI, Nerome R, Tajima S

& Kurane I (2004) Development and evaluation of

fluorogenic TaqMan reverse-transciptase PCR assays

for detection of dengue virus types 1–4 J Clin Microbiol

42, 5935–5937

15 Wu SJL, Lee EM, Pubatana R, Shurtliff RN, Porter

KR, Suharyono W, Watts DM, King CC, Murphey

GS, Hayes CG et al (2001) Detection of dengue viral

RNA using a nucleic acid sequence-based amplification

assay J Clin Microbiol 39, 2794–2798

16 Baeumner AJ, Schlesinger NA, Slutzki NS, Romano J,

Lee EM & Montagna RA (2002) Biosensor for dengue

virus detection: sensitive, rapid, and serotype specific

Anal Chem 74, 1442–1448

17 Schena M, Shalon D, Davis RW & Brown PO (1995)

Quantitative monitoring of gene expression patterns

with a complementary DNA microarray Science 270,

467–470

18 Lipshutz RJ, Fodor SP, Gingeras TR & Lockhart DJ (1999) High density synthetic oligonucleotide arrays Nat Genet 21, 20–24

19 Woese CR, Maniloff J & Zablen LB (1980) Phyloge-netic analysis of the mycoplasmas Proc Natl Acad Sci USA 77, 494–498

20 McGill TR, Jurka J, Sobieski JM, Pickett MH, Woese

CR & Fox GE (1986) Characteristic Archaebacterial 16S rRNA Oligonucleotides Syst Appl Microbiol 7, 194–197

21 Zhang Z, Willson RC & Fox GE (2002) Identifica-tion of characteristic oligonucleotides in the 16S ribosomal RNA sequence dataset Bioinformatics 18, 244–250

22 Felsenstein J (2005)PHYLIP(Phylogeny Inference Pack-age), Version 3.6 Distributed by the Author Depart-ment of Genome Sciences University of Washington, Seattle

23 Choi J-H, Jung H-Y, Kim H-S & Cho HG (2000) PhyloDraw: a phylogenetic tree drawing system Bioinformatics 16, 1056–1058

24 Perrie`re G & Gouy M (1996) WWW-Query: An on-line retrieval system for biological sequence banks Biochimie

78, 364–369

25 MacQueen J (1967) Methods for classification and ana-lysis of multivariate observations In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability(LeCam LM & Neyman J, eds), pp 281–297 California Press, Berkeley, California

26 Dopazo J & Carazo JM (1997) Phylogenetic reconstruc-tion using an unsupervised growing neural network that adopts the topology of a phylogenetic tree J Mol Evol

44, 226–233

27 Ward JH (1963) Hierarchical grouping to optimize an objective function J Am Stat Assoc 58, 236–244

28 Fofanov Y, Belapurkar C, Luo Y, Katili C, Wang J, Belosludtsev Y, Powdrill T, Fofanov V, Li T-B, Chum-akov S et al (2004) How independent are the appear-ances of n-mers in different genomes? Bioinformatics 20, 2421–2428

29 Chumakov S, Putonti C, Pettitt BM, Fox GE, Willson RC & Fofanov Y (2004) Using statistical properties of short subsequences in microbial identifi-cation In Proceedings of the International Conference

on Mathematics and Engineering Techniques in Medicine and Biological Sciences (Valafar F & Valafar H, eds), pp 363–367 CSREA Press, Las Vegas, NV

30 Fofanov V, Putonti C, Chumakov S, Pettitt BM & Fofanov Y (2005) Fast Algorithm for the Analysis of the Presence of Short Oligonucleotide Sequences in Genomic Sequences UH Technical Report #UH-CS-05–11, Uni-versity of Houston, Houston, Texas [Online http:// www.cs.uh.edu/Preprints/preprint/uh-cs-05-11.pdf]

Ngày đăng: 23/03/2014, 11:20

TỪ KHÓA LIÊN QUAN

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN