We have analyzed the Release 3 genomic sequence of Drosophila melanogaster to describe the euchromatic transposable elements in the sequenced strain of this species.. We identified 85 kn
Trang 1version 16.0: 2002-08-26: ma
The transposable elements of Drosophila melanogaster – a genomics perspective.
Joshua S Kaminker1,8, Casey M Bergman2,8, Brent Kronmiller2, Joseph Carlson2, RobertSvirskas3, Sandeep Patel2, Erwin Frise2, David A Wheeler5, Suzannna Lewis1, Gerald M.Rubin1,2,4, Michael Ashburner6,7 and Susan E Celniker2
1Department of Molecular and Cellular Biology, University of California, Berkeley, CA
94720, 2Drosophila Genome Project, Lawrence Berkeley National Laboratory, Berkeley, CA
94720, 3Amersham Biosciences, 2100 East Elliot Rd., Tempe, AZ 85284, 4Howard HughesMedical Institute, 5Human Genome Sequencing Center and Department of Molecular andCell Biology, Baylor College of Medicine, Houston, TX 77030, 6Department of Genetics,University of Cambridge, England, CB2 3EH
Trang 2Transposable elements are found in the genomes of nearly all eukaryotes We have analyzed
the Release 3 genomic sequence of Drosophila melanogaster to describe the euchromatic
transposable elements in the sequenced strain of this species We identified 85 known and
8 novel families of transposable element in the Release 3 sequences; these vary in copynumber between 1 and 146 A total of 1,573 transposable elements were identified whichcomprise 3.93 % of the Release 3 sequences The density of transposable elements is higher
on chromosome 4 relative to the major chromosome arms, and transposable element abundance on the X chromosome is similar to the major autosome arms The abundance of
of the three major classes of transposable elements (LTR, LINE-like, and TIR) are markedlyhigher in the proximal 2 Mb of each chromosome arm, reflecting the transition from
euchromatin to heterochromatin, whereas the high abundace on chromosome 4 is due only
to LINE-like and TIR elements More than two-thirds of the transposable elementsidentified in Release 3 are partial Analysis of structural variation of elements from differentfamilies reveals distinct patterns of deletion for each classes A large proportion oftransposable elements are found "nested" within other elements of the same or differentclasses Transposable elements are preferentially found outside genes; only 436 of 1,573transposable elements are contained within the 61.4 Mbof sequences which are annotated asbeing transcribed The high abundance, high proportion of complete elements and lowlevels of sequence diversity in LTR families suggest that indvidual LTR elements are more
likely to be recent insertions into the D melanogaster genome, relative to LINE-like or TIR
elements This work provides a starting point for future genomic analysis of transposable
elements in Drosophila.
Trang 3
Transposable element sequences are abundant yet poorly understood components of almostall eukaryotic genomes (but see Arkhipova and Meselson 2000) As a result, manybiologists have an interest in the description of transposable elements in completelysequenced eukaryotic genomes The evolutionary biologist wants to understand the origin
of transposable elements, how they are lost and gained by a species and the role they play inthe processes of genome evolution; the population geneticist wants to know the factors thatdetermine the frequency and distribution of elements within and between populations; thedevelopmental geneticist wants to know what roles these elements may play in eithernormal developmental processes or in the response of the organism to external conditions;finally, the molecular geneticist wants to know the mechanisms that regulate the life cycle ofthese elements and how they interact with the cellular machinery of the host It is for all ofthese reasons and more that a description of the transposable elements in the recently
completed Release 3 genomic sequence of D melanogaster is desirable.
The contribution of Drosophila to our understanding of transposable elements is long and
glorious Over 75 years ago, Milislav Demerec discovered highly mutable alleles of two
genes in D virilis, miniature and magenta (Demerec 1926; 1927; reviewed in Demerec 1935; Green 1976) Both genes were mutable in soma and germ-line and, for the miniature-
3alpha alleles, dominant enhancers of mutability were isolated by Demerec In retrospect, it
seems clear that the mutability of these alleles was the result of transposition of mobileelements; the dominant enhancers may have been particularly active elements or mutations
in host genes affecting transposability (see below) There matters essentially stood untilMcClintock's remarkable discovery of mutable alleles in maize and their basis –
transposition of the Ac and Ds factors (McClintock 1950), and the discovery, some 20 years later, of insertion elements in the gal operon of Escherichia coli (see Starlinger 1977)
Green (1977) synthesized the evidence then at hand to make a strong case for insertion as a
mechanism of mutagenesis in Drosophila Within a year or so Hogness' group had begun a molecular characterization of two elements in D melanogaster, 412 and copia (Rubin et al.
1976; Finnegan et al 1978) and evidence that they were transposable was soon available(Ilyin et al 1978; Strobel et al 1979; Young 1979) In fact, the Hogness group had already,but unknowingly, molecularly characterized the first eukaryotic transposable element, theinsertion sequences of 28S rRNA encoding genes (see Glover 1977) The discovery of male
recombination (Hiraizumi 1971), and two systems of hybrid dysgenesis in D melanogaster
(see Kidwell 1979), allowed the gap, then wide, between genetic and molecular analyses to
be bridged The discovery of the causal transposable elements, the P-element (Bingham, Kidwell and Rubin 1981) and the I-element (Bucheton et al 1984), lead to the first genomic
analyses of transposable elements in a eukaryote
The publication of the Release 1 genomic sequence in March 2000 (Adams et al 2000) andthe Release 2 genomic sequence in October 2000 encouraged several studies on the genomic
distribution and abundance of transposable elements in D melanogaster (Berezikov, Bucheton and Busseau 2000; Jurka 2000; Bowen and McDonald 2001; Rizzon et al 2002;
Bartolome, Maside and Charlesworth 2002) Unfortunately, neither release was suitable for
Trang 4rigorous analysis of its transposable elements since sequences corresponding to knowntransposable elements, along with other sequences known to be repetitive in the genome,were masked by the SCREENER algorithm and remained as gaps between unitigs (Myers et
al 2000) During the repeat resolution phase of the whole genome assembly, an attemptwas made to fill these gaps However, comparisons of small regions sequenced by theclone-by-clone approach versus the whole genome shotgun method show that this was not avery accurate process (Myers et al 2000; Benos et al 2001) It was clear that any rigorousanalysis of the transposable elements, or any other repeat, required a sequence of higher
quality This has now been achieved by the finishing efforts of the Berkeley Drosophila
Genome Project This sequence, Release 3, is now publicly available (Celniker et al 2002).For the first time, a reliable analysis can be performed of the nature, number and location of
the transposable elements in the euchromatin of D melanogaster.
Results and Discussion
Identification of known and novel transposable elements
Eukaryotic transposable elements are divided between those that transpose via an RNA
intermediate (class I), retrotransposons, and those that transpose by DNA excision and repair(class II), non-retrotransposons (Craig et al 2002) Within the retrotransposons, the majordivision is between those that possess long terminal repeats (LTR elements) (and those that
do not (LINE-like elements and SINE elements (Deininger 1989)) Among the
non-retrotransposons, the majority transpose via a DNA intermediate, encode their own
transposase and are flanked by relatively short terminally inverted repeat structures (TIR
elements) Foldback elements, which are characterized by their property of reannealing
after denaturation with zero-order kinetics, are quite distinct from prototypical class I or II
elements, and have been included in our analyses (Truet et al 1981) In addition, there are
other classes of repetitive elements, such as INE-1 (Locke et al 1999a; Locke et al 1999b;Wilder and Hollocher 2001), which are structurally distinct from all other classes ofelements, have not been included in this study
While the classification of transposable elements by structural class is relatively easy, thetaxonomy of transposable element families is somewhat arbitrary (Table 1) We used acriterion of greater than 90% identity over greater than 100 bp of sequence to assignindividual elements to families (see Methods) Subsequently, to insure proper inclusion ofelements in appropriate families we generated multiple alignments for all families oftransposable elements represented by multiple copies This allowed the identification andremoval of spurious hits to highly repetitive regions of the genome, and it also enabled us todistinguish sequences of closely related families that share extensive regions of similarity
A summary by class of the total number and number of complete transposable elements in
the Release 3 Drosophila euchromatic sequence is presented in Table 2, and detailed results
for individual families of transposable elements are listed in Table 3 Including those
described here, there are 96 known families of transposable elements in D melanogaster: 49
LTR families, 27 LINE-like families, 19 TIR families and the FB family We have identified
1,573 full or partial elements from 93 of these 96 families (Table 2) In total, 3.88% (4.5
Trang 5Mb) of the Release 3 sequence is composed of transposable elements Previous analysis ofboth the euchromatic and heterochromatic sequences using have suggested that 9% of the
Drosophila genome is composed of repetitive elements (Spradling and Rubin 1981) One
likely reason for this difference is that the proportion of transposable element sequences inheterochromatic regions is higher than the genomic average (Bartolome et al 2002; Dimitri
et al 2002)
As shown in Table 2, the different classes vary in their contribution to the Drosophila
euchromatin both in amount of sequence and number of elements LTR elements make uplargest proportion of the euchromatin (2.65%), more sequence than the sum of all other
classes of elements (LINE-like elements 0.87%, TIR elements 0.31%, and FB elements
0.04%) LTR elements are also the most numerous class of transposable elements in the
euchromatic sequences (686) followed by LINE-like (482), TIR (373), and FB (32) elements Thus, LTR elements are the most abundant class in the Drosophila euchromatin, both in terms of number (Rizzon et al 2002; Bartolome et al 2002) and amount of
sequence
The average size of all transposable elements in our study is 2.9 Kb, smaller than the 5.6 kbaverage length of middle repetitive DNA, estimated from reassociation kinetics (Manning,Schmid and Davidson 1975) LTR element sequences are on average significantly longerthan either LINE-like or TIR elements (Figure 1) The average lengths of genomic LTR,LINE-like, and TIR element sequences are 4.5, 2.1, and 0.9 Kb, respectively This is in partbecause the average length of canonical LTR sequences (6.5 Kb) is longer than LINE-like(4.7 Kb) or TIR (2.1 Kb) element sequences (Table 3)
While families with only one or two copies can be found within each class, the numbers oftransposable elements in the largest families are more than an order of magnitude greater
(Table 3) The largest family of LTR elements is roo (145 copies), the largest family of LINE-like elements is jockey (69 copies), and the largest family of TIR elements is 1360
(107 copies) The mean (median) number of elements per family for the LTR, LINE-like,and TIR classes of elements is: 14.0 (9), 17.9 (10), and 19.6 (12), respectively Over allclasses of element, the mean number of elements per family is 16.4, and the median is 9
Based on a definition of fewer than eight elements per family, 42 of the 96 (43.8%) oftransposable element families are low copy number (Table 3) Three of these 42 families
were not found in the Release 3 sequence though they have been reported in D.
melanogaster It is not surprising that we did not find any P-elements, since the sequenced
strain was selected to be free of them We did not find R2 and ZAM elements in the
euchromatin, but we identified these elements in unmapped scaffolds Consistent with this,
the R2 element has previously been shown to be found only within the 28S rDNA locus and heterochromatin (Jakubczak et al 1991) Strains of D melanogaster are known to exist in which ZAM elements occur in low copy number in heterochromatic sequences (Baldrich et
al 1997) The absence of the telomere-associated HeT-A and TART from all chromosomes except chromosome 4 is not unexpected, as the tandemly repetitive nature of the telomeric
sequences (Pardue and DeBaryshe 1999; Biessmann et al 2002) are inherently difficult toassemble and areunder-represented in Release 3
Trang 6We discovered eight new families of transposable elements within the Release 3 sequences.
Six are members of the LTR class: frogger (EMBL:AF492763), rover (EMBL:AF492764),
cruiser (a.k.a Quasimodo) (EMBL:AF364550), McClintock (EMBL:??), qbert
(EMBL: ??), and Stalker4 (EMBL:??) Two are members of the TIR class: Bari2
(EMBL: ??) and hopper2 (EMBL:??) The frogger element (one partial copy) was
identified on the basis of its LTRs and a protein-coding open reading frame (ORF) which is
73% similar at the amino acid level to that of the Dm88 family The rover family (six
copies) was identified in a BLAST search for repetitive elements in the genome; it is most
closely related to the 17.6 element (71% amino acid identity) The cruiser family (14
copies) was identified during the finishing project by virtue of its LTRs, and is most closely
related in sequence to the Idefix family, sharing 60% amino acid identity. Bari2 (four
copies) was identified by querying the D melanogaster genome using a Bari1-like element isolated from D erecta (EMBL:Y13853) The hopper2 (three copies) and Stalker4 (two copies) families were identified by an analysis of the multiple alignment of the hopper and
Stalker families, respectively These alignments exhibited evidence of distinct subfamilies
on the basis of both nucleotide divergence and structural rearrangements over large regions
of their alignment The qbert family was identified by searching for regions of the genome
that share similarity with protein-coding ORFs represented in our transposable element data
set The qbert family (one copy) is most highly related to the accord family and shares regions of similarity that are 66% identical at the amino acid level The McClintock family
(two copies), identified by its presence in a repeat region near the centromere of
chromosome 4, is most closely related to the 17.6 family Over its terminal 1,400 bp,
McClintock shares 86% amino acid identity with 17.6 However, elsewhere these elements
are quite divergent, sharing less than 50% amino acid identity over the proximal 5,000 bp of
the McClintock element
We have also discovered several other sequences with high sequence similarity to theprotein-coding regions of transposable elements, but they are not associated with repeats(see also Berezikov et al 2000) These elements cannot easily be classified to particularfamilies Although we have not included them in this analysis, they have been included inthe Release 3 annotation of the genome (http://www.fruitfly.org/annot/)
Chromosomal distribution of elements.
The fraction of each chromosome arm composed of transposable elements varies between
3.13% and 4.27%, with the exception of chromosome 4, for which it is over 11% This
indicates an average transposable element density of ~10-15/Mb for the major chromosome
arms, and over 82/Mb for chromosome 4 These densities of transposable elements are
greater than the estimate of 5/Mb derived from lower resolution cytological methods(Charlesworth and Langley 1988), presumably because of clustered elements and partial
elements that may give very weak in situ hybridization signals In contrast with the theoretical prediction that the X chromosome should have the lowest density of transposable
elements, if insertions are partially recessive and have deleterious effects in hemizygousmales (Montgomery et al 1987), we found no evidence for a reduction in density of
transposable elements on the X chromosome relative to the major autosome arms (Table 2).
Trang 7In contrast to previous findings (Bartolome et al 2002), this result suggests that thedeleterious effects of transposable element insertions may not be the primary forcecontrolling their distribution and abundance in the sequenced strain (Rizzon et al 2002)
The densities of LINE-like elements and TIR elements on 4 are 25.04/Mb and 47.66/Mb,
respectively, much higher than their densities on the major chromosome arms (2.5-5.3/Mband 2.1-3.4/Mb, respectively) (Table 2) By contrast, the density of LTR elements on
chromosome 4 is only slightly higher (8.08/Mb) than on the five major chromosome arms (5.1-6.9/Mb) Moreover, the fraction of chromosome 4 that is composed of LTR elements is
only slightly higher than that for the major chromosome arms (3.56% versus 2.25-2.93%)
Thus, the difference in density of transposable elements on chromosome 4 is predominantly
due to the order-of-magnitude increase in the number of LINE-like and TIR elements
Transposable element density is also known to vary along the major chromosome arms(Adams et al 2000; Rizzon et al 2002; Bartolome et al 2002) As shown in Figure 2, thedensity of transposable elements increases in the proximal euchromatin, here defined as the
proximal 2 Mb of the assembly of each of the five major chromosome arms (i.e, about 10%
of the euchromatic sequence here analysed) On the major chromosome arms, 36.6%(576/1573) of the elements are located in centromere proximal euchromatin, whichrepresents only 10% of the euchromatic sequence This result is also consistent withprevious observations that the density of transposable elements is higher in heterochromatinthan within euchromatin (Charlesworth et al 1994; Pimpinelli et al 1995; Carmena andGonzalez 1995; Dimitri 1997; Junakovic et al 1998), these centromere proximal sequencesrepresent the transition between euchromatin and heterochromatin There are 14 familieslocated exclusively within the centromere proximal 2 Mb of the major chromosome arms;
12 of these are low copy number families (defined as less than 8 copies per family).Although families which are exclusively found in the proximal euchromatin are also lowcopy number, elements belonging to low copy number families show no general tendency to
be located in these regions In fact only a minority (27/142) of elements that belong to lowcopy number families are localized in the proximal 2 Mb regions of the chromosome arms Finally, although the densities of transposable elements in the proximal euchromatin and
chromosome 4 are both elevated with respect to the euchromatic average (58 and 82
elements/Mb, respectively), the composition of the elements in these regions is quitedifferent The increase in transposable element density in the proximal regions of thechromosome arms is due to a larger number of elements belonging to all classes (Figure 2),
while the increase elements on chromosome 4 is due almost exclusively to LINE-like and
TIR elements
Analysis of structural variation.
It has been recognized since McClintock's definition of the Ac and Ds elements of maize
that many elements can be autonomous or defective with respect to transposition Defectiveelements often exhibit deletions in ORFs or terminal repeats which are necessary fortransposition Assuming that canonical elements represent full-length active copies, wedefined any element less than 97% of the length of the canonical member of their family as
Trang 8partial Based on this criterion, more than two-thirds (1087/1573) of the elements in theRelease 3 sequence are partial (Table 2) The proportion of partial elements is reasonablyuniform among major chromosome arms (64-73%), with the notable exception of
chromosome 4 (91%) While more than half (576/1087) of the partial elements on the major
chromosome arms lie within the proximal 2 Mb, nearly 90% of the transposable elements on
chromosome 4 are partial Since LINE-like and TIR elements make up 88% of the elements
on chromosome 4, these data indicate differences in proportions of partial elements betweenclasses In fact, 78% of LINE-like elements and 84% of TIR elements are partial, whereasonly 55% of LTR elements are partial (Table 2)
Analysis of the distribution of transposable elements lengths scaled relative to the length oftheir canonical sequence shows that while all three classes have bimodal distributions ofscaled elements lengths, they differ significantly from one another (Figure 3) The bimodalshape of these distributions presumably reflects the boundary states of the dynamic process
of deletion, excision and transposition Only a very small number of LINE-like (9) and TIR(22) elements exceed their canonical length, indicating that rates of insertion into
transposable elements are low relative to rates of deletion in D melanogaster transposable
elements (Petrov and Hartl 1998) A higher number of LTR elements (162) exceed theircanonical length, but on average these elements are less than 2% longer than the length oftheir canonical; only 27 of the 162 LTR elements are greater than 5% longer Of these,
25/27 are due to the 412 family; we have subsequently determined that the canonical
sequence used in this analysis was not full length
We characterized the distribution of structural variation for a representative element fromeach of the three major classes, by determining the proportion of sequences represented inmultiple alignments for a given nucleotide site (Figure 4) The resulting plot for the LINE-
like jockey family approximates a negative exponential distribution starting from the 3’ end
(Figure 4a) LINE-like elements are known to become deleted preferentially at their 5' ends,
as a consequence of the mechanism of their transposition (Finnegan 1997) The TIR
element pogo shows a very different pattern; deletions predominantly occur internally,
leaving the inverted repeat termini intact (see Tudor et al 1992) (Figure 4b) By analogy
with patterns of deletion in P elements (see Engels 1989), these deleted elements will be
non-autonomous with respect to transposition; they presumably arise when double-strandedgap repair is interrupted (see Engels et al 1990; Hsia and Schnable 1996) By contrast, for
the representative LTR element, roo, there is relative uniform pattern of structural variation
across the element, with the exception of two apparent deletion hotspots, both of whichoccur in regions that are expected to be coding (Figure 4c)
Twenty-four of the 93 families (25.8%) represented in Release 3 are composed entirely ofpartial elements An additional 19 families have only one full length element Fifteen of the
24 partial-only families are low copy number (less than eight copies per family) and nine arehigh copy number The majority of elements (125/181) in these 24 families are found in the
proximal 2 Mb or on chromosome 4; all elements for 16 of these 24 families are found
exclusively in these regions of the genome
Trang 9One class of defective LTR elements, solo LTR sequences, has been known for some time in
Drosophila (Cabonara and Gehring 1985) and other species (S cerevisiae, Boeke 1989; Schizosachharomyces pombe, Wood et al 2002; C elegans, Ganko et al 2001) These
presumably arise by exchange between the two LTRs flanking an element, with the loss of
the reciprocal product, a small circular molecule In S cerevisiae, 85% of all LTR element
insertions are solo LTRs (Kim et al 1998) We screened for solo LTRs of each family ofelement, using a criterion of 80% identity to the canonical LTR sequence of each family
Only 58 solo LTRs were identified, of which 14 are roo LTR elements
Analysis of expressed transposable element sequences
Transcription is an essential process in the life cycle of transposable elements Moreover,many transposable elements are known to be transcribed in developmentally regulatedpatterns (Ding and Lipshitz 1994; Danilevskaya et al 1994; Kerber et al 1996; Filatov,Morozova and Pasyukova 1998; Filatov, Nuzhdin and Pasyukova 1998) For these reasons it
is important to identify transcripts produced by transposable elements We identified whichtransposable elements are represented in the BDGP/LBNL's EST projects (Table 3) (Rubin
et al 2000; Stapelton et al 2002) All LTR families, and 88.9% (24/27) of LINE-likefamilies have BLAST hits in the EST database, in contrast to only 63% of TIR families,
which transpose through a DNA intermediate (The single P-element EST was from the CK
cDNA library, which was from a different strain of flies.)
Families which are composed of only partial elements may represent inactive families orfamilies for which no active copy exists in the Release 3 sequences Of the 24 familieswhich contain only partial elements, the majority (19/24) have hits in the EST database,suggesting that these families are at least transcribed and may be active This inference issupported by the fact that for the 10 families which have no hits in the EST database, 9families have either no or only one full-length copy in the Release 3 sequence It is possiblethat families which have only one full-length copy may also be inactive, if the canonicalsequence is itself not a full-length functional copy Taken at face value, these data suggest
that the majority of transposable element families in Drosophila have actively expressing
members
Analysis of sequence variation within families.
There is evidence that individual point mutations can affect transposable elements The highquality sequence of transposable elements in Release 3 allows such data to be reliably
analyzed for the first time Point mutations in coding regions of the gypsy family of
retrotransposons correlate with both transposition frequency and copy number
(Lyubomirskaya et al 2001) We identified only one full-length gypsy class element,
(FBti:0011990) Sequence comparison of this gypsy's ORF2 with that of the mutator strain
ORF2 shows these two ORFs to be identical and suggests that the single full-length gypsy
element is “active” in the sequenced strain Other families of elements have also been found
to be polymorphic with respect to their coding potential in Drosophila Kalymkova, Maisonhaute and Gvozdev (1999) found that while most 1731 elements have the +1 frameshift between their gag and pol gene regions typical of LTR elements, some do not and
Trang 10instead express a gag-pol fusion protein The single full-length 1731 element
(FBti:0020323) is of the latter type
Sequence variation within families of elements was estimated by analysing the averagepairwise distance within each family after multiple alignment (Table 3) These data showthat intra-family variation ranges from complete identity to 26.8% average pairwise distance(S2), but with only 7 families having greater than 10% average pairwise distance Theaverage sequence divergence for the LTR class of elements is only 2.7%; for the LINE-likeand TIR classes it is 5.4% and 7.6%, respectively These estimates of within familyvariation are remarkably similar to those of Wensink (1978) who, by studying the kinetics ofDNA reassociation of cloned middle repetitive sequences, showed that families of middlerepetitive sequence exhibited on the order of 3-7% sequence divergence Analysis of thedistribution of intra-family average pairwise distances by functional class shows that LTRfamilies have lower levels of average pairwise distance relative to LINE-like or TIR families(Figure 4) These suggest that elements in the LTR class are on average younger than eitherthe TIR or LINE-like classes of element
Nesting and clustering of transposable elements.
The nesting of transposable elements is common in plant genomes (SanMiguel et al 1996;SanMiguel et al 1998; Tikhonov et al 1999; Fu et al 2001) For our analysis, transposableelements that have inserted within another element are termed nests while groups oftransposable elements located within 10Kb of each other are defined as clusters We found
64 nests or clusters of transposable elements composed of 328 full or partial elements This
indicates that about 21% of all transposable elements in D melanogaster are either inserted
in another element or are positioned adjacent to another element The number of nested orclustered elements per arm ranges from 1.4 - 3.6/Mb The density of such elements is muchhigher in the proximal regions of the euchromatic arms; of the 64 nests or clusters, 25 arewithin the proximal 2 Mb regions of the major chromosome arms (see also O'Hare et al.2002) A large proportion (89%) of the elements belonging to nests or clusters are partial, incontrast to that seen for all elements (69%) Nesting and clustering appears to be somewhatmore frequent for LTR elements (29.3% nested or clustered) than for either LINE-likeelements (12.0% nested or clustered) or TIR elements (15.8% nested or clustered) This is
presumably due to the larger proportion of the LTR elements present in the Drosophila
euchromatin
Foldback elements often contain non-FB DNA (Truet et al 1981; see Hoffman-Liebermann,
Liebermann and Cohen 1989 and Caceres et al 2001) Both NOF and HB elements have been found flanked by FB arms (Truet et al 1981 and Harden and Ashburner 1990) We identified two HB elements immediately adjacent to FB elements and four examples of
NOF elements inserted into FB elements.
Patterns of element nesting can be very complex, as has been observed in other species(Tikhonov et al 1999) As pointed out by Walbot and Petrov (2001), the insertion of atransposable element may trigger a runaway process, since it will provide a target into whichother elements may insert without deleterious consequences The largest euchromatic
Trang 11complex of elements is on chromosome arm 3R (coordinate ~8.3 Mb), a complex of 29 fragments of Dm88, 18 fragments of invader1 and three fragments of micropia elements, occupying 32.4 kb Many of these fragments are identical; for example, of the 18 invader1 fragments, nine represent bases 1-424 of the canonical invader1 LTR sequence, three
represent bases 143-424, two bases 80-424 and two bases 1-108 These patterns remind us
of Jonathan Swift's warning to poets: "So, Nat'ralists observe, a Flea/ Hath smaller Fleas that
on him prey;/ And these have smaller Fleas to bite 'em,/ And so proceed ad infinitum."
(Swift 1733) Losada et al (1999) have suggested that some novel transposable elements
have evolved by nesting, in particular, that the Circe element arose as a consequence of the insertion of the Loa-like element of D silvestris into the Ulysses-like element of D virilis.
Nesting may involve elements of the same class or elements of different classes In asample of 31 simple nests each involving only two or three elements, we observed all 9possible combinations nesting possible among the LTR, LINE-like and TIR classes LTRelements have inserted into LTR elements most frequently (23/43), with the remainderspread aprroximately evenly over the eight other categories This pattern is likely due to thefact that LTR elements are the most abundant as both target and source sequences fornesting
Several complex nests involve many different families of element The nest near the base of
2L (coordinate ~20.1 Mb), for example, involves 11 different families, of all three major
classes of element Large clusters containing only one family of element are also found
For example, there is a complex of seven GATE elements at coordinate ~14.2 Mb on chromosome arm 2R and a complex of six mdg1 elements at coordinate ~5.7 Mb on the
same chromosome arm
Some transposable elements are present as large tandem arrays For example, the Tc1-like
Bari1 is organized as a tandem array in the heterochromatin at the base of chromosome arm 2R (Caizzi et al 1993) Tandem LTR element pairs have also been found in the D melanogaster genome (e.g., FBti:0019752 and FBti:0019753); here, two elements share an
internal LTR A number of different mechanisms have been suggested to result in tandem
Ty1 and Ty5 elements in S cerevisiae (Ke and Voytas 1997; Kim et al 1998); all involve
recombination between either linear cDNAs or circular DNA generated by LTRtransposition and a chromosomal element The mechanism(s) by which tandem elements
arise in Drosophila is not known
Insertion site preferences of natural transposable elements.
For one group of elements, the R1 and R2 LINE-like elements, there is very high insertion site specificity for sites within the 28S rDNA gene (Jakubczak et al 1991) Indeed, R1
elements are found only in the 28S rDNA gene (Eickbush et al 1997) For some LTRretrotransposons, a preference for AT-rich sequences has been known for some time (Inoue,Yuki and Saigo 1984; Freund and Meselson 1984; Tanda et al 1988) Transposableelements insert at a staggered cut in chromosomal DNA; after repair, this results in aduplication of the target sequence We estimated the physical characteristics of 500 bp of
DNA flanking the insertion sites of roo (LTR), jockey (LINE-like) and pogo (TIR) elements
Trang 12using the method described in Liao et al (2000) In our analysis, we included only elements
for which the duplicated target sequence could be unambiguously identified These data
suggest that roo and pogo prefer to insert in sequences of either higher than average (roo) or lower than average (pogo) denaturation temperatures; this may reflect functional differences
in the insertion mechanism of these elements There was no obvious bias in the sequences
into which jockey elements insert (Figure 6).
LTR retrotransposons use a tRNA primer for first-strand synthesis during transposition In
S cerevisiae LTR retrotransposons and tRNA genes are clustered: 90% are within 750 bp of
tRNA genes, and there are an on average of 1.2 insertions per tRNA gene (Kim et al 1998)
We determined whether a similar relationship exists in D melanogaster Our data suggest
no relationship between the location of tRNAs and transposable elements in this species
For example, of 311 elements on chromosome arm 2R, only five are within 10 kb of a tRNA
gene, or tRNA gene cluster However, Saigo (1986) has described an association of a tRNA
pseudogene and the 3' end of a copia element, possibly resulting from an aberrant reverse
transcription; an initiating tRNA:Met pseudogene has also been described as being
associated with repetitive sequences (Sharp et al 1981).
It has been known for many years that the P-element shows a marked preference to insert
immediately 5' to genes or within 5' exons (Tsubota, Ashburner and Schedl 1985; Spradling
et al 1995) This preference presumably reflects the chromatin environment at the time of
P-element transposition We analysed the position of transposable elements with respect to
the closest known or predicted gene from the Release 3 re-annotation (Misra et al 2002) Using this criteria there are 577 elements located 5’, 595 elements located 3’, and 401
elements within transcribed regions These ratios are consistent for all classes of element,suggesting that there is no insertion site bias with respect to genes We also find no bias ofinsertion with respect to the transcribed strand
All 401 transposable elements that map within transcribed regions are predicted to be inintrons However, during the reannotation of the genome (Misra et al 2002) coding exonswere not annotated in sequences with homology to transposable elements Thus it is possiblethat a small number of transposons within transcribed regions actually are inserted into acoding exon It is worth noting that, of the four mutations known to be carried by the
sequenced strain one (bw 1 ), and possibly two (sp 1 ), are mutated by the insertion of 412
elements
The insertions of transposable elements into coding regions on the X chromosome might be
expected to be less frequent compared to the major autosomes due to its haploid state in
males However, we see no bias for elements inserted within genes on the X chromosome;
of the 401 transposable elements inserted within genes, 79 are on the X chromosome, which
is within the range seen on the other major chromosome (73-88) This is consistent with the
percent of coding/non-coding sequence on the X chromsome (51.9%) relative to that of the
other chromosome arms (53.8%)
In a recent study of five protein coding genes located in the proximal regions ofchromosome arms, Dimitri et al (2002) found that introns have 50% transposable element
Trang 13sequence; this contrasts with the introns of euchromatic genes having only 0.11%transposable element sequences
Transposable elements in completely sequenced genomes
We can make a preliminary comparison of the transposable elements of D melanogaster with those of the other fully sequenced eukaryote genomes: Saccharomyces cerevisiae (Goffeau et al 1997), Schizosacchoromyces pombe (Wood et al 2002), Caenorhabditis
elegans (The C elegans Sequencing Consortium 1998) and Arabidopsis thaliana (The Arabidopsis Genome Initiative 2000).
In S cerevisiae, all transposable elements are of the LTR class (Boeke 1989); five different families are known and these comprise 3.1% of the entire genome in S cerevisiae, the
majority of which are solo LTRs (85%) (Kim et al 1998) This is in contrast to the relative
scarcity of solo LTRs found the Release 3 sequences of D melanogaster Transposable elements are quite rare in the sequenced strain of S pombe; only 11 intact, and three defective, Tf2 LTR-retrotransposon elements are known (Weaver et al 1993; Wood et al.
2002)
In C elegans, all three major classes of transposable element are found; 19 families of LTR
retrotransposons, with at most three full length members (Bowen and McDonald 1999;
Ganko et al 2001), three families of LINE-like retrotransposons, Rte-1, Sam and Frodo, with about 30 elements overall, 11 of Sam, three of Frodo and 10-15 of Rte-1 (Marin et al 1998; Youngman et al 1996), seven families of TIR (Tc) elements, with copy numbers
between 61 and 294 (Duret et al 2000), and five families of short DNA elements (Devine et
al 1997), with copy numbers from 81 to 1,204 (Duret et al 2000) It is interesting, given
the occurrence of retrotransposons in both C elegans and D melanogaster, that the
genomes of both species are characterized by very few retrotransposed pseudogenes (Jeffsand Ashburner 1991; Harrison et al 2001; Wang et al 2002) This may be due to a low rate
of retrotransposed pseudogene generation or to a very high rate of loss due to deletion, as
suggested by studies of the lineages of Helena elements in D virilis (Petrov et al 1996) and
D melanogaster (Petrov and Hartl 1998)
Transposable elements are far more abundant in the genome of A thaliana than in the euchromatic genomes of C elegans or D melanogaster In Arabidopsis, over 5,500
transposable elements exist, representing some 10% of the "euchromatic" sequence (The
Arabidopsis Genome Initiative 2000) The pericentromeric heterochromatin and the
heterochromatic knob on chromosome 4 of Arabidopsis have a very high density of
transposable elements and other repeats repeats (Kapitonov and Jurka 1999; Lin et al 1999;
Mayer et al 1999; CSHL/WUGSC/PEB Arabidopsis Sequencing Consortium 2000), an organization similar to that of the heterochromatin of Drosophila with respect to the transposable elements As in Drosophila, certain families of elements appear only in these heterochromatic regions, for example the Arabidopsis retrotransposon Athila In contrast to
Drosophila, only four percent of the complete elements in Arabidopsis are transcribed, as
judged by their representation in ESTs Also in contrast to Drosophila, class I and class II
Trang 14elements in Arabidopsis show very different chromosomal distributions, the former in the
centromeric regions and the latter on the periphery of pericentromeric domains
One class of element that is absent or unrecognized in the genome of D melanogaster are
the MITEs, miniature inverted repeat elements, characterized as short (less than 500 bp)elements with inverted repeat termini and lacking a transposase necessary for autonomous
transposition Elements similar to MITES have been described in D subobscura and its
relatives (Miller et al 2000) Whether or not MITES are indeed a separate class of element,
or simply represents internally deleted (and hence non-autonomous) TIR elements is unclear
(Kapitonov and Jurka 2002b; see Feschotte, Zhang and Wessler 2002) In C elegans, these
elements are abundant, with 5,000 elements in four sequence families; they show a random chromosomal distribution (Surzycki and Belknap 2000) MITEs are characteristic
non-of plant genomes; in A thaliana Surzycki and Belknap (1999) have identified three families with a copy number of about 90 In maize there are an estimated 6,000 copies of the mPIF
family of MITE elements alone, and evidence for autonomous family members (Zhang et al.2001)
Comparison of sequence and cytological data
In this paper we have described the transposable elements of the euchromatin of D.
melanogaster, as represented by the Release 3 sequences Because Release 3 represents
only a single sequence from the y 1 ; cn 1 bw 1 sp 1 isogenic strain first constructed in J.Kennison's laboratory in the early 1990's (Brizuela et al 1994), it is important to addresswhether the composition of transposable elements in this strain are typical of the species as awhole
It is well established that Drosophila strains vary in the number and location of transposable elements; these differences are often taken as de facto evidence of transposability (Young
1979) Large differences in the abundance of transposable elements have been observed
between laboratory strains for families such as gypsy, Bari1, ZAM and IdefixI (Kim,
Belyaeva and Aslanian 1990; Caggesse et al 1995; LeBlanc et al 1997; Desset et al 1999).Such variation in transposable element copy number may be associated with a mutationeither of the element itself, or of host genes that would normally regulate copy number [see,
for example, the role of the flamenco gene in regulating gypsy activity (Prud'homme et al.
1995) see Labrador and Corces (1997; 2002) for a review of host element interactions]
Transposable elements also differ between laboratory strains and natural populations of D.
melanogaster (Table 4, see Biemont and Cizeron 1999 for review) Perhaps the most salient
examples are those elements that cause hybrid dysgenesis – the P-element, the I-element and the H-element (a.k.a hobo) – which are either wholly absent from, or defective in, most
laboratory strains, but abundant today in natural populations (Engels 1989; Streck et al
1986; Crozatier et al 1988) Only one of the H-elements appears to be full length, and
comparison of its coding sequence to that of the canonical suggests the possibility that this
element is active in the sequenced strain Further, eight of the I-elements identified in the
sequenced strain appear to be of similar length and sequence to the active canonical TheORFs of these elements are very similar to those of the canonical suggesting that they toomight be active
Trang 15There has been extensive sampling of laboratory and natural strains of D melanogaster for euchromatic transposable elements by the method of in situ hybridization (Charlesworth and
Langley 1988; Biemont and Cizeron 1999) These samples provide estimates oftransposable element abundances which are relevant to compare with our data, since both
types of studies sample euchromatic sequences As shown in Table 4, the sequenced y 1 ; cn 1
bw 1 sp 1 isogenic strain is a typical D melanogaster strain, at least with respect to the
numbers of euchromatic elements Overall, the Spearman rank order correlation coefficientbetween the number of elements of each family in Release 3 and the average mid-point of
the ranges seen in other strains is 0.86 (p < 10-6) The correlation between these two types ofdata is imperfect since closely located elements (within 100 kb) of the same family and
grossly deleted elements will not be resolved by the method of in situ hybridization.
Nevertheless, this correlation suggests that results based on analysis of the sequenced strainmay be representative of the species as a whole
As previously noted, twenty-four of the 96 families of elements have only partial elements
Full length copies of these families may be discovered in other strains of D melanogaster,
or in closely related species Indeed, full length copies of the aurora family are present in
D simulans (Dsim\ninja, Shevelyov 1993; Kanamori et al 1998), of the mariner family in
D mauritiana (Medhora et al 1988), and of the Helena family in D virilis (Petrov et al.
1995) Our understanding of the evolutionary dynamics of the transposable elements willalso be immeasurably improved by comparative studies with other Diptera These are now,
or will be shortly, possible in the Anopheles gambiae and Drosophila pseudoobscura
genomic sequences These and other families of transposable elements may yet be
discovered in the heterochromatin of y 1 ; cn 1 bw 1 sp 1 as well Since the abundance of
transposable elements in the euchromatin (3.88%) is lower than the genome wide average(~9%), and the distribution of transposable elements increases in both centromere-proximalregions and the fourth chromosome, all indications suggest that a very high proportion ofheterochromatic sequences will be transposable elements (see also Bartolome et al 2002;Dimitri et al 2002)
Conclusions.
This study provides the first whole genome analysis of the transposable elements in the
Release 3 sequences of Drosophila melanogaster By taking advantage of the high quality
of transposable element sequences in Release 3 we have discovered previously unknownpatterns of transposable element abundance and diversity, as well as supporting trends based
on previous releases
Of the three major classes of transposable element in Drosophila, the LTR elements stand
out in several respects First, LTR elements are the most abundant in terms of numbers andamount of sequence Second, despite being the most common class of element andaccumulating in regions of high transposable element density, such as the centromereproximal sequences, LTR elements do not contribute to the high density of elements on
chromosome 4 Third, LTR elements have a higher proportion of full length elements
Trang 16relative to either LINE-like or TIR elements Fourth, on average LTR element familiesexhibit lower average pairwise distances than LINE-like or TIR element families
Taken together, these findings suggest a that the LTR elements in the Release 3 sequences
are on average more recent insertions into the D melanogaster genome than the LINE-like
or TIR elements These observations can be explained by two alternative scenarios: 1) amore recent invasion/wake-up (Vieira et al 1999) of LTR families relative to LINE-like or
TIR families in the D melanoagster genome or 2) a shorter persistance time of LTR families
relative to LINE-like or TIR families Since the patterns of variation in transposableelements are controlled by the joint effects of transposition, deletion and excision, it is clearthat these scenarios are not mututally exclusive Since many of the LTR families found in
D melanogaster are also found in closely related species (Biemont and Cizeron 1999), a
recent wake-up is a more likely scenario than a recent invasion (Dowsett and Young 1982;Vieira et al 1999) However, it remains to be explained what evolutionary or genomicmechanisms would cause the preferential wake-up of LTR elements but not LINE-like orTIR elements In contrast, there are mechanistic reasons to suspect a higher rate of decay ofLTR families since they can be eliminated by recombination between their homologous longterminal repeats, leaving only a single LTR as a footprint
Regardless of the explantation for these differences in abundance and variation amongfunctional classes of elements, our data demonstrate that the mechanisms of transposition
strongly influence the composition of transposable elements in the Drosophila genome We
hope that the data and results provided here will serve as a useful resource for similar
discoveries in Drosophila and other species in the future.
Trang 17Materials and methods.
The sequence and other data sets.
The sequence releases Release 1 of the "complete" euchromatic sequence of the genome of
D melanogaster was made available in March 2000 (Adams et al 2000; Myers et al 2000).
As explained in the Introduction, this sequence, by and large derived from an assembly of a12.8X whole genome shotgun sequence, is not suitable for the analysis of repeated
sequences Nor was the subsequent release made available from the Berkeley Drosophila
Genome Project's (BDGP) website, Release 2 (October 2000), which filled 330 sequencegaps (but left some 1,300) and had improved the order and orientation data for scaffolds(these were not always reliable in the Release 1 data)(www.fruitfly.org/annot/release2.html) Release 3 is the first high quality completesequence of this genome; it is of Phase 3 quality(www.ncbi.nlm.nih.gov:80/HTGS/divisions.html), that is, not only is the sequence qualityitself high (an estimated sequence error rate of less than 1 in 100,000 bp), but there are veryfew gaps and the order and orientation of all regions is known Release 3 has eight physical
gaps: in region 39D at the base of chromosome arm 2L (coordinate ~21.4 Mb), at regions 42B (coordinate ~1.48 Mb) and 57B on chromosome arm 2R (coordinate ~15.8 Mb), in
regions 3EF (coordinate ~3.5 Mb), 9EF (coordinate ~10.5 Mb) and 20A (coordinate ~21.3
Mb) of the X chromosome, on 3L in region 64C (coordinate ~5 Mb), and on chromosome 4
(coordinate ~1.2 Mb)
Five of the eight physical gaps in Release 3 are known to be associated with repetitive
sequences The gap in region 39D on 2L is due to a tandem array (0.5-1 Mb) of histone genes, the 50-100 kb gap on 2R at 42B is flanked by about 50 copies of an unknown 1 kb repeat, the less than 100 kb gap at 57B on 2R is flanked by a cluster of F-elements, the less than 100 kb gap on the X in region 20 is flanked on each side by complex nests of transposable elements; finally, the less than 100 kb gap in region 102 on chromosome 4 is flanked by the genes CG17467 and Caps, and the gap results from a simple, 9 bp, repeated
sequence
Release 3 was made incrementally, chromosome arm by chromosome arm, by the
sequencing groups at the Berkeley Drosophila Genome Project, Lawrence Berkeley National Laboratory (chromosome arms 2L, 2R, 3R, 4 and the proximal half of the X) and at the Baylor College of Medicine (the distal half of the X chromosome and chromosome arm
3L) (www.fruitfly.org/sequence/assembly.html) It has involved the finishing of 951
(BDGP) BAC-based clone assemblies that form a complete tiling path across the entireeuchromatic genome The assembly of each BAC has been checked by a comparison of therestriction enzyme fingerprint predicted from its sequence with that experimentallydetermined (Celniker et al 2002) The few remaining gaps in Release 3 are being addressedand subsequent releases of the sequence will be made
There are two caveats with respect to the assembly used in this analysis The first is thatregions containing arrays of similar elements may suffer from assembly artefacts.Resolving these will require the use of a constrained assembly program and more detailed
Trang 18comparison of the sequence with the restriction enzyme fingerprinted BAC clones The
second is that some parts of the distal X chromosome and of chromosome arm 3L are
unfinished in Release 3 (Celniker et al 2002) The transposable elements of theseunfinished sequences were included in analyses of the abundance and distribution ofelements, but were excluded from all analyses that required the alignment of elementsequences The number of elements included within Tables 2 and 3, but not included inalignments was 75 These elements will be clearly indicated in our data set
In Release 1 there was 3.8 Mb of sequence that could not be easily mapped to any
chromosome arm (Adams et al 2000) These were assembled into unmapped scaffolds
which included sequences from gaps in the euchromatic arms and sequences from the
centric "heterochromatin", including the entire Y chromosome The highly repetitive nature
of these sequences makes them very hard to assemble, either from a whole genome shotgun
or from a cloned-based sequencing strategy The "heterochromatin" also includes sequencesthat cannot be readily cloned, for instance the satellite DNA sequences and, perhaps, others
The distinct genetic and cytological nature of the pericentromeric regions of the Drosophila
chromosomes, both metaphase and polytene interphase, has been known for many years(Painter and Muller 1929; Heitz 1934) and its structure and properties clearly result from thenature of its sequences (Miklos et al 1988)
The convention that we will follow in this paper is to define as euchromatin everything thathas been assembled into a chromosome arm scaffold, and heterochromatin as the rest(unmapped scaffolds) This does some injury to the definitions of heterochromatin andeuchromatin as cytological phenomena, but these terms have been so abused in the past thatthis is of minor consequence We recognise, of course, that the transition from euchromatin
to heterochromatin is not abrupt; indeed we show that the characteristics of the most basalregions of the chromosome arms differ in their sequence organization from the regions distal
to them
The data set we used for unmapped scaffolds is from a new assembly provided by CeleraGenomics (E Myers and G Sutton, personal communication) This assembly, WGS3, is theresult of improvements to the Celera assembly algorithms (see Myers et al 2000) WGS3assembles 115.5 Mb into 14 mapped scaffolds, and leaves 22.2 Mb in 2,761 scaffolds, eachless than 1 Mb in length, assigned to unmapped scaffolds One reason for the increase insize of unmapped scaffolds between the first and third whole genome shotgun assemblies isthe inclusion in WGS3 of some 809,000 extra sequence reads not included in the twoprevious assemblies (Carvalho et al 2002) The 22.2 Mb of unmapped scaffolds includessequences that properly belong to the euchromatin as well as heterochromatic sequences.For this reason we simply used this data set as a subject sequence set for searching, by
BLAST, for elements that we had been unable to discover in the euchromatic sequence Amuch fuller description of the transposable elements of the heterochromatin, and of the
telomeres, of D melanogaster will be the subject of a publication from the Drosophila
Heterochromatin Genome Project (G Karpen and colleagues, personal communication)
Trang 19The EST data used in this analysis were those from the BDGP/LBNL's EST projects (Rubin
et al 2000; Stapelton et al 2002) These data are available fromhttp://www.fruitfly.org/sequence/dlcDNA.shtml/)
Reference data sets A reference data set of "canonical" sequences of transposable elements
was built by M Ashburner, P Benos and G Laio during the early stages of developing
methods for Drosophila genome annotation It was first used during the annotation of the 2.9 Mb "Adh-region" (Ashburner et al 1999) and has been maintained subsequently, and
(http://www.fruitfly.org/sequence/sequence_db/na_te.dros) As new sequences werepublished by others these were added to this file Most of these were "real" sequences,although some from Repbase (Jurka 2000; Kapitonov and Jurka 2002a) were consensussequences In addition, we made a determined effort to discover, from the evolving Release
3 sequence, "complete" sequences of the many elements known only from small sequencefragments, e.g of their LTR regions as well as all new elements identified in this study.The following elements were avaliable too late to be included in our analyses, but will be
included in further updates of the data: Tc3-like (Tu and Shao 2002), romano (Salles et al 2002), ninja (EMBL: AF520587) and the elements "DREF", "BG.DS00797" and
"CG.13775" of Robertson (2002)
Available data sets The following data sets are freely available for download
(www.fruitfly.org) and are maintained by FlyBase When using these resources please note,and publish, the Release numbers associated with the files
Many transposable elements of D melanogaster have been described and named
independently by several research groups In this paper we use the names adopted byFlyBase, which attempts to reflect priority of publication (or sequence release) There are,
in addition to those described here, many elements in FlyBase that have never beenassociated with a sequence or, even, a restriction map In the absence of further evidencenothing more can be said about these, which will be marked as being of "uncertain status" intheir FlyBase records
1 A file containing a single sequence of each element; this is a frozen data set of thesequences used to search the genome for transposable elements
2 A file of annotated "canonical" sequences, one for each identified family of transposableelements These sequences were chosen as the longest discovered in the genome with(where relevant and where possible) intact open reading frames There are a few familiesfor which no intact element could be found We have then attempted to construct an intactelement from the available data Such artifices are noted in the records These data will beupdated when new information becomes available, and will be further annotated byFlyBase Each release will be archived
3 A file, in FASTA format, of each element that has been discovered The following dataare to be found on the header line of each record:
Trang 20FBgn_id is the FlyBase record for the family, FBti_id is the unique identifier of eachoccurrence of an element and the coordinates are from the Release 3 data In addition to thesequence of each element, each record includes 500 bp of 5' and 3' flanking sequence which
is represented in lowercase letters These data will be regularly updated, in step with eachnew Release of the assembled sequence Each Release will be archived
4 The alignments of elements within a family used for the current analysis This file is in
MASE format (Galtier, Gouy, Gautier 1996) and each element is identified by its FBtinumber This is a frozen data set that will not be updated
5 The nested transposable elements and element complexes are available as an independentdata set Included within each sequence is 500 bp of flanking sequence on each side of theelement complex Each nest or complex as a unique FBti identifier number in FlyBase; inaddition each component of a nest or complex has its own FBti identifier number In the
FASTA header line for each sequence in this file the data included are:
>FBti_of_nest_or_complex,FBti_of_component,…,chromosome_arm:coordinates
Comparison with other data sets.
To support our claim that the Release 1 sequence is an inadequate substrate for rigorousanalysis we have compared the sequences of transposable elements in that release with those
of Release 3 We determined the identity of elements in the two releases by a comparison ofthe 500 bp on their 5' flanks Our results suggest that many, if not most, of the sequencesfrom Release 1 are artefacts of that assembly Of the 1,579 elements characterized inRelease 3 only 793 could be identified in Release 1; of these, only 33% were identical insequence; for LTR elements (n= 290) the proportion of identical elements was only 13%.Not surprisingly, in view of their relatively small size, TIR elements were best determined inRelease 1 (53% identical between the releases, n = 213) Of the 48% of elements thataligned and for which the Release 1 sequences did not include undetermined bases, theaverage pairwise distance between "identical" elements in Release 1 and Release 3 was8.2% (11.4% for LTR elements) The complete data are available fromhttp://www.fruitfly.org/
Analytical methods
Identification of known transposable elements WU-BLASTN 2.0 (http://blast.wustl.edu/)was used to search all chromosome arms for regions of similarity to each element in theRelease 3 dataset The parameters for the BLAST search were M=3, N=3, Q=3, R=3, X=3,and S=3 BLAST searches were done on a 32 node dual PIII Linux based compute farmsupplied by Linux Network Distribution of BLAST jobs to the cluster was managed by thePortable Batch System (PBS, http://www.openpbs.org) Individual BLAST jobs weresubmitted via pbsrsh, a rsh like program (E Frise, unpublished) Additionally, PBS was
Trang 21optimized and modified for the BDGP to handle a large number of queued jobs (E Frise,unpublished).
BLAST reports were generated by searching a single chromosome arm with each individualelement The results were then parsed to generate a list of the coordinates of all HighScoring Pairs (HSPs) that were at least 50 bp long and whose query and subject sequenceshad a pairwise identity of at least 90% All HSPs on this list that were within 10 kb of eachother and summed to greater than 100 bp were pooled into a "span" Each span wasbounded by two coordinates - a start coordinate that corresponds to the lowest coordinate ofany HSP in a particular span, and an end coordinate that corresponds to the highestcoordinate of any HSP in the same span A master list was then generated that contained allspans for all elements on a particular arm Any spans (for the same or different elements)that had overlapping coordinates were examined further by an analysis of the sequences ofthe HSPs While this identified a small number of spurious spans that did not correspond toreal elements, the majority of these instances correspond to the nested elements discussedbelow Start and end coordinates for all spans belonging to each element were used toextract genomic sequences for multiple sequence alignment (see below) Spurioussequences that did not align with other family members were removed from both the list ofspans and the multiple alignment Other attempts to define transposable element familiesbased on sequence identity have used a 90% cutoff with reference to the protein sequence ofthe reverse transcriptase motif of LTR-elements (Bowen and McDonald 1999; 2001) For
non-LTR transposons, Berezikov et al (2000) used a 70% nucleic acid sequence identity
criterion
Identification of new transposable elements through genome-genome comparison The first
approach to discover new transposable elements was performed by an all-by-all BLAST
using chromosome arms 2L, 2R, 3R, 4 and the proximal half of the X The chromosome
arms were divided into 20 kb segments each segment overlapping the previous by 10 kb Weused the NCBI-BLAST v.2 to BLAST each 20 kb section against the others Hits withgreater than 95% identity and 1000 bp in length were parsed and used as query sequences
in a BLAST against the canonical element sequence data set Redundant results wereremoved The coordinates of the repeats were parsed and known repeats were tagged Newrepeats were reviewed in CONSED (Gordon et al 1998) for the presence of open readingframes and repeat structure
Identification of new transposable elements through isolation of LTR sequences A second
approach was taken to identify single copy elements containing LTRs Each chromosomearm was divided into 1000 bp long pieces with neighboring pieces overlapping each other
by 500 bp WU-BLASTN 2.0 was used to search each chromosome arm for all regions ofsimilarity to each 1000 bp piece (parameters: M=3, N=3, Q=3, R=3, X=3, and S=3) The
BLAST report from such a search was parsed to generate a list of all HSPs that were at least
100 bp long and whose query and subject sequences had a pairwise identity of at least 95%.Then, all HSPs on this list that were greater than 500 bp apart and less than 15 kb apart werepooled into a span As above, each span was bounded by two coordinates, a startcoordinate which corresponds to the lowest coordinate of any HSP in a particular pool, and
an end coordinate which corresponds to the highest coordinate of any HSP in the same pool
Trang 22Each set of coordinates was compared to the list of coordinates of transposable elementsidentified in the screen for known elements and these were eliminated from this list Then,the coordinates of the remaining spans were used to extract genomic sequence from thefinished chromosome arms Each piece of genomic sequence was then compared to thecoding sequence of the known transposable elements using WU-TBLASTX 2.0 (with defaultparameters) Any span that produced a hit with a p-value less than 10-8 was analyzed bysearching through the non-redundant protein database at the NCBI using NCBI-BLASTX
(http://ncbi.nlm.nih.gov)
Alignment and calculation of evolutionary distances Preliminary multiple alignments of
elements within families were performed using the default settings of DIALIGN v2-1(Morgenstern 1999) The resulting multiple alignments were visualized in the SEAVIEW
alignment editor (Galtier et al 1996) Subsequent realignment was done using the
manual refinement Multiple alignments were used to calculated average pairwise distancewithin families using Kimura's 2-parameter substitution model (transition:transversion ratio
= 2:1) (Kimura 1980) as implemented in the DNADIST program of the PHYLIP package(Felsenstein 1993)
Physical characteristics of element insertion sites To analyze the physical properties of the
insertion sites of transposable elements we used the programs developed by Liao, Rehm andRubin (2000) The flanking sequences of elements with canonical ends were aligned,centered on a single copy of the element's target site sequence (that duplicated on insertion).The sequences were then analyzed for A-philicity, propeller twist, duplex stability anddenaturation temperature, as described by Liao, Rehm and Rubin (2000) As a baseline weused a randomly generated 500 bp sequence set of the same base composition as the overall
genome of D melanogaster (G Liao, personal communication) These analyses were performed with 49 roo element sequences, 12 jockey sequences and 28 pogo sequences.
Trang 23This work was supported by NIH grant H600750 to G.M Rubin, by NIH Grant HG00739 toFlyBase (P.I W.M Gelbart) and by Programme Grant G822559 from the Medical ResearchCouncil to M Ashburner, D Gubb and S Russell M.A thanks Jean Wiborg for her help
in the logistics of his visits to Berkeley
We thank Patrzio Dimitri, Bernardo Carvalho and Gary Karpen for allowing us to quotefrom their unpublished work, and Bob Levis, Mike Young, Stu Tsubota and Andy Flavell,for information on individual elements We also thank the following colleagues for theircomments on a draft of this paper: …??
References.
Adams, M., Celniker, S.E., Holt, R.A., Evans, C.A., Gocayne, J.D., Amanatides, P.G.,
Sherer, S.E., Li, P.W., Hoskins, R.A., Galle, R.F et al 2000 The genome sequence of
Drosophila melanogaster Science 287: 2185-2195.
Arkhipova, I and Meselson, M 2000 Transposable elements in sexual and ancient asexual
taxa Proc natn Acad Sci USA 97: 14473-14477./
Ashburner, M., Misra, S., Roote, J et al 1999 An exploration of the sequence of a 2.9-Mb
region of the genome of Drosophila melanogaster: The Adh region Genetics 153: 179-219.
Baldrich, E., Dimitri, P., Desset, S., Leblanc, P., Codipietro, D and Vaury, C 1997
Genomic distribution of the retrovirus-like element ZAM in Drosophila Genetica 100:
131-140
Bartolome, C., Maside, X and Charlesworth, B 2002 On the abundance and distribution of
transposable elements in the genome of Drosophila melanogaster Molec Biol Evol
19:926-937.
Beleaeva, E.S., Ananiev, E.V and Gvozdev, V.A 1984 Distribution of mobile dispersed
genes (mdg-1 and mdg-3) in the chromosomes of Drosophila melanogaster Chromosoma
90: 16-19.
Benos, P.V., Gatt, M.K., Murphey, L et al 2001 From first base: The sequence of the tip of
the X chromosome of Drosophila melanogaster, a comparison of two sequencing strategies.
Genome Res 11: 710-730.
Berezikov, E., Bucheton, A and Busseau, I 2000 A search for reverse transcriptase-coding
sequences reveals new non-LTR retrotransposons in the genome of Drosophila
melanogaster Genome Biol (16): research0011.1-0011.15.
Trang 24Berghella, L and Dimitri, P 1996 The heterochromatic rolled gene of Drosophila
melanogaster is extensively polytenized and transcriptionally active in the salivary gland
Biemont, C., Gautier, C and Heizmann, A 1988 Independent regulation of mobile element
copy number in Drosophila melanogaster inbred lines Chromosoma 96: 291-294.
Biessmann, H., Walter, M.F and Mason, J.M 2002 Telomeres in Drosophila and other insects In Telomeres and Telomerases: Cancer and Biology (eds G Krupp and R
Parwaresch) Landes Biosciences.
Bingham, P.M., Kidwell, M.G and Rubin, G.M 1981 The molecular basis of P-M hybrid
dysgenesis: the role of the P element, a P strain-specific transposon family Cell 29:
995-1004
Boeke, J.D 1989 Transposable elements in Saccharomyces cerevisiae Chapter 13 In
Mobile DNA (eds D.E Berg and M.M Howe) American Society of Microbiology,
Washington, DC
Bowen, N.J and McDonald, J.F 1999 Genomic analysis of Caenorhabditis elegans reveals
ancient families of retroviral-like elements Genome Res 9: 924-935.
Bowen, N.J and McDonald, J.F 2001 Drosophila euchromatic LTR retrotransposons are
much younger than the host species in which they reside Genome Res 11: 1527-1540.
Brizuela, B.J., Elfring, L.K., Ballard, J., Tamkun, J.W and Kennison, J.A 1994 Genetic
analysis of the brahma gene of Drosophila melanogaster and polytene chromosome
subdivisions 72AB Genetics 137: 803-813.
Bucheton, A., Paro, R., Sang, H.M., Pelisson, A and Finnegan, D.J 1984 The molecular
basis of I-R hybrid dysgenesis: identification, cloning and properties of the I factor Cell 38:
153-163
Caceres, M., Puig, M and Ruiz, A 2001 Molecular characterization of two natural
hotspots in the Drosophila buzzatii genome induced by transposon insertions Genome Res.
11: 1353-1364.
Trang 25Caggese, C., Pimpinelli, S., Barsani, P and Caizzi, R 1995 The distribution of the
transposable element Bari1 in the Drosophila melanogaster and Drosophila simulans
genomes Genetica 96: 269-283.
Caizzi, R., Caggese, C and Pimpinelli, S 1993 Bari1, a new transposon-like family in
Drosophila melanogaster with a unique heterochromatic organization Genetics 133:
335-345
Carbonara, B.D and Gehring, W.J 1985 Excision of copia element in a revertant of the
white-apricot mutation of Drosophila melanogaster leaves behind one long terminal repeat.
Molec gen Genet 199: 1-6.
Carmena, M and Gonzalez, C 1995 Transposable elements map in a conserved pattern of
distribution extending from beta-heterochromatin to centromeres in Drosophila
melanogaster Chromosoma 103: 676-684.
Carvalho, A.B., Vibranovski, M.D., Carlson, J.W., Celniker, S.E., Hoskins, R.A., Rubin,
G.M., Sutton, G.G., Myers, E.M., Adams, M.D and Clark, A.G 2002 Y chromosome and other heterochromatic sequences of the Drosophila melanogaster genome: how far can we
go ? Genetica (In press).
Celniker, S … 2002 Genome Res ??: ??-??.
Charlesworth, B., Jarne, P and Assimacopoulos, S 1994 The distribution of transposable
elements within and between chromosomes in a population of Drosophila melanogaster III.
Element abundances in heterochromatin Genet Res (Camb.) 64: 183-197.
Charlesworth, B and Langley, C.H 1988 The population genetics of Drosophila
transposable elements An Rev Genetics 23: 251-287.
Craig, N.L., Craigie, R., Gellert, M and Lambowitz, A.M 2002 Mobile DNA II American
Society of Microbiology, Washington, DC pp.1204
Crozatier, M., Vaury, C., Busseau, I., Pellison, A and Bucheton, A 1988 Structure and
genomic organization of I elements in I-R hybrid dysgenesis in Drosophila melanogaster.
Nucleic Acids Res 16: 9199-9123.
CSHL/WUGSC/PEB Arabidopsis Sequencing Consortium 2000 The complete sequence of
a heterochromatic island from a higher eukaryote Cell 100: 377-386.
Danilevskaya, O.N., Slot, F., Traverse, K.L., Hogan, N.C., Pardue and M.L 1994.
Drosophila telomere transposon HeT-A produces a transcript with tightly bound protein.
Proc natn Acad Sci USA 91: 6679-6668.
Trang 26Deininger, P.L 1989 SINEs: Short interspersed repeated DNA elements in higher
eukaryotes Chapter 27 In Mobile DNA (eds D.E Berg and M.M Howe) American Society
of Microbiology, Washington, DC
Demerec, M 1926 Minature-alpha – a second frequently mutating character in Drosophila
virilis Proc natn Acad Sci USA 12: 687-690.
Demerec, M 1927 Magenta-alpha – a third frequently mutating character in Drosophila
virilis Proc natn Acad Sci USA 13: 249-253.
Demerec, M 1935 Mutable genes Bot Rev 1: 233-248.
Desset S., Conte, C., Dimitri, P., Calco, V., Dastugue, B and Vaury, C 1999 Mobilization
of two retroelements, ZAM and Idefix, in a novel unstable line of Drosophila melanogaster.
Molec Biol Evol 16: 54-66.
Devine, S.E., Chissoe, S.L., Eby, Y., Wilson, R.K and Boeke, J.D 1997 A transposon-based
strategy for sequencing repetitive DNA in eukaryotic genomes Genome Res 7: 551-563.
Dimitri, P 1997 Constitutive heterochromatin and transposable elements in Drosophila
melanogaster Genetica 100: 85-93.
Dimitri, P., Junakovic, N and Arca, B 2002 Colonization of heterochromatic genes by
transposable elements in Drosophila melanogaster [In preparation]
Ding, D and Lipshitz, H.D 1994 Spatially regulated expression of retrovirus-like
transposons during Drosophila melanogaster embryogenesis Genet Res (Camb.) 64:
167-181
di Nocera, P.P., Digan, M.E., and Dawid, I.B 1983 A family of oligo-adenylate-terminated
transposable sequences in Drosophila J molec Biol 168:715-727.
Dominguez, A and Albornez, J 1996 Rates of movement of transposable elements in
Drosophila melanogaster Molec gen Genet 251: 130-138.
Dowsett, A.P and Young, M.W 1982 Differing levels of dispersed repetitive DNA among
closely related species of Drosophila Proc natn Acad Sci USA 79: 4570-4574.
Duret, L., Marais, G and Biemont, C 2000 Transposons but not retrotransposons are
located preferentially in regions of high recombination rate in Caenorhabditis elegans.
Genetics 156: 1661-1669.
Eickbush, T.H and Malik, H.S 2002 Origins and evolution of retrotransposons Chapter 49
In Mobile DNA II (eds N.L Craig, R Craigie, M Gellert and A.M Lambowitz) American
Society of Microbiology, Washington, DC