In the present study, we report the distribution and sequence features of recombi-nogenic long inverted repeats LIRs that are capable of forming stable stem-loops or palindromes within t
Trang 1of human long inverted repeats reveals species-specific intronic inverted repeats
Yong Wang and Frederick C C Leung
School of Biological Sciences and Genome Research Centre, The University of Hong Kong, China
An inverted repeat consists of two repeat copies
(here-after termed arms) that are approximately
complemen-tary to each other Generally, there is a spacer between
the arms, and the full structure of an inverted repeat
can form a stem-loop or palindrome The potential to
form a stable stem-loop is determined by the arm size,
spacer size and the matching degree of the arms [1,2]
For example, a relatively huge spacer makes it difficult
for the two arms to form a stem
Studies of inverted repeats show that they may raise
instability in a genome and, on the other hand,
regu-late gene expression in both prokaryotes and
eukary-otes Being capable of forming secondary structures
[3], inverted repeats can induce genomic instability via
gene amplification, recombination, DNA double-strand breaks and rearrangement [1,2,4–8] Moreover, inverted repeats provide sites for the integration of viruses into eukaryotic genomes [9,10] and also com-prise replication stall sites, as shown in a recent study
in which evidence obtained in vivo demonstrated repli-cation stalling by hairpins formed by inverted repeats
in bacteria, yeast and mammalian cells [11] As a result, they are restricted in a genome to some extent For example, neighboring repetitive elements, such as Alu repeats, are generally found to occur in the same direction, and those in the styles of head-to-head and tail-to-tail are rarely observed, particularly when the spacer between them is tiny [1,12] In a mouse
trans-Keywords
human; intron; long inverted repeat;
primates; stem-loop
Correspondence
Y Wang, School of Biological Sciences, The
University of Hong Kong, Hong Kong, China
Fax: +852 2857 4672
Tel: +852 2299 0825
E-mail: wangyong@graduate.hku.hk
(Received 11 December 2008, revised 19
January 2009, accepted 23 January 2009)
doi:10.1111/j.1742-4658.2009.06930.x
The inverted repeats present in a genome play dual roles They can induce genomic instability and, on the other hand, regulate gene expression In the present study, we report the distribution and sequence features of recombi-nogenic long inverted repeats (LIRs) that are capable of forming stable stem-loops or palindromes within the human genome A total of 2551 LIRs were identified, and 37% of them were located in long introns (largely
> 10 kb) of genes Their distribution appears to be random in introns and
is not restrictive, even for regions near intron–exon boundaries Almost half of them comprise TG⁄ CA-rich repeats, inversely arranged Alu repeats and MADE1 mariners The remaining LIRs are mostly unique in their sequence features Comparative studies of human, chimpanzee, rhesus monkey and mouse orthologous genes reveal that human genes have more recombinogenic LIRs than other orthologs, and over 80% are human-specific The human genes associated with the human-specific LIRs are involved in the pathways of cell communication, development and the nervous system, as based on significantly over-represented Gene Ontology terms The functional pathways related to the development and functions
of the nervous system are not enriched in chimpanzee and mouse ortho-logs The findings of the present study provide insight into the role of intronic LIRs in gene regulation and primate speciation
Abbreviations
FDR, false discovery rate; GO, Gene Ontology; LIR, long inverted repeats; siRNA, small interference RNA; TIR, terminal inverted repeat.
Trang 2gene experiment, the introduction of a large
palin-drome was followed by numerous rearrangements,
which were assumed to comprise a solution for
attenu-ating the impact of the palindrome in the progeny [13]
Inverted repeats also regulate gene expression The
stem-loops and palindromes constructed by inverted
repeats are involved in RNA interference, transcription
initiation of genes, initiation of DNA replication and
alternative splicing of exons The small interference
(si)RNA genes active in RNA interference comprise
inverted repeats capable of forming a stem-loop motif
longer than 22 bp Some are derived from miniature
inverted-repeat transposable elements [14] At present,
studies have identified siRNA genes from
Caenorhabd-itis elegans to humans RNA interference was initially
discovered as an efficient mechanism for inhibiting the
expression of specific genes [15,16], and later was
found to be responsible for developmental regulation
[17,18] and heterochromatin maintenance [19,20] In
promoters, inverted repeats can facilitate the
recogni-tion process and the subsequent binding of RNA
poly-merase during gene transcription [21,22] Moreover,
the inverted repeats in a cruciform structure will
attract mediators of second messenger-directed
tran-scription, hence altering the transcriptional response
[23] Many studies also show that inverted repeats are
essential for the initiation of DNA replication in
plas-mids, bacteria, eukaryotic viruses and mammalian cells
[24] The inverted repeats in introns are able to affect
the alternative splicing of exons [25,26] and the
removal efficiency of introns [27,28] For example,
alternative splicing of exon 2 in the COL2A1 gene was
mediated by a stem-loop adjacent to the exon–intron
boundary [25]
Because inverted repeats are both unstable and
func-tional elements in a genome, they are expected to be
distributed in intergenic regions or large introns of
genes In the yeast genome, almost 100% of large
palindromes (> 25 bp) are far away from coding
regions [29] Any insertion approaching conserved
transcribed sites will ultimately be erased unless their
presence provides an evolutionary advantage and,
thus, is under positive selection One line of evidence
for this is that, compared to introns, upstream regions
of genes have more palindromes, which probably
developed for the initiation of transcription [30] A
recent study shows that Caenorhabditis lineages have
conserved inverted repeats in intergenic regions [31],
which were suggested to be functional and therefore
actively maintained in the lineages In the human
genome, there are many such motifs, although we have
little knowledge of their fine-scale distribution,
sequence features and potential functions at present
[32,33] Human inverted repeats were investigated in a previous study, in which a majority of them were found to be weak with respect to their capacity to form a simple stem-loop or hairpin in terms of their structural features [32] Genome-wide distribution of human palindromes has also been surveyed, and a database has been created for public use [30] How-ever, the palindromes with mismatches and indels were not collected in the database
In the present study, we first located all the long inverted repeats (LIRs) characterized with long arms, high arm similarity and a short internal spacer in the human genome They were termed as recombinogenic LIRs in our previous study on human chromosomes
21 and 22 [33], although their distribution and fre-quency had not been fully surveyed in the whole human genome The present study aims to provide a panoramic view of recombinogenic LIRs On the basis
of evidence obtained in vivo [1,2,11], the LIRs identi-fied in the present study can easily form stem-loops or palindromes Their presence in the human genome by itself implies that they are functional in some manner The results obtained showed that 37% of the LIRs were located in intronic regions and some were primate-specific TG⁄ CA-rich repeats are the most frequently observed feature in LIR arms Considering that the LIRs probably have essential functions and drive the speciation of primates, we studied the degree
of conservation and species specification of the LIRs among orthologous genes from the mouse (Mus mus-culus), rhesus monkey (Macaca mulatta), chimpanzee (Pan troglodytes) and human The results obtained demonstrate that human orthologs have relatively more LIRs, most of which are human-specific These human-specific LIRs are probably essential for the development of the advanced functions of human nervous system in light of the Gene Ontology (GO) profile of human orthologs
Results
Characters and distribution of LIRs in human autosomes
We identified 2551 LIRs in human autosomes and approximately 87% of them have a short spacer (0–9 bp) and arm (31–59 bp) (Fig 1) By contrast, the mismatch rate between the arms of an LIR varies from 0–0.15, showing a relatively lower standard deviation with respect to the amounts of the LIRs in different ranges (Fig 1) These results indicate that a majority of the LIRs are able to form a stem-loop with a stem of 31–59 bp and a tiny loop (or none for a palindrome)
Trang 3The genomic distribution of the LIRs shows that the
density of LIRs selected by our criteria is quite low
(Fig 2) The highest and lowest LIR densities were
observed in chromosomes 4 (1.2⁄ Mb) and 22
(0.44⁄ Mb) respectively Interestingly, the LIR density
negatively co-varies with gene density among the
chro-mosomes (t = 19.8; P < 10)4) (Fig 3) The point
denoting chromosome 19 is notably far from the
regression line, accounting for gene clusters that con-tribute to the two-fold higher gene density of the chro-mosome 19 compared to the genomic average [34]
We found that the negative correlation is due to the high frequency of LIRs in long genes A total of 956 LIRs (in 702 genes) were located in genic regions, and
1595 in intergenic regions In other words, 37% of the LIRs were found within genes However, our
Fig 1 Characteristics of human LIRs The 2551 LIRs were classified according to spacer size, mismatch rate and arm length.
Fig 2 Distribution of LIRs in the human genome The density is represented by the amount of LIRs per 1 Mb sequence The shortest bars denote one LIR per 1 Mb.
Trang 4calculation of the coverage of genes in the human
gen-ome was 26.9%, which is consistent with the value
reported previously [35] When introns were taken into
account, the percentage was 25.1% This implies that
the distribution of the LIRs is not random Statistical
analysis performed on the results shows that the
pres-ence of LIRs is significantly biased to be within genes
(chi-square test; P < 0.0001) The LIRs that have long
arms (> 400 bp) and the associated genes are listed in
Table S1 Surprisingly, over half of them were found
within genes There are two cases of partial overlap
between LIRs and exons In one case, the left arm of
an intronic LIR extends into an exon of c14orf165
and, in the other case, an LIR on the chromosome 17
partially overlaps an exon of a putative gene The
fre-quency is much lower than expected We did not find
any LIRs overlapping either the start or end site of a
gene
Further results confirmed that LIRs tend to reside
in large intronic and intergenic regions Only five LIRs
were found in introns < 1 kb (the smallest intron was
757 bp), and none in intergenic regions < 2 kb The
median sizes for the introns and the intergenic regions
are 46 and 386 kb, respectively Moreover, most of the
LIR-containing intergenic regions are > 10 kb
Corre-spondingly, a chromosome that has more long genes
will show a lower gene density, in agreement with the
above negative correlation between LIR density and
gene density (Fig 2)
We then studied the positions of the LIRs in introns
and intergenic regions A short distance to the exon–
intron boundary or transcription starting point is
an indication that an LIR is functioning in the gene
A ratio of 0–0.5 was applied to denote the relative
distance to the boundaries, or to the center, and was divided into five ranges We calculated the percentage
of LIRs falling within the ranges and observed a small percentage difference between the ranges, suggesting a random distribution of LIRs in both intronic and intergenic regions (see Fig S1) We also considered the effect of the length of these regions on the distribution The intronic and intergenic regions were then classified
on the basis of their lengths Within each of the length groups, the numbers of LIRs in the ratio ranges do not show any significant difference (chi-square tests; d.f = 4; P > 0.1) (see Fig S1) Therefore, the LIRs
do not avoid approaching the boundaries for exonic or genic regions The median distance to the exon bound-aries is 7.8 kb, and that to the gene boundbound-aries is
69 kb
Strikingly, pseudogenes were frequently found around the intergenic LIRs A total of 803 intergenic LIRs (50%) have one or two neighboring pseudogenes,
of which 422 are RNA pseudogenes According to the annotation in the Ensembl database (http://www ensembl.org), approximately 27% of the human genes are pseudogenes The occurrence of pseudogenes adjacent to LIRs is statistically significant (chi-square test; P < 0.0001)
Sequence features of the human LIRs
We found that over half (51%) of the identified LIRs could be packed into groups consisting of at least three members on the basis of sequence similarity The group members are comprised of simple repeats, known repetitive elements, amplified genes or dupli-cated genomic fragments The largest group consists of LIRs formed by stretches of TG⁄ CA dinucleotides and interspersed TA dinucleotides We defined them as
TG⁄ CA-rich LIRs, accounting for 33% and 39% of all the LIRs in the intronic and intergenic regions, respectively (Fig 4) By contrast, we also identified
TC⁄ GA-rich LIRs that occupy only 3% in both of the regions Thus, the frequency of TG⁄ CA-rich LIRs is at least 11-fold higher than that of TC⁄ GA-rich ones (for intronic LIRs: 11-fold; for intergenic LIRs: 13-fold) The difference is statistically significant (chi-square test; P < 0.0001) On average, the combination of
TG⁄ CA-rich and TC ⁄ GA-rich LIRs occupies 38% of the identified LIRs Additionally, we could not identify any LIRs constructed by simple repeats in longer repeat units (> 2 bp)
The second largest group comprises the LIRs involved in known human repetitive elements We found 145 MADE1 mariners and 108 inverted Alu repeats in our LIR collection The mariner has a short
y = –23.8x + 33.6
R2 = 0.53
0
5
10
15
20
25
30
35
0.4 0.6 0.8 1 1.2
S-LIR density (/Mb) chr.19
Fig 3 Negative correlation between gene density and LIR
fre-quency The black dots show the correlation between gene density
and LIR density in the 22 chromosomes.
Trang 5spacer and long terminal inverted repeats (TIRs) In
the present study, they were considered as LIRs in
cases of high identity between TIRs Within both
intronic and intergenic LIRs, they occupy 6% in total
Alu repeats in the LIRs are mostly in a partial
struc-ture and found to be in the styles of head-to-head or
tail-to-tail In some large LIRs, more than one Alu
was included in one arm, and the complete structure
of Alu could be retained therein The proportion of
inverted Alus within the LIRs is 6% for intronic
regions and 3% for intergenic regions
The grouping of the LIRs is also a result of gene
amplification or fragmental duplication, although the
numbers of such groups and the members inside the
groups are not large We identified 20 LIRs in genes
encoding a novel protein similar to septin
(NPS-Sep-tin) and eight in genes encoding POTE The genes
belong to gene families and their duplication is coupled
with the spread of the LIRs inside the gene The
remaining LIRs aside from the above groups show
similarity either to one or none of the others They are
labeled as rare LIRs, accounting for 49% of all the
LIRs in the human genome
We explored the LIRs in the NPS-Septin gene
family in more detail A blat search in the University
of California Santa Cruz (UCSC) browser (http://
genome.ucsc.edu) was used to confirm the association
of the LIRs with the gene family The longest gene
LOC400807 is approximately 107 kb, and an
NPS-Septin LIR is positioned at approximately 10.6 kb In
addition, we also found more NPS-Septin LIRs on the
Y chromosome, although they were not present in
NPS-Septin genes Sequence alignment displays highly
identical arms but diverse spacers for the 20
NPS-Septin LIRs (Fig 5) They are able to form variant
stem-loop structures where both the stem and loop are
of different sizes Except for those on chromosomes 3,
10 and Y, all the LIRs were located at subtelomeric
regions (see Table S2) The proximal LIRs show
simi-lar spacer motifs; for example, the three LIRs on
chro-mosome 1p (no 1–3) and the two LIRs on the
Y chromosome (Fig 5) This is evidence for inverted duplication of the fragments at these regions We also noted that sequence similarity at the flanking regions
of the LIRs declines gradually at all sites
Species-specific LIRs inside orthologous genes
To obtain species specification of the LIRs, we detected LIRs in mouse, rhesus monkey, chimpanzee and human orthologous genes Among 12 723 groups
of orthologous genes, we identified 546 LIRs for human orthologs, 481 for chimpanzee orthologs, 201 for mouse orthologs and 130 for rhesus monkey ortho-logs For species specification of the LIRs, 421 (77%) are human-specific, 355 (74%) are chimpanzee-specific,
180 (90%) are mouse-specific and 107 (82%) are rhe-sus monkey-specific For the nonspecific LIRs, 13 groups of orthologs from the three primate species all have at least one LIR, and 104 ortholog pairs from humans and chimpanzees possess LIR(s) This suggests that most of the nonspecies-specific LIRs are shared
by the primates, and some LIRs were specifically developed in the primate lineage
We next obtained the biological profile of the human orthologs that have human-specific LIR(s) Compared
to randomly-selected human genes, the orthologs are significantly enriched with GO terms within the catego-ries of development, binding, membrane, cell communi-cation and signal transduction (Table 1) An important finding is that a number of the terms are related to the nervous system, including neurotransmitter receptor activity (GO:0030594), central nervous system develop-ment (GO:0007417), GABA receptor activity (GO: 0016917), axonogenesis (GO:0007409), projection, generation, differentiation and development of neurons (GO:0043005, GO:0048699, GO:0030182, GO:0048666), synapse (GO:0045202), and so on The GO term that is under-represented in these genes is GO:0006955 for immune response [false discovery rate (FDR) = 0.048]
We also performed the same test on 104 orthologs with human- and chimpanzee-specific LIRs The GO
Fig 4 Composition of LIRs in the human genome The LIRs in POTE and NPS-Septin families occupy 3% of all the intronic LIRs The ‘other’ LIRs, occupying 49% of all LIRs, refer to those with unique sequence features.
Trang 6terms from the human orthologs were used for
com-parison with those from randomly-selected human
genes, showing that the above over-represented GO
terms related to the nervous system were largely not
assigned to these orthologs (see Table S3) Only the
term GO:0045202 is related to synapse Basically, the
terms for binding, membrane, signal transduction and
cell communication are retained in the list
To make a control, we obtained over-represented
GO terms from the mouse genes associated with
mouse-specific LIRs by comparison with
randomly-selected mouse orthologs A part of the result shown
in Table S4 is similar to that also shown in Table S3 (e.g binding and signal transduction) The difference is that the result for the mouse orthologs includes GO terms for the regulation of transcription, the RNA bio-synthetic process and the phosphate metabolic process
We found one over-represented term (GO:0007399) in
a pathway for nervous system development (FDR = 0.0368)
The 104 LIRs common in human and chimpanzee orthologs were studied, aiming to uncover the mecha-nism of their formation We searched the arm sequences of the LIRs in the UCSC genome browser
Fig 5 Alignment of the LIRs mostly
located in genes encoding a novel protein
similar to septin The locations of the LIRs
are listed in Table S2 Essentially, the arms
of the LIRs are approximately 1–48 and
97–143 bp, and can be extended into the
spacers in some LIRs.
Trang 7for homologous fragments in other mammalian
genomes The species specification of the LIRs was
demonstrated in several cases, where we found
half-sized LIRs in the rhesus monkey genome One case is the LIR in the human gene c9orf52 that has 19 ORFs and four transcription variants Positioned
Table 1 GO terms over-represented in human genes having human-specific LIRs The genes are human orthologs that have at least one human-specific LIR Reference genes are randomly selected from the list of orthologs, and are used for comparison with the test human genes with specific LIRs The GO terms in 352 test genes were compared with those in 296 reference genes, using Fisher’s exact test in BLAST2GO FDR was applied to obtain significantly over-represented (FDR < 0.05) GO terms in the test genes Several GO terms belonging to levels 1 or 2 are not included.
GO:0007154 Cell communication 2.65E-04 GO:0005230 Extracellular ion channel activity 0.014854 GO:0032501 Multicellular organismal process 2.65E-04 GO:0031175 Neurite development 0.014854
GO:0004888 Transmembrane receptor activity 7.24E-04 GO:0030030 Cell projection organization 0.016226 GO:0031224 Intrinsic to membrane 7.24E-04 GO:0022804 Active transmembrane transporter 0.016354 GO:0016021 Integral to membrane 8.83E-04 GO:0004672 Protein kinase activity 0.017032 GO:0032502 Developmental process 0.001316 GO:0048856 Anatomical structure development 0.017199 GO:0030695 GTPase regulator activity 0.001397 GO:0043687 Post-translational protein modification 0.017406 GO:0007275 Multicellular organismal development 0.001397 GO:0015075 Ion transmembrane transporter 0.017424 GO:0044459 Plasma membrane part 0.001397 GO:0006464 Protein modification process 0.017424
GO:0030182 Neuron differentiation 0.001526 GO:0051234 Establishment of localization 0.018271
GO:0007186 G-protein coupled receptor protein 0.002093 GO:0022803 Passive transmembrane transporter 0.019349 GO:0004872 Receptor activity 0.002643 GO:0022838 Substrate-specific channel activity 0.019349 GO:0007166
GO:0000166
Cell surface receptor
Nucleotide binding
0.003784 0.003784
GO:0004713 GO:0004930
Protein-tyrosine kinase activity G-protein coupled receptor activity
0.020021 0.020021 GO:0031420 Alkali metal ion binding 0.004149 GO:0022857 Transmembrane transporter activity 0.020021 GO:0045211 Postsynaptic membrane 0.004149 GO:0050793 Regulation of developmental process 0.021558 GO:0006811 Ion transport 0.004408 GO:0008509 Anion transmembrane transporter 0.021558
GO:0046872 Metal ion binding 0.005536 GO:0019199 Transmembrane protein kinasc activity 0.023565 GO:0030554 Adenyl nucleotide binding 0.005536 GO:0048667 Neuron morphogenesis 0.023565 GO:0007155 Cell adhesion 0.005536 GO:0046578 Ras protein signal transduction 0.023565
GO:0005083 Small GTPase regulator activity 0.005868 GO:0004714 Kinase activity 0.023565
GO:0005509 Calcium ion binding 0.006231 GO:0022891 Transmembrane transporter activity 0.026669 GO:0016773 Phosphotransferase activity 0.006277 GO:0009790 Embryonic development 0.028903 GO:0017076 Purine nucleotide binding 0.00637 GO:0006793 Phosphorus metabolic process 0.030622
GO:0000902 Cell morphogenesis 0.007098 GO:0005096 GTPase activator activity 0.033582 GO:0032989 Cellular structure morphogenesis 0.007098 GO:0015698 Inorganic anion transport 0.033582 GO:0030234 Enzyme regulator activity 0.007765 GO:0065007 Biological regulation 0.034756 GO:0051056 Regulation of small GTPase 0.009344 GO:0005089 Rho guanyl-nucleotide exchange factor 0.041165 GO:0048869 Cellular developmental process 0.009344 GO:0005088 Ras guanyl-nucleotide exchange factor 0.041165 GO:0030154 Cell differentiation 0.009344 GO:0007010 Cytoskeleton organization 0.041874 GO:0005215 Transporter activity 0.009344 GO:0007417 Central nervous system development 0.043131 GO:0032559 Adenyl ribonucleotide binding 0.009362 GO:0030594 Neurotransmitter receptor activity 0.043131 GO:0032555 Purine ribonucleotide binding 0.01047 GO:0004674 Protein serine ⁄ threonine kinase 0.045995 GO:0032553 Ribonucleotide binding 0.01047 GO:0008092 Cytoskeletal protein binding 0.046358 GO:0031226 Intrinsic to plasma membrane 0.011022 GO:0005515 Protein binding 0.047541
Trang 8between exons 17 and 18 (5.3 kb to exon 18; 36.59 kb
to exon 17) (Fig 6), it has a homolog in the
chimpan-zee genome However, all homologous fragments from
the rhesus monkey correspond to one arm of the LIR
Moreover, motif conservation was exhibited at the
flanking sequences of the LIR in primates (Fig 6) In
other words, the half-sized LIR represents the
ancestral status, and the full-sized LIR was developed
in the chimpanzee and human lineages We did not
find fragments homologous to the LIR in the mouse
genome Instead, a half-sized LIR was observed in the
dog genome, suggesting that nonprimate genomes also
lack the full-sized LIR This also serves as additional
solid evidence for the presence of the half-sized LIR in
the rhesus monkey These results imply that some
LIRs were derived by inverted duplication of one arm
Discussion
A survey of recombinogenic LIRs across the
human genome
In the present study, we identified LIRs in the human
genome, and provide a fine map of the distribution of
human LIRs Due to a strong capability for forming a
stem-loop, the LIRs are recombinogenic and account
for only approximately 0.4% of all human LIRs, as
suggested previously [33] Our algorithm allows the
presence of mismatches and insertions in the stem part
of the secondary structure, and also provides settings
for spacer size, arm size and arm similarity Due to
variant internal structures, inverted repeats are
differ-ent in their efficiency with respect to the induction of
instability Evidence is available suggesting that arm
size, arm similarity and internal spacer size are all
important factors [1,2] Therefore, the inverted repeats
identified in the present study are generally associated
with a high potential for stem-loop or palindrome
formation This is partially supported by the fact that
approximately 87% of our LIRs have a short spacer
of < 10 bp Nonetheless, we cannot preclude the pos-sibility that some of the LIRs experience difficulty regarding the formation of a stem-loop, such as the reversely duplicated genes and those extremely large LIRs with a huge spacer (see Table S1) In previous studies, the methods employed for inverted repeat identification could not search the inverted repeats by freely defining arm similarity, spacer size and indels [30,32,36] Thus, the map of the LIRs obtained in the present study provides a more detailed distribution of stem-loops in the human genome, and confirms that the LIRs are mostly located in long introns and inter-genic regions Furthermore, the inverted repeats in the present study are more likely to be functional than those of previous studies because functional inverted repeats such as siRNA genes are rarely palindromes showing 100% arm similarity [30,32,36]
Because of the difficulties encountered in the design
of the algorithm for LIR searching at the genome level and the complex folding structures of inverted repeats, we could not target all the inverted repeats with a strong potential to form a stem-loop or a pal-indrome Particularly, there are a large number of AT-rich regions in the human genome, and the fre-quency of (TA)n repeats is 19.4 per Mb [37] The self-complementary (TA)n repeats can by themselves form variant secondary structures To remove these simple repeats, we set the GC content of the arm sequences at > 20% This step, however, unavoidably deleted AT-rich LIRs, and some of them have been implicated as the mediator of constitutional t(11;22) translocation in humans [38] Although there are also
a large number of (CA)n and (GA)n repeats in the human genome, the frequency of their complementary repeats (TG)n and (CT)n is much lower [37] Therefore, the presence of the TG⁄ CA-rich and
TC⁄ GA-rich LIRs is not a result of the enrichment
of (CA)n and (GA)n repeats
Fig 6 An intronic LIR in c9orf52 and the flanking conserved sequence The arrow denotes the intronic LIR, positioned between exons 17 and 18 The large arrows with an opposing orientation indicate the two arms of the LIR Rhesus monkey (Rhesus macaque) and dog (Canis familiaris) genes possess half-sized LIRs.
Trang 9Probable functions of the LIRs
The results obtained in the present study show that a
considerable proportion of the LIRs are within genes
and tend to be located in large introns of long genes
The LIRs in the large introns, although still unstable,
will not greatly disturb the coding parts of the genes
Knowing the genomic distribution and sequence
fea-tures of the LIRs enables us to speculate about the
biological functions of the LIRs
First, there are a large number of TG⁄ CA-rich LIRs
in our collection, and these intronic TG and CA tracts
are probably involved in the alternative splicing of
genes One study revealed that intronic TG tracts,
particularly in hairpin structure, are important in the
intron knockout process and help to create
complicated splicing patterns [39] On the other hand,
CA-tracts and CA-rich sequences are confirmed to be
regulators for alternative splicing One study showed
that the insertion of a CA repeat into different intronic
places will result in variant splicing patterns in a
human gene [40] Perhaps splicing sites at intron–exon
boundaries can be recognized easily by a signal of
sec-ondary structure Taken together, this allows us to
propose that the TG⁄ CA-rich LIRs are regulators in
human genes
Second, approximately half of the LIRs are unique
in sequence features, and some of them are probably
unidentified siRNA genes In the present study, the
LIRs are longer than the minimal length required for
an siRNA Although arm similarity is higher than that
observed in most siRNAs, some of them are still
can-didates for siRNA genes We used emboss sirna
(http://emboss.sourceforge.net/apps/cvs/sirna.html) to
identify the candidates with a threshold score of 8, and
found that 267 of the LIRs are potential siRNA genes
The validity of these motifs in gene silencing requires
further empirical examination
LIRs and recombination hotspots are not related
The question remains as to whether the
recombino-genic LIRs identified in the present study are
frequently associated with recombination hotspots in
the human genome Almost 47% of the human
gen-ome is composed of repeats [37], and direct repeats are
predominant over inverted repeats in the human
gen-ome, partially because inverted repeats are able to
induce instability five-fold more efficiently than direct
repeats [41] The UCSC browser provides the
recombi-nation rate data for the human genome Essentially,
recombination hotspots concentrate on subtelomeric
regions [42] The regions, however, do not have more
LIRs than other regions (Fig 2) and, instead, some chromosomal LIR-rich regions are located at the inner part of the chromosomes The lack of an association between LIRs and recombination hotspots is also suggested by a recent study on human recombination hotspots on the basis of a computational simulation using single nucleotide polymorphism data, which showed that inverted repeats were not found over-abundant in the hotspots [43] In the present study, we did not detect over-represented LIRs in the hotspots (results not shown) Therefore, the contribution of LIRs
to recombination hotspots is not supported, and the recombination-inducing effect of the recombinogenic LIRs probably acts only on specific genomic regions
LIRs spreading via fragmental duplications NPS-Septin genes are spread in the human genome possibly due to interchromosomal recombination and fragmental duplication One study showed that inter-chromosomal recombination frequently occurs at the subtelomeric regions in humans [42] NPS-Septin was assumed to be one of the gene families that amplified themselves by this mechanism The result of gene amplification is concurrent duplication of the intronic LIRs, as observed in chromosomes 1 and Y in the present study By contrast, only two chimpanzee NPS-Septin LIRs were identified, which is in accordance with the low frequency of subtelomeric duplications in the chimpanzee genome [42] Regarding the spread of LIRs in the POTE family, at least those on chromo-some 2 subtelomeres are most likely the result of intra-chromosomal recombination, as inferred from genomic locations Similarly, the chimpanzee genome has two POTE homologs: one on chromosome 12 and another one on chromosome 22 In addition, the LIRs in POTE and NPS-Septin families were entirely absent from other mammals in current genomic assemblies
Probable role of the LIRs in primate speciation Among the orthologous genes, we found that human and chimpanzee genes contain more LIRs than rhesus monkey and mouse orthologs Our data suggest that most of the LIRs shared by human and chimpanzee orthologs were developed and maintained by the com-mon ancestor of humans and chimpanzees However, the difference in LIR frequency among the primates could be narrowed to some extent Our search for LIRs in rhesus monkey orthologs probably missed a proportion of the LIRs, although the similarity between arms was lessened to 75% In the case that the similarity between arms was lower than 75%, some
Trang 10of the LIRs shared by all primates could not be
visual-ized Indeed, we observed higher mismatch rates in the
stems formed by monkey LIRs, and the corresponding
chimpanzee and human LIRs have undergone
compen-sating mutations that help to improve the stability of
the stem-loops for human and chimpanzee LIRs
rela-tive to those for rhesus monkey (results not shown)
The compensating mutations comprise one line of
evi-dence for the functional role and adaptive evolution of
the primate LIRs
The biological profiling of the orthologs with
human-specific LIRs implies their association with the
development of the central nervous system Moreover,
GO terms in pathways such as cell communication and
transmembrane signal transduction are enriched in
these orthologs The number of genes in eukaryotic
genomes is not so different as previously considered
[44,45] and the morphological and physiological
differ-ences among eukaryotes are considered to be the result
of the different regulation levels of the existing genes
The intronic LIRs in the present study are probably
novel, essential regulatory motifs that enable a
com-plex expression profile and the fine regulation of
human genes, as suggested previously [46] The
appear-ance of the LIRs probably provides humans with an
evolutionary advantage and contributes to the
specia-tion of primates
Experimental procedures
Identification of LIRs
The human genome (Build 35) was downloaded from the
NCBI (http://www.ncbi.nlm.nih.gov/) and the locations of
all human genes and their exons (for protein-coding genes)
were obtained from the Ensembl database (http://www
ensembl.org) From the gene list, we obtained the locations
of the boundaries of the genic and nongenic regions Exons
belonging to the same genes were sorted again according to
their genomic locations, and the introns were defined as the
intervals between the exons From the list, the boundaries
of exons and introns were determined
We first searched for inverted repeats across the human
genome using bespoke software [33] The settings for this
step were: arm length > 30 bp; arm identity > 85%; and
spacer < 2 kb In addition, inverted repeats with a GC
content of the arms of < 20% were filtered out This
aimed to exclude an abundance of inverted repeats formed
by (TA)nsimple repeats as shown in our primary study A
(TA)n by itself is an inverted repeat, and can form variant
secondary structures rather than an exclusive and stable
stem-loop Therefore, (TA)n repeats were not the required
typical inverted repeats As a result, we removed them from
the dataset in the present study Several types of redundan-cies were removed, as described previously [33] To define the recombinogenic LIRs, we screened the collection with new criterion The ratio of arm length to spacer length must be larger than mismatch, where the mismatch is equal
to 100% minus identity Therefore, the LIRs in our dataset were recombinogenic LIRs [33]
The LIRs within genes were identified and the ratio of their relative distance to exon–intron boundaries was calcu-lated Here, a ratio approaching to 0 indicates the relative distance to the closest exon-intron boundary and that close
to 0.5 means that the LIR is positioned close to the central
of an intron, no matter in what direction Pseudogenes were not used in this survey For those LIRs in intergenic regions, the same ratios were also measured The difference was that the ratios in that case represent the relative dis-tance to the closest neighboring genes We also attempted
to identify cases of partial overlapping between LIRs and genes or exons
Classification of LIRs
We selected the LIRs that were basically constructed by dinucleotide repeats In the case where TG + TA +
CA > 80% of the arm of an LIR, it was considered as
TG⁄ CA-rich; in the case where TC + TA + GA > 80%,
it was considered as TC⁄ GA-rich The remaining LIRs were classified on the basis of similarity We first used con-sensus motifs of common human repetitive elements (from the RepBase: http://www.girinst.org/) as templates An LIR was considered to be formed by a known repetitive element
if the identity of the homologous part (> 20 bp) was higher than 75% For the results obtained, LIRs formed by inverted Alu repeats were further confirmed by repeatmas-ker (http://repeatmasker.org) Second, LIRs homologous
to each other were searched Similarly, the criteria were: homologous part > 20 bp and identity of homologous part
> 75% Put simply, the algorithm for searching the homologous part aimed to find an identical seed of 5 bp and then extend the seed at both ends until continuous two mismatches occur at both sides
LIRs in mammalian orthologous genes
We obtained orthologous genes for human–chimpanzee, human–rhesus monkey and human–mouse species pairs from the BIOMART database (http://www.ensembl.org/ biomart/), which employs the Ensembl 42 Homology Data-base By searching the same human gene IDs in the three ortholog tables, we created a new ortholog table containing
12 723 groups of orthologous genes from the four species
In the BIOMART database, some orthologous genes are of the types ‘one-to-many’ and ‘many-to-many’ that denote a multiple orthologous relationship between the genes In the