CAAAGTTTGAGTTTT, AAARNNAGA downstream part of a bipartite consensus sequence found in the 3’- flanking regions of snRNA genes consensus sequence of the -35 region of promoters recognized
Trang 1GNOMIC
A Dictionary of Genetic Codes
E.N Trifonov and V Brendel
(1) Maruyama, T., T Gojobori, S Aota, and T Ikemura
(1986) Nucl Acids Res 14: r151-r197
Trang 2Morphology of Gnomic <Ab initio approach
Axiomatically, as in this book, Gnomic is a certain written language The nucleotide sequences, indeed, appear as texts and, most naturally, one wonders whether this analogy could
be traced any further
A first difficulty is absence of any interruptions in
the continuous nucleotide sequences Gnomic texts are not
physically divided in separate “phrases” and “words”, very much like some ancient writings without blanks between words In the latter case one could still read the texts provided thé language is known What could one do, however, with the undeciphered Gnomic texts ? In what follows, a brief account is presented of a recent work (1,2)
on two morphologically distinct classes of Gnomic words which can be detected in these continuous texts by simple computational means without any a priors knowledge of what these words would mean
Contrast words
The n-string ts called a contrast word of length n if
a) its derivative (n—1)-strings appear in a large sequence
ensemble significantly more frequently as part of the n-string
than as separate (n—1)-strings with arbitrary extensions and b) occurrences of its derivative (n+1)-strings are statistically
indistinguishable in the large ensemble
The contrast word, as defined, is characterized by strong internal correlations which do not extend beyond the word’s limits Like in English, would it be written continuously, the string “eyon” wherever encountered would
have almost exclusively “b” in front (“beyon”) and “d” at
the end (“eyond”), thus actually always belonging to the longer string “beyond” In its further one-letter extensions, however, practically every letter could be encountered
Trang 3(for example “ extendbeyondthe ”, “.„farbeyondlimibs ,
“reachingbeyondguarded ”, etc.) The internal correlation
sharply drops as soons as the word’s limits (in this case “b
and “d” of “beyond”) are exceeded The word, thus, can
be recognized by its “contrast” on the background of
surrounding letters, after all occurrences of the string in
question are listed and compared
Table 1 Vocabulary of contrast words of E coli
of E colt, bacteriophages lambda and T7 (1) and human
intervening sequences (2), respectively In cases of E coli
vocabularies Both preferred and avoided words are listed
in all cases
Comparison of various vocabularies brings us to the
is 3 to 5 bases; 2) the vocabularies of different genomes are partly overlapping but distinct; 3) the degree of correlation between the different vocabularies reflects the biological
or functional relatedness of the corresponding sequences;
Table 2 Contrast words of the bacteriophage lambda genome
Trang 4Table 3 Contrast words of the bacteriophage T7 genome
4) the contrast word vocabularies contain many strings
whose biological meaning has already been established by
restriction sites appear as avoided contrast words in
bacteriophages, see Tables 2 and 3)
The last point is of special interest What is
use and imply any previous knowledge of what various strings and the whole message might mean Any human language text written in a continuous manner after such
vocabulary of words (or morphemes) of this language,
irrespective of whether the operator (computer) is familiar
with the language or not The fact that some short nucleotide sequences with already known meaning are picked
by the contrast word technique suggests that the other words of the Gnomic ab initio vocabularies are likely to be
of biological importance as well.
Trang 5
Tandem words
sequences attracting attention of even least scrupulous
nucleotide sequences of virtually any length are frequently
found tandemly repeated many times, in most cases without
any obvious reason Their functions are largely unknown,
though many of the tandem repeats are speculated to be
involved in very important biological activities (multiple
binding sites for regulatory proteins, recombination and
mutation hot spots, potential Z-DNA sites, etc.) Contrary to
contrast words, not only the single repeat word but rather a
Table 4 Contrast words of human intervening sequences
CTC(GAG) ATGAGTA(TACTCAT) GTT(AAC)
CTG(CAG) CCATAGT(ACTATGG) TAG(CTA)
GAG(CTC) CGCAGCG(CGCTGCG) TCG(CGA)
GCG(CGC) GGTGTGG(CCACACC) TGA(TCA)
GTG(CAC) GTATGTC(GACATAC) TTG(CAA)
TAT(ATA) GTGCAAC(GTTGCAC) CAGA(TCTG)
TCT(AGA) TAAGTCT(AGACTTA) GGGGG(CCCCC)
whole succession of the repeats appears as one large internally correlated string This morphological class therefore needs a separate definition
The n-string ts called a tandem word of length n if tt ts encountered in a large sequence ensemble with unusual frequency as immediate exact repeat
Table 5 Tandem words of human
etc.) which is also very typical of the contrast vocabulary
symmetrical sequences in the primary RNA transcripts might correspond to single-stranded regions of RNA devoid of hairpins These sites are probably involved in complementary interactions with distant parts of the same mRNA recursor
or with other molecules (2) °
(1) Brendel, V., J S Beckmann and E N Trifonov
J Biomol Struct and Dyn., in press
(2) Beckmann, J S., V Brendel and E N Trifonov
submitted for publication.
Trang 6
Morphology of Gnomic Semantic approach
Are the nucleotide sequences constructed as a linear
texts ? One cannot escape a feeling that this is largely the
case, especially after certain short sequences had been found
to indeed have a unique biological meaning The GNOMIC
dictionary provides about 800 of such “words” all playing a
certain structural or regulatory role, experimentally
established or reasonably speculated
The entries of the dictionary are clearly divided in two classes: isolated individual strings and tandemly
repeated strings A parallel with the above ab instto
definitions of contrast and tandem words is, probably, more
ISOLATED WORD LENGTH (BASES)
Fig 1 Size distribution of 555 isolated words
of the Gnomic dictionary Modified bases are not included Every entry description is counted
Fig 2 Size distribution of 180 tandem words
of the Gnomic dictionary Every entry is counted as a separate word
than superficial Indeed, the tandem strings of the dictionary are structurally identical to the a5 initio tandem words The isolated strings, on the other hand, are similar
to the ab initio contrast words, though here the connection
is more complicated Were the contrast words found to have some meaning, they would be then listed in the dictionary as isolated strings Some of the restriction sites (see above) are an example But many of the isolated strings can not possibly be detected as contrast words For example, certain recognition sequences appear only a few times in a whole genome which would exclude them from any statistical treatment
The words of Gnomic, therefore, can be defined in at
least three different ways (isolated words of some known
Trang 7
There are, probably, many other structural or functional
classes of the words of this language, whose characterization
is the subject of further studies
One important question is whether the entries defined
morphological units of the language or rather they are only
parts (“morphemes”, “roots”) of actual words, the latter
being composed of several “roots”, “prefixes”, “suffixes”, etc
At this early stage the distribution of word sizes might
provide a clue The upper length limit of the entries of our
dictionary has been chosen as 25 bases for purely practical
purposes This size sequences carrying certain specific
messages as discussed in the sources referred to could well
be morphologically composite structures Were these entries
primarily words, i.e basic morphological elements of a
certain typical size, the distribution of sizes would have, a
maximum corresponding to the typical word size This
turns out to be the case In Figs 1 and 2 the size
distributions of isolated strings and of tandem repeats,
respectively, are presented The isolated strings are most
frequently of lengths of 3 to 11 bases The sharp peaks in
Fig 1 at 3 and 6 bases correspond to codons (3 bases) and
noteworthy that the word sizes derived on the basis of the
previous section) are also most frequently within 3 to 5
bases
We thus arrive at the conclusion that comparison of the nucleotide sequences with a language is more than just
a metaphor Gnomic is a language, built of detectable
words with certain meaning and with rather narrow size
distribution Morphology of Gnomic, syntax, semantics —
the whole linguistics of Gnomic is on the agenda
Consensus sequences and distributional recognition
A simple early concept of unique recognition of a given nucleotide sequence by the protein specifically interacting with a corresponding stretch of DNA or RNA fails to
restriction-modification enzymes become rather exceptions on the growing background of various sequence-wise uncertain and yet specific interactions
The sequences involved in the interactions are very
different unique sequences of this length exceeds 410 or
about 106 which is several times larger than the total number of various proteins encoded by a typical eukaryotic genome, only a small part of which are involved in the specific protein-nucleic acid interactions Thus, the potential information contained in a given sequence of this size exceeds the needs of specific recognition This allows a certain degree of degeneracy, and most of the specific protein-binding sites are, indeed, degenerate The sequences corresponding to the same binding protein are frequently discouragingly different and escape any simple description
So called consensus sequences are one pro tem way to describe the recognition sites The individual sequences all have some resemblance to the consensus which is derived by averaging the whole ensemble of the sites and writing down
corresponding positions along the recognition site The consensus presentation ignores the possibility of allowing
some alternative sequence elements (next strongest) in place
of the consensus ones, which might be recognized as well
It also does not answer the question as to which elements of the sequence (mono-, di-, or higher oligonucleotides) are actually important for the recognition
A more adequate description of the recognition sites would be a complete listing of all sequence elements actually found in different positions along the sequences The ones which are most frequent are likely to be essential for the
elements can be conveniently expressed in the form of a
Trang 8recognition matrix Several examples of mono and
dinucleotide recognition matrices are presented below, in
Appendix I
Use of alternative sequence elements in the same
position can be understood as a reflection of the fact that it
is not the sequence itself which is recognized by the protein
but rather local details and physical properties of molecular
structure of DNA which can be common or similar for
several different sequence elements On the other hand, the
observed degeneracy of the recognized sequences implies that
not all of the important elements have to be simultaneously
present in a given sequence and that various different
combinations (subsets) of the key elements would suffice
Thus, the load of specific recognition is distributed amongst
several sequence elements actually present This type of
specific protein-nucleic acid interaction is called distributional
recognition (1,2) as opposed to simple prototype unique
recognition In a sense the various sites recognized by the
same macromolecule are synonyms which not only have the
same meaning but also bear some structural resemblance to
one another
47: 271—978 (1983)
(2) Brendel, V., G H Hamm and E N Trifonov,
J Biomol Struct and Dyn 3: 705—723 (1986)
Typical sequence motifs of Gnomic word structure
For the continuous and largely undeciphered language like Gnomic the knowledge of the internal structure of its words would be crucial for the very detection of the words What are the most typical elements of which the word is constructed — mono-, dỉ-, trinucleotides, ? Do these elements overlap, or are they joined sequentially like the syllables in human languages ? Do they have any special molecular structure ? Undoubtedly, a careful study of the Gnomic word structure is needed, and we believe that the collection of words presented in this book will be of substantial help in these efforts
One interesting illumination is provided by the analysis of the Context Index of the dictionary Originally the index was meant to help find out whether any new
interesting site has common features (parts) with the sites
pentanucleotide composition of the words is very much non- uniform The pentanucleotide AGCAG, e.g., is found in 17 entries while its reshuffled derivative with the same base composition, AAGGC, does not appear at all (see Context
respectively, while their permutations GTAGC, ATGGC, and
unique possibility of mapping functionally interesting regions
composition of these sequences with the vocabulary specific for the function of interest Recurrence of certain most frequent oligomers has béen suggested recently as a possible
way for identification of potential regulatory sequences (1)
The use of specific vocabularies rather than individual oligomers is expected to be more selective for the mapping purposes As an illustration of potential power of the linguistic mapping technique we compared the trinucleotide composition of various parts of the human beta-globin gene
Trang 9Fig 3 Linguistic map of intervening sequences
trinucleotides are taken for the mapping, with their respective contrast values Both IVS1 and IVS2 are characterized by positive correlation
vocabulary of the introns
intervening sequences (Table 4, ab initio morphology section)
Fig 3 presents the resulting mapping function As one
would expect, the introns show strong positive correlation
the mapping results are not decisive This does not keep us,
however, from the conclusion that the linguistic mapping
technique is potentially a very powerful sequence diagnostic
tool
(2) Lawn, R M., A Efstratiadis, C O’Connell and
Leafing through the dictionary
Having the privilege of being the first users of the Gnomic dictionary, in this section we would like to share with the reader a few interesting observations we came across
One rather unexpected thing is an apparent clustering
of the CG dinucleotides in the entries of the dictionary
Indeed, the pentanucleotides containing CG are rather avoided The pentamers of the structure CGNNN, NCGNN, NNCGN and NNNCG appear in average 2.6 times, while average occurrence for all 1024 possible pentamers is close to 4.0 There are a few exceptions, like CCGCC which is
found 16 times (see Context index) or GGCGG (11 times)
but in general CG-pentamers are underrepresented One
NCGCG, CGCGN) should be even more scarce This is not
pentamers is 3.9, some being especially frequent (CGCGC —
7 times, GCGCG — 6 times) and all being encountered at
least once Thus, the CG dinucleotides have an apparent tendency to cluster Interestingly, this is also seen in the contrast word vocabulary of the intervening sequences, where several CG-clusters appear as preferred words of the introns
(see Morphology, ab initio approach) There must be some
dinucleotides advantageous for molecular recognition despite the general avoidance of the CG dinucleotides
Another striking feature of the Gnomic words is frequent appearance of the same oligonucleotides in the apparently unrelated entries The 11 base long sequence GTGACAAATTA, for example, is shared by the regulatory sites B of soybean nodulin and leghemoglobin genes, and by repeats in the origin of replication and in the incC region of
plasmid F Another 11 base long sequence TGAATATTNAT
Drosophila topoisomerase II recognition sites The sequence
AATATAATA (10 bases) is shared by certain repeats in
yeast mitochondria, tandem repeats in Leishmania tarentolae kinetoplasts, cores of autonomously replicating sequences of yeast, satellite DNA of Drosophila, and by repeats in
Trang 10
5’-regions of Dictyostellium genes There are 150 isolated
words of length 10 or more bases in the Gnomic dictionary
They consist of about 900 decamers which is less then
1/1000 of total number of various possible decanucleotides
Were these words a sample of randomly generated sequences,
to find even one example of identical decamers in this
sample would be a very improbable event The Gnomic
words, once again, appear to be composed of only a limited
spectrum of oligonucleotides, which is probably dictated by
structural requirements and limitations inherent to the
molecular recognition processes One can not help turning
again to the analogy with human speech, where also only a
small subset of all possible combinations of letters (basically,
alternations of vowels and consonants) is compatible with
limitations of human speech organs
Guide to the use of the Gnomic dictionary
dictionary per se and Context index
Gnomic dictionary includes 780 primary entries (bold
face) with descriptions and references About quarter of the
primary entries are tandem repeats presented in the same form as in the original source The dictionary contains also
about 1300 secondary entries (bold face, smaller type) which
are circular permutations of tandem words and their
directly whether any new tandem word has already been
discussed in the literature, in the same or in derivative form For non-repetitive “words” one has to consult the dictionary twice, taking both original sequence and _ its
In case the word of interest is not found in the dictionary, one might be interested in whether some related sequences appear in the dictionary This purpose is served
by the Context index The index lists all entries in which any given pentanucleotide is present as a part The new
Context index This part of the dictionary is especially useful, since it delivers the user from the necessity of time consuming and tiresome search in literature
The bibliography includes 434 references which appear first in the entry descriptions in an abbreviated form sufficient for locating the source in the user’s library Full references are provided as well, in the References section, listed alphabetically according to the titles of the journals,
in chronological order Every bibliographic description is provided by the return reference to the primary entry(ies) of which only the first five letters are indicated for brevity
This allows location of every source in the Gnomic dictionary Similarly, every author can be located, provided her (his) list of publications is available The bibliography should not be considered as comprehensive and credit-wise balanced Our primary concern has been to provide most informative sources Priority is given, therefore, to reviews
Trang 11and recent summarizing work where references to earlier
contributions can be found We tried our best, however, to
provide the original sources as well Should readers find any
more informative reference(s), we would be most grateful to
Res 13: 3021—3030, 1985)
Trang 12
GNOMIC DICTIONARY
A 1) adenine, first letter of the gnomic alphabet; appears as deoxyadenosine in DNA and as adenosine in RNA
2) site of attack and scission by
neocarzinostatin
PNAS 75, 3603 PNAS 75, 3608
=
A 1) N6-methyladenosine, M6A;
found in nuclear DNAs of several groups of lower eukaryotes and in
JP 27, 479
MR 44, 175 NAR 11, 5131
2) N6-isopentyladenosine, I6A;
modified base occurring in
position 37 of tRNAs
NAR 10-2, ri PNARMB 31, 59
3) 1-methyladenosine, M1A; minor base of tRNAs, specific for gram-
positive bacteria, in position 22
NAR 10-2, rl PNARMB 31, 59 4) 2-methyladenosine, M2A; minor
tracks are frequent in the human
3’-ends of the human Alu interspersed repeats
JMB 180, 753 ARB 51, 813
3) n=17; found at the 3’-end of
the human beta-tubulin pseudogene
Nat 297, 83
4) n=16; found at the 3’-end of
the human metallothionein II B pseudogene
Nat 299, 797
5) n= 44 to 71; found at the
3’-ends of dispersed repeat
Trang 136) n=22; found at the 3’-end of
the rat repetitive element R.dre.1
Nat 300, 330
7) n= 20 to 55; found upstream
from the promoter of the yeast
alcohol dehydrogenase gene
Nat 304, 652
8) n= 50 to 200; post-
transcriptionally added to the 3’-
ends of most eukaryotic mRNAs
CSHSQB 35, 743
ARB 42, 329
rupted only once by G, found up-
stream of the topoisomerase gene
inherently curved DNA, in
particular nucleosomal DNA
quadruplet recognized by the
frameshift suppressor sufG
TRO 19B 37
AAAAAAAGACTTAGAAAAAA consensus sequence of repeat
AAAACAAGCAGGAGRGGCT sequence flanking the Alu family repeat located downstream from a human insulin gene
X laevis tRNA genes
Gell 23, 251
AAAATTYT consensus of a presumed poly- adenylation signal in Trypanosoma
Cell 29, 291 Cell 37, 333
ably recognized by DNA invertases EMBO 4, 237
AAACCCCA),
s CCCCAAAA),
AAACCGGNTC
consensus sequence of the
African green monkey decasatellite; the core CCGG is strongly conserved JMB 157, 195
Trang 14
AAA
receptor protein) recognition site;
the core sequence GTGA seems to be
most important for recognition
PNAS 76, 5090
s CAAAGTTTGAGTTTT),
AAARNNAGA
downstream part of a bipartite
consensus sequence found in the 3’-
flanking regions of snRNA genes
consensus sequence of the -35
region of promoters recognized by
RNA-polymerase of Bacillus subtilis
with sigma factor sigma-32
Nat 302, 800
AAATT),
of the Leishmania tarentolae
kinetoplast maxicircle DNA
AACGTACTAAGCTCTCATGTTT
22 bp sequence, 7 times tandemly repeated (with small variations) in the origin of replication of plasmid R6K
PNAS 76, 1150
AACNNNNNNGTGC site recognized by the type I
restriction endonuclease /
methylase EcoK Gene 33, 1
Salmonella potsdam); the second A
of the entry and the first A of the complementary sequence are methylated
JMB 182, 579
AACNNNNNNRTAYG
a hybrid consensus sequence recognized by recombinant type I
sequence contains recognition sites for the enzymes SP (AAC) and
Trang 151) triplet appearing in the
universal genetic code as one of
two codons that specify lysine
2) site recognized by the
specific methylase BbvS III
Gene 33, 1
AAG)
n
motif in the MS satellite DNA of
sequence present near the 3’-
ends of yeast mitochondrial genes;
signal site for mRNA processing;
cleavage occurs a few bases
repeating motives in the 1.705 satellite DNA of Drosophila melanogaster
sequence responsible for
recognition and uptake of DNA during transformation of competent Haemophilus cells
PNAS 79, 2393
consensus sequence of a tentative transcription initiation signal in Halobacterium halobium
consensus sequence, presumably
regulatory, which occurs about positions -60 to -30 upstream of
the start methionine of five yeast
encoding glycolysis functions
JBC 258, 5291
AANNNNTTGAT consensus sequence of binding
site for E coli integration
host factor (IHF)
Cell 41, 489
AAT triplet appearing in the universal genetic code as one of two codons that specify asparagine
Trang 16repeated with minor variations in
the Leishmania tarentolae kineto-
plast maxicircle DNA divergent
mitochondrial protein-coding genes;
presumably mRNA processing site
MGG 196, 266
AATAATCACGCAGCTCAGCAGGC),,
s AATCACGCAGCTCAGCAGGCAAT) n
sequence frequently found 5 to
15 bases upstream from coding regions of Dictyostelium actin genes
JMB 183, 311
AATAGG),,
s GAATAG),
AATAT),„
repeating motives in the 1.672
satellite DNA of Drosophila melanogaster
JMB 135, 565
AATATAAT) n
s AATAATAT),,
sequence frequently found 5 to
15 bases upstream from coding regions of Dictyostelium actin genes
JMB 183, 311
AATATAG),
nucleotide of satellite Ic DNA of Drosophila virilis
Cell 17, 615
AATATAT),,
repeating motives in the 1.672 satellite DNA of Drosophila melanogaster
AATATGGGAATAACTTGTTTT),
s GGGAATAACTTGTTTTAATAT),
AATATT site recognized by the type H
1) n=12; sequence located
upstream from the human beta-globinV gene cluster
NAR 10, 7809
2) n=21; repeat located in the
3’-nontranscribed spacer of rat ribosomal DNA
JMB 184, 389
8) n=29; repeat located in the
promoter region of the mouse H-2K b gene (class I transplantation
antigen) Cell 44, 261 4} n up to 51; repeat found in several clones of Drosophila virilis DNA
JMB 172, 229
5) s CA),
6) s GT),
ACA triplet appearing in the universal genetic code as one of four codons that specify threonine
ACA),
s CAA),
ACAAAAACC consensus nonamer signal for
Trang 17in Drosophila virilis and
D americana satellite DNAs
JMB 85, 633
ACAAATT) n
in Drosophila virilis and
D americana satellite IT DNA
of the HS-beta satellite DNA of
motif is (ACACAGCGGG),,, n
uncertain, (second ref.)
FProc 35, 23 PNAS 70, 2642
ACACCCCTGTCCCO},,
8 CCCCACACCCCTGT),,
ACAG),, 1) s AGAC), 2) s CTGT),
ACAGGGGTGTGGGG), i
almost perfect tandem repeat in the
5’-flanking region of the human insulin gene
Nat 295, 31
ACAGTGGGGAGGGG), n==38; sequence tandemly repeated with minor variations in intron 1 of the human zeta-globin gene
PNAS 82, 493
ACC triplet appearing in the universal genetic code as one of four codons that specify threonine
ACCA one of three quadruplets recognized by the frameshift suppressor sufJ; the other quadruplets are ACCC and ACCU
by single base substitutions (see also CCRCCATGG)
Cell 44, 283
ACCATTCCCCCAGCAACCACAACA),
s CCACAACAACCATTCCCCCAGCAA),
ACCC one of three quadruplets recognized by the frameshift suppressor sufJ; the other quadruplets are ACCA and ACCU Cell 25, 489
ACCCACCCACTCCCC) n
s CCACTCCCCACCCAC),
ACCCATACATTT conserved sequence element present in intergenic regions of most’ yeast ribosomal protein genes
- presumably a regulatory signal active in either orientation
NAR 13, 701
ACCCCA),,
s CCCCAA),
ACCCCAAA) n
Trang 18sequence motif repeated with
small variations 18 times at the
ends of the En-1 transposon of Zea
ACCNNNNNNGGT site recognized by the type II
core of a consensus sequence
in the 5’-flanking regions of alpha -interferon induced genes in human Nat 314, 637
ACCTGAAGCAAOCTGAGCC),,
s GCCACCTGAAGCAACTGA),,
ACCTGC site recognized by the type II
restriction endonuclease /
ref M3
ACCU one of three quadruplets recognized by the frameshift suppressor sufJ; the other quadruplets are ACCA and ACCC
Cell 25, 489
ACG triplet appearing in the universal genetic code as one of four codons that specify threonine
s CGCGACAAAAACGACGCGACG) in ACGCGGGGGCGGAGGAGGGGGGG),
§ GGGCGGAGGAGGGGGGGACGCGG),
ACGCGGGTGCAGGAGGGG) n
8 GCAGGAGGGGACGCGGCT),,
ACGCGT site recognized by the type II
Gene 33, 1 NAR 13, r165
AGGTGAOTGAKOAKGCACTGATO),
s GATCACGTGACTGAKCAKGCACT) n
ACGTGATCAGTGCMTGMTCAGTC) n
s AGTGCMTGMTCAGTCACGTGATC) n ACNNN
Trang 19
ACR
RNA-primer of bacteriophage T4 CSHSQB ‘43, 469
transcripts
ARB 50, 349
AG),,
1) n=10; repeat located in the
polymorphic region of the human factor IX gene
NAR 12, 8861
bases downstream from the tran- scription initiation site of the rat cytochrome P-450c gene
JBC 260, 5026
the mouse immunoglobulin G3 constant region gene
EMBO 3, 2041
4) s GA),
5) s CT),
AGA 1) triplet appearing in the universal genetic code as one of six codons that specify arginine
2) codon for serine (rather than
arginine) in mitochondrial DNA of Drosophila
motif in the polypurine transcripts
and in the 1.705 (IV) satellite DNA
from Drosophila melanogaster
Cell 5, 183 JMB 114, 441 JMB 135, 581
JMB 184, 389
2) s ACAG),
3) s GACA),
AGACC
site recognized by the
type III restriction endonuclease /
Trang 20sequence repeated three times in
the region of the polyoma virus
genome with high affinity for the
polyoma large T-antigen
Gene 33, 1 NAR 13, r165
AGATGOGGGAGOAGGAGGAGA),
s GGAGGAGAAGATGCGGGAGCA),
AGC triplet appearing in the universal genetic code as one of six codons that specify serine
Trang 21of the murine interleukin-3 gene
one of the most conserved
sequences within 49 base pair
tandem repeats in switch (S)
regions of mouse immunoglobulin
heavy chain genes
AGG), 1) n=15; repeat located in the 3’-nontranscribed spacer of rat ribosomal DNA
1) n=22; tandem repeat in mouse
immunoglobulin gene recombination
regions; see also (TGAGC),,
Gene 33, 1 NAR 13, r165 AGGG),, 1) n=10; repeat found in one of
AGG
the macrovariants of crab satellite DNA
OSHSQB 47, 1151 2) s CCCT),
Trang 22RNA-polymerase of Bacillus subtilis
with sigma factor sigma-37
AGR
triplet used as stop codon
(rather than as arginine codon) in
mitochondrial DNA of Xenopus laevis
IRC 93, 93
AGRGYTTAGTNCGTNARNOATG) n
s ARNCATGAGRGYTTAGTNCGTN),,
AGT triplet appearing in the universal genetic code as one of six codons that specify serine
AGTACT site recognized by the type II
sequence of 3’-ends of known mRNAs
of the human respiratory syncytial
ARACTTAGARAAAWWW consensus sequence of binding sites for topoisomerase I in
NAR 13, 1543 Cell 41, 541
ARC),
s CAR),
ARCATG conserved sequence found in group If mitochondrial introns JMB 184, 353
Trang 23region of replication initiation of
plasmid R6K; binding site(s) for
the replication initiator protein
Cell 34, 125
AT),
of crab C borealis light satellite
DNA
JBC 237, 1961
2) s TA),
ATA
1) triplet appearing in the
universal genetic code as one of
three codons that specify
isoleucine
2) initiation codon for some of
the maxicircle genes of
initiator codon in mitochondrial
DNA of mammals and of Drosophila;
the codon is read as N-formyl-
function as initiator codon in
place of ATG in mitochondrial DNA
in Drosophila virilis and
D americana satellite DNAs
JMB 85, 633 2) s CTATAAA
T glabrata EMBO 4, 465
PNAS 81, 7156 ATATA)„
s AATAT),,
sequence several copies of which are found in the region of the yeast mitochondrial genome responsible for autonomous replication
genome
\⁄
Trang 24two tandem clusters in the Leishmania tarentolae kinetoplast maxicircle DNA divergent region NAR 13, 3241
2) triplet which can function as
initiator codon in mammalian mito- chondrial DNA, being read as N-formyl]-methionine
ATCCCTCAAACGRGGGAW consensus sequence of 4 sites found within the ori region of bacteriophage lambda, presumbably binding sites for protein O, responsible for initiation of replication
CSHSQB 43, 155
NAR 9, 1789
ATCGAT site recognized by the type II
restriction endonuclease / methylase Cla I (prototype) and
their isoschizomers Gene 33, 1
initiates polypeptide chains with methionine, both in eukaryotes and
in the 5’-untranslated regions of the dlecl and dlec2 lectin genes
of Phaseolus vulgaris EMBO 4, 883
ATGCAAATNA `N(
consensus cd sequence found upstream of all heavy chain variable region VH mouse immuno- globulin genes; s also
complementary sequence TNATTTGCAT
Nat 310, 71
ATGCAT
site recognized by the type II restriction endonuclease / methylase Ava III (prototype) and their isoschizomers
Gene 33, 1 NAR 13, 1165
ATGCATGC
Trang 25
ATG
site, found twice in the SV40
1) triplet appearing in the
universal genetic code as one of
three codons that specify
isoleucine
2) rarely used as initiation
codon instead of the standard ATG
EMBO 1, 311
3) triplet which can function as
initiator codon in mitochondrial DNA of mammals and Drosophila, being read as N-formy]-methionine
s ATATTATT) a
s ATTGCCTGCTGAGCTGCGTGATT) in
ATTC),, 1) n=15; repeating motif in the 3’-untranslated region of the murine anion exchange protein gene
Nat 316, 234
2) s TCAT),, ArrcAe M/
consensus of hexanucleotides
found in the 5’-untranslated regions of silkworm chorion genes PNAS 82, 6035
ATTCATAC major homology between tentative transcription terminators
of 5S rRNA and 7S RNA genes of Halobacterium halobium
s CTATGTTATT) n
ATTCTTA sequence proposed as possible part of a signal necessary for RNA processing in mitochondria
JBC 258, 14065
ATTCTTTTT consensus sequence R2 located
in intergenic regions of the Sendai
sequence motif common for
central single-stranded parts of U-RNAs
Trang 26
EMBO 1, 1259
AWGTGACTC
consensus sequence, presumably
regulatory, located upstream from
the 5’ start methionine of yeast
genes regulated by general amino
frequently repeated up to four
times, nontandemly
JBC 258, 5238
MCB 4, 1326
? AWTGCTTY »i
conserved sequence in the -30
to -25 region of phage T4 promoters
cytosine, one of four major
letters of the gnomic alphabet;
appears as deoxycytidine in DNA
and as cytidine in RNA
*
C
tically methylated cytosine present
in DNA of a wide variety of ,
found in the context C G
JBC 175, 315
Sci 210, 604
base of tRNAs NAR 10-2, rl
8) n=15 and 21; sequences located
in the 3’-untranslated region of the chicken myosin light chain 2 gene
JMB 181, 411
4) n=19; found at the ends of
the macronuclear DNA of Oxytricha
of retrovirus long terminal repeats Cell 27, 1
CA),
stream from the polyadenylation signal of the sea urchin
P miliaris histone H2A gene
PNAS 82, 1094
2) n=15, 21, and 22; repeats
located in the interspersed
repetitive elements of cytomegalovirus DNA
MCB 3, 1389
scription initiation site of the rat cytochrome P-450c¢ gene
JBC 260, 5026
4) n=20; repeat located in the
first intron of the translocated murine c-myc gene
NAR 12, 8987
second intron of the C-delta gene
of the murine immunoglobulin nu- delta heavy chain region
Nat 306, 483
downstream from the 3’-end of coding sequences of mouse immuno- globulin kappa variable region genes
JBC 255 3691
CA
the C-nu secreted and membrane regions of murine immunoglobulin nu-delta heavy chain genes
Nat 306, 483
8) s AC), 9) s TG),,
CAA triplet appearing in the universal genetic code as one of two codons that specify glutamine
Trang 27found at the left end of the AT-
rich spacer of 5S DNA of X laevis
protein of bacteriophage lambda W2
CAAGAAAGA
1) homology block at the 3’-ends
of histone genes
Cell 25, 301
2) conserved motif near the 3’-
ends of sea urchin histone genes;
responsible for formation of 3’- termini of the sea urchin histone H3 mRNA
yeast genes
Cell 33, 607
CAARCA site recognized by the type II
restriction endonuclease /
methylase Tthi11 II (prototype) and their isoschizomers
Gene 33, 1 NAR 13, 1165
sequence
NAR 13, 1369
CAC 1) triplet appearing in the universal genetic code as one of two codons that specify histidine 2) site recognized by the specific methylase Hind |
Nat 280, 288 Nat 280, 370 Nat 302, 575
endonuclease / methylase Taq Il
(prototype) and their isoschizomers (the other site is GACCGA)
Trang 28sequence element located 150 -
200 bases upstream of Adh genes of
maize, possibly with regulatory
restriction endonuclease / methylase Dra III (prototype) and
their isoschizomers
Gene 33, 1
“NAR 13, r165 NAR 13, 1517
CAG),
1) n=9; repeat found in one of Drosophila virilis DNA clones JMB 172, 229
2) n=12; found within the coding
region of the murine interleukin-2 gene
Nat 313, 402 NAR 12, 9323
putative 5’-end of human
repeated several times within the insulin gene and considered to be unique to the human insulin gene Nat 295, 31
CAGCCCTCCCCGGCCC),,
s CCCTCCCCGGCCCCAG),,
CAGCT one of the most conserved
sequences within 49 base pair
tandem repeats in switch (S)
regions of mouse immunoglobulin
heavy chain genes
Cell 23, 357
CAGCT),, 1) s ACCCC{AGCTC),, 2) s, GCTCA),
their isoschizomers Gene 33, 1
NAR 138, ri65
Trang 29sequence of satellite II DNA of
hermit crab Pagurus pollicaris
methylase PmaC I; cleavage
occurs within the GG dinucleotide
Nat 312, 616
CAT triplet appearing in the
universal genetic code as one of two codons that specify histidine
CAT),
s TCA), CATAGAATAA) n
region of promoters recognized by
CAT
RNA-polymerase of Bacillus subtilis with sigma factor sigma-29 ref, L1
Gene 33, 1 NAR 13, 1165
downstream of the cap site in the
5’-untranslated regions of most Dictyostelium mRNA’s MCB 5, 1465
CATT),
Trang 30four codons that specify proline
2) universal 3’-end of all
from mRNA capping sites of
mammalian beta-like globin genes
-80 region of eukaryotic promoters
Cell 21, 653
CCACAACAACCATTCCOCCAGCAA),
sequence of 24 base tandem repeats
in protein coding regions of barley B- and C-hordein cDNA
CCACTGTCCOCCTCC) in
s CCCCTCCCCACTGT),, CCAG),,
Gene 33, 1 NAR 13, ri65
CCARCA
site recognized by the type II
restriction endonuclease / methylase Tthili [I (prototype) and
their isoschizomers Gene 33, 1
universal genetic code as one of
four codons that specify proline 2) sequence within the promoter
of the human mitochondrial gene
Trang 31one of two quadruplets
recognized by the frameshift
suppressor sufA; the other
quadruplets is CCCU
Sei 175, 650
CCCCAA),„
1) n= 20 to 70; tandemly
repeated sequence at the termini of
the extrachromosomal rRNA genes in
Tetrahymena
JMB 120, 33
5’-termini of macromolecular DNA of the holotrichous ciliate Glancoma chattoni
the sequence is involved in an imperfect complementarity contact with another conserved sequence of viroids, CCGGTGG
Trang 32ala repeats in the bovine beta-
JMB 184, 389
2) s AGGG),,
3) s TCCO),
CCCTAA),
motif of guinea-pig alpha- satellite DNA
Sci 175, 650
CCCYAGCTCTCACCT),,
s AGCTCTCACCTCCCY) in
CCG triplet appearing in the universal genetic code as one of four codons that specify proline
Trang 33consensus sequence of the -10
region of promoters recognized by
RNA-polymerase of Bacillus subtilis
with sigma factor sigma-28
Nar 9, 5991
cccc Nj
sequence frequently found in
the promoter region of histone H4
core sequence of the promoter
of the human mitochondrial LSP
gene, most sensitive to point
CCGCCCCCGCGTCCCCCCCTCCT) n
s CCGCGTCCCCCCCTCCTCCGCCC),,
CCGCCCCTCGCCCCCTC),, n==19; sequence located in the joint region of the herpes simplex virus genome
JGV 55, 315
sequence common to 5’-non- coding regions of chicken actin genes; the sequence is repeated several times, both in tandem and separately
NAR 13, 1223
CCGCGG site recognized by the type II
methylase: Hpa II (prototype) and their isoschizomers
Gene 33, 1 NAR 13, r165 2) s AAACCGGNTC
the sequence is involved in an
imperfect complementarity contact
with another conserved sequence of viroids, CCCCGGGG
Gene 33, 1
NAR 138, r165
CCRCCATGG consensus sequence for
eukaryotic translation initiation
triplet NAR 12, 857
presumed recognition site for the PPR1 regulatory gene product, located 35 bases downstream from TATA-boxes of yeast URAL and URA3
JMB 185, 65
CCSGG site recognized by the type II
Trang 34
CCT
triplet appearing in the
universal genetic code as one of
four codons that specify proline
CCT),
1) n=15; repeat located in one of
macrovariants of crab satellite DNA
of small ribosomal RNA;
complementary to (Shine-Dalgarno) sequences upstream from translation start sites in mRNA
8 TAGCTTGCCCCTGCTCCTTC),,
CCTGOTGAGCTGCGTGATTATTG),
s ATTGCCTGCTGAGCTGCGTGATT) n CCTGTCCCCACACC),
8 CCCCACACCCCTGT),
CCTNAGG site recognized by the type II restriction endonuclease / methylase Sau I (prototype) and their isoschizomers
restriction endonuclease / methylase EcoR II (prototype) and
their isoschizomers Gene 33, 1
Trang 35clustered in intergenic regions of
involved in the control of their
expression
Nat 314, 467
2) this dinucleotide is
clustered in restriction enzyme
Hpa II tiny fragments (HTF)
derived from nonmethylated HTF
islands of chicken, mouse, and
Gene 33, 1
NAR 13, r165
CGC triplet appearing in the
universal genetic code as one of
six codons that specify arginine
CGCAO), n==5; repeat located in one of the macrovariants of crab satellite DNA
methylase FnuD II (prototype) and
their isoschizomers Gene 33, 1
PNAS 78, 7047
CGCTCTTA box A sequence, presumably involved in the lambdoid nut antitermination system
FL
CGG triplet appearing in the universal genetic code as one of six codons that specify arginine
Trang 36consensus of binding sites for
the GAL4 protein, a positive
regulatory protein of yeast
n==2-5; tandemly repeating runs
located in intron 2 of the zeta and
pseudo-zeta human globin genes
s GGGGGAGGAGCG),,
CGGGS
consensus sequence of E coli
terminators, located about 10 bases upstream from the termination point
s GTGCG),
CGGWCCG site recognized by the type II
protein binding sites upstream from RNA polymerase III promoters in
yeast
PNAS 82, 43
CGT triplet appearing in the universal genetic code as one of six codons that specify arginine
the 5’-noncoding region of oocyte-specific 5S rRNA genes of Xenopus borealis
Trang 37
CGT
consensus sequence of the -35
region of promoters recognized by
RNA-polymerase of Bacillus
subtilis with phage SPO1 specific
sigma factor sigma-gp33-34
ref P2
CGTTTGCCCA
consensus sequence of a
frequent 10 base pair repeat in
Drosophila
Nat 297, 201
CGTTTGCCCACCCTTTAAAA
consensus sequence of a
frequent 20 base pair repeat in
the foldback transposon FB4 of
mouse cDNA PNAS 80, 3391
2) n up to 18; repeat sequence
found in clones of Drosophila virilis DNA
JMB 172, 229 3) n= 2 to 20; repeats found in the heterogeneity region of the human ribosomal spacer
JMB 183, 213
sequence in the 3’-flanking regions
of human Ul RNA genes PNAS 81, 7288
downstream from the polyadenylation signal of the sea urchin P miliaris histone H2A gene
PNAS 82, 1094
region between the H2A and H1 genes
of sea urchin S purpuratus Cell 15, 1033
aberrantly rearranged J segments of the mouse immunoglobulin kappa light chain gene
Nat 302, 260
bases downstream from the tran- scription initiation site of the rat cytochrome P-450c gene
second intron of the murine C-delta
gene of the immunoglobulin nu-delta heavy chain region
Nat 306, 483
spacer region upstream from the mouse kallikrein gene
Nat 303, 300
11) s AG),, 12) s TC),
CTA triplet appearing in the universal genetic code as one of six codons that specify leucine
2
cTAAA N/
consensus sequence of the -35 region of promoters recognized by RNA-polymerase of Bacillus subtilis with sigma factor sigma-28
Nar 9, 5991
CTAACC),
s CCCTAA), CTAAGCCCACCACCA) n
complementary copies GTTTRGATTAG
in the negative strand template serve as binding sites for the viral mRNA leader
restriction endonuclease /
methylase Mae I (prototype) and their isoschizomers
Gene 33, 1 NAR 13, r165
CTAGAAGGAGCAGGGGCAAG),,
s GAAGGAGCAGGGGCAAGCTA),, CTAGCAACWGATG \ ,’ consensus sequence of a 13-mer putative control element in the 5’-regions of class II genes
of the human major
sequence is followed by a second element, CTGATTGG, 19-20 bp downstream
PNAS 82, 1475
CTAGCTTGCCCCTGCTCCTT) in
Trang 38
CTA
s TAGCTTGCCCCTGCTCCTTC),
CTAT),
second intron of the rat alpha
involved in termination of tran-
‘scription of vaccinia virus early
13 bp sequence, repeated with small variations 14 times in the origin of replication of plasmid
R6K MGG 192, 32
CTCC),
s CCCT),,
s TCCC),,
CTCCAG site recognized by the type II
restriction endonuclease / methylase Gsu I (prototype) and
Trang 39sequence present 31 times
| (with few mismatches) in the 1.715
triplet appearing in the universal genetic code as one of six codons that specify leucine
by a second element, CTAGCAACWGATG, 19-20 bp upstream
Gene 33, 1 NAR 13, r165
*-nontranscribed spacer of rat ribosomal DNA
lacI gene of Escherichia coli;
mutation hotspot in this gene
JMB 126, 847
CTGGAATNTTCTAG N/
consensus sequence of Drosophila heat shock gene
promoters
Cell 30, 517 NAR 13, 4401
Trang 40recognition sequence of gene 4
on single-stranded DNA
PNAS 78, 205
CTGGYAYRNNNNTTGCA YJ
consensus sequence of promoters of nitrogen fixation
genes in Klebsiella pneumoniae and
Rhizobia
Cell 37, 5
CTGT),
1) n=8; tandem repeat within the
second intron of the rat alpha
the region of the origin of
involved in expression of incompatibility
Gene 15, 257
CTGTGACAAAYNACCCTCAAAA) n
s AAACTGTGACAAAYNACCCTCA) n
CTN triplet coding for threonine
in Saccharomyces cerevisiae
PNAS 77, 3167 IRC 93, 93
introns, analogous to the yeast
intron consensus TACTAAC and also found 20 to 55 bases upstream from the 3’-ends of introns
, CTRGGGAGGTGAGAG),
s RGGGAGGTGAGAGCT) n
CTT triplet appearing in the universal genetic code as one of six codons that specify leucine
CTT),
s AAG),
CTTAAG site recognized by the type II
‘restriction endonuclease / methylase Af! II (prototype) and
repeat in the heterogeneity region
of the human ribosomal spacer
8 AAAACTCAAACTTTG),,
CTTYTG sequence common to the 5’-non- coding regions of most eukaryotic mRNAs