1. Trang chủ
  2. » Khoa Học Tự Nhiên

gnomic. a dictionary of genetic code

129 272 0
Tài liệu được quét OCR, nội dung có thể không chính xác
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 129
Dung lượng 7,52 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

CAAAGTTTGAGTTTT, AAARNNAGA downstream part of a bipartite consensus sequence found in the 3’- flanking regions of snRNA genes consensus sequence of the -35 region of promoters recognized

Trang 1

GNOMIC

A Dictionary of Genetic Codes

E.N Trifonov and V Brendel

(1) Maruyama, T., T Gojobori, S Aota, and T Ikemura

(1986) Nucl Acids Res 14: r151-r197

Trang 2

Morphology of Gnomic <Ab initio approach

Axiomatically, as in this book, Gnomic is a certain written language The nucleotide sequences, indeed, appear as texts and, most naturally, one wonders whether this analogy could

be traced any further

A first difficulty is absence of any interruptions in

the continuous nucleotide sequences Gnomic texts are not

physically divided in separate “phrases” and “words”, very much like some ancient writings without blanks between words In the latter case one could still read the texts provided thé language is known What could one do, however, with the undeciphered Gnomic texts ? In what follows, a brief account is presented of a recent work (1,2)

on two morphologically distinct classes of Gnomic words which can be detected in these continuous texts by simple computational means without any a priors knowledge of what these words would mean

Contrast words

The n-string ts called a contrast word of length n if

a) its derivative (n—1)-strings appear in a large sequence

ensemble significantly more frequently as part of the n-string

than as separate (n—1)-strings with arbitrary extensions and b) occurrences of its derivative (n+1)-strings are statistically

indistinguishable in the large ensemble

The contrast word, as defined, is characterized by strong internal correlations which do not extend beyond the word’s limits Like in English, would it be written continuously, the string “eyon” wherever encountered would

have almost exclusively “b” in front (“beyon”) and “d” at

the end (“eyond”), thus actually always belonging to the longer string “beyond” In its further one-letter extensions, however, practically every letter could be encountered

Trang 3

(for example “ extendbeyondthe ”, “.„farbeyondlimibs ,

“reachingbeyondguarded ”, etc.) The internal correlation

sharply drops as soons as the word’s limits (in this case “b

and “d” of “beyond”) are exceeded The word, thus, can

be recognized by its “contrast” on the background of

surrounding letters, after all occurrences of the string in

question are listed and compared

Table 1 Vocabulary of contrast words of E coli

of E colt, bacteriophages lambda and T7 (1) and human

intervening sequences (2), respectively In cases of E coli

vocabularies Both preferred and avoided words are listed

in all cases

Comparison of various vocabularies brings us to the

is 3 to 5 bases; 2) the vocabularies of different genomes are partly overlapping but distinct; 3) the degree of correlation between the different vocabularies reflects the biological

or functional relatedness of the corresponding sequences;

Table 2 Contrast words of the bacteriophage lambda genome

Trang 4

Table 3 Contrast words of the bacteriophage T7 genome

4) the contrast word vocabularies contain many strings

whose biological meaning has already been established by

restriction sites appear as avoided contrast words in

bacteriophages, see Tables 2 and 3)

The last point is of special interest What is

use and imply any previous knowledge of what various strings and the whole message might mean Any human language text written in a continuous manner after such

vocabulary of words (or morphemes) of this language,

irrespective of whether the operator (computer) is familiar

with the language or not The fact that some short nucleotide sequences with already known meaning are picked

by the contrast word technique suggests that the other words of the Gnomic ab initio vocabularies are likely to be

of biological importance as well.

Trang 5

Tandem words

sequences attracting attention of even least scrupulous

nucleotide sequences of virtually any length are frequently

found tandemly repeated many times, in most cases without

any obvious reason Their functions are largely unknown,

though many of the tandem repeats are speculated to be

involved in very important biological activities (multiple

binding sites for regulatory proteins, recombination and

mutation hot spots, potential Z-DNA sites, etc.) Contrary to

contrast words, not only the single repeat word but rather a

Table 4 Contrast words of human intervening sequences

CTC(GAG) ATGAGTA(TACTCAT) GTT(AAC)

CTG(CAG) CCATAGT(ACTATGG) TAG(CTA)

GAG(CTC) CGCAGCG(CGCTGCG) TCG(CGA)

GCG(CGC) GGTGTGG(CCACACC) TGA(TCA)

GTG(CAC) GTATGTC(GACATAC) TTG(CAA)

TAT(ATA) GTGCAAC(GTTGCAC) CAGA(TCTG)

TCT(AGA) TAAGTCT(AGACTTA) GGGGG(CCCCC)

whole succession of the repeats appears as one large internally correlated string This morphological class therefore needs a separate definition

The n-string ts called a tandem word of length n if tt ts encountered in a large sequence ensemble with unusual frequency as immediate exact repeat

Table 5 Tandem words of human

etc.) which is also very typical of the contrast vocabulary

symmetrical sequences in the primary RNA transcripts might correspond to single-stranded regions of RNA devoid of hairpins These sites are probably involved in complementary interactions with distant parts of the same mRNA recursor

or with other molecules (2) °

(1) Brendel, V., J S Beckmann and E N Trifonov

J Biomol Struct and Dyn., in press

(2) Beckmann, J S., V Brendel and E N Trifonov

submitted for publication.

Trang 6

Morphology of Gnomic Semantic approach

Are the nucleotide sequences constructed as a linear

texts ? One cannot escape a feeling that this is largely the

case, especially after certain short sequences had been found

to indeed have a unique biological meaning The GNOMIC

dictionary provides about 800 of such “words” all playing a

certain structural or regulatory role, experimentally

established or reasonably speculated

The entries of the dictionary are clearly divided in two classes: isolated individual strings and tandemly

repeated strings A parallel with the above ab instto

definitions of contrast and tandem words is, probably, more

ISOLATED WORD LENGTH (BASES)

Fig 1 Size distribution of 555 isolated words

of the Gnomic dictionary Modified bases are not included Every entry description is counted

Fig 2 Size distribution of 180 tandem words

of the Gnomic dictionary Every entry is counted as a separate word

than superficial Indeed, the tandem strings of the dictionary are structurally identical to the a5 initio tandem words The isolated strings, on the other hand, are similar

to the ab initio contrast words, though here the connection

is more complicated Were the contrast words found to have some meaning, they would be then listed in the dictionary as isolated strings Some of the restriction sites (see above) are an example But many of the isolated strings can not possibly be detected as contrast words For example, certain recognition sequences appear only a few times in a whole genome which would exclude them from any statistical treatment

The words of Gnomic, therefore, can be defined in at

least three different ways (isolated words of some known

Trang 7

There are, probably, many other structural or functional

classes of the words of this language, whose characterization

is the subject of further studies

One important question is whether the entries defined

morphological units of the language or rather they are only

parts (“morphemes”, “roots”) of actual words, the latter

being composed of several “roots”, “prefixes”, “suffixes”, etc

At this early stage the distribution of word sizes might

provide a clue The upper length limit of the entries of our

dictionary has been chosen as 25 bases for purely practical

purposes This size sequences carrying certain specific

messages as discussed in the sources referred to could well

be morphologically composite structures Were these entries

primarily words, i.e basic morphological elements of a

certain typical size, the distribution of sizes would have, a

maximum corresponding to the typical word size This

turns out to be the case In Figs 1 and 2 the size

distributions of isolated strings and of tandem repeats,

respectively, are presented The isolated strings are most

frequently of lengths of 3 to 11 bases The sharp peaks in

Fig 1 at 3 and 6 bases correspond to codons (3 bases) and

noteworthy that the word sizes derived on the basis of the

previous section) are also most frequently within 3 to 5

bases

We thus arrive at the conclusion that comparison of the nucleotide sequences with a language is more than just

a metaphor Gnomic is a language, built of detectable

words with certain meaning and with rather narrow size

distribution Morphology of Gnomic, syntax, semantics —

the whole linguistics of Gnomic is on the agenda

Consensus sequences and distributional recognition

A simple early concept of unique recognition of a given nucleotide sequence by the protein specifically interacting with a corresponding stretch of DNA or RNA fails to

restriction-modification enzymes become rather exceptions on the growing background of various sequence-wise uncertain and yet specific interactions

The sequences involved in the interactions are very

different unique sequences of this length exceeds 410 or

about 106 which is several times larger than the total number of various proteins encoded by a typical eukaryotic genome, only a small part of which are involved in the specific protein-nucleic acid interactions Thus, the potential information contained in a given sequence of this size exceeds the needs of specific recognition This allows a certain degree of degeneracy, and most of the specific protein-binding sites are, indeed, degenerate The sequences corresponding to the same binding protein are frequently discouragingly different and escape any simple description

So called consensus sequences are one pro tem way to describe the recognition sites The individual sequences all have some resemblance to the consensus which is derived by averaging the whole ensemble of the sites and writing down

corresponding positions along the recognition site The consensus presentation ignores the possibility of allowing

some alternative sequence elements (next strongest) in place

of the consensus ones, which might be recognized as well

It also does not answer the question as to which elements of the sequence (mono-, di-, or higher oligonucleotides) are actually important for the recognition

A more adequate description of the recognition sites would be a complete listing of all sequence elements actually found in different positions along the sequences The ones which are most frequent are likely to be essential for the

elements can be conveniently expressed in the form of a

Trang 8

recognition matrix Several examples of mono and

dinucleotide recognition matrices are presented below, in

Appendix I

Use of alternative sequence elements in the same

position can be understood as a reflection of the fact that it

is not the sequence itself which is recognized by the protein

but rather local details and physical properties of molecular

structure of DNA which can be common or similar for

several different sequence elements On the other hand, the

observed degeneracy of the recognized sequences implies that

not all of the important elements have to be simultaneously

present in a given sequence and that various different

combinations (subsets) of the key elements would suffice

Thus, the load of specific recognition is distributed amongst

several sequence elements actually present This type of

specific protein-nucleic acid interaction is called distributional

recognition (1,2) as opposed to simple prototype unique

recognition In a sense the various sites recognized by the

same macromolecule are synonyms which not only have the

same meaning but also bear some structural resemblance to

one another

47: 271—978 (1983)

(2) Brendel, V., G H Hamm and E N Trifonov,

J Biomol Struct and Dyn 3: 705—723 (1986)

Typical sequence motifs of Gnomic word structure

For the continuous and largely undeciphered language like Gnomic the knowledge of the internal structure of its words would be crucial for the very detection of the words What are the most typical elements of which the word is constructed — mono-, dỉ-, trinucleotides, ? Do these elements overlap, or are they joined sequentially like the syllables in human languages ? Do they have any special molecular structure ? Undoubtedly, a careful study of the Gnomic word structure is needed, and we believe that the collection of words presented in this book will be of substantial help in these efforts

One interesting illumination is provided by the analysis of the Context Index of the dictionary Originally the index was meant to help find out whether any new

interesting site has common features (parts) with the sites

pentanucleotide composition of the words is very much non- uniform The pentanucleotide AGCAG, e.g., is found in 17 entries while its reshuffled derivative with the same base composition, AAGGC, does not appear at all (see Context

respectively, while their permutations GTAGC, ATGGC, and

unique possibility of mapping functionally interesting regions

composition of these sequences with the vocabulary specific for the function of interest Recurrence of certain most frequent oligomers has béen suggested recently as a possible

way for identification of potential regulatory sequences (1)

The use of specific vocabularies rather than individual oligomers is expected to be more selective for the mapping purposes As an illustration of potential power of the linguistic mapping technique we compared the trinucleotide composition of various parts of the human beta-globin gene

Trang 9

Fig 3 Linguistic map of intervening sequences

trinucleotides are taken for the mapping, with their respective contrast values Both IVS1 and IVS2 are characterized by positive correlation

vocabulary of the introns

intervening sequences (Table 4, ab initio morphology section)

Fig 3 presents the resulting mapping function As one

would expect, the introns show strong positive correlation

the mapping results are not decisive This does not keep us,

however, from the conclusion that the linguistic mapping

technique is potentially a very powerful sequence diagnostic

tool

(2) Lawn, R M., A Efstratiadis, C O’Connell and

Leafing through the dictionary

Having the privilege of being the first users of the Gnomic dictionary, in this section we would like to share with the reader a few interesting observations we came across

One rather unexpected thing is an apparent clustering

of the CG dinucleotides in the entries of the dictionary

Indeed, the pentanucleotides containing CG are rather avoided The pentamers of the structure CGNNN, NCGNN, NNCGN and NNNCG appear in average 2.6 times, while average occurrence for all 1024 possible pentamers is close to 4.0 There are a few exceptions, like CCGCC which is

found 16 times (see Context index) or GGCGG (11 times)

but in general CG-pentamers are underrepresented One

NCGCG, CGCGN) should be even more scarce This is not

pentamers is 3.9, some being especially frequent (CGCGC —

7 times, GCGCG — 6 times) and all being encountered at

least once Thus, the CG dinucleotides have an apparent tendency to cluster Interestingly, this is also seen in the contrast word vocabulary of the intervening sequences, where several CG-clusters appear as preferred words of the introns

(see Morphology, ab initio approach) There must be some

dinucleotides advantageous for molecular recognition despite the general avoidance of the CG dinucleotides

Another striking feature of the Gnomic words is frequent appearance of the same oligonucleotides in the apparently unrelated entries The 11 base long sequence GTGACAAATTA, for example, is shared by the regulatory sites B of soybean nodulin and leghemoglobin genes, and by repeats in the origin of replication and in the incC region of

plasmid F Another 11 base long sequence TGAATATTNAT

Drosophila topoisomerase II recognition sites The sequence

AATATAATA (10 bases) is shared by certain repeats in

yeast mitochondria, tandem repeats in Leishmania tarentolae kinetoplasts, cores of autonomously replicating sequences of yeast, satellite DNA of Drosophila, and by repeats in

Trang 10

5’-regions of Dictyostellium genes There are 150 isolated

words of length 10 or more bases in the Gnomic dictionary

They consist of about 900 decamers which is less then

1/1000 of total number of various possible decanucleotides

Were these words a sample of randomly generated sequences,

to find even one example of identical decamers in this

sample would be a very improbable event The Gnomic

words, once again, appear to be composed of only a limited

spectrum of oligonucleotides, which is probably dictated by

structural requirements and limitations inherent to the

molecular recognition processes One can not help turning

again to the analogy with human speech, where also only a

small subset of all possible combinations of letters (basically,

alternations of vowels and consonants) is compatible with

limitations of human speech organs

Guide to the use of the Gnomic dictionary

dictionary per se and Context index

Gnomic dictionary includes 780 primary entries (bold

face) with descriptions and references About quarter of the

primary entries are tandem repeats presented in the same form as in the original source The dictionary contains also

about 1300 secondary entries (bold face, smaller type) which

are circular permutations of tandem words and their

directly whether any new tandem word has already been

discussed in the literature, in the same or in derivative form For non-repetitive “words” one has to consult the dictionary twice, taking both original sequence and _ its

In case the word of interest is not found in the dictionary, one might be interested in whether some related sequences appear in the dictionary This purpose is served

by the Context index The index lists all entries in which any given pentanucleotide is present as a part The new

Context index This part of the dictionary is especially useful, since it delivers the user from the necessity of time consuming and tiresome search in literature

The bibliography includes 434 references which appear first in the entry descriptions in an abbreviated form sufficient for locating the source in the user’s library Full references are provided as well, in the References section, listed alphabetically according to the titles of the journals,

in chronological order Every bibliographic description is provided by the return reference to the primary entry(ies) of which only the first five letters are indicated for brevity

This allows location of every source in the Gnomic dictionary Similarly, every author can be located, provided her (his) list of publications is available The bibliography should not be considered as comprehensive and credit-wise balanced Our primary concern has been to provide most informative sources Priority is given, therefore, to reviews

Trang 11

and recent summarizing work where references to earlier

contributions can be found We tried our best, however, to

provide the original sources as well Should readers find any

more informative reference(s), we would be most grateful to

Res 13: 3021—3030, 1985)

Trang 12

GNOMIC DICTIONARY

A 1) adenine, first letter of the gnomic alphabet; appears as deoxyadenosine in DNA and as adenosine in RNA

2) site of attack and scission by

neocarzinostatin

PNAS 75, 3603 PNAS 75, 3608

=

A 1) N6-methyladenosine, M6A;

found in nuclear DNAs of several groups of lower eukaryotes and in

JP 27, 479

MR 44, 175 NAR 11, 5131

2) N6-isopentyladenosine, I6A;

modified base occurring in

position 37 of tRNAs

NAR 10-2, ri PNARMB 31, 59

3) 1-methyladenosine, M1A; minor base of tRNAs, specific for gram-

positive bacteria, in position 22

NAR 10-2, rl PNARMB 31, 59 4) 2-methyladenosine, M2A; minor

tracks are frequent in the human

3’-ends of the human Alu interspersed repeats

JMB 180, 753 ARB 51, 813

3) n=17; found at the 3’-end of

the human beta-tubulin pseudogene

Nat 297, 83

4) n=16; found at the 3’-end of

the human metallothionein II B pseudogene

Nat 299, 797

5) n= 44 to 71; found at the

3’-ends of dispersed repeat

Trang 13

6) n=22; found at the 3’-end of

the rat repetitive element R.dre.1

Nat 300, 330

7) n= 20 to 55; found upstream

from the promoter of the yeast

alcohol dehydrogenase gene

Nat 304, 652

8) n= 50 to 200; post-

transcriptionally added to the 3’-

ends of most eukaryotic mRNAs

CSHSQB 35, 743

ARB 42, 329

rupted only once by G, found up-

stream of the topoisomerase gene

inherently curved DNA, in

particular nucleosomal DNA

quadruplet recognized by the

frameshift suppressor sufG

TRO 19B 37

AAAAAAAGACTTAGAAAAAA consensus sequence of repeat

AAAACAAGCAGGAGRGGCT sequence flanking the Alu family repeat located downstream from a human insulin gene

X laevis tRNA genes

Gell 23, 251

AAAATTYT consensus of a presumed poly- adenylation signal in Trypanosoma

Cell 29, 291 Cell 37, 333

ably recognized by DNA invertases EMBO 4, 237

AAACCCCA),

s CCCCAAAA),

AAACCGGNTC

consensus sequence of the

African green monkey decasatellite; the core CCGG is strongly conserved JMB 157, 195

Trang 14

AAA

receptor protein) recognition site;

the core sequence GTGA seems to be

most important for recognition

PNAS 76, 5090

s CAAAGTTTGAGTTTT),

AAARNNAGA

downstream part of a bipartite

consensus sequence found in the 3’-

flanking regions of snRNA genes

consensus sequence of the -35

region of promoters recognized by

RNA-polymerase of Bacillus subtilis

with sigma factor sigma-32

Nat 302, 800

AAATT),

of the Leishmania tarentolae

kinetoplast maxicircle DNA

AACGTACTAAGCTCTCATGTTT

22 bp sequence, 7 times tandemly repeated (with small variations) in the origin of replication of plasmid R6K

PNAS 76, 1150

AACNNNNNNGTGC site recognized by the type I

restriction endonuclease /

methylase EcoK Gene 33, 1

Salmonella potsdam); the second A

of the entry and the first A of the complementary sequence are methylated

JMB 182, 579

AACNNNNNNRTAYG

a hybrid consensus sequence recognized by recombinant type I

sequence contains recognition sites for the enzymes SP (AAC) and

Trang 15

1) triplet appearing in the

universal genetic code as one of

two codons that specify lysine

2) site recognized by the

specific methylase BbvS III

Gene 33, 1

AAG)

n

motif in the MS satellite DNA of

sequence present near the 3’-

ends of yeast mitochondrial genes;

signal site for mRNA processing;

cleavage occurs a few bases

repeating motives in the 1.705 satellite DNA of Drosophila melanogaster

sequence responsible for

recognition and uptake of DNA during transformation of competent Haemophilus cells

PNAS 79, 2393

consensus sequence of a tentative transcription initiation signal in Halobacterium halobium

consensus sequence, presumably

regulatory, which occurs about positions -60 to -30 upstream of

the start methionine of five yeast

encoding glycolysis functions

JBC 258, 5291

AANNNNTTGAT consensus sequence of binding

site for E coli integration

host factor (IHF)

Cell 41, 489

AAT triplet appearing in the universal genetic code as one of two codons that specify asparagine

Trang 16

repeated with minor variations in

the Leishmania tarentolae kineto-

plast maxicircle DNA divergent

mitochondrial protein-coding genes;

presumably mRNA processing site

MGG 196, 266

AATAATCACGCAGCTCAGCAGGC),,

s AATCACGCAGCTCAGCAGGCAAT) n

sequence frequently found 5 to

15 bases upstream from coding regions of Dictyostelium actin genes

JMB 183, 311

AATAGG),,

s GAATAG),

AATAT),„

repeating motives in the 1.672

satellite DNA of Drosophila melanogaster

JMB 135, 565

AATATAAT) n

s AATAATAT),,

sequence frequently found 5 to

15 bases upstream from coding regions of Dictyostelium actin genes

JMB 183, 311

AATATAG),

nucleotide of satellite Ic DNA of Drosophila virilis

Cell 17, 615

AATATAT),,

repeating motives in the 1.672 satellite DNA of Drosophila melanogaster

AATATGGGAATAACTTGTTTT),

s GGGAATAACTTGTTTTAATAT),

AATATT site recognized by the type H

1) n=12; sequence located

upstream from the human beta-globinV gene cluster

NAR 10, 7809

2) n=21; repeat located in the

3’-nontranscribed spacer of rat ribosomal DNA

JMB 184, 389

8) n=29; repeat located in the

promoter region of the mouse H-2K b gene (class I transplantation

antigen) Cell 44, 261 4} n up to 51; repeat found in several clones of Drosophila virilis DNA

JMB 172, 229

5) s CA),

6) s GT),

ACA triplet appearing in the universal genetic code as one of four codons that specify threonine

ACA),

s CAA),

ACAAAAACC consensus nonamer signal for

Trang 17

in Drosophila virilis and

D americana satellite DNAs

JMB 85, 633

ACAAATT) n

in Drosophila virilis and

D americana satellite IT DNA

of the HS-beta satellite DNA of

motif is (ACACAGCGGG),,, n

uncertain, (second ref.)

FProc 35, 23 PNAS 70, 2642

ACACCCCTGTCCCO},,

8 CCCCACACCCCTGT),,

ACAG),, 1) s AGAC), 2) s CTGT),

ACAGGGGTGTGGGG), i

almost perfect tandem repeat in the

5’-flanking region of the human insulin gene

Nat 295, 31

ACAGTGGGGAGGGG), n==38; sequence tandemly repeated with minor variations in intron 1 of the human zeta-globin gene

PNAS 82, 493

ACC triplet appearing in the universal genetic code as one of four codons that specify threonine

ACCA one of three quadruplets recognized by the frameshift suppressor sufJ; the other quadruplets are ACCC and ACCU

by single base substitutions (see also CCRCCATGG)

Cell 44, 283

ACCATTCCCCCAGCAACCACAACA),

s CCACAACAACCATTCCCCCAGCAA),

ACCC one of three quadruplets recognized by the frameshift suppressor sufJ; the other quadruplets are ACCA and ACCU Cell 25, 489

ACCCACCCACTCCCC) n

s CCACTCCCCACCCAC),

ACCCATACATTT conserved sequence element present in intergenic regions of most’ yeast ribosomal protein genes

- presumably a regulatory signal active in either orientation

NAR 13, 701

ACCCCA),,

s CCCCAA),

ACCCCAAA) n

Trang 18

sequence motif repeated with

small variations 18 times at the

ends of the En-1 transposon of Zea

ACCNNNNNNGGT site recognized by the type II

core of a consensus sequence

in the 5’-flanking regions of alpha -interferon induced genes in human Nat 314, 637

ACCTGAAGCAAOCTGAGCC),,

s GCCACCTGAAGCAACTGA),,

ACCTGC site recognized by the type II

restriction endonuclease /

ref M3

ACCU one of three quadruplets recognized by the frameshift suppressor sufJ; the other quadruplets are ACCA and ACCC

Cell 25, 489

ACG triplet appearing in the universal genetic code as one of four codons that specify threonine

s CGCGACAAAAACGACGCGACG) in ACGCGGGGGCGGAGGAGGGGGGG),

§ GGGCGGAGGAGGGGGGGACGCGG),

ACGCGGGTGCAGGAGGGG) n

8 GCAGGAGGGGACGCGGCT),,

ACGCGT site recognized by the type II

Gene 33, 1 NAR 13, r165

AGGTGAOTGAKOAKGCACTGATO),

s GATCACGTGACTGAKCAKGCACT) n

ACGTGATCAGTGCMTGMTCAGTC) n

s AGTGCMTGMTCAGTCACGTGATC) n ACNNN

Trang 19

ACR

RNA-primer of bacteriophage T4 CSHSQB ‘43, 469

transcripts

ARB 50, 349

AG),,

1) n=10; repeat located in the

polymorphic region of the human factor IX gene

NAR 12, 8861

bases downstream from the tran- scription initiation site of the rat cytochrome P-450c gene

JBC 260, 5026

the mouse immunoglobulin G3 constant region gene

EMBO 3, 2041

4) s GA),

5) s CT),

AGA 1) triplet appearing in the universal genetic code as one of six codons that specify arginine

2) codon for serine (rather than

arginine) in mitochondrial DNA of Drosophila

motif in the polypurine transcripts

and in the 1.705 (IV) satellite DNA

from Drosophila melanogaster

Cell 5, 183 JMB 114, 441 JMB 135, 581

JMB 184, 389

2) s ACAG),

3) s GACA),

AGACC

site recognized by the

type III restriction endonuclease /

Trang 20

sequence repeated three times in

the region of the polyoma virus

genome with high affinity for the

polyoma large T-antigen

Gene 33, 1 NAR 13, r165

AGATGOGGGAGOAGGAGGAGA),

s GGAGGAGAAGATGCGGGAGCA),

AGC triplet appearing in the universal genetic code as one of six codons that specify serine

Trang 21

of the murine interleukin-3 gene

one of the most conserved

sequences within 49 base pair

tandem repeats in switch (S)

regions of mouse immunoglobulin

heavy chain genes

AGG), 1) n=15; repeat located in the 3’-nontranscribed spacer of rat ribosomal DNA

1) n=22; tandem repeat in mouse

immunoglobulin gene recombination

regions; see also (TGAGC),,

Gene 33, 1 NAR 13, r165 AGGG),, 1) n=10; repeat found in one of

AGG

the macrovariants of crab satellite DNA

OSHSQB 47, 1151 2) s CCCT),

Trang 22

RNA-polymerase of Bacillus subtilis

with sigma factor sigma-37

AGR

triplet used as stop codon

(rather than as arginine codon) in

mitochondrial DNA of Xenopus laevis

IRC 93, 93

AGRGYTTAGTNCGTNARNOATG) n

s ARNCATGAGRGYTTAGTNCGTN),,

AGT triplet appearing in the universal genetic code as one of six codons that specify serine

AGTACT site recognized by the type II

sequence of 3’-ends of known mRNAs

of the human respiratory syncytial

ARACTTAGARAAAWWW consensus sequence of binding sites for topoisomerase I in

NAR 13, 1543 Cell 41, 541

ARC),

s CAR),

ARCATG conserved sequence found in group If mitochondrial introns JMB 184, 353

Trang 23

region of replication initiation of

plasmid R6K; binding site(s) for

the replication initiator protein

Cell 34, 125

AT),

of crab C borealis light satellite

DNA

JBC 237, 1961

2) s TA),

ATA

1) triplet appearing in the

universal genetic code as one of

three codons that specify

isoleucine

2) initiation codon for some of

the maxicircle genes of

initiator codon in mitochondrial

DNA of mammals and of Drosophila;

the codon is read as N-formyl-

function as initiator codon in

place of ATG in mitochondrial DNA

in Drosophila virilis and

D americana satellite DNAs

JMB 85, 633 2) s CTATAAA

T glabrata EMBO 4, 465

PNAS 81, 7156 ATATA)„

s AATAT),,

sequence several copies of which are found in the region of the yeast mitochondrial genome responsible for autonomous replication

genome

\⁄

Trang 24

two tandem clusters in the Leishmania tarentolae kinetoplast maxicircle DNA divergent region NAR 13, 3241

2) triplet which can function as

initiator codon in mammalian mito- chondrial DNA, being read as N-formyl]-methionine

ATCCCTCAAACGRGGGAW consensus sequence of 4 sites found within the ori region of bacteriophage lambda, presumbably binding sites for protein O, responsible for initiation of replication

CSHSQB 43, 155

NAR 9, 1789

ATCGAT site recognized by the type II

restriction endonuclease / methylase Cla I (prototype) and

their isoschizomers Gene 33, 1

initiates polypeptide chains with methionine, both in eukaryotes and

in the 5’-untranslated regions of the dlecl and dlec2 lectin genes

of Phaseolus vulgaris EMBO 4, 883

ATGCAAATNA `N(

consensus cd sequence found upstream of all heavy chain variable region VH mouse immuno- globulin genes; s also

complementary sequence TNATTTGCAT

Nat 310, 71

ATGCAT

site recognized by the type II restriction endonuclease / methylase Ava III (prototype) and their isoschizomers

Gene 33, 1 NAR 13, 1165

ATGCATGC

Trang 25

ATG

site, found twice in the SV40

1) triplet appearing in the

universal genetic code as one of

three codons that specify

isoleucine

2) rarely used as initiation

codon instead of the standard ATG

EMBO 1, 311

3) triplet which can function as

initiator codon in mitochondrial DNA of mammals and Drosophila, being read as N-formy]-methionine

s ATATTATT) a

s ATTGCCTGCTGAGCTGCGTGATT) in

ATTC),, 1) n=15; repeating motif in the 3’-untranslated region of the murine anion exchange protein gene

Nat 316, 234

2) s TCAT),, ArrcAe M/

consensus of hexanucleotides

found in the 5’-untranslated regions of silkworm chorion genes PNAS 82, 6035

ATTCATAC major homology between tentative transcription terminators

of 5S rRNA and 7S RNA genes of Halobacterium halobium

s CTATGTTATT) n

ATTCTTA sequence proposed as possible part of a signal necessary for RNA processing in mitochondria

JBC 258, 14065

ATTCTTTTT consensus sequence R2 located

in intergenic regions of the Sendai

sequence motif common for

central single-stranded parts of U-RNAs

Trang 26

EMBO 1, 1259

AWGTGACTC

consensus sequence, presumably

regulatory, located upstream from

the 5’ start methionine of yeast

genes regulated by general amino

frequently repeated up to four

times, nontandemly

JBC 258, 5238

MCB 4, 1326

? AWTGCTTY »i

conserved sequence in the -30

to -25 region of phage T4 promoters

cytosine, one of four major

letters of the gnomic alphabet;

appears as deoxycytidine in DNA

and as cytidine in RNA

*

C

tically methylated cytosine present

in DNA of a wide variety of ,

found in the context C G

JBC 175, 315

Sci 210, 604

base of tRNAs NAR 10-2, rl

8) n=15 and 21; sequences located

in the 3’-untranslated region of the chicken myosin light chain 2 gene

JMB 181, 411

4) n=19; found at the ends of

the macronuclear DNA of Oxytricha

of retrovirus long terminal repeats Cell 27, 1

CA),

stream from the polyadenylation signal of the sea urchin

P miliaris histone H2A gene

PNAS 82, 1094

2) n=15, 21, and 22; repeats

located in the interspersed

repetitive elements of cytomegalovirus DNA

MCB 3, 1389

scription initiation site of the rat cytochrome P-450c¢ gene

JBC 260, 5026

4) n=20; repeat located in the

first intron of the translocated murine c-myc gene

NAR 12, 8987

second intron of the C-delta gene

of the murine immunoglobulin nu- delta heavy chain region

Nat 306, 483

downstream from the 3’-end of coding sequences of mouse immuno- globulin kappa variable region genes

JBC 255 3691

CA

the C-nu secreted and membrane regions of murine immunoglobulin nu-delta heavy chain genes

Nat 306, 483

8) s AC), 9) s TG),,

CAA triplet appearing in the universal genetic code as one of two codons that specify glutamine

Trang 27

found at the left end of the AT-

rich spacer of 5S DNA of X laevis

protein of bacteriophage lambda W2

CAAGAAAGA

1) homology block at the 3’-ends

of histone genes

Cell 25, 301

2) conserved motif near the 3’-

ends of sea urchin histone genes;

responsible for formation of 3’- termini of the sea urchin histone H3 mRNA

yeast genes

Cell 33, 607

CAARCA site recognized by the type II

restriction endonuclease /

methylase Tthi11 II (prototype) and their isoschizomers

Gene 33, 1 NAR 13, 1165

sequence

NAR 13, 1369

CAC 1) triplet appearing in the universal genetic code as one of two codons that specify histidine 2) site recognized by the specific methylase Hind |

Nat 280, 288 Nat 280, 370 Nat 302, 575

endonuclease / methylase Taq Il

(prototype) and their isoschizomers (the other site is GACCGA)

Trang 28

sequence element located 150 -

200 bases upstream of Adh genes of

maize, possibly with regulatory

restriction endonuclease / methylase Dra III (prototype) and

their isoschizomers

Gene 33, 1

“NAR 13, r165 NAR 13, 1517

CAG),

1) n=9; repeat found in one of Drosophila virilis DNA clones JMB 172, 229

2) n=12; found within the coding

region of the murine interleukin-2 gene

Nat 313, 402 NAR 12, 9323

putative 5’-end of human

repeated several times within the insulin gene and considered to be unique to the human insulin gene Nat 295, 31

CAGCCCTCCCCGGCCC),,

s CCCTCCCCGGCCCCAG),,

CAGCT one of the most conserved

sequences within 49 base pair

tandem repeats in switch (S)

regions of mouse immunoglobulin

heavy chain genes

Cell 23, 357

CAGCT),, 1) s ACCCC{AGCTC),, 2) s, GCTCA),

their isoschizomers Gene 33, 1

NAR 138, ri65

Trang 29

sequence of satellite II DNA of

hermit crab Pagurus pollicaris

methylase PmaC I; cleavage

occurs within the GG dinucleotide

Nat 312, 616

CAT triplet appearing in the

universal genetic code as one of two codons that specify histidine

CAT),

s TCA), CATAGAATAA) n

region of promoters recognized by

CAT

RNA-polymerase of Bacillus subtilis with sigma factor sigma-29 ref, L1

Gene 33, 1 NAR 13, 1165

downstream of the cap site in the

5’-untranslated regions of most Dictyostelium mRNA’s MCB 5, 1465

CATT),

Trang 30

four codons that specify proline

2) universal 3’-end of all

from mRNA capping sites of

mammalian beta-like globin genes

-80 region of eukaryotic promoters

Cell 21, 653

CCACAACAACCATTCCOCCAGCAA),

sequence of 24 base tandem repeats

in protein coding regions of barley B- and C-hordein cDNA

CCACTGTCCOCCTCC) in

s CCCCTCCCCACTGT),, CCAG),,

Gene 33, 1 NAR 13, ri65

CCARCA

site recognized by the type II

restriction endonuclease / methylase Tthili [I (prototype) and

their isoschizomers Gene 33, 1

universal genetic code as one of

four codons that specify proline 2) sequence within the promoter

of the human mitochondrial gene

Trang 31

one of two quadruplets

recognized by the frameshift

suppressor sufA; the other

quadruplets is CCCU

Sei 175, 650

CCCCAA),„

1) n= 20 to 70; tandemly

repeated sequence at the termini of

the extrachromosomal rRNA genes in

Tetrahymena

JMB 120, 33

5’-termini of macromolecular DNA of the holotrichous ciliate Glancoma chattoni

the sequence is involved in an imperfect complementarity contact with another conserved sequence of viroids, CCGGTGG

Trang 32

ala repeats in the bovine beta-

JMB 184, 389

2) s AGGG),,

3) s TCCO),

CCCTAA),

motif of guinea-pig alpha- satellite DNA

Sci 175, 650

CCCYAGCTCTCACCT),,

s AGCTCTCACCTCCCY) in

CCG triplet appearing in the universal genetic code as one of four codons that specify proline

Trang 33

consensus sequence of the -10

region of promoters recognized by

RNA-polymerase of Bacillus subtilis

with sigma factor sigma-28

Nar 9, 5991

cccc Nj

sequence frequently found in

the promoter region of histone H4

core sequence of the promoter

of the human mitochondrial LSP

gene, most sensitive to point

CCGCCCCCGCGTCCCCCCCTCCT) n

s CCGCGTCCCCCCCTCCTCCGCCC),,

CCGCCCCTCGCCCCCTC),, n==19; sequence located in the joint region of the herpes simplex virus genome

JGV 55, 315

sequence common to 5’-non- coding regions of chicken actin genes; the sequence is repeated several times, both in tandem and separately

NAR 13, 1223

CCGCGG site recognized by the type II

methylase: Hpa II (prototype) and their isoschizomers

Gene 33, 1 NAR 13, r165 2) s AAACCGGNTC

the sequence is involved in an

imperfect complementarity contact

with another conserved sequence of viroids, CCCCGGGG

Gene 33, 1

NAR 138, r165

CCRCCATGG consensus sequence for

eukaryotic translation initiation

triplet NAR 12, 857

presumed recognition site for the PPR1 regulatory gene product, located 35 bases downstream from TATA-boxes of yeast URAL and URA3

JMB 185, 65

CCSGG site recognized by the type II

Trang 34

CCT

triplet appearing in the

universal genetic code as one of

four codons that specify proline

CCT),

1) n=15; repeat located in one of

macrovariants of crab satellite DNA

of small ribosomal RNA;

complementary to (Shine-Dalgarno) sequences upstream from translation start sites in mRNA

8 TAGCTTGCCCCTGCTCCTTC),,

CCTGOTGAGCTGCGTGATTATTG),

s ATTGCCTGCTGAGCTGCGTGATT) n CCTGTCCCCACACC),

8 CCCCACACCCCTGT),

CCTNAGG site recognized by the type II restriction endonuclease / methylase Sau I (prototype) and their isoschizomers

restriction endonuclease / methylase EcoR II (prototype) and

their isoschizomers Gene 33, 1

Trang 35

clustered in intergenic regions of

involved in the control of their

expression

Nat 314, 467

2) this dinucleotide is

clustered in restriction enzyme

Hpa II tiny fragments (HTF)

derived from nonmethylated HTF

islands of chicken, mouse, and

Gene 33, 1

NAR 13, r165

CGC triplet appearing in the

universal genetic code as one of

six codons that specify arginine

CGCAO), n==5; repeat located in one of the macrovariants of crab satellite DNA

methylase FnuD II (prototype) and

their isoschizomers Gene 33, 1

PNAS 78, 7047

CGCTCTTA box A sequence, presumably involved in the lambdoid nut antitermination system

FL

CGG triplet appearing in the universal genetic code as one of six codons that specify arginine

Trang 36

consensus of binding sites for

the GAL4 protein, a positive

regulatory protein of yeast

n==2-5; tandemly repeating runs

located in intron 2 of the zeta and

pseudo-zeta human globin genes

s GGGGGAGGAGCG),,

CGGGS

consensus sequence of E coli

terminators, located about 10 bases upstream from the termination point

s GTGCG),

CGGWCCG site recognized by the type II

protein binding sites upstream from RNA polymerase III promoters in

yeast

PNAS 82, 43

CGT triplet appearing in the universal genetic code as one of six codons that specify arginine

the 5’-noncoding region of oocyte-specific 5S rRNA genes of Xenopus borealis

Trang 37

CGT

consensus sequence of the -35

region of promoters recognized by

RNA-polymerase of Bacillus

subtilis with phage SPO1 specific

sigma factor sigma-gp33-34

ref P2

CGTTTGCCCA

consensus sequence of a

frequent 10 base pair repeat in

Drosophila

Nat 297, 201

CGTTTGCCCACCCTTTAAAA

consensus sequence of a

frequent 20 base pair repeat in

the foldback transposon FB4 of

mouse cDNA PNAS 80, 3391

2) n up to 18; repeat sequence

found in clones of Drosophila virilis DNA

JMB 172, 229 3) n= 2 to 20; repeats found in the heterogeneity region of the human ribosomal spacer

JMB 183, 213

sequence in the 3’-flanking regions

of human Ul RNA genes PNAS 81, 7288

downstream from the polyadenylation signal of the sea urchin P miliaris histone H2A gene

PNAS 82, 1094

region between the H2A and H1 genes

of sea urchin S purpuratus Cell 15, 1033

aberrantly rearranged J segments of the mouse immunoglobulin kappa light chain gene

Nat 302, 260

bases downstream from the tran- scription initiation site of the rat cytochrome P-450c gene

second intron of the murine C-delta

gene of the immunoglobulin nu-delta heavy chain region

Nat 306, 483

spacer region upstream from the mouse kallikrein gene

Nat 303, 300

11) s AG),, 12) s TC),

CTA triplet appearing in the universal genetic code as one of six codons that specify leucine

2

cTAAA N/

consensus sequence of the -35 region of promoters recognized by RNA-polymerase of Bacillus subtilis with sigma factor sigma-28

Nar 9, 5991

CTAACC),

s CCCTAA), CTAAGCCCACCACCA) n

complementary copies GTTTRGATTAG

in the negative strand template serve as binding sites for the viral mRNA leader

restriction endonuclease /

methylase Mae I (prototype) and their isoschizomers

Gene 33, 1 NAR 13, r165

CTAGAAGGAGCAGGGGCAAG),,

s GAAGGAGCAGGGGCAAGCTA),, CTAGCAACWGATG \ ,’ consensus sequence of a 13-mer putative control element in the 5’-regions of class II genes

of the human major

sequence is followed by a second element, CTGATTGG, 19-20 bp downstream

PNAS 82, 1475

CTAGCTTGCCCCTGCTCCTT) in

Trang 38

CTA

s TAGCTTGCCCCTGCTCCTTC),

CTAT),

second intron of the rat alpha

involved in termination of tran-

‘scription of vaccinia virus early

13 bp sequence, repeated with small variations 14 times in the origin of replication of plasmid

R6K MGG 192, 32

CTCC),

s CCCT),,

s TCCC),,

CTCCAG site recognized by the type II

restriction endonuclease / methylase Gsu I (prototype) and

Trang 39

sequence present 31 times

| (with few mismatches) in the 1.715

triplet appearing in the universal genetic code as one of six codons that specify leucine

by a second element, CTAGCAACWGATG, 19-20 bp upstream

Gene 33, 1 NAR 13, r165

*-nontranscribed spacer of rat ribosomal DNA

lacI gene of Escherichia coli;

mutation hotspot in this gene

JMB 126, 847

CTGGAATNTTCTAG N/

consensus sequence of Drosophila heat shock gene

promoters

Cell 30, 517 NAR 13, 4401

Trang 40

recognition sequence of gene 4

on single-stranded DNA

PNAS 78, 205

CTGGYAYRNNNNTTGCA YJ

consensus sequence of promoters of nitrogen fixation

genes in Klebsiella pneumoniae and

Rhizobia

Cell 37, 5

CTGT),

1) n=8; tandem repeat within the

second intron of the rat alpha

the region of the origin of

involved in expression of incompatibility

Gene 15, 257

CTGTGACAAAYNACCCTCAAAA) n

s AAACTGTGACAAAYNACCCTCA) n

CTN triplet coding for threonine

in Saccharomyces cerevisiae

PNAS 77, 3167 IRC 93, 93

introns, analogous to the yeast

intron consensus TACTAAC and also found 20 to 55 bases upstream from the 3’-ends of introns

, CTRGGGAGGTGAGAG),

s RGGGAGGTGAGAGCT) n

CTT triplet appearing in the universal genetic code as one of six codons that specify leucine

CTT),

s AAG),

CTTAAG site recognized by the type II

‘restriction endonuclease / methylase Af! II (prototype) and

repeat in the heterogeneity region

of the human ribosomal spacer

8 AAAACTCAAACTTTG),,

CTTYTG sequence common to the 5’-non- coding regions of most eukaryotic mRNAs

Ngày đăng: 11/04/2014, 09:43

TỪ KHÓA LIÊN QUAN

TÀI LIỆU CÙNG NGƯỜI DÙNG

  • Đang cập nhật ...

TÀI LIỆU LIÊN QUAN