Genetic Information: Nucleic Acids

Một phần của tài liệu Thiết kế các thuật toán sinh tin học bằng Python (Trang 67 - 73)

The DNA is a polymer composed of four nucleic acid units, callednucleotidesorbase pairs.

Nucleotides have a similar chemical composition, but can be distinguished by the respective nitrogenous bases:Adenine,Guanine,ThymineorCytosine. Adenine and Guanine are part of the purines group, while Cytosine and Thymine are pyrimidines. Nucleotides are usually referred by their first letter: A, G, T or C.

The DNA is a molecule composed of two complementary strands that form and stick together due to the connections established between the nucleotides in both strands. This comple- mentarity is possible due to a chemical phenomenon where Adenine bonds only to Thymine nucleotides as a result of two hydrogen connections (A=T), and Guanine bonds only to Cy- tosine by three hydrogen bonds (CG). This results in two complementary and anti-parallel strands (connected in opposite directions). Knowing the sequence of the nucleotides in one of the strands it is possible to obtain the sequence in the opposite strand by taking the com- plement of its nucleotides. The two opposite strands are also read in reverse directions, and therefore we say that one strand is the reverse complement of the other. The existence of these two strands is essential for passing the genetic information to new cells and to produce pro- teins. Due to this complementarity and redundancy, it is therefore standard to describe the DNA through only one of its strand sequences using the four-letter alphabet: {A, G, T, C}.

DNA molecules can form sequences of hundreds of thousands or millions of nucleotides.

Each of these individual and long DNA molecules are calledchromosomes. Within chromo- somes, we findgenesthat are functional regions of the DNA and encode the instructions to make proteins. The complete set of DNA forms the genetic material of the organism and is calledgenome. The size of the genome and the corresponding number of chromosomes in a cell is variable and depends on the species. For instance, certain bacterial species have a genome that contains around a few hundred or a few million base pairs in a single chromo- some, while mice or humans have a genome with approximately 3 billion (3×109) base pairs, but there are other species with even larger genomes.

An organism or a cell ishaploidif it only contains a single set of unpaired chromosomes.

If it contains two complete sets of chromosomes it is calleddiploid. Humans, for instance, are diploids and contain 23 chromosome pairs. The twenty-third chromosome pair (sexual chromosome) determines the gender of the individual. If it contains two copies of the X chro- mosome, it results in a female, while if it contains a chromosome X and chromosome Y, it results in a male individual. So, in total, the human genome has 24 distinct chromosomes.

The remaining 22 chromosome pairs are calledautosomal chromosomes. Each chromo- some contains many genes, and in the human genome they can sum up to nearly 21,000 coding for proteins. As we will see later, different types of genes can be found in the human genome.

Depending on the type of the cell, the genome is organized differently. In a prokaryotic cell, the genome exists in the form of a circular chromosome and it is located in the cytoplasm. On the other hand, in an eukaryotic cell, the genome is found in the nucleus and tightly packed into linear chromosomes. The chromosome organization is highly hierarchical and consists of a DNA-protein complex calledchromatin, that is organized in an array of sub-units called nucleosomes. In these intermediate sub-structures, the DNA structure wraps around proteins calledhistones. This organization allows accommodating a long DNA molecule in a small space. It also provides a structure to regulate the expression of genes.

The genetic composition of an organism comprised on its genome is called thegenotype. The phenotypeis the set of physical and observable traits of an individual. It results from the com- bination of the genotype and the environment. The height of an individual or the color of the eyes are phenotypic traits that are totally or partially encoded on the genotype.

In most multicellular organisms, each cell contains the same genetic information. At a certain point, the cell divides into two daughter cells, passing a copy of its DNA to each of its child cells. The process of copying the DNA is calledDNA replication. It is important that this pro- cess is accurate to ensure that the new cells contain the same genetic material of the mother and result in healthy cells. Since DNA has a double-helix structure, for DNA replication to take place, it is necessary that the two strands separate. Then, proteins (enzymes) calledDNA polymeraseswill synthesize a new DNA strand by adding nucleotides that complement each of the nucleotides in the original strand, also called the template strand.

So far, we have seen that the DNA contains the genetic information required for the cells to function and replicate and that genes, which are regions found along the DNA, encode the information necessary to synthesize proteins and other molecular products. The process of using the information encoded in a gene to produce a functional gene product is calledgene expression. But, what we have not discussed yet, is how the information flows from DNA to proteins. DNA, RNA and proteins are the central elements on this flow of genetic informa- tion that occurs in two steps:transcriptionandtranslation, in what is also called thecentral dogmaof molecular and cell Biology.

RNA is a single strand molecule that in contrast to DNA does not form an helix structure. It is also composed of four nucleotides, but it contains a fifth type of nucleic acid called Uracil (U) that is not present in the DNA. So, while thymine only exists in DNA, uracil is only found in RNA. The other three bases are common to DNA and RNA. The sequence alphabet of the RNA is: {A, G, U, C}.

3.2.1 Transcription: RNA Synthesis

Transcription is the first step required to produce a protein. In this step, the nucleotide se- quence of a gene from one of the DNA strands is transcribed, i.e. copied into a complemen-

tary molecule of RNA. The complementarity of the genetic code allows recovering the information encoded in the original DNA sequence, a process performed by an enzyme called RNA polymerase. Additional steps of RNA processing, including stabilizing the elements at the end of the molecule, are performed by different protein complexes.

After these steps, which occur within the nucleus of the cell, an RNA molecule calledmature messenger RNAormRNAis obtained. The mature mRNA is then transported to the cytoplasm, where it will be used by the cellular machine to guide the production of a protein. Fig.3.1 depicts the different steps involved in the flow of genetic information, including transcription and translation, covered in the next section, as well as replication, aforementioned.

3.2.2 Translation: Protein Synthesis

Proteins are cellular entities that have either a structural function, participating in the physical definition of the cell, or a chemical function being involved in chemical reactions occur- ring in the cell. In order to function as expected, a protein needs to acquire the appropriate structure. This structure is often decomposed at different complexity levels. The primary structure is defined by the chain of amino acids and is called apolypeptide. These polypep- tides will coil and fold, forming regular and local sub-structures, called secondary structures of the protein. One or more combined polypeptides with the appropriate structure will form a fully functional protein. It is the sequence of amino acids and the current cellular condi- tions that will determine how the polypeptide will fold and the protein acquires its struc- ture.

Translation is the process in which the nucleotide sequence of an mRNA molecule is tran- scribed into a chain of amino acids forming a polypeptide, that will consist in part or the to- tality of a protein. This process is performed by the ribosomes that attach and scan the mRNA from one end to the other, in groups ofnucleotide tripletsorcodons. In each position of the triplet, we have one of four nucleotides, so there are 4×4×4=64 possible triplets. To each codon in the mRNA sequence corresponds an amino acid in the polypeptide chain. Some of these codons represent specific signals that indicate the initiation or the termination of the translation process. Once the ribosome detects an initiation codon, it starts the formation of the amino acid chain, and when it scans the stop codon it stops the translation and detaches from the mRNA molecule.

There are 20 types of amino acids used to form polypeptides, much less than the 64 possible codons. Therefore, more than one codon corresponds to a type of amino acid. This mapping between codons and amino acids is provided in Table3.1and is commonly calledgenetic code.

Figure 3.1: Genetic information flow. Example of a fragment of a DNA double helix is shown at the top. Replication of DNA forms two identical DNA double helices. Transcription allows the synthesis of an mRNA transcript. Translation synthetizes a polypeptide from the mRNA tran- script.

Table 3.1: Genetic code: Mapping between codons and amino acids. Different notation of amino acids is presented.

Nucleotides Amino acids

UUU, UUC Phenylalanine/ Phe / P

UUA, UUG, UCU, UCA, UCC, UCG Leucine / Leu / L

AUU, AUC, AUA Leucine / Leu / L

AUG Methionine / Met / M (start)

GUU, GUC, GUA, GUG Valine / Val / V UCU, UCC, UCA, UCG, AGA, AGG Serine / Ser / S CCU, CCC, CCA, CCG Proline / Pro / P ACU, ACC, ACA, ACG Threonine / Thr / T GCU, GCC, GCA, GCG Alanine / Ala / A

UAA, UAC Tyrosine / Tyr / Y

UAA, UAG, UGA Stop codons

CAU, CAG Histidine / His / H

CAA, CAG Glutamine / Gln / Q

AAU, AAC Asparagine / Asn / N

AAA, AAG Lysine / Lys / K

GAU, GAC Aspartic Acid / Asp / D

GAA, GAG Glutamic Acid / Glu / G

UGU, UGC Cysteine / Cys / C

UGG Arginine / Arg / R

CGU, CGC, CGA, CGG, AGA, AGG Glycine / Gly / G

During the translation process, a type of small RNA molecule, calledtransfer RNAortRNAs, will bring to the ribosome the amino acids of the corresponding type, which will be comple- mentary to the mRNA codon that is currently being scanned. Each mRNA molecule can be scanned multiple times by different ribosomes giving rise to multiple copies of the polypep- tide.

The genetic code is an example where nature has developed a clever and robust cellular mech- anism. While there are exceptions, the genetic code constitutes a standard language common to most cells from the simpler to the more complex organisms. With its redundancy, where more than one codon encodes an amino acid, the genetic code encloses a very efficient code- correction mechanism that minimizes the impact of errors in the nucleotide sequence occur- ring in DNA replication.

During translation, the parsing of the mRNA sequence by the ribosome may start at different nucleotides. Given that a codon is composed of three nucleotides, the mRNA sequence may have three possible interpretations. Let’s consider the example in Table3.2and the possibili- ties in which the sequence can be translated.

These three ways of parsing the sequence are calledreading frames. Note that in the above reading frames, Stop is just an indicative signal and that when a stop codon is found the trans-

Table 3.2: An example of an mRNA sequence and the three reading frames.

AAUGCUCGUAAUUUAG

AAU-GCU-CGU-AAU-UUA→Asn-Ala-Arg-Asn-Leu AUG-CUC-GUA-AUU-UAG→Met-Leu-Val-Ile-Stop UGC-UCG-UAA-UUU→Cys-Ser-Stop-Stop

lation stops and no other amino acid is added to the polypeptide chain. A reading frame with a sufficient length and with no stop codons is calledopen reading frame (ORF). In genetics, when the sequence of the genome is known, but the location of the genes has not been an- notated yet, ORFs are of particular importance as they indicate a part of the genome where genes are potentially encoded.

So far, we have considered a simplistic view where the genetic information is found in one of the DNA strands. Given the double-helix structure, genes can be found in both strands. Thus, the information in DNA should be read from both strands. To identify the directionality of the DNA strands, the ends of each strand have been named as 3’-end and 5’-end. Since the in- formation of both strands is complementary, typically only the sequence of a single strand of DNA is given. By convention, this represents the sequence read from the 5’-end to the 3’-end, also represented as 5’→3’.

The DNA strand from where the mRNA coding for a protein is read, is calledDNA sense, positiveor+strand. Its complementary strand will be calledDNA antisense,negative or−strand. Genes can be found in both strands of DNA. Thus, when translating an mRNA sequence, the three previous reading frames actually become six possible reading frames.

In this section, we have seen that, DNA and RNA molecules can be represented as sequences written in four nucleotide symbols, while protein molecules can be represented as sequences of amino acids written in a twenty letter alphabet. Transcription is the process that creates a complementary copy of one the DNA strands of a gene. The result is a one-dimensional se- quence of nucleotides called messenger RNA (mRNA). The mRNA serves as an intermediary between the nuclear RNA and the ribosomes in the cytoplasm. Translation is the process that uses the mRNA information to encode the polypeptide or protein chain of amino acids. This way, proteins are not directly synthesized by DNA. In the translation process, the AUG codon has a double function of encoding a start codon indicating the initiation of the translation or the amino acid methionine. The translation stop signal is encoded by three codons and never included in the translated sequence. In the next section, we will discuss how the genetic infor- mation to encode proteins is organized within a gene.

Figure 3.2: General gene structure. Different regions that are present in genes of eukaryotic species. The intermediate scheme shows some of the signals that allow to define exon/intron structure. After splicing, a mature mRNA is formed only with exonic regions.

Một phần của tài liệu Thiết kế các thuật toán sinh tin học bằng Python (Trang 67 - 73)

Tải bản đầy đủ (PDF)

(395 trang)