3.3.1 Gene Structure
We have seen that genes form discrete units of genetic information located on the chromo- somes and consist of DNA that encodes the instructions to produce the polypeptides that will form proteins. Given the differences in the complexity of prokaryotic and eukaryotic cells, it is expected that genes from these two types of cells also show different complexity.
However, regardless of the different cell complexity, genes share a common general structure.
At one end of the gene there is a region of transcription initiation. In this region, different sig- nals where regulatory proteins bind to regulate gene transcription can be found. These signals work as switches where transcription is regulated both in terms of amount of produced mRNA and periodicity. At the other end of the gene there is a region that contains signals that deter- mine the end of transcription.
In between these two regions there is the transcribed region as shown in Fig.3.2. It is in this region where differences between prokaryotic and eukaryotic genes are more noticeable. In
Figure 3.3: Exon structure for gene HBA1. Darker solid regions represent exons. Larger blocks indicate the coding part of exons and the thinner blocks the untranslated regions. Dashed re- gions represent exons.
prokaryotic species, like for instance bacteria, genes are organized as a continuous stretch of DNA, and the totality of the transcribed mRNA sequence will be used to build the polypep- tide.
In eukaryotes, on the other hand, the transcribed region of a gene consists of a combination of coding segments calledexonsinterspersed with non-coding segments calledintrons. While, in transcription, the pre-mRNA will contain the sequence of both exons and introns, a sub- sequent step calledsplicing, will remove the introns and concatenate the exons to form the mature mRNA. Splicing occurs in the nucleus and, therefore, before translation. This means that effectively, only the nucleotide sequence corresponding to the exons will be used in the translation process. Fig.3.2depicts the process.
The gene structure in eukaryotes represents an increased complexity. The DNA contains signals that easily allow the delineation of this exon/intron structure. On average, exons are significantly shorter than introns and marked by short and conserved nucleotide sequences.
The exon/intron borders include signal sequences that mark the end of the exon and the start of the intron, the so-called5’ splice siteordonor site, and the end of the intron and the start of an exon, called3’ splice siteoracceptor site. These will be the regions where, during the splicing process, the intronic sequence is cut out and removed from the pre-mRNA. Addi- tional signals, for instance the polypyrimidine tract, can be found in introns or in exons to help recognizing the exon/intron borders and splice the introns, as shown in Fig.3.2.
We have seen, in the previous section, that translation only starts when the ribosome detects the start codon AUG and stops when one of the stop codons UAA, UAG or UGA is found.
The region within the start and the stop codon of the pre-mRNA, after properly spliced, is calledcoding region. This name comes from the fact that only this nucleotide sequence will be used to code for the protein sequence. The remaining regions before the start codon and the stop codon, although part of the pre-mRNA, will not be translated into amino acids. Thus, they are called respectively5’and3’-untranslated regionsor for short5’-UTRor3’-UTR.
In Fig.3.3, we can find a representation of the structure of the human gene hemoglobin al- pha 1 (HBA). This gene is part of a cluster of alpha globin genes spanning about 30 KB and located in the chromosome 16 of the human genome. This gene is involved in oxygen trans- port from the lung to the different peripheral tissues. The pre-mRNA contains around 900 nucleotides and consists of three exons. The resulting protein is 144 amino acids long.
We have assumed that a gene corresponds to a genomic region that harbors a sequence that contains the information necessary to produce a protein. However, this is not always the case. In fact, some genes do not code for proteins, i.e. the respective RNA sequence is not translated into an amino acid chain. While the first type of genes are calledprotein-coding genes, the second ones are callednon-coding RNAs (ncRNAs). ncRNAs are characterized as being a class of relatively short RNAs, many of them are known to be fully functional and playing regulatory roles within the cell, even though the function of the majority of them is currently unknown. There are several types of ncRNAs including microRNAs, snoRNAs and long non-coding RNAs. Some of these ncRNAS share several properties of protein-coding genes [46].
For simpler eukaryotic species, protein-coding genes tend to cover a large fraction of the genome [37]. For instance, budding yeast (S. cerevisiae) contains a genome of 12 million base pairs, with roughly 6600 genes corresponding to approximately 70% of the genome being used as protein coding sequence. For genomes of higher animals and plants, this percentage becomes much lower.
This is the case of the human genome, with its 3.2×109bases pairs and with∼21,000 protein coding genes, where only 3% of the genome corresponds to protein-coding sequences. So, it is often the case that a large proportion of the genome sequence of different species appear to not contain any gene. This is called the non-coding genome and the regions free of genes are intergenic regions. For some time, it was considered that these regions had no particular functional interest. With the advance of high-throughput sequencing technologies, several landmark studies like Encode [27,49,51] and Phantom [13,33] projects are revealing that these regions harbor many different functional signals that are used in cellular regulatory process mostly controlling gene expression in different ways.
Surprisingly, across diverse species, the number of protein coding genes does not totally correlate with the genome size or the overall cellular complexity. Consider for instance the species in Table3.3. The fission yeast has a longer genome than the budding yeast but a smaller number of genes. The fruit fly and the tobacco plant (model organism for plants) have a genome of similar length, but the plant has almost twice more genes than the fly. Mouse and human although very different species have a similar number of genes. This implies that there are several mechanisms, other than the total number of genes, that through evolution of the species have contributed to this differential complexity.
Alternative splicing is one of the mechanisms that contribute to differential complexity. Given its combinatorial nature its represents an interesting and challenging topic to be approached from the computational point of view. Alternative splicing is the process in which the exons of the pre-mRNAs are spliced forming different combinations. This allows that from one
Table 3.3: Genomes by the numbers. Genome size, number of chromosomes and number of protein coding genes for different model species. Numbers were retrieved from [129].
Organism Genome size
(Base Pairs)
Protein coding genes
Number of chromosomes
Budding yeast –S. cerevisiae 12 Mbp 6600 16
Fission yeast –S. pombe 13 Mbp 4800 3
Fruit fly –D. melanogaster 140 Mbp 14,000 8
Tobacco plant –A. thaliana 140 Mbp 27,000 10 (2n)
Mouse –M. musculus 2.8 Gbp 20,000 40 (2n)
Human –H. sapiens 3.2 Gbp 21,000 46 (2n)
pre-mRNA sequence several distinct mature mRNAs are produced, leading to structurally and functionally protein variants. This process, most extensively used in higher eukaryotes allows for an increased macromolecular and cellular complexity.
Recent studies, with high-throughput RNA sequencing, discovered that, in human cells, al- ternative splicing is a very common event with 90% to 95% of human multi-exon genes producing mRNA transcripts that are alternatively spliced [122,151]. Alternative splicing has a combinatorial nature where exons and introns are fully or partially combined in a multitude of ways [89]. Another important aspect of alternative splicing is cell and tissue specificity, allowing that in higher eukaryotes different types of cells generate from the same cell a differ- entiated repertoire of protein isoforms [24,111].
3.3.2 Regulation of Gene Expression
We have seen that genes encode for proteins, that proteins play a multitude of functions and that the concerted actions of these proteins determine the function of the cell. Gene expression refers to the process of expression of thousands of genes within the cell at a given moment.
Given the dynamic and adaptive nature of cells, the protein amounts required by the cell changes through time. The available amount of a certain protein results from the balance of its synthesis and its degradation.
Regulation of gene expressionrefers to the combination of processes that determine how much of a particular protein should be produced. As expected, these regulatory processes are much more complex in eukaryotic than in prokaryotic cells.
Since prokaryotic cells do not have a nucleus, the DNA flows in the cytoplasm. Transcription and translation occur almost simultaneously [36]. Whenever the production of a protein is no longer necessary, transcription stops. Given the immediate effect on the amount of produced protein, control of gene expression is essentially determined at the level of transcription. An- other interesting aspect of some prokaryotic species like bacteria is the organization of their
genes in clusters of co-regulated genes, calledoperons. Genes from an operon are close in the genome and under a common control mechanism that allows them to be transcribed in a coordinated way. This gives the species the capacity to rapidly react to different external stim- uli [131].
In eukaryotic cells, gene regulation is a multi-step process controlled by multiple mecha- nisms. Given that transcription and translation are decoupled, regulation may occur in the nucleus during transcription and RNA processing or in the cytoplasm during translation. It may even occur after the protein being produced, where biochemical modifications to the pro- tein result inpost-translational regulation[36,131].