In Chapter5, we have seen that the string matching problem consists of finding in a se- quenceS, defined over an alphabet, all the occurrences of patternP (defined in the same alphabet). In Section5.4, we introduced the Deterministic Finite Automaton (DFA), an ab- stract mathematical concept, that can be used to represent a string language, i.e. the set of strings that match a given pattern. The basic idea is to build an automaton from the pattern.
Then, once the DFA is built multiple strings can be scanned to look for the match of the pat- tern. The DFA is represented as a graph with a finite number of states. Each symbol in the pattern corresponds to a state and every matched symbol sends the automaton to a new state.
The DFA has a start and an end state. If the end state is reached, it means that the scanned string is accepted. Using a DFA, only one pass over the input string is required, making it a very efficient method for string matching. DFAs provide a deterministic pattern search, i.e.
either match or not the scanned sequence, as it happened also with regular expressions.
However, in biological sequence analysis, the events associated to a sequence are often proba- bilistic. Examples of this include for instance, the classification of the subcellular localization (e.g. cytosol, nucleus or the membrane) of a protein given its sequence; the presence of pro- tein domains within a sequence or the binding probability of a transcription factor on the sequence of a gene promoter region.
Hidden Markov Models(HMMs) are probabilistic models that capture statistical regulari- ties within a linear sequence. As we will see, HMMs are very well suited for the mentioned tasks. Indeed, HMMs have been largely used in many different problems in molecular biol- ogy, including gene finding, profile searches, multiple sequence alignment, identification of regulatory sites, or the prediction of secondary structure in proteins.
Bioinformatics Algorithms. DOI:10.1016/B978-0-12-812520-5.00012-2
Copyright © 2018 Elsevier Inc. All rights reserved. 255
Let us now examine the structure of an HMM by looking at a specific example. Consider the elements of a mature mRNA. It contains a tripartite structure that starts with a 5’ untranslated region (5’ UTR), a region that will be translated and code for a protein (CDS) and ends with a 3’ untranslated region (3’UTR) [112]. Moreover, we also know that the CDS region should start with a start codon (ATG) and end with one of the three stop codons. Here, for sake of simplicity we will omit the codon details.
So, in this case, we have a grammatical structure that defines a mature mRNA. We can view the untranslated and CDS regions as words. The sentences of our grammar will have certain rules, like the fact that the 5’UTR should always precede the CDS and the 3’UTR should al- ways come after the CDS. We can also assume that the three elements will always be present in a mature mRNA. Our goal will be to scan a sequence and label each of its nucleotide sym- bols with the corresponding region. A probability will be associated at every labeling point.
The representation of each region or label will be called state of the HMM.
When scanning a symbol at a given state, there is a certain probability that the next symbol will belong to the current state or not. The probabilities associated with a change of state are calledtransition probabilitiesand depend on several factors including, for instance, the typical sequence length of the corresponding region. In this case, let’s consider that, when scanning a symbol on the 5’UTR, we have 0.8 probability that the next symbol will also be part of the 5’UTR and 0.2 probability that it will change to the CDS region. We can ascertain probabilities to the other regions in a similar manner. To model our HMM we will need three states, plus two additional states that define the start and the end of the sequence. The start state represents the probability of starting in any of the defined states.
It is well known that the composition of mRNA regions differ depending on the species [159].
Considering, for instance, that the G+C content in the 5’UTR is of 60%, in the CDS of 50%
and in the 3’UTR of 30%, these compositional frequencies will determine theemission proba- bilitiesthat provide the probability of symbol being emitted by a state.
To determine the probability of a sequence, we just need to multiply the probability of the symbol in the respective state and the respective transition probability. Consider for instance the string TAGTTA with 3 nucleotides within the CDS and 3 nucleotides in the 3’UTR. The probability of this sequence will be given by:
PCDS(T )×P (CDS|CDS)×PCDS(A)×P (CDS|CDS)×PCDS(G)
×P (CDS|3U )×P3U(T )×P (3U|3U )×P3U(T )×P (3U|3U )×P3U(A)
=0.25×0.9×0.25×0.9×0.25×0.1×0.35×0.8×0.35×0.8×0.35
=3.47287×10−5
wherePS(N )is the emission probability of symbolN in stateSandP (SA|SB)the transition probability from stateSAtoSB, which can also be denoted asPSA,SB.
Since the multiplication of the probabilities can lead to very small numbers, it is often more convenient to use a score based on the logarithm of the probability, multiplied by−1. If transition and emission probabilities are converted to logarithmic values the score can be cal- culated as a sum of these components. In this case, the score will be 10.26. The scanning of the input sequence to calculate its probability or score can also be viewed a sequence genera- tion process by the HMM.
At a given state, a probability will be emitted based on the current symbol of the observed sequence. The transition to the next state will depend on the distribution of transition prob- abilities of the current state. Thus, we can consider that the HMM generates two strings of information. A visible sequence of symbols emitted at each state, called theobserved se- quenceand thestate path, the hidden sequence of states corresponding to the labels of each symbol. The above probability calculation follows a process known asMarkov chain, which defines a sequence of events based on a finite set of states with a serial dependence, i.e. the next state depends only on the current state.
Similarly to a deterministic finite automaton, an HMM is also defined by five different ele- ments:
• Alphabet of symbols,A= {a1, a2, ..., ak}.
• Set of states,S= {S1, S2, ..., Sn}.
• Initial state probability, as the probability of being at stateSi at instantt=0,I (Si).
• Emission probabilities, with ei,a as the probability of emitting symbola in statei, with
a∈Aei(a)=1.
• Transition probabilities, withTi,j as the transition probability of changing from stateito statej including itself, with
i∈S,j∈STi(j )=1.
While the first three parameters are represented as uni-dimensional vectors or lists, the emis- sion and transition probabilities are bi-dimensional matrices. The parameters of an HMM are frequently denoted asθ. In certain situations one may refer only to a subset of parameters, e.g.
θ =(ei,a;Ti,j)if only the emission and transition probabilities are relevant for the context.
Additionally, an observed sequence of lengthT will be denoted asO=o1, o2, .., oT. Each ot denotes the symbol observed at positiont. The sequence of states or state path will be de- noted asπ, withπt being the state corresponding to positiont. We will use the bigP notation with square brackets to refer to a conditional probability. For instanceP[O, π|M]refers to the probability of observing the sequenceO and state pathπ given the modelM.
Fig.12.1depicts the HMM structure for a simplified version of mature mRNA region label- ing. Consider an initial state probabilities ofI (5)=1,I (C)=0 andI (3)=0. For a possible
Figure 12.1: HMM scheme to detect three gene regions. (Top) Division of a gene in three re- gions containing a 5’ and 3’ untranslated regions flanking the coding regions (CDS). (Middle) Schematic representation for an HMM with 5 states to detect the 3 gene regions. Includes tran- sition and emission probabilities. (Bottom) Example of a segment of DNA sequence and corre- sponding state of each symbol.
state transition sequence 55CCCC33 and an emitted sequenceAT CGCGAA, we can calcu- late the corresponding probabilities as follows:
PT[55CCCC33]=I (5)×T5,5×T5,C×TC,C×TC,C×TC,C×TC,C×TC,3×T3,3
=1×0.8×0.2×0.9×0.9×0.9×0.1×0.8≈0.0093312 and
Pe[AT CGCGAA]=0.2×0.2×0.25×0.25×0.25×0.35×0.35≈1.9141e−05 Thus:
P[e(AT CGCGC)∧T (55CCCC33)]=0.0093312×1.9141e−05≈1.78e−07 The tasks that can be solved by an HMM can be divided into three main groups:scoringor probability calculation,decoding, andlearning. These functions can be applied to a specific state path or to the universe of all possible state paths.
Scoring functions provide the calculation of the probabilities of an observed sequence. For an observed sequenceOand a state pathπ, the joint probability ofSandπ under the HMMM
isP[O, π|M]. This can be calculated as in the example above where we obtain the joint prod- uct of emission and transition probabilities. If no state path is provided, the goal is to calculate the total probability of the sequence,P (O|M). This probability can be yield by summing the probabilities of all possible paths, which is an exponentially large number of paths, but can be efficiently calculated with theForward algorithm. This algorithm provides the probability that the HMMMwill output a sequenceo1...ot and reach the stateSi at positiont. The Forward probability is denoted asαt(Si)and equivalently asP[O=o1...ot, πt=Si|M].
Another important calculation is to determine the backward probability. This provides the probability of starting at stateSi and positiontand generating the remainder of the observed sequenceot...oT. The Backward probability is denoted asβt(Si)and equivalently asP[O= ot...oT|πt =Si|M].
Decoding refers to the process of determining the most likely underlying sequence of states given an observed sequence. In our previous example this refers to the task of determining the sequence of states 5U, CDS, or 3U for each of the observed sequence symbol. Given an HMMMand an observed sequenceO, the task of the decoder is to find the most probable state pathπ. Here, we will apply a dynamic programming algorithm, calledViterbialgorithm, to find the path with the maximum score over all paths. A trace-back function can be used to keep track of all visited paths and to determine the most probable path.
Another relevant question is to determine the total probability of the symbolot to be emitted by the statek, when considering all state paths. This is called posterior probability and de- noted asP[πt =k|O, M]. This probability is given by the product of the forward probability αt(k)and the backward probabilityβt(k)divided by the probability ofO,P (O).
In our mRNA HMM example we have defineda priorithe parameters of the model, which we could have determined either using previous knowledge or estimating from observed data.
If during the analysis process we acquire more data we may want to adjust these parameters and optimize our model. By analyzing a given dataset we can also learn these parameters. Our goal is to find the parametersθ =(ei,a;Ti,j)that maximize the probability of the observed sequence under the model M,P (O|M). If the observed sequence has been labeled, i.e. the state path is known, then a supervised learning is performed. In this case, the emission and transition probabilities are directly inferred from the training data. If the training sequences are unlabeled, then unsupervised learning is performed.
Here, an Expectation-Maximization algorithm calledBaum-Welchwill be used for learning.
Briefly, the algorithm starts with best guess for the parametersθ of the modelM. It then esti- mates the emission and transition probabilities from the data. It updates the model according to these new values. It iterates on this procedure until the parametersθ of the modelM con- verge.
Figure 12.2: Parameters for an example HMM with three states, including initial, transition and emission probabilities matrices. The 5’UTR is represented by the letter 5, the CDS by M and the 3’UTR by 3.