HMMs for Database Search

Proteins with related sequences and structures can be arranged into protein families. They of- ten share similar function and clear evolutionary relationship. In Chapter11we have seen that by aligning conserved regions across multiple sequences of the same family, we can derive a position weight matrix (PWM) that captures the patterns of conservation along the different positions of the alignment.

One of the most popular applications of HMMs in biological sequence analysis are the HMM- profiles. They provide a probabilistic profile of a protein family and resemble PWMs but provide a more flexible way to deal with insertions and deletions. A topology of an HMM- profile contains, apart from the start and end states, three different groups of states: main, insertion, and deletion states. The main states model the columns and these are as many as positions in the alignment. The insert states model highly variable regions of the alignment.

The deletion states, also called silent states, do not match any residue and make possible the jump between one or more columns of the alignments (corresponding to highly variable regions).

A typical bioinformatics application is to build an HMM-profile from a family of proteins and to search a database of sequences for other unseen family members. For an HMM-profileM representing a protein family, all sequencesW in databaseDare scanned. Sequences with a probability or a score higher than a threshold are potential members of the family. The prob- abilityP[W|M]can be calculated with the Forward algorithm. For further information on theory and applications of HMM-profiles we refer the reader to the textbooks [50] and Chap- ter 4 by Anders Krogh in [6]. The Pfam database [143] contains a large collection of multiple alignments and HMM-profiles for a large number for manually verified protein families.

Bibliographic Notes and Further Reading

A general introduction to hidden Markov models can be found in [130] and Chapter 9 of [85]

provides a very detailed algorithmic introduction. Materials for more oriented biological sequence analysis with HMMs can be found in the reference textbooks [50], [6], [84] and [19].

A quick introduction to the topic can be found in [54].

HMMs are very well suited for many tasks in biological sequence analysis problems. This in- cludes the tasks of gene finding, profile search, multiple sequence alignment, regulatory site identification and secondary structure prediction or copy-number detection. Here, we refer the reader to a few examples of applications on each of these topics. This is certainly not exhaus- tive and many other interesting works can be found elsewhere. In gene finding we emphasize the HMMgene [90] and GENSCAN [31] methods. For HMM-profiles we refer to the seminal work of David Haussler, Anders Krogh and colleagues [73,91], an extensive review by Sean Eddy [53] and the theory behind HMM-profiles can be found in [19,50] and Chapter 4 of [6].

For multiple sequence alignment with HMM the work of Sean Eddy [52] and for prediction of secondary structure from sequence [14,86]. HMMs have also been applied to detect copy- number alterations from genome sequencing data [152].

The software suitehmmer, developed by Sean Eddy, is possibly the most complete tool implementing different methods using hidden Markov models for multiple biological sequence analysis tasks, including searching for homologous sequences in databases and multiple sequence alignments. The algorithmic details behind the methods can be found in [50]

and [55–57] provide description of the different functionalities implemented in the package.

Exercises and Programming Projects

1. Adapt the code to convert probabilities into scores by taking thelog2 of each probability and summing instead of multiplying.

2. The probability of a sequence depends directly on its length. The log-odds measure is generally a more convenient way to represent the score of a sequence. For an observed sequenceOof lengthT, the measure can be defined as:logodds(O) =logP0.25[O]T = logP[O] −T ×log0.25. The 0.25 is derived from a background model where all the nucleotides are expected to occur with equal frequency.

3. Adapt the code of the Viterbi algorithm to keep track of all possible paths, returning them in addition to the optimal path.

Graphs: Concepts and Algorithms

In this chapter, we present the mathematical concept of graph and its computational repre- sentation. We address some of the main algorithms over graphs and develop a set of Python classes to implement different types of graphs and underlying algorithms. Graphs will be cen- tral in the development of algorithms for handling biological networks and genome assembly, tasks addressed in the next chapters.

Genes: Discrete Units of Genetic Information

Biological Sequences: Representations and Basic Algorithms