SEQUENCE ANALYSIS AND FURTHER DISCUSSION

Một phần của tài liệu mathematics of bioinformatics theory practice and applications pdf (Trang 99 - 103)

Now we know how to fi nd an optimal alignment. A major concern when inter- preting alignment results is whether similarity between sequences is biologi- cally signifi cant. Good alignments can occur by chance alone. Many chance mechanisms are involved in the creation of these data. How do we know if it is a biologically meaningful alignment, especially when the similarity is only marginal? Here we present two approaches. One is the classical approach based on the traditional statistical approach of calculating the chance of a match score greater than the value observed. The other is the Bayesian approach, based on a comparison of models.

Under even the simplest random models and scoring systems, very little is known about the random distribution of optimal global alignment scores (Deken, 1983 ). Monte Carlo experiments can provide rough distributional results for some specifi c scoring systems and sequence compositions (Reich et al., 1984 ), but these cannot be generalized easily.

Therefore, one of the few methods available for assessing the statistical signifi cance of a particular global alignment is to generate many random sequence pairs of the appropriate length and composition, and calculate the optimal alignment score for each (Altschul and Erickson, 1985 ; Fitch, 1983 ).

Although it is then possible to express the score of interest in terms of standard deviations from the mean, it is a mistake to assume that the relevant distribu- tion is normal and convert this Z - value into a P - value; the tail behavior of global alignment scores is unknown. The most that one can say reliably is that if 100 random alignments have scores inferior to the alignment of interest, the P - value in question is probably less than 0.01. One further pitfall to avoid is exaggerating the signifi cance of a result found among multiple tests. When

many alignments have been generated (e.g., in a database search), the signifi - cance of the best must be discounted accordingly. An alignment with a P - value of 0.0001 in the context of a single trial may be assigned a P - value of 0.1 only if it was selected as the best among 1000 independent trials.

However, unlike those of global alignments, statistics for the scores of local alignments are well understood. This is particularly true for local alignments lacking gaps, which we consider fi rst. Such alignments were precisely those sought by the original BLAST database search programs (Altschul et al., 1990 ) from each of the two sequences being compared. A modifi cation of the Smith – Waterman (1981) or Sellers (1984) algorithms will fi nd all segment pairs whose scores cannot be improved by extension or trimming. These are called high - scoring segment pairs (HSPs). To analyze how high a score is likely to rise by chance, a model of random sequences is needed. For proteins, the simplest model chooses the amino acid residues in a sequence independently, with specifi c background probabilities for the various residues. Additionally, the score expected for aligning a random pair of amino acid is required to be negative. Were this not the case, long alignments would tend to have high scores independent of whether the segments aligned were related, and the statistical theory would break down. Just as the sum of a large number of independent identically distributed (i.i.d.) random variables tends to a normal distribution, the maximum of a large number of i.i.d. random variables tends to an extreme value distribution (Gumbel, 1958 ). (We elide the many technical points required to make this statement rigorous.) In studying optimal local sequence alignments, we are essentially dealing with the latter case (Dembo et al., 1994 ; Karlin and Altschul, 1990 ). In the limit of suffi ciently large sequence lengths m and n , the statistics of HSP scores are characterized by two para- meters, K and λ (lambda). Most simply, the expected number of HSPs with a score of at least S is given by the formula E = K mn e - λ S . We call this the E - value for the score S .

This formula makes eminently intuitive sense. Doubling the length of either sequence should double the number of HSPs attaining a given score. Also, for an HSP to attain the score 2 x , it must attain the score x twice in a row, so one expects E to decrease exponentially with the score. The parameters K and λ can be thought of simply as natural scales for the search space size and the scoring system, respectively.

The other approach is based on a comparison of models.

1. Hidden Markov model (HMM). This is a statistical model in which the system being modeled is assumed to be a Markov process with unknown parameters, and the challenge is to determine the hidden parameters from the observable parameters. The extracted model parameters can then be used to perform further analysis: for example, for pattern recognition applications. In a regular Markov model, the state is directly visible to the observer, and there- fore the state transition probabilities are the only parameters. In a hidden Markov model, the state is not directly visible, but variables infl uenced by the

SEQUENCE ANALYSIS AND FURTHER DISCUSSION 83 state are visible. Each state has a probability distribution over the possible output tokens. Therefore, the sequence of tokens generated by an HMM gives some information about the sequence of states. Hidden Markov models are especially well known for their application in temporal pattern recognition, such as speech, handwriting, gesture recognition, musical score following, partial discharges, and bioinformatics.

2. Profi le hidden Markov models. These have several advantages over stan- dard profi les. Profi le HMMs have a formal probabilistic basis and have a consistent theory behind gap and insertion scores, in contrast to standard profi le methods, which use heuristic methods. HMMs apply a statistical method to estimate the true frequency of a residue at a given position in the alignment from its observed frequency, whereas standard profi les use the observed fre- quency itself to assign the score for that residue. This means that a profi le HMM derived from only 10 to 20 aligned sequences can be equivalent in quality to a standard profi le created from 40 to 50 aligned sequences. In general, producing good profi le HMMs requires less skill and manual interven- tion than does producing good standard profi les.

3. Pattern discovery. Given a sequence of data such as a DNA or amino acid sequence, a motif or pattern is a repeating subsequence. Such repeated subsequences often have important biological signifi cance, and hence discov- ering such motifs in various biological databases turns out to be a very impor- tant problem in computational biology. Of course, in biological applications the various occurrences of a pattern in the given sequence may not be exact, so it is important to be able to discover motifs even in the presence of small errors. Various tools are now available for carrying out automatic pattern discovery. This is usually the fi rst step toward a more sophisticated task such as gene fi nding in DNA or secondary structure prediction in protein sequences at the system level.

4. Scoring functions. The choice of a scoring function that refl ects biologi- cal or statistical observations about known sequences is important in produc- ing good alignments. Protein sequences are frequently aligned using substitution matrices that refl ect the probabilities of given character - to - character substitu- tions. A series of matrices called PAM (point accepted mutation) matrices , originally defi ned by Margaret Dayhoff and sometimes referred to as Dayhoff matrices ) explicitly encode evolutionary approximations regarding the rates and probabilities of particular amino acid mutations. Another common series of scoring matrices, known as BLOSUM (blocks substitution matrix), encodes empirically derived substitution probabilities (Durbin et al., 1998 ). Variants of both types of matrices are used to detect sequences with differing levels of divergence, thus allowing users of BLAST or FASTA to restrict searches to more closely related matches or to expand to detect more divergent sequences (Durbin et al., 1998 ). Gap penalties account for the introduction of a gap — in the evolutionary model, an insertion or deletion mutation — in both nucleo- tide and protein sequences, and therefore the penalty values should be

proportional to the rate expected for such mutations. The quality of the align- ments produced therefore depends on the quality of the scoring function. It can be very useful and instructive to try the same alignment several times with different choices for scoring matrix and/or gap penalty values, and to compare the results. Regions where the solution is weak or nonunique can often be identifi ed by observing which regions of the alignment are robust to variations in alignment parameters.

5. Structural alignments. These are usually specifi c to protein and some- times RNA sequences, and use information about the secondary and tertiary structure of the protein or RNA molecule to aid in aligning the sequences.

These methods can be used for two or more sequences and typically produce local alignments; however, because they depend on the availability of struc- tural information, they can only be used for sequences whose corresponding structures are known (usually through x - ray crystallography or NMR spectros- copy). Because both protein and RNA structure is more evolutionarily con- served than is sequence (Chothia and Lesk, 1986 ), structural alignments can be more reliable between sequences that are very distantly related and that have diverged so extensively that sequence comparison cannot reliably detect their similarity.

Structural alignments are used as the gold standard in evaluating align- ments for homology - based protein structure prediction (Zhang and Skolnick, 2005 ) because they explicitly align regions of the protein sequence that are structurally similar rather than relying exclusively on sequence information.

However, clearly, structural alignments cannot be used in structure prediction because at least one sequence in the query set is the target to be modeled, for which the structure is not known. It has been shown that given the structural alignment between a target and a template sequence, highly accurate models of the target protein sequence can be produced; a major stumbling block in homology - based structure prediction is the production of structurally accurate alignments given only sequence information.

We are witnessing the emergence of the “ data - rich ” era in biology. The myriad data available, ranging from sequence strings to complex phenotypic and disease - relevant data, pose a huge challenge to modern biology. The stan- dard paradigm in biology that deals with “ hypothesis to experimentation (low - throughput data) to models ” is gradually being replaced by “ data to hypothesis to models and experimentation to more data and models. ” And unlike data in physical sciences, those in the biological sciences are almost guaranteed to be highly heterogeneous and incomplete. To make signifi cant advances in this data - rich era, it is essential that there be robust data reposi- tories that allow interoperable navigation, query, and analysis across diverse data, a plug - and - play tools environment that will facilitate seamless interplay of tools and data, and versatile user interfaces that will allow biologists to visualize and present the results of analysis in the most intuitive and user - friendly manner. We address below several challenges posed by the enormous

CHALLENGES AND PERSPECTIVES 85 need for scientifi c data integration in biology, with specifi c examples and strat- egies. The issues that need to be addressed may include:

• Architecture of data and knowledge repositories

• Databases (fl at, relational, and object - oriented; which is most appropriate?)

• The imminent need for ontologies in biology • The middle layer (how to design it)

• Applications and integration of applications into the middle layer • Reduction and analysis of data (the largest challenge!)

• How to integrate legacy knowledge with data • User interfaces (Web browser and beyond)

The complex and diverse nature of biology mandates that there is no “ one solution fi ts all ” model for the issues listed above. Although there is a need to have similar solutions across multiple disciplines within biology, the dichotomy of having to deal with the context, which is everything in some cases, poses severe design challenges. For example, can a system that describes cellular signaling also describe developmental genetics? Can the ontologies that span different areas (e.g., anatomy, gene and protein data, cellular biology) be com- patible and connective? Can the detailed biological knowledge accrued pains- takingly over decades be integrated easily with high - throughput data? These are only few of the questions that arise in designing and building modern data and knowledge systems in biology.

Một phần của tài liệu mathematics of bioinformatics theory practice and applications pdf (Trang 99 - 103)

Tải bản đầy đủ (PDF)

(317 trang)