Phương pháp chẩn đoán hình ảnh medical image analysis methods (phần 10)

10 GraphBased Analysis of Amino Acid Sequences Luciano da Fontoura Costa CONTENTS 10.1 Introduction 10.2 ComplexNetworks Concepts and Tools 10.2.1 Brief Historic Perspective 10.2.2 Basic Mathematical Concepts 10.2.2.1 Graph Theory Basics 10.2.2.2 Probabilistic Concepts 10.2.2.3 Random Graph Models 10.2.2.4 SmallWorld and ScaleFree Models 10.3 ComplexNetworks Approaches toBioinformatics 10.4 Sequences of Amino Acids as Weighted, Directed Complex Networks 10.5 Results 10.5.1 Zebra Fish 10.5.2 Xenopus 10.5.3 Rat 10.6 Discussion 10.7 Concluding Remarks and Future Work Acknowledgments References 10.1 INTRODUCTION One of the most essential features underlying natural phenomena and dynamical systems are the many connections, implications, and causalities between the several involved elements and processes. For instance, the whole dynamics of gene activation can be understood as a highly complex network of interactions, in the sense that some genes are enhanced while others are inhibited by several environmental factors, including the current biochemical composition of the individual (such as the presence of specific genesproteins) as well as external effects such as temperature and interaction with other individuals. Interestingly, such a network of effects extends much beyond the individual in time and space, in the sense that any living being is Copyright 2005 by Taylor Francis Group, LLC 364 Medical Image Analysis affected by history (i.e., evolutionary processes) and spatial interactions (i.e., ecology). Although biology can only be fully understood and explained by considering the whole of such an intricate network of effects, reductionist approaches can still provide many insights about biological phenomena that are more localized in time and space, such as the genetic dynamics during an individual lifetime or an infectious process. The large masses of data produced by experimental works in biology, molecular biology, and genetics can only be properly organized, analyzed, and modeled by using computer concepts including databases, networks, parallel computing, and artificial intelligence, with special emphasis placed on signal processing and pattern recognition. The incorporation of such modern computer concepts and tools into biology and genetics has been called bioinformatics 1. The applications of this new area to genetics are manifold, ranging from nucleotide analysis to animal development. Among the several signalprocessing methods considered in bioinformatics 2, we have the application of Markov random fields to model the sequences of nucleotides, the use of correlation and covariance to characterize sequences of nucleotides and amino acids, and wavelets 2, 3. One particularly important problem concerns the analysis of proteins, the basic blocks of life 4, 5. Constituted by sequences of amino acids, proteins participate in all vital processes, acting as catalysts; providing the mechanical scaffolding for cells, organs, and tissues; and participating in DNA expression. Proteins are polymers of amino acids, determined from the DNA through the process of protein expression. Many of the properties of proteins derive from their spatial shape and electrical affinities, which are both defined by the specific sequences of constituent amino acids 4, 5. Therefore, given the sequence of amino acids specified by the DNA, the protein folds into specific forms while taking into account the interactions between the amino acids and external influence of chaperones. It remains an open problem how to determine the structural properties of proteins from the respective amino acid sequences, a problem known as protein folding 4, 5. Except for some basic motifs, such as alphahelices and betasheets, which are structures that appear repeatedly in proteins, the prediction of protein shape constitutes an intense research area. Experimentally, the sequences of amino acids underlying proteins can be obtained by using sequencing machines capable of reading the nucleotides, which are subsequently translated into amino acids by considering triples of nucleotides, the socalled codons, translated according to the genetic code 3–5. By being inherently oriented toward representing connections and implications, graphs stand out as one of the most general and interesting data structures that can be used to represent biological systems. Basically, a graph is a representational structure composed of nodes, which are connected through directed or undirected edges. Any structure or phenomenon can be represented to varying degrees of completeness in terms of graphs, where each node would correspond to an aspect of the phenomenon and the edges to interactions. Such a potential for representation and modeling is greatly extended by the many types of graphs, including those with weighted edges, different types of coexisting nodes or edges, and hypergraphs, to name only a few. Interestingly, most biological phenomena can be properly represented in terms of graphs, including gene activation, metabolic networks, evolution Copyright 2005 by Taylor Francis Group, LLC GraphBased Analysis of Amino Acid Sequences 365 (recall that hierarchical structures such as trees are special kinds of graphs), ecological interactions, and so on. However, despite the natural potential of graphs for representing and studying natural phenomena, their application was timid until the recent advent of the area of complex networks. One of the possible reasons for that is that graphs had been often understood as representations of static interactions, in the sense that the connections between nodes were typically assumed not to change with time. Thus, the uses of graphs in biology, for instance, were mainly constrained to representing evolutionary hierarchies (in terms of trees) and metabolic networks. This situation underwent an important recent change sparked mainly by the pioneering developments in random networks by Rapoport 6 and Erdös and Rényi 7, Watts and Strogatz smallworld models 8, and by Barabási scalefree networks 9. The research of such types of complex graphs became united under the name of complex networks 10–12. Now, in addition to the inherent potential of graphs to nicely represent natural phenomena, important connections were established with dynamics systems, statistical physics, and critical phenomena, while many possibilities for multidisciplinary research were established between areas such as graph theory, statistical physics, nonlinear dynamical systems, and complexity theory. Despite such promising perspectives, one of the often overlooked reasons why complex networks have become so important for modern science is that studies in this area tend to investigate the dynamical evolution of the graphs 10–12, which can provide key insights about the relationship between the topology and function of such complex systems. For example, one of the most interesting properties exhibited by random graphs is the abrupt appearance, as new edges are progressively added at random, of a giant cluster that dominates the graph structure and connections henceforth. Thus, in addition to being typically large (several studies in complex networks consider infinitely large graphs), the graphs were now used to model growing processes. Allied to the inherent vocation of graphs to represent connections, interactions, and causality, the possibility of modeling dynamical evolution in terms of complex networks has made this area into one of the most promising scientific concepts and tools. The present chapter is aimed at addressing how complexnetwork research has been applied to bioinformatics, with special attention given to the characterization and analysis of amino acid sequences in proteins. The text starts by reviewing the basic context, concepts, and tools of complexnetwork research and continues by presenting some of the main applications of this area in bioinformatics. The remainder of the chapter describes the more specific investigation of amino acid sequences in terms of complex networks obtained for graphs derived from subsequence strings. 10.2 COMPLEXNETWORKS CONCEPTS AND TOOLS 10.2.1 BRIEFHISTORICPERSPECTIVE The beginnings of complexnetwork research can be traced back to the pioneering and outstanding works by Rapoport 6 and Erdos and Renyi 7, who concentrated attention on the type of networks currently known as random networks. This name is somewhat misleading in the sense that many other network models are also Copyright 2005 by Taylor Francis Group, LLC 366 Medical Image Analysis random. The essential property of random networks as understood in graph theory, therefore, is not only being random, but to follow a particular probabilistic model, namely the uniform random distribution 13. In other words, given a set of N nodes, connections are established by choosing pairs of nodes according to the uniform probability density. In the case of undirected graphs, the edges are uniformly sampled out of the N(N–1)2 possible connections. Consequently, random networks correspond to the maximum entropy hypothesis of connectivity evolution, providing a suitable null hypothesis against which several real and theoretical models can be compared and contextualized. One of the most interesting features of random networks is the fact that the progressive addition of new edges tends to abruptly form a giant, dominating cluster (or connected component) in the graph. Such a critical transition is particularly interesting not only because it represents a sudden change of the network connectivity, but because it provides a nice opportunity for connecting graph theory to statistical physics. Indeed, the appearance of the giant cluster can be understood as a percolation of the graph, similar to critical phenomena (phase transitions) underlying the transformation of ice into water. Basically, percolation corresponds to an abrupt change of some property of the analyzed system as some parameter is continually varied. This interesting connection between graph theory and statistical physics has provided unprecedented opportunities for multidisciplinary works and applications, nicely bridging the gap between areas such as complexity analysis, which is typical of graph theory, and the study of systems involving large numbers of elements, typical in statistical physics. In addition to such an exciting perspective, random networks attracted much interest as possible models of real structures and phenomena in nature, with special emphasis given to the Internet and the World Wide Web. After the fruitful studies of Rapoport and Erdos and Renyi, the study of large networks (note that the term complex network was not typical at those times) went through a period of continuing academic investigation followed by few applications, except for promising investigations in areas such as sociology. Indeed, one of the next important steps shaping the modern area of complex networks was the investigation of personal interactions in society, of which the 1998 work by Watts and Strogatz 8 represents the basic reference. Basically, experimental investigations regarding social contacts led to the result that the average length between any two nodes (i.e. persons) is rather small, hence the name smallworld networks. The typical mathematical model of such networks starts with a regular graph, which subsequently has a percentage of its connections rewired according to uniform probability. Although such investigations brought many insights to the area, the smallworld property was later verified to be an almost ubiquitous property of complex networks. The subsequent investigations of the topological properties of the Internet and WWWperformed by Albert and Barabási 9 led to the important discovery that the statistical distribution of the node degrees (i.e., the number of connections of a node) in several complex networks tends to follow a power law, indicating scalefree behavior. Unlike the random model, this property favors the appearance of nodes concentrating many of the connections, the socalled hubs. Such underlying structure has several implications, such as resilience to attack, which Copyright 2005 by Taylor Francis Group, LLC GraphBased Analysis of Amino Acid Sequences 367 is particularly fragile for hub attacks. From then on, the developments in complexnetwork research boomed, covering several types of natural systems, from epidemics to economy. The interested reader is encouraged to check the excellent surveys of this area 10–12 for complementary information. 10.2.2 BASICMATHEMATICALCONCEPTS This section provides a brief introductory review of basic concepts and measurements in graph theory, statistics, random graphs, and smallwork and scalefree networks. Readers who are already familiar with such topics can proceed directly to Section 10.2.3. 10.2.2.1 Graph Theory Basics Basically, a typical graph 14–17 in complexnetwork theory 10–12 involves a collection of N nodes i = 1, 2, …, N that are connected through edges (i,j) that can have weights w(i,j). Such a data structure is precise and completely represented by the respective weight matrix W, where each entry W(j,i) represents the weight of edge (i,j). Nonexistent edges are represented as null entries in that matrix. The adjacency matrix K of the graph is a matrix where the value 1 is assigned to an element (i,j) whenever there is an edge connecting node j to I, and 0 otherwise. The adjacency matrix can be obtained from the weight matrix by setting each element larger or equal to a specific threshold value T to 1, assigning 0 otherwise. Such adjacency matrices, henceforth represented as KT , provide indication about the network structure defined by the weights that are higher than the threshold. Therefore, the adjacency matrix for high values of T can be understood as the strongest component, or “kernel,” of the weighted graph. Observe that it is also possible to consider the complementary matrix of KT with respect to K, which is defined as follows. Each element (i,j) of such a matrix, hence abbreviated as QT , receives value 1 iff KT (i,j) = 0 and K(i,j) 0. An undirected graph is characterized by undirected edges, so that K(j,i) = 1 iff K(i,j) = 1, i.e., K is symmetric. A directed graph, or digraph, is characterized by directed edges and not necessarily by a symmetric adjacency matrix. One of the most basic and interesting local feature of a graph or network is the number of connections of a specific node i, which is called the node degree and often abbreviated as ki . Observe that a directed graph has two types of such a degree, the indegree and the outdegree, corresponding to the number of incoming and outgoing edges, respectively. Figure 10.1illustrates the concepts introduced here with respect to an undirected graph G and a directed graph H, identifying the nodes, edges, and weights. This figure also shows the respective weight matrices WG and WH and adjacency matrices AG and AH. The degree of node 1 in G is 2, the outdegree of node 1 in H is 2, and the indegree of node 1 in H is 1. N is equal to 4 for both graphs. A great part of the importance of graphs stems from their generality for representing, in an intuitive and explicit way, virtually any discrete structure while emphasizing the involved entities (nodes) and connections. Indeed, virtually every data structure (e.g., tree, queue, list) is a particular case of a graph. In addition, graphs Copyright 2005 by Taylor Francis Group, LLC 368 Medical Image Analysis can be used to represent the most general mesh of points used for numeric simulation of dynamic systems, from the regular orthogonal lattice used in image representation to the most intricate adaptive triangulations. As such, graphs are poised to provide one of the keys for connecting not only structure and function, but also several different biological areas and even the whole of science. Several measurements or features have been proposed and used to express meaningful and useful global properties of the network structure. In similar fashion to feature selection in the area of pattern recognition (e.g., 13), the choice of such features has to take into account the specific problem of interest. For instance, a problem of communication along the network needs to take into account the distance between nodes. It should be observed that, in most cases, the selected set of features is degenerated, in the sense that it is not enough to reproduce the original network structure. Therefore, great attention must be paid when deriving general conclusions based on incomplete sets of measurements, as is almost always the case. Some of the more traditional network measurements are reviewed in the following paragraph. The global measurement, usually derived from the node degree, is its average value along the whole network. Observe that, for a digraph, the average indegree and outdegree are necessarily identical. The average node degree gives a first idea about the overall connectivity of the network. Additional information about the network connectivity can be obtained from the average clustering coefficient . Given one specific node i, the immediately connected nodes are identified, and the ratio between the number of connections between them and the maximum possible FIGURE 10.1 Basic concepts in graph theory: examples of undirected (G) and directed (H) graphs, with respective nodes, edges, and weights. The weight matrices of G and H are WG and WH, and the respective adjacency matrices considering threshold T = 1 are given as AG and AH. 1 3 2 4 4 1 11 3 2 3 2 2 2 4 node G: H: edge weight 0240 2000 4001 0010 0020 2000 3001 0000 0110 1000 1000 0000 0010 1000 1000 0000 WG= AG= AH= WH= Copyright 2005 by Taylor Francis Group, LLC GraphBased Analysis of Amino Acid Sequences 369 value of those connections defines the clustering coefficient of node i, i.e., Ci . This feature tends to express the local connectivity around each node. Another interesting and frequently used network measurement is the length between any two nodes i and j, here denoted as L(i,j). This distance may refer either to the minimal sum of weight along a path from i to j, or to the total number of edges between those two nodes. The present work is restricted to the latter. The respectively derived global feature is the average length considering all possible pairs of network nodes, hence . This measurement provides an idea not only about the proximity between nodes, but also about the overall network connectivity, in the sense that low averagedistance values tend to indicate a densely connected structure. Another interesting measurement that has been used to characterize complex networks is the betweenness centrality. Roughly, the betweenness centrality of a specific network node in an undirected graph corresponds to the number of shortest paths between any pair of node in the network that cross that node 18. 10.2.2.2 Probabilistic Concepts Any measurement whose outcome cannot be exactly predicted, such as the weight of an inhabitant of Chicago, can be represented in terms of a random variable 13, 19. Such variables can be completely characterized in terms of the respective density functions, which can be approximated in terms of the respective relative frequency histogram. Alternatively, a random variable can also be represented in terms of its (possibly) infinite moments, including the mean, variance, and so on. Statistical density functions of special interest for this chapter include the uniform distribution, which assigns the same probability to any possible measurement, and the Poisson distribution, which is characterized in terms of a ratio of event occurrence per length, area, or volume. For instance, we may have that the chance of having a failure in an electricity transmission cable is equal to one failure per 10,000 km. Therefore, the chance of observing the event along the considered structure (e.g., the transmission cable) is also equiprobable along the considered parameter (e.g., length or time). Such concepts can be immediately extended to multivariate measurements by introducing the concept of random vector. For instance, the temperature and pressure of an inhabitant of Chicago can be represented as the twodimensional random vector T, P. Such statistical entities are also completely characterized, in statistical terms, by their respective multivariate densities. Statistical and probabilistic concepts and techniques are essential for representing and modeling natural phenomena and biological data because of the intrinsic variation of such measurements. 10.2.2.3 Random Graph Models The first type of complex networks to be systematically investigated were the random graphs 6, 7, 10–12, 20. In using such graphs, one starts with N unconnected nodes and progressively adds edges between pairs of nodes chosen according to the uniform distribution. Although the measurements described in Section 2.2.1 are useful for characterizing the structure of such networks, it is also important to take into account parameters and measurements governing their dynamical evolution, including the Copyright 2005 by Taylor Francis Group, LLC 370 Medical Image Analysis critical phenomenon of percolation. As more connections are progressively added to a growing network, there is a definite tendency to form a giant cluster (percolation), which henceforth dominates the growing dynamics. Given a network, a cluster is understood as the set of nodes (and respective interconnecting edges) such that one can reach any node while starting from any other node in the cluster, i.e., the cluster is a connected component of the graph. The giant cluster corresponds to the cluster with the largest number of nodes at a given step of the network evolution. For an undirected random network, this phenomenon has been found to take place when the percentage of existing connections with respect to the maximum possible number of connections is about 1N 5. 10.2.2.4 SmallWorld and ScaleFree Models The types of complex networks known as small world and scale free were identified and studied years after Erdos and Renyi investigated random graphs. Smallworld networks 8, 10 are characterized by a short path from any pairs of its constituent nodes. A typical example of such a network is the social interactions within a given society, in the sense that there are just a few (about five or six) relations between any two persons. Characterized later than smallworld models, the scalefree networks 10–12 are characterized by the fact that the statistical distribution of the respective node degrees follows a power law, i.e., the representation of such a density in a loglog plot produces a straight line. Such densities, unlike those observed for other types of networks, implies a substantially higher chance of having nodes of high degree, which are traditionally called hubs. As reviewed in the next section, such nodes have been identified as playing an especially important role in biological networks. Scalefree networks can be produced by using the preferentialattachment growth strategy 10–12, characterized by the progressive addition of new nodes with fixed number of edges that are connected preferentially with nodes of higher degree, giving rise to the paradigm that has become known as “the rich get richer.” At the same time, scalefree networks have also been shown to be less resilient to random node attachments than other types of networks, such as random graphs 10. 10.3 COMPLEXNETWORKS APPROACHES TO BIOINFORMATICS Several possibilities of using complex network and statistical physics in biology have been described and revised by Bose in his interesting and extensive survey 21. Special attention is given to relationships between the network’s topology and functional properties, and the following three situations are covered in considerable depth: 1. The topology of complex biological networks, such as metabolic and protein interaction 2. Nonlinear dynamics in gene expression 3. The effect of stochasticity on the network dynamics Copyright 2005 by Taylor Francis Group, LLC GraphBased Analysis of Amino Acid Sequences 371 While we review in the following some of the most representative works applying complexnetwork research to biology, the reader is encouraged to complement and extend our revision by referring to Bose’s survey. Metabolic reactions, one of the key elements of life, were among the first to be studied by complexnetwork approaches. Such networks have their nodes representing the molecular compounds (or substrates), and the edges indicate the metabolic reactions connecting substrates. Incoming links to a substrate are understood to correspond to the reactions of which that substrate is a product. The pioneering investigation by Jeong et al. 22 considered networks that are available for 43 organisms, yielding average node indegree and outdegree in the range from 2.5 to 4, with the respective distribution being understood as scale free with exponents close to 2.2. The metabolic reactions of E. coli have been studied as undirected graphs by Wagner and Fell 23, yielding average node degree of 7 and a clustering coefficient (approximately 0.3) much larger than could be obtained for a random network. An interesting investigation into whether the duplication of information in genomes can significantly affect the power law exponents was reported by Chung et al. 24. By using probabilistic methods as the means to analyze the evolution of graphs under duplication mechanisms, those authors were able to show that such mechanisms can produce networks with low powerlaw exponents, which are compatible with many biological networks 25. The decomposition of biochemical networks into hierarchies of subnetworks, i.e., networks obtained by considering a subset of the nodes of the original graph and some of the respective edges, has been addressed by Holme and Huss 18. These authors use the algorithm of Girvan and Newman 26 for tracing subnetworks, in a form adapted to bipartite representations of biochemical networks. The underlying principle of the algorithm is the fact that vertices between densely connected areas have high betweenness centrality, such that removal with high degree leads to the partition of the whole network into subnetworks that are contained in previous clusters, thereby producing a hierarchy of subnetworks. Another extremely important type of biological network, corresponding to genomic regulatory systems (i.e., the set of processes controlling gene expression), has also been subject of increasing attention in complexnetwork research. This type of directed network is characterized by having nodes corresponding to components of the system, with the edges representing the geneexpression regulations 11. An important type of network in this category is that obtained from proteinprotein interactions. In this type of network, each node corresponds to a protein, and the directed edges represent the interactions. A model of regulatory networks has been described by Kuo and Banzhaf 27. A pioneering approach in this area is the work of Jeong et al. 28, which considered protein–protein interaction networks of S. cerevisiae, containing thousands of edges and nodes. The degree distribution was interpreted as following scalefree behavior with an approximate exponent of 2.5. One of the most important conclusions of that investigation was that the removal of the mostconnected proteins (i.e., hubs, the nodes of a complex network receiving a large number of connections) can have disastrous effects on the proper functioning of the individual. The issue of protein–protein interaction networks has also been Copyright 2005 by Taylor Francis Group, LLC 372 Medical Image Analysis considered in a number of other works, including Qin et al. 29, Wagner 30, PastorSatorras et al. 31, and in studies of the properties and evolution of such networks. Another related work, described by Wuchty 32, considered graphs obtained by assigning a node to every protein domain (or module) and an edge whenever two such domains are found in the same protein. The important problem of determining protein function has been addressed from the perspective of networks of physical interaction by Vazquez et al. 33. Their method is based on the minimization of the number of interacting proteins with different categories, so that the function estimation can be performed on a global scale while considering the entire connectivity of the protein network. The obtained results corroborate the validity of using proteinprotein interaction networks as a means of inferring protein function, despite the unavoidable presence of imperfections and the incompleteness of protein networks. The analysis of geneexpression networks in terms of embedded complex logistics maps (ECLM), a hybrid method blending some concepts from wavelets and coupled logistics maps, has been reported by Shaw 34. That study considered 112 genes collected at nine different time instants along 25 days, with each time point being fitted to an ECLM model with high Pearson correlation coefficient, and the connections between genes were determined by considering models with high pairwise correlation. The obtained connections were interpreted as following scalefree behavior in both topology and dynamics. A work by Bumble et al. 35 suggests that the study of pathways of network syntheses of genes, metabolism, and proteins should be extended to the investigation of the causes and treatment of diseases. Their approach involves methods capable of yielding, for a specific set of candidate reactions, a complete metabolic pathway network. Interesting results are obtained by investigating qualitative attributes, including relationships regarding the connectivity between vertices and the strength of connections, the relationship of interaction energies and chemical potentials with the coordination number of the lattice models, and how the stability of the networks are related to their topology. An interesting approach to analyzing the amino acid sequences of a protein in terms of subsequently overlapping strings of length K has been described by Hao et al. 36. The strings of amino acids are represented as graphs by associating each possible subsequence of length K to each graph node, and having the edges represent the observed successive transitions of subsequences. Their investigation targeted the reconstruction of the original sequences from the overlapping string networks, which can be approached by counting the number of Eulerian loops (i.e., a cyclic sequence of connected edges that are followed without repetition). More specifically, the sequences are reconstructed while starting with the same initial subsequence, using each of the subsequences the same number of times as observed in the original data, and respecting a fixed sequence length. It was therefore verified that the reconstruction is unique for K ≥ 5 for the majority of the considered networks (PDB.SEQ database 37). The present work addresses cooccurrence strings of amino acids (or any other basic biological element) similar to the scheme described in the previous paragraph, but here the subsequences do not necessarily overlap, and the number of times a Copyright 2005 by Taylor Francis Group, LLC GraphBased Analysis of Amino Acid Sequences 373 subsequence is followed by another is represented by the weight of the respective edge in the associated graph, following the same scheme used for concept association as described in the literature 38, 39. More specifically, whenever a subsequence of amino acids B is followed by another subsequence C, the weight of the edge connecting the two nodes representing those subsequences is increased by 1. Therefore, such a weighted, direct graph provides information about the number of times a specific subsequence is followed by other possible subsequences, which can be related to the statistical concept of correlation, with the difference that the sequence of the data is, unlike in the correlation, taken into account. As such, the obtained graph can be explored to characterize and model sequences of amino acids according to varying subsequence sizes. Moreover, by thresholding the weight matrix for subsequent threshold values, it is possible to identify subgraphs of the network corresponding to a strongly connected kernel of subsequences. 10.4 SEQUENCES OF AMINO ACIDS AS WEIGHTED, DIRECTED COMPLEX NETWORKS A protein can be specified in terms of its respective sequence of amino acids, represented by the string S = A1 A2 … AN , where each element Ai corresponds to one of the 20 possible amino acids, as indicated in Table 10.1. It is possible to subsume an amino acid sequence S, by grouping subsequences of amino acids into new numerical codes with higher values, in a way similar to that described by Hao et al. 36. The grouping scheme adopted in this work is illustrated in Figure 10.2,where the first and second group contains m and n amino acids, respectively. While it is possible to consider m n, we henceforth adopt m= n. The groups are taken with an overlap of g positions, with 0 ≤ g ≤ m. For each reference position i, we have two numerical codes B and C, obtained as follows B = (Ai–1)20 m–1 + … + (Ai+m–2 –1)20 + Ai+m1 (10.1) and C = (Ai+m–g–1) 20 n–1 + … + (Ai+m+n–g–2–1) 20 + Ai+m+n–g–1 (10.2) Therefore, we have that 1 ≤ B and C ≤ 20 m . FIGURE 10.2 The grouping scheme considered in this work, including two successive windows of size m and n, with overlap of g elements. i1 i i+mg1 i+mg i+m1 i+m i+m+ng1 i+m+n+g g Copyright 2005 by Taylor Francis Group, LLC 374 Medical Image Analysis An example of this coding scheme is given in the following. Let the original protein sequence in abbreviated amino acids be S = MEQWPLLFVVALCI or, in numerical codes S = (13)(6)(7)(18)(15)(11)(11)(14)(20)(20)(1)(11)(5)(10) For m = n = 2 and g = 0, we have: TABLE 10.1 Amino Acids and Respective Numerical Codes Abbreviation Numerical Code A1 R2 D3 N4 C5 E6 Q7 G8 H9 I10 L11 K12 M13 F14 P15 S16 T17 W18 Y19 V20 iB C 1 246 138 2 107 355 3 138 291 4 355 211 5 291 214 6 211 280 7 214 400 8 280 381 9 400 11 10 381 205 11 11 90 Copyright 2005 by Taylor Francis Group, LLC GraphBased Analysis of Amino Acid Sequences 375 Similarly, for m = n = 3 and g = 1, we obtain: Observe that the different ranges of i obtained in these two examples is a direct consequence of the fact that the larger size of the subsequences in the second example reduces the number of possible subsequence associations. Now, having defined the grouping scheme and the resulting sequences B and C, the graph representing the subsequent (with possible overlap) cooccurrences of numerical codes in this sequence is obtained as follows: 1. Each code in the sequences B and C is represented as one of the N nodes of the graph, whose number corresponds to the code produced for the respective sequence. For instance, the sequence (13)(6) implies a graph with two nodes identified as 13 and 6 containing a direct edge following from node 13 to node 6. Therefore, for a given m = n, we have a maximum of 20 m nodes, numbered from 1 to 20 m . Observe, however, that the resulting network does not necessarily include all possible nodes, allowing a reduction of the network size. 2. Every time a code B is followed by a code C, the weight of the edge connecting from node B to C is incremented by 1. In other words, the weight of the edge uniting two specific sequences B and C is equal to the number of times those two sequences are found to follow one another, in that same order, along the analyzed sequence of amino acids. Figure 10.3illustrates the graph obtained from the sequence (13)(6)(7)(18)(15) (11)(11)(14)(20)(20)(1)(11)(5)(10)(15)(11)(14) considering m = 1, where each node is represented by the respective code, and the edge weights (shown in italics) represent the number of successive subsequence (in this case a single amino acid) transitions. In this sense, the obtained graph represents the “unidirectional” correlations between two subsequent (with possible overlap) subsequences of amino acids in the analyzed protein. Such a network can be understood as a statistical model of the original protein for the specific correlation length implied by m and g. As such, it is possible to obtain simulated sequences of amino acids following such statistical models by performing MonteCarlo simulation over the outdegrees of each node, in the sense that each outgoing edge is taken with frequency corresponding to its iB C 1 4907 2755 2 2138 7091 3 2755 5811 4 7091 4214 5 5811 4280 6 4214 5600 7 4280 7981 8 5600 205 9 7981 4090 Copyright 2005 by Taylor Francis Group, LLC 376 Medical Image Analysis respective normalized weight (i.e., the sum of the weights of the outgoing edges must add up to 1). Therefore, the transition probabilities are proportional to the respective weights. Observe that the statistically normalized weight matrix of the network corresponds to a Markov chain, as the sum of any of its columns will be equal to 1. By thresholding the weight matrix for successive values of T (see Section 12.2.2), it is possible to obtain a family of graphs that can be understood as follows. The clusters defined for the highest values of T represent the kernels of the whole weighted network, corresponding to the subsequence associations that are most representative and more frequent along the whole protein. As the threshold is lowered, these kernels are augmented by incorporation of new nodes and merging of existing clusters. Such a thresholdbased evolution of the graph can be related to the evolutionary history of the protein formation, in the sense that the kernels would have appeared first and served as organizing structures around which the rest of the molecule evolved. At the same time, the strongest connections in the obtained network also reflect the repetition of basic protein motifs, such as alpha helices and beta sheets. 10.5 RESULTS In the following investigations, we consider proteins from three animal species: zebra fish, Xenopus (frog), and rat. The gene sequencing data were obtained from the NIH Gene Collection repository (http:zgc.nci.nih.gov,files verb+dr_mgc_ cds_aa.fasta, verb+xl_mgc_cds_aa.fasta, and verb+rn_mgc_cds_aa.fasta). The raw data consisted of sequences of amino acids for the 2948, 1977, and 640 proteins (each containing on the average of 400 amino acids) in each of those files. The obtained results, which considers m = n = 2 and g = 0, are presented respectively for each species in the following subsections. The average node degree was obtained by adding all columns of the adjacency matrix. The clustering coefficient was obtained by identifying the n nodes connected to each node and dividing the number of existing edges between those nodes by n(n – 1)2, i.e., the maximum number of FIGURE 10.3 The network obtained for m = 1 for the amino acid sequence (13)(6)(7)(18)(15) (11)(11)(14)(20)(20)(1)(11)(5)(10)(15)(11)(14). The weights of the edges are shown in italics. 1 5 6 7 10 11 14 13 11 1 1 1 1 2 2 1 1 1 1 1 1 18 20 15 Copyright 2005 by Taylor Francis Group, LLC GraphBased Analysis of Amino Acid Sequences 377 edges between those nodes. The minimum distances were calculated by using Dijkstra’s method 14. 10.5.1 ZEBRAFISH The obtained 400×400 weight matrix (recall from the previous section that 400 = 20 m = 20 2 ) had a maximum value of 487, obtained for the transition from SS to SS, and a minimum value of zero was obtained for 15,274 transitions. The maximum weight for transition between different nodes was 170, observed for the transition from EE to ED. The performed measurements included the average node degree (Figure 10.4(a)), clustering coefficient (Figure 10.5(a)),average length (Figure 10.6(a)), and maximum cluster size (Figure 10.7(a))for the series of thresholded matrices KT (solid lines) and QT (dashed lines) obtained for T = 1, 2, …, 170. We also calculated the indegree and outdegree densities, which are shown in Figure 10.8(a)and Figure 10.8(b), respectively, for T = 0. It is clear from this figure that both node degrees tend to be similar to one another, presenting a plateau for 6 < log(k) < 8.4 followed by a sharp decrease of node degree. The selfconnections between nodes representing subsequences, immediately obtained from the diagonal of the respective adjacency matrices of two identical amino acids, are given in Table 10.2. The initial kernel was also identified for T = 95, with the obtained digraph shown in Figure 10.9,where the edge widths correspond to the respective weights. Observe that although the original graph was thresholded at T to obtain the kernel in Figure 10.9, the graph in that figure incorporates all edges, including those with weight smaller than T, to provide a more comprehensive visualization of the obtained kernel. This fully connected (except selfconnections, which were not considered in this case) digraph presents dominance of the E, D, and S amino acids, with strong connections obtained for the node EE. The maximum weight was 170, obtained for the transition from node EE to ED. 10.5.2 XENOPUS The weight matrix had a maximum value of 293, obtained for the transition from EE to EE, and a minimum value of zero was obtained for 22,787 transitions. The maximum weight for transition between different nodes was 207, observed for the transition from GP to LQ. The performed measurements included the average node degree (Figure 10.4(b)),clustering coefficient (Figure 10.5(b)),average length (Figure 10.6(b)), and maximum cluster size (Figure 10.7(b))for the series of thresholded matrices KT (solid lines) and QT (dashed lines) obtained for T = 1, 2, …, 170. The indegree and outdegree densities are shown in Figure 10.10(a)and Figure 10.10(b), respectively, for T = 0. Both densities again tend to be similar to one another, presenting a plateau for 6 < log(k) < 8 followed by a sharp decrease of node degree. The selfconnections between nodes representing subsequences of two identical amino acids are given in Table 10.2.The initial kernel containing nine nodes was identified for T = 64, with the obtained digraph shown in Figure 10.11, which is dominated by the P and G amino acids. Copyright 2005 by Taylor Francis Group, LLC 378 Medical Image Analysis 10.5.3 RAT The weight matrix had a maximum value of 98, obtained for the transition from LL to LL, and a minimum value of zero was obtained for 69,792 transitions. Such a large number of null transitions is a consequence of the smaller number of proteins available for this animal in the original data. The maximum weight for transition between different nodes was 35, observed for the transition from LL to AA. The performed measurements included the average node degree (Figure 10.4(c)), clustering coefficient (Figure 10.5(c)), average length (Figure 10.6(c)),and maximum cluster size (Figure 10.7(c)) for the series of thresholded matrices KT (solid lines) and QT (dashed lines) obtained for T = 1, 2, …, 35. FIGURE 10.4 The average node degree as a function of the weight threshold T (solid line = KT , dashed line = QT ) for (a) zebrafish data, (b) Xenopus, and (c) rat. 3200 2800 2400 2000 1600 1200 800 400 0 020406080100 120 140 160 180 T 2400 2000 1600 1200 800 400 0 020406080100 120 140 160 180 T (a) (b) Copyright 2005 by Taylor Francis Group, LLC GraphBased Analysis of Amino Acid Sequences 379 The indegree and outdegree densities are shown in Figure 10.12(a)and Figure 10.12(b), respectively, for T = 0. Both of the resulting node degrees were again similar to one another, presenting a plateau for 4 < log(k) < 6 followed by a moderate decrease of node degree. The selfconnections between nodes representing subsequences of two identical amino acids are given in Table 10.2.The initial kernel was also identified for T = 22, with the obtained digraph shown in Figure 10.13.The dominant amino acids were L and A. 10.6 DISCUSSION Despite the different number of proteins and overall amino acid sequence lengths available for each of the three species, the clustering coefficient, average length, and maximum cluster size are determined from the respective adjacency matrices (not the weight), and therefore they are more significant statistically so that we can attempt a comparison between such measurements in the case of zebra fish and Xenopus. It is clear from Figure 10.4that, as expected, the average node degree of the graph KT decreases monotonically with the threshold value T, while the opposite happens for QT. The abrupt way in which the average node degree varies for the thresholded and complementary matrices suggests that a kind of phase transition (critical phenomenon) takes place as the values of T are increased. As shown in Figure 10.5,the average clustering coefficient for KT tends to decrease steadily with the threshold values, undergoing a relatively abrupt transition (near T = 20 for zebra fish), while the clustering coefficient of QT increases even more abruptly near T = 10, suggesting a phase transition also for this measurement. Generally, the local connectivity reaches less than 10% of its maximum value after just onethird of the considered T excursion, which suggests that the network connectivity is dominated by stronger connections surrounded by much smaller connection weights. FIGURE 10.4 (continued) 600 500 400 300 200 100 0 0102030405060708090100 T (c) Copyright 2005 by Taylor Francis Group, LLC 380 Medical Image Analysis The average lengths of KT shown in Figure 10.6suffer from the typical problem that such distances tend to fall as a consequence of the disappearance of connections. In other words, because nonexistent edges are not considered for the average length calculation, a network containing no connections has null average length, less than for a fully connected network, for which the average length would be 1 (overlooking selfconnections). To any extent, the average length presents a sharp discontinuity (near T = 80 for zebra fish, 60 for Xenopus, and 20 for rat), possibly indicating that a large number of edges are cut by thresholds larger than these values. At the same time, the maximum average lengths in each case are similar and relatively small. An abrupt increase of the average length is observed for QT for small values of T, FIGURE 10.5 Average clustering coefficient as a function of the weight threshold T (solid line = KT , dashed line = QT ) for (a) zebra fish, (b) Xenopus, and (c) rat data. 020406080100 120 140 160 180 T (a) 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 020406080100120 140 160 180 T (b) 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Copyright 2005 by Taylor Francis Group, LLC GraphBased Analysis of Amino Acid Sequences 381 indicating that that matrix indeed suffers an abrupt change of its connection for small threshold values. The graphs in Figures 10.7show that the maximum cluster size for KT decreases steadily for higher threshold values, as expected. The maximum cluster size for QT remained fixed at 400, confirming that the complementary matrix is highly connected. As indicated in Figure 10.8, Figure 10.10,and Figure 10.12,the node degree densities tend to present two distinct regions: one plateau portion at the lefthand side, followed by an abrupt descending portion at the righthand side of the graph. While the indegree and outdegree densities also produced similar profiles for the three species, the respective kernels identified at different threshold levels (because of the different length of the amino acid sequences) were found to be rather different, with distinct pairs of amino acids dominating each kernel. While such a result may be strongly affected by the different amounts of data available for each of the considered species, it may also suggest different fundamental structures for the amino acid sequencing in those animals. 10.7 CONCLUDING REMARKS AND FUTURE WORK This chapter has addressed the promising perspective of using modern complexnetwork concepts and tools as a means of characterizing, modeling, and analyzing biological sequences, with special attention given to amino acid sequences in proteins. After presenting a brief historic perspective of complexnetwork research and some of its most representative applications to bioinformatics, the basic concepts of complex networks and respective topological measurements were presented. The problem of characterizing proteins in terms of weighted digraphs obtained from consecutive (with possible overlap) subsequences of amino acids was addressed FIGURE 10.5 (continued) 0102030405060708090100 T (c) 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Copyright 2005 by Taylor Francis Group, LLC 382 Medical Image Analysis next, with respect to a specific protein in zebra fish, Xenopus, and rat. This investigation included the calculation of the average node degree, average clustering coefficient, the average length (in number of edges), and the size of the maximum cluster in the graph for a sequence of threshold values. The obtained curves were found to provide interesting insights about the structure of the overall protein, especially regarding the appearance of critical transitions of several of the considered measurements as Twas increased. In addition, kernels were identified for each case, suggesting an interesting basic organization in the amino acid sequences. Despite FIGURE 10.6 Average length as a function of the weight threshold T (solid line = KT, dashed line = QT ) for (a) zebra fish, (b) Xenopus, and (c) rat data. 020406080100120 140 160 180 T (a) 3.8 3.4 3.0 2.6 2.2 1.8 1.4 1.0 020406080100120 140 160 180 T (b) 3.8 3.4 3.0 2.6 2.2 1.8 1.4 1.0 Copyright 2005 by Taylor Francis Group, LLC GraphBased Analysis of Amino Acid Sequences 383 FIGURE 10.6 (continued) TABLE 10.2 SelfConnections of Subsequences Composed of Two Identical Amino Acids Subsequence Number of SelfConnections (Zebra fish) Number of SelfConnections (Xenopus) Number of SelfConnections (Rat) AA 274 126 27 RR 85 41 10 DD 216 186 11 NN 23 13 2 CC 14 3 3 EE 467 293 49 QQ 216 95 11 GG 310 79 21 HH 67 48 0 II 882 LL 161 126 98 KK 188 104 29 MM 040 FF 24 9 0 PP 299 176 71 SS 487 233 61 TT 52 16 6 WW 000 YY 620 VV 13 19 3 0102030405060708090100 T (c) 3.2 2.8 2.4 2.0 1.6 1.2 0.8 0.4 0 Copyright 2005 by Taylor Francis Group, LLC 384 Medical Image Analysis the different sizes of the amino acid sequences, which do imply problems of statistical meaningfulness, some interesting trends have been identified regarding the comparison of the measurements obtained for the three different species, especially the general similarity between the topological properties for each species while completely different kernels and dominant amino acids have been identified for those cases. Future extensions of this work include the consideration of other m, n, and g configurations, the use of additional structural features such as betweenness centrality as well as the ratios suggested in the literature 40, 41, and the identification of the hierarchical backbone of the directed network, as suggested in the literature 39. FIGURE 10.7 Maximum cluster size of KT as a function of the weight threshold T for (a) zebra fish, (b) Xenopus, and (c) rat data. 400 360 320 280 240 200 160 120 80 40 0 020406080100120 140 160 180 T (a) 400 360 320 280 240 200 160 120 80 40 0 020406080100120 140 160 180 T (b) Copyright 2005 by Taylor Francis Group, LLC GraphBased Analysis of Amino Acid Sequences 385 It would also be possible to consider the progressive merging of nodes and connected components into the initial kernel to obtain the hierarchical structure underlying the growth of the kernel, with possible applications to the complex problem of protein folding 42. Finally, it would be interesting to use such measurements to compare proteins (in terms of amino acids and bases) from the same or distinct individuals, as well as to infer philogenetic evolution of the proteins. In the case of DNA analysis, the obtained topological measurements can provide a means for distinguishing between coding and noncoding regions. ACKNOWLEDGMENTS The author is grateful to Fundação de Amparo à Pesquisa do Estado de São Paulo — FAPESP (proc. 99127652), Conselho Nacional de Desenvolvimento Científico e Tecnológico — CNPq (proc. 308231031), and the Human Frontier Science Program for financial support. FIGURE 10.7 (continued) 400 360 320 280 240 200 160 120 80 40 0 0102030405060708090100 T (c) Copyright 2005 by Taylor Francis Group, LLC 386 Medical Image Analysis FIGURE 10.8 Loglog plot of (a) in and (b) outdegree distributions for T = 0, weighted by the intensity of the edges (zebrafish data). 5 4 3 2 1 Log (k) 0 6.0 6.4 6.8 7.2 7.6 8.0 8.4 8.8 9.2 9.6 (a) 5 4 3 2 1 Log (k) 0 6.0 6.4 6.8 7.2 7.6 8.0 8.4 8.8 9.2 9.6 (b) Copyright 2005 by Taylor Francis Group, LLC GraphBased Analysis of Amino Acid Sequences 387 FIGURE 10.9 The tennode kernel obtained for T = 95 for zebrafish data. The weights are represented in terms of the edge widths. The maximum and minimum weights are 170 and zero, the latter corresponding to selfconnections, as these have been excluded from the matrix used to obtain this picture. EE ED EL EK LE SD SE AE DD DE Copyright 2005 by Taylor Francis Group, LLC 388 Medical Image Analysis FIGURE 10.10 Loglog plot of (a) in and (b) outdegree distributions for T = 0, weighted by the intensity of the edges (Xenopus data). 4.0 3.6 3.2 2.8 2.4 2.0 1.6 1.2 0.8 0.4 0 Log (k) 5.7 6.1 6.5 6.9 7.3 7.7 8.1 8.5 8.9 9.3 (a) 4.0 3.6 3.2 2.8 2.4 2.0 1.6 1.2 0.8 0.4 0 Log (k) 5.7 6.1 6.5 6.9 7.3 7.7 8.1 8.5 8.9 9.3 (b) Copyright 2005 by Taylor Francis Group, LLC GraphBased Analysis of Amino Acid Sequences 389 FIGURE 10.11 The ninenode kernel obtained for T = 64 for Xenopus data. The weights are represented in terms of the edge widths. GE GA RG AG SG PP PG GP GL Copyright 2005 by Taylor Francis Group, LLC 390 Medical Image Analysis FIGURE 10.12 Loglog plot of (a) in and (b) outdegree distributions for T = 0, weighted by the intensity of the edges (rat data). 3.6 3.2 2.8 2.4 2.0 1.6 1.2 0.8 0.4 0 Log (k) 4.4 4.8 5.2 5.6 6.0 6.4 6.8 7.2 7.6 8.0 (a) 3.6 3.2 2.8 2.4 2.0 1.6 1.2 0.8 0.4 0 Log (k) 4.4 4.8 5.2 5.6 6.0 6.4 6.8 7.2 7.6 8.0 (b) Copyright 2005 by Taylor Francis Group, LLC GraphBased Analysis of Amino Acid Sequences 391 FIGURE 10.13 The tennode kernel obtained for T = 2 for the rat data. The weights are represented in terms of the edge widths. LA GL EA AL AA VL LS LF LL LG Copyright 2005 by Taylor Francis Group, LLC 392 Medical Image Analysis REFERENCES 1. Baldi, P. and Brunak, S., Bioinformatics, MIT Press, Cambridge, MA, 2001. 2. da F. Costa, L., Signal processing in bioinformatics, IEEE Proc. Digital Signal Process. Conf., New Jersey, 2002, pp. 23–27. 3. Durbin, R., Eddy, S., Krogh, A., and Mitchison, G., Biological Sequence Analysis, Cambridge University Press, Cambridge, U.K., 1998. 4. Alberts, B., Bray, D., Lewins, L., Raff, M., Roberts, K., and Watson, J.D., Molecular Biology of the Cell, 3rd ed., Garland Publishing, New York, 1994. 5. Garrett, R.H. and Grisham, C.M., Biochemistry, Saunders College Publishing, Fort Worth, TX, 1995. 6. Rapoport, A., Contribution to the theory of random A biased nets, Bull. Math. Biophys., 19, 257–277, 1957. 7. Erdös, P. and Rényi, A., On the evolution of random graphs, Publications Mathematicae, 6, 290–297, 1959. 8. Watts, D.J. and Strogatz, S.H., Collective dynamics of smallworld networks, Nature, 393, 440–442, 1998. 9. Albert, R., Jeong, H., and Barabási, A.L., The diameter of the worldwide web, Nature, 401: 130–131, 1999. 10. Albert, R. and Barabási, A.L., Statistical mechanics of complex networks, Rev. Mod. Phys., 74, 47–97, 2002. 11. Dorogovtsev, S.N. and Mendes, J.F.F., Evolution of networks, Adv. Phys., 51, 1079–1187, 2002. 12. Newman, M.E.J., The structure and function of complex networks, SIAM Rev., 45, 167–256, 2003. 13. da F. Costa, L. and Cesar, R.M., Jr., Shape Analysis and Classification: Theory and Practice, CRC Press, Boca Raton, FL, 2001. 14. Aldous, J.M. and Wilson, R.J., Graphs and Applications: an Introductory Approach, SpringerVerlag, London, 2000. 15. West, D.B., Introduction to Graph Theory, Prentice Hall, Upper Saddle River, NJ, 2001. 16. Harary, F., Graph Theory, AddisonWesley, Reading, MA, 1995. 17. Bollobas, B., Modern Graph Theory, SpringerVerlag, Heidelberg, 1998. 18. Holme, P., Huss, M., and Jeong, H., Subnetwork hierarchies of biochemical pathways, Bioinformatics, 19, 532–538, 2003. 19. Alon, N. and Spencer, J.H., The Probabilistic Method, Wiley Interscience, New York, 2000. 20. Bollobas, B., Random Graphs, Cambridge University Press, Cambridge, U.K., 2001. 21. Bose, I., Biological networks, available online at http:arxiv.orgabscondmat 0202192. last accessed March 2005. 22. Jeong, H., Tombor, B., Albert, R., Oltvai, Z.N., and Barabasi, A.L., The largescale organization of metabolic networks, Nature, 407, 651–654, 2000. 23. Wagner, A. and Fell, D.A., The small world inside large metabolic networks, Proc. R. Soc. London B, 268, 1803–1810, 2001. 24. Chung, F., Lu, L., Dewey, T.G., and Galas, D.J., Duplication models for biological networks, Journal of Computational Biology, 10(5), 677–688, 2003. 25. Aiello, W., Chung, F., and Lu, L., in Proc. 32nd Annu. ACM Symp. Theory Computing, 171–180, 2000. 26. Girvan, M. and Newman, M.E.J., Community structure in social and biological networks, Proc. Nat. Acad. Sci. USA, 99, 7821–7826, 2002. Copyright 2005 by Taylor Francis Group, LLC GraphBased Analysis of Amino Acid Sequences 393 27. Kuo, P.D. and Banzhaf, W., Small world and scalefree network topologies in an artificial regulatory network model, Journal of Biological Physics and Chemistry, 4, 85–92, 2004. 28. Jeong, H., Mason, S.P., Barabási, A.L., and Oltvai, Z.N., Lethality and centrality in protein networks, Nature, 411, 41–42, 2001. 29. Qin, H., Lu, H.S.S., Wu, W.B., and Li, W.H., Evolution of the yeast protein interaction network, Proc. Nat. Acad. Sci., 100, 12,820–12,824, 2003. 30. Wagner, A., How large protein interaction networks evolve, Proceedings of the Royal Society of London Seriew B, 270, 457–466, 2003. 31. PastorSatorras, R., Smith, E., and Solé, R.V., Evolving protein interaction networks through gene duplication, J. Theor. Biol., 222, 99–210, 2003. 32. Wuchty, S., Scalefree behavior in protein domain networks, Mol. Biol. Evol., 18, 1697–1702, 2001. 33. Vazquez, A., Flammini, A., Maritan, A., and Vespignani, A., Global protein function prediction in proteinprotein interaction networks, Nat. Biotech., 21, 697–700, 2003. 34. Shaw, S., Evidence of scalefree topology and dynamics in gene regulatory networks, in Proc. ISCA 12th International Conference on Intelligent and Adaptive Systems and Software Engineering, 37–40, 2003. (ISBN 1880843471). 35. Bumble, S., Friedler, F., and Fan, L.T., A toy model for comparative phenomenon in molecular biology and the utilization of biochemical applications of PNS in genetic applications, available on line at http:arxiv.orgabscondmat0304348. 36. Hao, B., Xie, H., and Zhang, S., Compositional representation of protein sequence and the number of Eulerian loops, available on line at http:arxiv.orgabsphysics 0103028. 37. PDB.SEQ, A Collection of SWISSPROT Entries; available online at http:www. expasy.orgsprot. 38. da F. Costa, L., What’s in a name?, International Journal of Modern Physics C, 15, 371–379, 2004. 39. da F. Costa, L., The hierarchical backbone of complex networks, Physical Review Letters, 93 (9), paper 98702, 4p., 2004. 40. da F. Costa, L., Lpercolations of complex networks, Physical Review EStatistical Physics, Plasmas, Fluids and Related Interdisciplinary Topics, 70, paper 056106, 8p., 2004. 41. da F. Costa, L., Reinforcing the resilience of complex networks, Physical Review EStatistical Physics, Plasmas, Fluids and Related Interdisciplinary Topics, 69, paper 066127, 7p., 2004. 42. Crescenzi, P., Goldman, D., Papadimitriou, C., Piccolboni, A., and Yaxnakakis, M., On the complexity of protein folding, in Annu. Conf. Res. Computational Molecular Biol., ACM, New York, 1998, pp. 61–62. Copyright 2005 by Taylor Francis Group, LLC

Trang 1

10 Graph-Based Analysis of

Amino Acid Sequences

Luciano da Fontoura Costa

CONTENTS

10.1 Introduction 10.2 Complex-Networks Concepts and Tools10.2.1 Brief Historic Perspective10.2.2 Basic Mathematical Concepts10.2.2.1 Graph Theory Basics10.2.2.2 Probabilistic Concepts10.2.2.3 Random Graph Models 10.2.2.4 Small-World and Scale-Free Models 10.3 Complex-Networks Approaches to Bioinformatics10.4 Sequences of Amino Acids as Weighted, Directed Complex Networks

10.5 Results10.5.1 Zebra Fish10.5.2 Xenopus10.5.3 Rat10.6 Discussion 10.7 Concluding Remarks and Future WorkAcknowledgments

References

10.1 INTRODUCTION

One of the most essential features underlying natural phenomena and dynamicalsystems are the many connections, implications, and causalities between the severalinvolved elements and processes For instance, the whole dynamics of gene activationcan be understood as a highly complex network of interactions, in the sense thatsome genes are enhanced while others are inhibited by several environmental factors,including the current biochemical composition of the individual (such as the presence

of specific genes/proteins) as well as external effects such as temperature andinteraction with other individuals Interestingly, such a network of effects extendsmuch beyond the individual in time and space, in the sense that any living being is2089_book.fm copy Page 363 Tuesday, May 10, 2005 9:34 PM

Trang 2

364 Medical Image Analysis

affected by history (i.e., evolutionary processes) and spatial interactions (i.e., ogy) Although biology can only be fully understood and explained by consideringthe whole of such an intricate network of effects, reductionist approaches can stillprovide many insights about biological phenomena that are more localized in timeand space, such as the genetic dynamics during an individual lifetime or an infectiousprocess

ecol-The large masses of data produced by experimental works in biology, molecularbiology, and genetics can only be properly organized, analyzed, and modeled byusing computer concepts including databases, networks, parallel computing, andartificial intelligence, with special emphasis placed on signal processing and patternrecognition The incorporation of such modern computer concepts and tools intobiology and genetics has been called bioinformatics [1] The applications of thisnew area to genetics are manifold, ranging from nucleotide analysis to animaldevelopment Among the several signal-processing methods considered in bioinfor-matics [2], we have the application of Markov random fields to model the sequences

of nucleotides, the use of correlation and covariance to characterize sequences ofnucleotides and amino acids, and wavelets [2, 3]

One particularly important problem concerns the analysis of proteins, the basicblocks of life [4, 5] Constituted by sequences of amino acids, proteins participate

in all vital processes, acting as catalysts; providing the mechanical scaffolding forcells, organs, and tissues; and participating in DNA expression Proteins are polymers

of amino acids, determined from the DNA through the process of protein expression.Many of the properties of proteins derive from their spatial shape and electricalaffinities, which are both defined by the specific sequences of constituent aminoacids [4, 5] Therefore, given the sequence of amino acids specified by the DNA,the protein folds into specific forms while taking into account the interactionsbetween the amino acids and external influence of chaperones It remains an openproblem how to determine the structural properties of proteins from the respectiveamino acid sequences, a problem known as protein folding [4, 5] Except for somebasic motifs, such as alpha-helices and beta-sheets, which are structures that appearrepeatedly in proteins, the prediction of protein shape constitutes an intense researcharea Experimentally, the sequences of amino acids underlying proteins can beobtained by using sequencing machines capable of reading the nucleotides, whichare subsequently translated into amino acids by considering triples of nucleotides,the so-called codons, translated according to the genetic code [3–5]

By being inherently oriented toward representing connections and implications,graphs stand out as one of the most general and interesting data structures that can

be used to represent biological systems Basically, a graph is a representationalstructure composed of nodes, which are connected through directed or undirectededges Any structure or phenomenon can be represented to varying degrees ofcompleteness in terms of graphs, where each node would correspond to an aspect

of the phenomenon and the edges to interactions Such a potential for representationand modeling is greatly extended by the many types of graphs, including those withweighted edges, different types of coexisting nodes or edges, and hypergraphs, toname only a few Interestingly, most biological phenomena can be properly repre-sented in terms of graphs, including gene activation, metabolic networks, evolution2089_book.fm copy Page 364 Tuesday, May 10, 2005 9:34 PM

Trang 3

Graph-Based Analysis of Amino Acid Sequences 365

(recall that hierarchical structures such as trees are special kinds of graphs), ical interactions, and so on However, despite the natural potential of graphs forrepresenting and studying natural phenomena, their application was timid until therecent advent of the area of complex networks One of the possible reasons for that

ecolog-is that graphs had been often understood as representations of static interactions, inthe sense that the connections between nodes were typically assumed not to changewith time Thus, the uses of graphs in biology, for instance, were mainly constrained

to representing evolutionary hierarchies (in terms of trees) and metabolic networks.This situation underwent an important recent change sparked mainly by thepioneering developments in random networks by Rapoport [6] and Erdös and Rényi[7], Watts and Strogatz small-world models [8], and by Barabási scale-free networks[9] The research of such types of complex graphs became united under the name

of complex networks [10–12] Now, in addition to the inherent potential of graphs

to nicely represent natural phenomena, important connections were established withdynamics systems, statistical physics, and critical phenomena, while many possibil-ities for multidisciplinary research were established between areas such as graphtheory, statistical physics, nonlinear dynamical systems, and complexity theory.Despite such promising perspectives, one of the often overlooked reasons whycomplex networks have become so important for modern science is that studies inthis area tend to investigate the dynamical evolution of the graphs [10–12], whichcan provide key insights about the relationship between the topology and function

of such complex systems For example, one of the most interesting propertiesexhibited by random graphs is the abrupt appearance, as new edges are progressivelyadded at random, of a giant cluster that dominates the graph structure and connec-tions henceforth Thus, in addition to being typically large (several studies in complexnetworks consider infinitely large graphs), the graphs were now used to modelgrowing processes Allied to the inherent vocation of graphs to represent connections,interactions, and causality, the possibility of modeling dynamical evolution in terms

of complex networks has made this area into one of the most promising scientificconcepts and tools

The present chapter is aimed at addressing how complex-network research hasbeen applied to bioinformatics, with special attention given to the characterizationand analysis of amino acid sequences in proteins The text starts by reviewing thebasic context, concepts, and tools of complex-network research and continues bypresenting some of the main applications of this area in bioinformatics The remain-der of the chapter describes the more specific investigation of amino acid sequences

in terms of complex networks obtained for graphs derived from subsequence strings

10.2 COMPLEX-NETWORKS CONCEPTS AND TOOLS 10.2.1 B RIEF H ISTORIC P ERSPECTIVE

The beginnings of complex-network research can be traced back to the pioneeringand outstanding works by Rapoport [6] and Erdos and Renyi [7], who concentratedattention on the type of networks currently known as random networks This name

is somewhat misleading in the sense that many other network models are also2089_book.fm copy Page 365 Tuesday, May 10, 2005 9:34 PM

Trang 4

random The essential property of random networks as understood in graph theory,therefore, is not only being random, but to follow a particular probabilistic model,namely the uniform random distribution [13] In other words, given a set of N nodes,connections are established by choosing pairs of nodes according to the uniformprobability density In the case of undirected graphs, the edges are uniformly sampledout of the N(N–1)/2 possible connections Consequently, random networks corre-spond to the maximum entropy hypothesis of connectivity evolution, providing asuitable null hypothesis against which several real and theoretical models can becompared and contextualized

One of the most interesting features of random networks is the fact that theprogressive addition of new edges tends to abruptly form a giant, dominating cluster(or connected component) in the graph Such a critical transition is particularlyinteresting not only because it represents a sudden change of the network connec-tivity, but because it provides a nice opportunity for connecting graph theory tostatistical physics Indeed, the appearance of the giant cluster can be understood as

a percolation of the graph, similar to critical phenomena (phase transitions) lying the transformation of ice into water Basically, percolation corresponds to anabrupt change of some property of the analyzed system as some parameter iscontinually varied This interesting connection between graph theory and statisticalphysics has provided unprecedented opportunities for multidisciplinary works andapplications, nicely bridging the gap between areas such as complexity analysis,which is typical of graph theory, and the study of systems involving large numbers

under-of elements, typical in statistical physics In addition to such an exciting perspective,random networks attracted much interest as possible models of real structures andphenomena in nature, with special emphasis given to the Internet and the WorldWide Web

After the fruitful studies of Rapoport and Erdos and Renyi, the study of largenetworks (note that the term complex network was not typical at those times) wentthrough a period of continuing academic investigation followed by few applications,except for promising investigations in areas such as sociology Indeed, one of thenext important steps shaping the modern area of complex networks was the inves-tigation of personal interactions in society, of which the 1998 work by Watts andStrogatz [8] represents the basic reference Basically, experimental investigationsregarding social contacts led to the result that the average length between any twonodes (i.e persons) is rather small, hence the name small-world networks Thetypical mathematical model of such networks starts with a regular graph, whichsubsequently has a percentage of its connections rewired according to uniformprobability Although such investigations brought many insights to the area, thesmall-world property was later verified to be an almost ubiquitous property ofcomplex networks The subsequent investigations of the topological properties ofthe Internet and WWW performed by Albert and Barabási [9] led to the importantdiscovery that the statistical distribution of the node degrees (i.e., the number ofconnections of a node) in several complex networks tends to follow a power law,indicating scale-free behavior Unlike the random model, this property favors theappearance of nodes concentrating many of the connections, the so-called hubs.Such underlying structure has several implications, such as resilience to attack, which2089_book.fm copy Page 366 Tuesday, May 10, 2005 9:34 PM

Trang 5

is particularly fragile for hub attacks From then on, the developments in network research boomed, covering several types of natural systems, from epidemics

complex-to economy The interested reader is encouraged complex-to check the excellent surveys ofthis area [10–12] for complementary information

10.2.2 B ASIC M ATHEMATICAL C ONCEPTS

This section provides a brief introductory review of basic concepts and measurements

in graph theory, statistics, random graphs, and small-work and scale-free networks.Readers who are already familiar with such topics can proceed directly to Section10.2.3

10.2.2.1 Graph Theory Basics

Basically, a typical graph [14–17] in complex-network theory [10–12] involves acollection of N nodes i = 1, 2, …, N that are connected through edges (i,j) that canhave weights w(i,j) Such a data structure is precise and completely represented bythe respective weight matrix W, where each entry W(j,i) represents the weight ofedge (i,j) Nonexistent edges are represented as null entries in that matrix Theadjacency matrix K of the graph is a matrix where the value 1 is assigned to anelement (i,j) whenever there is an edge connecting node j to I, and 0 otherwise Theadjacency matrix can be obtained from the weight matrix by setting each elementlarger or equal to a specific threshold value T to 1, assigning 0 otherwise Suchadjacency matrices, henceforth represented as K T, provide indication about the net-work structure defined by the weights that are higher than the threshold Therefore,the adjacency matrix for high values of T can be understood as the strongest com-ponent, or “kernel,” of the weighted graph Observe that it is also possible to considerthe complementary matrix of K T with respect to K, which is defined as follows Eachelement (i,j) of such a matrix, hence abbreviated as Q T, receives value 1 iff K T(i,j)

= 0 and K(i,j) 0 An undirected graph is characterized by undirected edges, so that

K(j,i) = 1 iff K(i,j) = 1, i.e., K is symmetric A directed graph, or digraph, ischaracterized by directed edges and not necessarily by a symmetric adjacency matrix.One of the most basic and interesting local feature of a graph or network is thenumber of connections of a specific node i, which is called the node degree andoften abbreviated as k i Observe that a directed graph has two types of such a degree,the indegree and the outdegree, corresponding to the number of incoming andoutgoing edges, respectively

Figure 10.1 illustrates the concepts introduced here with respect to an undirectedgraph G and a directed graph H, identifying the nodes, edges, and weights Thisfigure also shows the respective weight matrices WG and WH and adjacency matrices

AG and AH The degree of node 1 in G is 2, the outdegree of node 1 in H is 2, andthe indegree of node 1 in H is 1 N is equal to 4 for both graphs

A great part of the importance of graphs stems from their generality for senting, in an intuitive and explicit way, virtually any discrete structure while empha-sizing the involved entities (nodes) and connections Indeed, virtually every datastructure (e.g., tree, queue, list) is a particular case of a graph In addition, graphs2089_book.fm copy Page 367 Tuesday, May 10, 2005 9:34 PM

Trang 6

repre-368 Medical Image Analysis

can be used to represent the most general mesh of points used for numeric simulation

of dynamic systems, from the regular orthogonal lattice used in image representation

to the most intricate adaptive triangulations As such, graphs are poised to provideone of the keys for connecting not only structure and function, but also severaldifferent biological areas and even the whole of science

Several measurements or features have been proposed and used to expressmeaningful and useful global properties of the network structure In similar fashion

to feature selection in the area of pattern recognition (e.g., [13]), the choice of suchfeatures has to take into account the specific problem of interest For instance, aproblem of communication along the network needs to take into account the distancebetween nodes It should be observed that, in most cases, the selected set of features

is degenerated, in the sense that it is not enough to reproduce the original networkstructure Therefore, great attention must be paid when deriving general conclusionsbased on incomplete sets of measurements, as is almost always the case Some ofthe more traditional network measurements are reviewed in the following paragraph.The global measurement, usually derived from the node degree, is its averagevalue <k> along the whole network Observe that, for a digraph, the average indegreeand outdegree are necessarily identical The average node degree gives a first ideaabout the overall connectivity of the network Additional information about thenetwork connectivity can be obtained from the average clustering coefficient <C>.Given one specific node i, the immediately connected nodes are identified, and theratio between the number of connections between them and the maximum possible

FIGURE 10.1 Basic concepts in graph theory: examples of undirected (G) and directed (H) graphs, with respective nodes, edges, and weights The weight matrices of G and H are WG

and WH, and the respective adjacency matrices considering threshold T = 1 are given as AG

3

2 2 2

4

node

edge weight

Trang 7

value of those connections defines the clustering coefficient of node i, i.e., C i This

feature tends to express the local connectivity around each node Another interesting

and frequently used network measurement is the length between any two nodes i

and j, here denoted as L(i,j) This distance may refer either to the minimal sum of

weight along a path from i to j, or to the total number of edges between those two

nodes The present work is restricted to the latter The respectively derived global

feature is the average length considering all possible pairs of network nodes, hence

<L> This measurement provides an idea not only about the proximity between

nodes, but also about the overall network connectivity, in the sense that low

average-distance values tend to indicate a densely connected structure Another interesting

measurement that has been used to characterize complex networks is the betweenness

centrality Roughly, the betweenness centrality of a specific network node in an

undirected graph corresponds to the number of shortest paths between any pair of

node in the network that cross that node [18]

10.2.2.2 Probabilistic Concepts

Any measurement whose outcome cannot be exactly predicted, such as the weight

of an inhabitant of Chicago, can be represented in terms of a random variable [13,

19] Such variables can be completely characterized in terms of the respective density

functions, which can be approximated in terms of the respective relative frequency

histogram Alternatively, a random variable can also be represented in terms of its

(possibly) infinite moments, including the mean, variance, and so on Statistical

density functions of special interest for this chapter include the uniform distribution,

which assigns the same probability to any possible measurement, and the Poisson

distribution, which is characterized in terms of a ratio of event occurrence per length,

area, or volume For instance, we may have that the chance of having a failure in

an electricity transmission cable is equal to one failure per 10,000 km Therefore,

the chance of observing the event along the considered structure (e.g., the

transmis-sion cable) is also equiprobable along the considered parameter (e.g., length or time)

Such concepts can be immediately extended to multivariate measurements by

introducing the concept of random vector For instance, the temperature and pressure

of an inhabitant of Chicago can be represented as the two-dimensional random vector

[T, P] Such statistical entities are also completely characterized, in statistical terms,

by their respective multivariate densities Statistical and probabilistic concepts and

techniques are essential for representing and modeling natural phenomena and

bio-logical data because of the intrinsic variation of such measurements

10.2.2.3 Random Graph Models

The first type of complex networks to be systematically investigated were the random

graphs [6, 7, 10–12, 20] In using such graphs, one starts with N unconnected nodes

and progressively adds edges between pairs of nodes chosen according to the uniform

distribution Although the measurements described in Section 2.2.1 are useful for

characterizing the structure of such networks, it is also important to take into account

parameters and measurements governing their dynamical evolution, including the

2089_book.fm copy Page 369 Tuesday, May 10, 2005 9:34 PM

Trang 8

critical phenomenon of percolation As more connections are progressively added

to a growing network, there is a definite tendency to form a giant cluster

(percola-tion), which henceforth dominates the growing dynamics Given a network, a cluster

is understood as the set of nodes (and respective interconnecting edges) such that

one can reach any node while starting from any other node in the cluster, i.e., the

cluster is a connected component of the graph The giant cluster corresponds to the

cluster with the largest number of nodes at a given step of the network evolution

For an undirected random network, this phenomenon has been found to take place

when the percentage of existing connections with respect to the maximum possible

number of connections is about 1/N [5]

10.2.2.4 Small-World and Scale-Free Models

The types of complex networks known as small world and scale free were identified

and studied years after Erdos and Renyi investigated random graphs Small-world

networks [8, 10] are characterized by a short path from any pairs of its constituent

nodes A typical example of such a network is the social interactions within a given

society, in the sense that there are just a few (about five or six) relations between

any two persons Characterized later than small-world models, the scale-free

net-works [10–12] are characterized by the fact that the statistical distribution of the

respective node degrees follows a power law, i.e., the representation of such a density

in a log-log plot produces a straight line Such densities, unlike those observed for

other types of networks, implies a substantially higher chance of having nodes of

high degree, which are traditionally called hubs As reviewed in the next section,

such nodes have been identified as playing an especially important role in biological

networks Scale-free networks can be produced by using the preferential-attachment

growth strategy [10–12], characterized by the progressive addition of new nodes

with fixed number of edges that are connected preferentially with nodes of higher

degree, giving rise to the paradigm that has become known as “the rich get richer.”

At the same time, scale-free networks have also been shown to be less resilient to

random node attachments than other types of networks, such as random graphs [10]

10.3 COMPLEX-NETWORKS APPROACHES TO

BIOINFORMATICS

Several possibilities of using complex network and statistical physics in biology

have been described and revised by Bose in his interesting and extensive survey

[21] Special attention is given to relationships between the network’s topology and

functional properties, and the following three situations are covered in considerable

depth:

1 The topology of complex biological networks, such as metabolic and

protein interaction

2 Nonlinear dynamics in gene expression

3 The effect of stochasticity on the network dynamics

2089_book.fm copy Page 370 Tuesday, May 10, 2005 9:34 PM

Trang 9

While we review in the following some of the most representative works applyingcomplex-network research to biology, the reader is encouraged to complement andextend our revision by referring to Bose’s survey

Metabolic reactions, one of the key elements of life, were among the first to bestudied by complex-network approaches Such networks have their nodes represent-ing the molecular compounds (or substrates), and the edges indicate the metabolicreactions connecting substrates Incoming links to a substrate are understood tocorrespond to the reactions of which that substrate is a product The pioneeringinvestigation by Jeong et al [22] considered networks that are available for 43organisms, yielding average node indegree and outdegree in the range from 2.5 to

4, with the respective distribution being understood as scale free with exponents

close to 2.2 The metabolic reactions of E coli have been studied as undirected

graphs by Wagner and Fell [23], yielding average node degree of 7 and a clusteringcoefficient (approximately 0.3) much larger than could be obtained for a randomnetwork An interesting investigation into whether the duplication of information ingenomes can significantly affect the power law exponents was reported by Chung

et al [24] By using probabilistic methods as the means to analyze the evolution ofgraphs under duplication mechanisms, those authors were able to show that suchmechanisms can produce networks with low power-law exponents, which are com-patible with many biological networks [25]

The decomposition of biochemical networks into hierarchies of subnetworks,i.e., networks obtained by considering a subset of the nodes of the original graphand some of the respective edges, has been addressed by Holme and Huss [18].These authors use the algorithm of Girvan and Newman [26] for tracing subnetworks,

in a form adapted to bipartite representations of biochemical networks The lying principle of the algorithm is the fact that vertices between densely connectedareas have high betweenness centrality, such that removal with high degree leads tothe partition of the whole network into subnetworks that are contained in previousclusters, thereby producing a hierarchy of subnetworks

under-Another extremely important type of biological network, corresponding togenomic regulatory systems (i.e., the set of processes controlling gene expression),has also been subject of increasing attention in complex-network research This type

of directed network is characterized by having nodes corresponding to components

of the system, with the edges representing the gene-expression regulations [11] Animportant type of network in this category is that obtained from protein-proteininteractions In this type of network, each node corresponds to a protein, and thedirected edges represent the interactions A model of regulatory networks has beendescribed by Kuo and Banzhaf [27] A pioneering approach in this area is the work

of Jeong et al [28], which considered protein–protein interaction networks of S.

cerevisiae, containing thousands of edges and nodes The degree distribution was

interpreted as following scale-free behavior with an approximate exponent of 2.5.One of the most important conclusions of that investigation was that the removal ofthe most-connected proteins (i.e., hubs, the nodes of a complex network receiving

a large number of connections) can have disastrous effects on the proper functioning

of the individual The issue of protein–protein interaction networks has also been

Trang 10

considered in a number of other works, including Qin et al [29], Wagner [30],Pastor-Satorras et al [31], and in studies of the properties and evolution of suchnetworks Another related work, described by Wuchty [32], considered graphsobtained by assigning a node to every protein domain (or module) and an edgewhenever two such domains are found in the same protein

The important problem of determining protein function has been addressed fromthe perspective of networks of physical interaction by Vazquez et al [33] Theirmethod is based on the minimization of the number of interacting proteins withdifferent categories, so that the function estimation can be performed on a globalscale while considering the entire connectivity of the protein network The obtainedresults corroborate the validity of using protein-protein interaction networks as ameans of inferring protein function, despite the unavoidable presence of imperfec-tions and the incompleteness of protein networks

The analysis of gene-expression networks in terms of embedded complex tics maps (ECLM), a hybrid method blending some concepts from wavelets andcoupled logistics maps, has been reported by Shaw [34] That study considered 112genes collected at nine different time instants along 25 days, with each time pointbeing fitted to an ECLM model with high Pearson correlation coefficient, and theconnections between genes were determined by considering models with high pair-wise correlation The obtained connections were interpreted as following scale-freebehavior in both topology and dynamics

logis-A work by Bumble et al [35] suggests that the study of pathways of networksyntheses of genes, metabolism, and proteins should be extended to the investigation

of the causes and treatment of diseases Their approach involves methods capable

of yielding, for a specific set of candidate reactions, a complete metabolic pathwaynetwork Interesting results are obtained by investigating qualitative attributes,including relationships regarding the connectivity between vertices and the strength

of connections, the relationship of interaction energies and chemical potentials withthe coordination number of the lattice models, and how the stability of the networksare related to their topology

An interesting approach to analyzing the amino acid sequences of a protein in

terms of subsequently overlapping strings of length K has been described by Hao

et al [36] The strings of amino acids are represented as graphs by associating each

possible subsequence of length K to each graph node, and having the edges represent

the observed successive transitions of subsequences Their investigation targeted thereconstruction of the original sequences from the overlapping string networks, whichcan be approached by counting the number of Eulerian loops (i.e., a cyclic sequence

of connected edges that are followed without repetition) More specifically, thesequences are reconstructed while starting with the same initial subsequence, usingeach of the subsequences the same number of times as observed in the original data,and respecting a fixed sequence length It was therefore verified that the reconstruc-

tion is unique for K ≥ 5 for the majority of the considered networks (PDB.SEQ

database [37])

The present work addresses co-occurrence strings of amino acids (or any otherbasic biological element) similar to the scheme described in the previous paragraph,but here the subsequences do not necessarily overlap, and the number of times a

Trang 11

subsequence is followed by another is represented by the weight of the respectiveedge in the associated graph, following the same scheme used for concept association

as described in the literature [38, 39] More specifically, whenever a subsequence

of amino acids B is followed by another subsequence C, the weight of the edgeconnecting the two nodes representing those subsequences is increased by 1 There-fore, such a weighted, direct graph provides information about the number of times

a specific subsequence is followed by other possible subsequences, which can berelated to the statistical concept of correlation, with the difference that the sequence

of the data is, unlike in the correlation, taken into account As such, the obtainedgraph can be explored to characterize and model sequences of amino acids according

to varying subsequence sizes Moreover, by thresholding the weight matrix forsubsequent threshold values, it is possible to identify subgraphs of the networkcorresponding to a strongly connected kernel of subsequences

10.4 SEQUENCES OF AMINO ACIDS AS WEIGHTED,

DIRECTED COMPLEX NETWORKS

A protein can be specified in terms of its respective sequence of amino acids,

represented by the string S = A1 A2 … AN, where each element Ai corresponds to

one of the 20 possible amino acids, as indicated in Table 10.1

It is possible to subsume an amino acid sequence S, by grouping subsequences

of amino acids into new numerical codes with higher values, in a way similar tothat described by Hao et al [36] The grouping scheme adopted in this work isillustrated in Figure 10.2, where the first and second group contains m and n amino

acids, respectively While it is possible to consider m n, we henceforth adopt m =

n The groups are taken with an overlap of g positions, with 0 ≤ g ≤ m.

For each reference position i, we have two numerical codes B and C, obtained

FIGURE 10.2 The grouping scheme considered in this work, including two successive

win-dows of size m and n, with overlap of g elements.

i-1 i i+m-g-1 i+m-g i+m-1 i+m i+m+n-g-1 i+m+n+g

g

Trang 12

An example of this coding scheme is given in the following Let the originalprotein sequence in abbreviated amino acids be

Abbreviation Numerical Code

Trang 13

Similarly, for m = n = 3 and g = 1, we obtain:

Observe that the different ranges of i obtained in these two examples is a direct

consequence of the fact that the larger size of the subsequences in the second examplereduces the number of possible subsequence associations

Now, having defined the grouping scheme and the resulting sequences B and C,

the graph representing the subsequent (with possible overlap) co-occurrences ofnumerical codes in this sequence is obtained as follows:

1 Each code in the sequences B and C is represented as one of the N nodes

of the graph, whose number corresponds to the code produced for therespective sequence For instance, the sequence (13)(6) implies a graphwith two nodes identified as 13 and 6 containing a direct edge following

from node 13 to node 6 Therefore, for a given m = n, we have a maximum

of 20m nodes, numbered from 1 to 20m Observe, however, that the ing network does not necessarily include all possible nodes, allowing areduction of the network size

result-2 Every time a code B is followed by a code C, the weight of the edge connecting from node B to C is incremented by 1 In other words, the weight of the edge uniting two specific sequences B and C is equal to the

number of times those two sequences are found to follow one another, inthat same order, along the analyzed sequence of amino acids

Figure 10.3 illustrates the graph obtained from the sequence (13)(6)(7)(18)(15)

(11)(11)(14)(20)(20)(1)(11)(5)(10)(15)(11)(14) considering m = 1, where each node

is represented by the respective code, and the edge weights (shown in italics)represent the number of successive subsequence (in this case a single amino acid)transitions

In this sense, the obtained graph represents the “unidirectional” correlationsbetween two subsequent (with possible overlap) subsequences of amino acids in theanalyzed protein Such a network can be understood as a statistical model of the

original protein for the specific correlation length implied by m and g As such, it

is possible to obtain simulated sequences of amino acids following such statisticalmodels by performing Monte-Carlo simulation over the outdegrees of each node, inthe sense that each outgoing edge is taken with frequency corresponding to its

Trang 14

respective normalized weight (i.e., the sum of the weights of the outgoing edgesmust add up to 1) Therefore, the transition probabilities are proportional to the respec-tive weights Observe that the statistically normalized weight matrix of the networkcorresponds to a Markov chain, as the sum of any of its columns will be equal to 1

By thresholding the weight matrix for successive values of T (see Section 12.2.2),

it is possible to obtain a family of graphs that can be understood as follows The

clusters defined for the highest values of T represent the kernels of the whole

weighted network, corresponding to the subsequence associations that are mostrepresentative and more frequent along the whole protein As the threshold is low-ered, these kernels are augmented by incorporation of new nodes and merging ofexisting clusters Such a threshold-based evolution of the graph can be related tothe evolutionary history of the protein formation, in the sense that the kernels wouldhave appeared first and served as organizing structures around which the rest of themolecule evolved At the same time, the strongest connections in the obtainednetwork also reflect the repetition of basic protein motifs, such as alpha helices andbeta sheets

10.5 RESULTS

In the following investigations, we consider proteins from three animal species:zebra fish, Xenopus (frog), and rat The gene sequencing data were obtained fromthe NIH Gene Collection repository (http://zgc.nci.nih.gov/, files \verb+dr_mgc_cds_aa.fasta, \verb+xl_mgc_cds_aa.fasta, and \verb+rn_mgc_cds_aa.fasta) The rawdata consisted of sequences of amino acids for the 2948, 1977, and 640 proteins(each containing on the average of 400 amino acids) in each of those files The

obtained results, which considers m = n = 2 and g = 0, are presented respectively

for each species in the following subsections The average node degree was obtained

by adding all columns of the adjacency matrix The clustering coefficient was

obtained by identifying the n nodes connected to each node and dividing the number

of existing edges between those nodes by n(n – 1)/2, i.e., the maximum number of

FIGURE 10.3 The network obtained for m = 1 for the amino acid sequence (13)(6)(7)(18)(15)

(11)(11)(14)(20)(20)(1)(11)(5)(10)(15)(11)(14) The weights of the edges are shown in italics.

10

11 14

1

1 1

1

18

20

15

Trang 15

edges between those nodes The minimum distances were calculated by using stra’s method [14]

Dijk-10.5.1 ZEBRA FISH

The obtained 400×400 weight matrix (recall from the previous section that 400 =

20m = 202) had a maximum value of 487, obtained for the transition from SS to SS,

and a minimum value of zero was obtained for 15,274 transitions The maximumweight for transition between different nodes was 170, observed for the transition

from EE to ED The performed measurements included the average node degree

(Figure 10.4(a)), clustering coefficient (Figure 10.5(a)), average length (Figure10.6(a)), and maximum cluster size (Figure 10.7(a)) for the series of thresholded

matrices KT (solid lines) and QT (dashed lines) obtained for T = 1, 2, …, 170.

We also calculated the indegree and outdegree densities, which are shown inFigure 10.8(a) and Figure 10.8(b), respectively, for T = 0 It is clear from this figurethat both node degrees tend to be similar to one another, presenting a plateau for 6

< log(k) < 8.4 followed by a sharp decrease of node degree The self-connections

between nodes representing subsequences, immediately obtained from the diagonal

of the respective adjacency matrices of two identical amino acids, are given in Table10.2

The initial kernel was also identified for T = 95, with the obtained digraph shown

in Figure 10.9, where the edge widths correspond to the respective weights Observe

that although the original graph was thresholded at T to obtain the kernel in Figure 10.9, the graph in that figure incorporates all edges, including those with weight smaller than T, to provide a more comprehensive visualization of the obtained kernel.

This fully connected (except self-connections, which were not considered in this

case) digraph presents dominance of the E, D, and S amino acids, with strong connections obtained for the node EE The maximum weight was 170, obtained for the transition from node EE to ED.

10.5.2 XENOPUS

The weight matrix had a maximum value of 293, obtained for the transition from

EE to EE, and a minimum value of zero was obtained for 22,787 transitions The

maximum weight for transition between different nodes was 207, observed for the

transition from GP to LQ The performed measurements included the average node

degree (Figure 10.4(b)), clustering coefficient (Figure 10.5(b)), average length ure 10.6(b)), and maximum cluster size (Figure 10.7(b)) for the series of thresholded

(Fig-matrices KT (solid lines) and QT (dashed lines) obtained for T = 1, 2, …, 170.

The indegree and outdegree densities are shown in Figure 10.10(a) and Figure

10.10(b), respectively, for T = 0 Both densities again tend to be similar to one another, presenting a plateau for 6 < log(k) < 8 followed by a sharp decrease of

node degree The self-connections between nodes representing subsequences of twoidentical amino acids are given in Table 10.2 The initial kernel containing nine

nodes was identified for T = 64, with the obtained digraph shown in Figure 10.11, which is dominated by the P and G amino acids.

Định dạng
Số trang	31
Dung lượng	3,99 MB