This thesis describes my work of applying machine learning methods to facilitate the identification of multifunctional enzymes, disease related proteins and microRNAs.. Three groups of d
Trang 1MACHINE LEARNING APPROACH
ZHANG HAILEI (B.Sc & M.S., Dalian University of Technology)
A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY
DEPARTMENT OF PHARMACY NATIONAL UNIVERSITY OF SINGAPORE
2008
Trang 2ACKNOWLEDGEMENTS
Foremost, I would like to present my sincere thanks to my supervisor, Professor Chen
Yu Zong, for his excellent guidance, invaluable advices throughout my PhD study
I would like to thank Professor Cao Zhiwei and Professor Ji Zhiliang for their insightful suggestions to my work on the prediction of disease related protein and multifunctional enzymes
My sincere gratitude also goes to BIDD group members, especially Dr Lin HongHuang, Dr Han Lianyi, Dr Zheng Chanjuan, Dr Cui Juan, Dr Wang Rong, Ms Tang Zhiqun, Mr Xie Bin, Ms Ma Xiaohua, Miss Jia Jia, Miss Liu Xin, Miss Shi Zhe, Miss Wingyee, Mr Zhu Feng, Mr Liu Xianghui, Ms Ong Serene etc I am really thankful for their valuable suggestions and support in my project, as well as enjoy the close friendship among us
Last, but not the least, I am eternally grateful to my parents and my husband for supporting and encouraging me throughout my life
Zhang Hailei
April 2008
Trang 3TABLE OF CONTENTS
ACKNOWLEDGEMENTS I
TABLE OF CONTENTS II
SUMMARY IV
LIST OF TABLES VII
LIST OF FIGURES X
LIST OF ACRONYMS XIII
1 Introduction 1
1.1 Introduction to multifunctional enzymes (MFEs) 2
1.2 Introduction to disease related proteins 4
1.2.1 Antimicrobial proteins 4
1.2.2 Antibiotic resistance proteins 5
1.2.3 Cancer associated proteins 7
1.3 Introduction to microRNAs 9
1.4 Overview of computational methods for biological function prediction 12
1.4.1 Sequence similarity method 12
1.4.2 Motif based methods 13
1.4.3 Machine learning approach 15
1.5 Scope and objective 15
2 Methods 18
2.1 Machine learning methods 18
2.1.1 Support Vector Machine (SVM) 19
2.1.2 K-Nearest Neighbors (KNN) 27
2.1.3 Neural Networks (NN) 29
2.1.4 Decision Tree (DT) 30
2.2 Feature selection 32
2.3 Performance evaluation 34
2.4 Construction of feature vectors 35
2.4.1 Protein feature vectors 35
2.4.2 MiRNA feature vectors 39
3 In silico search and characterization of multifunctional enzymes 41
3.1 Selection of MFEs and non-MFEs 41
3.2 Evaluation and discussion 43
3.2.1 Structural preference of MFEs 43
3.2.2 Characteristics of MFEs from pathway and evolution perspective ……… 45
3.2.3 Identification of novel MFEs 56
3.2.4 Contribution of physicochemical properties in the classification of MFEs………57
3.3 Server for identification of multifunctional enzyme (SIME) 58
3.4 MFEs database 61
3.5 Summary 64
4 Prediction of disease related proteins by support vector machine 66
4.1 Prediction of antimicrobial proteins 66
4.1.1 Selection of antimicrobial proteins and non-antimicrobial proteins ……… ……… 66
4.1.2 Prediction performance for antimicrobial proteins 68
Trang 44.1.3 Prediction of novel antimicrobial proteins 69
4.1.4 Contribution of feature properties 76
4.1.5 Server for antimicrobial protein identification (SAPI) 76
4.2 Prediction of antibiotic resistance proteins 77
4.2.1 Selection of ARPs and non-ARPs 78
4.2.2 Prediction performance 79
4.2.3 Prediction of novel ARPs 80
4.2.4 Scanning bacteria genomes 81
4.2.5 Contribution of feature properties to the classification of ARPs.82 4.2.6 Server for antibiotic resistance protein identification (SARPI) 82
4.3 Prediction of cancer associated proteins 84
4.3.1 Data preparation 84
4.3.2 Overall prediction accuracies and performance evaluation 85
4.3.3 Contribution of feature properties to the classification of cancer associated proteins 86
4.3.4 Analysis of individual feature contribution by feature selection 87 4.3.5 Cancer associated protein identification server (CAPIS) 88
4.4 Comparison with other statistical learning methods 90
4.5 Summary 91
5 Prediction of microRNAs by machine learning methods 93
5.1 Data preparation 93
5.1.1 Retrieval of precursor miRNAs and non-precursor miRNAs 93
5.1.2 Retrieval of mature miRNAs and non-mature miRNAs 94
5.2 Evaluation and discussion 95
5.2.1 Prediction performance for precursor miRNAs and mature miRNAs… 95
5.2.2 Screening non-coding RNAs within four representative genomes ……… 97
5.2.3 Comparison with other statistical learning methods 97
5.3 MiRNA prediction server 99
5.3.1 Comparison with other micoRNA prediction servers 99
5.4 Summary 104
6 Conclusion and future work 105
6.1 Major findings 105
6.2 Limitition of methods applied in this work 108
6.3 Future studies 109
BIBLIOGRAPHY 110
APPENDICES 123
LIST OF PUBLICATIONS 157
Trang 5SUMMARY
Proteins and functional RNAs are important components of biological organisms, which play essential roles in biological systems Therefore, the identification of functional proteins and RNAs is of great importance for understanding biological processes, discovering new therapeutic targets, and accelerating drug development This thesis describes my work of applying machine learning methods to facilitate the identification of multifunctional enzymes, disease related proteins and microRNAs
Multifunctional enzymes (MFEs) are enzymes that perform multiple catalytic activities The identification and characterization of MFEs would provide valuable insights into molecular mechanisms underlying the crosstalk between different cellular processes In this study, a total number of 3120 experimentally verified MFEs were collected from various sources A support vector machine (SVM) based classifier was then developed to distinguish MFEs from non-MFEs The classifier was also applied to search against ExPASy ENZYME database to identify potential novel MFEs Moreover, we also investigated the mechanism of multiple catalytic properties,
as well as their evolutionary basis Our results suggest that MFEs are non-evenly distributed in different species, but no solid evidence suggests complex life forms like human prefer more MFEs than simple life form like yeast Further KEGG ontology (KO) analysis indicated that MFEs most likely evolve from ancestor enzymes in primitive life forms From structural perspective, the alpha and beta fold topology seems to be most favored for MFEs The analysis of physiochemical properties indicated that four properties, including charge, polarizability, hydrophobicity, and solvent accessibility, are most important for the characterization of MFEs
Trang 6Another objective of this work is to identify disease related proteins which hold promise for discovering new therapeutic targets Three groups of disease related proteins were studied, including antimicrobial proteins, antibiotic resistance proteins and cancer associated proteins Corresponding SVM based prediction systems were developed to identify these proteins based on their primary sequences Independent data sets that were not included in model development were then used to evaluate the performance of classification system, showing that prediction accuracies for members and non-members of these disease related proteins are in the range of 81.8%~97.5% and 99.2%~99.9% respectively In addition, most of non-homologous antimicrobial proteins and antibiotic resistances were correctly predicted These results suggest the usefulness of SVM method for facilitating the identification of disease related proteins, especially for non-homologous functional proteins
The other objective of this work is to identify microRNAs (miRNAs) from sequence derived physicochemical properties by four machine learning methods, including decision trees (DT), k-nearest neighbors (KNN), probabilistic neural networks (PNN), and support vector machines (SVM) SVM was found to reach the best performance, with prediction accuracies of precursor miRNAs and mature miRNAs at 92.2% and 94.8%, and the accuracies for non-precursors miRNAs and non-matures miRNAs at 98.4 and 99.5% respectively Screening non-coding RNA sequences within four
representative genomes, including Homo sapiens, Mus musculus, Drosophila
melanogaster and Saccharomyces cerevisiae, identifies 2.2%~5.6% of non-coding
RNAs as potential precursor miRNAs, which contains fewer false positives than previous studies These findings indicate that our prediction system is capable of
Trang 7identifying miRNAs with relatively high accuracy Similar strategy can be ideally applied to the prediction of other functional RNA classes
Beyond in-house prediction models, we also developed a series of online prediction tools to serve scientific community to identify novel functional proteins and RNAs Our prediction systems could be accessed at following links
Trang 8LIST OF TABLES
Table 2-1 Example of training data for decision tree 32 Table 2-2 Division of amino acids into 3 different groups by different
physicochemical properties 37 Table 2-3 List of features for proteins 37 Table 2-4 Characteristic descriptors of cellular tumor antigen p53 (Swiss-Prot AC
P04637) The feature vector of this protein is constructed by combining all
of the descriptors in sequential order .38 Table 2-5 Division of nucleotides into different groups for different physicochemical
properties 39 Table 2-6 List of features for miRNA 40 Table 2-7 Example of computed descriptors of miRNA precursor (cel-mir-243) The
feature vector of this precursor is constructed by combining all the
descriptors in sequential order .40 Table 3-1 Statistics of the datasets and prediction accuracy of individual class of MFE
and that of all MFEs (б=21) 42 Table 3-2 Distribution of known and predicted enzymes of multiple catalytic domains
in different kingdoms and in top 20 host species Not all protein sequences studied in this work are included because the host species information of some protein sequences is not yet available in the protein sequence
databases .52 Table 3-3 Distribution of known and predicted enzymes with single multi-catalytic
domain in different kingdoms and in top 20 host species 53
Table 3-4 Orthologs of multifunctional enzymes (MFEs) in S cerevisiae and H
sapiens species 36.7% (22 out of 60) MFEs in H sapiens had their
orthologs in S cerevisiae, while 56.8% (21 out of 37) MFEs in S
cerevisiae had their orthologs in H sapiens .55
Table 4-1 Distribution of AMPs in top 10 host species 67 Table 4-2 Statistics of the datasets and prediction accuracy of individual class of
AMPs The predicted results are given in TP, FN, TN, FP, sensitivity SE=TP/(TP+FN), specificity SP=TN/(TN+FP), positive prediction value PPV=TP/(TP+FP) and overall accuracy Q=(TN+TP)/(TP+FN+TN+FP) The number of members and non-members in the testing and independent evaluation sets is TP+FN or TN+FP respectively .67
Trang 9Table 4-3 Statistics of prediction accuracy of antimicrobial proteins measured by
5-fold cross validation 69 Table 4-4 Prediction results of novel antimicrobial proteins by SVM-Prot, where “+”
represents proteins correctly predicted as antimicrobial proteins, and “-” represents proteins incorrectly predicted as non-antimicrobial proteins 70 Table 4-5 List of prediction results of 177 antimicrobial proteins in AMPer database
(“+” represents proteins correctly predicted as antimicrobial proteins, and
“-” represents proteins incorrectly predicted as non-antimicrobial proteins) 72 Table 4-7 Distribution of ARPs in top 10 bacteria species 79 Table 4-8 Statistics of the datasets and prediction accuracy of ARPs (σ =18) 79 Table 4-9 Statistics of accuracy for SVM prediction of antibiotic resistance proteins
evaluated by using 10-fold cross validation 80 Table 4-10 Prediction results of novel ARPs 81 Table 4-11 Statistics of datasets and prediction accuracy of cancer associated proteins
84 Table 4-12 Distribution of cancer associated proteins in top 10 bacteria species 85 Table 4-13 Features important for characterizing cancer associated proteins as
selected by recursive feature elimination method 87 Table 4-14 Comparison of prediction performance of all AMPs and non-AMPs with
different machine learning methods 91 Table 4-15 Comparison of prediction performance of antibiotic resistances and
non-antibiotic resistances with different machine learning methods 91 Table 4-16 Comparison of prediction performance of all CAPs and non-CAPs with
different machine learning methods 91 Table 5-1 Distribution of precursor miRNAs in top 10 host species 94 Table 5-2 Statistics of the datasets and prediction accuracy for precursor miRNAs and
mature miRNAs 95 Table 5-3 Location of predicted and validated rhesus miRNAs within putative
precursor sequences Sequences in italic denote those predicted by
MiRDetector while those with underline denote experimentally validated miRNAs .96 Table 5-5 Screening results of non-coding RNAs from four representative genomes 97
Trang 10Table 5-6 Comparison of prediction performance of precursor miRNAs and
non-precursor miRNAs with different machine learning methods 98 Table 5-7 Comparison of prediction performance of mature miRNAs and non-mature
miRNAs with different machine learning methods 98
S1 Scanning results of E coli K12 genome (# indicates that data were not included in
our model development) 123
S2 Scanning results of S aureus Mu50 genome (*indicates functional classification
by SVMProt followed by probability of correct characterization P-value, while # indicates the data are not included in our model data set) 134 S3 Prediction result of potential precursor miRNAs (“+” and “–” indicates that the
RNA is predicted as precursor miRNA and non-precursor miRNA,
respectively) 144
Trang 11LIST OF FIGURES
Figure 1-1 MiRNA biosynthesis MiRNA is produced from precursor microRNA
(pre-miRNA), which in turn is formed from a miRNA primary transcript
(pri-miRNA) .11
Figure 2-1 Architecture of support vector machines 21
Figure 2-2 Different hyperplanes could be used to separate examples 22
Figure 2-3 Mapping input space to feature space 24
Figure 2-4 Schematic diagrams illustrating the process of the training and prediction of the functional class of proteins by using SVM Sequence-derived feature hi, pi, vi … represents such structural and physicochemical properties as hydrophobicity, polarizability, and volume Feature di, si, mi, …, represents properties such as domain information, subcellular localization, and post-translational (PT) modification profiles etc 26
Figure 2-5 Example of k-nearest neighbors (squares and triangles represent traing samples and the star symbol indicates an unknown sample) 27
Figure 2-6 Architecture of a simple three-layer neural network 30
Figure 2-7 Example of a decision tree classifier 31
Figure 2-8 The sequence of a hypothetical protein for illustration of derivation of the feature vector* 38
Figure 3-1 Top 10 Pfam families for known enzymes of single multi-catalytic domain (SMAD-MFEs) It is noted that about 38% of SMAD-MFEs contain ArgJ domain, and majority of them are involved in Urea cycle and metabolism of amino groups pathway (amino acid metabolism map00220) 44
Figure 3-2 Top 10 Pfam families of known enzymes of multiple catalytic domains (MCD-MFEs) 44
Figure 3-3 Distribution of known and predicted putative MFEs (enzymes of single multi-catalytic domain SMAD-MFEs, enzymes of multiple catalytic domains MCD-MFEs) in SCOP fold families It is noted that 42% of MCD-MFEs and 69% of SMAD-MFEs belong to the alpha and beta fold class (a/b) .45
Figure 3-4 Statistics of known MFEs according to the number of biological pathways they anticipated in Totally 1,293 known enzymes of multiple catalytic domains (MCD-MFEs) and 285 known enzymes of single multi-catalytic domain (SMAD-MFEs) were employed in this study .48
Trang 12Figure 3-5 Statistics of known and predicted enzymes of multiple catalytic domains
(MCD-MFEs) with KEGG ontology (KO) MCD-MFEs are involved in 4 level one, 17 level two, and 74 level three pathways Majority of them anticipate in carbohydrate metabolism (CAR), lipid metabolism (LIP), nucleotide metabolism (NUC), amino acid metabolism (AAC) and
metabolism of cofactors and vitamins (COF) Number with “*” denotes
the number of predicted MCD-MFEs .49
Figure 3-6 Statistics of known enzymes of single multi-catalytic domains (SMAD-MFEs) in KEGG ontology (KO) SMAD-MFEs are involved in 3 level one, 10 level two and 52 level three pathways Majority of them anticipate in the carbohydrate metabolism (CAR), amino acid metabolism (AAC) and metabolism of cofactors and vitamins (COF) Number with “*” denotes the number of predicted SMAD-MFEs .50
Figure 3-7 Distribution of MFEs in different kingdoms Totally, 2,551 known enzymes of multiple catalytic domains (MCD-MFEs), 4,075 predicted MCD-MFEs, 537 known enzymes of single multi-catalytic domain (SMAD-MFEs), and 245 predicted SMAD-MFEs were included in the statistics It is noted the dominance of bacteria in both known and predicted MCD-MFEs and SMAD-MFEs in total enzyme number .51
Figure 3-8 Statistics of currently known MFEs and predicted MFEs by screening the ExPASy Enzyme database Totally there are 3,120 currently known MFEs, including 2,279 enzymes of multiple catalytic domains (MCD-MFEs), 572 known enzymes of single multi-catalytic domain (SMAD-MFEs) Totally, 2,641 novel MFEs with prediction probability >50% (4,320 with probability >80%), including 2,515 MCD-MFEs (4,075 with probability >80%) and 126 SMAD-MFEs (245 with probability >80%) were identified from 91,140 enzymes of ExPASy Enzyme database 57
Figure 3-9 SIME interface The sequence of a protein, in RAW format and containing no non-amino acid letters, can be input in a window provided .59
Figure 3-10a Result page of SIME showing that a query sequence is predicted as a multifunctional enzyme with multiple catalytic domain 60
Figure 3-10b Result page of SIME showing that a query sequence is predicted as a multifunctional enzyme with single catalytic domain 60
Figure 3-10c Result page of SIME showing that a query sequence is predicted as non multifunctional enzyme 61
Figure 3-11 Graphical searching interface of MFEs database 62
Figure 3-12 Graphical user interface of MFEs database .62
Figure 3-13 Graphical searching interface of MFEs database 63
Figure 3-14 Biological analysis results interface of MFEs 63
Trang 13Figure 4-1 Graphical user interface for SAPI 77 Figure 4-2 Result page of SAPI showing that a query sequence is an antimicrobial
protein .77 Figure 4-3 Interface for SARPI 82 Figure 4-4 Result page of SARPI showing that the query sequence is not antibiotic
resistance protein 83 Figure 4-5 CAPIS interface The sequence of a protein, in RAW format and
containing no non-amino acid letters, can be input in a window provided 89 Figure 4-6 Result page of CAPIS showing that the query sequence is a
proto-oncogene .89 Figure 5-1 Graphical user interface of MiRDetector The sequence of a query
sequence, in RAW format and containing non-AU(T)GC characters, can
be input in a window provided 102 Figure 5-2 Result page of MiRDetector showing that a query sequence is a potential
precursor miRNA 103 Figure 5-3 Result page of MiRDetector showing the location of the predicted mature
miRNA within the precursor 103
Trang 14LIST OF ACRONYMS
AMP Antimicrobial Protein
ARP Antibiotic Resistance Protein
CAP Cancer Associate Protein
CAPIS Cancer Associated Protein Identification Server
FP False Positive
IHA Inter-base hydrogen bonds donor
IHD Inter-base hydrogen bonds donor
KNN K-Nearest Neighbors
MCC Matthews correlation coefficient
MCD-MFEs MFEs with multiple catalytic domains
MFE Multifunctional Enzyme
MFP Multifunctional Proteins
MiRDetector MicroRNA Detector
MicroRNA miRNA
ncRNAs non-coding RNAs
NMFEP non-MFE proteins
ORFs Open Reading Frames
PNN Probabilistic Neural Network
PSI-BLAST Position Specific Iterative-Basic Local Alignment Search Tool
QP Quadratic Programming
Trang 15RFE Recursive Feature Elimination
rRNA ribosomal RNA
SAPI Server for Antimicrobial Protein Identification
SARPI Server for Antibiotic Resistance Protein Identification
SIME Server for Identification of Multifunctional Enzyme
SMAD-MFEs MFEs with single multi-activity domain
SVM Support Vector Machine
tRNA transfer RNA
Trang 161 Introduction
Proteins are important components of biological systems and essential to any life form They participate in almost every biological process, such as catalyzing chemical reactions, providing structure rigidity to cells, and transmitting signals and nutrients
A number of proteins are involved in different disease related pathways, and dysfunction of these proteins accounts for most of human diseases For example, over expression of oncogenes would cause cancers, while mutations in antimicrobial proteins may reduce their capacity to defend against microbial infection Therefore, identification of these proteins and understanding of their mechanisms would be of great importance to discover novel therapeutic targets and develop new drugs to treat diseases
Besides proteins, RNAs are also well recognized as important components of biological systems According to central dogma of molecular biology, RNAs are responsible to transcribe gene information storing in DNA, and then translate them into protein sequences However, since the late 1990s, a number of non-coding RNAs have been identified by experimental or computational methods They are not to be translated into proteins; instead, their role in biological systems remains at the RNA level In particular, a group of smallest non-coding RNAs, called microRNAs (miRNAs), have attracted intensive interests It is estimated that one third of human genes are regulated by miRNAs, which open a new door to controlling the expression
of desirable genes, and may profoundly influence current drug discovery process
Since the sequencing of phage fX174 in 1977, a tremendous amount of genomic information of organisms have been decoded and deposited into varieties of database
Trang 17Up to April 2008, more than 360,000 proteins have been collected in a curated protein database, Swiss-Prot, and the number is continuing increasing rapidly On the other hand, however, low and non-homologous proteins with unknown function constitute a substantial part (up to 20%~100%) in Open Reading Frames (ORFs), in many newly sequenced genomes Although wet-lab experiments are still the most effective methods to determine functions of proteins and RNAs, they are, however, still costly and time consuming for annotating such tremendous amount of data Therefore, there
is a need to explore other methods including computational approach for facilitating the identification of protein and RNA function to complement web-lab experimental methods
In this thesis, I will introduce my work on the application of machine learning to the prediction of multifunctional enzymes, disease related proteins, and miRNAs
1.1 Introduction to multifunctional enzymes (MFEs)
It has been noticed for a long time that some enzymes are able to perform multiple functions [1-4], which are called multifunctional enzymes (MFEs) An increasing number of such enzymes are being discovered in recent years MFEs are found to be beneficial to living systems and provide competitive survival edges in a variety of ways They are able to employ alternative approaches to coordinating multiple activities and regulate their own expression [1], which demonstrates evolutionary advantage as part of a clever strategy for generating complexity from existing proteins without expansion of the genome [3, 5, 6] Combination of multiple functions enables
an enzyme to act as a switch point in biochemical or signaling pathways so that a cell can rapidly respond to changes in surrounding environment [7] Multifunctionality
Trang 18seems to be a common mechanism of communication and cooperation between many different functions and pathways within a complex cellular system or between cells [2]
Identification of MFEs and subsequent investigation of their mechanistic and structural basis of multifunctionality is important for studying biological roles of enzymes [3, 7] and for the exploration of multiple activities in protein engineering [8] and inhibitor design [9] Studies of sequences, structures and components of MFEs have demonstrated that useful information can be derived for facilitating the understanding of the mechanism of actions [10], organizational and evolutionary features [11], and assembly patterns [12] of MFEs In-depth study on comprehensive collection of MFEs is expected to provide a more complete picture about the functional, evolutional, and structural features of multifunctional enzymes
A recent study indicates that current sequence analysis algorithms (alignment, clustering and motif approaches) are capable of disclosing individual functions of MFEs [13] Algorithms based on remote homology, like PSI-BLAST (Position Specific iterative-Basic Local Alignment Search Tool) [14], have been found to give good performance for finding alternative functions of MFEs [13] However, in some cases, it is difficult to determine whether the predicted multiple functions by these methods are due to true multifunctionality or false identification [2-4] Thus it is highly desirable to develop a method to determine the multifunctionality of proteins MFEs have certain common structural and physicochemical characteristics in spite of the diversity of their sequences and structures, which can be potentially exploited for determining whether enzymes are multifunctional or not Active sites of enzymes with
Trang 19multiple catalytic activities are inherently reactive environments packed with nucleophiles, electrophiles, acids, bases and cofactors [3] Special structural features are present in some MFEs to enable them to bind to different substrates [3] The surface of some MFEs allows the formation of complexes with different proteins or substrates at different cellular environments [2, 7]
Proteins of multiple functions are known to have high sequence and structure diversity but none-the-less possess common structural and physicochemical features
to perform common functions Such characteristics make it difficult to identify MFEs
by homology-based approaches Thus it is desirable to explore other methods to identify MFEs
1.2 Introduction to disease related proteins
1.2.1 Antimicrobial proteins
Microbes, such as bacteria, viruses and fungi, are responsible for a number of human
or other organisms’ diseases, such as acute bacterial meningitis [15], human immunodeficiency virus (HIV) [16] and latent tuberculosis infection [17] On the other hand, host organisms have also developed a variety of sophisticated mechanisms
to fight against the invasion of microbes, among which antimicrobial peptides play an important role Antimicrobial peptides are able to induce both innate and adaptive immune responses in host organisms [18, 19] They usually take effects by insertion into microbial membrane to either disrupt the physical integrity of the bilayer or translocate across the membrane and act on internal targets [18] Due to their broad-spectrum antimicrobial properties, antimicrobial peptides are increasingly used
Trang 20as molecular therapies [19] A number of databases have also been developed to collect and characterize antimicrobial peptides [20-23]
Antimicrobial peptides are derived from antimicrobial proteins (AMPs) upon bacterial attack [24, 25] Therefore knowledge of AMPs would be helpful to identify novel therapeutic targets and invent new antimicrobial agents to treat diseases caused by bacteria The characterization of AMPs to date mainly relies on kinds of experimental approaches such as NMR [26], electron microscopy [27], and fluorescent dyes [28] However, many of them generally require a purified or semi-purified target of interest, and usually time consuming, which limit their application to identify antimicrobial peptides in large scale [29] Therefore, alternative approaches including computational methods would be helpful to the identification of AMPs
1.2.2 Antibiotic resistance proteins
Antibiotics are believed to be one of the greatest medical inventions in the 20thcentury, which have significantly extended human life expectancy by 10 years [30, 31] Antibiotics have been widely used to treat various diseases caused by bacteria, such as tuberculosis, pneumonia and leprosy, which were lethal diseases before the invention of antibiotics Antibiotics take effect through inhibiting or killing bacteria while causing little or no harm to the host Various mechanisms are used by antibiotics to achieve this selective effect For instance, some antibiotics are able to inhibit the synthesis of key proteins that play critical roles in bacterial growth and proliferation [32], whilst others may disrupt bacterial membrane structure and result
in bacterial death [33]
Trang 21
However, the widespread usage of antibiotics also applies selective pressure on bacteria [34] Antibiotic resistance began to emerge almost as soon as the first clinical use of penicillin The emergence of highly virulent and multi-drug resistant bacterial strains has presented a serious challenge to traditional therapies of infectious diseases [35] Antibiotic resistance accounts for a number of treatment failures, and it could be fatal to those critically sick patients who rely on antibiotics to fight against bacteria [34] To make the situation even worse, resistant bacteria could spread widely, posing more serious problems for infection control [36]
Antibiotic resistance is a consequence of natural selection or programmed evolution Multiple mechanisms contribute to antibiotic resistance, such as drug modification by enzymatic mechanisms, mutation of drug targets, enhanced efflux pump expression, and altered membrane permeability [36] A number of proteins have been found responsible for antibiotic resistance For instance, many multi-drug resistance efflux systems can pump out antibiotics from the cell surface by a collection of membrane associated proteins [37] Specific mutations in antibiotic targets may hinder the binding and thus the effectiveness of certain antibiotics [38, 39] In addition, resistance determinants borne on plasmids, bacteriophages, transposons and other mobile genetic elements can be transferred to naive recipients [36, 40] Therefore, antibiotic resistance proteins may come from different sources which diversify from DNA gyrase, topoisomerase, to mutated enzymes, or gene duplication and over-expression of certain carrier proteins
Recognizing these proteins is critically important to study the evolution of antibiotic resistance, which will facilitate the design of novel drugs to control potential spread of
Trang 22antibiotic resistance [40] As part of the efforts for understanding and identifying these proteins, two antibiotics resistance protein databases, ARGO [41] and MvirDB [42], have been developed to collect and characterize ARPs Various experimental methods have been explored for the identification of antibiotic resistance proteins (ARPs) [43-46]
However, these methods are usually costly, time consuming, and resource intensive, which is a particular problem because of the fluidity of the microbial genomes can further increase the burden Therefore, it would be helpful to explore alternative methods including computational approach to identify ARPs
1.2.3 Cancer associated proteins
Cancer is the second leading cause of death in western world, just slightly inferior to cardiovascular diseases Intense efforts have been devoted to the study of cancer genesis, progression, and therapeutic implication Normal growth-control mechanisms have no effect on cancer cells Cancer refers to a group of diseases Cancer cells, unlike normal cells that respond to growth control mechanism, are capable of growing indefinitely and will invade healthy tissue nearby Moreover, cancer cells can also migrate and proliferate in other places through metastasis, which accounts for 90% of human cancer deaths
The induction of cancer involves accumulation of multiple genetic alternations A wide variety of chemical agents and physical agents can cause mutations in normal cells and induce malignant transformation which leads to final development to cancer For instance, extensive exposure to UV radiation may lead to the mutation and
Trang 23inactivation of p53 [47, 48], which plays important roles to suppress tumor Another important cause of tumor is induced by DNA or RNA viruses, which may integrate their genomes into host chromosomes and result in malignant transformation in virus-infected cells HIV-1 [49] could reverse transcribe their RNA into DNA and integrate to human genome, which may lead to malignant transformation
Within a normal tissue, cellular proliferation and cell death is carefully regulated by a number of signals A number of genes responsible for the malignant transformation have been identified in the past three decades [50] The growth and death of normal cells are sophisticatedly maintained by two categories of cancer related genes: proto-oncogenes and tumor suppressors Proto-oncogenes are normal genes whose mutations, called oncogenes, code for proteins causing cancer [51-53] Proto-oncogenes are converted to oncogenes by mutations or genetic rearrangement Some oncogenes are responsible for the over production of growth factor leading to uncontrolled cell growth Some other oncogenes perturb parts of the signal cascade [54] On the other hand, tumor suppressors are responsible for regulating cell proliferation or initiating apoptosis of cells, which reduce the possibility that a cell developing to a tumor cell [55, 56] For example, the inactivation of mutated retinoblastoma gene results in unregulated tumor proliferation
Identification of cancer associated proteins will facilitate efforts to understand the mechanism of cancer development and therefore helpful to discover novel pharmaceutical agents and therapeutic targets to fight against cancer The characterization of cancer-related proteins to date mainly relies on kinds of experimental approaches, like molecular cloning [57] RB is the first tumor
Trang 24suppressor gene isolated from human genome in 1986 [57] Therefore, it would be helpful to explore computational method to finding those proteins
1.3 Introduction to microRNAs
Non-coding genes function without being translated into protein products; instead, their products function at RNA level For many years, it was believed that there are
only a few non-coding RNAs (ncRNAs), such as transfer RNA (tRNA) and ribosomal
RNA (rRNA), both of which are involved in the process of translation and gene expression [58] However, since the late 1990s, a number of new non-coding RNAs have been found to participate in various regulatory events, which open a new door to investigate gene regulatory networks
MicroRNAs (miRNAs) are a group of smallest functional ncRNAs that regulate gene expression Since the discovery of the first miRNA in 1993 [59], miRNAs have been attracting more and more scientists’ interest MiRNA genes could be located in intergenic regions or in introns; some of them are found to be clustered [60] Many miRNAs have heterogeneous expression profiles in different tissues, which also could
be used as potential cancer markers [61-63] The majority of miRNAs are 21 to 25 nucleotides (nt) in length [64], with 21nt long on average Many miRNAs are both sequence and structure conserved in evolution [65] Mature miRNAs are derived from miRNA precursors (pri-miRNAs), which are about 70-100nt long and have an imperfect stem-loop structure with one or two miRNAs in the arms [66, 67] Figure 1-1 shows the biosynthesis of miRNAs in humans MiRNAs are first transcribed as primary transcripts (pri-miRNAs) with a cap and poly-A tail by RNA polymerase II [68] Pri-miRNAs are then processed into precursor miRNAs (pre-miRNAs) by
Trang 25microprocessor complex, which is comprised of Drosha [69] and DGCR8 [70] After that pre-miRNAs are transported from nucleus to cytoplasm by another complex that consists of exportin 5 and RanGTP [71] In cytoplasm, pre-miRNAs are released and processed by Dicer into short double-stranded RNAs [72] One segment called mature miRNA is integrated into the RNA-induced silencing complex (RISC) [73, 74] This complex is responsible for the gene silencing observed due to miRNA expression and RNA interference [75, 76]
MiRNAs play important roles in gene regulation at post-transcription level It is estimated that approximately one third of protein coding genes are regulated by miRNAs [77] MiRNAs are involved in surprisingly diverse of biological processes and they are responsible for a number of human diseases [78, 79] The exact mechanisms of gene regulation by miRNAs remain to be discovered Evidence shows that miRNA could degrade the target transcript, or inhibit protein translation [64] MiRNAs are able to negatively regulate their targets through sequence-specific-pairing approach [80] MiRNAs could bind to mRNA targets at on 3’-UTRs and repress translation and mediate degradation [72] The regulation mechanism of miRNAs in plants and animals are different Most plants miRNAs could bind almost perfectly to their target mRNAs, and their binding sites are not limited to the 3’ untranslated region (3’ UTR), but could be throughout the whole genome [81] In contrast, the pairing of animal miRNAs to their targets 3’UTR is imperfect
Trang 26Figure 1-1 MiRNA biosynthesis MiRNA is produced from precursor microRNA (pre-miRNA), which in turn is formed from a miRNA primary transcript (pri-miRNA)
The number of miRNAs in a vertebrate genome is estimated to be about 800-1000 [82, 83], and approximately 0.5-1.5% of human genes are estimated to encode miRNAs [84] Efforts have been devoted to collect and annotate miRNAs [85, 86] through various approaches A number of experimental methods have been developed to identify and characterize miRNAs [87-90] However, these methods are usually costly,
Trang 27time consuming, and resource intensive [91, 92] The short sequence, redundancy, and heterogeneous expression profiles make miRNA discovery even more difficult [92, 93] Numerous computational methods are also developed to facilitate the identification of miRNAs in different genomes, including sequence alignment [94, 95], structure based approach [96] and conservation based approach [97, 98] One statistical learning method, support vector machine (SVM), has also been applied to identify new miRNA candidates [93, 99, 100] However, these methods usually produce too many false positives when applied to large genomes Thus prediction of miRNAs with lower false positive rate is still a challenging task
1.4 Overview of computational methods for biological function prediction
1.4.1 Sequence similarity method
Sequence similarity method (also named sequence alignment method) is the most popular method used in protein or RNA function prediction The underlying assumption behind sequence similarity method is that similar sequence implies similar structure, and then similar function, which is satisfied in most of cases
Modern sequence alignment methods begin with the global homology algorithm of Needleman-Wunsch [101], which uses an iterative matrix method for optimizing the alignment between two sequences Since then, more rigorous methods, such as Sankoff alignment (1972) [102] and Reicher alignment (1973) [103], started to emerge, although their biological implication was difficult to formulate Later on, Smith and Waterman developed a local sequence alignment method [104], which only
Trang 28searched relatively conserved subsequence, so one single sequence may yield more than one subsequence and only these conserved sequences could contribute to the score of alignment Although this method was more useful for searching sequences in databases, it was still quite time consuming, and had to be used in supercomputers when large databases need to be searched In order to solve this problem, heuristic algorithms were proposed One of first tries is FASTA program developed by Lipman and Pearson[105], which aims to identify local similar regions between two sequences using PAM matrix The strategy significantly decreased the computation time for comparison In 1990, a breakthrough sequence comparison method, Basic Local Alignment Search Tool (BLAST), was developed [106] At that time, BLAST was significantly faster than any other sequence alignment tools while maintaining comparable sensitivity It balances accuracy and computation speed After that, Gapped BLAST [14] was developed to generate gapped alignments, with approximately three times as fast as BLAST search Meanwhile, Position Specific Iterated BLAST (PSI-BLAST) [107] allows BLAST search to iterate, which is particular useful to identify remote homologous proteins
Although sequence alignment methods have good performance in sequence analysis, they still have some limitations Some proteins are so unique that it is difficult to find their “neighbors” in existing protein databases Moreover, no all the similar proteins have analogous functions [108] So there is a need to find other methods to assign protein function beyond sequence alignment
1.4.2 Motif based methods
Trang 29Many proteins or RNAs are found to share consensus sequences or motif, which may provide important clue for their function prediction [109] Motif based methods, such
as Motifs, Prosite [109] and Sequence Clustering [110], have been developed to detect common motifs among proteins and RNAs Motif databases, such as PROSITE, ProDom and Rfam, are also widely used in sequence analysis
PROSITE [111] consists of a large collection of biologically significant signature patterns that were manually annotated and used to determine the function of a given protein The first release of PROSITE was published in 1992, which contained 397 entries describing 433 different patterns The number of patterns in the database has increased to 1318 in April 2008 The problem with PROSITE is that those patterns are usually too short, which may result in too many false positives of unrelated sequences
In order to address this problem, structurally defined regions, called domains, are used
to characterize parts of protein sequences with well defined functions ProDom and Pfam are two examples of this kind of databases ProDom [112] is a comprehensive database of protein domain families generated from the Swiss-Prot database by automated sequence comparisons It can be used for analyzing domain arrangements
of complex protein families and protein homology relationships Similarly, Pfam [113, 114] database currently covers a large collection of manually curated protein domain families Each family is represented by two multiple sequence alignments, two profile-Hidden Markov Models (profile-HMMs) and an annotation file [114] It can automatically classify query proteins into protein domain families [115] Pfam database current covers 9,318 families in April 2008
Trang 30Although there are so many motif databases which contain a large amount of patterns and domain information, not all newly sequenced proteins or RNAs could be covered
by these databases If a new sequence does not have any domain defined in current domain databases, its function could not be identified So it is desirable to explore alternative methods to predict protein function besides motif based method
1.4.3 Machine learning approach
Unlike sequence similarity approach and motif based approach, machine learning methods take a different strategy to predict protein function Machine learning methods derive rules from common characteristics within proteins, and then apply these rules to justify unseen examples Machine learning methods have been successfully applied to the identification of novel enzymes [116], bacterial proteins [117], lipid-binding proteins [118], transporters [119] and other protein functional classes [120, 121]
A number of challenges are still waiting to be solved, such as the generation of effective negative samples, ambiguous information in biological data and data imbalance issue
1.5 Scope and objective
The objective of this study is to develop computational tools to facilitate the identification of multifunctional enzymes, disease related proteins and miRNAs from their primary sequences derived physicochemical properties Machine learning methods were employed in the study Computational tools are expected to offer an
Trang 31alternative solution to the identification of functional proteins and non-coding RNAs, and help to accelerate the pace of drug development and discover new therapeutic targets
The objective could be divided into three parts:
1 To develop a classification system for predicting MFEs directly from their primary sequences Further analysis of their mechanism, evolution, species distribution need to be done
2 To develop prediction systems for disease related proteins, including antimicrobial proteins, antibiotic resistance proteins, and cancer related proteins
3 To apply machine learning methods to predict miRNAs
In order to achieve the 3 parts of the objective described above, a machine learning method, support vector machine (SVM), is employed to develop these prediction systems It is particular useful for the prediction of the proteins or miRNAs that are not homologous to those with known function, where traditional sequence similarity or motif based approach are likely to fail
This thesis includes six chapters Chapter 1 provides the introduction to multifunctional Enzymes, disease related proteins, microRNAs and current prediction methods for protein and RNA Chapter 2 describes algorithms of different machine learning methods, as well the construction of feature vectors The application of machine learning methods for the prediction of multifunctional Enzymes, disease related proteins and microRNAs are described in Chapter 3,
Trang 32Chapter 4 and Chapter 5, respectively Chapter 6 describes conclusion and future work
Trang 332 Methods
In the chapter, algorithms of four well known machine learning methods will be introduced, which will be used to develop computational methods to predict functional proteins and RNAs Moreover, feature selection and performance evaluation will also be illustrated As most of machine learning methods could only accept numerical values instead of protein/RNA sequences, it is essential to convert them into numerical vectors before the application of machine learning The method
of feature vector construction will be covered in the last part of this chapter
The term of machine learning refers to algorithms and techniques that allow computers to extract information from past experience Although it emerges as a separate research field in the early 1980s, the study of machine learning can be traced from the 1960s [122] Over the past 50 years, various machine learning methods have been developed and applied in a wide spectrum of fields, such as k-nearest neighbor algorithms in text categorization, decision tree methods in pharmaceutical research, artificial neural network in stock market analysis and prediction, support vector machine in bioinformatics and cheminformatics
Machine learning uses computational and statistical methods to build mathematical models, and make inference from training samples [123] Machine learning is a branch of artificial intelligence (AI), and it is closely related to statistics and pattern recognition, since they all study the analysis of data However, unlike statistics and
Trang 34pattern recognition, machine learning is primarily concerned with algorithmic complexity of computational implementations [124]
In order to be learnt by computational methods, all the samples, or instances, should
be represented by feature vectors, which could be categorical, binary or continuous Machine learning could be divided into two categories: if samples are given with known classes, it is called supervised learning; otherwise, it is called unsupervised learning [125] In supervised learning, the learning process is to optimize an objective function and predict the value of the function for any valid input object after having learnt experience from training examples This category includes well known machine learning methods like k-nearest neighbors, support vector machines, and decision trees On the other hand, unsupervised learning is never given the answer set, and all the answers are assumed to be latent variable All data under investigation are allowed
to speak for themselves and they are treated evenly This category includes self organization map and clustering methods
In the following sections, four machine learning algorithms will be introduced, including support vector machine, k-nearest neighbors, neural networks, and decision tree Their specific properties, advantages and disadvantages in real world problems, will also be discussed
2.1.1 Support Vector Machine (SVM)
Support vector machine (SVM) is one of the newest members in supervised learning family [126] It was first officially proposed by Vladimir Vapnik in 1995[126], and
Trang 35then further explained by Dr Burges in 1998[127] A special property of SVM is that
it simultaneously minimizes the empirical classification error and maximizes the geometric margin Over the past 20 years, SVM has been successfully applied to a wide range of real-world problems, including hand-written digit recognition [128], tone recognition [129], image classification [130-133], as well as broad fields in biology, such as protein function prediction[134, 135], protein-protein interaction prediction [136], protein remote homology detection [137, 138], and classification for discriminating coronary heart disease patients[139] SVM is the primary method used
in our study Therefore its theory and algorithm will be discussed with more details in following sections
2.1.1.1 Linear SVM
In two-class problems, SVM aims to separate examples of two classes with the maximum hyperplane (Figure 2-1) Mathematically, the data is composed of n examples of two classes, denoted asχ ={( , ), , ( , )}x y1 1 " x y n n , where N
i
x ∈R is a vector in feature space and y i∈ − + denotes its class A hyperplane could be { 1, 1}drawn to separate examples of one class (positive examples) from those of the other one (negative examples) The hyperplane is represented byw x b ⋅ + = , where w is 0
the slope and b is the bias Thus the objective function of SVM changes to minimize
Euclidean norm w with following limitations: 2
Trang 36According which side that new instances locate, we can easily determine which class they belong to So the decision function becomes f w b, ( )x =sign(<w x, > + b)
Figure 2-1 Architecture of support vector machines
Geometrically, all the points are divided into two regions by a hyperplane H As
shown in Figure 2-2, there are numerous ways through which a hyperplane can separate these examples The objective of SVM is to choose the “optimal” hyperplane
As all new examples are supposed to be located under similar distribution as training examples, the hyperplane should be chosen such that small shifts of data do not result
in fluctuations in prediction result Therefore, the hyperplane that separates examples
of two classes should have the largest margin, which is expected to possess the best generalization performance Such hyperplane is called the Optimal Separating Hyperplane (OSH) [30]
Trang 37Figure 2-2 Different hyperplanes could be used to separate examples
Examples locating on the margins are called support vectors, whose presentation determines the location of the hyperplane OSH could be thus represented by a linear combination of support vectors The margin γi( , )w b of a training point x is defined i
as the distance betweenHandx : i
which is an equivalent statement of the problem
Minimize 1 2
Trang 39This QP problem could be efficiently solved through several standard algorithms like
Sequential Minimization Optimization [140] or decomposition algorithms [141]
which αi > are called support vectors, which lie on the margin [127] 0
2.1.1.2 Nonlinear SVM
Many real-world problems are usually too complicated to be solved with linear
classifiers With the introduction of kernel techniques, input data could be mapped to
a higher-dimension space, where a new linear classifier can be used to classify these
examples (Figure 2-3)
Figure 2-3 Mapping input space to feature space
Let Φ denotes an implicit mapping function from input space to feature space F
Then all the previous equations are transformed by substituting input vector x and i
inner product ( , )x x with i Φ( )x i and kernel ( , )K x x respectively, where i
( , ) ( ) ( )
Trang 40Equation (13) is then replaced by
Polynomial ( , ) ( , 1)p
k x z = <x z> +
Sigmoid ( , ) tanh(k x z = κ <x z, > − δ)
Radial basis function (RBF) k x z( , ) exp(= − −x z 2/ 2σ2)
In this work, RBF kernel also known as Gaussian kernel is used due to its many advantages demonstrated in previous studies [118, 144, 145] Then SVM models developed in this study could be developed by using different σ values It is thus necessary to scan a number of σ values to find the best model, which is evaluated
by their performance on classification tasks In our work, SVM models withσ value