Multi-view Performs Better than Single View- 123docz.net

5.7 Results of Multi-view Prioritization

5.7.1 Multi-view Performs Better than Single View

According to the experimental result, integration of the multi-view data obtains sig- nificantly better performance than the best individual data. Among the different approaches we tried, the best performance is obtained by combining 9 CV profiles as kernels in the LSI reduced dimensionality (1-SVM+LSI, Error of AUC = 0.0335). This error is only a half of the best single CV profile (LDDB, 0.0792). The ROC curves of leave-one-out performance of different approaches are presented in Figure 5.3. Without LSI, 1-SVM data fusion reduces the error from 0.0792 to 0.0453. By coupling LSI with 1-SVM, the error is further reduced from 0.0453 to 0.0335. Considering the cost and effort of validating the false positive genes in

5.7 Results of Multi-view Prioritization 125 lab experiments, the improvement from 0.0792 to 0.0335 is quite meaningful because it means that when prioritizing 100 candidate genes, our proposed method can save the effort of validating 4 false positive genes. The obtained result is also comparable to the performance of the existing systems. In the Endeavour system [3], the same disease gene benchmark dataset and the same evaluation method is imple- mented. Endeavour is different from our approach mainly in two aspects. Firstly, Endeavour combines one textual data (GO-IDF profile obtained from free literature text mining) with nine other biological data sources. Also, there is no dimensionality reduction used in it. Secondly, Endeavour applies order statics to integrate the models. The performance obtained in our paper is much better than Endeavour (Er- ror=0.0833). Moreover, the performance is also better than the result obtained by De Bie et al. [15]. In their approach, they use the same data sources as Endeavour and apply the 1-SVM MKL algorithm for model integration (best performance Er- ror=0.0477,θmin=0.5/k). Since the methods and the disease gene benchmark data are exactly the same, the improvement can only be attributed to the multi-view text mining and the LSI dimensionality reduction. It is also notice that the L2-norm 1- SVM MKL performs better than the L∞(θmin=0) approach. When optimizing the L2-norm, the integration of 9 CVs has the error of 0.0587, and the integration of 9 LSI profiles has the error of 0.0392. This result is consistent with our hypothesis that non-sparse kernel fusion may perform better than sparse kernel fusion. However, the best result in the multi-view prioritization is obtained by L1 (θmin =1) approach.

0 0.05 0.1 0.15 0.2 0.25

eVOC GO KO LDDB MeSH MPO OMIM SNOMED Uniprot

Errorofprioritization

controlledvocabularies

FRPSOHWH&9 /6,

Fig. 5.3 Prioritization results obtained by complete CV profiles and LSI profiles

126 5 Multi-view Text Mining for Disease Gene Prioritization and Clustering

0 0.05 0.1 0.15 0.2

eVOC eVOCanatomicalsystem eVOChumandevelopment eVOCpathology GO GObiologicalprocess GOmolecularfunction GOcellularcomponent MeSH MeSHdiseases MeSHbiological MeSHanalytical SNOMED SNOMEDsituation SNOMEDobservableentity SNOMEDmorphologicabnormality

Errorofprioritization

controlledvocabularies

Fig. 5.4 Prioritization results obtained by complete CV profiles and subset CV profiles

5.7.2 Effectiveness of Multi-view Demonstrated on Various Number of Views

To further demonstrate the effectiveness of multi-view text mining, we evaluated the performance on various number of views. The number was increased from 2 to 9 and three different strategies were adopted to add the views. Firstly, we simulated a random strategy by enumerating all the combinations of views from the number of 2 to 9. The combinations of 2 out of 9 views is C92, 3 out of 9 is C93, and so on. We calculated the average performance of all combinations for each number of views. In the second and the third experiment, the views were added by two different heuristic rules. We ranked the performance of the nine views from high to low was LDDB, eVOC, MPO, GO, MeSH, SNOMED, OMIM, Uniprot, and KO. The second strategy combined best views first and increases the number from 2 to 9. In the third strategy, the irrelevant views were integrated first. The results obtained by these three strategies are presented in Figure 5.6. The performance of the random strategy increases steadily with the number of views involved in integration. In the best view first strategy, the performance increased and reached the ideal performance, then started to decrease when more irrelevant views are involved. The ideal performance of prioritization was obtained by combining the five best views (Error of AUC = 0.0431) by the 1-SVM method applied on averagely combined kernel. The generic integration method (order statistic) did not perform well on high dimensional gene- by-term data. The performance of integrating all CVs was comparable to the ideal performance, which shows that the proposed multi-view approach is quite robust to the irrelevant views. Furthermore, the merit in practical explorative analysis is that the near-optimal result can be obtained without evaluating each individual model. In the third strategy, because the combination starts from the irrelevant views first, the performance was not comparable to the random or the ideal case. Nevertheless, as

5.7 Results of Multi-view Prioritization 127

Fig. 5.5 ROC curves of prioritization obtained by various integration methods. The light grey curves represent individual textual data. The near-diagonal curve is obtained by prioritization of random genes.

shown, the performance of the multi-view approach was always better than the best single view involved in integration. Collectively, this experiment clearly illustrated that the multi-view approach is a promising and reliable strategy for disease gene prioritization.

5.7.3 Effectiveness of Multi-view Demonstrated on Disease Examples

To explain why the improvements take place when combining multiple views, we show an example taken from prioritization of MTM1, a gene relevant to the disease Myopathy. In the disease benchmark data set, Myopathy contains 41 relevant genes so we build the disease model by using the other 40 genes and leave MTM1 out with 99 random selected genes for validation. In order to compare the rankings,

128 5 Multi-view Text Mining for Disease Gene Prioritization and Clustering

AUCerroronprioritization

numberofviews(random) 2UGHU 690 EHVWVLQJOH

AUCerroronprioritization

numberofviews(bestfirst) 2UGHU 690 EHVWVLQJOH

AUCerroronprioritization

numberofviews(irrelevantfirst) 2UGHU 690 EHVWVLQJOH

Fig. 5.6 Multi-view prioritization with various number of views

only in the experiment for this example, the 99 random candidate genes are kept identical for different views. In Table 5.4, we list the ranking positions of MTM1 and the false positive genes in all 9 CVs. When using LDDB vocabulary, 3 “false positive genes” (C3orf1, HDAC4, and CNTFR) are ranked higher than MTM1. To investigate the terms causing this, we sort the correlation score of each term be- tween the disease model and the candidate genes. It seems that C3orf1 is ranked at the top mainly because of the high correlation with terms “skeletal, muscle, and heart”. HDAC4 is ranked at the second position because of terms “muscle, heart, calcium, and growth”. CTNFR is ranked at the third position due to the terms like

“muscle, heart, muscle weak, and growth”. As for the real disease relevant gene MTM1, the high correlated terms are “muscle, muscle weak, skeletal, hypotonia, growth, and lipid”. However, according to our knowledge, none of these three genes (C3orf1, HDAC4, and CNTFR) is actually known to cause any disease. Escarceller et al. [19] show that C3orf1 seems to be enhanced in heart and skeletal muscle, but there is no clue that it has directly relation with the disease. HDAC4 is found in the work of Little et al. “as a specific downstream sbstrate of CaMKIIdeltaB in cardiac cells and have broad applications for the signaling pathways leading to cardiac hy- pertrophy and heart failure” [31]. In the papers of Glenisson et al. [24] and Cohen et al. [13], HDAC4 is found to have a role in muscle, which means it might be a good candidate but has not yet been proved directly related to the disease. For CNTFR, it has been found that in heterozygotic mice inactivation of the CNTFR leads to a slight muscle weakness [41]. In the papers of Roth et al. [44] and De Mars et al. [16], CNTFR is shown related to muscular strength in human. Collectively, al- though these 3 genes found by LDDB vocabulary have not yet been reported as direct disease causing factors, the prioritization result is still meaningful because in literature they have many similar correlated terms with the disease model as the real disease causing gene does. Especially, according to literature, HDAC4 and CNFR seem to be nice candidates to muscular disorder. Though LDDB ranks 3 “false positive” genes higher than the real disease relevant gene, eVOC and GO rank the left out gene MTM1 as the top candidate gene. In eVOC, the most important correlated terms are “muscle, sever, disorder, ...”. In GO, the terms are “muscle, mutation, family, sever, ...”. When combining multi-view data for prioritization, the ranking of LDDB is complemented by eVOC and GO thus MTM1 is ranked as the top gene.

5.7 Results of Multi-view Prioritization 129

Table 5.4 Prioritization results of MTM1 by different CV profiles CV Rank Gene correlated terms

LDDB 1 C3orf1 muscle, heart, skeletal 2 HDAC4 muscle, heart, calcium, growth 3 CNTFR muscle, heart, muscle weak, growth

4 MTM1 muscle, muscle weak, skeletal, hypotonia, growth, lipid eVOC 1 MTM1 muscle, sever, disorder, affect, human, recess

MPO 1 HDAC4 muscle, interact, protein, domain, complex 2 HYAL1 sequence, human, protein, gener

3 WTAP protein, human, sequence, specif 4 FUT3 sequence, alpha, human ...

15 MTM1 myopathy, muscle, link, sequence, disease, sever GO 1 MTM1 muscle, mutate, family, gene, link, seqeuence, sever MeSH 1 HYAL1 human, protein, clone, sequence

2 LUC7L2 protein, large, human, function

3 MTM1 myopathy, muscle, mutate, family, gene, missens SNOMED 1 S100A8 protein, large, human, function

2 LUC7L2 protein, large, human, function 3 LGALS3 human, protein, express, bind ...

23 MTM1 muscle, mutate, family, gene, link OMIM 1 HDAC4 muscle, interact, protein, bind

2 MAFK sequence, protein, gene, asthma relat trait 3 LUC7L2 protein, large, function, sequence 4 SRP9L1 sequence, protein, length, function ...

50 MTM1 muscle, family, gene, link, sequence, disease, sever, weak Uniprot 1 MTM1 gene, protein, function

KO 1 S100A8 protein, bind, complex, specif, associ, relat 2 PRF1 specif, protein, contain, activ

...

56 MTM1 protein, large, specif, contain Multi-view 1 MTM1

2 HDAC4

3 CNTFR

Multi-view Performs Better than Single View

Rayleigh Quotient-Type Problems in Machine Learning

The Norms of Multiple Kernel Learning