1. Trang chủ
  2. » Tất cả

Genome wide prediction and prioritization of human aging genes by data fusion a machine learning approach

7 3 0

Đang tải... (xem toàn văn)

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Genome Wide Prediction and Prioritization of Human Aging Genes by Data Fusion a Machine Learning Approach
Tác giả Masoud Arabfard, Mina Ohadi, Vahid Rezaei Tabar, Ahmad Delbari, Kaveh Kavousi
Trường học University of Tehran
Chuyên ngành Bioinformatics
Thể loại Research article
Năm xuất bản 2019
Thành phố Tehran
Định dạng
Số trang 7
Dung lượng 548,89 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

RESEARCH ARTICLE Open Access Genome wide prediction and prioritization of human aging genes by data fusion a machine learning approach Masoud Arabfard1,2, Mina Ohadi3*, Vahid Rezaei Tabar4, Ahmad Delb[.]

Trang 1

R E S E A R C H A R T I C L E Open Access

Genome-wide prediction and prioritization

of human aging genes by data fusion: a

machine learning approach

Masoud Arabfard1,2, Mina Ohadi3*, Vahid Rezaei Tabar4, Ahmad Delbari3and Kaveh Kavousi2*

Abstract

Background: Machine learning can effectively nominate novel genes for various research purposes in the

laboratory On a genome-wide scale, we implemented multiple databases and algorithms to predict and prioritize the human aging genes (PPHAGE)

Results: We fused data from 11 databases, and used Nạve Bayes classifier and positive unlabeled learning (PUL) methods, NB, Spy, and Rocchio-SVM, to rank human genes in respect with their implication in aging The PUL methods enabled us to identify a list of negative (non-aging) genes to use alongside the seed (known age-related) genes in the ranking process Comparison of the PUL algorithms revealed that none of the methods for identifying

a negative sample were advantageous over other methods, and their simultaneous use in a form of fusion was critical for obtaining optimal results (PPHAGE is publicly available athttps://cbb.ut.ac.ir/pphage)

Conclusion: We predict and prioritize over 3,000 candidate age-related genes in human, based on significant ranking scores The identified candidate genes are associated with pathways, ontologies, and diseases that are linked to aging, such as cancer and diabetes Our data offer a platform for future experimental research on the genetic and biological aspects of aging Additionally, we demonstrate that fusion of PUL methods and data sources can be successfully used for aging and disease candidate gene prioritization

Keywords: Genome-wide, Prioritization, Human aging genes, Positive unlabeled learning, Machine learning

Background

Prior understanding of the genetic basis of a disease is a

crucial step for the better diagnosis and treatment of the

disease [1] Machine learning methods help specialists

and biologists the use of functional or inherent

proper-ties of genes in the selection of candidate genes [2]

Per-haps the question that is posed to researchers is why all

research is aimed at identifying pathogenic rather than

non-pathogenic genes The answer may lie in the fact

that genes introduced as non-pathogens may be

docu-mented as disease genes later on

Biologists apply computation, mathematics methods, and algorithms to develop machine learning methods of identifying novel candidate disease genes [3] Based on the principle of“guilt by association”, similar or identical diseases share genes that are very similar in function or intrinsic properties, or have direct physical protein-protein interactions [4] Most methods of predicting candidate genes employ various biological data, such as protein sequence, functional annotation, gene expres-sion, protein-protein interaction networks, regulatory data and even orthogonal and conservation data, to identify similarities with respect to the principle of asso-ciation based on similarity [5] These methods are

semi-supervised [6] Unsupervised methods cluster the genes based on their proximity and similarity to the known disease genes, and rank them by various methods Su-pervised methods create a boundary between disease genes and non-disease genes, and utilize this boundary

© The Author(s) 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/ ), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made The Creative Commons Public Domain Dedication waiver

* Correspondence: mi.ohadi@uswr.ac.ir ; ohadi.mina@yahoo.com ;

kkavousi@ut.ac.ir

3 Iranian Research Center on Aging, University of Social Welfare and

Rehabilitation Sciences, Tehran, Iran

2 Laboratory of Complex Biological Systems and Bioinformatics (CBB),

Department of Bioinformatics, Institute of Biochemistry and Biophysics (IBB),

University of Tehran, Tehran, Iran

Full list of author information is available at the end of the article

Trang 2

to select candidate genes Several studies have been

per-formed to address different aspects of the methodology

and have expanded the use of various methods and tools

[3,7–12]

The tools that are available for candidate gene

prioritization can be classified with respect to efficiency,

computational algorithms, data sources, and availability

[13–15] Available prioritization tools can be categorized

into specific and general tools [16] Specific tools are

used to prioritize candidate genes associated with a

specific disease In these methods, information related to

a specific tissue involved in the disease or other informa-tion related to the disease is employed General tools can be applied for most diseases, and various data sources are often used in these tools Gene prioritization tools can be divided into two types of single-species and multi-species Single-species tools are only usable for a specific species, such as human or mouse Multi-species tools have the ability to prioritize candidate genes in sev-eral different species For example, the ENDEAVOR

Table 1 Datasets used to evaluate reliable negative sample extraction algorithms

Number of instances Number of attributes Data set names

Table 2 Performance evaluation of the reliable negative sample extraction algorithms

Connectionist Bench (Sonar, Mines vs Rocks) NB 13.85 12.26 91.18 87.74 89.42

Trang 3

software can prioritize the candidate genes in six

differ-ent species [17] With respect to computational

algo-rithms, candidate prioritization tools are primarily

divided into two groups of complex network-based

methods and similarity-based methods [5] The

inevit-able completeness and existence of errors in biological

data sources necessitate fusion of multiple data sources

[18] Most gene targeting methods, therefore, use

mul-tiple data sources to improve performance

The purpose of this study was to design a machine to

identify and prioritize novel candidate aging genes in

hu-man We examined the existing methods of identifying

human non-aging (negative) genes in the machine

learn-ing techniques, and then made a binary classifier for

pre-dicting novel candidate genes, based on the positively

and negatively learned genes Gene ranking was based

on the principle of the similarity among positive genes

through “guilt by association” Thus, across the

un-labeled genes, genes that were less similar in respect

with the known genes were employed as negative

sample

Results

The three positive unlabeled learning (PUL) algorithms,

Nạve Bayes (NB), Spy, and Rocchio-SVM, were used to

evaluate the underlying data, and to compare them to

the eight datasets introduced with respect to

perform-ance All samples of a class with a higher frequency were

unlabeled We applied the algorithm to predict the la-bels These methods utilize a two-step strategy and are intended to extract a reliable negative sample from the main data (Table1)

We also randomly selected 70% of the positive samples

as the training set, and the remainder as the test set To determine the classifier, positive and negative samples were equally selected to ensure that the classifier did not have any bias at the training step Therefore, we com-pared the three algorithms with eight data sources ex-tracted from the UCI database (Additional file1) Comparison of the parameters of the three algorithms for all data sets revealed similar results in F_measure For example, in data set 1, the precision of the Roc-SVM method, (approximately 2–3%,) was better than those of the other two methods However, the recall of the NB method (approximately 4–6%,) was better than those of the other two methods, and Roc-SVM method had a lower false positive rate than that of the other two methods (Table2) In addition, comparison between the parameters of the three algorithms for data set 2, re-vealed that the precision of the NB method was better than that of the other two methods, the recall SPY method was 5% better than that of the other two methods, and the NB method had a lower false positive rate than that of the other two methods Therefore, none

of the methods had an absolute superiority Since the re-sults were very similar, the output of the three methods was combined

The three PUL algorithms were applied to extract reli-able negative samples and to compare them with respect

to performance In this algorithm, only 303 positive sam-ples were given as input, which enabled extraction of re-liable negative samples from the remaining data Subsequently, from the positive and negative data, a new

Table 3 Model performance evaluation by Nạve Bayes on the

aging data

Precision % Recall % F measure % Accuracy % AUC %

Train 80.78 76.95 78.81 78.52 83.81

Test 87.09 81.82 84.37 84.13 88.99

Fig 1 ROC curves ROC was performed to evaluate the performance of the Nạve Bayes model at the training and test steps, which resulted in similar values for both curves

Trang 4

classifier was trained to identify novel candidate genes to

be utilized for prioritization and ranking A total of 328

negative genes were extracted from each positive and

negative gene, with a threshold of 11 replicates per

nega-tive gene (Additional file 2), and the Nạve Bayes binary

classifiers were trained in a 10-fold cross-validation

(Table3) Additional file2contains results for all

thresh-olds The ROC chart for training and test data is shown

in Fig.1

We trained multiple binary classifiers using all features

in the positive genes and reliable negative data to

com-pare the NB classifier to other classifiers We

investi-gated the performance of binary SVM [27], NB, and

libD3C [28] classifiers in the dataset with 10-Fold cross

validation, using Weka [29] All classifiers had similar

performance in the main data set (Table4)

A major challenge in classification is to reduce the

di-mensionality of the feature space Some methods, such

as PCA, are linear combinations of the original features

In this research, we investigated the PCA method in the

final model, which eliminated some of the original input

features and retained a minimum subset of features that

yielded the best classification performance In addition,

the feature selection technique was used to select the

best subset of features that were satisfying to the model

in respect with the subset of the main features A fixed

number of top ranked features were selected to design a

classifier A suitable technique for feature selection is

minimal-redundancy-maximal-relevance (mRMR) [30]

We also used mRMR for feature selection in the main

data, and then compared multiple binary classifiers in

the positive and reliable negative genes We investigated

the top 500 ranked features that were extracted from the

mRMR tool to compare the classifiers All of the selected

classifiers yielded acceptable results (Table5)

Model accuracy assurance is very difficult when the model applied to a separate test suite includes positive and unlabeled samples This challenge is critical in in-stances which lack negative sample Thus, we compared the evaluation metric with the data We generated data for all 10 models in the training section to predict the residual genes, and extracted the genes that were identi-fied by the 10 models as positive genes, yielding a total

of 3531 final candidate genes

To compare the output of the method with the known tools for prioritizing the genes, the output of the model was compared with two softwares, Endeavor [17] and ToppGene [31], in the seed genes

(the list of seed genes in the form of K-Fold with K = 3 was utilized for the mentioned tools) Two metrics for com-paring the tools with the proposed model were considered The first metric calculated the average ranking for the seed genes, and the second metric determined the number of seed genes on the lists as 10, 50, 100, 500, and 1000

A tool that had more seed genes at the top of the list and a lower average rating compared with the remaining tools, received a higher ranking Table6 shows the out-put of the tools and the PPHAGE method for determin-ing the number of test genes on the known lists Table7

Table 4 Performance evaluation comparison by multiple binary

classifier in the aging data

TP rate

%

FP rate%

Precision

%

Recall

%

F measure

%

AUC

%

libD3C 85.1 15.3 85.3 85.1 85 91.9

Table 5 Performance evaluation comparison by multiple binary

classifier in the aging data after feature selection

TP rate

%

FP rate%

Precision

%

Recall

%

F measure

%

AUC

% SVM 83.5 17.1 84.2 83.5 83.4 83.2

libD3C 84.6 15.7 84.8 84.6 84.6 92.3

Table 6 Number of detected seed genes in comparison to the output of tools

Table 7 Average rank of the seed genes in comparison to the output of tools

Trang 5

shows the output of tools and the PPHAGE method for

the average rank score on different lists

The top 25 genes that received the highest weight

among all candidate aging genes (Table 8), were

vali-dated in a number of instances, based on experimental

evidence, age-related diseases, and genome-wide

associ-ation studies (GWAS) A list of all candidate positive

aging genes is provided in Additional file3

Discussion

On a genome-wide scale, we used three PUL methods to create a method for the isolation of human aging genes from other genes The combined use of several methods

as a fusion of their output was advantageous over using one single method

Following are examples of the identified genes and ex-perimental or GWAS link between these genes and

Table 8 The top 25 human candidate aging genes

Osteoporosis, Postmenopausal Colorectal Cancer

Diabetes Mellitus, Non-Insulin-Dependent Colorectal Cancer

Atherosclerosis Parkinson Disease Alzheimer’s Disease Arthritis Heart failure

[ 43 , 44 ] [ 45 – 47 ] [ 48 , 49 ] [ 50 , 51 ] [ 52 – 54 ] [ 55 – 57 ] [ 58 – 60 ] [ 61 – 63 ]

CTD_human RGD LHGDN BEFREE HPO

Cataract

GENOMICS_ENGLAND HPO

CTD_human

GWASCAT BEFREE

Colorectal Cancer Osteopetrosis

[ 72 – 75 ] [ 76 , 77 ] [ 78 ]

BEFREE GWASDB GWASCAT

Colorectal Cancer

UNIPROT

Colorectal Cancer

[ 81 ] [ 82 ]

BEFREE

Hereditary Diffuse Gastric Cancer Coronary heart disease Increased gastric cancer

[ 83 ] [ 84 ] [ 85 ]

BEFREE CTD_human HPO

Cataract

HPO HPO

Trang 6

aging On the list of the 25 top genes, NAP1L4 encodes

a member of the nucleosome assembly protein (NAP)

family, which interacts with both core and linker

his-tones, and shuttles between the cytoplasm and nucleus,

suggesting a role as histone chaperone Histone protein

levels decline during aging, and dramatically affect

chro-matin structure Remarkably, the lifespan can be

ex-tended by manipulations that reverse the age-dependent

changes to chromatin structure, indicating the pivotal role

of chromatin structure in aging [32] In another example,

gene expression of NAP1L4 increases with age in the skin

tissue [33] Findings of GWAS link a number of the

iden-tified genes to age-related disorders, such as GAB2 and

late onset Alzheimer’s disease [86], and QKI and coronary

heart disease/myocardial infarction [79] Interestingly,

GWAS reports also link QKI to successful aging [87]

RPL3 encodes a ribosomal protein that is a component

of the 60S subunit The encoded protein belongs to the

L3P family of ribosomal proteins, and is increased in

gene expression during aging of skeletal muscle [88] In another example, FZD5 is involved in prostate cancer, which is the most common malignancy in older men ATP8A2 is another gene subject to deterioration and loss of function over time RYR2 (Additional file 3) en-codes a ryanodine receptor found in cardiac muscle sarcoplasmic reticulum Mutations in this gene are asso-ciated with stress-induced polymorphic ventricular tachycardia and arrhythmogenic right ventricular dyspla-sia and methylation analysis of CpG sites in DNA from blood cells showed a positive correlation between RYR2 and age [89] In additional examples, differential expres-sion with age was identified in BCAS3, TUFM and DST

in the skin [33] Gene expression revealed a significant increase in the expression of hippocampal TLR3 from elderly (aged 69–99 years old) compared to cells from younger individuals (aged 20–52 years old) [90] Simi-larly, differential expression with age was identified in RORA in the adipose tissue [33]

Table 9 Indicative diseases associated with the candidate aging genes

Fig 2 Significant biological processes associated with the candidate aging genes

Trang 7

In order to investigate the implication of the

identi-fied candidate genes in aging, we conducted a

com-prehensive analysis of 330 human pathways in the

KEGG Each of the pathways was examined in the

seed and candidate genes, and direct association was

detected in a number of instances For example IL10

activates STAT3 in the FOXO signaling pathway In

another example, GAB2 has a regulatory role for

PLCG2 in the osteoclast differentiation pathway, as

well as an activating role in the chronic myeloid

leukemia pathway Likewise, FOS is an expression

tar-get for IL10 in the T cell receptor signaling pathway

Enrichr tool, based on the candidate genes and the

negative genes [91] to examine whether the

candi-date and negative genes were correctly selected in

respect with aging The analysis of candidate genes

was performed on 3531 genes from the rest of the

test genes (i.e excluding the positive seed and

reli-able negative genes) Most diseases that were

associ-ated with the candidate genes were diseases that

occur with aging (e.g colorectal cancer and diabetes)

(Table 9)

Ontology analysis of the candidate genes was

per-formed by FUNRICH [92] (Fig 2), which revealed

en-richment for the aging process and apoptosis A list of

all biological processes associated with the candidate

aging gene is provided in Additional file4

In the analysis of the enriched biological pathways,

using Enrichr (Table 10), cancer pathways had the

highest score Interestingly, viral pathways (e.g EBV and HSV) were enriched in the positive aging genes com-partment, which is in line with the previously reported immunosenescence and activation of such viruses as a result of aging [93] A list of all biological pathways of the candidate genes extracted by FUNRICH is provided

in Additional file5

No specific age-related diseases were detected for the identified negative genes (Table 11), which supports the validity of the model training used Ontology analysis of the reliable negative genes (Fig 3), which was also per-formed by FUNRICH, revealed that most of the ex-tracted processes had a general role in all cells and could not be related to specific aging processes Analyzing the biologic pathways in the negative genes indicated path-ways that were predominantly unrelated to the aging processes

Based on the principle that similar disease genes are likely to have similar characteristics, some ma-chine learning methods have been employed to pre-dict new disease genes from known disease genes Previous approaches developed a binary classifica-tion model that used known disease genes as a posi-tive training set and unknown genes as a negaposi-tive training set However, the negative sets were often noisy be-cause unknown genes could include healthy genes and positive collections Therefore, the results presented by these methods may not be reliable Using computational machine learning methods and similarity metrics, we iden-tified reliable negative samples, and then tested the samples

Table 10 Indicative biological pathways associated with the candidate aging genes

1 Pathways in cancer_Homo sapiens_hsa05200 4.07e-41 1.19e-38 −2.11 196.21

2 Proteoglycans in cancer_Homo sapiens_hsa05205 1.91e-31 2.78e-29 −1.99 140.58

3 Epstein-Barr virus infection_Homo sapiens_hsa05169 3.24e-30 3.15e-28 −1.9 128.92

4 Endocytosis_Homo sapiens_hsa04144 1.19e-28 8.70e-27 −1.89 121.38

5 Regulation of actin cytoskeleton_Homo sapiens_hsa04810 4.30e-26 2.51e-24 −1.82 106.42

6 HTLV-I infection_Homo sapiens_hsa05166 1.01e-25 4.21e-24 −1.79 103.2

7 Protein processing in endoplasmic reticulum_Homo sapiens_hsa04141 7.55e-26 3.68e-24 −1.69 98.04

8 Herpes simplex infection_Homo sapiens_hsa05168 1.24e-25 4.54e-24 −1.61 92.36

9 PI3K-Akt signaling pathway_Homo sapiens_hsa04151 1.79e-22 4.96e-21 −1.83 91.82

10 Focal adhesion_Homo sapiens_hsa04510 1.12e-22 3.63e-21 −1.72 86.98

Table 11 Indicative diseases associated with the reliable negative genes

Ngày đăng: 28/02/2023, 20:12

TỪ KHÓA LIÊN QUAN

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN

🧩 Sản phẩm bạn có thể quan tâm