Using multiple windows scanning and natural language processing techniques to study electron transport proteins

iii Using multiple windows scanning and natural language processing techniques to study electron transport proteins Student: Ho Quang Thai Advisor: Dr.. With different filters are appl

Trang 1

元智大學

資訊工程學系

博士論文

使用多窗口掃描和自然語言處理技術研究電子傳

遞蛋白

Using multiple windows scanning and natural language processing techniques to study electron

transport proteins

研究生：胡光泰指導教授：歐昱言博士

中華民國 111 年 4 月

Trang 2

遞蛋白

Using multiple windows scanning and natural language processing techniques to study electron

transport proteins

研究生：胡光泰 Student：Ho Quang Thai

指導教授：歐昱言博士 Advisor：Dr Yu Yen Ou

元智大學

資訊工程學系

博士論文

A Dissertation Submitted to the Department of Computer Science and Engineering

Yuan Ze University

in Partial Fulfillment of the Requirements

for the Degree of Doctor of Philosophy

in Computer Science and Engineering

April 2022 Chungli, Taiwan, Republic of China 中華民國 111 年 4 月

Trang 3

iii

Using multiple windows scanning and natural language processing

techniques to study electron transport proteins

Student: Ho Quang Thai Advisor: Dr Yu-Yen Ou

Department of Computer Science and Engineering

College of Informatics Yuan Ze University

Abstract

Nature is an infinite source of inspiration for people to discover and recreate wonderful inventions Inspired by the way neurons work in the human brain, Convolutional Neural Networks (CNNs) have been proposed and become a powerful and widely used tool in imaging-related tasks It and its structural variants are increasingly developed and achieved many cutting-edge achievements not only in the field of computer vision but also in many other fields In addition, CNN is also known as an effective tool in extracting hidden information in visual data In the field of bioinformatics, CNN has rapidly gained a lot of interest over the past decade, especially in biomedical imaging However, current solutions for applying CNNs to non-visual data, such as protein sequences, are still not fully resolved

Unlike image data, protein sequences cannot be decomposed into color channels, and color channels provide a lot of useful information for CNN's pattern recognition capabilities Another problem that makes it more troublesome to apply CNN to protein chains is the limitation of the input layer The input layer is defined as a fixed dimension,

Trang 4

iv

and the length of the protein chain is usually variable With different filters are applied to identify sequence motifs within the input protein sequence with multiple sequence alignment profiles The model uses a number of filters to capture many different sequence motifs, and then use multiple different window sizes to capture more feature motifs further to capture more significant patterns for protein prediction problems

Recently, the NLP field has risen with much state-of-the-art performance when successfully applying the transformers to help computers focus attention on word position and its context rather than just relying on how often the word appears in a sentence Natural language and protein sequence share several common points, like how they form and how they are presented This study assumes the protein sequences as an unknown language, each amino acid as a “word” in a biological vocabulary We operate NLP models pre-trained by an extensive corpus of natural language data to determine whether

an association between natural language and an undiscovered language exists inside our body

Our study analyzed electron transport proteins using multiple windows scanning and natural language processing techniques in three works Firstly, we used multiple windows scanning technique to predict electron transport proteins in transport proteins For independent data, our model performed with an average sensitivity of 92,59%, specificity

of 98,19%, accuracy of 97,41%, and Matthew's correlation coefficient (MCC) of 0.89 Additionally, our method can identify complexes with different molecular functions in electron transport proteins Across five independent data sets, MCCs were 0.86, 0.80, 0.88, 1.00, and 0.92, respectively In the second work, we combined feature set extracted from Bidirectional Encoder Representations from Transformers (BERT) pre-trained

Trang 5

v

model with Position Specific Score Matrix Profiles (PSSM), and the Amino Acid Index database (AAIndex) to identify Flavin Adenine Dinucleotide (FAD) binding sites in electron transport proteins with an average sensitivity of 85.19%, a specificity of 85.62%,

an accuracy of 85.60%, and an MCC of 0.35 for independent data set In the last work,

we attempt to use multiple windows scanning technique to resolve the FAD binding site identification problem In order to solve the problem of the modest amount of data in nature, we first trained the model by using PSSM profiles of FAD binding sites in transporters We then used the model to predict the FAD binding sites in electron transport proteins In our analysis, we found that the performance of independent data set had an average sensitivity of 92,59%, specificity of 98,19%, accuracy of 97,41%, and MCC of 0.89 The performance of our method is better in all measurement metrics than other published methods Researchers may be able to gain a deeper understanding of transport proteins through the proposed technique, which can also be used as a powerful web tool Further, the results of this study pave the way for further research on deep learning to enrich the bioinformatics field

Keywords: machine learning; deep learning; convolutional neural network; electron

transport proteins; position specific scoring matrix; FAD binding site; natural language processing, multiple windows scanning

Định dạng
Số trang	5
Dung lượng	193,45 KB