iii Using multiple windows scanning and natural language processing techniques to study electron transport proteins Student: Ho Quang Thai Advisor: Dr.. With different filters are appl
Trang 1元 智 大 學
資 訊 工 程 學 系
博 士 論 文
使用多窗口掃描和自然語言處理技術研究電子傳
遞蛋白
Using multiple windows scanning and natural language processing techniques to study electron
transport proteins
研 究 生: 胡光泰 指導教授: 歐昱言 博士
中華民國 111 年 4 月
Trang 2遞蛋白
Using multiple windows scanning and natural language processing techniques to study electron
transport proteins
研 究 生 :胡光泰 Student:Ho Quang Thai
指 導 教 授 :歐昱言 博士 Advisor:Dr Yu Yen Ou
元 智 大 學
資 訊 工 程 學 系
博 士 論 文
A Dissertation Submitted to the Department of Computer Science and Engineering
Yuan Ze University
in Partial Fulfillment of the Requirements
for the Degree of Doctor of Philosophy
in Computer Science and Engineering
April 2022 Chungli, Taiwan, Republic of China 中華民國 111 年 4 月
Trang 3iii
Using multiple windows scanning and natural language processing
techniques to study electron transport proteins
Student: Ho Quang Thai Advisor: Dr Yu-Yen Ou
Department of Computer Science and Engineering
College of Informatics Yuan Ze University
Abstract
Nature is an infinite source of inspiration for people to discover and recreate wonderful inventions Inspired by the way neurons work in the human brain, Convolutional Neural Networks (CNNs) have been proposed and become a powerful and widely used tool in imaging-related tasks It and its structural variants are increasingly developed and achieved many cutting-edge achievements not only in the field of computer vision but also in many other fields In addition, CNN is also known as an effective tool in extracting hidden information in visual data In the field of bioinformatics, CNN has rapidly gained a lot of interest over the past decade, especially in biomedical imaging However, current solutions for applying CNNs to non-visual data, such as protein sequences, are still not fully resolved
Unlike image data, protein sequences cannot be decomposed into color channels, and color channels provide a lot of useful information for CNN's pattern recognition capabilities Another problem that makes it more troublesome to apply CNN to protein chains is the limitation of the input layer The input layer is defined as a fixed dimension,
Trang 4iv
and the length of the protein chain is usually variable With different filters are applied to identify sequence motifs within the input protein sequence with multiple sequence alignment profiles The model uses a number of filters to capture many different sequence motifs, and then use multiple different window sizes to capture more feature motifs further to capture more significant patterns for protein prediction problems
Recently, the NLP field has risen with much state-of-the-art performance when successfully applying the transformers to help computers focus attention on word position and its context rather than just relying on how often the word appears in a sentence Natural language and protein sequence share several common points, like how they form and how they are presented This study assumes the protein sequences as an unknown language, each amino acid as a “word” in a biological vocabulary We operate NLP models pre-trained by an extensive corpus of natural language data to determine whether
an association between natural language and an undiscovered language exists inside our body
Our study analyzed electron transport proteins using multiple windows scanning and natural language processing techniques in three works Firstly, we used multiple windows scanning technique to predict electron transport proteins in transport proteins For independent data, our model performed with an average sensitivity of 92,59%, specificity
of 98,19%, accuracy of 97,41%, and Matthew's correlation coefficient (MCC) of 0.89 Additionally, our method can identify complexes with different molecular functions in electron transport proteins Across five independent data sets, MCCs were 0.86, 0.80, 0.88, 1.00, and 0.92, respectively In the second work, we combined feature set extracted from Bidirectional Encoder Representations from Transformers (BERT) pre-trained
Trang 5v
model with Position Specific Score Matrix Profiles (PSSM), and the Amino Acid Index database (AAIndex) to identify Flavin Adenine Dinucleotide (FAD) binding sites in electron transport proteins with an average sensitivity of 85.19%, a specificity of 85.62%,
an accuracy of 85.60%, and an MCC of 0.35 for independent data set In the last work,
we attempt to use multiple windows scanning technique to resolve the FAD binding site identification problem In order to solve the problem of the modest amount of data in nature, we first trained the model by using PSSM profiles of FAD binding sites in transporters We then used the model to predict the FAD binding sites in electron transport proteins In our analysis, we found that the performance of independent data set had an average sensitivity of 92,59%, specificity of 98,19%, accuracy of 97,41%, and MCC of 0.89 The performance of our method is better in all measurement metrics than other published methods Researchers may be able to gain a deeper understanding of transport proteins through the proposed technique, which can also be used as a powerful web tool Further, the results of this study pave the way for further research on deep learning to enrich the bioinformatics field
Keywords: machine learning; deep learning; convolutional neural network; electron
transport proteins; position specific scoring matrix; FAD binding site; natural language processing, multiple windows scanning