Machine learning based extraction of semantic relations from biomedical literature

ANN Artificial Neural Network bagging Bootstrap Aggregating BC5 CDR corpus BioCreative V Chemical-Disease relation cor-pusBERT Bidirectional Encoder Representations from TransformersbiLS

Trang 1

VIETNAM NATIONAL UNIVERSITY, HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY

LE HOANG QUYNH

MACHINE LEARNING-BASED EXTRACTION

OF SEMANTIC RELATIONS FROM BIOMEDICAL LITERATURE

DOCTOR OF PHILOSOPHY IN INFORMATION TECHNOLOGY DISSERTATION

Hanoi, 2022

Trang 2

VIETNAM NATIONAL UNIVERSITY, HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY

LE HOANG QUYNH

MACHINE LEARNING-BASED EXTRACTION

OF SEMANTIC RELATIONS FROM BIOMEDICAL LITERATURE

Major: Information Systems Code: 9480104.01

DOCTOR OF PHILOSOPHY IN INFORMATION TECHNOLOGY DISSERTATION

SUPERVISORS:

1 Prof Dr Nigel Collier

2 Dr Dang Thanh Hai

Hanoi, 2022

Trang 3

I hereby declare that this Doctoral Dissertation was carried out by me for the degree

of Doctor of Philosophy under the guidance and supervision of my supervisors

This dissertation is my own work and includes nothing, which is the outcome ofwork done in collaboration except as specified in the text

It is not substantially the same as any I have submitted for a degree, diploma orother qualification at any other university; and no part has already been, or is currentlybeing submitted for any degree, diploma or other qualification

Hanoi , January 2022

Author

Le Hoang Quynh

Trang 4

Table of Contents

DECLARATION iii

TABLE OF CONTENTS iv

ABBREVIATIONS viii

LIST OF FIGURES xi

LIST OF TABLES xiii

PREFACE 1

1 INTRODUCTION TO BIOMEDICAL RELATION EXTRACTION 11

1.1 Problem statement 11

1.1.1 Semantic relation extraction 11

1.1.2 Biomedical named entity recognition 12

1.1.3 Biomedical relation classification 15

1.2 Literature review 19

1.2.1 Literature review of biomedical named entity recognition 19

1.2.2 Literature review of biomedical relation extraction 24

1.2.3 Related doctoral dissertations 29

1.3 Related resources 30

1.3.1 Datasets for named entity recognition experiments 31

1.3.2 Datasets for relation classification experiments 32

1.4 Evaluation metrics 34

1.4.1 Evaluation metrics 34

1.4.2 Named entity recognition evaluation 35

1.4.3 Relation classification evaluation 36

1.5 Summary 37

Trang 5

2 AN END-TO-END PIPELINE MODEL FOR BIOMEDICAL RELATION

EXTRACTION 38

2.1 Distant supervision learning with silverCID corpus 39

2.2 Proposed UET-CAM system 42

2.2.1 Joint model of named entity recognition and normalization (DNER) 43 2.2.2 Coreference resolution 49

2.2.3 Intra-sentence relation classification with support vector machine 52 2.3 Experimental results and discussion 54

2.3.1 Choosing the combining manner of SSI and skip-gram for named entity normalization results 54

2.3.2 Named entity recognition and normalization results 55

2.3.3 CID relation classification results 57

2.3.4 Discussion 58

2.4 Summary 62

3 AN IMPROVED CRF-BILSTM MODEL FOR BIOMEDICAL NAMED ENTITY RECOGNITION 64

3.1 Introduction to deep learning for named entity recognition 65

3.2 Proposed D3NER model 67

3.2.1 Data pre-processing 67

3.2.2 The TPAC embeddings layer 68

3.2.3 Context representing biLSTM layer 71

3.2.4 Project layer 72

3.2.5 Conditional random fields layer 72

3.3 Experimental results and discussion 72

3.3.1 Experimental environment and model settings 73

3.3.2 Comparative models 75

3.3.3 The performance of D3NER model and comparisons 76

3.3.4 Contribution of the model components 80

3.3.5 Error analysis 82

3.4 Summary 86

4 HYBRID, ATTENTION-BASED AND ENSEMBLE DEEP LEARNING MODELS FOR BIOMEDICAL RELATION CLASSIFICATION 87

4.1 The shortest dependency path 89

4.1.1 Dependency tree 89

Trang 6

4.1.2 The shortest dependency path 90

4.1.3 Dependency Unit 91

4.2 A hybrid adaptive deep learning model for biomedical relation extraction 91 4.2.1 Proposed MASS model 92

4.2.2 Experimental corpora and comparative models 98

4.2.4 Experimental results and discussion 100

4.3 An attentive augmented deep learning model for biomedical relation ex-traction 106

4.3.1 Richer-but-smarter SDP 106

4.3.2 Proposed RbSP model 107

4.4 A multi-fragment ensemble deep learning model for biomedical relation extraction 118

4.4.1 Over-fitting problem of deep learning-based models 118

4.4.2 Bagging with bootstrap training data 119

4.4.3 Proposed multi-fragment ensemble architecture 121

4.5 Summary 129

5 GRAPH-BASED INTER-SENTENCE RELATION CLASSIFICATION IN BIOMEDICAL TEXT 131

5.1 Inter-sentence relations classification problem 132

5.2 Proposed graph-based inter-sentence relation classification model 134

5.2.1 Model overview 134

5.2.2 Document sub-graph construction 135

5.2.3 Paths finding, merging and choosing 138

5.2.4 Shared-weight convolutional neural network 140

5.3 Experimental results and discussion 143

5.3.2 Contribution of the added virtual edges in document sub-graph 144

5.3.3 Different sliding window size w for training and testing 145

5.3.4 Contribution of the model components 146

5.3.5 Comparison to comparative model 148

5.4 Discussion 150

Trang 7

5.5 Summary 152

CONCLUSION 156

LIST OF PUBLICATIONS 158

BIBLIOGRAPHY 158

Trang 8

ANN Artificial Neural Network

bagging Bootstrap Aggregating

BC5 CDR corpus BioCreative V Chemical-Disease relation

cor-pusBERT Bidirectional Encoder Representations from

TransformersbiLSTM Bidirectional Long Short-term Memory

CNN Convolutional Neural Network

CTD Comparative Toxicogenomics Database

DNER Disease Named Entity Recognition

ELMO Embeddings from Language Models

Trang 9

FP False Positive

FSU-PRGE The FSU PRotein GEne Corpus

HAScO Human-Aware Science Ontology

HHEAR Human Health Exposure Analysis Resource

MUC Message Understanding Conferences

NCBI National Center for Biotechnology

Informa-tionNCIT National Cancer Institute Thesaurus

Trang 10

SDP The Shortest Dependency Path

SilverCID A Silver-standard Corpus for

Chemical-induced Disease Relation ExtractionSNOMED Systematized Nomenclature of MedicineSSI Supervised Semantic Indexing

swCNN Shared-weight Convolutional Neural Network

TPAC the Token-POS tag-Abbrviation-Character

Embeedings

UMLS Unified Medical Language System

w/o REP With out Replacement

Trang 11

List of Figures

1 Growth of MEDLINE citations from 1986 to 2019 2

2 Challenges’ subtasks/tracks organized based on NLP perspectives [64] 3

3 The dissertation outline 10

1.1 An example taken from the BC5 CDR corpus with recognized names of Disease, Chemical and Species 14

1.2 Examples of (a) inter-sentence relation and (b) intra-sentence relation 17

1.3 Examples of relations with specific and unspecific location 18

1.4 Examples of (a) Promotes - a directed relation and (b) Associated - an undirected relation taken from Phenebank corpus 18

1.5 Named entity recognition approaches taxonomy 20

1.6 Relation extraction approaches taxonomy 25

1.7 The statistics of corpora used in our experiments for relation classification 34 2.1 Analysis of the Direct Evidence field in the CTD databases 40

2.2 An example of constructing silverCID corpus 41

2.3 Architecture of the proposed UET-CAM system 44

2.4 Advanced SSI model using skip-gram information for NEN 45

2.5 Hybrid model of SSI and skip-gram model for NEN 47

2.6 Sequential back-off model of SSI and skip-gram model for NEN 48

2.7 An example of coreference in text 49

2.8 An examples of using multi-pass sieve for coreference resolution 51

3.1 The D3NER architecture 68

3.2 The TPAC embedding architecture of D3NER 70

4.1 Example of a dependency tree 89

4.2 Examples of the shortest dependency paths 90

4.3 Examples of the dependency unit in the shortest dependency paths 91

4.4 The architecture of MASS model for relation classification 93

4.5 The multi-channel LSTM for word representation 95

Trang 12

4.6 Ablation test results for various components and information sources of

MASS model 104

4.7 Examples of SDPs and attached child nodes 107

4.8 The architecture of RbSP model for relation classification 108

4.9 The multi-layer attention architecture to extract the augmented informa-tion from the children of a token on SDP 110

4.10 Ablation test results for compositional embeddings of RbSP model 116

4.11 Ablation test results for augmented information of RbSP model 117

4.12 Training loss, training accuracy, validation loss and validation accuracy of our RbSP model in BC5 CDR corpus 119

4.13 The range of RbSP model’s results on BC5 CDR test set 120

4.14 The multi-fragment ensemble architecture 122

4.15 The changes of multi-fragment ensemble model’s results with different size of training data 125

4.16 The changes F 1 of multi-fragment ensemble model with different vote threshold 126

5.1 Examples of complicated cross-sentence relations 132

5.2 The proposed model for inter-sentence relation classification 134

5.3 Use sliding window to choose adjacent sentences for building document sub-graph 136

5.4 Examples of a document sub-graph 137

5.5 Examples of two unexpected problems while generating the instance from document sub-graph 139

5.6 Example of an abstract with many NER annotations that leads to the ex-plosion of similar paths 140

5.7 Diagram illustrating of a swCNN architecture 141

5.8 Ablation test results for virtual edges of the document sub-graph 145

5.9 The change of results with different size of sliding window 146

Trang 13

List of Tables

1.1 Example sentences labeled using different tagging schema 15

1.2 Examples for different relation types 17

1.3 Information about the BC5 CDR, NCBI and FSU-PRGE corpora for NER 31

1.4 Information about the BC5 CDR, BB3, DDI and Phenebank corpora forrelation classification 33

1.5 Defining the test metrics 35

2.1 Detailed Input/Output and the objectives of UET-CAM components 43

2.2 Large-scale feature set used in the intra-sentence relation extraction ule of UET-CAM system 53

mod-2.3 Named Entity Normalization results with different combining architectures 55

2.4 Disease named entity recognition results on BC5 CDR corpus of CAM system 55

UET-2.5 Relation classification results on BC5 CDR corpus of UET-CAM system 57

2.6 Analysis of the contribution of methods and resources used in the CAM system for capturing CID relationships 60

UET-2.7 Sources of errors by our system system on the CDR test set 61

3.1 Configurations and parameters of D3NER model 75

3.2 Experimental results of D3NER for 20 runs each with different randominitialization on BC5 CDR and NCBI corpora 77

3.3 Performance of D3NER and compared state-of-the-art models on twobenchmark corpora for Disease and Chemical NER 78

3.4 Experimental results of D3NER for 20 runs each with different randominitialization on FSU-PRGE corpus (4-fold cross validation) 80

3.5 Performance of D3NER and compared state-of-the-art model on PRGE corpus for Gene/protein NER 80

FSU-3.6 Ablation test results for different embeddings of D3NER model 81

3.7 Impact of fine-tunning embeddings as the D3NER’s hyper-parameters 82

3.8 D3NER confusion matrix on the CDR corpus 82

Trang 14

3.9 Examples for errors caused by D3NER on the BC5 CDR and FSU-PRGE

corpora 84

4.1 Examples for different relation types 87

4.2 Configurations and parameters of MASS model 100

4.3 Results of MASS model on the BC5 CDR corpus 101

4.4 Results of MASS model on the DDI-2013 corpus 102

4.5 Results of MASS model on the BB3 corpus 103

4.6 Results of MASS model on the Phenebank corpus 103

4.7 Examples of MASS model’s errors 105

4.8 Configurations and parameters of RbSP model 115

4.9 The RbSP model’s performance on BC5 CDR corpus 115

4.10 Multi-fragment ensemble results on BC5 CDR corpus 124

4.11 The comparison of our ensemble proposed models with other compara-tive models on BC5 CDR corpus 127

4.12 The comparison of our ensemble proposed models with other compara-tive models on DDI corpus 128

5.1 Tuned hyper-parameter of proposed model 144

5.2 Ablation test results for added virtual edges in the document sub-graph 144

5.3 Results of the document sub-graph based model on BC5 CDR corpus with different size of sliding window for training and testing 147

5.4 Ablation test results for various components of the document sub-graph based model on BC5 CDR corpus 148

5.5 The performance of document sub-graph-based model and some compar-ative models 149

5.6 The detailed results of the document sub-graph based model 150

5.7 Examples of errors on the BC5 CDR test set 151

Trang 15

The necessities of the dissertation:

In the past several decades, biomedicine and human health care have become one

of the major service industries They have been receiving increasing attention fromthe research community and the whole society E.g., in 2011, biomedical research inthe United States received 100−billion dollars of investment, with approximately 65%

supported by industry,30%by the government, and the remaining5%by charities, dations, or individual donors [137] Up to the present, many researchers have beenstill working hard with an expectation that more advances would occur for supportingbiomedical science and healthcare Therefore, the inevitable need is understanding andanalyzing the existed information and knowledge bases

foun-As a result, the field of biomedical research has overgrown, and the number ofbiomedical scientific publications is growing at an extremely high rate Accessing andprocessing this data to keep abreast of the state-of-the-art and making discoveries inbiomedical/healthcare scientific researches is essential for several types of users, in-cluding biomedical researchers, clinicians, database curators, and bibliometricians [77].There is more than 3000 articles are published in biomedical journals every day [64].MEDLINE®, a biomedical database of the US National Library of Medicine, is one ofthe most prominent and largest biomedical digital repositories As of 2019, it alreadycontains more than 26million citations with a fast increasing number of articles in lifesciences with a concentration on biomedicine1 Figure1illustrates the growth of MED-LINE from∼ 1million in1970to∼ 26million citations in2019 More impressively, thisnumber has increased nearly two times in 14years, from2005 (∼ 13.5 million) to 2019

(∼ 26.2million)

PubMed®2 is a free resource developed and maintained by the NCBI which

Trang 16

Figure 1: Growth of MEDLINE citations from 1986 to 2019.

The vertical axis shows the number of citation (in million) For clearly visualization, the

statistics before 2005 were presented every 5 years.

vides free access to MEDLINE and some other databases Following the statistic ported in November 20193, the total of PubMed citations cumulative has surpassed 30

re-million However, even if we got the result returned from PubMed, the difficulty ofprocessing this literature is ever-increasing It comes from the fast-growing volume ofbiomedical literature, the scope of topical coverage, and its interdisciplinary nature andits unstructured form For example, when searching for ‘Influenza’ in Pubmed, we gotthe results of 105, 066 articles The rapid growth of volume and variety of biomedicalscientific literature make it an exemplary case of Big Data [169] It is an unprecedentedopportunity to explore biomedical science and an enormous challenge when facing amassive amount of unstructured and semi-structured data

Recent research progress in biomedicine needs to be supported by methodologiescapable of assisting human experts in formulating hypotheses Biomedical natural lan-guage processing (BioNLP) is a sub-field of Natural language processing (NLP) thatseeks to help scientists understand the wealth of data from results that are hidden inlarge-scale scientific text collections BioNLP does this through the analysis, under-standing, and production of structured data from unstructured free text in large scaletext collections BioNLP now has a wide range of applications in biomedical literaturemining and attracted significant investment of the research communities worldwide, re-

Trang 17

flecting their central roles in many areas of biomedical research and healthcare science.

As a result, the market of biomedical text data analysis and bioNLP is growing rapidly

In particular, the NLP in Healthcare and life sciences market is estimated to grow from

USD1030.2million in2016to USD2650.2million by20214in the United States

Relation extraction (RE) plays a vital intermediate step in a variety of bioNLP

applications Its contributions range from precision medicine [6], adverse drug reactions

identification [30, 53], drug abuse events extraction [71], major life events extraction

[19, 106], question answering system [31, 120] and clinical decision support system

Figure 2: Challenges’ subtasks/tracks organized based on NLP perspectives [64]

In general, NLP tasks closer to the top of the pyramid are more difficult.

Because of these motivations, several challenge evaluations have been organized

to assess and advance bioNLP researches These challenges evaluations often attract

many scientists around the world to attend and publish their latest research on

biomed-ical analysis Huang and Lu (2015) [64] categorized the prevalent challenges by the

targeted problems in NLP research as in Figure2 We get an observation that BioNLP

shared tasks pays much of its attention on information extraction, including relation

ex-traction/classificationand named entity recognition, which are listed in the middle two

healthcare-lifesciences-nlp-market-131821021.html

Trang 18

parts of the pyramid Some examples of well-known shared tasks include the BioNLP,BioCreative, i2b2, ShARe/CLEF eHealth, and SemEval.

There have been a number of doctoral dissertations across the world that worked onrelation extraction related topics (more detailed information is given in Section 1.2.3).Some of them focused on a specific type of relation, examples include disease-generelations [66] and drug-drug relations [172] The data type that they targeted to are alsovery diverse (i.e., scientific literature [100] and electronic health record [96]) Manymachine learning methods were proposed for relation extraction: supervised feature-based machine learning [9], semi-supervised learning [172], deep learning [96], etc

In this Dissertation, we consider Relation Extraction as two text mining tasks, i.e., Named Entity Recognition (NER) and Relation Classification (RC) Thetask of biomedical named entity recognition (NER) seeks to locate named entities fromfree-form biomedical text and classify them into a set of pre-defined categories/typessuch as gene/protein, phenotype, disease, and chemical, or ‘none-of-the-above’ NERproblem consists of three sub-problems: (i) defining the entity boundary and (ii) assign-ing the delimited entity to a pre-defined class and (iii) named entity normalization, i.e.,match the extracted entities to a concept in the knowledge base In which, the namedentity normalization problem often be separated as an independent problem The pop-ular methods used for biomedical NER includes dictionary-based methods, Rule-basedmethods, classification-based methods, sequence labeling methods hybrid methods thatcombine other techniques [138, 167] Relation classification (RC) is the task of dis-covering semantic connections between biomedical entities The common biomedicalrelations includes Drug-drug interaction [164], chemical-disease relation [180], Protein-protein interaction [83] and many others The most typical methods for relation clas-sification are co-occurrence approaches, rule-based methods, several machine learningmethods, and hybrid methods [5,142,167]

sub-In line with worldwide research trend, this dissertation differs from the other search in several aspects: (i) We try to solve both NER and RC of RE as two separatetasks Most other works focus on only one task, NER or RC Some research addressed

re-RE as RC, and NER was be solved in the previous phase as a pre-processing step (ii)

We focus on the scientific literature abstracts and capitalize on their characteristics, notjust consider them as normal documents (iii) The dissertation research and apply a vari-ety of machine learning methods, including supervised feature-based machine learning,unsupervised machine learning, distant learning, and deep learning (iv) The dissertation

Trang 19

does not entirely focus on a specific type of relationship CID is just a typical ship used to facilitate the comparison of results Many experiments were conducted forother relation types, and all have positive results.

relation-Research challenges:

The biomedical research community pays much attention to developing dedicateddata and resources Recently, it is admitted that biomedicine is a field that having themost abundant amount of available public resources and tools However, the specificcharacteristics of biomedical data still bring many challenges for the research commu-nities [2,167]:

– Firstly, biomedical NLP is still facing many existing NLP problems, i.e., problemsexist not only in the field of the biomedical domain but also in the general field

of NLP We list here some widespread problems: the imbalanced data problem,special linguistics units such as negation and conjunction, and directed relationtype

– Secondly, information extraction in the biomedical domain often suffers errorscaused by relatively low performance of pre-processing steps Because biomedicaltexts are highly specialized, generic data analysis and NLP tools are not appropri-ate

– Thirdly, biomedical terms have their own diversities and characteristics, such asthe lack of nomenclatures and the extreme use of unknown words that lead tohighly variable and ambiguous compared to other domains

– The fourth problem comes from ambiguity and inconsistency, i.e., NEs with thesame orthographic features may fall into different categories

– Finally, biomedical is an interdisciplinary field The complexity of the biologicdomain and the growing ability of biomedical research relies increasingly on thedevelopment of methods and concepts crossing these boundaries

Research objectives and methodology:

Motivated by above necessities and challenges, the Dissertation aims at the ing research objectives:

follow-– [RO1] Appropriately represent the biomedical literature text to make the best use

of linguistic, syntactic, and semantic information

Trang 20

– [RO2] Take advantage of the state-of-the-art advanced methods and resources topropose the combination architectures and then improve them to resolve NER and

RC problems with good results

To reach these research aims, we focus on addressing the following main researchquestion: How to build an effective machine learning-based architecture for NER and

RC systems? It includes two sub-questions to supplement the main research question:– [Sub-question sQ1] How to convert the biomedical literature text, annotated withnamed entity and relation labels, into a rich representation containing useful infor-mation that can be processed by machine learning models?

This research question is addressed throughout the Dissertation, for example, werepresent the relations by using the engineered features in Chapter2, embedding,and the shortest dependency path in Chapter4, and the graph in Chapter5

– [Sub-question sQ2] How to apply, combine, and improve advanced machinelearning methods for building NER and RC systems?

This research question is solved in Chapter2, Chapters3, Chapter4and Chapter5.The research methodology of the Dissertation is the combination of qualitativeresearch and quantitative research:

• Qualitative research includes: (i) Analyzing the ideas, proposed methods and niques of related works; (ii) detecting problems, advantages and disadvantages ofthese methods; (iii) improving, combining and proposing new solutions and mod-els to resolve problems

tech-• Quantitative research includes: (i) Analyzing available corpus, (ii) deploying periments, (iii) verifying the performance of proposed methods and models and(iv) publishing the scientific reports to receive verification from the research com-munity

ex-Overview of our approach:

The Dissertation participates in the research trend of bioNLP in general and ical relation extraction in particular Our focuses are on improving the methods, exploit-ing rich information data representation, and build a capable architecture for biomedicalnamed entity recognition and relation classification, rather than on developing new ma-chine learning algorithms

Trang 21

biomed-We state that being able to achieve better performance in biomedical relation traction tasks depends on improvements in machine learning and data representation.

ex-We firstly build an end-to-end model for named entity recognition and relation tion This model mostly based on several supervised feature-based learning techniques.BioNLP, like its parent field NLP, has been through a step-change in the last five yearswith a move from machine learning based on expert features to deep learning techniquesthat learn feature representations for themselves Following this research trend, we thenpropose several deep architectures for improving named entity recognition and relationclassification

classifica-The main contribution of the Dissertation:

The Dissertation has three main contributions:

– Researching, improving, and proposing several data representation manners tomake use of linguistic, syntactic, and semantic information This contribution isreflected in the proposal of a rich feature set in Chapter2, a combination of severalinformation types in Chapter3and Chapter4, as well as a graph-based representa-tion in Chapter5

– Studying and constructing some machine learning architectures to solve NER and

RC problems based on combining and improving advanced machine learning ods from multiple perspectives: (i) UET-CAM system in Chapter2is a joint model

meth-of NER-NEN system and rich feature-based machine learning with distant vision learning for RC (ii) D3NER system in Chapter 3 combines several types

super-of information in a deep learning model (iii) MASS and RbSP models in Chapter

4 are deep learning-based models with several improvements, including attentionmechanism (iv) The multi-fragment ensemble model is also proposed in Chapter

4 (v) Finally, Chapter5focuses on intra-sentence relation extraction with a novelgraph-based model Most applied methods/techniques are carefully analyzed toevaluate their contribution to system performance

– Contributing to the research community by creating a silver-standard dataset called

‘silverCID’ for distant supervision learning This data set is used in Chapter2andChapter5and is demonstrated the good influence to the system performance.Scope of the Dissertation :

The Dissertation focuses on solving the relation extraction problem in English

Trang 22

biomedical literature text by applying natural language processing (NLP) techniques.

In which, two sub-problems (i.e., named entity recognition and relation classification)are solved separately by applying several advanced machine learning methods in an ap-propriate architecture

Biomedical named entity recognition problem is considered as a sequence belling problem Note that the nested entity problem is excluded, i.e., we do not con-sider the cases if named entities contain other named entities inside them or severalentities intersect In a part of the Dissertation, named entity recognition is processedsimultaneously with named entity normalization phrase to increase performance Thedissertation experiments worked on three fundamental biomedical entities, i.e., Chem-ical, Disease, and Protein/Gene They are three of the most frequently requested enti-ties by PubMed users worldwide [68] and are annotated in many well-known biomedi-cal knowledge-bases (Medical Subject Headings (MESH)5, Unified Medical LanguageSystem (UMLS)6, Systematized Nomenclature of Medicine (SNOMED)7,” and manyothers)

la-In this Dissertation, we delineate the scope of the study of biomedical relationclassification problem according to the following characteristics:

• Only binary biomedical relations are extracted We aim to address then−ary tions as further extensions of our model in the future works

rela-• We focus on both intra- and inter-sentence relations

• Both directed and undirected relations are considered in the research scope

• Depending on the corpus that the relation classification system works on, it can be

a binary classification or a multi-label classification problem

In experiments, we mostly focus on the chemical-induced disease relation (also known

as the adverse drug reaction and side effect) This relation attracts much attention fromthe research community as well as the industry It is annotated in many biomedical on-tologies, i.e., SNOMED, Orthology Ontology (OWL)8, Human Health Exposure Analy-sis Resource (HHEAR)9, Human-Aware Science Ontology (HAScO)10, National Cancer

Trang 23

Institute Thesaurus (NCIT)11, Radiology Gamuts Ontology (RGO)12, and ComparativeToxicogenomics Database (CTD)13, etc Other various relations are also considered insome experiments for further comparisons The example includes the drug-drug inter-action (includes mechanism, effect, advice and int), the locations (biotopes and geo-graphical places) of bacteria, and many others.

The Biocreative V CDR corpus was selected as benchmark datasets for mentation throughout the Dissertation Besides, depending on the verification direction

experi-we desired, some other datasets experi-were selected includes the DDI corpus, BB3 corpus, andPhenebank corpus

The dissertation outline:

The Dissertation outline is illustrated in Figure3, which contain Preface, five ters and the Conclusion The related publications are marked to their correspondingChapter

Chap-Chapter 1: INTRODUCTION TO BIOMEDICAL RELATION EXTRACTION

provides an introduction into important concepts relevant throughout this work Themain focus of this chapter are problem statement, literature review, related resourcesand the evaluation method

Chapter2: AN END-TO-END PIPELINE MODEL FOR BIOMEDICAL TION EXTRACTIONdescribes the architecture of our UET-CAM system that partici-pated in BioCreative V CDR track It is an end-to-end architecture for chemical-induceddisease relation extraction that consists of several advanced feature-based machine learn-ing components

RELA-Chapter3: AN IMPROVED CRF-BILSTM MODEL FOR BIOMEDICAL NAMEDENTITY RECOGNITIONimproves the biomedical named entity recognition by propos-ing a deep learning model with several embedding sources In addition to chemical anddisease entities, gene/protein entities are also considered in this chapter’s experiments.Chapter4: HYBRID, ATTENTION-BASED AND ENSEMBLE DEEP LEARN-ING MODELS FOR BIOMEDICAL RELATION CLASSIFICATION proposes somedeep architectures for the biomedical relation classification Several corpus with var-ious relation types are also used for demonstrating the flexibility and adaptability ofproposed model On–trending attention technique and the ensemble manner are also

13 http://ctdbase.org/

Trang 24

applied to propose a novel deep architecture with potential results.

Chapter5: GRAPH-BASED INTER-SENTENCE RELATION CLASSIFICATION

IN BIOMEDICAL TEXTpresents our approach for the inter-sentence relation cation To exploit the graph-based representation effectively, we develop novel shared-weight deep learning model

classifi-Lastly, in theConclusion, we summarizes the dissertation’ main contributions andlimitation, then ends with an outlook to future works

Chapter 4.

HYBRID, ATTENTION-BASED AND ENSEMBLE DEEP LEARNING MODELS FOR BIOMEDICAL RELATION CLASSIFICATION

Chapter 5.

GRAPH-BASED INTER-SENTENCE RELATION CLASSIFICATION

IN BIOMEDICAL TEXT (Extensive researches on inter-sentence relation)

Chapter 2.

AN END-TO-END PIPELINE MODEL FOR BIOMEDICAL RELATION EXTRACTION

[LHQ1] (Oxford Database, 2016 )

(Participated in BioCreative V CDR tracks)

Figure 3: The dissertation outline

The related publications are listed in their corresponding Chapter.

Trang 25

Chapter 1

Introduction to

Biomedical Relation Extraction

Information extraction is the process of extracting information from unstructured

or semi-structured data and turning it into structured data, or also the activity of lating a structured knowledge source with information from an unstructured knowledgesource [43] One of the most fundamental sub-tasks in information extraction is Seman-tic relation extraction

popu-1.1.1 Semantic relation extraction

First of all, we present the definition of semantic relation extraction in tion1.1

Defini-Definition 1.1 Semantic relations (or semantic relationships) are the associations thatthere exist between the meanings of linguistic components (e.g., semantic relations atword level, entities level, phrases level or sentence level, etc.)

Semantic relation extraction (see Definition 1.2) is useful in many fact extractionapplications ranging from question answering [31,120] to identifying adverse drug re-actions [53]

Definition 1.2 Relation Extraction (RE) is the task of detecting and characterizing the

Trang 26

semantic relations between pairs of named entity mentions in the text [2] Receivingthe (set of) document(s) as an input, the relation extraction system aims to extract allpre-defined relationships mentioned in this document by identifying the correspondingentities and determining the type of relationship between each pair of entities.

In this Dissertation, we focus on two sub-tasks of Relation Extraction: Named tity Recognition (NER) and Relation Classification The former, named entity recog-nition (NER, entity tagging), is an intermediate step for relation extraction It refers tolocating and classifying named entities in text into predefined categories In the Disser-tation scope, NER is the problem of finding biomedical entity mentions such as diseases,chemicals, genes, proteins, or organisms in natural language biomedical literature text,then tagging them with their location and type The latter, relation classification (RC),goes after NER to find the semantic relations between the corresponding entities [2].Biomedical relation classification often tries to classify the relationship between pairs ofbiomedical entities to relations such as drug-drug interaction, chemical-induced disease,bacteria live-in location, or tag them as ‘none’ if we can not find any relationship be-tween them We describe these two sub-problems in details in Sections1.1.2 and1.1.3

En-below

1.1.2 Biomedical named entity recognition

We gives the definitions of named entity in Definition1.3

Definition 1.3 A named entity (NE) (also called entity mention) is a continuous quence of words that designates some real world entity [2]

se-The automated recognition of named entities in text has been a highly active areafor over two decades and is referred to variously as ‘terminology extraction’, ‘termrecognition’, ’entity identification’, ‘entity chunking’, ‘entity extraction’ and ‘namedentity recognition’ In this dissertation, we use the term ’named entity recognition’.The task of named entity recognition (NER) seeks to locate NE from free-form textand classify them into a set of predefined categories/types such as person, organization,location, expressions of times, quantities, monetary values, percentages or ‘none-of-the-above’ In other words, NER is the problem of finding the mentions of entities in naturallanguage text and labelling them with their location and type Oftentimes this task can-not be simply accomplished by string matching against pre-compiled gazetteers becausenamed entities of a given entity type usually do not form a closed set and therefore any

Trang 27

gazetteer would be incomplete Another reason is that the type of a named entity can

be context-dependent [2] For example, ‘Ho Chi Minh’ may refer to the person who isVietnamese Communist revolutionary leader or the location ‘Ho Chi Minh city’, ‘Ho ChiMinh museum’or any other entity is sharing the same part ‘Ho Chi Minh’ To determinethe entity type for this text span occurring in a particular document, its context has to beconsidered

Named entity recognition is typically modeled as a sequence labeling problem

We treat each word in a sentence as an observation, sentence as a sequence of tions and try to assign labels to each observation of this sequence It is defined formally

observa-in Defobserva-inition1.4

Definition 1.4 Given a sequence of input tokensX = (x1, , xn), and a set of labels

L, named entity recognition (NER) task determines a sequence of labelsY = (y1, , yn)

such thatyi ∈ Lfor16i6n[88]

While one may apply standard classification to predict the label yi based solely

on xi, in sequence labelling, it is assumed that the label yi depends not only on itscorresponding observationxi but also possibly on other observations and other labels inthe sequence Typically this dependency is limited to observations and labels within aclose neighbourhood of the current positioni

The label of NER often incorporates two concepts: the type of the entity (e.g.whether the mention refers to a person, location, chemical or a disease), and the posi-tion of the token within the entity Hence, the label set should follow a formal taggingscheme (also called ‘label model’, ‘tagging format’), which is a format for tagging to-kens in a chunking task in computational linguistics, such as NER The simplest modelfor the token position is theIOmodel, which indicates whether the token is inside (I) oroutside (O) of an entity mention While simple, this model cannot differentiate between

a single mention containing several words and distinct mentions comprising tive terms IOB is the well-known tagging scheme that overcomes the limitation ofIO

consecu-scheme Different toIO scheme, with IOB scheme, a token is tagged as B if it marksthe beginning of an entity This model is capable of differentiating between consecutiveentities and has excellent support in the literature The more complex model commonlyused isIOBES (orIOBEW), which is the expressive variant ofIOB tagging scheme

In addition toI, O andB, IOBES uses E for Ending, and S for Singleton (a one-wordentity) While theIOBES scheme does not provide higher expressive power than the

IOB model, it was shown to improve labelling models’ performance marginally [157]

Trang 28

and have been used in several NER studies [87, 181] Example sentences annotatedusing each label scheme can be found in Table1.1.

In reality, entity mentions can appear in various forms, including names, pronouns(i.e., ‘he’, ‘her, ‘who’, etc.), and nominals (i.e., nouns, noun phrases, etc.) In many do-mains such as newswire and literature, NEs were often defined as proper names and theirquantities of interest The popular studied named entity types are person, organizationand location, which were first defined by the sixth in a series of Message Understand-ing Conferences (MUC-6) [48] These types are general enough to be useful for manyapplication domains Extraction of expressions of dates, times, monetary values andpercentages, which was also introduced by MUC-6, is often also studied under NER,although strictly speaking these expressions are not named entities Besides these gen-eral entity types, other types of entities are usually defined for specific domains andapplications The NE and NER in the biomedical domain will be described below.Biomedical named entities are phrases or combinations of phrases that denote im-portant concepts in biomedicine They can be chemicals, diseases, anatomies, pathwaysand genes/proteins, etc that are named in biomedical literature, which has been growing

at an unprecedented speed Automatically extracting them, a task known as cal named entity recognition, involves the demarcation of entity names of a specificsemantic type, e.g., proteins It results in annotations corresponding to a name’s in-textlocations as well as the predefined semantic category it has been assigned to [158] Overthe last fourteen years, there has been considerable interest in this problem with a variety

biomedi-of generic and entity-specific algorithms applied to extract the names biomedical cepts Recent NER researches in the biomedical domain have been primarily focused

con-on the most frequently requested entities by PubMed users worldwide, includes der (Disease, Symptom, Phenotype), Gene/Protein, Chemical/Drug, Biological Process,Medical Procedure, Living Being, Research Procedure, Cell Component, Body Part, De-viceor Tissue [68] Still, there are few proposed solutions for the other entities such asphenotypes, anatomy [24]

Disor-Figure 1.1: An example taken from the BC5 CDR corpus with recognized names of

Disease, Chemical and Species

Trang 29

Table 1.1: Example sentences labeled using different tagging schema

IO

of |O fusidic |I − CHEM ICAL acid |I − CHEM ICAL treatment |O in |O chronic |O

IOB

of |O fusidic |B − CHEM ICAL acid |I − CHEM ICAL treatment |O in |O chronic |O

IOBES

Examples are taken from the BC5 CDR corpus.

Figure 1.1 shows an example of biomedical named entities in text, chosen fromthe BioCreative V Chemical-Disease relation corpus [105] In this sentence, all disease,chemical, and species names have been demarcated The following Table1.1comparesthree tagging schemes for annotating this example

1.1.3 Biomedical relation classification

Relation classification (RC) typically follows NER in the relation extraction tem.Culotta et al.(2006) [29] define relation extraction as the task of discovering seman-tic connections between entities In the text processing, it usually amounts to examiningpairs of entities in a document and determining (from local language cues) whether arelationship exists between them

sys-We take the pairwise approach for the task of relation classification I.e., afterNER, we considered all pairs of recognized NERs as potential candidates, and givethem as the input to the relation classification system The relation classification systemthen classifies these candidates to assign them to a pre-defined relation type or ‘None-of-above’ (i.e., the negations) In reality, there may be multi-label instances, i.e there aremore than one relationships between an entity pair In this Dissertation, we ignore thesecases and only accept a single label for each instance Generally, a semantic relationshipcan be defined among multiple entities (n-ary), but within the scope of this dissertation,

we only consider binary relationships Extracted binary relationships have the structure

of a triple < e1, R, e2 >, where e1 and e2 are named entities (or noun phrases) in asentence (or abstract) from which the relationship is being extracted, andRis a relationtype that connects two corresponding entities As treated as a classification problem, wegive the formal definition of relation classification in Definition1.5

Definition 1.5 Relation classification task is defined by a real-valued functionfR that

Trang 30

decides whether the corresponding entities are in a relation or not.

e1 ande2are two entities that create a candidate for relation classification

d is a document which includes corresponding entitiese1 and e2 dcan be a tence, a paragraph or a document depending on the scope of relationships

sen-T (d)is the information that is extracted from d

Many respects should be considering in relation classification system, and they areoften different on different types of entities

– There may be several relations or only one relation in a corpus For example, BC5CDR [105] and BioNLP-ST 2016 BB3 [33] corpora were annotated with only onerelation type, whilst Phenebank and SemEval-2013 DDI-2013 [61] corpora haveseveral relation types

– Several relations are directed and order-sensitive, such as the Mechanism relation

in DDI corpus [60], the Inheres-in relation in Phenebank corpus Such relationsrequire the model to predict both relation types and the entity order correctly Incontrast, for undirected relations, such as Associated in Phenebank, both directionscan be accepted, another example is Chemical-Induced Disease relation in BC5CDR [105] which its direction always comes from a chemical to a disease

– The relation is intra-sentence relation (i.e two corresponding entities appeared inthe same sentence) or inter-sentence relation (i.e two corresponding entities mayappear in different sentences)

Biomedical relation classification concerns the detection of semantic relationsbetween biomedical named entities or noun phrases Recently, there has been consider-able interest in biomedical relation extraction and relation classification with a variety ofrelationships The common biomedical relations includes Drug-drug interaction [164],chemical-disease relation [180], Protein-protein interaction [83] and many others With

a multitude of possible relation types, it is critical to understand how systems will have in a variety of settings In biomedical domain, relation classification is useful in

Trang 31

be-many fact extraction applications ranging from identifying adverse drug reactions to jor life events It is also important in tasks such as Question Answering and KnowledgeAcquisition.

ma-We gives some examples of biomedical relations in Table 1.2, Figure 1.2, ure1.3, and Figure1.4 Table1.2represents two examples among a multitude of possiblerelation types in the biomedical domain Sentence(i)shows an example of Synonym-ofrelation which is represented by an abbreviation pattern This is very different to thepredicate relation Mechanism in(ii)

Fig-Table 1.2: Examples for different relation types

(i) < e1 > Three-dimensional digital subtraction angiographic < /e1 > ( < e2 > DSA < /e2 > ) images from diagnostic cerebral angiography were obtained

3D-(ii) Dexamethasone: Steady-state trough concentrations of albendazole sulfoxide were about 56% higher when 8 mg < e1 > dexamethasone < / e1 > was coadministered with each dose of < e2 > albendazole < / e2 > ( 15 mg/kg/day) in eight neurocysticercosis patients.

Sentence (i) shows a Synonym-of relation, represented by an abbreviation pattern,

which is very different from the predicate relation Mechanism in (ii).

Figure 1.2 includes examples form BC5 CDR corpus [105] of inter-sentence lation (i.e., two corresponding entities belongs to two separate sentences) and intra-sentence (i.e., two corresponding entities belongs to the same sentence)

re-Figure 1.2: Examples of (a) inter-sentence relation and (b) intra-sentence relation

Examples are taken from the BC5 CDR corpus with recognized chemical-induced disease relation between a chemical (highlighted in bold) and a disease (highlighted in underlined

bold).

Figure1.3indicates the difference of unspecific location and specific location tions While relation with specific location has information of the exact positions of thetwo corresponding entities, unspecific location relation does not, i.e all pairs of corre-sponding entities should be considered as positive instances These examples also comefrom BC5 CDR corpus

rela-Figure1.4are examples extracted from Phenebank corpus It includes examples of

Trang 32

Figure 1.3: Examples of relations with specific and unspecific location.

(a) Unspecific location relation taken from the BC5 CDR corpus with recognized chemicals (highlighted in bold) and diseases (highlighted in underlined bold) The annotation points out there are chemical-induced disease relations between chemical carbachol and diseases, but did not give the specific location of the corresponding entities (b) Specific location relation taken from the DDI corpus The annotation specify the Effect relation between two drugs (highlighted

in bold) at their specific locations.

directed and directed relations In the directed relation, the order of entities in the tion annotation should be considered, vice versa, in the undirected relation, two entitieshave the same role

rela-Figure 1.4: Examples of (a) Promotes - a directed relation and (b) Associated - an

undirected relation taken from Phenebank corpus

(Entities are highlighted in bold.)

Trang 33

1.2 Literature review

1.2.1 Literature review of biomedical named entity recognition

Over the last fourteen years, there has been considerable interest in biomedicalNER problem with a variety of generic and entity-specific algorithms applied to ex-tract many biomedical NER such as genes, gene products, cells, chemical compoundsand diseases Figure1.5 gives an overview of NER approaches, in which, some otherdevelopmental branches of machine learning methods, such as transfer learning, life-long learning, and reinforcement learning, are not within the scope of this Dissertation.The specific methods that we used to construct the proposed model are highlighted Ingeneral, similar to NER in the newswire domain, approaches to biomedical NER can

be categorized as knowledge-based methods and machine learning-based methods Wealso discuss hybrid approaches (combining several methods into an architecture) andjoint modeling (a research trend that tries to integrate and handle different tasks as asingle task)

Knowledge-based approaches:

The earliest and straight forward solutions to biomedical NER relied on dictionary–based approaches They rely on the use of existing biomedical resources containing

a comprehensive list of terms and determine whether expressions in the text match any

of the biomedical terms in the provided list [158] There are many knowledge-basesare used in this approaches, examples include MESH, UMLS, SNOMED, etc Up tonow, this method is still used in several studies, such as Eftimov et al (2107) [41]that proposed a rule-based named-entity recognition method for knowledge extraction

of evidence-based dietary recommendations

Rule–based methods try to craft patterns/rules manually to recognize NE In thisapproach, manually creating the rules for named entity recognition requires human ex-pertise and is labour intensive Example of researches in biomedical fields that followthese strategies includes Hanisch et al (2005) [58] that applied a staged rule–basedsystem on the UMLS, HPO and MetaMap

Knowledge-based methods often require human expertise and is labour intensive

to create such knowledge-bases and patterns Since there are millions of entity names

in use, and new ones are added constantly, these methods will never be sufficientlycomprehensive and can not catch up with the growth rate of the biomedical literature

Trang 34

Figure 1.5: Named entity recognition approaches taxonomy.

The specific methods that we applied in the proposed model are highlighted.

Some other developmental branches of machine learning methods, such as transfer learning, lifelong learning, and reinforcement learning, are not within the scope of this Dissertation

Feature-based supervised machine learning approaches:

Several recent works on biomedical NER use statistical supervised feature–basedmachine learning methods which are often more robust in terms of system performance.Traditionally, to perform well and efficiently, NER models require a set of informativefeatures (i.e linguistic patterns) that are well-engineered and carefully selected, heuris-tically based on domain knowledge [17] These methods utilize a large annotated corpusand the pre-defined feature set for inferring optimal prediction functions by training themodel and then use it to predict the labels of new data Supported by the availability

of various annotated biomedical corpora, supervised machine learning methods havebecome popular, owing to the satisfactory performance they have demonstrated

Peceptron [161] is a classic machine learning algorithm with many extended sions Some recent researches successfully apply the structured perceptrons to sequencelabeling tasks, include NER [62,126] In this Dissertation, perceptron is used for NER

Trang 35

ver-in the UET-CAM system (Chapter2).

Conditional Random Fields (CRF) [86] is the most popular discriminative chine learning model that alternative to the previous for sequence labelling, as it com-bines the advantage of Maximum Entropy Markov Model (MEMM) in exploiting non-independent contextual features of the entity without a label bias problem CRF-basedmodels have especially shown reliable performance in biomedical NER problem are lin-ear chain CRF [45,88,90,91] and skip-chain CRF [110] In this Dissertation, CRF isused as a labeling phase in the D3NER model (Chapter3)

ma-In addition to structured peceptron and CRF, supervised machine learning methodsthat can be used for NER are extremely abundant with many variants, such as HiddenMarkov Model (HMM) [24], semi-markov model [90], MEMM [38], Support VectorMachines (SVM) [25], decision tree [136], transition-based model [118], and more.Machine learning method with feature engineering, however, is still time-consuming,very often yields incomplete non-satisfactory sets Moreover, resulting feature sets areboth domain and model-specific

Deep learning-based approaches:

In the past few years, the advent of deep neural networks with the capability ofautomatically feature engineering even from noisy data has leveraged the development

of NER models The deep learning models aim to automatically induce the robust resentations of data by manipulating multiple hidden layers They have produced state-of-the-art results in many tasks of NLP as well as biomedical NER A variety of deeplearning methods and architectures have used in the field of NLP in general and biomed-ical NER in particular In which, the most typical deep neural networks (DNNs) are theConvolutional Neural Networks (CNNs), the Recurrent Neural Networks (RNNs) andtheir variants All of them often requires the use of additional techniques to solve theover-fiting problem and reduce the impact of the initialization

rep-Recurrent Neural Network (RNN) [162] performs effectively on sequential data,and it furthermore had many different improvements among several state-of-the-art NLPsystems including NER [107, 184] An advanced RNN type called RNN with LongShort-Term Memory (LSTM) unit [63] is a specific type of RNN that models dependen-cies between elements in a sequence through recurrent connections Since the LSTMarchitecture can only process the input in one direction, the bidirectional LSTM (biL-STM) network improves the LSTM by feeding the input to the LSTM network twicewithin two directions: forward - from the beginning to the end of the sequence and

Trang 36

vice versa, backward - from the end to the beginning of the sequence This design lows for the detection of dependencies from both previous and subsequent words in asequence Very recently, LSTM has increasingly been employed for biomedical NER,yielding state-of-the-art performance at the time of their publication [55,121,122,181].Realizing the potentials of LSTM for the NER problem, we use LSTM in a combinationwith CRF in the biomedical NER model in Chapter3.

al-Convolutional Neural Network (CNN) [92] is good at capturing then-gram tures in the flat structure and has also been proved effective in NLP includes NER[28,184]

fea-One of the fundamental steps in deep learning model is word representation, i.e.,transforming each word into a representation vector in the first layer of a deep learningmodel There are several approaches to create a word representation, including randomlyinitialized embeddings, one-hot vectors, character-level word embedding - representing

a token’s meaning in sense of its morphological surface [55, 177] The most commonapproach to convert a word into a vector is by looking up into the embeddings matrix(i.e., lookup table) which created based on the pre-trained word embedding Word em-beddings is a technique to represent a word by low-dimensional continuous vector repre-sentations (embeddings) that are pre-trained from extremely huge amount of unlabeledtext One of the pre-trained word vectors have been widely used in biomedical namedentity recognition is provided by Pyysalo et al (2013) [151] It is a pre-trained wordembedding of200dimensions that was induced from PubMed and PMC texts (6milliondistinct words) employed the word2vec skip-gram model [130] Another well-knownpre-trained embedding is the FastText [10], which are the 300-dimensional vectors thatrepresent words as the sum of the skip-gram vector and charactern-gram vectors to in-corporate sub-word information FastText is provided for the general domain, but it alsoallows us to re-train the model with our biomedical data Since these pre-trained wordembedding models learned the word representation based on the usage of words, theyallow words that are used in similar ways (similar context) to result in having similarrepresentations, naturally capturing their meaning

In recent years, the use of word embedding in deep learning-based models hasgradually been replaced by more efficient methods that have been remarkably effective

in NLP problems, including NER, namely ELMO (Embeddings from Language els, 2018) [149] and BERT (Bidirectional Encoder Representations from Transformers,2019) [35] In the early stage of their release, both ELMO and BERT only provided

Trang 37

Mod-the pre-trained for common-domain English text Re-training Mod-them is resource sive: pre-training a BERT-base model on English Wikipedia (2.5 billions of words) andBooksCorpus (0.8 billions of words) on a TPUv2 takes about54 hours1; pre-training aBERT-base model on Pubmed abstracts (4.5 billions of words) and PMC full text (13.5

expen-billions of words) on eight NVIDIA V100 (32GB) GPUs takes23days [97]; moreover,

to fine-tune a pre-trained BERT on a specific task, it often takes a few hours more on

a GPU In 2020, Lee et al [97] introduced BioBERT, the first domain-specific based model pre-trained on biomedical corpora We leave this research direction tofuture works

BERT-Unsupervised and semi-supervised machine learning:

Several unsupervised and semi–supervised methods have been utilized to tacklethe biomedical NER task Unsupervised-machine learning methods for biomedicalNER are often based on phrase chunking and distributional semantics In which the en-tity recognition may leverages terminologies, shallow syntactic knowledge (noun phrasechunking), and corpus statistics (inverse document frequency and context vectors) [190].The scope of the Dissertation does not focus on these methods

Semi–supervised methods take advantage of both supervised and unsupervisedapproaches They are applied in various manner such as self-training (bootstrapping)[174], co-training [52], transfer learning [179] and distant supervision learning [98] Weapply distant supervision learning in Chapter 2 and Chapter 4 to improve the perfor-mance of proposed models

Hybrid model and joint modeling:

Hybrid architectures are proposed to take advantages of several different methods

by combining them into a single model This approach integrates based methods, domain knowledge, and learning-based methods in various combinationmanners One of the state-of-the-art hybrid architecture that successfully applied to NER

heuristics/rule/pattern-is the combination of a deep learning network for data representation and then use CRFfor sequence labelling [55]

Following reports of the high-level performance of the joint-inference model inother NLP tasks, several studies tried to joint NER with other NLP task to improve theperformance Sometimes, after NER, we need to link the recognized entity to a concept(or data entry) in ontology or database, this task is called named entity normalization(NEN) Traditionally, NER and NEN were treated as two separate tasks, in which, NEN

Trang 38

took the output of NER as its input in a pipeline manner Several studies [89,116] havepointed out the limitations of this pipeline approach, i.e causing cascading errors fromNER to NEN, and limiting the ability of the NER system to exploit the lexical infor-mation provided back by the normalization directly The joint model between NER andNEN is expected to overcome these disadvantages of such a traditional pipeline model.Several works tried to build such a NER-NEN joint model; examples include [118] pro-posed a transition-based model to jointly perform disease NER and NEN, TaggerOne[89] is a joint model of a semi-Markov structured linear classifier, with a rich featureapproach for NER and supervised semantic indexing for NEN We exploit this idea intothe proposed model in Chapter2to join NER and NEN modules in the decoding phase.

1.2.2 Literature review of biomedical relation extraction

We categorize approaches to biomedical relation classification as knowledge-basedmethods and machine learning methods, as illustrated in Figure1.6 In which, the spe-cific methods that we used to construct the proposed model are highlighted Note thatthere are some other developmental branches of machine learning methods, such astransfer learning, lifelong learning, and reinforcement learning, but they are not withinthe scope of this Dissertation

Knowledge-based approaches:

The most simple approach for detecting potential relationships is based on co–occurrence statistics Based on the hypothesis that, if two entities are frequently men-tioned together, likely, they are somehow related, this method reveals biomedical rela-tionship through counting their co-existences in the same sentences entire abstracts [20].More accurate alternatives for relation classification are based on manual-crafted rules[78, 115] and patterns [79, 146] These methods do not require any annotated data totrain a system but typically meet two disadvantages: (i) the rules and patterns relied

on manually-crafted rules/pattern, which are very expensive, time-consuming and oftenrequire domain experts knowledge (ii) they are limited at extracting specific relationtypes Since the co-occurrence methods often have low precision and rule/pattern-basedmethods are labour-intensive but not generalized, machine learning approaches are cur-rently one of the top choices for relation extraction

Feature-based supervised learning approaches:

Some literature reviews on relation extraction [5, 142] divide supervised learning

Trang 39

Figure 1.6: Relation extraction approaches taxonomy.

The specific methods that we applied in the proposed model are highlighted.

approaches into two sub-categories, i.e., kernel-based and feature-based methods, based

on their input to the classifier While feature-based methods require a set of pre-definedfeatures extracted from sentences, kernel-based techniques often take advantages of richstructural representation such as dependency trees In this dissertation, we only focus onthe feature-based methods Feature-based methods represent labeled instances featurevector, in which, each element represent a feature These feature vectors are then served

to the classifier for training the model and predicting whether the candidate entity pairare related or not These methods are data-driven, i.e., based on domain-specific man-ually annotated corpora In the biomedical domain, these approaches are widely usedsince they can take advantages of various annotated biomedical corpora which are freelyavailable but bring potential performance

The popular feature-based supervised machine learning algorithm is Support tor Machines (SVM) [27] which tries to find a linear hyperplane, in a n-dimensionalspace, with the largest distance to the nearest instances of positive and negative classes.Feature-based SVM was used for extracting chemical-induced disease relation [186],Live-in event [99], drug-drug interaction [156], protein-protein interaction [132], protein-organism-location relation [117] and many other biomedical relations SVM with a rich

Trang 40

Vec-feature set is used for relation classification in Chapter2of this Dissertation.

In addition to SVM, machine learning methods that applied for biomedical relationextraction are abundant, such as Conditional Random Fields [14], Naive Bayes [102],maximum entropy [49], logistic regression [73]

These machine learning methods for relation classification require carefully engineered process Although it is time- and money-consuming to create the feature sets,they are often limited for a specific model and domain

feature-Deep learning approaches:

Recent successes in deep learning have stimulated interest in applying neural tectures to the task of relation classification They are extremely good at automaticallyfeature engineering from noisy data, thus, not requiring a handcrafted feature set but stillyielding good performances Deep learning models often requires the use of additionaltechniques to solve the over-fiting problem and reduce the impact of the initialization.The Dissertation applies both CNN and RNN with several different improvements forclassifying the biomedical relations

archi-Convolutional Neural Networks (CNNs) [92] are among early approaches to beapplied successfully to biomedical relation classification problem and yields the state-of-the-art results Zhao et al [194] used a syntax CNN for extracting drug-drug interaction.Verga et al [176], Zhou et al [198] applied CNN for chemical-induced disease relationextraction

Recurrent Neural Networks [162] are another approach to capturing relations andnaturally good at modelling long-distance relations within sequential language data.There are some variants of RNN that have been applied to biomedical relation classi-fication task, includes the original RNN [128], RNN with LSTM unit which is used toextend the range of context [108,197] and Recursive neural network [111]

Deep learning-based researches on relation classification in this Dissertation aremostly based on the shortest dependency path (SDP) Nodes (tokens) and dependencies

in the SDP can be represented as a vector by using methods outlined in Section1.2.1.While tokens are often represented based on word embedding, dependencies are oftenconverted to a one-hot vector or randomly initialization As introduced in Section1.2.1,ELMO(2018) [149] and BERT (2019) [35] have been shown to be effective in numerousstudies of NLPs, including RC However, the researches on relation classification inthis Dissertation are mostly based on the shortest dependency path (SDP), the use of

Tiêu đề	Machine Learning-Based Extraction Of Semantic Relations From Biomedical Literature
Tác giả	Le Hoang Quynh
Người hướng dẫn	Prof. Dr. Nigel Collier, Dr. Dang Thanh Hai
Trường học	Vietnam National University, Hanoi University of Engineering and Technology
Chuyên ngành	Information Systems
Thể loại	Dissertation
Năm xuất bản	2022
Thành phố	Hanoi

Định dạng
Số trang	193
Dung lượng	8,61 MB