An ontology based improvement for multi answer summarization in

This thesis focuses on the extractive summarization problem and presents some ontology-based improvements to the baseline multi-answer summarization model in the consumer health question

Trang 1

Nguyen Quoc An

AN ONTOLOGY-BASED IMPROVEMENT FOR MULTI-ANSWER SUMMARIZATION IN CONSUMER HEALTH QUESTION ANSWERING SYSTEM

GRADUATION THESIS Major: Computer Science

HA NOI - 2022

Trang 2

VIETNAM NATIONAL UNIVERSITY, HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY

Nguyen Quoc An

AN ONTOLOGY-BASED IMPROVEMENT FOR MULTI-ANSWER SUMMARIZATION IN CONSUMER HEALTH QUESTION ANSWERING SYSTEM

GRADUATION THESIS Major: Computer Science

Supervisors: Assoc.Prof Tran Trong Hieu

MSc Can Duy Cat

HA NOI - 2022

Trang 3

Abstract

Automatic question answering (QA) systems assist customers in quickly addressing daily questions During the COVID-19 pandemic, one of the topics that users care about is healthcare In the era of information explosion, distilling helpful information from the

QA system responses takes time Multi-answers summarization problem is researched for solving this problem The model of this task takes the customer’s question and all answers as input, then return the summary The summary has been shown to aid in better information absorption

This thesis focuses on the extractive summarization problem and presents some ontology-based improvements to the baseline multi-answer summarization model in the

consumer health question answering system with two main sub-tasks: Ontology con-

struction and Building extractive multi-answer summarization model Ontology con-

struction task focus on building ontology, which is leveraged to extend biological knowledge such as related terms, chemicals, diseases, and symptoms Additionally, WordNet

is used for enhancing common sense knowledge In the summarization phase, some sentence scoring methods are proposed for using extending keywords Compared to the baseline, the improved model performs better with large margin As the result, the proposed model outperforms current state-of-the-art comparatives with 0.511 ROUGE-

2 F1 An application model is built for creating a question-answering summarization model from five world’s leading independent biotechnology companies’ websites in Japan

Keywords: multi-answer summarization, extractive summarization, query-based sum-

marization, ontology construction, ROUGE

Trang 4

Acknowledgements

I want to thank my supervisor, Assoc.Prof Tran Trong Hieu, MSc Can Duy Cat They always had insightful comments both on my work and on this thesis Their dedication has given me more motivation to complete the thesis in the best way

Furthermore, I am very thankful to Dr Le Hoang Quynh and Data Science and Knowledge Technology Laboratory members at the VNU University of Engineering and Technology We had many discussion meetings, and their comments will help me improve myself and become more mature in the future

Finally, a deep thank to my family, relatives, and friends who are always with me during the most challenging times, always encouraging us in life and at work

Although I attempted to complete the report but will undoubtedly make minor errors, I sincerely receive the teachers’ and professors’ understanding and instruction

iv

Trang 5

Declaration

I declare that the thesis has been composed by myself and that the work has not be submitted for any other degree or professional qualification I confirm that the work submitted is my own, except where work which has formed part of jointly-authored publications has been included My contribution and those of the other authors to this work have been explicitly indicated below I confirm that appropriate credit has been given within this thesis where reference has been made to the work of others

I certify that, to the best of my knowledge, my thesis does not infringe upon any- one’s copyright nor violate any proprietary rights and that any ideas, techniques, quota- tions, or any other material from the work of other people included in my thesis, pub- lished or otherwise, are fully acknowledged in accordance with the standard referencing practices

I take full responsibility and take all prescribed disciplinary actions for our com- mitments I declare that this thesis has not been submitted for a higher degree to any other University or Institution

Student

Nguyen Quoc An

Trang 6

Table of Contents

Abstract iii

Acknowledgements iv

Declaration v

Table of Contents vi

List of Figures viii

List of Tables ix

1 Introduction 1

1.1 Motivation 1

1.2 Problem Statement 4

1.3 Difficulties and Challenges 7

1.4 Contributions of the thesis 8

2 Related work 10

2.1 Summarization approach 10

2.2 Ontology Construction Approach 12

3 Proposed model 14

3.1 Summarization baseline model 14

3.1.1 Pre-processing 14

3.1.2 Single-answer extractive summarization 15

3.1.3 Multi-answer extractive summarization 17

3.2 Ontology Construction 18

3.2.1 Motivation 18

3.2.2 Overview of proposed ontology construction 19

3.2.3 Biomedical databases 19

vi

Trang 7

3.2.4 Independence Ontology Construction 21

3.2.5 Ontologies Integration 23

3.2.6 Ontology Population 24

3.3 Apply Ontology-based Improvements to Summarization model 25

3.3.1 Baseline Model Improvements 26

3.3.2 Question’s Keyword Expanding 26

3.3.3 Customised scoring methods 29

4 Experiments and Results 31

4.1 Implementation and Configurations 31

4.2 Dataset and Evaluation methods 32

4.2.1 Metrics and Evaluation 32

4.3 Experimental results 33

4.3.1 Ontology Construction 34

4.3.2 Summarization Experiments 36

4.3.3 Errors Analysis 37

4.4 Application on medical website 39

4.4.1 System overview 39

4.4.2 Application’s result 40

Conclusions 42

List of Publications 43

References 44

Trang 8

List of Figures

1.1 The evolution of MEDLINE citations between 1986 and 2019 2

1.2 Typical tasks / competitions in the field of natural language processing for biomedical data 3

1.3 Classification of Text Summarization Approaches 4

1.4 Multi-Answer Summarization pipeline 5

2.1 Summarization approaches 10

3.1 Summarization baseline model 15

3.2 Overview of propose ontology construction 20

3.3 CTD disease-chemical relations 25

3.4 Proposed summarization model overview 27

3.5 Ontology expanding method 28

3.6 WordNet expanding method 29

4.1 The statistic of nodes and terms in three independent ontologies 35

4.2 The statistic of nodes and terms in three integrated ontology 35

4.3 The reduction of ROGUE-2 F1 per each scoring method when replacing the proposed weighted score with the before version 37

4.4 Ablation test results for various components 38

viii

Trang 9

List of Tables

1.1 The result summary example responses to a question in medical question

and answer system (MEDIQA) 6

3.1 MeSH’s topic category list 23

4.1 Configurations and parameters of proposed model 33

4.2 The statistics of extract summary in datasets 34

4.3 The statistic of relations and terms in ontology population 35

4.4 Comparison model’s results of the MEDIQA 2021 Task 2 - Extractive Summarization 37

4.5 Examples of some errors in test set 39

4.6 Five biotechnology companies’ websites in Japan 40

Trang 10

Chapter 1

Introduction

This chapter will present the motivation and the urgency of the thesis topic in section 1.1 Also, the summarization problem and query-based summarization problem are discussed in section 1.2

1.1 Motivation

Many experts and leaders have identified data as an invaluable asset in the era of information explosion For example, Clive Humby - a British mathematician and entrepreneur

in the field of data science, said “Data is the new oil” Indeed, exploiting data effec-

tively will bring great value Biomedical text mining is a topic of increasing interest in the research community For example, the expansion of MEDLINE1 is depicted in Fig- ure 1.1 [20] It is one of the largest and most well-known biomedical online databases

in the world From 1 million in 1970 to 13.5 million in 2005, the number doubled in 14 years to 26.2 million in 2019

However, in this age of information abundance and overload, the overabundance

of data has made it difficult for humans to absorb In that context, some automatic question-answer system is built For example, a question-answer system supports getting information about treatment for common symptoms of COVID-19 from reliable data, which allows users to handle infection situations more scientifically and easily

1 the US National Library of Medicine’s biomedical database

1

Trang 11

Figure 1.1: The evolution of MEDLINE citations between 1986 and 2019

The vertical axis represents the number of citations (million) For a clearer representation, the

statistics from before 2005 are issued every 5 years

Nowadays, several automatic question answering systems about health are built like Pubmed2 or CHiQA3, Google4 Although the answers returned by the search en- gines have been selected, independent answers from different sources still overlap For

instance, with the question “How long have SARS-CoV-2 existed?”, Pubmed provides

about 1000 long answers, and Google returns 5,070,000,000 response5

The idea is to use a summary engine to summarize all the responses into a short paragraph The summary answer gathers all of the necessary information and elimi- nates any duplicates Therefore, the users can read one paragraph instead of a massive amount of documents This thesis focuses on the summarization model in the Health question-answering system However, it is the two most demanding tasks are the question answering and summarization systems for biomedical text (according to experts in

Figure 1.2 [6])

Realizing the potential of biomedical summarization, a number of competitions have been launched in recent years to support research and development in this field The BioNLP workshop series, which is co-hosted by the ACL SIGBIOMED special-

Trang 12

Figure 1.2: Typical tasks / competitions in the field of natural language processing for

biomedical data

ized research community, has grown into an exceptional yearly event for researchers to present their research ideas in the field of natural language processing for biological and medical data (bioNLP) wIn 2021, the BioNLP workshop with the topic MEDIQA 2021: Summarization in the Medical Domain6 was held, consisting of three separate tasks The summarization of Multiple Answers task is similar to the summary engine

in the question-answer system, is chosen by my team Our team won second prize (in extractive summary) and third prize (in abstract summary) in this contest Besides, our team won second prize in science research student competition at my university and has four papers about summarization

After participating in this completion, the error analysis process indicates that the model only focuses on the terms mentioned in the question Meanwhile, related terms such as synonym terms, related chemicals, related diseases, etc., also have a certain degree of importance It is main reason for this thesis to continue research about question- driven improvements This thesis proposes some ontology-based improvements with a significant development compared to the previous model

6 https://sites.google.com/view/mediqa2021

3

Trang 13

1.2 Problem Statement

Text summarization aims to select or generate important information from the original text(s) to create a short version [7] Humans often read all documents to develop understanding, and then write a summary highlighting its main points Because of the absence of human experience and understanding, generating a text summary is exceed- ingly tough, time-consuming, and effortless for machines

Based on the different characteristics of the summary paragraphs, text summariza-

tion can be classified in many different ways as Figure 1.3 [3]

Figure 1.3: Classification of Text Summarization Approaches

• According to the input document(s): Single-document summarization and Multi-

document summarization The difference is that a Single-document summarization only focuses on a single text while a multi-text summary uses multiple documents

as input

• According to the summary usage: Generic and Query-based Generic is an ap-

proach that does not focus on a specific topic or aspect, and it makes an overview

of sources While the query-based summarization approach, the result is focused

on the user questions

• According to Techniques: Supervised and Unsupervised Unsupervised approaches

based on algorithms do not depend on human support, such as labelling train datasets These models are suitable for big data, such as website data Supervised learning methods are based on a sentence-level classification approach where the model learns between summary and non-summary sentences

Text Summarization Approach

Based on Input Document

Based on Summary Usage

Based on Techniques

Based on Characteristics

Trang 14

User's question

Summarization

Multiple related answers

• According to output characteristics: Extractive summarization and Abstractive

summarization The extraction method entails extracting the most crucial sentences from the documents The summary is then made by combining all of the critical sentences As a reason, every sentence, in summary, belongs to the original document in this approach Secondly, the abstractive approach tries to recreate the summary base on the original sentences

Formal definition According to Multi-Answer Summarization task requirements7, different answers can bring complementary perspectives that are likely to benefit the users of QA systems The purpose of this task is multi-answer summarizing model that can tackle summary challenges that numerous relevant replies to a medical question The

input to the model is the customer’s question Q, and all answers A = {A1, A2, , A n} The

output is a summary that answers the given question (Figure 1.4) Table 1.1 shows the

example of result summary

Figure 1.4: Multi-Answer Summarization pipeline

Thesis scope In this work, the model focus on the Query-based Multi-document Ex-

tractive summarization approach According to the classification approaches, the model

follows the four properties: Multiple, Query-based, Unsupervised, Extract The extrac-

tive approach has many advantages, such as (i) quick summarization time, (ii) low cost

of hardware resources, and (iii) easy to manage summary quality Besides, compressing multiple replies into a single answer saves time and effort for users The paragraph is summarised based on the user’s question, which is highly applicable

7 https://www.aicrowd.com/challenges/mediqa-2021/problems/mediqa-2021-multi-answer- summarization-mas

5

Trang 15

Table 1.1: The result summary example responses to a question in medical question and answer system (MEDIQA)

to expose the spine On your side, if you are having surgery on your lower back The surgeon will use tools called retractors to gently separate, hold the soft tissues and blood vessels apart, and have room to work A synthetic bone substitute is used With a cut on the front of the neck, toward the side The surgeon will use a graft (such as bone) to hold (or fuse) the bones together permanently There are several ways of fusing vertebrae together Strips of bone graft material may be placed over the back part of the spine Bone graft material may be placed between the vertebrae Special cages may be placed between the vertebrae These cages are packed with bone graft material The surgeon may get the bone graft from different places From another part of your body (usually around your pelvic bone) This is called an autograft Your surgeon will make a small cut over your hip and remove some bone from the back of the rim of the pelvis From a bone bank This is called an allograft A synthetic bone substitute can also be used The vertebrae may also fixed together with rods, screws, plates, or cages They are used to

3 to 4 hours

A bone graft can be taken from the person’s own healthy bone (this is called an au-

eral anesthesia).During surgery, the surgeon makes a cut over the bone defect The bone graft can be taken from areas close to the bone defect or more commonly from the pelvis The bone graft is shaped and inserted into and around the area The bone graft can be held in place with pins, plates, or screws

Extractive summary

A bone graft can be taken from the person’s own healthy bone (this is called an au-

with rods, screws, plates, or cages They are used to keep the vertebrae from moving until the bone grafts are fully healed.

Trang 16

1.3 Difficulties and Challenges

Figure 1.2 shows that Biomedical summarization is a challenging problem The en-

countered challenges are Text processing in the biomedical domain, multi-answers sum-

marization challenges, and query-based summarization challenges

Text processing problems in biomedical domain Understanding biomedical text re-

quires much human experience and knowledge because biomedical texts are scientific texts, containing many biomedical terms and acronyms In addition, Natural Language Processing tools in the biomedical field are still limited Some related problems are:

• Lack of nomenclatures: A biomedical term can have multiple different variations

Variations can be caused by a lack of consistency in naming or a writer’s mistake Therefore, it is necessary to retrieve terms variations to support the summarization model

• Extreme use of unknown words: New biomedical terms can not be updated in

pre-processing models As the result, pre-processing models label them as unknown words

• Complex relationships between biomedical terms: Some questions require rela-

tions between chemical-disease, disease-gene, and disease-symptom The identifi- cation of biological relationships plays an important role in summary problem

Summarization challenges Compared with single-answer Summarization, the Multi-

answer summarization model faces the following problems:

• Diversity of the answer’s length: Short and long answers can both be included

in the model’s input, resulting in unbalanced contributions to the summary As

a result, each answer has a distinct number of selected sentences Predicting the number of selected sentences per answer is a problem

• Overlap and inconsistency: Different authors write the answers, so the answer’s

vocabulary and style are also different Besides, the content of essential parts of the paragraphs may overlap in meaning

7

Trang 17

Query-based Summarization challenges Rather than retaining the text’s main idea,

the summary includes all pertinent and accurate information to answer question

• Relationship between question and answers: Answers keywords can relate to

question keywords through semantic relationships such as some biological relations mentioned above

• Question presentation: The user can be asked in a variety of ways with one ques-

tion Therefore, it is necessary to identify the context and important keywords to determine the correct relevant sentences

• Ambiguity of meaning: Some questions have several meanings, eliciting answers

on a variety of topics Besides, some questions are not clearly defining the subject

1.4 Contributions of the thesis

The previous work Our team has three papers at 2021 Annual Conference of the North

American Chapter of the Association for Computational Linguistics (NAACL 2021) and at the 2021 International Conference on Knowledge and Systems Engineering (KSE 2021) with the Runner-up Student Paper Award The details of related publications are

in the List of Publications section

Contributions of the thesis: This thesis proposes some ontology-based improvements

with a significant development compared to the previous model Some main contributions are:

• Ontology construction and application scenario in expanding question: The

proposed ontology construction process has three main steps: Independent ontologies construction, Ontology integration, and Ontology population One extending approach is implemented based on signal transfer for the extending keywords In- tegrated Ontology provides diverse related terms such as synonym terms, related chemicals, related genes, related symptoms Integrated Ontology enables quick access and can be used for other biomedical tasks

• Ontology-based Improvements for Summarization model: Ontology is lever-

aged to extend biological knowledge, and WordNet is used for enhancing common sense knowledge Some customizing methods are proposed based on TF-IDF, Keyword-based score, and weighted-relaxed word mover’s distance score

Trang 18

• The model’s applicability and performance: The model can be used for prac-

tical problems in the health question-answer system The proposed model gives better performance than the results of other teams in BioNLP workshop 2021 Ad- ditionally, one summarization model is built for five world’s leading independent biotechnology companies’ websites

Structure of the thesis: The thesis includes four main chapters as follow:

Chapter 1 Introduction: This chapter introduces the summarization problem and

query-based summarization problem Additionally, motivation discusses the urgency of the thesis’s topic

Chapter 2 Related work: This chapter presents related works in two main prob-

lems: summarization and ontology construction In summarization-related works, the thesis focus on two related approaches: extractive summarization and query-based summarization In ontology construction problems, the thesis discusses some ontology and some ontology integration methods

Chapter 3 Proposed model: This chapter presents about baseline model and its

problems, which discusses the urgency of ontology construction Some strategies are used to build a unified ontology by extracting some database and term definitions Fi- nally, some customized scoring models based on ontology are proposed

Chapter 4 Experiments and Results: This chapter provides an insight into the

implementation of the models and shows hyper-parameter settings After that, ontology construction’s results are discussed The model’s performance is presented with final result and components’ contribution Also, one application in five world’s leading independent biotechnology companies’ websites is shown with a few changes to handle big data Finally, some errors for better insight into the model are shown

Conclusions This Chapter concludes the thesis by summarizing the important con-

tributions and results Besides, this thesis will present some further extensions in future work

9

Trang 19

Chapter 2

Related work

With the model in a health question answering summarization system, the Query-based

Multi-document Extractive summarization approach is researched Section 2.1 will dis-

cuss some related words about extractive summarization approaches and query-based

approaches Section 2.2 will discuss biomedical ontology and some approaches for

ontology construction

2.1 Summarization approach

Figure 2.1: Summarization approaches

Extractive summarization approach: An extractive approach is an approach that has

been studied for many years The frequency-based method is the earliest method that tries to estimate the importance of components based on the frequency of words or sen-

Summarization approach

Extractive Approach

Query-based Aproach

Question-driven improvements

Trang 20

tences Term Frequency - Inverse Document Frequency (TF-IDF) supports selecting a vital word and removing some noise such as stop-words Newer methods suggest ways

of combining TF-IDF with other methods such as TF-IDF with K-Means [9]

Text objects often have graph structure, such as sentences always have a subject, predicate and object; noun phrases can include many nouns It is the reason the graph- based score is one of the standard approaches Inspired by PageRank - a website ranking algorithm by Google, LexRank and TexRank build a graph from sentences and words respectively and use degree centrality as score for ranking [8]

Many research studies have tried to apply machine learning methods to extractive summarization tasks, from Naive Bayes, Decision tree, and Support Vector Machine to deep learning models [5] The model’s performance is highly dependent on the size and nature of the training data set

Query-based summarization approach: A query-based summary is an efficient prob-

lem applied in question-answer systems Understanding question is a complex problem because the same meaning question that users can ask in many different ways

Some question-driven scoring methods are used to estimate relation between the question and answer’s components Weighted-relaxed word mover’s distance calculates shortest distance between question and answer’s sentences (wRWMD) [17] HSO method is used for filtering important sentence which has tight relation to questions and other sentences [19] Query-based scoring and Keyword-based scoring are proposed to estimate context relation and keyword ratio in question keywords set [Pub2]

Besides some internal sentence knowledge scoring, some databases can be used for improving question sense Wikipedia is a sizeable unstructured database for search- ing keywords in Wikipedia articles and detecting related keywords [15] ConceptNet and WordNet provide pre-existing relationships between words [19] For an extended biological sense, Medical Subject Headings database can be used in query expansion method [10]

With supervised learning, methods focus on learning the relationship between questions and answers through machine learning models or deep learning models The model can concatenate question and sentence vectors and then uses a single-layer neural network to obtain the distribution over positive and negative classless [22]

11

Trang 21

Applied in proposed model Some methods are used or customised in baseline model

and proposed model:

• Scoring methods: Some scoring methods are proposed based on TF-IDF for de-

tecting important sentences by calculating the score of its words; LexRank for calculating sentence degree center centrality; wRWMD, question-based score and keyword-based score for estimating the relation between question and answer’s sentence

• Multi-answers summarization: Maximal Marginal Relevance (MMR) [2] is a

diversity-based re-ranking method based on the similarities and can be used to remove redundancy in the summaries

2.2 Ontology Construction Approach

Ontologies have been proved to be the most effective way for humans and machines to communicate and share information [16] In this thesis, ontology helps extract medical- related terms in the medical field, which is lacking in WordNet or ConceptNet There are many ontologies in the biomedical domain Disease Ontology 1 is built for human disease which includes terms, phenotype characteristics and related medical vocabulary concepts Gene Ontology 2 is the biggest gene database However, there are no ontologies for summary purposes, and existing ontologies focus on only one aspect of health, such as chemicals, diseases, genes, or symptoms That makes it challenging to apply to the summary because the user questions are diverse in many aspects

Ontology Integration: Ontology integration is the process of unifying existing on-

tologies to create a general or comprehensive ontology [16] Ontology integration is a massive problem because biomedical knowledge requires a lot of experts’ experience and knowledge Most appropriate methods use semi-automatic techniques which depend on human support OMEN is a probabilistic mapping tool that creates maps with rules and then estimates mapping path by probability based on conditional probability tables [12] In 2020, CoMerger is public with a cluster approach [1] CoMerger groups related concepts across ontologies into partitions and merge first within and then across those partitions CoMerger merger allows the merging of more than two ontologies at a time

1 https://disease-ontology.org/

2 http://geneontology.org/

Trang 22

Ontology Population: Ontology population is the process of adding new concepts,

semantic relations and rules to an existing ontology and setting them in the correct position in the ontology [18] Content ontology design patterns are a framework for semi-automatic pattern-based ontologies population [4] Definition term-based is an algorithm which tries to search terms defined to extract relations between nodes [13] Besides, learning relations based on deep learning methods also have received attention Some existing databases such as the Toxicogenomics Database project (CTD) 3 can be used for the training model

Applied in proposed model In ontology construction process, some databases and

methods are used:

• Database: Medical Subject Headings Database4 is used as the backbone and then

is integrated with Mondo Disease Ontology - Mondo5 The Comparative Toxicoge- nomics Database - CTD and Symptom Ontology - SYMP6 are used for extracting biological relationships Additionally, Open English WordNet 7 is used for enhancing common-sense

• Algorithms: Definition term-based algorithm [13] is used for extracting chemical-

diseases relation and symptom-disease relations

Trang 23

Chapter 3

Proposed model

Section 3.1 presents baseline model Ontology-based Improvement approach will be proposed to solve the baseline model’s problem in two sections: Section 3.2 for ontology

construction and Section 3.3 for using ontology to improve baseline model

3.1 Summarization baseline model

The baseline model is based on the Prosper-thy-neighbour model [Pub 2] While having some proposed methods for making higher performance, it also have some problems in understanding the question when related keywords such as synonyms and biomedical

relationships are not focused The baseline model has three main parts: Pre-processing,

Single-answer extractive summarization, and Multi-answer extractive summarization

Figure 3.1 shows baseline model where the improved parts are highlighted in gray

3.1.1 Pre-processing

The Pre-processing phase takes the raw data, whose multiple answers and questions, as the input Firstly, answers are segmented into sentences Next, the normalization method removes noise from the raw answer and question (HTML tags, duplicate spacing, etc.) Keywords which include tokens and NERs are extracted for the summarization phase SpaCy1 is the primary Pre-processing method utilized Scispacy package is used instead

of the normal Spacy to process biomedical text efficiently Because segmentation by SpaCy has some mistakes in the paragraph, which do not have punctuation One rule- based sentence boundary detection module is added to Pre-processing phase [21]

1 spaCy: An industrial-strength NLP system in Python: https://spacy.io

Trang 24

Notes: Improved parts are highlighted in gray

Figure 3.1: Summarization baseline model

Finally, BioBERT is used for creating word embedding vectors and sentence embedding vectors from text BioBERT is a particular version of BERT for biomedical text mining The model uses large model of BioBERT which is pre-trained in PubMed

3.1.2 Single-answer extractive summarization

After pre-processing, three score strategies are used for sentence scoring and ranking

There are Frequency-based score, Graph-based scores, and Question-driven scores The

scoring methods get sentences and return sentence’s score

Frequency-based score Term Frequency - Inverse Document Frequency (TF-IDF)

method calculates word’s score in a corpus TF-IDF scores are calculated from the TF and IDF scores with some boosting approaches: Boosting the TF-IDF score of keywords and filtering lower word scores by fixing the threshold

Merging single summaries

Re-caculate sentence scores

Sentence Filtering

Maximal Marginal Relevance

Post-processing

Trang 25

Graph-based scores LexRank is used for detecting essential sentences in the answer

based on the document’s graphs In LexRank, the graph’s nodes present sentences, and a graph’s weighted edge refers to the similarity score of node pairs The score of sentences

is calculated in Formula 3.1, which presents centrality of sentence in answer

d p(u) = + (1 − d)

Question-driven scores Query-based score and keyword-based score are used for se-

lecting sentences related to the question Query-based score focus on calculate similarity

between question vector and sentence vector as described in Formula 3.2

where s is the sentence, sim is Cosine similarity between question and sentence vectors

The keyword-based score used longest common sub-sequence to calculate the key-

word’s ratio per sentence with Formula 3.3 ., ,

kw(s) =

k : k ∈ q| maxi ∈s lcs(k ,i) ≥ thres

where s is the sentence keywords, q is the question keywords

Scores combination After calculating a sentence score in some algorithm, all scores

are normalized by min-max normalization The final score is calculated by weighted

sum fashion as in Foumula 3.4

t where t is one of four scores: TF-IDF, LexRank, query-based scoring, and keyword-

based scoring

16

Trang 26

Prosper-thy-neighbour strategies Some adjacent sentences are structured in a de-

ductive manner (e.g following a stated sentence, additional explanation sentences are given.) or inductive (e.g the last sentence summarizes the preceding sentences.) As

a result, answers frequently contain continuous sentences as a cluster in the paragraph Center boosting increases scores for some sentences close to high-scoring sentences

[Pub 2] Let score i be the score of i-th sentences The final score f inal i of sentence i-th

is updated by the following formula:

min(i+R−1,n)

where n is the number of sentences, L and R are the number of sentences that impact the current sentence i in two directions: left and right With centre-boosting, the main

sentence significantly impacts the sentences around it

Generate single-answer extractive summarization: After calculating and boosting

sentence scores, the model ranks and filters the top sentences list After that, the model restores the sentence’s position and concatenates all selected sentences to create single- answer extractive summarization

3.1.3 Multi-answer extractive summarization

After single-answer summarization phase, all single-summaries are connected The sentences continue to be scored with all score types in Section 3.1.2 Then, they are com- bined, ranked, and filtered to create a summary like a single summarization phase

Maximal Marginal Relevance (MMR): To solve the problem of duplicate ideas in

multi-document summaries, a sentence filtering method called Maximal Marginal Rel-

evance (MMR) is added to the model with Fomula 3.6

MMR = argmax λ

Post-processing This methods focus on remove some noise from summary such as

questionable sentences, example sentences, long sentences and duplicated information

17

f inal i = max

Trang 27

3.2.1 Motivation

The baseline model has many proposed approaches to increase performance based on question’s keywords However, related keywords such as synonyms and biomedical relationships are not focused on In biomedical dataset, there are some related terms types:

• Synonyms: Keywords often have many different names which refer to the same

topic For example, the words “Cancer”, and “Neoplasm” all mean cancer

• Phenomena and symptoms: For diseases, the writer can also use phenomena and

symptoms to name the disease For example, the words “Tumor”, “Malignancy”,

“Benign Neoplasms” refer to cancer

• Keyword’s variation: Keywords may have some variation when included in a par-

ticular sentence to comply with grammar or due to a spelling mistake by the writer For example, “Malignant Neoplasm” can be represented in several other forms such as “Malignancy”, “Malignancies”, “Malignant Neoplasms”, “Neoplasm, Ma- lignant”, “Neoplasms, Malignant”

• Biomedical relations: Related words can be inferred from biomedical relations

such as “Bone Cysts” is a type of cancer, so it may be related to sentences that mention Cancer

Building an ontology for keyword expanding is necessary for the above problems Ontology shows relationships between terms about medical meaning As a result, the model can expand the query’s keywords to understand the question clearly and focus on related paragraphs

Tiêu đề	An Ontology Based Improvement for Multi Answer Summarization in Consumer Health Question Answering System
Tác giả	Nguyen Quoc An
Người hướng dẫn	Assoc.Prof. Tran Trong Hieu, MSc. Can Duy Cat
Trường học	Vietnam National University, Hanoi University of Engineering and Technology
Chuyên ngành	Computer Science
Thể loại	Graduation thesis
Năm xuất bản	2022
Thành phố	Hanoi

Định dạng
Số trang	55
Dung lượng	1,1 MB