This thesis focuses on the extractive summarization problem and presents some ontology-based improvements to the baseline multi-answer summarization model in the consumer health question
Trang 1Nguyen Quoc An
AN ONTOLOGY-BASED IMPROVEMENT FOR MULTI-ANSWER SUMMARIZATION IN CONSUMER HEALTH QUESTION ANSWERING SYSTEM
GRADUATION THESIS Major: Computer Science
HA NOI - 2022
Trang 2VIETNAM NATIONAL UNIVERSITY, HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY
Nguyen Quoc An
AN ONTOLOGY-BASED IMPROVEMENT FOR MULTI-ANSWER SUMMARIZATION IN CONSUMER HEALTH QUESTION ANSWERING SYSTEM
GRADUATION THESIS Major: Computer Science
Supervisors: Assoc.Prof Tran Trong Hieu
MSc Can Duy Cat
HA NOI - 2022
Trang 3Abstract
Automatic question answering (QA) systems assist customers in quickly addressing daily questions During the COVID-19 pandemic, one of the topics that users care about is healthcare In the era of information explosion, distilling helpful information from the
QA system responses takes time Multi-answers summarization problem is researched for solving this problem The model of this task takes the customer’s question and all answers as input, then return the summary The summary has been shown to aid in better information absorption
This thesis focuses on the extractive summarization problem and presents some ontology-based improvements to the baseline multi-answer summarization model in the
consumer health question answering system with two main sub-tasks: Ontology con-
struction and Building extractive multi-answer summarization model Ontology con-
struction task focus on building ontology, which is leveraged to extend biological knowl- edge such as related terms, chemicals, diseases, and symptoms Additionally, WordNet
is used for enhancing common sense knowledge In the summarization phase, some sentence scoring methods are proposed for using extending keywords Compared to the baseline, the improved model performs better with large margin As the result, the proposed model outperforms current state-of-the-art comparatives with 0.511 ROUGE-
2 F1 An application model is built for creating a question-answering summarization model from five world’s leading independent biotechnology companies’ websites in Japan
Keywords: multi-answer summarization, extractive summarization, query-based sum-
marization, ontology construction, ROUGE
Trang 4Acknowledgements
I want to thank my supervisor, Assoc.Prof Tran Trong Hieu, MSc Can Duy Cat They always had insightful comments both on my work and on this thesis Their dedication has given me more motivation to complete the thesis in the best way
Furthermore, I am very thankful to Dr Le Hoang Quynh and Data Science and Knowledge Technology Laboratory members at the VNU University of Engineering and Technology We had many discussion meetings, and their comments will help me im- prove myself and become more mature in the future
Finally, a deep thank to my family, relatives, and friends who are always with me during the most challenging times, always encouraging us in life and at work
Although I attempted to complete the report but will undoubtedly make minor errors, I sincerely receive the teachers’ and professors’ understanding and instruction
iv
Trang 5Declaration
I declare that the thesis has been composed by myself and that the work has not be submitted for any other degree or professional qualification I confirm that the work submitted is my own, except where work which has formed part of jointly-authored publications has been included My contribution and those of the other authors to this work have been explicitly indicated below I confirm that appropriate credit has been given within this thesis where reference has been made to the work of others
I certify that, to the best of my knowledge, my thesis does not infringe upon any- one’s copyright nor violate any proprietary rights and that any ideas, techniques, quota- tions, or any other material from the work of other people included in my thesis, pub- lished or otherwise, are fully acknowledged in accordance with the standard referencing practices
I take full responsibility and take all prescribed disciplinary actions for our com- mitments I declare that this thesis has not been submitted for a higher degree to any other University or Institution
Student
Nguyen Quoc An
Trang 6Table of Contents
Abstract iii
Acknowledgements iv
Declaration v
Table of Contents vi
List of Figures viii
List of Tables ix
1 Introduction 1
1.1 Motivation 1
1.2 Problem Statement 4
1.3 Difficulties and Challenges 7
1.4 Contributions of the thesis 8
2 Related work 10
2.1 Summarization approach 10
2.2 Ontology Construction Approach 12
3 Proposed model 14
3.1 Summarization baseline model 14
3.1.1 Pre-processing 14
3.1.2 Single-answer extractive summarization 15
3.1.3 Multi-answer extractive summarization 17
3.2 Ontology Construction 18
3.2.1 Motivation 18
3.2.2 Overview of proposed ontology construction 19
3.2.3 Biomedical databases 19
vi
Trang 73.2.4 Independence Ontology Construction 21
3.2.5 Ontologies Integration 23
3.2.6 Ontology Population 24
3.3 Apply Ontology-based Improvements to Summarization model 25
3.3.1 Baseline Model Improvements 26
3.3.2 Question’s Keyword Expanding 26
3.3.3 Customised scoring methods 29
4 Experiments and Results 31
4.1 Implementation and Configurations 31
4.2 Dataset and Evaluation methods 32
4.2.1 Metrics and Evaluation 32
4.3 Experimental results 33
4.3.1 Ontology Construction 34
4.3.2 Summarization Experiments 36
4.3.3 Errors Analysis 37
4.4 Application on medical website 39
4.4.1 System overview 39
4.4.2 Application’s result 40
Conclusions 42
List of Publications 43
References 44
Trang 8List of Figures
1.1 The evolution of MEDLINE citations between 1986 and 2019 2
1.2 Typical tasks / competitions in the field of natural language processing for biomedical data 3
1.3 Classification of Text Summarization Approaches 4
1.4 Multi-Answer Summarization pipeline 5
2.1 Summarization approaches 10
3.1 Summarization baseline model 15
3.2 Overview of propose ontology construction 20
3.3 CTD disease-chemical relations 25
3.4 Proposed summarization model overview 27
3.5 Ontology expanding method 28
3.6 WordNet expanding method 29
4.1 The statistic of nodes and terms in three independent ontologies 35
4.2 The statistic of nodes and terms in three integrated ontology 35
4.3 The reduction of ROGUE-2 F1 per each scoring method when replacing the proposed weighted score with the before version 37
4.4 Ablation test results for various components 38
viii
Trang 9List of Tables
1.1 The result summary example responses to a question in medical question
and answer system (MEDIQA) 6
3.1 MeSH’s topic category list 23
4.1 Configurations and parameters of proposed model 33
4.2 The statistics of extract summary in datasets 34
4.3 The statistic of relations and terms in ontology population 35
4.4 Comparison model’s results of the MEDIQA 2021 Task 2 - Extractive Summarization 37
4.5 Examples of some errors in test set 39
4.6 Five biotechnology companies’ websites in Japan 40
Trang 10Chapter 1
Introduction
This chapter will present the motivation and the urgency of the thesis topic in sec- tion 1.1 Also, the summarization problem and query-based summarization problem are discussed in section 1.2
1.1 Motivation
Many experts and leaders have identified data as an invaluable asset in the era of informa- tion explosion For example, Clive Humby - a British mathematician and entrepreneur
in the field of data science, said “Data is the new oil” Indeed, exploiting data effec-
tively will bring great value Biomedical text mining is a topic of increasing interest in the research community For example, the expansion of MEDLINE1 is depicted in Fig- ure 1.1 [20] It is one of the largest and most well-known biomedical online databases
in the world From 1 million in 1970 to 13.5 million in 2005, the number doubled in 14 years to 26.2 million in 2019
However, in this age of information abundance and overload, the overabundance
of data has made it difficult for humans to absorb In that context, some automatic question-answer system is built For example, a question-answer system supports getting information about treatment for common symptoms of COVID-19 from reliable data, which allows users to handle infection situations more scientifically and easily
1 the US National Library of Medicine’s biomedical database
1
Trang 11Figure 1.1: The evolution of MEDLINE citations between 1986 and 2019
The vertical axis represents the number of citations (million) For a clearer representation, the
statistics from before 2005 are issued every 5 years
Nowadays, several automatic question answering systems about health are built like Pubmed2 or CHiQA3, Google4 Although the answers returned by the search en- gines have been selected, independent answers from different sources still overlap For
instance, with the question “How long have SARS-CoV-2 existed?”, Pubmed provides
about 1000 long answers, and Google returns 5,070,000,000 response5
The idea is to use a summary engine to summarize all the responses into a short paragraph The summary answer gathers all of the necessary information and elimi- nates any duplicates Therefore, the users can read one paragraph instead of a massive amount of documents This thesis focuses on the summarization model in the Health question-answering system However, it is the two most demanding tasks are the ques- tion answering and summarization systems for biomedical text (according to experts in
Figure 1.2 [6])
Realizing the potential of biomedical summarization, a number of competitions have been launched in recent years to support research and development in this field The BioNLP workshop series, which is co-hosted by the ACL SIGBIOMED special-
Trang 12Figure 1.2: Typical tasks / competitions in the field of natural language processing for
biomedical data
ized research community, has grown into an exceptional yearly event for researchers to present their research ideas in the field of natural language processing for biological and medical data (bioNLP) wIn 2021, the BioNLP workshop with the topic MEDIQA 2021: Summarization in the Medical Domain6 was held, consisting of three separate tasks The summarization of Multiple Answers task is similar to the summary engine
in the question-answer system, is chosen by my team Our team won second prize (in extractive summary) and third prize (in abstract summary) in this contest Besides, our team won second prize in science research student competition at my university and has four papers about summarization
After participating in this completion, the error analysis process indicates that the model only focuses on the terms mentioned in the question Meanwhile, related terms such as synonym terms, related chemicals, related diseases, etc., also have a certain de- gree of importance It is main reason for this thesis to continue research about question- driven improvements This thesis proposes some ontology-based improvements with a significant development compared to the previous model
6 https://sites.google.com/view/mediqa2021
3
Trang 131.2 Problem Statement
Text summarization aims to select or generate important information from the original text(s) to create a short version [7] Humans often read all documents to develop un- derstanding, and then write a summary highlighting its main points Because of the absence of human experience and understanding, generating a text summary is exceed- ingly tough, time-consuming, and effortless for machines
Based on the different characteristics of the summary paragraphs, text summariza-
tion can be classified in many different ways as Figure 1.3 [3]
Figure 1.3: Classification of Text Summarization Approaches
• According to the input document(s): Single-document summarization and Multi-
document summarization The difference is that a Single-document summarization only focuses on a single text while a multi-text summary uses multiple documents
as input
• According to the summary usage: Generic and Query-based Generic is an ap-
proach that does not focus on a specific topic or aspect, and it makes an overview
of sources While the query-based summarization approach, the result is focused
on the user questions
• According to Techniques: Supervised and Unsupervised Unsupervised approaches
based on algorithms do not depend on human support, such as labelling train datasets These models are suitable for big data, such as website data Supervised learning methods are based on a sentence-level classification approach where the model learns between summary and non-summary sentences
Text Summarization Approach
Based on Input Document
Based on Summary Usage
Based on Techniques
Based on Characteristics
Trang 14User's question
Summarization
Multiple related answers
• According to output characteristics: Extractive summarization and Abstractive
summarization The extraction method entails extracting the most crucial sen- tences from the documents The summary is then made by combining all of the critical sentences As a reason, every sentence, in summary, belongs to the original document in this approach Secondly, the abstractive approach tries to recreate the summary base on the original sentences
Formal definition According to Multi-Answer Summarization task requirements7, different answers can bring complementary perspectives that are likely to benefit the users of QA systems The purpose of this task is multi-answer summarizing model that can tackle summary challenges that numerous relevant replies to a medical question The
input to the model is the customer’s question Q, and all answers A = {A1, A2, , A n} The
output is a summary that answers the given question (Figure 1.4) Table 1.1 shows the
example of result summary
Figure 1.4: Multi-Answer Summarization pipeline
Thesis scope In this work, the model focus on the Query-based Multi-document Ex-
tractive summarization approach According to the classification approaches, the model
follows the four properties: Multiple, Query-based, Unsupervised, Extract The extrac-
tive approach has many advantages, such as (i) quick summarization time, (ii) low cost
of hardware resources, and (iii) easy to manage summary quality Besides, compressing multiple replies into a single answer saves time and effort for users The paragraph is summarised based on the user’s question, which is highly applicable
7 https://www.aicrowd.com/challenges/mediqa-2021/problems/mediqa-2021-multi-answer- summarization-mas
5
Trang 15Table 1.1: The result summary example responses to a question in medical question and answer system (MEDIQA)
to expose the spine On your side, if you are having surgery on your lower back The surgeon will use tools called retractors to gently separate, hold the soft tissues and blood vessels apart, and have room to work A synthetic bone substitute is used With a cut on the front of the neck, toward the side The surgeon will use a graft (such as bone) to hold (or fuse) the bones together permanently There are several ways of fusing vertebrae together Strips of bone graft material may be placed over the back part of the spine Bone graft material may be placed between the vertebrae Special cages may be placed between the vertebrae These cages are packed with bone graft material The surgeon may get the bone graft from different places From another part of your body (usually around your pelvic bone) This is called an autograft Your surgeon will make a small cut over your hip and remove some bone from the back of the rim of the pelvis From a bone bank This is called an allograft A synthetic bone substitute can also be used The vertebrae may also fixed together with rods, screws, plates, or cages They are used to
3 to 4 hours
A bone graft can be taken from the person’s own healthy bone (this is called an au-
eral anesthesia).During surgery, the surgeon makes a cut over the bone defect The bone graft can be taken from areas close to the bone defect or more commonly from the pelvis The bone graft is shaped and inserted into and around the area The bone graft can be held in place with pins, plates, or screws
Extractive summary
A bone graft can be taken from the person’s own healthy bone (this is called an au-
with rods, screws, plates, or cages They are used to keep the vertebrae from moving until the bone grafts are fully healed.
Trang 161.3 Difficulties and Challenges
Figure 1.2 shows that Biomedical summarization is a challenging problem The en-
countered challenges are Text processing in the biomedical domain, multi-answers sum-
marization challenges, and query-based summarization challenges
Text processing problems in biomedical domain Understanding biomedical text re-
quires much human experience and knowledge because biomedical texts are scientific texts, containing many biomedical terms and acronyms In addition, Natural Language Processing tools in the biomedical field are still limited Some related problems are:
• Lack of nomenclatures: A biomedical term can have multiple different variations
Variations can be caused by a lack of consistency in naming or a writer’s mistake Therefore, it is necessary to retrieve terms variations to support the summarization model
• Extreme use of unknown words: New biomedical terms can not be updated in
pre-processing models As the result, pre-processing models label them as un- known words
• Complex relationships between biomedical terms: Some questions require rela-
tions between chemical-disease, disease-gene, and disease-symptom The identifi- cation of biological relationships plays an important role in summary problem
Summarization challenges Compared with single-answer Summarization, the Multi-
answer summarization model faces the following problems:
• Diversity of the answer’s length: Short and long answers can both be included
in the model’s input, resulting in unbalanced contributions to the summary As
a result, each answer has a distinct number of selected sentences Predicting the number of selected sentences per answer is a problem
• Overlap and inconsistency: Different authors write the answers, so the answer’s
vocabulary and style are also different Besides, the content of essential parts of the paragraphs may overlap in meaning
7
Trang 17Query-based Summarization challenges Rather than retaining the text’s main idea,
the summary includes all pertinent and accurate information to answer question
• Relationship between question and answers: Answers keywords can relate to
question keywords through semantic relationships such as some biological rela- tions mentioned above
• Question presentation: The user can be asked in a variety of ways with one ques-
tion Therefore, it is necessary to identify the context and important keywords to determine the correct relevant sentences
• Ambiguity of meaning: Some questions have several meanings, eliciting answers
on a variety of topics Besides, some questions are not clearly defining the subject
1.4 Contributions of the thesis
The previous work Our team has three papers at 2021 Annual Conference of the North
American Chapter of the Association for Computational Linguistics (NAACL 2021) and at the 2021 International Conference on Knowledge and Systems Engineering (KSE 2021) with the Runner-up Student Paper Award The details of related publications are
in the List of Publications section
Contributions of the thesis: This thesis proposes some ontology-based improvements
with a significant development compared to the previous model Some main contribu- tions are:
• Ontology construction and application scenario in expanding question: The
proposed ontology construction process has three main steps: Independent ontolo- gies construction, Ontology integration, and Ontology population One extending approach is implemented based on signal transfer for the extending keywords In- tegrated Ontology provides diverse related terms such as synonym terms, related chemicals, related genes, related symptoms Integrated Ontology enables quick access and can be used for other biomedical tasks
• Ontology-based Improvements for Summarization model: Ontology is lever-
aged to extend biological knowledge, and WordNet is used for enhancing com- mon sense knowledge Some customizing methods are proposed based on TF-IDF, Keyword-based score, and weighted-relaxed word mover’s distance score
Trang 18• The model’s applicability and performance: The model can be used for prac-
tical problems in the health question-answer system The proposed model gives better performance than the results of other teams in BioNLP workshop 2021 Ad- ditionally, one summarization model is built for five world’s leading independent biotechnology companies’ websites
Structure of the thesis: The thesis includes four main chapters as follow:
Chapter 1 Introduction: This chapter introduces the summarization problem and
query-based summarization problem Additionally, motivation discusses the urgency of the thesis’s topic
Chapter 2 Related work: This chapter presents related works in two main prob-
lems: summarization and ontology construction In summarization-related works, the thesis focus on two related approaches: extractive summarization and query-based sum- marization In ontology construction problems, the thesis discusses some ontology and some ontology integration methods
Chapter 3 Proposed model: This chapter presents about baseline model and its
problems, which discusses the urgency of ontology construction Some strategies are used to build a unified ontology by extracting some database and term definitions Fi- nally, some customized scoring models based on ontology are proposed
Chapter 4 Experiments and Results: This chapter provides an insight into the
implementation of the models and shows hyper-parameter settings After that, ontology construction’s results are discussed The model’s performance is presented with final result and components’ contribution Also, one application in five world’s leading inde- pendent biotechnology companies’ websites is shown with a few changes to handle big data Finally, some errors for better insight into the model are shown
Conclusions This Chapter concludes the thesis by summarizing the important con-
tributions and results Besides, this thesis will present some further extensions in future work
9
Trang 19Chapter 2
Related work
With the model in a health question answering summarization system, the Query-based
Multi-document Extractive summarization approach is researched Section 2.1 will dis-
cuss some related words about extractive summarization approaches and query-based
approaches Section 2.2 will discuss biomedical ontology and some approaches for
ontology construction
2.1 Summarization approach
Figure 2.1: Summarization approaches
Extractive summarization approach: An extractive approach is an approach that has
been studied for many years The frequency-based method is the earliest method that tries to estimate the importance of components based on the frequency of words or sen-
Summarization approach
Extractive Approach
Query-based Aproach
Question-driven improvements
Trang 20tences Term Frequency - Inverse Document Frequency (TF-IDF) supports selecting a vital word and removing some noise such as stop-words Newer methods suggest ways
of combining TF-IDF with other methods such as TF-IDF with K-Means [9]
Text objects often have graph structure, such as sentences always have a subject, predicate and object; noun phrases can include many nouns It is the reason the graph- based score is one of the standard approaches Inspired by PageRank - a website ranking algorithm by Google, LexRank and TexRank build a graph from sentences and words respectively and use degree centrality as score for ranking [8]
Many research studies have tried to apply machine learning methods to extractive summarization tasks, from Naive Bayes, Decision tree, and Support Vector Machine to deep learning models [5] The model’s performance is highly dependent on the size and nature of the training data set
Query-based summarization approach: A query-based summary is an efficient prob-
lem applied in question-answer systems Understanding question is a complex problem because the same meaning question that users can ask in many different ways
Some question-driven scoring methods are used to estimate relation between the question and answer’s components Weighted-relaxed word mover’s distance calcu- lates shortest distance between question and answer’s sentences (wRWMD) [17] HSO method is used for filtering important sentence which has tight relation to questions and other sentences [19] Query-based scoring and Keyword-based scoring are proposed to estimate context relation and keyword ratio in question keywords set [Pub2]
Besides some internal sentence knowledge scoring, some databases can be used for improving question sense Wikipedia is a sizeable unstructured database for search- ing keywords in Wikipedia articles and detecting related keywords [15] ConceptNet and WordNet provide pre-existing relationships between words [19] For an extended biological sense, Medical Subject Headings database can be used in query expansion method [10]
With supervised learning, methods focus on learning the relationship between questions and answers through machine learning models or deep learning models The model can concatenate question and sentence vectors and then uses a single-layer neural network to obtain the distribution over positive and negative classless [22]
11
Trang 21Applied in proposed model Some methods are used or customised in baseline model
and proposed model:
• Scoring methods: Some scoring methods are proposed based on TF-IDF for de-
tecting important sentences by calculating the score of its words; LexRank for calculating sentence degree center centrality; wRWMD, question-based score and keyword-based score for estimating the relation between question and answer’s sentence
• Multi-answers summarization: Maximal Marginal Relevance (MMR) [2] is a
diversity-based re-ranking method based on the similarities and can be used to remove redundancy in the summaries
2.2 Ontology Construction Approach
Ontologies have been proved to be the most effective way for humans and machines to communicate and share information [16] In this thesis, ontology helps extract medical- related terms in the medical field, which is lacking in WordNet or ConceptNet There are many ontologies in the biomedical domain Disease Ontology 1 is built for human disease which includes terms, phenotype characteristics and related medical vocabulary concepts Gene Ontology 2 is the biggest gene database However, there are no ontolo- gies for summary purposes, and existing ontologies focus on only one aspect of health, such as chemicals, diseases, genes, or symptoms That makes it challenging to apply to the summary because the user questions are diverse in many aspects
Ontology Integration: Ontology integration is the process of unifying existing on-
tologies to create a general or comprehensive ontology [16] Ontology integration is a massive problem because biomedical knowledge requires a lot of experts’ experience and knowledge Most appropriate methods use semi-automatic techniques which de- pend on human support OMEN is a probabilistic mapping tool that creates maps with rules and then estimates mapping path by probability based on conditional probability tables [12] In 2020, CoMerger is public with a cluster approach [1] CoMerger groups related concepts across ontologies into partitions and merge first within and then across those partitions CoMerger merger allows the merging of more than two ontologies at a time
1 https://disease-ontology.org/
2 http://geneontology.org/
Trang 22Ontology Population: Ontology population is the process of adding new concepts,
semantic relations and rules to an existing ontology and setting them in the correct position in the ontology [18] Content ontology design patterns are a framework for semi-automatic pattern-based ontologies population [4] Definition term-based is an algorithm which tries to search terms defined to extract relations between nodes [13] Besides, learning relations based on deep learning methods also have received attention Some existing databases such as the Toxicogenomics Database project (CTD) 3 can be used for the training model
Applied in proposed model In ontology construction process, some databases and
methods are used:
• Database: Medical Subject Headings Database4 is used as the backbone and then
is integrated with Mondo Disease Ontology - Mondo5 The Comparative Toxicoge- nomics Database - CTD and Symptom Ontology - SYMP6 are used for extracting biological relationships Additionally, Open English WordNet 7 is used for enhanc- ing common-sense
• Algorithms: Definition term-based algorithm [13] is used for extracting chemical-
diseases relation and symptom-disease relations
Trang 23Chapter 3
Proposed model
Section 3.1 presents baseline model Ontology-based Improvement approach will be proposed to solve the baseline model’s problem in two sections: Section 3.2 for ontology
construction and Section 3.3 for using ontology to improve baseline model
3.1 Summarization baseline model
The baseline model is based on the Prosper-thy-neighbour model [Pub 2] While having some proposed methods for making higher performance, it also have some problems in understanding the question when related keywords such as synonyms and biomedical
relationships are not focused The baseline model has three main parts: Pre-processing,
Single-answer extractive summarization, and Multi-answer extractive summarization
Figure 3.1 shows baseline model where the improved parts are highlighted in gray
3.1.1 Pre-processing
The Pre-processing phase takes the raw data, whose multiple answers and questions, as the input Firstly, answers are segmented into sentences Next, the normalization method removes noise from the raw answer and question (HTML tags, duplicate spacing, etc.) Keywords which include tokens and NERs are extracted for the summarization phase SpaCy1 is the primary Pre-processing method utilized Scispacy package is used instead
of the normal Spacy to process biomedical text efficiently Because segmentation by SpaCy has some mistakes in the paragraph, which do not have punctuation One rule- based sentence boundary detection module is added to Pre-processing phase [21]
1 spaCy: An industrial-strength NLP system in Python: https://spacy.io
Trang 24Notes: Improved parts are highlighted in gray
Figure 3.1: Summarization baseline model
Finally, BioBERT is used for creating word embedding vectors and sentence em- bedding vectors from text BioBERT is a particular version of BERT for biomedical text mining The model uses large model of BioBERT which is pre-trained in PubMed
3.1.2 Single-answer extractive summarization
After pre-processing, three score strategies are used for sentence scoring and ranking
There are Frequency-based score, Graph-based scores, and Question-driven scores The
scoring methods get sentences and return sentence’s score
Frequency-based score Term Frequency - Inverse Document Frequency (TF-IDF)
method calculates word’s score in a corpus TF-IDF scores are calculated from the TF and IDF scores with some boosting approaches: Boosting the TF-IDF score of keywords and filtering lower word scores by fixing the threshold
Merging single summaries
Re-caculate sentence scores
Sentence Filtering
Maximal Marginal Relevance
Post-processing
Trang 25Graph-based scores LexRank is used for detecting essential sentences in the answer
based on the document’s graphs In LexRank, the graph’s nodes present sentences, and a graph’s weighted edge refers to the similarity score of node pairs The score of sentences
is calculated in Formula 3.1, which presents centrality of sentence in answer
d p(u) = + (1 − d)
Question-driven scores Query-based score and keyword-based score are used for se-
lecting sentences related to the question Query-based score focus on calculate similarity
between question vector and sentence vector as described in Formula 3.2
where s is the sentence, sim is Cosine similarity between question and sentence vectors
The keyword-based score used longest common sub-sequence to calculate the key-
word’s ratio per sentence with Formula 3.3 ., ,
kw(s) =
k : k ∈ q| maxi ∈s lcs(k ,i) ≥ thres
where s is the sentence keywords, q is the question keywords
Scores combination After calculating a sentence score in some algorithm, all scores
are normalized by min-max normalization The final score is calculated by weighted
sum fashion as in Foumula 3.4
t where t is one of four scores: TF-IDF, LexRank, query-based scoring, and keyword-
based scoring
16
Trang 26Prosper-thy-neighbour strategies Some adjacent sentences are structured in a de-
ductive manner (e.g following a stated sentence, additional explanation sentences are given.) or inductive (e.g the last sentence summarizes the preceding sentences.) As
a result, answers frequently contain continuous sentences as a cluster in the paragraph Center boosting increases scores for some sentences close to high-scoring sentences
[Pub 2] Let score i be the score of i-th sentences The final score f inal i of sentence i-th
is updated by the following formula:
min(i+R−1,n)
where n is the number of sentences, L and R are the number of sentences that impact the current sentence i in two directions: left and right With centre-boosting, the main
sentence significantly impacts the sentences around it
Generate single-answer extractive summarization: After calculating and boosting
sentence scores, the model ranks and filters the top sentences list After that, the model restores the sentence’s position and concatenates all selected sentences to create single- answer extractive summarization
3.1.3 Multi-answer extractive summarization
After single-answer summarization phase, all single-summaries are connected The sen- tences continue to be scored with all score types in Section 3.1.2 Then, they are com- bined, ranked, and filtered to create a summary like a single summarization phase
Maximal Marginal Relevance (MMR): To solve the problem of duplicate ideas in
multi-document summaries, a sentence filtering method called Maximal Marginal Rel-
evance (MMR) is added to the model with Fomula 3.6
MMR = argmax λ
Post-processing This methods focus on remove some noise from summary such as
questionable sentences, example sentences, long sentences and duplicated information
17
f inal i = max
Trang 273.2.1 Motivation
The baseline model has many proposed approaches to increase performance based on question’s keywords However, related keywords such as synonyms and biomedical rela- tionships are not focused on In biomedical dataset, there are some related terms types:
• Synonyms: Keywords often have many different names which refer to the same
topic For example, the words “Cancer”, and “Neoplasm” all mean cancer
• Phenomena and symptoms: For diseases, the writer can also use phenomena and
symptoms to name the disease For example, the words “Tumor”, “Malignancy”,
“Benign Neoplasms” refer to cancer
• Keyword’s variation: Keywords may have some variation when included in a par-
ticular sentence to comply with grammar or due to a spelling mistake by the writer For example, “Malignant Neoplasm” can be represented in several other forms such as “Malignancy”, “Malignancies”, “Malignant Neoplasms”, “Neoplasm, Ma- lignant”, “Neoplasms, Malignant”
• Biomedical relations: Related words can be inferred from biomedical relations
such as “Bone Cysts” is a type of cancer, so it may be related to sentences that mention Cancer
Building an ontology for keyword expanding is necessary for the above problems Ontology shows relationships between terms about medical meaning As a result, the model can expand the query’s keywords to understand the question clearly and focus on related paragraphs