00051000958 advanced machine learning models and applications in scientific multi document summarization

Motivation

In today's rapidly advancing scientific landscape, the volume of scientific articles doubles approximately every nine years, creating a substantial challenge for researchers due to the limited capacity of human information processing To address this issue, effective tools are essential for managing the vast amount of knowledge The summarization model plays a crucial role by allowing researchers to efficiently screen and analyze scientific texts, condensing the most important information into a concise summary.

The abstract in scientific documents serves as an initial overview of the study but has notable limitations that hinder its effectiveness as a comprehensive summary It may be influenced by the authors' biases, often highlighting positive outcomes while downplaying limitations Additionally, some publications lack a detailed abstract, which restricts readers' understanding of the research's full scope and implications Furthermore, abstracts can become outdated, failing to represent the research's relevance in the current context Thus, a thorough summary is crucial for providing readers with more objective information about the article.

Scientific documents are linked through citation networks, allowing one document to reference another Citation sentences contain crucial information about the evaluated input document, selected by humans for their conciseness and objectivity As a result, these citations offer valuable insights for summarization models.

This thesis investigates summarization techniques that utilize information from diverse scientific documents The primary objective is to create detailed summaries that reflect both the author's viewpoint and insights from the broader research community These summaries incorporate essential content from the original document along with analyses from related articles, ensuring a comprehensive understanding of the subject matter.

The research community has shown significant interest in this topic, exemplified by the CL-SciSumm Shared Tasks, which have been conducted for six years and feature numerous competing teams from across the globe.

Problem Definition

In Natural Language Processing (NLP), the Elementary Discourse Unit (EDU) represents a smaller unit of text The scientific field has distinct characteristics that contribute to the definitions of frequently used terms.

The Elementary Discourse Unit (EDU) serves as a fundamental component in discourse analysis, representing the smallest meaningful element, such as a clause or simple sentence EDUs offer finer granularity compared to full sentences, making them a superior choice for summarization For instance, the sentence about Mark Wahlberg starring in a film can be segmented into two EDUs: "Boston native Mark Wahlberg will star in a film about the Boston Marathon bombing and the subsequent manhunt" and "Deadline reported Wednesday." Additionally, the sentence regarding Wahlberg's film, titled "Patriots' Day," can be divided into three EDUs: "Wahlberg’s film," "titled Patriots’ Day," and "is being produced by CBS Films."

Citations in scientific articles serve to reference other works for purposes such as providing evidence, supporting claims, and acknowledging prior research A citation network illustrates the relationships between academic documents through these citations, where each node represents a document and each edge signifies that one document cites another.

A citation span is a text segment that explicitly references other research articles, playing a crucial role in academic writing by acknowledging previous contributions It guides readers to the original sources for further exploration and reflects the evolving perspectives of the research community regarding the cited work.

A cited text span refers to specific portions of a source document that are referenced by a citation For instance, in Table 1.1, Citation 1 originates from the paper "Measuring Contextual Fitness Using Error Contexts Extracted from the Wikipedia Revision History." This citation points to the input document, highlighting the cited text span: "The focus of our work is on the use of contextual role knowledge for coreference resolution."

Text summarization is the process of selecting or generating essential information from original texts to produce a concise version The goal is to develop a summarization model capable of creating summaries from scientific documents and extracting citations from related works An overview of the model's input and output for summarization is depicted in Figure 1.1.

Given an input documentDand its list of citations{c k } n k=1 c In which the document

Dand the citationc k can be composed of smaller text units as follows:

D={t D i } n i=1 D (1.1) c k ={t c i k } n i=1 ck (1.2) where t D i and t c i are the text unit, n D and n c k are the number of text unit in D and c k respectively The model creates a summaryS fromDand{c k } n k=1 c

Figure 1.1: Overview of the input and output of summarization model

Table 1.1 illustrates an example of input and expected output, where the input includes a scientific document along with its citations The output is a detailed summary produced by experts, showcasing the distinctive features of scientific texts.

The scientific document follows a traditional structure, including sections such as abstract, introduction, method, results, and conclusion However, the section names have been tailored to enhance clarity, with "Learning Contextual Role Knowledge" for the method and "Evaluation Results" for the results This customization aids in understanding the paper's objectives but may complicate the model's ability to recognize a consistent flow.

The input includes citations from various sources that reference the target document and related scientific texts (Citation 1, 12) These citations are concise and effectively summarize the key points of the input text Additionally, since most scientific texts are published in PDF format, converting them using Optical Character Recognition (OCR) can lead to certain types of errors (as noted in Citation 7).

The summary synthesizes key information from multiple sources to deliver a comprehensive overview of the input text, highlighting essential details for clarity and coherence.

Table 1.1: Example of input and corresponding golden summary

Unsupervised Learning of Contextual Role Knowledge for Coreference Resolution

Introducing BABAR, a coreference resolver that leverages contextual role knowledge to assess potential antecedents for anaphors By utilizing information extraction patterns, BABAR effectively identifies contextual roles and establishes four distinct contextual role categories, enhancing its ability to process and understand language.

Coreference resolution has been a significant area of research, with various theoretical discourse models proposed (Grosz et al., 1995; Grosz and Sidner, 1998) This study emphasizes the application of contextual role knowledge to enhance coreference resolution Our findings indicate that BABAR performs well across different domains, with contextual role knowledge notably improving accuracy, particularly for pronouns.

This section outlines the representation and learning of contextual role knowledge However, utilizing the top-level semantic classes of WordNet has presented challenges due to the overly broad distinctions between classes In response to this issue, BABAR adopts a conservative approach.

We assessed the performance of BABAR in two key areas: terrorism and natural disasters, utilizing the MUC4 terrorism corpus and relevant news articles from the Reuter’s text collection The results showed an improvement in the F-measure score for both domains, indicating a significant enhancement in recall, albeit with a slight reduction in precision.

The goal of our research was to explore the use of contextual role knowledge for coreference ( )

Measuring the contextual fitness of a term is essential for various NLP applications, including speech recognition, optical character recognition, and co-reference resolution.

Difficulties and Challenges

Summarization is a complex challenge in Natural Language Processing, particularly when dealing with scientific texts that have unique characteristics The development of effective summarization models faces numerous difficulties, including fundamental natural language issues and various specific challenges.

Integrating diverse viewpoints from multiple sources can be challenging when summarizing information, as each source may present different perspectives To create a cohesive and accurate summary, it is essential to account for potential biases and ensure the summary remains balanced and objective Summarization models must navigate these complexities while preserving the integrity of each source to produce informative summaries.

Natural language is complex and often ambiguous, as a single idea can be conveyed in various ways This ambiguity presents a challenge for summarization models in pinpointing essential information Additionally, sentences can enhance the meaning of surrounding sentences, contributing to a cohesive context Consequently, it is crucial for models to have a profound understanding of linguistic subtleties and context to generate clear and precise summaries.

Effectively managing redundancy is crucial in summarization models, as it involves recognizing the significance of various pieces of information The challenge lies in identifying and removing repetitive content while still covering all essential points Scientific texts often include reiterations of concepts to highlight their importance or clarify complex ideas Therefore, models must be skilled at detecting and addressing redundancy to create concise summaries that communicate vital information without unnecessary repetition.

Scientific texts possess unique features, including connections to related documents via citation networks and intricate structures These characteristics can present challenges for summarization models.

Citation texts offer updated discussions about an article but often include references to other papers, which can introduce irrelevant information This excess of unrelated content may cause models to produce noisy summaries Thus, it is essential for the model to effectively discern relevant information in citation texts that relates specifically to the target document.

Long scientific documents, especially survey articles, can be significantly longer than typical texts, which complicates the summarization process The increased length introduces diverse aspects that challenge models in maintaining coherence Thus, effectively managing long inputs is essential for creating accurate and informative summaries that capture the essence of the original content.

Scientific documents are structured hierarchically, featuring sections like introduction, methodology, results, and discussion that collectively support a central argument These sections interact to reinforce key points, provide context, and compare findings Each section comprises smaller components, such as sentences and elementary discourse units (EDUs), along with various relationships, including discourse and coreference relations The complexity of these relationships presents an ongoing challenge in modeling the connections between text intervals within a document.

Scientific documents exhibit logical complexity, featuring intricate arguments, hypotheses, and reasoning that unfold across various sections To maintain coherence and accurately represent the original argument, summarization models must effectively track and summarize these detailed logical progressions.

Research Contributions

The primary objective is to create detailed summaries that incorporate both the author's viewpoint and insights from the research community, utilizing information from various data sources to enrich the model's knowledge By leveraging the document's structure and the interconnections among its components, the model's performance can be significantly improved This thesis introduces two innovative models: Citation-guided Summarization and Citation-integrated Summarization, which contribute to the field in three key ways.

This thesis presents two innovative methods for creating a detailed summary of scientific documents and their citation networks These approaches highlight the importance of utilizing various data sources for effective scientific summarization, demonstrating how citations can offer valuable insights and current perspectives from the research community.

This thesis introduces a Deep Neural Network for Citation-guided Summarization, which selects key sentences based on citation sentences to generate a coherent summary In Citation-integrated Summarization, it integrates citations with structural components of documents using heterogeneous graphs A novel heterogeneous graph architecture is proposed for the summarization task, along with a contrastive loss function to improve representation learning, enabling better differentiation between important and less important nodes.

Experimental results on the CL-SciSumm dataset show that the proposed approaches outperform existing models, achieving remarkable performance The analysis highlights the significant effectiveness of these models' contributions.

Structure of the Thesis

My thesis is structured into five main chapters and a conclusion, focusing on the introduction of the problem, a review of related research, proposed solutions, and an analysis of the results.

Chapter 1 introduces the motivation for the thesis topic, emphasizing its significance and context It articulates the problem statement while illustrating the key challenges and complexities involved in addressing the issue The chapter concludes by outlining our contributions and the overall structure of the thesis.

• Chapter 2: Related WorkThis section presents and analyzes in detail the related research on two main research directions: leveraging external knowledge from citations and leveraging internal knowledge from the document.

Chapter 3 focuses on the Citation-guided Summarization model, providing a comprehensive overview of its structure and functionality It outlines the key phases of the model, including Pre-processing, Important Sentence Selection, and Summary Generation, ensuring a clear understanding of the summarization process.

Chapter 4 focuses on the Citation-integrated Summarization model, providing a comprehensive overview of its structure and functionality It details each phase of the model, including Structure Normalization, Heterogeneous Graph Construction, Heterogeneous Graph Encoder, and Important Estimation, to enhance understanding of the summarization process.

Chapter 5 delves into the implementation of the models, showcasing their performance through final results and the contributions of various components Additionally, it highlights specific errors to enhance understanding of the model's functionality.

• Conclusions This chapter concludes the thesis by summarizing the key contributions and results Additionally, it highlights the limitations of our models and identifies potential extensions for future work.

Recent advancements in technology have led to the development of various models that enhance performance in summarization tasks This section highlights key research in summarization, emphasizing two main approaches: utilizing external knowledge from citations and harnessing internal knowledge from the document itself.

Citation-based Summarization

Direct Citation-based Summarization

Early studies proposed using citations to create summaries, as they provide concise, human-rewritten information about a paper However, citation sentences often blend discussions of the input document with other sources, leading to an excess of irrelevant information.

In 2008, a pioneering study introduced C-LexRank, a model designed to generate summaries from citations using graph-based techniques In this model, citation sentences serve as vertices, while the weighted edges indicate the semantic relatedness between them By constructing this graph, C-LexRank effectively clusters the vertices and identifies the most significant sentences from each cluster for summarization.

Mohammad et al conducted a survey on diverse summarization methods, including linguistically-based rules, LexRank, and C-LexRank They found that Trimmer compressions, which utilize linguistically driven rules to obscure syntactic elements of a parsed text, demonstrate a superior ability to remove irrelevant information This effectiveness highlights the advantages of using linguistically-based rules in summarization tasks.

Abu-Jbara et al introduced an architecture that utilizes the functional categories of citation sentences to enhance document summarization Their model classifies citations into five distinct categories: background, problem statement, method, results, and limitations, ensuring a comprehensive representation of the input document To streamline citation usage, the model further categorizes citations into three types: "keep" to retain the citation, "remove" to delete it, and "replace" to substitute it with an appropriate pronoun Additionally, they employed LexRank to generate the final summary effectively.

Citation-based summarization models have shown encouraging outcomes, focusing on graph-based methodologies While efforts to eliminate irrelevant information have been explored, rule-based approaches continue to face notable limitations.

Cited Text Span-based Summarization

Recent studies suggest that using cited text spans can enhance summary generation A cited text span consists of specific sections from a source document that correspond to the citation Unlike citations, which may include extraneous information, cited text spans are directly relevant However, accurately identifying these spans poses a challenge, as citations do not always clearly indicate a specific section of the document.

Initial studies in summarization have utilized unsupervised methods, focusing on the importance and frequency of words to select key sentences For instance, Qazvinian et al developed a technique that identifies sentences containing the highest number of significant key phrases extracted from citations They employed n-grams to locate these phrases within the input document, ultimately selecting sentences that maximize the presence of these key phrases Conversely, other approaches have integrated clustering and sentence selection models, such as CitationAS, which operates in two stages: first, clustering similar citation sentences, and then labeling and merging these clusters using Word2Vec and WordNet The clusters are ranked by size, and important sentences are extracted based on TF-IDF and Maximum Marginal Relevance (MMR) to create the final summary Similarly, Agrawal et al proposed a straightforward system that employs a k-nearest neighbor classifier and a bootstrapping method to categorize citations into three groups, followed by the construction of a knowledge graph to summarize the input document.

Advancements in traditional machine learning have led to the integration of these methods into new architectures for text summarization Various machine learning techniques have been utilized to predict cited text and generate summaries The authors introduced multiple features from both source documents and citations to assess whether a sentence should be included in the summary Recognizing the significance of features in machine learning models, Abura’ed et al proposed additional features to improve the model's learning capabilities They subsequently applied a linear regression-based approach to evaluate the importance of each sentence, ultimately selecting the most significant ones to form the final summary.

Deep learning methods have enhanced model learning by minimizing reliance on expert-extracted features Lauscher et al introduced a ranking model that assesses sentence importance based on similarity to citation sentences, utilizing features like lexical and semantic similarity scores Important sentences were selected through a learning-to-rank model, with adjacent sentences providing additional context Support Vector Machines (SVM) and Convolutional Neural Networks (CNN) were then used for sentence classification, leading to summaries generated via clustering algorithms and TextRank Yasunaga et al proposed a Graph Convolutional Network (GNN) for summarization, initially predicting cited sentences using a TF-IDF-based method, followed by representing text and importance scores with the GNN They concluded with a greedy heuristic method to select key sentences for the final summary.

Language models are highly effective for generating coherent summaries by utilizing knowledge from pre-training datasets to grasp contextual relationships Mei et al approached the summarization task as an information retrieval problem, introducing a query method that employs a unigram language model to produce words that accurately represent the content of citations.

Recent advancements in general summarization tasks have led to the development of several effective language models Notably, Large Language Models (LLMs) have demonstrated their capability in generating summaries through zero-shot learning My team and I conducted experiments utilizing LLMs for summarization, yielding promising outcomes In 2023, LLMs were also employed to summarize relevant content from citations, where a retrieval model was first used to identify cited text spans, followed by the application of prompt-based instruction-tuned LLMs like LLaMA and GPT-4 for summary generation.

The research community has shown significant interest in span-based summarization of scientific texts, particularly through the CL-SciSumm Shared Tasks, which have been ongoing for six years and involve numerous competing teams globally The Shark task focuses on three primary objectives: identifying cited text spans along with their corresponding facet labels, and generating a concise summary from these spans, limited to 250 words This initiative has successfully engaged researchers for over six years.

Applied in Proposed Model

Some ideas and methods have been employed or customized in the proposed models in this thesis The specific details are as follows:

The Citation-integrated Summarization model leverages citation information within a heterogeneous graph to enhance summarization By directly incorporating citations, the model forges connections between internal and external data in the input document, thereby improving its ability to identify key information efficiently.

The Citation-guided Summarization model aims to remove irrelevant information from citations by leveraging insights from related studies It identifies cited sentences based on citation sentences and selects the top k cited sentences for summarization Furthermore, the model explores two summarization strategies: Fusion-based and Augmentation-based summarization, as utilized in prior research.

Graph-based Summarization

Unsupervised Graph-based Summarization

Traditional methods typically involve constructing a graph to identify significant text units, which are then combined to create a final summary As a result, these approaches do not necessitate training on labeled data.

LexRank is a sentence-level summarization model that represents sentences as nodes in a graph, with edges indicating their similarity through cosine similarity It utilizes a PageRank-based algorithm to determine the importance of each sentence, identifying central sentences that encapsulate the key information of the document Similarly, TextRank also models sentences as nodes and uses similarity scores as edges, but it typically calculates these scores based on the shared words between sentences.

Recent advancements have introduced methods that enhance graph representations by integrating additional information Zheng et al developed PacSum, which includes two key innovations: the use of BERT for embedding sentences to enhance meaning and similarity, and the implementation of directed graphs to more accurately reflect sentence centrality within a document They argue that directed edges are essential, as the influence of relationships between nodes on their centralities is not symmetrical.

In 2021, the Hierarchical and Positional Ranking model (HipoRank) was introduced, which constructs a directed and hierarchical graph where nodes represent sentences and edges indicate the distance between sentences and their boundaries This model enhances structural information by incorporating section nodes alongside sentence nodes, allowing for the pruning of weak connections However, summarization methods can inadvertently create facet bias by selecting sentences from the same facet To mitigate this issue, the Facet-Aware Ranking (FAR) model was developed, introducing a novel sentence-document weight that encourages the model to focus on diverse facets, which is integrated into the sentence centrality score.

Unsupervised graph-based summarization methods show promise in processing structured documents through graphs However, their lack of training on labeled data may hinder their ability to accurately capture the unique characteristics of input documents across various domains, potentially impacting their overall performance.

Deep Learning-based Summarization

The advancement of Deep Learning has led to the introduction of methods that effectively capture cross-sentence relationships within documents These deep learning techniques offer numerous benefits, including the ability to be trained on labeled datasets and the capability to identify hidden features.

Recurrent Neural Networks (RNNs) are widely used deep learning models for extractive summarization, featuring a two-layer architecture that processes sentences and words to capture internal and external relationships NeuSum enhances this approach by integrating sentence scoring and selection into a single step, utilizing a joint learning method to predict the importance of sentences based on previously selected ones Despite their advancements, RNNs struggle with long-distance dependencies in lengthy documents, such as scientific articles, which have a complex hierarchical structure The Transformer model addresses these challenges through an attention mechanism, but it is not well-suited for long documents To overcome this limitation, Longformer introduces a new attention mechanism that scales linearly with sequence length, enabling it to manage extensive texts However, its automatic attention may miss important structural elements, as it does not consider the titles of sections and subsections effectively.

Graph Neural Network-based Summarization

Researchers are increasingly utilizing graph neural networks (GNN) to merge deep learning with graph architecture GNNs effectively leverage graph connectivity to process contextual information in lengthy documents and navigate complex discourse structures Furthermore, deep learning enables these models to be trained on labeled data, allowing GNNs to comprehend intricate structures and recognize the unique characteristics of the data.

The Approximate Discourse Graph (ADG) is an early approach for training models using graph structures to tackle summarization tasks In this method, graphs are created with nodes that represent sentences and edges that illustrate the relationships between them A linear regression classifier is then used to assess the importance of each sentence While the model has shown promising results, it relies on homogeneous graphs with a single type of node, limiting its ability to capture the intricate discourse structure of the document.

Heterogeneous graph-based methods aim to integrate diverse information through the use of various types of nodes and edges These models typically involve three key steps: constructing the graph, encoding representations, and predicting summaries Notably, the proposed models showcase several unique features, as outlined in Table 2.1.

Huang et al introduce a heterogeneous graph model that incorporates multiple node and edge types, contrasting with traditional homogeneous graphs This model represents documents as heterogeneous graphs featuring three node types: sentence nodes, EDU nodes, and entity nodes, along with three corresponding edge types: entity-EDU, EDU-EDU, and EDU-sentence To enhance node representations, Graph Attention Networks (GAT) are employed, utilizing a labeled dataset The final prediction label is derived from the EDU nodes' representations, allowing for the identification of the most significant EDUs Similarly, HAESum constructs a graph that emphasizes the importance of different node types.

Table 2.1: Heterogeneous graph-based methods

Summary Prediction Types of nodes Types of edges

Huang et al Entity, EDU,

HAESum Word, Sentence Word-Sentence, Hyperedge Hyperedge

Entity-Sentence, Sentence -Sentence, Sentence -Section, Section-Section

Sentence-Sentence, Sentence -Section, Section-Section

The authors introduce a method for learning representations at both the sentence and sub-sentence levels, utilizing a novel edge type known as hyperedges Unlike regular edges, hyperedges can represent multidimensional relationships by linking multiple nodes By focusing on nodes at the sub-sentence level to create summaries, this approach effectively reduces redundant components, leading to more concise extractive summaries.

Recent studies have explored heterogeneous graphs that incorporate various node types at higher levels One such model, SAPGraph, features a heterogeneous graph comprising section nodes, sentence nodes, and entity nodes These nodes are interconnected by four distinct edge types: Entity-Sentence, Sentence-Sentence, Sentence-Section, and Section-Section The model employs Graph Attention Networks (GAT) to represent the nodes after initializing their values Additionally, it predicts the importance level of the nodes using trigram blocking, enabling the clustering of sentences with similar aspects and revealing hidden structures within the data.

Recent proposals have introduced an aspect-aware heterogeneous graph model featuring a novel objective loss This model constructs a graph comprising two types of nodes—section nodes and sentence nodes—and three types of edges: sentence-sentence, sentence-section, and section-section Node representations are learned using Graph Attention Networks (GAT) with two optimization functions: contrastive loss and summary loss The contrastive loss principle ensures that important sentences are positioned closer to the summary than to less significant sentences By incorporating section nodes, this model effectively groups sentences within the same section, revealing hidden structures, similar to SAPGraph.

I explored the idea of CHANGES for the summarization task on Vietnamese data and achieved promising results [Pub 4].

Heterogeneous graph-based methods enhance information diversity by integrating various node and edge types EDU nodes help eliminate redundant sentence components, leading to more concise extractive summaries, while section nodes facilitate the grouping of sentences within the same section to uncover hidden structures Edges are categorized into neighbor relations and cross relations, yet the effective integration of nodes and the development of corresponding edge types remain inadequately understood.

Applied in Proposed Model

Some ideas and methods have been employed or customized in the Cited text span-based summarization model The specific details are as follows:

This thesis introduces a heterogeneous graph featuring various types of nodes and edges The edges are initialized through an adjacency matrix, drawing on concepts akin to the CHANGES model [42].

The node representation learning is conducted using Graph Attention Networks (GAT), aligning with previous research Alongside the Summary Loss, we incorporate Contrastive Loss, enhanced by various proposals and improvements.

• Summary Prediction: This thesis uses an MLP to predict the importance level of each node, similar to related studies [13, 42, 44].

This chapter outlines the Citation-guided Summarization model, beginning with an overview in Section 3.1 It then elaborates on each phase of the model, including the Pre-processing phase in Section 3.2, the Important Sentence Selection phase in Section 3.3, and the Summary Generation phase in Section 3.4.

Model Overview

Figure 3.1shows an overview of the proposed architecture for scientific summarization.

Our architecture generates detailed summaries by utilizing both the target document and its citation network By leveraging citations, we can pinpoint relevant sentences that enhance the initial abstract with crucial additional information.

The pipeline consists of three key phases: Pre-processing, Important Sentence Selection, and Summary Generation It begins with the input document D and its citation network, which includes citation sentences and their corresponding citing documents During the Pre-processing phase, both the document and citations undergo tokenization, segmentation, and normalization, followed by the generation of sentence vectors using an embedding model In the Sentence Selection phase, significant sentences are identified based on citation information, with adjacent sentences extracted to enhance context Finally, two distinct combination strategies are employed to create the summary from the abstract.

Pr e- pr oc es si ng Segmentation Normalization

Sen ten ce Sel ec tio n Cited Sentence Prediction

Su m m ar y G en er at io n

Si m ila rit y- ba se d m et ho d - f

Ad ja ce nt Se nt en ce Ex tr ac tio n

Fi lte rin g Se nt en ce te xt m ap pi ng

The Citation-guided Summarization model includes two main strategies: Fusion-based summarization, which combines the abstract with selected sentences, and Augmentation-based summarization, which first summarizes the selected sentences before integrating this information into the abstract.

Pre-processing Phase

Most scientific documents are published in PDF format, creating challenges for information extraction To maintain fairness among models, scientific data in Natural Language Processing tasks is usually provided as text, which is then converted using Optical Character Recognition (OCR) modules However, this conversion often introduces errors Therefore, it is essential to correct these OCR errors to ensure accurate data for subsequent analysis.

Segmentation and Tokenization In the segmentation step, document D is divided into a list of sentences as follows:

In Tokenization step, a s D i in D and a citation sentence c k is divide into a list of tokens as follows: s D i ={t s j i } n j=1 si (3.2) c k ={t c j k } n j=1 ck (3.3) wheret s j i is a token ins i andt c j k is a token inc k

During the normalization phase, a rule-based approach is employed to identify and correct OCR errors This process addresses common issues encountered during the OCR conversion.

OCR processes characters in a sequential manner, often overlooking the overall structure of the document Elements like page numbers, footnotes, and subsection titles can be extracted, but they may introduce noise and disrupt the content To enhance clarity, the model eliminates sentences containing fewer than five tokens.

In scientific documents, tables are often converted into plain text, which can obscure the original meaning of the data To enhance clarity, this process eliminates sentences where the ratio of alphabetic tokens and invalid English words exceeds 80%.

Embedding The sentences are embedded based on their tokens using pre-trained models The output of this step is a vectorv i for sentences D i orc k

Important Sentence Selection Phase

Cited Sentence Prediction

In this stage, a Deep Neural Network is developed to forecast cited sentences The model processes a pair consisting of a citation sentence vector and a document sentence vector to determine if the document sentence corresponds to the citation sentence This task is framed as a binary classification problem.

The Deep Neural Network Architecture estimates the probability that a document sentence $D_i$ is referenced by a citation sentence $c_i$ through three key aspects Initially, the vector $v_{c_i}$ of the citation sentence $c_i$ is input into a Multi-Layer Perceptron (MLP) network to extract its features, denoted as $\alpha_{c_i}$ This process is mathematically represented as $\alpha_{c_i} = \phi_c(W_c^T v_{c_i} + b_c)$, where $\phi_c$ is the ReLU activation function, $W_c$ is the weight matrix, and $b_c$ is the bias.

Similarly, features α j D are extracted from the vector v j D of a sentence s D j in the documentDusing an MLP network as follows: α D j =φ D (W D T v D j +b D ) (3.5) whereφ D is RELU activation function,W D is the weight matrix,b D is the bias.

The last feature type is the similarity score between the citation sentence vector $ v_{c_i} $ and the document sentence vector $ v_{D_j} $ This score is computed using a similarity-based method, represented by the equation $ \alpha_{f_{ij}} = f(v_{c_i}, v_{D_j}) $.

In our experiment,f is the Cosine similarity method as follows: f(v i c ,v j D ) = v i c ãv D j

|v c i ||v j D | (3.7) where(ã) is the dot operator,|v i c |and|v j D |are the Euclidean norms of citation vectorv c i and sentence vectorv D j

After computing three distinct feature types, all features are concatenated, and a Multi-Layer Perceptron (MLP) network is employed to estimate the probability that citation $c_i$ and document $D_j$ form a citation pair This process is mathematically represented by the equations: \[\hat{\alpha}_{ij} = \alpha_{c_i} || \alpha_{D_j} || \alpha_{f_{ij}} \]and \[\alpha_{ij} = \phi(W^T \hat{\alpha}_{ij} + b) \]where $||$ denotes the concatenation operator, $\phi$ is the ReLU activation function, $W$ represents the weight matrix, and $b$ is the bias term.

The model training process utilizes labeled samples derived from the CL-SciSumm dataset, which establishes a connection between citation sentences and their corresponding cited sentences in the target document All mapping pairs from this dataset serve as positive samples, assigned a label value of 1 Additionally, for each citation sentence, five random sentences from the target document are selected to create negative samples, which are labeled as 0.

The model inference process involves utilizing a trained model to evaluate scores for each citation sentence within the target document The highest-scoring sentences are retained, while the others are filtered out Ultimately, the selected sentences' text and positions are retrieved, leading to the final cited sentences.

Adjacent Sentence Extraction

In our previous research, we discovered that important sentences often require nearby sentences for clarification or support These adjacent sentences can be organized either deductively, where explanatory sentences follow a stated sentence, or inductively, where a concluding sentence summarizes the preceding content Consequently, a summary may consist of contiguous sentences grouped together within a paragraph Given that Deep Neural Networks do not take into account the neighboring sentences, it is essential to select adjacent sentences to enhance the context surrounding important sentences.

To select a sentence based on its position relative to a cited sentence, we use a formula that considers the adjacent labels The formula is defined as follows: \[\text{label adj}_i = \min(i+R, n) \max_{j=\max(i-L, 1)} \text{label cited}_j\]In this equation, $L$ and $R$ represent the number of sentences influencing the current sentence $i$ from the left and right, respectively Here, $n$ denotes the total number of sentences, while $\text{label adj}_i$ and $\text{label cited}_j$ refer to the adjacent and cited labels of the sentences, respectively.

Summary Generation Phase

Large Language Models

In this step, large language models are used to generate the summary as follows:

BART, or Bidirectional and Auto-Regressive Transformers, is a model developed from the Transformer architecture It uniquely integrates bidirectional pre-training with a sequence-to-sequence framework that allows for auto-regressive generation.

By leveraging masked language modeling during pretraining and fine-tuning on summarization-specific datasets, BART produces concise and coherent summaries that capture the essence of longer texts.

PEGASUS, like BART, is based on the Transformer architecture and employs a unique pre-training strategy that masks sentences within a document This enables the model to generate the omitted text by leveraging the surrounding context, allowing PEGASUS to effectively learn the relationships and essential information within the text.

Here are the key sentences rewritten for SEO optimization while maintaining the original meaning: -**GPT-3.5 Turbo Overview**: GPT-3.5 Turbo, developed by OpenAI, is an advanced iteration of GPT-3 that offers enhanced speed and accuracy for a variety of language tasks, such as question answering and text summarization **Improved Context Management**: This model excels in managing longer contexts, resulting in coherent and relevant responses that are effective in real-world applications.**API Integration and Learning Method**: GPT-3.5 Turbo can be accessed via API, utilizing zero-shot learning techniques to generate concise summaries.**Prompt Specification**: For our specific use case, we requested the output as a paragraph in English without any additional explanations. - This version is structured to be more SEO-friendly while retaining the essential information from the original article.

Your task is to generate a summary for following document: Document: ‘‘‘{text_input}‘‘‘

1 https://platform.openai.com/docs/models/gpt-3-5-turbo

Combination

Citations offer valuable insights from the input document, but they do not ensure the inclusion of all pertinent information To create a thorough summary, it is essential to incorporate details from both the abstract and the cited sentences We consider the abstract as the authors' primary perspective and gather insights from the research community In this process, we explore two strategies to effectively merge data from the abstract and cited sentences, as illustrated in Figure 3.2.

Fusion-based Strategy Augmentation-based Strategy

Figure 3.2: Two combination strategies in Citation-guided Summarization model

Fusion-based summarization involves a large language model that combines information from both the abstract and cited sentences Initially, the abstract is concatenated with the cited sentences, and this combined input is then processed by the summarization model to produce a coherent summary.

Augmentation-based summarization involves using the abstract of an input document as a foundational overview The approach focuses on supplementing this abstract with essential information derived from cited sentences These sentences are concatenated and processed through a large language model, resulting in a generated summary that is then incorporated into the original abstract.

This chapter introduces the Citation-integrated Summarization model, beginning with an overview in Section 4.1 It then elaborates on each phase of the model, including Pre-processing in Section 4.2, Heterogeneous Graph Construction in Section 4.3, Heterogeneous Graph Encoder in Section 4.4, and Important Score Estimation in Section 4.5.

Model Overview

A novel model utilizing a heterogeneous graph has been developed to effectively leverage both internal document components, such as hierarchical structure, and external components, like citations Unlike existing models that focus solely on the input document, our approach incorporates both the target document and its citation network By constructing a heterogeneous graph that integrates citation sentences, the model learns representations during training, enabling it to autonomously understand node representations and their relationships.

The proposed architecture for scientific summarization, illustrated in Figure 4.1, consists of four key phases: Pre-processing, Heterogeneous Graph Construction, Heterogeneous Graph Encoder, and Important Score Estimation The architecture takes an input document D along with its citation network, which includes citation sentences \{c_k\}_{n=1}^k and their corresponding citing documents.

Te xt Ma pp in g Ab str ac t M eth od Ci ta ti on s

Heterogeneous Graph Construction Heterogeneous Graph

Abstract e ! " ! e … $ # # To ken en co der s $ & e $ # # e " # ! s !

Seg m en ta tio n a nd T ok en iza tio n No rm ali za tio n

Figure 4.1: Overview of Citation-integrated Summarization Model.

The initial step involves pre-processing and normalizing input data into structured formats, which consist of structured documents and citations This structured data is then organized into Heterogeneous Graphs Subsequently, the graphs are processed through a Heterogeneous Graph Encoder to enhance their representation for summarization purposes Finally, the EDU nodes within the graph undergo an Important Score Estimation to assess their significance, allowing for the selection of the most critical nodes to create the predicted summary.

The importance of a node is defined through its ROUGE-2 Precision (ROUGE-2 P) score For a given EDUe i , the ROUGE-2 P is calculated between EDU e i and the golden summary as follows:

A higher ROUGE-2 Precision score reflects that the EDUe i contains more bigrams present in the reference summary, indicating its greater significance as a positive sample Conversely, a lower score suggests that the EDUe i holds less importance, categorizing it as a negative sample.

Pre-processing Phase

Scientific documents require pre-processing to eliminate noise from raw text and prepare the data for model input This model incorporates various preprocessing steps such as segmentation, tokenization, normalization, and embedding Unlike the Citation-guided Summarization model, this model's preprocessing phase features specialized segmentation and normalization techniques tailored to its specific requirements.

Segmentation and Tokenization Firstly, document D is divided into a list of sentences as follows:

This model utilizes the Elementary Discourse Unit (EDU), a smaller text unit than a sentence, to create summaries An EDU is the smallest meaningful element in discourse analysis, such as a clause or simple sentence It is preferred over a sentence for summary composition due to its finer granularity Consequently, the DMRST tool divides a discourse item $D_i$ and a citation sentence $c_k$ into lists of EDUs The lists are represented as $s_{D_i} = \{e_{s_{ji}}\}_{j=1}^{n}$ and $c_k = \{e_{c_{jk}}\}_{j=1}^{n}$, where $e_{s_{ji}}$ and $e_{c_{jk}}$ are the respective EDUs This approach allows the model to include the entire sentence in the EDU list if all information is deemed important.

In the Tokenization step, each Elementary Discourse Unit (EDU) $ e_j $ is divided into a list of tokens to facilitate the embedding process This can be represented as $ e_j = \{ t_{r_j} \}_{r=1}^{n} $, where $ e_j $ denotes an EDU from a document or citations, and $ t_r $ represents a token within $ e_j $.

Normalization Firstly, errors from converting texts from PDF are addressed This step usesGPT-4o-minito correct spelling and grammatical errors.

Scientific texts typically adhere to a common structure, including sections like abstract, introduction, related work, method, result, and conclusion, though their names and positions may vary To address this, GPT is employed to identify section positions and labels, facilitating the retrieval of information for documents with a standardized format Additionally, I noted that certain terms, such as summary, conclusion, and conclusion & future work, are often used interchangeably to denote the conclusion section A dictionary was utilized to map these terms to a normalized label After identifying and normalizing the sections, sentences within each section are grouped accordingly.

Heterogeneous Graph Construction Phase

Graph Architecture Construction

A structured document is represented as a heterogeneous graph $ G = \{V, E\} $, where $ V $ denotes the set of nodes and $ E $ represents the set of edges The graph construction phase and the architecture of the heterogeneous graph are illustrated in Figure 4.2.

Figure 4.2: The architecture of the heterogeneous graph

The graph G is structured with three node levels: section, sentence, and EDU, aiming to integrate the advantages of various node types The model focuses on selecting significant EDUs to reduce noise in sentences, while section nodes group sentences to utilize the structural characteristics of scientific documents Sentence nodes serve as connections between EDU and section nodes The node set of G is defined as $ V = V_{sec} \cup V_{sent} \cup V_{EDU} $, where $ V_{sec} $ represents section nodes, $ V_{sent} $ denotes sentence nodes, and $ V_{EDU} $ includes EDU nodes The section nodes are categorized into six standardized types: Abstract, Conclusion, Introduction, Method, Result, and Other.

The nodes are connected through edges The edgeE connection ofGis defined as

The heterogeneous graph is defined as $ E = E_{sec} \cup E_{sent} \cup E_{EDU} \cup E_{cross} $, encompassing both peer-level and non-peer relations The sets $ E_{sec} $, $ E_{sent} $, and $ E_{EDU} $ represent peer-level relations among sections, sentences, and elementary discourse units (EDUs), respectively In contrast, $ E_{cross} $ captures non-peer relations between text units and their corresponding higher-level units This graph is represented by an adjacency matrix, where the relationship between nodes $ i $ and $ j $ is denoted by the values at positions $ (i, j) $ and $ (j, i) $ within the matrix.

Citation sentences are integrated into the document graph, organized under a section titled "Citations." Within this section, the Educational Units (EDUs) in each citation sentence are interconnected by the relation $E_{EDU}$, while citation sentences themselves are linked through $E_{sent}$ The "Citations" section is further connected to other sections via $E_{sec}$ Additionally, cross edges are established to connect EDU-sentence and sentence-section pairs.

Graph Node Initialization

After obtaining the vectors of the tokens in preprocessing, this step initializes the vectors of the higher-level nodes, including EDU nodes, sentence nodes, and section nodes.

EDU Node Representation involves generating token vectors using pre-trained models as the token encoder The vector for EDUe i is derived by averaging the vectors of its constituent tokens Additionally, positional encoding vectors are utilized to indicate the position of each EDU within the document, as highlighted in our previous study [Pub 4].

The initial representation is defined by the equation \$x_i = v_i + PE(e_i)\$, where \$x_i\$ denotes the initial representation for \$EDU_{e_i}\$, \$v_i\$ is the calculated vector of \$e_i\$ derived from its tokens, and \$PE(\cdot)\$ represents the position encoding function used in the Transformer model.

In sentence node representation, each sentence node's representation is initialized by aggregating the initialized representations of the elementary discourse units (EDUs) For a given sentence $ s_i $, the initialized representation is calculated as follows: \[x_{s_i} = \frac{1}{n_{s_i}} \sum_{j=1}^{n_{s_i}} x_{EDU_j}\]

X j=1 x j (4.8) wheren s i is the number of EDUs ins i ,x j is the initialized representation of EDUe j

In the section node representation model, higher-level representations are created by aggregating sentence-level representations, effectively summarizing the main topics of each section Each section node representation is initialized by combining the initialized representations of its constituent sentences For a given section $ sec_i $, the initialized representation is defined as:\[x^{sec_i} = \frac{1}{n^{sec_i}} \sum_{j=1}^{n^{sec_i}} x^{seci}\]

X j=1 x s j (4.9) wheren sec i is the number of sentences insec i ,x s j is the initialized representation ofs j

Graph Edge Initialization

This thesis, similar to previous work [42], avoids using cosine similarity for initializing graph edges Instead, edges are initialized with values of 0 or 1 to clarify structural information, where 1 indicates a connection and 0 signifies no connection An adjacency matrix is employed to store these edges, as illustrated in Figure 4.3.

IN EDU-level, the edge E EDU (e i , e j ) between EDU e i and EDU e j is initialized with a value of 1 if they belong to the same sentence Similarly, sentence edge

E sent (s i , s j ) between sentence s i and sentences j is initialized with a value of 1 if they belong to the same section All section edges are, by default, initialized with a value

14 1 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Node 8

In Figure 4.3, an adjacency matrix is initialized to 1, indicating that all nodes are part of the same document Peer-level edges enable the sharing of information among nodes at the same level, facilitating the flow of both local information through EDU edges and global information via section edges.

On the other hand, nodes at different levels also exchange information through

Ecrossedges enhance the flow of information from local to global levels The value of Ecross(ei, sj) is set to 1 when EDUe i is part of sentences j Likewise, Ecross(si, secj) is initialized to 1 if sentences i are included in section sec j.

Heterogeneous Graph Encoder Phase

Graph Attention Network

Graph Attention Networks (GAT) are specialized neural networks for graph-structured data that leverage an attention mechanism to dynamically assign weights to neighboring nodes based on their importance Unlike traditional methods that use fixed weights, GATs learn attention coefficients during training, enabling the model to prioritize the most relevant neighbors When a node $ x_i $ aggregates information from its neighbors, the attention weight $ w_{ij} $ between node $ x_i $ and node $ x_j $ is computed using the formula $ z_{ij} = \text{LeakyReLU}(W_a [W_q x_i || W_k x_j]) $ and $ w_{ij} = \exp(z_{ij}) $.

In the context of multi-head attention, the output for node $ x_i $ is computed using the formula $ x'_{i} = ||_{k=1}^{K} \sigma \left( \sum_{k \in \text{Neigh}_{i}} \exp(z_{ik}) \right) $, where $ W_a, W_q, W_k $ are trainable weights and $ \text{Neigh}_{i} $ represents the first-degree neighbors of node $ x_i $ This approach involves $ K $ independent attention mechanisms, which are concatenated to form the final output.

 (4.12) where || is concatenation operation, σ is activation function w k ij are attention weight computed byk th attention mechanism andW k the corresponding trainable weight.

Contrastive Learning Loss

Contrastive learning plays a crucial role in representation learning by focusing on EDU representations It achieves this by minimizing the distance between the ideal representation and significant EDU vectors, while simultaneously maximizing the distance from irrelevant EDU vectors.

Contrastive Learning Loss is a key optimization function in this model Unlike earlier research that employed the InfoNCE loss function for contrastive learning, we opted for triplet loss This choice is driven by the advantages of triplet loss over InfoNCE, especially in terms of its simplicity and interpretability Triplet loss necessitates a triplet consisting of anchor, positive, and negative samples.

The equation \$X_i = 1 \max\{d(x_{a_i}, x_{p_i}) - d(x_{a_i}, x_{n_i}) + \alpha, 0\}\$ (4.13) represents a loss function in a triplet network, where \$N\$ is the total number of triplets In this context, \$x_{a_i}\$ denotes the representation of the anchor (golden) node, \$x_{p_i}\$ signifies the representation of positive nodes, and \$x_{n_i}\$ indicates the representation of negative nodes The function \$d(., )\$ refers to the Euclidean distance, and \$\alpha\$ is the defined loss margin.

We utilize the node importance score to differentiate between positive and negative samples, as selecting hard samples significantly enhances model performance Hard negatives are those that are near positives but do not overlap, which challenges the model to discern subtle differences Conversely, hard positives are samples that are close to the anchor vector but not trivially so, promoting diversity in learning To prevent the overlap of positive and negative samples, we establish a margin that clearly separates these categories, allowing for the application of thresholds to eliminate uncertain samples.

Positive if ROUGE-2 Pi > β Negative if ROUGE-2 P i < γ

In the model, thresholds β and γ define a hard margin, represented as ∆ = β−γ, where ∆ > 0 The model identifies hard positives by selecting the top k EDUs with the lowest scores from positive samples, while hard negatives are determined by the top k EDUs with the highest scores from negative samples These hard samples are essential for calculating the contrastive learning loss.

Important Score Estimation Phase

Important Score Estimation

The heterogeneous graph representations of the EDU node $ x'_{i} $, its corresponding sentence node $ x'_{s_{i}} $, and section node $ x'_{sec_{i}} $ are concatenated to form $ x''_{i} = x'_{i} || x'_{s_{i}} || x'_{sec_{i}} $ This concatenated representation is then input into multi-layer perceptron (MLP) layers to predict an importance score for the EDU $ \hat{y}_{i} = \phi(W^{T} x''_{i} + b) $, where $ \phi $ is the Sigmoid activation function, $ W $ is the weight matrix, and $ b $ is the bias The resulting score $ \hat{y}_{i} $ indicates the importance of the EDU and its likelihood of being included in the summary, with the golden label for this importance score derived from the ROUGE-2 P score as per Formula 4.1.

Summary Loss

During training, the summary loss is computed to enhance the important score of Educational Discourse Units (EDUs) The summarization loss is defined using binary cross-entropy, which compares the predicted score $\hat{y}_i$ with the ground-truth label $y_i \in (0,1)$.

[y i log ˆy i + (1−y i ) log(1−yˆ i )] (4.17) in which,N EDU is the number of EDU.

The model's final loss combines contrastive and summarizing losses in a weighted manner, aiming to balance the learning of discourse connections with the generation of high-quality summaries.

L total = λLcontrastive+ (1−λ)L summ (4.18) whereλis a hyperparameter that controls the trade-off between contrastive learning and summarization quality.

Training and Inference Process

The training and inference processes differ in the Citation-integrated Summarization model; the training process focuses on updating parameters, whereas the inference process is designed to produce the predicted summary.

The training process involves several key steps: input data is processed through Pre-processing, Heterogeneous Graph Construction, Heterogeneous Graph Encoder, and Important Score Estimation to generate an important score for each EDU During this phase, the contrastive loss is computed from the node representation post-encoding, while the summary loss is derived from the important score The total loss is then calculated to update parameters through backpropagation Notably, since the contrastive loss tends to converge more quickly than the summary loss, optimization of the contrastive loss is paused if it fails to decrease over a specified number of epochs, allowing for continued training focused on the summary loss.

The inference process involves generating a summary of the most significant Elementary Discourse Units (EDUs) Initially, input data is processed through four steps to predict the importance scores of all EDUs These EDUs are then ranked according to their importance, and the model selects them while ensuring diversity in the summary An EDU is skipped if it has a high overlap with a previously selected EDU, as measured by the ROUGE-2 precision score The selection continues until the desired summary length is achieved, after which the EDUs are reordered to their original sequence and concatenated to form the final summary.

This chapter presents model implementation, experimental environment and settings inSection 5.1, Dataset and Evaluation methods inSection 5.2and summarization performance inSection 5.3.

Implementation and Configurations

Dataset and Evaluation Methods

Contribution of Proposed Components

Tiêu đề	Advanced machine learning models and applications in scientific multi-document summarization
Tác giả	Nguyen Quoc An
Người hướng dẫn	Dr. Le Hoang Quynh
Trường học	Vietnam National University, Hanoi University of Engineering and Technology
Chuyên ngành	Information Technology
Thể loại	Thesis
Năm xuất bản	2024
Thành phố	Ha Noi

Định dạng
Số trang	69
Dung lượng	2,39 MB