1. Trang chủ
  2. » Luận Văn - Báo Cáo

Luận văn automatic presentation slides generator for scientific academic paper

85 0 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 85
Dung lượng 5,57 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Cấu trúc

  • 1.1 Problem Statement (15)
  • 1.2 Goals (16)
  • 1.3 Scope (17)
  • 1.4 Thesis Structure (17)
  • 2.1 Overview of scientific papers (19)
  • 2.2 Overview of slide presentation (20)
  • 3.1 Question Answering problem (22)
    • 3.1.1 Introduction (22)
    • 3.1.2 Frequently used question forms (23)
  • 3.2 Question answering system architecture (24)
    • 3.2.1 Information retrieval-based system (24)
    • 3.2.2 Reading comprehension based question answering system (25)
  • 3.3 WordPiece Segmentation (26)
  • 3.4 Sequence to Sequence (27)
    • 3.4.1 Recurrent Neural Network (27)
    • 3.4.2 Seq2seq’s architecture (28)
  • 3.5 Attention (29)
    • 3.5.1 Transformer (29)
    • 3.5.2 Encoder and Decoder stacks (29)
    • 3.5.3 Scaled Dot-Product Attention (30)
    • 3.5.4 Multi-Head Attention (32)
    • 3.5.5 Position-wise Feed-Forward Networks (32)
    • 3.5.6 Embeddings and Softmax (33)
  • 4.1 Automatic slide generation (34)
  • 4.2 Document summarization (36)
  • 4.3 State-of-the-Art model utilization (37)
  • 5.1 Query-based text summarization approach (Baseline) (39)
    • 5.1.1 Keyword module (40)
    • 5.1.2 Dense IR module (40)
    • 5.1.3 QA module (41)
    • 5.1.4 Figure extraction module (41)
  • 5.2 Filtering data by proposed semantic search method (42)
    • 5.2.1 Semantic search (43)
    • 5.2.2 Filtering data by using semantic search (44)
  • 5.3 Fine-tuning Dense IR module (46)
    • 5.3.1 Dense IR model (47)
    • 5.3.2 Data Preparation (48)
    • 5.3.3 Loss Function (49)
    • 5.3.4 Evaluation Metrics (51)
  • 5.4 Data Augmentation (54)
  • 6.1 Dataset Analysis (56)
  • 6.2 Data processing approach (57)
    • 6.2.1 Text extraction from papers (57)
    • 6.2.2 Text extraction from slides (61)
  • 6.3 Experiment setup (63)
    • 6.3.1 Dataset (63)
    • 6.3.2 Evaluation metrics (64)
    • 6.3.3 Environment setup (64)
    • 6.3.4 Parameter setup (65)
  • 6.4 Results (65)
    • 6.4.1 State-of-the-Art model utilization (65)
    • 6.4.2 Fine-tune Dense IR model (66)
    • 6.4.3 Filtering data by using semantic search (68)
    • 6.4.4 Data Augmentation (70)
  • 6.5 Web Application (74)
    • 6.5.1 Functional and non-functional requirement (74)
    • 6.5.2 Activity diagram (75)
    • 6.5.3 Tecknology stack (76)
    • 6.5.4 User interface (77)
  • 3.1 Information retrieval based system with three main components [3] (0)
  • 3.2 Information retrieval based system with four main components [4] (0)
  • 3.3 Reading comprehension based question answering system [5] (0)
  • 3.4 Encoder and Decoder [6] (0)
  • 3.5 The Transformer architecture [7] (0)
  • 3.6 Scaled Dot-Product Attention [7] (0)
  • 3.7 Multi-Head Attention [7] (0)
  • 4.1 The architecture of DOC2PPT [8] (0)
  • 5.1 The system architecture of D2S model [9] (0)
  • 5.2 The structure of semantic search using DistilBERT and Faiss (0)
  • 5.3 The data filtering process using semantic search (0)
  • 5.4 The difficulty in finding benchmark and data (0)
  • 5.5 The difficulty in finding pre-trained model and data (0)
  • 5.6 An example in the training dataset (0)
  • 5.7 Using CATTS model to generate summaries for data augmentation (0)
  • 6.1 Title and author of a paper in the dataset (0)
  • 6.2 Title and author are extracted using GROBID (0)
  • 6.3 An example of slide extraction by using Azure OCR (0)
  • 6.4 The flow of generating slides (0)
  • 6.5 Home page (0)
  • 6.6 Upload page (0)
  • 6.7 Input slide title and keywords page (0)
  • 6.2 Descriptive statistics of presentations about total count and the average number(in parenthesis) (0)
  • 6.3 Descriptive statistics of documents about total count and the average number(in parenthesis) (0)
  • 6.4 Descriptive statistics of presentations about total count and the average number(in parenthesis) (0)
  • 6.5 Evaluation results on the test set when using BERT and RoBERTa model 55 (0)
  • 6.6 Evaluation results of fine-tuned models (0)
  • 6.7 Evaluation results on the test set when using fine-tuned DistilBERT model 56 (0)
  • 6.8 The number of examples in each prefiltered and filtered dataset (0)
  • 6.9 Evaluation results of fine-tuned models using datasets filtered by using (0)
  • 6.10 Evaluation results on the test set when using DistilBERT model fine- (0)
  • 6.11 The number of examples in each augmented data (0)
  • 6.12 Evaluation results of fine-tuned models using datasets combining filtered (0)
  • 6.13 Evaluation results on the test set when using DistilBERT model fine- (0)

Nội dung

24 5 Solutions for Automatic Slide Generation 27 5.1 Query-based text summarization approach Baseline.. 526.5 Evaluation results on the test set when using BERT and RoBERTa model 556.6 E

Problem Statement

Presentations are commonly used in business, education, and research as they are visually effective in summarizing and explaining work to an audience Designing a presentation is often considered an art form, requiring patience and effort to create an excellent presentation It is essential to be able to concisely and aesthetically explain complex ideas while abstracting them We can enhance presentations by selecting lay- outs, adding illustrations, and incorporating effects The content presented significantly affects the presentation’s quality as it is the summary of the subject matter.

In natural language processing, text summarization is a common task It involves breaking down lengthy text into manageable paragraphs or sentences There are two types of text summarization: extractive and abstractive Extractive summarization se- lects a subset of the text’s sentences to create a summary Abstractive summarization reorganizes the vocabulary in the source text and may add new words or phrases to cre- ate a more ”human-friendly” summary With the effectiveness of deep learning models,researchers have integrated the selection strategy into the scoring model to predict the relative importance of previously selected sentences, resulting in better extractive sum- maries [10] It can be challenging to generate abstract summaries for long documents.With an encoder-decoder architecture, the researchers in [11] have added deep commu- nicating agents to address this issue A group of cooperating agents, each in charge of a segment of the input text, divide the duty of encoding a long text Each of them individ- ually encodes the given text and broadcasts their encoding to other agents, which allows agents to share global context information regarding various sections of the document. Text summarization models are employed to summarize scientific research papers It helps researchers save time in the literature review phase Also, it facilitates the speedy selection of pertinent research papers and the elimination of less significant ones As part of a search engine that allows users to select categorized values like scientific tasks, datasets, and more, IBM researchers introduced a novel system in 2019 [12] that pro- vides summaries for articles in the field of computer science Researchers have devel- oped the CATT model to generate TLDRs for scientific papers in 2020 [13] based on the success of deep learning models utilizing transformer architecture, such as BART and BERT.

When presenting scientific research in seminars, a presentation is necessary To make this easier, researchers have developed ways to automate the creation of slides They use a method called extractive summarization to select important sentences or phrases in the document and use an algorithm to divide them into slides However, this method only focuses on the text and does not include graphic elements To address this problem, re- searchers in [14], [9], and [8] have developed abstractive summarization and embedding methods.

While there have been numerous studies regarding the automatic generation of slides for scientific research articles, certain issues remain unresolved Specifically, the accu- racy, quality, and naturalness of the slide content require further improvement Addi- tionally, there has been insufficient attention paid to the data used to train the model As a result, our team has decided to delve into the topic of ”Automatic Slide Creation forScientific Academic Paper” to utilize recent advancements to enhance the outcomes of previous studies.

Goals

The project’s main objective is to enhance the quality of slides created automatically by incorporating deep learning and advances in natural language processing into the models The committee established the following sub-goals to accomplish the project’s main goal:

1 Re-implement models from earlier studies that address issues with automatic slide generation for research papers.

2 Investigate and implement advances in deep learning and natural language pro- cessing to the actual model.

Scope

In this thesis, our team will create models to generate presentations for scientific pa- pers automatically We choose to set some limitations for the ”Automatic Slides Gener- ation for Scientific Academic Paper” topic as follows to match the skills and experience of our team and make the thesis easier to complete:

• All of the papers utilized are from the discipline of computer science.

• English is used in the generated slides.

• Each slide’s title will be based on the title of the section it corresponds to in the paper.

• One or more slides present each section in the papers.

• Each slide’s content will have numerous bullet points.

• Each bullet point will have one or more sentences to explain it.

• Generated slides can contain the graphical elements in the papers.

• The papers will serve as the source for all graphics in the slides.

Thesis Structure

The structure of the thesis consists of the following chapters:

Introduce the issue statement, the reason behind choosing the topic, and the thesis’s structure.

Provide analyses of the test corpus to identify a solution to the stated problem, the goal, and the details and scope of the subject.

Give an overview of text summarization and the machine learning models used in natural language processing.

Briefly describe the topic’s content, the results, and any limits to the current body of knowledge.

• Chapter 5: Solutions for Automatic Slide Generation

Clearly describe the methods used and the specific finished works.

Provide an overview of multiple datasets, outline the experimental methodology employed, and present the resultant findings.

Summarize the results achieved and give plans for the thesis in the future.

Overview of scientific papers

A scientific paper is a written report presenting the findings of original research, and its structure has been established over many years by editorial practice, scientific ethics, and interactions with printing and publishing companies A wide variety of articles are published A few examples of them include original articles, case studies, technical notes, picture essays, reviews, comments, and editorials The many types of papers and the criteria used to evaluate them should be understood by the authors The abbrevi- ation IMRAD (introduction, methodology, results, analysis, discussion) stands for the fundamental components of a scientific publication However, it should be noted that most publications have requirements for the format of a paper Some publishers divide manuscripts into all or part of these components, while others do not, and the sequence may vary between publications.

In general, a scientific paper usually has the following sections:

• Title: The title of a scientific research paper is usually the research topic of that paper It should be the fewest possible words.

• Keyword list: The keyword list provides the opportunity to add keywords, used by the indexing and abstracting services, in addition to those already present in the title Judicious use of keywords may increase the ease with which interested parties can locate your article.

• Abstract: The abstract section concisely states the principal objectives and scope of the investigation which are not obvious from the title It concisely summarises the results and principal conclusions.

• Introduction: The introduction section begins by introducing the reader to the pertinent literature It helps them to establish the significance of your current work.

• Materials and methods: The main purpose of the ”Materials and Methods” sec- tion is to provide enough detail for a competent worker to repeat your study and reproduce the results The usual order of presentation of methods is chronological. However, related methods may need to be described together and strict chronolog- ical order cannot always be followed

• Results: The results section presents the authors’ findings Display items like figures and tables are central in this section.

• Discussion: The discussion section is one of the final parts of a scientific paper, in which an author describes, analyzes, and interprets their findings It also tells readers why the research results are important and where they fit in the current literature.

• References: The references section provides information for readers who may want to access the sources authors cite in their papers It is located at the end of each paper.

Overview of slide presentation

A slide is a single page of a presentation Collectively, a group of slides may be known as a slide deck A slide show is an exposition of a series of slides or images on an electronic device or a projection screen A slide’s construction typically consists of two parts:

• Slide title: Shows a slide’s or a series of slides’ principal objective.

• Slide content: Display the information pertinent to the slide’s title Slide content is often brief and succinct, including only the most important information that has to be delivered It might be outlined in bullet points The ideas expressed in each bullet point may be further broken down into sub-bullets Text, photos, billboards, and slideshow effects are all acceptable slide content.

A presentation comprises several slides that are organized in a certain order The strength of a presentation is primarily determined by the information given and the slide designer’s imagination The following three parts will often be featured in a presenta- tion:

• Introduction: The introduction sets the tone for the entire presentation and ex- plains what the audience will come away with after viewing it Required compo- nents for the introduction part are:

– The title of the presentation: Introduce the topic of your presentation and provide a brief description.

– A table of contents / main menu: Presents the main content that will be in the presentation.

– Objectives: Present the goals and outcomes that can be achieved through the presentation.

• The body: The presentation’s primary focus is on the body part The sections indicated in the introduction will all be covered in this part’s content Each presen- tation’s body will be organized differently depending on the presenter’s ingenuity.

• Conclusion: This part is a summary of the information presented that emphasizes the key takeaways for the audience After expressing gratitude for their attention,the speaker will ask the audience for questions.

Question Answering problem

Introduction

Question Answering (QA) is a complex task in natural language processing It re- quires a knowledge of the text’s meaning as well as the capacity to conclude relevant facts [15] In essence, QA is a type of information retrieval that takes the form of gen- uine human language questions or queries When using a question-answering system, the purpose for people is to obtain the essential information or data represented through short, correct replies to questions and answers to acquire knowledge about one or more specific situations In terms of Artificial Intelligence, developing a question-answering system attempts to train artificial ”brains”, assisting them in understanding and process- ing natural human language QA is such an important issue that most occupations or tasks in natural language processing may be described by question-and-answer issues based on input language [15].

The question-answering system will work on the following principle: after users enter the content to be queried, the system will generate answers by searching for in- formation in an already established knowledge system, which can be pre-existing doc- uments, specialist websites, or even more broadly, standard datasets used in the assess- ment Question-answering systems are classified into two broad types based on the knowledge system’s content source:

• Closed-domain: The purpose is to find solutions to queries using data from a spe- cific topic or area The requirement to create models with small and constrained datasets is the major problem of closed-domain systems [16].

• Open domain: The purpose is to get answers to inquiries from large-scale litera- ture sources, regardless of specific themes or fields [17] Such systems need data sources of appropriate scale and precision.

Frequently used question forms

Users can input questions of many sorts and queries of various types in question- answering systems Nonetheless, the majority of inquiries fit into one of the following inquiry categories [18]:

• Factoid: These questions are about natural or societal facts In English, these queries typically begin with wh-words such as who, where, when, and where The solution is generally a single fact mentioned in the context ”Which is Vietnam’s largest city?” is an example of a factoid question.

• List: The query’s content is similar to the question’s, but the answer to this form of inquiry will be a series of entities or data For example, the context reveals the names of Mr H’s three children, and the question is ”What are Mr H’s children’s names?”

• Definition: Requires extensive data access A term, phrase, whole sentence, or even a chunk of text retrieved from the input context is generally the response to a definition query In English, this type of question is commonly preceded by ”what is” or ”what are,”.

• Hypothetical: Obtain knowledge from the context of hypothetical (non-real) oc- currences The question is commonly phrased as ”What would if ?” (English).

In the context event, the answer is a piece of information.

• Causal: Unlike definition inquiries, which seek to understand the notion of an en- tity, cause-effect queries seek to understand the ”how” and ”why” of events or ob- jects Because text analysis necessitates extensive language processing techniques, this is one of the most difficult questions to answer.

• Confirmation: Questions with a Yes or No answer, verifying the occurrence of an event or phenomenon in the context; or confirming whether the query content fits what is stated in the context.

Question answering system architecture

Information retrieval-based system

The units are split differently between studies The system will be comprised of three major components [3], which are as follows (Figure 3.1):

• Question processing: Building the inquiry and determining the structure of the answer are the two basic processes in answering questions During the query cre- ation process, a query for the question is built to fetch related documents In the response type identification stage, a classifier is utilized to categorize the questions based on the predicted response type.

• Document and passage retrieval: The query formed in the query creation process is fed into an information retrieval engine, which returns the top most relevant re- trieval documents Because response analysis models primarily work on small seg- ments of documents, recovered texts are subjected to a paragraph retrieval model to get short text fragments This is the element responsible for locating passages or phrases that are comparable to the input query.

• Answer extraction: The most appropriate response is derived from the supplied paragraph The similarity between the input question and the retrieved responses must be measured in this stage.

According to [4], the two components of document retrieval and paragraph retrieval have been separated, resulting in an information retrieval system with four primary com- ponents (Figure 3.2).

Figure 3.1: Information retrieval based system with three main components [3]

Figure 3.2: Information retrieval based system with four main components [4]

Reading comprehension based question answering system

The system will be divided into four major components: embedding, feature ex- traction, question-context interaction, and answer prediction Each module serves the following purposes:

• Embeddings: Because the computer cannot understand natural language, the em- bedding module is required to convert the input words into fixed-length vectors at the system’s top Taking context and questions as input, this module creates context and embeds questions using several ways.

• Feature extraction: The contexts and questions are given into the feature extrac- tion module after passing via the embedding module to assist us in better compre- hending the context and questions The primary goal of this module is to extract additional contextual information.

• Context-question interaction: The relationship between the context and the ques- tion is significant in anticipating the answer With this information, the machine can determine which portion of the context is most essential in answering the ques- tion To that purpose, a one-way or two-way attention mechanism is extensively utilized in this module to emphasize aspects of the context that are important to the question The following part will go over the attention process in further depth.

• Answer prediction: This is the system’s last component, providing an answer to a question based on all of the information gathered from earlier modules.

An example model for a reading comprehension-based system is presented in Fig- ure 3.3.

Figure 3.3: Reading comprehension based question answering system [5]

WordPiece Segmentation

WordPiece is a natural language processing algorithm that is used to segment sub- words The vocabulary in this method is initialized with language-independent char- acters The most common character combinations will be added to the lexicon in an ongoing process Yonghui Wu et al constructed the first WordPiece model, which was published in 2016 [19].

WordPiece generates tokens for single words using the greedy method of longest- match-first The algorithm will select the longest prefix from the remaining text that fits a word in the model’s vocabulary regularly This method, known as MaxMatch,has been used to segment Chinese words since the 1980s Although it has been widely used in NLP for decades, the calculation of the commonly used MaxMatch approach has a complexity of O(n 2 ) with n as the length of the input word orO(nm)with m as the maximum length of the token in the dictionary It should be noted that when the dictionary contains long terms, the factor m might be rather significant [20].

Sequence to Sequence

Recurrent Neural Network

A Recurrent Neural Network (RNN) is a network that operates in sequence and uses its output as input for the next steps The RNN is made up of a hidden state h and an optional output y It operates on a variable-length input variable sequence x= (x 1 ,x 2 , ,x t )[6] At each timet, the hidden stateh t of the neural network is updated according to the equation 3.1. h t = f(h t−1 ,x t ) (3.1) where f denotes a non-linear activation function The function f might be as basic as a sigmoid function or as sophisticated as a network of short-term memory (LSTM, orLong Short-Term Memory) An RNN is trained to anticipate the next signal in a series to learn the probability distribution of that sequence In this situation, the outcome will be expressed as a conditional probability dependent on prior times at each timet [6].

Seq2seq’s architecture

An encoder and a decoder are the two major components of a Seq2seq model In which the encoder takes information from an input string and outputs a single vector, while the decoder receives information as the vector sent by the encoder and creates an output string as a result.

The encoder is a recurrent neural network that reads each symbol of an input se- quence x sequentially [6] The hidden state of the network changes as it reads each symbol, according to the equation (3.1) The hidden state of the neural network is a summary of the complete input sequence, expressed as a context vectorc, after reading to the conclusion of the sequence (indicated by the terminator notation) The Attention section contains information on the context vectorc.

Decoder is a recurrent neural network trained to generate an output sequence by predicting the next symboly t while the hidden stateh t is known [6] Bothy t and h t are dependent on the preceding symboly t−1 and the vectorcof the input string passed from the encoder So, the hidden state of the decoder at time t is calculated according to the equation 3.2. ht = f(h t−1 ,y t−1 ,c) (3.2)

The conditional probability of the symbolyt is given by the equation 3.3.

P(y t |y t−1 ,y t−2 , ,y 1 ) =g(h t ,y t−1 ,c) (3.3) where f and g are the activation functions provided It is important to note that the functiongmust produce legitimate probabilities, such as the softmax exponential mean.

Encoder and Decoder representations are provided in Figure 3.4.

Attention

Transformer

The Transformer architecture, proposed by Ashish Vaswani et al in 2017, is a mod- eling architecture that eliminates repetition and instead depends exclusively on the at- tention mechanism to infer global connections between inputs and outputs [7].

The architecture of the model is shown in Figure 3.5.

Encoder and Decoder stacks

Encoder: A stack with N identical layers Within each layer are two sublay- ers: a multi-head self-attention layer and a simple feedforward neural network This neural network is fully connected based on location Then, perform a residual connec- tion around both sublayers, followed by layer normalization Each sublayer’s output is expressed as LayerNorm(x + SubLayer(x)) using the sublayer’s implementation of the functionSubLayer(x) All sublayers and embedding layers will produce output with the same size of 512.

Decoder: Like the encoder, the decoder is made up ofNidentical layers In addition

Figure 3.5: The Transformer architecture [7] to the two sublayers in each layer, such as the encoder, the decoder adds a third sub- layer that implements a multi-head attention mechanism on the encoder stack’s output.The decoder additionally employs residual connections around sublayers that are asso- ciated with the class’s normalization The self-attention sublayer is modified to prevent the preposition phenomena from engaging in the prediction of locations after it This guarantees that forecasts for positioni are solely based on known outputs beforei.

Scaled Dot-Product Attention

The architecture of this Attention is illustrated in Figure 3.6.

Inputs include queries and keys of dimension dk and values of dimensiondv Then,calculate the dot product of the query for all keys, then divide each of those dot products

Figure 3.6: Scaled Dot-Product Attention [7] by√ dk, and then apply the softmax function to get the weights on the values.

In practice, the attention function is computed concurrently on a collection of numer- ous questions, which are packed together to create a matrixQ The matrix K is formed by packing the keys together, while the matrixV is formed by packing the values to- gether The equation 3.4 is used to compute the output matrix.

Two commonly used attention functions are the additive and attention functions using the dot product [22] With smalld k , the two functions have similar mechanisms, but as dk increases, the additive attention function is much more efficient than the dot product attention function [23] Vaswani and colleagues questioned why, when the value of dk is big, the magnitude of the dot product grows fast, resulting in the softmax function having a relatively tiny gradient To compensate for the quick rise, multiply the scalar by the scaling factor 1

√d k Vaswani’s team assumes that the components of the two matrices Q and K are inde- pendent random variables with mean 0 and variance 1 The dot product of two matrices computed with d k

(Q i ∗Ki)has a mean of 0 and a variance ofdk.

Multi-Head Attention

When compared to performing a single attention function on a space with dimen- sional d model , performing the projection htimes with the different learned projections, namely the query projection on thed k dimension, the keys projection on the d k dimen- sion, and the projection of the values on thedv dimension, the efficiency will be higher.

This attention method enables the model to pay attention to input from distinct sub- spaces at different places at the same time This would be impossible with the single- head attention system.

The Multi-Head Attention function is computed as the equation 3.5, using the matri- ces described in Section 3.5.3.

MultiHead(Q,K,V) =Concat(head 1 ,head 2 head h )∗W O (3.5)

Position-wise Feed-Forward Networks

In addition to the attention sublayers, each encoder and decoder layer comprises a fully connected feed-forward network that is applied to each place independently and identically This is made up of two linear transformations separated by a ReLU activa- tion The transition network ofxcan be calculated using the equation 3.6:

Although the linear transformations are the same across all places, the parameters vary from layer to layer This may also be described as two convolutional layers with kernel size 1 The dimensions of the input and output are d model Q2, while the inner layer dimensions aredf f 48.

Embeddings and Softmax

Vaswani’s team [7], like other sequence transduction models, converts input and out- put tokens into d model -sized vectors using learned embeddings In addition, they utilize a conventional linear transformation and a softmax function to turn the decoder output into the probability of the next token is predicted In Vaswani et al.’s model, the two em- bedding layers and the pre-softmax linear transformation share the same weight matrix. Multiply the weights by√ d model in embedded classes.

Automatic slide generation

Text summarization has practical use in automatic slide generation Scientific papers will have their material distilled and condensed Slides will be used to succinctly and effectively depict the essential points Many researchers have used machine learning and deep learning models to address the issue of automatic slide creation in a variety of ways as they have evolved.

Effective text summarization requires the ability to identify important sentences The important scores of the sentences in an academic paper were determined using the re- gression method [24] This technique chooses an appropriate regression model using sentence characteristics as input However, this approach has the drawback of selecting numerous lengthy sentences for presentations, and the chosen key phrases are frequently poor candidates for bullet points, making the slides tedious to read A phrase-based strategy has been developed to address this issue [25] It chooses and arranges material using terms from the academic paper as the fundamental building blocks When cre- ating slides, the critical score of each phrase and the hierarchical link between phrases are taken into consideration Support Vector Regression (SVR) is a different technique that may be used to identify important sentences in the text and focus on them [26] The integer linear programming (ILP) model [24] aids in designing the arrangement of slides once the list of important sentences has been constructed With a maximum slide length option, it will choose the number of slides on its own The Microsoft Office API is used to create the draft slides when they have content.

Many researchers have employed deep learning networks besides machine learning- based approaches to address this issue Some of them score the sentences by extracting a more thorough list of surface features, taking into account the semantics or meaning of the statement, and using the context around the current sentence [14] These models are fitted using syntactic, semantic, and contextual embedding It is suggested to use the Bullet Point Generation algorithm to create slides, the number of which will depend on the information in each section of the papers.

Recently, query-based text summarization has been used as another approach to au- tomatically slide generation to make the contents and titles of slides more natural not just match one-to-one from papers[9] This approach considers the user-entered slide titles as questions and the paper document as the corpus It retrieves information related to the slide titles and integrates title-specific key phrases to guide the model to gener- ate slide content To make the automatically generated slides more vivid and natural, graphic elements in the paper should be used during slide creation Dense vector IR can be used to compute vector similarities between the captions of figures/tables and slide titles The slide with a high similarity score will have the corresponding element inserted In addition, a random forest classifier can be used to filter out slide content that likely cannot be derived from the paired papers.

Tsu-Jui et al [8] introduced a new approach that creates slides by integrating para- phrase and layout prediction modules and making use of the underlying patterns found in documents and presentations They create a network with modules that focus on related tasks The architecture of their network, DOC2PPT, is illustrated in Figure 4.1.

All of these modules are:

• A Document Reader (DR) encodes sentences and figures in a document.

• A Progress Tracker (PT) retains references to the input (i.e., which section is now being processed) and the output (i.e., which slide is currently being generated), and depending on the progress made so far, decides when to proceed on to the next section/slide.

• An Object Placer (OP) selects the text or figure from the current section to be displayed on the current slide Additionally, it forecasts where and how big each item will be put on the slide.

• A Paraphraser (PAR) takes the chosen sentence and condenses it before adding it to a slide.

Figure 4.1: The architecture of DOC2PPT [8]

Document summarization

The task of document summarization aims to generate a very short summary of a given document or document set The NEUSUM model [10] was built by integrating the selection strategy into the scoring model, which directly predicts the relative importance given previously selected sentences This model consists of a document encoder which has a hierarchical architecture and sentence extractor built with RNN The authors also design a rule-based system to label the sentences in a given document to determine the sentences to be extracted.

Representing a long document for abstractive summarization is a challenge in docu- ment summarization An approach to address the challenge is using deep communicat- ing agents in an encoder-decoder architecture [11] The task of encoding a long text is divided across multiple collaborating agents, each in charge of a subsection of the input text Each of these agents encodes their assigned text independently, and broadcasts their encoding to others, allowing agents to share global context information about different sections of the document.

Researchers in IBM [12] introduced a novel system providing summaries for Com- puter Science publications as part of a search system that can choose categorized values such as scientific tasks, datasets, and more They generated a standalone summary for each section Each of these section-based summaries is eventually composed together into one paper summary.

State-of-the-Art model utilization

Edward Sun and his team [9] have observed several promising results in their re- search They use logic to present their findings through concise and information-packed slides The slides are automatically generated and include graphic elements from the original research paper However, the generated content is still mostly ”extractive”. Despite this, the metrics used to evaluate their research have yielded unsatisfactory out- comes Since the publication of this academic paper, there have been many successful scientific studies that focus on improving deep learning models.

The BART-LS model [27] is an innovative method of generating natural language that combines the benefits of pre-training and fine-tuning It uses the BART architecture, which is a model that can process both the left and right context of a token The BART-

LS model takes this a step further by incorporating a latent space that can capture the meaning and style of the input text The latent space is learned by minimizing the reconstruction loss of the input text and the contrastive loss between the input and output texts The BART-LS model can generate varied and articulate text for a range of natural language tasks, including summarizing, paraphrasing, simplifying, and changing writing style.

The Pegasus-X model [28] is an advanced tool for summarizing long texts, capa- ble of handling documents up to 4096 tokens It builds on the Pegasus architecture, incorporating a larger Transformer encoder and decoder, an extended vocabulary, and a new pre-training objective Pegasus-X delivers impressive outcomes across various benchmark datasets like CNN/Daily Mail, arXiv, PubMed, and BigPatent It produces concise, coherent summaries that capture the key points of the original documents while maintaining factual accuracy and grammatical correctness.

The Long T5 model [29] is an advanced natural language processing system that can create high-quality texts from a set of keywords It is based on the T5 framework, which uses an encoder-decoder structure with a transformer architecture The Long T5 model improves upon the original by adding more layers, parameters, and attention heads, and by using a unique attention mechanism called Longformer These upgrades allow it to handle longer and more complex texts like summaries, essays, and stories. The Long T5 model is trained on a vast and diverse collection of text data that covers multiple domains and languages, enabling it to produce texts in multiple languages It is a robust and adaptable tool for natural language generation, including tasks like text summarization, rewriting, expansion, completion, and more.

The RoBERTa model [30] is a highly advanced natural language processing system that can carry out a variety of tasks like text classification, answering questions, senti- ment analysis, and more It is an improved version of the BERT model, utilizing more data, bigger batch sizes, more training steps, and a more extensive vocabulary Ad- ditionally, it eliminates the next sentence prediction objective and replaces Wordpiece encoding with byte-pair encoding The RoBERTa model has outperformed the BERT model in various datasets and benchmarks, proving its effectiveness and reliability.

Query-based text summarization approach (Baseline)

Keyword module

The Keyword module extracts the hierarchical structure and vague titles (such as

”Experiments”or”Results”) from papers Weak titles are those that are generic and do not provide much information, resembling section headers This is problematic as these sections can be quite lengthy By using the Keyword module to extract the parent-child tree of section names and subsection headers, information retrieval and slide content generation can be improved.

Dense IR module

Recent research has introduced various embedding-based retrieval methods ([31],

[32], [33]) that perform better than traditional IR techniques like BM25 These methods use the leaf nodes of parent-child trees from the keyword module and integrate them into the re-ranking function of a dense vector IR system that is based on a distilled BERT miniature [34] The dense vector IR model is trained without gold passage annotations, as the titles of papers and their content are similar This reduces the cross-entropy loss of the titles to their original content (derived from the original slides).

The pre-trained IR model generates vector representations for all paper snippets (4- sentence paragraphs) and slide titles The similarity between the vectors of all snippets from a paper and the vector of a slide title is measured using pairwise inner products.

The Maximum Inner Product Search is used for ranking paper passage candidates based on their relevance to a specific title The top ten candidates are chosen as input con- texts for the QA module Section titles and subsection heads (keywords) collected from the Keyword module are also re-ranked using a weighted ranking function as shown in equation 5.1. α(emb title ãembtext) + (1−α)(emb title ãembtext kw ) (5.1) where emb title ,embtext and embtext kw are the embedding vectors based on the pre- trained IR model for a given title, a text snippet, and the leaf node keyword from the keyword module which contains the text snippet.

QA module

The QA module merges the slide title with associated keywords to form a query, while the top ten dense IR text snippets are concatenated to create context The hierarchy of keywords from the module is used to match titles with groups of keywords If a title matches header 2.1, it and all its recursive children are added as keywords for the QA module It’s worth noting that not all titles have associated keywords.

The QA model utilized is a BART [35] that has been fine-tuned The query and context are encoded in the ”title[SEP1]keywords[SEP2]context” format Upon input, keywords are progressively embedded as a comma-separated list after the slide title. Integrating keywords into the query aids the model in paying attention to relevant context across all text snippets.

Creating slides from a paper is a creative process that depends on the author’s style.Some may include anecdotes or facts not present in the related publication The QA model is refined with filtered training data to assist with this, enhancing the accuracy of the paired paper’s sentences.

Figure extraction module

Slide decks are considered incomplete without captivating visual elements that cap- ture the audience’s attention To create a comprehensive slide deck, a procedure is estab- lished by incorporating linked figures and tables from the article The dense vector IR module is used to determine vector similarities between the slide title and the expanded keywords present in the captions of the figures and tables Based on the ranking of the figures and tables, a final set of recommendations is prepared and shared with the user.

Filtering data by proposed semantic search method

Semantic search

Semantic search [36] is a technique that aims to improve the accuracy and relevance of search results by understanding the intent and context of the user query Unlike tra- ditional keyword-based search, semantic search uses natural language processing, ma- chine learning, and knowledge graphs to analyze the meaning and structure of the query and match it with the most appropriate documents or data Semantic search can pro- vide more precise and personalized answers, as well as suggestions for related topics or queries Semantic search can also handle complex or conversational queries that involve multiple concepts, synonyms, or ambiguities.

The concept of semantic search involves placing all entries within a corpus, whether they are sentences, paragraphs, or documents, into a vector space During a search, the query is also placed into this vector space, and the closest match from the corpus is lo- cated To begin this process, a large language model is used to convert input content into arrays of numbers, which is known as vectorization Similar concepts will have similar numerical values These vectors are then stored in vector databases, which are specialized systems for storing these numerical arrays and finding matches Once the vector indexes are created, the final step is to conduct the search Vector search trans- forms the input query into a vector and searches for the best match, resulting in the most suitable conceptual fit The way semantic search works is shown in Fig 5.2 Here are some benefits of semantic search:

• Improve accuracy and relevance of search results.

• Enhance user experience and satisfaction.

• Increase discoverability and diversity of information.

• Better understanding of user behavior and preferences.

There are two types of semantic search:

• Symmetric semantic search: Both the input query and the entries in the corpus should be of similar length and have the same amount of content For instance, when searching for similar questions, if the input query is ”How to learn Python online?” the matched entry could be ”How to learn Python on the web?” For symmetrical tasks, users can flip the query and the entries in the corpus.

• Asymmetric semantic search: Typically, a short input query is used, such as a question or some keywords, with the expectation of receiving longer paragraphs in response that answers the query For instance, if a user asks ”What is Python?”, the expected answer would be something like ”Python is a programming language that is interpreted, high-level, and general-purpose It follows a design philosophy ” In cases where tasks are asymmetric, flipping the query and the entries in the corpus does not make sense.

Figure 5.2: The structure of semantic search using DistilBERT and Faiss

Filtering data by using semantic search

We have found semantic search to be effective in tasks related to information re- trieval Recognizing the potential of this method, we utilized it to filter data and remove content from slides that did not match the paper To accomplish this task, we utilize the sentences within each scientific paper as a source of knowledge The sentences within each slide that pertain to the scientific paper are used as search queries We apply a symmetric semantic search to match the sentences in the slide with those within the sci- entific paper The results returned are similarity scores and indexes that correspond to the matching sentences within the scientific paper To classify the content in a slide for a scientific paper, we establish a threshold value calledt If the similarity score between the slide content and the paper is equal to or greater thant, we keep the slide content If the similarity score is less thant, we discard it The threshold valuet can vary between

To embed sentences in the papers into vector space, we use the sentence-transformers library of SBert 1 The sentence-transformers library is a Python framework that allows users to easily compute semantic similarity between sentences It is based on the popu- lar transformers library and provides various models and methods for sentence embed- ding, clustering, paraphrasing, and evaluation The library also supports multilingual and cross-lingual scenarios, as well as custom training and fine-tuning of existing mod- els The sentence-transformers library is designed to be user-friendly, efficient, and scal- able, and can be integrated with various downstream applications such as information retrieval, question answering, text summarization, and natural language generation For this task, we utilize the ”all-mpnet-base-v2” model, which outperforms other models in symmetric semantic search.

To build the vector database, we use the Faiss library Faiss is a library for efficient similarity search and clustering of dense vectors It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM It also contains supporting code for evaluation and parameter tuning Faiss is written in C++ with com- plete wrappers for Python/numpy Some of the most useful algorithms are implemented on the GPU It was developed by Facebook AI Research.

After the analysis and design process, the data filtering process using semantic search is shown in Fig 5.3 To illustrate this data filtering method, we use the content of paper

”Neural AMR: Sequence-to-Sequence Models for Parsing and Generation”as a knowl- edge base Semantic search is used to find the top 5 sentences in the paper that are similar to the sentence”Paired Training: scalable data augmentation algorithm” The returned results are shown in Table 5.1 If the threshold valuet is set to 0.6, this sentence is considered to be derived from the paper and is retained because the highest similarity score is 0.691 If the threshold valuet is higher than 0.691 then this sentence will be discarded Therefore, finding a suitable threshold valuet is very important.

2 0.486 We set the initial Gigaword sample size to k = 200, 000 and executed a maximum of 3 iterations of self-training.

Paired Training Obtaining a corpus of jointly annotated pairs of sentences and AMR graphs is expensive and current datasets only extend to thousands of examples.

While much of this improvement comes from self-training, our model without Gigaword data outperforms these ap- proaches by 3.5 points on F1.

We show that our seq2seq model can learn the same informa- tion as a language model, especially after pretraining on the external corpus.AMR Generation Data Augmentation Our paired training procedure is largely inspired by Sennrich et al.

Table 5.1: Top 5 sentences have the highest similarity score with the given sentence.

Figure 5.3: The data filtering process using semantic search

Fine-tuning Dense IR module

Dense IR model

To retrieve relevant context for a given question, the Dense IR model encodes the question and context into the same vector space The context vector is then stored in a vector database, and the retriever encodes the question to compare it with the context vectors in the database To maintain semantic links between sentences in scientific pa- pers, we group four consecutive sentences into passages and encode each passage into a vector embedding using the Dense IR model For the task of retrieving texts comparable to a query, the sentences-transformers library in SBert offers several suitable models that

Figure 5.5: The difficulty in finding pre-trained model and data are trained on the MS MARCO dataset Based on the published training results 2 , we chose the ”msmarco-distilbert-base-tas-b” model for optimal outcomes.

Data Preparation

To successfully train a sentence transformer model, each example in the dataset must have a label or structure that helps the model understand if two sentences are similar or different There are typically four types of dataset configurations:

• Case 1: The example consists of a pair of sentences with a label indicating their degree of similarity, which could be an integer or a float value.

• Case 2: The example comprises a pair of positive sentences that share similari- ties, without any label Examples of such pairs include paraphrases, summaries of full texts, duplicate questions, queries and their responses, and source and target language texts.

• Case 3: The example includes a sentence with an integer label This format can be easily converted into three sentences (triplets) consisting of an ”anchor”, a ”posi- tive” of the same class as the anchor, and a ”negative” of a different class, using

2 https://www.sbert.net/docs/pretrained-models/msmarco-v3.html loss functions.

• Case 4: The example is a triplet (anchor, positive, negative) without classes or labels for the sentences.

To generate scientific paper presentations, we faced a shortage of labeled data There- fore, we chose the second data setup wherein each example consists of a query and a response We used the slides that correspond to the scientific papers to train the Dense

IR model The slide names act as queries, and their contents serve as responses Please refer to Figure 5.6 for a better understanding of the data structure.

Figure 5.6: An example in the training dataset

Loss Function

Each of the four different data formats is accompanied by its specific loss function. Based on our structured data, we employ the multiple negatives ranking (MNR) loss function[38], which proves especially valuable when handling datasets comprising pos- itive sentence pairs The MNR loss function aims to maximize the similarity between an input and its corresponding output while minimizing the similarity between an input and any other output in a given batch The MNR loss function can be seen as a gener- alization of the contrastive loss function, which only considers one negative output per input.

To understand the MNR loss function, we must first establish that P(y|x)represents the probability that a response matches a query, which is then utilized to rank potential responsesybased on an inputx Equation 5.2 is employed to determine this probability.

Next, the joint probability ofP(x,y)is evaluated by a neural network scoring functionS, which is followed by Equation 5.3.

The combination of equations 5.2 and 5.3 provides an approximate probability of the training data utilized to train the neural networks, which is then followed by Equation 5.4.

The MNR loss function is determined in each training batch of stochastic gradient de- scent by evaluating K input slide titles x= (x 1 ,x 2 , ,xk) and their corresponding re- sponsesy= (y 1 ,y 2 , ,yk) Each slide contentyi functions as a negative candidate forxi ifi̸= j, with theK−1 negative examples forxbeing different at every pass through the data due to shuffling in stochastic gradient descent The loss function is computed as the approximated mean negative log probability of the data, which is followed by Equation 5.5.

# (5.5) using equation 5.4, whereθ represents the word embeddings and neural network param- eters used to calculateS.

The MNR loss function has several advantages and disadvantages On the positive side, it can:

• Easily handle datasets with many classes and imbalanced labels without requiring the computation of the softmax over all possible classes.

• Learn from difficult negative samples, which can enhance the model’s discrimina- tive power and decrease the false positive rate.

• Straightforward to implement and optimize using stochastic gradient descent by sampling a batch of positive and negative samples for each iteration.

On the negative side, the MNR loss function may:

• Suffer from the sampling bias issue, where the negative samples in a batch may not accurately represent the true distribution of negative samples This can lead to suboptimal ranking performance and slow convergence.

• Be impacted significantly by the choice of hyperparameters, such as the batch size, number of negative samples per positive sample, and margin parameter, and must be carefully calibrated to balance ranking accuracy and computational efficiency.

• Be affected by label noise, which can introduce confusion and inconsistency in the ranking objective and negatively impact the model’s quality.

Evaluation Metrics

When evaluating the performance of an information retrieval model, various metrics are considered In our system, the Dense IR model retrieves the top 10 information related to the slide title entered by the user These become the contexts used by the

QA model to generate slide content This process involves ranking, and we evaluate the Dense IR model using two commonly used rank-aware metrics: Mean Reciprocal Rank (MRR) and Mean Average Precision (MAP).

Mean reciprocal rank (MRR)[39] is a metric that evaluates the quality of a ranking system It is defined as the average of the reciprocal ranks of the first relevant item for a set of queries The reciprocal rank is the inverse of the rank position of the first relevant item For example, if the first relevant item is ranked 3rd, the reciprocal rank is

1/3 MRR ranges from 0 to 1, with higher values indicating better ranking performance. Equation 5.6 shows how MRR is calculated.

1 k q (5.6) whereQis the set of queries and 1 kq is the reciprocal rank of the queryq.

• Simple to compute and is easy to interpret.

• Put a high focus on the first relevant element of the list.

• Good for known-item searches such as navigational queries or looking for a fact.

It also has some disadvantages:

• Do not evaluate the rest of the list of recommended items It focuses on a single item from the list.

• Give a list with a single relevant item just as much weight as a list with many relevant items.

• Might be not a good metric to compare multiple related items.

Mean average precision (MAP)[40] is a metric that evaluates the performance of an information retrieval system It measures how well the system retrieves relevant docu- ments for a given query and ranks them according to their relevance MAP is calculated by averaging the precision values at different recall levels for each query and then aver- aging the results over all queries, which is followed by Equation 5.7. mAP= 1

APi (5.7) whereN is the total number of queries, andAPiis the average precision of theith query.

To understand Equation 5.7, it is crucial to comprehend precision, precision at k, and average precision Precision is calculated by Equation 5.8 as the ratio of correctly iden- tified positive values to the total number of predicted positive values.

Precision= |{relevant documents} ∩ {retrieved documents}|

|{retrieved documents}| (5.8) where therelevant documentsset refers to the ground truth values or the expected docu- ments Theretrieved documentsset includes the documents selected by the information model based on similarity scores Usually, precision takes all the retrieved documents into account However, we can calculate precision only for the topk documents This metric is called precision at k, or P(k) It is used to calculate average precision (AP).

AP is the area under the precision-recall curve For information retrieval models, where recall is a less critical metric, AP can be calculated by using Equation 5.9.

P(k)r(k) (5.9) where RD is the number of relevant documents for the query, N is the total number of documents, P(k) is the precision at k, and r(k) is the relevance of the k th retrieved document (0 if not relevant, and 1 if relevant).

• Give a single metric that represents the complex area under the precision-recall curve This provides the average precision per list.

• Handle the ranking of lists recommended items naturally This is in contrast to metrics that consider the retrieved items as a set.

• Sensitive to changes in the system’s performance across different queries and recall levels.

MAP also has some disadvantages:

• Not fit for fine-grained numerical ratings because MAP is unable to extract an error measure from this information.

• May be influenced by the choice of relevance threshold or the number of docu- ments to evaluate, which may vary across different domains or tasks.

Data Augmentation

Data augmentation is a technique that can enhance the performance of natural lan- guage processing models by creating new training examples from existing ones It can help to overcome the limitations of small or imbalanced datasets, reduce overfitting, and increase the generalization ability of the models Data augmentation can be applied at different levels of natural language processing, such as word, sentence, or document level Some common methods of data augmentation include synonym replacement, back translation, paraphrasing, and text mixing.

Generating slides automatically, like other tasks, can be challenging due to the lack of sufficient training data When the model does not have enough data to generalize, it cannot perform well Our system’s Dense IR module uses slides from scientific papers as training data; however, some slides contain information not present in the papers, added to explain related concepts Additionally, during data processing, errors can occur when converting slides to text format Training on such data can negatively impact the model’s quality Therefore, acquiring more accurate and error-free data is crucial for improving the model’s performance.

When writing a scientific paper, it is important to organize the information into clear sections and subsections with titles that summarize what will be discussed This is sim- ilar to how slides in a presentation have titles that provide a summary of the content on that slide To generate more data for our model training, we use the content from these sections and subsections in the paper to correspond with the slide titles and content. However, the section/subsection content tends to be longer than the slide content, so we use text summarization techniques to make the content more concise without sacrificing important information This summarized content, along with the section/subsection ti- tles and corresponding data from slides, is then used for training our Dense IR model. Figure 5.7 shows how we do data augmentation by using the content of scientific papers.

The quality of the data we generate depends on the model we choose After careful research and learning, we selected two models, T5 and CATTS, to create summaries for different sections or subsections T5[41] is a natural language processing model that can perform various tasks, such as text summarization, translation, question answering, and

Figure 5.7: Using CATTS model to generate summaries for data augmentation text generation T5 stands for Text-To-Text Transfer Transformer, which means it can take text as input and produce text as output T5 uses attention mechanisms to under- stand the relationships between words and sentences It is trained on a large corpus of text from various domains and languages, making it versatile for new tasks CATTS is the model used in research [13] to generate TLDR, a new form of extreme summariza- tion for scientific papers This model outperforms strong baselines.

Dataset Analysis

The DOC2PPT dataset is the dataset used in the ”DOC2PPT: Automatic Presenta- tion Slides Generation from Scientific Documents” paper [8] The authors collected 5,873 paired scientific documents and associated presentation slide decks (for a to- tal of about 70K pages and 100K slides, respectively) from academic proceedings, focusing on three research communities: computer vision (CVPR, ECCV, BMVC), natural language processing (ACL, NAACL, EMNLP), and machine learning (ICML, NeurIPS, ICLR) The DOC2PPT dataset can be downloaded from this URL: https: //drive.google.com/file/d/1Cjc3Ae_knnMzjAIzg1sHc4jIEs1-5x7_/view

This dataset contains 2600 paper-slide pairs in computer vision, 931 paper-slide pairs in natural language processing, and 2342 paper-slide pairs in machine learning All scientific papers are in PDF format Each slide in the corresponding presentation is a JPG file Table 6.1 and table 6.2 report the descriptive statistics of this dataset.

Train/Val/Test #Sections #Sentences #Figures

ML 1,872/234/236 17,735 (7.6) 801,754 (45.2) 15,687 (6.7)Total 4,686/592/595 41,066 (6.99) 1,757,566 (42.8) 48,799 (8.3)Table 6.1: Descriptive statistics of documents about total count and the average num- ber(in parenthesis)

Train/Val/Test #Slides #Sentences #Figures

Table 6.2: Descriptive statistics of presentations about total count and the average num- ber(in parenthesis)

Data processing approach

Text extraction from papers

Scientific papers are stored in PDF format This is not the right format for model training To convert scientific papers from PDF to text format, we use a tool called GROBID [42].

GROBID (GeneRation Of BIbliographic Data) is a machine learning library for ex- tracting, parsing, and re-structuring raw documents such as PDF into structured TEI/XML encoded documents with a particular focus on technical and scientific publications. TEI/XML is a standard format for encoding and exchanging texts in the humanities and social sciences TEI stands for Text Encoding Initiative, a consortium of scholars and institutions that develops and maintains the TEI Guidelines XML stands for Extensible Markup Language, a generic syntax for creating structured documents TEI/XML com- bines the flexibility of XML with the specificity of the TEI Guidelines, which define over

500 elements and attributes for describing various aspects of textual content, structure, and metadata TEI/XML is widely used for creating digital editions, corpora, dictionar- ies, manuscripts, and other types of text-based resources Some of the functionalities of the GROBID are:

• Header extraction and parsing from the article in PDF format The extraction here covers the usual bibliographical information (e.g title, abstract, authors, af- filiations, keywords, etc.).

• References extraction and parsingfrom articles in PDF format

• Full-text extraction and structuring from PDF articles, including a model for the overall document segmentation and models for the structuring of the text body (paragraph, section titles, reference and footnote callouts, figures, tables, etc.).

• PDF coordinatesfor extracted information, allowing to create ”augmented” inter- active PDF based on bounding boxes of the identified structures.

We use GROBID version 0.7.3 for converting scientific papers to TEI/XML format. The results are illustrated in Fig 6.1, 6.2.

Figure 6.1: Title and author of a paper in the dataset

The content of the paper, after being extracted and saved in TEI/XML format, is still not suitable for model training We take one more step which is converting TEI/XML format to JSON format The converted data will contain the following fields:

• title: stores the title of the paper.

• abstract: stores the content of the abstract section in the paper.

• text: stores the list of sentences in the paper and assigns an identified number for each sentence.

• header: stores the information of sections in the paper, which includes the name,and the identified number of start and end sentences of each section.

Figure 6.2: Title and author are extracted using GROBID

• figures: stores the information of graphical elements extracted from the paper, which includes the name, caption, and position in the paper of each graphical element.

An example result of converting TEI/XML to JSON format is shown in Listing 6.1.

3 " a b s t r a c t ": " A u t o m a t e d p r o c e s s i n g of h i s t o r i c a l t e x t s o f t e n r e l i e s on pre - n o r m a l i z a t i o n to m o d e r n word f o r m s

T r a i n i n g encoder - d e c o d e r a r c h i t e c t u r e s to s o l v e such p r o b l e m s t y p i c a l l y r e q u i r e s a lot of t r a i n i n g data, w h i c h is not a v a i l a b l e for the n a m e d task ",

6 " s t r i n g ": " I n t r o d u c t i o n T h e r e is a g r o w i n g i n t e r e s t in a u t o m a t e d p r o c e s s i n g of h i s t o r i c a l d o c u m e n t s, as e v i d e n c e d by the g r o w i n g f i e l d of d i g i t a l h u m a n i t i e s and the i n c r e a s i n g n u m b e r of d i g i t a l l y a v a i l a b l e c o l l e c t i o n s of h i s t o r i c a l d o c u m e n t s "

16 " c a p t i o n ": " T a b l e 2: S e l e c t e d p r e d i c t i o n s from some of our m o d e l s on the M4 text ; B = BEAM, F = F I L T E R, A A T T E N T I O N ",

Listing 6.1: An example result after converting TEI/XML format to JSON format

We also consider graphic elements contained in scientific papers Therefore, we use a tool calledpdffigures2[43] to be able to extract the graphic elements and save them in JPG format Pdffigures2 is a tool for automatically extracting figures, captions, and tables from scholarly documents It is designed to handle a variety of document formats,layouts, and languages Pdffigures2 can be used as a standalone application or as a library for other applications that need to process PDF documents It is written in Scala and uses the PDFBox library for PDF parsing and rendering The information about the graphical elements will be added to the figures field in the JSON file containing the information of the corresponding scientific paper.

Text extraction from slides

Slides corresponding to scientific papers are saved as images in JPG format To be suitable for the model training, we use the Azure OCR to be able to extract the charac- ters and content contained in the slides Azure OCR (Optical Character Recognition) is a cloud-based service that allows us to extract text and information from images and doc- uments Azure OCR supports over 100 languages and can handle complex and mixed- language documents Azure OCR is powered by advanced machine learning models that are constantly improving and expanding their capabilities The result for extracting content from slides using Azure OCR is shown in Fig 6.3.

Figure 6.3: An example of slide extraction by using Azure OCR

In a presentation, slides often have visual effects, so many slides have similar content.

To clean the content extracted from the slide, we perform the merging of the content of the slides that are more than 80% similar or the slides that have the same title The content of the slides after being extracted, removed, and aggregated will be saved in JSON format and include the following fields:

• slides: stores all of the slide content corresponding to the papers.

• text: stores list of sentences in slides after merging.

• page nums: stores the id of slides used in the merge process.

• paper title: stores the name of the corresponding paper.

An example of the final results after the slide extraction is shown in Listing 6.2.

8 "1 Non - t e r m i n a l n o d e s - e n t i t i e s and e v e n t s over the text ",

9 " You want to long bath take ",

11 " You want to take long bath ",

Listing 6.2:An example of final results after slide extraction

Experiment setup

Dataset

For this research, we utilized the DOC2PPT data set which includes pairs of papers and slides All papers cover topics related to computer science and have been presented at various conferences Since our available resources for training the model were lim- ited, we chose to work with only 500 paper-slide pairs from DOC2PPT to conduct our experiment To ensure thorough testing, the data set was separated into 3 sub-datasets: training, validation, and test sets The training set includes 350 paper-slide pairs, which make up 70% of the data The validation set includes 100 paper-slide pairs, accounting for 20% Lastly, the test set includes 50 paper-slide pairs, accounting for 10% of the data For more information on each data set, please refer to Table 6.3 and 6.4.

Table 6.3: Descriptive statistics of documents about total count and the average num- ber(in parenthesis)

Table 6.4: Descriptive statistics of presentations about total count and the average num- ber(in parenthesis)

Evaluation metrics

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics and a software package specifically designed for evaluating automatic summarization We use ROUGE scores to evaluate the generated content about the ground-truth slide con- tent Metrics are used:

• ROUGE-1: measures the number of matching uni-grams between the model-generated text and a human-produced reference.

• ROUGE-2: measures the number of matching bi-grams between the model-generated text and a human-produced reference.

• ROUGE-L: measures longest matching sequence of words using LCS (LongestCommon Subsequence) based statistics.

Environment setup

We utilized the Google Colaboratory (Colab) service for efficient training and uti- lization of transformer models Google Colab is a cloud-based platform that provides an interactive notebook environment for writing and executing Python code It is equipped with free access to GPUs and TPUs and offers a variety of data science, machine learn- ing, and deep learning libraries and tools Users can also collaborate, share their note- books, and import data from Google Drive or other sources We opted for the Google Colab Pro service to ensure a smooth experiment process This service allows us to:

• Utilize GPUs and TPUs for faster machine learning model training and inference.

• Increase notebook memory and disk space to handle larger datasets and complex computations.

• Run notebooks for a longer duration without interruptions or idle timeouts.

• Collaborate with others in real-time.

• Access thousands of pre-installed libraries and frameworks or install our own using pip or conda.

Parameter setup

Using Google Colab Pro, we can use a powerful NVIDIA A100 graphics card with 40GB of RAM to train our models During the training process, we establish the follow- ing parameters:

Additionally, to gauge our proposed method’s effectiveness, we set the threshold valuet for filtering data through semantic search to 0.5, 0.4, and 0.3 respectively.

Results

Web Application

Ngày đăng: 15/10/2025, 20:05

Nguồn tham khảo

Tài liệu tham khảo Loại Chi tiết
[1] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” 2021 Sách, tạp chí
Tiêu đề: Zero-shot text-to-image generation
Tác giả: A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, I. Sutskever
Nhà XB: arXiv
Năm: 2021
[3] Z. Abbasiyantaeb and S. Momtazi, “Text-based question answering from in- formation retrieval and deep neural network perspectives: A survey,” CoRR, vol. abs/2002.06612, 2020 Sách, tạp chí
Tiêu đề: Text-based question answering from in-formation retrieval and deep neural network perspectives: A survey
[4] Y. Boreshban, S. M. Mirbostani, G. Ghassem-Sani, S. A. Mirroshandel, and S. Amiriparian, “Improving question answering performance using knowledge dis- tillation and active learning,” CoRR, vol. abs/2109.12662, 2021 Sách, tạp chí
Tiêu đề: Improving question answering performance using knowledge dis-tillation and active learning
[5] S. Liu, X. Zhang, S. Zhang, H. Wang, and W. Zhang, “Neural machine reading comprehension: Methods and trends,” CoRR, vol. abs/1907.01118, 2019 Sách, tạp chí
Tiêu đề: Neural machine readingcomprehension: Methods and trends
[7] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” CoRR, vol. abs/1706.03762, 2017 Sách, tạp chí
Tiêu đề: Attention is all you need
[8] T.-J. Fu, W. Y. Wang, D. McDuff, and Y. Song, “Doc2ppt: Automatic presentation slides generation from scientific documents,” 2022 Sách, tạp chí
Tiêu đề: Doc2ppt: Automatic presentationslides generation from scientific documents
[9] E. Sun, Y. Hou, D. Wang, Y. Zhang, and N. X. R. Wang, “D2s: Document-to-slide generation via query-based text summarization,” 2021 Sách, tạp chí
Tiêu đề: D2s: Document-to-slide generation via query-based text summarization
Tác giả: E. Sun, Y. Hou, D. Wang, Y. Zhang, N. X. R. Wang
Năm: 2021
[10] Q. Zhou, N. Yang, F. Wei, S. Huang, M. Zhou, and T. Zhao, “Neural document summarization by jointly learning to score and select sentences,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), (Melbourne, Australia), pp. 654–663, Association for Computa- tional Linguistics, July 2018 Sách, tạp chí
Tiêu đề: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Tác giả: Q. Zhou, N. Yang, F. Wei, S. Huang, M. Zhou, T. Zhao
Nhà XB: Association for Computational Linguistics
Năm: 2018
[11] A. Celikyilmaz, A. Bosselut, X. He, and Y. Choi, “Deep communicating agents for abstractive summarization,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Lan- guage Technologies, Volume 1 (Long Papers), (New Orleans, Louisiana), pp. 1662– Sách, tạp chí
Tiêu đề: Deep communicating agents forabstractive summarization
[13] I. Cachola, K. Lo, A. Cohan, and D. Weld, “TLDR: Extreme summarization of sci- entific documents,” in Findings of the Association for Computational Linguistics:EMNLP 2020, (Online), pp. 4766–4777, Association for Computational Linguis- tics, Nov. 2020 Sách, tạp chí
Tiêu đề: TLDR: Extreme summarization of sci-entific documents
[14] A. Sefid, J. Wu, P. Mitra, and C. L. Giles, “Automatic slide generation for scientific papers,” CEUR Workshop Proceedings, vol. 2526, pp. 11–16, 2019 Sách, tạp chí
Tiêu đề: Automatic slide generation for scientificpapers
[15] A. Kumar, O. Irsoy, J. Su, J. Bradbury, R. English, B. Pierce, P. Ondruska, I. Gul- rajani, and R. Socher, “Ask me anything: Dynamic memory networks for natural language processing,” CoRR, vol. abs/1506.07285, 2015 Sách, tạp chí
Tiêu đề: Ask me anything: Dynamic memory networks for naturallanguage processing
[16] M. A. Kia, A. Garifullina, M. Kern, J. Chamberlain, and S. Jameel, “Adaptable closed-domain question answering using contextualized cnn-attention models and question expansion,” IEEE Access, vol. 10, pp. 45080–45092, 2022 Sách, tạp chí
Tiêu đề: Adaptableclosed-domain question answering using contextualized cnn-attention models andquestion expansion
[17] F. Zhu, W. Lei, C. Wang, J. Zheng, S. Poria, and T. Chua, “Retrieving and reading: A comprehensive survey on open-domain question answering,” CoRR, vol. abs/2101.00774, 2021 Sách, tạp chí
Tiêu đề: Retrieving andreading: A comprehensive survey on open-domain question answering
[18] B. Ojokoh and E. Adebisi, “A review of question answering systems,” J. Web Eng., vol. 17, no. 8, pp. 717–758, 2019 Sách, tạp chí
Tiêu đề: A review of question answering systems
[20] X. Song, A. Salcianu, Y. Song, D. Dopson, and D. Zhou, “Fast wordpiece tokeniza- tion,” CoRR, vol. abs/2012.15524, 2020 Sách, tạp chí
Tiêu đề: Fast wordpiece tokenization
Tác giả: X. Song, A. Salcianu, Y. Song, D. Dopson, D. Zhou
Nhà XB: CoRR
Năm: 2020
[21] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” CoRR, vol. abs/1409.3215, 2014 Sách, tạp chí
Tiêu đề: Sequence to sequence learning with neural networks
Tác giả: I. Sutskever, O. Vinyals, Q. V. Le
Nhà XB: CoRR
Năm: 2014
[22] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learn- ing to align and translate,” in 3rd International Conference on Learning Repre- sentations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings (Y. Bengio and Y. LeCun, eds.), 2015 Sách, tạp chí
Tiêu đề: Neural machine translation by jointly learning to align and translate
Tác giả: D. Bahdanau, K. Cho, Y. Bengio
Nhà XB: International Conference on Learning Representations
Năm: 2015
[23] D. Britz, A. Goldie, M. Luong, and Q. V. Le, “Massive exploration of neural ma- chine translation architectures,” CoRR, vol. abs/1703.03906, 2017 Sách, tạp chí
Tiêu đề: Massive exploration of neural ma-chine translation architectures
[24] Y. Hu and X. Wan, “Ppsgen: Learning-based presentation slides generation for aca- demic papers,” IEEE Transactions on Knowledge and Data Engineering, vol. 27, pp. 1085–1097, 4 2015 Sách, tạp chí
Tiêu đề: Ppsgen: Learning-based presentation slides generation for aca-demic papers