VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITYHO CHI MINH UNIVERSITY OF TECHNOLOGY FACULTY OF COMPUTER SCIENCE AND ENGINEERING REPORT CAPSTONE PROJECT Automatic Presentation Slides Generat
Problem Statement
The visual impact of presentations is a powerful tool for summarizing and elucidating bodies of work to an audience, making them invaluable in research, education, and business A presentation design is often an artistic creation that requires patience, effort, and the ability to abstract complex concepts and explain them clearly and beautifully By thoughtfully arranging layouts, adding illustrations, and applying effects, you can elevate the overall quality of a presentation The body of a presentation serves as an overview of the topic, and its effectiveness hinges on careful content selection and clear communication.
Text summarization is a core task in natural language processing that condenses long text into digestible phrases or paragraphs It comes in two main flavors: extractive summarization, which creates a summary by selecting a subset of sentences from the original text, and abstractive summarization, which rephrases and, if needed, adds new words to produce more human-friendly summaries With deep learning, the sentence-selection process can be integrated into a scoring model that directly predicts the relative importance of previously selected sentences to yield better extractive summaries Generating abstractive summaries for long documents remains challenging One approach uses an encoder–decoder architecture with deep communicating agents, where encoding is divided among multiple agents, each handling a portion of the input, and agents share global context by broadcasting their encoded representations to others.
Text summarization models are used to condense scientific research papers, speeding up literature reviews by saving researchers time and helping them quickly identify relevant work while filtering out less important studies In 2019, IBM researchers introduced a novel search-system that allows users to filter by categories such as scientific tasks and datasets, providing concise summaries for computer science articles Building on the success of transformer-based deep learning, the CATT model was developed in 2020 to generate TLDRs for scientific papers, leveraging architectures like BART and BERT.
Seminars for presenting scientific findings rely on effective slides, and researchers have extensively studied automating slide creation for scientific papers These efforts primarily use extractive summarization, which assigns an importance score to each phrase or sentence and applies a selection strategy to assemble key components into slides, often employing integer linear programming to arrange them Yet these studies largely overlook graphic elements and concentrate on text alone To address this gap, recent work advances abstractive summarization and embedding methods to better integrate both textual content and visual graphics in automated slide generation.
Although automatic slide creation for scientific research articles has been studied extensively, several challenges remain unresolved, especially in the quality, accuracy, and naturalness of slide content More emphasis is needed on the training data used to build these models, as data quality and representativeness significantly affect outcomes To address this gap, our group investigates Automatic Slide Creation for Scientific Academic Papers, aiming to improve on the findings of prior research.
Goals
The project aims to improve automatically generated slides by integrating advanced deep learning and natural language processing techniques into the models To achieve this overarching objective, the committee defined a set of concrete sub-goals that guide progress, emphasizing improvements in model accuracy, output coherence, linguistic quality, and computational efficiency, along with robust evaluation criteria.
1 Harness the Power of Large Language Models: Employ the capabilities of Long-Context Large Language Models to enhance the depth of understanding in automatic slide generation.
2 Implement Specialized Prompt Engineering Strategies: Develop specialized prompt engineering techniques tailored for Long-Context Models, optimizing their performance in generating concise and contextually rich slide content.
Optimize and evaluate prompt-tuned long-context models by rigorously assessing their effectiveness in translating research papers into presentation slides, with ROUGE scores and overall coherence serving as the primary metrics to gauge translation quality, narrative flow, and readability.
Scope
This thesis leverages long-context large language models (LLMs) and external APIs to automatically generate presentation materials for scientific papers By aligning with our team’s expertise and capabilities, this approach streamlines the thesis process, enhances feasibility, and accelerates slide creation while maintaining a coherent narrative and supporting reproducible outputs across publications.
• All of the papers utilized are from the discipline of computer science.
• English is used in the generated slides.
• Each slide’s title will be based on the title of the section it corresponds to in the paper.
• One or more slides present each section in the papers.
• Each slide’s content will have numerous bullet points.
• Each bullet point will have one or more sentences to explain it.
Thesis Structure
The structure of the thesis consists of the following chapters:
Introducing the thesis’s structure, the issue statement, and the rationale for select- ing the subject.
Offering examinations of the test data in order to find a solution for the problem as stated, the objective, and the specifics and range of the topic.
This chapter provides a concise overview of the theoretical foundations of text summarization, underscoring the role of long-context large language models in advancing natural language processing for our specific application By examining how these models handle extended documents, identify salient information, and generate coherent, high-quality summaries, the discussion links theoretical insights to practical implementation and highlights the potential for scalable, accurate NLP solutions in real-world tasks.
Giving a succinct explanation of the topic’s content, findings, and any knowledge gaps.
• Chapter 5: Solutions for Automatic Slide Generation
Clearly outlining the techniques employed and the particular completed products.
Delivering a summary of several datasets, outlining the experimental strategy used, and presenting the conclusions that were reached.
Providing a summary of the outcomes attained and future plans for the thesis.
Overview of scientific papers
A scientific paper is a written report that summarizes the results of original research Its format has evolved over many years through interactions with publishers, scientific ethics, and editorial practices, and a range of article types exists, including original articles, case studies, technical notes, picture essays, reviews, comments, and editorials Authors should be aware of the various paper types and the standards by which they are judged The essential elements of a scientific publication are represented by IMRAD, which stands for Introduction, methodology, results, analysis, and discussion Many publications specify their own format requirements; some publishers structure submissions into these categories, while others do not, and the order may vary from one publication to another.
In general, a scientific paper usually has the following sections:
• Title: A scientific research paper’s title typically refers to the paper’s research topic There should be as few words as possible.
Beyond the keywords already in the title, the keyword list provides space to add additional terms that indexing and abstracting services rely on By selecting relevant, specific keywords and using them thoughtfully, you improve your article’s visibility and make it easier for readers to discover your content.
This abstract clearly outlines the study’s primary goals and key parameters that are not immediately evident from the title, and it provides a concise summary of the findings and the main conclusions.
• Introduction: The reader is first introduced to the relevant literature in the intro- duction section It aids in their understanding of the importance of your ongoing work.
Materials and Methods should provide sufficient detail to allow a capable researcher to replicate the study and reproduce its findings, with experimental procedures described in a clear, typically chronological order When strict chronological sequencing is not feasible, related methods can be grouped and presented together to maintain coherence while preserving transparency, reproducibility, and a logical flow of the experimental design and protocols.
• Results: The results section presents the authors’ findings Display items like figures and tables are central in this section.
Within a scientific paper, the discussion section is one of the final parts where the author interprets and analyzes their findings, explaining what the results mean and how they relate to the study’s hypotheses and methodology It highlights the significance of the findings, clarifies their implications for the field, and shows how the work fits into the current literature, addressing whether results support or challenge existing theories The discussion also acknowledges limitations, suggests directions for future research, and underscores the broader impact of the study.
The references section at the end of every document provides complete source details for all cited works, helping readers access the original sources, verify information, and pursue further research.
Overview of slide presentation
In presentation terminology, a slide is a single page of content, a slide deck is the collection of slides used together to form a complete presentation, and a slide show refers to the display of these slides or images on a projection screen or electronic device A slide is usually made up of two components.
• Slide title: Shows a slide’s or a series of slides’ principal objective.
Slide content should present only the data relevant to the slide title, keeping the information concise and focused on the core takeaways When helpful, organize ideas as a bulleted list, with each bullet's concepts further divided into smaller subpoints to provide detail without clutter Acceptable components include text, images, billboards, and slideshow effects, as long as they reinforce the message and stay visually uncluttered By aligning every element with the slide's title and prioritizing the most crucial information, the slide communicates clearly and supports SEO through targeted wording that matches the topic.
A presentation consists of multiple slides arranged in a deliberate order, and its strength hinges on the quality of the content and the slide designer’s creativity In practice, a presentation will frequently consist of three sections, forming a clear structure that supports the overall message.
Watching this presentation will give you a clear understanding of the core concepts, the actionable steps to apply them, and practical examples that illustrate how the ideas work in real life; the introduction also sets an engaging, accessible, and authoritative tone, outlines the key takeaways, and previews the structure and timing of the session so you know what to expect and how you can use the content to achieve your goals.
– The title of the presentation: Introduce the topic of your presentation and provide a brief description.
– A table of contents / main menu: Presents the main content that will be in the presentation.
– Objectives: Present the goals and outcomes that can be achieved through the presentation.
The body of the presentation is its focal point, delivering content that covers every section outlined in the introduction This central section addresses all key topics with thorough detail, while its organization varies from one presentation to the next, shaped by the presenter's creativity.
To summarize, this conclusion highlights the main information and emphasizes the key takeaways for readers, delivering clear, actionable insights Thank you for your attention What questions do you have as you apply these ideas?
This chapter provides the essential theoretical background needed for implementing the topic, beginning with Section 3.1, which defines the question-and-answer problem and surveys its widely used question types; Section 3.2 then outlines the overall design of a QA system; and Sections 3.3 through 3.5 present core strategies for building effective QA systems, including WordPiece segmentation, sequence-to-sequence modeling, and attention mechanisms.
Question Answering problem
Introduction
Question Answering (QA) is a complex task in natural language processing that requires understanding text meaning and identifying relevant facts Essentially, QA is a form of information retrieval expressed through authentic human-language questions When using a QA system, the goal is to obtain concise, correct answers that convey essential information about a specific situation In artificial intelligence, developing QA systems aims to train machines to understand and process natural language, enabling them to interpret queries and extract precise knowledge QA is a foundational area in NLP, with many language-processing tasks described as question-and-answer problems based on input language.
Question-answering systems operate by accepting user queries and returning answers drawn from a pre-built knowledge base This knowledge base may consist of existing documents, specialized websites, or widely used benchmark datasets Depending on the knowledge source, QA systems can be categorized into two primary types.
Closed-domain systems aim to solve queries using data from a specific topic or area, focusing on domain-specific knowledge A key challenge for these models is their need to operate with small, constrained datasets, which can limit performance and generalization To build effective closed-domain AI, strategies such as careful data curation, domain-adaptive transfer learning, data augmentation, and specialized architectures designed for learning from limited data are employed within the context of domain-specific research [21].
Open-domain question answering aims to retrieve answers from vast, cross-domain literature sources, spanning diverse themes and fields rather than being limited to a specific domain To provide accurate and comprehensive results, these systems depend on data sources that are large in scale and precise in content [22].
Frequently used question forms
Users can input questions of many sorts and queries of various types in question- answering systems Nonetheless, the majority of inquiries fit into one of the following inquiry categories [23]:
Factoid questions focus on facts about nature or society and typically begin with wh-words such as who, what, where, when, why, or how The answer is usually a single factual detail drawn from the surrounding context For example, "Which is the largest city in Vietnam?" is a classic factoid question.
List-type queries typically yield a collection of entities or data rather than a single item, even when the query and the question are alike; for example, asking “What are Mr H’s children’s names?” arises from the context that reveals the names of Mr H’s three children, producing a compiled list as the response.
Extensive data access is necessary to answer definition queries, because responses can be a term, a phrase, an entire sentence, or even a passage drawn from the input context In English, questions that seek definitions commonly start with “what is” or “what are,” indicating that the user expects a concise explanation or a direct excerpt from the source text By aligning the query format with the form of the expected answer, this approach delivers clear, precise definitions and relevant excerpts for improved comprehension and searchability.
• Hypothetical: Obtain knowledge from the context of hypothetical (non-real) oc- currences The question is commonly phrased as ”What would if ?” (English).
In the context event, the answer is a piece of information.
Causal inquiries differ from definitional inquiries, which seek to understand the notion of an entity; cause‑and‑effect questions aim to uncover how and why events or objects occur In text analysis, answering these questions hinges on extensive language processing techniques, making causal queries among the most challenging tasks to tackle.
Confirmation consists of yes-or-no questions used to verify whether an event or phenomenon has occurred within a given context, and to confirm whether the query content aligns with what is stated This binary verification approach delivers clarity by determining if observed facts fit the context and if a query matches the described information, making it easier to assess accuracy and relevance in analysis and discussion.
Question answering system architecture
Information retrieval-based system
The units are split differently between studies The system will be comprised of three major components [3], which are as follows (Figure 3.1):
Question processing involves two essential steps: constructing the inquiry and determining the structure of the answer During the question creation phase, a query is built to fetch related documents, enabling effective information retrieval In the response type identification phase, a classifier is used to categorize the question by its predicted response type, shaping the appropriate answer format.
During query formulation, the search query is fed into an information retrieval engine, which returns the top, most relevant documents Because response analysis models operate on small segments of text, the retrieved documents are processed by a paragraph retrieval model to extract concise text fragments This paragraph retrieval step locates passages or phrases that closely match the input query, ensuring efficient matching and providing short, query-aligned snippets for downstream analysis and SEO-ready content.
• Answer extraction: The most appropriate response is derived from the supplied paragraph The similarity between the input question and the retrieved responses must be measured in this stage.
Figure 3.1: Information retrieval based system with three main components [3]
According to [4], the two components of document retrieval and paragraph retrieval have been separated, resulting in an information retrieval system with four primary com- ponents (Figure 3.2).
Figure 3.2: Information retrieval based system with four main components [4]
Reading comprehension based question answering system
The system will be divided into four major components: embedding, feature ex- traction, question-context interaction, and answer prediction Each module serves the following purposes:
Embeddings are used because computers do not understand natural language directly The embedding module at the top of the system converts input words into fixed-length vectors, taking context and questions as input to create contextual representations and embed the questions through several approaches.
Feature extraction takes contexts and questions, processes them through the embedding module, and then passes them to the feature extraction module to enhance our understanding of both the context and the questions The primary goal of this stage is to extract additional contextual information that supports more accurate interpretation and downstream tasks.
Context-question interaction is crucial for answer prediction because it helps identify the most relevant parts of the context for a given question To achieve this, one-way and two-way attention mechanisms are widely used to spotlight the context elements that matter most to the query The next section will delve deeper into how the attention process works and why it matters for effective question answering.
• Answer prediction: This is the system’s last component, providing an answer to a question based on all of the information gathered from earlier modules.
An example model for a reading comprehension-based system is presented in Fig- ure 3.3.
Figure 3.3: Reading comprehension based question answering system [5]
WordPiece Segmentation
WordPiece is a natural language processing algorithm used to segment text into subwords The vocabulary is initialized with language-independent characters, and the lexicon expands over time by continually adding the most frequent character combinations The first WordPiece model was developed by Yonghui Wu and colleagues and published in 2016 [24].
WordPiece tokenization uses a greedy, longest-match-first strategy to generate tokens for individual words At each step, it selects the longest prefix of the remaining text that appears in the model’s vocabulary, a technique widely known as MaxMatch This approach has been used for Chinese word segmentation since the 1980s and remains common in natural language processing today However, the computational cost of MaxMatch can be substantial: it often runs in O(n^2) time, where n is the input word length, or O(nm) time if m denotes the maximum token length in the dictionary, and when the vocabulary contains long terms the parameter m can significantly impact performance [25].
Sequence to Sequence
Recurrent Neural Network
A Recurrent Neural Network (RNN) processes data in sequence, using its own previous output as input for the next step, and it maintains a hidden state h with an optional output y It accepts a variable-length input sequence x = (x1, x2, , xt) and updates the hidden state at each time step t via ht = f(ht−1, xt), where f is a nonlinear activation function that can be as simple as a sigmoid or as sophisticated as an LSTM (Long Short-Term Memory) network RNNs are trained to predict the next signal in a sequence, effectively learning the conditional probability distribution of the sequence given the prior observations, with the output at time t reflecting this conditional probability conditioned on all earlier inputs.
Seq2seq’s architecture
In a sequence-to-sequence (Seq2Seq) model, the encoder and the decoder are its two core components The encoder processes the input string and encodes it into a single fixed-length vector that summarizes the input The decoder then takes this vector as its input and generates the corresponding output string, producing the result from the encoded representation.
An encoder is a recurrent neural network that processes an input sequence x symbol by symbol As each symbol is read, the hidden state updates according to equation (3.1), producing a running summary of the sequence Once the entire input has been read, this hidden state is represented as a context vector c that summarizes the complete input sequence, as indicated by the terminator notation The Attention section provides information about the context vector c.
The decoder is a recurrent neural network that generates an output sequence by predicting the next symbol, conditioned on the current hidden state h_t Both the predicted symbol y_t and the hidden state h_t depend on the preceding symbol y_{t-1} and the context vector c produced by the encoder from the input sequence Therefore, the decoder’s hidden state at time t is computed by the equation h_t = f(h_{t-1}, y_{t-1}, c) (3.2).
The conditional probability of the symbolyt is given by the equation 3.3.
The model defines the conditional probability of the next symbol, y_t, given the history y_{t-1}, , y_1 as P(y_t | y_{t-1}, , y_1) = g(h_t, y_{t-1}, c) (3.3), where the activation functions f and g are the provided choices The function g must produce legitimate probabilities, such as those produced by a softmax over the current hidden state h_t and the preceding outputs The encoder and decoder representations that underpin this framework are illustrated in Figure 3.4.
Attention
Transformer
Introduced in 2017 by Ashish Vaswani and collaborators, the Transformer is a modeling architecture that forgoes recurrence and relies entirely on an attention mechanism to capture global dependencies between inputs and outputs.
The architecture of the model is shown in Figure 3.5.
Encoder and Decoder stacks
An encoder built as a stack of six identical layers forms the core of a Transformer model Each layer contains two sublayers: a multi-head self-attention mechanism that captures dependencies across positions in the input, and a position-wise fully connected feedforward network that applies a separate transformation to each position After each sublayer, a residual connection adds the sublayer’s output to its input, followed by layer normalization to stabilize training The outputs of the first sublayer feed into the second, and the processed representations advance through all six layers to produce the final encoded representation.
Figure 3.5: The Transformer architecture [7] expressed asLayerNorm(x + SubLayer(x)) using the sublayer’s implementation of the functionSubLayer(x) All sublayers and embedding layers will produce output with the same size of 512.
The decoder consists of six identical layers Each layer contains the same two sublayers as the encoder, plus a third sublayer that applies a multi-head attention mechanism over the encoder stack’s outputs Like the encoder, the decoder uses residual connections around each sublayer and applies normalization The self-attention sublayer is masked to prevent attending to future positions, so predictions at position i are based only on previously observed outputs at positions less than i.
Scaled Dot-Product Attention
The architecture of this Attention is illustrated in Figure 3.6.
Figure 3.6: Scaled Dot-Product Attention [7]
In an attention mechanism, inputs include queries and keys of dimension dk and values of dimension dv First, compute the dot product between the query and each key to obtain similarity scores Then, divide each score by the square root of dk to scale them, and apply the softmax function to convert these scores into a set of weights on the values Finally, use these weights to form a weighted sum of the values, producing the attention output.
In practice, the attention mechanism computes on a collection of queries in parallel by stacking them into a single matrix Q The keys and values are likewise aggregated into matrices K and V The output matrix is then computed using equation 3.4.
Two widely used attention mechanisms are additive attention and dot-product attention When the key/value dimension d_k is small, these two mechanisms share a similar underlying computation, but as d_k grows, additive attention becomes more efficient than dot-product attention Vaswani and colleagues noted that with large d_k, the dot-product magnitude can grow quickly, causing the softmax to yield very small gradients To maintain stable gradients, the dot-product is scaled by 1/√d_k, mitigating the rapid growth and preserving effective learning.
Vaswani’s team assumes that the components of the two matrices Q and K are inde- pendent random variables with mean 0 and variance 1 The dot product of two matrices computed with d k
(Q i ∗Ki)has a mean of 0 and a variance ofdk.
Multi-Head Attention
Compared with a single attention function operating in a space of dimension d, multi-head attention achieves higher efficiency by projecting the inputs through h distinct learned projections: the query projection into dimension d_k, the key projection into dimension d_k, and the value projection into dimension d_v Each head performs attention in a lower-dimensional subspace, reducing computation and enabling parallel processing, while preserving the model's ability to capture diverse relationships across the input.
The multi-head attention mechanism enables a model to attend to inputs from distinct subspaces at different positions simultaneously, capturing diverse patterns across the sequence This parallel attention capability goes beyond what single-head attention can offer, which only focuses on a single subspace at a time.
The Multi-Head Attention function is computed as the equation 3.5, using the matri- ces described in Section 3.5.3.
MultiHead(Q,K,V) =Concat(head 1 ,head 2 head h )∗W O (3.5)
Position-wise Feed-Forward Networks
In transformer architectures, every encoder and decoder layer includes a position‑wise feed-forward network that is applied to each position independently and identically This block consists of two linear transformations separated by a ReLU activation, delivering a nonlinear per‑position transformation The transition network x can be computed from Equation 3.6.
FFN(x) = [max(0,x∗W 1 +b 1 )]∗W 2 +b 2 ,andtheplacementisquitereasonable.However,therearestilla f ewpointsduetoincompletereports,sotherearemany?.Seecommentbe f orethechapters.
Although the linear transformations are the same across all positions, the parameters vary from layer to layer This can be described as two consecutive 1x1 convolutional layers performing the same operation at each location The input and output dimensions are d_model, while the inner (hidden) dimension is d_ff.
Embeddings and Softmax
Vaswani and colleagues treat both input and output tokens as d_model-dimensional vectors using learned embeddings, as in other sequence-to-sequence models The decoder’s outputs are projected through a linear transformation and a softmax to yield the probability distribution for the next token In their Transformer design, the input embedding matrix and the pre-softmax projection share the same weight matrix (weight tying), and the embeddings are scaled by sqrt(d_model) This approach helps stabilize training and aligns input and output representations.
Long-context large language models
Large language model
Language models (LMs) are computational systems that understand and generate human language [29] They have a transformative ability to predict the likelihood of word sequences and to generate new text from a given input, enabling a wide range of natural language processing applications.
N-gram language models are among the most widely used approaches, estimating word probabilities based on their contextual history However, they encounter challenges such as rare or unseen terms, the risk of overfitting, and difficulty in capturing complex language patterns To tackle these limitations, researchers continually develop and refine more advanced modeling strategies, including smoothing techniques and larger training corpora, to improve robustness and performance in language modeling.
LM structures and training approaches.
Large Language Models (LLMs) are powerful models with massive parameter counts and advanced learning capabilities The self-attention mechanism in Transformers—the core building block behind models like GPT-3, InstructGPT, GPT-4, LLaMa, and LLaMa 2—enables efficient natural language modeling and underpins modern NLP systems Transformers revolutionize NLP by handling sequential data with parallelization while capturing long-range dependencies in text A key theme in current research is in-context learning, where LLMs generate coherent, contextually relevant text based on a provided prompt, making them well suited for interactive and conversational applications Reinforcement Learning from Human Feedback (RLHF) is another crucial facet, guiding model outputs to align with human preferences and improve real-world usefulness.
[37] This technique involves fine-tuning the model using human-generated responses as rewards, allowing the model to learn from its mistakes and progressively improve its performance over time.
An autoregressive language model, such as GPT-3 or LLaMA 2, predicts the next token y from a context X by maximizing the conditional likelihood P(y|X) If the context consists of tokens x1, x2, , x_{t-1}, then the model estimates the probability P(y_t | x1, x2, , x_{t-1}) The chain rule lets this conditional probability be written as a product over positions: P(y1, y2, , y_t | X) = ∏_{i=1}^t P(y_i | x1, , x_{i-1}).
P(yt|x 1 ,x 2 ,x t−1 ) where T is the sequence length In this manner, the model generates an entire text sequence by autoregressively predicting each token at each position.
Large language models (LLMs) are typically pre-trained with fixed context lengths—for example, GPT-3.5-turbo at 4096 tokens, LLaMa at 2048, and LLaMa 2 at 4096—making training with very long contexts prohibitively expensive for many researchers To push these limits, several works explore fine-tuning-based approaches to extend context length, with Position Interpolation modifying rotary position encoding to scale LLaMA to 32,768 tokens and Focused Transformer using contrastive learning to train LongLLaMA; both rely on full fine-tuning and thus entail substantial computational cost (e.g., 128 A100 GPUs or 128 TPUv3) Landmark attention offers a more efficient, though somewhat lossy, option by compressing long inputs into retrieved tokens LongLoRA provides an efficient fine-tuning strategy that extends context sizes with limited computation, while LoRA uses low-rank weight updates to approximate full fine-tuning Notably, LongLoRA shows that short attention can approximate long-context behavior during training On a single 8× A100 machine, LongLoRA scales LLaMa 2 7B from 4k to 100k, LLaMa 13B to 64k, and LLaMa 2 70B to 32k, enabling practical tasks such as summarizing long documents or answering extended questions.
LongLoRA introduces Shifted Sparse Attention S2-Attn during fine-tuning, while at inference time the model preserves its standard self-attention In addition to training LoRA weights in linear layers, LongLoRA enhances the trainability of embedding and normalization layers, expanding the model’s adaptability without a large parameter overhead This extension adds only a small number of new trainable parameters, yet it supports richer context extension.
Shift sparse attention
Standard self-attention costs O(n 2 )computations, causing LLMs on long sequences high memory cost and slow To avoid these issues during training,S 2 −Attnis applied.
Two patterns—the shift pattern and the shifted pattern—are introduced to enhance self-attention for long-context inputs The shift pattern splits the input into several groups, enabling efficient local attention within each group The shifted pattern shifts the group partition by half the group size across half of the attention heads, allowing communication between adjacent groups This design does not increase computational costs but promotes information flow across groups.
Figure 3.9 demonstrates S2-Attn by first splitting the head dimension into two chunks Next, tokens in one chunk are shifted by half the group size Then the tokens are divided into groups and reshaped into batch dimensions The attention computation happens only within each group, while information can still flow between groups through the shifting operation While shifting can introduce potential information leakage, this risk is easy to prevent with a small modification to the attention mask.
Prompt engineering
Understand of LLM settings in prompt engineering
When working with prompts, you interact with the LLM via an API or directly You can configure a few parameters to get different results for your prompts.
Temperature controls the balance between determinism and randomness in language models: at lower temperatures the model tends to pick the most probable next token, yielding deterministic and concise outputs, while higher temperatures increase randomness by boosting the weights of less likely tokens, producing more diverse and creative results Practically, use a low temperature for fact-based QA and tasks requiring accuracy and brevity, and raise the temperature for poem generation or other creative writing to foster variety and imaginative responses.
Top-p sampling, also known as nucleus sampling, is a probabilistic text generation technique that, together with temperature settings, lets you control how deterministic the model’s output is If you’re after exact, factual answers, keep the top-p value low; for more diverse and creative responses, increase it to a higher value.
Max length lets you control how many tokens the model can generate by adjusting the 'max length' setting Setting an appropriate max length helps prevent long or irrelevant responses and allows you to manage costs more effectively.
Stop sequences are strings that halt the model's token generation, giving you control over the length and structure of the response By specifying stop sequences, you can define boundaries for output and guide how the content is formed, such as limiting a generated list to no more than 10 items For example, adding “11” as a stop sequence instructs the model to stop after producing ten items, ensuring concise results Using stop sequences helps produce predictable, well-structured content suitable for SEO and user readability.
Frequency penalty is a setting that applies a penalty to the next token proportional to how often that token has appeared in the prompt and the model’s prior response As the frequency penalty increases, the likelihood of repeating the same word decreases, encouraging the model to use a more varied vocabulary This mechanism reduces repetition in generated text by assigning higher penalties to tokens that appear more frequently, resulting in more diverse and concise outputs.
Presence penalty also applies to repeated tokens, but unlike the frequency penalty, its effect is the same for all repeated tokens A token that appears twice and a token that appears ten times are penalized equally This setting helps prevent the model from repeating phrases too often in its output If you want the model to generate more diverse or creative text, consider using a higher presence penalty Conversely, if you need the model to stay focused, try using a lower presence penalty.
Prompt elements
As we cover more and more tasks and applications with prompt engineering, you will notice that certain elements make up a prompt A prompt contains any of the following elements:
• Instruction: a specific task or instruction you want the model to perform.
• Context: external information or additional context that can steer the model to better responses.
• Input Data: the input or question that we are interested in finding a response for,
• Output Indicator: the type or format of the output.
Here are a few SEO-friendly rewritten sentences and options that capture the core meaning in a coherent paragraph.Option 1 — concise core paragraph- Prompt design doesn't require all four elements; the format should fit the task Place instructions at the start of the prompt, and use a clear separator like ### to distinguish instruction from context For example: Return the results as a paragraph in English, with no further explanation.Option 2 — alternate wording- You only need the essential parts of a prompt, and the format should match the task at hand Start with explicit instructions and separate them from context using a simple marker like ### Example: "Return the results as a paragraph in English, with no extra explanation."Option 3 — meta description style- Learn efficient prompt structure: include only what's needed, begin with clear guidance, and use a separator like ### to split instruction from context This approach keeps prompts concise and task-focused, with practical examples such as returning results as a paragraph in English.Option 4 — quick bullet summary (for sections)- You don’t need all four elements for a prompt.- Put instructions at the beginning.- Use ### to separate instruction and context.- Example: Return the results as a paragraph in English, no further explanation.
Listing 3.1: An example of translating task
The output of 3.1 would be:
Another example for text classification tasks:
Listing 3.2: An example of text classification tasks
The output of 3.2 can be:
An example with full 4 elements in the prompt:
5 You are an AI l a n g u a g e m o d e l w i t h a c c e s s to a v a s t a m o u n t of i n f o r m a t i o n on c l i m a t e c h a n g e
11 T h i s c o m p r e h e n s i v e r e v i e w i n v e s t i g a t e s the i n t r i c a t e d y n a m i c s of c l i m a t e c h a n g e and its p r o f o u n d i m p l i c a t i o n s for g l o b a l e c o s y s t e m s The s t u d y s y n t h e s i z e s c u r r e n t r e s e a r c h on r i s i n g t e m p e r a t u r e s , c h a n g i n g p r e c i p i t a t i o n p a t t e r n s , and t h e i r c a s c a d i n g e f f e c t s on b i o d i v e r s i t y , e c o s y s t e m s , and h u m a n s o c i e t i e s It e x p l o r e s the r o l e of a n t h r o p o g e n i c a c t i v i t i e s in d r i v i n g c l i m a t e c h a n g e and a n a l y z e s p o t e n t i a l m i t i g a t i o n s t r a t e g i e s
14 C l i m a t e change , d r i v e n p r i m a r i l y by h u m a n a c t i v i t i e s , is a d e f i n i n g c h a l l e n g e of the 21 st c e n t u r y T h i s a r t i c l e a i m s to p r o v i d e a n u a n c e d u n d e r s t a n d i n g of the c o m p l e x d y n a m i c s a s s o c i a t e d w i t h c l i m a t e c h a n g e and its far - r e a c h i n g c o n s e q u e n c e s F r o m a l t e r i n g w e a t h e r p a t t e r n s to i n f l u e n c i n g sea levels , the i m p a c t s of c l i m a t e c h a n g e are m u l t i f a c e t e d and e x t e n d a c r o s s s p a t i a l and t e m p o r a l s c a l e s
18 The a r t i c l e d e l v e s i n t o the u n p r e c e d e n t e d t e m p e r a t u r e r i s e o b s e r v e d o v e r the p a s t c e n t u r y and its r a m i f i c a t i o n s on e c o s y s t e m s W a r m e r t e m p e r a t u r e s c o n t r i b u t e to the m e l t i n g of p o l a r ice caps , l e a d i n g to r i s i n g sea l e v e l s and d i s r u p t i o n s in m a r i n e e c o s y s t e m s The c o n s e q u e n c e s of t e m p e r a t u r e a n o m a l i e s are f u r t h e r e x a m i n e d in the c o n t e x t of e x t r e m e w e a t h e r events , i n c l u d i n g h u r r i c a n e s , d r o u g h t s , and h e a t w a v e s
21 S h i f t s in p r e c i p i t a t i o n p a t t e r n s are a n a l y z e d for t h e i r e f f e c t s on t e r r e s t r i a l e c o s y s t e m s C h a n g e s in r a i n f a l l d i s t r i b u t i o n i m p a c t a g r i c u l t u r e , w a t e r a v a i l a b i l i t y , and the p r e v a l e n c e of w i l d f i r e s The a r t i c l e r e v i e w s r e g i o n a l v a r i a t i o n s in p r e c i p i t a t i o n and t h e i r i m p l i c a t i o n s for v u l n e r a b l e e c o s y s t e m s , e m p h a s i z i n g the n e e d for a d a p t i v e s t r a t e g i e s to a d d r e s s t h e s e c h a l l e n g e s
24 A c r i t i c a l f o c u s of the s t u d y is the a c c e l e r a t i n g r a t e of b i o d i v e r s i t y l o s s and the l o o m i n g t h r e a t of s p e c i e s e x t i n c t i o n
C l i m a t e c h a n g e e x a c e r b a t e s h a b i t a t loss , d i s r u p t s m i g r a t i o n p a t t e r n s , and i n t r o d u c e s n o v e l s t r e s s o r s to e c o s y s t e m s The a r t i c l e e x a m i n e s c a s e s t u d i e s h i g h l i g h t i n g the v u l n e r a b i l i t y of v a r i o u s s p e c i e s and e c o s y s t e m s to t h e s e d i s r u p t i o n s
27 H u m a n a c t i v i t i e s , s u c h as the b u r n i n g of f o s s i l f u e l s and d e f o r e s t a t i o n , are i d e n t i f i e d as m a j o r c o n t r i b u t o r s to c l i m a t e c h a n g e The a r t i c l e e l u c i d a t e s the m e c h a n i s m s t h r o u g h w h i c h t h e s e a c t i v i t i e s r e l e a s e g r e e n h o u s e g a s e s i n t o the a t m o s p h e r e , t r a p p i n g h e a t and i n t e n s i f y i n g the g r e e n h o u s e e f f e c t
30 In c o n c l u s i o n , t h i s a r t i c l e s y n t h e s i z e s d i v e r s e s t r a n d s of r e s e a r c h to p r o v i d e a h o l i s t i c u n d e r s t a n d i n g of the c o m p l e x d y n a m i c s of c l i m a t e c h a n g e It u n d e r s c o r e s the u r g e n c y of g l o b a l c o o p e r a t i o n to m i t i g a t e the i m p a c t s and a d a p t to the c h a n g i n g c l i m a t e The k n o w l e d g e d i s t i l l e d f r o m t h i s r e v i e w s e r v e s as a f o u n d a t i o n for i n f o r m e d d e c i s i o n - m a k i n g and p o l i c y f o r m u l a t i o n in the f a c e of t h i s e x i s t e n t i a l c h a l l e n g e
Listing 3.3: An example of summarization task with full of 4 elements in the prompt
The output of 3.3 when we input the prompt to ChatGPT:
1 T h i s c o m p r e h e n s i v e r e v i e w e x p l o r e s the i n t r i c a t e d y n a m i c s of c l i m a t e c h a n g e and its p r o f o u n d i m p l i c a t i o n s for g l o b a l e c o s y s t e m s The s t u d y s y n t h e s i z e s c u r r e n t r e s e a r c h on r i s i n g t e m p e r a t u r e s , s h i f t i n g p r e c i p i t a t i o n p a t t e r n s , and t h e i r c a s c a d i n g e f f e c t s on b i o d i v e r s i t y , e c o s y s t e m s , and h u m a n s o c i e t i e s C l i m a t e change , p r i m a r i l y d r i v e n by a n t h r o p o g e n i c a c t i v i t i e s , p r e s e n t s a m u l t i f a c e t e d c h a l l e n g e , a l t e r i n g w e a t h e r p a t t e r n s , i n f l u e n c i n g sea levels , and c o n t r i b u t i n g to e x t r e m e w e a t h e r e v e n t s The a r t i c l e e m p h a s i z e s the a c c e l e r a t i n g r a t e of b i o d i v e r s i t y loss , h i g h l i g h t i n g the v u l n e r a b i l i t y of v a r i o u s s p e c i e s and e c o s y s t e m s A d d i t i o n a l l y , it i d e n t i f i e s h u m a n a c t i v i t i e s , s u c h as f o s s i l f u e l c o m b u s t i o n and d e f o r e s t a t i o n , as m a j o r c o n t r i b u t o r s to c l i m a t e change , e l u c i d a t i n g t h e i r m e c h a n i s m s and e m p h a s i z i n g the n e e d for g l o b a l c o o p e r a t i o n to d e v e l o p e f f e c t i v e m i t i g a t i o n and a d a p t a t i o n s t r a t e g i e s In c o n c l u s i o n , the r e v i e w u n d e r s c o r e s the u r g e n c y of a d d r e s s i n g c l i m a t e c h a n g e t h r o u g h i n f o r m e d d e c i s i o n - m a k i n g and g l o b a l c o l l a b o r a t i o n
Automatic slide generation
Automatic slide generation benefits significantly from text summarization, enabling lengthy scientific papers to be condensed and distilled into concise, digestible content By extracting and highlighting the core findings, summaries illustrate the key points clearly and concisely through slides As machine learning and deep learning models have advanced, researchers have increasingly applied these techniques to automate slide creation, leveraging summarization and content synthesis to produce effective presentations from complex documents This approach streamlines knowledge transfer, improves readability, and supports efficient, scalable creation of presentation material from scientific literature.
Effective text summarization requires the ability to identify important sentences The important scores of the sentences in an academic paper were determined using the re- gression method [16] This technique chooses an appropriate regression model using sentence characteristics as input However, this approach has the drawback of selecting numerous lengthy sentences for presentations, and the chosen key phrases are frequently poor candidates for bullet points, making the slides tedious to read A phrase-based strategy has been developed to address this issue [17] It chooses and arranges material using terms from the academic paper as the fundamental building blocks When cre- ating slides, the critical score of each phrase and the hierarchical link between phrases are taken into consideration Support Vector Regression (SVR) is a different technique that may be used to identify important sentences in the text and focus on them [18] The integer linear programming (ILP) model [16] aids in designing the arrangement of slides once the list of important sentences has been constructed With a maximum slide length option, it will choose the number of slides on its own The Microsoft Office API is used to create the draft slides when they have content.
Many researchers have employed deep learning networks besides machine learning- based approaches to address this issue Some of them score the sentences by extracting a more thorough list of surface features, taking into account the semantics or meaning of the statement, and using the context around the current sentence [19] These models are fitted using syntactic, semantic, and contextual embedding It is suggested to use the Bullet Point Generation algorithm to create slides, the number of which will depend on the information in each section of the papers.
Query-based text summarization has emerged as an effective approach for automatic slide generation, producing slide content and titles that naturally reflect the source papers rather than a strict one-to-one extraction In this framework, user-entered slide titles are treated as questions and the paper corpus as the knowledge base; the system retrieves information related to the titles and injects title-specific key phrases to guide slide content generation To make the generated slides vivid and natural, graphic elements from the papers should be integrated during slide creation Dense vector information retrieval computes vector similarities between figure/table captions and slide titles, and slides with higher similarity scores receive the corresponding elements Additionally, a random forest classifier can be used to filter out slide content that is unlikely to be supported by the paired papers.
Tsu-Jui et al present DOC2PPT, a novel slide-generation approach that integrates a paraphrase module with a layout-prediction module to leverage the patterns inherent in documents and presentations They construct a modular network where each component targets related tasks, enabling joint text rewriting and layout design The architecture of this network, DOC2PPT, is illustrated in Figure 4.1.
All of these modules are:
• A Document Reader (DR) encodes sentences and figures in a document.
A Progress Tracker (PT) maintains references to the input—the current section being processed—and the output—the slide currently being generated It uses these references, together with overall progress, to decide when to advance to the next section or the next slide By tracking input and output states, the PT orchestrates a smooth, step-by-step workflow and ensures timely progression through the content.
An Object Placer (OP) selects the text or figure from the current section to be displayed on the current slide, ensuring that the most relevant content appears in the right place It also forecasts the layout by determining where each item will be positioned and how large it will be on the slide, guiding a balanced, readable presentation.
• A Paraphraser (PAR) takes the chosen sentence and condenses it before adding it to a slide.
Figure 4.1: The architecture of DOC2PPT [9]
Document summarization
Document summarization aims to generate a concise summary of a document or a set of documents The NEUSUM model integrates a selection strategy into the scoring model, directly predicting the relative importance of sentences based on previously selected ones It employs a hierarchical document encoder and an RNN-based sentence extractor Additionally, the authors design a rule-based labeling system to identify which sentences should be extracted.
Representing a long document for abstractive summarization is challenging; a solution is to deploy deep communicating agents inside an encoder–decoder architecture The input text is split into multiple subsections, with each subsection assigned to a dedicated agent that encodes its portion independently The agents then broadcast their encodings to one another, allowing the system to share global context information about different sections of the document This collaborative, multi-agent communication enables cross-sectional information flow and helps the decoder generate higher-quality abstractive summaries for long texts.
IBM researchers introduced a novel system that generates concise summaries for computer science publications within a search interface, enabling retrieval by categorized elements such as scientific tasks and datasets The system creates a standalone summary for each section of a paper, and these section-level summaries are later combined into a single, cohesive paper summary.
State-of-the-Art model utilization
Edward Sun and his team [10] report several promising results from their research, presenting findings with logic in concise, information-packed slides that are automatically generated and incorporate graphic elements from the original paper However, the generated content remains largely extractive Despite this limitation, the metrics used to evaluate their work yield unsatisfactory outcomes Since the publication of this academic paper, numerous studies have emerged that focus on improving deep learning models.
The BART-LS model is an innovative method for natural language generation that blends pre-training with fine-tuning to achieve strong performance Built on the BART architecture, it can process both the left and right context of a token, enabling more coherent and context-aware text generation By combining robust pre-trained knowledge with task-specific fine-tuning, the BART-LS model delivers improved adaptability and higher-quality outputs across a range of NLP tasks.
LS model advances text generation by integrating a latent space that captures the meaning and style of the input text This latent representation is learned by minimizing the reconstruction loss of the input and a contrastive loss that aligns input and output texts The BART-LS model uses this latent space to produce varied, fluent text across natural language tasks, including summarization, paraphrasing, simplification, and stylistic transformation.
Pegasus-X is an advanced summarization model designed for long texts, capable of handling documents up to 4096 tokens It extends the Pegasus architecture by adding a larger Transformer encoder and decoder, an expanded vocabulary, and a new pre-training objective On benchmark datasets such as CNN/Daily Mail, arXiv, PubMed, and BigPatent, Pegasus-X delivers concise, coherent summaries that faithfully capture the main points of the source while preserving factual accuracy and grammatical correctness.
Long T5 is an advanced natural language processing model that can generate high-quality text from keyword inputs It builds on the T5 framework, employing an encoder-decoder architecture based on transformers Compared with the original, Long T5 adds more layers, parameters, and attention heads, and uses the Longformer attention mechanism to better handle longer and more complex texts such as summaries, essays, and stories It is trained on a vast, diverse corpus spanning multiple domains and languages, enabling multilingual text generation This makes Long T5 a robust and adaptable tool for natural language generation across tasks like text summarization, rewriting, expansion, and completion.
RoBERTa is a state-of-the-art natural language processing model designed to handle a range of tasks, including text classification, question answering, and sentiment analysis As an enhanced variant of BERT, it trains on more data, uses larger batch sizes, goes through more training steps, and features a broader vocabulary It eliminates the next sentence prediction objective and switches from WordPiece to byte-pair encoding, contributing to its stronger performance Across diverse datasets and benchmarks, RoBERTa consistently outperforms BERT, underscoring its effectiveness and reliability.
Query-based text summarization approach (Baseline)
Keyword module
By applying the Keyword module, we extract the hierarchical structure of papers and identify weak, undescriptive titles that resemble section headers These generic titles influence the apparent scope and size of each section, highlighting a challenge in organizing content Building a parent–child tree from the extracted section names and sub-sections enhances information retrieval and supports workflows like slide content generation.
Dense IR module
Recent research presents embedding-based retrieval approaches that outperform traditional IR methods like BM25 [49], [50], [51] In these systems, leaf nodes from the keyword module’s parent–child trees are integrated into the re-ranking function of a dense-vector information retrieval pipeline based on a distilled BERT miniature [52] Because titles resemble paper snippets, the dense-vector IR model is trained without gold passage annotations to reduce the cross-entropy loss between titles and their original content derived from the original slides.
To align paper content with presentation slides, a pre-trained information retrieval (IR) model computes vector representations for every four-sentence paper snippet and for each slide title The system measures similarity by calculating pairwise inner products between the snippet vectors and the slide title vectors, yielding a scalar score for each pairing This vector-based approach creates a common embedding space in which paper fragments and titles can be directly compared, enabling efficient retrieval and ranking of relevant slides The resulting method supports scalable matching of publication content to slide topics, enhancing semantic search and content reuse across papers and presentations.
Maximum Inner Product Search (MIPS) ranks paper passage candidates by their relevance to a given title, selecting the top ten candidates as input contexts for the QA module Section titles and subsection heads—keywords—from the Keyword module are further re-ranked using a weighted function defined as equation 5.1: α(emb_title · emb_text) + (1−α)(emb_title · emb_text_kw), where emb_title, emb_text, and emb_text_kw are the embedding vectors produced by the pre-trained IR model for the title, a text snippet, and the leaf-node keyword associated with that snippet.
QA module
To drive the QA module, we combine the slide title with its associated keywords, while the context is built by concatenating the top ten ranked text snippets from the dense IR module The process relies on a parent–child hierarchy from the keyword module to map each title to a corresponding group of keywords; when a title matches a header like 2.1, that title and all of its recursive children (2.1.x, 2.1.x.x) are added as keywords for the QA module Note that not every title has a related keyword.
This QA model is a fine-tuned BART system that delivers accurate answers by encoding the context and the query in a structured format: "title[SEP1]keywords[SEP2]context." The input places the slide title first, followed by keywords embedded progressively as a comma-separated list, which helps organize the data for retrieval When generating slide content, integrating keywords into the query directs the model to focus on the most relevant context across all returned text fragments, improving both coherence and relevance of the output.
Turning a research paper into slides blends creativity with the author’s unique style, which means slide content may reflect anecdotes or details not present in the original publication A QA model trained on carefully filtered data can perform this task more reliably, enabling slide sentences that faithfully mirror the source paper This approach can enhance presentation quality by aligning slide content with the paper while preserving essential ideas.
Figure extraction module
Visual elements are essential to keep audiences engaged, otherwise slide decks feel incomplete The slide-generation workflow uses the article’s linked figures and tables to build a comprehensive set of slides A dense vector IR module measures vector similarities between each slide title and the expanded keywords in figure and table captions Based on these similarities, the figures and tables are ranked, and a final set of recommendations is produced and delivered to the user.
Filtering data method
Data filtering is a fundamental step in deep learning that ensures the quality and reliability of training data and, in turn, the resulting models It can be applied at various stages of the data pipeline—data collection, preprocessing, augmentation, and sampling—to remove noisy, corrupted, irrelevant, or redundant data points that can degrade model performance and generalization Effective filtering helps balance the data distribution, reducing biases and overfitting risks, while also cutting computational costs by limiting unnecessary data By filtering data, practitioners can improve training efficiency, model robustness, and overall predictive accuracy, enabling more dependable deployment in real-world tasks.
During training, slide content serves as ground-truth data for a model that automatically generates slides, with the model leveraging context from scientific papers to align content with slide titles; however, slides often include clarifications or illustrations not present in the papers, and Azure OCR can introduce invalid characters when extracting text When slide content diverges from the corresponding paper, training accuracy declines To mitigate this, we propose a data-filtering approach based on semantic search that, for each slide, retrieves the most similar content from the source paper and assesses whether that content is actually part of the paper This process helps purge non-paper or inappropriate content, ensuring the training corpus emphasizes accurate, relevant information Implementing this filtering is expected to improve data quality, reduce noise from misaligned content, and enhance the model’s performance in slide generation.
Semantic search is a technique that enhances the accuracy and relevance of search results by understanding user intent and context Unlike traditional keyword-based search, it relies on natural language processing, machine learning, and knowledge graphs to analyze the meaning and structure of queries and align them with the most appropriate documents or data This approach delivers more precise, personalized answers and suggests related topics or queries It also handles complex or conversational queries that involve multiple concepts, synonyms, or ambiguities.
Semantic search places every corpus entry—whether a sentence, paragraph, or document—into a shared vector space, and it also converts the user query into a vector so the closest matching entry can be located Vectorization, typically powered by a large language model, turns content into numerical arrays so that similar concepts have similar vectors, which are then stored in vector databases that index and retrieve based on similarity Once the vector indexes are built, the search process transforms the input query into a vector and finds the best conceptual fit, yielding the most relevant results; the workflow is illustrated in Fig 5.2, highlighting the core benefits of semantic search.
• Improve accuracy and relevance of search results.
• Enhance user experience and satisfaction.
• Increase discoverability and diversity of information.
• Better understanding of user behaviour and preferences.
There are two types of semantic search:
Symmetric semantic search aligns the input query with corpus entries so they have similar length and content, improving semantic matching In practice, if you search for “How to learn Python online?”, a matching entry such as “How to learn Python on the web?” shares equivalent meaning and comparable word count, which boosts relevance For symmetric tasks, the method supports swapping the query and corpus entries, ensuring results remain consistent when the search direction is reversed.
Asymmetric semantic search relies on short queries—such as questions or keywords—to elicit longer, paragraph-length responses that address the user's underlying intent For example, when a user asks “What is Python?”, the expected output is a concise explanation like “Python is a programming language that is interpreted, high-level, and general-purpose, designed with a philosophy that emphasizes readability and simplicity.” In tasks where the query and the corpus are asymmetric, flipping the roles does not generally yield meaningful results.
Figure 5.2: The structure of semantic search using DistilBERT and Faiss
Filtering data by using semantic search
Semantic search proves effective for information retrieval, and we used it to filter slide content by removing items that do not align with the paper We treat the sentences in each scientific paper as knowledge sources and use the slide sentences that reference the paper as search queries A symmetric semantic search matches slide sentences to the corresponding sentences in the paper, yielding similarity scores and indices of the best matches To classify slide content for a scientific paper, we define a threshold t: if the slide content's similarity score is at least t, it is kept; if it is below t, it is discarded The threshold t can vary depending on the application and dataset.
To map sentences into vector space for downstream tasks, we use the sentence-transformers SBert 1 library, a Python framework that makes it easy to compute semantic similarity between sentences and builds on the popular transformers ecosystem It offers various models and methods for sentence embedding, clustering, paraphrasing, and evaluation, with support for multilingual and cross-lingual scenarios and options for custom training and fine-tuning of existing models Designed for user-friendliness, efficiency, and scalability, the library integrates with applications such as information retrieval, question answering, text summarization, and natural language generation For this work, we rely on the all-mpnet-base-v2 model, which delivers strong performance in symmetric semantic search compared with other models.
To construct the vector database, we use Faiss, a library built for efficient similarity search and clustering of dense vectors Faiss offers algorithms that can search through vector datasets of any size, including those that don’t fit in RAM, along with supporting code for evaluation and parameter tuning Written in C++ with complete Python/numpy bindings, Faiss also includes GPU-accelerated implementations of many algorithms It was developed by Facebook AI Research.
Following the analysis and design stage, data filtering through semantic search is illustrated in Fig 5.3 To demonstrate the method, we use the paper Neural AMR: Sequence-to-Sequence Models for Parsing and Generation as the knowledge base Semantic search retrieves the top five sentences from the paper that are most similar to the query "Paired Training: scalable data augmentation algorithm." The results appear in Table 5.1 When the threshold value t is set to 0.6, this sentence is considered derived from the paper and is retained because the highest similarity score is 0.691 If t is set higher than 0.691, the sentence would be discarded Thus, the choice of threshold t determines whether a sentence is retained.
Table 5.1: Top 5 sentences have the highest similarity score with the given sentence.
2 0.486 We set the initial Gigaword sample size to k = 200, 000 and executed a maximum of 3 iterations of self-training.
Paired Training Obtaining a corpus of jointly annotated pairs of sentences and AMR graphs is expensive and current datasets only extend to thousands of examples.
While much of this improvement comes from self-training, our model without Gigaword data outperforms these ap- proaches by 3.5 points on F1.
We demonstrate that our seq2seq model can capture the same information as a language model, particularly after pretraining on an external corpus In AMR generation, we employ data augmentation to enhance robustness, and our paired training procedure is largely inspired by the work of Sennrich et al.
Figure 5.3: The data filtering process using semantic search
Prompt engineering approach
LLMs module
The LLMs module is the central powerhouse of our automatic slide generation framework, capable of comprehensively summarizing academic papers based on carefully crafted instructions It leverages the intrinsic summarization abilities of advanced language models and provides a customizable approach that lets users tailor the output content and formatting to meet specific preferences.
In our approach, the LLMs module serves as the linchpin for distilling complex academic content into concise, presentation-ready summaries that align with user-defined prompts By applying targeted prompt engineering, we guide large language models to generate outputs that resemble human-generated slides, while fine-tuning instructions to achieve desired tones, levels of formality, and stylistic elements essential for polished, effective presentations.
Table 5.2: Human evaluation results for zero-shot and five-shot LLMs, finetuned LMs, and reference summaries [11].
Setting Models Faithfulness Coherence Relevance Faithfulness Coherence Relevance
Fine-tuned models Brio 0.94 3.94 4.40 0.58 4.68 3.89 language Pegasus 0.97 3.93 4.38 0.57 4.73 3.85
Large language models (LLMs) exhibit summarization capabilities that approach the quality of human-authored summaries, a claim supported by empirical evidence in Figure 5.2, which demonstrates their performance on summarizing news articles Human evaluators assessed the LLM outputs for faithfulness, coherence, and relevance, with faithfulness measured on a binary scale and coherence and relevance rated on a 1-to-5 scale The resulting scores highlight the nuanced and high-quality nature of the summaries, and Figure 5.2 provides a clear visual overview of how closely LLM-generated summaries align with their human-authored counterparts.
Across zero-shot language models, increasing model size yields stronger performance, as GPT-3 improves in faithfulness, coherence, and relevance when scaling from 350 million to 6.7 billion and 175 billion parameters; instruction-focused variants like Ada Instruct v1, Curie Instruct v1, and Davinci Instruct v2 outperform GPT-3 across multiple benchmarks, with Davinci Instruct v2 delivering particularly impressive results.
Moving to the five-shot models, Anthropic-LM, Cohere XL, GLM, OPT, and GPT-
3, among others, display competitive scores Notably, fine-tuned models such as Brio and Pegasus exhibit commendable performance, reinforcing the idea that fine-tuning can significantly enhance summarization quality.
Compared with existing references, large language models consistently outperform them, underscoring the progress these models bring to text summarization The results reveal nuanced evaluations: each model demonstrates strengths in different aspects, which reinforces the need for a diverse set of evaluation metrics.
Figure 5.2 highlights the nuanced summarization capabilities of large language models (LLMs) Faithfulness scores near 1, along with high coherence and relevance ratings, show that these models generate summaries that closely mirror human-authored ones Collectively, the findings underscore the potential of LLMs to automate the production of concise, informative, and human-like summaries across diverse domains.
LLMs demonstrate leading summarization capabilities, as shown in this evaluation, placing them at the forefront of natural language processing advancements By employing nuanced evaluation metrics and robust empirical evidence, these models distill complex information into concise, coherent, and highly relevant summaries that rival human-authored work.
The LongAlpaca-7B-16k model is fine-tuned on a subset LongAlpaca-12k dataset with LongLoRA in supervised fine-tuning, LongAlpaca-16k-length [8] LongAlpaca- 7B-16k was evaluated on LongBench [55] and L-Eval [56] benchmarks.
Table 5.3: LongAlpaca-7B-16k Evaluation on LongBench English tasks
Model Avg Single-Doc QA Multi-Doc QA Summarization Few-shot Learning Code Synthetic
Table 5.3 displays the evaluation results, where the model demonstrates competitive performance across diverse tasks compared to the baseline models.
Table 5.4: Evaluation on L-Eval open-ended tasks, comparing to GPT-3.5-Turbo and judging win rates via GPT-4
Model Win-rate Wins Ties
Table 5.4 presents the win-rate comparison for open-ended tasks in the L-Eval benchmark, showing that the evaluated model achieves a win rate of 39.06% with 45 wins and 60 ties when judged against GPT-3.5-Turbo using the criteria provided by GPT-4.
The ownership of OpenAI has left us with very little information about GPT-3.5- Turbo However, we do possess the data about the architectures, training dataset, training methodology, and assessment.
The authors adopt the GPT-2–style model and architecture, including the modified initialization, pre-normalization, and reversible tokenization, but replace the standard attention with alternating dense and locally banded sparse patterns in the transformer layers, in a fashion similar to the Sparse Transformer Language-model datasets have rapidly expanded, with Common Crawl now comprising nearly a trillion words, enough to train the largest models without repeating any sequence, yet unfiltered or lightly filtered Common Crawl tends to be lower quality than curated corpora To raise average dataset quality, they (1) downloaded and filtered Common Crawl against a range of high-quality reference corpora, (2) performed fuzzy document-level deduplication within and across datasets to reduce redundancy and preserve the integrity of the held-out validation set as a reliable measure of overfitting, and (3) augmented training data with known high-quality reference corpora to increase diversity In training, all models were trained on Nvidia V100 GPUs on a high-bandwidth Microsoft cluster.
Table 5.5: Zero-shot results on PTB language modeling dataset
SOTA (Zero-Shot) 35.8 α GPT-3 Zero-Shot 20.5
Table 5.5 presents zero-shot results on the PTB language modeling dataset, highlighting the model’s performance relative to the current state-of-the-art (SOTA) and GPT-3 in zero-shot evaluation The model establishes a new PTB SOTA, surpassing the previous best by 15 perplexity points and achieving a perplexity of 20.50.
Model selections
In constructing our automatic slide generation framework, the pivotal decision of model selection is centred around two Long Context Language Models that harmonize distinct strengths: LongAlpaca-7B-16k and GPT-3.5-Turbo.
LongAlpaca-7B-16k occupies a central role in our framework, fine-tuned on the LongAlpaca-12k dataset via LongLoRA in supervised fine-tuning [8], and it demonstrates strong adaptability to the nuances of academic paper summarization, achieving competitive performance across diverse tasks as depicted in Table 5.3.
Across open-ended tasks evaluated with L-Eval, LongAlpaca-7B-16k achieves a substantial win rate of 39.06%, outperforming GPT-3.5-Turbo in a large number of instances (Table 5.4) This performance underscores the model’s strength in handling complex, open-ended prompts and reinforces its suitability for our framework.
Even with limited insights into the internal architecture of GPT-3.5-Turbo, its selection is based on demonstrated strengths in language understanding and language generation The model's design, rooted in a GPT-2 foundation with targeted enhancements, enables it to handle a diverse range of prompts effectively.
GPT-3.5-Turbo demonstrates a competitive edge in language tasks with its zero-shot performance on the Penn Treebank (PTB) language modeling dataset, as shown in Table 5.5 The model achieves a perplexity of 20.50 in this setting, underscoring strong language modeling capabilities even without task-specific fine-tuning While the PTB dataset does not directly align with summarization tasks, the result highlights GPT-3.5-Turbo's proficiency across language-related tasks This benchmark informs expectations for the model's performance in broader natural language processing applications, including text summarization, and reinforces its robust zero-shot abilities.
The synergy between LongAlpaca-7B-16k and GPT-3.5-Turbo in our LLMs module creates a robust foundation for automatic slide generation These models are selected for complementary strengths that enable nuanced, high-quality summarization of academic papers Their collaboration aligns with our research goals, speeding up slide production while preserving accuracy and depth for scholarly presentations.
Prompt tuning
Within our automatic slide generation framework, the effectiveness of large language models (LLMs) stems not only from their intrinsic capabilities but is amplified by a focused process called Prompt Tuning This approach involves customizing prompts to steer LLMs toward outputs that precisely match user-defined preferences for both content and formatting By refining prompts, we align the model’s outputs with specific slide objectives, ensuring coherence, relevance, and the desired presentation style while maintaining efficiency and scalability across topics.
The Prompt Tuning Workflow, illustrated in Figure 5.5, starts by creating a comprehensive Prompt template guided by the content and structure of the academic paper, including its sections This template then forms the foundation for developing a Full prompt, a detailed instruction set that captures nuanced preferences such as tone, level of formality, and stylistic elements.
Feeding the complete prompt into large language models (LLMs) generates a JSON output that encapsulates a summary based on the provided instructions This output then undergoes a dual evaluation: first, checking its alignment with the desired format, and second, assessing accuracy, coherence, and relevance to the original prompt By combining format conformity with content fidelity, this process produces SEO-friendly summaries that are clear, well-structured, and ready for publication.
Figure 5.5 presents a prompt-tuning workflow that ensures LLM-generated content adheres to user-defined preferences for structure and style, while concurrently evaluating the output with the ROUGE metric to quantify the quality of the summarization.
After the initial LLM outputs and the dual evaluation, the workflow enters an iterative Prompt Tuning phase where the Full prompt is refined based on the preliminary results ROUGE scores are used as quantitative feedback to guide optimization, ensuring the adjustments improve alignment with the target criteria The refined Prompt template produced by tuning is then reintegrated into the workflow, becoming the updated starting point for subsequent iterations and further improvements in the prompt design for better downstream performance.
By employing a cyclic refinement loop, the model’s summaries are iteratively adjusted until they consistently meet the target format while preserving fidelity to the original material The iterative Prompt Tuning workflow provides a structured yet dynamic approach to crafting precise instructions, maximizing the proficiency of large language models in generating tailored, high-quality academic summaries.