Conception et implémentation d’un prototype pour le résumé automatique des articles scientifiques = thiết kế và phát triển một Ứng dụng thử nghiệm cho tóm tắt tự Động các bài báo khoa học

Conception et implémentation d’un prototype pour le résumé automatique des articles scientifiques = Thiết kế và phát triển một ứng dụng thử nghiệm cho tóm tắt tự động các bài báo khoa học

Définition du problème

Discussions on the topic of automatic text summarization are abundant and varied At its simplest, a summary is a condensed textual representation of a document's content This form of summarization captures the essential information while reducing the length of the original text.

CHAPITRE 1 INTRODUCTION compression textuelle avec perte d’information Quant au regard du code ISO 24617-

Automatic text summarization, a key theme in resource management, aims to create guidelines for developing new resources that are immediately interoperable with existing ones It involves generating a concise and meaningful synthesis of textual content, utilizing techniques from Natural Language Processing (NLP) This process is essential for quickly extracting information from various sources, including books, news articles, research papers, and social media posts The most recognizable form of text condensation is the summary, which accurately represents a document's content However, producing a relevant and high-quality summary requires careful selection, evaluation, organization, and assembly of information segments based on their relevance Understanding and managing redundancy, coherence, and cohesion are crucial for creating automated summaries that are comprehensible and acceptable to human readers.

Contexte et motivation

Contexte

Manually summarizing text is time-consuming, labor-intensive, and often impractical given the vast amount of content we encounter daily This highlights the need for tools that can automatically summarize text like a human would For instance, after a busy day, one might remember a presentation due the next day and use an automatic summarization system to condense the material before a text-to-speech synthesizer reads it aloud This approach saves time and helps prevent mental fatigue The use of automatic summarization tools is essential and highly beneficial for various reasons.

— d’économiser le temps de lecture ;

— de condenser l’information aux fins d’intégration dans de petites appareilles mo- biles, comme les PDA, les tablettes et autres ;

— de faire une meilleure sélection de contenu ;

— de mieux préparer des revues, des exposés et autres.

Motivation

Summarizing the content of important documents is essential for scientists and particularly for students Despite advancements in automatic summarization techniques, the challenge of preserving the original structure of the document remains unaddressed Often, a generic summary is generated without thoroughly examining the significant chapters or sections that may contain valuable information In fields like computer science, understanding the content of a scientific article is difficult without reviewing these major headings This gap in existing research motivates us to develop a system that produces summaries highlighting the most relevant information from each section of a scientific article.

Problématique

The emergence of automatic text summarization systems provides crucial assistance at all levels These systems aim to generate a condensed version of the source document using computational techniques based on statistical or linguistic approaches for comprehensive text analysis As such, the original document's structure is not preserved, leading to limited information in the summary when dealing with large volumes of text Given that scientific articles are often structured into chapters or sections containing specific topics, it raises the question of whether automatic summarization of sections in PDF format scientific articles is feasible Therefore, developing a prototype capable of extracting and summarizing the textual content of scientific articles by chapter is both necessary and intriguing.

Objectifs

A travers ce travail scientifique, nous nous sommes fixé comme objectifs de :

1 Faire une étude sur le résumé automatique de texte en général et pour les articles scientifiques en particulier.

2 Proposer et implémenter un prototype d’application capable de faire du résumé automatique de texte par section (chapitre ou grand titre).

Structure du mémoire

Notre travail de mémoire contient environ 5 chapitres traitant les points suivants :

1 Le premier chapitre présente l’analyse du sujet en faisant un bref exposé sur le contexte du travail, la problématique, les objectifs et la structure du mémoire.

2 Le deuxième chapitre est consacré à l’état de l’art sur le résumé automatique de texte et les articles scientifiques.

3 Le troisième chapitre expose la solution proposée à travers l’architecture globale du système et le pipeline des modèles.

4 Le quatrième chapitre présente l’implémentation, l’expérimentation et l’évalua- tion du travail réalisé.

5 Le cinquième chapitre conclut notre travail et donne des perspectives futures.

Généralités

Intelligence Artificielle

Artificial Intelligence (AI) is a field of computer science focused on developing methods that enable machines to perform tasks typically carried out by humans, such as reasoning, acting, and adapting Coined by John McCarthy in 1956, who is regarded as the father of AI, the term refers to "the science and engineering of creating intelligent machines." While McCarthy introduced the term, the concept of machines simulating human-like intelligence was explored earlier by influential researchers like Alan Turing In industrial contexts, AI is defined as algorithms that mimic human actions to varying degrees of sophistication.

Apprentissage automatique

Machine learning, a subset of artificial intelligence, utilizes statistical learning algorithms to develop intelligent systems It involves applying statistical methods to algorithms, enabling them to learn autonomously without explicit programming and to improve as they process more data Machine learning algorithms are categorized into three types: supervised, unsupervised, and reinforcement learning Examples of machine learning applications include recommendation systems on music and video streaming services and online customer support through chatbots.

Apprentissage profond

Deep learning is a subfield of machine learning that utilizes Artificial Neural Networks, inspired by the neural networks of the human brain These networks mimic the function of neurons to tackle complex learning problems Essentially, an artificial neural network can be viewed as a series of small calculators that perform mathematical operations based on their connections, leading to output neurons Typically, these networks consist of multiple layers, where each layer processes inputs from the previous one This representation learning method employs a cascade of nonlinear processing units to transform and extract features for subsequent layers Deep learning algorithms can learn from inputs in both supervised and unsupervised manners through several levels known as entity layers, which are not designed by humans but automatically learned from generalized learning processes Consequently, deep learning can recognize faces, synthesize text, and even drive autonomous vehicles.

FIGURE2.1 – Relation AI, ML et DL - Pier Paolo Ippolito[1]

Traitement Automatique des Langues Naturelles

Natural Language Processing (NLP), known in French as Traitement Automatique des Langues Naturelles (TALN), is a scientific field focused on developing automated methods for manipulating human language It enables computer programs to understand spoken or written language, merging aspects of computer science, artificial intelligence, and linguistics The primary goal of NLP is to allow machines to read, decipher, comprehend, and derive meaning from human language As one of the most widely applied areas of artificial intelligence, the demand for models that analyze speech and language, identify contextual patterns, and generate information from text and audio is expected to grow alongside advancements in AI Key applications of NLP include various language processing tasks and technologies.

Information retrieval is a field of computer science focused on processing documents with unstructured text, enabling quick access to information based on user-specified keywords It serves as the foundation for web search engines and is crucial for scientific and documentary research Documents can be indexed by their content and associated concepts using domain-specific thesauri However, the evolution of living languages and the emergence of new concepts present several practical challenges.

Information extraction is the process of extracting data from unstructured textual sources to identify predefined entities or facts for structured classification or storage Often referred to as text extraction, it employs machine learning to automatically analyze text and extract relevant keywords and phrases from sources such as news articles, surveys, and customer service tickets.

2.1.4.3 Analyse de Sentiment (Sentiment Analysis)

Sentiment analysis is a method used to identify the emotional tone behind a piece of text This popular technique helps organizations and businesses assess and categorize opinions related to a product, service, or idea The conveyed emotion can be classified as positive, negative, or neutral.

Machine translation refers to the automated process of translating content from a source language to a target language without human intervention This process can involve text or audio and typically operates word by word, often resulting in subpar translations To enhance translation quality, various techniques have been developed Popular translation engines, such as Google Translate and DeepL, exemplify these advancements in machine translation technology.

This approach involves developing a system designed to answer questions, primarily focusing on contextual reading comprehension while avoiding out-of-context inquiries Typically, the responses are generated by querying a structured database of knowledge or information.

FIGURE2.2 – NLP et ses sous-types[2]

Résumé automatique de texte

Résumé mono-document ou multi-document

There are two types of summarization systems based on the number of input documents: mono-document and multi-document summarization Mono-document summarizers create summaries for individual documents and vary in their ability to handle different document sizes, such as web articles versus multiple PDF articles In contrast, multi-document summarization systems, which are more recent, produce adjustable-length summaries from a collection of documents.

Résumé générique ou orienté

There are two types of summaries: generic and oriented A generic summary lacks visible specifics and presents a general overview based solely on the source text, without considering context or particularities In contrast, an oriented summary is driven by a specific task or request, selecting and processing only the information relevant to that task This type of summary is heavily influenced by context and the information being sought.

Résumé informatif ou indicatif

Se référant au style du résumé produit, on parle de résumé informatif ou indicatif.

An informative summary serves as a concise version of the original text, capturing the most relevant information comprehensively In contrast, an indicative summary highlights the key topics discussed in the document.

Résumé extractif ou abstractif

The text summarization methods can be categorized into extractive, abstractive, or hybrid approaches The extractive method focuses on selecting the most important sentences from the source documents, creating a summary by concatenating these chosen sentences In contrast, the abstractive method involves interpreting and rephrasing the content of the documents to generate a more coherent summary.

CHAPITRE 2 ETAT DE L’ART d’entrée dans une représentation intermédiaire et le résumé de sortie est généré à partir de cette représentation Contrairement aux résumés extractifs, les résumés abstraits sont constitués de phrases différentes des phrases du ou des documents d’origine La méthode hybride de résumé de texte fusionne celles extractive et abstractive.

Résumé monolingue, multilingue ou inter linguistique

A summarization system can be categorized based on language: monolingual, multilingual, or interlinguistic A monolingual summarization system operates when both the source and target documents are in the same language In contrast, an interlinguistic summarization system handles source texts written in multiple languages, such as English, Arabic, and French, and generates summaries in those same languages Finally, a multilingual summarization system involves a source text in one language, like English, with the summary produced in a different language, such as French.

FIGURE2.3 – Types de résumé automatique de texte[3]

Méthodes de résumés automatiques de texte

Méthode statistique

The statistical method, originating in the 1950s with Luhn's research on automatic summarization, processes text by analyzing various phrases, known as units It extracts significant sentences and words based on statistical analysis of features, defining the "most important" sentences by their position, frequency, and other criteria This approach does not rely on linguistic resources but instead calculates a score for each unit to gauge its importance Importance is measured through word frequency, tf-idf scores, the presence of prototypical expressions, and positional context relative to titles Statistical techniques focus on numerical and quantitative values, contrasting with artificial intelligence methods that are more symbolic and qualitative These numerical techniques are prevalent in extraction-based approaches with surface analysis, allowing for effective comparisons and assessments of sentence relevance through calculated numerical values The simplicity of comparing these values enhances the automation of the summarization process.

De plus, il devient plus aisé de combiner un ensemble de critères de nature différentes pour exprimer une valeur de pertinence globale d’une phrase, d’un paragraphe, etc.

To effectively combine various values, a linear function using coefficients can be applied to emphasize the significance of certain criteria over others This process involves three key phases or steps.

1 La première étape consiste à faire des statistiques pour chaque unité (mot ou phrase) en utilisant des critères bien spécifiques (la fréquence des unités, leur position, etc.).

2 La deuxième phase consiste à sélectionner les unités saillantes dans le texte, en se basant sur les statistiques précédentes, et en attribuant à chaque unité un score selon ces critères.

3 Enfin, la troisième qui est la phase d’extraction qui consiste à éliminer les uni- tés ayant un score très faible tout en combinant celles de scores les plus élevés

CHAPITRE 2 ETAT DE L’ART pour produire le résumé La base du résumé de l’approche statistique est le calcul du score de chaque unité qui se fait en accumulant les poids de chaque cri- tère considéré et présent dans chaque phrase (unité), multiplié par un coefficient spécifique pour ce critère.

La méthode de calcul du score d’une unité P est donnée par l’équation[13]:

1 ≤ i ≤ k α i ∗C i (P) k étant le nombre total de critères retenus pour calculer le score de l’unité ;

C i la fonction calculant la valeur numérique d’un critère i appliqué à l’unité ; α i est le coefficient associé au critère i.

Méthode linguistique

The linguistic method emerged in the 1980s with the integration of Artificial Intelligence into automatic text summarization systems, aiming to address the limitations of statistical approaches and enhance comprehension in generated summaries This technique focuses on understanding the context of the document itself, utilizing linguistic knowledge alongside statistical methods to produce summaries Summarization can be achieved through extraction or abstraction, which are generally recognized as the two primary approaches for automatic text summarization Various theories, such as Rhetorical Structure Theory and lexical chains, underpin this approach, forming the foundational steps for the linguistic method.

1 Pour un document ou plusieurs reỗus en entrộe, le systốme utilise les informations linguistiques pour créer une représentation de l’entrée.

2 Ensuite, cette représentation va être réduite en utilisant des règles de réduction, soit en gardant les phases les plus importantes ou en créant une nouvelle repré- sentation.

3 Enfin s’effectue l’étape de génération du résumé, qui sert à fusionner les phrases extraites pour avoir un résumé extractif, ou transformer la représentation réduite à un résumé par abstraction.

Autres méthodes

Basées sur les méthodes précitées et selon l’approche considérée, beaucoup d’autres méthodes ont été créées.

2.3.3.1 Méthodes basées sur des thématiques

These methods focus on identifying the main subject of a document, determining its core topic Common techniques for thematic representation include Term Frequency, Term Frequency-Inverse Document Frequency (TF-IDF), and lexical chains According to Nenkova & McKeown (2012), the steps involved in processing an extractive summary based on themes include identifying key terms and their significance within the text.

1 Conversion du texte d’entrée en une représentation intermédiaire qui capture les sujets clés abordés,

2 L’attribution d’une note (score) à chacune des phrases du document d’entrée selon cette représentation.

3 La génération du résumé en fonction des phrases ayant la note la plus élevée.

2.3.3.2 Méthodes basées sur des graphes

These methods utilize phrase-based graphs to represent a document or a collection of documents Discussions on how random walks on the graphical representation of phrases support text summarization indicate their effectiveness in extractive synthesis Such representations have been commonly employed for this purpose, with algorithms like LexRank and PageRank leveraging them effectively The phrase scoring process in LexRank consists of two main steps.

1 Représenter les phrases du document à l’aide d’un graphe non orienté de sorte que chaque nœud du graphe représente une phrase du texte d’entrée et pour chaque paire de phrases le poids du bord de connexion est la similarité séman- tique entre les deux phrases correspondantes en utilisant la similarité cosinus,

2 Utiliser un algorithme de classement pour déterminer l’importance de chaque phrase Les phrases sont classées en fonction de leurs scores LexRank de la même manière que l’algorithme PageRank

Various machine learning models have been proposed for automatic text summarization, primarily framing the task as a classification problem These models determine whether to include a sentence in the summary by learning from examples, categorizing each sentence of the document as either "summary" or not.

"non résumé" en utilisant un ensemble de documents d’apprentissage c’est-à-dire une

CHAPITRE 2 ETAT DE L’ART collection de documents et leurs résumés respectifs générés par l’homme Le récapi- tulatif basé sur l’apprentissage automatique contient les étapes suivantes :

1 Extraire les caractéristiques du document prétraité en se basant sur plusieurs ca- ractéristiques de phrases et/ou de mots.

2 Transmettre les caractéristiques extraites à un algorithme qui produit une valeur unique comme score de sortie.

Research on abstract synthesis techniques has leveraged deep learning methods to address challenges in machine learning These techniques employ rule-based approaches to identify segments containing significant events and integrate this information into the summary Additionally, tree-based methods and ontology-based approaches are utilized, where the incorporation of a word signifies its meaning.

A document can be viewed as a collection of sentences, while a sentence is a collection of words This task is framed as a problem of maximizing a submodular function, which is defined by the negative summation of distances from the nearest neighbors based on embedding distributions The fundamental steps for the methods involved are crucial for effective implementation.

ML sont de mise à la différence qu’un réseau de neurones est utilisé pour produire une valeur unique comme score de sortie.

Approches de résumé automatique de texte

Résumé par extraction

Automatic text summarization by extraction involves generating summaries through surface analysis, drawing inspiration from information retrieval approaches This method focuses on extracting the most relevant sentences from the text and chaining them together to create a coherent excerpt It was first introduced by Luhn in 1958 and Edmunson in 1969 The primary objective is to quickly provide a summary using only the surface elements of the text There are two main approaches to handling these surface textual elements: first, statistical techniques aligned with information retrieval models, which use relevance criteria based on statistical value.

CHAPITRE 2 ETAT DE L’ART numérique d’un segment textuel calculée par une fonction de score portant sur plusieurs critères précis et variables Un segment textuel est alors extrait si cette valeur est suffisamment élevée par rapport à un seuil et aussi par rapport à la valeur des autres segments Les critères pris en compte dans l’évaluation de la pertinence d’un segment sont relativement hétérogènes Les principaux sont la fréquence de thèmes pertinents ou bien représentatifs du texte, la position dans le texte, la présence de termes contenus dans les titres, etc Ce qui caractérise ces méthodes statistiques, c’est qu’elles tra- vaillent sur des valeurs entièrement numériques calculées à l’aide de valeurs arbitrai- rement données ou dépendantes d’un apprentissage Deuxièmement, les techniques plutôt linguistiques s’appuient quant à elles sur la présence de marques linguistiques de surface, et sur des critères de nature discursive liés aux marques, comme la position dans la structure discursive pour établir la pertinence ou non d’une phrase du texte. L’objectif est dans ce cas de repérer les segments les plus importants par des connaissances linguistiques sur le texte (marques, structures discursives, etc.) sans faire appel à une forme d’évaluation quantitative sur la pertinence mélangeant des critères de na- tures diverses Elle repose généralement sur l’idée que certaines marques de surface dans un contexte textuel bien précis permettant d’attribuer une valeur sémantique au segment qui les contient, et ainsi de connaợtre sa pertinence dans le texte L’avantage de cette méthode est de ne pas passer par une analyse en profondeur et de fournir un résumé plus simple et rapide avec des extraits typiques du texte d’origine.

Résumé par abstraction

Automatic text summarization through abstraction involves the use of artificial intelligence techniques to generate summaries by understanding the content Abstractive summarization is particularly valuable in deep learning contexts, as it can overcome grammatical inconsistencies often found in extractive methods This approach mimics human summarization activities, requiring a comprehensive analysis of the text to create a representation that can be modified to produce a summary The application must be capable of paraphrasing and condensing sections of the source document Abstractive text summarization algorithms generate new sentences that convey the most important information from the original text, mirroring the natural process humans use to summarize similar documents While many methods utilize this understanding-based approach, most research in the field still focuses on extraction-based techniques Two key steps are essential: 1) understanding the source text and 2) generating the summary.

Considérons les avantages et désavantages de ces deux approches comme ci-dessous :

EXTRACTION - Plus rapide et plus simple que l’approche abstraite.

- Plus grande précision en raison de l’extraction di- recte des phrases et les terminologies exactes qui existent dans le texte original

- Redondance dans certaines phrases sommaires

- Les phrases extraites peuvent être plus longues que la moyenne

- Conflits d’expressions temporelles dans le cas de plusieurs documents parce que les résumés extractifs sont sélection- nés à partir de différents documents d’entrée

- Manque de sémantique et de cohésion dans les phrases résumées

- Le résumé de sortie peut être injuste pour les textes d’entrée qui se composent de plusieurs sujets avec informations contradictoires

ABSTRACTION enhances summary generation by utilizing diverse vocabulary that deviates from the original text, employing more flexible expressions through paraphrasing, compression, or merging ideas.

- Le résumé généré est plus proche du résumé manuel

- Mieux réduit le texte par rapport aux méthodes ex- tractives

- Générer un résumé abstrait de qualité est très dif- ficile

- Nécessite une interpré- tation complète du texte d’entrée afin de générer de nouvelles phrases.

- N’est pas en mesure de traiter correctement les mots hors vocabulaire et ce que leurs représenta- tions ne peuvent capter.

TABLE2.1 – Avantages et désaventages des approches extractive et abstractive

Etapes de résumé automatique de texte

Etapes de résumé extractif

1 Construction d’une représentation intermédiaire du texte d’entrée dans le but de trouver un contenu saillant.

2 Evaluation des phrases sur la base de la représentation, en attribuant à chaque phrase une valeur indiquant la probabilité qu’elle soit prise dans le résumé.

3 Produire un résumé basé avec les phrases les plus importantes (meilleur score).

FIGURE2.5 – Etapes de résumé extractif[4]

Etapes de résumé abstractif

1 L’identification des thèmes est la première étape pour produire un résumé abstractif Un filtrage du fichier d’entrée est fait pour obtenir seulement les thèmes les plus importants du texte Une fois ces thèmes identifiés, ils sont présentés sous forme d’un extrait Pour effectuer cette étape, presque tous les systèmes utilisent plusieurs modules indépendants Chaque module attribue un score aux unités d’entrée (mot, phrase ou paragraphe), puis un module de combinaison additionne les scores pour chaque unité afin d’attribuer un score unique Enfin, le système renvoie les unités du plus haut en score, en fonction de la longueur du résumé, demandé par l’utilisateur ou fixé préalablement par le système Elle permet de faire un résumé simple en détectant les unités importantes dans le document (mot, phrase, paragraphes, etc.)

2 Dans l’étape d’interprétation, le but est de faire un compactage en réinterprétant et en fusionnant les thèmes extraits pour avoir des thèmes plus brefs Ceci est in- dispensable par le fait que les résumés abstraits sont généralement plus courts que les extraits équivalents Cette deuxième phase de résumé automatique (pas- sage de l’extrait vers l’abstrait) est naturellement plus complexe que la première. Pour compléter cette phase, le système a besoin de connaissances sur les vo- cables (par exemple, les anthologies), puisque sans connaissance aucun système ne peut fusionner les sujets extraits pour produire des sujets moins nombreux afin de former une abstraction Lors de l’interprétation, les thèmes identifiés comme importants sont fusionnés, représentés en des termes nouveaux, et ex- primés en utilisant une nouvelle formulation, de nouveaux concepts ou des mots qui n’existent pas dans le document original.

3 L’étape de génération du résumé est considérée comme le résultat de l’étape d’interprétation avec formulation de nouvelles phrases dans le but de produire un texte cohérent et lisible par l’humain.

Considering the significant influence of artificial intelligence, particularly in the areas of knowledge representation and cognitive activity description, additional steps can certainly be incorporated.

Les algorithmes

Algorithmes pour faire du résumé extractif

En règle générale, l’utilisation de l’approche basée sur l’extraction pour résumer des textes peut fonctionner comme suit :

1 Introduire une méthode pour extraire les phrases clés méritées du document source Par exemple, on peut utiliser le balisage des parties du discours, des sé- quences de mots ou d’autres modèles linguistiques pour identifier les phrases clés.

2 Rassembler des documents texte avec des phrases clés étiquetées positivement. Les phrases clés doivent être compatibles avec la technique d’extraction stipulée. Pour augmenter la précision, on peut également créer des phrases clés étiquetées négativement.

3 Entraợner un classificateur d’apprentissage automatique binaire pour effectuer le résumé du texte D’autres fonctionnalités peuvent être inclues.

The Luhn algorithm is a method based on TF-IDF (Term Frequency-Inverse Document Frequency), commonly used in information retrieval and text mining This statistical measure assesses the importance of a term within a document relative to a larger collection or corpus As one of the earliest text summarization techniques, Luhn proposed that the significance of each word in a document correlates with its frequency The concept suggests that sentences containing the most frequent words are more meaningful, while those with fewer occurrences hold less significance However, this approach is not regarded as highly accurate.

The PageRank algorithm, developed by Larry Page and Sergey Brin at Stanford University in the late 1990s, is a web page ranking system that evaluates websites based on their incoming backlinks This algorithm serves as the foundation for Google's search engine, enabling the measurement and ranking of web pages according to user search results.

TextRank is a graph-based ranking algorithm designed for text processing, enabling the identification of the most relevant sentences or keywords within a text Utilizing an extractive approach, it leverages unsupervised graphs and PageRank for text summarization In TextRank, the nodes of the graph represent sentences, while the weights of the edges between them indicate the similarity among those sentences.

LexRank is an unsupervised algorithm inspired by PageRank, utilizing graph-based methods for automatic text summarization It assesses the importance of sentences through graph representation, employing the concept of eigenvector centrality Similar to TextRank, LexRank uses a modified cosine similarity measure, weighted by IDF, to determine the similarity between sentences, which forms the edges of the graph Additionally, LexRank incorporates an intelligent post-processing step to ensure that the selected key sentences for the summary are not overly similar to one another.

Latent Semantic Analysis (LSA) is a technique that creates a vector representation of a document, allowing for the comparison of document similarity by calculating the distance between vectors LSA projects data into a lower-dimensional space without significant information loss, enabling singular vectors to capture recurring word combination patterns within a corpus The magnitude of singular values indicates the importance of the model in a document This algebraic method reveals interrelations between phrases and words while effectively reducing noise, thereby enhancing accuracy.

Algorithmes pour faire du résumé abstractif

The abstraction approach operates on deep learning principles, utilizing neural networks to perform tasks traditionally handled by humans, such as understanding input text and generating summaries Generally, the following steps can be identified:

3 Produire des séquences de sortie sous forme de résumés

Recurrent Neural Networks (RNNs) address the limitations of traditional neural networks in reasoning and information retention These networks feature loops that enable information to persist over time RNNs are a crucial variant of neural networks, widely utilized in Natural Language Processing (NLP).

CHAPITRE 2 ETAT DE L’ART d’une classe de réseaux de neurones qui permettent d’utiliser les sorties précédentes de couches comme entrées aux suivantes tout en ayant des états cachés RNN a un concept de "mémoire" qui mémorise toutes les informations sur ce qui a été calculé jusqu’au pas de temps t Les RNN sont appelés récurrents car ils effectuent la même tâche pour chaque élément d’une séquence, la sortie dépendant des calculs précé- dents Ils appartiennent à la classe des réseaux de neurones artificiels qui sont repré- sentés à l’aide de modèles graphiques Les nœuds appartiennent à la partie du graphe orienté le long d’une séquence qui permet l’exposition du comportement dynamique temporel Dans les réseaux de neurones traditionnels, l’entrée et la sortie sont consi- dérées comme indépendants les uns des autres donc il ne prend pas en compte les informations précédentes Les connexions entre les nœuds ne forment jamais de cycle, les informations se déplacent dans une direction des états d’entrée à travers les états cachés à l’état de sortie De plus, les sorties sont indépendantes chacune de l’autre telle que la sortie au pas de temps t ne dépende pas du pas de temps t-1 Dans un tel scéna- rio, la prédiction du mot suivant dans un réseau d’anticipation de phrase s’avère inef- ficace Avec les RNN, les états internes peuvent être utilisés pour traiter une séquence d’entrées par rétro propagation RNN se trouvent être efficace lorsque les entrées et les sorties dépendent l’une de l’autre.

FIGURE2.6 – Réseau de neurones récurrents standard[5]

LSTM, or Long Short Term Memory, is a specialized type of Recurrent Neural Network (RNN) designed to learn long-term dependencies Introduced by Hochreiter and Schmidhuber in 1997, LSTMs are highly effective across various problems They feature a unique chain-like structure comprising four interactive modules, which facilitate their ability to retain information over extended periods This architecture includes cells and gates that play a crucial role in managing the flow of information, allowing LSTMs to excel in tasks that require understanding context over time.

CHAPITRE 2 ETAT DE L’ART la couche horizontale dans le module répétitif et agit comme une courroie de convoca- tion avec des interactions minimales C’est une voie de transfert de l’information Dans certains scénarios, seules des informations récentes sont nécessaires pour effectuer une tâche donnée, telles que des modèles linguistiques essayant de prédire le dernier mot d’une phrase Dans des situations ó l’état de la cellule n’est pas porteur d’informations pertinentes ou récentes, les portes les rejettent et conservent celles qui sont importantes Dans de tels cas, le réseau LSTM s’est avéré très efficace La structure de base d’un module LSTM est donnée comme suit :

FIGURE2.7 – Long Short Term Memory networks[5]

Encoder-Decoder models are the foundation of advanced sequence-to-sequence (seq2seq) architectures, including Attention models, GPT, BERT, and Transformers These models consist of two main components: the encoder and the decoder, which are recurrent neural networks that transform an input data sequence from one domain into an output sequence in another domain The encoder processes a sequence of words, converting them into a vector representation, while the decoder, enhanced by an attention mechanism that focuses on specific words at each step, generates the output by predicting the next word based on prior context One of the most notable applications of these models is translation, alongside automatic summarization, where words from one language are converted into another While basic models perform well on short phrases, they struggle with longer sequences Long Short-Term Memory (LSTM) networks are a popular choice for these models, but they face slow learning times and information loss with lengthy sequences Transformers provide a solution by utilizing an attention mechanism instead of relying on recurrent structures.

Transformers were introduced to eliminate recursion, enabling parallel computation and reducing training time while addressing performance issues related to long dependencies Their architecture employs an attention mechanism that encodes each position and connects distant words by evaluating an input sequence and determining the significance of other parts at each step This is achieved through the calculation of an "importance vector," which captures the relevance of each input encoding for the current step, guiding which parts of the input should be prioritized.

Cas d’application des résumés automatiques

Secteur e-commerce

E-commerce refers to the buying and selling of products or services online These e-commerce platforms enable customers to share their opinions through digital ratings and textual comments Automatic text summarization can aid in analyzing and synthesizing this data to better understand customer needs and expectations, identify trends, and assist potential buyers in their decision-making process.

Surveillance des médias

Automatic summarization provides an opportunity to condense the continuous streams of information from social media into easily accessible summaries This allows users to stay connected to social media while reducing the overwhelming flow of information that comes with it.

Sélection de bulletin d’informations

News bulletins consist of an introduction followed by a curated selection of relevant articles Automatic summarization provides a condensed overview of each article, enabling readers to quickly identify the most important and pertinent content This approach allows them to select articles that suit their interests without the need to open each one individually for a thorough review.

Secteur financier

Banks and insurance companies invest heavily in acquiring information, such as news articles and various subscriptions, to inform their decision-making processes while also automating the negotiation of their services and actions Analyzing and synthesizing financial documents, like market reports, can facilitate a swift assessment of financial situations, enabling informed recommendations for internal actions within the bank or to their investors.

Synthèse des emails

As we receive an increasing number of emails daily, various experiments have been implemented to streamline their management Some of these initiatives focus on classifying emails and extracting keywords, while others aim to distill the essential content by automatically summarizing emails through either extraction or abstraction methods.

E-learning

E-learning est un système d’apprentissage éducatif basé sur un enseignement for- malisé mais avec l’aide de ressources électroniques De nombreux enseignants utilisent des études de cas et des contenus de l’actualité pour dispenser leurs enseigne- ments Le résumé automatique peut aider à mettre à jour plus rapidement le contenu d’ensignement en produisant des résumés en rapport avec leurs centres d’intérêt C’est aussi un outil important pour les étudiants qui sont en phase de rédaction de rapport de stage ou de mémoire de fin d’études, etc.

Exemples de systèmes existants

Beaucoup de systèmes de résumé automatique de texte ont vu le jour Nous pré- sentons quelques-uns de ces outils industriels à travers le tableau ci-dessous :

NOM DE L’APPLICATION URL FONCTIONS

TLDR This is a free automatic text summarization tool that allows users to condense lengthy articles, documents, or web pages into concise summaries with just one click Utilizing advanced natural language processing, it simplifies information overload by providing clear and digestible content This tool is ideal for students, writers, and anyone needing to quickly grasp the essence of extensive texts, as it filters out unnecessary distractions and highlights key points Additionally, TLDR This extracts essential metadata, such as the author and publication date, and estimates reading time, ensuring an efficient reading experience Recognized by top websites, TLDR This enhances comprehension and saves time, making it a valuable resource for anyone navigating the vast amount of information online.

Text Summarizer https://aws.amazon. com/marketplace/ai/ model-evaluation? productId dfde1c3c-0b54-45eb\

Construit sur des modèles basés sur l’apprentissage par transfert et les trans- formeurs L’entrée peut avoir un maximum de 512 mots et donne une sortie de 3 phrases (environ 30 mots).).

QuillBot https://quillbot.com/ summarize

This tool can summarize web articles and documents, allowing users to select the summary length in either paragraphs or key phrases It is available as an extension for Chrome, Word, and Google Docs, and has the capability to identify confusing sentences and paragraphs.

Charli https://www.charli. ai/

Powered by AI, this solution can organize, understand, interpret, analyze, and automate tasks It seamlessly integrates with over 500 applications, including ERP, CRM, and cloud-based systems like Dropbox, Slack, Asana, and Google Workspace.

Scholarcy https://www. scholarcy.com/

Chrome and Edge extensions break down each paragraph of an article or document into smaller sections Designed to assist users, these tools identify key information for quick extraction and utilization.

Résumé des articles scientifiques

Article scientifique : définition

A scientific article is defined as a report that presents scientific findings, typically detailing the work and results of experimental research, and is published in one or more scientific journals.

Structure d’un article scientifique

Scientific articles are well-structured documents that adhere to established scientific methods Key criteria for a scientific article include having authors who are researchers in the field, an editor, and a scientific journal for editing and validation These articles typically share common textual characteristics, such as predictable placements of typical elements, signal words, and a model-like structure Generally, there are seven main sections to note: Abstract, Introduction, Method, Results, Discussion, Conclusion, and References.

Types de résumé d’articles scientifiques

Selon N Ibrahim Altmami et M El Bachir Menai (5 mai 2020)[17], il existe deux principaux types de résumé automatique des articles scientifiques Il s’agit :

1 un rộsumộ qui donne un aperỗu gộnộral de l’article d’ú on peut se rộfộrer à un résumé de type génériques, et

2 un résumé basé sur des phrases de citations spécifiques d’ó l’on peut parler d’un résumé orienté.

As previously mentioned, a generic summary refers solely to the overall content of the source document, producing a summary without considering a specific context This type of summary is criticized for lacking precise scientific rigor, as it presents contributions in a general and less targeted manner Additionally, it inadequately conveys the author's perspective These issues have led to the development of context-oriented summary generation.

A task-oriented summary is driven by a specific request or objective, focusing on particular phrases or quotes In this approach, only information relevant to the specified context is selected and processed This type of summary utilizes a set of relevant citations from the article to create a concise overview that highlights key points.

CHAPITRE 2 ETAT DE L’ART les principales contributions et conclusions de l’article Alors, il contient plus d’informations et de contributions ciblées que le résumé générique.

Nikhil Alampalli Ramu, Mohana Sai

Bandarupalli, Manoj Sri Surya Nek- kanti, and Gowtham Ramesh : “Sum- marization of Research Publications

Using Automatic Extraction”, Interna- tional Conference on Intelligent Data

Communication Technologies and In- ternet of Things (ICICI 2019) - https:

The authors of this article address the challenge of managing large volumes of data, which complicates the extraction and transformation processes needed to identify relevant scientific documents They view this as a significant obstacle in the scientific research community and propose a model designed to extract problem statements from scientific articles This model can then be utilized to locate related articles, thereby enhancing the efficiency of research efforts.

Arman Cohan and Nazli Goharian :

“Scientific document summarization via citation contextualization and scientific discourse” International Journal on Di- gital Libraries (IJDL, May 2017) -https:

The authors of this study present a method for summarizing scientific articles that focuses on contextualizing citations and scientific discourse The two proposed approaches build upon previous work related to citation-based summaries, where key points of an article are extracted from a set of given citations, while also incorporating the reference context to address the inaccuracies often associated with citation texts.

Junsheng ZHANG, Kun LI, Chang- qing YAO “Event-based Summarization for Scientific Literature in Chinese”,

International Conference on Identifi- cation, Information and Knowledge in the Internet of Things 2017 - https:

//www.sciencedirect.com/science/ article/pii/S1877050918302758

The authors present a method for generating a summary using an event structure known as 5W 1H, where sentences are categorized and selected based on their relevance to various event elements The importance of each candidate sentence is then assessed, leading to the selection of the most pertinent and significant sentences to create an event-based summary.

TABLE2.3 – Quelques travaux sur le résumé des articles scientifiques

Techniques d’évaluation

Evaluation intrinsèque

Intrinsic metrics involve the direct assessment of the generated summary by comparing it to one or more human-written summaries or checking for the presence of keywords Among the various intrinsic evaluation metrics available, the most well-known and widely used are the BLEU and ROUGE metrics.

The Bilingual Evaluation Understudy Score, abbreviated as BLEU, is a metric used to assess the quality of a generated sentence against a reference sentence A perfect match results in a score of 1.0, while a complete mismatch scores 0.0 This scoring system was developed to evaluate predictions made by machine translation systems.

L’équation BLEU est la suivante :

ROUGE, which stands for Recall-Oriented Understudy for Gisting Evaluation, is a recall-based metric that utilizes n-grams as content units to compare candidate summaries with reference summaries This set of measures is primarily used to assess automatic text summarization and machine translation It works by evaluating an automatically generated summary or translation against a set of human-produced reference summaries ROUGE measures the overlap of n-grams between machine-generated outputs and those created by human judges A high ROUGE score indicates a strong correlation with human-written summaries, suggesting effective summarization.

1 Score de précision : elle est calculée en divisant le nombre de phrases existant dans les résumés de référence et candidats (c’est-à-dire le système) par le nombre de phrases dans le résumé candidat comme dans l’équation.

2 Score de rappel : elle est calculée en divisant le nombre de phrases existant dans les résumés de référence et candidats par le nombre de phrases dans le résumé de référence comme.

3 Score de mesure F : il s’agit d’une mesure qui combine des mesures de rappel et de précision C’est la moyenne harmonique entre la précision et le rappel. L’équation ROUGE est la suivante :

Evaluation extrinsèque

Intrinsic evaluation assesses the coherence, content coverage, and informativeness of a summary, while extrinsic evaluation measures the usefulness of summaries in a specific application context It focuses on the tasks for which a document summarization system can be beneficial In extrinsic evaluation, the system's output is judged based on its impact on an external task, often necessitating human intervention Generally, it is easier to develop automated techniques for intrinsic evaluations than for extrinsic ones.

Dans ce chapitre, nous présentons les solutions proposées à travers l’architecture globale adoptée, le pipeline des modèles et dataset utilisés et autres.

Architecture globale de notre système

In this work, we examined existing models and systems for automatic text summarization We found that current Machine Learning and Deep Learning models can generate high-performance and coherent extractive or abstractive summaries However, the systems built around these models often produce summaries that are limited in the information provided and do not enhance the understanding of the content under each main heading of a document.

Recognizing the significance of scientific articles and the essential information they contain, our system offers a solution to summarize each key point or title of a scientific article, treating its content as distinct sections for easy understanding.

Our system receives one or more scientific articles in PDF format, applying a preprocessing step to extract text while maintaining the original structure The extracted content will then be processed using either extractive or abstractive text summarization models Ultimately, a summary will be generated for each section, taking into account the main headings of the articles and following the chosen model.

Alors, l’architecture globale de notre systốme est ainsi conỗue :

FIGURE3.1 – Architecture globale de notre système

Pipeline du modèle par extraction

We utilized the LSA model for automatic text summarization through extraction This unsupervised machine learning model is pre-trained and employs natural language processing (NLP) techniques based on the Singular Value Decomposition (SVD) algorithm to extract information effectively.

Latent Semantic Analysis (LSA) employs a robust algebraic-statistical method to uncover hidden semantic structures within words and phrases This technique identifies associations between words, revealing which words frequently co-occur and highlighting common terms across different sentences A high frequency of shared words among sentences suggests a semantic relationship between them The meaning of a sentence is derived from its constituent words, while the meaning of words is shaped by the sentences in which they appear The fundamental steps of the LSA pipeline are essential for this analysis.

Creating an input matrix involves representing the document in a matrix format to facilitate understanding and calculations This process generates a document-term matrix, where the cells indicate the significance of words within the sentences.

Singular Value Decomposition (SVD) is a key step in analyzing the term matrix generated from textual data This algebraic-statistical method leverages Term Frequency-Inverse Document Frequency (TF-IDF) to model the relationships between words and phrases, effectively grouping them into coherent semantic trends.

CHAPITRE 3 SOLUTIONS PROPOSÉES contextuelle ou TF représente le nombre de fois qu’un mot est répété dans une phrase et DF le nombre de phrases dans le document.

The selection of sentences is based on the highest value or average of the TF-IDF scores of the chosen sentences, which contain the most frequent words or expressions throughout the entire document.

Pipeline du modèle par abstraction

For the abstract summarization, we utilized Google's T5 model to conduct a study aimed at exploring the limitations of transfer learning and identifying which techniques are most effective for large-scale application of these insights.

T5, which stands for "Text-To-Text Transfer Transformer," is a model designed for various natural language processing (NLP) tasks It functions as an encoder-decoder model trained on both supervised and unsupervised multi-task datasets, converting each task into a text format Unlike BERT-style models that can only produce class labels or embeddings from input, T5 processes text input and generates modified text output This text-to-text formatting makes T5 particularly versatile for applications such as automatic summarization, question answering, and machine translation.

Ce modèle est disponible sous différentes tailles[19]à savoir :

— T5-small avec 60 millions de paramètres.

— T5-base avec 220 millions de paramètres.

— T5-large avec 770 millions de paramètres.

Etant un modèle pré-entrainé avec de grands besoins en ressources matérielles lourdes, son implémentation a été simplifiée Le pipeline du modèle T5 peut être résumé en 2 grandes étapes :

Pre-training involves initial training conducted on large datasets, such as C4, aimed at removing noise and other anomalies using an encoder-decoder architecture.

2 Fine-tuning : qui lui permet ensuite d’être affiné sur les tâches en aval avec un objectif supervisé et une modélisation d’entrée appropriée pour le réglage texte à texte.

Dataset

Although our system is not necessarily dependent on training a specific model, we conducted fine-tuning on the T5-base model as an experiment We utilized a standard dataset for automatic text summarization known as "news_summary." This dataset is specifically designed for automatic summarization, readily available for download in CSV format, and consists of six columns containing 4,514 data examples from various contributors.

We divided the dataset into three parts: "TRAIN," which comprises 80% of the data for training; "VAL," containing 10% for validation; and "TEST," which holds the remaining 10% for model testing.

We evaluated our T5-base model using both dataset-specific data, comprising 30 articles, and external data, which included approximately 30 scientific articles in PDF format—20 from the field of computer science and 10 from various other domains.

Contribution

The core of our contribution lies in the preprocessing steps that enable us to extract text from submitted PDF scientific articles while maintaining their original structure We ensured that our models receive text with a clearly identified structure, delineated by major headings, which guides them in generating summaries The models we employed, including TextRank, LSA, BART, and T5-base, operate after this preprocessing phase, as outlined in our system architecture We selected the LSA and T5-base models based on the clarity and coherence of the summaries they produced under the same experimental conditions If the generated summary lacks clarity and coherence, making it difficult for human readers to accurately grasp the provided information, it is deemed ineffective.

In our implementation of the TextRank and LSA models, we utilized the Python library Sumy, ensuring that each section produced a single line of summary For the transformer models BART and T5-base, we set constraints for generating summaries within specified minimum and maximum lengths Notably, the TextRank model typically generates a higher number of n-grams that align with human summaries compared to the LSA model, as it often delivers more information regardless of the implementation criteria.

FIGURE3.5 – Résumé par section avec LSA[9]

FIGURE3.6 – Evaluation résumé par section LSA[9]

FIGURE3.7 – Résumé par section avec TEXTRANK[9]

FIGURE3.8 – Evaluation résumé par section TEXTRANK[9]

A travers ce chapitre, nous présentons l’implémentation, expérimentation et l’éva- luation de notre travail.

Environnement de travail

Ressources matérielles

Pour le développement et les expérimentations nous avons utilisé un ordinateur avec les caractéristiques suivantes :

• Processeur : Intel(R) Core(TM) i3-3110 M10 CPU @ 2.40GHz (4CPUs)

• Carte graphique : Intel(R) HD Graphics 4000

Ressources logicielles

• Environnement de développement intégré (IDE) : Jupyter Notebook et Pycharm

• Bibliothèques diverses : BeautifulSoup, Sumy, PyMuPDF, GenSim, etc.

CHAPITRE 4 IMPLÉMENTATION, EXPÉRIMENTATION ET EVALUATION

Implémentation

Extraction du texte

Our system architecture begins with the raw extraction of text from the submitted PDF document This initial step involves converting the PDF format into a text format (txt) For this process, we utilized the fitz module from PyMuPDF.

FIGURE4.1 – Extraction brute du texte

Extraction par section

The extraction of text sections is implemented to ensure that the text is presented in the same format as the original document for our models We utilized Beautiful Soup, a Python library designed for extracting data from HTML and XML files It operates with a parser that facilitates navigation, searching, and modifying the text parsing tree This approach allows us to traverse the extracted text from the PDF document and retrieve major headings based on header sizes, specifically H2, H3, or H4 In this example, we successfully extracted the H2 headings.

FIGURE4.2 – Identification des grands titres H

We employ a technique to delineate all text associated with each main title or header, which we refer to as a section Each identified and delineated main title can encompass a collection of paragraphs.

Pré-traitement

A preprocessing step is performed to remove unnecessary special characters before experimenting with the adopted models This step is essential for a clear and precise representation of the original input text The significance of preprocessing is crucial in nearly all developed systems related to Text Processing and Natural Language Processing.

Expérimentation

Résumé par approche extractive

Nous avons utilisé le modèle LSA auquel est soumis notre texte extrait par section pour en faire du résumé par approche extractive.

FIGURE4.4 – Résumé extractif par section Nous avons aussi souligné les textes de nos résumés dans le fichier PDF.

FIGURE4.5 – Résumé extractif par section souligné dans le fichier PDF

Nous avons aussi fait un résumé général sur tout le texte dans l’objectif de pouvoir faire une concaténation des résumés par section et global et voir le résultat.

Notre résumé mixte est donc le suivant :

Résumé par approche abstractive

To create an abstractive summary, we utilized the T5-base model from the Transformer family These models offer accessible APIs for easy downloading and implementation of advanced pre-trained models Leveraging pre-trained models helps reduce computational costs and simplifies deployment on limited hardware resources.

4.3.2.1 Fine - Tuning du modèle T5-base

Fine-tuning is a method of applying transfer learning by taking a pre-trained model and adjusting it for a similar task, leveraging its existing knowledge without starting from scratch In our process, we fine-tuned the T5-base model using consistent hyperparameters, layers, loss functions, and checkpoints We instantiated the model with 10 epochs and a batch size of 8, utilizing the T5 tokenizer for faster training.

— 3072 sorties d’états caché par anticipation

Extraits des pertes de l’entraợnement ci-dessous :

FIGURE4.10 – Courbe de perte d’entraợnement

FIGURE4.11 – Courbe de pertes de validation

4.3.2.2 Test avec le jeu de données

After reserving a portion of the dataset for model testing, we employed a range sorting method to randomly select 30 articles from our test distribution for section-wise summarization At this stage, we ensured that the system is flexible enough to accept one or multiple PDF articles as input and generate a summary for each output.

FIGURE4.12 – Test avec le dataset

4.3.2.3 Test avec des articles scientifiques en PDF

We tested our model on 30 scientific articles in PDF format For each article identified in the input directory of our system, we processed it to generate a summary and conduct an evaluation Let's take a look at the experimental results presented in the following figures.

FIGURE4.13 – Résumé par section de l’article 2101.00029.pdf

FIGURE4.14 – Evaluation de l’article 2101.00029.pdf

We applied the same treatment to scientific articles in PDF format to generate summaries of the text in a general manner, without preserving the original structure This approach also enables us to compare our section summaries against the overall text summaries.

FIGURE4.15 – Résumé général de l’article 2101.00029.pdf et son évaluation

Tiêu đề	Conception et implémentation d’un prototype pour le résumé automatique des articles scientifiques
Tác giả	Michelet Juste
Người hướng dẫn	Dr. Ho Tuong Vinh, Responsable du Master - IFI
Trường học	Université Nationale Du Vietnam, Hà Nội
Chuyên ngành	Systèmes Intelligents et Multimédia
Thể loại	Mémoire
Năm xuất bản	2022
Thành phố	Hanoi

Định dạng
Số trang	67
Dung lượng	2,17 MB