1. Trang chủ
  2. » Luận Văn - Báo Cáo

Extraction d’information pour une population d’un graphe de connaissance en Écologie = rút trích thông tin nhằm cung cấp dữ liệu cho biểu Đồ tri thức về sinh thái học

66 0 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Extraction d’information pour une population d’un graphe de connaissance en écologie
Tác giả Tshitenge Mupuwe Jojo
Người hướng dẫn M. Nicolas Le Guillarme, Enseignant Chercheur à LECA
Trường học Université Nationale du Vietnam
Chuyên ngành Systèmes Intelligents et Multimédia
Thể loại Mémoire de fin d’études
Năm xuất bản 2022
Thành phố Hanoi
Định dạng
Số trang 66
Dung lượng 8,13 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Cấu trúc

  • 2.1 Introduction (12)
  • 2.2 Présentation du sujet de recherche (12)
    • 2.2.1 Problématique et Contexte (12)
    • 2.2.2 Objectif du stage (14)
  • 2.3 Présentation du cadre du stage (14)
  • 2.4 Présentation du cadre d’étude (14)
  • 2.5 conclusion (15)
  • 3.1 Extraction d’information (16)
    • 3.1.1 Historique de l’extraction d’information (17)
  • 3.2 Entités nommées (17)
    • 3.2.1 Reconnaissance d’entités nommées (18)
    • 3.2.2 Les principales étapes dans la reconnaissance des entités nom- mées (NER) (18)
    • 3.2.3 Pré-traitement (19)
    • 3.2.4 Extraction des fonctionnalités ou caractéristiques (19)
    • 3.2.5 Modèle (19)
    • 3.2.6 Mesure d’évaluations (25)
    • 3.2.7 Outils pour la reconnaissance entités nommées (27)
    • 3.2.8 TaxoNERD (27)
    • 3.2.9 Desambiguisation (28)
  • 3.3 Extraction de relations (29)
    • 3.3.1 Présentation (29)
    • 3.3.2 Type des relations (29)
    • 3.3.3 Méthodes ou approches de type binaire (30)
  • 3.4 Extraction de relations sous supervision distante (32)
    • 3.4.1 Fonctionnement (32)
  • 3.5 Base de connaissance (33)
    • 3.5.1 Globi (33)
  • 3.6 Conclusion (33)
  • 4.1 Présentation (35)
  • 4.2 Description des approches et des techniques (35)
    • 4.2.1 Acquisition de données (35)
    • 4.2.2 Résultat (37)
    • 4.2.3 Reconnaissance des mentions d’entités (39)
    • 4.2.4 Desambiguisation Taxonerd (40)
    • 4.2.5 Annotation (41)
    • 4.2.6 Entraợnement du modốle (43)
    • 4.2.7 Conclusion (45)
  • Bibliographie 39 (48)
    • A.1 Quelques fonctions (50)
      • 3.2 Etapes de NER (0)
      • 3.3 Formule CRF[Dupont, 2017] (0)
      • 3.4 Architecture BI-LSTM-CRF en REN (0)
      • 3.5 Architecture-model-transformer (0)
      • 3.6 Scaled Dot-Product Attention Multi-Head Attention consists of several (0)
      • 3.7 Architecture Bert (0)
      • 3.8 Architecture Biobert (0)
      • 3.9 Pipeline TaxoNERD (0)
      • 3.10 Exemple de l’extraction des relations (0)
      • 3.11 Pipeline supervision distante (0)
      • 4.1 Architecture générale (0)
      • 4.2 algorithme extraction paragraphes (0)
      • 4.3 algorithme extraction paragraphes (0)
      • 4.4 Requête d’extraction (0)
      • 4.5 Connexion au serveur ISTEX (0)
      • 4.6 réponse du serveur (0)
      • 4.7 Reconnaissance mention avec TaxoNERD (0)
      • 4.8 Exemple de la reconnaissance mention avec sortie TaxoNERD (0)
      • 4.9 Desambiguisation avec TaxoNERD (0)
      • 4.10 Desambiguisation avec TaxoNERD (0)
      • 4.11 connexion et requête Globi (0)
      • 4.12 Exemple de l’implémention de l’annotation globi (0)
      • 4.13 Exemple d’annotation de relation avec globi (0)
      • 4.14 Pré-formation et mise au point de BioBERT (0)

Nội dung

Extraction d’information pour une population d’un graphe de connaissance en écologie = Rút trích thông tin nhằm cung cấp dữ liệu cho biểu đồ tri thức về sinh thái học

Introduction

In this chapter, we will provide a general overview of the research topic, the hosting structure for the research, which is the LECA (Laboratoire d’Écologie Alpine), and the study framework, represented by the Institut Francophone International.

Présentation du sujet de recherche

Problématique et Contexte

L’accroissement du nombre des publications scientifiques publiées et disponibles à partir de bibliothèques en ligne, est une source d’information à exploiter et a fait émerger les recherches en extraction d’information.

Cependant comprendre comment les écosystèmes fonctionnent et répondre aux perturbations auxquelles ils font face sont aujourd’hui des enjeux majeurs.

The ecosystem is typically represented by various ecological interactions occurring among species within a community These interactions serve as powerful tools for visualizing biodiversity data and assessing ecosystem health, which is significantly influenced by both biotic and abiotic factors Interaction networks are categorized based on the ways species interact, including predation, competition, commensalism, amensalism, mutualism, symbiosis, and parasitism In all cases, these interactions involve a reciprocity of shared resources such as space, light, and nutrients.

Dans la présente étude, le réseau considéré est le réseaux trophique, il représente des relations entre consommateurs et ressources.

According to Wikipedia, a trophic network is a collection of interconnected food chains within an ecosystem, through which energy and biomass are transferred This involves the exchange of elements such as carbon and nitrogen among different levels of the food chain, highlighting the interaction between autotrophic plants and heterotrophic organisms.

According to Dunne (2006), a trophic network represents the intricate relationships of predation and feeding among species within an ecological community Since the 1970s, ecologists have been striving to gain a deeper understanding of these trophic networks In their research, scientists differentiate between "taxonomic" species and other species within the study of trophic dynamics.

Trophic species represent functional groupings of taxonomic species that share the same predators and prey, contrasting with a taxonomic approach that focuses on distinguishing species based on morphological and phylogenetic distances While a trophic species can coincide with a taxonomic species when it includes only one taxon, it is primarily employed to mitigate methodological biases in network construction In instances where taxonomic differentiation is not feasible, a "morphological" species classification is utilized, distinguishing species solely based on their morphology.

The construction of ecological interaction networks is a complex and labor-intensive process, often challenging to address through manual literature surveys However, text mining can significantly expedite this process by aggregating knowledge on relevant traits from extensive databases to help reconstruct these networks One potential method involves linking species with previously observed and documented interactions, utilizing structured open-access knowledge bases that centralize available information Despite this, much of the data regarding species interactions remains scattered in unstructured formats within scientific literature, accessible through specialized search engines Therefore, it is essential to carefully select a method to leverage this knowledge and enhance existing databases, equipping information extraction tools to effectively target mentions of ecological interactions in scientific publications.

En effet, plusieurs informations susceptibles d’aider à sauver des nombreuses in- teractions de l’écosystème, sont enfouis dans un tas de publications scientifiques Les principales difficultés pour ce travail sont :

— La récolte des articles scientifiques traitant des réseaux trophiques.

— segmentation des paragraphes pour chaque article

— Les ambiguùtộs inhộrentes au langage naturel

— Les ambiguùtộs liộes à la reconnaissance des mentions de noms de taxons

— Trouver une base de connaissance des interactions

— L’obtention d’un corpus annoté sur le réseau trophique

Due to the lack of structured and annotated data, we must navigate through a vast amount of unstructured information to identify the necessary details for extracting relationships between species.

Objectif du stage

Les objectifs du sujet de stage étaient multiples :

— La recherche et la récupération des articles scientifiques qui traitent des réseaux de trophiques ;

— La segmentation de ces articles en paragraphes ;

— La reconnaissance des mentions de noms de taxons ;

— La désambiguisation de noms des mentions dans la taxonomie ;

To create a training and testing dataset, leverage the interactions found in an existing knowledge base by querying scientific publication search engines through their APIs.

— Implémenter une ou plusieurs méthodes d’extraction de relations et évaluer les performances de ces approches sur le jeu de test

Présentation du cadre du stage

Le laboratoire d’écologie alpine (LECA) est une Unité Mixte de Recherche (UMR

5553) associant des chercheurs duCNRS, de l’Université Grenoble Alpes et de l’Uni- versité Savoie Mont-Blanc, membre de l’Observatoire des Sciences de l’Univers de Gre- nobleOSUG.

The research focuses on short- and long-term observation, experimentation, and modeling to develop predictive models for biodiversity responses to environmental changes These models are applied to address societal issues related to ecosystem services assessment, environmental management, and biodiversity conservation, with a significant emphasis on high-altitude ecosystems.

Présentation du cadre d’étude

2L’Institut Francophone InternationalIFIa été créé en 1993 sur la base du déve- loppement de l’Institut de la Francophonie pour l’Informatique, et de l’intégration du

1 https://leca.osug.fr/-Qui-sommes-nous-

2 http://ifi.edu.vn/fr/news/

The French University Pole in Hanoi, established in 2006, is located within the National University of Vietnam Officially named the "International Francophone Institute" since November 18, 2014, the IFI serves as a high-quality international research and training organization affiliated with the National University of Vietnam, Hanoi Its mission includes providing logistical and technical support in computer science to various enterprises and research laboratories, as well as offering a training framework in computer science The training program comprises two specializations: Intelligent Systems and Multimedia (SIM) and Network and Communicating Systems.

RSC Depuis sa création en 2009, l’IFIforme ses étudiants en vue de l’obtention d’un double diplôme de master recherche.

conclusion

In this chapter, we introduced the research topic, the host organization, and the study framework The following chapter will focus on the literature review regarding information extraction from a corpus of textual documents.

In this chapter, we will discuss various methods and techniques relevant to our work that will help us achieve our objectives The goals of this internship are diverse, and for each one, multiple approaches are available Therefore, we will evaluate existing approaches and the different methods for information extraction.

Extraction d’information

Historique de l’extraction d’information

According to Minard (2013), the early significant efforts in information extraction began with the Message Understanding Conferences (MUC), which were initiated and funded by DARPA (Defense Advanced Research Projects Agency) These conferences aimed to evaluate and promote research in the field of information extraction For instance, during MUC-3 in 1991, the extraction task focused on journalistic dispatches related to terrorist activities in Latin America.

La première conférence a eu lieu en 1987 Les deux dernières conférencesMUC-6et

MUC-7, held in 1995 and 1997, focused on information extraction from templates, defining tasks such as named entity recognition (including names of people, organizations, and locations), coreference resolution, predefined form filling, and scenario extraction In MUC-7 in 1998, the task of extracting relationships between entities was introduced, examining relationships like employee-of, product-of, and location-of, marking the beginning of research in semantic relationship extraction During the MUC evaluation campaigns, two key metrics for information extraction were established: recall, which measures the number of correctly extracted entities divided by the total number of entities to extract, and precision, which assesses the number of correctly extracted entities against the total number of extracted entities, both correct and incorrect Following the MUC conferences, ACE (Automatic Content Extraction) conferences continued to promote research in information extraction by proposing tasks for entity detection and characterization, relationships, and later, events.

In 2009, the inaugural Knowledge Base Population (KBP) task was held as part of the Text Analysis Conference (TAC) evaluation campaign The primary objective of this task was to extract information about entities for a knowledge base Participants were presented with two sub-tasks: linking text entities to those in the knowledge base and populating fields in the base with relevant information about these entities.

Entités nommées

Reconnaissance d’entités nommées

Named Entity Recognition (NER) is a technique used to identify and categorize linguistic expressions within text It focuses on recognizing textual objects, such as names of people, organizations, locations, quantities, distances, values, and dates Originally developed for information retrieval, NER is now a foundational step in various Natural Language Processing (NLP) tasks, including semantic annotation, machine translation, classification, and ontology development.

Dans bien des cas, il est utilisé pour extraire des termes propres à un domaine et dans notre cas il sera utilisé pour identifier les mentions de noms de taxon.

Les principales étapes dans la reconnaissance des entités nom- mées (NER)

La reconnaissance des entités nommées comprend différentes étapes a savoir : le prétraitement, l’extraction des caractéristiques et la modélisation la section suivante sera consacrée à la description de différentes étapes.

Pré-traitement

Preprocessing involves data cleaning through tokenization and, in some cases, normalization to reduce ambiguity during feature processing In natural language processing, particularly in Bio-NER, this includes data cleaning, tokenization, normalization of names, and lemmatization to minimize ambiguities during feature extraction Some studies adhere to the TTL (Tokenization, Tagging, and Lemmatization) model proposed by Ion in 2007 as a standard preprocessing framework for biomedical text mining applications, as noted by Maria Mitrofan in 2017.

Extraction des fonctionnalités ou caractéristiques

In systems that utilize rules and dictionaries, the extraction of orthographic and morphological features, particularly focusing on word formations, is a primary choice Consequently, these systems heavily rely on techniques based on word formation and language syntax.

The current cutting-edge method for feature extraction in biomedical text exploration is word embeddings, which are sensitive to semantic and syntactic details Word embeddings involve learning a real-valued vector representing a word in an unsupervised or semi-supervised manner from a text corpus While the foundations of word integration were established by Ronan Collobert in 2008, significant advancements have been made in text embedding through neural networks that consider context, semantics, and syntax in Natural Language Processing (NLP) applications This article discusses some of the most important approaches to word representation and embeddings applicable to the biomedical field.

Modèle

Named Entity Recognition (NER) modeling systems can be categorized into four types: rule-based models, dictionary-based models, machine learning-based models, and hybrid models Recently, there has been a shift towards pure machine learning approaches or hybrid techniques that combine rules and dictionaries with machine learning methods Current research in NER focuses on deep learning with sequential data and Conditional Random Fields (CRF) This section will exclusively discuss deep learning approaches and CRF in the context of named entity recognition.

Conditional Random Fields (CRFs) are a class of statistical models utilized in pattern recognition and statistical learning They effectively account for the interactions between neighboring variables and are commonly applied to sequential data, including natural language processing, biological sequences, and computer vision.

Conditional Random Fields (CRFs) take a structured input set of elements x and produce a structured labeling output y for that set They are considered discriminative models because they estimate the conditional probability of a set of labels y given an input x When the graph representing the dependencies between labels is linear, the probability distribution of a sequence of annotations y based on an observable sequence x is defined by [Dupont, 2017].

— ó Z(x) est un facteur de normalisation dépendant de x.

— Les K traits fk sont des fonctions a valeur dans 0, 1 fournies par l’utilisateur.

— Les poids k associés aux différents traits fk sont les paramètres du modèle déter- minés par l’apprentissage.

In the past five years, there has been a shift in literature towards general models of deep neural networks, including Feedforward Neural Networks (FFNN), Recurrent Neural Networks (RNN), and Convolutional Neural Networks (CNN), which have been utilized for BioNER systems Notably, frequent variations of RNNs include Elman and Jordan models, as well as unidirectional and bidirectional architectures.

To achieve optimal results, Bi-LSTM and CRF models are effectively combined with word-level and character-level embeddings within a structured framework (Habibi et al., 2017; Wang et al., 2018a; Giorgi and Bader, 2019; Ling et al., 2019; Weber et al., 2019; Yoon et al., 2019) In this context, a pre-trained lookup table generates significant outcomes.

Auteur: Jojo TSHITENGE MUPUWE 11 embeddings de mots, et unBi-LSTMséparé pour chaque séquence de mots, qui sont ensuite combinés pour acquérir x1 , x2 , xn comme représentation de mots (Habibi et al., 2017).

FIGURE3.4 – Architecture BI-LSTM-CRF en REN

The vectors are input into a bidirectional LSTM, where the outputs from both the forward and backward paths, hb and hf, are combined using an activation function This combined output is then fed into a CRF layer, which is typically set up to predict the class of each word using the IBO (Inside-Beginning-Outside) format.

In Figure 3.4, the hidden layer hn illustrates that the embedding layer transforms the gene word into a vector Xn This vector is then utilized as input for both the forward LSTM hn and the backward LSTM hn←hn, where the forward LSTM depends on the previous value hn1 and the backward LSTM relies on the future value hn+1.

The combined output from the LSTM layers, processed through a tanh activation function, produces the final output Yn A Conditional Random Field (CRF) layer on top utilizes Yn to label it as I-inside, B-Beginning, or O-Outside of a named entity In this example, Yn is tagged as I-gene, indicating that it is a word within the named entity of a gene.

Transformers are a deep learning model introduced in 2017, primarily used in natural language processing According to Ashish Vaswani et al (2017), transformers are characterized as a Sequence-to-Sequence (Seq2Seq) architecture Seq2Seq is a type of neural network that converts a given sequence of elements, such as the words in a sentence, into another sequence LSTM-based models are a popular choice for this type of architecture.

LSTM modules are ideal for sequential data as they can interpret sequences while retaining or discarding information deemed important or unimportant For instance, in sentences, the order of words is essential for comprehension, making LSTMs a natural fit for handling such data.

Seq2Seq models consist of an encoder and a decoder The encoder processes the input sequence and maps it into a higher-dimensional space (an n-dimensional vector) This abstract vector is then fed into the decoder, which transforms it into an output sequence The output sequence can take various forms, including another language, symbols, or a copy of the input.

The attention mechanism analyzes an input sequence and determines the significance of different parts at each step For each input processed by the encoder, it simultaneously considers multiple other inputs, assigning varying weights to highlight their importance The decoder then utilizes the encoded phrase along with the weights provided by the attention mechanism for further processing.

[Ashish Vaswani and al, 2017]Présente une nouvelle architecture appelée transfor- mer Le transformer utilise le mécanisme d’attention décrit plus haut CommeLSTM,

Transformer est une architecture permettant de transformer une séquence en une autre

Jojo Tshitenges Mupu's model utilizes an encoder-decoder structure but stands apart from traditional sequence-to-sequence models as it does not incorporate recurrent networks like GRU or LSTM Instead, this architecture eliminates recurrence entirely, relying solely on an attention mechanism to establish global dependencies between the input and output.

The encoder is positioned on the left and the decoder on the right, both consisting of stackable modules as depicted by Nx These modules primarily include 'Multi-Head Attention' and 'Feed Forward' layers Inputs and target outputs are first embedded into an n-dimensional space, as direct usage of strings is not feasible A crucial aspect of the model is the positional encoding of different words Since we lack recurrent networks to remember the sequence order, we must assign a relative position to each word in our sequence, as the arrangement of elements is vital These positions are then added to the n-dimensional vector representation of each word.

FIGURE 3.6 – Scaled Dot-Product Attention Multi-Head Attention consists of several attention layers running in parallel

Commenỗons par la description de gauche du mộcanisme d’attention Elle peut être décrite par l’équation suivante[Ashish Vaswani and al, 2017]:

— Q (Query) est une matrice qui contient la requête,*

— K (Keys) sont toutes les clés,

The key/value/query concept originates from search systems, where a search engine maps a user's query to a set of keys associated with candidates in the database, ultimately presenting the best matches as values In the context of multi-head attention modules, the value (V) consists of the same sequence of words as the query (Q) for the encoder and decoder However, for the attention module that considers both encoder and decoder sequences, V differs from the sequence represented by Q.

Pour simplifier, nous pourrions dire que les valeurs de V sont multipliées et addi- tionnées avec des poids d’attention a, ó nos poids sont définis par :

Mesure d’évaluations

Evaluating a classifier can be quite challenging, as there are numerous performance metrics available However, the literature highlights three key metrics for assessing named entity recognition systems: precision, recall, and F-score.

To calculate precision, we divide the number of correctly classified positive examples by the total number of predicted positive examples A high precision value indicates that a positive label is accurately assigned, reflecting a low number of false positives (FP).

Rappel élevé - faible précision: cela signifie que la plupart des exemples positifs sont correctement reconnus (FN faible) mais qu’il y a beaucoup de faux positifs.

Rappel faible - haute précision: Cela montre que nous manquons beaucoup d’exemples positifs (FN élevé) mais ceux que nous prédisons comme positifs sont en effet positifs (FP faible)

Recall is defined as the ratio of correctly classified positive examples to the total number of positive examples A high recall percentage indicates that the class is accurately recognized, reflecting a low number of false negatives (FN).

The F-measure combines precision and recall into a single metric, providing a comprehensive evaluation of a model's performance By utilizing the harmonic mean instead of the arithmetic mean, the F-measure penalizes extreme values more heavily, ensuring a balanced assessment Consequently, the F-measure will always be closer to the lower value of either precision or recall, reflecting the model's true effectiveness.

Les formules pour calculer ces métriques sont :

Proportion d’éléments bien classés pour une classe donnée.

Proportion d’éléments bien classés par rapport au nombre d’éléments de la classe à prédite.

Mesure de compromis entre précision et rappel.

Dans ce contexte, les quatre classes utilisées sont désignées comme vrais positifs (TP), faux positifs (FP), vrais négatifs (TN) et faux négatifs (FN), cela représentent :

— Vrai positif(TP): l’observation est positive et devrait être positive.

— Faux négatif(FN): L’observation est positive, mais est prédite négative.

— Vrai Négatif(TN): L’observation est négative et devrait être négative.

— Faux positif(FP): L’observation est négative, mais est prédite positive.

Outils pour la reconnaissance entités nommées

Deep learning-based Named Entity Recognition (NER) systems have proliferated and achieved excellent results; however, there are still very few tools for recognizing taxon mentions in text or extracting information from biodiversity literature in the field of ecology This work will focus solely on the TaxoNERD tool.

TaxoNERD

TaxoNERD, developed by Dr Nicolas Le Guillarme at the Laboratory of Ecology of the Alps, is an advanced tool that employs two deep neural network (DNN) models for the recognition of mention names in ecological documents To achieve high performance, TaxoNERD leverages these sophisticated models.

DNNexistants préformés sur de grandes corpus biomédicaux Ci-dessous la figure 3.9 présente les modèles deep learning utilisés par TaxoNERD[Guillarme and Thuiller, 2021].

TaxoNERD offers two deep neural network models for recognizing taxon mentions in ecological documents The first model employs spaCy's standard Tok2Vec with word vectors for speed, while the second model utilizes a pre-trained Transformer-based language model called BioBERT.

Desambiguisation

Entity linking, also known as named-entity linking (NEL), named-entity disambiguation (NED), named-entity recognition and disambiguation (NERD), or named-entity normalization (NEN), is a natural language processing (NLP) task that assigns a unique identity to entities mentioned in a text This process involves automatically linking entities found in the text to known entities within an existing knowledge base Semantic annotation is a more detailed, yet less precise, variant of named entity methods, as it only describes the category of the entity.

7 https://en.wikipedia.org/wiki/Entity_linking

Extraction de relations

Présentation

Relation extraction is a subtask of information extraction that involves identifying and categorizing semantic relationships from unstructured text It aims to determine if a typed association exists between multiple entities, typically referred to as n-ary relationships based on the number of entities involved A binary relationship specifically refers to the association between two entities Relation extraction can be generic, focusing solely on the existence of a relationship without specifying its nature This process can occur at the sentence level or across multiple sentences within a document While sentence-level relation extraction is commonly utilized, extracting relationships across multiple sentences is less explored, even in supervised learning contexts This study will focus on the extraction of relationships across multiple sentences.

FIGURE3.10 – Exemple de l’extraction des relations

Type des relations

Relationships are typically categorized into two types: n-ary relationships, which involve n arguments, and binary relationships, which pertain to interactions between two entities and are a specific case of n-ary relationships.

A n-ary relationship is defined as a relationship involving more than two arguments or individuals in an ontological sense The process of extracting n-ary relationships involves identifying both the arguments and the relationship itself.

CHAPITRE 3 ETAT DE L’ART est explicite, ou à reconnaợtre les arguments et les relier entre eux, si la relation est im- plicite.

A binary relationship involves two entities, referred to as the arguments of the relationship These arguments can play distinct roles, leading to an oriented (or asymmetric) relationship, or they can have equal roles, resulting in a non-oriented (or symmetric) relationship.

In this section, we will focus on the method of extracting relationships based on binary relations, where entities are annotated within the text, and our goal is to identify the relationships that exist between these entities.

Méthodes ou approches de type binaire

The methods for extracting relationships vary based on the specific subtask being addressed, such as detecting relationships between two entities, identifying the direction of these relationships, and categorizing them This article will focus solely on the techniques used for detecting and categorizing relationships.

3.3.3.1 Approche basée sur la co-occurrence

One of the simplest methods for entity association is the co-occurrence approach, which posits that entities are linked if they appear together in target sentences The underlying hypothesis suggests that the more frequently two entities occur together, the higher the likelihood of their association Furthermore, this approach extends to suggest that a relationship exists between two or more entities if they share a connection with a third entity, acting as a reciprocal link.

Hierarchical and lexical relationships, such as hypernyms, consistently exist within a domain A co-occurrence-based method effectively identifies these relationships, as noted by Mans Minekus in 2005 This approach is commonly used as a baseline for assessing new techniques It operates on the premise that words frequently appearing in the same context are semantically related While this method yields high recall in extracting relationships, it often suffers from low precision.

3.3.3.2 Approche basée sur des règles

Dans une approche basée sur des règles, l’extraction des relations dépend forte- ment de l’analyse syntaxique et sémantique des phrases En tant que tels, ces procédés

Jojo Tshitenge Mupuwo discusses the use of part-of-speech (POS) tagging tools to identify associations in text By analyzing verbs and prepositions that correlate with multiple nouns or phrases, these tools effectively recognize named entities.

The most commonly used machine learning approaches rely on annotated corpora with pre-identified relationships as training data (supervised learning) Previously, the main challenge in applying these machine learning methods for relationship detection was the acquisition of ethically tested training and test data However, datasets generated through biomedical text extraction competitions like BioCreative and BioNLP have significantly mitigated this issue.

Supervised approaches require costly annotated training data, making them difficult to scale, as detecting new types of relationships necessitates additional training data Furthermore, supervised classifiers often exhibit bias towards the text domain used for training, resulting in suboptimal outcomes on different textual content Semi-supervised methods still require a small labeled dataset of entity relationships along with an unlabeled portion to learn how to extract relationships However, most systems have not achieved accuracy levels comparable to supervised approaches, highlighting the expensive and tedious nature of generating labeled data for entity recognition.

Expanding approaches to extract new types of relationships typically requires human effort While unsupervised relation extraction techniques have been proposed in the past and evolved within the open information extraction subdomain, these methods do not rely on training data However, they often yield suboptimal results that are difficult to interpret and align with existing sets of relationships, schemas, or ontologies, posing challenges for various applications, including knowledge base refinement.

Approches hybrides : ici, toutes les approches ci-dessus sont combinées puis utili- sées pour déterminer les relations des entités.

In recent years, a new paradigm for automatic relation extraction, known as distant supervision, has been proposed This approach combines the benefits of semi-supervised and unsupervised methods while utilizing a knowledge base as a training data source to extract relationships The following section will delve deeper into this paradigm [Perera et al., 2020].

Extraction de relations sous supervision distante

Fonctionnement

The article discusses the mechanism of distant supervision for binary relation extraction Given a text corpus and a knowledge base, distant supervision aligns the relationships in the knowledge base with sentences in the text corpus Initially, sentences mentioning a pair of entities are collected, and if a sentence or paragraph refers to two entities (e1, e2), distant supervision matches the knowledge base relationships to the entities in the corpus, labeling these sentences as instances (or mentions) of the relationship Sentences sharing the same entity pair are typically referred to as a bag of phrases The concept of adapting a knowledge base to automatically generate training data for relation extraction was first proposed by Mintz et al in 2009.

Base de connaissance

Globi

Globi (Global Biotic Interactions) is an open-access knowledge base that facilitates research on species interactions, such as predator-prey, pollinator-plant, pathogen-host, and parasite-host relationships, by integrating existing open datasets through open-source software GloBI continuously analyzes existing data infrastructures and registries, tracking the available species interaction data The discovered interaction data is then resolved and integrated, positioning GloBI not as a centralized repository of species interaction data but rather as a research index that helps users locate existing species interaction datasets within their native cyber habitats.

Conclusion

In this chapter, we reviewed the various approaches that can assist in achieving the objectives of this internship Initially, we outlined methods for named entity recognition within a corpus, concluding with relationship extraction The next chapter will present our proposed solution and the chosen approaches for accomplishing each objective.

8 https://www.atlassian.com/fr/itsm/knowledge-management/what-is-a-knowledge- base

9 https://www.globalbioticinteractions.org/about

In this chapter, we will outline the proposed solution to achieve the objectives of our information extraction project within an ecological corpus We will discuss the various methods and techniques selected for each objective, as well as the results obtained following the implementation of our solution Figure 4.1 illustrates the overall architecture of our proposed solution.

Présentation

The literature review of previous studies led us to adopt a distance learning approach This method relies on unlabelled textual data and a knowledge base that includes known instances for each given relationship.

To implement this approach, several ordered tasks must be executed First, it is essential to retrieve full-text articles in PDF or XML format Next, the articles should be examined to extract paragraphs for further exploration The third task involves identifying mentions of taxon names within these paragraphs to compile a list of taxa Following this, the Globi knowledge base will be queried to annotate the relationships between the identified mentions, thus creating an annotated corpus Finally, we will employ a state-of-the-art method using the pre-trained BioBERT transformer, specifically designed for biomedical text, to train on our data.

Après avoir présenté l’architecture générale de notre système, nous allons décrire chacune des étapes ainsi que les résultats obtenus.

Description des approches et des techniques

Acquisition de données

Due to the lack of annotated corpora regarding ecological interactions for trophic networks, this study aims to construct a data corpus from full-text scientific articles Many existing information extraction methods focus solely on article abstracts, yet full-text electronic articles, which provide richer data sources, are available in numerous online databases and libraries but are underutilized For our research, we will utilize full-text articles from the ISTEX library, as it offers more relevant resources aligned with our needs.

ISTEX provides free access to data through its API from its FTP server, containing over 25 million documents across 31 scientific literature corpora This includes more than 9,318 journals and 349,009 ebooks published from 1473 to 2020, making it a vast data repository.

ISTEX is one of the leading providers of scientific articles globally Each article is assigned a Digital Object Identifier (DOI) to enhance the identification of resources.

1 http://https://www.istex.fr/

CHAPITRE 4 SOLUTION ET RÉSULTATS sociant des métadonnées c’est-à-dire des informations la décrivant : auteur ou créa- teur, titre, mots-clés, résumé, éditeur, langue, date de publication, source, droits de propriété de la ressource, localisation, conditions d’accès, etc et donc, nous nous ap- puyons sur lesDOI’s pour télécharger les articles. les versions publiées des articles se composent de plusieurs fichiers tels que des fi- chiers images, pdf et xml contenant toutes les métadonnées de ces articles Ces formats d’articles téléchargés ont des structures différentes et ce qui rends la tâche d’extraction du texte d’un document difficile En outre, l’un des plus gros éditeurs mondiaux de la litterature scientique, Elsevier possède sa propre spécificationDTD(Description de type de document) qui ne doit pas être utilisée par une organisation externe Il existe une abondance d’outils d’extraction de texte dans un document et le choix ceux-ci dé- pend du format des documents et la qualité des données extraient.

Une chaợne de traitement nous a permit à extraire les paragraphes des articles afin de constituer notre corpus sur les interactions dans un réseau de trophique.

XML files are documents that encapsulate various pieces of information within tags, similar to HTML files To extract the desired information from these files, it's necessary to analyze the XML structure, identify the relevant tags, and extract data from them For this purpose, we will utilize BeautifulSoup, one of the most popular Python libraries for web scraping Specifically, we will employ Python's lxml parser to effectively analyze our XML files and identify the necessary tags.

Algorithme :extraction des paragraphes à partir des fichiers xml

Résultat

We will present the results of the methods used to construct our corpus The process of building and preparing the corpus is both crucial and delicate, as it serves as the essential source of information for the entire entity extraction process In our case, the constructed corpus is specialized in the field of computer science and contains a collection of texts related to trophic networks.

La figure 4.4 présente la requête utilisée pour interroger notre bibliothèque sur l’existence des articles contenant ces mots les mots repris celui-ci.

Nous présentons les paramètres ainsi que la connexion au serveur ISTEX sur la fi- gure 4.5.

FIGURE4.5 – Connexion au serveur ISTEX

La figure 4.6Le serveur retourne une réponse.

The server's response to a request is provided in both PDF and JSON formats It includes the request code, which indicates the result: a code of 200 signifies a successful download initiation, 400 indicates a bad request, and 500 denotes a server error.

Table 4.1 summarizes the volume of data obtained through the data extraction method from ISTEX, along with the segmentation into paragraphs based on the extracted articles Each paragraph includes the article identifier for easy retrieval.

Nombre d’articles en pdfs 1200 Nombre d’articles en xmls 1200 Nombre de paragraphe 10500

Reconnaissance des mentions d’entités

For entity mention recognition in text, we will utilize TaxoNERD due to its high performance in identifying taxon mentions within ecological literature The recognition of entity names using TaxoNERD will involve two main sub-modules: named entity recognition and named entity disambiguation.

TaxoNERD processes a corpus of article paragraphs to perform named entity recognition, outputting a file in ANN format that consolidates information as illustrated in Table 4.5 below.

FIGURE4.7 – Reconnaissance mention avec TaxoNERD

Figure 4.7 highlights the named entities identified by TaxoNERD, which recognizes the scientific names of species mentioned in the paragraph It also detects abbreviations and vernacular names.

Para a r t i cl e paragraphe de l’article

T1 Tn numérotaion des taxons trouvés dans le paragraphe

LIVB LIVING BEING is the label associated with a taxon entity It includes the index of the first letter of the mention within the paragraph and the index of the last letter plus one of the mention in the paragraph, along with the name of the mentions and the text of the mention.

TABLE4.2 – Structure fichier REN TaxoNERD

This article presents the findings from the named entity recognition process applied to the text We utilized TaxoNERD, which takes our paragraphs as input and generates an output file in ANN format containing a list of identified mentions within the text The figure below illustrates an example of the output generated by TaxoNERD.

FIGURE4.8 – Exemple de la reconnaissance mention avec sortie TaxoNERD

Desambiguisation Taxonerd

After identifying named entities, it is essential to associate each mention with an identifier from the reference taxonomy We utilize the NCBI taxonomy to detect species mentions in the text and assign them a corresponding taxon The NCBI taxonomy tracks synonyms and name changes, providing valuable information for researching historical bibliographies Additionally, the NCBI taxonomy serves as a reference for describing organisms in various databases, including NCBI resources like the Sequence Read Archive (SRA) and GenBank, as well as in the European Nucleotide Archive (EMBL-EBI) and the DNA Data Bank of Japan (DDBJ) [Federhen, 2014].

If a mention is not linked to an entity within the taxonomy, it may indicate that the mention was incorrect or that the taxon is not included in the taxonomy.

Disambiguation involves utilizing the NCBI number of species to address the confusion arising from taxonomic names This process aims to uniquely identify each taxonomic name through its NCBI number Table 4.3 presents the information encapsulated in the output of Taxonerd following disambiguation.

Para a r t i cl e Paragraphe de l’article

T1 Tn Numérotaion des taxons trouvés dans le paragraphe

LIVB LIVING BEING c’est une entité de type taxon

The article outlines a method for identifying specific mentions within a text by providing the index of the first and last letters of each mention, along with the name of the mention itself found in the paragraph This structured approach facilitates easy reference and analysis of key terms within the content.

NCBI Identifiant référentielle dans la taxonomie

TABLE4.3 – Structure fichier sortie TaxoNERD desambiguisation

4.2.4.1 Résultat desambiguisation la figure ??présente un exemple du résultat obtenu après desambiguisation des nom de mention dans le texte avec TaxoNERD.

Annotation

The annotation of relationships occurs between pairs of entity mentions using the Globi knowledge base The system inputs a pair of entity mentions and queries the Globi knowledge base via an HTTPS request to determine if a relationship exists between the two entities If a relationship is found, annotation is provided; if not, no annotation is given Various types of ecological interactions are possible with Globi, including preysOn, kills, eat, parasiteOf, endoparasiteOf, ectoparasiteOf, and parasitoidOf.

In our research, we focus exclusively on EAT-type interaction relationships, utilizing TaxoNERD to disambiguate various ecological terms such as hostOf, pollinates, pathogenOf, vectorOf, dispersalVectorOf, createsHabitatFor, hasEpiphyte, acquiresNutrientsFrom, mutualistOf, flowersVisitedBy, ecologicallyRelatedTo, coRoostsWith, and adjacentTo.

Below, we outline the methods employed to annotate mention texts using the GLOBI knowledge base The following steps were carried out on the list of mention names generated by TaxoNERD after disambiguation.

— Récupération de la paire d’entités par leur numéroNCBI

The annotation has been applied to all previously identified taxa In this work, as mentioned earlier, the considered relationship is of the EATS type, which is only applicable to entities within the same paragraph.

La figure 4.11présente la connexion à la base de connaissance Globi, la récupération de la paire de mentions, la définition de la relation et formulation de la requête.

FIGURE4.11 – connexion et requête Globi

FIGURE4.12 – Exemple de l’implémention de l’annotation globi

Manual annotation was conducted by experts to assess the quality of our relationship extraction model The manually annotated data will be utilized in the testing phase of our model The figure below illustrates the manual data annotation process.

Entraợnement du modốle

To learn from our data, we will utilize the BioBERT model (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining), a domain-specific language representation model pre-trained on large-scale biomedical corpora The BioBERT model is open source and fully applicable However, to enable the model to perform tasks such as

Figure 4.13 illustrates an example of relationship annotation using Globi, emphasizing the need for refinement in relation extraction This refinement involves adding a task-specific layer trained on an annotated dataset to enhance the output generated by BioBERT.

FIGURE4.14 – Pré-formation et mise au point de BioBERT

La section suivante sera consacrée à nos résultats expérimentaux avec une analyse détaillée de ses variantes.

The Bio-BERT model developed is designed to process input text and annotate taxons using the "EAT" type It was trained over 100 epochs, and the table below outlines the various parameters employed in the training of our model.

Type de modèle Bio-BERT modèle de pré-entrainement

Batch size 32 pour chaque itéractionn on selectionne 32 séquences pour l’entrainement max-seq-length 128 la taille maximale d’une phrase après tokenisation

TABLE4.4 – Paramètres du modèle Bio-Bert

Nous remarquons que malgré les paramètres ci-hauts,le modèle apprend moins et il fait beaucoup d’erreurs le tableau ci-dessous présente le résultat du modèle. structure Description

TABLE4.5 – Résultats des métriques sur nos données de test

Conclusion

In this chapter, we presented our proposed solution and the results achieved The solution is comprised of four main tasks: building a dedicated corpus for trophic network interactions, developing a method for the automatic recognition of taxon names in text, annotating relationships between entities, and finally, training using our corpus Various methodologies, resources, and tools have been implemented to meet the objectives of this research project.

The results indicate that significant performance was achieved in corpus construction and entity recognition using the TaxoNERD tool However, the performance of the BioBERT transformer model was notably low, with F-measure scores hovering around 30.

This document outlines the work completed during the final internship for the Master's degree in Intelligent Systems and Multimedia The internship focused on information extraction from a knowledge graph in the field of ecology The primary goal was to reconstruct the network of ecological interactions based on scientific literature that identifies the interacting species.

To achieve our objective, we first conducted a literature review on approaches and techniques relevant to our goal Due to a lack of data, we constructed our corpus using scientific literature from the ISTEX online library Several preprocessing techniques were employed for corpus construction For entity recognition and linking, we utilized the TaxoNERD tool, which yielded satisfactory results validated by domain experts For the annotation and extraction of relationships between named entities, we applied a distant learning method using GLOBI as a knowledge base, successfully annotating relationships After automatic annotation, a portion of the data was manually annotated to evaluate our model with an automatic learning framework Finally, we employed the pre-trained BioBERT model for training on our annotated data; however, the results were subpar, prompting us to analyze and address these issues in the future, as the adaptation of the BioBERT model to our data is already in question.

Through this work, we proposed a distance learning approach that has proven to be highly effective in building our annotated corpus Consequently, this study could facilitate the reconstruction of an interaction network With these results, it will be possible to accurately monitor biodiversity conservation progress and establish clearer policies to protect certain ecological interactions that are currently at risk of disappearing In the future, we plan to explore approaches that could address this issue further.

In this article, we explore the significance of minimizing noise in text during data preprocessing in the context of machine learning Our focus is on relationship extraction using a hybrid architecture that combines Bidirectional Encoder Representations from Transformers (BERT) with Graph Transformers (BERT-GT) This innovative approach incorporates a neighborhood attention mechanism, enhancing the ability to capture relationships within complex paragraphs containing multiple sentences We apply this robust method to biomedical relationship extraction datasets, aiming to improve accuracy and efficiency in data analysis.

Quelques fonctions

CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM Độc lập - Tự do - Hạnh phúc

RÉPUBLIQUE SOCIALISTE DU VIET NAM Indépendance – Liberté - Bonheur

(Dùng cho học viên cao học)

(Pour les étudiants en master)

I LÝ LỊCH SƠ LƯỢC (CURRICULUM VITAE ) :

Họ và tên (Nom et Prénom): Tshitenge Mupuwe Jojo

Giới tính (Sexe): Nam (Masculin)

Ngày, tháng, năm sinh (Date de naissance): 06/04/1991

Nơi sinh (Lieu de naissance): Kinshasa

Quê quán (Village natal): Kinshasa

Dân tộc (Ethnie): Mukua sumba

Chức vụ, đơn vị công tác trước khi đi học tập, nghiên cứu (Profession, lieu de travail avant de faire ces études): Không/Néant

Current address: 07 Rue du Gros Raisin, Orléans, France Work phone: None Home phone: None Mobile phone: 0609923255.

II QUÁ TRÌNH ĐÀO TẠO (FORMATION) :

Hệ đào tạo (Mode de formation): Đại học chính quy/Université formelle

Thời gian từ (de): 11/2009 đến (à) 04/2015

Nơi học (trường, thành phố/tỉnh) (Lieu de formation: université, ville/province): Đại học Kinshasa /Université de Kinshasa

Ngành học (Branche): Kỹ sư Công nghệ thông tin/Génie-Informatique

Đề tài khoá luận tốt nghiệp tập trung vào việc triển khai hệ thống giám sát video và đèn giao thông nhằm cải thiện tình hình giao thông đường bộ tại các trục chính của Kinshasa Hệ thống này sẽ góp phần nâng cao an toàn giao thông và giảm thiểu ùn tắc, đồng thời ứng dụng công nghệ hiện đại trong quản lý giao thông đô thị.

Nơi bảo vệ (trường, thành phố) (Lieu de soutenance: Université, ville): Đại học

2 Thạc sĩ (formation de master) :

Quyết định công nhận học viên cao học số 3234/QĐ-ĐHQGHN, ngày 28 tháng 9 năm

2018 của Giám đốc Đại học Quốc gia Hà Nội (Décision de reconnaissance des étudiants de master n o 3234/QĐ-ĐHQGHN le 28/9/2018 du Président de l’Université Nationale du Viet Nam, Ha Noi).

Nơi học (khoa, trường) (Lieu d’apprentissage): Viện Quốc tế Pháp ngữ (L’Institut Francophone International)

Ngành (Branche): Công nghệ thông tin (Informatique)

Chuyên ngành (Spécialité): Hệ thống thông minh và Đa phương tiện (Systèmes Intelligents et Multimédia)

The graduation thesis titled "Information Extraction for Knowledge Graph Population in Ecology" focuses on the process of extracting relevant data to enhance knowledge graphs within the field of ecology This research aims to improve the representation and accessibility of ecological information, facilitating better understanding and analysis of ecological systems By utilizing advanced data extraction techniques, the study contributes to the development of comprehensive knowledge graphs that support ecological research and decision-making.

Người hướng dẫn (Encardrant): Nicolas LEGUILLARME

Cơ quan công tác (Lieu de travail de l’encadrant): Đại học Grenoble Alpes / Université Grenoble Alpes CS 40700 38058

3 Trình độ ngoại ngữ (biết ngoại ngữ gì, mức độ) (langues étrangères, niveau):

Hà Nội, ngày 25 tháng 10 năm 2022

XAC NHẬN CỦA CƠ QUAN CỬ ĐI HỌC

HOẶC CỦA CƠ SỞ ĐÀO TẠO

CỘNG HOÀ XÃ HỘI CHỦ NGHĨA VIỆT NAM Độc lập - Tự do - Hạnh phúc

BẢN NHẬN XÉT LUẬN VĂN THẠC SĨ

RAPPORT DU MEMOIRE DE FIN D'ETUDE MASTER Đề tài/ Sujet:

RÚT TRÍCH THÔNG TIN NHẰM CUNG CẤP DỮ LIỆU CHO BIỂU ĐỒ TRI THỨC VỀ SINH THÁI HỌC

(Tiếng Pháp: Extraction d’information pour une population d’un graphe de connaissance en écologie)

Ngành/Secteur: Công nghệ thông tin/Informatique

Chuyên ngành/Spécialité: Hệ thống thông minh và Đa phương tiện / Systèmes Intelligents et

Mã số chuyên ngành/Code de la spécialité: Chuyên ngành thí điểm / Programme pilote

Của học viên/Nom de l'étudiant: Tshitenge Mupuwe Jojo

Họ và tên cán bộ phản biện/Nom du rapporteur: Hồ Tường Vinh

Cơ quan công tác/Etablissement du rapporteur: VNU-IFI

1 Tính cấp thiết, ý nghĩa lý luận và thực tiễn của đề tài luận văn / Nécessité, sens théorique et pratique du sujet de mémoire

This thesis explores machine learning, deep learning, and information extraction from texts, focusing on their application in identifying relationships within the food web to create an annotated corpus The research is significant both theoretically and practically.

Ce mémoire a pour objectifs de :

- Étudier l’extraction automatique des information à partir de textes; l’apprentissage profonde;

- Proposer et expérimenter un pipeline de traitement, basée sur les réseaux de neurones profonds, pour l’extraction des relations dans un réseau trophique à partir des articles scientiques

2 Phương pháp nghiên cứu / Méthodologie de recherche

L’auteur a rộalisộ son travail de faỗon claire et mộthodologique : faire une ộtude de l’existant ; élaborer des exigences; proposer une solution ; implémenter et évaluer la solution proposée

3 Cơ sở lý luận và tổng quan của đề tài nghiên cứu / Fondement théorique et littérature du sujet de recherche

The author conducted a study on automatic information extraction from texts, followed by a review of existing methods and models for extracting relationships within a trophic network Building on these insights, he proposed a text and natural language processing pipeline for extracting relevant information from a collection of scientific articles The implementation and experimentation of the solution demonstrate its effectiveness and potential value.

4 Những đóng góp mới của luận văn / Apports du mémoire

A text processing and natural language pipeline is designed to extract relevant information from a collection of scientific articles, with the goal of building an annotated corpus on the trophic network.

5 Kết cấu, hình thức trình bày và văn phong / Structure, présentation et style

Le mémoire est bien structuré C’est facile à suivre les chapitres

6 Những thiếu sót và hạn chế về nội dung và hình thức của luận văn (nếu có) / Lacunes et limites du mémoire (s'il y en a)

- Về hình thức / Sur la forme:

- Về nội dung / Sur le contenu:

Une analyse plus détaillée sur des avantages et desavantages de la solution proposée dans le cadre de ce projet est à désirée

7 Mức độ đạt được của công trình nghiên cứu so với yêu cầu của luận văn thạc sĩ / Conclusion générale par rapport aux exigences du mémoire de fin d'études du niveau Master

Dựa trên yêu cầu của luận văn Thạc sĩ và kết quả nghiên cứu được trình bày, tôi đồng ý cho học viên Tshitenge Mupuwe Jojo bảo vệ luận văn trước Hội đồng.

En me basant sur les exigences d’un mémoire de fin d’études, je donne un avis favorable à la soutenance du mémoire de fin d’études à Tshitenge Mupuwe Jojo devant le jury

CÁN BỘ PHẢN BIỆN RAPPORTEUR

BẢN NHẬN XÉT LUẬN VĂN THẠC SĨ

RAPPORT DU MEMOIRE DE FIN D'ETUDE MASTER

Về đề tài/ Sujet: RÚT TRÍCH THÔNG TIN NHẰM CUNG CẤP DỮ LIỆU CHO BIỂU ĐỒ TRI THỨC

VỀ SINH THÁI HỌC/ EXTRACTION D’INFORMATION POUR UNE POPULATION D’UN GRAPHE

Ngành/Secteur: Công nghệ thông tin/Informatique

Chuyên ngành/Spécialité: Hệ thống thông minh và đa phương tiện/ Systèmes Intelligents et Multimédia

Mã số chuyên ngành/Code de la spécialité: Chuyên ngành thí điểm / Programme pilote Của học viên/Nom de l'étudiant: TSHITENGE MUPUWE Jojo

Họ và tên cán bộ phản biện/Nom du rapporteur: Nguyễn Mạnh Hùng

Cơ quan công tác/Etablissement du rapporteur: Học viện công nghệ bưu chính viễn thông, Hà Nội.

1 Tính cấp thiết, ý nghĩa lý luận và thực tiễn của đề tài luận văn / Nécessité, sens théorique et pratique du sujet de mémoire

2 Phương pháp nghiên cứu / Méthodologie de recherche

- La méthodologie de recherche est raisonnable et adaptée au problème.

3 Cơ sở lý luận và tổng quan của đề tài nghiên cứu / Fondement théorique et littérature du sujet de recherche

- Le travail du stagiaire fait partie de LECA

4 Những đóng góp mới của luận văn / Apports du mémoire

To create a training and testing dataset, extract interactions from an existing knowledge base by querying scientific publication search engines through their APIs.

- Implémenter une ou plusieurs méthodes d’extraction de relations et évaluer les performances de ces approches sur le jeu de test

5 Kết cấu, hình thức trình bày và văn phong / Structure, présentation et style

- Le mémoire contient des sections nécessaires;

6 Những thiếu sót và hạn chế về nội dung và hình thức của luận văn (nếu có) / Lacunes et limites du mémoire (s'il y en a)

- Về hình thức / Sur la forme:

- Il y a des images qui ne sont pas citées leur résources (trop)

- Les formulas ne sont pas enumérées

- Về nội dung / Sur le contenu:

- Les résultats dans le tabeau 4.5 ne sont pas bon avec la valeur de F1 score de 25%?

- Il n'y a pas de comparason avec d'autre méthodes.

7 Mức độ đạt được của công trình nghiên cứu so với yêu cầu của luận văn thạc sĩ / Conclusion générale par rapport aux exigences du mémoire de fin d'études du niveau Master

Dựa trên các yêu cầu của luận văn Thạc sĩ và kết quả nghiên cứu được trình bày, tôi đồng ý cho học viên TSHITENGE MUPUWE Jojo tiến hành bảo vệ luận văn trước Hội đồng.

En me basant sur les exigences d’un mémoire de fin d’études, je donne un avis favorable à la soutenance du mémoire de fin d’études à TSHITENGE MUPUWE Jojo devant le jury.

Hà Nội, ngày tháng năm 20

Ngày đăng: 24/03/2025, 23:02

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN

🧩 Sản phẩm bạn có thể quan tâm

w