1. Trang chủ
  2. » Luận Văn - Báo Cáo

Khóa luận tốt nghiệp: Constructing knowledge graphs with triple extraction techniques

112 0 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Constructing Knowledge Graphs with Triple Extraction Techniques
Tác giả Nguyen Minh Hieu
Người hướng dẫn Do Phuc, Ngo Duc Thanh
Trường học University of Information Technology
Chuyên ngành Information Systems
Thể loại Graduation Thesis
Năm xuất bản 2020
Thành phố Ho Chi Minh City
Định dạng
Số trang 112
Dung lượng 58,14 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Cấu trúc

  • CHAPTER 1. INTRODUCTION ............................ S11 ng HH nh nh nhiệt 1 (15)
    • 1.1 Context.............................. 6E -⁄⁄⁄⁄7.....5........À.............Ô (15)
    • 1.2 The problem and its significance ..........................- -G- 2G 3 191 vn TT HH ng nh nh nhện 2 (16)
    • 1.3 Related 0i 1n :inỌỘọỪDỘỒ (0)
      • 1.3.2 FNgiU [si 40v: 0s LucỪŨỪŨŨIDẬDAẠAẦỌDOAỘAẠAỌẬỆỄẲÊẦIẬNIẬIẤầẳầdỪỒỪ (0)
      • 1.3.3 Semantic role labeẽing..........................-- .- -- -- + 2c 1322133211 33211 31531818111 111 1111111811 E1 1v re 5 (19)
      • 1.3.4 Knowledge graph COnSfTUCfIOH.................... - G- G11 TH ng HH nh nh nành 6 (0)
    • 1.4 Motivation... ........ọ5BBB 7 (21)
    • 1.7 Chapter SUMMary 1 8 (22)
  • CHAPTER 2. BACKGROUND AND THEORYY.......................- ác S12 ng re 10 (24)
    • 2.1 Knowledge graph 1-11 (24)
      • 2.1.3 Graph data models 117... ........4 (25)
        • 2.1.3.1 Resource Description Framework.........cceceeesseeeeeeeceeseeeeesneeseseeeeseeseenees 12 (26)
        • 2.1.3.2 Labeled Property Graph ........ccccccccccesscesscessceeceseeseeeseeeseceseeeseeesessseeseeeseees 16 (30)
    • 2.2 Neo4j graph dataasSe.......................... - TT TH TH TH HH nh TH Hà HH Tnhh 17 (31)
      • 2.2.2 Cypher query đi... ae (0)
    • 2.3 Named Entity Rẹ€COBTIIOH..................... - c3. 32122111211 11 111111 119 11 11H11 TH TH HH Hiệp 22 (0)
    • 2.4 Coreference r€SỌUtIOT...................- c2 St S111 v11 11 211211 11111 11 H1 HT TH TH TH ng HT ng 24 (0)
    • 2.5 Semantic roles 1 26 (40)
    • 2.6 BERT — Bidirectional Encoder Representations from Transformers (44)
    • 2.7 BERT for semantic role labeẽing.......................- - - ---- - + 2c 1332111331 1E%5E55E1EE1EEeErrkee 33 (47)
    • 2.8 Chapter SUMMALY 2001011505080.ố.ố.............dđ (50)
  • CHAPTER 3. SYSTEM DESIGN AND IMPLEMTATION.......................... ..--.- 2c 38 (52)
    • 3.2 Extractor SVS{€TH........................G- Q1 TH ke 40 SNằ 200v... 177 (0)
      • 3.2.2 Sentence decompose 0n (54)
      • 3.2.4 SRE ©XÍTACẨOT-.........................-- Q H H TH TH HH 45 (0)
      • 3.2.5 Coreference resolver... II)aaẳ (61)
      • 3.2.6 cu I8 2a. 6 (0)
    • 3.3 Question-ansWer Pair ỉ€T€TAẨOT................... - - c S1 1121111911119 1111119111 tk ket 53 (0)
    • 3.4 Question-answer pair S€aTC€T.........................- - c1 2c 3121111211 11911 1111111111 1111 E111 91v vn re 56 (0)
    • 3.5 Visualization SYSẨ€T...................... .-- TT TH nh TH HH Thu nu HH Hàn nh nh TH 57 (71)
  • CHAPTER 4. SYSTEM EVALUATIƠN.................... .- -- S1 11211112 12 1111111111112 11 11T key 69 9ˆ (83)
    • 4.3 Knowledge graph cOnStrUCfIOT.........................- Q2. 2221121112112 1155119 111111111 11 E1 kiệt 71 (85)
    • 4.4 Question ỉ€n€rafIOTI..................... ... c1 11121113 1111 11121111911 9111 TH HH kkt 81 h9. 0050 an... ...33 (0)
  • CHAPTER 5. CONCLUSION AND FUTURE WORK............................-.ccScSSsereirerree 83 bÄW®0.10)1:1CPPaa4 (97)
    • 5.2 Future WOTĂK.....................- nh TT TH HH Thu TH HH HT TH HH ch TH 83 (97)

Nội dung

There have been several existed works on this topic, however information lost is one big problem that they have been still dealing with.Therefore, the center of attention in this thesis

INTRODUCTION S11 ng HH nh nh nhiệt 1

Context 6E -⁄⁄⁄⁄7 5 À Ô

In recent years, a buzzword of industry 4.0 has been mentioned and appeared dominantly on social media with its promise to take the advantages of data to revolutionize manufacturing It can be said that data is the foundation of megatrends in information technology fields relating to Artificial Intelligence (AI) which can strongly support the industry On the grounds that the number of online activities keeps arising, the amount of data has been increasing exponentially and becomes more various in nature We obviously know that data may contain underlying information and knowledge However, data seems useless if it is not processed properly Therefore, a big challenge is to extract value from collected raw data for further uses.

In order to make use of information from text data, knowledge graphs appeared as a good way of knowledge representation The term knowledge graph became obviously notable due to the introduction of Google’s Knowledge Graph in 2012 in which the basic motto is to focusing on searching things not strings [1] by utilizing knowledge graphs Special concerns about this technology led to the fact that many other companies began to develop their own knowledge graphs such as: DBpedia [2], YAGO [3], Wikidata [4] or Freebase

[5] Due to knowledge graph’s powerful capability of knowledge representation and reasoning, it was used as the backbone of artificial intelligence to serve both academic and industrial purposes.

To construct a knowledge graph, a set of triples (head, relation, tail) must be produced from text data Therefore, triple extraction is one of the key basic steps Until now, methods of triple extraction have been being developed with many different techniques which relate to statistical based and neural network based models However, it is supposed to be unclean and insufficient to instantly apply extracted triples through these methods for knowledge graph construction As a result, the main purpose of this thesis is to make triples extracted from unstructured text more useful and valuable in order that high-quality knowledge graphs can be created from them to support advanced uses.

The problem and its significance - -G- 2G 3 191 vn TT HH ng nh nh nhện 2

Triple extraction, a subset of information extraction, is an aspect in Natural Language Processing that takes an important role in knowledge graph construction One of the most well-known information extraction system is Open Information System (OpenIE) [6] which was first introduce in 2007 After that, a wide range of OpenIE system has been developed with different approaches in which modern ones are associated with neural networks so as to enhance performance However, by doing experiments with some OpenlE systems such as: OpenIE-5 (including CALMIE [7], BONIE [8], RelNoun [9] and SRLIE [10]), MinIE [11], Supervised-OIE or RnnOIE [12] and IMoJIE [13], it can be found that those systems tend to produce many triples from a sentence to cover all kinds of results Subjects or objects extracted from those techniques are regularly long sequences of words which includes semantic roles In addition, handcrafted methods to extract triples that I have experienced only focus on noun phrases and verb phrases that can make triples lose metadata of facts about time, place, manner, purpose and so on.

For example, with the sentence “John went to California with Tim in 2010 by airplane in order to fravel.”, the results from some OpenlE systems are listed below: e IMoJIE: (John, went, to California with Tim in 2010 by airplane in order) e MimIE: o (John, went to California with, Tim) o (John, went to California in, 2010) o (John, went to California by, airplane) o (John, went to California in to, travel) o (John, went to, California) e OpenIE-5: (John, went, to California, with Tim, in 2010, in order)

From the above outputs, it can be seen that IMoJIE tries to select the remaining words after the verb as object and fails to complete this object, MinIE tries to split the input sentence into shorter sentences and fails to recognize the verb phrase correctly, and OpenIE-5 lacks the arguments “by airplane” and also fails to complete the phrase “in order to travel” The expected output from the example sentence based on semantic roles is described as below:

[Experiencer: John] [verb: went] [Ending point: to California] [Comitative: with Tim ] [Temporal: in 2010] [Manner: by airplane] [{Purpose: in order to travel] e (John, went to, California) + properties: (comitative, temporal, manner, purpose) e (John, went with, Tim) e (John, went in, 2010) e (John, went by, airplane)

Moreover, some knowledge graph construction pipelines link entities by using coreference resolution in an ineffective way In detail, for each a coreference chain, the first mention is used to replace all its succeeding mentions By this way, entities are not kept their diversity when comparing to the original text.

In consequence, constructed knowledge graphs can contain duplicate and unnecessary information and lack of information preservation, that causes difficulties and inconveniences for further uses.

Another problem is that there seems to be very few open-source complete pipelines or applications that ingest raw data then produce knowledge graphs and finally prove the usefulness them Although there are still tutorials or guides about knowledge graph construction, they are stopped at triple extraction and visualizing the results in a very simple way Grakn [14] can be considered as a typical product about knowledge graphs and their application, but it may take much time for users to be familiar with Grakn’s ecosystem including: database, functions and query language.

To solve the above problems, my approach is building a pipeline that enables semantic role labeling for triple extraction combined with other NLP tasks like named entity recognition and coreference resolution, and store triples with their metadata in property graphs.

Up to now, there have been systems trying to enhance information extraction and semantic role labeling methods and developing knowledge graphs for many purposes As the growth of neural networks, most of state-of-the-art information extraction and semantic role labeling methods are applied with deep learning techniques to outperform existed ones. Knowledge graphs have gained a number of interests from both academic communities and industries The methods of knowledge graph construction have been a hot topic that gains a number of attentions from many researchers.

Triple extraction has been a key research field in natural language processing It takes the unstructured data expressed in natural language as input and then turns it into a structured representation in the form of relational tuples , in which arg/ and arg2 normally are subject and object in a sentence whereas rel is the predicate denoting the relation between them Traditional triple extraction system takes input as pre-defined target relation with hand-crafted extraction patterns to answer specific requests on small and homogeneous corpora.

Open Information Extraction (OpenIE) was first introduced by TEXTRUNNER [6] in 2007 with its paradigm which aims to all types of relations discovered in input documents so as that the limitation about a small set of pre-defined relations of traditional methods is reduced After that, OpenIE was followed by popular triple extraction systems such asReVerb [15], OLLIE [16], Stanford-IE [17], ClauseIE [18], MinIE [11], OpenIE-4 [9] and

OpenIE-5 Most of these systems use syntactic or semantic parsers to build hand-crafted patterns for triple extraction from sentences.

Another approach for triple extraction is taking advantages of neural networks Based on the information from IMoJIE and OpenIE6 [19], there are three main types of supervised neural models for neural triple extraction: labeling-based systems, generation-based systems and span-based systems For the first system, OpenIE was formulated as a sequence labeling problem which involves tagging every word in the input sentence. RnnOlE (or Supervised-OJE) takes the input sentence, identifies the word index of the predicate’s syntactic head, then for each head word, performs sequence labeling to get arguments About generation-based system, it uses Seq2Seq [20] model to generate extractions CopyAttention [21] and IMOJIE are two typical systems using this technique. While CopyAttention uses LSTM to encode and decode, IMOJIE uses a BERT-based [22] encoder and an iterative decoder which re-encodes the generated triples About last type of system, SpanOIE [23] applied a modified span selection model that consists of two modules: the predicate module to find potential candidate relation spans and the argument module to absorb a sentence and a predicate span then produce argument spans for this predicate.

According to the benchmark from OpenIE6 on dataset OIE16-C, OpenIE6 get the highest F1 score, 65.6 comparing to 56.8, 56.0, 54.0, 52.3 belonging to IMoJIE, RnnOIE, SpanOlE, MinlE, respectively.

Semantic role labeling (SRL) is the task of automatically specifying the semantic roles for each predicate including arguments and modifiers which can answer questions about sentence meaning like “who” did “what” to “whom” Most of research have been working on PropBank standard [24] which provides 5 types of numbered arguments and 15 type of modifiers BIO sequential tagging approach is used by many state-of-the-art SRL models. The first an end-to-end neural-based SRL model was introduced by Collobert et al in 2011

[25] and relied on convolutional neural networks (CNNs) The later research made use of BiLSTM with and without self-attention architectures In 2018, He et al used span-based approach [26] to identify the label for each span while the model by Ouchi et al [27] selected the span for each label AllenNLP also provides an end-to-end SRL model which is a reimplementation with some modifications of the BERT-based model provided by Shi et al (2019) [28].

About the benchmark on Ontonotes 5.0, the model provided by Ouchi et al., AllenNLP and

He et al achieves the F1 score 87.0, 86.49 and 85.5 respectively Due to lack of computing resources and time, this thesis will make use the pre-trained model by AllenNLP instead of training from scratch with respect to the other models.

Knowledge graphs have been a key component of many Artificial Intelligence (AI) applications such as question-answering system, recommendation system and web search.

A lot of well-known knowledge graphs have been developed like Wikidata, YAGO, DBpedia and Freebase It is really time-consuming and expensive to build those mentioned knowledge graphs The method for knowledge graph construction from unstructured text has been a challenging problem for years because of the purpose to enhance the automation and minimize the human intervention It relates to key techniques in NLP: entity recognition, coreference resolution and triple extraction.

Motivation ọ5BBB 7

As mentioned in section 1.2, the main challenge in knowledge graph construction from unstructured text relies on the efficiency use of the raw input documents In detail, because most of triple extraction techniques used in current knowledge graph construction systems generate triples with long sequences of words in subjects and objects In case triples are not correctly pruned before being plugged in knowledge graphs, information problems including information loss, information redundancy and information duplication can occur.

As a result, high-quality knowledge graphs cannot be constructed.

To deal with the above challenge, my proposal is integrating semantic role labeling into triple extraction so that triples will be generated with its metadata After that, to answer the question about the utility of constructed knowledge graphs from my method, simple additional tasks about question generation and question-answer search engine are built on top of them.

In this thesis, my efforts will be spent to build an automatic knowledge graph construction pipeline, a question generation module and a question-answer search engine.

The pipeline can dynamically ingest raw text documents, using triple extraction combined with semantic role labeling, coreference resolution and named entity recognition to

7 generate triples along with their metadata then store them in a graph database Entities are also enriched by linking with wikidata entities Moreover, the ability to store, load, merge, and using multiple knowledge graphs together is put into consider.

The question generation module uses generated knowledge graph as its backbone Beside questions and answers relating to entities and relations, those about metadata of triples which relates to semantic roles like “how”, “when”, “where” and “why” questions and entity types are also focused.

The final product is desired to work correctly and perform some outstanding results so that it can make some contributions in NLP aspect and serve the community as an opensource project.

This thesis consists of 5 chapters with the following structure: e Chapter 1: This chapter introduces and provides information about the context, problems and their significance, related works, the motivation and the contribution of this thesis. e Chapter 2: This chapter is about background knowledges and theories involving to this thesis. e Chapter 3: This chapter is about how the proposed system is designed as well as its components. e Chapter 4: This chapter is about how the proposed system is implemented and its evaluation. e Chapter 5: This chapter is about conclusion and future developments.

Chapter SUMMary 1 8

To summarize, this chapter introduces the context and reasons why the approach to apply semantic role labeling for knowledge graph construction is useful and significant An investigation about existed related works is made to point out their limitations Finally, this

8 chapter also indicates three main elements will be created in this thesis including an automatic knowledge graph construction pipeline, a question generation module and a question search engine The next chapter describes theoretical background required for researching in order to finish this thesis.

BACKGROUND AND THEORYY .- ác S12 ng re 10

Knowledge graph 1-11

With the huge growth of quantities of data due to billions of users on the Internet through, it appears a need to understand and make use of this data in some productive analytical way The importance of making machines truly understand human language emerges a require about an effective way to represent of natural language data especially text data. Knowledge graphs were born to deal with these problems and have been considered as a crucial area in Artificial Intelligence.

Some potential definitions of knowledge graphs were mentioned in a valuable and concise survey from a research paper “Towards a Definition of Knowledges Graphs” [33] After that, from an abstract knowledge graph architecture illustrated in Figure 2.1, a new definition was provided by the authors: “A knowledge graph acquires and integrates information into an ontology and applies a reasoner to derive new knowledge.” Another way to define knowledge graphs can be found in the book “Knowledge Graphs: Methodology, Tools and Selected Use Cases” [34] which states that “Knowledge Graphs are very large semantic nests that integrate various and heterogeneous information sources to represent knowledge about certain domains of discourse”.

Figure 2.1: Towards a Definition of Knowledge Graphs: An abstract KG architecture

On the grounds that knowledge graphs make use of a graph-theoretic representation of human knowledge, with the simplest functional definition, knowledge graph is a set of triples and each triple intuitively represents a fact A triple is a 3-tuple (h, r, t) where h and t denote a head entity and a tail entity respectively whereas r indicates the relation between two entities / and ¢ In graph representation, entities are represented as nodes or vertices with associations between these captured as edges For example, a set of triples (Hieu; was born in; Ben Tre), (Hieu; worked in; Fossil Group), (Hieu; is studying in; UIT), (Ben Tre; is a province in; Vietnam), (Fossil Group; is a U.S company), (UIT; belongs to; HCM- VNU) is illustrated as a knowledge graph in Figure 2.2.

Ben Tre is a province i Vietnam was born in worked in i ja U.S company is studying belongs to.

Figure 2.2 An example about a knowledge graph 2.1.3 Graph data models

There are two ways to store graph data: Resource Description Framework (RDF) model and Labeled Property Graph (LPG) model.

RDF is a World Wide Web Consortium ( W3C) standard for exchange data on the web that presents data as a graph [35] It was born in the situation of the need to interoperable exchange machine-readable data contained on the web between applications RDF points up the provision to enable automated processing of the web resources RDF can be applied in various application areas: in resource discovery to enhance the search engine power, in cataloging to describe the content and content relationships in a particular web page, in AI to empower the knowledge storage and exchange and so on The basic RDF data model has three object types: resource, property and statement. e Resources are all things described in RDF expressions A resource can be an HTML page or a specific element of a HTML or XML document or a whole collection of pages of a website or an object that is not accessible through Web like: books and papers A name of a resource is in the form of a URI plus optional an anchor ID. e Properties are specific aspects, characteristics, attributes or relations used to describe a resource Each property has a particular definition, defines permitted values, the resource type it can describe, and its relationships with other properties. e Statements are expressions where each of them includes a specific resource together with a named property plus the value of that property for that resource There are three parts of a statement: the subject, the predicate and the object The objects (the property value) can be another resource or a literal, 1.e., a URI or a simple string or other primitive data type defined by XML.

Here is an example about RDF graph based on a tutorial about “RDF Graph and Syntax” [36]:

12 http:/www.example.org/~joe/contact.rdf#joesmith http://xmlns.com/foaf/0.1/homepage

In Figure 2.3, from the sentence “Joe Smith has homepage http://www.example.org/~Joe/”, the triple (Joe Smith; has homepage; http://www.example.org/~joe/) is extracted All elements of this triple are resources defined by URIs Joe Smith is identified by the first resource http://www.example.org/~joe/contact.rdf#joesmith which is the subject The second resource http://xmlns.com/foaf0.1/homepage is the predicate homepage from a FOAF (friend-of-a-friend) vocabulary [37] The final resource (object) is Joe’s homepage address hitp://(vww.example.org/~joe/ The following figure shows extended information about Joe Smith. http://xmins.com/foaf/0.1/Person http:/www.example.org/~joe! http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://xmlns.com/foaf/0.1/homepage http:/www.example.org/~joe/contact.rdf#joesmith http://xmlns.com/foaf/0.1/family_name http://xmlns.com/foaf/0.1/mbox http://xmins.com/foaf/0.1/givenname a oa STIS OG

Figure 2.4 An RDF graph describing Joe Smith

The information illustrated in Figure 2.4 can be translated as a set of triples: (Joe Smith; is; a person), (Joe Smith; has a homepage; htttp://www.example.org/~joe/), (Joe Smith; has an email address; joe.smith@example.org), (Joe Smith; has given name; Joe), (Joe Smith;

13 has family name; Smith) The syntax to describe the graph in Figure 2.4 is shown Figure 2.5:

"http: //www.example.org/~joe/contact.rdf#joesmith">

Smith

Joe

Figure 2.5 Joe Smith information RDF graph syntax

There are many syntax types for RDF model such as: Turtle [38], TriG [39], JSON-LD

[40], RDFa (RDFA-PRIMER) [41], N-Triples and N-Quads, however, the syntax using XML is the most basic one It is easier to get familiar with RDF XML with an approach using examples The following table shows two records in a CD-list:

Linkin Park Machine Shop co | 10

0

P, and P, are embedded into position vectors The concatenation of position embeddings and contextual representation H are plugged in a BiLSTM layer to obtain hidden states from two direction, followed by a single hidden multilayer perceptron (MLP) to make the relation prediction Whole relation extraction process is illustrated in Figure 2.20:

Figure 2.20 BERT-based relation extraction model (Source: Figure 1 in “Simple BERT

Models for Relation Extraction and Semantic Role Labeling’’)

Semantic role labeling process consists of 4 steps: predicate detection, predicate sense disambiguation, argument identification and argument classification The task to find the predicate is quite trivial and some benchmark datasets provides the predicate for testing and training, so the authors focus on other 3 tasks Predicate sense disambiguation is the task of pointing out the sense of a predicate in a given context This task is treated as sequence labeling After the input sequence is converted into tokens by WordPiece, the predicate token is tagged with appropriate label while the O and X label is used for the first

34 sub-token of any word and remaining fragments, respectively Sequences are fed into the BERT encoder to obtain contextual vector representation A concatenation of “predicate indicator” embedding and contextual vector representation is used to let the model distinguish which tokens are predicate tokens The final prediction is made from the concatenated embedding and the label set fed into the MLP The purpose of argument identification and classification task is to assign semantic roles combined with BIO tagging scheme to argument spans The input of BERT encoder is a sentence-predicate pair which is designed as [[CLS] sentence [SEP] predicate [SEP]] that allows the attention mechanisms cover the interaction between the predicate and entire sentence The output hidden states of BERT are then concatenated with predicate indicator embedding, followed by a single layer BiLSTM The final prediction is decoded by feeding the hidden state of the current token and the predicate to an MLP.

[CLS] Barack Obama went to Pans [SEP] went [SEP]

Figure 2.21 Predicate argument identification and classification model (Source: Figure 2 in “Simple BERT Models for Relation Extraction and Semantic Role Labeling”)

Figure 2.21 shows the point where the model predicts the argument of token “Barack” based on the hidden state of predicate “went” and that of the current token (b) denotes for the concatenation of the output contextual embedding from BERT and predicate indicator embedding.

AllenNLP reimplemented this architecture combined with some modifications to create an end-to-end semantic role labeling model following PropBank style which achieves 86.49 test Fl on the Ontonotes 5.0 dataset Below are an example using semantic role labeling model provided by AllenNLP to annotate sentences: e Input sentence:

“Fernando José Torres Sanz is a Spanish former professional footballer who played as a sfriker `. e Output annotated sentences:

1 [ARGI: Fernando José Torres Sanz] [V: is] [ARG2: a Spanish former professional footballer who played as a striker]

2 Fernando José Torres Sanz is [ARGO: a Spanish former professional footballer] [R-

ARGO: who] [V: played] [ARGM-PRD: as a striker]

In the above example, there are two verbs that lead to two ways of annotations The first one relies on the verb “is” and its arguments includes “Fernando José Torres Sanz” indicates a patient and “a Spanish former professional footballer who played as a striker” is an attribute Arguments of the verb “played” includes “a Spanish former professional footballer who played as a striker” as a causer, “who” which indicates the relative pronoun for the causer The modifier “‘as a striker” represents an adjunct of a predicate started with

In this thesis, semantic role labeling is the core technique that is utilized to extract triples of (subject, predicate, object) as well as metadata including arguments and modifiers.

Chapter SUMMALY 2001011505080.ố.ố dđ

This chapter points out knowledges, concepts, theories and models surrounding knowledge graphs, graph storage, common NLP tasks and deep learning networks which are applied for this thesis Neo4j labeled property graph, coreference resolution, named entity recognition and BERT for semantic role labeling are important elements relating to the

36 automatic knowledge graph construction pipeline In the next chapter, the design of the system as well as its components are described in detail.

SYSTEM DESIGN AND IMPLEMTATION .- 2c 38

Visualization SYSẨ€T . TT TH nh TH HH Thu nu HH Hàn nh nh TH 57

All operations and results are obviously easier to be demonstrated and interacted through a web UI comparing to terminal commands Visualization system (Figure 3.19) in this project has to show information for mentioned 3 processing modules: extractor system, question-answering pair generator and similar question searcher.

Visualization System Components í Extractor System UI

As the most complicated module, extractor system UI needs the greatest number of components including: a section to input the text data, a section to visualize the knowledge graph, a section to view a list of decomposed sentences and a section to monitor coreference resolution result.

Neo4j provides well-designed graph visualization as well as UI tools for data manipulation. However, those features are only usable internally in Neo4j application To create a web graph visualization library connected to Neo4j database from scratch is truly challenging in a short period short Therefore, it is certainly needed to find available resources for this task After investigation and experiments, neo4jd3 and neovis.js library seem the best options The neovis.js library gains more attention because of its easy usability, well- organized documentation and long-term maintenance Meanwhile, neo4jd3 has been abandoned since 2016 However, due to the needs of visualizing labeled property graphs, neo4jd3 is chosen but it needs a lot of efforts to modify in the source code Figure 3.20 illustrates how neo4jd3 and neovis display a knowledge graph. neo4jd3 https://eisman.github.io/neo4jd3/, https://github.com/neo4j-contrib/neovis.js/,

The next component is a carousel of charts that contains information about generated graphs such as number of nodes, number of edges, average node degree, relationship distribution, named entity distribution and predicate modifier distribution.

The viewer for decomposed sentences is simply just a json viewer For visualizing coreference resolved paraphs, my UI tries to imitate the demonstration from AllenNLP which showed in the below Figure 3.21.

59 ©) Paul Allen| was born on January 21, 1953, 1 BW Seattle , Washington , | BW Seattle , Washington , | to Kenneth Sam Allen and Edna Faye Allen

Allen] attended 4 Lakeside School , a private school in [Ef Seatte Lakeside School , a private school in [Ef Seatte School , a school in BBY Seattle| Ey seattle] , where [@) 'he| befriended

* Bill Gates , two years younger , with whom “he shared an enthusiasm for computers) |— EETT:mji EETT:mji used a teletype terminal at high school , Lakeside ,| to develop programming skills on several time - sharing computer systems

Figure 3.21 AllenNLP coreference resolution visualization

User interfaces for question-answer pair generator and searcher are much simpler The first one includes a table containing a list of question-answer pairs and some data viewing operations ( e.g., searching, sorting, paging) Searcher section is a place user can input a question and view the result list of similar questions.

Back-end APIs of the web app is built with Flask which is a lightweight web framework and extremely suitable for simple demonstration Front-end is purely developed by the stack: HTML, CSS, and JavaScript.

In general, the web UI contains 7 sections: raw input section, sentence decomposer viewer, coreference resolution visualization, knowledge graph viewer, knowledge graph information (KG Stats) viewer, question-answer pair viewer, question-answer searching viewer The overview of visualization system is illustrated in Figure 3.22.

Tnput document Knowledge Graph KG Stats A Pairs QA Search

Benjamin James Chilwell (born 21 December 1996) is an English ơ professional footballer who plays as a left-back for Premier mm : 381 wikilata đes: "English association football player” token_start_id: 0

League club Chelsea and the England national team Chilwell was born in Milton Keynes, Buckinghamshire, and attended wikidata_url: "https: //www.wikidata.org/wiki/921621029" token_end_1d: 2

Redborne Upper School and Community College.Chilwell joined - ^ wikidata_cats: ["hunan","human","polyphyletic group of organisms known by one particular common name", "natural person", "concept”] wikidata_id: 21621029 type: "coref origin" wikidata_label: "Ben Chilwell” | đc iđ: 2020122215319

Leicester City's academy aged 12 in 2009 having played in

Rushden & Diamonds’ centre of excellence He won their Academy

Player of the Year award at the end of the 2014-15 season.After featuring for the club in pre-season under new ner_texts: ("Benjamin Janes Chỉlwell"] mer labels: ("PERSON"] text: "Benjamin James Chilwell" node_4d: '20201222115319-0-2" manager Claudio Ranieri, Chilwell was given the number 30 ee shirt ahead of the 2015-16 season He made his first-team —. debut on 27 October 2015 in the club's League Cup match 2 against Hull City Chilwell played the entire match as

Leicester lost 5-4 on penalties following a 1-1 draw after extra time.On 19 November 2015, Chilwell joined Championship e-

Sentence Decomposer Viewer Coreference Resolution Visualization

Nye ( born 21 December 1996 ) is an English professional footballer who gen sents: [

“Benjamin James Chilwell was born 21 December 1996.", plays as left - back for [FW nrenier League club chelsea| and [EMI the england national tean]

"Benjamin James Chilwell is an English professional footballer.",

Figure 3.22 Visualization system overview Figure 3.23 shows the place user inputs the raw text for the extractor pipeline.

Benjamin James Chilwell (born 21 December 1996) is an English professional footballer who plays as a left-back for Premier League club Chelsea and the England national team Chilwell was born in Milton Keynes, Buckinghamshire, and attended

Redborne Upper School and Community College.Chilwell joined Leicester City's academy aged 12 in 2009 having played in

Rushden & Diamonds' centre of excellence He won their Academy Player of the Year award at the end of the 2014-15 season.After featuring for the club in pre-season under new manager Claudio Ranieri, Chilwell was given the number 30 shirt ahead of the 2015-16 season He made his first-team debut on 27 October 2015 in the club's League Cup match against Hull City Chilwell played the entire match as Leicester lost 5—4 on penalties following a 1-1 draw after extra time.On 19 November 2015, Chilwell joined Championship

Figure 3.23 Raw text input section

Extractor pipeline ingests raw text and returns the knowledge graph data, output sentences from sentence decomposer, and coreference chains which are demonstrated in Figure 3.24, 3.25, 3.26 respectively.

| node | : 205 doc_id: 20201224015146 ner labels: ["CARDINAL","ORG"] ner texts: ["two","the UEFA Champions League"] node_id: "20201224015146-995-1000" text: "two in the UEFA Champions League" token_end_id: 1000 token_start_id: 995 type: "normal"

F=== ["international association football clubs cup","single-elimination tournament","tournament system","system","state of the

United States"] wikidata_des: "European association football tournament" wikidata_id: 18756 wikidata label: "UEFA Champions League"

The right side of knowledge graph viewer is a legend indicating named entity types. Whenever, a node or relationship is focused, its property appears Moreover, it is possible for users to zoom and drag the nodes.

Sentence Decomposer Viewer gen sents: [

"Benjamin James Chilwell was born 21 December 1996.",

“Benjamin James Chilwell is an English professional footballer.",

"Benjamin James Chilwell plays as a left-back.",

“Benjamin James Chilwell plays for Premier League club Chelsea.",

"Benjamin James Chilwell plays for the England national team.",

“Benjamin James Chilwell plays as a left-back for Premier League club Chelsea for the E

SYSTEM EVALUATIƠN .- S1 11211112 12 1111111111112 11 11T key 69 9ˆ

Knowledge graph cOnStrUCfIOT .- Q2 2221121112112 1155119 111111111 11 E1 kiệt 71

Evaluation for knowledge graph construction from unstructured text is actually challenging According to my survey, there have been not any official resources about ground truth, standard datasets and benchmark for this task Each research has different

71 methods to evaluate their work These methods can be classified in three types: evaluating by human intervention, calculating the graph the graph edit distance and evaluating on the applied tasks using constructed knowledge graphs First two methods are reference in the

“Grading Rules and Scores” section in Automatic Knowledge Graph Construction Contest (Wu et al., 2019) [29] For the first one, extracted triples are manually compared with the labeled ones by experts The seconds method calculates the similarity between constructed knowledge graphs and prepared ones This similarity is measured through the graph edit distance which is the minimum cost of graph edit operations (e.g., vertex insertion, vertex deletion, vertex substitution, edge insertion, edge deletion, edge substitution) to transform one graph into another graph The final method is scoring the task built over the knowledge graphs like question answering [58].

To reproduce the benchmark based on first two methods, it is extremely crucial to require the labeled knowledge graph, however, there are not any published ones Therefore, one shows case shown in the contest is used to be the comparison standard between the knowledge graphs generated by my pipeline (SRL-KG) and the ones produced by winner team The first example text is:

“BYD Auto debuted its E-SEED GT concept car and Song Pro SUV alongside its all-new e-series models at the Shanghai International Automobile Industry Exhibition The company also showcased its latest Dynasty series of vehicles, which was recently unveiled at the company’s spring product launch in Beijing.”

72 the company's spring product launch in the company's spring product launch

GT concept car and Song Pro its latest — SS

Dynasty series - _ of vehicles SS p SS b N\

2s B bi ⁄o, s hy its E-SEED GT concept car

Figure 4.1 SRL-KG’s knowledge graph on example text 1 Ẳ cmon product launch

Dynasty series of vehicles ay models debuted anghai International ; ry Exhibition showcased / alongside

Figure 4.2 Team UWA’s knowledge graph on example text 1

Beijing ` its Song Pro —is_origin_of — Song Pro SUV

As can be seen from the Figure 4.1 and Figure 4.2, Team UWA’s knowledge graph explicitly contains more entities because they consider a single preposition can take the role as a predicate In the knowledge graph constructed by my pipeline (Figure 4.1), the information like “all-new e-series models” and “Shanghai International Automobile Industry Exhibition” are implicitly stored in the modifiers of predicate “debuted”’ along with its preceding prepositions which are “alongside”, “at” respectively (Figure 4.3)

"args": "{"ARGO": [["BYD Auto", [["BYD Auto", "0RG"]]]],

"ARG1": [["its E-SEED GT concept car and Song Pro SUV", [["Song Pro SUV", "ORG"]]]], "ARG2": [], "ARG3": [],

"mdfs": "{"ARGM-ADJ": [], "ARGM-ADV": [], "ARGM-CAU": [],

"ARGM-COM": [], "ARGM-DIR": [], "ARGM-DIS": [], "ARGM- DSP": [], "ARGM-EXT": [], "ARGM-GOL": [], "ARGM-LOC":

[["alongside its all-new e-series models", []], ["at the

Shanghai International Automobile Industry Exhibition",

[["the Shanghai International Automobile Industry Exhibition", "FAC"]]]], "ARGM-LVB": [], "ARGM-MNR": []

"ARGM-MOD": [], "ARGM-NEG": [], "ARGM-PNC": [], "ARGM- PRD": [], "ARGM-PRP": [], "ARGM-REC": [], "ARGM-TMP":

Figure 4.3 Modifiers of predicate “debuted”

There are two points that SRL-KG performs better with the above paragraph First, team UWA’s knowledge graph lacks the predicate “unveiled” Second, their system failed to recognize “the company” in “the company’s spring product launch” corefers “BYD Auto” However, my pipeline’s drawback is producing two duplicated entities “the company” and

“The company” due to sentence decomposer module.

That using prepositions as predicates from team UWA causes information conflict problem when there is a vast number of sentences A sentence with many prepositions obviously produces a multi-hop route which makes difficulties to trace back what the original fact is Let us take a look at the knowledge graph constructed from the below paragraph (example text 2):

“Benjamin James Chilwell (born 21 December 1996) is an English professional footballer who plays as a left-back for Premier League club Chelsea and the England national team. Mason Mount (born 10 January 1999) is an English professional footballer who plays as a central midfielder for Premier League club Chelsea and the England national team ”

Mason Mount (born 10 January 1999) is left-back

English professional footballer , lor v is for

Benjamin James Chilwell (born 21 December 1996) England national team plays as v for Premier League club Chelsea Ỷ v for central midfielder

Figure 4.4 Team UWA’s knowledge graph on example text 2

In Figure 4.4, it is impossible to distinguish exactly what roles Mason Mount and Benjamin James Chilwell play Moreover, it seems that a standalone triple in which its predicate is a preposition does not really express meaningful information Moreover, team UWA’s pipeline fails to correctly extract two entities “Benjamin James Chilwell” and “Mason

Mount” with their birthdays Whereas the result from my pipeline shown in Figure 4.5 is more accurate.

League club _- Chelsea sÁp|d central midfielder for Premier League club Chelsea for the En. as a left-back for Premier

Figure 4.5 SRL-KG’s knowledge graph on example text 2

In Figure 4.5, it is easy to know that Benjamin James Chilwell plays as a left-back and Mason Mount plays as a central midfielder In addition, SRL-KG also successfully extract the birthdays of those two footballers However, as mentioned when analyzing the results from example text 1, information redundancy is a problem that SRL-KG is facing because of the sentence decomposer module.

Let us continue to investigate results (Figure 4.6 and 4.7) with example text 3:

“Nguyen Minh Hieu usually wakes up at 5 am and goes to school by bus at 7 am He returns home at 8 pm everyday ”

Figure 4.6 Team UWA’s knowledge graph on example text 3

Figure 4.7 SRL-KG’s knowledge graph on example text 3

From Figure 4.6 and 4.7, we can see that team UWA’s pipeline completely fails to extract correct triples and build a knowledge graph Meanwhile, SRL-KG output is better although the information about “at 8 pm everyday” is not clearly shown When digging into the predicate “returns”, it possible to see this information which is illustrated in Figure 4.8.

"ares": "{"ARG0": [], "ARG1": [["He", []]], "ARG2": [],

"mdfs": "{"ARGM-ADJ": [], "ARGM-ADV": [], "ARGM-CAU": [],

"ARGM-COM": [], "ARGM-DIR": [], "ARGM-DIS": [], "ARGM- psp": [], "ARGM-EXT": [], "ARGM-GOL": [], "ARGM-LOC": [],

"ARGM-LVB": [], "ARGM-MNR": [], "ARGM-MOD": [], "ARGM- NEG": [], "ARGM-PNC": [], "ARGM-PRD": [], "ARGM-PRP": [],

"ARGM-REC": [], "ARGM-TMP": [["at 8 pm", [["8 pm",

Figure 4.8 Modifiers of predicate “returns ” Additionally, SRL-KG also succeeds to handle relative clauses proved by example text 4:

“The medication which the patient’s GP had previously prescribed has been changed following surgery.” had_previously_prescribed — have_been_changed surgery

Figure 4.9 SRL-KG’s knowledge graph on example text 4

In Figure 4.9, entity “the medication” is detected as the object of subject “the patient’s GP” with the predicate “had previously prescribed” and as the subject of object “following surgery” with the predicate “have been changed” With example text 4, team UWA’s pipeline is not able to produce any outputs.

After the assessment with 4 example texts above, it can be partially concluded that SRL-

KG pipeline is able to perform better at triple extraction, coreference resolution and information preservation comparing to team UWA’s pipeline.

About time performance, SRL-KG is measured by running with 1-sentence text dataset to 10-sentence text dataset where each dataset contains 10 texts Texts are randomly collected from International VnExpress News The result is shown in the Table 4.3 where: e avg run time: average time to process a text in a dataset e avg num_tokens: average number of tokens of a text in a dataset e min_num_tokens: minimum number of tokens of a text in a dataset e max_num_tokens: maximum number of tokens of a text in a dataset e avg num nodes: average number of generated nodes of a text in a dataset e min_num_nodes: minimum number of generated nodes of a text in a dataset e max_num_nodes: maximum number of generated nodes of a text in a dataset e avg num _ edges: average number of generated edges of a text in a dataset e min_num_edges: minimum number of generated edges of a text in a dataset e max_num_edges: maximum number of generated edges of a text in a dataset

EEESIHIEEIBIHEIBBIIGI as the numbers of tokens, nodes and edges By take the ratio between avg _num_tokens and avg run time, SRL-KG is able to process 9.3 tokens per second This speed is not really fast that causes time performance problem when applying SRL-KG to long unstructured text After investigation and observation, for a given sentence, the sentence composer can generate up to 7 simple-structured sentences Consequently, the good side is that SRL-KG can analyze the original sentence more deeply and produce more relationships which are about 8 relationships per sentence with the above dataset However, processing modules needs to handle more data Coreference resolution and graph pruning are two processing stages that cost the greatest number of running time.

Question ỉ€n€rafIOTI c1 11121113 1111 11121111911 9111 TH HH kkt 81 h9 0050 an 33

In a short period of time, this thesis successfully proposes a semantic role labeling based approach to flexibly construct knowledge graphs which are utilized to serve question generation task The research provides knowledge about not only knowledge graph construction, but also other important NLP tasks and models like information extraction, semantic role labeling, transformer, BERT and so on Moreover, one of the most crucial problem that I finally solved is finding methods to evaluate my outcome with other existed works.

At first glance this thesis seems simple and underrated because it is implemented by reusing available resources instead of providing a state-of-the-art architecture or network. However, after deep diving into it, I realized the most importance point was that I succeed to discover and solve a significant problem which other existed works are facing In fact, that how to combine different technologies together in order to serve one purpose is not easy as expected and requires a lot of efforts to make them work correctly.

From almost zero background about NLP, thank to this thesis, I have gained a number of valuable knowledges about not only NLP aspects and deep learning models but also problem-solving and research skills.

Due to time limitation, there are problems that the current system is dealing with First of all, the processing speed is not fast enough to handle long unstructured texts Second, the sentence decomposer depends on Sapien Language Engine which is a close-sourced service Third, although question generation produced promising results, there have been not many efforts on this task Finally, all of NLP tasks in this project only works on English data.

CONCLUSION AND FUTURE WORK -.ccScSSsereirerree 83 bÄW®0.10)1:1CPPaa4

Future WOTĂK - nh TT TH HH Thu TH HH HT TH HH ch TH 83

Due to time limitation, there are problems that the current system is dealing with First of all, the processing speed is not fast enough to handle long unstructured texts Second, the sentence decomposer depends on Sapien Language Engine which is a close-sourced service Third, although question generation produced promising results, there have been not many efforts on this task Finally, all of NLP tasks in this project only works on English data.

To overcome identified system limitations and improve the current work, there should be future actions which are described as follow: c

Building the sentence decomposer from scratch to remove the dependency on Sapien Language Engine.

Linking extracted entities and relationships with many knowledge bases by using deep learning approach.

Researching how to embed property graphs for question-answering task.

Designing more structured patterns as well as applying deep learning models for question generation task.

Building an end-to-end system to process Vietnamese texts.

APPENDIX A: EXAMPLES OF SENTENCE DECOMPSOSER MODULE

“Barack Hussein Obama II is an American politician who served as the 44th President of the United States from 2009 to 2017.”

1 "Barack Hussein Obama II is an American politician."

2 "Barack Hussein Obama II served as the 44th President."

3 "Barack Hussein Obama II served of the United States."

4 "Barack Hussein Obama II served from 2009."

5 "Barack Hussein Obama II served to 2017."

6 "Barack Hussein Obama II served as the 44th President of the United States from

7 "an American politician served as the 44th President."

8 "an American politician served of the United States."

9 "an American politician served from 2009."

10."an American politician served to 2017."

11."an American politician served as the 44th President of the United States from 2009 to 2017."

“Fernando José Torres Sanz is a Spanish former professional footballer who played as a striker.”

1 "Fernando Jose Torres Sanz is a Spanish former professional footballer."

2 "Fernando Jose Torres Sanz played as a striker."

3 "a Spanish former professional footballer played as a striker."

“Torres started his career with Atlético Madrid and progressed through their youth system to their first-team squad.”

1 "Torres started his career with Atletico Madrid.",

"Torres progressed through their youth system.",

"Torres progressed to their first-team squad.",

-F YN "Torres progressed through their youth system to their first-team squad."

APPENDIX B: AN EXAMPLE OF PROPERTY TRIPLES

“Fernando Torres is a Spanish former professional footballer Torres started his career with Atlético Madrid and progressed through their youth system to the first-team squad.”

Output knowledge graph: former professional footballer ane their youth is_origin_of — their first-team squad

Figure B.1 Knowledge graph about Fernando Torres

Details about properties of triple (Fernando Torres; is; a Spanish former professional footballer):

“wikidata_des": "Spanish association football player",

"wikidata_urL": "https: //ww.wikidata.org/wiki/Q42731",

“polyphyletic group of organisms known by one particular common name",

"wikidata_des": “person who plays association football

"wikidata_url": "https: //ww.wikidata.org/wiki/Q937857",

"wikidata_Label": "association football player",

"text": "a Spanish former professional footballer",

"args": "{"ARGO": [], "ARG1": [["Fernando Torres",

[["Fernando Torres", "PERSON"]]]], "ARG2": [["a Spanish former professional footballer", [["Spanish", "NORP"]]]],

"mdfs": "{"ARGM-ADJ": [], "ARGM-ADV": [], “ARGM-CAU": [],

"ARGM-COM": [], "ARGM-DTR": [], "ARGM-DIS": [], "ARGM-

DSP": [], "ARGM-EXT": [], "ARGM-G0L": [], "ARGM-L0C": [],

"ARGM-LVB": [], "ARGM-MNR": [], "ARGM-MOD": [], "ARGM-

NEG": [], "ARGM-PNC": [], "ARGM-PRD": [], "ARGM-PRP": [],

APPENDIX C: EXAMPLES OF QUESTION GENERATION

“Barack Hussein Obama II is an American politician who served as the 44th President of the United States from 2009 to 2017.” onnw ® YN

Who is an American politician? - Barack Hussein Obama II

Who served as the 44th President of the United States? - Barack Hussein Obama II Who served from 2009? - Barack Hussein Obama II

Who served to 2017? - Barack Hussein Obama II What country did Barack Hussein Obama II serve? - the United States When did Barack Hussein Obama II serve? - 2017

When did Barack Hussein Obama II serve? — from 2009 When did Barack Hussein Obama II serve as the 44th President of the United States?

— from 2017 — to 2009 What did Barack Hussein Obama II serve? - as the 44th President of the United States

“John went to California with Tim by airplane in order to travel.”

Who went by airplane? — John Who went to California? — John Who went with Tim? — John

How did John go? — by airplane How did John go to California? — by airplane

91 © œm © What state of the united states did John go? — California

Where did John go? — California

Who did John go? — Tim

For what purpose did John go to California?

A Singhal, “Introducing the Knowledge Graph: things, not strings,” Off Google Blog, 2012.

S Auer, C Bizer, G Kobilarov, J Lehmann, R Cyganiak, and Z Ives, “DBpedia:

A nucleus for a Web of open data,” 2007, doi: 10.1007/978-3-540-76298-0 52.

F M Suchanek, G Kasneci, and G Weikum, “Yago: A core of semantic knowledge,” 2007, doi: 10.1145/1242572.1242667.

D Vrandešiộ and M Krửtzsch, “Wikidata: A free collaborative knowledgebase,” Commun ACM, 2014, doi: 10.1145/2629489.

K Bollacker, C Evans, P Paritosh, T Sturge, and J Taylor, “Freebase: A collaboratively created graph database for structuring human knowledge,” 2008, doi: 10.1145/1376616.1376746.

M Banko, M J Cafarella, S Soderland, M Broadhead, and O Etzioni, “Open information extraction from the web,” 2007.

5 Saha and Mausam, “Open Information Extraction from Conjunctive Sentences,” Proc ofthe 27th Int Conf Comput Linguist., 2018.

S Saha, H Pal, and Mausam, “Bootstrapping for numerical open IE,” 2017, doi: 10.18653/v1/P17-2050.

H Pal and M -, “Demonyms and Compound Relational Nouns in Nominal Open IE,” 2016, doi: 10.18653/v1/w16-1307.

J Christensen, Mausam, S Soderland, and O Etzioni, “Semantic Role Labeling for Open Information Extraction,” 2010.

K Gashteovski, R Gemulla, and L del Corro, “MinIE: Minimizing facts in open information extraction,” 2017, doi: 10.18653/v1/d17-1278.

G Stanovsky, J Michael, L Zettlemoyer, and I Dagan, “Supervised open information extraction,” 2018, doi: 10.18653/v1/n18-1081.

K Kolluru, S Aggarwal, V Rathore, Mausam, and S Chakrabarti, “IMoJIE: Iterative Memory-Based Joint Open Information Extraction,” 2020, doi: 10.18653/v1/2020.acl-main.521.

A Fader, S Soderland, and O Etzioni, “Identifying relations for Open Information Extraction,” in EMNLP 2011 - Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, 2011, pp 1535-1545.

Mausam, M Schmitz, R Bart, S Soderland, and O Etzioni, “Open language learning for information extraction,” 2012.

G Angeli, M J Premkumar, and C D Manning, “Leveraging linguistic structure for open domain information extraction,” 2015, doi: 10.3115/v1/p15-1034.

L Del Corro and R Gemulla, “ClausIE: Clause-based open information extraction,”

M and S C Keshav Kolluru, Vaibhav Adlakha, Samarth Aggarwal, “OpenIE6: Iterative Grid Labeling and Coordination Analysis for Open Information Extraction,” 2020.

I Sutskever, O Vinyals, and Q V Le, “Sequence to sequence learning with neural networks,” 2014.

L Cui, F Wei, and M Zhou, “Neural open information extraction,” 2018, doi: 10.18653/v1/p18-2065.

J Devlin, M W Chang, K Lee, and K Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” 2019.

J Zhan and H Zhao, “Span Model for Open Information Extraction on Accurate

Corpus,” Proc AAAI Conf: Artif: Intell., 2020, doi: 10.1609/aaai.v34i05.6497.

M Palmer, P Kingsbury, and D Gildea, “The proposition bank: An annotated corpus of semantic roles,” Comput Linguist., 2005, doi:

R Collobert, J Weston, L Bottou, M Karlen, K Kavukcuoglu, and P Kuksa,

“Natural language processing (almost) from scratch,” J Mach Learn Res., 2011.

L He, K Lee, O Levy, and L Zettlemoyer, “Jointly predicting predicates and arguments in neural semantic role labeling,” 2018, doi: 10.18653/v1/p18-2058.

H Ouchi, H Shindo, and Y Matsumoto, “A span selection model for semantic role labeling,” 2020, doi: 10.18653/v1/d18-1191.

P Shi and J Lin, “Simple BERT Models for Relation Extraction and Semantic Role Labeling,” arXiv 2019.

X Wu, J Wu, X Fu, J Li, P Zhou, and X Jiang, “Automatic knowledge graph construction: A report on the 2019 ICDM/ICBK Contest,” in Proceedings - IEEE International Conference on Data Mining, ICDM, 2019, vol 2019-Novem, pp. 1540-1545, doi: 10.1109/ICDM.2019.00204.

M Stewart, M Enkhsaikhan, and W Liu, “ICDM 2019 knowledge graph contest: Team UWA,” 2019, doi: 10.1109/ICDM.2019.00205.

N Kertkeidkachorn and R Ichise, “T2KG: An end-to-end system for creating knowledge graph from unstructured text,” 2017.

R Clancy, I F Ilyas, J Lin, and D R Cheriton, “Knowledge Graph Construction from Unstructured Text with Applications to Fact Verification and Beyond,” 2019.

L Ehrlinger and W WửB, “Towards a definition of knowledge graphs,” 2016.

A Fensel, Dieter & Simsek, Umutcan & Angele, Kevin & Huaman, Elwin & Karle, Elias & Panasiuk, Oleksandra & Toma, loan & Umbrich, Jũrgen & Wahler,

Knowledge Graphs: Methodology, Tools and Selected Use Cases 2020.

W3C, “Resource Description Framework (RDF) Model and Syntax Specification,”

1999 https://www.w3.org/TR/PR-rdf-syntax/Overview.html.

M Obitko, “RDF Graph and Syntax.” https://www.obitko.com/tutorials/ontologies- semantic-web/rdf-graph-and-syntax.html.

D B and L Miller, “FOAF Vocabulary Specification 0.99.” http://xmlns.com/foaf/spec/.

G Beckett, D., T Berners-Lee, T., E Prud’hommeaux, E., Carothers, “RDF 1.1

(2014), Turtle Terse RDF Triple Language W3C Recommendation,” 2014 http://www.w3.org/TR/turtle/.

G Carothers and A Seaborne, “RDF 1.1 TriG - RDF Dataset Language,” W3C Recommendation, 2014 http://www.w3.org/TR/trig/.

M L Manu Sporny, Gregg Kellogg, “JSON-LD 1.0,” 2014 .

I H B A M.S M Birbeck, “RDFa 1.1 Primer - Second Edition,” 2013 .

J Li, A Sun, J Han, and C Li, “A Survey on Deep Learning for Named Entity Recognition,” arXiv 2018, doi: 10.1109/tkde.2020.298 1314.

C P Kevin Clark, Dan Jurafsky, Christopher Manning, “Coreference Resolution.” https://nlp.stanford.edu/projects/coref.shtml#:~:text=Coreference resolution is the task,question answering%2C and information extraction.

D Jurafsky and J H Martin, Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Third Edit 2019.

C F Baker, C J Fillmore, and J B Lowe, “The Berkeley FrameNet Project,” 1998, doi: 10.3115/980845.980860.

C J Fillmore, C R Johnson, and M R L Petruck, “Background to framenet,” Jnt.

C Bonial, J Hwang, J Bonn, K Conger, O Babko-Malaya, and M Palmer,

“English PropBank Annotation Guidelines,” Tech Report, Univ Color Boulder, 2012.

K W CHURCH, “Word2Vec,’ Nai Lang Eng., 2017, = doi: 10.1017/s1351324916000334.

J Pennington, R Socher, and C D Manning, “GloVe: Global vectors for word representation,” 2014, doi: 10.3115/v1/d14-1162.

M E Peters et al., “Deep contextualized word representations,’ 2018, doi: 10.18653/v1/n18-1202.

Y Wu et al., “Google’s NMT,” ArXiv e-prints, 2016.

Plasticity, “Sapien Language Engine.” https://www.plasticity.ai/api/docs.

N Reimers and I Gurevych, “Sentence-BERT: Sentence embeddings using siamese BERT-networks,” 2020, doi: 10.18653/v1/d19-1410.

G Stanovsky and I Dagan, “Creating a large benchmark for open information extraction,” 2016, doi: 10.18653/v1/d16-1252.

S Bhardwaj, S Aggarwal, and Mausam, “CARB: A crowdsourced benchmark for open IE,” 2020, doi: 10.18653/v1/d19-1651.

Q Zhou, N Yang, F Wei, C Tan, H Bao, and M Zhou, “Neural question generation from text: A preliminary study,” 2018, doi: 10.1007/978-3-319-73618-1 56.

L He, M Lewis, and L Zettlemoyer, “Question-answer driven semantic role labeling: Using natural language to annotate natural language,” 2015, doi: 10.18653/v1/d15-1076.

Ngày đăng: 02/10/2024, 03:38

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN

w