Relation Extraction in Vietnamese Text via Piecewise Convolution Neural Network with WordLevel Attention44978

Relation Extraction in Vietnamese Text via Piecewise Convolution Neural Network with Word-Level Attention Van-Nhat Nguyen1, Ha-Thanh Nguyen1, Dinh-Hieu Vo1, Le-Minh Nguyen2 1VNU Univer

Trang 1

Relation Extraction in Vietnamese Text via

Piecewise Convolution Neural Network with

Word-Level Attention Van-Nhat Nguyen1, Ha-Thanh Nguyen1, Dinh-Hieu Vo1, Le-Minh Nguyen2

1VNU University of Engineering and Technology

2Japan Advanced Institute of Science and Technology

Abstract— With the explosion of information technology,

the Internet now contains enormous amounts of data, so the

role of information extraction systems becomes very

important Relation Extraction is a sub-task of Information

Extraction, which focuses on classifying the relationship

between the entity pairs mentioned in the text In recent years,

despite the many new methods have been introduced, Relation

Extraction still receives attention from researchers for

languages in general and Vietnamese in particular

Relation Extraction can be addressed in a variety of ways,

including supervised learning methods, unsupervised and

semi-supervised methods Recent studies in the English language

have shown that Relation Extraction using deep learning

method in the supervised or semi-supervised domains is

achieving optimal and superior results over traditional

non-deep learning methods However, researches in Vietnamese are

few and in the process of searching documents, the results of

deep learning applying for Relation Extraction in Vietnamese

are not found Therefore, the research focuses on studying and

research the method of using deep learning to solve Relation

Extraction task in Vietnamese In order to solve the Relation

Extraction task, the research proposes and constructs a deep

learning model named Piecewise Convolution Neural Network

with Word-Level Attention

Keywords— Relation Extraction, deep learning, convolution

neural network, attention mechanism

I. INTRODUCTION

With the explosion of information technology, the

Internet now contains an enormous amount of data

According to intemetlivestats’s statistics, up to now there

have been over 1,868,000,000 websites; For every second,

there are over 2,683,187 emails sent, over 66,423 Google

searches, 8,003 Twitter posts, 840 Instagram photos, over

1,366 Tumblr posts, 3,090 Skype calls, over 73,450

YouTube videos, 55,730 GB of Internet traffic data, etc and

these numbers continue to increase

Data on the Internet contains enormous amounts of

information, but most exist in the form of unstructured text,

so there is also redundant information causing difficulties in

analyzing data Therefore, the information extracting systems

play a very important role in extracting meaningful

information from the data for analysis Information

Extraction (IE) is a field of study in natural language

processing (NLP) related to extracting structured information

(which can easily be interpreted by a type of data) from an

unstructured text Data extracted by IE systems can be

applied in a variety of areas, such as studying about the users’ primary business trends, disease prevention, crime prevention, bioinformatics, stock analysis, etc Not only that, knowledge base (KB) such as Freebase [1] or DBpedia [2] still need a lot of knowledge to improve, so we can use IE systems to expand this knowledge base

According to Jiang [3], the information extraction task has emerged since the 1970s (DeJong’s FRUMP program) but has only begun to attract attention when DARPA (Defense Advanced Research Projects Agency) initiated and sponsored the Message Understanding Conferences (MUC)

in the 1990s Extracting information is a bigger task that involves several sub-tasks such as Named Entity Recognition (NER), Relation Extraction (RE), Event Extraction, etc These sub-tasks are closely interrelated in the bigger task, for example, NER is considered a preprocessing task for a more complex task which is Relation Extraction

Relation Extraction is a sub-task in information extraction, focused on recognizing and classifying relationships between entities in sentences or text Relationships derived from RE systems can be applied to many tasks such as Q&A systems, biomedical text mining, and medical support Thus, in recent years, the Relation Extraction task has been receiving great attention from researchers worldwide There have been a lot of researches raised in important conferences such as Colling, ACL, Senseval, etc Relation Extraction is also a part of knowledge mining international projects such as Automatic Content Extraction (ACE), Global WordNet, etc

In recent years, despite not a lot of research works, the Relation Extraction task continues to receive attention from a large number of researchers around the world In particular, Relation Extraction is a task that has the potential of applying deep learning For English, in recent work [4, 5, 6],

it has been shown that the application of deep learning in this task is more effective than traditional non-deep learning Relation Extraction is also a task mentioned in the information extraction task in Vietnamese text However, the researches that went into Relation Extraction of the Vietnamese text are still limited and in the process of reviewing documents, the results of deep learning applied to Relation Extraction in Vietnamese are nowhere to be found The objective of this research was to investigate and study to give a deep learning model for the Relation Extraction task in Vietnamese In order to reach this goal, the research shall study and introduce methods of solving

Trang 2

Relation Extraction problems and some models in each

method From a number of ideas in the researches [4, 6, 7],

the study has developed a deep learning model called

Piecewise Convolution Neural Network with Word-Level

Attention In order to assess the effectiveness of the model, a

Vietnamese dataset was developed and tested on the model

that the study proposes

II. RELATED WORKS

A Simple CNN models

The simple convolutional neural network model [8] was

the earliest work that tries to use CNN to automatically learn

features instead of hand-crafting features First, this model

encodes the input sentence using words embedding and

lexical features, followed by a convolutional layer, a single

neural network layer, and Softmax output layer using to give

the probability distribution for all relation classes

Convolution neural network model with max-pooling

layer [9] also use CNN to encode sentence-level features

But the other point is that they use a max-pooling layer on

the output of the convolution layer This paper is also the

first work to use positional-embedding This model also uses

lexical-level features such as noun information in the

sentence and hypernyms of the nouns on WordNet

Convolutional neural network with multi-sized window

kernel [10] is based on the results of Liu [8] and Zhang [9]

This model completely removes the lexical word-feature to

enrich the representation of the input sentence and allows

CNN to self-learn the necessary features Their architecture

is similar to Zeng et al., consisting of words and positional

embeddings followed by convolution and max-pooling In

addition, they combined a convolutional matrix of different

size window to capture n-gram level features Additionally,

they also incorporate convolutional kernel of varying

window sizes to capture wider ranges of n-gram features

B CNN with Attention Mechanism models

Attention-based Convolution Neural Network [4], uses

word embedding, positional embedding, and incorporates

part-of-speech features to construct the word vector Next are

a convolutional layer and a max-pooling layer to obtain the

convolution feature of the sentence-level The attention

weight is calculated by letting the model self-learn to

calculate the correlation between the words in the sentence

feature of sentence-level is calculated as the weighted sum of the word vectors The convolution feature vector and two attention-based context feature vector (for two entities) are concatenated before getting passed through a multi-layer perceptron with softmax activation

With the result of 85.9% of the F1-score on the

SemEval-2010 Task 8 [11] dataset, this model has proven to be effective when applied on a deep-learning model to solve the task of Relational extraction

Multi-level Attention CNN [6] is perhaps the best performing model at present, with 88.0% of F1-score on the SemEval-2010 Task 8 [11] dataset Their biggest contribution is the combination of two attention-based layers: attention on the input layer, and attention on the max-pooling layer We can see that the attention mechanism has a positive effect on the models

III. PROPOSED MODEL

To solve the problem of Relation Extraction, the study proposes and builds a deep learning model called Piecewise Conventional Neural Network with Word-Level Attention This chapter will detail the architecture and process of building the model The general architecture of our model is shown in Figure 1

A Input representation

Suppose the input sentence is a sequence of words ൌ ሾଵǡ ଶǡ ǥ ǡ ୬ሿ with the length n, two entities of the sentence are ଵൌ ୮ and ଶൌ ୲ ሺǡ א ሾͳǡ ሿǢ ് ሻ Similar to the mentioned models, the research uses word embedding and positional embedding to encode each word

୧ into a vector ୧୑ First, the model uses word embedding to capture the semantics of each word Given a matrix with word embedding ୚ with size of ȁȁ ൈ ୵, where is the vocabulary and ୵ is the embedded dimension Each word

୧ will be searched in the embedded matrix to retrieve the vector from ୧ୢא Թୢ౭

Next, the model uses positional embedding to capture the characteristic of the distance between each word in the sentence to two entities First, the relative distance of each word to the two entities in the sentence is calculated For a set of input sentences, two sets of relative positions will be Figure 1 General architecture of the model

Trang 3

two positional embedding matrices Then, two relative

distances of each word will be searched in the respective

embedding matrix to retrieve two positional embedding

vector ǡͳ and ǡʹ of the same size

Finally, the vector representation of the word is a

concatenation of three vectors and has a size of ൌ ୵൅

ʹ כ ୮

ൌ ْ ǡͳ ْ ǡʹǡ Where: ْ is the vector concatenation operator

B Attention mechanism of word – level

According to the studies of Huang et al [4] and Wang et

al [6], the words in the sentence contain different levels of

importance for predicting the relationship between the pair of

entities For example “[Hoàng Vn Trà]e1 sinh xã Nghi

Hng, huyn Nghi Lc, tnh [Ngh An]e2” (“[Hoang Van

Tra] e1 was born in Nghi Hung commune, Nghi Loc district,

[Nghe An] e2 province”) relationship between the two entities

is "Hometown"; the word “sinh” (“born”) is the most

important information to predict the relationship between the

pair of entities So we need to train the model to focus the

attention on words that carry such important information

(shown in Figure 2) The research proposes a word level

attention mechanism, all word representation vectors are

multiplied by an attention weight (learned by the model)

First, we connect the word embedding vector of each

word with two-word embedding vectors of two entities, the

resulting vector called ୧

୧ൌ ୧ୢْ ୮ْ ୲ୢǡ ሺǡ א ሾͳǡ ሿǡ ് ሻ

Next, ୧ is passed through a fully-connected layer to

compute the correlation between each word with two

entities, namely ୧

୧ൌ ୳୧൅ ୳ Finally, the weight of attention Ƚ୧ for each word is

calculated by applying the softmax function on the

correlation vector of the sentence

Ƚ୧ൌσ ሺሺ୧ሻ

୧ሻ ୧ After obtaining attention weight, the new representation vector ୧ of the word will be calculated as the old representation vector multiplied by the attention weight

୧ൌ Ƚ୧୧୑

C The convolutional layer

Similar to deep learning models in Relation Extraction, this model also uses a convolutional layer with windows of

3 to capture the tri-gram level features Assuming the convolution layer has m filters, the output value of the jth

filter ሺ א ሾͳǡ ሿሻ for each word ୧ሺ א ሾͳǡ ሿሻ is calculated

by the convolution of the filter matrix ୡౠא Թଷൈୢ of jth

filter with representation matrix of three-word phrase ሾ୧ିଵǡ ୧ǡ ୧ାଵሿ, followed by a tanh activation function:

୧୨ൌ ቀୡౠሾ୧ିଵǡ ୧ǡ ୧ାଵሿ୘൅ ୡౠቁ Where: ୡౠא Թଷൈୢ is filter matrix of the th filter, is the matrix displacement operator Then the representation of the sentence for the filter is a vector:

୨ൌ ൣଵ୨ǡ ଶ୨ǡ ǥ ǡ ୬୨൧

So with filters, the representation matrix of sentence will be:

ൌ ሾଵǡ ଶǡ ǥ ୫ሿ

D Piecewise max-pooling layer

The purpose of the max-pooling layer is to obtain the most prominent value on the output of each filter of the convolution layer The research uses the piecewise max-pooling layer of Zeng et al [7] The two entities in the sentence will separate the representation Sj of the sentence into three parts

୨ൌ ൣ୨ଵǡ ୨ଶǡ ୨ଷ൧ The model will get the maximum value per section, ୨ൌ

ൣ୨ଵǡ ୨ଶǡ ୨ଷ൧ with ୨୩ൌ ൫୨୩൯ Finally, the Figure 2 Attention mechanism of word – level

Trang 4

representation of sentence will be the concatenation of the

vectors ୨:

כൌ ଵْ ଶْ ǥ ْ ୫

E Lexical features

The studies of Huang et al [4] and Zeng et al [9] show

that information from two entities and words around them

are very important Base on these studies, this model also

uses a lexical feature vector of entities and two words

around them Suppose ୮ǡ ୲ሺǡ א ሾͳǡ ሿǢ ് ሻ are word

embeddings of two entities, the lexical feature vector will be

computed:

ൌ ୮ିଵْ ୮ْ ୮ାଵْ ୲ିଵْ ୲ْ ୲ାଵ

F Output

After obtaining the feature vector כ of the sentence and

the lexical feature vector , we concatenate them into a

single vector ୭ൌ כْ Next, this vector is passed

through a fully-connected layer with a softmax activation

function to obtain an output vector:

ൌ ሺ୭୭൅ ୭ሻ ൌ ሾଵǡ ଶǡ ǥ ǡ ୪ሿ

where is the number of relation classes, ୰ሺ א ሾͳǡ ሿሻ is

the probability that the sentence is predicted to represent

the relation The relationship predicted by the model for

sentence will be the highest probability relation:

ୗൌ ሺሻ

IV. EXPERIMENTS

A Dataset and Evaluation Metrics

The dataset of the study is collected automatically from

Vietnamese Wikipedia pages about people1 Based on the

parameters (living_place, occupation), we extract the

"human-place" and "human-occupation" entity pairs Next,

from the pages of human entities, we analyze all the

sentences containing the specified entity pairs and

automatically assign them a label For example,

human-occupation pairs are assigned to the Occupation label,

sentences containing the keywords such as "born" and

"home" are marked as “Hometown" label Finally, the data

is reviewed and re-labeled manually for avoiding mistakes

This dataset includes 1716 sentences with 3 relations and

one Other relation, which are described in detail in Table I:

Table I: Number of sentences in the dataset

Hometown 358

Workplace 632

Occupation 435

Other 291

Total 1,716

We evaluate the model using the Macro-F1 score over 3

relations (excluding Other) The study also used the k-fold

cross validation with ൌ ͷ

1

theo_tham_s

B Experimental setup

For word embeddings, the study using a Word2Vec model which was pre-trained on Vietnamese Wikipedia corpus In training, the study uses Adam optimizer with a learning rate of 0.01 and Early Stopping with a patience of 5

The remaining parameters are described in Table II

Table II: The list of model’s parameters

Parameters Value

Patience 5

C Expreriment results

With the dataset and parameters as mentioned, the proposed model achieved macro-averaged F1-score of 94.89% In addition, we evaluated 5 different designs of the neural network to evaluate the performance of three techniques: lexical feature, piecewise max-pooling, and word-level attention The results are shown in Table III

Table III: Experimental results of the designs

Design Lexical

Feature

Piecewise Max-pooling

Word-level Attention

F1 Score

Based on Table III above, we can see that the combination of the three techniques brings the best results on the dataset

D Errors analysis

Analyzing the errors in the experiment, we found that our model possesses the general weakness of machine learning algorithms: data dependence Most of the errors in

our experiment are related to the label Other The first

reason is the number of examples belonging to this label in our dataset is small, for only 16.96% The second reason is

the ambiguity of the labels, sentences in labels Hometown,

Workplace, Occupation can be wrongly classified into label Other and vice versa The third reason is that the label Other

has no specific features

Tuyn]e1 hát song ca cùng [ca s]e2 Ch Linh…” (From 1967

to 1968, [Thanh Tuyen] e1 sang duets with [singer]e2 Che Linh ), our model recognized this sentence’s label is Occupation, but it is Other In this case, our model has not

learned the characteristics of the sentence that has the label

Other, so it based on the sign of “Thanh Tuyen”, “Che Linh” and “singer” to classify sentences into label Occupation

Trang 5

As we can see, there are still some problems in our

model These issues will continue to be addressed in

subsequent studies

V. CONCLUSION AND FUTURE WORK

In this study, we have researched a number of deep

learning models solving Relation Extraction problems in

English and proposed a Relation Extraction model working

with Vietnamese texts In our experiment, we constructed a

set of Vietnamese data collected from Vietnamese

Wikipedia and obtained 94.89% of the F1 score on this

dataset

However, the experimental data and the number of

relations of this study is limited and only focuses on a

specific domain related to human In the future, we will

continue to improve the quality of the data The model

obtained in this study can also be used in labeling process

for additional data We will also apply this research result

on some specific domains like legal engineering

ACKNOWLEDGEMENT

This work has been supported by Vietnam National

University, Hanoi (VNU), under Project No QG.16.91

REFERENCES [1] BOLLACKER, Kurt, et al Freebase: a collaboratively created graph

database for structuring human knowledge In: Proceedings of the

2008 ACM SIGMOD international conference on Management of

data AcM, 2008 p 1247-1250

[2] AUER, Sören, et al Dbpedia: A nucleus for a web of open data

In: The semantic web Springer, Berlin, Heidelberg, 2007 p 722-735

[3] JIANG, Jing Information extraction from text In: Mining text data Springer, Boston, MA, 2012 p 11-41

[4] HUANG, Xuanjing, et al Attention-based convolutional neural network for semantic Relation Extraction In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers 2016 p 2526-2536

[5] HUANG, Yi Yao; WANG, William Yang Deep Residual Learning for Weakly-Supervised Relation Extraction arXiv preprint arXiv:1707.08866, 2017

[6] WANG, Linlin, et al Relation classification via multi-level attention cnns In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2016 p 1298-1307

[7] ZENG, Daojian, et al Distant supervision for Relation Extraction via

piecewise convolutional neural networks In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

2015 p 1753-1762

[8] LIU, ChunYang, et al Convolution neural network for Relation

Extraction In: International Conference on Advanced Data Mining and Applications Springer, Berlin, Heidelberg, 2013 p 231-242

[9] ZENG, Daojian, et al Relation classification via convolutional deep neural network In: Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers 2014 p 2335-2344

[10] NGUYEN, Thien Huu; GRISHMAN, Ralph Relation Extraction:

Perspective from convolutional neural networks In: Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing 2015 p 39-48

[11] HENDRICKX, Iris, et al Semeval-2010 task 8: Multi-way classification of semantic relations between pairs of nominals

In: Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions Association for Computational

Linguistics, 2009 p 94-99

[12] MIKOLOV, Tomas, et al Distributed representations of words and phrases and their compositionality In: Advances in neural information processing systems 2013 p 3111-3119

learning models solving Relation Extraction problems in

English and proposed a Relation Extraction model working

with Vietnamese texts In our experiment, we constructed...

In: The semantic web Springer, Berlin, Heidelberg, 2007 p 722-735

[3] JIANG, Jing Information extraction from text In: Mining text data Springer, Boston,... embeddings, the study using a Word2Vec model which was pre-trained on Vietnamese Wikipedia corpus In training, the study uses Adam optimizer with a learning rate of 0.01 and Early Stopping with

Định dạng
Số trang	5
Dung lượng	171,52 KB