Relation Extraction in Vietnamese Text via Piecewise Convolution Neural Network with Word-Level Attention Van-Nhat Nguyen1, Ha-Thanh Nguyen1, Dinh-Hieu Vo1, Le-Minh Nguyen2 1VNU Univer
Trang 1Relation Extraction in Vietnamese Text via
Piecewise Convolution Neural Network with
Word-Level Attention Van-Nhat Nguyen1, Ha-Thanh Nguyen1, Dinh-Hieu Vo1, Le-Minh Nguyen2
1VNU University of Engineering and Technology
2Japan Advanced Institute of Science and Technology
Abstract— With the explosion of information technology,
the Internet now contains enormous amounts of data, so the
role of information extraction systems becomes very
important Relation Extraction is a sub-task of Information
Extraction, which focuses on classifying the relationship
between the entity pairs mentioned in the text In recent years,
despite the many new methods have been introduced, Relation
Extraction still receives attention from researchers for
languages in general and Vietnamese in particular
Relation Extraction can be addressed in a variety of ways,
including supervised learning methods, unsupervised and
semi-supervised methods Recent studies in the English language
have shown that Relation Extraction using deep learning
method in the supervised or semi-supervised domains is
achieving optimal and superior results over traditional
non-deep learning methods However, researches in Vietnamese are
few and in the process of searching documents, the results of
deep learning applying for Relation Extraction in Vietnamese
are not found Therefore, the research focuses on studying and
research the method of using deep learning to solve Relation
Extraction task in Vietnamese In order to solve the Relation
Extraction task, the research proposes and constructs a deep
learning model named Piecewise Convolution Neural Network
with Word-Level Attention
Keywords— Relation Extraction, deep learning, convolution
neural network, attention mechanism
I. INTRODUCTION
With the explosion of information technology, the
Internet now contains an enormous amount of data
According to intemetlivestats’s statistics, up to now there
have been over 1,868,000,000 websites; For every second,
there are over 2,683,187 emails sent, over 66,423 Google
searches, 8,003 Twitter posts, 840 Instagram photos, over
1,366 Tumblr posts, 3,090 Skype calls, over 73,450
YouTube videos, 55,730 GB of Internet traffic data, etc and
these numbers continue to increase
Data on the Internet contains enormous amounts of
information, but most exist in the form of unstructured text,
so there is also redundant information causing difficulties in
analyzing data Therefore, the information extracting systems
play a very important role in extracting meaningful
information from the data for analysis Information
Extraction (IE) is a field of study in natural language
processing (NLP) related to extracting structured information
(which can easily be interpreted by a type of data) from an
unstructured text Data extracted by IE systems can be
applied in a variety of areas, such as studying about the users’ primary business trends, disease prevention, crime prevention, bioinformatics, stock analysis, etc Not only that, knowledge base (KB) such as Freebase [1] or DBpedia [2] still need a lot of knowledge to improve, so we can use IE systems to expand this knowledge base
According to Jiang [3], the information extraction task has emerged since the 1970s (DeJong’s FRUMP program) but has only begun to attract attention when DARPA (Defense Advanced Research Projects Agency) initiated and sponsored the Message Understanding Conferences (MUC)
in the 1990s Extracting information is a bigger task that involves several sub-tasks such as Named Entity Recognition (NER), Relation Extraction (RE), Event Extraction, etc These sub-tasks are closely interrelated in the bigger task, for example, NER is considered a preprocessing task for a more complex task which is Relation Extraction
Relation Extraction is a sub-task in information extraction, focused on recognizing and classifying relationships between entities in sentences or text Relationships derived from RE systems can be applied to many tasks such as Q&A systems, biomedical text mining, and medical support Thus, in recent years, the Relation Extraction task has been receiving great attention from researchers worldwide There have been a lot of researches raised in important conferences such as Colling, ACL, Senseval, etc Relation Extraction is also a part of knowledge mining international projects such as Automatic Content Extraction (ACE), Global WordNet, etc
In recent years, despite not a lot of research works, the Relation Extraction task continues to receive attention from a large number of researchers around the world In particular, Relation Extraction is a task that has the potential of applying deep learning For English, in recent work [4, 5, 6],
it has been shown that the application of deep learning in this task is more effective than traditional non-deep learning Relation Extraction is also a task mentioned in the information extraction task in Vietnamese text However, the researches that went into Relation Extraction of the Vietnamese text are still limited and in the process of reviewing documents, the results of deep learning applied to Relation Extraction in Vietnamese are nowhere to be found The objective of this research was to investigate and study to give a deep learning model for the Relation Extraction task in Vietnamese In order to reach this goal, the research shall study and introduce methods of solving
Trang 2Relation Extraction problems and some models in each
method From a number of ideas in the researches [4, 6, 7],
the study has developed a deep learning model called
Piecewise Convolution Neural Network with Word-Level
Attention In order to assess the effectiveness of the model, a
Vietnamese dataset was developed and tested on the model
that the study proposes
II. RELATED WORKS
A Simple CNN models
The simple convolutional neural network model [8] was
the earliest work that tries to use CNN to automatically learn
features instead of hand-crafting features First, this model
encodes the input sentence using words embedding and
lexical features, followed by a convolutional layer, a single
neural network layer, and Softmax output layer using to give
the probability distribution for all relation classes
Convolution neural network model with max-pooling
layer [9] also use CNN to encode sentence-level features
But the other point is that they use a max-pooling layer on
the output of the convolution layer This paper is also the
first work to use positional-embedding This model also uses
lexical-level features such as noun information in the
sentence and hypernyms of the nouns on WordNet
Convolutional neural network with multi-sized window
kernel [10] is based on the results of Liu [8] and Zhang [9]
This model completely removes the lexical word-feature to
enrich the representation of the input sentence and allows
CNN to self-learn the necessary features Their architecture
is similar to Zeng et al., consisting of words and positional
embeddings followed by convolution and max-pooling In
addition, they combined a convolutional matrix of different
size window to capture n-gram level features Additionally,
they also incorporate convolutional kernel of varying
window sizes to capture wider ranges of n-gram features
B CNN with Attention Mechanism models
Attention-based Convolution Neural Network [4], uses
word embedding, positional embedding, and incorporates
part-of-speech features to construct the word vector Next are
a convolutional layer and a max-pooling layer to obtain the
convolution feature of the sentence-level The attention
weight is calculated by letting the model self-learn to
calculate the correlation between the words in the sentence
feature of sentence-level is calculated as the weighted sum of the word vectors The convolution feature vector and two attention-based context feature vector (for two entities) are concatenated before getting passed through a multi-layer perceptron with softmax activation
With the result of 85.9% of the F1-score on the
SemEval-2010 Task 8 [11] dataset, this model has proven to be effective when applied on a deep-learning model to solve the task of Relational extraction
Multi-level Attention CNN [6] is perhaps the best performing model at present, with 88.0% of F1-score on the SemEval-2010 Task 8 [11] dataset Their biggest contribution is the combination of two attention-based layers: attention on the input layer, and attention on the max-pooling layer We can see that the attention mechanism has a positive effect on the models
III. PROPOSED MODEL
To solve the problem of Relation Extraction, the study proposes and builds a deep learning model called Piecewise Conventional Neural Network with Word-Level Attention This chapter will detail the architecture and process of building the model The general architecture of our model is shown in Figure 1
A Input representation
Suppose the input sentence is a sequence of words ൌ ሾଵǡ ଶǡ ǥ ǡ ୬ሿ with the length n, two entities of the sentence are ଵൌ ୮ and ଶൌ ୲ ሺǡ א ሾͳǡ ሿǢ ് ሻ Similar to the mentioned models, the research uses word embedding and positional embedding to encode each word
୧ into a vector ୧ First, the model uses word embedding to capture the semantics of each word Given a matrix with word embedding with size of ȁȁ ൈ ୵, where is the vocabulary and ୵ is the embedded dimension Each word
୧ will be searched in the embedded matrix to retrieve the vector from ୧ୢא Թୢ౭
Next, the model uses positional embedding to capture the characteristic of the distance between each word in the sentence to two entities First, the relative distance of each word to the two entities in the sentence is calculated For a set of input sentences, two sets of relative positions will be Figure 1 General architecture of the model
Trang 3two positional embedding matrices Then, two relative
distances of each word will be searched in the respective
embedding matrix to retrieve two positional embedding
vector ǡͳ and ǡʹ of the same size
Finally, the vector representation of the word is a
concatenation of three vectors and has a size of ൌ ୵
ʹ כ ୮
ൌ ْ ǡͳ ْ ǡʹǡ Where: ْ is the vector concatenation operator
B Attention mechanism of word – level
According to the studies of Huang et al [4] and Wang et
al [6], the words in the sentence contain different levels of
importance for predicting the relationship between the pair of
entities For example “[Hoàng Vn Trà]e1 sinh xã Nghi
Hng, huyn Nghi Lc, tnh [Ngh An]e2” (“[Hoang Van
Tra] e1 was born in Nghi Hung commune, Nghi Loc district,
[Nghe An] e2 province”) relationship between the two entities
is "Hometown"; the word “sinh” (“born”) is the most
important information to predict the relationship between the
pair of entities So we need to train the model to focus the
attention on words that carry such important information
(shown in Figure 2) The research proposes a word level
attention mechanism, all word representation vectors are
multiplied by an attention weight (learned by the model)
First, we connect the word embedding vector of each
word with two-word embedding vectors of two entities, the
resulting vector called ୧
୧ൌ ୧ୢْ ୮ْ ୲ୢǡ ሺǡ א ሾͳǡ ሿǡ ് ሻ
Next, ୧ is passed through a fully-connected layer to
compute the correlation between each word with two
entities, namely ୧
୧ൌ ୳୧ ୳ Finally, the weight of attention Ƚ୧ for each word is
calculated by applying the softmax function on the
correlation vector of the sentence
Ƚ୧ൌσ ሺሺ୧ሻ
୧ሻ ୧ After obtaining attention weight, the new representation vector ୧ of the word will be calculated as the old representation vector multiplied by the attention weight
୧ൌ Ƚ୧୧
C The convolutional layer
Similar to deep learning models in Relation Extraction, this model also uses a convolutional layer with windows of
3 to capture the tri-gram level features Assuming the convolution layer has m filters, the output value of the jth
filter ሺ א ሾͳǡ ሿሻ for each word ୧ሺ א ሾͳǡ ሿሻ is calculated
by the convolution of the filter matrix ୡౠא Թଷൈୢ of jth
filter with representation matrix of three-word phrase ሾ୧ିଵǡ ୧ǡ ୧ାଵሿ, followed by a tanh activation function:
୧୨ൌ ቀୡౠሾ୧ିଵǡ ୧ǡ ୧ାଵሿ ୡౠቁ Where: ୡౠא Թଷൈୢ is filter matrix of the th filter, is the matrix displacement operator Then the representation of the sentence for the filter is a vector:
୨ൌ ൣ ଵ୨ǡ ଶ୨ǡ ǥ ǡ ୬୨൧
So with filters, the representation matrix of sentence will be:
ൌ ሾଵǡ ଶǡ ǥ ୫ሿ
D Piecewise max-pooling layer
The purpose of the max-pooling layer is to obtain the most prominent value on the output of each filter of the convolution layer The research uses the piecewise max-pooling layer of Zeng et al [7] The two entities in the sentence will separate the representation Sj of the sentence into three parts
୨ൌ ൣ୨ଵǡ ୨ଶǡ ୨ଷ൧ The model will get the maximum value per section, ୨ൌ
ൣ୨ଵǡ ୨ଶǡ ୨ଷ൧ with ୨୩ൌ ൫୨୩൯ Finally, the Figure 2 Attention mechanism of word – level
Trang 4representation of sentence will be the concatenation of the
vectors ୨:
כൌ ଵْ ଶْ ǥ ْ ୫
E Lexical features
The studies of Huang et al [4] and Zeng et al [9] show
that information from two entities and words around them
are very important Base on these studies, this model also
uses a lexical feature vector of entities and two words
around them Suppose ୮ǡ ୲ሺǡ א ሾͳǡ ሿǢ ് ሻ are word
embeddings of two entities, the lexical feature vector will be
computed:
ൌ ୮ିଵْ ୮ْ ୮ାଵْ ୲ିଵْ ୲ْ ୲ାଵ
F Output
After obtaining the feature vector כ of the sentence and
the lexical feature vector , we concatenate them into a
single vector ୭ൌ כْ Next, this vector is passed
through a fully-connected layer with a softmax activation
function to obtain an output vector:
ൌ ሺ୭୭ ୭ሻ ൌ ሾଵǡ ଶǡ ǥ ǡ ୪ሿ
where is the number of relation classes, ୰ሺ א ሾͳǡ ሿሻ is
the probability that the sentence is predicted to represent
the relation The relationship predicted by the model for
sentence will be the highest probability relation:
ୗൌ ሺሻ
IV. EXPERIMENTS
A Dataset and Evaluation Metrics
The dataset of the study is collected automatically from
Vietnamese Wikipedia pages about people1 Based on the
parameters (living_place, occupation), we extract the
"human-place" and "human-occupation" entity pairs Next,
from the pages of human entities, we analyze all the
sentences containing the specified entity pairs and
automatically assign them a label For example,
human-occupation pairs are assigned to the Occupation label,
sentences containing the keywords such as "born" and
"home" are marked as “Hometown" label Finally, the data
is reviewed and re-labeled manually for avoiding mistakes
This dataset includes 1716 sentences with 3 relations and
one Other relation, which are described in detail in Table I:
Table I: Number of sentences in the dataset
Hometown 358
Workplace 632
Occupation 435
Other 291
Total 1,716
We evaluate the model using the Macro-F1 score over 3
relations (excluding Other) The study also used the k-fold
cross validation with ൌ ͷ
1
theo_tham_s
B Experimental setup
For word embeddings, the study using a Word2Vec model which was pre-trained on Vietnamese Wikipedia corpus In training, the study uses Adam optimizer with a learning rate of 0.01 and Early Stopping with a patience of 5
The remaining parameters are described in Table II
Table II: The list of model’s parameters
Parameters Value
Patience 5
C Expreriment results
With the dataset and parameters as mentioned, the proposed model achieved macro-averaged F1-score of 94.89% In addition, we evaluated 5 different designs of the neural network to evaluate the performance of three techniques: lexical feature, piecewise max-pooling, and word-level attention The results are shown in Table III
Table III: Experimental results of the designs
Design Lexical
Feature
Piecewise Max-pooling
Word-level Attention
F1 Score
Based on Table III above, we can see that the combination of the three techniques brings the best results on the dataset
D Errors analysis
Analyzing the errors in the experiment, we found that our model possesses the general weakness of machine learning algorithms: data dependence Most of the errors in
our experiment are related to the label Other The first
reason is the number of examples belonging to this label in our dataset is small, for only 16.96% The second reason is
the ambiguity of the labels, sentences in labels Hometown,
Workplace, Occupation can be wrongly classified into label Other and vice versa The third reason is that the label Other
has no specific features
Tuyn]e1 hát song ca cùng [ca s]e2 Ch Linh…” (From 1967
to 1968, [Thanh Tuyen] e1 sang duets with [singer]e2 Che Linh ), our model recognized this sentence’s label is Occupation, but it is Other In this case, our model has not
learned the characteristics of the sentence that has the label
Other, so it based on the sign of “Thanh Tuyen”, “Che Linh” and “singer” to classify sentences into label Occupation
Trang 5As we can see, there are still some problems in our
model These issues will continue to be addressed in
subsequent studies
V. CONCLUSION AND FUTURE WORK
In this study, we have researched a number of deep
learning models solving Relation Extraction problems in
English and proposed a Relation Extraction model working
with Vietnamese texts In our experiment, we constructed a
set of Vietnamese data collected from Vietnamese
Wikipedia and obtained 94.89% of the F1 score on this
dataset
However, the experimental data and the number of
relations of this study is limited and only focuses on a
specific domain related to human In the future, we will
continue to improve the quality of the data The model
obtained in this study can also be used in labeling process
for additional data We will also apply this research result
on some specific domains like legal engineering
ACKNOWLEDGEMENT
This work has been supported by Vietnam National
University, Hanoi (VNU), under Project No QG.16.91
REFERENCES [1] BOLLACKER, Kurt, et al Freebase: a collaboratively created graph
database for structuring human knowledge In: Proceedings of the
2008 ACM SIGMOD international conference on Management of
data AcM, 2008 p 1247-1250
[2] AUER, Sören, et al Dbpedia: A nucleus for a web of open data
In: The semantic web Springer, Berlin, Heidelberg, 2007 p 722-735
[3] JIANG, Jing Information extraction from text In: Mining text data Springer, Boston, MA, 2012 p 11-41
[4] HUANG, Xuanjing, et al Attention-based convolutional neural network for semantic Relation Extraction In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers 2016 p 2526-2536
[5] HUANG, Yi Yao; WANG, William Yang Deep Residual Learning for Weakly-Supervised Relation Extraction arXiv preprint arXiv:1707.08866, 2017
[6] WANG, Linlin, et al Relation classification via multi-level attention cnns In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2016 p 1298-1307
[7] ZENG, Daojian, et al Distant supervision for Relation Extraction via
piecewise convolutional neural networks In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing
2015 p 1753-1762
[8] LIU, ChunYang, et al Convolution neural network for Relation
Extraction In: International Conference on Advanced Data Mining and Applications Springer, Berlin, Heidelberg, 2013 p 231-242
[9] ZENG, Daojian, et al Relation classification via convolutional deep neural network In: Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers 2014 p 2335-2344
[10] NGUYEN, Thien Huu; GRISHMAN, Ralph Relation Extraction:
Perspective from convolutional neural networks In: Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing 2015 p 39-48
[11] HENDRICKX, Iris, et al Semeval-2010 task 8: Multi-way classification of semantic relations between pairs of nominals
In: Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions Association for Computational
Linguistics, 2009 p 94-99
[12] MIKOLOV, Tomas, et al Distributed representations of words and phrases and their compositionality In: Advances in neural information processing systems 2013 p 3111-3119
... number of deeplearning models solving Relation Extraction problems in
English and proposed a Relation Extraction model working
with Vietnamese texts In our experiment, we constructed...
In: The semantic web Springer, Berlin, Heidelberg, 2007 p 722-735
[3] JIANG, Jing Information extraction from text In: Mining text data Springer, Boston,... embeddings, the study using a Word2Vec model which was pre-trained on Vietnamese Wikipedia corpus In training, the study uses Adam optimizer with a learning rate of 0.01 and Early Stopping with