13 Fvaluation for NER 1.4 Our work Chapter 2 Approauhes to Named Entity Recognition 2.1 Rules based methods 2.2 Machine learning methods 3.2 Feature selection for NLR «css ssssesvese
Trang 1
UNIVERSITY OF ENGINEERING AND TECHNOLOGY
VIETNAM NATIONAL UNIVERSITY, HANOI
Trang 2
UNIVERSITY OF ENGINEERING AND TECHNOLOGY
VIETNAM NATIONAL UNIVERSITY, ITANOT
MASTER’S THESIS OF INFORMATION TECHNOLOGY
Supervisor: Assoc.Prof.Dr Le Anh Cuong
Ha noi - 2015
Trang 3Originality statement
“Thereby declare that the work contained in this thesis is of my own and has not been previously submitted for a degree or diploma at this or any other higher education institution Yo the best of my knowledge and belief, the thesis contains no materials previously published or written by another person except where due references or acknowledgements arc made.”
Signatur
Trang 4Supervisor’s approval
“L hereby approve that the thesis in its curent form is ready for committee examination as a requirement for the Master of Computer Science degree at the University of Engineering and Technology.”
Signature:
Trang 5iit
Abstract
Named Entity Recognition (NER) aims to extract and classif'y words in documents into
pre-defined entity clas:
Chinese, etc Ilowever, NR for Vietnamese is still a challenging due to its
characteristics and the lack of Vietnamese carpus
Tm this thesis, we study approaches to NER including handcrafted rules, machine
leaming and hybrid methods We present challenges in NER for Vietnamese such as
the lack of standard evaluation corpus and the standard methods for constructing data
set Specially, we focus on labeling entities Vietnamese since most study has not presented the detail of handerafting entities in Vietnamese We apply supervised
machine learmng methods for Vietnamese NER based on Conditional Random Field
and Support Vector Machine with changes in fealure seleclion suitable for Vietnamese The evalualion shows thai these methods outperform other badiional
methods in NER, such as Hidden Markov Model and rule-hased methods.
Trang 6Aknowledgement
First, I would like to thank my supervisor Assoc Prof Dr Le Anh Cuong for his advice and support This thesis would nol have been possible without him and without freedom and encouragement he has given me over the last two years I spent at the Faculty of Technology of University of Engineering and Technology, Vietnam National University (VNU), Ha Noi
Thave been working with amazing friends in the KI9CS class I dedicate my gratitude
to each of them: Tai Pham Dinh, Tuan Dinh Vu, and Nam Thanh Pham I would especially like to thank the teachers in University of Engineering and ‘lechnology, VNU for the collaboration, great ideas and feedbacks during my dissertation
Finally, I thank my parents and my brother, Hoang Le, for encouragement, adview and support | especially thank my wife, Linh Thi Nguyen, and my lovely daughter, Ngoc Khanh Le, for their endless love and sacrifice in the last two years They gave me strength and encouragement to do thịs thesis
Ha Noi, September, 2015
Trang 71.2 Named entity recognition
13 Fvaluation for NER
1.4 Our work
Chapter 2 Approauhes to Named Entity Recognition
2.1 Rules based methods
2.2 Machine learning methods
3.2 Feature selection for NLR «css ssssesvesessersinee sees
3.2.1 Feature selection methads
3.2.2 Mask mietliodl cà cach
ka
7
Trang 8vì
2 CRF method
43 Experimental results
4.4 An example of experimental resulis anid error arutlysis
Chapter 5 Conelusien SH HH HH H0 are
References
Trang 9List of Figures
Figure 1.1: Example of automatically extracted information from a news article on a
terrorist attack Source [18] cecccucccssssssssssssssssessuseessessssssssssseeeseseeeseie "¬
Figure 2.2: How lo compute transition probabilities 10
Figure 2.3: A two dimensional SVM, with the hyperplanes representing the margin
Figure 2.4: The mapping @ of input data from the input space into an infinitely
dimensional Ililbert Space in a non-linear SVM classifier Source [17] 14
Figure 3.1 A taxonomy of feature selection methods Source [2l | 20
Figure 4.1: Generating training data stages 27
Trang 10Document and corpus foatures
Onhographic features for Vietnamess
; Lexical and POS features for Vietnamese
Table 4.1:
> Stalistics of training data in entity level
An example af a sentence in training data
Results on testing data o£ SVÀM Learner
Results on testing data of NER wing CRF method
Trang 11List of Abbreviations
‘Abbreviation Stand for
IE Information Extraction
MUG “Message Underslanding Conferences
CoNLL Conferences on Natural Language Learning
MET “Multilingual Entity Tasks
SigNTT Spovial Interest Group on Natural Language Learning
Trang 12
text For instance, we are interested in extracting information on violent events from
ouline news, which involes the identification of the main actors of the event, its
location and mumber of people affected |18] Figure 1.1 shows an example of a text
snippet from a news article aboul a lerrorist allack and the structured information derived from thal snippel The process of extracting such structured information involves the idenlification of cerlam small-scale struckures such as noun phrases
denoting a person or a group of persons, geographical references and numerical expressions, as well as finding semantic relations between them Ilowever, in this scenario some domain specific knowledge is required (e.g., understanding the fact that terrorist attacks might result in people being killed or injured) in order to correctly
aggregate the partially extracted information into a structured form
“Three hambs have exploded in north-eastern Nigeria, kalling 25 people and wounding 12 in an attack curriedout by an Islamic sect Authorities suid the bombs exploded on Sunday afternoon in the city of Maiduguri,”
Figure 1.1: Example of automatically extracted information from a news article on a
lerrorist atlack Source [18]
Trang 132
Starting from 1987, a sorics of Message Understanding Conferences (MUC) has been held which focus on the following domains:
= MUC (1987), MUC-2 (1989): Naval operations messages
© MUC3 (1991), MUC-4 (1992): Terrorism in Latin American countries
© -MUC-5 (1993): Joinl ventures and microclectronics domain
® MUFC-6 (1995): News articles on management changes
® MỤC? (1998): Satellite launches reports
The significance of TE is related to the growing amount of information available in
unstructured form Tim Berners-Lee, who is the inventor of the World Wide Web
(WWW), tefers to the existing Tutemelas the web of documents and advocales thai
more content will be available as a web of dala Until his banspires, the web largely
cousists of unstructured documents without semantic metadata Knowledge contamed
in these documents can be more accessible for machine processing by transforming information into relational form, or by marking-up it with KML tags For instance, an
intelligent agent monitoring a news data feed requires 1K to transform unstructured
data (i text) into something that oan be reasoned with A typical application of II is
Lo 5
the extracted information
a set of documents wrillen in a natural language and populale a dalabase with
TE on lext aims al creating a structured view i.c., a representation of the information
that is machine understandable According to [18], the classical IE tasks include:
Named Entity Recognition addresses the problem of the identification (detection) and classification of prede(ined types of named entities, such as organizations (e.g, ‘World Trealth Organisation’), persons (e.g., ‘Muammar Kaddafi’), place names (e.g., ‘the Baltic Sea’), temporal expressions (e.g.,I September 2011"), numerical and currency
expressions (c.g.„ ‘20 Million Buros’), ele
Co-reference Resolution requires the identification of multiple (co-+eferring) mentions of the same cntity in the text For cxample, "International Business Machines" and °IBM" refer to the same real-world entity If we take the two semtences
"ML Smith likes fishing But he doesnt like biking", it would be beneficial to detect
that “he" is referring 1o the previously detected person "M Smith"
Relation Lixtraction focuses on detecting and classifying predefined relationships
between culilies identified in text For example:
« PERSON works for ORGANTZATION (extracted from the scnlencs “Rill works
for LBM.")
« PERSON located in LOCATION (extracted Irom the senionce "Rill is m France.”)
Trang 143
Event Extraction refers to the identification of events in free text and deriving
detailed and structured information about them Ideally, it should identify who did
what to whom, when, where, through what methods (instruments), and why Normally,
event extraction involves extraction of several entities and relationship between them
For instance, extraction of information on terrorist attacks from the text fragment
‘Masked gunmen armed with assault rifles and grenades attacked a wedding party in
mainly Kurdish southeast Turkey, killing at least 44 people.’ would reveal the
identification of perpetrators (masked gunmen), victims (people), number of
killed/injured (at least 44), weapons used (rifles and grenades), and location (southeast
Turkey),
1.2 Named entity recognition
In IE, name entity recognition (NER) is slightly more complex The named entities
(e.g the location, actors and targets of a terrorist event) need to be recognized as such
This NER task (also known as ‘propername classification’) involves the identifeation
and classification of named entities: people, places, organizations, products,
companies, and even dates, times, or monetary amounts For example, Figure 1,2 demonstrates an English NER system which identifies and classifies entities in text documents It identifies 4 type entities including person, location, organization and
mise
Named Entity Recognition Demo Results
The Named Entity Recognizer has identified the following named entities
[LOC Houston] , Monday, July 21 Men have landed and walked on the moon, Two [MISC
Americans) , astronauts of [MISC Apollo 11] , steered their fragile four-legged lunar module
safely and smoothly to the historic landing yesterday at 4:17:40 P.M., Eastern daylight time,
the 38-year-old civilian commander, radioed to earth and the : “LOC Houston] , [ORG Tranquility Base] here; the Eagle has
The first men to reach the moon [PER Mr Armstrong] and his co-pilot, Col (PER Edwin
E Aldrin) , Jr of the [ORG Air Force] brought their ship to rest on a level, rock-strewn
plain near the southwestern shore of the arid [ORG Sea of Tranquility] About six and a half
hours later, [PER Mr Armstrong] opened the landing craft's hatch, stepped slowly down the
ladder and declared as he planted the first human footprint on the lunar crust: "That's one
small step for man, one giant leap for mankind.”
Figure 1.2: A named entity recognition system Source *
* http://www iti illinois edu/tech-transfer/technologies/natural-language-processing-nip
Trang 154
Previous NER studies mainly focus on popular languages such as English, France, Spanish and Japanese Methods developed in these studies are based on supervised leaming Tri et al [22] performed NER using Support Vector Machine (SVM) and
obtained an overall F-score of 87.75% The VN-KIM IE system” builts an ontology
and then applied it to Jape grammars to define target named entities in the web
Nguyen Ba Dat et al [7] employed rule-based approach for Jape grammars plug-in in
Gate framework for NER For example, the text “Chủ tịch nước Trương Tân Sang
sang thăm chính thức Nhat Ban vao ngay 20/7/2015” will be annotated text as follows:
Chủ tịch nước <PER>Trương Tấn Sang</PER> sang thăm chính thức <LOC>Nhật Bản<LOC> vào ngày 20/7/2015./ President Truong Tan Sang visited Japan on
20/07/2015
1.3 Evaluation for NER
To evaluate a NER algorithm, many metrics, which have been developed in data mining and machine learning, can be used These metrics measure the frequency with
which an algorithm makes correct or incorrect identification and classification of named entities Most common measures are:
Precision: measures the the ratio of relevant items selected to number of items
* http://www cse.hemuut,edu vn/~tru/VN-KIM/
Trang 165
evaluation This data sct consists of 10.000 Victnamese sentences from VLSP*
project First, the raw data is processed to get the word-segmented data Then the word-segmented data is manual annotated to get NER data In another stage, the word-
segmented data is automatically tagged by Part of speech (POS) tagging tool Finally,
data from these two stages are combined to have the final data which is used for our
experiments
We study feature extraction methods to improve system performance and learning
ume In Vietnamese NIR problem, we focus on characteristics of Vietnamese
language and the method to select features We compare the performance of our
imethed with the well-known methods such as CRF and SVM
‘The rest of this thesis is organized as follows Chapter 2 introduces approaches to
NER The Chapter 3 presers about how Lo extract features from data which is used for
training in supervised learning machine, characteristics of Vietnamese language and
feates for Vietnamese NER Chapter 4 shows our experiences for NER in
Vietnamese document Finally, in Chapter 5, we discuss the summary of our thesis and
our future work
* http-/vlep-vietlp.ore:8080
Trang 17Chapter 2 Approaches to Named Entity
Recognition
This chapter reviews popular methods for the dectection and classification of the Named Entities The chapter is organized as follows In section 2.1, we present rules based methods In section 2.2, machine learning methods and their variations will be described in details Section 2.3 shows hybrid methods for recognising NE The methods presented in this chapter are the basic of our approach for Vietnamese NER 2.1 Rules based methods
This method relies on human intuition of designers who assemble a large number of rules capturing the human intuitive notions For example, in many languages, usually person names are preceded by some kind of title For examples, the name in the text
"Ông ÔN Gia Bão đã đến thăm Viet_Nam vao nim 2004/Mr Wen Jiabao visited
Vietnam in year 2004" can be easily discovered by the rules like [21]: (Jtitle capitalized word => title PERSON
In another example, the left or right context of expression is used to classify a named entity For instance, the location in the text *Tỉnh Quäng: Ñĩ8h dang phải hứng chíu tran mua lich sti/@wang|NiBhi province is experiencing the greatest raining.” can be recognized by the rule: location marker capitalized word => location marker LOCATION
In many studies, rules based methods are combined with other techniques to filter
news by fine-grained knowledge-based categorization [2]
In [6], a rule-based system for Vietnamese NER is used where rules are incrementally
created while annotating a named entity corpus It is the first publicly avaiable open-
sourced project for building an annotated named entities corpus The NER system built
in this project archieves high performance with an F-measure of 83% The VN-KIM
IE system uses Jape grammars to recognize entities of various types (Organization, Location, Person, Date, Time, Money and Percent) with an F-measure of 81%
In comparation with the machine learning approaches, the advantage of the rules based approach is that it does not need a large annotated data That means the system can
activate and get the results immediately after the rules are constructed
Trang 18T
However, a major disadvantage of the rule based method is that a certain set of rules
may work well for a certain domain, but may be not suitable for other domains That
means the entire knowledge base of a rule-based system has to be rewritten to fit the
requirements of a new domain, Furthermore, constructing any rule base of sufficient size is very expensive in terms of time and money
2.2 Machine learning methods
Machine learning methods are used in many NER systems for different languages
There methods include HMM, Maximum Entropy Model (MaMM), SVM and CRF
This section will describe these methods in details
2.2.1 Hidden Markov Model
HMMs were introduced in 1960s It is now applied in many research fields such as voice recognition, biological information and natural language processing HMMs are
probabilistic finite state models with parameters for state-transition probabilities and state-specific observation probabilities HMMs allow estimating probabilities of unobserved events HMMs are very common in Natural Language Processing,
Definition of HMM
Given a token sequence: 0 = (0;,02, 07) , the goal is to find a stochomastic optimal
tag sequence § = (5;,S3, $7) — hidden states
Figure 2.1: Directed graph represents HMM
In Figure 2.1, 5, is the state in time f=/ in the chain state S
The underlying states follow a Markov chain where, given the recent state, the future stage is independent of the past:
P (Sey Se Se-ins+sS0) = PC pualte)
And the transition probalities:
Gxt = Pei = US: = ®)
Trang 19Given the state s,, the observation 9, is independent of other observations and states
For a fixed state, the observation 0, is generated according to a fixed probability law Given state k, the probability law of O is specified by h, (0)
Parameters of an HMM model involved:
Transition probabilities: A =a„,k,l—1, M Bach ay) represents the probability of transitioning from state to s;
© Initial probilities : 7,,,% = 1, ,.M; 7,is the probability that s,is a start state,
© For each state k, 7, Dy
« Emission probabilities: a set B of functions of the form b,(o)which is the
probability of observation 0, being emitted by s;
Trang 20v= argmax P(0|2)
Unfortunately, there is no known way to analytically find a global maximum, ie., a
model 2’ , such that
v= argmax P(O|A)
But it is possible to find a local maximum
Given an initial model 4 , we can always find a model 2" , such that
P(o|â ) > P(0l2)
Parameter Re-estimation
Use the forward-backward (or Baum-Welch) algorithm, which is a hill-climbing
algorithm Using an initial parameter instantiation, the forward-backward algorithm
iteratively re-estimates the parameters and improves the probability by which a given
observation is generated by the new parameters, There are three parameters need to be
re-estimated:
*_ Initial state đistribution: 7,
+ Transition probabilities: a,;
* Emission probabilities: b;(o,) Re-estimation of transition probabilities:
What’s the probability of being in state s; at time f and going to state 5), given the current model and parameters?
&) = P@ Suess = 510.2)
Trang 21Figure 2.2: How to compute transition probabilities
The intuition behind the re-estimation equation for transition probabilities is
Estimation of Initial State Probabilities:
Initial state distribution: is the probability that s, is a start state
Re-estimation is easy:
x’, = expected number of times in state s, at time 1
Formally: m, =4()
Estimation Emission Probabilities:
Emission probabilities are re-estimated as
expected number of transitions from state s, to states,
Trang 22Wi., 8(0,,7,) = 1, if'e, = vy, and 0 otherwise
Note that ổ here is the Kronecker delta function and is not related to the & in the
discussion of the Viterbi algorithm [19]
Updating HMM model with parameters:
and I_LOC So we have to find the chain of labels (the labels of named entities) which
describle the observed words with highest probability
Limitation of HMM
In sequence problems, we assume that the state time t depends only on the previous state However, this assumption is not enough to present relationship between factors
in the sequence
2.2.2 Support Vector Machine
A SVM is a system which is trained to classify input data into categories The SVM is trained on a large training corpus containing marked samples of the categories Each
training sample is represented by a point plotted onto a hyperspace The SVM then attempts to draw a hyperplane between the categories, splitting the points as evenly as
possible For the two dimensional case, this is illustrated in Figure 2.3.
Trang 232
Y
Figure 2.3: A two dimensional SVM, with the hyperplanes representing the margin
drawn as dashed lines
Suppose that there is training data for the SVM in the form of n k-dimensional real
vectors x; and integer y;, where y, is either 1 or -1 Whether , is positive or negative
indicates the category for the vector i The aim of the training phase is to plot the
vectors in a k-dimensional hyperspace and draw a hyperplane which as evenly as possible separates points from the two categories
Suppose that this hyperplane has normal vector w Then the hyperplane can be written
as the points x satisfying
w.x—b=0
where ith b is the offset of the hyperplane from the original w We choose this
hyperplane so that it maximizes the margin between the points representing the two categories Imagine that two hyperplanes lying at the “border” of two regions in each
of which there are only points of either category These two hyperplanes are perpendicular to w and cut through the outermost training data points in their
respective regions Two such planes can be illustrated as dashed lines in Figure 2.3
Maximizing the margin between the points representing the two categories can be
considered as keeping these two hyperplanes as far apart as possible The training data
points which end up on the dashed lines in Fig 2.2 are called Support Vectors, hence
the name Support Vector Machine The hyperplanes can be described by the equations
Trang 24If we replaces ||w|| with — „ the Lagrange Multipliers can be used to rewrite this
optimization problem into the following quadratic optimization problem:
where the @, are Lagrange multipliers [20] Data sets which are possible to divide into
two are called linearly separable Depending on how the data is arranged, this may not
be possible It is, however, possible to use an alternative model involving a soft margin
[24] The soft margin model allows for a minimal number of mislabeled examples
This is done by introducing slack variable € for each training data vector x, The
function to be minimized, !“"" is modified by adding a term representing the slack
variables This can be done in several ways, but a common way is to introduce a linear function, so that the problem is to minimize:
Trang 2514
&¡((W.xị — b)y; 2 1— šụ (1 < í< m)
By using Lagrange Multipliers as before, the problem can be rewritten as:
mings - Ya iO dw x; — by; — 14 &) -— Bee 20
t=i
Figure 2.4: The mapping Ø of input data from the input space into an infinitely
dimensional Hilbert Space in a non-linear SVM classifier Source [17]
for #,Ø, > 0 [3] To get rid of the slack variables, one can also rewrite this problem
into its dual form:
It is worth mentioning that there are also non-linear classifiers While linear classifiers
require that the data be linearly separable, or nearly so, in order to give qualitative results, the input data is transformed as to become linearly separable In a non-linear
classifier, the input vectors x, are transformed as to lie in an infinitely dimensional
Hilbert Space where it is always possible to linearly separate the two data categories