We apply supervised machine learning methods for Vietnamese NER based on Conditional Random Field and Support Vector Machine with changes in feature selection suitable for Vietnamese.. 2
Trang 1VIETNAM NATIONAL UNIVERSITY, HANOI
Trang 2VIETNAM NATIONAL UNIVERSITY, HANOI
MASTER’S THESIS OF INFORMATION TECHNOLOGY
Supervisor: Assoc.Prof.Dr Le Anh Cuong
Ha noi - 2015
Trang 3Originality statement
“I hereby declare that the work contained in this thesis is of my own and has not been previously submitted for a degree or diploma at this or any other higher education institution To the best of my knowledge and belief, the thesis contains no materials previously published or written by another person except where due references or acknowledgements are made.”
Signature:………
Trang 4Supervisor’s approval
“I hereby approve that the thesis in its current form is ready for committee examination as a requirement for the Master of Computer Science degree at the University of Engineering and Technology.”
Signature:………
Trang 5Abstract
Named Entity Recognition (NER) aims to extract and classify words in documents into pre-defined entity classes It is fundamental for many natural language processing tasks such as machine translation, information extraction and question answering NER has extensively studied for other languages such as English, Japanese and Chinese, etc However, NER for Vietnamese is still a challenging due to its characteristics and the lack of Vietnamese corpus
In this thesis, we study approaches to NER including handcrafted rules, machine learning and hybrid methods We present challenges in NER for Vietnamese such as the lack of standard evaluation corpus and the standard methods for constructing data set Specially, we focus on labeling entities Vietnamese since most study has not presented the detail of handcrafting entities in Vietnamese We apply supervised machine learning methods for Vietnamese NER based on Conditional Random Field and Support Vector Machine with changes in feature selection suitable for Vietnamese The evaluation shows that these methods outperform other traditional methods in NER, such as Hidden Markov Model and rule-based methods
Trang 6Aknowledgement
First, I would like to thank my supervisor Assoc Prof Dr Le Anh Cuong for his advice and support This thesis would not have been possible without him and without freedom and encouragement he has given me over the last two years I spent at the Faculty of Technology of University of Engineering and Technology, Vietnam National University (VNU), Ha Noi
I have been working with amazing friends in the K19CS class I dedicate my gratitude
to each of them: Tai Pham Dinh, Tuan Dinh Vu, and Nam Thanh Pham I would especially like to thank the teachers in University of Engineering and Technology, VNU for the collaboration, great ideas and feedbacks during my dissertation
Finally, I thank my parents and my brother, Hoang Le, for encouragement, advice and support I especially thank my wife, Linh Thi Nguyen, and my lovely daughter, Ngoc Khanh Le, for their endless love and sacrifice in the last two years They gave me
strength and encouragement to do this thesis
Ha Noi, September, 2015
Trang 7Contents
Supervisor‟s approval ii
Abstract iii
Aknowledgement iv
List of Figures vii
List of Tables viii
List of Abbreviations ix
Chapter 1 Introduction 1
1.1 Information Extraction 1
1.2 Named entity recognition 3
1.3 Evaluation for NER 4
1.4 Our work 4
Chapter 2 Approaches to Named Entity Recognition 6
2.1 Rules based methods 6
2.2 Machine learning methods 7
2.3 Hybrid methods 17
Chapter 3 Feature Extraction 18
3.1 Characteristics of Vietnamese language 18
3.1.1 Lexical Resource 18
3.1.2 Word Formation 18
3.1.3 Spelling Variation 18
3.2 Feature selection for NER 19
3.2.1 Feature selection methods 20
3.2.2 Mask methods 21
3.2.3 Taxonomy of features 21
3.3 Feature selection for Vietnamese NER 23
4.1 Data preparation 26
4.2 Machine learning methods for Vietnamese NER 29
4.2.1 SVM method 29
Trang 84.2.2 CRF method 30
4.3 Experimental results 31
4.4 An example of experimental results and error analysis 32
Chapter 5 Conclusion 37
References 38
Trang 9List of Figures
Figure 1.1: Example of automatically extracted information from a news article on a
terrorist attack Source [18] 1
Figure 2.1: Directed graph represents HMM 7
Figure 2.2: How to compute transition probabilities 10
Figure 2.3: A two dimensional SVM, with the hyperplanes representing the margin drawn as dashed lines 12
Figure 2.4: The mapping of input data from the input space into an infinitely dimensional Hilbert Space in a non-linear SVM classifier Source [17] 14
Figure 3.1: A taxonomy of feature selection methods Source [21] 20
Figure 4.1: Generating training data stages 27
Figure 4.2: Vietnamese NER based on SVM 30
Figure 4.3: Vietnamese NER based on CRF 31
Trang 10List of Tables
Table 3.1: Word-level features 22
Table 3.2: Gazetteer features 23
Table 3.3: Document and corpus features 23
Table 3.4: Orthographic features for Vietnamese 24
Table 3.5: Lexical and POS features for Vietnamese 24
Table 4.1: An example of a sentence in training data 28
Table 4.2: Statistics of training data in entity level 28
Table 4.3: The number of label types in training data and test data 29
Table 4.4: Results on testing data of SVM Learner 31
Table 4.5: Results on testing data of NER using CRF method 32
Table 4.6: Annotating table 32
Trang 11List of Abbreviations
Trang 12Chapter 1 Introduction
1.1 Information Extraction
Information Extraction (IE) is a research area in Natural Language Processing (NLP It focuses on techniques to identify a predefined set of concepts in a specific domain, where a domain consists of a text corpus together with a well-defined information need In other word, IE is about deriving structured information from unstructured text For instance, we are interested in extracting information on violent events from online news, which involes the identification of the main actors of the event, its location and number of people affected [18] Figure 1.1 shows an example of a text snippet from a news article about a terrorist attack and the structured information derived from that snippet The process of extracting such structured information involves the identification of certain small-scale structures such as noun phrases denoting a person or a group of persons, geographical references and numerical expressions, as well as finding semantic relations between them However, in this scenario some domain specific knowledge is required (e.g., understanding the fact that terrorist attacks might result in people being killed or injured) in order to correctly aggregate the partially extracted information into a structured form
Figure 1.1: Example of automatically extracted information from a news article on a
terrorist attack Source [18]
“Three bombs have exploded in north-eastern Nigeria, killing 25 people and wounding 12 in an attack carriedout by an Islamic sect Authorities said the bombs exploded on Sunday afternoon in the city of Maiduguri.”
Trang 13Starting from 1987, a series of Message Understanding Conferences (MUC) has been held which focus on the following domains:
MUC-1 (1987), MUC-2 (1989): Naval operations messages
MUC-3 (1991), MUC-4 (1992): Terrorism in Latin American countries
MUC-5 (1993):Joint venturesand microelectronics domain
MUC-6 (1995): News articles on management changes
MUC-7 (1998): Satellite launches reports
The significance of IE is related to the growing amount of information available in unstructured form Tim Berners-Lee, who is the inventor of the World Wide Web (WWW), refers to the existing Internet as the web of documents and advocates that more content will be available as a web of data Until this transpires, the web largely consists of unstructured documents without semantic metadata Knowledge contained
in these documents can be more accessible for machine processing by transforming information into relational form, or by marking-up it with XML tags For instance, an intelligent agent monitoring a news data feed requires IE to transform unstructured data (i.e text) into something that can be reasoned with A typical application of IE is
to scan a set of documents written in a natural language and populate a database with the extracted information
IE on text aims at creating a structured view i.e., a representation of the information that is machine understandable According to [18], the classical IE tasks include:
Named Entity Recognition addresses the problem of the identification (detection) and
classification of predefined types of named entities, such as organizations (e.g., „World Health Organisation‟), persons (e.g., „Muammar Kaddafi‟), place names (e.g., „the Baltic Sea‟), temporal expressions (e.g.,„1 September 2011‟), numerical and currency expressions (e.g., „20 Million Euros‟), etc
Co-reference Resolution requires the identification of multiple (co-referring)
mentions of the same entity in the text For example, "International Business Machines" and "IBM" refer to the same real-world entity If we take the two sentences
"M Smith likes fishing But he doesn't like biking", it would be beneficial to detect that "he" is referring to the previously detected person "M Smith"
Relation Extraction focuses on detecting and classifying predefined relationships
between entities identified in text For example:
PERSON works for ORGANIZATION (extracted from the sentence "Bill works for IBM.")
PERSON located in LOCATION (extracted from the sentence "Bill is in France.")
Trang 14Event Extraction refers to the identification of events in free text and deriving
detailed and structured information about them Ideally, it should identify who did what to whom, when, where, through what methods (instruments), and why Normally, event extraction involves extraction of several entities and relationship between them For instance, extraction of information on terrorist attacks from the text fragment
„Masked gunmen armed with assault rifles and grenades attacked a wedding party in mainly Kurdish southeast Turkey, killing at least 44 people.‟ would reveal the identification of perpetrators (masked gunmen), victims (people), number of killed/injured (at least 44), weapons used (rifles and grenades), and location (southeast Turkey)
1.2 Named entity recognition
In IE, name entity recognition (NER) is slightly more complex The named entities (e.g the location, actors and targets of a terrorist event) need to be recognized as such This NER task (also known as „propername classification‟) involves the identifcation and classification of named entities: people, places, organizations, products, companies, and even dates, times, or monetary amounts For example, Figure 1.2 demonstrates an English NER system which identifies and classifies entities in text documents It identifies 4 type entities including person, location, organization and misc
Figure 1.2: A named entity recognition system Source 1
1 http://www.iti.illinois.edu/tech-transfer/technologies/natural-language-processing-nlp
Trang 15Previous NER studies mainly focus on popular languages such as English, France, Spanish and Japanese Methods developed in these studies are based on supervised learning Tri et al [22] performed NER using Support Vector Machine (SVM) and obtained an overall F-score of 87.75% The VN-KIM IE system2 builts an ontology and then applied it to Jape grammars to define target named entities in the web Nguyen Ba Dat et al [7] employed rule-based approach for Jape grammars plug-in in Gate framework for NER For example, the text “Chủ tịch nước Trương Tấn Sang sang thăm chính thức Nhật Bản vào ngày 20/7/2015” will be annotated text as follows:
Chủ tịch nước <PER>Trương Tấn Sang</PER> sang thăm chính thức <LOC>Nhật Bản</LOC> vào ngày 20/7/2015./ President Truong Tan Sang visited Japan on 20/07/2015
1.3 Evaluation for NER
To evaluate a NER algorithm, many metrics, which have been developed in data mining and machine learning, can be used These metrics measure the frequency with which an algorithm makes correct or incorrect identification and classification of named entities Most common measures are:
Precision: measures the the ratio of relevant items selected to number of items
2 http://www.cse.hcmut.edu.vn/~tru/VN-KIM/
Trang 16evaluation This data set consists of 10.000 Vietnamese sentences from VLSP3 project First, the raw data is processed to get the word-segmented data Then the word-segmented data is manual annotated to get NER data In another stage, the word-segmented data is automatically tagged by Part of speech (POS) tagging tool Finally, data from these two stages are combined to have the final data which is used for our experiments
We study feature extraction methods to improve system performance and learning time In Vietnamese NER problem, we focus on characteristics of Vietnamese language and the method to select features We compare the performance of our method with the well-known methods such as CRF and SVM
The rest of this thesis is organized as follows Chapter 2 introduces approaches to NER The Chapter 3 presents about how to extract features from data which is used for training in supervised learning machine, characteristics of Vietnamese language and features for Vietnamese NER, Chapter 4 shows our experiences for NER in Vietnamese document Finally, in Chapter 5, we discuss the summary of our thesis and our future work
3 http://vlsp.vietlp.org:8080
Trang 17Chapter 2 Approaches to Named Entity
Recognition
This chapter reviews popular methods for the dectection and classification of the Named Entities The chapter is organized as follows In section 2.1, we present rules based methods In section 2.2, machine learning methods and their variations will be described in details Section 2.3 shows hybrid methods for recognising NE The methods presented in this chapter are the basic of our approach for Vietnamese NER
2.1 Rules based methods
This method relies on human intuition of designers who assemble a large number of rules capturing the human intuitive notions For example, in many languages, usually person names are preceded by some kind of title For examples, the name in the text
“Ông Ôn_Gia_Bảo đã đến thăm Việt_Nam vào năm 2004/Mr Wen Jiabao visited
Vietnam in year 2004" can be easily discovered by the rules like [21]: (1)title
capitalized_word => title PERSON
In another example, the left or right context of expression is used to classify a named
entity For instance, the location in the text “Tỉnh Quảng_Ninh đang phải hứng chịu
trận mưa lịch sử/Quảng_Ninh province is experiencing the greatest raining.” can be
recognized by the rule: location_marker capitalized_word => location_marker
in this project archieves high performance with an F-measure of 83% The VN-KIM
IE system uses Jape grammars to recognize entities of various types (Organization, Location, Person, Date, Time, Money and Percent) with an F-measure of 81%
In comparation with the machine learning approaches, the advantage of the rules based approach is that it does not need a large annotated data That means the system can activate and get the results immediately after the rules are constructed
Trang 18However, a major disadvantage of the rule based method is that a certain set of rules may work well for a certain domain, but may be not suitable for other domains That means the entire knowledge base of a rule-based system has to be rewritten to fit the requirements of a new domain Furthermore, constructing any rule base of sufficient size is very expensive in terms of time and money
2.2 Machine learning methods
Machine learning methods are used in many NER systems for different languages There methods include HMM, Maximum Entropy Model (MaMM), SVM and CRF This section will describe these methods in details
2.2.1 Hidden Markov Model
HMMs were introduced in 1960s It is now applied in many research fields such as voice recognition, biological information and natural language processing HMMs are probabilistic finite state models with parameters for state-transition probabilities and state-specific observation probabilities HMMs allow estimating probabilities of unobserved events HMMs are very common in Natural Language Processing
Definition of HMM
Figure 2.1: Directed graph represents HMM
In Figure 2.1, is the state in time t=i in the chain state S
The underlying states follow a Markov chain where, given the recent state, the future stage is independent of the past:
And the transition probalities:
Trang 19Here k,l=1,2,…, M, where M is the total number of states Initial probabilities of states:
for any k, Following by (2.1),
Given the state , the observation is independent of other observations and states For a fixed state, the observation is generated according to a fixed probability law Given state k, the probability law of is specified by
In summary:
Model Parameters
Parameters of an HMM model involved:
probability of transitioning from state to
Initial probilities : ; is the probability that is a start state
For each state ,
Emission probabilities: a set B of functions of the form which is the probability of observation being emitted by
Model Learning
Up to now we‟ve assumed that we know the underlying model
We want to maximize the parameters with respect to the current data, i.e., we‟re looking for a model , such that
Trang 20Unfortunately, there is no known way to analytically find a global maximum, i.e., a model , such that
But it is possible to find a local maximum
Given an initial model , we can always find a model , such that
Parameter Re-estimation
Use the forward-backward (or Baum-Welch) algorithm, which is a hill-climbing algorithm Using an initial parameter instantiation, the forward-backward algorithm iteratively re-estimates the parameters and improves the probability by which a given observation is generated by the new parameters There are three parameters need to be re-estimated:
• Initial state distribution:
• Transition probabilities: ai,j
• Emission probabilities: bi(ot)
Re-estimation of transition probabilities:
What‟s the probability of being in state s i at time t and going to state s j, given the current model and parameters?
Trang 21Figure 2.2: How to compute transition probabilities
The intuition behind the re-estimation equation for transition probabilities is
Formally:
Estimation of Initial State Probabilities:
Initial state distribution: is the probability that si is a start state
Re-estimation is easy:
Formally:
Estimation Emission Probabilities:
Emission probabilities are re-estimated as:
Trang 22Limitation of HMM
In sequence problems, we assume that the state time t depends only on the previous state However, this assumption is not enough to present relationship between factors
in the sequence
2.2.2 Support Vector Machine
A SVM is a system which is trained to classify input data into categories The SVM is trained on a large training corpus containing marked samples of the categories Each training sample is represented by a point plotted onto a hyperspace The SVM then attempts to draw a hyperplane between the categories, splitting the points as evenly as possible For the two dimensional case, this is illustrated in Figure 2.3
Trang 23Figure 2.3: A two dimensional SVM, with the hyperplanes representing the margin
drawn as dashed lines
Suppose that there is training data for the SVM in the form of n k-dimensional real
vectors and integer , where is either 1 or -1 Whether is positive or negative indicates the category for the vector i The aim of the training phase is to plot the vectors in a k-dimensional hyperspace and draw a hyperplane which as evenly as possible separates points from the two categories
Suppose that this hyperplane has normal vector w Then the hyperplane can be written
as the points x satisfying
where is the offset of the hyperplane from the original w We choose this
hyperplane so that it maximizes the margin between the points representing the two categories Imagine that two hyperplanes lying at the “border” of two regions in each
of which there are only points of either category These two hyperplanes are
perpendicular to w and cut through the outermost training data points in their
respective regions Two such planes can be illustrated as dashed lines in Figure 2.3 Maximizing the margin between the points representing the two categories can be considered as keeping these two hyperplanes as far apart as possible The training data points which end up on the dashed lines in Fig 2.2 are called Support Vectors, hence the name Support Vector Machine The hyperplanes can be described by the equations
Trang 24And
The distance between the two areas is Since the SVM wants to maximize the margin, we need to minimize We also do not want to extend the margin indefinitely, since the training data points should not lay on the margin Thus, the following constraints are addded:
for in the first category, and
for in the second category This can be rewritten as the optimization problem of
If we replaces with , the Lagrange Multipliers can be used to rewrite this optimization problem into the following quadratic optimization problem:
where the are Lagrange multipliers [20] Data sets which are possible to divide into
two are called linearly separable Depending on how the data is arranged, this may not
be possible It is, however, possible to use an alternative model involving a soft margin [24] The soft margin model allows for a minimal number of mislabeled examples This is done by introducing slack variable for each training data vector The function to be minimized, is modified by adding a term representing the slack variables This can be done in several ways, but a common way is to introduce a linear function, so that the problem is to minimize:
for some constant C To this minimization, the following modified constraint is added:
Trang 25By using Lagrange Multipliers as before, the problem can be rewritten as:
Figure 2.4: The mapping of input data from the input space into an infinitely
dimensional Hilbert Space in a non-linear SVM classifier Source [17]
for [3] To get rid of the slack variables, one can also rewrite this problem into its dual form:
subject to the constraints
and
It is worth mentioning that there are also non-linear classifiers While linear classifiers
require that the data be linearly separable, or nearly so, in order to give qualitative
results, the input data is transformed as to become linearly separable In a non-linear classifier, the input vectors are transformed as to lie in an infinitely dimensional Hilbert Space where it is always possible to linearly separate the two data categories