Nhận dạng thực thể có tên trong văn bản tiếng việt

We apply supervised machine learning methods for Vietnamese NER based on Conditional Random Field and Support Vector Machine with changes in feature selection suitable for Vietnamese.. 2

Trang 1

VIETNAM NATIONAL UNIVERSITY, HANOI

Trang 2

VIETNAM NATIONAL UNIVERSITY, HANOI

MASTER’S THESIS OF INFORMATION TECHNOLOGY

Supervisor: Assoc.Prof.Dr Le Anh Cuong

Ha noi - 2015

Trang 3

Originality statement

“I hereby declare that the work contained in this thesis is of my own and has not been previously submitted for a degree or diploma at this or any other higher education institution To the best of my knowledge and belief, the thesis contains no materials previously published or written by another person except where due references or acknowledgements are made.”

Signature:………

Trang 4

Supervisor’s approval

“I hereby approve that the thesis in its current form is ready for committee examination as a requirement for the Master of Computer Science degree at the University of Engineering and Technology.”

Signature:………

Trang 5

Abstract

Named Entity Recognition (NER) aims to extract and classify words in documents into pre-defined entity classes It is fundamental for many natural language processing tasks such as machine translation, information extraction and question answering NER has extensively studied for other languages such as English, Japanese and Chinese, etc However, NER for Vietnamese is still a challenging due to its characteristics and the lack of Vietnamese corpus

In this thesis, we study approaches to NER including handcrafted rules, machine learning and hybrid methods We present challenges in NER for Vietnamese such as the lack of standard evaluation corpus and the standard methods for constructing data set Specially, we focus on labeling entities Vietnamese since most study has not presented the detail of handcrafting entities in Vietnamese We apply supervised machine learning methods for Vietnamese NER based on Conditional Random Field and Support Vector Machine with changes in feature selection suitable for Vietnamese The evaluation shows that these methods outperform other traditional methods in NER, such as Hidden Markov Model and rule-based methods

Trang 6

Aknowledgement

First, I would like to thank my supervisor Assoc Prof Dr Le Anh Cuong for his advice and support This thesis would not have been possible without him and without freedom and encouragement he has given me over the last two years I spent at the Faculty of Technology of University of Engineering and Technology, Vietnam National University (VNU), Ha Noi

I have been working with amazing friends in the K19CS class I dedicate my gratitude

to each of them: Tai Pham Dinh, Tuan Dinh Vu, and Nam Thanh Pham I would especially like to thank the teachers in University of Engineering and Technology, VNU for the collaboration, great ideas and feedbacks during my dissertation

Finally, I thank my parents and my brother, Hoang Le, for encouragement, advice and support I especially thank my wife, Linh Thi Nguyen, and my lovely daughter, Ngoc Khanh Le, for their endless love and sacrifice in the last two years They gave me

strength and encouragement to do this thesis

Ha Noi, September, 2015

Trang 7

Contents

Supervisor‟s approval ii

Abstract iii

Aknowledgement iv

List of Figures vii

List of Tables viii

List of Abbreviations ix

Chapter 1 Introduction 1

1.1 Information Extraction 1

1.2 Named entity recognition 3

1.3 Evaluation for NER 4

1.4 Our work 4

Chapter 2 Approaches to Named Entity Recognition 6

2.1 Rules based methods 6

2.2 Machine learning methods 7

2.3 Hybrid methods 17

Chapter 3 Feature Extraction 18

3.1 Characteristics of Vietnamese language 18

3.1.1 Lexical Resource 18

3.1.2 Word Formation 18

3.1.3 Spelling Variation 18

3.2 Feature selection for NER 19

3.2.1 Feature selection methods 20

3.2.2 Mask methods 21

3.2.3 Taxonomy of features 21

3.3 Feature selection for Vietnamese NER 23

4.1 Data preparation 26

4.2 Machine learning methods for Vietnamese NER 29

4.2.1 SVM method 29

Trang 8

4.2.2 CRF method 30

4.3 Experimental results 31

4.4 An example of experimental results and error analysis 32

Chapter 5 Conclusion 37

References 38

Trang 9

List of Figures

Figure 1.1: Example of automatically extracted information from a news article on a

terrorist attack Source [18] 1

Figure 2.1: Directed graph represents HMM 7

Figure 2.2: How to compute transition probabilities 10

Figure 2.3: A two dimensional SVM, with the hyperplanes representing the margin drawn as dashed lines 12

Figure 2.4: The mapping of input data from the input space into an infinitely dimensional Hilbert Space in a non-linear SVM classifier Source [17] 14

Figure 3.1: A taxonomy of feature selection methods Source [21] 20

Figure 4.1: Generating training data stages 27

Figure 4.2: Vietnamese NER based on SVM 30

Figure 4.3: Vietnamese NER based on CRF 31

Trang 10

List of Tables

Table 3.1: Word-level features 22

Table 3.2: Gazetteer features 23

Table 3.3: Document and corpus features 23

Table 3.4: Orthographic features for Vietnamese 24

Table 3.5: Lexical and POS features for Vietnamese 24

Table 4.1: An example of a sentence in training data 28

Table 4.2: Statistics of training data in entity level 28

Table 4.3: The number of label types in training data and test data 29

Table 4.4: Results on testing data of SVM Learner 31

Table 4.5: Results on testing data of NER using CRF method 32

Table 4.6: Annotating table 32

Trang 11

List of Abbreviations

Trang 12

Chapter 1 Introduction

1.1 Information Extraction

Information Extraction (IE) is a research area in Natural Language Processing (NLP It focuses on techniques to identify a predefined set of concepts in a specific domain, where a domain consists of a text corpus together with a well-defined information need In other word, IE is about deriving structured information from unstructured text For instance, we are interested in extracting information on violent events from online news, which involes the identification of the main actors of the event, its location and number of people affected [18] Figure 1.1 shows an example of a text snippet from a news article about a terrorist attack and the structured information derived from that snippet The process of extracting such structured information involves the identification of certain small-scale structures such as noun phrases denoting a person or a group of persons, geographical references and numerical expressions, as well as finding semantic relations between them However, in this scenario some domain specific knowledge is required (e.g., understanding the fact that terrorist attacks might result in people being killed or injured) in order to correctly aggregate the partially extracted information into a structured form

Figure 1.1: Example of automatically extracted information from a news article on a

terrorist attack Source [18]

“Three bombs have exploded in north-eastern Nigeria, killing 25 people and wounding 12 in an attack carriedout by an Islamic sect Authorities said the bombs exploded on Sunday afternoon in the city of Maiduguri.”

Trang 13

Starting from 1987, a series of Message Understanding Conferences (MUC) has been held which focus on the following domains:

 MUC-1 (1987), MUC-2 (1989): Naval operations messages

 MUC-3 (1991), MUC-4 (1992): Terrorism in Latin American countries

 MUC-5 (1993):Joint venturesand microelectronics domain

 MUC-6 (1995): News articles on management changes

 MUC-7 (1998): Satellite launches reports

The significance of IE is related to the growing amount of information available in unstructured form Tim Berners-Lee, who is the inventor of the World Wide Web (WWW), refers to the existing Internet as the web of documents and advocates that more content will be available as a web of data Until this transpires, the web largely consists of unstructured documents without semantic metadata Knowledge contained

in these documents can be more accessible for machine processing by transforming information into relational form, or by marking-up it with XML tags For instance, an intelligent agent monitoring a news data feed requires IE to transform unstructured data (i.e text) into something that can be reasoned with A typical application of IE is

to scan a set of documents written in a natural language and populate a database with the extracted information

IE on text aims at creating a structured view i.e., a representation of the information that is machine understandable According to [18], the classical IE tasks include:

Named Entity Recognition addresses the problem of the identification (detection) and

classification of predefined types of named entities, such as organizations (e.g., „World Health Organisation‟), persons (e.g., „Muammar Kaddafi‟), place names (e.g., „the Baltic Sea‟), temporal expressions (e.g.,„1 September 2011‟), numerical and currency expressions (e.g., „20 Million Euros‟), etc

Co-reference Resolution requires the identification of multiple (co-referring)

mentions of the same entity in the text For example, "International Business Machines" and "IBM" refer to the same real-world entity If we take the two sentences

"M Smith likes fishing But he doesn't like biking", it would be beneficial to detect that "he" is referring to the previously detected person "M Smith"

Relation Extraction focuses on detecting and classifying predefined relationships

between entities identified in text For example:

 PERSON works for ORGANIZATION (extracted from the sentence "Bill works for IBM.")

 PERSON located in LOCATION (extracted from the sentence "Bill is in France.")

Trang 14

Event Extraction refers to the identification of events in free text and deriving

detailed and structured information about them Ideally, it should identify who did what to whom, when, where, through what methods (instruments), and why Normally, event extraction involves extraction of several entities and relationship between them For instance, extraction of information on terrorist attacks from the text fragment

„Masked gunmen armed with assault rifles and grenades attacked a wedding party in mainly Kurdish southeast Turkey, killing at least 44 people.‟ would reveal the identification of perpetrators (masked gunmen), victims (people), number of killed/injured (at least 44), weapons used (rifles and grenades), and location (southeast Turkey)

1.2 Named entity recognition

In IE, name entity recognition (NER) is slightly more complex The named entities (e.g the location, actors and targets of a terrorist event) need to be recognized as such This NER task (also known as „propername classification‟) involves the identifcation and classification of named entities: people, places, organizations, products, companies, and even dates, times, or monetary amounts For example, Figure 1.2 demonstrates an English NER system which identifies and classifies entities in text documents It identifies 4 type entities including person, location, organization and misc

Figure 1.2: A named entity recognition system Source 1

1 http://www.iti.illinois.edu/tech-transfer/technologies/natural-language-processing-nlp

Trang 15

Previous NER studies mainly focus on popular languages such as English, France, Spanish and Japanese Methods developed in these studies are based on supervised learning Tri et al [22] performed NER using Support Vector Machine (SVM) and obtained an overall F-score of 87.75% The VN-KIM IE system2 builts an ontology and then applied it to Jape grammars to define target named entities in the web Nguyen Ba Dat et al [7] employed rule-based approach for Jape grammars plug-in in Gate framework for NER For example, the text “Chủ tịch nước Trương Tấn Sang sang thăm chính thức Nhật Bản vào ngày 20/7/2015” will be annotated text as follows:

Chủ tịch nước <PER>Trương Tấn Sang</PER> sang thăm chính thức <LOC>Nhật Bản</LOC> vào ngày 20/7/2015./ President Truong Tan Sang visited Japan on 20/07/2015

1.3 Evaluation for NER

To evaluate a NER algorithm, many metrics, which have been developed in data mining and machine learning, can be used These metrics measure the frequency with which an algorithm makes correct or incorrect identification and classification of named entities Most common measures are:

Precision: measures the the ratio of relevant items selected to number of items

2 http://www.cse.hcmut.edu.vn/~tru/VN-KIM/

Trang 16

evaluation This data set consists of 10.000 Vietnamese sentences from VLSP3 project First, the raw data is processed to get the word-segmented data Then the word-segmented data is manual annotated to get NER data In another stage, the word-segmented data is automatically tagged by Part of speech (POS) tagging tool Finally, data from these two stages are combined to have the final data which is used for our experiments

We study feature extraction methods to improve system performance and learning time In Vietnamese NER problem, we focus on characteristics of Vietnamese language and the method to select features We compare the performance of our method with the well-known methods such as CRF and SVM

The rest of this thesis is organized as follows Chapter 2 introduces approaches to NER The Chapter 3 presents about how to extract features from data which is used for training in supervised learning machine, characteristics of Vietnamese language and features for Vietnamese NER, Chapter 4 shows our experiences for NER in Vietnamese document Finally, in Chapter 5, we discuss the summary of our thesis and our future work

3 http://vlsp.vietlp.org:8080

Trang 17

Chapter 2 Approaches to Named Entity

Recognition

This chapter reviews popular methods for the dectection and classification of the Named Entities The chapter is organized as follows In section 2.1, we present rules based methods In section 2.2, machine learning methods and their variations will be described in details Section 2.3 shows hybrid methods for recognising NE The methods presented in this chapter are the basic of our approach for Vietnamese NER

2.1 Rules based methods

This method relies on human intuition of designers who assemble a large number of rules capturing the human intuitive notions For example, in many languages, usually person names are preceded by some kind of title For examples, the name in the text

“Ông Ôn_Gia_Bảo đã đến thăm Việt_Nam vào năm 2004/Mr Wen Jiabao visited

Vietnam in year 2004" can be easily discovered by the rules like [21]: (1)title

capitalized_word => title PERSON

In another example, the left or right context of expression is used to classify a named

entity For instance, the location in the text “Tỉnh Quảng_Ninh đang phải hứng chịu

trận mưa lịch sử/Quảng_Ninh province is experiencing the greatest raining.” can be

recognized by the rule: location_marker capitalized_word => location_marker

in this project archieves high performance with an F-measure of 83% The VN-KIM

IE system uses Jape grammars to recognize entities of various types (Organization, Location, Person, Date, Time, Money and Percent) with an F-measure of 81%

In comparation with the machine learning approaches, the advantage of the rules based approach is that it does not need a large annotated data That means the system can activate and get the results immediately after the rules are constructed

Trang 18

However, a major disadvantage of the rule based method is that a certain set of rules may work well for a certain domain, but may be not suitable for other domains That means the entire knowledge base of a rule-based system has to be rewritten to fit the requirements of a new domain Furthermore, constructing any rule base of sufficient size is very expensive in terms of time and money

2.2 Machine learning methods

Machine learning methods are used in many NER systems for different languages There methods include HMM, Maximum Entropy Model (MaMM), SVM and CRF This section will describe these methods in details

2.2.1 Hidden Markov Model

HMMs were introduced in 1960s It is now applied in many research fields such as voice recognition, biological information and natural language processing HMMs are probabilistic finite state models with parameters for state-transition probabilities and state-specific observation probabilities HMMs allow estimating probabilities of unobserved events HMMs are very common in Natural Language Processing

Definition of HMM

Figure 2.1: Directed graph represents HMM

In Figure 2.1, is the state in time t=i in the chain state S

The underlying states follow a Markov chain where, given the recent state, the future stage is independent of the past:

And the transition probalities:

Trang 19

Here k,l=1,2,…, M, where M is the total number of states Initial probabilities of states:

for any k, Following by (2.1),

Given the state , the observation is independent of other observations and states For a fixed state, the observation is generated according to a fixed probability law Given state k, the probability law of is specified by

In summary:

Model Parameters

Parameters of an HMM model involved:

probability of transitioning from state to

 Initial probilities : ; is the probability that is a start state

 For each state ,

 Emission probabilities: a set B of functions of the form which is the probability of observation being emitted by

Model Learning

Up to now we‟ve assumed that we know the underlying model

We want to maximize the parameters with respect to the current data, i.e., we‟re looking for a model , such that

Trang 20

Unfortunately, there is no known way to analytically find a global maximum, i.e., a model , such that

But it is possible to find a local maximum

Given an initial model , we can always find a model , such that

Parameter Re-estimation

Use the forward-backward (or Baum-Welch) algorithm, which is a hill-climbing algorithm Using an initial parameter instantiation, the forward-backward algorithm iteratively re-estimates the parameters and improves the probability by which a given observation is generated by the new parameters There are three parameters need to be re-estimated:

• Initial state distribution:

• Transition probabilities: ai,j

• Emission probabilities: bi(ot)

Re-estimation of transition probabilities:

What‟s the probability of being in state s i at time t and going to state s j, given the current model and parameters?

Trang 21

Figure 2.2: How to compute transition probabilities

The intuition behind the re-estimation equation for transition probabilities is

Formally:

Estimation of Initial State Probabilities:

Initial state distribution: is the probability that si is a start state

Re-estimation is easy:

Formally:

Estimation Emission Probabilities:

Emission probabilities are re-estimated as:

Trang 22

Limitation of HMM

In sequence problems, we assume that the state time t depends only on the previous state However, this assumption is not enough to present relationship between factors

in the sequence

2.2.2 Support Vector Machine

A SVM is a system which is trained to classify input data into categories The SVM is trained on a large training corpus containing marked samples of the categories Each training sample is represented by a point plotted onto a hyperspace The SVM then attempts to draw a hyperplane between the categories, splitting the points as evenly as possible For the two dimensional case, this is illustrated in Figure 2.3

Trang 23

Figure 2.3: A two dimensional SVM, with the hyperplanes representing the margin

drawn as dashed lines

Suppose that there is training data for the SVM in the form of n k-dimensional real

vectors and integer , where is either 1 or -1 Whether is positive or negative indicates the category for the vector i The aim of the training phase is to plot the vectors in a k-dimensional hyperspace and draw a hyperplane which as evenly as possible separates points from the two categories

Suppose that this hyperplane has normal vector w Then the hyperplane can be written

as the points x satisfying

where is the offset of the hyperplane from the original w We choose this

hyperplane so that it maximizes the margin between the points representing the two categories Imagine that two hyperplanes lying at the “border” of two regions in each

of which there are only points of either category These two hyperplanes are

perpendicular to w and cut through the outermost training data points in their

respective regions Two such planes can be illustrated as dashed lines in Figure 2.3 Maximizing the margin between the points representing the two categories can be considered as keeping these two hyperplanes as far apart as possible The training data points which end up on the dashed lines in Fig 2.2 are called Support Vectors, hence the name Support Vector Machine The hyperplanes can be described by the equations

Trang 24

And

The distance between the two areas is Since the SVM wants to maximize the margin, we need to minimize We also do not want to extend the margin indefinitely, since the training data points should not lay on the margin Thus, the following constraints are addded:

for in the first category, and

for in the second category This can be rewritten as the optimization problem of

If we replaces with , the Lagrange Multipliers can be used to rewrite this optimization problem into the following quadratic optimization problem:

where the are Lagrange multipliers [20] Data sets which are possible to divide into

two are called linearly separable Depending on how the data is arranged, this may not

be possible It is, however, possible to use an alternative model involving a soft margin [24] The soft margin model allows for a minimal number of mislabeled examples This is done by introducing slack variable for each training data vector The function to be minimized, is modified by adding a term representing the slack variables This can be done in several ways, but a common way is to introduce a linear function, so that the problem is to minimize:

for some constant C To this minimization, the following modified constraint is added:

Trang 25

By using Lagrange Multipliers as before, the problem can be rewritten as:

Figure 2.4: The mapping of input data from the input space into an infinitely

dimensional Hilbert Space in a non-linear SVM classifier Source [17]

for [3] To get rid of the slack variables, one can also rewrite this problem into its dual form:

subject to the constraints

and

It is worth mentioning that there are also non-linear classifiers While linear classifiers

require that the data be linearly separable, or nearly so, in order to give qualitative

results, the input data is transformed as to become linearly separable In a non-linear classifier, the input vectors are transformed as to lie in an infinitely dimensional Hilbert Space where it is always possible to linearly separate the two data categories

Định dạng
Số trang	50
Dung lượng	0,92 MB