Luận văn named entity recognition in vietnamese documents nhận dạng thực thể có tên trong văn bản tiếng việt

13 Fvaluation for NER 1.4 Our work Chapter 2 Approauhes to Named Entity Recognition 2.1 Rules based methods 2.2 Machine learning methods 3.2 Feature selection for NLR «css ssssesvese

Trang 1

UNIVERSITY OF ENGINEERING AND TECHNOLOGY

VIETNAM NATIONAL UNIVERSITY, HANOI

Trang 2

UNIVERSITY OF ENGINEERING AND TECHNOLOGY

VIETNAM NATIONAL UNIVERSITY, ITANOT

MASTER’S THESIS OF INFORMATION TECHNOLOGY

Supervisor: Assoc.Prof.Dr Le Anh Cuong

Ha noi - 2015

Trang 3

Originality statement

“Thereby declare that the work contained in this thesis is of my own and has not been previously submitted for a degree or diploma at this or any other higher education institution Yo the best of my knowledge and belief, the thesis contains no materials previously published or written by another person except where due references or acknowledgements arc made.”

Signatur

Trang 4

Supervisor’s approval

“L hereby approve that the thesis in its curent form is ready for committee examination as a requirement for the Master of Computer Science degree at the University of Engineering and Technology.”

Signature:

Trang 5

iit

Abstract

Named Entity Recognition (NER) aims to extract and classif'y words in documents into

pre-defined entity clas:

Chinese, etc Ilowever, NR for Vietnamese is still a challenging due to its

characteristics and the lack of Vietnamese carpus

Tm this thesis, we study approaches to NER including handcrafted rules, machine

leaming and hybrid methods We present challenges in NER for Vietnamese such as

the lack of standard evaluation corpus and the standard methods for constructing data

set Specially, we focus on labeling entities Vietnamese since most study has not presented the detail of handerafting entities in Vietnamese We apply supervised

machine learmng methods for Vietnamese NER based on Conditional Random Field

and Support Vector Machine with changes in fealure seleclion suitable for Vietnamese The evalualion shows thai these methods outperform other badiional

methods in NER, such as Hidden Markov Model and rule-hased methods.

Trang 6

Aknowledgement

First, I would like to thank my supervisor Assoc Prof Dr Le Anh Cuong for his advice and support This thesis would nol have been possible without him and without freedom and encouragement he has given me over the last two years I spent at the Faculty of Technology of University of Engineering and Technology, Vietnam National University (VNU), Ha Noi

Thave been working with amazing friends in the KI9CS class I dedicate my gratitude

to each of them: Tai Pham Dinh, Tuan Dinh Vu, and Nam Thanh Pham I would especially like to thank the teachers in University of Engineering and ‘lechnology, VNU for the collaboration, great ideas and feedbacks during my dissertation

Finally, I thank my parents and my brother, Hoang Le, for encouragement, adview and support | especially thank my wife, Linh Thi Nguyen, and my lovely daughter, Ngoc Khanh Le, for their endless love and sacrifice in the last two years They gave me strength and encouragement to do thịs thesis

Ha Noi, September, 2015

Trang 7

1.2 Named entity recognition

13 Fvaluation for NER

1.4 Our work

Chapter 2 Approauhes to Named Entity Recognition

2.1 Rules based methods

2.2 Machine learning methods

3.2 Feature selection for NLR «css ssssesvesessersinee sees

3.2.1 Feature selection methads

3.2.2 Mask mietliodl cà cach

ka

7

Trang 8

vì

2 CRF method

43 Experimental results

4.4 An example of experimental resulis anid error arutlysis

Chapter 5 Conelusien SH HH HH H0 are

References

Trang 9

List of Figures

Figure 1.1: Example of automatically extracted information from a news article on a

terrorist attack Source [18] cecccucccssssssssssssssssessuseessessssssssssseeeseseeeseie "¬

Figure 2.2: How lo compute transition probabilities 10

Figure 2.3: A two dimensional SVM, with the hyperplanes representing the margin

Figure 2.4: The mapping @ of input data from the input space into an infinitely

dimensional Ililbert Space in a non-linear SVM classifier Source [17] 14

Figure 3.1 A taxonomy of feature selection methods Source [2l | 20

Figure 4.1: Generating training data stages 27

Trang 10

Document and corpus foatures

Onhographic features for Vietnamess

; Lexical and POS features for Vietnamese

Table 4.1:

> Stalistics of training data in entity level

An example af a sentence in training data

Results on testing data o£ SVÀM Learner

Results on testing data of NER wing CRF method

Trang 11

List of Abbreviations

‘Abbreviation Stand for

IE Information Extraction

MUG “Message Underslanding Conferences

CoNLL Conferences on Natural Language Learning

MET “Multilingual Entity Tasks

SigNTT Spovial Interest Group on Natural Language Learning

Trang 12

text For instance, we are interested in extracting information on violent events from

ouline news, which involes the identification of the main actors of the event, its

location and mumber of people affected |18] Figure 1.1 shows an example of a text

snippet from a news article aboul a lerrorist allack and the structured information derived from thal snippel The process of extracting such structured information involves the idenlification of cerlam small-scale struckures such as noun phrases

denoting a person or a group of persons, geographical references and numerical expressions, as well as finding semantic relations between them Ilowever, in this scenario some domain specific knowledge is required (e.g., understanding the fact that terrorist attacks might result in people being killed or injured) in order to correctly

aggregate the partially extracted information into a structured form

“Three hambs have exploded in north-eastern Nigeria, kalling 25 people and wounding 12 in an attack curriedout by an Islamic sect Authorities suid the bombs exploded on Sunday afternoon in the city of Maiduguri,”

Figure 1.1: Example of automatically extracted information from a news article on a

lerrorist atlack Source [18]

Trang 13

2

Starting from 1987, a sorics of Message Understanding Conferences (MUC) has been held which focus on the following domains:

= MUC (1987), MUC-2 (1989): Naval operations messages

® MUFC-6 (1995): News articles on management changes

® MỤC? (1998): Satellite launches reports

The significance of TE is related to the growing amount of information available in

unstructured form Tim Berners-Lee, who is the inventor of the World Wide Web

(WWW), tefers to the existing Tutemelas the web of documents and advocales thai

more content will be available as a web of dala Until his banspires, the web largely

cousists of unstructured documents without semantic metadata Knowledge contamed

in these documents can be more accessible for machine processing by transforming information into relational form, or by marking-up it with KML tags For instance, an

intelligent agent monitoring a news data feed requires 1K to transform unstructured

data (i text) into something that oan be reasoned with A typical application of II is

Lo 5

the extracted information

a set of documents wrillen in a natural language and populale a dalabase with

TE on lext aims al creating a structured view i.c., a representation of the information

that is machine understandable According to [18], the classical IE tasks include:

Named Entity Recognition addresses the problem of the identification (detection) and classification of prede(ined types of named entities, such as organizations (e.g, ‘World Trealth Organisation’), persons (e.g., ‘Muammar Kaddafi’), place names (e.g., ‘the Baltic Sea’), temporal expressions (e.g.,I September 2011"), numerical and currency

expressions (c.g.„ ‘20 Million Buros’), ele

Co-reference Resolution requires the identification of multiple (co-+eferring) mentions of the same cntity in the text For cxample, "International Business Machines" and °IBM" refer to the same real-world entity If we take the two semtences

"ML Smith likes fishing But he doesnt like biking", it would be beneficial to detect

that “he" is referring 1o the previously detected person "M Smith"

Relation Lixtraction focuses on detecting and classifying predefined relationships

between culilies identified in text For example:

« PERSON works for ORGANTZATION (extracted from the scnlencs “Rill works

for LBM.")

« PERSON located in LOCATION (extracted Irom the senionce "Rill is m France.”)

Trang 14

3

Event Extraction refers to the identification of events in free text and deriving

detailed and structured information about them Ideally, it should identify who did

what to whom, when, where, through what methods (instruments), and why Normally,

event extraction involves extraction of several entities and relationship between them

For instance, extraction of information on terrorist attacks from the text fragment

‘Masked gunmen armed with assault rifles and grenades attacked a wedding party in

mainly Kurdish southeast Turkey, killing at least 44 people.’ would reveal the

identification of perpetrators (masked gunmen), victims (people), number of

killed/injured (at least 44), weapons used (rifles and grenades), and location (southeast

Turkey),

1.2 Named entity recognition

In IE, name entity recognition (NER) is slightly more complex The named entities

(e.g the location, actors and targets of a terrorist event) need to be recognized as such

This NER task (also known as ‘propername classification’) involves the identifeation

and classification of named entities: people, places, organizations, products,

companies, and even dates, times, or monetary amounts For example, Figure 1,2 demonstrates an English NER system which identifies and classifies entities in text documents It identifies 4 type entities including person, location, organization and

mise

Named Entity Recognition Demo Results

The Named Entity Recognizer has identified the following named entities

[LOC Houston] , Monday, July 21 Men have landed and walked on the moon, Two [MISC

Americans) , astronauts of [MISC Apollo 11] , steered their fragile four-legged lunar module

safely and smoothly to the historic landing yesterday at 4:17:40 P.M., Eastern daylight time,

the 38-year-old civilian commander, radioed to earth and the : “LOC Houston] , [ORG Tranquility Base] here; the Eagle has

The first men to reach the moon [PER Mr Armstrong] and his co-pilot, Col (PER Edwin

E Aldrin) , Jr of the [ORG Air Force] brought their ship to rest on a level, rock-strewn

plain near the southwestern shore of the arid [ORG Sea of Tranquility] About six and a half

hours later, [PER Mr Armstrong] opened the landing craft's hatch, stepped slowly down the

ladder and declared as he planted the first human footprint on the lunar crust: "That's one

small step for man, one giant leap for mankind.”

Figure 1.2: A named entity recognition system Source *

* http://www iti illinois edu/tech-transfer/technologies/natural-language-processing-nip

Trang 15

4

Previous NER studies mainly focus on popular languages such as English, France, Spanish and Japanese Methods developed in these studies are based on supervised leaming Tri et al [22] performed NER using Support Vector Machine (SVM) and

obtained an overall F-score of 87.75% The VN-KIM IE system” builts an ontology

and then applied it to Jape grammars to define target named entities in the web

Nguyen Ba Dat et al [7] employed rule-based approach for Jape grammars plug-in in

Gate framework for NER For example, the text “Chủ tịch nước Trương Tân Sang

sang thăm chính thức Nhat Ban vao ngay 20/7/2015” will be annotated text as follows:

Chủ tịch nước <PER>Trương Tấn Sang</PER> sang thăm chính thức <LOC>Nhật Bản<LOC> vào ngày 20/7/2015./ President Truong Tan Sang visited Japan on

20/07/2015

1.3 Evaluation for NER

To evaluate a NER algorithm, many metrics, which have been developed in data mining and machine learning, can be used These metrics measure the frequency with

which an algorithm makes correct or incorrect identification and classification of named entities Most common measures are:

Precision: measures the the ratio of relevant items selected to number of items

* http://www cse.hemuut,edu vn/~tru/VN-KIM/

Trang 16

5

evaluation This data sct consists of 10.000 Victnamese sentences from VLSP*

project First, the raw data is processed to get the word-segmented data Then the word-segmented data is manual annotated to get NER data In another stage, the word-

segmented data is automatically tagged by Part of speech (POS) tagging tool Finally,

data from these two stages are combined to have the final data which is used for our

experiments

We study feature extraction methods to improve system performance and learning

ume In Vietnamese NIR problem, we focus on characteristics of Vietnamese

language and the method to select features We compare the performance of our

imethed with the well-known methods such as CRF and SVM

‘The rest of this thesis is organized as follows Chapter 2 introduces approaches to

NER The Chapter 3 presers about how Lo extract features from data which is used for

training in supervised learning machine, characteristics of Vietnamese language and

feates for Vietnamese NER Chapter 4 shows our experiences for NER in

Vietnamese document Finally, in Chapter 5, we discuss the summary of our thesis and

our future work

* http-/vlep-vietlp.ore:8080

Trang 17

Chapter 2 Approaches to Named Entity

Recognition

This chapter reviews popular methods for the dectection and classification of the Named Entities The chapter is organized as follows In section 2.1, we present rules based methods In section 2.2, machine learning methods and their variations will be described in details Section 2.3 shows hybrid methods for recognising NE The methods presented in this chapter are the basic of our approach for Vietnamese NER 2.1 Rules based methods

This method relies on human intuition of designers who assemble a large number of rules capturing the human intuitive notions For example, in many languages, usually person names are preceded by some kind of title For examples, the name in the text

"Ông ÔN Gia Bão đã đến thăm Viet_Nam vao nim 2004/Mr Wen Jiabao visited

Vietnam in year 2004" can be easily discovered by the rules like [21]: (Jtitle capitalized word => title PERSON

In another example, the left or right context of expression is used to classify a named entity For instance, the location in the text *Tỉnh Quäng: Ñĩ8h dang phải hứng chíu tran mua lich sti/@wang|NiBhi province is experiencing the greatest raining.” can be recognized by the rule: location marker capitalized word => location marker LOCATION

In many studies, rules based methods are combined with other techniques to filter

news by fine-grained knowledge-based categorization [2]

In [6], a rule-based system for Vietnamese NER is used where rules are incrementally

created while annotating a named entity corpus It is the first publicly avaiable open-

sourced project for building an annotated named entities corpus The NER system built

in this project archieves high performance with an F-measure of 83% The VN-KIM

IE system uses Jape grammars to recognize entities of various types (Organization, Location, Person, Date, Time, Money and Percent) with an F-measure of 81%

In comparation with the machine learning approaches, the advantage of the rules based approach is that it does not need a large annotated data That means the system can

activate and get the results immediately after the rules are constructed

Trang 18

T

However, a major disadvantage of the rule based method is that a certain set of rules

may work well for a certain domain, but may be not suitable for other domains That

means the entire knowledge base of a rule-based system has to be rewritten to fit the

requirements of a new domain, Furthermore, constructing any rule base of sufficient size is very expensive in terms of time and money

2.2 Machine learning methods

Machine learning methods are used in many NER systems for different languages

There methods include HMM, Maximum Entropy Model (MaMM), SVM and CRF

This section will describe these methods in details

2.2.1 Hidden Markov Model

HMMs were introduced in 1960s It is now applied in many research fields such as voice recognition, biological information and natural language processing HMMs are

probabilistic finite state models with parameters for state-transition probabilities and state-specific observation probabilities HMMs allow estimating probabilities of unobserved events HMMs are very common in Natural Language Processing,

Definition of HMM

Given a token sequence: 0 = (0;,02, 07) , the goal is to find a stochomastic optimal

tag sequence § = (5;,S3, $7) — hidden states

Figure 2.1: Directed graph represents HMM

In Figure 2.1, 5, is the state in time f=/ in the chain state S

The underlying states follow a Markov chain where, given the recent state, the future stage is independent of the past:

P (Sey Se Se-ins+sS0) = PC pualte)

And the transition probalities:

Gxt = Pei = US: = ®)

Trang 19

Given the state s,, the observation 9, is independent of other observations and states

For a fixed state, the observation 0, is generated according to a fixed probability law Given state k, the probability law of O is specified by h, (0)

Parameters of an HMM model involved:

Transition probabilities: A =a„,k,l—1, M Bach ay) represents the probability of transitioning from state to s;

« Emission probabilities: a set B of functions of the form b,(o)which is the

probability of observation 0, being emitted by s;

Trang 20

v= argmax P(0|2)

Unfortunately, there is no known way to analytically find a global maximum, ie., a

model 2’ , such that

v= argmax P(O|A)

But it is possible to find a local maximum

Given an initial model 4 , we can always find a model 2" , such that

P(o|â ) > P(0l2)

Parameter Re-estimation

Use the forward-backward (or Baum-Welch) algorithm, which is a hill-climbing

algorithm Using an initial parameter instantiation, the forward-backward algorithm

iteratively re-estimates the parameters and improves the probability by which a given

observation is generated by the new parameters, There are three parameters need to be

re-estimated:

*_ Initial state đistribution: 7,

+ Transition probabilities: a,;

* Emission probabilities: b;(o,) Re-estimation of transition probabilities:

What’s the probability of being in state s; at time f and going to state 5), given the current model and parameters?

&) = P@ Suess = 510.2)

Trang 21

Figure 2.2: How to compute transition probabilities

The intuition behind the re-estimation equation for transition probabilities is

Estimation of Initial State Probabilities:

Initial state distribution: is the probability that s, is a start state

Re-estimation is easy:

x’, = expected number of times in state s, at time 1

Formally: m, =4()

Estimation Emission Probabilities:

Emission probabilities are re-estimated as

expected number of transitions from state s, to states,

Trang 22

Wi., 8(0,,7,) = 1, if'e, = vy, and 0 otherwise

Note that ổ here is the Kronecker delta function and is not related to the & in the

discussion of the Viterbi algorithm [19]

Updating HMM model with parameters:

and I_LOC So we have to find the chain of labels (the labels of named entities) which

describle the observed words with highest probability

Limitation of HMM

In sequence problems, we assume that the state time t depends only on the previous state However, this assumption is not enough to present relationship between factors

in the sequence

2.2.2 Support Vector Machine

A SVM is a system which is trained to classify input data into categories The SVM is trained on a large training corpus containing marked samples of the categories Each

training sample is represented by a point plotted onto a hyperspace The SVM then attempts to draw a hyperplane between the categories, splitting the points as evenly as

possible For the two dimensional case, this is illustrated in Figure 2.3.

Trang 23

2

Y

Figure 2.3: A two dimensional SVM, with the hyperplanes representing the margin

drawn as dashed lines

Suppose that there is training data for the SVM in the form of n k-dimensional real

vectors x; and integer y;, where y, is either 1 or -1 Whether , is positive or negative

indicates the category for the vector i The aim of the training phase is to plot the

vectors in a k-dimensional hyperspace and draw a hyperplane which as evenly as possible separates points from the two categories

Suppose that this hyperplane has normal vector w Then the hyperplane can be written

as the points x satisfying

w.x—b=0

where ith b is the offset of the hyperplane from the original w We choose this

hyperplane so that it maximizes the margin between the points representing the two categories Imagine that two hyperplanes lying at the “border” of two regions in each

of which there are only points of either category These two hyperplanes are perpendicular to w and cut through the outermost training data points in their

respective regions Two such planes can be illustrated as dashed lines in Figure 2.3

Maximizing the margin between the points representing the two categories can be

considered as keeping these two hyperplanes as far apart as possible The training data

points which end up on the dashed lines in Fig 2.2 are called Support Vectors, hence

the name Support Vector Machine The hyperplanes can be described by the equations

Trang 24

If we replaces ||w|| with — „ the Lagrange Multipliers can be used to rewrite this

optimization problem into the following quadratic optimization problem:

where the @, are Lagrange multipliers [20] Data sets which are possible to divide into

two are called linearly separable Depending on how the data is arranged, this may not

be possible It is, however, possible to use an alternative model involving a soft margin

[24] The soft margin model allows for a minimal number of mislabeled examples

This is done by introducing slack variable € for each training data vector x, The

function to be minimized, !“"" is modified by adding a term representing the slack

variables This can be done in several ways, but a common way is to introduce a linear function, so that the problem is to minimize:

Trang 25

14

&¡((W.xị — b)y; 2 1— šụ (1 < í< m)

By using Lagrange Multipliers as before, the problem can be rewritten as:

mings - Ya iO dw x; — by; — 14 &) -— Bee 20

t=i

Figure 2.4: The mapping Ø of input data from the input space into an infinitely

dimensional Hilbert Space in a non-linear SVM classifier Source [17]

for #,Ø, > 0 [3] To get rid of the slack variables, one can also rewrite this problem

into its dual form:

It is worth mentioning that there are also non-linear classifiers While linear classifiers

require that the data be linearly separable, or nearly so, in order to give qualitative results, the input data is transformed as to become linearly separable In a non-linear

classifier, the input vectors x, are transformed as to lie in an infinitely dimensional

Hilbert Space where it is always possible to linearly separate the two data categories

Tiêu đề	Named entity recognition for Vietnamese documents
Tác giả	Le Ngoc Anh
Người hướng dẫn	Assoc. Prof. Dr. Le Anh Cuong
Trường học	University of Engineering and Technology Vietnam National University
Chuyên ngành	Computer Science
Thể loại	Thesis
Năm xuất bản	2015
Thành phố	Hanoi

Định dạng
Số trang	50
Dung lượng	1,34 MB