Luận văn building a semantic role labeling system for vietnamese sentences

This kind of research problem is called function tags labeling problem, a class of problems to finding semantic information of phrase.. Thus, there were some research that focused on fu

Trang 1

UNTVERSITY OF ENGINEERING AND TECHNOLOGY

VIETNAM NATIONAL UNIVERSITY, HANOI

NGUYEN THANH HUY

BUILDING A SEMANTICROLELABELING SYSTEM FOR VIETNAMESE SENTENCES

Major : Computer Science Code : 60 48 01

MASTER THESIS

Supervised by: PhD Nguyen Phuong Thai

Tianoi - 2011

Trang 2

Chapter I- Introduetion - ¬— ¬— ¬— a)

3 Current studies on Function tagging TH se TH se mà

Chapter II: Related works TH se TH se TH se wl

3 SelectedFeatuss Hee Hee Hee

Trang 3

4 Teatures and Constraints

5.5 Maximum Entopy Prineiple

Corpora and Tools ¬—

Functional Labeling Precisions

Trang 4

Figure 7 Scenarios in constrained optimization HHu re Figure 8 Pseudo-code for extracting funetion labels sesesntnee Figure 9 An example of word cluster

Figure 10 Leaming curve

Figure 11 The dependency between two fumetion labels sesesntnee

Trang 5

List of Tables

Tablel Functional Labeling Approaches

‘Table 2 Result of labeling by parsing approach following Collin model

Table 3 Function Tags on Viet Treebarik

Tablc 4 Vieuiamose Trocbank statistios HHu re HHu re

Table 5 Evaluation of Vietnamese functional labeling system

‘Table 6 Increases in precision by using word cluster featue

16 Ö„18

Ö„34

37

AO

Trang 6

Chapter I: Introduction

Tu this chapter, T introduce function tags and value of fumelion tag in NLP -

applicalions, some current approaches, objective of thesis and our contribution

Finally, 1 describe the structure of the thesis

1 Function tags

‘There are two kinds of tags in linguistics: syntactic tags and function tags For

syntactic tags there are several theories and projects research result in Unglish, Spanish, Chinese and [4][13][14][18] These research mainly focus on finding the part-of-speech and tagging for their comstilusnis Function lags are uriderstood as abatrael labels beeause they are nol similar lo syntactic labels Tf a syntactic label has one notation for a batch of words in a paragraph, function tags present the relationship between a phrase and its utterance in each difference context So for each phrase,

function tags might be transforming It depends on context of its neighbors For example we consider a phrase: “Basehall bat” syntactic of this phrase is “noun

phrase” Gn almost research they are annotated as NP) But ils funchon tag might be a

subject 1 this sentence:

This baseball bat is very expensive

In other case its function tag might be a direot object:

J bought this baseball bat last month

Or instrument, agerit int a passive voice:

That man was attacked by this baseball bat

Function tags are directly mentioned by Blaheta (2003) [2] There are a lot of research focuses in how to tag fimetion tags for a sentence This kind of research

problem is called function tags labeling problem, a class of problems to finding semantic information of phrase To sum up, function tag labeling is defined as a

problem how to find the scmantic information of a batch of words, and then tag, thers with a given annotation in its context

2 Corpora for function tag labeling

Nowadays machine leaming is the popular method for most of modern problems especially in Nature Language Processing subject To build up a machine leaming

9

Trang 7

system we need a training data set There are some function tag labeling corpora that

are applied for languages such as English and Chinese In English, there are two main corpora which are used for semantic role labeling and function tag labeling problems

They are Frame Net(Baker, 1998; FillMore and Baker 2000) and Prop Bank (Palmer et

al,2005) The main idea of Frame Net is that group all similar words in the same group,

and then represents relationship of this group with other groups in a group-network

That why it is called Frame Net Figure 1shows a small example branch on Frame Net

Figure 1 Sample domain and frame element of Frame Net

The second corpus is PropBank (Palmer et al., 2005) which is a modification of Penn Treebank by annotated addition information: function tags Penn Treebank and Prop Bank are organized as a set of trees: each tree illustrates a sentence which is tagged syntactic labels (for Penn Treebank) and with both syntactic and functional labels (for Prop Bank)

PropBank and Chinese Treebank are linguistics resources which have been

available for research purpose for a long time Whereas, Viet Treebank’ have been

developed recently by Nguyen [17] by using experiment and approaches from Penn

Treebank Hence, Viet Treebank has the same structure as Penn Treebank and Chinese Treebank in which each word is presented as leaf node of a tree, none terminal nodes

are tagged syntactic label or functional label Figure 2 will show an example from Viet

Treebank with function tags

ietlp.otg:8080/demo/2pageEresotirces

10

Trang 8

Figure 2 A parsing with function tags in Viet Treebank

3 Current studies on Function tagging

Function tags labeling is an important processing step for many natural language

processing applications such as question answering, information extraction, and

summarization Thus, there were some research that focused on function tagging problem to cover additional semantic information which is more useful than syntactic labels

In 1997, Collins [7] introduced the idea to add some useful syntactic information,

and then he proposed a parser to have enough ability in guessing the complement tag

This parser is called Collins's parser, and it is considered as first system in tagging

label The function tags labeling is defined precisely by Blaheta (2003) [2] His

research used data from Pen Tree II which is covered extra function tags With Blaheta’s proposal there are various investigation focusing on function tag labeling

such as Merlo and Mussillo (2005), Blaheta and Charniak (2004), Chrupala and

Genabith (2006), Sun, Sui (2009) These studies extend function tags labeling topic by

focusing on new language such as Chinese, proposing new approaches, or

investigating new features Nowadays, there are three main approach strategies for

function tags labeling problem

The first approach is called parsing, which is tagging function labels during the

parsing process, this approach is a modification of Collin’s parser Following this

approach, we can consider studies of Gabbard [9], and Marcus [17]

11

Trang 9

‘The second approach is called Jabeling method which inchides two phases extracting features and classifying function labels ‘I'his approach has more techniques because of the diversity of classification techniques ‘the most typical research of this approach is Blaheta’s research [3] In his research, he has been applied some techniques to show the impact of each technique for function tag, labeling problem

The third approach is defined as sequential labeling approach For this approach, function tags are predicted from observed words chain (Yuan [23]) This approach is similar to selecting {calures of classification approach but the difference is thal if uses

a prediction mods] instead of a classification model These approaches will be

discussed in detail in next chapter,

Today there is a class problem which covers function tagging This class was

mentioned by Carreras (2004) [5] and called Semantic Role Labeling Semantic Role Labeling is similar to function tag labeling but it works at a more abstract level When building a Semantic Role Labeling system, the training data will have more information They nat only include time, location, manner, etc, but also object, instrument, agent, etc This problem is a new promised research for NLP-applications

which need to understand the meaning of sentences

4, Objective of the thesis

As we mentioned above, assigning function tags has wide rescarch, especially for English Revently, some studies were applied for Spanish, and Chinese Al function tagging system have contributed in their corpora a semantic class which is very useful for other NLP-applications such as Question Answering, Summarization, Information

Retneval, cle

In recent years, Nature Language Processing topies in Viewam have developed

rapidly Uspecially for Vietnamese, many studies have focused on how to recognize the syntactic of Vietnamese sentences by a POS tagging system But unfornmately, these NLP-applications co nol provide semantic information for a sentence Whereas, some KLP-applications need te know semantic information to answer questions: who, where, what, and whom

To deal with this problem, our research focuses on building an automatic function tags labeling In this thesis, I call as stage onc, teruporarily; our rescarch will build a function tagging system, a problem that is shallower than Semantic Role Labeling

12

Trang 10

(SRL) problem, which is applied for Vietnamese Our research only tags some basic

function labels such as: time, location, direction etc Others semantic roles such as:

agent, instrument, etc, which belong to Semantic Role Labeling problem will be append into our system in the future In our system, we have two phases: first we

extract function tags from Viel Treebank, a bank covered by hand-crafl semantic

labels After that, we select features to train the classification model Some features extend from studies of Blaheta [2] and Yuan [22], we also proposed a new feature

which has significant impact for function tag labeling system In later chapters, we will

introduce our system in detail such as: feature extraction, selected model, building new

feature, cte

5 Our contributions

As we discussed in the previous chapter, we aim to building a system which is shallower than Semantic Role Labeling tor Victnamesc sentences According to our Jmowledge, through there are have been other investigation in NLPs for Vietnamese, our system research on function tag labeling may be the first one

Furthermore, our syslem gives new lool to tag functional labels for VieL Trecbank

‘This will enrich Viet ‘lreebank to have automatic instead of hand-crafted tagging as it has done before Viet ‘'reebank is one of important resources for research that study in

‘Natural Language Processing for Vietnamese If it is enriched by function tags, others research will have more information on Vietnamese especially for semantic

information

Moreover, in this thesis I will also introduce some approaches in function tag labeling problem to shaw some applied techniques These techniques may be inherited from sore researcher later Because our system is the fist one in function tag labeling problem, we expect that it gives a base line system for function tag laboling rescarch

later,

We define function lagging problem in lwo phases: in phase one, we build a training

dala sơi [rom Viet Trecbank and olher resources, such as TacDong and PC world

newspapers In the second phase, we apply a classification model [1] to achieve the function tagging model for each input constituent correspondently ‘These two phases

of problem are described as following:

13

Trang 11

Inphase one: We extract features as shown in chapter 3, from Viet Treebank to build up a training data set Independently, we build a corpus inchided words

which are grouped clusters We name thal copus a word cluster, aud the process making word cluster is called word clustering We believe that our

selected features are not the best selection However, our experiments show that

they have enough reliability to recognize a fimetion tag Absolutely some of

selected feature was referred by some result from other studies such as

Blabeta[2], Gabbard[9], Yuan [23], the remaining features are our proposed

features

« Inphase two - selecting the classification model: in NLP domain there are many

techniques to classify a data sct As a result, there some classification models which can be used to recognize a function tag In this step, we decided select

the Maximum Lntrapy Model (MEMs) as our classification model because of

the advantages of this model After that, we evaluate our system by calculate

the precision, and distribution proportion for each function tag This result will

be mentioned in chapter 4

In conclusion, our research is the first one in function tag labeling problem, it will

be an encouragement for researchers who are going to focus in semantic domain in

NIP-applications

6 Thesis structare

Tn this section, we introduce a brief oullme of thesis This is 2 overviews of the

following sections where we give more details

Chapter 2 — Related works

In this chapter we would like to show some further research that are related to our

research and modern approaches for Function tagging

Chapter 3 The proposal

We propose our approaches in Function Labeling for Vietnamese including: model,

preparing data, lechmque, selected features, and method Lo extract features from Viet

‘Treebank

Chapter 4— Experiments

14

Trang 12

‘This chapter will evaluate the effectiveness of our model Because until this thesis has completed, there are not similar system in Function tagging problem, so we would like

consider our model as a line base

Chapter 5 —Conclusions and Future works

In this chapter, we describe general conclusions about our work its advantages,

roslrieliors Bosidos, we propose future works to improve our model

Chapter 5 is followed by a list of related reference

15

Trang 13

Chapter II: Related works

In this chapter, I would like to introduce some recent research in function

lagging in some aspects such as approach stralegies, techniques, and ther

experiments As T discussed above, there are three approaches to this problem T will

mention one method for each approach in this chapter For an overview, table ane

provide a summary of approaches in function tag labeling problem

‘Approach Input Technique Features

1": Gabbard et al word sequence paming(PCFG} tree-based

2" Blaheta syntactic tree tree, perceptron) -«-lsssification (decision ased

8 sequential labeling |

3°:Yuetetal wordsequence AURAL word-based

Ours syntactic tree classiieadonCMIEM) Ihee-b2sed word

clusters

‘Table1, Functional Labeling Approaches

Note that, in modern research Functional Labeling has strong relation with

Semantic Role Labeling (SRT) task [5, 10] However, SRT is more complex thant

Funotional Labelmg, It requires a proposition bank which has not been available for Vietnamese yet Kor this reason, our research focuses on Function ‘fag Labeling

problem only

1 Function Tags Labeling by Parsing

The parsing approach is based on the technique that was used to build Penn

Treebank, in which function labels are tagged during parsing tree Following this approach I would like mention to Gabbard ct al., (2006) |9]

1.1 Motivation

Gabbard et al focus in three important semantic types of empty category

amolalious in Perm Treebank (o identify furiclian tags Cor a constituent,

Null complementizers: in Penn Treebank they are notated as symbols “Null complementizers”, often replace for relative pronouns such as that, who, which

missed in sentence For example: “ She is the girl 0 I told you, yesterday”

is

Trang 14

Traces of wh-movement they are annotate as “*T*" This type focus on abject of wh- questions, they co-index for the position of constituent which are referred by wh-question The following example represents this type in a sentence: "What; do you want (NP "I*-1)?”

® {NP *)s: These notations as are used for several purposes in Penn Treebank But they arc commonly used to denote the passive such as: “(NP-1 this dog) was hit (NP *-1) by a drunken man”

Following [9] these lypes had ignored statistical parsing until Collins Model 2 had been proposed in 1997 Mods! 2 used heuristies and function lags during the training process lo identify arguments of constituents (c.g TMP for NP to combine as TMP-

NP label) Uxtending the Model 2, Collins Model 3 has been trying to cover traces of

Wh-movement with some limited success

1.2 Approach

They modified the Collins’s parser (2003) by adding somantic information to tag function labels Their parser has two stages: in stage one the parser analyzes syntactic structure, both function tags and empty categories Ln this stage, the challenge is how

to produce function tags and mark empty categories without decreasing the regular

Parseval metric, the metric was used as one of the evaluation criteria The second stage

focuses on recovery of empty categorics by combining the linguistically —informed architecture (Campbell, 2004) and rich feature set with machine learning methods

Extending the Colhins’s Model 2 the function lags aller taining process with

heuristics lo identify arguments, they are kept in all parameter clas

s Then they

augment, the argument identification heuristic Lo treat, all nonicrminal with any of tags

in the Syntactic group The function tags are treated for the internal tag as synonyms

So these function tags not only used for excluding potential argument but also for

including argument like Bikel’s parser (Bikel, 2004)

1.3 Result

In testing phrase, Gabbard use all sentences with maximum length less than 40 words for each sentence ‘Their system used two measures which are often used in NLPs research There are precision and recall As the result, they discovered that sparseness data problem in Blaheta [2] is fixed by this approach The table 2 shows the precision and reeall proportion of Gabbard research

1

Trang 15

Parser Recall/Precision Madel 2 88.12/88 31

Model 2 - FuneB 88.23/88.31

Table 2 Result of labeling by parsing approach following Cotltn model

2 Scquential Function Tag Labeling

This approach formulates function tagging as a sequential labeling problem which has been applied to other important natural language processing tasks such as named entity recognition and chunking ‘this approach does not require tree-based information All features are word-based including surrounding words and their part- of-speech (POS) tags The leaming model for predicting a sequence observed can be done by some technique such as Hidden Marko Model [18] and Conditional Random Fields |12], In their paper, Yuan et al |22{ choose the Hidden Markev-Support Vector Machine (HM-SVM) technique for their learning model ‘Ihe result of tagger system was very hiph and reached an acouracy of 96.18%

2.1 Features

According to Chinese Treebank (CTB) giundclines, the grammatical funeclions of

Chinese and the reference of English verbs (Levin 1993), five features for function tagging are defined as follow:

accurate of prediction In their experiments, they started from range [-2, +2] and

up to [-4, | 4] words context

* Bi-gram of POS tags: the prediction of Bi-gram for POS tag input of

constituent

and its arguments Besides, each class of verb is associated with a set of syntactic frames In this sense, Yuan et al relied on the surface verb for

distingiushing syntactic roles

highly correlated with its grammatical fimotion or example, for Chinese language, subjects generally appear before a verb, and objects after This

18

Trang 16

feature was used to overcome the lack of syntactic structure that could be read

from the parse tree

2.2 Learning model

In the research of Caixia Yuan and his partners, function tagging is defined as the problem predicting a sequence of function tags from given input words Hence, this problem can be formulated as a stream of sequence leaming problem For this problem in predicting from a sequence stream there are several algorithms such as hidden Markov model (Rabiner,1989), conditional random fields (Lafferty et all,

20013, and hidden Markov support vector machine (HM-SVM) (Altun et al., 2003; Tsocharitaridis et al., 2004)

‘The selected model in this paper is LIM-SVM because of its advantages in leaming labels As a result, the tagger reached a 96.18% accuracy rate

3 Function Tag Labeling by Classification

This approach carries out functional labeling after syntactic parsing Fumetional labeling is considered as a classification problem, or more specifically a tree-based classification problem Syntactic trees, the output of parsing, are used as the input of functional labeling, ‘ree nodes syntactic constituents are labeled with function tags independently ollowing this approach, many machine learning techniques can be applied This labeling approach was first used by Blaheta [2] for English He cmployed (wo machine learning teolmiques including devision tree and perceptron

3.1 Fcaturc

Blaheta (2003) proposed nine features which were used in this model, including:

* Label: Syntactic label of the constituent Biaheta suggests that it is probably the

most important feature especially for syntactic tags Tor example: a VP? is

never lagged any syntactic tags and ADJP might be a PRD label,

¢ ce-Label: This feature represent coordinate conjunction label Normally, cc- label used for identifying the regular label But if corstituent are combined from

lwo or more phrases such as NP, VP, PP it will be tagged by CCNP, CCVP,

CCPP instead of NP, VP, PP [or whole phrases

* These notations sui

annotations here

as NP, VP, ADJ, PP,., are referred from Pe Treebank guideline, soT do not list their

19

Trang 17

Head word: This is a basic feature in parsing studies such as [7] The idea of this feature is that each word in each sentence always relate with other words in

same phrase, excepl the word at the beginning of this phrase Moreover, these words may be adjunets or complements to some other words which are outside

the phrase This relation 1s lexical information which will be used as function tags

Head POS: POS tag of head word Charinak’s experiment [3] showed that the parsing system improved approximately 2% when adding part-of-speech

feature

Alternative head: If the constituent is a prepositional phrase, its nou phrase’s

head word will be alternative head

Alternative POS tag: POS tag of alternative head

Function tags: Obviously, in most of prediction models function tags

information is useful to predict themselves in another context

Label clusters: Labels are manually grouped into clusters This feature is very

useful for labels which have low frequericy in training dala such as WHNP,

ADIP_ JJ, ctc Tl used to increase the performance of labeling task iu sparscness

data case

Word clusters: An algorithm was rum to group all words with a giver POS tag

into a relatively small number of clusters ‘his feature is used as label cluster

feature With nearly 40,000 words Balheta hoped that it will fend off the spare data problem

The advantage of classification slraicgy is thal new feature, both local and non-local,

can be incorporated casily Because of this advantage, in our system we decide to follow the classification approach for our research

3.2 Model

Blaheta [2] applied two classification methods in his research, including: Decision Troos and Perceptron He carried out the experiment for each method to compare the performanve of each method for this problem To do that, Blaheta had to prepare two training data sets for each method correspondingly

Decision ‘Irecs method: Decisions trees method has been applied in many

Natural Language Processing studies (Magerman, 1994, Bahl et al,.) especially

20

Trang 18

in English To build up a decision tree for function tag labeling problem, Blaheta implemented an existing package, c4.5, from Ross Quinlan’s research, Following format of package o4.5, Blaheta converted all training and testing data set into new format which Ross Quinlan’s package c4.$ can read from In

conclusion, Blahela decided to abandon this method because of its performance

He explained that the performance of decision tree inoreased slightly, while memory for stored a model (decision tree) explored gradually Hence, this method is limited by size of memory

Output of classic perceptron is only “‘Irue’, ‘False’, or ‘Yes’, ‘No’, or 0,1 etc

For function tag labeling problem Blaheta applied a multi-valued perceptron

classification In which, each node in output layer of neural network is a

function tag A Perceptron system can be describe as:

adjusted, and model is trained again

The figure 3 below will presenl the perceptron model for function tag labeling

Trang 19

Chapter III: The proposed approach

In this chapter, we would like to introduce our approach, including the overview of architecture’s model, how to define features with function tag labeling problem with

Vietnamese and the selection of classification model

ystom Architecture

Our task includes two phase as introduced in Chapter 1 These are training and testing phases In the training phase, we used two resources as training data for our system ‘hey are Vietnamese ‘reebank and an unlabeled corpus collect from online newspapers In this phase, two main steps are feature extraction and Maximum Entropy Model (MEMs) training Besides, for building training data an additional step

is required to classify words to word clusters from unlabeled corpus In the testing phase, the input is a syntactic tree and the outputs of our system are functional labels over the input tree For this phase, we use two main steps as training phase again ‘hey are feature extraction and MEM classification

In this section we want lo present our model by @ graphical model This give an overview of our task and steps we processed in our task igure 4 will shows our

system for function tag labeling problem

22

Trang 20

Figure 4 Model af Function ‘I'ag Labeling System for Vietaunese sentences

2 Function Tags in Vietnamese

Vietnamese 1s created from Latin symbols with characteristics of our culture, Hence, Vietnamese’s constituents in a sentence have the same role as English In [13],

we have been defined 20 function tags for Viet Treebank; Table3 below present

functional labels are tagged on Viet Treebank It can be clearly seen that, function tags

is defined almost have the same notations and role as English

2

Trang 21

Clause types

Syntactic roles

Adverbials

[19], The features are described as follows?

ø Label: The syntactic label of current constituent, which is being functionally

‘Table 3 Function Tags on Viet Tr eebank

classified, is very important in recognizing its role

Father's label: This feature is useful in certain cases For example, suppose hai the current constituent, is a oun phrase (NP), Tf its father is a clause (3),

there are more chances for it to be a subject (SUB), otherwise, if its father is a

verb phrase (VP) it is more likely to be a direct object (DOB)

Head wor

important for discriminating fimctions, for example, between temporal (TMP) and (LOC)

: This (calure has been proved to be usaful in parsing, Ti is also

3 Iaderlined proposed features are proposed different from [2] and [9]

24

Tiêu đề	Building a semantic role labeling system for Vietnamese sentences
Tác giả	Nguyen Thanh Huy
Người hướng dẫn	PhD. Nguyen Phuong Thai
Trường học	University of Engineering and Technology Vietnam National University, Hanoi
Chuyên ngành	Computer Science
Thể loại	Master thesis
Năm xuất bản	2011
Thành phố	Hanoi

Định dạng
Số trang	42
Dung lượng	0,9 MB