Building a semantic role labeling system for Vietnamese sentences

This kind of research problem is called function tags labeling problem, a class of problems to finding semantic information of phrase.. Thus, there were some research that focused on fun

Trang 1

UNIVERSITY OF ENGINEERING AND TECHNOLOGY

VIETNAM NATIONAL UNIVERSITY, HANOI

NGUYEN THANH HUY

BUILDING A SEMANTICROLELABELING SYSTEM FOR VIETNAMESE SENTENCES

Major : Computer Science

Code : 60 48 01

MASTER THESIS

Supervised by: PhD Nguyen Phuong Thai

Hanoi - 2011

Trang 2

Table of Contents

Acknowledgements 3

Abstract 4

List of Figures 7

List of Tables 8

Chapter I: Introduction 9

1 Function tags 9

2 Corpora for function tag labeling 9

3 Current studies on Function tagging 11

4 Objective of the thesis 12

5 Our contributions 13

6 Thesis structure 14

Chapter II: Related works 16

1 Function Tags Labeling by Parsing 16

1.1 Motivation 16

1.2 Approach 17

1.3 Result 17

2 Sequential Function Tag Labeling 18

2.1 Features 18

2.2 Learning model 19

3 Function Tag Labeling by Classification 19

3.1 Feature 19

3.2 Model 20

Chapter III: The proposed approach 22

1 System Architecture 22

2 Function Tags in Vietnamese 23

3 Selected Features 24

4 Word Clustering 25

5 Classification Model 26

Trang 3

5.1 Maximum Entropy by Motivating Example 27

5.2 Maximum Entropy Modeling 28

5.3 Training data 29

5.4 Features and Constraints 29

5.5 Maximum Entropy Principle 31

6 Summarization 32

Chapter IV: Experiment 34

1 Corpora and Tools 34

2 Functional Labeling Precisions 36

3 Error Analyses 38

4 Effectiveness of Word Cluster Feature 39

5 Summary 40

Chapter V: Conclusion and Future work 41

1 Contributions 41

2 Future work 42

Bibliography 43

Publications 45

Trang 4

List of Figures

Figure 1 Sample domain and frame element of Frame Net 10

Figure 2 A parsing with function tags in Viet Treebank 11

Figure 3.The perceptron model for function tags labeling problem 21

Figure 4 Model of Function Tag Labeling System for Vietnamese sentences 23

Figure 5 An example for selected features in Viet Treebank 25

Figure 6 Example of word cluster hierarchy 26

Figure 7 Scenarios in constrained optimization 32

Figure 8 Pseudo-code for extracting function labels 33

Figure 9 An example of word cluster 36

Figure 10 Learning curve 38

Figure 11 The dependency between two function labels 39

Trang 5

List of Tables

Table1 Functional Labeling Approaches 16

Table 2 Result of labeling by parsing approach following Collin model 18

Table 3 Function Tags on Viet Treebank 24

Table 4 Vietnamese Treebank statistics 34

Table 5 Evaluation of Vietnamese functional labeling system 37

Table 6 Increases in precision by using word cluster feature 40

Trang 6

Chapter I: Introduction

In this chapter, I introduce function tags and value of function tag in NLP - applications, some current approaches, objective of thesis and our contribution Finally, I describe the structure of the thesis

1 Function tags

There are two kinds of tags in linguistics: syntactic tags and function tags For syntactic tags there are several theories and projects research result in English, Spanish, Chinese and [4][13][14][18] These research mainly focus on finding the part-of-speech and tagging for their constituents Function tags are understood as abstract labels because they are not similar to syntactic labels If a syntactic label has one notation for a batch of words in a paragraph, function tags present the relationship between a phrase and its utterance in each difference context So for each phrase, function tags might be transforming It depends on context of its neighbors For

example we consider a phrase: “Baseball bat” syntactic of this phrase is “noun

phrase” (in almost research they are annotated as NP) But its function tag might be a

subject in this sentence:

This baseball bat is very expensive

In other case its function tag might be a direct object:

I bought this baseball bat last month

Or instrument, agent in a passive voice:

That man was attacked by this baseball bat

Function tags are directly mentioned by Blaheta (2003) [2] There are a lot of research focuses in how to tag function tags for a sentence This kind of research problem is called function tags labeling problem, a class of problems to finding semantic information of phrase To sum up, function tag labeling is defined as a problem how to find the semantic information of a batch of words, and then tag them with a given annotation in its context

2 Corpora for function tag labeling

Nowadays machine learning is the popular method for most of modern problems especially in Nature Language Processing subject To build up a machine learning

Trang 7

system we need a training data set There are some function tag labeling corpora that are applied for languages such as English and Chinese In English, there are two main corpora which are used for semantic role labeling and function tag labeling problems They are Frame Net(Baker, 1998; FillMore and Baker 2000) and Prop Bank (Palmer et al,2005).The main idea of Frame Net is that group all similar words in the same group, and then represents relationship of this group with other groups in a group-network That why it is called Frame Net Figure 1shows a small example branch on Frame Net

Figure 1 Sample domain and frame element of Frame Net

The second corpus is PropBank (Palmer et al., 2005) which is a modification of Penn Treebank by annotated addition information: function tags Penn Treebank and Prop Bank are organized as a set of trees; each tree illustrates a sentence which is tagged syntactic labels (for Penn Treebank) and with both syntactic and functional labels (for Prop Bank)

PropBank and Chinese Treebank are linguistics resources which have been available for research purpose for a long time Whereas, Viet Treebank1 have been developed recently by Nguyen [17] by using experiment and approaches from Penn Treebank Hence, Viet Treebank has the same structure as Penn Treebank and Chinese Treebank in which each word is presented as leaf node of a tree, none terminal nodes are tagged syntactic label or functional label Figure 2 will show an example from Viet Treebank with function tags

1

http://vlsp.vietlp.org:8080/demo/?page=resources

Trang 8

3 Current studies on Function tagging

Function tags labeling is an important processing step for many natural language processing applications such as question answering, information extraction, and summarization Thus, there were some research that focused on function tagging problem to cover additional semantic information which is more useful than syntactic labels

In 1997, Collins [7] introduced the idea to add some useful syntactic information, and then he proposed a parser to have enough ability in guessing the complement tag This parser is called Collins‟s parser, and it is considered as first system in tagging label The function tags labeling is defined precisely by Blaheta (2003) [2] His research used data from Pen Tree II which is covered extra function tags With Blaheta‟s proposal there are various investigation focusing on function tag labeling such as Merlo and Mussillo (2005), Blaheta and Charniak (2004), Chrupala and Genabith (2006), Sun, Sui (2009) These studies extend function tags labeling topic by focusing on new language such as Chinese, proposing new approaches, or investigating new features Nowadays, there are three main approach strategies for function tags labeling problem:

The first approach is called parsing, which is tagging function labels during the

parsing process, this approach is a modification of Collin‟s parser Following this approach, we can consider studies of Gabbard [9], and Marcus [17]

Figure 2 A parsing with function tags in Viet Treebank

Trang 9

The second approach is called labeling method which includes two phases:

extracting features and classifying function labels This approach has more techniques because of the diversity of classification techniques The most typical research of this approach is Blaheta‟s research [3] In his research, he has been applied some techniques to show the impact of each technique for function tag labeling problem

The third approach is defined as sequential labeling approach For this approach, function tags are predicted from observed words chain (Yuan [23]) This approach is similar to selecting features of classification approach but the difference is that it uses

a prediction model instead of a classification model These approaches will be discussed in detail in next chapter

Today, there is a class problem which covers function tagging This class was mentioned by Carreras (2004) [5] and called Semantic Role Labeling Semantic Role Labeling is similar to function tag labeling but it works at a more abstract level When building a Semantic Role Labeling system, the training data will have more

information They not only include time, location, manner, etc, but also object,

instrument, agent, etc This problem is a new promised research for NLP-applications

which need to understand the meaning of sentences

4 Objective of the thesis

As we mentioned above, assigning function tags has wide research, especially for English Recently, some studies were applied for Spanish, and Chinese All function tagging system have contributed in their corpora a semantic class which is very useful for other NLP-applications such as Question Answering, Summarization, Information Retrieval, etc

In recent years, Nature Language Processing topics in Vietnam have developed rapidly Especially for Vietnamese, many studies have focused on how to recognize the syntactic of Vietnamese sentences by a POS tagging system But unfortunately, these NLP-applications do not provide semantic information for a sentence Whereas,

some NLP-applications need to know semantic information to answer questions: who,

where, what, and whom

To deal with this problem, our research focuses on building an automatic function tags labeling In this thesis, I call as stage one, temporarily; our research will build a function tagging system, a problem that is shallower than Semantic Role Labeling

Trang 10

(SRL) problem, which is applied for Vietnamese Our research only tags some basic

function labels such as: time, location, direction etc Others semantic roles such as:

agent, instrument, etc, which belong to Semantic Role Labeling problem will be

append into our system in the future In our system, we have two phases: first we extract function tags from Viet Treebank, a bank covered by hand-craft semantic labels After that, we select features to train the classification model Some features extend from studies of Blaheta [2] and Yuan [22], we also proposed a new feature which has significant impact for function tag labeling system In later chapters, we will introduce our system in detail such as: feature extraction, selected model, building new feature, etc

5 Our contributions

As we discussed in the previous chapter, we aim to building a system which is shallower than Semantic Role Labeling for Vietnamese sentences According to our knowledge, through there are have been other investigation in NLPs for Vietnamese, our system research on function tag labeling may be the first one

Furthermore, our system gives new tool to tag functional labels for Viet Treebank This will enrich Viet Treebank to have automatic instead of hand-crafted tagging as it has done before Viet Treebank is one of important resources for research that study in Natural Language Processing for Vietnamese If it is enriched by function tags, others research will have more information on Vietnamese especially for semantic information

Moreover, in this thesis I will also introduce some approaches in function tag labeling problem to show some applied techniques These techniques may be inherited from some researcher later Because our system is the first one in function tag labeling problem, we expect that it gives a base line system for function tag labeling research later

We define function tagging problem in two phases: in phase one, we build a training data set from Viet Treebank and other resources, such as LaoDong and PC world newspapers In the second phase, we apply a classification model [1] to achieve the function tagging model for each input constituent correspondently These two phases

of problem are described as following:

Trang 11

· In phase one: We extract features as shown in chapter 3, from Viet Treebank to build up a training data set Independently, we build a corpus included words which are grouped clusters We name that corpus a word cluster, and the process making word cluster is called word clustering We believe that our selected features are not the best selection However, our experiments show that they have enough reliability to recognize a function tag Absolutely, some of selected feature was referred by some result from other studies such as Blaheta[2], Gabbard[9], Yuan [23], the remaining features are our proposed features

· In phase two - selecting the classification model: in NLP domain there are many techniques to classify a data set As a result, there some classification models which can be used to recognize a function tag In this step, we decided select the Maximum Entropy Model (MEMs) as our classification model because of the advantages of this model After that, we evaluate our system by calculate the precision, and distribution proportion for each function tag This result will

be mentioned in chapter 4

In conclusion, our research is the first one in function tag labeling problem; it will

be an encouragement for researchers who are going to focus in semantic domain in NLP-applications

6 Thesis structure

In this section, we introduce a brief outline of thesis This is a overviews of the following sections where we give more details

Chapter 2 – Related works

In this chapter we would like to show some further research that are related to our research and modern approaches for Function tagging

Chapter 3 –The proposal

We propose our approaches in Function Labeling for Vietnamese including: model, preparing data, technique, selected features, and method to extract features from Viet Treebank

Chapter 4 – Experiments

Trang 12

This chapter will evaluate the effectiveness of our model Because until this thesis has completed, there are not similar system in Function tagging problem, so we would like consider our model as a line base

Chapter 5 –Conclusions and Future works

In this chapter, we describe general conclusions about our work its advantages, restrictions Besides, we propose future works to improve our model

Chapter 5 is followed by a list of related reference

Trang 13

Chapter II: Related works

In this chapter, I would like to introduce some recent research in function tagging in some aspects such as: approach strategies, techniques, and their experiments As I discussed above, there are three approaches to this problem I will mention one method for each approach in this chapter For an overview, table one provide a summary of approaches in function tag labeling problem

1st : Gabbard et al word sequence parsing (PCFG) tree-based

2nd: Blaheta syntactic tree classification (decision

tree, perceptron) tree-based

3rd : Yuan et al word sequence sequential labeling

(HM-SVM) word-based Ours syntactic tree classification (MEM) tree-based, word

clusters

Table1 Functional Labeling Approaches

Note that, in modern research Functional Labeling has strong relation with Semantic Role Labeling (SRL) task [5, 10] However, SRL is more complex than Functional Labeling It requires a proposition bank which has not been available for Vietnamese yet For this reason, our research focuses on Function Tag Labeling problem only

1 Function Tags Labeling by Parsing

The parsing approach is based on the technique that was used to build Penn Treebank, in which function labels are tagged during parsing tree Following this approach I would like mention to Gabbard et al., (2006) [9]

1.1 Motivation

Gabbard et al focus in three important semantic types of empty category annotations in Penn Treebank to identify function tags for a constituent:

· Null complementizers: in Penn Treebank they are notated as 0symbols “Null

complementizers”, often replace for relative pronouns such as that, who, which missed in sentence For example: “ She is the girl 0 I told you, yesterday”

Trang 14

· Traces of wh-movement: they are annotate as “*T*” This type focus on

object of wh- questions, they co-index for the position of constituent which are referred by wh-question The following example represents this type in a sentence: “What1 do you want (NP *T*-1)?”

· (NP *)s: These notations as are used for several purposes in Penn Treebank

But they are commonly used to denote the passive such as: “(NP-1 this dog) was hit (NP *-1) by a drunken man”

Following [9] these types had ignored statistical parsing until Collins Model 2 had been proposed in 1997 Model 2 used heuristics and function tags during the training process to identify arguments of constituents (e.g TMP for NP to combine as TMP-

NP label) Extending the Model 2, Collins Model 3 has been trying to cover traces of Wh-movement with some limited success

1.2 Approach

They modified the Collins‟s parser (2003) by adding semantic information to tag function labels Their parser has two stages: in stage one the parser analyzes syntactic structure, both function tags and empty categories In this stage, the challenge is how

to produce function tags and mark empty categories without decreasing the regular Parseval metric, the metric was used as one of the evaluation criteria The second stage focuses on recovery of empty categories by combining the linguistically –informed architecture (Campbell, 2004) and rich feature set with machine learning methods

Extending the Collins‟s Model 2, the function tags after training process with heuristics to identify arguments; they are kept in all parameter classes Then they augment the argument identification heuristic to treat all nonterminal with any of tags

in the Syntactic group The function tags are treated for the internal tag as synonyms

So these function tags not only used for excluding potential argument but also for including argument like Bikel‟s parser (Bikel, 2004)

1.3 Result

In testing phrase, Gabbard use all sentences with maximum length less than 40 words for each sentence Their system used two measures which are often used in NLPs research There are precision and recall As the result, they discovered that sparseness data problem in Blaheta [2] is fixed by this approach The table 2 shows the precision and recall proportion of Gabbard research

Trang 15

Parser Recall/Precision Model 2 88.12/88.31 Model 2 - FuncB 88.23/88.31

Table 2 Result of labeling by parsing approach following Collin model

2 Sequential Function Tag Labeling

This approach formulates function tagging as a sequential labeling problem which has been applied to other important natural language processing tasks such as named entity recognition and chunking This approach does not require tree-based information All features are word-based including surrounding words and their part-of-speech (POS) tags The learning model for predicting a sequence observed can be done by some technique such as Hidden Markov Model [18] and Conditional Random Fields [12] In their paper, Yuan et al [22] choose the Hidden Markov-Support Vector Machine (HM-SVM) technique for their learning model The result of tagger system was very high and reached an accuracy of 96.18%

2.1 Features

According to Chinese Treebank (CTB) guidelines, the grammatical functions of Chinese and the reference of English verbs (Levin 1993), five features for function tagging are defined as follow:

· Word and POS tags: the context made by surrounding words can increase the

accurate of prediction In their experiments, they started from range [-2, +2] and

up to [-4, +4] words context

· Bi-gram of POS tags: the prediction of Bi-gram for POS tag input of constituent

· Verbs: Function tags like subject and object describe the relations between verb

and its arguments Besides, each class of verb is associated with a set of syntactic frames In this sense, Yuan et al relied on the surface verb for distinguishing syntactic roles

· POS tags of verbs

· Position indicators: Whether a constituent occurs before or after the verb is

highly correlated with its grammatical function For example, for Chinese language, subjects generally appear before a verb, and objects after This

Trang 16

feature was used to overcome the lack of syntactic structure that could be read from the parse tree

2.2 Learning model

In the research of Caixia Yuan and his partners, function tagging is defined as the problem predicting a sequence of function tags from given input words Hence, this problem can be formulated as a stream of sequence learning problem For this problem in predicting from a sequence stream there are several algorithms such as hidden Markov model (Rabiner,1989), conditional random fields (Lafferty et al., 2001), and hidden Markov support vector machine (HM-SVM) (Altun et al., 2003; Tsochantaridis et al., 2004)

The selected model in this paper is HM-SVM because of its advantages in learning labels As a result, the tagger reached a 96.18% accuracy rate

3 Function Tag Labeling by Classification

This approach carries out functional labeling after syntactic parsing Functional labeling is considered as a classification problem, or more specifically a tree-based classification problem Syntactic trees, the output of parsing, are used as the input of functional labeling Tree nodes – syntactic constituents – are labeled with function tags independently Following this approach, many machine learning techniques can be applied This labeling approach was first used by Blaheta [2] for English He employed two machine learning techniques including decision tree and perceptron

3.1 Feature

Blaheta (2003) proposed nine features which were used in this model, including:

· Label: Syntactic label of the constituent Blaheta suggests that it is probably the

most important feature especially for syntactic tags For example: a VP2 is never tagged any syntactic tags and ADJP might be a PRD label

· Label: This feature represent coordinate conjunction label Normally,

cc-label used for identifying the regular cc-label But if constituent are combined from two or more phrases such as NP, VP, PP it will be tagged by CCNP, CCVP,

CCPP instead of NP, VP, PP for whole phrases

2

These notations such as NP, VP, ADJ, PP, , are referred from Penn Treebank guideline, so I do not list their

Trang 17

· Head word: This is a basic feature in parsing studies such as [7] The idea of

this feature is that each word in each sentence always relate with other words in same phrase, except the word at the beginning of this phrase Moreover, these words may be adjuncts or complements to some other words which are outside the phrase This relation is lexical information which will be used as function

tags

· Head POS: POS tag of head word Charinak‟s experiment [3] showed that the

parsing system improved approximately 2% when adding part-of-speech

feature

· Alternative head: If the constituent is a prepositional phrase, its noun phrase‟s

head word will be alternative head

· Alternative POS tag: POS tag of alternative head

· Function tags: Obviously, in most of prediction models function tags information is useful to predict themselves in another context

· Label clusters: Labels are manually grouped into clusters This feature is very

useful for labels which have low frequency in training data such as WHNP, ADJP, JJ, etc It used to increase the performance of labeling task in sparseness

data case

· Word clusters: An algorithm was run to group all words with a given POS tag

into a relatively small number of clusters This feature is used as label cluster feature With nearly 40,000 words Balheta hoped that it will fend off the spare data problem

The advantage of classification strategy is that new feature, both local and non-local, can be incorporated easily Because of this advantage, in our system we decide to follow the classification approach for our research

3.2 Model

Blaheta [2] applied two classification methods in his research, including: Decision Trees and Perceptron He carried out the experiment for each method to compare the performance of each method for this problem To do that, Blaheta had to prepare two training data sets for each method correspondingly

· Decision Trees method: Decisions trees method has been applied in many

Natural Language Processing studies (Magerman, 1994; Bahl et al,.) especially

Trang 18

in English To build up a decision tree for function tag labeling problem, Blaheta implemented an existing package, c4.5, from Ross Quinlan‟s research Following format of package c4.5, Blaheta converted all training and testing data set into new format which Ross Quinlan‟s package c4.5 can read from In conclusion, Blaheta decided to abandon this method because of its performance

He explained that the performance of decision tree increased slightly, while memory for stored a model (decision tree) explored gradually Hence, this method is limited by size of memory

· Perceptron: a classical perceptron network executes a binary classification

Output of classic perceptron is only „True‟, „False‟, or „Yes‟, „No‟, or 0 ,1 etc For function tag labeling problem Blaheta applied a multi-valued perceptron classification In which, each node in output layer of neural network is a function tag A Perceptron system can be describe as:

o First, function tag from Penn Treebank are extracted to create a training data

o Second, an initial perceptron network is setup then trained it with function tags, obtained from the previous steps to build a multi-classification model

o Last, this model is tested by testing data; and system performance is evaluated Then, if result is low, input weight of neural in input layer is adjusted, and model is trained again

The figure 3 below will present the perceptron model for function tag labeling problem

PRD

Function tagsConstituents

Figure 3.The perceptron model for function tags labeling problem

Trang 19

Chapter III: The proposed approach

In this chapter, we would like to introduce our approach, including the overview of architecture‟s model, how to define features with function tag labeling problem with Vietnamese and the selection of classification model

1 System Architecture

Our task includes two phase as introduced in Chapter 1.These are training and testing phases In the training phase, we used two resources as training data for our system They are Vietnamese Treebank and an unlabeled corpus collect from online newspapers In this phase, two main steps are feature extraction and Maximum Entropy Model (MEMs) training Besides, for building training data an additional step

is required to classify words to word clusters from unlabeled corpus In the testing phase, the input is a syntactic tree and the outputs of our system are functional labels over the input tree For this phase, we use two main steps as training phase again They are feature extraction and MEM classification

In this section we want to present our model by a graphical model This give an overview of our task and steps we processed in our task Figure 4 will shows our system for function tag labeling problem

Trang 20

Input Syntactic tree

Output Syntactic tree with functional tags

Online newspapers 700,000 sentences

Vietnamese treebank

10,417 trees

Feature extraction Word clusters

Figure 4 Model of Function Tag Labeling System for Vietnamese sentences

2 Function Tags in Vietnamese

Vietnamese is created from Latin symbols with characteristics of our culture Hence, Vietnamese‟s constituents in a sentence have the same role as English In [13],

we have been defined 20 function tags for Viet Treebank; Table3 below present functional labels are tagged on Viet Treebank It can be clearly seen that, function tags

is defined almost have the same notations and role as English

Trang 21

Clause types

CMD Command EXC Exclamation

Syntactic roles

SUB Subject SQ Question

TPC Topic DOB Direct object

PRD Predicate IOB Indirect object

EXT Extent VOC Vocative

Adverbials

TMP Time LOC Location

MNR Manner DIR Direction

PRP

CND

Purpose Condition

CNC ADV

Concession Adverbial MDP Modus

· Label: The syntactic label of current constituent, which is being functionally

classified, is very important in recognizing its role

· Father’s label: This feature is useful in certain cases For example, suppose

that the current constituent is a noun phrase (NP) If its father is a clause (S), there are more chances for it to be a subject (SUB), otherwise, if its father is a verb phrase (VP) it is more likely to be a direct object (DOB)

· Head word: This feature has been proved to be useful in parsing It is also

important for discriminating functions, for example, between temporal (TMP)

Định dạng
Số trang	42
Dung lượng	1,16 MB