This kind of research problem is called function tags labeling problem, a class of problems to finding semantic information of phrase.. Thus, there were some research that focused on fun
Trang 1UNIVERSITY OF ENGINEERING AND TECHNOLOGY
VIETNAM NATIONAL UNIVERSITY, HANOI
NGUYEN THANH HUY
BUILDING A SEMANTICROLELABELING SYSTEM FOR VIETNAMESE SENTENCES
Major : Computer Science
Code : 60 48 01
MASTER THESIS
Supervised by: PhD Nguyen Phuong Thai
Hanoi - 2011
Trang 2Table of Contents
Acknowledgements 3
Abstract 4
List of Figures 7
List of Tables 8
Chapter I: Introduction 9
1 Function tags 9
2 Corpora for function tag labeling 9
3 Current studies on Function tagging 11
4 Objective of the thesis 12
5 Our contributions 13
6 Thesis structure 14
Chapter II: Related works 16
1 Function Tags Labeling by Parsing 16
1.1 Motivation 16
1.2 Approach 17
1.3 Result 17
2 Sequential Function Tag Labeling 18
2.1 Features 18
2.2 Learning model 19
3 Function Tag Labeling by Classification 19
3.1 Feature 19
3.2 Model 20
Chapter III: The proposed approach 22
1 System Architecture 22
2 Function Tags in Vietnamese 23
3 Selected Features 24
4 Word Clustering 25
5 Classification Model 26
Trang 35.1 Maximum Entropy by Motivating Example 27
5.2 Maximum Entropy Modeling 28
5.3 Training data 29
5.4 Features and Constraints 29
5.5 Maximum Entropy Principle 31
6 Summarization 32
Chapter IV: Experiment 34
1 Corpora and Tools 34
2 Functional Labeling Precisions 36
3 Error Analyses 38
4 Effectiveness of Word Cluster Feature 39
5 Summary 40
Chapter V: Conclusion and Future work 41
1 Contributions 41
2 Future work 42
Bibliography 43
Publications 45
Trang 4List of Figures
Figure 1 Sample domain and frame element of Frame Net 10
Figure 2 A parsing with function tags in Viet Treebank 11
Figure 3.The perceptron model for function tags labeling problem 21
Figure 4 Model of Function Tag Labeling System for Vietnamese sentences 23
Figure 5 An example for selected features in Viet Treebank 25
Figure 6 Example of word cluster hierarchy 26
Figure 7 Scenarios in constrained optimization 32
Figure 8 Pseudo-code for extracting function labels 33
Figure 9 An example of word cluster 36
Figure 10 Learning curve 38
Figure 11 The dependency between two function labels 39
Trang 5List of Tables
Table1 Functional Labeling Approaches 16
Table 2 Result of labeling by parsing approach following Collin model 18
Table 3 Function Tags on Viet Treebank 24
Table 4 Vietnamese Treebank statistics 34
Table 5 Evaluation of Vietnamese functional labeling system 37
Table 6 Increases in precision by using word cluster feature 40
Trang 6Chapter I: Introduction
In this chapter, I introduce function tags and value of function tag in NLP - applications, some current approaches, objective of thesis and our contribution Finally, I describe the structure of the thesis
1 Function tags
There are two kinds of tags in linguistics: syntactic tags and function tags For syntactic tags there are several theories and projects research result in English, Spanish, Chinese and [4][13][14][18] These research mainly focus on finding the part-of-speech and tagging for their constituents Function tags are understood as abstract labels because they are not similar to syntactic labels If a syntactic label has one notation for a batch of words in a paragraph, function tags present the relationship between a phrase and its utterance in each difference context So for each phrase, function tags might be transforming It depends on context of its neighbors For
example we consider a phrase: “Baseball bat” syntactic of this phrase is “noun
phrase” (in almost research they are annotated as NP) But its function tag might be a
subject in this sentence:
This baseball bat is very expensive
In other case its function tag might be a direct object:
I bought this baseball bat last month
Or instrument, agent in a passive voice:
That man was attacked by this baseball bat
Function tags are directly mentioned by Blaheta (2003) [2] There are a lot of research focuses in how to tag function tags for a sentence This kind of research problem is called function tags labeling problem, a class of problems to finding semantic information of phrase To sum up, function tag labeling is defined as a problem how to find the semantic information of a batch of words, and then tag them with a given annotation in its context
2 Corpora for function tag labeling
Nowadays machine learning is the popular method for most of modern problems especially in Nature Language Processing subject To build up a machine learning
Trang 7system we need a training data set There are some function tag labeling corpora that are applied for languages such as English and Chinese In English, there are two main corpora which are used for semantic role labeling and function tag labeling problems They are Frame Net(Baker, 1998; FillMore and Baker 2000) and Prop Bank (Palmer et al,2005).The main idea of Frame Net is that group all similar words in the same group, and then represents relationship of this group with other groups in a group-network That why it is called Frame Net Figure 1shows a small example branch on Frame Net
Figure 1 Sample domain and frame element of Frame Net
The second corpus is PropBank (Palmer et al., 2005) which is a modification of Penn Treebank by annotated addition information: function tags Penn Treebank and Prop Bank are organized as a set of trees; each tree illustrates a sentence which is tagged syntactic labels (for Penn Treebank) and with both syntactic and functional labels (for Prop Bank)
PropBank and Chinese Treebank are linguistics resources which have been available for research purpose for a long time Whereas, Viet Treebank1 have been developed recently by Nguyen [17] by using experiment and approaches from Penn Treebank Hence, Viet Treebank has the same structure as Penn Treebank and Chinese Treebank in which each word is presented as leaf node of a tree, none terminal nodes are tagged syntactic label or functional label Figure 2 will show an example from Viet Treebank with function tags
1
http://vlsp.vietlp.org:8080/demo/?page=resources
Trang 83 Current studies on Function tagging
Function tags labeling is an important processing step for many natural language processing applications such as question answering, information extraction, and summarization Thus, there were some research that focused on function tagging problem to cover additional semantic information which is more useful than syntactic labels
In 1997, Collins [7] introduced the idea to add some useful syntactic information, and then he proposed a parser to have enough ability in guessing the complement tag This parser is called Collins‟s parser, and it is considered as first system in tagging label The function tags labeling is defined precisely by Blaheta (2003) [2] His research used data from Pen Tree II which is covered extra function tags With Blaheta‟s proposal there are various investigation focusing on function tag labeling such as Merlo and Mussillo (2005), Blaheta and Charniak (2004), Chrupala and Genabith (2006), Sun, Sui (2009) These studies extend function tags labeling topic by focusing on new language such as Chinese, proposing new approaches, or investigating new features Nowadays, there are three main approach strategies for function tags labeling problem:
The first approach is called parsing, which is tagging function labels during the
parsing process, this approach is a modification of Collin‟s parser Following this approach, we can consider studies of Gabbard [9], and Marcus [17]
Figure 2 A parsing with function tags in Viet Treebank
Trang 9The second approach is called labeling method which includes two phases:
extracting features and classifying function labels This approach has more techniques because of the diversity of classification techniques The most typical research of this approach is Blaheta‟s research [3] In his research, he has been applied some techniques to show the impact of each technique for function tag labeling problem
The third approach is defined as sequential labeling approach For this approach, function tags are predicted from observed words chain (Yuan [23]) This approach is similar to selecting features of classification approach but the difference is that it uses
a prediction model instead of a classification model These approaches will be discussed in detail in next chapter
Today, there is a class problem which covers function tagging This class was mentioned by Carreras (2004) [5] and called Semantic Role Labeling Semantic Role Labeling is similar to function tag labeling but it works at a more abstract level When building a Semantic Role Labeling system, the training data will have more
information They not only include time, location, manner, etc, but also object,
instrument, agent, etc This problem is a new promised research for NLP-applications
which need to understand the meaning of sentences
4 Objective of the thesis
As we mentioned above, assigning function tags has wide research, especially for English Recently, some studies were applied for Spanish, and Chinese All function tagging system have contributed in their corpora a semantic class which is very useful for other NLP-applications such as Question Answering, Summarization, Information Retrieval, etc
In recent years, Nature Language Processing topics in Vietnam have developed rapidly Especially for Vietnamese, many studies have focused on how to recognize the syntactic of Vietnamese sentences by a POS tagging system But unfortunately, these NLP-applications do not provide semantic information for a sentence Whereas,
some NLP-applications need to know semantic information to answer questions: who,
where, what, and whom
To deal with this problem, our research focuses on building an automatic function tags labeling In this thesis, I call as stage one, temporarily; our research will build a function tagging system, a problem that is shallower than Semantic Role Labeling
Trang 10(SRL) problem, which is applied for Vietnamese Our research only tags some basic
function labels such as: time, location, direction etc Others semantic roles such as:
agent, instrument, etc, which belong to Semantic Role Labeling problem will be
append into our system in the future In our system, we have two phases: first we extract function tags from Viet Treebank, a bank covered by hand-craft semantic labels After that, we select features to train the classification model Some features extend from studies of Blaheta [2] and Yuan [22], we also proposed a new feature which has significant impact for function tag labeling system In later chapters, we will introduce our system in detail such as: feature extraction, selected model, building new feature, etc
5 Our contributions
As we discussed in the previous chapter, we aim to building a system which is shallower than Semantic Role Labeling for Vietnamese sentences According to our knowledge, through there are have been other investigation in NLPs for Vietnamese, our system research on function tag labeling may be the first one
Furthermore, our system gives new tool to tag functional labels for Viet Treebank This will enrich Viet Treebank to have automatic instead of hand-crafted tagging as it has done before Viet Treebank is one of important resources for research that study in Natural Language Processing for Vietnamese If it is enriched by function tags, others research will have more information on Vietnamese especially for semantic information
Moreover, in this thesis I will also introduce some approaches in function tag labeling problem to show some applied techniques These techniques may be inherited from some researcher later Because our system is the first one in function tag labeling problem, we expect that it gives a base line system for function tag labeling research later
We define function tagging problem in two phases: in phase one, we build a training data set from Viet Treebank and other resources, such as LaoDong and PC world newspapers In the second phase, we apply a classification model [1] to achieve the function tagging model for each input constituent correspondently These two phases
of problem are described as following:
Trang 11· In phase one: We extract features as shown in chapter 3, from Viet Treebank to build up a training data set Independently, we build a corpus included words which are grouped clusters We name that corpus a word cluster, and the process making word cluster is called word clustering We believe that our selected features are not the best selection However, our experiments show that they have enough reliability to recognize a function tag Absolutely, some of selected feature was referred by some result from other studies such as Blaheta[2], Gabbard[9], Yuan [23], the remaining features are our proposed features
· In phase two - selecting the classification model: in NLP domain there are many techniques to classify a data set As a result, there some classification models which can be used to recognize a function tag In this step, we decided select the Maximum Entropy Model (MEMs) as our classification model because of the advantages of this model After that, we evaluate our system by calculate the precision, and distribution proportion for each function tag This result will
be mentioned in chapter 4
In conclusion, our research is the first one in function tag labeling problem; it will
be an encouragement for researchers who are going to focus in semantic domain in NLP-applications
6 Thesis structure
In this section, we introduce a brief outline of thesis This is a overviews of the following sections where we give more details
Chapter 2 – Related works
In this chapter we would like to show some further research that are related to our research and modern approaches for Function tagging
Chapter 3 –The proposal
We propose our approaches in Function Labeling for Vietnamese including: model, preparing data, technique, selected features, and method to extract features from Viet Treebank
Chapter 4 – Experiments
Trang 12This chapter will evaluate the effectiveness of our model Because until this thesis has completed, there are not similar system in Function tagging problem, so we would like consider our model as a line base
Chapter 5 –Conclusions and Future works
In this chapter, we describe general conclusions about our work its advantages, restrictions Besides, we propose future works to improve our model
Chapter 5 is followed by a list of related reference
Trang 13Chapter II: Related works
In this chapter, I would like to introduce some recent research in function tagging in some aspects such as: approach strategies, techniques, and their experiments As I discussed above, there are three approaches to this problem I will mention one method for each approach in this chapter For an overview, table one provide a summary of approaches in function tag labeling problem
1st : Gabbard et al word sequence parsing (PCFG) tree-based
2nd: Blaheta syntactic tree classification (decision
tree, perceptron) tree-based
3rd : Yuan et al word sequence sequential labeling
(HM-SVM) word-based Ours syntactic tree classification (MEM) tree-based, word
clusters
Table1 Functional Labeling Approaches
Note that, in modern research Functional Labeling has strong relation with Semantic Role Labeling (SRL) task [5, 10] However, SRL is more complex than Functional Labeling It requires a proposition bank which has not been available for Vietnamese yet For this reason, our research focuses on Function Tag Labeling problem only
1 Function Tags Labeling by Parsing
The parsing approach is based on the technique that was used to build Penn Treebank, in which function labels are tagged during parsing tree Following this approach I would like mention to Gabbard et al., (2006) [9]
1.1 Motivation
Gabbard et al focus in three important semantic types of empty category annotations in Penn Treebank to identify function tags for a constituent:
· Null complementizers: in Penn Treebank they are notated as 0symbols “Null
complementizers”, often replace for relative pronouns such as that, who, which missed in sentence For example: “ She is the girl 0 I told you, yesterday”
Trang 14· Traces of wh-movement: they are annotate as “*T*” This type focus on
object of wh- questions, they co-index for the position of constituent which are referred by wh-question The following example represents this type in a sentence: “What1 do you want (NP *T*-1)?”
· (NP *)s: These notations as are used for several purposes in Penn Treebank
But they are commonly used to denote the passive such as: “(NP-1 this dog) was hit (NP *-1) by a drunken man”
Following [9] these types had ignored statistical parsing until Collins Model 2 had been proposed in 1997 Model 2 used heuristics and function tags during the training process to identify arguments of constituents (e.g TMP for NP to combine as TMP-
NP label) Extending the Model 2, Collins Model 3 has been trying to cover traces of Wh-movement with some limited success
1.2 Approach
They modified the Collins‟s parser (2003) by adding semantic information to tag function labels Their parser has two stages: in stage one the parser analyzes syntactic structure, both function tags and empty categories In this stage, the challenge is how
to produce function tags and mark empty categories without decreasing the regular Parseval metric, the metric was used as one of the evaluation criteria The second stage focuses on recovery of empty categories by combining the linguistically –informed architecture (Campbell, 2004) and rich feature set with machine learning methods
Extending the Collins‟s Model 2, the function tags after training process with heuristics to identify arguments; they are kept in all parameter classes Then they augment the argument identification heuristic to treat all nonterminal with any of tags
in the Syntactic group The function tags are treated for the internal tag as synonyms
So these function tags not only used for excluding potential argument but also for including argument like Bikel‟s parser (Bikel, 2004)
1.3 Result
In testing phrase, Gabbard use all sentences with maximum length less than 40 words for each sentence Their system used two measures which are often used in NLPs research There are precision and recall As the result, they discovered that sparseness data problem in Blaheta [2] is fixed by this approach The table 2 shows the precision and recall proportion of Gabbard research
Trang 15Parser Recall/Precision Model 2 88.12/88.31 Model 2 - FuncB 88.23/88.31
Table 2 Result of labeling by parsing approach following Collin model
2 Sequential Function Tag Labeling
This approach formulates function tagging as a sequential labeling problem which has been applied to other important natural language processing tasks such as named entity recognition and chunking This approach does not require tree-based information All features are word-based including surrounding words and their part-of-speech (POS) tags The learning model for predicting a sequence observed can be done by some technique such as Hidden Markov Model [18] and Conditional Random Fields [12] In their paper, Yuan et al [22] choose the Hidden Markov-Support Vector Machine (HM-SVM) technique for their learning model The result of tagger system was very high and reached an accuracy of 96.18%
2.1 Features
According to Chinese Treebank (CTB) guidelines, the grammatical functions of Chinese and the reference of English verbs (Levin 1993), five features for function tagging are defined as follow:
· Word and POS tags: the context made by surrounding words can increase the
accurate of prediction In their experiments, they started from range [-2, +2] and
up to [-4, +4] words context
· Bi-gram of POS tags: the prediction of Bi-gram for POS tag input of constituent
· Verbs: Function tags like subject and object describe the relations between verb
and its arguments Besides, each class of verb is associated with a set of syntactic frames In this sense, Yuan et al relied on the surface verb for distinguishing syntactic roles
· POS tags of verbs
· Position indicators: Whether a constituent occurs before or after the verb is
highly correlated with its grammatical function For example, for Chinese language, subjects generally appear before a verb, and objects after This
Trang 16feature was used to overcome the lack of syntactic structure that could be read from the parse tree
2.2 Learning model
In the research of Caixia Yuan and his partners, function tagging is defined as the problem predicting a sequence of function tags from given input words Hence, this problem can be formulated as a stream of sequence learning problem For this problem in predicting from a sequence stream there are several algorithms such as hidden Markov model (Rabiner,1989), conditional random fields (Lafferty et al., 2001), and hidden Markov support vector machine (HM-SVM) (Altun et al., 2003; Tsochantaridis et al., 2004)
The selected model in this paper is HM-SVM because of its advantages in learning labels As a result, the tagger reached a 96.18% accuracy rate
3 Function Tag Labeling by Classification
This approach carries out functional labeling after syntactic parsing Functional labeling is considered as a classification problem, or more specifically a tree-based classification problem Syntactic trees, the output of parsing, are used as the input of functional labeling Tree nodes – syntactic constituents – are labeled with function tags independently Following this approach, many machine learning techniques can be applied This labeling approach was first used by Blaheta [2] for English He employed two machine learning techniques including decision tree and perceptron
3.1 Feature
Blaheta (2003) proposed nine features which were used in this model, including:
· Label: Syntactic label of the constituent Blaheta suggests that it is probably the
most important feature especially for syntactic tags For example: a VP2 is never tagged any syntactic tags and ADJP might be a PRD label
· Label: This feature represent coordinate conjunction label Normally,
cc-label used for identifying the regular cc-label But if constituent are combined from two or more phrases such as NP, VP, PP it will be tagged by CCNP, CCVP,
CCPP instead of NP, VP, PP for whole phrases
2
These notations such as NP, VP, ADJ, PP, , are referred from Penn Treebank guideline, so I do not list their
Trang 17· Head word: This is a basic feature in parsing studies such as [7] The idea of
this feature is that each word in each sentence always relate with other words in same phrase, except the word at the beginning of this phrase Moreover, these words may be adjuncts or complements to some other words which are outside the phrase This relation is lexical information which will be used as function
tags
· Head POS: POS tag of head word Charinak‟s experiment [3] showed that the
parsing system improved approximately 2% when adding part-of-speech
feature
· Alternative head: If the constituent is a prepositional phrase, its noun phrase‟s
head word will be alternative head
· Alternative POS tag: POS tag of alternative head
· Function tags: Obviously, in most of prediction models function tags information is useful to predict themselves in another context
· Label clusters: Labels are manually grouped into clusters This feature is very
useful for labels which have low frequency in training data such as WHNP, ADJP, JJ, etc It used to increase the performance of labeling task in sparseness
data case
· Word clusters: An algorithm was run to group all words with a given POS tag
into a relatively small number of clusters This feature is used as label cluster feature With nearly 40,000 words Balheta hoped that it will fend off the spare data problem
The advantage of classification strategy is that new feature, both local and non-local, can be incorporated easily Because of this advantage, in our system we decide to follow the classification approach for our research
3.2 Model
Blaheta [2] applied two classification methods in his research, including: Decision Trees and Perceptron He carried out the experiment for each method to compare the performance of each method for this problem To do that, Blaheta had to prepare two training data sets for each method correspondingly
· Decision Trees method: Decisions trees method has been applied in many
Natural Language Processing studies (Magerman, 1994; Bahl et al,.) especially
Trang 18in English To build up a decision tree for function tag labeling problem, Blaheta implemented an existing package, c4.5, from Ross Quinlan‟s research Following format of package c4.5, Blaheta converted all training and testing data set into new format which Ross Quinlan‟s package c4.5 can read from In conclusion, Blaheta decided to abandon this method because of its performance
He explained that the performance of decision tree increased slightly, while memory for stored a model (decision tree) explored gradually Hence, this method is limited by size of memory
· Perceptron: a classical perceptron network executes a binary classification
Output of classic perceptron is only „True‟, „False‟, or „Yes‟, „No‟, or 0 ,1 etc For function tag labeling problem Blaheta applied a multi-valued perceptron classification In which, each node in output layer of neural network is a function tag A Perceptron system can be describe as:
o First, function tag from Penn Treebank are extracted to create a training data
o Second, an initial perceptron network is setup then trained it with function tags, obtained from the previous steps to build a multi-classification model
o Last, this model is tested by testing data; and system performance is evaluated Then, if result is low, input weight of neural in input layer is adjusted, and model is trained again
The figure 3 below will present the perceptron model for function tag labeling problem
PRD
Function tagsConstituents
Figure 3.The perceptron model for function tags labeling problem
Trang 19Chapter III: The proposed approach
In this chapter, we would like to introduce our approach, including the overview of architecture‟s model, how to define features with function tag labeling problem with Vietnamese and the selection of classification model
1 System Architecture
Our task includes two phase as introduced in Chapter 1.These are training and testing phases In the training phase, we used two resources as training data for our system They are Vietnamese Treebank and an unlabeled corpus collect from online newspapers In this phase, two main steps are feature extraction and Maximum Entropy Model (MEMs) training Besides, for building training data an additional step
is required to classify words to word clusters from unlabeled corpus In the testing phase, the input is a syntactic tree and the outputs of our system are functional labels over the input tree For this phase, we use two main steps as training phase again They are feature extraction and MEM classification
In this section we want to present our model by a graphical model This give an overview of our task and steps we processed in our task Figure 4 will shows our system for function tag labeling problem
Trang 20Input Syntactic tree
Output Syntactic tree with functional tags
Online newspapers 700,000 sentences
Vietnamese treebank
10,417 trees
Feature extraction Word clusters
Figure 4 Model of Function Tag Labeling System for Vietnamese sentences
2 Function Tags in Vietnamese
Vietnamese is created from Latin symbols with characteristics of our culture Hence, Vietnamese‟s constituents in a sentence have the same role as English In [13],
we have been defined 20 function tags for Viet Treebank; Table3 below present functional labels are tagged on Viet Treebank It can be clearly seen that, function tags
is defined almost have the same notations and role as English
Trang 21Clause types
CMD Command EXC Exclamation
Syntactic roles
SUB Subject SQ Question
TPC Topic DOB Direct object
PRD Predicate IOB Indirect object
EXT Extent VOC Vocative
Adverbials
TMP Time LOC Location
MNR Manner DIR Direction
PRP
CND
Purpose Condition
CNC ADV
Concession Adverbial MDP Modus
· Label: The syntactic label of current constituent, which is being functionally
classified, is very important in recognizing its role
· Father’s label: This feature is useful in certain cases For example, suppose
that the current constituent is a noun phrase (NP) If its father is a clause (S), there are more chances for it to be a subject (SUB), otherwise, if its father is a verb phrase (VP) it is more likely to be a direct object (DOB)
· Head word: This feature has been proved to be useful in parsing It is also
important for discriminating functions, for example, between temporal (TMP)