This kind of research problem is called function tags labeling problem, a class of problems to finding semantic information of phrase.. Thus, there were some research that focused on fu
Trang 1
UNTVERSITY OF ENGINEERING AND TECHNOLOGY
VIETNAM NATIONAL UNIVERSITY, HANOI
NGUYEN THANH HUY
BUILDING A SEMANTICROLELABELING SYSTEM FOR VIETNAMESE SENTENCES
Major : Computer Science Code : 60 48 01
MASTER THESIS
Supervised by: PhD Nguyen Phuong Thai
Tianoi - 2011
Trang 2Chapter I- Introduetion - ¬— ¬— ¬— a)
3 Current studies on Function tagging TH se TH se mà
Chapter II: Related works TH se TH se TH se wl
3 SelectedFeatuss Hee Hee Hee
Trang 34 Teatures and Constraints
5.5 Maximum Entopy Prineiple
Corpora and Tools ¬—
Functional Labeling Precisions
Trang 4Figure 7 Scenarios in constrained optimization HHu re Figure 8 Pseudo-code for extracting funetion labels sesesntnee Figure 9 An example of word cluster
Figure 10 Leaming curve
Figure 11 The dependency between two fumetion labels sesesntnee
Trang 5List of Tables
Tablel Functional Labeling Approaches
‘Table 2 Result of labeling by parsing approach following Collin model
Table 3 Function Tags on Viet Treebarik
Tablc 4 Vieuiamose Trocbank statistios HHu re HHu re
Table 5 Evaluation of Vietnamese functional labeling system
‘Table 6 Increases in precision by using word cluster featue
16 Ö„18
Ö„34
37
AO
Trang 6Chapter I: Introduction
Tu this chapter, T introduce function tags and value of fumelion tag in NLP -
applicalions, some current approaches, objective of thesis and our contribution
Finally, 1 describe the structure of the thesis
1 Function tags
‘There are two kinds of tags in linguistics: syntactic tags and function tags For
syntactic tags there are several theories and projects research result in Unglish, Spanish, Chinese and [4][13][14][18] These research mainly focus on finding the part-of-speech and tagging for their comstilusnis Function lags are uriderstood as abatrael labels beeause they are nol similar lo syntactic labels Tf a syntactic label has one notation for a batch of words in a paragraph, function tags present the relationship between a phrase and its utterance in each difference context So for each phrase,
function tags might be transforming It depends on context of its neighbors For example we consider a phrase: “Basehall bat” syntactic of this phrase is “noun
phrase” Gn almost research they are annotated as NP) But ils funchon tag might be a
subject 1 this sentence:
This baseball bat is very expensive
In other case its function tag might be a direot object:
J bought this baseball bat last month
Or instrument, agerit int a passive voice:
That man was attacked by this baseball bat
Function tags are directly mentioned by Blaheta (2003) [2] There are a lot of research focuses in how to tag fimetion tags for a sentence This kind of research
problem is called function tags labeling problem, a class of problems to finding semantic information of phrase To sum up, function tag labeling is defined as a
problem how to find the scmantic information of a batch of words, and then tag, thers with a given annotation in its context
2 Corpora for function tag labeling
Nowadays machine leaming is the popular method for most of modern problems especially in Nature Language Processing subject To build up a machine leaming
9
Trang 7system we need a training data set There are some function tag labeling corpora that
are applied for languages such as English and Chinese In English, there are two main corpora which are used for semantic role labeling and function tag labeling problems
They are Frame Net(Baker, 1998; FillMore and Baker 2000) and Prop Bank (Palmer et
al,2005) The main idea of Frame Net is that group all similar words in the same group,
and then represents relationship of this group with other groups in a group-network
That why it is called Frame Net Figure 1shows a small example branch on Frame Net
Figure 1 Sample domain and frame element of Frame Net
The second corpus is PropBank (Palmer et al., 2005) which is a modification of Penn Treebank by annotated addition information: function tags Penn Treebank and Prop Bank are organized as a set of trees: each tree illustrates a sentence which is tagged syntactic labels (for Penn Treebank) and with both syntactic and functional labels (for Prop Bank)
PropBank and Chinese Treebank are linguistics resources which have been
available for research purpose for a long time Whereas, Viet Treebank’ have been
developed recently by Nguyen [17] by using experiment and approaches from Penn
Treebank Hence, Viet Treebank has the same structure as Penn Treebank and Chinese Treebank in which each word is presented as leaf node of a tree, none terminal nodes
are tagged syntactic label or functional label Figure 2 will show an example from Viet
Treebank with function tags
ietlp.otg:8080/demo/2pageEresotirces
10
Trang 8Figure 2 A parsing with function tags in Viet Treebank
3 Current studies on Function tagging
Function tags labeling is an important processing step for many natural language
processing applications such as question answering, information extraction, and
summarization Thus, there were some research that focused on function tagging problem to cover additional semantic information which is more useful than syntactic labels
In 1997, Collins [7] introduced the idea to add some useful syntactic information,
and then he proposed a parser to have enough ability in guessing the complement tag
This parser is called Collins's parser, and it is considered as first system in tagging
label The function tags labeling is defined precisely by Blaheta (2003) [2] His
research used data from Pen Tree II which is covered extra function tags With Blaheta’s proposal there are various investigation focusing on function tag labeling
such as Merlo and Mussillo (2005), Blaheta and Charniak (2004), Chrupala and
Genabith (2006), Sun, Sui (2009) These studies extend function tags labeling topic by
focusing on new language such as Chinese, proposing new approaches, or
investigating new features Nowadays, there are three main approach strategies for
function tags labeling problem
The first approach is called parsing, which is tagging function labels during the
parsing process, this approach is a modification of Collin’s parser Following this
approach, we can consider studies of Gabbard [9], and Marcus [17]
11
Trang 9‘The second approach is called Jabeling method which inchides two phases extracting features and classifying function labels ‘I'his approach has more techniques because of the diversity of classification techniques ‘the most typical research of this approach is Blaheta’s research [3] In his research, he has been applied some techniques to show the impact of each technique for function tag, labeling problem
The third approach is defined as sequential labeling approach For this approach, function tags are predicted from observed words chain (Yuan [23]) This approach is similar to selecting {calures of classification approach but the difference is thal if uses
a prediction mods] instead of a classification model These approaches will be
discussed in detail in next chapter,
Today there is a class problem which covers function tagging This class was
mentioned by Carreras (2004) [5] and called Semantic Role Labeling Semantic Role Labeling is similar to function tag labeling but it works at a more abstract level When building a Semantic Role Labeling system, the training data will have more information They nat only include time, location, manner, etc, but also object, instrument, agent, etc This problem is a new promised research for NLP-applications
which need to understand the meaning of sentences
4, Objective of the thesis
As we mentioned above, assigning function tags has wide rescarch, especially for English Revently, some studies were applied for Spanish, and Chinese Al function tagging system have contributed in their corpora a semantic class which is very useful for other NLP-applications such as Question Answering, Summarization, Information
Retneval, cle
In recent years, Nature Language Processing topies in Viewam have developed
rapidly Uspecially for Vietnamese, many studies have focused on how to recognize the syntactic of Vietnamese sentences by a POS tagging system But unfornmately, these NLP-applications co nol provide semantic information for a sentence Whereas, some KLP-applications need te know semantic information to answer questions: who, where, what, and whom
To deal with this problem, our research focuses on building an automatic function tags labeling In this thesis, I call as stage onc, teruporarily; our rescarch will build a function tagging system, a problem that is shallower than Semantic Role Labeling
12
Trang 10(SRL) problem, which is applied for Vietnamese Our research only tags some basic
function labels such as: time, location, direction etc Others semantic roles such as:
agent, instrument, etc, which belong to Semantic Role Labeling problem will be append into our system in the future In our system, we have two phases: first we
extract function tags from Viel Treebank, a bank covered by hand-crafl semantic
labels After that, we select features to train the classification model Some features extend from studies of Blaheta [2] and Yuan [22], we also proposed a new feature
which has significant impact for function tag labeling system In later chapters, we will
introduce our system in detail such as: feature extraction, selected model, building new
feature, cte
5 Our contributions
As we discussed in the previous chapter, we aim to building a system which is shallower than Semantic Role Labeling tor Victnamesc sentences According to our Jmowledge, through there are have been other investigation in NLPs for Vietnamese, our system research on function tag labeling may be the first one
Furthermore, our syslem gives new lool to tag functional labels for VieL Trecbank
‘This will enrich Viet ‘lreebank to have automatic instead of hand-crafted tagging as it has done before Viet ‘'reebank is one of important resources for research that study in
‘Natural Language Processing for Vietnamese If it is enriched by function tags, others research will have more information on Vietnamese especially for semantic
information
Moreover, in this thesis I will also introduce some approaches in function tag labeling problem to shaw some applied techniques These techniques may be inherited from sore researcher later Because our system is the fist one in function tag labeling problem, we expect that it gives a base line system for function tag laboling rescarch
later,
We define function lagging problem in lwo phases: in phase one, we build a training
dala sơi [rom Viet Trecbank and olher resources, such as TacDong and PC world
newspapers In the second phase, we apply a classification model [1] to achieve the function tagging model for each input constituent correspondently ‘These two phases
of problem are described as following:
13
Trang 11Inphase one: We extract features as shown in chapter 3, from Viet Treebank to build up a training data set Independently, we build a corpus inchided words
which are grouped clusters We name thal copus a word cluster, aud the process making word cluster is called word clustering We believe that our
selected features are not the best selection However, our experiments show that
they have enough reliability to recognize a fimetion tag Absolutely some of
selected feature was referred by some result from other studies such as
Blabeta[2], Gabbard[9], Yuan [23], the remaining features are our proposed
features
« Inphase two - selecting the classification model: in NLP domain there are many
techniques to classify a data sct As a result, there some classification models which can be used to recognize a function tag In this step, we decided select
the Maximum Lntrapy Model (MEMs) as our classification model because of
the advantages of this model After that, we evaluate our system by calculate
the precision, and distribution proportion for each function tag This result will
be mentioned in chapter 4
In conclusion, our research is the first one in function tag labeling problem, it will
be an encouragement for researchers who are going to focus in semantic domain in
NIP-applications
6 Thesis structare
Tn this section, we introduce a brief oullme of thesis This is 2 overviews of the
following sections where we give more details
Chapter 2 — Related works
In this chapter we would like to show some further research that are related to our
research and modern approaches for Function tagging
Chapter 3 The proposal
We propose our approaches in Function Labeling for Vietnamese including: model,
preparing data, lechmque, selected features, and method Lo extract features from Viet
‘Treebank
Chapter 4— Experiments
14
Trang 12‘This chapter will evaluate the effectiveness of our model Because until this thesis has completed, there are not similar system in Function tagging problem, so we would like
consider our model as a line base
Chapter 5 —Conclusions and Future works
In this chapter, we describe general conclusions about our work its advantages,
roslrieliors Bosidos, we propose future works to improve our model
Chapter 5 is followed by a list of related reference
15
Trang 13Chapter II: Related works
In this chapter, I would like to introduce some recent research in function
lagging in some aspects such as approach stralegies, techniques, and ther
experiments As T discussed above, there are three approaches to this problem T will
mention one method for each approach in this chapter For an overview, table ane
provide a summary of approaches in function tag labeling problem
‘Approach Input Technique Features
1": Gabbard et al word sequence paming(PCFG} tree-based
2" Blaheta syntactic tree tree, perceptron) -«-lsssification (decision ased
8 sequential labeling |
3°:Yuetetal wordsequence AURAL word-based
Ours syntactic tree classiieadonCMIEM) Ihee-b2sed word
clusters
‘Table1, Functional Labeling Approaches
Note that, in modern research Functional Labeling has strong relation with
Semantic Role Labeling (SRT) task [5, 10] However, SRT is more complex thant
Funotional Labelmg, It requires a proposition bank which has not been available for Vietnamese yet Kor this reason, our research focuses on Function ‘fag Labeling
problem only
1 Function Tags Labeling by Parsing
The parsing approach is based on the technique that was used to build Penn
Treebank, in which function labels are tagged during parsing tree Following this approach I would like mention to Gabbard ct al., (2006) |9]
1.1 Motivation
Gabbard et al focus in three important semantic types of empty category
amolalious in Perm Treebank (o identify furiclian tags Cor a constituent,
Null complementizers: in Penn Treebank they are notated as symbols “Null complementizers”, often replace for relative pronouns such as that, who, which
missed in sentence For example: “ She is the girl 0 I told you, yesterday”
is
Trang 14Traces of wh-movement they are annotate as “*T*" This type focus on abject of wh- questions, they co-index for the position of constituent which are referred by wh-question The following example represents this type in a sentence: "What; do you want (NP "I*-1)?”
® {NP *)s: These notations as are used for several purposes in Penn Treebank But they arc commonly used to denote the passive such as: “(NP-1 this dog) was hit (NP *-1) by a drunken man”
Following [9] these lypes had ignored statistical parsing until Collins Model 2 had been proposed in 1997 Mods! 2 used heuristies and function lags during the training process lo identify arguments of constituents (c.g TMP for NP to combine as TMP-
NP label) Uxtending the Model 2, Collins Model 3 has been trying to cover traces of
Wh-movement with some limited success
1.2 Approach
They modified the Collins’s parser (2003) by adding somantic information to tag function labels Their parser has two stages: in stage one the parser analyzes syntactic structure, both function tags and empty categories Ln this stage, the challenge is how
to produce function tags and mark empty categories without decreasing the regular
Parseval metric, the metric was used as one of the evaluation criteria The second stage
focuses on recovery of empty categorics by combining the linguistically —informed architecture (Campbell, 2004) and rich feature set with machine learning methods
Extending the Colhins’s Model 2 the function lags aller taining process with
heuristics lo identify arguments, they are kept in all parameter clas
s Then they
augment, the argument identification heuristic Lo treat, all nonicrminal with any of tags
in the Syntactic group The function tags are treated for the internal tag as synonyms
So these function tags not only used for excluding potential argument but also for
including argument like Bikel’s parser (Bikel, 2004)
1.3 Result
In testing phrase, Gabbard use all sentences with maximum length less than 40 words for each sentence ‘Their system used two measures which are often used in NLPs research There are precision and recall As the result, they discovered that sparseness data problem in Blaheta [2] is fixed by this approach The table 2 shows the precision and reeall proportion of Gabbard research
1
Trang 15Parser Recall/Precision Madel 2 88.12/88 31
Model 2 - FuneB 88.23/88.31
Table 2 Result of labeling by parsing approach following Cotltn model
2 Scquential Function Tag Labeling
This approach formulates function tagging as a sequential labeling problem which has been applied to other important natural language processing tasks such as named entity recognition and chunking ‘this approach does not require tree-based information All features are word-based including surrounding words and their part- of-speech (POS) tags The leaming model for predicting a sequence observed can be done by some technique such as Hidden Marko Model [18] and Conditional Random Fields |12], In their paper, Yuan et al |22{ choose the Hidden Markev-Support Vector Machine (HM-SVM) technique for their learning model ‘Ihe result of tagger system was very hiph and reached an acouracy of 96.18%
2.1 Features
According to Chinese Treebank (CTB) giundclines, the grammatical funeclions of
Chinese and the reference of English verbs (Levin 1993), five features for function tagging are defined as follow:
© Word and POS tags: the context made by swrounding words can increase the
accurate of prediction In their experiments, they started from range [-2, +2] and
up to [-4, | 4] words context
* Bi-gram of POS tags: the prediction of Bi-gram for POS tag input of
constituent
© Verbs: Function tags like subject and object describe the relations between verb
and its arguments Besides, each class of verb is associated with a set of syntactic frames In this sense, Yuan et al relied on the surface verb for
distingiushing syntactic roles
© POS tags of verbs
© Position indicators: Whether a constituent occurs before or aller the yerb is
highly correlated with its grammatical fimotion or example, for Chinese language, subjects generally appear before a verb, and objects after This
18
Trang 16feature was used to overcome the lack of syntactic structure that could be read
from the parse tree
2.2 Learning model
In the research of Caixia Yuan and his partners, function tagging is defined as the problem predicting a sequence of function tags from given input words Hence, this problem can be formulated as a stream of sequence leaming problem For this problem in predicting from a sequence stream there are several algorithms such as hidden Markov model (Rabiner,1989), conditional random fields (Lafferty et all,
20013, and hidden Markov support vector machine (HM-SVM) (Altun et al., 2003; Tsocharitaridis et al., 2004)
‘The selected model in this paper is LIM-SVM because of its advantages in leaming labels As a result, the tagger reached a 96.18% accuracy rate
3 Function Tag Labeling by Classification
This approach carries out functional labeling after syntactic parsing Fumetional labeling is considered as a classification problem, or more specifically a tree-based classification problem Syntactic trees, the output of parsing, are used as the input of functional labeling, ‘ree nodes syntactic constituents are labeled with function tags independently ollowing this approach, many machine learning techniques can be applied This labeling approach was first used by Blaheta [2] for English He cmployed (wo machine learning teolmiques including devision tree and perceptron
3.1 Fcaturc
Blaheta (2003) proposed nine features which were used in this model, including:
* Label: Syntactic label of the constituent Biaheta suggests that it is probably the
most important feature especially for syntactic tags Tor example: a VP? is
never lagged any syntactic tags and ADJP might be a PRD label,
¢ ce-Label: This feature represent coordinate conjunction label Normally, cc- label used for identifying the regular label But if corstituent are combined from
lwo or more phrases such as NP, VP, PP it will be tagged by CCNP, CCVP,
CCPP instead of NP, VP, PP [or whole phrases
* These notations sui
annotations here
as NP, VP, ADJ, PP,., are referred from Pe Treebank guideline, soT do not list their
19
Trang 17Head word: This is a basic feature in parsing studies such as [7] The idea of this feature is that each word in each sentence always relate with other words in
same phrase, excepl the word at the beginning of this phrase Moreover, these words may be adjunets or complements to some other words which are outside
the phrase This relation 1s lexical information which will be used as function tags
Head POS: POS tag of head word Charinak’s experiment [3] showed that the parsing system improved approximately 2% when adding part-of-speech
feature
Alternative head: If the constituent is a prepositional phrase, its nou phrase’s
head word will be alternative head
Alternative POS tag: POS tag of alternative head
Function tags: Obviously, in most of prediction models function tags
information is useful to predict themselves in another context
Label clusters: Labels are manually grouped into clusters This feature is very
useful for labels which have low frequericy in training dala such as WHNP,
ADIP_ JJ, ctc Tl used to increase the performance of labeling task iu sparscness
data case
Word clusters: An algorithm was rum to group all words with a giver POS tag
into a relatively small number of clusters ‘his feature is used as label cluster
feature With nearly 40,000 words Balheta hoped that it will fend off the spare data problem
The advantage of classification slraicgy is thal new feature, both local and non-local,
can be incorporated casily Because of this advantage, in our system we decide to follow the classification approach for our research
3.2 Model
Blaheta [2] applied two classification methods in his research, including: Decision Troos and Perceptron He carried out the experiment for each method to compare the performanve of each method for this problem To do that, Blaheta had to prepare two training data sets for each method correspondingly
Decision ‘Irecs method: Decisions trees method has been applied in many
Natural Language Processing studies (Magerman, 1994, Bahl et al,.) especially
20
Trang 18in English To build up a decision tree for function tag labeling problem, Blaheta implemented an existing package, c4.5, from Ross Quinlan’s research, Following format of package o4.5, Blaheta converted all training and testing data set into new format which Ross Quinlan’s package c4.$ can read from In
conclusion, Blahela decided to abandon this method because of its performance
He explained that the performance of decision tree inoreased slightly, while memory for stored a model (decision tree) explored gradually Hence, this method is limited by size of memory
© Perceptron: a classical perceptron network executes a binary classification
Output of classic perceptron is only “‘Irue’, ‘False’, or ‘Yes’, ‘No’, or 0,1 etc
For function tag labeling problem Blaheta applied a multi-valued perceptron
classification In which, each node in output layer of neural network is a
function tag A Perceptron system can be describe as:
© Furst, function tag from Ponn Trecbank are extracted to create a training
adjusted, and model is trained again
The figure 3 below will presenl the perceptron model for function tag labeling
Trang 19Chapter III: The proposed approach
In this chapter, we would like to introduce our approach, including the overview of architecture’s model, how to define features with function tag labeling problem with
Vietnamese and the selection of classification model
ystom Architecture
Our task includes two phase as introduced in Chapter 1 These are training and testing phases In the training phase, we used two resources as training data for our system ‘hey are Vietnamese ‘reebank and an unlabeled corpus collect from online newspapers In this phase, two main steps are feature extraction and Maximum Entropy Model (MEMs) training Besides, for building training data an additional step
is required to classify words to word clusters from unlabeled corpus In the testing phase, the input is a syntactic tree and the outputs of our system are functional labels over the input tree For this phase, we use two main steps as training phase again ‘hey are feature extraction and MEM classification
In this section we want lo present our model by @ graphical model This give an overview of our task and steps we processed in our task igure 4 will shows our
system for function tag labeling problem
22
Trang 20Figure 4 Model af Function ‘I'ag Labeling System for Vietaunese sentences
2 Function Tags in Vietnamese
Vietnamese 1s created from Latin symbols with characteristics of our culture, Hence, Vietnamese’s constituents in a sentence have the same role as English In [13],
we have been defined 20 function tags for Viet Treebank; Table3 below present
functional labels are tagged on Viet Treebank It can be clearly seen that, function tags
is defined almost have the same notations and role as English
2
Trang 21Clause types
Syntactic roles
Adverbials
[19], The features are described as follows?
ø Label: The syntactic label of current constituent, which is being functionally
‘Table 3 Function Tags on Viet Tr eebank
classified, is very important in recognizing its role
Father's label: This feature is useful in certain cases For example, suppose hai the current constituent, is a oun phrase (NP), Tf its father is a clause (3),
there are more chances for it to be a subject (SUB), otherwise, if its father is a
verb phrase (VP) it is more likely to be a direct object (DOB)
Head wor
important for discriminating fimctions, for example, between temporal (TMP) and (LOC)
: This (calure has been proved to be usaful in parsing, Ti is also
3 Iaderlined proposed features are proposed different from [2] and [9]
24