Combining statistical machine learning withtransformation rule learning for Vietnamese Word Sense Disambiguation Phu-Hung Dinh, Ngoc-Khuong Nguyen, and Anh-Cuong Le Dept.. This paper pro
Trang 1Combining statistical machine learning with
transformation rule learning for Vietnamese Word
Sense Disambiguation
Phu-Hung Dinh, Ngoc-Khuong Nguyen, and Anh-Cuong Le
Dept of Computer Science University of Engineering and Technology Vietnam National University, Ha Noi
144 Xuan Thuy, Cau Giay, Ha Noi, Viet Nam {hungdp@wru.edu.vn, khuongnn.mcs09@vnu.edu.vn, cuongla@vnu.edu.vn}
Abstract—Word Sense Disambiguation (WSD) is the task of
determining the right sense of a word depending on the context
it appears Among various approaches developed for this task,
statistical machine learning methods have been showing their
advantages in comparison with others However, there are some
cases which cannot be solved by a general statistical model This
paper proposes a novel framework, in which we use the rules
generated by transformation based learning (TBL) to improve
the performance of a statistical machine learning model This
framework can be considered as a combination of a rule-based
method and statistical based method We have developed this
method for the problem of Vietnamese WSD and achieved some
promising results.
Index Terms—Machine Learning, Transformation Based
Learning, Naive Bayesian classification
I INTRODUCTION
As the ambiguity of natural languages, a word may have
multiple meanings (senses) Practically speaking, an
ambigu-ous word has ambiguity regarding its part-of-speech and
mean-ing WSD usually considers to disambiguate the meaning of a
word in a specific speech A word in a specific
part-of-speech which has several meanings is called polysemous For
example, the noun “bank” has at least two different meanings:
“bank” in “Bank of England” and “bank” in “river bank”
Beside that, polysemous word also exists in Vietnamese For
instance, consider the following sentences:
• Anh ta đang câu cá ở ao.
He is fishing in a pond
• Đại bác câu trúng lô cốt.
The guns lobbed home shells on the blockhouse
The occurrence of the word “câu” in the two sentences
clearly denotes different meanings: “to fish” and “to lob”
WSD means that determining its right sense in the
par-ticular context The success of this task makes benefit for
many Natural Language Processing (NLP) problems such as
information retrieval, machine translation, human-computer
communication, and so on
The automatic disambiguation of word senses has received
concern since the 1950s [1] Since that time, many studies
have been investigating on various methods for this problem, but the performances of available WSD systems or published results are limited These methods can be mainly divided into two approaches: knowledge-based and machine learning (using corpora)
Knowledge-based methods rely on previously acquired lin-guistic knowledge Therefore, WSD task will be performed by matching the context in which they appear with information from an external knowledge source The methods in this approach are based on knowledge resources like WordNet thesaurus, as well as grammar rules or hand-coded rules for disambiguation (see [1] for more detail and discussion)
In machine learning approach, since the decade of 1990s, empirical and statistical approach have attracted almost studies
in NLP field Many machine learning methods have been applied to a large variety of NLP task (including WSD) with remarkable success The methods in this approach use tech-niques from statistics and machine learning to induce models
of language usage from large samples of text Generally, basing
on labeled data, unlabeled data, or both, machine learning methods can be divided into three groups including supervised, unsupervised, and semi-supervised one Because supervised systems are based on annotated data, they achieve better results Many machine learning methods have been applied
to systems such as: use maximum entropy models [2], [3], use support vector machines (SVM) [4], use decision list [5], [6], use Naive Bayesian (NB) classifier [7], [8] Other studies have tried to use linguistic knowledge from dictionaries and thesauri as in [9], [10]
Machine learning approach seems to show its advantages in comparison with knowledge based approach While knowledge based approach is based on rules generated by experts as well
as their ability and meets difficulties in solving a big number
of cases, machine learning approach can solve the problem
on a large scale without paying much attention on linguistic aspects
However, the obtained results for WSD (e.g in English) are still far from applying in a real system Although an average
978-1-4673-0309-5/12/$31.00 ©2012 IEEE
Trang 2accuracy on Senseval-2 and Senseval-31is around 70%, some
other studies such as [13] achieve higher accuracy (about 90%
for several words) as being implemented on a large training
data
The first reason from our observation causing unexpected
results for the WSD statistical machine learning systems is
based on a spare corpus The second reason is that there are
usually exceptional cases for any NLP problems (particularly
for WSD), which does not depend on a general principle (or
model) Therefore, in this paper we focus on correcting the
cases which may be misclassified by a statistical machine
learning system During the research, by borrowing the idea
from knowledge based approach instead of generating these
rules by an expert we will apply the techniques in TBL for
automatically producing the rules
Firstly, basing on the training corpus, a machine learning
model is trained to be used as the initialized classification
system for a TBL based error driven learning approach in order
to amend the initial prediction, using a development corpus
Consequently a set of TBL rules are produced Secondly, in
the final model, as the first step we first use the machine
learning model to detect senses for the polysemous words and
then apply the obtained transformation rules on the results of
the first steps to get the final senses
The paper is organized into six parts including the
introduc-tion In section II, we will present a background including TBL
and a statistical machine learning method And then, the detail
of our proposed model will be presented in the section III In
section IV, we will present feature selection and rule templates
selection Data preparation and experiments will be presented
in section V Finally, we conclude the paper in section VI
II BACKGROUND
In this section we will introduce the NB classification (in
the corpus-based approach) and the TBL (in the rule-based
approach), which are the two basic methods used in the
combination method we propose
A NB model
NB method have been used in most classification work and
were first used for WSD by Gale [7] NB classifiers work
on the assumption that all the feature variables representing a
problem are conditionally independent given the classes
Assuming that the polysemous word w is
disambiguat-ing Suppose that w has a set of potential senses (classes)
S = {s1, , s c }, and it is given a context of w which is
presented by a set of features F = {f1, , f n } The Bayesian
theory suggests that the word w should be assigned to class
s k provided that the a posterior probability of that class is
maximum, namely
s k= arg max
sj P (s j |F ), j ∈ {1, , c}
where the value of P (s j |F ) is computed by the following
equation:
1 See more detail about these corpora on http://www.senseval.org/
P (s j |F ) = P (s j P (F ) )P (F |s j)
P (F ) is constant for all senses and therefore does not influence the value of P (s j |F ) The sense s k of w is then:
s k= arg max
sj P (s j |F )
= arg max
sj
P (s j )P (F |s j)
P (F )
= arg max
sj P (s j)
n
i=1
P (f i |s j)
= arg max
sj [logP (s j) +n
i=1
logP (f i |s j)]
The values of P (f i |s j ) and P (s i) are computed via maximum-likelihood estimation as:
P (s j) =C(s j)
N and P (f i |s j) = C(f i , s j)
C(s j)
where C(f i , s j ) is the number of occurrences of f i in a
context of sense s j in the training corpus, C(s j) is the number
of occurrences of s j in the training corpus, and N is the total number of occurrences of the polysemous word w or the size
of the training dataset To avoid the effects of zero counts when estimating the conditional probabilities of the model, we set
P (f i , s j ) equal to 1/N for each sense s j when meeting a new
feature f i in a context of the test dataset NB algorithm for WSD is presented as follows:
Training:
forall senses s j of w do
forall features f i extracted from the training data do
P (f i |s j) =C(f i , s j)
C(s j)
end end forall senses s j of w do
P (s j) =C(w, s j)
C(w)
end Disambiguation:
forall senses s j of w do
score(s j ) = log(P (s j))
forall features f i in the context window c do
score(s j ) = score(s j ) + log(P (f i |s j))
end end
choose s k = arg max
sj score(s j)
B Transformation-Based Learning
TBL is known as the most successful method in the rule-based approach for many NLP tasks because it provides a
Trang 3method for automatically learning the rules.
Eric and Brill [11] introduced TBL and showed that it can
do part-of-speech tagging with fairly high accuracy The same
method can be applied in many natural language
process-ing tasks, for example: text chunkprocess-ing, parsprocess-ing, named entity
recognition, and word sense disambiguation The method’s key
idea is to compare the golden-corpus being correctly tagged
with the current-corpus being created through an initial tagger,
then automatically generate rules to correct errors based on
predefined templates
Transformation-based learning algorithm runs in multiple
iterations as following:
Input: Raw-corpus containing the entire raw text without
labels is extracted from the golden-corpus that contains
man-ually labeled context/label pairs
• Step 1:Generating initial-corpus by using an initial label
where its input is the raw-corpus
• Step 2: Comparing the initial-corpus with the
golden-corpus to determine the initial-golden-corpus’s label errors from
which all rule templates are used for creating potential
rules
• Step 3: Appling each rule in potential rules to a copy
of initial-corpus The score of a rule is computed by
subtracting of number of additional errors from number
of correctly changed labels The rule with the best score
is selected
• Step 4: Updating the initial-corpus by applying selected
rule and moving this rule to the list of transformation
rules
• Step 5: Stopping if the best score is smaller than a
predefined threshold T, otherwise repeat step 2
Output: List of transformation rules
III OURAPPROACH
In this section, we describe our approach to induce a model
which will correct the missing tagged senses of a statistical
machine learning model (note that, we choose here the NB
classification model) This model includes the training phase
and testing phase
Notice that in this model, we will use a training data for
generating a NB classification and then use a developing data
for learning the transformation rules These two sets of tagged
data is constructed by manually labeling from a set of selected
contexts of the polysemous word
A The training phase
The training process consists of two stages In the first
stage, list error are determined based on the NB model This
stage is described as shown in Figure 1
Input:Training-corpus and developing-corpus contain
man-ually labeled context/label pairs
• Step 1:Obtaining the raw developing-corpus by
remov-ing the labels from the developremov-ing-corpus
• Step 2:Using training-corpus to train a NB classification model This classification model is then tested on the raw developing-corpus obtained in the step 1 The obtained result is called the initial-corpus
• Step 3: Comparing initial-corpus with the developing-corpus to determine list all contexts with wrong labels from the NB classification model
Output: List of contexts with wrong labels (we call it the list error as shown in Figure 1)
Figure 1 The diagram describes training algorithm (first stage)
In the second stage, the set of TBL rules is determined based on applying the TBL algorithms on the list error obtained at the step 1 Notice that in this stage we will use a predefined templates for generating potential TBL rules (this
is mentioned in detail in Section IV.B) Now, this stage is described as follows (shown in Figure 2)
Input: Developing-corpus and initial-corpus, and the list error
• Step 1: Applying the rule templates for the list error
to generate a list of transformation rules (called list potential-rules)
• Step 2:Applying each rule in list potential-rule to a copy
of the initial-corpus Score of each rule is calculated as
s2− s1, where s1is the number of cases that right labels
are transformed into wrong labels, s2 is the number of cases that are corrected The rule with highest score is selected
• Step 3:Updating initial-corpus by applying the rule with the highest score and moving this rule to the selected TBL rules List error also has been updated accordingly
by comparing initial-corpus with developing-corpus
• Step 4: Stopping if the highest score is smaller than a
predefined threshold T, else go to step 1.
Output: List of transformation rules (i.e selected TBL rules)
B The test phase
The proposed approach uses selected TBL rules obtained at the training phase for the testing as follows (shown in Figure 3)
Input:Test-corpus and selected TBL rules
Trang 4Figure 2 The diagram describes training algorithm (second stage)
• Step 1:Obtaining raw test-corpus by removing the label
from the test-corpus
• Step 2: Using the NB Classification model on the raw
test-corpus and obtaining the called initial-corpus
• Step 3: Appling selected TBL rules for initial-corpus to
create the labeled corpus
• Step 4: Comparing the labeled corpus with test-corpus
to evaluate system (i.e get the accuracy)
Output: Accuracy of the proposed model
Figure 3 The diagram describes test phase
IV FEATURES ANDRULE TEMPLATES
A Feature Selection
One of the most important tasks in WSD is the
determi-nation of useful information related to word senses In the
corpus-based approach, most studies have just considered the
information extracted from the context in which the target word
appears
Context is the only means to identify the meaning of a
pol-ysemous word Therefore, all works on sense disambiguation
rely on the context of the target word to provide information
being used for its disambiguation For corpus-based methods,
context also provides the prior knowledge with which current
context is compared to achieve disambiguation
Suppose that w is the polysemous word to be disambiguated,
and S = {s1, s2, s m} is the set of its potential senses Given
a context W of w represented as:
W = { w −3 , w −2 , w −1 , w0, w1, w2, w3 }
W is a context of w within a windows (-3,+3) in which w0
= w is the target word; for each i∈ {−3 + 3} w i is a word appearing at the position i in relation with w
Based on previous studies [12], [13], [14] and our experi-ment, we propose to use two kinds of knowledge and represent them as subsets of features, as following:
• Bag-of-words, F1(l,r) ={w −l , ,w +r}:
We choose F1(-3,+3), it has seven elements (features) as follows:
F1(-3,+3) = {w −3 , w −2 , w −1 , w0, w1, w2, w3}
• Collocation of words, F2={w −l w +r}:
We choose collocations provided that their lengths (including the target word) are less or equal to 4, it
means (l + r + 1) ≤ 4 It has nine elements (features) as
follows:
F2={w −1 w0, w0w1, w −2 w −1 w0, w −1 w0w1, w0w1w2,
w −3 w −2 w −1 w0, w −2 w −1 w0w1, w −1 w0w1w2,
w0w1w2w3 }
In summary we obtain 16 features and denote them as
(f1, f2, , f16) These features will be used in the NB classi-fication model and for building a TBL rules
B Rule templates for building TBL rules
Rule templates are the important parts of the TBL algorithm They are used for automatically generating TBL rules Based
on previous studies [15], [16] and presented features above,
we propose some rule templates as follows (shown in Figure 4)
A→ B word C @ [ -1 ]
A→ B word C @ [ 1 ]
A→ B word C @ [ -2 ] & word D @ [ -1 ]
A→ B word C @ [ -1 ] & word D @ [ 1 ]
A→ B word C @ [ 1 ] & word D @ [ 2 ]
Figure 4 The rule template
For example, some explanation of the rule templates are described as follows:
The rule template: “A→ B word C @ [ 1 ]” means “transfer
label of current word from label A to label B if the next word
is C” or the rule template “A → B word C @ [ -1 ] & word
D @ [ 1 ]” means “transfer label of current word from label
A to label B if the previous word is C and the next word is D”
V EXPERIMENT
A Data preparation
As considering WSD as a classification problem, we need
an annotated corpus for this task For English, many studies use such kinds of corpus as 1, 2,
Senseval-3 and so on Because a standard corpus in Vietnamese does not exist, it is necessary to building a training-corpus for it To this end, we first use a crawler to collect data from web sites and obtain about 1.2GB of raw data (approximately 120.000
Trang 5articles from more than 50 Vietnamese web sites such as
www.vnexpress.net, www.dantri.com.vn, etc.) We then extract
from this corpus the contexts (containing several sentences
around the ambiguous word) for 10 ambiguous words For
example, a context for the ambiguous word “bạc” is shown in
Figure 5
Trọng tâm của tháng là sự hòa hợp trong gia đình, khi
các thành viên đồng thuận về con đường sự nghiệp của
bạn Giữa tháng 3, tình hình tài chính của bạn cải thiện
rất nhiều Tiền bạc vẫn đổ dồn về, nhưng phải luôn biết
cách chi tiêu hợp lý Đây cũng là khoảng thời gian thích
hợp để bạn đầu tư vào các tài sản cố định Nếu may
mắn, bạn sẽ thu về một khoản tiền lớn
Figure 5 A context of the word “bạc”
After that, these contexts of 10 ambiguous words is
manu-ally labeled to obtain the labeled corpus The Table I describes
in detail the number of samples and senses of ambiguous
words in turn
Table I
S TATISTICS ON THE LABELED DATA
No Word Part of speech Senses Examples
To conduct the experiment we build some kinds of data as
follows:
Firstly, from the labeled corpus, we divide this corpus into
two parts by the ratio 3:1, called data-corpus 1 and data-corpus
2 respectively Data-corpus 1 is used for training and
data-corpus 2 is used for test in the models like NB, TBL, SVM
2, and our proposed model
Secondly, data-corpus 1 is used for the purpose of building
the TBL rules so it is divided randomly 10 times into two
parts by the ratio 3:1, one part is used for training (called
training-corpus) and the other is used for developing (called
developing-corpus) Notice that the training phase for building
TBL rules will be processed 10 times corresponding to the
training and developing data, to cover as much as possible for
extracting TBL rules
The Table II shows the number of training, developing, and
test sets
2 we use libsvm for SVM model See more details about libsvm at:
http://www.csie.ntu.edu.tw/∼cjlin/libsvm/
Table II
D ATA S ETS Part of Corpus 1 Corpus 2
No Word Speech Training Developing Test
B Experimental results
In this section, we present experimental results on four mod-els: NB model, TBL model, SVM model, and the proposed model that is the combination of NB and TBL model From data sets above, firstly, we evaluate the accuracy on the NB model and obtain the results shown in Table III The average accuracy of this model is about 86.5%
Table III
NB M ODEL R ESULTS
No Word Part of Training Test Accur (%)
Speech
Secondly, with each ambiguous word, using training algo-rithm in Section III, we obtain the lists of TBL rules As running this phase 10 times, we obtain 10 lists of TBL rules The Table IV shows the experimental results if we separately test (use the combination model) these TBL lists for the word
"bạc", and its part-of-speech is adjective Moreover, if we combine all rules into an account and then we will obtain better accuracy 92.8% Beside that, some TBL rules we obtain for the word "bạc" is shown the Figure 6
4 → 2 word vàng @ [ -1 ]
2 → 4 word sới @ [ -1 ]
2 → 1 word cao @ [ 1 ] & word cấp @ [ 2 ]
2 → 3 word tiền @ [ 1 ]
2 → 3 word mấy @ [ -2 ] & word triệu @ [ -1 ]
3 → 2 word tờ @ [ -1 ]
4 → 1 word két @ [ -1 ]
Figure 6 Some TBL rules for the word “bạc”
Finally, we show the experimental results on our system for the 10 ambiguous word It can be seen from the Table
Trang 6Table IV
NB & R ULES B ASED MODEL ’ S RESULT FOR AMBIGUOUS WORD “ BẠC ”
No List of Rules Accuracy of
NB & TBL (%)
1 list rules 1 89.2
2 list rules 2 89.9
3 list rules 3 89.2
4 list rules 4 89.9
5 list rules 5 89.9
6 list rules 6 89.9
7 list rules 7 89.2
8 list rules 8 90.6
9 list rules 9 92.1
10 list rules 10 89.2
11 combined rules 92.8
V that results obtained from the proposed model (combining
NB classification and TBL) is better than results obtained
from NB classification model, TBL model, and SVM model
The average accuracy of this model achieves about 91.3% for
10 ambiguous words, which is 4.8%, 7.4%, and 3.1% more
accurate than the NB classification model, TBL model, and
SVM model respectively
Table V
NB & TBL & SVM & O UR PROPOSED MODEL RESULTS
No Word Part of Accur1 Accur2 Accur3 Accur4
Speech (%) (%) (%) (%)
1 Bạc Noun 81.8 82.4 84.4 88.6
3 Cất Verb 84.4 79.7 86.4 89.7
4 Câu Noun 97.6 97.3 97.8 98.3
5 Câu Verb 85.3 88.0 86.7 96.0
6 Cầu Noun 95.6 85.4 95.6 95.9
7 Khai Verb 90.4 88.2 91.2 92.9
8 Pha Verb 79.2 76.5 81.2 83.9
9 Phát Verb 73.6 75.2 77.1 80.9
10 Sắc Noun 91.6 83.2 92.8 94.0
1 Accuracy of NB model
2 Accuracy of TBL model
3 Accuracy of SVM model
4 Accuracy of NB & TBL model
VI CONCLUSIONS This paper has proposed a new method for combining
the advantages of machine learning approach and rule-based
approach for the task of word sense disambiguation In
particular, we have used NB classification for the machine
learning method and combined it with TBL We have
experimented on some Vietnamese polysemous words and
the obtained accuracy has increased until 4.8%, 7.4%, and
3.1% when being compared with the results of NB model,
TBL model, and SVM model respectively It also proves that
TBL can be utilized to correct wrong results from statistical
machine learning models
This model can be applied to other languages for the task
of WSD, and we believe that it can be also applied to some
other natural language processing tasks such as part-of-speech tagging, syntactic parsing, and so on
ACKNOWLEDGMENT This work is partially supported by the Vietnams Na-tional Foundation for Science and Technology Development (NAFOSTED), project code 102.99.35.09
REFERENCES [1] N Ide and J Véronis, “Introduction to the special issue on word sense
disambiguation: the state of the art,” Comput Linguist., vol 24, pp.
2–40, March 1998.
[2] A Suárez and M Palomar, “A maximum entropy-based word sense
disambiguation system,” in Proceedings of the 19th international
con-ference on Computational linguistics - Volume 1, ser COLING ’02.
Stroudsburg, PA, USA: Association for Computational Linguistics, 2002,
pp 1–7.
[3] A L Berger, S A D Pietra, and V J D Pietra, “A maximum entropy
approach to natural language processing,” Computational Linguistics,
vol 22, pp 39–71, 1996.
[4] Y K Lee, H T Ng, and T K Chia, “Supervised word sense disambigua-tion with support vector machines and multiple knowledge sources,” in
Senseval-3: Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text, 2004, pp 137–140.
[5] D Yarowsky, “Unsupervised word sense disambiguation rivaling
su-pervised methods,” in Proceedings of the 33rd annual meeting on
Association for Computational Linguistics, ser ACL ’95 Stroudsburg,
PA, USA: Association for Computational Linguistics, 1995, pp 189– 196.
[6] T Pedersen, “A decision tree of bigrams is an accurate predictor of word
sense,” in Proceedings of the second meeting of the North American
Chapter of the Association for Computational Linguistics on Language technologies, ser NAACL ’01 Stroudsburg, PA, USA: Association for
Computational Linguistics, 2001, pp 1–8.
[7] W A Gale, K W Church, and D Yarowsky, “A method for
disam-biguating word senses in a large corpus,” Computers and the Humanities,
vol 26, pp 415–439, 1992.
[8] T Pedersen, “A simple approach to building ensembles of naive bayesian classifiers for word sense disambiguation,” 2000.
[9] M Lesk, “Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone,” in
Proceedings of the 5th annual international conference on Systems documentation, ser SIGDOC ’86. New York, NY, USA: ACM, 1986,
pp 24–26.
[10] R Navigli and P Velardi, “Structural semantic interconnections: A
knowledge-based approach to word sense disambiguation,” IEEE Trans.
Pattern Anal Mach Intell., vol 27, pp 1075–1086, July 2005.
[11] E Brill, “Transformation-based error-driven learning and natural
lan-guage processing: a case study in part-of-speech tagging,” Comput.
Linguist., vol 21, pp 543–565, December 1995.
[12] C A Le, “A study of classifier combination and semi-supervised learning for word sense disambiguation,” Ph.D dissertation, School of Information Science Japan Advanced Institute of Science and Technol-ogy, 2007.
[13] C A Le and A Shimazu, “High word sense disambiguation using
naive bayesian classifier with rich features,” in The 18th Pacific Asia
Conference on Language, Information and Computation (PACLIC-2004),
2004, pp 105–113.
[14] R F Mihalcea, “Word sense disambiguation with pattern learning and
automatic feature selection,” Nat Lang Eng., vol 8, pp 343–358,
December 2002.
[15] G Ngai and R Florian, “Transformation-based learning in the fast
lane,” in Proceedings of the second meeting of the North American
Chapter of the Association for Computational Linguistics on Language technologies, ser NAACL ’01 Stroudsburg, PA, USA: Association for
Computational Linguistics, 2001, pp 1–8.
[16] R L Milidiú, J C Duarte, and C Nogueira Dos Santos, “Current topics
in artificial intelligence,” D Borrajo, L Castillo, and J M Corchado, Eds Berlin, Heidelberg: Springer-Verlag, 2007, ch TBL Template Selection: An Evolutionary Approach, pp 180–189.