DSpace at VNU: Combining statistical machine learning with transformation rule learning for Vietnamese Word Sense Disambiguation

Combining statistical machine learning withtransformation rule learning for Vietnamese Word Sense Disambiguation Phu-Hung Dinh, Ngoc-Khuong Nguyen, and Anh-Cuong Le Dept.. This paper pro

Trang 1

Combining statistical machine learning with

transformation rule learning for Vietnamese Word

Sense Disambiguation

Phu-Hung Dinh, Ngoc-Khuong Nguyen, and Anh-Cuong Le

Dept of Computer Science University of Engineering and Technology Vietnam National University, Ha Noi

144 Xuan Thuy, Cau Giay, Ha Noi, Viet Nam {hungdp@wru.edu.vn, khuongnn.mcs09@vnu.edu.vn, cuongla@vnu.edu.vn}

Abstract—Word Sense Disambiguation (WSD) is the task of

determining the right sense of a word depending on the context

it appears Among various approaches developed for this task,

statistical machine learning methods have been showing their

advantages in comparison with others However, there are some

cases which cannot be solved by a general statistical model This

paper proposes a novel framework, in which we use the rules

generated by transformation based learning (TBL) to improve

the performance of a statistical machine learning model This

framework can be considered as a combination of a rule-based

method and statistical based method We have developed this

method for the problem of Vietnamese WSD and achieved some

promising results.

Index Terms—Machine Learning, Transformation Based

Learning, Naive Bayesian classification

I INTRODUCTION

As the ambiguity of natural languages, a word may have

multiple meanings (senses) Practically speaking, an

ambigu-ous word has ambiguity regarding its part-of-speech and

mean-ing WSD usually considers to disambiguate the meaning of a

word in a specific speech A word in a specific

part-of-speech which has several meanings is called polysemous For

example, the noun “bank” has at least two different meanings:

“bank” in “Bank of England” and “bank” in “river bank”

Beside that, polysemous word also exists in Vietnamese For

instance, consider the following sentences:

• Anh ta đang câu cá ở ao.

He is fishing in a pond

• Đại bác câu trúng lô cốt.

The guns lobbed home shells on the blockhouse

The occurrence of the word “câu” in the two sentences

clearly denotes different meanings: “to fish” and “to lob”

WSD means that determining its right sense in the

par-ticular context The success of this task makes benefit for

many Natural Language Processing (NLP) problems such as

information retrieval, machine translation, human-computer

communication, and so on

The automatic disambiguation of word senses has received

concern since the 1950s [1] Since that time, many studies

have been investigating on various methods for this problem, but the performances of available WSD systems or published results are limited These methods can be mainly divided into two approaches: knowledge-based and machine learning (using corpora)

Knowledge-based methods rely on previously acquired lin-guistic knowledge Therefore, WSD task will be performed by matching the context in which they appear with information from an external knowledge source The methods in this approach are based on knowledge resources like WordNet thesaurus, as well as grammar rules or hand-coded rules for disambiguation (see [1] for more detail and discussion)

In machine learning approach, since the decade of 1990s, empirical and statistical approach have attracted almost studies

in NLP field Many machine learning methods have been applied to a large variety of NLP task (including WSD) with remarkable success The methods in this approach use tech-niques from statistics and machine learning to induce models

of language usage from large samples of text Generally, basing

on labeled data, unlabeled data, or both, machine learning methods can be divided into three groups including supervised, unsupervised, and semi-supervised one Because supervised systems are based on annotated data, they achieve better results Many machine learning methods have been applied

to systems such as: use maximum entropy models [2], [3], use support vector machines (SVM) [4], use decision list [5], [6], use Naive Bayesian (NB) classifier [7], [8] Other studies have tried to use linguistic knowledge from dictionaries and thesauri as in [9], [10]

Machine learning approach seems to show its advantages in comparison with knowledge based approach While knowledge based approach is based on rules generated by experts as well

as their ability and meets difficulties in solving a big number

of cases, machine learning approach can solve the problem

on a large scale without paying much attention on linguistic aspects

However, the obtained results for WSD (e.g in English) are still far from applying in a real system Although an average

Trang 2

accuracy on Senseval-2 and Senseval-31is around 70%, some

other studies such as [13] achieve higher accuracy (about 90%

for several words) as being implemented on a large training

data

The first reason from our observation causing unexpected

results for the WSD statistical machine learning systems is

based on a spare corpus The second reason is that there are

usually exceptional cases for any NLP problems (particularly

for WSD), which does not depend on a general principle (or

model) Therefore, in this paper we focus on correcting the

cases which may be misclassified by a statistical machine

learning system During the research, by borrowing the idea

from knowledge based approach instead of generating these

rules by an expert we will apply the techniques in TBL for

automatically producing the rules

Firstly, basing on the training corpus, a machine learning

model is trained to be used as the initialized classification

system for a TBL based error driven learning approach in order

to amend the initial prediction, using a development corpus

Consequently a set of TBL rules are produced Secondly, in

the final model, as the first step we first use the machine

learning model to detect senses for the polysemous words and

then apply the obtained transformation rules on the results of

the first steps to get the final senses

The paper is organized into six parts including the

introduc-tion In section II, we will present a background including TBL

and a statistical machine learning method And then, the detail

of our proposed model will be presented in the section III In

section IV, we will present feature selection and rule templates

selection Data preparation and experiments will be presented

in section V Finally, we conclude the paper in section VI

II BACKGROUND

In this section we will introduce the NB classification (in

the corpus-based approach) and the TBL (in the rule-based

approach), which are the two basic methods used in the

combination method we propose

A NB model

NB method have been used in most classification work and

were first used for WSD by Gale [7] NB classifiers work

on the assumption that all the feature variables representing a

problem are conditionally independent given the classes

Assuming that the polysemous word w is

disambiguat-ing Suppose that w has a set of potential senses (classes)

S = {s1, , s c }, and it is given a context of w which is

presented by a set of features F = {f1, , f n } The Bayesian

theory suggests that the word w should be assigned to class

s k provided that the a posterior probability of that class is

maximum, namely

s k= arg max

sj P (s j |F ), j ∈ {1, , c}

where the value of P (s j |F ) is computed by the following

equation:

1 See more detail about these corpora on http://www.senseval.org/

P (s j |F ) = P (s j P (F ) )P (F |s j)

P (F ) is constant for all senses and therefore does not influence the value of P (s j |F ) The sense s k of w is then:

s k= arg max

sj P (s j |F )

= arg max

sj

P (s j )P (F |s j)

P (F )

= arg max

sj P (s j)

n

i=1

P (f i |s j)

= arg max

sj [logP (s j) +n

i=1

logP (f i |s j)]

The values of P (f i |s j ) and P (s i) are computed via maximum-likelihood estimation as:

P (s j) =C(s j)

N and P (f i |s j) = C(f i , s j)

C(s j)

where C(f i , s j ) is the number of occurrences of f i in a

context of sense s j in the training corpus, C(s j) is the number

of occurrences of s j in the training corpus, and N is the total number of occurrences of the polysemous word w or the size

of the training dataset To avoid the effects of zero counts when estimating the conditional probabilities of the model, we set

P (f i , s j ) equal to 1/N for each sense s j when meeting a new

feature f i in a context of the test dataset NB algorithm for WSD is presented as follows:

Training:

forall senses s j of w do

forall features f i extracted from the training data do

P (f i |s j) =C(f i , s j)

C(s j)

end end forall senses s j of w do

P (s j) =C(w, s j)

C(w)

end Disambiguation:

forall senses s j of w do

score(s j ) = log(P (s j))

forall features f i in the context window c do

score(s j ) = score(s j ) + log(P (f i |s j))

end end

choose s k = arg max

sj score(s j)

B Transformation-Based Learning

TBL is known as the most successful method in the rule-based approach for many NLP tasks because it provides a

Trang 3

method for automatically learning the rules.

Eric and Brill [11] introduced TBL and showed that it can

do part-of-speech tagging with fairly high accuracy The same

method can be applied in many natural language

process-ing tasks, for example: text chunkprocess-ing, parsprocess-ing, named entity

recognition, and word sense disambiguation The method’s key

idea is to compare the golden-corpus being correctly tagged

with the current-corpus being created through an initial tagger,

then automatically generate rules to correct errors based on

predefined templates

Transformation-based learning algorithm runs in multiple

iterations as following:

Input: Raw-corpus containing the entire raw text without

labels is extracted from the golden-corpus that contains

man-ually labeled context/label pairs

• Step 1:Generating initial-corpus by using an initial label

where its input is the raw-corpus

• Step 2: Comparing the initial-corpus with the

golden-corpus to determine the initial-golden-corpus’s label errors from

which all rule templates are used for creating potential

rules

• Step 3: Appling each rule in potential rules to a copy

of initial-corpus The score of a rule is computed by

subtracting of number of additional errors from number

of correctly changed labels The rule with the best score

is selected

• Step 4: Updating the initial-corpus by applying selected

rule and moving this rule to the list of transformation

rules

• Step 5: Stopping if the best score is smaller than a

predefined threshold T, otherwise repeat step 2

Output: List of transformation rules

III OURAPPROACH

In this section, we describe our approach to induce a model

which will correct the missing tagged senses of a statistical

machine learning model (note that, we choose here the NB

classification model) This model includes the training phase

and testing phase

Notice that in this model, we will use a training data for

generating a NB classification and then use a developing data

for learning the transformation rules These two sets of tagged

data is constructed by manually labeling from a set of selected

contexts of the polysemous word

A The training phase

The training process consists of two stages In the first

stage, list error are determined based on the NB model This

stage is described as shown in Figure 1

Input:Training-corpus and developing-corpus contain

man-ually labeled context/label pairs

• Step 1:Obtaining the raw developing-corpus by

remov-ing the labels from the developremov-ing-corpus

• Step 2:Using training-corpus to train a NB classification model This classification model is then tested on the raw developing-corpus obtained in the step 1 The obtained result is called the initial-corpus

• Step 3: Comparing initial-corpus with the developing-corpus to determine list all contexts with wrong labels from the NB classification model

Output: List of contexts with wrong labels (we call it the list error as shown in Figure 1)

Figure 1 The diagram describes training algorithm (first stage)

In the second stage, the set of TBL rules is determined based on applying the TBL algorithms on the list error obtained at the step 1 Notice that in this stage we will use a predefined templates for generating potential TBL rules (this

is mentioned in detail in Section IV.B) Now, this stage is described as follows (shown in Figure 2)

Input: Developing-corpus and initial-corpus, and the list error

• Step 1: Applying the rule templates for the list error

to generate a list of transformation rules (called list potential-rules)

• Step 2:Applying each rule in list potential-rule to a copy

of the initial-corpus Score of each rule is calculated as

s2− s1, where s1is the number of cases that right labels

are transformed into wrong labels, s2 is the number of cases that are corrected The rule with highest score is selected

• Step 3:Updating initial-corpus by applying the rule with the highest score and moving this rule to the selected TBL rules List error also has been updated accordingly

by comparing initial-corpus with developing-corpus

• Step 4: Stopping if the highest score is smaller than a

predefined threshold T, else go to step 1.

Output: List of transformation rules (i.e selected TBL rules)

B The test phase

The proposed approach uses selected TBL rules obtained at the training phase for the testing as follows (shown in Figure 3)

Input:Test-corpus and selected TBL rules

Trang 4

Figure 2 The diagram describes training algorithm (second stage)

• Step 1:Obtaining raw test-corpus by removing the label

from the test-corpus

• Step 2: Using the NB Classification model on the raw

test-corpus and obtaining the called initial-corpus

• Step 3: Appling selected TBL rules for initial-corpus to

create the labeled corpus

• Step 4: Comparing the labeled corpus with test-corpus

to evaluate system (i.e get the accuracy)

Output: Accuracy of the proposed model

Figure 3 The diagram describes test phase

IV FEATURES ANDRULE TEMPLATES

A Feature Selection

One of the most important tasks in WSD is the

determi-nation of useful information related to word senses In the

corpus-based approach, most studies have just considered the

information extracted from the context in which the target word

appears

Context is the only means to identify the meaning of a

pol-ysemous word Therefore, all works on sense disambiguation

rely on the context of the target word to provide information

being used for its disambiguation For corpus-based methods,

context also provides the prior knowledge with which current

context is compared to achieve disambiguation

Suppose that w is the polysemous word to be disambiguated,

and S = {s1, s2, s m} is the set of its potential senses Given

a context W of w represented as:

W = { w −3 , w −2 , w −1 , w0, w1, w2, w3 }

W is a context of w within a windows (-3,+3) in which w0

= w is the target word; for each i∈ {−3 + 3} w i is a word appearing at the position i in relation with w

Based on previous studies [12], [13], [14] and our experi-ment, we propose to use two kinds of knowledge and represent them as subsets of features, as following:

• Bag-of-words, F1(l,r) ={w −l , ,w +r}:

We choose F1(-3,+3), it has seven elements (features) as follows:

F1(-3,+3) = {w −3 , w −2 , w −1 , w0, w1, w2, w3}

• Collocation of words, F2={w −l w +r}:

We choose collocations provided that their lengths (including the target word) are less or equal to 4, it

means (l + r + 1) ≤ 4 It has nine elements (features) as

follows:

F2={w −1 w0, w0w1, w −2 w −1 w0, w −1 w0w1, w0w1w2,

w −3 w −2 w −1 w0, w −2 w −1 w0w1, w −1 w0w1w2,

w0w1w2w3 }

In summary we obtain 16 features and denote them as

(f1, f2, , f16) These features will be used in the NB classi-fication model and for building a TBL rules

B Rule templates for building TBL rules

Rule templates are the important parts of the TBL algorithm They are used for automatically generating TBL rules Based

on previous studies [15], [16] and presented features above,

we propose some rule templates as follows (shown in Figure 4)

A→ B word C @ [ -1 ]

A→ B word C @ [ 1 ]

A→ B word C @ [ -2 ] & word D @ [ -1 ]

A→ B word C @ [ -1 ] & word D @ [ 1 ]

A→ B word C @ [ 1 ] & word D @ [ 2 ]

Figure 4 The rule template

For example, some explanation of the rule templates are described as follows:

The rule template: “A→ B word C @ [ 1 ]” means “transfer

label of current word from label A to label B if the next word

is C” or the rule template “A → B word C @ [ -1 ] & word

D @ [ 1 ]” means “transfer label of current word from label

A to label B if the previous word is C and the next word is D”

V EXPERIMENT

A Data preparation

As considering WSD as a classification problem, we need

an annotated corpus for this task For English, many studies use such kinds of corpus as 1, 2,

Senseval-3 and so on Because a standard corpus in Vietnamese does not exist, it is necessary to building a training-corpus for it To this end, we first use a crawler to collect data from web sites and obtain about 1.2GB of raw data (approximately 120.000

Trang 5

articles from more than 50 Vietnamese web sites such as

www.vnexpress.net, www.dantri.com.vn, etc.) We then extract

from this corpus the contexts (containing several sentences

around the ambiguous word) for 10 ambiguous words For

example, a context for the ambiguous word “bạc” is shown in

Figure 5

Trọng tâm của tháng là sự hòa hợp trong gia đình, khi

các thành viên đồng thuận về con đường sự nghiệp của

bạn Giữa tháng 3, tình hình tài chính của bạn cải thiện

rất nhiều Tiền bạc vẫn đổ dồn về, nhưng phải luôn biết

cách chi tiêu hợp lý Đây cũng là khoảng thời gian thích

hợp để bạn đầu tư vào các tài sản cố định Nếu may

mắn, bạn sẽ thu về một khoản tiền lớn

Figure 5 A context of the word “bạc”

After that, these contexts of 10 ambiguous words is

manu-ally labeled to obtain the labeled corpus The Table I describes

in detail the number of samples and senses of ambiguous

words in turn

Table I

S TATISTICS ON THE LABELED DATA

No Word Part of speech Senses Examples

To conduct the experiment we build some kinds of data as

follows:

Firstly, from the labeled corpus, we divide this corpus into

two parts by the ratio 3:1, called data-corpus 1 and data-corpus

2 respectively Data-corpus 1 is used for training and

data-corpus 2 is used for test in the models like NB, TBL, SVM

2, and our proposed model

Secondly, data-corpus 1 is used for the purpose of building

the TBL rules so it is divided randomly 10 times into two

parts by the ratio 3:1, one part is used for training (called

training-corpus) and the other is used for developing (called

developing-corpus) Notice that the training phase for building

TBL rules will be processed 10 times corresponding to the

training and developing data, to cover as much as possible for

extracting TBL rules

The Table II shows the number of training, developing, and

test sets

2 we use libsvm for SVM model See more details about libsvm at:

http://www.csie.ntu.edu.tw/∼cjlin/libsvm/

Table II

D ATA S ETS Part of Corpus 1 Corpus 2

No Word Speech Training Developing Test

B Experimental results

In this section, we present experimental results on four mod-els: NB model, TBL model, SVM model, and the proposed model that is the combination of NB and TBL model From data sets above, firstly, we evaluate the accuracy on the NB model and obtain the results shown in Table III The average accuracy of this model is about 86.5%

Table III

NB M ODEL R ESULTS

No Word Part of Training Test Accur (%)

Speech

Secondly, with each ambiguous word, using training algo-rithm in Section III, we obtain the lists of TBL rules As running this phase 10 times, we obtain 10 lists of TBL rules The Table IV shows the experimental results if we separately test (use the combination model) these TBL lists for the word

"bạc", and its part-of-speech is adjective Moreover, if we combine all rules into an account and then we will obtain better accuracy 92.8% Beside that, some TBL rules we obtain for the word "bạc" is shown the Figure 6

4 → 2 word vàng @ [ -1 ]

2 → 4 word sới @ [ -1 ]

2 → 1 word cao @ [ 1 ] & word cấp @ [ 2 ]

2 → 3 word tiền @ [ 1 ]

2 → 3 word mấy @ [ -2 ] & word triệu @ [ -1 ]

3 → 2 word tờ @ [ -1 ]

4 → 1 word két @ [ -1 ]

Figure 6 Some TBL rules for the word “bạc”

Finally, we show the experimental results on our system for the 10 ambiguous word It can be seen from the Table

Trang 6

Table IV

NB & R ULES B ASED MODEL ’ S RESULT FOR AMBIGUOUS WORD “ BẠC ”

No List of Rules Accuracy of

NB & TBL (%)

1 list rules 1 89.2

2 list rules 2 89.9

3 list rules 3 89.2

4 list rules 4 89.9

5 list rules 5 89.9

6 list rules 6 89.9

7 list rules 7 89.2

8 list rules 8 90.6

9 list rules 9 92.1

10 list rules 10 89.2

11 combined rules 92.8

V that results obtained from the proposed model (combining

NB classification and TBL) is better than results obtained

from NB classification model, TBL model, and SVM model

The average accuracy of this model achieves about 91.3% for

10 ambiguous words, which is 4.8%, 7.4%, and 3.1% more

accurate than the NB classification model, TBL model, and

SVM model respectively

Table V

NB & TBL & SVM & O UR PROPOSED MODEL RESULTS

No Word Part of Accur1 Accur2 Accur3 Accur4

Speech (%) (%) (%) (%)

1 Bạc Noun 81.8 82.4 84.4 88.6

3 Cất Verb 84.4 79.7 86.4 89.7

4 Câu Noun 97.6 97.3 97.8 98.3

5 Câu Verb 85.3 88.0 86.7 96.0

6 Cầu Noun 95.6 85.4 95.6 95.9

7 Khai Verb 90.4 88.2 91.2 92.9

8 Pha Verb 79.2 76.5 81.2 83.9

9 Phát Verb 73.6 75.2 77.1 80.9

10 Sắc Noun 91.6 83.2 92.8 94.0

1 Accuracy of NB model

2 Accuracy of TBL model

3 Accuracy of SVM model

4 Accuracy of NB & TBL model

VI CONCLUSIONS This paper has proposed a new method for combining

the advantages of machine learning approach and rule-based

approach for the task of word sense disambiguation In

particular, we have used NB classification for the machine

learning method and combined it with TBL We have

experimented on some Vietnamese polysemous words and

the obtained accuracy has increased until 4.8%, 7.4%, and

3.1% when being compared with the results of NB model,

TBL model, and SVM model respectively It also proves that

TBL can be utilized to correct wrong results from statistical

machine learning models

This model can be applied to other languages for the task

of WSD, and we believe that it can be also applied to some

other natural language processing tasks such as part-of-speech tagging, syntactic parsing, and so on

ACKNOWLEDGMENT This work is partially supported by the Vietnams Na-tional Foundation for Science and Technology Development (NAFOSTED), project code 102.99.35.09

REFERENCES [1] N Ide and J Véronis, “Introduction to the special issue on word sense

disambiguation: the state of the art,” Comput Linguist., vol 24, pp.

2–40, March 1998.

[2] A Suárez and M Palomar, “A maximum entropy-based word sense

disambiguation system,” in Proceedings of the 19th international

con-ference on Computational linguistics - Volume 1, ser COLING ’02.

Stroudsburg, PA, USA: Association for Computational Linguistics, 2002,

pp 1–7.

[3] A L Berger, S A D Pietra, and V J D Pietra, “A maximum entropy

approach to natural language processing,” Computational Linguistics,

vol 22, pp 39–71, 1996.

[4] Y K Lee, H T Ng, and T K Chia, “Supervised word sense disambigua-tion with support vector machines and multiple knowledge sources,” in

Senseval-3: Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text, 2004, pp 137–140.

[5] D Yarowsky, “Unsupervised word sense disambiguation rivaling

su-pervised methods,” in Proceedings of the 33rd annual meeting on

Association for Computational Linguistics, ser ACL ’95 Stroudsburg,

PA, USA: Association for Computational Linguistics, 1995, pp 189– 196.

[6] T Pedersen, “A decision tree of bigrams is an accurate predictor of word

sense,” in Proceedings of the second meeting of the North American

Chapter of the Association for Computational Linguistics on Language technologies, ser NAACL ’01 Stroudsburg, PA, USA: Association for

Computational Linguistics, 2001, pp 1–8.

[7] W A Gale, K W Church, and D Yarowsky, “A method for

disam-biguating word senses in a large corpus,” Computers and the Humanities,

vol 26, pp 415–439, 1992.

[8] T Pedersen, “A simple approach to building ensembles of naive bayesian classifiers for word sense disambiguation,” 2000.

[9] M Lesk, “Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone,” in

Proceedings of the 5th annual international conference on Systems documentation, ser SIGDOC ’86. New York, NY, USA: ACM, 1986,

pp 24–26.

[10] R Navigli and P Velardi, “Structural semantic interconnections: A

knowledge-based approach to word sense disambiguation,” IEEE Trans.

Pattern Anal Mach Intell., vol 27, pp 1075–1086, July 2005.

[11] E Brill, “Transformation-based error-driven learning and natural

lan-guage processing: a case study in part-of-speech tagging,” Comput.

Linguist., vol 21, pp 543–565, December 1995.

[12] C A Le, “A study of classifier combination and semi-supervised learning for word sense disambiguation,” Ph.D dissertation, School of Information Science Japan Advanced Institute of Science and Technol-ogy, 2007.

[13] C A Le and A Shimazu, “High word sense disambiguation using

naive bayesian classifier with rich features,” in The 18th Pacific Asia

Conference on Language, Information and Computation (PACLIC-2004),

2004, pp 105–113.

[14] R F Mihalcea, “Word sense disambiguation with pattern learning and

automatic feature selection,” Nat Lang Eng., vol 8, pp 343–358,

December 2002.

[15] G Ngai and R Florian, “Transformation-based learning in the fast

lane,” in Proceedings of the second meeting of the North American

Chapter of the Association for Computational Linguistics on Language technologies, ser NAACL ’01 Stroudsburg, PA, USA: Association for

Computational Linguistics, 2001, pp 1–8.

[16] R L Milidiú, J C Duarte, and C Nogueira Dos Santos, “Current topics

in artificial intelligence,” D Borrajo, L Castillo, and J M Corchado, Eds Berlin, Heidelberg: Springer-Verlag, 2007, ch TBL Template Selection: An Evolutionary Approach, pp 180–189.

Định dạng
Số trang	6
Dung lượng	198,92 KB