Luận văn vietnamese word clustering and antonym identification phân tích cụm từ tiếng việt và nhận diện từ trái nghĩa luận văn ths công nghệ thông tin

In this thesis, I focus on some word clustering algorithms with the unlabeled data, in which T mainly apply two mothods: word clustering by Brown’s algorithm [22] and word similarity by

Trang 1

VIETNAM NATIONAL UNIVERSITY, ILANOT UNIVERSITY OF ENGINEERING AND TECHNOLOGY

NGUYEN KIM ANIL

VIETNAMESE WORD CLUSTERING

AND ANTONYM IDENTIFICATION

MASTER THESIS OF INFORMATION TECHNOLOGY

Trang 2

VIETNAM NATIONAL UNIVERSITY, HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY

NGUYEN KIM ANH

VIETNAMESE WORD CLUSTERING

AND ANTONYM IDENTIFICATION

Major: Computer science

Code: 60 48 01

MASTER THESIS OF INFORMATION TECHNOLOGY

SUPERVISOR: PhD Nguyen Phuong Thai

Hanoi - 2013

Trang 3

2.1.2 Sticky Paire and Sematic Classes

Chapter [iI - Our approach

3.1 Word clustering in Vietnamese

Trang 5

List of Figures

Figure 1 An example of Brown's cluster algorithm

Vigure 2, An example of Vietnamese word cluster Figure 3 The syntax trec OŸ a serifece

Figure 4 An example about Vietnamese word similarity

Vigure 5, Select word clusters by dictionary Figure 6, An cxample abou† seIitences parsos

Figure 7 The true of k-clusters

Trang 6

List of Tables

Table 1 Results of CBC with discovering word senses

Table 2 Results of CBC with document clustering cesses

Table 3 Ancillary antonym frames ccecsccssssesessssnssievesseessenisnseesnneeivee

Table 4 Coordinated antonym frames

‘Table 5 Transitional antonym frames .csssssessssssessseseseseevenisesesteeeeicvee

Table 6 An unlabeled corpus in Victnamese oe ccc ào căn cocveovrieerriee

Table 7 The result of five initial clusters

Table 8 The comparison between Word clustering and Word similarity

Tablc 9 The result oŸ antoniyra Ïraies c cà or

Table 10 The relation of wy arkÌ vwạ pairs

Table 11 The effectiveness of word chuster feature

Trang 7

Chapter I

Introduction

In recent years, statistical learning methods have been vastly successful for using natural language processing tasks Most of machine learning algorithms which are used in natural language processing tasks are supervised and they require labeled data These

labeled dala are ofien made by hand or in some other ways, which could be lime

consuming and expensive However, while the labeled data is difficult to be created by hand, the unlabeled data is basically free on the Intemet which is considered as raw text This raw text can be easily preprocessed to be made suitable for using in an unsupervised

or semi-supervised leaming algorithm Previous works have shown that using the unlabeled data to replace a taditional labeled dala can improve perfurmance (Miller el al.,

2004, Abney, 2004; Collins and Singer, 1999) [19]]21[71

In this thesis, I focus on some word clustering algorithms with the unlabeled data, in

which T mainly apply two mothods: word clustering by Brown’s algorithm [22] and word similarity by Dckang Lin |10| Those two methods are used in clustering words in the corpus While Brown's method cluster words basing on the relationships between words standing before and aiter the clustered word, Dekang Lin’s method uses the relationships among those three words To compare the advantages and disadvantages of these two methods, I experimented them on the same corpus, using the same cyaluation method and

the same main word in clusters ‘The result of word clustering contained different clusters and each cluster included words in the same contexts This result was used as features for

the application: Vietramese functional labeling I also evaluated influence of word

clustering when using them as features im thts application For example, word clusters were used to solve the data sparseness problem of the head word feature Besides, I use

the statistics method to extract 20 frames of antonym which can use to identify antonym

classes in chusters

In this chapter, I describe word similarity, hicrarchical clustering of word and their applications which are used in natural language processing Besides, I would like to

Trang 8

introduce function tags, word segmentation tasks, objective of thesis and our contribution Finally, I will describe the structure of the thesis

1.1 Ward Similarity

The semantic of an unknown word can be inferred from its context Consider the

following, examples

A bottle of Beer is on the table

Everyone likes Beer:

Beer makes you drunk,

Contexts of the word Beer in which it is used suggest that Beer could be a kind of aleoholic beverage This moans thal other alcoholic beverage may occur in the same contexts as contexts of Beer and they may be related Consequently, two words arc similar if they appear in similar contexts or they can be exchangeable to some extent For example, “Téng thang” (President) and “Chu tich” (Chairman) are similar according to this definition In contrast, two words “Kéo” (Scissors) and “Cf” (Knife) are not similar undor this definition, while scmantically related Intuitively if I can gencrate a good clustering, the words in this cluster should be similar

1.2 Hierarchical Clustering of Ward

In recent years, some algorithms have been proposed to automatically clustering words based on a large unlabeled corpus, such as (Brown et al 1992, Lin et al 1998) [22][1 0] T consider a corpus of T words, a vacabulary of V words, and a partition 7 of the vocabulary The likelihood L(m) of a bigram class model generating the corpus is given

by:

La)=I H

Tere, II is the entropy of the l-gram word distribution, and I is the average mutual

information of adjacent classes in the corpus

I - 2 PGslIng h@)B&)

11

Trang 9

Where, Pr(¢,¢,) is the probability of a word in ¢, is followed by a word in o; 8o H and

z are independent, the partition also maximizes the likelihood L(z) of the corpus, because

ik maxinnves the average mutual information Thus, T can use the average mulual

information to coustruct Ihe clusters of word by repealing the merging step until the

number of clusters is reduced to the predefined number C

1.3 Function tags

Functional tags Jabeling is an important processing step for many natural language processing applications such as question answering, information extraction, and summarization Thus, there were some researches focusing in function tagging problem to

cover additional serantic miformalion which are more useful thar syntactic labels

‘There are two kinds of tags in linguistics: syntactic tags and functional tags Lior

synilaclic tags there are inany theories and projects in English, Spanish, Chinese aml elc

Main lasks of these researches are common Finding the parl-ol-speech and tag them [or their constituents Functional tags are understood as abstract labels because they are not

similar syntactic labels If a syntactic label has one notation for batch of words for all

paragraphs, functional tags are representative of the relationship between a phrase and its

nicrance in cach differcuce context So for cach phrase, funchonal lags might be

transforming Tl depends on the context of its neighbors For example, when T consider a phrase: “Baseball bat” the syntax of this phrase is “noun phrase” (in almost researches they are annotated as NP), But its functional tag might be a subject in this sentence:

This baseball hat is very expensive

In another case, ils fimetional tag mighl be a dircel object

I bought this baseball bat last month

Or instrument agent in a passive voice:

That men was attacked by this baseball bat

Functional tags are directly mentioned by Blaheta (2003) [13] Since there are a lot of researches focusing om how to lag functional lags [or a sentence This kind of research problem is called functional tags labeling problem, a class of problems aiming at finding semantic information of phrase To sum up, fumotional tag labeling is defined as a problem of how to find the semantic information of bag of words, and then tag, them with

a given anmolation in its context

12

Trang 10

1.4, Objectives of the Thesis

Most of these successful machine learning algorithms are supervised algorithm, and they usually use labeled data These labeled data are often created by hand, which is time consuming and expensive While unlabeled data is free, they can be obtained from the newspapers, website and they oxi as raw (exts on (he Internet,

In this thesis, I expected to investigate some methods of clustering words for unlabeled data, which was easily extracted from onfine sources Among automatic clustering methods, T focused on two: hierarchical word clustering by Brown's algorithm and word similarity by Dekang Lin Besides, | also suggested a common evaluating tool for both methods when they are applied in the same Vietnamese corpus

The output of the word clustering was used as features in natural language processing tasks such as Vietamese functional labeling 1 also evaluated the influences of word

clusters when they were used as features in this task

1.5 Our Contributions

As I discussed above, the main aim of this thesis is to cluster unlabeled Vietnamese

words Thus, the contribution of this paper is as follows

* Secondly, I suggested a qualified evaluating method for clusters after I clustered words using thesaurus dictionary with 5 criteria

« Thirdly, I compared two clustering methods for Vietnamese, which are word clustering by Brown and word similarity by Dekang Lin T used the results of clusters as features in Vietnamosc functional labcling task to increase the task’s efficiency Besides, I use the statistics method to extract 20 frames of antonym which can use to identify

antonym classes in the clusters

Tn conclusion, our contribution is that T have implemented word clustering about 700,000 sentences in Vietnamese by hierarchical word clustering algorithm, using

Vietnamese thesaurus dictionary and five criteria to evaluate the true of clusters and using

Trang 11

clusters as features of NLP tasks such as Vietnamese functional labeling Finally, extract

20 frames of antonym to identify antonym pairs in antonym classes

1.6 Thesis structure

In this section, 1 would like to introduce brief outline of thesis ‘Thus, you can have overviews on next sections where I want to disouss

Chapter 2 — Related works

In this chapter I would like to introduce some recent researches about word clustering, functional labeling and word scgricutation

Chapter 3— Our approach

This chapter suggests the method I applied to cluster Victnamese, how I evaluate the qualities of the clusters after the word clustering process, how ta use those clusters as

TeaLures m Vieltamese functional labcling lask And how to extract frames of anonym

from the corpus

Chapter 4— Experiment

Tn this chapter, T would like to discuss the corpus T used to chister and some tools

applied in this thesis Besides, | pointed out and analyzed some errors for erroneous clusters in the word clustering process Finally, I expected to evaluate the influence of clusters when T applied hem in task of Vietnamese functional tabsling

Chapter 5—Conclusions and Future works

In the last chapter, I want to have a general conclusion about the advantages and restrictions of our works Besides, I propose some works which I will do in future to

improve our modal

Finally, references will show close researches which our system referred to.

Trang 12

2.1.1 The Brown algorithm

Word clustering is considered here as a method for estimating the probabilities of low frequency events One of the aims of word clustering is the problem of predicting a word

from previous words in a sample of text In this fask, authors used the bottom-up

agglomerative word clustcring algorithm (Brown ct al, 1992) 120] to derive a hicrarchical clustering of words ‘The input to the algorithm is a verpus of unlabeled data which consist

of a vocabulary of words to be clustered The output of the word cluster algorithm is a binary tree in Figure J, in which the leaves of the tree are the words in the vocabulary and cach internal node is interpreted as ä cluster containing words in that sub-tree Initially, each word in the corpus is considered to be in its own distinct cluster The algorithm then repeatedly merges the pair of chusters that maximizes the quality of the clustering result and each word belongs to exactly one chuster until the number of clusters is reduced to the

predclined number of clusters as follows:

Initial Mapping: Fut a singlc word in cach cluster

Compuce Lhe inital AMT of Lhe collection

repeat

for cach vaizs of clusters do

erga Lhe pair of clusters Lemporarily

Trang 13

Compute AMI of tae new collection

until reach the predefined number of clusters

stewardess conspirator womanizer

Tignre 1 An example of Brawn’s cluster algorithm

To identify the quality of a clustering, this algorithm considers a training toxt of T- words 17, a vocabulary of V-words, and a partition « of the vocabulary The maximum likelihood estimates of the parameters of a 1-gram class model generating the corpus

given by

4 Cbe) hôn 2= co and

Trang 14

2-gram class model the sequential maximum likelihood estimates of the order 2-

parameters maximize Pri; If.) are given by:

Prte, Le.) = So co P6) BC BC la)

x

= Price, y= COD yg EO)

whore, C(o,) and S7C(ee) are the number of words in the ¢” and t/* which the class is

the ¢, Let, 7.) =(7 -1) "log Pr(ef z,) then

r@=3, et ogra le,)Prow, le.)

= = LProplogrrew)+ SPH log Si 1 Prete.)

=i(e,e)— Họ»)

in which, H roprcsenis the entropy of the 1-gram word distribution, and Ƒ represent the

average mutual information of adjacent classes e and e„

2.L

Sticky Pairs and Sematic Classes

One of the aims of word clustering is group words together base on the statistical similarity of their surroundings In addition, the information context of words is also viewed as features to group words together For example consider two words w,, and w,

in the same contexts, the mutual information of the two words as adjacent is:

17

Trang 15

chosen from a window of 100,1 words centered on w, but excluding the words in a

window of 5 centered on w, is w, If Pr,.(w,4,) is much larger then Priw,)Priw,) then,

w, and w, are semantically sticky Using Pr,,,,(4,4,) tos algorithm shows some interesting classes such as:

we our us ourselves ours

question questions asking answer answers answering

verformance perZormed perform verforms performing

tic jacket suit write wriles wrilirg wriLien w^oLe moring noon evening night nights midnight bed

atterncy counsel trial court judge

follows:

A bottle of vedka is on the table

bveryene TÌk vodka

Vodka maxes you drunk

We make vedka out of corn

The similarity between two objecis is identified lo be the amount of information contained in the commonality between the objects divided by the amount information in

the descriptions of the objects (Lin, 1997) [11] Considering two words w, and w,, the similarity of two wards in the same context is:

log P(comanon(w,,W,))

log P(describe(w,,,w,))

18

Trang 16

To compute the similarity of two words in its context, Dekang Lin used a dependency

triple Gv, 7, w’) which contains two words and the grammatical relationship between them inthe input sentence In which, w is considered; r is the gramnatical relationship between

w and w’, w’ is the ward context of w and the notation /¥.7,¥'|| is the frequency counts of

the dependency triple (wv, 7, w’) in the parsed corpus When v, 1, or w’ is the wild card (*) then the frequency counts of all the dependency triples which matches the rest of the sample are summed up For example, |ludng, oly, “| (|drink, obj, *|) is the frequency counts of “udag-obj” (drink-obj) relationships in the parsed corpus

‘The description of a word w contains the frequency counts of all the dependency tnples that matches the pattem (wv, *, *) Let IGe, 7, w’) denote the amount information

contained in |¥.7.W'], this value is computed as follows

b) Object and Object-of (Obj and Obj-of): In this relationship, central word is verb, context word is noun (as object) The signs are: parent nods as VP, central node as

VY, context node as NP.

Trang 17

¢) Complement and Complement-of (Mod and Mod-of): In this relationship, central word is noun, context word is modifiers for nouns (modifiers mayhe N, A or V) The signs on the tree are: parent node as NP, central node as N, context node as N,

most similar to it as follows:

W( POS): W, 15) Way Sie Wra Sy,

In which, pos is a part of speech; w, is a word and 5; — sim(w,w,), ‘The top-10 words in the noun, verb and adjective entries with the word “brief” as follows:

brief (noun): affidavil 0.13, pelilion 0.05, memorandum 0.05, motion 0.05, lewsuiL 0.05, deposition 0.05, slight 0.05, prospectus 0.04, document 0.04 paper 0,04,

brief (verb): tell 0.09, urge 0.07, ask 0.07, meet 0.06, appoint 0.06, elect 0.05, name 0.05, empower 0.05, summon 0.05, overrule 0,04,

brief (adjective): Isngthy 0.13, short 0.12, recent 0.09, prolonged 0.09, long 0.03, extended 0.09, daylong 0.08, scheduled 0.08, stormy 0.67, plarmed 0.06,

2.3 Clustering By Committee

Clustering by Committes (CBC) focuses on addressing the general goal of clustering, that is to group data elements when the intra-group similarities are high and the intergroup similarities are low CBC is presented by two versions of the algorithm: a hard clustering

20

Trang 18

version, in which each element is assigned to only one cluster and a soft clustering version where cloments can be assigned to multiple clusters

2.3.1 Mativation

Clustering By Committee is constructed by a desire to automatically extract concepts

and word senses from large unlabeled collections of text In previous researches, word senses are usually defined by using a manually constructed lexicon such as WordNet, By this way, lexicons also have several problems such as manually created lexicon ofien conlain rare senses and they miss many domain specilic senses One way lo solve these problems is to use a chistering algorithm to automatically induce semantic classes (Lin and Pantel 2001) [9]

Many clustering algorithms represent a cluster by the centroid of all its members such

as K-means or by representative elements For example, when clustering words, this task can use contexts of the words as features and group together the words In CBC, the

centroid of a cluster is constructed by averaging the feature vectors of a subset of the cluster members and the subset is viewed as a committee that determines other elements

that belong to the cluster

which mis of cach cluster form 4 commitice; In the final phase of the algonthm, cach

element is assigned to its most similar clusters

2.3.2.1 Phase I: Find top-similar elements

Computing the top-similar elements of an element e, the algorithm first sort features according to their point-wise mutual information values and then only consider a subset

of the features with highest mutual information Finally, the phase I compute the pairwise

sninilanly between e¢ and clemonts that share a feature [rom this subset

2.3.2.2 Phase II: Find committee

Trang 19

In the second phase of the clustering algorithm recursively finds tight clusters

scattered in the similarity space In cach recursive step, the algorithm finds a set of tight

clusters which is called committees A committee covers an element if element’s

similarity to the centroid of the committee exceeds some high similarity threshold ‘'he phase IT is presented as follows

> Input: A list of elements E to be clustered, a similarity database S from phase I, thresholds @, and @,

Cluster the top-similar elements of e from § using average-link clustering

For each cluster discovered ¢ compute the following score: |c|xavgsim(c),

where |c| is the number of elements in c and avgsim(c) is the average pairwise

similarity between elements in c Store the highest-scoring cluster in a list L

¢ Step 2: Sort clusters in L in descending order of their scores

«+ Step 3: Let C be a list of committees, initially empty For cach cluster ¢ € L in sorted order, compute the centroid of ¢ by averaging the frequency vectors of ats elements and computing the mutual information vector of the centroid in the

same way If cs similarity ta the centroid of each committee previously added

to C is below a threshold 0,, add c to C

¢ Step 4 JEC is ompty, the algorithm is done and return C

below direshold @,, add ¢ to a list of residues R

union of C and the output of a recursive call to phrase II using the same input

excepl replacing E wilh R

* Output: 2 list of committees

2.3.2.3 Phase I: Assign clement fo clusters

Tn the phase TIT, it has two versions: the hard chustering version for the document clustering and the sofi clustering for discovering word senses Tn the first version, every

element is assigned to the cluster containing the committee to which it is most similar

‘his version resembles K-means in that every element is assigned to its closest centroid

2

Trang 20

Unlike K-means, the number of clusters is not fixed and the centroids do not change

The second version, each element ¢ is assigned to its most similar clusters in the

following, way:

lel Che a lish of eluscers initially emply

let 8 be the top-209 similar clusters to e

e Sis not empty | c€Sbe the most s

similerity(c, c< o

exit the loop

if cis not similar te any cluszer in C{

assion eto ¢ remove trom eits teatures that overlap with the

used the test set consisting of 13,403 words, $13403

Trang 21

Table 1 Resulls of CBC with discovering word senses

«The goai of document clustering is to discover documents with similar topics On

the 20-news data set, CBC implementatian of Chameleon was unable to complete

in reasonable tine For K-means, CBC used K—1000 over five iterations for 20-

news and A 30 over eight iterations for Reuters For Buckshot, CBC used K 80 for 20-news and K—50 for Reuters, icludes over eight iterations l’or the 20-news

corpus, CBC spends the vast majority of the time finding the top similar documents (38 minules) and compuling the similarity between documents and committee centroids (119 minutes) Lhe rest of the computation, which includes

clustering the top-20 similar documents for every of the 18828 documents and

sorting the clusters, took less than 5 minutes ‘I'able 2 shows the results of various

clustering algorithms on document clustering

Table 2 Results af CRC with document clustering

Algorithms Reuters | 20-News

Trang 22

Chapter TIT

Our approach

In this chapter, 1 will describe our approach, including word clustering for

‘Vietnamese, evaluating influence of word clustering in Vietnamese and further analyzing the classes of antonym relationship within clusters, how to use word clustered as features

im Vielrwmese funelional labeling Lask

3.1 Word clustering in Vietnamese

To cluster words in Vietnamese, I apply two methods of word clustering on the same corpus as mentioned in Chapter 1: word clustering by Brown’s algorithm and word similarity by Dekang Lin

3.1.1 Brown’s algorithm

In order to extract word cluster in Vietnamese, I used Brown’s algorithm to word cluster for Vietnamese corpus which is one of the most well-imown and effective clustering algorithms in Language Models The Brown algorithm uses mutual information between cluster pairs in a bottom-up approach to maximize Average Mutual Information belween adjacent clusters The outpul’s algorithm is hicrarchical clustering of words in which words will be hierarchically clustered by bits string as Migure 2

A word cluster contains a main word and subordinate words, each subordinale word leas the same bit string and corresponding frequeney The manber of subordiuale words is different in each cluster To solve the problem, | randomly take away some subordinate words in clusters containing more than 10 subordinate words Because I realized that the subordinate words I dismiss often has low frequency and they often have no semantic

telalions Lo the main word.

Tiêu đề	Vietnamese Word Clustering and Antonym Identification Phân Tích Cụm Từ Tiếng Việt Và Nhận Diện Từ Trái Nghĩa Luận Văn ThS Công Nghệ Thông Tin
Tác giả	Nguyen Kim An, Nguyen Kim Anh
Người hướng dẫn	PhD. Nguyen Phuong Thai
Trường học	Vietnam National University, Hanoi University of Engineering and Technology
Chuyên ngành	Information Technology
Thể loại	Thesis
Năm xuất bản	2013
Thành phố	Hanoi

Định dạng
Số trang	44
Dung lượng	565,4 KB