Vietnamese word clustering and Antonym identification = Phân tích cụm từ tiếng Việt và nhận diện từ trái nghĩa

In this thesis, I focus on some word clustering algorithms with the unlabeled data, in which I mainly apply two methods: word clustering by Brown’s algorithm [22] and word similarity by

Trang 1

VIETNAM NATIONAL UNIVERSITY, HANOI

UNIVERSITY OF ENGINEERING AND TECHNOLOGY

NGUYEN KIM ANH

VIETNAMESE WORD CLUSTERING AND ANTONYM IDENTIFICATION

MASTER THESIS OF INFORMATION TECHNOLOGY

Hanoi - 2013

Trang 2

VIETNAM NATIONAL UNIVERSITY, HANOI

UNIVERSITY OF ENGINEERING AND TECHNOLOGY

NGUYEN KIM ANH

VIETNAMESE WORD CLUSTERING AND ANTONYM IDENTIFICATION

Major: Computer science

Code: 60 48 01

MASTER THESIS OF INFORMATION TECHNOLOGY

SUPERVISOR: PhD Nguyen Phuong Thai

Hanoi - 2013

Trang 3

Table of Contents

Acknowledgements 4

Abstract 5

Chapter I - Introduction 10

1.1 Word Similarity 11

1.2 Hierarchical Clustering of Word 11

1.3 Function tags 12

1.4 Objectives of the Thesis 13

1.5 Our Contributions 13

1.6 Thesis structure 14

Chapter II - Related Works 15

2.1 Word Clustering 15

2.1.1 The Brown algorithm 15

2.1.2 Sticky Pairs and Sematic Classes 17

2.2 Word Similarity 18

2.2.1 Approach 18

2.2.2 Grammar Relationships 19

2.2.3 Results 20

2.3 Clustering By Committee 20

2.3.1 Motivation 21

2.3.2 Algorithm 21

2.3.3 Results 23

Chapter III - Our approach 25

3.1 Word clustering in Vietnamese 25

3.1.1 Brown's algorithm 25

3.1.2 Word similarity 26

3.2 Evaluating Methodology 28

3.3 Antonym classes 31

3.3.1 Ancillary antonym 31

3.3.2 Coordinated antonym 32

3.3.3 Minor classes 33

3.4 Vietnamese functional labeling 34

Trang 4

Chapter IV - Experiment 37

4.1 Results and Comparison 37

4.2 Antonym frames 40

4.3 Effectiveness of Word Cluster feature in Vietnamese Functional labeling 42

4.4 Error analyses 43

4.5 Summarization 44

Chapter V - Conclusion and Future works 45

5.1 Conclusion 45

5.2 Future works 45

Bibliography 46

Trang 5

List of Figures

Figure 1 An example of Brown's cluster algorithm 16

Figure 2 An example of Vietnamese word cluster 26

Figure 3 The syntax tree of a sentence 26

Figure 4 An example about Vietnamese word similarity 28

Figure 5 Select word clusters by dictionary 30

Figure 6 An example about sentences parses 35

Figure 7 The true of k-clusters 38

Trang 6

List of Tables

Table 1 Results of CBC with discovering word senses 24

Table 2 Results of CBC with document clustering 24

Table 3 Ancillary antonym frames 32

Table 4 Coordinated antonym frames 33

Table 5 Transitional antonym frames 34

Table 6 An unlabeled corpus in Vietnamese 37

Table 7 The result of five initial clusters 39

Table 8 The comparison between Word clustering and Word similarity 40

Table 9 The result of antonym frames 41

Table 10 The relation of w1 and w2 pairs 42

Table 11 The effectiveness of word cluster feature 43

Trang 7

Chapter I

Introduction

In recent years, statistical learning methods have been vastly successful for using natural language processing tasks Most of machine learning algorithms which are used in natural language processing tasks are supervised and they require labeled data These labeled data are often made by hand or in some other ways, which could be time consuming and expensive However, while the labeled data is difficult to be created by hand, the unlabeled data is basically free on the Internet which is considered as raw text This raw text can be easily preprocessed to be made suitable for using in an unsupervised

or semi-supervised learning algorithm Previous works have shown that using the unlabeled data to replace a traditional labeled data can improve performance (Miller et al., 2004; Abney, 2004; Collins and Singer, 1999) [19][2][7]

In this thesis, I focus on some word clustering algorithms with the unlabeled data, in which I mainly apply two methods: word clustering by Brown’s algorithm [22] and word similarity by Dekang Lin [10] Those two methods are used in clustering words in the corpus While Brown’s method cluster words basing on the relationships between words standing before and after the clustered word, Dekang Lin’s method uses the relationships among those three words To compare the advantages and disadvantages of these two methods, I experimented them on the same corpus, using the same evaluation method and the same main word in clusters The result of word clustering contained different clusters and each cluster included words in the same contexts This result was used as features for the application: Vietnamese functional labeling I also evaluated influence of word clustering when using them as features in this application For example, word clusters were used to solve the data sparseness problem of the head word feature Besides, I use the statistics method to extract 20 frames of antonym which can use to identify antonym classes in clusters

In this chapter, I describe word similarity, hierarchical clustering of word and their applications which are used in natural language processing Besides, I would like to

Trang 8

introduce function tags, word segmentation tasks, objective of thesis and our contribution Finally, I will describe the structure of the thesis

1.1 Word Similarity

The semantic of an unknown word can be inferred from its context Consider the following examples:

A bottle of Beer is on the table

Everyone likes Beer

Beer makes you drunk

Contexts of the word Beer in which it is used suggest that Beer could be a kind of

alcoholic beverage This means that other alcoholic beverage may occur in the same

contexts as contexts of Beer and they may be related Consequently, two words are

similar if they appear in similar contexts or they can be exchangeable to some extent For

example, “Tổng_thống” (President) and “Chủ_tịch” (Chairman) are similar according to this definition In contrast, two words “Kéo” (Scissors) and “Cắt” (Knife) are not similar

under this definition, while semantically related Intuitively, if I can generate a good clustering, the words in this cluster should be similar

1.2 Hierarchical Clustering of Word

In recent years, some algorithms have been proposed to automatically clustering words based on a large unlabeled corpus, such as (Brown et al 1992, Lin et al, 1998) [22][10] I consider a corpus of T words, a vocabulary of V words, and a partition π of the vocabulary The likelihood L(π) of a bigram class model generating the corpus is given by:

L(π) = I – H Here, H is the entropy of the 1-gram word distribution, and I is the average mutual information of adjacent classes in the corpus:

Trang 9

Where, Pr(c c1 2)is the probability of a word in c1 is followed by a word in c2 So H and

π are independent, the partition also maximizes the likelihood L(π) of the corpus, because

it maximizes the average mutual information Thus, I can use the average mutual information to construct the clusters of word by repeating the merging step until the number of clusters is reduced to the predefined number C

1.3 Function tags

Functional tags labeling is an important processing step for many natural language processing applications such as question answering, information extraction, and summarization Thus, there were some researches focusing in function tagging problem to cover additional semantic information which are more useful than syntactic labels

There are two kinds of tags in linguistics: syntactic tags and functional tags For syntactic tags there are many theories and projects in English, Spanish, Chinese and etc Main tasks of these researches are common finding the part-of-speech and tag them for their constituents Functional tags are understood as abstract labels because they are not similar syntactic labels If a syntactic label has one notation for batch of words for all paragraphs, functional tags are representative of the relationship between a phrase and its utterance in each difference context So for each phrase, functional tags might be transforming It depends on the context of its neighbors For example, when I consider a

phrase: “Baseball bat” the syntax of this phrase is “noun phrase” (in almost researches they are annotated as NP) But its functional tag might be a subject in this sentence:

This baseball bat is very expensive

In another case, its functional tag might be a direct object:

I bought this baseball bat last month

Or instrument agent in a passive voice:

That men was attacked by this baseball bat

Functional tags are directly mentioned by Blaheta (2003) [13] Since there are a lot of researches focusing on how to tag functional tags for a sentence This kind of research problem is called functional tags labeling problem, a class of problems aiming at finding semantic information of phrase To sum up, functional tag labeling is defined as a problem of how to find the semantic information of bag of words, and then tag them with

a given annotation in its context

Trang 10

1.4 Objectives of the Thesis

Most of these successful machine learning algorithms are supervised algorithm, and they usually use labeled data These labeled data are often created by hand, which is time consuming and expensive While unlabeled data is free, they can be obtained from the newspapers, website… and they exit as raw texts on the Internet

In this thesis, I expected to investigate some methods of clustering words for unlabeled data, which was easily extracted from online sources Among automatic clustering methods, I focused on two: hierarchical word clustering by Brown’s algorithm and word similarity by Dekang Lin Besides, I also suggested a common evaluating tool for both methods when they are applied in the same Vietnamese corpus

The output of the word clustering was used as features in natural language processing tasks such as Vietnamese functional labeling I also evaluated the influences of word clusters when they were used as features in this task

1.5 Our Contributions

As I discussed above, the main aim of this thesis is to cluster unlabeled Vietnamese words Thus, the contribution of this paper is as follows

• Firstly, I managed to do automatic word clustering for unlabeled Vietnamese data

with the corpus of about 700.000 sentences

• Secondly, I suggested a qualified evaluating method for clusters after I clustered

words using thesaurus dictionary with 5 criteria

• Thirdly, I compared two clustering methods for Vietnamese, which are word clustering by Brown and word similarity by Dekang Lin I used the results of clusters as features in Vietnamese functional labeling task to increase the task’s efficiency Besides, I use the statistics method to extract 20 frames of antonym which can use to identify

antonym classes in the clusters

In conclusion, our contribution is that I have implemented word clustering about 700,000 sentences in Vietnamese by hierarchical word clustering algorithm, using Vietnamese thesaurus dictionary and five criteria to evaluate the true of clusters and using

Trang 11

clusters as features of NLP tasks such as Vietnamese functional labeling Finally, extract

20 frames of antonym to identify antonym pairs in antonym classes

1.6 Thesis structure

In this section, I would like to introduce brief outline of thesis Thus, you can have

overviews on next sections where I want to discuss

Chapter 2 – Related works

In this chapter I would like to introduce some recent researches about word clustering,

functional labeling and word segmentation

Chapter 3 – Our approach

This chapter suggests the method I applied to cluster Vietnamese, how I evaluate the

qualities of the clusters after the word clustering process, how to use those clusters as

features in Vietnamese functional labeling task And how to extract frames of antonym

from the corpus

Chapter 4 – Experiment

In this chapter, I would like to discuss the corpus I used to cluster and some tools

applied in this thesis Besides, I pointed out and analyzed some errors for erroneous

clusters in the word clustering process Finally, I expected to evaluate the influence of

clusters when I applied them in task of Vietnamese functional labeling

Chapter 5 –Conclusions and Future works

In the last chapter, I want to have a general conclusion about the advantages and

restrictions of our works Besides, I propose some works which I will do in future to

improve our model

Finally, references will show close researches which our system referred to

Trang 12

Chapter II

Related Works

In this chapter, I would like to introduce some researches in recent years in word clustering and word similarity tasks such as: class-based n-grams models by Brown’s algorithm [22], word similarity [10] and clustering by committee [23]

2.1 Word Clustering

2.1.1 The Brown algorithm

Word clustering is considered here as a method for estimating the probabilities of low frequency events One of the aims of word clustering is the problem of predicting a word from previous words in a sample of text In this task, authors used the bottom-up agglomerative word clustering algorithm (Brown et al, 1992) [20] to derive a hierarchical clustering of words The input to the algorithm is a corpus of unlabeled data which consist

of a vocabulary of words to be clustered The output of the word cluster algorithm is a binary tree in Figure 1, in which the leaves of the tree are the words in the vocabulary and each internal node is interpreted as a cluster containing words in that sub-tree Initially, each word in the corpus is considered to be in its own distinct cluster The algorithm then repeatedly merges the pair of clusters that maximizes the quality of the clustering result and each word belongs to exactly one cluster until the number of clusters is reduced to the predefined number of clusters as follows:

Initial Mapping: Put a single word in each cluster

Compute the initial AMI of the collection

repeat

for each pairs of clusters do

Merge the pair of clusters temporarily

Compute the AMI of the collection

end for

Select a pair of clusters with min decrement of AMI

Trang 13

Compute AMI of the new collection

until reach the predefined number of clusters

repeat

Move each term to the cluster for which the resulting partition has the greatest AMI

until no more increment in AMI

Figure 1 An example of Brown's cluster algorithm

To identify the quality of a clustering, this algorithm considers a training text of

likelihood estimates of the parameters of a 1-gram class model generating the corpus given by:

Trang 14

gram class model, the sequential maximum likelihood estimates of the order

2-parameters maximize Pr(t2T | t1) are given by:

and t1T−1 which the class is the c1 Let, L(π)= (T −1)−1

in which, H represents the entropy of the 1-gram word distribution, and I represent the

2.1.2 Sticky Pairs and Sematic Classes

One of the aims of word clustering is group words together base on the statistical similarity of their surroundings In addition, the information context of words is also

in the same contexts, the mutual information of the two words as adjacent is:

Trang 15

chosen from a window of 100,1 words centered on w abut excluding the words in a window of 5 centered on w a is w b If Prnear (w a w b) is much larger then Pr(w a )Pr(w b) then

interesting classes such as:

we our us ourselves ours question questions asking answer answers answering performance performed perform performs performing

tie jacket suit write writes writing written wrote pen moring noon evening night nights midnight bed

attorney counsel trial court judge problems problem solution solve analyzed sloved solving

A bottle of vodka is on the table

Everyone likes vodka

Vodka makes you drunk

We make vodka out of corn

The similarity between two objects is identified to be the amount of information contained in the commonality between the objects divided by the amount information in the descriptions of the objects (Lin, 1997) [11] Considering two words w a and w b, the similarity of two words in the same context is:

sim(w a ,w b)= log P(common(w a ,w b))

log P(describe(w a ,w b))

Trang 16

To compute the similarity of two words in its context, Dekang Lin used a dependency

triple (w, r, w’) which contains two words and the grammatical relationship between them

in the input sentence In which, w is considered; r is the grammatical relationship between

w and w’; w’ is the word context of w and the notation w,r,w' is the frequency counts of

the dependency triple (w, r, w’) in the parsed corpus When w, r, or w’ is the wild card

(*) then the frequency counts of all the dependency triples which matches the rest of the

sample are summed up For example, ||uống, obj, *|| (||drink, obj, *||) is the frequency counts of “uống-obj” (drink-obj) relationships in the parsed corpus

The description of a word w contains the frequency counts of all the dependency triples that matches the pattern (w, *, *) Let I(w, r, w’) denote the amount information contained in w,r,w' , this value is computed as follows:

I(w,r,w') = log w,r,w' × ∗,r,∗

Let T (w) is the set of pairs (r, w’), the similarity of two words w a and w b is

sim(w a ,w b) that computed as follows:

b) Object and Object-of (Obj and Obj-of): In this relationship, central word is verb, context word is noun (as object) The signs are: parent node as VP, central node as

V, context node as NP

Trang 17

c) Complement and Complement-of (Mod and Mod-of): In this relationship, central word is noun, context word is modifiers for nouns (modifiers maybe N, A or V) The signs on the tree are: parent node as NP, central node as N, context node as N,

A or V

d) Prepositional object (Proj and Proj-of): In this relationship, central word is preposition (E), context word is noun The signs are: parent node as PP, central node as E, context node as NP

2.2.3 Results

Dekang Lin extracted about 56.5 million dependency triples from a 64-million-word corpus parsed that contains the Wall Street Journal (24 million words), San Joe Mercury (21 million words) and AP Newswire (19 million words) Their experiment computed the pairwise similarity between all the nouns, all the verbs and all the adjectives/adverbs For each word, Dekang Lin creates a thesaurus entry which contains the top-N words that are most similar to it as follows:

w( pos) : w1, s1,w2, s2, ,wN, sN

the noun, verb and adjective entries with the word “brief” as follows:

brief (noun): affidavit 0.13, petition 0.05, memorandum 0.05, motion 0.05, lawsuit

0.05, deposition 0.05, slight 0.05, prospectus 0.04, document 0.04 paper 0.04,

brief (verb): tell 0.09, urge 0.07, ask 0.07, meet 0.06, appoint 0.06, elect 0.05, name

0.05, empower 0.05, summon 0.05, overrule 0.04,

brief (adjective): lengthy 0.13, short 0.12, recent 0.09, prolonged 0.09, long 0.09,

extended 0.09, daylong 0.08, scheduled 0.08, stormy 0.07, planned 0.06,

2.3 Clustering By Committee

Clustering by Committee (CBC) focuses on addressing the general goal of clustering, that is to group data elements when the intra-group similarities are high and the intergroup similarities are low CBC is presented by two versions of the algorithm: a hard clustering

Trang 18

version, in which each element is assigned to only one cluster and a soft clustering version where elements can be assigned to multiple clusters

2.3.1 Motivation

Clustering By Committee is constructed by a desire to automatically extract concepts and word senses from large unlabeled collections of text In previous researches, word senses are usually defined by using a manually constructed lexicon such as WordNet By this way, lexicons also have several problems such as manually created lexicon often contain rare senses and they miss many domain specific senses One way to solve these

and Pantel 2001) [9]

Many clustering algorithms represent a cluster by the centroid of all its members such

as K-means or by representative elements For example, when clustering words, this task can use contexts of the words as features and group together the words In CBC, the centroid of a cluster is constructed by averaging the feature vectors of a subset of the cluster members and the subset is viewed as a committee that determines other elements that belong to the cluster

2.3.2 Algorithm

CBC algorithm consists of three phases: in phase I, the algorithm computes each

element’s top-k similar elements, with some small value of k; In phase II: the algorithm constructs a collection of tight clusters using the top-k similar elements from phase I, in

which elements of each cluster form a committee; In the final phase of the algorithm, each element is assigned to its most similar clusters

2.3.2.1 Phase I: Find top-similar elements

Computing the top-similar elements of an element e, the algorithm first sort features

according to their point-wise mutual information values and then only consider a subset

of the features with highest mutual information Finally, the phase I compute the pairwise

similarity between e and elements that share a feature from this subset

2.3.2.2 Phase II: Find committee

Trang 19

In the second phase of the clustering algorithm recursively finds tight clusters scattered in the similarity space In each recursive step, the algorithm finds a set of tight clusters which is called committees A committee covers an element if element’s similarity to the centroid of the committee exceeds some high similarity threshold The phase II is presented as follows:

• Input: A list of elements E to be clustered, a similarity database S from phase I,

thresholds 𝜃! and 𝜃!

• Step 1: For each element 𝑒 𝜖 𝐸

Cluster the top-similar elements of e from S using average-link clustering

similarity between elements in c Store the highest-scoring cluster in a list L

• Step 2: Sort clusters in L in descending order of their scores

• Step 3: Let C be a list of committees, initially empty For each cluster 𝑐 𝜖 𝐿 in

sorted order, compute the centroid of c by averaging the frequency vectors of

its elements and computing the mutual information vector of the centroid in the

same way If c’s similarity to the centroid of each committee previously added

• Step 4: If C is empty, the algorithm is done and return C

• Step 5: For each element 𝑒 𝜖 𝐸 If e’s similarity to every committee in C is

• Step 6: If R is empty, the algorithm is done and return C Otherwise, return the

union of C and the output of a recursive call to phrase II using the same input except replacing E with R

• Output: a list of committees

2.3.2.3 Phase III: Assign element to clusters

In the phase III, it has two versions: the hard clustering version for the document clustering and the soft clustering for discovering word senses In the first version, every element is assigned to the cluster containing the committee to which it is most similar

This version resembles K-means in that every element is assigned to its closest centroid

Trang 20

Unlike K-means, the number of clusters is not fixed and the centroids do not change The second version, each element e is assigned to its most similar clusters in the

following way:

let C be a list of clusters initially empty

let S be the top-200 similar clusters to e

while S is not empty {

let c ∈ S be the most similar cluster to e

if similarity(e, c) < σ

exit the loop

if c is not similar to any cluster in C {

assign e to c remove from e its features that overlap with the features of c;

an element e is assigned to a cluster c, the intersecting features between e and c are removed from e This allows CBC to discover the less frequent senses of a word and to

avoid discovering duplicate senses

2.3.3 Results

• CBC discovers word senses by assigning words to more than one cluster Each

cluster to which a word is assigned represents a sense of that word Using the

Soft-clustering version of Phase III of CBC allows CBC to assign words to multiple

clusters, to discover the less frequent senses of a word and to avoid discovering duplicate senses CBC applied the precision/recall evaluation methodology and used the test set consisting of 13,403 words, S13403

Trang 21

Table 1 Results of CBC with discovering word senses

Algorithm Precision (%) Recall (%) F-measure (%)

20-corpus, CBC spends the vast majority of the time finding the top similar documents (38 minutes) and computing the similarity between documents and committee centroids (119 minutes) The rest of the computation, which includes clustering the top-20 similar documents for every of the 18828 documents and sorting the clusters, took less than 5 minutes Table 2 shows the results of various clustering algorithms on document clustering

Table 2 Results of CBC with document clustering

Algorithms Reuters 20-News

Trang 22

Chapter III

Our approach

In this chapter, I will describe our approach, including word clustering for Vietnamese, evaluating influence of word clustering in Vietnamese and further analyzing the classes of antonym relationship within clusters, how to use word clustered as features

in Vietnamese functional labeling task

3.1 Word clustering in Vietnamese

To cluster words in Vietnamese, I apply two methods of word clustering on the same

corpus as mentioned in Chapter 1: word clustering by Brown’s algorithm and word

similarity by Dekang Lin

3.1.1 Brown’s algorithm

In order to extract word cluster in Vietnamese, I used Brown’s algorithm to word cluster for Vietnamese corpus which is one of the most well-known and effective clustering algorithms in Language Models The Brown algorithm uses mutual information between cluster pairs in a bottom-up approach to maximize Average Mutual Information between adjacent clusters The output’s algorithm is hierarchical clustering of words in which words will be hierarchically clustered by bits string as Figure 2

A word cluster contains a main word and subordinate words, each subordinate word has the same bit string and corresponding frequency The number of subordinate words is different in each cluster To solve the problem, I randomly take away some subordinate

subordinate words I dismiss often has low frequency and they often have no semantic relations to the main word

Định dạng
Số trang	44
Dung lượng	1,33 MB