In this thesis, I focus on some word clustering algorithms with the unlabeled data, in which I mainly apply two methods: word clustering by Brown’s algorithm [22] and word similarity by
Trang 1VIETNAM NATIONAL UNIVERSITY, HANOI
UNIVERSITY OF ENGINEERING AND TECHNOLOGY
NGUYEN KIM ANH
VIETNAMESE WORD CLUSTERING AND ANTONYM IDENTIFICATION
MASTER THESIS OF INFORMATION TECHNOLOGY
Hanoi - 2013
Trang 2VIETNAM NATIONAL UNIVERSITY, HANOI
UNIVERSITY OF ENGINEERING AND TECHNOLOGY
NGUYEN KIM ANH
VIETNAMESE WORD CLUSTERING AND ANTONYM IDENTIFICATION
Major: Computer science
Code: 60 48 01
MASTER THESIS OF INFORMATION TECHNOLOGY
SUPERVISOR: PhD Nguyen Phuong Thai
Hanoi - 2013
Trang 3Table of Contents
Acknowledgements 4
Abstract 5
Chapter I - Introduction 10
1.1 Word Similarity 11
1.2 Hierarchical Clustering of Word 11
1.3 Function tags 12
1.4 Objectives of the Thesis 13
1.5 Our Contributions 13
1.6 Thesis structure 14
Chapter II - Related Works 15
2.1 Word Clustering 15
2.1.1 The Brown algorithm 15
2.1.2 Sticky Pairs and Sematic Classes 17
2.2 Word Similarity 18
2.2.1 Approach 18
2.2.2 Grammar Relationships 19
2.2.3 Results 20
2.3 Clustering By Committee 20
2.3.1 Motivation 21
2.3.2 Algorithm 21
2.3.3 Results 23
Chapter III - Our approach 25
3.1 Word clustering in Vietnamese 25
3.1.1 Brown's algorithm 25
3.1.2 Word similarity 26
3.2 Evaluating Methodology 28
3.3 Antonym classes 31
3.3.1 Ancillary antonym 31
3.3.2 Coordinated antonym 32
3.3.3 Minor classes 33
3.4 Vietnamese functional labeling 34
Trang 4Chapter IV - Experiment 37
4.1 Results and Comparison 37
4.2 Antonym frames 40
4.3 Effectiveness of Word Cluster feature in Vietnamese Functional labeling 42
4.4 Error analyses 43
4.5 Summarization 44
Chapter V - Conclusion and Future works 45
5.1 Conclusion 45
5.2 Future works 45
Bibliography 46
Trang 5List of Figures
Figure 1 An example of Brown's cluster algorithm 16
Figure 2 An example of Vietnamese word cluster 26
Figure 3 The syntax tree of a sentence 26
Figure 4 An example about Vietnamese word similarity 28
Figure 5 Select word clusters by dictionary 30
Figure 6 An example about sentences parses 35
Figure 7 The true of k-clusters 38
Trang 6List of Tables
Table 1 Results of CBC with discovering word senses 24
Table 2 Results of CBC with document clustering 24
Table 3 Ancillary antonym frames 32
Table 4 Coordinated antonym frames 33
Table 5 Transitional antonym frames 34
Table 6 An unlabeled corpus in Vietnamese 37
Table 7 The result of five initial clusters 39
Table 8 The comparison between Word clustering and Word similarity 40
Table 9 The result of antonym frames 41
Table 10 The relation of w1 and w2 pairs 42
Table 11 The effectiveness of word cluster feature 43
Trang 7Chapter I
Introduction
In recent years, statistical learning methods have been vastly successful for using natural language processing tasks Most of machine learning algorithms which are used in natural language processing tasks are supervised and they require labeled data These labeled data are often made by hand or in some other ways, which could be time consuming and expensive However, while the labeled data is difficult to be created by hand, the unlabeled data is basically free on the Internet which is considered as raw text This raw text can be easily preprocessed to be made suitable for using in an unsupervised
or semi-supervised learning algorithm Previous works have shown that using the unlabeled data to replace a traditional labeled data can improve performance (Miller et al., 2004; Abney, 2004; Collins and Singer, 1999) [19][2][7]
In this thesis, I focus on some word clustering algorithms with the unlabeled data, in which I mainly apply two methods: word clustering by Brown’s algorithm [22] and word similarity by Dekang Lin [10] Those two methods are used in clustering words in the corpus While Brown’s method cluster words basing on the relationships between words standing before and after the clustered word, Dekang Lin’s method uses the relationships among those three words To compare the advantages and disadvantages of these two methods, I experimented them on the same corpus, using the same evaluation method and the same main word in clusters The result of word clustering contained different clusters and each cluster included words in the same contexts This result was used as features for the application: Vietnamese functional labeling I also evaluated influence of word clustering when using them as features in this application For example, word clusters were used to solve the data sparseness problem of the head word feature Besides, I use the statistics method to extract 20 frames of antonym which can use to identify antonym classes in clusters
In this chapter, I describe word similarity, hierarchical clustering of word and their applications which are used in natural language processing Besides, I would like to
Trang 8introduce function tags, word segmentation tasks, objective of thesis and our contribution Finally, I will describe the structure of the thesis
1.1 Word Similarity
The semantic of an unknown word can be inferred from its context Consider the following examples:
A bottle of Beer is on the table
Everyone likes Beer
Beer makes you drunk
Contexts of the word Beer in which it is used suggest that Beer could be a kind of
alcoholic beverage This means that other alcoholic beverage may occur in the same
contexts as contexts of Beer and they may be related Consequently, two words are
similar if they appear in similar contexts or they can be exchangeable to some extent For
example, “Tổng_thống” (President) and “Chủ_tịch” (Chairman) are similar according to this definition In contrast, two words “Kéo” (Scissors) and “Cắt” (Knife) are not similar
under this definition, while semantically related Intuitively, if I can generate a good clustering, the words in this cluster should be similar
1.2 Hierarchical Clustering of Word
In recent years, some algorithms have been proposed to automatically clustering words based on a large unlabeled corpus, such as (Brown et al 1992, Lin et al, 1998) [22][10] I consider a corpus of T words, a vocabulary of V words, and a partition π of the vocabulary The likelihood L(π) of a bigram class model generating the corpus is given by:
L(π) = I – H Here, H is the entropy of the 1-gram word distribution, and I is the average mutual information of adjacent classes in the corpus:
Trang 9Where, Pr(c c1 2)is the probability of a word in c1 is followed by a word in c2 So H and
π are independent, the partition also maximizes the likelihood L(π) of the corpus, because
it maximizes the average mutual information Thus, I can use the average mutual information to construct the clusters of word by repeating the merging step until the number of clusters is reduced to the predefined number C
1.3 Function tags
Functional tags labeling is an important processing step for many natural language processing applications such as question answering, information extraction, and summarization Thus, there were some researches focusing in function tagging problem to cover additional semantic information which are more useful than syntactic labels
There are two kinds of tags in linguistics: syntactic tags and functional tags For syntactic tags there are many theories and projects in English, Spanish, Chinese and etc Main tasks of these researches are common finding the part-of-speech and tag them for their constituents Functional tags are understood as abstract labels because they are not similar syntactic labels If a syntactic label has one notation for batch of words for all paragraphs, functional tags are representative of the relationship between a phrase and its utterance in each difference context So for each phrase, functional tags might be transforming It depends on the context of its neighbors For example, when I consider a
phrase: “Baseball bat” the syntax of this phrase is “noun phrase” (in almost researches they are annotated as NP) But its functional tag might be a subject in this sentence:
This baseball bat is very expensive
In another case, its functional tag might be a direct object:
I bought this baseball bat last month
Or instrument agent in a passive voice:
That men was attacked by this baseball bat
Functional tags are directly mentioned by Blaheta (2003) [13] Since there are a lot of researches focusing on how to tag functional tags for a sentence This kind of research problem is called functional tags labeling problem, a class of problems aiming at finding semantic information of phrase To sum up, functional tag labeling is defined as a problem of how to find the semantic information of bag of words, and then tag them with
a given annotation in its context
Trang 101.4 Objectives of the Thesis
Most of these successful machine learning algorithms are supervised algorithm, and they usually use labeled data These labeled data are often created by hand, which is time consuming and expensive While unlabeled data is free, they can be obtained from the newspapers, website… and they exit as raw texts on the Internet
In this thesis, I expected to investigate some methods of clustering words for unlabeled data, which was easily extracted from online sources Among automatic clustering methods, I focused on two: hierarchical word clustering by Brown’s algorithm and word similarity by Dekang Lin Besides, I also suggested a common evaluating tool for both methods when they are applied in the same Vietnamese corpus
The output of the word clustering was used as features in natural language processing tasks such as Vietnamese functional labeling I also evaluated the influences of word clusters when they were used as features in this task
1.5 Our Contributions
As I discussed above, the main aim of this thesis is to cluster unlabeled Vietnamese words Thus, the contribution of this paper is as follows
• Firstly, I managed to do automatic word clustering for unlabeled Vietnamese data
with the corpus of about 700.000 sentences
• Secondly, I suggested a qualified evaluating method for clusters after I clustered
words using thesaurus dictionary with 5 criteria
• Thirdly, I compared two clustering methods for Vietnamese, which are word clustering by Brown and word similarity by Dekang Lin I used the results of clusters as features in Vietnamese functional labeling task to increase the task’s efficiency Besides, I use the statistics method to extract 20 frames of antonym which can use to identify
antonym classes in the clusters
In conclusion, our contribution is that I have implemented word clustering about 700,000 sentences in Vietnamese by hierarchical word clustering algorithm, using Vietnamese thesaurus dictionary and five criteria to evaluate the true of clusters and using
Trang 11clusters as features of NLP tasks such as Vietnamese functional labeling Finally, extract
20 frames of antonym to identify antonym pairs in antonym classes
1.6 Thesis structure
In this section, I would like to introduce brief outline of thesis Thus, you can have
overviews on next sections where I want to discuss
Chapter 2 – Related works
In this chapter I would like to introduce some recent researches about word clustering,
functional labeling and word segmentation
Chapter 3 – Our approach
This chapter suggests the method I applied to cluster Vietnamese, how I evaluate the
qualities of the clusters after the word clustering process, how to use those clusters as
features in Vietnamese functional labeling task And how to extract frames of antonym
from the corpus
Chapter 4 – Experiment
In this chapter, I would like to discuss the corpus I used to cluster and some tools
applied in this thesis Besides, I pointed out and analyzed some errors for erroneous
clusters in the word clustering process Finally, I expected to evaluate the influence of
clusters when I applied them in task of Vietnamese functional labeling
Chapter 5 –Conclusions and Future works
In the last chapter, I want to have a general conclusion about the advantages and
restrictions of our works Besides, I propose some works which I will do in future to
improve our model
Finally, references will show close researches which our system referred to
Trang 12Chapter II
Related Works
In this chapter, I would like to introduce some researches in recent years in word clustering and word similarity tasks such as: class-based n-grams models by Brown’s algorithm [22], word similarity [10] and clustering by committee [23]
2.1 Word Clustering
2.1.1 The Brown algorithm
Word clustering is considered here as a method for estimating the probabilities of low frequency events One of the aims of word clustering is the problem of predicting a word from previous words in a sample of text In this task, authors used the bottom-up agglomerative word clustering algorithm (Brown et al, 1992) [20] to derive a hierarchical clustering of words The input to the algorithm is a corpus of unlabeled data which consist
of a vocabulary of words to be clustered The output of the word cluster algorithm is a binary tree in Figure 1, in which the leaves of the tree are the words in the vocabulary and each internal node is interpreted as a cluster containing words in that sub-tree Initially, each word in the corpus is considered to be in its own distinct cluster The algorithm then repeatedly merges the pair of clusters that maximizes the quality of the clustering result and each word belongs to exactly one cluster until the number of clusters is reduced to the predefined number of clusters as follows:
Initial Mapping: Put a single word in each cluster
Compute the initial AMI of the collection
repeat
for each pairs of clusters do
Merge the pair of clusters temporarily
Compute the AMI of the collection
end for
Select a pair of clusters with min decrement of AMI
Trang 13Compute AMI of the new collection
until reach the predefined number of clusters
repeat
Move each term to the cluster for which the resulting partition has the greatest AMI
until no more increment in AMI
Figure 1 An example of Brown's cluster algorithm
To identify the quality of a clustering, this algorithm considers a training text of
likelihood estimates of the parameters of a 1-gram class model generating the corpus given by:
Trang 14gram class model, the sequential maximum likelihood estimates of the order
2-parameters maximize Pr(t2T | t1) are given by:
and t1T−1 which the class is the c1 Let, L(π)= (T −1)−1
in which, H represents the entropy of the 1-gram word distribution, and I represent the
2.1.2 Sticky Pairs and Sematic Classes
One of the aims of word clustering is group words together base on the statistical similarity of their surroundings In addition, the information context of words is also
in the same contexts, the mutual information of the two words as adjacent is:
Trang 15chosen from a window of 100,1 words centered on w abut excluding the words in a window of 5 centered on w a is w b If Prnear (w a w b) is much larger then Pr(w a )Pr(w b) then
interesting classes such as:
we our us ourselves ours question questions asking answer answers answering performance performed perform performs performing
tie jacket suit write writes writing written wrote pen moring noon evening night nights midnight bed
attorney counsel trial court judge problems problem solution solve analyzed sloved solving
A bottle of vodka is on the table
Everyone likes vodka
Vodka makes you drunk
We make vodka out of corn
The similarity between two objects is identified to be the amount of information contained in the commonality between the objects divided by the amount information in the descriptions of the objects (Lin, 1997) [11] Considering two words w a and w b, the similarity of two words in the same context is:
sim(w a ,w b)= log P(common(w a ,w b))
log P(describe(w a ,w b))
Trang 16To compute the similarity of two words in its context, Dekang Lin used a dependency
triple (w, r, w’) which contains two words and the grammatical relationship between them
in the input sentence In which, w is considered; r is the grammatical relationship between
w and w’; w’ is the word context of w and the notation w,r,w' is the frequency counts of
the dependency triple (w, r, w’) in the parsed corpus When w, r, or w’ is the wild card
(*) then the frequency counts of all the dependency triples which matches the rest of the
sample are summed up For example, ||uống, obj, *|| (||drink, obj, *||) is the frequency counts of “uống-obj” (drink-obj) relationships in the parsed corpus
The description of a word w contains the frequency counts of all the dependency triples that matches the pattern (w, *, *) Let I(w, r, w’) denote the amount information contained in w,r,w' , this value is computed as follows:
I(w,r,w') = log w,r,w' × ∗,r,∗
Let T (w) is the set of pairs (r, w’), the similarity of two words w a and w b is
sim(w a ,w b) that computed as follows:
b) Object and Object-of (Obj and Obj-of): In this relationship, central word is verb, context word is noun (as object) The signs are: parent node as VP, central node as
V, context node as NP
Trang 17c) Complement and Complement-of (Mod and Mod-of): In this relationship, central word is noun, context word is modifiers for nouns (modifiers maybe N, A or V) The signs on the tree are: parent node as NP, central node as N, context node as N,
A or V
d) Prepositional object (Proj and Proj-of): In this relationship, central word is preposition (E), context word is noun The signs are: parent node as PP, central node as E, context node as NP
2.2.3 Results
Dekang Lin extracted about 56.5 million dependency triples from a 64-million-word corpus parsed that contains the Wall Street Journal (24 million words), San Joe Mercury (21 million words) and AP Newswire (19 million words) Their experiment computed the pairwise similarity between all the nouns, all the verbs and all the adjectives/adverbs For each word, Dekang Lin creates a thesaurus entry which contains the top-N words that are most similar to it as follows:
w( pos) : w1, s1,w2, s2, ,wN, sN
the noun, verb and adjective entries with the word “brief” as follows:
brief (noun): affidavit 0.13, petition 0.05, memorandum 0.05, motion 0.05, lawsuit
0.05, deposition 0.05, slight 0.05, prospectus 0.04, document 0.04 paper 0.04,
brief (verb): tell 0.09, urge 0.07, ask 0.07, meet 0.06, appoint 0.06, elect 0.05, name
0.05, empower 0.05, summon 0.05, overrule 0.04,
brief (adjective): lengthy 0.13, short 0.12, recent 0.09, prolonged 0.09, long 0.09,
extended 0.09, daylong 0.08, scheduled 0.08, stormy 0.07, planned 0.06,
2.3 Clustering By Committee
Clustering by Committee (CBC) focuses on addressing the general goal of clustering, that is to group data elements when the intra-group similarities are high and the intergroup similarities are low CBC is presented by two versions of the algorithm: a hard clustering
Trang 18version, in which each element is assigned to only one cluster and a soft clustering version where elements can be assigned to multiple clusters
2.3.1 Motivation
Clustering By Committee is constructed by a desire to automatically extract concepts and word senses from large unlabeled collections of text In previous researches, word senses are usually defined by using a manually constructed lexicon such as WordNet By this way, lexicons also have several problems such as manually created lexicon often contain rare senses and they miss many domain specific senses One way to solve these
and Pantel 2001) [9]
Many clustering algorithms represent a cluster by the centroid of all its members such
as K-means or by representative elements For example, when clustering words, this task can use contexts of the words as features and group together the words In CBC, the centroid of a cluster is constructed by averaging the feature vectors of a subset of the cluster members and the subset is viewed as a committee that determines other elements that belong to the cluster
2.3.2 Algorithm
CBC algorithm consists of three phases: in phase I, the algorithm computes each
element’s top-k similar elements, with some small value of k; In phase II: the algorithm constructs a collection of tight clusters using the top-k similar elements from phase I, in
which elements of each cluster form a committee; In the final phase of the algorithm, each element is assigned to its most similar clusters
2.3.2.1 Phase I: Find top-similar elements
Computing the top-similar elements of an element e, the algorithm first sort features
according to their point-wise mutual information values and then only consider a subset
of the features with highest mutual information Finally, the phase I compute the pairwise
similarity between e and elements that share a feature from this subset
2.3.2.2 Phase II: Find committee
Trang 19In the second phase of the clustering algorithm recursively finds tight clusters scattered in the similarity space In each recursive step, the algorithm finds a set of tight clusters which is called committees A committee covers an element if element’s similarity to the centroid of the committee exceeds some high similarity threshold The phase II is presented as follows:
• Input: A list of elements E to be clustered, a similarity database S from phase I,
thresholds 𝜃! and 𝜃!
• Step 1: For each element 𝑒 𝜖 𝐸
Cluster the top-similar elements of e from S using average-link clustering
similarity between elements in c Store the highest-scoring cluster in a list L
• Step 2: Sort clusters in L in descending order of their scores
• Step 3: Let C be a list of committees, initially empty For each cluster 𝑐 𝜖 𝐿 in
sorted order, compute the centroid of c by averaging the frequency vectors of
its elements and computing the mutual information vector of the centroid in the
same way If c’s similarity to the centroid of each committee previously added
• Step 4: If C is empty, the algorithm is done and return C
• Step 5: For each element 𝑒 𝜖 𝐸 If e’s similarity to every committee in C is
• Step 6: If R is empty, the algorithm is done and return C Otherwise, return the
union of C and the output of a recursive call to phrase II using the same input except replacing E with R
• Output: a list of committees
2.3.2.3 Phase III: Assign element to clusters
In the phase III, it has two versions: the hard clustering version for the document clustering and the soft clustering for discovering word senses In the first version, every element is assigned to the cluster containing the committee to which it is most similar
This version resembles K-means in that every element is assigned to its closest centroid
Trang 20Unlike K-means, the number of clusters is not fixed and the centroids do not change The second version, each element e is assigned to its most similar clusters in the
following way:
let C be a list of clusters initially empty
let S be the top-200 similar clusters to e
while S is not empty {
let c ∈ S be the most similar cluster to e
if similarity(e, c) < σ
exit the loop
if c is not similar to any cluster in C {
assign e to c remove from e its features that overlap with the features of c;
an element e is assigned to a cluster c, the intersecting features between e and c are removed from e This allows CBC to discover the less frequent senses of a word and to
avoid discovering duplicate senses
2.3.3 Results
• CBC discovers word senses by assigning words to more than one cluster Each
cluster to which a word is assigned represents a sense of that word Using the
Soft-clustering version of Phase III of CBC allows CBC to assign words to multiple
clusters, to discover the less frequent senses of a word and to avoid discovering duplicate senses CBC applied the precision/recall evaluation methodology and used the test set consisting of 13,403 words, S13403
Trang 21Table 1 Results of CBC with discovering word senses
Algorithm Precision (%) Recall (%) F-measure (%)
20-corpus, CBC spends the vast majority of the time finding the top similar documents (38 minutes) and computing the similarity between documents and committee centroids (119 minutes) The rest of the computation, which includes clustering the top-20 similar documents for every of the 18828 documents and sorting the clusters, took less than 5 minutes Table 2 shows the results of various clustering algorithms on document clustering
Table 2 Results of CBC with document clustering
Algorithms Reuters 20-News
Trang 22Chapter III
Our approach
In this chapter, I will describe our approach, including word clustering for Vietnamese, evaluating influence of word clustering in Vietnamese and further analyzing the classes of antonym relationship within clusters, how to use word clustered as features
in Vietnamese functional labeling task
3.1 Word clustering in Vietnamese
To cluster words in Vietnamese, I apply two methods of word clustering on the same
corpus as mentioned in Chapter 1: word clustering by Brown’s algorithm and word
similarity by Dekang Lin
3.1.1 Brown’s algorithm
In order to extract word cluster in Vietnamese, I used Brown’s algorithm to word cluster for Vietnamese corpus which is one of the most well-known and effective clustering algorithms in Language Models The Brown algorithm uses mutual information between cluster pairs in a bottom-up approach to maximize Average Mutual Information between adjacent clusters The output’s algorithm is hierarchical clustering of words in which words will be hierarchically clustered by bits string as Figure 2
A word cluster contains a main word and subordinate words, each subordinate word has the same bit string and corresponding frequency The number of subordinate words is different in each cluster To solve the problem, I randomly take away some subordinate
subordinate words I dismiss often has low frequency and they often have no semantic relations to the main word