In this thesis, I focus on some word clustering algorithms with the unlabeled data, in which T mainly apply two mothods: word clustering by Brown’s algorithm [22] and word similarity by
Trang 1VIETNAM NATIONAL UNIVERSITY, ILANOT UNIVERSITY OF ENGINEERING AND TECHNOLOGY
NGUYEN KIM ANIL
VIETNAMESE WORD CLUSTERING
AND ANTONYM IDENTIFICATION
MASTER THESIS OF INFORMATION TECHNOLOGY
Trang 2VIETNAM NATIONAL UNIVERSITY, HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY
NGUYEN KIM ANH
VIETNAMESE WORD CLUSTERING
AND ANTONYM IDENTIFICATION
Major: Computer science
Code: 60 48 01
MASTER THESIS OF INFORMATION TECHNOLOGY
SUPERVISOR: PhD Nguyen Phuong Thai
Hanoi - 2013
Trang 3
2.1.2 Sticky Paire and Sematic Classes
Chapter [iI - Our approach
3.1 Word clustering in Vietnamese
Trang 5List of Figures
Figure 1 An example of Brown's cluster algorithm
Vigure 2, An example of Vietnamese word cluster Figure 3 The syntax trec OŸ a serifece
Figure 4 An example about Vietnamese word similarity
Vigure 5, Select word clusters by dictionary Figure 6, An cxample abou† seIitences parsos
Figure 7 The true of k-clusters
Trang 6List of Tables
Table 1 Results of CBC with discovering word senses
Table 2 Results of CBC with document clustering cesses
Table 3 Ancillary antonym frames ccecsccssssesessssnssievesseessenisnseesnneeivee
Table 4 Coordinated antonym frames
‘Table 5 Transitional antonym frames .csssssessssssessseseseseevenisesesteeeeicvee
Table 6 An unlabeled corpus in Victnamese oe ccc ào căn cocveovrieerriee
Table 7 The result of five initial clusters
Table 8 The comparison between Word clustering and Word similarity
Tablc 9 The result oŸ antoniyra Ïraies c cà or
Table 10 The relation of wy arkÌ vwạ pairs
Table 11 The effectiveness of word chuster feature
Trang 7
Chapter I
Introduction
In recent years, statistical learning methods have been vastly successful for using natural language processing tasks Most of machine learning algorithms which are used in natural language processing tasks are supervised and they require labeled data These
labeled dala are ofien made by hand or in some other ways, which could be lime
consuming and expensive However, while the labeled data is difficult to be created by hand, the unlabeled data is basically free on the Intemet which is considered as raw text This raw text can be easily preprocessed to be made suitable for using in an unsupervised
or semi-supervised leaming algorithm Previous works have shown that using the unlabeled data to replace a taditional labeled dala can improve perfurmance (Miller el al.,
2004, Abney, 2004; Collins and Singer, 1999) [19]]21[71
In this thesis, I focus on some word clustering algorithms with the unlabeled data, in
which T mainly apply two mothods: word clustering by Brown’s algorithm [22] and word similarity by Dckang Lin |10| Those two methods are used in clustering words in the corpus While Brown's method cluster words basing on the relationships between words standing before and aiter the clustered word, Dekang Lin’s method uses the relationships among those three words To compare the advantages and disadvantages of these two methods, I experimented them on the same corpus, using the same cyaluation method and
the same main word in clusters ‘The result of word clustering contained different clusters and each cluster included words in the same contexts This result was used as features for
the application: Vietramese functional labeling I also evaluated influence of word
clustering when using them as features im thts application For example, word clusters were used to solve the data sparseness problem of the head word feature Besides, I use
the statistics method to extract 20 frames of antonym which can use to identify antonym
classes in chusters
In this chapter, I describe word similarity, hicrarchical clustering of word and their applications which are used in natural language processing Besides, I would like to
Trang 8introduce function tags, word segmentation tasks, objective of thesis and our contribution Finally, I will describe the structure of the thesis
1.1 Ward Similarity
The semantic of an unknown word can be inferred from its context Consider the
following, examples
A bottle of Beer is on the table
Everyone likes Beer:
Beer makes you drunk,
Contexts of the word Beer in which it is used suggest that Beer could be a kind of aleoholic beverage This moans thal other alcoholic beverage may occur in the same contexts as contexts of Beer and they may be related Consequently, two words arc similar if they appear in similar contexts or they can be exchangeable to some extent For example, “Téng thang” (President) and “Chu tich” (Chairman) are similar according to this definition In contrast, two words “Kéo” (Scissors) and “Cf” (Knife) are not similar undor this definition, while scmantically related Intuitively if I can gencrate a good clustering, the words in this cluster should be similar
1.2 Hierarchical Clustering of Ward
In recent years, some algorithms have been proposed to automatically clustering words based on a large unlabeled corpus, such as (Brown et al 1992, Lin et al 1998) [22][1 0] T consider a corpus of T words, a vacabulary of V words, and a partition 7 of the vocabulary The likelihood L(m) of a bigram class model generating the corpus is given
by:
La)=I H
Tere, II is the entropy of the l-gram word distribution, and I is the average mutual
information of adjacent classes in the corpus
I - 2 PGslIng h@)B&)
11
Trang 9Where, Pr(¢,¢,) is the probability of a word in ¢, is followed by a word in o; 8o H and
z are independent, the partition also maximizes the likelihood L(z) of the corpus, because
ik maxinnves the average mutual information Thus, T can use the average mulual
information to coustruct Ihe clusters of word by repealing the merging step until the
number of clusters is reduced to the predefined number C
1.3 Function tags
Functional tags Jabeling is an important processing step for many natural language processing applications such as question answering, information extraction, and summarization Thus, there were some researches focusing in function tagging problem to
cover additional serantic miformalion which are more useful thar syntactic labels
‘There are two kinds of tags in linguistics: syntactic tags and functional tags Lior
synilaclic tags there are inany theories and projects in English, Spanish, Chinese aml elc
Main lasks of these researches are common Finding the parl-ol-speech and tag them [or their constituents Functional tags are understood as abstract labels because they are not
similar syntactic labels If a syntactic label has one notation for batch of words for all
paragraphs, functional tags are representative of the relationship between a phrase and its
nicrance in cach differcuce context So for cach phrase, funchonal lags might be
transforming Tl depends on the context of its neighbors For example, when T consider a phrase: “Baseball bat” the syntax of this phrase is “noun phrase” (in almost researches they are annotated as NP), But its functional tag might be a subject in this sentence:
This baseball hat is very expensive
In another case, ils fimetional tag mighl be a dircel object
I bought this baseball bat last month
Or instrument agent in a passive voice:
That men was attacked by this baseball bat
Functional tags are directly mentioned by Blaheta (2003) [13] Since there are a lot of researches focusing om how to lag functional lags [or a sentence This kind of research problem is called functional tags labeling problem, a class of problems aiming at finding semantic information of phrase To sum up, fumotional tag labeling is defined as a problem of how to find the semantic information of bag of words, and then tag, them with
a given anmolation in its context
12
Trang 101.4, Objectives of the Thesis
Most of these successful machine learning algorithms are supervised algorithm, and they usually use labeled data These labeled data are often created by hand, which is time consuming and expensive While unlabeled data is free, they can be obtained from the newspapers, website and they oxi as raw (exts on (he Internet,
In this thesis, I expected to investigate some methods of clustering words for unlabeled data, which was easily extracted from onfine sources Among automatic clustering methods, T focused on two: hierarchical word clustering by Brown's algorithm and word similarity by Dekang Lin Besides, | also suggested a common evaluating tool for both methods when they are applied in the same Vietnamese corpus
The output of the word clustering was used as features in natural language processing tasks such as Vietamese functional labeling 1 also evaluated the influences of word
clusters when they were used as features in this task
1.5 Our Contributions
As I discussed above, the main aim of this thesis is to cluster unlabeled Vietnamese
words Thus, the contribution of this paper is as follows
© Firstly, T managed lo do automatic word clustering for unlabeled Vieinamese data with the corpus of about 700.000 sentences
* Secondly, I suggested a qualified evaluating method for clusters after I clustered words using thesaurus dictionary with 5 criteria
« Thirdly, I compared two clustering methods for Vietnamese, which are word clustering by Brown and word similarity by Dekang Lin T used the results of clusters as features in Vietnamosc functional labcling task to increase the task’s efficiency Besides, I use the statistics method to extract 20 frames of antonym which can use to identify
antonym classes in the clusters
Tn conclusion, our contribution is that T have implemented word clustering about 700,000 sentences in Vietnamese by hierarchical word clustering algorithm, using
Vietnamese thesaurus dictionary and five criteria to evaluate the true of clusters and using
Trang 11clusters as features of NLP tasks such as Vietnamese functional labeling Finally, extract
20 frames of antonym to identify antonym pairs in antonym classes
1.6 Thesis structure
In this section, 1 would like to introduce brief outline of thesis ‘Thus, you can have overviews on next sections where I want to disouss
Chapter 2 — Related works
In this chapter I would like to introduce some recent researches about word clustering, functional labeling and word scgricutation
Chapter 3— Our approach
This chapter suggests the method I applied to cluster Victnamese, how I evaluate the qualities of the clusters after the word clustering process, how ta use those clusters as
TeaLures m Vieltamese functional labcling lask And how to extract frames of anonym
from the corpus
Chapter 4— Experiment
Tn this chapter, T would like to discuss the corpus T used to chister and some tools
applied in this thesis Besides, | pointed out and analyzed some errors for erroneous clusters in the word clustering process Finally, I expected to evaluate the influence of clusters when T applied hem in task of Vietnamese functional tabsling
Chapter 5—Conclusions and Future works
In the last chapter, I want to have a general conclusion about the advantages and restrictions of our works Besides, I propose some works which I will do in future to
improve our modal
Finally, references will show close researches which our system referred to.
Trang 122.1.1 The Brown algorithm
Word clustering is considered here as a method for estimating the probabilities of low frequency events One of the aims of word clustering is the problem of predicting a word
from previous words in a sample of text In this fask, authors used the bottom-up
agglomerative word clustcring algorithm (Brown ct al, 1992) 120] to derive a hicrarchical clustering of words ‘The input to the algorithm is a verpus of unlabeled data which consist
of a vocabulary of words to be clustered The output of the word cluster algorithm is a binary tree in Figure J, in which the leaves of the tree are the words in the vocabulary and cach internal node is interpreted as ä cluster containing words in that sub-tree Initially, each word in the corpus is considered to be in its own distinct cluster The algorithm then repeatedly merges the pair of chusters that maximizes the quality of the clustering result and each word belongs to exactly one chuster until the number of clusters is reduced to the
predclined number of clusters as follows:
Initial Mapping: Fut a singlc word in cach cluster
Compuce Lhe inital AMT of Lhe collection
repeat
for cach vaizs of clusters do
erga Lhe pair of clusters Lemporarily
Trang 13Compute AMI of tae new collection
until reach the predefined number of clusters
stewardess conspirator womanizer
Tignre 1 An example of Brawn’s cluster algorithm
To identify the quality of a clustering, this algorithm considers a training toxt of T- words 17, a vocabulary of V-words, and a partition « of the vocabulary The maximum likelihood estimates of the parameters of a 1-gram class model generating the corpus
given by
4 Cbe) hôn 2= co and
Trang 142-gram class model the sequential maximum likelihood estimates of the order 2-
parameters maximize Pri; If.) are given by:
Prte, Le.) = So co P6) BC BC la)
x
= Price, y= COD yg EO)
whore, C(o,) and S7C(ee) are the number of words in the ¢” and t/* which the class is
the ¢, Let, 7.) =(7 -1) "log Pr(ef z,) then
r@=3, et ogra le,)Prow, le.)
= = LProplogrrew)+ SPH log Si 1 Prete.)
=i(e,e)— Họ»)
in which, H roprcsenis the entropy of the 1-gram word distribution, and Ƒ represent the
average mutual information of adjacent classes e and e„
2.L
Sticky Pairs and Sematic Classes
One of the aims of word clustering is group words together base on the statistical similarity of their surroundings In addition, the information context of words is also viewed as features to group words together For example consider two words w,, and w,
in the same contexts, the mutual information of the two words as adjacent is:
17
Trang 15chosen from a window of 100,1 words centered on w, but excluding the words in a
window of 5 centered on w, is w, If Pr,.(w,4,) is much larger then Priw,)Priw,) then,
w, and w, are semantically sticky Using Pr,,,,(4,4,) tos algorithm shows some interesting classes such as:
we our us ourselves ours
question questions asking answer answers answering
verformance perZormed perform verforms performing
tic jacket suit write wriles wrilirg wriLien w^oLe moring noon evening night nights midnight bed
atterncy counsel trial court judge
follows:
A bottle of vedka is on the table
bveryene TÌk vodka
Vodka maxes you drunk
We make vedka out of corn
The similarity between two objecis is identified lo be the amount of information contained in the commonality between the objects divided by the amount information in
the descriptions of the objects (Lin, 1997) [11] Considering two words w, and w,, the similarity of two wards in the same context is:
log P(comanon(w,,W,))
log P(describe(w,,,w,))
18
Trang 16To compute the similarity of two words in its context, Dekang Lin used a dependency
triple Gv, 7, w’) which contains two words and the grammatical relationship between them inthe input sentence In which, w is considered; r is the gramnatical relationship between
w and w’, w’ is the ward context of w and the notation /¥.7,¥'|| is the frequency counts of
the dependency triple (wv, 7, w’) in the parsed corpus When v, 1, or w’ is the wild card (*) then the frequency counts of all the dependency triples which matches the rest of the sample are summed up For example, |ludng, oly, “| (|drink, obj, *|) is the frequency counts of “udag-obj” (drink-obj) relationships in the parsed corpus
‘The description of a word w contains the frequency counts of all the dependency tnples that matches the pattem (wv, *, *) Let IGe, 7, w’) denote the amount information
contained in |¥.7.W'], this value is computed as follows
b) Object and Object-of (Obj and Obj-of): In this relationship, central word is verb, context word is noun (as object) The signs are: parent nods as VP, central node as
VY, context node as NP.
Trang 17¢) Complement and Complement-of (Mod and Mod-of): In this relationship, central word is noun, context word is modifiers for nouns (modifiers mayhe N, A or V) The signs on the tree are: parent node as NP, central node as N, context node as N,
most similar to it as follows:
W( POS): W, 15) Way Sie Wra Sy,
In which, pos is a part of speech; w, is a word and 5; — sim(w,w,), ‘The top-10 words in the noun, verb and adjective entries with the word “brief” as follows:
brief (noun): affidavil 0.13, pelilion 0.05, memorandum 0.05, motion 0.05, lewsuiL 0.05, deposition 0.05, slight 0.05, prospectus 0.04, document 0.04 paper 0,04,
brief (verb): tell 0.09, urge 0.07, ask 0.07, meet 0.06, appoint 0.06, elect 0.05, name 0.05, empower 0.05, summon 0.05, overrule 0,04,
brief (adjective): Isngthy 0.13, short 0.12, recent 0.09, prolonged 0.09, long 0.03, extended 0.09, daylong 0.08, scheduled 0.08, stormy 0.67, plarmed 0.06,
2.3 Clustering By Committee
Clustering by Committes (CBC) focuses on addressing the general goal of clustering, that is to group data elements when the intra-group similarities are high and the intergroup similarities are low CBC is presented by two versions of the algorithm: a hard clustering
20
Trang 18version, in which each element is assigned to only one cluster and a soft clustering version where cloments can be assigned to multiple clusters
2.3.1 Mativation
Clustering By Committee is constructed by a desire to automatically extract concepts
and word senses from large unlabeled collections of text In previous researches, word senses are usually defined by using a manually constructed lexicon such as WordNet, By this way, lexicons also have several problems such as manually created lexicon ofien conlain rare senses and they miss many domain specilic senses One way lo solve these problems is to use a chistering algorithm to automatically induce semantic classes (Lin and Pantel 2001) [9]
Many clustering algorithms represent a cluster by the centroid of all its members such
as K-means or by representative elements For example, when clustering words, this task can use contexts of the words as features and group together the words In CBC, the
centroid of a cluster is constructed by averaging the feature vectors of a subset of the cluster members and the subset is viewed as a committee that determines other elements
that belong to the cluster
which mis of cach cluster form 4 commitice; In the final phase of the algonthm, cach
element is assigned to its most similar clusters
2.3.2.1 Phase I: Find top-similar elements
Computing the top-similar elements of an element e, the algorithm first sort features according to their point-wise mutual information values and then only consider a subset
of the features with highest mutual information Finally, the phase I compute the pairwise
sninilanly between e¢ and clemonts that share a feature [rom this subset
2.3.2.2 Phase II: Find committee
Trang 19In the second phase of the clustering algorithm recursively finds tight clusters
scattered in the similarity space In cach recursive step, the algorithm finds a set of tight
clusters which is called committees A committee covers an element if element’s
similarity to the centroid of the committee exceeds some high similarity threshold ‘'he phase IT is presented as follows
> Input: A list of elements E to be clustered, a similarity database S from phase I, thresholds @, and @,
© Step 1: For each element ¢ cE
Cluster the top-similar elements of e from § using average-link clustering
For each cluster discovered ¢ compute the following score: |c|xavgsim(c),
where |c| is the number of elements in c and avgsim(c) is the average pairwise
similarity between elements in c Store the highest-scoring cluster in a list L
¢ Step 2: Sort clusters in L in descending order of their scores
«+ Step 3: Let C be a list of committees, initially empty For cach cluster ¢ € L in sorted order, compute the centroid of ¢ by averaging the frequency vectors of ats elements and computing the mutual information vector of the centroid in the
same way If cs similarity ta the centroid of each committee previously added
to C is below a threshold 0,, add c to C
¢ Step 4 JEC is ompty, the algorithm is done and return C
© Step 5 For cach clomonte & F Tf e's similarity to every commillse in C is
below direshold @,, add ¢ to a list of residues R
© Step 6: If R is empty, the algorithm is done and return C Otherwise, retum the
union of C and the output of a recursive call to phrase II using the same input
excepl replacing E wilh R
* Output: 2 list of committees
2.3.2.3 Phase I: Assign clement fo clusters
Tn the phase TIT, it has two versions: the hard chustering version for the document clustering and the sofi clustering for discovering word senses Tn the first version, every
element is assigned to the cluster containing the committee to which it is most similar
‘his version resembles K-means in that every element is assigned to its closest centroid
2
Trang 20Unlike K-means, the number of clusters is not fixed and the centroids do not change
The second version, each element ¢ is assigned to its most similar clusters in the
following, way:
lel Che a lish of eluscers initially emply
let 8 be the top-209 similar clusters to e
e Sis not empty | c€Sbe the most s
similerity(c, c< o
exit the loop
if cis not similar te any cluszer in C{
assion eto ¢ remove trom eits teatures that overlap with the
used the test set consisting of 13,403 words, $13403
Trang 21Table 1 Resulls of CBC with discovering word senses
«The goai of document clustering is to discover documents with similar topics On
the 20-news data set, CBC implementatian of Chameleon was unable to complete
in reasonable tine For K-means, CBC used K—1000 over five iterations for 20-
news and A 30 over eight iterations for Reuters For Buckshot, CBC used K 80 for 20-news and K—50 for Reuters, icludes over eight iterations l’or the 20-news
corpus, CBC spends the vast majority of the time finding the top similar documents (38 minules) and compuling the similarity between documents and committee centroids (119 minutes) Lhe rest of the computation, which includes
clustering the top-20 similar documents for every of the 18828 documents and
sorting the clusters, took less than 5 minutes ‘I'able 2 shows the results of various
clustering algorithms on document clustering
Table 2 Results af CRC with document clustering
Algorithms Reuters | 20-News
Trang 22Chapter TIT
Our approach
In this chapter, 1 will describe our approach, including word clustering for
‘Vietnamese, evaluating influence of word clustering in Vietnamese and further analyzing the classes of antonym relationship within clusters, how to use word clustered as features
im Vielrwmese funelional labeling Lask
3.1 Word clustering in Vietnamese
To cluster words in Vietnamese, I apply two methods of word clustering on the same corpus as mentioned in Chapter 1: word clustering by Brown’s algorithm and word similarity by Dekang Lin
3.1.1 Brown’s algorithm
In order to extract word cluster in Vietnamese, I used Brown’s algorithm to word cluster for Vietnamese corpus which is one of the most well-imown and effective clustering algorithms in Language Models The Brown algorithm uses mutual information between cluster pairs in a bottom-up approach to maximize Average Mutual Information belween adjacent clusters The outpul’s algorithm is hicrarchical clustering of words in which words will be hierarchically clustered by bits string as Migure 2
A word cluster contains a main word and subordinate words, each subordinale word leas the same bit string and corresponding frequeney The manber of subordiuale words is different in each cluster To solve the problem, | randomly take away some subordinate words in clusters containing more than 10 subordinate words Because I realized that the subordinate words I dismiss often has low frequency and they often have no semantic
telalions Lo the main word.