However, existing algorithms suffer from various aspects; hard clustering algorithms where each document belongs to exactly one cluster cannot detect the multiple themes of a document, w
Trang 1A WORD-BASED SOFT CLUSTERING ALGORITHM FOR DOCUMENTS
King-Ip Lin, Ravikumar Kondadadi Department of Mathematical Sciences, The University of Memphis, Memphis, TN 38152, USA
linki,kondadir@msci.memphis.edu
Abstract:
Document clustering is an important tool for
applications such as Web search engines It enables the
user to have a good overall view of the information
contained in the documents However, existing algorithms
suffer from various aspects; hard clustering algorithms
(where each document belongs to exactly one cluster)
cannot detect the multiple themes of a document, while
soft clustering algorithms (where each document can
belong to multiple clusters) are usually inefficient We
propose WBSC (Word-based Soft Clustering), an efficient
soft clustering algorithm based on a given similarity
measure WBSC uses a hierarchical approach to cluster
documents having similar words WBSC is very effective
and efficient when compared with existing hard clustering
algorithms like K-means and its variants
Keywords:
Document clustering, Word based clustering, soft
clustering
1 Introduction
Document clustering is a very useful tool in today’s
world where a lot of documents are stored and retrieved
electronically It enables one to discover hidden similarity
and key concepts Moreover, it enables one to summarize
a large amount of documents using key or common
attributes of the clusters Thus clustering can be used to
categorize document databases and digital libraries, as
well as providing useful summary information of the
categories for browsing purposes For instance, a typical
search on the World Wide Web can return thousands of
documents Automatic clustering of the documents
enables the user to have a clear and easy grasp of what
kind of documents are retrieved, providing tremendous
help for him/her to locate the right information
Many clustering techniques have been developed and
they can be applied to clustering documents [3, 6] contain
examples of using such techniques Most of these
traditional approaches use documents as the basis of
clustering In this paper, we took an alternative approach,
word-based soft clustering (WBSC).Instead of comparing
documents and clustering them directly, we cluster the
words used in such documents For each word, we form a
cluster containing documents that have that word After
that, we combine clusters that contain similar sets of
documents, by an agglomerative approach [5] The
resulting clusters will contain documents that share a similar set of words
We believe this approach has many advantages Firstly, it allows a document to reside in multiple clusters This allows one to capture documents that contain multiple topics Secondly, it leads itself to a natural representation of the clusters, which are the words that is associated with the clusters themselves Also, the running time is strictly linear to the number of documents Even though the running time grows quadratically with the number of clusters, this number is bounded by the number
of words, which has a rather fixed upper bound (the number of words in the language) This allows the algorithm to scale well
The rest of the paper is organized as follows: Section
2 summarizes related work in the field Section 3 describes our algorithm in more detail Section 4 provides some preliminary experimental results Section 5 outlines future direction of this work
2 Related work
Clustering algorithms have been developed and used
in many fields [4] provides an extensive survey of various clustering techniques In this section, we highlight work done on document clustering
Many clustering techniques have been applied to clustering documents For instance, Willett [9] provided a survey on applying hierarchical clustering algorithms into clustering documents Cutting et al [1] adapted various partition-based clustering algorithms to clustering documents Two of the techniques are Buckshot and Fractionation Buckshot selects a small sample of documents to pre-cluster them using a standard clustering algorithm and assigns the rest of the documents to the
clusters formed Fractionation splits the N documents into
‘m’ buckets where each bucket contains N/m documents.
Fractionation takes an input parameter ρ, which indicates the reduction factor for each bucket The standard
clustering algorithm is applied so that if there are ‘n’ documents in each bucket, they are clustered into n/ρ
clusters Now each of these clusters is treated as if they were individual documents and the whole process is
repeated until there are only ‘K’ clusters
A relatively new technique is proposed by Zamir and Etzioni [10] They introduce the notion of phrase-based document clustering They use a generalized suffix-tree to obtain information about the phrases and use them
to cluster the documents
Trang 23 WBSC
As the name suggests, WBSC uses a word-based
approach to build clusters It first forms initial clusters of
the documents, with each cluster representing a single
word For instance, WBSC forms a cluster for the word
‘tiger’ made up of all the documents that contain the word
‘tiger’ After that, WBSC merges similar clusters –clusters
are similar if they contain the similar set of documents –
using a hierarchical based approach until some stopping
criterion is reached At the end, the clusters are displayed
based on the words associated with them
Cluster Initialization:
We represent each cluster as a vector of
documents For each word, we create a bit vector (called
term vector), with each dimension representing whether a
certain document is present or absent (denoted by 1 and 0,
respectively) This step requires only one pass through the
set of the documents However, we do maintain some sort
of index (e.g a hash table) on the list of words in the
document to speed up the building process
Notice that the number of clusters is independent
to the number of documents However, it grows with the
total number of words in all the documents Thus, we do
not consider stop-words words that either carry no
meaning (like prepositions) or very common words The
second idea can be extended Words that appear in too
many documents are not going to help clustering
Similarly, if a word appear in only one document, it is not
likely that it will merge with other clusters and remain
useless in clustering Thus, we remove all initial clusters
that have only one document or have more than half of the
total number of documents in it
Cluster Building:
Once the initial clusters are formed, we apply a
hierarchical-based algorithm to combine the clusters Here
we need a measure of similarity between any pair of
clusters Let us assume we have two clusters A and B,
with n1 and n2 documents each respectively (without loss
of generality, assume n1 ≤ n2), with c documents in
common Intuitively we would like to combine the two
clusters if they have many common documents (relative
to the size of the clusters) We tried c min( n1, n2) as
our similarity measure However, this measure does not
perform well After trying with different measures, we
settled with the Tanimoto measure [2],
c n n
c
− + 2 1
WBSC iterates over the clusters to merge similar
clusters using this similarity measure Clusters are merged
if their similarity value is higher than a pre-defined
threshold We consider every pair of clusters, merging
them if the similarity is larger than the threshold Also,
there are cases when one cluster is simply the subset of
the other We merge them in that case Currently we use
0.33 as our threshold We are experimenting with different ways of finding a better threshold
Displaying results
In the algorithm, we keep track of the words that are used to represent each cluster When two clusters merge, the new cluster acquires the words from both clusters Thus at the end of the algorithm, each cluster will have a set of representative words This set can be used to provide a description of the cluster Also, we found out that there are a lot of clusters that have very few words associated with them – as they have either never been merged, or are only merged 2-3 times Our results show that discarding such clusters does not affect the results significantly In fact, it allows the results to be interpreted more clearly
4 Experiments & Results:
This section describes the results of the various experiments that we carried out In order to evaluate the performance of WBSC, we compared it with other algorithms like K-Means, Fractionation and Buckshot [1]
Experimental Setup:
Our test bed consists of two separate datasets:
• Web: This contains 2000 documents downloaded
from the Web They are from different categories such as Food, Cricket, Astronomy, Clustering, Genetic Algorithms, Baseball, Movies, XML, Salsa (Dance) and Olympics We use search engines to help us locate the appropriate documents
• Newsgroups: This contains 2000 documents from
the UCI KDD archive (http://kdd.ics.uci.edu/) [8] It has documents retrieved from 20 different news groups
All the experiments are carried out on a 733 MHz,
260 MB RAM PC We did experiments on different document sets of different sizes We ran the algorithm to get the clusters and tested the effectiveness of clustering (the types of clusters formed), both qualitatively and quantitatively We also compared the execution times of all the algorithms for document sets of different sizes
Effectiveness of Clustering:
We did many experiments with different number
of documents taken from the above-mentioned test bed All the algorithms were run to produce the same number
of clusters with same input parameters WBSC formed clusters for each of the different categories in the document sets, while the other algorithms (K-Means, Fractionation and Buckshot) did not In addition, the other algorithms formed clusters with documents of many categories that are not related to each other Figure 1 shows sample output from one of the experiments with
500 documents picked from 10 of the categories of the
Trang 3newsgroups data set The categories include atheism,
graphics, x-windows, hardware, space, electronics,
Christianity, guns, baseball and hockey
BTW, remind, NHL, hockey, Canada, Canucks, hockey, teams, players Hockey
Human, Christian, bible, background, research, errors, April, President, members, member, private,
Cult, MAGIC, baseball, Bob, rules, poor, teams, players, season Baseball
Distribution, experience, anti, message, Windows, error, week, X Server, Microsoft, problems Windows-x
Military, fold, NASA, access, backup, systems, station, shuttle Space
Notion, soul, several, hell, Personally, created, draw, exists, god Atheism
Figure 1: clusters formed by WBSC on a sample from UCI data archive
WBSC formed 14 clusters It formed more than
one cluster for some of the categories like Hockey, Guns
etc The document set on hockey has different documents
related to Detroit Redwings and Vancouver Canucks Our
algorithm formed two different clusters for these two
different sets We ran the other algorithms with 10 as the
required number of clusters to be formed Many of the
clusters have documents that are not at all related to each
other (they have documents from different categories)
To show the effectiveness of our algorithm we
run the various algorithms on our web data set, where
some documents are in multiple categories (for example a
document related to both baseball and movies should be
in all the three clusters: baseball, movie, and
baseball-movie) We cannot present the results due to limitation on
space K-Means, Fractionation and Buckshot formed only
clusters about baseball and Movies and did not form a
cluster for documents related to both categories However,
WBSC formed a cluster for baseball-movies and put the
documents related to baseball-movies in that cluster This
shows the effectiveness of our method
We also measure the effectiveness of WBSC
quantitatively We compared the clusters formed by the
documents against the documents in the original
categories and matched the clusters with the categories
one-to-one For each matching, we counted the number of
documents that are common in corresponding clusters
The matching with the largest number of common
documents is used to measure the effectiveness This
matching can be found by a maximum weight bipartite
matching algorithm [7] We return the number of
documents in the matching The more documents that are matched, the more resemble the clusters are to the original categories For our algorithm, and for the purpose of this comparison, we assigned each document to the cluster that has the largest similarity value
Figure 2 shows the number of matches with the original categories for different algorithms for different sets of documents
0.00 10.00 20.00 30.00 40.00 50.00 60.00
Number of documents
Figure 2: Comparison of quality of clusters by
different clustering algorithms
We can clearly see that WBSC outperforms the other algorithms in effectiveness
Execution Time:
We also measured the execution time of various algorithms Figure 3 gives the comparison of execution time As the graph shows, WBSC outperforms almost all
Trang 4other algorithms in execution time, especially as the
number of documents increases
0.00
20.00
40.00
60.00
80.00
100.00
120.00
140.00
Number of documents
Figure 3: Execution times of various clustering
algorithms
Web Search Results:
We also tested WBSC on the results from Web
search engines We downloaded documents returned from
the Google search engine (www.google.com) and we
apply WBSC on them Limitation on space prohibits us
from showing all the results Here we show some of the
clusters found by WBSC by clustering the top 100 URLs
returned from searching the term “cardinal” The
categories correspond to the common usage of the word
“cardinal” in documents over the Web (name of baseball
teams, nickname for schools, a kind of bird, and a
Catholic clergy) Figure 4 shows some of the clusters
formed by WBSC
Cluster results: Keywords and sample
documents Relatedtopic
Bishops, Catholic, world, Roman,
Church, cardinals, College, Holy
1 Cardinals in the Catholic
Church Catholicpages.com
2 The Cardinals of the Holy
Roman Church
Roman Catholic Church Cardinals library
Dark, Bird, female, color
1 Cardinal Page
2 First Internet Bird Club
Cardinal Bird
Benes, rookie, players, hit, innings,
runs, Garrett, teams, league, Saturday
1 St.Louis Cardinals Notes
2 National League Baseball
-Brewers vs Cardinals
St.Louis Cardinals (MLB team)
Figure 4: Clusters formed for search term “cardinals” by
WBSC This shows that our algorithm is effective even with the
web search results
5 Conclusions
In this paper, we presented WBSC, a word-based document-clustering algorithm We have shown that it is a promising means for clustering documents as well as presenting the clusters in a meaningful way to the user
We have compared WBSC with other existing algorithms and have shown that it performs well, both in terms of running time and quality of the clusters formed
Future work includes tuning the parameters of the algorithm, as well as adapting new similarity measures and data structures to speed up the algorithm
6 References:
[1] Douglass R Cutting, David R Karger, Jan O Pedersen, John W Tukey, Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections, In
Proceedings of the Fifteenth Annual International ACM SIGIR Conference, pp 318-329, June 1992
[2] Dean, P M Ed., Molecular Similarity in Drug Design, Blackie Academic & Professional, 1995, pp
111 –137
[3] D R Hill, A vector clustering technique, in:
Samuelson (Ed.), Mechanized Information Storage,
Retrieval and Dissemination, North-Holland, Amsterdam,
1968
[4] A.K Jain, M.N Murty and P.J Flynn, Data
Clustering: A Review, ACM Computing Surveys 31(3):
264-323, Sept 1999
[5] F Murtagh, A Survey of Recent Advances in
Hierarchical Clustering Algorithms", The Computer
Journal, 26(4): 354-359, 1983.
[6] J J Rocchio, Document retrieval systems –
optimization and evaluation, Ph.D Thesis, Harvard
University, 1966
[7] Robert E Tarjan, Data Structures and
Network Algorithms, Society for Industrial and Applied
Mathematics, 1983
[8]http://kdd.ics.uci.edu/databases/20newsgroup s/20newsgroups.html, last visited November 18th, 2000
[9] P.Willett, Recent trends in hierarchical
document clustering: a critical review, Information
processing and management, 24: 577-97, 1988.
[10] O.Zamir, O.Etzioni, Web document
clustering: a feasibility demonstration, in Proceedings of
19 th international ACM SIGIR conference on research and development in information retrieval (SIGIR 98), 1998, pp
46-54