A WORD-BASED SOFT CLUSTERING ALGORITHM FOR DOCUMENTS

However, existing algorithms suffer from various aspects; hard clustering algorithms where each document belongs to exactly one cluster cannot detect the multiple themes of a document, w

Trang 1

A WORD-BASED SOFT CLUSTERING ALGORITHM FOR DOCUMENTS

King-Ip Lin, Ravikumar Kondadadi Department of Mathematical Sciences, The University of Memphis, Memphis, TN 38152, USA

linki,kondadir@msci.memphis.edu

Abstract:

Document clustering is an important tool for

applications such as Web search engines It enables the

user to have a good overall view of the information

contained in the documents However, existing algorithms

suffer from various aspects; hard clustering algorithms

(where each document belongs to exactly one cluster)

cannot detect the multiple themes of a document, while

soft clustering algorithms (where each document can

belong to multiple clusters) are usually inefficient We

propose WBSC (Word-based Soft Clustering), an efficient

soft clustering algorithm based on a given similarity

measure WBSC uses a hierarchical approach to cluster

documents having similar words WBSC is very effective

and efficient when compared with existing hard clustering

algorithms like K-means and its variants

Keywords:

Document clustering, Word based clustering, soft

clustering

1 Introduction

Document clustering is a very useful tool in today’s

world where a lot of documents are stored and retrieved

electronically It enables one to discover hidden similarity

and key concepts Moreover, it enables one to summarize

a large amount of documents using key or common

attributes of the clusters Thus clustering can be used to

categorize document databases and digital libraries, as

well as providing useful summary information of the

categories for browsing purposes For instance, a typical

search on the World Wide Web can return thousands of

documents Automatic clustering of the documents

enables the user to have a clear and easy grasp of what

kind of documents are retrieved, providing tremendous

help for him/her to locate the right information

Many clustering techniques have been developed and

they can be applied to clustering documents [3, 6] contain

examples of using such techniques Most of these

traditional approaches use documents as the basis of

clustering In this paper, we took an alternative approach,

word-based soft clustering (WBSC).Instead of comparing

documents and clustering them directly, we cluster the

words used in such documents For each word, we form a

cluster containing documents that have that word After

that, we combine clusters that contain similar sets of

documents, by an agglomerative approach [5] The

resulting clusters will contain documents that share a similar set of words

We believe this approach has many advantages Firstly, it allows a document to reside in multiple clusters This allows one to capture documents that contain multiple topics Secondly, it leads itself to a natural representation of the clusters, which are the words that is associated with the clusters themselves Also, the running time is strictly linear to the number of documents Even though the running time grows quadratically with the number of clusters, this number is bounded by the number

of words, which has a rather fixed upper bound (the number of words in the language) This allows the algorithm to scale well

The rest of the paper is organized as follows: Section

2 summarizes related work in the field Section 3 describes our algorithm in more detail Section 4 provides some preliminary experimental results Section 5 outlines future direction of this work

2 Related work

Clustering algorithms have been developed and used

in many fields [4] provides an extensive survey of various clustering techniques In this section, we highlight work done on document clustering

Many clustering techniques have been applied to clustering documents For instance, Willett [9] provided a survey on applying hierarchical clustering algorithms into clustering documents Cutting et al [1] adapted various partition-based clustering algorithms to clustering documents Two of the techniques are Buckshot and Fractionation Buckshot selects a small sample of documents to pre-cluster them using a standard clustering algorithm and assigns the rest of the documents to the

clusters formed Fractionation splits the N documents into

‘m’ buckets where each bucket contains N/m documents.

Fractionation takes an input parameter ρ, which indicates the reduction factor for each bucket The standard

clustering algorithm is applied so that if there are ‘n’ documents in each bucket, they are clustered into n/ρ

clusters Now each of these clusters is treated as if they were individual documents and the whole process is

repeated until there are only ‘K’ clusters

A relatively new technique is proposed by Zamir and Etzioni [10] They introduce the notion of phrase-based document clustering They use a generalized suffix-tree to obtain information about the phrases and use them

to cluster the documents

Trang 2

3 WBSC

As the name suggests, WBSC uses a word-based

approach to build clusters It first forms initial clusters of

the documents, with each cluster representing a single

word For instance, WBSC forms a cluster for the word

‘tiger’ made up of all the documents that contain the word

‘tiger’ After that, WBSC merges similar clusters –clusters

are similar if they contain the similar set of documents –

using a hierarchical based approach until some stopping

criterion is reached At the end, the clusters are displayed

based on the words associated with them

Cluster Initialization:

We represent each cluster as a vector of

documents For each word, we create a bit vector (called

term vector), with each dimension representing whether a

certain document is present or absent (denoted by 1 and 0,

respectively) This step requires only one pass through the

set of the documents However, we do maintain some sort

of index (e.g a hash table) on the list of words in the

document to speed up the building process

Notice that the number of clusters is independent

to the number of documents However, it grows with the

total number of words in all the documents Thus, we do

not consider stop-words words that either carry no

meaning (like prepositions) or very common words The

second idea can be extended Words that appear in too

many documents are not going to help clustering

Similarly, if a word appear in only one document, it is not

likely that it will merge with other clusters and remain

useless in clustering Thus, we remove all initial clusters

that have only one document or have more than half of the

total number of documents in it

Cluster Building:

Once the initial clusters are formed, we apply a

hierarchical-based algorithm to combine the clusters Here

we need a measure of similarity between any pair of

clusters Let us assume we have two clusters A and B,

with n1 and n2 documents each respectively (without loss

of generality, assume n1 ≤ n2), with c documents in

common Intuitively we would like to combine the two

clusters if they have many common documents (relative

to the size of the clusters) We tried c min( n1, n2) as

our similarity measure However, this measure does not

perform well After trying with different measures, we

settled with the Tanimoto measure [2],

c n n

c

− + 2 1

WBSC iterates over the clusters to merge similar

clusters using this similarity measure Clusters are merged

if their similarity value is higher than a pre-defined

threshold We consider every pair of clusters, merging

them if the similarity is larger than the threshold Also,

there are cases when one cluster is simply the subset of

the other We merge them in that case Currently we use

0.33 as our threshold We are experimenting with different ways of finding a better threshold

Displaying results

In the algorithm, we keep track of the words that are used to represent each cluster When two clusters merge, the new cluster acquires the words from both clusters Thus at the end of the algorithm, each cluster will have a set of representative words This set can be used to provide a description of the cluster Also, we found out that there are a lot of clusters that have very few words associated with them – as they have either never been merged, or are only merged 2-3 times Our results show that discarding such clusters does not affect the results significantly In fact, it allows the results to be interpreted more clearly

4 Experiments & Results:

This section describes the results of the various experiments that we carried out In order to evaluate the performance of WBSC, we compared it with other algorithms like K-Means, Fractionation and Buckshot [1]

Experimental Setup:

Our test bed consists of two separate datasets:

• Web: This contains 2000 documents downloaded

from the Web They are from different categories such as Food, Cricket, Astronomy, Clustering, Genetic Algorithms, Baseball, Movies, XML, Salsa (Dance) and Olympics We use search engines to help us locate the appropriate documents

• Newsgroups: This contains 2000 documents from

the UCI KDD archive (http://kdd.ics.uci.edu/) [8] It has documents retrieved from 20 different news groups

All the experiments are carried out on a 733 MHz,

260 MB RAM PC We did experiments on different document sets of different sizes We ran the algorithm to get the clusters and tested the effectiveness of clustering (the types of clusters formed), both qualitatively and quantitatively We also compared the execution times of all the algorithms for document sets of different sizes

Effectiveness of Clustering:

We did many experiments with different number

of documents taken from the above-mentioned test bed All the algorithms were run to produce the same number

of clusters with same input parameters WBSC formed clusters for each of the different categories in the document sets, while the other algorithms (K-Means, Fractionation and Buckshot) did not In addition, the other algorithms formed clusters with documents of many categories that are not related to each other Figure 1 shows sample output from one of the experiments with

500 documents picked from 10 of the categories of the

Trang 3

newsgroups data set The categories include atheism,

graphics, x-windows, hardware, space, electronics,

Christianity, guns, baseball and hockey

BTW, remind, NHL, hockey, Canada, Canucks, hockey, teams, players Hockey

Human, Christian, bible, background, research, errors, April, President, members, member, private,

Cult, MAGIC, baseball, Bob, rules, poor, teams, players, season Baseball

Distribution, experience, anti, message, Windows, error, week, X Server, Microsoft, problems Windows-x

Military, fold, NASA, access, backup, systems, station, shuttle Space

Notion, soul, several, hell, Personally, created, draw, exists, god Atheism

Figure 1: clusters formed by WBSC on a sample from UCI data archive

WBSC formed 14 clusters It formed more than

one cluster for some of the categories like Hockey, Guns

etc The document set on hockey has different documents

related to Detroit Redwings and Vancouver Canucks Our

algorithm formed two different clusters for these two

different sets We ran the other algorithms with 10 as the

required number of clusters to be formed Many of the

clusters have documents that are not at all related to each

other (they have documents from different categories)

To show the effectiveness of our algorithm we

run the various algorithms on our web data set, where

some documents are in multiple categories (for example a

document related to both baseball and movies should be

in all the three clusters: baseball, movie, and

baseball-movie) We cannot present the results due to limitation on

space K-Means, Fractionation and Buckshot formed only

clusters about baseball and Movies and did not form a

cluster for documents related to both categories However,

WBSC formed a cluster for baseball-movies and put the

documents related to baseball-movies in that cluster This

shows the effectiveness of our method

We also measure the effectiveness of WBSC

quantitatively We compared the clusters formed by the

documents against the documents in the original

categories and matched the clusters with the categories

one-to-one For each matching, we counted the number of

documents that are common in corresponding clusters

The matching with the largest number of common

documents is used to measure the effectiveness This

matching can be found by a maximum weight bipartite

matching algorithm [7] We return the number of

documents in the matching The more documents that are matched, the more resemble the clusters are to the original categories For our algorithm, and for the purpose of this comparison, we assigned each document to the cluster that has the largest similarity value

Figure 2 shows the number of matches with the original categories for different algorithms for different sets of documents

0.00 10.00 20.00 30.00 40.00 50.00 60.00

Number of documents

Figure 2: Comparison of quality of clusters by

different clustering algorithms

We can clearly see that WBSC outperforms the other algorithms in effectiveness

Execution Time:

We also measured the execution time of various algorithms Figure 3 gives the comparison of execution time As the graph shows, WBSC outperforms almost all

Trang 4

other algorithms in execution time, especially as the

number of documents increases

0.00

20.00

40.00

60.00

80.00

100.00

120.00

140.00

Number of documents

Figure 3: Execution times of various clustering

algorithms

Web Search Results:

We also tested WBSC on the results from Web

search engines We downloaded documents returned from

the Google search engine (www.google.com) and we

apply WBSC on them Limitation on space prohibits us

from showing all the results Here we show some of the

clusters found by WBSC by clustering the top 100 URLs

returned from searching the term “cardinal” The

categories correspond to the common usage of the word

“cardinal” in documents over the Web (name of baseball

teams, nickname for schools, a kind of bird, and a

Catholic clergy) Figure 4 shows some of the clusters

formed by WBSC

Cluster results: Keywords and sample

documents Relatedtopic

Bishops, Catholic, world, Roman,

Church, cardinals, College, Holy

1 Cardinals in the Catholic

Church Catholicpages.com

2 The Cardinals of the Holy

Roman Church

Roman Catholic Church Cardinals library

Dark, Bird, female, color

1 Cardinal Page

2 First Internet Bird Club

Cardinal Bird

Benes, rookie, players, hit, innings,

runs, Garrett, teams, league, Saturday

1 St.Louis Cardinals Notes

2 National League Baseball

-Brewers vs Cardinals

St.Louis Cardinals (MLB team)

Figure 4: Clusters formed for search term “cardinals” by

WBSC This shows that our algorithm is effective even with the

web search results

5 Conclusions

In this paper, we presented WBSC, a word-based document-clustering algorithm We have shown that it is a promising means for clustering documents as well as presenting the clusters in a meaningful way to the user

We have compared WBSC with other existing algorithms and have shown that it performs well, both in terms of running time and quality of the clusters formed

Future work includes tuning the parameters of the algorithm, as well as adapting new similarity measures and data structures to speed up the algorithm

6 References:

[1] Douglass R Cutting, David R Karger, Jan O Pedersen, John W Tukey, Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections, In

Proceedings of the Fifteenth Annual International ACM SIGIR Conference, pp 318-329, June 1992

[2] Dean, P M Ed., Molecular Similarity in Drug Design, Blackie Academic & Professional, 1995, pp

111 –137

[3] D R Hill, A vector clustering technique, in:

Samuelson (Ed.), Mechanized Information Storage,

Retrieval and Dissemination, North-Holland, Amsterdam,

1968

[4] A.K Jain, M.N Murty and P.J Flynn, Data

Clustering: A Review, ACM Computing Surveys 31(3):

264-323, Sept 1999

[5] F Murtagh, A Survey of Recent Advances in

Hierarchical Clustering Algorithms", The Computer

Journal, 26(4): 354-359, 1983.

[6] J J Rocchio, Document retrieval systems –

optimization and evaluation, Ph.D Thesis, Harvard

University, 1966

[7] Robert E Tarjan, Data Structures and

Network Algorithms, Society for Industrial and Applied

Mathematics, 1983

[8]http://kdd.ics.uci.edu/databases/20newsgroup s/20newsgroups.html, last visited November 18th, 2000

[9] P.Willett, Recent trends in hierarchical

document clustering: a critical review, Information

processing and management, 24: 577-97, 1988.

[10] O.Zamir, O.Etzioni, Web document

clustering: a feasibility demonstration, in Proceedings of

19 th international ACM SIGIR conference on research and development in information retrieval (SIGIR 98), 1998, pp

46-54

Định dạng
Số trang	4
Dung lượng	194 KB