Text clustering using frequent weighted utility itemsets

Full Terms & Conditions of access and use can be found at http://www.tandfonline.com/action/journalInformation?journalCode=ucbs20 Cybernetics and Systems An International Journal ISSN: 0

Trang 1

Full Terms & Conditions of access and use can be found at http://www.tandfonline.com/action/journalInformation?journalCode=ucbs20

Cybernetics and Systems

An International Journal

ISSN: 0196-9722 (Print) 1087-6553 (Online) Journal homepage: http://www.tandfonline.com/loi/ucbs20

Text Clustering Using Frequent Weighted Utility Itemsets

Tram Tran, Bay Vo, Tho Thi Ngoc Le & Ngoc Thanh Nguyen

To cite this article: Tram Tran, Bay Vo, Tho Thi Ngoc Le & Ngoc Thanh Nguyen (2017) Text Clustering Using Frequent Weighted Utility Itemsets, Cybernetics and Systems, 48:3, 193-209

To link to this article: http://dx.doi.org/10.1080/01969722.2016.1276774

Published online: 02 Mar 2017.

Submit your article to this journal

View related articles

View Crossmark data

Trang 2

2017, VOL 48, NO 3, 193–209

http://dx.doi.org/10.1080/01969722.2016.1276774

Text Clustering Using Frequent Weighted Utility Itemsets

Tram Tran a , Bay Vo b , c , Tho Thi Ngoc Le d , and Ngoc Thanh Nguyen e

a University of Information Technology, Vietnam National University, Ho Chi Minh City, Vietnam;

b Division of Data Science, Ton Duc Thang University, Ho Chi Minh City, Vietnam; c Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City, Vietnam; d Faculty of Information Technology,

Ho Chi Minh City University of Technology, Ho Chi Minh City, Vietnam; e Faculty of Computer Science and Management, Wroclaw University of Science and Technology, Wroclaw, Poland

ABSTRACT

Text clustering is an important topic in text mining One of the

most effective methods for text clustering is an approach based

on frequent itemsets (FIs), and thus, there are many related

algorithms that aim to improve the accuracy of text clustering

However, these do not focus on the weights of terms in

documents, even though the frequency of each term in each

document has a great impact on the results In this work, we

propose a new method for text clustering based on frequent

weighted utility itemsets (FWUI) First, we calculate the Term

Frequency (TF) for each term in documents to create a weight

matrix for all documents The weights of terms in documents are

based on the Inverse Document Frequency Next, we use the

Modification Weighted Itemset Tidset (MWIT)-FWUI algorithm for

mining FWUI from a number matrix and the weights of terms in

documents Finally, based on frequent utility itemsets, we cluster

documents using the MC (Maximum Capturing) algorithm The

proposed method has been evaluated on three data sets

consisting of 1,600 documents covering 16 topics The

experi-mental results show that our method, using FWUI, improves the

accuracy of the text clustering compared to methods using FIs

KEYWORDS

Frequent itemsets; frequent weighted utility itemsets; quantitative databases; text clustering; weight of terms

Introduction

Text clustering is widely studied in text mining due to its important roles in many applications such as spam filtering, claims investigations, or monitoring opinions Researchers have exploited different ways for text clustering, includ-ing applyinclud-ing common clusterinclud-ing algorithms for text domain, utilizinclud-ing the nature of word patterns/context, and probabilistic approaches (Aggarwal and Zhai 2012) In this article, we approach the text clustering problem from the patterns of words in documents, specifically utilizing itemsets in text clustering Beil, Ester, and Xu (2002) introduced the frequent term-based clustering (FTC) approach for text clustering based on frequent terms Their experi-mental results show that, compared to bisecting K-means (Steinbach, Karypis, and Kumar 2000), FTC achieves higher accuracy and faster processing Their

CONTACT Bay Vo vodinhbay@tdt.edu.vn Division of Data Science, Ton Duc Thang University, Ho Chi Minh City 700000, Vietnam

Trang 3

work inspires a new line of approaches for text clustering with some variations, such as Frequent Itemset-based Hierarchical Clustering (FIHC), Clustering based on Maximal Sequences (CMS), and Clustering based on Frequent Word Sequences (CFWS) Zhang et al (2010) analyzed some disadvantages of these approaches, such as, (1) FTC (Beil, Ester, and Xu 2002) causes isolated docu-ments; (2) FIHC (Fung, Wang, and Ester 2003) cannot solve cluster conflicts; (3) CMS (Hernández-Reyes et al 2006) depends on the effectiveness of docu-ment representation; and (4) CFWS (Li, Chung, and Holt 2008) may produce trivial clustering results To overcome these issues, Zhang et al (2010) pro-posed the Maximum Capturing (MC) approach using frequent itemsets (FIs) Practically, previous works mainly focus on whether a term occurs in documents and count the frequencies of items in itemsets In this article, in addition to considering the frequencies of itemsets, as in previous works,

we consider the weights of terms in documents to improve the performance

of MC First, we utilize Term Frequency-Inverse Document Frequency (TF-IDF) as weights to mine frequent weighted utility itemsets (FWUIs) from

a database of text documents Then, the resulting FWUIs are used with the

MC approach for text clustering We evaluated our proposed method on three data sets, and the experimental results show that FWUI improves the performance of the MC for text clustering

The contributions of our work are as follows:

Generating a quantitative matrix for a collection of documents, where TF- IDF (Term Frequency-Inverse Document Frequency) is used as the term weight;

Applying the MWIT-FWUI (Modification Weighted Itemset Tidset- Frequent Weighted Utility Itemset) algorithm for mining frequent utility itemsets in a weighted matrix;

Applying the MC (Maximum Capturing) approach for text clustering; Evaluating our system on new data sets to measure its performance The rest of this article is organized as follows: Section “Related Concepts” reviews some related concepts used in this article, Section “Related Work” outlines related works on frequent itemset mining and frequent itemset-based text clustering, Section “Text Clustering Based on Frequent Weighted Utility Itemsets” describes our proposal, Section “Experiments” presents the experi-ments and evaluation of our approach in comparison to previous methods, and Section “Conclusions and Future Work” concludes this article as well

as introduces some directions for future work

Related Concepts

Quantitative Transaction Databases

A quantitative transaction database QD (Vo, Le, and Jung 2012) is defined as a

triple QD ¼ hT, I, Wi, which contains a set of transactions T ¼ {t1, t2, … , t m},

Trang 4

a set of items I ¼ {i1, i2, … , i n } and a set of weights W ¼ {w1, w2, … , w n}

corresponding to items in I Each transaction has the form t k ¼{x k1 , x k2, … ,

x kn }, where x ki is the quantity of item ith in transaction t k

Intuitively, Table 1 shows an example of a quantitative transaction

data-base There are six transactions T ¼ {t1, t2,…, t6} and five items I ¼ {A, B,

Transaction t1 ¼{2, 0, 3, 0, 4} is interpreted as follows: in transaction t1, a

customer purchases two items A, three items C, four items E, but not any

of item B or D

Term Frequency-Inverse Document Frequency (TF-IDF)

The TF-IDF (Salton and McGill 1986) of a word is a score indicating the importance of that word/term in a document with regard to a collection of documents This score is the product of Term Frequency (TF) and Inverse Document Frequency (IDF)

Term Frequency (TF), annotated as tf (t, d), is the number of occurrences

of a term t in a document d, as computed by the following formula:

where n (t, d) is the occurrences of term t in document d and n (d) is the total number of occurrences of all terms in document d

IDF, annotated as idf (t, D), measures the informativeness of the term t in a collection of corpus D It is calculated as the logarithmically scaled inverse fraction of the number of documents in a corpus that contain the term t

where |D| is the number of documents in D and df (t, D) is the number of documents in D containing term t

IDF score of a term indicates the importance in the collection of documents

D, i.e., rare terms have high scores and frequent terms have low scores

Table 1 An example of quantitative transaction database

Trang 5

TF IDF t; d; Dð Þ ¼tf t; dð Þ � idf t; Dð Þ ð3Þ TF-IDF is the product of Term Frequency and Inverse Document Frequency

Related Work

Frequent Itemset Mining Approaches

There are many approaches to mine FIs Agrawal and Srikant (1994)

intro-duce Apriori for mining association rules from a database of sale transactions

Apriori is the most basic join-based algorithm that identifies the frequent individual items in the database and extends the size of itemsets until they are still frequent Soon after, Park, Chen, and Yu (1995) proposed the Direct

Hashing and Pruning (DHP) algorithm to optimize Apriori by pruning

candidate itemsets in each iteration and trimming the transactions

Another branch of approaches applies tree-based algorithms, is based on the concept of set-enumeration In this strategy, candidates are explored with the use of a subgraph of a lattice of itemsets As such, the problem of frequent itemset generation becomes that of constructing an enumeration or lexico-graphic tree Agrawal, Imieliński, and Swami (1993) introduced a simple version of a tree-based algorithm for mining the association rules of items

in large databases, called the AIS algorithm The AIS algorithm constructs

trees in a level-wise fashion, and itemsets at each level are counted using a transaction database Agarwal, Aggarwal, and Prasad (2001) proposed the

TreeProjection algorithm to optimize the counting work at the lower levels

of a tree by re-utilizing the counting work at previous levels Zaki et al (1997) proposed Eclat (an IT-tree approach) for quickly mining association rules, in which the database is scanned only once and candidates are not generated Eclat thus achieves better and faster performance than previous algo-rithms that require the generation of candidates Similarly, Han, Pei, and Yin (2000) introduced FP-Growth (an FP-tree-based approach) to mine frequent patterns without candidate generation, thus improving processing time and sav-ing memory Constraint-based approaches for minsav-ing have been developed in recent years (Duong, Truong, and Vo 2014; Truong, Duong, and Ngan 2016) Tao, Murtagh, and Farid (2003) proposed the WARM (Weighted Associ-ation Rule Mining) approach to discover significant relAssoci-ationships in transaction

Trang 6

databases, in which the weights of items are integrated in the mining process More recently, Vo, Tran, and Ngo (2013) proposed FWI (a WIT-tree-based approach) for quickly mining frequent weighted itemsets from weighted item transaction databases Vo et al (2013) also introduced the FWCI approach, which is an IT-tree-based approach for mining frequent weighted closed item-sets The FWCI approach has been improved by exploiting diffset and develop-ing features for the fast removal of itemsets that are not closed (Vo 2017) Many proposals have been made for the problem of frequent weighted (closed) itemset mining, which is concerned with the weights (or benefit) of items, but not the quantity As such, Vo, Le, and Jung (2012) introduced FWUI based on MWIT-Tree FWUI is an extension of FWI based on the weighted utility of items for association rule mining, which is a development

of frequent weighted itemsets Mining on FWUI considers both the quantity and weights of items

Frequent Itemsets-Based Text Clustering Approaches

Beil, Ester, and Xu (2002) proposed the FTC algorithm, which works in a bottom-up way FTC starts with an empty set and then continuously enrolls one element from the remaining frequent term sets until all items are assigned into clusters At each step, FTC selects the remaining FIs that cover the minimum overlap with the other cluster candidates

Hernández-Reyes et al (2006) introduced a Maximal Frequent Sequence (MFS)-based approach for document clustering (CMS approach) Their approach makes use of Maximal Frequent Sequences as features in a vector space model and applies the K-means algorithm with cosine similarity to cluster documents

Li, Chung, and Holt (2008) proposed the CFWS approach, where docu-ments are treated as sequences of meaningful words instead of bags of words, and clustering is based on differences between the documents and transaction data set Zhang et al (2010) proposed the Maximum Capturing (MC) approach for text clustering based on utility itemsets MC assumes that two documents should be clustered together if they have maximum similarities

Text Clustering Based on Frequent Weighted Utility Itemsets

Figure 1 shows an overview of our approach for text clustering using FWUI First, input documents are preprocessed for a set of terms Second, a weight is assigned to each term to indicate its importance in the document with regard

to a corpus The output of this step is a weight matrix of all terms from all documents Third, from the weight matrix, we extract utility itemsets that are weighted as the benefit for the contents of documents Finally, documents are clustered based on the FWUI The following sections explain each step in

Trang 7

detail: Section “Preprocess Documents” describes the preprocessing of text; Section “Algorithm for Mining Frequent Weighted Utility Itemsets” describes how weights are assigned to terms and FWUI are extracted; and Section “Text Clustering Algorithm” explains the text clustering algorithm

Preprocess Documents

This step transforms each document into a set of words First, all documents are tokenized into words Note that tokenization is not always the same for all languages due to the different characteristics For example, English text is often separated by spaces, while Vietnamese text is not

Figure 2 presents an example of the text tokenization of a Vietnamese sentence and an English sentence with the same meaning We assume that the task of tokenization is done by employing existing tools

When all documents are tokenized, we eliminate stopwords which serve grammatical roles or are not informative in the sentence Figure 3 shows the same sentences as used above when stopwords are removed

Algorithm for Mining Frequent Weighted Utility Itemsets

As discussed in the first section, previous approaches consider either the weights of items or the weights of items in transactions For a quantitative database constructed from documents, we consider only the weights of items

Figure 2. An example of tokenization for Vietnamese and English texts

Figure 3. An example of eliminating stopwords for Vietnamese and English texts

Figure 1. Diagram of text clustering using frequent weighted utility itemsets

Trang 8

in transactions Hence, we apply the MWIT-FWUI algorithm for mining FWUI We modify the algorithm for mining FWUI using a matrix of term weights from documents, specifically, (1) changing the matrix of term quencies into a matrix of term weights, and (2) using a new way to mine fre-quent utility itemsets based on the matrix of term weights The pseudocode of the proposed MWIT-FWUI algorithm is presented in Algorithm 1, with the details of the scores explained below

Vo, Le, and Jung (2012) defined the transaction weighted utility twu of a transaction t k as

P

i j2S tð Þk w j�x ki j

t k

where x ki j is the quantity of item i j , w j is the weight of item i j in transaction t k

and |t k | is the total number of items in transaction t k

The weighted utility Support wus of an itemset X is calculated as:

P

t k2t Xð Þtwu tð Þk P

Text Clustering Algorithm

Algorithm 2 describes our clustering algorithm using FWUI

First, we construct a similarity matrix A, where each element in the matrix

is the common itemsets among documents

Second, we find maximum and minimum similarities that are nonzero

Algorithm 1 Mining frequent weighted utility itemsets

Input: Quantitative Transaction Database QD ¼ <T,I,W> and threshold min wus

Output: Set of frequent weighted utility itemsets U which satisfy threshold min wus

Method:

1 for document t ∈ T, term i ∈ I do

2 Compute tfidf (i, t, T); // Using formulas (1), (2) and (3)

3 for document t ∈ T do

4 Compute twu (t); // Using formula (4)

5 for term i ∈ I do

6 Compute wus (i); // Using formula (5)

7 P ← {i|i ∈ I ∧ wus (i) ≥ minwus};

8 U ← MWIT − FWUI (P, minwus);

9 Function MWIT − FWUI (Itemsets P, minwus)

10 for term wi ∈ P do

11 W ← W ∪ wi;

12 P ← ∅;

13 for (wj ∈ P ∧ j > i) do

14 X ¼ wi ∪ w j;

15 Compute wus (X); // Using formula (5)

16 if wus (X) ≥ min wus then

17 Pi ← P i ∪ X;

18 U ← MWIT − FWUI (Pi , min wus);

Trang 9

Third, if the maximum value is equal to the minimum value, all unclustered documents are grouped into a new cluster Otherwise, i.e., if the maximum value is greater than the minimum value, then if either document in a pair belongs to any cluster, the other document in that pair will be assigned to the same cluster If not, we form that pair as a new cluster

The algorithm repeats these three steps until all documents are grouped into clusters

Illustration Example

Given a quantitative transaction database including two topics, as in Table 3, where each document is treated as a transaction, we have transaction/

document set T ¼ {d1, d2, … , d9} and itemsets I ¼ {″Paint″, ″Art″, ″Place″,

″Ball″, ″Match″} In this database, d1 ¼{2, 0, 3, 0, 4} means d1 contains two

Table 3 Database of word occurrences in nine documents

Algorithm 2 Text clustering using FWUI

Input: Set of documents D ¼ {d1 , d2,…, d n }, frequent weighted utility itemsets W

Output: Clusters of documents C ¼ {c1 , c2,…, c m}

Method:

1 Construct similarity matrix A, where A ij is the common itemsets of documents d i and d j ;

2 cluster 0; // This is the cluster number

3 DP {(d i , d j )|i < j ∧ d i , d j ∈ D}; // All document pairs

4 while DP ≠ ∅ do

5 min min (A) ;

6 max max (A);

7 if max ¼ min then

8 Group all unclustered documents into a new cluster;

9 else max > min

10 P{(d i , d j )|i < j ∧ A ij ¼max}; // Pairs whose similarities are max

11 for p ∈ P do // p ¼ di , d j )

12 if d i , d j ∉ c, ∀ c ∈ C then

13 cluster cluster þ 1;

14 A ij cluster;

15 DP DP\{d i , d j};

16 if d k ∈ c|k ∈ {i, j}, c ∈ C then

17 c c ∪ {d i , d j};

18 Aij cluster number of c;

19 DP DP\{d i , d j};

Trang 10

items “Paint,” three items “Place,” four items “Match,” and does not contain items “Art” or “Ball.”

Mining Frequent Weighted Utility Itemsets

Step 1: Formula (1) is used to calculate the Term Frequencies (TF scores) of words in a document

For example, the TF scores of item “Paint” in documents d1, d4, d6, and d7

are calculated as follows:

3

1

6; and tf Paint; d7ð Þ

¼2

7: Similarly, TF scores of all words are calculated and shown in Table 4 Step 2: Since Inverse Document Frequency (IDF) is a unique score that indicates the importance of a word in a database, we use this score as the weight of a word Formula (2) is used to calculate the IDF scores of words For example, the IDF scores of words “Paint,” “Art,” “Place,” “Ball,” and

“Match” in the database are:

5

6

The weights of all items are calculated and shown in Table 5

Table 4 TF scores of all words in each document

Table 5 IDF scores of all words in the database

Định dạng
Số trang	18
Dung lượng	0,91 MB