Full Terms & Conditions of access and use can be found at http://www.tandfonline.com/action/journalInformation?journalCode=ucbs20 Cybernetics and Systems An International Journal ISSN: 0
Trang 1Full Terms & Conditions of access and use can be found at http://www.tandfonline.com/action/journalInformation?journalCode=ucbs20
Cybernetics and Systems
An International Journal
ISSN: 0196-9722 (Print) 1087-6553 (Online) Journal homepage: http://www.tandfonline.com/loi/ucbs20
Text Clustering Using Frequent Weighted Utility Itemsets
Tram Tran, Bay Vo, Tho Thi Ngoc Le & Ngoc Thanh Nguyen
To cite this article: Tram Tran, Bay Vo, Tho Thi Ngoc Le & Ngoc Thanh Nguyen (2017) Text Clustering Using Frequent Weighted Utility Itemsets, Cybernetics and Systems, 48:3, 193-209
To link to this article: http://dx.doi.org/10.1080/01969722.2016.1276774
Published online: 02 Mar 2017.
Submit your article to this journal
View related articles
View Crossmark data
Trang 22017, VOL 48, NO 3, 193–209
http://dx.doi.org/10.1080/01969722.2016.1276774
Text Clustering Using Frequent Weighted Utility Itemsets
Tram Tran a , Bay Vo b , c , Tho Thi Ngoc Le d , and Ngoc Thanh Nguyen e
a University of Information Technology, Vietnam National University, Ho Chi Minh City, Vietnam;
b Division of Data Science, Ton Duc Thang University, Ho Chi Minh City, Vietnam; c Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City, Vietnam; d Faculty of Information Technology,
Ho Chi Minh City University of Technology, Ho Chi Minh City, Vietnam; e Faculty of Computer Science and Management, Wroclaw University of Science and Technology, Wroclaw, Poland
ABSTRACT
Text clustering is an important topic in text mining One of the
most effective methods for text clustering is an approach based
on frequent itemsets (FIs), and thus, there are many related
algorithms that aim to improve the accuracy of text clustering
However, these do not focus on the weights of terms in
documents, even though the frequency of each term in each
document has a great impact on the results In this work, we
propose a new method for text clustering based on frequent
weighted utility itemsets (FWUI) First, we calculate the Term
Frequency (TF) for each term in documents to create a weight
matrix for all documents The weights of terms in documents are
based on the Inverse Document Frequency Next, we use the
Modification Weighted Itemset Tidset (MWIT)-FWUI algorithm for
mining FWUI from a number matrix and the weights of terms in
documents Finally, based on frequent utility itemsets, we cluster
documents using the MC (Maximum Capturing) algorithm The
proposed method has been evaluated on three data sets
consisting of 1,600 documents covering 16 topics The
experi-mental results show that our method, using FWUI, improves the
accuracy of the text clustering compared to methods using FIs
KEYWORDS
Frequent itemsets; frequent weighted utility itemsets; quantitative databases; text clustering; weight of terms
Introduction
Text clustering is widely studied in text mining due to its important roles in many applications such as spam filtering, claims investigations, or monitoring opinions Researchers have exploited different ways for text clustering, includ-ing applyinclud-ing common clusterinclud-ing algorithms for text domain, utilizinclud-ing the nature of word patterns/context, and probabilistic approaches (Aggarwal and Zhai 2012) In this article, we approach the text clustering problem from the patterns of words in documents, specifically utilizing itemsets in text clustering Beil, Ester, and Xu (2002) introduced the frequent term-based clustering (FTC) approach for text clustering based on frequent terms Their experi-mental results show that, compared to bisecting K-means (Steinbach, Karypis, and Kumar 2000), FTC achieves higher accuracy and faster processing Their
CONTACT Bay Vo vodinhbay@tdt.edu.vn Division of Data Science, Ton Duc Thang University, Ho Chi Minh City 700000, Vietnam
Trang 3work inspires a new line of approaches for text clustering with some variations, such as Frequent Itemset-based Hierarchical Clustering (FIHC), Clustering based on Maximal Sequences (CMS), and Clustering based on Frequent Word Sequences (CFWS) Zhang et al (2010) analyzed some disadvantages of these approaches, such as, (1) FTC (Beil, Ester, and Xu 2002) causes isolated docu-ments; (2) FIHC (Fung, Wang, and Ester 2003) cannot solve cluster conflicts; (3) CMS (Hernández-Reyes et al 2006) depends on the effectiveness of docu-ment representation; and (4) CFWS (Li, Chung, and Holt 2008) may produce trivial clustering results To overcome these issues, Zhang et al (2010) pro-posed the Maximum Capturing (MC) approach using frequent itemsets (FIs) Practically, previous works mainly focus on whether a term occurs in documents and count the frequencies of items in itemsets In this article, in addition to considering the frequencies of itemsets, as in previous works,
we consider the weights of terms in documents to improve the performance
of MC First, we utilize Term Frequency-Inverse Document Frequency (TF-IDF) as weights to mine frequent weighted utility itemsets (FWUIs) from
a database of text documents Then, the resulting FWUIs are used with the
MC approach for text clustering We evaluated our proposed method on three data sets, and the experimental results show that FWUI improves the performance of the MC for text clustering
The contributions of our work are as follows:
Generating a quantitative matrix for a collection of documents, where TF- IDF (Term Frequency-Inverse Document Frequency) is used as the term weight;
Applying the MWIT-FWUI (Modification Weighted Itemset Tidset- Frequent Weighted Utility Itemset) algorithm for mining frequent utility itemsets in a weighted matrix;
Applying the MC (Maximum Capturing) approach for text clustering; Evaluating our system on new data sets to measure its performance The rest of this article is organized as follows: Section “Related Concepts” reviews some related concepts used in this article, Section “Related Work” outlines related works on frequent itemset mining and frequent itemset-based text clustering, Section “Text Clustering Based on Frequent Weighted Utility Itemsets” describes our proposal, Section “Experiments” presents the experi-ments and evaluation of our approach in comparison to previous methods, and Section “Conclusions and Future Work” concludes this article as well
as introduces some directions for future work
Related Concepts
Quantitative Transaction Databases
A quantitative transaction database QD (Vo, Le, and Jung 2012) is defined as a
triple QD ¼ hT, I, Wi, which contains a set of transactions T ¼ {t1, t2, … , t m},
Trang 4a set of items I ¼ {i1, i2, … , i n } and a set of weights W ¼ {w1, w2, … , w n}
corresponding to items in I Each transaction has the form t k ¼{x k1 , x k2, … ,
x kn }, where x ki is the quantity of item ith in transaction t k
Intuitively, Table 1 shows an example of a quantitative transaction
data-base There are six transactions T ¼ {t1, t2,…, t6} and five items I ¼ {A, B,
Transaction t1 ¼{2, 0, 3, 0, 4} is interpreted as follows: in transaction t1, a
customer purchases two items A, three items C, four items E, but not any
of item B or D
Term Frequency-Inverse Document Frequency (TF-IDF)
The TF-IDF (Salton and McGill 1986) of a word is a score indicating the importance of that word/term in a document with regard to a collection of documents This score is the product of Term Frequency (TF) and Inverse Document Frequency (IDF)
Term Frequency (TF), annotated as tf (t, d), is the number of occurrences
of a term t in a document d, as computed by the following formula:
where n (t, d) is the occurrences of term t in document d and n (d) is the total number of occurrences of all terms in document d
IDF, annotated as idf (t, D), measures the informativeness of the term t in a collection of corpus D It is calculated as the logarithmically scaled inverse fraction of the number of documents in a corpus that contain the term t
where |D| is the number of documents in D and df (t, D) is the number of documents in D containing term t
IDF score of a term indicates the importance in the collection of documents
D, i.e., rare terms have high scores and frequent terms have low scores
Table 1 An example of quantitative transaction database
Trang 5TF IDF t; d; Dð Þ ¼tf t; dð Þ � idf t; Dð Þ ð3Þ TF-IDF is the product of Term Frequency and Inverse Document Frequency
Related Work
Frequent Itemset Mining Approaches
There are many approaches to mine FIs Agrawal and Srikant (1994)
intro-duce Apriori for mining association rules from a database of sale transactions
Apriori is the most basic join-based algorithm that identifies the frequent individual items in the database and extends the size of itemsets until they are still frequent Soon after, Park, Chen, and Yu (1995) proposed the Direct
Hashing and Pruning (DHP) algorithm to optimize Apriori by pruning
candidate itemsets in each iteration and trimming the transactions
Another branch of approaches applies tree-based algorithms, is based on the concept of set-enumeration In this strategy, candidates are explored with the use of a subgraph of a lattice of itemsets As such, the problem of frequent itemset generation becomes that of constructing an enumeration or lexico-graphic tree Agrawal, Imieliński, and Swami (1993) introduced a simple version of a tree-based algorithm for mining the association rules of items
in large databases, called the AIS algorithm The AIS algorithm constructs
trees in a level-wise fashion, and itemsets at each level are counted using a transaction database Agarwal, Aggarwal, and Prasad (2001) proposed the
TreeProjection algorithm to optimize the counting work at the lower levels
of a tree by re-utilizing the counting work at previous levels Zaki et al (1997) proposed Eclat (an IT-tree approach) for quickly mining association rules, in which the database is scanned only once and candidates are not generated Eclat thus achieves better and faster performance than previous algo-rithms that require the generation of candidates Similarly, Han, Pei, and Yin (2000) introduced FP-Growth (an FP-tree-based approach) to mine frequent patterns without candidate generation, thus improving processing time and sav-ing memory Constraint-based approaches for minsav-ing have been developed in recent years (Duong, Truong, and Vo 2014; Truong, Duong, and Ngan 2016) Tao, Murtagh, and Farid (2003) proposed the WARM (Weighted Associ-ation Rule Mining) approach to discover significant relAssoci-ationships in transaction
Trang 6databases, in which the weights of items are integrated in the mining process More recently, Vo, Tran, and Ngo (2013) proposed FWI (a WIT-tree-based approach) for quickly mining frequent weighted itemsets from weighted item transaction databases Vo et al (2013) also introduced the FWCI approach, which is an IT-tree-based approach for mining frequent weighted closed item-sets The FWCI approach has been improved by exploiting diffset and develop-ing features for the fast removal of itemsets that are not closed (Vo 2017) Many proposals have been made for the problem of frequent weighted (closed) itemset mining, which is concerned with the weights (or benefit) of items, but not the quantity As such, Vo, Le, and Jung (2012) introduced FWUI based on MWIT-Tree FWUI is an extension of FWI based on the weighted utility of items for association rule mining, which is a development
of frequent weighted itemsets Mining on FWUI considers both the quantity and weights of items
Frequent Itemsets-Based Text Clustering Approaches
Beil, Ester, and Xu (2002) proposed the FTC algorithm, which works in a bottom-up way FTC starts with an empty set and then continuously enrolls one element from the remaining frequent term sets until all items are assigned into clusters At each step, FTC selects the remaining FIs that cover the minimum overlap with the other cluster candidates
Hernández-Reyes et al (2006) introduced a Maximal Frequent Sequence (MFS)-based approach for document clustering (CMS approach) Their approach makes use of Maximal Frequent Sequences as features in a vector space model and applies the K-means algorithm with cosine similarity to cluster documents
Li, Chung, and Holt (2008) proposed the CFWS approach, where docu-ments are treated as sequences of meaningful words instead of bags of words, and clustering is based on differences between the documents and transaction data set Zhang et al (2010) proposed the Maximum Capturing (MC) approach for text clustering based on utility itemsets MC assumes that two documents should be clustered together if they have maximum similarities
Text Clustering Based on Frequent Weighted Utility Itemsets
Figure 1 shows an overview of our approach for text clustering using FWUI First, input documents are preprocessed for a set of terms Second, a weight is assigned to each term to indicate its importance in the document with regard
to a corpus The output of this step is a weight matrix of all terms from all documents Third, from the weight matrix, we extract utility itemsets that are weighted as the benefit for the contents of documents Finally, documents are clustered based on the FWUI The following sections explain each step in
Trang 7detail: Section “Preprocess Documents” describes the preprocessing of text; Section “Algorithm for Mining Frequent Weighted Utility Itemsets” describes how weights are assigned to terms and FWUI are extracted; and Section “Text Clustering Algorithm” explains the text clustering algorithm
Preprocess Documents
This step transforms each document into a set of words First, all documents are tokenized into words Note that tokenization is not always the same for all languages due to the different characteristics For example, English text is often separated by spaces, while Vietnamese text is not
Figure 2 presents an example of the text tokenization of a Vietnamese sentence and an English sentence with the same meaning We assume that the task of tokenization is done by employing existing tools
When all documents are tokenized, we eliminate stopwords which serve grammatical roles or are not informative in the sentence Figure 3 shows the same sentences as used above when stopwords are removed
Algorithm for Mining Frequent Weighted Utility Itemsets
As discussed in the first section, previous approaches consider either the weights of items or the weights of items in transactions For a quantitative database constructed from documents, we consider only the weights of items
Figure 2. An example of tokenization for Vietnamese and English texts
Figure 3. An example of eliminating stopwords for Vietnamese and English texts
Figure 1. Diagram of text clustering using frequent weighted utility itemsets
Trang 8in transactions Hence, we apply the MWIT-FWUI algorithm for mining FWUI We modify the algorithm for mining FWUI using a matrix of term weights from documents, specifically, (1) changing the matrix of term quencies into a matrix of term weights, and (2) using a new way to mine fre-quent utility itemsets based on the matrix of term weights The pseudocode of the proposed MWIT-FWUI algorithm is presented in Algorithm 1, with the details of the scores explained below
Vo, Le, and Jung (2012) defined the transaction weighted utility twu of a transaction t k as
P
i j2S tð Þk w j�x ki j
t k
where x ki j is the quantity of item i j , w j is the weight of item i j in transaction t k
and |t k | is the total number of items in transaction t k
The weighted utility Support wus of an itemset X is calculated as:
P
t k2t Xð Þtwu tð Þk P
Text Clustering Algorithm
Algorithm 2 describes our clustering algorithm using FWUI
First, we construct a similarity matrix A, where each element in the matrix
is the common itemsets among documents
Second, we find maximum and minimum similarities that are nonzero
Algorithm 1 Mining frequent weighted utility itemsets
Input: Quantitative Transaction Database QD ¼ <T,I,W> and threshold min wus
Output: Set of frequent weighted utility itemsets U which satisfy threshold min wus
Method:
1 for document t ∈ T, term i ∈ I do
2 Compute tfidf (i, t, T); // Using formulas (1), (2) and (3)
3 for document t ∈ T do
4 Compute twu (t); // Using formula (4)
5 for term i ∈ I do
6 Compute wus (i); // Using formula (5)
7 P ← {i|i ∈ I ∧ wus (i) ≥ minwus};
8 U ← MWIT − FWUI (P, minwus);
9 Function MWIT − FWUI (Itemsets P, minwus)
10 for term wi ∈ P do
11 W ← W ∪ wi;
12 P ← ∅;
13 for (wj ∈ P ∧ j > i) do
14 X ¼ wi ∪ w j;
15 Compute wus (X); // Using formula (5)
16 if wus (X) ≥ min wus then
17 Pi ← P i ∪ X;
18 U ← MWIT − FWUI (Pi , min wus);
Trang 9Third, if the maximum value is equal to the minimum value, all unclustered documents are grouped into a new cluster Otherwise, i.e., if the maximum value is greater than the minimum value, then if either document in a pair belongs to any cluster, the other document in that pair will be assigned to the same cluster If not, we form that pair as a new cluster
The algorithm repeats these three steps until all documents are grouped into clusters
Illustration Example
Given a quantitative transaction database including two topics, as in Table 3, where each document is treated as a transaction, we have transaction/
document set T ¼ {d1, d2, … , d9} and itemsets I ¼ {″Paint″, ″Art″, ″Place″,
″Ball″, ″Match″} In this database, d1 ¼{2, 0, 3, 0, 4} means d1 contains two
Table 3 Database of word occurrences in nine documents
Algorithm 2 Text clustering using FWUI
Input: Set of documents D ¼ {d1 , d2,…, d n }, frequent weighted utility itemsets W
Output: Clusters of documents C ¼ {c1 , c2,…, c m}
Method:
1 Construct similarity matrix A, where A ij is the common itemsets of documents d i and d j ;
2 cluster 0; // This is the cluster number
3 DP {(d i , d j )|i < j ∧ d i , d j ∈ D}; // All document pairs
4 while DP ≠ ∅ do
5 min min (A) ;
6 max max (A);
7 if max ¼ min then
8 Group all unclustered documents into a new cluster;
9 else max > min
10 P{(d i , d j )|i < j ∧ A ij ¼max}; // Pairs whose similarities are max
11 for p ∈ P do // p ¼ di , d j )
12 if d i , d j ∉ c, ∀ c ∈ C then
13 cluster cluster þ 1;
14 A ij cluster;
15 DP DP\{d i , d j};
16 if d k ∈ c|k ∈ {i, j}, c ∈ C then
17 c c ∪ {d i , d j};
18 Aij cluster number of c;
19 DP DP\{d i , d j};
Trang 10items “Paint,” three items “Place,” four items “Match,” and does not contain items “Art” or “Ball.”
Mining Frequent Weighted Utility Itemsets
Step 1: Formula (1) is used to calculate the Term Frequencies (TF scores) of words in a document
For example, the TF scores of item “Paint” in documents d1, d4, d6, and d7
are calculated as follows:
3
1
6; and tf Paint; d7ð Þ
¼2
7: Similarly, TF scores of all words are calculated and shown in Table 4 Step 2: Since Inverse Document Frequency (IDF) is a unique score that indicates the importance of a word in a database, we use this score as the weight of a word Formula (2) is used to calculate the IDF scores of words For example, the IDF scores of words “Paint,” “Art,” “Place,” “Ball,” and
“Match” in the database are:
5
6
The weights of all items are calculated and shown in Table 5
Table 4 TF scores of all words in each document
Table 5 IDF scores of all words in the database