This study proposes a fast outlier detection algorithm for big datasets Cell-RDOS and two clustering algorithms for big datasets on a limited memory computer Cell-MST-based and Weighted
Trang 1MR.DUONG VAN HIEU
A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF DOCTOR OF PHILOSOPHY
IN INFORMATION TECHNOLOGY DEPARTMENT OF INFORMATION TECHNOLOGY
GRADUATE COLLEGE KING MONGKUT'S UNIVERSITY OF TECHNOLOGY NORTH BANGKOK
ACADEMIC YEAR 2016 COPYRIGHT OF KING MONGKUT'S UNIVERSITY OF TECHNOLOGY NORTH BANGKOK
Trang 2ii
Name : Mr.Duong Van Hieu
Thesis Title : A Combination of Graph-based and Cell-Based Clustering
Techniques for Big Datasets Major Field : Information Technology (International Program)
King Mongkut’s University of Technology North Bangkok Thesis Advisor : Associate Professor Dr.Phayung Meesad
Academic Year : 2016
Abstract
Big dataset analysis is more challenging to data scientists They cause technological challenges when setting up a big data project in terms of choosing the right platform technologies and suitable algorithms This study proposes a fast outlier detection algorithm for big datasets (Cell-RDOS) and two clustering algorithms for big datasets on a limited memory computer (Cell-MST-based and Weighted Cell-MST-based) The Cell-RDOS algorithm is a combination of cell-based algorithms and a revised version of the ranking-based outlier detection algorithm with various depths (RDOS) The Cell-RDOS algorithm can produce the same results compared to the RDOS algorithm, and reduce up to 99% executing time of the RDOS algorithm when working with big datasets The proposed clustering algorithms are combinations of cell-based algorithms, MST-based algorithms, and K-means clustering algorithm These two proposed algorithms outperform many other algorithms in terms of less used memory, accuracy level, and speed Firstly, they can reduce up to 99% required memory size compared to the previous methods such as Similarity-based, Decision-theoretic Rough Set and MST-based algorithms
(Total 101 pages)
Keywords : Big dataset, cell-based clustering, graph-based clustering,
minimum-spanning tree, outlier detection
Advisor
Trang 3บทคัดย่อ
การวิเคราะห์ชุดข้อมูลขนาดใหญ่เป็นความท้าทายส าหรับนักวิทยาศาสตร์ข้อมูล ความท้าทาย คือการเลือกใช้แพลทฟอร์มเทคโนโลยีและอัลกอริธึมที่ถูกต้องส าหรับโครงงานข้อมูลขนาดใหญ่ โดยเฉพาะอย่างยิ่งในข้อมูลขนาดใหญ่ที่ประกอบไปด้วยข้อมูลนอกกลุ่มซึ่งมีผลต่อการวิเคราะห์ ข้อมูลในกลุ่มหลัก ข้อมูลนอกกลุ่มมีความจ าเป็นต้องมีการค้นพบและแก้ไขก่อนที่จะท าการวิเคราะห์ ข้อมูลกลุ่มหลัก วิทยานิพนธ์นี้น าเสนอเทคนิคใหม่เกี่ยวกับชุดข้อมูลขนาดใหญ่ ได้แก่ 1) ขั้นตอน วิธีการค้นหาข้อมูลนอกกลุ่มอย่างรวดเร็ว เรียกว่า Cell-RDOS และ 2) ขั้นตอนวิธีการจัดกลุ่มข้อมูล 2 วิธี Cell-MST-based และ Weighted Cell-MST-based ส าหรับขั้นตอนวิธีการค้นหาข้อมูลนอกกลุ่ม Cell-RDOS เป็นการผสนผสานระหว่างเทคนิค cell-based และ RDOS ผลการวิจัยพบว่า Cell-RDOS ให้ผลการค้นหาข้อมูลนอกกลุ่มเทียบเท่ากับวิธี RDOS แบบเดิม แต่ลดเวลาการประมวลผลลงถึง 99% ในการท างานกับชุดข้อมูลขนาดใหญ่ ในส่วนอัลกอริธึมการจัดกลุ่มที่พัฒนาขึ้นเป็นการผสมผสาน เทคนิค cell-based, MST-based และ K-means ผลการวิจัยพบว่าเทคนิคที่น าเสนอใหม่ทั้งสองวิธี มีผลการจัดกลุ่มได้ดีกว่าขั้นตอนวิธีแบบดั้งเดิม ได้แก่ Similarity–based, Decision-theoretic Rough Set และ MST-based ด้านการใช้หน่วยความจ าที่น้อยกว่าแบบเดิมถึง 99% อีกทั้งด้านความถูกต้อง ในการจัดกลุ่ม และความเร็วในการประมวลผลก็มีผลที่ดีกว่าเทคนิคเดิม
(วิทยานิพนธ์มีจ านวนทั้งสิ้น 101 หน้า)
ค าส าคัญ : ชุดข้อมูลขนาดใหญ่ การจัดกลุ่มบนฐานเซล การจัดกลุ่มบนฐานกราฟ มินิมัมสแปน
นิงทรี การค้นหาข้อมูลนอกกลุ่ม
_ อาจารย์ที่ปรึกษาวิทยานิพนธ์หลัก
Trang 4Thanks to the Decision 911/QĐ-TTg signed on 17/06/2010 by the Vietnamese
Prime Minister (known as Project 911) and the Vietnam International Education Development Department (VIED) who supported scholarship during my three years
of Doctor Degree study in Thailand Without their support, it would have been impossible for me to have a chance to finish my PhD study
I would like to thank Mr.Gary Sheriff who is the international coordinator at the Faculty of Information Technology – KMUTNB, and thanks to friends who helped
me overcome the obstacles of overseas life
Last but certainly not least, thanks to my parents who gave birth to me and brought me up, thanks to my siblings, and thanks to my ideal wife who took care of our two children during the three years of my study in Thailand
Mr Dương Văn Hiếu
Trang 5v
TABLE OF CONTENTS
Page
2.2 Difficulties of Working with Big Datasets 12 2.3 Graphs and Minimum-Cost Spanning Trees 13
3.4 A Fast Outlier Detection Algorithm for Big Datasets 41 3.5 A Cell-MST Clustering Algorithm for Big Datasets 50 3.6 A Weighted Cell-MST-Based Clustering Algorithm for Big
4.3 Results of the Proposed Cell-DROS Algorithm and Analysis 66
Trang 6vi
TABLE OF CONTENTS (CONTINUED)
Page 4.4 Results of the Proposed Cell-MST-based Clustering Algorithm
4.5 Results of the Proposed Weighted Cell-MST-based
Trang 7vii
LIST OF TABLES
2-2 An example of objects are ranked in a descending of outlier scores 16
4-2 Datasets for cluster number estimation experiments 66 4-3 Comparison of matching results between the previous RDOS algorithm
and the proposed Cell-RDOS algorithm with C L=29,000 cells 67 4-4 Comparison of executing time between the previous RDOS algorithm
and the proposed Cell-RDOS algorithm with C L=29,000 cells 67 4-5 Comparison of required memory sizes of the previous similarity-based, MST -based methods and the proposed Cell-MST-based method 68
4-6 Comparison of values of T 0, sizes of G0 the Prim Trajectory and the
proposed 1st MST-based OTE method running on machine 1 69
4-7 Comparison of values of T 0, sizes of G0 between the Prim Trajectory and the proposed 1st MST-based OTE method running on machine 2 69 4-8 Estimated cluster numbers of the proposed Cell-MST-based method when
4-9 Comparison of the best results of the QEM method and results of the
4-10 Comparison of the averages of executing time between the QEM method
411 Comparison of required memory sizes between similaritybased, MST
4-12 Comparison of values of T 0 and sizes of G0 between the Prim Trajectory method and the proposed 2nd MST-based OTE algorithm on machine 1 78
4-13 Comparison of values of T 0 and sizes of G0 between the Prim Trajectory method and the proposed 2nd MST-based OTE algorithm on machine 2 79
Trang 8viii
LIST OF TABLE (CONTINUED)
4-14 Estimated cluster numbers of the proposed Weighted Cell-MST- based
4-15 Comparison of the estimated cluster numbers of the proposed Cell-
4-16 Comparison of the estimated results of the proposed Cell- MST-based
method and Weighted Cell-MST-based method when clustering datasets
4-17 Comparison of the best results of the QEM method and results of the
4-18 Comparison of the averages of executing time between the QEM method and the proposed Weighted Cell-MST-based method 83 4-19 Comparison of the best results of the QEM method, results of the proposed Cell-MST-based and Weighted Cell-MST-based methods 84
Trang 9ix
LIST OF FIGURES
2-2 An example of a dataset having 2 clusters and 2 outliers 9 2-3 An example of hard clustering and fuzzy clustering 11
2-6 An illustration of relationships between a dataset and a graph 14 2-7 An illustration of a graph and a minimum-cost spanning tree 14
2-9 An illustration of k-distance(o) and reach-dist k(p, o) 17
2-11 An example of k nearest neighborhood and fluential space 20 2-12 A traditional partitioning of data clustering algorithms 22 2-13 Categories of clustering algorithms for big dataset 23 2-14 Poor results of random selection and first K point selection 25 2-15 An example of centroid initialization using the simple selection method
3-4 A difference between Euclidean distance function and the proposed
3-5 An illustration of dividing a dataset to a set of weighted cells 42
3-6 An example of adjusting values of M to increase, decease U 44
3-9 An illustration of retrieving ids of outliers and removing outliers 49 3-10 A proposed Cell-MST-based clustering algorithm for big datasets 55
Trang 104-2 An example of an adjacency-list structure for graphs 64 4-3 An example of an adjacency-list structure for a forest 64 4-4 Comparison of the actual cluster numbers, estimated cluster numbers of the QEM method and the proposed Cell-MST-based method 72 4-5 Comparison of Inter-Intra index values between the QEM method and the
4-6 Comparison of results from dataset Simulation2D1 (1) 73 4-7 Comparison of results from dataset Simulation2D2 (1) 74 4-8 Comparison of results from dataset Simulation2D3 (1) 74 4-9 Comparison of results from dataset Simulation2D4 (1) 74 4-10 Comparison of results from dataset Simulation3D1 (1) 75 4-11 Comparison of results from dataset Simulation3D2 (1) 75 4-12 Comparison of results from dataset Simulation3D3 (1) 75 4-13 Comparison of results from dataset Simulation3D4 (1) 76 4-14 Comparison of results from dataset Transactions70k (1) 76 4-15 Comparison of results from dataset Transactions90k (1) 76 4-16 Comparison of results from dataset TDriveTrajectory (1) 77 4-17 Comparison of results from dataset GeolifeTrajectory (1) 77 4-18 Comparison of the actual cluster numbers, estimated cluster numbers by the QEM and the proposed Weighted Cell-MST-based methods 82 4-19 Comparison of Inter-Intra index values between the QEM method and the
4-20 Comparison of results from dataset Simulation2D1 (2) 84 4-21 Comparison of results from dataset Simulation2D2 (2) 85 4-22 Comparison of results from dataset Simulation2D3 (2) 85 4-23 Comparison of results from dataset Simulation2D4 (2) 85
Trang 12CHAPTER 1 INTRODUCTION
1.1 Problem Statement and Background
In the big data era, datasets generated from enterprises are not only massive, heterogeneous, inconsistent, and dynamically change but also erroneous These large datasets not only provide potential benefits to enterprises, business, and scientific applications but also result in technical challenges for data scientists in terms of choosing the right platform technologies, and selecting the suitable algorithms to optimize system performance when setting up a big data project (Philip Chen and Zhang, 2014; Torre-Bastida et al., 2015; Zicari et al., 2016; Pop et al., 2016)
The first step of a data analysis project should be the data preprocessing processes One of the challenging tasks of preprocessing is to identify and remove outliers from provided datasets An outlier is defined as an object which deviates so much from the remaining objects in the same dataset Outlying objects may be generated by different sources or mechanisms compared to the remaining objects in the same datasets (Hawkins, 1980) In data mining and statistics, outliers are also called abnormal, deviant, or discordant individuals (Aggarwal, 2013)
Outlier analysis plays an important role in data science because of its applicability to various applications such as credit card fraud detection, insurance, health care, security intrusion detection, interesting sensor events detection, fault detection in safety systems, military surveillance, medical diagnosis, law enforcement, and earth science (Chandola, Banerjee and Kumar, 2009; Aggarwal, 2013; Aggarwal, 2015) Outliers should be identified and eliminated before the provided datasets are used
by other machine learning algorithms such as attribute selection, classification, associate rule extraction, data clustering algorithms Well-known density-based outlier detection algorithms use local outlier factor (LOF) (Breunig et al., 2000), connectivity-based outlier factor (COF) (Tang et al., 2002), influential measure of outliers by symmetric neighborhood relationship (INFLO) (Jin et al., 2006), and clusters (RBDA, ODMR, ODMRD) (Huang, Mehrotraa and Mohana, 2013)
Trang 13The well-known outlier detection algorithms have a few limitations in terms of using density The LOF and COF algorithms may produce poor results when outliers located in between low and high density clusters The INFLO algorithm may produce incorrect outliers because it uses an assumption that all neighbors of a testing object have the same density Results of the RBDA, ODMR and ODMRD algorithms may be negatively affected by local irregularities of datasets
Recently, to overcome the limitations of the aforementioned well-known density-based outlier detection algorithms, a precise outlier detection method based on multi-time sampling (Ha, Seok and Lee, 2015), and a ranking-based outlier detection algorithm with various depths (RDOS) (Bhattacharya, Ghosh and Chowdhury, 2015) were proposed It was reported that the newly proposed algorithms can produce very high precision results compared to the previous algorithms However, these new algorithms cannot be applied on datasets having large numbers of objects due to their very long executing time
Similar to outlier detection algorithms, data clustering techniques are essential tools for analyzing and mining big datasets, and have been applied to many areas of life such as business, engineering, science, and education (Romero and Ventura, 2013; Kokol, 2015) Conventionally, data clustering algorithms are divided into hard clustering and fuzzy clustering algorithms Hard clustering is separated into partitioning and hierarchical clustering Due to the speedy increase of datasets in size, hierarchical clustering became unfeasible and partitioning clustering became more important (Gan, Ma and Wu, 2007) Among partitioning-based algorithms such as K-Means, K-Medoids, K-Modes, PAM, CLARANS, CLARA, FCM, CSO, PSO; K-Means is a well-known and most used algorithm (Fahad et al., 2014; Arora and Chana, 2014)
Working with big datasets is challenging (Barioni et al., 2014; Hodge, 2014; Bedi, Jindal and Gautam, 2014; Fahad et al., 2014; Ishikawa, 2015; Van Hieu and Meesad, 2015a) Existing partitioning-based clustering methods have three main limitations Firstly, clustering results are sensitive to noises and depend on initial centroids Secondly, they need a user-predefined cluster number value 𝐾 to perform clustering processes Moreover, it is difficult to work with big datasets on a limited memory computer due to the need of a huge amount of available memory (Van Hieu
Trang 14and Meesad, 2015b) Several solutions have been proposed to solve the aforementioned limitations of the clustering tasks However, the problems have not been completely solved
Based on the obstacles of identifying and eliminating outliers from large datasets, and the limitations of existing partitioning-based clustering algorithms; this study proposes a comprehensive solution to deal with big dataset analysis The proposed solution includes identifying and removing outliers contained in a big dataset; estimating an optimal cluster number, determining optimally initial cluster centers, and performing the K-mean clustering algorithm to cluster a big dataset on a personal computer with limited memory The proposed solution not only can solve the outlier detection problem but also the data clustering problems of big datasets
1.2 Purposes of the Study
To be able to identify and eliminate outliers from very large datasets in a limited time, estimate optimal cluster numbers of big datasets, determine optimal initial centroids of clusters, an perform data clustering on a personal computer; this study attempts the following:
1.2.1 To propose a fast outlier detection algorithm for big datasets
1.2.2 To propose combinations of graph-based and cell-based clustering algorithms for big datasets on a limited memory personal computer
1.2.3 To test and evaluate the proposed algorithms
1.3 Scope of the Study
To meet the purposes of the study, scope of this study includes:
1.3.1 Firstly, two cell-based algorithms are proposed to transform a dataset having extremely large number of objects to a fairly small set of weighted cells The first cell-based transformation algorithm converts a big dataset to a small set of weighted cells based on predefined lower and upper bounds of an expected cell set This cell-based transformation algorithm will be used by the proposed outlier detection algorithm The second cell-based transformation algorithm converts a very large dataset to a very small set of weighted cells based on the available memory size
of a computer This second cell-based transformation algorithm will be used by a
Trang 15Cell-MST-based clustering algorithm and a Weighted Cell-MST-based clustering algorithm
1.3.2 Secondly, a weighted distance function is defined to measure distances between weighted cells based on their coordinates and weights This new weighted distance function will be used by the proposed outlier detection algorithm and the Weighted Cell-MST-based clustering algorithm
1.3.3 Thirdly, two MST-based optimal threshold estimation algorithms are proposed to estimate optimal threshold values and optimal graphs from weighted cell sets The first MST-based optimal threshold estimation algorithm uses the Euclid distance definition to estimate the best initial threshold 𝑇0 from a set of cells to construct an optimal edge-weighted graph 𝐆0 from cells This algorithm will be used
by the Cell-MST clustering algorithm The second MST-based optimal threshold estimation algorithm uses the proposed weighted distance definition to estimate the best initial threshold 𝑇0 from a set of weighted cells to construct an optimal edge-weighted graph 𝐆0 from weighted cells The second MST-based optimal threshold estimation algorithm will be used by the Weighted Cell-MST clustering algorithm
1.3.4 A fast outlier detection algorithm is designed to detect outliers from big datasets A Cell-MST-based clustering algorithm and a Weighted Cell-MST-based clustering algorithm are designed to estimate an optimal cluster number, initialize cluster centers, and perform the clustering processes
1.3.5 All algorithms will be implemented using the C programming language, compiled by the TDM-GCC 4.8.1 64-bit release associated with the Dev C++ 5.9.2 All programs run on two personal computers with different memory size The first machine is a laptop computer configured with an Intel processor core i5-3230M CPU 2.60 GHz, 6GB of RAM, and Windows 8.1 The second machine is a desktop computer configured with an Intel processor core i5-24000 CPU 3.10 GHz, 8GB of RAM, and Windows 7
1.3.6 Datasets used in the experiments will be collected from the UCI website, the website of Microsoft and the website of NASA Moreover, simulated datasets generated using the simulation programs coding by ourselves will also be used to conduct experiments
Trang 161.4 Utilization of the Study
Results of the study can be used in many aspects of data science, data mining and analytics, business, engineering, and education The developed outlier detection algorithm and data clustering can be used in various applications such as credit card fraud detection, insurance, health care, security intrusion detection, interesting sensor events detection, fault detection in safety systems, military surveillance, medical diagnosis, law enforcement, and earth science, marketing, management, data mining, pattern recognition, image analysis, bioinformatics, machine learning, text mining, web clustering, and educational applications
Trang 17
CHAPTER 2 LITERATURE REVIEW
This chapter firstly defines notations being used to present mathematical formulae, and introduces a few basic definitions which will be used through this research Secondly, the difficulties of working with big datasets are presented Related literatures including graphs and minimum cost spanning trees, outlier detection, and dataset clustering will be presented in three separate sections Last but not least, aforementioned contents will be summarized in the last section
2.1 Basic Notations and Definitions
2.1.1 Basic Notations
To be convenient and consistent while presenting mathematical notations throughout this research, an italic lower case letter will be used to indicate a scalar value or a variable, a bold lower case letter will be used to denote a vector or an object,
an italic upper case letter will be used to represent a constant, and a bold upper case letter will be used to denote a matrix or a set of vectors which can be a set of objects For instance, 𝑖 is represented a scalar value or a variable, 𝐱 denotes a vector or
an object, 𝑁 indicates a constant, and 𝐗 symbolizes a matrix or a set of vectors Symbol 𝐗 can be a set of objects such as 𝐗 = (𝐱1, 𝐱2, … , 𝐱𝑁)
Moreover, in the case of a string of letters is used to refer to a variable, a scalar value, a vector, a matrix, or a set of values, it will be printed in the aforementioned styles Furthermore, names of functions will be printed in the italic style and parameters of functions will also be printed in the aforementioned definitions of
notations For example, max must be understood as a variable or a name of a function,
Dist must be understood as a matrix or set of value, and max(Dist) must be understood
as a function named max, and this function has a parameter named Dist which is a set
of values
Trang 182.1.2 Basic Definitions
2.1.2.1 Data Mining
Data mining is a technology which includes traditional data analysis methods with sophisticated algorithms to process a large volume of data (Tan, Steinbach and Kumar, 2005) Data mining is a process of discovering useful patterns and trends in a large dataset (Daniel and Chantal, 2014) Data mining is originated from database systems, statistics, machine learning, artificial intelligence, and pattern recognition Currently, data mining is a multidisciplinary field including database technology, statistics, machine learning, artificial intelligence, data visualization, and related disciplines (Herrera, 2009)
FIGURE 2-1 Data mining is a multidisciplinary field
2.1.2.2 Datasets, Objects and Attributes
In data mining, a dataset or data set is a collection of data objects An object or a data object is a sample or an observation in a dataset An attribute of an object or a feature of a dataset is a component of an object When a dataset is presented in a form
of a table, an object is presented as a row and an attribute is a column of that table Moreover, a dataset is also presented as a matrix For example, a dataset 𝐗 having 𝑁 objects and 𝐷 attributes can be presented as a matrix
2.1.2.3 Distance, Similarity and Dissimilarity
Distances between objects play an important role in data mining A distance between two objects tells us how far two objects are, and is generally measured by
Data mining
Database Technology
Machine Learning
Other Disciplines
Data Visualization
Statistics
Artificial Intelligence
Trang 19Euclidean distance Euclidean distance between 𝐱 = (𝑥1, 𝑥2, ⋯ , 𝑥𝐷) and 𝐲 =(𝑦1, 𝑦2, ⋯ , 𝑦𝐷) is a special case of Minkowski distance defined by (2-2) (Tan et al., 2005)
𝑑𝑖𝑠𝑡𝑀𝑖𝑛𝑘𝑜𝑤𝑠𝑘𝑖(𝐱, 𝐲) = √∑𝐷 (|𝑦𝑖 − 𝑥𝑖|)𝑟
𝑖=1
𝑟
(2- 2) When
r=1, (2-2) is City block distance;
r=2, (2-2) is Euclidean distance;
r=, (2-2) is supermum distance
On the other hand, similarity between two objects is used to measure similar levels between two objects It tells us how similar two objects are Unlike similarity, dissimilarity is used to measure dissimilar levels between two objects Proximity is a common name of similarity and dissimilarity Proximities between two objects which are widely used in data mining such as simple matching coefficient, Cosine similarity, correlation, Jaccard coefficient, extended Jaccard coefficient, etc., Similarity and dissimilarity between two objects are defined based on data types of attributes of objects Details of those definitions can be found in the references (Tan et al., 2005; The Pennsylvania State University, 2015)
2.1.2.4 Outliers and Outlier Detection
Hawkins defined “An outlier is an observation which deviates so much from the other observations as to arouse suspicious that it was generated by a different mechanism” (Hawkins, 1980) In data mining and statistics, outliers are also called
abnormal, deviant, or discordant individuals And, outliers are objects which are different from the remaining objects (Aggarwal, 2015)
FIGURE 2-2 An example of a dataset having 2 clusters and 2 outliers
Cluster 1
Cluster 2
Trang 20Outlier analysis plays an important role in data mining and analysis because its applicability to various applications such as credit card fraud detection, insurance, health care, security intrusion detection, fault detection in safety systems, and military surveillance, etc., (Chandola et al., 2009) Outlier detection algorithms normally find models of normal data objects from provided datasets Consequently, outliers are defined as objects that do not naturally fit those models Most outlier detection algorithms produce results fit one of two types which are binary label and real-valued outlier scores In the first type, objects are marked as outliers or non-outliers On the other hand, the second type assigns a score as a real value to each object and sorts objects in a descending order based on scores, the top 𝑘 objects are considered as 𝑘 outliers (Shaikh and Kitagawa, 2014)
2.1.2.5 Clustering
Clustering is a technique of creating clusters of data objects in such a way that objects in the same clusters are more similar than objects belonging to different clusters and objects in different clusters are more dissimilar than objects belonging to the same clusters (Gan et al., 2007) Based on nature of the data, distances or similarities between objects will be used to measure similar level of two data objects
2.1.2.6 Hard Clustering and Fuzzy Clustering
Hard clustering is to assign each object to one cluster only In other words, a clustering algorithm divides a dataset 𝐗 having 𝑁 objects into 𝐾 ≤ 𝑁 dis-joined clusters However, fuzzy clustering is to assign each object to one or more than one cluster Or, a fuzzy clustering algorithm divides a dataset 𝐗 having 𝑁 objects into
𝐾 ≤ 𝑁 overlapping clusters To measure membership level of objects and clusters in fuzzy clustering, a 𝑀 × 𝐾 matrix 𝐌 is defined in such a way that sum of membership levels of an object on 𝐾 clusters must be equal to 1 (Gan et al., 2007) In this matrix, 𝑢𝑖𝑗 is membership between object 𝑖𝑡ℎ and cluster 𝑗𝑡ℎ
Trang 21A centroids or center of a cluster is normally calculated as the average of objects belonging to that cluster as (2-5)
𝐳𝑖 = 1
‖𝐂𝑖‖∑𝐱𝑗∈𝐂𝑖𝐱𝑗 (2- 5)
a) Hard clustering with 6 clusters b) Fuzzy clustering with 4 clusters
FIGURE 2-3 An example of hard clustering and fuzzy clustering
2.1.2.7 Validity Indexes
In data clustering, a validity index is a function or a criterion, which is used to evaluate the goodness of clustering results Based on the nature of a validity index, a clustering result is considered as the best result or worse result when the validity index obtains the maximum or minimum values By using the same a validity index, if the validity index value of a clustering result 𝐴 is better the validity index value of a clustering 𝐵, we can conclude that the clustering result 𝐴 is better than the clustering result 𝐵 Well-known validity indexes used in clustering validation are divided into three types which are external, internal, and relative indexes (Tan, Steinbach and Kumar, 2005)
2.1.2.8 Big Data and Big Datasets
Big data is a name used to refer to datasets whose sizes are beyond the ability of typical software tools to capture, store, manage and analyze (Zicari, 2014) Big data are often characterized as three critical characteristics known as Volume, Variety, and Velocity (Ishikawa, 2015) Moreover, data mining researchers of today usually characterize big data by using 7Vs which are Volume, Variety, Velocity, Veracity, Value, Variability, and Viability (Bedi et al., 2014)
Cluster 1 Cluster 2
Cluster 3
Cluster 4
Cluster 5 Cluster 6
Cluster 1
Cluster 2
Cluster 3 Cluster 4
Trang 22As presented in (Shirkhorshidi, Aghabozorgi and Wah, 2014), there is no concise definitions of a big dataset However, big datasets can be categorized into 5 names based on their sizes
TABLE 2-1 Categories of Big Datasets
Size in bytes 106 108 1010 1012 >1012
Name Medium Large Huge Monster Very Large
2.1.2.9 Memory
Memory known as RAM is a short name of Random Access Memory It is a
crucial hardware component of a computer It is an intermediary storage place where data are temporarily stored and data operations are performed Limits on memory are varying by hardware platforms and operating systems A personal computer configured with a 32-bit Windows operating system can handle up to 4 gigabytes (GB), and a personal computer with a 64-bit Windows operating system can handle up
to 512 GB (Microsoft, 2015)
2.1.2.10 Limited Memory Computers
Limited memory computers or computers with limited memory are used to refer
to standard desktop computers or laptop computers These computers used by standard users or power users, usually have from 4 gigabytes of RAM to 16 gigabytes of RAM (Texas Tech University, 2015)
2.2 Difficulties of Working with Big Datasets
Big datasets not only provide benefits to enterprises, business and scientific applications but also cause challenges for data scientists Working with big datasets is challenging Difficulties of working with big datasets include data capturing, storing, searching, sharing, analyzing, and visualizing, and processing Because of their characteristics, big datasets cause many technical challenges when setting up big data projects in terms of choosing the right platform technologies, and selecting the suitable algorithms to optimize system performance (Philip et al., 2014; Barioni et al., 2014; Bedi et al., 2014; Fahad et al., 2014; Ishikawa, 2015; Torre-Bastida et al., 2015; Lytras
et al., 2015; Zicari et al., 2016; Pop et al., 2016)
Trang 23One of challenging tasks of data analysis is to estimate an optimal cluster number Graphs and minimum cost spanning trees are important tools used to estimate optimal cluster numbers in clustering analysis
2.3 Graphs and Minimum-Cost Spanning Trees
2.3.1 Graph
A graph 𝐆 is a finite set of vertices and a collection of unordered edges connecting pairs of vertices Graphs are usually presented by diagrams, vertices are points and edges are lines connecting two points (Wallis, 2007)
FIGURE 2-4 A sample graph
A tree 𝐆′ is defined as a connected component of a graph 𝐆 containing no cycles Thus, a graph 𝐆 can be a group of connected components or a group of trees
FIGURE 2-5 An example of three trees built from a graph
Graphs have been used in many applications including data mining and machine learning (Bunke, 2003; Aggarwal and Wang, 2010; Parthasarathy, Tatikonda and Ucar, 2010; Donato and Gionis, 2010; Saiful Islam et al., 2015) In terms of data clustering, graphs have been used successfully to detect cluster numbers in datasets (Foggia et al., 2006; Barbakh, Wu and Fyfe, 2009; Li, 2015) There are three crucial relationships between a graph and a dataset being used in data clustering:
1 A vertex of a graph is equivalent to an object of a dataset,
18
14
11
8
12
Trang 242 A connected component of a graph is equivalent to a cluster which is a part of
a dataset,
3 The number of connected components of a graph is equivalent to the number
of clusters in a dataset
a) A dataset has 6 groups of objects b) A graph comprises of 6 trees
FIGURE 2-6 An illustration of relationships between a dataset and a graph
2.3.2 Minimum-Cost Spanning Trees
A spanning tree of a graph is defined as a connected subgraph containing no cycle that includes all vertices of the graph A minimum-cost spanning tree also called minimum spanning tree (MST) is an important tool to find the best path concluding all vertices of a graph in which each vertex is visited one time only A minimum spanning tree of an edge-weighted graph is a spanning tree whose weight is not larger than the weights of any other spanning trees of the same graph (Sedgewick and Wayne, 2011)
a) An edge-weighted graph b) A MST of an edge-weighted graph
FIGURE 2-7 An illustration of a graph and a minimum-cost spanning tree
2.3.3 Graphs and MST Storage
A large graph cannot be stored in a static data structure such as matrix or array because size of the required matrix will go beyond the management capability of a
Trang 25programming language such as C Moreover, numbers of edges connected to each node of graphs are not the same To be able to store a large edge-weighted graph in such a way that it can be easily accessed by algorithms which build MSTs from the graphs, an adjacency-list data structure has been utilized The adjacency-list data structure is designed as an array of dynamic linked lists Each element of the array represents a node in an edge-weighted graph, and each element points to a dynamic linked list of edges which are connected to this node Each component of the linked list
is a structure including two node ids, a real value being used as a weight of an edge, and a pointer pointing to the next component The number of elements in a list is equal
to a number of edges connected to that node
v
FIGURE 2-8 An illustration of storing a graph
Big datasets may contain outliers Outlier objects need to be detected and eliminated before datasets are used by other machine learning algorithms
2.4 Outlier Detection Algorithms
Outlier detection methods normally find models of normal data objects from provided datasets Consequently, outliers are defined as objects that do not naturally fit those models Many outlier detection algorithms have been used in credit card fraud detection, security instrument detection, fault detection in safety systems, military
Trang 26surveillance, etc., (Chandola et al., 2009) Most outlier detection algorithms produce results in one of two types which are binary label and real-valued outlier scores (Aggarwal, 2015) In the first type, objects are marked as outliers or non-outliers On the other hand, the second type assigns a score as a real value to each object and sorts objects in a descending order based on scores, the top 𝑘 objects are considered as 𝑘 outliers (Shaikh and Kitagawa, 2014) This study focuses on the second type of outlier detection
TABLE 2-2 An example of objects are ranked in a descending of outlier scores
Object IDs Outlier Scores
The first density-based outlier detection algorithm uses local outlier factor (LOF)
to rank objects (Breunig, et al., 2000) This LOF algorithm firstly calculates abnormal degree of each object based on local reachability density and 𝑘 − 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 neighbors Then, all objects are ranked based on outlier degree values Objects with the highest outlier scores are considered as outliers
Let 𝐩, o be objects, and 𝑘 be a positive integral number, 𝐗 be a set of objects,
𝑘 − 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝐩), 𝑑(𝐩, 𝐨), is a distance between 𝐩 and 𝐨 ∈ 𝐗 such that:
1 Having at least 𝑘 objects 𝐨′ ∈ 𝐗\{𝐩} satisfy 𝑑(𝐩, 𝐨′) ≤ 𝑑(𝐩, 𝐨)
2 And, having at most 𝑘 − 1 objects 𝐨′ ∈ 𝐗\{𝐩} satisfy 𝑑(𝐩, 𝐨′) < 𝑑(𝐩, 𝐨)
𝑘 − 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 neighbors of an object 𝐩 are objects whose distances from 𝐩 is not greater than 𝑘 − 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝐩) and denoted by (2-6)
𝑁𝑘−𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝐩) = {𝐪 ∈ 𝐗\{𝐩}|𝑑(𝐩, 𝐪) ≤ 𝑘 − 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝐩) (2-6) Reachability of an object 𝐩 with respect to an object 𝐨 and local reachability density of an object 𝐩 is defined as (2-7) and (2-8), respectively And, local outlier factor of an object 𝐩 is defined by (2-9)
𝑟𝑒𝑎𝑐ℎ − 𝑑𝑖𝑠𝑡𝑘(𝐩, 𝐨) = max{𝑘 − 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝐨), 𝑑(𝐩, 𝐨)} (2-7)
Trang 27|𝑁𝑘(𝐩)| (2-9)
FIGURE 2-9 An illustration of k-distance(o) and reach-dist k(p,o)
The LOF algorithm is extensively computational when working with datasets having large numbers of objects To improve results of the LOF algorithm when working with low density patterns, connectivity-based outlier factor (COF) has been proposed (Tang et al., 2002) The COF of an object 𝐩 is defined as ratio of its average chaining distance with average chaining distances of its 𝑘 nearest neighbors
Let 𝐏 and 𝐐 are two disjoined subsets of 𝐗, distance between two sets 𝐏 and
𝐐 is defined as the minimum distance between object 𝐱 ∈ 𝐏 and 𝐲 ∈ 𝐐 A set based
nearest path, SBN-path, from 𝐩1 is a sequence <𝐩1, 𝐩2, ⋯ , 𝐩𝑟> in such a way that
𝐩𝑖+1 with 1 ≤ 𝑖 ≤ 𝑟 − 1 is the nearest neighbor of set { 𝐩1, 𝐩2, ⋯ , 𝐩𝑖} in set {𝐩𝑖+1, 𝐩𝑖+2, ⋯ , 𝐩𝑟}
A set based nearest trail, 𝑆𝐵𝑁 − 𝑡𝑟𝑎𝑖𝑙 , with respect to the 𝑆𝐵𝑁 − 𝑝𝑎𝑡ℎ
<𝐩1, 𝐩2, ⋯ , 𝐩𝑟> is defined as a ordered collection <𝐞1, 𝐞2, ⋯ , 𝐞𝑟−1> such that all
1 ≤ 𝑖 ≤ 𝑟 − 1, an edge of a sequence 𝐞𝑖 = (𝐨𝑖, 𝐩𝑖+1) where 𝐨𝑖 ∈ {𝐩1, 𝐩2, ⋯ , 𝐩𝑖} The average chaining distance from object 𝐩 to 𝐐\{𝐩} is defined by (2-10) and the connectivity-based outlier factor of object 𝐩 with respect to its 𝑘 neighbors is defined as (2-11)
𝑎𝑐 − 𝑑𝑖𝑠𝑡𝐐(𝐩) =𝑟−11 ∑ 2(𝑟−𝑖)
𝑟 𝑑𝑖𝑠𝑡(𝐞𝑖)
𝑟−1 𝑖=1 (2-10)
Trang 28𝐶𝑂𝐹𝑘(𝐩) =∑|𝑁𝑘(𝐩)|𝑎𝑐−𝑑𝑖𝑠𝑡𝑁𝑘(𝐩)(𝐩)
𝑎𝑐−𝑑𝑖𝑠𝑡 𝑁𝑘(𝐨)(𝐨)𝐨∈𝑁𝑘(𝐩) (2-11) Suppose that we have a dataset as in figure 2-10, 𝑑(𝐩1, 𝐩2)=7, 𝑑(𝐩2, 𝐩7)=3, and distance between any two adjacent points in a line is 1 Given 𝑘 = 10
FIGURE 2-10 An illustration of calculating COF
Trang 29Similar to using LOF, after obtaining the connectivity-based outlier factor values
of all objects, objects are ranked in a descending order based on COF values The objects with the highest COF values are considered as outliers Using COF may be better than using LOF in the case of the provided datasets having low density patterns However, using COF is more computational than using LOF
To detect outliers in more complex situations, influential measure of outliers (INFLO) was proposed (Jin et al., 2006) This method uses reverse nearest relationship
to measure outlier degree of an object The Influential measure of an object 𝐩 is defined as ratio of density in its neighborhood with average density of objects in its reverse nearest neighbors Similar to using the COF, using the INFLO is also an extensively computational algorithm and inadequate to work with large datasets Let, density of object 𝐩, 𝑑𝑒𝑛(𝐩), be defined as reverse of 𝑘 − 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝐩) The reversed nearest neighborhood of 𝐩 is defined by (2-12) The 𝑘 − 𝑖𝑛𝑓𝑙𝑢𝑒𝑛𝑡𝑖𝑎𝑙 space for 𝐩 is defined by (2-13) The influential outlier score of object 𝐩 is defined
Trang 30FIGURE 2-11 An example of k nearest neighborhood and fluential space
Similar to the INFLO, rank-based outlier algorithms (RBDA) (Huang et al., 2013) use the concept of the reverse neighborhood to rank objects Rank of an object
𝐩 among all its neighbours firstly is calculated Next, an outlier degree value of the object 𝐩 is defined as ratio of sum of ranks of 𝐩 among its neighbors with the number of its neighbors Then, outlier degree values of objects are normalized based
on these values and size of outliers need to be detected An object associated with normalized outlier degree which is greater than 𝐿 is considered as an outlier Other outlier algorithms combine ranking and clustering are RADA, ODMR, and ODMRD (Huang, Mehrotraa and Mohana, 2012) Based on executing time, the aforementioned algorithms are inappropriate to work with datasets having a very large number of objects
To provide a high accuracy detection algorithm, a precise ranking method has been proposed (Ha et al., 2015) It is actually a sampling method Firstly, 𝑆 subsets having the same size 𝑀 of a provided dataset are arbitrarily extracted Then, outlier scores of objects belonging each subset are calculated Next, the final outlier score of each object is calculated as sum of its outlier scores from 𝑆 subsets Finally, objects are ranked based on outlier factor scores and objects with highest scores are considered
as outliers This method can provide high accurate results but its executing time is tremendously long when working with big datasets due to the size of subsets is also a large number and the number of subsets is inversed proportional to the size of subsets
In order to use this sampling method, sample size of subsects, 𝑀, and numbers of times to complete sampling ,𝑟𝑅𝑒𝑝, are given by the table 2-3 In this table, 𝑁 is a number of objects of a provide dataset
Trang 31TABLE 2-3 Values of M and rRep
The main advantage of this RDOS algorithm is to determine an optimal 𝑘’ which might be less than the 𝑘 defined by users It was reported that, by using this optimal 𝑘’, the RDOS algorithm produces more accurate results compared to the LOF, COF, INFLO, RBDA, and ORMRD algorithms As a consequence of testing a number values of 𝑘 to obtain an optimal𝑘’, executing time of the RDOS algorithm will be larger than executing time of the LOF, COF, INFLO, RBDA, and ORMRD algorithms
Provided datasets after removing outliers can be used by data clustering algorithms to estimate cluster numbers, initialize cluster centers, and perform the clustering processes
2.5 Data Clustering Algorithms
Clustering, cluster analysis, taxonomy analysis, segmentation analysis or unsupervised classification is a clustering technique A data clustering algorithm divides a dataset into a number of clusters in such a way that objects in the same clusters are more similar than objects in different clusters and objects in different clusters are more dissimilar than objects in the same clusters (Gan et al., 2007) Data clustering is one of the most popular data object labelling techniques (Tang and Liu, 2014) Conventionally, data clustering algorithms are divided into hard clustering and
Trang 32fuzzy clustering algorithms And, hard clustering algorithms include partitioning clustering and hierarchical clustering algorithms
FIGURE 2-12 A traditional partitioning of data clustering algorithms
Because of the speedy increase of datasets in size, hierarchical clustering became unfeasible and partitioning clustering became more important (Gan et al., 2007) Therefore, to study partitioning-based clustering algorithms for big datasets has gained more interest from academic and industrial researchers
2.5.1 Partitioning Clustering
A partitioning clustering algorithm splits a dataset 𝐗 = (𝐱1, 𝐱2, … , 𝐱𝑁) into
𝐾 ≤ 𝑁 disjoined clusters 𝐂 = (𝐜1, 𝐜2, … , 𝐜𝐾) in such a way that sum of within-cluster distances will be minimized and inter-cluster distance will be maximized Sum of within-cluster distances and inter-cluster distance are calculated by (2-15) and (2-16), respectively (Ray and Turi, 1999)
𝐼𝑛𝑡𝑟𝑎 = ∑ ∑ |𝐱 − 𝐳𝑖|2
𝐱∈𝐜 𝑖
𝐾 𝑖=1 with 𝐳𝑖 is center of cluster 𝐜𝑖 (2- 15) 𝐼𝑛𝑡𝑒𝑟 = ∑(|𝐳𝑖 − 𝐳𝑗|2) with 𝑖 = 1, 𝐾 − 1; 𝑗 = 𝑖 + 1, , 𝐾 (2- 16) According to (Fahad, et al., 2014), clustering algorithms for big datasets can be divided into five groups They are partitioning-based algorithms, hierarchical-based algorithms, density-based algorithms, grid-based algorithms, and model-based algorithms
Clustering algorithms
Hard clustering Fuzzy clustering Partitioning Hierarchical
Trang 33
FIGURE 2-13 Categories of clustering algorithms for big datase
Among partitioning-based clustering algorithms such as K-Means, KMedoids, K-Modes, PAM, CLARANS, CLARA, FCM, CSO, PSO; K-means is widely used It assigns 𝑁 objects of a dataset 𝐗 into a user-predefined 𝐾 clusters based on distances from each object to all centers of 𝐾 clusters The assignment process is repeated a number of iterations until reaching a user-predefined stopping condition
(Barioni et al., 2014) The K-Means algorithm can be depicted as Algorithm 2- 1
1 Initialize 𝐾 centers𝐙 = (𝐳1, 𝐳2, … , 𝐳𝐾) of 𝐾 clusters
2 Assign each object 𝐱𝑖 = (𝑥𝑖1, 𝑥𝑖2, … , 𝑥𝑖𝐷) with 𝑖 = 1, , 𝑁 to a cluster
4 Check a user-predefined stopping condition
- If the stopping condition is not satisfied, repeat from step 2,
- Otherwise, return a set of 𝐾 finial centers and stop
Trang 34When clustering small datasets, the K-Means algorithm can stop when the clusters are unchanged after the number of repeating iterations increases 1 However, when clustering datasets have a very large number of objects, the number of repeating iterations can also be a large number and the algorithm consumes a long period of time before reaching a stable state Thus, a stopping condition of the K-Means algorithm can be the difference between indexes of two successive iterations is less than or using a new cutting method (Van Hieu and Meesad, 2015a)
2.5.2 Cluster Centroid Initialization
Centroid initialization is the first step of a partitioning-based clustering algorithm such as K-Means Clustering results and executing time of a partitioning-based clustering algorithm heavily depend on initial centers Different initial centers lead to different clustering results and result in various numbers of repeating iterations Several algorithms have been proposed to initialize the first 𝐾 centroids which are random selection, first 𝐾 points selection, simple selection, greatest minimum distance selection, and others ( Celebi, Kingravi and Vela, 2013)
Random selection is to arbitrarily pick 𝐾 unique objects and use them as the 𝐾 initial centroids It is a fast and easy method However, selected initial centroids may not be good centroids Moreover, clustering results, repeating iterations, and executing time are unstable when clustering two times in a dataset with the same number of clusters
First 𝐾 point selection is to select the first 𝐾 objects of a provided dataset and use them as the 𝐾 initial centroids Similar to the random selection method, to pick the first 𝐾 points is also the fastest and simplest way However, clustering results of the first 𝐾 distinct point selection are heavily affected by data ordering
For example, given a dataset having 6 clusters as shown in figure 2-14 and
𝐾 = 6, the dataset is sorted in an ascending order based on the first attribute When using the random selection method, 4 initial centroids 𝐜1, 𝐜2, 𝐜3 and 𝐜4were selected from cluster 1, and two other initial centroids 𝐜5 and 𝐜6 were picked from cluster 4 When using the first 𝐾 points as initial centroids, 4 initial centroids 𝐜1, 𝐜2,
𝐜4 and 𝐜5 are selected from cluster 1 and 2 initial centroids 𝐜3 and 𝐜6 from cluster 2
Trang 35a) A result of random selection b) A result of first 𝐾 point selection
FIGURE 2-14 Poor results of random selection and first K point selection
On the other hand, simple selection method is better than the two aforementioned methods in terms of non-arbitrary selection and independent on data ordering The center of a dataset can be used as the first initial centroid 𝐜1 An unselected object can
be selected as the next centroid 𝐜𝑖(𝑖 = 2, … , 𝐾) if it is at least 𝑇 units away from the previous selected objects However, in the case of having a good threshold𝑇, this method can select the first candidate it meets, and the selected candidate may not be the best one among many candidates
For example, given a dataset having 5 clusters shown in figure 2-15 and 𝐾=5, the simple selection method is applied to obtain initial centroids
FIGURE 2-15 An example of centroid initialization using the simple selection
method with K=5, T=8
The greatest minimum distance strategy has been used by a few researchers The first initial centroid 𝐜1 can be the object that has the maximum norm Then, distances from all unselected objects to selected objects are calculated The object with the largest distance is selected as the next centroid (Katsavounidis, Jay Kuo and Zhen, 1994) In K-Means++, the first initial centroid is arbitrarily chosen The next centroid
c4 c5
c6
Cluster 1
Cluster 3 Cluster 2
Cluster 4 Cluster 5
Cluster 6
Cluster 1
Cluster 3 Cluster 2
Cluster 4 Cluster 5 Cluster 6
Trang 36𝐜𝑖 = 𝐱′(𝑖 = 2, … , 𝐾) is selected with probability 𝐷(𝐱′)/ ∑ 𝐷(𝐱)2 with 𝐷(𝐱) is the shortest distance from the data object 𝐱 to the closet centroid (Arthur and Vassilvitskii, 2007) This method is considered as a good choice when the cluster number 𝐾 is known as an optimal value
Other centroid initialization methods include density-based (Hui and Wei, 2012) The density-based approach includes three steps The first step divides a provided dataset 𝐗 into 𝑀 × 𝑀 cells Let 𝑁𝑖𝑗 be numbers of objects belonging to each cell Step 2 randomly selects 𝐾𝑖𝑗 = 𝐾 × 𝑁𝑖𝑗/𝑁 samples within each cell as initial centers Step 3 checks a stopping condition If 𝑠𝑢𝑚(𝐾𝑖𝑗) = 𝐾 then stop, if 𝑠𝑢𝑚(𝐾𝑖𝑗) < 𝐾 and ((𝐾 − 𝑠𝑢𝑚(𝐾𝑖𝑗))/𝐾) > 10% then set 𝑀 = 𝑀 − (√𝑀 − 1)2 and repeat step 2, otherwise, randomly select 𝐾 points among 𝑠𝑢𝑚(𝐾𝑖𝑗) points as initial centers
As aforementioned, a crucial prerequisite factor of all aforementioned methods is
a user-predefined value of 𝐾 The aforementioned centroid initialization algorithms cannot perform if the value of 𝐾 is unknown Thus, to estimate an optimal number of clusters is a critical step before conducting data clustering
2.5.3 Cluster Number Estimation
Number of clusters 𝐾 is a vital parameter of clustering algorithms because it tells clustering algorithms how many clusters the provided dataset should be divided to Values of 𝐾 can be estimated by domain experts or based on data visualization However, it is extremely difficult to visualize big datasets having more than three attributes
a) A dataset is visualized in a 2D space b) A dataset is visualized in a 3D space
FIGURE 2-16 An example of visualization
Trang 37Algorithms proposed to estimate the number of clusters can be classified into two groups which are without using graphs and using graphs The first group includes using locally adaptive clustering, similarity matrix, decision-theoretic rough set, weighted gap statistics, and heuristic validity functions
3.5.3.1 Cluster Number Estimation Without Using Graphs
The locally adaptive clustering method uses a set of weighted clusters 𝐰 =(𝑤1, 𝑤2, … , 𝑤𝐷) to associate with a set of clusters 𝐂 = (𝐜1, 𝐜2, … , 𝐜𝐾) where 𝑤𝑗 is the degree of participation of the feature 𝑗 on the set of clusters 𝐂 (Han et al., 2008)
A validity function 𝑉𝑘 is defined based on cluster results and weighted clusters Values of 𝐾 ∈ [2, √𝑁] are tested and the best value of 𝐾 is selected corresponding
to the best value of 𝑉𝑘
A given dataset having 𝑁 objects in a 𝐷 dimensional space will be divided into
𝐾 disjoined subsets 𝐒 = {𝐬1, 𝐬2, ⋯ , 𝐬𝐾} by using (2-17) Let 𝑋𝑗𝑖 be variance of data
in cluster 𝑗 along dimension 𝑖, h≥ 0 be a coefficient, 𝑐𝑗 be centers of clusters 𝑗, and 𝐜0 be a center of the given dataset 𝑋𝑗𝑖 is calculated by (2-18), and weight vectors 𝑤𝑗𝑖 are calculated by (2-19) A validity function 𝑉𝑘 is defined by (2-20)
|𝐒𝑗| (2- 18)
−𝑋𝑗𝑖ℎ
⁄ ) 𝑒𝑥𝑝(1+𝑙𝑜𝑔(∑ 𝐷 𝑒𝑥𝑝((−𝑋𝑗𝑖⁄ )−1)ℎ
𝐾 𝑖=1
𝑁
The main disadvantage of the locally adaptive clustering method is time consumption when 𝑁 is a large number Suppose that a given dataset has 2,000,000 objects, this dataset must be processed 1,413 times of clustering with 𝐾 from 2 to 1,414 to select an optimal cluster number
Trang 38Using similarity matrix to find similar objects is also used to estimate the number
of clusters (Shao, Pi and Liu, 2013) Firstly, a 𝑁 × 𝑁 matrix of similarities between
𝑁 objects is calculated by using (2-21) Then, values on each row of the matrix are sorted in an ascending order A user-predefined threshold value 𝑞 is used to find 𝑞 most similar objects of an object This method may work well on small datasets But, it
is inadequate to work on large datasets due to the need of a huge amount of available memory to store the similarity matrix
𝑠𝑖𝑚(𝑖, 𝑗) = ∑𝐷𝑝=1𝑥𝑖𝑝 ×𝑥𝑗𝑝
√∑ 𝐷 𝑥𝑖𝑝2 𝑝=1 ×√∑ 𝐷 𝑥𝑗𝑝2
𝑝=1
with 𝑖 = 1, , 𝑁; 𝑗 = 1, … , 𝑁 (2-21)
In the case of a 4-byte float data type is used to store similarity values It requires
𝑁 × 𝑁 × 4 bytes to store the similarity matrix This means that it requires 14,901.16 gigabytes to store a similarity matrix when clustering a dataset having 2,000,000 objects This required memory is extremely larger than memory size of a computer Decision-theoretic rough method has been proposed to estimate the number of clusters (Yu, Liu and Wang, 2014) This method not only uses a similarity matrix like the similarity-based method (Shao et al., 2013) but also uses a risk matrix to measure two objects 𝐱i and 𝐱j whether belonging to the same cluster or not It was considered as a good algorithm to estimate the cluster number and the initial centroids However, this algorithm is not appropriate to work with big datasets due to a huge amount of the required memory for storing similarity and risk matrixes In this approach, similarity between objects are distance calculated by using (2-22) and a risk matrix is calculated using (2-24)
2×𝑣𝑎𝑙 , 𝑠𝑖𝑚(𝐱𝑖, 𝐱𝑗) < 𝑣𝑎𝑙, (𝐱𝑖, 𝐱𝑗) ∈ 𝐂 0.5 +𝑠𝑖𝑚(𝐱𝑖,𝐱𝑗)−𝑣𝑎𝑙
2−2×𝑣𝑎𝑙 , 𝑠𝑖𝑚(𝐱𝑖, 𝐱𝑗) ≥ 𝑣𝑎𝑙, (𝐱 𝑖 , 𝐱𝑗) ∈ ¬𝐂 0.5 −𝑣𝑎𝑙−𝑠𝑖𝑚(𝐱2×𝑣𝑎𝑙𝑖,𝐱𝑗), 𝑠𝑖𝑚(𝐱𝑖, 𝐱𝑗) < 𝑣𝑎𝑙, (𝐱𝑖, 𝐱𝑗) ∈ ¬𝐂
(2-24)
Trang 39The required memory size is enormously greater than the memory size of a
computer If a provided dataset has 2,000,000 objects and a 4-byte float data type is
used to store similarity values and risk values, it requires at least 𝑁 × 𝑁 × 4 × 2 bytes to store a similarity matrix and a risk matrix This required memory size is 29,802.32 gigabytes
In extended K-Means, the Bayesian Information Criterion (BIC) is used to test values of 𝐾 ∈ [𝐿𝐵, 𝑈𝐵] to find the best 𝐾 (Ishioka, 2000) Step 1 is to perform data clustering with 𝐾 = 𝐿𝐵 Then, step 2 splits each cluster into two clusters until reaching a user-predefined stopping criterion After dividing a cluster into two new clusters, a new BIC value is calculated If the new BIC value of the current structure is better than the BIC value of the previous structure, new centroids are accepted, otherwise, the new structure will be discarded The difficulty with this method is how
to obtain the best values of 𝐿𝐵and𝑈𝐵
Other methods use heuristic functions include 𝐶𝐻(𝑚), 𝑇𝑊𝐻(𝑚), 𝑋𝑍𝐹(𝑚), 𝐿𝐴(𝑚) , 𝑋𝑢(𝑚) to evaluate the number of clusters (Kolesnikov, Trichina and Kauranne, 2015) To obtain more stable results, a new parameterized cost function 𝑃𝐶𝐹(𝑚) was defined based on the quantization error modeling (QEM) to find the best value of 𝐾 among a range of values [2, 𝑚𝑀𝑎𝑥] by (2-27) where 𝐱𝑗 is object 𝑗, 𝐂𝑖
is cluster 𝑖 and 𝐳𝑖 is center of cluster 𝑖
Trang 40𝑚𝑀𝑎𝑥 Moreover, there is no specific centroid initialization methods proposed to associate with quantization error modeling method
Using weighted gap statistic was considered as a good choice for cluster number estimation (Yan and Ke, 2007) Unfortunately, the weighted gap statistic is very computational like its previous version, the gap statistic (Tibshiran, Walther and Hastie, 2001) Weighted gap statistic and gap statistic methods can work well with small datasets However, the weighted gap and gap statistic algorithms are not suitable to estimate the number of clusters of very large datasets
2.5.3.2 Cluster Number Estimation Using Graphs
The second group of cluster number estimation algorithms is using graph theory Graph-based algorithms work on undirected graphs created from datasets instead of working on original datasets Graph-based K-Means clustering using an adjacent matrix has been proposed to estimate the cluster number and determine the initial centroids (Stokes, 2013) This algorithm was considered as a good choice However, the main weakness of this method is the adjacent matrix is too large to fit the available memory of a computer when working with a large dataset If a dataset has 2,000,000
objects and a 1-byte char data type is used to store adjacent values It requires 3,725.29
gigabytes to store an adjacent matrix This required memory is tremendously exceeds the memory size of a computer
Other graph-based algorithms use minimum-cost spanning trees to estimate cluster numbers (Zhong et al., 2015; Reddy, Mishra and Jana, 2011; Galluccio, Michel and Comon, 2012) Firstly, these MST-bases algorithms build undirected graphs from datasets Each vertex represents a data object and a distance between two vertices represents a distance between two data objects The standard deviation of edge lengths
is used as an initial threshold 𝑇0 to select edges to construct an initial graph 𝐆0(Galluccio, Michel and Comon, 2008) Then, edges which are greater than ∆ will be removed from the graph 𝐆0 to form forests 𝐅 The number of trees of the forests 𝐅
is considered as the number of clusters The common drawback of aforementioned Graph-based algorithms is the size of an undirected graph increases significantly when the number of objects of a dataset slightly increases Thus, similar to the similarity-based (Shao et al., 2013) and decision-theoretic rough set methods (Yu et al.,