• Based on Data Sampling, we propose and prove algorithms for coreset structions in order to find the most suitable subsets that can both be used toreduce the computational cost and be u
Trang 1ĐẠI HỌC QUỐC GIA TP HCM
TRƯỜNG ĐẠI HỌC BÁCH KHOA
-
NGUYỄN LÊ HOÀNG
PHÂN CỤM CÁC TẬP DỮ LIỆU CÓ KÍCH THƯỚC LỚN
DỰA VÀO LẤY MẪU VÀ NỀN TẢNG SPARK
Ngành : Khoa học Máy tính
Mã số : 8480101
LUẬN VĂN THẠC SĨ
TP HỒ CHÍ MINH, tháng 01 năm 2020
Trang 2VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY
UNIVERSITY OF TECHNOLOGY
———oOo———
NGUYEN LE HOANG
CLUSTERING LARGE DATASETS
BASED ON DATA SAMPLING AND SPARK
Computer Science No.: 8480101
MASTER THESIS
Ho Chi Minh City, January 2020
Trang 3CÔNG TRÌNH ĐƯỢC HOÀN THÀNH TẠI TRƯỜNG ĐẠI HỌC BÁCH KHOA ĐẠI HỌC QUỐC GIA – TP HỒ CHÍ MINH
Cán bộ hướng dẫn khoa học: PGS TS ĐẶNG TRẦN KHÁNH ………
Cán bộ đồng hướng dẫn: TS LÊ HỒNG TRANG ………
Cán bộ chấm nhận xét 1: TS PHAN TRỌNG NHÂN ………
Cán bộ chấm nhận xét 2: PGS TS HUỲNH TRUNG HIẾU ………
Luận văn thạc sĩ được bảo vệ tại Trường Đại học Bách Khoa, ĐHQG TP HCM
ngày 30 tháng 12 năm 2019
Thành phần Hội đồng đánh giá luận văn thạc sĩ gồm:
1 PGS TS NGUYỄN THANH BÌNH
2 TS NGUYỄN AN KHƯƠNG
3 TS PHAN TRỌNG NHÂN
4 PGS TS HUỲNH TRUNG HIẾU
5 PGS TS NGUYỄN TUẤN ĐĂNG
Xác nhận của Chủ tịch Hội đồng đánh giá LV và Trưởng Khoa quản lý chuyên ngành sau khi luận văn đã được sửa chữa (nếu có)
CHỦ TỊCH HỘI ĐỒNG
PGS TS NGUYỄN THANH BÌNH
TRƯỞNG KHOA KH & KTMT
Trang 4THE RESEARCH WORK FOR THIS THESIS
HAS BEEN CARRIED OUT ATUNIVERSITY OF TECHNOLOGYVIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY
Under the supervision of
• Supervisor: ASSOC PROF DR DANG TRAN KHANH
• and co-supervisor: DR LE HONG TRANG
Examiner Board
• Examiner 1: DR PHAN TRONG NHAN
• Examiner 2: ASSOC PROF DR HUYNH TRUNG HIEU
This thesis is reviewed and defended at University of Technology, VNU-HCMC
on December 30, 2019The members of Thesis Defense Committee are:
1 ASSOC PROF DR NGUYEN THANH BINH
2 DR NGUYEN AN KHUONG
3 DR PHAN TRONG NHAN
4 ASSOC PROF DR HUYNH TRUNG HIEU
5 ASSOC PROF DR NGUYEN TUAN DANG
Confirmation from President of Thesis Defense Committee and Dean of Faculty of
Computer Science and Engineering
President of Thesis Defense
Trang 5ĐẠI HỌC QUỐC GIA TP.HCM
TRƯỜNG ĐẠI HỌC BÁCH KHOA
CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM
Độc lập - Tự do - Hạnh phúc
NHIỆM VỤ LUẬN VĂN THẠC SĨ
Họ tên học viên : NGUYỄN LÊ HOÀNG MSHV : 1770472
Ngày, tháng, năm sinh : 12/03/1988 Nơi sinh : TP HCM Ngành : Khoa học Máy tính Mã số : 8480101
I TÊN ĐỀ TÀI : PHÂN CỤM CÁC TẬP DỮ LIỆU CÓ KÍCH THƯỚC LỚN DỰA VÀO LẤY MẪU VÀ NỀN TẢNG SPARK
II NHIỆM VỤ VÀ NỘI DUNG :
- Tìm hiểu và nghiên cứu về các bài toán gom cụm, các phương pháp lấy mẫu, tổng quát hoá dữ liệu và nền tảng Apache Spark cho dữ liệu lớn
- Dựa vào phương pháp lấy mẫu, chúng tôi đề xuất và chứng minh các giải thuật xây dựng tập coreset để tìm ra tập hợp con phù hợp nhất vừa có thể để giảm chi phí tính toán và vừa có thể được sử dụng như là tập hợp con đại diện cho tập dữ liệu gốc trong các bài toán gom cụm
- Thử nghiệm và đánh giá các phương pháp đề xuất
III NGÀY GIAO NHIỆM VỤ : 04/01/2019
IV NGÀY HOÀN THÀNH NHIỆM VỤ : 07/12/2019
V CÁN BỘ HƯỚNG DẪN : PGS TS ĐẶNG TRẦN KHÁNH và TS LÊ HỒNG TRANG
Trang 6MASTER THESIS OBLIGATIONS
Student: NGUYEN LE HOANG
Date of Birth: March 12, 1988
Major: Computer Science
StudentID: 1770472Place of Birth: Ho Chi Minh CityNumber: 8480101
I THESIS TITLE: CLUSTERING LARGE DATASETS BASED ON DATASAMPLING AND SPARK
II OBLIGATIONS AND CONTENTS:
• Study and research about clustering problems, data sampling methods, datageneralization and Apache Spark framework for big data
• Based on Data Sampling, we propose and prove algorithms for coreset structions in order to find the most suitable subsets that can both be used toreduce the computational cost and be used as the representative subsets of thefull original datasets in clustering problems
con-• Do experiments and evaluate the proposed methods
III START DATE: January 4, 2019
IV END DATE: December 7, 2019
V SUPERVISORS: ASSOC PROF DR DANG TRAN KHANH
and DR LE HONG TRANG
Ho Chi Minh City, December 7, 2019
Supervisor Dean of Faculty of Computer Science
Trang 7Acknowledgements
I am very grateful to my supervisor, Assoc Prof Dr DANG TRAN KHANHand co-supervisor Dr LE HONG TRANG for the guidance, inspiration and con-structive suggestions that help me in the preparation of this graduation thesis
I would like to thank my family very much, especially to my parents, who havealways been by my side and supported me whatever I want
Ho Chi Minh City, December 7, 2019
Trang 8Abstract
Since the development of technology, data has become one of the most essentialfactors in 21st century However, the explosion of Internet has transformed thesedata to big ones which are very hard to handle and execute In this thesis, we proposesolutions for clustering large-scale data, a vital problem in machine learning and awidely-applied matter in industry
To solve this problem, we use the data sampling methods which are based onthe concept of coresets - the subsets of data that must be small enough to reducecomputational complexity but must keep all representative characteristics of originalone In other words, now we can scale down big datasets to the much smaller onesthat can be clustered efficiently while these results can be considered as the solutionsfor the whole original datasets Besides, in order to make the solving process forlarge-scale datasets much more faster, we apply the open framework for big data -Apache Spark
In the scope of this thesis, we propose and prove two methods for coreset structions for k-means clustering We also do some experiments and evaluate theseproposed algorithms to estimate the advantages and disadvantages of each one Thisthesis can be divided into four parts as follows:
con-• Chapter1and Chapter2are the introduction and overview about coresets andrelated background These chapters also provide a brief about Apache Spark,some definitions as well as theorems that are used in this thesis
• In Chapter 3, we propose and prove the first coreset construction which isbased on the Farthest-First-Traversal algorithm and ProTraS algorithm [58]for k-median and k-means clustering We also evaluate this method at the end
of this chapter
• In Chapter 4, based on prior work about Lightweight Coreset [12], we pose and prove the correctness of the second coreset construction, the α -lightweight coreset for k-means clustering, a general and adjustable-parameterform of lightweight coreset
pro-• In Chapter 5, we apply the α - lightweight coreset and the data generalizationmethod for solving the whole problem of this thesis - clustering large scaledatasets We also apply Apache Spark to solve the problem faster To evalu-ate the correctness, we do experiments with some large scale benchmark datasamples
Trang 9Tóm tắt
Với sự phát triển của công nghệ, dữ liệu đã trở thành một trong những yếu tố quan trọng nhất của thế kỷ 21 Tuy nhiên, sự bùng nổ của Internet đã biến đổi những dữ liệu này thành những dữ liệu vô cùng lớn khiến cho việc xử lý và khai thác trở nên cực kỳ khó khăn Trong đề tài này, chúng tôi sẽ đề xuất giải pháp để giải quyết bài toán gom cụm cho dữ liệu
có kích thước lớn, đây được xem là một bài toán rất quan trọng của máy học (machine learning) và cũng là một bài toán được áp dụng rộng rãi trong công nghiệp
Để giải bài toán, chúng tôi sử dụng phương pháp lấy mẫu được dựa
trên khái niệm về tập coreset – được định nghĩa là một tập con nhưng thoả
mãn hai điều kiện: phải đủ nhỏ để giảm độ phức tạp trong tính toán nhưng phải mang đầy đủ các đặc trưng đại diện của tập gốc Nói cách khác, chúng
ta bây giờ có thể thu nhỏ tập dữ liệu lớn thành một tập nhỏ hơn để có thể phân cụm hiệu quả trong khi kết quả thu được trên tập con cũng được xem
là kết quả của cả tập gốc Bên cạnh đó, để quá trình xử lý trong tập dữ liệu
có kích thước lớn nhanh hơn, chúng tôi cũng sử dụng nền tảng xử lý dữ
liệu lớn Apache Spark.
Trong phạm vi của luận văn này, chúng tôi đề xuất và chứng minh hai phương pháp để xây dựng tập cốt coreset cho bài toán gôm cụm k-means
Chúng tôi cũng thực thi các thử nghiệm và đánh giá các giải thuật được đề xuất để tìm các ưu và khuyết của mỗi phương pháp Luận văn được chia thành 4 phần chính như sau:
• Chương 1 và chương 2 giới thiệu các khái niệm về tập coreset và các kiến
thức liên quan trong Trong các chương này, chúng tôi cũng tóm tắt ngắn
gọn về Apache Spark và các định lý được sử dụng trong luận văn.
• Trong chương 3, chúng tôi đề xuất và chứng minh phương pháp đầu tiên
để xây dựng tập coreset dựa trên giải thuật Farthest-First-Traversal và giải thuật ProTraS [58] cho bài toán gôm cụm k-median và k-means Chúng tôi
cũng tiến hành đánh giá giải thuật này trong cuối chương
• Trong chương 4, dựa trên các công trình về lightweight coreset [12],
chúng tôi đề xuất và chứng minh tính đúng đắn của phương pháp xây
dựng coreset thứ hai, - lightweight coreset, cho bài toán gôm cụm means, đây được xem là một dạng tổng quát và có thể điều chỉnh hệ số của lightweight coreset.
k-• Trong chương 5, chúng tôi sử dụng phương pháp - lightweight coreset
cùng với phương pháp tổng quát hoá dữ liệu để giải quyết tổng thể bài toán – gôm cụm trên tập dữ liệu có kích thước lớn Chúng tôi cũng sử
dụng nền tảng Apache Spark để bài toán được giải quyết nhanh hơn Để
đánh giá độ chính xác, chúng tôi tiến hành thử nghiệm và so sánh kết quả
trên các tập mẫu benchmark có kích thước lớn
α
α
Trang 10Declaration of Authorship
I, NGUYEN LE HOANG, declare that this thesis - Clustering Large Datasetsbased on Data Sampling and Spark, and the work presented in this thesis are myown I confirm that:
• This work was done wholly or mainly while in candidature for a Master ofScience atUniversity of Technology, VNU-HCMC
• No part of this thesis has previously been submitted for any degree or any otherqualification at this University or any other institution
• Where I have quoted from the work of others, the source is always given Withthe exception of such quotations, this thesis is entirely my own work
• I have acknowledged all main sources of help
• Where the thesis is based on work done by myself jointly with others, I havemade clear exactly what was done by others and what I have contributed my-self
Signed:
Date:
Trang 11Contents
1.1 Overview 1
1.2 The Scope of Research 4
1.3 Research Contributions 5
1.3.1 Scientific Significance 5
1.3.2 Practical Significance 6
1.4 Organization of Thesis 6
1.5 Publications relevant to this Thesis 6
2 Background and Related Works 7 2.1 k-Means and k-Means++ Clustering 8
2.1.1 k-Means Clustering 8
2.1.2 k-Means++ Clustering 9
2.2 Coresets 10
Trang 122.2.1 Definition 10
2.2.2 Some Coreset Constructions 11
2.3 Apache Spark 12
2.3.1 What is Apache Spark? 12
2.3.2 Why Apache Spark? 13
2.4 Bounds on Sample Complexity of Learning 14
3 FFT-based Coresets 15 3.1 Farthest-First-Traversal Algorithm 16
3.2 FFT-based Coresets for k-Median and k-Means Clustering 16
3.3 ProTraS algorithm and limitations 21
3.3.1 ProTraS algorithm 21
3.3.2 Drawbacks in ProTraS 21
3.4 Proposed FFT-based Coreset Construction 24
3.4.1 Proposed Algorithm 24
3.4.2 Initial Step 25
3.4.3 Decrease the Computational Complexity 25
3.5 Experiments 26
3.5.1 Experiment Setup 26
3.5.2 Results and Discussion 29
4 General Lightweight Coresets 33 4.1 Lightweight Coreset 34
4.1.1 Definition 34
4.1.2 Algorithm 34
4.2 The α-Lightweight Coreset 35
4.2.1 Definition 35
Trang 134.2.2 Theorem about the Optimal Solutions 36
4.3 Algorithm 38
4.4 Analysis 39
5 Clustering Large Datasets via Coresets and Spark 45 5.1 Processing Method 46
5.1.1 Data Generalization 46
5.1.2 Built-in k-Means clustering in Spark 46
5.1.3 Realistic Method 47
5.2 Experiments 48
5.2.1 Experimental Method 48
5.2.2 Experimental Data Sets 48
5.2.3 Results 49
6 Conclusions 57 References 59
Trang 14List of Figures
1.1 Big Data properties 2
1.2 Machine Learning: Supervised vs Unsupervised 3
2.1 Spark Logo - https://spark.apache.org 12
2.2 The Components of Spark 13
3.1 Original-R15 (small cluster) and scaling-R15 23
3.2 Some data sets for experiments 27
3.3 ARI in relation to subsample size for datasets D1 - D8 31
3.4 ARI in relation to subsample size for datasets D9 - D16 32
5.1 ARI and Runtime of Birch1 in relation to full data 50
5.2 ARI and Runtime of Birch2 in relation to full data 50
5.3 ARI and Runtime of Birch3 in relation to full data 51
5.4 ARI and Runtime of ConfLongDemo in relation to full data 51
5.5 ARI and Runtime of KDDCupBio in relation to full data 51
Trang 15List of Tables
3.1 Data sets for Experiments 27
3.2 Experimental Results - Adjusted Rand Index Comparison 30
3.3 Experimental Results - Time Comparison 30
5.1 Data sets for Experiments 49
5.2 Experimental Results for dataset Birch1 52
5.3 Experimental Results for dataset Birch2 53
5.4 Experimental Results for dataset Birch3 54
5.5 Experimental Results for dataset ConfLongDemo 55
5.6 Experimental Results for dataset KDDCup Bio 56
Trang 16List of Algorithms
1 k-Means Clustering - Lloyd’s Algorithm [42] 8
2 D2Sampling for k-Means++ [6] 9
3 Farthest-First-Traversal algorithm 16
4 ProTraS Algorithm [58] 22
5 Proposed FFT-based Coreset Construction 24
6 Lightweight Coreset [12] 35
7 Proposed α - Lightweight Coreset Construction 38
Trang 17increas-of data every second Besides, the increasing in mobile and personal devices such assmart phones or tablets also make the data bigger every day, especially through photoand video sharing in popular social networks like Facebook, Twitter, YouTube, etc.Many researches have shown that the amount of data created each year is growingfaster than ever before and they estimate that by 2020, every human on the planet will
be creating 1.7 megabytes of information each second; and in only a year, the mulated world data will grow to 44 zettabytes 1 [46] Another research from IDCpredict that the amount of global data captured in 2025 will reach 163 zettabytes, atenfold increase compared to 2016 [55]
accu-Consequently, researchers now have to face new hard situation: solving lems for data that have big amount in volume, variety, velocity, veracity and value.(Figure1.12) For the demand of understanding and explaining these data in order
prob-to solve reality problems, it is very hard for human if there is no help from machine.That’s why machine learning plays an important role in this decade as well as in thefuture By applying machine learning combined with artificial intelligence (AI), sci-entists can create systems having the ability to automatically learn and improve fromexperience without being explicitly programmed
For each specific purpose, machine learning is divided into two categories: pervised and unsupervised Supervised learning is a kind of training model where thetraining sets go along with provided target labels, the system will learn from these
su-1 One zettabyte is equivalent to one billion gigabytes.
2 Image source: https://www.edureka.co/blog/what-is-big-data/
Trang 18Chapter 1 Introduction 2
FIGURE1.1: Big Data properties
training sets and then is used to predict or classify future instances In contrast, pervised machine learning approaches extract information from data sets where suchexplicit labels are not available The importance of this field is expected to grow as it
unsu-is estimated that 85% of global data in 2025 will be unlabeled [55] In particular, dataclustering - the tasks of grouping together similar objects into clusters — seems to be
a fruitful approach for analyzing that data [13] Applications are broad and includefields such as computer vision [61], information retrieval [35], computational ge-ometry [36] and recommendation systems [41] Furthermore, clustering techniquescan also be used to learn data representations that are used in downstream predictiontasks such as classification and regression [16] Machine learning categories can bedescribed briefly in Figure1.23
In general, clustering is one of the most popular techniques in machine learningand is used widely in large-scale data analysis The target of clustering is partition-ing a set of objects into groups such that objects in same group are similar to eachother and objects in different groups are dissimilar to each other This technique, due
to its importance and application in reality, has a lot of investigations and variousalgorithms For example, we can use BIRCH [68], CURE [27] which are belonging
to hierarchical clustering, also known as connectivity-based clustering, for solvingproblems based on the idea of objects being more related to nearby objects than toobjects farther away If the problems are closely related to statistics, we can usedistribution-based clustering such as Gaussian Mixture Model (GMM) [66] or DB-CLASD [53] For matter based on density clustering in which the data that is in the
3 Image source: https://towardsdatascience.com/supervised-vs-unsupervised-learning
Trang 19Chapter 1 Introduction 3
FIGURE1.2: Machine Learning: Supervised vs Unsupervised
region with high density of the data space is considered to belong to the same ter [38], we can use Mean-shift [17], DBSCAN [20] - the most well-known density-based clustering algorithm, or OPTICS [4] - an improvement of DBSCAN And one
clus-of the most common approaches for clustering is based on partition in which the basicidea is to assign the centers of data points to some random objects, the actual cen-ters will be reveal through several iterations until a stop condition is satisfied Somecommon algorithms of this kind are k-means [45], k-medoids [49], CLARA [37],CLARANS [48] For a more detail, we refer readers to the survey of clusteringalgorithms by R.Xu (2005) [65] and by D Xu (2015) [64]
In fact, there are a lot of clustering algorithms and improvements that can beused in applications Each one has its own benefits and drawbacks as well Thequestion of choosing a suitable clustering algorithm is an important and difficultproblem that users must deal with when they have to solve situations with specificconfigurations and settings There are some research about this such as in [14], [39],[67] where explain about the quality of clusters in some circumstances However, inthe scope of this thesis, we do not cover this issue and various clustering algorithms,instead of this, we fix and select one of the most popular clustering algorithm - the k-means clustering We will use this algorithm throughout of this report and investigatemethods that can deal with k-means clustering for large-scale data set
Moreover, to design a complete solution that can cluster and analyze large-scaledata is still a challenge for data scientists Many methods have been proposed forseveral years to deal with machine learning for big data One of the simplest way isdepending on infrastructure and hardware: the more powerful and modern machine
we have, the more complicated and larger amount of data we can solve This solution
is quite easy but costs a lot of money and few people can afford this Another option
is finding suitable algorithms to reduce the computational complexity from the inputsize that may contain millions or billions of data points There are some approachmethods such as data compression in [69], [1], data deduplication [19], dimension
Trang 20Chapter 1 Introduction 4
reduction [25], [60], [51], etc For a survey about this, readers can find more usefulinformation in [54] Among big data reduction methods, data sampling is one ofthe popular options that are closely related to machine learning and data mining forresearchers The key idea of data sampling is that instead of solving problems onthe full data with large-scale size, we can find the answer for the subset of this data;this result is then used as the baseline for finding the actual solution for original dataset This leads us to a new difficulty: finding a subset that must be small enoughfor effectively reducing computational complexity but must keep all representativecharacteristics of original data And, this difficulty is the motivation for us to do thisresearch and this thesis as well
1.2 The Scope of Research
In this thesis, we will propose a solution for a problem of clustering large datasets
We use the word "large" to indicate the data that has "big" in volume, not the wholecharacteristics of big data described in previous section with 5 V’s (Volume, Variety,Value, Velocity and Veracity) (Figure1.1) However, the Volume, in other words, thedata size, is one of the most non-trivial difficulties that most researchers have to facewhen solving a big data related problem
For clustering algorithm, even though there are a lot of investigations and ods, we consider fixed clustering problems with the prototypical k-means clustering
meth-We select this because k-means is the most well-known clustering algorithm and iswidely applied in reality as well as in industry or scientific research
While there is a wealth of prior work on clustering of small and medium sizeddata sets, there are unique challenges in the massive data setting The traditionalalgorithms have a super-linear computational complexity on the size of the data setmaking them infeasible when there are many data points In the scope of this thesis,
we apply data sampling to deal with the massive data setting A very basic approach
of this method is random sampling or uniform sampling In fact, while uniformsampling is feasible for some problems, there are instances where it performs verypoorly due to the naive nature of the sampling strategy For example, real-world data
is often imbalanced and contains clusters of different sizes As a consequence, asmall fraction of data points can be very important and have an enormous impact onthe objective function Such imbalanced data sets are problematic for methods based
on uniform sampling since, with high probability These methods only sample points
in large clusters and the information in small clusters is discarded [13]
The idea of finding a relevant subset from original data to decrease the putational cost brings scientists to the concept of coreset, which was first applied ingeometric approximation by Agarwal et al in 2004 [2], [3] The problem of coresetconstructions for k-median and k-means clustering was then stated and investigated
com-by Har-Peled et al in [28], [29] Since that time, many coreset construction rithms have been proposed for a wide variety of clustering problems In this thesis,
Trang 211.3 Research Contributions
In this thesis, we will solve the problem of clustering large data sets by using datasampling methods and the framework for big data - Apache Spark Since the frame-work Spark is a part of technical field and it is maintained by Apache, we do notmake any change in its configurations or do not improve any thing belong to Spark
as well Instead, our research focus on the data sampling methods which will findthe most relevant subsets, called coresets, of a full data set Coresets, in other words,can be described as a compact subset such that models trained on coresets will alsoprovide a good fit with models trained on full data set By using coresets, we canscale down a big data to a tiny one in order to reduce the computational cost of amachine learning problem With deeply research about coresets, our thesis has somescientific and practical contributions as follows
• By based on the lightweight coreset of Bachem, Lucic and Krause [12], wepropose a general model for the α - lightweight coreset, then we propose ageneral lightweight coreset construction that is very fast and practical This isproved is Chapter4
Trang 22Chapter 1 Introduction 6
1.3.2 Practical Significance
• Due to its high runtime, our proposed FFT-based coreset construction is veryhard to be used in reality However, through experiments with some state-of-the-arts coreset constructions, this proposed algorithm is showed that it is able
to produce one of the best sample coresets that can be used in experiments
• Our proposed α - lightweight coreset model is a generalization of the tional lightweight coreset This proposal can be used for various practicalcases, especially for situations that need to focus on multiplicative errors oradditive errors of the samples
tradi-1.4 Organization of Thesis
The remaining of this thesis is organized as follows
• Chapter 2 This chapter is an overview over prior works related to this thesis,including the k-means and k-means++ algorithms, the definition of coresets, ashort brief about Apache Spark and a theorem about bounds for sample com-plexity
• Chapter 3 We introduce about farthest-first-traversal algorithm as well as theProTraS for finding coresets Then we propose an FFT-based algorithm forcoreset construction
• Chapter 4 This chapter is about lightweight coreset and our general lightweightcoreset model We also prove the correctness of this model and propose a gen-eral algorithm for this α - lightweight coreset
• Chapter 5 This chapter shows the experimental running for clustering largedatasets We use the α - lightweight coreset for sampling process and kMeans++for clustering on Apache Spark framework
• Chapter 6 We have the thesis conclusion and an ending here
1.5 Publications relevant to this Thesis
• Nguyen Le Hoang, Tran Khanh Dang and Le Hong Trang A ComparativeStudy of the Use of Coresets for Clustering Large Datasets pp 45-55 LNCS
11814 Future Data and Security Engineering FDSE 2019
• Le Hong Trang, Nguyen Le Hoang and Tran Khanh Dang A First-Traversal-based Sampling Algorithm for k-clustering International Con-ference on Ubiquitous Information Management and Communication IMCOM2020
Trang 23Chapter 2
Background and Related Works
In this chapter, we provide a short introduction about background and prior worksrelated to this thesis
• k-Means and k-means++ Clustering
• Coresets
• Apache Spark
• Bounds & Pseudo-dimension
Trang 24Chapter 2 Background and Related Works 82.1 k-Means and k-Means++ Clustering
In 1957, an algorithm, now often referred to simply as “k-means”, was proposed
by S Lloyd of Bell Labs; it was then published in 1982 [42] Lloyd’s algorithmbegins with k arbitrary “centers,” typically chosen uniformly at random from thedata points Each point is then assigned to the nearest center, and each center isrecomputed as the center of mass of all points assigned to it These last two steps arerepeated until the process stabilizes [6]
The Lloyd’s algorithm is described in Algorithm1
Algorithm 1 k-Means Clustering - Lloyd’s Algorithm [42]
Require: data set X , number of clusters k
Ensure: k separated clusters
1: Randomly initialize k centers C = {cj}k
j=1∈ Rdxk 2: while (Not Convergence) do
The algorithm was then developed by Inaba et al [33], Matousek [47], Vega
et al [63], etc However, one of the most advanced improvement of k-means is thek-means++ by Author and Vassilvitskii in [6] We will give an overview about thisalgorithm in next section
Trang 25Chapter 2 Background and Related Works 9
In the Algorithm 1, the initial set of cluster centers (line 1) is based on randomsampling where k points are selected uniformly at random from the data set Thissimple approach was fast and easy to implement However, there are many naturalexamples for which the algorithm generates arbitrarily bad clusterings This happensdue to the conflict placement of the starting centers, and in particular, it can holdwith high probability even if the centers are chosen uniformly at random from thedata points [6]
To overcome this problem, Arthur and Vassilvitskii [6] propose the algorithmnamed k-means++ which uses adaptive seeding based on a technique called D2 -sampling to create its initial seed set before running Lloyd’s algorithm to conver-gence [8] Given an existing set of centers S, the D2- sampling strategy, as the namesuggests, samples each point x ∈ X with probability proportional to the squared dis-tance to the selected centers, i.e.,
p(x|S) = d(x, S)
2
∑x0 ∈Xd(x0, S)2
The D2- sampling is described in Algorithm2
Algorithm 2 D2Sampling for k-Means++ [6]
Require: data set X , number of clusters k
Ensure: initial set S used for k-means
Trang 26k-Chapter 2 Background and Related Works 102.2 Coresets
In this thesis, we apply data sampling via coresets to deal with the massive data ting In computational geometry, a coreset is a small set of points that approximatesthe shape of a larger point set, in the sense that applying some geometric measure
set-to the two sets results in approximately equal numbers [Wikipedia] In the usage
of clustering problem terms, a coreset is a weighted subset of the data such that thequality of any clustering evaluated on coreset closely approximates the quality on thefull data set
In most cases, it is not easy to find this most relevant subset Consequently,attention has shifted to developing approximation algorithms The goal now is tocompute an (1 + ε)-approximation subset, for some 0 < ε < 1 The framework ofcoresets has recently emerged as a general approach to achieve this goal [3]
The definition of coreset depends on each machine learning problem For thek-means clustering, definition of coresets can be stated as
Definition 1 (Coresets for k-Means Clustering)
(S Har-Peled and S Mazumdar [28])
Let ε > 0, the weighted set C is a (k, ε)- coreset of X if for any Q ⊂ Rd ofcardinality at most k
|φX(Q) − φC(Q)| ≤ εφX(Q)this also equivalent to
(1 − ε)φX(Q) ≤ φC(Q) ≤ (1 + ε)φX(Q)
This is a strong theoretical guarantee as the cost evaluated on the coreset φC(Q)has to approximate the cost on the full data set φX(Q) up to a 1 ± ε multiplicativefactor uniformly for all possible sets of cluster centers As a direct consequence,solving on the coreset yields provably competitive solutions when evaluated on thefull data set [43] More formally, Lucic et al in [43] showed that, if C is a coreset of
X with ε ∈ (0,13), then
φX(QC∗) ≤ (1 + 3ε)φX(Q∗X)where Q∗C denotes the optimal solution of k centers on C and φX(Q∗X) denotes theoptimal solution on X This means that the optimal solution of coreset can produce an(1 + 3ε) approximation on the original data As a result, we can solve the clusteringproblem on the coreset while retaining strong theoretical guarantees
Trang 27Chapter 2 Background and Related Works 11
Many coreset construction algorithms have been proposed in recent years for means clustering problems One of the first methods are based on exponential grids
k-by Har-Peled and Mazumdar in [28] and an improved version by Har-Peled andKushal in [29] The coreset construction with sampling-based approach was firstused by Chen [15] and was investigated deeply by Feldman et al with plenty ofresearch about coresets for k-means (Feldman, Monemizadeh and Sohler [21]), highdimensional subspace (Feldman, Monemizadeh, Sohler and Woodruff [22]), coresetsfor mixture models (Feldman, Faulkner and Krause [23]), PCA and projective clus-tering (Feldman, Schmidt and Sohler [24]), etc By based on prior works, Lucic et al.constructs coresets for the estimation of Gaussian mixture models (Lucic, Faulknerand Krause [44]) and for clustering with Bregman divergences (Lucic, Bachem andKrause [43]) Recently, Bachem et al proposed coresets for nonparametric clus-tering (Bachem, Lucic and Krause [7]), one-shot coresets for k-clustering (Bachem,Lucic and Lattanzi [11]) and lightweight coresets (Bachem, Lucic and Krause [12]).Besides, Ros and Guillaume proposed a coreset construction based on Farthest-First-Traversal algorithm in [58] For a more detail survey of the results about coresets,
we refer the reader to paper [9] by Bachem, Lucic and Krause
In this thesis, we will continue prior works and propose new algorithms of set constructions for k-means clustering
core-• By based on farthest-first-traversal algorithm and ProTraS algorithm by Ros &Guillaume in [58], we propose an FFT-based coreset construction This part isexplained and proved clearly in Chapter3
• By based on the lightweight coreset of Bachem, Lucic and Krause [12], wepropose a general model for called the α - lightweight coreset This is proved
is Chapter4
Trang 28Chapter 2 Background and Related Works 122.3 Apache Spark
Apache Spark, an open-source distributed cluster computing framework, was inally developed at the University of California, Berkeley’s AMPLab The Sparkcodebase was then donated to the Apache Software Foundation There are many re-sources about this on the Internet To provide a brief and clear details about Spark,most of the content for this section are referred and taken from
orig-https://mapr.com/products/apache-spark/
Apache Spark is a powerful unified analytics engine for large-scale distributed dataprocessing and machine learning On top of the Spark core data processing engineare libraries for SQL, machine learning, graph computation, and stream processing.These libraries can be used together in many stages in modern data pipelines andallow for code reuse across batch, interactive, and streaming applications
Spark is useful for ETL processing, analytics and machine learning workloads, andfor batch and interactive processing of SQL queries, machines learning inferences,and artificial intelligence applications
FIGURE2.1: Spark Logo - https://spark.apache.org
Data Pipelines Much of Spark’s power lies in its ability to combine very ferent techniques and processes into a single, coherent whole Outside Spark, thediscrete tasks of selecting data, transforming that data in various ways, and ana-lyzing the transformed results might easily require a series of separate processingframeworks, such as Apache Oozie Spark, on the other hand, offers the ability tocombine these, crossing boundaries between batch, streaming, and interactive work-flows in ways that make the user more productive
dif-Spark jobs perform multiple operations consecutively, in memory, only spilling todisk when required by memory limitations Spark simplifies the management ofthese disparate processes, offering an integrated whole – a data pipeline that is easier
to configure, run, and maintain
Trang 29Chapter 2 Background and Related Works 13
Challenges with Previous Technologies Before Spark, there was MapReduce.With MapReduce, iterative algorithms require chaining multiple MapReduce jobstogether This causes a lot of reading and writing to disk For each MapReduce job,data is read from a distributed file block into a map process, written to and read from
a file in between, and then written to an output file from a reducer process
FIGURE2.2: The Components of Spark
Advantages of Spark The goal of the Spark project was to keep the benefits ofMapReduce’s scalable, distributed, fault-tolerant processing framework while mak-ing it more efficient and easier to use Spark is designed for speed:
• Spark runs multi-threaded lightweight tasks inside of JVM processes, ing fast job startup and parallel multi-core CPU utilization
provid-• Spark caches data in memory across multiple parallel operations, making itespecially fast for parallel processing of distributed data with iterative algo-rithms
• Spark provides a rich functional programming model and comes packaged withhigher level libraries for SQL, machine learning, streaming, and graphs
The components of Apache Spark are shown in Figure2.21
1 https://backtobazics.com/big-data/spark/understanding-apache-spark-architecture/
Trang 30Chapter 2 Background and Related Works 142.4 Bounds on Sample Complexity of Learning
In Chapter4of this thesis, we consider functions families which map from a dataset
X → [0, 1] and where each function in the family corresponds to a solution in oursolution space Intuitively, the pseudo-dimension provides a measure of the combi-natorial complexity of the underlying machine learning problem and may be viewed
as the generalization of the VC-dimension to [0, 1]-valued functions [13] The nition of pseudo-dimension was first proposed by Haussler [31] For an overview onthe pseudo-dimension, we refer to Anthony and Bartlett [5]
defi-Definition 2 (Pseudo-Dimension) (Haussler [31])
Fix a countably infinite domain X The pseudo-dimension of a set F of functionsfrom X to [0,1], denoted by Pdim(F), is the largest d such that
There is a sequence x1, x2, , xd of domain elements from X and a sequence
r1, r2, , rd of real thresholds such that
for each b1, b2, , bd∈ {above, below}, there is an f ∈ F such that
for all i= 1, 2, , d, we have f (xi) ≥ ri ⇐⇒ bi= above
By using the properties of the pseudo-dimension, Li, Long, and Srinivasan posed a theorem for a tight bound on the number of required samples for all functions
pro-in the function family [40]
Theorem 1 (Bounds on the Sample Complexity of Learning)
(Y Li, P M Long and A Srinivasan [40])
Let α > 0, v > 0 and δ > 0 Fix a countably infinite domain X and let p(.) beany probability distribution over X,
Let F be a set of functions from X to [0,1] with Pdim(F) = d and denote by C asample of m points from X sampled independently according to p(.)
Trang 31cluster-a lot of computcluster-ations, ccluster-an be considered cluster-as one of the best relevcluster-ant coreset thcluster-at mcluster-ay
be built for research and scientific purposes
• Firstly, we show that the Farthest-First-Traversal algorithm (FFT) can yield a(k, ε)-coreset for both k-median and k-means clustering
• Secondly, we illustrate some existing limitations of ProTraS [58], the the-art coreset construction that based on FFT
state-of-• From that, by based on FFT combined with good points from ProTraS, wepropose an algorithm for coreset construction of both k-means and k-medianclustering
• We compare this proposed coreset with other state-of-the-art sample coresetsfrom Lightweight Coreset of Bachem et al [12], Adaptive Sampling of Feld-man et al in [23] and Uniform Sampling as baseline to show that this proposedcoreset can be considered as the best suitable subset of any original full data
Even though this thesis is mainly about coresets for k-means clustering However, inthis section, we also prove results about coresets for k-median clustering Therefore,the FFT-based coresets can be applied not only for k-means but also for k-medianclustering
Trang 32Chapter 3 FFT-based Coresets 163.1 Farthest-First-Traversal Algorithm
We start this chapter with a short introduction about Farthest-First-Traversal (FFT)algorithm In computational geometry, the FFT of a metric space is a set of pointsselected sequently; after the first point is chosen arbitrarily, each next successivepoint is located as the farthest one from the set of previously-selected points Thefirst use of the FFT was by Rosenkrantz, Stearns & Lewis [59] in connection withheuristics for the traveling salesman problem Then, Gonzalez [26] used it as part of
a greedy approximation algorithm for the problem of finding k clusters that minimizethe maximum diameter of a cluster Later, Arthur & Vassilvitskii [6] use a FFT-likealgorithm to propose k-means++ algorithm
The FFT is described in Algorithm3
Algorithm 3 Farthest-First-Traversal algorithm
Require: dataset X with |X | = n
As mentioned, FFT algorithm can be used to solve many complicated problems
in data mining and machine learning, in next section, we will prove that the process
of FFT can yield coresets for both k-median and k-means clustering Then, we canfind a coreset by applying FFT algorithm
3.2 FFT-based Coresets for k-Median and k-Means
Clustering
Firstly, we define some expressions used in this section
Let X⊂ Rdbe a data set and x∈ X Let C ⊂ X be a subset of X For each c ∈ C,
we denote
• wc= |T (c)| with
T(c) = the number of items from X whose closest point in C is c, i.e
T(c) = {y|(y ∈ X \ C) ∧ (d(y, c) = min
c 0 ∈Cd(y, c0))}
• dc= maxt∈T (c)d(t, c) = the largest distance between all points in T (c) to c
Trang 33Chapter 3 FFT-based Coresets 17
Theorem 2 (FFT-based Coresets for k-Median Clustering)
There exists an ε > 0 such that the subset receiving from FFT algorithm ondataset X is a(k, ε)-coreset of X for k-median clustering
Proof: k-median is a variation of k-means where instead of calculating the meanfor each cluster to determine its centroid, one instead calculates the median Withk-median, the function φ of X and C are defined as
Trang 34Chapter 3 FFT-based Coresets 18
Trang 35Chapter 3 FFT-based Coresets 19
Theorem 3 (FFT-based Coresets for k-Means Clustering)
There exists an ε > 0 such that the subset receiving from FFT algorithm ondataset X is a(k, ε)-coreset of X for k-means clustering
Proof: With k-means, the function φ of X and C are
We denote dmax= maxx∈Xd(x, Q)
Using triangle inequality, ∀x ∈ X and c ∈ C, we have
d(x, Q) ≤ d(x, c) + d(c, Q)d(x, Q) ≤ dc+ d(c, Q), ∀x ∈ T (c)d(x, Q)2≤ dc+ d(c, Q)2
d(x, Q)2≤ dc2+ d(c, Q)2+ 2dcdmaxSum the inequality for all x ∈ T (c), then for all c ∈ C,
Trang 36Chapter 3 FFT-based Coresets 20
∆ = ∑
c∈C
wcdc dc+ 2dmaxand choose
Theorem 4 (FFT-based Coresets for both k-Median and k-Means Clustering)The sample from applying FFT algorithm on data set X is a(k, ε)-coreset of Xfor both k-median and k-means clustering
Trang 37Chapter 3 FFT-based Coresets 213.3 ProTraS algorithm and limitations
In previous section, we have shown that FFT algorithm can be used to build a coresetfor both k-median and k-means clustering However, there are few research aboutFFT-based coreset In 2018, FFT algorithm was first used to find coresets by Ros andGuillaume [58] The sample from their coreset construction can be considered as acoreset for k-median clustering Even though their work consists some limitations, it
is still a very valuable resource and a great idea for further study
In 2017, they proposed DENDIS [56] and DIDES [57] which are iterative algorithmsbased on the hybridization of distance and density concepts They differ in the pri-ority given to distance or density, and in the stopping criterion defined accordingly.However, they have drawbacks In 2018, by based on the FFT algorithm and thegood points from DENDIS and DIDES, Ros and Guillaume proposed a new algo-rithm named ProTraS [58] that is both easy to tune and scalable ProTraS algorithm
is based on the sampling cost which is computed according to the within group tance and to the representativeness of the sample item This algorithm is designed
dis-to produce a (k, ε)-coreset and use the approximation level, ε, as the sdis-topping rion This algorithm is then used and provides good respect in some research such as
crite-in [62] by Le Hong Trang et al
The original ProTraS is described in Algorithm4
Trang 38Chapter 3 FFT-based Coresets 22
Algorithm 4 ProTraS Algorithm [58]
Require: T = {xi}, for i = 1, 2, , n, and ε
Ensure: S = {yj} and T (yj), for j = 1, 2, , s
1: Select an initial pattern xinit ∈ T
11: Find dmax(yk) = maxxm∈T (yk)d(xm, yk)
12: Store dmax(yk), xmax(yk) where dmax(yk) = d(xmax(yk), yk)
Trang 39Chapter 3 FFT-based Coresets 23
FIGURE3.1: Original-R15 (small cluster) and scaling-R15
The ProTraS will stop when cost < ε0 with ε0 is a given and input constant(line4) This cost function is taken from equation3.3of ε for k-median clustering
or equivalent to costX(Q) ≥ n, which lead to some errors as follows
• For data with large distance between points, Algorithm 4 needs a lot of erations before reaching stop condition; in some cases, with given ε0= 0.1,Algorithm4returns a coreset with size equal to the size of full data set
it-• Algorithm4cannot determine two data sets in same shape but different scaleparameters (Figure 3.1) With same given ε0, the data with larger scale willhave bigger coreset size
In fact, with assumption as above, Algorithm4can only be optimized and create goodsamples in some circumstances that are satisfied the given assumption Although theoriginal ProTraS is just for k-median clustering and has some limitations, the idea
of coreset construction from FFT algorithm is worth for deeper research In nextsection, we propose a FFT-based coreset construction for both k-means and k-medianclustering that is independent to the value of ε
Trang 40Chapter 3 FFT-based Coresets 243.4 Proposed FFT-based Coreset Construction
Based on FFT algorithm and three theorems in section 3.2, we propose the algorithmfor coreset constructions for both k-median and k-means clustering Proposed algo-rithm pseudocode is described in Algorithm5 To complete the proposed algorithm,
we also propose strategy for Initial Step and strategy for decreasing the tional complexity
computa-Algorithm 5 Proposed FFT-based Coreset Construction
The main loop (lines 5-20) includes two sub-loops
• In the first one (lines 6-9), each unselected pattern, xi∈ X \ C, is attached tothe closest selected element in C
• The second loop finds the next selected representative, x∗ = xmax(c∗), which
is the farthest item from current C In this step, if we simply choose x∗ as thefarthest distance between all points in X to current set C like the original FFT,i.e
d(x∗,C) = max
(x∈X \C)∧(c∈C)d(x, c)