Clustering large datasets based on data sampling ans spark

• Based on Data Sampling, we propose and prove algorithms for coreset structions in order to find the most suitable subsets that can both be used toreduce the computational cost and be u

Trang 1

ĐẠI HỌC QUỐC GIA TP HCM

TRƯỜNG ĐẠI HỌC BÁCH KHOA

-

NGUYỄN LÊ HOÀNG

PHÂN CỤM CÁC TẬP DỮ LIỆU CÓ KÍCH THƯỚC LỚN

DỰA VÀO LẤY MẪU VÀ NỀN TẢNG SPARK

Ngành : Khoa học Máy tính

Mã số : 8480101

LUẬN VĂN THẠC SĨ

TP HỒ CHÍ MINH, tháng 01 năm 2020

Trang 2

VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY

UNIVERSITY OF TECHNOLOGY

———oOo———

NGUYEN LE HOANG

CLUSTERING LARGE DATASETS

BASED ON DATA SAMPLING AND SPARK

Computer Science No.: 8480101

MASTER THESIS

Ho Chi Minh City, January 2020

Trang 3

CÔNG TRÌNH ĐƯỢC HOÀN THÀNH TẠI TRƯỜNG ĐẠI HỌC BÁCH KHOA ĐẠI HỌC QUỐC GIA – TP HỒ CHÍ MINH

Cán bộ hướng dẫn khoa học: PGS TS ĐẶNG TRẦN KHÁNH ………

Cán bộ đồng hướng dẫn: TS LÊ HỒNG TRANG ………

Cán bộ chấm nhận xét 1: TS PHAN TRỌNG NHÂN ………

Cán bộ chấm nhận xét 2: PGS TS HUỲNH TRUNG HIẾU ………

Luận văn thạc sĩ được bảo vệ tại Trường Đại học Bách Khoa, ĐHQG TP HCM

ngày 30 tháng 12 năm 2019

Thành phần Hội đồng đánh giá luận văn thạc sĩ gồm:

1 PGS TS NGUYỄN THANH BÌNH

2 TS NGUYỄN AN KHƯƠNG

3 TS PHAN TRỌNG NHÂN

4 PGS TS HUỲNH TRUNG HIẾU

5 PGS TS NGUYỄN TUẤN ĐĂNG

Xác nhận của Chủ tịch Hội đồng đánh giá LV và Trưởng Khoa quản lý chuyên ngành sau khi luận văn đã được sửa chữa (nếu có)

CHỦ TỊCH HỘI ĐỒNG

PGS TS NGUYỄN THANH BÌNH

TRƯỞNG KHOA KH & KTMT

Trang 4

THE RESEARCH WORK FOR THIS THESIS

HAS BEEN CARRIED OUT ATUNIVERSITY OF TECHNOLOGYVIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY

Under the supervision of

• Supervisor: ASSOC PROF DR DANG TRAN KHANH

• and co-supervisor: DR LE HONG TRANG

Examiner Board

• Examiner 1: DR PHAN TRONG NHAN

• Examiner 2: ASSOC PROF DR HUYNH TRUNG HIEU

This thesis is reviewed and defended at University of Technology, VNU-HCMC

on December 30, 2019The members of Thesis Defense Committee are:

1 ASSOC PROF DR NGUYEN THANH BINH

2 DR NGUYEN AN KHUONG

3 DR PHAN TRONG NHAN

4 ASSOC PROF DR HUYNH TRUNG HIEU

5 ASSOC PROF DR NGUYEN TUAN DANG

Confirmation from President of Thesis Defense Committee and Dean of Faculty of

Computer Science and Engineering

President of Thesis Defense

Trang 5

ĐẠI HỌC QUỐC GIA TP.HCM

TRƯỜNG ĐẠI HỌC BÁCH KHOA

CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM

Độc lập - Tự do - Hạnh phúc

NHIỆM VỤ LUẬN VĂN THẠC SĨ

Họ tên học viên : NGUYỄN LÊ HOÀNG MSHV : 1770472

Ngày, tháng, năm sinh : 12/03/1988 Nơi sinh : TP HCM Ngành : Khoa học Máy tính Mã số : 8480101

I TÊN ĐỀ TÀI : PHÂN CỤM CÁC TẬP DỮ LIỆU CÓ KÍCH THƯỚC LỚN DỰA VÀO LẤY MẪU VÀ NỀN TẢNG SPARK

II NHIỆM VỤ VÀ NỘI DUNG :

- Tìm hiểu và nghiên cứu về các bài toán gom cụm, các phương pháp lấy mẫu, tổng quát hoá dữ liệu và nền tảng Apache Spark cho dữ liệu lớn

- Dựa vào phương pháp lấy mẫu, chúng tôi đề xuất và chứng minh các giải thuật xây dựng tập coreset để tìm ra tập hợp con phù hợp nhất vừa có thể để giảm chi phí tính toán và vừa có thể được sử dụng như là tập hợp con đại diện cho tập dữ liệu gốc trong các bài toán gom cụm

- Thử nghiệm và đánh giá các phương pháp đề xuất

III NGÀY GIAO NHIỆM VỤ : 04/01/2019

IV NGÀY HOÀN THÀNH NHIỆM VỤ : 07/12/2019

V CÁN BỘ HƯỚNG DẪN : PGS TS ĐẶNG TRẦN KHÁNH và TS LÊ HỒNG TRANG

Trang 6

MASTER THESIS OBLIGATIONS

Student: NGUYEN LE HOANG

Date of Birth: March 12, 1988

Major: Computer Science

StudentID: 1770472Place of Birth: Ho Chi Minh CityNumber: 8480101

I THESIS TITLE: CLUSTERING LARGE DATASETS BASED ON DATASAMPLING AND SPARK

II OBLIGATIONS AND CONTENTS:

• Study and research about clustering problems, data sampling methods, datageneralization and Apache Spark framework for big data

• Based on Data Sampling, we propose and prove algorithms for coreset structions in order to find the most suitable subsets that can both be used toreduce the computational cost and be used as the representative subsets of thefull original datasets in clustering problems

con-• Do experiments and evaluate the proposed methods

III START DATE: January 4, 2019

IV END DATE: December 7, 2019

V SUPERVISORS: ASSOC PROF DR DANG TRAN KHANH

and DR LE HONG TRANG

Ho Chi Minh City, December 7, 2019

Supervisor Dean of Faculty of Computer Science

Trang 7

Acknowledgements

I am very grateful to my supervisor, Assoc Prof Dr DANG TRAN KHANHand co-supervisor Dr LE HONG TRANG for the guidance, inspiration and con-structive suggestions that help me in the preparation of this graduation thesis

I would like to thank my family very much, especially to my parents, who havealways been by my side and supported me whatever I want

Ho Chi Minh City, December 7, 2019

Trang 8

Abstract

Since the development of technology, data has become one of the most essentialfactors in 21st century However, the explosion of Internet has transformed thesedata to big ones which are very hard to handle and execute In this thesis, we proposesolutions for clustering large-scale data, a vital problem in machine learning and awidely-applied matter in industry

To solve this problem, we use the data sampling methods which are based onthe concept of coresets - the subsets of data that must be small enough to reducecomputational complexity but must keep all representative characteristics of originalone In other words, now we can scale down big datasets to the much smaller onesthat can be clustered efficiently while these results can be considered as the solutionsfor the whole original datasets Besides, in order to make the solving process forlarge-scale datasets much more faster, we apply the open framework for big data -Apache Spark

In the scope of this thesis, we propose and prove two methods for coreset structions for k-means clustering We also do some experiments and evaluate theseproposed algorithms to estimate the advantages and disadvantages of each one Thisthesis can be divided into four parts as follows:

con-• Chapter1and Chapter2are the introduction and overview about coresets andrelated background These chapters also provide a brief about Apache Spark,some definitions as well as theorems that are used in this thesis

• In Chapter 3, we propose and prove the first coreset construction which isbased on the Farthest-First-Traversal algorithm and ProTraS algorithm [58]for k-median and k-means clustering We also evaluate this method at the end

of this chapter

• In Chapter 4, based on prior work about Lightweight Coreset [12], we pose and prove the correctness of the second coreset construction, the α -lightweight coreset for k-means clustering, a general and adjustable-parameterform of lightweight coreset

pro-• In Chapter 5, we apply the α - lightweight coreset and the data generalizationmethod for solving the whole problem of this thesis - clustering large scaledatasets We also apply Apache Spark to solve the problem faster To evalu-ate the correctness, we do experiments with some large scale benchmark datasamples

Trang 9

Tóm tắt

Với sự phát triển của công nghệ, dữ liệu đã trở thành một trong những yếu tố quan trọng nhất của thế kỷ 21 Tuy nhiên, sự bùng nổ của Internet đã biến đổi những dữ liệu này thành những dữ liệu vô cùng lớn khiến cho việc xử lý và khai thác trở nên cực kỳ khó khăn Trong đề tài này, chúng tôi sẽ đề xuất giải pháp để giải quyết bài toán gom cụm cho dữ liệu

có kích thước lớn, đây được xem là một bài toán rất quan trọng của máy học (machine learning) và cũng là một bài toán được áp dụng rộng rãi trong công nghiệp

Để giải bài toán, chúng tôi sử dụng phương pháp lấy mẫu được dựa

trên khái niệm về tập coreset – được định nghĩa là một tập con nhưng thoả

mãn hai điều kiện: phải đủ nhỏ để giảm độ phức tạp trong tính toán nhưng phải mang đầy đủ các đặc trưng đại diện của tập gốc Nói cách khác, chúng

ta bây giờ có thể thu nhỏ tập dữ liệu lớn thành một tập nhỏ hơn để có thể phân cụm hiệu quả trong khi kết quả thu được trên tập con cũng được xem

là kết quả của cả tập gốc Bên cạnh đó, để quá trình xử lý trong tập dữ liệu

có kích thước lớn nhanh hơn, chúng tôi cũng sử dụng nền tảng xử lý dữ

liệu lớn Apache Spark.

Trong phạm vi của luận văn này, chúng tôi đề xuất và chứng minh hai phương pháp để xây dựng tập cốt coreset cho bài toán gôm cụm k-means

Chúng tôi cũng thực thi các thử nghiệm và đánh giá các giải thuật được đề xuất để tìm các ưu và khuyết của mỗi phương pháp Luận văn được chia thành 4 phần chính như sau:

• Chương 1 và chương 2 giới thiệu các khái niệm về tập coreset và các kiến

thức liên quan trong Trong các chương này, chúng tôi cũng tóm tắt ngắn

gọn về Apache Spark và các định lý được sử dụng trong luận văn.

• Trong chương 3, chúng tôi đề xuất và chứng minh phương pháp đầu tiên

để xây dựng tập coreset dựa trên giải thuật Farthest-First-Traversal và giải thuật ProTraS [58] cho bài toán gôm cụm k-median và k-means Chúng tôi

cũng tiến hành đánh giá giải thuật này trong cuối chương

• Trong chương 4, dựa trên các công trình về lightweight coreset [12],

chúng tôi đề xuất và chứng minh tính đúng đắn của phương pháp xây

dựng coreset thứ hai, - lightweight coreset, cho bài toán gôm cụm means, đây được xem là một dạng tổng quát và có thể điều chỉnh hệ số của lightweight coreset.

k-• Trong chương 5, chúng tôi sử dụng phương pháp - lightweight coreset

cùng với phương pháp tổng quát hoá dữ liệu để giải quyết tổng thể bài toán – gôm cụm trên tập dữ liệu có kích thước lớn Chúng tôi cũng sử

dụng nền tảng Apache Spark để bài toán được giải quyết nhanh hơn Để

đánh giá độ chính xác, chúng tôi tiến hành thử nghiệm và so sánh kết quả

trên các tập mẫu benchmark có kích thước lớn

α

Trang 10

Declaration of Authorship

I, NGUYEN LE HOANG, declare that this thesis - Clustering Large Datasetsbased on Data Sampling and Spark, and the work presented in this thesis are myown I confirm that:

• This work was done wholly or mainly while in candidature for a Master ofScience atUniversity of Technology, VNU-HCMC

• No part of this thesis has previously been submitted for any degree or any otherqualification at this University or any other institution

• Where I have quoted from the work of others, the source is always given Withthe exception of such quotations, this thesis is entirely my own work

• I have acknowledged all main sources of help

• Where the thesis is based on work done by myself jointly with others, I havemade clear exactly what was done by others and what I have contributed my-self

Signed:

Date:

Trang 11

Contents

1.1 Overview 1

1.2 The Scope of Research 4

1.3 Research Contributions 5

1.3.1 Scientific Significance 5

1.3.2 Practical Significance 6

1.4 Organization of Thesis 6

1.5 Publications relevant to this Thesis 6

2 Background and Related Works 7 2.1 k-Means and k-Means++ Clustering 8

2.1.1 k-Means Clustering 8

2.1.2 k-Means++ Clustering 9

2.2 Coresets 10

Trang 12

2.2.1 Definition 10

2.2.2 Some Coreset Constructions 11

2.3 Apache Spark 12

2.3.1 What is Apache Spark? 12

2.3.2 Why Apache Spark? 13

2.4 Bounds on Sample Complexity of Learning 14

3 FFT-based Coresets 15 3.1 Farthest-First-Traversal Algorithm 16

3.2 FFT-based Coresets for k-Median and k-Means Clustering 16

3.3 ProTraS algorithm and limitations 21

3.3.1 ProTraS algorithm 21

3.3.2 Drawbacks in ProTraS 21

3.4 Proposed FFT-based Coreset Construction 24

3.4.1 Proposed Algorithm 24

3.4.2 Initial Step 25

3.4.3 Decrease the Computational Complexity 25

3.5 Experiments 26

3.5.1 Experiment Setup 26

3.5.2 Results and Discussion 29

4 General Lightweight Coresets 33 4.1 Lightweight Coreset 34

4.1.1 Definition 34

4.1.2 Algorithm 34

4.2 The α-Lightweight Coreset 35

4.2.1 Definition 35

Trang 13

4.2.2 Theorem about the Optimal Solutions 36

4.3 Algorithm 38

4.4 Analysis 39

5 Clustering Large Datasets via Coresets and Spark 45 5.1 Processing Method 46

5.1.1 Data Generalization 46

5.1.2 Built-in k-Means clustering in Spark 46

5.1.3 Realistic Method 47

5.2 Experiments 48

5.2.1 Experimental Method 48

5.2.2 Experimental Data Sets 48

5.2.3 Results 49

6 Conclusions 57 References 59

Trang 14

List of Figures

1.1 Big Data properties 2

1.2 Machine Learning: Supervised vs Unsupervised 3

2.1 Spark Logo - https://spark.apache.org 12

2.2 The Components of Spark 13

3.1 Original-R15 (small cluster) and scaling-R15 23

3.2 Some data sets for experiments 27

3.3 ARI in relation to subsample size for datasets D1 - D8 31

3.4 ARI in relation to subsample size for datasets D9 - D16 32

5.1 ARI and Runtime of Birch1 in relation to full data 50

5.4 ARI and Runtime of ConfLongDemo in relation to full data 51

5.5 ARI and Runtime of KDDCupBio in relation to full data 51

Trang 15

List of Tables

3.1 Data sets for Experiments 27

3.2 Experimental Results - Adjusted Rand Index Comparison 30

3.3 Experimental Results - Time Comparison 30

5.1 Data sets for Experiments 49

5.2 Experimental Results for dataset Birch1 52

5.5 Experimental Results for dataset ConfLongDemo 55

5.6 Experimental Results for dataset KDDCup Bio 56

Trang 16

List of Algorithms

1 k-Means Clustering - Lloyd’s Algorithm [42] 8

2 D2Sampling for k-Means++ [6] 9

3 Farthest-First-Traversal algorithm 16

4 ProTraS Algorithm [58] 22

5 Proposed FFT-based Coreset Construction 24

6 Lightweight Coreset [12] 35

7 Proposed α - Lightweight Coreset Construction 38

Trang 17

increas-of data every second Besides, the increasing in mobile and personal devices such assmart phones or tablets also make the data bigger every day, especially through photoand video sharing in popular social networks like Facebook, Twitter, YouTube, etc.Many researches have shown that the amount of data created each year is growingfaster than ever before and they estimate that by 2020, every human on the planet will

be creating 1.7 megabytes of information each second; and in only a year, the mulated world data will grow to 44 zettabytes 1 [46] Another research from IDCpredict that the amount of global data captured in 2025 will reach 163 zettabytes, atenfold increase compared to 2016 [55]

accu-Consequently, researchers now have to face new hard situation: solving lems for data that have big amount in volume, variety, velocity, veracity and value.(Figure1.12) For the demand of understanding and explaining these data in order

prob-to solve reality problems, it is very hard for human if there is no help from machine.That’s why machine learning plays an important role in this decade as well as in thefuture By applying machine learning combined with artificial intelligence (AI), sci-entists can create systems having the ability to automatically learn and improve fromexperience without being explicitly programmed

For each specific purpose, machine learning is divided into two categories: pervised and unsupervised Supervised learning is a kind of training model where thetraining sets go along with provided target labels, the system will learn from these

su-1 One zettabyte is equivalent to one billion gigabytes.

2 Image source: https://www.edureka.co/blog/what-is-big-data/

Trang 18

Chapter 1 Introduction 2

FIGURE1.1: Big Data properties

training sets and then is used to predict or classify future instances In contrast, pervised machine learning approaches extract information from data sets where suchexplicit labels are not available The importance of this field is expected to grow as it

unsu-is estimated that 85% of global data in 2025 will be unlabeled [55] In particular, dataclustering - the tasks of grouping together similar objects into clusters — seems to be

a fruitful approach for analyzing that data [13] Applications are broad and includefields such as computer vision [61], information retrieval [35], computational ge-ometry [36] and recommendation systems [41] Furthermore, clustering techniquescan also be used to learn data representations that are used in downstream predictiontasks such as classification and regression [16] Machine learning categories can bedescribed briefly in Figure1.23

In general, clustering is one of the most popular techniques in machine learningand is used widely in large-scale data analysis The target of clustering is partition-ing a set of objects into groups such that objects in same group are similar to eachother and objects in different groups are dissimilar to each other This technique, due

to its importance and application in reality, has a lot of investigations and variousalgorithms For example, we can use BIRCH [68], CURE [27] which are belonging

to hierarchical clustering, also known as connectivity-based clustering, for solvingproblems based on the idea of objects being more related to nearby objects than toobjects farther away If the problems are closely related to statistics, we can usedistribution-based clustering such as Gaussian Mixture Model (GMM) [66] or DB-CLASD [53] For matter based on density clustering in which the data that is in the

3 Image source: https://towardsdatascience.com/supervised-vs-unsupervised-learning

Trang 19

FIGURE1.2: Machine Learning: Supervised vs Unsupervised

region with high density of the data space is considered to belong to the same ter [38], we can use Mean-shift [17], DBSCAN [20] - the most well-known density-based clustering algorithm, or OPTICS [4] - an improvement of DBSCAN And one

clus-of the most common approaches for clustering is based on partition in which the basicidea is to assign the centers of data points to some random objects, the actual cen-ters will be reveal through several iterations until a stop condition is satisfied Somecommon algorithms of this kind are k-means [45], k-medoids [49], CLARA [37],CLARANS [48] For a more detail, we refer readers to the survey of clusteringalgorithms by R.Xu (2005) [65] and by D Xu (2015) [64]

In fact, there are a lot of clustering algorithms and improvements that can beused in applications Each one has its own benefits and drawbacks as well Thequestion of choosing a suitable clustering algorithm is an important and difficultproblem that users must deal with when they have to solve situations with specificconfigurations and settings There are some research about this such as in [14], [39],[67] where explain about the quality of clusters in some circumstances However, inthe scope of this thesis, we do not cover this issue and various clustering algorithms,instead of this, we fix and select one of the most popular clustering algorithm - the k-means clustering We will use this algorithm throughout of this report and investigatemethods that can deal with k-means clustering for large-scale data set

Moreover, to design a complete solution that can cluster and analyze large-scaledata is still a challenge for data scientists Many methods have been proposed forseveral years to deal with machine learning for big data One of the simplest way isdepending on infrastructure and hardware: the more powerful and modern machine

we have, the more complicated and larger amount of data we can solve This solution

is quite easy but costs a lot of money and few people can afford this Another option

is finding suitable algorithms to reduce the computational complexity from the inputsize that may contain millions or billions of data points There are some approachmethods such as data compression in [69], [1], data deduplication [19], dimension

Trang 20

reduction [25], [60], [51], etc For a survey about this, readers can find more usefulinformation in [54] Among big data reduction methods, data sampling is one ofthe popular options that are closely related to machine learning and data mining forresearchers The key idea of data sampling is that instead of solving problems onthe full data with large-scale size, we can find the answer for the subset of this data;this result is then used as the baseline for finding the actual solution for original dataset This leads us to a new difficulty: finding a subset that must be small enoughfor effectively reducing computational complexity but must keep all representativecharacteristics of original data And, this difficulty is the motivation for us to do thisresearch and this thesis as well

1.2 The Scope of Research

In this thesis, we will propose a solution for a problem of clustering large datasets

We use the word "large" to indicate the data that has "big" in volume, not the wholecharacteristics of big data described in previous section with 5 V’s (Volume, Variety,Value, Velocity and Veracity) (Figure1.1) However, the Volume, in other words, thedata size, is one of the most non-trivial difficulties that most researchers have to facewhen solving a big data related problem

For clustering algorithm, even though there are a lot of investigations and ods, we consider fixed clustering problems with the prototypical k-means clustering

meth-We select this because k-means is the most well-known clustering algorithm and iswidely applied in reality as well as in industry or scientific research

While there is a wealth of prior work on clustering of small and medium sizeddata sets, there are unique challenges in the massive data setting The traditionalalgorithms have a super-linear computational complexity on the size of the data setmaking them infeasible when there are many data points In the scope of this thesis,

we apply data sampling to deal with the massive data setting A very basic approach

of this method is random sampling or uniform sampling In fact, while uniformsampling is feasible for some problems, there are instances where it performs verypoorly due to the naive nature of the sampling strategy For example, real-world data

is often imbalanced and contains clusters of different sizes As a consequence, asmall fraction of data points can be very important and have an enormous impact onthe objective function Such imbalanced data sets are problematic for methods based

on uniform sampling since, with high probability These methods only sample points

in large clusters and the information in small clusters is discarded [13]

The idea of finding a relevant subset from original data to decrease the putational cost brings scientists to the concept of coreset, which was first applied ingeometric approximation by Agarwal et al in 2004 [2], [3] The problem of coresetconstructions for k-median and k-means clustering was then stated and investigated

com-by Har-Peled et al in [28], [29] Since that time, many coreset construction rithms have been proposed for a wide variety of clustering problems In this thesis,

Trang 21

1.3 Research Contributions

In this thesis, we will solve the problem of clustering large data sets by using datasampling methods and the framework for big data - Apache Spark Since the frame-work Spark is a part of technical field and it is maintained by Apache, we do notmake any change in its configurations or do not improve any thing belong to Spark

as well Instead, our research focus on the data sampling methods which will findthe most relevant subsets, called coresets, of a full data set Coresets, in other words,can be described as a compact subset such that models trained on coresets will alsoprovide a good fit with models trained on full data set By using coresets, we canscale down a big data to a tiny one in order to reduce the computational cost of amachine learning problem With deeply research about coresets, our thesis has somescientific and practical contributions as follows

• By based on the lightweight coreset of Bachem, Lucic and Krause [12], wepropose a general model for the α - lightweight coreset, then we propose ageneral lightweight coreset construction that is very fast and practical This isproved is Chapter4

Trang 22

1.3.2 Practical Significance

• Due to its high runtime, our proposed FFT-based coreset construction is veryhard to be used in reality However, through experiments with some state-of-the-arts coreset constructions, this proposed algorithm is showed that it is able

to produce one of the best sample coresets that can be used in experiments

• Our proposed α - lightweight coreset model is a generalization of the tional lightweight coreset This proposal can be used for various practicalcases, especially for situations that need to focus on multiplicative errors oradditive errors of the samples

tradi-1.4 Organization of Thesis

The remaining of this thesis is organized as follows

• Chapter 2 This chapter is an overview over prior works related to this thesis,including the k-means and k-means++ algorithms, the definition of coresets, ashort brief about Apache Spark and a theorem about bounds for sample com-plexity

• Chapter 3 We introduce about farthest-first-traversal algorithm as well as theProTraS for finding coresets Then we propose an FFT-based algorithm forcoreset construction

• Chapter 4 This chapter is about lightweight coreset and our general lightweightcoreset model We also prove the correctness of this model and propose a gen-eral algorithm for this α - lightweight coreset

• Chapter 5 This chapter shows the experimental running for clustering largedatasets We use the α - lightweight coreset for sampling process and kMeans++for clustering on Apache Spark framework

• Chapter 6 We have the thesis conclusion and an ending here

1.5 Publications relevant to this Thesis

• Nguyen Le Hoang, Tran Khanh Dang and Le Hong Trang A ComparativeStudy of the Use of Coresets for Clustering Large Datasets pp 45-55 LNCS

11814 Future Data and Security Engineering FDSE 2019

• Le Hong Trang, Nguyen Le Hoang and Tran Khanh Dang A First-Traversal-based Sampling Algorithm for k-clustering International Con-ference on Ubiquitous Information Management and Communication IMCOM2020

Trang 23

Chapter 2

Background and Related Works

In this chapter, we provide a short introduction about background and prior worksrelated to this thesis

• k-Means and k-means++ Clustering

• Coresets

• Apache Spark

• Bounds & Pseudo-dimension

Trang 24

Chapter 2 Background and Related Works 82.1 k-Means and k-Means++ Clustering

In 1957, an algorithm, now often referred to simply as “k-means”, was proposed

by S Lloyd of Bell Labs; it was then published in 1982 [42] Lloyd’s algorithmbegins with k arbitrary “centers,” typically chosen uniformly at random from thedata points Each point is then assigned to the nearest center, and each center isrecomputed as the center of mass of all points assigned to it These last two steps arerepeated until the process stabilizes [6]

The Lloyd’s algorithm is described in Algorithm1

Algorithm 1 k-Means Clustering - Lloyd’s Algorithm [42]

Require: data set X , number of clusters k

Ensure: k separated clusters

1: Randomly initialize k centers C = {cj}k

j=1∈ Rdxk 2: while (Not Convergence) do

The algorithm was then developed by Inaba et al [33], Matousek [47], Vega

et al [63], etc However, one of the most advanced improvement of k-means is thek-means++ by Author and Vassilvitskii in [6] We will give an overview about thisalgorithm in next section

Trang 25

Chapter 2 Background and Related Works 9

In the Algorithm 1, the initial set of cluster centers (line 1) is based on randomsampling where k points are selected uniformly at random from the data set Thissimple approach was fast and easy to implement However, there are many naturalexamples for which the algorithm generates arbitrarily bad clusterings This happensdue to the conflict placement of the starting centers, and in particular, it can holdwith high probability even if the centers are chosen uniformly at random from thedata points [6]

To overcome this problem, Arthur and Vassilvitskii [6] propose the algorithmnamed k-means++ which uses adaptive seeding based on a technique called D2 -sampling to create its initial seed set before running Lloyd’s algorithm to conver-gence [8] Given an existing set of centers S, the D2- sampling strategy, as the namesuggests, samples each point x ∈ X with probability proportional to the squared dis-tance to the selected centers, i.e.,

p(x|S) = d(x, S)

2

∑x0 ∈Xd(x0, S)2

The D2- sampling is described in Algorithm2

Algorithm 2 D2Sampling for k-Means++ [6]

Require: data set X , number of clusters k

Ensure: initial set S used for k-means

Trang 26

k-Chapter 2 Background and Related Works 102.2 Coresets

In this thesis, we apply data sampling via coresets to deal with the massive data ting In computational geometry, a coreset is a small set of points that approximatesthe shape of a larger point set, in the sense that applying some geometric measure

set-to the two sets results in approximately equal numbers [Wikipedia] In the usage

of clustering problem terms, a coreset is a weighted subset of the data such that thequality of any clustering evaluated on coreset closely approximates the quality on thefull data set

In most cases, it is not easy to find this most relevant subset Consequently,attention has shifted to developing approximation algorithms The goal now is tocompute an (1 + ε)-approximation subset, for some 0 < ε < 1 The framework ofcoresets has recently emerged as a general approach to achieve this goal [3]

The definition of coreset depends on each machine learning problem For thek-means clustering, definition of coresets can be stated as

Definition 1 (Coresets for k-Means Clustering)

(S Har-Peled and S Mazumdar [28])

Let ε > 0, the weighted set C is a (k, ε)- coreset of X if for any Q ⊂ Rd ofcardinality at most k

|φX(Q) − φC(Q)| ≤ εφX(Q)this also equivalent to

(1 − ε)φX(Q) ≤ φC(Q) ≤ (1 + ε)φX(Q)

This is a strong theoretical guarantee as the cost evaluated on the coreset φC(Q)has to approximate the cost on the full data set φX(Q) up to a 1 ± ε multiplicativefactor uniformly for all possible sets of cluster centers As a direct consequence,solving on the coreset yields provably competitive solutions when evaluated on thefull data set [43] More formally, Lucic et al in [43] showed that, if C is a coreset of

X with ε ∈ (0,13), then

φX(QC∗) ≤ (1 + 3ε)φX(Q∗X)where Q∗C denotes the optimal solution of k centers on C and φX(Q∗X) denotes theoptimal solution on X This means that the optimal solution of coreset can produce an(1 + 3ε) approximation on the original data As a result, we can solve the clusteringproblem on the coreset while retaining strong theoretical guarantees

Trang 27

Many coreset construction algorithms have been proposed in recent years for means clustering problems One of the first methods are based on exponential grids

k-by Har-Peled and Mazumdar in [28] and an improved version by Har-Peled andKushal in [29] The coreset construction with sampling-based approach was firstused by Chen [15] and was investigated deeply by Feldman et al with plenty ofresearch about coresets for k-means (Feldman, Monemizadeh and Sohler [21]), highdimensional subspace (Feldman, Monemizadeh, Sohler and Woodruff [22]), coresetsfor mixture models (Feldman, Faulkner and Krause [23]), PCA and projective clus-tering (Feldman, Schmidt and Sohler [24]), etc By based on prior works, Lucic et al.constructs coresets for the estimation of Gaussian mixture models (Lucic, Faulknerand Krause [44]) and for clustering with Bregman divergences (Lucic, Bachem andKrause [43]) Recently, Bachem et al proposed coresets for nonparametric clus-tering (Bachem, Lucic and Krause [7]), one-shot coresets for k-clustering (Bachem,Lucic and Lattanzi [11]) and lightweight coresets (Bachem, Lucic and Krause [12]).Besides, Ros and Guillaume proposed a coreset construction based on Farthest-First-Traversal algorithm in [58] For a more detail survey of the results about coresets,

we refer the reader to paper [9] by Bachem, Lucic and Krause

In this thesis, we will continue prior works and propose new algorithms of set constructions for k-means clustering

core-• By based on farthest-first-traversal algorithm and ProTraS algorithm by Ros &Guillaume in [58], we propose an FFT-based coreset construction This part isexplained and proved clearly in Chapter3

• By based on the lightweight coreset of Bachem, Lucic and Krause [12], wepropose a general model for called the α - lightweight coreset This is proved

is Chapter4

Trang 28

Chapter 2 Background and Related Works 122.3 Apache Spark

Apache Spark, an open-source distributed cluster computing framework, was inally developed at the University of California, Berkeley’s AMPLab The Sparkcodebase was then donated to the Apache Software Foundation There are many re-sources about this on the Internet To provide a brief and clear details about Spark,most of the content for this section are referred and taken from

orig-https://mapr.com/products/apache-spark/

Apache Spark is a powerful unified analytics engine for large-scale distributed dataprocessing and machine learning On top of the Spark core data processing engineare libraries for SQL, machine learning, graph computation, and stream processing.These libraries can be used together in many stages in modern data pipelines andallow for code reuse across batch, interactive, and streaming applications

Spark is useful for ETL processing, analytics and machine learning workloads, andfor batch and interactive processing of SQL queries, machines learning inferences,and artificial intelligence applications

FIGURE2.1: Spark Logo - https://spark.apache.org

Data Pipelines Much of Spark’s power lies in its ability to combine very ferent techniques and processes into a single, coherent whole Outside Spark, thediscrete tasks of selecting data, transforming that data in various ways, and ana-lyzing the transformed results might easily require a series of separate processingframeworks, such as Apache Oozie Spark, on the other hand, offers the ability tocombine these, crossing boundaries between batch, streaming, and interactive work-flows in ways that make the user more productive

dif-Spark jobs perform multiple operations consecutively, in memory, only spilling todisk when required by memory limitations Spark simplifies the management ofthese disparate processes, offering an integrated whole – a data pipeline that is easier

to configure, run, and maintain

Trang 29

Challenges with Previous Technologies Before Spark, there was MapReduce.With MapReduce, iterative algorithms require chaining multiple MapReduce jobstogether This causes a lot of reading and writing to disk For each MapReduce job,data is read from a distributed file block into a map process, written to and read from

a file in between, and then written to an output file from a reducer process

FIGURE2.2: The Components of Spark

Advantages of Spark The goal of the Spark project was to keep the benefits ofMapReduce’s scalable, distributed, fault-tolerant processing framework while mak-ing it more efficient and easier to use Spark is designed for speed:

• Spark runs multi-threaded lightweight tasks inside of JVM processes, ing fast job startup and parallel multi-core CPU utilization

provid-• Spark caches data in memory across multiple parallel operations, making itespecially fast for parallel processing of distributed data with iterative algo-rithms

• Spark provides a rich functional programming model and comes packaged withhigher level libraries for SQL, machine learning, streaming, and graphs

The components of Apache Spark are shown in Figure2.21

1 https://backtobazics.com/big-data/spark/understanding-apache-spark-architecture/

Trang 30

Chapter 2 Background and Related Works 142.4 Bounds on Sample Complexity of Learning

In Chapter4of this thesis, we consider functions families which map from a dataset

X → [0, 1] and where each function in the family corresponds to a solution in oursolution space Intuitively, the pseudo-dimension provides a measure of the combi-natorial complexity of the underlying machine learning problem and may be viewed

as the generalization of the VC-dimension to [0, 1]-valued functions [13] The nition of pseudo-dimension was first proposed by Haussler [31] For an overview onthe pseudo-dimension, we refer to Anthony and Bartlett [5]

defi-Definition 2 (Pseudo-Dimension) (Haussler [31])

Fix a countably infinite domain X The pseudo-dimension of a set F of functionsfrom X to [0,1], denoted by Pdim(F), is the largest d such that

There is a sequence x1, x2, , xd of domain elements from X and a sequence

r1, r2, , rd of real thresholds such that

for each b1, b2, , bd∈ {above, below}, there is an f ∈ F such that

for all i= 1, 2, , d, we have f (xi) ≥ ri ⇐⇒ bi= above

By using the properties of the pseudo-dimension, Li, Long, and Srinivasan posed a theorem for a tight bound on the number of required samples for all functions

pro-in the function family [40]

Theorem 1 (Bounds on the Sample Complexity of Learning)

(Y Li, P M Long and A Srinivasan [40])

Let α > 0, v > 0 and δ > 0 Fix a countably infinite domain X and let p(.) beany probability distribution over X,

Let F be a set of functions from X to [0,1] with Pdim(F) = d and denote by C asample of m points from X sampled independently according to p(.)

Trang 31

cluster-a lot of computcluster-ations, ccluster-an be considered cluster-as one of the best relevcluster-ant coreset thcluster-at mcluster-ay

be built for research and scientific purposes

• Firstly, we show that the Farthest-First-Traversal algorithm (FFT) can yield a(k, ε)-coreset for both k-median and k-means clustering

• Secondly, we illustrate some existing limitations of ProTraS [58], the the-art coreset construction that based on FFT

state-of-• From that, by based on FFT combined with good points from ProTraS, wepropose an algorithm for coreset construction of both k-means and k-medianclustering

• We compare this proposed coreset with other state-of-the-art sample coresetsfrom Lightweight Coreset of Bachem et al [12], Adaptive Sampling of Feld-man et al in [23] and Uniform Sampling as baseline to show that this proposedcoreset can be considered as the best suitable subset of any original full data

Even though this thesis is mainly about coresets for k-means clustering However, inthis section, we also prove results about coresets for k-median clustering Therefore,the FFT-based coresets can be applied not only for k-means but also for k-medianclustering

Trang 32

Chapter 3 FFT-based Coresets 163.1 Farthest-First-Traversal Algorithm

We start this chapter with a short introduction about Farthest-First-Traversal (FFT)algorithm In computational geometry, the FFT of a metric space is a set of pointsselected sequently; after the first point is chosen arbitrarily, each next successivepoint is located as the farthest one from the set of previously-selected points Thefirst use of the FFT was by Rosenkrantz, Stearns & Lewis [59] in connection withheuristics for the traveling salesman problem Then, Gonzalez [26] used it as part of

a greedy approximation algorithm for the problem of finding k clusters that minimizethe maximum diameter of a cluster Later, Arthur & Vassilvitskii [6] use a FFT-likealgorithm to propose k-means++ algorithm

The FFT is described in Algorithm3

Algorithm 3 Farthest-First-Traversal algorithm

Require: dataset X with |X | = n

As mentioned, FFT algorithm can be used to solve many complicated problems

in data mining and machine learning, in next section, we will prove that the process

of FFT can yield coresets for both k-median and k-means clustering Then, we canfind a coreset by applying FFT algorithm

3.2 FFT-based Coresets for k-Median and k-Means

Clustering

Firstly, we define some expressions used in this section

Let X⊂ Rdbe a data set and x∈ X Let C ⊂ X be a subset of X For each c ∈ C,

we denote

• wc= |T (c)| with

T(c) = the number of items from X whose closest point in C is c, i.e

T(c) = {y|(y ∈ X \ C) ∧ (d(y, c) = min

c 0 ∈Cd(y, c0))}

• dc= maxt∈T (c)d(t, c) = the largest distance between all points in T (c) to c

Trang 33

Chapter 3 FFT-based Coresets 17

Theorem 2 (FFT-based Coresets for k-Median Clustering)

There exists an ε > 0 such that the subset receiving from FFT algorithm ondataset X is a(k, ε)-coreset of X for k-median clustering

Proof: k-median is a variation of k-means where instead of calculating the meanfor each cluster to determine its centroid, one instead calculates the median Withk-median, the function φ of X and C are defined as

Trang 34

Trang 35

Theorem 3 (FFT-based Coresets for k-Means Clustering)

There exists an ε > 0 such that the subset receiving from FFT algorithm ondataset X is a(k, ε)-coreset of X for k-means clustering

Proof: With k-means, the function φ of X and C are

We denote dmax= maxx∈Xd(x, Q)

Using triangle inequality, ∀x ∈ X and c ∈ C, we have

d(x, Q) ≤ d(x, c) + d(c, Q)d(x, Q) ≤ dc+ d(c, Q), ∀x ∈ T (c)d(x, Q)2≤ dc+ d(c, Q)2

d(x, Q)2≤ dc2+ d(c, Q)2+ 2dcdmaxSum the inequality for all x ∈ T (c), then for all c ∈ C,

Trang 36

∆ = ∑

c∈C

wcdc dc+ 2dmaxand choose

Theorem 4 (FFT-based Coresets for both k-Median and k-Means Clustering)The sample from applying FFT algorithm on data set X is a(k, ε)-coreset of Xfor both k-median and k-means clustering

Trang 37

Chapter 3 FFT-based Coresets 213.3 ProTraS algorithm and limitations

In previous section, we have shown that FFT algorithm can be used to build a coresetfor both k-median and k-means clustering However, there are few research aboutFFT-based coreset In 2018, FFT algorithm was first used to find coresets by Ros andGuillaume [58] The sample from their coreset construction can be considered as acoreset for k-median clustering Even though their work consists some limitations, it

is still a very valuable resource and a great idea for further study

In 2017, they proposed DENDIS [56] and DIDES [57] which are iterative algorithmsbased on the hybridization of distance and density concepts They differ in the pri-ority given to distance or density, and in the stopping criterion defined accordingly.However, they have drawbacks In 2018, by based on the FFT algorithm and thegood points from DENDIS and DIDES, Ros and Guillaume proposed a new algo-rithm named ProTraS [58] that is both easy to tune and scalable ProTraS algorithm

is based on the sampling cost which is computed according to the within group tance and to the representativeness of the sample item This algorithm is designed

dis-to produce a (k, ε)-coreset and use the approximation level, ε, as the sdis-topping rion This algorithm is then used and provides good respect in some research such as

crite-in [62] by Le Hong Trang et al

The original ProTraS is described in Algorithm4

Trang 38

Algorithm 4 ProTraS Algorithm [58]

Require: T = {xi}, for i = 1, 2, , n, and ε

Ensure: S = {yj} and T (yj), for j = 1, 2, , s

1: Select an initial pattern xinit ∈ T

11: Find dmax(yk) = maxxm∈T (yk)d(xm, yk)

12: Store dmax(yk), xmax(yk) where dmax(yk) = d(xmax(yk), yk)

Trang 39

FIGURE3.1: Original-R15 (small cluster) and scaling-R15

The ProTraS will stop when cost < ε0 with ε0 is a given and input constant(line4) This cost function is taken from equation3.3of ε for k-median clustering

or equivalent to costX(Q) ≥ n, which lead to some errors as follows

• For data with large distance between points, Algorithm 4 needs a lot of erations before reaching stop condition; in some cases, with given ε0= 0.1,Algorithm4returns a coreset with size equal to the size of full data set

it-• Algorithm4cannot determine two data sets in same shape but different scaleparameters (Figure 3.1) With same given ε0, the data with larger scale willhave bigger coreset size

In fact, with assumption as above, Algorithm4can only be optimized and create goodsamples in some circumstances that are satisfied the given assumption Although theoriginal ProTraS is just for k-median clustering and has some limitations, the idea

of coreset construction from FFT algorithm is worth for deeper research In nextsection, we propose a FFT-based coreset construction for both k-means and k-medianclustering that is independent to the value of ε

Trang 40

Chapter 3 FFT-based Coresets 243.4 Proposed FFT-based Coreset Construction

Based on FFT algorithm and three theorems in section 3.2, we propose the algorithmfor coreset constructions for both k-median and k-means clustering Proposed algo-rithm pseudocode is described in Algorithm5 To complete the proposed algorithm,

we also propose strategy for Initial Step and strategy for decreasing the tional complexity

computa-Algorithm 5 Proposed FFT-based Coreset Construction

The main loop (lines 5-20) includes two sub-loops

• In the first one (lines 6-9), each unselected pattern, xi∈ X \ C, is attached tothe closest selected element in C

• The second loop finds the next selected representative, x∗ = xmax(c∗), which

is the farthest item from current C In this step, if we simply choose x∗ as thefarthest distance between all points in X to current set C like the original FFT,i.e

d(x∗,C) = max

(x∈X \C)∧(c∈C)d(x, c)

Định dạng
Số trang	82
Dung lượng	4,37 MB