Key words: Similarity search, fast query processing, scalability, clustering, filtering, pruning, redundancy-free capability, load balance, big data, MapReduce, Hadoop... In the followi
Trang 1JOHANNES KEPLER UNIVERSITY LINZ
Altenberger Str 69
4040 Linz, Austria www.jku.at DVR 0093696
to confer the academic degree of
Doktor der technischen Wissenschaften
in the Doctoral Program
Engineering Sciences
Trang 3EIDESSTATTLICHE ERKLÄRUNG
Ich erkläre an Eides statt, dass ich die vorliegende Dissertation selbstständig und ohne fremde Hilfe verfasst, andere als die angegebenen Quellen und Hilfsmittel nicht benutzt bzw die wörtlich oder sinngemäß entnommenen Stellen als solche kenntlich gemacht habe
Die vorliegende Dissertation ist mit dem elektronisch übermittelten Textdokument identisch
Linz, 2016
Trong Nhan Phan
Trang 4SWORN DECLARATION
I hereby declare under oath that the submitted Doctoral Thesis has been written solely by me without any third-party assistance, information other than provided sources or aids have not been used and those used have been fully documented Sources for literal, paraphrased and cited quotes have been accurately credited
The submitted document here present is identical to the electronically submitted text document
Linz, 2016
Trong Nhan Phan
Trang 5I am very grateful to Professor Roland Wagner and Mr Knud Steiner for giving me good conditions and assistance when I am doing my PhD at the Institute for Application Oriented Knowledge Processing (FAW) in Linz and in Hagenberg, Austria
I would also like to extend my sincere thanks to European Commission via the GATE (knowledGe mAnagement Technology transfer and Education programme), an Erasmus Mundus mobility project, especially to Ms Christine Hinterleitner and Ms Emma Huss for their support during my mobility in Austria
My special thanks go to Mrs Gabriela Wagner, Mrs Gabriela Küng, Richard Küng, Eric Küng, and Felix Küng Their kindness, generosity, and sense of humor make my PhD life much more colorful and beautiful
My sincere thanks to Mr Faruk Kujundžid, Scientific Computing, Information Management team, Johannes Kepler University Linz, for kindly supporting us with Alex Cluster
My acknowledgement would not be complete without my colleagues: Markus Jäger, Stefan Nadschläger, Pablo Gómez-Pérez, and Christian Huber It is my great pleasure working with you
I would like to thank Ms Monika Neubauer and Mr Andreas Dreiling for their help in administration and techniques
I will never forget warm welcome from Ms Dagmar Auer, Hilda Kosorus, Jan Kubovy, Peter Regner, and all colleagues at FAW
I am thankful to the reviewers who have spent their time on our research work and provided us their invaluable feedback
I appreciate Professor Tran Khanh Dang and my colleagues at Faculty of Computer Science and Engineering, HCMC University of Technology, Vietnam, for their encouragement and support when I am doing my PhD in Austria
I also appreciate my lecturers and teachers who have guided me with their knowledge
My family has always and forever been by my side no matter how harsh and hard
Though I cannot list all the names in this section, I feel myself very lucky when I have them, friends, and companies throughout my life People I have met come into my life for some reasons Anyhow, I thank you for everything and never stop hoping and believing to see you, the Angles, again and again
Trang 6KURZFASSUNG
Similarity Search (Ähnlichkeitssuche) ist eine der zentralen Operationen, nicht nur in Datenbanken, ebenso in anderen Hauptgebieten der Datenverarbeitung wie Information Retrieval, Machine Learning oder Data Mining Darüber hinaus wird sie in verschiedenen Anwendungen verwendet, beispielsweise Duplicate Detection, Data Cleaning oder Data Clustering Trotz der hohen Verbreitung und Verwendung von Similarity Search, ist ihre Anwendung wegen der hohen Kosten der Ähnlichkeitsberechnungen sehr teuer Similarity Search wird auch sehr zeitintensiv und zeitraubend, wenn es bei der Ausführung auf irrelevante Objekte zugreift und deren Ähnlichkeitswerte unnötigerweise berechnet Mehr noch, muss Similarity Search mit den Herausforderungen von Big Data klar kommen, die größte darunter, das Verwalten großer Datenmengen Diese Herausforderungen machen Similarity Search teuer, hinterlassen uns aber eine große Motivation
Als ein Paradigma für riesige (large-scale) Berechnungen auf parallelen und verteilten Systemen bestehend aus herkömmlichen Computern zeigt MapReduce (Verkleinerung von Datensätzen) schnell seine Fähigkeiten, große Datenmengen mit hoher Fehlertoleranz verarbeiten zu können In dieser Arbeit analysieren wir die Leistung von Similarity Search in Kombination mit MapReduce Darüber hinaus untersuchen wir die Probleme der Skalierbarkeit, Redundanz und Lastausgleich, welche bei Similarity Search immer ein Thema sind Natürlich wird eine genaue Similarity Search ohne Verlust von Genauigkeit bevorzugt Unter Verwendung verschiedener Ansätze streben wir eine Verbesserung der Performanz von Similarity Search Vorgängen unter Zuhilfenahme von MapReduce an Genau genommen untersuchen bzw verwenden wir drei unterschiedliche Ansätze für eine schnelle Similarity Search, diese sind: Instant, Build-In und Hybrid
Zuerst wenden wir MapReduce an, um Erfahrungen mit Similarity Search mit besonderen Ähnlichkeitsmaßen zu sammeln – typische bzw beliebte Vertreter sind Jaccard- und Cosinus (Jaccard- Koeffizient und Kosinus-Ähnlichkeit) Die Idee dahinter ist, einen Invertierten Index zu erhalten, welcher ein bekanntes Werkzeug für die Indexierung einer schnellen Volltext-Suche ist Unsere Strategie ist es, die gegebenen Daten einer bestimmten Query zu indexieren und nicht alle vorhandenen Ausgangsdaten Als Folge dieser Strategie wird nur ein kleiner Teil der Daten für die Similarity Search verarbeitet Gleichzeitig wird der riesige Datenbestand von unwichtigen Daten befreit, indem sowohl in die Indizierung als auch in den Suchprozess eingegriffen wird Da der Prozess von einer bestimmten Query abhängig ist, ist er brauchbar für einmalige Abfragen, weil weniger Daten im MapReduce verarbeitet werden müssen
Im zweiten Schritt wollen wir die Indexierung und die Similarity Search voneinander trennen, damit Abfragen der Similarity Search mehrmals in der Queue ausgeführt werden können, ohne den originalen Datenbestand erneut verarbeiten zu müssen Dieser Ansatz gehört
Trang 7zur Klasse der Build-In Ansätze, bei denen die indexierten Daten bereits gut vorbereitet sind und dann für Similarity Search Queues verwendet werden können Wir haben herausgefunden, dass die Verwendung invertierter Indizes zu Nachteilen führen kann, welche nicht adäquat für Similarity Search unter der Verwendung von MapReduce sind Folglich schlagen wir stattdessen dokumentbasierte Indizes vor um diese Nachteile auszugleichen Darüber hinaus werden Datenobjekte gebündelt (clustering) womit der mögliche Suchraum eingegrenzt wird
Im dritten Schritt beschäftigen wir uns mit den Hybrid-Ansätzen, welche die Vorteile sowohl des Instant- und des Build-In Ansätze vereinen Unser Ziel ist das Erreichen von schnellen Indexierungen und einer schnellen Similarity Search Weiters muss jedem bewusst sein, dass die Daten richtig organisiert gehören, wenn man die Indizes bildet Einerseits sind zu fest oder
zu lose geclusterte Objekte nicht hilfreich für die Full-Scan-Anwendung von MapReduce, andererseits sollten die Datenobjekte so organisiert werden, dass unnötige Zugriffe auf ein Minimum reduziert werden, vor allem auch deswegen, weil die dokumentbasierten Indizes das Grundgerüst unserer Indizierungsphase darstellen Darüber hinaus widmen wir uns einer guten Arbeitsverteilung und schlagen ein Verfahren vor, den Lastenausgleich und somit die Laufzeit zu verbessern
Außerdem statten wir unsere vorgeschlagenen Methoden mit Filtern und Kürzungsstrategien aus Zusätzlich schlagen wir eine hybride MapReduce-basierte Architektur vor, deren Hauptaufgabe es ist, mit den „drei Vs“ von BigData umzugehen (Volume/Volumen – Velocity/Geschwindigkeit – Variety/Vielfalt) Darüber hinaus sind wir uns der Verwendung von minimalistischen MapReduce Methoden bewusst Da die Ausführung eines MapReduce Jobs teuer ist, führt die Verwendung weniger Jobs ohne wiederholte Zugriffe auf die Originaldaten
zu weniger Zusatzaufwand Schließlich haben wir intensiv eine Reihe von Experimenten an realen Datensätzen durchgeführt Die Ergebnisse zeigen, dass unsere vorgeschlagenen Verfahren eine bessere Leistung erzielen als die Basismethode und relevante vergleichbare Arbeiten bzw Methoden von Similarity Search
Schlüsselwörter: Ähnlichkeitssuche, schnelle Abfrageverarbeitung, Skalierbarkeit, Clustering,
Filter, Pruning, redundanzfreie Verarbeitung, Lastverteilung, BigData, MapReduce, Hadoop
Trang 8ABSTRACT
Similarity search is the principle operation not only in databases but also in disciplinary majors such as information retrieval, machine learning, or data mining In addition, it has been widely-used in various applications like duplicate detection, data cleaning, or data clustering Nevertheless, a similarity search process is expensive due to the cost of similarity computations Moreover, similarity search becomes time-consuming when it has to access irrelevant objects and then has unnecessarily to evaluate their similarity Furthermore, it has to deal with challenges from big data, first and foremost with the large amounts of data Such challenges make similarity search costly but leave big motivations for us
Emerging as a paradigm for large-scale processing with the fashion of parallel and distributed computing on a cluster of commodity machines, MapReduce rapidly shows its capability as a candidate for processing massive datasets with high-fault tolerance In our dissertation, we study the performance of similarity search with MapReduce Moreover, we also study the problems of scalability, redundancy, and load balance when doing similarity search Without loss of accuracy, we prefer an exact similarity search Among various approaches, we choose the way of improving performance from similarity search schemes using MapReduce More specifically, we propose three different kinds of approaches towards fast similarity search They are respectively the instant approaches, the build-in approaches, and the hybrid approaches
Firstly, we employ MapReduce to experience similarity search with particular measures, whose typical and popular representatives are Cosine and Jaccard measures The idea behind is
to utilize inverted index, which is a well-known index data structure for fast full text search Our strategy is, however, to index those data that exist in a given query rather than indexing all original data As a consequence, only a small portion of data is processed for similarity search
At the same time, we minimize the big amount of inessential data engaging in both the indexing and the search processes Due to being dependent on a certain query, this approach belongs to the category of instant approaches and is considered to be suitable for one-time querying while maintaining less data throughout MapReduce jobs
Secondly, we want to separate the indexing phase and the similarity search phase so that similarity queries are able to run multiple times from the indexing data without re-accessing the original data This approach belongs to the build-in approaches in that the indexed data are already prepared in advance and soon ready for similarity queries Moreover, we observe that using inverted index leads to some drawbacks that are not appropriate for similarity search using MapReduce Consequently, we propose using document-based index instead in order to overcome these drawbacks Furthermore, we cluster data objects into different compartments
so that we reduce the search space for the task of searching
Trang 9Thirdly, we are towards the hybrid approaches that take the advantages of both the instant approaches and the build-in approaches Our goal is to achieve fast index building as well as fast similarity search Moreover, we are aware of organizing data when building the indices It
is, on the one hand, noticed that clustering objects too tight or too loose would not be useful for the full-scan fashion of MapReduce On the other hand, though document-based index is exploited as a skeleton in our indexing phase, data objects should be organized in ways so that
we are able to minimize unnecessary data accesses Furthermore, we address the load imbalance and then propose a straggler mitigating method to augment better load balance and
at the same time improve the runtime at reducers
Besides, we equip our proposed methods with filtering and pruning strategies Additionally,
we propose a hybrid MapReduce-based architecture whose main aim is to deal with the “three Vs” (Volume, Velocity, Variety) of big data Moreover, we are conscious of employing minimal MapReduce jobs Due to the fact that a single MapReduce job is expensive, using less MapReduce jobs without re-accessing the original data results in fewer penalties Furthermore,
we intensively conduct a series of empirical experiments on real datasets The results demonstrate that our proposed methods have better performance than the baseline method and some related work
Key words: Similarity search, fast query processing, scalability, clustering, filtering, pruning,
redundancy-free capability, load balance, big data, MapReduce, Hadoop
Trang 10LIST OF FIGURES
Page Figure 1-1: Examples of typical similarity queries; (a) Range query; (b) k-Nearest Neighbor
query; (c) Self-join query 4
Figure 1-2: Data revolution since 2005 (Letouzé, 2012) 7
Figure 1-3: MapReduce paradigm (Phan et al., 2015c) 10
Figure 1-4: Data redundancy throughout MapReduce processes 18
Figure 2-1: Performance-improving approaches with MapReduce 26
Figure 2-2: The architectures of Hadoop MapReduce and Hadoop YARN 28
Figure 2-3: Pruned document pair (Baraglia et al., 2010; De Francisci et al., 2010) 33
Figure 2-4: The generation of word frequency dictionary (Li et al., 2011) 35
Figure 2-5: The generation of text vector (Li et al., 2011) 35
Figure 2-6: The generation of PLT inverted file (Li et al., 2011) 35
Figure 2-7: Query text search (Li et al., 2011) 35
Figure 2-8: Computing pairwise similarity of a toy collection of 3 documents 36
Figure 2-9: Example of task computation in partition-based similarity search with hybrid indexing (Alabduljalil et al., 2013) 36
Figure 2-10: Example of pair-wise similarity computation using a 2-pass blocking of 9 objects (Kolb et al., 2013) 39
Figure 2-11: Example of redundancy-free MR-based pair-wise similarity computation 39
Figure 2-12: A hybrid MapReduce-based architecture (Phan et al., 2016) 41
Figure 3-1: An overview scheme of an instant approach 46
Figure 3-2: The overview scheme of Cosine-based method (Phan et al., 2014a) 48
Figure 3-3: MapReduce-1 from the Cosine-based method for pairwise similarity 54
Figure 3-4: MapReduce-2 from the Cosine-based method for pairwise similarity 55
Figure 3-5: MapReduce-3 from the Cosine-based method for pairwise similarity 56
Figure 3-6: MapReduce-4 from the Cosine-based method for pairwise similarity 56
Figure 3-7: MapReduce-1 from the Cosine-based method when given a pivot 57
Figure 3-8: MapReduce-2 from the Cosine-based method when given a pivot 57
Figure 3-9: MapReduce-4 from the Cosine-based method with Pre-pruning-2 57
Figure 3-10: MapReduce-3 from the Cosine-based method with Pre-pruning-1 58
Figure 3-11: Performance with DBLP Datasets (Phan et al., 2014a) 60
Figure 3-12: Similarity queries with DBLP Datasets (Phan et al., 2014a) 60
Figure 3-13: Performance with Gutenberg Datasets (Phan et al., 2015c) 61
Figure 3-14: Similarity queries with Gutenberg Datasets (Phan et al., 2015c) 61
Figure 3-15: Performance with shingles and Gutenberg Datasets (Phan et al., 2015c) 61
Figure 3-16: The overview scheme of Jaccard-based method (Phan et al., 2014b) 63
Trang 11Figure 3-17: MapReduce-1 from the Jaccard-based method with pairwise similarity 68
Figure 3-18: MapReduce-2 from the Jaccard-based method with pairwise similarity 68
Figure 3-19: MapReduce-1 from the Jaccard-based method when given a query object 69
Figure 3-20: MapReduce-2 from the Jaccard-based method when given a query object 70
Figure 3-21: Performance with pairwise similarity (Phan et al., 2014b) 71
Figure 3-22: Performance with similarity queries (Phan et al., 2014b) 71
Figure 4-1: An overview scheme of a build-in approach 74
Figure 4-2: An example of an object-identifying process (Phan et al., 2015b) 75
Figure 4-3: Data redundancy with an inverted index (Phan et al., 2015b) 77
Figure 4-4: MapReduce-1 job from a document-indexing method (Phan et al., 2015b) 81
Figure 4-5: MapReduce-1 job with Dq from a document-indexing method (Phan et al., 2015b) 81 Figure 4-6: MapReduce-2 job from a document-indexing method (Phan et al., 2015b) 82
Figure 4-7: Performance evaluation (Phan et al., 2015b) 84
Figure 4-8: Data output evaluation (Phan et al., 2015b) 84
Figure 4-9: The relevance between an inverted index and a document index 84
Figure 4-10: Candidate measurement with Gutenberg Datasets (Phan et al., 2015b) 85
Figure 4-11: The spiral clustering scheme (Phan et al., 2016) 86
Figure 4-12: The element-based clustering 88
Figure 4-13: MapReduce-1 job from a two-level clustering method 94
Figure 4-14: MapReduce-2 job from a two-level clustering method 94
Figure 4-15: Performance evaluation with a single query 96
Figure 4-16: Performance evaluation with query batches 97
Figure 4-17: Overview comparison 97
Figure 5-1: An overview scheme of a hybrid approach 99
Figure 5-2: The granularity-enhanced spiral clustering scheme (Phan et al., 2016) 102
Figure 5-3: MapReduce-1 job from a granularity-enhanced spiral clustering method 109
Figure 5-4: MapReduce-2 job from a granularity-enhanced spiral clustering method 110
Figure 5-5: The distribution of data and clusters (Phan et al., 2016) 112
Figure 5-6: Performance with MapReduce (Phan et al., 2016) 113
Figure 5-7: The granularity-enhanced spiral clustering scheme (Phan et al., 2016) 113
Figure 5-8: An example of the load imbalance by skewed size of key-value pairs 115
Figure 5-9: Elapsed time among reducers with P5K from II, BI, DI, and eHSim 115
Figure 5-10: Elapsed time among reducers with P5K from eHSim 115
Figure 5-11: Load distribution with default hashing 116
Figure 5-12: Performance measurement with a single machine 125
Figure 5-13: Total reducer time with Alex cluster 126
Figure 5-14: Elapsed time among reducers with P1K 127
Figure 5-15: Elapsed time among reducers with P3K 127
Trang 12Figure 5-16: Elapsed time among reducers with P5K 127
Figure 6-1: File size visualization on datasets 134
Figure 6-2: Shingle distribution among datasets 135
Figure 6-3: Experiments with a single machine on Gutenberg datasets 138
Figure 6-4: Experiments on Alex with PQ and PQE on Gutenberg datasets 140
Figure 6-5: Experiments with the cluster of commodity machines on Gutenberg datasets 141
Figure 6-6: Experiments with different queries and multiple Gutenberg data sources 143
Figure 6-7: Experiments with key-value pairs and candidate pairs on Gutenberg datasets 145
Figure 6-8: Experiments on Alex with DBLP datasets 146
Figure 6-9: Experiments with different queries and multiple DBLP data sources 147
Figure 6-10: Experiments with key-value pairs on DBLP datasets 148
Trang 13LIST OF TABLES
Page
Table 1-1: Some typical instances of metrics 5
Table 1-2: Some typical instances of metric variants 5
Table 1-3: The evolving concepts of Big Data 8
Table 1-4: The main advantages of Parallel Database Management Systems over MapReduce (Pavlo et al., 2009; Stonebraker et al., 2010) 15
Table 1-5: The main advantages of MapReduce over Parallel Database Management Systems (Dittrich et al., 2013a; Dittrich et al., 2013b; Pavlo et al., 2009; Stonebraker et al., 2010) 16
Table 1-6: MapReduce versus Message Passing Interface 17
Table 2-1: Some examples of parameter configuration with Hadoop, (Impetus, 2009) 27
Table 2-2: Hadoop MapReduce vs Hadoop YARN (Murthy et al., 2014;Radia & Srinivas, 2014) 28 Table 2-3: A piece of Hadoop ecosystem 32
Table 3-1: An overview of MapReduce jobs from the Cosine-based method 49
Table 3-2: MAP-1 algorithm of Cosine-based method 50
Table 3-3: REDUCE-1 algorithm of Cosine-based method 50
Table 3-4: MAP-2 algorithm of Cosine-based method 51
Table 3-5: REDUCE-2 algorithm of Cosine-based method 52
Table 3-6: MAP-3 algorithm of Cosine-based method 52
Table 3-7: REDUCE-3 algorithm of Cosine-based method 53
Table 3-8: MAP-4 algorithm of Cosine-based method 53
Table 3-9: REDUCE-4 algorithm of Cosine-based method 54
Table 3-10: An overview of MapReduce jobs from the Jaccard-based method 64
Table 3-11: MAP-1 algorithm of Jaccard-based method 65
Table 3-12: REDUCE-1 algorithm of Jaccard-based method 66
Table 3-13: MAP-2 algorithm of Jaccard-based method 66
Table 3-14: REDUCE-2 algorithm of Jaccard-based method 67
Table 4-1: An overview of MapReduce jobs with document-based indices (Phan et al., 2015b) 78 Table 4-2: MAP-1 algorithm of a document-indexing method 79
Table 4-3: REDUCE-1 algorithm of a document-indexing method 79
Table 4-4: MAP-2 algorithm of a document-indexing method 80
Table 4-5: REDUCE-2 algorithm of a document-indexing method 81
Table 4-6: An overview of MapReduce jobs with the two-level clustering method 89
Table 4-7: MAP-1 algorithm of a two-level clustering method 91
Table 4-8: REDUCE-1 algorithm of a two-level clustering method 92
Table 4-9: MAP-2 algorithm of a two-level clustering method 93
Table 4-10: REDUCE-2 algorithm of a two-level clustering method 93
Trang 14Table 5-1: An overview of MapReduce jobs with the granularity-enhanced spiral clustering
method (Phan et al., 2016) 103
Table 5-2: MAP-1 algorithm of a granularity-enhanced spiral clustering method 105
Table 5-3: REDUCE-1 algorithm of a granularity-enhanced spiral clustering method 106
Table 5-4: MAP-2 algorithm of a granularity-enhanced spiral clustering method 107
Table 5-5: REDUCE-2 algorithm of a granularity-enhanced spiral clustering method 108
Table 5-6: An overview of MapReduce jobs with eHSimWLB 118
Table 5-7: MAP-1 algorithm of eHSimWLB 119
Table 5-8: REDUCE-1 algorithm of eHSimWLB 120
Table 5-9: MAP-2 algorithm of eHSimWLB 121
Table 5-10: REDUCE-2 algorithm of eHSimWLB 123
Table 5-11: eHSim versus eHSimWLB with a single machine 128
Table 5-12: eHSim versus eHSimWLB with Alex cluster 128
Table 6-1: Gutenberg dataset organization 132
Table 6-2: DBLP dataset organization 132
Table 6-3: Query data 133
Table 6-4: Query group 133
Table 6-5: The two facilities for Hadoop deployment 136
Table 6-6: Query-processing time from a single machine 138
Table 6-7: Running time of a single MapReduce job with one-letter file 139
Table 6-8: Total key-value pairs among methods on Gutenberg datasets 144
Table 7-1: Summary of our contributions 153
Table 7-2: Our assessment in general 154
Trang 15TABLE OF CONTENTS
Page
1.1 Similarity Search 3
1.2 Big Data 6
1.3 MapReduce Paradigm 10
1.4 Motivation 11
1.4.1 The Essential Role of Similarity Search 11
1.4.2 The Era of Big Data 13
1.4.3 The Fashion of Parallel and Distributed Computing 14
1.4.4 Redundancy Problem 17
1.4.5 Load Balancing Problem 18
1.5 Problem Statement 19
1.6 Objectives 20
1.7 Scope and Constraints 21
1.8 Our Approach and Contributions 21
1.9 Organization 24
CHAPTER 2 – RELATED WORK AND DATA PROCESSING STRATEGIES 25 2.1 Overview 25
2.2 Related Work 26
2.2.1 Framework Configuration 26
2.2.2 Implementation Improvement 27
2.2.3 Specialized Third Party and Hybrid Systems 31
2.2.4 Scheme-based Improvement 32
Trang 162.3 Filtering and Pruning Strategies 40
2.4 Our General MapReduce-based Architecture 41
2.5 Conclusion 43
PART 2 – APPROACHES AND METHODS 44 CHAPTER 3 – INSTANT APPROACHES 45 3.1 Overview 45
3.2 Cosine-based Method 46
3.2.1 Theory 46
3.2.2 Algorithms and Examples 49
3.2.3 Performance Study 59
3.3 Jaccard-based Method 63
3.3.1 Theory 63
3.3.2 Algorithms and Examples 65
3.3.3 Performance Study 70
3.4 Conclusion 72
CHAPTER 4 – BUILD-IN APPROACHES 73 4.1 Overview 73
4.2 Document-indexing Method 74
4.2.1 Theory 74
4.2.2 Algorithms and Examples 78
4.2.3 Performance Study 83
4.3 Two-level Clustering Method 85
4.3.1 Theory 85
4.3.2 Algorithms and Examples 90
4.3.3 Performance Study 95
4.4 Conclusion 98
CHAPTER 5 – HYBRID APPROACHES 99 5.1 Overview 99
5.2 Granularity-enhanced Spiral Clustering 100
5.2.1 Theory 100
5.2.2 Algorithms and Examples 104
Trang 175.2.3 Performance Study 110
5.3 Load Balancing Strategy 114
5.3.1 Theory 114
5.3.2 Algorithms 119
5.3.3 Performance Study 123
5.4 Conclusion 129
PART 3 – ASSESSMENT AND SUMMARY 130 CHAPTER 6 – EVALUATION 131 6.1 Overview 131
6.2 Datasets 131
6.3 Environmental Settings 135
6.4 Experiment Measurement 136
6.5 Empirical Experiments 137
6.5.1 Experiments with a Single Machine on Gutenberg Datasets 137
6.5.2 Experiments with a One-letter File 139
6.5.3 Experiments with a Cluster of Commodity Machines on Gutenberg Datasets 140
6.5.4 Experiments with Different Queries and Multiple Gutenberg Datasets 142
6.5.5 Experiments with Key-value Pairs and Candidate Pairs on Gutenberg Datasets 144 6.5.6 Experiments with a Cluster on DBLP Datasets 145
6.5.7 Experiments with Different Queries and Multiple DBLP Datasets 146
6.5.8 Experiments with Key-value Pairs on DBLP Datasets 147
6.6 Conclusion 148
CHAPTER 7 – SUMMARY 149 7.1 Conclusion 149
7.2 Work Assessment 154
7.3 Open Research and Challenges 156
7.3.1 The Optimization Problem 156
7.3.2 More Efficient Data Organization 157
7.3.3 Query Grouping Strategies 157
Trang 19PART 1 – OVERVIEW
Trang 20CHAPTER 1 INTRODUCTION
Similarity search is the concept that is either explicitly or implicitly attached to our daily lives
It is the combination of two terms “similarity” and “search.” According to the WordNet, a lexical
database for English, we have the definition of “similarity” as follows:
Similarity exists and affects different aspects, especially in cognitive sciences (Larkey & Markman, 2005) For instance, similarity in a clustering process decides how similar objects are grouped together Or in a classification process, similarity determines how a new object is labeled among diverse categories Or in a knowledge-based system, similarity supports identifying related knowledge to solve a problem Besides, similarity also gives its influences in social psychology (Zezula, 2012) The most common example is that humans tend to be close to those who share their similarity in terms of interest, habit, personality, age, and so forth No matter how similarity is in either natural or social sciences, its existence really does matter
On the other hand, we have the definition of “search” from WordNet as follows:
When considered in computer science, search is, thus, a common and fundamental operation in that its goal is to look for desired results from queries Due to the fact that most of everything can be stored in a digital form, searching is much intensively used A report in (Zezula, 2012) shows that users spend quite a lot of time, approximately one week of each month, on their search for specific needs such as looking for news, weather forecasts, knowledge, or experiences Hence, searching naturally becomes our daily task
“Similarity (n) The quality of being similar, a Gestalt principle of organization
holding that (other things being equal) parts of a stimulus field that are similar to
each other tend to be perceived as belonging together as a unit.” (WordNet, 2010)
“Search (n) The activity of looking thoroughly in order to find something or
someone, an investigation seeking answers, an operation that determines whether one
or more of a set of items has a specified property, the examination of alternative
hypotheses, boarding and inspecting a ship on the high seas.” (WordNet, 2010)
Trang 21As a consequence, similarity search leads to the kind of searching based on similarity For instance, web search engines like Google, Yahoo!, or Bing find the related information on the World Wide Web according to users’ search keywords Other than being used directly, other important processes like clustering, classification, data mining, decision making, and so on can employ similarity search to achieve their aims For example, a clustering process exploits similarity search for finding similar objects that have the high probability of being in the same group Therefore, the big and important role of similarity search has been popularly spread to different domains
Furthermore, the existence of similarity search is not only about finding but also towards quality of results in terms of user prospects This is due to the fact that an exact-match search is usually limited when there are complex objects (e.g., multimedia objects) and the desired results of a searching task might be changed because of the context Meanwhile, the beauty of similarity search is to return relevant results instead of emptiness when there is no match at all Because of this, similarity search dominates the exact-match search and becomes a widely-used operation serving a vast collective of users’ needs
Even though similarity search was investigated and developed very long time ago, it has still been being faced its new challenges and called for attention In the following sections, we introduce the concept of similarity search in section 1.1, big data in section 1.2, and MapReduce paradigm in section 1.3 Next, we show the main inspirations of our work in section 1.4 In addition, we identify the problem statement in section 1.5 as well as objectives in section 1.6 Besides, we determine the scope and constraints in our study in section 1.7 Moreover, we generally introduce our approach and contributions in section 1.8 Last but not least, we show the outline of our dissertation in section 1.9
1.1 Similarity Search
A metric is a function d on the domain of objects D such that d: , where is the set
of real numbers A metric space = (D, d) is defined by the mapping above and the function d
holds the following conditions (Fréchet, 1906; Choudhary 1992; Zezula et al., 2010):
Non-negativity: ( )
Coincidence Axiom: ( )
Symmetry: ( ) ( )
Triangle Inequality: ( ) ( ) ( )
DEFINITION 1-1 (SIMILARITY SEARCH) Let D be the domain of objects, d be a metric, and be a metric space Given a set , a similarity search on X is an operation that retrieves all objects in the set X so that these objects satisfy required constraints when either explicitly or implicitly given a query object
Trang 22There are several variants of similarity search, which accords to specific needs known as similarity queries In our work, we typically define the fundamental ones as follows
DEFINITION 1-2 (RANGE QUERY) Let D be the domain of objects, d be a metric, and be a
metric space Given a set , a query object , and a pre-defined range threshold r, a range query, denoted as R(q, r), retrieves all objects in such as * ( ) +
EXAMPLE 1-2 (RANGE QUERY) Figure 1-1a illustrates an example of a range query R(q, r)
There are 8 objects denoted as O 1 , O 2 , O 3 , O 4 , O 5 , O 6 , O 7 , and O 8 When given a query object q and a pre-defined range threshold r, the range query R(q, r) returns O 3 and O 5 that are similar to
q and satisfy the threshold r
DEFINITION 1-3 (k-NEAREST NEIGHBOR QUERY) Let D be the domain of objects, d be a
metric, and be a metric space Given a set , a query object , and a pre-defined number k, a k-Nearest Neighbor query, denoted as kNN(q), retrieves all objects in such as
* | | ( ) ( )+
EXAMPLE 1-3 (NEAREST NEIGHBOR QUERY) Figure 1-1b illustrates an example of a
k-Nearest Neighbor query kNN(q) There are 8 objects denoted as O 1 , O 2 , O 3 , O 4 , O 5 , O 6 , O 7 , and
O 8 When given a query object q and the parameter k = 3, the 3-Nearest Neighbor query returns
O 5 , O 3 , and O 2 that are the three nearest similar objects to q
DEFINITION 1-4 (JOIN QUERY) Let D be the domain of objects, d be a metric, and be a
metric space Given a set , a set , and a threshold , a join query J(X, Y, ) retrieves all pair of objects in such as * ( ) ( ) +
EXAMPLE 1-4 (JOIN QUERY) Figure 1-1c illustrates an example of a self-join query J(X, X, )
There are 8 objects denoted as O 1 , O 2 , O 3 , O 4 , O 5 , O 6 , O 7 , and O 8 When given a distance threshold , the self-join query J(X, X, ) returns the pairs objects such as (O 1 , O 2 ), (O 1 , O 3 ), (O 2 ,
O 5 ), (O 3 , O 4 ), (O 3 , O 5 ), (O 5 , O 6 ), and (O 7 , O 8 ), whose distances are smaller or equal to the given distance threshold
Figure 1-1: Examples of typical similarity queries; (a) Range query; (b) k-Nearest Neighbor
query; (c) Self-join query
Trang 23Table 1-1: Some typical instances of metrics Metric Distance Formula/Description Types of Data
Hamming
(Hamming, 1950) ( ) ∑ , - , - Binary Strings
Table 1-2: Some typical instances of metric variants Distance Function Formula/Description Types of Data
Dice's coefficient or Sørensen
Index (Dice, 1945; Sorensen, 1948)
( ) ‖ ‖
Cosine (Singhal, 2001)
Overlap (Manning & Schütze, 1999) ( ) ‖ ‖
Other variants can be seen as either special cases or complex forms from the combinations
of the fundamental types of similarity queries (Zezula et al., 2010) For instance, a self-join case,
also known as pairwise similarity, is the special case of join query where X = Y On the other hand, we can combine R(q, r) and kNN(q) to a new variant of similarity query V(q, r) such that
* | | ( ) ( ) ( ) +
Furthermore, a similarity search process consists of the two main phases as follows:
Candidate generation phase: indicates the phase where similar pairs of objects are
identified and generated as candidates
1
https://en.wikipedia.org/wiki/Cosine_similarity#cite_ref-1
Trang 24 Candidate verification phase: denotes the phase where the candidates are verified
against the similarity constraints
In order to evaluate how similar or dissimilar a pair of object is, metrics are indispensable parts in similarity search Table 1-1 lists some typical examples of metrics It is worth noting that in case one of the conditions of a metric space is not hold, we have variants of metric spaces (Zezula et al., 2010) For example, if a metric only satisfies the following condition ( ) but not the following condition ( ) , then it is called a pseudo-metric Or if a metric does not hold the symmetry property ( ) ( ), it is considered as a quasi-metric Or if a metric has a stronger constraint
on the triangle inequality property such as ( ) , ( ) ( )-, it is known as a super-metric or an ultra-metric Or if a metric does not hold the triangle inequality property ( ) ( ) ( ), it belongs to a semi-metric Table 1-2 shows some typical instances of metric variants
Last but not least, when a technique deals with the trade-off between the quality of the result and the search performance, we mention it as an approximate method In other words, if the correctness of the final result is assured, we have an exact similarity search Otherwise, we have an approximate similarity search The main aim of the trade-off is to accelerate the performance of similarity search with regardless to the degradation of quality
1.2 Big Data
Due to the rapid development of modern technologies and applications, the amounts of generated and collected data quickly increase and become more and more plentiful In addition, we are now in the time in that data are everywhere Such enormous data reside in different data sources and can be collected not only from sensors, machines, automation processes, and mobile devices but also from World-Wide-Web, the world of Internet of Things, social networks, transactions, and user interactions For example in practice, data with the support nowadays technologies in agricultural domains (Jäger et al., 2015) can be collected from different kinds of sensors like those measure the air temperature, air humidity, air pressure, soil temperature, soil moisture, fertilizers and pesticides, and density of weeds; from services like those provide weather forecasts, maps, and plant diagnosis; and from machines like tractors, ploughs, loaders, trailed sprayers, and harvest machines
On the other hand, IBM (IBM, 2011) and Letouzé (Letouzé, 2012) show a phenomenon of
data revolution known as “data deluge,” (Hey & Trefethen, 2003; Hey et al., 2009; Letouzé,
2012) which indicates from the time when there was a relatively small amount of data available through some limited channels to the time when data have been enormously rising from diversified data sources The evidence from the paper reveals the high rate of data growth in today’s digital age, where global information reaches thousands of Exabyte of data, and
Trang 25predicts that the data capacity will continually increase and be doubled in every 20 months For
a simple illustration of how big and fast data have grown in recent years, we would prefer some
facts as follows Gantz et al denote that the digital universe in 2007 is around 281 Exabytes, and its size since 2006 increases 10 times by 2011 (Gantz et al., 2008) Besides, Villars et al say that the data generated over the World in 2010 reached over 1 Zettabyte and its amount is expected to reach 7 Zettabytes a year by 2014 (Villars et al., 2011) In fact, Letouzé indicates that the overall data grew from 150 Exabytes in 2005 to 1200 Exabytes in 2010 and illustrates
the data revolution since 2005 (Letouzé, 2012), which is illustrated in Figure 1-2 Moreover, Cloud Security Alliance estimates that the amount of data generated are predicted to be double
in every two years, from 2500 Exabytes in 2012 to 40.000 Exabytes in 2020 (Cloud Security
Alliance, 2013) Furthermore, Bryant et al show how we are involved in the huge amounts of data in reality (Bryant et al., 2008) For instance, big companies like Google, Yahoo!, or
Microsoft daily collect Terabytes of data Or Wal-Mart has to deal with around 267 million transactions per day at their 6000 stores world-wide In addition, the Large Synoptic Survey
Telescope2 records 30 Terabytes of image data every day while the large Hadron Collider3generates 60 Terabytes of data per day and 15 Petabytes per year Moreover, the documents
on the World-Wide-Web are about several hundred Terabytes of text data
Figure 1-2: Data revolution since 2005 4 (Letouzé, 2012)
Trang 26Nevertheless, the term “Big Data” is not just about how big the data are, and there is
currently no clear consensus on its definition We have surveyed many literatures, and Table
1-3, on the one hand, shows what the term “Big Data” refers to On the other hand, many studies commonly use the first “three Vs” model to define the term “Big Data” (Demchenko et
al., 2014; Dong & Srivastava, 2013; IBM, 2011; Kaisler et al., 2013; Laney, 2001; Letouzé, 2012; McKendrick, 2012; Mitchell et al., 2012; Nessi, 2012; Russom, 2011), which also form its fundamental properties as follows:
Volume It points out the tremendous amounts of data There is currently no exact
number from which we would say how big they are, but the volumes are usually from tens of Terabytes and beyond
Velocity It reflects the high speed rate of data generated, collected, and changed
Variety It describes the heterogeneity of data from diverse data sources and in
different formats as well as structures
Other than them, more properties have been recognized as following:
Veracity It refers how to trust the data from different data sources while they are of
widely differing qualities (Demchenko et al., 2014; Dong & Srivastava, 2013; Labrinidis & Jagadish, 2012)
Value It concerns how useful data are and how they are exploited after being recorded
(Demchenko et al., 2014; Kaisler et al., 2013; McKendrick, 2012; Nessi, 2012)
Complexity It measures how data are inter-connected and inter-dependent in big data
structures (Kaisler et al., 2013)
In short, big data emerge as an inevitable consequence, which brings both opportunities and challenges not only to researchers but also to companies and organizations world-wide
Table 1-3: The evolving concepts of Big Data
1 2011
(Gantz & Reinsel, 2011) from International Data Corporation (IDC)
“Big data technologies describe a new generation
of technologies and architectures designed to economically extract value from very large volumes of a wide variety of data, by enabling high-velocity capture, discovery, and/or analysis.”
2 2011 (Manyika et al., 2011)
“Big Data refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze.”
3 2011 (Villars et al., 2011)
“Big Data is about the growing challenge that organizations face as they deal with large and fast-growing sources of data or information that also present a complex range of analysis and use problems.”
Trang 27No Year Source Definition
4 2012 (Halevi & Moed, 2012)
“The term “Big Data” is coined by Roger Magoulas from O’Reilly media in 2005, refers to a wide range of large data sets almost impossible
to manage and process using traditional data management tools – due to their size, but also their complexity.”
5 2012 (Nessi, 2012)
“Big Data is a term encompassing the use of techniques to capture, process, analyse and visualize potentially large datasets in a reasonable timeframe not accessible to standard IT technologies By extension, the platform, tools and software used for this purpose are collectively called Big Data technologies.”
6 2013 (Kaisler et al., 2013)
“Big Data is the amount of data just beyond technology’s capability to store, manage and process efficiently.”
7 2013 (Cloud Security
Alliance, 2013)
“Big Data refers to the massive amounts of digital information companies and governments collect about human beings and our environment.”
9 2013 (Gartner, 2013; Sicular,
2013)
“Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation.”
10 2015 Wikipedia5
“Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate.”
5
https://en.wikipedia.org/wiki/Big_data
Trang 281.3 MapReduce Paradigm
MapReduce is a parallel programming paradigm that supports end-users performing specialized real-world tasks in a distributed environment without caring about the underlying parallelism
(Dean & Ghemawat, 2008) Its basic idea exploits the “divide-and-conquer” strategy dividing a
big problem into smaller ones that are then handled on commodity machines With MapReduce paradigm, end-users only need to specify their computations by a MAP and a REDUCE function Moreover, a MAP task is specified by a MAP function and a REDUCE task is specified by a REDUCE function Once MapReduce is deployed in a cluster of commodity machines, one machine is in charge of a master while the others are responsible for workers The master
delivers m MAP tasks and r REDUCE tasks to idle workers Those assigned a MAP task are called
mappers whilst those assigned a REDUCE task are called reducers According to the work
(Dittrich et al., 2012), MapReduce has been a “de facto” standard in many enterprises for
large-scale data processing
Figure 1-3: MapReduce paradigm (Phan et al., 2015c)
Trang 29An overview workflow of a MapReduce job is illustrated in Figure 1-3 Suppose that there
are m MAP tasks and r REDUCE tasks within the job Its single workflow is shortly described as
the following steps:
1 Data input in a distributed file system is firstly partitioned into multiple chunks according to the block size
2 A mapper reads its corresponding input partition and then parses the input data into
key-value pairs of the form [key1 , value 1 ] Next, it passes each pair to a MAP function,
which emits intermediate key-value pairs of the form [key2 , value 2 ] These key-value
pairs are then buffered, partitioned into r regions, and periodically written to the local
disk of a mapper
3 When getting a signal from a master, a reducer pulls the intermediate key-value pairs from the local disks of mappers with regards to its partition It later on groups these pairs according to their keys by a shuffling process, which produces the key-value pairs
of the form [key2 , [value 2 ]]
4 Afterwards, the above key-value pairs from a reducer are fed to a REDUCE function, which produces its result from the reduce partition
5 Final results are written back to the distributed file system
1.4 Motivation
1.4.1 The Essential Role of Similarity Search
Similarity search has been a principle operation spreading from interdisciplinary fields of studies such as information retrieval (Manning et al., 2008; Moffat et al., 1993), clustering (Beeferman & Berger, 2000; Broder et al., 1997; Lin & Cohen, 2010), data mining (Han et al., 2011), and machine learning (Ding et al., 2014; Weinberger & Saul, 2009) to various ranges of application domains such as duplicate detection (Hajishirzi et al., 2010; Kolcz et al., 2004), plagiarism (Hoad & Zobel, 2003; Stein & zu Eissen, 2006), recommender systems (Adomavicius
& Tuzhilin, 2005), and data cleaning (Arasu et al., 2006; Chaudhuri et al., 2006), to name a few The main goal of similarity search is to look for similar objects in a universal set In order to do that, either similarity measures or distance measures (Gomaa & Fahmy, 2013; Zezula et al., 2010) have been employed to estimate how similar a pair of objects is
Unfortunately, similarity search is a non-trivial task since it is bound not only by CPU but also by I/O costs (Zezula et al., 2010) In addition, evaluating either distance or similarity measures is expensive, which also gives big influences to its performance For example6, if we
want to compute the similarity among pairs in a set whose cardinality is n, we will need to do
Trang 30millisecond to do one computation and we have n = 10.000 objects in the set, then it approximately costs us 19 months to finish its computing Hence, there is a crucial need aiming
at reducing these costs so that we can improve the performance of similarity search The two main targets are, therefore, to diminish irrelevant object accesses and to decrease the amounts
of distance or similarity computations
Other than that, the phenomenon so-called the “curse of dimensions” (Gionis et al., 1999)
has been the big issue in many years Due to the fact that an object in reality is characterized by
a collection of interesting features, the collection size is usually large For instance, a text object
is represented by a set of its unique words or terms, whose size can be up to thousands or more If we consider each feature as a dimension, an object results in a point in a multi-dimensional space Such high-dimensional data cause serious obstacles to the data structures that neither scale well nor manage huge amounts of data Weber et al show that when the
dimensionality exceeds 10, the performances of all current indexing techniques that base on
space partitioning are generally slower than those of the brute-force and linear search approaches (Weber et al., 1998) Consequently, doing similarity search in high-dimensional spaces is of the major issue for future similarity-based systems
Due to those, the performance problem of similarity search has gained much attention from both academia and industry In order to avoid such high costs, many literatures propose different approaches including indexing (Fenz et al., 2012; Hjaltason & Samet, 2003; Wang et al., 2015), filtering and pruning (Rajaraman & Ullman, 2011; Xiao et al., 2008a), hashing (Satuluri & Parthasarathy, 2012; Zhang et al., 2010), approximate searching (Dorneles et al., 2011; Patella & Ciaccia, 2009), and some combinations among them (Andoni & Indyk, 2008; Bayardo et al., 2007; Xiao et al., 2008b; Xiao et al., 2011; Zhang et al., 2013) For example, Fenz
et al introduce a State Set Index, which is based on a prefix index and interpreted as a deterministic Finite Automaton (NFA) to efficiently do similarity search in very large string sets (Fenz et al., 2012) To index a string, its character is processed to map with a state of NFA, and the last character defines the accepting state When given a string query, its characters are one-by-one processed by NFA To this end, those strings which are in the final accepting state set are retrieved as candidates After that, their similarity is computed by using Edit distance
Non-In addition, Wang et al propose a hierarchical segment tree index, known as HS-tree, with effective pruning techniques (Wang et al., 2015) Their basic idea of building HS-tree is to firstly group the strings by length and then recursively partition the strings into two segments by half-length Finally, each segment has an inverted index showing which strings contain the
segments When given a string query, the HS-tree search process identifies the i th level from the threshold Next, it uses the length filter to choose the visited nodes in HS-tree Then it generates a set of sub-strings of the given query A candidate is specified by the strings that appear in the inverted indices corresponding to the sub-strings of the given query On the other
Trang 31hand, Bayardo et al propose All-Pairs algorithm with Cosine measure (Bayardo et al., 2007) Their goal is to reduce candidate size by exploiting thresholds during indexing, a specific sort order, and thresholds during matching Consequently, there is no need to build a complete inverted index from vector inputs In addition, Xiao et al extend the All-Pairs algorithm with the proposed combination of positional filtering, prefix filtering, and suffix filtering (Xiao et al., 2008b; Xiao et al., 2011)
Meanwhile, Satuluri and Parthasarathy present BayesLSH, which overcomes the drawbacks
of standard similarity estimation using Locality-sensitive hashing technique (Satuluri & Parthasarathy, 2012) With BayesLSH, they can estimate similarities without tuning the number
of hashes and perform early pruning whose main goal is to quickly discard a large amount of candidate pairs which should not be in the result set Moreover, Zhang et al study the combination between an index structure based method and a hashing based method to approximately solve k-Nearest Neighbor queries (Zhang et al., 2013) The basic idea of the former is to use k-means clustering tree performing pruning strategy while that of the latter is
to use Hamming distance performing fast distance computation
In general, those different approaches are typical examples that focus on improving the performance of similarity search Nevertheless, recent studies point out the fact that using data access structures is not very efficient due to the high-dimensional data while sequential scanning may be a better way but not in large datasets (Patella & Ciaccia, 2009) Moreover, approximate methods can be alternative choices but have to relax their correctness Furthermore, most of those techniques are, unfortunately, of centralized solutions that do not take scalability into account Especially when we are approaching the era of big data and services, relevant data become very large and complex than ever, which poses big challenges
on the development of large-scale data processing systems
1.4.2 The Era of Big Data
We cannot deny the severe influences of big data from different perspectives (Bizer et al., 2012; Labrinidis & Jagadish, 2012; Patil et al., 2014) Basically, the properties of big data in section 1.2,
whose first “three Vs” are of the most foremost important ones, impose top challenges on the
way we are living today Trelles et al estimate the costs when working with large volumes of
data (Trelles et al., 2011) For example, it currently takes them at least 9 hours to process 500
GB of data in a cluster of 1000 cloud nodes, which costs 3000 US$ for 500 Gigabytes to 500 Terabytes of total data Then they estimate that it will take 2 years to process 1 Petabyte in a cluster of 1000 cloud nodes, which costs 6.000.000 US$ for 1 Petabyte to 1 Exabyte of total
data From these numbers, they want to insist that whether we are ready with current computational technologies for big data Thus, requirements for advances in different technologies and applications are actually in need (Bryant et al., 2008)
Trang 32On the other side, there are several surveys towards enterprises showing their actions in active due to big data For example, key findings from the survey of Bange et al show that a majority of organizations are seriously aware of big data (Bange et al., 2013) They also recognize opportunities from big data including the top two of making the best use of decisions
and improving their own operational processes According to the survey, 72% of the respondents address the large volumes whilst 66% of them are conscious of the variety of data and 43% of them want to have faster data integration In addition, the survey from (Gigaspaces, 2012) concludes that 43% out of 80% positive responding enterprises consider big data as
critical mission in their business, and most of them are still looking for better infrastructures and tools that are able to resolve challenges from big data Moreover, the survey from
(McKendrick, 2012) denotes more than 50% of respondents deem big data extremely important
to their business due to the fact that it puts questions they never ask before and gives them opportunities for effective competitions and better revenue streams Furthermore, the survey from (Syncsort, 2013) identifies that the two most important opportunities organizations nowadays are facing include how to find insights in their business they have ever had before and how to reduce data costs
Apart from them, big data also poses challenging problems on privacy (Cloud Security Alliance, 2013; Letouzé, 2012; Wong, 2012), usability (Labrinidis & Jagadish, 2012), data integration (Dong & Srivastava, 2013; Intel, 2013), big data to lost data (Megler & Maier, 2012), and data visualization (Bizer et al., 2012), to name a few What’s more, investigating in big data results in potential benefits such as advanced approaches and technologies, valuable profits, effectiveness and efficiency in data processing (Kaisler et al., 2013) In terms of both opportunities and challenges, big data may cause a big chaos if there is no improvement or innovation from existing systems to deal with it To industry, big data becomes a key factor that involves in transforming companies and bringing changes in their processes, models, plans, and trends as well As a consequence, there is an essential demand emerging for traditional data-processing mechanisms where big data makes the data become more and more complex than ever and more or less keeps pushing not only opportunities but also long-term challenges to running systems in general and similarity search in particular
1.4.3 The Fashion of Parallel and Distributed Computing
The outbreak of data poses real challenges on data processing as well as quality of services Thus, it is very hard for a single machine to afford to process such big data A large amount of data easily leads an operation or a service to a slow response, which is especially not expected for (near) real-time services and applications Much even worse, a single machine may face a total breakdown whenever there is any failure, and operations have to be re-run from scratch These scenarios more or less affect what we call quality of services
Trang 33Table 1-4: The main advantages of Parallel Database Management Systems over MapReduce
(Pavlo et al., 2009; Stonebraker et al., 2010)
No MapReduce Parallel Database Management Systems
1 Too young when developed
since 2004
Long-time developed since mid-1980s
2
Developing components (e.g.,
HIVE7, PIG8, Zookeeper9)
Professional tools (e.g., high-level access languages SQL, optimizer, compression, tuning tools, the knowledge of data distribution and location)
4 Query processing by writing
algorithms at the low level
Query processing by exploiting high declarative languages
5 No integrated indexing Built-in indexing
6 Poor supports for iteration and
join tasks
Well supports for iteration and join tasks
7
Intermediate results when
processing a query are written
to disks before being pulled to a
next stage
Intermediate results when processing a query are pipelined from producer to consumer without being written to disks
8 Data parsing at run time Data parsing at load time
An efficient way of data processing is popularly from parallel and distributed computing Due to scalability, the parallelization in a cluster of commodity machines where data are managed in a distributed file system becomes a must so that we can afford to efficiently process our big data as well as consider performance improvement There are currently two big candidate technologies known as Parallel Database Management Systems (DeWitt & Gray, 1992) and MapReduce paradigm (Dean & Ghemawat, 2008), which are designed to deal with this manner Table 1-4 shows the main advantages of Parallel Database Management Systems over MapReduce whereas Table 1-5 shows the main advantages of MapReduce over Parallel Database Management Systems
Even though there are lots of debates about the two state-of-the-art technologies (Pavlo et al., 2009; Stonebraker et al., 2010) while MapReduce has fewer strong points than Parallel DBMSs, we believe that MapReduce is a promising technique capably dealing with challenges in big data due to the fact that it holds its key characteristics towards big data This point of view
is consolidated when MapReduce, together with its popular framework known as Hadoop10, is nowadays either used or employed by so many institutions as well as companies with big
Trang 34names such as Google, IBM, Facebook, Yahoo, and Ebay11 For example, there are hundreds of
Terabytes of data daily processed by MapReduce at Google (Dean & Ghemawat, 2008) and 75
Terabytes of compressed data processed by Hadoop at Facebook day-after-day (Thusoo et al., 2010) Furthermore, the high fault-tolerance of MapReduce gives better scalability once the size of a cluster of commodity machines increases
It is worth noting that there is also an alternative option other than MapReduce and Parallel Database Management Systems In terms of parallel computing, Message Passing Interface
(MPI) has become a “de facto” standard since its beginning in 1991 (Message Passing Interface
Forum, 1994) For a better illustration, Table 1-6 shows the major comparison between MapReduce and MPI In overall, MapReduce still overcomes MPI on the aspects of high fault-tolerance, large-scale processing, and friendly programming Such characteristics have been promoting MapReduce to be a good choice dealing with challenges not only from scalability issues but also from big data
Table 1-5: The main advantages of MapReduce over Parallel Database Management Systems (Dittrich et al., 2013a; Dittrich et al., 2013b; Pavlo et al., 2009; Stonebraker et al., 2010)
No MapReduce Parallel Database Management
Systems
1
Upfront investment is small (no schemas,
no SQL, no integrity constraints, no
normalization, no data cleaning)
Upfront investment is not small
4
It is schema-free, which supports
semi-structured data and flexibly deals with
unstructured data
Relational schemas are suitable for structured data but are unsuitable for unstructured data and are awkward for semi-structured data
5
Load time is fast Load time takes time due to query
plan, optimization process, and index building, to name a few
6 Designed for complex tasks that
manipulate diverse data
User-defined Functions (UDFs) is restricted
Trang 35Table 1-6: MapReduce versus Message Passing Interface (Chen et al., 2011; Kang et al., 2015 Singh, n.d.)
No MapReduce Message Passing Interface
1
A parallel computing paradigm It is
based on user-specified MAP and
REDUCE functions
A message passing library interface specification for parallel programming
2
High fault-tolerance mechanism
Moreover, distributed file system
(e.g., HDFS) has its advantage for
fault-tolerance due to data
replications
Poor fault-tolerance mechanism even though an application can employ check-points to achieve fault tolerance
3
No experience in parallel computing
to use large distributed systems
Hence, it has a user-friendly
programming style
More complicated using due to its various communication functions Hence, it requires a great deal of programming skills
4
The compute nodes and the data
nodes are the same due to the fact
that each node has a copy of some of
the data
MPI program runs on each processor core of the compute nodes and the data flows from data nodes to the compute nodes
5
Good for data parallelism
Specifically, it is suitable for
non-iterative algorithms where nodes
require little data exchange to
proceed Moreover, it is a choice
when data size is large and no
iterative processing
Good for task parallelism It is appropriate for iterative algorithms where nodes require data exchange to proceed Moreover, it is a choice when data size is moderate and the problem
is computational-intensive
General speaking, the fashion of parallel and distributed computing is of the essential trends not only for similarity search but also for other operations when dealing with massive datasets while preserving scalability Furthermore, similarity search does not require iterations over data, which is suitable for state-of-the-art technologies like MapReduce
1.4.4 Redundancy Problem
By our observations, we are aware of the redundancy throughout computing processes that plays a part in the slow performance of the similarity search The redundancy problem comes out as objects or operations that are neither necessary nor relevant to a final result but take part in a computing process and cause extra costs For instance, duplicate or irrelevant objects should not be stored, accessed, and evaluated during similarity search because they are sooner
or later not included in a final result Resolving the redundancy problem, therefore, contributes
to improving the overall performance of similarity search and other tasks alike
Trang 36Figure 1-4: Data redundancy throughout MapReduce processes
The data redundancy problem becomes a serious problem with MapReduce because of its full-scan manner On the other hand, MapReduce processes are strictly bound by I/O That also means data including either intermediate or final key-value pairs emitted from either mappers
or reducers are written back to the distributed file system Redundant data, hence, trigger
additional costs Let consider an example illustrated in Figure 1-4 Suppose that there are 4 mappers and 3 reducers in a single MapReduce job Additionally, assume that each mapper and reducer emit 3 output files, so we have in total 12 output files from the mappers and 9 output
files from the reducers According to the MapReduce paradigm described in section 1.3, each
reducer sequentially processes 12 files from the mappers, and the mappers from the next round of MapReduce job, if any, will process 9 files from the reducers If there are redundant
data, which are illustrated as files with bold borders, we suffer extra costs not only from the current MapReduce job but also from successors, which leads to accumulated costs Moreover, there are inevitable overheads from the shuffle process between mappers and reducers (Dean
& Ghemawat, 2008) Consequently, redundant data should be pruned as soon as possible
1.4.5 Load Balancing Problem
As we know that Hadoop is conscious of slow tasks by supporting speculative execution at runtime This implies the case in that Hadoop tries to schedule redundant copies of the rest of prolonged tasks on idle compute nodes Whenever any of them is complete as the first, the others will be discarded Nevertheless, this way only detects and tolerates the slow tasks without fixing them Meanwhile, a slow task emerges by the two main causes as follows:
The external cause A task is slow when running in an inappropriate context It relates
to either hardware degradation from the node on which it runs or software misconfiguration in a distributed environment
The internal cause A task is slow due to its load imbalance This denotes the case that one has to process much more data than others and finally costs long time to finish
Trang 37In the scope of our work, we take the problem of load imbalance into account Since a MapReduce job ends only when its last reducer finishes writing the result to the distributed file
system, this makes a MapReduce job suffer the “curse of the last reducer” in that the last
reducer bounds the execution time of the MapReduce job By our observations, there are several main reasons that lead to this scenario as briefly described as follows:
Skew raw data Data in reality are usually skew, even from natural sciences to social
sciences For instance, we have different documents with diverse sizes in the corpus for the task of similarity search When comparing one with a given document, we have different tasks with different sizes for our similarity comparison Consequently, the skewed data leads to the load imbalance throughout MapReduce jobs, first and
foremost at mappers
Skew intermediate data by the key attribute Intermediate data emitted from mappers
may be skew This may happen due to either the skew raw data or loads-unaware
designed algorithms By default, Hadoop uses hashing on the key attribute of
intermediate key-value pairs emitted from mappers so as to distribute them to partitions with respect to reducers The mechanism is, however, unconscious of the skewed distribution of keys For instance, if the frequency of a key is high, we have one reducer take more pairs than others because of the high number of keys As a result,
different reducers take different sizes of loads
Skew intermediate data by the value attribute Since intermediate data transferred
between mappers and reducers are in the form of key-value pairs The complex values
in the pairs still make our tasks process them longer even when they share the same number of pairs For instance, when given the same query document to two tasks with one document per task, the task computing a similarity score from the long document
costs more time than that from the short document
The load imbalance problem highly impacts the overall performance of a MapReduce job The problem also means that we cannot make the best use of computing resources and facilities to achieve high efficiency for computing tasks in general and for the operation of similarity search in particular
1.5 Problem Statement
Our problem is to find out ways that effectively and efficiently support similarity search in the era of big data when using MapReduce paradigm More specifically, we study the problems of scalability, performance, redundancy, and load balance throughout similarity search processes with MapReduce as follows:
The scalability problem is to maintain the capability of similarity search so that it can handle the growing amounts of workloads Moreover, it also refers to the style of large-scale processing with the distributed facilities from MapReduce
Trang 38 The performance problem is to achieve high performance for the whole process when processing data, especially for the operation of similarity search
The redundancy problem is to minimize unrelated objects or unnecessary operations among either computations or intermediate and final outputs
The load balance is to mitigate prolonged tasks so as to speed up response times and augment system utilization
1.6 Objectives
Our main goals are the three-folds as the following:
We aim at improving the overall performance of similarity search in big data with MapReduce paradigm
We want to resolve related problems towards a unified solution
By our work, we would like to contribute and promote the potential development of MapReduce for large-scale processing
Keeping the objectives in mind, we initially come out with related research questions that are briefly presented as follows
RESEARCH QUESTION 1 Which key factors affect the performance of similarity search?
Identifying the ones that impact the performance of similarity search is essential For each found, we need to take the chance of either improving or dealing with it Knowing the problems that really cause slow runtime strongly makes us be aware of them and necessarily urges us to find out appropriate solutions
- What are they?
- How do they influence the performance of similarity search?
RESEARCH QUESTION 2 How to make the best use of MapReduce paradigm for fast similarity search? Answering these questions below helps us understand more about the
capability of MapReduce for accelerating the operation of similarity search
- Although MapReduce emerges to target many large-scale problems and data-intensive applications, is it capable of supporting similarity search?
- We need to investigate in how to perform the process of similarity search with MapReduce Is there any problem when we employ MapReduce paradigm for the operation of similarity search?
- What does related work say about improving the performance of similarity search using MapReduce paradigm?
- Which direction should we follow and why?
- Can we take the advantage of indexing techniques to speed up the process in the fashion of parallel and distributed computing? If yes, how to organize the data stored in the indices to support fast similarity search?
Trang 39- How many approaches are there possible for fast similarity search using MapReduce?
- How can we achieve efficiency by improving either MapReduce or the searching process itself or both?
RESEARCH QUESTION 3 Once an approach or a method is found, is it good enough? To
answer this theme research question, we will need to evaluate the approach or method to double-check its cons and compare it with related work
- Though the approach or method is useful to resolve a specific problem, will it have any side-effect or lead to other problems?
- Can we further improve it?
- What make our approach or method different from other approaches or methods?
- How good is it in comparison with other approaches or methods?
RESEARCH QUESTION 4 If there are several approaches or methods found, how to make them work together as a whole? Discrete approaches or methods are not much helpful at all
In this case, we would make effort to gain a united solution in that each approach or method not only resolves its own problems but also collaboratively works together
- Are they mutually exclusive?
- In case there is any conflict among them, how can we resolve it?
RESEARCH QUESTION 5 Besides scalability, how can MapReduce-based similarity search tackle with other challenges from big data?
- As we still have other challenges from big data, basically known as the “three Vs,” is the
approach conscious of them?
Throughout our research work, we try to validate them and find out answers not only to the theme research questions but also to their related sub-questions
1.7 Scope and Constraints
In the scope of our dissertation, our work is constrained by the following:
We study the problem of similarity search in a metric space
We work with plain text data as a simple illustration
We have not taken semantic into account yet
We mostly tackle with the “three Vs” challenges from the context of big data We,
however, concentrate on data processing rather than data management
1.8 Our Approach and Contributions
For the ease of illustration and due to the fact that text similarity search is the successful application of similarity search (Zezula, 2012), we work with document objects and show their representation in the vector-space model as follows
Trang 40DEFINITION 1-5 (DOCUMENT REPRESENTATION BY TERMS) Suppose a workset Ω
consisting of a set of n document objects D i , which is represented as Ω = ,D 1 , D 2 , D 3 , …, D n } Given a document object D i composing of a set of words as term, the document D i is represented
by its terms such as D i = {term 1 , term 2 , term 3 , …, term w }
DEFINITION 1-6 (DOCUMENT REPRESENTATION BY SHINGLES) Suppose a workset Ω
consisting of a set of n document objects D i , which is represented as Ω = ,D 1 , D 2 , D 3 , …, D n } Given a document D i as a string of characters, and K-shingles are defined as any sub-string having the length K found in the document, the document D i is represented by its shingles such
as D i = {SH 1 , SH 2 , …, SH z }
The concept of K-shingles (Rajaraman & Ullman, 2011; Theobald et al., 2008) is exploited in the field of natural language processing to represent documents due to the fact that it helps avoid the mismatch when any two document objects share the same number of terms but with different term positions Hence, representing a document by its shingles is semantically better than representing a document by its terms
In our study, we generally employ Jaccard (Jaccard, 1912) as a typical example of a metric to compute similarity scores Other metric variants if used will be explicitly mentioned Additionally, we use the sign || || to denote the cardinality of a set Consequently, the
cardinality of a document object Di, denoted as || ||, is known as the total number of
elements belonging to the set Moreover, we use the sign [,] to indicate a list, the sign [[,], [,]] to demonstrate a list of lists, the sign [,]ord to denote an ordered list, and the sign (u•v) gives the inner product between u and v Furthermore, we let the sign denote the greater string comparison between u and v while the sign denote the smaller string comparison between u and v Last but not least, since we are working in a distributed environment, we
additionally use Uniform Resource Locator (URL) other than identifications so that we can uniquely specify a resource
With the potential of MapReduce paradigm when compared to other state-of-the-art technologies in section 1.4.3, we decide to employ it in our work for large-scale data processing Additionally, Hadoop13, on the one hand, is the popular framework that implements MapReduce paradigm On the other hand, the surveys either from or towards industry (Bange
et al., 2013; Gigaspaces, 2012; McKendrick, 2012; Syncsort, 2013) positively show the potential
of using and experimenting Hadoop among companies and organizations to explore big data Hence, Hadoop framework becomes a very good candidate among big data tools Furthermore, due to the fact that the streaming data transfer approach has lower execution time than the file-based communication mechanism (Fox et al., 2008), Hadoop streaming14 is equipped with