Our Approach and Contributions

For the ease of illustration and due to the fact that text similarity search is the successful application of similarity search (Zezula, 2012), we work with document objects and show their representation in the vector-space model as follows.

DEFINITION 1-5 (DOCUMENT REPRESENTATION BY TERMS). Suppose a workset Ω consisting of a set of n document objects Di, which is represented as Ω = ,D1, D2, D3, …, Dn}.

Given a document object Di composing of a set of words as term, the document Di is represented by its terms such as Di = {term1, term2, term3, …, termw}.

DEFINITION 1-6 (DOCUMENT REPRESENTATION BY SHINGLES). Suppose a workset Ω consisting of a set of n document objects Di, which is represented as Ω = ,D1, D2, D3, …, Dn}.

Given a document Di as a string of characters, and K-shingles are defined as any sub-string having the length K found in the document, the document Di is represented by its shingles such as Di = {SH1, SH2, …, SHz}.

The concept of K-shingles (Rajaraman & Ullman, 2011; Theobald et al., 2008) is exploited in the field of natural language processing to represent documents due to the fact that it helps avoid the mismatch when any two document objects share the same number of terms but with different term positions. Hence, representing a document by its shingles is semantically better than representing a document by its terms.

In our study, we generally employ Jaccard (Jaccard, 1912) as a typical example of a metric to compute similarity scores. Other metric variants if used will be explicitly mentioned.

Additionally, we use the sign || || to denote the cardinality of a set. Consequently, the cardinality of a document object Di, denoted as || ||, is known as the total number of elements belonging to the set. Moreover, we use the sign [,] to indicate a list, the sign [[,], [,]] to demonstrate a list of lists, the sign [,]ord to denote an ordered list, and the sign (u•v) gives the inner product between u and v. Furthermore, we let the sign denote the greater string comparison between u and v while the sign denote the smaller string comparison between u and v. Last but not least, since we are working in a distributed environment, we additionally use Uniform Resource Locator (URL) other than identifications so that we can uniquely specify a resource.

With the potential of MapReduce paradigm when compared to other state-of-the-art technologies in section 1.4.3, we decide to employ it in our work for large-scale data processing. Additionally, Hadoop13, on the one hand, is the popular framework that implements MapReduce paradigm. On the other hand, the surveys either from or towards industry (Bange et al., 2013; Gigaspaces, 2012; McKendrick, 2012; Syncsort, 2013) positively show the potential of using and experimenting Hadoop among companies and organizations to explore big data.

Hence, Hadoop framework becomes a very good candidate among big data tools. Furthermore, due to the fact that the streaming data transfer approach has lower execution time than the file-based communication mechanism (Fox et al., 2008), Hadoop streaming14 is equipped with

13 https://hadoop.apache.org/

14 http://hadoop.apache.org/docs/r1.2.1/streaming.html

Hadoop. This utility with Hadoop distribution allows us to create and run MapReduce jobs with any executable scripts of Map and/or Reduce tasks, which prevents the extra-costs from unnecessary Map or Reduce tasks. As a result, we make the best use of this feature in our work.

Aside from the power of MapReduce, we notice that a MapReduce job is expensive while most of state-of-the-art methods employ at least two MapReduce jobs (Deng et al., 2014; Drew &

Hahsler, 2014; Elsayed et al., 2008; Lin & Cohen, 2010; Li et al., 2011; Metwally & Faloutsos, 2012; Rong et al., 2013; Vernica et al., 2010). The more MapReduce jobs we use, the more costs we get. As a result, there is sake of using minimum MapReduce jobs to solve a specific problem.

In the meantime, we define the redundancy problem as the case where we get extra overheads by either inessential objects or unnecessary computations or both. It is of the most popular reason that slows down the performance of similarity search. By observing, we classify the redundancy problem into two main categories as follows.

DEFINITION 1-7 (REDUNDANCY BY SIMILARITY SEARCH PROCESSES). The redundancy by similarity search processes occurs when a similarity search process examines irrelevant objects or/and does needless similarity or distance computations among them.

DEFINITION 1-8 (REDUNDANCY BY MAPREDUCE PROCESSES). The redundancy by MapReduce processes emerges when a MapReduce process accesses unrelated key-value pairs or/and generates undesired duplicate data.

The redundancy problem actually gives big influences to the effectiveness and efficiency of tasks in general and similarity search in particular. Resolving the redundancy problem leads to other good consequences due to fewer extra overheads such as fewer computations (i.e., fewer CPU costs), fewer I/O costs, fewer storage costs, and so on, which certainly improves the overall performance of similarity search. With MapReduce paradigm, we observe that redundant data mostly emerge from the very first stage of a Map task. Pruning them soon, therefore, will bring more benefits for the performance.

Generally, we improve the performance of similarity search from the perspective of schemes and algorithms. Our main contributions are briefly summed up as follows:

1. We design a hybrid MapReduce-based architecture in CHAPTER 2, which lays a firm skeleton on our approaches (Phan et al., 2016).

2. We propose the instant approaches that are based on a given query with Cosine and Jaccard measures, respectively, in CHAPTER 3 (Phan et al., 2014a; Phan et al., 2014b;

Phan et al., 2015c).

3. We propose the build-in approaches in CHAPTER 4 (Phan et al., 2015a; Phan et al., 2015b; Phan et al., 2016).

4. We propose the hybrid approaches in CHAPTER 5 (Phan et al., 2016).

5. We conduct empirical experiments and evaluations with real datasets for our proposed methods in CHAPTER 6.

The Fashion of Parallel and Distributed Computing