Luận Văn Ace Agile, Contingent And Efficient Similarity Joins Using Mapreduce.pdf

A Thesis entitled ACE Agile, Contingent and Efficient Similarity Joins Using MapReduce by Mahalakshmi Lakshminarayanan Submitted to the Graduate Faculty as partial fulfillment of the requirements for[.]

Trang 1

A ThesisentitledACE: Agile, Contingent and Efficient Similarity Joins Using MapReduce

byMahalakshmi LakshminarayananSubmitted to the Graduate Faculty as partial fulfillment of the requirements for the

Masters of Science Degree in Engineering

Dr Vijay Devabhaktuni, Committee Chair

Dr William F Acosta, Committee Member

Dr Robert C Green II, Committee Member

Dr Mansoor Alam, Committee Member

Dr Patricia R Komuniecki, DeanCollege of Graduate Studies

The University of Toledo

Trang 2

Trang 3

An Abstract ofACE: Agile, Contingent and Efficient Similarity Joins Using MapReduce

byMahalakshmi LakshminarayananSubmitted to the Graduate Faculty as partial fulfillment of the requirements for the

Masters of Science Degree in Engineering

The University of ToledoDecember 2013

Similarity Join is an important operation for data mining, with a diverse range ofreal world applications Three efficient MapReduce Algorithms for performing Sim-ilarity Joins between multisets are proposed in this thesis Filtering techniques forsimilarity joins minimize the number of pairs of entities joined and hence, they arevital for improving the efficiency of the algorithm Multisets represent real worlddata better, by considering the frequency of its elements Prior serial algorithmsincorporate filtering techniques only for sets, but not multisets, while prior MapRe-duce algorithms do not incorporate any filtering technique or inefficiently incorporateprefix filtering with poor scalability

This work extends the filtering techniques, namely the prefix, size, positionaland suffix filters to multisets, and also achieves the challenging task of efficientlyincorporating them in the shared-nothing MapReduce model Adeptly incorporatingthe filtering techniques in a strategic sequence minimizes the pairs generated andjoined, resulting in I/O, network and computational efficiency

In the SSS algorithm, prefix, size and positional filtering are incorporated in theMapReduce Framework The pairs that thrive filtering are joined suavely in the thirdSimilarity Join Stage, utilizing a Multiset File generated in the second stage We alsodeveloped a rational and creative technique to enhance the scalability of the algorithm

as a contingency need

Trang 4

In the ESSJ algorithm, all the filtering techniques, namely, prefix, size, positional

as well as suffix filtering are incorporated in the MapReduce Framework It is designedwith a seamless and scalable Similarity Join Stage, where the similarity joins areperformed without dependency to a file

In the EASE algorithm, all the filtering techniques, namely, prefix, size, positionaland suffix are incorporated in the MapReduce Framework However it is tailored as

a hybrid algorithm to exploit the strategies of both SSS and ESSJ for performing thejoins Some multiset pairs are joined utilizing the Multiset File similar to SSS, andsome multisets are joined without utilizing it similar to ESSJ The algorithm harveststhe benefits of both the strategies

SSS and ESSJ algorithms were developed using Hadoop and tested using world Twitter data For both SSS and ESSJ, experimental results demonstrate phe-nomenal performance gains of over 70% in comparison to the competing state-of-the-art algorithm

Trang 5

real-I dedicate this work to the Almighty!

Trang 6

I thank Dr Rob for his timely, creative, elegant, prudent and thorough guidanceand support It was wonderful and comfortable working under him!

Without the guidance of Dr Acosta and Dr Rob, this work would not have beenpossible

I thank Dr Alam for his wise, kind and gracious support throughout my Master’sprogram! Special thanks to Dr Vijay for his benevolent, erudite and gracious guidanceand support!

I thank Dr Acosta, Dr Alam, EECS and the ET Departments for the financialsupport I thank the EECS and ET faculty members and staff members who havehelped me

I thank my parents, grand parents, brothers, relatives and friends for their support,with special thanks to my mom! Ultimately, I thank God for showering His grace onus!

Trang 7

2.1 MapReduce Model 6

2.2 Hadoop Features 7

2.3 Multisets and Similarity Measures 8

2.4 Literature Review 10

2.4.1 Serial Algorithms 10

2.4.2 Parallel Algorithms 11

3 Strategic and Suave Processing for Similarity Joins Using MapRe-duce 14 3.1 Stage I - Map Phase 15

3.2 Stage I - Reduce Phase 17

3.3 Stage II - Map Phase 19

Trang 8

3.4 Stage II- Reduce Phase 22

3.4.1 Positional Filtering: 24

3.4.2 Positional Filtering in Stage II-Reduce Phase 26

3.5 Stage III - Map Phase 28

3.6 Stage III - Reduce Phase 29

3.7 Preprocessing 31

3.8 Comparison of SSS with SSJ-2R 34

3.9 Experimental Results 35

3.10 Enhancing the Scalability of the Algorithm 41

3.11 Summary 44

4 Adept and Agile Processing for Efficient and Scalable Similarity Joins Using MapReduce 46 4.1 Stage I - Map Phase 47

4.2 Stage I - Reduce Phase 50

4.3 Stage II - Map Phase 50

4.4 Stage II - Reduce Phase 55

4.4.1 Suffix Filtering 58

4.4.2 Optimizing the minimum Prefix Hamming Distance, Hpmin: 60 4.4.3 Suffix Filtering in Stage II- Reduce Phase 62

4.7 Comparison of ESSJ with SSJ-2R 66

4.8 Experimental Results 68

4.9 Summary 73

5 Efficient, Adaptable and Scalable MapReduce Algorithm For

Trang 9

5.1 Stage II - Reduce Phase 75

5.4 Discussion 79

Trang 10

List of Tables

2.1 Multiset Similarity Measures and their formulae 103.1 The number of pairs for which similarity joins are performed in SSJ-2Rand SSS algorithms 363.2 Running times of the Stages of SSS algorithm, for 16,000 records and asimilarity threshold of 0.8 373.3 Running times of the Stages of SSJ-2R algorithm, for 16,000 records and

a similarity threshold of 0.8 373.4 Running times of SSS and SSJ-2R algorithms, for varying number of in-put records and corresponding Performance Improvement for a similaritythreshold of 0.7 393.5 Running times of SSS and SSJ-2R algorithms, for varying number of in-put records and corresponding Performance Improvement for a similaritythreshold of 0.8 403.6 Running times of the Waves of Stage III of SSS-SE algorithm, for 16,000records and a similarity threshold of 0.8 434.1 The number of pairs for which similarity joins are performed in SSJ-2Rand ESSJ algorithms 694.2 Running times of the Stages of ESSJ algorithm, for 16,000 records and asimilarity threshold of 0.7 69

Trang 11

4.3 Running times of the Stages of SSJ-2R algorithm, for 16,000 records and

a similarity threshold of 0.7 704.4 Running times of the Stages of ESSJ algorithm, for 16,000 records and asimilarity threshold of 0.8 704.5 Running times of the Stages of SSJ-2R algorithm, for 16,000 records and

a similarity threshold of 0.8 704.6 Running times of ESSJ and SSJ-2R algorithms, for varying number ofinput records and corresponding Performance Improvement for a similaritythreshold of 0.7 714.7 Running times of ESSJ and SSJ-2R algorithms, for varying number ofinput records and corresponding Performance Improvement for a similaritythreshold of 0.8 72

Trang 12

List of Figures

2-1 MapReduce Model 7

3-1 Mapper and Reducer Instances of Stage I 19

3-2 Type I and Type II Mapper instances of Stage II 20

3-3 Reducer Instance of Stage II 24

3-4 Partitioning a Multiset Mi for Positional Filtering 25

3-5 Type I and Type II Mapper Instances of Stage III 29

3-6 Reducer Instance of Stage III 30

3-7 Running times of the algorithms vs Number of Records, for a similarity threshold of 0.9 37

4-1 Mapper and Reducer Instances of Stage I 48

4-2 Type I and Type II Mapper Instances of Stage II 54

4-3 Reducer Instance of Stage II 56

4-4 Partitioning a Multiset Mi for Suffix Filtering 59

4-5 Type I and Type II Mapper Instance of Stage III 63

Trang 13

4-6 Reducer Instance of Stage III 654-7 Running times of the algorithms vs Number of Records, for a similaritythreshold of 0.7 714-8 Running times of the algorithms vs Number of Records, for a similaritythreshold of 0.8 725-1 Mapper Instances of Stage III 765-2 Reducer Instance of Stage III 78

Trang 14

Chapter 1

Introduction

This era has visualized massive growth of online applications and their users, whichhas resulted in an enormous increase in the volume of data that needs to be processed.Besides, there are numerous applications that require big data processing by theirnature This includes processing large corpora, environmental and medical data thatare gathered over a period, data from Smart Grids, and so on Simple, yet effectiveand essential operations are always the need of the moment for any application.Similarity Joins are vital operations of that nature, which are essential for a diverserange of applications Similarity joins are all time handy, and in the current scenario,big data is omnipresent Some interesting applications of similarity joins include –Duplicate detection [1–5], Plagiarism Detection [6, 7], Data Cleaning [8, 9], RecordLinkage [10–13], String Searching [14–19], Community Discovery [20, 21], InternetTraffic Anomalies Detection and Advertisement Targeting [22, 23] and CollaborativeFiltering for Recommendation Systems [24] As the size of data in such applications istypically very large, distributed processing is generally a necessity The MapReduceframework [25] and Hadoop [26] are very popular tools that are used in this study foraccomplishing these purposes

In this thesis, the focus is on the creation of ACE (Agile, Contingent and Efficient)MapReduce algorithms that effectively handle similarity joins between multisets

Trang 15

Stated concisely, the issue addressed through this study is as follows: Given a tion of multisets, S = {Mi, , M|S|}, where Mirepresents a multiset, and a similaritythreshold, t, all pairs of multisets (< Mi, Mj >), whose similarity Sim(Mi, Mj) ex-ceeds t must be discovered.

collec-In addressing this issue, the entirety of the presented work focuses on a trilogy

of challenges involved in efficiently performing similarity joins in the MapReduceparadigm including:

1 In a naive implementation, all of the possible pairs of entities must be joined

In an efficient implementation, the initial application of filtering techniques tofilter out the possible pairs that must be joined is preferred Real world datacan be better represented using multisets because the frequency of an entity istaken into account Thus, filtering techniques must be developed for multisets,though existing work have designed filtering techniques only for sets;

2 These filtering techniques must be designed in a distributed way suitable to theMapReduce framework; and

3 Similarity Joins must be performed for the pairs that survive filtering Thechallenge is to bring together the data corresponding to the surviving entitypairs in the MapReduce style work flow

This thesis proposes three algorithms to address the above mentioned trilogy ofconcerns and names them as SSS (Strategic and Suave Processing for Similarity JoinsUsing MapReduce), ESSJ (Adept and Agile Processing for Efficient and ScalableSimilarity Joins using MapReduce) and EASE (Efficient, Adaptable and ScalableMapReduce Algorithm For Similarity Joins Using Hybrid Strategies)

The second and third problems listed above are particularly challenging as itrequires designing the algorithm to suit the shared-nothing MapReduce model

Trang 16

The prior MapReduce similarity join algorithms have incorporated no filteringtechniques or have attempted to incorporate the filtering techniques, but that hasresulted in inefficiency and poor scalability, due to a large quantity of data generatedcausing I/O and network bottlenecks However, the algorithms in this study achievethis task by adeptly applying the prefix, size, positional and suffix filtering techniques

in a strategic sequence, which minimizes the candidate pairs generated, resulting inI/O and network efficiency The dramatic reduction in the number of pairs that arejoined, leads to computational efficiency

In the MapReduce style work flow, to bring together the elements of multisetscorresponding to a pair for performing similarity join, three agile strategies are de-veloped in this thesis The first strategy utilizes a file, which must be distributed tothe various nodes in a cluster and loaded in the memory of the nodes, for performingthe joins (Used in SSS) In the second strategy, the elements of multisets correspond-ing to a pair are brought together without utilizing a file (Used in ESSJ) The thirdstrategy (Used in EASE) is a hybrid one, utilizing both the previous strategies, wherethe elements of multisets corresponding to some pairs are brought together utilizing

a file(like SSS) and for other pairs without using a file(like ESSJ) These strategiesare contingent in nature, as they are designed to be scalable when the dataset beingprocessed increases in size, as well as when the size of the cluster increases in size.The main contributions of this thesis are:

1 Three efficient MapReduce Algorithms, namely SSS, ESSJ and EASE for forming similarity joins between multisets are presented that are applicable tosets, multisets and vectors as well;

per-2 For the first time, filtering techniques developed for efficiently performing larity joins in sets are extended for application to multisets, namely prefix [4,27],size [28], positional [4] and suffix [4] filtering;

Trang 17

simi-3 In SSS, prefix, size and positional filtering are incorporated in the MapReduceFramework and the thriving pairs are suavely joined utilizing a file named,Multiset File, in the Similarity Join MapReduce Stage.

4 A rational and creative technique to enhance the scalability of the SSS algorithm

is presented as a contingency need;

5 In ESSJ, prefix, size, positional as well as suffix filtering are incorporated in theMapReduce Framework;

6 ESSJ is equipped with a seamless and scalable similarity join stage, which ages to avoid the need for distributing a file to all the nodes of a cluster andloading it into the memory;

man-7 In EASE, all the filtering techniques, viz., prefix, size, positional and suffixfiltering are incorporated in the MapReduce Framework, similar to ESSJ EASE

is designed as a hybrid algorithm to exploit the strategies of both SSS (utilizingMultiset File) and ESSJ (without dependency to Multiset File) for performingthe joins As a result, it will yield the benefits of both the strategies

8 For all the three algorithms, as a result of applying filtering techniques, thenumber of pairs generated and joined are minimized resulting in I/O, networkand computational efficiency;

9 Presented a scalable technique to pre-process the input data; and

10 SSS and ESSJ algorithms are compared with and analyzed in light of the peting state of the art algorithm, by testing them on the Twitter dataset fordiscovering similar users

com-In Chapter 2, the background information that are required for understandingand constructing the SSS and ESSJ algorithms are explored In Chapter 3, the

Trang 18

SSS algorithm is presented and in Chapter 4, the ESSJ In both the Chapters, thepresented algorithm is compared against the state-of-the art algorithm, followed bythe experimental analysis to demonstrate the performance improvements and thesummary Additionally, in Chapter 3, the technique to enhance the scalability of theSSS algorithm and a general technique to preprocess the input data are also presented.

In Chapter 5, EASE, which is designed as a hybrid algorithm, to exploit the strategies

of both SSS and ESSJ for performing the joins is presented and discussed Chapter

6 concludes this thesis

Trang 19

Chapter 2

Background

This chapter introduces the preliminary background necessary for creating theSSS, ESSJ and EASE algorithms, including the MapReduce Model, Hadoop Features,Multisets, Measures of Similarity and the Literature Review

The MapReduce Framework [25] was developed for large scale parallel and tributed processing MapReduce-based technologies are widely used in the indus-tries [29–37] The academic research communities are also contributing a lot towards

dis-it [38–48]

MapReduce framework manages the parallelization, fault tolerance, data transfersand load balancing over a cluster of machines, and the programmer has to concentrateonly on the problem at hand that has to be solved in a distributed manner Thedata is written to and read from a Distributed File System (DFS) A programmerdefined Map Function processes a key/value pair input record and sends out a list ofintermediate key/value pairs A programmer defined Reduce Function receives a list

of values corresponding to a same intermediate key, and processes it to send out alist of values

The intermediate keys are partitioned to the right reducer by the partitioning

Trang 20

function, and it is ensured that the partitioned records are sorted, forming the ShufflePhase The default partitioning function is hash(key) mod R, where R is the number

of reducers or partitions The programmer has the flexibility of choosing his owncustom partitioning function as well The MapReduce Model is represented in Fig.2-1 and the Map and Reduce functions can be mathematically represented as shown

Map0 Map1 Map2

Reduce0

Shuffle

Output 1 Output 0

is executed on that node

Secondary sorting can be performed in Hadoop When using Hadoop, keys of the

Trang 21

Shuffle Phase, records are sorted based on the primary key Secondary sorting meansthat records with the same primary key are sorted based on the secondary key This

is possible with the help of a custom partitioning function, sorting comparator andgrouping comparator The custom partitioner ensures that all the records with thesame primary key reach the same reducer The sort comparator ensures that all therecords with the same primary key are sorted based on the secondary key Thirdly,the grouping comparator ensures that all records with the same primary key reachthe same reduce instance in a reducer Thus records which reach a reduce instanceare sorted based on the primary key Among the records that have the same primarykey, they are sorted based on the secondary key

Based on [23] and [49], the following representations for multiset, multiset union,multiset intersection and multiset size are considered in this study Further, thisstudy also defines the representation for position of a multiset element

Multiset: Consider a set, S, of multisets {M1, , Mi, M|S|}, on the data ments D = d1, , d|D| A multiset identified by Mi is represented as Mi =<

ele-D, D → N >= {mi,1, , mi,|D|}, where mi,k represents the element in multiset

Mi that has the data element dk The element of Mi : mi,k =< dk, fi(dk) >,where dk is the kth data element, and fi(dk) denotes the number of times dk

occurs in Mi Mi = {mi,1, , mi,|D|} = {< d1, fi(d1) >, , < d|D|, fi(d|D|) >}

Let Mi and Mj be two multisets over a given domain set S

Trang 22

Multiset Union: A Multiset Union between Mi and Mj is given by:

Multiset Size: Multiset Size | Mi |, is the sum of the frequencies of the data elements

of all the elements present in Mi and is given by:

of mi,j in Mi denoted by P os(mi,j), is the sum of the frequencies of the dataelements of all the elements in Mi, that are prior to mi,j as defined below

Trang 23

mea-Table 2.1: Multiset Similarity Measures and their formulae

Similarity Measure FormulaeRuzicka |Mi ∩Mj|

2.4.1 Serial Algorithms

In order to improve the efficiency, the algorithms in the literature have focused

on reducing the number of candidate pairs for which Similarity Joins have to beperformed

Locality Sensitive Hashing (LSH) is a famous, approximate technique, where thedata elements are hashed, and so the similar elements are hashed to the same bucketwith a high probability [50] LSH technique based on min wise independent permuta-tions to compute the Jaccard Similarity for near duplicate detection of web documentswas proposed by [51]

An inverted index [52] maps an element to the entities that contain it Instead

of comparing all the entities in a collection, inverted index helps to compare theentities that at least contain a common element An inverted index based algorithmindexing every set element to generate candidate pairs for similarity computation

Trang 24

with optimizations based on using threshold information, sorting and clustering basedtechniques was proposed by [53].

A signature based technique called PartEnum, where two sets which share a nature in common if their Hamming Distance is less than a particular value, k, wasdeveloped by [28] And similarity is evaluated for sets sharing at least a signature

sig-A different technique of generating candidate pairs using candidates which have

a common prefix element in their set, named Prefix filtering was proposed by [27]

An All Pairs Algorithm to compute the similarity of all the pairs of similar dates based on exploiting the threshold during indexing to reduce the candidate pairsgenerated was proposed by [24]

candi-A prefix filtering based algorithm, where top-k similar elements are returned, ifthe similarity threshold is unknown was developed by [54]

An adaptive model to select approximate prefix length for performing the joins,which outperforms the traditional prefix filtering was proposed by [55]

A state of the art technique for detecting duplicate documents, where prefix tering was combined with further filtering techniques namely, the positional (based

fil-on the positifil-on of the set elements) and suffix filtering (based fil-on the suffix hammingdistance) for sets, named PPJoin+, was proposed by [4]

2.4.2 Parallel Algorithms

In this section, the prior MapReduce Parallel algorithms for performing similarityjoins are discussed

A MapReduce algorithm for finding traffic anomalies and load balancing proxies

in the Internet using similarity joins was proposed by [23] In the first stage of thealgorithm, the size of every multiset is calculated In the second stage all the multisetelements are indexed in the Map Phase and the MIDs that share a common multiset

Trang 25

Since every multiset element is indexed without incorporating any filtering technique,humongous number of pairs will be generated, which must be written to the DFS bythe Reduce Phase, causing I/O inefficiency In the next stage, these humongousnumbers of pairs are read by the Mappers and passed through the network causingcongestion and inefficiency and it is not a scalable approach.

A MapReduce algorithm for computing similarity between the normalized ments vectors, where the similarity is expressed as the product of document vectorweights, i.e the intersection was developed by [8] Similar to [23], every element ofthe document vector is indexed and the possible candidate pairs are generated, as aresult of which, I/O inefficiency and network congestion will result The differencebetween this algorithm and the algorithm in [23] is that, they take into account thesize of the multisets, whereas here, the document vectors are normalized and theirsize is not taken into account

docu-The VCL Algorithm [56], is a MapReduce algorithm designed for computing larity joins between sets, which tries to implement the PPJoinPlus [4] in a distributedmanner This algorithm was developed for sets and is therefore not capable of han-dling multisets However, it is worthwhile to discuss the characteristics of this al-gorithm Each set is replicated the number of times the elements in the prefix ofthe set in the Map Phase In the Reduce Phase, each reduce instance pertains to

simi-a unique set element, simi-and its vsimi-alue list consists of the complete sets thsimi-at shsimi-are theset element in their prefix and are hence potential candidates of being similar, andtheir similarity is evaluated The replication of each set, the number of times theelements in the prefix of the Mappers makes it unfit for large sets as the large number

of replication will cause subsequent network congestion and I/O constraints hinderingthe scalability of the algorithm This drawback has been mentioned in [57], and hasbeen demonstrated in [23]

A MapReduce Similarity Join algorithm for computing vector cosine similarity

Trang 26

between documents called SSJ-2R was proposed by [9] It incorporates prefix filteringTechnique to reduce the potential candidate document vector pairs However, thisalgorithm fails to incorporate other filtering techniques.

The above discussed parallel algorithms for multiset similarity joins are directlypertinent to this work However, it is worth mentioning other relevant work done

in this area Theoretical comparison of MapReduce algorithms of various ity measures such as Edit Distance, Jaccard and Hamming Distance measures waspresented by [58] Multiway Join Optimization in MapReduce Framework has beenfocused by [59] Performing Joins between Log Data and a Reference Data in theMapReduce Framework was proposed by [60]

Trang 27

to generate the possible candidate pairs for which similarity joins are to be performed.This is followed by size filtering to further reduce the pairs In Stage II, positionalfiltering is applied to the pairs In the Stage III, similarity joins are performed forthe thriving pairs.

The Map and Reduce Phases of the three Stages are detailed below(Sections 3.1 to3.6) along with the preprocessing stage(Section 3.7) In Section 3.8, SSS is comparedwith the competing state of the art algorithm SSJ-2R, and in Section 3.9, throughexperimental analysis, the unprecedented performance gain in SSS in comparison toSSJ-2R is reported In Section 3.10, the technique to enhance the scalability of thealgorithm is presented as a contingency need Section 3.11, summarizes this chapter

Trang 28

3.1 Stage I - Map Phase

The preprocessed input to the Map Phase consists of records, each containing theMultiset ID, Mi, followed by the elements of Mi The elements of Mi, are arrangedbased on the increasing order of the global frequency of their data elements acrossthe entire multiset collection, with the least frequent being the first Each Mappercalculates the prefix size of the multiset in the input record

From the prefix filtering technique, Lemma1 of [27], rephrased by Lemma1 of [4],

it is known that if two sets Si and Sj have an overlap greater than α, there must

be a non-zero intersection in their prefixes The Si-prefix (| Si | −α + 1) and Sjprefix(| Sj | −α + 1) must have non-zero intersection

-From [4], for a similarity threshold, t,

α= t

From [4], knowing only one set Si, without the knowledge about the other setswith which it will intersect, the prefix size can be found by the formula: | Si | −(t∗ |

Si |) + 1, where t is the similarity threshold From [27], prefix filtering is applicable

to multisets Therefore, it is have the prefix size of a multiset, Mi with similaritythreshold, t, given by, | Mi | −(t ∗ |Mi|) + 1

For every element, mi,k, present in the prefix of the multiset Mi, with data element,

dk, the following information, viz the MID, Mi, the size of Mi, | Mi | and dk’sfrequency in Mi, fi(dk) and position of mi,k in Mi, P os(mi,k), are sent as the Mappervalue and are denoted by I(dk(Mi)), the key being dk

Stage I - Map Phase can be expressed as shown below:

< Mi, Elements of Mi >∀ mi,k∈M

0

i s P ref ix

−−−−−−−−−−−→ (< dk, I(dk(Mi)) >)∗ (3.2)

Trang 29

I(dk(Mi)) =< Mi, | Mi |, fi(dk), P os(mi,k) > (3.3)The pseudo-code of Stage I - Map Phase is shown in Algorithm 1

Algorithm 1: Stage I - Map Phase

1 Input: For each Mapper instance of Stage I, the input record has the value,valin =< Mi, Elements of Mi >

2 Output: The output records are of the format, < keyout, valout >, where thekeyout is the prefix data element, dk, and the valout is I(dk(Mi))

3 Compute the size of Mi, viz., |Mi| by iterating through its elements and

summing up the frequencies of its data elements;

4 Compute the prefix size of the multiset, Mi, using the formula:

| Mi | −(t ∗ |Mi|) + 1 ;

5 /*t is the similarity threshold */

6 fsum ← 0; /*Initializing fsum, which represents the sum of the frequencies ofthe data elements present in the multiset */

7 for every mi,k ∈ M0

is P ref ix do

8 P os(mi,k) ← fsum ;

9 /*Position of mi,k, viz., P os(mi,k) is the sum of the frequencies of the dataelements prior to it in Mi */

10 Update fsum as fsum = fsum+ fi(dk) ;

11 keyout = dk;

12 valout = I(dk(Mi)) =< Mi, | Mi |, fi(dk), P os(mi,k) >;

13 output(keyout, valout);

14 end

Trang 30

3.2 Stage I - Reduce Phase

At each reducer, the records that share the same prefix data element, dk, as thekey are grouped together From prefix filtering principle, it is known that, if twomultisets have a common prefix data element, they are potential candidates of beingsimilar Therefore, all the possible MID pairs that share the same dk, are generated Ifsay k records share a prefix data element, then k ∗ (k − 1)/2 MID pairs are generated

To reduce this number further, the size filtering technique [28] is applied, whichgives effective pruning results This technique is applied with the help of the sizeinformation sent with every record This technique was also employed in the serialalgorithms, [24] and [4] For every MID pair, < Mi, Mj >and threshold t, if the sizefiltering condition,viz., | Mj |≥ t∗ | Mi |, is satisfied, it passes the filter, otherwise it

is pruned The multiset with the smaller size is taken as Mj, and the bigger as Mi.For every MID pair, < Mi, Mj >that survives size filtering, the frequency of dk, sizeand position information of both Mi and Mj (denoted by I(dk(Mi, Mj) in (3.5)), areappended and sent as the reducer output

This process is defined mathematically as:

< dk,(I(dk(Mi)))∗ >

∀M i ,M j

that survives Size F iltering

−−−−−−−−→ (< Mi, Mj, I(dk(Mi, Mj)) >)∗ (3.4)Where, I(dk(Mi)) is given by (3.3), and

I(dk(Mi, Mj)) =<| Mi |, | Mj |, fi(dk), fj(dk), P os(mi,k), P os(mj,k) > (3.5)

The Stage I Map and Reduce instances are also detailed in Fig.3-1

The pseudo-code of Stage I - Reduce Phase is shown in Algorithm 2

Trang 31

Algorithm 2: Stage I - Reduce Phase

1 Input: For each Reduce Instance, the input has the key, keyin = dk andcorresponding value list, (value)∗ = (I(dk(Mi)))∗

2 Output: The output consists of records having the value,

valout = Mi, Mj, I(dk(Mi, Mj)) I(dk(Mi, Mj)) is given by,

<| Mi |, | Mj |, fi(dk), fj(dk), P os(mi,k), P os(mj,k) >

3 for every I(dk(Mi)) ∈ (I(dk(Mi)))∗ do

4 for everyI(dk(Mj)) ∈ (I(dk(Mi)))∗ do

5 if | Mj |≥ t∗ | Mi | then

6 /* Size Filtering Condition */

7 valout = Mi, Mj, I(dk(Mi, Mj));

Trang 32

d k

I(d k (M i )) I(d k ( M j )) I(d k ( M x ))

M i , M j ,I(d k (M i , M j ))

(Value)*

Index every prefix element, with data element, d k, and compute its position, Pos i (m i,k )

Figure 3-1: Mapper and Reducer Instances of Stage I

Stage II - Map Phase consists of two types of Mappers:

1 Type I Mappers The records from the output of Stage I-Reduce Phase areread These records pertain to MID pairs and are denoted as MID Pair records.The output key is the MID pair < Mi,Mj > and the value is the Mj andthe I(dk(Mi, Mj)), which comprises of the frequency of dk, size and positioninformation of both Mi and Mj, in order to facilitate positional filtering in theReduce Phase The process can be represented by (3.6)

Trang 33

< Mi, Mj, I(dk(Mi, Mj)) >→<< Mi, Mj >, < Mj, I(dk(Mi, Mj)) >> (3.6)

Where, I(dk(Mi, Mj)) is given by (3.5)

The pseudo-code of Stage II - Map Phase for Type I Mappers is shown inAlgorithm 3

2 Type II Mappers The preprocessed input of Stage I-Map Phase, where recordseach consisting of the MID, Mi and its elements, is read here and sent as outputwith < Mi, m > as the key and the elements of Mi as the value Here, the

‘m’ in the key denotes that it is a record containing the multiset elements.These records can be called, the multiset records This can be mathematicallyrepresented as:

M i ,m Elements of M i

Figure 3-2: Type I and Type II Mapper instances of Stage II

Trang 34

Fig.3-2 represents the Type I and Type II Mapper instances.

Algorithm 3: Stage II - Map Phase (Type I)

1 Input: The input record to each Mapper instance has the value,

valin =< Mi, Mj, I(dk(Mi, Mj)) >

2 Output: The output record has the key, < Mi, Mj >, and value,

< Mj, I(dk(Mi, Mj)) >

3 keyout =< Mi, Mj >;

4 valout =< Mj, I(dk(Mi, Mj)) >;

Algorithm 4: Stage II - Map Phase (Type II)

1 Input:The input record to each Mapper instance has the value,

valin =< Mi, Elements of Mi >

2 Output: The output record has the key, < Mi, m >, and value,

< Elements of Mi >

3 keyout =< Mi, m >;

4 valout = Elements of Mi;

Partitioning, Grouping and Sorting are customized rather than using the default,

so that in each reduce instance, records will be grouped based on the MID, Mi, theprimary key and sorted based on the secondary key

Custom Partitioning: A multiset record has the composite key, < Mi, m >, where

Mi is the primary key and m is the secondary key An MID pair record hasthe composite key, < Mi, Mj >, where the primary key is Mi and Mj is thesecondary key Records are partitioned based on the primary key Both types

of records, for which the primary key, viz Mi is the same, are partitioned to

Trang 35

Custom Grouping: Custom Grouping ensures that records that have the sameMID, Mi as the primary key reach the same instance Each reduce instancethus pertains to a unique MID, Mi.

Custom Sorting: Sorting is based on the secondary key, so that the record taining the multiset elements arrives first in an instance, and is followed by theMID Pair records

The records which have the same Mi as primary key are grouped in the sameinstance These include, the multiset record corresponding to Mi and the MID Pairrecords with the same Mi as their primary key In every reduce instance, the multisetrecord with key, < Mi, m >arrives first Following that, the MID Pair records arrive.The MID Pair records arrive in a sorted order based on their secondary key,

Mj The MID Pair records that pertain to the same < Mi, Mj > pair are groupedtogether and positional filtering is applied Every unique pair, < Mi, Mj >, thatsurvives positional filtering is sent as output If there is at least one pair that survivespositional filtering, MID Mi and its elements are written to a file named as theMultiset File The Reduce Phase is diagrammatically represented in Fig.3-3 and canalso be expressed by (3.8) below

P ositional

F iltering

< M i , Elements of M i >

If Atleast One P air Survives

Trang 36

Algorithm 5: Stage II - Reduce Phase

1 Input: Each reduce instance has as input key, keyin=< Mi, x > and has(value)∗ containing < Elements of Mi >and (< Mj, I(dk(Mi, Mj)) >)∗

2 Output: It sends as output the MID pairs, with the value,

valout =< Mi, Mj > It also writes as output, Mi and its elements to theMultiset File, if any MID pair survives positional filtering

3 Map M apM ID ← Group (I(dkMi, Mj))∗ into groups in M apM ID, where eachMap entry consists of a key Mj and value consisting of (I(dk(Mi, Mj)))∗

pertaining to a particular Mj;

4 PassPositionalFilter ← false; PassFilterCheck ← false;

5 for each Mj ∈ M apM ID do

6 Get (I(dk(Mi, Mj)))∗ corresponding to Mj ;

7 PassPositionalFilter ← Apply Positional Filtering for Pair < Mi, Mj >using (I(dk(Mi, Mj)))∗ ;

8 if PassPositionalFilter == true then

Trang 37

Atleast one pair has survived filtering?

M i , Elements of M i

Written to Multiset File

Pair survives filtering?

Yes

Apply Positional Filtering for the

M i , M k pair

Apply Positional Filtering for the

M i , M j pair

Yes

M i , M j

Figure 3-3: Reducer Instance of Stage II

The Positional Filtering technique and how it is performed here in Stage II Reduce Phase are explained below

-3.4.1 Positional Filtering:

Positional filtering is the technique of filtering pairs of sets based on the positionalinformation of the overlapping token between the sets and is defined in Lemma 2 of [4].Lemma 1 describes how to extend this technique to multisets

Lemma 1 (Positional Filtering for Multisets): Considering an ordering O of themultiset elements universe U and a set of multisets, each with multiset elements sorted

in the order of O Let element w = mi,k, partition the multiset, Mi into left partition,

Mil(w) = {mi,1, , mi,k} and right partition Mir(w) = {mi,k+1, , mi,d|D|} If the

Trang 38

Overlap(Mi, Mj) ≥ α, then for every element, w ∈ Mi∩Mj, Overlap(Mil(w), Mjl(w))+min(| Mir(w) |, | Mjr(w) |) ≥ α.

Positional Filtering Condition is thus:

Figure 3-4: Partitioning a Multiset Mi for Positional Filtering

It can be seen in Fig 3-4 how a multiset Mi is partitioned for positional filtering at

w= mi,k, the multiset element pertaining to a data element, “dk”, into left partition

Mil(w) and right partition Mir(w)

Let us discuss the reasoning behind Lemma 1 The Overlap(Mil(w), Mjl(w))

is the overlap between the left partitions of Mi and Mj The maximum possibleoverlap between the right partitions of Mi and Mj, is min(| Mir(w) |, | Mjr(w) |),

as there can be no overlap greater than that in the right partitions There can be

no overlap between the left partition of Mi and the right partition of Mj and viceversa, as the multiset elements are arranged in a global ordering Thus the sum ofthe actual overlap in the left partition and the maximum possible overlap betweenthe right partitions, represents the maximum overlap possible between Mi and Mj,and is represented by the left hand side of (3.9) This must be greater than α(given

by (3.1)), the minimum required overlap that must exist between Mi and Mj, for a

Trang 39

similarity threshold of t, which is represented by the right hand side of (3.9).

3.4.2 Positional Filtering in Stage II-Reduce Phase

Here in Stage II - Reduce Phase, positional filtering is performed at the last lapping prefix element, w = Mi(dl) “dl” is the last overlapping prefix data element

over-So, the Overlap(Mil(w), Mjl(w)) will be the total overlap between the prefixes ofthe two multisets, Mi and Mj, and can be denoted by αp MID Pair records thatreach a reduce instance are grouped based on the pair, < Mi, Mj > they pertain to.Records pertaining to the same pair are grouped together and the records of a groupare sorted based on the position of the element pertaining to the overlapping dataelement in Mi, which is P os(mi,k) ( Line 3 of Algorithm 6), followed by accumulatingthe overlap by iterating over the sorted records, and thereby calculating the totalprefix overlap (Lines 7 – 9) At the last record, which pertains to the last overlappingprefix element, w = mi,l, positional filtering is applied using (3.9)(Lines 10 - 16) Itcan be seen in Line 12, that the right partition size of a multiset, Mi is calculated bythe formula, | Mir(w) |=| Mi | −(P os(mi,l) + fi(dl)) P os(mi,l) + fi(dl) is the sum

of the frequencies of the multiset elements present up till “w” and represents the leftpartition size, | Mjl(w) |

Trang 40

Algorithm 6: Positional Filtering

7 for each I(dk(Mi, Mj) in (I(dk(Mi, Mj))∗ do

8 /*the information needed for the formulae below are taken from

I(dk(Mi, Mj))*/ ;

9 αp = αp+ min(fi(dk), fj(dk)) ;

10 if Last I(dk(Mi, Mj)) then

11 /*Positions of the last overlapping prefix element,w = mi,l, containing

data element, dl, in Mi and Mj are P os(mi,l) and P os(mi,l) respectively

*/;

12 ubound= min(| Mir(w) |, | Mjr(w) |) = min((| Mi |

−(P os(mi,l) + fi(dl))), (| Mj | −(P os(mj,l) + fj(dl))));

Tiêu đề	ACE: Agile, Contingent and Efficient Similarity Joins Using MapReduce
Tác giả	Mahalakshmi Lakshminarayanan
Người hướng dẫn	Dr. Vijay Devabhaktuni, Dr. William F. Acosta, Dr. Robert C. Green II, Dr. Mansoor Alam, Dr. Patricia R. Komuniecki
Trường học	The University of Toledo
Chuyên ngành	Engineering
Thể loại	thesis
Năm xuất bản	2013
Thành phố	Toledo

Định dạng
Số trang	103
Dung lượng	2,49 MB