A Thesis entitled ACE Agile, Contingent and Efficient Similarity Joins Using MapReduce by Mahalakshmi Lakshminarayanan Submitted to the Graduate Faculty as partial fulfillment of the requirements for[.]
Trang 1A ThesisentitledACE: Agile, Contingent and Efficient Similarity Joins Using MapReduce
byMahalakshmi LakshminarayananSubmitted to the Graduate Faculty as partial fulfillment of the requirements for the
Masters of Science Degree in Engineering
Dr Vijay Devabhaktuni, Committee Chair
Dr William F Acosta, Committee Member
Dr Robert C Green II, Committee Member
Dr Mansoor Alam, Committee Member
Dr Patricia R Komuniecki, DeanCollege of Graduate Studies
The University of Toledo
Trang 2Copyright 2013, Mahalakshmi LakshminarayananThis document is copyrighted material Under copyright law, no parts of this
Trang 3An Abstract ofACE: Agile, Contingent and Efficient Similarity Joins Using MapReduce
byMahalakshmi LakshminarayananSubmitted to the Graduate Faculty as partial fulfillment of the requirements for the
Masters of Science Degree in Engineering
The University of ToledoDecember 2013
Similarity Join is an important operation for data mining, with a diverse range ofreal world applications Three efficient MapReduce Algorithms for performing Sim-ilarity Joins between multisets are proposed in this thesis Filtering techniques forsimilarity joins minimize the number of pairs of entities joined and hence, they arevital for improving the efficiency of the algorithm Multisets represent real worlddata better, by considering the frequency of its elements Prior serial algorithmsincorporate filtering techniques only for sets, but not multisets, while prior MapRe-duce algorithms do not incorporate any filtering technique or inefficiently incorporateprefix filtering with poor scalability
This work extends the filtering techniques, namely the prefix, size, positionaland suffix filters to multisets, and also achieves the challenging task of efficientlyincorporating them in the shared-nothing MapReduce model Adeptly incorporatingthe filtering techniques in a strategic sequence minimizes the pairs generated andjoined, resulting in I/O, network and computational efficiency
In the SSS algorithm, prefix, size and positional filtering are incorporated in theMapReduce Framework The pairs that thrive filtering are joined suavely in the thirdSimilarity Join Stage, utilizing a Multiset File generated in the second stage We alsodeveloped a rational and creative technique to enhance the scalability of the algorithm
as a contingency need
Trang 4In the ESSJ algorithm, all the filtering techniques, namely, prefix, size, positional
as well as suffix filtering are incorporated in the MapReduce Framework It is designedwith a seamless and scalable Similarity Join Stage, where the similarity joins areperformed without dependency to a file
In the EASE algorithm, all the filtering techniques, namely, prefix, size, positionaland suffix are incorporated in the MapReduce Framework However it is tailored as
a hybrid algorithm to exploit the strategies of both SSS and ESSJ for performing thejoins Some multiset pairs are joined utilizing the Multiset File similar to SSS, andsome multisets are joined without utilizing it similar to ESSJ The algorithm harveststhe benefits of both the strategies
SSS and ESSJ algorithms were developed using Hadoop and tested using world Twitter data For both SSS and ESSJ, experimental results demonstrate phe-nomenal performance gains of over 70% in comparison to the competing state-of-the-art algorithm
Trang 5real-I dedicate this work to the Almighty!
Trang 6I thank Dr Rob for his timely, creative, elegant, prudent and thorough guidanceand support It was wonderful and comfortable working under him!
Without the guidance of Dr Acosta and Dr Rob, this work would not have beenpossible
I thank Dr Alam for his wise, kind and gracious support throughout my Master’sprogram! Special thanks to Dr Vijay for his benevolent, erudite and gracious guidanceand support!
I thank Dr Acosta, Dr Alam, EECS and the ET Departments for the financialsupport I thank the EECS and ET faculty members and staff members who havehelped me
I thank my parents, grand parents, brothers, relatives and friends for their support,with special thanks to my mom! Ultimately, I thank God for showering His grace onus!
Trang 72.1 MapReduce Model 6
2.2 Hadoop Features 7
2.3 Multisets and Similarity Measures 8
2.4 Literature Review 10
2.4.1 Serial Algorithms 10
2.4.2 Parallel Algorithms 11
3 Strategic and Suave Processing for Similarity Joins Using MapRe-duce 14 3.1 Stage I - Map Phase 15
3.2 Stage I - Reduce Phase 17
3.3 Stage II - Map Phase 19
Trang 83.4 Stage II- Reduce Phase 22
3.4.1 Positional Filtering: 24
3.4.2 Positional Filtering in Stage II-Reduce Phase 26
3.5 Stage III - Map Phase 28
3.6 Stage III - Reduce Phase 29
3.7 Preprocessing 31
3.8 Comparison of SSS with SSJ-2R 34
3.9 Experimental Results 35
3.10 Enhancing the Scalability of the Algorithm 41
3.11 Summary 44
4 Adept and Agile Processing for Efficient and Scalable Similarity Joins Using MapReduce 46 4.1 Stage I - Map Phase 47
4.2 Stage I - Reduce Phase 50
4.3 Stage II - Map Phase 50
4.4 Stage II - Reduce Phase 55
4.4.1 Suffix Filtering 58
4.4.2 Optimizing the minimum Prefix Hamming Distance, Hpmin: 60 4.4.3 Suffix Filtering in Stage II- Reduce Phase 62
4.5 Stage III - Map Phase 63
4.6 Stage III - Reduce Phase 65
4.7 Comparison of ESSJ with SSJ-2R 66
4.8 Experimental Results 68
4.9 Summary 73
5 Efficient, Adaptable and Scalable MapReduce Algorithm For
Trang 95.1 Stage II - Reduce Phase 75
5.2 Stage III - Map Phase 75
5.3 Stage III - Reduce Phase 77
5.4 Discussion 79
Trang 10List of Tables
2.1 Multiset Similarity Measures and their formulae 103.1 The number of pairs for which similarity joins are performed in SSJ-2Rand SSS algorithms 363.2 Running times of the Stages of SSS algorithm, for 16,000 records and asimilarity threshold of 0.8 373.3 Running times of the Stages of SSJ-2R algorithm, for 16,000 records and
a similarity threshold of 0.8 373.4 Running times of SSS and SSJ-2R algorithms, for varying number of in-put records and corresponding Performance Improvement for a similaritythreshold of 0.7 393.5 Running times of SSS and SSJ-2R algorithms, for varying number of in-put records and corresponding Performance Improvement for a similaritythreshold of 0.8 403.6 Running times of the Waves of Stage III of SSS-SE algorithm, for 16,000records and a similarity threshold of 0.8 434.1 The number of pairs for which similarity joins are performed in SSJ-2Rand ESSJ algorithms 694.2 Running times of the Stages of ESSJ algorithm, for 16,000 records and asimilarity threshold of 0.7 69
Trang 114.3 Running times of the Stages of SSJ-2R algorithm, for 16,000 records and
a similarity threshold of 0.7 704.4 Running times of the Stages of ESSJ algorithm, for 16,000 records and asimilarity threshold of 0.8 704.5 Running times of the Stages of SSJ-2R algorithm, for 16,000 records and
a similarity threshold of 0.8 704.6 Running times of ESSJ and SSJ-2R algorithms, for varying number ofinput records and corresponding Performance Improvement for a similaritythreshold of 0.7 714.7 Running times of ESSJ and SSJ-2R algorithms, for varying number ofinput records and corresponding Performance Improvement for a similaritythreshold of 0.8 72
Trang 12List of Figures
2-1 MapReduce Model 7
3-1 Mapper and Reducer Instances of Stage I 19
3-2 Type I and Type II Mapper instances of Stage II 20
3-3 Reducer Instance of Stage II 24
3-4 Partitioning a Multiset Mi for Positional Filtering 25
3-5 Type I and Type II Mapper Instances of Stage III 29
3-6 Reducer Instance of Stage III 30
3-7 Running times of the algorithms vs Number of Records, for a similarity threshold of 0.9 37
3-8 Running times of the algorithms vs Number of Records, for a similarity threshold of 0.8 38
3-9 Running times of the algorithms vs Number of Records, for a similarity threshold of 0.7 38
3-10 Running times of the algorithms vs Number of Records, for a similarity threshold of 0.8 42
4-1 Mapper and Reducer Instances of Stage I 48
4-2 Type I and Type II Mapper Instances of Stage II 54
4-3 Reducer Instance of Stage II 56
4-4 Partitioning a Multiset Mi for Suffix Filtering 59
4-5 Type I and Type II Mapper Instance of Stage III 63
Trang 134-6 Reducer Instance of Stage III 654-7 Running times of the algorithms vs Number of Records, for a similaritythreshold of 0.7 714-8 Running times of the algorithms vs Number of Records, for a similaritythreshold of 0.8 725-1 Mapper Instances of Stage III 765-2 Reducer Instance of Stage III 78
Trang 14Chapter 1
Introduction
This era has visualized massive growth of online applications and their users, whichhas resulted in an enormous increase in the volume of data that needs to be processed.Besides, there are numerous applications that require big data processing by theirnature This includes processing large corpora, environmental and medical data thatare gathered over a period, data from Smart Grids, and so on Simple, yet effectiveand essential operations are always the need of the moment for any application.Similarity Joins are vital operations of that nature, which are essential for a diverserange of applications Similarity joins are all time handy, and in the current scenario,big data is omnipresent Some interesting applications of similarity joins include –Duplicate detection [1–5], Plagiarism Detection [6, 7], Data Cleaning [8, 9], RecordLinkage [10–13], String Searching [14–19], Community Discovery [20, 21], InternetTraffic Anomalies Detection and Advertisement Targeting [22, 23] and CollaborativeFiltering for Recommendation Systems [24] As the size of data in such applications istypically very large, distributed processing is generally a necessity The MapReduceframework [25] and Hadoop [26] are very popular tools that are used in this study foraccomplishing these purposes
In this thesis, the focus is on the creation of ACE (Agile, Contingent and Efficient)MapReduce algorithms that effectively handle similarity joins between multisets
Trang 15Stated concisely, the issue addressed through this study is as follows: Given a tion of multisets, S = {Mi, , M|S|}, where Mirepresents a multiset, and a similaritythreshold, t, all pairs of multisets (< Mi, Mj >), whose similarity Sim(Mi, Mj) ex-ceeds t must be discovered.
collec-In addressing this issue, the entirety of the presented work focuses on a trilogy
of challenges involved in efficiently performing similarity joins in the MapReduceparadigm including:
1 In a naive implementation, all of the possible pairs of entities must be joined
In an efficient implementation, the initial application of filtering techniques tofilter out the possible pairs that must be joined is preferred Real world datacan be better represented using multisets because the frequency of an entity istaken into account Thus, filtering techniques must be developed for multisets,though existing work have designed filtering techniques only for sets;
2 These filtering techniques must be designed in a distributed way suitable to theMapReduce framework; and
3 Similarity Joins must be performed for the pairs that survive filtering Thechallenge is to bring together the data corresponding to the surviving entitypairs in the MapReduce style work flow
This thesis proposes three algorithms to address the above mentioned trilogy ofconcerns and names them as SSS (Strategic and Suave Processing for Similarity JoinsUsing MapReduce), ESSJ (Adept and Agile Processing for Efficient and ScalableSimilarity Joins using MapReduce) and EASE (Efficient, Adaptable and ScalableMapReduce Algorithm For Similarity Joins Using Hybrid Strategies)
The second and third problems listed above are particularly challenging as itrequires designing the algorithm to suit the shared-nothing MapReduce model
Trang 16The prior MapReduce similarity join algorithms have incorporated no filteringtechniques or have attempted to incorporate the filtering techniques, but that hasresulted in inefficiency and poor scalability, due to a large quantity of data generatedcausing I/O and network bottlenecks However, the algorithms in this study achievethis task by adeptly applying the prefix, size, positional and suffix filtering techniques
in a strategic sequence, which minimizes the candidate pairs generated, resulting inI/O and network efficiency The dramatic reduction in the number of pairs that arejoined, leads to computational efficiency
In the MapReduce style work flow, to bring together the elements of multisetscorresponding to a pair for performing similarity join, three agile strategies are de-veloped in this thesis The first strategy utilizes a file, which must be distributed tothe various nodes in a cluster and loaded in the memory of the nodes, for performingthe joins (Used in SSS) In the second strategy, the elements of multisets correspond-ing to a pair are brought together without utilizing a file (Used in ESSJ) The thirdstrategy (Used in EASE) is a hybrid one, utilizing both the previous strategies, wherethe elements of multisets corresponding to some pairs are brought together utilizing
a file(like SSS) and for other pairs without using a file(like ESSJ) These strategiesare contingent in nature, as they are designed to be scalable when the dataset beingprocessed increases in size, as well as when the size of the cluster increases in size.The main contributions of this thesis are:
1 Three efficient MapReduce Algorithms, namely SSS, ESSJ and EASE for forming similarity joins between multisets are presented that are applicable tosets, multisets and vectors as well;
per-2 For the first time, filtering techniques developed for efficiently performing larity joins in sets are extended for application to multisets, namely prefix [4,27],size [28], positional [4] and suffix [4] filtering;
Trang 17simi-3 In SSS, prefix, size and positional filtering are incorporated in the MapReduceFramework and the thriving pairs are suavely joined utilizing a file named,Multiset File, in the Similarity Join MapReduce Stage.
4 A rational and creative technique to enhance the scalability of the SSS algorithm
is presented as a contingency need;
5 In ESSJ, prefix, size, positional as well as suffix filtering are incorporated in theMapReduce Framework;
6 ESSJ is equipped with a seamless and scalable similarity join stage, which ages to avoid the need for distributing a file to all the nodes of a cluster andloading it into the memory;
man-7 In EASE, all the filtering techniques, viz., prefix, size, positional and suffixfiltering are incorporated in the MapReduce Framework, similar to ESSJ EASE
is designed as a hybrid algorithm to exploit the strategies of both SSS (utilizingMultiset File) and ESSJ (without dependency to Multiset File) for performingthe joins As a result, it will yield the benefits of both the strategies
8 For all the three algorithms, as a result of applying filtering techniques, thenumber of pairs generated and joined are minimized resulting in I/O, networkand computational efficiency;
9 Presented a scalable technique to pre-process the input data; and
10 SSS and ESSJ algorithms are compared with and analyzed in light of the peting state of the art algorithm, by testing them on the Twitter dataset fordiscovering similar users
com-In Chapter 2, the background information that are required for understandingand constructing the SSS and ESSJ algorithms are explored In Chapter 3, the
Trang 18SSS algorithm is presented and in Chapter 4, the ESSJ In both the Chapters, thepresented algorithm is compared against the state-of-the art algorithm, followed bythe experimental analysis to demonstrate the performance improvements and thesummary Additionally, in Chapter 3, the technique to enhance the scalability of theSSS algorithm and a general technique to preprocess the input data are also presented.
In Chapter 5, EASE, which is designed as a hybrid algorithm, to exploit the strategies
of both SSS and ESSJ for performing the joins is presented and discussed Chapter
6 concludes this thesis
Trang 19Chapter 2
Background
This chapter introduces the preliminary background necessary for creating theSSS, ESSJ and EASE algorithms, including the MapReduce Model, Hadoop Features,Multisets, Measures of Similarity and the Literature Review
The MapReduce Framework [25] was developed for large scale parallel and tributed processing MapReduce-based technologies are widely used in the indus-tries [29–37] The academic research communities are also contributing a lot towards
dis-it [38–48]
MapReduce framework manages the parallelization, fault tolerance, data transfersand load balancing over a cluster of machines, and the programmer has to concentrateonly on the problem at hand that has to be solved in a distributed manner Thedata is written to and read from a Distributed File System (DFS) A programmerdefined Map Function processes a key/value pair input record and sends out a list ofintermediate key/value pairs A programmer defined Reduce Function receives a list
of values corresponding to a same intermediate key, and processes it to send out alist of values
The intermediate keys are partitioned to the right reducer by the partitioning
Trang 20function, and it is ensured that the partitioned records are sorted, forming the ShufflePhase The default partitioning function is hash(key) mod R, where R is the number
of reducers or partitions The programmer has the flexibility of choosing his owncustom partitioning function as well The MapReduce Model is represented in Fig.2-1 and the Map and Reduce functions can be mathematically represented as shown
Map0 Map1 Map2
Reduce0
Shuffle
Output 1 Output 0
is executed on that node
Secondary sorting can be performed in Hadoop When using Hadoop, keys of the
Trang 21Shuffle Phase, records are sorted based on the primary key Secondary sorting meansthat records with the same primary key are sorted based on the secondary key This
is possible with the help of a custom partitioning function, sorting comparator andgrouping comparator The custom partitioner ensures that all the records with thesame primary key reach the same reducer The sort comparator ensures that all therecords with the same primary key are sorted based on the secondary key Thirdly,the grouping comparator ensures that all records with the same primary key reachthe same reduce instance in a reducer Thus records which reach a reduce instanceare sorted based on the primary key Among the records that have the same primarykey, they are sorted based on the secondary key
Based on [23] and [49], the following representations for multiset, multiset union,multiset intersection and multiset size are considered in this study Further, thisstudy also defines the representation for position of a multiset element
Multiset: Consider a set, S, of multisets {M1, , Mi, M|S|}, on the data ments D = d1, , d|D| A multiset identified by Mi is represented as Mi =<
ele-D, D → N >= {mi,1, , mi,|D|}, where mi,k represents the element in multiset
Mi that has the data element dk The element of Mi : mi,k =< dk, fi(dk) >,where dk is the kth data element, and fi(dk) denotes the number of times dk
occurs in Mi Mi = {mi,1, , mi,|D|} = {< d1, fi(d1) >, , < d|D|, fi(d|D|) >}
Let Mi and Mj be two multisets over a given domain set S
Trang 22Multiset Union: A Multiset Union between Mi and Mj is given by:
Multiset Size: Multiset Size | Mi |, is the sum of the frequencies of the data elements
of all the elements present in Mi and is given by:
of mi,j in Mi denoted by P os(mi,j), is the sum of the frequencies of the dataelements of all the elements in Mi, that are prior to mi,j as defined below
Trang 23mea-Table 2.1: Multiset Similarity Measures and their formulae
Similarity Measure FormulaeRuzicka |Mi ∩Mj|
2.4.1 Serial Algorithms
In order to improve the efficiency, the algorithms in the literature have focused
on reducing the number of candidate pairs for which Similarity Joins have to beperformed
Locality Sensitive Hashing (LSH) is a famous, approximate technique, where thedata elements are hashed, and so the similar elements are hashed to the same bucketwith a high probability [50] LSH technique based on min wise independent permuta-tions to compute the Jaccard Similarity for near duplicate detection of web documentswas proposed by [51]
An inverted index [52] maps an element to the entities that contain it Instead
of comparing all the entities in a collection, inverted index helps to compare theentities that at least contain a common element An inverted index based algorithmindexing every set element to generate candidate pairs for similarity computation
Trang 24with optimizations based on using threshold information, sorting and clustering basedtechniques was proposed by [53].
A signature based technique called PartEnum, where two sets which share a nature in common if their Hamming Distance is less than a particular value, k, wasdeveloped by [28] And similarity is evaluated for sets sharing at least a signature
sig-A different technique of generating candidate pairs using candidates which have
a common prefix element in their set, named Prefix filtering was proposed by [27]
An All Pairs Algorithm to compute the similarity of all the pairs of similar dates based on exploiting the threshold during indexing to reduce the candidate pairsgenerated was proposed by [24]
candi-A prefix filtering based algorithm, where top-k similar elements are returned, ifthe similarity threshold is unknown was developed by [54]
An adaptive model to select approximate prefix length for performing the joins,which outperforms the traditional prefix filtering was proposed by [55]
A state of the art technique for detecting duplicate documents, where prefix tering was combined with further filtering techniques namely, the positional (based
fil-on the positifil-on of the set elements) and suffix filtering (based fil-on the suffix hammingdistance) for sets, named PPJoin+, was proposed by [4]
2.4.2 Parallel Algorithms
In this section, the prior MapReduce Parallel algorithms for performing similarityjoins are discussed
A MapReduce algorithm for finding traffic anomalies and load balancing proxies
in the Internet using similarity joins was proposed by [23] In the first stage of thealgorithm, the size of every multiset is calculated In the second stage all the multisetelements are indexed in the Map Phase and the MIDs that share a common multiset
Trang 25Since every multiset element is indexed without incorporating any filtering technique,humongous number of pairs will be generated, which must be written to the DFS bythe Reduce Phase, causing I/O inefficiency In the next stage, these humongousnumbers of pairs are read by the Mappers and passed through the network causingcongestion and inefficiency and it is not a scalable approach.
A MapReduce algorithm for computing similarity between the normalized ments vectors, where the similarity is expressed as the product of document vectorweights, i.e the intersection was developed by [8] Similar to [23], every element ofthe document vector is indexed and the possible candidate pairs are generated, as aresult of which, I/O inefficiency and network congestion will result The differencebetween this algorithm and the algorithm in [23] is that, they take into account thesize of the multisets, whereas here, the document vectors are normalized and theirsize is not taken into account
docu-The VCL Algorithm [56], is a MapReduce algorithm designed for computing larity joins between sets, which tries to implement the PPJoinPlus [4] in a distributedmanner This algorithm was developed for sets and is therefore not capable of han-dling multisets However, it is worthwhile to discuss the characteristics of this al-gorithm Each set is replicated the number of times the elements in the prefix ofthe set in the Map Phase In the Reduce Phase, each reduce instance pertains to
simi-a unique set element, simi-and its vsimi-alue list consists of the complete sets thsimi-at shsimi-are theset element in their prefix and are hence potential candidates of being similar, andtheir similarity is evaluated The replication of each set, the number of times theelements in the prefix of the Mappers makes it unfit for large sets as the large number
of replication will cause subsequent network congestion and I/O constraints hinderingthe scalability of the algorithm This drawback has been mentioned in [57], and hasbeen demonstrated in [23]
A MapReduce Similarity Join algorithm for computing vector cosine similarity
Trang 26between documents called SSJ-2R was proposed by [9] It incorporates prefix filteringTechnique to reduce the potential candidate document vector pairs However, thisalgorithm fails to incorporate other filtering techniques.
The above discussed parallel algorithms for multiset similarity joins are directlypertinent to this work However, it is worth mentioning other relevant work done
in this area Theoretical comparison of MapReduce algorithms of various ity measures such as Edit Distance, Jaccard and Hamming Distance measures waspresented by [58] Multiway Join Optimization in MapReduce Framework has beenfocused by [59] Performing Joins between Log Data and a Reference Data in theMapReduce Framework was proposed by [60]
Trang 27to generate the possible candidate pairs for which similarity joins are to be performed.This is followed by size filtering to further reduce the pairs In Stage II, positionalfiltering is applied to the pairs In the Stage III, similarity joins are performed forthe thriving pairs.
The Map and Reduce Phases of the three Stages are detailed below(Sections 3.1 to3.6) along with the preprocessing stage(Section 3.7) In Section 3.8, SSS is comparedwith the competing state of the art algorithm SSJ-2R, and in Section 3.9, throughexperimental analysis, the unprecedented performance gain in SSS in comparison toSSJ-2R is reported In Section 3.10, the technique to enhance the scalability of thealgorithm is presented as a contingency need Section 3.11, summarizes this chapter
Trang 283.1 Stage I - Map Phase
The preprocessed input to the Map Phase consists of records, each containing theMultiset ID, Mi, followed by the elements of Mi The elements of Mi, are arrangedbased on the increasing order of the global frequency of their data elements acrossthe entire multiset collection, with the least frequent being the first Each Mappercalculates the prefix size of the multiset in the input record
From the prefix filtering technique, Lemma1 of [27], rephrased by Lemma1 of [4],
it is known that if two sets Si and Sj have an overlap greater than α, there must
be a non-zero intersection in their prefixes The Si-prefix (| Si | −α + 1) and Sjprefix(| Sj | −α + 1) must have non-zero intersection
-From [4], for a similarity threshold, t,
α= t
From [4], knowing only one set Si, without the knowledge about the other setswith which it will intersect, the prefix size can be found by the formula: | Si | −(t∗ |
Si |) + 1, where t is the similarity threshold From [27], prefix filtering is applicable
to multisets Therefore, it is have the prefix size of a multiset, Mi with similaritythreshold, t, given by, | Mi | −(t ∗ |Mi|) + 1
For every element, mi,k, present in the prefix of the multiset Mi, with data element,
dk, the following information, viz the MID, Mi, the size of Mi, | Mi | and dk’sfrequency in Mi, fi(dk) and position of mi,k in Mi, P os(mi,k), are sent as the Mappervalue and are denoted by I(dk(Mi)), the key being dk
Stage I - Map Phase can be expressed as shown below:
< Mi, Elements of Mi >∀ mi,k∈M
0
i s P ref ix
−−−−−−−−−−−→ (< dk, I(dk(Mi)) >)∗ (3.2)
Trang 29I(dk(Mi)) =< Mi, | Mi |, fi(dk), P os(mi,k) > (3.3)The pseudo-code of Stage I - Map Phase is shown in Algorithm 1
Algorithm 1: Stage I - Map Phase
1 Input: For each Mapper instance of Stage I, the input record has the value,valin =< Mi, Elements of Mi >
2 Output: The output records are of the format, < keyout, valout >, where thekeyout is the prefix data element, dk, and the valout is I(dk(Mi))
3 Compute the size of Mi, viz., |Mi| by iterating through its elements and
summing up the frequencies of its data elements;
4 Compute the prefix size of the multiset, Mi, using the formula:
| Mi | −(t ∗ |Mi|) + 1 ;
5 /*t is the similarity threshold */
6 fsum ← 0; /*Initializing fsum, which represents the sum of the frequencies ofthe data elements present in the multiset */
7 for every mi,k ∈ M0
is P ref ix do
8 P os(mi,k) ← fsum ;
9 /*Position of mi,k, viz., P os(mi,k) is the sum of the frequencies of the dataelements prior to it in Mi */
10 Update fsum as fsum = fsum+ fi(dk) ;
11 keyout = dk;
12 valout = I(dk(Mi)) =< Mi, | Mi |, fi(dk), P os(mi,k) >;
13 output(keyout, valout);
14 end
Trang 303.2 Stage I - Reduce Phase
At each reducer, the records that share the same prefix data element, dk, as thekey are grouped together From prefix filtering principle, it is known that, if twomultisets have a common prefix data element, they are potential candidates of beingsimilar Therefore, all the possible MID pairs that share the same dk, are generated Ifsay k records share a prefix data element, then k ∗ (k − 1)/2 MID pairs are generated
To reduce this number further, the size filtering technique [28] is applied, whichgives effective pruning results This technique is applied with the help of the sizeinformation sent with every record This technique was also employed in the serialalgorithms, [24] and [4] For every MID pair, < Mi, Mj >and threshold t, if the sizefiltering condition,viz., | Mj |≥ t∗ | Mi |, is satisfied, it passes the filter, otherwise it
is pruned The multiset with the smaller size is taken as Mj, and the bigger as Mi.For every MID pair, < Mi, Mj >that survives size filtering, the frequency of dk, sizeand position information of both Mi and Mj (denoted by I(dk(Mi, Mj) in (3.5)), areappended and sent as the reducer output
This process is defined mathematically as:
< dk,(I(dk(Mi)))∗ >
∀M i ,M j
that survives Size F iltering
−−−−−−−−→ (< Mi, Mj, I(dk(Mi, Mj)) >)∗ (3.4)Where, I(dk(Mi)) is given by (3.3), and
I(dk(Mi, Mj)) =<| Mi |, | Mj |, fi(dk), fj(dk), P os(mi,k), P os(mj,k) > (3.5)
The Stage I Map and Reduce instances are also detailed in Fig.3-1
The pseudo-code of Stage I - Reduce Phase is shown in Algorithm 2
Trang 31Algorithm 2: Stage I - Reduce Phase
1 Input: For each Reduce Instance, the input has the key, keyin = dk andcorresponding value list, (value)∗ = (I(dk(Mi)))∗
2 Output: The output consists of records having the value,
valout = Mi, Mj, I(dk(Mi, Mj)) I(dk(Mi, Mj)) is given by,
<| Mi |, | Mj |, fi(dk), fj(dk), P os(mi,k), P os(mj,k) >
3 for every I(dk(Mi)) ∈ (I(dk(Mi)))∗ do
4 for everyI(dk(Mj)) ∈ (I(dk(Mi)))∗ do
5 if | Mj |≥ t∗ | Mi | then
6 /* Size Filtering Condition */
7 valout = Mi, Mj, I(dk(Mi, Mj));
Trang 32d k
I(d k (M i )) I(d k ( M j )) I(d k ( M x ))
M i , M j ,I(d k (M i , M j ))
(Value)*
Index every prefix element, with data element, d k, and compute its position, Pos i (m i,k )
Figure 3-1: Mapper and Reducer Instances of Stage I
Stage II - Map Phase consists of two types of Mappers:
1 Type I Mappers The records from the output of Stage I-Reduce Phase areread These records pertain to MID pairs and are denoted as MID Pair records.The output key is the MID pair < Mi,Mj > and the value is the Mj andthe I(dk(Mi, Mj)), which comprises of the frequency of dk, size and positioninformation of both Mi and Mj, in order to facilitate positional filtering in theReduce Phase The process can be represented by (3.6)
Trang 33< Mi, Mj, I(dk(Mi, Mj)) >→<< Mi, Mj >, < Mj, I(dk(Mi, Mj)) >> (3.6)
Where, I(dk(Mi, Mj)) is given by (3.5)
The pseudo-code of Stage II - Map Phase for Type I Mappers is shown inAlgorithm 3
2 Type II Mappers The preprocessed input of Stage I-Map Phase, where recordseach consisting of the MID, Mi and its elements, is read here and sent as outputwith < Mi, m > as the key and the elements of Mi as the value Here, the
‘m’ in the key denotes that it is a record containing the multiset elements.These records can be called, the multiset records This can be mathematicallyrepresented as:
M i ,m Elements of M i
Figure 3-2: Type I and Type II Mapper instances of Stage II
Trang 34Fig.3-2 represents the Type I and Type II Mapper instances.
Algorithm 3: Stage II - Map Phase (Type I)
1 Input: The input record to each Mapper instance has the value,
valin =< Mi, Mj, I(dk(Mi, Mj)) >
2 Output: The output record has the key, < Mi, Mj >, and value,
< Mj, I(dk(Mi, Mj)) >
3 keyout =< Mi, Mj >;
4 valout =< Mj, I(dk(Mi, Mj)) >;
5 output(keyout, valout);
Algorithm 4: Stage II - Map Phase (Type II)
1 Input:The input record to each Mapper instance has the value,
valin =< Mi, Elements of Mi >
2 Output: The output record has the key, < Mi, m >, and value,
< Elements of Mi >
3 keyout =< Mi, m >;
4 valout = Elements of Mi;
5 output(keyout, valout);
Partitioning, Grouping and Sorting are customized rather than using the default,
so that in each reduce instance, records will be grouped based on the MID, Mi, theprimary key and sorted based on the secondary key
Custom Partitioning: A multiset record has the composite key, < Mi, m >, where
Mi is the primary key and m is the secondary key An MID pair record hasthe composite key, < Mi, Mj >, where the primary key is Mi and Mj is thesecondary key Records are partitioned based on the primary key Both types
of records, for which the primary key, viz Mi is the same, are partitioned to
Trang 35Custom Grouping: Custom Grouping ensures that records that have the sameMID, Mi as the primary key reach the same instance Each reduce instancethus pertains to a unique MID, Mi.
Custom Sorting: Sorting is based on the secondary key, so that the record taining the multiset elements arrives first in an instance, and is followed by theMID Pair records
The records which have the same Mi as primary key are grouped in the sameinstance These include, the multiset record corresponding to Mi and the MID Pairrecords with the same Mi as their primary key In every reduce instance, the multisetrecord with key, < Mi, m >arrives first Following that, the MID Pair records arrive.The MID Pair records arrive in a sorted order based on their secondary key,
Mj The MID Pair records that pertain to the same < Mi, Mj > pair are groupedtogether and positional filtering is applied Every unique pair, < Mi, Mj >, thatsurvives positional filtering is sent as output If there is at least one pair that survivespositional filtering, MID Mi and its elements are written to a file named as theMultiset File The Reduce Phase is diagrammatically represented in Fig.3-3 and canalso be expressed by (3.8) below
P ositional
F iltering
< M i , Elements of M i >
If Atleast One P air Survives
Trang 36Algorithm 5: Stage II - Reduce Phase
1 Input: Each reduce instance has as input key, keyin=< Mi, x > and has(value)∗ containing < Elements of Mi >and (< Mj, I(dk(Mi, Mj)) >)∗
2 Output: It sends as output the MID pairs, with the value,
valout =< Mi, Mj > It also writes as output, Mi and its elements to theMultiset File, if any MID pair survives positional filtering
3 Map M apM ID ← Group (I(dkMi, Mj))∗ into groups in M apM ID, where eachMap entry consists of a key Mj and value consisting of (I(dk(Mi, Mj)))∗
pertaining to a particular Mj;
4 PassPositionalFilter ← false; PassFilterCheck ← false;
5 for each Mj ∈ M apM ID do
6 Get (I(dk(Mi, Mj)))∗ corresponding to Mj ;
7 PassPositionalFilter ← Apply Positional Filtering for Pair < Mi, Mj >using (I(dk(Mi, Mj)))∗ ;
8 if PassPositionalFilter == true then
Trang 37Atleast one pair has survived filtering?
M i , Elements of M i
Written to Multiset File
Pair survives filtering?
Pair survives filtering?
Yes
Yes
Apply Positional Filtering for the
M i , M k pair
Apply Positional Filtering for the
M i , M j pair
Yes
M i , M j
Figure 3-3: Reducer Instance of Stage II
The Positional Filtering technique and how it is performed here in Stage II Reduce Phase are explained below
-3.4.1 Positional Filtering:
Positional filtering is the technique of filtering pairs of sets based on the positionalinformation of the overlapping token between the sets and is defined in Lemma 2 of [4].Lemma 1 describes how to extend this technique to multisets
Lemma 1 (Positional Filtering for Multisets): Considering an ordering O of themultiset elements universe U and a set of multisets, each with multiset elements sorted
in the order of O Let element w = mi,k, partition the multiset, Mi into left partition,
Mil(w) = {mi,1, , mi,k} and right partition Mir(w) = {mi,k+1, , mi,d|D|} If the
Trang 38Overlap(Mi, Mj) ≥ α, then for every element, w ∈ Mi∩Mj, Overlap(Mil(w), Mjl(w))+min(| Mir(w) |, | Mjr(w) |) ≥ α.
Positional Filtering Condition is thus:
Figure 3-4: Partitioning a Multiset Mi for Positional Filtering
It can be seen in Fig 3-4 how a multiset Mi is partitioned for positional filtering at
w= mi,k, the multiset element pertaining to a data element, “dk”, into left partition
Mil(w) and right partition Mir(w)
Let us discuss the reasoning behind Lemma 1 The Overlap(Mil(w), Mjl(w))
is the overlap between the left partitions of Mi and Mj The maximum possibleoverlap between the right partitions of Mi and Mj, is min(| Mir(w) |, | Mjr(w) |),
as there can be no overlap greater than that in the right partitions There can be
no overlap between the left partition of Mi and the right partition of Mj and viceversa, as the multiset elements are arranged in a global ordering Thus the sum ofthe actual overlap in the left partition and the maximum possible overlap betweenthe right partitions, represents the maximum overlap possible between Mi and Mj,and is represented by the left hand side of (3.9) This must be greater than α(given
by (3.1)), the minimum required overlap that must exist between Mi and Mj, for a
Trang 39similarity threshold of t, which is represented by the right hand side of (3.9).
3.4.2 Positional Filtering in Stage II-Reduce Phase
Here in Stage II - Reduce Phase, positional filtering is performed at the last lapping prefix element, w = Mi(dl) “dl” is the last overlapping prefix data element
over-So, the Overlap(Mil(w), Mjl(w)) will be the total overlap between the prefixes ofthe two multisets, Mi and Mj, and can be denoted by αp MID Pair records thatreach a reduce instance are grouped based on the pair, < Mi, Mj > they pertain to.Records pertaining to the same pair are grouped together and the records of a groupare sorted based on the position of the element pertaining to the overlapping dataelement in Mi, which is P os(mi,k) ( Line 3 of Algorithm 6), followed by accumulatingthe overlap by iterating over the sorted records, and thereby calculating the totalprefix overlap (Lines 7 – 9) At the last record, which pertains to the last overlappingprefix element, w = mi,l, positional filtering is applied using (3.9)(Lines 10 - 16) Itcan be seen in Line 12, that the right partition size of a multiset, Mi is calculated bythe formula, | Mir(w) |=| Mi | −(P os(mi,l) + fi(dl)) P os(mi,l) + fi(dl) is the sum
of the frequencies of the multiset elements present up till “w” and represents the leftpartition size, | Mjl(w) |
Trang 40Algorithm 6: Positional Filtering
7 for each I(dk(Mi, Mj) in (I(dk(Mi, Mj))∗ do
8 /*the information needed for the formulae below are taken from
I(dk(Mi, Mj))*/ ;
9 αp = αp+ min(fi(dk), fj(dk)) ;
10 if Last I(dk(Mi, Mj)) then
11 /*Positions of the last overlapping prefix element,w = mi,l, containing
data element, dl, in Mi and Mj are P os(mi,l) and P os(mi,l) respectively
*/;
12 ubound= min(| Mir(w) |, | Mjr(w) |) = min((| Mi |
−(P os(mi,l) + fi(dl))), (| Mj | −(P os(mj,l) + fj(dl))));