Fast Approximate near Neighbor Algorithm by Clustering in High Dimensions

Fast Approximate Near Neighbor Algorithm by Clustering in High DimensionsHung Tran-The Z8 FPT Software JSC Hanoi, VietNam Email: hungtt5@fsoft.com.vn Vinh Nguyen Van University of Engine

Trang 1

Fast Approximate Near Neighbor Algorithm by Clustering in High Dimensions

Hung Tran-The

Z8 FPT Software JSC

Hanoi, VietNam

Email: hungtt5@fsoft.com.vn

Vinh Nguyen Van

University of Engineering and Technology Vietnam National University, Hanoi

Hanoi, VietNam Email: vinhnv@vnu.edu.vn

Minh Hoang Anh

Z8 FPT Software JSC Hanoi, VietNam Email: minhha@fsoft.com.vn

Abstract—This paper addresses the (r, 1 + )-approximate

near neighbor problem ( or (r, 1 + )-NN) that is deﬁned as

follows: given a set of n points in a d-dimensional space,

a query point q and parameter 0 < δ < 1, build a data

structure that reports a point within distance (1 + )r from

q with probability 1 − δ, if there is a point in the data set

within distance r from q We present an algorithm for this

problem in the Hamming space by using a new clustering

technique, the triangle inequality and the locality sensitive

hashing approach Our algorithm achieves O(n1+1+2ρ 2ρ + dn)

space, and O(f dn 1+2ρ 2ρ ) query time, where f is generally a

small integer, ρ is a parameter of the algorithm in (STOC

1998) [1] or (STOC 2015) [2] Our results show that we can

improve the algorithms in [1], [2] when value is small ( < 1

for [1] and < 0.5 for [2]).

I INTRODUCTION

A similarity search problem involves a collection of

objects (documents, images, etc.) that are characterized by

a collection of relevant features and represented as points

in a high dimensional attribute space Given a query and a

set ofn objects in the form of points in the d-dimensional

space, we are required to ﬁnd the nearest (most similar)

objects to the query This is called the nearest neighbor

problem This problem is of major importance to a variety of

applications, such as data compression, databases and data

mining, information retrieval, image and video databases,

machine learning, pattern recognition, statistics and data

analysis

There are several efﬁcient algorithms known for the case

when the dimensiond is “low” (see [3] for an overview).

However, the main issue is to deal with a high-dimensional

data Despite decades of intensive effort, the current

solu-tions suffer from either space or query time that is

exponen-tial ind In fact, for large enough d, in theory or in practice,

they often provide little improvement over a linear algorithm

that compares a query to each point from the database This

phenomenon is often called “the curse of dimensionality”.

In recent years, several researchers proposed methods for

overcoming this state of affaires by using approximation

algorithms for the problem In that formulation, the(1 +

)-approximate nearest neighbor algorithm is allowed to return

a point, whose distance from the query is at most (1 + )

times the distance from the query to its nearest points, where

> 0 is called the approximation factor Many algorithms for the (1 + )-approximate nearest neighbor problem can

be found in [4], [1], [5], [6], [7], [8], [9], [10], [11] The appeal of this approach is that, in many cases, an approximate nearest neighbor is almost as good as the exact one An efﬁcient approximation algorithm can be used to solve the exact nearest neighbor problem, by enumerating all approximate nearest neighbors and choosing the closest point

In [6], [1], [11], [8], the authors constructed data struc-tures for the 1 + -approximate nearest neighbor problem

which avoided the curse of dimensionality To be more speciﬁc, for any constant > 0, the data structures sup-port queries in time O (dlog(n)), and use space which is

polynomial in n Unfortunately, the exponent in the space

bounds is roughly C/2 (for < 1), where C is a

“non-negligible” constant Thus, even for, say, = 1, the space used by the data structure is large enough so that the algorithm becomes impractical even for relatively small data sets From the practical perspective, the space used by an algorithm should be as close to linear as possible If the space bound is (say) sub-quadratic, and the approximation factorc is a constant, the best existing solutions are based on locality sensitive hashing(LSH) One other attractive feature

of the locality sensitive hashing approach is that it enjoys a rigorous theoretical performance guarantee even in the worst case

Locality Sensitive Hashing: The core idea of LSH

approach is to hash items in a similarity preserving way, i.e., it tries to store similar items in the same buckets, while keeping dissimilar items in different buckets In [1], [12] the authors provided such locality-sensitive hash functions for the case when the points live in binary Hamming space

{0, 1} d The process of the LSH approach has two steps: training and querying In the training step, LSH ﬁrst learns

a hash function h (p) = {h1(p), h2(p), , h m (p) : h i (p) ∈ {0, 1}} where m is the code length They use projections of

the input point on one of the coordinates, that is, the hash function of form h i (p) = p i for every i = 1, , m Then,

LSH represents each item in the database as a hash code

by the hash mapping h (x), and constructs a hash table by

2015 Seventh International Conference on Knowledge and Systems Engineering

Trang 2

hashing each item into the bucket indexed by its code In the

querying step, LSH ﬁrst converts the query into a hash code,

and then ﬁnds its nearest neighbor in the bucket indexed by

the code If the probability of collision is at leastp = 1−r/d

for the close points and at most q = 1 − (1 + )r/d for the

far points, they achieve the algorithms for the(r, 1 + )-NN

usingO (n 1+ρ + nd) spaces, O(dn 1+ρ) preprocessing time,

andO (dn ρ ) query time, where ρ = log(1/p)/log(1/q) For

r < d/log (n), ρ < 1

1+

Related Works: In a followup work [9], the authors

introduced LSH functions that work directly in Euclidean

space and result in a faster running time The latter algorithm

forms the basis of E2 LSH package for high-dimensional

similarity search, which has been used in several applied

scenarios

In this paper, we only focus on the Hamming space and

the improvement of efﬁciency of locality sensitive hashing

approach The efﬁciency can be obtained via improvement

of exponent ρ The LSH algorithm has been since used in

numerous applied setting as [13], [12], [14], [15], [16]

The work of [17] showed that LSH for Hamming space

must have ρ ≥ 1

2(1+) − O( 1

1+ ) − o (1) In a recent paper

of Alexandr Andoni et al (SODA 2014) [18], the authors

proposed a version of LSH approach by using essentially the

same LSH functions families as described in [1] However,

the properties of those hash functions as well as the overall

algorithm are different This approach leads to a two-level

hashing algorithm The outer hash table partitions the data

sets into buckets of bounded diameter Then, for each bucket,

they build the inner hash table, which uses (after some

pruning) the center of the minimum enclosing ball of the

points in the bucket as a center point This hashing type is

data-dependent hashing, i.e., a randomized hash family that

itself depends on the actual points in the dataset Their

algo-rithm achievesO (n ρ + dlog(n)) query time and O c (n 1+ρ+

dlog (n)) space, where ρ ≤ 7

8(1+) +O( 1

(1+) 3/2 )+o (1)) In

a very recent paper of the same authors (accepted at STOC

2015) [2], they continue to consider the data-dependent

hashing for the optimality of exponent ρ Their algorithm

achievesO (n 1+ρ +dn) space and O(dn ρ) query time, where

2(1+)−1

Contribution: We present an algorithm for the (r,

1+)-NN problem in the Hamming space by using a new

cluster-ing technique, the triangle inequality as a stage of

prepro-cessing To the best of our knowledge, this paper presents

ﬁrst the clustering algorithm as a preprocessing before using

LSH approach in order to solve(r, 1 + )-NN problem Our

algorithm achieves the following advantages:

• We give an improvement in case where is small

for the classic LSH algorithms in [1] and in very

recent paper [2], by choosing K = O(n 1+2ρ1 ), we

achieve an algorithm withO (n1+1+2ρ 2ρ +dn) space, and

O (fdn 1+2ρ 2ρ ) query time, where f is generally a small

STOC 1998 [1] O (dn ρ) O (n 1+ρ + dn) ρ < 1

1+

2(1+)−1

Our results O (fdn 1+2ρ 2ρ ) O (n1+1+2ρ 2ρ + dn) ρ of original algorithm

Table I

S PACE AND TIME BOUNDS FOR LSH ALGORITHMS AND OUR

ALGORITHM

integer See the table 1 We see that n 1+2ρ 2ρ < n ρ if

2

1+2ρ < 1 Then we will achieve n 1+2ρ 2ρ < n ρ if <1 for the case of [1] and if < 0.5 for the case of [2].

Hence, our algorithm is better than original algorithms

of [1] and [2] when is small as mentioned and f is

small In fact, we also prove that the averaging value of

f is smaller than d

d−2r Therefore, ifd > 4r then the

averaging value is 1 Note that there are not many such LSH algorithms for(r, 1+)-NN problem in Hamming

space We can cite here two algorithms in [1], [2]

• We provide a simple and easy algorithm in two index-ing and queryindex-ing procedures

The remainder of the paper is structured as follows Section II provides a formal deﬁnition of the problem we are tackling Our solution to the problem is presented in Sections III Then we give the correctness of our algorithm

in Sections IV and the complexity analysis in Section V Finally, Section VI concludes the paper

II PROBLEMDEFINITION

In this paper, we solve the (r, 1 + )-approximate near

neighbor problem in the Hamming space:

Deﬁnition 1 ((r, 1 + )-approximate near neighbor, or

(r, 1 + )-NN) Given a set P of points in a d-dimensional Hamming space H d , and δ > 0, we construct a data structure which, given any query point q, does the following task, with probability 1−δ, if there exists a r-near neighbor

of q in P, it reports a (1 + )r-near neighbor of q in P.

Similarly to [1], [12],δ is an absolute constant bounded

away from 1 Formally, anr-near neighbor of q is a point p

such that d (p, q) ≤ r, where d(p, q) denotes the Hamming

distance betweenp and q A set C(q, r) is one that contains

allr-near neighbors of q.

Observe that (r, 1 + )-NN is simply a decision version

of the Approximate Nearest Neighbor problem Although

in many applications solving the decision version is good enough, one can also reduce the approximate NN problem to approximate NN via binary-search-like approach In partic-ular, it is known [1] that the1+-approximate NN problem

reduces to O (log(n/)) instances of (r, 1 + )-NN Then,

the complexity of1+-approximate NN is the same (within

log factor) as that of the(r, 1 + )-NN problem.

Trang 3

III ALGORITHM

The key idea of our algorithm is that one can greatly

reduce the searching space by cheaply partitioning data into

clusters, which we call stage 1 Then we use the triangle

inequality to select a few matched clusters and ﬁnally use

an LSH algorithm for each selected cluster We call this

stage 2 Such space partitioning approach is very natural

for searching problem and it is applicable in many settings

For example, [19] solves the exact K-nearest neighbors

problem using theK-means clustering in order to reduce the

searching space, [20] solves the exactK-nearest neighbors

classiﬁcation basing on homogeneous clusters

There are a few of clustering algorithms that can be

applicable for our case asK-means algorithm and Canopy

algorithm [21] The key idea of K-means is to partition

data into K clusters in which each point of data belongs

to the cluster with the nearest mean while the key idea of

Canopy algorithm involves using a cheap, approximate

dis-tance measure to efﬁciently divide the data into overlapping

subsets we call canopies with the same radius However,

the disadvantage of such algorithms is that (1) we do not

know how many points in a cluster and (2) the computation

forK-means algorithm is expensive while canopies may be

overlapped for canopy algorithm We introduce here a simple

clustering algorithm that ensure 2 properties: (1) we know

exactly how many points in a cluster and (2) the computation

for this algorithm is cheap

A Clustering Algorithm

Deﬁnition 2 A K-cluster with the center c is a subset S of

P such that:

1) |S| = O(K),

2) all points of S are O (K) nearest neighbors of c.

The ideal ofK-nearest neighbor clustering is as follows:

at ﬁrst, we choose randomly a point c of P which is

the center of a cluster, then we calculate O (K) nearest

neighbors ofc by using Merge sort algorithm The Merge

sort algorithm sorts the distances of dataset toc in ascending

order and stocks them in an array We select O (K) points

that correspond to the O (K) ﬁrst elements of the array.

These O (K) points and point c form a cluster with radius

r which is the O (K)-th element of the array Hence, it is

easy to see that the distances of theseO (K) points to c are

smaller thanr After, we remove these points from P and

continue to choose randomly an other point as the center of

the next cluster The process stops when all points ofP are

removed and belong to a cluster The details of algorithm is

represented in Figure 1

Figure 2 illustrates an example of k-nearest neighbor

clustering algorithm result Given a set P with 12 points

{a, b, c, d, e, f, g, m, n, o, p, q}, we achieve 3 clusters with

centersp, q, t and each cluster contains 4 points.

Input:

1 a parameterK and a set P = {p1, p2, , p n }

2 C = ∅

Output:

3 a set of clustersC

4 while P = ∅

5 select randomly a pointc from P to initialize a cluster

6 search for the setN (c) of O(K) nearest neighbors of c from P

7 by using Merge sort algorithm on distances of data toc

8 add these points inN (c) to the cluster c

9 add the clusterc to set C

10 remove all points of the clusterc from P

Figure 1. K-Nearest Neighbor Clustering Algorithm

Figure 2 Example of clustering algorithm

B Searching Algorithm

In the second stage, we select clusters according to some conditions, which we call step 1, and then we use a LSH algorithm for each satisﬁed cluster in order to search the nearest neighbor, which we call step 2 Since there areO (K)

points in each clusters, there are at mostO(n

K) clusters If

by some elimination mechanism, such as branch and bound,

we only select a few clusters then the search space for the LSH algorithm will be greatly reduced

Step 1: After the clustering algorithm, we achieve a

set of clusters C For simplicity, for each cluster with the

center c, let R (c) be the radius of the cluster We will use

the triangle inequality in order to select only a few clusters among all clusters Such a cluster is called acandidate that

is deﬁned as follows:

Deﬁnition 3 Given a query point q, we say that a cluster

is a candidate of q if two following conditions are satisﬁed:

• d (q, c) ≤ R(c) + r,

• for each center of cluster with center c such that c = c, then d (q, c ) ≥ R(c ) − r.

Step 2: In this step, we select a LSH algorithm ( [1]

or [1]) for (r, 1 + )-NN in order to search for the nearest

neighbor in each cluster satisﬁed in step 1 All points in each cluster are structured according to the reprocessing algorithm

of the selected algorithm The only difference is that we use twice the number of buckets of selected algorithm in each hash table and n 2ρ hash tables instead of n ρ hash tables

Trang 4

1 a parameterK and a set of clusters C = {c}

2 Candidates = ∅

Query algorithm for a query point q:

3 Step 1:

4 for each c ∈ C:

5 if c is a candidate of q then

6 addc to Candidates

7 Step 2:

8 for each c in Candidates:

9 LSHAlgorithm (q, c) and stop

Figure 3 Searching Algorithm using LSH Approach

The reason will be discussed in Section IV We denote by

LSHAlgorithm (q, c) the query algorithm of the selected

LSH algorithm, where q is the query point and c is the

center of a cluster This query algorithm returns a nearest

neighbor if it ﬁnds a nearest neighbor in the cluster with the

centerc, otherwise the query algorithm returns null See the

searching algorithm in Figure 3

IV CORRECTNESS

Lemma 1 Given any query point q, if there exists p ∈ P

and p ∈ C(q, r) then there exists a cluster ∈ C such that (1)

p belongs to the cluster and (2) this cluster is a candidate

of q.

Proof: According to the clustering algorithm, p must

be within only some cluster with the centerc We will show

that this cluster is the candidate ofq Indeed, by the triangle

inequality in Hamming space, d (q, c) ≤ d(c, p) + d(p, q).

Moreover, asp is within the cluster with center c, d (c, p) ≤

R (c) As p ∈ C(q, r), d(p, q) ≤ r So we get d(q, c) ≤

R (c) + r.

Now, consider any cluster with centerc other thanc By

the triangle inequality,d (q, c ) ≥ d(c , p ) − d(p, q) As p is

within the cluster with centerc, p is not within cluster with

center c Then d (c , p ) ≥ R(c ) Moreover, d(p, q) ≤ r.

Hence, d (q, c ) ≥ R(c ) − r By deﬁnition 3, the lemma

holds

Similarly as in [1], the correctness of our algorithm

holds if we can ensure that with a constant probability, the

following two properties hold:

• if there exists p ∈ C(q, r), then q will collide p.

• the total number of collisions of q with points in

P\C(q, (1 + )r) is small.

The property 1 holds thank to Lemma 1 Indeed, given

any query pointq, if there a p is r-near neighbor of q then

p must be in a candidate of q Hence, from the correctness

of LSH algorithm used for this candidate cluster in stage 2,

we obtain the property 1 For the property 2, it is not like in

[1] or [2], we need to use two times number of buckets of

the original algorithm in each hash table, and n 2ρ buckets

instead of n ρ buckets Because if we use n ρ buckets, then

in the worst-case we must consider allO(n

K) clusters, and thus if we use the parameters of the original algorithm, the number of collisions of q with points in P\C(q, (1 + )r)

is a function of O(n

K) This value is large To overcome this, we use two times number of buckets of the original algorithm Hence, the probability of collision ofq and some

pointp ∈ P\C(q, (1 + )r) reduces n times Thus, the total

number of collisions ofq with points in P\C(q, (1 + )r) is

small even when we must examine allO(n

K) clusters

V COMPLEXITYANALYSIS

Given any query point q, let f be the number of

candidates ofq We get:

Theorem 1 Given a parameter K > 0 and any LSH algorithm the exponent parameter ρ in [1] or [2]), there exists a data structure for (r, 1 + )-NN problem in the Hamming space with:

• preprocessing time O (d( n

K )K 1+2ρ + dnlog(K)),

• space O(n

K )K 1+2ρ + dn),

• querying time O (d n

K + fdK 2ρ)

Proof: For each loop in clustering algorithm, we need

determine O (K) nearest neighbors of a center This

re-quires O (dKlog(K) processing time by using a sorting

algorithm As in each loop, we can determine a cluster containingO (K) points of P , Algorithm 1 terminates after

O(n

K ) loops Hence, Algorithm 1 uses O(( n

K )dKlog(K)) =

O (dnlog(K)) processing time The processing time for

building buckets in each cluster isK1+1+1 Hence, the total preprocessing time isO (d( n

K )K 1+2ρ + dnlog(K)).

For each loop in clustering algorithm, we can determine

O (K) nearest neighbors of a center by using the

sort-ing algorithm A efﬁcient sortsort-ing algorithm only requires

O (dK) spaces Hence, after O( n

K) loops, Algorithm 1 uses

O((n

K )dK) = O(dn) spaces The needed space for stocking

buckets in each cluster in searching algorithm isO (K 1+ρ+

dK ) Hence, the total space is O(( n

K )K 1+2ρ + dn).

The total query time consists of the time for searching

f clusters from O(n

K) clusters in step 1 and the time for searching(1 + , r)-approximate nearest neighbors in each

cluster found in step 1 Hence, the total query time is

O(n

K ) + O(fK 2ρ)

If we let K = O(n 1+2ρ1 ) Then the preprocessing time

isO (n1+ 2ρ

1+2ρ + dn(log(n)), the space is O(n1+ 2ρ

1+2ρ + dn)

and the query time isO ((f + 1)dn 1+2ρ 2ρ ) So we get

Theorem 2 There exists a data structure for (r, 1 + )-NN problem in the Hamming space with:

• preprocessing time O (n1+1+2ρ 2ρ + dn(log(n)),

• space O (n1+1+2ρ 2ρ + dn),

• querying time O (fdn 1+2ρ 2ρ ) Next, we will show that valuef on average is small.

Trang 5

Lemma 2 Given any query point q and a parameter f ≥ 2,

if c1, c2, , c f are candidates of q then for every 1 ≤ i ≤ f:

R (c i ) − r ≤ d(q, c i ) ≤ R(c i ) + r.

Proof: It is direct from Lemma 1.

Lemma 3 Given any query point q and a K-cluster with

center c Then the probability that R (c) − r ≤ d(q, c) ≤

R (c) + r is smaller than 2r

d Proof: There are two cases: if R (c) > r then as R(c)−

r ≤ d(q, c) ≤ R(c) + r, distance d(q, c) belongs to a range

of length2r If R(c) ≤ r then as d(q, c) ≤ R(c) + r, 0 <

d (q, c) ≤ 2r Hence,d(q, c) also belongs to a range of 2r.

Thus, in both case, the probability to ﬁnd a pointq such that

d (q, c) belongs to a a range of length 2r in a d-dimensional

space is smaller than 2r d

Lemma 4 Given any query point q Then the probability

that q has f candidates is(2r

d)f Proof: It is direct from Lemma 2 and Lemma 5.

Lemma 5 Given any query point q Then the average

candidates f the probability of q is smaller than d

d−2r Proof: As 0 ≤ f ≤ n/K, we get f = n/K

0=1(2r

d)i =

1−( 2r

d)n/K+1

1− 2r

d

< d−2r d

From the Lemma 5, for d > 4r the average candidates

f < 2, so f = 1 Now, we return the complexity analysis of

the algorithm We see that n 1+2ρ 2ρ < n ρ if 1+2ρ2 <1 Then

we will achieve n 1+2ρ 2ρ < n ρ if < 1 for the case of [1]

and if < 0.5 for the case of [2] As above proved, we

showed the averaging value of f is very small (if d > 4r

then f = 1) Hence, our algorithm is better than original

algorithms of [1] and [2] when is small as mentioned and

f is small.

VI CONCLUSION

In this paper, we presented an algorithm for (r, 1 +

)-NN problem in Hamming space The algorithm uses a

new clustering technique, the triangle inequality and the

locality sensitive hashing approach that permits solve(r, 1+

)-NN problem in high-dimensional space We achieved

O (n1+1+2ρ 2ρ +dn) space, and O(fdn 1+2ρ 2ρ ) query time, where

f is generally a small integer, ρ is a parameter of the

algorithm Our result is an improvement over the those of

[1], [2] in the case where parameters and f are small.

We saw that our algorithm can be investigated more

deeply by looking into the tuning parameter and can be

applied in different applications with high-dimensional data,

such as movie recommendation, speech recognition, which

will be our future works

ACKNOWLEDGEMENT

We thank our colleges in Z8, FPT Software, in particular

Henry Tu for insightful discussions about the problem and

reviewing our paper

REFERENCES

[1] P Indyk and R Motwani, “Approximate nearest neighbors: Towards removing the curse of dimensionality,” in

Proceedings of the Thirtieth Annual ACM Symposium

on Theory of Computing, ser STOC ’98. New York,

NY, USA: ACM, 1998, pp 604–613 [Online] Available: http://doi.acm.org/10.1145/276698.276876

[2] A Andoni and I Razenshteyn, “Optimal data-dependent hashing for approximate near neighbors,”

CoRR, vol abs/1501.01062, 2015 [Online] Available:

http://arxiv.org/abs/1501.01062

[3] H Samet, Foundations of Multidimensional and Metric Data Structures (The Morgan Kaufmann Series in Computer Graphics and Geometric Modeling). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2005

[4] S Arya, D M Mount, N S Netanyahu, R Silverman, and A Y Wu, “An optimal algorithm for approximate

nearest neighbor searching ﬁxed dimensions,” J ACM,

vol 45, no 6, pp 891–923, Nov 1998 [Online] Available: http://doi.acm.org/10.1145/293347.293348

[5] J M Kleinberg, “Two algorithms for nearest-neighbor search in high dimensions,” in Proceedings of the Twenty-ninth Annual ACM Symposium on Theory

of Computing, ser STOC ’97 New York, NY, USA: ACM, 1997, pp 599–608 [Online] Available: http://doi.acm.org/10.1145/258533.258653

[6] E Kushilevitz, R Ostrovsky, and Y Rabani, “Efﬁcient search for approximate nearest neighbor in high dimensional

spaces,” in Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, ser STOC ’98. New York, NY, USA: ACM, 1998, pp 614–623 [Online] Available: http://doi.acm.org/10.1145/276698.276877 [7] S Har-Peled, “A replacement for voronoi diagrams

of near linear size,” in Proceedings of the 42Nd IEEE Symposium on Foundations of Computer Science,

ser FOCS ’01 Washington, DC, USA: IEEE Computer Society, 2001, pp 94– [Online] Available: http://dl.acm.org/citation.cfm?id=874063.875592

[8] N Ailon and B Chazelle, “Approximate nearest neighbors

and the fast johnson-lindenstrauss transform,” in Proceedings

of the Thirty-eighth Annual ACM Symposium on Theory

of Computing, ser STOC ’06 New York, NY, USA: ACM, 2006, pp 557–563 [Online] Available: http://doi.acm.org/10.1145/1132516.1132597

[9] M Datar, N Immorlica, P Indyk, and V S Mirrokni,

“Locality-sensitive hashing scheme based on p-stable

distributions,” in Proceedings of the Twentieth Annual Symposium on Computational Geometry, ser SCG ’04.

New York, NY, USA: ACM, 2004, pp 253–262 [Online] Available: http://doi.acm.org/10.1145/997817.997857

Trang 6

[10] S Har-Peled and S Mazumdar, “On coresets for

k-means and k-median clustering,” in Proceedings of

the Thirty-sixth Annual ACM Symposium on Theory

of Computing, ser STOC ’04 New York, NY,

USA: ACM, 2004, pp 291–300 [Online] Available:

http://doi.acm.org/10.1145/1007352.1007400

[11] A Chakrabarti and O Regev, “An optimal randomised cell

probe lower bounds for approximate nearest neighbor

search-ing,” in In Proceedings of the Symposium on Foundations of

Computer Science.

[12] A Gionis, P Indyk, and R Motwani, “Similarity search in

high dimensions via hashing,” in Proceedings of the 25th

International Conference on Very Large Data Bases, ser.

VLDB ’99 San Francisco, CA, USA: Morgan Kaufmann

Publishers Inc., 1999, pp 518–529 [Online] Available:

http://dl.acm.org/citation.cfm?id=645925.671516

[13] J Buhler, “Efﬁcient large-scale sequence comparison

by locality-sensitive hashing,” Bioinformatics, vol 17,

no 5, pp 419–428, 2001 [Online] Available:

http://dx.doi.org/10.1093/bioinformatics/17.5.419

[14] E Cohen, M Datar, S Fujiwara, A Gionis,

P Indyk, R Motwani, J D Ullman, and C Yang,

“Finding interesting associations without support pruning,”

in ICDE, 2000, pp 489–500 [Online] Available:

http://dx.doi.org/10.1109/ICDE.2000.839448

[15] G Bogdan, S Ilan, and M Peter, “Mean shift based clustering

in high dimensions: A texture classiﬁcation example,” in

Proceedings of the Ninth IEEE International Conference on

Computer Vision - Volume 2, ser ICCV ’03. Washington,

DC, USA: IEEE Computer Society, 2003, pp 456– [Online]

Available: http://dl.acm.org/citation.cfm?id=946247.946595

[16] J Buhler, “Provably sensitive indexing strategies for

biosequence similarity search,” in Proceedings of the

Sixth Annual International Conference on Computational

Biology, ser RECOMB ’02 New York, NY,

USA: ACM, 2002, pp 90–99 [Online] Available:

http://doi.acm.org/10.1145/565196.565208

[17] R Motwani, A Naor, and R Panigrahi, “Lower

bounds on locality sensitive hashing,” in Proceedings of

the Twenty-second Annual Symposium on Computational

Geometry, ser SCG ’06 New York, NY, USA:

ACM, 2006, pp 154–157 [Online] Available:

http://doi.acm.org/10.1145/1137856.1137881

[18] A Andoni, P Indyk, H L N Łn, and I Razenshteyn,

“Beyond localitysensitive hashing.”

[19] X Wang, “A fast exact k-nearest neighbors algorithm for high

dimensional search using k-means clustering and triangle

inequality,” in The 2011 International Joint Conference on

Neural Networks, IJCNN 2011, San Jose, California, USA,

July 31 - August 5, 2011, 2011, pp 1293–1299 [Online].

Available: http://dx.doi.org/10.1109/IJCNN.2011.6033373

[20] S Ougiaroglou and G Evangelidis, “Efﬁcient $$k$$k-nn

classiﬁcation based on homogeneous clusters,” Artif Intell.

Rev., vol 42, no 3, pp 491–513, Oct 2014 [Online].

Available: http://dx.doi.org/10.1007/s10462-013-9411-1

[21] A McCallum, K Nigam, and L H Ungar, “Efﬁcient clustering of high-dimensional data sets with application

to reference matching,” in Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser KDD ’00. New York,

NY, USA: ACM, 2000, pp 169–178 [Online] Available: http://doi.acm.org/10.1145/347090.347123

Định dạng
Số trang	6
Dung lượng	263,94 KB