Fast Approximate Near Neighbor Algorithm by Clustering in High DimensionsHung Tran-The Z8 FPT Software JSC Hanoi, VietNam Email: hungtt5@fsoft.com.vn Vinh Nguyen Van University of Engine
Trang 1Fast Approximate Near Neighbor Algorithm by Clustering in High Dimensions
Hung Tran-The
Z8 FPT Software JSC
Hanoi, VietNam
Email: hungtt5@fsoft.com.vn
Vinh Nguyen Van
University of Engineering and Technology Vietnam National University, Hanoi
Hanoi, VietNam Email: vinhnv@vnu.edu.vn
Minh Hoang Anh
Z8 FPT Software JSC Hanoi, VietNam Email: minhha@fsoft.com.vn
Abstract—This paper addresses the (r, 1 + )-approximate
near neighbor problem ( or (r, 1 + )-NN) that is defined as
follows: given a set of n points in a d-dimensional space,
a query point q and parameter 0 < δ < 1, build a data
structure that reports a point within distance (1 + )r from
q with probability 1 − δ, if there is a point in the data set
within distance r from q We present an algorithm for this
problem in the Hamming space by using a new clustering
technique, the triangle inequality and the locality sensitive
hashing approach Our algorithm achieves O(n1+1+2ρ 2ρ + dn)
space, and O(f dn 1+2ρ 2ρ ) query time, where f is generally a
small integer, ρ is a parameter of the algorithm in (STOC
1998) [1] or (STOC 2015) [2] Our results show that we can
improve the algorithms in [1], [2] when value is small ( < 1
for [1] and < 0.5 for [2]).
I INTRODUCTION
A similarity search problem involves a collection of
objects (documents, images, etc.) that are characterized by
a collection of relevant features and represented as points
in a high dimensional attribute space Given a query and a
set ofn objects in the form of points in the d-dimensional
space, we are required to find the nearest (most similar)
objects to the query This is called the nearest neighbor
problem This problem is of major importance to a variety of
applications, such as data compression, databases and data
mining, information retrieval, image and video databases,
machine learning, pattern recognition, statistics and data
analysis
There are several efficient algorithms known for the case
when the dimensiond is “low” (see [3] for an overview).
However, the main issue is to deal with a high-dimensional
data Despite decades of intensive effort, the current
solu-tions suffer from either space or query time that is
exponen-tial ind In fact, for large enough d, in theory or in practice,
they often provide little improvement over a linear algorithm
that compares a query to each point from the database This
phenomenon is often called “the curse of dimensionality”.
In recent years, several researchers proposed methods for
overcoming this state of affaires by using approximation
algorithms for the problem In that formulation, the(1 +
)-approximate nearest neighbor algorithm is allowed to return
a point, whose distance from the query is at most (1 + )
times the distance from the query to its nearest points, where
> 0 is called the approximation factor Many algorithms for the (1 + )-approximate nearest neighbor problem can
be found in [4], [1], [5], [6], [7], [8], [9], [10], [11] The appeal of this approach is that, in many cases, an approximate nearest neighbor is almost as good as the exact one An efficient approximation algorithm can be used to solve the exact nearest neighbor problem, by enumerating all approximate nearest neighbors and choosing the closest point
In [6], [1], [11], [8], the authors constructed data struc-tures for the 1 + -approximate nearest neighbor problem
which avoided the curse of dimensionality To be more specific, for any constant > 0, the data structures sup-port queries in time O (dlog(n)), and use space which is
polynomial in n Unfortunately, the exponent in the space
bounds is roughly C/2 (for < 1), where C is a
“non-negligible” constant Thus, even for, say, = 1, the space used by the data structure is large enough so that the algorithm becomes impractical even for relatively small data sets From the practical perspective, the space used by an algorithm should be as close to linear as possible If the space bound is (say) sub-quadratic, and the approximation factorc is a constant, the best existing solutions are based on locality sensitive hashing(LSH) One other attractive feature
of the locality sensitive hashing approach is that it enjoys a rigorous theoretical performance guarantee even in the worst case
Locality Sensitive Hashing: The core idea of LSH
approach is to hash items in a similarity preserving way, i.e., it tries to store similar items in the same buckets, while keeping dissimilar items in different buckets In [1], [12] the authors provided such locality-sensitive hash functions for the case when the points live in binary Hamming space
{0, 1} d The process of the LSH approach has two steps: training and querying In the training step, LSH first learns
a hash function h (p) = {h1(p), h2(p), , h m (p) : h i (p) ∈ {0, 1}} where m is the code length They use projections of
the input point on one of the coordinates, that is, the hash function of form h i (p) = p i for every i = 1, , m Then,
LSH represents each item in the database as a hash code
by the hash mapping h (x), and constructs a hash table by
2015 Seventh International Conference on Knowledge and Systems Engineering
2015 Seventh International Conference on Knowledge and Systems Engineering
2015 Seventh International Conference on Knowledge and Systems Engineering
Trang 2hashing each item into the bucket indexed by its code In the
querying step, LSH first converts the query into a hash code,
and then finds its nearest neighbor in the bucket indexed by
the code If the probability of collision is at leastp = 1−r/d
for the close points and at most q = 1 − (1 + )r/d for the
far points, they achieve the algorithms for the(r, 1 + )-NN
usingO (n 1+ρ + nd) spaces, O(dn 1+ρ) preprocessing time,
andO (dn ρ ) query time, where ρ = log(1/p)/log(1/q) For
r < d/log (n), ρ < 1
1+
Related Works: In a followup work [9], the authors
introduced LSH functions that work directly in Euclidean
space and result in a faster running time The latter algorithm
forms the basis of E2 LSH package for high-dimensional
similarity search, which has been used in several applied
scenarios
In this paper, we only focus on the Hamming space and
the improvement of efficiency of locality sensitive hashing
approach The efficiency can be obtained via improvement
of exponent ρ The LSH algorithm has been since used in
numerous applied setting as [13], [12], [14], [15], [16]
The work of [17] showed that LSH for Hamming space
must have ρ ≥ 1
2(1+) − O( 1
1+ ) − o (1) In a recent paper
of Alexandr Andoni et al (SODA 2014) [18], the authors
proposed a version of LSH approach by using essentially the
same LSH functions families as described in [1] However,
the properties of those hash functions as well as the overall
algorithm are different This approach leads to a two-level
hashing algorithm The outer hash table partitions the data
sets into buckets of bounded diameter Then, for each bucket,
they build the inner hash table, which uses (after some
pruning) the center of the minimum enclosing ball of the
points in the bucket as a center point This hashing type is
data-dependent hashing, i.e., a randomized hash family that
itself depends on the actual points in the dataset Their
algo-rithm achievesO (n ρ + dlog(n)) query time and O c (n 1+ρ+
dlog (n)) space, where ρ ≤ 7
8(1+) +O( 1
(1+) 3/2 )+o (1)) In
a very recent paper of the same authors (accepted at STOC
2015) [2], they continue to consider the data-dependent
hashing for the optimality of exponent ρ Their algorithm
achievesO (n 1+ρ +dn) space and O(dn ρ) query time, where
2(1+)−1
Contribution: We present an algorithm for the (r,
1+)-NN problem in the Hamming space by using a new
cluster-ing technique, the triangle inequality as a stage of
prepro-cessing To the best of our knowledge, this paper presents
first the clustering algorithm as a preprocessing before using
LSH approach in order to solve(r, 1 + )-NN problem Our
algorithm achieves the following advantages:
• We give an improvement in case where is small
for the classic LSH algorithms in [1] and in very
recent paper [2], by choosing K = O(n 1+2ρ1 ), we
achieve an algorithm withO (n1+1+2ρ 2ρ +dn) space, and
O (fdn 1+2ρ 2ρ ) query time, where f is generally a small
STOC 1998 [1] O (dn ρ) O (n 1+ρ + dn) ρ < 1
1+
2(1+)−1
Our results O (fdn 1+2ρ 2ρ ) O (n1+1+2ρ 2ρ + dn) ρ of original algorithm
Table I
S PACE AND TIME BOUNDS FOR LSH ALGORITHMS AND OUR
ALGORITHM
integer See the table 1 We see that n 1+2ρ 2ρ < n ρ if
2
1+2ρ < 1 Then we will achieve n 1+2ρ 2ρ < n ρ if <1 for the case of [1] and if < 0.5 for the case of [2].
Hence, our algorithm is better than original algorithms
of [1] and [2] when is small as mentioned and f is
small In fact, we also prove that the averaging value of
f is smaller than d
d−2r Therefore, ifd > 4r then the
averaging value is 1 Note that there are not many such LSH algorithms for(r, 1+)-NN problem in Hamming
space We can cite here two algorithms in [1], [2]
• We provide a simple and easy algorithm in two index-ing and queryindex-ing procedures
The remainder of the paper is structured as follows Section II provides a formal definition of the problem we are tackling Our solution to the problem is presented in Sections III Then we give the correctness of our algorithm
in Sections IV and the complexity analysis in Section V Finally, Section VI concludes the paper
II PROBLEMDEFINITION
In this paper, we solve the (r, 1 + )-approximate near
neighbor problem in the Hamming space:
Definition 1 ((r, 1 + )-approximate near neighbor, or
(r, 1 + )-NN) Given a set P of points in a d-dimensional Hamming space H d , and δ > 0, we construct a data structure which, given any query point q, does the following task, with probability 1−δ, if there exists a r-near neighbor
of q in P, it reports a (1 + )r-near neighbor of q in P.
Similarly to [1], [12],δ is an absolute constant bounded
away from 1 Formally, anr-near neighbor of q is a point p
such that d (p, q) ≤ r, where d(p, q) denotes the Hamming
distance betweenp and q A set C(q, r) is one that contains
allr-near neighbors of q.
Observe that (r, 1 + )-NN is simply a decision version
of the Approximate Nearest Neighbor problem Although
in many applications solving the decision version is good enough, one can also reduce the approximate NN problem to approximate NN via binary-search-like approach In partic-ular, it is known [1] that the1+-approximate NN problem
reduces to O (log(n/)) instances of (r, 1 + )-NN Then,
the complexity of1+-approximate NN is the same (within
log factor) as that of the(r, 1 + )-NN problem.
Trang 3III ALGORITHM
The key idea of our algorithm is that one can greatly
reduce the searching space by cheaply partitioning data into
clusters, which we call stage 1 Then we use the triangle
inequality to select a few matched clusters and finally use
an LSH algorithm for each selected cluster We call this
stage 2 Such space partitioning approach is very natural
for searching problem and it is applicable in many settings
For example, [19] solves the exact K-nearest neighbors
problem using theK-means clustering in order to reduce the
searching space, [20] solves the exactK-nearest neighbors
classification basing on homogeneous clusters
There are a few of clustering algorithms that can be
applicable for our case asK-means algorithm and Canopy
algorithm [21] The key idea of K-means is to partition
data into K clusters in which each point of data belongs
to the cluster with the nearest mean while the key idea of
Canopy algorithm involves using a cheap, approximate
dis-tance measure to efficiently divide the data into overlapping
subsets we call canopies with the same radius However,
the disadvantage of such algorithms is that (1) we do not
know how many points in a cluster and (2) the computation
forK-means algorithm is expensive while canopies may be
overlapped for canopy algorithm We introduce here a simple
clustering algorithm that ensure 2 properties: (1) we know
exactly how many points in a cluster and (2) the computation
for this algorithm is cheap
A Clustering Algorithm
Definition 2 A K-cluster with the center c is a subset S of
P such that:
1) |S| = O(K),
2) all points of S are O (K) nearest neighbors of c.
The ideal ofK-nearest neighbor clustering is as follows:
at first, we choose randomly a point c of P which is
the center of a cluster, then we calculate O (K) nearest
neighbors ofc by using Merge sort algorithm The Merge
sort algorithm sorts the distances of dataset toc in ascending
order and stocks them in an array We select O (K) points
that correspond to the O (K) first elements of the array.
These O (K) points and point c form a cluster with radius
r which is the O (K)-th element of the array Hence, it is
easy to see that the distances of theseO (K) points to c are
smaller thanr After, we remove these points from P and
continue to choose randomly an other point as the center of
the next cluster The process stops when all points ofP are
removed and belong to a cluster The details of algorithm is
represented in Figure 1
Figure 2 illustrates an example of k-nearest neighbor
clustering algorithm result Given a set P with 12 points
{a, b, c, d, e, f, g, m, n, o, p, q}, we achieve 3 clusters with
centersp, q, t and each cluster contains 4 points.
Input:
1 a parameterK and a set P = {p1, p2, , p n }
2 C = ∅
Output:
3 a set of clustersC
4 while P = ∅
5 select randomly a pointc from P to initialize a cluster
6 search for the setN (c) of O(K) nearest neighbors of c from P
7 by using Merge sort algorithm on distances of data toc
8 add these points inN (c) to the cluster c
9 add the clusterc to set C
10 remove all points of the clusterc from P
Figure 1. K-Nearest Neighbor Clustering Algorithm
Figure 2 Example of clustering algorithm
B Searching Algorithm
In the second stage, we select clusters according to some conditions, which we call step 1, and then we use a LSH algorithm for each satisfied cluster in order to search the nearest neighbor, which we call step 2 Since there areO (K)
points in each clusters, there are at mostO(n
K) clusters If
by some elimination mechanism, such as branch and bound,
we only select a few clusters then the search space for the LSH algorithm will be greatly reduced
Step 1: After the clustering algorithm, we achieve a
set of clusters C For simplicity, for each cluster with the
center c, let R (c) be the radius of the cluster We will use
the triangle inequality in order to select only a few clusters among all clusters Such a cluster is called acandidate that
is defined as follows:
Definition 3 Given a query point q, we say that a cluster
is a candidate of q if two following conditions are satisfied:
• d (q, c) ≤ R(c) + r,
• for each center of cluster with center c such that c = c, then d (q, c ) ≥ R(c ) − r.
Step 2: In this step, we select a LSH algorithm ( [1]
or [1]) for (r, 1 + )-NN in order to search for the nearest
neighbor in each cluster satisfied in step 1 All points in each cluster are structured according to the reprocessing algorithm
of the selected algorithm The only difference is that we use twice the number of buckets of selected algorithm in each hash table and n 2ρ hash tables instead of n ρ hash tables
Trang 41 a parameterK and a set of clusters C = {c}
2 Candidates = ∅
Query algorithm for a query point q:
3 Step 1:
4 for each c ∈ C:
5 if c is a candidate of q then
6 addc to Candidates
7 Step 2:
8 for each c in Candidates:
9 LSHAlgorithm (q, c) and stop
Figure 3 Searching Algorithm using LSH Approach
The reason will be discussed in Section IV We denote by
LSHAlgorithm (q, c) the query algorithm of the selected
LSH algorithm, where q is the query point and c is the
center of a cluster This query algorithm returns a nearest
neighbor if it finds a nearest neighbor in the cluster with the
centerc, otherwise the query algorithm returns null See the
searching algorithm in Figure 3
IV CORRECTNESS
Lemma 1 Given any query point q, if there exists p ∈ P
and p ∈ C(q, r) then there exists a cluster ∈ C such that (1)
p belongs to the cluster and (2) this cluster is a candidate
of q.
Proof: According to the clustering algorithm, p must
be within only some cluster with the centerc We will show
that this cluster is the candidate ofq Indeed, by the triangle
inequality in Hamming space, d (q, c) ≤ d(c, p) + d(p, q).
Moreover, asp is within the cluster with center c, d (c, p) ≤
R (c) As p ∈ C(q, r), d(p, q) ≤ r So we get d(q, c) ≤
R (c) + r.
Now, consider any cluster with centerc other thanc By
the triangle inequality,d (q, c ) ≥ d(c , p ) − d(p, q) As p is
within the cluster with centerc, p is not within cluster with
center c Then d (c , p ) ≥ R(c ) Moreover, d(p, q) ≤ r.
Hence, d (q, c ) ≥ R(c ) − r By definition 3, the lemma
holds
Similarly as in [1], the correctness of our algorithm
holds if we can ensure that with a constant probability, the
following two properties hold:
• if there exists p ∈ C(q, r), then q will collide p.
• the total number of collisions of q with points in
P\C(q, (1 + )r) is small.
The property 1 holds thank to Lemma 1 Indeed, given
any query pointq, if there a p is r-near neighbor of q then
p must be in a candidate of q Hence, from the correctness
of LSH algorithm used for this candidate cluster in stage 2,
we obtain the property 1 For the property 2, it is not like in
[1] or [2], we need to use two times number of buckets of
the original algorithm in each hash table, and n 2ρ buckets
instead of n ρ buckets Because if we use n ρ buckets, then
in the worst-case we must consider allO(n
K) clusters, and thus if we use the parameters of the original algorithm, the number of collisions of q with points in P\C(q, (1 + )r)
is a function of O(n
K) This value is large To overcome this, we use two times number of buckets of the original algorithm Hence, the probability of collision ofq and some
pointp ∈ P\C(q, (1 + )r) reduces n times Thus, the total
number of collisions ofq with points in P\C(q, (1 + )r) is
small even when we must examine allO(n
K) clusters
V COMPLEXITYANALYSIS
Given any query point q, let f be the number of
candidates ofq We get:
Theorem 1 Given a parameter K > 0 and any LSH algorithm the exponent parameter ρ in [1] or [2]), there exists a data structure for (r, 1 + )-NN problem in the Hamming space with:
• preprocessing time O (d( n
K )K 1+2ρ + dnlog(K)),
• space O(n
K )K 1+2ρ + dn),
• querying time O (d n
K + fdK 2ρ)
Proof: For each loop in clustering algorithm, we need
determine O (K) nearest neighbors of a center This
re-quires O (dKlog(K) processing time by using a sorting
algorithm As in each loop, we can determine a cluster containingO (K) points of P , Algorithm 1 terminates after
O(n
K ) loops Hence, Algorithm 1 uses O(( n
K )dKlog(K)) =
O (dnlog(K)) processing time The processing time for
building buckets in each cluster isK1+1+1 Hence, the total preprocessing time isO (d( n
K )K 1+2ρ + dnlog(K)).
For each loop in clustering algorithm, we can determine
O (K) nearest neighbors of a center by using the
sort-ing algorithm A efficient sortsort-ing algorithm only requires
O (dK) spaces Hence, after O( n
K) loops, Algorithm 1 uses
O((n
K )dK) = O(dn) spaces The needed space for stocking
buckets in each cluster in searching algorithm isO (K 1+ρ+
dK ) Hence, the total space is O(( n
K )K 1+2ρ + dn).
The total query time consists of the time for searching
f clusters from O(n
K) clusters in step 1 and the time for searching(1 + , r)-approximate nearest neighbors in each
cluster found in step 1 Hence, the total query time is
O(n
K ) + O(fK 2ρ)
If we let K = O(n 1+2ρ1 ) Then the preprocessing time
isO (n1+ 2ρ
1+2ρ + dn(log(n)), the space is O(n1+ 2ρ
1+2ρ + dn)
and the query time isO ((f + 1)dn 1+2ρ 2ρ ) So we get
Theorem 2 There exists a data structure for (r, 1 + )-NN problem in the Hamming space with:
• preprocessing time O (n1+1+2ρ 2ρ + dn(log(n)),
• space O (n1+1+2ρ 2ρ + dn),
• querying time O (fdn 1+2ρ 2ρ ) Next, we will show that valuef on average is small.
Trang 5Lemma 2 Given any query point q and a parameter f ≥ 2,
if c1, c2, , c f are candidates of q then for every 1 ≤ i ≤ f:
R (c i ) − r ≤ d(q, c i ) ≤ R(c i ) + r.
Proof: It is direct from Lemma 1.
Lemma 3 Given any query point q and a K-cluster with
center c Then the probability that R (c) − r ≤ d(q, c) ≤
R (c) + r is smaller than 2r
d Proof: There are two cases: if R (c) > r then as R(c)−
r ≤ d(q, c) ≤ R(c) + r, distance d(q, c) belongs to a range
of length2r If R(c) ≤ r then as d(q, c) ≤ R(c) + r, 0 <
d (q, c) ≤ 2r Hence,d(q, c) also belongs to a range of 2r.
Thus, in both case, the probability to find a pointq such that
d (q, c) belongs to a a range of length 2r in a d-dimensional
space is smaller than 2r d
Lemma 4 Given any query point q Then the probability
that q has f candidates is(2r
d)f Proof: It is direct from Lemma 2 and Lemma 5.
Lemma 5 Given any query point q Then the average
candidates f the probability of q is smaller than d
d−2r Proof: As 0 ≤ f ≤ n/K, we get f = n/K
0=1(2r
d)i =
1−( 2r
d)n/K+1
1− 2r
d
< d−2r d
From the Lemma 5, for d > 4r the average candidates
f < 2, so f = 1 Now, we return the complexity analysis of
the algorithm We see that n 1+2ρ 2ρ < n ρ if 1+2ρ2 <1 Then
we will achieve n 1+2ρ 2ρ < n ρ if < 1 for the case of [1]
and if < 0.5 for the case of [2] As above proved, we
showed the averaging value of f is very small (if d > 4r
then f = 1) Hence, our algorithm is better than original
algorithms of [1] and [2] when is small as mentioned and
f is small.
VI CONCLUSION
In this paper, we presented an algorithm for (r, 1 +
)-NN problem in Hamming space The algorithm uses a
new clustering technique, the triangle inequality and the
locality sensitive hashing approach that permits solve(r, 1+
)-NN problem in high-dimensional space We achieved
O (n1+1+2ρ 2ρ +dn) space, and O(fdn 1+2ρ 2ρ ) query time, where
f is generally a small integer, ρ is a parameter of the
algorithm Our result is an improvement over the those of
[1], [2] in the case where parameters and f are small.
We saw that our algorithm can be investigated more
deeply by looking into the tuning parameter and can be
applied in different applications with high-dimensional data,
such as movie recommendation, speech recognition, which
will be our future works
ACKNOWLEDGEMENT
We thank our colleges in Z8, FPT Software, in particular
Henry Tu for insightful discussions about the problem and
reviewing our paper
REFERENCES
[1] P Indyk and R Motwani, “Approximate nearest neighbors: Towards removing the curse of dimensionality,” in
Proceedings of the Thirtieth Annual ACM Symposium
on Theory of Computing, ser STOC ’98. New York,
NY, USA: ACM, 1998, pp 604–613 [Online] Available: http://doi.acm.org/10.1145/276698.276876
[2] A Andoni and I Razenshteyn, “Optimal data-dependent hashing for approximate near neighbors,”
CoRR, vol abs/1501.01062, 2015 [Online] Available:
http://arxiv.org/abs/1501.01062
[3] H Samet, Foundations of Multidimensional and Metric Data Structures (The Morgan Kaufmann Series in Computer Graphics and Geometric Modeling). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2005
[4] S Arya, D M Mount, N S Netanyahu, R Silverman, and A Y Wu, “An optimal algorithm for approximate
nearest neighbor searching fixed dimensions,” J ACM,
vol 45, no 6, pp 891–923, Nov 1998 [Online] Available: http://doi.acm.org/10.1145/293347.293348
[5] J M Kleinberg, “Two algorithms for nearest-neighbor search in high dimensions,” in Proceedings of the Twenty-ninth Annual ACM Symposium on Theory
of Computing, ser STOC ’97 New York, NY, USA: ACM, 1997, pp 599–608 [Online] Available: http://doi.acm.org/10.1145/258533.258653
[6] E Kushilevitz, R Ostrovsky, and Y Rabani, “Efficient search for approximate nearest neighbor in high dimensional
spaces,” in Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, ser STOC ’98. New York, NY, USA: ACM, 1998, pp 614–623 [Online] Available: http://doi.acm.org/10.1145/276698.276877 [7] S Har-Peled, “A replacement for voronoi diagrams
of near linear size,” in Proceedings of the 42Nd IEEE Symposium on Foundations of Computer Science,
ser FOCS ’01 Washington, DC, USA: IEEE Computer Society, 2001, pp 94– [Online] Available: http://dl.acm.org/citation.cfm?id=874063.875592
[8] N Ailon and B Chazelle, “Approximate nearest neighbors
and the fast johnson-lindenstrauss transform,” in Proceedings
of the Thirty-eighth Annual ACM Symposium on Theory
of Computing, ser STOC ’06 New York, NY, USA: ACM, 2006, pp 557–563 [Online] Available: http://doi.acm.org/10.1145/1132516.1132597
[9] M Datar, N Immorlica, P Indyk, and V S Mirrokni,
“Locality-sensitive hashing scheme based on p-stable
distributions,” in Proceedings of the Twentieth Annual Symposium on Computational Geometry, ser SCG ’04.
New York, NY, USA: ACM, 2004, pp 253–262 [Online] Available: http://doi.acm.org/10.1145/997817.997857
Trang 6[10] S Har-Peled and S Mazumdar, “On coresets for
k-means and k-median clustering,” in Proceedings of
the Thirty-sixth Annual ACM Symposium on Theory
of Computing, ser STOC ’04 New York, NY,
USA: ACM, 2004, pp 291–300 [Online] Available:
http://doi.acm.org/10.1145/1007352.1007400
[11] A Chakrabarti and O Regev, “An optimal randomised cell
probe lower bounds for approximate nearest neighbor
search-ing,” in In Proceedings of the Symposium on Foundations of
Computer Science.
[12] A Gionis, P Indyk, and R Motwani, “Similarity search in
high dimensions via hashing,” in Proceedings of the 25th
International Conference on Very Large Data Bases, ser.
VLDB ’99 San Francisco, CA, USA: Morgan Kaufmann
Publishers Inc., 1999, pp 518–529 [Online] Available:
http://dl.acm.org/citation.cfm?id=645925.671516
[13] J Buhler, “Efficient large-scale sequence comparison
by locality-sensitive hashing,” Bioinformatics, vol 17,
no 5, pp 419–428, 2001 [Online] Available:
http://dx.doi.org/10.1093/bioinformatics/17.5.419
[14] E Cohen, M Datar, S Fujiwara, A Gionis,
P Indyk, R Motwani, J D Ullman, and C Yang,
“Finding interesting associations without support pruning,”
in ICDE, 2000, pp 489–500 [Online] Available:
http://dx.doi.org/10.1109/ICDE.2000.839448
[15] G Bogdan, S Ilan, and M Peter, “Mean shift based clustering
in high dimensions: A texture classification example,” in
Proceedings of the Ninth IEEE International Conference on
Computer Vision - Volume 2, ser ICCV ’03. Washington,
DC, USA: IEEE Computer Society, 2003, pp 456– [Online]
Available: http://dl.acm.org/citation.cfm?id=946247.946595
[16] J Buhler, “Provably sensitive indexing strategies for
biosequence similarity search,” in Proceedings of the
Sixth Annual International Conference on Computational
Biology, ser RECOMB ’02 New York, NY,
USA: ACM, 2002, pp 90–99 [Online] Available:
http://doi.acm.org/10.1145/565196.565208
[17] R Motwani, A Naor, and R Panigrahi, “Lower
bounds on locality sensitive hashing,” in Proceedings of
the Twenty-second Annual Symposium on Computational
Geometry, ser SCG ’06 New York, NY, USA:
ACM, 2006, pp 154–157 [Online] Available:
http://doi.acm.org/10.1145/1137856.1137881
[18] A Andoni, P Indyk, H L N Łn, and I Razenshteyn,
“Beyond localitysensitive hashing.”
[19] X Wang, “A fast exact k-nearest neighbors algorithm for high
dimensional search using k-means clustering and triangle
inequality,” in The 2011 International Joint Conference on
Neural Networks, IJCNN 2011, San Jose, California, USA,
July 31 - August 5, 2011, 2011, pp 1293–1299 [Online].
Available: http://dx.doi.org/10.1109/IJCNN.2011.6033373
[20] S Ougiaroglou and G Evangelidis, “Efficient $$k$$k-nn
classification based on homogeneous clusters,” Artif Intell.
Rev., vol 42, no 3, pp 491–513, Oct 2014 [Online].
Available: http://dx.doi.org/10.1007/s10462-013-9411-1
[21] A McCallum, K Nigam, and L H Ungar, “Efficient clustering of high-dimensional data sets with application
to reference matching,” in Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser KDD ’00. New York,
NY, USA: ACM, 2000, pp 169–178 [Online] Available: http://doi.acm.org/10.1145/347090.347123