42 open problems in mathematics

Topics in Mathematics of Data Science Lecture Notes Ten Lectures and Forty Two Open Problems in the Mathematics of Data Science Afonso S Bandeira December, 2015 Preface These are notes from a course I.

A couple of Open Problems

Koml´ os Conjecture

We start with a fascinating problem in Discrepancy Theory.

Open Problem 0.1 (Koml´os Conjecture) Given n, let K(n) denote the infimum over all real numbers such that: for all set of n vectors u 1 , , u n ∈ R n satisfying ku i k 2 ≤ 1, there exist signs i =±1 such that k 1 u 1 + 2 u 2 +ã ã ã+ n u n k ∞ ≤K(n).

There exists a universal constantK such that K(n)≤K for all n.

An early reference for this conjecture is Joel Spencer's book [Spe94], which emphasizes the conjecture's close connection to Spencer's Six Standard Deviations Suffice Theorem [Spe85] In the course, we will study semidefinite programming relaxations, and recent work shows that a particular semidefinite relaxation of the conjecture holds [Nik13] The same paper also provides a thorough account of partial progress toward the conjecture.

• It is not so difficult to show thatK(n) √

Matrix AM-GM inequality

Moving to a compelling generalization of the arithmetic–geometric means inequality, this result sheds light on how with- and without-replacement sampling compare in certain randomized algorithms By connecting a classical inequality to practical performance differences, the generalization provides a framework for analyzing sampling schemes and quantifying their impact on algorithmic behavior (see [RR12]).

Open Problem 0.2 For any collection of dìd positive semidefinite matrices A 1 ,ã ã ã , A n , the following is true:

=1 where Sym(n) denotes the group of permutations of n elements, and k ã k the spectral norm.

These conjectures claim that products of matrices with repeated factors are larger than the corresponding products without repetitions For a detailed discussion of the motivations and the precise formulations behind these conjectures, refer to [RR12] for conjecture (a) and [Duc12] for conjecture (b).

Recently these conjectures have been solved for the particular case of n = 3, in [Zha14] for (a) and in [IKW14] for(b).

Brief Review of some linear algebra tools

Singular Value Decomposition

The Singular Value Decomposition (SVD) is one of the most useful tools for this course! Given a matrixM ∈R m×n , the SVD ofM is given by

M =UΣV T , (1) where U ∈ O(m), V ∈ O(n) are orthogonal matrices (meaning that U T U = U U T = I and V T V V V T = I) and Σ ∈ R m×n is a matrix with non-negative entries in its diagonal and otherwise zero entries.

The columns ofU and V are referred to, respectively, as left and right singular vectors ofM and the diagonal elements of Σ as singular values of M.

Remark 0.1 Say m ≤ n, it is easy to see that we can also think of the SVD as having U ∈ R m×n where U U T = I, Σ∈R n×n a diagonal matrix with non-negative entries and V ∈O(n).

Spectral Decomposition

IfM ∈R n×n is symmetric then it admits a spectral decomposition

M =VΛV T , whereV ∈O(n) is a matrix whose columns v k are the eigenvectors ofM and Λ is a diagonal matrix whose diagonal elementsλ k are the eigenvalues of M Similarly, we can write n

When all of the eigenvalues ofM are non-negative we say thatM is positive semidefinite and write

M 0 In that case we can write

A decomposition ofM of the formM =U U T (such as the one above) is called a Cholesky decomposition.

The spectral norm of M is defined as kMk= max|λ k (M) k |

Trace and norm

Given a matrixM ∈R n×n , its trace is given by

Its Frobeniues norm is given by kMk F s

A particularly important property of the trace is that: n

Note that this implies that, e.g., Tr(ABC) = Tr(CAB), it does not imply that, e.g., Tr(ABC) Tr(ACB) which is not true in general!

Quadratic Forms

During the course we will be interested in solving problems of the type max Tr V

Note that this is equivalent to d max v k T M vk, (2) v 1 , ,v d ∈ R n v T

X v k=1 i j =δ ij whereδij is the Kronecker delta (is 1 isi=j and 0 otherwise).

When d= 1 this reduces to the more familiar max v T M v (3) v∈ R n kvk 2 =1

It is easy to see (for example, using the spectral decomposition of M) that (3) is maximized by the leading eigenvector ofM and max v T M v=λmax(M). v v

According to Fan's theorem (see, for example, Mos11, page 3), problem (2) is maximized by taking v1, , vk to be the k leading eigenvectors of M, and its value is the sum of the k largest eigenvalues of M A practical consequence is that the solution to (2) can be computed sequentially: we solve for d = 1 to obtain v1, then for d = 2 to obtain v2, and so on, up to vk.

Remark 0.2 All of the tools and results above have natural analogues when the matrices have complex entries (and are Hermitian instead of symmetric).

This course is a mostly self-contained, research-oriented program designed for undergraduate students, while also welcoming graduate students who want to pursue theoretical work on algorithms that extract information from data It emphasizes interdisciplinary connections across Mathematics, Applied Mathematics, Computer Science, Electrical Engineering, Statistics, and Operations Research, showcasing how ideas from these fields converge to address data-driven questions in theory and practice.

1 Principal Component Analysis (PCA) and some random matrix theory that will be used to understand the performance of PCA in high dimensions, through spike models.

2 Manifold Learning and Diffusion Maps: a nonlinear dimension reduction tool, alternative to PCA Semisupervised Learning and its relations to Sobolev Embedding Theorem.

3 Spectral Clustering and a guarantee for its performance: Cheeger’s inequality.

4 Concentration of Measure and tail bounds in probability, both for scalar variables and matrix variables.

5 Dimension reduction through Johnson-Lindenstrauss Lemma and Gordon’s Escape Through a Mesh Theorem.

6 Compressed Sensing/Sparse Recovery, Matrix Completion, etc If time permits, I will present Number Theory inspired constructions of measurement matrices.

7 Group Testing This section leverages combinatorial tools to establish lower bounds on testing procedures, outlining the theoretical limits of how efficiently tests can be performed If time permits, it also includes a concise crash course on error-correcting codes and shows how these codes can be applied to group testing to improve performance.

8 Approximation algorithms in Theoretical Computer Science and the Max-Cut problem.

9 Clustering on random graphs: Stochastic Block Model Basics of duality in optimization.

10 Synchronization, inverse problems on graphs, and estimation of unknown variables from pairwise ratios on compact groups.

11 Some extra material may be added, depending on time available.

At the end of nearly every lecture, I present a pair of open problems They aren’t necessarily the field’s most important questions, though some are indeed significant; I aim for a mix of important, approachable, and fun challenges Here I present two such problems, and a similar discussion of these issues is also available on my blog.

1 Principal Component Analysis in High Dimensions and the SpikeModel

Dimension Reduction and PCA

PCA as best d-dimensional affine fit

We are trying to approximate eachx k by d x k ≈à+X

Let v1, , vd be an orthonormal basis for a d-dimensional subspace of R^p, and let a ∈ R^p denote a translation The coefficients β_k ∈ R^d represent x_k in this basis If we assemble the basis vectors into the matrix V = [v1 vd] ∈ R^{p×d}, then x_k can be approximated by x_k ≈ a + V β_k, with V^T V = I_d, since the v_i are orthonormal.

We will measure goodness of fit in terms of least squares and attempt to solve min X n kx k −(à+V β k ) 2 à, V, β k k 2 (8)

We start by optimizing forà It is easy to see that the first order conditions foràcorrespond to n

Thus, the optimal valueà ∗ ofàsatisfies

BecausePn k=1β k = 0 we have that the optimalàis given by

1 n à ∗ = X x k =àn, nk=1 the sample mean.

We can then proceed on finding the solution for (9) by solving n min k à n V

Let us proceed by optimizing for βk Since the problem decouples for eachk, we can focus on, for eachk, d 2 minkx k −à n −V β k k2 2 = min x k −à n ( β − (10) k β

Since v 1 , , v d are orthonormal, it is easy to see that the solution is given by (β k ∗ ) = i v i T (x k −à n ) which can be succinctly written asβk =V T (xk−àn) Thus, (9) is equivalent to

−(x k −à n ) T V V T (x k −à n ). Since (x k −à n ) (x T k −à n ) does not depend on V, minimizing (9) is equivalent to n max X T

A few more simple algebraic manipulations using properties of the trace:

Tr h (xk−àn) V V T (xk−àn) k=1 k=1 n i

. This means that the solution to (13) is given b max Tr

As we saw above (recall (2)) the solution is given byV = [v 1 ,ã ã ã , v d ] wherev 1 , , v d correspond to thedleading eigenvectors of Σ n

Let us first show that interpretation (2) of finding thed-dimensional projection ofx1, , xnthat preserves the most variance also arrives to the optimization problem (13).

PCA as d-dimensional projection that preserves the most variance

We seek an orthonormal basis v1, , vd for an a-dimensional space, arranged as V = [v1, , vd] with V V^T = I_d, so that projecting the data points x1, , xn onto this d-dimensional subspace yields the largest possible variance Put another way, we want the subspace that preserves the most variation in the data when reduced to d dimensions This objective can be written as maximizing the total variance of the projected coordinates, or the trace of V^T S V where S is the sample covariance matrix, subject to V^T V = I_d The solution is given by the top-d eigenvectors of S, i.e., the principal components that define the optimal subspace.

 , k=1 to have as much variance as possible Hence, we are interested in solving max X 1 n

V (xk−àn) = Tr V ΣnV , k=1 nr=1 k=1 showing that (14) is equivalent to (13) and that the two interpretations of PCA are indeed equivalent.

Finding the Principal Components

When given a dataset x1, , xn ∈R p , in order to compute the Principal Components one needs to find the leading eigenvectors of

A straightforward approach is to form the matrix Σn, costing O(n p^2) operations, and then compute its spectral decomposition, costing O(p^3) operations Consequently, the overall computational complexity is O(max(n p^2, p^3)) (see [HJ85] and/or [Gol96]).

An alternative is to use the Singular Value Decomposition (1) Let X= [x1ã ã ãxn] recall that, Σn= 1 X−àn1 T X−àn1 T T n

Let us take the SVD of X−à 1 T =U DU T with n L R UL∈O(p), Ddiagonal, and U R T UR= I Then,

UL are the eigenvectors of the sample covariance Σn, meaning the left singular vectors reflect the principal directions of the data Computing the SVD of X after a rank-one adjustment costs roughly min(n^2 p, p^2 n) If you only need the top eigenvectors, the computational burden decreases accordingly This cost can be further reduced with randomized algorithms, which provide accurate approximate SVD solutions much faster than exact methods.

O pnlogd+ (p+n)d 2 time(see for example [HMT09, RST09, MM15]).

Which d should we pick?

When the goal is data visualization, projecting to d = 2 or d = 3 dimensions often makes the most sense But PCA is useful for many purposes beyond visualization: it can reveal that data lie in a lower-dimensional subspace while being corrupted by high-dimensional noise, and projecting onto the principal components can reduce the noise while preserving the signal; it can also enable running algorithms that would be too computationally expensive in the full high-dimensional space.

If time allows, we may discuss these methods later in the course Dimension reduction can be helpful in these applications and many others, but it is often unclear how to choose the most appropriate approach and how to set the parameters for a given problem.

If we denote thek-th largest eigenvalue of Σ n asλ(+) k (Σ n ), then thek-th principal component has λ (+) (Σ a Tr(Σ k n ) proportion of the variance 2 n )

A fairly popular heuristic for selecting the cutoff is to choose the component that has significantly more variance than the one immediately after it, and this idea is usually visualized with a scree plot, a plot of the values of the ordered eigenvalues in descending order to reveal where the variance concentrates.

To decide the cutoff, it is common to look for the elbow on the scree plot as a practical criterion for retaining components In the next section we will explore random matrix theory to better understand the behavior of the eigenvalues of Sigma_n, which will clarify when the cutoff should be applied.

A related open problem

We now show an interesting open problem posed by Mallat and Zeitouni at [MZ11]

Open Problem 1.1 (Mallat and Zeitouni [MZ11]) considers a Gaussian vector g in R^p with mean zero and a known covariance matrix Σ, where d < p For any orthonormal basis V = [v1, , vp] of R^p, define the random variable Γ_V as the squared L2 norm of the largest projection of g onto a subspace generated by d elements of the basis V.

What is the basisV for which E[Γ V ]is maximized?

The conjecture in [MZ11] states that the optimal basis is given by the eigendecomposition of Σ This is known to hold for d = 1 (see [MZ11]), but the issue remains open for d > 1 It is not difficult to observe that, without loss of generality, Σ can be assumed to be diagonal.

A particularly intuitive way of stating the problem is:

5 Score: P i=1 v˜ i T g The objective is to pic k the basis in order to maximize the expected value of theScore.

Notice that if the steps of the procedure were taken in a slightly different order on which step

Step 4 occurs before obtaining the draw of g (step 3) In this setup, the optimal basis is the eigenbasis of Σ, and the best subset is simply the leading eigenvectors, a choice that mirrors PCA as described above.

More formally, we can write the problem as finding

 whereg ∼ N(0,Σ) The observation regarding the different ordering of the steps amounts to saying that the eigenbasis of Σ is the optimal solution for

PCA in high dimensions and Marcenko-Pastur

A related open problem

Open Problem 1.2 (Monotonicity of singular values [BKS13a]) Consider the setting above but withp=n, then X ∈R n×n is a matrix with iid N(0,1)entries Let σi

, denote thei-th singular value 4 of √ 1 n X, and define α R (n) :=E 1

# as the expected value of the average singular value of √ 1 n X.

The conjecture is that, for every n≥1, α R (n+ 1)≥α R (n).

In the complex setting, α_C(n) denotes the analogous quantity, defined with X having i.i.d complex-valued standard Gaussian entries CN(0,1) It is conjectured for all n ≥ 1 that α_C(n+1) ≤ α_C(n), i.e., the sequence α_C(n) is nonincreasing as n increases in the complex case This extends the monotonicity intuition from the real case to complex Gaussian matrices.

Notice that the singular values of √ 1 n X are simply the square roots of the eigenvalues of S n , σ i

4 The i-th diagonal element of Σ in the SVD √ 1 n X = U ΣV

This means that we can computeα R in the limit (since we know the limiting distribution of λ i (S n )) and get (sincep=n we haveγ = 1, γ−= 0, and γ+= 2)

Also,α R (1) simply corresponds to the expected value of the absolute value of a standard gaussian g α R (1) =E|g|= 2 r π ≈0.7990, which is compatible with the conjecture.

In the complex-valued setting, the Marchenko–Pastur distribution remains applicable, so the limits of α_C(n) and α_R(n) coincide as n → ∞: lim_{n→∞} α_C(n) = lim_{n→∞} α_R(n) Moreover, α_C(1) can be readily computed and is larger than this limiting value.

Spike Models and BBP transition

A brief mention of Wigner matrices

Another important random matrix model is the Wigner matrix, which we will encounter again later in this course A standard Gaussian Wigner matrix W ∈ R^{n×n} is a symmetric matrix with independent Gaussian entries: the upper triangle entries are i.i.d N(0,1) and the diagonal entries are independent as well (the exact variance convention can vary, but symmetry is maintained by W_ij = W_ji) In the large‑n limit, the eigenvalues of (1/√n) W converge in distribution to the Wigner semicircle law, supported on [−2, 2].

W are distributed according to the so-called semi-circular law n dSC(x) = 1 p

2π − and there is also a BBP like transition for this matrix ensemble [FP06] More precisely, if v is a unit-norm vector inR n and ξ≥0 then the largest eigenvalue of √ 1 n W +ξvv T satisfies

In the preceding argument, it isn’t clearly demonstrated where the leading eigenvalue is used The full proof requires first establishing the existence of an eigenvalue outside the support, and the argument then applies solely to that eigenvalue (see [Pau]).

An open problem about spike models

Open Problem 1.3 (Spike Model for cut–SDP [MS15] As since been solved [MS15]) Let

W denote a symmetric Wigner matrix with i.i.d entries Wij ∼ N(0,1) Also, given B ∈R n×n symmetric, define:

What is the value ofξ∗, defined as ξ∗= inf{ξ≥0 :q(ξ)>2}.

It is known that, if 0≤ξ ≤1,q(ξ) = 2 [MS15].

One can show that n 1 Q(B)≤λmax(B) In fact, max{Tr(BX) :X 0, X ii = 1} ≤max{Tr(BX) :X 0, TrX=n}.

It is also not difficult to show (hint: take the spectral decomposition ofX) that

Remark 1.4 Optimization problems of the type of max{Tr(BX) :X 0, X ii = 1} are semidefinite programs, they will be a major player later in the course!

Since 1 n ETr 11 T ξ n 11 T + √ 1 n W ≈ξ, by takingX= 11 T we expect that q(ξ)≥ξ.

These observ h ations imply that 1 i

Within the two-cluster Stochastic Block Model, the current bound involves a parameter ξ (see MS15) A natural conjecture is that ξ equals 1, and if this were the case it would imply that a semidefinite programming (SDP)–based clustering algorithm achieves the optimal detection performance for the two-cluster SBM We will discuss these topics in detail later in the course (see MS15).

Remark 1.5 We remark that Open Problem 1.3 as since been solved [MS15].

Later in the course, we will cover clustering under the Stochastic Block Model in depth and show how the same semidefinite programming (SDP) approach is known to be optimal for exact recovery, as established in the literature [ABH14, HWX14, Ban15c].

2 Graphs, Diffusion Maps, and Semi-supervised Learning

Graphs

Cliques and Ramsey numbers

Cliques are fundamental structures in graphs with broad application-specific significance For example, in social networks—where vertices represent people and edges connect friends—a clique represents a tightly knit group in which every member is directly connected to every other member, providing a clear interpretation for detecting close-knit communities and studying social dynamics.

A natural question asks whether there exist arbitrarily large graphs in which neither the graph nor its complement contains a large clique; Ramsey answered this in the negative in 1928 We begin with a few definitions: for a graph G, let c(G) denote the size of its largest clique, and let c(G^c) denote the size of the largest clique in the complement of G, so that r(G) := max{ c(G), c(G^c) }.

Givenr, letR(r) denote the smallest integernsuch that every graphGonnnodes must haver(G)≥r. Ramsey [Ram28] showed thatR(r) is finite, for every r.

Remark 2.1 It is easy to show that R(3)≤6, try it!

We will need a simple estimate for what follows (it is a very useful consequence of Stirling’s approximation, e.g.).

Proposition 2.2 For everyk≤n positive integers, n k k

We will show a simple lower bound onR(r) But first we introduce a random graph construction, an Erd˝os-Reny´ı graph.

Definition 2.3 Givennandp, the random Erd˝os-Reny´ı graphG(n, p)is a random graph onnvertices where each possible edge appears, independently, with probabilityp.

To establish a lower bound on R(r), the probabilistic method—an elegant non-constructive approach pioneered by Erdős to prove the existence of combinatorial objects—is used The central idea is that if a random variable has a certain expectation, then there must exist a specific outcome whose value meets or exceeds that expectation This intuition is best understood through a simple example Theorem 2.4 states that for every r ≥ 2,

Proof LetG be drawn from theG n, 1 2 distribution, G∼G n, 1 2 S

For every set ofr nodes, let X(S) denote the random variable

1 ifS is a clique or independent set, X(S) 0 otherwise.

Also, letX denote the random variable

We will proceed by estimatingE[X] Note that, by linearity of expectation,

S∈( V r ) andE[X(S)] = Prob{S is a clique or independent set}= 2

That means that ifn≤2 r−1 2 and r≥3 thenE[X] 0; in other words, there is a positive chance that X takes a value less than 1, which, since X is an integer, means X = 0 occurs with positive probability If the expected value E[X] is below 1, this forces there to be at least one instance with X < 1, and therefore X = 0 in that instance This probabilistic observation implies the existence of a graph with a sufficiently large number of nodes that contains neither a clique of size r nor an independent set of size r, which establishes the claimed theorem in graph theory (Ramsey theory).

Remarkably, this lower bound is not very different from the best known In fact, the best known lower and upper bounds known [Spe75, Con09] forR(r) are

Open Problem 2.1 Recall the definition ofR(r) above, the following questions are open:

• What is the value ofR(5)?

• What are the asymptotics the lower bound (√ of R(s)? In particular, improve on the base of the exponent on either

• Construct a family of graphsG= (V, E)with increasing number of vertices for which there exists ε >0 such that 9

It is known that 43 ≤ R(5) ≤ 49 There is a famous quote in Joel Spencer’s book [Spe94] that conveys the difficulty of computing Ramsey numbers:

Erdős asks us to imagine an alien force vastly more powerful than us landing on Earth and demanding the value of R(5) or they will destroy our planet; in that case he says we should marshal all our computers and all our mathematicians to search for the exact value But suppose they demand R(6) instead; in that scenario he believes we should attempt to destroy the aliens There is also an alternative useful way to think about the bound 22: by taking the base-2 logarithm of each bound and rearranging, we obtain a different perspective on the same relationships.

The current “world record” (see [CZ15, Coh15]) for deterministic construction of families of graphs with

| | c small r(G) achieves r(G) 2 (log log V ) , for some constant c >0 Note that this is still considerably larger than polylog|V| In contrast, it is very easy for randomized constructions to satisfy r(G) ≤

2 log 2 n, as made precise by the folloing theorem.

Theorem 2.5 Let G∼G n, 1 2 be and Erd˝os-Reny´ı graph with edge probability 1 2 Then, with high probability, 10

Proof Givenn, we are interested in upper bounding Prob{R(G)≥ d2 log 2 ne} and we proceed by union bounding (and making use of Proposition 2.2):

∃ S⊂V, S | |log 2 ne S is a clique or independent set

 [ {S is a clique or independent set}

≤ Prob{S is a clique or independent set}

9 By a k b k we mean that there exists a constant c such that a k ≤ c b k

10 We say an event happens with high probability if its probability is ≥ 1 − n −Ω(1)

2 The following is one of the most fascinating conjectures in Graph Theory

Open Problem 2.2 (Erdős–Hajnal Conjecture [EH89]): Prove or disprove that for every finite graph H there exists a constant δ_H > 0 such that every H-free graph on n vertices contains a clique or an independent set of size at least n^{δ_H}.

It is known that r(G) & exp c H √ logn , for some constant c H > 0 (see [Chu13] for a survey on this conjecture) Note that this lower bound already shows that H-free graphs need to have considerably larger r(G) This is an amazing local to global effect, where imposing a constraint on small groups of vertices are connected (beingH-free is a local property) creates extremely large cliques or independence sets (much larger than polylog(n) as in random Erd˝os-Reny´ı graphs).

Since we do not know how to deterministically construct graphs with r(G) ≤ polylogn, one approach could be to take G ∼ G n, 1 2 and check that indeed it has small clique and independence number However, finding the largest clique on a graph is known to be NP-hard (meaning that there is no polynomial time algorithm to solve it, provided that the widely believed conjecture N P 6= P holds) That is a worst-case statement and thus it doesn’t necessarily mean that it is difficult to find the clique number of random graphs That being said, the next open problem suggests that this is indeed still difficult.

Consider a planted clique model: start with a random graph G(n, 1/2), pick ω vertices uniformly at random, and connect every pair among them to plant a clique of size ω If ω > 2 log_2 n, this planted clique surpasses the size of any preexisting clique in the graph, meaning the planted structure is, in principle, detectable from the graph alone A naïve approach is to examine all subsets of size 2 log_2 n + 1 and test whether each subset forms a clique; if a subset is a clique, it is very likely that these vertices belong to the planted clique However, checking all such subgraphs runs in super-polynomial time, roughly n^{O(log n)}, which motivates the natural question of whether this can be done in polynomial time.

Consider the planted clique model on G(n, 1/2) In this setting, the degrees of the nodes follow a binomial distribution with mean (n−1)/2 and a standard deviation of about √n If the planted clique size ω exceeds a constant times √n (ω > c√n for sufficiently large c), the vertices in the planted clique tend to have larger degrees, making the planted clique easy to detect and recover Remarkably, there is no known method that works when ω is substantially smaller than this threshold A quasi-linear-time algorithm [DM13] can, with high probability, find the largest clique as long as ω satisfies a certain growth condition stated in the reference.

Open Problem 2.3 (The planted clique problem) Let Gbe a random graph constructed by taking aG n, 1 2 and planting a clique of size ω.

1 Is there a polynomial time algorithm that is able to find the largest clique of G (with high probability) for ω √ n For example, for ω ≈

An amplification technique enables locating the largest clique for the planted clique problem when ω ≈ c√n for arbitrarily small c in polynomial time, with the exponent in the runtime depending on c The core idea is to examine all subsets of a fixed finite size and check whether the planted clique contains each subset, thereby revealing a large clique within this bound.

2 Is there a polynomial time algorithm that is able to distinguish, with high probability, G from a draw ofG n, 1 2 for ω√ n For example, for ω≈

3 Is there a quasi-linear time algorithm able to find the largest clique of G (with high probability) for ω≤

This open problem is especially significant because the central hypothesis—that finding planted cliques for small values of ω underpins several cryptographic protocols and shapes average-case hardness results—has wide-ranging implications In particular, the hardness of recovering planted cliques when ω is small is a key driver behind these cryptographic constructions and average-case complexity results, with Sparse PCA highlighted as a notable example [BR13].

Diffusion Maps

A couple of examples

The ring graph is a graph onnnodes{1, , n}such that nodekis connected tok−1 andk+ 1 and

1 is connected ton Figure 2 has the Diffusion Map of it truncated to two dimensions

Another simple graph is Kn, the complete graph on nnodes (where every pair of nodes share an edge), see Figure 3.

Figure 2 shows the diffusion map of the ring graph, offering a natural and intuitive visualization of its structure If you were asked to draw the ring graph, the most likely sketch would be a circle, and the diffusion map indeed produces a circular arrangement of the points that matches this expectation Moreover, the diffusion map can be computed analytically with little effort, and the resulting embedding confirms that the data lie on a circle, illustrating the ring graph’s inherent circular structure and its usefulness for circular data representation.

Diffusion Maps of point clouds

Embedding a point cloud x1, , xn in R^p and the accompanying graph is a common task in data analysis A typical option is Principal Component Analysis (PCA), but PCA is designed to capture only linear structure, which may miss nonlinear, low-dimensional manifolds inherent in the data For example, a dataset of face images taken from different angles and under varying lighting conditions has intrinsic dimensionality governed by facial muscles, head and neck movements, and lighting DOFs, yet this nonlinear structure may not be readily apparent in the raw pixel values.

Consider a point cloud sampled from a two-dimensional manifold—the Swiss roll—embedded in three dimensions To uncover its intrinsic two-dimensional structure, we must distinguish points that are near each other because they lie close on the manifold from points that merely appear nearby due to curvature in 3D space We achieve this by constructing a graph whose nodes are data points and whose edges connect local neighbors, typically using a k-nearest-neighbor or epsilon-ball criterion This graph preserves local geometry and provides the foundation for manifold learning methods that recover the 2D coordinates from the 3D embedding.

To capture the manifold structure, we build a graph where each data point becomes a node We connect only pairs that are close on the manifold, not just nearby in Euclidean space, to avoid misrepresenting curvature This is done at a small scale ε, linking nodes i and j when their intrinsic distance is below ε The connections are smoothed with a kernel Kε, and the edge weight is wij = Kε( ||xi − xj||^2 ) A common choice is a Gaussian kernel, Kε(u) = exp(−u/(2ε)) (equivalently, Kε(||xi − xj||^2) = exp(−||xi − xj||^2/(2ε))).

, that gives essentially zero weight to edges corresponding to pairs of nodes for whichkx i −xjk 2 √ ε We can then take the the Diffusion Maps of the resulting graph.

Figure 3 shows the diffusion map of the complete graph on four nodes in three dimensions, which forms a regular tetrahedron and indicates the absence of a low-dimensional structure for this graph This result is unsurprising, since every pair of nodes is connected, leaving the graph without a natural low-dimensional representation.

A simple example

An illustrative dataset can be formed by images of a white square on a black background at different positions Each data point corresponds to the square’s two-dimensional position, so the dataset is intrinsically two-dimensional However, this two-dimensional structure is not directly visible from the raw pixel vectors of each image; the pixel values do not lie in a two-dimensional affine subspace This gap between intrinsic and ambient dimensionality highlights the need for dimensionality reduction and manifold learning in image analysis, since the data’s natural geometry lives on a two-dimensional manifold inside a high-dimensional pixel space.

To begin, we examine the one-dimensional case where the blob is a vertical stripe that simply moves left and right We can think of the space as a wraparound world, like in the arcade game Asteroids: when a stripe exits the screen on the right, it reappears on the left, and the same wraparound behavior applies for the vertical direction in higher dimensions The resulting point cloud should exhibit not only a one-dimensional structure but also a circular topology Remarkably, this structure becomes evident when we compute the two-dimensional Diffusion Map of the dataset (see Figure 5).

In the two-dimensional example, the underlying manifold is expected to be a two-dimensional torus, and Figure 6 demonstrates that the three-dimensional diffusion map captures the toroidal structure of the data.

Similar non-linear dimensional reduction techniques

There are several other non-linear dimensionality reduction methods, among them ISOMAP, which is particularly popular ISOMAP seeks an embedding in R^d such that the Euclidean distances between points in the embedding approximate the geodesic distances along the data manifold This is achieved by first estimating the geodesic distance between pairs of points vi and vj on a neighborhood graph (constructed from the data, typically using k-nearest neighbors or an epsilon ball) via shortest-path distances, and then applying Multidimensional Scaling (MDS) to the resulting geodesic distance matrix to obtain the low-dimensional embedding.

Figure 4 shows a Swiss roll point cloud, with points sampled from a two‑dimensional manifold embedded in R^3, and a graph constructed by taking these samples as the graph’s nodes To scale the representation, we seek a set of points y_i in R^d that minimize an objective—for example, the sum of squared norms across all y_i—so as to preserve the manifold’s geometry while enabling efficient graph‑based analysis and dimensionality reduction.

, which can be done with spectral methods (it is a good exercise to compute the optimal solution to the above optimization problem).

Semi-supervised learning

An interesting experience and the Sobolev Embedding Theorem

Let us try a simple experiment on a d-dimensional grid within the cube [-1, 1]^d, sampling a large number of points (roughly m^d) We label the center as +1 and assign -1 to every node whose distance from the center is greater than or equal to a chosen radius r This creates a radial threshold labeling: points inside the sphere of radius r are +1, while points outside are -1 By varying r and d, we can examine how the +1 region scales, how the decision boundary behaves, and what fraction of grid points receive each label as the dimension grows This setup provides a clean testbed for understanding radial labeling on high-dimensional grids and its implications for related algorithms.

In non-degenerate cases, where the labeled and unlabeled parts of the graph stay connected, the matrix is invertible, and this fact is straightforward to verify; only degenerate configurations—such as the unlabeled portion being disconnected from the labeled portion—could prevent invertibility.

Figure 8 shows a two-dimensional embedding of a dataset of hand images produced with ISOMAP, as reported in TdSL00 In this embedding, the coordinates map so that the center corresponds to 1 and the boundary to −1 The aim is to understand how the labeling algorithm will extend to the remaining unlabeled points, with the expectation that points farther from the center (and near the labeled boundary) will receive smaller labels, while points close to the center will be assigned larger labels.

Figure 12 shows the results for d = 1, Figure 13 for d = 2, and Figure 14 for d = 3 When d ≤ 2, the method appears to interpolate smoothly between the labels, but at d = 3 the model effectively outputs −1 for all points, offering little meaningful information To gain intuition, consider the input space Rd: let’s say that we want to understand how dimensionality affects the learned mapping.

R t to find a function inR d that takes the value 1 at zero and−1 at the unit sphere, that minimizes B k∇f(x)k 2 dx Let us consider the following function on B 0 (1) (the ball

0 (1) centered at 0 with unit radius)

2dx≈ε d−2 , ε meaning that, ifd >2, the performance of this function is improving as ε→0, explaining the results in Figure 14.

One convenient way to think about the situation is through the Sobolev Embedding Theorem The Sobolev space H^m(R^d) consists of functions on R^d whose derivatives up to order m are square-integrable The Sobolev Embedding Theorem says that if m > d, then H^m(R^d) embeds into spaces of more regular functions, which implies improved regularity and, in many cases, continuity or boundedness of the functions.

According to the discussion, if a function f belongs to the Sobolev space H^m(R^d), then f must be continuous, indicating a level of regularity that rules out highly irregular behavior and affects how we analyze associated problems This material is © Science All rights reserved and is excluded from Creative Commons licensing; for more information, see http://ocw.mit.edu/help/faq-fair-use/.

Figure 9: The two dimensional represention of a data set of handwritten digits as obtained in [TdSL00] using ISOMAP Remarkably, the two dimensionals are interpretable

Figure 10 shows the task of labeling an unlabeled point given a few labeled points, and the behavior illustrated in Figure 14 indicates that if we could also control the second derivatives of f, this phenomenon would disappear Although not described in detail here, there is a way to achieve this by minimizing f^T L^2 f instead of f^T L f Figure 15 reproduces the same experiment with f^T L f replaced by f^T L^2 f, confirming our intuition that the discontinuity issue should vanish (see NSZ09 for more on this phenomenon) This content is © Science and is excluded from the Creative Commons license; for more information, see the MIT OCW fair-use guidance.

Figure 11: In this example we are given many unlabeled points, the unlabeled points help us learn the geometry of the data.

Figure 12 shows the d = 1 instance of applying this method to the example described above The node values are represented by color coding, and for d = 1 the method smoothly interpolates between the labeled points.

3 Spectral Clustering and Cheeger’s Inequality

Clustering

k-means Clustering

One of the most popular methods for clustering is k-means clustering Given data points x1, x2, , xn in R^p, k-means partitions them into k clusters S1 through Sk with centers μ1, μ2, , μk in R^p The aim is to minimize the total within-cluster variation by assigning each point to a cluster so that the sum of squared distances to its cluster center is minimized, i.e., minimize ∑_{i=1}^n ||x_i − μ_{c(i)}||^2 over the cluster assignments and centers.

Figure 13 shows a two-dimensional (d = 2) example of applying this method to the scenario described above The node values are visualized through color coding, which helps readers track how the data points are connected The approach appears to interpolate smoothly between the labeled points, providing a continuous representation across the figure.

Note that, given the partition, the optimal centers are given by à l = 1 x

Lloyd’s algorithm [Llo82] (also known as the k-means algorithm), is an iterative algorithm that alternates between

• Given centersà1, , à k , assign each point xi to the cluster l= argmin l=1, ,k kx i −à l k.

There is no guarantee that P will converge to the solution of (25), since Lloyd’s algorithm often gets stuck in local optima of (25) In upcoming lectures, we will discuss convex relaxations for clustering as an alternative algorithmic approach to Lloyd’s algorithm Because optimizing (25) is NP-hard, no polynomial-time algorithm can solve it in the worst case under the widely believed conjecture P ≠ NP.

While popular, k-means clustering has some potential issues:

• One needs to set the number of clusters a priori (a typical way to overcome this issue is by trying the algorithm for different number of clusters).

Equation (25) is defined under the assumption that data points reside in Euclidean space, but in many clustering tasks we only have pairwise affinities and no embedding in R^p To address this, (25) can be reformulated using distances only, eliminating the requirement for an explicit Euclidean coordinates representation This distance-based reformulation allows clustering from an affinity matrix or non-Euclidean data while preserving the relationships among data points.

Figure 14 presents the d = 3 case of applying this method to the previously described example, with node values shown using color coding In this configuration, the solution appears to learn only the label -1.

Figure 15 illustrates a d=3 example of applying this method with the extra regularization f^T L2 f to the previously described case, where node values are indicated by color coding The added regularization appears to resolve the discontinuities, yielding smoother and more coherent results.

• The formulation is computationally hard, so algorithms may produce suboptimal instances.

• The solutions ofk-means are always convex clusters This means that k-means may have difficulty in finding cluster such as in Figure 17.

Spectral Clustering

A natural way to address the issues of k-means illustrated in Figure 17 is to use Diffusion Maps From the data points we construct a weighted graph G = (V, E, W) by applying a kernel K, such as a Gaussian kernel, to measure pairwise similarities; the edge weights encode local affinities and define a diffusion process on the graph that reveals the intrinsic geometry of the data Analyzing the resulting diffusion coordinates enables robust, low-dimensional representations that preserve the data's manifold structure, yielding improved clustering and dimensionality reduction over standard k-means.

, by associating each point to a vertex and, for which pair of nodes, set the edge

2 weight as wij =K (kx i −xjk).

Figure 16: Examples of points separated in clusters.

Recall the construction of a matrix M =D −1 W as the transition matrix of a random walk

Prob{X(t+ 1) =j|X(t) =i}= w ij =Mij, deg(i) whereDis the diagonal with D ii = deg(i) The d-dimensional Diffusion Maps is given by φ(d) t (i)  t

= ΦΛΨ T where Λ is the diagonal matrix with the eigenvalues of M and Φ and Ψ are, respectively, the right and left eigenvectors ofM (note that they form a bi-orthogonal system, Φ T Ψ I).

For graph clustering into k clusters, the standard approach is to truncate the Diffusion Map to k−1 dimensions, since this dimensionality allows up to k linearly separable regions When the embedded clusters are linearly separable, applying k-means to the embedding can reveal the clusters, and this is the core motivation behind Spectral Clustering.

Algorithm 3.1 (Spectral Clustering) Given a graph G = (V, E, W) and a number of clusters k (andt), Spectral Clustering consists in taking a (k−1)dimensional Diffusion Map

 and clustering the points φ t − (1), φ t − (2), , φ t − (n)∈R k−1 using, for example, k-means clustering.

Figure 17: Because the solutions ofk-means are always convex clusters, it is not able to handle some cluster structures.

Two clusters

Normalized Cut

Given a graphG= (V, E, W), a natural measure to measure a vertex partition (S, S c ) is cut(S) =X X wij. i∈S j∈S c

Note however that the minimum cut is achieved for S = ∅ (since cut(∅) = 0) which is a rather meaningless choice of partition.

Remark 3.3 One way to circumvent this issue is to ask that |S|=|S c |(let’s say that the number of verticesn=|V|is even), corresponding to a balanced partition We can then identify a partition with a label vector y ∈

{±1} n where yi = 1 is i∈S, and yi =−1 otherwise Also, the balanced condition can be written as n i=1 y i = 0 This means that we can write the minimum balanced cut as min cut(S) = min 1

X i≤j wij(yi−yj) 2 = 1 min y T LGy,

1 T y=0 where L G =D−W is the graph Laplacian 13

When a perfectly balanced partition is too restrictive, researchers explore variations of the cut(S) objective that honor the intuition that both S and its complement S^c should be reasonably large, even if not exactly |V|/2 These measures penalize small sides and encourage meaningful separation, yielding partitions that are informative in practice A prime example is Cheeger’s cut, which captures the trade-off between cut size and the sizes of the two sides and has become a canonical tool in graph partitioning.

Definition 3.4 (Cheeger’s cut) Given a graph and a vertex partition(S, S c ), the cheeger cut (also known as conductance, and sometimes expansion) of S is given by cut(S) h(S) = , min{vol(S),vol(S c )} where vol(S) =P i∈Sdeg(i).

Also, the Cheeger’s constant of Gis given by hG= minh(S).

Normalized Cut, Ncut, is a graph-partitioning measure that quantifies the quality of splitting a set S from its complement S^c by normalizing the crossing edge weight with the volumes of both sides It is defined as Ncut(S) = cut(S, S^c)/vol(S) + cut(S, S^c)/vol(S^c), where cut(S, S^c) is the total weight of edges that cross the cut and vol(S) and vol(S^c) are the sums of degrees (volumes) of the nodes in S and S^c The Normalized Cut is closely related to the conductance h(S); in fact, h(S) ≤ Ncut(S) ≤ 2 h(S).

13 W is the matrix of weights and D the degree matrix, a diagonal matrix with diagonal entries D ii = deg(i).

Both h(S) and Ncut(S) favor nearly balanced partitions, Proposition 3.5 below will give an interpretation of Ncut via random walks.

Let us recall the construction form previous lectures of a random walk onG= (V, E, W):

Prob{X(t+ 1) =j|X(t) =i}= w ij =M ij , deg(i) whereM =D −1 W Recall that M = ΦΛΨ T where Λ is the diagonal matrix with the eigenvaluesλk of M and Φ and Ψ form a biorthogonal system Φ T Ψ = I and correspond to, respectively, the right

− 1 and left eigenvectors ofM Moreover they are given by Φ =D 2 V and Ψ = D 1 2 V whereV T V =I andD − 1 2 W D − 1 2 =VΛV T is the spectral decomposition ofD − 1 2 W D − 1 2

Recall that M has eigenvalue 1 with a left eigenvector ψ such that ψ^T M = ψ^T, and in fact ψ_i is proportional to deg(i) for i = 1, ,n (since ψ = D1^2 v1 = D φ1 = [deg(i)]) This shows that the vector h with components deg(i) defines the stationary distribution of the random walk driven by M With vol(G) = sum_i deg(i) this fact can be checked easily: if X(t) has distribution p_t, then X(t+1) has distribution p_{t+1}^T = p_t^T M.

Proposition 3.5 states that for a graph G=(V,E,W) and a partition (S, S^c) of V, Ncut(S) corresponds to the probability, in the stationary distribution of the random walk on G, that a walker currently in S moves to S^c plus the probability that a walker currently in S^c moves to S; in other words, Ncut(S) is the sum of the two conditional transition probabilities P(go to S^c | in S) and P(go to S | in S^c), with these probabilities taken with respect to the stationary distribution.

Ncut(S) = Prob{X(t+ 1)∈S c |X(t)∈S}+ Prob{X(t+ 1)∈S|X(t)∈S c }, where Prob{X(t) =i}= vol(G) deg(i ).

Proof Without loss of generality, we may take t = 0 The second term in the sum corresponds to the first with S replaced by its complement Sc and vice versa, so we’ll focus on the first term We have:

}= , vol(S c ) which concludes the proof 2

Normalized Cut as a spectral relaxation

Below we will show that Ncut can be written in terms of a minimization of a quadratic form involving the graph LaplacianLG, analogously to the balanced partition.

Recall that balanced partition can be written as

An intuitive way to relax the balanced condition is to let the labels y take two real values, a and b, with y_i = a for i ∈ S and y_i = b for i ∉ S, rather than forcing them to be ±1 We can then use the volume of a set to define a less restrictive balance by requiring that a·vol(S) + b·vol(S^c) = 0, which is equivalent to 1^T D y = 0.

We also need to fix a scale/normalization for aand b: a 2 vol (S) +b 2 vol (S c ) = 1, which corresponds toy T Dy= 1.

As we will see below, this corresponds precisely to Ncut.

Proposition 3.6 For a and b to satisfy avol (S) +bvol (S c ) = 0 and a 2 vol (S) +b 2 vol (S c ) = 1 it must be that a vol(S c ) vol(S) vol(G)

Proof The proof involves only doing simple algebraic manipulations together with noticing that vol(S) + vol(S c ) = vol(G) 2

Ncut(S) =y T L G y, where y is given by yi 

1 v wij + c vol(G) vol(S c ) ol(S) vol(S) vol(S c ) + 2

= wij + + + c vol(G) vol(S) vol(S c ) vol(S) vol(S c ) i∈S j∈S

2 This means that finding the minimum Ncut corresponds to solving min y T L G y s t y∈ {a, b} n for someaand b y T (26)

Since solving (26) is, in general, NP-hard, we consider a similar problem where the constraint that y can only take two values is removed: min y T LGy s t y∈R n y T (27)

Given a solution of (27) we can round it to a partition by setting a threshold τ and taking

Let S = { i ∈ V : yi ≤ τ } denote the set of indices in V for which yi is at most the threshold τ As shown below, equation (27) is an eigenvector problem, which is why we call it a spectral relaxation Moreover, the solution y is proportional to φ^2, meaning that this approach exactly matches Algorithm 3.2.

In order to better see that (27) is an eigenvector problem (and thus computationally tractable),setz=D1 2 y and L G =D − 1 2 L G D − 1 2 , then (27) is equivalent min z T L G z s t z∈R n kzk 2 = 1 (28)

Let L_G = I − D^{−1/2} W D^{−1/2} be the normalized Laplacian Its eigenvalues can be ordered increasingly as 0 = λ1(L_G) ≤ λ2(L_G) ≤ … ≤ λn(L_G) The eigenvector corresponding to the smallest eigenvalue is given by D^{1/2} 1 By the variational interpretation of the eigenvalues, this implies that the minimum of (28) is λ2(L_G).

1 1 and the minimizer is given by the second smallest eigenvector of L G =I−D − 2 W D − 2 , which is the

1 1 second largest eigenvector ofD − 2 W D − 2 which we know is v 2 This means that the optimaly in (27) is given by ϕ 2 =D − 1 2 v 2 This confirms that this approach is equivalent to Algorithm 3.2.

Because the relaxation (27) is obtained from (26) by removing a constraint we immediately have that λ 2 (L G )≤min Ncut(S).

In what follows we will show a guarantee for Algorithm 3.2.

Lemma 3.8 There is a threshold τ producing a partitionS such that h(S)≤p

This implies in particular that h(S)≤p

4h G , meaning that Algorithm 3.2 is suboptimal at most by a square root factor.

Note that this also directly implies the famous Cheeger’s Inequality

Theorem 3.9 (Cheeger’s Inequality) Recall the definitions above The following holds:

Cheeger’s inequality was first established for manifolds by Jeff Cheeger in 1970 [Che70], the graph version is due to Noga Alon and Vitaly Milman [Alo86, AM85] in the mid 80s.

The upper bound in Cheeger’s inequality (corresponding to Lemma 3.8) is more interesting but more difficult to prove, it is often referred to as the “the difficult part” of Cheeger’s inequality We will prove this Lemma in what follows There are several proofs of this inequality (see [Chu10] for four different proofs!) The proof that follows is an adaptation of the proof in this blog post [Tre11] for the case of weighted graphs.

We will show that given y∈R n satisfying y T L

(y) :y T Dy ≤δ, and y T D1 = 0 there is a “rounding of it”, meaning a threshold τ and a corresponding choice of partition

≤ 2δ, sincey=ϕ 2 satisfies the conditions and gives δ =λ 2 (L G ) this proves the Lemma.

We will pick this threshold at random and use the probabilistic method to show that at least one of the thresholds works.

First we can, without loss of generality, assume that y 1 ≤ ã ≤ y n (we can simply relabel the vertices) Also, note that scaling of y does not change the value of R(y) Also, if y D 1 = 0 adding a multiple of 1 to y can only decrease the value of (y): the numerator does not change and the denominator (y+c1) T D(y+c1) =y T Dy+c 2 T

This means that we can construct (fromy by adding a multiple of 1 and scaling) a vector x such that x 1 ≤ ≤x n , x m = 0, andx 2 1 +x 2 n = 1, and x T LGx

≤δ, x T Dx wherembe the index for which vol({1, , m−1})≤vol({m, , n}) but vol({1, , m})>vol({m, , n}).

We consider a random construction of S with the following distribution S = {i ∈ V : xi ≤τ} whereτ ∈[x 1 , x n ] is drawn at random with the distribution b

It is not difficult to check that

Prob{τ ∈[a, b]} b 2 −a 2 ifaand bhave the same sign a 2 +b 2 ifaand bhave different signs

Let us start by estimatingEcut(S).

Note that Prob{(S, S c ) cuts the edge (i, j)}is x 2 i −x 2 j isx i andx j have the same sign andx 2 i +x 2 j otherwise Both cases can be conveniently upper bounded by|x i −xj|(|x i |+|x j |) This means that

|x i |+|x j |) , where the second inequality follows from the Cauchy-Schwarz inequality.

X we know that wij(xi xj) = 2x 2 T LGx 2δx T Dx. ij

|x i |+|x |) 2 ≤X w 2x 2 + 2x 2 = 2 X deg(i)x 2 2 T j ij i j i + 2 deg(j)x j = 4x Dx. ij i

Emin{volS,volS c }= deg(i) Prob i=1

{x i is in the smallest set (in terms of volume)}, to break ties, if vol(S) = vol(S

We take the smallest set to be the one with the earliest indices The MIS vertex is always in the largest set A vertex j < MIS lies in the smallest set when x_j ≤ τ ≤ x_m, and any vertex j > m lies in the smallest set when 0 ≤ τ ≤ x_j This means that the index order and the threshold τ together determine exactly which vertices belong to the smallest set.

Prob{x i is in the smallest set (in terms of volume) =x 2 j

Emin{volS,volS c }=X deg(i)x 2 i =x T Dx. i=1

Note however that because is not necessarily the same as and so,

E min{vol S,vol S c } E min{vol S,vol S c } we do not necessarily have cut(S) E

} However, since both random variables are positive,

√ } 2δ ≤0, which guarantees, by the probabilistic method, the existence ofS i such that cut(S)≤min{volS,volS c √

} 2δ, which is equivalent to cut(S) h(S) = √ min{volS,volS c ≤ 2δ,

} which concludes the proof of the Lemma.

Small Clusters and the Small Set Expansion Hypothesis

We now restrict to unweighted regular graphsG= (V, E).

Cheeger’s inequality lets you efficiently approximate a graph’s Cheeger number up to a square-root factor In particular, for G = (V, E) and φ, you can efficiently distinguish between the cases h_G ≤ φ and h_G ≥ 2√φ Can this bound be improved?

Open Problem 3.1 Does there exists a constant c >0 such that it is N P-hard to, given φ, and G distinguis between the cases

These results arise from a central conjecture in theoretical computer science, described in [BS14], and it is known [RS10] to imply the Unique Games Conjecture [Kho10], which we will explore in forthcoming lectures.

Conjecture 3.10 (Small-Set Expansion Hypothesis [RS10]) For every >0there exists δ >0 such that it isN P-hard to distinguish between the cases

1 There exists a subset S ⊂V with vol(S) =δvol(V) such that cut(S vol(S) ) ≤, cut(S)

2 vol( ) S ≥1−, for every S ⊂V satisfying vol(S)≤δvol(V).

Computing Eigenvectors

Spectral clustering can be approached by computing the second smallest eigenvalue of the graph Laplacian L_G, which can be efficiently estimated via the leading eigenvector of a sparse matrix A ∈ R^{n×n} with m nonzero entries, provided that |λ_max(A)| ≥ |λ_min(A)| The power method starts from an initial vector x_0 and iteratively updates x_{t+1} = A x_t / ||A x_t|| to converge toward the leading eigenvector As shown in analyses like KW92, randomized variants of the power method can produce a vector x satisfying x^T A x ≥ λ_max(A) (1 − δ) x^T x in time O(δ^−1 (m+n) log n), meaning an approximate solution can be found in quasi-linear time.

One limitation of the power method is that, a posteriori, we cannot rule out the existence of an eigenvalue of A that is much larger than the ones we have found, because all of our iterates might be orthogonal to the eigenvector corresponding to that eigenvalue In other words, the method guarantees only that if such a dominant eigenvalue exists, it would have been extremely likely to be discovered by the power method This subtlety motivates the open problem described below, which aims to address this uncertainty and to develop criteria or alternatives that ensure a more robust identification of the largest eigenvalue.

Open Problem 3.2 asks whether, for a symmetric matrix M with a small condition number, there exists a quasi-linear time procedure (in the number of nonzero entries of M) that certifies M ≽ 0 The procedure may be randomized and may, with some probability, fail to certify M ≽ 0 even when M is PSD, but it must never produce erroneous certificates Moreover, when M ≽ 0, the probability of success should be bounded away from zero.

14 Note that, in spectral clustering, an error on the calculation of ϕ 2 propagates gracefully to the guarantee given byCheeger’s inequality.

Cholesky decomposition yields certificates, but computing them in quasi-linear time remains unresolved The power method applied to αI − M can generate certificates with arbitrarily small probability of being false In upcoming lectures, we will discuss the practical relevance of this approach as a fast tool to certify solutions produced by heuristics (Ban15b).

Multiple Clusters

Given a graphG= (V, E, W), a natural way of evaluatingk-way clusterign is via thek-way expansion constant (see [LGT12]):

S 1 k l=1, ,k v ol(S) , where the maximum is over all choice ofkdisjoin subsets ofV (but not necessarily forming a partition). Another natural definition is ϕG(k) = min

It is easy to see that ϕ G (k)≤ρ G (k).

Theorem 3.11 ([LGT12]) Let G= (V, E, W) be a graph andk a positive integer ρG(k)≤ O k 2 p λk, (29)

. Open Problem 3.3 Let G= (V, E, W) be a graph and ka positive integer, is the following true? ρG(k)≤polylog(k)p λk (30)

Equation (30) does not hold when the subsets are constrained to form a partition, meaning that every vertex must belong to at least one of the sets [LRTV12] Additionally, there is no dependence on k that would contradict the Small-Set Expansion Hypothesis described above.

4 Concentration Inequalities, Scalar and Matrix Versions

Large Deviation Inequalities

Sums of independent random variables

In what follows we’ll show two useful inequalities involving sums of independent random variables. The intuitive idea is that if we have a sum of independent random variables

Let X = X1 + + Xn, where Xi are iid centered random variables While X can, in principle, be as large as O(n), its typical magnitude is of order O(√n), which matches the scale of its standard deviation The inequalities that follow give precise bounds on the probability that X exceeds a level on the order of √n Although Chebyshev’s inequality could be used, the tail bounds presented here yield probabilities that decay exponentially rather than quadratically, a feature that proves advantageous in many applications.

Theorem 4.3 (Hoeffding’s Inequality) Let X 1 , X 2 , , X n be independent bounded random variables, i.e.,|X i | ≤aand E[Xi] = 0 Then,

The inequality implies that fluctuations larger than O(√ n) have small probability For example, fort=a√

2nlogn we get that the probability is at most 2 n

Proof We first derive a probability bound for the event ∑_{i=1}^n X_i > t The argument again uses Markov’s inequality To obtain an exponentially small tail probability, we apply the classical exponential trick: for any λ > 0, consider e^{λ ∑_{i=1}^n X_i} and bound P(∑_{i=1}^n X_i > t) ≤ e^{−λ t} E[e^{λ ∑_{i=1}^n X_i}] The resulting bound is then optimized by selecting the value of λ that minimizes the right-hand side.

E[e λX i ], (33) i=1 where the penultimate step follows from Markov’s inequality and the last equality follows from independence of theXi’s.

We now use the fact that |X i | ≤ato bound E[e λX i ] Because the functionf(x) =e λx is convex, λx a+x e ≤

Since, for alli,E[X i ] = 0 we get

= cosh(λa) Note that 15 cosh(x)≤e x 2 /2 , for all x∈R Hence,

This follows immediately from the Taylor expansions: cosh(x) = P ∞ x n n=0 (2n)! , e x 2 /2 = P ∞ n=0 x 2n ,

This inequality holds for any choice ofλ≥0, so we choose the value ofλthat minimizes minλ

Differentiating readily shows that the minimizer is given by λ= t , na 2 which satisfiesλ >0 For this choice of λ, n(λa) 2 1

By using the same argument on P n i=1(−X i ), and union bounding over the two events we get, Prob

Remark 4.4 Let’s say that we have random variables r 1 , , r n i.i.d distributed as

Then, E(ri) = 0 and|r i | ≤1 so Hoeffding’s inequality gives:

Intuitively, the smallest p is the more concentrated |Pn i=1ri| should be, however Hoeffding’s inequality does not capture this behavior.

A natural way to quantify this intuition is to observe that the variance of the sum ∑_{i=1}^n r_i depends on p; in particular, Var(r_i) = p Bernstein’s inequality then uses the summands’ variance to yield a tighter bound than Hoeffding’s inequality.

To strengthen the proof, we refine step (33) by leveraging the variance bound to obtain a sharper estimate of E[e^{λ X_i}] If X_i is centered, has Var(X_i)=σ^2, and is almost surely bounded by |X_i| ≤ a, then for every k ≥ 2 one has |E[X_i^k]| ≤ E|X_i|^k ≤ σ^2 a^{k-2} This moment bound feeds into the Taylor expansion of the mgf, giving improved control of E[e^{λ X_i}] in terms of λ, σ^2 and a Consequently, we obtain a tighter bound on the exponential moment, which strengthens the overall argument in the proof.

Theorem 4.5 (Bernstein’s Inequality) Let X 1 , X 2 , , X n be independent centered bounded random variables, i.e.,|X i | ≤aand E[Xi] = 0, with varianceE[X i 2 ] =σ 2 Then, n t 2

Remark 4.6 Before proving Bernstein’s Inequality, note that on the example of Remark 4.4 we get

! which exhibits a dependence on p and, for small values of pis considerably smaller than what Hoeffd- ing’s inequality gives.

As before, we will prove

− n and then union bound with the same result for i=1 X i , to prove the Theorem.

Now comes the source of the improvement over Hoeffding’s,

We will use a few simple inequalities (that can be easily proved with calculus) such as 16 1 +x ≤ e x , for all x∈R.

As before, we try to find the value

−t+ (ae λa −a) = 0 a 2 which implies that the optimal choice ofλis given by

If we set u= at , (34) nσ 2 thenλ ∗ = 1 a log(1 +u).

Now, the value of the minimum is given by nσ 2 nσ 2

The rest of the proof follo

X ws by noting that, for every u >0,

16 In fact y = 1 + x is a tangent line to the graph of f(x) = e x

Gaussian Concentration

Spectral norm of a Wigner Matrix

To illustrate the utility of Gaussian concentration, consider W ∈ R^{n×n}, a standard Gaussian Wigner matrix: a symmetric matrix with independent Gaussian entries, where the off-diagonal entries have variance 1 and the diagonal entries have variance 2 The spectral (operator) norm ||W|| is a Gaussian-functional of the entries and, by Gaussian concentration, concentrates sharply around its mean with fluctuations on the order of √n In this ensemble, the largest eigenvalue typically scales like 2√n, illustrating the √n scaling of the norm in high dimensions This example shows how Gaussian structure yields precise, high-probability control over matrix norms through concentration inequalities.

2-Lipschitz function of these variables, since kW (1) k − kW (2) k ≤ W (1) −W (2) ≤ W (1) −W (2)

The symmetry√ of the matrix and the variance

2 of the diagon al entries are responsible for an extra factor of 2.

Using Gaussian Concentration (Theorem 4.7) we immediately get

Proposition 4.8 Let W ∈ R n×n be a standard Gaussian Wigner matrix, a symmetric matrix with (otherwise) independent gaussian entries, the off-diagonal entries have unit variance and the diagonal entries have variance 2 Then,

Note that this gives an extremely precise control of the fluctuations ofkWk In fact, fort= 2√ logn this gives

Talagrand’s concentration inequality

A remarkable result by Talagrand [Tal95], Talangrad’s concentration inequality, provides an analogue of Gaussian concentration to bounded random variables.

Theorem 4.9 (Talangrand concentration inequality, Theorem 2.1.13 [Tao12]) Let K > 0, and let X1, , Xn be independent bounded random variables, |X i | ≤ K for all 1 n

F :R →Rbe a σ-Lipschitz and convex function Then, for any t≥0, t 2 Prob{|F(X)−E[F(X)]| ≥tK} ≤c1exp

Other useful similar inequalities (with explicit constants) are available in [Mas00].

17 It is an excellent exercise to prove E kW k ≤ 2 √ n using Slepian’s inequality.

Other useful large deviation inequalities

Additive Chernoff Bound

The additive Chernoff bound, also known as Chernoff-Hoeffding theorem concerns Bernoulli random variables.

Theorem 4.10 Given 0 < p 0:

Multiplicative Chernoff Bound

There is also a multiplicative version (see, for example Lemma 2.3.3 in [Dur06]), which is particularly useful.

Theorem 4.11 LetX1, , Xnbe independent random variables taking values is{0,1}(meaning they are Bernoulli distributed but not necessarily identically distributed) Let à=E δ

Pn i=1Xi, then, for any

Deviation bounds on χ 2 variables

A particularly useful deviation inequality is Lemma 1 in Laurent and Massart [LM00]:

Theorem 4.12 (Lemma 1 in Laurent and Massart [LM00]) Let X 1 , , X n be i.i.d standard gaussian random variables (N(0,1)), and a1, , an non-negative numbers Let n

The following inequalities hold for anyt >0:

• Prob{Z ≤ −2kak 2 √ x} ≤exp(−x), k k 2 Pn where a 2 = k=1 a 2 k and kak∞= max1≤k ≤n|a k |.

Note that ifa k = 1, for allk, thenZis aχ 2 withndegrees of freedom, so this theorem immediately gives a deviation inequality forχ 2 random variables.

Matrix Concentration

In many important applications, some of which we will see in the proceeding lectures, one needs to use a matrix version of the inequalities above.

Given {X k } n k=1 independent random symmetric d×d matrices one is interested in deviation inequalities for λmax n

! For example, a very useful adaptation of Bernstein’s

X inequality exists for this setting.

Theorem 4.13 (Theorem 1.4 in [Tro12]) Let{X k } n k=1 be a sequence of independent random sym- metricd×dmatrices Assume that each Xk satisfies:

EX k = 0 andλmax(X k )≤R almost surely.

Note thatkAk denotes the spectral norm ofA.

In this section we state and prove a collection of matrix concentration results that are closely related to Theorem 4.13 Building on Proposition 4.8, which lets us convert bounds on the expected spectral norm of a random matrix into tail bounds, we focus on bounding the expected spectral norm For a clear introduction to matrix concentration and its inequalities, Tropp’s monograph provides guidance and includes a proof of Theorem 4.13 as well as many other useful results.

Among inequalities for Gaussian series, a particularly important one is intimately related to the non-commutative Khintchine inequality [Pis03], and for this reason it is often called the Non-commutative Khintchine inequality (see, for example, equation (4.9) in [Tro12]).

Theorem 4.14 (Non-commutative Khintchine (NCK)) Let A 1 , , A n ∈ R d×d be symmetric matrices andg 1 , , g n ∼ N(0,1) i.i.d., then:

Note that, akin to Proposition 4.8, we can also use Gaussian Concentration to get a tail bound on k n k=1 g k A k k We consider the function

We now estimate Lipschitz constant; letg, h∈R n then

X k=1 where the first inequality made use of the triangular inequality and the last one of the Cauchy-Schwarz inequality.

This motivates us to define a new parameter, the weak varianceσ∗.

Definition 4.15 (Weak Variance (see, for example, [Tro15b])) GivenA , , A1 n∈R d d × symmetric matrices We define the weak variance parameter as n σ 2 = max X 2 v T A k v

This means that, using Gaussian concentration (and setting t=uσ∗), we have

This means that although the expected value ofkPn k=1g k A k kis controlled by the parameterσ, its fluctuations seem to be controlled byσ∗ We compare the two quantities in the following Proposition.

Proposition 4.16 Given A 1 , , A n ∈R d×d symmetric matrices, recall that σ v u u t n

Proof Using the Cauchy-Schwarz inequality, σ ∗ 2 = max X n v T Akv 2 v: kvk=1 k=1 n

Optimality of matrix concentration result for gaussian series

An interesting observation regarding random matrices with independent matrices 68

In the in-n dependent entries setting, Theorem 4.19 achieves tightness (up to constants) across a broad class of variance profiles b_ij, a result spelled out in Corollary 3.15 of [BvH15] The key mechanism is that if the i ≤ j largest variances are on par with the variance of a sufficient number of entries, then the bound in Theorem 4.19 remains tight up to a constant factor.

However, the situation is not as well understood when the variance profiles b 2 ij are arbitrary. i≤j

Since the spectral norm of a matrix is always at least the` 2 norm of a row, the n follo o wing lower bound holds (forX a symmetric random matrix with independent gaussian entries):

Combined insights from Latała’s [Lat05] and Riemer and Schutt’s [RS13] work, together with the findings in [BvH15], point toward the conjecture that this lower bound remains tight up to multiplicative constants.

Open Problem 4.2 (Lata la-Riemer-Schutt) GivenX a symmetric random matrix with independent gaussian entries, is the following true?

The results in [BvH15] answer this question affirmatively for a broad range of variance profiles, though not in full generality More recently, van Handel [vH15] proved the conjecture in the positive, with an extra factor of sqrt(log log d) More precisely, this means that the conjecture holds up to the sqrt(log log d) factor.

EkXk.p log logdEmax k kXe k k 2 , wheredis the number of rows (and columns) of X.

19 We briefly discuss this improvement in Remark 4.32

A matrix concentration inequality for Rademacher Series

A small detour on discrepancy theory

The following conjecture appears in a nice blog post of Raghu Meka [Mek14].

Conjecture 4.23 [Matrix Six-Deviations Suffice] There exists a universal constant C such that, for any choice of n symmetric matrices H 1 , , H n ∈ R n n × satisfying kH k k ≤ 1 (for all k = 1, , n), there existsε1, , εn∈ {±1} such that

Open Problem 4.3 Prove or disprove Conjecture 4.23.

Note that, when the matricesH k are diagonal, this problem corresponds to Spencer’s Six Standard Deviations Suffice Theorem [Spe85].

Remark 4.24 Also, using Theorem 4.22, it is easy to show that if one picksε i as i.i.d Rademacher random variables, then with positive probability (via the probabilistic method) the inequality will be satisfied with an extra √ logn term In fact one has

Remark 4.25 Remark 4.24 motivates asking whether Conjecture 4.23 can be strengthened to ask for ε 1 , , ε n such that

Back to matrix concentration

Using Theorem 4.22, we’ll prove the following Theorem.

Theorem 4.26 Let T1, , Tn∈R d×d be random independent positive semidefinite matrices, then

A key step in the proof of Theorem 4.26 is an idea that is extremely useful in Probability, the trick of symmetrization For this reason we isolate it in a lemma.

Lemma 4.27 (Symmetrization) Let T1, , Tn be independent random matrices (note that they don’t necessarily need to be positive semidefinite, for the sake of this lemma) and ε1, , εn random i.i.d Rademacher random variables (independent also from the matrices) Then

Let us now introduce, for each i, a random matrix

T i 0 iden tically distributed to T i and independent (all 2nmatrices are independent) Then

, where we use the notation Ea to is taken with respect to the variable a and the last step follows from Jensen’s inequality with respect toET 0

Since T i −T i 0 is a symmetric random variable, it is identically distributed to ε i (T i −T i 0 ) which gives

Using Lemma 4.27 and Theorem 4.22 we get

The core idea is to bring a term from the left-hand side into the right-hand side To do this, we rely on a simple, well-supported fact: Fact 2.3 in Tro15a, which provides an elementary proof that, under the stated condition on T_i, the LHS term can be represented on the RHS This fact acts as the bridge that permits the required translation, after which the RHS naturally encapsulates the term initially on the left and the argument can proceed.

. Further applying the Cauchy-Schwarz inequality forEgives,

Now that the termEkPn i=1Tikappears in the RHS, the proof can be finished with a simple application of the quadratic formula (see Section 6.1 in [Tro15a] for details).

We now show an inequality for general symmetric matrices

Theorem 4.28 Let Y1, , Yn∈R d×d be random independent positive semidefinite matrices, then

2 and L =Emax i kY i k 2 (43) and, as in (42),

Using Symmetrization (Lemma 4.27) and Theorem 4.22, we get

, and the proof can be concluded by noting thatY i 2 0 and using Theorem 4.26 2

Remark 4.29 (The rectangular case) One can extend Theorem 4.28 to general rectangular matrices S1, , Sn∈R d 1 × d 2 by setting

We defer the details to [Tro15a]

To prove Theorem 4.22, we employ an AM-GM‑type inequality for matrices that admits an elementary proof, unlike the matrix inequality linked to Open Problem 0.2 in Ban15d.

Lemma 4.30 Given symmetric matrices H, W, Y ∈ R d×d and non-negative integers r, q satisfying q≤2r,

An elementary proof is given in Fact 2.4 of [Tro15a], and it shows that the matrix inequality is the matrix analogue of the scalar bound a^θ b^{1−θ} ≤ θ a + (1−θ) b for a,b ≥ 0 and 0 ≤ θ ≤ 1 This scalar bound can be established by adding two AM–GM steps: a^θ b^{1−θ} ≤ θ a + (1−θ) b and b^θ a^{1−θ} ≤ (1−θ) b + θ a.

LetX =Pn k=1ε k H k , then for any positive integerp,

The first inequality follows from Jensen’s inequality, while the last follows from X^{2p} ≥ 0 and the observation that the trace of a positive semidefinite matrix is at least its spectral norm In the sequel, we upper-bound E[Tr X^{2p}] To facilitate the analysis, we define X_i^{+} and X_i^{-} as X conditioned on ε_i taking the values +1 and −1, respectively.

2 − i , where the expectation can be taken overεj forj6=i.

Now we rewrite X +i 2p −1 −X −i 2p −1 as a telescopic sum:

We now make use of Lemma 4.30 to get 20 to get n 1

20 See Remark 4.32 regarding the suboptimality of this step.

≤σ 2 (2p−1)ETrX 2p−2 (47) Applying this inequality, recursively, we get

≤[(2p−1)!!] 2p 1 σd 2p 1 Takingp=dlogdeand using the fact that (2p−1)!!≤

(see [Tro15a] for an elementary proof e consisting essentially of taking logarithms and comparing the sum with an integral) we get

Remark 4.31 A similar argument can be used to prove Theorem 4.14 (the gaussian series case) based on gaussian integration by parts, see Section 7.2 in [Tro15c].

Remark 4.32 Note that, up until the step from (44) to (45) all steps are equalities suggesting that this step may be the lossy step responsible by the suboptimal dimensional factor in several cases (although (46) can also potentially be lossy, it is not uncommon that H i 2 is a multiple of the identity matrix, which would render this step also an equality).

In fact, Joel Tropp [Tro15c] recently proved an improvement over

The NCK inequality replaces inequality (45) with a tighter argument In essence, when the Hi’s are non-commutative, most summands in (44) are actually smaller than the contributions at q = 0 and q = 2p − 2, which are the terms that appear in (45) This observation yields a sharper bound by exploiting non-commutativity, leading to a reduced overall estimate compared with the original form.

Tiêu đề	42 open problems in mathematics
Tác giả	Afonso S. Bandeira
Trường học	Massachusetts Institute of Technology
Chuyên ngành	Mathematics of Data Science
Thể loại	lecture notes
Năm xuất bản	2015
Thành phố	Cambridge

Định dạng
Số trang	165
Dung lượng	2,69 MB

42 open problems in mathematics

A couple of Open Problems

Koml´ os Conjecture

Matrix AM-GM inequality

Brief Review of some linear algebra tools

Singular Value Decomposition

Spectral Decomposition

Trace and norm

Quadratic Forms

Dimension Reduction and PCA

PCA as best d-dimensional affine fit

PCA as d-dimensional projection that preserves the most variance

Finding the Principal Components

Which d should we pick?

A related open problem

PCA in high dimensions and Marcenko-Pastur

A related open problem

Spike Models and BBP transition

A brief mention of Wigner matrices

An open problem about spike models

Graphs

Cliques and Ramsey numbers

Diffusion Maps

A couple of examples

Diffusion Maps of point clouds

A simple example

Similar non-linear dimensional reduction techniques

Semi-supervised learning

An interesting experience and the Sobolev Embedding Theorem

Clustering

k-means Clustering

Spectral Clustering

Two clusters

Normalized Cut

Normalized Cut as a spectral relaxation

Small Clusters and the Small Set Expansion Hypothesis

Computing Eigenvectors

Multiple Clusters

Large Deviation Inequalities

Sums of independent random variables

Gaussian Concentration

Spectral norm of a Wigner Matrix

Talagrand’s concentration inequality

Other useful large deviation inequalities

Additive Chernoff Bound

Multiplicative Chernoff Bound

Deviation bounds on χ 2 variables

Matrix Concentration

Optimality of matrix concentration result for gaussian series

An interesting observation regarding random matrices with independent matrices 68

A matrix concentration inequality for Rademacher Series

A small detour on discrepancy theory

Back to matrix concentration

Other Open Problems

The Johnson-Lindenstrauss Lemma

Gordon’s Theorem

Sparse vectors and Low-rank matrices

Coherence and Gershgorin Circle Theorem

Some Coding Theory and the proof of Theorem 7.3

In terms of linear Bernoulli algebra

What does the spike model suggest?

The analysis

Angular Synchronization

Signal Alignment

Gaussian width of rank-r matrices

Coherence and Gershgorin Circle Theorem