Statistics, Data Mining, and Machine Learning in Astronomy 46 • Chapter 2 Fast Computation on Massive Data Sets often compute various MLM as fast asO(N) orO(N log N) We will highlight a few of the bas[.]
Trang 146 • Chapter 2 Fast Computation on Massive Data Sets
often compute various MLM as fast asO(N) or O(N log N) We will highlight a few
of the basic concepts behind such methods
2.3 Seven Types of Computational Problem
There are a large number of statistical/machine learning methods described in this book Making them run fast boils down to a number of different types of computational problems, including the following:
1 Basic problems: These include simple statistics, like means, variances, and
covari-ance matrices We also put basic one-dimensional sorts and range searches in this category These are all typically simple to compute in the sense that they areO(N)
orO(N log N) at worst We will discuss some key basic problems in §2.5.1.
2 Generalized N-body problems: These include virtually any problem involving
distances or other similarities between (all or many) pairs (or higher-order
n-tuples) of points, such as nearest-neighbor searches, correlation functions,
or kernel density estimates Such problems are typically O(N2) or O(N3) if computed straightforwardly, but more sophisticated algorithms are available (WSAS, [12]) We will discuss some such problems in §2.5.2
3 Linear algebraic problems: These include all the standard problems of
computa-tional linear algebra, including linear systems, eigenvalue problems, and inverses
Assuming typical cases with N D, these can be O(N) but in some cases the matrix of interest is N × N, making the computation O(N3) Some common examples where parameter fitting ends up being conveniently phrased in terms of linear algebra problems appear in dimensionality reduction (chapter 7) and linear regression (chapter 8)
4 Optimization problems: Optimization is the process of finding the minimum or
maximum of a function This class includes all the standard subclasses of opti-mization problems, from unconstrained to constrained, convex and nonconvex Unconstrained optimizations can be fast (though somewhat indeterminate as they generally only lead to local optima), beingO(N) for each of a number of iterations.
Constrained optimizations, such as the quadratic programs required by nonlinear support vector machines (discussed in chapter 9) areO(N3) in the worst case Some optimization approaches beyond the widely used unconstrained optimiza-tion methods such as gradient descent or conjugate gradient are discussed in
§4.4.3 on the expectation maximization algorithm for mixtures of Gaussians
5 Integration problems: Integration arises heavily in the estimation of Bayesian
models, and typically involves high-dimensional functions Performing integra-tion with high accuracy via quadrature has a computaintegra-tional complexity which is
exponential in the dimensionality D In §5.8 we describe the Markov chain Monte
Carlo (MCMC) algorithm, which can be used for efficient high-dimensional integration and related computations
6 Graph-theoretic problems: These problems involve traversals of graphs, as in
probabilistic graphical models or nearest-neighbor graphs for manifold learning The most difficult computations here are those involving discrete variables, in which the computational cost may beO(N) but is exponential in the number
of interacting discrete variables among the D dimensions Exponential
computa-tions are by far the most time consuming and generally must be avoided at all cost