Statistics, data mining, and machine learning in astronomy

Statistics, Data Mining, and Machine Learning in Astronomy 46 • Chapter 2 Fast Computation on Massive Data Sets often compute various MLM as fast asO(N) orO(N log N) We will highlight a few of the bas[.]

Trang 1

46 • Chapter 2 Fast Computation on Massive Data Sets

often compute various MLM as fast asO(N) or O(N log N) We will highlight a few

of the basic concepts behind such methods

2.3 Seven Types of Computational Problem

There are a large number of statistical/machine learning methods described in this book Making them run fast boils down to a number of different types of computational problems, including the following:

1 Basic problems: These include simple statistics, like means, variances, and

covari-ance matrices We also put basic one-dimensional sorts and range searches in this category These are all typically simple to compute in the sense that they areO(N)

orO(N log N) at worst We will discuss some key basic problems in §2.5.1.

2 Generalized N-body problems: These include virtually any problem involving

distances or other similarities between (all or many) pairs (or higher-order

n-tuples) of points, such as nearest-neighbor searches, correlation functions,

or kernel density estimates Such problems are typically O(N2) or O(N3) if computed straightforwardly, but more sophisticated algorithms are available (WSAS, [12]) We will discuss some such problems in §2.5.2

3 Linear algebraic problems: These include all the standard problems of

computa-tional linear algebra, including linear systems, eigenvalue problems, and inverses

Assuming typical cases with N D, these can be O(N) but in some cases the matrix of interest is N × N, making the computation O(N3) Some common examples where parameter fitting ends up being conveniently phrased in terms of linear algebra problems appear in dimensionality reduction (chapter 7) and linear regression (chapter 8)

4 Optimization problems: Optimization is the process of finding the minimum or

maximum of a function This class includes all the standard subclasses of opti-mization problems, from unconstrained to constrained, convex and nonconvex Unconstrained optimizations can be fast (though somewhat indeterminate as they generally only lead to local optima), beingO(N) for each of a number of iterations.

Constrained optimizations, such as the quadratic programs required by nonlinear support vector machines (discussed in chapter 9) areO(N3) in the worst case Some optimization approaches beyond the widely used unconstrained optimiza-tion methods such as gradient descent or conjugate gradient are discussed in

§4.4.3 on the expectation maximization algorithm for mixtures of Gaussians

5 Integration problems: Integration arises heavily in the estimation of Bayesian

models, and typically involves high-dimensional functions Performing integra-tion with high accuracy via quadrature has a computaintegra-tional complexity which is

exponential in the dimensionality D In §5.8 we describe the Markov chain Monte

Carlo (MCMC) algorithm, which can be used for efficient high-dimensional integration and related computations

6 Graph-theoretic problems: These problems involve traversals of graphs, as in

probabilistic graphical models or nearest-neighbor graphs for manifold learning The most difficult computations here are those involving discrete variables, in which the computational cost may beO(N) but is exponential in the number

of interacting discrete variables among the D dimensions Exponential

computa-tions are by far the most time consuming and generally must be avoided at all cost

Định dạng
Số trang	1
Dung lượng	57,07 KB