Inverse Problems, Statistical Mechanics and Simulated Annealing K.. Venkatesh Prasad Ford Motor Company 28.1 Background 28.2 Inverse Problems in DSP 28.3 Analogies with Statistical Mecha
Trang 1K Venkatesh Prasad “Inverse Problems, Statistical Mechanics and Simulated Annealing.”
2000 CRC Press LLC <http://www.engnetbase.com>.
Trang 2Inverse Problems, Statistical Mechanics and Simulated
Annealing
K Venkatesh Prasad
Ford Motor Company
28.1 Background 28.2 Inverse Problems in DSP 28.3 Analogies with Statistical Mechanics
Combinatorial Optimization The Metropolis Criterion Gibbs’ Distribution
28.4 The Simulated Annealing Procedure Defining Terms
References Further Reading
28.1 Background
The focus of this chapter is on inverse problems — what they are, where they manifest themselves
in the realm of digital signal processing (DSP), and how they might be “solved1.” Inverse problems
deal with estimating hidden causes, such as a set of transmitted symbols ftg, given observable effects
such as a set of received symbolsfrg and a system (H) responsible for mapping ftg into frg Inverse problems are succinctly stated using vector-space notation and take the form of estimating t2 R M,
given:
where r 2 R Nand H 2 R MN andR denotes the space of real numbers whose dimensions are
specified in the superscript(s) Such problems call for the inversion of H, an operation which may or
may not be numerically possible We will shortly address these issues, but we should note here for
completeness that these problems contrast with direct problems — where r is to be directly (without
matrix inversion) estimated, given H and t.
1 The quotes are used to stress that unique deterministic solutions might not exist for such problems and the observed effects might not continuously track the underlying causes Formally speaking, this is a result of such problems of being
ill-posed in the sense of Hadamard [1 ] What is typically sought is an optimal solution, such as a minimum norm/minimum energy solution.
Trang 328.2 Inverse Problems in DSP
Inverse problems manifest themselves in a broad range of DSP applications in fields as diverse as digital astronomy, electronic communications, geophysics [2], medicine [3], and oceanography The core of all these problems takes the form shown in Eq (28.1) This, in fact, is the discrete version of the Fredholm integral equation of the first kind for which, by definition2, the limits of integration are
fixed and the unknown function f appears only inside the integral To motivate our discussion, we
will describe an application-specific problem, and in the process introduce some of the notations and concepts to be used in the later sections The inverse problem in the field of electronic communications
has to do with estimating t, given r which is often received with noise, commonly modeled to be
additive white Gaussian (AWG) in nature The communication system and the transmission channel
are typically stochastically characterizable and are represented by a linear system matrix (H) The problem, therefore, is to solve for t in the system of linear equations:
where vector n denotes AWG noise Two tempting solutions might come to mind: if matrix H is invertible, i.e., H−1exists, then why not solve for t as:
or else why not compute a minimum-norm solution such as the pseudoinverse solution:
where H†is referred to as the pseudoinverse [5] of H and is defined to be TH0HU−1H0, where H0
denotes the transpose of H There are several reasons why neither solution [Eqs (28.3) or (28.4)] might be viable One reason is that the dimensions of the system might be extremely large, placing a
greater computational load than might be affordable Another reason is that H is often numerically
ill-conditioned, implying that inversions or pseudo-inversions might not be reliable even if otherwise reliable numerical inversion procedures, such as Gaussian elimination or singular value decompo-sition [6,19], were to be employed Furthermore, even if preconditioning [6] were possible on the
system of linear equations r D Ht Cn, resulting in a numerical improvement of the coefficients of H,
there is one even more overbearing hurdle that has often to be dealt with, and this has to do with the
fact that such problems are frequently ill-posed In practical terms3this means that small changes in the inputs might result in arbitrarily large changes in outputs For all these reasons the most tempting solution-approaches are often ruled out As we describe in the next section, inverse problems may be recast as combinatorial optimization problems We will then show how combinatorial optimization
problems may be solved using a powerful tool called simulated annealing [7] that has evolved from our understanding of statistical mechanics [8] and the simulation of the annealing (cooling) behavior
of physical matter [9]
2 There exist two classes of integral equations ([ 4 ], pg 865): if the limits of integration are fixed, the equations are referred to as Fredholm integral equations; if one of the limits is a variable the equations are referred to as Volterra integral equations Further, if the unknown function appears only inside the integral the equation is called “first kind”, but if it appears both inside and outside the integral the equation is called “second kind”.
3 For a more complete description see [ 1 ].
Trang 428.3 Analogies with Statistical Mechanics
Understanding the analogies of inverse problems in DSP to problems in statistical mechanics is valuable to us because we can then draw upon the analytical and computational tools developed over the past century to solve inverse problems in the field of statistical mechanics [8] The broad analogy
is that just as the received symbols r in Eq (28.1) are the observed effects of hidden underlying
causes (the transmitted symbols t) — the measured temperature and state (solid, liquid, or gaseous)
of physical matter are the effects of underlying causes such as the momenta and velocities of the particles that compose the matter A more specific analogy comes from the reasoning that if the inverse problem were to be treated as a combinatorial optimization problem, where each candidate
solution is one possible configuration (or combination of the scalar elements of t), then we could use
the criterion developed by Metropolis et al [9] for physical systems to select the optimal configuration The Metropolis criterion is based on the assumption that candidate configurations have probabilistic distributions of the form originally described by Gibbs [8] to guarantee statistical equilibrium of ensembles of systems In order to apply Metropolis’ selection criterion, we must make one final analogy: we need to treat the combinatorial optimization problem as if it were the outcome of an imaginary physical system in which matter has been brought to boil When such a physical system is gradually cooled (a process referred to as annealing) then, provided the cooling rate is neither too fast nor too slow, the system will eventually solidify into a minimum energy configuration As depicted in Fig.28.1to solve inverse problems we first recast the problem as a combinatorial optimization problem
and then solve this recasted problem using simulated annealing — a procedure that numerically
mimics the annealing of physical systems In this section we will describe the basic principles of combinatorial optimization, Metropolis’ criterion to select or discard potential configurations, and the origins of Gibbs’ distribution We will outline the simulated annealing algorithm in the following section and will follow that with examples of implementation and applications
FIGURE 28.1: The direct path(a ! d) to solving the inverse problem is often not viable since it
relies on the inversion of a system matrix An optimal solution, however, may be obtained by an indirect path(a ! b ! c ! d) which involves recasting the inverse problem as an equivalent
combinatorial optimization problem and then solving this problem using simulated annealing
Trang 528.3.1 Combinatorial Optimization
The optimal solution to the inverse problem [Eq (28.1)], as explained above, amounts to estimating
vector t Under the assumptions enumerated below, the inverse problem can be recast as a
combina-torial problem whose solution then yields the desired optimal solution to the inverse problem The assumptions required are:
1 Each (scalar) elementt(i), 1 i M, of t 2 R Mcan take on only a finite set of finite
values That is−1 < t j (i)< 1I 8i&j, where t j (i) denotes the jth possible value that
theith element of t can take, and j is a finite valued index j J i < 1I 8i J idenotes
the number of possible values theith element of t can take.
2 Let each combination ofM scalar values t(i) of t be referred to as a candidate vector or a
feasible configuration tk, where the indexk K < 1 Associated with each candidate
vector tk we must have a quantifiable measure of error, cost, or energy (E k ).
Given the above assumptions, the combinatorial form of the inverse problem may be stated as: out
ofK possible candidate vectors t k , 1 k K, search for the vector t koptwith the lowest errorE kopt Although easily stated, the time and computational efficiency with which the solution is obtained hinges on at least two significant factors — the design of the error-function and the choice of the search strategy The error-function(E k ) must provide a quantifiable measure of dissimilarity or distance, between a feasible configuration (t k ) and the true (but unknown) configuration (ttrue), i.e.,
whered denotes a distance function The goal of the combinatorial optimization problem is to
efficiently search through the combinatorial space and stop at the optimal, minimum-error(Eopt),
configuration — tkopt:
EoptD Etkopt
wherekoptdenotes the value of indexk associated with the optimal configuration In the ideal case,
whenδ D 0, from Eq (28.5), we have that tkopt D ttrue In practice, however, owing to a combination
of factors such as noise (Eq.28.2), or the system (Eq.28.1) being underdetermined,Eopt D δ > 0,
implying that tkopt 6D ttrue, but that tkopt is the best possible solution given what is known about the problem and its solutions In general the error-function must satisfy the requirements of a distance function or metric (adapted from [10], pg 237):
E (t k ) D 0 <D> t k D ttrue,ψ(28.7a)
E (t k ) D E (−t k ) D d (ttrue 1 − tk ) ,ψ(28.7b)
E (t k ) E −t j
C d t k− tj
,ψ(28.7c)
where Eq (28.7a) follows from Eq (28.5), and where, likek, index j is defined in the range (1, K )
andK < 1 Eq (28.7a) stated that if the error is zero,t kis the true configuration The implication
of Eq (28.7b) is that error is a function of the absolute value of the distance of a configuration from the true configuration Eq (28.7c) implies that the triangle inequality law holds
In designing the error-function, one can classify the sources of error into two distinct categories: The first category of error, denoted byE ksignal, provides a measure of error (or distance) between the observed signal(r k ) and the estimated signal (Or k ) — computed for the current configuration t k
using Eq (28.1) The second category, denoted byEconstraints
k , accounts for the price to be “paid”
when an estimated solution deviates from the constraints we would want to impose on them based
on our understanding of the physical world The physical world, for instance, might suggest that
Trang 6each element of the signal is very probably positive valued In this case, a negative valued estimate
of a signal element will result in an error-value that is proportionate to the magnitude of the signal negativity This constraint is popularly known as the non-negativity constraint Another constraint might arise from the assumption that the solution is expected to be smooth [11]:
Ot0S Ot D δsmooth, (28.8) whereS is a smoothing matrix and δsmoothis the degree of smoothness of the signal The error-function, therefore, takes the following form:
E k D E 1 signal
k C Econstraints
Esignalk D kr1 k− Orkk2 where
Econstraints
c2C
(α c E c ) ,
(28.9)
whereEconstraintsrepresents the total error from all other factors or constraints that might be imposed
on the solution,fCg represents the set of constraint indices, and α candE crepresent the weight and the error-function, respectively, associated withcth constraint.
28.3.2 The Metropolis Criterion
The core task in solving the combinatorial optimization described above is to search for a configuration
tk for which the error-function E K is a minimum Standard gradient descent methods [6, 12,
13] would have been the natural choice had theE k been a function with just one minimum (or maximum) value, but this function typically has multiple minimas (or maximas) — gradient descent methods would tend to get locked into a local minimum The simulated annealing procedure (Fig.28.2— discussed in the next section), suggested by Metropolis et al [9] for the problem of finding stable configurations of interacting atoms and adapted for combinatorial optimization by Kirkpatrick [7], provides a scheme to traverse the surface of theE k, get out of local minimas, and
eventually cool into a global minimum The contribution of Metropolis et al., commonly referred to
in the literature as Metropolis’ criterion, is based on the assumption that the difference in the error
of two consecutive feasible configurations (denoted as1E D E 1 kC1 − E k ) takes the form of Gibbs’
distribution [Eq (28.11)] The criterion states that even if a configuration were to result in increased error, i.e.,1E > 0, one can select the new configuration if:
where random denotes a random number drawn from a uniform distribution in the range [0,1) and
T denotes a the temperature of the physical system.
28.3.3 Gibbs’ Distribution
At the turn of the 20th century, Gibbs [8], building upon the work of Clausius, Maxwell, and Boltzmann in statistical mechanics, proposed the probability distribution P:
whereψ and 2 were constants and denoted the free energy in a system This distribution was crafted
to satisfy the condition of statistical equilibrium ([8], pg 32) for ensembles of (thermodynamical)
Trang 7FIGURE 28.2: The outline of the annealing algorithm.
systems:
X dP
dp1
Pp iC dP
dq1
Pq i
wherep i andq i represented the generalized momentum and velocity, respectively, of theith degree
of freedom The negative sign on in Eq (28.11) was required to satisfy the condition:
Z
Z
| {z } all phases
28.4 The Simulated Annealing Procedure
The simulated annealing algorithm as outlined in Fig.28.2mimics the annealing (or controlled cool-ing) of an imaginary physical system The unknown parameters are treated like particles in a physical system An initial configurationtinitialis chosen along with an initial (“boiling”) temperature value
(Tinitial) The choice of Tinitial is made so as to ensure that a vast majority, say 90%, of configura-tions are acceptable even if they result in a negative1E k The initial configuration is perturbed, either by using a random number generator or by sequential selection, to create a second configura-tion, and1E2is computed The Metropolis criterion is applied to decide whether or not to accept the new configuration After equilibrium is reached, i.e., afterj1E2j δequilib, whereδequilibis a small heuristically chosen threshold, the temperature is lowered according to a cooling schedule and
the process is repeated until a pre-selected frozen temperature is reached Several different cooling
schedules have been proposed in the literature ([18], pg 59) In one popular schedule [18,19] each
Trang 8subsequent temperatureT kC1is less than the current temperatureT k, by a fixed percentage ofT k, i.e.,
T kC1 D β k T k, whereβ k is typically in the range of 0.8 to unity Based on the behavior of physical systems which attain minimum (free) energy (or global minimum) states when they freeze at the end
of an annealing process, the assumption underlying the simulated annealing procedure is that the
toptthat is finally attained is also globally minimum
The results of applying the simulated annealing procedure to the problems of three-dimensional signal restoration [14] is shown in Fig.28.3 In this problem, a defocused image, vector r, of an
opaque eight-step staircase object was provided along with the space-varying point-spread-function
matrix (H), and a well-focused image The unknown vector t represented the intensities of the
volume elements (voxels) with the visible voxels taking on positive values and hidden voxels having
a value of zero The vector t was lexicographically indexed so that by knowing which elements of
t were positive, one could reconstruct the three-dimensional structure Using simulated annealing,
and constraints (opacity, non-negativity of intensity, smoothness of intensity and depth, and tight bounds on the voxel intensity values obtained from the well-focused image), the original object was reconstructed
FIGURE 28.3: Three-dimensional signal recovery using simulated annealing The staircase object shown corresponding to era 17 is recovered from a defocused image by testing a number of feasible configurations and applying the Metropolis criterion to a simulated annealing procedure
Defining Terms
In the following definitions, as in the preceding discussion, t2 R M , r 2 R N, and H2 R MN.
Combinatorial Optimization: The process of selecting the optimal (lowest-cost) configuration
from a large space of candidate or feasible configurations
Configuration: Any vector t is a configuration The term is used in the combinatorial
opti-mization literature
Trang 9Cost/energy/error function: The terms cost, energy, or error function are frequently used
in-terchangeably in the literature Cost function is often used in the optimization literature
to represent the mapping of a candidate vector into a (scalar) functional whose value is indicative of the optimality of the candidate vector Energy function is frequently used in electronic communication theory as a pseudonym for theL2norm or root-mean-square value of a vector Error function is typically used to measure a mismatch between an estimated (vector) and its expected value For purposes of this discussion we use the terms cost, energy, and error function interchangeably
Gibbs’ distribution: The distribution (in reality a probability density function (pdf)) in which
theη the index of probability (P) is a linear function of energy, i.e., η D log P D ψ− 2 , whereψ and 2 are constants and represents energy, giving the familiar pdf:
P D exp ψ − 2 , (28.14)
Inverse problem: Given matrix H and vector r, find t that satisfies r D Ht.
Metropolis’ criterion: The criterion first suggested by Metropolis et al [9] to decide whether
or not to accept a configuration that results in an increased error, when trying to search for minimum error configurations in a combinatorial optimization problem
Minimum-norm: The norm between two vectors is a (scalar) measure of distance (such as the
L1, L2) (or Euclidean), L1norms or the Mahalanobis distance ([10], pg 24), or the Manhattan metric [7]) between them Minimum-norm, unless otherwise noted, implies minimum Euclidean(L2) norm (denoted by k k):
min
Pseudoinverse: Let toptbe the unique minimum norm vector, therefore,
Htopt− r min
The pseudoinverse of matrix H denoted by H†2 R NMis the matrix mapping all r into
its corresponding topt
Statistical mechanics: That branch of mechanics in which the problem is to find the statistical
distribution of the parameters of ensembles (large numbers) of systems (each differing not just infinitesimally, but embracing every possible combination of the parameters) at a desired instant in time, given those distributions at the present time Maxwell, according
to Gibbs [8], coined the term “statistical mechanics” This field owes its origin to the desire to explain the laws of thermodynamics as stated by Gibbs ([8], pg viii): “The laws
of thermodynamics, as empirically determined, express the approximate and probable behavior of systems of a great number of particles, or, more precisely, they express the laws of mechanics for such systems as they appear to beings who have not the fineness
of perception to enable them to appreciate quantities of the order of magnitude of those which relate to single particles, and who cannot repeat their experiments often enough
to obtain any but the most probable results”
References
[1] Hadamard, J., Sur les probl`emes aux d´eriv´es partilles et leur signification physique, Bull 13, Princeton University, 1902
Trang 10[2] Frolik, J.L and Yagle A.E., Reconstruction of multilayered lossy dielectrics from plane-wave impulse responses at 2 angles of incidence,IEEE Trans Geosci Remote Sens., 33: 268–279,
March, 1995
[3] Greensite, F., Well-posed formulation of the inverse problem of electrocardiography,Ann Biomed Eng., 22 (2): 172–183, 1994.
[4] Arfken, G.,Mathematical Methods for Physicists, Academic Press, 1985.
[5] Greville, T.N.E., The pseudoinverse of a rectangular or singular matrix and its application to the solution of systems of linear equations,SIAM Rev 1: 38–43, 1959.
[6] Golub, G.H and Van Loan, C.F.,Matrix Computations, 2nd ed., The Johns Hopkins University
Press, Baltimore, 1989
[7] Kirkpatrick, S., Optimization by simulated annealing: quantitative studies,J Stat Phys., 34(5,
6): 975–986, 1984
[8] Gibbs, J.W.,Elementary Particles in Statistical Mechanics, Yale University Press, New Haven,
1902
[9] Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., and Teller, E., Equation of state calculations by fast computing machines,J Chem Phys., 21: 1087–1092, June, 1953.
[10] Duda, R.O and Hart, P.E.,Pattern Classification and Scene Analysis, John Wiley, 1973.
[11] Pratt, W.K.,Digital Image Processing, John Wiley, New York, 1978.
[12] Luenberger, D.G.,Optimization by Vector Space Methods, John Wiley & Sons, New York, 1969.
[13] Gill, P.E and Murray, W., Quasi-Newton methods for linearly constrained optimization, in
Numerical Methods for Constrained Optimization, Gill, P.E and Murray, W., Eds., Academic
Press, London, 1974
[14] Prasad, K.V., Mammone, R.J., and Yogeshwar, J., 3-D image restoration using constrained optimization techniques,Opt Eng., 29: 279–288, April, 1990.
[15] Tikhonov, A.N and Arsenin, V.Y.,Solutions of Ill-Posed Problems, V.H Winston & Sons,
Washington, D.C., 1977
[16] Soumekh, M., Reconnaissance with ultra wideband UHF synthetic aperture radar,IEEE Acoust Speech, Signal Process., 12: 21–40, July, 1995.
[17] van Laarhoven, P.J.M and Aarts, E.H.L.,Simulated Annealing: Theory and Applications, D.
Riedel, Dordrecht, Holland, 1987
[18] Aarts, E and Korst, J.,Simulated Annealing and Boltzmann Machines, John Wiley, New York,
1989
[19] Press, W.H., Flannery, B.D., Teukolsky, S.A., and Vetterling, W.T.,Numerical Recipes in C,
Cambridge University Press, U.K., 1988
[20] Geman, S and Geman, D., Stochastic relaxation, Gibbs distributions and the Bayesian restora-tions of images,IEEE Trans Patt Recog Mach Intell., PAMI- 6: 721–741, November, 1984.
Further Reading
Inverse problems — The classic by Tikhonov [15] provides a good introduction to the subject matter For a description of inverse problems related to synthetic aperture radar application see [16] Statistical mechanics — Gibbs’ [8] work is historical treasure
Vector spaces and optimization — The books by Leunberger [12] and Gill and Murray [13] provide
a broad introductory foundation
Simulated annealing — Two recent books by van Laarhoven and Aarts [17] and Aarts and Korst [18] contain a comprehensive coverage of the theory and application of simulated annealing A useful simulated annealing algorithm, along with tips for numerical implementation and random number generation, can be found inNumerical Recipes in C [19] An alternative simulated annealing