The traditional on-line and data-intensive preoc-cupations of signal processing continue to demand that these algorithms be tractable.Increasingly, full probability modelling the so-call
Trang 2Springer Series on
Signals and Communication Technology
Trang 3Circuits and Systems
Based on Delta Modulation
Linear, Nonlinear and Mixed Mode Processing
D.G Zrilic ISBN 3-540-23751-8
Functional Structures in Networks
AMLn – A Language for Model Driven
Development of Telecom Systems
T Muth ISBN 3-540-22545-5
RadioWave Propagation
for Telecommunication Applications
H Sizun ISBN 3-540-40758-8
Electronic Noise and Interfering Signals
Principles and Applications
G Vasilescu ISBN 3-540-40741-3
DVB
The Family of International Standards
for Digital Video Broadcasting, 2nd ed.
U Reimers ISBN 3-540-43545-X
Digital Interactive TV and Metadata
Future Broadcast Multimedia
A Lugmayr, S Niiranen, and S Kalli
ISBN 3-387-20843-7
Adaptive Antenna Arrays
Trends and Applications
S Chandran (Ed.) ISBN 3-540-20199-8
Digital Signal Processing
with Field Programmable Gate Arrays
U Meyer-Baese ISBN 3-540-21119-5
Neuro-Fuzzy and Fuzzy Neural Applications
in Telecommunications
P Stavroulakis (Ed.) ISBN 3-540-40759-6
SDMA for Multipath Wireless Channels
Limiting Characteristics
and Stochastic Models
I.P Kovalyov ISBN 3-540-40225-X
Processing of SAR Data
Fundamentals, Signal Processing, Interferometry
A Hein ISBN 3-540-05043-4
Chaos-Based Digital Communication Systems
Operating Principles, Analysis Methods, and Performance Evalutation
F.C.M Lau and C.K Tse ISBN 3-540-00602-8
Adaptive Signal Processing
Application to Real-World Problems
J Benesty and Y Huang (Eds.) ISBN 3-540-00051-8
Multimedia Information Retrieval and Management
Technological Fundamentals and Applications
D Feng, W.C Siu, and H.J Zhang (Eds.) ISBN 3-540-00244-8
Structured Cable Systems
A.B Semenov, S.K Strizhakov, and I.R Suncheley ISBN 3-540-43000-8
Advanced Theory of Signal Detection
Weak Signal Detection in Generalized Obeservations
I Song, J Bae, and S.Y Kim ISBN 3-540-43064-4
Wireless Internet Access over GSM and UMTS
M Taferner and E Bonek ISBN 3-540-42551-9
The Variational Bayes Method
in Signal Processing
V ˇSm´ıdl and A Quinn ISBN 3-540-28819-8
Trang 5Institute of Information Theory and Automation
Academy of Sciences of the Czech Republic, Department of Adaptive Systems
PO Box 18, 18208 Praha 8, Czech Republic
E-mail: smidl@utia.cas.cz
Dr Anthony Quinn
Department of Electronic and Electrical Engineering
University of Dublin, Trinity College
Dublin 2, Ireland
E-mail: aquinn@tcd.ie
ISBN-10 3-540-28819-8 Springer Berlin Heidelberg New York
ISBN-13 978-3-540-28819-0 Springer Berlin Heidelberg New York
Library of Congress Control Number: 2005934475
This work is subject to copyright All rights are reserved, whether the whole or part of the material is concerned, specif ically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microf ilm or in any other way, and storage in data banks Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag Violations are liable
to prosecution under the German Copyright Law.
Springer is a part of Springer Science+Business Media.
Typesetting and prod c u tion: SPI Publisher Services
Cover design: design & production GmbH, Heidelberg
Printed on acid-free paper SPIN: 11370918 62/3100/SPI - 5 4 3 2 1 0
Trang 6Do mo Thuismitheoirí
A.Q.
Trang 7Gaussian linear modelling cannot address current signal processing demands Inmodern contexts, such as Independent Component Analysis (ICA), progress has beenmade specifically by imposing non-Gaussian and/or non-linear assumptions Hence,standard Wiener and Kalman theories no longer enjoy their traditional hegemony inthe field, revealing the standard computational engines for these problems In theirplace, diverse principles have been explored, leading to a consequent diversity in theimplied computational algorithms The traditional on-line and data-intensive preoc-cupations of signal processing continue to demand that these algorithms be tractable.
Increasingly, full probability modelling (the so-called Bayesian approach)—or
partial probability modelling using the likelihood function—is the pathway for sign of these algorithms However, the results are often intractable, and so the area
de-of distributional approximation is de-of increasing relevance in signal processing The
Expectation-Maximization (EM) algorithm and Laplace approximation, for ple, are standard approaches to handling difficult models, but these approximations(certainty equivalence, and Gaussian, respectively) are often too drastic to handlethe high-dimensional, multi-modal and/or strongly correlated problems that are en-countered Since the 1990s, stochastic simulation methods have come to dominateBayesian signal processing Markov Chain Monte Carlo (MCMC) sampling, and re-lated methods, are appreciated for their ability to simulate possibly high-dimensionaldistributions to arbitrary levels of accuracy More recently, the particle filtering ap-proach has addressed on-line stochastic simulation Nevertheless, the wider accept-ability of these methods—and, to some extent, Bayesian signal processing itself—has been undermined by the large computational demands they typically make.The Variational Bayes (VB) method of distributional approximation originates—
exam-as does the MCMC method—in statistical physics, in the area known exam-as Mean Field
Theory Its method of approximation is easy to understand: conditional
indepen-dence is enforced as a functional constraint in the approximating distribution, and the best such approximation is found by minimization of a Kullback-Leibler diver- gence (KLD) The exact—but intractable—multivariate distribution is therefore fac-
torized into a product of tractable marginal distributions, the so-called VB-marginals.
This straightforward proposal for approximating a distribution enjoys certain
Trang 8opti-VIII Preface
mality properties What is of more pragmatic concern to the signal processing munity, however, is that the VB-approximation conveniently addresses the followingkey tasks:
com-1 The inference is focused (or, more formally, marginalized) onto selected subsets
of parameters of interest in the model: this one-shot (i.e off-line) use of the VB
method can replace numerically intensive marginalization strategies based, forexample, on stochastic sampling
2 Parameter inferences can be arranged to have an invariant functional form
when updated in the light of incoming data: this leads to feasible on-line
tracking algorithms involving the update of fixed- and finite-dimensional
sta-tistics In the language of the Bayesian, conjugacy can be achieved under the
VB-approximation There is no reliance on propagating certainty equivalents,
stochastically-generated particles, etc.
Unusually for a modern Bayesian approach, then, no stochastic sampling is requiredfor the VB method In its place, the shaping parameters of the VB-marginals are
found by iterating a set of implicit equations to convergence This Iterative
Varia-tional Bayes (IVB) algorithm enjoys a decisive advantage over the EM algorithm
whose computational flow is similar: by design, the VB method yields distributions
in place of the point estimates emerging from the EM algorithm Hence, in commonwith all Bayesian approaches, the VB method provides, for example, measures of
uncertainty for any point estimates of interest, inferences of model order/rank, etc.
The machine learning community has led the way in exploiting the VB method
in model-based inference, notably in inference for graphical models It is timely,however, to examine the VB method in the context of signal processing where, todate, little work has been reported In this book, at all times, we are concerned with
the way in which the VB method can lead to the design of tractable computational
schemes for tasks such as (i) dimensionality reduction, (ii) factor analysis for medicalimagery, (iii) on-line filtering of outliers and other non-Gaussian noise processes, (iv)
tracking of non-stationary processes, etc Our aim in presenting these VB algorithms
is not just to reveal new flows-of-control for these problems, but—perhaps moresignificantly—to understand the strengths and weaknesses of the VB-approximation
in model-based signal processing In this way, we hope to dismantle the current chology of dependence in the Bayesian signal processing community on stochasticsampling methods Without doubt, the ability to model complex problems to arbitrarylevels of accuracy will ensure that stochastic sampling methods—such as MCMC—will remain the golden standard for distributional approximation Notwithstandingthis, our purpose here is to show that the VB method of approximation can yieldhighly effective Bayesian inference algorithms at low computational cost In show-ing this, we hope that Bayesian methods might become accessible to a much broaderconstituency than has been achieved to date
Trang 91 Introduction 1
1.1 How to be a Bayesian 1
1.2 The Variational Bayes (VB) Method 2
1.3 A First Example of the VB Method: Scalar Additive Decomposition 3 1.3.1 A First Choice of Prior 3
1.3.2 The Prior Choice Revisited 4
1.4 The VB Method in its Context 6
1.5 VB as a Distributional Approximation 8
1.6 Layout of the Work 10
1.7 Acknowledgement 11
2 Bayesian Theory 13
2.1 Bayesian Benefits 13
2.1.1 Off-line vs On-line Parametric Inference 14
2.2 Bayesian Parametric Inference: the Off-Line Case 15
2.2.1 The Subjective Philosophy 16
2.2.2 Posterior Inferences and Decisions 16
2.2.3 Prior Elicitation 18
2.2.3.1 Conjugate priors 19
2.3 Bayesian Parametric Inference: the On-line Case 19
2.3.1 Time-invariant Parameterization 20
2.3.2 Time-variant Parameterization 20
2.3.3 Prediction 22
2.4 Summary 22
3 Off-line Distributional Approximations and the Variational Bayes Method 25
3.1 Distributional Approximation 25
3.2 How to Choose a Distributional Approximation 26
3.2.1 Distributional Approximation as an Optimization Problem 26
3.2.2 The Bayesian Approach to Distributional Approximation 27
Trang 10X Contents
3.3 The Variational Bayes (VB) Method of Distributional Approximation 28
3.3.1 The VB Theorem 28
3.3.2 The VB Method of Approximation as an Operator 32
3.3.3 The VB Method 33
3.3.4 The VB Method for Scalar Additive Decomposition 37
3.4 VB-related Distributional Approximations 39
3.4.1 Optimization with Minimum-Risk KL Divergence 39
3.4.2 Fixed-form (FF) Approximation 40
3.4.3 Restricted VB (RVB) Approximation 40
3.4.3.1 Adaptation of the VB method for the RVB Approximation 41
3.4.3.2 The Quasi-Bayes (QB) Approximation 42
3.4.4 The Expectation-Maximization (EM) Algorithm 44
3.5 Other Deterministic Distributional Approximations 45
3.5.1 The Certainty Equivalence Approximation 45
3.5.2 The Laplace Approximation 45
3.5.3 The Maximum Entropy (MaxEnt) Approximation 45
3.6 Stochastic Distributional Approximations 46
3.6.1 Distributional Estimation 47
3.7 Example: Scalar Multiplicative Decomposition 48
3.7.1 Classical Modelling 48
3.7.2 The Bayesian Formulation 48
3.7.3 Full Bayesian Solution 49
3.7.4 The Variational Bayes (VB) Approximation 51
3.7.5 Comparison with Other Techniques 54
3.8 Conclusion 56
4 Principal Component Analysis and Matrix Decompositions 57
4.1 Probabilistic Principal Component Analysis (PPCA) 58
4.1.1 Maximum Likelihood (ML) Estimation for the PPCA Model 59 4.1.2 Marginal Likelihood Inference of A 61
4.1.3 Exact Bayesian Analysis 61
4.1.4 The Laplace Approximation 62
4.2 The Variational Bayes (VB) Method for the PPCA Model 62
4.3 Orthogonal Variational PCA (OVPCA) 69
4.3.1 The Orthogonal PPCA Model 70
4.3.2 The VB Method for the Orthogonal PPCA Model 70
4.3.3 Inference of Rank 77
4.3.4 Moments of the Model Parameters 78
4.4 Simulation Studies 79
4.4.1 Convergence to Orthogonal Solutions: VPCA vs FVPCA 79
4.4.2 Local Minima in FVPCA and OVPCA 82
4.4.3 Comparison of Methods for Inference of Rank 83
4.5 Application: Inference of Rank in a Medical Image Sequence 85
4.6 Conclusion 87
Trang 115 Functional Analysis of Medical Image Sequences 89
5.1 A Physical Model for Medical Image Sequences 90
5.1.1 Classical Inference of the Physiological Model 92
5.2 The FAMIS Observation Model 92
5.2.1 Bayesian Inference of FAMIS and Related Models 94
5.3 The VB Method for the FAMIS Model 94
5.4 The VB Method for FAMIS: Alternative Priors 99
5.5 Analysis of Clinical Data Using the FAMIS Model 102
5.6 Conclusion 107
6 On-line Inference of Time-Invariant Parameters 109
6.1 Recursive Inference 110
6.2 Bayesian Recursive Inference 110
6.2.1 The Dynamic Exponential Family (DEF) 112
6.2.2 Example: The AutoRegressive (AR) Model 114
6.2.3 Recursive Inference of non-DEF models 117
6.3 The VB Approximation in On-Line Scenarios 118
6.3.1 Scenario I: VB-Marginalization for Conjugate Updates 118
6.3.2 Scenario II: The VB Method in One-Step Approximation 121
6.3.3 Scenario III: Achieving Conjugacy in non-DEF Models via the VB Approximation 123
6.3.4 The VB Method in the On-Line Scenarios 126
6.4 Related Distributional Approximations 127
6.4.1 The Quasi-Bayes (QB) Approximation in On-Line Scenarios 128 6.4.2 Global Approximation via the Geometric Approach 128
6.4.3 One-step Fixed-Form (FF) Approximation 129
6.5 On-line Inference of a Mixture of AutoRegressive (AR) Models 130
6.5.1 The VB Method for AR Mixtures 130
6.5.2 Related Distributional Approximations for AR Mixtures 133
6.5.2.1 The Quasi-Bayes (QB) Approximation 133
6.5.2.2 One-step Fixed-Form (FF) Approximation 135
6.5.3 Simulation Study: On-line Inference of a Static Mixture 135
6.5.3.1 Inference of a Many-Component Mixture 136
6.5.3.2 Inference of a Two-Component Mixture 136
6.5.4 Data-Intensive Applications of Dynamic Mixtures 139
6.5.4.1 Urban Vehicular Traffic Prediction 141
6.6 Conclusion 143
7 On-line Inference of Time-Variant Parameters 145
7.1 Exact Bayesian Filtering 145
7.2 The VB-Approximation in Bayesian Filtering 147
7.2.1 The VB method for Bayesian Filtering 149
7.3 Other Approximation Techniques for Bayesian Filtering 150
7.3.1 Restricted VB (RVB) Approximation 150
7.3.2 Particle Filtering 152
Trang 12XII Contents
7.3.3 Stabilized Forgetting 153
7.3.3.1 The Choice of the Forgetting Factor 154
7.4 The VB-Approximation in Kalman Filtering 155
7.4.1 The VB method 156
7.4.2 Loss of Moment Information in the VB Approximation 158
7.5 VB-Filtering for the Hidden Markov Model (HMM) 158
7.5.1 Exact Bayesian filtering for known T 159
7.5.2 The VB Method for the HMM Model with Known T 160
7.5.3 The VB Method for the HMM Model with Unknown T 162
7.5.4 Other Approximate Inference Techniques 164
7.5.4.1 Particle Filtering 164
7.5.4.2 Certainty Equivalence Approach 165
7.5.5 Simulation Study: Inference of Soft Bits 166
7.6 The VB-Approximation for an Unknown Forgetting Factor 168
7.6.1 Inference of a Univariate AR Model with Time-Variant Parameters 169
7.6.2 Simulation Study: Non-stationary AR Model Inference via Unknown Forgetting 173
7.6.2.1 Inference of an AR Process with Switching Parameters 173
7.6.2.2 Initialization of Inference for a Stationary AR Process 174
7.7 Conclusion 176
8 The Mixture-based Extension of the AR Model (MEAR) 179
8.1 The Extended AR (EAR) Model 179
8.1.1 Bayesian Inference of the EAR Model 181
8.1.2 Computational Issues 182
8.2 The EAR Model with Unknown Transformation: the MEAR Model 182 8.3 The VB Method for the MEAR Model 183
8.4 Related Distributional Approximations for MEAR 186
8.4.1 The Quasi-Bayes (QB) Approximation 186
8.4.2 The Viterbi-Like (VL) Approximation 187
8.5 Computational Issues 188
8.6 The MEAR Model with Time-Variant Parameters 191
8.7 Application: Inference of an AR Model Robust to Outliers 192
8.7.1 Design of the Filter-bank 192
8.7.2 Simulation Study 193
8.8 Application: Inference of an AR Model Robust to Burst Noise 196
8.8.1 Design of the Filter-Bank 196
8.8.2 Simulation Study 197
8.8.3 Application in Speech Reconstruction 201
8.9 Conclusion 201
Trang 139 Concluding Remarks 205
9.1 The VB Method 205
9.2 Contributions of the Work 206
9.3 Current Issues 206
9.4 Future Prospects for the VB Method 207
Required Probability Distributions 209
A.1 Multivariate Normal distribution 209
A.2 Matrix Normal distribution 209
A.3 Normal-inverse-Wishart (N iW A,Ω) Distribution 210
A.4 Truncated Normal Distribution 211
A.5 Gamma Distribution 212
A.6 Von Mises-Fisher Matrix distribution 212
A.6.1 Definition 213
A.6.2 First Moment 213
A.6.3 Second Moment and Uncertainty Bounds 214
A.7 Multinomial Distribution 215
A.8 Dirichlet Distribution 215
A.9 Truncated Exponential Distribution 216
References 217
Index 225
Trang 14a i , a i,D ith column of matrix A, A D, respectively.
a i,j , a i,j,D (i, j)th element of matrix A, A D , respectively, i = 1 n,
j = 1 m.
b i , b i,D ith element of vector b, b D, respectively
diag (·) A = diag (a), a ∈ R q
a Diagonal vector of given matrix A (the context will
distin-guish this from a scalar, a (see 2nd entry, above)).
(r) denotes matrix A with restricted rank,
rank (A) = r ≤ min (n, m).
I r ∈ R r ×r Square identity matrix.
1p,q,0p,q Matrix of size p × q with all elements equal to one, zero,
re-spectively
Trang 15a = vec (A) Operator restructuring elements of A = [a1, , a n] into a
A Singular Value Decomposition (SVD) of matrix A ∈ R n ×m.
In this monograph, the SVD is expressed in the ‘economic’
{A} c Set of objects A with cardinality c.
A (i) ith element of set {A} c , i = 1, , c.
Analysis
χ X(·) Indicator (characteristic) function of set X.
erf (x) Error function: erf (x) = √2πx
0 exp
−t2 dt.
ln (A) , exp (A) Natural logarithm and exponential of matrix A respectively.
Both operations are performed on elements of the matrix (or
Trang 16Notational Conventions XVII
Γ r 1
Γ r
1
2(p − j + 1)
, r ≤ p
0F1(a, AA ) Hypergeometric function,p F q(·), with p = 0, q = 1, scalar
parameter a, and symmetric matrix parameter, AA
of the argument, x If x is a continuous variable, then δ (x) is the Dirac δ-function:
if x = 0, otherwise .
, i = 1, , p:
p (i) = [δ (i − 1) , δ (i − 2) , , δ (i − p)]
I(a,b] Interval (a, b] inR
Probability Calculus
Pr (·) Probability of given argument
f (x |θ) Distribution of (discrete or continuous) random variable x,
conditioned by known θ.
˘
f (x) Variable distribution to be optimized (‘wildcard’ in functional
optimization)
x [i] , f [i] (x) x and f (x) in the i-th iteration of an iterative algorithm.
ˆ Point estimate of unknown parameter θ.
Ef (x)[·] Expected value of argument with respect to distribution,
f (x).
x, x Upper bound, lower bound, respectively, on range of random
N X (M, Σ p ⊗ Σ n) Matrix Normal distribution of X with mean value, M , and
covariance matrices, Σ and Σ
Trang 17t N x (µ, r;X) Truncated scalar Normal of x, of type N (µ, r), confined to
support setX ⊂ R.
M X (F ) Von-Mises-Fisher matrix distribution of X with matrix
para-meter, F
G x (α, β) Scalar Gamma distribution of x with parameters, α and β.
U x(X) Scalar Uniform distribution of x on the support set X ⊂ R.
Trang 18List of Acronyms
AR AutoRegressive (model, process)
ARD Automatic Rank Determination (property)
CDEF Conjugate (parameter) distribution to a DEF (observation)
modelDEF Dynamic Exponential Family
DEFS Dynamic Exponential Family with Separable parametersDEFH Dynamic Exponential Family with Hidden variables
EAR Extended AutoRegressive (model, process)
FAMIS Functional Analysis for Medical Image Sequences (model)FVPCA Fast Variational Principal Component Analysis (algorithm)
HPD Highest Posterior Density (region)
ICA Independent Component Analysis
IVB Iterative Variational Bayes (algorithm)
KLD Kullback-Leibler Divergence
MCMC Markov Chain Monte Carlo
MEAR Mixture-based Extension of the AutoRegressive model
OVPCA Orthogonal Variational Principal Component Analysis
PCA Principal Component Analysis
PPCA Probabilistic Principal Component Analysis
RLS Recursive Least Squares
RVB Restricted Variational Bayes
Trang 20Introduction
1.1 How to be a Bayesian
In signal processing, as in all quantitative sciences, we are concerned with data,
D, and how we can learn about the system or source which generated D We will
often refer to learning as inference In this book, we will model the data cally, so that a set, θ, of unknown parameters describes the data-generating system.
parametri-In deterministic problems, knowledge of θ determines D under some notional rule,
D = g(θ) This accounts for very few of the data contexts in which we must work.
In particular, when D is information-bearing, then we must model the uncertainty (sometimes called the randomness) of the process The defining characteristic of Bayesian methods is that we use probabilities to quantify our beliefs amid uncer-
tainty, and the calculus of probability to manipulate these quantitative beliefs [1–3].Hence, our beliefs about the data are completely expressed via the parametric prob-
abilistic observation model, f (D |θ) In this way, knowledge of θ determines our beliefs about D, not D themselves.
In practice, the result of an observational experiment is that we are given D,
and our problem is to use them to learn about the system—summarized by the
unknown parameters, θ—which generated them This learning amid uncertainty is known as inductive inference [3], and it is solved by constructing the distribution
f (θ |D), namely, the distribution which quantifies our a posteriori beliefs about the
system, given a specific set of data, D The simple prescription of Bayes’ rule solves the implied inverse problem [4], allowing us to reverse the order of the conditioning
in the observation model, f (D |θ):
Bayes’ rule specifies how our prior beliefs, quantified by the prior distribution,
f (θ), are updated in the light of D Hence, a Bayesian treatment requires prior
quan-tification of our beliefs about the unknown parameters, θ, whether or not θ is by
nature fixed or randomly realized The signal processing community, in particular,
has been resistant to the philosophy of strong Bayesian inference [3], which assigns
Trang 21probabilities to fixed, as well as random, unknown quantities Hence, they relegate
Bayesian methods to inference problems involving only random quantities [5, 6].This book adheres to the strong Bayesian philosophy
Tractability is a primary concern to any signal processing expert seeking to velop a parametric inference algorithm, both in the off-line case and, particularly,
de-on-line The Bayesian approach provides f (θ |D) as the complete inference of θ, and
this must be manipulated in order to solve problems of interest For example, we
may wish to concentrate the inference onto a subset, θ1, by marginalizing over their
complement, θ2:
f (θ1|D) ∝
A decision, such as a point estimate, may be required The mean a posteriori
estimate may then be justified:
θ1=
Θ ∗
Finally, we might wish to select a model from a set of candidates,{M1, , M c },
via computation of the marginal probability of D with respect to each candidate:
f ( M l |D) ∝ Pr[M l ].
Θ ∗ l
f (D |θ l , M l )dθ l (1.4)
Here, θ l ∈ Θ ∗
l are the parameters of the competing models, and Pr[M l] is the essary prior on those models
nec-1.2 The Variational Bayes (VB) Method
The integrations required in (1.2)–(1.4) will often present computational burdens thatcompromise the tractability of the signal processing algorithm In Chapter 3, we willreview some of the approximations which can help to address these problems, but the
aim of this book is to advocate the use of the Variational Bayes (VB) approximation
as an effective pathway to the design of tractable signal processing algorithms forparametric inference These VB solutions will be shown, in many cases, to be noveland attractive alternatives to currently available Bayesian inference algorithms
The central idea of the VB method is to approximate f (θ |D), ab initio, in terms
f (θ |D) to f(θ|D), namely, a particular Kullback-Leibler Divergence (KLD), which
we will call KLDVBin Section 3.2.2:
Trang 221.3 A First Example of the VB Method: Scalar Additive Decomposition 3
In practical terms, functional optimization of (1.6) yields a known functional
form for ˜f (θ1|D) and ˜ f (θ2|D), which will be known as the VB-marginals
How-ever, the shaping parameters associated with each of these VB-marginals are pressed via particular moments of the others Therefore, the approximation is pos-sible if all moments required in the shaping parameters can be evaluated Mutualinteraction of VB-marginals via their moments presents an obstacle to evaluation ofits shaping parameters, since a closed-form solution is available only for a limited
ex-number of problems However, a generic iterative algorithm for evaluation of
VB-moments and shaping parameters is available for tractable VB-marginals (i.e
mar-ginals whose moments can be evaluated) This algorithm—reminiscent of the
clas-sical Expectation-Maximization (EM) algorithm—will be called the Iterative
Varia-tional Bayes (IVB) algorithm in this book Hence, the computaVaria-tional burden of the
VB-approximation is confined to iterations of the IVB algorithm The result is a set
of moments and shaping parameters, defining the VB-approximation (1.5)
1.3 A First Example of the VB Method: Scalar Additive
m, ω −1 The task is to infer
the two unknown parameters—i.e the mean, m, and precision, ω—of the Normal
distribution,N , given just one scalar data point, d This constitutes a stressful regime
for inference In order to ‘be a Bayesian’, we assign a prior distribution to m and ω.
Given the poverty of data, we can expect our choice to have some influence on ourposterior inference We will now consider two choices for prior elicitation
1.3.1 A First Choice of Prior
The following choice seems reasonable:
(1.9) becomes flatter The Gamma distribution,G, in (1.10) was chosen to reflect the
positivity of ω Its parameters, α > 0 and β > 0, may again be chosen to yield a
Trang 23non-informative prior For α → 0 and β → 0, (1.10) approaches Jeffreys’ improper
prior on scale parameters, 1/ω [7].
Joint inference of the normal mean and precision, m and ω respectively, is well
studied in the literature [8, 9] From Bayes’ rule, the posterior distribution is
rized in Appendices A.2 and A.5 respectively Even in this simple case, evaluation of
the marginal distribution of the mean, m, i.e f (m |d, α, β, φ), is not tractable Hence,
we seek the best approximation in the class of conditionally independent posteriors
on m and ω, by minimizing KLDVB (1.6), this being the VB-approximation Thesolution can be found in the following form:
˜
f (m |d, α, β, φ) = N m
(ω + φ) −1 ωd, (ω + φ) −1, (1.12)
The VB-moments (1.14) fully determine the VB-marginals, (1.12) and (1.13) It can
be shown that this set of VB-equations (1.14) has three possible solutions (beingroots of a 3rd-order polynomial), only one of which satisfies ω > 0 Hence, the
optimized KLDVB has three ‘critical’ points for this model The exact distributionand its VB-approximation are compared in Fig 1.1
1.3.2 The Prior Choice Revisited
For comparison, we now consider a different choice of the priors:
Here, (1.16) is the same as (1.10), but (1.15) has been parameterized differently from
(1.9) It still expresses our lack of knowledge of the polarity of m, and it still izes extreme values of m if γ → 0 Hence, both prior structures, (1.9) and (1.15), can
Trang 24penal-1.3 A First Example of the VB Method: Scalar Additive Decomposition 5
Fig 1.1 The VB-approximation, (1.12) and (1.13), for the scalar additive decomposition
(dash-dotted contour) Full contour lines denote the exact posterior distribution (1.11)
express non-informative prior knowledge However, the precision parameter, γω, of
m is now chosen proportional to the precision parameter, ω, of the noise (1.8).
From Bayes’ rule, the posterior distribution is now
Trang 25In this case, the VB-marginals have the following forms:
The exact and VB-approximated posterior distributions are compared in Fig 1.2
Remark 1.1 (Choice of priors for the VB-approximation) Even in the stressful regime
of this example (one datum, two unknowns), each set of priors had a similar ence on the posterior distribution In more realistic contexts, the distinctions will be
influ-even less, as the influence of the data—via f (D |θ) in (1.1)—begins to dominate the
prior, f (θ) However, from an analytical point-of-view, the effects of the prior choice
can be very different, as we have seen in this example Recall that the moments ofthe exact posterior distribution were tractable in the case of the second prior (1.17),but were not tractable in the first case (1.11) This distinction carried through to therespective VB-approximations Once again, the second set of priors implied a farsimpler solution (1.22) than the first (1.14) Therefore, in this book, we will take care
to design priors which can facilitate the task of VB-approximation We will always
be in a position to ensure that our choice is non-informative
1.4 The VB Method in its Context
Statistical physics has long been concerned with high-dimensional probability tions and their simplification [10] Typically, the physicist is considering a system of
Trang 26func-1.4 The VB Method in its Context 7
Fig 1.2 The VB-approximation, (1.19) and (1.20), for the scalar additive decomposition
(dash-dotted contour), using alternative priors, (1.15) and (1.16) Full contour lines denotethe exact posterior distribution (1.17)
many interacting particles and wishes to infer the state, θ, of this system Boltzmann’s law [11] relates the energy of the state to its probability, f (θ) If we wish to infer a sub-state, θ i , we must evaluate the associated marginal, f (θ i) Progress can be made
by replacing the exact probability model, f (θ), with an approximation, ˜ f (θ)
Typi-cally, this requires us to neglect interactions in the physical system, by setting manysuch interactions to zero The optimal such approximate distribution, ˜f (θ), can be
chosen using the variational method [12], which seeks a free-form solution within the approximating class that minimizes some measure of disparity between f (θ) and
˜
f (θ) Strong physical justification can be advanced for minimization of a
Kullback-Leibler divergence (1.6), which is interpretable as a relative entropy The Variational
Bayes (VB) approximation is one example of such an approximation, where
inde-pendence between all θ iis enforced (1.5) In this case, the approximating marginalsdepend on expectations of the remaining states Mean Field Theory (MFT) [10] gen-eralizes this approach, exploring many such choices for the approximating function,
˜
f (θ), and its disparity with respect to f (θ) Once the variational approximation has
been obtained, the exact system is studied by means of this approximation [13].The machine learning community has adopted Mean Field Theory [12] as a way
to cope with problems of learning and belief propagation in complex systems such
as neural networks [14–16] Ensemble learning [17] is an example of the use ofthe VB-approximation in this area Communication between the machine learning
Trang 27and physics communities has been enhanced by the language of graphical els [18–20] The Expectation-Maximization (EM) algorithm [21] is another impor-tant point of tangency, and was re-derived in [22] using KLDVBminimization The
mod-EM algorithm has long been known in the signal processing community as a means
of finding the Maximum Likelihood (ML) solution in high-dimensional problems—such as image segmentation—involving hidden variables Replacement of the EM
equations with Variational EM (i.e IVB) [23] equations allows distributional
ap-proximations to be used in place of point estimates
In signal processing, the VB method has proved to be of importance in addressingproblems of model structure inference, such as the inference of rank in PrincipalComponent Analysis (PCA) [24] and Factor Analysis [20, 25]), and in the inference
of the number of components in a mixture [26] It has been used for identification ofnon-Gaussian AutoRegressive (AR) models [27, 28], for unsupervised blind sourceseparation [29], and for pattern recognition of hand-written characters [15]
1.5 VB as a Distributional Approximation
The VB method of approximation is one of many techniques for approximation ofprobability functions In the VB method, the approximating family is taken as theset of all possible distributions expressed as the product of required marginals, withthe optimal such choice made by minimization of a KLD The following are amongthe many other approximations—deterministic and stochastic—that have been used
in signal processing:
Point-based approximations: examples include the Maximum a Posteriori (MAP)
and ML estimates These are typically used as certainty equivalents [30] in cision problems, leading to highly tractable procedures Their inability to takeaccount of uncertainty is their principal drawback
de-Local approximations: the Laplace approximation [31], for example, performs aTaylor expansion at a point, typically the ML estimate This method is known
to the signal processing community in the context of criteria for model order lection, such as the Schwartz criterion and Bayes’ Information Criterion (BIC),both of which were derived using the Laplace method [31] Their principal dis-advantage is their inability to cope with multimodal probability functions.Spline approximations: tractable approximations of the probability function may beproposed on a sufficiently refined partition of the support The computationalload associated with integrations typically increases exponentially with the num-ber of dimensions
se-MaxEnt and moment matching: the approximating distribution may be chosen tomatch a selected set of the moments of the true distribution [32] Under theMaxEnt principle [33], the optimal such moment-matching distribution is theone possessing maximum entropy subject to these moment constraints
Empirical approximations: a random sample is generated from the probability tion, and the distributional approximation is simply a set of point masses placed
Trang 28func-1.5 VB as a Distributional Approximation 9
at these independent, identically-distributed (i.i.d.) sampling points The keytechnical challenge is efficient generation of i.i.d samples from the true dis-tribution In recent years, stochastic sampling techniques [34]—particularly theclass known as Markov Chain Monte Carlo (MCMC) methods [35]—have over-taken deterministic methods as the golden standard for distributional approxi-mation They can yield approximations to an arbitrary level of accuracy, but typ-ically incur major computational overheads It can be instructive to examine theperformance of any deterministic method—such as the VB method—in terms
of the accuracy-vs-complexity trade-off achieved by these stochastic samplingtechniques
The VB method has the potential to offer an excellent trade-off between tional complexity and accuracy of the distributional approximation This is suggested
computa-in Fig 1.3 The macomputa-in computational burden associated with the VB method is theneed to solve iteratively—via the IVB algorithm—a set of simultaneous equations inorder to reveal the required moments of the VB-marginals If computational cost is
of concern, VB-marginals may be replaced by simpler approximations, or the uation of moments can be approximated, without, hopefully, diminishing the overallquality of approximation significantly This pathway of approximation is suggested
eval-by the dotted arrow in Fig 1.3, and will be traversed in some of the signal processingapplications presented in this book Should the need exist to increase accuracy, the
VB method is sited in the flexible context of Mean Field Theory, which offers moresophisticated techniques that might be explored
meanfieldtheory
samplingmethods
methodsdeterministic
EM algorithmVariational Bayes (IVB)
Fig 1.3 The accuracy-vs-complexity trade-off in the VB method.
Trang 291.6 Layout of the Work
We now briefly summarize the main content of the Chapters of this book
Chapter 2 This provides an introduction to Bayesian theory relevant for tional approximation We review the philosophical framework, and we introducebasic probability calculus which will be used in the remainder of the book Theimportant distinction between off-line and on-line inference is outlined.Chapter 3 Here, we are concerned with the problem of distributional approxima-tion The VB-approximation is defined, and from it we synthesize an ergonomic
distribu-procedure for deducing these VB-approximations This is known as the VB
method Related distributional approximations are briefly reviewed and
com-pared to the VB method A simple inference problem—scalar multiplicativedecomposition—is considered
Chapter 4 The VB method is applied to the problem of matrix multiplicative compositions The VB-approximation for these models reveals interesting prop-erties of the method, such as initialization of the Iterative VB algorithm (IVB)and the existence of local minima These models are closely related to PrincipalComponent Analysis (PCA), and we show that the VB inference provides solu-tions to problems not successfully addressed by PCA, such as the inference ofrank
de-Chapter 5 We use our experience from de-Chapter 4 to derive the VB-approximationfor the inference of physiological factors in medical image sequences The phys-ical nature of the problem imposes additional restrictions which are successfullyhandled by the VB method
Chapter 6 The VB method is explored in the context of recursive inference of nal processes In this Chapter, we confine ourselves to time-invariant parametermodels We isolate three fundamental scenarios, each of which constitutes a re-cursive inference task where the VB-approximation is tractable and adds value
sig-We apply the VB method to the recursive identification of mixtures of AR els The practical application of this work in prediction of urban traffic flow isoutlined
mod-Chapter 7 The time-invariant parameter assumption from mod-Chapter 6 is relaxed.Hence, we are concerned here with Bayesian filtering The use of the VB method
in this context reveals interesting computational properties in the resulting rithm, while also pointing to some of the difficulties which can be encountered.Chapter 8 We address a practical signal processing task, namely, the reconstruction
algo-of AR processes corrupted by unknown transformation and noise distortions.The use of the VB method in this ambitious context requires synthesis of ex-perience gained in Chapters 6 and 7 The resulting VB inference is shown to
be successful in optimal data pre-processing tasks such as outlier removal andsuppression of burst noise An application in speech denoising is presented.Chapter 9 We summarize the main findings of the work, and point to some interest-ing future prospects
Trang 301.7 Acknowledgement 11
1.7 Acknowledgement
The first author acknowledges the support of Grants AV ˇCR 1ET 100 750 401 andMŠMT 1M6798555601
Trang 31Bayesian Theory
In this Chapter, we review the key identities of probability calculus relevant toBayesian inference We then examine three fundamental contexts in parametric mod-elling, namely (i) off-line inference, (ii) on-line inference of time-invariant parame-ters, and (iii) on-line inference of time-variant parameters In each case, we use theBayesian framework to derive the formal solution Each context will be examined indetail in later Chapters
2.1 Bayesian Benefits
A Bayesian is someone who uses only probabilities to quantify degrees of belief in
an uncertain hypothesis, and uses only the rules of probability as the calculus foroperating on these degrees of belief [7, 8, 36, 37] At the very least, this approach to
inductive inference is consistent, since the calculus of probability is consistent, i.e.
any valid use of the rules of probability will lead to a unique conclusion This is nottrue of classical approaches to inference, where degrees of belief are quantified usingone of a vast range of criteria, such as relative frequency of occurrence, distance in
a normed space, etc If the Bayesian’s probability model is chosen to reflect such
criteria, then we might expect close correspondence between Bayesian and classical
methods However, a vital distinction remains Since probability is a measure
func-tion on the space of possibilities, the marginalizafunc-tion operator (i.e integrafunc-tion) is a
powerful inferential tool uniquely at the service of the Bayesian Careful comparison
of Bayesian and classical solutions will reveal that the real added value of Bayesianmethods derives from being able to integrate, thereby concentrating the inferenceonto a selected subset of quantities of interest In this way, Bayesian methods natu-rally embrace the following key problems, all problematical for the non-Bayesian:
1 projection into a desired subset of the hypothesis space;
2 reduction of the number of parameters appearing in the probability function called ‘elimination of nuisance parameters’ [38]);
(so-3 quantification of the risk associated with a data-informed decision;
Trang 3214 2 Bayesian Theory
4 evaluation of expected values and moments;
5 comparison of competing model structures and penalization of complexity ham’s Razor) [39, 40] ;
(Ock-6 prediction of future data
All of these tasks require integration with respect to the probability measure onthe space of possibilities In the case of 5 above, competing model structures are
measured, leading to consistent quantification of model complexity This natural
en-gendering of Ockham’s razor is among the most powerful features of the Bayesianframework
Why, then, are Bayesian methods still so often avoided in application contextssuch as statistical signal processing? The answer is mistrust of the prior, and philo-sophical angst about (i) its right to exist, and (ii) its right to influence a decision oralgorithm With regard to (i), it is argued by non-Bayesians that probabilities mayonly be attached to objects or hypotheses that vary randomly in repeatable exper-iments [41] With regard to (ii), the non-Bayesian (objectivist) perspective is thatinferences should be based only on data, and never on prior knowledge Preoccupa-tion with these issues is to miss where the action really is: the ability to marginalize
in the Bayesian framework In our work, we will eschew detailed philosophical guments in favour of a policy that minimizes the influence of the priors we use, andpoints to the practical added value over frequentist methods that arise from use ofprobability calculus
ar-2.1.1 Off-line vs On-line Parametric Inference
In an observational experiment, we may wish to infer knowledge of an unknown
quantity only after all data, D, have been gathered This batch-based inference will be called the off-line scenario, and Bayesian methods must be used to update our beliefs given no data (i.e our prior), to beliefs given D It is the typical situation arising in
database analysis In contrast, we may wish to interleave the process of observing
data with the process of updating our beliefs This on-line scenario is important in
control and decision tasks, for example For convenience, we refer to the independent
variable indexing the occasions (temporal, spatial, etc.) when our inferences must be updated, as time, t = 0, 1, The incremental data observed between inference times
is d t , and the aggregate of all data observed up to and including time t is denoted by
D t Hence:
D t = D t −1 ∪ d t , t = 1, ,
with D0 = {}, by definition For convenience, we will usually assume that d t ∈
Rp ×1 , p ∈ N+,∀t, and so D t can be structured into a matrix of dimension p × t,
with the incremental data, d t , as its columns:
In this on-line scenario, Bayesian methods are required to update our state of
knowledge conditioned by D t −1 , to our state of knowledge conditioned by D t Of
Trang 33course, the update is achieved using exactly the same ‘inference machine’, namelyBayes’ rule (1.1) Indeed, one step of on-line inference is equivalent to an off-line
step, with D = d t , and with the prior at time t being conditioned on D t −1 theless, it will be convenient to handle the off-line and on-line scenarios separately,and we now review the Bayesian probability calculus appropriate to each case
Never-2.2 Bayesian Parametric Inference: the Off-Line Case
Let the measured data be denoted by D A parametric probabilistic model of the data is given by the probability distribution, f (D |θ), conditioned by knowledge of
the parameters, θ In this book, the notation f ( ·) can represent either a probability density function for continuous random variables, or a probability mass function
for discrete random variables We will refer to f ( ·) as a probability distribution in
both cases In this way a significant harmonization of formulas and nomenclaturecan be achieved We need only keep in mind that integrations should be replaced bysummations whenever the argument is discrete1
Our prior state of knowledge of θ is quantified by the prior distribution, f (θ) Our state of knowledge of θ after observing D is quantified by the posterior distribution,
f (θ |D) These functions are related via Bayes’ rule,
f (θ |D) = f (θ, D) f (D) = f (D |θ) f (θ)
where Θ ∗ is the space of θ We will refer to f (θ, D) as the joint distribution of parameters and data, or, more concisely, as the joint distribution We will refer to
f (D |θ) as the observation model If this is viewed as a (non-measure) function of θ,
it is known as the likelihood function [3, 43–45]:
ζ = f (D) is the normalizing constant, sometimes known as the partition function
in the physics literature [46]:
where∝ means equal up to the normalizing constant, ζ The posterior is fully
deter-mined by the product f (D |θ) f (θ), since the normalizing constant follows from the
1This can also be achieved via measure theory, operating in a consistent way for both discreteand continuous distributions, with probability densities generalized in the Radon-Nikodymsense [42] The practical effect is the same, and so we will avoid this formality
Trang 3416 2 Bayesian Theory
requirement that f (θ |D) be a probability distribution; i.e.Θ ∗ f (θ |D) = 1
Evalua-tion of ζ (2.4) can be computaEvalua-tionally expensive, or even intractable If the integral
in (2.4) does not converge, the distribution is called improper [47] The posterior distribution with explicitly known normalization (2.5) will be called the normalized
distribution In Fig 2.1, we represent Bayes’ rule (2.2) as an operator, B, ing the prior into the posterior, via the observation model, f (D |θ)
transform-B
f (D|θ)
Fig 2.1 Bayes’ rule as an operator.
2.2.1 The Subjective Philosophy
All our beliefs about θ, and their associated quantifiers via f (θ), f (θ |D), etc., are
conditioned on the parametric probability model, f (θ, D), chosen by us a priori (2.2) Its ingredients are (i) the deterministic structure relating D to an unknown parameter set, θ, i.e the observation model f (D |θ), and a chosen measure on the
space, Θ, of this parameter set, i.e the prior measure f (θ) In this sense, Bayesian methods are born from a subjective philosophy, which conditions all inference on the
prior knowledge of the observer [2, 36] Jeffreys’ notation [7],I, is used to condition
all probability functions explicitly on this corpus of prior knowledge; e.g f (θ) →
f (θ |I) For convenience, we will not use this notation, nor will we forget the fact that
this conditioning is always present In model comparison (1.4), where we examine
competing model assumptions, f l (θ l , D), l = 1, , c, this conditioning becomes
more explicit, via the indicator variable or pointer, l ∈ {1, 2, , c} , but once again
we will suppress the implied Jeffreys’ notation
2.2.2 Posterior Inferences and Decisions
The task of evaluating the full posterior distribution (2.5) will be called parameter
in-ference in this book We favour this phrase over the alternative—density estimation—
used in some decision theory texts [48] The full posterior distribution is a completedescription of our uncertainty about the parameters of the observation model (2.3),
given prior knowledge, f (θ), and all available data, D For many practical tasks,
we need to derive conditional and marginal distributions of model parameters, andtheir moments Consider the (vector of) model parameters to be partitioned into two
Trang 35dθ2
Fig 2.2 The marginalization operator.
In Fig 2.2, we represent (2.6) as an operator This graphical representation will beconvenient in later Chapters
The moments of the posterior distribution—i.e the expected or mean value of known functions, g (θ) , of the parameter—will be denoted by
Ef (θ|D) [g (θ)] =
Θ ∗
In general, we will use the notation g (θ) to refer to a posterior point estimate of g(θ).
Hence, for the choice (2.7), we have
by minimizing the posterior expected loss,
is the workhorse of classical inference, since it avoids the issue of defining a prior
over the space of possibilities In particular, it is the dominant tool for probabilistic
methods in signal processing [5, 53, 54] Consider the special case of an additive
Gaussian noise model for vector data, D = d ∈ R p , with
d = s(θ) + e,
e ∼ N (0, Σ) ,
Trang 3618 2 Bayesian Theory
where Σ is known, and s(θ) is the (non-linearly parameterized) signal model In this
case, θML = θLS, the traditional non-linear, weighted Least-Squares (LS) estimate
[55] of θ From the Bayesian perspective, these classical estimators— θMLand θLS—
can be justified only to the extent that a uniform prior over Θ ∗ might be justified
When Θ ∗has infinite Lebesgue measure, this prior is improper, leading to technicaland philosophical difficulties [3, 8] In this book, it is the strongly Bayesian choice,
g (θ) = Ef (θ|D) [g (θ)] (2.8), which predominates Hence, the notation g ≡ g (θ)
will always denote the posterior mean of g(θ), unless explicitly stated otherwise.
As an alternative to point estimation, the Bayesian may choose to describe a
con-tinuous posterior distribution, f (θ |D) (2.2), in terms of a region or interval within
which θ has a high probability of occurrence These credible regions [37] replace the
confidence intervals of classical inference, and have an intuitive appeal The ing special case provides a unique specification, and will be used in this book
follow-Definition 2.1 (Highest Posterior Density (HPD) Region) R ⊂ Θ ∗ is the 100(1 − α)% HPD region of (continuous) distribution, f (θ |D) , where α ∈ (0, 1), if (i)
R f (θ |D) = 1 − α, and if (ii) almost surely (a.s.) for any θ1∈ R and θ2/∈ R, then
f (θ1|D) ≥ f (θ2|D).
2.2.3 Prior Elicitation
The prior distribution (2.2) required by Bayes’ rule is a function that must be elicited
by the designer of the model It is an important part of the inference problem, andcan significantly influence posterior inferences and decisions (Section 2.2.2) Gen-eral methods for prior elicitation have been considered extensively in the litera-ture [7,8,37,56], as well as the problem of choosing priors for specific signal models
in Bayesian signal processing [3, 35, 57] In this book, we are concerned with thepractical impact of prior choices on the inference algorithms which we develop Theprior distribution will be used in the following ways:
1 To supplement the data, D, in order to obtain a reliable posterior estimate, in
cases where there are insufficient data and/or a poorly defined model This will
be called regularization (via the prior);
2 To impose various restrictions on the parameter θ, reflecting physical constraints
such as positivity Note, from (2.2), that if the prior distribution on a subset of
the parameter support, Θ ∗ , is zero, then the posterior distribution will also be
zero on this subset;
3 To express prior ignorance about θ If the data are assumed to be informative enough, we prefer to choose a non-informative prior (i.e a prior with minimal
impact on the posterior distribution) Philosophical and analytical challenges areencountered in the design of non-informative priors, as discussed, for example,
in [7, 46]
In this book, we will typically choose our prior from a family of distributions ing analytical tractability during the Bayes update (Fig 2.1) Notably, we will work
Trang 37provid-with conjugate priors, as defined in the next Section In such cases, we will designour non-informative prior by choosing its parameters to have minimal impact on theparameters of the posterior distribution.
2.2.3.1 Conjugate priors
In parametric inference, all distributions, f ( ·), have a known functional form, and are
completely determined once the associated shaping parameters are known Hence, the shaping parameters of the posterior distribution, f (θ |D, s0) (2.5), are, in general,
the complete data record, D, and any shaping parameters, s0, of the prior, f0(θ |s0)
Hence, a massive increase in the degrees-of-freedom of the inference may occurduring the prior-to-posterior update It will be computationally advantageous if the
form of the posterior distribution is identical to the form of prior, f0(·|s0), i.e the
inference is functionally invariant with respect to Bayes’ rule, and is determined from
a finite-dimensional vector shaping parameter:
with s0forming the parameters of the prior If s0are unknown, then they are called
hyper-parameters [37], and are assigned a hyperprior, s0 ∼ f(s0) As we will see
in Chapter 6, the choice of conjugate priors is of key importance in the design oftractable Bayesian recursive algorithms, since they confine the shaping parameters
toRq , and prevent a linear increase in the number of degrees-of-freedom with D t
(2.1) From now on, we will not use the subscript ‘0’ in f0 The fixed functional
form will be implied by the conditioning on sufficient statistics s.
2.3 Bayesian Parametric Inference: the On-line Case
We now specialize Bayesian inference to the case of learning in tandem with data acquisition, i.e we wish to update our inference in the light of incremental data,
d t(Section 2.1.1) We distinguish two situations, namely invariant and variant parameterizations
Trang 38where f (θ |D0) ≡ f (θ), the parameter prior (2.2) This scenario is illustrated in
Fig 2.3 The observation model, f (d t |θ, D t −1 ), at time t is related to the
observa-B
f (d t |θ, D t−1)
Fig 2.3 The Bayes’ rule operator in the on-line scenario with time-invariant parameterization.
tion model for the accumulated data, D t—which we can interpret as the likelihood
function of θ (2.3)—via the chain rule of probability:
In this case, new parameters, θ t , are required to explain d t , i.e the observation model,
f (d t |θ t , D t −1 ), t = 1, 2, , is an explicitly time-varying function For convenience,
we assume that θ t ∈ R r , ∀t, and we aggregate the parameters into a matrix, Θ t , as
we did the data (2.1):
with Θ0 = {} by definition Once again, Bayes’ rule (2.2) is used to update our
knowledge of Θ t in the light of new data, d t:
Trang 39Note that the dimension of the integration is r(t − 1) at time t If the integrations
need to be carried out numerically, this increasing dimensionality proves prohibitive
in real-time applications Therefore, the following simplifying assumptions are cally adopted [42]:
typi-Proposition 2.1 (Markov observation model and parameter evolution models).
The observation model is to be simplified as follows:
f (d t |Θ t , D t −1 ) = f (d t |θ t , D t −1 ) , (2.19)
i.e d t is conditionally independent of Θ t −1 , given θ t
The parameter evolution model is to be simplified as follows:
f (θ t |Θ t −1 , D t −1 ) = f (θ t |θ t −1 ) (2.20)
In many applications, (2.20) may depend on exogenous (observed) data, ξ t , which can be seen as shaping parameters, and need not be explicitly listed in the condition- ing part of the notation.
This Markov model (2.20) is the required extra ingredient for Bayesian variant on-line inference Employing Proposition 2.1 in (2.18), and noting that
f (θ t |θ t −1 , D t −1 ) f (θ t −1 |D t −1 ) dθ t −1 , t = 2, 3, (2.21)
The data update of Bayesian filtering:
f (θ t |D t)∝ f (d t |θ t , D t −1 ) f (θ t |D t −1 ) , t = 1, 2, (2.22)
Note, therefore, that the integration dimension is fixed at r, ∀t (2.21) We will
refer to this two-step update for Bayesian on-line inference of θ t as Bayesian
fil-tering, in analogy to Kalman filtering which involves the same two-step procedure,
and which is, in fact, a specialization to the case of Gaussian observation (2.19) andparameter evolution (2.20) models On-line inference of time-variant parameters isillustrated in schematic form in Fig 2.4 In Chapter 7, the problem of designingtractable Bayesian recursive filtering algorithms will be addressed for a wide class
of models, (2.19) and (2.20), using Variational Bayes (VB) techniques
Trang 40Fig 2.4 The inferential scheme for Bayesian filtering The operator ‘×’ denotes
multiplica-tion of distribumultiplica-tions
2.3.3 Prediction
Our purpose in on-line inference of parameters will often be to predict future data
In the Bayesian paradigm, k-steps-ahead prediction is achieved by eliciting the
fol-lowing distribution:
This will be known as the predictor.
The one-step ahead predictor (i.e k = 1 in (2.23)) for a model with
time-invariant parameters (2.14) is as follows:
In later Chapters, we will study the use of the Variational Bayes (VB) approximation
in all three contexts of Bayesian learning reviewed in this Chapter, namely:
1 off-line parameter inference (Section 2.2), in Chapter 3;
2 on-line inference of Time-Invariant (TI) parameters (Section 2.3.1), in Chapter 6;
3 on-line inference of Time-Variant (TV) parameters (Section 2.3.2), in Chapter 7