The variational bayes method in signal processing 2006

The traditional on-line and data-intensive preoc-cupations of signal processing continue to demand that these algorithms be tractable.Increasingly, full probability modelling the so-call

Trang 2

Springer Series on

Signals and Communication Technology

Trang 3

Circuits and Systems

Based on Delta Modulation

Linear, Nonlinear and Mixed Mode Processing

D.G Zrilic ISBN 3-540-23751-8

Functional Structures in Networks

AMLn – A Language for Model Driven

Development of Telecom Systems

T Muth ISBN 3-540-22545-5

RadioWave Propagation

for Telecommunication Applications

H Sizun ISBN 3-540-40758-8

Electronic Noise and Interfering Signals

Principles and Applications

G Vasilescu ISBN 3-540-40741-3

DVB

The Family of International Standards

for Digital Video Broadcasting, 2nd ed.

U Reimers ISBN 3-540-43545-X

Digital Interactive TV and Metadata

Future Broadcast Multimedia

A Lugmayr, S Niiranen, and S Kalli

ISBN 3-387-20843-7

Adaptive Antenna Arrays

Trends and Applications

S Chandran (Ed.) ISBN 3-540-20199-8

Digital Signal Processing

with Field Programmable Gate Arrays

U Meyer-Baese ISBN 3-540-21119-5

Neuro-Fuzzy and Fuzzy Neural Applications

in Telecommunications

P Stavroulakis (Ed.) ISBN 3-540-40759-6

SDMA for Multipath Wireless Channels

Limiting Characteristics

and Stochastic Models

I.P Kovalyov ISBN 3-540-40225-X

Processing of SAR Data

Fundamentals, Signal Processing, Interferometry

A Hein ISBN 3-540-05043-4

Chaos-Based Digital Communication Systems

Operating Principles, Analysis Methods, and Performance Evalutation

F.C.M Lau and C.K Tse ISBN 3-540-00602-8

Adaptive Signal Processing

Application to Real-World Problems

J Benesty and Y Huang (Eds.) ISBN 3-540-00051-8

Multimedia Information Retrieval and Management

Technological Fundamentals and Applications

D Feng, W.C Siu, and H.J Zhang (Eds.) ISBN 3-540-00244-8

Structured Cable Systems

A.B Semenov, S.K Strizhakov, and I.R Suncheley ISBN 3-540-43000-8

Advanced Theory of Signal Detection

Weak Signal Detection in Generalized Obeservations

I Song, J Bae, and S.Y Kim ISBN 3-540-43064-4

Wireless Internet Access over GSM and UMTS

M Taferner and E Bonek ISBN 3-540-42551-9

The Variational Bayes Method

in Signal Processing

V ˇSm´ıdl and A Quinn ISBN 3-540-28819-8

Trang 5

Institute of Information Theory and Automation

Academy of Sciences of the Czech Republic, Department of Adaptive Systems

PO Box 18, 18208 Praha 8, Czech Republic

E-mail: smidl@utia.cas.cz

Dr Anthony Quinn

Department of Electronic and Electrical Engineering

University of Dublin, Trinity College

Dublin 2, Ireland

E-mail: aquinn@tcd.ie

ISBN-10 3-540-28819-8 Springer Berlin Heidelberg New York

ISBN-13 978-3-540-28819-0 Springer Berlin Heidelberg New York

Library of Congress Control Number: 2005934475

This work is subject to copyright All rights are reserved, whether the whole or part of the material is concerned, specif ically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microf ilm or in any other way, and storage in data banks Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag Violations are liable

to prosecution under the German Copyright Law.

Springer is a part of Springer Science+Business Media.

Typesetting and prod c u tion: SPI Publisher Services

Cover design: design & production GmbH, Heidelberg

Printed on acid-free paper SPIN: 11370918 62/3100/SPI - 5 4 3 2 1 0

Trang 6

Do mo Thuismitheoirí

A.Q.

Trang 7

Gaussian linear modelling cannot address current signal processing demands Inmodern contexts, such as Independent Component Analysis (ICA), progress has beenmade speciﬁcally by imposing non-Gaussian and/or non-linear assumptions Hence,standard Wiener and Kalman theories no longer enjoy their traditional hegemony inthe ﬁeld, revealing the standard computational engines for these problems In theirplace, diverse principles have been explored, leading to a consequent diversity in theimplied computational algorithms The traditional on-line and data-intensive preoc-cupations of signal processing continue to demand that these algorithms be tractable.

Increasingly, full probability modelling (the so-called Bayesian approach)—or

partial probability modelling using the likelihood function—is the pathway for sign of these algorithms However, the results are often intractable, and so the area

de-of distributional approximation is de-of increasing relevance in signal processing The

Expectation-Maximization (EM) algorithm and Laplace approximation, for ple, are standard approaches to handling difﬁcult models, but these approximations(certainty equivalence, and Gaussian, respectively) are often too drastic to handlethe high-dimensional, multi-modal and/or strongly correlated problems that are en-countered Since the 1990s, stochastic simulation methods have come to dominateBayesian signal processing Markov Chain Monte Carlo (MCMC) sampling, and re-lated methods, are appreciated for their ability to simulate possibly high-dimensionaldistributions to arbitrary levels of accuracy More recently, the particle ﬁltering ap-proach has addressed on-line stochastic simulation Nevertheless, the wider accept-ability of these methods—and, to some extent, Bayesian signal processing itself—has been undermined by the large computational demands they typically make.The Variational Bayes (VB) method of distributional approximation originates—

exam-as does the MCMC method—in statistical physics, in the area known exam-as Mean Field

Theory Its method of approximation is easy to understand: conditional

indepen-dence is enforced as a functional constraint in the approximating distribution, and the best such approximation is found by minimization of a Kullback-Leibler divergence (KLD) The exact—but intractable—multivariate distribution is therefore fac-

torized into a product of tractable marginal distributions, the so-called VB-marginals.

This straightforward proposal for approximating a distribution enjoys certain

Trang 8

opti-VIII Preface

mality properties What is of more pragmatic concern to the signal processing munity, however, is that the VB-approximation conveniently addresses the followingkey tasks:

com-1 The inference is focused (or, more formally, marginalized) onto selected subsets

of parameters of interest in the model: this one-shot (i.e off-line) use of the VB

method can replace numerically intensive marginalization strategies based, forexample, on stochastic sampling

2 Parameter inferences can be arranged to have an invariant functional form

when updated in the light of incoming data: this leads to feasible on-line

tracking algorithms involving the update of ﬁxed- and ﬁnite-dimensional

sta-tistics In the language of the Bayesian, conjugacy can be achieved under the

VB-approximation There is no reliance on propagating certainty equivalents,

stochastically-generated particles, etc.

Unusually for a modern Bayesian approach, then, no stochastic sampling is requiredfor the VB method In its place, the shaping parameters of the VB-marginals are

found by iterating a set of implicit equations to convergence This Iterative

Varia-tional Bayes (IVB) algorithm enjoys a decisive advantage over the EM algorithm

whose computational ﬂow is similar: by design, the VB method yields distributions

in place of the point estimates emerging from the EM algorithm Hence, in commonwith all Bayesian approaches, the VB method provides, for example, measures of

uncertainty for any point estimates of interest, inferences of model order/rank, etc.

The machine learning community has led the way in exploiting the VB method

in model-based inference, notably in inference for graphical models It is timely,however, to examine the VB method in the context of signal processing where, todate, little work has been reported In this book, at all times, we are concerned with

the way in which the VB method can lead to the design of tractable computational

schemes for tasks such as (i) dimensionality reduction, (ii) factor analysis for medicalimagery, (iii) on-line ﬁltering of outliers and other non-Gaussian noise processes, (iv)

tracking of non-stationary processes, etc Our aim in presenting these VB algorithms

is not just to reveal new ﬂows-of-control for these problems, but—perhaps moresigniﬁcantly—to understand the strengths and weaknesses of the VB-approximation

in model-based signal processing In this way, we hope to dismantle the current chology of dependence in the Bayesian signal processing community on stochasticsampling methods Without doubt, the ability to model complex problems to arbitrarylevels of accuracy will ensure that stochastic sampling methods—such as MCMC—will remain the golden standard for distributional approximation Notwithstandingthis, our purpose here is to show that the VB method of approximation can yieldhighly effective Bayesian inference algorithms at low computational cost In show-ing this, we hope that Bayesian methods might become accessible to a much broaderconstituency than has been achieved to date

Trang 9

1 Introduction 1

1.1 How to be a Bayesian 1

1.2 The Variational Bayes (VB) Method 2

1.3 A First Example of the VB Method: Scalar Additive Decomposition 3 1.3.1 A First Choice of Prior 3

1.3.2 The Prior Choice Revisited 4

1.4 The VB Method in its Context 6

1.5 VB as a Distributional Approximation 8

1.6 Layout of the Work 10

1.7 Acknowledgement 11

2 Bayesian Theory 13

2.1 Bayesian Beneﬁts 13

2.1.1 Off-line vs On-line Parametric Inference 14

2.2 Bayesian Parametric Inference: the Off-Line Case 15

2.2.1 The Subjective Philosophy 16

2.2.2 Posterior Inferences and Decisions 16

2.2.3 Prior Elicitation 18

2.2.3.1 Conjugate priors 19

2.3 Bayesian Parametric Inference: the On-line Case 19

2.3.1 Time-invariant Parameterization 20

2.3.2 Time-variant Parameterization 20

2.3.3 Prediction 22

2.4 Summary 22

3 Off-line Distributional Approximations and the Variational Bayes Method 25

3.1 Distributional Approximation 25

3.2 How to Choose a Distributional Approximation 26

3.2.1 Distributional Approximation as an Optimization Problem 26

3.2.2 The Bayesian Approach to Distributional Approximation 27

Trang 10

X Contents

3.3 The Variational Bayes (VB) Method of Distributional Approximation 28

3.3.1 The VB Theorem 28

3.3.2 The VB Method of Approximation as an Operator 32

3.3.3 The VB Method 33

3.3.4 The VB Method for Scalar Additive Decomposition 37

3.4 VB-related Distributional Approximations 39

3.4.1 Optimization with Minimum-Risk KL Divergence 39

3.4.2 Fixed-form (FF) Approximation 40

3.4.3 Restricted VB (RVB) Approximation 40

3.4.3.1 Adaptation of the VB method for the RVB Approximation 41

3.4.3.2 The Quasi-Bayes (QB) Approximation 42

3.4.4 The Expectation-Maximization (EM) Algorithm 44

3.5 Other Deterministic Distributional Approximations 45

3.5.1 The Certainty Equivalence Approximation 45

3.5.2 The Laplace Approximation 45

3.5.3 The Maximum Entropy (MaxEnt) Approximation 45

3.6 Stochastic Distributional Approximations 46

3.6.1 Distributional Estimation 47

3.7 Example: Scalar Multiplicative Decomposition 48

3.7.1 Classical Modelling 48

3.7.2 The Bayesian Formulation 48

3.7.3 Full Bayesian Solution 49

3.7.4 The Variational Bayes (VB) Approximation 51

3.7.5 Comparison with Other Techniques 54

3.8 Conclusion 56

4 Principal Component Analysis and Matrix Decompositions 57

4.1 Probabilistic Principal Component Analysis (PPCA) 58

4.1.1 Maximum Likelihood (ML) Estimation for the PPCA Model 59 4.1.2 Marginal Likelihood Inference of A 61

4.1.3 Exact Bayesian Analysis 61

4.1.4 The Laplace Approximation 62

4.2 The Variational Bayes (VB) Method for the PPCA Model 62

4.3 Orthogonal Variational PCA (OVPCA) 69

4.3.1 The Orthogonal PPCA Model 70

4.3.2 The VB Method for the Orthogonal PPCA Model 70

4.3.3 Inference of Rank 77

4.3.4 Moments of the Model Parameters 78

4.4 Simulation Studies 79

4.4.1 Convergence to Orthogonal Solutions: VPCA vs FVPCA 79

4.4.2 Local Minima in FVPCA and OVPCA 82

4.4.3 Comparison of Methods for Inference of Rank 83

4.5 Application: Inference of Rank in a Medical Image Sequence 85

4.6 Conclusion 87

Trang 11

5 Functional Analysis of Medical Image Sequences 89

5.1 A Physical Model for Medical Image Sequences 90

5.1.1 Classical Inference of the Physiological Model 92

5.2 The FAMIS Observation Model 92

5.2.1 Bayesian Inference of FAMIS and Related Models 94

5.3 The VB Method for the FAMIS Model 94

5.4 The VB Method for FAMIS: Alternative Priors 99

5.5 Analysis of Clinical Data Using the FAMIS Model 102

5.6 Conclusion 107

6 On-line Inference of Time-Invariant Parameters 109

6.1 Recursive Inference 110

6.2 Bayesian Recursive Inference 110

6.2.1 The Dynamic Exponential Family (DEF) 112

6.2.2 Example: The AutoRegressive (AR) Model 114

6.2.3 Recursive Inference of non-DEF models 117

6.3 The VB Approximation in On-Line Scenarios 118

6.3.1 Scenario I: VB-Marginalization for Conjugate Updates 118

6.3.2 Scenario II: The VB Method in One-Step Approximation 121

6.3.3 Scenario III: Achieving Conjugacy in non-DEF Models via the VB Approximation 123

6.3.4 The VB Method in the On-Line Scenarios 126

6.4 Related Distributional Approximations 127

6.4.1 The Quasi-Bayes (QB) Approximation in On-Line Scenarios 128 6.4.2 Global Approximation via the Geometric Approach 128

6.4.3 One-step Fixed-Form (FF) Approximation 129

6.5 On-line Inference of a Mixture of AutoRegressive (AR) Models 130

6.5.1 The VB Method for AR Mixtures 130

6.5.2 Related Distributional Approximations for AR Mixtures 133

6.5.2.1 The Quasi-Bayes (QB) Approximation 133

6.5.2.2 One-step Fixed-Form (FF) Approximation 135

6.5.3 Simulation Study: On-line Inference of a Static Mixture 135

6.5.3.1 Inference of a Many-Component Mixture 136

6.5.3.2 Inference of a Two-Component Mixture 136

6.5.4 Data-Intensive Applications of Dynamic Mixtures 139

6.5.4.1 Urban Vehicular Trafﬁc Prediction 141

6.6 Conclusion 143

7 On-line Inference of Time-Variant Parameters 145

7.1 Exact Bayesian Filtering 145

7.2 The VB-Approximation in Bayesian Filtering 147

7.2.1 The VB method for Bayesian Filtering 149

7.3 Other Approximation Techniques for Bayesian Filtering 150

7.3.1 Restricted VB (RVB) Approximation 150

7.3.2 Particle Filtering 152

Trang 12

XII Contents

7.3.3 Stabilized Forgetting 153

7.3.3.1 The Choice of the Forgetting Factor 154

7.4 The VB-Approximation in Kalman Filtering 155

7.4.1 The VB method 156

7.4.2 Loss of Moment Information in the VB Approximation 158

7.5 VB-Filtering for the Hidden Markov Model (HMM) 158

7.5.1 Exact Bayesian ﬁltering for known T 159

7.5.2 The VB Method for the HMM Model with Known T 160

7.5.3 The VB Method for the HMM Model with Unknown T 162

7.5.4 Other Approximate Inference Techniques 164

7.5.4.1 Particle Filtering 164

7.5.4.2 Certainty Equivalence Approach 165

7.5.5 Simulation Study: Inference of Soft Bits 166

7.6 The VB-Approximation for an Unknown Forgetting Factor 168

7.6.1 Inference of a Univariate AR Model with Time-Variant Parameters 169

7.6.2 Simulation Study: Non-stationary AR Model Inference via Unknown Forgetting 173

7.6.2.1 Inference of an AR Process with Switching Parameters 173

7.6.2.2 Initialization of Inference for a Stationary AR Process 174

7.7 Conclusion 176

8 The Mixture-based Extension of the AR Model (MEAR) 179

8.1 The Extended AR (EAR) Model 179

8.1.1 Bayesian Inference of the EAR Model 181

8.1.2 Computational Issues 182

8.2 The EAR Model with Unknown Transformation: the MEAR Model 182 8.3 The VB Method for the MEAR Model 183

8.4 Related Distributional Approximations for MEAR 186

8.4.1 The Quasi-Bayes (QB) Approximation 186

8.4.2 The Viterbi-Like (VL) Approximation 187

8.5 Computational Issues 188

8.6 The MEAR Model with Time-Variant Parameters 191

8.7 Application: Inference of an AR Model Robust to Outliers 192

8.7.1 Design of the Filter-bank 192

8.7.2 Simulation Study 193

8.8 Application: Inference of an AR Model Robust to Burst Noise 196

8.8.1 Design of the Filter-Bank 196

8.8.2 Simulation Study 197

8.8.3 Application in Speech Reconstruction 201

8.9 Conclusion 201

Trang 13

9 Concluding Remarks 205

9.1 The VB Method 205

9.2 Contributions of the Work 206

9.3 Current Issues 206

9.4 Future Prospects for the VB Method 207

Required Probability Distributions 209

A.1 Multivariate Normal distribution 209

A.2 Matrix Normal distribution 209

A.3 Normal-inverse-Wishart (N iW A,Ω) Distribution 210

A.4 Truncated Normal Distribution 211

A.5 Gamma Distribution 212

A.6 Von Mises-Fisher Matrix distribution 212

A.6.1 Deﬁnition 213

A.6.2 First Moment 213

A.6.3 Second Moment and Uncertainty Bounds 214

A.7 Multinomial Distribution 215

A.8 Dirichlet Distribution 215

A.9 Truncated Exponential Distribution 216

References 217

Index 225

Trang 14

a i , a i,D ith column of matrix A, A D, respectively.

a i,j , a i,j,D (i, j)th element of matrix A, A D , respectively, i = 1 n,

j = 1 m.

b i , b i,D ith element of vector b, b D, respectively

diag (·) A = diag (a), a ∈ R q

a Diagonal vector of given matrix A (the context will

distin-guish this from a scalar, a (see 2nd entry, above)).

(r) denotes matrix A with restricted rank,

rank (A) = r ≤ min (n, m).

I r ∈ R r ×r Square identity matrix.

1p,q,0p,q Matrix of size p × q with all elements equal to one, zero,

re-spectively

Trang 15

a = vec (A) Operator restructuring elements of A = [a1, , a n] into a

A Singular Value Decomposition (SVD) of matrix A ∈ R n ×m.

In this monograph, the SVD is expressed in the ‘economic’

{A} c Set of objects A with cardinality c.

A (i) ith element of set {A} c , i = 1, , c.

Analysis

χ X(·) Indicator (characteristic) function of set X.

erf (x) Error function: erf (x) = √2πx

0 exp

−t2 dt.

ln (A) , exp (A) Natural logarithm and exponential of matrix A respectively.

Both operations are performed on elements of the matrix (or

Trang 16

Notational Conventions XVII

Γ r 1

Γ r

1

2(p − j + 1)

, r ≤ p

0F1(a, AA ) Hypergeometric function,p F q(·), with p = 0, q = 1, scalar

parameter a, and symmetric matrix parameter, AA

of the argument, x If x is a continuous variable, then δ (x) is the Dirac δ-function:

if x = 0, otherwise .

, i = 1, , p:

p (i) = [δ (i − 1) , δ (i − 2) , , δ (i − p)]

I(a,b] Interval (a, b] inR

Probability Calculus

Pr (·) Probability of given argument

f (x |θ) Distribution of (discrete or continuous) random variable x,

conditioned by known θ.

˘

f (x) Variable distribution to be optimized (‘wildcard’ in functional

optimization)

x [i] , f [i] (x) x and f (x) in the i-th iteration of an iterative algorithm.

ˆ Point estimate of unknown parameter θ.

Ef (x)[·] Expected value of argument with respect to distribution,

f (x).

x, x Upper bound, lower bound, respectively, on range of random

N X (M, Σ p ⊗ Σ n) Matrix Normal distribution of X with mean value, M , and

covariance matrices, Σ and Σ

Trang 17

t N x (µ, r;X) Truncated scalar Normal of x, of type N (µ, r), conﬁned to

support setX ⊂ R.

M X (F ) Von-Mises-Fisher matrix distribution of X with matrix

para-meter, F

G x (α, β) Scalar Gamma distribution of x with parameters, α and β.

U x(X) Scalar Uniform distribution of x on the support set X ⊂ R.

Trang 18

List of Acronyms

AR AutoRegressive (model, process)

ARD Automatic Rank Determination (property)

CDEF Conjugate (parameter) distribution to a DEF (observation)

modelDEF Dynamic Exponential Family

DEFS Dynamic Exponential Family with Separable parametersDEFH Dynamic Exponential Family with Hidden variables

EAR Extended AutoRegressive (model, process)

FAMIS Functional Analysis for Medical Image Sequences (model)FVPCA Fast Variational Principal Component Analysis (algorithm)

HPD Highest Posterior Density (region)

ICA Independent Component Analysis

IVB Iterative Variational Bayes (algorithm)

KLD Kullback-Leibler Divergence

MCMC Markov Chain Monte Carlo

MEAR Mixture-based Extension of the AutoRegressive model

OVPCA Orthogonal Variational Principal Component Analysis

PCA Principal Component Analysis

PPCA Probabilistic Principal Component Analysis

RLS Recursive Least Squares

RVB Restricted Variational Bayes

Trang 20

Introduction

1.1 How to be a Bayesian

In signal processing, as in all quantitative sciences, we are concerned with data,

D, and how we can learn about the system or source which generated D We will

often refer to learning as inference In this book, we will model the data cally, so that a set, θ, of unknown parameters describes the data-generating system.

parametri-In deterministic problems, knowledge of θ determines D under some notional rule,

D = g(θ) This accounts for very few of the data contexts in which we must work.

In particular, when D is information-bearing, then we must model the uncertainty (sometimes called the randomness) of the process The deﬁning characteristic of Bayesian methods is that we use probabilities to quantify our beliefs amid uncer-

tainty, and the calculus of probability to manipulate these quantitative beliefs [1–3].Hence, our beliefs about the data are completely expressed via the parametric prob-

abilistic observation model, f (D |θ) In this way, knowledge of θ determines our beliefs about D, not D themselves.

In practice, the result of an observational experiment is that we are given D,

and our problem is to use them to learn about the system—summarized by the

unknown parameters, θ—which generated them This learning amid uncertainty is known as inductive inference [3], and it is solved by constructing the distribution

f (θ |D), namely, the distribution which quantiﬁes our a posteriori beliefs about the

system, given a speciﬁc set of data, D The simple prescription of Bayes’ rule solves the implied inverse problem [4], allowing us to reverse the order of the conditioning

in the observation model, f (D |θ):

Bayes’ rule speciﬁes how our prior beliefs, quantiﬁed by the prior distribution,

f (θ), are updated in the light of D Hence, a Bayesian treatment requires prior

quan-tiﬁcation of our beliefs about the unknown parameters, θ, whether or not θ is by

nature ﬁxed or randomly realized The signal processing community, in particular,

has been resistant to the philosophy of strong Bayesian inference [3], which assigns

Trang 21

probabilities to ﬁxed, as well as random, unknown quantities Hence, they relegate

Bayesian methods to inference problems involving only random quantities [5, 6].This book adheres to the strong Bayesian philosophy

Tractability is a primary concern to any signal processing expert seeking to velop a parametric inference algorithm, both in the off-line case and, particularly,

de-on-line The Bayesian approach provides f (θ |D) as the complete inference of θ, and

this must be manipulated in order to solve problems of interest For example, we

may wish to concentrate the inference onto a subset, θ1, by marginalizing over their

complement, θ2:

f (θ1|D) ∝

A decision, such as a point estimate, may be required The mean a posteriori

estimate may then be justiﬁed:

θ1=

Θ ∗

Finally, we might wish to select a model from a set of candidates,{M1, , M c },

via computation of the marginal probability of D with respect to each candidate:

f ( M l |D) ∝ Pr[M l ].

Θ ∗ l

f (D |θ l , M l )dθ l (1.4)

Here, θ l ∈ Θ ∗

l are the parameters of the competing models, and Pr[M l] is the essary prior on those models

nec-1.2 The Variational Bayes (VB) Method

The integrations required in (1.2)–(1.4) will often present computational burdens thatcompromise the tractability of the signal processing algorithm In Chapter 3, we willreview some of the approximations which can help to address these problems, but the

aim of this book is to advocate the use of the Variational Bayes (VB) approximation

as an effective pathway to the design of tractable signal processing algorithms forparametric inference These VB solutions will be shown, in many cases, to be noveland attractive alternatives to currently available Bayesian inference algorithms

The central idea of the VB method is to approximate f (θ |D), ab initio, in terms

f (θ |D) to f(θ|D), namely, a particular Kullback-Leibler Divergence (KLD), which

we will call KLDVBin Section 3.2.2:

Trang 22

1.3 A First Example of the VB Method: Scalar Additive Decomposition 3

In practical terms, functional optimization of (1.6) yields a known functional

form for ˜f (θ1|D) and ˜ f (θ2|D), which will be known as the VB-marginals

How-ever, the shaping parameters associated with each of these VB-marginals are pressed via particular moments of the others Therefore, the approximation is pos-sible if all moments required in the shaping parameters can be evaluated Mutualinteraction of VB-marginals via their moments presents an obstacle to evaluation ofits shaping parameters, since a closed-form solution is available only for a limited

ex-number of problems However, a generic iterative algorithm for evaluation of

VB-moments and shaping parameters is available for tractable VB-marginals (i.e

mar-ginals whose moments can be evaluated) This algorithm—reminiscent of the

clas-sical Expectation-Maximization (EM) algorithm—will be called the Iterative

Varia-tional Bayes (IVB) algorithm in this book Hence, the computaVaria-tional burden of the

VB-approximation is conﬁned to iterations of the IVB algorithm The result is a set

of moments and shaping parameters, deﬁning the VB-approximation (1.5)

1.3 A First Example of the VB Method: Scalar Additive

m, ω −1 The task is to infer

the two unknown parameters—i.e the mean, m, and precision, ω—of the Normal

distribution,N , given just one scalar data point, d This constitutes a stressful regime

for inference In order to ‘be a Bayesian’, we assign a prior distribution to m and ω.

Given the poverty of data, we can expect our choice to have some inﬂuence on ourposterior inference We will now consider two choices for prior elicitation

1.3.1 A First Choice of Prior

The following choice seems reasonable:

(1.9) becomes ﬂatter The Gamma distribution,G, in (1.10) was chosen to reﬂect the

positivity of ω Its parameters, α > 0 and β > 0, may again be chosen to yield a

Trang 23

non-informative prior For α → 0 and β → 0, (1.10) approaches Jeffreys’ improper

prior on scale parameters, 1/ω [7].

Joint inference of the normal mean and precision, m and ω respectively, is well

studied in the literature [8, 9] From Bayes’ rule, the posterior distribution is

rized in Appendices A.2 and A.5 respectively Even in this simple case, evaluation of

the marginal distribution of the mean, m, i.e f (m |d, α, β, φ), is not tractable Hence,

we seek the best approximation in the class of conditionally independent posteriors

on m and ω, by minimizing KLDVB (1.6), this being the VB-approximation Thesolution can be found in the following form:

˜

f (m |d, α, β, φ) = N m

(ω + φ) −1 ωd, (ω + φ) −1, (1.12)

The VB-moments (1.14) fully determine the VB-marginals, (1.12) and (1.13) It can

be shown that this set of VB-equations (1.14) has three possible solutions (beingroots of a 3rd-order polynomial), only one of which satisﬁes ω > 0 Hence, the

optimized KLDVB has three ‘critical’ points for this model The exact distributionand its VB-approximation are compared in Fig 1.1

1.3.2 The Prior Choice Revisited

For comparison, we now consider a different choice of the priors:

Here, (1.16) is the same as (1.10), but (1.15) has been parameterized differently from

(1.9) It still expresses our lack of knowledge of the polarity of m, and it still izes extreme values of m if γ → 0 Hence, both prior structures, (1.9) and (1.15), can

Trang 24

penal-1.3 A First Example of the VB Method: Scalar Additive Decomposition 5

Fig 1.1 The VB-approximation, (1.12) and (1.13), for the scalar additive decomposition

(dash-dotted contour) Full contour lines denote the exact posterior distribution (1.11)

express non-informative prior knowledge However, the precision parameter, γω, of

m is now chosen proportional to the precision parameter, ω, of the noise (1.8).

From Bayes’ rule, the posterior distribution is now

Trang 25

In this case, the VB-marginals have the following forms:

The exact and VB-approximated posterior distributions are compared in Fig 1.2

Remark 1.1 (Choice of priors for the VB-approximation) Even in the stressful regime

of this example (one datum, two unknowns), each set of priors had a similar ence on the posterior distribution In more realistic contexts, the distinctions will be

inﬂu-even less, as the inﬂuence of the data—via f (D |θ) in (1.1)—begins to dominate the

prior, f (θ) However, from an analytical point-of-view, the effects of the prior choice

can be very different, as we have seen in this example Recall that the moments ofthe exact posterior distribution were tractable in the case of the second prior (1.17),but were not tractable in the ﬁrst case (1.11) This distinction carried through to therespective VB-approximations Once again, the second set of priors implied a farsimpler solution (1.22) than the ﬁrst (1.14) Therefore, in this book, we will take care

to design priors which can facilitate the task of VB-approximation We will always

be in a position to ensure that our choice is non-informative

1.4 The VB Method in its Context

Statistical physics has long been concerned with high-dimensional probability tions and their simpliﬁcation [10] Typically, the physicist is considering a system of

Trang 26

func-1.4 The VB Method in its Context 7

Fig 1.2 The VB-approximation, (1.19) and (1.20), for the scalar additive decomposition

(dash-dotted contour), using alternative priors, (1.15) and (1.16) Full contour lines denotethe exact posterior distribution (1.17)

many interacting particles and wishes to infer the state, θ, of this system Boltzmann’s law [11] relates the energy of the state to its probability, f (θ) If we wish to infer a sub-state, θ i , we must evaluate the associated marginal, f (θ i) Progress can be made

by replacing the exact probability model, f (θ), with an approximation, ˜ f (θ)

Typi-cally, this requires us to neglect interactions in the physical system, by setting manysuch interactions to zero The optimal such approximate distribution, ˜f (θ), can be

chosen using the variational method [12], which seeks a free-form solution within the approximating class that minimizes some measure of disparity between f (θ) and

˜

f (θ) Strong physical justiﬁcation can be advanced for minimization of a

Kullback-Leibler divergence (1.6), which is interpretable as a relative entropy The Variational

Bayes (VB) approximation is one example of such an approximation, where

inde-pendence between all θ iis enforced (1.5) In this case, the approximating marginalsdepend on expectations of the remaining states Mean Field Theory (MFT) [10] gen-eralizes this approach, exploring many such choices for the approximating function,

˜

f (θ), and its disparity with respect to f (θ) Once the variational approximation has

been obtained, the exact system is studied by means of this approximation [13].The machine learning community has adopted Mean Field Theory [12] as a way

to cope with problems of learning and belief propagation in complex systems such

as neural networks [14–16] Ensemble learning [17] is an example of the use ofthe VB-approximation in this area Communication between the machine learning

Trang 27

and physics communities has been enhanced by the language of graphical els [18–20] The Expectation-Maximization (EM) algorithm [21] is another impor-tant point of tangency, and was re-derived in [22] using KLDVBminimization The

mod-EM algorithm has long been known in the signal processing community as a means

of ﬁnding the Maximum Likelihood (ML) solution in high-dimensional problems—such as image segmentation—involving hidden variables Replacement of the EM

equations with Variational EM (i.e IVB) [23] equations allows distributional

ap-proximations to be used in place of point estimates

In signal processing, the VB method has proved to be of importance in addressingproblems of model structure inference, such as the inference of rank in PrincipalComponent Analysis (PCA) [24] and Factor Analysis [20, 25]), and in the inference

of the number of components in a mixture [26] It has been used for identiﬁcation ofnon-Gaussian AutoRegressive (AR) models [27, 28], for unsupervised blind sourceseparation [29], and for pattern recognition of hand-written characters [15]

1.5 VB as a Distributional Approximation

The VB method of approximation is one of many techniques for approximation ofprobability functions In the VB method, the approximating family is taken as theset of all possible distributions expressed as the product of required marginals, withthe optimal such choice made by minimization of a KLD The following are amongthe many other approximations—deterministic and stochastic—that have been used

in signal processing:

Point-based approximations: examples include the Maximum a Posteriori (MAP)

and ML estimates These are typically used as certainty equivalents [30] in cision problems, leading to highly tractable procedures Their inability to takeaccount of uncertainty is their principal drawback

de-Local approximations: the Laplace approximation [31], for example, performs aTaylor expansion at a point, typically the ML estimate This method is known

to the signal processing community in the context of criteria for model order lection, such as the Schwartz criterion and Bayes’ Information Criterion (BIC),both of which were derived using the Laplace method [31] Their principal dis-advantage is their inability to cope with multimodal probability functions.Spline approximations: tractable approximations of the probability function may beproposed on a sufﬁciently reﬁned partition of the support The computationalload associated with integrations typically increases exponentially with the num-ber of dimensions

se-MaxEnt and moment matching: the approximating distribution may be chosen tomatch a selected set of the moments of the true distribution [32] Under theMaxEnt principle [33], the optimal such moment-matching distribution is theone possessing maximum entropy subject to these moment constraints

Empirical approximations: a random sample is generated from the probability tion, and the distributional approximation is simply a set of point masses placed

Trang 28

func-1.5 VB as a Distributional Approximation 9

at these independent, identically-distributed (i.i.d.) sampling points The keytechnical challenge is efﬁcient generation of i.i.d samples from the true dis-tribution In recent years, stochastic sampling techniques [34]—particularly theclass known as Markov Chain Monte Carlo (MCMC) methods [35]—have over-taken deterministic methods as the golden standard for distributional approxi-mation They can yield approximations to an arbitrary level of accuracy, but typ-ically incur major computational overheads It can be instructive to examine theperformance of any deterministic method—such as the VB method—in terms

of the accuracy-vs-complexity trade-off achieved by these stochastic samplingtechniques

The VB method has the potential to offer an excellent trade-off between tional complexity and accuracy of the distributional approximation This is suggested

computa-in Fig 1.3 The macomputa-in computational burden associated with the VB method is theneed to solve iteratively—via the IVB algorithm—a set of simultaneous equations inorder to reveal the required moments of the VB-marginals If computational cost is

of concern, VB-marginals may be replaced by simpler approximations, or the uation of moments can be approximated, without, hopefully, diminishing the overallquality of approximation signiﬁcantly This pathway of approximation is suggested

eval-by the dotted arrow in Fig 1.3, and will be traversed in some of the signal processingapplications presented in this book Should the need exist to increase accuracy, the

VB method is sited in the ﬂexible context of Mean Field Theory, which offers moresophisticated techniques that might be explored

meanﬁeldtheory

samplingmethods

methodsdeterministic

EM algorithmVariational Bayes (IVB)

Fig 1.3 The accuracy-vs-complexity trade-off in the VB method.

Trang 29

1.6 Layout of the Work

We now brieﬂy summarize the main content of the Chapters of this book

Chapter 2 This provides an introduction to Bayesian theory relevant for tional approximation We review the philosophical framework, and we introducebasic probability calculus which will be used in the remainder of the book Theimportant distinction between off-line and on-line inference is outlined.Chapter 3 Here, we are concerned with the problem of distributional approxima-tion The VB-approximation is deﬁned, and from it we synthesize an ergonomic

distribu-procedure for deducing these VB-approximations This is known as the VB

method Related distributional approximations are brieﬂy reviewed and

com-pared to the VB method A simple inference problem—scalar multiplicativedecomposition—is considered

Chapter 4 The VB method is applied to the problem of matrix multiplicative compositions The VB-approximation for these models reveals interesting prop-erties of the method, such as initialization of the Iterative VB algorithm (IVB)and the existence of local minima These models are closely related to PrincipalComponent Analysis (PCA), and we show that the VB inference provides solu-tions to problems not successfully addressed by PCA, such as the inference ofrank

de-Chapter 5 We use our experience from de-Chapter 4 to derive the VB-approximationfor the inference of physiological factors in medical image sequences The phys-ical nature of the problem imposes additional restrictions which are successfullyhandled by the VB method

Chapter 6 The VB method is explored in the context of recursive inference of nal processes In this Chapter, we conﬁne ourselves to time-invariant parametermodels We isolate three fundamental scenarios, each of which constitutes a re-cursive inference task where the VB-approximation is tractable and adds value

sig-We apply the VB method to the recursive identification of mixtures of AR els The practical application of this work in prediction of urban traffic flow isoutlined

mod-Chapter 7 The time-invariant parameter assumption from mod-Chapter 6 is relaxed.Hence, we are concerned here with Bayesian ﬁltering The use of the VB method

in this context reveals interesting computational properties in the resulting rithm, while also pointing to some of the difﬁculties which can be encountered.Chapter 8 We address a practical signal processing task, namely, the reconstruction

algo-of AR processes corrupted by unknown transformation and noise distortions.The use of the VB method in this ambitious context requires synthesis of ex-perience gained in Chapters 6 and 7 The resulting VB inference is shown to

be successful in optimal data pre-processing tasks such as outlier removal andsuppression of burst noise An application in speech denoising is presented.Chapter 9 We summarize the main ﬁndings of the work, and point to some interest-ing future prospects

Trang 30

1.7 Acknowledgement 11

1.7 Acknowledgement

The ﬁrst author acknowledges the support of Grants AV ˇCR 1ET 100 750 401 andMŠMT 1M6798555601

Trang 31

Bayesian Theory

In this Chapter, we review the key identities of probability calculus relevant toBayesian inference We then examine three fundamental contexts in parametric mod-elling, namely (i) off-line inference, (ii) on-line inference of time-invariant parame-ters, and (iii) on-line inference of time-variant parameters In each case, we use theBayesian framework to derive the formal solution Each context will be examined indetail in later Chapters

2.1 Bayesian Beneﬁts

A Bayesian is someone who uses only probabilities to quantify degrees of belief in

an uncertain hypothesis, and uses only the rules of probability as the calculus foroperating on these degrees of belief [7, 8, 36, 37] At the very least, this approach to

inductive inference is consistent, since the calculus of probability is consistent, i.e.

any valid use of the rules of probability will lead to a unique conclusion This is nottrue of classical approaches to inference, where degrees of belief are quantiﬁed usingone of a vast range of criteria, such as relative frequency of occurrence, distance in

a normed space, etc If the Bayesian’s probability model is chosen to reﬂect such

criteria, then we might expect close correspondence between Bayesian and classical

methods However, a vital distinction remains Since probability is a measure

func-tion on the space of possibilities, the marginalizafunc-tion operator (i.e integrafunc-tion) is a

powerful inferential tool uniquely at the service of the Bayesian Careful comparison

of Bayesian and classical solutions will reveal that the real added value of Bayesianmethods derives from being able to integrate, thereby concentrating the inferenceonto a selected subset of quantities of interest In this way, Bayesian methods natu-rally embrace the following key problems, all problematical for the non-Bayesian:

1 projection into a desired subset of the hypothesis space;

2 reduction of the number of parameters appearing in the probability function called ‘elimination of nuisance parameters’ [38]);

(so-3 quantiﬁcation of the risk associated with a data-informed decision;

Trang 32

14 2 Bayesian Theory

4 evaluation of expected values and moments;

5 comparison of competing model structures and penalization of complexity ham’s Razor) [39, 40] ;

(Ock-6 prediction of future data

All of these tasks require integration with respect to the probability measure onthe space of possibilities In the case of 5 above, competing model structures are

measured, leading to consistent quantiﬁcation of model complexity This natural

en-gendering of Ockham’s razor is among the most powerful features of the Bayesianframework

Why, then, are Bayesian methods still so often avoided in application contextssuch as statistical signal processing? The answer is mistrust of the prior, and philo-sophical angst about (i) its right to exist, and (ii) its right to inﬂuence a decision oralgorithm With regard to (i), it is argued by non-Bayesians that probabilities mayonly be attached to objects or hypotheses that vary randomly in repeatable exper-iments [41] With regard to (ii), the non-Bayesian (objectivist) perspective is thatinferences should be based only on data, and never on prior knowledge Preoccupa-tion with these issues is to miss where the action really is: the ability to marginalize

in the Bayesian framework In our work, we will eschew detailed philosophical guments in favour of a policy that minimizes the inﬂuence of the priors we use, andpoints to the practical added value over frequentist methods that arise from use ofprobability calculus

ar-2.1.1 Off-line vs On-line Parametric Inference

In an observational experiment, we may wish to infer knowledge of an unknown

quantity only after all data, D, have been gathered This batch-based inference will be called the off-line scenario, and Bayesian methods must be used to update our beliefs given no data (i.e our prior), to beliefs given D It is the typical situation arising in

database analysis In contrast, we may wish to interleave the process of observing

data with the process of updating our beliefs This on-line scenario is important in

control and decision tasks, for example For convenience, we refer to the independent

variable indexing the occasions (temporal, spatial, etc.) when our inferences must be updated, as time, t = 0, 1, The incremental data observed between inference times

is d t , and the aggregate of all data observed up to and including time t is denoted by

D t Hence:

D t = D t −1 ∪ d t , t = 1, ,

with D0 = {}, by deﬁnition For convenience, we will usually assume that d t ∈

Rp ×1 , p ∈ N+,∀t, and so D t can be structured into a matrix of dimension p × t,

with the incremental data, d t , as its columns:

In this on-line scenario, Bayesian methods are required to update our state of

knowledge conditioned by D t −1 , to our state of knowledge conditioned by D t Of

Trang 33

course, the update is achieved using exactly the same ‘inference machine’, namelyBayes’ rule (1.1) Indeed, one step of on-line inference is equivalent to an off-line

step, with D = d t , and with the prior at time t being conditioned on D t −1 theless, it will be convenient to handle the off-line and on-line scenarios separately,and we now review the Bayesian probability calculus appropriate to each case

Never-2.2 Bayesian Parametric Inference: the Off-Line Case

Let the measured data be denoted by D A parametric probabilistic model of the data is given by the probability distribution, f (D |θ), conditioned by knowledge of

the parameters, θ In this book, the notation f ( ·) can represent either a probability density function for continuous random variables, or a probability mass function

for discrete random variables We will refer to f ( ·) as a probability distribution in

both cases In this way a signiﬁcant harmonization of formulas and nomenclaturecan be achieved We need only keep in mind that integrations should be replaced bysummations whenever the argument is discrete1

Our prior state of knowledge of θ is quantiﬁed by the prior distribution, f (θ) Our state of knowledge of θ after observing D is quantiﬁed by the posterior distribution,

f (θ |D) These functions are related via Bayes’ rule,

f (θ |D) = f (θ, D) f (D) = f (D |θ) f (θ)

where Θ ∗ is the space of θ We will refer to f (θ, D) as the joint distribution of parameters and data, or, more concisely, as the joint distribution We will refer to

f (D |θ) as the observation model If this is viewed as a (non-measure) function of θ,

it is known as the likelihood function [3, 43–45]:

ζ = f (D) is the normalizing constant, sometimes known as the partition function

in the physics literature [46]:

where∝ means equal up to the normalizing constant, ζ The posterior is fully

deter-mined by the product f (D |θ) f (θ), since the normalizing constant follows from the

1This can also be achieved via measure theory, operating in a consistent way for both discreteand continuous distributions, with probability densities generalized in the Radon-Nikodymsense [42] The practical effect is the same, and so we will avoid this formality

Trang 34

requirement that f (θ |D) be a probability distribution; i.e.Θ ∗ f (θ |D) = 1

Evalua-tion of ζ (2.4) can be computaEvalua-tionally expensive, or even intractable If the integral

in (2.4) does not converge, the distribution is called improper [47] The posterior distribution with explicitly known normalization (2.5) will be called the normalized

distribution In Fig 2.1, we represent Bayes’ rule (2.2) as an operator, B, ing the prior into the posterior, via the observation model, f (D |θ)

transform-B

f (D|θ)

Fig 2.1 Bayes’ rule as an operator.

2.2.1 The Subjective Philosophy

All our beliefs about θ, and their associated quantiﬁers via f (θ), f (θ |D), etc., are

conditioned on the parametric probability model, f (θ, D), chosen by us a priori (2.2) Its ingredients are (i) the deterministic structure relating D to an unknown parameter set, θ, i.e the observation model f (D |θ), and a chosen measure on the

space, Θ, of this parameter set, i.e the prior measure f (θ) In this sense, Bayesian methods are born from a subjective philosophy, which conditions all inference on the

prior knowledge of the observer [2, 36] Jeffreys’ notation [7],I, is used to condition

all probability functions explicitly on this corpus of prior knowledge; e.g f (θ) →

f (θ |I) For convenience, we will not use this notation, nor will we forget the fact that

this conditioning is always present In model comparison (1.4), where we examine

competing model assumptions, f l (θ l , D), l = 1, , c, this conditioning becomes

more explicit, via the indicator variable or pointer, l ∈ {1, 2, , c} , but once again

we will suppress the implied Jeffreys’ notation

2.2.2 Posterior Inferences and Decisions

The task of evaluating the full posterior distribution (2.5) will be called parameter

in-ference in this book We favour this phrase over the alternative—density estimation—

used in some decision theory texts [48] The full posterior distribution is a completedescription of our uncertainty about the parameters of the observation model (2.3),

given prior knowledge, f (θ), and all available data, D For many practical tasks,

we need to derive conditional and marginal distributions of model parameters, andtheir moments Consider the (vector of) model parameters to be partitioned into two

Trang 35

dθ2

Fig 2.2 The marginalization operator.

In Fig 2.2, we represent (2.6) as an operator This graphical representation will beconvenient in later Chapters

The moments of the posterior distribution—i.e the expected or mean value of known functions, g (θ) , of the parameter—will be denoted by

Ef (θ|D) [g (θ)] =

Θ ∗

In general, we will use the notation g (θ) to refer to a posterior point estimate of g(θ).

Hence, for the choice (2.7), we have

by minimizing the posterior expected loss,

is the workhorse of classical inference, since it avoids the issue of deﬁning a prior

over the space of possibilities In particular, it is the dominant tool for probabilistic

methods in signal processing [5, 53, 54] Consider the special case of an additive

Gaussian noise model for vector data, D = d ∈ R p , with

d = s(θ) + e,

e ∼ N (0, Σ) ,

Trang 36

where Σ is known, and s(θ) is the (non-linearly parameterized) signal model In this

case, θML = θLS, the traditional non-linear, weighted Least-Squares (LS) estimate

[55] of θ From the Bayesian perspective, these classical estimators— θMLand θLS—

can be justiﬁed only to the extent that a uniform prior over Θ ∗ might be justiﬁed

When Θ ∗has inﬁnite Lebesgue measure, this prior is improper, leading to technicaland philosophical difﬁculties [3, 8] In this book, it is the strongly Bayesian choice,

g (θ) = Ef (θ|D) [g (θ)] (2.8), which predominates Hence, the notation g ≡ g (θ)

will always denote the posterior mean of g(θ), unless explicitly stated otherwise.

As an alternative to point estimation, the Bayesian may choose to describe a

con-tinuous posterior distribution, f (θ |D) (2.2), in terms of a region or interval within

which θ has a high probability of occurrence These credible regions [37] replace the

conﬁdence intervals of classical inference, and have an intuitive appeal The ing special case provides a unique speciﬁcation, and will be used in this book

follow-Deﬁnition 2.1 (Highest Posterior Density (HPD) Region) R ⊂ Θ ∗ is the 100(1 − α)% HPD region of (continuous) distribution, f (θ |D) , where α ∈ (0, 1), if (i)

R f (θ |D) = 1 − α, and if (ii) almost surely (a.s.) for any θ1∈ R and θ2/∈ R, then

f (θ1|D) ≥ f (θ2|D).

2.2.3 Prior Elicitation

The prior distribution (2.2) required by Bayes’ rule is a function that must be elicited

by the designer of the model It is an important part of the inference problem, andcan significantly influence posterior inferences and decisions (Section 2.2.2) Gen-eral methods for prior elicitation have been considered extensively in the litera-ture [7,8,37,56], as well as the problem of choosing priors for specific signal models

in Bayesian signal processing [3, 35, 57] In this book, we are concerned with thepractical impact of prior choices on the inference algorithms which we develop Theprior distribution will be used in the following ways:

1 To supplement the data, D, in order to obtain a reliable posterior estimate, in

cases where there are insufﬁcient data and/or a poorly deﬁned model This will

be called regularization (via the prior);

2 To impose various restrictions on the parameter θ, reﬂecting physical constraints

such as positivity Note, from (2.2), that if the prior distribution on a subset of

the parameter support, Θ ∗ , is zero, then the posterior distribution will also be

zero on this subset;

3 To express prior ignorance about θ If the data are assumed to be informative enough, we prefer to choose a non-informative prior (i.e a prior with minimal

impact on the posterior distribution) Philosophical and analytical challenges areencountered in the design of non-informative priors, as discussed, for example,

in [7, 46]

In this book, we will typically choose our prior from a family of distributions ing analytical tractability during the Bayes update (Fig 2.1) Notably, we will work

Trang 37

provid-with conjugate priors, as deﬁned in the next Section In such cases, we will designour non-informative prior by choosing its parameters to have minimal impact on theparameters of the posterior distribution.

2.2.3.1 Conjugate priors

In parametric inference, all distributions, f ( ·), have a known functional form, and are

completely determined once the associated shaping parameters are known Hence, the shaping parameters of the posterior distribution, f (θ |D, s0) (2.5), are, in general,

the complete data record, D, and any shaping parameters, s0, of the prior, f0(θ |s0)

Hence, a massive increase in the degrees-of-freedom of the inference may occurduring the prior-to-posterior update It will be computationally advantageous if the

form of the posterior distribution is identical to the form of prior, f0(·|s0), i.e the

inference is functionally invariant with respect to Bayes’ rule, and is determined from

a ﬁnite-dimensional vector shaping parameter:

with s0forming the parameters of the prior If s0are unknown, then they are called

hyper-parameters [37], and are assigned a hyperprior, s0 ∼ f(s0) As we will see

in Chapter 6, the choice of conjugate priors is of key importance in the design oftractable Bayesian recursive algorithms, since they conﬁne the shaping parameters

toRq , and prevent a linear increase in the number of degrees-of-freedom with D t

(2.1) From now on, we will not use the subscript ‘0’ in f0 The ﬁxed functional

form will be implied by the conditioning on sufﬁcient statistics s.

2.3 Bayesian Parametric Inference: the On-line Case

We now specialize Bayesian inference to the case of learning in tandem with data acquisition, i.e we wish to update our inference in the light of incremental data,

d t(Section 2.1.1) We distinguish two situations, namely invariant and variant parameterizations

Trang 38

where f (θ |D0) ≡ f (θ), the parameter prior (2.2) This scenario is illustrated in

Fig 2.3 The observation model, f (d t |θ, D t −1 ), at time t is related to the

observa-B

f (d t |θ, D t−1)

Fig 2.3 The Bayes’ rule operator in the on-line scenario with time-invariant parameterization.

tion model for the accumulated data, D t—which we can interpret as the likelihood

function of θ (2.3)—via the chain rule of probability:

In this case, new parameters, θ t , are required to explain d t , i.e the observation model,

f (d t |θ t , D t −1 ), t = 1, 2, , is an explicitly time-varying function For convenience,

we assume that θ t ∈ R r , ∀t, and we aggregate the parameters into a matrix, Θ t , as

we did the data (2.1):

with Θ0 = {} by deﬁnition Once again, Bayes’ rule (2.2) is used to update our

knowledge of Θ t in the light of new data, d t:

Trang 39

Note that the dimension of the integration is r(t − 1) at time t If the integrations

need to be carried out numerically, this increasing dimensionality proves prohibitive

in real-time applications Therefore, the following simplifying assumptions are cally adopted [42]:

typi-Proposition 2.1 (Markov observation model and parameter evolution models).

The observation model is to be simpliﬁed as follows:

f (d t |Θ t , D t −1 ) = f (d t |θ t , D t −1 ) , (2.19)

i.e d t is conditionally independent of Θ t −1 , given θ t

The parameter evolution model is to be simpliﬁed as follows:

f (θ t |Θ t −1 , D t −1 ) = f (θ t |θ t −1 ) (2.20)

In many applications, (2.20) may depend on exogenous (observed) data, ξ t , which can be seen as shaping parameters, and need not be explicitly listed in the conditioning part of the notation.

This Markov model (2.20) is the required extra ingredient for Bayesian variant on-line inference Employing Proposition 2.1 in (2.18), and noting that

f (θ t |θ t −1 , D t −1 ) f (θ t −1 |D t −1 ) dθ t −1 , t = 2, 3, (2.21)

The data update of Bayesian ﬁltering:

f (θ t |D t)∝ f (d t |θ t , D t −1 ) f (θ t |D t −1 ) , t = 1, 2, (2.22)

Note, therefore, that the integration dimension is ﬁxed at r, ∀t (2.21) We will

refer to this two-step update for Bayesian on-line inference of θ t as Bayesian

ﬁl-tering, in analogy to Kalman ﬁltering which involves the same two-step procedure,

and which is, in fact, a specialization to the case of Gaussian observation (2.19) andparameter evolution (2.20) models On-line inference of time-variant parameters isillustrated in schematic form in Fig 2.4 In Chapter 7, the problem of designingtractable Bayesian recursive ﬁltering algorithms will be addressed for a wide class

of models, (2.19) and (2.20), using Variational Bayes (VB) techniques

Trang 40

Fig 2.4 The inferential scheme for Bayesian ﬁltering The operator ‘×’ denotes

multiplica-tion of distribumultiplica-tions

2.3.3 Prediction

Our purpose in on-line inference of parameters will often be to predict future data

In the Bayesian paradigm, k-steps-ahead prediction is achieved by eliciting the

fol-lowing distribution:

This will be known as the predictor.

The one-step ahead predictor (i.e k = 1 in (2.23)) for a model with

time-invariant parameters (2.14) is as follows:

In later Chapters, we will study the use of the Variational Bayes (VB) approximation

in all three contexts of Bayesian learning reviewed in this Chapter, namely:

1 off-line parameter inference (Section 2.2), in Chapter 3;

2 on-line inference of Time-Invariant (TI) parameters (Section 2.3.1), in Chapter 6;

3 on-line inference of Time-Variant (TV) parameters (Section 2.3.2), in Chapter 7

Định dạng
Số trang	241
Dung lượng	6,89 MB