04of15 bayesian reasoning and machine learning

This book presents a unified treatment via graphical models, a marriagebetween graph and probability theory, facilitating the transference of Machine Learning concepts betweendifferent b

Trang 1

Bayesian Reasoning and Machine Learning

David Barber c

Trang 2

Notation List

V a calligraphic symbol typically denotes a set of random variables 7

dom(x) Domain of a variable 7

x = x The variable x is in the state x 7

p(x = tr) probability of event/variable x being in the state true 7

p(x = fa) probability of event/variable x being in the state false 7

p(x, y) probability of x and y 8

p(x∩ y) probability of x and y 8

p(x∪ y) probability of x or y 8

p(x|y) The probability of x conditioned on y 8

X ⊥⊥Y|Z VariablesX are independent of variables Y conditioned on variables Z 11 X >>Y|Z VariablesX are dependent on variables Y conditioned on variables Z 11

R xf (x) For continuous variables this is shorthand forR f(x)dx and for discrete vari-ables means summation over the states of x, P xf (x) 18

I [S] Indicator : has value 1 if the statement S is true, 0 otherwise 19

pa (x) The parents of node x 26

ch (x) The children of node x 26

ne (x) Neighbours of node x 26

dim (x) For a discrete variable x, this denotes the number of states x can take 34

hf(x)ip(x) The average of the function f (x) with respect to the distribution p(x) 158 δ(a, b) Delta function For discrete a, b, this is the Kronecker delta, δa,b and for continuous a, b the Dirac delta function δ(a− b) 160

dim x The dimension of the vector/matrix x 171

] (x = s, y = t) The number of times x is in state s and y in state t simultaneously 197

]x y The number of times variable x is in state y 278

D Dataset 291

n Data index 291

N Number of dataset training points 291

S Sample Covariance matrix 315

σ(x) The logistic sigmoid 1/(1 + exp(−x)) 353

erf(x) The (Gaussian) error function 353

xa:b xa, xa+1, , xb 455

i∼ j The set of unique neighbouring edges on a graph 585

Im The m× m identity matrix 605

Trang 3

The data explosion

We live in a world that is rich in data, ever increasing in scale This data comes from many differentsources in science (bioinformatics, astronomy, physics, environmental monitoring) and commerce (customerdatabases, financial transactions, engine monitoring, speech recognition, surveillance, search) Possessingthe knowledge as to how to process and extract value from such data is therefore a key and increasinglyimportant skill Our society also expects ultimately to be able to engage with computers in a natural manner

so that computers can ‘talk’ to humans, ‘understand’ what they say and ‘comprehend’ the visual worldaround them These are difficult large-scale information processing tasks and represent grand challengesfor computer science and related fields Similarly, there is a desire to control increasingly complex systems,possibly containing many interacting parts, such as in robotics and autonomous navigation Successfullymastering such systems requires an understanding of the processes underlying their behaviour Processingand making sense of such large amounts of data from complex systems is therefore a pressing modern dayconcern and will likely remain so for the foreseeable future

of the problem forces us to address uncertainty In the broadest sense, Machine Learning and related fieldsaim to ‘learn something useful’ about the environment within which the agent operates Machine Learning

is also closely allied with Artificial Intelligence, with Machine Learning placing more emphasis on using data

to drive and adapt the model

In the early stages of Machine Learning and related areas, similar techniques were discovered in relativelyisolated research communities This book presents a unified treatment via graphical models, a marriagebetween graph and probability theory, facilitating the transference of Machine Learning concepts betweendifferent branches of the mathematical and computational sciences

Whom this book is for

The book is designed to appeal to students with only a modest mathematical background in undergraduatecalculus and linear algebra No formal computer science or statistical background is required to follow thebook, although a basic familiarity with probability, calculus and linear algebra would be useful The bookshould appeal to students from a variety of backgrounds, including Computer Science, Engineering, appliedStatistics, Physics, and Bioinformatics that wish to gain an entry to probabilistic approaches in MachineLearning In order to engage with students, the book introduces fundamental concepts in inference using

Trang 4

only minimal reference to algebra and calculus More mathematical techniques are postponed until as andwhen required, always with the concept as primary and the mathematics secondary.

The concepts and algorithms are described with the aid of many worked examples The exercises anddemonstrations, together with an accompanying MATLAB toolbox, enable the reader to experiment andmore deeply understand the material The ultimate aim of the book is to enable the reader to constructnovel algorithms The book therefore places an emphasis on skill learning, rather than being a collection ofrecipes This is a key aspect since modern applications are often so specialised as to require novel methods.The approach taken throughout is to describe the problem as a graphical model, which is then translatedinto a mathematical framework, ultimately leading to an algorithmic implementation in theBRMLtoolbox

The book is primarily aimed at final year undergraduates and graduates without significant experience inmathematics On completion, the reader should have a good understanding of the techniques, practicalitiesand philosophies of probabilistic aspects of Machine Learning and be well equipped to understand moreadvanced research level material

The structure of the book

The book begins with the basic concepts of graphical models and inference For the independent readerchapters 1,2,3,4,5,9,10,13,14,15,16,17,21 and 23 would form a good introduction to probabilistic reasoning,modelling and Machine Learning The material in chapters 19, 24, 25 and 28 is more advanced, with theremaining material being of more specialised interest Note that in each chapter the level of material is ofvarying difficulty, typically with the more challenging material placed towards the end of each chapter As

an introduction to the area of probabilistic modelling, a course can be constructed from the material asindicated in the chart

The material from parts I and II has been successfully used for courses on Graphical Models I have alsotaught an introduction to Probabilistic Machine Learning using material largely from part III, as indicated.These two courses can be taught separately and a useful approach would be to teach first the GraphicalModels course, followed by a separate Probabilistic Machine Learning course

A short course on approximate inference can be constructed from introductory material in part I and themore advanced material in part V, as indicated The exact inference methods in part I can be coveredrelatively quickly with the material in part V considered in more in depth

A timeseries course can be made by using primarily the material in part IV, possibly combined with materialfrom part I for students that are unfamiliar with probabilistic modelling approaches Some of this material,particularly in chapter 25 is more advanced and can be deferred until the end of the course, or consideredfor a more advanced course

The references are generally to works at a level consistent with the book material and which are in the mostpart readily available

Accompanying code

The BRMLtoolboxis provided to help readers see how mathematical models translate into actual LAB code There are a large number of demos that a lecturer may wish to use or adapt to help illustratethe material In addition many of the exercises make use of the code, helping the reader gain confidence

MAT-in the concepts and their application Along with complete routMAT-ines for many MachMAT-ine LearnMAT-ing methods,the philosophy is to provide low level routines whose composition intuitively follows the mathematical de-scription of the algorithm In this way students may easily match the mathematics with the correspondingalgorithmic implementation

Trang 5

1: Probabilistic Reasoning 2: Basic Graph Concepts 3: Belief Networks 4: Graphical Models 5: Efficient Inference in Trees 6: The Junction Tree Algorithm 7: Making Decisions

8: Statistics for Machine Learning 9: Learning as Inference

10: Naive Bayes 11: Learning with Hidden Variables 12: Bayesian Model Selection 13: Machine Learning Concepts 14: Nearest Neighbour Classification 15: Unsupervised Linear Dimension Reduction 16: Supervised Linear Dimension Reduction 17: Linear Models

18: Bayesian Linear Models 19: Gaussian Processes 20: Mixture Models 21: Latent Linear Models 22: Latent Ability Models 23: Discrete-State Markov Models 24: Continuous-State Markov Models 25: Switching Linear Dynamical Systems 26: Distributed Computation

27: Sampling 28: Deterministic Approximate Inference

Trang 6

Other books in this area

The literature on Machine Learning is vast with much relevant literature also contained in statistics, gineering and other physical sciences A small list of more specialised books that may be referred to fordeeper treatments of specific topics is:

en-• Graphical models

– Graphical models by S Lauritzen, Oxford University Press, 1996

– Bayesian Networks and Decision Graphs by F Jensen and T D Nielsen, Springer Verlag, 2007.– Probabilistic Networks and Expert Systems by R G Cowell, A P Dawid, S L Lauritzen and D

J Spiegelhalter, Springer Verlag, 1999

– Probabilistic Reasoning in Intelligent Systems by J Pearl, Morgan Kaufmann, 1988

– Graphical Models in Applied Multivariate Statistics by J Whittaker, Wiley, 1990

– Probabilistic Graphical Models: Principles and Techniques by D Koller and N Friedman, MITPress, 2009

• Machine Learning and Information Processing

– Information Theory, Inference and Learning Algorithms by D J C MacKay, Cambridge versity Press, 2003

Uni-– Pattern Recognition and Machine Learning by C M Bishop, Springer Verlag, 2006

– An Introduction To Support Vector Machines, N Cristianini and J Shawe-Taylor, CambridgeUniversity Press, 2000

– Gaussian Processes for Machine Learning by C E Rasmussen and C K I Williams, MIT press,2006

Acknowledgements

Many people have helped this book along the way either in terms of reading, feedback, general insights,allowing me to present their work, or just plain motivation Amongst these I would like to thank DanCornford, Massimiliano Pontil, Mark Herbster, John Shawe-Taylor, Vladimir Kolmogorov, Yuri Boykov,Tom Minka, Simon Prince, Silvia Chiappa, Bertrand Mesot, Robert Cowell, Ali Taylan Cemgil, David Blei,Jeff Bilmes, David Cohn, David Page, Peter Sollich, Chris Williams, Marc Toussaint, Amos Storkey, ZakriaHussain, Le Chen, Seraf´ın Moral, Milan Studen´y, Luc De Raedt, Tristan Fletcher, Chris Vryonides,Yannis ++haralambous, Tom Furmston, Ed Challis and Chris Bracegirdle I would also like to thank the many stu-dents that have helped improve the material during lectures over the years I’m particularly grateful toTaylan Cemgil for allowing his GraphLayout package to be bundled with theBRMLtoolbox

The staff at Cambridge University Press have been a delight to work with and I would especially like tothank Heather Bergman for her initial endeavors and the wonderful Diana Gillooly for her continued enthu-siasm

A heartfelt thankyou to my parents and sister – I hope this small token will make them proud I’m alsofortunate to be able to acknowledge the support and generosity of friends throughout Finally, I’d like tothank Silvia who made it all worthwhile

Trang 7

TheBRMLtoolbox is a lightweight set of routines that enables the reader to experiment with concepts ingraph theory, probability theory and Machine Learning The code contains basic routines for manipulatingdiscrete variable distributions, along with more limited support for continuous variables In addition thereare many hard-coded standard Machine Learning algorithms The website contains also a complete list ofall the teaching demos and related exercise material

BRMLTOOLKIT

Graph Theory

ancestors - Return the ancestors of nodes x in DAG A

ancestralorder - Return the ancestral order or the DAG A (oldest first)

descendents - Return the descendents of nodes x in DAG A

children - return the children of variable x given adjacency matrix A

edges - Return edge list from adjacency matrix A

elimtri - Return a variable elimination sequence for a triangulated graph

connectedComponents - Find the connected components of an adjacency matrix

istree - Check if graph is singly-connected

neigh - Find the neighbours of vertex v on a graph with adjacency matrix G

noselfpath - return a path excluding self transitions

parents - return the parents of variable x given adjacency matrix A

spantree - Find a spanning tree from an edge list

triangulate - Triangulate adjacency matrix A

triangulatePorder - Triangulate adjacency matrix A according to a partial ordering

Potential manipulation

condpot - Return a potential conditioned on another variable

changevar - Change variable names in a potential

dag - Return the adjacency matrix (zeros on diagonal) for a Belief Network

deltapot - A delta function potential

disptable - Print the table of a potential

divpots - Divide potential pota by potb

drawFG - Draw the Factor Graph A

drawID - plot an Influence Diagram

drawJTree - plot a Junction Tree

drawNet - plot network

evalpot - Evaluate the table of a potential when variables are set

exppot - exponential of a potential

eyepot - Return a unit potential

grouppot - Form a potential based on grouping variables together

groupstate - Find the state of the group variables corresponding to a given ungrouped state logpot - logarithm of the potential

markov - Return a symmetric adjacency matrix of Markov Network in pot

maxpot - Maximise a potential over variables

maxsumpot - Maximise or Sum a potential over variables

multpots - Multiply potentials into a single potential

numstates - Number of states of the variables in a potential

Trang 8

orderpot - Return potential with variables reordered according to order

orderpotfields - Order the fields of the potential, creating blank entries where necessary

potsample - Draw sample from a single potential

potscontainingonly - Returns those potential numbers that contain only the required variables

potvariables - Returns information about all variables in a set of potentials

setevpot - Sets variables in a potential into evidential states

setpot - sets potential variables to specified states

setstate - set a potential’s specified joint state to a specified value

squeezepots - Eliminate redundant potentials (those contained wholly within another)

sumpot - Sum potential pot over variables

sumpotID - Return the summed probability and utility tables from an ID

sumpots - Sum a set of potentials

table - Return the potential table

ungrouppot - Form a potential based on ungrouping variables

uniquepots - Eliminate redundant potentials (those contained wholly within another)

whichpot - Returns potentials that contain a set of variables

Routines also extend the toolbox to deal with Gaussian potentials:

multpotsGaussianMoment.m, sumpotGaussianCanonical.m, sumpotGaussianMoment.m, multpotsGaussianCanonical.m See demoSumprodGaussCanon.m, demoSumprodGaussCanonLDS.m, demoSumprodGaussMoment.m

Inference

absorb - Update potentials in absorption message passing on a Junction Tree

absorption - Perform full round of absorption on a Junction Tree

absorptionID - Perform full round of absorption on an Influence Diagram

ancestralsample - Ancestral sampling from a Belief Network

binaryMRFmap - get the MAP assignment for a binary MRF with positive W

bucketelim - Bucket Elimination on a set of potentials

condindep - Conditional Independence check using graph of variable interactions

condindepEmp - Compute the empirical log Bayes Factor and MI for independence/dependence

condindepPot - Numerical conditional independence measure

condMI - conditional mutual information I(x,y|z) of a potential.

FactorConnectingVariable - Factor nodes connecting to a set of variables

FactorGraph - Returns a Factor Graph adjacency matrix based on potentials

IDvars - probability and decision variables from a partial order

jtassignpot - Assign potentials to cliques in a Junction Tree

jtree - Setup a Junction Tree based on a set of potentials

jtreeID - Setup a Junction Tree based on an Influence Diagram

LoopyBP - loopy Belief Propagation using sum-product algorithm

MaxFlow - Ford Fulkerson max flow - min cut algorithm (breadth first search)

maxNpot - Find the N most probable values and states in a potential

maxNprodFG - N-Max-Product algorithm on a Factor Graph (Returns the Nmax most probable States) maxprodFG - Max-Product algorithm on a Factor Graph

MDPemDeterministicPolicy - Solve MDP using EM with deterministic policy

MDPsolve - Solve a Markov Decision Process

MesstoFact - Returns the message numbers that connect into factor potential

metropolis - Metropolis sample

mostprobablepath - Find the most probable path in a Markov Chain

mostprobablepathmult - Find the all source all sink most probable paths in a Markov Chain

sumprodFG - Sum-Product algorithm on a Factor Graph represented by A

Specific Models

ARlds - Learn AR coefficients using a Linear Dynamical System

ARtrain - Fit autoregressive (AR) coefficients of order L to v.

BayesLinReg - Bayesian Linear Regression training using basis functions phi(x)

BayesLogRegressionRVM - Bayesian Logistic Regression with the Relevance Vector Machine

CanonVar - Canonical Variates (no post rotation of variates)

cca - canonical correlation analysis

covfnGE - Gamma Exponential Covariance Function

EMbeliefnet - train a Belief Network using Expectation Maximisation

EMminimizeKL - MDP deterministic policy solver Finds optimal actions

EMqTranMarginal - EM marginal transition in MDP

EMqUtilMarginal - Returns term proportional to the q marginal for the utility term

EMTotalBetaMessage - backward information needed to solve the MDP process using message passing

EMvalueTable - MDP solver calculates the value function of the MDP with the current policy

FA - Factor Analysis

Trang 9

GMMem - Fit a mixture of Gaussian to the data X using EM

GPclass - Gaussian Process Binary Classification

GPreg - Gaussian Process Regression

HebbML - Learn a sequence for a Hopfield Network

HMMbackward - HMM Backward Pass

HMMbackwardSAR - Backward Pass (beta method) for the Switching Autoregressive HMM

HMMem - EM algorithm for HMM

HMMforward - HMM Forward Pass

HMMforwardSAR - Switching Autoregressive HMM with switches updated only every Tskip timesteps

HMMgamma - HMM Posterior smoothing using the Rauch-Tung-Striebel correction method

HMMsmooth - Smoothing for a Hidden Markov Model (HMM)

HMMsmoothSAR - Switching Autoregressive HMM smoothing

HMMviterbi - Viterbi most likely joint hidden state of a HMM

kernel - A kernel evaluated at two points

Kmeans - K-means clustering algorithm

LDSbackward - Full Backward Pass for a Latent Linear Dynamical System (RTS correction method) LDSbackwardUpdate - Single Backward update for a Latent Linear Dynamical System (RTS smoothing update) LDSforward - Full Forward Pass for a Latent Linear Dynamical System (Kalman Filter)

LDSforwardUpdate - Single Forward update for a Latent Linear Dynamical System (Kalman Filter)

LDSsmooth - Linear Dynamical System : Filtering and Smoothing

LDSsubspace - Subspace Method for identifying Linear Dynamical System

LogReg - Learning Logistic Linear Regression Using Gradient Ascent (BATCH VERSION)

MIXprodBern - EM training of a Mixture of a product of Bernoulli distributions

mixMarkov - EM training for a mixture of Markov Models

NaiveBayesDirichletTest - Naive Bayes prediction having used a Dirichlet prior for training

NaiveBayesDirichletTrain - Naive Bayes training using a Dirichlet prior

NaiveBayesTest - Test Naive Bayes Bernoulli Distribution after Max Likelihood training

NaiveBayesTrain - Train Naive Bayes Bernoulli Distribution using Max Likelihood

nearNeigh - Nearest Neighbour classification

pca - Principal Components Analysis

plsa - Probabilistic Latent Semantic Analysis

plsaCond - Conditional PLSA (Probabilistic Latent Semantic Analysis)

rbf - Radial Basis function output

SARlearn - EM training of a Switching AR model

SLDSbackward - Backward pass using a Mixture of Gaussians

SLDSforward - Switching Latent Linear Dynamical System Gaussian Sum forward pass

SLDSmargGauss - compute the single Gaussian from a weighted SLDS mixture

softloss - Soft loss function

svdm - Singular Value Decomposition with missing values

SVMtrain - train a Support vector Machine

General

argmax - performs argmax returning the index and value

assign - Assigns values to variables

betaXbiggerY - p(x>y) for x~Beta(a,b), y~Beta(c,d)

bar3zcolor - Plot a 3D bar plot of the matrix Z

avsigmaGauss - Average of a logistic sigmoid under a Gaussian

cap - Cap x at absolute value c

chi2test - inverse of the chi square cumulative density

count - for a data matrix (each column is a datapoint), return the state counts

condexp - Compute normalised p proportional to exp(logp);

condp - Make a conditional distribution from the matrix

dirrnd - Samples from a Dirichlet distribution

field2cell - Place the field of a structure in a cell

GaussCond - Return the mean and covariance of a conditioned Gaussian

hinton - Plot a Hinton diagram

ind2subv - Subscript vector from linear index

ismember_sorted - True for member of sorted set

lengthcell - Length of each cell entry

logdet - Log determinant of a positive definite matrix computed in a numerically stable manner logeps - log(x+eps)

logGaussGamma - unnormalised log of the Gauss-Gamma distribution

logsumexp - Compute log(sum(exp(a).*b)) valid for large a

logZdirichlet - Log Normalisation constant of a Dirichlet distribution with parameter u

majority - Return majority values in each column on a matrix

maxarray - Maximise a multi-dimensional array over a set of dimensions

maxNarray - Find the highest values and states of an array over a set of dimensions

Trang 10

mix2mix - Fit a mixture of Gaussians with another mixture of Gaussians

mvrandn - Samples from a multi-variate Normal(Gaussian) distribution

mygamrnd - Gamma random variate generator

mynanmean - mean of values that are not nan

mynansum - sum of values that are not nan

mynchoosek - binomial coefficient v choose k

myones - same as ones(x), but if x is a scalar, interprets as ones([x 1])

myrand - same as rand(x) but if x is a scalar interprets as rand([x 1])

myzeros - same as zeros(x) but if x is a scalar interprets as zeros([x 1])

normp - Make a normalised distribution from an array

randgen - Generates discrete random variables given the pdf

replace - Replace instances of a value with another value

sigma - 1./(1+exp(-x))

sigmoid - 1./(1+exp(-beta*x))

sqdist - Square distance between vectors in x and y

subv2ind - Linear index from subscript vector.

sumlog - sum(log(x)) with a cutoff at 10e-200

Miscellaneous

compat - Compatibility of object F being in position h for image v on grid Gx,Gy

logp - The logarithm of a specific non-Gaussian distribution

placeobject - Place the object F at position h in grid Gx,Gy

plotCov - return points for plotting an ellipse of a covariance

pointsCov - unit variance contours of a 2D Gaussian with mean m and covariance S

setup - run me at initialisation checks for bugs in matlab and initialises path

validgridposition - Returns 1 if point is on a defined grid

Trang 11

Notation List II Preface II BRML toolbox VII Contents XI

1.1 Probability Refresher 7

1.1.1 Interpreting Conditional Probability 9

1.1.2 Probability Tables 12

1.2 Probabilistic Reasoning 12

1.3 Prior, Likelihood and Posterior 18

1.3.1 Two dice : what were the individual scores? 19

1.4 Summary 20

1.5 Code 20

1.5.1 Basic Probability code 20

1.5.2 General utilities 21

1.5.3 An example 22

1.6 Exercises 22

2 Basic Graph Concepts 25 2.1 Graphs 25

2.2 Numerically Encoding Graphs 27

2.2.1 Edge list 27

2.2.2 Adjacency matrix 28

2.2.3 Clique matrix 28

2.3 Summary 29

2.4 Code 29

2.4.1 Utility routines 29

2.5 Exercises 30

3 Belief Networks 31 3.1 The Benefits of Structure 31

3.1.1 Modelling independencies 32

3.1.2 Reducing the burden of specification 34

3.2 Uncertain and Unreliable Evidence 35

3.2.1 Uncertain evidence 35

Trang 12

CONTENTS CONTENTS

3.2.2 Unreliable evidence 37

3.3 Belief Networks 38

3.3.1 Conditional independence 39

3.3.2 The impact of collisions 40

3.3.3 Graphical path manipulations for independence 43

3.3.4 d-Separation 43

3.3.5 Graphical and distributional in/dependence 43

3.3.6 Markov equivalence in belief networks 45

3.3.7 Belief networks have limited expressibility 46

3.4 Causality 47

3.4.1 Simpson’s paradox 47

3.4.2 The do-calculus 49

3.4.3 Influence diagrams and the do-calculus 49

3.5 Summary 50

3.6 Code 50

3.6.1 Naive inference demo 50

3.6.2 Conditional independence demo 50

3.7 Exercises 51

4 Graphical Models 57 4.1 Graphical Models 57

4.2 Markov Networks 58

4.2.1 Markov properties 59

4.2.2 Markov random fields 60

4.2.3 Hammersley-Clifford Theorem 61

4.2.4 Conditional independence using Markov networks 63

4.2.5 Lattice Models 63

4.3 Chain Graphical Models 65

4.4 Factor Graphs 67

4.4.1 Conditional independence in factor graphs 68

4.5 Expressiveness of Graphical Models 68

4.6 Summary 70

4.7 Code 71

4.8 Exercises 71

5 Efficient Inference in Trees 75 5.1 Marginal Inference 75

5.1.1 Variable elimination in a Markov chain and message passing 75

5.1.2 The sum-product algorithm on factor graphs 78

5.1.3 Dealing with Evidence 81

5.1.4 Computing the marginal likelihood 81

5.1.5 The problem with loops 83

5.2 Other Forms of Inference 83

5.2.1 Max-Product 83

5.2.2 Finding the N most probable states 85

5.2.3 Most probable path and shortest path 87

5.2.4 Mixed inference 89

5.3 Inference in Multiply Connected Graphs 89

5.3.1 Bucket elimination 90

5.3.2 Loop-cut conditioning 91

5.4 Message Passing for Continuous Distributions 92

5.5 Summary 92

5.6 Code 93

5.6.1 Factor graph examples 93

5.6.2 Most probable and shortest path 93

Trang 13

CONTENTS CONTENTS

5.6.3 Bucket elimination 94

5.6.4 Message passing on Gaussians 94

5.7 Exercises 94

6 The Junction Tree Algorithm 97 6.1 Clustering Variables 97

6.1.1 Reparameterisation 97

6.2 Clique Graphs 98

6.2.1 Absorption 99

6.2.2 Absorption schedule on clique trees 100

6.3 Junction Trees 101

6.3.1 The running intersection property 102

6.4 Constructing a Junction Tree for Singly-Connected Distributions 104

6.4.1 Moralisation 104

6.4.2 Forming the clique graph 104

6.4.3 Forming a junction tree from a clique graph 104

6.4.4 Assigning potentials to cliques 105

6.5 Junction Trees for Multiply-Connected Distributions 105

6.5.1 Triangulation algorithms 107

6.6 The Junction Tree Algorithm 108

6.6.1 Remarks on the JTA 109

6.6.2 Computing the normalisation constant of a distribution 110

6.6.3 The marginal likelihood 111

6.6.4 Some small JTA examples 111

6.6.5 Shafer-Shenoy propagation 113

6.7 Finding the Most Likely State 113

6.8 Reabsorption : Converting a Junction Tree to a Directed Network 114

6.9 The Need For Approximations 115

6.9.1 Bounded width junction trees 115

6.10 Summary 116

6.11 Code 116

6.12 Exercises 117

7 Making Decisions 121 7.1 Expected Utility 121

7.1.1 Utility of money 121

7.2 Decision Trees 122

7.3 Extending Bayesian Networks for Decisions 125

7.3.1 Syntax of influence diagrams 125

7.4 Solving Influence Diagrams 129

7.4.1 Messages on an ID 130

7.4.2 Using a junction tree 130

7.5 Markov Decision Processes 133

7.5.1 Maximising expected utility by message passing 134

7.5.2 Bellman’s equation 135

7.6 Temporally Unbounded MDPs 136

7.6.1 Value iteration 136

7.6.2 Policy iteration 137

7.6.3 A curse of dimensionality 137

7.7 Variational Inference and Planning 138

7.8 Financial Matters 139

7.8.1 Options pricing and expected utility 140

7.8.2 Binomial options pricing model 141

7.8.3 Optimal investment 142

7.9 Further Topics 144

Trang 14

CONTENTS CONTENTS

7.9.1 Partially observable MDPs 144

7.9.2 Reinforcement learning 144

7.10 Summary 146

7.11 Code 147

7.11.1 Sum/Max under a partial order 147

7.11.2 Junction trees for influence diagrams 147

7.11.3 Party-Friend example 148

7.11.4 Chest Clinic with Decisions 148

7.11.5 Markov decision processes 148

7.12 Exercises 149

II Learning in Probabilistic Models 153 8 Statistics for Machine Learning 157 8.1 Representing Data 157

8.1.1 Categorical 157

8.1.2 Ordinal 157

8.1.3 Numerical 157

8.2 Distributions 158

8.2.1 The Kullback-Leibler Divergence KL(q|p) 161

8.2.2 Entropy and information 162

8.3 Classical Distributions 163

8.4 Multivariate Gaussian 168

8.4.1 Completing the square 169

8.4.2 Conditioning as system reversal 170

8.4.3 Whitening and centering 171

8.5 Exponential Family 171

8.5.1 Conjugate priors 172

8.6 Learning distributions 172

8.7 Properties of Maximum Likelihood 174

8.7.1 Training assuming the correct model class 175

8.7.2 Training when the assumed model is incorrect 175

8.7.3 Maximum likelihood and the empirical distribution 176

8.8 Learning a Gaussian 176

8.8.1 Maximum likelihood training 176

8.8.2 Bayesian inference of the mean and variance 177

8.8.3 Gauss-Gamma distribution 179

8.9 Summary 179

8.10 Code 180

8.11 Exercises 180

9 Learning as Inference 191 9.1 Learning as Inference 191

9.1.1 Learning the bias of a coin 191

9.1.2 Making decisions 192

9.1.3 A continuum of parameters 193

9.1.4 Decisions based on continuous intervals 194

9.2 Bayesian methods and ML-II 195

9.3 Maximum Likelihood Training of Belief Networks 196

9.4 Bayesian Belief Network Training 199

9.4.1 Global and local parameter independence 199

9.4.2 Learning binary variable tables using a Beta prior 200

9.4.3 Learning multivariate discrete tables using a Dirichlet prior 202

9.5 Structure learning 205

9.5.1 PC algorithm 206

Trang 15

CONTENTS CONTENTS

9.5.2 Empirical independence 207

9.5.3 Network scoring 209

9.5.4 Chow-Liu Trees 211

9.6 Maximum Likelihood for Undirected models 213

9.6.1 The likelihood gradient 213

9.6.2 General tabular clique potentials 214

9.6.3 Decomposable Markov networks 215

9.6.4 Exponential form potentials 220

9.6.5 Conditional random fields 221

9.6.6 Pseudo likelihood 224

9.6.7 Learning the structure 224

9.7 Summary 224

9.8 Code 225

9.8.1 PC algorithm using an oracle 225

9.8.2 Demo of empirical conditional independence 225

9.8.3 Bayes Dirichlet structure learning 225

9.9 Exercises 226

10 Naive Bayes 229 10.1 Naive Bayes and Conditional Independence 229

10.2 Estimation using Maximum Likelihood 230

10.2.1 Binary attributes 230

10.2.2 Multi-state variables 233

10.2.3 Text classification 234

10.3 Bayesian Naive Bayes 234

10.4 Tree Augmented Naive Bayes 236

10.4.1 Learning tree augmented Naive Bayes networks 236

10.5 Summary 237

10.6 Code 237

10.7 Exercises 237

11 Learning with Hidden Variables 241 11.1 Hidden Variables and Missing Data 241

11.1.1 Why hidden/missing variables can complicate proceedings 241

11.1.2 The missing at random assumption 242

11.1.3 Maximum likelihood 243

11.1.4 Identifiability issues 244

11.2 Expectation Maximisation 244

11.2.1 Variational EM 244

11.2.2 Classical EM 246

11.2.3 Application to Belief networks 248

11.2.4 General case 250

11.2.5 Convergence 253

11.2.6 Application to Markov networks 253

11.3 Extensions of EM 253

11.3.1 Partial M step 253

11.3.2 Partial E-step 253

11.4 A failure case for EM 255

11.5 Variational Bayes 256

11.5.1 EM is a special case of variational Bayes 258

11.5.2 An example: VB for the Asbestos-Smoking-Cancer network 258

11.6 Optimising the Likelihood by Gradient Methods 261

11.6.1 Undirected models 261

11.7 Summary 262

11.8 Code 262

11.9 Exercises 262

Trang 16

CONTENTS CONTENTS

12.1 Comparing Models the Bayesian Way 267

12.2 Illustrations : coin tossing 268

12.2.1 A discrete parameter space 268

12.2.2 A continuous parameter space 269

12.3 Occam’s Razor and Bayesian Complexity Penalisation 270

12.4 A continuous example : curve fitting 273

12.5 Approximating the Model Likelihood 274

12.5.1 Laplace’s method 275

12.5.2 Bayes information criterion (BIC) 275

12.6 Bayesian Hypothesis Testing for Outcome Analysis 276

12.6.1 Outcome analysis 276

12.6.2 Hindep : model likelihood 277

12.6.3 Hsame : model likelihood 278

12.6.4 Dependent outcome analysis 279

12.6.5 Is classifier A better than B? 280

12.7 Summary 281

12.8 Code 282

12.9 Exercises 282

III Machine Learning 287 13 Machine Learning Concepts 291 13.1 Styles of Learning 291

13.1.1 Supervised learning 291

13.1.2 Unsupervised learning 292

13.1.3 Anomaly detection 293

13.1.4 Online (sequential) learning 293

13.1.5 Interacting with the environment 293

13.1.6 Semi-supervised learning 294

13.2 Supervised Learning 294

13.2.1 Utility and Loss 294

13.2.2 Using the empirical distribution 295

13.2.3 Bayesian decision approach 298

13.3 Bayes versus Empirical Decisions 302

13.4 Summary 303

13.5 Exercises 303

14 Nearest Neighbour Classification 305 14.1 Do As Your Neighbour Does 305

14.2 K-Nearest Neighbours 306

14.3 A Probabilistic Interpretation of Nearest Neighbours 308

14.3.1 When your nearest neighbour is far away 309

14.4 Summary 309

14.5 Code 309

14.6 Exercises 309

15 Unsupervised Linear Dimension Reduction 311 15.1 High-Dimensional Spaces – Low Dimensional Manifolds 311

15.2 Principal Components Analysis 311

15.2.1 Deriving the optimal linear reconstruction 312

15.2.2 Maximum variance criterion 314

15.2.3 PCA algorithm 314

15.2.4 PCA and nearest neighbours classification 316

15.2.5 Comments on PCA 316

Trang 17

CONTENTS CONTENTS

15.3 High Dimensional Data 317

15.3.1 Eigen-decomposition for N < D 318

15.3.2 PCA via Singular value decomposition 318

15.4 Latent Semantic Analysis 319

15.4.1 Information retrieval 320

15.5 PCA With Missing Data 321

15.5.1 Finding the principal directions 323

15.5.2 Collaborative filtering using PCA with missing data 324

15.6 Matrix Decomposition Methods 324

15.6.1 Probabilistic latent semantic analysis 325

15.6.2 Extensions and variations 328

15.6.3 Applications of PLSA/NMF 329

15.7 Kernel PCA 330

15.8 Canonical Correlation Analysis 332

15.8.1 SVD formulation 333

15.9 Summary 334

15.10Code 334

15.11Exercises 334

16 Supervised Linear Dimension Reduction 337 16.1 Supervised Linear Projections 337

16.2 Fisher’s Linear Discriminant 337

16.3 Canonical Variates 339

16.3.1 Dealing with the nullspace 341

16.4 Summary 342

16.5 Code 342

16.6 Exercises 342

17 Linear Models 345 17.1 Introduction: Fitting A Straight Line 345

17.2 Linear Parameter Models for Regression 346

17.2.1 Vector outputs 348

17.2.2 Regularisation 348

17.2.3 Radial basis functions 350

17.3 The Dual Representation and Kernels 351

17.3.1 Regression in the dual-space 352

17.4 Linear Parameter Models for Classification 352

17.4.1 Logistic regression 353

17.4.2 Beyond first order gradient ascent 357

17.4.3 Avoiding overconfident classification 357

17.4.4 Multiple classes 358

17.4.5 The Kernel Trick for Classification 358

17.5 Support Vector Machines 359

17.5.1 Maximum margin linear classifier 359

17.5.2 Using kernels 361

17.5.3 Performing the optimisation 362

17.5.4 Probabilistic interpretation 362

17.6 Soft Zero-One Loss for Outlier Robustness 362

17.7 Summary 363

17.8 Code 364

17.9 Exercises 364

Trang 18

CONTENTS CONTENTS

18.1 Regression With Additive Gaussian Noise 367

18.1.1 Bayesian linear parameter models 368

18.1.2 Determining hyperparameters: ML-II 369

18.1.3 Learning the hyperparameters using EM 370

18.1.4 Hyperparameter optimisation : using the gradient 371

18.1.5 Validation likelihood 373

18.1.6 Prediction and model averaging 373

18.1.7 Sparse linear models 374

18.2 Classification 375

18.2.1 Hyperparameter optimisation 376

18.2.2 Laplace approximation 376

18.2.3 Variational Gaussian approximation 379

18.2.4 Local variational approximation 380

18.2.5 Relevance vector machine for classification 381

18.2.6 Multi-class case 381

18.3 Summary 382

18.4 Code 382

18.5 Exercises 383

19 Gaussian Processes 385 19.1 Non-Parametric Prediction 385

19.1.1 From parametric to non-parametric 385

19.1.2 From Bayesian linear models to Gaussian processes 386

19.1.3 A prior on functions 387

19.2 Gaussian Process Prediction 388

19.2.1 Regression with noisy training outputs 388

19.3 Covariance Functions 390

19.3.1 Making new covariance functions from old 391

19.3.2 Stationary covariance functions 391

19.3.3 Non-stationary covariance functions 393

19.4 Analysis of Covariance Functions 393

19.4.1 Smoothness of the functions 393

19.4.2 Mercer kernels 394

19.4.3 Fourier analysis for stationary kernels 395

19.5 Gaussian Processes for Classification 396

19.5.1 Binary classification 396

19.5.2 Laplace’s approximation 397

19.5.3 Hyperparameter optimisation 399

19.5.4 Multiple classes 400

19.6 Summary 400

19.7 Code 400

19.8 Exercises 401

20 Mixture Models 403 20.1 Density Estimation Using Mixtures 403

20.2 Expectation Maximisation for Mixture Models 404

20.2.1 Unconstrained discrete tables 405

20.2.2 Mixture of product of Bernoulli distributions 407

20.3 The Gaussian Mixture Model 409

20.3.1 EM algorithm 409

20.3.2 Practical issues 412

20.3.3 Classification using Gaussian mixture models 413

20.3.4 The Parzen estimator 414

20.3.5 K-Means 415

20.3.6 Bayesian mixture models 415

Trang 19

CONTENTS CONTENTS

20.3.7 Semi-supervised learning 416

20.4 Mixture of Experts 416

20.5 Indicator Models 417

20.5.1 Joint indicator approach: factorised prior 417

20.5.2 Polya prior 418

20.6 Mixed Membership Models 419

20.6.1 Latent Dirichlet allocation 419

20.6.2 Graph based representations of data 421

20.6.3 Dyadic data 421

20.6.4 Monadic data 422

20.6.5 Cliques and adjacency matrices for monadic binary data 423

20.7 Summary 426

20.8 Code 426

20.9 Exercises 427

21 Latent Linear Models 429 21.1 Factor Analysis 429

21.1.1 Finding the optimal bias 431

21.2 Factor Analysis : Maximum Likelihood 431

21.2.1 Eigen-approach likelihood optimisation 432

21.2.2 Expectation maximisation 434

21.3 Interlude: Modelling Faces 436

21.4 Probabilistic Principal Components Analysis 438

21.5 Canonical Correlation Analysis and Factor Analysis 439

21.6 Independent Components Analysis 440

21.7 Summary 442

21.8 Code 442

21.9 Exercises 442

22 Latent Ability Models 445 22.1 The Rasch Model 445

22.1.1 Maximum likelihood training 445

22.1.2 Bayesian Rasch models 446

22.2 Competition Models 447

22.2.1 Bradley-Terry-Luce model 447

22.2.2 Elo ranking model 448

22.2.3 Glicko and TrueSkill 448

22.3 Summary 449

22.4 Code 449

22.5 Exercises 449

IV Dynamical Models 451 23 Discrete-State Markov Models 455 23.1 Markov Models 455

23.1.1 Equilibrium and stationary distribution of a Markov chain 456

23.1.2 Fitting Markov models 457

23.1.3 Mixture of Markov models 458

23.2 Hidden Markov Models 460

23.2.1 The classical inference problems 460

23.2.2 Filtering p(ht|v1:t) 461

23.2.3 Parallel smoothing p(ht|v1:T) 462

23.2.4 Correction smoothing 462

23.2.5 Sampling from p(h1:T|v1:T) 464

23.2.6 Most likely joint state 464

Trang 20

CONTENTS CONTENTS

23.2.7 Prediction 465

23.2.8 Self localisation and kidnapped robots 466

23.2.9 Natural language models 468

23.3 Learning HMMs 468

23.3.2 Mixture emission 470

23.3.3 The HMM-GMM 470

23.3.4 Discriminative training 471

23.4 Related Models 471

23.4.1 Explicit duration model 471

23.4.2 Input-Output HMM 472

23.4.3 Linear chain CRFs 473

23.4.4 Dynamic Bayesian networks 474

23.5 Applications 474

23.5.1 Object tracking 474

23.5.2 Automatic speech recognition 474

23.5.3 Bioinformatics 475

23.5.4 Part-of-speech tagging 475

23.6 Summary 475

23.7 Code 476

23.8 Exercises 476

24 Continuous-state Markov Models 483 24.1 Observed Linear Dynamical Systems 483

24.1.1 Stationary distribution with noise 484

24.2 Auto-Regressive Models 485

24.2.1 Training an AR model 486

24.2.2 AR model as an OLDS 486

24.2.3 Time-varying AR model 487

24.2.4 Time-varying variance AR models 488

24.3 Latent Linear Dynamical Systems 489

24.4 Inference 490

24.4.1 Filtering 492

24.4.2 Smoothing : Rauch-Tung-Striebel correction method 494

24.4.3 The likelihood 495

24.4.4 Most likely state 496

24.4.5 Time independence and Riccati equations 496

24.5 Learning Linear Dynamical Systems 497

24.5.1 Identifiability issues 497

24.5.3 Subspace Methods 499

24.5.4 Structured LDSs 500

24.5.5 Bayesian LDSs 500

24.6 Switching Auto-Regressive Models 500

24.6.1 Inference 501

24.6.2 Maximum likelihood learning using EM 501

24.7 Summary 502

24.8 Code 503

24.8.1 Autoregressive models 503

24.9 Exercises 504

Trang 21

CONTENTS CONTENTS

25.1 Introduction 507

25.2 The Switching LDS 507

25.2.1 Exact inference is computationally intractable 508

25.3 Gaussian Sum Filtering 508

25.3.1 Continuous filtering 509

25.3.2 Discrete filtering 511

25.3.3 The likelihood p(v1:T) 511

25.3.4 Collapsing Gaussians 511

25.3.5 Relation to other methods 512

25.4 Gaussian Sum Smoothing 512

25.4.1 Continuous smoothing 514

25.4.2 Discrete smoothing 514

25.4.3 Collapsing the mixture 514

25.4.4 Using mixtures in smoothing 515

25.4.5 Relation to other methods 516

25.5 Reset Models 518

25.5.1 A Poisson reset model 520

25.5.2 Reset-HMM-LDS 521

25.6 Summary 522

25.7 Code 522

25.8 Exercises 522

26 Distributed Computation 525 26.1 Introduction 525

26.2 Stochastic Hopfield Networks 525

26.3 Learning Sequences 526

26.3.1 A single sequence 526

26.3.2 Multiple sequences 531

26.3.3 Boolean networks 532

26.3.4 Sequence disambiguation 532

26.4 Tractable Continuous Latent Variable Models 532

26.4.1 Deterministic latent variables 532

26.4.2 An augmented Hopfield network 534

26.5 Neural Models 535

26.5.1 Stochastically spiking neurons 535

26.5.2 Hopfield membrane potential 535

26.5.3 Dynamic synapses 536

26.5.4 Leaky integrate and fire models 537

26.6 Summary 537

26.7 Code 537

26.8 Exercises 538

V Approximate Inference 539 27 Sampling 543 27.1 Introduction 543

27.1.1 Univariate sampling 544

27.1.2 Rejection sampling 545

27.1.3 Multivariate sampling 546

27.2 Ancestral Sampling 548

27.2.1 Dealing with evidence 548

27.2.2 Perfect sampling for a Markov network 549

27.3 Gibbs Sampling 549

27.3.1 Gibbs sampling as a Markov chain 550

Trang 22

CONTENTS CONTENTS

27.3.2 Structured Gibbs sampling 55127.3.3 Remarks 55127.4 Markov Chain Monte Carlo (MCMC) 55227.4.1 Markov chains 55327.4.2 Metropolis-Hastings sampling 55327.5 Auxiliary Variable Methods 55527.5.1 Hybrid Monte Carlo 55527.5.2 Swendson-Wang 55727.5.3 Slice sampling 55927.6 Importance Sampling 56027.6.1 Sequential importance sampling 56227.6.2 Particle filtering as an approximate forward pass 56327.7 Summary 56527.8 Code 56527.9 Exercises 566

28.1 Introduction 56928.2 The Laplace approximation 56928.3 Properties of Kullback-Leibler Variational Inference 57028.3.1 Bounding the normalisation constant 57028.3.2 Bounding the marginal likelihood 57028.3.3 Bounding marginal quantities 57128.3.4 Gaussian approximations using KL divergence 57128.3.5 Marginal and moment matching properties of minimising KL(p|q) 57228.4 Variational Bounding Using KL(q|p) 57328.4.1 Pairwise Markov random field 57328.4.2 General mean field equations 57628.4.3 Asynchronous updating guarantees approximation improvement 57628.4.4 Structured variational approximation 57728.5 Local and KL Variational Approximations 57928.5.1 Local approximation 58028.5.2 KL variational approximation 58028.6 Mutual Information Maximisation : A KL Variational Approach 58128.6.1 The information maximisation algorithm 58228.6.2 Linear Gaussian decoder 58328.7 Loopy Belief Propagation 58428.7.1 Classical BP on an undirected graph 58428.7.2 Loopy BP as a variational procedure 58528.8 Expectation Propagation 58728.9 MAP for Markov networks 59028.9.1 Pairwise Markov networks 59228.9.2 Attractive binary Markov networks 59328.9.3 Potts model 59528.10Further Reading 59628.11Summary 59628.12Code 59728.13Exercises 597

29.1 Linear Algebra 60329.1.1 Vector algebra 60329.1.2 The scalar product as a projection 60429.1.3 Lines in space 60429.1.4 Planes and hyperplanes 60429.1.5 Matrices 605

Trang 23

CONTENTS CONTENTS

29.1.6 Linear transformations 60629.1.7 Determinants 60629.1.8 Matrix inversion 60729.1.9 Computing the matrix inverse 60829.1.10 Eigenvalues and eigenvectors 60829.1.11 Matrix decompositions 60929.2 Multivariate Calculus 61029.2.1 Interpreting the gradient vector 61129.2.2 Higher derivatives 61129.2.3 Matrix calculus 61229.3 Inequalities 61229.3.1 Convexity 61229.3.2 Jensen’s inequality 61329.4 Optimisation 61329.5 Multivariate Optimisation 61329.5.1 Gradient descent with fixed stepsize 61429.5.2 Gradient descent with line searches 61429.5.3 Minimising quadratic functions using line search 61529.5.4 Gram-Schmidt construction of conjugate vectors 61529.5.5 The conjugate vectors algorithm 61629.5.6 The conjugate gradients algorithm 61629.5.7 Newton’s method 61729.6 Constrained Optimisation using Lagrange Multipliers 61929.6.1 Lagrange Dual 619

Trang 24

CONTENTS CONTENTS

Trang 25

Part I

Inference in Probabilistic Models

Trang 27

Introduction to Part I

Probabilistic models explicitly take into account uncertainty and deal with our

imperfect knowledge of the world Such models are of fundamental significance in

Machine Learning since our understanding of the world will always be limited by our

observations and understanding We will focus initially on using probabilistic models

as a kind of expert system

In Part I, we assume that the model is fully specified That is, given a model of the

environment, how can we use it to answer questions of interest We will relate the

complexity of inferring quantities of interest to the structure of the graph describing

the model In addition, we will describe operations in terms of manipulations on

the corresponding graphs As we will see, provided the graphs are simple tree-like

structures, most quantities of interest can be computed efficiently

Part I deals with manipulating mainly discrete variable distributions and forms the

background to all the later material in the book

Trang 28

Directed Factor Graph

Bayesian Networks

Dynamic Bayes nets

chains

HMM

LDS

Latent variable models

Discrete Mixture

models

ing

cluster-Continuous

reduct

dimen- complete repres.

over-Influence diagrams

Strong JT Decision theory Chain Graphs

Undirected

Graphs

Markov network

input dependent

CRF

Pairwise

Boltz.

machine (disc.)

Gauss.

Process (cont)

Clique Graphs

Junction tree Clique

Trang 29

Graphical model

Multiplyconnected

decomposable

cliques small

tractable

messages-

message-passing

intractable required

messages-cliques large

required

approx- decomposable

non-JTA

cliques small

absorption

Shenoy

Shafer-cliques big (or mess.

intract)

required

approx-cutset conditioning (inefficient)

special-cases

tractable-Gaussian

binary- MRF-MAP

attractive- binary-pure- interaction- MRF

planar-Singlyconnected

message updates tractable

sum/max product

message updates intractable

approx required (EP)

Bucket elimination (inefficient)

Graphical models and associated (marginal) inference methods Specific inference methods are highlighted

in red Loosely speaking, provided the graph corresponding to the model is singly-connected most ofthe standard (marginal) inference methods are tractable Multiply-connected graphs are generally moreproblematic, although there are special cases which remain tractable

Trang 31

1.1 Probability Refresher

Variables, States and Notational Shortcuts

Variables will be denoted using either upper case X or lower case x and a set of variables will typically bedenoted by a calligraphic symbol, for exampleV = {a, B, c}

The domain of a variable x is written dom(x), and denotes the states x can take States will typically

be represented using sans-serif font For example, for a coin c, dom(c) = {heads, tails} and p(c = heads)represents the probability that variable c is in state heads The meaning of p(state) will often be clear,without specific reference to a variable For example, if we are discussing an experiment about a coin c,the meaning of p(heads) is clear from the context, being shorthand for p(c = heads) When summing over avariableP

xf (x), the interpretation is that all states of x are included, i.e P

xf (x)≡P

s∈dom(x)f (x = s).Given a variable, x, its domain dom(x) and a full specification of the probability values for each of thevariable states, p(x), we have a distribution for x Sometimes we will not fully specify the distribution, onlycertain properties, such as for variables x, y, p(x, y) = p(x)p(y) for some unspecified p(x) and p(y) Whenclarity on this is required we will say distributions with structure p(x)p(y), or a distribution class p(x)p(y).For our purposes, events are expressions about random variables, such as Two heads in 6 coin tosses Twoevents are mutually exclusive if they cannot both be true For example the events The coin is heads andThe coin is tailsare mutually exclusive One can think of defining a new variable named by the event so,for example, p(The coin is tails) can be interpreted as p(The coin is tails = true) We use the shorthandp(x = tr) for the probability of event/variable x being in the state true and p(x = fa) for the probability ofvariable x being in the state false

Definition 1.1 (Rules of Probability for Discrete Variables)

Trang 32

Probability Refresher

The probability p(x = x) of variable x being in state x is represented by a value between 0 and 1.p(x = x) = 1 means that we are certain x is in state x Conversely, p(x = x) = 0 means that we are certain

x is not in state x Values between 0 and 1 represent the degree of certainty of state occupancy

The summation of the probability over all the states is 1:

We will use the shorthand p(x, y) for p(x and y) Note that p(y, x) = p(x, y) and p(x or y) = p(y or x)

Definition 1.2 (Set notation) An alternative notation in terms of set theory is to write

Since Bayes’ rule trivially follows from the definition of conditional probability, we will sometimes be loose

in our language and use the terms Bayes’ rule and conditional probability as synonymous

As we shall see throughout this book, Bayes’ rule plays a central role in probabilistic reasoning since it helps

Trang 33

us ‘invert’ probabilistic relationships, translating between p(y|x) and p(x|y)

Definition 1.5 (Probability Density Functions) For a continuous variable x, the probability density f (x)

is defined such that

As shorthand we will sometimes writeR

xf (x), particularly when we want an expression to be valid for eithercontinuous or discrete variables The multivariate case is analogous with integration over all real space, andthe probability that x belongs to a region of the space defined accordingly Unlike probabilities, probabilitydensities can take positive values greater than 1

Formally speaking, for a continuous variable, one should not speak of the probability that x = 0.2 since theprobability of a single value is always zero However, we shall often write p(x) for continuous variables, thusnot distinguishing between probabilities and probability density function values Whilst this may appearstrange, the nervous reader may simply replace our p(x) notation forR

x∈∆f (x)dx, where ∆ is a small regioncentred on x This is well defined in a probabilistic sense and, in the limit ∆ being very small, this wouldgive approximately ∆f (x) If we consistently use the same ∆ for all occurrences of pdfs, then we will simplyhave a common prefactor ∆ in all expressions Our strategy is to simply ignore these values (since in the endonly relative probabilities will be relevant) and write p(x) In this way, all the standard rules of probabilitycarry over, including Bayes’ Rule

Remark 1.1(Subjective Probability) Probability is a contentious topic and we do not wish to get boggeddown by the debate here, apart from pointing out that it is not necessarily the rules of probability thatare contentious, rather what interpretation we should place on them In some cases potential repetitions

of an experiment can be envisaged so that the ‘long run’ (or frequentist) definition of probability in whichprobabilities are defined with respect to a potentially infinite repetition of experiments makes sense Forexample, in coin tossing, the probability of heads might be interpreted as ‘If I were to repeat the experiment

of flipping a coin (at ‘random’), the limit of the number of heads that occurred over the number of tosses

is defined as the probability of a head occurring.’

Here’s a problem that is typical of the kind of scenario one might face in a machine learning situation Afilm enthusiast joins a new online film service Based on expressing a few films a user likes and dislikes,the online company tries to estimate the probability that the user will like each of the 10000 films in theirdatabase If we were to define probability as a limiting case of infinite repetitions of the same experiment,this wouldn’t make much sense in this case since we can’t repeat the experiment However, if we assumethat the user behaves in a manner consistent with other users, we should be able to exploit the large amount

of data from other users’ ratings to make a reasonable ‘guess’ as to what this consumer likes This degree

of belief or Bayesian subjective interpretation of probability sidesteps non-repeatability issues – it’s just aframework for manipulating real values consistent with our intuition about probability[158]

1.1.1 Interpreting Conditional Probability

Conditional probability matches our intuitive understanding of uncertainty For example, imagine a circulardart board, split into 20 equal sections, labelled from 1 to 20 Randy, a dart thrower, hits any one of the 20sections uniformly at random Hence the probability that a dart thrown by Randy occurs in any one of the

20 regions is p(region i) = 1/20 A friend of Randy tells him that he hasn’t hit the 20 region What is theprobability that Randy has hit the 5 region? Conditioned on this information, only regions 1 to 19 remainpossible and, since there is no preference for Randy to hit any of these regions, the probability is 1/19 The

Trang 34

119giving the intuitive result An important point to clarify is that p(A = a|B = b) should not be interpreted

as ‘Given the event B = b has occurred, p(A = a|B = b) is the probability of the event A = a occurring’

In most contexts, no such explicit temporal causality is implied1 and the correct interpretation should be ‘p(A = a|B = b) is the probability of A being in state a under the constraint that B is in state b’

The relation between the conditional p(A = a|B = b) and the joint p(A = a, B = b) is just a normalisationconstant since p(A = a, B = b) is not a distribution in A – in other words, P

ap(A = a, B = b) 6= 1 Tomake it a distribution we need to divide : p(A = a, B = b)/P

ap(A = a, B = b) which, when summed over

adoes sum to 1 Indeed, this is just the definition of p(A = a|B = b)

Definition 1.6 (Independence)

Variables x and y are independent if knowing the state (or value in the continuous case) of one variablegives no extra information about the other variable Mathematically, this is expressed by

Provided that p(x)6= 0 and p(y) 6= 0 independence of x and y is equivalent to

If p(x|y) = p(x) for all states of x and y, then the variables x and y are said to be independent If

for some constant k, and positive functions f (·) and g(·) then x and y are independent and we write x⊥⊥y

Example 1.1 (Independence) Let x denote the day of the week in which females are born, and y denotethe day in which males are born, with dom(x) = dom(y) = {1, , 7} It is reasonable to expect that x

is independent of y We randomly select a woman from the phone book, Alice, and find out that she wasborn on a Tuesday We also randomly select a male at random, Bob Before phoning Bob and asking him,what does knowing Alice’s birth day add to which day we think Bob is born on? Under the independenceassumption, the answer is nothing Note that this doesn’t mean that the distribution of Bob’s birthday isnecessarily uniform – it just means that knowing when Alice was born doesn’t provide any extra informationthan we already knew about Bob’s birthday, p(y|x) = p(y) Indeed, the distribution of birthdays p(y) andp(x) are non-uniform (statistically fewer babies are born on weekends), though there is nothing to suggestthat x are y are dependent

Deterministic Dependencies

Sometimes the concept of independence is perhaps a little strange Consider the following : variables x and

y are both binary (their domains consist of two states) We define the distribution such that x and y arealways both in a certain joint state:

Trang 35

This may seem strange – we know for sure the relation between x and y, namely that they are always in thesame joint state, yet they are independent Since the distribution is trivially concentrated in a single jointstate, knowing the state of x tells you nothing that you didn’t anyway know about the state of y, and viceversa This potential confusion comes from using the term ‘independent’ which may suggest that there is norelation between objects discussed The best way to think about statistical independence is to ask whether

or not knowing the state of variable y tells you something more than you knew before about variable x,where ‘knew before’ means working with the joint distribution of p(x, y) to figure out what we can knowabout x, namely p(x)

Definition 1.7 (Conditional Independence)

denotes that the two sets of variablesX and Y are independent of each other provided we know the state

of the set of variables Z For conditional independence, X and Y must be independent given all states of

Z Formally, this means that

SimilarlyX >>Y|∅ can be written as X>>Y

Intuitively, if x is conditionally independent of y given z, this means that, given z, y contains no additionalinformation about x Similarly, given z, knowing x does not tell me anything more about y Note that

X ⊥⊥Y|Z ⇒ X0⊥⊥Y0|Z for X0⊆ X and Y0 ⊆ Y

Remark 1.2 (Independence implications) It’s tempting to think that if a is independent of b and b isindependent of c then a must be independent of c:

Similarly, it’s tempting to think that if a and b are dependent, and b and c are dependent, then a and cmust be dependent:

However, this also does not follow We give an explicit numerical example in exercise(3.17)

Finally, note that conditional independence x⊥⊥ y| z does not imply marginal independence x ⊥⊥ y See alsoexercise(3.20)

Trang 36

Probabilistic Reasoning

1.1.2 Probability Tables

Based on the populations 60776238, 5116900 and 2980700 of England (E), Scotland (S) and Wales (W),the a priori probability that a randomly selected person from the combined three countries would live inEngland, Scotland or Wales, is approximately 0.88, 0.08 and 0.04 respectively We can write this as a vector(or probability table) :

(1.1.22)From this we can form a joint distribution p(Cnt, M T ) = p(M T|Cnt)p(Cnt) This could be written as a

3× 3 matrix with columns indexed by country and rows indexed by Mother Tongue:

For joint distributions over a larger number of variables, xi, i = 1, , D, with each variable xi taking Kistates, the table describing the joint distribution is an array with QD

i=1Ki entries Explicitly storing tablestherefore requires space exponential in the number of variables, which rapidly becomes impractical for alarge number of variables We discuss how to deal with this issue in chapter(3) and chapter(4)

A probability distribution assigns a value to each of the joint states of the variables For this reason,p(T, J, R, S) is considered equivalent to p(J, S, R, T ) (or any such reordering of the variables), since in eachcase the joint setting of the variables is simply a different index to the same probability This situation ismore clear in the set theoretic notation p(J∩ S ∩ T ∩ R) We abbreviate this set theoretic notation by usingthe commas – however, one should be careful not to confuse the use of this indexing type notation withfunctions f (x, y) which are in general dependent on the variable order Whilst the variables to the left of theconditioning bar may be written in any order, and equally those to the right of the conditioning bar may bewritten in any order, moving variables across the bar is not generally equivalent, so that p(x1|x2)6= p(x2|x1)

1.2 Probabilistic Reasoning

The central paradigm of probabilistic reasoning is to identify all relevant variables x1, , xN in the ronment, and make a probabilistic model p(x1, , xN) of their interaction Reasoning (inference) is thenperformed by introducing evidence that sets variables in known states, and subsequently computing proba-bilities of interest, conditioned on this evidence The rules of probability, combined with Bayes’ rule makefor a complete reasoning system, one which includes traditional deductive logic as a special case[158] Inthe examples below, the number of variables in the environment is very small In chapter(3) we will discuss

Trang 37

1 Assuming eating lots of hamburgers is rather widespread, say p(Hamburger Eater) = 0.5, what is theprobability that a hamburger eater will have Kreuzfeld-Jacob disease?

This may be computed as

p(KJ |Hamburger Eater) = p(Hamburger Eater, KJ )

p(Hamburger Eater) =

p(Hamburger Eater|KJ )p(KJ )p(Hamburger Eater)

(1.2.1)

=

9

10× 1 100000 1 2

2 If the fraction of people eating hamburgers was rather small, p(Hamburger Eater) = 0.001, what is theprobability that a regular hamburger eater will have Kreuzfeld-Jacob disease? Repeating the abovecalculation, this is given by

9

10× 1

100000 1 1000

This is much higher than in scenario (1) since here we can be more sure that eating hamburgers isrelated to the illness

Example 1.3(Inspector Clouseau) Inspector Clouseau arrives at the scene of a crime The victim lies dead

in the room alongside the possible murder weapon, a knife The Butler (B) and Maid (M ) are the inspector’smain suspects and the inspector has a prior belief of 0.6 that the Butler is the murderer, and a prior belief

of 0.2 that the Maid is the murderer These beliefs are independent in the sense that p(B, M ) = p(B)p(M ).(It is possible that both the Butler and the Maid murdered the victim or neither) The inspector’s priorcriminal knowledge can be formulated mathematically as follows:

dom(B) = dom(M ) ={murderer, not murderer} , dom(K) = {knife used, knife not used} (1.2.4)

p(knife used|B = not murderer, M = not murderer) = 0.3

p(knife used|B = not murderer, M = murderer) = 0.2

p(knife used|B = murderer, M = not murderer) = 0.6

p(knife used|B = murderer, M = murderer) = 0.1

(1.2.6)

In addition p(K, B, M ) = p(K|B, M)p(B)p(M) Assuming that the knife is the murder weapon, what isthe probability that the Butler is the murderer? (Remember that it might be that neither is the murderer).Using b for the two states of B and m for the two states of M ,

m,bp(K|b, m)p(b, m) =

p(B)P

mp(K|B, m)p(m)P

bp(b)P

mp(K|b, m)p(m) (1.2.7)

Trang 38

10× 1

10+ 8

10× 6 10

6 10

2

10× 1

10 +108 × 6

10 + 4 10

Remark 1.3 The role of p(knife used) in the Inspector Clouseau example can cause some confusion Inthe above,

p(knife used) =X

b

p(b)Xm

is computed to be 0.456 But surely, p(knife used) = 1, since this is given in the question! Note that thequantity p(knife used) relates to the prior probability the model assigns to the knife being used (in theabsence of any other information) If we know that the knife is used, then the posterior

p(knife used|knife used) = p(knife used, knife used)p(knife used) = p(knife used)

which, naturally, must be the case

Example 1.4 (Who’s in the bathroom?) Consider a household of three people, Alice, Bob and Cecil.Cecil wants to go to the bathroom but finds it occupied He then goes to Alice’s room and sees she is there.Since Cecil knows that only either Alice or Bob can be in the bathroom, from this he infers that Bob must

be in the bathroom

To arrive at the same conclusion in a mathematical framework, we define the following events

A = Alice is in her bedroom, B = Bob is in his bedroom, O = Bathroom occupied (1.2.11)

We can encode the information that if either Alice or Bob are not in their bedrooms, then they must be inthe bathroom (they might both be in the bathroom) as

The first term expresses that the bathroom is occupied if Alice is not in her bedroom, wherever Bob is.Similarly, the second term expresses bathroom occupancy as long as Bob is not in his bedroom Thenp(B = fa|O = tr, A = tr) = p(B = fa, O = tr, A = tr)p(O = tr, A = tr) = p(O = tr|A = tr, B = fa)p(A = tr, B = fa)

where

p(O = tr, A = tr) = p(O = tr|A = tr, B = fa)p(A = tr, B = fa)

+ p(O = tr|A = tr, B = tr)p(A = tr, B = tr) (1.2.14)Using the fact p(O = tr|A = tr, B = fa) = 1 and p(O = tr|A = tr, B = tr) = 0, which encodes that if Alice

is in her room and Bob is not, the bathroom must be occupied, and similarly, if both Alice and Bob are intheir rooms, the bathroom cannot be occupied,

p(B = fa|O = tr, A = tr) = p(A = tr, B = fa)p(A = tr, B = fa) = 1 (1.2.15)This example is interesting since we are not required to make a full probabilistic model in this case thanks

to the limiting nature of the probabilities (we don’t need to specify p(A, B)) The situation is common inlimiting situations of probabilities being either 0 or 1, corresponding to traditional logic systems

Trang 39

@@

‘All fruits grow on trees’ lead to the conclusion that ‘All apples grow on trees’ To see how this might bededuced using Bayesian reasoning, consider

infer A⇒ T

Example 1.6 (Aristotle : Inverse Modus Ponens) According to Logic, from the statement : ‘If A is truethen B is true’, one may deduce that ‘if B is false then A is false’ To see how this fits in with a probabilisticreasoning system we can first express the statement : ‘If A is true then B is true’ as p(B = tr|A = tr) = 1.Then we may infer

p(A = fa|B = fa) = 1 − p(A = tr|B = fa)

Example 1.7 (Soft XOR Gate)

A standard XOR logic gate is given by the table on the right If we

observe that the output of the XOR gate is 0, what can we say about

A and B? In this case, either A and B were both 0, or A and B were

both 1 This means we don’t know which state A was in – it could

equally likely have been 1 or 0

Trang 40

Consider a ‘soft’ version of the XOR gate given on the right,

++

so that the gate stochastically outputs C = 1 depending on its

inputs, with additionally A⊥⊥ B and p(A = 1) = 0.65, p(B =

= p(A = 0) (p(C = 0|A = 0, B = 0)p(B = 0) + p(C = 0|A = 0, B = 1)p(B = 1))

= 0.35× (0.9 × 0.23 + 0.01 × 0.77) = 0.075145Then

p(A = 1|C = 0) = p(A = 1, C = 0)

p(A = 1, C = 0) + p(A = 0, C = 0) =

0.4052750.405275 + 0.075145 = 0.8436 (1.2.20)

Example 1.8 (Larry) Larry is typically late for school If Larry is late, we denote this with L = late,otherwise, L = not late When his mother asks whether or not he was late for school he never admits tobeing late The response Larry gives RL is represented as follows

The remaining two values are determined by normalisation and are

Given that RL= not late, what is the probability that Larry was late, i.e p(L = late|RL= not late)?

Using Bayes’ we have

p(L = late|RL= not late) = p(L = late, RL= not late)

p(RL= not late)

p(L = late, RL= not late) + p(L = not late, RL= not late) (1.2.23)

p(L = late|RL= not late) = p(L = late)

Where we used normalisation in the last step, p(L = late) + p(L = not late) = 1 This result is intuitive –Larry’s mother knows that he never admits to being late, so her belief about whether or not he really waslate is unchanged, regardless of what Larry actually says

Định dạng
Số trang	672
Dung lượng	15,43 MB