This book presents a unified treatment via graphical models, a marriagebetween graph and probability theory, facilitating the transference of Machine Learning concepts betweendifferent b
Trang 1Bayesian Reasoning and Machine Learning
David Barber c
Trang 2Notation List
V a calligraphic symbol typically denotes a set of random variables 7
dom(x) Domain of a variable 7
x = x The variable x is in the state x 7
p(x = tr) probability of event/variable x being in the state true 7
p(x = fa) probability of event/variable x being in the state false 7
p(x, y) probability of x and y 8
p(x∩ y) probability of x and y 8
p(x∪ y) probability of x or y 8
p(x|y) The probability of x conditioned on y 8
X ⊥⊥Y|Z VariablesX are independent of variables Y conditioned on variables Z 11 X >>Y|Z VariablesX are dependent on variables Y conditioned on variables Z 11
R xf (x) For continuous variables this is shorthand forR f(x)dx and for discrete vari-ables means summation over the states of x, P xf (x) 18
I [S] Indicator : has value 1 if the statement S is true, 0 otherwise 19
pa (x) The parents of node x 26
ch (x) The children of node x 26
ne (x) Neighbours of node x 26
dim (x) For a discrete variable x, this denotes the number of states x can take 34
hf(x)ip(x) The average of the function f (x) with respect to the distribution p(x) 158 δ(a, b) Delta function For discrete a, b, this is the Kronecker delta, δa,b and for continuous a, b the Dirac delta function δ(a− b) 160
dim x The dimension of the vector/matrix x 171
] (x = s, y = t) The number of times x is in state s and y in state t simultaneously 197
]x y The number of times variable x is in state y 278
D Dataset 291
n Data index 291
N Number of dataset training points 291
S Sample Covariance matrix 315
σ(x) The logistic sigmoid 1/(1 + exp(−x)) 353
erf(x) The (Gaussian) error function 353
xa:b xa, xa+1, , xb 455
i∼ j The set of unique neighbouring edges on a graph 585
Im The m× m identity matrix 605
Trang 3The data explosion
We live in a world that is rich in data, ever increasing in scale This data comes from many differentsources in science (bioinformatics, astronomy, physics, environmental monitoring) and commerce (customerdatabases, financial transactions, engine monitoring, speech recognition, surveillance, search) Possessingthe knowledge as to how to process and extract value from such data is therefore a key and increasinglyimportant skill Our society also expects ultimately to be able to engage with computers in a natural manner
so that computers can ‘talk’ to humans, ‘understand’ what they say and ‘comprehend’ the visual worldaround them These are difficult large-scale information processing tasks and represent grand challengesfor computer science and related fields Similarly, there is a desire to control increasingly complex systems,possibly containing many interacting parts, such as in robotics and autonomous navigation Successfullymastering such systems requires an understanding of the processes underlying their behaviour Processingand making sense of such large amounts of data from complex systems is therefore a pressing modern dayconcern and will likely remain so for the foreseeable future
of the problem forces us to address uncertainty In the broadest sense, Machine Learning and related fieldsaim to ‘learn something useful’ about the environment within which the agent operates Machine Learning
is also closely allied with Artificial Intelligence, with Machine Learning placing more emphasis on using data
to drive and adapt the model
In the early stages of Machine Learning and related areas, similar techniques were discovered in relativelyisolated research communities This book presents a unified treatment via graphical models, a marriagebetween graph and probability theory, facilitating the transference of Machine Learning concepts betweendifferent branches of the mathematical and computational sciences
Whom this book is for
The book is designed to appeal to students with only a modest mathematical background in undergraduatecalculus and linear algebra No formal computer science or statistical background is required to follow thebook, although a basic familiarity with probability, calculus and linear algebra would be useful The bookshould appeal to students from a variety of backgrounds, including Computer Science, Engineering, appliedStatistics, Physics, and Bioinformatics that wish to gain an entry to probabilistic approaches in MachineLearning In order to engage with students, the book introduces fundamental concepts in inference using
Trang 4only minimal reference to algebra and calculus More mathematical techniques are postponed until as andwhen required, always with the concept as primary and the mathematics secondary.
The concepts and algorithms are described with the aid of many worked examples The exercises anddemonstrations, together with an accompanying MATLAB toolbox, enable the reader to experiment andmore deeply understand the material The ultimate aim of the book is to enable the reader to constructnovel algorithms The book therefore places an emphasis on skill learning, rather than being a collection ofrecipes This is a key aspect since modern applications are often so specialised as to require novel methods.The approach taken throughout is to describe the problem as a graphical model, which is then translatedinto a mathematical framework, ultimately leading to an algorithmic implementation in theBRMLtoolbox
The book is primarily aimed at final year undergraduates and graduates without significant experience inmathematics On completion, the reader should have a good understanding of the techniques, practicalitiesand philosophies of probabilistic aspects of Machine Learning and be well equipped to understand moreadvanced research level material
The structure of the book
The book begins with the basic concepts of graphical models and inference For the independent readerchapters 1,2,3,4,5,9,10,13,14,15,16,17,21 and 23 would form a good introduction to probabilistic reasoning,modelling and Machine Learning The material in chapters 19, 24, 25 and 28 is more advanced, with theremaining material being of more specialised interest Note that in each chapter the level of material is ofvarying difficulty, typically with the more challenging material placed towards the end of each chapter As
an introduction to the area of probabilistic modelling, a course can be constructed from the material asindicated in the chart
The material from parts I and II has been successfully used for courses on Graphical Models I have alsotaught an introduction to Probabilistic Machine Learning using material largely from part III, as indicated.These two courses can be taught separately and a useful approach would be to teach first the GraphicalModels course, followed by a separate Probabilistic Machine Learning course
A short course on approximate inference can be constructed from introductory material in part I and themore advanced material in part V, as indicated The exact inference methods in part I can be coveredrelatively quickly with the material in part V considered in more in depth
A timeseries course can be made by using primarily the material in part IV, possibly combined with materialfrom part I for students that are unfamiliar with probabilistic modelling approaches Some of this material,particularly in chapter 25 is more advanced and can be deferred until the end of the course, or consideredfor a more advanced course
The references are generally to works at a level consistent with the book material and which are in the mostpart readily available
Accompanying code
The BRMLtoolboxis provided to help readers see how mathematical models translate into actual LAB code There are a large number of demos that a lecturer may wish to use or adapt to help illustratethe material In addition many of the exercises make use of the code, helping the reader gain confidence
MAT-in the concepts and their application Along with complete routMAT-ines for many MachMAT-ine LearnMAT-ing methods,the philosophy is to provide low level routines whose composition intuitively follows the mathematical de-scription of the algorithm In this way students may easily match the mathematics with the correspondingalgorithmic implementation
Trang 51: Probabilistic Reasoning 2: Basic Graph Concepts 3: Belief Networks 4: Graphical Models 5: Efficient Inference in Trees 6: The Junction Tree Algorithm 7: Making Decisions
8: Statistics for Machine Learning 9: Learning as Inference
10: Naive Bayes 11: Learning with Hidden Variables 12: Bayesian Model Selection 13: Machine Learning Concepts 14: Nearest Neighbour Classification 15: Unsupervised Linear Dimension Reduction 16: Supervised Linear Dimension Reduction 17: Linear Models
18: Bayesian Linear Models 19: Gaussian Processes 20: Mixture Models 21: Latent Linear Models 22: Latent Ability Models 23: Discrete-State Markov Models 24: Continuous-State Markov Models 25: Switching Linear Dynamical Systems 26: Distributed Computation
27: Sampling 28: Deterministic Approximate Inference
Trang 6Other books in this area
The literature on Machine Learning is vast with much relevant literature also contained in statistics, gineering and other physical sciences A small list of more specialised books that may be referred to fordeeper treatments of specific topics is:
en-• Graphical models
– Graphical models by S Lauritzen, Oxford University Press, 1996
– Bayesian Networks and Decision Graphs by F Jensen and T D Nielsen, Springer Verlag, 2007.– Probabilistic Networks and Expert Systems by R G Cowell, A P Dawid, S L Lauritzen and D
J Spiegelhalter, Springer Verlag, 1999
– Probabilistic Reasoning in Intelligent Systems by J Pearl, Morgan Kaufmann, 1988
– Graphical Models in Applied Multivariate Statistics by J Whittaker, Wiley, 1990
– Probabilistic Graphical Models: Principles and Techniques by D Koller and N Friedman, MITPress, 2009
• Machine Learning and Information Processing
– Information Theory, Inference and Learning Algorithms by D J C MacKay, Cambridge versity Press, 2003
Uni-– Pattern Recognition and Machine Learning by C M Bishop, Springer Verlag, 2006
– An Introduction To Support Vector Machines, N Cristianini and J Shawe-Taylor, CambridgeUniversity Press, 2000
– Gaussian Processes for Machine Learning by C E Rasmussen and C K I Williams, MIT press,2006
Acknowledgements
Many people have helped this book along the way either in terms of reading, feedback, general insights,allowing me to present their work, or just plain motivation Amongst these I would like to thank DanCornford, Massimiliano Pontil, Mark Herbster, John Shawe-Taylor, Vladimir Kolmogorov, Yuri Boykov,Tom Minka, Simon Prince, Silvia Chiappa, Bertrand Mesot, Robert Cowell, Ali Taylan Cemgil, David Blei,Jeff Bilmes, David Cohn, David Page, Peter Sollich, Chris Williams, Marc Toussaint, Amos Storkey, ZakriaHussain, Le Chen, Seraf´ın Moral, Milan Studen´y, Luc De Raedt, Tristan Fletcher, Chris Vryonides,Yannis ++haralambous, Tom Furmston, Ed Challis and Chris Bracegirdle I would also like to thank the many stu-dents that have helped improve the material during lectures over the years I’m particularly grateful toTaylan Cemgil for allowing his GraphLayout package to be bundled with theBRMLtoolbox
The staff at Cambridge University Press have been a delight to work with and I would especially like tothank Heather Bergman for her initial endeavors and the wonderful Diana Gillooly for her continued enthu-siasm
A heartfelt thankyou to my parents and sister – I hope this small token will make them proud I’m alsofortunate to be able to acknowledge the support and generosity of friends throughout Finally, I’d like tothank Silvia who made it all worthwhile
Trang 7TheBRMLtoolbox is a lightweight set of routines that enables the reader to experiment with concepts ingraph theory, probability theory and Machine Learning The code contains basic routines for manipulatingdiscrete variable distributions, along with more limited support for continuous variables In addition thereare many hard-coded standard Machine Learning algorithms The website contains also a complete list ofall the teaching demos and related exercise material
BRMLTOOLKIT
Graph Theory
ancestors - Return the ancestors of nodes x in DAG A
ancestralorder - Return the ancestral order or the DAG A (oldest first)
descendents - Return the descendents of nodes x in DAG A
children - return the children of variable x given adjacency matrix A
edges - Return edge list from adjacency matrix A
elimtri - Return a variable elimination sequence for a triangulated graph
connectedComponents - Find the connected components of an adjacency matrix
istree - Check if graph is singly-connected
neigh - Find the neighbours of vertex v on a graph with adjacency matrix G
noselfpath - return a path excluding self transitions
parents - return the parents of variable x given adjacency matrix A
spantree - Find a spanning tree from an edge list
triangulate - Triangulate adjacency matrix A
triangulatePorder - Triangulate adjacency matrix A according to a partial ordering
Potential manipulation
condpot - Return a potential conditioned on another variable
changevar - Change variable names in a potential
dag - Return the adjacency matrix (zeros on diagonal) for a Belief Network
deltapot - A delta function potential
disptable - Print the table of a potential
divpots - Divide potential pota by potb
drawFG - Draw the Factor Graph A
drawID - plot an Influence Diagram
drawJTree - plot a Junction Tree
drawNet - plot network
evalpot - Evaluate the table of a potential when variables are set
exppot - exponential of a potential
eyepot - Return a unit potential
grouppot - Form a potential based on grouping variables together
groupstate - Find the state of the group variables corresponding to a given ungrouped state logpot - logarithm of the potential
markov - Return a symmetric adjacency matrix of Markov Network in pot
maxpot - Maximise a potential over variables
maxsumpot - Maximise or Sum a potential over variables
multpots - Multiply potentials into a single potential
numstates - Number of states of the variables in a potential
Trang 8orderpot - Return potential with variables reordered according to order
orderpotfields - Order the fields of the potential, creating blank entries where necessary
potsample - Draw sample from a single potential
potscontainingonly - Returns those potential numbers that contain only the required variables
potvariables - Returns information about all variables in a set of potentials
setevpot - Sets variables in a potential into evidential states
setpot - sets potential variables to specified states
setstate - set a potential’s specified joint state to a specified value
squeezepots - Eliminate redundant potentials (those contained wholly within another)
sumpot - Sum potential pot over variables
sumpotID - Return the summed probability and utility tables from an ID
sumpots - Sum a set of potentials
table - Return the potential table
ungrouppot - Form a potential based on ungrouping variables
uniquepots - Eliminate redundant potentials (those contained wholly within another)
whichpot - Returns potentials that contain a set of variables
Routines also extend the toolbox to deal with Gaussian potentials:
multpotsGaussianMoment.m, sumpotGaussianCanonical.m, sumpotGaussianMoment.m, multpotsGaussianCanonical.m See demoSumprodGaussCanon.m, demoSumprodGaussCanonLDS.m, demoSumprodGaussMoment.m
Inference
absorb - Update potentials in absorption message passing on a Junction Tree
absorption - Perform full round of absorption on a Junction Tree
absorptionID - Perform full round of absorption on an Influence Diagram
ancestralsample - Ancestral sampling from a Belief Network
binaryMRFmap - get the MAP assignment for a binary MRF with positive W
bucketelim - Bucket Elimination on a set of potentials
condindep - Conditional Independence check using graph of variable interactions
condindepEmp - Compute the empirical log Bayes Factor and MI for independence/dependence
condindepPot - Numerical conditional independence measure
condMI - conditional mutual information I(x,y|z) of a potential.
FactorConnectingVariable - Factor nodes connecting to a set of variables
FactorGraph - Returns a Factor Graph adjacency matrix based on potentials
IDvars - probability and decision variables from a partial order
jtassignpot - Assign potentials to cliques in a Junction Tree
jtree - Setup a Junction Tree based on a set of potentials
jtreeID - Setup a Junction Tree based on an Influence Diagram
LoopyBP - loopy Belief Propagation using sum-product algorithm
MaxFlow - Ford Fulkerson max flow - min cut algorithm (breadth first search)
maxNpot - Find the N most probable values and states in a potential
maxNprodFG - N-Max-Product algorithm on a Factor Graph (Returns the Nmax most probable States) maxprodFG - Max-Product algorithm on a Factor Graph
MDPemDeterministicPolicy - Solve MDP using EM with deterministic policy
MDPsolve - Solve a Markov Decision Process
MesstoFact - Returns the message numbers that connect into factor potential
metropolis - Metropolis sample
mostprobablepath - Find the most probable path in a Markov Chain
mostprobablepathmult - Find the all source all sink most probable paths in a Markov Chain
sumprodFG - Sum-Product algorithm on a Factor Graph represented by A
Specific Models
ARlds - Learn AR coefficients using a Linear Dynamical System
ARtrain - Fit autoregressive (AR) coefficients of order L to v.
BayesLinReg - Bayesian Linear Regression training using basis functions phi(x)
BayesLogRegressionRVM - Bayesian Logistic Regression with the Relevance Vector Machine
CanonVar - Canonical Variates (no post rotation of variates)
cca - canonical correlation analysis
covfnGE - Gamma Exponential Covariance Function
EMbeliefnet - train a Belief Network using Expectation Maximisation
EMminimizeKL - MDP deterministic policy solver Finds optimal actions
EMqTranMarginal - EM marginal transition in MDP
EMqUtilMarginal - Returns term proportional to the q marginal for the utility term
EMTotalBetaMessage - backward information needed to solve the MDP process using message passing
EMvalueTable - MDP solver calculates the value function of the MDP with the current policy
FA - Factor Analysis
Trang 9GMMem - Fit a mixture of Gaussian to the data X using EM
GPclass - Gaussian Process Binary Classification
GPreg - Gaussian Process Regression
HebbML - Learn a sequence for a Hopfield Network
HMMbackward - HMM Backward Pass
HMMbackwardSAR - Backward Pass (beta method) for the Switching Autoregressive HMM
HMMem - EM algorithm for HMM
HMMforward - HMM Forward Pass
HMMforwardSAR - Switching Autoregressive HMM with switches updated only every Tskip timesteps
HMMgamma - HMM Posterior smoothing using the Rauch-Tung-Striebel correction method
HMMsmooth - Smoothing for a Hidden Markov Model (HMM)
HMMsmoothSAR - Switching Autoregressive HMM smoothing
HMMviterbi - Viterbi most likely joint hidden state of a HMM
kernel - A kernel evaluated at two points
Kmeans - K-means clustering algorithm
LDSbackward - Full Backward Pass for a Latent Linear Dynamical System (RTS correction method) LDSbackwardUpdate - Single Backward update for a Latent Linear Dynamical System (RTS smoothing update) LDSforward - Full Forward Pass for a Latent Linear Dynamical System (Kalman Filter)
LDSforwardUpdate - Single Forward update for a Latent Linear Dynamical System (Kalman Filter)
LDSsmooth - Linear Dynamical System : Filtering and Smoothing
LDSsubspace - Subspace Method for identifying Linear Dynamical System
LogReg - Learning Logistic Linear Regression Using Gradient Ascent (BATCH VERSION)
MIXprodBern - EM training of a Mixture of a product of Bernoulli distributions
mixMarkov - EM training for a mixture of Markov Models
NaiveBayesDirichletTest - Naive Bayes prediction having used a Dirichlet prior for training
NaiveBayesDirichletTrain - Naive Bayes training using a Dirichlet prior
NaiveBayesTest - Test Naive Bayes Bernoulli Distribution after Max Likelihood training
NaiveBayesTrain - Train Naive Bayes Bernoulli Distribution using Max Likelihood
nearNeigh - Nearest Neighbour classification
pca - Principal Components Analysis
plsa - Probabilistic Latent Semantic Analysis
plsaCond - Conditional PLSA (Probabilistic Latent Semantic Analysis)
rbf - Radial Basis function output
SARlearn - EM training of a Switching AR model
SLDSbackward - Backward pass using a Mixture of Gaussians
SLDSforward - Switching Latent Linear Dynamical System Gaussian Sum forward pass
SLDSmargGauss - compute the single Gaussian from a weighted SLDS mixture
softloss - Soft loss function
svdm - Singular Value Decomposition with missing values
SVMtrain - train a Support vector Machine
General
argmax - performs argmax returning the index and value
assign - Assigns values to variables
betaXbiggerY - p(x>y) for x~Beta(a,b), y~Beta(c,d)
bar3zcolor - Plot a 3D bar plot of the matrix Z
avsigmaGauss - Average of a logistic sigmoid under a Gaussian
cap - Cap x at absolute value c
chi2test - inverse of the chi square cumulative density
count - for a data matrix (each column is a datapoint), return the state counts
condexp - Compute normalised p proportional to exp(logp);
condp - Make a conditional distribution from the matrix
dirrnd - Samples from a Dirichlet distribution
field2cell - Place the field of a structure in a cell
GaussCond - Return the mean and covariance of a conditioned Gaussian
hinton - Plot a Hinton diagram
ind2subv - Subscript vector from linear index
ismember_sorted - True for member of sorted set
lengthcell - Length of each cell entry
logdet - Log determinant of a positive definite matrix computed in a numerically stable manner logeps - log(x+eps)
logGaussGamma - unnormalised log of the Gauss-Gamma distribution
logsumexp - Compute log(sum(exp(a).*b)) valid for large a
logZdirichlet - Log Normalisation constant of a Dirichlet distribution with parameter u
majority - Return majority values in each column on a matrix
maxarray - Maximise a multi-dimensional array over a set of dimensions
maxNarray - Find the highest values and states of an array over a set of dimensions
Trang 10mix2mix - Fit a mixture of Gaussians with another mixture of Gaussians
mvrandn - Samples from a multi-variate Normal(Gaussian) distribution
mygamrnd - Gamma random variate generator
mynanmean - mean of values that are not nan
mynansum - sum of values that are not nan
mynchoosek - binomial coefficient v choose k
myones - same as ones(x), but if x is a scalar, interprets as ones([x 1])
myrand - same as rand(x) but if x is a scalar interprets as rand([x 1])
myzeros - same as zeros(x) but if x is a scalar interprets as zeros([x 1])
normp - Make a normalised distribution from an array
randgen - Generates discrete random variables given the pdf
replace - Replace instances of a value with another value
sigma - 1./(1+exp(-x))
sigmoid - 1./(1+exp(-beta*x))
sqdist - Square distance between vectors in x and y
subv2ind - Linear index from subscript vector.
sumlog - sum(log(x)) with a cutoff at 10e-200
Miscellaneous
compat - Compatibility of object F being in position h for image v on grid Gx,Gy
logp - The logarithm of a specific non-Gaussian distribution
placeobject - Place the object F at position h in grid Gx,Gy
plotCov - return points for plotting an ellipse of a covariance
pointsCov - unit variance contours of a 2D Gaussian with mean m and covariance S
setup - run me at initialisation checks for bugs in matlab and initialises path
validgridposition - Returns 1 if point is on a defined grid
Trang 11Notation List II Preface II BRML toolbox VII Contents XI
1.1 Probability Refresher 7
1.1.1 Interpreting Conditional Probability 9
1.1.2 Probability Tables 12
1.2 Probabilistic Reasoning 12
1.3 Prior, Likelihood and Posterior 18
1.3.1 Two dice : what were the individual scores? 19
1.4 Summary 20
1.5 Code 20
1.5.1 Basic Probability code 20
1.5.2 General utilities 21
1.5.3 An example 22
1.6 Exercises 22
2 Basic Graph Concepts 25 2.1 Graphs 25
2.2 Numerically Encoding Graphs 27
2.2.1 Edge list 27
2.2.2 Adjacency matrix 28
2.2.3 Clique matrix 28
2.3 Summary 29
2.4 Code 29
2.4.1 Utility routines 29
2.5 Exercises 30
3 Belief Networks 31 3.1 The Benefits of Structure 31
3.1.1 Modelling independencies 32
3.1.2 Reducing the burden of specification 34
3.2 Uncertain and Unreliable Evidence 35
3.2.1 Uncertain evidence 35
Trang 12CONTENTS CONTENTS
3.2.2 Unreliable evidence 37
3.3 Belief Networks 38
3.3.1 Conditional independence 39
3.3.2 The impact of collisions 40
3.3.3 Graphical path manipulations for independence 43
3.3.4 d-Separation 43
3.3.5 Graphical and distributional in/dependence 43
3.3.6 Markov equivalence in belief networks 45
3.3.7 Belief networks have limited expressibility 46
3.4 Causality 47
3.4.1 Simpson’s paradox 47
3.4.2 The do-calculus 49
3.4.3 Influence diagrams and the do-calculus 49
3.5 Summary 50
3.6 Code 50
3.6.1 Naive inference demo 50
3.6.2 Conditional independence demo 50
3.6.3 Utility routines 51
3.7 Exercises 51
4 Graphical Models 57 4.1 Graphical Models 57
4.2 Markov Networks 58
4.2.1 Markov properties 59
4.2.2 Markov random fields 60
4.2.3 Hammersley-Clifford Theorem 61
4.2.4 Conditional independence using Markov networks 63
4.2.5 Lattice Models 63
4.3 Chain Graphical Models 65
4.4 Factor Graphs 67
4.4.1 Conditional independence in factor graphs 68
4.5 Expressiveness of Graphical Models 68
4.6 Summary 70
4.7 Code 71
4.8 Exercises 71
5 Efficient Inference in Trees 75 5.1 Marginal Inference 75
5.1.1 Variable elimination in a Markov chain and message passing 75
5.1.2 The sum-product algorithm on factor graphs 78
5.1.3 Dealing with Evidence 81
5.1.4 Computing the marginal likelihood 81
5.1.5 The problem with loops 83
5.2 Other Forms of Inference 83
5.2.1 Max-Product 83
5.2.2 Finding the N most probable states 85
5.2.3 Most probable path and shortest path 87
5.2.4 Mixed inference 89
5.3 Inference in Multiply Connected Graphs 89
5.3.1 Bucket elimination 90
5.3.2 Loop-cut conditioning 91
5.4 Message Passing for Continuous Distributions 92
5.5 Summary 92
5.6 Code 93
5.6.1 Factor graph examples 93
5.6.2 Most probable and shortest path 93
Trang 13CONTENTS CONTENTS
5.6.3 Bucket elimination 94
5.6.4 Message passing on Gaussians 94
5.7 Exercises 94
6 The Junction Tree Algorithm 97 6.1 Clustering Variables 97
6.1.1 Reparameterisation 97
6.2 Clique Graphs 98
6.2.1 Absorption 99
6.2.2 Absorption schedule on clique trees 100
6.3 Junction Trees 101
6.3.1 The running intersection property 102
6.4 Constructing a Junction Tree for Singly-Connected Distributions 104
6.4.1 Moralisation 104
6.4.2 Forming the clique graph 104
6.4.3 Forming a junction tree from a clique graph 104
6.4.4 Assigning potentials to cliques 105
6.5 Junction Trees for Multiply-Connected Distributions 105
6.5.1 Triangulation algorithms 107
6.6 The Junction Tree Algorithm 108
6.6.1 Remarks on the JTA 109
6.6.2 Computing the normalisation constant of a distribution 110
6.6.3 The marginal likelihood 111
6.6.4 Some small JTA examples 111
6.6.5 Shafer-Shenoy propagation 113
6.7 Finding the Most Likely State 113
6.8 Reabsorption : Converting a Junction Tree to a Directed Network 114
6.9 The Need For Approximations 115
6.9.1 Bounded width junction trees 115
6.10 Summary 116
6.11 Code 116
6.11.1 Utility routines 116
6.12 Exercises 117
7 Making Decisions 121 7.1 Expected Utility 121
7.1.1 Utility of money 121
7.2 Decision Trees 122
7.3 Extending Bayesian Networks for Decisions 125
7.3.1 Syntax of influence diagrams 125
7.4 Solving Influence Diagrams 129
7.4.1 Messages on an ID 130
7.4.2 Using a junction tree 130
7.5 Markov Decision Processes 133
7.5.1 Maximising expected utility by message passing 134
7.5.2 Bellman’s equation 135
7.6 Temporally Unbounded MDPs 136
7.6.1 Value iteration 136
7.6.2 Policy iteration 137
7.6.3 A curse of dimensionality 137
7.7 Variational Inference and Planning 138
7.8 Financial Matters 139
7.8.1 Options pricing and expected utility 140
7.8.2 Binomial options pricing model 141
7.8.3 Optimal investment 142
7.9 Further Topics 144
Trang 14CONTENTS CONTENTS
7.9.1 Partially observable MDPs 144
7.9.2 Reinforcement learning 144
7.10 Summary 146
7.11 Code 147
7.11.1 Sum/Max under a partial order 147
7.11.2 Junction trees for influence diagrams 147
7.11.3 Party-Friend example 148
7.11.4 Chest Clinic with Decisions 148
7.11.5 Markov decision processes 148
7.12 Exercises 149
II Learning in Probabilistic Models 153 8 Statistics for Machine Learning 157 8.1 Representing Data 157
8.1.1 Categorical 157
8.1.2 Ordinal 157
8.1.3 Numerical 157
8.2 Distributions 158
8.2.1 The Kullback-Leibler Divergence KL(q|p) 161
8.2.2 Entropy and information 162
8.3 Classical Distributions 163
8.4 Multivariate Gaussian 168
8.4.1 Completing the square 169
8.4.2 Conditioning as system reversal 170
8.4.3 Whitening and centering 171
8.5 Exponential Family 171
8.5.1 Conjugate priors 172
8.6 Learning distributions 172
8.7 Properties of Maximum Likelihood 174
8.7.1 Training assuming the correct model class 175
8.7.2 Training when the assumed model is incorrect 175
8.7.3 Maximum likelihood and the empirical distribution 176
8.8 Learning a Gaussian 176
8.8.1 Maximum likelihood training 176
8.8.2 Bayesian inference of the mean and variance 177
8.8.3 Gauss-Gamma distribution 179
8.9 Summary 179
8.10 Code 180
8.11 Exercises 180
9 Learning as Inference 191 9.1 Learning as Inference 191
9.1.1 Learning the bias of a coin 191
9.1.2 Making decisions 192
9.1.3 A continuum of parameters 193
9.1.4 Decisions based on continuous intervals 194
9.2 Bayesian methods and ML-II 195
9.3 Maximum Likelihood Training of Belief Networks 196
9.4 Bayesian Belief Network Training 199
9.4.1 Global and local parameter independence 199
9.4.2 Learning binary variable tables using a Beta prior 200
9.4.3 Learning multivariate discrete tables using a Dirichlet prior 202
9.5 Structure learning 205
9.5.1 PC algorithm 206
Trang 15CONTENTS CONTENTS
9.5.2 Empirical independence 207
9.5.3 Network scoring 209
9.5.4 Chow-Liu Trees 211
9.6 Maximum Likelihood for Undirected models 213
9.6.1 The likelihood gradient 213
9.6.2 General tabular clique potentials 214
9.6.3 Decomposable Markov networks 215
9.6.4 Exponential form potentials 220
9.6.5 Conditional random fields 221
9.6.6 Pseudo likelihood 224
9.6.7 Learning the structure 224
9.7 Summary 224
9.8 Code 225
9.8.1 PC algorithm using an oracle 225
9.8.2 Demo of empirical conditional independence 225
9.8.3 Bayes Dirichlet structure learning 225
9.9 Exercises 226
10 Naive Bayes 229 10.1 Naive Bayes and Conditional Independence 229
10.2 Estimation using Maximum Likelihood 230
10.2.1 Binary attributes 230
10.2.2 Multi-state variables 233
10.2.3 Text classification 234
10.3 Bayesian Naive Bayes 234
10.4 Tree Augmented Naive Bayes 236
10.4.1 Learning tree augmented Naive Bayes networks 236
10.5 Summary 237
10.6 Code 237
10.7 Exercises 237
11 Learning with Hidden Variables 241 11.1 Hidden Variables and Missing Data 241
11.1.1 Why hidden/missing variables can complicate proceedings 241
11.1.2 The missing at random assumption 242
11.1.3 Maximum likelihood 243
11.1.4 Identifiability issues 244
11.2 Expectation Maximisation 244
11.2.1 Variational EM 244
11.2.2 Classical EM 246
11.2.3 Application to Belief networks 248
11.2.4 General case 250
11.2.5 Convergence 253
11.2.6 Application to Markov networks 253
11.3 Extensions of EM 253
11.3.1 Partial M step 253
11.3.2 Partial E-step 253
11.4 A failure case for EM 255
11.5 Variational Bayes 256
11.5.1 EM is a special case of variational Bayes 258
11.5.2 An example: VB for the Asbestos-Smoking-Cancer network 258
11.6 Optimising the Likelihood by Gradient Methods 261
11.6.1 Undirected models 261
11.7 Summary 262
11.8 Code 262
11.9 Exercises 262
Trang 16CONTENTS CONTENTS
12.1 Comparing Models the Bayesian Way 267
12.2 Illustrations : coin tossing 268
12.2.1 A discrete parameter space 268
12.2.2 A continuous parameter space 269
12.3 Occam’s Razor and Bayesian Complexity Penalisation 270
12.4 A continuous example : curve fitting 273
12.5 Approximating the Model Likelihood 274
12.5.1 Laplace’s method 275
12.5.2 Bayes information criterion (BIC) 275
12.6 Bayesian Hypothesis Testing for Outcome Analysis 276
12.6.1 Outcome analysis 276
12.6.2 Hindep : model likelihood 277
12.6.3 Hsame : model likelihood 278
12.6.4 Dependent outcome analysis 279
12.6.5 Is classifier A better than B? 280
12.7 Summary 281
12.8 Code 282
12.9 Exercises 282
III Machine Learning 287 13 Machine Learning Concepts 291 13.1 Styles of Learning 291
13.1.1 Supervised learning 291
13.1.2 Unsupervised learning 292
13.1.3 Anomaly detection 293
13.1.4 Online (sequential) learning 293
13.1.5 Interacting with the environment 293
13.1.6 Semi-supervised learning 294
13.2 Supervised Learning 294
13.2.1 Utility and Loss 294
13.2.2 Using the empirical distribution 295
13.2.3 Bayesian decision approach 298
13.3 Bayes versus Empirical Decisions 302
13.4 Summary 303
13.5 Exercises 303
14 Nearest Neighbour Classification 305 14.1 Do As Your Neighbour Does 305
14.2 K-Nearest Neighbours 306
14.3 A Probabilistic Interpretation of Nearest Neighbours 308
14.3.1 When your nearest neighbour is far away 309
14.4 Summary 309
14.5 Code 309
14.6 Exercises 309
15 Unsupervised Linear Dimension Reduction 311 15.1 High-Dimensional Spaces – Low Dimensional Manifolds 311
15.2 Principal Components Analysis 311
15.2.1 Deriving the optimal linear reconstruction 312
15.2.2 Maximum variance criterion 314
15.2.3 PCA algorithm 314
15.2.4 PCA and nearest neighbours classification 316
15.2.5 Comments on PCA 316
Trang 17CONTENTS CONTENTS
15.3 High Dimensional Data 317
15.3.1 Eigen-decomposition for N < D 318
15.3.2 PCA via Singular value decomposition 318
15.4 Latent Semantic Analysis 319
15.4.1 Information retrieval 320
15.5 PCA With Missing Data 321
15.5.1 Finding the principal directions 323
15.5.2 Collaborative filtering using PCA with missing data 324
15.6 Matrix Decomposition Methods 324
15.6.1 Probabilistic latent semantic analysis 325
15.6.2 Extensions and variations 328
15.6.3 Applications of PLSA/NMF 329
15.7 Kernel PCA 330
15.8 Canonical Correlation Analysis 332
15.8.1 SVD formulation 333
15.9 Summary 334
15.10Code 334
15.11Exercises 334
16 Supervised Linear Dimension Reduction 337 16.1 Supervised Linear Projections 337
16.2 Fisher’s Linear Discriminant 337
16.3 Canonical Variates 339
16.3.1 Dealing with the nullspace 341
16.4 Summary 342
16.5 Code 342
16.6 Exercises 342
17 Linear Models 345 17.1 Introduction: Fitting A Straight Line 345
17.2 Linear Parameter Models for Regression 346
17.2.1 Vector outputs 348
17.2.2 Regularisation 348
17.2.3 Radial basis functions 350
17.3 The Dual Representation and Kernels 351
17.3.1 Regression in the dual-space 352
17.4 Linear Parameter Models for Classification 352
17.4.1 Logistic regression 353
17.4.2 Beyond first order gradient ascent 357
17.4.3 Avoiding overconfident classification 357
17.4.4 Multiple classes 358
17.4.5 The Kernel Trick for Classification 358
17.5 Support Vector Machines 359
17.5.1 Maximum margin linear classifier 359
17.5.2 Using kernels 361
17.5.3 Performing the optimisation 362
17.5.4 Probabilistic interpretation 362
17.6 Soft Zero-One Loss for Outlier Robustness 362
17.7 Summary 363
17.8 Code 364
17.9 Exercises 364
Trang 18CONTENTS CONTENTS
18.1 Regression With Additive Gaussian Noise 367
18.1.1 Bayesian linear parameter models 368
18.1.2 Determining hyperparameters: ML-II 369
18.1.3 Learning the hyperparameters using EM 370
18.1.4 Hyperparameter optimisation : using the gradient 371
18.1.5 Validation likelihood 373
18.1.6 Prediction and model averaging 373
18.1.7 Sparse linear models 374
18.2 Classification 375
18.2.1 Hyperparameter optimisation 376
18.2.2 Laplace approximation 376
18.2.3 Variational Gaussian approximation 379
18.2.4 Local variational approximation 380
18.2.5 Relevance vector machine for classification 381
18.2.6 Multi-class case 381
18.3 Summary 382
18.4 Code 382
18.5 Exercises 383
19 Gaussian Processes 385 19.1 Non-Parametric Prediction 385
19.1.1 From parametric to non-parametric 385
19.1.2 From Bayesian linear models to Gaussian processes 386
19.1.3 A prior on functions 387
19.2 Gaussian Process Prediction 388
19.2.1 Regression with noisy training outputs 388
19.3 Covariance Functions 390
19.3.1 Making new covariance functions from old 391
19.3.2 Stationary covariance functions 391
19.3.3 Non-stationary covariance functions 393
19.4 Analysis of Covariance Functions 393
19.4.1 Smoothness of the functions 393
19.4.2 Mercer kernels 394
19.4.3 Fourier analysis for stationary kernels 395
19.5 Gaussian Processes for Classification 396
19.5.1 Binary classification 396
19.5.2 Laplace’s approximation 397
19.5.3 Hyperparameter optimisation 399
19.5.4 Multiple classes 400
19.6 Summary 400
19.7 Code 400
19.8 Exercises 401
20 Mixture Models 403 20.1 Density Estimation Using Mixtures 403
20.2 Expectation Maximisation for Mixture Models 404
20.2.1 Unconstrained discrete tables 405
20.2.2 Mixture of product of Bernoulli distributions 407
20.3 The Gaussian Mixture Model 409
20.3.1 EM algorithm 409
20.3.2 Practical issues 412
20.3.3 Classification using Gaussian mixture models 413
20.3.4 The Parzen estimator 414
20.3.5 K-Means 415
20.3.6 Bayesian mixture models 415
Trang 19CONTENTS CONTENTS
20.3.7 Semi-supervised learning 416
20.4 Mixture of Experts 416
20.5 Indicator Models 417
20.5.1 Joint indicator approach: factorised prior 417
20.5.2 Polya prior 418
20.6 Mixed Membership Models 419
20.6.1 Latent Dirichlet allocation 419
20.6.2 Graph based representations of data 421
20.6.3 Dyadic data 421
20.6.4 Monadic data 422
20.6.5 Cliques and adjacency matrices for monadic binary data 423
20.7 Summary 426
20.8 Code 426
20.9 Exercises 427
21 Latent Linear Models 429 21.1 Factor Analysis 429
21.1.1 Finding the optimal bias 431
21.2 Factor Analysis : Maximum Likelihood 431
21.2.1 Eigen-approach likelihood optimisation 432
21.2.2 Expectation maximisation 434
21.3 Interlude: Modelling Faces 436
21.4 Probabilistic Principal Components Analysis 438
21.5 Canonical Correlation Analysis and Factor Analysis 439
21.6 Independent Components Analysis 440
21.7 Summary 442
21.8 Code 442
21.9 Exercises 442
22 Latent Ability Models 445 22.1 The Rasch Model 445
22.1.1 Maximum likelihood training 445
22.1.2 Bayesian Rasch models 446
22.2 Competition Models 447
22.2.1 Bradley-Terry-Luce model 447
22.2.2 Elo ranking model 448
22.2.3 Glicko and TrueSkill 448
22.3 Summary 449
22.4 Code 449
22.5 Exercises 449
IV Dynamical Models 451 23 Discrete-State Markov Models 455 23.1 Markov Models 455
23.1.1 Equilibrium and stationary distribution of a Markov chain 456
23.1.2 Fitting Markov models 457
23.1.3 Mixture of Markov models 458
23.2 Hidden Markov Models 460
23.2.1 The classical inference problems 460
23.2.2 Filtering p(ht|v1:t) 461
23.2.3 Parallel smoothing p(ht|v1:T) 462
23.2.4 Correction smoothing 462
23.2.5 Sampling from p(h1:T|v1:T) 464
23.2.6 Most likely joint state 464
Trang 20CONTENTS CONTENTS
23.2.7 Prediction 465
23.2.8 Self localisation and kidnapped robots 466
23.2.9 Natural language models 468
23.3 Learning HMMs 468
23.3.1 EM algorithm 468
23.3.2 Mixture emission 470
23.3.3 The HMM-GMM 470
23.3.4 Discriminative training 471
23.4 Related Models 471
23.4.1 Explicit duration model 471
23.4.2 Input-Output HMM 472
23.4.3 Linear chain CRFs 473
23.4.4 Dynamic Bayesian networks 474
23.5 Applications 474
23.5.1 Object tracking 474
23.5.2 Automatic speech recognition 474
23.5.3 Bioinformatics 475
23.5.4 Part-of-speech tagging 475
23.6 Summary 475
23.7 Code 476
23.8 Exercises 476
24 Continuous-state Markov Models 483 24.1 Observed Linear Dynamical Systems 483
24.1.1 Stationary distribution with noise 484
24.2 Auto-Regressive Models 485
24.2.1 Training an AR model 486
24.2.2 AR model as an OLDS 486
24.2.3 Time-varying AR model 487
24.2.4 Time-varying variance AR models 488
24.3 Latent Linear Dynamical Systems 489
24.4 Inference 490
24.4.1 Filtering 492
24.4.2 Smoothing : Rauch-Tung-Striebel correction method 494
24.4.3 The likelihood 495
24.4.4 Most likely state 496
24.4.5 Time independence and Riccati equations 496
24.5 Learning Linear Dynamical Systems 497
24.5.1 Identifiability issues 497
24.5.2 EM algorithm 498
24.5.3 Subspace Methods 499
24.5.4 Structured LDSs 500
24.5.5 Bayesian LDSs 500
24.6 Switching Auto-Regressive Models 500
24.6.1 Inference 501
24.6.2 Maximum likelihood learning using EM 501
24.7 Summary 502
24.8 Code 503
24.8.1 Autoregressive models 503
24.9 Exercises 504
Trang 21CONTENTS CONTENTS
25.1 Introduction 507
25.2 The Switching LDS 507
25.2.1 Exact inference is computationally intractable 508
25.3 Gaussian Sum Filtering 508
25.3.1 Continuous filtering 509
25.3.2 Discrete filtering 511
25.3.3 The likelihood p(v1:T) 511
25.3.4 Collapsing Gaussians 511
25.3.5 Relation to other methods 512
25.4 Gaussian Sum Smoothing 512
25.4.1 Continuous smoothing 514
25.4.2 Discrete smoothing 514
25.4.3 Collapsing the mixture 514
25.4.4 Using mixtures in smoothing 515
25.4.5 Relation to other methods 516
25.5 Reset Models 518
25.5.1 A Poisson reset model 520
25.5.2 Reset-HMM-LDS 521
25.6 Summary 522
25.7 Code 522
25.8 Exercises 522
26 Distributed Computation 525 26.1 Introduction 525
26.2 Stochastic Hopfield Networks 525
26.3 Learning Sequences 526
26.3.1 A single sequence 526
26.3.2 Multiple sequences 531
26.3.3 Boolean networks 532
26.3.4 Sequence disambiguation 532
26.4 Tractable Continuous Latent Variable Models 532
26.4.1 Deterministic latent variables 532
26.4.2 An augmented Hopfield network 534
26.5 Neural Models 535
26.5.1 Stochastically spiking neurons 535
26.5.2 Hopfield membrane potential 535
26.5.3 Dynamic synapses 536
26.5.4 Leaky integrate and fire models 537
26.6 Summary 537
26.7 Code 537
26.8 Exercises 538
V Approximate Inference 539 27 Sampling 543 27.1 Introduction 543
27.1.1 Univariate sampling 544
27.1.2 Rejection sampling 545
27.1.3 Multivariate sampling 546
27.2 Ancestral Sampling 548
27.2.1 Dealing with evidence 548
27.2.2 Perfect sampling for a Markov network 549
27.3 Gibbs Sampling 549
27.3.1 Gibbs sampling as a Markov chain 550
Trang 22CONTENTS CONTENTS
27.3.2 Structured Gibbs sampling 55127.3.3 Remarks 55127.4 Markov Chain Monte Carlo (MCMC) 55227.4.1 Markov chains 55327.4.2 Metropolis-Hastings sampling 55327.5 Auxiliary Variable Methods 55527.5.1 Hybrid Monte Carlo 55527.5.2 Swendson-Wang 55727.5.3 Slice sampling 55927.6 Importance Sampling 56027.6.1 Sequential importance sampling 56227.6.2 Particle filtering as an approximate forward pass 56327.7 Summary 56527.8 Code 56527.9 Exercises 566
28.1 Introduction 56928.2 The Laplace approximation 56928.3 Properties of Kullback-Leibler Variational Inference 57028.3.1 Bounding the normalisation constant 57028.3.2 Bounding the marginal likelihood 57028.3.3 Bounding marginal quantities 57128.3.4 Gaussian approximations using KL divergence 57128.3.5 Marginal and moment matching properties of minimising KL(p|q) 57228.4 Variational Bounding Using KL(q|p) 57328.4.1 Pairwise Markov random field 57328.4.2 General mean field equations 57628.4.3 Asynchronous updating guarantees approximation improvement 57628.4.4 Structured variational approximation 57728.5 Local and KL Variational Approximations 57928.5.1 Local approximation 58028.5.2 KL variational approximation 58028.6 Mutual Information Maximisation : A KL Variational Approach 58128.6.1 The information maximisation algorithm 58228.6.2 Linear Gaussian decoder 58328.7 Loopy Belief Propagation 58428.7.1 Classical BP on an undirected graph 58428.7.2 Loopy BP as a variational procedure 58528.8 Expectation Propagation 58728.9 MAP for Markov networks 59028.9.1 Pairwise Markov networks 59228.9.2 Attractive binary Markov networks 59328.9.3 Potts model 59528.10Further Reading 59628.11Summary 59628.12Code 59728.13Exercises 597
29.1 Linear Algebra 60329.1.1 Vector algebra 60329.1.2 The scalar product as a projection 60429.1.3 Lines in space 60429.1.4 Planes and hyperplanes 60429.1.5 Matrices 605
Trang 23CONTENTS CONTENTS
29.1.6 Linear transformations 60629.1.7 Determinants 60629.1.8 Matrix inversion 60729.1.9 Computing the matrix inverse 60829.1.10 Eigenvalues and eigenvectors 60829.1.11 Matrix decompositions 60929.2 Multivariate Calculus 61029.2.1 Interpreting the gradient vector 61129.2.2 Higher derivatives 61129.2.3 Matrix calculus 61229.3 Inequalities 61229.3.1 Convexity 61229.3.2 Jensen’s inequality 61329.4 Optimisation 61329.5 Multivariate Optimisation 61329.5.1 Gradient descent with fixed stepsize 61429.5.2 Gradient descent with line searches 61429.5.3 Minimising quadratic functions using line search 61529.5.4 Gram-Schmidt construction of conjugate vectors 61529.5.5 The conjugate vectors algorithm 61629.5.6 The conjugate gradients algorithm 61629.5.7 Newton’s method 61729.6 Constrained Optimisation using Lagrange Multipliers 61929.6.1 Lagrange Dual 619
Trang 24CONTENTS CONTENTS
Trang 25Part I
Inference in Probabilistic Models
Trang 27Introduction to Part I
Probabilistic models explicitly take into account uncertainty and deal with our
imperfect knowledge of the world Such models are of fundamental significance in
Machine Learning since our understanding of the world will always be limited by our
observations and understanding We will focus initially on using probabilistic models
as a kind of expert system
In Part I, we assume that the model is fully specified That is, given a model of the
environment, how can we use it to answer questions of interest We will relate the
complexity of inferring quantities of interest to the structure of the graph describing
the model In addition, we will describe operations in terms of manipulations on
the corresponding graphs As we will see, provided the graphs are simple tree-like
structures, most quantities of interest can be computed efficiently
Part I deals with manipulating mainly discrete variable distributions and forms the
background to all the later material in the book
Trang 28Directed Factor Graph
Bayesian Networks
Dynamic Bayes nets
chains
HMM
LDS
Latent variable models
Discrete Mixture
models
ing
cluster-Continuous
reduct
dimen- complete repres.
over-Influence diagrams
Strong JT Decision theory Chain Graphs
Undirected
Graphs
Markov network
input dependent
CRF
Pairwise
Boltz.
machine (disc.)
Gauss.
Process (cont)
Clique Graphs
Junction tree Clique
Trang 29Graphical model
Multiplyconnected
decomposable
cliques small
tractable
messages-
message-passing
intractable required
messages-cliques large
required
approx- decomposable
non-JTA
cliques small
absorption
Shenoy
Shafer-cliques big (or mess.
intract)
required
approx-cutset conditioning (inefficient)
special-cases
tractable-Gaussian
binary- MRF-MAP
attractive- binary-pure- interaction- MRF
planar-Singlyconnected
message updates tractable
sum/max product
message updates intractable
approx required (EP)
Bucket elimination (inefficient)
Graphical models and associated (marginal) inference methods Specific inference methods are highlighted
in red Loosely speaking, provided the graph corresponding to the model is singly-connected most ofthe standard (marginal) inference methods are tractable Multiply-connected graphs are generally moreproblematic, although there are special cases which remain tractable
Trang 311.1 Probability Refresher
Variables, States and Notational Shortcuts
Variables will be denoted using either upper case X or lower case x and a set of variables will typically bedenoted by a calligraphic symbol, for exampleV = {a, B, c}
The domain of a variable x is written dom(x), and denotes the states x can take States will typically
be represented using sans-serif font For example, for a coin c, dom(c) = {heads, tails} and p(c = heads)represents the probability that variable c is in state heads The meaning of p(state) will often be clear,without specific reference to a variable For example, if we are discussing an experiment about a coin c,the meaning of p(heads) is clear from the context, being shorthand for p(c = heads) When summing over avariableP
xf (x), the interpretation is that all states of x are included, i.e P
xf (x)≡P
s∈dom(x)f (x = s).Given a variable, x, its domain dom(x) and a full specification of the probability values for each of thevariable states, p(x), we have a distribution for x Sometimes we will not fully specify the distribution, onlycertain properties, such as for variables x, y, p(x, y) = p(x)p(y) for some unspecified p(x) and p(y) Whenclarity on this is required we will say distributions with structure p(x)p(y), or a distribution class p(x)p(y).For our purposes, events are expressions about random variables, such as Two heads in 6 coin tosses Twoevents are mutually exclusive if they cannot both be true For example the events The coin is heads andThe coin is tailsare mutually exclusive One can think of defining a new variable named by the event so,for example, p(The coin is tails) can be interpreted as p(The coin is tails = true) We use the shorthandp(x = tr) for the probability of event/variable x being in the state true and p(x = fa) for the probability ofvariable x being in the state false
Definition 1.1 (Rules of Probability for Discrete Variables)
Trang 32Probability Refresher
The probability p(x = x) of variable x being in state x is represented by a value between 0 and 1.p(x = x) = 1 means that we are certain x is in state x Conversely, p(x = x) = 0 means that we are certain
x is not in state x Values between 0 and 1 represent the degree of certainty of state occupancy
The summation of the probability over all the states is 1:
We will use the shorthand p(x, y) for p(x and y) Note that p(y, x) = p(x, y) and p(x or y) = p(y or x)
Definition 1.2 (Set notation) An alternative notation in terms of set theory is to write
Since Bayes’ rule trivially follows from the definition of conditional probability, we will sometimes be loose
in our language and use the terms Bayes’ rule and conditional probability as synonymous
As we shall see throughout this book, Bayes’ rule plays a central role in probabilistic reasoning since it helps
Trang 33Probability Refresher
us ‘invert’ probabilistic relationships, translating between p(y|x) and p(x|y)
Definition 1.5 (Probability Density Functions) For a continuous variable x, the probability density f (x)
is defined such that
As shorthand we will sometimes writeR
xf (x), particularly when we want an expression to be valid for eithercontinuous or discrete variables The multivariate case is analogous with integration over all real space, andthe probability that x belongs to a region of the space defined accordingly Unlike probabilities, probabilitydensities can take positive values greater than 1
Formally speaking, for a continuous variable, one should not speak of the probability that x = 0.2 since theprobability of a single value is always zero However, we shall often write p(x) for continuous variables, thusnot distinguishing between probabilities and probability density function values Whilst this may appearstrange, the nervous reader may simply replace our p(x) notation forR
x∈∆f (x)dx, where ∆ is a small regioncentred on x This is well defined in a probabilistic sense and, in the limit ∆ being very small, this wouldgive approximately ∆f (x) If we consistently use the same ∆ for all occurrences of pdfs, then we will simplyhave a common prefactor ∆ in all expressions Our strategy is to simply ignore these values (since in the endonly relative probabilities will be relevant) and write p(x) In this way, all the standard rules of probabilitycarry over, including Bayes’ Rule
Remark 1.1(Subjective Probability) Probability is a contentious topic and we do not wish to get boggeddown by the debate here, apart from pointing out that it is not necessarily the rules of probability thatare contentious, rather what interpretation we should place on them In some cases potential repetitions
of an experiment can be envisaged so that the ‘long run’ (or frequentist) definition of probability in whichprobabilities are defined with respect to a potentially infinite repetition of experiments makes sense Forexample, in coin tossing, the probability of heads might be interpreted as ‘If I were to repeat the experiment
of flipping a coin (at ‘random’), the limit of the number of heads that occurred over the number of tosses
is defined as the probability of a head occurring.’
Here’s a problem that is typical of the kind of scenario one might face in a machine learning situation Afilm enthusiast joins a new online film service Based on expressing a few films a user likes and dislikes,the online company tries to estimate the probability that the user will like each of the 10000 films in theirdatabase If we were to define probability as a limiting case of infinite repetitions of the same experiment,this wouldn’t make much sense in this case since we can’t repeat the experiment However, if we assumethat the user behaves in a manner consistent with other users, we should be able to exploit the large amount
of data from other users’ ratings to make a reasonable ‘guess’ as to what this consumer likes This degree
of belief or Bayesian subjective interpretation of probability sidesteps non-repeatability issues – it’s just aframework for manipulating real values consistent with our intuition about probability[158]
1.1.1 Interpreting Conditional Probability
Conditional probability matches our intuitive understanding of uncertainty For example, imagine a circulardart board, split into 20 equal sections, labelled from 1 to 20 Randy, a dart thrower, hits any one of the 20sections uniformly at random Hence the probability that a dart thrown by Randy occurs in any one of the
20 regions is p(region i) = 1/20 A friend of Randy tells him that he hasn’t hit the 20 region What is theprobability that Randy has hit the 5 region? Conditioned on this information, only regions 1 to 19 remainpossible and, since there is no preference for Randy to hit any of these regions, the probability is 1/19 The
Trang 34119giving the intuitive result An important point to clarify is that p(A = a|B = b) should not be interpreted
as ‘Given the event B = b has occurred, p(A = a|B = b) is the probability of the event A = a occurring’
In most contexts, no such explicit temporal causality is implied1 and the correct interpretation should be ‘p(A = a|B = b) is the probability of A being in state a under the constraint that B is in state b’
The relation between the conditional p(A = a|B = b) and the joint p(A = a, B = b) is just a normalisationconstant since p(A = a, B = b) is not a distribution in A – in other words, P
ap(A = a, B = b) 6= 1 Tomake it a distribution we need to divide : p(A = a, B = b)/P
ap(A = a, B = b) which, when summed over
adoes sum to 1 Indeed, this is just the definition of p(A = a|B = b)
Definition 1.6 (Independence)
Variables x and y are independent if knowing the state (or value in the continuous case) of one variablegives no extra information about the other variable Mathematically, this is expressed by
Provided that p(x)6= 0 and p(y) 6= 0 independence of x and y is equivalent to
If p(x|y) = p(x) for all states of x and y, then the variables x and y are said to be independent If
for some constant k, and positive functions f (·) and g(·) then x and y are independent and we write x⊥⊥y
Example 1.1 (Independence) Let x denote the day of the week in which females are born, and y denotethe day in which males are born, with dom(x) = dom(y) = {1, , 7} It is reasonable to expect that x
is independent of y We randomly select a woman from the phone book, Alice, and find out that she wasborn on a Tuesday We also randomly select a male at random, Bob Before phoning Bob and asking him,what does knowing Alice’s birth day add to which day we think Bob is born on? Under the independenceassumption, the answer is nothing Note that this doesn’t mean that the distribution of Bob’s birthday isnecessarily uniform – it just means that knowing when Alice was born doesn’t provide any extra informationthan we already knew about Bob’s birthday, p(y|x) = p(y) Indeed, the distribution of birthdays p(y) andp(x) are non-uniform (statistically fewer babies are born on weekends), though there is nothing to suggestthat x are y are dependent
Deterministic Dependencies
Sometimes the concept of independence is perhaps a little strange Consider the following : variables x and
y are both binary (their domains consist of two states) We define the distribution such that x and y arealways both in a certain joint state:
Trang 35Probability Refresher
This may seem strange – we know for sure the relation between x and y, namely that they are always in thesame joint state, yet they are independent Since the distribution is trivially concentrated in a single jointstate, knowing the state of x tells you nothing that you didn’t anyway know about the state of y, and viceversa This potential confusion comes from using the term ‘independent’ which may suggest that there is norelation between objects discussed The best way to think about statistical independence is to ask whether
or not knowing the state of variable y tells you something more than you knew before about variable x,where ‘knew before’ means working with the joint distribution of p(x, y) to figure out what we can knowabout x, namely p(x)
Definition 1.7 (Conditional Independence)
denotes that the two sets of variablesX and Y are independent of each other provided we know the state
of the set of variables Z For conditional independence, X and Y must be independent given all states of
Z Formally, this means that
SimilarlyX >>Y|∅ can be written as X>>Y
Intuitively, if x is conditionally independent of y given z, this means that, given z, y contains no additionalinformation about x Similarly, given z, knowing x does not tell me anything more about y Note that
X ⊥⊥Y|Z ⇒ X0⊥⊥Y0|Z for X0⊆ X and Y0 ⊆ Y
Remark 1.2 (Independence implications) It’s tempting to think that if a is independent of b and b isindependent of c then a must be independent of c:
Similarly, it’s tempting to think that if a and b are dependent, and b and c are dependent, then a and cmust be dependent:
However, this also does not follow We give an explicit numerical example in exercise(3.17)
Finally, note that conditional independence x⊥⊥ y| z does not imply marginal independence x ⊥⊥ y See alsoexercise(3.20)
Trang 36Probabilistic Reasoning
1.1.2 Probability Tables
Based on the populations 60776238, 5116900 and 2980700 of England (E), Scotland (S) and Wales (W),the a priori probability that a randomly selected person from the combined three countries would live inEngland, Scotland or Wales, is approximately 0.88, 0.08 and 0.04 respectively We can write this as a vector(or probability table) :
p(M T = Eng|Cnt = E) = 0.95 p(MT = Eng|Cnt = S) = 0.7 p(MT = Eng|Cnt = W) = 0.6p(M T = Scot|Cnt = E) = 0.04 p(MT = Scot|Cnt = S) = 0.3 p(MT = Scot|Cnt = W) = 0.0p(M T = Wel|Cnt = E) = 0.01 p(MT = Wel|Cnt = S) = 0.0 p(MT = Wel|Cnt = W) = 0.4
(1.1.22)From this we can form a joint distribution p(Cnt, M T ) = p(M T|Cnt)p(Cnt) This could be written as a
3× 3 matrix with columns indexed by country and rows indexed by Mother Tongue:
For joint distributions over a larger number of variables, xi, i = 1, , D, with each variable xi taking Kistates, the table describing the joint distribution is an array with QD
i=1Ki entries Explicitly storing tablestherefore requires space exponential in the number of variables, which rapidly becomes impractical for alarge number of variables We discuss how to deal with this issue in chapter(3) and chapter(4)
A probability distribution assigns a value to each of the joint states of the variables For this reason,p(T, J, R, S) is considered equivalent to p(J, S, R, T ) (or any such reordering of the variables), since in eachcase the joint setting of the variables is simply a different index to the same probability This situation ismore clear in the set theoretic notation p(J∩ S ∩ T ∩ R) We abbreviate this set theoretic notation by usingthe commas – however, one should be careful not to confuse the use of this indexing type notation withfunctions f (x, y) which are in general dependent on the variable order Whilst the variables to the left of theconditioning bar may be written in any order, and equally those to the right of the conditioning bar may bewritten in any order, moving variables across the bar is not generally equivalent, so that p(x1|x2)6= p(x2|x1)
1.2 Probabilistic Reasoning
The central paradigm of probabilistic reasoning is to identify all relevant variables x1, , xN in the ronment, and make a probabilistic model p(x1, , xN) of their interaction Reasoning (inference) is thenperformed by introducing evidence that sets variables in known states, and subsequently computing proba-bilities of interest, conditioned on this evidence The rules of probability, combined with Bayes’ rule makefor a complete reasoning system, one which includes traditional deductive logic as a special case[158] Inthe examples below, the number of variables in the environment is very small In chapter(3) we will discuss
Trang 371 Assuming eating lots of hamburgers is rather widespread, say p(Hamburger Eater) = 0.5, what is theprobability that a hamburger eater will have Kreuzfeld-Jacob disease?
This may be computed as
p(KJ |Hamburger Eater) = p(Hamburger Eater, KJ )
p(Hamburger Eater) =
p(Hamburger Eater|KJ )p(KJ )p(Hamburger Eater)
(1.2.1)
=
9
10× 1 100000 1 2
2 If the fraction of people eating hamburgers was rather small, p(Hamburger Eater) = 0.001, what is theprobability that a regular hamburger eater will have Kreuzfeld-Jacob disease? Repeating the abovecalculation, this is given by
9
10× 1
100000 1 1000
This is much higher than in scenario (1) since here we can be more sure that eating hamburgers isrelated to the illness
Example 1.3(Inspector Clouseau) Inspector Clouseau arrives at the scene of a crime The victim lies dead
in the room alongside the possible murder weapon, a knife The Butler (B) and Maid (M ) are the inspector’smain suspects and the inspector has a prior belief of 0.6 that the Butler is the murderer, and a prior belief
of 0.2 that the Maid is the murderer These beliefs are independent in the sense that p(B, M ) = p(B)p(M ).(It is possible that both the Butler and the Maid murdered the victim or neither) The inspector’s priorcriminal knowledge can be formulated mathematically as follows:
dom(B) = dom(M ) ={murderer, not murderer} , dom(K) = {knife used, knife not used} (1.2.4)
p(knife used|B = not murderer, M = not murderer) = 0.3
p(knife used|B = not murderer, M = murderer) = 0.2
p(knife used|B = murderer, M = not murderer) = 0.6
p(knife used|B = murderer, M = murderer) = 0.1
(1.2.6)
In addition p(K, B, M ) = p(K|B, M)p(B)p(M) Assuming that the knife is the murder weapon, what isthe probability that the Butler is the murderer? (Remember that it might be that neither is the murderer).Using b for the two states of B and m for the two states of M ,
m,bp(K|b, m)p(b, m) =
p(B)P
mp(K|B, m)p(m)P
bp(b)P
mp(K|b, m)p(m) (1.2.7)
Trang 3810× 1
10+ 8
10× 6 10
6 10
2
10× 1
10 +108 × 6
10 + 4 10
Remark 1.3 The role of p(knife used) in the Inspector Clouseau example can cause some confusion Inthe above,
p(knife used) =X
b
p(b)Xm
is computed to be 0.456 But surely, p(knife used) = 1, since this is given in the question! Note that thequantity p(knife used) relates to the prior probability the model assigns to the knife being used (in theabsence of any other information) If we know that the knife is used, then the posterior
p(knife used|knife used) = p(knife used, knife used)p(knife used) = p(knife used)
which, naturally, must be the case
Example 1.4 (Who’s in the bathroom?) Consider a household of three people, Alice, Bob and Cecil.Cecil wants to go to the bathroom but finds it occupied He then goes to Alice’s room and sees she is there.Since Cecil knows that only either Alice or Bob can be in the bathroom, from this he infers that Bob must
be in the bathroom
To arrive at the same conclusion in a mathematical framework, we define the following events
A = Alice is in her bedroom, B = Bob is in his bedroom, O = Bathroom occupied (1.2.11)
We can encode the information that if either Alice or Bob are not in their bedrooms, then they must be inthe bathroom (they might both be in the bathroom) as
The first term expresses that the bathroom is occupied if Alice is not in her bedroom, wherever Bob is.Similarly, the second term expresses bathroom occupancy as long as Bob is not in his bedroom Thenp(B = fa|O = tr, A = tr) = p(B = fa, O = tr, A = tr)p(O = tr, A = tr) = p(O = tr|A = tr, B = fa)p(A = tr, B = fa)
where
p(O = tr, A = tr) = p(O = tr|A = tr, B = fa)p(A = tr, B = fa)
+ p(O = tr|A = tr, B = tr)p(A = tr, B = tr) (1.2.14)Using the fact p(O = tr|A = tr, B = fa) = 1 and p(O = tr|A = tr, B = tr) = 0, which encodes that if Alice
is in her room and Bob is not, the bathroom must be occupied, and similarly, if both Alice and Bob are intheir rooms, the bathroom cannot be occupied,
p(B = fa|O = tr, A = tr) = p(A = tr, B = fa)p(A = tr, B = fa) = 1 (1.2.15)This example is interesting since we are not required to make a full probabilistic model in this case thanks
to the limiting nature of the probabilities (we don’t need to specify p(A, B)) The situation is common inlimiting situations of probabilities being either 0 or 1, corresponding to traditional logic systems
Trang 39Probabilistic Reasoning
@@
‘All fruits grow on trees’ lead to the conclusion that ‘All apples grow on trees’ To see how this might bededuced using Bayesian reasoning, consider
p(T = tr|A = tr) = p(T = tr|A = tr, F = fa)p(F = fa|A = tr) + p(T = tr|A = tr, F = tr)p(F = tr|A = tr)
infer A⇒ T
Example 1.6 (Aristotle : Inverse Modus Ponens) According to Logic, from the statement : ‘If A is truethen B is true’, one may deduce that ‘if B is false then A is false’ To see how this fits in with a probabilisticreasoning system we can first express the statement : ‘If A is true then B is true’ as p(B = tr|A = tr) = 1.Then we may infer
p(A = fa|B = fa) = 1 − p(A = tr|B = fa)
Example 1.7 (Soft XOR Gate)
A standard XOR logic gate is given by the table on the right If we
observe that the output of the XOR gate is 0, what can we say about
A and B? In this case, either A and B were both 0, or A and B were
both 1 This means we don’t know which state A was in – it could
equally likely have been 1 or 0
Trang 40Probabilistic Reasoning
Consider a ‘soft’ version of the XOR gate given on the right,
++
so that the gate stochastically outputs C = 1 depending on its
inputs, with additionally A⊥⊥ B and p(A = 1) = 0.65, p(B =
= p(A = 0) (p(C = 0|A = 0, B = 0)p(B = 0) + p(C = 0|A = 0, B = 1)p(B = 1))
= 0.35× (0.9 × 0.23 + 0.01 × 0.77) = 0.075145Then
p(A = 1|C = 0) = p(A = 1, C = 0)
p(A = 1, C = 0) + p(A = 0, C = 0) =
0.4052750.405275 + 0.075145 = 0.8436 (1.2.20)
Example 1.8 (Larry) Larry is typically late for school If Larry is late, we denote this with L = late,otherwise, L = not late When his mother asks whether or not he was late for school he never admits tobeing late The response Larry gives RL is represented as follows
The remaining two values are determined by normalisation and are
Given that RL= not late, what is the probability that Larry was late, i.e p(L = late|RL= not late)?
Using Bayes’ we have
p(L = late|RL= not late) = p(L = late, RL= not late)
p(RL= not late)
p(L = late, RL= not late) + p(L = not late, RL= not late) (1.2.23)
p(L = late|RL= not late) = p(L = late)
Where we used normalisation in the last step, p(L = late) + p(L = not late) = 1 This result is intuitive –Larry’s mother knows that he never admits to being late, so her belief about whether or not he really waslate is unchanged, regardless of what Larry actually says