Re-arranging terms then gives the required result.1.2 For the regularized sum-of-squares error function given by 1.4 the correspondinglinear equations are again obtained by differentiati
Trang 1Chapter 1: Introduction 7
Chapter 2: Probability Distributions 28
Chapter 3: Linear Models for Regression 62
Chapter 4: Linear Models for Classification 78
Chapter 5: Neural Networks 93
Chapter 6: Kernel Methods 114
Chapter 7: Sparse Kernel Machines 128
Chapter 8: Graphical Models 136
Chapter 9: Mixture Models and EM 150
Chapter 10: Approximate Inference 163
Chapter 11: Sampling Methods 198
Chapter 12: Continuous Latent Variables 207
Chapter 13: Sequential Data 223
Chapter 14: Combining Models 246
Trang 26 CONTENTS
Trang 3Re-arranging terms then gives the required result.
1.2 For the regularized sum-of-squares error function given by (1.4) the correspondinglinear equations are again obtained by differentiation, and take the same form as(1.122), but withAijreplaced byAeij, given by
e
1.3 Let us denote apples, oranges and limes bya, o and l respectively The marginal
probability of selecting an apple is given by
p(a) = p(a|r)p(r) + p(a|b)p(b) + p(a|g)p(g)
To find the probability that the box was green, given that the fruit we selected was
an orange, we can use Bayes’ theorem
p(g|o) = p(o|g)p(g)
The denominator in (4) is given by
p(o) = p(o|r)p(r) + p(o|b)p(b) + p(o|g)p(g)
1
1.4 We are often interested in finding the most probable value for some quantity Inthe case of probability distributions over discrete variables this poses little problem.However, for continuous variables there is a subtlety arising from the nature of prob-ability densities and the way they transform under non-linear changes of variable
Trang 48 Solution 1.4
Consider first the way a functionf (x) behaves when we change to a new variable y
where the two variables are related byx = g(y) This defines a new function of y
given by
e
Supposef (x) has a mode (i.e a maximum) atbx so that f0(bx) = 0 The
correspond-ing mode off (y) will occur for a valuee by obtained by differentiating both sides of
(7) with respect toy
e
f0(by) = f0(g(by))g0(by) = 0 (8)Assuming g0(by) 6= 0 at the mode, then f0(g(by)) = 0 However, we know that
f0(bx) = 0, and so we see that the locations of the mode expressed in terms of each
of the variablesx and y are related bybx = g(by), as one would expect Thus, finding
a mode with respect to the variablex is completely equivalent to first transforming
to the variabley, then finding a mode with respect to y, and then transforming back
tox
Now consider the behaviour of a probability densitypx(x) under the change of
vari-ablesx = g(y), where the density with respect to the new variable is py(y) and is
given by ((1.27)) Let us writeg0(y) = s|g0(y)| where s ∈ {−1, +1} Then ((1.27))
can be written
py(y) = px(g(y))sg0(y)
Differentiating both sides with respect toy then gives
p0y(y) = sp0x(g(y)){g0(y)}2
Due to the presence of the second term on the right hand side of (9) the relationship
b
x = g(by) no longer holds Thus the value of x obtained by maximizing px(x) will
not be the value obtained by transforming topy(y) then maximizing with respect to
y and then transforming back to x This causes modes of densities to be dependent
on the choice of variables In the case of linear transformation, the second term onthe right hand side of (9) vanishes, and so the location of the maximum transformsaccording tobx = g(by)
This effect can be illustrated with a simple example, as shown in Figure 1 Webegin by considering a Gaussian distribution px(x) over x with mean µ = 6 and
standard deviation σ = 1, shown by the red curve in Figure 1 Next we draw a
sample ofN = 50, 000 points from this distribution and plot a histogram of their
values, which as expected agrees with the distributionpx(x)
Now consider a non-linear change of variables fromx to y given by
The inverse of this function is given by
Trang 5Solutions 1.5–1.6 9
Figure 1 Example of the transformation of
the mode of a density under a linear change of variables, illus- trating the different behaviour com- pared to a simple function See the text for details.
0 0.5
1
g −1 (x)
p x (x)
p y (y) y
x
which is a logistic sigmoid function, and is shown in Figure 1 by the blue curve.
If we simply transformpx(x) as a function of x we obtain the green curve px(g(y))
shown in Figure 1, and we see that the mode of the density px(x) is transformed
via the sigmoid function to the mode of this curve However, the density overy
transforms instead according to (1.27) and is shown by the magenta curve on the leftside of the diagram Note that this has its mode shifted relative to the mode of thegreen curve
To confirm this result we take our sample of50, 000 values of x, evaluate the
corre-sponding values ofy using (11), and then plot a histogram of their values We see
that this histogram matches the magenta curve in Figure 1 and not the green curve!
1.5 Expanding the square we have
E[(f (x)− E[f(x)])2] = E[f(x)2− 2f(x)E[f(x)] + E[f(x)]2]
= E[f(x)2]− 2E[f(x)]E[f(x)] + E[f(x)]2
= E[f(x)2]− E[f(x)]2
as required
1.6 The definition of covariance is given by (1.41) as
cov[x, y] = E[xy]− E[x]E[y]
Using (1.33) and the fact thatp(x, y) = p(x)p(y) when x and y are independent, we
x
yp(y)y
= E[x]E[y]
Trang 610 Solutions 1.7–1.8
and hencecov[x, y] = 0 The case where x and y are continuous variables is
analo-gous, with (1.33) replaced by (1.34) and the sums replaced by integrals
1.7 The transformation from Cartesian to polar coordinates is defined by
=
cos θ −r sin θ
sin θ r cos θ
= r
where again we have used (2.177) Thus the double integral in (1.125) becomes
I2 =
Z 2π 0
Z ∞ 0exp
− r22σ2
1/2exp
Trang 7We now note that in the factor (y + µ) the first term in y corresponds to an odd
integrand and so this integral must vanish (to show this explicitly, write the integral
as the sum of two integrals, one from−∞ to 0 and the other from 0 to ∞ and then
show that these two integrals cancel) In the second term,µ is a constant and pulls
outside the integral, leaving a normalized Gaussian distribution which integrates to
Now we expand the square on the left-hand side giving
E[x2]− 2µE[x] + µ2
= σ2
Making use of (1.49) then gives (1.50) as required
Finally, (1.51) follows directly from (1.49) and (1.50)
Setting this to zero we obtainx = µ
Similarly, for the multivariate case we differentiate (1.52) with respect to x to obtain
printing of PRML, there are mistakes in (C.20); all instances of x (vector)
in the denominators should be x (scalar).
Full file at https://TestbankDirect.eu/
Trang 8Similarly for the variances, we first note that
(x + z− E[x + z])2= (x− E[x])2+ (z− E[z])2+ 2(x− E[x])(z − E[z]) (26)
where the final term will integrate to zero with respect to the factorized distribution
Setting this equal to zero and moving the terms involvingµ to the other side of the
Multiplying both sides by2(σ2)2/N and substituting µMLforµ we get (1.56)
Trang 9Solutions 1.12–1.14 13
1.12 Ifm = n then xnxm= x2
nand using (1.50) we obtainE[x2n] = µ2+ σ2, whereas if
n 6= m then the two data points xnandxm are independent and henceE[xnxm] =E[xn]E[xm] = µ2 where we have used (1.49) Combining these two results weobtain (1.130)
+ µ2+ 1
Nσ2
=
N− 1N
N
X
n=1(xn− µ)2
Trang 1014 Solution 1.15
from which we obtain (1.132) The number of independent components inwS
ijcan befound by noting that there areD2parameters in total in this matrix, and that entriesoff the leading diagonal occur in constrained pairswij = wji forj 6= i Thus we
start withD2parameters in the matrixwS
ij, subtractD for the number of parameters
on the leading diagonal, divide by two, and then add backD for the leading diagonal
and we obtain(D2− D)/2 + D = D(D + 1)/2
1.15 The redundancy in the coefficients in (1.133) arises from interchange symmetriesbetween the indicesik Such symmetries can therefore be removed by enforcing anordering on the indices, as in (1.134), so that only one member in each group ofequivalent configurations occurs in the summation
To derive (1.135) we note that the number of independent parameters n(D, M )
which appear at orderM can be written as
i 2 =1
· · ·
iXM−1
i M =11
)
(33)
where the term in braces hasM−1 terms which, from (32), must equal n(i1, M−1)
Thus we can write
D = 1 Now we assume that it is true for a specific value of dimensionality D and
then show that it must be true for dimensionalityD + 1 Thus consider the left-hand
side of (1.136) evaluated forD + 1 which gives
which equals the right hand side of (1.136) for dimensionality D + 1 Thus, by
induction, (1.136) must hold true for all values ofD
Trang 11Solution 1.16 15
Finally we use induction to prove (1.137) ForM = 2 we find obtain the standard
resultn(D, 2) = 12D(D + 1), which is also proved in Exercise 1.14 Now assume
that (1.137) is correct for a specific orderM − 1 so that
and hence shows that (1.137) is true for polynomials of orderM Thus by induction
(1.137) must be true for all values ofM
1.16 NOTE: In the1stprinting of PRML, this exercise contains two typographical errors
On line 4, M 6th should be Mth and on the l.h.s of (1.139), N (d, M ) should be
N (D, M )
The result (1.138) follows simply from summing up the coefficients at all order up
to and including orderM To prove (1.139), we first note that when M = 0 the right
hand side of (1.139) equals 1, which we know to be correct since this is the number
of parameters at zeroth order which is just the constant offset in the polynomial.Assuming that (1.139) is correct at orderM , we obtain the following result at order
M + 1
N (D, M + 1) =
M +1X
m=0n(D, m)
=
M
X
m=0n(D, m) + n(D, M + 1)
which is the required result at orderM + 1
Full file at https://TestbankDirect.eu/
Trang 1216 Solutions 1.17–1.18
Now assumeM D Using Stirling’s formula we have
D+Me−D−MD! MMe−M
D+Me−DD! MM
which grows likeMD withM The case where D M is identical, with the roles
ofD and M exchanged By numerical evaluation we obtain N (10, 3) = 286 and
If x is an integer we can apply proof by induction to relate the gamma function to
the factorial function Suppose thatΓ(x + 1) = x! holds Then from the result (39)
we haveΓ(x + 2) = (x + 1)Γ(x + 1) = (x + 1)! Finally, Γ(1) = 1 = 0!, which
completes the proof by induction
1.18 On the right-hand side of (1.142) we make the change of variablesu = r2to give
The volume of a sphere of radius1 in D-dimensions is obtained by integration
VD= SD
Z 1 0
Trang 13Solutions 1.19–1.20 17
1.19 The volume of the cube is(2a)D Combining this with (1.143) and (1.144) we obtain(1.145) Using Stirling’s formula (1.146) in (1.145) the ratio becomes, for largeD,
volume of spherevolume of cube =πe
2D
D/2 1
which goes to 0 as D → ∞ The distance from the center of the cube to the mid
point of one of the sides isa, since this is where it makes contact with the sphere
Similarly the distance to one of the corners isa√
D from Pythagoras’ theorem Thus
the ratio is√
D
1.20 Sincep(x) is radially symmetric it will be roughly constant over the shell of radius
r and thickness This shell has volume SDrD −1 and sincekxk2 = r2we have
Solving forr, and using D 1, we obtainbr'√Dσ
Next we note that
p(br + ) ∝ (br + )D−1exp
−(br + )22σ2
= exp
−(br + )22σ2 + (D− 1) ln(br + )
We now expand p(r) around the pointbr Since this is a stationary point of p(r)
we must keep terms up to second order Making use of the expansionln(1 + x) =
x− x2/2 + O(x3), together with D 1, we obtain (1.149)
Finally, from (1.147) we see that the probability density at the origin is given by
Trang 1418 Solutions 1.21–1.24
1.21 Since the square root function is monotonic for non-negative numbers, we can takethe square root of the relationa 6 b to obtain a1/2 6 b1/2 Then we multiply bothsides by the non-negative quantitya1/2to obtaina 6 (ab)1/2
The probability of a misclassification is given, from (1.78), by
proba-1− p(Cj|x) is a minimum, which is equivalent to choosing the j for which the
pos-terior probabilityp(Cj|x) is a maximum This loss matrix assigns a loss of one if
the example is misclassified, and a loss of zero if it is correctly classified, and henceminimizing the expected loss will minimize the misclassification rate
1.23 From (1.81) we see that for a general loss matrix and arbitrary class priors, the pected loss is minimized by assigning an input x to class thej which minimizes
and so there is a direct trade-off between the priorsp(Ck) and the loss matrix Lkj
1.24 A vector x belongs to classCkwith probabilityp(Ck|x) If we decide to assign x to
classCj we will incur an expected loss ofP
kLkjp(Ck|x), whereas if we select the
reject option we will incur a loss ofλ Thus, if
Trang 15reject unless the smallest value of1− p(Cl|x) is less than λ, or equivalently if the
largest value ofp(Cl|x) is less than 1 − λ In the standard reject criterion we reject
if the largest posterior probability is less thanθ Thus these two criteria for rejection
are equivalent providedθ = 1− λ
1.25 The expected squared loss for a vectorial target variable is given by
E[L] =
Z Z
ky(x) − tk2p(t, x) dx dt
Our goal is to choose y(x) so as to minimize E[L] We can do this formally using
the calculus of variations to give
=ky(x) − E[t|x] + E[t|x] − tk2
= ky(x) − E[t|x]k2+ (y(x)− E[t|x])T(E[t|x] − t)+(E[t|x] − t)T(y(x)− E[t|x]) + kE[t|x] − tk2
Following the treatment of the univariate case, we now substitute this into (1.151)and perform the integral over t Again the cross-term vanishes and we are left with
Trang 1620 Solutions 1.27–1.28
from which we see directly that the function y(x) that minimizes E[L] is given byE[t|x]
1.27 Since we can choose y(x) independently for each value of x, the minimum of the
expectedLqloss can be found by minimizing the integrand given by
Z
for each value of x Setting the derivative of (52) with respect toy(x) to zero gives
the stationarity condition
which says thaty(x) must be the conditional median of t
Forq → 0 we note that, as a function of t, the quantity |y(x) − t|q is close to 1everywhere except in a small neighbourhood aroundt = y(x) where it falls to zero
The value of (52) will therefore be close to 1, since the densityp(t) is normalized, but
reduced slightly by the ‘notch’ close tot = y(x) We obtain the biggest reduction in
(52) by choosing the location of the notch to coincide with the largest value ofp(t),
i.e with the (conditional) mode
1.28 From the discussion of the introduction of Section 1.6, we have
mh(p)
Trang 17Solutions 1.29–1.30 21
and so, by continuity, we have thath(px) = x h(p) for any real number x
Now consider the positive real numbers p and q and the real number x such that
p = qx From the above discussion, we see that
h(p)ln(p) =
h(qx)ln(qx) =
x h(q)
x ln(q) =
h(q)ln(q)
The functionln(x) is concave_ and so we can apply Jensen’s inequality in the form
(1.115) but with the inequality reversed, so that
Trang 1822 Solutions 1.31–1.33
1.31 We first make use of the relationI(x; y) = H(y)− H(y|x) which we obtained in
(1.121), and note that the mutual information satisfiesI(x; y) > 0 since it is a form
of Kullback-Leibler divergence Finally we make use of the relation (1.112) to obtainthe desired result (1.152)
To show that statistical independence is a sufficient condition for the equality to besatisfied, we substitutep(x, y) = p(x)p(y) into the definition of the entropy, giving
a form of KL divergence, and this vanishes only if the two distributions are equal, sothatp(x, y) = p(x)p(y) as required
1.32 When we make a change of variables, the probability density is transformed by theJacobian of the change of variables Thus we have
p(x) = p(y)
∂x∂yij
... data-page="17">
Solutions 1.29–1.30 21
and so, by continuity, we have thath(px) = x h(p) for any real number x
Now consider the positive real numbers p and q and the real... that it is true for a specific value of dimensionality D and
then show that it must be true for dimensionalityD + Thus consider the left-hand
side of (1.136) evaluated forD + which...
that (1.137) is correct for a specific orderM − so that
and hence shows that (1.137) is true for polynomials of orderM Thus by induction
(1.137) must be true for all values ofM
1.16