Solution manual for pattern recognition and machine learning by bishop

Re-arranging terms then gives the required result.1.2 For the regularized sum-of-squares error function given by 1.4 the correspondinglinear equations are again obtained by differentiati

Trang 1

Chapter 1: Introduction 7

Chapter 2: Probability Distributions 28

Chapter 3: Linear Models for Regression 62

Chapter 4: Linear Models for Classification 78

Chapter 5: Neural Networks 93

Chapter 6: Kernel Methods 114

Chapter 7: Sparse Kernel Machines 128

Chapter 8: Graphical Models 136

Chapter 9: Mixture Models and EM 150

Chapter 10: Approximate Inference 163

Chapter 11: Sampling Methods 198

Chapter 12: Continuous Latent Variables 207

Chapter 13: Sequential Data 223

Chapter 14: Combining Models 246

Trang 2

6 CONTENTS

Trang 3

Re-arranging terms then gives the required result.

1.2 For the regularized sum-of-squares error function given by (1.4) the correspondinglinear equations are again obtained by differentiation, and take the same form as(1.122), but withAijreplaced byAeij, given by

e

1.3 Let us denote apples, oranges and limes bya, o and l respectively The marginal

probability of selecting an apple is given by

p(a) = p(a|r)p(r) + p(a|b)p(b) + p(a|g)p(g)

To find the probability that the box was green, given that the fruit we selected was

an orange, we can use Bayes’ theorem

p(g|o) = p(o|g)p(g)

The denominator in (4) is given by

p(o) = p(o|r)p(r) + p(o|b)p(b) + p(o|g)p(g)

1

1.4 We are often interested in finding the most probable value for some quantity Inthe case of probability distributions over discrete variables this poses little problem.However, for continuous variables there is a subtlety arising from the nature of prob-ability densities and the way they transform under non-linear changes of variable

Trang 4

8 Solution 1.4

Consider first the way a functionf (x) behaves when we change to a new variable y

where the two variables are related byx = g(y) This defines a new function of y

given by

e

Supposef (x) has a mode (i.e a maximum) atbx so that f0(bx) = 0 The

correspond-ing mode off (y) will occur for a valuee by obtained by differentiating both sides of

(7) with respect toy

e

f0(by) = f0(g(by))g0(by) = 0 (8)Assuming g0(by) 6= 0 at the mode, then f0(g(by)) = 0 However, we know that

f0(bx) = 0, and so we see that the locations of the mode expressed in terms of each

of the variablesx and y are related bybx = g(by), as one would expect Thus, finding

a mode with respect to the variablex is completely equivalent to first transforming

to the variabley, then finding a mode with respect to y, and then transforming back

tox

Now consider the behaviour of a probability densitypx(x) under the change of

vari-ablesx = g(y), where the density with respect to the new variable is py(y) and is

given by ((1.27)) Let us writeg0(y) = s|g0(y)| where s ∈ {−1, +1} Then ((1.27))

can be written

py(y) = px(g(y))sg0(y)

Differentiating both sides with respect toy then gives

p0y(y) = sp0x(g(y)){g0(y)}2

Due to the presence of the second term on the right hand side of (9) the relationship

b

x = g(by) no longer holds Thus the value of x obtained by maximizing px(x) will

not be the value obtained by transforming topy(y) then maximizing with respect to

y and then transforming back to x This causes modes of densities to be dependent

on the choice of variables In the case of linear transformation, the second term onthe right hand side of (9) vanishes, and so the location of the maximum transformsaccording tobx = g(by)

This effect can be illustrated with a simple example, as shown in Figure 1 Webegin by considering a Gaussian distribution px(x) over x with mean µ = 6 and

standard deviation σ = 1, shown by the red curve in Figure 1 Next we draw a

sample ofN = 50, 000 points from this distribution and plot a histogram of their

values, which as expected agrees with the distributionpx(x)

Now consider a non-linear change of variables fromx to y given by

The inverse of this function is given by

Trang 5

Solutions 1.5–1.6 9

Figure 1 Example of the transformation of

the mode of a density under a linear change of variables, illus- trating the different behaviour com- pared to a simple function See the text for details.

0 0.5

1

g −1 (x)

p x (x)

p y (y) y

x

which is a logistic sigmoid function, and is shown in Figure 1 by the blue curve.

If we simply transformpx(x) as a function of x we obtain the green curve px(g(y))

shown in Figure 1, and we see that the mode of the density px(x) is transformed

via the sigmoid function to the mode of this curve However, the density overy

transforms instead according to (1.27) and is shown by the magenta curve on the leftside of the diagram Note that this has its mode shifted relative to the mode of thegreen curve

To confirm this result we take our sample of50, 000 values of x, evaluate the

corre-sponding values ofy using (11), and then plot a histogram of their values We see

that this histogram matches the magenta curve in Figure 1 and not the green curve!

1.5 Expanding the square we have

E[(f (x)− E[f(x)])2] = E[f(x)2− 2f(x)E[f(x)] + E[f(x)]2]

= E[f(x)2]− 2E[f(x)]E[f(x)] + E[f(x)]2

= E[f(x)2]− E[f(x)]2

as required

1.6 The definition of covariance is given by (1.41) as

cov[x, y] = E[xy]− E[x]E[y]

Using (1.33) and the fact thatp(x, y) = p(x)p(y) when x and y are independent, we

x

yp(y)y

= E[x]E[y]

Trang 6

10 Solutions 1.7–1.8

and hencecov[x, y] = 0 The case where x and y are continuous variables is

analo-gous, with (1.33) replaced by (1.34) and the sums replaced by integrals

1.7 The transformation from Cartesian to polar coordinates is defined by

=

cos θ −r sin θ

sin θ r cos θ

= r

where again we have used (2.177) Thus the double integral in (1.125) becomes

I2 =

Z 2π 0

Z ∞ 0exp

− r22σ2

1/2exp

Trang 7

We now note that in the factor (y + µ) the first term in y corresponds to an odd

integrand and so this integral must vanish (to show this explicitly, write the integral

as the sum of two integrals, one from−∞ to 0 and the other from 0 to ∞ and then

show that these two integrals cancel) In the second term,µ is a constant and pulls

outside the integral, leaving a normalized Gaussian distribution which integrates to

Now we expand the square on the left-hand side giving

E[x2]− 2µE[x] + µ2

= σ2

Making use of (1.49) then gives (1.50) as required

Finally, (1.51) follows directly from (1.49) and (1.50)

Setting this to zero we obtainx = µ

Similarly, for the multivariate case we differentiate (1.52) with respect to x to obtain

printing of PRML, there are mistakes in (C.20); all instances of x (vector)

in the denominators should be x (scalar).

Full file at https://TestbankDirect.eu/

Trang 8

Similarly for the variances, we first note that

(x + z− E[x + z])2= (x− E[x])2+ (z− E[z])2+ 2(x− E[x])(z − E[z]) (26)

where the final term will integrate to zero with respect to the factorized distribution

Setting this equal to zero and moving the terms involvingµ to the other side of the

Multiplying both sides by2(σ2)2/N and substituting µMLforµ we get (1.56)

Trang 9

Solutions 1.12–1.14 13

1.12 Ifm = n then xnxm= x2

nand using (1.50) we obtainE[x2n] = µ2+ σ2, whereas if

n 6= m then the two data points xnandxm are independent and henceE[xnxm] =E[xn]E[xm] = µ2 where we have used (1.49) Combining these two results weobtain (1.130)

+ µ2+ 1

Nσ2

=

N− 1N

N

X

n=1(xn− µ)2

Trang 10

14 Solution 1.15

from which we obtain (1.132) The number of independent components inwS

ijcan befound by noting that there areD2parameters in total in this matrix, and that entriesoff the leading diagonal occur in constrained pairswij = wji forj 6= i Thus we

start withD2parameters in the matrixwS

ij, subtractD for the number of parameters

on the leading diagonal, divide by two, and then add backD for the leading diagonal

and we obtain(D2− D)/2 + D = D(D + 1)/2

1.15 The redundancy in the coefficients in (1.133) arises from interchange symmetriesbetween the indicesik Such symmetries can therefore be removed by enforcing anordering on the indices, as in (1.134), so that only one member in each group ofequivalent configurations occurs in the summation

To derive (1.135) we note that the number of independent parameters n(D, M )

which appear at orderM can be written as

i 2 =1

· · ·

iXM−1

i M =11

)

(33)

where the term in braces hasM−1 terms which, from (32), must equal n(i1, M−1)

Thus we can write

D = 1 Now we assume that it is true for a specific value of dimensionality D and

then show that it must be true for dimensionalityD + 1 Thus consider the left-hand

side of (1.136) evaluated forD + 1 which gives

which equals the right hand side of (1.136) for dimensionality D + 1 Thus, by

induction, (1.136) must hold true for all values ofD

Trang 11

Solution 1.16 15

Finally we use induction to prove (1.137) ForM = 2 we find obtain the standard

resultn(D, 2) = 12D(D + 1), which is also proved in Exercise 1.14 Now assume

that (1.137) is correct for a specific orderM − 1 so that

and hence shows that (1.137) is true for polynomials of orderM Thus by induction

(1.137) must be true for all values ofM

1.16 NOTE: In the1stprinting of PRML, this exercise contains two typographical errors

On line 4, M 6th should be Mth and on the l.h.s of (1.139), N (d, M ) should be

N (D, M )

The result (1.138) follows simply from summing up the coefficients at all order up

to and including orderM To prove (1.139), we first note that when M = 0 the right

hand side of (1.139) equals 1, which we know to be correct since this is the number

of parameters at zeroth order which is just the constant offset in the polynomial.Assuming that (1.139) is correct at orderM , we obtain the following result at order

M + 1

N (D, M + 1) =

M +1X

m=0n(D, m)

=

M

X

m=0n(D, m) + n(D, M + 1)

which is the required result at orderM + 1

Full file at https://TestbankDirect.eu/

Trang 12

16 Solutions 1.17–1.18

Now assumeM D Using Stirling’s formula we have

D+Me−D−MD! MMe−M

D+Me−DD! MM

which grows likeMD withM The case where D M is identical, with the roles

ofD and M exchanged By numerical evaluation we obtain N (10, 3) = 286 and

If x is an integer we can apply proof by induction to relate the gamma function to

the factorial function Suppose thatΓ(x + 1) = x! holds Then from the result (39)

we haveΓ(x + 2) = (x + 1)Γ(x + 1) = (x + 1)! Finally, Γ(1) = 1 = 0!, which

completes the proof by induction

1.18 On the right-hand side of (1.142) we make the change of variablesu = r2to give

The volume of a sphere of radius1 in D-dimensions is obtained by integration

VD= SD

Z 1 0

Trang 13

Solutions 1.19–1.20 17

1.19 The volume of the cube is(2a)D Combining this with (1.143) and (1.144) we obtain(1.145) Using Stirling’s formula (1.146) in (1.145) the ratio becomes, for largeD,

volume of spherevolume of cube =πe

2D

D/2 1

which goes to 0 as D → ∞ The distance from the center of the cube to the mid

point of one of the sides isa, since this is where it makes contact with the sphere

Similarly the distance to one of the corners isa√

D from Pythagoras’ theorem Thus

the ratio is√

D

1.20 Sincep(x) is radially symmetric it will be roughly constant over the shell of radius

r and thickness This shell has volume SDrD −1 and sincekxk2 = r2we have

Solving forr, and using D 1, we obtainbr'√Dσ

Next we note that

p(br + ) ∝ (br + )D−1exp

−(br + )22σ2

= exp

−(br + )22σ2 + (D− 1) ln(br + )

We now expand p(r) around the pointbr Since this is a stationary point of p(r)

we must keep terms up to second order Making use of the expansionln(1 + x) =

x− x2/2 + O(x3), together with D 1, we obtain (1.149)

Finally, from (1.147) we see that the probability density at the origin is given by

Trang 14

18 Solutions 1.21–1.24

1.21 Since the square root function is monotonic for non-negative numbers, we can takethe square root of the relationa 6 b to obtain a1/2 6 b1/2 Then we multiply bothsides by the non-negative quantitya1/2to obtaina 6 (ab)1/2

The probability of a misclassification is given, from (1.78), by

proba-1− p(Cj|x) is a minimum, which is equivalent to choosing the j for which the

pos-terior probabilityp(Cj|x) is a maximum This loss matrix assigns a loss of one if

the example is misclassified, and a loss of zero if it is correctly classified, and henceminimizing the expected loss will minimize the misclassification rate

1.23 From (1.81) we see that for a general loss matrix and arbitrary class priors, the pected loss is minimized by assigning an input x to class thej which minimizes

and so there is a direct trade-off between the priorsp(Ck) and the loss matrix Lkj

1.24 A vector x belongs to classCkwith probabilityp(Ck|x) If we decide to assign x to

classCj we will incur an expected loss ofP

kLkjp(Ck|x), whereas if we select the

reject option we will incur a loss ofλ Thus, if

Trang 15

reject unless the smallest value of1− p(Cl|x) is less than λ, or equivalently if the

largest value ofp(Cl|x) is less than 1 − λ In the standard reject criterion we reject

if the largest posterior probability is less thanθ Thus these two criteria for rejection

are equivalent providedθ = 1− λ

1.25 The expected squared loss for a vectorial target variable is given by

E[L] =

Z Z

ky(x) − tk2p(t, x) dx dt

Our goal is to choose y(x) so as to minimize E[L] We can do this formally using

the calculus of variations to give

=ky(x) − E[t|x] + E[t|x] − tk2

Following the treatment of the univariate case, we now substitute this into (1.151)and perform the integral over t Again the cross-term vanishes and we are left with

Trang 16

20 Solutions 1.27–1.28

from which we see directly that the function y(x) that minimizes E[L] is given byE[t|x]

1.27 Since we can choose y(x) independently for each value of x, the minimum of the

expectedLqloss can be found by minimizing the integrand given by

Z

for each value of x Setting the derivative of (52) with respect toy(x) to zero gives

the stationarity condition

which says thaty(x) must be the conditional median of t

Forq → 0 we note that, as a function of t, the quantity |y(x) − t|q is close to 1everywhere except in a small neighbourhood aroundt = y(x) where it falls to zero

The value of (52) will therefore be close to 1, since the densityp(t) is normalized, but

reduced slightly by the ‘notch’ close tot = y(x) We obtain the biggest reduction in

(52) by choosing the location of the notch to coincide with the largest value ofp(t),

i.e with the (conditional) mode

1.28 From the discussion of the introduction of Section 1.6, we have

mh(p)

Trang 17

Solutions 1.29–1.30 21

and so, by continuity, we have thath(px) = x h(p) for any real number x

Now consider the positive real numbers p and q and the real number x such that

p = qx From the above discussion, we see that

h(p)ln(p) =

h(qx)ln(qx) =

x h(q)

x ln(q) =

h(q)ln(q)

The functionln(x) is concave_ and so we can apply Jensen’s inequality in the form

(1.115) but with the inequality reversed, so that

Trang 18

22 Solutions 1.31–1.33

1.31 We first make use of the relationI(x; y) = H(y)− H(y|x) which we obtained in

(1.121), and note that the mutual information satisfiesI(x; y) > 0 since it is a form

of Kullback-Leibler divergence Finally we make use of the relation (1.112) to obtainthe desired result (1.152)

To show that statistical independence is a sufficient condition for the equality to besatisfied, we substitutep(x, y) = p(x)p(y) into the definition of the entropy, giving

a form of KL divergence, and this vanishes only if the two distributions are equal, sothatp(x, y) = p(x)p(y) as required

1.32 When we make a change of variables, the probability density is transformed by theJacobian of the change of variables Thus we have

p(x) = p(y)

∂x∂yij

Solutions 1.29–1.30 21

and so, by continuity, we have thath(px) = x h(p) for any real number x

Now consider the positive real numbers p and q and the real... that it is true for a specific value of dimensionality D and

then show that it must be true for dimensionalityD + Thus consider the left-hand

side of (1.136) evaluated forD + which...

that (1.137) is correct for a specific orderM − so that

and hence shows that (1.137) is true for polynomials of orderM Thus by induction

(1.137) must be true for all values ofM

1.16

Định dạng
Số trang	23
Dung lượng	252,07 KB