Tài liệu Bài 5: Information Theory docx

In fact, the entropy of a random variable can be interpreted as the degree of information that the observation of the variable gives.. According to the fundamental results of information

Trang 1

Information Theory

Estimation theory gives one approach to characterizing random variables This was based on building parametric models and describing the data by the parameters

An alternative approach is given by information theory Here the emphasis is on

coding We want to code the observations The observations can then be stored

in the memory of a computer, or transmitted by a communications channel, for example Finding a suitable code depends on the statistical properties of the data

In independent component analysis (ICA), estimation theory and information theory offer the two principal theoretical approaches

In this chapter, the basic concepts of information theory are introduced The latter half of the chapter deals with a more specialized topic: approximation of entropy These concepts are needed in the ICA methods of Part II

5.1 ENTROPY

5.1.1 Definition of entropy

Entropy is the basic concept of information theory EntropyH is defined for a discrete-valued random variableX as

H(X ) =

X

i P(X = a

i ) log P(X = a

i

where the a

i are the possible values ofX Depending on what the base of the logarithm is, different units of entropy are obtained Usually, the logarithm with base 2 is used, in which case the unit is called a bit In the following the base is

105

Independent Component Analysis Aapo Hyv¨arinen, Juha Karhunen, Erkki Oja

Copyright  2001 John Wiley & Sons, Inc ISBNs: 0-471-40540-X (Hardback); 0-471-22131-7 (Electronic)

Trang 2

Fig 5.1 The function f in (5.2), plotted on the interval 0 1]

not important since it only changes the measurement scale, so it is not explicitly mentioned

Let us define the functionf as

This is a nonnegative function that is zero forp = 0and forp = 1, and positive for values in between; it is plotted in Fig 5.1 Using this function, entropy can be written as

H(X ) =

X

i

f (P (X = a

i

Considering the shape of f, we see that the entropy is small if the probabilities

P (X = a

i

)are close to0or1, and large if the probabilities are in between

In fact, the entropy of a random variable can be interpreted as the degree of information that the observation of the variable gives The more “random”, i.e., unpredictable and unstructured the variable is, the larger its entropy Assume that the probabilities are all close to0, expect for one that is close to1(the probabilities must sum up to one) Then there is little randomness in the variable, since it almost always takes the same value This is reflected in its small entropy On the other hand, if all the probabilities are equal, then they are relatively far from0and1, andf takes large values This means that the entropy is large, which reflects the fact that the variable

is really random: We cannot predict which value it takes

Example 5.1 Let us consider a random variableXthat can have only two values,a andb Denote bypthe probability that it has the valuea, then the probability that it

isbis equal to1 p The entropy of this random variable can be computed as

(5.4)

Trang 3

ENTROPY 107

Thus, entropy is a simple function ofp (It does not depend on the valuesaandb.) Clearly, this function has the same properties asf: it is a nonnegative function that

is zero forp = 0and forp = 1, and positive for values in between In fact, it it is maximized forp = 1=2(this is left as an exercice) Thus, the entropy is largest when the values are both obtained with a probability of50% In contrast, if one of these values is obtained almost always (say, with a probability of99:9%), the entropy of

Xis small, since there is little randomness in the variable

5.1.2 Entropy and coding length

The connection between entropy and randomness can be made more rigorous by

considering coding length Assume that we want to find a binary code for a large

number of observations ofX, so that the code uses the minimum number of bits possible According to the fundamental results of information theory, entropy is very closely related to the length of the code required Under some simplifying assumptions, the length of the shortest code is bounded below by the entropy, and this bound can be approached arbitrarily close, see, e.g., [97] So, entropy gives roughly the average minimum code length of the random variable

Since this topic is out of the scope of this book, we will just illustrate it with two examples

Example 5.2 Consider again the case of a random variable with two possible values,

aandb If the variable almost always takes the same value, its entropy is small This

is reflected in the fact that the variable is easy to code In fact, assume the valuea

is almost always obtained Then, one efficient code might be obtained simply by counting how manya’s are found between two subsequent observations ofb, and writing down these numbers If we need to code only a few numbers, we are able to code the data very efficiently

In the extreme case where the probability ofais1, there is actually nothing left to code and the coding length is zero On the other hand, if both values have the same probability, this trick cannot be used to obtain an efficient coding mechanism, and every value must be coded separately by one bit

Example 5.3 Consider a random variable X that can have eight different values with probabilities(1=2 1=4 1=8 1=16 1=64 1=64 1=64 1=64) The entropy ofX

is2bits (this computation is left as an exercice to the reader) If we just coded the data in the ordinary way, we would need 3 bits for every observation But a more intelligent way is to code frequent values with short binary strings and infrequent values with longer strings Here, we could use the following strings for the outcomes: 0,10,110,1110,111100,111101,111110,111111 (Note that the strings can be written one after another with no spaces since they are designed so that one always knows

when the string ends.) With this encoding the average number of bits needed for

each outcome is only 2, which is in fact equal to the entropy So we have gained a 33% reduction of coding length

Trang 4

5.1.3 Differential entropy

The definition of entropy for a discrete-valued random variable can be generalized for continuous-valued random variables and vectors, in which case it is often called differential entropy

The differential entropyH of a random variablexwith densityp

x (:)is defined as:

H(x) =

Z

p x () log p

x ()d =

Z

f (p x

Differential entropy can be interpreted as a measure of randomness in the same way

as entropy If the random variable is concentrated on certain small intervals, its differential entropy is small

Note that differential entropy can be negative Ordinary entropy cannot be negative because the functionf in (5.2) is nonnegative in the interval 0 1], and discrete probabilities necessarily stay in this interval But probability densities can be larger than 1, in which case f takes negative values So, when we speak of a “small differential entropy”, it may be negative and have a large absolute value

It is now easy to see what kind of random variables have small entropies They are the ones whose probability densities take large values, since these give strong negative contributions to the integral in (5.8) This means that certain intervals are quite probable Thus we again find that entropy is small when the variable is not very random, that is, it is contained in some limited intervals with high probabilities

Example 5.4 Consider a random variablexthat has a uniform probability distribu-tion in the interval0 a] Its density is given by

p x () = ( 1=a for0 a

The differential entropy can be evaluated as

H (x) =

Z a

0 1

a log 1

a

Thus we see that the entropy is large ifais large, and small ifais small This is natural because the smallerais, the less randomness there is inx In the limit where

agoes to0, differential entropy goes to1, because in the limit, x is no longer random at all: it is always0

The interpretation of entropy as coding length is more or less valid with differ-ential entropy The situation is more complicated, however, since the coding length interpretation requires that we discretize (quantize) the values ofx In this case, the coding length depends on the discretization, i.e., on the accuracy with which we want

to represent the random variable Thus the actual coding length is given by the sum

of entropy and a function of the accuracy of representation We will not go into the details here; see [97] for more information

Trang 5

ENTROPY 109

The definition of differential entropy can be straightforwardly generalized to the multidimensional case Letxbe a random vector with densityp

x (:) The differential entropy is then defined as:

H (x) =

Z

p x () log p

x ()d =

Z

f (p x

5.1.4 Entropy of a transformation

Consider an invertible transformation of the random vectorx, say

In this section, we show the connection between the entropy ofyand that ofx

A short, if somewhat sloppy derivation is as follows (A more rigorous derivation

is given in the Appendix.) Denote byJ f ()the Jacobian matrix of the functionf, i.e., the matrix of the partial derivatives off at point The classic relation between the densityp

yofyand the densityp

xofx, as given in Eq (2.82), can then be formulated as

p y () = p x (f

1 ())j det J f (f

1 ())j

1

(5.10) Now, expressing the entropy as an expectation

H (y ) = Eflog p

y

we get

Eflog p

y

(y )g = Eflogp

x (f

1 (y ))j det J f (f

1 (y ))j

1 ]g

= Eflogp

x

(x)j det J f (x)j

1 ]g = Eflog p

x (x)g Eflog j det J f (x)jg (5.12) Thus we obtain the relation between the entropies as

H(y ) = H (x) + Eflog j det Jf (x)jg (5.13)

In other words, the entropy is increased in the transformation byEflog j det J f (x)jg

An important special case is the linear transformation

in which case we obtain

This also shows that differential entropy is not scale-invariant Consider a random

variablex If we multiply it by a scalar constant,, differential entropy changes as

Thus, just by changing the scale, we can change the differential entropy This is why the scale of often is fixed before measuring its differential entropy

Trang 6

5.2 MUTUAL INFORMATION

5.2.1 Definition using entropy

Mutual information is a measure of the information that members of a set of random variables have on the other random variables in the set Using entropy, we can define the mutual informationI betweenn(scalar) random variables,x

i

i = 1 ::: n, as follows

I (x 1

x 2

::: x n ) = n X

i=1

H (x i

wherexis the vector containing all thex

i Mutual information can be interpreted by using the interpretation of entropy as code length The termsH (x

i )give the lengths of codes for thex

iwhen these are coded separately, andH (x)gives the code length when x is coded as a random vector, i.e., all the components are coded in the same code Mutual information thus shows what code length reduction is obtained by coding the whole vector instead

of the separate components In general, better codes can be obtained by coding the whole vector However, if thex

iare independent, they give no information on each other, and one could just as well code the variables separately without increasing code length

5.2.2 Definition using Kullback-Leibler divergence

Alternatively, mutual information can be interpreted as a distance, using what is called the Kullback-Leibler divergence This is defined between twon-dimensional probability density functions (pdf’s)p

1 andp 2 as

(p 1

p 2

= Z

p 1 () log p 1 ()

p 2 ()

The Kullback-Leibler divergence can be considered as a kind of a distance between the two probability densities, because it is always nonnegative, and zero if and only if the two distributions are equal This is a direct consequence of the (strict) convexity

of the negative logarithm, and the application of the classic Jensen’s inequality Jensen’s inequality (see [97]) says that for any strictly convex functionf and any random variabley, we have

Takef (y) = log(y), and assume thaty = p

2 (x)=p 1 (x)wherexhas the distribution given byp

1

Then we have

(p

1

p

2

= Eff(y)g = Ef log

p 2 (x)

p 1 (x)

g = Z

p 1 ()f log

p 2 ()

p 1 () gd

Z 1 p 2 ()

Z 2

(5.20)

Trang 7

MAXIMUM ENTROPY 111

Moreover, we have equality in Jensen’s inequality if and only ifyis constant In our case, it is constant if and only if the two distributions are equal, so we have proven the announced property of the Kullback-Leibler divergence

Kullback-Leibler divergence is not a proper distance measure, though, because it

is not symmetric

To apply Kullback-Leibler divergence here, let us begin by considering that if random variablesxiwere independent, their joint probability density could be fac-torized according to the definition of independence Thus one might measure the independence of thexias the Kullback-Leibler divergence between the real density

p1

=px

()and the factorized densityp2

=p1 1 p2 2 :::pn

(n ), where thepi

(:) are the marginal densities of thexi In fact, simple algebraic manipulations show that this quantity equals the mutual information that we defined using entropy in (5.17), which is left as an exercice

The interpretation as Kullback-Leibler divergence implies the following important

property: Mutual information is always nonnegative, and it is zero if and only if the variables are independent This is a direct consequence of the properties of the

Kullback-Leibler divergence

5.3 MAXIMUM ENTROPY

5.3.1 Maximum entropy distributions

An important class of methods that have application in many domains is given by the maximum entropy methods These methods apply the concept of entropy to the task

of regularization

Assume that the information available on the densitypx

(:)of the scalar random variablexis of the form

Z

p()Fi ()d =ci fori= 1::: (5.21)

which means in practice that we have estimated the expectationsEfFi

(x)gofm different functionsFi

ofx (Note thatiis here an index, not an exponent.) The question is now: What is the probability density functionp0that satisfies the constraints in (5.21), and has maximum entropy among such densities? (Earlier, we defined the entropy of random variable, but the definition can be used with pdf’s as well.) This question can be motivated by noting that a finite number of observations cannot tell us exactly whatpis like So we might use some kind of regularization to

obtain the most usefulpcompatible with these measurements Entropy can be here considered as a regularization measure that helps us find the least structured density compatible with the measurements In other words, the maximum entropy density can be interpreted as the density that is compatible with the measurements and makes the minimum number of assumptions on the data This is because entropy can be interpreted as a measure of randomness, and therefore the maximum entropy density

n

Trang 8

is the most random of all the pdf’s that satisfy the constraints For further details on why entropy can be used as a measure of regularity, see [97, 353]

The basic result of the maximum entropy method (see, e.g [97, 353]) tells us that under some regularity conditions, the densityp

0 ()which satisfies the constraints (5.21) and has maximum entropy among all such densities, is of the form

p 0 () = A exp(

X

i a i F i

Here,Aanda

iare constants that are determined from thec

i, using the constraints

in (5.21) (i.e., by substituting the right-hand side of (5.22) forpin (5.21)), and the constraint

R

p

0

()d = 1 This leads in general to a system of n + 1nonlinear equations that may be difficult to solve, and in general, numerical methods must be used

5.3.2 Maximality property of the gaussian distribution

Now, consider the set of random variables that can take all the values on the real line, and have zero mean and a fixed variance, say 1 (thus, we have two constraints) The maximum entropy distribution for such variables is the gaussian distribution This is because by (5.22), the distribution has the form

p 0 () = A exp(a

1 2 + a 2

and all probability densities of this form are gaussian by definition (see Section 2.5)

Thus we have the fundamental result that a gaussian variable has the largest entropy among all random variables of unit variance This means that entropy

could be used as a measure of nongaussianity In fact, this shows that the gaussian distribution is the “most random” or the least structured of all distributions Entropy

is small for distributions that are clearly concentrated on certain values, i.e., when the variable is clearly clustered, or has a pdf that is very “spiky” This property can be generalized to arbitrary variances, and what is more important, to multidimensional spaces: The gaussian distribution has maximum entropy among all distributions with

a given covariance matrix

5.4 NEGENTROPY

The maximality property given in Section 5.3.2 shows that entropy could be used to define a measure of nongaussianity A measure that is zero for a gaussian variable and always nonnegative can be simply obtained from differential entropy, and is called negentropy NegentropyJis defined as follows

(5.24)

Trang 9

APPROXIMATION OF ENTROPY BY CUMULANTS 113

wherex

gaussis a gaussian random vector of the same covariance matrixasx Its entropy can be evaluated as

H(x gauss ) = 1

2 log j det j +

n

2

wherenis the dimension ofx

Due to the previously mentioned maximality property of the gaussian distribution, negentropy is always nonnegative Moreover, it is zero if and only ifxhas a gaussian distribution, since the maximum entropy distribution is unique

Negentropy has the additional interesting property that it is invariant for invertible linear transformations This is because fory = Mxwe haveEfyy

T

T , and, using preceding results, the negentropy can be computed as

J (Mx) =

1

2

log j det(MM

T )j + n

2

1 + log 2] (H (x) + )

=

1

2

log j det j + 2

1

2 log j det Mj +

n

2

1 + log 2] H (x) log j det Mj

= 1

2 log j det j +

n

2

1 + log 2] H (x)

= H (x gauss ) H (x) = J(x) (5.26)

In particular negentropy is scale-invariant, i.e., multiplication of a random variable by

a constant does not change its negentropy This was not true for differential entropy,

as we saw earlier

5.5 APPROXIMATION OF ENTROPY BY CUMULANTS

In the previous section we saw that negentropy is a principled measure of nongaus-sianity The problem in using negentropy is, however, that it is computationally very difficult To use differential entropy or negentropy in practice, we could compute the integral in the definition in (5.8) This is, however, quite difficult since the integral involves the probability density function The density could be estimated using ba-sic density estimation methods such as kernel estimators Such a simple approach would be very error prone, however, because the estimator would depend on the cor-rect choice of the kernel parameters Moreover, it would be computationally rather complicated

Therefore, differential entropy and negentropy remain mainly theoretical quanti-ties In practice, some approximations, possibly rather coarse, have to be used In this section and the next one we discuss different approximations of negentropy that will be used in the ICA methods in Part II of this book

5.5.1 Polynomial density expansions

The classic method of approximating negentropy is using higher-order cumulants (defined in Section 2.7) These are based on the idea of using an expansion not unlike

log j det Mj

Trang 10

a Taylor expansion This expansion is taken for the pdf of a random variable, sayx,

in the vicinity of the gaussian density (We only consider the case of scalar random variables here, because it seems to be sufficient in most applications.) For simplicity, let us first makexzero-mean and of unit variance Then, we can make the technical assumption that the densityp

x ()ofxis near the standardized gaussian density

'() = exp(

2

=2)=

p

Two expansions are usually used in this context: the Gram-Charlier expansion and the Edgeworth expansion They lead to very similar approximations, so we only consider the Gram-Charlier expansion here These expansions use the so-called Chebyshev-Hermite polynomials, denoted byH

iwhere the indexiis a nonnegative integer These polynomials are defined by the derivatives of the standardized gaussian pdf'()by the equation

@ i '()

@

i

= (1) i H i

Thus,H

i is a polynomial of orderi These polynomials have the nice property of forming an orthonormal system in the following sense:

Z

'()H i ()H j ()d =

( 1 ifi = j 0 ifi 6= j

(5.29)

The Gram-Charlier expansion of the pdf of x, truncated to include the two first nonconstant terms, is then given by

p

x

() ^

x () = '()(1 +

3 (x) H 3 ()

3!

+ 4 (x) H 4 ()

4!

This expansion is based on the idea that the pdf ofx is very close to a gaussian one, which allows a Taylor-like approximation to be made Thus, the nongaussian part of the pdf is directly given by the higher-order cumulants, in this case the third-and fourth-order cumulants Recall that these are called the skewness third-and kurtosis, and are given by

3 (x) = Efx

3

gand 4 (x) = Efx

4

g 3 The expansion has an infinite number of terms, but only those given above are of interest to us Note that the expansion starts directly from higher-order cumulants, because we standardized

xto have zero mean and unit variance

5.5.2 Using density expansions for entropy approximation

Now we could plug the density in (5.30) into the definition of entropy, to obtain

H(x)

Z

^ x () log ^

x

This integral is not very simple to evaluate, though But again using the idea that the pdf is very close to a gaussian one, we see that the cumulants in (5.30) are very

Tiêu đề	Information Theory
Tác giả	Aapo Hyvärinen, Juha Karhunen, Erkki Oja
Trường học	John Wiley & Sons, Inc.
Chuyên ngành	Information Theory
Thể loại	Tài liệu
Năm xuất bản	2001
Thành phố	Hoboken

Định dạng
Số trang	20
Dung lượng	422,62 KB