In fact, the entropy of a random variable can be interpreted as the degree of information that the observation of the variable gives.. According to the fundamental results of information
Trang 1Information Theory
Estimation theory gives one approach to characterizing random variables This was based on building parametric models and describing the data by the parameters
An alternative approach is given by information theory Here the emphasis is on
coding We want to code the observations The observations can then be stored
in the memory of a computer, or transmitted by a communications channel, for example Finding a suitable code depends on the statistical properties of the data
In independent component analysis (ICA), estimation theory and information theory offer the two principal theoretical approaches
In this chapter, the basic concepts of information theory are introduced The latter half of the chapter deals with a more specialized topic: approximation of entropy These concepts are needed in the ICA methods of Part II
5.1 ENTROPY
5.1.1 Definition of entropy
Entropy is the basic concept of information theory EntropyH is defined for a discrete-valued random variableX as
H(X ) =
X
i P(X = a
i ) log P(X = a
i
where the a
i are the possible values ofX Depending on what the base of the logarithm is, different units of entropy are obtained Usually, the logarithm with base 2 is used, in which case the unit is called a bit In the following the base is
105
Independent Component Analysis Aapo Hyv¨arinen, Juha Karhunen, Erkki Oja
Copyright 2001 John Wiley & Sons, Inc ISBNs: 0-471-40540-X (Hardback); 0-471-22131-7 (Electronic)
Trang 2Fig 5.1 The function f in (5.2), plotted on the interval 0 1]
not important since it only changes the measurement scale, so it is not explicitly mentioned
Let us define the functionf as
This is a nonnegative function that is zero forp = 0and forp = 1, and positive for values in between; it is plotted in Fig 5.1 Using this function, entropy can be written as
H(X ) =
X
i
f (P (X = a
i
Considering the shape of f, we see that the entropy is small if the probabilities
P (X = a
i
)are close to0or1, and large if the probabilities are in between
In fact, the entropy of a random variable can be interpreted as the degree of information that the observation of the variable gives The more “random”, i.e., unpredictable and unstructured the variable is, the larger its entropy Assume that the probabilities are all close to0, expect for one that is close to1(the probabilities must sum up to one) Then there is little randomness in the variable, since it almost always takes the same value This is reflected in its small entropy On the other hand, if all the probabilities are equal, then they are relatively far from0and1, andf takes large values This means that the entropy is large, which reflects the fact that the variable
is really random: We cannot predict which value it takes
Example 5.1 Let us consider a random variableXthat can have only two values,a andb Denote bypthe probability that it has the valuea, then the probability that it
isbis equal to1 p The entropy of this random variable can be computed as
(5.4)
Trang 3ENTROPY 107
Thus, entropy is a simple function ofp (It does not depend on the valuesaandb.) Clearly, this function has the same properties asf: it is a nonnegative function that
is zero forp = 0and forp = 1, and positive for values in between In fact, it it is maximized forp = 1=2(this is left as an exercice) Thus, the entropy is largest when the values are both obtained with a probability of50% In contrast, if one of these values is obtained almost always (say, with a probability of99:9%), the entropy of
Xis small, since there is little randomness in the variable
5.1.2 Entropy and coding length
The connection between entropy and randomness can be made more rigorous by
considering coding length Assume that we want to find a binary code for a large
number of observations ofX, so that the code uses the minimum number of bits possible According to the fundamental results of information theory, entropy is very closely related to the length of the code required Under some simplifying assumptions, the length of the shortest code is bounded below by the entropy, and this bound can be approached arbitrarily close, see, e.g., [97] So, entropy gives roughly the average minimum code length of the random variable
Since this topic is out of the scope of this book, we will just illustrate it with two examples
Example 5.2 Consider again the case of a random variable with two possible values,
aandb If the variable almost always takes the same value, its entropy is small This
is reflected in the fact that the variable is easy to code In fact, assume the valuea
is almost always obtained Then, one efficient code might be obtained simply by counting how manya’s are found between two subsequent observations ofb, and writing down these numbers If we need to code only a few numbers, we are able to code the data very efficiently
In the extreme case where the probability ofais1, there is actually nothing left to code and the coding length is zero On the other hand, if both values have the same probability, this trick cannot be used to obtain an efficient coding mechanism, and every value must be coded separately by one bit
Example 5.3 Consider a random variable X that can have eight different values with probabilities(1=2 1=4 1=8 1=16 1=64 1=64 1=64 1=64) The entropy ofX
is2bits (this computation is left as an exercice to the reader) If we just coded the data in the ordinary way, we would need 3 bits for every observation But a more intelligent way is to code frequent values with short binary strings and infrequent values with longer strings Here, we could use the following strings for the outcomes: 0,10,110,1110,111100,111101,111110,111111 (Note that the strings can be written one after another with no spaces since they are designed so that one always knows
when the string ends.) With this encoding the average number of bits needed for
each outcome is only 2, which is in fact equal to the entropy So we have gained a 33% reduction of coding length
Trang 45.1.3 Differential entropy
The definition of entropy for a discrete-valued random variable can be generalized for continuous-valued random variables and vectors, in which case it is often called differential entropy
The differential entropyH of a random variablexwith densityp
x (:)is defined as:
H(x) =
Z
p x () log p
x ()d =
Z
f (p x
Differential entropy can be interpreted as a measure of randomness in the same way
as entropy If the random variable is concentrated on certain small intervals, its differential entropy is small
Note that differential entropy can be negative Ordinary entropy cannot be negative because the functionf in (5.2) is nonnegative in the interval 0 1], and discrete probabilities necessarily stay in this interval But probability densities can be larger than 1, in which case f takes negative values So, when we speak of a “small differential entropy”, it may be negative and have a large absolute value
It is now easy to see what kind of random variables have small entropies They are the ones whose probability densities take large values, since these give strong negative contributions to the integral in (5.8) This means that certain intervals are quite probable Thus we again find that entropy is small when the variable is not very random, that is, it is contained in some limited intervals with high probabilities
Example 5.4 Consider a random variablexthat has a uniform probability distribu-tion in the interval0 a] Its density is given by
p x () = ( 1=a for0 a
The differential entropy can be evaluated as
H (x) =
Z a
0 1
a log 1
a
Thus we see that the entropy is large ifais large, and small ifais small This is natural because the smallerais, the less randomness there is inx In the limit where
agoes to0, differential entropy goes to1, because in the limit, x is no longer random at all: it is always0
The interpretation of entropy as coding length is more or less valid with differ-ential entropy The situation is more complicated, however, since the coding length interpretation requires that we discretize (quantize) the values ofx In this case, the coding length depends on the discretization, i.e., on the accuracy with which we want
to represent the random variable Thus the actual coding length is given by the sum
of entropy and a function of the accuracy of representation We will not go into the details here; see [97] for more information
Trang 5ENTROPY 109
The definition of differential entropy can be straightforwardly generalized to the multidimensional case Letxbe a random vector with densityp
x (:) The differential entropy is then defined as:
H (x) =
Z
p x () log p
x ()d =
Z
f (p x
5.1.4 Entropy of a transformation
Consider an invertible transformation of the random vectorx, say
In this section, we show the connection between the entropy ofyand that ofx
A short, if somewhat sloppy derivation is as follows (A more rigorous derivation
is given in the Appendix.) Denote byJ f ()the Jacobian matrix of the functionf, i.e., the matrix of the partial derivatives off at point The classic relation between the densityp
yofyand the densityp
xofx, as given in Eq (2.82), can then be formulated as
p y () = p x (f
1 ())j det J f (f
1 ())j
1
(5.10) Now, expressing the entropy as an expectation
H (y ) = Eflog p
y
we get
Eflog p
y
(y )g = Eflogp
x (f
1 (y ))j det J f (f
1 (y ))j
1 ]g
= Eflogp
x
(x)j det J f (x)j
1 ]g = Eflog p
x (x)g Eflog j det J f (x)jg (5.12) Thus we obtain the relation between the entropies as
H(y ) = H (x) + Eflog j det Jf (x)jg (5.13)
In other words, the entropy is increased in the transformation byEflog j det J f (x)jg
An important special case is the linear transformation
in which case we obtain
This also shows that differential entropy is not scale-invariant Consider a random
variablex If we multiply it by a scalar constant,, differential entropy changes as
Thus, just by changing the scale, we can change the differential entropy This is why the scale of often is fixed before measuring its differential entropy
Trang 65.2 MUTUAL INFORMATION
5.2.1 Definition using entropy
Mutual information is a measure of the information that members of a set of random variables have on the other random variables in the set Using entropy, we can define the mutual informationI betweenn(scalar) random variables,x
i
i = 1 ::: n, as follows
I (x 1
x 2
::: x n ) = n X
i=1
H (x i
wherexis the vector containing all thex
i Mutual information can be interpreted by using the interpretation of entropy as code length The termsH (x
i )give the lengths of codes for thex
iwhen these are coded separately, andH (x)gives the code length when x is coded as a random vector, i.e., all the components are coded in the same code Mutual information thus shows what code length reduction is obtained by coding the whole vector instead
of the separate components In general, better codes can be obtained by coding the whole vector However, if thex
iare independent, they give no information on each other, and one could just as well code the variables separately without increasing code length
5.2.2 Definition using Kullback-Leibler divergence
Alternatively, mutual information can be interpreted as a distance, using what is called the Kullback-Leibler divergence This is defined between twon-dimensional probability density functions (pdf’s)p
1 andp 2 as
(p 1
p 2
= Z
p 1 () log p 1 ()
p 2 ()
The Kullback-Leibler divergence can be considered as a kind of a distance between the two probability densities, because it is always nonnegative, and zero if and only if the two distributions are equal This is a direct consequence of the (strict) convexity
of the negative logarithm, and the application of the classic Jensen’s inequality Jensen’s inequality (see [97]) says that for any strictly convex functionf and any random variabley, we have
Takef (y) = log(y), and assume thaty = p
2 (x)=p 1 (x)wherexhas the distribution given byp
1
Then we have
(p
1
p
2
= Eff(y)g = Ef log
p 2 (x)
p 1 (x)
g = Z
p 1 ()f log
p 2 ()
p 1 () gd
Z 1 p 2 ()
Z 2
(5.20)
Trang 7MAXIMUM ENTROPY 111
Moreover, we have equality in Jensen’s inequality if and only ifyis constant In our case, it is constant if and only if the two distributions are equal, so we have proven the announced property of the Kullback-Leibler divergence
Kullback-Leibler divergence is not a proper distance measure, though, because it
is not symmetric
To apply Kullback-Leibler divergence here, let us begin by considering that if random variablesxiwere independent, their joint probability density could be fac-torized according to the definition of independence Thus one might measure the independence of thexias the Kullback-Leibler divergence between the real density
p1
=px
()and the factorized densityp2
=p1 1 p2 2 :::pn
(n ), where thepi
(:) are the marginal densities of thexi In fact, simple algebraic manipulations show that this quantity equals the mutual information that we defined using entropy in (5.17), which is left as an exercice
The interpretation as Kullback-Leibler divergence implies the following important
property: Mutual information is always nonnegative, and it is zero if and only if the variables are independent This is a direct consequence of the properties of the
Kullback-Leibler divergence
5.3 MAXIMUM ENTROPY
5.3.1 Maximum entropy distributions
An important class of methods that have application in many domains is given by the maximum entropy methods These methods apply the concept of entropy to the task
of regularization
Assume that the information available on the densitypx
(:)of the scalar random variablexis of the form
Z
p()Fi ()d =ci fori= 1::: (5.21)
which means in practice that we have estimated the expectationsEfFi
(x)gofm different functionsFi
ofx (Note thatiis here an index, not an exponent.) The question is now: What is the probability density functionp0that satisfies the constraints in (5.21), and has maximum entropy among such densities? (Earlier, we defined the entropy of random variable, but the definition can be used with pdf’s as well.) This question can be motivated by noting that a finite number of observations cannot tell us exactly whatpis like So we might use some kind of regularization to
obtain the most usefulpcompatible with these measurements Entropy can be here considered as a regularization measure that helps us find the least structured density compatible with the measurements In other words, the maximum entropy density can be interpreted as the density that is compatible with the measurements and makes the minimum number of assumptions on the data This is because entropy can be interpreted as a measure of randomness, and therefore the maximum entropy density
n
Trang 8
is the most random of all the pdf’s that satisfy the constraints For further details on why entropy can be used as a measure of regularity, see [97, 353]
The basic result of the maximum entropy method (see, e.g [97, 353]) tells us that under some regularity conditions, the densityp
0 ()which satisfies the constraints (5.21) and has maximum entropy among all such densities, is of the form
p 0 () = A exp(
X
i a i F i
Here,Aanda
iare constants that are determined from thec
i, using the constraints
in (5.21) (i.e., by substituting the right-hand side of (5.22) forpin (5.21)), and the constraint
R
p
0
()d = 1 This leads in general to a system of n + 1nonlinear equations that may be difficult to solve, and in general, numerical methods must be used
5.3.2 Maximality property of the gaussian distribution
Now, consider the set of random variables that can take all the values on the real line, and have zero mean and a fixed variance, say 1 (thus, we have two constraints) The maximum entropy distribution for such variables is the gaussian distribution This is because by (5.22), the distribution has the form
p 0 () = A exp(a
1 2 + a 2
and all probability densities of this form are gaussian by definition (see Section 2.5)
Thus we have the fundamental result that a gaussian variable has the largest entropy among all random variables of unit variance This means that entropy
could be used as a measure of nongaussianity In fact, this shows that the gaussian distribution is the “most random” or the least structured of all distributions Entropy
is small for distributions that are clearly concentrated on certain values, i.e., when the variable is clearly clustered, or has a pdf that is very “spiky” This property can be generalized to arbitrary variances, and what is more important, to multidimensional spaces: The gaussian distribution has maximum entropy among all distributions with
a given covariance matrix
5.4 NEGENTROPY
The maximality property given in Section 5.3.2 shows that entropy could be used to define a measure of nongaussianity A measure that is zero for a gaussian variable and always nonnegative can be simply obtained from differential entropy, and is called negentropy NegentropyJis defined as follows
(5.24)
Trang 9APPROXIMATION OF ENTROPY BY CUMULANTS 113
wherex
gaussis a gaussian random vector of the same covariance matrixasx Its entropy can be evaluated as
H(x gauss ) = 1
2 log j det j +
n
2
wherenis the dimension ofx
Due to the previously mentioned maximality property of the gaussian distribution, negentropy is always nonnegative Moreover, it is zero if and only ifxhas a gaussian distribution, since the maximum entropy distribution is unique
Negentropy has the additional interesting property that it is invariant for invertible linear transformations This is because fory = Mxwe haveEfyy
T
T , and, using preceding results, the negentropy can be computed as
J (Mx) =
1
2
log j det(MM
T )j + n
2
1 + log 2] (H (x) + )
=
1
2
log j det j + 2
1
2 log j det Mj +
n
2
1 + log 2] H (x) log j det Mj
= 1
2 log j det j +
n
2
1 + log 2] H (x)
= H (x gauss ) H (x) = J(x) (5.26)
In particular negentropy is scale-invariant, i.e., multiplication of a random variable by
a constant does not change its negentropy This was not true for differential entropy,
as we saw earlier
5.5 APPROXIMATION OF ENTROPY BY CUMULANTS
In the previous section we saw that negentropy is a principled measure of nongaus-sianity The problem in using negentropy is, however, that it is computationally very difficult To use differential entropy or negentropy in practice, we could compute the integral in the definition in (5.8) This is, however, quite difficult since the integral involves the probability density function The density could be estimated using ba-sic density estimation methods such as kernel estimators Such a simple approach would be very error prone, however, because the estimator would depend on the cor-rect choice of the kernel parameters Moreover, it would be computationally rather complicated
Therefore, differential entropy and negentropy remain mainly theoretical quanti-ties In practice, some approximations, possibly rather coarse, have to be used In this section and the next one we discuss different approximations of negentropy that will be used in the ICA methods in Part II of this book
5.5.1 Polynomial density expansions
The classic method of approximating negentropy is using higher-order cumulants (defined in Section 2.7) These are based on the idea of using an expansion not unlike
log j det Mj
Trang 10a Taylor expansion This expansion is taken for the pdf of a random variable, sayx,
in the vicinity of the gaussian density (We only consider the case of scalar random variables here, because it seems to be sufficient in most applications.) For simplicity, let us first makexzero-mean and of unit variance Then, we can make the technical assumption that the densityp
x ()ofxis near the standardized gaussian density
'() = exp(
2
=2)=
p
Two expansions are usually used in this context: the Gram-Charlier expansion and the Edgeworth expansion They lead to very similar approximations, so we only consider the Gram-Charlier expansion here These expansions use the so-called Chebyshev-Hermite polynomials, denoted byH
iwhere the indexiis a nonnegative integer These polynomials are defined by the derivatives of the standardized gaussian pdf'()by the equation
@ i '()
@
i
= (1) i H i
Thus,H
i is a polynomial of orderi These polynomials have the nice property of forming an orthonormal system in the following sense:
Z
'()H i ()H j ()d =
( 1 ifi = j 0 ifi 6= j
(5.29)
The Gram-Charlier expansion of the pdf of x, truncated to include the two first nonconstant terms, is then given by
p
x
() ^
x () = '()(1 +
3 (x) H 3 ()
3!
+ 4 (x) H 4 ()
4!
This expansion is based on the idea that the pdf ofx is very close to a gaussian one, which allows a Taylor-like approximation to be made Thus, the nongaussian part of the pdf is directly given by the higher-order cumulants, in this case the third-and fourth-order cumulants Recall that these are called the skewness third-and kurtosis, and are given by
3 (x) = Efx
3
gand 4 (x) = Efx
4
g 3 The expansion has an infinite number of terms, but only those given above are of interest to us Note that the expansion starts directly from higher-order cumulants, because we standardized
xto have zero mean and unit variance
5.5.2 Using density expansions for entropy approximation
Now we could plug the density in (5.30) into the definition of entropy, to obtain
H(x)
Z
^ x () log ^
x
This integral is not very simple to evaluate, though But again using the idea that the pdf is very close to a gaussian one, we see that the cumulants in (5.30) are very