1. Trang chủ
  2. » Tài Chính - Ngân Hàng

Tài liệu Bài 10: ICA by Minimization of Mutual Information pdf

7 357 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề ICA by minimization of mutual information
Tác giả Aapo Hyvärinen, Juha Karhunen, Erkki Oja
Chuyên ngành Information Theory
Thể loại Chương sách
Năm xuất bản 2001
Thành phố New York
Định dạng
Số trang 7
Dung lượng 119,75 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

ICA by Minimization of Mutual Information An important approach for independent component analysis ICA estimation, in-spired by information theory, is minimization of mutual information.

Trang 1

ICA by Minimization of

Mutual Information

An important approach for independent component analysis (ICA) estimation, in-spired by information theory, is minimization of mutual information

The motivation of this approach is that it may not be very realistic in many cases

to assume that the data follows the ICA model Therefore, we would like to develop

an approach that does not assume anything about the data What we want to have

is a general-purpose measure of the dependence of the components of a random vector Using such a measure, we could define ICA as a linear decomposition that minimizes that dependence measure Such an approach can be developed using mutual information, which is a well-motivated information-theoretic measure of statistical dependence

One of the main utilities of mutual information is that it serves as a unifying framework for many estimation principles, in particular maximum likelihood (ML) estimation and maximization of nongaussianity In particular, this approach gives a rigorous justification for the heuristic principle of nongaussianity

10.1 DEFINING ICA BY MUTUAL INFORMATION

10.1.1 Information-theoretic concepts

The information-theoretic concepts needed in this chapter were explained in Chap-ter 5 Readers not familiar with information theory are advised to read that chapChap-ter before this one

221

ISBNs: 0-471-40540-X (Hardback); 0-471-22131-7 (Electronic)

Trang 2

222 ICA BY MINIMIZATION OF MUTUAL INFORMATION

We recall here very briefly the basic definitions of information theory The differential entropyHof a random vectorywith densityp(y )is defined as:

H(y ) = 

Z

p(y ) logp(y )dy (10.1) Entropy is closely related to the code length of the random vector A normalized version of entropy is given by negentropyJ, which is defined as follows

whereygauss is a gaussian random vector of the same covariance (or correlation) matrix asy Negentropy is always nonnegative, and zero only for gaussian random vectors Mutual informationIbetweenm(scalar) random variables,y i i= 1:::mis defined as follows

I(y1y2:::y m) =

m

X

H(y i) H(y ) (10.3)

10.1.2 Mutual information as measure of dependence

We have seen earlier (Chapter 5) that mutual information is a natural measure of the dependence between random variables It is always nonnegative, and zero if and only

if the variables are statistically independent Mutual information takes into account the whole dependence structure of the variables, and not just the covariance, like principal component analysis (PCA) and related methods

Therefore, we can use mutual information as the criterion for finding the ICA representation This approach is an alternative to the model estimation approach We define the ICA of a random vectorxas an invertible transformation:

where the matrixBis determined so that the mutual information of the transformed componentss iis minimized If the data follows the ICA model, this allows estimation

of the data model On the other hand, in this definition, we do not need to assume that the data follows the model In any case, minimization of mutual information can

be interpreted as giving the maximally independent components

Trang 3

10.2 MUTUAL INFORMATION AND NONGAUSSIANITY

Using the formula for the differential entropy of a transformation as given in (5.13)

of Chapter 5, we obtain a corresponding result for mutual information We have for an invertible linear transformationy = Bx:

I (y

1

 y 2

 ::: y ) =

X

i

H (y i )  H (x)  log j det Bj (10.5)

Now, let us consider what happens if we constrain they

ito be uncorrelated and of

unit variance This meansEfyy

T

g = BEfxx

T gB T

= I, which implies

det I = 1 = det(BEfxx

T gB T ) = (det B)(det Efxx

T g)(det B

T ) (10.6) and this implies that det B must be constant sincedet Efxx

T

gdoes not depend

onB Moreover, fory

i of unit variance, entropy and negentropy differ only by a constant and the sign, as can be seen in (10.2) Thus we obtain,

I (y 1

 y 2

 ::: y ) =const.

X

i J(y i

where the constant term does not depend onB This shows the fundamental relation between negentropy and mutual information

We see in (10.7) that finding an invertible linear transformationBthat minimizes the mutual information is roughly equivalent to finding directions in which the ne-gentropy is maximized We have seen previously that nene-gentropy is a measure of

nongaussianity Thus, (10.7) shows that ICA estimation by minimization of mutual

in-formation is equivalent to maximizing the sum of nongaussianities of the estimates of the independent components, when the estimates are constrained to be uncorrelated.

Thus, we see that the formulation of ICA as minimization of mutual information gives another rigorous justification of our more heuristically introduced idea of finding maximally nongaussian directions, as used in Chapter 8

In practice, however, there are also some important differences between these two criteria

1 Negentropy, and other measures of nongaussianity, enable the deflationary, i.e., one-by-one, estimation of the independent components, since we can look for the maxima of nongaussianity of a single projectionb

T

x This is not possible with mutual information or most other criteria, like the likelihood

2 A smaller difference is that in using nongaussianity, we force the estimates of the independent components to be uncorrelated This is not necessary when using mutual information, because we could use the form in (10.5) directly,

as will be seen in the next section Thus the optimization space is slightly reduced

Trang 4

224 ICA BY MINIMIZATION OF MUTUAL INFORMATION

10.3 MUTUAL INFORMATION AND LIKELIHOOD

Mutual information and likelihood are intimately connected To see the connection between likelihood and mutual information, consider the expectation of the log-likelihood in (9.5):

1

T

Eflog L(B)g =

n X

i=1 Eflog p i (b T i x)g + log j det Bj (10.8)

If thep

i were equal to the actual pdf’s of b

T i

x, the first term would be equal to



P

i

H (b

T

i

x) Thus the likelihood would be equal, up to an additive constant given

by the total entropy ofx, to the negative of mutual information as given in Eq (10.5)

In practice, the connection may be just as strong, or even stronger This is because

in practice we do not know the distributions of the independent components that are needed in ML estimation A reasonable approach would be to estimate the density

ofb

T

i

xas part of the ML estimation method, and use this as an approximation of the density ofs

i This is what we did in Chapter 9 Then, thep

iin this approximation

of likelihood are indeed equal to the actual pdf’sb

T i

x Thus, the equivalency would really hold

Conversely, to approximate mutual information, we could take a fixed approxi-mation of the densitiesy

i, and plug this in the definition of entropy Denote the pdf’s

byG

i

(y

i

) = log p

i (y i ) Then we could approximate (10.5) as

I (y

1

 y

2

 ::: y ) = 

X

i EfG i (y i )g  log j det Bj  H(x)

(10.9) Now we see that this approximation is equal to the approximation of the likelihood used in Chapter 9 (except, again, for the global sign and the additive constant given by

H (x)) This also gives an alternative method of approximating mutual information that is different from the approximation that uses the negentropy approximations

10.4 ALGORITHMS FOR MINIMIZATION OF MUTUAL INFORMATION

To use mutual information in practice, we need some method of estimating or ap-proximating it from real data Earlier, we saw two methods for apap-proximating mutual entropy The first one was based on the negentropy approximations introduced in Section 5.6 The second one was based on using more or less fixed approximations for the densities of the ICs in Chapter 9

Thus, using mutual information leads essentially to the same algorithms as used for maximization of nongaussianity in Chapter 8, or for maximum likelihood estimation

in Chapter 9 In the case of maximization of nongaussianity, the corresponding algorithms are those that use symmetric orthogonalization, since we are maximizing the sum of nongaussianities, so that no order exists between the components Thus,

we do not present any new algorithms in this chapter; the reader is referred to the two preceding chapters

Trang 5

0 0.5 1 1.5 2 2.5 3 0

0.5 1 1.5 2 2.5 3 3.5 4 4.5

5x 10

−4

iteration count

Fig 10.1 The convergence of FastICA for ICs with uniform distributions The value of mutual information shown as function of iteration count.

10.5 EXAMPLES

Here we show the results of applying minimization of mutual information to the two mixtures introduced in Chapter 7 We use here the whitened mixtures, and the FastICA algorithm (which is essentially identical whichever approximation of mutual information is used) For illustration purposes, the algorithm was always initialized

so thatWwas the identity matrix The functionGwas chosen asG

1in (8.26) First, we used the data consisting of two mixtures of two subgaussian (uniformly distributed) independent components To demonstrate the convergence of the al-gorithm, the mutual information of the components at each iteration step is plotted

in Fig 10.1 This was obtained by the negentropy-based approximation At con-vergence, after two iterations, mutual information was practically equal to zero The corresponding results for two supergaussian independent components are shown

in Fig 10.2 Convergence was obtained after three iterations, after which mutual information was practically zero

10.6 CONCLUDING REMARKS AND REFERENCES

A rigorous approach to ICA that is different from the maximum likelihood approach

is given by minimization of mutual information Mutual information is a natural information-theoretic measure of dependence, and therefore it is natural to estimate the independent components by minimizing the mutual information of their estimates Mutual information gives a rigorous justification of the principle of searching for maximally nongaussian directions, and in the end turns out to be very similar to the likelihood as well

Mutual information can be approximated by the same methods that negentropy is approximated Alternatively, is can be approximated in the same way as likelihood

Trang 6

226 ICA BY MINIMIZATION OF MUTUAL INFORMATION

0 0.5 1 1.5 2 2.5

3x 10

−3

iteration count

Fig 10.2 The convergence of FastICA for ICs with supergaussian distributions The value

of mutual information shown as function of iteration count.

Therefore, we find here very much the same objective functions and algorithms as

in maximization of nongaussianity and maximum likelihood The same gradient and fixed-point algorithms can be used to optimize mutual information

Estimation of ICA by minimization of mutual information was probably first proposed in [89], who derived an approximation based on cumulants The idea has, however, a longer history in the context of neural network research, where it has been proposed as a sensory coding strategy It was proposed in [26, 28, 30, 18], that decomposing sensory data into features that are maximally independent is useful

as a preprocessing step Our approach follows that of [197] for the negentropy approximations

A nonparametric algorithm for minimization of mutual information was proposed

in [175], and an approach based on order statistics was proposed in [369] See [322, 468] for a detailed analysis of the connection between mutual information and infomax or maximum likelihood A more general framework was proposed in [377]

Trang 7

10.1 Derive the formula in (10.5)

10.2 Compute the constant in (10.7)

10.3 If the variances of the y

i are not constrained to unity, does this constant change?

10.4 Compute the mutual information for a gaussian random vector with covariance matrixC

Computer assignments

10.1 Create a sample of 2-D gaussian data with the two covariance matrices



3 0

0 2

 and



3 1

1 2



(10.10)

Estimate numerically the mutual information using the definition (Divide the data into bins, i.e., boxes of fixed size, and estimate the density at each bin by computing the number of data points that belong to that bin and dividing it by the size of the bin This elementary density approximation can then be used in the definition.)

Ngày đăng: 23/12/2013, 07:19

TỪ KHÓA LIÊN QUAN

🧩 Sản phẩm bạn có thể quan tâm

w