Joint estimation of covariance matrix via cholesky decomposition

These new methods aimed to eliminate thedisadvantages of sample covariance matrix when the dimension is large see John-stone 2001 and Bai 1993 and to provide structured and interpretable

Trang 1

JOINT ESTIMATION OF COVARIANCE

MATRIX VIA CHOLESKY DECOMPOSITION

JIANG XIAOJUN

NATIONAL UNIVERSITY OF SINGAPORE

2012

Trang 2

JOINT ESTIMATION OF COVARIANCE

MATRIX VIA CHOLESKY DECOMPOSITION

JIANG XIAOJUN

(B.Sc Peking University of China)

A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY DEPARTMENT OF STATISTICS AND APPLIED

PROBABILITY NATIONAL UNIVERSITY OF SINGAPORE

2012

Trang 3

ACKNOWLEDGEMENTS

I would like to take this opportunity to express my gratitude to my supervisorassociate professor Leng Chenlei He is such a nice mentor not only because ofhis brilliant ideas but also his kindness to his students I can not finish this thesiswithout his kind guidance It is my luck to have him as my supervisor Specialacknowledgement also goes to the faculties and staff of DSAP Anytime I encoun-tered difficulties and tried to seek help from them, I was always warmly welcomed

I also have to express my thanks to my colleges You make my four years study inDSAP a pleasant time

Trang 4

CONTENTS

1.1 Cholesky Decomposition 31.2 Penalized Method 61.3 Penalties with Group Effect 15

Trang 5

CONTENTS iv

2.1 Direct Thresholding Approaches 22

2.2 Penalized Approaches 26

2.3 Methods Based on Ordered Data 29

2.4 Motivation and Significance 31

Chapter 3 Model Description 37 3.1 Penalized Joint Normal Likelihood Function 38

3.2 IL-JMEC Method 44

3.3 GL-JMEC Method 46

3.4 Computation Issue 51

3.5 Main Results 56

Chapter 4 Simulation Results 60 4.1 Simulation Settings 60

4.2 Simulation with Respect to Different Data Sets 62

4.3 A Real Data Set Analysis 76

Chapter 5 Conclusion 82 Chapter A Appendix 86 A.1 Three Lemmas 86

A.2 Proof of Theorems 90

Trang 6

ABSTRACT

Covariance matrix estimation is a very important topic in statistics The mate is needed in various aspects of statistics In this research, we focus on jointlyestimating covariance matrix and precision matrix for grouped data with naturalorder via Cholesky decomposition We treat autoregressive parameters at the sameposition in different groups as a set and impose penalty functions with group ef-fect to these parameters together A sparse l∞penalty and a sparse group LASSOpenalty are used in our methods Both penalties may produce common zeros in theautoregressive matrices for different groups which reveal the common relationships

esti-of variables between groups When data structures in different groups are close,our approaches can do better than separate estimation approaches by providingmore accurate covariance and precision matrix estimates and they are guaranteed

to be positive definite A coordinate decent algorithm is used in the optimization

Trang 7

Summary vi

procedure and convergence rates have been established in this study We can provethat under some regularity conditions, our penalized estimators are consistent Inthe simulation part, we show their good performance by comparing our methodswith the separated estimation methods An application to classify cattle from twotreatment groups based on their weights is also included

Trang 8

LIST Of NOTATIONS

A ⊗ B Kronecker product of two matrices A and B

|A|1 l1 norm of matrix A

Vec(A) The vectorization of matrix A

||A|| The singular value of matrix A which equals the

square root of maximal eigenvalue of AA0

||A||F Frobenius Norm of matrix A which equals √

trAA0

U (a, b) Uniform distribution on interval (a,b)

I(A) Indicator function on event A

< α, β > Inner product of vectors α and β

Trang 9

List of Tables

Table 4.1 Simulation result when sample size is growing 69

Table 4.2 Simulation result when number of groups is growing while

the autoregressive matrices are identity matrix 71

Table 4.3 Simulation result when number of groups is growing while

autoregressive matrices are randomly generated 72

Table 4.4 Simulation result when data have different degrees of similarity 74

Trang 10

List of Tables ix

Table 4.5 Simulation result when when autoregressive matrices have

many non zero elements 75

Table 4.6 Performance of discrimination study for the cattle weight data 78

Trang 11

Figure 4.1 Ratio of Frobenius loss and Operator loss in example 1 70

Figure 4.2 Ratio of Frobenius loss and Operator loss in example 3 73

Figure 4.3 Trend of weights for the two groups of cattle 77

Trang 12

statis-an optimal portfolio In Gaussistatis-an graphical modeling, a sparse precision matrix

is uniquely corresponding to an undirected graph that represents the conditionalindependent relationships of the target variables (see Pearl 2000)

Standard estimators of covariance matrix and precision matrix are the sample

Trang 13

covariance matrix and its inverse multiples a scale parameter These two tors are proved to be unbiased and consistent Moreover, they are very easy tocalculate Due to these properties, they are widely used in statistics In recentlyyears, alternative estimators of the covariance matrix and the precision matrix havebeen proposed due to high dimensional data requirement and also the requirementsfor special structures of the variables These new methods aimed to eliminate thedisadvantages of sample covariance matrix when the dimension is large (see John-stone 2001 and Bai 1993) and to provide structured and interpretable estimators.Penalized estimation methods and thresholding methods (Ledoit and Wolf 2004;Huang et al 2006; Lan and Fan 2009; Rothman 2008 and so on) have made greatcontributions to achieve these goals

estima-Most researches so far focused on estimating single covariance matrix or sion matrix However, in some cases, it is much more valuable to jointly estimatethem if grouped data were observed from similar categories For instance, we con-sider gene data that describes different types of the same diseases or observations

preci-of patients from different treatment groups It is reasonable to assume that datafrom different groups share similar structures, and it is obviously a waste of in-formation if we estimate the covariance matrices separately because the similarity

of data is simply ignored Meanwhile, it is not feasible to combine the data alltogether and estimate a single covariance matrix while treating them as a single

Trang 14

1.1 Cholesky Decomposition 3

group A possible way to employ the information of similarity between differentgroups is to jointly estimate the matrices We can expect that estimation accuracymay be increased if the joint estimation method is employed In this research,

in order to achieve the joint estimation objective and keep our estimates positivedefinite, grouped penalization approaches based on Cholesky decomposition areinvestigated

In the subsequent sections, background knowledge about Cholesky tion and penalty approaches will be reviewed These are the key tools in our newmethods In chapter 2, the development of matrix estimation approaches will bereviewed

Using Cholesky decomposition to estimate the covariance and the precision trix was firstly introduced by Pourahmadi (1999) A joint mean-covariance modelhas been proposed to estimate the autoregressive parameters of the covariancematrix in that approach After that, this decomposition was widely used in longi-tudinal study and matrix estimation (see Pourahmadi 2000, Huang 2006, Rothman

ma-2008, Shojaie and Michailidis 2010, Rothman et al 2010, Leng et al 2010)

Trang 15

1.1 Cholesky Decomposition 4

The Cholesky decomposition illustrates that for every positive definite matrix

Σ, there exists an unique lower triangular matrix R , such that

where the diagonal entries of R are all nonnegative The elements r11, r21, · · · , rp1,

r22, r32, · · · , rpp of matrix R can be obtained consequently Assume the diagonalentries of matrix R are σ1, σ2, σp and matrix D is a diagonal matrixwith diagonal entries σ2

In this modified decomposition, matrix T is a lower triangular matrix with ones

on its diagonal while matrix D is a diagonal matrix A charming advantage of theCholesky decomposition is that the parameters in matrix T is free to constraintsand the only requirement for matrix D is that its diagonal elements are all pos-itive Moreover, The modified Cholesky decomposition has a natural statisticalexplanation (see Pourahmadi 1999)

Following the argument in Pourahmadi (1999), the elements in matrix T can

be expressed as the successive regression coefficients of variables regressed on theirpredecessors and the elements in matrix D can be expressed as the regressionerror variances If we further assume the variables have a multivariate normal

Trang 16

F (yk|Y(k)) ∼ N(σ(k)0 Σ−1(k)Y(k), σkk− σ(k)0 Σ−1(k)σ(k)) (1.3)Here, we denote Y(k) = (y1, y2, · · · , yk−1)0, Σ(k) the k − 1 dimensional main subma-trix of Σ, σ(k) as a vector that contains the first k − 1 elements of the kth column

of matrix Σ and σkk as the kkth element of matrix Σ

Denote ˜yk = E(yk|Y(k)) = σ0(k)Σ−1(k)Y(k) and ˜k as the residual term yk − ˜yk Obviously, ˜y1 = 0 and ˜1 = y1 Since E(yk|Y(k)) can be treated as the projection

of yk on the σ-field σ(y1, y2, · · · , yk−1), it is straightforward to conclude that ˜k isindependent with ˜1, ˜2, · · · , ˜k−1

Denote the kth row of matrix L by L0k in which the first k − 1 elements satisfy(φk1, φk2, · · · , φk(k−1)) = −σ(k)0 Σ−1(k), φkk = 1 and φkl = 0 for l > k This implies

L0kY = ˆk (k = 1, · · · , p) Write these p equations into matrix form, we have

Trang 17

If we relax the assumption of multivariate normality of Y , σ0(k)Σ−1(k)Y(k) is stillthe best linear prediction of yk with least square error Thus the elements in thekth row of T are the least square regression coefficients of variable yk regressed

on variables y1, · · · , yk−1 This explanation makes the autoregressive parametersmeaningful It reminds us that we may impose some special structures on theparameters if we have prior information about the data

In the traditional methods, parameters are estimated based on some meaningfulloss functions L(θ), mostly by minimizing these target loss functions Likelihoodfunctions constitute a widely used collection of the loss functions, for instance,the popular negative log likelihood function of multivariate normal tr(Σ−1S) +

Trang 18

1.2 Penalized Method 7

log |Σ| Minimizing this negative log likelihood function will lead to the maximumlikelihood estimator of the covariance matrix In some other applications, the lossfunctions can be also chosen as norms, for example, l1 or l2 norm The widely usedlinear regression is an application of the l2 norm loss Minimizing the squared l2norm of Y − Xβ leads to a linear model of variable y based on variables x1, x2, xp with smallest squared fitting error Here Y is the vector of observations ofvariable y and X is the design matrix for the explanatory variables x1, x2, xp

In these classical estimation approaches, the parameters or covariates are allincluded in the model For example, the standard linear regression model alwayscontains all the explanatory variables that we have observed However, as we know,including many covariates leads to low bias but high prediction variance or say overfitting This over fitting phenomenon can be explained in linear regression problems

as follows The coefficient of determination R2 is always decreasing when oneadds more and more explanatory variables to the model However, the predictionvariance can be very high In order to reduce the prediction variance, one cansacrifice a little bias so as to decrease the prediction variance by making a tradeoffbetween them

A natural idea is to make some special assumption on the data For instance,there are a lot of small coefficients or there are a lot of unimportant explanationvariables Based on this kind of prior information, more adaptive models can be

Trang 19

proposed to investigate the data

Penalization methods were introduced as a simple and straightforward way toachieve this objective The idea is similar to the AIC and BIC methods A trade-off is made between the fitness and the prediction accuracy by adding a penaltyfunction pλ(θ) to the loss function Lθ and minimizing the new objective function

instead of the original loss function Lθ In this new method, the loss function

Lθ controls the fitness of the model and pλ(θ) can be used to set a constraint tothe complexity (the number of nonzero parameters included in the model or saysparsity) or structure of the model

Penalization approaches also have close relationship with Bayesian method (seeZhao et al 2009) Particularly, if we assume the loss function in (1.5) is a log like-lihood function and the penalty function pλ(θ) is a log prior density function ofparameters θ, then the objective function (1.5) can be explained as the log poste-rior density function of θ conditional on the observations, for example, the ridgeregression when we choose pλ(θ) = λ||θ||22 is the same as the Bayesian approachwhere a multivariate normal prior N (0,2λ1Ip) is imposed to the parameters θ

Trang 20

One great advantage of the penalized approaches comparing to Bayesian proaches is that a more flexible function pλ(θ) can be used to constrain the param-eters in penalized approaches However, in Bayesian approaches, a proper priordensity function is needed A carefully chosen penalty function can make a tradeoffbetween the bias and the prediction accuracy It may also introduce some desiredproperties or structures to the model In order to introduce the penalized methods,

ap-we use the squared l2 norm loss function ||Y − Xβ||2

2 as an example and denotethe ordinary least square estimator of the coefficients by θols For simplicity, weassume XTX = Ip

Frank and Friedman (1993) introduced the bridge regression method in whichthey add a penalty term pλ(θ) = λ||θ||q

q to the loss function ||Y − Xθ||2

2 Instead

of using the ordinary least square method, they minimized the following function

θbridge= argminθ[L(θ) + pλ(θ)] = argminθ[||Y − Xθ||22+ λ||θ||qq]

Here q is a positive constant and λ is the threshold parameter When q > 1,bridge regression method shrinks the parameters θ and reduces variability When

q ≤ 1, bridge regression method provides sparse estimates of the parameter θ.Particularly, bridge regression is also called LASSO when the constant q is set to 1(see Tibshirani 1996) When q is set to 2, bridge regression is also known as ridgeregression Overall, bridge regression provides a more stable model compared tothe ordinary regression method while bias is increased

Trang 21

In recent years, especially along with the development of large dimensionaldata set, penalty approaches which lead to sparse estimators gained more andmore attention This is mainly because models from traditional approaches arerelatively complicated since nearly all of the parameters are nonzero For instance,the sample covariance matrix which is obtained from gaussian likelihood functionhas no zero element in it, which means all variables must be marginally correlated.All the coefficients in ordinary least square regression method are nonzero meansall variables are important in predicting the target variable This is confusing andhard to interpret

Consequently, investigating simple models that only include the important ables becomes a popular research area in statistics Penalizing methods which iscapable to provide sparse estimators have been extensively investigated They canefficiently reduce the complexity of the underling model In Fan and Li (2001),they listed three desired properties of an ideal penalty function which are

vari-1: Unbiasedness: The resulting estimator should be nearly unbiased and thelarge coefficients should be only slightly shrunk in order to guarantee the accuracy

2: Sparsity: The solution must be sparse that provides a more interpretablemodel

3: Continuity: The solution is continuous with respect to the data in order to

Trang 22

avoid instability

It has to be noted that most of the penalty functions can not satisfy all thesethree requirements especially convex penalty functions Convex penalty functionsalways shrink small coefficients as well as large coefficients Nonconvex penaltyfunctions may meet all these three requirements However, nonconvex penaltiesalways lead to computation difficulties and it is hard to find the global minimizer

A very natural shrinkage approach called hard thresholding method was tioned in the Antoniadis (1997) The solution for the least square linear regressionproblem with a hard threshold penalty term is

men-ˆ

θhard= ˆθolsI(|ˆθols| > λ) (1.6)The corresponding penalty function is

pλ(θ) = λ2− (|θ| − λ)2I(|θ| < λ) (1.7)Threshold parameter λ is a positive constant which was chosen by carefully bal-ancing sparsity and bias It directly shrinks small coefficients to zero and keep thelarge coefficients However, the solution is not continuous with respect to data.This makes the resulting model sensitive to the observations

In Tibshirani (1996), the so called LASSO has been proposed This methodgives a simple and straightforward way to achieve sparse models in regression

Trang 23

θlasso = sign(ˆθols)(|ˆθols| − λ)+ (1.9)

This formula for orthogonal design case shows some insights of the LASSOmethod that this method can shrink small coefficients to zero and provide a sparsesolution The solution is continuous with respect to data and also continuous withrespect to the threshold parameter λ The LASSO method performs well when thecoefficients are sparse while the ridge regression method is well performed whenthere are a lot of small coefficients That’s because the ridge regression only shrinksthe coefficients towards zero while the LASSO algorithm is a thresholding approachwhich shrinks some coefficients to exact zero

Trang 24

Efron et al (2004) proposed the so called LARS algorithm which is a veryimportant work and it can cover both LASSO algorithm and Forward Stagewiseselection method The solution path of LASSO can be obtained efficiently by asimple modification of LARS Moreover, The LARS algorithm gives a geometri-cal explanation and provides researchers with further understanding of LASSO Inthe paper of Rosset et al (2008), they also proposed the solution path for the

l1 penalized approaches but with more general loss functions The loss functionswere extended to the class of differentiable and piecewise quadratic functions withrespect to the response variable y and the term xT

i θ These researches made tant contributions to LASSO since one can efficiently calculate the whole solutionpath for different λ

impor-However, as a convex penalty function, there is also a problem with LASSO.The LASSO solution for the orthogonal design case which is presented in (1.9)reminds us that the LASSO algorithm also shrinks the large coefficients Thiseffect leads to bias and affects the prediction accuracy

In order to eliminate the disadvantage of LASSO and try to satisfy the threeconditions mentioned in Fan and Li (2001), in that paper, the authors proposed theSCAD penalty function which is a nonconvex function The solution for SCAD iscontinuous with respect to data and retains the large coefficients When the design

Trang 25

The corresponding penalty function is relatively complex, but the first order tive of SCAD penalty function has an explicit form

deriva-p0λ(θ) = λ{I(|θ| ≤ λ) + (aλ − |θ|)+

(a − 1)λ I(|θ| > λ)}. (1.10)

This penalty function can provide sparse estimators of the coefficients by ing the small coefficients while the large coefficients remain the same It can betreated as a combination of LASSO and hard thresholding method Fan and Li(2001) showed that this penalty function satisfies the three requirements which arepreviously mentioned and also showed that this penalty has the so called oracleproperty which means this penalized method can perform as good as the zero co-efficients are already known However, the SCAD penalty function is not convex.This may lead to computation problem

shrink-Zou (2006) proved that the LASSO algorithm does not satisfy the oracle erty Alternatively, he proposed an adaptive LASSO method Instead of penalizing

Trang 26

prop-1.3 Penalties with Group Effect 15

each coefficient equally, the adaptive method penalizes each coefficient with a ticular weight The penalty function is

In Section 1.2, the penalty functions penalize parameters individually However,

in some applications, one may be interested in penalty functions that have groupeffect which penalize the parameters together With the group effect of the penaltyfunctions, one may achieve desired structure of the variables, for instance, makingthe variables close or shrinking them towards zero together

Trang 27

1.3 Penalties with Group Effect 16

Tibshirani et al (2005) proposed a fusion LASSO method This fusion

LAS-SO not only penalizes the coefficients themselves, it also penalizes the successivedifferences of the coefficients The fused LASSO penalty function is

In Bondell and Reich (2008), they proposed another penalization method which

is called OSCAR The penalty function was chosen as a combination of the l1 normand a pairwise l∞ norm The objective function can be presented as

Trang 28

The so called elastic net was proposed by Zou and Hastie (2005) The elasticnet estimator βen is defined as

Some other penalty functions with group effects have been investigated in der to meet some special requirements in multi-ANOVA problems In the multi-ANOVA problems, factors can be a combination of measures and may have severallevels The main goal of multi-ANOVA is often to select the important factors and

or-to identify the level of importance of variables within the facor-tor Suppose there are

J factors and the jth factor has coefficients θj which is a pj dimensional vector.The corresponding design matrix for the jth factor is Xj and the response is Y Inorder to find the estimates of coefficients θj (j=1, 2, J ), one can fit a linearregression model and minimize the objective function

Trang 29

It is reasonable to assume that some factors are not important in the modelwhich means that some of the coefficient vectors θj must be 0 Meanwhile, for theimportant factors, the variables in the same group may perform differently Underthis concern, Yuan and Lin (2004) introduced the group LASSO algorithm Theyimposed a penalty term

to the objective function Here ||θj||(Kj) = (θT

jKjθj)1/2 where Kj is a kernel matrixwhich was set to pjIj in their paper One important feature of this group LASSO

is that it can select important factors and set coefficients in unimportant factors to

be all zero A group LARS algorithm is also investigated in their paper However,different from the relationship of LASSO and LARS, group LARS can not revealthe solution path of group LASSO (the solution path of group LASSO is notpiecewise linear)

In group LASSO, the coefficients within a group will either estimated to be allzero or non of them is zero This is not reasonable especially when the variableshave different levels within a group Bondell and Reich (2009) used a weightedfusion penalty method to solve this multi-factor ANOVA problem and consideredthe levels of the variables within a group The penalty term is

Trang 30

They try to minimize the penalized objective function

is that it can collapse levels within a group by setting the coefficients to be equal

Zhao et al (2008) introduced the so called composite absolute penalty In theirmethod, parameters are divided into several groups G1, G2, GK using someprior knowledge For each group, they penalize the parameters within the groupwith a lγk norm For the resulting K dimensional vector, it was penalized by anoverall lγ0 norm with power γ0 Their method can be presented by the followingminimization problem

In their setting, the overall parameter γ0 was set to 1, and the inner parameters

γk (k=1, 2, K) were chosen according to the requirement of the model Theoverall lγ0 norm will penalize some group norms to exact 0 which performs a groupselection effect and the inner lγk norm will construct some desired structures of theparameters within the group

Trang 31

Zhou et al (2010) proposed a hierarchical penalty function using a terizing technique to construct the common zeros across different groups In theirapproach, the parameters θkj in group k were reparameterized by dkαkj That is

reparame-θkj = dkαkj The parameters dk and αkj were both penalized by a LASSO typepenalty The estimates were obtained by minimizing

The linking parameters dk can be shrunk to zero which makes all the coefficients

θk1, θk2, · · · , θkpk in the kth group equal zero all together This will perform agroup selection property Meanwhile, even if the linking parameter dk is not zero,the parameter αkj may be shrunk to zero This also makes θkj = 0 So an uniquezero was obtained in the kth group The consistent property and also the sparsityproperty were given in their paper

Trang 33

The sample covariance matrix estimator S is asymptotically unbiased theless, according to the research of Yin (1988) and Bai (1993), the eigenvalues ofsample covariance matrix S tend to be more dispersing than the population eigen-values This leads to shrinkage estimation methods that shrink the eigenvalues

Never-of sample covariance matrix Dey and Srinivasan (1985) proposed an orthogonalinvariant minimax estimator under the Stein’s loss function According to theirsetting, the estimator was chosen as Rφ(L)R0, where R is a matrix constructed bythe eigenvectors of the sample covariance matrices and φ(L) is a diagonal matrix.Each entry of matrix φ(L) was chosen as a function of the eigenvalues of the samplecovariance matrix The eigenvectors of this estimator are the same as the samplecovariance matrix but the eigenvalues are shrunken

Ledoit and Wolf (2003a, 2003b, 2004) have developed a series of work thatfocused on combining the sample covariance matrix with a well structured matrix.Let Σ denote the true covariance matrix and S is the sample covariance matrix.The idea of their approach is to find an estimator ˆΣ = δF +(1−δ)S that minimizesthe following risk function

min

Trang 34

Here δ ranges from 0 to 1 and F is a matrix that has special structure Thismethod shrinks the sample covariance matrix S towards the structured matrix Fand makes a tradeoff between estimating bias and prediction variance

The first work has been done by Ledoit and Wolf (2003a), where they chose

F as a matrix that was computed from a single index model for the stock returndata In another work of Ledoit and Wolf (2003b), F was chosen as a matrix withequal off diagonal elements

The matrix F was chosen to be υI in Ledoit and Wolf (2004) Under thissetting, the resulting estimator is named as Ledoit-Wolf estimator Because theminimizer of (2.1) depends on the underlying true covariance matrix Σ, the authorsproposed asymptotic estimators of υ and δ based on the sample covariance matrix.This work is considered as a benchmark due to the simplicity and convenience ofcalculation

Besides shrinking the eigenvalues, nowadays more and more researchers focused

on estimating sparse covariance matrices that the parameters in the covariance trix were shrunk This is because sparse covariance and precision matrices providemore interpretable structures of the variables A zero element in the covariancematrix represents that the corresponding variables are marginally independent and

ma-a zero element in the precision mma-atrix represents the corresponding two vma-arima-ables

Trang 35

are independent conditional on all the remaining variables Both independent lationships will simplify the whole structure of the variables Special interest wasgained by the sparse precision matrix because a sparse precision matrix is unique-

re-ly corresponding to an undirected graph of the variables if the variables have amultivariate normal distribution

Bickel and Levina (2008b) proposed a direct hard thresholding method forestimating the covariance matrix The estimator can be simply obtained as

ˆ

Σλ (ˆσkl= sklI|skl|>λ, k 6= l),where skldenotes the klth element of the sample covariance matrix S This methodsimply shrinks the small elements in the sample covariance matrix to zero andachieves a sparse estimator of the covariance matrix The convergence rate underoperator norm was given on a large class of matrices El Karoui (2008) indepen-dently proposed a similar direct thresholding approach and the consistent propertyunder operator norm was also given

This direct thresholding method was further investigated by Rothman et al.(2009) They extended the hard thresholding method to more general methods.Instead of choosing the klth element ˆσkl as sklI|skl|>λ, they chose

ˆ

σkl= pλ(skl) (k 6= l)

The threshold function pλ can be extended from hard threshold function to a

Trang 36

more generalized thresholding operators that satisfy several requirements Theconvergence rate is also given in their paper These direct thresholding methodsare attractive since there is nearly no computation burden except the computation

of the threshold parameter using cross validation

These two thresholding methods both employ universal threshold functions and

an adaptive version of the direct thresholding methods was proposed by Cai and

Wu (2011) They argued that the adaptive thresholding estimator ˆΣ of Σ withklth element ˆσkl = pλkl(skl) would outperform the estimator from the universalthresholding methods because the sample covariances would have a wide range ofvariability Here, pλkl is a threshold function with parameter λklwhich is closely re-lated to the sample correlation coefficients An optimal rate of s0(p) log(p/n)(1−q)/2

is achieved by the adaptive estimator

These thresholding methods have sounding convergence properties which holdwhen log(p)/n = o(1) Nevertheless, it has to be noted that these methods cannot guarantee the positive definiteness property of the estimators which is a fun-damental requirement for covariance matrices

Trang 37

Most of the shrinkage methods were based on covariance matrices One planation is that the sample covariance matrix is always available Shrinking thesample covariance matrix is easy and straightforward However, shrinking the pre-cision matrix is not easy First of all, the inverse of sample covariance matrix maynot exist at all which will occur when p > n Even if the dimension p is lessthan n, it was shown that the inverse of sample covariance matrix may not be agood estimator for precision matrix because the estimator is ill-conditioned whichmeans the estimation error will significantly increase when inverting the samplecovariance matrix (see Ledoit and Wolf 2004)

ex-Although directly shrinking the precision matrix may not be a good choice, ternative methods may also achieve the shrinkage objective, for example, penalizedmethods By carefully choosing the loss function and penalty function, one canalso achieve sparse estimates of covariance matrix and precision matrix

al-The first approach that employed the penalized method in estimating a sparseprecision matrix was done by Meinshausen and Buhlmann (2006), where theyregressed each variable on all the rest variables using a LASSO method Theregression coefficients can be penalized to zero by the l1penalty term The ijth and

Trang 38

jith components of the precision matrix were estimated to be zero if the coefficient

of variable i regressed on variable j or the coefficient of variable j regressed onvariable i equals zero, or both of them are zero It has to be noted that thismethod only focuses on finding the positions of the zero entries in the precisionmatrix which reveals the underling gaussian graphical model of the variables butdoes not provide an estimator of the precision matrix

Most of the penalized approaches for estimating matrices are based on the mal likelihood function In the work of d’Aspremont et al (2008), they suggested

nor-a pennor-alized method thnor-at imposes nor-a pennor-alty function on the number of nonzeroelements of the precision matrix based on the negative log normal likelihood func-tion, which made a tradeoff between the complexity of the target matrix and theestimation bias This method is similar to the AIC method

Instead of penalizing the number of nonzero elements in precision matrix, man et al (2008) and Rothman et al (2008) both proposed a penalized methodthat directly penalizes the off diagonal elements of the precision matrix by adding

Fried-a l1 penalty to the elements of precision matrix based on negative log normallikelihood loss function A very fast computation algorithm called GLASSO wasdeveloped in Friedman et al (2008) which is based on the work of Friedman et

al (2007) The convergence rate of the estimator under the Frobenius norm wasfirstly given in Rothman et al (2008)

Trang 39

Lam and Fan (2009) extended the penalized methods by replacing the l1

penal-ty with more general penalties such as SCAD Besides the estimator of precisionmatrix, the penalized estimators of covariance matrix, correlation coefficients ma-trix were also given in that paper Explicit convergence rate of these estimatorsunder Frobenius norm were also investigated

Another interesting approach was done by Cai et al (2011) In their approach,

a sparse precision matrix is obtained by minimizing the elementwise l1 norm of thematrix Ω under the constraint

||SΩ − I||∞< λ

In their paper, the l1 norm of matrix A (a n×p matrix)is defined asPn

i=1

Pp j=1|aij|,the l∞ norm is defined as maxi,j|aij| The resulting estimator ˆΩ has elements

ˆ

ωij = ˆωji = ˆωij1I{ˆωij ≤ ˆωji} + ˆωji1I{ˆωij > ˆωji}where ˆωij is the ijth element of the estimator from the above minimization problem.This work is interesting since it provided a penalized method without likeli-hood function The method can be implemented by linear programming which isrelatively simple to compute

Trang 40

2.3 Methods Based on Ordered Data 29

Thresholding methods and direct penalization methods mentioned above are allinvariant with respect to the order of variables Nevertheless, in some applications,prior information about the order of variables are available This drives researchers

to investigate new methods that use the prior information

In some applications, it is reasonable to assume variables that are far awaymay be not correlated to each other Thus the corresponding covariances are zero.Based on this assumption, a direct banding method was proposed by Bickel andLevina (2008a) In that paper, the klth element in the sample covariance matrixwas shrunk to zero if and only if |k − l| > Mn Here, Mn is an integer that waschosen by cross validation The convergence rate was given for a large class ofcovariance matrices

In some cases, the variables have a natural order, which means one can fitthem with an autoregressive model This property reminds us that the modifiedCholesky decomposition can be implemented in estimating covariance matrix Themodified Cholesky decomposition of a given matrix Σ can be written as Σ =

L−1DL−10 and the elements in the lower triangle matrix L can be interpreted asthe regression coefficients that one variable regressed on it’s predecessors Wu and

Định dạng
Số trang	121
Dung lượng	617 KB