These new methods aimed to eliminate thedisadvantages of sample covariance matrix when the dimension is large see John-stone 2001 and Bai 1993 and to provide structured and interpretable
Trang 1JOINT ESTIMATION OF COVARIANCE
MATRIX VIA CHOLESKY DECOMPOSITION
JIANG XIAOJUN
NATIONAL UNIVERSITY OF SINGAPORE
2012
Trang 2JOINT ESTIMATION OF COVARIANCE
MATRIX VIA CHOLESKY DECOMPOSITION
JIANG XIAOJUN
(B.Sc Peking University of China)
A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY DEPARTMENT OF STATISTICS AND APPLIED
PROBABILITY NATIONAL UNIVERSITY OF SINGAPORE
2012
Trang 3ACKNOWLEDGEMENTS
I would like to take this opportunity to express my gratitude to my supervisorassociate professor Leng Chenlei He is such a nice mentor not only because ofhis brilliant ideas but also his kindness to his students I can not finish this thesiswithout his kind guidance It is my luck to have him as my supervisor Specialacknowledgement also goes to the faculties and staff of DSAP Anytime I encoun-tered difficulties and tried to seek help from them, I was always warmly welcomed
I also have to express my thanks to my colleges You make my four years study inDSAP a pleasant time
Trang 4CONTENTS
1.1 Cholesky Decomposition 31.2 Penalized Method 61.3 Penalties with Group Effect 15
Trang 5CONTENTS iv
2.1 Direct Thresholding Approaches 22
2.2 Penalized Approaches 26
2.3 Methods Based on Ordered Data 29
2.4 Motivation and Significance 31
Chapter 3 Model Description 37 3.1 Penalized Joint Normal Likelihood Function 38
3.2 IL-JMEC Method 44
3.3 GL-JMEC Method 46
3.4 Computation Issue 51
3.5 Main Results 56
Chapter 4 Simulation Results 60 4.1 Simulation Settings 60
4.2 Simulation with Respect to Different Data Sets 62
4.3 A Real Data Set Analysis 76
Chapter 5 Conclusion 82 Chapter A Appendix 86 A.1 Three Lemmas 86
A.2 Proof of Theorems 90
Trang 6ABSTRACT
Covariance matrix estimation is a very important topic in statistics The mate is needed in various aspects of statistics In this research, we focus on jointlyestimating covariance matrix and precision matrix for grouped data with naturalorder via Cholesky decomposition We treat autoregressive parameters at the sameposition in different groups as a set and impose penalty functions with group ef-fect to these parameters together A sparse l∞penalty and a sparse group LASSOpenalty are used in our methods Both penalties may produce common zeros in theautoregressive matrices for different groups which reveal the common relationships
esti-of variables between groups When data structures in different groups are close,our approaches can do better than separate estimation approaches by providingmore accurate covariance and precision matrix estimates and they are guaranteed
to be positive definite A coordinate decent algorithm is used in the optimization
Trang 7Summary vi
procedure and convergence rates have been established in this study We can provethat under some regularity conditions, our penalized estimators are consistent Inthe simulation part, we show their good performance by comparing our methodswith the separated estimation methods An application to classify cattle from twotreatment groups based on their weights is also included
Trang 8LIST Of NOTATIONS
A ⊗ B Kronecker product of two matrices A and B
|A|1 l1 norm of matrix A
Vec(A) The vectorization of matrix A
||A|| The singular value of matrix A which equals the
square root of maximal eigenvalue of AA0
||A||F Frobenius Norm of matrix A which equals √
trAA0
U (a, b) Uniform distribution on interval (a,b)
I(A) Indicator function on event A
< α, β > Inner product of vectors α and β
Trang 9List of Tables
Table 4.1 Simulation result when sample size is growing 69
Table 4.2 Simulation result when number of groups is growing while
the autoregressive matrices are identity matrix 71
Table 4.3 Simulation result when number of groups is growing while
autoregressive matrices are randomly generated 72
Table 4.4 Simulation result when data have different degrees of similarity 74
Trang 10List of Tables ix
Table 4.5 Simulation result when when autoregressive matrices have
many non zero elements 75
Table 4.6 Performance of discrimination study for the cattle weight data 78
Trang 11Figure 4.1 Ratio of Frobenius loss and Operator loss in example 1 70
Figure 4.2 Ratio of Frobenius loss and Operator loss in example 3 73
Figure 4.3 Trend of weights for the two groups of cattle 77
Trang 12statis-an optimal portfolio In Gaussistatis-an graphical modeling, a sparse precision matrix
is uniquely corresponding to an undirected graph that represents the conditionalindependent relationships of the target variables (see Pearl 2000)
Standard estimators of covariance matrix and precision matrix are the sample
Trang 13covariance matrix and its inverse multiples a scale parameter These two tors are proved to be unbiased and consistent Moreover, they are very easy tocalculate Due to these properties, they are widely used in statistics In recentlyyears, alternative estimators of the covariance matrix and the precision matrix havebeen proposed due to high dimensional data requirement and also the requirementsfor special structures of the variables These new methods aimed to eliminate thedisadvantages of sample covariance matrix when the dimension is large (see John-stone 2001 and Bai 1993) and to provide structured and interpretable estimators.Penalized estimation methods and thresholding methods (Ledoit and Wolf 2004;Huang et al 2006; Lan and Fan 2009; Rothman 2008 and so on) have made greatcontributions to achieve these goals
estima-Most researches so far focused on estimating single covariance matrix or sion matrix However, in some cases, it is much more valuable to jointly estimatethem if grouped data were observed from similar categories For instance, we con-sider gene data that describes different types of the same diseases or observations
preci-of patients from different treatment groups It is reasonable to assume that datafrom different groups share similar structures, and it is obviously a waste of in-formation if we estimate the covariance matrices separately because the similarity
of data is simply ignored Meanwhile, it is not feasible to combine the data alltogether and estimate a single covariance matrix while treating them as a single
Trang 141.1 Cholesky Decomposition 3
group A possible way to employ the information of similarity between differentgroups is to jointly estimate the matrices We can expect that estimation accuracymay be increased if the joint estimation method is employed In this research,
in order to achieve the joint estimation objective and keep our estimates positivedefinite, grouped penalization approaches based on Cholesky decomposition areinvestigated
In the subsequent sections, background knowledge about Cholesky tion and penalty approaches will be reviewed These are the key tools in our newmethods In chapter 2, the development of matrix estimation approaches will bereviewed
Using Cholesky decomposition to estimate the covariance and the precision trix was firstly introduced by Pourahmadi (1999) A joint mean-covariance modelhas been proposed to estimate the autoregressive parameters of the covariancematrix in that approach After that, this decomposition was widely used in longi-tudinal study and matrix estimation (see Pourahmadi 2000, Huang 2006, Rothman
ma-2008, Shojaie and Michailidis 2010, Rothman et al 2010, Leng et al 2010)
Trang 151.1 Cholesky Decomposition 4
The Cholesky decomposition illustrates that for every positive definite matrix
Σ, there exists an unique lower triangular matrix R , such that
where the diagonal entries of R are all nonnegative The elements r11, r21, · · · , rp1,
r22, r32, · · · , rpp of matrix R can be obtained consequently Assume the diagonalentries of matrix R are σ1, σ2, σp and matrix D is a diagonal matrixwith diagonal entries σ2
In this modified decomposition, matrix T is a lower triangular matrix with ones
on its diagonal while matrix D is a diagonal matrix A charming advantage of theCholesky decomposition is that the parameters in matrix T is free to constraintsand the only requirement for matrix D is that its diagonal elements are all pos-itive Moreover, The modified Cholesky decomposition has a natural statisticalexplanation (see Pourahmadi 1999)
Following the argument in Pourahmadi (1999), the elements in matrix T can
be expressed as the successive regression coefficients of variables regressed on theirpredecessors and the elements in matrix D can be expressed as the regressionerror variances If we further assume the variables have a multivariate normal
Trang 16F (yk|Y(k)) ∼ N(σ(k)0 Σ−1(k)Y(k), σkk− σ(k)0 Σ−1(k)σ(k)) (1.3)Here, we denote Y(k) = (y1, y2, · · · , yk−1)0, Σ(k) the k − 1 dimensional main subma-trix of Σ, σ(k) as a vector that contains the first k − 1 elements of the kth column
of matrix Σ and σkk as the kkth element of matrix Σ
Denote ˜yk = E(yk|Y(k)) = σ0(k)Σ−1(k)Y(k) and ˜k as the residual term yk − ˜yk Obviously, ˜y1 = 0 and ˜1 = y1 Since E(yk|Y(k)) can be treated as the projection
of yk on the σ-field σ(y1, y2, · · · , yk−1), it is straightforward to conclude that ˜k isindependent with ˜1, ˜2, · · · , ˜k−1
Denote the kth row of matrix L by L0k in which the first k − 1 elements satisfy(φk1, φk2, · · · , φk(k−1)) = −σ(k)0 Σ−1(k), φkk = 1 and φkl = 0 for l > k This implies
L0kY = ˆk (k = 1, · · · , p) Write these p equations into matrix form, we have
Trang 17If we relax the assumption of multivariate normality of Y , σ0(k)Σ−1(k)Y(k) is stillthe best linear prediction of yk with least square error Thus the elements in thekth row of T are the least square regression coefficients of variable yk regressed
on variables y1, · · · , yk−1 This explanation makes the autoregressive parametersmeaningful It reminds us that we may impose some special structures on theparameters if we have prior information about the data
In the traditional methods, parameters are estimated based on some meaningfulloss functions L(θ), mostly by minimizing these target loss functions Likelihoodfunctions constitute a widely used collection of the loss functions, for instance,the popular negative log likelihood function of multivariate normal tr(Σ−1S) +
Trang 181.2 Penalized Method 7
log |Σ| Minimizing this negative log likelihood function will lead to the maximumlikelihood estimator of the covariance matrix In some other applications, the lossfunctions can be also chosen as norms, for example, l1 or l2 norm The widely usedlinear regression is an application of the l2 norm loss Minimizing the squared l2norm of Y − Xβ leads to a linear model of variable y based on variables x1, x2, xp with smallest squared fitting error Here Y is the vector of observations ofvariable y and X is the design matrix for the explanatory variables x1, x2, xp
In these classical estimation approaches, the parameters or covariates are allincluded in the model For example, the standard linear regression model alwayscontains all the explanatory variables that we have observed However, as we know,including many covariates leads to low bias but high prediction variance or say overfitting This over fitting phenomenon can be explained in linear regression problems
as follows The coefficient of determination R2 is always decreasing when oneadds more and more explanatory variables to the model However, the predictionvariance can be very high In order to reduce the prediction variance, one cansacrifice a little bias so as to decrease the prediction variance by making a tradeoffbetween them
A natural idea is to make some special assumption on the data For instance,there are a lot of small coefficients or there are a lot of unimportant explanationvariables Based on this kind of prior information, more adaptive models can be
Trang 191.2 Penalized Method 8
proposed to investigate the data
Penalization methods were introduced as a simple and straightforward way toachieve this objective The idea is similar to the AIC and BIC methods A trade-off is made between the fitness and the prediction accuracy by adding a penaltyfunction pλ(θ) to the loss function Lθ and minimizing the new objective function
instead of the original loss function Lθ In this new method, the loss function
Lθ controls the fitness of the model and pλ(θ) can be used to set a constraint tothe complexity (the number of nonzero parameters included in the model or saysparsity) or structure of the model
Penalization approaches also have close relationship with Bayesian method (seeZhao et al 2009) Particularly, if we assume the loss function in (1.5) is a log like-lihood function and the penalty function pλ(θ) is a log prior density function ofparameters θ, then the objective function (1.5) can be explained as the log poste-rior density function of θ conditional on the observations, for example, the ridgeregression when we choose pλ(θ) = λ||θ||22 is the same as the Bayesian approachwhere a multivariate normal prior N (0,2λ1Ip) is imposed to the parameters θ
Trang 201.2 Penalized Method 9
One great advantage of the penalized approaches comparing to Bayesian proaches is that a more flexible function pλ(θ) can be used to constrain the param-eters in penalized approaches However, in Bayesian approaches, a proper priordensity function is needed A carefully chosen penalty function can make a tradeoffbetween the bias and the prediction accuracy It may also introduce some desiredproperties or structures to the model In order to introduce the penalized methods,
ap-we use the squared l2 norm loss function ||Y − Xβ||2
2 as an example and denotethe ordinary least square estimator of the coefficients by θols For simplicity, weassume XTX = Ip
Frank and Friedman (1993) introduced the bridge regression method in whichthey add a penalty term pλ(θ) = λ||θ||q
q to the loss function ||Y − Xθ||2
2 Instead
of using the ordinary least square method, they minimized the following function
θbridge= argminθ[L(θ) + pλ(θ)] = argminθ[||Y − Xθ||22+ λ||θ||qq]
Here q is a positive constant and λ is the threshold parameter When q > 1,bridge regression method shrinks the parameters θ and reduces variability When
q ≤ 1, bridge regression method provides sparse estimates of the parameter θ.Particularly, bridge regression is also called LASSO when the constant q is set to 1(see Tibshirani 1996) When q is set to 2, bridge regression is also known as ridgeregression Overall, bridge regression provides a more stable model compared tothe ordinary regression method while bias is increased
Trang 211.2 Penalized Method 10
In recent years, especially along with the development of large dimensionaldata set, penalty approaches which lead to sparse estimators gained more andmore attention This is mainly because models from traditional approaches arerelatively complicated since nearly all of the parameters are nonzero For instance,the sample covariance matrix which is obtained from gaussian likelihood functionhas no zero element in it, which means all variables must be marginally correlated.All the coefficients in ordinary least square regression method are nonzero meansall variables are important in predicting the target variable This is confusing andhard to interpret
Consequently, investigating simple models that only include the important ables becomes a popular research area in statistics Penalizing methods which iscapable to provide sparse estimators have been extensively investigated They canefficiently reduce the complexity of the underling model In Fan and Li (2001),they listed three desired properties of an ideal penalty function which are
vari-1: Unbiasedness: The resulting estimator should be nearly unbiased and thelarge coefficients should be only slightly shrunk in order to guarantee the accuracy
2: Sparsity: The solution must be sparse that provides a more interpretablemodel
3: Continuity: The solution is continuous with respect to the data in order to
Trang 221.2 Penalized Method 11
avoid instability
It has to be noted that most of the penalty functions can not satisfy all thesethree requirements especially convex penalty functions Convex penalty functionsalways shrink small coefficients as well as large coefficients Nonconvex penaltyfunctions may meet all these three requirements However, nonconvex penaltiesalways lead to computation difficulties and it is hard to find the global minimizer
A very natural shrinkage approach called hard thresholding method was tioned in the Antoniadis (1997) The solution for the least square linear regressionproblem with a hard threshold penalty term is
men-ˆ
θhard= ˆθolsI(|ˆθols| > λ) (1.6)The corresponding penalty function is
pλ(θ) = λ2− (|θ| − λ)2I(|θ| < λ) (1.7)Threshold parameter λ is a positive constant which was chosen by carefully bal-ancing sparsity and bias It directly shrinks small coefficients to zero and keep thelarge coefficients However, the solution is not continuous with respect to data.This makes the resulting model sensitive to the observations
In Tibshirani (1996), the so called LASSO has been proposed This methodgives a simple and straightforward way to achieve sparse models in regression
Trang 23θlasso = sign(ˆθols)(|ˆθols| − λ)+ (1.9)
This formula for orthogonal design case shows some insights of the LASSOmethod that this method can shrink small coefficients to zero and provide a sparsesolution The solution is continuous with respect to data and also continuous withrespect to the threshold parameter λ The LASSO method performs well when thecoefficients are sparse while the ridge regression method is well performed whenthere are a lot of small coefficients That’s because the ridge regression only shrinksthe coefficients towards zero while the LASSO algorithm is a thresholding approachwhich shrinks some coefficients to exact zero
Trang 241.2 Penalized Method 13
Efron et al (2004) proposed the so called LARS algorithm which is a veryimportant work and it can cover both LASSO algorithm and Forward Stagewiseselection method The solution path of LASSO can be obtained efficiently by asimple modification of LARS Moreover, The LARS algorithm gives a geometri-cal explanation and provides researchers with further understanding of LASSO Inthe paper of Rosset et al (2008), they also proposed the solution path for the
l1 penalized approaches but with more general loss functions The loss functionswere extended to the class of differentiable and piecewise quadratic functions withrespect to the response variable y and the term xT
i θ These researches made tant contributions to LASSO since one can efficiently calculate the whole solutionpath for different λ
impor-However, as a convex penalty function, there is also a problem with LASSO.The LASSO solution for the orthogonal design case which is presented in (1.9)reminds us that the LASSO algorithm also shrinks the large coefficients Thiseffect leads to bias and affects the prediction accuracy
In order to eliminate the disadvantage of LASSO and try to satisfy the threeconditions mentioned in Fan and Li (2001), in that paper, the authors proposed theSCAD penalty function which is a nonconvex function The solution for SCAD iscontinuous with respect to data and retains the large coefficients When the design
Trang 25The corresponding penalty function is relatively complex, but the first order tive of SCAD penalty function has an explicit form
deriva-p0λ(θ) = λ{I(|θ| ≤ λ) + (aλ − |θ|)+
(a − 1)λ I(|θ| > λ)}. (1.10)
This penalty function can provide sparse estimators of the coefficients by ing the small coefficients while the large coefficients remain the same It can betreated as a combination of LASSO and hard thresholding method Fan and Li(2001) showed that this penalty function satisfies the three requirements which arepreviously mentioned and also showed that this penalty has the so called oracleproperty which means this penalized method can perform as good as the zero co-efficients are already known However, the SCAD penalty function is not convex.This may lead to computation problem
shrink-Zou (2006) proved that the LASSO algorithm does not satisfy the oracle erty Alternatively, he proposed an adaptive LASSO method Instead of penalizing
Trang 26prop-1.3 Penalties with Group Effect 15
each coefficient equally, the adaptive method penalizes each coefficient with a ticular weight The penalty function is
In Section 1.2, the penalty functions penalize parameters individually However,
in some applications, one may be interested in penalty functions that have groupeffect which penalize the parameters together With the group effect of the penaltyfunctions, one may achieve desired structure of the variables, for instance, makingthe variables close or shrinking them towards zero together
Trang 271.3 Penalties with Group Effect 16
Tibshirani et al (2005) proposed a fusion LASSO method This fusion
LAS-SO not only penalizes the coefficients themselves, it also penalizes the successivedifferences of the coefficients The fused LASSO penalty function is
In Bondell and Reich (2008), they proposed another penalization method which
is called OSCAR The penalty function was chosen as a combination of the l1 normand a pairwise l∞ norm The objective function can be presented as
Trang 281.3 Penalties with Group Effect 17
The so called elastic net was proposed by Zou and Hastie (2005) The elasticnet estimator βen is defined as
Some other penalty functions with group effects have been investigated in der to meet some special requirements in multi-ANOVA problems In the multi-ANOVA problems, factors can be a combination of measures and may have severallevels The main goal of multi-ANOVA is often to select the important factors and
or-to identify the level of importance of variables within the facor-tor Suppose there are
J factors and the jth factor has coefficients θj which is a pj dimensional vector.The corresponding design matrix for the jth factor is Xj and the response is Y Inorder to find the estimates of coefficients θj (j=1, 2, J ), one can fit a linearregression model and minimize the objective function
Trang 291.3 Penalties with Group Effect 18
It is reasonable to assume that some factors are not important in the modelwhich means that some of the coefficient vectors θj must be 0 Meanwhile, for theimportant factors, the variables in the same group may perform differently Underthis concern, Yuan and Lin (2004) introduced the group LASSO algorithm Theyimposed a penalty term
to the objective function Here ||θj||(Kj) = (θT
jKjθj)1/2 where Kj is a kernel matrixwhich was set to pjIj in their paper One important feature of this group LASSO
is that it can select important factors and set coefficients in unimportant factors to
be all zero A group LARS algorithm is also investigated in their paper However,different from the relationship of LASSO and LARS, group LARS can not revealthe solution path of group LASSO (the solution path of group LASSO is notpiecewise linear)
In group LASSO, the coefficients within a group will either estimated to be allzero or non of them is zero This is not reasonable especially when the variableshave different levels within a group Bondell and Reich (2009) used a weightedfusion penalty method to solve this multi-factor ANOVA problem and consideredthe levels of the variables within a group The penalty term is
Trang 301.3 Penalties with Group Effect 19
They try to minimize the penalized objective function
is that it can collapse levels within a group by setting the coefficients to be equal
Zhao et al (2008) introduced the so called composite absolute penalty In theirmethod, parameters are divided into several groups G1, G2, GK using someprior knowledge For each group, they penalize the parameters within the groupwith a lγk norm For the resulting K dimensional vector, it was penalized by anoverall lγ0 norm with power γ0 Their method can be presented by the followingminimization problem
In their setting, the overall parameter γ0 was set to 1, and the inner parameters
γk (k=1, 2, K) were chosen according to the requirement of the model Theoverall lγ0 norm will penalize some group norms to exact 0 which performs a groupselection effect and the inner lγk norm will construct some desired structures of theparameters within the group
Trang 311.3 Penalties with Group Effect 20
Zhou et al (2010) proposed a hierarchical penalty function using a terizing technique to construct the common zeros across different groups In theirapproach, the parameters θkj in group k were reparameterized by dkαkj That is
reparame-θkj = dkαkj The parameters dk and αkj were both penalized by a LASSO typepenalty The estimates were obtained by minimizing
The linking parameters dk can be shrunk to zero which makes all the coefficients
θk1, θk2, · · · , θkpk in the kth group equal zero all together This will perform agroup selection property Meanwhile, even if the linking parameter dk is not zero,the parameter αkj may be shrunk to zero This also makes θkj = 0 So an uniquezero was obtained in the kth group The consistent property and also the sparsityproperty were given in their paper
Trang 332.1 Direct Thresholding Approaches 22
The sample covariance matrix estimator S is asymptotically unbiased theless, according to the research of Yin (1988) and Bai (1993), the eigenvalues ofsample covariance matrix S tend to be more dispersing than the population eigen-values This leads to shrinkage estimation methods that shrink the eigenvalues
Never-of sample covariance matrix Dey and Srinivasan (1985) proposed an orthogonalinvariant minimax estimator under the Stein’s loss function According to theirsetting, the estimator was chosen as Rφ(L)R0, where R is a matrix constructed bythe eigenvectors of the sample covariance matrices and φ(L) is a diagonal matrix.Each entry of matrix φ(L) was chosen as a function of the eigenvalues of the samplecovariance matrix The eigenvectors of this estimator are the same as the samplecovariance matrix but the eigenvalues are shrunken
Ledoit and Wolf (2003a, 2003b, 2004) have developed a series of work thatfocused on combining the sample covariance matrix with a well structured matrix.Let Σ denote the true covariance matrix and S is the sample covariance matrix.The idea of their approach is to find an estimator ˆΣ = δF +(1−δ)S that minimizesthe following risk function
min
Trang 342.1 Direct Thresholding Approaches 23
Here δ ranges from 0 to 1 and F is a matrix that has special structure Thismethod shrinks the sample covariance matrix S towards the structured matrix Fand makes a tradeoff between estimating bias and prediction variance
The first work has been done by Ledoit and Wolf (2003a), where they chose
F as a matrix that was computed from a single index model for the stock returndata In another work of Ledoit and Wolf (2003b), F was chosen as a matrix withequal off diagonal elements
The matrix F was chosen to be υI in Ledoit and Wolf (2004) Under thissetting, the resulting estimator is named as Ledoit-Wolf estimator Because theminimizer of (2.1) depends on the underlying true covariance matrix Σ, the authorsproposed asymptotic estimators of υ and δ based on the sample covariance matrix.This work is considered as a benchmark due to the simplicity and convenience ofcalculation
Besides shrinking the eigenvalues, nowadays more and more researchers focused
on estimating sparse covariance matrices that the parameters in the covariance trix were shrunk This is because sparse covariance and precision matrices providemore interpretable structures of the variables A zero element in the covariancematrix represents that the corresponding variables are marginally independent and
ma-a zero element in the precision mma-atrix represents the corresponding two vma-arima-ables
Trang 352.1 Direct Thresholding Approaches 24
are independent conditional on all the remaining variables Both independent lationships will simplify the whole structure of the variables Special interest wasgained by the sparse precision matrix because a sparse precision matrix is unique-
re-ly corresponding to an undirected graph of the variables if the variables have amultivariate normal distribution
Bickel and Levina (2008b) proposed a direct hard thresholding method forestimating the covariance matrix The estimator can be simply obtained as
ˆ
Σλ (ˆσkl= sklI|skl|>λ, k 6= l),where skldenotes the klth element of the sample covariance matrix S This methodsimply shrinks the small elements in the sample covariance matrix to zero andachieves a sparse estimator of the covariance matrix The convergence rate underoperator norm was given on a large class of matrices El Karoui (2008) indepen-dently proposed a similar direct thresholding approach and the consistent propertyunder operator norm was also given
This direct thresholding method was further investigated by Rothman et al.(2009) They extended the hard thresholding method to more general methods.Instead of choosing the klth element ˆσkl as sklI|skl|>λ, they chose
ˆ
σkl= pλ(skl) (k 6= l)
The threshold function pλ can be extended from hard threshold function to a
Trang 362.1 Direct Thresholding Approaches 25
more generalized thresholding operators that satisfy several requirements Theconvergence rate is also given in their paper These direct thresholding methodsare attractive since there is nearly no computation burden except the computation
of the threshold parameter using cross validation
These two thresholding methods both employ universal threshold functions and
an adaptive version of the direct thresholding methods was proposed by Cai and
Wu (2011) They argued that the adaptive thresholding estimator ˆΣ of Σ withklth element ˆσkl = pλkl(skl) would outperform the estimator from the universalthresholding methods because the sample covariances would have a wide range ofvariability Here, pλkl is a threshold function with parameter λklwhich is closely re-lated to the sample correlation coefficients An optimal rate of s0(p) log(p/n)(1−q)/2
is achieved by the adaptive estimator
These thresholding methods have sounding convergence properties which holdwhen log(p)/n = o(1) Nevertheless, it has to be noted that these methods cannot guarantee the positive definiteness property of the estimators which is a fun-damental requirement for covariance matrices
Trang 372.2 Penalized Approaches 26
Most of the shrinkage methods were based on covariance matrices One planation is that the sample covariance matrix is always available Shrinking thesample covariance matrix is easy and straightforward However, shrinking the pre-cision matrix is not easy First of all, the inverse of sample covariance matrix maynot exist at all which will occur when p > n Even if the dimension p is lessthan n, it was shown that the inverse of sample covariance matrix may not be agood estimator for precision matrix because the estimator is ill-conditioned whichmeans the estimation error will significantly increase when inverting the samplecovariance matrix (see Ledoit and Wolf 2004)
ex-Although directly shrinking the precision matrix may not be a good choice, ternative methods may also achieve the shrinkage objective, for example, penalizedmethods By carefully choosing the loss function and penalty function, one canalso achieve sparse estimates of covariance matrix and precision matrix
al-The first approach that employed the penalized method in estimating a sparseprecision matrix was done by Meinshausen and Buhlmann (2006), where theyregressed each variable on all the rest variables using a LASSO method Theregression coefficients can be penalized to zero by the l1penalty term The ijth and
Trang 382.2 Penalized Approaches 27
jith components of the precision matrix were estimated to be zero if the coefficient
of variable i regressed on variable j or the coefficient of variable j regressed onvariable i equals zero, or both of them are zero It has to be noted that thismethod only focuses on finding the positions of the zero entries in the precisionmatrix which reveals the underling gaussian graphical model of the variables butdoes not provide an estimator of the precision matrix
Most of the penalized approaches for estimating matrices are based on the mal likelihood function In the work of d’Aspremont et al (2008), they suggested
nor-a pennor-alized method thnor-at imposes nor-a pennor-alty function on the number of nonzeroelements of the precision matrix based on the negative log normal likelihood func-tion, which made a tradeoff between the complexity of the target matrix and theestimation bias This method is similar to the AIC method
Instead of penalizing the number of nonzero elements in precision matrix, man et al (2008) and Rothman et al (2008) both proposed a penalized methodthat directly penalizes the off diagonal elements of the precision matrix by adding
Fried-a l1 penalty to the elements of precision matrix based on negative log normallikelihood loss function A very fast computation algorithm called GLASSO wasdeveloped in Friedman et al (2008) which is based on the work of Friedman et
al (2007) The convergence rate of the estimator under the Frobenius norm wasfirstly given in Rothman et al (2008)
Trang 392.2 Penalized Approaches 28
Lam and Fan (2009) extended the penalized methods by replacing the l1
penal-ty with more general penalties such as SCAD Besides the estimator of precisionmatrix, the penalized estimators of covariance matrix, correlation coefficients ma-trix were also given in that paper Explicit convergence rate of these estimatorsunder Frobenius norm were also investigated
Another interesting approach was done by Cai et al (2011) In their approach,
a sparse precision matrix is obtained by minimizing the elementwise l1 norm of thematrix Ω under the constraint
||SΩ − I||∞< λ
In their paper, the l1 norm of matrix A (a n×p matrix)is defined asPn
i=1
Pp j=1|aij|,the l∞ norm is defined as maxi,j|aij| The resulting estimator ˆΩ has elements
ˆ
ωij = ˆωji = ˆωij1I{ˆωij ≤ ˆωji} + ˆωji1I{ˆωij > ˆωji}where ˆωij is the ijth element of the estimator from the above minimization problem.This work is interesting since it provided a penalized method without likeli-hood function The method can be implemented by linear programming which isrelatively simple to compute
Trang 402.3 Methods Based on Ordered Data 29
Thresholding methods and direct penalization methods mentioned above are allinvariant with respect to the order of variables Nevertheless, in some applications,prior information about the order of variables are available This drives researchers
to investigate new methods that use the prior information
In some applications, it is reasonable to assume variables that are far awaymay be not correlated to each other Thus the corresponding covariances are zero.Based on this assumption, a direct banding method was proposed by Bickel andLevina (2008a) In that paper, the klth element in the sample covariance matrixwas shrunk to zero if and only if |k − l| > Mn Here, Mn is an integer that waschosen by cross validation The convergence rate was given for a large class ofcovariance matrices
In some cases, the variables have a natural order, which means one can fitthem with an autoregressive model This property reminds us that the modifiedCholesky decomposition can be implemented in estimating covariance matrix Themodified Cholesky decomposition of a given matrix Σ can be written as Σ =
L−1DL−10 and the elements in the lower triangle matrix L can be interpreted asthe regression coefficients that one variable regressed on it’s predecessors Wu and