We propose a penalized likelihood approach for covariance graphical model tion and a BIC type criterion for the selection of the tuning parameter... We also compare the penalized approac
Trang 1GRAPHICAL MODEL SELECTION
LIN NAN
(B.Sc National University of Singapore)
A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY
DEPARTMENT OF STATISTICS AND APPLIED PROBABILITY
NATIONAL UNIVERSITY OF SINGAPORE
2010
Trang 2First of all, I would like to show my deepest gratitude to my supervisor A/P Leng Chenleiand co-supervisor Dr Chen Ying, who conscientiously led me into the field of statisti-cal research This thesis would not have been possible without their patient guidanceand continuous support I really appreciate their efforts in helping me overcome all theproblems I encountered in the past four years It is my honor and luck to have them, twobrilliant young professors as my PhD supervisors
Special acknowledgement also goes to all the professors and staffs in Department ofstatistics and applied probability I have been in this warm family for almost eight years.With their help I have built up the statistical skills that can benefit me for my whole life Ican not find an exact word to express my gratitude to the department but I will definitelyfind a way to reciprocate the family in future
I further express my appreciation to my dear friends Ms Liu Yan, Ms Jiang Qian,
Trang 3Mr Lu Jun, Mr Liang Xuehua, Mr Jiang Binyan and Dr Zhang Rongli, for giving mehelp, support and encouragement during my PhD study Thanks for your company, myPhD life becomes more colorful and enjoyable.
Finally, I am forever indebted to my family My dear parents, who gave me thecourage to pursuit the PhD study at the beginning, and have always been my constantsource of support by giving me endless love and understanding My husband, MengChuan, he is my joy, my pillar and my guiding light This thesis is also in memory of
my dear grandmothers
Trang 41.1 Background 1
1.2 Literature review 2
1.2.1 Review of penalized approaches 2
1.2.2 Review of graphical model 13
1.2.3 Organization of the thesis 23
Trang 52 Methodology 25
2.1 Main result 25
2.2 Theory 34
2.2.1 Proof of lemmas 36
2.2.2 Proof of theorems 40
3 Simulation 49 3.1 Simulation settings 49
3.2 Performance evaluation 51
3.3 Simulation Results 53
3.3.1 Simulation results for different models 53
3.3.2 Simulation results for models with different dimensions 60
4 Real Data analysis 70 4.1 Introduction 70
4.2 Call center data 71
4.3 Financial stocks Vs education stocks 72
Trang 65 Conclusion and Further Research 78
5.1 Conclusion and discussion 78
5.2 Future research 80
Trang 7There has been a rising interest in high-dimensional data from many important fieldsrecently One of the major challenges in modern statistics is to investigate the complexrelationships and dependencies existing in data, in order to build parsimonious modelsfor inference Covariance or correlation matrix estimation that addresses the relation-ships among random variables attracts a lot of attention due to its ubiquity in data analy-sis Suppose we have a d-dimensional vector following multivariate normal distributionwith mean zero and certain covariance matrix that we are interested in estimating Ofparticular interest is to identify zero entries in this covariance matrix, since the zero en-try corresponds to marginal independence between two variables This is referred ascovariance graphical model selection, which arises when the interest is to model pair-wise correlation Identifying pairwise independence in this model is helpful to elucidaterelations between the variables
We propose a penalized likelihood approach for covariance graphical model tion and a BIC type criterion for the selection of the tuning parameter An attractive
Trang 8selec-feature of a likelihood based approach is its improved efficiency comparing to banding
or thresholding Another attractive feature of the proposed method is that the positivedefiniteness of the covariance matrix is explicitly ensured We show that the penalizedlikelihood estimator converges to the true covariance matrix under frobenius norm withexplicit rate In addition, we show that the zero entries in the true covariance matrix areestimated as zero with probability tending to 1 We also compare the penalized approachwith other methods for covariance graphical model, such as sample covariance matrix,SIN approaches proposed by Drton and Perlman(2004), method developed by Bickeland Levina(2008b) and the shrinkage estimator of Ledoit andWolf (2003), in terms ofboth simulations and real data analysises The results show that the penalized methodnot only can provide sparse estimates of the covariance matrix, but also has competitiveestimation accuracy
Trang 9List of Tables
3.1 Simulations: Model 1 with d=10 and n=30 Average (SE) KL, QL, OL,
FL, FP and FN over 50 replications 543.2 Simulations: Model 2 with d=10 and n=30 Average (SE) KL, QL, OL,
FL, FP and FN over 50 replications 563.3 Simulations: Model 3 with d=10 and n=30 Average (SE) KL, QL, OL,
FL, FP and FN over 50 replications 583.4 Simulations: Model 4 with d=10 and n=30 Average (SE) KL, QL, OL,
FL, FP and FN over 50 replications 603.5 Simulations: Model 5 with d=10 and n=30 Average (SE) KL, QL, OL,
FL, FP and FN over 50 replications 623.6 Simulations: Model 6 with d=10 and n=30 Average (SE) KL, QL, OL,
FL, FP and FN over 50 replications 643.7 Simulations: Model 3 with d=30 and n=100 Average (SE) KL, QL, OL,
FL, FP and FN over 50 replications 663.8 Simulations: Model 3 with d=100 and n=100 Average (SE) KL, QL,
OL, FL, FP and FN over 50 replications 663.9 Simulations: Model 3 with d=100 and n=100 Average (SE) KL, QL,
OL, FL, FP and FN over 50 replications 66
4.1 Average (SE) KL, QL, OL, FL, FP and FN for Call Center Data withd=84,n=164 4-fold CV on the training data minimizing the BIC 744.2 Average (SE) KL, QL, OL, FL, FP and FN for Financial stock returns
Vs Education stock returns with d=10,n=49 4-fold CV on the trainingdata minimizing the BIC 75
Trang 10List of Figures
3.1 Simulations: Model 1 with d=10 and n=30 Average (SE) KL, QL, OL,
FL, FP and FN over 50 replications 553.2 Simulations: Model 2 with d=10 and n=30 Average (SE) KL, QL, OL,
FL, FP and FN over 50 replications 573.3 Simulations: Model 3 with d=10 and n=30 Average (SE) KL, QL, OL,
FL, FP and FN over 50 replications 593.4 Simulations: Model 4 with d=10 and n=30 Average (SE) KL, QL, OL,
FL, FP and FN over 50 replications 613.5 Simulations: Model 5 with d=10 and n=30 Average (SE) KL, QL, OL,
FL, FP and FN over 50 replications 633.6 Simulations: Model 6 with d=10 and n=30 Average (SE) KL, QL, OL,
FL, FP and FN over 50 replications 653.7 Simulations: Model 3 with d=30 and n=100 Average (SE) KL, QL, OL,
FL, FP and FN over 50 replications 673.8 Simulations: Model 3 with d=100 and n=100 Average (SE) KL, QL,
OL, FL, FP and FN over 50 replications 683.9 Simulations: Model 3 with d=100 and n=100 Average (SE) KL, QL,
OL, FL, FP and FN over 50 replications 69
4.1 Call center data 764.2 Financial stock Vs Education Stock 77
Trang 11a correct model for inference Covariance or correlation matrix estimation that addressesthe relationships attracts a lot of attention due to its ubiquity in data analysis PrincipalComponent analysis (PCA), linear and quadratic discriminant analysis (LDA and QDA)and analysis of independence relations in the context of graphical models all need to
Trang 12estimate the covariance matrix However, the number of parameters in the covariancematrix grows quickly with dimensionality, so high dimensional data leads to heavy bur-den of computation As a result, the sparsity assumption of the covariance matrix (i.e.,some entries of the covariance matrix are exactly zero) is frequently imposed to achieve
a balance between biases and variances In this thesis, we propose a penalized hood approach to estimate covariance matrix in order to strike parsimony on covariancegraphical model selection
likeli-1.2 Literature review
1.2.1 Review of penalized approaches
Consider the linear regression model y = Xβ + ǫ, where y is an n × 1 vector, X is an
n × d matrix and ǫ is an n × 1 vector Without loss of generality, we assume that the data
are centered, the columns of X are orthonormal and y i ’s are conditionally independent
given the design matrix X Throughout this thesis, we assume ǫs are independently and
identically distributed with mean zero and finite variance σ2 A model fitting procedureproduces the vector of coefficients ˆβ = ( ˆβ0, , ˆβd)
We obtain ordinary least squares (OLS) estimates by minimizing the residual squared
Trang 13Best Subset selection is one of the standard techniques for improving OLS We lect or delete one independent variable through hypothesis testing at some level α in eachstep Most traditional variable selection methods follow stepwise subset selection pro-cedures to select variables, such as Akaike’s information criterion AIC [Akaike (1973)]and Bayesian information criterion BIC [Schwarz (1978)] Nevertheless, this commonstepwise procedure has long been recognized as extremely variable since changes in datamay result in very different models To remedy this problem, Drton and Perlman (2004)proposed a SIN approach that produces conservative simultaneous 1-α confidence in-tervals, and use these confidence intervals to do model selection in a single step Bestsubset selection and SIN approach improve OLS estimates by providing interpretablemodels.
se-Recently many statisticians have proposed various penalization methods, that usuallyshrink estimates to make trade-offs between bias and variance, to overcome the limita-
Trang 14tions of OLS estimates and best subset selection The penalized estimates are obtained
by minimizing the residual squared error plus a penalty function, i.e
ˆβpenalized = arg min
where non-negative constant λ is a tuning parameter and pλrepresents a penalty function
Antoniadis (1997) and Fan (1997) both mentioned the hard thresholding estimator
ˆβHardThre = ˆβolsI(| ˆβols| > λ),
which is derived by taking the hard thresholding penalty function
pλ(|β|) = λ2− (|β| − λ)2I(|β| < λ)
Frank and Friedman (1993) introduced bridge regression with L q penalty function
λ|β|q , where q is a positive constant When q > 1, the resulting penalized estimates
shrink the solutions to reduce variability but do not enjoy sparsity On the other hand,
when q ≤ 1, the L q penalty functions lead to sparse solutions but have relatively largebiases
One special cases of bridge regression is the L2-penalty
Trang 15where γ is a positive number.
Ridge regression is a continuous process that shrinks coefficients, so it achieves betterprediction performance through a bias-variance trade-off However it does not set anycoefficients to 0 and hence does not give an easily interpretable model
Lasso, proposed by Tibshirani(1996), is the penalized least squares method imposing
an L1-penalty
pλ(|β|) = λ|β|
on the regression coefficients The L1-penalty leads to a solution
ˆβLasso = sgn( ˆβols)(| ˆβols| − γ)+
Because of the the nature of the L1-penalty, the Lasso does both continuous shrinkageand automatic variable selection simultaneously According to the simulation results,for small number of moderate-sized effects, the Lasso does better than ridge regres-sion; for large number of small effects, the ridge regression performs better than Lassoestimates, thus neither of them uniformly dominates the other However, as variable se-lection becomes increasingly important in modern data analysis, the Lasso is much moreappealing owing to its sparse representation Given orthogonal design, the entire Lassosolution paths can be computed by LARS algorithm, proposed by Efron et al (2004)
Although the Lasso enjoys great computational advantages and excellent mances, it has several limitations:
Trang 16perfor-1 Lasso lacks the oracle property defined in Fan and Li (2001).
2 If there is a group of variables among which the pairwise correlation are very high,the Lasso tends to select only one variable from the group and does not care whichone is selected In sum, Lasso can not handle collinearity
3 Lasso can only select individual input variables, so it is not suitable for generalfactor selection
In some situations, such as multifactor analysis-of-variance (ANOVA) problem, able selection concentrates in selection of a group of important factors, rather than indi-vidual variables As we have stated, Lasso is only designed for selecting individual inputfactor, thus is not for this kind of scenarios Yuan and Lin (2006) proposed the groupLasso to improve over Lasso in terms of group variable selections For a vector η ∈ R d,
vari-d ≥ 1, and a symmetric d × d positive definite matrix K, they denoted
where λ ≥ 0 is a tuning parameter, K1, , K d are positive definite matrices with many
possible choices The authors verified that when K j = p j I p j , j = 1, , d, and β − j =
Trang 17The entire solutions for group Lasso can be obtained iteratively.
Fan and Li (2001) stated that a good penalty function should result in an estimatorwith the following three properties:
1 Unbiasedness: The resulting estimator has no over penalization for large ter to avoid unnecessary modeling biases
parame-2 Sparsity: The resulting estimator automatically set insignificant parameters to 0
3 Continuity: The resulting estimator is continuous in data in order to avoid bility in model prediction
insta-It has been shown that the L q and hard thresholding penalty functions do not neously satisfy and three properties Fan and Li (2001) proposed the smoothly clippedabsolute deviation penalty(SCAD)
sgn( ˆβols)(| ˆβols| − λ)+, when| ˆβols| ≤ λ;
{(a − 1) ˆβols− sgn( ˆβols)aλ }/(a − 2), when 2λ < | ˆβols| < aλ;
ˆβols, when| ˆβols| > aλ.
The two parameters (λ, a) can be searched by some criterias, such as BIC, cross
valida-tion and generalized cross-validavalida-tion Fan and Li (2001) suggested an ”oracle property,”for finite parameter case, which assists selecting variables only with nonzero coefficients
Trang 18and estimates the remaining coefficients as zero.
(Oracle Property) Let V1, , V n be independent and identically distributed, each with a
density f (V, β) satisfying conditions (A)-(C):
(A) The observations V iare independent and identically distributed with probability
den-sity f (V, β) with respect to some measure µ f (V, β) has a common support and the model
is identifiable Furthermore, the first and second logarithmic derivatives of f satisfying
is finite and positive definite at β = β0
(C) There exists an open subset ω of Ω that contains the true parameter point β0such
that for almost all V the density f (V, β) admits all third derivatives (∂β∂ f (V,β)
Trang 190 If λn → 0 and √nλ n → ∞ as n → ∞, then with probability tending to 1, the root-n
consistant local maximizers ˆβ = [ ˆβ1, ˆβ2]T must satisfy:
1 Sparsity: ˆβ2= 0
2 Asymptotic normality:
√
n(I1(β10) + Σ){ ˆβ1− β10+ (I1(β10) + Σ)−1b } → N{0, I1(β10)}
indistribution, where I1(β10) = I1(β10, 0), the Fisher information knowing β2 = 0
SCAD that enjoys oracle properties improves other non-concave penalty such as L1
penalty and the hard thresholding penalty
Fan and Li (2001) established oracle properties for non-concave penalties, such as CAD, Lasso and bridge regression, only for finite parameter cases Fan and peng (2004)generalized the situations to diverging number of parameters They stated a generalframework for non-concave penalty with general conditions to enjoy oracle propertyand proved
S-Zou and Hastie (2005) introduced a regularization technique called elastic net Theyfirstly obtained the naive elastic net estimator by
minimizing ky − Xβk2,
Trang 20elastic net is a two-stage procedure: for each fixed λ2, they firstly found the ridge sion coefficients, and then performed Lasso As a result, a double amount of shrinkageoccurs, which introduces unnecessary extra bias compared with pure Lasso or ridge re-gression Thus, they rescaled the naive elastic net coefficients by a constant (1 + λ2) tocompromise the extra shrinkage The elastic net solution is
regres-ˆ
βenet= sgn( ˆβols)(| ˆβols| −λ21)+
Similar to Lasso, elastic net simultaneously does automatic variable selection and
con-tinuous shrinkage In addition, elastic net can potentially select all d predictors and select
groups of correlated variables, which overcomes the two limitations of Lasso
Usually, an estimate ˆβ is considered desirable if it is consistent in terms of both efficient estimate and variable selection We call a solution path ”path consistent” if itcontains at least one such desirable estimate Although Lasso and elastic net perfor-
co-m superiorly in prediction, they are not consistent in variable selection (Leng, Lin andWahba, 2006; Meinshausen and Buhlmann, 2006; Yuan and Lin, 2007; Zou, 2006)
Trang 21Zou (2006) suggested a new version of Lasso for simultaneous estimation and able selection, called adaptive Lasso estimator
where γ is a positive constant and ˆβiniis an initial root-n consistent estimate of β It has
beenshown that adaptive Lasso has oracle properties when the adaptively weighted l1
penalty is utilized and the adaptive Lasso shrinkage results in a near-minimax-optimalestimator
The non-negative garrotte estimator has been introduced by Yuan and Lin (2007).They obtained the estimator by
Trang 22identifies the set of important variables and is consistent for coefficients of the tant variables, whereas such a property may not be valid for the initial estimators Ingeneral, it has been shown that the non-negative garrotte can turn a non-consistent esti-mate into an estimate that is consistent in terms of both variable selection and coefficientestimation.
impor-As pointed out in Zou (2009), the adaptive Lasso improves Lasso by achieving theoracle property but can not handle collinearity, while elastic net can deal with collinearitybut lack oracle property The two penalties improve the Lasso in two different areas.Thus Zou (2009) combined the strength of adaptive Lasso and elastic net and proposed
a better estimator that improve Lasso in both areas, called the adaptive elastic-net
Trang 231.2.2 Review of graphical model
Graphical model is a modeling technique which uses graphs to represent dependenciesbetween stochastic variables (Lauritzen, 1996)
The most common graphical models are undirected graphs, called concentration
graphical model A concentration graphical model for the random vector X = (X1, , X d)T ∈
R dwith unknown mean µ and nonsingular covariance matrix Σ, is represented by an
undi-rected graph G = (V, E), where V = {1, , d} is the set of all variables and E represents
the conditional independence relationships among X1, , X d The absence of an
undirect-ed undirect-edge between two vertices encodes conditional independence between the associatundirect-edvariables given all the other variables As we known, zero entries in the concentrationmatrix Σ−1 also indicate the conditional independences between the two associated ran-dom variables given all other variables Thus parameter estimation in the concentrationgraphical model is equivalent to identify zero entries in the concentration matrix
An example of concentration graphical model is seen in Figure (1) Suppose X = (X1, , X4)T
Trang 24Then X exhibits the following conditional independent structure:
Figure 1 An example of concentration graphical model
A lot of research work has been done regarding model selection in concentrationgraphical model Whittaker (1990), Lauritzen (1996) and Edwards (2000) presentedcommonly used estimation methods in and statistical properties of concentration graph-ical models Wong et al (2003) and Dobra et al (2004) used Bayesian approaches toestimate the concentration matrix Drton and Perlman(2004) proposed a SIN method
to produce simultaneous confidence intervals to do model selection in a single step
Sch¨afer and Strimmer (2005) did the estimation by regularization with bootstrap
vari-ance reduction and selected network based on the estimated concentration matrix using
false discover rate (FDR) Meinshausen and B¨uhlmann (2006) performed neighborhood
selection for all variables to estimate the structure of a concentration graphical model,and showed their method is consistent in high-dimensional settings Huang et al (2006)
used either an L1 (Lasso) or an L2 (ridge) penalty on the off-diagonal elements of holesky factor in order to create zeros in arbitrary locations in the concentration matrix
C-Li and Gui (2006) introduced a threshold gradient descent (TGD) regularization dure to obtain the estimator Yuan and Lin (2007) and d’Aspremont et al (2008) used
proce-a penproce-alized likelihood method Lproce-asso to estimproce-ate the concentrproce-ation mproce-atrix, resulting in
Trang 25a sparse estimate Frideman et al (2008) developed a fast algorithm, called graphicalLasso algorithm, to estimate the sparse concentration matrix Rothman et al (2007) pro-posed SPICE, a permutation invariant estimator for precision matrix based on penalizedlikelihood with a Lasso-type penalty and established remarkable results on the rate ofconvergence under Frobenius norm Lam and Fan (2009) generalized Rothman’s work
to other penalties and proved sparsistency for all the estimators presented in their paper
There has also been considerable interest in bidirected covariance graphical models,where lacking of a bidirected edge between two variables indicates a marginal inde-pendence Covariance matrix estimation is a common statistical problem that arises inmany scientific applications, such as financial risk assessment and longitudinal study
Let X = (X1, , X d) ∈ Rd , i = 1, , n be a d-dimensional vector following a
multivari-ate normal distribution N d(0, Σ) We are interested in estimating the covariance matrix
Σ = (σi j)d ×d Of particular interest is the problem of identifying zero entries in Σ, ince σi j = 0 corresponds to marginal independence of X i and X j This is referred ascovariance graphical model selection (Cox and Wermuth, 1993, 1996) For example,
Then X exhibits the following marginal independent structures:
Trang 261 ↔ 2 ↔ 3 ↔ 4
Figure 2 An example of covariance graphical model
Actually, statistical inference regarding covariance graphical model selection lem is not well developed For model selection, in principle, one can employ backwardelimination or forward selection However, it is now well understood that such a processmay suffer from relative lack of accuracy and instability Moreover, an exhaustive pro-cedure such as best subset selection suffers from computational complexity
prob-In recent years, some people used iterative method to apply maximum likelihoodestimation in covariance graphical model For example, Anderson (1969, 1970, 1973)proposed an algorithm solving covariance graphical models His maximum likelihoodequations can be written as
Trang 27graphical modeling software MIM developed in Edwards (2000) fits covariance ical models by a ”dual likelihood method” from Kauermann (1996) Wermuth et al.(2006) also derived asymptotically efficient approximations to the maximum likelihoodestimate in such models Chaudhuri et al (2007) addressed the problem of estimatingthe covariance matrix when some of the entries are zero and presented an iterative con-ditional fitting algorithm, guaranteed convergence properties , to compute the maximumlikelihood estimate in covariance graphical models All these approaches are only appli-
graph-cable when dimension d and number of observations n are both not large.
When the dimension is high, it has been pointed out many times that the samplecovariance matrix is not a good estimator of the population covariance matrix, fromMarcenko-Pastur law (1967) to Johnstone (2001) Thus some alternative estimators havebeen developed for high-dimensional cases Most of these estimators try to achieve s-parsity assumption in order to simplify the scenario Generally speaking, there are twobroad classes of covariance matrix estimators: those that assume variables are naturallyordered and those far apart in the ordering are only weakly correlated, e.g., longitudinaldata, time series, spatial data or spectroscopy, and those invariant to variable permuta-tions, such as genetics and social science
The first class includes banding or taping the sample covariance matrix Bickel andLevina (2008a) proposed a banding technique, by either banding the sample covariance
Trang 28matrix or estimating a banded version of the population covariance matrix:
By requiring log d/n→ 0, he showed that when the population covariance matrix ranges
over certain fairly natural families, their estimator is consistent in the operator norm
Cai et al (2010) proposed a tapering procedure to estimate covariance matrix: for a
given even integer k with 1 ≤ k ≤ d,
by choosing proper tapering parameter, the optimal rate of convergence can be achieved
by the proposed tapering estimator, although the estimator is not necessarily semidefinite
Trang 29positive-Pourahmadi (1999) suggested using modified cholesky factorization to estimate centration matrix Based on Pourahmadi’s method, Rothman et al (2010) proposed abanded covariance matrix estimator by banding the Cholesky factor of the covariance.
con-Regress each variable X j on X j−1, , X1for 2 ≤ j ≤ d:
Let T = (t jq)p ×p be the lower-triangular matrix containing regression coefficients with
ones on the diagonal and L = T−1 Since ε = X − ˆX = T X, we have X = Lε Then,
Apply the above decomposition to the population matrix X = (x1, , x d)n ×d Define
e1 = x1, and for 2≤ j ≤ p, compute coefficients and the residual respectively as
After the last projection, the estimates ˆL and ˆ D can be obtained, and the resulting
estimator of covariance matrix is:
ˆ
ΣCholesky= ˆL ˆ D ˆ L T
Trang 30A positive definite estimator can be guaranteed by regularizing the cholesky factor viaregression interpretation provided by the paper Similar to other banding estimators, itslow computational property is vary attractive However Adam did not provide a conver-gence rate to support his estimator due to technical difficulties.
Wu and Pourahmadi (2009) established a banded estimator for covariance matrix bybanding the sample autocovariance matrix, which is attractive in time series analysis
Let X1, , X n be a realization of a mean zero stationary process X t, its autocovariance
γk = cov(X0, X k) can be estimated by
However the positive definite estimator ˆΣn = ( ˆγi − j)1≤i, j≤n is not a good estimate of
Σn = (γi − j)1≤i, j≤n since ˆΣn − Σn does not converge to zero under operator norm Wuand Pourahmadi proposed the estimator by truncating ˆΣn:
ˆ
Σn,l = ( ˆγi − j1(|i − j| ≤ l))1≤i, j≤n,
where l≥ 0 is an integer They have shown that their , not necessarily positive-definite,
estimator converge to the true covariance matrix with rate γn under operator norm
There are many situations requiring that covariance matrix estimators need to be variance under variable permutations, such as gene expression arrays, where no naturalordering exists among variables Thresholding small elements to zero becomes a popu-lar method when estimate such covariance matrix In spite of potential loss of positive
Trang 31in-definiteness, this kind of approaches are usually quite simple and carry no computationalburdens.
El Karoui (2008) proposed componentwise hard thresholding of the entries in the
sample covariance matrix for ”large n, large d” problems He defined his own notion
of sparsity called β−sparse, which improves the natural notion of sparsity for dividing
classes of matrices estimable through hard thresholding and those that are not Compared
to banding method in Bickel (2008a), the β−sparsity is applicable for problems where
there is no canonical ordering of the variables because the method is invariant underpermutation of the variables It has been shown that when β < 0.5, the hard thresholdestimator
ˆ
Σthreshold(s) = (σ i j1(|σi j | ≥ s)) d ×d
are consistently estimable under operator norm when d/n → l , 0, where l is generally
finite as d → ∞ However, when β ≥ 0.5, this strategy may fail to give good
estima-tors The β-sparsity divides sharply classes of matrices that are estimable through hardthresholding and those that are not
Bickel and Levina(2008b) simultaneously and independently proposed thresholding
of the sample covariance matrix as a permutation-invariant approach to obtain the mators They also developed a notion of sparsity, which is more specialized but easier
esti-to analyze than El Karoui’s β-sparse, and showed that by requiring log d/n → l , 0, the
Trang 32hard threshold estimator
tend-q(Σ) = tr(S Σ−1) + log|Σ| +X
i, j
pλ(|σi j|),
where S is the sample covariance matrix and pλ is a non-concave penalty function,
de-pending on parameter λ, such as the L1-penalty pλ(β) = λ|β| Lam and Fan investigated
both the sparsistency and rates of convergence for non-concave penalized likelihood timators for covariance and precision matrices under Frobenius norm
es-There are no comprehensive theoretical framework for Bayesian inference for variance graphical models until Khare and Rajaratnam (2009) Due to the limitation ofBayesian theory, Khare and Rajaratnam constructed a family of Wishart distributions asthe parameter space for covariance graphical model, instead of the cone of positive def-inite matrices with fixed zeroes corresponding to the missing edges in the graph They
Trang 33co-formed a rich conjugate of priors ,sampled from these distributions using Gibbs pling, and showed the convergence of the estimator Khare and Rajaratnam gave thedefinition of homogeneous graph, which ensures the closed form of normalizing con-stant.
sam-Part of the difficulty in fitting a covariance matrix or its inverse comes from the itive definite constraint of the estimator Bickel and Levia (2008a) proposed the bandingtechnique with a nonnegative definite banding matrix to guarantee this property How-ever, thresholding may give non positive definite matrices We propose a penalizedlikelihood based method in the following section An attractive feature of the likelihoodbased approach is its improved efficiency comparing to banding or thresholding, analo-gous to the difference between Lasso and hard thresholding Another attractive feature
pos-of the proposed method is that the positive definiteness pos-of the covariance matrix is plicitly ensured, thus avoiding the need to make adjustment to a non positive definitematrix after thresholding (El Karoui, 2008)
ex-1.2.3 Organization of the thesis
This thesis consists five chapters and is organized as follows
In this chapter 1, we have provided introduction to the background of this thesis andreviewed penalized approaches and the graphical models
Trang 34Chapter 2 is the main result of the thesis We present the main methods and provethe main results.
In chapter 3 we do simulation analysises to compare our penalized approach to othermethods that are also used in covariance graphical model
In chapter 4 we apply the penalized appraoch in two real world examples to estimatesparse covariance matrices and do comparison with other methods
In the last chapter, chapter 5, we do the summarization and discuss some applicationsand possible future research
Trang 35variables, to be zero Suppose that the data (x i , Y i) are collected independently Given
x i , Y i follows a density function f i (k(x T
i β), y i ), where k represents a known link tion Let l i = log f i denote the conditional log-likelihood function of Y i Based on these
func-information we can obtain the penalized log-likelihood function for Y i:
Trang 36The penalized maximum likelihood estimator ˆβ can be derived by maximizing the
pe-nalized likelihood function
Similarly, for covariance graphical model selection, statistical inference is also based
on the likelihood function, and the penalized maximum likelihood estimator can also be
used to select significant variables Let x i = (x i1 , · · · , x id)T ∈ R d , i = 1, , n, be a
d-dimensional multivariate normal random vector Without loss of generality, we assume
that E(x i ) = 0 and cov(x i) = Σ = (σj1j2)d ×dfor some positive definite matrix Σ Then thelikelihood function of Σ is given by
The unpenalized maximum likelihood estimator can be obtained by maximizing l(Σ),
which is equivalent as minimizing
Since nd2 log(2π) is a constant, we can directly minimize the following loss function
L(Σ) to derive the maximum likelihood estimator
Trang 37The resulting maximum likelihood estimator (MLE) of Σ is
S = n−1Xx i x T i
Generally speaking, this MLE is a dense estimator, meaning that nearly all the entries
of S are non-zero As we know, the number of entries in Σ grows very fast with the
dimensionality Thus when dimension is high, we would like to get some sparse mates, with certain entries being estimated as zero, to simplify the situations In order toobtain sparse solutions for the off diagonal components of Σ, we propose the followingpenalized likelihood objective function:
is positive definite
For the purpose of convenience, we define several terms:
• αj =σj j,
• βj = (σj j′ : j′ , j) = (β j1, , βj( j′ −1), βj( j′ +1), , βjd)T ∈ R d−1,
• x i( − j) ∈ R d−1: the same vector as x i but without the jth component,
• Σ(− j) ∈ R (d −1)×(d−1) : the same matrix as Σ but without the jth column and row,
Trang 38• τj =αj − β⊤
jΣ−1
(− j)βj
Note that (αj, βj ), j = 1, · · · , d completely specifies the covariance matrix Σ, hence,
find-ing the penalized estimator for Σ is equivalent to find the penalized estimator for (αj, βj).For such a purpose, we would like to propose an algorithm, which iteratively optimize(αj, βj) but with Σ(− j) fixed in order to obtain an sparse penalized estimator for Σ In
order to achieve this, we need to express Lλ(Σ) in terms of αj, βj, Σ− jand τj
First of all, we would like to get an expression of the concentration matrix Σ−1 interms of αj, βj, Σ− j and τj Let I denote a d × d diagonal matrix By simple matrix
multiplication we can get
Trang 39Taking inverse on both sides we can get
Based on equation (2.4) we can obtain an expression of x T i Σ−1x i, which is a part of
the likelihood function Lλ(Σ), in terms of αj, βj, Σ− jand τj as well:
Next we would like to derive an expression of|Σ|, which is also a part of the
likeli-hood function of the covariance matrix, in terms of αj, βj, Σ− jand τj
Trang 40It is easy to derive that
Since|AB| = |A||B| and det