1. Trang chủ
  2. » Giáo Dục - Đào Tạo

A penalized likelihood approach in covariance graphical model selection

99 159 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 99
Dung lượng 355,56 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

We propose a penalized likelihood approach for covariance graphical model tion and a BIC type criterion for the selection of the tuning parameter... We also compare the penalized approac

Trang 1

GRAPHICAL MODEL SELECTION

LIN NAN

(B.Sc National University of Singapore)

A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

DEPARTMENT OF STATISTICS AND APPLIED PROBABILITY

NATIONAL UNIVERSITY OF SINGAPORE

2010

Trang 2

First of all, I would like to show my deepest gratitude to my supervisor A/P Leng Chenleiand co-supervisor Dr Chen Ying, who conscientiously led me into the field of statisti-cal research This thesis would not have been possible without their patient guidanceand continuous support I really appreciate their efforts in helping me overcome all theproblems I encountered in the past four years It is my honor and luck to have them, twobrilliant young professors as my PhD supervisors

Special acknowledgement also goes to all the professors and staffs in Department ofstatistics and applied probability I have been in this warm family for almost eight years.With their help I have built up the statistical skills that can benefit me for my whole life Ican not find an exact word to express my gratitude to the department but I will definitelyfind a way to reciprocate the family in future

I further express my appreciation to my dear friends Ms Liu Yan, Ms Jiang Qian,

Trang 3

Mr Lu Jun, Mr Liang Xuehua, Mr Jiang Binyan and Dr Zhang Rongli, for giving mehelp, support and encouragement during my PhD study Thanks for your company, myPhD life becomes more colorful and enjoyable.

Finally, I am forever indebted to my family My dear parents, who gave me thecourage to pursuit the PhD study at the beginning, and have always been my constantsource of support by giving me endless love and understanding My husband, MengChuan, he is my joy, my pillar and my guiding light This thesis is also in memory of

my dear grandmothers

Trang 4

1.1 Background 1

1.2 Literature review 2

1.2.1 Review of penalized approaches 2

1.2.2 Review of graphical model 13

1.2.3 Organization of the thesis 23

Trang 5

2 Methodology 25

2.1 Main result 25

2.2 Theory 34

2.2.1 Proof of lemmas 36

2.2.2 Proof of theorems 40

3 Simulation 49 3.1 Simulation settings 49

3.2 Performance evaluation 51

3.3 Simulation Results 53

3.3.1 Simulation results for different models 53

3.3.2 Simulation results for models with different dimensions 60

4 Real Data analysis 70 4.1 Introduction 70

4.2 Call center data 71

4.3 Financial stocks Vs education stocks 72

Trang 6

5 Conclusion and Further Research 78

5.1 Conclusion and discussion 78

5.2 Future research 80

Trang 7

There has been a rising interest in high-dimensional data from many important fieldsrecently One of the major challenges in modern statistics is to investigate the complexrelationships and dependencies existing in data, in order to build parsimonious modelsfor inference Covariance or correlation matrix estimation that addresses the relation-ships among random variables attracts a lot of attention due to its ubiquity in data analy-sis Suppose we have a d-dimensional vector following multivariate normal distributionwith mean zero and certain covariance matrix that we are interested in estimating Ofparticular interest is to identify zero entries in this covariance matrix, since the zero en-try corresponds to marginal independence between two variables This is referred ascovariance graphical model selection, which arises when the interest is to model pair-wise correlation Identifying pairwise independence in this model is helpful to elucidaterelations between the variables

We propose a penalized likelihood approach for covariance graphical model tion and a BIC type criterion for the selection of the tuning parameter An attractive

Trang 8

selec-feature of a likelihood based approach is its improved efficiency comparing to banding

or thresholding Another attractive feature of the proposed method is that the positivedefiniteness of the covariance matrix is explicitly ensured We show that the penalizedlikelihood estimator converges to the true covariance matrix under frobenius norm withexplicit rate In addition, we show that the zero entries in the true covariance matrix areestimated as zero with probability tending to 1 We also compare the penalized approachwith other methods for covariance graphical model, such as sample covariance matrix,SIN approaches proposed by Drton and Perlman(2004), method developed by Bickeland Levina(2008b) and the shrinkage estimator of Ledoit andWolf (2003), in terms ofboth simulations and real data analysises The results show that the penalized methodnot only can provide sparse estimates of the covariance matrix, but also has competitiveestimation accuracy

Trang 9

List of Tables

3.1 Simulations: Model 1 with d=10 and n=30 Average (SE) KL, QL, OL,

FL, FP and FN over 50 replications 543.2 Simulations: Model 2 with d=10 and n=30 Average (SE) KL, QL, OL,

FL, FP and FN over 50 replications 563.3 Simulations: Model 3 with d=10 and n=30 Average (SE) KL, QL, OL,

FL, FP and FN over 50 replications 583.4 Simulations: Model 4 with d=10 and n=30 Average (SE) KL, QL, OL,

FL, FP and FN over 50 replications 603.5 Simulations: Model 5 with d=10 and n=30 Average (SE) KL, QL, OL,

FL, FP and FN over 50 replications 623.6 Simulations: Model 6 with d=10 and n=30 Average (SE) KL, QL, OL,

FL, FP and FN over 50 replications 643.7 Simulations: Model 3 with d=30 and n=100 Average (SE) KL, QL, OL,

FL, FP and FN over 50 replications 663.8 Simulations: Model 3 with d=100 and n=100 Average (SE) KL, QL,

OL, FL, FP and FN over 50 replications 663.9 Simulations: Model 3 with d=100 and n=100 Average (SE) KL, QL,

OL, FL, FP and FN over 50 replications 66

4.1 Average (SE) KL, QL, OL, FL, FP and FN for Call Center Data withd=84,n=164 4-fold CV on the training data minimizing the BIC 744.2 Average (SE) KL, QL, OL, FL, FP and FN for Financial stock returns

Vs Education stock returns with d=10,n=49 4-fold CV on the trainingdata minimizing the BIC 75

Trang 10

List of Figures

3.1 Simulations: Model 1 with d=10 and n=30 Average (SE) KL, QL, OL,

FL, FP and FN over 50 replications 553.2 Simulations: Model 2 with d=10 and n=30 Average (SE) KL, QL, OL,

FL, FP and FN over 50 replications 573.3 Simulations: Model 3 with d=10 and n=30 Average (SE) KL, QL, OL,

FL, FP and FN over 50 replications 593.4 Simulations: Model 4 with d=10 and n=30 Average (SE) KL, QL, OL,

FL, FP and FN over 50 replications 613.5 Simulations: Model 5 with d=10 and n=30 Average (SE) KL, QL, OL,

FL, FP and FN over 50 replications 633.6 Simulations: Model 6 with d=10 and n=30 Average (SE) KL, QL, OL,

FL, FP and FN over 50 replications 653.7 Simulations: Model 3 with d=30 and n=100 Average (SE) KL, QL, OL,

FL, FP and FN over 50 replications 673.8 Simulations: Model 3 with d=100 and n=100 Average (SE) KL, QL,

OL, FL, FP and FN over 50 replications 683.9 Simulations: Model 3 with d=100 and n=100 Average (SE) KL, QL,

OL, FL, FP and FN over 50 replications 69

4.1 Call center data 764.2 Financial stock Vs Education Stock 77

Trang 11

a correct model for inference Covariance or correlation matrix estimation that addressesthe relationships attracts a lot of attention due to its ubiquity in data analysis PrincipalComponent analysis (PCA), linear and quadratic discriminant analysis (LDA and QDA)and analysis of independence relations in the context of graphical models all need to

Trang 12

estimate the covariance matrix However, the number of parameters in the covariancematrix grows quickly with dimensionality, so high dimensional data leads to heavy bur-den of computation As a result, the sparsity assumption of the covariance matrix (i.e.,some entries of the covariance matrix are exactly zero) is frequently imposed to achieve

a balance between biases and variances In this thesis, we propose a penalized hood approach to estimate covariance matrix in order to strike parsimony on covariancegraphical model selection

likeli-1.2 Literature review

1.2.1 Review of penalized approaches

Consider the linear regression model y = Xβ + ǫ, where y is an n × 1 vector, X is an

n × d matrix and ǫ is an n × 1 vector Without loss of generality, we assume that the data

are centered, the columns of X are orthonormal and y i ’s are conditionally independent

given the design matrix X Throughout this thesis, we assume ǫs are independently and

identically distributed with mean zero and finite variance σ2 A model fitting procedureproduces the vector of coefficients ˆβ = ( ˆβ0, , ˆβd)

We obtain ordinary least squares (OLS) estimates by minimizing the residual squared

Trang 13

Best Subset selection is one of the standard techniques for improving OLS We lect or delete one independent variable through hypothesis testing at some level α in eachstep Most traditional variable selection methods follow stepwise subset selection pro-cedures to select variables, such as Akaike’s information criterion AIC [Akaike (1973)]and Bayesian information criterion BIC [Schwarz (1978)] Nevertheless, this commonstepwise procedure has long been recognized as extremely variable since changes in datamay result in very different models To remedy this problem, Drton and Perlman (2004)proposed a SIN approach that produces conservative simultaneous 1-α confidence in-tervals, and use these confidence intervals to do model selection in a single step Bestsubset selection and SIN approach improve OLS estimates by providing interpretablemodels.

se-Recently many statisticians have proposed various penalization methods, that usuallyshrink estimates to make trade-offs between bias and variance, to overcome the limita-

Trang 14

tions of OLS estimates and best subset selection The penalized estimates are obtained

by minimizing the residual squared error plus a penalty function, i.e

ˆβpenalized = arg min

where non-negative constant λ is a tuning parameter and pλrepresents a penalty function

Antoniadis (1997) and Fan (1997) both mentioned the hard thresholding estimator

ˆβHardThre = ˆβolsI(| ˆβols| > λ),

which is derived by taking the hard thresholding penalty function

pλ(|β|) = λ2− (|β| − λ)2I(|β| < λ)

Frank and Friedman (1993) introduced bridge regression with L q penalty function

λ|β|q , where q is a positive constant When q > 1, the resulting penalized estimates

shrink the solutions to reduce variability but do not enjoy sparsity On the other hand,

when q ≤ 1, the L q penalty functions lead to sparse solutions but have relatively largebiases

One special cases of bridge regression is the L2-penalty

Trang 15

where γ is a positive number.

Ridge regression is a continuous process that shrinks coefficients, so it achieves betterprediction performance through a bias-variance trade-off However it does not set anycoefficients to 0 and hence does not give an easily interpretable model

Lasso, proposed by Tibshirani(1996), is the penalized least squares method imposing

an L1-penalty

pλ(|β|) = λ|β|

on the regression coefficients The L1-penalty leads to a solution

ˆβLasso = sgn( ˆβols)(| ˆβols| − γ)+

Because of the the nature of the L1-penalty, the Lasso does both continuous shrinkageand automatic variable selection simultaneously According to the simulation results,for small number of moderate-sized effects, the Lasso does better than ridge regres-sion; for large number of small effects, the ridge regression performs better than Lassoestimates, thus neither of them uniformly dominates the other However, as variable se-lection becomes increasingly important in modern data analysis, the Lasso is much moreappealing owing to its sparse representation Given orthogonal design, the entire Lassosolution paths can be computed by LARS algorithm, proposed by Efron et al (2004)

Although the Lasso enjoys great computational advantages and excellent mances, it has several limitations:

Trang 16

perfor-1 Lasso lacks the oracle property defined in Fan and Li (2001).

2 If there is a group of variables among which the pairwise correlation are very high,the Lasso tends to select only one variable from the group and does not care whichone is selected In sum, Lasso can not handle collinearity

3 Lasso can only select individual input variables, so it is not suitable for generalfactor selection

In some situations, such as multifactor analysis-of-variance (ANOVA) problem, able selection concentrates in selection of a group of important factors, rather than indi-vidual variables As we have stated, Lasso is only designed for selecting individual inputfactor, thus is not for this kind of scenarios Yuan and Lin (2006) proposed the groupLasso to improve over Lasso in terms of group variable selections For a vector η ∈ R d,

vari-d ≥ 1, and a symmetric d × d positive definite matrix K, they denoted

where λ ≥ 0 is a tuning parameter, K1, , K d are positive definite matrices with many

possible choices The authors verified that when K j = p j I p j , j = 1, , d, and β − j =

Trang 17

The entire solutions for group Lasso can be obtained iteratively.

Fan and Li (2001) stated that a good penalty function should result in an estimatorwith the following three properties:

1 Unbiasedness: The resulting estimator has no over penalization for large ter to avoid unnecessary modeling biases

parame-2 Sparsity: The resulting estimator automatically set insignificant parameters to 0

3 Continuity: The resulting estimator is continuous in data in order to avoid bility in model prediction

insta-It has been shown that the L q and hard thresholding penalty functions do not neously satisfy and three properties Fan and Li (2001) proposed the smoothly clippedabsolute deviation penalty(SCAD)

sgn( ˆβols)(| ˆβols| − λ)+, when| ˆβols| ≤ λ;

{(a − 1) ˆβols− sgn( ˆβols)aλ }/(a − 2), when 2λ < | ˆβols| < aλ;

ˆβols, when| ˆβols| > aλ.

The two parameters (λ, a) can be searched by some criterias, such as BIC, cross

valida-tion and generalized cross-validavalida-tion Fan and Li (2001) suggested an ”oracle property,”for finite parameter case, which assists selecting variables only with nonzero coefficients

Trang 18

and estimates the remaining coefficients as zero.

(Oracle Property) Let V1, , V n be independent and identically distributed, each with a

density f (V, β) satisfying conditions (A)-(C):

(A) The observations V iare independent and identically distributed with probability

den-sity f (V, β) with respect to some measure µ f (V, β) has a common support and the model

is identifiable Furthermore, the first and second logarithmic derivatives of f satisfying

is finite and positive definite at β = β0

(C) There exists an open subset ω of Ω that contains the true parameter point β0such

that for almost all V the density f (V, β) admits all third derivatives (∂β∂ f (V,β)

Trang 19

0 If λn → 0 and √nλ n → ∞ as n → ∞, then with probability tending to 1, the root-n

consistant local maximizers ˆβ = [ ˆβ1, ˆβ2]T must satisfy:

1 Sparsity: ˆβ2= 0

2 Asymptotic normality:

n(I1(β10) + Σ){ ˆβ1− β10+ (I1(β10) + Σ)−1b } → N{0, I1(β10)}

indistribution, where I1(β10) = I1(β10, 0), the Fisher information knowing β2 = 0

SCAD that enjoys oracle properties improves other non-concave penalty such as L1

penalty and the hard thresholding penalty

Fan and Li (2001) established oracle properties for non-concave penalties, such as CAD, Lasso and bridge regression, only for finite parameter cases Fan and peng (2004)generalized the situations to diverging number of parameters They stated a generalframework for non-concave penalty with general conditions to enjoy oracle propertyand proved

S-Zou and Hastie (2005) introduced a regularization technique called elastic net Theyfirstly obtained the naive elastic net estimator by

minimizing ky − Xβk2,

Trang 20

elastic net is a two-stage procedure: for each fixed λ2, they firstly found the ridge sion coefficients, and then performed Lasso As a result, a double amount of shrinkageoccurs, which introduces unnecessary extra bias compared with pure Lasso or ridge re-gression Thus, they rescaled the naive elastic net coefficients by a constant (1 + λ2) tocompromise the extra shrinkage The elastic net solution is

regres-ˆ

βenet= sgn( ˆβols)(| ˆβols| −λ21)+

Similar to Lasso, elastic net simultaneously does automatic variable selection and

con-tinuous shrinkage In addition, elastic net can potentially select all d predictors and select

groups of correlated variables, which overcomes the two limitations of Lasso

Usually, an estimate ˆβ is considered desirable if it is consistent in terms of both efficient estimate and variable selection We call a solution path ”path consistent” if itcontains at least one such desirable estimate Although Lasso and elastic net perfor-

co-m superiorly in prediction, they are not consistent in variable selection (Leng, Lin andWahba, 2006; Meinshausen and Buhlmann, 2006; Yuan and Lin, 2007; Zou, 2006)

Trang 21

Zou (2006) suggested a new version of Lasso for simultaneous estimation and able selection, called adaptive Lasso estimator

where γ is a positive constant and ˆβiniis an initial root-n consistent estimate of β It has

beenshown that adaptive Lasso has oracle properties when the adaptively weighted l1

penalty is utilized and the adaptive Lasso shrinkage results in a near-minimax-optimalestimator

The non-negative garrotte estimator has been introduced by Yuan and Lin (2007).They obtained the estimator by

Trang 22

identifies the set of important variables and is consistent for coefficients of the tant variables, whereas such a property may not be valid for the initial estimators Ingeneral, it has been shown that the non-negative garrotte can turn a non-consistent esti-mate into an estimate that is consistent in terms of both variable selection and coefficientestimation.

impor-As pointed out in Zou (2009), the adaptive Lasso improves Lasso by achieving theoracle property but can not handle collinearity, while elastic net can deal with collinearitybut lack oracle property The two penalties improve the Lasso in two different areas.Thus Zou (2009) combined the strength of adaptive Lasso and elastic net and proposed

a better estimator that improve Lasso in both areas, called the adaptive elastic-net

Trang 23

1.2.2 Review of graphical model

Graphical model is a modeling technique which uses graphs to represent dependenciesbetween stochastic variables (Lauritzen, 1996)

The most common graphical models are undirected graphs, called concentration

graphical model A concentration graphical model for the random vector X = (X1, , X d)T

R dwith unknown mean µ and nonsingular covariance matrix Σ, is represented by an

undi-rected graph G = (V, E), where V = {1, , d} is the set of all variables and E represents

the conditional independence relationships among X1, , X d The absence of an

undirect-ed undirect-edge between two vertices encodes conditional independence between the associatundirect-edvariables given all the other variables As we known, zero entries in the concentrationmatrix Σ−1 also indicate the conditional independences between the two associated ran-dom variables given all other variables Thus parameter estimation in the concentrationgraphical model is equivalent to identify zero entries in the concentration matrix

An example of concentration graphical model is seen in Figure (1) Suppose X = (X1, , X4)T

Trang 24

Then X exhibits the following conditional independent structure:

Figure 1 An example of concentration graphical model

A lot of research work has been done regarding model selection in concentrationgraphical model Whittaker (1990), Lauritzen (1996) and Edwards (2000) presentedcommonly used estimation methods in and statistical properties of concentration graph-ical models Wong et al (2003) and Dobra et al (2004) used Bayesian approaches toestimate the concentration matrix Drton and Perlman(2004) proposed a SIN method

to produce simultaneous confidence intervals to do model selection in a single step

Sch¨afer and Strimmer (2005) did the estimation by regularization with bootstrap

vari-ance reduction and selected network based on the estimated concentration matrix using

false discover rate (FDR) Meinshausen and B¨uhlmann (2006) performed neighborhood

selection for all variables to estimate the structure of a concentration graphical model,and showed their method is consistent in high-dimensional settings Huang et al (2006)

used either an L1 (Lasso) or an L2 (ridge) penalty on the off-diagonal elements of holesky factor in order to create zeros in arbitrary locations in the concentration matrix

C-Li and Gui (2006) introduced a threshold gradient descent (TGD) regularization dure to obtain the estimator Yuan and Lin (2007) and d’Aspremont et al (2008) used

proce-a penproce-alized likelihood method Lproce-asso to estimproce-ate the concentrproce-ation mproce-atrix, resulting in

Trang 25

a sparse estimate Frideman et al (2008) developed a fast algorithm, called graphicalLasso algorithm, to estimate the sparse concentration matrix Rothman et al (2007) pro-posed SPICE, a permutation invariant estimator for precision matrix based on penalizedlikelihood with a Lasso-type penalty and established remarkable results on the rate ofconvergence under Frobenius norm Lam and Fan (2009) generalized Rothman’s work

to other penalties and proved sparsistency for all the estimators presented in their paper

There has also been considerable interest in bidirected covariance graphical models,where lacking of a bidirected edge between two variables indicates a marginal inde-pendence Covariance matrix estimation is a common statistical problem that arises inmany scientific applications, such as financial risk assessment and longitudinal study

Let X = (X1, , X d) ∈ Rd , i = 1, , n be a d-dimensional vector following a

multivari-ate normal distribution N d(0, Σ) We are interested in estimating the covariance matrix

Σ = (σi j)d ×d Of particular interest is the problem of identifying zero entries in Σ, ince σi j = 0 corresponds to marginal independence of X i and X j This is referred ascovariance graphical model selection (Cox and Wermuth, 1993, 1996) For example,

Then X exhibits the following marginal independent structures:

Trang 26

1 ↔ 2 ↔ 3 ↔ 4

Figure 2 An example of covariance graphical model

Actually, statistical inference regarding covariance graphical model selection lem is not well developed For model selection, in principle, one can employ backwardelimination or forward selection However, it is now well understood that such a processmay suffer from relative lack of accuracy and instability Moreover, an exhaustive pro-cedure such as best subset selection suffers from computational complexity

prob-In recent years, some people used iterative method to apply maximum likelihoodestimation in covariance graphical model For example, Anderson (1969, 1970, 1973)proposed an algorithm solving covariance graphical models His maximum likelihoodequations can be written as

Trang 27

graphical modeling software MIM developed in Edwards (2000) fits covariance ical models by a ”dual likelihood method” from Kauermann (1996) Wermuth et al.(2006) also derived asymptotically efficient approximations to the maximum likelihoodestimate in such models Chaudhuri et al (2007) addressed the problem of estimatingthe covariance matrix when some of the entries are zero and presented an iterative con-ditional fitting algorithm, guaranteed convergence properties , to compute the maximumlikelihood estimate in covariance graphical models All these approaches are only appli-

graph-cable when dimension d and number of observations n are both not large.

When the dimension is high, it has been pointed out many times that the samplecovariance matrix is not a good estimator of the population covariance matrix, fromMarcenko-Pastur law (1967) to Johnstone (2001) Thus some alternative estimators havebeen developed for high-dimensional cases Most of these estimators try to achieve s-parsity assumption in order to simplify the scenario Generally speaking, there are twobroad classes of covariance matrix estimators: those that assume variables are naturallyordered and those far apart in the ordering are only weakly correlated, e.g., longitudinaldata, time series, spatial data or spectroscopy, and those invariant to variable permuta-tions, such as genetics and social science

The first class includes banding or taping the sample covariance matrix Bickel andLevina (2008a) proposed a banding technique, by either banding the sample covariance

Trang 28

matrix or estimating a banded version of the population covariance matrix:

By requiring log d/n→ 0, he showed that when the population covariance matrix ranges

over certain fairly natural families, their estimator is consistent in the operator norm

Cai et al (2010) proposed a tapering procedure to estimate covariance matrix: for a

given even integer k with 1 ≤ k ≤ d,

by choosing proper tapering parameter, the optimal rate of convergence can be achieved

by the proposed tapering estimator, although the estimator is not necessarily semidefinite

Trang 29

positive-Pourahmadi (1999) suggested using modified cholesky factorization to estimate centration matrix Based on Pourahmadi’s method, Rothman et al (2010) proposed abanded covariance matrix estimator by banding the Cholesky factor of the covariance.

con-Regress each variable X j on X j−1, , X1for 2 ≤ j ≤ d:

Let T = (t jq)p ×p be the lower-triangular matrix containing regression coefficients with

ones on the diagonal and L = T−1 Since ε = X − ˆX = T X, we have X = Lε Then,

Apply the above decomposition to the population matrix X = (x1, , x d)n ×d Define

e1 = x1, and for 2≤ j ≤ p, compute coefficients and the residual respectively as

After the last projection, the estimates ˆL and ˆ D can be obtained, and the resulting

estimator of covariance matrix is:

ˆ

ΣCholesky= ˆL ˆ D ˆ L T

Trang 30

A positive definite estimator can be guaranteed by regularizing the cholesky factor viaregression interpretation provided by the paper Similar to other banding estimators, itslow computational property is vary attractive However Adam did not provide a conver-gence rate to support his estimator due to technical difficulties.

Wu and Pourahmadi (2009) established a banded estimator for covariance matrix bybanding the sample autocovariance matrix, which is attractive in time series analysis

Let X1, , X n be a realization of a mean zero stationary process X t, its autocovariance

γk = cov(X0, X k) can be estimated by

However the positive definite estimator ˆΣn = ( ˆγi − j)1≤i, j≤n is not a good estimate of

Σn = (γi − j)1≤i, j≤n since ˆΣn − Σn does not converge to zero under operator norm Wuand Pourahmadi proposed the estimator by truncating ˆΣn:

ˆ

Σn,l = ( ˆγi − j1(|i − j| ≤ l))1≤i, j≤n,

where l≥ 0 is an integer They have shown that their , not necessarily positive-definite,

estimator converge to the true covariance matrix with rate γn under operator norm

There are many situations requiring that covariance matrix estimators need to be variance under variable permutations, such as gene expression arrays, where no naturalordering exists among variables Thresholding small elements to zero becomes a popu-lar method when estimate such covariance matrix In spite of potential loss of positive

Trang 31

in-definiteness, this kind of approaches are usually quite simple and carry no computationalburdens.

El Karoui (2008) proposed componentwise hard thresholding of the entries in the

sample covariance matrix for ”large n, large d” problems He defined his own notion

of sparsity called β−sparse, which improves the natural notion of sparsity for dividing

classes of matrices estimable through hard thresholding and those that are not Compared

to banding method in Bickel (2008a), the β−sparsity is applicable for problems where

there is no canonical ordering of the variables because the method is invariant underpermutation of the variables It has been shown that when β < 0.5, the hard thresholdestimator

ˆ

Σthreshold(s) = (σ i j1(|σi j | ≥ s)) d ×d

are consistently estimable under operator norm when d/n → l , 0, where l is generally

finite as d → ∞ However, when β ≥ 0.5, this strategy may fail to give good

estima-tors The β-sparsity divides sharply classes of matrices that are estimable through hardthresholding and those that are not

Bickel and Levina(2008b) simultaneously and independently proposed thresholding

of the sample covariance matrix as a permutation-invariant approach to obtain the mators They also developed a notion of sparsity, which is more specialized but easier

esti-to analyze than El Karoui’s β-sparse, and showed that by requiring log d/n → l , 0, the

Trang 32

hard threshold estimator

tend-q(Σ) = tr(S Σ−1) + log|Σ| +X

i, j

pλ(|σi j|),

where S is the sample covariance matrix and pλ is a non-concave penalty function,

de-pending on parameter λ, such as the L1-penalty pλ(β) = λ|β| Lam and Fan investigated

both the sparsistency and rates of convergence for non-concave penalized likelihood timators for covariance and precision matrices under Frobenius norm

es-There are no comprehensive theoretical framework for Bayesian inference for variance graphical models until Khare and Rajaratnam (2009) Due to the limitation ofBayesian theory, Khare and Rajaratnam constructed a family of Wishart distributions asthe parameter space for covariance graphical model, instead of the cone of positive def-inite matrices with fixed zeroes corresponding to the missing edges in the graph They

Trang 33

co-formed a rich conjugate of priors ,sampled from these distributions using Gibbs pling, and showed the convergence of the estimator Khare and Rajaratnam gave thedefinition of homogeneous graph, which ensures the closed form of normalizing con-stant.

sam-Part of the difficulty in fitting a covariance matrix or its inverse comes from the itive definite constraint of the estimator Bickel and Levia (2008a) proposed the bandingtechnique with a nonnegative definite banding matrix to guarantee this property How-ever, thresholding may give non positive definite matrices We propose a penalizedlikelihood based method in the following section An attractive feature of the likelihoodbased approach is its improved efficiency comparing to banding or thresholding, analo-gous to the difference between Lasso and hard thresholding Another attractive feature

pos-of the proposed method is that the positive definiteness pos-of the covariance matrix is plicitly ensured, thus avoiding the need to make adjustment to a non positive definitematrix after thresholding (El Karoui, 2008)

ex-1.2.3 Organization of the thesis

This thesis consists five chapters and is organized as follows

In this chapter 1, we have provided introduction to the background of this thesis andreviewed penalized approaches and the graphical models

Trang 34

Chapter 2 is the main result of the thesis We present the main methods and provethe main results.

In chapter 3 we do simulation analysises to compare our penalized approach to othermethods that are also used in covariance graphical model

In chapter 4 we apply the penalized appraoch in two real world examples to estimatesparse covariance matrices and do comparison with other methods

In the last chapter, chapter 5, we do the summarization and discuss some applicationsand possible future research

Trang 35

variables, to be zero Suppose that the data (x i , Y i) are collected independently Given

x i , Y i follows a density function f i (k(x T

i β), y i ), where k represents a known link tion Let l i = log f i denote the conditional log-likelihood function of Y i Based on these

func-information we can obtain the penalized log-likelihood function for Y i:

Trang 36

The penalized maximum likelihood estimator ˆβ can be derived by maximizing the

pe-nalized likelihood function

Similarly, for covariance graphical model selection, statistical inference is also based

on the likelihood function, and the penalized maximum likelihood estimator can also be

used to select significant variables Let x i = (x i1 , · · · , x id)T ∈ R d , i = 1, , n, be a

d-dimensional multivariate normal random vector Without loss of generality, we assume

that E(x i ) = 0 and cov(x i) = Σ = (σj1j2)d ×dfor some positive definite matrix Σ Then thelikelihood function of Σ is given by

The unpenalized maximum likelihood estimator can be obtained by maximizing l(Σ),

which is equivalent as minimizing

Since nd2 log(2π) is a constant, we can directly minimize the following loss function

L(Σ) to derive the maximum likelihood estimator

Trang 37

The resulting maximum likelihood estimator (MLE) of Σ is

S = n−1Xx i x T i

Generally speaking, this MLE is a dense estimator, meaning that nearly all the entries

of S are non-zero As we know, the number of entries in Σ grows very fast with the

dimensionality Thus when dimension is high, we would like to get some sparse mates, with certain entries being estimated as zero, to simplify the situations In order toobtain sparse solutions for the off diagonal components of Σ, we propose the followingpenalized likelihood objective function:

is positive definite

For the purpose of convenience, we define several terms:

• αjj j,

• βj = (σj j: j′ , j) = (β j1, , βj( j′ −1), βj( j′ +1), , βjd)T ∈ R d−1,

• x i( − j) ∈ R d−1: the same vector as x i but without the jth component,

• Σ(− j) ∈ R (d −1)×(d−1) : the same matrix as Σ but without the jth column and row,

Trang 38

• τjj − β⊤

jΣ−1

(− j)βj

Note that (αj, βj ), j = 1, · · · , d completely specifies the covariance matrix Σ, hence,

find-ing the penalized estimator for Σ is equivalent to find the penalized estimator for (αj, βj).For such a purpose, we would like to propose an algorithm, which iteratively optimize(αj, βj) but with Σ(− j) fixed in order to obtain an sparse penalized estimator for Σ In

order to achieve this, we need to express Lλ(Σ) in terms of αj, βj, Σ− jand τj

First of all, we would like to get an expression of the concentration matrix Σ−1 interms of αj, βj, Σ− j and τj Let I denote a d × d diagonal matrix By simple matrix

multiplication we can get

Trang 39

Taking inverse on both sides we can get

Based on equation (2.4) we can obtain an expression of x T i Σ−1x i, which is a part of

the likelihood function Lλ(Σ), in terms of αj, βj, Σ− jand τj as well:

Next we would like to derive an expression of|Σ|, which is also a part of the

likeli-hood function of the covariance matrix, in terms of αj, βj, Σ− jand τj

Trang 40

It is easy to derive that

Since|AB| = |A||B| and det

Ngày đăng: 10/09/2015, 15:53

TỪ KHÓA LIÊN QUAN

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN