In this article, we propose a clusteringalgorithm, called γ-SUP, by implementing a minimum γ divergence on a mixture of q-Gaussian family through a self-updating process.. Key words and
Trang 1arXiv:1205.2034v1 [stat.AP] 9 May 2012
γ-Divergence with Application to Cryo-EM Images
Ting-Li Chena, Hung Hungb, I-Ping Tua∗
aInstitute of Statistical Science, Academia Sinica
bInstitute of Epidemiology & Preventive Medicine
National Taiwan University
cInstitute of Chemistry, Academia Sinica
December 31, 2018
∗
Corresponding author, iping@stat.sinica.edu.tw.
Trang 2of clusters due to free orientations Clustering analysis is a necessary step to group thesimilar orientation images for noise reduction In this article, we propose a clusteringalgorithm, called γ-SUP, by implementing a minimum γ divergence on a mixture of q-Gaussian family through a self-updating process γ-SUP copes well with the cryo-EMimages by its advantages as follows (a) It resolves the sensitivity issue of choosing thenumber of clusters and cluster initials (b) It sets a hard influence range for each component
in the mixture model and hence leads to a robust procedure for learning each of the localclusters (c) It performs a soft rejection by down weighting deviant points from clustercenters and further enhances the robustness (d) At each iteration, it shrinks the mixturemodel parameter estimates toward cluster centers, and improves the efficiency of mixtureestimation
Key words and phrases: clustering algorithm, cryo-EM images, γ-divergence, k-means,multilinear principal component analysis, q-Gaussian distribution, robust statistics, self-updating process
Trang 31 Introduction and motivating data example
Cryo-electron microscopy (cryo-EM) has been emerging as a powerful tool for obtaining highresolution three-dimensional (3-D) structures of biological macro-molecules (Saibil, 2000;van Heel et al., 2000; Frank, 2002; Yu et al., 2008) Traditionally, an efficient 3-D structuredetermination is provided by X-ray crystallography when a large assembly of crystal can
be obtained, allowing for signals to be recorded from the spots in the diffraction pattern.However, not every molecule can form a crystal, and sometimes it is beneficial to study amolecule as a single particle rather than in a crystal In contrast to X-ray crystallography,cryo-EM does not need crystals but can view dispersed biological molecules embedded in
a thin layer of vitreous ice (Dubochet, 2012) The electron beam transmitting through thespecimen generates 2-D projections of these freely oriented molecules and the 3-D structurecan be obtained by back projections provided the angular relationships among the 2-Dprojections are determined (DeRosier and Klug, 1968)
As biological molecules are highly sensitive to electron beams, only very limited dosesare allowed for imaging This yields a barely recognizable image for individual molecule(Henderson, 1995) Nevertheless, as the molecule is assembled by subunits in high order, forexample, an icosahedra virus, the symmetry would facilitate the image processing and allowfor attaining near-atomic resolution structure (Liu et al., 2010; Jiang et al., 2008; Grigorieffand Harrison, 2011) However, the de-noising for a particle of low or no symmetry isevidently challenging as it requires the image alignment and the data clustering on sufficientnumber of images in similar orientations for averaging (Chang et al., 2010)
In this article, we focus on the clustering step and assume that the image alignment hasbeen carried through At present, k-means based algorithms are probably the most popular
Trang 4clustering methods for cryo-EM images (Frank, 2006) Modification for k-means has beenproposed to enforce balanced cluster sizes and to avoid excessively large or small clusters(Sorzano et al., 2010) However, there still exists the issue that initial cluster assignment
is required, which drives the final clustering result to some degree Furthermore, k-meansenforces every object to be clustered into a certain class, which may not be proper for caseswhen a non-negligible portion of image objects become outliers due to misalignment ordata contamination This enforcement may lead to serious bias for estimating the clusterrepresentatives
On the contrary, the self-updating process (SUP, Chen and Shiu 2007; Shiu and Chen2012) is a data-driven robust algorithm that allows extremely small-sized clusters consisting
of only a few data points or even singleton to accommodate outliers Allowing outlierclusters is important for robust clustering SUP starts with each individual data point as
a singleton cluster, so that neither random initials nor cluster number is required Next,
it goes through a self-updating process of data merging and parting according to weightedaverages over their local neighborhoods by defining a hard influence region, where the weightcan be an arbitrary function that is proportional to the similarity The data points finallyconverge and stop at representative points as cluster centers
In this article, we modify from the original SUP and propose an information-theoreticframework to formulate the clustering algorithm, named γ-SUP, as a minimum γ-divergenceestimation (Fujisawa and Eguchi, 2008; Cichocki and Amari, 2010; Eguchi et al., 2011)
of q-Gaussian mixture (Amari and Ohara, 2011; Eguchi et al., 2011) through the SUPimplementation In this framework, we parameterize the weights in SUP through a family
of γ divergence with γ > 0 and also parameterize the hard influence region through a
Trang 5q-Gaussian family with q < 1 As a comparison, the popular k-means is a special case ofthis framework with γ = 0 and q = 1 and with the number of classes k given but withoutthe SUP implementation The use of q-Gaussian mixture model with q < 1 sets a hardinfluence range for each component in the mixture model and rejects data influence fromoutside this range, and hence it leads to a robust procedure for learning each of the localclusters The minimum γ-divergence with γ > 0 essentially performs a soft rejection bydown weighting deviant points from cluster centers and further enhances the robustness.
At each iteration, the self-updating process shrinks the mixture model parameter estimatestoward cluster centers, which acts as if the effective temperature, that is within clustervariance over the power parameter, is continuously decreasing iteratively so that such ashrinkage updating will improve the efficiency of mixture estimation
For application, we design a simulation to investigate the performance of γ-SUP 6400images with 100×100 pixels are generated by simulating the 2-D projected cryo-EM images
of a model molecule, RNA polymerase II, in 128 equally spaced (angle-wise) orientationswith iid Gaussian noise N (0, 402) We assume two scenarios in the image alignment step:perfect alignment and 10% misalignment For perfect image alignment, the k-means reaches
a correct rate 83%, and γ-SUP achieves 100% When allowing 10% misalignment, k-meansdrops its accuracy rate to 74%, while γ-SUP still keeps 100%
The paper is organized as follows In Section 2, we have a brief review of γ-divergenceand q-Gaussian mixture relevant for γ-SUP In Section 3, we formulate the γ-SUP clusteringalgorithm as a minimum γ-divergence estimation of q-Gaussian mixture with k-means as aspecial case In Section 4, we show γ-SUP’s stability to tuning parameter selection and itsefficiency In Section 5, we apply γ-SUP to the simulated cryo-EM images In Section 6,
Trang 6we summarize our conclusions.
In this section we briefly review the concepts of γ-divergence and q-Gaussian distribution,which are the key technical tools for our γ-SUP clustering algorithm
The most widely used distribution divergence is probably the Kullback-Leibler divergencedue to its connection to maximum likelihood estimation (MLE) The γ-divergence is ageneralization of Kullback-Leibler divergence indexed by a power parameter γ Let M bethe collection of all the positive integrable functions defined on X ⊂ Rp
Definition 1 (Fujisawa and Eguchi 2008; Cichocki and Amari 2010; Eguchi et al 2011)
For f, g ∈ M, define the γ-divergence Dγ(·k·) and γ-cross entropy Cγ(·k·) as follows:1
Dγ(f kg) = Cγ(f kg) − Cγ(f kf ) with Cγ(f kg) = −γ(γ + 1)1
Z
gγ(x)kgkγγ+1
Trang 7Note that in the limiting case, limγ→0Dγ(·k·) = D0(·k·) reduces to the Kullback-Leiblerdivergence The MLE which corresponds to D0(·k·) has been shown to be optimal in pa-rameter estimation in the sense of having minimum asymptotic covariance This optimalitycomes with the cost that MLE relies heavily on the correctness of model specification and,hence, is not robust against model deviation and outliers On the other hand, the minimumγ-divergence estimation is shown to be super robust (Fujisawa and Eguchi, 2008) againstdata contamination.
The Gaussian distribution is a generalization of Gaussian distribution by using the exponential instead of the usual exponential function as defined below Let Sp denote thecollection of all strictly positive definite p × p symmetric matrices
q-Definition 2 (modified from Amari and Ohara 2011; Eguchi et al 2011) For a fixed
q ∈ −∞, 1+2p, define the p-variate q-Gaussian distribution with parameters θ = (µ, Σ) ∈
Rp× Sp to have the probability density function given by
fq(x; θ) = cp,q
(√2π)pp|Σ| expq{u(x; θ)} , x ∈ R
+ , with {x}+= max{x, 0}
Denote the q-Gaussian distribution with parameters (µ, Σ) by Gq(µ, Σ).
Trang 8The constant cp,q is given below (cf Eguchi et al., 2011):
The γ-divergence is a discrepancy measure for two functions in M and its minimum can
be used as a criterion to approximate an underlying probability density function f from acertain model class MΘ parameterized by θ ∈ Θ ⊂ Rm By (1) and (2), the estimate of f
in the population level can be written as:
In this study, we consider Mθ with θ = (µ, Σ) to be a family of q-Gaussian distributions
Gq(µ, Σ) introduced in Definition 2 For any fixed γ, q and the true density function f , the
2 For q < 1 +p+ν2 , it ensures the existence of the ν th moment of X.
Trang 9loss function (2) becomes
R expq(u(v; θ)) γ+1
dv
) γ γ+1dx
Hence, minimizing Lγ,f{fq(x; θ)} over possible values of θ is equivalent to maximizing
we get the stationarity for the maximizer µ∗
Trang 103 γ-SUP
We are now in a position to introduce our clustering method, the γ-SUP, which minimizesthe γ-divergence on the mixture of q-Gaussian distributions
3.1 The case with q < 1 and γ > 0
Suppose that f is a mixture of k components, i.e.,
by assigning proper initial points Turning to the clustering problem, we start from (7)
to develop a clustering algorithm called SUP by employing a self-updating process SUP has the following key ingredients:
Trang 11γ-• It adopts a q-Gaussian mixture model, where the determination of number of nents is data-driven It starts with each individual data point as a singleton cluster.That is, it starts with a mixture of n components of q-Gaussians.
compo-• The q-Gaussian, with q < 1, sets a hard influence range for each component andcompletely rejects data outside this range See the hard influence range reflected inthe weights (8)
• It estimates the model parameters by the minimum divergence The minimum divergence performs a soft rejection by down weighting the influence of data deviantfrom the cluster centers, which further enhances the clustering robustness
γ-• The self-updating process updates model weights, where the update process shrinksthe fitted mixture model toward cluster centers at each iteration Such a shrinkageupdate acts as if the effective temperature is iteratively decreasing, see Figure 2, sothat it improves the efficiency of mixture estimation The effective temperature isdefined as σℓ2/s; where σ2ℓ = 1
n(ℓ)1
Pn(ℓ)1i=1(ˆµ(ℓ)i − ¯ˆµ(ℓ))2, ¯µˆ(ℓ) = 1
n(ℓ)1
Pn(ℓ)1i=1µˆ(ℓ)i Also refer
to Examples 1 and 2 in Section 4 for efficiency comparison
Note that γ-SUP aims to simultaneously extract all relevant clusters without the need ofspecifying the number of components and initials It allows singleton or extremely small-sized clusters to accommodate potential outliers
Suppose we have collected data {xi}n
i=1, and let ˆFn(0) be its empirical cumulative tribution function With the given data set {xi}ni=1, we would like to iteratively groupthem with group representatives {ˆµ(ℓ)1 , , ˆµ(ℓ)k
dis-ℓ}, where the number of clusters kℓ at the
ℓth iteration is data-driven and varies through the self-updating process We start with nclusters (i.e., k0 = n) with initial representatives {ˆµ(0)i = xi}ni=1 Based on the stationary
Trang 12equation (7), for ℓ = 0, 1, 2, , consider the following self-updating process:
Pn j=1w(0)ij → ˆµ(2)i =
Pn j=1w(1)ij µˆ(1)j
Pn j=1wij(1) → · · · → ˆµ(∞)i , (11)where the weights are given by
wij(ℓ)=
expq
− 12σ2 µ(ℓ)i − ˆµ(ℓ)j 2
w(ℓ)ij =
1 − sρ2σ2 µ(ℓ)i − ˆµ(ℓ)j 2
2
1/s +
= exp1−s
− ρ2σ2 µ(ℓ)i − ˆµ(ℓ)j 2
i=1, which have k distinctive components denoted by {ˆµh}k
h=1 By(12), the weight wij(ℓ) is anti-proportion to kˆµ(ℓ)i − ˆµ(ℓ)j k, which assures the convergence ofγ-SUP (Shiu and Chen, 2012) The principle of down weighting is important for robustmodel fitting (Basu et al., 1998; Field and Smith, 1994; Windham, 1995) However, wewant to emphasize here that our down weighting by w(ℓ)ij in (11) is with respect to models
instead of to data.3 That is, ˆFn(ℓ)(x) is also updated during the iteration of (10) for the case
3 For down weighting with respect to data, one has w(ℓ)ij = exp1−s−2σρ2 x i − ˆµ(ℓ)j
2 2
and the update
ˆ
µ(1)i =
P n j=1 wij(0)x j
P n j=1 w(0)ij → ˆµ(2)i =
P n j=1 w(1)ij x j
P n j=1 wij(1) → · · · → ˆ µ(∞)i That is, the weighted model average is replaced by weighted data average.
Trang 13with respect to model, while it is always fixed at ˆFn(0)(x) for the case with respect to data.
Such a weighting scheme with respect to model is more efficient than with respect to data.See Example 2 in Section 4 for simulation studies
The γ-SUP algorithm can be further simplified Define the scaling parameter τ to be
From (16), to implement γ-SUP, we need to determine the values of (s, τ ) It is found
in our numerical studies in Section 4 that γ-SUP is quite insensitive to the choice of s
We thus suggest to choose a small value of s ∈ (0, 0.1) in practical implementation, whichusually gives satisfactory results In summary, γ-SUP starts with n clusters using each(scaled) individual data point ˜µ(0)i = xi/τ as a cluster member, which avoids the problem
of random initials Eventually, γ-SUP converges to certain k clusters, where k depends onthe tuning parameters (s, τ ), but otherwise is completely data-driven At the end of theupdating process, we have the cluster centers {ˆµh = τ · ˜µh}kh=1 and the cluster membershipassignment {ci}n
i=1 for each data point The γ-SUP clustering algorithm is summarized inTable 1
The case, when γ = 0 and q = 1, corresponds to the minimum Kullback-Leibler divergence,
or equivalently maximum likelihood, estimation of a mixture of usual Gaussian components
Trang 14The minimum KL divergence over mixtures of k components of Gaussian distributions with
EM algorithm leads to k-means clustering (Banerjee et al., 2005), where k is predetermined
It is known that k-means has some drawbacks, such as it needs to specify the number ofclasses k, its clustering result depends on random initials, it is not robust to outliers, and
it does not perform well when k is large We end this section with a remark below, whichexplains why the usual Gaussian mixture model (q = 1) together with MLE (γ = 0) arenot robust It also supports the choice of q < 1 and γ > 0 in our γ-SUP method
Remark 1 (Fujisawa and Eguchi 2008; Eguchi et al 2011) If q-Gaussian is used for
modeling, we should adopt a γ value with γ > 1 − q, so that the minimum γ-divergence
estimation can be robust against deviation from model assumption.
We will show in this section by numerical examples that the performance of γ-SUP is quitestable with respect to the selection of tuning parameter s We will also show that γ-SUP ,which adopts a down weighting scheme with respect to model, is more efficient in modelparameter estimation than the usual robust model fitting, which adopts a down weightingscheme with respect to data More detailed explanations for the difference between downweighting respect to data versus with respect to model will be given in Examples 1 and 2
Example 1 (Stability with respect to s-selection and performance comparison) The datawith sample size 100 is generated from a mixture of two normal distributions with densityfunction (for simplicity, assume σ2= 1)
(1 − π)f (x; µ1, 1) + πf (x; µ2, 1) (17)
Trang 15Assume that the true location parameter of interest is µ1 = 0 in the first component of (17),but the observable data is contaminated by another normal distribution with mean µ2 = −7,where π represents the proportion of contamination The aim here is to estimate the meanparameter µ1 of the major component.
The robustifying model fitting (RMF) is a robust estimation method proposed by ham (1995) When f (x; θ) is the density function of Gaussain distribution, parameter es-timation by RMF is accomplished by weighted average, where the weights are updatediteratively Let {f (x; θ) : θ} be the class of model pdfs and ˆθ(ℓ) be the parameter estimate
Wind-of θ at the ℓth iteration RMF first re-weights the data contribution by the model densitythrough w∗
(x; ˆθ(ℓ)) = {f (x; ˆθ(ℓ))}1/s, where s is a positive tuning parameter The updatedestimate ˆθ(ℓ+1)is then obtained from an iteration similar to (10), but with the original dataand with the weights w∗
(x, ˆθ(ℓ)),
ˆ(ℓ+1) =
Pn j=1xjw∗(xj; ˆθ(ℓ))
Pn j=1w∗(xj; ˆθ(ℓ)) .
A main difference between γ-SUP and RMF is that, γ-SUP does weighted model average
in its updates, while RMF does weighted data average in its updates In view of this point,our aim here is to compare the performance of γ-SUP with that of RMF, in estimatingthe location parameter µ1 of the major component Both γ-SUP and RMF are used to fitthe data using mixture of {f (x; µ, ˆσ2) : µ ∈ R}, where ˆσ2 is the sample variance estimateusing entire data, and the center of the largest cluster is used as the estimate of the majorcomponent mean µ1 Simulation results with 100 replicates under π = 0.3 over differentchoices of s values are placed in Figure 1 It can be seen that the performance of γ-SUP israther stable over various values of s, while that of RMF fluctuates more and is sensitive
to the choice of s at the left boundary range s ∈ [0.2, 0.6]
Trang 16Simulation results at the optimal choice of s are further provided in Table 2, whichgives the means and standard errors of the estimates from different methods There aretwo variants of γ-SUP reported γ-SUP-I uses the convergence point of the largest cluster
as the mean parameter estimate γ-SUP-II resorts back to original data and uses thesample average of original data which have been assigned to the largest cluster It can beseen that γ-SUP has a smaller standard errors than RMF, especially in the case of largecontamination π = 0.3 The superior performance of γ-SUP comes from the shrinkagestrategy built in the self-updating process It acts as if the effective temperature parameter
is continuously decreasing (see Figure 2), and as if the weight function is getting uniform,while the updating proceeds In summary, γ-SUP is insensitive to the choice of the tuningparameter s, and is more robust against the influence of outliers than RMF