When the null hypothesis is tested whether the spatial correlation parameter Uis zero in the presence of spatial correlation in the error term the Wald test has avery good power although
Trang 1When the null hypothesis is tested whether the spatial correlation parameter U
is zero in the presence of spatial correlation in the error term the Wald test has avery good power although the power is higher with the sparser weighting matrix.The latter characteristic is a general feature for all conducted tests The power forthe spatial error parameter in the presence of a non-zero spatial lag parameter islower However, the power of the Wald test in this circumstances is (much) greaterthan the power achievable by using Lagrange Multiplier tests In Figure 1c the best
performing LM test is plotted, i.e LM A All LM tests relying on OLS residuals failseriously to detect the true DGP
Comparable to the performance of the Wald test based on MLE estimates is theWald test based on GMM estimates but only in detecting the significant lag parameter
in the presence of a significant spatial error parameter In the reverse case the Waldtest using GMM estimates is much worse
As a further model selection approach the performance of information criteria
is analyzed The performance of the classical Akaike information criterion and thebias corrected AIC are almost identical In Figure 1d the share of cases in whichAIC/AICc identifies the correct DGP is plotted on the y-axis All information cri-teria fail in more than 15% of the cases to identify the correct more parsimoniousmodel, i.e SARAR(1,0) or SARAR(0,1) instead of SARAR(1,1) However, in the
remaining experiments (U = 0.05, ,0.2 or O = 05, ,0.2) AIC/AICcis ble to the performance of the Wald test BIC performs better than AIC/AICcto detectSARAR(1,0) or SARAR(0,1) instead of SARAR(1,1) but much worse in the remain-ing experiments
compara-In order to be able to propose a general procedure for model selection the proach must also be suitable if the true DGP is SARAR(1,0) or SARAR(0,1) In thiscase the Wald test based on the general model has again the appropriate size and avery good power Further the sensitivity on different weighting matrices is less se-
ap-vere However, the power is smallest for the test with the null hypothesis H0: O = 0
and with distance as weighting scheme W2 The Wald test using GMM estimates isagain comparable when testing for the spatial lag parameter but worse when testingfor the spatial error parameter
Not significantly different from the power function of the Wald test based on thegeneral model are both LM statistics based on OLS residuals However, in this case
LM Afails to identify the correct DGP
The Wald test outperforms the information criteria regarding the identification ofSARAR(1,0) or SARAR(0,1) If OLS is the DGP, the correct model is chosen onlyabout two thirds of the time by AIC/AICcbut comparably often to Wald by BIC IfSARAR(1,0) is the data generating process all information criteria perform poorerthan the Wald test independent of the underlying weighting scheme If theSARAR(0,1) is the data generating process, BIC is worse than the Wald test, andAIC/AICchas a slightly higher performance for small values of the spatial parame-ter but is outperformed by the Wald test for higher values of the spatial parameters.For the sake of completeness it is noted that no valid model selection can be con-ducted using likelihood ratio tests
Trang 2Fig 1 a) Power of the Wald test based on the general model and MLE estimates b) Power
of the Wald test based on the general model and GMM estimates c) Power of the Lagrange
Multiplier test using LM A as test statistic d) Correct model choice of the better performinginformation criterion (AIC/AICc)
To conclude, we find that the ’general to specific’ approach is the most suitableprocedure to identify the correct data generating process (DGP) regarding Cliff-Ordtype spatial models Independent whether the true DGP is a SARAR(1,1),SARAR(1,0), SARAR(0,1), or just a regression model without any spatial corre-lation, the general model should be estimated and the Wald tests conducted Thechance to identify the true DGP is than higher compared to the alternative modelchoice criteria based on the LM tests, LR tests or on information criteria like AIC,AICcor BIC
References
ANSELIN, L (1988a): Lagrange Multiplier Test Diagnostics for Spatial Dependence and
Spa-tial Heterogeneity Geographical Analysis, 20, 1–17.
ANSELIN, L (1988b): Spatial Econometrics: Methods and Models Kluwer Academic
Pub-lishers, Boston
ANSELIN, L., BERA, A., FLORAX, R and YOON, M (1996): Simple Diagnostic Tests for
Spatial Dependence, Regional Science and Urban Economics, 26, 77–104.
Trang 3ANSELIN, L., FLORAX, R and REY, S (2004): Advances in Spatial Econometrics Springer,
Berlin
BELITZ, C and LANG, S (2006), Simultaneous selection of variables and smoothing
param-eters in geoadditive regression models In H.-J Lenz, and R Decker (Eds.):Advances in
Data Analysis Springer, Berlin-Heidelberg, forthcoming.
DUBIN, R (2003): Robustness of Spatial Autocorrelation Specifications: Some Monte Carlo
Evidence, Journal of Regional Science, 43, 221–248.
FLORAX, R.J., and DE GRAAFF, T (2004): The Performance of Diagnostic Tests for SpatialDependence in Linear Regression Models: A Meta-Analysis of Simulation Studies In:
L Anselin, R.J Florax, and S.J Rey (Eds.): Advances in Spatial Econometrics -
Method-ology, Tools and Applications Springer, Berlin-Heidelberg, 29-65.
FLORAX, R.J., and REY, S.J (1995): The Impacts of Misspecified Spatial Interaction in
Lin-ear Regression Models In: L Anselin, R.J Florax, and S.J Rey (Eds.): New Directions
in Spatial Econometrics Springer, Berlin-Heidelberg, 111-135.
FLORAX, R.J., FOLMER, H., and REY, S.J (2003): Specification Searches in Spatial
Econo-metrics: The Relevance of HendryŠs Methodology, Regional Science and Urban
Eco-nomics, 33, 557–579.
HENDRY, D (1979): Predictive Failure and Econometric Modelling in Macroeconomics: The
Transactions Demand for Money In: P Ormerod (Ed.): Economic Modelling
Heine-mann, London, 217-242
KELEJIAN, H., and PRUCHA, I (1999): A Generalized Moments Estimator for the
Autore-gressive Parameter in a Spatial Model, International Economic Review, 40, 509–533.
KELEJIAN, H., and PRUCHA, I (2006): Specification and Estimation of Spatial gressive Models with Autoregressive and Heteroskedastic Disturbances, unpublishedmanuscript, University of Maryland
Autore-LEE, L (2003): Best Spatial Two-Stage Least Squares Estimators for a Spatial Autoregressive
Model with Autoregressive Disturbances, Econometric Reviews, 22, 307–335.
MCMILLEN, D (2003): Spatial Autocorrelation or Model Misspecification?, International
Regional Science Review, 26, 208–217.
Trang 4of Hyper-Spectral Skin Data
Hannes Kazianka1, Raimund Leitner2and Jürgen Pilz1
1 University of Klagenfurt, Institute of Statistics, Alpen-Adria-Universität Klagenfurt,Universitätsstraße 65-67, 9020 Klagenfurt, Austria
{hannes.kazianka, juergen.pilz} @uni-klu.ac.at
2 CTR Carinthian Tech Research AG, Europastraıe 4/1, 9524 Villach, Austria
Raimund.Leitner@ctr.at
Abstract Supervised classification methods require reliable and consistent training sets In
image analysis, where class labels are often assigned to the entire image, the manual tion of pixel-accurate class labels is tedious and time consuming We present an independentcomponent analysis (ICA)-based method to generate these pixel-accurate class labels withminimal user interaction The algorithm is applied to the detection of skin cancer in hyper-spectral images Using this approach it is possible to remove artifacts caused by sub-optimalimage acquisition We report on the classification results obtained for the hyper-spectral skincancer data set with 300 images using support vector machines (SVM) and model-based dis-criminant analysis (MclustDA, MDA)
genera-1 Introduction
Hyper-spectral images consist of several, up to hundred, images acquired at different
- mostly narrow band and contiguous - wavelengths Thus, a hyper-spectral imagecontains pixels represented as multidimensional vectors with elements indicating thereflectivity at a specific wavelength For a contiguous set of narrow band wavelengthsthese vectors correspond to spectra in the physical meaning and are equal to spectrameasured with e.g spectrometers
Supervised classification of hyper-spectral images requires a reliable and consistenttraining set In many applications labels are assigned to the full image instead of toeach individual pixel even if instances of all the classes occur in the image To obtain
a reliable training set it may be necessary to label the images on a pixel by pixel basis.Manually generating pixel-accurate class labels requires a lot of effort; cluster-basedautomatic segmentation is often sensitive to measurement errors and illuminationproblems In the following we present a labelling strategy for hyper-spectral skincancer data that uses PCA, ICA and K-Means clustering For the classification ofunknown images, we compare support vector machines and model-based discrimi-nant analysis
Trang 5Section 2 describes the methods that are used for the labelling approach The fication algorithms are discussed in Section 3 In Section 4 we present the segmen-tation and classification results obtained for the skin cancer data set and Section 5 isdevoted to discussions and conclusions.
classi-2 Labelling
Hyper-spectral data are highly correlated and contain noise which adversely affectsclassification and clustering algorithms As the dimensionality of the data equals thenumber of spectral bands, using the full spectral information leads to computationalcomplexity To overcome the curse of dimensionality we use PCA to reduce the di-mensions of the data, and inherently also unwanted noise Since different features ofthe image may have equal score values for the same principal component, an addi-tional feature extraction step is proposed ICA makes it possible to detect acquisitionartifacts like saturated pixels and inhomogeneous illumination Those effects can besignificantly reduced in the spectral information giving rise to an improved segmen-tation
2.1 Principal Component Analysis (PCA)
PCA is a standard method for dimension reduction and can be performed by gular value decomposition The algorithm gives uncorrelated principal components
sin-We assume that those principal components that correspond to very low eigenvaluescontribute only to noise As a rule of thumb, we chose to retain at least 95% of thevariability which led to selecting 6-12 components
2.2 Independent Component Analysis (ICA)
ICA is a powerful statistical tool to determine hidden factors of multivariate data The
ICA model assumes that the observed data, x, can be expressed as a linear mixture
of statistically independent components, s The model can be written as
x = As where the unknown matrix A is called the mixing matrix Defining W as the unmixing matrix we can calculate s as
s = W x.
As we have already done a dimension reduction, we can assume that noise is
neg-ligible and A is square which implies W = A −1 This significantly simplifies the
estimation of A and s Providing that no more than one independent component has
Gaussian distribution, the model can be uniquely estimated up to scalar multipliers.There exists a variety of different algorithms for fitting the ICA model In our work
we focused on the two most popular implementations which are based on tion of non-Gaussianity and minimisation of mutual information respectively: Fas-tICA and FlexICA
Trang 6The FastICA algorithm developed by Hyvärinen et al (2002) uses negentropy, J (y),
as a measure of Gaussianity Since negentropy is zero for Gaussian variables andalways nonnegative one has to maximise negentropy in order to maximise non-Gaussianity To avoid computation problems the algorithm uses an approximation
of negentropy: If G denotes a nonquadratic function and we want to estimate one independent component s we can approximate
J (y) ≈ [E {G(y)} − E {G(Q)}]2
, where Q is a standardised Gaussian variable and y is an estimate of s We adopt to use
G (y) = logcoshy since this has been shown to be a good choice Maximisation rectly leads to a fixed-point iteration algorithm that is 20 −50 times faster than other
di-ICA implementations To estimate several independent components a deflationaryorthogonalisation method is used
FlexICA
Mutual information is a natural measure of information that members of a set ofrandom variables have on the others Choi et al (2000) proposed an ICA algorithmthat attempts to minimise this quantity All independent components are estimatedsimultaneously using a natural gradient learning rule with the assumption that thesource signals have the generalized Gaussian distribution with density
q i (y i) = r i
2Vi* (1/ri)exp
− r1i
Here r idenotes the Gaussian exponent which is chosen in a flexible way depending
on the kurtosis of the y i
2.3 Two-Stage K-Means clustering
From a statistical point of view it may be inappropriate to use K-means clusteringsince K-means cannot use all the higher order information that ICA provides Thereare several approaches that avoid using K-means, for example Shah et al (2005) pro-posed the ICA mixture model (ICAMM) However, for large images this algorithmfails to converge We developed a 2-stage K-means clustering strategy that worksparticularly well with skin data The choice of 5 resp 3 clusters for the K-meansalgorithm has been determined empirically for the skin cancer data set
1 Drop ICs that contain a high amount of noise or correspond to artifacts
2 Perform K-means clustering with 5 clusters
3 Those clusters that correspond to healthy skin are taken together into one cluster
This cluster is labelled as skin.
4 Perform a second run of K-means clustering on the remaining clusters (inflamedskin, lesion, etc.) This time use 3 clusters Label the clusters that correspond to
the mole and melanoma centre as mole and melanoma The remaining clusters
are considered to be ‘regions of uncertainty’
Trang 73 Classification
This section describes the classification methods that have been investigated Thepreprocessing steps for the training data are the same as in the segmentation task:Dimension reduction using PCA and feature extraction performed by ICA Usingthe Bayesian Information Criterion (BIC), the data were reduced to 6 dimensions
3.1 Mixture Discriminant Analysis (MDA)
MDA assumes that each class j can be modelled as a mixture of R j subclasses
The subclasses have a multivariate Gaussian distribution with mean vector z jr , r=
1, ,Rj, and covariance matrix 6, which is the same for all classes Hence, the mixture model for class j has the density
al (2001) suggest, using optimal scoring It is also possible to use flexible inant analysis (FDA) or penalized discriminant analysis (PDA) in combination withMDA The major drawback of this classification approach is that, similar to LDAwhich is also described in Hastie et al (2001), the covariance matrix is fixed for allclasses and the number of subclasses for each class has to be set in advance
discrim-3.2 Model-based Discriminant Analysis (MclustDA)
MclustDA, proposed by Fraley et al (2002), extends MDA in a way that the ance in each class is parameterized using the eigenvalue decomposition
covari-6r= OrD r A r D T r , r = 1, ,R j
The volume of the component is controlled by Or, A r defines the shape and D r isresponsible for the orientation The model selection is done using the BIC and themaximum likelihood estimation is performed by an EM-algorithm
3.3 Support Vector Machines (SVM)
The aim of support vector machines is to find a hyperplane that optimally separates
two classes in a high-dimensional feature space induced by a Mercer kernel K (x,z).
In the L2-norm case the Lagrangian dual problem is to find O∗that solves the ing convex optimization problem:
m
i=1
Oiy i = 0, Oi ≥ 0,
Trang 8where xi are training points belonging to classes y i The cost parameter C and the
kernel function have to be chosen to suit to the problem It is also possible to usedifferent cost parameters for unbalanced data as was suggested by Veropoulos et al.(1999)
Although SVMs were originally designed as binary classifiers, there exists a
vari-ety of methods to extend them to k > 2 classes In our work we focused on against-all and one-against-one SVMs The one-against-all formulation trains each class against all remaining classes resulting in k binary SVMs The one-against-one
one-formulation usesk(k−1)2 SVMs, each separating one class from one another
4 Results
A set of 310 hyper-spectral images (512 × 512 pixels and 300 spectral bands) of
malign and benign lesions were taken in clinical studies at the Medical University
Graz, Austria They are classified as melanoma or mole by human experts on the
basis of a histological examination However, in our survey we distinguish between
three classes, melanoma, mole and skin, since all these classes typically occur in the
images The segmentation task is especially difficult in this application: We have
to take into account that melanoma typically occurs in combination with mole To
reduce the number of outliers in the training set we define a ‘region of uncertainty’
as a transition region between the kernels of mole and melanoma and between the
lesion and the skin
4.1 Training
Figures 1(b) and 1(c) display the first step of the K-Means strategy described in tion 2.3 The original image displayed in Figure 1(a) shows a mole that is located
Sec-in the middle of a hand For PCA-transformed data, as Sec-in Figure 1(b), the algorithm
performs poorly and the classes do not correspond to lesion, mole and skin regions
(left and bottom) Even the lesion is in the same class together with an illuminationproblem If the data is also transformed using ICA, as in Figure 1(c), the lesion isalready identified and there exists a second class in the form of a ring around thelesion which is the desired ‘region of uncertainty’ The other classes correspond towrinkles on the hand
Figure 1(d) shows the second K-Means step for the PCA transformed data Althoughthe second K-Means step makes it possible to separate the lesion from the illumina-tion problem it can be seen that the class that should correspond to the kernel of themole is too large Instances from other classes are present in the kernel The secondK-Means step with the ICA preprocessed data is shown in Figure 1(e) Not only thekernel is reliably detected but there also exists a transition region consisting of twoclasses One class contains the border of the lesion The second class separates thekernel from the remaining part of the mole
We believe that the FastICA algorithm is the most appropriate ICA implementation
Trang 9(a) (b) (c)
Fig 1 The two iteration steps of the K-Means approach for both PCA ((b) and (d)) and
ICA ((c) and (e)) are displayed together with the original image (a) The different gray levelsindicate the cluster the pixel has been assigned to
for this segmentation task The segmentation quality for both methods is very lar, however the FastICA algorithm is faster and more stable
simi-To generate a training set of 12.000 pixel spectra per class we labelled 60 mole ages and 17 melanoma images using our labelling approach The pixels in the train-
im-ing set are chosen randomly from the segmented images
4.2 Classification
In Table 1 we present the classification results obtained for the different classifiers
described in Section 3 As a test set we use 57 melanoma and 253 mole images We
use the output of the LDA classifier as a benchmark
LDA turns out to be the worst classifier for the recognition of moles Nearly one half
of the mole images are misclassified as melanoma On the other hand LDA yields excellent results for the classification of melanoma, giving rise to the presumption that there is a large bias towards the melanoma class With MDA we use three sub-
classes in each class Although both MDA and LDA keep the covariance fixed, MDAmodels the data as mixture of Gaussians leading to a significantly higher recognitionrate compared to LDA Using FDA or PDA in combination with MDA does not im-prove the results MclustDA performs best among these classifiers Notice however,that BIC overestimates the number of subclasses in each class which is between 14and 21 For all classes the model with varying shape, varying volume and varying
Trang 10Table 1 Recognition rates obtained for the different classifiers
FlexICA MelanomaMole 84.5%89.4% 86.5%89.4% 56.1%98.2%FastICA MelanomaMole 84.5%89.4% 87.7%89.4% 56.1%98.2%
FlexICA MelanomaMole 72.7%92.9% 69.9%94.7% 87.7%89.4%
orientation of the mixture components is chosen This extra flexibility makes it sible to outperform MDA even though only half of the training points could be useddue to memory limitations Another significant advantage of MclustDA is its speed,taking around 20 seconds for a full image
pos-Since misclassification of melanoma into the mole class is less favourable than classification of mole into the melanoma class, we clearly have unbalanced data
mis-in the skmis-in cancer problem Accordmis-ing to Veropoulos et al (1999) we can choose
C melanoma > C mole = C skin We obtain the best results using the polynomial kernel of
degree three with C melanoma = 0.5 and Cmole = C skin = 0.1 This method is clearly
superior when compared with the other SVM approaches For the one-against-all(OAA-SVM) and the one-against-one (OAO-SVM) formulation we use Gaussian
kernels with C= 2 and V = 20 A drawback of all the SVM classifiers, however, isthat training takes 20 hours (Centrino Duo 2.17GHz, 2GB RAM) and classification
of a full image takes more than 2 minutes
We discovered that different ICA implementations have no significant impact on thequality of the classification output FlexICA performs slightly better for the unbal-anced SVM and one-against-all-SVM FastICA gives better results for MclustDA.For all other classifiers the performances are equal
5 Conclusion
The combination of PCA and ICA makes it possible to detect both artifacts and thelesion in hyper-spectral skin cancer data The algorithm projects the correspond-ing features on different independent components; dropping the independent com-ponents that correspond to the artifacts and applying a 2-stage K-Means clusteringleads to a reliable segmentation of the images It is interesting to note that for the
mole images in our study there is always one single independent component that
carries the information about the whole lesion This suggests very simple tation in the case where the skin is healthy: keep the single independent component
segmen-that contains the desired information and perform the K-Means steps For melanoma
Trang 11images the spectral information about the lesion is contained in at least two pendent components, leading to reliable separation of the melanoma kernel from themole kernel.
inde-Unbalanced SVM and MclustDA yield equally good classification results, however,because of its computational performance MclustDA is the best classifier for the skincancer data in terms of overall accuracy
The presented segmentation and classification approach does not use any spatial formation In future research Markov random fields and contextual classifiers could
in-be used to take into account the spatial context
In a possible application, where the physician is assisted by system which pre-screenspatients, we have to take care of high sensitivity which is typically accompanied with
a loss in specificity Preliminary experiments showed that a sensitivity of 95% is sible at the cost of 20% false-positives
pos-References
ABE, S (2005): Support Vector Machines for Pattern Classification Springer, London.
CHOI, S., CICHOCKI, A and AMARI, S (2000): Flexible Independent Component Analysis
Journal of VLSI Signal Processing, 26(1/2), 25-38.
FRALEY, C and RAFTERY, A (2002): Model-Based Clustering, Discriminant Analysis, and
Density Estimation Journal of the American Statistical Association, 97, 611–631 HASTIE, T., TIBSHIRANI, R and FRIEDMAN, J (2001): The Elements of Statistical Learn-
ing Springer, New York.
HYVÄRINEN, A., KARHUNEN, J and OJA, E (2001): Independent Component Analysis.
Wiley, New York
SHAH, C., ARORA, M and VARSHNEY, P (2004): Unsupervised classification of
hyper-spectral data: an ICA mixture model based approach International Journal of Remote
Sensing, 25, 481–487.
VEROPOULOS, K., CAMPBELL, C and CRISTIANI, N (1999): Controlling the Sensitivity
of Support Vector Machines Proceedings of the Sixteenth International Joint Conference
on Artificatial Intelligence, Workshop ML3, 55–60.
Trang 12Michaela DenkEC3 – E-Commerce Competence Center,
Donau-City-Str 1, 1220 Vienna, Austria
michaela.denk@ec3.at
Abstract Entity identification deals with matching records from different datasets or within
one dataset that represent the same real-world entity when unique identifiers are not available.Enabling data integration at record level as well as the detection of duplicates, entity identifi-cation plays a major role in data preprocessing, especially concerning data quality This paperpresents a framework for statistical entity identification in particular focusing on probabilisticrecord linkage and string matching and its implementation inR According to the stages ofthe entity identification process, the framework is structured into seven core components: datapreparation, candidate selection, comparison, scoring, classification, decision, and evaluation.Samples of real-world CRM datasets serve as illustrative examples
1 Introduction
Ensuring data quality is a crucial challenge in statistical data management aiming atimproved usability and reliability of the data Entity identification deals with match-ing records from different datasets or within a single dataset that represent the samereal-world entity and, thus, enables data integration at record level as well as thedetection of duplicates Both can be regarded as a means of improving data qual-ity, the former by completing datasets through adding supplementary variables, re-placing missing or invalid values, and appending records for additional real-worldentities, the latter by resolving data inconsistencies Unless sophisticated methodsare applied, data integration is also a potential source of ‘dirty’ data: duplicate orincomplete records might be introduced Besides its contribution to data quality, en-tity identification is regarded as a means of increasing the efficiency of the usage
of available data as well This is of particular interest in official statistics, wherethe reduction of the responder burden is a prevailing issue In general, applicationsnecessitating statistical entity identification (SEI) are found in diverse fields such asdata mining, customer relationship management (CRM), bioinformatics, criminal in-vestigations, and official statistics Various frameworks for entity identification havebeen proposed (see for example Denk (2006) or Neiling (2004) for an overview),most of them concentrating on particular stages of the process, such the author’s