Data Analysis Machine Learning and Applications Episode 2 Part 3 pps

When the null hypothesis is tested whether the spatial correlation parameter Uis zero in the presence of spatial correlation in the error term the Wald test has avery good power although

Trang 1

When the null hypothesis is tested whether the spatial correlation parameter U

is zero in the presence of spatial correlation in the error term the Wald test has avery good power although the power is higher with the sparser weighting matrix.The latter characteristic is a general feature for all conducted tests The power forthe spatial error parameter in the presence of a non-zero spatial lag parameter islower However, the power of the Wald test in this circumstances is (much) greaterthan the power achievable by using Lagrange Multiplier tests In Figure 1c the best

performing LM test is plotted, i.e LM A All LM tests relying on OLS residuals failseriously to detect the true DGP

Comparable to the performance of the Wald test based on MLE estimates is theWald test based on GMM estimates but only in detecting the significant lag parameter

in the presence of a significant spatial error parameter In the reverse case the Waldtest using GMM estimates is much worse

As a further model selection approach the performance of information criteria

is analyzed The performance of the classical Akaike information criterion and thebias corrected AIC are almost identical In Figure 1d the share of cases in whichAIC/AICc identifies the correct DGP is plotted on the y-axis All information cri-teria fail in more than 15% of the cases to identify the correct more parsimoniousmodel, i.e SARAR(1,0) or SARAR(0,1) instead of SARAR(1,1) However, in the

remaining experiments (U = 0.05, ,0.2 or O = 05, ,0.2) AIC/AICcis ble to the performance of the Wald test BIC performs better than AIC/AICcto detectSARAR(1,0) or SARAR(0,1) instead of SARAR(1,1) but much worse in the remain-ing experiments

compara-In order to be able to propose a general procedure for model selection the proach must also be suitable if the true DGP is SARAR(1,0) or SARAR(0,1) In thiscase the Wald test based on the general model has again the appropriate size and avery good power Further the sensitivity on different weighting matrices is less se-

ap-vere However, the power is smallest for the test with the null hypothesis H0: O = 0

and with distance as weighting scheme W2 The Wald test using GMM estimates isagain comparable when testing for the spatial lag parameter but worse when testingfor the spatial error parameter

Not significantly different from the power function of the Wald test based on thegeneral model are both LM statistics based on OLS residuals However, in this case

LM Afails to identify the correct DGP

The Wald test outperforms the information criteria regarding the identification ofSARAR(1,0) or SARAR(0,1) If OLS is the DGP, the correct model is chosen onlyabout two thirds of the time by AIC/AICcbut comparably often to Wald by BIC IfSARAR(1,0) is the data generating process all information criteria perform poorerthan the Wald test independent of the underlying weighting scheme If theSARAR(0,1) is the data generating process, BIC is worse than the Wald test, andAIC/AICchas a slightly higher performance for small values of the spatial parame-ter but is outperformed by the Wald test for higher values of the spatial parameters.For the sake of completeness it is noted that no valid model selection can be con-ducted using likelihood ratio tests

Trang 2

Fig 1 a) Power of the Wald test based on the general model and MLE estimates b) Power

of the Wald test based on the general model and GMM estimates c) Power of the Lagrange

Multiplier test using LM A as test statistic d) Correct model choice of the better performinginformation criterion (AIC/AICc)

To conclude, we find that the ’general to specific’ approach is the most suitableprocedure to identify the correct data generating process (DGP) regarding Cliff-Ordtype spatial models Independent whether the true DGP is a SARAR(1,1),SARAR(1,0), SARAR(0,1), or just a regression model without any spatial corre-lation, the general model should be estimated and the Wald tests conducted Thechance to identify the true DGP is than higher compared to the alternative modelchoice criteria based on the LM tests, LR tests or on information criteria like AIC,AICcor BIC

References

ANSELIN, L (1988a): Lagrange Multiplier Test Diagnostics for Spatial Dependence and

Spa-tial Heterogeneity Geographical Analysis, 20, 1–17.

ANSELIN, L (1988b): Spatial Econometrics: Methods and Models Kluwer Academic

Pub-lishers, Boston

ANSELIN, L., BERA, A., FLORAX, R and YOON, M (1996): Simple Diagnostic Tests for

Spatial Dependence, Regional Science and Urban Economics, 26, 77–104.

Trang 3

ANSELIN, L., FLORAX, R and REY, S (2004): Advances in Spatial Econometrics Springer,

Berlin

BELITZ, C and LANG, S (2006), Simultaneous selection of variables and smoothing

param-eters in geoadditive regression models In H.-J Lenz, and R Decker (Eds.):Advances in

Data Analysis Springer, Berlin-Heidelberg, forthcoming.

DUBIN, R (2003): Robustness of Spatial Autocorrelation Specifications: Some Monte Carlo

Evidence, Journal of Regional Science, 43, 221–248.

FLORAX, R.J., and DE GRAAFF, T (2004): The Performance of Diagnostic Tests for SpatialDependence in Linear Regression Models: A Meta-Analysis of Simulation Studies In:

L Anselin, R.J Florax, and S.J Rey (Eds.): Advances in Spatial Econometrics -

Method-ology, Tools and Applications Springer, Berlin-Heidelberg, 29-65.

FLORAX, R.J., and REY, S.J (1995): The Impacts of Misspecified Spatial Interaction in

Lin-ear Regression Models In: L Anselin, R.J Florax, and S.J Rey (Eds.): New Directions

in Spatial Econometrics Springer, Berlin-Heidelberg, 111-135.

FLORAX, R.J., FOLMER, H., and REY, S.J (2003): Specification Searches in Spatial

Econo-metrics: The Relevance of HendryŠs Methodology, Regional Science and Urban

Eco-nomics, 33, 557–579.

HENDRY, D (1979): Predictive Failure and Econometric Modelling in Macroeconomics: The

Transactions Demand for Money In: P Ormerod (Ed.): Economic Modelling

Heine-mann, London, 217-242

KELEJIAN, H., and PRUCHA, I (1999): A Generalized Moments Estimator for the

Autore-gressive Parameter in a Spatial Model, International Economic Review, 40, 509–533.

KELEJIAN, H., and PRUCHA, I (2006): Specification and Estimation of Spatial gressive Models with Autoregressive and Heteroskedastic Disturbances, unpublishedmanuscript, University of Maryland

Autore-LEE, L (2003): Best Spatial Two-Stage Least Squares Estimators for a Spatial Autoregressive

Model with Autoregressive Disturbances, Econometric Reviews, 22, 307–335.

MCMILLEN, D (2003): Spatial Autocorrelation or Model Misspecification?, International

Regional Science Review, 26, 208–217.

Trang 4

of Hyper-Spectral Skin Data

Hannes Kazianka1, Raimund Leitner2and Jürgen Pilz1

1 University of Klagenfurt, Institute of Statistics, Alpen-Adria-Universität Klagenfurt,Universitätsstraße 65-67, 9020 Klagenfurt, Austria

{hannes.kazianka, juergen.pilz} @uni-klu.ac.at

2 CTR Carinthian Tech Research AG, Europastraıe 4/1, 9524 Villach, Austria

Raimund.Leitner@ctr.at

Abstract Supervised classification methods require reliable and consistent training sets In

image analysis, where class labels are often assigned to the entire image, the manual tion of pixel-accurate class labels is tedious and time consuming We present an independentcomponent analysis (ICA)-based method to generate these pixel-accurate class labels withminimal user interaction The algorithm is applied to the detection of skin cancer in hyper-spectral images Using this approach it is possible to remove artifacts caused by sub-optimalimage acquisition We report on the classification results obtained for the hyper-spectral skincancer data set with 300 images using support vector machines (SVM) and model-based dis-criminant analysis (MclustDA, MDA)

genera-1 Introduction

Hyper-spectral images consist of several, up to hundred, images acquired at different

- mostly narrow band and contiguous - wavelengths Thus, a hyper-spectral imagecontains pixels represented as multidimensional vectors with elements indicating thereflectivity at a specific wavelength For a contiguous set of narrow band wavelengthsthese vectors correspond to spectra in the physical meaning and are equal to spectrameasured with e.g spectrometers

Supervised classification of hyper-spectral images requires a reliable and consistenttraining set In many applications labels are assigned to the full image instead of toeach individual pixel even if instances of all the classes occur in the image To obtain

a reliable training set it may be necessary to label the images on a pixel by pixel basis.Manually generating pixel-accurate class labels requires a lot of effort; cluster-basedautomatic segmentation is often sensitive to measurement errors and illuminationproblems In the following we present a labelling strategy for hyper-spectral skincancer data that uses PCA, ICA and K-Means clustering For the classification ofunknown images, we compare support vector machines and model-based discrimi-nant analysis

Trang 5

Section 2 describes the methods that are used for the labelling approach The fication algorithms are discussed in Section 3 In Section 4 we present the segmen-tation and classification results obtained for the skin cancer data set and Section 5 isdevoted to discussions and conclusions.

classi-2 Labelling

Hyper-spectral data are highly correlated and contain noise which adversely affectsclassification and clustering algorithms As the dimensionality of the data equals thenumber of spectral bands, using the full spectral information leads to computationalcomplexity To overcome the curse of dimensionality we use PCA to reduce the di-mensions of the data, and inherently also unwanted noise Since different features ofthe image may have equal score values for the same principal component, an addi-tional feature extraction step is proposed ICA makes it possible to detect acquisitionartifacts like saturated pixels and inhomogeneous illumination Those effects can besignificantly reduced in the spectral information giving rise to an improved segmen-tation

2.1 Principal Component Analysis (PCA)

PCA is a standard method for dimension reduction and can be performed by gular value decomposition The algorithm gives uncorrelated principal components

sin-We assume that those principal components that correspond to very low eigenvaluescontribute only to noise As a rule of thumb, we chose to retain at least 95% of thevariability which led to selecting 6-12 components

2.2 Independent Component Analysis (ICA)

ICA is a powerful statistical tool to determine hidden factors of multivariate data The

ICA model assumes that the observed data, x, can be expressed as a linear mixture

of statistically independent components, s The model can be written as

x = As where the unknown matrix A is called the mixing matrix Defining W as the unmixing matrix we can calculate s as

s = W x.

As we have already done a dimension reduction, we can assume that noise is

neg-ligible and A is square which implies W = A −1 This significantly simplifies the

estimation of A and s Providing that no more than one independent component has

Gaussian distribution, the model can be uniquely estimated up to scalar multipliers.There exists a variety of different algorithms for fitting the ICA model In our work

we focused on the two most popular implementations which are based on tion of non-Gaussianity and minimisation of mutual information respectively: Fas-tICA and FlexICA

Trang 6

The FastICA algorithm developed by Hyvärinen et al (2002) uses negentropy, J (y),

as a measure of Gaussianity Since negentropy is zero for Gaussian variables andalways nonnegative one has to maximise negentropy in order to maximise non-Gaussianity To avoid computation problems the algorithm uses an approximation

of negentropy: If G denotes a nonquadratic function and we want to estimate one independent component s we can approximate

J (y) ≈ [E {G(y)} − E {G(Q)}]2

, where Q is a standardised Gaussian variable and y is an estimate of s We adopt to use

G (y) = logcoshy since this has been shown to be a good choice Maximisation rectly leads to a fixed-point iteration algorithm that is 20 −50 times faster than other

di-ICA implementations To estimate several independent components a deflationaryorthogonalisation method is used

FlexICA

Mutual information is a natural measure of information that members of a set ofrandom variables have on the others Choi et al (2000) proposed an ICA algorithmthat attempts to minimise this quantity All independent components are estimatedsimultaneously using a natural gradient learning rule with the assumption that thesource signals have the generalized Gaussian distribution with density

q i (y i) = r i

2Vi* (1/ri)exp

− r1i

Here r idenotes the Gaussian exponent which is chosen in a flexible way depending

on the kurtosis of the y i

2.3 Two-Stage K-Means clustering

From a statistical point of view it may be inappropriate to use K-means clusteringsince K-means cannot use all the higher order information that ICA provides Thereare several approaches that avoid using K-means, for example Shah et al (2005) pro-posed the ICA mixture model (ICAMM) However, for large images this algorithmfails to converge We developed a 2-stage K-means clustering strategy that worksparticularly well with skin data The choice of 5 resp 3 clusters for the K-meansalgorithm has been determined empirically for the skin cancer data set

1 Drop ICs that contain a high amount of noise or correspond to artifacts

2 Perform K-means clustering with 5 clusters

3 Those clusters that correspond to healthy skin are taken together into one cluster

This cluster is labelled as skin.

4 Perform a second run of K-means clustering on the remaining clusters (inflamedskin, lesion, etc.) This time use 3 clusters Label the clusters that correspond to

the mole and melanoma centre as mole and melanoma The remaining clusters

are considered to be ‘regions of uncertainty’

Trang 7

3 Classification

This section describes the classification methods that have been investigated Thepreprocessing steps for the training data are the same as in the segmentation task:Dimension reduction using PCA and feature extraction performed by ICA Usingthe Bayesian Information Criterion (BIC), the data were reduced to 6 dimensions

3.1 Mixture Discriminant Analysis (MDA)

MDA assumes that each class j can be modelled as a mixture of R j subclasses

The subclasses have a multivariate Gaussian distribution with mean vector z jr , r=

1, ,Rj, and covariance matrix 6, which is the same for all classes Hence, the mixture model for class j has the density

al (2001) suggest, using optimal scoring It is also possible to use flexible inant analysis (FDA) or penalized discriminant analysis (PDA) in combination withMDA The major drawback of this classification approach is that, similar to LDAwhich is also described in Hastie et al (2001), the covariance matrix is fixed for allclasses and the number of subclasses for each class has to be set in advance

discrim-3.2 Model-based Discriminant Analysis (MclustDA)

MclustDA, proposed by Fraley et al (2002), extends MDA in a way that the ance in each class is parameterized using the eigenvalue decomposition

covari-6r= OrD r A r D T r , r = 1, ,R j

The volume of the component is controlled by Or, A r defines the shape and D r isresponsible for the orientation The model selection is done using the BIC and themaximum likelihood estimation is performed by an EM-algorithm

3.3 Support Vector Machines (SVM)

The aim of support vector machines is to find a hyperplane that optimally separates

two classes in a high-dimensional feature space induced by a Mercer kernel K (x,z).

In the L2-norm case the Lagrangian dual problem is to find O∗that solves the ing convex optimization problem:

m

i=1

Oiy i = 0, Oi ≥ 0,

Trang 8

where xi are training points belonging to classes y i The cost parameter C and the

kernel function have to be chosen to suit to the problem It is also possible to usedifferent cost parameters for unbalanced data as was suggested by Veropoulos et al.(1999)

Although SVMs were originally designed as binary classifiers, there exists a

vari-ety of methods to extend them to k > 2 classes In our work we focused on against-all and one-against-one SVMs The one-against-all formulation trains each class against all remaining classes resulting in k binary SVMs The one-against-one

one-formulation usesk(k−1)2 SVMs, each separating one class from one another

4 Results

A set of 310 hyper-spectral images (512 × 512 pixels and 300 spectral bands) of

malign and benign lesions were taken in clinical studies at the Medical University

Graz, Austria They are classified as melanoma or mole by human experts on the

basis of a histological examination However, in our survey we distinguish between

three classes, melanoma, mole and skin, since all these classes typically occur in the

images The segmentation task is especially difficult in this application: We have

to take into account that melanoma typically occurs in combination with mole To

reduce the number of outliers in the training set we define a ‘region of uncertainty’

as a transition region between the kernels of mole and melanoma and between the

lesion and the skin

4.1 Training

Figures 1(b) and 1(c) display the first step of the K-Means strategy described in tion 2.3 The original image displayed in Figure 1(a) shows a mole that is located

Sec-in the middle of a hand For PCA-transformed data, as Sec-in Figure 1(b), the algorithm

performs poorly and the classes do not correspond to lesion, mole and skin regions

(left and bottom) Even the lesion is in the same class together with an illuminationproblem If the data is also transformed using ICA, as in Figure 1(c), the lesion isalready identified and there exists a second class in the form of a ring around thelesion which is the desired ‘region of uncertainty’ The other classes correspond towrinkles on the hand

Figure 1(d) shows the second K-Means step for the PCA transformed data Althoughthe second K-Means step makes it possible to separate the lesion from the illumina-tion problem it can be seen that the class that should correspond to the kernel of themole is too large Instances from other classes are present in the kernel The secondK-Means step with the ICA preprocessed data is shown in Figure 1(e) Not only thekernel is reliably detected but there also exists a transition region consisting of twoclasses One class contains the border of the lesion The second class separates thekernel from the remaining part of the mole

We believe that the FastICA algorithm is the most appropriate ICA implementation

Trang 9

(a) (b) (c)

Fig 1 The two iteration steps of the K-Means approach for both PCA ((b) and (d)) and

ICA ((c) and (e)) are displayed together with the original image (a) The different gray levelsindicate the cluster the pixel has been assigned to

for this segmentation task The segmentation quality for both methods is very lar, however the FastICA algorithm is faster and more stable

simi-To generate a training set of 12.000 pixel spectra per class we labelled 60 mole ages and 17 melanoma images using our labelling approach The pixels in the train-

im-ing set are chosen randomly from the segmented images

4.2 Classification

In Table 1 we present the classification results obtained for the different classifiers

described in Section 3 As a test set we use 57 melanoma and 253 mole images We

use the output of the LDA classifier as a benchmark

LDA turns out to be the worst classifier for the recognition of moles Nearly one half

of the mole images are misclassified as melanoma On the other hand LDA yields excellent results for the classification of melanoma, giving rise to the presumption that there is a large bias towards the melanoma class With MDA we use three sub-

classes in each class Although both MDA and LDA keep the covariance fixed, MDAmodels the data as mixture of Gaussians leading to a significantly higher recognitionrate compared to LDA Using FDA or PDA in combination with MDA does not im-prove the results MclustDA performs best among these classifiers Notice however,that BIC overestimates the number of subclasses in each class which is between 14and 21 For all classes the model with varying shape, varying volume and varying

Trang 10

Table 1 Recognition rates obtained for the different classifiers

FlexICA MelanomaMole 84.5%89.4% 86.5%89.4% 56.1%98.2%FastICA MelanomaMole 84.5%89.4% 87.7%89.4% 56.1%98.2%

FlexICA MelanomaMole 72.7%92.9% 69.9%94.7% 87.7%89.4%

orientation of the mixture components is chosen This extra flexibility makes it sible to outperform MDA even though only half of the training points could be useddue to memory limitations Another significant advantage of MclustDA is its speed,taking around 20 seconds for a full image

pos-Since misclassification of melanoma into the mole class is less favourable than classification of mole into the melanoma class, we clearly have unbalanced data

mis-in the skmis-in cancer problem Accordmis-ing to Veropoulos et al (1999) we can choose

C melanoma > C mole = C skin We obtain the best results using the polynomial kernel of

degree three with C melanoma = 0.5 and Cmole = C skin = 0.1 This method is clearly

superior when compared with the other SVM approaches For the one-against-all(OAA-SVM) and the one-against-one (OAO-SVM) formulation we use Gaussian

kernels with C= 2 and V = 20 A drawback of all the SVM classifiers, however, isthat training takes 20 hours (Centrino Duo 2.17GHz, 2GB RAM) and classification

of a full image takes more than 2 minutes

We discovered that different ICA implementations have no significant impact on thequality of the classification output FlexICA performs slightly better for the unbal-anced SVM and one-against-all-SVM FastICA gives better results for MclustDA.For all other classifiers the performances are equal

5 Conclusion

The combination of PCA and ICA makes it possible to detect both artifacts and thelesion in hyper-spectral skin cancer data The algorithm projects the correspond-ing features on different independent components; dropping the independent com-ponents that correspond to the artifacts and applying a 2-stage K-Means clusteringleads to a reliable segmentation of the images It is interesting to note that for the

mole images in our study there is always one single independent component that

carries the information about the whole lesion This suggests very simple tation in the case where the skin is healthy: keep the single independent component

segmen-that contains the desired information and perform the K-Means steps For melanoma

Trang 11

images the spectral information about the lesion is contained in at least two pendent components, leading to reliable separation of the melanoma kernel from themole kernel.

inde-Unbalanced SVM and MclustDA yield equally good classification results, however,because of its computational performance MclustDA is the best classifier for the skincancer data in terms of overall accuracy

The presented segmentation and classification approach does not use any spatial formation In future research Markov random fields and contextual classifiers could

in-be used to take into account the spatial context

In a possible application, where the physician is assisted by system which pre-screenspatients, we have to take care of high sensitivity which is typically accompanied with

a loss in specificity Preliminary experiments showed that a sensitivity of 95% is sible at the cost of 20% false-positives

pos-References

ABE, S (2005): Support Vector Machines for Pattern Classification Springer, London.

CHOI, S., CICHOCKI, A and AMARI, S (2000): Flexible Independent Component Analysis

Journal of VLSI Signal Processing, 26(1/2), 25-38.

FRALEY, C and RAFTERY, A (2002): Model-Based Clustering, Discriminant Analysis, and

Density Estimation Journal of the American Statistical Association, 97, 611–631 HASTIE, T., TIBSHIRANI, R and FRIEDMAN, J (2001): The Elements of Statistical Learn-

ing Springer, New York.

HYVÄRINEN, A., KARHUNEN, J and OJA, E (2001): Independent Component Analysis.

Wiley, New York

SHAH, C., ARORA, M and VARSHNEY, P (2004): Unsupervised classification of

hyper-spectral data: an ICA mixture model based approach International Journal of Remote

Sensing, 25, 481–487.

VEROPOULOS, K., CAMPBELL, C and CRISTIANI, N (1999): Controlling the Sensitivity

of Support Vector Machines Proceedings of the Sixteenth International Joint Conference

on Artificatial Intelligence, Workshop ML3, 55–60.

Trang 12

Michaela DenkEC3 – E-Commerce Competence Center,

Donau-City-Str 1, 1220 Vienna, Austria

michaela.denk@ec3.at

Abstract Entity identification deals with matching records from different datasets or within

one dataset that represent the same real-world entity when unique identifiers are not available.Enabling data integration at record level as well as the detection of duplicates, entity identifi-cation plays a major role in data preprocessing, especially concerning data quality This paperpresents a framework for statistical entity identification in particular focusing on probabilisticrecord linkage and string matching and its implementation inR According to the stages ofthe entity identification process, the framework is structured into seven core components: datapreparation, candidate selection, comparison, scoring, classification, decision, and evaluation.Samples of real-world CRM datasets serve as illustrative examples

1 Introduction

Ensuring data quality is a crucial challenge in statistical data management aiming atimproved usability and reliability of the data Entity identification deals with match-ing records from different datasets or within a single dataset that represent the samereal-world entity and, thus, enables data integration at record level as well as thedetection of duplicates Both can be regarded as a means of improving data qual-ity, the former by completing datasets through adding supplementary variables, re-placing missing or invalid values, and appending records for additional real-worldentities, the latter by resolving data inconsistencies Unless sophisticated methodsare applied, data integration is also a potential source of ‘dirty’ data: duplicate orincomplete records might be introduced Besides its contribution to data quality, en-tity identification is regarded as a means of increasing the efficiency of the usage

of available data as well This is of particular interest in official statistics, wherethe reduction of the responder burden is a prevailing issue In general, applicationsnecessitating statistical entity identification (SEI) are found in diverse fields such asdata mining, customer relationship management (CRM), bioinformatics, criminal in-vestigations, and official statistics Various frameworks for entity identification havebeen proposed (see for example Denk (2006) or Neiling (2004) for an overview),most of them concentrating on particular stages of the process, such the author’s

Định dạng
Số trang	25
Dung lượng	660,03 KB