Quantitative Methods and Applications in GIS - Chapter 7 ppt

Factor, and Cluster Analyses, and Application in Social Area Analysis This chapter discusses three important multivariate statistical analysis methods:principal components analysis PCA,

Trang 1

Factor, and Cluster Analyses, and Application

in Social Area Analysis

This chapter discusses three important multivariate statistical analysis methods:principal components analysis (PCA), factor analysis (FA), and cluster analysis(CA) PCA and FA are often used together for data reduction by structuring manyvariables into a limited number of components (factors) The techniques are partic-ularly useful for eliminating variable collinearity and uncovering latent variables.Applications of the methods are widely seen in socioeconomic studies (also see casestudy 8 in Section 8.4) While the PCA and FA group variables, the CA classiﬁesmany observations into categories according to similarity among their attributes

In other words, given a dataset as a table, the PCA and FA reduce the number ofcolumns and the CA reduces the number of rows

Social area analysis is used to illustrate the techniques, as it employs all threemethods The interpretation of social area analysis results also leads to a review andcomparison of three classic models on urban structure, namely, the concentric zonemodel, the sector model, and the multinuclei model The analysis demonstrates howanalytical statistical methods synthesize descriptive models into one framework.Beijing, the capital city of China, on the verge of forming its social areas afterdecades under a socialist regime, is chosen as the study area for a case study Usage

of GIS in this case study is limited to mapping for spatial patterns

Section 7.1 discusses principal components and factor analysis Section 7.2explains cluster analysis Section 7.3 reviews social area analysis A case study onthe social space in Beijing is presented in Section 7.4 to provide a new perspective

to the fast-changing urban structure in China The chapter is concluded with adiscussion and brief summary in Section 7.5

7.1 PRINCIPAL COMPONENTS AND FACTOR ANALYSIS

Principal components and factor analysis are often used together for data reduction.Beneﬁts of this approach include uncovering latent variables for easy interpretationand removing multicollinearity for subsequent regression analysis In many socio-economic applications, variables extracted from census data are often correlatedwith each other, and thus contain duplicated information to some extent Principalcomponents and factor analysis use fewer factors to represent the original variables,and thus simplify the structure for analysis Resulting component or factor scores2795_C007.fm Page 127 Friday, February 3, 2006 12:14 PM

Trang 2

128 Quantitative Methods and Applications in GIS

are uncorrelated to each other (if not rotated or orthogonally rotated), and thus can

be used as explanatory variables in regression analysis

Despite the commonalities, principal components and factor analysis are “bothconceptually and mathematically very different” (Bailey and Gatrell, 1995, p 225).Principal components analysis uses the same number of variables (components) tosimply transform the original data, and thus is a mathematical transformation (strictlyspeaking, not a statistical operation) Factor analysis uses fewer variables (factors)

to capture most of the variation among the original variables (with error terms), andthus is a statistical analysis process Principal components attempts to explain thevariance of observed variables, whereas factor analysis intends to explain theirintercorrelations (Hamilton, 1992, p 252) In many applications (as in ours), thetwo methods are used together In SAS, principal components analysis is offered as

an option under the procedure for factor analysis

7.1.1 P RINCIPAL C OMPONENTS F ACTOR M ODEL

In formula, principal components analysis (PCA) transforms original data on K

observed variables Z k to data on K principal components F k that are independentfrom (uncorrelated with) each other:

In a true factor analysis (FA), the residual (error) term, denoted as u k to guish it from v k in a PCFA, is unique to each variable Z k:

distin-The u k are termed unique factors (in contrast to common factors F j) In the PCFA,the residual v k is a linear combination of the discarded components (F J+1, …, F K) andthus cannot be uncorrelated like the u k in a true FA (Hamilton, 1992, p 252)

Z k =l F k1 1+l F k2 2+ + l F kj j+ + l F kK K

Z k =l F k1 1+l F k2 2+ + l F kJ J +v k

Z k =l F k1 1+l F k2 2+ + l F kJ J +u k

Trang 3

Principal Components, Factor, and Cluster Analyses, and Application 129

7.1.2 F ACTOR L OADINGS , F ACTOR S CORES , AND E IGENVALUES

For convenience, the original data of observed variables Z k are ﬁrst standardized 1

prior to the PCA and FA analysis, and the initial values for components (factors)

are also standardized When both Z k and F j are standardized, the l kj in Equations 7.1

and 7.2 are standardized coefﬁcients in the regression of variables Z k on components

(factors) F j, also termed factor loadings For example, l k1 is the loading of variables

Z k on standardized component F1 Factor loading reﬂects the strength of relations

between variables and components

Conversely, the components F j can be reexpressed as a linear combination of

the original variables Z k:

(7.4)

Estimates of these components (factors) are termed factor scores Estimates of

a kjare factor score coefﬁcients, i.e., coefﬁcients in the regression of factors

on variables

The components F j are constructed to be uncorrelated with each other and are

ordered such that the ﬁrst component F1 has the largest sample variance (λ1), F2 the

second largest, and so on The variances λj corresponding to various components

are termed eigenvalues, and λ1 > λ2> …

Since standardized variables have variances of 1, the total variance of all

variables also equals the number of variables, such as

λ1 + λ2 + … + λK = K (7.5)Therefore, the proportion of total variance explained by the jth component is λj/K

Eigenvalues provide a basis for judging which components (factors) are

impor-tant and which are not, and thus deciding how many components to retain One may

also follow a rule of thumb that only eigenvalues greater than 1 are important (Grifﬁth

and Amrhein, 1997, p 169) Since the variance of each standardized variable is 1,

a component with λ < 1 accounts for less than an original variable’s variation, and

thus does not serve the purpose of data reduction

The eigenvalue-1 rule is arbitrary A scree graph plots eigenvalues against

component (factor) number and provides a more useful guidance (Hamilton, 1992,

p 258) For example, Figure 7.1 shows the scree graph of eigenvalues in a case of

14 components (using the result from case study 7 in Section 7.4) The graph levels

off after component 4, indicating that components 5 to 14 account for relatively

little additional variance Therefore, four components may be retained as principal

components

Outputs from statistical analysis software such as SAS include important

infor-mation, such as factor loadings, eigenvalues, and proportions (of total variance)

Factor scores can be saved in a predeﬁned external ﬁle The factor analysis procedure

in SAS also outputs a correlation matrix between the observed variables for analysts

to examine their relations

F j=a Z1j 1+a Z2j 2+ + a Z Kj K

2795_C007.fm Page 129 Friday, February 3, 2006 12:14 PM

Trang 4

130 Quantitative Methods and Applications in GIS

7.1.3 R OTATION

Initial results from PCFA are often hard to interpret as variables load across factors

While ﬁtting the data equally well, rotation generates simpler structure and more

interpretable factors by maximizing the loading (positive or negative) of each

vari-able on one factor and minimizing the loadings on the others As a result, we can

detect which factor (latent variable) captures the information contained in what

variables (observed), and subsequently label the factors adequately

Orthogonal rotation generates independent (uncorrelated) factors, an important

property for many applications A widely used orthogonal rotation method is Varimax

rotation, which maximizes the variance of the squared loadings for each factor, and

thus polarizes loadings (either high or low on factors) Varimax rotation is often the

rotation technique used in social area analysis Oblique rotation (e.g., promax rotation)

generates even greater polarization, but allows correlation between factors In SAS,

an option is provided to specify which rotation to use

As a summary, Figure 7.2 illustrates the process of PCFA:

1 The original dataset of K observed variables with n records is ﬁrst

standardized to a dataset of Z scores with the same number of variables

4 A rotation method is used to load each variable strongly on one factor

(and near zero on the others) for easier interpretation

The SAS procedure for factor analysis (FA) is FACTOR, which also reports

the principal components analysis (PCA) results preceding those of FA The

following sample SAS statements implement the factor analysis that uses four

factors to capture the structure of 14 variables, x1 through x14, and adopts the

Varimax rotation technique:

FIGURE 7.1 Scree graph for principal components analysis.

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

Component

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Trang 5

proc factor out=FACTSCORE (replace=yes)

nfact=4 rotate=varimax;

var x1-x14;

The SAS data set FACTSCORE has the factor scores, which can be saved to an

external ﬁle Note that a SAS program is not case sensitive

7.2 CLUSTER ANALYSIS

Cluster analysis (CA) groups observations according to similarity among their

attributes As a result, the observations within a cluster are more similar than

observations between clusters, as measured by the clustering criterion Note the

difference between CA and another similar multivariate analysis technique —

discriminant function analysis (DFA) Both group observations into categories based

on the characteristic variables Categories are unknown in CA but known in DFA

See Appendix 7A for further discussion on DFA

Geographers have a long-standing interest in cluster analysis (CA) that has been

developed in applications such as regionalization and city classiﬁcation In the case

of social area analysis, cluster analysis is used to further analyze the results from

factor analysis (i.e., factor scores of various components across space) and group

areas into different types of social areas

A key element in deciding assignment of observations to clusters is distance,

measured in various ways The most commonly used distance measure is

K

Component loadings

1 2 3 K

K components

PCFA (3)

1 2 3 J (J < K)

J factors

(4) Rotation

K

2795_C007.fm Page 131 Friday, February 3, 2006 12:14 PM

Trang 6

where x ik and x jk are the kth variable value of the K-dimensional observations for individuals i and j When K = 2, Euclidean distance is simply the straight-line distance between observations i and j in a two-dimensional space Like the various

distance measures discussed in Chapter 2, distance measures here also includeManhattan (or city block) distance and others (e.g., Minkowski distance, Canberradistance) (Everitt et al., 2001, p 40)

The most widely used clustering method is the agglomerative hierarchical methods (AHMs) The methods produce a series of groupings: the ﬁrst consists of

single-member clusters, and the last consists of a single cluster of all members The

results of these algorithms can be summarized with a dendrogram, a tree diagram

showing the history of sequential grouping process See Figure 7.3 for the exampleillustrated below In the diagram, the clusters are nested and each cluster is a member

of a larger, higher-level cluster

For illustration, an example is used to explain a simple AHM, the single-linkage method or the nearest-neighbor method Consider a dataset of four observations with

the following distance matrix:

FIGURE 7.3 Dendrogram for the clustering analysis example.

Data points

Distance

1.0 2.0 3.0 4.0 5.0

C1

C2 C3

d ij x ik x jk k

K

=

∑( ( ) )2 / 1

1 2

D1

1234

Trang 7

The smallest no-zero entry in the above matrix D1 is (2 → 1) = 3, and thereforeindividuals 1 and 2 are grouped together to form the ﬁrst cluster C1 Distancesbetween this cluster and the other two individuals are deﬁned according to thenearest-neighbor criterion:

A new matrix is now obtained with cells representing distances between clusterC1 and individuals 3 and 4, or between individuals 3 and 4:

The smallest no-zero entry in D2 is (4 → 3) = 4, and thus individuals 3 and 4are grouped to form a cluster C2 Finally, clusters C1 and C2 are grouped together,with distance equal to 5, to form one cluster C3 containing all four members Theprocess is summarized in a dendrogram in Figure 7.3, where the height representsthe distance at which each fusion is made

Similarly, the complete linkage (farthest-neighbor) method uses the maximum

distance between pair of objects (one in one cluster and one in the other); the

average linkage method uses the average distance between pair of objects; and the centroid method uses squared Euclidean distance between individuals and cluster

means (centroids)

Another commonly used AHM is Ward’s method The objective at each stage is

to minimize the increase in the total within-cluster error sum of squares given by

where

in which x ck,i is the value for the kth variable for the ith observation in the cth cluster,

and is the mean of the kth variable in the cth cluster.

Each clustering method has its advantages and disadvantages A desirable tering should produce clusters of similar size, densely located, compact in shape,

E c x ck j x ck

k K

Trang 8

and internally homogeneous (Grifﬁth and Amrhein, 1997, p 217) The single-linkagemethod tends to produce unbalanced and straggly clusters and should be avoided inmost cases If outlier is a major concern, the centroid method should be used Ifcompactness of clusters is a primary objective, the complete linkage method should

be used Ward’s method tends to ﬁnd same size and spherical clusters and is ommended if no single overriding property is desired (Grifﬁth and Amrhein, 1997,

rec-p 220) The case study in this chapter also uses Ward’s method

The choice for the number of clusters depends on objectives of speciﬁc cations Similar to the selection of factors based on the eigenvalues in factor analysis,one may also use a scree plot to assist in the decision In the case of Ward’s method,

appli-a grappli-aph of R2 vs the number of clusters helps choose the number, beyond whichlittle more homogeneity is attained by further mergers

In SAS, the procedure CLUSTER implements the cluster analysis and theprocedure TREE generates the dendrogram The following sample SAS statementsuse Ward’s method for clustering and cut off the dendrogram at nine clusters:

proc cluster method=ward outtree=tree;

id subdist_id; /* variable for labeling ids */var factor1-factor4; /* variables used */

proc tree out=bjcluster ncl=9;

id subdist_id;

7.3 SOCIAL AREA ANALYSIS

The social area analysis was developed by Shevky and Williams (1949) in a study

of Los Angeles and was later elaborated on by Shevky and Bell (1955) in a study

of San Francisco The basic thesis is that the changing social differentiation of societyleads to residential differentiation within cities The studies classiﬁed census tractsinto types of social areas based on three basic constructs: economic status (socialrank), family status (urbanization), and segregation (ethnic status) Originally thethree constructs were measured by six variables: economic status was captured byoccupation and education; family status by fertility, women labor participation, andsingle-family houses; and ethnic status by percentage of minorities (Cadwallader,

1996, p 135) In a factor analysis, an idealized factor loadings matrix probably lookslike Table 7.1 Subsequent studies using a large number and variety of measuresgenerally conﬁrmed the validity of the three constructs (Berry, 1972, p 285;Hartshorn, 1992, p 235)

Geographers made an important advancement in social area analysis by lyzing the spatial patterns associated with these dimensions (e.g., Rees, 1970; Knox,1987) The socioeconomic status factor tends to exhibit a sector pattern: tracts withhigh values for variables, such as income and education, form one or more sectors,and low-status tracts form other sectors The family status factor tends to formconcentric zones: inner zones are dominated by tracts with small families witheither very young or very old household heads, and tracts in outer zones are mostly

Trang 9

ana-Principal Components, Factor, and Cluster Analyses, and Application 135

occupied by large families with middle-age household heads The ethnic statusfactor tends to form clusters, each of which is dominated by a particular ethnicgroup Superimposing the three constructs generates a complex urban mosaic,which can be grouped into various social areas by cluster analysis See Figure 7.4

By studying the spatial patterns from social area analysis, three classic models forurban structure — Burgess’s (1925) concentric zone model, Hoyt’s (1939) sectormodel, and the Ullman–Harris (Harris and Ullman, 1945) multinuclei model —are synthesized into one framework In other words, each of the three models reflectsone specific dimension of urban structure and is complementary to the others.There are at least three criticisms of the factorial ecological approach to under-standing residential differentiation in cities (Cadwallader, 1996, p 151) First, theanalysis results are sensitive to research design, such as variables selected andmeasured, analysis units, and factor analysis methods Second, it is still a descriptiveform of analysis and fails to explain the underlying process that causes the patterns.Third, the social areas identified by the studies are merely homogeneous, but notnecessarily functional regions or cohesive communities Despite the criticisms, socialarea analysis helps us understand residential differentiation within cities, and serves

as an important instrument for studying intraurban social spatial structure Applications

of social area analysis can be seen on cities in developed countries, particularly rich

on cities in North America (see a review by Davies and Herbert, 1993), and also onsome cities in developing countries (e.g., Berry and Rees, 1969; Abu-Lughod, 1969)

7.4 CASE STUDY 7: SOCIAL AREA ANALYSIS IN BEIJING

This case study is developed on the basis of a research project reported in Gu et al.(2005) Detailed research design and interpretation of the results can be found inthe original paper This section shows the procedures to implement the study, withemphasis on illustrating the three statistical methods In addition, the study illustrateshow to test the spatial structure of factors by regression models with dummy vari-ables Since the 1978 economic reforms in China, and particularly the 1984 urbanreforms, including the urban land use reform and the housing reform, urbanlandscape in China has changed signiﬁcantly Many large cities have been on the

TABLE 7.1

Idealized Factor Loadings in Social Area Analysis

Economic Status Family Status Ethnic Status

Trang 10

transition from self-contained work unit neighborhood systems to more differentiatedurban space As the capital city of China, Beijing offers an interesting case to lookinto this important change in urban structure in China.

The study area is the contiguous urbanized area of Beijing, with 107 subdistricts

(jiedao), excluding the 2 remote suburban districts (Mentougou and Fangshan) and

23 subdistricts on the periphery of inner suburbs (also rural and lack of completedata) See Figure 7.5 The study area had a total population of 5.9 million, and thesubdistricts had an average population of 55,200 in 1998 Subdistrict has been thebasic administrative unit in Beijing for decades, and also the lowest geographic levelreported in government statistical reports accessible by the public Therefore, it wasthe analysis unit used in this research Because of the lack of socioeconomic data

in the national census of population, most of the data used in this research wereextracted from the 1998 statistical yearbooks of individual districts in Beijing Somedata, such as personal income and individual living space, were obtained through asurvey of households conducted in 1998

FIGURE 7.4 Conceptual model for urban mosaic.

Large families

Ethnic enclaves

(c) Ethnic status

Định dạng
Số trang	21
Dung lượng	1,2 MB