Factor, and Cluster Analyses, and Application in Social Area Analysis This chapter discusses three important multivariate statistical analysis methods:principal components analysis PCA,
Trang 1Factor, and Cluster Analyses, and Application
in Social Area Analysis
This chapter discusses three important multivariate statistical analysis methods:principal components analysis (PCA), factor analysis (FA), and cluster analysis(CA) PCA and FA are often used together for data reduction by structuring manyvariables into a limited number of components (factors) The techniques are partic-ularly useful for eliminating variable collinearity and uncovering latent variables.Applications of the methods are widely seen in socioeconomic studies (also see casestudy 8 in Section 8.4) While the PCA and FA group variables, the CA classifiesmany observations into categories according to similarity among their attributes
In other words, given a dataset as a table, the PCA and FA reduce the number ofcolumns and the CA reduces the number of rows
Social area analysis is used to illustrate the techniques, as it employs all threemethods The interpretation of social area analysis results also leads to a review andcomparison of three classic models on urban structure, namely, the concentric zonemodel, the sector model, and the multinuclei model The analysis demonstrates howanalytical statistical methods synthesize descriptive models into one framework.Beijing, the capital city of China, on the verge of forming its social areas afterdecades under a socialist regime, is chosen as the study area for a case study Usage
of GIS in this case study is limited to mapping for spatial patterns
Section 7.1 discusses principal components and factor analysis Section 7.2explains cluster analysis Section 7.3 reviews social area analysis A case study onthe social space in Beijing is presented in Section 7.4 to provide a new perspective
to the fast-changing urban structure in China The chapter is concluded with adiscussion and brief summary in Section 7.5
7.1 PRINCIPAL COMPONENTS AND FACTOR ANALYSIS
Principal components and factor analysis are often used together for data reduction.Benefits of this approach include uncovering latent variables for easy interpretationand removing multicollinearity for subsequent regression analysis In many socio-economic applications, variables extracted from census data are often correlatedwith each other, and thus contain duplicated information to some extent Principalcomponents and factor analysis use fewer factors to represent the original variables,and thus simplify the structure for analysis Resulting component or factor scores2795_C007.fm Page 127 Friday, February 3, 2006 12:14 PM
Trang 2128 Quantitative Methods and Applications in GIS
are uncorrelated to each other (if not rotated or orthogonally rotated), and thus can
be used as explanatory variables in regression analysis
Despite the commonalities, principal components and factor analysis are “bothconceptually and mathematically very different” (Bailey and Gatrell, 1995, p 225).Principal components analysis uses the same number of variables (components) tosimply transform the original data, and thus is a mathematical transformation (strictlyspeaking, not a statistical operation) Factor analysis uses fewer variables (factors)
to capture most of the variation among the original variables (with error terms), andthus is a statistical analysis process Principal components attempts to explain thevariance of observed variables, whereas factor analysis intends to explain theirintercorrelations (Hamilton, 1992, p 252) In many applications (as in ours), thetwo methods are used together In SAS, principal components analysis is offered as
an option under the procedure for factor analysis
7.1.1 P RINCIPAL C OMPONENTS F ACTOR M ODEL
In formula, principal components analysis (PCA) transforms original data on K
observed variables Z k to data on K principal components F k that are independentfrom (uncorrelated with) each other:
In a true factor analysis (FA), the residual (error) term, denoted as u k to guish it from v k in a PCFA, is unique to each variable Z k:
distin-The u k are termed unique factors (in contrast to common factors F j) In the PCFA,the residual v k is a linear combination of the discarded components (F J+1, …, F K) andthus cannot be uncorrelated like the u k in a true FA (Hamilton, 1992, p 252)
Z k =l F k1 1+l F k2 2+ + l F kj j+ + l F kK K
Z k =l F k1 1+l F k2 2+ + l F kJ J +v k
Z k =l F k1 1+l F k2 2+ + l F kJ J +u k
Trang 3Principal Components, Factor, and Cluster Analyses, and Application 129
7.1.2 F ACTOR L OADINGS , F ACTOR S CORES , AND E IGENVALUES
For convenience, the original data of observed variables Z k are first standardized 1
prior to the PCA and FA analysis, and the initial values for components (factors)
are also standardized When both Z k and F j are standardized, the l kj in Equations 7.1
and 7.2 are standardized coefficients in the regression of variables Z k on components
(factors) F j, also termed factor loadings For example, l k1 is the loading of variables
Z k on standardized component F1 Factor loading reflects the strength of relations
between variables and components
Conversely, the components F j can be reexpressed as a linear combination of
the original variables Z k:
(7.4)
Estimates of these components (factors) are termed factor scores Estimates of
a kjare factor score coefficients, i.e., coefficients in the regression of factors
on variables
The components F j are constructed to be uncorrelated with each other and are
ordered such that the first component F1 has the largest sample variance (λ1), F2 the
second largest, and so on The variances λj corresponding to various components
are termed eigenvalues, and λ1 > λ2> …
Since standardized variables have variances of 1, the total variance of all
variables also equals the number of variables, such as
λ1 + λ2 + … + λK = K (7.5)Therefore, the proportion of total variance explained by the jth component is λj/K
Eigenvalues provide a basis for judging which components (factors) are
impor-tant and which are not, and thus deciding how many components to retain One may
also follow a rule of thumb that only eigenvalues greater than 1 are important (Griffith
and Amrhein, 1997, p 169) Since the variance of each standardized variable is 1,
a component with λ < 1 accounts for less than an original variable’s variation, and
thus does not serve the purpose of data reduction
The eigenvalue-1 rule is arbitrary A scree graph plots eigenvalues against
component (factor) number and provides a more useful guidance (Hamilton, 1992,
p 258) For example, Figure 7.1 shows the scree graph of eigenvalues in a case of
14 components (using the result from case study 7 in Section 7.4) The graph levels
off after component 4, indicating that components 5 to 14 account for relatively
little additional variance Therefore, four components may be retained as principal
components
Outputs from statistical analysis software such as SAS include important
infor-mation, such as factor loadings, eigenvalues, and proportions (of total variance)
Factor scores can be saved in a predefined external file The factor analysis procedure
in SAS also outputs a correlation matrix between the observed variables for analysts
to examine their relations
F j=a Z1j 1+a Z2j 2+ + a Z Kj K
2795_C007.fm Page 129 Friday, February 3, 2006 12:14 PM
Trang 4130 Quantitative Methods and Applications in GIS
7.1.3 R OTATION
Initial results from PCFA are often hard to interpret as variables load across factors
While fitting the data equally well, rotation generates simpler structure and more
interpretable factors by maximizing the loading (positive or negative) of each
vari-able on one factor and minimizing the loadings on the others As a result, we can
detect which factor (latent variable) captures the information contained in what
variables (observed), and subsequently label the factors adequately
Orthogonal rotation generates independent (uncorrelated) factors, an important
property for many applications A widely used orthogonal rotation method is Varimax
rotation, which maximizes the variance of the squared loadings for each factor, and
thus polarizes loadings (either high or low on factors) Varimax rotation is often the
rotation technique used in social area analysis Oblique rotation (e.g., promax rotation)
generates even greater polarization, but allows correlation between factors In SAS,
an option is provided to specify which rotation to use
As a summary, Figure 7.2 illustrates the process of PCFA:
1 The original dataset of K observed variables with n records is first
standardized to a dataset of Z scores with the same number of variables
4 A rotation method is used to load each variable strongly on one factor
(and near zero on the others) for easier interpretation
The SAS procedure for factor analysis (FA) is FACTOR, which also reports
the principal components analysis (PCA) results preceding those of FA The
following sample SAS statements implement the factor analysis that uses four
factors to capture the structure of 14 variables, x1 through x14, and adopts the
Varimax rotation technique:
FIGURE 7.1 Scree graph for principal components analysis.
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
Component
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Trang 5Principal Components, Factor, and Cluster Analyses, and Application 131
proc factor out=FACTSCORE (replace=yes)
nfact=4 rotate=varimax;
var x1-x14;
The SAS data set FACTSCORE has the factor scores, which can be saved to an
external file Note that a SAS program is not case sensitive
7.2 CLUSTER ANALYSIS
Cluster analysis (CA) groups observations according to similarity among their
attributes As a result, the observations within a cluster are more similar than
observations between clusters, as measured by the clustering criterion Note the
difference between CA and another similar multivariate analysis technique —
discriminant function analysis (DFA) Both group observations into categories based
on the characteristic variables Categories are unknown in CA but known in DFA
See Appendix 7A for further discussion on DFA
Geographers have a long-standing interest in cluster analysis (CA) that has been
developed in applications such as regionalization and city classification In the case
of social area analysis, cluster analysis is used to further analyze the results from
factor analysis (i.e., factor scores of various components across space) and group
areas into different types of social areas
A key element in deciding assignment of observations to clusters is distance,
measured in various ways The most commonly used distance measure is
K
Component loadings
1 2 3 K
K components
PCFA (3)
1 2 3 J (J < K)
J factors
(4) Rotation
K
K
2795_C007.fm Page 131 Friday, February 3, 2006 12:14 PM
Trang 6where x ik and x jk are the kth variable value of the K-dimensional observations for individuals i and j When K = 2, Euclidean distance is simply the straight-line distance between observations i and j in a two-dimensional space Like the various
distance measures discussed in Chapter 2, distance measures here also includeManhattan (or city block) distance and others (e.g., Minkowski distance, Canberradistance) (Everitt et al., 2001, p 40)
The most widely used clustering method is the agglomerative hierarchical methods (AHMs) The methods produce a series of groupings: the first consists of
single-member clusters, and the last consists of a single cluster of all members The
results of these algorithms can be summarized with a dendrogram, a tree diagram
showing the history of sequential grouping process See Figure 7.3 for the exampleillustrated below In the diagram, the clusters are nested and each cluster is a member
of a larger, higher-level cluster
For illustration, an example is used to explain a simple AHM, the single-linkage method or the nearest-neighbor method Consider a dataset of four observations with
the following distance matrix:
FIGURE 7.3 Dendrogram for the clustering analysis example.
Data points
Distance
1.0 2.0 3.0 4.0 5.0
C1
C2 C3
d ij x ik x jk k
K
=
∑( ( ) )2 / 1
1 2
D1
1234
Trang 7Principal Components, Factor, and Cluster Analyses, and Application 133
The smallest no-zero entry in the above matrix D1 is (2 → 1) = 3, and thereforeindividuals 1 and 2 are grouped together to form the first cluster C1 Distancesbetween this cluster and the other two individuals are defined according to thenearest-neighbor criterion:
A new matrix is now obtained with cells representing distances between clusterC1 and individuals 3 and 4, or between individuals 3 and 4:
The smallest no-zero entry in D2 is (4 → 3) = 4, and thus individuals 3 and 4are grouped to form a cluster C2 Finally, clusters C1 and C2 are grouped together,with distance equal to 5, to form one cluster C3 containing all four members Theprocess is summarized in a dendrogram in Figure 7.3, where the height representsthe distance at which each fusion is made
Similarly, the complete linkage (farthest-neighbor) method uses the maximum
distance between pair of objects (one in one cluster and one in the other); the
average linkage method uses the average distance between pair of objects; and the centroid method uses squared Euclidean distance between individuals and cluster
means (centroids)
Another commonly used AHM is Ward’s method The objective at each stage is
to minimize the increase in the total within-cluster error sum of squares given by
where
in which x ck,i is the value for the kth variable for the ith observation in the cth cluster,
and is the mean of the kth variable in the cth cluster.
Each clustering method has its advantages and disadvantages A desirable tering should produce clusters of similar size, densely located, compact in shape,
E c x ck j x ck
k K
Trang 8and internally homogeneous (Griffith and Amrhein, 1997, p 217) The single-linkagemethod tends to produce unbalanced and straggly clusters and should be avoided inmost cases If outlier is a major concern, the centroid method should be used Ifcompactness of clusters is a primary objective, the complete linkage method should
be used Ward’s method tends to find same size and spherical clusters and is ommended if no single overriding property is desired (Griffith and Amrhein, 1997,
rec-p 220) The case study in this chapter also uses Ward’s method
The choice for the number of clusters depends on objectives of specific cations Similar to the selection of factors based on the eigenvalues in factor analysis,one may also use a scree plot to assist in the decision In the case of Ward’s method,
appli-a grappli-aph of R2 vs the number of clusters helps choose the number, beyond whichlittle more homogeneity is attained by further mergers
In SAS, the procedure CLUSTER implements the cluster analysis and theprocedure TREE generates the dendrogram The following sample SAS statementsuse Ward’s method for clustering and cut off the dendrogram at nine clusters:
proc cluster method=ward outtree=tree;
id subdist_id; /* variable for labeling ids */var factor1-factor4; /* variables used */
proc tree out=bjcluster ncl=9;
id subdist_id;
7.3 SOCIAL AREA ANALYSIS
The social area analysis was developed by Shevky and Williams (1949) in a study
of Los Angeles and was later elaborated on by Shevky and Bell (1955) in a study
of San Francisco The basic thesis is that the changing social differentiation of societyleads to residential differentiation within cities The studies classified census tractsinto types of social areas based on three basic constructs: economic status (socialrank), family status (urbanization), and segregation (ethnic status) Originally thethree constructs were measured by six variables: economic status was captured byoccupation and education; family status by fertility, women labor participation, andsingle-family houses; and ethnic status by percentage of minorities (Cadwallader,
1996, p 135) In a factor analysis, an idealized factor loadings matrix probably lookslike Table 7.1 Subsequent studies using a large number and variety of measuresgenerally confirmed the validity of the three constructs (Berry, 1972, p 285;Hartshorn, 1992, p 235)
Geographers made an important advancement in social area analysis by lyzing the spatial patterns associated with these dimensions (e.g., Rees, 1970; Knox,1987) The socioeconomic status factor tends to exhibit a sector pattern: tracts withhigh values for variables, such as income and education, form one or more sectors,and low-status tracts form other sectors The family status factor tends to formconcentric zones: inner zones are dominated by tracts with small families witheither very young or very old household heads, and tracts in outer zones are mostly
Trang 9ana-Principal Components, Factor, and Cluster Analyses, and Application 135
occupied by large families with middle-age household heads The ethnic statusfactor tends to form clusters, each of which is dominated by a particular ethnicgroup Superimposing the three constructs generates a complex urban mosaic,which can be grouped into various social areas by cluster analysis See Figure 7.4
By studying the spatial patterns from social area analysis, three classic models forurban structure — Burgess’s (1925) concentric zone model, Hoyt’s (1939) sectormodel, and the Ullman–Harris (Harris and Ullman, 1945) multinuclei model —are synthesized into one framework In other words, each of the three models reflectsone specific dimension of urban structure and is complementary to the others.There are at least three criticisms of the factorial ecological approach to under-standing residential differentiation in cities (Cadwallader, 1996, p 151) First, theanalysis results are sensitive to research design, such as variables selected andmeasured, analysis units, and factor analysis methods Second, it is still a descriptiveform of analysis and fails to explain the underlying process that causes the patterns.Third, the social areas identified by the studies are merely homogeneous, but notnecessarily functional regions or cohesive communities Despite the criticisms, socialarea analysis helps us understand residential differentiation within cities, and serves
as an important instrument for studying intraurban social spatial structure Applications
of social area analysis can be seen on cities in developed countries, particularly rich
on cities in North America (see a review by Davies and Herbert, 1993), and also onsome cities in developing countries (e.g., Berry and Rees, 1969; Abu-Lughod, 1969)
7.4 CASE STUDY 7: SOCIAL AREA ANALYSIS IN BEIJING
This case study is developed on the basis of a research project reported in Gu et al.(2005) Detailed research design and interpretation of the results can be found inthe original paper This section shows the procedures to implement the study, withemphasis on illustrating the three statistical methods In addition, the study illustrateshow to test the spatial structure of factors by regression models with dummy vari-ables Since the 1978 economic reforms in China, and particularly the 1984 urbanreforms, including the urban land use reform and the housing reform, urbanlandscape in China has changed significantly Many large cities have been on the
TABLE 7.1
Idealized Factor Loadings in Social Area Analysis
Economic Status Family Status Ethnic Status
Trang 10transition from self-contained work unit neighborhood systems to more differentiatedurban space As the capital city of China, Beijing offers an interesting case to lookinto this important change in urban structure in China.
The study area is the contiguous urbanized area of Beijing, with 107 subdistricts
(jiedao), excluding the 2 remote suburban districts (Mentougou and Fangshan) and
23 subdistricts on the periphery of inner suburbs (also rural and lack of completedata) See Figure 7.5 The study area had a total population of 5.9 million, and thesubdistricts had an average population of 55,200 in 1998 Subdistrict has been thebasic administrative unit in Beijing for decades, and also the lowest geographic levelreported in government statistical reports accessible by the public Therefore, it wasthe analysis unit used in this research Because of the lack of socioeconomic data
in the national census of population, most of the data used in this research wereextracted from the 1998 statistical yearbooks of individual districts in Beijing Somedata, such as personal income and individual living space, were obtained through asurvey of households conducted in 1998
FIGURE 7.4 Conceptual model for urban mosaic.
Large families
Ethnic enclaves
(c) Ethnic status