Scatterplot MatricesOne common graphical method to explore a dataset is to make a scatter plot ofeach data series against each other and arrange these plots in a matrix.. In the construc
Trang 1Problem384 Someone said on an email list about statistics: if you cannot see
an effect in the data, then there is no use trying to estimate it Right or wrong?Answer One argument one might give is the curse of dimensionality Also higher moments
of the distribution, kurtosis etc., cannot be seen very cleary with the plain eye
Trang 233.1 Scatterplot MatricesOne common graphical method to explore a dataset is to make a scatter plot of
each data series against each other and arrange these plots in a matrix In R, the
pairs function does this Scatterplot matrices should be produced in the preliminary
stages of the investigation, but the researcher should not think he or she is done after
having looked at the scatterplot matrices
In the construction of scatter plot matrices, it is good practice to change the
signs of some of the variables in order to make all correlations positive if this is
possible
[BT99, pp 17–20] gives a good example of what kinds of things can be seen from
looking at scatterplot matrices The data for this book are available athttp://biometrics.ag.uq.edu.au/software.htm
Problem 385 5 points Which inferences about the datasets can you draw from
looking at the scatterplot matrix in [BT99, Exhibit 3.2, p 14]?
Answer The discussion on [ BT99 , p 19?] distinguishes three categories First the univariate
phenomena:
• yield is more concentrated for local genotypes (•) than for imports (◦);
• the converse is true for protein % but not as pronounced;
• oil % and seed size are lower for local genotypes (•); regarding seed size, the heaviest • is ligher
than the lightest ◦;
• height and lodging are greater for local genotypes.
Trang 3Bivariate phenomena are either within-group or between-group phenomena or both.:
• negative relationship of protein % and oil % (both within • and ◦);
• positive relationship of oil % and seed size (both within • and ◦ and also between these groups);
• negative relationship, between groups, of seed size and height;
• positive relationship of height and lodging (within ◦ and between groups);
• negative relationship of oil % and lodging (between groups and possibly within •);
• negative relationship of seed size and lodging (between groups);
• positive relationship of height and lodging (between groups).
The between group pehnomena are, of course, not due to an interaction between the groups, but they are the consequence of univariate phenomena As a third category, the authors point out unusual individual points:
• 1 high ◦ for yield;
• 1 high • (still lower than all the ◦s) for seed size;
• 1 low ◦ for lodging;
• 1 low • for protein % and oil % in combination.
[Coo98, Figure 2.8 on p 29] shows a scatterplot matrix of the “horse mussel”data, originally from [Cam89] This graph is also available atwww.stat.umn.edu/RegGraph/graphics/Figure 2.8.gif Horse mussels, (Atrinia), were sampled fromthe Marlborough Sounds The five variables are L = Shell length in mm, W = Shellwidth in mm, H = Shell height in mm, S = Shell mass in g, and M = Muscle mass
in g M is the part of the mussel that is edible
Trang 4Problem 386 3 points In the mussel data set, M is the “response” (according
to [Coo98]) Is it justified to call this variable the “response” and the other variablesthe explanatory variables, and if so, how would you argue for it?
Answer This is one of the issues which is not sufficiently discussed in the literature It would
be justified if the dimensions and weight of the shell were exogenous to the weight of the edible part
of the mussel I.e., if the mussel first grows the shell, and then it fills this shell wish muscle, and the dimensions of the shell affect how big the muscle can grow, but the muscle itself does not have
an influence on the dimensions of the shell If this is the case, then it makes sense to look at the distribution of M conditionally on the other variables, i.e., ask the question: given certain weights and dimensions of the shell, what is the nature of the mechanism by which the muscle grows inside this shell But if muscle and shell grow together, both affected by the same variables (temperature, nutrition, daylight, etc.), then the conditional distribution is not informative In this case, the joint
In order to get this dataset into R, you simply say data(mussels), after havingsaid library(ecmet) Then you need the command pairs(mussels) to getthe scatterplot matrix Also interesting is pairs(log(mussels)), especially sincethe log transformation is appropriate if one explains volume and weight by length,height, and width
The scatter plot of M versus H shows a clear curvature; but one should not jump
to the conclusion that the regression is not linear Cook brings another example withconstructed data, in which the regression function is clearly linear, without error
Trang 5term, and in which nevertheless the scatter plot of the response versus one of thepredictors shows a similar curvature as in the mussel data.
Problem 387 Cook’s constructed dataset is available as dataset reggra29 inthe ecmet package Make a scatterplot matrix of the plot, then load it into XGobiand convince yourself thaty depends linearly on x1 and x2
Answer You need the commands data(reggra29) and then pairs(reggra29) to get the terplot matrix Before you can access xgobi from R, you must give the command library(xgobi) Then xgobi(reggra29) The dependency is y = 3 + x 1 + elemx 2 /2
scat-Problem 388 2 points Why can the scatter plot of the dependent variableagainst one of the independent variables be so misleading?
Answer Because the included independent variable becomes a proxy for the excluded able The effect of the excluded variable is mistaken to come from the included variable Now if the included and the excluded variable are independent of each other, then the omission of the excluded variable increases the noise, but does not have a systematic effect But if there is an empirical relationship between the included and the excluded variable, then this translates into a spurious relationship between included and dependent variables The mathematics of this is discussed in
Trang 6Problem 389 Would it be possible in the scatter plot in [Coo98, p 217] toreverse the signs of some of the variables in such a way that all correlations arepositive?
Answer Yes, you have to reverse the signs of 6Below and AFDC Here are the instructions how to do the scatter plots: in arc, go to the load menu (Ignore the close and the menu boxes, they don’t seem to work.) Then type the path into the long box, /usr/share/ecmet/xlispstat and press return This gives me only one option, Minneapolis-schools.lsp I have to press 3 times
on this until it jumps to the big box, then I can press enter on the big box to load the data This gives me a bigger menu Go to the MPLSchools menu, and to the add variable option You have
to type in 6BelNeg = (- 6Below), then enter, then AFDCNeg = (- AFDC), and finally BthPtsNeg = (- BthPts) Then go to the Graph&Fit menu, and select scatterplot matrix Then you have to be careful about the order: first select AFDCNeg, in the left box and double click so that it jumps over to the right box Then select HS, then BthPtsNeg, then 6BelNeg, then 6Above Now the scatterplot matrix will be oriented all in 1 direction
33.2 Conditional Plots
In order to account for the effect of excluded variables in a scatter plot, thefunction coplot makes scatter plots in which the excluded variable is conditionedupon The graphics demo has such a conditioning plot; here is the code (from thefile /usr/lib/R/demos/graphics/graphics.R):
data(quakes)
Trang 7coplot(long ~ lat | depth, data=quakes, pch=21)
33.3 Spinning
An obvious method to explore a more than two-dimensional structure graphically
is to look at plots of y against various linear combinations of x Many statisticalsoftware packages have the ability to do so, but one of the most powerful ones isXGobi Documentation about xgobi, which is more detailed than the help(xgobi)
in R/Splus can be obtained by typing man xgobi while in unix A nice brief mentation is [Rip96] The official manual is is [SCB91] and [BCS96]
docu-XGobi can be used as a stand-alone program or it can be invoked from inside R
or Splus In R, you must give the command library(xgobi) in order to make thefunction xgobi accessible
The search for “interesting” projections of the data into one-, two-, or 3-dimensionalspaces has been automated in projection pursuit regression programs The basic ref-erence is [FS81], but there is also the much older [FT74]
The most obvious graphical regression method consists in slicing or binning thedata, and taking the mean of the data in each bin But if you have too manyexplanatory variables, this local averaging becomes infeasible, because of the “curse
of dimensionality.” Consider a dataset with 1000 observations and 10 variables, allbetween 0 and 1 In order to see whether the data are uniformly distributed or
Trang 8whether they have some structure, you may consider splitting up the 10-dimensionalunit cube into smaller cubes and counting the number of datapoints in each of thesesubcubes The problem here is: if one makes those subcubes large enough that theycontain more than 0 or 1 observations, then their coordinate lengths are not muchsmaller than the unit hypercube itself Even with a side length of 1/2, which would
be the largest reasonable side length, one needs 1024 subcubes to fill the hypercube,therefore the average number of data points is a little less than 1 By projectinginstead of taking subspaces, projection pursuit regression does not have this problem
of data scarcity
Projection pursuit regression searches for an interesting and informative tion of the data by maximizing a criterion function A logical candidate would forinstance be the variance ratio as defined in (8.6.7), but there are many others
projec-About grand tours, projection pursuit guided tours, and manual tours see [CBCH97]and [CB97]
Problem 390 If you run XGobi from the menu in Debian GNU/Linux, it usesprim7, which is a 7-dimensional particle physics data set used as an example in[FT74]
The following is from the help page for this dataset: There are 500 observationstaken from a high energy particle physics scattering experiment which yields fourparticles The reaction can be described completely by 7 independent measurements
Trang 9The important features of the data are short-lived intermediate reaction stages whichappear as protuberant “arms” in the point cloud.
The projection pursuit guided tour is the tool to use to understand this data set.Using all 7 variables turn on projection pursuit and optimize with the Holes indexuntil a view is found that has a triangle and two arms crossing each other off oneedge (this is very clear once you see it but the Holes index has a tendency to get stuck
in another local maximum which doesn’t have much structure) Brush the arms withseparate colours and glyphs Change to the Central Mass index and optimize As newarms are revealed brush them and continue When you have either run out of colours
or time turn off projection pursuit and watch the data touring Then it becomesclear that the underlying structure is a triangle with 5 or 6 arms (some appear to be1-dimensional, some 2-dimensional) extending from the vertices
Trang 10A more practically-oriented book, which teaches the software especially developedfor this approach, is [CW99].
For any graphical procedure exploring linear combinations of the explanatoryvariables, the structural dimension d of a regression is relevant: it is the smallestnumber of distinct linear combinations of the predictors required to characterize theconditional distribution ofy|x
If the data follow a linear regression then their structural dimension is 1 Buteven if the regression is nonlinear but can be written in the form
(33.4.1) y|x∼ g(β>x) + σ(β>x)ε
withεindependent ofx, this is also a population with structural dimension of 1 If
t is a monotonic transformation, then
Trang 11with β1 and β2 linearly independent, then the structural dimension is 2, since oneneeds 2 different linear combinations ofxto characterize the distribution ofy If
then this is a very simple relationship betweenxandy, but the structural dimension
is k, the number of dimensions ofx, since the relationship is not intrinsically linear.Problem 391 [FS81, p 818] Show that the regression function consisting ofthe interaction term between x1 and x2 only φ(x) = x1x2 has structural dimension
2, i.e., it can be written in the form φ(x) =P2m=1sm(αmx) where sm are smoothfunctions of one variable
Answer.
(33.4.4) α 1 = √1
2
"1 1 o
#
α 2 = √12
"1
−1 o
Problem 392 [Coo98, p 62] In the rubber data, mnr is the dependent variable
y, and temp and dp form the two explanatory variablesx1 andx2 Look at the datausing XGgobi or some other spin program What is the structural dimension of thedata set?
Trang 12The rubber data are from [Woo72], and they are also discussed in [Ric, p 506].mnr is modulus of natural rubber, temp the temperature in degrees Celsius, and dpDicumyl Peroxide in percent.
Answer The data are a plane that has been distorted by twisting and stretching Since one needs a different view to get the best fit of the points in the upper-right corner than for the points
in the lower-left corner, the structural dimension must be 2
If one looks at the scatter plots ofyagainst all linear combinations of components
of x, and none of them show a relationship (either linear or nonlinear), then thestructural dimension is zero
Here are the instruction how to do graphical regression on the mussel data Selectthe load menu (take the cursor down a little until it is black, then go back up), thenpress the Check Data Dir box, then double click ARCG so that it jumps into thebig box Then Update/Open File will give you a long list of selections, where youwill find mussels.lsp Double click on this so that it jumps into the big box, andthen press on the Update/Open File box Now for the Box-Cox transformation Ifirst have to go to scatterplot matrices, then click on transformations, then to findnormalizing transformations It you just select the 4 predictors and then pressthe OK button, there will be an error message; apparently the starting values werenot good enough Try again, using marginal Box-Cox Starting Values This willsucceed, and the LR test for all transformations logs has a p-value of 14 Therefore
Trang 13choose the log transform for all the predictors (If we include all 5 variables, the
LR test for all transformations to be log transformations has a p-value of 0.000.)Therefore transform the 4 predictor variables only to logs There you see the verylinear relationship between the predictors, and you see that all the scatter plotswith the response are very similar This is a sign that the structural dimension is 1according to [CW99, pp 435/6] If that is the case, then a plot of the actual againstthe fitted values is a sufficient summary plot For this, run the Fit Linear LS menuoption, and then plot the dependent variable against the fitted value Now the nextquestion might be: what transformation will linearize this, and a log curve seems tofit well