Class Notes in Statistics and Econometrics Part 17 pps

Scatterplot MatricesOne common graphical method to explore a dataset is to make a scatter plot ofeach data series against each other and arrange these plots in a matrix.. In the construc

Trang 1

Problem384 Someone said on an email list about statistics: if you cannot see

an effect in the data, then there is no use trying to estimate it Right or wrong?Answer One argument one might give is the curse of dimensionality Also higher moments

of the distribution, kurtosis etc., cannot be seen very cleary with the plain eye

Trang 2

33.1 Scatterplot MatricesOne common graphical method to explore a dataset is to make a scatter plot of

each data series against each other and arrange these plots in a matrix In R, the

pairs function does this Scatterplot matrices should be produced in the preliminary

stages of the investigation, but the researcher should not think he or she is done after

having looked at the scatterplot matrices

In the construction of scatter plot matrices, it is good practice to change the

signs of some of the variables in order to make all correlations positive if this is

possible

[BT99, pp 17–20] gives a good example of what kinds of things can be seen from

looking at scatterplot matrices The data for this book are available athttp://biometrics.ag.uq.edu.au/software.htm

Problem 385 5 points Which inferences about the datasets can you draw from

looking at the scatterplot matrix in [BT99, Exhibit 3.2, p 14]?

Answer The discussion on [ BT99 , p 19?] distinguishes three categories First the univariate

phenomena:

• yield is more concentrated for local genotypes (•) than for imports (◦);

• the converse is true for protein % but not as pronounced;

• oil % and seed size are lower for local genotypes (•); regarding seed size, the heaviest • is ligher

than the lightest ◦;

• height and lodging are greater for local genotypes.

Trang 3

Bivariate phenomena are either within-group or between-group phenomena or both.:

• negative relationship of protein % and oil % (both within • and ◦);

• positive relationship of oil % and seed size (both within • and ◦ and also between these groups);

• negative relationship, between groups, of seed size and height;

• positive relationship of height and lodging (within ◦ and between groups);

• negative relationship of oil % and lodging (between groups and possibly within •);

• negative relationship of seed size and lodging (between groups);

• positive relationship of height and lodging (between groups).

The between group pehnomena are, of course, not due to an interaction between the groups, but they are the consequence of univariate phenomena As a third category, the authors point out unusual individual points:

• 1 high ◦ for yield;

• 1 high • (still lower than all the ◦s) for seed size;

• 1 low ◦ for lodging;

• 1 low • for protein % and oil % in combination.

[Coo98, Figure 2.8 on p 29] shows a scatterplot matrix of the “horse mussel”data, originally from [Cam89] This graph is also available atwww.stat.umn.edu/RegGraph/graphics/Figure 2.8.gif Horse mussels, (Atrinia), were sampled fromthe Marlborough Sounds The five variables are L = Shell length in mm, W = Shellwidth in mm, H = Shell height in mm, S = Shell mass in g, and M = Muscle mass

in g M is the part of the mussel that is edible

Trang 4

Problem 386 3 points In the mussel data set, M is the “response” (according

to [Coo98]) Is it justified to call this variable the “response” and the other variablesthe explanatory variables, and if so, how would you argue for it?

Answer This is one of the issues which is not sufficiently discussed in the literature It would

be justified if the dimensions and weight of the shell were exogenous to the weight of the edible part

of the mussel I.e., if the mussel first grows the shell, and then it fills this shell wish muscle, and the dimensions of the shell affect how big the muscle can grow, but the muscle itself does not have

an influence on the dimensions of the shell If this is the case, then it makes sense to look at the distribution of M conditionally on the other variables, i.e., ask the question: given certain weights and dimensions of the shell, what is the nature of the mechanism by which the muscle grows inside this shell But if muscle and shell grow together, both affected by the same variables (temperature, nutrition, daylight, etc.), then the conditional distribution is not informative In this case, the joint

In order to get this dataset into R, you simply say data(mussels), after havingsaid library(ecmet) Then you need the command pairs(mussels) to getthe scatterplot matrix Also interesting is pairs(log(mussels)), especially sincethe log transformation is appropriate if one explains volume and weight by length,height, and width

The scatter plot of M versus H shows a clear curvature; but one should not jump

to the conclusion that the regression is not linear Cook brings another example withconstructed data, in which the regression function is clearly linear, without error

Trang 5

term, and in which nevertheless the scatter plot of the response versus one of thepredictors shows a similar curvature as in the mussel data.

Problem 387 Cook’s constructed dataset is available as dataset reggra29 inthe ecmet package Make a scatterplot matrix of the plot, then load it into XGobiand convince yourself thaty depends linearly on x1 and x2

Answer You need the commands data(reggra29) and then pairs(reggra29) to get the terplot matrix Before you can access xgobi from R, you must give the command library(xgobi) Then xgobi(reggra29) The dependency is y = 3 + x 1 + elemx 2 /2

scat-Problem 388 2 points Why can the scatter plot of the dependent variableagainst one of the independent variables be so misleading?

Answer Because the included independent variable becomes a proxy for the excluded able The effect of the excluded variable is mistaken to come from the included variable Now if the included and the excluded variable are independent of each other, then the omission of the excluded variable increases the noise, but does not have a systematic effect But if there is an empirical relationship between the included and the excluded variable, then this translates into a spurious relationship between included and dependent variables The mathematics of this is discussed in

Trang 6

Problem 389 Would it be possible in the scatter plot in [Coo98, p 217] toreverse the signs of some of the variables in such a way that all correlations arepositive?

Answer Yes, you have to reverse the signs of 6Below and AFDC Here are the instructions how to do the scatter plots: in arc, go to the load menu (Ignore the close and the menu boxes, they don’t seem to work.) Then type the path into the long box, /usr/share/ecmet/xlispstat and press return This gives me only one option, Minneapolis-schools.lsp I have to press 3 times

on this until it jumps to the big box, then I can press enter on the big box to load the data This gives me a bigger menu Go to the MPLSchools menu, and to the add variable option You have

to type in 6BelNeg = (- 6Below), then enter, then AFDCNeg = (- AFDC), and finally BthPtsNeg = (- BthPts) Then go to the Graph&Fit menu, and select scatterplot matrix Then you have to be careful about the order: first select AFDCNeg, in the left box and double click so that it jumps over to the right box Then select HS, then BthPtsNeg, then 6BelNeg, then 6Above Now the scatterplot matrix will be oriented all in 1 direction

33.2 Conditional Plots

In order to account for the effect of excluded variables in a scatter plot, thefunction coplot makes scatter plots in which the excluded variable is conditionedupon The graphics demo has such a conditioning plot; here is the code (from thefile /usr/lib/R/demos/graphics/graphics.R):

data(quakes)

Trang 7

coplot(long ~ lat | depth, data=quakes, pch=21)

33.3 Spinning

An obvious method to explore a more than two-dimensional structure graphically

is to look at plots of y against various linear combinations of x Many statisticalsoftware packages have the ability to do so, but one of the most powerful ones isXGobi Documentation about xgobi, which is more detailed than the help(xgobi)

in R/Splus can be obtained by typing man xgobi while in unix A nice brief mentation is [Rip96] The official manual is is [SCB91] and [BCS96]

docu-XGobi can be used as a stand-alone program or it can be invoked from inside R

or Splus In R, you must give the command library(xgobi) in order to make thefunction xgobi accessible

The search for “interesting” projections of the data into one-, two-, or 3-dimensionalspaces has been automated in projection pursuit regression programs The basic ref-erence is [FS81], but there is also the much older [FT74]

The most obvious graphical regression method consists in slicing or binning thedata, and taking the mean of the data in each bin But if you have too manyexplanatory variables, this local averaging becomes infeasible, because of the “curse

of dimensionality.” Consider a dataset with 1000 observations and 10 variables, allbetween 0 and 1 In order to see whether the data are uniformly distributed or

Trang 8

whether they have some structure, you may consider splitting up the 10-dimensionalunit cube into smaller cubes and counting the number of datapoints in each of thesesubcubes The problem here is: if one makes those subcubes large enough that theycontain more than 0 or 1 observations, then their coordinate lengths are not muchsmaller than the unit hypercube itself Even with a side length of 1/2, which would

be the largest reasonable side length, one needs 1024 subcubes to fill the hypercube,therefore the average number of data points is a little less than 1 By projectinginstead of taking subspaces, projection pursuit regression does not have this problem

of data scarcity

Projection pursuit regression searches for an interesting and informative tion of the data by maximizing a criterion function A logical candidate would forinstance be the variance ratio as defined in (8.6.7), but there are many others

projec-About grand tours, projection pursuit guided tours, and manual tours see [CBCH97]and [CB97]

Problem 390 If you run XGobi from the menu in Debian GNU/Linux, it usesprim7, which is a 7-dimensional particle physics data set used as an example in[FT74]

The following is from the help page for this dataset: There are 500 observationstaken from a high energy particle physics scattering experiment which yields fourparticles The reaction can be described completely by 7 independent measurements

Trang 9

The important features of the data are short-lived intermediate reaction stages whichappear as protuberant “arms” in the point cloud.

The projection pursuit guided tour is the tool to use to understand this data set.Using all 7 variables turn on projection pursuit and optimize with the Holes indexuntil a view is found that has a triangle and two arms crossing each other off oneedge (this is very clear once you see it but the Holes index has a tendency to get stuck

in another local maximum which doesn’t have much structure) Brush the arms withseparate colours and glyphs Change to the Central Mass index and optimize As newarms are revealed brush them and continue When you have either run out of colours

or time turn off projection pursuit and watch the data touring Then it becomesclear that the underlying structure is a triangle with 5 or 6 arms (some appear to be1-dimensional, some 2-dimensional) extending from the vertices

Trang 10

A more practically-oriented book, which teaches the software especially developedfor this approach, is [CW99].

For any graphical procedure exploring linear combinations of the explanatoryvariables, the structural dimension d of a regression is relevant: it is the smallestnumber of distinct linear combinations of the predictors required to characterize theconditional distribution ofy|x

If the data follow a linear regression then their structural dimension is 1 Buteven if the regression is nonlinear but can be written in the form

(33.4.1) y|x∼ g(β>x) + σ(β>x)ε

withεindependent ofx, this is also a population with structural dimension of 1 If

t is a monotonic transformation, then

Trang 11

with β1 and β2 linearly independent, then the structural dimension is 2, since oneneeds 2 different linear combinations ofxto characterize the distribution ofy If

then this is a very simple relationship betweenxandy, but the structural dimension

is k, the number of dimensions ofx, since the relationship is not intrinsically linear.Problem 391 [FS81, p 818] Show that the regression function consisting ofthe interaction term between x1 and x2 only φ(x) = x1x2 has structural dimension

2, i.e., it can be written in the form φ(x) =P2m=1sm(αmx) where sm are smoothfunctions of one variable

Answer.

(33.4.4) α 1 = √1

2

"1 1 o

#

α 2 = √12

"1

−1 o

Problem 392 [Coo98, p 62] In the rubber data, mnr is the dependent variable

y, and temp and dp form the two explanatory variablesx1 andx2 Look at the datausing XGgobi or some other spin program What is the structural dimension of thedata set?

Trang 12

The rubber data are from [Woo72], and they are also discussed in [Ric, p 506].mnr is modulus of natural rubber, temp the temperature in degrees Celsius, and dpDicumyl Peroxide in percent.

Answer The data are a plane that has been distorted by twisting and stretching Since one needs a different view to get the best fit of the points in the upper-right corner than for the points

in the lower-left corner, the structural dimension must be 2

If one looks at the scatter plots ofyagainst all linear combinations of components

of x, and none of them show a relationship (either linear or nonlinear), then thestructural dimension is zero

Here are the instruction how to do graphical regression on the mussel data Selectthe load menu (take the cursor down a little until it is black, then go back up), thenpress the Check Data Dir box, then double click ARCG so that it jumps into thebig box Then Update/Open File will give you a long list of selections, where youwill find mussels.lsp Double click on this so that it jumps into the big box, andthen press on the Update/Open File box Now for the Box-Cox transformation Ifirst have to go to scatterplot matrices, then click on transformations, then to findnormalizing transformations It you just select the 4 predictors and then pressthe OK button, there will be an error message; apparently the starting values werenot good enough Try again, using marginal Box-Cox Starting Values This willsucceed, and the LR test for all transformations logs has a p-value of 14 Therefore

Trang 13

choose the log transform for all the predictors (If we include all 5 variables, the

LR test for all transformations to be log transformations has a p-value of 0.000.)Therefore transform the 4 predictor variables only to logs There you see the verylinear relationship between the predictors, and you see that all the scatter plotswith the response are very similar This is a sign that the structural dimension is 1according to [CW99, pp 435/6] If that is the case, then a plot of the actual againstthe fitted values is a sufficient summary plot For this, run the Fit Linear LS menuoption, and then plot the dependent variable against the fitted value Now the nextquestion might be: what transformation will linearize this, and a log curve seems tofit well

Định dạng
Số trang	22
Dung lượng	379,45 KB