For example, having successfully predicted an epidemic at one armybase, one would then wish to see whether a similar model might beapplied at a second and a third almost-but-not-quite id
Trang 4Linear regression techniques are designed to help us predict expected
values, as in E(Y) = m + bX But what if our real interest is in ing extreme values, if, for example, we would like to characterize the observations of Y that are likely to lie in the upper and lower tails of Y ’s
predict-distribution
Even when expected values or medians lie along a straight line, otherquantiles may follow a curved path Koenker and Hallock applied themethod of quantile regression to data taken from Ernst Engel’s study in
1857 of the dependence of households’ food expenditure on householdincome As Fig 7.8 reveals, not only was an increase in food expendituresobserved as expected when household income was increased, but the dis-persion of the expenditures increased also
In estimating the tth quantile1, we try to find that value of b for which
0
1 t is pronounced tau.
Trang 57.6 VALIDATION
As noted in the preceding sections, more than one model can provide asatisfactory fit to a given set of observations; even then, goodness of fit is
no guarantee of predictive success Before putting the models we develop
to practical use, we need to validate them There are three main
objec-model’s form and the choice of variables is obtained by attempting to fitthe same model in a similar but distinct context
Trang 6For example, having successfully predicted an epidemic at one armybase, one would then wish to see whether a similar model might beapplied at a second and a third almost-but-not-quite identical base.
Independent verification can help discriminate among several modelsthat appear to provide equally good fits to the data Independent verifica-tion can be used in conjunction with either of the two other validationmethods For example, an automobile manufacturer was trying to forecastparts sales After correcting for seasonal effects and long-term growthwithin each region, ARIMA techniques were used.2A series of best-fittingARIMA models was derived: one model for each of the nine sales regionsinto which the sales territory had been divided The nine models werequite different in nature As the regional seasonal effects and long-termgrowth trends had been removed, a single ARIMA model applicable to allregions, albeit with coefficients that depended on the region, was moreplausible The model selected for this purpose was the one that gave thebest fit when applied to all regions
Independent verification also can be obtained through the use of gate or proxy variables For example, we may want to investigate past cli-mates and test a model of the evolution of a regional or worldwide climateover time We cannot go back directly to a period before direct measure-ments on temperature and rainfall were made, but we can observe thewidth of growth rings in long-lived trees or measure the amount ofcarbon dioxide in ice cores
surro-7.6.2 Splitting the Sample
For validating time series, an obvious extension of the methods described
in the preceding section is to hold back the most recent data points, fitthe model to the balance of the data, and then attempt to “predict” thevalues held in reserve
When time is not a factor, we still would want to split the sample intotwo parts, one for estimating the model parameters and the other for veri-fication The split should be made at random The downside is that when
we use only a portion of the sample, the resulting estimates are lessprecise
In Exercises 7.24–7.26, we want you to adopt a compromise proposed
by Moiser Begin by splitting the original sample in half; choose yourregression variables and coefficients independently for each of the
2
For examples and discussion of AutoRegressive Integrated Moving Average processes used
to analyze data whose values change with time.
Trang 7subsamples If the results are more or less in agreement, then combine thetwo samples and recalculate the coefficients with greater precision.
There are several different ways to arrange for the division Here is oneway:
•Suppose we have 100 triples of observations in columns 1 through 4.
We start a 4th column as we did in Chapter 1 for an audit, insert the formula = Rand() in the top cell, and copy it down the column Wher- ever a value greater than 0.500 appears, the observation will be included in the training set.
Exercise 7.24. Apply Moiser’s method to the Milazzo data of Exercise7.12 Can total coliform levels be predicted on the basis of month, oxygenlevel, and temperature?
Note: As conditions and relationships do change over time, any method
of prediction should be revalidated frequently For example, suppose we
had used observations from January 2000 to January 2004 to constructour original model and held back more recent data from January to June
2004 to validate it When we reach January 2005, we might refit themodel, using the data from 1/2000 to 6/2004 to select the variables anddetermine the values of the coefficients, then use the data from 6/2004 to1/2005 to validate the revised model
Exercise 7.26. Some authorities would suggest discarding the earliestobservations before refitting the model In the present example, thiswould mean discarding all the data from the first half of the year 2000.Discuss the possible advantages and disadvantages of discarding these data.7.6.3 Cross-Validation with the Bootstrap
Recall that the purpose of bootstrapping is to simulate the taking ofrepeated samples from the original population (and to save money andtime by not having to repeat the entire sampling procedure from scratch)
By bootstrapping, we are able to judge to a limited extent whether the
Trang 8models we derive will be useful for predictive purposes or whether theywill fail to carry over from sample to sample As Exercise 7.27 demon-strates, some variables may prove more reliable as predictors than others.
Exercise 7.27. Bootstrap repeatedly from the data provided in Exercises7.22 and 7.23 and use the XLSTAT stepwise function to select the vari-ables to be incorporated in the model each time Are some variablescommon to all the models?
7.7 CLASSIFICATION AND REGRESSION TREES
As the number of potential predictors increases, the method of linearregression becomes less and less practical With three potential predictors,
we can have as many as seven coefficients to be estimated: one for theintercept, three for first-order terms in the predictors Pi, two for second-order terms of the form PiPj, and one third-order term P1P2P3 With k variables, we have k first-order terms, k(k- 1) second-order terms, and soforth Should all these terms be included in our model? Which onesshould be neglected? With so many possible combinations, will a singleequation be sufficient?
We need to consider alternate approaches If you’re a mycologist, abotanist, a herpetologist, or simply a nature lover you may have made use
of some sort of a key For example,
1 Leaves simple?
3 Leaves needle-shaped?
a Leaves in clusters of 2 to many?
i Leaves in clusters of 2 to 5, sheathed, persistent for severalyears?
To derive the decision tree depicted in Fig 7.9, we began by groupingour prospects’ attitudes into categories using the data from Exercise 7.22.Purchase attitudes of 1, 2, or 3 indicate low interest, 4, 5, and 6 indicatemedium interest, and 7, 8, and 9 indicate high interest For example, ifthe orginal purchase data were in column L, we might categorize the firstentry in an adjacent column via the command = IF(L3 < 4, 1, IF(L3 < 7,
2, 3)), which we then would copy down the column
As in Exercise 7.22, the object was to express Purchase as a function ofFashion, Gamble, and Ozone The computer considered each of the vari-ables in turn, looking to find both the variable and the associated valuethat would be most effective in subdividing the data Eventually, it settled
Trang 9on “Is Gamble <5.5” as the most effective splitter This question dividesthe training data set into two groups, one containing all the most likelyprospects.
The computer then proceded to look for a second splitter that wouldseparate the “lo” prospects from the medium Again, “Gamble” proved to
be the most useful, and so on
Obviously, building a decision tree is not something you would want toattempt in the absense of a computer and the appropriate software Fortu-nately, you can download Ctree, a macro-filled Excel spreadsheet, fromhttp://www.geocities.com/adotsaha/CTree/CtreeinExcel.html
The first of the seven worksheets in the CTree package, labeled
“ReadMe,” contains detailed instructions for the use of the remainingworksheets Initially, the Ctree “Data” worksheet contains the sepallength, sepal width, petal length, and petal width of 150 irises Theattempt at classification of the iris into three separate species on the basis
of these measurements dates back to 1935 Our own first clues to thenumber of subpopulations or categories of iris, as well as to the generalshape of the underlying frequency distribution, come from consideration
of the histogram in Fig 7.10 A glance suggests the presence of at leasttwo species, although because of the overlap of the various subpopulations
it is difficult to be sure Three species actually are present as shown in Fig.7.11
In constructing the decision tree depicted in part in Fig 7.12, we made two modifications to the default settings in the Ctree spreadsheet.First, on the Data sheet, we included sepal length and sepal width as
Gamble < 5.5
Gamble < 4.5 Fashion < 5.5 Gamble < 3.5
Gamble < 6.5 med
lo lo lo
Trang 10Petallen: petal length in mm.
Sepallen: sepal length in mm.
Fetalwid: petal width in mm Sepal width not shown.
1 9 17 25 PETALWID
Iris Species Classification
Physical Mersurement Source: Fisher (1936) Iris Data Species:
Virginica Versicolor Setosa SEPALLEN
D0335 UC
FIGURE 7.11 Representing three variables in two dimensions Iris species Derived with the help of SAS/Graph ®
Trang 11explanatory variables, changing the settings in row21 from “omit” to
“cont.” Surprisingly, this change did not affect the resulting decision tree,which still made use of only petal length and petal width as classifiers.Second, on the UserInput sheet, we selected option 1 for partitioninginto training and test sets Option 2 is appropriate only with time seriesdata
Note in Fig 7.12 that the setosa species is classified on the basis of a single value, while distinguishing veriscolor and verginica subspecies is far
FIGURE 7.12 Part of decision tree for iris classification.
Trang 12As noted earlier, we set aside 20% of the flowers at random to test ourclassification scheme on The test results in Table 7.6 abstracted from theResults worksheet are reassuring.
Exercise 7.28. Show that the decision tree method only makes use ofvariables it considers important by constructing a tree for classifyingprospective purchasers into hi, med, and lo categories using the model
Purchase ~ Fashion + Ozone + Pollution + Coupons +
Gamble+ Today + Work + UPM
Exercise 7.29. Apply the CART method to the Milazzo data of Exercise7.12 to develop a prediction scheme for coliform levels in bathing waterbased on the month, oxygen level, and temperature
7.8 DATA MINING
When data sets are very large with hundreds of rows and dozens ofcolumns, different algorithms come into play In Section 2.2.1, we con-sidered the possibility of a market basket analysis, when a retail outletwould wish to analyze the pattern of its sales to see what items might beprofitably grouped and marketed together
Table 7.7 depicts part of just such a data set Each column corresponds
to a different type of book and each row corresponds to a single tion The complete data set contains 2000 transactions
transac-TABLE 7.5 Classification of Training Data
Predicted Class True Class Setosa Verginica Versicolor
Trang 14After downloading and installing Xlminer, an Excel add-in, from
http://www.resample.com/xlminer, select “Affinity” from theXlMiner menu, and then “Association rules.” The Association Rules dialogbox will appear as shown in Fig 7.13
Completing the dialog as shown, click on the OK button to see theresults displayed in Fig 7.14 Note that each rule is presented along with
an estimated confidence level, support and lift ratio
Rule#1 says that if an Italian cookbook and a Youthbook are bought, acookbook will also be bought This particular rule has confidence of100%, meaning that, of the people who bought an Italian cookbook and aYouthbook, all (100%) bought cookbooks as well “Support (a)” indicatesthat it has the support of 118 transactions, meaning that 118 peoplebought an Italian cookbook and a Youthbook, total “Support (c)” indi-cates the number of transactions involving the purchase of cookbooks,total (This is a piece of side information—it is not involved in calculatingthe confidence or support for the rule itself.) “Support (a U c)” is thenumber of transactions where an Italian cookbook and a Youthbook aswell as a cookbook were bought
Lift ratio indicates how much more likely one is to encounter a book transaction if just those transactions where an Italian cookbook and
cook-a Youthbook is purchcook-ased cook-are considered, cook-as compcook-ared to the entire
popu-FIGURE 7.13 Preparing to do a market basket analysis.
Trang 15lation of transactions—it’s the confidence divided by support (c) wherethe latter is expressed as a percentage For Rule#1, the confidence is 100%.support (c) (in percentage) = (862/2000)*100 = 43.1 So the lift ratio =100/43.1= 2.32.
7.9 SUMMARY AND REVIEW
In this chapter, you were introduced to two techniques for classifying andpredicting outcomes: linear regression and classification trees Threemethods for estimating linear regression coefficients were described alongwith guidelines for choosing among them You were provided with a step-wise technique for choosing variables to be included in a regressionmodel The assumptions underlying the regression technique were dis-cussed, along with the resultant limitations Overall guidelines for modeldevelopment were provided
You learned the importance of validating and revalidating your modelsbefore placing any reliance upon them You were introduced to one of thesimplest of pattern recognition methods, the classification tree, to be usedwhenever there are large numbers of potential predictors or when classifi-cation rather than quantitative prediction is your primary goal
Exercise 7.30. Make a list of all the italicized terms in this chapter.Provide a definition for each one, along with an example
FIGURE 7.14 Results of a market basket analysis of the bookstore data.
Trang 16Exercise 7.31. It is almost self-evident that levels of toluene, a commonlyused solvent, would be observed in the blood after working in a roomwhere the solvent was present in the air Do the observations recordedbelow also suggest that blood levels are a function of age and bodyweight? Construct a model for predicting blood levels of toluene usingthis data.
Trang 17IN THIS CHAPTER, we assume you have just completed an analysis of yourown or someone else’s research and now wish to issue a report on theoverall findings You’ll learn what to report and how to go about report-ing it, with particular emphasis on the statistical aspects of data collectionand analysis.
One of the most common misrepresentations in scientific work
is the scientific paper itself It presents a mythical
reconstruc-tion of what actually happened All of what are in retrospect
mistaken ideas, badly designed experiments and incorrect lations are omitted The paper presents the research as if it hadbeen carefully thought out, planned and executed according to
calcu-a necalcu-at, rigorous process, for excalcu-ample involving testing of calcu-a
hypothesis The misrepresentation of the scientific paper is themost formal aspect of the misrepresentation of science as an
orderly process based on a clearly defined method
•Power and sample size calculations
•Data collection methods
Reporting Your Findings
Introduction to Statistics Through Resampling Methods & Microsoft Office Excel ®, by Phillip I Good
Copyright © 2005 John Wiley & Sons, Inc.