INTRODUCTION TO STATISTICS THROUGH RESAMPLING METHODS AND MICROSOFT OFFICE EXCEL phần 9 pps

For example, having successfully predicted an epidemic at one armybase, one would then wish to see whether a similar model might beapplied at a second and a third almost-but-not-quite id

Trang 4

Linear regression techniques are designed to help us predict expected

values, as in E(Y) = m + bX But what if our real interest is in ing extreme values, if, for example, we would like to characterize the observations of Y that are likely to lie in the upper and lower tails of Y ’s

predict-distribution

Even when expected values or medians lie along a straight line, otherquantiles may follow a curved path Koenker and Hallock applied themethod of quantile regression to data taken from Ernst Engel’s study in

1857 of the dependence of households’ food expenditure on householdincome As Fig 7.8 reveals, not only was an increase in food expendituresobserved as expected when household income was increased, but the dis-persion of the expenditures increased also

In estimating the tth quantile1, we try to ﬁnd that value of b for which

0

1 t is pronounced tau.

Trang 5

7.6 VALIDATION

As noted in the preceding sections, more than one model can provide asatisfactory ﬁt to a given set of observations; even then, goodness of ﬁt is

no guarantee of predictive success Before putting the models we develop

to practical use, we need to validate them There are three main

objec-model’s form and the choice of variables is obtained by attempting to ﬁtthe same model in a similar but distinct context

Trang 6

For example, having successfully predicted an epidemic at one armybase, one would then wish to see whether a similar model might beapplied at a second and a third almost-but-not-quite identical base.

Independent verification can help discriminate among several modelsthat appear to provide equally good fits to the data Independent verifica-tion can be used in conjunction with either of the two other validationmethods For example, an automobile manufacturer was trying to forecastparts sales After correcting for seasonal effects and long-term growthwithin each region, ARIMA techniques were used.2A series of best-fittingARIMA models was derived: one model for each of the nine sales regionsinto which the sales territory had been divided The nine models werequite different in nature As the regional seasonal effects and long-termgrowth trends had been removed, a single ARIMA model applicable to allregions, albeit with coefficients that depended on the region, was moreplausible The model selected for this purpose was the one that gave thebest fit when applied to all regions

Independent veriﬁcation also can be obtained through the use of gate or proxy variables For example, we may want to investigate past cli-mates and test a model of the evolution of a regional or worldwide climateover time We cannot go back directly to a period before direct measure-ments on temperature and rainfall were made, but we can observe thewidth of growth rings in long-lived trees or measure the amount ofcarbon dioxide in ice cores

surro-7.6.2 Splitting the Sample

For validating time series, an obvious extension of the methods described

in the preceding section is to hold back the most recent data points, ﬁtthe model to the balance of the data, and then attempt to “predict” thevalues held in reserve

When time is not a factor, we still would want to split the sample intotwo parts, one for estimating the model parameters and the other for veri-ﬁcation The split should be made at random The downside is that when

we use only a portion of the sample, the resulting estimates are lessprecise

In Exercises 7.24–7.26, we want you to adopt a compromise proposed

by Moiser Begin by splitting the original sample in half; choose yourregression variables and coefﬁcients independently for each of the

2

For examples and discussion of AutoRegressive Integrated Moving Average processes used

to analyze data whose values change with time.

Trang 7

subsamples If the results are more or less in agreement, then combine thetwo samples and recalculate the coefﬁcients with greater precision.

There are several different ways to arrange for the division Here is oneway:

•Suppose we have 100 triples of observations in columns 1 through 4.

We start a 4th column as we did in Chapter 1 for an audit, insert the formula = Rand() in the top cell, and copy it down the column Wher- ever a value greater than 0.500 appears, the observation will be included in the training set.

Exercise 7.24. Apply Moiser’s method to the Milazzo data of Exercise7.12 Can total coliform levels be predicted on the basis of month, oxygenlevel, and temperature?

Note: As conditions and relationships do change over time, any method

of prediction should be revalidated frequently For example, suppose we

had used observations from January 2000 to January 2004 to constructour original model and held back more recent data from January to June

2004 to validate it When we reach January 2005, we might reﬁt themodel, using the data from 1/2000 to 6/2004 to select the variables anddetermine the values of the coefﬁcients, then use the data from 6/2004 to1/2005 to validate the revised model

Exercise 7.26. Some authorities would suggest discarding the earliestobservations before reﬁtting the model In the present example, thiswould mean discarding all the data from the ﬁrst half of the year 2000.Discuss the possible advantages and disadvantages of discarding these data.7.6.3 Cross-Validation with the Bootstrap

Recall that the purpose of bootstrapping is to simulate the taking ofrepeated samples from the original population (and to save money andtime by not having to repeat the entire sampling procedure from scratch)

By bootstrapping, we are able to judge to a limited extent whether the

Trang 8

models we derive will be useful for predictive purposes or whether theywill fail to carry over from sample to sample As Exercise 7.27 demon-strates, some variables may prove more reliable as predictors than others.

Exercise 7.27. Bootstrap repeatedly from the data provided in Exercises7.22 and 7.23 and use the XLSTAT stepwise function to select the vari-ables to be incorporated in the model each time Are some variablescommon to all the models?

7.7 CLASSIFICATION AND REGRESSION TREES

As the number of potential predictors increases, the method of linearregression becomes less and less practical With three potential predictors,

we can have as many as seven coefficients to be estimated: one for theintercept, three for first-order terms in the predictors Pi, two for second-order terms of the form PiPj, and one third-order term P1P2P3 With k variables, we have k first-order terms, k(k- 1) second-order terms, and soforth Should all these terms be included in our model? Which onesshould be neglected? With so many possible combinations, will a singleequation be sufficient?

We need to consider alternate approaches If you’re a mycologist, abotanist, a herpetologist, or simply a nature lover you may have made use

of some sort of a key For example,

1 Leaves simple?

3 Leaves needle-shaped?

a Leaves in clusters of 2 to many?

i Leaves in clusters of 2 to 5, sheathed, persistent for severalyears?

To derive the decision tree depicted in Fig 7.9, we began by groupingour prospects’ attitudes into categories using the data from Exercise 7.22.Purchase attitudes of 1, 2, or 3 indicate low interest, 4, 5, and 6 indicatemedium interest, and 7, 8, and 9 indicate high interest For example, ifthe orginal purchase data were in column L, we might categorize the ﬁrstentry in an adjacent column via the command = IF(L3 < 4, 1, IF(L3 < 7,

2, 3)), which we then would copy down the column

As in Exercise 7.22, the object was to express Purchase as a function ofFashion, Gamble, and Ozone The computer considered each of the vari-ables in turn, looking to ﬁnd both the variable and the associated valuethat would be most effective in subdividing the data Eventually, it settled

Trang 9

on “Is Gamble <5.5” as the most effective splitter This question dividesthe training data set into two groups, one containing all the most likelyprospects.

The computer then proceded to look for a second splitter that wouldseparate the “lo” prospects from the medium Again, “Gamble” proved to

be the most useful, and so on

Obviously, building a decision tree is not something you would want toattempt in the absense of a computer and the appropriate software Fortu-nately, you can download Ctree, a macro-ﬁlled Excel spreadsheet, fromhttp://www.geocities.com/adotsaha/CTree/CtreeinExcel.html

The ﬁrst of the seven worksheets in the CTree package, labeled

“ReadMe,” contains detailed instructions for the use of the remainingworksheets Initially, the Ctree “Data” worksheet contains the sepallength, sepal width, petal length, and petal width of 150 irises Theattempt at classiﬁcation of the iris into three separate species on the basis

of these measurements dates back to 1935 Our own ﬁrst clues to thenumber of subpopulations or categories of iris, as well as to the generalshape of the underlying frequency distribution, come from consideration

of the histogram in Fig 7.10 A glance suggests the presence of at leasttwo species, although because of the overlap of the various subpopulations

it is difﬁcult to be sure Three species actually are present as shown in Fig.7.11

In constructing the decision tree depicted in part in Fig 7.12, we made two modiﬁcations to the default settings in the Ctree spreadsheet.First, on the Data sheet, we included sepal length and sepal width as

Gamble < 5.5

Gamble < 4.5 Fashion < 5.5 Gamble < 3.5

Gamble < 6.5 med

lo lo lo

Trang 10

Petallen: petal length in mm.

Sepallen: sepal length in mm.

Fetalwid: petal width in mm Sepal width not shown.

1 9 17 25 PETALWID

Iris Species Classification

Physical Mersurement Source: Fisher (1936) Iris Data Species:

Virginica Versicolor Setosa SEPALLEN

D0335 UC

FIGURE 7.11 Representing three variables in two dimensions Iris species Derived with the help of SAS/Graph ®

Trang 11

explanatory variables, changing the settings in row21 from “omit” to

“cont.” Surprisingly, this change did not affect the resulting decision tree,which still made use of only petal length and petal width as classiﬁers.Second, on the UserInput sheet, we selected option 1 for partitioninginto training and test sets Option 2 is appropriate only with time seriesdata

Note in Fig 7.12 that the setosa species is classiﬁed on the basis of a single value, while distinguishing veriscolor and verginica subspecies is far

FIGURE 7.12 Part of decision tree for iris classiﬁcation.

Trang 12

As noted earlier, we set aside 20% of the ﬂowers at random to test ourclassiﬁcation scheme on The test results in Table 7.6 abstracted from theResults worksheet are reassuring.

Exercise 7.28. Show that the decision tree method only makes use ofvariables it considers important by constructing a tree for classifyingprospective purchasers into hi, med, and lo categories using the model

Purchase ~ Fashion + Ozone + Pollution + Coupons +

Gamble+ Today + Work + UPM

Exercise 7.29. Apply the CART method to the Milazzo data of Exercise7.12 to develop a prediction scheme for coliform levels in bathing waterbased on the month, oxygen level, and temperature

7.8 DATA MINING

When data sets are very large with hundreds of rows and dozens ofcolumns, different algorithms come into play In Section 2.2.1, we con-sidered the possibility of a market basket analysis, when a retail outletwould wish to analyze the pattern of its sales to see what items might beproﬁtably grouped and marketed together

Table 7.7 depicts part of just such a data set Each column corresponds

to a different type of book and each row corresponds to a single tion The complete data set contains 2000 transactions

transac-TABLE 7.5 Classiﬁcation of Training Data

Predicted Class True Class Setosa Verginica Versicolor

Trang 14

After downloading and installing Xlminer, an Excel add-in, from

http://www.resample.com/xlminer, select “Afﬁnity” from theXlMiner menu, and then “Association rules.” The Association Rules dialogbox will appear as shown in Fig 7.13

Completing the dialog as shown, click on the OK button to see theresults displayed in Fig 7.14 Note that each rule is presented along with

an estimated conﬁdence level, support and lift ratio

Rule#1 says that if an Italian cookbook and a Youthbook are bought, acookbook will also be bought This particular rule has conﬁdence of100%, meaning that, of the people who bought an Italian cookbook and aYouthbook, all (100%) bought cookbooks as well “Support (a)” indicatesthat it has the support of 118 transactions, meaning that 118 peoplebought an Italian cookbook and a Youthbook, total “Support (c)” indi-cates the number of transactions involving the purchase of cookbooks,total (This is a piece of side information—it is not involved in calculatingthe conﬁdence or support for the rule itself.) “Support (a U c)” is thenumber of transactions where an Italian cookbook and a Youthbook aswell as a cookbook were bought

Lift ratio indicates how much more likely one is to encounter a book transaction if just those transactions where an Italian cookbook and

cook-a Youthbook is purchcook-ased cook-are considered, cook-as compcook-ared to the entire

popu-FIGURE 7.13 Preparing to do a market basket analysis.

Trang 15

lation of transactions—it’s the conﬁdence divided by support (c) wherethe latter is expressed as a percentage For Rule#1, the conﬁdence is 100%.support (c) (in percentage) = (862/2000)*100 = 43.1 So the lift ratio =100/43.1= 2.32.

7.9 SUMMARY AND REVIEW

In this chapter, you were introduced to two techniques for classifying andpredicting outcomes: linear regression and classiﬁcation trees Threemethods for estimating linear regression coefﬁcients were described alongwith guidelines for choosing among them You were provided with a step-wise technique for choosing variables to be included in a regressionmodel The assumptions underlying the regression technique were dis-cussed, along with the resultant limitations Overall guidelines for modeldevelopment were provided

You learned the importance of validating and revalidating your modelsbefore placing any reliance upon them You were introduced to one of thesimplest of pattern recognition methods, the classiﬁcation tree, to be usedwhenever there are large numbers of potential predictors or when classiﬁ-cation rather than quantitative prediction is your primary goal

Exercise 7.30. Make a list of all the italicized terms in this chapter.Provide a deﬁnition for each one, along with an example

FIGURE 7.14 Results of a market basket analysis of the bookstore data.

Trang 16

Exercise 7.31. It is almost self-evident that levels of toluene, a commonlyused solvent, would be observed in the blood after working in a roomwhere the solvent was present in the air Do the observations recordedbelow also suggest that blood levels are a function of age and bodyweight? Construct a model for predicting blood levels of toluene usingthis data.

Trang 17

IN THIS CHAPTER, we assume you have just completed an analysis of yourown or someone else’s research and now wish to issue a report on theoverall ﬁndings You’ll learn what to report and how to go about report-ing it, with particular emphasis on the statistical aspects of data collectionand analysis.

One of the most common misrepresentations in scientiﬁc work

is the scientiﬁc paper itself It presents a mythical

reconstruc-tion of what actually happened All of what are in retrospect

mistaken ideas, badly designed experiments and incorrect lations are omitted The paper presents the research as if it hadbeen carefully thought out, planned and executed according to

calcu-a necalcu-at, rigorous process, for excalcu-ample involving testing of calcu-a

hypothesis The misrepresentation of the scientiﬁc paper is themost formal aspect of the misrepresentation of science as an

orderly process based on a clearly deﬁned method

•Power and sample size calculations

•Data collection methods

Reporting Your Findings

Introduction to Statistics Through Resampling Methods & Microsoft Ofﬁce Excel ®, by Phillip I Good

Định dạng
Số trang	24
Dung lượng	473,51 KB