Quantitative Methods for Business chapter 17 potx

High performance – statistical inference for comparing population means and bivariate data 17 Chapter objectives This chapter will help you to: ■ test hypotheses on the difference betwee

Trang 1

High performance – statistical inference

for comparing population means and

bivariate data

17

Chapter objectives

This chapter will help you to:

■ test hypotheses on the difference between two populationmeans using independent samples and draw appropriate conclusions

■ carry out tests of hypotheses about the difference between twopopulation means using paired data and draw appropriateconclusions

■ test differences between population means using analysis ofvariance analysis (ANOVA) and draw appropriate conclusions

Trang 2

■ conduct hypothesis tests about population correlation cients and draw appropriate conclusions

coeffi-■ produce interval estimates using simple linear regressionmodels

■ perform contingency analysis and interpret the results

■ use the technology; test differences between sample means,apply correlation and regression inference, and contingencyanalysis in EXCEL, MINITAB and SPSS

■ become acquainted with the business use of contingencyanalysis

In the previous chapter we looked at statistical inference in relation tounivariate data, estimating and testing single population parameterslike the mean using single sample results In this chapter we will con-sider statistical inference methods that enable us to compare means oftwo or more populations, to test population correlation coefficients, tomake predictions from simple linear regression models and to test forassociation in qualitative data

17.1 Testing hypotheses about two

population means

In section 16.3 of the previous chapter we looked at tests of the lation mean based on a single sample mean In this section we will con-sider tests designed to assess the difference between two populationmeans In businesses these tests are used to investigate whether, forinstance, the introduction of a new logo improves sales

popu-To use these tests you need to have a sample from each of the twopopulations For the tests to be valid the samples must be random, but

they can be independent or dependent.

Independent samples are selected from each population separately.Suppose a domestic gas supplier wanted to assess the impact of a newcharging system on customers’ bills The company could take a ran-dom sample of customers and record the size of their bills under theexisting charging system then, after the new system is introduced, takeanother random sample of customers and record the size of their bills.These samples would be independent

Dependent samples consist of matched or paired values If the gassupplier took a random sample of customers and recorded the size oftheir bills both before and after the introduction of the new chargingsystem they would be using a paired or dependent sample

Trang 3

The choice of independent or dependent samples depends on thecontext of the test Unless there is a good reason for using paired data

it is better to use independent samples We will begin by looking at testsfor use with independent samples and deal with paired samples later inthis section

As with single sample tests, the size of the samples is important because

it determines the nature of the sampling distribution In this section wewill assume that the population standard deviations are not known

17.1.1 Large independent samples

The null hypothesis we use in comparing population means is based

on the difference between the means of the two populations, ␮1 ␮2.The possible combinations of null and alternative hypotheses are shown

␮1 ␮2, and a standard error of:

where ␴1and ␴2 are the standard deviations of the first and second

populations, and n1and n2are the sizes of the samples from the firstand second populations

␴12 ␴

1 2 2 2

n n

Table 17.1

Types of hypotheses for comparing population means

Null hypothesis Alternative hypothesis Type of test

Trang 4

We will assume that the population standard deviations are not known,

in which case the estimated standard error of the sampling distribution is:

The test statistic is:

If the null hypothesis suggests that the difference between the lation means is zero, we can simplify this to:

popu-Once we have calculated the test statistic we need to compare it tothe appropriate critical value from the Standard Normal Distribution

s12 s

1 2 2 2

n n

Example 17.1

A national breakdown recovery service has depots at Oxford and Portsmouth The meanand standard deviation of the times that it took for the staff at the Oxford depot to assisteach of a random sample of 47 motorists were 51 minutes and 7 minutes respectively.The mean and standard deviation of the response times recorded by the staff at thePortsmouth depot in assisting a random sample of 39 customers were 49 minutes and

5 minutes respectively Test the hypothesis that there is no difference between the meanresponse times of the two depots Use a 5% level of significance

H0:␮1 ␮2 0 H1:␮1 ␮2⬆ 0

This is a two-tail test using a 5% level of confidence so the critical values arez0.025.Unless the test statistic is below1.96 or above 1.96 the null hypothesis cannot berejected The test statistic, 1.541, is within1.96 so we cannot reject H0 The populationmean response times of the two breakdown services could be equal

539

Trang 5

Notice that in Example 17.1 we have not said anything about the tributions of response times The Central Limit Theorem allows us to

dis-use the same two-sample z test whatever the shape of the populations

from which the samples were drawn as long as the size of both samples

is 30 or more

At this point you may find it useful to try Review Questions 17.1 to

17.3at the end of the chapter

17.1.2 Small independent samples

If the size of the samples you want to use to compare population means

is small, less than 30, you can only follow the procedure outlined in theprevious section if both populations are normal and both populationstandard deviations known In the absence of the latter it is possible totest the difference between two population means using small inde-pendent samples but only under certain circumstances

If both populations are normal and their standard deviations can beassumed to be the same, that is ␴1 ␴2, we can conduct a two-sample

t test We use the sample standard deviations to produce a pooled

esti-mate of the standard error of the sampling distribution of X—1 X—2, sp

The test statistic is

We then compare the test statistic to the appropriate critical value from

the t distribution The number of degrees of freedom for this test is

n1 n2 2, one degree of freedom is lost for each of the sample means

Trang 6

produc-At this point you may find it useful to try Review Questions 17.4 to

17.1.3 Paired samples

If you want to test the difference between population means usingdependent or paired samples the nature of the data enables you to testthe mean of the differences between all the paired values in the popu-lation,␮ d This approach contrasts with the methods described in theearlier parts of this section where we have tested the difference betweenpopulation means,␮1 ␮2

The procedure involved in testing hypotheses using paired samples

is very similar to the one-sample hypothesis testing we discussed in section16.3 of Chapter 16 We have to assume that the differences betweenthe paired values are normally distributed with a mean of ␮ d, and astandard deviation of ␴ d The sampling distribution of sample mean

brand packets are 33.4% and 1.1% respectively Test the hypothesis that the mean oat content of the premium brand is no greater than the mean oat content of the

‘own-brand’ muesli using a 1% level of significance

We will define ␮1as the population mean of the ‘own-brand’ and ␮2as the populationmean of the premium product

H0:␮1 ␮2 0 H1:␮1 ␮2

First we need the pooled estimate of the standard error:

Now we can calculate the test statistic:

This is a one-tail test so the null hypothesis will only be rejected if the test statistic

exceeds the critical value From Table 6 on page 623 in Appendix 1, t0.01,29is 2.462 Sincethe test statistic is greater than the critical value we can reject the null hypothesis at the1% level The difference between the sample means is very significant

1.243 * 1

14

117 3.344

Trang 7

differences will also be normally distributed with a mean of ␮ dand astandard error of ␴ d/√n, where n is the number of differences in thesample Since we assume that ␴ dis unknown we have to use the estimated

standard error s d/√n, where sdis the standard deviation of the sampledifferences

Typically samples of paired data tend to be small so the benchmark

distribution for the test is the t distribution The test is therefore called the paired t test Table 17.2 lists the three possible combinations of

hypotheses

The test statistic is:

where x – dis the mean of the sample differences

We then compare the test statistic to the appropriate critical value

from the t distribution with n 1 degrees of freedom

Types of hypotheses for the mean of the population of differences

Null hypothesis Alternative hypothesis Type of test

In this table ␮ d0represents the value of the population mean that is to be tested.

Trang 8

The mean and standard deviation of the sample differences are 7.75 and 3.05, to

2 decimal places The test statistic is:

From Table 6 on page 623, t0.10,11is 1.363 The alternative hypothesis is that the populationmean salary difference is less than £8000 so the critical value is1.363 A sample meanthat produces a test statistic this low or lower would lead us to reject the null hypothesis

In this case, although the sample mean is less than £8000, the test statistic,0.28, is notless than the critical value and the null hypothesis cannot be rejected The populationmean of the salary differences could well be £8000

t 7.75 8.00

12

0.250.88 0.284

3 05 √

17.2 Testing hypotheses about more than

two population means – one-way ANOVA

In some investigations it is important to establish whether two randomsamples come from a single population or from two populations withdifferent means The techniques we looked at in the previous sectionenable us to do just that But what if we have three or more randomsamples and we need to establish whether they come from populationswith different means?

You might think that the obvious answer is to run t tests using each

pair of random samples to establish whether the first sample camefrom the same population as the second, the first sample came fromthe same population as the third and the second sample came from thesame population as the third and so on In doing this you would betesting the hypotheses:

H0:␮1 ␮2 H0:␮1 ␮3 H0:␮2 ␮3etc

Although feasible, this is not the best way to approach the tion For one thing the more random samples that are involved thegreater the chance that you miss out one or more possible pairings Foranother, each test you conduct carries a risk of making a type 1 error,wrongly rejecting a null hypothesis because you happen to have a sample

Trang 9

investiga-result from an extreme end of its sampling distribution The chance ofthis occurring is the level of significance you use in conducting the test.The problem when you conduct a series of related tests is that theprobability of making a type 1 error increases; if you use a 5% level ofsignificance then the probability of not making a type 1 error in asequence of three tests is, using the multiplication rule of probability,0.95 * 0.95 * 0.95 or 0.857 This means the effective level of signifi-cance is 14.3%, considerably greater than you might have assumed.

To establish whether more than two samples come from populations

with different means we use an alternative approach, analysis of variance, usually abbreviated to ANOVA (analysis of va riance) At first sight it

seems rather odd to be using a technique based on variance, a measure

of spread, to assess hypotheses about means, which are measures of

loca-tion The reason for doing this is that it enables us to focus on the spread

of the sample means, after all the greater the differences between thesample means the greater the chance that they come from populationswith different means However, we have to be careful to put these dif-ferences into context because, after all, we can get different samples

from the same population Using ANOVA involves looking at the

bal-ance between the varibal-ance of the sample means and the varibal-ance in thesample data overall Example 17.4 illustrates why this is important

Example 17.4

The Kranilisha Bank operates cash dispensing machines in Gloucester, Huddersfieldand Ipswich The amounts of cash dispensed (in £000s) at a random sample of machinesduring a specific period were:

These are independent samples and so the fact that one sample (Huddersfield) tains fewer values does not matter The sample data are shown in the form of boxplots

con-in Figure 17.1

The distributions in Figure 17.1 suggest that there are differences between theamounts of cash dispensed at the machines, with those in Ipswich having the largestturnover and those in Huddersfield having the smallest The sample means, which arerepresented by the dots, bear out this impression: 34 for Gloucester, 25 for Huddersfieldand 41 for Ipswich

Trang 10

In Example 17.4 the sample means are diverse enough and the tributions shown in Figure 17.1 distinct enough to indicate differencesbetween the locations, but is it enough to merely compare the samplemeans?

dis-Figure 17.1

Cash dispensed at machines in Gloucester, Huddersfield and Ipswich

Ipswich Huddersfield

Trang 11

In Figure 17.2 the considerable overlaps between the data from thethree locations suggest that despite the contrasts in the means it ismore likely that the three samples come from the same population.Concentrating on the means alone in this case would have led us to thewrong conclusion.

So how we can test whether the three samples all come from the samepopulation, in other words that there is no difference between the pop-ulation mean amounts of cash dispensed per period in the threetowns? For convenience we will use ␮G,␮Hand␮Ito represent the popu-lation means for Gloucester, Huddersfield and Ipswich respectively.The hypothesis we need to test is:

the mean of the values in all three samples, the overall mean Since we

already know the three sample means we can work this out by taking themean of the sample means, being careful to weight each mean by the number of observations in the sample In the first instance we willuse the original data from Example 17.4:

The test statistic we will use is based on comparing the variationbetween the sample means with the variation within the samples One

of the measures of variation or spread that we looked at in Chapter 6

Trang 12

was the sample variance, the square of the sample standard deviation:

The basis of the variance is the sum of the squared deviationsbetween each observation and the sample mean This amount, which

is usually abbreviated to the sum of squares, is used to measure variation

in analysis of variance The sum of squares for the Gloucester sample,

which we will denote as SSG, is:

We can work out the equivalent figures for the Huddersfield andIpswich samples:

The sum of these three sums of squares is the sum of the squares

within the samples, SSW:

SSW SSG SSH SSI 226 104 338 668The measure we need for the variation between the sample means isthe sum of the squared deviations between the sample means and themean of all the observations in the three samples This is the sum of

the squares between the samples, SSB In calculating it we have to weight

each squared deviation by the sample size:

If we add the sum of squares within the samples, SSW, to the sum of squares between the samples, SSB, the result is the total sum of squares

in the data, denoted by SST The total sum of squares is also the sum of

SSB (n )*(x ) x (n )*(x ) x ( )*( )n x x

5 *(34 33.929) 4 *(25 33.929) 5 *(41 33.929) 568.929

( ) (29 41) (34 41) (44 41) (47 41) (51 41) 338

( ) (17 25) (25 25) (27 25) (31 25) 104

i i

n

2

2 1

Trang 13

the squared deviations between each observation in the set of threesamples and the mean of the combined data:

When you calculate a sample variance you have to divide the sum of

squared deviations by the sample size less one, n 1 This is the number

of degrees of freedom left in the data; we lose one degree of freedombecause we use the mean in working out the deviations from it Before

we can use the sums of squares we have determined above to test thehypothesis of no difference between the population means we need toincorporate the degrees of freedom associated with each sum of squares

by working out the mean sum of squares This makes the variation withinthe samples directly comparable to the variation between samples

The mean sum of squares within the samples, MSW, is the sum of

squares within the samples divided by the number of observations inall three samples, in this case 14, less the number of samples we have, 3.You may like to think of subtracting three as reflecting our using thethree sample means in working out the sum of squares within the sam-

ples If we use k to represent the number of samples we have:

The mean sum of squares between the samples, MSB, is the sum

of squares between the samples divided by the number of samples, k

minus one We lose one degree of freedom because we have used theoverall mean to find the sum of squares between the samples

The test statistic used to decide the validity of the null hypothesis isthe ratio of the mean sum of squares between samples to the mean sum

(29 33.929) (34 33.929) (44 33.929) (47 33.929) (51 33.929)

79.719 15.434 3.719 25.719 101.434286.577 79.719 48.005

668 568.929 1236.929

Trang 14

of squares within the sample Because the benchmark distribution we

shall use to assess it is the F distribution after its inventor, R A Fisher, the test statistic is represented by the letter F :

Before comparing this to the F distribution it is worth pausing to

con-sider the meaning of the test statistic If the three samples came from asingle population, in other words the null hypothesis is true, both the

MSB above the line, the numerator, and the MSW below the line, the

denominator, would both be unbiased estimators of the variance of thatpopulation If this were the case, the test statistic would be close to one

If on the other hand the samples do come from populations with ferent means, in other words the null hypothesis is not true, we would

dif-expect the MSB, the numerator, to be much larger than the

denom-inator Under these circumstances the test statistic would be greaterthan one

In order to gauge how large the test statistic would have to be to lead

us to reject the null hypothesis we have to look at it in the context of

the F distribution This distribution portrays the variety of test statistics

we would get if we compared all conceivable sets of samples from a

sin-gle population and worked out the ratio of MSB to MSW Since neither the MSB nor the MSW can be negative, as they are derived from squared deviations, the F distribution consists of entirely positive values The version of the F distribution you use depends on the numbers of degrees of freedom you use to work out the MSB and the MSW, respectively the nominator and denominator of the test statistic The F distri-

bution with 2 degrees of freedom in the numerator and 11 degrees offreedom in the denominator, which is the appropriate version for thebank data from Example 17.4, is shown in Figure 17.3

We can assess the value of the test statistic for the data from Example

17.4 by comparing it with a benchmark figure or critical value from the

distribution shown in Figure 17.3 The critical value you use depends

on the level of significance you require Typically this is 5% or 0.05.The shaded area in Figure 17.3 is the 5% of the distribution beyond 3.98

If the null hypothesis really were true we would only expect to have teststatistics greater than 3.98 in 5% of cases In the case of Example 17.4the test statistic is rather higher, 4.684, so the null hypothesis should berejected at the 5% level At least two of the means are different

In general, reject the null hypothesis if the test statistic is greater than

F k 1, nk, ␣ , where k is the number of samples, n is the number of values in

the samples overall and ␣ is the level of significance Note that this is a

F MSB MSW

284.465 4.684

60 727

Trang 15

one-tail test; it is only possible to reject the hypothesis if the test statistic islarger than the critical value you use Values of the test statistic from theleft-hand side of the distribution are consistent with the null hypothesis.

Table 7 on page 624 in Appendix 1 contains details of the F tion You may like to check it to locate F2, 11, 0.01which is the value of F

distribu-with 2 numerator degrees of freedom and 11 denominator degrees offreedom that cuts off a right-hand tail area of 1%, 7.21 This value isgreater than the test statistic for the data from Example 17.4 so we can-not reject the null hypothesis at the 1% level of significance

4 3

2 1

The overall mean and the three sample means are exactly the same as those derivedfrom the data in Example 17.4 so the mean sum of squares between the samples isunchanged, 284.465

We need to calculate the sum of squares within the samples for the amended data:

∑

Trang 16

In this case the test statistic is much lower than the critical value for a 5% level of nificance, 3.98, and the null hypothesis should not be rejected; it is a reasonableassumption that these three samples come from a single population, confirming theimpression from Figure 17.2.

sig-The test statistic, F 284.465 1.115

255 091

SSW SS SS SS MSW

1198 378 1230 28062806

In this section we have used one-way analysis of variance; one-way

because we have only considered one factor, geographical location, inour investigation There may be other factors that may be pertinent toour analysis, such as the type of location of the cash dispensers inExample 17.4, i.e town centre, supermarket, or garage ANOVA is aflexible technique that can be used to take more than one factor intoaccount For more on its capabilities and applications see Roberts andRusso (1999)

17.3 Testing hypotheses and producing

interval estimates for quantitative

bivariate data

In this section we will look at statistical inference techniques that enableyou to estimate and test relationships between variables in populations

Trang 17

based on sample data The sample data we use to do this is called

bivariate data because it consists of observed values of two variables.

This sort of data is usually collected in order to establish whether there

is a relationship between the two variables, and if so, what sort of relationship it is

Many organizations use this type of analysis to study consumer iour, patterns of costs and revenues, and other aspects of their opera-tions Sometimes the results of such analysis have far-reachingconsequences For example, if you look at a tobacco product you willsee a health warning prominently displayed It is there because someyears ago researchers used these types of statistical methods to establishthat there was a relationship between tobacco consumption and certainmedical conditions

behav-The quantitative bivariate analysis that we considered in Chapter 7consisted of two related techniques: correlation and regression.Correlation analysis, which is about calculating and evaluating the cor-relation coefficient, enables you to tell whether there is a relationshipbetween the observed values of two variables and how strong it is.Regression analysis, which is about finding lines of best fit, enables you

to find the equation of the line that is most appropriate for the data,

the regression model.

Here we will address how the results from applying correlation andregression to sample data can be used to test hypotheses and make esti-mates for the populations the sets of sample data belong to

17.3.1 Testing the population correlation coefficient

The sample correlation coefficient, represented by the letter r,

measures the extent of the linear association between a sample of

obser-vations of two variables, X and Y You can find the sample correlation

coefficient of a set of bivariate data using the formula:

Trang 18

If we select a random sample from populations of X and Y that are both normal in shape, the sample correlation coefficient will be an

unbiased estimate of the population coefficient, represented by the Greek

r, the letter rho, ␳ In fact the main reason for calculating the sample

correlation coefficient is to assess the linear association, if any, between

the X and Y populations.

The value of the sample correlation coefficient alone is some help inassessing correlation between the populations, but a more thoroughapproach is to test the null hypothesis that the population correlationcoefficient is zero:

H0:␳ 0

The alternative hypothesis you use depends on what you would like

to show If you are interested in demonstrating that there is significantcorrelation in the population, then use:

If we adopt the first of these, H1:␳ ⬆ 0, we will need to use a two-tail

test, if we use one of the other forms we will conduct a one-tail test Inpractice it is more usual to test for either positive or negative correl-ation rather than for both

Once we have established the nature of our alternative hypothesis

we need to calculate the test statistic from our sample data The teststatistic is:

Here r is the sample correlation coefficient and n is the number of

pairs of observations in the sample.

As long as the populations of X and Y are normal the test statistic will belong to a t distribution with n 2 degrees of freedom and a mean of

zero, if the null hypothesis is true and there is no linear association between the populations of X and Y.

Trang 19

17.3.2 Testing regression models

The second bivariate quantitative technique we looked at in Chapter 7was simple linear regression analysis This allows you to find the equa-

tion of the line of best fit between two variables, X and Y Such a line has two distinguishing features, its intercept and its slope In the standard

Example 17.7

A shopkeeper wants to investigate the relationship between the temperature and thenumber of cans of soft drinks she sells The maximum daytime temperature (in degreesCelsius) and the soft drinks sales on 10 working days chosen at random are:

The sample correlation coefficient is 0.871 Test the hypothesis of no correlationbetween temperature and sales against the alternative that there is positive correlationusing a 5% level of significance

We need to compare this test statistic to the t distribution with n 2, in this case 8,

degrees of freedom According to Table 6 on page 623 in Appendix 1 the value of t with

8 degrees of freedom that cuts off a 5% tail on the right-hand side of the distribution,

t0.05,8, is 1.860 Since the test statistic is larger than 1.860 we can reject the null thesis at the 5% level of significance The sample evidence strongly suggests positivecorrelation between temperature and sales in the population

hypo-The test statistic, 2

1

0.871* 1 2

1 0 5.02

Temperature Cans sold

Trang 20

formula we used the intercept is represented by the letter a and the slope by the letter b:

Y a bX

The line that this equation describes is the best way of representing the

relationship between the dependent variable, Y, and the independent variable, X In practice it is almost always the result of a sample investi-

gation that is intended to shed light on the relationship between the

populations of X and Y That is why we have used ordinary rather than

Greek letters in the equation

The results of a sample investigation can provide you with an standing of the relationship between the populations The interceptand slope of the line of best fit for the sample are point estimates for theintercept and slope of the line of best fit for the populations, which are

under-represented by the Greek equivalents of a and b, ␣ and ␤:

Y ␣ ␤X

The intercept and slope from the sample regression line can be used totest hypotheses about the equivalent figures for the populations Typically

we use null hypotheses that suggest that the population values are zero:

H0:␣ 0 for the intercept

and

H0:␤ 0 for the slope.

If the population intercept is zero, the population line of best fit will

be represented by the equation Y 0 ␤X, and the line will begin at

the origin of the graph You can see this type of line in Figure 17.4

If we wanted to see whether the population intercept is likely to bezero, we would test the null hypothesis H0:␣ 0 against the alternative

Trang 21

When you use regression analysis you will find that investigating thevalue of the intercept is rarely important Occasionally it is of interest, forinstance if we are looking at the relationship between an organization’slevels of operational activity and its total costs at different periods of timethen the intercept of the line of best fit represents the organization’sfixed costs.

Typically we are much more interested in evaluating the slope of theregression line The slope is pivotal; it tells us how the dependent vari-able responds to changes in the independent variable For this reason

the slope is also known as the coefficient of the independent variable.

If the population slope turns out to be zero, it tells you that thedependent variable does not respond to the independent variable.The implication of this is that your independent variable is of no use inexplaining how your dependent variable behaves and there would be

no point in using it to make predictions of the dependent variable

If the slope of the line of best fit is zero, the equation of the line

would be Y ␣ 0X, and the line would be perfectly horizontal You

can see this illustrated in Figure 17.5

The line in Figure 17.5 shows that whatever the value of X, whether

it is small and to the left of the horizontal axis or large and to the right

of it, the value of Y remains the same The size of the x value has no impact whatsoever on Y, and the regression model is useless.

We usually want to use regression analysis to find useful rather thanuseless models – regression models that help us understand and antici-pate the behaviour of dependent variables In order to demonstrate that

a model is valid, it is important that you test the null hypothesis that theslope is zero Hopefully the sample evidence will enable you to reject thenull hypothesis in favour of the alternative, that the slope is not zero.The test statistic used to test the hypothesis is:

Trang 22

Where b is the sample slope, 0 is the zero population slope that the null hypothesis suggests, and s bis the estimated standard error of the samplingdistribution of the sample slopes.

To calculate the estimated standard error, s b , divide s, the standard deviation of the sample residuals, the parts of the y values that the line

of best fit does not explain, by the square root of the sum of the squared

deviations between the x values and their mean, x–

Once we have the test statistic we can assess it by comparing it to

the t distribution with n 2 degrees of freedom, two fewer than the

number of pairs of x and y values in our sample data.

H0:␤ 0 H1:␤ ⬆ 0

To find the test statistic we first need to calculate the standard deviation of the

residuals We can identify the residuals by taking each x value, putting it into the equation

of the line of best fit and then working out what Y ‘should’ be, according to the model The difference between the y value that the equation says should be associated with the

x value and the y value that is actually associated with the x value is the residual.

To illustrate this, we will look at the first pair of values in our sample data, a day whenthe temperature was 14° and 19 cans of soft drink were sold If we insert the tempera-ture into the equation of the line of best fit we can use the equation to estimate thenumber of cans that ‘should’ have been sold on that day:

Sales 0.74 (2.38 * 14) 34.06The residual is the difference between the actual sales level, 19, and this estimate:

residual 19 34.06 15.04The standard deviation of the residuals is based on the squared residuals The residuals and their squares are:

Temperature Sales Residuals Squared residuals

(Continued)

Trang 23

We find the standard deviation of the residuals by taking the square root of the sum

of the squared residuals divided by n, the number of residuals, minus 2 (We have to

subtract two because we have ‘lost’ 2 degrees of freedom in using the intercept andslope to calculate the residuals.)

To get the estimated standard error we divide this by the sum of squared differencesbetween the temperature figures and their mean

The estimated standard error is:

and the test statistic t (b 0)/s b 2.38/0.4738 5.02

From Table 6, the t value with 8 degrees of freedom that cuts off a tail area of 2.5%,

t8,0.025, is 2.306 If the null hypothesis is true and the population slope is zero only 2.5%

of test statistics will be more than 2.306 and only 2.5% will be less than2.306 Thelevel of significance is 5% so our decision rule is therefore to reject the null hypothesis

if the test statistic is outside2.306 Since the test statistic in this case is 5.02 we shouldreject H0and conclude that the evidence suggests that the population slope is not zero

Trang 24

The implication of the sort of result we arrived at in Example 17.8 isthat the model, represented by the equation, is sufficiently sound toenable the temperature variable to be used to predict sales.

If you compare the test statistic for the sample slope in Example 17.8with the test statistic for the sample correlation coefficient in Example 17.7,you will see that they are both 5.02 This is no coincidence; the two testsare equivalent The slope represents the form of the association betweenthe variables whereas the correlation coefficient measures its strength Weuse the same data in the same sort of way to test both of them

17.3.3 Constructing interval predictions

When you use a regression model to make a prediction, as we did inExample 17.8 to obtain the residuals, you get a single figure that is the

value of Y that the model suggests is associated with the value of X that

you specify

Example 17.9

Use the regression model in Example 17.8 to predict the sales that will be achieved on

a day when the temperature is 22° Celsius

If temperature 22, according to the regression equation:

Sales 0.74 2.38 (22) 53.1

Since the number of cans sold is discrete, we can round this to 53 cans

The problem with single-figure predictions is that we do not knowhow likely they are to be accurate It is far better to have an interval that

we know, with a given level of confidence, will be accurate

Before looking at how to produce such intervals, we need to clarifyexactly what we want to find The figure we produced in Example 17.9

we described as a prediction of sales on a day when the temperature is 22° In fact, it can also be used as an estimate of the mean level of sales

that occur on days when the temperature is 22° Because it is a singlefigure it is a point estimate of the mean sales levels on such days

We can construct an interval estimate, or confidence interval, of themean level of sales on days when the temperature is at a particular level

by taking the point estimate and adding and subtracting an error Theerror is the product of the standard error of the sampling distribution

Trang 25

of the point estimates and a figure from the t distribution The t bution should have n 2 degrees of freedom, n being the number

distri-of pairs distri-of data in our sample, and the t value we select from it is based

on the level of confidence we want to have that our estimate will beaccurate

We can express this procedure using the formula:

shop-From Example 17.9 the point estimate for the mean, y

, is 53.1 cans We can use theprecise figure because the mean, unlike sales on a particular day, does not have to

be discrete We also know from Example 17.9 that s, the standard deviation of the sample residuals, is 9.343, x –is 13.1 and (x x–)2is 388.90

The t value we need is 2.306, the value that cuts off a tail of 2.5% in the t distribution

with 10 2 8 degrees of freedom The value of x0, the temperature on the days whosemean sales figure we want to estimate, is 22

Confidence interval ( )

( ) 53.1 2.306 * 9.343 (22 13.1) 53.1 11.873 41.227 to 64.973

2 2

If we produce a confidence interval for the mean of the y values ated with an x value outside the range of x values in the sample it will be

associ-both wide and unreliable

Trang 26

The confidence interval we produced in Example 17.11 is of no realuse because the temperature on which it is based, 35°, is well beyond therange of temperatures in our sample Confidence intervals produced

from regression lines will be wider when they are based on x values ther away from the mean of the x values This is shown in Figure 17.6.

fur-If you want to produce a prediction of an individual y value associated with a particular x value rather than an estimate of the mean y value associated with the x value, with a given level of confidence, you can

produce what is called a prediction interval This is to distinguish thistype of forecast from a confidence interval, which is a term reserved forestimates of population measures like means

Temperature

Regression line 95% CI

Trang 27

The procedure used to produce prediction intervals is very similar tothe one we used to produce confidence intervals for means of values ofdependent variables It is represented by the formula:

If you look carefully you can see that the difference between this andthe formula for a confidence interval is that we have added one to theexpression beneath the square root sign The effect of this will be towiden the interval considerably This is to reflect the fact that individualvalues vary more than statistical measures like means, which are based

2 2

confi-Just like confidence intervals produced using regression models,

pre-diction intervals are more dependable if they are based on x values nearer the mean of the x values Prediction intervals based on x values

that are well outside the range of the sample data are of very little value.The usefulness of the estimates that you produce from a regressionmodel depends to a large extent on the size of the sample that youhave The larger the sample on which your regression model is based,the more precise and confident your predictions and estimates will be

As we have seen, the width of the intervals increases the further the x value is away from the mean of the x values, and estimates and predictions based on x values outside the range of x values in our sample are

useless So, if you know that you want to construct intervals based on

Trang 28

specific values of x, try to ensure that these values are within the range

of x values in your sample.

17.3.4 When simple linear models won’t do the job

So far we have concentrated on the simple linear regression model.Although it is used extensively and is the appropriate model in manycases, some sets of quantitative bivariate data show patterns that cannot

be represented adequately by a simple linear model If you use suchdata to test hypotheses about the slopes of linear models you will prob-ably find that the slopes are not significant This may be because therelationship between the variables is non-linear and therefore the best-fit model will be some form of curve

Example 17.13

A retail analyst wants to investigate the efficiency of the operations of a large market chain She takes a random sample of 22 of their stores and produced the fol-lowing plot of the weekly sales per square foot of selling area (in £s) against the salesarea of the store (in 000s sq ft)

0 5 10 15 20 25

Trang 29

It is clear from Figure 17.7 that the relationship between the two ables does not seem to be linear In this case to find an appropriatemodel you would have to look to the methods of non-linear regression.The variety of non-linear models is wide and it is not possible to discussthem here, but they are covered in Bates and Watts (1988) and Seberand Wild (1989) Non-linear models themselves can look daunting but

vari-in essence usvari-ing them may vari-involve merely transformvari-ing the data so thatthe simple linear regression technique can be applied; in effect you

‘straighten’ the data to use the straight-line model

The simple linear regression model you produce for a set of datamay not be effective because the true relationship is non-linear But itmight be that the two variables that you are trying to relate are not thewhole story; perhaps there are other factors that should be taken intoaccount One way of looking into this is to use residual plots You canobtain them from MINITAB You will find guidance on doing this insection 17.5.2 below

One type of residual plot is a plot of the residuals against the fits (the values of the Y variable that should, according to the line, have occurred).

It can be particularly useful because it can show you whether there issome systematic variation in your data that is not explained by the model

Fitted value of sales

Residuals versus the fitted values (response is sales)

Định dạng
Số trang	58
Dung lượng	0,99 MB