High performance – statistical inference for comparing population means and bivariate data 17 Chapter objectives This chapter will help you to: ■ test hypotheses on the difference betwee
Trang 1High performance – statistical inference
for comparing population means and
bivariate data
17
Chapter objectives
This chapter will help you to:
■ test hypotheses on the difference between two populationmeans using independent samples and draw appropriate conclusions
■ carry out tests of hypotheses about the difference between twopopulation means using paired data and draw appropriateconclusions
■ test differences between population means using analysis ofvariance analysis (ANOVA) and draw appropriate conclusions
Trang 2■ conduct hypothesis tests about population correlation cients and draw appropriate conclusions
coeffi-■ produce interval estimates using simple linear regressionmodels
■ perform contingency analysis and interpret the results
■ use the technology; test differences between sample means,apply correlation and regression inference, and contingencyanalysis in EXCEL, MINITAB and SPSS
■ become acquainted with the business use of contingencyanalysis
In the previous chapter we looked at statistical inference in relation tounivariate data, estimating and testing single population parameterslike the mean using single sample results In this chapter we will con-sider statistical inference methods that enable us to compare means oftwo or more populations, to test population correlation coefficients, tomake predictions from simple linear regression models and to test forassociation in qualitative data
17.1 Testing hypotheses about two
population means
In section 16.3 of the previous chapter we looked at tests of the lation mean based on a single sample mean In this section we will con-sider tests designed to assess the difference between two populationmeans In businesses these tests are used to investigate whether, forinstance, the introduction of a new logo improves sales
popu-To use these tests you need to have a sample from each of the twopopulations For the tests to be valid the samples must be random, but
they can be independent or dependent.
Independent samples are selected from each population separately.Suppose a domestic gas supplier wanted to assess the impact of a newcharging system on customers’ bills The company could take a ran-dom sample of customers and record the size of their bills under theexisting charging system then, after the new system is introduced, takeanother random sample of customers and record the size of their bills.These samples would be independent
Dependent samples consist of matched or paired values If the gassupplier took a random sample of customers and recorded the size oftheir bills both before and after the introduction of the new chargingsystem they would be using a paired or dependent sample
Trang 3The choice of independent or dependent samples depends on thecontext of the test Unless there is a good reason for using paired data
it is better to use independent samples We will begin by looking at testsfor use with independent samples and deal with paired samples later inthis section
As with single sample tests, the size of the samples is important because
it determines the nature of the sampling distribution In this section wewill assume that the population standard deviations are not known
17.1.1 Large independent samples
The null hypothesis we use in comparing population means is based
on the difference between the means of the two populations, 1 2.The possible combinations of null and alternative hypotheses are shown
1 2, and a standard error of:
where 1and 2 are the standard deviations of the first and second
populations, and n1and n2are the sizes of the samples from the firstand second populations
12
1 2 2 2
n n
Table 17.1
Types of hypotheses for comparing population means
Null hypothesis Alternative hypothesis Type of test
Trang 4We will assume that the population standard deviations are not known,
in which case the estimated standard error of the sampling distribution is:
The test statistic is:
If the null hypothesis suggests that the difference between the lation means is zero, we can simplify this to:
popu-Once we have calculated the test statistic we need to compare it tothe appropriate critical value from the Standard Normal Distribution
s12 s
1 2 2 2
n n
Example 17.1
A national breakdown recovery service has depots at Oxford and Portsmouth The meanand standard deviation of the times that it took for the staff at the Oxford depot to assisteach of a random sample of 47 motorists were 51 minutes and 7 minutes respectively.The mean and standard deviation of the response times recorded by the staff at thePortsmouth depot in assisting a random sample of 39 customers were 49 minutes and
5 minutes respectively Test the hypothesis that there is no difference between the meanresponse times of the two depots Use a 5% level of significance
H0:1 2 0 H1:1 2⬆ 0
This is a two-tail test using a 5% level of confidence so the critical values arez0.025.Unless the test statistic is below1.96 or above 1.96 the null hypothesis cannot berejected The test statistic, 1.541, is within1.96 so we cannot reject H0 The populationmean response times of the two breakdown services could be equal
539
Trang 5Notice that in Example 17.1 we have not said anything about the tributions of response times The Central Limit Theorem allows us to
dis-use the same two-sample z test whatever the shape of the populations
from which the samples were drawn as long as the size of both samples
is 30 or more
At this point you may find it useful to try Review Questions 17.1 to
17.3at the end of the chapter
17.1.2 Small independent samples
If the size of the samples you want to use to compare population means
is small, less than 30, you can only follow the procedure outlined in theprevious section if both populations are normal and both populationstandard deviations known In the absence of the latter it is possible totest the difference between two population means using small inde-pendent samples but only under certain circumstances
If both populations are normal and their standard deviations can beassumed to be the same, that is 1 2, we can conduct a two-sample
t test We use the sample standard deviations to produce a pooled
esti-mate of the standard error of the sampling distribution of X—1 X—2, sp
The test statistic is
We then compare the test statistic to the appropriate critical value from
the t distribution The number of degrees of freedom for this test is
n1 n2 2, one degree of freedom is lost for each of the sample means
Trang 6produc-At this point you may find it useful to try Review Questions 17.4 to
17.6at the end of the chapter
17.1.3 Paired samples
If you want to test the difference between population means usingdependent or paired samples the nature of the data enables you to testthe mean of the differences between all the paired values in the popu-lation, d This approach contrasts with the methods described in theearlier parts of this section where we have tested the difference betweenpopulation means,1 2
The procedure involved in testing hypotheses using paired samples
is very similar to the one-sample hypothesis testing we discussed in section16.3 of Chapter 16 We have to assume that the differences betweenthe paired values are normally distributed with a mean of d, and astandard deviation of d The sampling distribution of sample mean
brand packets are 33.4% and 1.1% respectively Test the hypothesis that the mean oat content of the premium brand is no greater than the mean oat content of the
‘own-brand’ muesli using a 1% level of significance
We will define 1as the population mean of the ‘own-brand’ and 2as the populationmean of the premium product
H0:1 2 0 H1:1 2
First we need the pooled estimate of the standard error:
Now we can calculate the test statistic:
This is a one-tail test so the null hypothesis will only be rejected if the test statistic
exceeds the critical value From Table 6 on page 623 in Appendix 1, t0.01,29is 2.462 Sincethe test statistic is greater than the critical value we can reject the null hypothesis at the1% level The difference between the sample means is very significant
1.243 * 1
14
117 3.344
Trang 7differences will also be normally distributed with a mean of dand astandard error of d/√n, where n is the number of differences in thesample Since we assume that dis unknown we have to use the estimated
standard error s d/√n, where sdis the standard deviation of the sampledifferences
Typically samples of paired data tend to be small so the benchmark
distribution for the test is the t distribution The test is therefore called the paired t test Table 17.2 lists the three possible combinations of
hypotheses
The test statistic is:
where x – dis the mean of the sample differences
We then compare the test statistic to the appropriate critical value
from the t distribution with n 1 degrees of freedom
Types of hypotheses for the mean of the population of differences
Null hypothesis Alternative hypothesis Type of test
In this table d0represents the value of the population mean that is to be tested.
Trang 8The mean and standard deviation of the sample differences are 7.75 and 3.05, to
2 decimal places The test statistic is:
From Table 6 on page 623, t0.10,11is 1.363 The alternative hypothesis is that the populationmean salary difference is less than £8000 so the critical value is1.363 A sample meanthat produces a test statistic this low or lower would lead us to reject the null hypothesis
In this case, although the sample mean is less than £8000, the test statistic,0.28, is notless than the critical value and the null hypothesis cannot be rejected The populationmean of the salary differences could well be £8000
t 7.75 8.00
12
0.250.88 0.284
3 05 √
At this point you may find it useful to try Review Questions 17.7 to
17.9at the end of the chapter
17.2 Testing hypotheses about more than
two population means – one-way ANOVA
In some investigations it is important to establish whether two randomsamples come from a single population or from two populations withdifferent means The techniques we looked at in the previous sectionenable us to do just that But what if we have three or more randomsamples and we need to establish whether they come from populationswith different means?
You might think that the obvious answer is to run t tests using each
pair of random samples to establish whether the first sample camefrom the same population as the second, the first sample came fromthe same population as the third and the second sample came from thesame population as the third and so on In doing this you would betesting the hypotheses:
H0:1 2 H0:1 3 H0:2 3etc
Although feasible, this is not the best way to approach the tion For one thing the more random samples that are involved thegreater the chance that you miss out one or more possible pairings Foranother, each test you conduct carries a risk of making a type 1 error,wrongly rejecting a null hypothesis because you happen to have a sample
Trang 9investiga-result from an extreme end of its sampling distribution The chance ofthis occurring is the level of significance you use in conducting the test.The problem when you conduct a series of related tests is that theprobability of making a type 1 error increases; if you use a 5% level ofsignificance then the probability of not making a type 1 error in asequence of three tests is, using the multiplication rule of probability,0.95 * 0.95 * 0.95 or 0.857 This means the effective level of signifi-cance is 14.3%, considerably greater than you might have assumed.
To establish whether more than two samples come from populations
with different means we use an alternative approach, analysis of variance, usually abbreviated to ANOVA (analysis of va riance) At first sight it
seems rather odd to be using a technique based on variance, a measure
of spread, to assess hypotheses about means, which are measures of
loca-tion The reason for doing this is that it enables us to focus on the spread
of the sample means, after all the greater the differences between thesample means the greater the chance that they come from populationswith different means However, we have to be careful to put these dif-ferences into context because, after all, we can get different samples
from the same population Using ANOVA involves looking at the
bal-ance between the varibal-ance of the sample means and the varibal-ance in thesample data overall Example 17.4 illustrates why this is important
Example 17.4
The Kranilisha Bank operates cash dispensing machines in Gloucester, Huddersfieldand Ipswich The amounts of cash dispensed (in £000s) at a random sample of machinesduring a specific period were:
These are independent samples and so the fact that one sample (Huddersfield) tains fewer values does not matter The sample data are shown in the form of boxplots
con-in Figure 17.1
The distributions in Figure 17.1 suggest that there are differences between theamounts of cash dispensed at the machines, with those in Ipswich having the largestturnover and those in Huddersfield having the smallest The sample means, which arerepresented by the dots, bear out this impression: 34 for Gloucester, 25 for Huddersfieldand 41 for Ipswich
Trang 10In Example 17.4 the sample means are diverse enough and the tributions shown in Figure 17.1 distinct enough to indicate differencesbetween the locations, but is it enough to merely compare the samplemeans?
dis-Figure 17.1
Cash dispensed at machines in Gloucester, Huddersfield and Ipswich
Ipswich Huddersfield
Trang 11In Figure 17.2 the considerable overlaps between the data from thethree locations suggest that despite the contrasts in the means it ismore likely that the three samples come from the same population.Concentrating on the means alone in this case would have led us to thewrong conclusion.
So how we can test whether the three samples all come from the samepopulation, in other words that there is no difference between the pop-ulation mean amounts of cash dispensed per period in the threetowns? For convenience we will use G,HandIto represent the popu-lation means for Gloucester, Huddersfield and Ipswich respectively.The hypothesis we need to test is:
the mean of the values in all three samples, the overall mean Since we
already know the three sample means we can work this out by taking themean of the sample means, being careful to weight each mean by the number of observations in the sample In the first instance we willuse the original data from Example 17.4:
The test statistic we will use is based on comparing the variationbetween the sample means with the variation within the samples One
of the measures of variation or spread that we looked at in Chapter 6
Trang 12was the sample variance, the square of the sample standard deviation:
The basis of the variance is the sum of the squared deviationsbetween each observation and the sample mean This amount, which
is usually abbreviated to the sum of squares, is used to measure variation
in analysis of variance The sum of squares for the Gloucester sample,
which we will denote as SSG, is:
We can work out the equivalent figures for the Huddersfield andIpswich samples:
The sum of these three sums of squares is the sum of the squares
within the samples, SSW:
SSW SSG SSH SSI 226 104 338 668The measure we need for the variation between the sample means isthe sum of the squared deviations between the sample means and themean of all the observations in the three samples This is the sum of
the squares between the samples, SSB In calculating it we have to weight
each squared deviation by the sample size:
If we add the sum of squares within the samples, SSW, to the sum of squares between the samples, SSB, the result is the total sum of squares
in the data, denoted by SST The total sum of squares is also the sum of
SSB (n )*(x ) x (n )*(x ) x ( )*( )n x x
5 *(34 33.929) 4 *(25 33.929) 5 *(41 33.929) 568.929
( ) (29 41) (34 41) (44 41) (47 41) (51 41) 338
( ) (17 25) (25 25) (27 25) (31 25) 104
i i
n
2
2 1
Trang 13the squared deviations between each observation in the set of threesamples and the mean of the combined data:
When you calculate a sample variance you have to divide the sum of
squared deviations by the sample size less one, n 1 This is the number
of degrees of freedom left in the data; we lose one degree of freedombecause we use the mean in working out the deviations from it Before
we can use the sums of squares we have determined above to test thehypothesis of no difference between the population means we need toincorporate the degrees of freedom associated with each sum of squares
by working out the mean sum of squares This makes the variation withinthe samples directly comparable to the variation between samples
The mean sum of squares within the samples, MSW, is the sum of
squares within the samples divided by the number of observations inall three samples, in this case 14, less the number of samples we have, 3.You may like to think of subtracting three as reflecting our using thethree sample means in working out the sum of squares within the sam-
ples If we use k to represent the number of samples we have:
The mean sum of squares between the samples, MSB, is the sum
of squares between the samples divided by the number of samples, k
minus one We lose one degree of freedom because we have used theoverall mean to find the sum of squares between the samples
The test statistic used to decide the validity of the null hypothesis isthe ratio of the mean sum of squares between samples to the mean sum
(29 33.929) (34 33.929) (44 33.929) (47 33.929) (51 33.929)
79.719 15.434 3.719 25.719 101.434286.577 79.719 48.005
668 568.929 1236.929
Trang 14
of squares within the sample Because the benchmark distribution we
shall use to assess it is the F distribution after its inventor, R A Fisher, the test statistic is represented by the letter F :
Before comparing this to the F distribution it is worth pausing to
con-sider the meaning of the test statistic If the three samples came from asingle population, in other words the null hypothesis is true, both the
MSB above the line, the numerator, and the MSW below the line, the
denominator, would both be unbiased estimators of the variance of thatpopulation If this were the case, the test statistic would be close to one
If on the other hand the samples do come from populations with ferent means, in other words the null hypothesis is not true, we would
dif-expect the MSB, the numerator, to be much larger than the
denom-inator Under these circumstances the test statistic would be greaterthan one
In order to gauge how large the test statistic would have to be to lead
us to reject the null hypothesis we have to look at it in the context of
the F distribution This distribution portrays the variety of test statistics
we would get if we compared all conceivable sets of samples from a
sin-gle population and worked out the ratio of MSB to MSW Since neither the MSB nor the MSW can be negative, as they are derived from squared deviations, the F distribution consists of entirely positive values The version of the F distribution you use depends on the numbers of degrees of freedom you use to work out the MSB and the MSW, respect- ively the nominator and denominator of the test statistic The F distri-
bution with 2 degrees of freedom in the numerator and 11 degrees offreedom in the denominator, which is the appropriate version for thebank data from Example 17.4, is shown in Figure 17.3
We can assess the value of the test statistic for the data from Example
17.4 by comparing it with a benchmark figure or critical value from the
distribution shown in Figure 17.3 The critical value you use depends
on the level of significance you require Typically this is 5% or 0.05.The shaded area in Figure 17.3 is the 5% of the distribution beyond 3.98
If the null hypothesis really were true we would only expect to have teststatistics greater than 3.98 in 5% of cases In the case of Example 17.4the test statistic is rather higher, 4.684, so the null hypothesis should berejected at the 5% level At least two of the means are different
In general, reject the null hypothesis if the test statistic is greater than
F k 1, nk, ␣ , where k is the number of samples, n is the number of values in
the samples overall and ␣ is the level of significance Note that this is a
F MSB MSW
284.465 4.684
60 727
Trang 15one-tail test; it is only possible to reject the hypothesis if the test statistic islarger than the critical value you use Values of the test statistic from theleft-hand side of the distribution are consistent with the null hypothesis.
Table 7 on page 624 in Appendix 1 contains details of the F tion You may like to check it to locate F2, 11, 0.01which is the value of F
distribu-with 2 numerator degrees of freedom and 11 denominator degrees offreedom that cuts off a right-hand tail area of 1%, 7.21 This value isgreater than the test statistic for the data from Example 17.4 so we can-not reject the null hypothesis at the 1% level of significance
4 3
2 1
The overall mean and the three sample means are exactly the same as those derivedfrom the data in Example 17.4 so the mean sum of squares between the samples isunchanged, 284.465
We need to calculate the sum of squares within the samples for the amended data:
∑
Trang 16In this case the test statistic is much lower than the critical value for a 5% level of nificance, 3.98, and the null hypothesis should not be rejected; it is a reasonableassumption that these three samples come from a single population, confirming theimpression from Figure 17.2.
sig-The test statistic, F 284.465 1.115
255 091
SSW SS SS SS MSW
1198 378 1230 28062806
In this section we have used one-way analysis of variance; one-way
because we have only considered one factor, geographical location, inour investigation There may be other factors that may be pertinent toour analysis, such as the type of location of the cash dispensers inExample 17.4, i.e town centre, supermarket, or garage ANOVA is aflexible technique that can be used to take more than one factor intoaccount For more on its capabilities and applications see Roberts andRusso (1999)
At this point you may find it useful to try Review Questions 17.10 to
17.12at the end of the chapter
17.3 Testing hypotheses and producing
interval estimates for quantitative
bivariate data
In this section we will look at statistical inference techniques that enableyou to estimate and test relationships between variables in populations
Trang 17based on sample data The sample data we use to do this is called
bivariate data because it consists of observed values of two variables.
This sort of data is usually collected in order to establish whether there
is a relationship between the two variables, and if so, what sort of relationship it is
Many organizations use this type of analysis to study consumer iour, patterns of costs and revenues, and other aspects of their opera-tions Sometimes the results of such analysis have far-reachingconsequences For example, if you look at a tobacco product you willsee a health warning prominently displayed It is there because someyears ago researchers used these types of statistical methods to establishthat there was a relationship between tobacco consumption and certainmedical conditions
behav-The quantitative bivariate analysis that we considered in Chapter 7consisted of two related techniques: correlation and regression.Correlation analysis, which is about calculating and evaluating the cor-relation coefficient, enables you to tell whether there is a relationshipbetween the observed values of two variables and how strong it is.Regression analysis, which is about finding lines of best fit, enables you
to find the equation of the line that is most appropriate for the data,
the regression model.
Here we will address how the results from applying correlation andregression to sample data can be used to test hypotheses and make esti-mates for the populations the sets of sample data belong to
17.3.1 Testing the population correlation coefficient
The sample correlation coefficient, represented by the letter r,
measures the extent of the linear association between a sample of
obser-vations of two variables, X and Y You can find the sample correlation
coefficient of a set of bivariate data using the formula:
Trang 18If we select a random sample from populations of X and Y that are both normal in shape, the sample correlation coefficient will be an
unbiased estimate of the population coefficient, represented by the Greek
r, the letter rho, In fact the main reason for calculating the sample
correlation coefficient is to assess the linear association, if any, between
the X and Y populations.
The value of the sample correlation coefficient alone is some help inassessing correlation between the populations, but a more thoroughapproach is to test the null hypothesis that the population correlationcoefficient is zero:
H0: 0
The alternative hypothesis you use depends on what you would like
to show If you are interested in demonstrating that there is significantcorrelation in the population, then use:
If we adopt the first of these, H1: ⬆ 0, we will need to use a two-tail
test, if we use one of the other forms we will conduct a one-tail test Inpractice it is more usual to test for either positive or negative correl-ation rather than for both
Once we have established the nature of our alternative hypothesis
we need to calculate the test statistic from our sample data The teststatistic is:
Here r is the sample correlation coefficient and n is the number of
pairs of observations in the sample.
As long as the populations of X and Y are normal the test statistic will belong to a t distribution with n 2 degrees of freedom and a mean of
zero, if the null hypothesis is true and there is no linear association between the populations of X and Y.
Trang 19At this point you may find it useful to try Review Questions 17.13 to
17.15at the end of the chapter
17.3.2 Testing regression models
The second bivariate quantitative technique we looked at in Chapter 7was simple linear regression analysis This allows you to find the equa-
tion of the line of best fit between two variables, X and Y Such a line has two distinguishing features, its intercept and its slope In the standard
Example 17.7
A shopkeeper wants to investigate the relationship between the temperature and thenumber of cans of soft drinks she sells The maximum daytime temperature (in degreesCelsius) and the soft drinks sales on 10 working days chosen at random are:
The sample correlation coefficient is 0.871 Test the hypothesis of no correlationbetween temperature and sales against the alternative that there is positive correlationusing a 5% level of significance
We need to compare this test statistic to the t distribution with n 2, in this case 8,
degrees of freedom According to Table 6 on page 623 in Appendix 1 the value of t with
8 degrees of freedom that cuts off a 5% tail on the right-hand side of the distribution,
t0.05,8, is 1.860 Since the test statistic is larger than 1.860 we can reject the null thesis at the 5% level of significance The sample evidence strongly suggests positivecorrelation between temperature and sales in the population
hypo-The test statistic, 2
1
0.871* 1 2
1 0 5.02
Temperature Cans sold
Trang 20formula we used the intercept is represented by the letter a and the slope by the letter b:
Y a bX
The line that this equation describes is the best way of representing the
relationship between the dependent variable, Y, and the independent variable, X In practice it is almost always the result of a sample investi-
gation that is intended to shed light on the relationship between the
populations of X and Y That is why we have used ordinary rather than
Greek letters in the equation
The results of a sample investigation can provide you with an standing of the relationship between the populations The interceptand slope of the line of best fit for the sample are point estimates for theintercept and slope of the line of best fit for the populations, which are
under-represented by the Greek equivalents of a and b, ␣ and :
Y ␣ X
The intercept and slope from the sample regression line can be used totest hypotheses about the equivalent figures for the populations Typically
we use null hypotheses that suggest that the population values are zero:
H0:␣ 0 for the intercept
and
H0: 0 for the slope.
If the population intercept is zero, the population line of best fit will
be represented by the equation Y 0 X, and the line will begin at
the origin of the graph You can see this type of line in Figure 17.4
If we wanted to see whether the population intercept is likely to bezero, we would test the null hypothesis H0:␣ 0 against the alternative
Trang 21When you use regression analysis you will find that investigating thevalue of the intercept is rarely important Occasionally it is of interest, forinstance if we are looking at the relationship between an organization’slevels of operational activity and its total costs at different periods of timethen the intercept of the line of best fit represents the organization’sfixed costs.
Typically we are much more interested in evaluating the slope of theregression line The slope is pivotal; it tells us how the dependent vari-able responds to changes in the independent variable For this reason
the slope is also known as the coefficient of the independent variable.
If the population slope turns out to be zero, it tells you that thedependent variable does not respond to the independent variable.The implication of this is that your independent variable is of no use inexplaining how your dependent variable behaves and there would be
no point in using it to make predictions of the dependent variable
If the slope of the line of best fit is zero, the equation of the line
would be Y ␣ 0X, and the line would be perfectly horizontal You
can see this illustrated in Figure 17.5
The line in Figure 17.5 shows that whatever the value of X, whether
it is small and to the left of the horizontal axis or large and to the right
of it, the value of Y remains the same The size of the x value has no impact whatsoever on Y, and the regression model is useless.
We usually want to use regression analysis to find useful rather thanuseless models – regression models that help us understand and antici-pate the behaviour of dependent variables In order to demonstrate that
a model is valid, it is important that you test the null hypothesis that theslope is zero Hopefully the sample evidence will enable you to reject thenull hypothesis in favour of the alternative, that the slope is not zero.The test statistic used to test the hypothesis is:
Trang 22Where b is the sample slope, 0 is the zero population slope that the null hypothesis suggests, and s bis the estimated standard error of the samplingdistribution of the sample slopes.
To calculate the estimated standard error, s b , divide s, the standard deviation of the sample residuals, the parts of the y values that the line
of best fit does not explain, by the square root of the sum of the squared
deviations between the x values and their mean, x–
Once we have the test statistic we can assess it by comparing it to
the t distribution with n 2 degrees of freedom, two fewer than the
number of pairs of x and y values in our sample data.
H0: 0 H1: ⬆ 0
To find the test statistic we first need to calculate the standard deviation of the
residuals We can identify the residuals by taking each x value, putting it into the equation
of the line of best fit and then working out what Y ‘should’ be, according to the model The difference between the y value that the equation says should be associated with the
x value and the y value that is actually associated with the x value is the residual.
To illustrate this, we will look at the first pair of values in our sample data, a day whenthe temperature was 14° and 19 cans of soft drink were sold If we insert the tempera-ture into the equation of the line of best fit we can use the equation to estimate thenumber of cans that ‘should’ have been sold on that day:
Sales 0.74 (2.38 * 14) 34.06The residual is the difference between the actual sales level, 19, and this estimate:
residual 19 34.06 15.04The standard deviation of the residuals is based on the squared residuals The residuals and their squares are:
Temperature Sales Residuals Squared residuals
(Continued)
Trang 23We find the standard deviation of the residuals by taking the square root of the sum
of the squared residuals divided by n, the number of residuals, minus 2 (We have to
subtract two because we have ‘lost’ 2 degrees of freedom in using the intercept andslope to calculate the residuals.)
To get the estimated standard error we divide this by the sum of squared differencesbetween the temperature figures and their mean
The estimated standard error is:
and the test statistic t (b 0)/s b 2.38/0.4738 5.02
From Table 6, the t value with 8 degrees of freedom that cuts off a tail area of 2.5%,
t8,0.025, is 2.306 If the null hypothesis is true and the population slope is zero only 2.5%
of test statistics will be more than 2.306 and only 2.5% will be less than2.306 Thelevel of significance is 5% so our decision rule is therefore to reject the null hypothesis
if the test statistic is outside2.306 Since the test statistic in this case is 5.02 we shouldreject H0and conclude that the evidence suggests that the population slope is not zero
Trang 24The implication of the sort of result we arrived at in Example 17.8 isthat the model, represented by the equation, is sufficiently sound toenable the temperature variable to be used to predict sales.
If you compare the test statistic for the sample slope in Example 17.8with the test statistic for the sample correlation coefficient in Example 17.7,you will see that they are both 5.02 This is no coincidence; the two testsare equivalent The slope represents the form of the association betweenthe variables whereas the correlation coefficient measures its strength Weuse the same data in the same sort of way to test both of them
17.3.3 Constructing interval predictions
When you use a regression model to make a prediction, as we did inExample 17.8 to obtain the residuals, you get a single figure that is the
value of Y that the model suggests is associated with the value of X that
you specify
Example 17.9
Use the regression model in Example 17.8 to predict the sales that will be achieved on
a day when the temperature is 22° Celsius
If temperature 22, according to the regression equation:
Sales 0.74 2.38 (22) 53.1
Since the number of cans sold is discrete, we can round this to 53 cans
The problem with single-figure predictions is that we do not knowhow likely they are to be accurate It is far better to have an interval that
we know, with a given level of confidence, will be accurate
Before looking at how to produce such intervals, we need to clarifyexactly what we want to find The figure we produced in Example 17.9
we described as a prediction of sales on a day when the temperature is 22° In fact, it can also be used as an estimate of the mean level of sales
that occur on days when the temperature is 22° Because it is a singlefigure it is a point estimate of the mean sales levels on such days
We can construct an interval estimate, or confidence interval, of themean level of sales on days when the temperature is at a particular level
by taking the point estimate and adding and subtracting an error Theerror is the product of the standard error of the sampling distribution
Trang 25of the point estimates and a figure from the t distribution The t bution should have n 2 degrees of freedom, n being the number
distri-of pairs distri-of data in our sample, and the t value we select from it is based
on the level of confidence we want to have that our estimate will beaccurate
We can express this procedure using the formula:
shop-From Example 17.9 the point estimate for the mean, y
, is 53.1 cans We can use theprecise figure because the mean, unlike sales on a particular day, does not have to
be discrete We also know from Example 17.9 that s, the standard deviation of the sample residuals, is 9.343, x –is 13.1 and (x x–)2is 388.90
The t value we need is 2.306, the value that cuts off a tail of 2.5% in the t distribution
with 10 2 8 degrees of freedom The value of x0, the temperature on the days whosemean sales figure we want to estimate, is 22
Confidence interval ( )
( ) 53.1 2.306 * 9.343 (22 13.1) 53.1 11.873 41.227 to 64.973
2 2
If we produce a confidence interval for the mean of the y values ated with an x value outside the range of x values in the sample it will be
associ-both wide and unreliable
Trang 26The confidence interval we produced in Example 17.11 is of no realuse because the temperature on which it is based, 35°, is well beyond therange of temperatures in our sample Confidence intervals produced
from regression lines will be wider when they are based on x values ther away from the mean of the x values This is shown in Figure 17.6.
fur-If you want to produce a prediction of an individual y value associated with a particular x value rather than an estimate of the mean y value associated with the x value, with a given level of confidence, you can
produce what is called a prediction interval This is to distinguish thistype of forecast from a confidence interval, which is a term reserved forestimates of population measures like means
Temperature
Regression line 95% CI
Trang 27The procedure used to produce prediction intervals is very similar tothe one we used to produce confidence intervals for means of values ofdependent variables It is represented by the formula:
If you look carefully you can see that the difference between this andthe formula for a confidence interval is that we have added one to theexpression beneath the square root sign The effect of this will be towiden the interval considerably This is to reflect the fact that individualvalues vary more than statistical measures like means, which are based
2 2
confi-Just like confidence intervals produced using regression models,
pre-diction intervals are more dependable if they are based on x values nearer the mean of the x values Prediction intervals based on x values
that are well outside the range of the sample data are of very little value.The usefulness of the estimates that you produce from a regressionmodel depends to a large extent on the size of the sample that youhave The larger the sample on which your regression model is based,the more precise and confident your predictions and estimates will be
As we have seen, the width of the intervals increases the further the x value is away from the mean of the x values, and estimates and predic- tions based on x values outside the range of x values in our sample are
useless So, if you know that you want to construct intervals based on
Trang 28specific values of x, try to ensure that these values are within the range
of x values in your sample.
At this point you may find it useful to try Review Questions 17.16 to
17.20at the end of the chapter
17.3.4 When simple linear models won’t do the job
So far we have concentrated on the simple linear regression model.Although it is used extensively and is the appropriate model in manycases, some sets of quantitative bivariate data show patterns that cannot
be represented adequately by a simple linear model If you use suchdata to test hypotheses about the slopes of linear models you will prob-ably find that the slopes are not significant This may be because therelationship between the variables is non-linear and therefore the best-fit model will be some form of curve
Example 17.13
A retail analyst wants to investigate the efficiency of the operations of a large market chain She takes a random sample of 22 of their stores and produced the fol-lowing plot of the weekly sales per square foot of selling area (in £s) against the salesarea of the store (in 000s sq ft)
0 5 10 15 20 25
Trang 29It is clear from Figure 17.7 that the relationship between the two ables does not seem to be linear In this case to find an appropriatemodel you would have to look to the methods of non-linear regression.The variety of non-linear models is wide and it is not possible to discussthem here, but they are covered in Bates and Watts (1988) and Seberand Wild (1989) Non-linear models themselves can look daunting but
vari-in essence usvari-ing them may vari-involve merely transformvari-ing the data so thatthe simple linear regression technique can be applied; in effect you
‘straighten’ the data to use the straight-line model
The simple linear regression model you produce for a set of datamay not be effective because the true relationship is non-linear But itmight be that the two variables that you are trying to relate are not thewhole story; perhaps there are other factors that should be taken intoaccount One way of looking into this is to use residual plots You canobtain them from MINITAB You will find guidance on doing this insection 17.5.2 below
One type of residual plot is a plot of the residuals against the fits (the values of the Y variable that should, according to the line, have occurred).
It can be particularly useful because it can show you whether there issome systematic variation in your data that is not explained by the model
Fitted value of sales
Residuals versus the fitted values (response is sales)