Test driving – sampling theory, estimation andhypothesis testing 16 Chapter objectives This chapter will help you to: ■ understand the theory behind the use of sample results forpredicti
Trang 1Test driving – sampling theory, estimation and
hypothesis testing
16
Chapter objectives
This chapter will help you to:
■ understand the theory behind the use of sample results forprediction
■ make use of the t distribution and appreciate its importance
■ construct and interpret interval estimates of populationmeans and population proportions
■ work out necessary sample sizes for interval estimation
■ carry out tests of hypotheses about population means, portions and medians, and draw appropriate conclusionsfrom them
pro-■ use the technology; the t distribution, estimation and esis testing in EXCEL, MINITAB and SPSS
hypoth-■ become acquainted with the business origins of the t
distribution
In the previous chapter we reviewed the methods that can be used toselect samples from populations in order to gain some understanding ofthose populations In this chapter we will consider how sample resultscan be used to provide estimates of key features, or parameters, of the
Trang 2populations from which they were selected It is important to note thatthe techniques described in this chapter, and the theory on which theyare based, should only be used with results of samples selected usingprobabilistic or random sampling methods The techniques are based
on knowing, or at least having a reliable estimate of, the sampling errorand this is not possible with non-random sampling methods
In Chapter 13 we looked at the normal distribution, an importantstatistical distribution that enables you to investigate the very manycontinuous variables that occur in business and many other fields,whose values are distributed in a normal pattern What makes the nor-mal distribution especially important is that it enables us to anticipate
how sample results vary This is because many sampling distributions
have a normal pattern
16.1 Sampling distributions
Sampling distributions are distributions that show how sample resultsvary They depict the ‘populations’ of sample results Such distribu-tions play a crucial role in quantitative work because they enable us touse data from a sample to make statistically valid predictions or judge-ments about a population There are considerable advantages in usingsample results in this way, especially when the population is too large
to be accessible, or if investigating the population is too expensive ortime-consuming
A sample is a subset of a population, that is, it consists of some vations taken from the population A random sample is a sample thatconsists of values taken at random from the population
obser-You can take many different random samples from the same tion, even samples that consist of the same number of observations.Unless the population is very small the number of samples that you couldtake from it is to all intents and purposes infinite A ‘parent’ populationcan produce an effectively infinite number of ‘offspring’ samples.These samples will have different means, standard deviations and so
popula-on So if we want to use, say, a sample mean to predict the value of thepopulation mean we will be using something that varies from sample to
sample, the sample mean (x –), to predict something that is fixed, thepopulation mean
To do this successfully we need to know how the sample means varyfrom one sample to another We need to think of sample means as
observations, x – s, of a variable, X––, and consider how they are distributed.What is more we need to relate the distribution of sample means to the
Trang 3parameters of the population the samples come from so that we canuse sample statistics to predict population measures The distribution
of X––, the sample means, is a sampling distribution.
We will begin by considering the simplest case, in which we assumethat the parent population is normally distributed If this is the case,what will the sampling distributions of means of samples taken fromthe population be like?
If you were to take all possible random samples consisting of n
obser-vations from a population that is normal, with a mean and a standard
deviation, and analyse them you would find that the sample means of
all these samples will themselves be normally distributed
You would find that the mean of the distribution of all these ent sample means is exactly the same as the population mean, You
differ-would also find that the standard deviation of all these sample means isthe population standard deviation divided by the square root of thesize of the samples, /√n.
So the sample means of all the samples size n that can be taken from
a normal population with a mean and a standard deviation are
dis-tributed normally with a mean of and a standard deviation of /√n.
In other words, the sample means are distributed around the samemean as the population itself but with a smaller spread
We know that the sample means will be less spread out than the
popu-lation because n will be more than one, so /√n will be less than For
instance, if there are four observations in each sample, /√n will be
/2, that is the sampling distribution of means of samples which have
four observations in them will have half the spread of the populationdistribution
The larger the size of the samples, the less the spread in the values oftheir means, for instance if each sample consists of 100 observations thestandard deviation of the sampling distribution will be /10, a tenth
of the population distribution
This is an important logical point In taking samples we are aging out’ the differences between the individual values in the popula-tion The larger the samples, the more this happens For this reason it
‘aver-is better to use larger samples to make predictions about a population.Next time you see an opinion poll look for the number of peoplethat the pollsters have canvassed It will probably be at least one thou-sand The results of an opinion poll are a product that the pollingorganization wants to sell to media companies In order to do this theyhave to persuade them that their poll results are likely to be reliable.They won’t be able to do this if they only ask a very few people for theiropinions!
Trang 4The standard deviation of a sampling distribution, /√n, is also
known as the standard error of the sampling distribution because it
helps us anticipate the error we will have to deal with if we use a ple mean to predict the value of the population mean
sam-If we know the mean and standard deviation of the parent populationdistribution we can find the probabilities of ranges different samplemeans as we can do for any other normal distribution, by using theStandard Normal Distribution
Example 16.1
Reebar Frozen Foods produces packs of four fish portions On the packaging they claimthat the average weight of the portions is 120 g If the mean weight of the fish portionsthey buy is 124 g with a standard deviation of 4 g, what is the probability that the meanweight of a pack of four portions will be less than 120 g?
We will assume that the selection of the four portions to put in a pack is random.Imagine we took every possible sample of four portions from the population of fish por-tions purchased by Reebar (which we will assume for practical purposes to be infinite)and calculated the mean weight of each sample We would find that the sampling distri-bution of all these means has a mean of 124 g and a standard error of 4/√4, which is 2.The probability that a sample of four portions has a mean of less than 120 g is theprobability that a normal variable with a mean of 124 and a standard deviation of 2 isless than 120
The z-equivalent of the value 120 in the sampling distribution is
From Table 5 on pages 621–622 in Appendix 1 you will find that the probability that
z is less than2.00 is 0.0228 or 2.28%
We can conclude that there is a less than one in forty chance that four portions in apack chosen at random have a mean weight of less than 120 g You might like tocompare this with the probability of one fish portion selected at random weighing lessthan 120 g:
Using Table 5 you will find that the probability that Z is less than1.00 is 0.1587 or15.87%, approximately a one in six chance This is rather greater than the chance ofgetting a pack of four whose mean weight is less than 120 g (2.28%); in general there isless variation among sample means than there is among single points of data
Trang 5The procedure we used in Example 16.1 can be applied whether weare dealing with small samples or with very much larger samples Aslong as the population the samples come from is normal we can besure that the sampling distribution will be distributed normally with amean of and a standard deviation of /√n.
But what if the population is not normal? There are many tions that are not normal, such as distributions of wealth of individuals
distribu-or distributions of waiting times
Fortunately, according to a mathematical finding known as the
Central Limit Theorem, as long as n is large (which is usually interpreted
to mean 30 or more) the sampling distribution of sample means will benormal in shape and have a mean of and a standard deviation of /√n.
This is true whatever the shape of the population distribution.
Example 16.2
The times that passengers at a busy railway station have to wait to buy tickets during therush hour follow a skewed distribution with a mean of 2 minutes 46 seconds and a stand-ard deviation of 1 minute 20 seconds What is the probability that a random sample of
100 passengers will, on average, have to wait more than 3 minutes?
The sample size, 100, is larger than 30 so the sampling distribution of the samplemeans will have a normal shape It will have a mean of 2 minutes 46 seconds, or
166 seconds, and a standard error of 80/√100 seconds
From Table 5 the probability that Z is more than 1.75 is 0.0401 So the probability
that a random sample of 100 passengers will have to wait on average more than 3 minutes
is 4.01%, or a little more than a one in twenty-five chance
At this point you may find it useful to try Review Questions 16.1 to 16.6at the end of the chapter
Trang 616.1.1 Estimating the standard error
The main reason for being interested in sampling distributions is tohelp us use samples to assess populations because studying the wholepopulation is not possible or practicable Typically we will be using asample, which we do know about, to investigate a population, which wedon’t know about We will have a sample mean and we will want to use
it to assess the likely value of the population mean
So far we have measured sampling distributions using the mean andthe standard deviation of the population, and But if we need to
find out about the population using a sample, how can we possiblyknow the values of and ?
The answer is that in practice we don’t In the case of the populationmean,, this doesn’t matter because typically it is something we are
trying to assess But without the population standard deviation, , we
do need an alternative approach to measuring the spread of a pling distribution
sam-Because we will have a sample, the obvious answer is to use the
stand-ard deviation, s, in place of the population standstand-ard deviation, So
instead of using the real standard error, /√n, we estimate the
stand-ard error of the sampling distribution with s/√n.
Using the estimated standard error, s/√n, is fine as long as the ple concerned is large (in practice, that n, the sample size, is at least 30) If we are dealing with a large sample we can use s/√n as an approx-
sam-imation of /√n The means of samples consisting of n observations
will be normally distributed with a mean of and an estimated
stan-dard error of s/√n The Central Limit Theorem allows us to do this
even if the population the sample comes from is not itself normal inshape
The population mean, , in this case is 0.538 and the sample standard deviation,
s, is 0.042 We want the probability that x – is less than 0.568, P (X–– 0.568) The
Trang 7It is important to remember that s/√n is not the real standard error,
it is the estimated standard error, but because the standard deviation of a
large sample will be reasonably close to the population standard ation the estimated standard error will be close to the actual standarderror
devi-At this point you may find it useful to try Review Question 16.7 at the
end of the chapter
16.1.2 The t distribution
In section 16.1.1 we looked at how you can analyse sampling
distribu-tions using the sample standard deviation, s, when you do not know the
population standard deviation, As long as the sample size, n, is 30
or more the estimated standard error will be a sufficiently consistentmeasure of the spread of the sampling distribution, whatever the shape
of the parent population
If, however, the sample size, n, is less than 30 the estimated standard error, s/√n, is generally not so close to the actual standard error, /√n,
and the smaller the sample size, the greater will be the differencebetween the two In this situation it is possible to model the samplingdistribution using the estimated standard error, as long as the popula-tion the sample comes from is normal, but we have to use a modifiednormal distribution in order to do it
This modified normal distribution is known as the t distribution The
development of the distribution was a real breakthrough because itmade it possible to investigate populations using small sample results.Small samples are generally much cheaper and quicker to gather than
a large sample so the t distribution broadened the scope for analysis
based on sample data
z-equivalent of 0.568 is:
If you look at Table 5 you will find that the probability that Z is less than 2.73 is 0.9968,
so the probability that the sample mean is more than a pint is 0.9968 or 99.68%
Trang 8The t distribution is a more spread out version of the normal
distri-bution The difference between the two is illustrated in Figure 16.1.The greater spread is to compensate for the greater variation in samplestandard deviations between small samples than between large samples.The smaller the sample size, the more compensation is needed, so
there are a number of versions of the t distribution The one that
should be used in a particular context depends on the number ofdegrees of freedom, represented by the symbol (nu, the Greek letter n),
which is the sample size minus one, n1
To work out the probability that the mean of a small sample takenfrom a normal population is more, or less, than a certain amount we
first need to find its t-equivalent, or t value The procedure is very similar to the way we work out a z-equivalent.
Trang 9The t value that we used in Example 16.4 could be written as t0.05,8because it is the value of t that cuts off a tail area of 5% in the t distri- bution that has 8 degrees of freedom In the same way, t0.01,15represent
the t value that cuts off a tail area of 1% in the t distribution that has
15 degrees of freedom
You will find that the way the t distribution is used in further work depends on tail areas For this reason, and also because the t distribution
varies depending on the number of degrees of freedom, printed tables
do not provide full details of the t distribution in the same way that
Standard Normal Distribution tables give full details of the Standard
Normal Distribution Table 6 on page 623 gives selected values of t from the t distribution with different degrees of freedom for the most com- monly used tail areas If you need t distribution values that are not in
Table 6 you can obtain them using computer software, as shown in section 16.4 later in this chapter
The population mean, , is 0.538 and the sample standard deviation, s, is 0.048 We
want the probability that X–– is less than 0.568, P (X–– 0.568) The t value equivalent
to 0.568 is:
You will find some details of the t distribution in Table 6 on page 623 in Appendix 1.
Look down the column headed on the left hand side until you see the figure 8, the
number of degrees of freedom in this case (the sample size is 9) Look across the row to
the right and you will see five figures that relate to the t distribution with eight degrees
of freedom The nearest of these figures to 1.875 is the 1.86 that is in the column
headed 0.05 This means that 5% of the t distribution with eight degrees of freedom is above 1.86 In other words, the probability that t is more than 1.86 is 0.05 This means
that the probability that the mean volume of nine pints will be less than 0.568 litres will
Use Table 6 to find:
(a) t with 4 degrees of freedom that cuts off a tail area of 0.10, t0.10,4
(b t with 10 degrees of freedom that cuts off a tail area of 0.01, t0.01,10
Trang 10At this point you may find it useful to try Review Questions 16.8 and 16.9at the end of the chapter.
16.1.3 Choosing the right model for a sampling
distribution
The normal distribution and the t distribution are both models that
you can use to model sampling distributions, but how can you be surethat you use the correct one? This section is intended to provide a briefguide to making the choice
The first question to ask is, are the samples whose results make upthe sampling distribution drawn from a population that is distributednormally? In other words, is the parent population normal? If theanswer is yes then it is always possible to model the sampling distribu-tion If the answer is no then it is only possible to model the sampling
distribution if the sample size, n, is 30 or more.
The second question is whether the population standard deviation,
, is known If the answer to this is yes then as long as the parent
popu-lation is normal the sampling distribution can be modelled using thenormal distribution whatever the sample size If the answer is no thesampling distribution can be modelled using the normal distributiononly if the sample size is 30 or more In the absence of the populationstandard deviation, you have to use the sample standard deviation toapproximate the standard error
Finally, what if the parent population is normal, the population ard deviation is not known and the sample size is less than 30? In
stand-these circumstances you should use the t distribution and approximate
(c) t with 17 degrees of freedom that cuts off a tail area of 0.025, t0.025,17
(d) t with 100 degrees of freedom that cuts off a tail area of 0.005, t0.005,100
From Table 6:
(a) t0.10,4is in the row for 4 degrees of freedom and the column headed 0.10,
1.533 This means that the probability that t, with 4 degrees of freedom, is
greater than 1.533 is 0.10 or 10%
(b) t0.01,10is the figure in the row for 10 degrees of freedom and the columnheaded 0.01, 2.764
(c) t0.025,17is in the row for 17 degrees of freedom and the 0.025 column, 2.110
(d) t0.005,100is in the row for 100 degrees of freedom and the 0.005 column, 2.626
Trang 11the standard error using the sample standard deviation Note that if
the parent population is not normal and the sample size is less than 30 neither the normal distribution nor the t distribution can be used to
model the sampling distribution, and this is true whether or not thepopulation standard deviation is known
16.2 Statistical inference: estimation
Businesses use statistical analysis to help them study and solve lems In many cases the data they use in their analysis will be sampledata Usually it is too expensive, or too time-consuming or simplyimpossible to obtain population data
prob-So if there is a problem of customer dissatisfaction they will study datafrom a sample of customers, not all customers If there is a problemwith product quality they will study a sample of the products, not all ofthem If a large organization has a problem with staff training they willstudy the experiences of a sample of their staff rather than all their staff
Of course, they will want to analyse the sample data in order to drawgeneral conclusions about the population As long as the samples theyuse are random samples, in other words they consist of observed valueschosen at random from the population, it is quite possible to do this.The use of sample data in drawing conclusions, or making deduc-
tions, about populations is known as statistical inference from the word
infer which means to deduce or conclude Statistical inference that
involves testing claims about population parameters is known as
statis-tical decision-making because it can be used to help organizations and
individuals take decisions
In the last section we looked at sampling distributions These butions are the theoretical foundations on which statistical inference isbuilt because they connect the behaviour of sample results to the dis-tribution of the population the samples came from Now we will con-sider the procedures involved in statistical inference
distri-There are two types of statistical inference technique that you willencounter in this chapter The one we shall look at in this section is
estimation, the using of sample data to predict population measures
like means and proportions The other is hypothesis testing, using sample
data to verify or refute claims made about population measures, thesubject of Section 16.3
Collecting sample data can be time-consuming and expensive so inpractice organizations don’t like to gather more data than they need,but on the other hand they don’t want to end up with too little in casethey haven’t enough for the sort of conclusive results they want You
Trang 12will find a discussion of this aspect of planning statistical research inthis section.
16.2.1 Statistical estimation
Statistical estimation is the use of sample measures such as means or portions to estimate the values of their population counterparts The easi-est way of doing this is to simply take the sample measure and use it as
pro-it stands as a prediction of the population equivalent So, we could take the mean of a sample and use it as our estimate of the population mean
This type of prediction is called point estimation It is used to get a ‘feel’
for the population value and is a perfectly valid use of the sample result.The main shortcoming of point estimation is given away by its name;
it is a single point, a single shot at estimating one number using another
It is a crude way of estimating a population measure because not only
is it uncertain whether it will be a good estimate, in other words close tothe measure we want it to estimate, but we have no idea of the probabilitythat it is a good estimate
The best way of using sample information to predict population
measures is to use what is known as interval estimation, which involves constructing a range or interval as the estimate The aim is to be able to
say how likely it is that the interval we construct is accurate, in other
words, how confident we are that the interval does include within it the
population measure Because the probability that the interval includesthe population measure, or the confidence we should have in theinterval estimate, is an important issue, interval estimates are often
called confidence intervals.
Before we look at how interval estimates are constructed and whythey work, it will be helpful if we reiterate some key points about sam-pling distributions For convenience we will concentrate on samplemeans for the time being
■ A sampling distribution of sample means shows how all
the means of the different sample of a particular size, n, are
distributed
■ Sampling distributions that describe the behaviour of means
of samples of 30 or more are always approximately normal
■ The mean of the sampling distribution of sample means is thepopulation mean, .
■ The standard deviation of the sampling distribution of samplemeans, called the standard error, is the population standarddeviation divided by the square root of the sample size, /√n.
Trang 13The sampling distributions that are normal in shape, the ones thatshow how sample means of big samples vary, have the features of thenormal distribution One of these features is that if we take a point twostandard deviations to the left of the mean and another point two stand-ard deviations to the right of the mean, the area between the twopoints is roughly 95% of the distribution.
To be more precise, if these points were 1.96 standard deviationsbelow and above the mean of the distribution the area would be exactly95% of the distribution In other words, 95% of the observations in thedistribution are within 1.96 standard deviations from the mean
This is also true for normal sampling distributions; 95% of the ple means in a sampling distribution that is normal will be between1.96 standard errors below and 1.96 standard errors above the mean.You can see this illustrated in Figure 16.2
sam-The limits of this range or interval are:
1.96 /√n on the left-hand side
and 1.96 /√n on the right-hand side.
The greatest difference between any of the middle 95% of samplemeans and the population mean, , is 1.96 standard errors, 1.96 √n.
The probability that any one sample mean is within 1.96 standarderrors of the mean is:
The sampling distribution allows us to predict values of sample meansusing the population mean But in practice we wouldn’t be interested in
Trang 14doing this because we don’t know the population mean Indeed, ically the population mean is the thing we want to find out about using
typ-a styp-ample metyp-an rtyp-ather thtyp-an the other wtyp-ay round Whtyp-at mtyp-akes styp-amplingdistributions so important is that we can use them to do this
As we have seen, adding and subtracting 1.96 standard errors to andfrom the population mean creates an interval that contains 95% of thesample means in the distribution But what if, instead of adding thisamount to and subtracting it from the population mean, we add it toand subtract it from every sample mean in the distribution?
We would create an interval around every sample mean In 95% ofcases, the intervals based on the 95% of sample means closest to thepopulation mean in the middle of the distribution, the interval wouldcontain the population mean itself In the other 5% of cases, thosemeans furthest away from the population mean, the interval would notcontain the population mean
So, suppose we take the mean of a large sample and create a rangearound it by adding 1.96 standard errors to get an upper figure, andsubtracting 1.96 standard errors to get a lower figure There is a 95%chance that the range between the upper and lower figures will encom-
pass the mean of the population Such a range is called a 95% interval
estimate or a 95% confidence interval because it is an interval that we are
95% confident, or certain, contains the population mean
Example 16.6
The total bill sizes of shoppers at a supermarket have a mean of £50 and a standard ation of £12.75 A group of researchers, who do not know that the population mean billsize is £50, finds the bill size of a random sample of 100 shoppers
devi-The sampling distribution that the mean of their sample belongs to is shown inFigure 16.3 The standard error of this distribution is 12.75/√100 1.275
Ninety-five per cent of the sample means in this distribution will be between 1.96standard errors below the mean, which is:
50 (1.96 * 1.275) 47.50and 1.96 standard errors above the mean, which is:
50 (1.96 * 1.275) 52.50
This is shown in Figure 16.4
Suppose the researchers calculate the mean of their sample and it is £49.25, a ure inside the interval 47.50 to 52.50 that contains the 95% of sample means within
Trang 15fig-1.96 standard errors of the population mean If they add and subtract the same fig-1.96standard errors to and from their sample mean:
49.25 (1.96 * 1.275) 49.25 2.499 £46.751 to £51.749The interval they create does contain the population mean, £50
Notice the symbol ‘’ in the expression we have used It represents the carrying out
of two operations: both adding and subtracting the amount after it The addition
46.175 47.45 48.725 50.00 51.275 52.55 53.825 0.0
0.1 0.2 0.3 0.4
0.1 0.2 0.3 0.4
Trang 16If the researchers in Example 16.6 took many samples and created
an interval based on each one by adding and subtracting 1.96 standarderrors they would find that only occasionally would the interval notinclude the population mean
How often will the researchers in Example 16.6 produce an intervalthat does not include the population mean? The answer is every timethey have a sample mean that is among the lowest 21⁄2% or the highest
21⁄2% of sample means If the sample mean is among the lowest 21⁄2%the interval they produce will be too low, as in Example 16.7 If thesample mean is among the highest 21⁄2% the interval will be too high
As long as the sample mean is among the 95% of the distributionbetween the lowest 21⁄2% and the highest 21⁄2%, the interval they pro-duce will include the population mean, in other words it will be anaccurate estimate of the population mean
Of course, usually when we carry out this sort of research we don’tactually know what the population mean is, so we don’t know whetherthe sample mean we have is among the 95% that will give us accurateinterval estimates or whether it is among the 5% that will give us inac-curate interval estimates The important point is that if we have a sam-ple mean and we create an interval in this way there is a 95% chancethat the interval will be accurate To put it another way, on average
produces the higher figure, in this case 51.749, and the subtraction produces the lowerfigure, 46.751
Imagine they take another random sample of 100 shoppers and find that the meanexpenditure of this second sample is a little higher, but still within the central 95% ofthe sampling distribution, say £51.87 If they add and subtract 1.96 standard errors toand from this second mean:
51.87 (1.96 * 1.275) 51.87 2.499 £49.371 to £54.369This interval also includes the population mean
Example 16.7
The researchers in Example 16.6 take a random sample that yields a mean of £47.13.Calculate a 95% confidence interval using this sample mean
47.13 (1.96 * 1.275) 47.13 2.499 £44.631 to £49.629This interval does not include the population mean of £50
Trang 1719 out of every 20 samples will produce an accurate estimate, and
1 out of 20 will not That is why the interval is called a 95% interval estimate or a 95% confidence interval
We can express the procedure for finding an interval estimate for apopulation measure as taking a sample result and adding and sub-
tracting an error This error reflects the uncertainties involved in using
sample information to predict population measures
Population measure estimate sample result errorThe error is made up of two parts, the standard error and the number
of standard errors The number of standard errors depends on howconfident we want to be in our estimation
Suppose you want to estimate the population mean If you know thepopulation standard deviation, , and you want to be (100 ␣)% con-
fident that your interval is accurate, then you can obtain your estimate
of using:
The letter ‘z’ appears because we are dealing with sampling
distribu-tions that are normal, so we can use the Standard Normal Distribution,
the z distribution, to model them You have to choose which z value to
use on the basis of how sure you want or need to be that your estimate
is accurate
If you want to be 95% confident in your estimate, that is (100 ␣)% 95%, then ␣ is 5% and ␣/2 is 21⁄2% or 0.025 To produce
your estimate you would use z0.025, 1.96, the z value that cuts off a
21⁄2% tail in the Standard Normal Distribution This means that a point1.96 standard errors away from the mean of a normal sampling distri-bution, the population mean, will cut off a tail area of 21⁄2% of the dis-tribution So:
95% interval estimate of x – (1.96 * /√n)
This is the procedure we used in Example 16.6
The most commonly used level of confidence interval is probably95%, but what if you wanted to construct an interval based on a higherlevel of confidence, say 99%? A 99% level of confidence means we want99% of the sample means in the sampling distribution to provide accur-ate interval estimates
To obtain a 99% confidence interval the only adjustment we make is
the z value that we use If (100 ␣)% 99%, then ␣ is 1% and ␣/2 is
1⁄2% or 0.005 To produce your estimate use z0.005, 2.576, the z value
x (z ␣2* √n)
Trang 18that cuts off a 1⁄2% tail in the Standard Normal Distribution:
99% interval estimate of x – (2.576 * /√n)
The most commonly used confidence levels and the z values you
need to construct them are given in Table 16.1
Notice that the confidence interval in Example 16.8 includes thepopulation mean, £50, unlike the 95% interval estimate produced inExample 16.7 using the same sample mean, £47.13 This is because thissample mean, £47.13, is not amongst the 95% closest to the populationmean, but it is amongst the 99% closest to the population mean.Changing the level of confidence to 99% has meant the interval isaccurate, but it is also wider The 95% interval estimate was £44.631 to
£49.629, a width of £4.998 The 99% interval estimate is £43.846 to
£50.414, a width of £6.568
You can obtain the z values necessary for other levels of confidence by
looking for the appropriate values of ␣/2 in the body of Table 5 on pages
621–622 in Appendix 1 and finding the z values associated with them.
From Table 16.1 the z value that cuts off a 0.005 tail area is 2.576, so the 99%
confi-dence interval is:
47.13 (2.576 * 1.275) 47.13 3.284 £43.846 to £50.414
Example 16.9
Use the sample result in Example 16.7, £47.13, to produce a 98% confidence intervalfor the population mean
Trang 19At this point you may find it useful to try Review Questions 16.10 and 16.11at the end of the chapter.
16.2.2 Determining the sample size for estimating
a population mean
All other things being equal, if we want to be more confident that ourinterval is accurate we have to accept that the interval will be wider, inother words less precise If we want to be more confident and retainthe same degree of precision, the only thing we can do is to take alarger sample
In the examples we have looked at so far the size of the sample wasalready decided But what if, before starting a sample investigation, youwanted to ensure that you had a big enough sample to enable you toproduce a precise enough estimate with a certain level of confidence?
To see how, we need to start with the expression we have used for theerror of a confidence interval:
Until now we have assumed that we know these three elements so wecan work out the error But what if we wanted to set the error and find
the necessary sample size, n? We can change the expression for the
error around so that it provides a definition of the sample size:
This means that as long as you know the degree of precision you
need (the error), the level of confidence (to find z ␣/2), and the lation standard deviation (), you can find out what sample size you
Trang 20At this point you may find it useful to try Review Questions 16.12 and 16.13at the end of the chapter.
Practical interval estimation is based on sample results alone, but
it is very similar to the procedure we explored in Example 16.6 The
main difference is that we have to use a sample standard deviation, s, to
produce an estimate for the standard error of the sampling distributionthe sample belongs to Apart from this, as long as the sample we have isquite large, which we can define as containing 30 or more observations,
we can follow exactly the same procedure as before
That is, instead of
estimate of x – (z ␣/2*/√n)
we use
estimate of x – (z ␣/2 * s/√n).
Example 16.10
If the researchers in Example 16.6 want to construct 99% confidence intervals that are
£5 wide, what sample size should they use?
If the estimates are to be £5 wide that means they will have to be produced by addingand subtracting £2.50 to and from the sample mean In other words the error will be2.50 If the level of confidence is to be 99% then the error will be 2.576 standard errors
We know that the population standard deviation, , is 12.75, so:
n (2.576 * 12.75/2.50)2
n (13.1376)2 172.6 to one decimal place
Since the size of a sample must be a whole number we should round this up to 173.When you are working out the necessary sample size you must round the calculatedsample size up to the next whole number to achieve the specified confidence level.Here we would round up to 173 even if the result of the calculation was 172.01
We should conclude that if the researchers want 99% interval estimates that are £5wide they would have to take a sample of 173 shoppers
Trang 21In Example 16.11 we are not told whether the population that thesample comes from is normal or not This doesn’t matter because thesample size is over 30 In fact, given that airlines tend to restrict cabinbaggage to 5 kg per passenger the population distribution in this casewould probably be skewed.
16.2.4 Estimating with small samples
If we want to produce an interval estimate based on a smaller sample,one with less than 30 observations in it, we have to be much more care-ful First, for the procedures we will consider in this section to be valid,the population that the sample comes from must be normal Second,because the sample standard deviation of a small sample is not a reli-able enough estimate of the population standard deviation to enable
us to use the z distribution, we must use the t distribution to find how
many standard errors are to be added and subtracted to produce aninterval estimate with a given level of confidence
Instead of
estimate of x (z␣/2*/√n)
we use
estimate of x (t ␣/2, * s/√n).
The form of the t distribution you use depends on , the number of
degrees of freedom, which is the sample size less one (n 1) You canfind the values you need to produce interval estimates in Table 6 onpage 623 of Appendix 1
Example 16.11
The mean weight of the cabin baggage checked in by a random sample of 40 passengers
at an international airport departure terminal was 3.47 kg The sample standard ation was 0.82 kg Construct a 90% confidence interval for the mean weight of cabin bag-gage checked in by passengers at the terminal
devi-In this case ␣ is 10%, so ␣/2 is 5% or 0.05 and according to Table 16.1 z0.05is 1.645
90% interval estimate of (1.645 * )
3.47 (1.645 * 0.82 40) 3.47 0.213
Trang 22You may recall from section 16.1.2 that the t distribution is a fied form of the z distribution If you compare the figures in the bot- tom row of the 0.05, 0.025 and 0.005 columns of Table 6 with the z
modi-values in Table 16.1, that is 1.645, 1.960 and 2.576, you can see that
they are same If, however, you compare these z values with the lent t values in the top row of Table 6, the ones for the t distribution
equiva-with just one degree of freedom, which we would have to use for ples of only 2, you can see that the differences are substantial
sam-At this point you may find it useful to try Review Questions 16.14 and 16.15at the end of the chapter
16.2.5 Estimating population proportions
Although so far we have concentrated on how to estimate populationmeans, these are not the only population parameters that can be esti-mated You will also come across estimates of population proportions,indeed almost certainly you already have
If you have seen an opinion poll of voting intentions, you have seen
an estimate of a population proportion To produce the opinion pollresult that you read in a newspaper pollsters have interviewed a sample
of people and used the sample results to predict the voting intentions
of the entire population
dis-Here␣ is 5% so ␣/2 is 2.5% or 0.025 and the number of degrees of freedom, , is
n 1, 14
95% estimate of –x (t0.025,14* s/√n) From Table 6, t0.025,14is 2.145, so:
95% estimate of 56.3 (2.145 * 7.1/√15)
56.3 3.932 52.378% to 60.232%
Trang 23In many ways estimating a population proportion is very similar tothe estimation we have already considered To start with you need asample: you calculate a sample result around which your estimate will
be constructed; you add and subtract an error based on the standarderror of the relevant sampling distribution, and how confident youwant to be that your estimate is accurate
We have to adjust our approach because of the different nature ofthe data When we estimate proportions we are usually dealing withqualitative variables The values of these variables are characteristics,for instance people voting for party A or party B If there are only twopossible characteristics, or we decide to use only two categories in ouranalysis, the variable will have a binomial distribution
As you will see, this is convenient as it means we only have to dealwith one sample result, the sample proportion, but it also means that
we cannot produce reliable estimates from small samples, those sisting of less than 30 observations This is because the distribution ofthe population that the sample comes from must be normal if we are
con-to use the t distribution, the device we have previously used con-to
over-come the extra uncertainty involved in small sample estimation.The sampling distribution of sample proportions is approximatelynormal in shape if the samples involved are large, that is, more than 30,
as long as the sample proportion is not too small or too large In tice, because we do not know the sample proportion before taking thesample it is best to use a sample size of over 100 If the samples are small,the sampling distribution of sample proportions is not normal
prac-Provided that you have a large sample, you can construct an intervalestimate for the population proportion, (pi, the Greek letter p), by
taking the sample proportion, p, and adding and subtracting an error.
The sample proportion is the number of items in the sample that
pos-sess the characteristic of interest, x, divided by the total number of items in the sample, n.
Sample proportion, p x/n
The error that you add to and subtract from the sample proportion
is the z value appropriate for the level of confidence you want to use
multiplied by the estimated standard error of the sampling tion of the sample proportion This estimated standard error is based
distribu-on the sample proportidistribu-on:
estimated standard error p p
n
(1 )
Trang 24The precision of the test depends on the estimated standard error ofthe sample proportions, √p(1 – p)/n The value of this depends on p,
the sample proportion Clearly you won’t know this until the sampledata have been collected, but you can’t collect the sample data until youhave decided what sample size to use You therefore need to make aprior assumption about the value of the sample proportion
To be on the safe side we will assume the worst-case scenario, which
is that the value of p will be the one that produces the highest value of
shop-These results suggest that we can be 95% confident that the proportion of kets with suitable trolleys for shoppers with limited mobility will be between 19.6% and36.4%
Trang 25p (1 p) The higher the value of p (1p) the wider the interval will be, for a given sample size We need to avoid the situation where p (1p)
turns out to be larger than we have assumed it is
What is the largest value of p (1p)? If you work out p (1p) when p
is 0.1, you will get 0.09 If p is 0.2, p (1p) rises to 0.16 As you increase the value of p you will find that it keeps going up until p is 0.5, when
p (1 p) is 0.25, then it goes down again.
The error in an interval estimate of a population proportion is:
If p is 0.5, in other words we assume the largest value of p (1 p):
This last expression can be re-arranged to obtain an expression for n:
n z
For the error to be 5%:
This has to be rounded up to 385 to meet the confidence requirement so a randomsample of 385 supermarkets would have to be used
Trang 26At this point you may find it useful to try Review Questions 16.18 and 16.19at the end of the chapter.
16.3 Statistical inference: hypothesis testing
Usually when we construct interval estimates of population parameters
we have no idea of the actual value of the parameter we are trying toestimate Indeed the purpose of estimation using sample results is totell us what the actual value is likely to be
Sometimes we use sample results to deal with a different situation.This is where the population parameter is claimed to take a particularvalue and we want to see whether the claim is correct Such a claim is
known as a hypothesis, and the use of sample results to investigate whether it is true is called hypothesis testing To begin with we will con-
centrate on testing hypotheses about population means using a singlesample Later in this section you will find hypothesis testing of popula-tion proportions and a way of testing hypotheses about populationmedians
Hypothesis testing begins with a formal statement of the claim being
made for the population parameter This is known as the null hypothesis
because it is the starting point in the investigation, and is represented
by the symbol H0, ‘aitch-nought’
We could find that a null hypothesis appears to be wrong, in which
case we should reject it in favour of an alternative hypothesis, represented
by the symbol H1, ‘aitch-one’ The alternative hypothesis is the tion of explanations that contradict the claim in the null hypothesis
collec-A null hypothesis may specify a single value for the populationmeasure, in which case we would expect the alternative hypothesis toconsist of other values both below and above it Because of this ‘dual’nature of the alternative hypothesis, the procedure to investigate such
a null hypothesis is known as a two-sided test.
In other cases the null hypothesis might specify a minimum or amaximum value, in which case the alternative hypothesis consists ofvalues below, or values above respectively The procedure we use in
these cases is called a one-sided test Table 16.2 lists the three
Trang 27After establishing the form of the hypotheses we can test them usingthe sample evidence we have We need to decide if the sample evi-dence is compatible with the null hypothesis, in which case we cannotreject it If the sample evidence contradicts the null hypothesis wereject it in favour of the alternative hypothesis.
To help us make this decision we need a decision rule to apply to our
sample evidence The decision rule is based on the assumption that thenull hypothesis is true
If the population mean really does take the value specified in thenull hypothesis, then as long as we know the value of the populationstandard deviation, , and the size of our sample, n, we can identify the
sampling distribution that our sample belongs to
Example 16.15
A bus company promotes a ‘one-hour’ tour of a city Suggest suitable null and tive hypotheses for an investigation by:
alterna-(a) a passenger who wants to know how long the journey will take
(b) a journalist from a consumer magazine who wants to see whether passengersare being cheated
In the first case we might assume that the passenger is as concerned about the tourtaking too much time as too little time, so appropriate hypotheses would be that thepopulation mean of the times of the tours is either equal to one hour or it is not
H0: 60 minutes H1: 60 minutes
In the second case we can assume that the investigation is more focused The journalist
is more concerned that the trips might not take the full hour rather than taking longerthan an hour, so appropriate hypotheses would be that the population mean tour time
is either one hour or more, or it is less than an hour
H0: 60 minutes H1: 60 minutes
Table 16.2
Types of hypotheses
Null hypothesis Alternative hypothesis Type of test
H0: 0 H1: 0 (not equal) Two-sided
H0: 0 H1: 0(greater than) One-sided
H0: 0 H1: 0 (less than) One-sided
Trang 28The next stage is to compare our sample mean to the sampling bution it is supposed to come from if H0is true If it seems to belong tothe sampling distribution, in other words it is not too far away from thecentre of the distribution, we consider the null hypothesis plausible If,
distri-on the other hand, the sample mean is located distri-on distri-one of the extremes
of the sampling distribution we consider the null hypothesis suspect
We can make this comparison by working out the z-equivalent of
the sample mean and using it to find out the probability that a samplemean of the order of the one we have comes from the sampling distri-bution that is based on the null hypothesis being true Because we are
using a z-equivalent, this type of hypothesis test is called a z test.
Example 16.16
The standard deviation of the bus tours in Example 16.15 is known to be 6 minutes Ifthe duration of a random sample of 50 tours is to be recorded in order to investigate theoperation, what can we deduce about the sampling distribution the mean of the samplebelongs to?
The null hypotheses in both sections of Example 16.15 specified a population mean,
, of 60 minutes If this is true the mean of the sampling distribution, the distribution
of means of all samples consisting of 50 observations, is 60 The population standarddeviation,, is 6 so the standard error of the sampling distribution, /√n, is 6/√50,
0.849 minutes
We can conclude that the sample mean of our sample will belong to a sampling distribution which is normal with a mean of 60 and a standard error of 0.849, if H0is true
Example 16.17
The mean of the random sample in Example 16.16 is 61.87 minutes What is the
z-equivalent of this sample mean, assuming it belongs to a sampling distribution with a
mean of 60 and a standard error of 0.849? Use the z-equivalent to find the probability
that a sample mean of this magnitude comes from such a sampling distribution
Using Table 5 on pages 621–622 of Appendix 1:
... hypothesis test is called a z test.Example 16. 16
The standard deviation of the bus tours in Example 16. 15 is known to be minutes Ifthe duration of a random sample... error of 0.849, if H0is true
Example 16. 17
The mean of the random sample in Example 16. 16 is 61.87 minutes What is the
z-equivalent of this... distribution the mean of the samplebelongs to?
The null hypotheses in both sections of Example 16. 15 specified a population mean,
, of 60 minutes If this is true the mean of the sampling