Taking short cuts –sampling methods 15 Chapter objectives This chapter will help you to: ■ appreciate the reasons for sampling ■ understand sampling bias and how to avoid it ■ employ pro
Trang 1Taking short cuts –
sampling methods
15
Chapter objectives
This chapter will help you to:
■ appreciate the reasons for sampling
■ understand sampling bias and how to avoid it
■ employ probabilistic sampling methods and be aware of their limitations
■ use the technology: simple random sampling in MINITAB and SPSS
■ become acquainted with business uses of sampling methods
A population is the entire set of items or people that form the subjects
of study in an investigation and a sample is a subset of a population.
Companies need to know about the populations they deal with: popu-lations of customers, employees, suppliers, products and so on Typically these populations are very large, so large that they are to all intents and purposes infinite
Gathering data about such large populations is likely to be very expensive, time-consuming and to a certain extent impractical The scale
of expense can be immense; even governments of large countries only commit resources to survey their entire populations, that is, to conduct a census, about every ten years
The amount of time involved in surveying the whole population means that it may be so long before the results are available that they are completely out of date There may be some elements within the
Trang 2population that simply cannot be included in a survey of it; for instance,
a car manufacturer may want to conduct a survey of all customers buying
a certain model three years before in order to gauge customer satisfac-tion Inevitably a number of those customers will have died in the period since buying their car and thus cannot be included in the survey
To satisfy their need for data about the populations that matter to them without having to incur great expense or wait a long time for results com-panies turn to sampling, the process of taking a sample from a population
in order to use the sample data to gain insight into the entire population Although not as accurate as the results of a population survey, sample results can be precise enough to serve the purposes of the investigation The downside of sampling is that many different samples can be taken from the same population, even if the samples are the same size You
can work out the number of samples of n items that could be selected from a population of N items:
We can use this to work out the number of samples of size 6 that could
be selected from a very small population of just 20 items:
You can imagine that the number of samples that could be selected from
a much larger population will be so very large as to border on the infinite Each of the samples you could select from a population inevitably excludes much of the population, so sample results will not be precisely the same as those from the entire population There will be differences
known as sampling errors between sample results and the results of a
population survey and furthermore different samples will yield different results and hence different sampling errors
In this chapter you will find details of a variety of sampling methods, but before we look at them we need to consider what companies might look for in sampling, and what they would prefer to avoid
15.1 Bias in sampling
The point of selecting a sample from a population is to study it and use the results to understand the population To be effective a sample should therefore reflect the population as a whole However, there is
Number of samples size 6 20!
6! 14! 38,760
Number of samples size !
!( )!
n N
n N n
Trang 3
no guarantee that the elements of the population that are chosen for the sample will collectively reflect the population Even if the population is quite small there will be an enormous number of combinations of elem-ents that you could select in a sample Inevitably some of these samples will represent the entire population better than others
Although it is impossible to avoid the possibility of getting an unrep-resentative sample, it is important to avoid using a sampling method that will almost invariably lead to your getting an unrepresentative sample This means avoiding bias in selecting your sample
Effective methods of sampling are those that minimize the chances
of getting unrepresentative samples and allow you to anticipate the degree of sampling error using appropriate probability distributions Such methods should give every element of the population the same chance of being selected in a sample as any other element of the popula-tion, and consequently every possible sample of a certain size the same chance of selection as every other sample of the same size
If some elements of the population have a greater chance of being selected in a sample than others, then we have bias in our sampling method Bias has to be avoided as the samples that can result will be extremely unlikely to reflect the population as a whole and such mis-leading results may have disastrous consequences
Example 15.1
Packaged potato crisps are sold by the million every day of the week; it is a huge market You might think that the company that pioneered the product would by now be a very large and successful one, but you would be wrong; after their initial success they ran into problems that eventually lead to their being taken over Occasionally the company that now owns the brand re-launches it as a retro product, with the distinctive small blue paper twist of salt in the crisp packet
A key factor in the decline of the potato crisp pioneers was product quality The com-pany received a consistent stream of complaints from customers about the number of charred and green-tinged crisps The company directors knew of these complaints but were baffled by them; they knew their product was good because they tasted a sample taken from the production line every day with their morning coffee
The problem for the directors was the method used to take the samples from the pro-duction line The sample was selected by the shopfloor staff, who knew they were destined for the boardroom and quite understandably ensured that only the best were selected The samples provided for the directors were therefore biased; the charred and green crisps that their customers wrote about had no chance of being selected in the samples taken for the directors
Trang 4The most effective way of avoiding bias in sample selection is to use probabilistic methods, which ensure that every element in the popula-tion has the same chance of being included in the sample In the next section we will look at sampling methods that yield samples from which
you can produce unbiased estimators of population measures, or
param-eters such as a mean or a proportion.
In Example 15.1 the company directors were completely misled by the bias in the selection of their samples of potato crisps Biased samples will mislead, no matter how large the samples are; in fact, the larger such sam-ples are, the greater the danger of misrepresentation since it is always tempting to attach more credibility to a large sample
The directors were reluctant to take action to deal with a problem they were convinced did not exist This made it easier for competitors to enter the market and the initial advantage the pioneers enjoyed was lost
Example 15.2
In the 1936 presidential election in the USA the incumbent Democrat, Franklin Roosevelt, faced the Republican governor of Kansas, Alfred Landon Roosevelt was associated with the New Deal programme of large-scale public expenditure to alleviate the high level of unemployment in the depression of the time Landon on the other hand wanted to end what he considered government profligacy
The prominent US weekly magazine of the time, The Literary Digest, conducted one of
the largest polls ever undertaken to predict the result of the election After analysing
the returns from over 2 million respondents, the Digest confidently predicted that
Landon would win by a large margin, 56% to 44% The actual result was that Roosevelt
won by a large margin, obtaining 60% of the vote How could the Digest poll have been
so wrong?
The answer lay in the sampling method they used They sent postcards to millions of people listed in telephone directories, car registration files and magazine subscription lists The trouble was that in the USA of 1936 those who had telephones and cars and subscribed to magazines were the better-off citizens In restricting the poll to such people, who largely supported Landon, the poll was biased against the poor and unemployed, who largely voted for Roosevelt
Trang 5Example 15.3
Strani Systems have 2000 employees in the UK The HR director of the company wants
to select a sample of 400 employees to answer questions about their experience of working for the company How should she go about using simple random sampling?
The population in this case consists of all the Strani employees in the UK The sampling frame would be a list of employees, perhaps the company payroll, with each employee
15.2 Probabilistic sampling methods
Perhaps the obvious way of giving every element in a population the same chance of being selected in sample is to use a random process such as those used to select winning numbers in lottery competitions Lotteries are usually regarded as fair because every number in the popu-lation of lottery numbers has an equal chance of being picked as a winning number
15.2.1 Simple random sampling
Selecting a set of winning numbers in a lottery is an example of simple random sampling, whether the process involves elaborate machines or simply picking the numbers from the proverbial hat You can use the same approach in drawing samples from a population
Before you can undertake simple random sampling you need to estab-lish a clear definition of the population and compile a list of the elements
in it In the same way as all the numbers in a lottery must be included if the draw is to be fair, all the items in the population must be included for the sample we take to be random The population list is the basis or
framework of our sample selection so it is known as the sampling frame.
Once you have the sampling frame you need to number each elem-ent in it and then you can use random numbers to select your sample
If you have 100 elements in the population and you need a sample of
15 from it you can take a sequence of 15 two-digit random numbers from Table 4 on page 620 in Appendix 1 and select the elements for the sample accordingly; for instance if the first random number is 71 you take the 71st element on the sampling frame, if the second dom number is 09 you take the ninth element and so on If the ran-dom number 00 occurs in the sequence you take the 100th element
Trang 6Simple random sampling has several advantages; it is straightforward and inexpensive Because the probability of selection is known it is pos-sible to assess the sampling error involved and ensure that estimates of population parameters based on the sample are unbiased
A potential disadvantage of simple random sampling is that in a case such as Example 15.3 the sample may consist of elements all over the country, which will make data collection expensive Another is that whilst it is an appropriate method for largely homogenous populations,
if a population is subdivided by, for instance, gender and gender is an important aspect of the analysis, using simple random sampling will not ensure suitable representation of both genders
15.2.2 Systematic random sampling
A faster alternative to simple random sampling is systematic sampling This involves selecting a proportion of elements from the sampling frame by choosing elements at regular intervals through the list The first element is selected using a random number
numbered from 1 to 2000 The HR director should then take a sequence of four-digit random numbers such as those listed along row 7 of Table 4:
1426 7156 7651 0042 9537 2573 and so on
She does face a problem in that only two of the random numbers, 1426 and 0042, will enable her to select an employee from the list as the others are well above 2000 To get round this she could simply ignore the ones that are too high and continue until she has 400 random numbers that are in the appropriate range This may take considerable time and she may prefer to replace the first digit in each number so that in every case they are either 0 or 1, making all the four-digit numbers in the range 0000 to 1999 (0000 would be used for the 2000th employee):
Change 0, 2, 4, 6, 8 to 0 Change 1, 3, 5, 7, 9 to 1
By applying this to the figures from row 7 of Table 4 she would get:
1426 1156 1651 0042 1537 0573 Now she can use every number in the sequence to select for the sample
Example 15.4
How can the HR director in Example 15.3 use systematic sampling to select her sample
of 400 employees?
Trang 7As well as being cheap and simple, systematic sampling does yield samples with a definable sampling error and therefore able to produce unbiased estimates This is true as long as the population list used to select the sample is not drawn up in such a way as to give rise to bias In Example 15.4 a list of employees in alphabetical order should not result in bias but if most employees worked in teams of five, one of whom was the team leader and the list of employees was set out by teams rather than surnames, then the systematic sampling of every fifth employee would generate a sample with either all or none of the employees selected being team leaders
Systematic sampling has the same disadvantages as simple random sampling; expensive data collection if the sample members are widely dispersed, and the possibility of sub-sections of the population being under-represented
15.2.3 Stratified random sampling
One problem with both sampling methods we have looked at so far is that the samples they produce may not adequately reflect the balance
of different constituencies within the population In the long run this unevenness will be balanced out by other samples selected using the same methods, but this is little comfort if you only have the time or resources to take one sample
To avoid sections of a population being under-, or for that matter over-represented you can use stratified random sampling As the name implies, the sample selection is random, but it is structured using the
sections or strata in the population The starting point is to define the size
of the sample and then decide what proportion of each section of the population needs to be selected for the sample Once you have decided how many elements you need from each section, then use simple ran-dom sampling to choose them This ensures that all the sections of the population are represented in the sample yet preserves the random
Since there are 2000 employees she needs to select every fifth employee in the list that constitutes the sampling frame To decide whether she should start with the first, second, third, fourth or fifth employee on the list she could take a two-digit random number and
if it is between 00 and 19 start with the first employee, between 20 and 39 the second, between 40 and 59 the third, between 60 and 79 the fourth, and between 80 and 99 the fifth The first two-digit number at the top of column 9 of Table 4 is 47, so using this she should start with the third employee and proceed to take every fifth name after that
Trang 8The advantage of stratified random sampling is that it produces samples that yield unbiased estimators of population parameters whilst ensuring that the different sectors of the population are represented The disadvantage in a case like Example 15.5 is that the sample consists
of widely dispersed members and collecting data from them may be expensive, especially if face-to-face interviews are involved
15.2.4 Cluster sampling
If the investigation for which you require a sample is based on a popu-lation that is widely scattered you may prefer to use cluster sampling This method is appropriate if the population you wish to sample is
composed of geographically distinct units or clusters You simply take a
complete list of the clusters that make up your population and take a
random sample of clusters from it The elements in your sample are all
the individuals in each selected cluster
Example 15.5
The 2000 UK employees of Strani Systems are based at six locations; 400 work in Leeds,
800 in Manchester, 200 in Norwich, 300 in Oxford, 100 in Plymouth, and 200 in Reading How can the HR director in Example 15.3 use stratified random sampling to choose her sample of 400 employees?
A sample of 400 constitutes 20% of the workforce of 2000 To stratify the sample in the same way as the population she should select 20% of the employees from each site; 80 from Leeds, 160 from Manchester, 40 from Norwich, 60 from Oxford, 20 from Plymouth and
40 from Reading She should then use simple random sampling to choose the sample members from each site For this she would need a sampling frame for each location
Example 15.6
How can the HR director from Example 15.3 use cluster sampling to select a sample of employees?
She can make a random selection of two or maybe three locations by simply putting the names of the location in a hat and drawing two out All the employees at these locations constitute her sample
nature of the selection and thus your ability to produce unbiased estima-tors of the population parameters from your sample data
Trang 9The advantages of cluster sampling are that it is cheap, especially if the investigation involves face-to-face interviews, because the number
of locations to visit is small and you only need sampling frames for the selected clusters rather than the entire population
The disadvantages are that you may well end up with a larger sample than you need and there is a risk that some sections of the population may be under-represented If Leeds and Manchester were the chosen clusters in Example 15.5, the sample size would
be 1200 (the 400 employees at Leeds and the 800 at Manchester),
a far larger sample than the HR director requires If the overall gender balance of the company employees in Example 15.5 is 40% male and 60% females yet this balance was 90% male and 10% female at the Norwich and Reading sites there would be a serious imbalance in the sample if it consisted of employees at those two sites
15.2.5 Multi-stage sampling
Multi-stage is a generic term for any combination of probabilistic sam-pling methods It can be particularly useful for selecting samples from populations that are divided or layered in more than one way
A rather more sophisticated approach would be to make the probability that a location is selected proportionate to its size by putting one ticket in the hat for every 100 employees at a location – four tickets for Leeds, eight for Manchester and so on
As an alternative to drawing tickets from a hat, she could follow the approach we used
in section 12.4 of Chapter 12 to simulate business processes and employ random numbers
to make the selections in accordance with the following allocations:
Random number Location allocation
Manchester 20–59 Norwich 60–69 Oxford 70–84 Plymouth 85–89 Reading 90–99
Trang 10Example 15.6
The HR director from Example 15.3 likes the idea of cluster sampling as it will result in cost savings for her investigation, but she wants to avoid having a sample of more than
400 employees How can she use multi-stage sampling to achieve this?
She can use cluster sampling to select her locations and then, rather than contact all the employees at each site, she could use stratified sampling to ensure that the sample size is 400 For instance, if Leeds and Manchester were selected the 1200 employees at those sites constitute three times as many as the HR director requires in her sample so she should select one-third of the employees at each site; 133 at Leeds and 267 at Manchester She could use either systematic or simple random sampling to choose the sample members
The advantage of multi-stage sampling is that you can customize your approach to selecting your sample; it enables you to benefit from the advantages of a particular method and use others alongside it to overcome its disadvantages In Example 15.6 the HR director is able
to preserve the cost advantage of cluster sampling and use the other methods to keep to her target sample size Like other probabilistic methods it produces results that can be used as unbiased estimators of population parameters
15.3 Other sampling methods
Wherever possible you should use probabilistic sampling methods, not because they are more likely to produce a representative sample (which
is not always true) but because they allow you to make a statistical evalu-ation of the sampling error and hence you can use the results to make predictions about the population the sample comes from that are statis-tically valid Doing this with samples obtained by other methods does not have the same validity
Why then is it worth looking at other methods at all? There are sev-eral reasons: some populations that you might wish to investigate simply cannot be listed, such as the potential customers of a business, so
it is impossible to draw up a sampling frame; secondly, some of these methods are attractive because they are convenient; and thirdly, they are used by companies and therefore it is a good idea to be aware of them and their limitations