1. Trang chủ
  2. » Giáo Dục - Đào Tạo

EXPOSURE ANALYSIS -CHAPTER 3 potx

13 211 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 13
Dung lượng 322,38 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Probability-based samples use randomization like flipping a coin or rolling a die to ensure that a sample is representative of the units on the list from which the sample was selected..

Trang 1

Human Exposure Assessment Studies

Roy Whitmore

RTI International

CONTENTS

3.1 Synopsis 65

3.2 What Is a Sample? 66

3.3 Why Use a Sample? 66

3.4 Does It Matter How the Sample Is Selected? 66

3.5 How Is a Representative Sample Selected? 67

3.6 Example of a Probability-Based Sample 67

3.7 When Should You Use a Probability-Based Sample? 69

3.8 When Is a Probability-Based Sample Not Necessary? 69

3.9 Is Probability-Based Sampling Sufficient to Support Robust Inferences? 70

3.10 Where Do You Begin? 70

3.11 Why Does One Stratify the Sampling Frame? 71

3.12 Are Special Statistical Analysis Techniques Needed? 72

3.13 Additional Examples of Probability-Based Sampling for Exposure Assessment Studies 72

3.14 Conclusion 74

3.15 Questions for Review 74

References 77

3.1 SYNOPSIS

Whenever one studies a sample of people to determine the effects of environmental exposures on them, one usually wants to extrapolate the findings beyond the individuals actually studied and measured This extrapolation from study subjects to the population represented by them requires application of inferential statistics A firm foundation for inferential statistics is established by using the scientific method to design and implement the study The scientific method requires one to explicitly define the population about which one wants to make inferences and to select a sample from that population in such a manner that the probability of being selected into the sample is known for every person selected into the sample Sampling procedures that result in known probabilities of selection from a specified finite population are referred to as probability-based sampling methods The purpose of this chapter is to give you a basic understanding of what probability-based sampling methods are, when they should be used, how they are applied, when it may be satisfactory to not use them, and what else is required to support defensible inferences from a sample to the population it represents For example, it is important that the size of the

Trang 2

sample be large enough and that valid measurements be obtained for a high proportion of the people selected into the sample

3.2 WHAT IS A SAMPLE?

A sample is a subset of the individuals, or units, belonging to a larger group that one would like

to characterize or about which one would like to make inferences The larger group about which one would like to make inferences is usually referred to as the universe, or target population, of interest

For example, suppose we want to characterize personal exposures to airborne allergens for the residents of Virginia during the upcoming summer Any subset of the people living in Virginia during that summer would be a sample from the population of interest, but some samples would

be better than others One would need a sample that was representative of various geographic areas

of Virginia and representative of all months of the summer However, any subset of the summer

residents of Virginia would be a sample from that target population

3.3 WHY USE A SAMPLE?

A sample is used to learn about the universe from which the sample was selected without observing

and measuring all units in the universe However, not all samples provide the same amount of

information about the universe from which they were selected

3.4 DOES IT MATTER HOW THE SAMPLE IS SELECTED?

Probability-based samples are samples for which every unit in the universe has a known, positive probability of being included in the sample These types of samples ensure representativeness of the universe from which they were selected Moreover, they allow one to characterize the uncertainty

of inferences from the sample For example, one can calculate the probability that an inference from the sample may be incorrect If that probability is too high for comfort, one can take a larger sample to reduce the degree of uncertainty

The inadequacy of sampling methods that might at first seem adequate has been dramatically illustrated in two U.S presidential election polls: the 1936 and 1948 presidential election polls

Prior to the 1936 election, the Literary Digest magazine sent postcards to its subscribers asking whom they would vote for Based on over 2,000,000 returned postcards, the Literary Digest

predicted that Alfred E Landon would defeat Franklin D Roosevelt by a 57% to 43% margin However, George Gallop correctly predicted that Roosevelt would win based on a random sample

of 300,000 likely voters The primary lesson learned from this experience was that all members of the population of interest must have a chance of being included in the sample The readers of the

Literary Digest were more affluent than the majority of the population, which was still struggling

through the Great Depression (1929–1939)

Prior to the 1948 presidential election, Gallop conducted a final poll of likely voters on October

20, about 2 weeks before the November 2 election, and predicted that Thomas Dewey would defeat Harry Truman by a 49.5% to 44.5% margin (with the remainder of the votes going to Henry Wallace and Strom Thurmond) Instead, Truman defeated Dewey by a 49.9% to 45.5% margin Apparently, voter preferences changed sufficiently during the final 2 weeks of campaigning that the October

20 poll was not valid for making inferences regarding the election outcome This reinforced the earlier lesson about sampling from the population of interest but with a different twist As is true

in many environmental studies, the outcome of interest was time-dependent For election polls and for many environmental studies, the population of interest must be carefully defined in terms of both geographic and temporal scope, and the sample selection methods must ensure that the sample

Trang 3

is representative in both dimensions because the measurements obtained for study subjects may vary temporally (e.g., by season, by day, or by time of day)

3.5 HOW IS A REPRESENTATIVE SAMPLE SELECTED?

As mentioned above, probability-based sampling is necessary to ensure that a sample is represen-tative of the universe from which it was selected Probability-based samples use randomization (like flipping a coin or rolling a die) to ensure that a sample is representative of the units on the list from which the sample was selected The list from which the sample is selected is referred to

as the sampling frame The sampling frame is sometimes a simple list of the units in the population For example, many European counties maintain population registries that can be used as sampling frames More often, however, a registry of the population of interest does not exist, and a multistage listing process is used that effectively includes all members of the population of interest For example, probability-based samples of the residents of the United States are usually selected by first sampling from all counties in the United States, then sampling from all census blocks in the sample counties, and finally sampling from all households in the sample blocks

The simplest example of a probability-based sample is a simple random sample A simple random sample is one in which all possible samples of the same size have the same probability of being selected, such as when all units in the sample are determined by independent random draws from the population units However, simple random samples are rarely used in practice Instead, the sampling frame is usually partitioned into subsets, called strata, and an independent sample is selected from each subset, or stratum One reason to partition the population into strata is to guarantee that some of the sample will come from each of the different portions of the population For example, if we were sampling the summer residents of Virginia, we might want to stratify spatially into four or five geographic regions and temporally by month That would guarantee that the sample would include persons from all geographic regions of the state and all months of the summer

Probability-based samples are representative of the units on the sampling frames from which the samples were selected However, for reasons of practical or cost expediency, the sampling frame from which the sample is selected may not include all units of the population of interest For example, some probability-based surveys of the U.S population use probability-based samples of telephone numbers These samples are only representative of households with telephone service Moreover, some telephone households may be excluded, such as those on banks of telephone numbers that have only been recently released The population represented by the sampling frame

is often referred to as the survey population, whereas the population about which inferences are desired is the target population When the survey and target populations are not identical, one needs

to carefully consider the potential for bias when making inferences regarding the target population Kalton (1983) provides a basic introduction to the concepts of probability-based sampling Gilbert (1987) and Thompson (1992) provide more in-depth discussions of probability-based sampling methods for environmental studies

3.6 EXAMPLE OF A PROBABILITY-BASED SAMPLE

In the mid-1990s, the U.S Environmental Protection Agency (USEPA) conducted a multimedia, multipathway human exposure assessment study, the National Human Exposure Assessment Survey (NHEXAS), in USEPA Region 51 as a field test of procedures that could be used in a national

study of human exposures to environmental toxicants A probability-based sampling design was used for the field test because that would be necessary for a defensible national study The statistical

1 The six states of Minnesota, Wisconsin, Michigan, Illinois, Indiana, and Ohio.

Trang 4

sampling design for the field study is described in detail by Whitmore et al (1999) and is summa-rized in Figure 3.1.2

The target population for the NHEXAS field test consisted of the noninstitutionalized residents

of USEPA Region 5 during the period of field data collection, from July 1995 through May 1997 Members of the sample were selected using a four-stage, probability-based sampling design The stages of sampling were:

1 First-stage sample of counties

2 Second-stage sample of area segments defined by census blocks within sample counties

FIGURE 3.1 Flow chart for the four-stage sampling design for the National Human Exposure Assessment

Survey (NHEXAS) field study in USEPA Region 5.

2 The study objectives, hypotheses, and exposure study design are described by Pellizzari et al (1995).

TARGET POPULATION

GEOGRAPHIC SCOPE: Non-Institutionalized Residents

of the 6 States in EPA Region 5*

TEMPORAL SCOPE: July 1995 through May 1997

FIRST-STAGE SAMPLE

OF AREA SEGMENTS

FRAME: List of All Counties in the 6 States STRATA: The 6 States and 2 Time Periods SAMPLE: 16 Counties Selected for Each of the 2 Temporal Strata (32 Sample Counties)

SECOND-STAGE SAMPLE

OF AREA SEGMENTS

FRAME: List of All Census Blocks in Each of the 32 Sample Counties STRATA: 4 Strata Defined for Each Sample County in Terms of:

(a) Percent Urban Population, (b) Percent Black Population, (c) Average Dwelling Unit Value

SAMPLE: One Area Segment Randomly Selected from Each Stratum (128 Sample Segments)

FRAME: List of All Dwellings Located in Each Sample Segment at the Time of the Field Data Collection

SAMPLE: 8 Dwellings Randomly Selected from Each Sample Segment

THIRD-STAGE SAMPLE

OF DWELLINGS

453 Sample Persons

326 Completed the Baseline Questioinnaire

249 Completed Core Personal Exposure Monitoring

884 Sample Dwellings

805 Were Occupied Housing Units

555 Completed the Household Screening Questionnaire

FOURTH-STAGE SAMPLE

OF PERSONS

FRAME: All Current Members of Each of the 555 Screened Housholds SAMPLE: Randomly Selected Half of the One-Person Households and One Participant from the Each of the Other Households

*Minnesota, Wisconsin, Michigan, Illinois, Indiana, and Ohio.

Trang 5

3 Third-stage sample of households within sample segments

4 Fourth-stage sample of persons within sample households

Temporal stratification was implemented for the NHEXAS first-stage sample of counties by select-ing two independent samples of 18 counties — one to be used for the first 9 months of data collection and the other to be used for the last 9 months Four counties were selected into both samples For reasons of practical expediency and cost control, the survey population covered by the field test sampling design differed in some known ways from the target population In particular, the following subpopulations were not included in the field test:

1 People not living in households (e.g., homeless persons and those in prisons, nursing homes, dormitories, etc.)

2 People living on military bases

3 People who were not mentally or physically capable of participating

3.7 WHEN SHOULD YOU USE A PROBABILITY-BASED SAMPLE?

You should use a probability-based sample when you can afford to collect data for more than a minimal number of units in the universe of interest, say more than 20 or 30 persons, and you want to:

1 Make defensible inferences regarding the universe, such as the proportion of people whose personal exposures exceed a specified threshold

2 Quantify the uncertainty of those inferences

3 Do so while making few assumptions about the universe, such as the shape of the distribution of all exposures in the universe

Other types of inference that are less reliant on assumptions when supported by a probability-based sample include the following:

1 Testing hypotheses, such as whether the mean personal exposure of people with low socioeconomic status exceeds that of the rest of the population

2 Estimating relationships between exposures, health effects, and demographic character-istics (e.g., correlations or regression models)

Estimating relationships between variables is often done with data that do not come from a probability-based sample However, if the relationship between the variables of interest is not the same for all members of the universe, a model based on a probability-based sample protects against getting a biased estimate of the relationships

3.8 WHEN IS A PROBABILITY-BASED SAMPLE NOT NECESSARY?

When one can only afford to collect data for a small sample, say 20 or fewer persons, or when a small sample is considered sufficient for the purposes of the investigation, a purposively selected sample is usually superior to a probability-based sample When there will be only a small number

of observations, it is important to carefully pick the units to be observed so that they are as representative of the universe of interest as can be achieved with a small sample One needs to use expert judgment and knowledge of the universe to ensure that the sample units are not unusual (e.g., not mostly large or mostly small units and not just the units most convenient or least expensive

to observe and measure)

Trang 6

USEPA (2002) provides some examples where judgmental samples are appropriate One exam-ple is characterization of groundwater contamination beneath a Brownfield site, a site suspected of being contaminated with industrial waste that is being redeveloped In this case, the high cost of collecting groundwater samples may preclude probability-based sampling

Another example is determining whether or not the concentration of a contaminant in surface soils exceeds a specified threshold anywhere on a Brownfield site In this case, samples can be collected from the areas where industrial spills are known to have occurred (e.g., based on visual inspection of the soil) If none of the samples exceeds the threshold, the investigation may be finished However, if any of the samples exceeds the threshold, then a probability-based sampling design may be necessary to characterize the distribution of soil concentrations of the contaminant throughout the site

3.9 IS PROBABILITY-BASED SAMPLING SUFFICIENT TO SUPPORT

ROBUST INFERENCES?

Unfortunately, selecting a probability-based sample from the population of interest is not sufficient

to support defensible inferences regarding the population You must actually collect data from most, if not all, members of the sample When the sample units are not people or commercial establishments (e.g., trees, soil, water, or air), data usually can be obtained for most of the sample members However, when you must get people to agree to let you collect information from them, considerable effort must be devoted to obtaining a good response rate USEPA (1983) recommends setting the target response rate to be at least 75% for mail and telephone surveys A reasonably high response rate (at least 50% or better) is necessary to limit the uncertainty due to the possibility that the outcomes being measured may be systematically different for respondents and nonrespon-dents

Moreover, not just any data are sufficient You must use measurement methods that have been carefully developed and tested to ensure that the measurements are accurate and not biased This principle is equally important for analytical instruments that measure concentrations of contami-nants in the environment as well as for questionnaires that attempt to obtain data regarding the attributes and activities of people and businesses

The size of the sample also is important for limiting uncertainty regarding the inferences made from a probability-based sample A small probability-based sample could result in inferences with

a high level of uncertainty because of the great deal of variability in outcomes that could be obtained with small samples As the sample size increases, the variability between the results from different samples using the same probability-based sampling design decreases and the precision of the statistical inferences increases USEPA (2002) provides sample size guidance for simple random sampling designs, which serves as a useful starting point for other sampling designs

3.10 WHERE DO YOU BEGIN?

The scientific method is used to proceed from research objectives to the study design that best fulfills the requirements of the research The scientific method for developing a probability-based sampling design is explained as a seven-step process, called the Data Quality Objectives (DQO) process, in USEPA (2000) and in Chapter 2 of Millard and Neerchal (2001) Although these documents describe the process in terms of testing a hypothesis to make a decision, the process is equally applicable to studies for which the objective is to characterize the status and trends of exposures in a population

The scientific process for developing a probability-based sampling design begins with explicit specification of the goals of the study The population of interest must be specified, including both the spatial and temporal extent of the population In addition to overarching study objectives, one

Trang 7

must specify specific population parameters to be estimated and, if applicable, specific hypotheses

to be tested (e.g., the mean concentration of cotinine in saliva is higher for smokers than for nonsmoking adults) After one has identified the key estimates and hypothesis tests that will drive the inferential needs of the study, one must specify the level of uncertainty that can be tolerated for the estimates or inferences At this point, it is necessary to consult with a survey statistician to mathematically formulate the precision requirements and determine appropriate classes of statistical sampling designs Having identified appropriate classes of statistical sampling designs, the cost of data collection must be estimated in terms of cost per sampling unit (e.g., the differential cost for going to one more county, going to one more area in a sample county, or collecting and analyzing data from one more participant in a sample area) A sampling statistician can use the precision requirements, cost estimates, and constraints on study resources (e.g., the research budget) to determine the optimum sampling design (e.g., sampling frames and strata, stages of sampling, and sample sizes) that will either achieve the precision requirements for the least cost or achieve maximum precision within specified cost constraints

3.11 WHY DOES ONE STRATIFY THE SAMPLING FRAME?

As mentioned earlier, the sampling design for a research study sample is seldom a simple random sample, even if there is a list of the members of the population of interest that can be used as a sampling frame One usually at least stratifies the sampling frame A stratified sample is one in which the sampling frame is partitioned into disjoint subsets, called strata, and a separate, statis-tically independent sample is selected from each stratum

There are several reasons why sampling frames usually are stratified before sample selection The reasons include: (a) to improve the representativeness of the sample, (b) to control the precision

of estimates, (c) to control costs, and (d) to enable use of different sampling designs for different portions of the population

The representativeness of a sample can be improved by stratifying the sampling frame because the sampling design then guarantees that a specified number of units will be selected from each stratum Hence, if there are population subsets (e.g., spatial or temporal domains) that are so important that the sample would be considered to be deficient if it did not include any units from them, they should be sampling strata

There are two quite different ways that strata often are used to improve the precision of study estimates If separate estimates are required for various population domains (e.g., defined by age, race, gender, or socioeconomic status), adequate precision for these estimates may require that certain domains receive more that a proportionate sample In this case, one often defines the domains that need to be oversampled (need more than a proportionate sample) to be sampling strata so that the sample sizes needed for adequate precision can be guaranteed (see, e.g., Ezzati-Rice and Murphy

1995) In a similar manner, if hypotheses will test for differences between two groups (e.g., exposed and unexposed individuals), then the two groups may be defined to be strata so that they receive equal sample sizes, thereby maximizing the likelihood that the hypothesis test will detect any differences between the two groups

Alternatively, stratification can be used to maximize the precision of estimates of overall population parameters (e.g., overall population means or proportions) In this case, one forms strata that are as homogeneous as possible with respect to the key outcomes of interest in the study Lacking any firm knowledge of how the outcomes will be distributed across strata, the precision of estimates is then maximized by proportionate allocation of the sample to the strata However, if one knows ahead of time approximately what the variability of outcome measures will be within strata, precision can be improved for overall population estimates by using higher sampling fractions in the strata with higher variability Hence, efficient stratification requires knowledge of spatial and temporal variability For human exposure assessment surveys, one often is tempted to define strata that contain population members who are expected to experience higher-than-average exposures to the

Trang 8

environmental pollutants of interest and oversample these strata This can produce improved pre-cision for estimates of overall population mean exposures if the stratum sampling rates are propor-tional to the variability (standard deviations) of the stratum measures of exposure concentrations However, this practice can lead to loss of precision if the higher exposures are not sufficiently concentrated in the oversampled strata (see, e.g., USEPA 1990a,b; Whitmore et al 1994; Sexton

et al 2003) High exposure strata can be oversampled with little loss of precision for overall population estimates only if the following two conditions are simultaneously satisfied: (a) the percentage of persons with high exposures is much higher in the oversampled strata and (b) the oversampled strata contain a high proportion (say, 75% or more) of the highly exposed population (see Callahan et al 1995)

Stratification also can be used to control study costs if the population can be partitioned into strata that have different costs per unit for data collection and analysis For example, units of the population that are difficult to access and measure could be defined as a separate sampling stratum In this case, study costs can be reduced by assigning lower sampling fractions to the higher cost strata

Finally, if different sampling designs are most efficient for different parts of the population, use of different sampling designs is facilitated by stratifying the sample because a statistically independent sample is selected from each stratum For example, a survey of industrial establish-ments may require use of two lists that contain quite different types of information for different industrial sectors Defining each list, or industrial sector, to be a stratum allows one to use entirely different sampling designs for the two strata within the overall probability-based sampling design for the industry

3.12 ARE SPECIAL STATISTICAL ANALYSIS TECHNIQUES NEEDED?

Use of a probability-based sampling design allows one to compute robust estimates of precision (e.g., standard errors) that usually are larger than one would get from the usual statistical analysis procedures, which typically assume that all observations come from a simple random sample and are independent and identically distributed The probability-based estimates of standard errors are based on the known sampling design: the stages of sampling, the strata, and the probabilities of selection Ignoring the sampling design is likely to result in underestimation of the sampling variance and, hence, erroneous claims of statistically significant results

If the population units do not all have the same probability of selection, the responses from each unit must be weighted inversely to the unit’s probability of selection to avoid design bias For example,

if minorities were selected at twice as high a rate as other strata, then each minority person in the sample represents half as many population members as the other sample members In this case, minority sample members would be assigned a statistical analysis weight that was half as large as that of the other sample members Ignoring the sampling weights would result in biased estimates

Several statistical analysis software packages are available to enable proper analysis of data from probability-based samples An overview of the variance estimation methods is provided by Wölter (1985) A summary of currently available software can be obtained at http://www.hcp.med.harvard edu/statistics/survey-soft/

3.13 ADDITIONAL EXAMPLES OF PROBABILITY-BASED SAMPLING

FOR EXPOSURE ASSESSMENT STUDIES

Because of the cost of chemical and physical analysis of environmental samples, most human exposure assessment studies have smaller sample sizes than the NHEXAS and are usually confined

to a local geographic area (e.g., a city or county) The carbon monoxide exposure assessment study conducted in Washington, DC, during the winter of 1982–1983 is a good example The statistical

Trang 9

sampling design is described in detail in USEPA (1984) and is summarized in Figure 3.2 and in Akland et al 1985

The target population for the CO study consisted of the nonsmoking, noninstitutionalized residents of the Washington, DC metropolitan area during the winter of 1982–1983 who were 18–70 years of age Members of the sample were selected using a three-stage probability-based sampling design The stages of sampling were:

1 First-stage sample of area segments defined by census block groups

2 Second-stage sample of residential addresses obtained from a commercial vendor

3 Third-stage sample of persons

The address lists obtained from the vendor did not include some members of the target population An incomplete sampling frame was used to reduce study costs In this case, the survey

FIGURE 3.2 Flow chart for the three-stage sampling design for the Washington, DC, carbon monoxide

personal exposure monitoring field study.

TARGET POPULATION

GEOGRAPHIC SCOPE: Non-Institutionalized, Nonsmoking Residents of the Washington, DC, Metropolitan Area Aged 18 to 70 TEMPORAL SCOPE: Winter 1982 – 83

FIRST-STAGE SAMPLE

OF AREA SEGMENTS

FRAME: List of all Census Block Groups in the Washington Metropolitan Area STRATA: States and Counties

SAMPLE: Stratified Random Sample

of 250 Area Segments

SECOND-STAGE SAMPLE

OF ADDRESSEES

FRAME: Commercial List of Residential Addresses

in Each Sample Segment SAMPLE: Simple Random Sample of 40 Addresses from Each Sample Segment

FRAME: 5,209 Eligible Members of Screened Households STRATA: 9 Strata Based on (a) Short or Long Commute, (b) Presence of Gas Stove or Space Heater, and (c) Attached Garage

SAMPLE: Stratified Random Sample of 1,987 Individuals

THIRD-STAGE SAMPLE

OF PERSONS

1,161 Persons Interviewed

712 Persons Monitored for 1 Day for Personal CO Exposures 4,401 Screened Households 5,209 Eligible Individuals

Trang 10

population excluded people with no telephone service and most people whose address was not in the latest published telephone directory

A bibliography of human exposure assessment studies that have utilized probability-based sampling techniques to make inferences to specified target populations is provided in Table 3.1

3.14 CONCLUSION

In conclusion, human exposure assessment studies usually are designed to make inferences from

a sample of persons to a finite population with specific spatial and temporal bounds In order to make defensible inferences from the sample to the target population, probability-based sampling methods must be used to select the study subjects Randomization in the probability-based sampling method protects against biased selection of subjects and facilitates computation of robust, design-based measures of precision It also makes use of expert knowledge to design an efficient sample For example, overall sample precision can be maximized by defining sampling strata that have relatively homogeneous values of the outcome measurements

Use of tested and validated measurement methods ensures that reliable measurements will be obtained from all sample persons Using survey procedures that result in valid measurements for

a high proportion of the sample members protects against nonresponse bias Finally, having a sufficiently large sample ensures that survey statistics will have adequate precision

The scientific method, as outlined by the USEPA’s Data Quality Objectives process, is recom-mended for designing and implementing human exposure assessment studies This process ensures statistically valid inferences with sufficient precision to satisfy the study objectives

3.15 QUESTIONS FOR REVIEW

1 What is a sampling frame?

2 What is a probability-based sample?

3 How do you determine if probability-based sampling is necessary for a given research study?

4 What is necessary in addition to probability-based sampling to support defensible infer-ences to the population from which the sample was selected?

5 What are sampling strata?

6 Why would you stratify a sampling frame?

7 What are the basic steps of the Data Quality Objectives process?

Ngày đăng: 12/08/2014, 00:21

TỪ KHÓA LIÊN QUAN