438 Summary 451 / Exercises 454 10 Inference for a Single Quantitative Variable 461 10.1 Inference for a Mean When Population Standard Deviation Is Known or Sample Size Is Large.. 558 Su
Trang 2How do we know this text’s exercises are perfectly adapted
for online learning with Enhanced WebAssign?
The text author wrote them.
Enhanced WebAssignfor Elementary Statistics: Looking at the
Big Picture is an easy-to-use online teaching and learning system
that provides assignable homework, automatic grading, and
interactive assistance for students With more than 1,000 exercises
pulled directly from the text—written and customized by Nancy
Pfenning to be ideal for the online environment—students get
problem-solving practice that clarifi es statistics, builds skills, and
boosts conceptual understanding And when you choose Enhanced
WebAssign, students also get access to a Multimedia eBook, a
complete interactive version of the text
Students Get Interactive Practice
As students work problems, they can link directly to:
Watch It—Videos of worked exercises and examples from the text Read It—Relevant eBook selections from the text
You Save Time on Homework Management, Including Automatic Grading Enhanced WebAssign’s simple, user-friendly interface lets you quickly master the essential functions—and help is always available if you need
it Create a course in two easy steps, enroll students quickly (or let them
enroll themselves), and select problems for an assignment in fewer than
fi ve minutes Enhanced WebAssign automatically grades the assignments and sends results to your gradebook It’s that easy!
Find out more and see a sample assignment at www.webassign.net/brookscole
Screenshots shown here are for illustrative purposes only.
Trang 3Statistics
Trang 6Looking at the Big Picture
Nancy Pfenning
Publisher: Richard Stratton
Senior Sponsoring Editor: Molly Taylor
Associate Editor: Daniel Seibert
Editorial Assistant: Shaylin Walsh
Senior Marketing Manager: Greta Kleinert
Marketing Coordinator: Erica O’Connell
Marketing Communications Manager:
Mary Anne Payumo
Content Project Manager: Susan Miscio
Art Director: Linda Helcher
Senior Print Buyer: Diane Gibbons
Senior Rights Acquisition Account Manager,
Text: Katie Huha
Production Service: S4Carlisle Publishing
Services
Rights Acquisition Account Manager, Images:
Don Schlotman
Photo Researcher: Jennifer Lim
Interior and Cover Designer: KeDesign
Cover Image: © Veer Incorporated
Compositor: S4Carlisle Publishing Services
ALL RIGHTS RESERVED No part of this work covered by the copyright herein may be reproduced, transmitted, stored, or used in any form or by any means graphic, electronic, or mechanical, including but not limited to photocopying, recording, scanning, digitizing, taping, Web distribution, information networks, or information storage and retrieval systems, except
as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without the prior written permission of the publisher.
Library of Congress Control Number: 2009935400 ISBN-13: 978-0-495-01652-6
ISBN-10: 0-495-01652-7
Brooks/Cole
20 Channel Center Street Boston, MA 02210 USA
Cengage Learning is a leading provider of customized learning solutions with office locations around the globe, including Singapore, the United Kingdom, Australia, Mexico, Brazil, and Japan Locate your local office at:
Purchase any of our products at your local college store or at our preferred
online store www.ichapters.com
Printed in the United States of America
1 2 3 4 5 6 7 12 11 10 09
For product information and technology assistance, contact us at
Cengage Learning Customer & Sales Support, 1-800-354-9706
For permission to use material from this text or product, submit all
requests online at www.cengage.com/permissions.
Further permissions questions can be emailed to
permissionrequest@cengage.com.
Trang 7To Frank, Andreas & Mary, Marina, and Nils
Trang 9Preface xv
1 Introduction: Variables and Processes in Statistics 1
Types of Variables: Categorical or Quantitative 2
Students Talk Stats:Identifying Types of Variables 3
Handling Data for Two Types of Variables 5
Roles of Variables: Explanatory or Response 7
Statistics as a Four-Stage Process 9
Summary 11 / Exercises 11 PA R T I Data Production 16 2 Sampling: Which Individuals Are Studied 18 Sources of Bias in Sampling: When Selected Individuals Are Not Representative 18
Probability Sampling Plans: Relying on Randomness 20
The Role of Sample Size: Bigger Is Better If the Sample Is Representative 21
From Sample to Population: To What Extent Can We Generalize? 22
Students Talk Stats:Seeking a Representative Sample 23
Summary 25 / Exercises 25 3 Design: How Individuals Are Studied 30 3.1 Various Designs for Studying Variables 30
Identifying Study Design 32
Observational Studies versus Experiments: Who Controls the Variables? 33
Errors in Studies’ Conclusions: The Imperfect Nature of Statistical Studies 35 3.2 Sample Surveys: When Individuals Report Their Own Values 38
Sources of Bias in Sample Surveys 38
3.3 Observational Studies: When Nature Takes Its Course 46
Confounding Variables and Causation 46
Paired or Two-Sample Studies 48
Prospective or Retrospective Studies: Forward or Backward in Time 49
3.4 Experiments: When Researchers Take Control 51
Randomized Controlled Experiments 52
Double-Blind Experiments 53
“Blind” Subjects 53
“Blind” Experimenters 54
Pitfalls in Experimentation 55
Modifications to Randomization 57
Contents
Trang 10Students Talk Stats:Does Watching TV Cause ADHD? Considering
Study Design 63
Summary 63 / Exercises 65 PA R T I I Displaying and Summarizing Data 70 4 Displaying and Summarizing Data for a Single Variable 72 4.1 Single Categorical Variable 72
Summaries and Pie Charts 72
The Role of Sample Size: Why Some Proportions Tell Us More Than Others Do 74
Bar Graphs: Another Way to Visualize Categorical Data 75
Mode and Majority: The Value That Dominates 77
Revisiting Two Types of Bias 77
Students Talk Stats:Biased Sample, Biased Assessment 78
4.2 Single Quantitative Variables and the Shape of a Distribution 82
Thinking about Quantitative Data 83
Stemplots: A Detailed Picture of Number Values 85
Histograms: A More General Picture of Number Values 89
4.3 Center and Spread: What’s Typical for Quantitative Values, and How They Vary 93
Five-Number Summary: Landmark Values for Center and Spread 93
Boxplots: Depicting the Key Number Values 95
Mean and Standard Deviation: Center and Spread in a Nutshell 98
4.4 Normal Distributions: The Shape of Things to Come 108
The 68-95-99.7 Rule for Samples: What’s “Normal” for a Data Set 110
From a Histogram to a Smooth Curve 113
Standardizing Values of Normal Variables: Storing Information in the Letter z 114
Students Talk Stats:When the 68-95-99.7 Rule Does Not Apply 117
“Unstandardizing” z-Scores: Back to Original Units 118
The Normal Table: A Precursor to Software 119
Summary 125 / Exercises 127 5 Displaying and Summarizing Relationships 133 5.1 Relationships between One Categorical and One Quantitative Variable 133
Different Approaches for Different Study Designs 133
Displays 134
Summaries 134
Notation 134
Data from a Two-Sample Design 134
Data from a Several-Sample Design 137
Data from a Paired Design 138
Students Talk Stats:Displaying and Summarizing Paired Data 139
Generalizing from Samples to Populations: The Role of Spreads 141
The Role of Sample Size: When Differences Have More Impact 143
5.2 Relationships between Two Categorical Variables 150
Summaries and Displays: Two-Way Tables, Conditional Percentages, and Bar Graphs 151
The Role of Sample Size: Larger Samples Let Us Rule Out Chance 156
Trang 11Comparing Observed and Expected Counts 156
Confounding Variables and Simpson’s Paradox: Is the Relationship Really There? 157
5.3 Relationships between Two Quantitative Variables 165
Displays and Summaries: Scatterplots, Form, Direction, and Strength 166
Correlation: One Number for Direction and Strength 170
When the Correlation Is 0, ⫹1, or ⫺1 171
Correlation as a Measure of Direction and Strength 173
A Closer Look at Correlation 174
Correlation Is Unaffected by the Roles of Explanatory and Response Variables 175
Correlation Is Unaffected by Units of Measurement 176
Least Squares Regression Line: What We See in a Linear Plot 177
A Closer Look at Least Squares Regression 182
Residuals: Prediction Errors in a Regression 182
Spread s about the Line versus Spread s yabout the Mean Response 183
The Effect of Explanatory and Response Roles on the Regression Line 184 Influential Observations and Outliers 185
Students Talk Stats:How Outliers and Influential Observations Affect a Relationship 186
Sample versus Population: Thinking Beyond the Data at Hand 187
The Role of Sample Size: Larger Samples Get Us Closer to the Truth 188
Time Series: When Time Explains a Response 189
Additional Variables: Confounding Variables, Multiple Regression 191
Students Talk Stats:Confounding in a Relationship between Two Quantitative Variables 191
Summary 204 / Exercises 206 PA R T I I I Probability 224 6 Finding Probabilities 226 6.1 The Meaning of “Probability” and Basic Rules 226
Permissible Probabilities 229
Probabilities Summing to One 229
Probability of “Not” Happening 231
Probability of One “Or” the Other for Non-overlapping Events 231
Probability of One “And” the Other for Two Independent Events 233
6.2 More General Probability Rules and Conditional Probability 238
Probability of One “Or” the Other for Any Two Events 239
Probability of Both One “And” the Other Event Occurring 241
Students Talk Stats:Probability as a Weighted Average of Conditional Probabilities 245
Conditional Probability in Terms of Ordinary Probabilities 246
Checking for Independence 247
Counts Expected If Two Variables Are Independent 250
Summary 256 / Exercises 257 7 Random Variables 267 7.1 Discrete Random Variables 268
Probability Distributions of Discrete Random Variables 269
The Mean of a Random Variable 276
The Standard Deviation of a Random Variable 278
Rules for the Mean and Standard Deviation of a Random Variable 280
Trang 127.2 Binomial Random Variables 291
What Makes a Random Variable “Binomial”? 291
The Mean and Standard Deviation of Sample Proportions 295
Students Talk Stats:Calculating and Interpreting the Mean and Standard Deviation of Count or Proportion 297
The Shape of the Distribution of Counts or Proportions: The Central Limit Theorem 299
7.3 Continuous Random Variables and the Normal Distribution 311
Discrete versus Continuous Distributions 312
When a Random Variable Is Normal 315
The 68-95-99.7 Rule for Normal Random Variables 316
Standardizing and Unstandardizing: From Original Values to z or Vice Versa 319
Estimating z Probabilities with a Sketch of the 68-95-99.7 Rule 319
Nonstandard Normal Probabilities 323
Tails of the Normal Curve: The 90-95-98-99 Rule 326
Students Talk Stats:Means, Standard Deviations, and Below-Average Heights 329
Summary 335 / Exercises 337 8 Sampling Distributions 344 Categorical Variables: The Behavior of Sample Proportions 344
Quantitative Variables: The Behavior of Sample Means 345
8.1 The Behavior of Sample Proportion in Repeated Random Samples 346
Thinking about Proportions from Samples or Populations 346
Center, Spread, and Shape of the Distribution of Sample Proportion 348
8.2 The Behavior of Sample Mean in Repeated Random Samples 356
Thinking about Means from Samples or Populations 356
The Mean of the Distribution of Sample Mean 358
The Standard Deviation of the Distribution of Sample Mean 358
The Shape of the Distribution of Sample Mean: The Central Limit Theorem 360
Center, Spread, and Shape of the Distribution of Sample Mean 360
Normal Probabilities for Sample Means 362
Students Talk Stats:When Normal Approximations Are Appropriate 365
Summary 371 / Exercises 372 PA R T I V Statistical Inference 386 9 Inference for a Single Categorical Variable 388 9.1 Point Estimate and Confidence Interval: A Best Guess and a Range of Plausible Values for Population Proportion 390
Probability versus Confidence: Talking about Random Variables or Parameters 392
95% Confidence Intervals: Building around Our Point Estimate 394
The Role of Sample Size: Closing In on the Truth 398
Confidence at Other Levels 400
Deciding If a Particular Value Is Plausible: An Informal Approach 403
The Meaning of a Confidence Interval: What Exactly Have We Found? 404
Students Talk Stats:Interpreting a Confidence Interval 405
9.2 Hypothesis Test: Is a Proposed Population Proportion Plausible? 413
Three Forms of Alternative Hypothesis: Different Ways to Disagree 416
One-Sided or Two-Sided Alternative Hypothesis 425
Trang 13How Small Is a “Small” P-Value? 429
The Role of Sample Size in Conclusions for Hypothesis Tests 430
When to Reject the Null Hypothesis: Three Contributing Factors 431
Students Talk Stats:Interpreting a P-Value 432
Type I or II Error: What Kind of Mistakes Can We Make? 433
Students Talk Stats:What Type of Error Was Made? 435
Relating Results of Test with Confidence Interval: Two Sides of the Same Coin 435
The Language of Hypothesis Tests: What Exactly Do We Conclude? 436
Students Talk Stats:The Correct Interpretation of a Small P-Value 437
Students Talk Stats:The Correct Interpretation When a P-Value Is Not Small 437
The “Critical Value” Approach: Focusing on the Standard Score 438
Summary 451 / Exercises 454 10 Inference for a Single Quantitative Variable 461 10.1 Inference for a Mean When Population Standard Deviation Is Known or Sample Size Is Large 462
A Confidence Interval for the Population Mean Based on z 464
95% Confidence Intervals with z 465
Students Talk Stats:Confidence Interval for a Mean: Width, Margin of Error, Standard Deviation, and Standard Error 469
Role of Sample Size: Larger Samples, Narrower Intervals 471
Intervals at Other Levels of Confidence with z 472
Interpreting a Confidence Interval for the Mean 473
Students Talk Stats:Correctly Interpreting a Confidence Interval for the Mean 473
A z Hypothesis Test about the Population Mean 474
10.2 Inference for a Mean When the Population Standard Deviation Is Unknown and the Sample Size Is Small 480
A t Confidence Interval for the Population Mean 482
95% Confidence Intervals with t 482
Intervals at Other Levels of Confidence with t 484
A t Hypothesis Test about the Population Mean 486
Students Talk Stats:Practical Application of a t Test 488
10.3 A Closer Look at Inference for Means 491
A One-Sided or Two-Sided Alternative Hypothesis about a Mean 491
The Role of Sample Size and Spread: What Leads to Small P-Values? 493
Type I and II Errors: Mistakes in Conclusions about Means 494
Relating Tests and Confidence Intervals for Means 495
Correct Language in Hypothesis Test Conclusions about a Mean 496
Robustness of Procedures 498
Summary 503 / Exercises 505 11 Inference for Relationships between Categorical and Quantitative Variables 520 11.1 Inference for a Paired Design with t 522
Hypothesis Test in a Paired Design 522
Confidence Interval in a Paired Design 524
11.2 Inference for a Two-Sample Design with t 528
The Two-Sample t Distribution and Test Statistic 528
Hypothesis Test in a Two-Sample Design 530
Confidence Interval in a Two-Sample Design 534
The Pooled Two-Sample t Procedure 536
Students Talk Stats:Ordinary versus Pooled Two-Sample t 537
Trang 1411.3 Inference for a Several-Sample Design with F: Analysis of Variance 543
The F Statistic 545
The F Distribution 550
Solving Several-Sample Problems 552
The ANOVA Table: Organizing What We Know about F 555
The ANOVA Alternative Hypothesis 557
Assumptions of ANOVA 558
Summary 566 Students Talk Stats:Reviewing Relationships between Categorical Explanatory and Quantitative Response Variables 566
Exercises 571 12 Inference for Relationships between Two Categorical Variables 591 12.1 Comparing Proportions with a z Test 592
12.2 Comparing Counts with a Chi-Square Test 598
Relating Chi-Square to z 598
The Table of Expected Counts 599
Comparing Observed to Expected Counts 600
The Chi-Square Distribution 602
The Chi-Square Test 604
Sample Size and Chi-Square Assumptions 604
Summary 613 / Exercises 614 13 Inference for Relationships between Two Quantitative Variables 628 13.1 Inference for Regression: Focus on the Slope of the Regression Line 629
Setting the Stage: Summarizing a Relationship for Sampled Points 630
Distinguishing between Sample and Population Relationships 631
A Model for the Relationship between Two Quantitative Variables in a Population 634
The Distribution of Sample Slope b1 636
The Distribution of Standardized Sample Slope t 637
Hypothesis Test about the Population Slope with t: A Clue about the Relationship 638
Students Talk Stats:No Evidence of a Relationship 643
Confidence Interval for the Slope of the Population Regression Line 644
13.2 Interval Estimates for an Individual or Mean Response 651
Summary 662 / Exercises 664 14 How Statistics Problems Fit into the Big Picture 677 14.1 The Big Picture in Problem Solving 677 Students Talk Stats:Choosing the Appropriate Statistical Tools: Question 1 678
Students Talk Stats:Choosing the Appropriate Statistical Tools: Question 2 679
Students Talk Stats:Choosing the Appropriate Statistical Tools: Question 3 680
Exercises 683
15.1 The Sign Test as an Alternative to the Paired t Test 15.2 The Rank-Sum Test as an Alternative to the Two-Sample t Test
Wilcoxon rank-sum test
15.3 Summary of Non-parametrics Exercises
Trang 1516 Two-Way ANOVA (available online)
Trang 16Data Production
Part I
16
1 Obtaining a sample
2 Designing a study to discover what we want to know about the variables
of interest for the individuals in the sample
An Overview
In this part of the book, we focus on the two stages of data production:The principles of good data production play a vital role in what we aim to ac-complish throughout the book It is of the utmost importance at this stage to avoid
any form of bias.
Bias Due to Sampling
In an interview, Larry Flynt (controversial publisher of Hustler and similar
maga-zines) was asked, “How would you like women to remember you—as someonewho helped or hurt their position?”1His reply was “ of the thousands of girlswho have posed for my magazines, I’ve never had one who felt she had been ex-ploited I think it’s actually helped the women’s movement .” Obviously, the
statistics, the most common quantities to be estimated are means andproportions
Bias is the tendency of an estimate to deviate in one direction from a
true value
A biased sample results in over- or underestimates because the sample
is not representative of the population of interest
The design of a study is the plan for gathering information about the variables of interest A biased study design results in over- or
underestimates because of flaws in the way information about sampledindividuals is gathered
Part IV A study design
that assesses sampled
values without bias is a
Trang 17in general, so we cannot infer anything about the attitude of the larger population
of women based on his sample of models
Thus, it is extremely important that the very first step in data production—
sampling—be carried out in such a way that the sample really does represent the
population of interest Also, we must remember that our summaries of variables
and their relationships reflect the true nature of the variables and relationships in
the sample only if the design for gathering the information is sound.
Bias Due to Study Design
According to an article entitle “Exercise Does Good Things for Teens’ Moods,”
“Boys who reported less than an hour of vigorous physical activity a week were
more likely to be depressed and withdrawn than those who exercised regularly.”
The design for assessing the boys’ physical activity and mood was to simply observe
the values for these variables as they naturally occurred For this reason, we can’t
rule out a very different explanation for what the researchers observed in their
sam-ple of boys: Perhaps being in a good mood makes a teenager more likely to exercise
Good data production is an essential part of the “big picture” of statistics We
must keep its principles in mind as we progress later on in the book to
summariz-ing data, understandsummariz-ing probability, and performsummariz-ing statistical inference
Throughout this part of the book, we will establish guidelines for ideal
produc-tion of data It is important for us to strive to achieve these standards Realistically,
however, it is rarely possible to carry out a study that is completely free of flaws
Therefore, we must frequently apply common sense to decide which imperfections
we can “live with,” and which ones could completely undermine a study’s results
17
1 Data Production: Take sample data from the
population, with sampling and study designs that avoid bias
2 Displaying and Summarizing:
Use appropriate displays and summaries of the sample data, according to variable types and roles
3 Probability: Assume we know
what’s true for the population;
how should random samples
behave?
4 Statistical Inference: Assume we only know what’s
true about sampled values of a single variable or
relationship; what can we infer about the larger
population?
Trang 18Displaying and Summarizing Data: An Overview
Before going into detail about the two steps in data production—sampling
and design—we discussed the fact that the way we handle statistical lems depends on the number and type of variables involved We either have
prob-a single cprob-ategoricprob-al vprob-ariprob-able, prob-a single quprob-antitprob-ative vprob-ariprob-able, or prob-a relprob-ationshipbetween, respectively, a categorical and a quantitative variable, two categorical vari-ables, or two quantitative variables Categorical variables are summarized by tellingcounts, proportions, or percents in the category of interest, whereas quantitativevariables are often summarized by reporting the mean Whenever we are interested
in the relationship between two variables, it is important to establish which (if any)plays the role of explanatory variable and which is the response The roles played
by the variables will determine which displays and summaries are appropriate.Once we establish what is true about a variable or relationship in a random
sample, we will be in a position to say something about what is true for the larger population Throughout this book, we must take care to distinguish between sam-
ples and populations
Displaying and Summarizing
Data
Part II
70
Definitions A number that summarizes a sample is called a statistic.
A number that summarizes the population is called a parameter.
The most common statistics of interest are the sample proportion (called
“p-hat”) and the sample mean (called “x-bar”), corresponding to the
param-eters population proportion p and population mean (called “mu”) These will
be formally defined as we encounter them in Chapter 4
Identifying Statistics and Parameters
Here are some situations featuring either statistics or parameters
쮿 19% of 2,366 surveyed Americans said they believed money can buy happiness
x
pN
Trang 19Results of a survey taken by several hundred students in introductory
statis-tics classes at a particular university provide a good source of real-life examples
corresponding to each of the 5 variable situations, from one categorical variable
to two quantitative variables These students reported their age, whether or not
they’d eaten breakfast that day, how many minutes they spent on the computer the
day before, and so on To gain experience in working with real data, we will
of-ten produce displays and summaries, and later perform statistical inference, using
this data set Because our summaries of the survey data correspond to a sample,
we will treat those summaries as statistics, not parameters
portion of all Americans who believe money can buy happiness is a parameter p.
쮿 A New York Times article entitled “The DNA 200” reports that the first 200
inmates to be cleared through DNA evidence, from January 1989 to April
2007, averaged 12 years in prison.1
Here the number 12 is a parameter m because it is talking about the mean years
for the population of all 200 inmates exonerated thus far
Keeping the Big Picture in Perspective
In Part I, we learned about good sampling technique, to ensure that the sample
truly represents the larger population about which we want to draw conclusions
We also learned how to design good studies so that the information obtained
about the variables or relationships accurately reflects the truth about the sampled
individuals Adhering to good principles of sampling and design is vital for the
the-ory developed in Part III, when we assume a population parameter is known, and
learn how the corresponding sample statistic behaves The behavior is predictable
only if the statistic summarizes data values that are unbiased The same principles
continue to be essential for the more practical techniques learned in Part IV, when
we use sample statistics to draw conclusions about unknown population
parame-ters Again, those conclusions will be correct only if the statistic is unbiased
Keeping in mind that the sampling technique and study design could have an
impact on the data that are produced, we undertake in Part II to summarize data
gathered about single variables and about relationships In other words, we will
now learn how to find relevant sample statistics for the data at hand The
follow-ing diagram shows how summarizfollow-ing data fits into the “big picture” of statistics
1 Data Production: Take sample data from the
population, with sampling and study designs that avoid bias
2 Displaying and Summarizing:
Use appropriate displays and summaries of the sample data, according to variable types and roles
3 Probability: Assume we know
what’s true for the population;
how should random samples
behave?
4 Statistical Inference: Assume we only know what’s
true about sampled values of a single variable or
relationship; what can we infer about the larger
71
Trang 20Introduction to Probability
Our ultimate goal in this book is to perform statistical inference: Use a
sample statistic (such as sample mean or sample proportion) to drawconclusions about an unknown population parameter (like populationmean or population proportion)
Political polls provide a straightforward example of the kind of reasoning volved in performing statistical inference First, keeping in mind principles estab-lished in Part I, researchers would design and implement a survey to poll peopleabout their views before a presidential election Methods of Part II would indicatethat the results (categorical) could be summarized with a percentage Suppose that
in-54% in the sample of 1,000 voters intend to vote for a particular candidate, and
we would like to decide whether or not the majority—more than 50%—of all
vot-ers intend to vote for that candidate
Probability
Part III
224
1 Data Production: Take sample data from the
population, with sampling and study designs that avoid bias
2 Displaying and Summarizing:
Use appropriate displays and summaries of the sample data, according to variable types and roles
3 Probability: Assume we know
what’s true for the population;
how should random samples
behave?
4 Statistical Inference: Assume we only know what’s
true about sampled values of a single variable or
relationship; what can we infer about the larger
population?
Trang 2150% (no more) of all voters favor that candidate Then we would determine how
probable or improbable it would be to find as many as 54%, in a random sample
of 1,000 voters, intending to vote for that candidate If it turns out to be extremely
unlikely to get a sample percentage as high as 54% when the population
percent-age is only 50%, then we’d conclude that the population percentpercent-age is not so low
as 50% It is almost certainly more
The key to making a decision in our election example is finding the likelihood
(or unlikelihood) of obtaining a certain sample percentage, given a claimed
popu-lation percentage Thus, it is a probability that brings about our final decision
Re-ferring to our sketch of the “big picture,” we are ready now to tackle the third
major step in the four-step process of learning to perform statistical inference
By the end of Part III, we will have established the necessary theory to
evalu-ate probabilities like the one needed to solve the election example above This
the-ory is by no means simple, and must be developed gradually We will begin by
learning basic and more general rules of probability (the science), which is the
for-mal study of random behavior Next, we learn about the behavior of random
vari-ables, which are a particular kind of quantitative variable whose values are a
result of some random process (such as random sampling) This leads to the
chap-ter on sampling distributions, which tell the behavior of two random variables of
particular interest—sample proportion and sample mean By this time we will be
able to determine, for a given population parameter, how the corresponding
sta-tistic behaves in the long run for random samples This sets the stage for inference
in Part IV, when we turn this knowledge around, and for a given statistic (such as
sample proportion), determine what should be true about the corresponding
pa-rameter (such as unknown population proportion)
Now that we are about to begin our formal study of random behavior—the
science of probability—it is a good time to remind ourselves of the importance of
techniques learned in Part I, on data production Randomization was the key to
producing unbiased samples for observational studies, and the key to establishing
causation in experiments Now we should take note of the fact that the entire
the-ory of probability developed in Part III, on which the applications in Part IV
de-pend, requires that selections or assignments have been made at random.
In Part II, when we learned various display and summary techniques, we
com-partmentalized the topics according to number and type of variables involved
There were five basic situations, as illustrated in the diagram below: one
categor-ical variable, one quantitative variable, one each categorcategor-ical and quantitative, two
categorical variables, and two quantitative variables
In Part IV, when we learn to draw conclusions about the larger population,
based on sample data, we will again handle one situation at a time, depending on
number and type of variables Now, in Part III, there will occasionally be subtle
shifts from one to two categorical variables, or from quantitative to categorical
variables and vice versa Instead of focusing on number and type of variables, we
concentrate, especially in Chapter 6, on the general rules that govern random
be-havior in any of these five situations.
225
Two quantitative variables
Q → →Q
Two categorical variables
C →C
One categorical and one quantitative variable
One quantitative variable
Trang 22Statistical Inference: An Overview
Whether or not we state it explicitly, whenever information is gathered
about a group of individuals, we almost always want to generalize
to a larger group A poll finds what proportion of surveyed votersfavor a particular candidate, to get an idea of what proportion of
all voters favor that candidate An experimenter determines how much more
weight is lost by some dieters who exercise, compared to some dieters who don’t,
to draw conclusions about weight loss by all dieters who do or don’t exercise.
Most people, even if they have never taken a statistics course, are not so naive
as to believe that what is true for a sample must also be exactly true for the largerpopulation But unless they have a knowledge of statistical principles, people areunable to judge to what degree information about a sample can be extended to thegeneral population This book teaches you to be an educated consumer of statis-tical information, so that by the time you have finished this final (and most impor-tant) part, you will have the skills to make such generalizations carefully andcorrectly These skills will enable you to decide, given poll results, whether or not
a majority of all voters favor a candidate They will let you estimate, given results
of an experiment, how many more pounds any dieter stands to lose if he or she
exercises regularly
Inference in the Big Picture
Our diagram of the four processes should help remind you of how this fourth andfinal process fits into the “big picture” of statistics
By now we have considered how to produce an unbiased sample, and how todisplay and summarize the sample data, depending on what types of variables areinvolved We have established important principles of probability theory, and areready to make practical use of these results: Now that we know how samples tend
to behave relative to populations, we turn this knowledge around and discoverwhat is likely to be true about a population, given what we have observed in asample Our knowledge about the population, based on the sample, will not beperfect, but methods about to be presented will enable us to quantify the uncer-
Statistical Inference
Trang 23tainty of our conclusions This final step, inference, is highlighted in our diagram
because it is the task at hand The five variable situations are also shown because
each situation calls for a different approach to inference
Two Major Forms of Inference
No matter which of the five situations applies, our inference about the larger
pop-ulation, based on the sample, may take one of two forms: confidence intervals or
hypothesis tests
쮿 Setting up a confidence interval is a way of presenting a range of plausible
val-ues for the unknown population parameter The interval tells us what valval-ues
are, in a sense, believable
쮿 Carrying out a hypothesis test is a way of deciding whether or not a
particu-lar proposed value for the unknown parameter is plausible In the case of
re-lationships between two variables, a hypothesis test is especially important
because it helps us decide whether or not there is convincing evidence that
those variables are related in the larger population, not just in the sample
In the next five chapters we will systematically consider both forms of inference—
confidence intervals and hypothesis tests—for each of the five variable situations As
you advance through these chapters, you may want to refer back to this overview
occasionally, to help keep the “big picture” in perspective throughout
387
population, with sampling and study designs that avoid bias
2 Displaying and Summarizing:
Use appropriate displays and summaries of the sample data, according to variable types and roles
3 Probability: Assume we know
what’s true for the population;
how should random samples
4 Statistical Inference: Assume we only know what’s
true about sampled values of a single variable or
relationship; what can we infer about the larger
population?
Trang 24Before the semester starts, a statistics teacher wants to organize a box of
hundreds of newspaper clippings and Internet reports collected in the past
couple of years:
쮿 “Dark Chocolate Might Reduce Blood Pressure”
쮿 “Almost Half of U.S Internet Users ‘Google’ Themselves”
쮿 “Vampire Bat Saliva Researched for Stroke”
쮿 “Environmental Mercury, Autism Linked by New Research”
There are several reports on smoking and on obesity, but for most of the topics—
such as bat saliva—there is only one article How can the teacher sort all of those
articles in a way that will make them easy to access for future reference?
At the end of the semester, a group of statistics students are studying together,
trying to solve practice final exam problems such as these:
쮿 Suppose systolic blood pressures for 7 patients who ate dark chocolate
daily for two weeks dropped an average of 5 points, whereas those of a
con-trol group of 6 patients who ate white chocolate remained unchanged If
the standardized difference between blood pressure decreases was 2.1, do
we have convincing evidence that dark chocolate is beneficial?
쮿 According to a 2007 report, 47% of 1,623 U.S Internet users surveyed by
the Pew Internet & American Life Project had searched for information
about themselves online Give a 95% confidence interval for the
percent-age of all U.S Internet users who searched online for information about
Introduction:
Variables and Processes in Statistics
What can you accomplish with this book, and how?
Trang 25쮿 Researchers found that 9 out of 15 stroke patients receiving vampire batsaliva had an excellent recovery, compared with 4 out of 17 who wereuntreated Does this provide evidence that bat saliva is effective in treat-ing stroke patients?
쮿 Research in a large sample of Texas school districts found that for every1,000 pounds of environmentally released mercury, there was a 17% in-crease in autism rates If one district has 300 additional pounds of environ-mental mercury compared to another, how much higher do we predict itsautism rate to be?
The students may feel overwhelmed in trying to find the right approach toeach of the problems, after having learned a whole semester’s worth of various sta-tistical procedures How can the students figure out which procedure is the rightone for each problem?
The answer for both teacher and students is a simple one, and it will also bethe key for you to understand what this book is all about, from beginning to end
The way we handle statistical problems depends on the number and types of ables involved.
vari-A variable, as the name suggests, is something that varies for different
individ-uals: Blood pressure is a variable because it takes different values for different ple; recovery from a stroke is a variable because some patients have an excellentrecovery and others do not The individuals with variable traits in many cases arepeople, but individuals can be anything that we are interested in—from penguins
peo-to school districts peo-to planets
Types of Variables: Categorical or Quantitative
Virtually all of the situations encountered in this book will involve either a single
variable or the relationship between two variables A variable’s type is categorical
if it takes qualitative values such as sex, race, or the response to a yes-or-no
ques-tion The type is quantitative if the variable takes number values for which
arith-metic makes sense, such as age, number of siblings, or rating something on a scale
of 1 to 10
Definitions A variable is a characteristic that differs for different individuals A categorical variable takes qualitative values that are not subject to the laws of arithmetic A quantitative variable takes number values for which arithmetic makes sense A relationship (also known as an
association) exists between two variables if certain values of one tend to
occur with certain values of the other
The statistics teacher can divide the clippings into just five piles:
1 One categorical variable
2 One quantitative variable
3 One categorical variable and one quantitative variable
4 Two categorical variables
5 Two quantitative variables
Likewise, the statistics students just need to identify the number and type ofvariables involved in each problem, and this will suggest what statistical proce-dure should be applied
This book features “Students Talk Stats” examples and exercises that are cussions by four prototypical students, highlighting many of the most important
dis-Categorical variables are
variables, like ZIP codes,
are categorical if the
numbers are labels, not
signifying an amount
that can be quantified
For example, if half of a
group of students have
a ZIP code 15217 and
the other half 15213, we
can’t say that the typical
ZIP code is the average
of these, 15215
A C LOSER
L OOK
Trang 26ideas in statistics As you gradually rise to higher levels of understanding of
statis-tical concepts and procedures, you may find you can relate to their struggles and
discoveries Our first such discussion will help you begin to develop the skill of
identifying what types of variables are involved when you are presented with any
report containing statistical information
Identifying Types of Variables
Four students who have recently enrolled in astatistics class are browsing through news articles onthe Internet, thinking about what kind of variables areinvolved
Adam: “I’m in the mood for chocolate, so I’m looking at this
article that says ‘Dark Chocolate Might Reduce Blood Pressure’
I’m pretty sure blood pressure is quantitative but couldn’t chocolate go either way?”
Brittany: “Realistically, I think they’d just compare people who do and don’t eat dark
chocolate, which would make it categorical Here’s one that says ‘Almost Half of U.S
Internet Users “Google” Themselves’ Half is a number so it’s quantitative.”
Carlos: “Half is talking about the overall fraction, but for each person, they just
recorded whether or not they Googled themselves, so it’s categorical What about
‘Vampire Bat Saliva Researched for Stroke’? I picture they handled bat saliva like
Brittany said they’d handle chocolate—some people get it and others don’t I don’t
think it would be easy to put a number on recovery from a stroke, so that variable’s
probably categorical, too.”
Dominique: “I’m confused about this one: ‘Environmental Mercury, Autism Linked
by New Research’ Mercury would be quantitative, and I think of autism as being
categorical, but the report says they looked at autism rates in different school districts
depending on how much mercury was in the area Would that make autism
quantitative?”
Adam is correct that blood pressure is quantitative, and Brittany rightly guesses that
chocolate consumption in this case would be categorical Carlos has correctly
identified Googling one’s self as a categorical variable in the second article, and is
on the right track that both bat saliva and stroke recovery would be categorical
Finally, although autism for individual people would be categorical, if a study
considers autism rates for a sample of school districts, then the variable is
quantitative Dominique is right about both mercury and autism rate being
quantitative variables in this study
Practice: Try Exercise 1.2 on page 11.
Although variable type is usually fairly straightforward to identify, some
“crossover” from one type to the other may take place, such as in the autism/
mercury study discussed above by the four students, as well as in the following
pair of examples
Trang 27E XAMPLE 1.1 When a Categorical Variable Gives Rise to a Quantitative Variable
Background:Individual teenagers were surveyed as to whether they haveused marijuana, and whether they have used harder drugs
Researchers then looked at the percentage of teenagers using marijuanaand the percentage using harder drugs in various countries around theworld to see if those two variables are related
Questions:What kinds of variables are involved in the first situation?What kinds of variables are involved in the second situation?
Responses:The first situation explores the relationship between twocategorical variables The second explores the relationship between twoquantitative variables
Practice: Try Exercise 1.6(a,b) on page 12.
Country % Marijuana % Harder Drugs
to analyze the data, they simply classified the babies as being below
6 pounds (considered below normal) or not, along with information aboutwhether the mothers had been X-rayed while pregnant
Questions:What kind of variables were involved in the first situation? Inthe second situation?
Trang 28Handling Data for Two Types of Variables
We refer to recorded values of categorical or quantitative variables as data The
science of statistics is all about handling data.
Responses:The first situation involves one categorical variable (mother
had dental X-rays or not) and a quantitative variable (baby’s weight) The
second situation involves two categorical variables because babies’ weights
are now categorized into two groups
Practice: Try Exercise 1.8 on page 12.
Original variable (birth weight) is quantitative 5.5 5.8 5.8 6.0 6.2 6.3
Below normal Normal b b b n n n
Definitions Data are pieces of information about the values taken by
variables for a set of individuals
The science of statistics concerns itself with gathering data about a
group of individuals, displaying and summarizing the data, and using the
information provided by the data to draw conclusions about a larger
group of individuals
Before we go into detail about the process of gathering data, it helps to have
an idea of how we will handle the data when the time comes Categorical variables
are summarized by telling count or proportion or percentage in the category of
in-terest The most common way of summarizing quantitative variables is with their
mean (same thing as average), although we will discuss other useful summaries a
bit later in this book
Definitions The count in a category of interest is simply the number
of individuals in that category
The proportion in a category of interest is the number of individuals
in that category, divided by the total number of individuals considered
The percentage in a category of interest is the proportion (as a
decimal) multiplied by 100%
The mean of a set of values is their sum divided by the total number
of values
Students may be misled to think that the variable of interest in a situation is
quantitative because they see a number attached to it In fact, that number may be
a count or a proportion or a percentage summarizing values of a categorical
vari-able It may help to think about how data values are being recorded for each
indi-vidual in a sample in order to decide whether the variable of interest is categorical
or quantitative, as Carlos did in the four students’ discussion on page 3
Many real-life studies,including manydiscussed in this book,convert quantitativevariables to categorical
in order to simplifymatters
L OOKING
A HEAD
Trang 29Most of the data that statisticians handle, and most of the data that we encounter
in our everyday lives, come from some subgroup, called a sample, as opposed to the entire group of interest, called the population Occasionally, we have access to infor- mation about the entire population, gathered via a census This was the case in Ex-
ample 1.4 about earnings of various demographic groups in the United States
E XAMPLE 1.3 Summarizing Categorical Variables
Background:An article entitled “New Test-Taking Skill: Working theSystem” reports: “Indeed, although only a tiny fraction—1.9%—ofstudents nationwide got special accommodations for the SAT, thepercentage jumps fivefold for students at New England prep schools At 20prominent Northeastern private schools, nearly one in 10 students receivedspecial treatment.”1
Question:What type of variable is featured here, and how is it summarized?
Response:For each student in the entire nation or in the private schoolsexamined, it is recorded whether or not the student was granted specialaccommodations in taking the SAT test This is a categorical variable,summarized by telling the percentage or proportion in the category ofinterest (receiving special accommodations)
Practice: Try Exercise 1.10 on page 12.
E XAMPLE 1.4 Summarizing Quantitative Variables
Background:An article entitled “Racial Gaps in Education Cause IncomeTiers” reports: “On average, a white man with a college diploma earnedabout $65,000 in 2001 Similarly educated white women made about 40%less, while black and Hispanic men earned 30% less .”2
Question:How would earnings for each group (such as white women orHispanic men) be summarized—with a mean or with a proportion?
Response:Earnings are a quantitative variable and could be summarizedfor each group with a mean, namely $65,000 for college-educated whitemen, and a mean that is less by 40% of $65,000 for college-educated whitewomen—that is, $39,000
Practice: Try Exercise 1.11 on page 12.
Definitions A sample is a subset taken from a larger group, and the larger group of interest is the population.
A census, according to Webster’s dictionary, is a “usually complete
enumeration of the population,” and we think of a census in general as asurvey intended to include all citizens in a given area When we talk about
“the Census,” we are referring to the U.S Census, conducted regularly
since 1790, and designed to gather more and more detailed informationabout America’s population
In cases like this, where
values of a quantitative
variable are being
compared for two or
more categorical
groups, a summary
occasionally quantifies
the differences by
reporting what percent
higher or lower another
mean is from the
original mean
A C LOSER
L OOK
Trang 30Once census results are summarized, as in Example 1.4, there are no further
statistical procedures needed to draw conclusions about the “larger population.”
E XAMPLE 1.5 When Information Is Provided for an
Entire Population
Background:“Are Feeding Tubes Over-Prescribed?” describes a Harvard
Medical School study that “involved 1999 data from all 15,135 licensed
U.S nursing homes at the time.”3The study found that “one-third of U.S
nursing home patients in the final stages of Alzheimer’s and other forms of
dementia are given feeding tubes, despite evidence that the practice serves
no benefit and may even cause harm.” The variable of interest here is
whether or not nursing home patients in the final stages of Alzheimer’s or
other forms of dementia are given feeding tubes, a categorical variable that
is summarized with the proportion 1/3
Question:Why would it not be appropriate to generalize the study’s
results to a larger population?
Response:It is not possible to generalize this result to a larger group
because it already refers to patients in all nursing homes at the time, rather
than to a sample comprising a subset of those patients
Practice: Try Exercise 1.14 on page 13.
Roles of Variables: Explanatory or Response
By far the most interesting and useful statistical studies involve relationships
be-tween variables How we approach the data will depend on what roles the
vari-ables play in their relationship There are occasionally situations where two
variables have “equal footing” in the relationship, such as in a study of the
re-lationship between football teams’ rankings in offense and in defense For the
most part, however, one variable is thought to cause changes in, or at least to
ex-plain, values of the other: It is called the explanatory variable The other
vari-able is impacted by, or responds to, the first: It is called the response varivari-able A
more complicated relationship can involve more than one explanatory or
re-sponse variable
Definitions Causation exists between two variables if changes in values
of the first are actually responsible for changes in values of the second
The explanatory variable in a relationship between two variables is
the one that is presumed to impact the other variable, called the response
variable.
In the following diagram of the five possible situations introduced on page 2,
the last three involve a relationship The direction of the arrow goes from
explana-tory to response variable Because relatively few actual situations of interest
in-volve a quantitative explanatory and categorical response variable, and because
the analysis is fairly advanced compared to the others, we will not analyze such
situations in this book
Trang 31Example 1.6 illustrates the five situations in a variety of contexts.
E XAMPLE 1.6 Identifying Variable Types and Roles
Background:Consider these headlines:
쮿 “Men Are Twice as Likely as Women to
쮿 “Smaller, Hungrier Mice”
쮿 “County’s Average Weekly Wages at
$811, Better Than U.S Average”
Questions:What type of variables areinvolved in each of these situations? If therelationship between two variables is ofinterest, which plays the role of explanatoryvariable and which is the response?
Responses:“Men Are Twice
as Likely as Women to Be Hit
by Lightning”: We consider two categorical variables—genderand whether or not a person is hit by lightning Gender would
be the explanatory variable and being hit by lightning or not isthe response The other way around wouldn’t make sense because being hit
by lightning could not have an impact on a person’s gender
“35% of Returning Troops Seek Mental Health Aid”:Whether or not a returning soldier seeks mental health aid is
a single categorical variable
“Do Oscar Winners Live Longer Than Less SuccessfulPeers?”: This involves a categorical explanatory variable—being an Oscar winner or not—and a quantitative responsevariable—length of life
“Smaller, Hungrier Mice”: This brief headline suggests arelationship between two quantitative variables: the size of amouse and its appetite Size apparently plays the role ofexplanatory variable, so that as size goes down, the amount
of food desired goes up
Q → →Q
Two categorical variables
C →C
One categorical and one quantitative variable
One quantitative variable
Q
One categorical variable
Trang 32“County’s Average Weekly Wages at $811, Better Than U.S.
Average”: This involves just one quantitative variable—
weekly wages If wages for one county had been compared
to those of another county, then there would have been anadditional categorical explanatory variable Comparing thiscounty’s wages to those of the United States in general is a different kind of
comparison, where the county residents may be thought of as a single
sample, coming from the larger population of U.S residents
Practice: Try Exercise 1.17 on page 13.
Q
Statistics as a Four-Stage Process
Before we begin to learn about the first stage in the process of statistical analysis,
we should consider how all the stages fit together to accomplish our overall goal
On page 5, we stated that, as a science, statistics is used to produce information
from a sample, summarize it, and then draw conclusions about the larger
popula-tion from which the sample came Those conclusions, known as statistical
infer-ence, can be reached only if we have some knowledge of the workings of random
behavior, which comes under the realm of the science of probability.
Definitions A random occurrence is one that happens by chance
alone, and not according to a preference or an attempted influence
Probability is the formal study of the likelihood or chance of
something occurring in a random situation In the context of statistics,
probability explores the behavior of random samples taken from a larger
population
Statistical inference is the scientific process of drawing conclusions
about a population based on information from a sample
Thus, our goal can be reached in four stages, which will be addressed one at
a time in the book’s four parts
1 Data production: How to select a representative sample, and how to
prop-erly assess values of variables for that sample
2 Displaying and summarizing data: Depicting and describing single
quan-titative or categorical variables of interest, or relationships between
vari-ables if there are two varivari-ables involved
3 Probability: The scientific process wherein we assume we actually know
what is true for the entire population, and conclude what is likely to be
true for a sample drawn at random from that population
4 Statistical inference: Using what we have discovered about the variables
of interest in a random sample to draw conclusions about those variables
for the larger population
It is easy for a student to lose sight of these long-term goals, as he or she
con-centrates on learning particular concepts and techniques Throughout the book,
the following diagram will help remind you of how each new topic fits into the
“big picture.” A reminder of variable types and roles is included because
aware-ness of the variables involved is always an important part of the statistical picture
Trang 331 Data Production: Take sample data from the
population, with sampling and study designs that avoid bias
2 Displaying and Summarizing:
Use appropriate displays and summaries of the sample data, according to variable types and roles
3 Probability: Assume we know
what’s true for the population;
how should random samples
behave?
4 Statistical Inference: Assume we only know what’s
true about sampled values of a single variable or
relationship; what can we infer about the larger
E XAMPLE 1.7 Identifying the Four Processes
Background: Consider the following situations:
쮿 A retail manager is asked to present some graphs and a brief report onher group’s sales over the past several months, broken down into varioustypes of merchandise
쮿 Before a bookstore’s owners make plans for extensive renovations, theywant to find out what customers already like about the store and whataspects are in need of change
쮿 A pharmaceutical company has carried out a study and determinedproportions of patients experiencing nausea for those who take a certainmedication and those who take a “dummy pill.” The company wants toknow what claims it can make about proportions of patients
experiencing nausea in the general population for those who take themedication compared to those who don’t
쮿 The proportion of all Americans who are of Hispanic origin is 0.13.We’d like to know how unlikely it would be to take a random sample of1,000 Americans and find only 0.06 to be Hispanic
Question:Which of the four processes is involved in each situation?
쮿 The final one is a probability problem because we seek the likelihood ofobtaining a certain proportion in our sample who are Hispanic
Practice: Try Exercise 1.23 on page 14.
Trang 34C h a p t e r 1 S u m m a r y
Variables and Statistics
쮿 The science of statistics is concerned withgathering data, summarizing it, and using thatinformation to draw conclusions about a largerpopulation The latter process is known as
statistical inference.
쮿 A census gathers information about an entire
population rather than just a sample
쮿 When the relationship between two variables is
of interest, it should be determined which (if
any) plays the role of explanatory variable and which is the response variable.
쮿 A random occurrence is one that happens by chance alone, and probability is the formal
Characteristics that can differ from one
individual to another are called
vari-ables Variables can be either
categori-cal or quantitative In statistics, we studysingle variables or relationships be-
tween variables At times we merely cus on variables’ values for a specific set
fo-of individuals, called a sample More fo-
of-ten, our goal is to generalize to a larger
group, called the population.
쮿 Data are pieces of information about the values
taken by variables for a set of individuals
쮿 The five variable situations to be covered in this
book are:
1 Single categorical variable
2 Single quantitative variable
3 Categorical explanatory and quantitative
쮿 Categorical variables can be summarized with
counts, proportions, or percentages.
쮿 Quantitative variables can be summarized with
means.
쮿 If individuals studied are entire groups, the
percentage in a particular category for each
group can be treated as a quantitative variable
쮿 A quantitative variable can be converted into a
categorical variable by grouping into ranges of
values
C h a p t e r 1 E x e r c i s e s
*1.1 Students were asked to rate their instructor’s
preparation for class as being excellent,
good, or needs improvement Response to
this question is what types of variable—
quantitative or categorical?
*1.2 Suppose researchers want to investigate how
weight can affect blood pressure Tell what
types of variables each of these situationsinvolves
a Individuals’ weights and blood pressuresare recorded
b Individuals are classified as being normal
or overweight, and their blood pressuresare recorded
Note: Asterisked numbers indicate exercises whose answers are provided in the Solutions to Selected Exercises section, on page 689.
Trang 35c Individuals are classified as having high
or low blood pressure, and their weights
are recorded in kilograms
d Individuals are assessed as having high or
low blood pressure, and as being normal
or overweight
1.3 Prospective subjects for a study had their
blood pressures recorded
a Is the variable of interest quantitative or
categorical?
b Would results best be summarized with a
mean or with a proportion?
1.4 Before the 2004 presidential election in the
United States, there was a great deal of
interest concerning public opinion of the
war in Iraq For each of the following
situations, tell what individuals are being
studied, what variable is of interest, and
whether the variable is categorical or
quantitative
a People around the world were surveyed
as to whether they approved or
disapproved of the Iraq war
b People in various countries were surveyed
as to whether they approved or
disapproved of the Iraq war For each
country, it was determined what percentage
of its people disapproved of the war
c The Guardian—a British newspaper—
reported that 8 of 10 countries surveyed
by leading newspapers (such as the
Guardian, Canada’s La Presse, and
Japan’s Asahi Shimbun) disapproved of
the Iraq war
1.5 Based on a survey of a few thousand people,
a newspaper reporter wants to draw
conclusions about how a country’s citizens
in general feel about the war in Iraq At this
point, is the reporter mainly concerned with
data production, displaying and
summarizing data, probability, or
performing statistical inference?
*1.6 For parts (a) and (b), tell who or what
individuals are being studied, identify the
variable of interest, and tell whether it is
categorical or quantitative; then answer the
question in part (c)
a Adults were surveyed as to whether they
were married, single, or divorced
b The New York Times reported, state by
state, the divorce rate per 1,000 married
adults in 2003 The lowest rate was inMassachusetts, with 5.7 divorces per1,000 married people, and the highestwas in Nevada, with 14.6 per 1,000
c Assume we have Census data on maritalstatus of people in the United States Arethose people considered to be a sample or
a population?
1.7 A New York Times reporter decides to
convey information about American divorcerates by including a map of the UnitedStates Each state is shaded from light todark depending on how high its divorce rate
is At this point, is the reporter mainlyconcerned with data production, displayingand summarizing data, probability, orperforming statistical inference?
*1.8 “Can Mom’s Drinking Lower Kids’ IQ?”examined the relationship between mothers’consumption of alcohol during pregnancyand their children’s IQs The mothers wereclassified as being abstainers (0 alcoholicdrinks per day), light drinkers (up to 0.5 perday), moderate drinkers (0.5 to 1 per day),
or heavy drinkers (more than 1 per day) Isalcohol consumption being treated as acategorical or a quantitative variable?1.9 An article reported costs of ski-lift tickets invarious resorts in a region as being less than
$20, $20 to $40, $40 to $50, or more than
$50 Is ticket price being treated as acategorical or a quantitative variable?
*1.10 A British survey reported in 2006 states:
“Nearly 40 percent of 106 students whoanswered questionnaires about theirattitudes said they couldn’t cope withouttheir cell phone.”4
a What type of variable is beingconsidered?
b How is the variable summarized?
*1.11 “In a study of 87 French and Swiss collegestudents, researchers gave half of themsunscreen with a protection factor of 10 andthe other half with a factor of 30 Thestudents, who weren’t told which lotion theyreceived, went on summer vacations andrecorded the amount of time they spent inthe sun Users of the stronger sunscreenspent 25% more time in the sun, mostlysunbathing, the study found students inthe study often waited until their skin turnedred before rushing to the shade.”5
Trang 36a Is time spent in the sun being treated as
a quantitative or a categorical variable?
b How would researchers summarize time
spent in the sun for each group (those
with the stronger and those with the
weaker sunscreen)?
1.12 A newspaper article entitled “Teens Most
Likely to Have Sex at Home” notes that of
the sexually active teens surveyed in the
year 2000, “56% said they first had sex at
their family’s home or at the home of their
partner’s family.”6
a What is the variable of interest?
b Is the variable of interest quantitative or
categorical?
c How is the variable being summarized?
1.13 Based on results of a survey of sexually
active teenagers, sociologists would like to
be able to say whether or not a majority of
all sexually active teenagers first had sex at
their or their partner’s home At this point,
are the sociologists mainly concerned with
data production, displaying and
summarizing data, probability, or
performing statistical inference?
*1.14 The New York Times reports: “Three out of
four workers drove to their jobs by
themselves in 2006, according to another
finding by the Census Bureau.”7Should we
consider the workers studied to be a sample
or a population?
1.15 Mortality rates in the United States during
the 1980s and 1990s were studied by county,
race, gender, and income, with the following
results: “Asian-Americans, average per-capita
income of $21,566, have a life expectancy of
84.9 years Western American Indians,
$10,029, 72.7 years ”8Are these numbers
referring to samples or populations?
1.16 The American Association of Retired People
(AARP) conducted a survey in which it was
discovered that 63% of adult Americans
don’t want to live to be at least 100 On
average, those polled wanted to live to the
age of 91
a Should we consider the Americans
polled to be a sample or a population?
b There is a categorical variable of interest
in the survey; tell roughly how the
survey question was phrased to obtain
those responses
c There is a quantitative variable of interest
in the survey; tell roughly how the surveyquestion was phrased to obtain thoseresponses
*1.17 The New York Times reported on a study of
gadgets and appliances in American homes.For each of the following results, tell which
of the five variable situations is involved,choosing from the following:
쮿 C: single categorical variable
쮿 Q: single quantitative variable
쮿 C → Q: categorical explanatory variable
and quantitative response variable
쮿 C → C: categorical explanatory variable
and categorical response variable
qa For each of the 17 appliances studied, the
Times reported the percentage of
American homes in 2001 that had theappliance For example, microwaveovens were in 96% of the homes andanswering machines were in 78% of the
homes (1) C (2) Q (3) C → Q (4) C → C (5) Q → Q
b The study made a comparison ofpercentage owning each appliance in
2001 to the percentage owning theappliance in 1987 For example,microwave ovens were in 66% of thehomes in 1987 as opposed to 96% in
2001 Answering machines were in 10%
of the homes in 1987 as opposed to 78%
in 2001 (1) C (2) Q (3) C → Q (4) C →
C (5) Q → Q
c The study reported 2.5 television sets
owned per household in 2001 (1) C (2) Q (3) C → Q (4) C → C (5) Q → Q
1.18 The New York Times reported on a study of
gadgets and appliances in American homes.For each of the 17 appliances studied, it toldthe percentage of American homes in 2001that had the appliance For example,microwave ovens were in 96% of the homesand answering machines were in 78% of thehomes
a Who or what are the individuals beingstudied?
b What is the variable of interest?
c Is the variable of interest quantitative orcategorical?
Trang 37(most unfavorable) to 10 (mostfavorable).
d Viewers’ ratings of the ad on a scale of 1
to 10 are recorded, along with theviewers’ age group as being youth, youngadult, middle-aged, or senior citizen.1.22 Television advertisers are trying to decidewhich of the approaches outlined inExercise 1.21 to use in an upcoming study
of age and response to an advertisement Atthis point, are they mainly concerned withdata production, displaying and
summarizing data, probability, orperforming statistical inference?
*1.23 A department head wants to investigate thequality of teaching of a professor who iscoming up for tenure Tell which of the fourprocesses (data production, displaying andsummarizing, probability, or statisticalinference) is involved in each of these stages:
a The department head considers whether
to simply ask students to rate variousaspects of the professor’s performance on
a 5-point scale, or whether to also askthem to write a paragraph describingtheir experience in that professor’s class
b A sample of students is surveyed, andscores on a 5-point scale are averaged foreach aspect of the professor’s
d Based on the responses of sampledstudents, the department head concludes
that the mean preparedness rating for all of
the professor’s students is higher than 4.0.1.24 Men’s Health magazine used data on body
mass index, back-surgery rates, usage ofgyms, etc to grade the quality of men’s
“abs” (abdominal muscles) in 60 citiesacross the country If each city was given arating between 0 and 4, such as 2.75 forPittsburgh, then how is the variable ofinterest being treated—as quantitative orcategorical?
1.25 Suppose Men’s Health magazine wants to
present the results of the survey described inExercise 1.24 in a way that is both appealing
1.19 The study that looked at prevalence of
various appliances in homes in 2001, as
described in Exercises 1.17 and 1.18, made a
comparison to the percentages for each
appliance in 1987 For example, microwave
ovens were in 66% of the homes in 1987 as
opposed to 96% in 2001 Answering
machines were in 10% of the homes in 1987
as opposed to 78% in 2001
a There are two variables involved; what is
the explanatory variable?
b Tell whether the explanatory variable is
quantitative or categorical
c What is the response variable?
d Tell whether the response variable is
quantitative or categorical
e In which year would you expect
percentages to be higher overall—1987
or 2001, or both the same?
1.20 The New York Times study of appliances
reported 2.5 television sets per household in
1.21 Suppose television advertisers want to know
if age plays a role in people’s response to a
rather unconventional ad that might be aired
during the next Super Bowl The ad is
shown to a variety of viewers Keeping in
mind that the explanatory variable is not
necessarily the first one mentioned, classify
each of the following possible approaches as
involving one of these relationships:
쮿 C → C: categorical explanatory variable
and categorical response variable
쮿 C → Q: categorical explanatory variable
and quantitative response variable
쮿 Q → C: quantitative explanatory variable
and categorical response variable
쮿 Q → Q: quantitative explanatory variable
and quantitative response variable
a They ask whether or not a viewer likes
the ad, and record his or her age
b They classify a viewer as being youth,
young adult, middle-aged, or senior
citizen, and whether or not he or she
likes the ad
c Viewers’ ages are recorded, along with
their rating of the ad on a scale of 1
Trang 38and informative Is the magazine mainly
concerned with data production, displaying
and summarizing data, probability, or
performing statistical inference?
1.26 Anthropologists studied gender differences
in public restroom graffiti, noting whether
the graffiti occurred in a men’s or women’s
room, and classifying writings as being
competitive and derogatory or advisory and
sympathetic
a There are two variables mentioned here;
what is the explanatory variable?
b Tell whether the explanatory variable is
quantitative or categorical
c What is the response variable?
d Tell whether the response variable is
quantitative or categorical
e Would type of writings for each gender
be summarized with means or
proportions?
1.28 If researchers report that smokers are 10times as likely to be alcoholics compared tononsmokers, do they consider smoking to bethe explanatory variable or the response?1.29 The Centers for Disease Control andPrevention noted that “the price of a pack ofcigarettes went up 90% between 1997 and2003.”9Suppose students in an introductorystatistics course have been asked to identify thetwo variables of interest here, then tell which isexplanatory and which is response, andwhether each is quantitative or categorical.Which student has the correct answer?
Adam: The explanatory variable is price
of cigarettes, and it’s categorical because itwas summarized with a percentage Theresponse is year, and it’s quantitative because
it takes number values
Brittany: The roles are reversed: Year is
the quantitative explanatory variable andprice is the categorical response
Carlos: Year is the explanatory variable,
and because just two values are possible, it’scategorical Price is the response and it’squantitative—90% just tells how much theprice has changed from the year 1997 to theyear 2003
Dominique: Both variables are quantitative
because they both take number values; year isexplanatory because it affects the price.1.30 One-third of all nursing home patients withAlzheimer’s and other forms of dementia aregiven feeding tubes Researchers want toknow how unlikely it would be to find morethan half in a random sample of 100 suchpatients to have been given feeding tubes Arethe researchers mainly concerned with dataproduction, displaying and summarizingdata, probability, or performing statisticalinference?
1.31 Hand in an article or report about a
statistical study; tell what variable or
variables are involved and whether they are
quantitative or categorical If there are two
variables, tell which is explanatory andwhich is response If summaries arementioned, tell whether they are reportingmeans or proportions or something else
Discovering Research: Variable T ypes and Roles
1.32 Use the results of Exercise 1.6 and relevant
findings from the Internet to make a report
Reporting on Research: Variable T ypes and Roles
Typical graffiti for women’s room?
1.27 If researchers report that alcoholics are three
times as likely to smoke compared to
nonalcoholics, do they consider smoking to
be the explanatory variable or the response?
on divorce in the United States that relies onstatistical information
Trang 39The process of data production consists of two steps: (1) obtain the
sam-ple, and (2) carry out a properly designed study to assess the variables orrelationships of interest In this chapter we will concentrate on the firststep, stressing that the sample must be taken in such a way as to ensurethat it represents the larger population of interest without bias
Pick a number at random from 1 to 20 This may sound easy, but unless youget outside help from something like a computer or a table of random digits or a20-sided die, the task is impossible Our brains are designed to recognize and cre-
ate patterns, not randomness.
Just as our brains are not equipped to guide us in selecting a number between
1 and 20 truly at random, we cannot pick a truly random sample of participantsfor a study “off the top of our head” without the aid of some random number gen-
erator Random as a household word often is used to describe a selection that a
statistician would call “haphazard.” Technically, a random sample must makeplanned use of chance so that the laws of probability apply
Sources of Bias in Sampling: When Selected Individuals Are Not Representative
Bias, the tendency for an estimate to deviate in one direction from the true value,
can enter into the selection process in a variety of ways After we define some ofthe most common sources of bias in sampling, we will examine how they can arise
in the context of an example
How good is the food?
Who should be asked?
How should we take a sample of individuals to
gain information about the larger group?
We defined a random
occurrence on page 9 to
be one that happens by
chance alone, and not
Trang 40Definitions Selection bias occurs in general when the sample is
nonrepresentative of the larger population of interest
The sampling frame is the collection of all the individuals who have
the potential to be selected It should—but does not necessarily—match
the population of interest
A self-selected sample (also known as a volunteer sample) includes
only individuals who have taken the initiative to participate, as opposed
to having been recruited by researchers
A haphazard sample is selected without a scientific plan, according to
the whim of whoever is drawing the sample
The main criterion for selection in a convenience sample is that the
sampled individuals are found at a time or in a place that is handy for
researchers
Nonresponse occurs when individuals selected by researchers decline
to be part of the sample A sample is described as suffering from
nonresponse bias when too many individuals decline, to the extent that
there is a substantial impact on the composition of the sample
Call-in or Internet pollsare practically
guaranteed to bebiased, often quiteheavily, because theyresult in volunteersamples
A C LOSER
L OOK
E XAMPLE 2.1 How Various Types of Bias Occur in Sampling
Background:A professor wants to survey a sample of six from 80 class
members to get their opinion about the course textbook
Questions:Are these sampling methods unbiased? If not, what type of
bias enters in?
1 Ask for students to raise their hands if they would like to give their
opinion of the textbook
2 Sample the next six students who come in to office hours.
3 Look at a class roster and, without the aid of a random number
generator, attempt to take a “random” sample of six names
4 Assign each student in the classroom a number from 1 on up, then use
software or a table of random digits to select six at random
5 Take a random sample from the roster of students enrolled and mail
them a questionnaire
Responses:
1 Asking students to raise their hands yields a volunteer sample, which
would be likely to favor people with strong positive or negative
feelings about the book
2 Asking students who come in to office hours would yield a
convenience sample, and would result in bias because students who
need help may tend to find the book difficult to understand
Continued