Business statistics in practice using data modeling and analytics 8e bowerman

Therefore, we might conclude that the statistical model describing the sample of 50 mileages in Table 1.7 states that this sample has been approximately randomly selected from a popula

Trang 2

Business Statistics in Practice

Using Modeling, Data, and Analytics

Trang 3

BUSINESS STATISTICS IN PRACTICE: USING DATA, MODELING, AND ANALYTICS, EIGHTH EDITION

2009 No part of this publication may be reproduced or distributed in any form or by any means, or stored in a

database or retrieval system, without the prior written consent of McGraw-Hill Education, including, but not

limited to, in any network or other electronic storage or transmission, or broadcast for distance learning.

Some ancillaries, including electronic and print components, may not be available to customers outside the

Senior Vice President, Products & Markets: Kurt L Strand

Vice President, General Manager, Products & Markets: Marty Lange

Vice President, Content Design & Delivery: Kimberly Meriwether David

Managing Director: James Heine

Senior Brand Manager: Dolly Womack

Director, Product Development: Rose Koos

Product Developer: Camille Corum

Marketing Manager: Britney Hermsen

Director of Digital Content: Doug Ruby

Digital Product Developer: Tobi Philips

Director, Content Design & Delivery: Linda Avenarius

Program Manager: Mark Christianson

Content Project Managers: Harvey Yep (Core) / Bruce Gin (Digital)

Buyer: Laura M Fuller

Design: Srdjan Savanovic

Content Licensing Specialists: Ann Marie Jannette (Image) / Beth Thole (Text)

Cover Image: ©Sergei Popov, Getty Images and ©teekid, Getty Images

Compositor: MPS Limited

Printer: R R Donnelley

All credits appearing on page or at the end of the book are considered to be an extension of the copyright page.

Library of Congress Control Number: 2015956482

The Internet addresses listed in the text were accurate at the time of publication The inclusion of a website does

not indicate an endorsement by the authors or McGraw-Hill Education, and McGraw-Hill Education does not

guarantee the accuracy of the information presented at these sites.

www.mhhe.com

Trang 4

ABOUT THE AUTHORS

Bruce L Bowerman Bruce L

Bowerman is emeritus professor

of information systems and analytics at Miami University in Oxford, Ohio He received his Ph.D degree in statistics from Iowa State University in 1974, and he has over 40 years of experience teaching basic sta-tistics, regression analysis, time series forecasting,

survey sampling, and design of experiments to both

undergraduate and graduate students In 1987 Professor

Bowerman received an Outstanding Teaching award

from the Miami University senior class, and in 1992 he

received an Effective Educator award from the Richard

T Farmer School of Business Administration Together

with Richard T O’Connell, Professor Bowerman has

written 23 textbooks These include Forecasting, Time

Series, and Regression: An Applied Approach (also

coauthored with Anne B Koehler); Linear Statistical

Models: An Applied Approach ; Regression Analysis:

Unified Concepts, Practical Applications, and

Com-puter Implementation (also coauthored with Emily

S Murphree); and Experimental Design: Unified

Concepts, Practical Applications, and Computer

Imple-mentation (also coauthored with Emily S Murphree)

The first edition of Forecasting and Time Series earned

an Outstanding Academic Book award from Choice

magazine Professor Bowerman has also published a

number of articles in applied stochastic process, time

series forecasting, and statistical education In his spare

time, Professor Bowerman enjoys watching movies and

sports, playing tennis, and designing houses

Richard T O’Connell Richard

T O’Connell is emeritus fessor of information systems and analytics at Miami Univer-sity in Oxford, Ohio He has more than 35 years of experi-ence teaching basic statistics, statistical quality control and process improvement, regres-sion analysis, time series forecasting, and design of ex-

pro-companies in the Midwest In 2000 Professor O’Connell received an Effective Educator award from the Richard T Farmer School of Business Administration

Together with Bruce L Bowerman, he has written 23

textbooks These include Forecasting, Time Series, and

Regression: An Applied Approach (also coauthored

with Anne B Koehler); Linear Statistical Models:

An Applied Approach ; Regression Analysis: Unified

Concepts, Practical Applications, and Computer mentation (also coauthored with Emily S Murphree);

Imple-and Experimental Design: Unified Concepts, Practical

Applications, and Computer Implementation (also coauthored with Emily S Murphree) Professor O’Connell has published a number of articles in the area of innovative statistical education He is one of the first college instructors in the United States to integrate statistical process control and process improvement methodology into his basic business statistics course

He (with Professor Bowerman) has written several articles advocating this approach He has also given presentations on this subject at meetings such as the Joint Statistical Meetings of the American Statisti-cal Association and the Workshop on Total Quality Management: Developing Curricula and Research Agendas ( sponsored by the Production and Operations Management Society) Professor O’Connell received

an M.S degree in decision sciences from ern University in 1973 In his spare time, Professor O’Connell enjoys fishing, collecting 1950s and 1960s rock music, and following the Green Bay Packers and Purdue University sports

Northwest-Emily S Murphree Emily S

Murphree is emerita professor

of statistics at Miami University

in Oxford, Ohio She received her Ph.D degree in statistics from the University of North Carolina and does research in applied probability Professor Murphree received Miami’s College of Arts and Science Distinguished Educator Award in 1998 In 1996, she was named one of Oxford’s Citizens of the Year for her work with Habitat for Hu-

Trang 5

AUTHORS’ PREVIEW

Business Statistics in Practice: Using Data, Modeling,

and Analytics, Eighth Edition, provides a unique and

flexible framework for teaching the introductory course

in business statistics This framework features:

• A new theme of statistical modeling introduced in

Chapter 1 and used throughout the text

• A substantial and innovative presentation of

business analytics and data mining that provides

instructors with a choice of different teaching

options

• Improved and easier to understand discussions

of probability, probability modeling, traditional

statistical inference, and regression and time series

modeling

• Continuing case studies that facilitate student

learning by presenting new concepts in the

context of familiar situations

• Business improvement conclusions— highlighted

in yellow and designated by icons BI in the

page margins—that explicitly show how

statistical analysis leads to practical business

decisions

• Many new exercises, with increased emphasis on

students doing complete statistical analyses on

their own

• Use of Excel (including the Excel add-in MegaStat)

and Minitab to carry out traditional statistical

analysis and descriptive analytics Use of JMP and

the Excel add-in XLMiner to carry out predictive

analytics

We now discuss how these features are implemented in

the book’s 18 chapters

Chapters 1, 2, and 3: Introductory concepts and

statistical modeling Graphical and numerical

descriptive methods In Chapter 1 we discuss

data, variables, populations, and how to select

ran-dom and other types of samples (a topic formerly

discussed in Chapter 7) A new section introduces

sta-tistical modeling by defining what a stasta-tistical model

is and by using The Car Mileage Case to preview

specifying a normal probability model describing the

mileages obtained by a new midsize car model (see

“bell-probability curve is a graph of what is called the normal “bell-probability distribution (or normal

probability model), which is discussed in Chapter 6 Therefore, we might conclude that the

statistical model describing the sample of 50 mileages in Table 1.7 states that this sample has

been (approximately) randomly selected from a population of car mileages that is described

by a normal probability distribution We will see in Chapters 7 and 8 that this statistical

model and probability theory allow us to conclude that we are “95 percent” confident that the

more than 23 mpg Because we have seen in Example 1.4 that the mean of the sample of n 5

true population mean EPA combined mileage for the new midsize model is between 31.56 2 23 5 31.33 mpg and 31.56 1 23 5 31.79 mpg 10 Because we are 95 percent confident that the population mean EPA combined mileage is at least 31.33 mpg, we have strong statistical evidence that this not only meets, but slightly exceeds, the tax credit standard of 31 mpg and thus that the new midsize model deserves the tax credit.

Throughout this book we will encounter many situations where we wish to make a tistical inference about one or more populations by using sample data Whenever we make assumptions about how the sample data are selected and about the population(s) from which what we hope are valid statistical inferences In Chapters 13, 14, and 15 these models become complex and not only specify the probability distributions describing the sampled populations but also specify how the means of the sampled populations are related to each other

sta-sales of a product to the predictor variables advertising expenditure and price In order to relate a response variable such as sales to one or more predictor variables so that we can

explain and predict values of the response variable, we sometimes use a statistical technique

called regression analysis and specify a regression model.

The idea of building a model to help explain and predict is not new Sir Isaac Newton’s equations describing motion and gravitational attraction help us understand bodies in motion and are used today by scientists plotting the trajectories of spacecraft Despite their successful use, however, these equations are only approximations to the exact nature of motion Seventeenth-century Newtonian physics has been superseded by the more sophis- ticated twentieth-century physics of Einstein and Bohr But even with the refinements of

BI

10 The exact reasoning behind and meaning of this statement is given in Chapter 8, which discusses confidence intervals.

25 20 15 10 5 0 2 6 16

22 22 18

of car mileages shown in Chapter 1, and in Chapter 3 (numerical descriptive methods) we use this histogram

to help explain the Empirical Rule As illustrated in Figure 3.15, this rule gives tolerance intervals provid-ing estimates of the “lowest” and “highest” mileages that the new midsize car model should be expected to get in combined city and highway driving:

150 Chapter 3 Descriptive Statistics: Numerical Methods and Some Predictive Analytics

Figure 3.15 depicts these estimated tolerance intervals, which are shown below the histogram

Because the difference between the upper and lower limits of each estimated tolerance terval is fairly small, we might conclude that the variability of the individual car mileages [ _x 6 3s] 5 [29.2, 34.0] implies that almost any individual car that a customer might pur-

in-chase this year will obtain a mileage between 29.2 mpg and 34.0 mpg.

Before continuing, recall that we have rounded _x and s to one decimal point accuracy

in order to simplify our initial example of the Empirical Rule If, instead, we calculate the Empirical Rule intervals by using _x 5 31.56 and s 5 7977 and then round the interval end-

points to one decimal place accuracy at the end of the calculations, we obtain the same tervals as obtained above In general, however, rounding intermediate calculated results can rounding intermediate results.

in-We next note that if we actually count the number of the 50 mileages in Table 3.1 that are contained in each of the intervals [ _x 6 s] 5 [30.8, 32.4], [ _x 6 2s] 5 [30.0, 33.2], and

[ _x 6 3s] 5 [29.2, 34.0], we find that these intervals contain, respectively, 34, 48, and 50 of

the 50 mileages The corresponding sample percentages—68 percent, 96 percent, and 100 percent—that apply to a normally distributed population This is further evidence that the population of all mileages is (approximately) normally distributed and thus that the Empiri- cal Rule holds for this population.

To conclude this example, we note that the automaker has studied the combined city and highway mileages of the new model because the federal tax credit is based on these combined mileages When reporting fuel economy estimates for a particular car model to the purchaser to purchaser Therefore, the EPA reports both a combined mileage estimate and separate city and highway mileage estimates to the public (see Table 3.1(b) on page 137)

BI

F i g u r e 3 1 5 Estimated Tolerance Intervals in the Car Mileage Case

Estimated tolerance interval for

the mileages of 99.73 percent of all individual cars

the mileages of 95.44 percent of all individual cars 30.0 33.2

the mileages of 68.26 percent of all individual cars 30.8 32.4

Histogram of the 50 Mileages

0

20 15 10 5 25

Mpg

29.5 30.0 30.5 31 .0 31 .5 32.0 32.5 33.0 33.5 6

16

22 22 18

10 4 2

Chapters 1, 2, and 3: Six optional sections cussing business analytics and data mining The Disney Parks Case is used in an optional section of

dis-Chapter 1 to introduce how business analytics and data mining are used to analyze big data This case consid-ers how Walt Disney World in Orlando, Florida, uses MagicBands worn by many of its visitors to collect mas-sive amounts of real-time location, riding pattern, and purchase history data These data help Disney improve visitor experiences and tailor its marketing messages

to different types of visitors At its Epcot park, Disney

Trang 6

helps visitors choose their next ride by continuously

summarizing predicted waiting times for seven popular

rides on large screens in the park Disney management

also uses the riding pattern data it collects to make

plan-ning decisions, as is shown by the following business

improvement conclusion from Chapter 1:

…As a matter of fact, Channel 13 News

in Orlando reported on March 6, 2015—

during the writing of this case—that Disney had announced plans to add a third

“theatre” for Soarin’ (a virtual ride) in order to shorten long visitor waiting times

The Disney Parks Case is also used in an optional section of Chapter 2 to help discuss descriptive

analytics Specifically, Figure 2.36 shows a bullet graph

summarizing predicted waiting times for seven Epcot

rides posted by Disney at 3 p.m on February 21, 2015, and Figure 2.37 shows a treemap illustrating ficticious visitor ratings of the seven Epcot rides Other graphics discussed in the optional section on descriptive analyt-ics include gauges, sparklines, data drill-down graph-ics, and dashboards combining graphics illustrating a business’s key performance indicators For example, Figure 2.35 is a dashboard showing eight “flight on time” bullet graphs and three “flight utilization” gauges for an airline

Chapter 3 contains four optional sections that cuss six methods of predictive analytics The methods discussed are explained in an applied and practical way

dis-by using the numerical descriptive statistics previously discussed in Chapter 3 These methods are:

• Classification tree modeling and regression tree modeling (see Section 3.7 and the following figures):

BI

graphs compare the single primary measure to a target, or objective, which is represented by

colors ranging from dark green to red and signifying short (0 to 20 minutes) to very long (80

to 100 minutes) predicted waiting times This bullet graph does not compare the predicted

waiting times to an objective However, the bullet graphs located in the upper left of the

for the airline) do display objectives represented by short vertical black lines For example,

consider the bullet graphs representing the percentages of on-time arrivals and departures in

the Midwest, which are shown below.

Jan 70%

Regional

Fleet Utilization Costs

Average Load Flights on Time

International Short-Haul

90%

Feb Mar Apr May June July Aug Sept Oct Nov Dec

Jan 0 2

4 6 8 10

Arrival Departure

100 95 90 85 80 75 70

Fuel Costs Total Costs Average Load Factor Breakeven Load Factor

100 95 90 85 80 75 70

F i g u r e 2 3 5 A Dashboard of the Key Performance Indicators for an Airline

F i g u r e 2 3 6 Excel Output of a Bullet Graph of Disney’s Predicted Waiting Times (in minutes) for

the Seven Epcot Rides Posted at 3 p m on February 21, 2015 DSDisneyTimes

Nemo & Friends Mission: Space green Mission: Space orange Living With The Land Spaceship Earth Test Track Soarin'

0 20 40 60 80 100

50 Midwest

60 70 80 Arrival Departure

Jan 70%

Regional

Fleet Utilization Costs

Average Load Flights on Time

International Short-Haul

90%

Jan 0 2

4 6 8 10

Arrival Departure

100 95 90 85 80 70

Fuel Costs Total Costs Average Load Factor Breakeven Load Factor

100 95 90 85 80 70

F i g u r e 2 3 5 A Dashboard of the Key Performance Indicators for an Airline

F i g u r e 2 3 6 Excel Output of a Bullet Graph of Disney’s Predicted Waiting Times (in minutes) for

the Seven Epcot Rides Posted at 3 p m on February 21, 2015 DSDisneyTimes

Nemo & Friends Mission: Space green Mission: Space orange Living With The Land Spaceship Earth Test Track Soarin'

0 20 40 60 80 100

50 Midwest

60 70 80 Arrival Departure

90 100 50 60 70 80 90 100

94 Chapter 2 Descriptive Statistics: Tabular and Graphical Methods and Descriptive Analytics

The airline’s objective was to have 80 percent of midwestern arrivals be on time The approximately 75 percent of actual midwestern arrivals that were on time is in the airline’s

80 percent objective.

Treemaps We next discuss treemaps, which help visualize two variables Treemaps

display information in a series of clustered rectangles, which represent a whole The sizes

of the rectangles represent a first variable, and treemaps use color to characterize the

vari-ous rectangles within the treemap according to a second variable For example, suppose

(as a purely hypothetical example) that Disney gave visitors at Epcot the voluntary

oppor-tunity to use their personal computers or smartphones to rate as many of the seven Epcot

rides as desired on a scale from 0 to 5 Here, 0 represents “poor,” 1 represents “fair,” 2

represents “good,” 3 represents “very good,” 4 represents “excellent,” and 5 represents

a particular day (These data are completely fictitious.) Figure 2.37(b) shows the Excel

output of a treemap, where the size and color of the rectangle for a particular ride

repre-sent, respectively, the total number of ratings and the mean rating for the ride The colors

(signifying a mean rating near the “fair,” or 1, level), as shown by the color scale on the

treemap Note that six of the seven rides are rated to be at least “good,” four of the seven

rides are rated to be at least “very good,” and one ride is rated as “fair.” Many treemaps

use a larger range of colors (ranging, say, from dark green to red), but the Excel app we

treemaps are frequently used to display hierarchical information (information that could

be displayed as a tree, where different branchings would be used to show the hierarchical

information) For example, Disney could have visitors voluntarily rate the rides in each

of its four Orlando parks—Disney’s Magic Kingdom, Epcot, Disney’s Animal Kingdom,

F i g u r e 2 3 7 The Number of Ratings and the Mean Rating for Each of Seven Rides at Epcot

(0 5 Poor, 1 5 Fair, 2 5 Good, 3 5 Very Good, 4 5 Excellent, 5 5 Superb) and

an Excel Output of a Treemap of the Numbers of Ratings and the Mean Ratings (a) The number of ratings and the mean ratings DSDisneyRatings

(b) Excel output of the treemap

Ride Number of Ratings Mean Rating

Test Track presented by Chevrolet 2045 4.247 Spaceship Earth 697 1.319 Living With The Land 725 2.186 Mission: Space orange 1589 3.408 Mission: Space green 467 3.116 The Seas with Nemo & Friends 1157 2.712

Soarin' Test Track presented

by Chevrolet

The Seas With Nemo & Friends

Mission: Space orange

Mission: Space green Living With The Land

Spaceship Earth

4.8

3.6 2.5

1.3

94 Chapter 2 Descriptive Statistics: Tabular and Graphical Methods and Descriptive Analytics

The airline’s objective was to have 80 percent of midwestern arrivals be on time The approximately 75 percent of actual midwestern arrivals that were on time is in the airline’s

light brown “satisfactory” region of the bullet graph, but this 75 percent does not reach the

80 percent objective.

Treemaps We next discuss treemaps, which help visualize two variables Treemaps

display information in a series of clustered rectangles, which represent a whole The sizes

of the rectangles represent a first variable, and treemaps use color to characterize the

vari-ous rectangles within the treemap according to a second variable For example, suppose

(as a purely hypothetical example) that Disney gave visitors at Epcot the voluntary

oppor-tunity to use their personal computers or smartphones to rate as many of the seven Epcot

represents “good,” 3 represents “very good,” 4 represents “excellent,” and 5 represents

“superb.” Figure 2.37(a) gives the number of ratings and the mean rating for each ride on

a particular day (These data are completely fictitious.) Figure 2.37(b) shows the Excel

output of a treemap, where the size and color of the rectangle for a particular ride

repre-sent, respectively, the total number of ratings and the mean rating for the ride The colors

range from dark green (signifying a mean rating near the “superb,” or 5, level) to white

(signifying a mean rating near the “fair,” or 1, level), as shown by the color scale on the

treemap Note that six of the seven rides are rated to be at least “good,” four of the seven

rides are rated to be at least “very good,” and one ride is rated as “fair.” Many treemaps

used to obtain Figure 2.37(b) gave the range of colors shown in that figure Also, note that

treemaps are frequently used to display hierarchical information (information that could

be displayed as a tree, where different branchings would be used to show the hierarchical

information) For example, Disney could have visitors voluntarily rate the rides in each

and Disney’s Hollywood Studios A treemap would be constructed by breaking a large

F i g u r e 2 3 7 The Number of Ratings and the Mean Rating for Each of Seven Rides at Epcot

(0 5 Poor, 1 5 Fair, 2 5 Good, 3 5 Very Good, 4 5 Excellent, 5 5 Superb) and

an Excel Output of a Treemap of the Numbers of Ratings and the Mean Ratings (a) The number of ratings and the mean ratings DSDisneyRatings

(b) Excel output of the treemap

Ride Number of Ratings Mean Rating

Test Track presented by Chevrolet 2045 4.247 Spaceship Earth 697 1.319 Living With The Land 725 2.186 Mission: Space orange 1589 3.408 Mission: Space green 467 3.116 The Seas with Nemo & Friends 1157 2.712

Soarin' Test Track presented

by Chevrolet

The Seas With Nemo & Friends

Mission: Space orange

Mission: Space green Living With The Land

Spaceship Earth

4.8 3.6

2.5 1.3

Decision Trees: Classification Trees and Regression Trees (Optional)

RSquare 0.640 40 N Number of Splits 4 Split Prune Color Points

Count All Rows

40 Level G^2 55.051105 LogWorth Rate

0.5500 0.5500 22

Prob Count

0

Purchases.526.185 Purchases,26.185 Purchases.532.45 Purchases,32.45 Purchases.539.925 Purchases

,39.925 PlatProfile(1) PlatProfile (0)

Count Purchases32.45

0.1905 0.2068 4 17

Prob Count

0

0.9474 0.9275 18 1

Prob Count

0

Count PlatProfile(1)

16 Level G^2 7.4813331 LogWorth 0.937419 Rate

0.0625 0.0892 1 15

Prob Count

0

Count PlatProfile(0)

5 Level G^2 6.7301167 Rate

0.6000 0.5859 3

Prob Count

0

9 Level G^2 6.2789777 Rate

0.8889 0.8588 8

Prob Count

0

10 Level G^2 0 Rate

1.0000 0.9625 10 0

Prob Count

0

0.00 0.25

0.50

0

1 0.75

Partition for Upgrade

1.00

All Rows

F i g u r e 3 2 6 JMP Output of a Classification Tree for the Card Upgrade Data DSCardUpgrade

5 1 2

0 2

5 1 4

2 3

5 0 0 13

7

9 15

,0.5; that is, Card 5 0

Overall 24 1 4.166667

5 1 2

5 0 0

5 857 6

5 0 0 13

Overall 16 1 6.25

(h) Pruning the tree in (e)

For Exercise 3.56 (d) and (e) Prob for 0 Prob for 1 Purchases Card

Cust 1 0.142857 0.857143 43.97 1 Cust 2 0 1 52.48 0

9 9

(i) XLMiner training data and best pruned Fresh demand regression

F i g u r e 3 2 8 (Continued )

www.freebookslides.com

Trang 7

3.9 Factor Analysis (Optional and Requires Section 3.4)

Factor analysis starts with a large number of correlated variables and attempts to find fewer

underlying uncorrelated factors that describe the “essential aspects” of the large number of correlated variables To illustrate factor analysis, suppose that a personnel officer has inter- viewed and rated 48 job applicants for sales positions on the following 15 variables.

1 Form of application letter 6 Lucidity 11 Ambition

2 Appearance 7 Honesty 12 Grasp

3 Academic ability 8 Salesmanship 13 Potential

4 Likability 9 Experience 14 Keenness to join

5 Self-confidence 10 Drive 15 Suitability

LO3-10

Interpret the information provided

by a factor analysis (Optional).

bers), the average distance of each cluster’s members from the cluster centroid, and the distances between

a Use the output to summarize the members of each

cluster.

the clusters Also, discuss how this k-means cluster

analysis leads to the same practical conclusions about how to improve the popularities of baseball and tennis that have been obtained using the previously discussed hierachical clustering.

Cluster-2 2 0.960547 Cluster-3 5 1.319782 Cluster-4 3 0.983933 Cluster-5 2 2.382945 Overall 13 1.249053

Distance Between

Cluster-1 0 4.573068 3.76884 3.928169 5.052507 Cluster-2 4.573068 0 3.112135 7.413553 4.298823 Cluster-3 3.76884 3.112135 0 5.276346 2.622167 Cluster-4 3.928169 7.413553 5.276346 0 5.224902 Cluster-5 5.052507 4.298823 2.622167 5.224902 0

Football 4 4.387004 7.911413 6.027006 1.04255 5.712401

XLMiner Output for Exercise 3.61

Cluster-1 4.78 4.18 2.16 3.33 3.6 2.67 Cluster-2 5.6 4.825 5.99 3.475 1.71 3.92 Cluster-3 2.858 4.796 5.078 3.638 2.418 3.022 Cluster-4 1.99 3.253333 1.606667 4.62 5.773333 2.363333 Cluster-5 2.6 4.61 6.29 5 4.265 3.22 } These are

the centroids

1 Form of application letter 6 Lucidity 11 Ambition

2 Appearance 7 Honesty 12 Grasp

3 Academic ability 8 Salesmanship 13 Potential

4 Likability 9 Experience 14 Keenness to join

5 Self-confidence 10 Drive 15 Suitability

LO3-10

bers), the average distance of each cluster’s members from the cluster centroid, and the distances between

cluster.

Cluster-2 2 0.960547 Cluster-3 5 1.319782 Cluster-4 3 0.983933 Cluster-5 2 2.382945 Overall 13 1.249053

Distance Between

Cluster-1 0 4.573068 3.76884 3.928169 5.052507 Cluster-2 4.573068 0 3.112135 7.413553 4.298823 Cluster-3 3.76884 3.112135 0 5.276346 2.622167 Cluster-4 3.928169 7.413553 5.276346 0 5.224902 Cluster-5 5.052507 4.298823 2.622167 5.224902 0

Football 4 4.387004 7.911413 6.027006 1.04255 5.712401

Cluster-1 4.78 4.18 2.16 3.33 3.6 2.67 Cluster-2 5.6 4.825 5.99 3.475 1.71 3.92 Cluster-3 2.858 4.796 5.078 3.638 2.418 3.022 Cluster-4 1.99 3.253333 1.606667 4.62 5.773333 2.363333 Cluster-5 2.6 4.61 6.29 5 4.265 3.22 } These are

the centroids

Cluster Analysis and Multidimensional Scaling (Optional)

We will illustrate k-means clustering by using a real data mining project For

confidentiality purposes, we will consider a fictional grocery chain However, the 2.3 million store loyalty card holders Store managers are interested in clustering their find that certain customers tend to buy many cooking basics like oil, flour, eggs, rice, and food aisle Perhaps there are other important categories like calorie-conscious, vegetarian,

or premium-quality shoppers.

The executives don’t know what the clusters are and hope the data will enlighten them

They choose to concentrate on 100 important products offered in their stores Suppose that product 1 is fresh strawberries, product 2 is olive oil, product 3 is hamburger buns, and product 4 is potato chips For each customer having a Just Right loyalty card, they will know the

Dendrogram Complete Linkage 0.00

LO3-10

the centroids of each cluster (that is, the six mean ues on the six perception scales of the cluster’s members), the average distance of each cluster’s members from the cluster centroid, and the distances between

cluster.

b By using the members of each cluster and the

clus-ter centroids, discuss the basic differences between

3.66 What is the purpose of association rules?

3.67 Discuss the meanings of the terms support percentage,

confidence percentage , and lift ratio.

METHODS AND APPLICATIONS

3.68 In the previous XLMiner output, show how the lift

ratio of 1.1111(rounded) for the recommendation of C

to renters of B has been calculated Interpret this lift ratio.

3.69 The XLMiner output of an association rule analysis of

the DVD renters data using a specified support centage of 40 percent and a specified confidence per-

per-centage of 70 percent is shown below DSDVDRent

a Summarize the recommendations based on a lift

ratio greater than 1.

b Consider the recommendation of DVD B based on having rented C & E (1) Identify and interpret the

support for C & E Do the same for the support for

C & E & B (2) Show how the Confidence% of 80 has been calculated (3) Show how the Lift Ratio of

1.1429 (rounded) has been calculated.

Exercises for Section 3.10

Rule: If all Antecedent items are purchased, then with Confidence percentage Consequent items will also be purchased.

Row ID Confidence% Antecedent (x) Consequent (y) Support for x Support for y Support for x & y Lift Ratio

or thrillers) and hierarchies (for example, a hierarchy related to how new the product is).

Chapter Summary

We began this chapter by presenting and comparing several sures of central tendency We defined the population mean and

mea-we saw how to estimate the population mean by using a sample

the mean, median, and mode for symmetrical distributions and for dis tributions that are skewed to the right or left We then stud-

ied measures of variation (or spread) We defined the range,

variance, and standard deviation, and we saw how to estimate

a population variance and standard deviation by using a sample

We learned that a good way to interpret the standard deviation when a population is (approximately) normally distributed is to

use the Empirical Rule, and we studied Chebyshev’s Theorem,

which gives us intervals containing reasonably large fractions of the population units no matter what the population’s shape might

be We also saw that, when a data set is highly skewed, it is best

to use percentiles and quartiles to measure variation, and we learned how to construct a box-and-whiskers plot by using the

quartiles.

After learning how to measure and depict central tendency and variability, we presented various optional topics First, we discussed several numerical measures of the relationship between

coefficient, and the least squares line We then introduced the concept of a weighted mean and also explained how to compute descriptive statistics for grouped data In addition, we showed how to calculate the geometric mean and demonstrated its inter-

pretation Finally, we used the numerical methods of this chapter

to give an introduction to four important techniques of predictive association rules.

Chapter Summary 201

CONCEPTS