two categorical variables: Host and Purchase Host identifies the originating site: MSN, RecipeSource, or Yahoo; Purchase indicates... 5.1 Contingency TablesConsider Two Categorical Va
Trang 2Association between Categorical Variables
Chapter 5
Trang 35.1 Contingency Tables
Which hosts send more buyers to
Amazon.com?
two categorical variables: Host and Purchase
Host identifies the originating site: MSN,
RecipeSource, or Yahoo; Purchase indicates
Trang 45.1 Contingency Tables
Consider Two Categorical Variables
Simultaneously
categorical variable contingent on the value of
another (for every combination of both variables)
Trang 55.1 Contingency Tables
Contingency Table for Web Shopping
Trang 65.1 Contingency Tables
Marginal and Conditional Distributions
• Marginal distributions appear in the “margins” of a contingency table and represent the totals
(frequencies) for each categorical variable
separately
• Conditional distributions refer to counts within a
Trang 75.1 Contingency Tables
Conditional Distribution of Purchase for each
Host (Column Counts and Percentages)
Trang 85.1 Contingency Tables
Conditional Distribution
• Reveals the percentage of purchases
among visitors from RecipeSource to be
much less than for MSN and Yahoo
• Host and Purchase are associated
Trang 95.1 Contingency Tables
Segmented Bar Charts
• Used to display conditional distributions
• Divides the bars in a bar chart into
segments that are proportional to the
percentage in each category of a second
variable
Trang 105.1 Contingency Tables
Contingency Table of Purchase by Region
Trang 115.1 Contingency Tables
Segmented Bar Chart Shows Association
Trang 125.1 Contingency Tables
Mosaic Plots
Alternative to segmented bar chart
A plot in which the size of each “tile” is
proportional to the count in a cell of a
contingency table
Trang 135.1 Contingency Tables
Contingency Table of Shirt Size by Style
Trang 145.1 Contingency Tables
Mosaic Plot Shows Association
Trang 154M Example 5.1: CAR THEFT
Motivation
Should insurance companies vary the
premiums for different car models (are
some cars more likely to be stolen than
others)?
Trang 164M Example 5.1: CAR THEFT
Method
Data obtained from the National Highway Traffic
Safety Administration (NHTSA) on car theft for
seven popular models (two categorical variables: type of car and whether the car was stolen).
Trang 174M Example 5.1: CAR THEFT
Mechanics
Trang 184M Example 5.1: CAR THEFT
Mechanics
Trang 194M Example 5.1: CAR THEFT
Message
The Dodge Intrepid is more likely to be stolen than other popular models The data suggest that
higher premiums for theft insurance should be
charged for models that are more likely to be
stolen.
Trang 205.2 Lurking Variables
and Simpson’s Paradox
Association Not Necessarily Causation
affects the apparent relationship between two
other variables
between two variables when data are separated
Trang 214M Example 5.2: AIRLINE ARRIVALS
Motivation
Does it matter which of two airlines a
corporate CEO chooses when flying to
meetings if he wants to avoid delays?
Trang 224M Example 5.2: AIRLINE ARRIVALS
Method
Data obtained from US Bureau of
Transportation Statistics on flight delays for two airlines (two categorical variables:
airline and whether the flight arrived on
time)
Trang 234M Example 5.2: AIRLINE ARRIVALS
Mechanics
Trang 244M Example 5.2: AIRLINE ARRIVALS
Mechanics –
Is destination a lurking variable?
Trang 254M Example 5.2: AIRLINE ARRIVALS
Mechanics –
This is Simpson’s Paradox
Trang 264M Example 5.2: AIRLINE ARRIVALS
Message
The CEO should book on US Airways as it is more likely to arrive on time regardless of
destination
Trang 275.3 Strength of Association
Chi-Squared Statistic
A measure of association in a contingency
table
Calculated based on a comparison of the
observed contingency table to an artificial
table with the same marginal totals but no
Trang 285.3 Strength of Association
Contingency Table
Trang 295.3 Strength of Association
Calculating the Chi-Squared Statistic
Trang 315.3 Strength of Association
Cramer’s V
Derived from the Chi-Squared Statistic
Ranges in value from 0 (variables are not
associated) to 1(variables are perfectly
associated)
Trang 325.3 Strength of Association
Calculating Cramer’s V
V = 0.20 for our example
There is a weak association between group
2
x V
Trang 335.3 Strength of Association
Checklist: Chi-Squared and Cramer’s V
Verify that variables are categorical
Verify that there are no obvious lurking
variables
Trang 344M Example 5.3: REAL ESTATE
Motivation
Do people who heat their homes with gas
prefer to cook with gas as well? What
heating systems and appliances should a
developer select for newly built homes?
Trang 354M Example 5.3: REAL ESTATE
Method
The developer contacts homeowners to
obtain the data Two categorical variables: type of fuel used for home heating (gas or electric) and type of fuel used for cooking
(gas or electric)
Trang 364M Example 5.3: REAL ESTATE
Mechanics
Trang 374M Example 5.3: REAL ESTATE
Message
Homeowners prefer gas to electric heat by
about 2 to 1 The developer should build
about two-thirds of new homes with gas
heat Put electric appliances in all homes
with electric heat and in half of the homes
with gas heat (assuming that buyers for
new homes have the same preferences).
Trang 38Best Practices
association between two categorical variables.
Trang 39Pitfalls