25.1 Two-Sample ComparisonConfounding Variables careful about lurking variables that would account for the significant difference between average salaries e.g., experience.. 25.1 Two-Sa
Trang 2Categorical Explanatory Variables
Chapter 25
Trang 325.1 Two-Sample Comparisons
Does Wal-Mart discriminate against female
employees? Are they paid less than men?
explanatory variable representing gender to
analyze pay data.
between men and women to account for other
variables that may affect pay.
Trang 425.1 Two-Sample Comparison
Example: Mid-Level Managers’ Salaries
The average salary for women is $140,000 and the average salary for men is $144,700
Trang 525.1 Two-Sample Comparison
Example: Mid-Level Managers’ Salaries
The 95% confidence for the difference in mean
salaries is $740 to $8,591 (since 0 is not in this
interval, the difference is significant)
Assume conditions for inference are satisfied.
Trang 625.1 Two-Sample Comparison
Confounding Variables
careful about lurking variables that would account for the significant difference between average
salaries (e.g., experience).
correlated with salary and the two groups (men
and women) differ with regard to experience.
Trang 725.1 Two-Sample Comparison
Subsets and Confounding
Restrict analysis to a subset of cases with matching levels of the confounding variable (e.g., compare men and women with 5 years of experience)
Trang 825.1 Two-Sample Comparison
Subsets and Confounding
The 95% confidence interval for the difference in average salaries between men and women within the subset of managers with 5 years experience
includes 0 (the difference is not significant)
However, the standard error of the difference is
much larger; the cases in the subset do not
produce a precise estimate
Trang 925.2 Analysis of Covariance
Regression on Subsets
What about the difference between average
salaries for managers with 2, 10 or 15 years
experience?
Analysis of covariance: regression that combines categorical and numerical explanatory variables; adjusts the comparison of means for the effects of confounding variables
Trang 1025.2 Analysis of Covariance
Regression on Subsets
Trang 1225.2 Analysis of Covariance
Combining Regressions
women requires a dummy variable identifying
whether a manager is male or female (Group = 1
for men; Group = 0 for women).
An interaction term is the product of two
explanatory variables in a regression model
Trang 1325.2 Analysis of Covariance
Combining Regressions
Trang 1425.2 Analysis of Covariance
Combining Regressions
Trang 1525.2 Analysis of Covariance
Interpreting Coefficients
dummy variable forms a baseline for comparison.
between estimated intercepts in the simple
regressions The slope of the interaction is the
difference between estimated slopes in the simple regressions.
Trang 1625.3 Checking Conditions
The scatterplot reveals a linear (weak)
association between Salary and Years.
Some caution is necessary regarding lurking
variables (e.g., educational background or
business aptitude)
Trang 1725.3 Checking Conditions
Checking for Similar Variances
Plot the residuals on the fitted values
Compare side-by-side boxplots of the residuals
for each group The similar variance condition is violated if the IQR in one boxplot is more than
twice the length of the other
Trang 1825.3 Checking Conditions
Checking for Similar Variances
Trang 1925.3 Checking Conditions
Checking for Similar Variances
Trang 2025.3 Checking Conditions
The similar variance condition is satisfied
Examining the normal quantile plot confirms that the residuals are nearly normal
Trang 2125.4 Interactions and Inference
Principle of marginality: if the interaction is
statistically significant, retain it as well as both of its components regardless of their level of
significance
If the interaction is not statistically significant,
remove it from the regression and re-estimate the equation A model without an interaction term is simpler to interpret since the lines fit to the groups are parallel
Trang 2225.4 Interactions and Inference
Interactions and Collinearity
An interaction in a multiple regression introduces
collinearity (see large VIF for Group Years).
Trang 2325.4 Interactions and Inference
Interactions and Collinearity
Since the interaction in this example is not
significant, remove it and re-estimate the MRM
Trang 2425.4 Interactions and Inference
Parallel Fits
between the intercepts for male and female
managers.
means that the line for men is shifted up from the
line for women by $1,024 for all levels of
experience
Trang 2525.4 Interactions and Inference
Parallel Fits
Trang 2625.4 Interactions and Inference
Parallel Fits
the slope of Group indicates that it is not
statistically significant.
difference between the average salaries of male
and female managers when comparing managers with equal years of experience.
Trang 274M Example 25.1:
PRIMING IN ADVERTISING
Motivation
FedEx introduced the Courier Pak using two waves
of promotion: an ad to raise awareness (i.e.,
priming) and a visit to existing clients by a sales
rep Management has two questions: (1) How
many shipments were generated by a typical one hour contact by the sales rep? and (2) Was the
promotion more effective for clients who were
already aware of the Courier Pak?
Trang 284M Example 25.1:
PRIMING IN ADVERTISING
Method
Based on data from 125 customers, fit a multiple
regression with a categorical variable The
response is number of shipments using Courier
Pak The explanatory variables are the amount of time spent with the client by a sales rep and a
dummy variable indicating whether or not the
client was aware of the Courier Pak The
interaction between the explanatory variables is
included
Trang 294M Example 25.1:
PRIMING IN ADVERTISING
Method
Scatterplot with lines fit separately for each group
(clients aware of Courier Pak shown in green)
Trang 30indicates whether prior awareness of Courier
Paks affects how the sales rep visit influenced the client
Trang 314M Example 25.1:
PRIMING IN ADVERTISING
Mechanics – Estimate Model
Trang 324M Example 25.1:
PRIMING IN ADVERTISING
Mechanics – Check Conditions
Nothing in the plots suggest dependence Similar
variance condition is satisfied.
Trang 334M Example 25.1:
PRIMING IN ADVERTISING
Mechanics – Check Conditions
Similar variances confirmed
Trang 344M Example 25.1:
PRIMING IN ADVERTISING
Mechanics – Check Conditions
Nearly normal condition is satisfied
Trang 354M Example 25.1:
PRIMING IN ADVERTISING
Mechanics
Based on the F-statistic we can conclude that the
model explains statistically significant variation
The interaction between awareness and hours of contact is statistically significant Following the
principle of marginality, we retain Aware in the
model
The interaction implies that the gap between the
lines gets wider as the number of contact hours
increases
Trang 364M Example 25.1:
PRIMING IN ADVERTISING
Message
Priming produces a statistically significant increase
in the subsequent use of Courier Paks when
followed by a visit from a sales rep Each
additional hour of contact with a sales rep
produces about 4.3 more uses of the Courier
Paks with priming than without priming
Trang 3725.5 Regression with Several Groups
Example: Estimating Store Sales
Explanatory variables are median household
income in surrounding community, size of the
local population, and market (urban, suburban,
rural)
The response is sales in dollars per square foot
Trang 3825.5 Regression with Several Groups
Scatterplot Matrix
Rural – red
Suburban – green
Urban – blue
Association within each
group appears linear.
Trang 3925.5 Regression with Several Groups
Example: Estimating Store Sales
In general, to distinguish J groups requires J-1
dummy variables
For this example use two dummy variables:
Suburban Dummy = 1 suburban, 0 otherwise
Urban Dummy = 1 urban, 0 otherwise
Note that rural locations would be coded 0,0
Trang 4025.5 Regression with Several Groups
Example: Estimating Store Sales
Trang 4125.5 Regression with Several Groups
Example: Estimating Store Sales
The interpretation of the estimates is similar to
the interpretation of models with two groups
Coefficients associated with dummy variables
reflect differences of stores in other locations
compared to rural stores
Trang 4225.5 Regression with Several Groups
Estimating Sales for Rural Stores
The estimated equation for baseline comparison
(stores located in a rural location) is
Estimated Sales ($/SqFt) =
-388.6992 + 0.0097 Income + 0.2401 Population (000)
Trang 4325.5 Regression with Several Groups
Estimating Sales for Urban Stores
Consider stores in an urban location The estimated
sales is given by
Estimated Sales ($/SqFt) =
(-388.6992 + 468.8654) + (0.0097 - 0.0053) Income + 0.2401 Population (000)
Estimated Sales ($/SqFt) =
80.1662 + 0.0044 Income + 0.2401 Population (000)
Trang 4425.5 Regression with Several Groups
Interpretation of Results
compared to rural stores, but do not grow as fast
with increases in income.
because the model does not include an interaction term between Population and dummy variables for location.
Trang 45Best Practices
Be thorough in your search for confounding
variables
Consider interactions
Choose an appropriate baseline group
Write out the fits for separate groups
Trang 46Best Practices (Continued)
Be careful interpreting the coefficient of the
dummy variable
Check for comparable variances in the groups
Use color-coding or different plot symbols to
identify subsets of observations in plots
Trang 47 Don’t think that you have adjusted for all of the
confounding factors
Don’t confuse the different types of slopes
Don’t forget to check the conditions of the MRM