22.1 Problem 1: Changing VariationAlthough regression analysis allows the use of prices of different size homes to estimate the home of a specific size, prices tend to be more variable
Trang 2Regression Diagnostics
Chapter 22
Trang 322.1 Problem 1: Changing Variation
Although regression analysis allows the use of prices of different size homes to estimate the home of a specific size, prices tend to be
more variable for larger homes How does
this affect the SRM?
Consider how to recognize and fix three potential
problems affecting regression models: changing
variation in the data, outliers, and dependence
among observations
Trang 422.1 Problem 1: Changing Variation
Price ($000) vs Home Size (Sq Ft.)
Both the average and standard deviation in price
Trang 522.1 Problem 1: Changing Variation
SRM Results: Home Price Example
Trang 622.1 Problem 1: Changing Variation
Fixed Costs, Marginal Costs, and Variable Costs
The estimated intercept (50.598687) can be
interpreted as the fixed cost of a home
The 95% confidence interval for the intercept (after rounding) is -$4,000 to $105,000
Since it includes zero, this interval is not a precise
Trang 722.1 Problem 1: Changing Variation
Fixed Costs, Marginal Costs, and Variable Costs
The slope (0.1594259) estimates the marginal cost
of an additional square foot of space
The 95% confidence interval for the slope (after
rounding) is $135,000 to $183,500
It can be interpreted as the average difference in
home price associated with 1,000 square feet
Trang 822.1 Problem 1: Changing Variation
Detecting Differences in Variation
Based on the scatterplot, the association between home price and size appears linear
Little concern about lurking variables since the
sample of homes is from the same neighborhood
Similar variances condition is not satisfied
Trang 922.1 Problem 1: Changing Variation
Detecting Differences in Variation
Fan-shaped appearance of residual plot indicates
changing variances.
Trang 1022.1 Problem 1: Changing Variation
Detecting Differences in Variation
Side-by-side boxplots confirm that variances increase
Trang 1122.1 Problem 1: Changing Variation
Detecting Differences in Variation
Heteroscedastic: errors that have different
amounts of variation
Homoscedastic: errors having equal amounts of variation
Trang 1222.1 Problem 1: Changing Variation
Consequences of Different Variation
Prediction intervals are too narrow or too wide
Confidence intervals for the slope and intercept
are not reliable
Hypothesis tests regarding β0 and β1 are not
reliable
Trang 1322.1 Problem 1: Changing Variation
Consequences of Different Variation
The 95% prediction intervals are too wide for small
homes and too narrow for large homes
Trang 1422.1 Problem 1: Changing Variation
Fixing the Problem: Revise the Model
If F represents fixed cost and M marginal costs,
the equation of the SRM becomes
Price = F M SqFt
Trang 1522.1 Problem 1: Changing Variation
Fixing the Problem: Revise the Model
Divide both sides of the equation by the number
of square feet and simplify:
SqFt
SqFt SqFt
Trang 1622.1 Problem 1: Changing Variation
Fixing the Problem: Revise the Model
The response variable becomes price per square foot and the explanatory variable becomes the
reciprocal of the number of square feet
The marginal cost M is the intercept and the
slope is F, the fixed cost.
The residuals have similar variances
Trang 1722.1 Problem 1: Changing Variation
Fixing the Problem: Revise the Model
Boxplots confirm homoscedastic errors
Trang 18prices into fixed and variable costs to better prepare for negotiations with realtors.
Trang 194M Example 22.1:
ESTIMATING HOME PRICES
Method
Data consists of a sample of 94 homes for
sale in Seattle The explanatory variable is the reciprocal of home size and the
response is price per square foot The
scatterplot shows a linear association and there are no obvious lurking variables.
Trang 204M Example 22.1:
ESTIMATING HOME PRICES
Mechanics
Evidently independent, similar variances, and
nearly normal conditions met
Trang 214M Example 22.1:
ESTIMATING HOME PRICES
Mechanics
The SRM results.
Trang 22The 95% confidence interval for the intercept is
[136.8182 to 178.6878] and the 95% confidence interval for the slope is [18,592.36 to 89,181.64]
Trang 234M Example 22.1:
ESTIMATING HOME PRICES
Message
Prices for homes in this Seattle
neighborhood run about $140 to $180 per
square foot, on average Average fixed
costs associated with the purchase are in
the range $19,000 to $89,000, with 95%
confidence
Trang 2422.1 Problem 1: Changing Variation
Comparing Models with Different Responses
Even though the revised model has a smaller r 2,
It provides more reliable and narrower confidence intervals for fixed and variable costs; and
It provides more sensible prediction intervals
Trang 2522.1 Problem 1: Changing Variation
Comparing Models with Different Responses
Trang 2622.1 Problem 1: Changing Variation
Comparing Models with Different Responses
Trang 2722.2 Problem 2: Leveraged Outliers
Consider a Contractor’s Bid on a Project
A contractor is bidding on a project to construct an
875 square-foot addition to a home
If he bids too low, he loses money on the project
If he bids too high, he does not get the job
Trang 2822.2 Problem 2: Leveraged Outliers
Contractor Data for n=30 Similar Projects
Note that all but one of his previous projects are
Trang 2922.2 Problem 2: Leveraged Outliers
Contractor Example
His one project at 900 square feet is an outlier
It is also a leveraged observation as it pulls the
regression line in its direction
Leveraged: an observation in regression that has
a small or large value of the explanatory variable
Trang 3022.2 Problem 2: Leveraged Outliers
Consequences of an Outlier
To see the consequences of an outlier, fit the
least squares regression line both with and
without it
Use the standard errors obtained without
including the outlier to compare estimates
Trang 3122.2 Problem 2: Leveraged Outliers
Consequences for the Contractor Example
Trang 3222.2 Problem 2: Leveraged Outliers
Consequences for the Contractor Example
Including the outlier shifts the estimated fixed cost
up by about 1.5 standard errors
Including the outlier shifts the estimated marginal cost down by about 1.56 standard errors
Trang 3322.2 Problem 2: Leveraged Outliers
Consequences for the Contractor Example
Prediction intervals when the outlier is included
Trang 3422.2 Problem 2: Leveraged Outliers
Consequences for the Contractor Example
Prediction intervals when the outlier is not included
Trang 3522.2 Problem 2: Leveraged Outliers
Fixing the Problem: More Information
If the outlier describes what is expected the next time under the same conditions, then it should be included
In the contractor example, more information is
needed to decide whether to include or exclude
the outlier
Trang 3622.3 Problem 3: Dependent Errors and Time Series
Detecting Dependence
With time series data, plot residuals versus time
to look for a pattern indicating dependence in the errors
Use the Durbin-Watson statistic to test for
correlation between adjacent residuals (known as autocorrelation)
Trang 3722.3 Problem 3: Dependent Errors and Time Series
The Durbin-Watson Statistic
Tests the null hypothesis H0: ρε = 0
Is calculated as follows:
2
2
2 2
2 1
1
2 2 3
2 1
2
) (
) (
) (
n
n
n
e e
e
e e
e e
e
e D
Trang 3822.3 Problem 3: Dependent Errors and Time Series
The Durbin-Watson Statistic
Use p-value provided by software or table
(portion shown below) to draw a conclusion
Trang 3922.3 Problem 3: Dependent Errors and Time Series
Consequences of Dependence
If there is positive autocorrelation in the errors, the
estimated standard errors are too small.
The estimated slope and intercept are less precise than suggested by the output.
Best remedy is to incorporate the dependence into the regression model.
Trang 414M Example 22.2:
CELL PHONE SUBSCRIBERS
Motivation
The rate of growth is captured by taking the
¼ power of the number of subscribers.
Trang 424M Example 22.2:
CELL PHONE SUBSCRIBERS
Method
Use simple regression to predict the future
number of subscribers The quarter power
of the number of subscribers, in millions, is the response The explanatory variable is time The scatterplot shows a linear
association Other lurking variables may
be present, however, such as technology
and marketing.
Trang 434M Example 22.2:
CELL PHONE SUBSCRIBERS
Mechanics
The least squares equation is
Estimated Subscribers1/4 = -317.4 + 0.16 Date
Trang 444M Example 22.2:
CELL PHONE SUBSCRIBERS
Mechanics
The timeplot of residuals and D = 0.11 indicates
independence condition is not satisfied Also
Trang 454M Example 22.2:
CELL PHONE SUBSCRIBERS
Message
Using a novel transformation, the historical
trend can be summarized as
Estimated Subscribers1/4 = -317.4 + 0.16 Date
However, since the conditions for SRM are
not satisfied, we cannot quantify the
uncertainty for predictions.
Trang 46Best Practices
Make sure that your model makes sense
Plan to change your model if it does not match
the data
Report the presence of and how you handle any outliers
Trang 47 Do not rely on summary statistics like r 2 to pick
the best model
Don’t compare r 2 between regression models
unless the response is the same
Do not check for normality until you get the right equation
Trang 48Pitfalls (Continued)
Don’t think that your data are independent if the
Durbin-Watson statistic is close to 2
Never forget to look at plots of the data and
model