Two basic problems are discussed in this chapter: Population regression model Inferring its characteristics—that is, its intercept and slope terms—from the corresponding terms est
Trang 1DECISION MAKING
Regression Analysis: Statistical Inference
11
Trang 2 Two basic problems are discussed in this chapter:
Population regression model
Inferring its characteristics—that is, its intercept and
slope term(s)—from the corresponding terms estimated
by least squares
Determining which explanatory variables belong in the equation
Inferring whether there is any population regression
equation worth pursuing
Trang 3The Statistical Model
(slide 1 of 7)
To perform statistical inference in a regression context, a statistical model is required—that
is, we must first make several assumptions
about the population.
These assumptions represent an idealization of reality and are never likely to be entirely
satisfied for the population in any real study.
From a practical point of view, all we can ask is that they represent a close approximation to reality.
If the assumptions are grossly violated, statistical inferences that are based on these assumptions
should be viewed with suspicion.
Trang 4The Statistical Model
(slide 2 of 7)
Regression assumptions:
There is a population regression line
It joins the means of the dependent variable for all
values of the explanatory variables.
For any fixed values of the explanatory variables, the mean of the errors is zero.
For any values of the explanatory variables, the variance (or standard deviation) of the dependent variable is a constant, the same for all such
Trang 5The Statistical Model
(slide 3 of 7)
The first assumption is probably the most important.
It implies that for some set of explanatory variables, there
is an exact linear relationship in the population between
the means of the dependent variable and the values of the
explanatory variables.
Equation for population regression line joining means:
α is the intercept term, and the βs are the slope terms (Greek s are the slope terms (Greek letters are used to denote that they are unobservable population
parameters.)
Most individual Ys do not lie on the population regression
line
The vertical distance from any point to the line is an error
Equation for population regression line with error:
Trang 6The Statistical Model
(slide 4 of 7)
population regression line.
It states that the variation of the Ys about the
regression line is the same, regardless of the values of the Xs.
The technical term for this property is homoscedasticity
A simpler term is constant error variance
This assumption is often questionable—the variation in
Y often increases as X increases.
Heteroscedasticity means that the variability of Y values is larger for some X values than for others.
A simpler term for this is nonconstant error variance
The easiest way to detect nonconstant error variance is
Trang 7The Statistical Model
(slide 5 of 7)
Assumption 3 is equivalent to stating
that the errors are normally distributed.
You can check this by forming a histogram (or a Q-Q plot) of the residuals.
If assumption 3 holds, the histogram should be approximately symmetric and bell-shaped, and the points of a Q-Q plot should be close to a 45 degree line.
If there is an obvious skewness or some other nonnormal property, this indicates a violation of assumption 3.
Trang 8The Statistical Model
(slide 6 of 7)
independence of the errors.
This assumption means that information on some
of the errors provides no information on the
values of the other errors.
For cross-sectional data, this assumption is
usually taken for granted.
For time-series data, this assumption is often
violated.
This is because of a property called autocorrelation.
The Durbin-Watson statistic is one measure of
autocorrelation and thus measures the extent to which
Trang 9The Statistical Model
(slide 7 of 7)
One other assumption is important for
numerical calculations: No explanatory
variable can be an exact linear combination
of any other explanatory variables.
variables can be written as a weighted sum of several of the others
This is called exact multicollinearity.
If it exists, there is redundancy in the data.
multicollinearity, where explanatory variables
are highly, but not exactly, correlated.
Trang 10Inferences about the
Regression Coefficients
In the equation for the population regression line, α
and the βs are the slope terms (Greek s are called the regression coefficients
There is one other unknown constant in the model: the variance of the errors, labeled σ 2
The choice of relevant explanatory variables is
almost never obvious
Two guiding principles are relevance and data availability.
One overriding principle is parsimony —to explain the most with the least
It favors a model with fewer explanatory variables, assuming that this model explains the dependent variable almost as well
as a model with additional explanatory variables.
Trang 11Sampling Distribution of the
Regression Coefficients
The sampling distribution of any estimate derived from sample data is the distribution of this estimate over all possible samples.
Sampling distribution of a regression coefficient:
If the regression assumptions are valid, the standardized value
has a t distribution with n − k − 1 degrees of freedom
This result has three important implications:
The estimate b is unbiased in the sense that its mean is βs are the slope terms (Greek , the true but
unknown value of the slope.
The estimated standard deviation of b is labeled sb
It is usually called the standard error of a regression coefficient , or the
standard error of b.
It measures how much the bs would vary from sample to sample.
The shape of the distribution of b is symmetric and bell-shaped.
Trang 12Example 11.1:
Overhead Costs.xlsx (slide 1 of 2)
Objective: To use standard regression output to make inferences about
the regression coefficients of machine hours and production runs in the equation for overhead costs.
Solution: The dependent variable is Overhead and the explanatory
variables are Machine Hours and Production Runs
The output from StatTools’s Regression procedure is shown below.
The estimates of the regression coefficients appear under the label Coefficient.
The column labeled Standard Error shows the sb values.
Each b represents a point estimate of the corresponding βs are the slope terms (Greek The corresponding sb
indicates the accuracy of this point estimate
Trang 13Example 11.1:
Overhead Costs.xlsx (slide 2 of 2)
The sample data can be used to obtain a
confidence interval for a regression
coefficient
A confidence interval for any βs are the slope terms (Greek is of the form:
where the t-multiple depends on the
confidence level and the degrees of freedom.
StatTools always provides these 95%
confidence intervals for the regression
coefficients automatically, as shown at the
bottom right of the figure on the previous slide.
Trang 14Hypothesis Tests for the Regression
Coefficients and p-Values
There is another important piece of information in regression
outputs: the t-values for the individual regression coefficients.
Each t-value is the ratio of the estimated coefficient to its standard
Trang 15A Test for the Overall Fit:
The ANOVA Table (slide 1 of 3)
variables in the regression equation explains the dependent variable.
An indication of this problem is a very small R 2
value.
An equation has no explanatory power if the the
same value of Y will be predicted regardless of the values of the Xs.
the explanatory variables are zero
coefficients is not zero.
Trang 16A Test for the Overall Fit:
The ANOVA Table (slide 2 of 3)
To test the null hypothesis, use an F test, a formal
procedure for testing whether the explained
variation is large compared to the unexplained
variation.
This is also called the ANOVA (analysis of variance) test
because the elements for calculating the required
F-value are shown in an ANOVA table for regression.
The ANOVA table splits the total variation of the Y
Trang 17A Test for the Overall Fit:
The ANOVA Table (slide 3 of 3)
The required F-ratio for the test is:
where and
If the F-ratio is small, the explained variation is small relative
to the unexplained variation, and there is evidence that the regression equation provides little explanatory value.
The F-ratio has an associated p-value that allows you to
run the test easily; it is reported in most regression
outputs.
Reject the null hypothesis—and conclude that the X variables have at least some explanatory value—if the F-value in the ANOVA table is large and the corresponding p-value is small.
Trang 18 Multicollinearity occurs when there is a fairly
strong linear relationship among a set of
explanatory variables.
In this case, the relationship between the explanatory
variable X and the dependent variable Y is not always accurately reflected in the coefficient of X; it depends
on which other Xs are included or not included in the
equation.
There are various degrees of multicollinearity, but in
each of them, there is a linear relationship between two
or more explanatory variables.
The symptoms of multicollinearity can be “wrong” signs
of the coefficients, smaller-than-expected t-values, and larger-than-expected (insignificant) p-values.
Trang 19Example 11.2:
Heights Simulation.xlsx (slide 1 of 2)
multicollinearity when both foot length variables are used in a regression for height.
and the explanatory variables are Right and
Left, the length of the right foot and the left
foot, respectively.
Simulation is used to generate a hypothetical
data set of heights and left and right foot
lengths.
Height is approximately 31.8 plus 3.2 times foot length (all expressed in inches).
Trang 20Example 11.2:
Heights Simulation.xlsx (slide 2 of 2)
The regression output when both Right and Left are entered in the equation for Height appears at the bottom right of the figure
below.
Trang 21INCLUDE/EXCLUDE
DECISIONS
The t-values of regression coefficients can be
used to make include/exclude decisions for
explanatory variables in a regression equation.
Finding the best Xs to include in a regression
equation is the most difficult part of any real
regression analysis.
You are always trying to get the best fit possible, but the principle of parsimony suggests using the fewest number of variables.
This presents a trade-off, where there are not always easy answers.
To help with this decision, several guidelines are
presented on the next slide.
Trang 22Guidelines for Including/Excluding
Variables in a Regression Equation
Look at a variable’s t-value and its associated p-value If the
p-value is above some accepted significance level, such as 0.05, this
variable is a candidate for exclusion.
Check whether a variable’s t-value is less than 1 or greater than 1 in magnitude If it is less than 1, then it is a mathematical fact that sewill decrease (and adjusted R2 will increase) if this variable is
excluded from the equation
Look at t-values and p-values, rather than correlations, when making
include/exclude decisions An explanatory variable can have a fairly
high correlation with the dependent variable, but because of other
variables included in the equation, it might not be needed.
When there is a group of variables that are in some sense logically related, it is sometimes a good idea to include all of them or exclude all of them
Use economic and/or physical theory to decide whether to include or
exclude variables, and put less reliance on t-values and/or p-values.
Trang 23Example 11.3:
Catalog Marketing.xlsx (slide 1 of 2)
Objective: To see which potential explanatory variables are
useful for explaining current year spending amounts at HyTex with multiple regression.
Solution: Data file contains data on 1000 customers who
purchased mail-order products from HyTex Company
For each customer, data on several variables are included.
Base the regression on the first 750 observations and use the other 250 for validation.
Enter all of the potential explanatory variables.
Then exclude unnecessary variables based on their t-values and p-values.
Four variables, Age, Gender, Own Home, and Married, have
p-values well above 0.05 and are obvious candidates for exclusion.
Exclude variables one at a time, starting with the variable that has the highest p-value, and rerun the regression after each exclusion.
Trang 24Example 11.3:
Catalog Marketing.xlsx (slide 2 of 2)
The resulting output appears below.
Trang 25Stepwise Regression
Many statistical packages provide some assistance in
include/exclude decisions by including automatic equation-building options.
These options estimate a series of regression equations by successively adding (or deleting) variables according to prescribed rules.
Generically, these methods are referred to as stepwise regression
There are three types of equation-building procedures:
Forward—begins with no explanatory variables in the equation and
successfully adds one at a time until no remaining variables make a
significant contribution.
Backward—begins with all potential explanatory variables in the equation
and deletes them one at a time until further deletion would do more
harm than good.
Stepwise—is much like a forward procedure, except that it also considers
possible deletions along the way.
All of these procedures have the same basic objective—to find an
equation with a small se and a large R2 (or adjusted R2).
Trang 26stepwise procedure from the
Regression Type dropdown list
in the Regression dialog box.
Specify Amount Spent as the
dependent variable and select
all of the other variables
(besides Customer) as
potential explanatory
variables.
A sample of the stepwise
output appears to the right
The variables that enter or exit
the equation are listed at the
bottom of the output.
Trang 27(slide 1 of 2)
An observation can be considered an outlier for one or more
of the following reasons:
It has an extreme value for at least one variable.
Its value of the dependent variable is much larger or smaller than predicted by the regression line, and its residual is abnormally large
in magnitude
An example of this type of outlier is shown below.
Trang 28(slide 2 of 2)
Its residual is not only large in magnitude, but this point “tilts” the regression line toward it
This type of outlier is called an influential point
An example of this type of outlier is shown below, on the left.
Its values of individual explanatory variables are not extreme, but they fall outside the general pattern of the other observations
An example of this type of outlier is shown below, on the right.
In most cases, the regression output will look “nicer” if you delete the
Trang 29Example 11.4:
Bank Salaries.xlsx (slide 1 of 2)
to see to what extent they affect the regression model.
variables or scatterplots of the residuals versus the fitted values.
Trang 30Example 11.4:
Bank Salaries.xlsx (slide 2 of 2)
Then run the
shown on the top
right; the output
with the outlier
excluded is shown
on the bottom
right.
Trang 31Violations of Regression Assumptions
violations of regression assumptions:
How to detect violations of the assumptions
This is usually relatively easy, using scatterplots, histograms, time series graphs, and numerical measures.
What goes wrong if the violations are ignored
This depends on the type of violation and its