Business analytics data analysis and decision making 5th by wayne l winston chapter 11

 Two basic problems are discussed in this chapter:  Population regression model  Inferring its characteristics—that is, its intercept and slope terms—from the corresponding terms est

Trang 1

DECISION MAKING

Regression Analysis: Statistical Inference

11

Trang 2

 Two basic problems are discussed in this chapter:

 Population regression model

 Inferring its characteristics—that is, its intercept and

slope term(s)—from the corresponding terms estimated

by least squares

 Determining which explanatory variables belong in the equation

 Inferring whether there is any population regression

equation worth pursuing

Trang 3

The Statistical Model

(slide 1 of 7)

 To perform statistical inference in a regression context, a statistical model is required—that

is, we must first make several assumptions

about the population.

 These assumptions represent an idealization of reality and are never likely to be entirely

satisfied for the population in any real study.

 From a practical point of view, all we can ask is that they represent a close approximation to reality.

 If the assumptions are grossly violated, statistical inferences that are based on these assumptions

should be viewed with suspicion.

Trang 4

(slide 2 of 7)

 Regression assumptions:

 There is a population regression line

 It joins the means of the dependent variable for all

values of the explanatory variables.

 For any fixed values of the explanatory variables, the mean of the errors is zero.

 For any values of the explanatory variables, the variance (or standard deviation) of the dependent variable is a constant, the same for all such

Trang 5

(slide 3 of 7)

 The first assumption is probably the most important.

 It implies that for some set of explanatory variables, there

is an exact linear relationship in the population between

the means of the dependent variable and the values of the

explanatory variables.

 Equation for population regression line joining means:

 α is the intercept term, and the βs are the slope terms (Greek s are the slope terms (Greek letters are used to denote that they are unobservable population

parameters.)

 Most individual Ys do not lie on the population regression

line

 The vertical distance from any point to the line is an error

 Equation for population regression line with error:

Trang 6

(slide 4 of 7)

population regression line.

 It states that the variation of the Ys about the

regression line is the same, regardless of the values of the Xs.

 The technical term for this property is homoscedasticity

 A simpler term is constant error variance

 This assumption is often questionable—the variation in

Y often increases as X increases.

 Heteroscedasticity means that the variability of Y values is larger for some X values than for others.

 A simpler term for this is nonconstant error variance

 The easiest way to detect nonconstant error variance is

Trang 7

(slide 5 of 7)

 Assumption 3 is equivalent to stating

that the errors are normally distributed.

 You can check this by forming a histogram (or a Q-Q plot) of the residuals.

 If assumption 3 holds, the histogram should be approximately symmetric and bell-shaped, and the points of a Q-Q plot should be close to a 45 degree line.

 If there is an obvious skewness or some other nonnormal property, this indicates a violation of assumption 3.

Trang 8

(slide 6 of 7)

independence of the errors.

 This assumption means that information on some

of the errors provides no information on the

values of the other errors.

 For cross-sectional data, this assumption is

usually taken for granted.

 For time-series data, this assumption is often

violated.

 This is because of a property called autocorrelation.

 The Durbin-Watson statistic is one measure of

autocorrelation and thus measures the extent to which

Trang 9

(slide 7 of 7)

 One other assumption is important for

numerical calculations: No explanatory

variable can be an exact linear combination

of any other explanatory variables.

variables can be written as a weighted sum of several of the others

 This is called exact multicollinearity.

 If it exists, there is redundancy in the data.

multicollinearity, where explanatory variables

are highly, but not exactly, correlated.

Trang 10

Inferences about the

Regression Coefficients

 In the equation for the population regression line, α

and the βs are the slope terms (Greek s are called the regression coefficients

 There is one other unknown constant in the model: the variance of the errors, labeled σ 2

 The choice of relevant explanatory variables is

almost never obvious

 Two guiding principles are relevance and data availability.

 One overriding principle is parsimony —to explain the most with the least

 It favors a model with fewer explanatory variables, assuming that this model explains the dependent variable almost as well

as a model with additional explanatory variables.

Trang 11

Sampling Distribution of the

Regression Coefficients

 The sampling distribution of any estimate derived from sample data is the distribution of this estimate over all possible samples.

 Sampling distribution of a regression coefficient:

If the regression assumptions are valid, the standardized value

has a t distribution with n − k − 1 degrees of freedom

 This result has three important implications:

 The estimate b is unbiased in the sense that its mean is βs are the slope terms (Greek , the true but

unknown value of the slope.

 The estimated standard deviation of b is labeled sb

It is usually called the standard error of a regression coefficient , or the

standard error of b.

It measures how much the bs would vary from sample to sample.

 The shape of the distribution of b is symmetric and bell-shaped.

Trang 12

Example 11.1:

Overhead Costs.xlsx (slide 1 of 2)

 Objective: To use standard regression output to make inferences about

the regression coefficients of machine hours and production runs in the equation for overhead costs.

 Solution: The dependent variable is Overhead and the explanatory

variables are Machine Hours and Production Runs

 The output from StatTools’s Regression procedure is shown below.

 The estimates of the regression coefficients appear under the label Coefficient.

 The column labeled Standard Error shows the sb values.

 Each b represents a point estimate of the corresponding βs are the slope terms (Greek The corresponding sb

indicates the accuracy of this point estimate

Trang 13

Example 11.1:

Overhead Costs.xlsx (slide 2 of 2)

 The sample data can be used to obtain a

confidence interval for a regression

coefficient

 A confidence interval for any βs are the slope terms (Greek is of the form:

where the t-multiple depends on the

confidence level and the degrees of freedom.

 StatTools always provides these 95%

confidence intervals for the regression

coefficients automatically, as shown at the

bottom right of the figure on the previous slide.

Trang 14

Hypothesis Tests for the Regression

Coefficients and p-Values

 There is another important piece of information in regression

outputs: the t-values for the individual regression coefficients.

 Each t-value is the ratio of the estimated coefficient to its standard

Trang 15

A Test for the Overall Fit:

The ANOVA Table (slide 1 of 3)

variables in the regression equation explains the dependent variable.

 An indication of this problem is a very small R 2

value.

 An equation has no explanatory power if the the

same value of Y will be predicted regardless of the values of the Xs.

the explanatory variables are zero

coefficients is not zero.

Trang 16

 To test the null hypothesis, use an F test, a formal

procedure for testing whether the explained

variation is large compared to the unexplained

variation.

 This is also called the ANOVA (analysis of variance) test

because the elements for calculating the required

F-value are shown in an ANOVA table for regression.

 The ANOVA table splits the total variation of the Y

Trang 17

 The required F-ratio for the test is:

where and

 If the F-ratio is small, the explained variation is small relative

to the unexplained variation, and there is evidence that the regression equation provides little explanatory value.

 The F-ratio has an associated p-value that allows you to

run the test easily; it is reported in most regression

outputs.

 Reject the null hypothesis—and conclude that the X variables have at least some explanatory value—if the F-value in the ANOVA table is large and the corresponding p-value is small.

Trang 18

 Multicollinearity occurs when there is a fairly

strong linear relationship among a set of

explanatory variables.

 In this case, the relationship between the explanatory

variable X and the dependent variable Y is not always accurately reflected in the coefficient of X; it depends

on which other Xs are included or not included in the

equation.

 There are various degrees of multicollinearity, but in

each of them, there is a linear relationship between two

or more explanatory variables.

 The symptoms of multicollinearity can be “wrong” signs

of the coefficients, smaller-than-expected t-values, and larger-than-expected (insignificant) p-values.

Trang 19

Example 11.2:

Heights Simulation.xlsx (slide 1 of 2)

multicollinearity when both foot length variables are used in a regression for height.

and the explanatory variables are Right and

Left, the length of the right foot and the left

foot, respectively.

 Simulation is used to generate a hypothetical

data set of heights and left and right foot

lengths.

 Height is approximately 31.8 plus 3.2 times foot length (all expressed in inches).

Trang 20

Example 11.2:

Heights Simulation.xlsx (slide 2 of 2)

 The regression output when both Right and Left are entered in the equation for Height appears at the bottom right of the figure

below.

Trang 21

INCLUDE/EXCLUDE

DECISIONS

 The t-values of regression coefficients can be

used to make include/exclude decisions for

explanatory variables in a regression equation.

 Finding the best Xs to include in a regression

equation is the most difficult part of any real

regression analysis.

 You are always trying to get the best fit possible, but the principle of parsimony suggests using the fewest number of variables.

 This presents a trade-off, where there are not always easy answers.

 To help with this decision, several guidelines are

presented on the next slide.

Trang 22

Guidelines for Including/Excluding

Variables in a Regression Equation

 Look at a variable’s t-value and its associated p-value If the

p-value is above some accepted significance level, such as 0.05, this

variable is a candidate for exclusion.

 Check whether a variable’s t-value is less than 1 or greater than 1 in magnitude If it is less than 1, then it is a mathematical fact that sewill decrease (and adjusted R2 will increase) if this variable is

excluded from the equation

 Look at t-values and p-values, rather than correlations, when making

include/exclude decisions An explanatory variable can have a fairly

high correlation with the dependent variable, but because of other

variables included in the equation, it might not be needed.

 When there is a group of variables that are in some sense logically related, it is sometimes a good idea to include all of them or exclude all of them

 Use economic and/or physical theory to decide whether to include or

exclude variables, and put less reliance on t-values and/or p-values.

Trang 23

Example 11.3:

Catalog Marketing.xlsx (slide 1 of 2)

 Objective: To see which potential explanatory variables are

useful for explaining current year spending amounts at HyTex with multiple regression.

 Solution: Data file contains data on 1000 customers who

purchased mail-order products from HyTex Company

 For each customer, data on several variables are included.

 Base the regression on the first 750 observations and use the other 250 for validation.

 Enter all of the potential explanatory variables.

 Then exclude unnecessary variables based on their t-values and p-values.

 Four variables, Age, Gender, Own Home, and Married, have

p-values well above 0.05 and are obvious candidates for exclusion.

 Exclude variables one at a time, starting with the variable that has the highest p-value, and rerun the regression after each exclusion.

Trang 24

Example 11.3:

Catalog Marketing.xlsx (slide 2 of 2)

 The resulting output appears below.

Trang 25

Stepwise Regression

 Many statistical packages provide some assistance in

include/exclude decisions by including automatic equation-building options.

 These options estimate a series of regression equations by successively adding (or deleting) variables according to prescribed rules.

 Generically, these methods are referred to as stepwise regression

 There are three types of equation-building procedures:

 Forward—begins with no explanatory variables in the equation and

successfully adds one at a time until no remaining variables make a

significant contribution.

 Backward—begins with all potential explanatory variables in the equation

and deletes them one at a time until further deletion would do more

harm than good.

 Stepwise—is much like a forward procedure, except that it also considers

possible deletions along the way.

 All of these procedures have the same basic objective—to find an

equation with a small se and a large R2 (or adjusted R2).

Trang 26

stepwise procedure from the

Regression Type dropdown list

in the Regression dialog box.

 Specify Amount Spent as the

dependent variable and select

all of the other variables

(besides Customer) as

potential explanatory

variables.

 A sample of the stepwise

output appears to the right

The variables that enter or exit

the equation are listed at the

bottom of the output.

Trang 27

(slide 1 of 2)

 An observation can be considered an outlier for one or more

of the following reasons:

 It has an extreme value for at least one variable.

 Its value of the dependent variable is much larger or smaller than predicted by the regression line, and its residual is abnormally large

in magnitude

An example of this type of outlier is shown below.

Trang 28

(slide 2 of 2)

 Its residual is not only large in magnitude, but this point “tilts” the regression line toward it

This type of outlier is called an influential point

An example of this type of outlier is shown below, on the left.

 Its values of individual explanatory variables are not extreme, but they fall outside the general pattern of the other observations

An example of this type of outlier is shown below, on the right.

 In most cases, the regression output will look “nicer” if you delete the

Trang 29

Example 11.4:

Bank Salaries.xlsx (slide 1 of 2)

to see to what extent they affect the regression model.

variables or scatterplots of the residuals versus the fitted values.

Trang 30

Example 11.4:

Bank Salaries.xlsx (slide 2 of 2)

 Then run the

shown on the top

right; the output

with the outlier

excluded is shown

on the bottom

right.

Trang 31

Violations of Regression Assumptions

violations of regression assumptions:

 How to detect violations of the assumptions

 This is usually relatively easy, using scatterplots, histograms, time series graphs, and numerical measures.

 What goes wrong if the violations are ignored

 This depends on the type of violation and its

Định dạng
Số trang	40
Dung lượng	2,57 MB