ADDITIONAL TOPICS IN MULTIPLE REGRESSION AND CORRELATION
3. When an independent variable is added or removed, the partial regression
One way to deal with multicollinearity is to avoid having a set of independent variables in which two or more variables are highly correlated with each other. In the fast-food example used throughout this chapter, the inclusion of a third inde- pendent variable representing the number of vehicles using the drive-through would likely cause multicollinearity problems, because x2 5 drive-through sales would probably be highly correlated with x3 5 drive-through traffic count—each variable is essentially describing the “bigness” of the drive-through aspect of the restaurant.
In Chapter 17, we will discuss stepwise regression, a regression model in which the independent variables enter the regression analysis one at a time. The first x variable to enter is the one that explains the greatest amount of variation in y. The second x variable to enter will be the one that explains the greatest amount of the remaining variation in y, and so on. At each “step,” the computer decides which, if
any, of the remaining x variables should be brought in. Every step results in a new regression equation and an updated printout of this equation and its analysis. The general idea of stepwise regression is the balancing act of trying to (1) explain the most possible variation in y, while (2) using the fewest possible x variables.
16.46 What is a dummy variable, and how is it useful to multiple regression? Give an example of three dummy vari- ables that could be used in describing your home town.
16.47 A multiple regression equation has been developed for y 5 daily attendance at a community swimming pool, x1 5 temperature (degrees Fahrenheit), and x2 5 weekend versus weekday (x2 5 1 for Saturday and Sunday, and 0 for the other days of the week). For the regression equation shown below, interpret each partial regression coefficient.
ˆ
y 5 100 1 8x11 150x2
16.48 For the regression equation in Exercise 16.47, the estimated number of persons swimming on a zero-degree weekday would be 100 persons. Since this level of pool attendance is unlikely on such a day, does this mean that an error was made in constructing the regression equa- tion? Explain.
16.49 What is multicollinearity, and how can it adversely affect multiple regression analysis? How can we tell whether multicollinearity is present?
( DATA SET ) Note: Exercises 16.50 and 16.51 require a computer and statistical software.
16.50 For 12 recent clients, a weight-loss clinic has collected the following data. (Session is coded as 1 for day, 0 for evening. Gender is coded as 1 for male, 0 for female.) Using a computer statistical package, carry out a multiple regression analysis, interpret the partial regression coefficients, and discuss the results of each significance test.
Pounds Months as
Client Lost a Client Session Gender
1 31 5 1 1
2 49 8 1 1
3 12 3 1 0
4 26 9 0 0
5 34 8 0 1
6 11 2 0 0
7 4 1 0 1
8 27 8 0 1
9 12 6 1 1
10 28 9 1 0
11 41 6 0 0
12 16 6 0 0
16.51 A safety researcher has measured the speed of automobiles passing his vantage point near an interstate highway. He has also recorded the number of occupants and whether the driver was wearing a seat belt. The data are shown in the following table with seat belt users coded as 1, nonusers as 0. Using a computer statistical package, carry out a multiple regression analysis on these data, interpret the partial regression coefficients, and discuss the results of each significance test.
y 5 speed (mph): 61 63 55 59 59 52 61 x15 occupants: 2 1 2 1 3 1 2 x25 belt usage: 1 1 1 0 0 1 0 y 5 speed (mph): 51 57 55 54 49 61 73 x15 occupants: 2 2 2 3 1 1 1 x25 belt usage: 1 1 1 0 1 1 0
E X E R C I S E S
SUMMARY
•Multiple regression model
Unlike its counterpart in Chapter 15, multiple regression and correlation analy- sis considers two or more independent (x) variables. The model is of the form yi 50 1 1x1i 1 2x2i 1 . . . 1 kxki 1 i and, for a given set of x values, the expected value of y is given by the regression equation, E(y) 5011x1i1 2x2i1. . . 1 kxki. For each x term in the estimation equation, the correspond- ing is referred to as the partial regression coefficient. Each i (i 5 1, 2, . . ., k) is a slope relating changes in E(y) to changes in one x variable when all other x’s
( 16.8 )
are held constant. The multiple standard error of estimate expresses the amount of dispersion of data points about the regression equation.
• Estimation and computer assistance
Calculations involving multiple regression and correlation analysis are tedious and best left to a computer package such as Excel or Minitab. The chapter includes formulas for determining the approximate confidence interval for the conditional mean of y as well as the approximate prediction interval for an individual y value, given a set of x values. The exact intervals can be obtained using Excel or Minitab.
• Coefficient of multiple determination
Multiple correlation measures the strength of the relationship between the de- pendent variable and the set of independent variables. The coefficient of multiple determination (R2) is the proportion of the variation in y that is explained by the regression equation.
• Significance testing and residual analysis
The overall significance of the regression equation can be tested through the use of analysis of variance, while a t-test is used in testing the significance of each par- tial regression coefficient. The results of these tests are typically included when data have been analyzed using a computer statistical package. Most computer packages can also provide a summary of the residuals, which may be analyzed further in testing the regression model and identifying data points especially dis- tant from the regression equation.
• Dummy variables and multicollinearity
A dummy variable has a value of either one or zero, depending on whether a given characteristic is present, and it allows qualitative data to be included in the analysis. If two or more independent variables are highly correlated with each other, multicollinearity is present, and the partial regression coefficients may be unreliable. Multicollinearity is not a problem when the only purpose of the equa- tion is to predict the value of y.
• Stepwise regression
In stepwise regression, independent variables enter the equation one at a time, with the first one entered being the x variable that explains the greatest amount of the variation in y. At each step, the variable introduced is the one that explains the greatest amount of the remaining variation in y. Stepwise regression will be discussed more completely in Chapter 17.
The Multiple Regression Model
yi 50 1 1x1i 1 2x2i 1. . . 1kxki 1i
where yi 5 a value of the dependent variable, y 0 5a constant
x1i , x2i , . . ., xki 5 values of the independent variables, x1, x2, . . ., xk 1, 2, . . ., k5 partial regression coefficients for the
independent variables, x1, x2, . . ., xk i5 random error, or residual
E Q U A T I O N S
The Sample Multiple Regression Equation
y 5 bˆ 0 1 b1x1 1 b2x2 1. . . 1 bkxk
where yˆ 5 the estimated value of the dependent variable, y for a given set of values for x1, x2, . . ., xk k5the number of independent variables
b0 5 the y-intercept; this is the estimated value of y when all of the independent variables are equal to 0 bi 5the partial regression coefficient for xi The Multiple Standard Error of Estimate
se5ẽwwwww o_________ n (y2i k 22yˆ i 1)2 where yi5 each observed value of y in the data yˆi5 the value of y that would have been
estimated from the regression equation n 5 the number of data points
k 5 the number of independent (x) variables Approximate Confidence Interval for the Conditional Mean of y
yˆ 6 t ____ se
ẽwn where yˆ 5 the estimated value of y based on the set of x values provided
t 5 t value from the t distribution table for the desired confidence level and df 5 n 2k 21 (n 5 number of data points, k 5 number of x variables)
se 5 the multiple standard error of estimate Approximate Prediction Interval for an Individual y Observation
yˆ 6 tse where yˆ 5the estimated value of y based on the set of x values provided t 5 t value from the t distribution table
for the desired confidence level and
df 5 n 2k 21 (n 5 number of data points, k 5 number of x variables)
se 5 the multiple standard error of estimate Coefficient of Multiple Determination
R2 51 2 Error variation, which is not explained by the regression equation ____________________________________________________
Total variation in the y values 5 1 2 o(yi2yˆ i)2
_________
o(yi 2}y)2 or 1 2 SSE____
SST
As in Chapter 15, the total variation in y is the sum of the explained variation plus the unexplained (error) variation, or
Total variation Variation explained by Variation not explained by in y values 5 the regression equation 1 the regression equation
(SST) (SSR) (SSE)
ANOVA Test for the Overall Significance of the Multiple Regression Equation
• Null hypothesis
H0: 1525. . . 5 k5 0 The regression equation is not significant.
• Alternative hypothesis
H1: One or more of the i (i 51, 2, . . ., k) ị 0 The regression equation is significant.
• Test statistic
F 5 ______________ SSRyk
SSEy(n 2 k 2 1) where SSR 5 regression sum of squares
SSE 5 error sum of squares
n 5 number of data points k 5 number of independent
(x) variables
• Critical value of the F statistic
F for significance level specified and df (numerator) 5k, and df (denominator) 5 n 2k 21
t-Test for the Significance of a Partial Regression Coefficient, bi
• Null hypothesis
H0: i 5 0 The population coefficient is 0.
• Alternative hypothesis
H1: iị 0 The population coefficient is not 0.
• Test statistic t 5 bi2 0
______
sb
i
where bi5 the observed value of the regression coefficient sb
i5 the estimated standard deviation of bi
1. We can also test H0: i5i0 versus H1: iịi0, where i0 is any value of interest.
With the exception of i0 replacing zero, the test is the same as described here.
2. Critical values of t are 6t for the level of significance desired and df 5 n 2 k 2 1 (n 5 number of data points, k 5 number of independent variables).
Confidence Interval for a Partial Regression Coefficient, i
The interval is bi6t(sbi) where bi5 partial regression coefficient in the sample regression equation t 5 t value for the confidence level
desired, and df 5 n 2 k 2 1 sb
i 5 estimated standard deviation of bi Excel provides these confidence intervals as part of the regression analysis printout.
NO TES NO TE
( DATA SET ) Note: Exercises 16.52–16.58 require a computer and statistical software.
16.52 Interested in the possible relationship between the size of his tip versus the size of the check and the number of diners in the party, a food server has recorded the fol- lowing for a sample of 8 checks:
Observation
Number y 5Tip x1 5Check x2 5Diners
1 $7.5 $40 2
2 0.5 15 1
3 2.0 30 3
4 3.5 25 4
5 9.5 50 4
6 2.5 20 5
7 3.5 35 5
8 1.0 10 2
a. Determine the multiple regression equation and inter- pret the partial regression coefficients.
b. What is the estimated tip amount for 3 diners who have a $40 check?
c. Determine the 95% prediction interval for the tip left by a dining party like the one in part (b).
d. Determine the 95% confidence interval for the mean tip left by all dining parties like the one in part (b).
e. Determine the 95% confidence interval for the par- tial regression coefficients, 1 and 2.
f. Interpret the significance tests in the computer printout.
g. Analyze the residuals. Does the analysis support the applicability of the multiple regression model to these data?
16.53 Annual per-capita consumption of all fresh fruits versus that of apples and grapes from 1998 through 2003 was as shown in the table. Source: Bureau of the Census, Statistical Abstract of the United States 2006, p. 137.
y 5All
Year Fresh Fruits x1 5Apples x25Grapes 1998 128.9 lb/person 19.0 lb/person 7.1 lb/person
1999 129.8 18.7 8.1
2000 128.0 17.6 7.4
2001 125.7 15.8 7.7
2002 126.9 16.2 8.7
2003 126.7 16.7 7.5
a. Determine the multiple regression equation and interpret the partial regression coefficients.
b. What is the estimated per-capita consumption of all fresh fruits during a year when 17 pounds of apples and 6 pounds of grapes are consumed per person?
c. Determine the 95% prediction interval for per-capita consumption of fresh fruits during a year like the one described in part (b).
d. Determine the 95% confidence interval for mean per- capita consumption of fresh fruits during years like the one described in part (b).
e. Determine the 95% confidence interval for the popu- lation partial regression coefficients, 1 and 2. f. Interpret the significance tests in the computer printout.
g. Analyze the residuals. Does the analysis support the applicability of the multiple regression model to these data?
16.54 A university placement director is interested in the effect that grade point average (GPA) and the number of university activities listed on the résumé might have on the starting salaries of this year’s graduating class. He has collected these data for a sample of 10 graduates:
y 5Starting x15Grade x2 5Number
Salary Point of
Graduate (Thousands) Average Activities
1 $40 3.2 2
2 46 3.6 5
3 38 2.8 3
4 39 2.4 4
5 37 2.5 2
6 38 2.1 3
7 42 2.7 3
8 37 2.6 2
9 44 3.0 4
10 41 2.9 3
a. Determine the multiple regression equation and inter- pret the partial regression coefficients.
b. Dave has a 3.6 grade point average and 3 university activities listed on his résumé. What would be his estimated starting salary?
c. Determine the 95% prediction interval for the start- ing salary of the student described in part (b).
d. Determine the 95% confidence interval for the mean starting salary for all students like the one described in part (b).
e. Determine the 95% confidence interval for the popu- lation partial regression coefficients, 1 and 2. f. Interpret the significance tests in the computer
printout.
g. Analyze the residuals. Does the analysis support the applicability of the multiple regression model to these data?
CHAPTE R EXERCISES
16.55 An admissions counselor, examining the usefulness of x1 5total SAT score and x2 5high school class rank in predicting y 5freshman grade point average (GPA) at her university, has collected the sample data shown here and listed in file XR16055. Rank in high school class is expressed as a cumulative percentile, i.e., 100.0% reflects the top of the class.
High High
Freshman SAT, School Freshman SAT, School GPA Total Rank GPA Total Rank 2.66 1153 61.5 2.45 1136 45.5 2.10 1086 84.5 2.50 966 60.2 3.33 1141 92.0 2.29 1023 74.0 3.85 1237 94.4 2.24 976 86.4 2.51 1205 89.5 1.81 1066 73.0 3.22 1205 97.0 2.99 1076 55.2 2.92 1163 95.9 3.14 1152 72.1 1.95 1121 64.1 1.86 955 51.0 a. Determine the multiple regression equation and inter-
pret the partial regression coefficients.
b. What is the estimated freshman GPA for a student who scored 1100 on the SAT exam and had a cumu- lative class rank of 80%?
c. Determine the 95% prediction interval for the GPA of the student described in part (b).
d. Determine the 95% confidence interval for the mean GPA of all students similar to the one described in part (b).
e. Determine the 95% confidence interval for the popu- lation partial regression coefficients, 1 and 2. f. Interpret the significance tests in the computer
printout.
g. Analyze the residuals. Does the analysis support the applicability of the multiple regression model to these data?
16.56 Data file XR16056 lists the following information for a sample of local homes sold recently: selling price (dol- lars), lot size (acres), living area (square feet), and whether the home is air conditioned (1 5 AyC, 0 5 no AyC).
a. Determine the multiple regression equation and inter- pret the partial regression coefficients.
b. What is the estimated selling price for a house sitting on a 0.9-acre lot, with 1800 square feet of living area and central air conditioning?
c. Determine the 95% prediction interval for the selling price of the house described in part (b).
d. Determine the 95% confidence interval for the mean selling price of all houses like the one described in part (b).
e. Determine the 95% confidence interval for the popu- lation partial regression coefficients, 1, 2, and 3. f. Interpret the significance tests in the computer
printout.
g. Analyze the residuals. Does the analysis support the applicability of the multiple regression model to these data?
16.57 In Exercise 16.56, what would be the estimated selling price of a house occupying a 0.01-acre lot, with 100 square feet of living area and no central air condi- tioning? Considering the nature of this “house,” does this selling price seem reasonable? If not, might the computer have made an error in obtaining the regression equation?
Explain.
16.58 Data file XR16058 lists the following data for a sample of automatic teller machine (ATM) customers:
time (seconds) to complete their transaction, estimated age, and gender (1 5 male, 0 5 female).
a. Determine the multiple regression equation and interpret the partial regression coefficients.
b. What is the estimated time required by a female customer who is 45 years of age?
c. Determine the 95% prediction interval for the time required by the customer described in part (b).
d. Determine the 95% confidence interval for the mean time required by customers similar to the one described in part (b).
e. Determine the 95% confidence interval for the popu- lation partial regression coefficients, 1 and 2. f. Interpret the significance tests in the computer
printout.
g. Analyze the residuals. Does the analysis support the applicability of the multiple regression model to these data?
For several years, Thorndike Sports Equipment has been a minority stockholder in the Snow Kingdom Ski Resort and Conference Center. The Thorndikes visit Snow King- dom several times each winter to meet with management and find out how business is going. In addition, the visits
give them a chance for informal discussions with custom- ers and potential customers of Thorndike ski clothing and equipment. Luke claims that many good product ideas have been inspired by a warm drink in the Snow Kingdom lodge.
Thorndike Sports Equipment
On their current visit, Ted and Luke are asked by Snow Kingdom management to lend a hand in analyzing some data that might help in predicting how many cus- tomers to expect on any given day. Overall business has been rather steady over the past several years, but the daily customer count seems to have ups and downs. The people at Snow Kingdom are curious as to what factors might be causing the seemingly random levels of patronage from one day to the next.
In response to Ted’s request, management supplies data for a random sample of 30 days over the past two seasons.
The information includes the number of skiers, the high temperature (degrees Fahrenheit), the number of inches of snow on the ground at noon, and whether the day fell on a weekend (1 5 weekend, 0 5 weekday). Data are shown here and are also in file THORN16.
Using multiple regression and correlation analysis, do you think this information appears to be helpful in explaining the level of daily patronage at Snow Kingdom?
Help Ted in interpreting the associated computer printout for an upcoming presentation to the Snow Kingdom management.
This case involves the variables shown here and described in data file SHOPPING. Information like that gained from this case could provide management with useful insights into how two or more variables (including dummy vari- ables) can help describe consumers’ overall attitude to- ward a shopping area.
• Variables 7–9 Overall attitude toward the shopping area. The highest possible rating is 5. In each analysis, one of these will be the dependent variable.
• Variables 18–25 The importance (highest rating 5 7) placed on each of 8 attributes that a shopping area might possess. In each analysis, these will be among the independent variables.
• Variables 26 and 28 The respondent’s gender and marital status. These independent variables will be dummy variables. Recode them so that (1 5 male, 0 5 female) and (1 5 married, 0 5 single or other).
The necessary commands will vary, depending on your computer statistical package. Be careful not to save the recoded database, because the original values will be over- written if you do.
1. With variable 7 (attitude toward Springdale Mall) as the dependent variable, perform a multiple regres- sion analysis using variables 21 (good var iety of sizes/
styles), 22 (sales staff helpful/friendly), 26 (gender), and 28 (marital status) as the four indepen dent vari- ables. If possible, have the residuals and the predicted y values retained for later analysis.
a. Interpret the partial regression coefficient for variable 26 (gender). At the 0.05 level, is it significantly different from zero? If so, what does this say about the respective attitudes of males versus females toward Springdale Mall?
Interpret the other partial regression coef- ficients and the results of the significance test for each.
b. At the 0.05 level, is the overall regression equa- tion significant? At exactly what p-value is the equation significant?
c. What percentage of the variation in y is explained by the regression equation? Explain this percent- age in terms of the analysis of variance table that accompanies the printout.
Springdale Shopping Survey
Snow Temperature Snow Temperature
Day Skiers Weekend (Inches) (Degrees) Day Skiers Weekend (Inches) (Degrees)
1 402 0 22 24 16 648 0 14 8
2 337 0 17 25 17 540 0 17 11
3 471 0 17 28 18 614 1 36 6
4 610 0 39 21 19 796 1 25 29
5 620 0 11 18 20 477 0 13 5
6 545 1 24 17 21 532 0 35 30
7 523 0 25 29 22 732 0 36 15
8 563 0 34 17 23 618 0 18 11
9 873 1 18 11 24 728 1 19 28
10 358 0 28 6 25 620 0 29 8
11 568 0 14 9 26 551 0 18 6
12 453 0 12 18 27 816 0 31 13
13 485 0 27 27 28 765 0 24 19
14 767 1 37 16 29 650 1 22 27
15 735 1 12 6 30 732 0 11 24
Sam Easton started out as a real estate agent in Atlanta ten years ago. After working two years for a national real estate firm, he trans- ferred to Dallas, Texas, and worked for another realty agency. His friends and relatives convinced him that with his experience and knowl- edge of the real estate business, he should open his own agency. He
eventually acquired his broker’s license and before long started his own company, Easton Realty Company, in Fort Worth, Texas.
Two salespeople at the previous company agreed to follow him to the new company. Easton currently has eight real estate agents working for him. Before the real estate slump, the combined residential sales for Easton Realty amounted to approximately $15 million annually.
Recently, the Dallas–Fort Worth (DFW) metroplex and the state of Texas have suffered economic problems from several sources. Much of the wealth in Texas was gener- ated by the oil industry, but the oil industry has fallen on hard times in recent years. Many savings and loan (S&L) institutions loaned large amounts of money to the oil in- dustry and to commercial and residential construction. As the oil industry fell off and the economy weakened, many S&Ls found themselves in difficulty as a result of poor real estate investments and the soft real estate market that was getting worse with each passing month. With the less- ening of the Cold War, the federal government closed sev- eral military bases across the country, including two in the DFW area. Large government contractors, such as General Dynamics, had to trim down their operations and lay off many workers. This added more pressure to the real estate market by putting more houses on an already saturated market. Real estate agencies found it increasingly difficult to sell houses.
Two days ago, Sam Easton received a spe- cial delivery letter from the president of the local Board of Realtors. The board had re- ceived complaints from two people who had listed and sold their homes through Easton Realty in the past month. The president of the Board of Realtors was informing Sam of these complaints and giving him the opportunity to respond. Both complaints were triggered by a recent article on home sales that appeared in one of the local newspapers. The article contained the table shown below.
Typical Home Sale, DFW Metroplex Average Sales Price $104,250 Average Size 1860 sq. ft.
Note: Includes all homes sold in the Dallas–Fort Worth metroplex over the past 12 months.
The two sellers charged that Easton Realty Company had underpriced their homes in order to accelerate the sales.
The first house is located outside of the DFW area, is four years old, has 2190 square feet, and sold for $88,500. The second house is located in Fort Worth, is nine years old, has 1848 square feet, and sold for $79,500. Both houses in question are three-bedroom houses. Both sellers believe that they would have received more money for their houses if Easton Realty had priced them at their true market value.
Sam knew from experience that people selling their homes invariably overestimate the value. Most sellers be- lieve they could have gotten more money from the sale of their homes. But Sam also knew that his agents would not intentionally underprice houses. However, in these bad eco- nomic times, many real estate companies, including Easton Realty, had large inventories of houses for sale and needed to make sales. One quick way to unload these houses is to underprice them. On a residential sale, an agent working under a real estate broker typically makes about 3% of the sales price if he originally listed the property. Dropping the
Easton Realty Company (A)
Digital Vision/Getty Images
d. If possible with your computer package, generate a plot of the residuals (vertical axis) versus each of the independent variables (horizontal axis).
Evaluate each plot in terms of whether pat- terns exist that could weaken the validity of the regression model.
e. If possible with your computer statistical pack- age, use the normal probability plot to examine the residuals.
2. Repeat question 1, but with variable 8 (attitude toward Downtown) as the dependent variable.
3. Repeat question 1, but with variable 9 (attitude toward West Mall) as the dependent variable.
4. Compare the regression equations obtained in questions 1, 2, and 3. For which one of the shopping areas does this set of independent variables seem to do the best job of predicting shopper attitude? Explain your reasoning.
B U S I N E S S C A S E