To solidify your understanding of the six basic steps of applied regression analysis, let’s work through a complete regression example. Suppose that you’ve been hired to determine the best location for the next Woody’s res- taurant, where Woody’s is a moderately priced, 24-hour, family restaurant chain.6 You decide to build a regression model to explain the gross sales vol- ume at each of the restaurants in the chain as a function of various descrip- tors of the location of that branch. If you can come up with a sound equation to explain gross sales as a function of location, then you can use this equa- tion to help Woody’s decide where to build their newest eatery. Given data on
4. The standard error of the coefficient is discussed in more detail in Section 4.2; the t-value is developed in Section 5.2.
5. For example, the Journal of Money, Credit, and Banking and the American Economic Review have requested authors to submit their actual data sets so that regression results can be verified. See W. G. Dewald et al., “Replication in Empirical Economics,” American Economic Review, Vol. 76, No. 4, pp. 587–603 and Daniel S. Hamermesh, “Replication in Economics,” NBER Working Paper 13026, April 2007.
6. The data in this example are real (they’re from a sample of 33 Denny’s restaurants in Southern California), but the number of independent variables considered is much smaller than was used in the actual research. Datafile = WOODY3.
M03_STUD2742_07_SE_C03.indd 73 1/4/16 6:10 PM
74 ChAPtER 3 Learning to Use regression anaLysis
land costs, building costs, and local building and restaurant municipal codes, the owners of Woody’s will be able to make an informed decision.
1. Review the literature and develop the theoretical model. You do some reading about the restaurant industry, but your review of the literature consists mainly of talking to various experts within the firm. They give you some good ideas about the attributes of a successful Woody’s location. The ex- perts tell you that all of the chain’s restaurants are identical (indeed, this is sometimes a criticism of the chain) and that all the locations are in what might be called “suburban, retail, or residential” environments (as distinguished from central cities or rural areas, for example). Because of this, you realize that many of the reasons that might help explain differ- ences in sales volume in other chains do not apply in this case because all the Woody’s locations are similar. (If you were comparing Woody’s to another chain, such variables might be appropriate.)
In addition, discussions with the people in the Woody’s strategic plan- ning department convince you that price differentials and consumption differences between locations are not as important as the number of cus- tomers a particular location attracts. This causes you concern for a while because the variable you had planned to study originally, gross sales vol- ume, would vary as prices changed between locations. Since your com- pany controls these prices, you feel that you would rather have an esti- mate of the “potential” for such sales. As a result, you decide to specify your dependent variable as the number of customers served (measured by the number of checks or bills that the servers handed out) in a given location in the most recent year for which complete data are available.
2. Specify the model: Select the independent variables and the functional form.
Your discussions lead to a number of suggested variables. After a while, you realize that there are three major determinants of sales (customers) on which virtually everyone agrees. These are the number of people who live near the location, the general income level of the location, and the number of direct competitors close to the location. In addi- tion, there are two other good suggestions for potential explanatory variables. These are the number of cars passing the location per day and the number of months that the particular restaurant has been open.
After some serious consideration of your alternatives, you decide not to include the last possibilities. All the locations have been open long enough to have achieved a stable clientele. In addition, it would be very expensive to collect data on the number of passing cars for all the loca- tions. Should population prove to be a poor measure of the available customers in a location, you’ll have to decide whether to ask your boss for the money to collect complete traffic data.
75 Using regression anaLysis to pick restaUrant Locations
The exact definitions of the independent variables you decide to include are:
N = Competition: the number of direct market competitors within a two-mile radius of the Woody’s location
P = Population: the number of people living within a three-mile radius of the Woody’s location
I = Income: the average household income of the population measured in variable P
Since we have yet to develop any functional forms other than a linear functional form and a typical stochastic error term, that’s what you decide to use.
3. Hypothesize the expected signs of the coefficients. After thinking about which variables to include, you expect hypothesizing signs will be easy.
For two of the variables, you’re right. Everyone expects that the more competition there is, the fewer customers (holding constant the popu- lation and income of an area) there will be, and also that the more people there are who live near a particular restaurant, the more cus- tomers (holding constant the competition and income) the restaurant will have. You expect that the greater the income is in a particular area, the more people will choose to eat in a family restaurant. However, people in especially high-income areas might want to eat in a restau- rant that has more “atmosphere” than a family restaurant like Woody’s.
As a result, you worry that the income variable might be only weakly positive in its impact. To sum, you expect:
- + +?
Yi = β0+βNNi+βPPi+βIIi+ ei (3.3) where the signs above the coefficients indicate the expected impact of that particular independent variable on the dependent variable, hold- ing constant the other two explanatory variables, and ei is a typical sto- chastic error term.
4. Collect the data. Inspect and clean the data. You want to include every local restaurant in the Woody’s chain in your study, and, after some effort, you come up with data for your dependent variable and your independent variables for all 33 locations. You inspect the data, and you’re confident that the quality of your data is excellent for three rea- sons: each manager measured each variable identically, you’ve included each restaurant in the sample, and all the information is from the same year. [The data set is included in this section, along with a sample com- puter output for the regression estimated by Stata (Tables 3.1 and 3.2).]
M03_STUD2742_07_SE_C03.indd 75 1/4/16 6:10 PM
76 ChAPtER 3 Learning to Use regression anaLysis
Table 3.1 data for the Woody’s restaurant example (Using the stata program)
y n p i
1. 107919 3 65044 13240
2. 118866 5 101376 22554
3. 98579 7 124989 16916
4. 122015 2 55249 20967
5. 152827 3 73775 19576
6. 91259 5 48484 15039
7. 123550 8 138809 21857
8. 160931 2 50244 26435
9. 98496 6 104300 24024
10. 108052 2 37852 14987
11. 144788 3 66921 30902
12. 164571 4 166332 31573
13. 105564 3 61951 19001
14. 102568 5 100441 20058
15. 103342 2 39462 16194
16. 127030 5 139900 21384 17. 166755 6 171740 18800 18. 125343 6 149894 15289
19. 121886 3 57386 16702
20. 134594 6 185105 19093 21. 152937 3 114520 26502
22. 109622 3 52933 18760
23. 149884 5 203500 33242
24. 98388 4 39334 14988
25. 140791 3 95120 18505
26. 101260 3 49200 16839
27. 139517 4 113566 28915 28. 115236 9 194125 19033 29. 136749 7 233844 19200
30. 105067 7 83416 22833
31. 136872 6 183953 14409
32. 117146 3 60457 20307
33. 163538 2 65065 20111
(obs=33)
y n p i
y 1.0000
n -0.1442 1.0000
p 0.3926 0.7263 1.0000
i 0.5370 -0.0315 0.2452 1.0000
77 Using regression anaLysis to pick restaUrant Locations
Table 3.2 actual computer output (Using the stata program)
number of obs = 33 F( 3, 29) = 15.65 prob7F = 0.0000 r–squared = 0.6182 adj r–squared = 0.5787 root Mse = 14543
y coef. std. err. t p7t [95% conf. interval]
n -9074.674 2052.674 -4.42 0.000 -13272.86 -4876.485 p .3546684 .0726808 4.88 0.000 .2060195 .5033172 i 1.287923 .5432938 2.37 0.025 .1767628 2.399084 _cons 102192.4 12799.83 7.98 0.000 76013.84 128371
source ss df Ms
Model 9.9289e+09 3 3.3096e+09 residual 6.1333e+09 29 211492485 total 1.6062e+10 32 501943246
y yhat residuals
1. 107919 115089.6 -7170.56 2. 118866 121821.7 -2955.74 3. 98579 104785.9 -6206.864 4. 122015 130642 -8627.041 5. 152827 126346.5 26480.55 6. 91259 93383.88 -2124.877 7. 123550 106976.3 16573.66 8. 160931 135909.3 25021.71 9. 98496 115677.4 -17181.36 10. 108052 116770.1 -8718.094 11. 144788 138502.6 6285.425 12. 164571 165550 -979.0342 13. 105564 121412.3 -15848.3 14. 102568 118275.5 -15707.47 15. 103342 118895.6 -15553.63 16. 127030 133978.1 -6948.114 17. 166755 132868.1 33886.91 18. 125343 120598.1 4744.898 19. 121886 116832.3 5053.7 20. 134594 137985.6 -3391.591 21. 152937 149717.6 3219.428 22. 109622 117903.5 -8281.508 23. 149884 171807.2 -21923.22 24. 98388 99147.65 -759.6514 25. 140791 132537.5 8253.518 26. 101260 114105.4 -12845.43 27. 139517 143412.3 -3895.303 28. 115236 113883.4 1352.599 29. 136749 146334.9 -9585.905 30. 105067 97661.88 7405.122 31. 136872 131544.4 5327.621 32. 117146 122564.5 -5418.45
33. 163538 133021 30517
M03_STUD2742_07_SE_C03.indd 77 1/4/16 6:10 PM
78 ChAPtER 3 Learning to Use regression anaLysis
5. Estimate and evaluate the equation. You take the data set and enter it into the computer. You then run an OLS regression on the data, but you do so only after thinking through your model once again to see if there are hints that you’ve made theoretical mistakes. You end up admitting that although you cannot be sure you are right, you’ve done the best you can, so you estimate the equation, obtaining:
Yni = 102,192-9075Ni+ 0.355Pi+ 1.288Ii (3.4) (2053) (0.073) (0.543)
t = -4.42 4.88 2.37 N = 33 R2 = .579
This equation satisfies your needs in the short run. In particular, the estimated coefficients in the equation have the signs you expected. The overall fit, although not outstanding, seems reasonable for such a di- verse group of locations. To predict Y, you obtain the values of N, P, and I for each potential new location and then plug them into Equation 3.4.
Other things being equal, the higher the predicted Y, the better the loca- tion from Woody’s point of view.
6. Document the results. The results summarized in Equation 3.4 meet our documentation requirements. (Note that we include the standard er- rors of the estimated coefficients and t-values7 for completeness, even though we won’t make use of them until Chapter 5.) However, it’s not easy for a beginning researcher to wade through a computer’s regres- sion output to find all the numbers required for documentation. You’ll probably have an easier time reading your own computer system’s printout if you take the time to “walk through” the sample computer output for the Woody’s model in Tables 3.1–3.2. This sample output was produced by the Stata computer program, but it’s similar to those produced by EViews, SAS, SHAZAM, TSP, and others.
7. Throughout the text, the number in parentheses below a coefficient estimate typically will be the standard error of that estimated coefficient. Some authors put the t-value in parentheses, though, so be alert when reading journal articles or other books.
The first items listed are the actual data. These are followed by the simple correlation coefficients between all pairs of variables in the data set. Next comes a listing of the estimated coefficients, their estimated standard errors, and the associated t-values, and follows with R2, R2, RSS, the F-ratio, and other items that we will explain in later chap- ters. Finally, we have a listing of the observed Ys, the predicted Ys, and
79 dUMMy VariabLes
the residuals for each observation. Numbers followed by “e+ 06” or
“e-01” are expressed in a scientific notation indicating that the printed decimal point should be moved six places to the right or one place to the left, respectively.
In future sections, we’ll return to this example in order to apply vari- ous tests and ideas as we learn them.