It’s time to get some experience choosing independent variables. After all, every equation so far in the text has come with the specification already determined, but once you’ve finished this course you’ll have to make all such specification decisions on your own. In future chapters, we’ll use a technique called “interactive regression learning exercises” to allow you to make your own actual specification choices and get feedback on your choices. To start, though, let’s work through a specification together.
To keep things as simple as possible, we’ll begin with a topic near and dear to your heart—your GPA! Suppose a friend who attends a small liberal arts college surveys all 25 members of her econometrics class, obtains data on the variables listed here, and asks for your help in choosing a specification:
GPAi = the cumulative college grade point average of the ith student on a four-point scale
HGPAi = the cumulative high school grade point average of the ith student on a four-point scale
MSATi = the highest score earned by the ith student on the math section of the SAT test (800 maximum)
175 an example of chooSing independent VariableS
VSATi = the highest score earned by the ith student on the verbal section of the SAT test (800 maximum)
SATi = MSATi+VSATi
GREKi = a dummy variable equal to 1 if the ith student is a member of a fraternity or sorority, 0 otherwise
HRSi = the ith student’s estimate of the average number of hours spent studying per course per week in college
PRIVi = a dummy variable equal to 1 if the ith student graduated from a private high school, 0 otherwise
JOCKi = a dummy variable equal to 1 if the ith student is or was a member of a varsity intercollegiate athletic team for at least one season, 0 otherwise
lnEXi = the natural log of the number of full courses that the ith student has completed in college
Assuming that GPAi is the dependent variable, which independent vari- ables would you choose? Before you answer, think through the possibilities carefully. What does the literature tell us on this subject? (Is there literature?) What are the expected signs of each of the coefficients? How strong is the theory behind each variable? Which variables seem obviously important?
Which variables seem potentially irrelevant or redundant? Are there any other variables that you wish your friend had collected?
To get the most out of this example, you should take the time to write down the exact specification that you would run:
GPAi = f1?, ?, ?, ?, ?2 +e
It’s hard for most beginning econometricians to avoid the temptation of including all of these variables in a GPA equation and then dropping any variables that have insignificant t-scores. Even though we mentioned in the previous section that such a specification search procedure will result in biased coefficient estimates, most beginners don’t trust their own judgment and tend to include too many variables. With this warning in mind, do you want to make any changes in your proposed specification?
No? OK, let’s compare notes. We believe that grades are a function of a student’s ability, how hard the student works, and the student’s experience taking college courses. Consequently, our specification would be:
+ + +
GPAi = β0+β1HGPAi+β2HRSi+β3lnEXi+ei
We can already hear you complaining! What about SATs, you say? Everyone knows they’re important. How about jocks and Greeks? Don’t they have lower
M06_STUD2742_07_SE_C06.indd 175 1/6/16 5:15 PM
176 ChAptEr 6 Specification: chooSing the independent VariableS
GPAs? Don’t prep schools grade harder and prepare students better than public high schools?
Before we answer, it’s important to note that we think of specification choice as choosing which variables to include, not which variables to exclude.
That is, we don’t assume automatically that a given variable should be included in an equation simply because we can’t think of a good reason for dropping it.
Given that, however, why did we choose the variables we did? First, we think that the best predictor of a student’s college GPA is his or her high school GPA. We have a hunch that once you know HGPA, SATs are redun- dant, at least at a liberal arts college10 where there are few multiple choice tests. In addition, we’re concerned that possible racial and gender bias in the SAT test makes it a questionable measure of academic potential, but we rec- ognize that we could be wrong on this issue.
As for the other variables, we’re more confident. For example, we feel that once we know how many hours a week a student spends studying, we couldn’t care less what that student does with the rest of his or her time, so JOCK and GREK are superfluous once HRS is included. In addition, the higher lnEX is, the better student study habits are and the more likely stu- dents are to be taking courses in their major. Finally, while we recognize that some private schools are superb and that some public schools are not, we’d guess that PRIV is irrelevant; it probably has only a minor effect.
If we estimate this specification on the 25 students, we obtain:
GPAi = -0.26+0.49HGPAi+ 0.06HRSi+0.42lnEXi (6.18) 10.212 10.022 10.142
t = 2.33 3.00 3.00 N = 25 R2 = .585
Since we prefer this specification on theoretical grounds, since the overall fit seems reasonable, and since each coefficient meets our expectations in terms of sign, size, and significance, we consider this an acceptable equation. The only circumstance under which we’d consider estimating a second specifica- tion would be if we had theoretical reasons to believe that we had omitted a
h
10. In contrast, SATs tend to have a statistically significant effect on GPAs at large research universities. For example, see Andrew Barkley and Jerry Forst, “The Determinants of First-Year Academic Performance in the College of Agriculture at Kansas State University, 1990–1999,”
Journal of Agricultural and Applied Economics, Vol. 36, No 2, pp. 437–448.
177 Summary
relevant variable. The only variable that might meet this description is SATi (which we prefer to the individual MSAT and VSAT):
GPAi = -0.92+0.47HGPAi+0.05HRSi (6.19) 10.222 10.022
t = 2.12 2.50
+0.44lnEXi +0.00060SATi 10.142 10.000642 t = 3.12 0.93 N = 25 R2 = .583
Let’s use our four specification criteria to compare Equations 6.18 and 6.19:
1. Theory: As discussed previously, the theoretical validity of SAT tests is a matter of some academic controversy, but they still are one of the most-cited measures of academic potential in this country.
2. t-Test: The coefficient of SAT is positive, as we’d expect, but it’s not significantly different from zero.
3. R2: As you’d expect (since SAT’s t-score is under 1), R2 falls slightly when SAT is added.
4. Bias: None of the estimated slope coefficients changes substantially when SAT is added, though some of the t-scores do change because of the increase in the SE1βN2s caused by the addition of SAT.
Thus, the statistical criteria do not convincingly contradict our theoretical contention that SAT is irrelevant.
Finally, it’s important to recognize that different researchers could come up with different final equations on this topic. A researcher whose prior expectation was that SAT unambiguously belonged in the equation would have estimated Equation 6.19 and accepted that equation without bothering to estimate Equation 6.18. Other researchers, in the spirit of sensitivity analy- sis, would report both equations.