Once we have rejected the null hypothesis (3.12) in favor of the alternative hypothesis (3.13), it is natural to want to quantifythe extent to which the model fits the data. The quality of a linear regression fit is typically assessed using two related quantities: theresidual standard error (RSE) and theR2
R2
statistic.
Table 3.2 displays the RSE, theR2 statistic, and the F-statistic (to be described in Section 3.2.2) for the linear regression of number of units sold on TV advertising budget.
Residual Standard Error
Recall from the model (3.5) that associated with each observation is an error term. Due to the presence of these error terms, even if we knew the true regression line (i.e. even ifβ0 and β1 were known), we would not be able to perfectly predictY fromX. The RSE is an estimate of the standard
4In Table 3.1, a small p-value for the intercept indicates that we can reject the null hypothesis thatβ0= 0, and a small p-value forTVindicates that we can reject the null hypothesis thatβ1= 0. Rejecting the latter null hypothesis allows us to conclude that there is a relationship betweenTVandsales. Rejecting the former allows us to conclude that in the absence ofTVexpenditure,salesare non-zero.
Quantity Value Residual standard error 3.26
R2 0.612
F-statistic 312.1
TABLE 3.2.For theAdvertisingdata, more information about the least squares model for the regression of number of units sold on TV advertising budget.
deviation of. Roughly speaking, it is the average amount that the response will deviate from the true regression line. It is computed using the formula
RSE = 1
n−2RSS = 1
n−2 n i=1
(yi−yˆi)2. (3.15) Note that RSS was defined in Section 3.1.1, and is given by the formula
RSS = n i=1
(yi−yˆi)2. (3.16) In the case of the advertising data, we see from the linear regression output in Table 3.2 that the RSE is 3.26. In other words, actual sales in each market deviate from the true regression line by approximately 3,260 units, on average. Another way to think about this is that even if the model were correct and the true values of the unknown coefficients β0 and β1 were known exactly, any prediction of sales on the basis of TV advertising would still be off by about 3,260 units on average. Of course, whether or not 3,260 units is an acceptable prediction error depends on the problem context. In the advertising data set, the mean value ofsalesover all markets is approximately 14,000 units, and so the percentage error is 3,260/14,000 = 23 %.
The RSE is considered a measure of thelack of fit of the model (3.5) to the data. If the predictions obtained using the model are very close to the true outcome values—that is, if ˆyi ≈yi fori = 1, . . . , n—then (3.15) will be small, and we can conclude that the model fits the data very well. On the other hand, if ˆyi is very far fromyi for one or more observations, then the RSE may be quite large, indicating that the model doesn’t fit the data well.
R2 Statistic
The RSE provides an absolute measure of lack of fit of the model (3.5) to the data. But since it is measured in the units of Y, it is not always clear what constitutes a good RSE. TheR2statistic provides an alternative measure of fit. It takes the form of aproportion—the proportion of variance explained—and so it always takes on a value between 0 and 1, and is independent of the scale ofY.
To calculateR2, we use the formula R2= TSS−RSS
TSS = 1−RSS
TSS (3.17)
where TSS =
(yi−y)¯ 2 is thetotal sum of squares, and RSS is defined
total sum of squares
in (3.16). TSS measures the total variance in the responseY, and can be thought of as the amount of variability inherent in the response before the regression is performed. In contrast, RSS measures the amount of variability that is left unexplained after performing the regression. Hence, TSS−RSS measures the amount of variability in the response that is explained (or removed) by performing the regression, and R2 measures the proportion of variability in Y that can be explained using X. An R2 statistic that is close to 1 indicates that a large proportion of the variability in the response has been explained by the regression. A number near 0 indicates that the regression did not explain much of the variability in the response; this might occur because the linear model is wrong, or the inherent errorσ2 is high, or both. In Table 3.2, theR2 was 0.61, and so just under two-thirds of the variability insalesis explained by a linear regression onTV.
TheR2 statistic (3.17) has an interpretational advantage over the RSE (3.15), since unlike the RSE, it always lies between 0 and 1. However, it can still be challenging to determine what is agood R2 value, and in general, this will depend on the application. For instance, in certain problems in physics, we may know that the data truly comes from a linear model with a small residual error. In this case, we would expect to see anR2value that is extremely close to 1, and a substantially smallerR2value might indicate a serious problem with the experiment in which the data were generated. On the other hand, in typical applications in biology, psychology, marketing, and other domains, the linear model (3.5) is at best an extremely rough approximation to the data, and residual errors due to other unmeasured factors are often very large. In this setting, we would expect only a very small proportion of the variance in the response to be explained by the predictor, and anR2 value well below 0.1 might be more realistic!
TheR2 statistic is a measure of the linear relationship between X and Y. Recall thatcorrelation, defined as
correlation
Cor(X, Y) =
n
i=1(xi−x)(yi−y) n
i=1(xi−x)2n
i=1(yi−y)2, (3.18) is also a measure of the linear relationship betweenX and Y.5 This sug- gests that we might be able to user= Cor(X, Y) instead ofR2in order to assess the fit of the linear model. In fact, it can be shown that in the simple linear regression setting,R2=r2. In other words, the squared correlation
5We note that in fact, the right-hand side of (3.18) is the sample correlation; thus, it would be more correct to writeCor(X, Y ); however, we omit the “hat” for ease of notation.
and the R2 statistic are identical. However, in the next section we will discuss the multiple linear regression problem, in which we use several pre- dictors simultaneously to predict the response. The concept of correlation between the predictors and the response does not extend automatically to this setting, since correlation quantifies the association between a single pair of variables rather than between a larger number of variables. We will see thatR2fills this role.