The3.11.1.Framingham.logfile continues as follows and illustrates how to perform the analyses discussed in Sections 3.13, 3.14 and 3.15.
89 3.16. Multiple linear regression with Stata
. *
. * Use multiple regression models to analyze the effects of log(sbp), . * log(bmi), age and log(scl) on log(sbp)
. *
. generate woman = sex -1
. generate wo_lbmi = woman * logbmi (9 missing values generated)
. generate wo_age = woman * age . generate wo_lscl = woman * logscl
(33 missing values generated)
. regress logsbp logbmi age logscl woman wo_lbmi wo_age wo_lscl {1}
{Output omitted. See Table 3.2}
. regress logsbp logbmi age logscl woman wo_lbmi wo_age {2} {Output omitted. See Table 3.2} . regress logsbp logbmi age logscl woman wo_age {3}
Source | SS df MS Number of obs = 4658
--- + --- F( 5, 4652) = 318.33 Model | 30.8663845 5 6.1732769 Prob > F = 0.0000
Residual | 90.2160593 4652 .019392962 R-squared = 0.2549 {4}
--- + --- Adj R-squared = 0.2541 Total | 121.082444 4657 .026000095 Root MSE = .13926 ---
logsbp | Coef. Std. Err. t P>|t| [95% Conf. Interval] {5}
--- + --- logbmi | .262647 .0137549 19.095 0.000 .2356808 .2896131
age | .0035167 .0003644 9.650 0.000 .0028023 0042311 {6} logscl | .0595923 .0114423 5.208 0.000 .0371599 .0820247
woman | -.2165261 .0233469 -9.274 0.000 -.2622971 -.1707551 wo_age | .0048624 .0004988 9.749 0.000 .0038846 .0058403
_cons | 3.537356 .0740649 47.760 0.000 3.392153 3.682558 {7} --- . *
. * Calculate 95% confidence and prediction intervals for a 60 year-old woman . * with a SCL of 400 and a BMI of 40.
. *
.edit {8}
90 3. Multiple linear regression
-preserve -set obs 4700
-replace scl = 400 in 4700 -replace age = 60 in 4700 -replace bmi = 40 in 4700 -replace woman = 1 in 4700 -replace id = 9999 in 4700
. replace logbmi = log(bmi) if id == 9999 {9}
(1 real change made)
. replace logscl = log(scl) if id == 9999 (1 real change made)
. replace wo_age = woman*age if id == 9999 (1 real change made)
. predict yhat,xb {10}
(41 missing values generated)
. predict h, leverage {11}
(41 missing values generated)
. predict std_yhat, stdp {12}
(41 missing values generated)
. predict std_f, stdf {13}
(41 missing values generated)
. generate cil_yhat = yhat - invt(4658-5-1,.95)*std_yhat {14}
(41 missing values generated)
. generate ciu_yhat = yhat + invt(4658-5-1,.95)*std_yhat (41 missing values generated)
. generate cil_f = yhat - invt(4658-5-1,.95)*std_f {15} (41 missing values generated)
. generate ciu_f = yhat + invt(4658-5-1,.95)*std_f (41 missing values generated)
. generate cil_sbpf = exp(cil_f) {16}
(41 missing values generated) . generate ciu_sbpf = exp(ciu_f)
(41 missing values generated)
. list bmi age scl woman logbmi logscl yhat h std_yhat std_f cil_yhat {17}
> ciu_yhat cil_f ciu_f cil_sbpf ciu_sbpf if id==9999
91 3.16. Multiple linear regression with Stata
Observation 4700
bmi 40 age 60 scl 400
woman 1 logbmi 3.688879 logscl 5.991465
yhat 5.149496 h .003901 std_yhat .0086978 std_f .13953 cil_yhat 5.132444 ciu_yhat 5.166547 cil_f 4.875951 ciu_f 5.42304 cil_sbpf 131.0987 ciu_sbpf 226.5669
. display invt(4652,.95) 1.960474
Comments
1 This command regresseslogsbpagainst the other covariates given in the command line. It evaluates model (3.17).
2 This command evaluates model (3.18).
3 This command evaluates model (3.19).
4 The output from the regress command for multiple linear regression is similar to that for simple linear regression that was discussed in Section 2.12. The R2statistic=MSS/TSS=30.866/121.08=0.2549.
The mean squared error (MSE) iss2=0.019 392 962, which we defined in equation (3.5). Taking the square root of this variance estimate gives the Root MSE=s =0.139 26.
5 For each covariate in the model, this table gives the estimate of the associated regression coefficient, the standard error of this estimate, thet statistic for testing the null hypothesis that the true value of the parameter equals zero, thePvalue that corresponds to thiststatistic, and the 95%
confidence interval for the coefficient estimate. The coefficient estimates in the second column of this table are also given in Table 3.2 in the second column on the right.
6 Note that although the age parameter estimate is small it is almost ten times larger that its associated standard error. Hence this estimate dif- fers from zero with high statistical significance. The large range of the age of study subjects means that the influence of age onlogsbpwill be appreciable even though this coefficient is small.
7 The estimate of the constant coefficientαis 3.537 356.
8 Use the Stata editor to create a new record with covariatesscl,age,bmi and women equal to 400, 60, 40 and 1 respectively. For subsequent manipulation setidequal to 9999 (or any other identification number that has not already been assigned).
92 3. Multiple linear regression
9 Thereplacecommand redefines those values of an existing variable for which theifcommand qualifier is true. In this command,logbmiis only calculated for the new patient withi d=9999. This and the following two statements defines the covariateslogbmi,logsclandwo_agefor this patient.
10 The variable yhatis set equal to ˆyi for each record in memory. That is,yhatequals the estimated expected value oflogsbpfor each patient.
This includes the new record that we have just created. Note that the regression parameter estimates are unaffected by this new record since it was created after theregresscommand was given.
11 Theleverageoption of thepredictcommand creates a new variable called hthat equals the leverage for each patient. Note thathis defined for our new patient even though no value oflogsbpis given. This is because the leverage is a function of the covariates and does not involve the response variable.
12 Thestdpoption setsstd_yhatequal to the standard error ofyhat, which equalss√
hi.
13 Thestdfoption setsstd_fequal to the standard deviation oflogsbpgiven the patient’s covariates. That is,std_f=s√
hi+1.
14 This command and the next define cil_yhat and ciu_yhat to be the lower and upper bounds of the 95% confidence interval foryhat, re- spectively. This interval is given by equation (3.10). Note that there are 4658 patients in our regression and there are 5 covariates in our model.
Hence the number of degrees of freedom equals 4658−5−1=4652.
15 This command and the next define cil_sbpf and ciu_sbpf to be the lower and upper bounds of the 95% prediction interval forlogsbpgiven the patient’s covariates. This interval is given by equation (3.12).
16 This command and the next define the 95% prediction interval for the SBP of a new patient having the specified covariates. We exponentiate the prediction interval given by equation (3.12) to obtain the interval for SBP as opposed to log[SBP].
17 This command lists the covariates and calculated values for the new patient only (that is, for records for which id =9999 is true). The highlighted values in the output were also calculated by hand in Section 3.15.