Simple Linear Regression with Stata

The following log ﬁle and comments illustrates how to use Stata to perform the calculations discussed in the previous sections.

. * 2.12.Poison.log . *

. * Calculate the mean plasma glycolate and arterial pH levels for the . * ethylene glycol poisoning data of Brent et al. (1999). Regress glycolate

46 2. Simple linear regression

. * levels against pH. Drawa scatter plot of glycolate against pH. Plot the . * linear regression line on this scatter plot together with the 95%

. * confidence limits for this line and the 95% prediction intervals for new . * patients.

. *

. use C:\WDDtext\2.12.Poison.dta, clear {1}

. summarize ph glyco

Variable | Obs Mean Std. Dev. Min Max

---+--- ph | 18 7.210556 .1730512 6.88 7.47

glyco | 18 90.44 80.58488 0 265.24

. format ph %9.1g {2}

. format glyco %9.0g

. graph glyco ph, gap(4) xlabel(6.8,6.9 to 7.5) ylabel(0, 50 to 300) {3}

>xline(7.21) yline(90.4)

{Graph omitted. See Figure 2.1}

. regress glyco ph {4}

Source | SS df MS Number of obs = 18 {5}

--- + --- F( 1, 16) = 61.70 Model | 87664.6947 1 87664.6947 Prob > F = 0.0000 {7}

Residual | 22731.9877 16 1420.74923 R-squared = 0.7941 {8}

--- + --- Adj R-squared = 0.7812 Total | 110396.682 17 6493.9225 Root MSE = 37.693 {6}

--- glyco | Coef. Std. Err. t P>|t| [95% Conf. Interval]

---+--- ph | -414.9666 52.82744 -7.855 0.000 -526.9558 -302.9775 {9}

_cons | 3082.58 381.0188 8.090 0.000 2274.856 3890.304 {10}

---

. predict yhat, xb {11}

. graph glyco yhat ph, gap(4) xlabel(6.9,7.0 to 7.5) ylabel(0, 50 to 300) {12}

>connect(.l) symbol(Oi)

{Graph omitted. See Figure 2.4}

. predict std_p, stdp {13}

. display_N {14}

47 2.12. Simple linear regression with Stata

. display invttail(_N-2,0.025) {15}

2.1199053

. generate ci_u = yhat + invttail(_N-2,0.025)*std_p {16}

. generate ci_l = yhat - invttail(_N-2,0.025)*std_p {17}

. sort ph {18}

. graph glyco yhat ci_u ci_l ph, gap(4) xlabel(6.9,7.0 to 7.5) {19}

>ylabel(0, 50 to 300) connect(.lll) symbol(Oiii)

{Graph omitted. See Figure 2.7.}

. predict std_f, stdf {20}

. generate ci_uf = yhat + invttail(_N-2,0.025)*std_f {21}

. generate ci_lf = yhat - invttail(_N-2,0.025)*std_f

. graph glyco yhat ci_u ci_l ci_lf ci_uf ph, gap(4) xlabel(6.9,7.0 to 7.5) {22}

>ylabel(0, 50 to 300) connect(.lllll) symbol(Oiiiii)

{Graph omitted. See Figure 2.8}

Comments

1 The2.12.Poison.dtadata set contains the plasma glycolate and arterial pH levels of 18 patients admitted for ethylene glycol poisoning. These levels are stored in variables calledglycoandph, respectively. Theclearoption of theusecommand deletes any data that may have been in memory when this command was given.

2 Stata variables are associated with formats that control how they are dis- played in the Stata Editor and in graphs and data output. This command assignspha general numeric format with up to nine digits and one digit after the decimal point. The next command assignsglycoa similar format with no digits after the decimal point. These commands will affect the appearance of the axis labels in the subsequent graph commands. They do not affect the numeric values of these variables. In the2.12.Poison dataset both of these formats are set to %9.2g.

3 The commandgraph glyco ph draws a scatter plot ofglycobyph. The options on this command improve the visual appearance of the scatter plot;gap(4)places the title of the y-axis four spaces to the left of the y-axis. Thexlabeloption labels thex-axis from 6.8 to 7.5 in even increments 0.1 units apart. Theylabeloptions labels they-axis from 0 to 300 in increments of 50. Thexlineandylineoptions draw vertical and horizontal lines atx=7.21 andy=90.4 respectively. The default titles of thex- andy-axes are labels assigned to thephandglycovariables in the

48 2. Simple linear regression

2.12.Poison.dta data set. The resulting graph is similar to Figure 2.1.

(In this latter figure I used a graphics editor to annotate the mean glycolate and pHvalues and to indicate the residuals for three data points.) 4 This command performs a linear regression ofglycoagainst ph. That is, we fit the model E[glyco| ph]=α+β×ph(see equation 2.5). The most important output from this command has been highlighted and is defined below.

5 The number of patientsn=18.

6 The root MSE iss=37.693 (see equation 2.9). The total sum of squares is TSS=110 396.682.

7 The model sum of squares is MSS=87 664.6947.

8 R2=MSS/TSS=0.7941. Hence 79% of the variation in glycolate levels is explained by this linear regression.

9 The slope estimate of β for this linear regression is b = −414.9666 (see equation 2.6). The estimated standard error ofbis se[b]=52.827 44 (see equation 2.13). Thetstatistic to test the null hypothesis thatβ=0 ist = b/se[b]= –7.855 (see equation 2.14). The P value associated with this statistic is < 0.0005. The 95% conﬁdence interval for β is (–526.9558, –302.9775) (see equation 2.15).

10 The yintercept estimate ofαfor this linear regression isa =3082.58 (see equation 2.7).

11 Thepredictcommand can estimate a variety of statistics after a regression or other estimation command. (Stata refers to such commands as post estimation commands.) Thexboption causes a new variable (in this example yhat) to be set equal to each patient’s expected plasma glycolate level

y[x]=a+bx; in this equation,xis the patient’s arterial pHandaandb are the parameter estimates of the linear regression (see also equation 2.8).

12 This command graphsglycoandyhatagainstph. Theconnectandsymbol options specify how this is to be done. There must be one character be- tween the parentheses following theconnectandsymboloptions for each plotted variable. The first character affects the first variable (glyco), the second affects the second variable (yhat) et cetera;connect(.l) specifies that theglycovalues are not connected and theyhatvalues are connected by a straight line;symbol(Oi) specifies that each glyco value is indicated by a large circle but that no symbol is used foryhat. The net effect of this command is to produce a scatter plot of glycolate against pHtogether with a straight line indicating the expected glycolate levels as a function of pH. The resulting graph is similar to Figure 2.4.

Stata commands can often be too long to ﬁt on a single line of a log ﬁle. When this happens the command wraps onto the next line.

49 2.13. Lowess regression

A “>” symbol at the beginning of a line indicates the continuation of the preceding command rather than the start of a new one.

13 With the stdp option predict deﬁnes a new variable (in this example std_p) to be the standard error ofyhat. That is,std_p=

var[ ˆy[x]|x]

(see equation 2.17).

14 Thedisplaycommand calculates and displays a numeric expression or constant._N denotes the number of variables in the data set, which in this example is 18.

15 The Stata functioninvttail(n,1−α) calculates a critical value of sizeαfor atdistribution withndegrees of freedom. Thus,invttail(_N−2,0.025)= invttail(16,0.025)=t16,0.025=2.119 9053.

16 Thegeneratecommand deﬁnes a new variable in terms of old ones. Here ci_uis set equal to

y[x]+tn−2,0.025

var[ ˆy[x]|x],

which is the upper bound for the 95% conﬁdence interval for ˆy[x] (see equation 2.18).

17 Similarly

ci l = y[x]ˆ −tn−2,0.025

var[ ˆy[x]|x].

18 This command sorts the data byph. This is needed to ensure that the following graph command draws the conﬁdence bounds correctly. The data should be sorted by thex-axis variable whenever a non-linear curve is to be plotted.

19 We next add the graphs of the 95% conﬁdence intervals for ˆy[x] to the preceding graph, which generates Figure 2.7.

20 Thestdfoption of thepredictcommand deﬁnes a new variable (std_f) that equals the standard deviation of the response for a new patient. That is,

std_f =

var[ ˆy[x]|x]+s2.

21 The next two commands deﬁneci_uf and ci_lf to be the bounds of the 95% prediction intervals for new patients (see equation 2.19).

22 This ﬁnalgraphcommand adds the 95% prediction intervals to the preceding graph. It is similar to Figure 2.8.

The Stata Statistical Software Package

Transforming the x and y Variables