Collect the Data. Inspect and Clean the Data

Một phần của tài liệu Using econometrics a practical guide (7th edition) (Trang 89 - 92)

Obtaining an original data set and properly preparing it for regression is a surprisingly difficult task. This step entails more than a mechanical recording of data, because the type and size of the sample also must be chosen.

A general rule regarding sample size is “the more observations the better,”

as long as the observations are from the same general population. Ordinarily, researchers take all the roughly comparable observations that are readily available. In regression analysis, all the variables must have the same number of observations. They also should have the same frequency (monthly, quar- terly, annual, etc.) and time period. Often, the frequency selected is deter- mined by the availability of data.

The reason there should be as many observations as possible concerns the statistical concept of degrees of freedom first mentioned in Section 2.4.

Consider fitting a straight line to two points on an X, Y coordinate system as in Figure 3.1. Such an exercise can be done mathematically without error.

Both points lie on the line, so there is no estimation of the coefficients involved. The two points determine the two parameters, the intercept and the slope, precisely. Estimation takes place only when a straight line is fitted to

2. Note that while we hypothesize signs for the slope coefficients, we don’t hypothesize an expected sign for the intercept. We’ll explain why in Section 7.1.

M03_STUD2742_07_SE_C03.indd 69 1/4/16 6:10 PM

70 ChAPtER 3 Learning to Use regression anaLysis

three or more points that were generated by some process that is not exact.

The excess of the number of observations (three) over the number of coeffi- cients to be estimated (in this case two, the intercept and slope) is the degrees of freedom.3 All that is necessary for estimation is a single degree of freedom, as in Figure 3.2, but the more degrees of freedom there are, the better. This is because when the number of degrees of freedom is large, every positive error is likely to be balanced by a negative error. When degrees of freedom are low, the random element is likely to fail to provide such offsetting observations.

For example, the more a coin is flipped, the more likely it is that the observed proportion of heads will reflect the true probability of 0.5.

Another area of concern has to do with the units of measurement of the variables. Does it matter if a variable is measured in dollars or thousands of dollars? Does it matter if the measured variable differs consistently from the true variable by 10 units? Interestingly, such changes don’t matter in terms of regression analysis except in interpreting the scale of the coefficients. All conclusions about signs, significance, and economic theory are independent of units of measurement. For example, it makes little difference whether an independent variable is measured in dollars or thousands of dollars. The

3. Throughout the text, we will calculate the number of degrees of freedom (d.f.) in a regres- sion equation as d.f.= (N-K-1), where K is the number of independent variables in the equation. Equivalently, some authors will set K′ = K+1 and define d.f. = (N-K′). Since K′ equals the number of independent variables plus 1 (for the constant), it equals the number of coefficients to be estimated in the regression.

Y

0 X

Figure 3.1 Mathematical Fit of a Line to two points

If there are only two points in a data set, as in Figure 3.1, a straight line can be fitted to those points mathematically without error, because two points completely determine a straight line.

71 steps in appLied regression anaLysis

constant term and measures of overall fit remain unchanged. Such a multipli- cative factor does change the slope coefficient, but only by the exact amount necessary to compensate for the change in the units of measurement of the independent variable. Similarly, a constant factor added to a variable alters only the intercept term without changing the slope coefficient itself.

The final step before estimating your equation is to inspect and clean the data. You should make it a point always to look over your data set to see if you can find any errors. The reason is obvious: why bother using sophisti- cated regression analysis if your data are incorrect?

To inspect the data, obtain a plot (graph) of the data and look for outliers.

An outlier is an observation that lies outside the range of the rest of the observa- tions, and looking for outliers is an easy way to find data entry errors. In addi- tion, it’s a good habit to look at the mean, maximum, and minimum of each variable and then think about possible inconsistencies in the data. Are any observations impossible or unrealistic? Did GDP double in one year? Does a student have a 7.0 GPA on a 4.0 scale? Is consumption negative?

Typically, the data can be cleaned of these errors by replacing an incorrect number with the correct one. In extremely rare circumstances, an observation can be dropped from the sample, but only if the correct number can’t be found or if that particular observation clearly isn’t from the same population as the rest of the sample. Be careful! The mere existence of an outlier is not a justification for dropping that observation from the sample. A regression needs to be able to explain all the observations in a sample, not just the well-behaved ones. For more on the details of data collection, see Sections 11.2 and 11.3. For more on generating your own data through an economic experiment, see Section 16.1.

Y

0 X

Figure 3.2 statistical Fit of a Line to three points

If there are three (or more) points in a data set, as in Figure 3.2, then the line must almost always be fitted to the points statistically, using the estimation procedures of Section 2.1.

M03_STUD2742_07_SE_C03.indd 71 1/4/16 6:10 PM

72 ChAPtER 3 Learning to Use regression anaLysis

Một phần của tài liệu Using econometrics a practical guide (7th edition) (Trang 89 - 92)

Tải bản đầy đủ (PDF)

(578 trang)