1. Trang chủ
  2. » Giáo Dục - Đào Tạo

Tài liệu Kinh tế ứng dụng_ Lecture 3: Outliers, Leverage and Influence docx

8 479 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Outliers, Leverage and Influence
Tác giả Nguyen Hoang Bao
Trường học Applied Econometrics
Thể loại Lecture
Năm xuất bản 2004
Định dạng
Số trang 8
Dung lượng 137,99 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

When a point is an outlier in bivariate analysis, it has a large residual i.e., Y value is far removed from its fitted value.. The points of high leverage can exert undue influence on th

Trang 1

Applied Econometrics

Lecture 3: Outliers, Leverage and Influence

‘Life is the art of drawing sufficient conclusions from insufficient premises’

SAMUEL BUTLER

1) Introduction

The estimates of the regression parameters are influenced by a few extreme observations The residual plot may let us pick out, which the individual data points are high or low We may use the residual plot to find the outlier, which are inadequately captured by the regression model itself

2) Identification of outliers

¾ The percentiles that cut the data up into four quarters have special names: The 25th percentiles and the 75th percentiles are called the lower and upper quartiles (QL and QU)

¾ The lower quartile will be the [integer((n+1)/2)+1]/2 value from the bottom of the ordered list the upper quartile is the [integer((n+1)/2)+1]/2 value from the top

¾ A data point Y0 is considered to be an outliers if

Y0 < QL – 1.5 IQR or Y0 > QU + 1.5 IQR where IQR is the inter – quartile range (IQR = QU – QL) (Source: Hoaglin, 1983)

3) Outliers

An outlier is a point, which is far removed from its fitted value (i.e., has large residual) Large in this context does not refer to the absolute size of a residual but to its size relative to most of the other residuals in the regression

When a point is an outlier in univariate analysis, it is defined with reference to its own mean When

a point is an outlier in bivariate analysis, it has a large residual (i.e., Y value is far removed from its fitted value)

Apart from the graphical methods, we can also rely on special statistics to detect outliers In order to compare the large residual to the other residual, we may calculate the standardized residual, which is simply the residual divided by the standard error of the estimate (ei/s) But an outlier in the data set will inflate the standard error of the regression Hence we use the studentized residual

Trang 2

h 1 s(i)

e t

i

i i

=

where

ei is the residual (ei = yi –yˆi)

s(i) is the standard error of the estimation having dropped the ith observation from the sample

hi is the hat statistic for the observation ith, which is defined as

− +

=

=

n 1 i

2

2 i

) X Xi (

) X Xi ( n

1 h

The additional term in the denominator, 1 −hi, is necessary since the variance of the residuals is assumed not to be constant With the adjustment factor, we get a t – statistic, which tests whether the ith residual is significant different from zero and, hence, signals an outliers, which does not really fit the overall pattern

Alternatively, the t – statistic of the coefficient of the dummy variable pick out a single observation from the sample

4) Leverage

A data point has a high leverage if it is far removed in the X – direction (i.e., it is a disproportionate distance away from the middle range of the X – direction) (Myers, 1990)

The points of high leverage can exert undue influence on the outcome of a least squares regression line That is, points with high leverage are capable of exerting a strong pull on the slope of the regression line

In univariate analysis, the definition of an outlier and a point of leverage are the same A point, which is an outlier, also has high leverage with respect to the mean In bivariate analysis, a point of high leverage (with respect to the slope coefficient) is one which is far removed in the X – direction (as opposed to an outliers which are far removed from Y – direction)

A test statistic for the leverage is the hat statistic:

− +

=

=

n 1 i

2

2 i

) X Xi (

) X Xi ( n 1 h

Trang 3

which serves as a measure of leverage of the ith data point It measures leverage because the numerator is the squared distance of the ith data point from its mean in the X – direction, while its denominator is a measure of overall variability of the data points along the X – axis Therefore, the higher value of hi the higher is the leverage of the ith data point, the greater the distance of Xi from its mean

hi can vary from 1/n (i.e., close to zero) for a point with no leverage and tend to one for very high leverage It is suggested that the following guidelines are based on the maximum observed hi = max(hi) (Huber, 1981):

0.2 < max(hi) < 0.5 risky

5) Influence

A data point is influential if removing it from the sample would markedly change the position of the least squares regression line (Moore and McCabe, 1989) Hence, influential data points pull the regression line in their regression

The influential data points do not necessarily produce large residuals That is, they are not always outliers as well, although they can be Conversely, an outlier is not necessarily an influential point, particularly when it is a point with little leverage

In univariate analysis, an outlier has high leverage and will be influential In bivariate analysis, high leverage is a necessarily condition for influence on the slope, but not a sufficient one Similarly, an outlier may not be influential if it has low leverage, nor a point of high leverage be an outlier if its leverage is strong enough

A test for influence is the DFBETA statistic, which is defined as1:

)(i) β SE(

(i) β β DFBETA

1

1 1 i

= where bracket (i) refers to the value of the statistic when observation ith is excluded from the regression The DFBETAs measure the sensitivity of the slope coefficient to the deletion of the ith data point

1 We suppose that the regression model can be specified as Y = β 0 + β 1 X

Trang 4

if DFBETA < 2/ n , the point has no influence

if DFBETA > 3/ n , the point is influential

if 2/ n< DFBETA < 3/ n , the point is inconclusive

The regression analysis should capture general pattern in the data: an influential point can prevent this from being so Hence, they are often best dropped from the regression

DFBETAs should always be used in conjunction with diagnostic regression graphics It is always possible that a cluster of points is exerting influence rather than a single data point

Table 5: Summary measures outliers, leverage, and influence

Studentized residual (t i )

h 1 s(i)

e t

i

i i

= Outliers Critical values available (higher than usual t–test), but

recommend use t i as an exploratory tool Hat statistic (h i )

− +

=

=

n 1 i

2

2 i

) X Xi (

) X Xi ( n

1

Bounded by 1/n (no leverage) and 1 (extremely leverage); values above 0.5 indicate excessive leverage and values over 0.2 indicate the observation may give problems

DFBETA

)(i) β SE(

(i) β β DFBETA

1

1 1 i

= Influence Under 2/ n , the point has no influence; over 3/ n ,

the point is influential and strongly so if DFBETA exceeds 2

Note: n is the sample size; k is the number of regressors; the subscript (i) (i.e., with parentheses) indicates an estimation from the sample omitting observation i In each case you should use the absolute value of the calculated statistic

Source: Mukherjee Chandan, Howard White and Marc Wuyts (1998), ‘Econometrics and Data Analysis for

Developing Countries’ published by Routledge, London, UK

Trang 5

References

Bao, Nguyen Hoang (1995), ‘Applied Econometrics’, Lecture notes and Readings,

Vietnam-Netherlands Project for MA Program in Economics of Development

Hoaglin, David C., Mosteller F., Tukey J (1983), Understanding Robust and Exploratory Data

Analysis, New York: John Wiley

Huber, Peter J (1981), Robust Statistics, New York: John Wiley

Maddala, G.S (1992), ‘Introduction to Econometrics’, Macmillan Publishing Company, New York

Moore, D.S and McCabe, G.P (1989), Introduction to the Practice of Statistics, New York:

Freeman

Mukherjee Chandan, Howard White and Marc Wuyts (1998), ‘Econometrics and Data Analysis for

Developing Countries’ published by Routledge, London, UK

Myers R H (1990), Classical and Modern Regression with Application, Second Edition, Boston,

M.A: PWS – Kent

Trang 6

Workshop 3: Outliers, Leverage and Influence

1) Look carefully at the four plots in the attached figure For each plot write down whether any of the points is: an outliers, a point of high leverage, an influential points or some combination of these Briefly comment on your findings

Hint:

Plot 1

Plot 2

Plot 3

Plot 4

2) An examination of residuals provides a diagnostic check on the model When the regression model is inadequately specified, the residuals are not just pure noise Instead they contain a message that can help us to specify a better model

Consider the four different relations between Y and X plotted below (Anscombe, 1973) – a simplified version of some common phenomena

2.1) Calculate the regression line (Y against X), and graph it in panel 1, 2, 3 and 4 with the data points

2.2) State which graph above corresponds to this situation:

(i) The relation is really curved, rather than linear

(ii) The positive relation is entirely the result of just one data point

(iii) The residual variance is entirely the result of just one data point – which may very

well be recorded in error (iv) It makes good sense to use the regression line for prediction

2.3) Briefly, what lesson does this show?

Trang 7

Regression 1 Regression 2 Regression 3 Regression 4

8.04

6.95

7.58

8.81

8.33

9.96

7.24

4.26

10.84

4.82

5.68

10

8

13

9

11

14

6

4

12

7 5

9.14 8.14 8.74 8.77 9.26 8.1 6.13 3.1 9.13 7.26 4.74

10

8

13

9

11

14

6

4

12

7 5

7.46 6.77 12.74 7.11 7.81 8.84 6.08 5.39 8.15 6.42 5.73

10

8

13

9

11

14

6

4

12

7 5

6.58 5.76 7.71

8 8.47 7.04 5.25 12.5 5.56 7.91 6.89

8

8

8

8

8

8

8

19

8

8 8

3) The identification of outliers in univariate analysis

Using the data file LEACCESS.WK1, identify if there is any outliers in each of the following data sets:

4) The identification of outliers in bivariate analysis

4.1) Using the data file AIDSAV, test whether observation 26 (Lesotho) is:

b) an point of high leverage

c) an influential point

4.2) Draw the scatter plot of S/Y against A/Y showing the regression line with and without point 26 in the same graph

4.3) What happen to the R2 when observation 26 is dropped from the data set? Explain

4.4) Are there any other problematic points in the sample?

Trang 8

4.5) Show algebraically that a point with no leverage cannot have any influence on the slope coefficient

5) Outliers in bivariate analysis

5.1) Using the data file HOLMQ, which contains the data for EDUEXP and EAID, examine the figure and test whether any possible points is:

b) a point of high leverage

c) an influential point

5.2) Draw the scatter plot, showing the fitted line Briefly comment on your findings

Ngày đăng: 27/01/2014, 11:20

TỪ KHÓA LIÊN QUAN

🧩 Sản phẩm bạn có thể quan tâm

w