Hướng dẫn phân tích time series trong Stata

Nếu tất cả các TS đều là white noise thì có thể dùng các phương pháp thông kêtruyền thống để xử lý chúng vì các phương pháp thống kê truyền thống có giả địnhrằng giữa các giá trị của biế

Trang 1

LÝ THUYẾT TIME SERIES

TS được xem như là một chuỗi các biến liên tục x1; x2; x3; trong đó x1 là giá trịcủa chuỗi tại thời điểm đầu tiên, x2 là tại thời điểm thứ hai và v.v

TS được coi là một stochastic process (quá trình ngẫu nhiên) nghĩa là các giá trị

trong chuỗi xuất hiện ngẫu nhiên

White noise là gì? White noise là một TS được dùng để mô hình hóa tiếng ồn

trong kỹ thuật White noise bao gồm tập hợp các giá trị tiếng ồn tại nhiều thời điểm

và các tiếng ồn này không tương quan gì với nhau White noise còn gọi là white independent noise hoặc iid (identically distributed) noise hay Gaussian white noise

Nếu tất cả các TS đều là white noise thì có thể dùng các phương pháp thông kêtruyền thống để xử lý chúng vì các phương pháp thống kê truyền thống có giả địnhrằng giữa các giá trị của biến số không tương quan với nhau Tuy nhiên tất cả các

TS có các giá trị đều tương quan với nhau

Mục đích của time series analysis (TSA) chính là tìm ra dạng của dữ liệu từ đó tiênđoán dữ liệu trong tương lai

TSA giả định dữ liệu chuỗi thời gian (TS) có dạng hệ thống nhất định (time series pattern) (gồm 3 thành phần) cộng với một phần random noise hoặc white noise hoặc error (sai lệch) làm cho khó có thể xác định được dạng của dữ liệu Do đó

TS được cho là theo mô hình cộng thêm (additive model) Hầu hết các kỹ thuật

TSA đều cố gắng khử random noise để dạng dữ liệu được xác định dễ hơn

Dạng hệ thống của TS gồm có 3 thành phần, mỗi thành phần đều là một hàm số

 Thành phần xu hướng (trend) Trend có dạng hàm số với các hệ số của số hạng được ước lượng từ phương pháp tổng bình phương tối thiểu (OLS-least square estimates) Các dạng hàm số phổ biến của trend bao gồm: hàm logistic (logistic function), hàm Mitscherlich, đường cong The Gompertz Curve, hàm

Trang 2

Allometric Để kiểm tra độ tương thích của mô hình cho trend chúng ta có thể

sử dụng giá trị R2 gọi là squared multiple correlation coefficient (SMCC).

 Thành phần theo mùa (seasonality)

 Thành phần chu kỳ (cycle)

TÍNH CHẤT TIME SERIES

Để mô tả một TS người ta dùng hàm phân bố kết hợp (joint distribution function).Hàm này có các tham số cần quan tâm sau đây để mô tả đúng cho một time series

 Mean function: trung bình hàm

 Autocovariance function: tự đồng phương sai

Để có thể phân tích TS một cách chính xác thì TS phải stationary Stationary là gì?

Stationary là tính ổn định của một dạng TS

Một TS có tính ổn định khi:

Một strictly stationary TS là một TS cực ổn định mà xác suất một giá trị tại một

thời điểm bất kỳ đều giống với xác suất một giá trị tại thời điểm bất kỳ khác Ví dụ,giá trị x tại thời điểm 1 là 10 thì giá trị x tại thời điểm 100 cũng là 10

Trong thực tế hầu hết các TS không ổn định do ảnh hưởng bởi trend và seasonality

 TS có trend không gọi là TS ổn định xu hướng (trend stationary TS)

 TS có trend và seasonality gọi là TS theo mùa bội (multiplicative seasonality).

Hầu hết các TS đều có dạng này

LỌC TIME SERIES

Mục đích của lọc TS (filtering TS) chính là làm mềm phần white noise để từ đó

phát hiện thành phần trend hoặc season của TS

Có nhiều filter khác nhau, bao gồm:

 Linear filter hay còn gọi là moving average (MA): một filter đặc biệt trong nhóm này là simple moving averages bậc 2s + 1 and 2s Sau khi lọc với MA,

TS sẽ bộc lộ dưới dạng Y = T + S + white noise Nếu dùng thêm filter diffrencethì TS sẽ chỉ còn dạng Y = S + white noise

Trang 3

 The Census X–11 Program: giúp lọc TS chỉ còn mùa Cách làm như sau: 1)

tính MA để lọc còn TS có trend, 2) tính hiệu số để còn TS có mùa và whitenoise, 3) tính MA 5 cho từng tháng để lọc còn TS có mùa mà thôi Mục đíchcủa chương trình này là để lọc TS có tính theo mùa Sau khi chạy chương trình

TS sẽ chỉ bộc lộ dưới dạng Y = S mà thôi

 Best Local Polynomial Fit : dùng cho TS ban đầu có hình vặn xoắn khônggiống đường thẳng Khi đó phải dùng filter này để đưa TS về dạng thẳng và từ

đó dùng MA mới được

 Diffrence filter: Sau khi lọc bằng diffrence thì TS sẽ bộc lộ tính mùa Y = S.

Nếu diffrence hai lần thì sẽ làm TS bộc lộ ra phần T đa thức (không theo đườngthẳng)

 Exponential Smoother : chủ yếu được dùng để tiên đoán.

Trang 4

HỒI QUY CHO TIME SERIES

Trong mô hình hồi quy tuyến tính truyền thống thì noise (error) là white nghĩa làcác error không tương quan Trong mô hình hồi quy cho TS thì noise lại có tươngquan

DISTRIBUTED-LAG MODEL

This methodology allows the effect of a single exposureevent to be distributed over a specific period of time, using several parameters toexplain the contributions at different lags, thus providing a comprehensive picture

of the time-course of the exposure-response relationship

Conventional DLMs rely on the assumption of a linear effect between the exposureand the outcome

More recently, a general approach has been proposed to further relax the linearityassumption, and flexibly describe simultaneously non-linear and delayed effects.This step has lead to the generation of the new modeling framework of distributedlag non-linear models (DLNMs)

Phải xác định được mối quan hệ giữa effect of x lên y như thế nào Có 4 dạng

 Permanent effect of x tạo permanent effect of y

 Permanent effect of x tạo temporary effect of y

 Temporary effect (short term effect) of x tạo temporary effect of y

 Temporary effect of x tạo permanent effect of y: mối lien quan này không giảiquyết được bằng distributed-lag model; do đó phải chuyển đổi temporary effect

of x thành permanent effect of x hoặc chuyển permanent effect of y thànhtemporary effect of y



the finite distributed lag model is most suitable to estimating dynamic relationshipswhen lag weights decline to zero relatively quickly, when the regressor is not

Trang 5

highly autocorrelated, and when the sample is long relative to the length of the lagdistribution.

The simplest model is the Koyck lag, which has

one lag of y on the right-hand side with only the current value of x

Given the autoregressive lag relationship in equation (3.13), a logical extension is

to allow lags of x on the right-hand side The general autoregressive distributed

is written

In all of the models we have studied, we must specify the length

of the lag prior to estimation

An obvious way to choose the length of a lag is to start with a

significance of the coefficient at the longest lag—the “trailing

one period it if we cannot reject the null hypothesis that the effect

We continue shortening the lag until the trailing lag coefficient isstatistically significant

Information criteria are designed to measure the amount of information about thedependent variable contained in a set of regressors They are goodness-of-fit

same type as R2 or R2 , but without the convenient interpretation as share of

Trang 6

variance explained that we give to R2 in an OLS regression with an intercept term.

The two most commonly used criteria are the Akaike information criterion (AIC) and the Schwartz/Bayesian information criterion (SBIC)

As discussed above, adding lags of x and/or y to the right-hand side of a

When using residual autocorrelation to determine lag length, one adds lags untilthe residuals appear to be white noise After running the distributed-lag regression,

the residuals and uses a Breusch-Godfrey LM test or a Box-Ljung Q test to test the

null hypothesis that the residuals are white noise Rejecting the white-noise null

that more lags should be added to the regression according to this criterion

Mô hình ARDL

In order to run ARDL some preconditions needed to be checked

First of all we require the ARDL module for STATA, for this write followingcommand“findit ARDL” in STATA command window it will show the link for theARDL module, click it and install in your STATA

Trang 7

Following is the command “ardl depvar indepvar1 indepvar2 … , aic” here aic isused to automatic lag selection using Akike Information Criterion Method.Following are the results

have matched the results with the ARDL of eviews, they are about 90% similar theslight difference is because the fact that both software packages use a differentmethod to calculate standard errors Following is the command “ardl, noctablebtest” this will show the ARDL bound test and critical values As expected thecritical values are same as what is shown in the eviews but the bound test isslightly larger in eviews it is 5.43 here it is 5.62 hence we can say that there aremore chances that you will find cointegration in STATA

Now you need the long run and short run coefficients it can be estimated through

“ardl

depvar indepvar1 indepvar2 … , aic ec regstore(ecreg)”

Trang 8

Here ec will be used to generate the error correction version of the model with aic

as the criterion for the lag order The important thing is the use of restore(name)command, it will be explained later

Here you can see the LR is the long run estimates, SR is the short run estimatesand ADJ is the adjustment coefficient or the error correction coefficients Now forthe case to generate the post estimation diagnostics you need to convert the ardlestimated results to the reg format so that we can apply post estimations

For this write the command “estimates restore ecreg” it will bring the result of theardl ecm model into the memory of the computer And when you write the

“regress” command it will show the ecm results under regress command like below

Trang 9

Here you can use following commands

“estat archlm” for the ARCH LM test for higher order autocorrelation

“estat bgodfrey” for the Breusch Godfrey LM test for higher order autocorrelation

“estat hettest” for Breusch Pagan Heteroscedasticity test

“estat ovtest” for Ramsey RESET test

“estat vif” for VIF test of Multicollinearity

For all these tests the decision criterion is available in the form of null oralternative hypothesis Up-til now I am looking how to check the stability of thecoefficient (CUSUM) test in STATA Any one who knows how to do it pleaseshare Hope this helps

Trang 10

MÔ HÌNH NON-LINEAR AUTOREGRESSIVE DISTRIBUTED LAG MODEL (NARDL)

This blog is illustrating the Non-linear ARDL cointegrating bounds which is also

Greenwood-Nimmo, 2014) The idea behind this model is questioning the standard assumption

of symmetric estimates, by which the effect of increasing of a variable is equal andopposite to the decreasing of the same variable There are few cases mentioned inthe above study like creation and destruction of jobs in boom and recession

industrial production index (independent variable) You can import this data into

Once imported, you have to indicate Stata that data is time series for this followingcommand is used

tsset time

This way all the time series command will become functional In order to estimate

Stata/ado/base/n folder where ever it is installed, it will then work in Stata.Following is the command

In the command below p() and q() are the number of lags of dependent andindependent variable used You can identify optimal lag by using ‘varsoc’

nardl un ip, p(2) q(2)

Trang 11

Above table is standard one step ECM, the first coefficient is the convergencecoefficient and x1 is the first independent variable where x1p is the increasingportion of x1 and x1n is the decreasing portion of x1.

Below is the F bounds test, here it is 2.22, its critical values are same as the simple

smaller than critical values

Below table shows the long run increasing and decreasing effect of independentvariable on the dependent variable When the independent variable increases itdecreases unemployment by 14.71% but when independent variable decreases, itincreases unemployment by 48.69%

Trang 12

The long run asymmetry and short run asymmetry is tested using F test Since onlylong run F test is significant so there is only long run asymmetry.

After estimating the model, there are four types of diagnostics reported, since all ofthem are insignificant, so there is no autocorrelation, heteroscedasticity,

We can also generate the graph by adding the ‘plot’ option in command and furtherconfidence interval by using bootstrap and level option The horizon option willidentify how many years the graph will be constructed

nardl un ip, p(2) q(4) plot horizon(40) bootstrap(100) level(95)

Trang 13

in the above figure, we can see that decrease in IP(industrial production) has apositive effect on UN(unemployment) shown by red line While increasing IP has atemporary negative effect on UN shown by the green line And the blue lineshowing the increasing trend of asymmetry with time.

MÔ HÌNH DLNM

Chạy bằng R

Ví dụ: tìm mô hình cho biến tử vong do tim mạch và pm10 và nhiệt độ

 tạo cross basis cho pm10 và nhiệt độ

Các giá trị trong câu lệnh

Argvar: tạo basis cho var

Arglag: tạo basis cho lag

Fun: function là hàm cho var hoặc lag Có các loại fun sau:

Trang 14

o lin: linear là hàm tuyến tính

o poly: polynominal là hàm đa thức

o strata: stratafied là hàm phân nhóm

o ns: là hàm natural cubic spline là hàm bật 3 với df=5

internal knot: là các khoảng biểu thị trên biểu đồ, nếu không liệt kê nghĩa là mặcđịnh

boundary knot: là các nút biên cũng chính là khoảng nhiệt độ

lag = 15: là có 15 lag cho biến pm10

lag=3: là có 3 lag cho biến nhiệt độ

lag strata: 0 và 1-3, break = 1 nghĩa là biên dưới của strata 1-3

Xây dựng mô hình tử vong = cross basis pm10 + cross basis nhiệt độ + hàmsmooth bậc 3 cho thời gian + ngày trong tuần (dow)

Ước lượng mối tương quan với một mức pm nhất định lên tử vong (hay nói cáchkhác là ước lượng tham số của biến pm hay RR-nguy cơ tương đối)

At = 0: 20: tính tham số tiên đoán cho từng giá trị của pm10 từ 0-20 µgr

Bylag=0.2: giá trị tiên đoán sẽ được tính trong không gian lag với gia số là 0.2cumul (default to FALSE) indicates that also incremental cumulative associationsalong lags must be included

The function includes the pred1.pm object with the stored results, and the argument "slices" defines

that we want to graph relationship corresponding to specific values of predictor and lag in

Trang 15

the related dimensions With var=10 I display the lag-response relationship for a specific value of PM 10 , i.e 10 µgr/m3 This association is defined using the reference value of 0 µgr/m3 , thus

Trang 17

TẠO BIẾN NGÀY THÁNG DÙNG CHO TIME SERIES

gen date=tm(2000m1)+_n-1

format %tq date

Tạo biến date là biến tháng bắt đầu từ tháng

1 năm 2000 Có thể dùng tq (quarterly),tm(monthly), tw(weeks), ty(year)

Đổi từ interger sang dạng chuỗi (tg: năm, tm: tháng-năm, tw: tuần-năm)

quý- gen date = mdy(month, day,year) Tạo biến date từ biến month, day, year có

sẵn gen day = day(date)

gen week = week (date)

gen month = month (date)

gen quarter = quarter (date)

gen half = halfyear(date)

gen year = year(date)

Tạo biến ngày, tuần, tháng, quý, nửa năm vànăm từ biến date đầy đủ ngày/tháng/năm

Tạo biến ngày trong tuần (dow) và ngàytrong năm (doy) từ biến dat đầy đủngày/tháng/năm

THIẾT LẬP BIẾN THỜI GIAN DÙNG CHO PHÂN TÍCH TIME SERIES

Thiết lập biến date làbiến ngày tháng vớikhoảng cách delta là 1quý Biến date không cầnoption quarterly vì biếndate đã được format là tqtrước đó

Thiết lập biến date làbiến ngày tháng với

Định dạng
Số trang	26
Dung lượng	651,79 KB
File đính kèm	118. TIME SERIES.rar (607 KB)