Output 23.3.1 Summary of the Slide MeasuresThe SIMILARITY Procedure Slide Measures Summary for Input=ELECTRIC and Target=MASONRY Slide Slide Target Input Slide Slide Slide Sequence Seque
Trang 11652 F Chapter 23: The SIMILARITY Procedure
Example 23.3: Sliding Similarity Analysis
This example illustrates how to use sliding similarity analysis to compare two time sequences The SASHELP.WORKERSdata set contains two similar time series variables (ELECTRICandMASONRY), which represent employment over time The following statements create an example data set that contains two time series of differing lengths, where the variable MASONRY has the first 12 and last
7 observations set to missing to simulate the lack of data associated with the target series:
data workers; set sashelp.workers;
if '01JAN1978'D <= date < '01JAN1982'D then masonry = masonry;
else masonry = ;
run;
The goal of sliding similarity measures analysis is find the slide index that corresponds to the most similar subsequence of the input series when compared to the target sequence The following statements perform sliding similarity analysis on the example data set:
proc similarity data=workers out=_NULL_ print=(slides summary);
id date interval=month;
input electric;
target masonry / slide=index measure=msqrdev
expand=(localabs=3 globalabs=3) compress=(localabs=3 globalabs=3);
run;
The DATA=WORKERS option specifies that the input data set WORK.WORKERS is to be used
in the analysis The OUT=_NULL_ option specifies that no output time series data set is to
be created The PRINT=(SLIDES SUMMARY) option specifies that the ODS tables related to the sliding similarity measures and their summary be produced The INPUT statement speci-fies that the input variable is ELECTRIC The TARGET statement specifies that the target vari-able is MASONRY and that the similarity measure be computed using mean squared deviation (MEASURE=MSQRDEV) The SLIDE=INDEX option specifies observation index sliding The COMPRESS=(LOCALABS=3 GLOBALABS=3) option limits local and global absolute compres-sion to 3 The EXPAND=(LOCALABS=3 GLOBALABS=3) option limits local and global absolute expansion to 3
Trang 2Output 23.3.1 Summary of the Slide Measures
The SIMILARITY Procedure
Slide Measures Summary for Input=ELECTRIC and Target=MASONRY
Slide Slide Target Input Slide Slide Slide Sequence Sequence Warping Minimum
Index DATE Length Length Amount Measure
Output 23.3.2 Minimum Measure
Minimum Measure Summary
Input Variable MASONRY
ELECTRIC 322.5460
This analysis results in 23 slides based on the observation index The minimum measure (322.5460) occurs at slide index 13 which corresponds to the time value FEB1978 Note that the original data setSASHELP.WORKERSwas modified beginning at the time value JAN1978 This similarity analysis justifies the belief theELECTRIClagsMASONRYby one month based on the time series cross-correlation analysis despite the lack of target data (MASONRY)
The goal of seasonal sliding similarity measures is to find the seasonal slide index that corresponds to the most similar seasonal subsequence of the input series when compared to the target sequence The
Trang 31654 F Chapter 23: The SIMILARITY Procedure
following statements repeat the preceding similarity analysis on the example data set with seasonal sliding:
proc similarity data=workers out=_NULL_ print=(slides summary);
id date interval=month;
input electric;
target masonry / slide=season measure=msqrdev;
run;
Output 23.3.3 Summary of the Seasonal Slide Measures
The SIMILARITY Procedure
Slide Measures Summary for Input=ELECTRIC and Target=MASONRY
Slide Slide Target Input Slide Slide Slide Sequence Sequence Warping Minimum Index DATE Length Length Amount Measure
Output 23.3.4 Seasonal Minimum Measure
Minimum Measure Summary
Input Variable MASONRY ELECTRIC 641.9273
The analysis differs from the previous analysis in that the slides are performed based on the seasonal index (SLIDE=SEASON) with no warping With a seasonality of 12, two seasonal slides are considered at slide indices 0 and 12 with the minimum measure (641.9273) occurring at slide index
12 which corresponds to the time value JAN1978 Note that the original data setSASHELP.WORKERS was modified beginning at the time value JAN1978 This similarity analysis justifies the belief that ELECTRICandMASONRYhave similar seasonal properties based on seasonal decomposition analysis despite the lack of target data (MASONRY)
Example 23.4: Searching for Historical Analogies
This example illustrates how to search for historical analogies by using seasonal sliding similarity analysis of transactional time-stamped data TheSASHELP.TIMEDATAdata set contains the variable (VOLUME), which represents activity over time The following statements create an example data
Trang 4set that contains two time series of differing lengths, where the variableHISTORYrepresents the historical activity andRECENTrepresents the more recent activity:
data timedata; set sashelp.timedata;
drop volume;
recent = ;
history = volume;
if datetime >= '20AUG2000:00:00:00'DT then do;
recent = volume;
history = ;
end;
run;
The goal of seasonal sliding similarity measures is to find the seasonal slide index that corresponds
to the most similar seasonal subsequence of the input series when compared to the target sequence The following statements perform similarity analysis on the example data set with seasonal sliding:
proc similarity data=timedata out=_NULL_ outsequence=sequences
outsum=summary;
id datetime interval=dtday accumulate=total
start='27JUL1997:00:00:00'dt end='21OCT2000:11:59:59'DT;
input history / normalize=absolute;
target recent / slide=season normalize=absolute measure=mabsdev;
run;
The DATA=TIMEDATA option specifies that the input data set WORK.TIMEDATAbe used in the analysis The OUT=_NULL_ option specifies that no output time series data set is to be created The OUTSEQUENCE=SEQUENCES and OUTSUM=SUMMARY options specify the output sequences and summary data sets, respectively The ID statement specifies that the time ID variable isDATETIME, which is to be accumulated on a daily basis (INTERVAL=DTDAY) by summing the transactions (ACCUMULATE=TOTAL) The ID statement also specifies that the data is accumulated on the weekly boundaries starting on the week of 27JUL1997 and ending on the week of 15OCT2000 (START=’27JUL1997:00:00:00’DT END=’21OCT2000:11:59:59’DT) The INPUT statement spec-ifies that the input variable isHISTORY, which is to be normalized using absolute normalization (NORMALIZE=ABSOLUTE) The TARGET statement specifies that the target variable isRECENT, which is to be normalized by using absolute normalization (NORMALIZE=ABSOLUTE) and that the similarity measure be computed by using mean absolute deviation (MEASURE=MABSDEV) The SLIDE=SEASON options specifies season index sliding
To illustrate the results of the similarity analysis, the output sequence data set must be subset by using the output summary data set
data _NULL_; set summary;
call symput('MEASURE', left(trim(putn(recent,'BEST20.'))));
run;
data result; set sequences;
by _SLIDE_;
retain flag 0;
if first._SLIDE_ then do;
if (&measure - 0.00001 < _SIM_ < &measure + 0.00001)
then flag = 1;
Trang 51656 F Chapter 23: The SIMILARITY Procedure
end;
if flag then output;
if last._SLIDE_ then flag = 0;
run;
The following statements generate a cross series plot of the results:
proc timeseries data=result out=_NULL_ crossplot=series;
id datetime interval=dtday;
var _TARSEQ_;
crossvar _INPSEQ_;
run;
The cross series plot illustrates that the historical time series analogy most similar to the most recent time series data that started on 20AUG2000 occurred on 02AUG1998
Output 23.4.1 Cross Series Plot of the Historical Time Series
Trang 6Barry, M J and Linoff, G S (1997), Data Mining Techniques: For Marketing, Sales, and Customer Support, New York: John Wiley & Sons
Han, J and Kamber, M (2001), Data Mining: Concepts and Techniques, San Francisco: Morgan Kaufmann Publishers
Leonard, M J and Wolfe, B L (2005), “Mining Transactional and Time Series Data,” Proceedings
of the Thirtieth Annual SAS Users Group International Conference, Cary, NC: SAS Institute Inc Leonard, M J., Elsheimer, D B., and Sloan, J (2008), “An Introduction to Similarity Analysis Using SAS,” Proceedings of the SAS Global Forum 2008 Conference, Cary, NC: SAS Institute Inc
Pyle, D (1999), Data Preparation for Data Mining, San Francisco: Morgan Kaufman Publishers, Inc
Sankoff, D and Kruskal, J B (2001), Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison, Stanford, CA: CSLI Publications
Trang 71658
Trang 8The SIMLIN Procedure
Contents
Overview: SIMLIN Procedure 1659
Getting Started: SIMLIN Procedure 1660
Prediction and Simulation 1661
Syntax: SIMLIN Procedure 1662
Functional Summary 1662
PROC SIMLIN Statement 1663
BY Statement 1664
ENDOGENOUS Statement 1665
EXOGENOUS Statement 1665
ID Statement 1665
LAGGED Statement 1665
OUTPUT Statement 1666
Details: SIMLIN Procedure 1666
Defining the Structural Form 1667
Computing the Reduced Form 1667
Dynamic Multipliers 1667
Multipliers for Higher Order Lags 1668
EST= Data Set 1669
DATA= Data Set 1670
OUTEST= Data Set 1670
OUT= Data Set 1671
Printed Output 1671
ODS Table Names 1673
Examples: SIMLIN Procedure 1673
Example 24.1: Simulating Klein’s Model I 1673
Example 24.2: Multipliers for a Third-Order System 1682
References 1687
Overview: SIMLIN Procedure
The SIMLIN procedure reads the coefficients for a set of linear structural equations, which are usually produced by the SYSLIN procedure PROC SIMLIN then computes the reduced form and, if
Trang 91660 F Chapter 24: The SIMLIN Procedure
input data are given, uses the reduced form equations to generate predicted values PROC SIMLIN is especially useful when dealing with sets of structural difference equations The SIMLIN procedure can perform simulation or forecasting of the endogenous variables
The SIMLIN procedure can be applied only to models that are:
linear with respect to the parameters
linear with respect to the variables
square (as many equations as endogenous variables)
nonsingular (the coefficients of the endogenous variables form an invertible matrix)
Getting Started: SIMLIN Procedure
The SIMLIN procedure processes the coefficients in a data set created by the SYSLIN procedure using the OUTEST= option or by another regression procedure such as PROC REG To use PROC SIMLIN you must first produce the coefficient data set and then specify this data set on the EST= option of the PROC SIMLIN statement You must also tell PROC SIMLIN which variables are endogenous and which variables are exogenous List the endogenous variables in an ENDOGENOUS statement, and list the exogenous variables in an EXOGENOUS statement
The following example illustrates the creation of an OUTEST= data set with PROC SYSLIN and the computation and printing of the reduced form coefficients for the model with PROC SIMLIN
proc syslin data=in outest=e;
model y1 = y2 x1;
model y2 = y1 x2;
run;
proc simlin est=e;
endogenous y1 y2;
exogenous x1 x2;
run;
If the model contains lagged endogenous variables you must also use a LAGGED statement to tell PROC SIMLIN which variables contain lagged values, which endogenous variables they are lags of, and the number of periods of lagging For dynamic models, the TOTAL and INTERIM= options can
be used on the PROC SIMLIN statement to compute and print total and impact multipliers (See
"Dynamic Multipliers" later in this section for an explanation of multipliers.)
In the following example the variables Y1LAG1, Y2LAG1, and Y2LAG2 contain lagged values
of the endogenous variables Y1 and Y2 Y1LAG1 and Y2LAG1 contain values of Y1 and Y2 for the previous observation, while Y2LAG2 contains 2 period lags of Y2 The LAGGED statement specifies the lagged relationships, and the TOTAL and INTERIM= options request multiplier analysis
Trang 10The INTERIM=2 option prints matrices showing the impact that changes to the exogenous variables have on the endogenous variables after 1 and 2 periods
data in; set in;
y1lag1 = lag(y1);
y2lag1 = lag(y2);
y2lag2 = lag2(y2);
run;
proc syslin data=in outest=e;
model y1 = y2 y1lag1 y2lag2 x1;
model y2 = y1 y2lag1 x2;
run;
proc simlin est=e total interim=2;
endogenous y1 y2;
exogenous x1 x2;
lagged y1lag1 y1 1 y2lag1 y2 1 y2lag2 y2 2;
run;
After the reduced form of the model is computed, the model can be simulated by specifying an input data set on the PROC SIMLIN statement and using an OUTPUT statement to write the simulation results to an output data set The following example modifies the PROC SIMLIN step from the preceding example to simulate the model and stores the results in an output data set
proc simlin est=e total interim=2 data=in;
endogenous y1 y2;
exogenous x1 x2;
lagged y1lag1 y1 1 y2lag1 y2 1 y2lag2 y2 2;
output out=sim predicted=y1hat y2hat
residual=y1resid y2resid;
run;
Prediction and Simulation
If an input data set is specified with the DATA= option in the PROC SIMLIN statement, the procedure reads the data and uses the reduced form equations to compute predicted and residual values for each
of the endogenous variables (If no data set is specified with the DATA= option, no simulation of the system is performed, and only the reduced form and multipliers are computed.)
The character of the prediction is based on the START= value Until PROC SIMLIN encounters the START= observation, actual endogenous values are found and fed into the lagged endogenous terms Once the START= observation is reached, dynamic simulation begins, where predicted values are fed into lagged endogenous terms until the end of the data set is reached
The predicted and residual values generated here are different from those produced by the SYSLIN procedure since PROC SYSLIN uses the structural form with actual endogenous values The