ABSTRACT This thesis presents the methodology of relational data mining for stock market prediction by making clear each problem related to the keywords: methodology, relational, data mi
Trang 1VIETNAM NATIONAL UNIVERSITY, HANOI
COLLEGE OF TECHNOLOGY
* * *
CHU THAI HOA
METHODOLOGY OF RELATIONAL DATAMINING FOR
STOCK MARKET PREDICTION
Major: Information Technology
Code: 1.01.10
MASTER'S THESIS
Instructor: Prof Dr HO TU BAG
DAI HOC QUOC GiA HA NOl TRUNG FAM IHONG TIN THIJ VIEN
000 ^J 000095^ Hanoi, June 2007
Trang 2ABSTRACT
This thesis presents the methodology of relational data mining for stock market prediction by making clear each problem related to the keywords: methodology, relational, data mining, stock market, and prediction, then coming to the methodology of relational data mining with the emphasis on Machine Methods for Discovering Regularities (MMDR) for stock market prediction
Stock market prediction has been widely studied in terms of time-series prediction problem Deriving relationships that allow one to predict future values of time series is challenging One approach to prediction is to spot pattems in the past, when we already know what followed them, and to test on more recent data If a pattem is followed by the same outcome frequently enough, we can gain confidence that it is a genuine relationship
The purpose of relational data mining (RDM) is to overcome the limitations of attributed-based learning methods (commonly used in finance) in representing background knowledge and complex relations RDM approaches look for pattems that involve multiple tables (relations) from a relational database This approach will play a key role in future advances in data mining methodology and practice
MMDR method is one of the few Hybrid Probabilistic Relational Data Mining methods developed and applied to stock market data The method has an advantage
in handling numerical data It expresses pattems in First-order Logic (FOL) and assigns probabilities to rules generated by composing pattems This will be made clear through an application of MMDR with computational experiment on price index data of Standard and Poor's 500
The thesis consists of 3 chapters concentrating on relational data mining methodology for stock market prediction
Trang 3ACKNOWLEDGEMENTS
This thesis would not have been completed if there was no help and support of many people I would like to take this opportunity to express my gratitude to the many people who helped me during the time of development leading to the thesis
In particular, I would like to thank my instructor Prof Dr HO Tu Bao, for his courage of accepting me as a Master's student, for his enthusiasm, his knowledge and his encouragement in the work throughout I would never been able to finish this Thesis without his encouragement as well as his strict requirement for quality of the research
I also enjoyed and appreciated the fruitful exchange of ideas with Dr NGUYEN Trong Dung, to whom I am also grateful for comments on the thesis In the early days of my research Dr HA Quang Thuy, Dr PHAM Tran Nhu and Dr DO Van Thanh stimulated my interest in data mining in financial forecast I am thankful for that and for the many discussions I had with them
I am indebted to CFO LE The Anh, CFO NGUYEN Minh Quang for their patience with my questions on financial and stock market forecast I am also grateful
to Dr PHAM Ngoc Khoi, Dr NGUYEN Phu Chien, MSc DAO Van Thanh, Mrs
LE Thi Hoang My for words of encouragement during months of the thesis efforts and for their style-improving suggestions My thanks also go to everyone who has provided support or advice to me on data mining, stock market, forecast and so on in one way or another
My family has been creating good conditions for me to complete the thesis I dedicate the thesis to my father, my mother and my young brother whose love and support are always for me
Hanoi, June 2007,
CHU Thai Hoa
Trang 4TABLE OF CONTENTS
ABSTRACT i ACKNOWLEDGEMENTS ii
TABLE OF CONTENTS iii
LIST OF TABLES AND FIGURES v
LIST OF ABBREVIATIONS vi
INTRODUCTION 1
Problem definition 1
Motivations of the Thesis 2
Objectives of the Thesis 4
Method of the Thesis study 4
Stmcture of the Thesis 5
CHAPTER I: OVERVIEW OF STOCK MARKET PREDICTION IN DM 6
LI Introduction to stock market prediction 6
1.1.1 Basic concepts of forecast 6
1.1.2 Prediction tasks in stock market 7
1.1.3 Stock market time series properties 8
1.1.4 Stock market prediction with the efficient market theory 9
1.1.5 Questions in stock market prediction 10
1.1.6 Challenges and Possibilifies on Developing a Stock Market
Prediction System 11
1.2 Data mining methodology for stock market prediction 13
1.2.1 Prediction in data mining 13
1.2.2 Parameters 14
1.2.3 Approaches to stock market prediction 15
1.2.4 Data mining methods in stock market 17
CHAPTER II: RELATIONAL DATA MINING FOR STOCK MARKET
PREDICTION "22 ILL Introduction 22
II.2 Basic problems 22
11.2.1 First-order logic and rules 22
11.2.2 Representative measurement theory 25
11.2.3 Breadth-first search 29
11.2.4 Occam's razor principle 30
IL3 Theory of RDM 31
11.3.1 Data types in RDM 31
11.3.2 Relational representation of examples 33
11.3.3 Background knowledge and problems of search for regularities 34
IL4 An algorithm for RDM: MMDR 39
II.4.1 Motivations of choice for MMDR 39
Trang 511.4.2 Some concepts 40
11.4.3 Algorithm MMDR L'"!" ".^.".^43
CHAPTER III: AN APPLICATION OF MMDR TO STOCK PRICE
PREDICTION 47 IILL MMDR model for prediction 47
III.2 Experiment preparation 48
111.2.1 Data description and representation 48
111.2.2 Demo program 50
IIL3 Application of MMDR model 52
111.3.1 Step 1: Generating logical rules 52
111.3.2 Step 2: Learning logical rules 54
IIL3.3 Step 3: Creating intervals 56
IIL4 Results and evaluations 58
111.4.1 Stability of discovered rules on test data 58
111.4.2 Evaluations of forecast performance 61
CONCLUSIONS 70
Contributions of the thesis 70
Limitations of the thesis 71
Future work 72 Summary 73 APPENDICIES vii
Source code vii
REFERENCES xii
In English xii
In Vietnamese xvii
Website xvii
Trang 6LIST OF TABLES AND FIGURES
Comparison of AVL-based methods and first-order logic methods 20
UpDown predicate 23 Predicates Up and Down 23
Examples of terms 24 Attribute-based data example 34
Partial background knowledge for stock market 37
Figure III.l Flow diagram for MMDR model: steps and techniques 48
Training set and Test set 49
Examples of rule consistent with hypotheses H1-H4 54
Table A.1: Stability checking table 59
Table A.2: Performance matrics for a set of 125 regularities 62
Figure A.l: Performance of 125 found regularities on test data 62
Table A.3: Performance matrics for a set of 292 regularities 63
Figure A.2: Performance of 125 found regularities on test data 63
Table A.5: Performance for regularity with conditional probability of 0.49 66
Figure A.3: Performance of an individual regualrity with conditional probability of
0.49 on test data 66 Table A.6: Performance for regularity with conditional probability of 0.84 67
Figure A.4: Performance of an individual regualrity with conditional probability of
0.84 on test data 67 Table A.7: Forecast result for the day December 1'^ 2006 (the regularity with
conditionalprobability of 0.84) 68
Table A.8: Forecast result for the day December 1^ 2006 (the set of 292
regularities with conditional probability not less than 0.65) 69
Trang 7LIST OF ABBREVIATIONS
AI : Artificial Intelligence
AVL(s) : Attribute-value language(s)
DM : Data mining
FOL : First-order Logic
ILP : Inductive Logic Programming
ML : Machine Leaming
MMDR : Machine Methods for Discovering Regularities
MRDM : Multi-Relational Data mining
RDM : Relational Data mining
RMT : Representative measurement theory
Trang 8INTRODUCTION
Problem definition
There are four major technological reasons stimulating data mining
development, applications and public interest: the emergence of very large databases;
advances in computer technology; fast access to vast amounts of data; and the ability
to apply computationally intensive statistical methodology to these data
Data mining is the process of discovering hidden patterns in data Due to the
large size of databases, importance of information stored, and valuable information
obtained, finding hidden pattems in data has become increasingly significant The
stock market provides an area in which large volumes of data are created and stored
on a daily basis
Financial forecasfing has been widely studied at a case of time-series prediction
problem Times series such as the stock market are often seen as non-stationary
which present challenges in predicting fiiture values The efficient market theory
states that it is pracfically impossible to predict financial markets long-term
However, there is good evidence that short-term trends do exist and programs can be
written to find them The data miners' challenge is to find the trends quickly while
they are valid, as well as to recognize the time when the trends are no longer
effective Data mining methods provides the fi-amework for stock market predictions
to discover hidden trends and pattems
Well-known and commonly used data mining methods in stock market are
attributed-based leaming methods but they have some serious drawbacks: limited
ability to represent background knowledge and lack of complex relations The
purpose of RDM is to overcome these limitations RDM is a learning method that is
better suited for stock market mining with a better ability to explain discovered rules
than other symbolic approaches
However, current relational methods are relatively inefficient and have rather
limited facilities for handling numerical data RDM as a hybrid leaming method
combines the strength of FOL and probabilistic inference to meet these challenges
One of the few Hybrid Probabilistic Relational Data Mining methods, MMDR that
handles numerical data efficiently, is developed and applied to stock market data
It is believed that now is the time for RDM methods, in particular, MMDR to
stock market prediction has advantages in discovering regularities in stock market
time series
Trang 9Motivations of the Thesis
In the past few years, Vietnam's stock market was still in early stage of development and thus did not catch attention from investors and researchers Especially, to interested learners, mastering professional methods of stock market analysis and forecast require to have fime and wide background knowledge to study all fields covered Moreover, according to the efficient market theory, it is practically impossible to infer a fixed long-term global forecasting model from historical stock market information Therefore, there have been few Vietnamese interested in and performing research on stock market prediction
Two recent years have witnessed the surprising development of the Vietnamese stock market with a host of notable events Especially, after Vietnam became a World Trade Organization (WTO) member, the Vietnamese economy has so many opportunities to develop, leading to the development of many companies and markets including the financial and stock markets It is said that Vietnam's stock market will grow rapidly in the next years, and it will ranlc second in the region, just after China, in terms of growth rate
Under the rapid development of Vietnam's financial market, professional activities such as analysis and prediction of financial market should be paid more attention In particular, these activities play a significant role in the task of macro economic forecast at the National Center for Socio-economic Information and Forecast (under the Ministry of Planning and Investment), which helps make sound policies related to socio-economic management and regulation at macro level Data mining provides some methods and techniques that are able to help approach stock market prediction quite effectively
In fact, there have been already some studies and successful applications of data mining techniques to stock market forecast However, the capture of loiowledge and application techniques of each approach is quite challenging and consumes time I read some papers and especially paid attention to a research on relational data mining in finance by two researchers, Prof Dr Boris Kovalerchuk and Dr Evgenii Vityaev They reported that, "Mining stock market data presents special challenges For one, the rewards for finding successftil pattems are potentially enormous, but so are the difficulties and sources of conftisions The efficient market theory states that
it is practically impossible to predict financial markets long-term However, there is good evidence that short-term trends do exist and programs can be written to find them The data miners' challenge is to find the trends quickly while they are valid, to
Trang 10deal effectively with time series and calendar effects, as well as to recognize the time when the trends are no longer effective"
The leaming method RDM is able to leam more expressive rules, make better use of underlying domain knowledge and explain discovered rules than other symbolic approaches It is thus better suited for stock market mining This approach will play a key role in fiiture advances in data mining methodology and practice The earlier algorithms for RDM suffer fi-om a relative computational inefficiency and have rather limited tools for processing numerical data This problem is especially necessary to be considered in stock market analysis where data commonly are numerical time series Therefore, RDM as a hybrid leaming method that combining the strength of FOL and probabilistic inference is developed to meet these challenges One of the few Hybrid Probabilisfic Relational Data Mining methods, MMDR, that handles numerical data efficiently, is developed and applied to stock market forecasting
The common question "Can stock market prediction be profitable?" is often made to any research on methods of stock market prediction In fact, there are few people doing research on RDM for stock market forecast, because it requires interested learners to have wide background knowledge to understand all fields covered Much less has been reported publicly on success of data mining in real trading by financial institutions If real success is reported then competitors can apply the same methods and the leverage will disappear, because in essence all ftindamental data mining methods are not proprietary I used to concentrate my study in attempt to end up with a Master's Degree and as a millionaire (kidding), but this is too high risk to take
Basing my intention on practical suggestions and requirements, as well as my personal interest, I came to a decision of doing research on stock market forecast Through some school lessons and extra self-learning efforts, I access some data mining techniques to seek a solution to the task Those above motivate the aim of the thesis - to carry out research and experiment on methodology of RDM for stock market prediction
Trang 11Objectives of the Thesis
- Systematical organization of RDM methodology for stock market
prediction
Most of the exisfing studies on RDM for stock market prediction are reported in
a short and overview way, which causes difficulties for many readers The thesis is
primarily based on the book "Data Mining in Finance: Advances in Relational and
Hybrid Methods" and some papers by the two researchers Dr Kovalerchuk & Dr
Vityaev However, after having a thorough grasp of the RDM methodology, I
systematically organize the methodology, especially the algorithm MMDR in my
view and supplement more extensions of knowledge in data mining and stock
market forecast to the thesis Hopefiilly, it plays an important role in helping new
comers move toward the problem more favorably
- Experiment performance of MMDR method to stock market prediction
Centre of the thesis is the issue of discovering regularities in stock price series
addressed and illustrated through the MMDR The thesis also carries out an RDM
application to stock market prediction through an experiment with a small
self-developed program in a set of Standard and Poor's data The experiment helps
understand and trust more the feasibility and efficiency of RDM methodology and
MMDR algorithm presented in the thesis
Method of the Thesis study
The study behind the thesis has been mostly goal driven As problems appeared
on the way to realizing stock market prediction, they were tackled by various means
as listed below:
• Investigation of some existing machine learning and data mining methods
through related documents such as Doctoral Theses, Master' theses, online
papers, books, etc
• Reading of financial and stock market literatures for properties, forecast
techniques and hints of regularities in stock market data able to be exploited
• Learning about some existing stock market prediction software for deeper
understanding of regularity discovered
• Some theoretical considerations on mechanisms behind the generation of
stock data, and on general predictability demands and limits
• Practical insights into the realm of trading in stock market
• Contacts with experts on data mining and data mining software development,
with stock market investors and chief financial officers
Trang 12Courses on economic forecast and stock market mostly organized by the National Center for Socio-economic Information and Forecast
Collection of related documents and systemization of Mater's thesis
Programming in PHP and carrying experiments to illustrate and to prove the main idea and algorithm presented in the Thesis
Structure of the Thesis
The thesis is stmctured in the following way The first part introduces the problem definifion, method of study, objectives and stmcture of the Thesis
Chapter 1 provides an overview of stock market prediction in data mining
through two following parts "Introduction to stock market prediction" includes basic concepts of stock market forecast, data mining with the Efficient Market Theory, stock market time series properties, and drawbacks and possibilities on developing a stock market prediction, etc The last part "Data mining methodology for stock market prediction" presents some major types of data mining prediction, approaches to stock market prediction and comparisons on representation languages and data mining methods used in stock market
Chapter 2 talks about some basic problems, theory of RDM and an algorithm
MMDR In comparison with other data mining methods, the RDM approach is considered fi-om the point of view of their Data Types, Representation Languages (to manipulate and interpret data) and Class of hypothesis (to be tested on data) One
of the few Hybrid Probabilistic Relational Data Mining methods, MMDR, which is equipped with probabilistic mechanism that is necessary for time series with high level of noise, is mainly introduced
In Chapter 5, an MMDR application to stock market price prediction is made
clear for the methodology through three steps: mle generating, rule learning and interval creating This chapter also brings out some statisfic results and evaluations for the experiment conducted to demonstrate the application
Finally, contributions, limitations and fiiture work of my research are given as conclusion part for the thesis At the appendix part, the thesis also provides some table stmctures and source code developed by myself that are used for experiment
Trang 13CHAPTER I: OVERVIEW OF STOCK MARKET
PREDICTION IN DATA MINING
1.1 Introduction to stock market prediction
1.1.1 Basic concepts of forecast
This section provides a brief basic concepts of forecast An introductory
discussion of the topic can be found in [46] - Michael Leonard, Large-Scale
Automatic Forecasting: Millions of Forecasts, International Symposium of
Forecasting, 2002
Forecasts are time series predictions made for future periods in time They are
random variables and therefore have an associated probability distribution The
mean or median of each forecast is called the prediction The variance of each
forecast is called the prediction error variance and the square root of the variance is
called the prediction standard error The variance is computed from the forecast
model parameter estimates and the model residual variance
The forecast for the next future period is called the one-step ahead forecast The
forecast for h periods in the future is called the h-step ahead forecast The forecast
horizon or forecast lead is the number of periods into the future for which
predictions are made (one-step, two-step, , h-step) The larger the forecast horizon,
the larger the prediction error variance at the end of the horizon
The confidence limits are based on the prediction standard errors and a chosen
confidence limit size A confidence limit size of 0.05 results in 95% confidence
limits The confidence limits are often computed assuming a normal distribution, but
others could be used As with the prediction standard errors, the width of the
confidence limits increases with the forecast horizon
The prediction error is the difference between the predicted value and the actual
value when the actual value is known For transformed models, it is important to
understand the difference between the model errors (or residuals) and the prediction
errors The residuals measure the departure from the model in the transformed
metric The prediction errors measure the departure from the original series
Taken together, the predictions, prediction standard errors, and confidence
limits at each period in the forecast horizon are the forecasts Although many people
use the word "forecast" to imply only prediction, a forecast is not one number for
each future time period
Using a transformed forecasting model requires the following steps:
Trang 14• The time series data are transformed
• The transformed time series data are fit using the forecasting model
• The forecasts are computed using the parameter estimates and the
transformed time series data
• The forecasts (predictions, prediction standard errors, and confidence limits)
are inverse transformed
The naive inverse transformation resuhs in median forecasts To obtain mean
forecasts requires that the prediction and the prediction error variance both are
adjusted based on the transformation Additionally, the model residuals will be
different from the prediction errors due to this inverse transformation If no
transformation is used, the model residual and the prediction error will be the same,
and likewise the mean and median forecast will be the same (assuming a symmetric
disturbance distribution)
The statistics of fit evaluate how well a forecasting model performs by
comparing the actual data to the predictions For a given forecast model that has
been fitted to the time series data, the model should be checked or evaluated to see
how well it fits or forecasts the data The statistics of fit can be computed from the
model residuals or the prediction errors
When a particular statistic of fit is used for forecast model selection, it is
referred to as the model selection criterion When using model selection criteria to
rank forecasting models, it is important to compare the errors on the same metric,
that is, you should not compare transformed model residuals with non-transformed
model residuals You should first inverse transform the forecasts from the
transformed model prior to compufing the prediction errors and then compute the
model selection criterion based on the prediction errors
1.1.2 Prediction tasks in stock market
Boris Kovalerchuk, Evgenii Vityaev, Data Mining For Financial Applications, In:
0 Maimon, L Rokach (Eds.): The Data Mining and Knowledge Discovery
• Straight prediction of the stock market numeric characteristic, e.g., stock
return or exchange rate
Trang 15• The prediction whether the stock market characteristic will increase or
decrease
Having in mind that in the first case, it is necessary to take into account the
trading cost and the significance of the trading return And in the second case, it is
necessary to forecast whether the stock market characteristic will increase or
decrease no less than some threshold Thus, the difference between data mining
methods for the first or second case can be less obvious, because the second case
may require some kind of numeric forecast
Financial institutions produce huge datasets that build a foundation for approaching
these enormously complex and dynamic problems with data mining tools Potential
significant benefits of solving these problems motivate extensive research for years
1.1.3 Stock market time series properties
One may wonder if there are universal characteristics of the many series coming
from markets different in size, location, sophistication, etc The surprising fact is that
there are Moreover, interacting systems in other fields, such as statistical mechanics,
suggest that the properties of stock market time series loosely depend on the market
microstructure and are common to a range of interacting systems Such observations
have stimulated new models of markets based on analogies with particle systems and
brought in new analysis techniques opening the era of econophysics A more
detailed discussion of stock market time series properties can be found in [66]
-Stefan Zemke, On Developing a Financial Prediction System: Pitfalls and
Possibilities, First International Workshop on Data Mining Lessons Learned at
ICML'02, 2002 This section introduces a brief on stock market time series
properties including:
- Distribution
Distribution of stock market series tends to be non-normal, sharp peaked and
heavy-tailed, these properties being more pronounced for intraday values Such
observations were pioneered interestingly around the time the EMH was formulated
Extreme values appear more frequently in a stock market series as compared to a
nomially-distributed series of the same variance This is important to the practitioner
since often the values cannot be disregarded as erroneous outliers but must be
actively anticipated, because of their magnitude which can influence trading
performance
- Scaling property
Trang 16Scaling property of a time series indicates that the series is self-similar at different time scales This is common in stock market time series, i.e given a plot of returns without the axis signed; it is next to impossible to say if it represents hourly, daily or monthly changes, since all the plots look similar, with differences appearing
at minute resolution Thus prediction methods developed for one resolution could, in principle, be applied to others
- Data frequency
Data frequency refers to how often series values are collected: hourly, daily, weekly etc Usually, if a stock market series provides values on daily, or longer, basis, it is low frequency data, otherwise - when many intraday quotes are included
- it is high frequency Tick-by-tick data includes all individual transactions, and as such, the event-driven fime between data points varies creating challenge even for such a simple calculation as correlation
1.1.4 Stock market prediction with the efficient market theory
The Efficient Market Theory/Hypothesis (EMH) inifially got wide acceptance
in the financial community It asserts, in weak form, that the current price of an asset already reflects all informafion obtainable fi*om past prices and assumes that news is promptly incorporated into prices Since news is assumed unpredictable, so are prices In other words, according to the EMH, the evolufion of the prices for each economic variable is a random walk The variations in prices are completely independent from one fime step to the next in the long run EMH states that it is practically impossible to infer a fixed long-term global forecasting model from historical stock market informafion This idea is based on the observation that if the market presents some kind of regularity then someone will take advantage of it and the regularity disappears
However, real markets do not obey all the consequences of the hypothesis, e.g., price random walk implies normal distribution, not the observed case; there is a delay while price stabilizes to a new level after news, which among other, lead to a more modem view: "Overall, the best evidence points to the following conclusion The market isn't efficient with respect to any of the so-called levels of efficiency The value investing phenomenon is inconsistent with semi-strong form efficiency, and the January effect is inconsistent even with weak form efficiency Overall, the evidence indicates that a great deal of information available at all levels is, at any given time, reflected in stock prices The market may not be easily beaten, but it appears to be beatable, at least if you are willing to work at it."
Trang 17The market efficiency theory does not exclude that hidden short-term local
conditional regularities may exist These regularities can not work "forever," they
should be corrected frequently It has been shown that the stock market data are not
random and that the efficient market hypothesis is merely a subset of a larger
chaotic market hypothesis This hypothesis does not exclude successful short term
forecasfing models forpredicfion of chaofic time series
Data mining does not try to accept or reject the efficient market theory Data
mining creates tools which can be useful for discovering subtle short-term
conditional pattems and trends in wide range of stock market data This means that
retraining should be a permanent part of data mining in stock market and any claim
that a silver bullet trading has been found should be treated similarly to claims that a
perpetual mobile has been discovered
1.1.5 Questions in stock market prediction
Following are some questions of scienfific and pracfical interest concerning
stock market prediction:
• Prediction possibility: Is statistically significant prediction of stock market
data possible? Is profitable prediction of such data possible? What involves
answer to the former question, adjusted by constraints imposed by the real
markets?
• Methods: If prediction is possible, what methods are best at performing it?
What methods are best-suited for what data characteristics - could it be said
in advance?
• Meta-methods: What are the ways to improve the methods? Can
metaheuristics successful in other domains, such as ensembles or pruning,
improve stock market prediction?
• Data: Can the amount, type of data needed for prediction, be characterized?
• Data preprocessing: Can data transformations that facilitate prediction be
identified? In particular, what transformation formulae enhance input data?
• Evaluation: What are the features of sound evaluation procedure, respecting
the properties of stock market data and the expectations of stock market
prediction? What are the common evaluation drawbacks?
• Predictor development: Are there any common features of successful
prediction systems? If so, what are they, and how could they be advanced?
Can common reasons of failure of stock market prediction be identified?
Are they intrinsic, non-reparable, or there is a way to amend them?
• Transfer to other domains: Can the methods developed for stock market
prediction benefit other domains?
Trang 18Predictability estimation: Can stock market data be reasonably quickly estimated
to be predictable or not, without the investment to build a custom system? What
are the methods, what do they actually say, what are their limits?
Consequences of predictability: What are the theoretical and practical
consequences of demonstrated predictability of stock market data, or the
impossibility of it? How a successful prediction method translates into
economical models? What could be the social consequences of stock market
prediction?
1.1-6 Challenges and Possibilities on Developing a Stock Market
Prediction System
A successful stock market predicfion system presents many challenges Some
are encountered over agam, and though an individual solution might be
system-specific, general principles still apply Using them as a guideline might save fime,
effort, boost results, as such promoting project's success
The idea of stock market predicfion (and resulting riches) is appealing, initiating
countless attempts In this competitive environment, if one wants above-average
resuhs, one needs above-average insight and sophistication Reported successful
systems are hybrid and custom made, whereas straightforward approaches, e.g a
neural network plugged to relatively unprocessed data, usually fail The
individuality of a hybrid system offers chances and dangers One can bring together
the best of many approaches; however the interaction complexity hinders judging
where the performance dis/advantage is coming from
Stock market prediction has been widely studied at a case of time-series
prediction problem; The difficulty of this problem is due to the following factors:
low signal-to-noise ratio, non-Gaussian noise distribufion, nonstationarity, and
nonlinearly Deriving relationships that allow one to predict future values of time
series is a challenging task when the underlying system is highly non-linear
Usually, the history of the time series is provided and the goal is to extract from that
data a dynamic system The dynamic system models the relationship between a
window of past values and a value T time steps ahead Discovering such a model is
difficult in pracfice since the processes are typically cormpted by noise and can only
be partially modeled due to missing information and the overall complexity of the
problem In addition, stock market time series are inherently non-stationary so
adaptive forecasting techniques are required
- Data Preprocessing
Trang 19Before data is fed into an algorithm, it must be collected, inspected, cleaned and
selected Since even the best predictor will fail on bad data, data quality and
preparafion is cmcial Also, since a predictor can exploit only certain data features, it
is important to detect which data preprocessing/presentation works best
• Visual inspecfion is invaluable At first, one can look for: trend - if need to
remove, histogram - redistribute, missing values and outliers, any
regularities
• Missing values deah with by data mining methods
• Series to instances conversion is required by most leaming algorithms
expecting as an input a fixed length vector
• Indicators are series derived from others, enhancing some features of
interest, such as trend reversal
• Feature selection can make learning feasible, as because of the curse of
dimensionality long instances demand (exponentially) more data
- Prediction Algorithms
Common leaming algorithms point their features important to stock market
prediction:
• Linear methods are widely used in stock market prediction
• Neural Network seems the method of choice for stock market predicfion
• C4.5, ILP - generate decision trees/if-then rules - human understandable, if
small
• Nearest Neighbor does not create a general model, but to predict, it looks
back for the most similar case(s) Irrelevant/noisy features disrupt the
similarity measure, so pre-processing is worthwhile
• Bayesian classifier/predictor first learns probabilities how evidence supports
outcomes, used then to predict new evidence's outcome
• Support Vector Machines (SVM) are a relatively new and powerful learner,
having attractive characteristics for time series prediction
- System Evaluation
Proper evaluation is critical to a prediction system development First, it has to
measure exactly the interesting effect as opposed to prediction accuracy Second, it
has to be sensitive enough as to disfinguish oflen minor gains Third, it has to
convince that the gains are no merely a coincidence
• Evaluation bias resulfing from the evaluation scheme and time series data,
needs to be recognized
Trang 20• Evaluation data should include different regimes, markets, even data errors,
and be plentiful Dividing test data into segments helps to spot performance
irregularities (for different regimes)
• Sanity checks involve common sense Prediction errors along the series
should not reveal any stmcture, unless the predictor missed something
1.2 Data mining methodology for stock market prediction
1.2.1 Prediction in data mining
a Introduction
The goal of data mining is to produce new knowledge that the user can act
upon It does this by building a model of the real world based on data collected from
a variety of sources The result of the model building is a description of patterns and
relationships in the data that can be confidenfiy used for prediction
Prediction is one of the most important problems in data mining It involves
using some variables or fields in the data set to predict unknown or future values of
other variables of interest The goal of prediction is to forecast or deduce the value
of an attribute based on values of other attributes
b Major types of prediction
- In the view of contruction and use of model
Prediction can be viewed as the construction and use of model to assess the
class of an unlabeled sample, or to assess the value or value ranges of an attribute
that a given sample is likely to have In this view, classification and regression are
the two major types of predicfion problems:
• Classification: used to discrete or nominal values It predicts into what
category or class a case falls In other words, classification problems aim to
identify the characteristics that indicate the group to which each case
belongs Data mining creates classificafion models by examining already
classified data (cases) and inductively finding a predictive pattern
• Regression: used to predict continuous or ordered values It predicts what
number value a variable will have In other words, regression uses existing
values to forecast what other values will be The prediction of continuous
values can be modeled by statistical techniques of regression
- In the view of use of prediction to predict
This view is commonly accepted in data mining Predicfion refers the use of
prediction to predict class labels as classification and to predict continuous values as
prediction:
Trang 21• Classification: used to extract models describing important data classes
Classificafion predicts categorical class label It classifies data (constructs a
model) based on the training set and the values (class labels) in a classifying
attribute and uses it in classifying new data
• Prediction: used to predict future data trends, i.e., predict unknown or
missing values It models confinuous-valued funcfions Any of the methods
and techniques used for classification may also be used for prediction
1.2.2 Parameters
There are several parameters to characterize data mining methodologies for
stock market forecasting:
1.2.2.1 Datatypes
Two major groups of data types
• Attributes data type: object is represented by attributes that is each object x
is given by a set of values ^i(x), A2{x\ , An{x)
• Relational data type: objects are represented by their relations with other
objects For instance, x>y, y<z, x>z In this example we may not know that
x=3, y=l and z=2 Thus attributes of objects are not known, but their
relations are known Objects may have different attributes (e.g., x=5, y=2,
and z= 4), but still have the same relations
1.2.2.2 Data set and techniques
Fundamental and technical analyses are two widely used techniques in stock
market forecast
- Fundamental analysis
Fundamental analysis tries to determine all the econometric variables that may
influence the dynamics of a given stock price or exchange rate Often it is hard to
establish which of these variables are relevant and how to evaluate their effect
- Technical analysis
Technical analysis assumes that when the sampling rate of a given economic
variable is high, all the information necessary to predict the future values is
contained in the time series itself There are several difficulties in technical analysis
for accurate prediction: successive ticks correspond to bids from different sources,
the correlation between price variations may be low, time series are not stationary,
good statisfical indicators may not be known, different realizations of the random
process may not be available, and the number of training examples may not be
enough to accurately infer rules Therefore, the technical analysis can fit short-term
predictions for stock market time series without great changes in the economic
Trang 22environment between successive ticks Actually, the technical analysis was more successful in identifying market trends, which is much easier than forecasting the future stock prices Currently different data mining techniques try to incorporate some of the most common technical analysis strategies in pre-processing of data and
in the construction of appropriate attributes
Two major options exist: use the time series itself or use all variables that may influence the evolution of the time series Data mining methods do not restrict themselves to a particular option They follow a fundamental analysis approach incorporating all available attributes and their values, but they also do not exclude a technical analysis approach based only on a time series such as stock price and parameters derived fi"om it Most popular time series are index value at open, index value at close, highest index value, lowest index value and trading volume and lagged returns from the time series of interest Fundamental factors include the price
of gold, retail sales index, industrial production indices, and foreign currency exchange rates Technical factors include variables that are derived from time series such as moving averages
1.2.2.3 Mathematical algorithm (method, model)
A variety of statistical, neural network and logical methods has been developed For example, there are many neural network models, based on different mathematical algorithms, theories and methodologies Combinations of different models may provide a better performance than those provided by individuals Many data mining methods assume a functional form of the relationship being modeled
1.2.2.4 Form of relationships between objects
The next characteristic of a specific data mining methodology is a form of the
relationship between objects Many data mining methods assume o functional form
of the relationship being modeled For instance, the linear discriminant analysis assumes linearity of the border that discriminates between two classes in the space
of attributes Often it is hard to justify such functional form in advance RDM methodology in stock market does not assume a functional form for the relationship
In addition, RDM algorithms do not assume the existence of derivatives It can automatically leam symbolic relations on numerical data of stock market time series
1.2.3 Approaches to stock market prediction
a Physics approach and data mining approach
The impact of market players on market regularities stimulated a surge of
attempts to use ideas of statistical physics in finance If an observer is a large
marketplace player then such observer can potentially change regularities of the
Trang 23marketplace dynamically Attempts to forecast in such dynamic environment with
thousands active agents leads to much more complex models than traditional data
mining models designed for This is one of the major reasons that such interactions
are modeled using ideas from statistical physics rather than from statistical data
mining The physics approach in finance is also known as "econophysic" and
"physics of finance" The major difference from data mining approach is coming
from the fact that in essence the data mining approach is not about developing
specific methods for financial tasks, but the physics approach is
b Deterministic dynamic system approach
Stock market data are often represented as a time series of a variety of attributes
such as stock prices and indexes Time series prediction has been one of the ultimate
challenges in mathematical modeling for many years Currently data mining
methods try to enhance this study with new approaches Dynamic system approach
has been developed and applied successfully for many difficult problems in physics
Recently several studies have been accomplished to apply this technique in stock
market Usually, the history of the time series is provided and the goal is to extract
from that data a dynamic system The dynamic system models the relationship
between a window of past values and a value T time steps ahead Below presents the
major steps of this approach:
• Step 1: Development of state space for the dynamic system, i.e selecting
and/or inventing attributes characterizing the system behavior
• Step 2: Discovering the laws that govern the phenomenon, i.e discovering
relations between attributes of current and previous states (state vectors) in
the form of differential equations
• Step 3: Solving differential equations for identifying the transition function
(mles)
• Step 4: Use of the transition funcfion as a predictor of the next state of the
dynamic system, e.g., next day stock value
Inferring a set of rules for dynamic system assumes that there is
• Enough information in the available data to sufficiently characterize the
dynamics of the system with high accuracy
• All of the variables that influence the time series are available or they vary
slowly enough that the system can be modeled adaptively
• The system has reached some kind of stationary evoludon
• The system is a detenninistic system
• The evoludon of a system can be described by means of a surface in the
space of delayed values
Trang 24There are several applicafions of these methods to stock time series However,
the literature claims both for and against the existence of chaotic deterministic
system underlying stock market Recent research has focused on methods to
disfinguish stochastic noise from deterministic chaotic dynamics and more generally
on constmcting systems combining deterministic and probabilistic techniques
1.2.4 Data mining methods in stock market
Almost every computational method has been explored and used for financial
modeling New developments augment traditional technical analysis of stock market
curves that has been used extensively by financial institutions Such stock charting
helps to identify buy/sell signals (timing "flags") using graphical pattems Data
mining as a process of discovering useful patterns, correlations has its own place in
stock market modeling
Similarly to other computational methods, almost every data mining method
and technique has been used in financial modeling An incomplete list includes a
variety of linear and non-linear models, multi-layer neural networks, k-means and
hierarchical clustering; k-nearest neighbors, decision tree analysis, regression
(logistic regression; general multiple regression), ARIMA, principal component
analysis, and Bayesian leaming Less traditional methods used include rough sets,
RDM methods (deterministic inductive logic programming) and newer probabilistic
methods, support vector machine, independent component analysis, Markov models
and hidden Markov models
1.2.4.1 Representation languages
a Propositional Logic language
A proposition is a statement that can be true or false Propositional logic uses
true statements to form or prove other tme statements In other words, propositional
logics are concerned with propositional (or sentential) operators which may be
applied to one or more propositions giving new propositions
Propositional logic has very limited expressive power It is not adequate for
formalizing valid arguments that rely on the internal stmcture of the propositions
involved
b First-order logic language
First-order logic (FOL) is a system of deduction extending propositional logic
by the ability to express relations between individuals FOL languag^s^support
variables, relations, and complex expressions DAI HOC QUOC GIA HA NOl
TRUrJG TAM THONG TIN THL/ViEN
Trang 25The FOL language differs from a propositional logic language mainly by the
presence of variables Therefore, a language of monadic fiinctions and predicates is
a FOL language, but a very restricted language
c Attribute-value languages
Attribute-value language is a propositional language in which propositions are
attribute-value pairs that can be considered as predicates In other words, in an
attribute-value language, objects are described by tuples of attribute-value pairs,
where each attribute represents some characteristic of the object
Attribute-value languages are languages of monadic fiinctions (fiinctions of one
variable) and monadic predicates (Boolean functions with only one argument) This
language was not designed to represent relations that involve two, three or more
objects
d Comparison of these languages
Many well-known rule learners are propositional but propositional
representations offer no general way to describe the essential relations among the
values of the attributes In contrast with propositional mles, first order mles have an
advantage in discovering relational assertions because they capture relations
directly Several types of hypotheses/mles presented in FOL are simple relational
assertions with variables Relational assertions can be conveniently expressed using
first-order representations, while they are very difficult to describe using
propositional representations
Also, first order mles allow one to express naturally other more general
hypotheses not only the relation between pairs of attributes These more general
rules can be as for classification problems as for an interval forecast of continuous
variable Moreover, these mles are able to catch Markov chain type of models used
for stock market time series forecast That algorithms designed to leam sets of
first-order rules that contain variables is significant because first-first-order rules are much
more expressive than propositional mles
1.2.4.2 AVL-based methods
The common data mining methodology assuming attributes data type is known
as an attribute-based or attribute-value methodology It covers a wide range of
statistical and connectionist (neural network) methods There are two types of
attribute-value methods: the first one is based on numerical expressions, and the
second one is based on logical expressions and operations
Historically, methods based on AVLs such as neural networks, the nearest
neighbors method, and decision trees dominate in financial applications of data
Trang 26mining The reasons are: (1) In many areas including stock market, training data are
naturally described by attributes of individual entities such as price, amount and so
on, and (2) Relations between entities can be very useful for data mining, (3) These
methods are well-known, relatively simple, efficient, and can handle noisy data
However, these methods have two strong limitations: (1) a limited ability to
represent background knowledge, and (2) the lack of complex relations
1.2.4.3 Relational Data Mining methods
Less traditional relational methodology is based on the relational data type
Many RDM approaches are based on ILP which refers to the collection of machine
leaming algorithms that use FOL as their language The term relational data mining
(RDM) is used in parallel with the terms Inductive Logic Programming (ILP) and
First Order Logic (FOL) methods to emphasize the goal - discovering relations The
term ILP reflects the techniques for discovering relations - logic programming In
particular, discovering relational regularities can be done without logical inference
and in languages of higher order Therefore, RDM is defined as Discovering hidden
relations (general first-order logic relations) in numerical and symbolic data using
background knowledge (domain theory)
Relational Data Mining (RDM) technology is a data modeling algorithm that
does not assume the functional form of the relationship being modeled a priori It
can automatically, consider a large number of inputs (e.g., time series
characterization parameters) and leam how to combine these to produce estimates
for fiiture values of a specific output variable RDM combines recent advances in
such areas as FOL, Probabilistic Inference and Representative Measurement Theory
(RMT) This approach will play a key role in future advances in data mining
methodology and practice
The typical claim about RDM: it can not handle large data sets This statement
is based on the assumption that initial data are provided in the form of relations
a Deterministic ILP meth ads
The purpose of ILP is to overcome the limitations of AVL-based methods ILP
systems naturally incorporate background knowledge and relations between objects
into the leaming process They have a mechanism to represent background stock
market knowledge in human-readable and understandable form Obviously,
understandable mles have advantages over a stock market forecast without
explanations
Traditionally, ILP methods were pure deterministic techniques, which
originated in logic programming There are well-known problems with deterministic
Trang 27methods, which is especially important for stock market applications with numerical
data and a high level of noise
• Limited facility for handling numerical data: Historically, ILP methods solve
only classification tasks without direct operations on numerical data
• Relative inefficiency: Statistically significant rules have advantage in
comparison with mles tested only for their performance on training and test
data Training and testing data can be too limited and/or not representative If
rules rely only on them then there are more chances that these rules will not
deliver a right forecast on other data This is a hard problem for any data
mining method and especially for deterministic methods including
deterministic ILP
b Hybrid Probabilistic Relational methods
The purpose of RDM is to overcome the limitations of current relational
methods (deterministic ILP methods) In the real world, RDM should handle
imperfect (noisy) data and in particular imperfect numerical data
Recent research has focused on methods to distinguish stochatic noise from
deterministic chaotic dynamics and more generally on constmcting systems
combining deterministic and probabilistic techniques RDM follows the same
direction, moving from classical deterministic FOL mles to probabilistic first-order
mles to avoid limitations of deterministic systems The combination benefits from
noise robust probabilistic inference and highly expressive and understandable FOL
mles employed in ILP
Relational methods in finance such as Machine Method for Discovering
Regularities (MMDR) are equipped with probabilistic mechanism that is necessary
for time series with high level of noise The MMDR method handles an interval
forecast of numeric variables with continuous values like prices along with solving
classification tasks
MMDR is one of the few Hybrid Probabilistic Relational Data Mining methods
developed and applied to stock market data In computational experiments, trading
strategies developed based on MMDR consistently outperform trading strategies
developed based on other data-mining methods and buy and hold strategy
1.2.4.4 Comparison of AVL-based methods and relational methods
Table below summarizes the advantages and disadvantages of AVL-based
methods and FOL-based methods
Comparison of AVL-based methods and first-order logic methods
Method Advantages for the learning process | Disadvantages for the
Trang 28- Simple, efficient, and handle noisy data
- Appropriate learning time with a large number of training examples
- Solid theoretical basis (first-order logic, logic programming)
- Flexible form of background knowledge, problem representation, and problem- specific constraints
- Understandable representation of background knowledge, and relations between examples
learning process
- Limited form of background knowledge
- Lack of relations in the concept
- Inappropriate learning time with a large number
of arguments in the relations
-Weak facilities for processing numerical data
Data mining approaches that find pattems in a given single table are referred to
as attribute-value or propositional leaming approaches, as the patterns they find can
be expressed in propositional logic RDM approaches are also referred to as order leaming approaches, or relational leaming approaches, as the pattems they find are expressed in the relational formalism of first-order logic
first-Selecfion of a method for discovering regularifies in stock market time series is
a very complex task Uncertainty of problem descripfions and methods capabilities are among the most obvious difficulties in the process of selection It is pointed out that attribute-based learners typically only accept available (background) knowledge
in rather limited form In contrast, relafional leamers support general representation for background knowledge
One of the main advantages of RDM over attribute-based leaming is ILP's generality of representation for background knowledge This enables the user to provide, in a more natural way, domain-specific background knowledge to be used
in leaming The use of background knowledge enables the user both to develop a suitable problem representation and to introduce problem-specific constraints into the leaming process By contrast, attribute-based leamers can typically accept background knowledge in rather limited form only
Trang 29CHAPTER II: RELATIONAL DATA MINING FOR
STOCK MARKET PREDICTION
11.1 Introduction
Data mining methods are developed to search for patterns in data, which can be
discovered from unstructured data, semi-structured documents or from structured
sources Hke relational databases While most real-world databases store information
in multiple tables, most of the methods of data mining proposed so far operate on
unstructured data (in a form of a single table with data) This "attribute-value"
representation requires the data to be preprocessed and aggregated into a single
table, which faces the risking loss of meaning and/or information More complex
pattems are simply not expressible in attribute-value format and, thus, cannot be
discovered One way to enlarge the expressiveness is to generalize from one-table
mining to multiple table mining, i.e., to support mining on full relational databases
Relational data mining (RDM) approaches look for pattems that involve multiple
tables (relations) from a relational database To emphasize this fact, RDM is often
referred to as multi-relational data mining (MRDM) but, the terms RDM and
MRDM are often used interchangeably
RDM methods seem to be gaining momentum in different fields Data mining in
stock market follows this trend and leads the application of RDM for
multidimensional stock market time series Examples and arguments for applications
of RDM to stock market produce expectations of great advancements in the near
fiiture It was also strengthened in several publications that RDM area is moving
toward probabilistic first-order rules to avoid the limitations of deterministic
systems Relational methods such as MMDR are equipped with probabilistic
mechanism that is necessary for time series with high level of noise Often, stock
market data are represented as a time series of a variety of attributes such as stock
prices and indexes MMDR method expresses pattems in first-order logic and
assigns probabilities to rules generated by composing pattems Then MMDR is
applied to discover regularities, which will be used to predict stock market
This section introduces RDM methodology and presents the algorithm MMDR
applied in stock market data
11.2 Basic problems
IL2.1 First-order logic and rules
a Basic concepts
Trang 30A predicate is defined as a binary function or a subset of a set
D=DixD2x xDn, where Di can be a set of stock prices at moment t=l and D2 can
be stock price at moment t=2 and so on Predicates can be defined extensionally, as
a list of tuples for which the predicate is true, or intensionally, as a set of (Horn)
clauses for computing whether the predicate is true
Let stock(t) be a stock price at t, and consider the predicate UpDown(stock(t),
stock(t+l), stock(t+2)), which is true if stock goes up from date t to date t+1 and
goes down from date t+1 to date t+2
This predicate is presented extensionally in Table II 1 and intensionally using
two other predicates Up and Down:
Up(stock(t),stock(t+ l))&Down(stock(t+ l),stock(t+2)) -^ UpDown(stock(t),stock(t+l),stock(t+2)
Predicates Up and Down
False True False
Updown(,)
False True False True
A literal is a predicate A or its negation (DA) The last one is called a negative literal An unnegated predicate is called a positive literal A clause body
is a conjunction AJ&A2& &A1 of literals Aj,A2, -,At, Often we will omit & operator
and write A]&A2& &A( as A]A2.,.At A Horn clause consists of two components: a
clause head (AQ) and a clause body {A}A2 Ai.,.A), A clause head, AQ, is defined as
a single predicate A Horn clause is written in two equivalent forms: AQ <— AjA
2 -Ai,,.Ai, or AjA2 ^ Ai,„At, ^AQ, where each Aj is a literal The second forni is
traditional for mathematical logic and the first fonii is more common in applications
Trang 31A collection of Horn clauses with the same head AQ is called a rule The
collection can consist of a single Hom clause; therefore, a single Horn clause is also
called a rule Mathematically the term collection is equivalent to the OR operator
(D), therefore the rule with two bodies AjA2, A, and BjB2 Bi can be written as AQ ^
(A1A2 A, D BjB2 ,B^,
A k-tuple, a functional expression, and a term are the next concepts used in
relational approach A finite sequence of k constants, denoted by <ai, ,ak> is called
a k-tuple of constants A function applied to k-tuples is called a functional
expression A term is a constant, variable or functional expression Examples of
terms are given in Table II.3
Functional expression Functional expression Incorrect
Predicate, literal (Stock x is traded on NASDAQ) Predicate(x,y), literal
Yes Yes Yes Yes Yes
No
No
No
A k-tuple of terms can be constructed as a sequence of k terms These concepts
are used to define the concept of atom An atom is a predicate symbol applied to a
k-tuple of terms For example, a predicate symbol P can be applied to 2-tuple of
terms (v,w), producing an atom P(v,w) of arity 2 If P is predicate ">" (greater),
v=StockPrice(x) and w=StockPrice(y) are two terms then they produce an atom:
StockPrice(x) >StockPrice(y), that is, price of stock x is greater than price of stock y
Predicate P uses two terms v and w as its arguments The number two is the arity of
this predicate If a predicate or function has k arguments, the number k is called
arity of the predicate or function symbol
By convention, function and predicate symbols are denoted by Name/Arity
Functions may have variety of values, but predicates may have only Boolean values
true and false The meaning of the rule for a k-arity predicate is the set of k-tuples
that satisfy the predicate A tuple safisfies a rule if it satisfies one of the Hom clauses
that define the rule A unary (monadic) predicate is a predicate with arity 1 For
example, NASDAQ(x) is unary predicate
b Quantifiers
V means "for all" For example, Vx StockPrice(x) > 0 means that for all stocks,
stock prices are non-negative
Trang 323 means that "there exists" It is a way of starting the existence of some object
in the world without explicitly identifying it For example, 3 x P(SP500,y) means
that 3x StockPrice (SP500) > StockPrice(x), i.e., there is stock x less expansive than
SP500 for a given time Using the introduced notation, the following clause can be written:
3x (StockPrice (x) < $100 <- TradeVolume(x) < lOOMO)
This is a notation typical for Prolog language As already mentioned, more traditional logic notation uses the opposite sequence of expression:
3x (TradeVolume(x) < 100,000 -^StockPrice (x) <$100)
Both clauses are equivalent to the statement: "There is a stock such that if its trade volume is less than 100,000 per day than its price is less than $100 per share" Combining two quantifiers, a more complex clause can be written:
Vx By (StockPrice (y) < 100 <- TradeVolume(x) < 100,000)
Predicates defined by a collection of examples are called extensionally defined predicates, and predicates defined by a rule are called intensionally defined predicates If predicates defined by rules then inference based on these predicates can be explained in terms of these rules Similarly, the extensionally defined predicates correspond to the observable facts (or the operational predicates) A collection of intensionally defined predicates is also called domain knowledge or domain theory
11.2.2 Representative measurement theory
a Problem definition
Rapid growth of databases is accompanied by growing variety of types of numerical data RDM has unique capabilities for discovering a wide range of human-readable, understandable regularities in databases with various data types However, the use of symbolic relational data for numerical forecast and discovery regularities in numerical data requires the solution of two problems
• Problem 1 is the development of a mechanism for transformation between numerical and relational data presentations In particular, numerical stock market time series should be transformed into symbolic predicate form for discovering relational regularities
• Problem 2 is the discovery of rules computationally tractable in a relafional form
Representative measurement theory (RMT) is a powerful mechanism for approaching both problems RMT was motivated by the intention to fomialize
Trang 33measurement in psychology to a level comparable with physics Further study has
shown that the measurement in physics itself should be formalized Finally, a fonnal
concept of measurement scale was developed This concept expresses what we call a
data type in data mining
b Some definitions
A relation structure A consists of a set A and relations S i, ,Sn defined on A
A-<A,S,, ,S„>
Each relation Si is a Boolean function (predicate) with ni arguments from A
The relational structure A = <A,Si, ,Sn> is considered along with a relational
structure of the same type
R^<A,T, T„>
Usually the set R is a subset of Re"", m>l, where Re"" is a set of m-tuples of real
numbers and each relation Ti has the same ni as the corresponding relation Sj Tj and
Si are called a k-ary relation on R Theoretically, it is not a formal requirement that
R be numerical
Next, the relational system A is interpreted as an empirical real-world system
and R is interpreted as a numerical system designed as a numerical representation of
A To formalize the idea of numeric representation, we define a homomorphism cp as
a mapping from A to R A mapping (p:
A -^ R is called a homomorphism if for all i (i=l, ,n),
(a, .,ak(i)) e S-, <^ ((p(ai) (p(ak(i))) e T-,
In other notation, Si(ai, ,ai,(i)) o Ti((p(ai), ,(p(ak(i)))
Let (D(A,R) be the set of all homomorphism for A and R It is possible that
cI)(A,R) is empty or contains a variety of representations Several theorems are
proved in RMT about the contents of 0(A,R) These theorems involve whether
<D(A,R) is empty are called representation theorems These theorems involve the size
of <D(A,R) are called uniqueness theorems
Using the set of homomorphism 0(A,R), we can define the notion of
permissible transformations and the data type (scale types) The most natural
concept of permissible transformations is a mapping of the numerical set R into
itself, which should bring a "good" representation More precisely, y is permissible
for (D(A,R) if Y maps R into itself, and for every cp in 0(A,R), 79 is also in (D(A,R)
For instance, the permissible transformations could be transformations, x -^ rx or
monotone transformations x -^ y(x)
Trang 34c Results from measurement theory for learning algorithms
- Regularites discovered with numeric data presentation also able to be
discovered using relational data representation
RMT yields the result that: Any numerical type of data can be transformed into
a relational form with complete preservation of the relevant properties of a numeric
data type It means that all regularities that can be discovered with numeric data
presentation also can be discovered using relational data representation
Many theorems support this property mathematically These theorems are called
homomorphism theorems, because they match (set homomorphism) numerical
properties and predicates Actually, this is a theoretical basis of logical data
presentation without loosing empirical information and without generating
meaningless predicates and numeric relations Moreover, RMT changes the
paradigm of data types RMT argues that a numerical presentation is a relational
data form, not the primary numerical form It is a derivative presentation of an
empirical relational data type The theorems mentioned support this idea
- Ordering relation is the most important relation for the transformation
The next critical result from measurement theory for leaming algorithms is that
the ordering relation is the most important relation for this transformation Most
practical regularities can be written in this form and discovered using an ordering
relation This relation is a central relation in transformation between numerical and
relational data presentation
- Speed up the search for regularities with help of data type hierarchy
developed in RMT
The hierarchy of data types developed in RMT can help to speed up the search
for regularities The search begins with testing rules based on properties of weaker
scales and finishes with properties of stronger scales as defined in RMT The
number of search computations for weaker scales in smaller than for stronger scales
This idea is actually implemented in MMDR method MMDR begins by discovering
regularities with monadic (unary) predicates; e.g., x is larger then constant 5, x>5
and then discovers regularities with ordering relation x>y with two variables In
other words, MMDR discovers regularities based on ordering relations which are
more complex first order logic relations
- Speed up the search for regularities by reducing the space of hypotheses
Measurement theory prompts to speed up the search by searching in a smaller
space of hypotheses The reason is that there is no need to assume any particular
class of numerical fianctions to select a forecasting function in this class In contrast
Trang 35numerical interpolation approaches assume a class of numerical functions RMT has
several theorems, which match classes of numeric functions with relations
- Avoidance of incorrect data preprocessing
RMT prompts to avoid incorrect data preprocessing which may violate data
type properties This is an especially sensitive issue if preprocessing involves
combinafions of several features Usually a discovered regularity is changed
significantly if an uncoordinated preprocessing of attributes is applied In this way,
pattems can be cormpted In particular, measurement theory can help to select
appropriate preprocessing mechanisms for stock market data and a better
preprocessing mechanism speeds up the search for regularifies
d Transformation of numerical data into relational form
There should be two steps in the transformation of numerical data into relational
form:
• Extracting, generating, and discovering essential predicates (relations)
• Matching these essential predicates with numerical data
Sometimes this is straightforward Sometimes this is a much more complex
task, especially taking into account that computational complexity of the problem is
growing exponentially with the number of new predicates RDM can be viewed also
as a tool for discovering predicates, which will be used for solving a target problem
by some other methods In this way, the whole data mining area can be considered
as a predicate discovery technology
The RMT theorems determine the numerical representafions of the attributes
and laws for the corresponding sets of axioms The RMT treats the numerical
representations of the values and laws as only numerical codes of the algebraic
structures representing the operafional properties of the values and laws Thus, the
algebraic stmctures are the primary representations of values and laws The main
statements and results of the Measurement Theory are the following:
• Numerical representations of quantities and laws of nature are determined by
the set of axioms for the corresponding empirical systems, and algebraic
systems with certain sets of relations and operations;
• The numerical representations are unique up to certain sets of permissible
transformations (such as change in measurement units);
• All physical attributes may be embedded into the structure of physical
quantities;
• Physical laws are simple because all attributes involved in a law are
simultaneously scaled by a special scaling process;
Trang 36• The axiomatic approach is applicable not only to the physical attributes and
laws, but also to many other attributes and laws of other domains (such as
psychology),
e Empirical axiomatic theory
Several studies have shown the actual success of discovery of understandable
regularities in databases significantly depends on use of data type information Data
type information is the first source to get predicates for relational data presentation
RMT is intended for generating operations and predicates as a description of a
data type This RMT mechanism is called an empirical axiomatic theory A variety
of data types as matrices of comparisons of pairs, and muhiple comparisons,
attribute-based matrices, matrices of ordered data, and matrices of closeness can be represented
in this way Then the relational representation of data and data types are used for
discovering understandable regularities We argue that current leaming methods
utilize only part of data type information actually presented in data They either lose a
part of data type information or add some non-interpretable information
The language of empirical axiomatic theories is an adequate lossless language
to this goal There is an important difference between language of empirical
axiomatic theories and FOL language The FOL language does not indicate anything
about real world entities Meanwhile the language if empirical axiomatic theories
uses FOL language as a mathematical mechanism, but it also incorporates addifional
concepts to meet the strong requirement of empirical interpretation of relations and
operations
11.2.3 Breadth-first search
In graph theory, breadth-first search (BFS) is a graph search algorithm that
begins at the root node and explores all the neighboring nodes Then for each of
those nearest nodes, it explores their unexplored neighbor nodes, and so on, until it
finds the goal
- Components of Search Algorithms
Searching algorithms have a number of common components based on linear
data stmctures, additionally certain mechanisms need to be built into the graph data
structures in order to recover the routes discovered by the searching algorithm
These components are as follows
• An open list - a list of nodes for consideration Implementing an open list is a
straightforwards matter, and usually a linear data structure is used (usually a
queue, or an ordered queue)
Trang 37• A closed list - a list of nodes already considered The closed list can be represented by a queue also, or a flag can be added to the visited nodes, which is set to zero while they are unvisited and unity when they are not
• A parentage mechanism - a mechanism for recovering the routes discovered
by the search algorithm The parentage mechanism works by recording which nodes result in others being placed on the open list
- The Breadth First Search Algorithm
The breadth first search algorithm requires the identity of the starting node of the search, and the destination node are known The algorithm will then search through the whole graph and attempt to find a route from start node to destination node Here is a broad outline of the Breadth-First Search Algorithm
Put the starting point onto the Open List
While the Open List is not empty
Node n=node removed from open list
Is n the destination?
If yes stop Otherwise, for each unvisited node s connected to n
add s to the open list mark n as parent ofs
This algorithm will search through all of the nodes added to the open list, until the destination node appears in the queue, in which case the algorithm will stop Having performed this search we recover the route by starting at the destination node, following the parentage of each node until we return to the start node
11.2.4 Occam's razor principle
MMDR method can use the Occam's razor principle, commonly attributed to William of Occam (early 14th century), that states: "Entities should not be multiplied beyond necessity" This principle is generally interpreted as: "Among the theories that are consistent with the observed phenomena, one should select the simplest theory" or "When you have two competing theories which make exactly the same predictions, the one that is simpler is the better" The Occam's razor principle suggests that among all the hypotheses that are correct for all (or for most of) the training examples, one should select the simplest hypothesis; it can be expected that this hypothesis is most likely to capture the stmcture inherent in the problem and that its high predicfion accuracy can be expected on objects outside the training set
Trang 38This principle is frequently used by noise handling algorithms (e.g., rule
truncation and tree pruning) as noise handling aims at simplifying the generated
rules or decision trees in order to avoid overfiting a noisy training set
Despite the successful use of the Occam's razor principle as the basis for
hypothesis construction, several problems arise in practice First is the problem of
the definition of the most appropriate complexity measure that will be used to
identify the simplest hypothesis, since different measures can select different
simplest hypotheses for the same training set Second, recent experimental work has
undoubtly shown that applications of the Occam's razor may not always lead to best
prediction accuracy Further empirical evidence against the use of Occam's razor
principle is provided by boosting and bagging approaches in which an ensemble of
classiers (a more complex hypothesis) typically achieves better accuracy than any
single classier In addition to experimental evidence, much disorientation was
caused by the so-called "conservation law of generalization performance" Although
it is rather clear that real-word leaming tasks are different from the set of all
theoretically possible leaming tasks, there remains the so-called "selective
superiority problem" that each algorithm performs best in some but not all domains
11.3 Theory of RDM
In comparison with other data mining methods, the Relational Data Mining
approach is considered from the point of view of their Data Types, Representation
Languages (to manipulate and interpret data) and Class of hypothesis (to be tested
on data) RDM approach overcomes the limitations of particular data mining
methods, which are induced by their data types, languages and hypothesis classes by
unlimited extension the data type notion and hypothesis classes using the FOL
11.3.1 Data types in RDM
a Problems of data types
The design of a particular data mining system implies the selection of the set of
data types supported by that system In Object-Oriented Programming (OOP), this is
a part of the software design Data types are declared in the process of software
development If data types of a particular leaming problem are out of the range of
the data mining system, users have two options: to redesign a system or to corrupt
types of input training data for the system The first option often does not exit at all,
but the second produces an inappropriate result
There are two solutions to the problem:
• The first is to develop data type conversion mechanisms which may work
correctly within a data mining tool with a limited number of data types For
Trang 39example, if input data are of a cyclical data type and only linear data types are supported by the data mining tool, one may develop a coding of the cyclical data such that a linear mechanism will process the data correctly
• Another solution would be to develop universal data mining tools to handle any data type In this case, member functions of a data type should be input information along with the data (MMDR implements this approach) This problem is more general than the problem of a specific data mining tool
In financial applications, usually the data are presented as numeric attributes, but often relations are not presented explicitly More precisely, these attributes are coded with numbers, but applicability of number relations and operations must be confiraied Let a relative difference for the stock price be
A(tMS(t)-S(t-l)]/S(t)
a "float" data type This is correct for a computer memory allocation, but it does not help to decide if all operations with float numbers are applicable for A(t) For instance, what does it mean to add one relative difference A(x) to another A(y)? There is no empirical procedure matching this sum operation However, the comparison operation makes sense, e.g., A(x) < A(y) means faster growth of stock price on date y than on date x This relation also helps interpret a relation "A(w) between A(x) and A(y)" as
A(x) < A(w) and A(w) < A(y) or A(y) < A(w) and A(w) < A(x)
Both of these relations are already interpreted empirically
Therefore, A values can be compared, but one probably should avoid an addition operation (+) if the goal is to produce an interpretable leamed rule If one decides to ignore these observations and applies operations formally proper for float numbers in programming languages, a leamed mle will be difficult to interpret As was already mentioned, these difficulties arose from the uncertainty of the set of interpretable operations and predicates for these data, i.e., uncertainty of empirical contents of data
b Levels of data type
- Single-level data type
A data type (type for short) in modem object-oriented programming (OOP) languages is a rich data structure, <A,P,F> It consists of elements A={ai,a2, a,J, relations between elements (predicates) P={Pi,P2,-Pm} and meaningftil operations with elements F={Fi,F2,-,Fk} Operations may include two, three or more elements, e.g., c = a # b, where # is an operation on elements a and b producing element c This definition of data type formalizes the concept of a single-level data type
Trang 40Traditional AVLs operate with much simpler single-level data types Implicitly,
each attribute in AVLs reflects a type, which can take a number of possible values
These values are elements of A It is common in AVLs that a data type is given as
an implicit data type Usually, relations P and operations F are not expressed
explicitly However, such data types can be embedded explicitly into AVLs
- Multilevel data type
A multilevel data type can be defined by considering each element a; from A as
a composite data stmcture (data type) instead of as an atom
- Complex data types and selector functions
Each data type is associated with a relational system, which includes: (1)
cardinality, (2) permissible operadons with data type elements, and (3) permissible
relations between data type elements In tum, each data type element may consist of
its own subelements with their types Selector functions serve for extracting
subterms from terms Without selector functions, the intemal structure of the type
could not be accessed
11.3.2 Relational representation of examples
Knowledge Representation is an important and informal initial step in RDM
There are many ways to represent knowledge in the FOL language One of them can
skip important information; another one can hide it Therefore, data mining
algorithms may work too long to "dig" relevant information or even may produce
inappropriate mles Introducing data types and concepts of representative
measurement theory (RMT) into the knowledge representation process helps to
address this representation problem In fact the measurement theory developed a
wide set of data types
Relational representation of examples is the key to relational data mining If
examples are already given in relational form, relational methods can be applied
directly For attribute-based examples, this is not the case It requires to express
attribute-based examples and their data types in relational fonn There are 2 major
ways to express attribute-based examples using predicates:
- Generating predicates for each value
To express price $909.0 from Table II.4 in predicate form, we may generate
predicate P9090(x), such that P9090(x)=tme if and only if the price is equal to
$909.0 In this way, we would be forced to generate about 10000 predicates if prices
are expressed from $1 to $1000 with a $0.10 step In this case, the price data type
has not yet been presented with the P9090(x) predicate Therefore, additional
relations to express this data type should be introduced For example, it can be a