Methodology of relational datamining for stock market prediction

ABSTRACT This thesis presents the methodology of relational data mining for stock market prediction by making clear each problem related to the keywords: methodology, relational, data mi

Trang 1

VIETNAM NATIONAL UNIVERSITY, HANOI

COLLEGE OF TECHNOLOGY

* * *

CHU THAI HOA

METHODOLOGY OF RELATIONAL DATAMINING FOR

STOCK MARKET PREDICTION

Major: Information Technology

Code: 1.01.10

MASTER'S THESIS

Instructor: Prof Dr HO TU BAG

DAI HOC QUOC GiA HA NOl TRUNG FAM IHONG TIN THIJ VIEN

000 ^J 000095^ Hanoi, June 2007

Trang 2

ABSTRACT

This thesis presents the methodology of relational data mining for stock market prediction by making clear each problem related to the keywords: methodology, relational, data mining, stock market, and prediction, then coming to the methodology of relational data mining with the emphasis on Machine Methods for Discovering Regularities (MMDR) for stock market prediction

Stock market prediction has been widely studied in terms of time-series prediction problem Deriving relationships that allow one to predict future values of time series is challenging One approach to prediction is to spot pattems in the past, when we already know what followed them, and to test on more recent data If a pattem is followed by the same outcome frequently enough, we can gain confidence that it is a genuine relationship

The purpose of relational data mining (RDM) is to overcome the limitations of attributed-based learning methods (commonly used in finance) in representing background knowledge and complex relations RDM approaches look for pattems that involve multiple tables (relations) from a relational database This approach will play a key role in future advances in data mining methodology and practice

MMDR method is one of the few Hybrid Probabilistic Relational Data Mining methods developed and applied to stock market data The method has an advantage

in handling numerical data It expresses pattems in First-order Logic (FOL) and assigns probabilities to rules generated by composing pattems This will be made clear through an application of MMDR with computational experiment on price index data of Standard and Poor's 500

The thesis consists of 3 chapters concentrating on relational data mining methodology for stock market prediction

Trang 3

ACKNOWLEDGEMENTS

This thesis would not have been completed if there was no help and support of many people I would like to take this opportunity to express my gratitude to the many people who helped me during the time of development leading to the thesis

In particular, I would like to thank my instructor Prof Dr HO Tu Bao, for his courage of accepting me as a Master's student, for his enthusiasm, his knowledge and his encouragement in the work throughout I would never been able to finish this Thesis without his encouragement as well as his strict requirement for quality of the research

I also enjoyed and appreciated the fruitful exchange of ideas with Dr NGUYEN Trong Dung, to whom I am also grateful for comments on the thesis In the early days of my research Dr HA Quang Thuy, Dr PHAM Tran Nhu and Dr DO Van Thanh stimulated my interest in data mining in financial forecast I am thankful for that and for the many discussions I had with them

I am indebted to CFO LE The Anh, CFO NGUYEN Minh Quang for their patience with my questions on financial and stock market forecast I am also grateful

to Dr PHAM Ngoc Khoi, Dr NGUYEN Phu Chien, MSc DAO Van Thanh, Mrs

LE Thi Hoang My for words of encouragement during months of the thesis efforts and for their style-improving suggestions My thanks also go to everyone who has provided support or advice to me on data mining, stock market, forecast and so on in one way or another

My family has been creating good conditions for me to complete the thesis I dedicate the thesis to my father, my mother and my young brother whose love and support are always for me

Hanoi, June 2007,

CHU Thai Hoa

Trang 4

TABLE OF CONTENTS

ABSTRACT i ACKNOWLEDGEMENTS ii

TABLE OF CONTENTS iii

LIST OF TABLES AND FIGURES v

LIST OF ABBREVIATIONS vi

INTRODUCTION 1

Problem definition 1

Motivations of the Thesis 2

Objectives of the Thesis 4

Method of the Thesis study 4

Stmcture of the Thesis 5

CHAPTER I: OVERVIEW OF STOCK MARKET PREDICTION IN DM 6

LI Introduction to stock market prediction 6

1.1.1 Basic concepts of forecast 6

1.1.2 Prediction tasks in stock market 7

1.1.3 Stock market time series properties 8

1.1.4 Stock market prediction with the efficient market theory 9

1.1.5 Questions in stock market prediction 10

1.1.6 Challenges and Possibilifies on Developing a Stock Market

Prediction System 11

1.2 Data mining methodology for stock market prediction 13

1.2.1 Prediction in data mining 13

1.2.2 Parameters 14

1.2.3 Approaches to stock market prediction 15

1.2.4 Data mining methods in stock market 17

CHAPTER II: RELATIONAL DATA MINING FOR STOCK MARKET

PREDICTION "22 ILL Introduction 22

II.2 Basic problems 22

11.2.1 First-order logic and rules 22

11.2.2 Representative measurement theory 25

11.2.3 Breadth-first search 29

11.2.4 Occam's razor principle 30

IL3 Theory of RDM 31

11.3.1 Data types in RDM 31

11.3.2 Relational representation of examples 33

11.3.3 Background knowledge and problems of search for regularities 34

IL4 An algorithm for RDM: MMDR 39

II.4.1 Motivations of choice for MMDR 39

Trang 5

11.4.2 Some concepts 40

11.4.3 Algorithm MMDR L'"!" ".^.".^43

CHAPTER III: AN APPLICATION OF MMDR TO STOCK PRICE

PREDICTION 47 IILL MMDR model for prediction 47

III.2 Experiment preparation 48

111.2.1 Data description and representation 48

111.2.2 Demo program 50

IIL3 Application of MMDR model 52

111.3.1 Step 1: Generating logical rules 52

111.3.2 Step 2: Learning logical rules 54

IIL3.3 Step 3: Creating intervals 56

IIL4 Results and evaluations 58

111.4.1 Stability of discovered rules on test data 58

111.4.2 Evaluations of forecast performance 61

CONCLUSIONS 70

Contributions of the thesis 70

Limitations of the thesis 71

Future work 72 Summary 73 APPENDICIES vii

Source code vii

REFERENCES xii

In English xii

In Vietnamese xvii

Website xvii

Trang 6

LIST OF TABLES AND FIGURES

Comparison of AVL-based methods and first-order logic methods 20

UpDown predicate 23 Predicates Up and Down 23

Examples of terms 24 Attribute-based data example 34

Partial background knowledge for stock market 37

Figure III.l Flow diagram for MMDR model: steps and techniques 48

Training set and Test set 49

Examples of rule consistent with hypotheses H1-H4 54

Table A.1: Stability checking table 59

Table A.2: Performance matrics for a set of 125 regularities 62

Figure A.l: Performance of 125 found regularities on test data 62

Table A.3: Performance matrics for a set of 292 regularities 63

Figure A.2: Performance of 125 found regularities on test data 63

Table A.5: Performance for regularity with conditional probability of 0.49 66

Figure A.3: Performance of an individual regualrity with conditional probability of

0.49 on test data 66 Table A.6: Performance for regularity with conditional probability of 0.84 67

Figure A.4: Performance of an individual regualrity with conditional probability of

0.84 on test data 67 Table A.7: Forecast result for the day December 1'^ 2006 (the regularity with

conditionalprobability of 0.84) 68

Table A.8: Forecast result for the day December 1^ 2006 (the set of 292

regularities with conditional probability not less than 0.65) 69

Trang 7

LIST OF ABBREVIATIONS

AI : Artificial Intelligence

AVL(s) : Attribute-value language(s)

DM : Data mining

FOL : First-order Logic

ILP : Inductive Logic Programming

ML : Machine Leaming

MMDR : Machine Methods for Discovering Regularities

MRDM : Multi-Relational Data mining

RDM : Relational Data mining

RMT : Representative measurement theory

Trang 8

INTRODUCTION

Problem definition

There are four major technological reasons stimulating data mining

development, applications and public interest: the emergence of very large databases;

advances in computer technology; fast access to vast amounts of data; and the ability

to apply computationally intensive statistical methodology to these data

Data mining is the process of discovering hidden patterns in data Due to the

large size of databases, importance of information stored, and valuable information

obtained, finding hidden pattems in data has become increasingly significant The

stock market provides an area in which large volumes of data are created and stored

on a daily basis

Financial forecasfing has been widely studied at a case of time-series prediction

problem Times series such as the stock market are often seen as non-stationary

which present challenges in predicting fiiture values The efficient market theory

states that it is pracfically impossible to predict financial markets long-term

However, there is good evidence that short-term trends do exist and programs can be

written to find them The data miners' challenge is to find the trends quickly while

they are valid, as well as to recognize the time when the trends are no longer

effective Data mining methods provides the fi-amework for stock market predictions

to discover hidden trends and pattems

Well-known and commonly used data mining methods in stock market are

attributed-based leaming methods but they have some serious drawbacks: limited

ability to represent background knowledge and lack of complex relations The

purpose of RDM is to overcome these limitations RDM is a learning method that is

better suited for stock market mining with a better ability to explain discovered rules

than other symbolic approaches

However, current relational methods are relatively inefficient and have rather

limited facilities for handling numerical data RDM as a hybrid leaming method

combines the strength of FOL and probabilistic inference to meet these challenges

One of the few Hybrid Probabilistic Relational Data Mining methods, MMDR that

handles numerical data efficiently, is developed and applied to stock market data

It is believed that now is the time for RDM methods, in particular, MMDR to

stock market prediction has advantages in discovering regularities in stock market

time series

Trang 9

Motivations of the Thesis

In the past few years, Vietnam's stock market was still in early stage of development and thus did not catch attention from investors and researchers Especially, to interested learners, mastering professional methods of stock market analysis and forecast require to have fime and wide background knowledge to study all fields covered Moreover, according to the efficient market theory, it is practically impossible to infer a fixed long-term global forecasting model from historical stock market information Therefore, there have been few Vietnamese interested in and performing research on stock market prediction

Two recent years have witnessed the surprising development of the Vietnamese stock market with a host of notable events Especially, after Vietnam became a World Trade Organization (WTO) member, the Vietnamese economy has so many opportunities to develop, leading to the development of many companies and markets including the financial and stock markets It is said that Vietnam's stock market will grow rapidly in the next years, and it will ranlc second in the region, just after China, in terms of growth rate

Under the rapid development of Vietnam's financial market, professional activities such as analysis and prediction of financial market should be paid more attention In particular, these activities play a significant role in the task of macro economic forecast at the National Center for Socio-economic Information and Forecast (under the Ministry of Planning and Investment), which helps make sound policies related to socio-economic management and regulation at macro level Data mining provides some methods and techniques that are able to help approach stock market prediction quite effectively

In fact, there have been already some studies and successful applications of data mining techniques to stock market forecast However, the capture of loiowledge and application techniques of each approach is quite challenging and consumes time I read some papers and especially paid attention to a research on relational data mining in finance by two researchers, Prof Dr Boris Kovalerchuk and Dr Evgenii Vityaev They reported that, "Mining stock market data presents special challenges For one, the rewards for finding successftil pattems are potentially enormous, but so are the difficulties and sources of conftisions The efficient market theory states that

it is practically impossible to predict financial markets long-term However, there is good evidence that short-term trends do exist and programs can be written to find them The data miners' challenge is to find the trends quickly while they are valid, to

Trang 10

deal effectively with time series and calendar effects, as well as to recognize the time when the trends are no longer effective"

The leaming method RDM is able to leam more expressive rules, make better use of underlying domain knowledge and explain discovered rules than other symbolic approaches It is thus better suited for stock market mining This approach will play a key role in fiiture advances in data mining methodology and practice The earlier algorithms for RDM suffer fi-om a relative computational inefficiency and have rather limited tools for processing numerical data This problem is especially necessary to be considered in stock market analysis where data commonly are numerical time series Therefore, RDM as a hybrid leaming method that combining the strength of FOL and probabilistic inference is developed to meet these challenges One of the few Hybrid Probabilisfic Relational Data Mining methods, MMDR, that handles numerical data efficiently, is developed and applied to stock market forecasting

The common question "Can stock market prediction be profitable?" is often made to any research on methods of stock market prediction In fact, there are few people doing research on RDM for stock market forecast, because it requires interested learners to have wide background knowledge to understand all fields covered Much less has been reported publicly on success of data mining in real trading by financial institutions If real success is reported then competitors can apply the same methods and the leverage will disappear, because in essence all ftindamental data mining methods are not proprietary I used to concentrate my study in attempt to end up with a Master's Degree and as a millionaire (kidding), but this is too high risk to take

Basing my intention on practical suggestions and requirements, as well as my personal interest, I came to a decision of doing research on stock market forecast Through some school lessons and extra self-learning efforts, I access some data mining techniques to seek a solution to the task Those above motivate the aim of the thesis - to carry out research and experiment on methodology of RDM for stock market prediction

Trang 11

Objectives of the Thesis

- Systematical organization of RDM methodology for stock market

prediction

Most of the exisfing studies on RDM for stock market prediction are reported in

a short and overview way, which causes difficulties for many readers The thesis is

primarily based on the book "Data Mining in Finance: Advances in Relational and

Hybrid Methods" and some papers by the two researchers Dr Kovalerchuk & Dr

Vityaev However, after having a thorough grasp of the RDM methodology, I

systematically organize the methodology, especially the algorithm MMDR in my

view and supplement more extensions of knowledge in data mining and stock

market forecast to the thesis Hopefiilly, it plays an important role in helping new

comers move toward the problem more favorably

- Experiment performance of MMDR method to stock market prediction

Centre of the thesis is the issue of discovering regularities in stock price series

addressed and illustrated through the MMDR The thesis also carries out an RDM

application to stock market prediction through an experiment with a small

self-developed program in a set of Standard and Poor's data The experiment helps

understand and trust more the feasibility and efficiency of RDM methodology and

MMDR algorithm presented in the thesis

Method of the Thesis study

The study behind the thesis has been mostly goal driven As problems appeared

on the way to realizing stock market prediction, they were tackled by various means

as listed below:

• Investigation of some existing machine learning and data mining methods

through related documents such as Doctoral Theses, Master' theses, online

papers, books, etc

• Reading of financial and stock market literatures for properties, forecast

techniques and hints of regularities in stock market data able to be exploited

• Learning about some existing stock market prediction software for deeper

understanding of regularity discovered

• Some theoretical considerations on mechanisms behind the generation of

stock data, and on general predictability demands and limits

• Practical insights into the realm of trading in stock market

• Contacts with experts on data mining and data mining software development,

with stock market investors and chief financial officers

Trang 12

Courses on economic forecast and stock market mostly organized by the National Center for Socio-economic Information and Forecast

Collection of related documents and systemization of Mater's thesis

Programming in PHP and carrying experiments to illustrate and to prove the main idea and algorithm presented in the Thesis

Structure of the Thesis

The thesis is stmctured in the following way The first part introduces the problem definifion, method of study, objectives and stmcture of the Thesis

Chapter 1 provides an overview of stock market prediction in data mining

through two following parts "Introduction to stock market prediction" includes basic concepts of stock market forecast, data mining with the Efficient Market Theory, stock market time series properties, and drawbacks and possibilities on developing a stock market prediction, etc The last part "Data mining methodology for stock market prediction" presents some major types of data mining prediction, approaches to stock market prediction and comparisons on representation languages and data mining methods used in stock market

Chapter 2 talks about some basic problems, theory of RDM and an algorithm

MMDR In comparison with other data mining methods, the RDM approach is considered fi-om the point of view of their Data Types, Representation Languages (to manipulate and interpret data) and Class of hypothesis (to be tested on data) One

of the few Hybrid Probabilistic Relational Data Mining methods, MMDR, which is equipped with probabilistic mechanism that is necessary for time series with high level of noise, is mainly introduced

In Chapter 5, an MMDR application to stock market price prediction is made

clear for the methodology through three steps: mle generating, rule learning and interval creating This chapter also brings out some statisfic results and evaluations for the experiment conducted to demonstrate the application

Finally, contributions, limitations and fiiture work of my research are given as conclusion part for the thesis At the appendix part, the thesis also provides some table stmctures and source code developed by myself that are used for experiment

Trang 13

CHAPTER I: OVERVIEW OF STOCK MARKET

PREDICTION IN DATA MINING

1.1 Introduction to stock market prediction

1.1.1 Basic concepts of forecast

This section provides a brief basic concepts of forecast An introductory

discussion of the topic can be found in [46] - Michael Leonard, Large-Scale

Automatic Forecasting: Millions of Forecasts, International Symposium of

Forecasting, 2002

Forecasts are time series predictions made for future periods in time They are

random variables and therefore have an associated probability distribution The

mean or median of each forecast is called the prediction The variance of each

forecast is called the prediction error variance and the square root of the variance is

called the prediction standard error The variance is computed from the forecast

model parameter estimates and the model residual variance

The forecast for the next future period is called the one-step ahead forecast The

forecast for h periods in the future is called the h-step ahead forecast The forecast

horizon or forecast lead is the number of periods into the future for which

predictions are made (one-step, two-step, , h-step) The larger the forecast horizon,

the larger the prediction error variance at the end of the horizon

The confidence limits are based on the prediction standard errors and a chosen

confidence limit size A confidence limit size of 0.05 results in 95% confidence

limits The confidence limits are often computed assuming a normal distribution, but

others could be used As with the prediction standard errors, the width of the

confidence limits increases with the forecast horizon

The prediction error is the difference between the predicted value and the actual

value when the actual value is known For transformed models, it is important to

understand the difference between the model errors (or residuals) and the prediction

errors The residuals measure the departure from the model in the transformed

metric The prediction errors measure the departure from the original series

Taken together, the predictions, prediction standard errors, and confidence

limits at each period in the forecast horizon are the forecasts Although many people

use the word "forecast" to imply only prediction, a forecast is not one number for

each future time period

Using a transformed forecasting model requires the following steps:

Trang 14

• The time series data are transformed

• The transformed time series data are fit using the forecasting model

• The forecasts are computed using the parameter estimates and the

transformed time series data

• The forecasts (predictions, prediction standard errors, and confidence limits)

are inverse transformed

The naive inverse transformation resuhs in median forecasts To obtain mean

forecasts requires that the prediction and the prediction error variance both are

adjusted based on the transformation Additionally, the model residuals will be

different from the prediction errors due to this inverse transformation If no

transformation is used, the model residual and the prediction error will be the same,

and likewise the mean and median forecast will be the same (assuming a symmetric

disturbance distribution)

The statistics of fit evaluate how well a forecasting model performs by

comparing the actual data to the predictions For a given forecast model that has

been fitted to the time series data, the model should be checked or evaluated to see

how well it fits or forecasts the data The statistics of fit can be computed from the

model residuals or the prediction errors

When a particular statistic of fit is used for forecast model selection, it is

referred to as the model selection criterion When using model selection criteria to

rank forecasting models, it is important to compare the errors on the same metric,

that is, you should not compare transformed model residuals with non-transformed

model residuals You should first inverse transform the forecasts from the

transformed model prior to compufing the prediction errors and then compute the

model selection criterion based on the prediction errors

1.1.2 Prediction tasks in stock market

Boris Kovalerchuk, Evgenii Vityaev, Data Mining For Financial Applications, In:

0 Maimon, L Rokach (Eds.): The Data Mining and Knowledge Discovery

• Straight prediction of the stock market numeric characteristic, e.g., stock

return or exchange rate

Trang 15

• The prediction whether the stock market characteristic will increase or

decrease

Having in mind that in the first case, it is necessary to take into account the

trading cost and the significance of the trading return And in the second case, it is

necessary to forecast whether the stock market characteristic will increase or

decrease no less than some threshold Thus, the difference between data mining

methods for the first or second case can be less obvious, because the second case

may require some kind of numeric forecast

Financial institutions produce huge datasets that build a foundation for approaching

these enormously complex and dynamic problems with data mining tools Potential

significant benefits of solving these problems motivate extensive research for years

1.1.3 Stock market time series properties

One may wonder if there are universal characteristics of the many series coming

from markets different in size, location, sophistication, etc The surprising fact is that

there are Moreover, interacting systems in other fields, such as statistical mechanics,

suggest that the properties of stock market time series loosely depend on the market

microstructure and are common to a range of interacting systems Such observations

have stimulated new models of markets based on analogies with particle systems and

brought in new analysis techniques opening the era of econophysics A more

detailed discussion of stock market time series properties can be found in [66]

-Stefan Zemke, On Developing a Financial Prediction System: Pitfalls and

Possibilities, First International Workshop on Data Mining Lessons Learned at

ICML'02, 2002 This section introduces a brief on stock market time series

properties including:

- Distribution

Distribution of stock market series tends to be non-normal, sharp peaked and

heavy-tailed, these properties being more pronounced for intraday values Such

observations were pioneered interestingly around the time the EMH was formulated

Extreme values appear more frequently in a stock market series as compared to a

nomially-distributed series of the same variance This is important to the practitioner

since often the values cannot be disregarded as erroneous outliers but must be

actively anticipated, because of their magnitude which can influence trading

performance

- Scaling property

Trang 16

Scaling property of a time series indicates that the series is self-similar at different time scales This is common in stock market time series, i.e given a plot of returns without the axis signed; it is next to impossible to say if it represents hourly, daily or monthly changes, since all the plots look similar, with differences appearing

at minute resolution Thus prediction methods developed for one resolution could, in principle, be applied to others

- Data frequency

Data frequency refers to how often series values are collected: hourly, daily, weekly etc Usually, if a stock market series provides values on daily, or longer, basis, it is low frequency data, otherwise - when many intraday quotes are included

- it is high frequency Tick-by-tick data includes all individual transactions, and as such, the event-driven fime between data points varies creating challenge even for such a simple calculation as correlation

1.1.4 Stock market prediction with the efficient market theory

The Efficient Market Theory/Hypothesis (EMH) inifially got wide acceptance

in the financial community It asserts, in weak form, that the current price of an asset already reflects all informafion obtainable fi*om past prices and assumes that news is promptly incorporated into prices Since news is assumed unpredictable, so are prices In other words, according to the EMH, the evolufion of the prices for each economic variable is a random walk The variations in prices are completely independent from one fime step to the next in the long run EMH states that it is practically impossible to infer a fixed long-term global forecasting model from historical stock market informafion This idea is based on the observation that if the market presents some kind of regularity then someone will take advantage of it and the regularity disappears

However, real markets do not obey all the consequences of the hypothesis, e.g., price random walk implies normal distribution, not the observed case; there is a delay while price stabilizes to a new level after news, which among other, lead to a more modem view: "Overall, the best evidence points to the following conclusion The market isn't efficient with respect to any of the so-called levels of efficiency The value investing phenomenon is inconsistent with semi-strong form efficiency, and the January effect is inconsistent even with weak form efficiency Overall, the evidence indicates that a great deal of information available at all levels is, at any given time, reflected in stock prices The market may not be easily beaten, but it appears to be beatable, at least if you are willing to work at it."

Trang 17

The market efficiency theory does not exclude that hidden short-term local

conditional regularities may exist These regularities can not work "forever," they

should be corrected frequently It has been shown that the stock market data are not

random and that the efficient market hypothesis is merely a subset of a larger

chaotic market hypothesis This hypothesis does not exclude successful short term

forecasfing models forpredicfion of chaofic time series

Data mining does not try to accept or reject the efficient market theory Data

mining creates tools which can be useful for discovering subtle short-term

conditional pattems and trends in wide range of stock market data This means that

retraining should be a permanent part of data mining in stock market and any claim

that a silver bullet trading has been found should be treated similarly to claims that a

perpetual mobile has been discovered

1.1.5 Questions in stock market prediction

Following are some questions of scienfific and pracfical interest concerning

stock market prediction:

• Prediction possibility: Is statistically significant prediction of stock market

data possible? Is profitable prediction of such data possible? What involves

answer to the former question, adjusted by constraints imposed by the real

markets?

• Methods: If prediction is possible, what methods are best at performing it?

What methods are best-suited for what data characteristics - could it be said

in advance?

• Meta-methods: What are the ways to improve the methods? Can

metaheuristics successful in other domains, such as ensembles or pruning,

improve stock market prediction?

• Data: Can the amount, type of data needed for prediction, be characterized?

• Data preprocessing: Can data transformations that facilitate prediction be

identified? In particular, what transformation formulae enhance input data?

• Evaluation: What are the features of sound evaluation procedure, respecting

the properties of stock market data and the expectations of stock market

prediction? What are the common evaluation drawbacks?

• Predictor development: Are there any common features of successful

prediction systems? If so, what are they, and how could they be advanced?

Can common reasons of failure of stock market prediction be identified?

Are they intrinsic, non-reparable, or there is a way to amend them?

• Transfer to other domains: Can the methods developed for stock market

prediction benefit other domains?

Trang 18

Predictability estimation: Can stock market data be reasonably quickly estimated

to be predictable or not, without the investment to build a custom system? What

are the methods, what do they actually say, what are their limits?

Consequences of predictability: What are the theoretical and practical

consequences of demonstrated predictability of stock market data, or the

impossibility of it? How a successful prediction method translates into

economical models? What could be the social consequences of stock market

prediction?

1.1-6 Challenges and Possibilities on Developing a Stock Market

Prediction System

A successful stock market predicfion system presents many challenges Some

are encountered over agam, and though an individual solution might be

system-specific, general principles still apply Using them as a guideline might save fime,

effort, boost results, as such promoting project's success

The idea of stock market predicfion (and resulting riches) is appealing, initiating

countless attempts In this competitive environment, if one wants above-average

resuhs, one needs above-average insight and sophistication Reported successful

systems are hybrid and custom made, whereas straightforward approaches, e.g a

neural network plugged to relatively unprocessed data, usually fail The

individuality of a hybrid system offers chances and dangers One can bring together

the best of many approaches; however the interaction complexity hinders judging

where the performance dis/advantage is coming from

Stock market prediction has been widely studied at a case of time-series

prediction problem; The difficulty of this problem is due to the following factors:

low signal-to-noise ratio, non-Gaussian noise distribufion, nonstationarity, and

nonlinearly Deriving relationships that allow one to predict future values of time

series is a challenging task when the underlying system is highly non-linear

Usually, the history of the time series is provided and the goal is to extract from that

data a dynamic system The dynamic system models the relationship between a

window of past values and a value T time steps ahead Discovering such a model is

difficult in pracfice since the processes are typically cormpted by noise and can only

be partially modeled due to missing information and the overall complexity of the

problem In addition, stock market time series are inherently non-stationary so

adaptive forecasting techniques are required

- Data Preprocessing

Trang 19

Before data is fed into an algorithm, it must be collected, inspected, cleaned and

selected Since even the best predictor will fail on bad data, data quality and

preparafion is cmcial Also, since a predictor can exploit only certain data features, it

is important to detect which data preprocessing/presentation works best

• Visual inspecfion is invaluable At first, one can look for: trend - if need to

remove, histogram - redistribute, missing values and outliers, any

regularities

• Missing values deah with by data mining methods

• Series to instances conversion is required by most leaming algorithms

expecting as an input a fixed length vector

• Indicators are series derived from others, enhancing some features of

interest, such as trend reversal

• Feature selection can make learning feasible, as because of the curse of

dimensionality long instances demand (exponentially) more data

- Prediction Algorithms

Common leaming algorithms point their features important to stock market

prediction:

• Linear methods are widely used in stock market prediction

• Neural Network seems the method of choice for stock market predicfion

• C4.5, ILP - generate decision trees/if-then rules - human understandable, if

small

• Nearest Neighbor does not create a general model, but to predict, it looks

back for the most similar case(s) Irrelevant/noisy features disrupt the

similarity measure, so pre-processing is worthwhile

• Bayesian classifier/predictor first learns probabilities how evidence supports

outcomes, used then to predict new evidence's outcome

• Support Vector Machines (SVM) are a relatively new and powerful learner,

having attractive characteristics for time series prediction

- System Evaluation

Proper evaluation is critical to a prediction system development First, it has to

measure exactly the interesting effect as opposed to prediction accuracy Second, it

has to be sensitive enough as to disfinguish oflen minor gains Third, it has to

convince that the gains are no merely a coincidence

• Evaluation bias resulfing from the evaluation scheme and time series data,

needs to be recognized

Trang 20

• Evaluation data should include different regimes, markets, even data errors,

and be plentiful Dividing test data into segments helps to spot performance

irregularities (for different regimes)

• Sanity checks involve common sense Prediction errors along the series

should not reveal any stmcture, unless the predictor missed something

1.2 Data mining methodology for stock market prediction

1.2.1 Prediction in data mining

a Introduction

The goal of data mining is to produce new knowledge that the user can act

upon It does this by building a model of the real world based on data collected from

a variety of sources The result of the model building is a description of patterns and

relationships in the data that can be confidenfiy used for prediction

Prediction is one of the most important problems in data mining It involves

using some variables or fields in the data set to predict unknown or future values of

other variables of interest The goal of prediction is to forecast or deduce the value

of an attribute based on values of other attributes

b Major types of prediction

- In the view of contruction and use of model

Prediction can be viewed as the construction and use of model to assess the

class of an unlabeled sample, or to assess the value or value ranges of an attribute

that a given sample is likely to have In this view, classification and regression are

the two major types of predicfion problems:

• Classification: used to discrete or nominal values It predicts into what

category or class a case falls In other words, classification problems aim to

identify the characteristics that indicate the group to which each case

belongs Data mining creates classificafion models by examining already

classified data (cases) and inductively finding a predictive pattern

• Regression: used to predict continuous or ordered values It predicts what

number value a variable will have In other words, regression uses existing

values to forecast what other values will be The prediction of continuous

values can be modeled by statistical techniques of regression

- In the view of use of prediction to predict

This view is commonly accepted in data mining Predicfion refers the use of

prediction to predict class labels as classification and to predict continuous values as

prediction:

Trang 21

• Classification: used to extract models describing important data classes

Classificafion predicts categorical class label It classifies data (constructs a

model) based on the training set and the values (class labels) in a classifying

attribute and uses it in classifying new data

• Prediction: used to predict future data trends, i.e., predict unknown or

missing values It models confinuous-valued funcfions Any of the methods

and techniques used for classification may also be used for prediction

1.2.2 Parameters

There are several parameters to characterize data mining methodologies for

stock market forecasting:

1.2.2.1 Datatypes

Two major groups of data types

• Attributes data type: object is represented by attributes that is each object x

is given by a set of values ^i(x), A2{x\ , An{x)

• Relational data type: objects are represented by their relations with other

objects For instance, x>y, y<z, x>z In this example we may not know that

x=3, y=l and z=2 Thus attributes of objects are not known, but their

relations are known Objects may have different attributes (e.g., x=5, y=2,

and z= 4), but still have the same relations

1.2.2.2 Data set and techniques

Fundamental and technical analyses are two widely used techniques in stock

market forecast

- Fundamental analysis

Fundamental analysis tries to determine all the econometric variables that may

influence the dynamics of a given stock price or exchange rate Often it is hard to

establish which of these variables are relevant and how to evaluate their effect

- Technical analysis

Technical analysis assumes that when the sampling rate of a given economic

variable is high, all the information necessary to predict the future values is

contained in the time series itself There are several difficulties in technical analysis

for accurate prediction: successive ticks correspond to bids from different sources,

the correlation between price variations may be low, time series are not stationary,

good statisfical indicators may not be known, different realizations of the random

process may not be available, and the number of training examples may not be

enough to accurately infer rules Therefore, the technical analysis can fit short-term

predictions for stock market time series without great changes in the economic

Trang 22

environment between successive ticks Actually, the technical analysis was more successful in identifying market trends, which is much easier than forecasting the future stock prices Currently different data mining techniques try to incorporate some of the most common technical analysis strategies in pre-processing of data and

in the construction of appropriate attributes

Two major options exist: use the time series itself or use all variables that may influence the evolution of the time series Data mining methods do not restrict themselves to a particular option They follow a fundamental analysis approach incorporating all available attributes and their values, but they also do not exclude a technical analysis approach based only on a time series such as stock price and parameters derived fi"om it Most popular time series are index value at open, index value at close, highest index value, lowest index value and trading volume and lagged returns from the time series of interest Fundamental factors include the price

of gold, retail sales index, industrial production indices, and foreign currency exchange rates Technical factors include variables that are derived from time series such as moving averages

1.2.2.3 Mathematical algorithm (method, model)

A variety of statistical, neural network and logical methods has been developed For example, there are many neural network models, based on different mathematical algorithms, theories and methodologies Combinations of different models may provide a better performance than those provided by individuals Many data mining methods assume a functional form of the relationship being modeled

1.2.2.4 Form of relationships between objects

The next characteristic of a specific data mining methodology is a form of the

relationship between objects Many data mining methods assume o functional form

of the relationship being modeled For instance, the linear discriminant analysis assumes linearity of the border that discriminates between two classes in the space

of attributes Often it is hard to justify such functional form in advance RDM methodology in stock market does not assume a functional form for the relationship

In addition, RDM algorithms do not assume the existence of derivatives It can automatically leam symbolic relations on numerical data of stock market time series

1.2.3 Approaches to stock market prediction

a Physics approach and data mining approach

The impact of market players on market regularities stimulated a surge of

attempts to use ideas of statistical physics in finance If an observer is a large

marketplace player then such observer can potentially change regularities of the

Trang 23

marketplace dynamically Attempts to forecast in such dynamic environment with

thousands active agents leads to much more complex models than traditional data

mining models designed for This is one of the major reasons that such interactions

are modeled using ideas from statistical physics rather than from statistical data

mining The physics approach in finance is also known as "econophysic" and

"physics of finance" The major difference from data mining approach is coming

from the fact that in essence the data mining approach is not about developing

specific methods for financial tasks, but the physics approach is

b Deterministic dynamic system approach

Stock market data are often represented as a time series of a variety of attributes

such as stock prices and indexes Time series prediction has been one of the ultimate

challenges in mathematical modeling for many years Currently data mining

methods try to enhance this study with new approaches Dynamic system approach

has been developed and applied successfully for many difficult problems in physics

Recently several studies have been accomplished to apply this technique in stock

market Usually, the history of the time series is provided and the goal is to extract

from that data a dynamic system The dynamic system models the relationship

between a window of past values and a value T time steps ahead Below presents the

major steps of this approach:

• Step 1: Development of state space for the dynamic system, i.e selecting

and/or inventing attributes characterizing the system behavior

• Step 2: Discovering the laws that govern the phenomenon, i.e discovering

relations between attributes of current and previous states (state vectors) in

the form of differential equations

• Step 3: Solving differential equations for identifying the transition function

(mles)

• Step 4: Use of the transition funcfion as a predictor of the next state of the

dynamic system, e.g., next day stock value

Inferring a set of rules for dynamic system assumes that there is

• Enough information in the available data to sufficiently characterize the

dynamics of the system with high accuracy

• All of the variables that influence the time series are available or they vary

slowly enough that the system can be modeled adaptively

• The system has reached some kind of stationary evoludon

• The system is a detenninistic system

• The evoludon of a system can be described by means of a surface in the

space of delayed values

Trang 24

There are several applicafions of these methods to stock time series However,

the literature claims both for and against the existence of chaotic deterministic

system underlying stock market Recent research has focused on methods to

disfinguish stochastic noise from deterministic chaotic dynamics and more generally

on constmcting systems combining deterministic and probabilistic techniques

1.2.4 Data mining methods in stock market

Almost every computational method has been explored and used for financial

modeling New developments augment traditional technical analysis of stock market

curves that has been used extensively by financial institutions Such stock charting

helps to identify buy/sell signals (timing "flags") using graphical pattems Data

mining as a process of discovering useful patterns, correlations has its own place in

stock market modeling

Similarly to other computational methods, almost every data mining method

and technique has been used in financial modeling An incomplete list includes a

variety of linear and non-linear models, multi-layer neural networks, k-means and

hierarchical clustering; k-nearest neighbors, decision tree analysis, regression

(logistic regression; general multiple regression), ARIMA, principal component

analysis, and Bayesian leaming Less traditional methods used include rough sets,

RDM methods (deterministic inductive logic programming) and newer probabilistic

methods, support vector machine, independent component analysis, Markov models

and hidden Markov models

1.2.4.1 Representation languages

a Propositional Logic language

A proposition is a statement that can be true or false Propositional logic uses

true statements to form or prove other tme statements In other words, propositional

logics are concerned with propositional (or sentential) operators which may be

applied to one or more propositions giving new propositions

Propositional logic has very limited expressive power It is not adequate for

formalizing valid arguments that rely on the internal stmcture of the propositions

involved

b First-order logic language

First-order logic (FOL) is a system of deduction extending propositional logic

by the ability to express relations between individuals FOL languag^s^support

variables, relations, and complex expressions DAI HOC QUOC GIA HA NOl

TRUrJG TAM THONG TIN THL/ViEN

Trang 25

The FOL language differs from a propositional logic language mainly by the

presence of variables Therefore, a language of monadic fiinctions and predicates is

a FOL language, but a very restricted language

c Attribute-value languages

Attribute-value language is a propositional language in which propositions are

attribute-value pairs that can be considered as predicates In other words, in an

attribute-value language, objects are described by tuples of attribute-value pairs,

where each attribute represents some characteristic of the object

Attribute-value languages are languages of monadic fiinctions (fiinctions of one

variable) and monadic predicates (Boolean functions with only one argument) This

language was not designed to represent relations that involve two, three or more

objects

d Comparison of these languages

Many well-known rule learners are propositional but propositional

representations offer no general way to describe the essential relations among the

values of the attributes In contrast with propositional mles, first order mles have an

advantage in discovering relational assertions because they capture relations

directly Several types of hypotheses/mles presented in FOL are simple relational

assertions with variables Relational assertions can be conveniently expressed using

first-order representations, while they are very difficult to describe using

propositional representations

Also, first order mles allow one to express naturally other more general

hypotheses not only the relation between pairs of attributes These more general

rules can be as for classification problems as for an interval forecast of continuous

variable Moreover, these mles are able to catch Markov chain type of models used

for stock market time series forecast That algorithms designed to leam sets of

first-order rules that contain variables is significant because first-first-order rules are much

more expressive than propositional mles

1.2.4.2 AVL-based methods

The common data mining methodology assuming attributes data type is known

as an attribute-based or attribute-value methodology It covers a wide range of

statistical and connectionist (neural network) methods There are two types of

attribute-value methods: the first one is based on numerical expressions, and the

second one is based on logical expressions and operations

Historically, methods based on AVLs such as neural networks, the nearest

neighbors method, and decision trees dominate in financial applications of data

Trang 26

mining The reasons are: (1) In many areas including stock market, training data are

naturally described by attributes of individual entities such as price, amount and so

on, and (2) Relations between entities can be very useful for data mining, (3) These

methods are well-known, relatively simple, efficient, and can handle noisy data

However, these methods have two strong limitations: (1) a limited ability to

represent background knowledge, and (2) the lack of complex relations

1.2.4.3 Relational Data Mining methods

Less traditional relational methodology is based on the relational data type

Many RDM approaches are based on ILP which refers to the collection of machine

leaming algorithms that use FOL as their language The term relational data mining

(RDM) is used in parallel with the terms Inductive Logic Programming (ILP) and

First Order Logic (FOL) methods to emphasize the goal - discovering relations The

term ILP reflects the techniques for discovering relations - logic programming In

particular, discovering relational regularities can be done without logical inference

and in languages of higher order Therefore, RDM is defined as Discovering hidden

relations (general first-order logic relations) in numerical and symbolic data using

background knowledge (domain theory)

Relational Data Mining (RDM) technology is a data modeling algorithm that

does not assume the functional form of the relationship being modeled a priori It

can automatically, consider a large number of inputs (e.g., time series

characterization parameters) and leam how to combine these to produce estimates

for fiiture values of a specific output variable RDM combines recent advances in

such areas as FOL, Probabilistic Inference and Representative Measurement Theory

(RMT) This approach will play a key role in future advances in data mining

methodology and practice

The typical claim about RDM: it can not handle large data sets This statement

is based on the assumption that initial data are provided in the form of relations

a Deterministic ILP meth ads

The purpose of ILP is to overcome the limitations of AVL-based methods ILP

systems naturally incorporate background knowledge and relations between objects

into the leaming process They have a mechanism to represent background stock

market knowledge in human-readable and understandable form Obviously,

understandable mles have advantages over a stock market forecast without

explanations

Traditionally, ILP methods were pure deterministic techniques, which

originated in logic programming There are well-known problems with deterministic

Trang 27

methods, which is especially important for stock market applications with numerical

data and a high level of noise

• Limited facility for handling numerical data: Historically, ILP methods solve

only classification tasks without direct operations on numerical data

• Relative inefficiency: Statistically significant rules have advantage in

comparison with mles tested only for their performance on training and test

data Training and testing data can be too limited and/or not representative If

rules rely only on them then there are more chances that these rules will not

deliver a right forecast on other data This is a hard problem for any data

mining method and especially for deterministic methods including

deterministic ILP

b Hybrid Probabilistic Relational methods

The purpose of RDM is to overcome the limitations of current relational

methods (deterministic ILP methods) In the real world, RDM should handle

imperfect (noisy) data and in particular imperfect numerical data

Recent research has focused on methods to distinguish stochatic noise from

deterministic chaotic dynamics and more generally on constmcting systems

combining deterministic and probabilistic techniques RDM follows the same

direction, moving from classical deterministic FOL mles to probabilistic first-order

mles to avoid limitations of deterministic systems The combination benefits from

noise robust probabilistic inference and highly expressive and understandable FOL

mles employed in ILP

Relational methods in finance such as Machine Method for Discovering

Regularities (MMDR) are equipped with probabilistic mechanism that is necessary

for time series with high level of noise The MMDR method handles an interval

forecast of numeric variables with continuous values like prices along with solving

classification tasks

MMDR is one of the few Hybrid Probabilistic Relational Data Mining methods

developed and applied to stock market data In computational experiments, trading

strategies developed based on MMDR consistently outperform trading strategies

developed based on other data-mining methods and buy and hold strategy

1.2.4.4 Comparison of AVL-based methods and relational methods

Table below summarizes the advantages and disadvantages of AVL-based

methods and FOL-based methods

Comparison of AVL-based methods and first-order logic methods

Method Advantages for the learning process | Disadvantages for the

Trang 28

- Simple, efficient, and handle noisy data

- Appropriate learning time with a large number of training examples

- Solid theoretical basis (first-order logic, logic programming)

- Flexible form of background knowledge, problem representation, and problem- specific constraints

- Understandable representation of background knowledge, and relations between examples

learning process

- Limited form of background knowledge

- Lack of relations in the concept

- Inappropriate learning time with a large number

of arguments in the relations

-Weak facilities for processing numerical data

Data mining approaches that find pattems in a given single table are referred to

as attribute-value or propositional leaming approaches, as the patterns they find can

be expressed in propositional logic RDM approaches are also referred to as order leaming approaches, or relational leaming approaches, as the pattems they find are expressed in the relational formalism of first-order logic

first-Selecfion of a method for discovering regularifies in stock market time series is

a very complex task Uncertainty of problem descripfions and methods capabilities are among the most obvious difficulties in the process of selection It is pointed out that attribute-based learners typically only accept available (background) knowledge

in rather limited form In contrast, relafional leamers support general representation for background knowledge

One of the main advantages of RDM over attribute-based leaming is ILP's generality of representation for background knowledge This enables the user to provide, in a more natural way, domain-specific background knowledge to be used

in leaming The use of background knowledge enables the user both to develop a suitable problem representation and to introduce problem-specific constraints into the leaming process By contrast, attribute-based leamers can typically accept background knowledge in rather limited form only

Trang 29

CHAPTER II: RELATIONAL DATA MINING FOR

STOCK MARKET PREDICTION

11.1 Introduction

Data mining methods are developed to search for patterns in data, which can be

discovered from unstructured data, semi-structured documents or from structured

sources Hke relational databases While most real-world databases store information

in multiple tables, most of the methods of data mining proposed so far operate on

unstructured data (in a form of a single table with data) This "attribute-value"

representation requires the data to be preprocessed and aggregated into a single

table, which faces the risking loss of meaning and/or information More complex

pattems are simply not expressible in attribute-value format and, thus, cannot be

discovered One way to enlarge the expressiveness is to generalize from one-table

mining to multiple table mining, i.e., to support mining on full relational databases

Relational data mining (RDM) approaches look for pattems that involve multiple

tables (relations) from a relational database To emphasize this fact, RDM is often

referred to as multi-relational data mining (MRDM) but, the terms RDM and

MRDM are often used interchangeably

RDM methods seem to be gaining momentum in different fields Data mining in

stock market follows this trend and leads the application of RDM for

multidimensional stock market time series Examples and arguments for applications

of RDM to stock market produce expectations of great advancements in the near

fiiture It was also strengthened in several publications that RDM area is moving

toward probabilistic first-order rules to avoid the limitations of deterministic

systems Relational methods such as MMDR are equipped with probabilistic

mechanism that is necessary for time series with high level of noise Often, stock

market data are represented as a time series of a variety of attributes such as stock

prices and indexes MMDR method expresses pattems in first-order logic and

assigns probabilities to rules generated by composing pattems Then MMDR is

applied to discover regularities, which will be used to predict stock market

This section introduces RDM methodology and presents the algorithm MMDR

applied in stock market data

11.2 Basic problems

IL2.1 First-order logic and rules

a Basic concepts

Trang 30

A predicate is defined as a binary function or a subset of a set

D=DixD2x xDn, where Di can be a set of stock prices at moment t=l and D2 can

be stock price at moment t=2 and so on Predicates can be defined extensionally, as

a list of tuples for which the predicate is true, or intensionally, as a set of (Horn)

clauses for computing whether the predicate is true

Let stock(t) be a stock price at t, and consider the predicate UpDown(stock(t),

stock(t+l), stock(t+2)), which is true if stock goes up from date t to date t+1 and

goes down from date t+1 to date t+2

This predicate is presented extensionally in Table II 1 and intensionally using

two other predicates Up and Down:

Up(stock(t),stock(t+ l))&Down(stock(t+ l),stock(t+2)) -^ UpDown(stock(t),stock(t+l),stock(t+2)

Predicates Up and Down

False True False

Updown(,)

False True False True

A literal is a predicate A or its negation (DA) The last one is called a negative literal An unnegated predicate is called a positive literal A clause body

is a conjunction AJ&A2& &A1 of literals Aj,A2, -,At, Often we will omit & operator

and write A]&A2& &A( as A]A2.,.At A Horn clause consists of two components: a

clause head (AQ) and a clause body {A}A2 Ai.,.A), A clause head, AQ, is defined as

a single predicate A Horn clause is written in two equivalent forms: AQ <— AjA

2 -Ai,,.Ai, or AjA2 ^ Ai,„At, ^AQ, where each Aj is a literal The second forni is

traditional for mathematical logic and the first fonii is more common in applications

Trang 31

A collection of Horn clauses with the same head AQ is called a rule The

collection can consist of a single Hom clause; therefore, a single Horn clause is also

called a rule Mathematically the term collection is equivalent to the OR operator

(D), therefore the rule with two bodies AjA2, A, and BjB2 Bi can be written as AQ ^

(A1A2 A, D BjB2 ,B^,

A k-tuple, a functional expression, and a term are the next concepts used in

relational approach A finite sequence of k constants, denoted by <ai, ,ak> is called

a k-tuple of constants A function applied to k-tuples is called a functional

expression A term is a constant, variable or functional expression Examples of

terms are given in Table II.3

Functional expression Functional expression Incorrect

Predicate, literal (Stock x is traded on NASDAQ) Predicate(x,y), literal

Yes Yes Yes Yes Yes

No

A k-tuple of terms can be constructed as a sequence of k terms These concepts

are used to define the concept of atom An atom is a predicate symbol applied to a

k-tuple of terms For example, a predicate symbol P can be applied to 2-tuple of

terms (v,w), producing an atom P(v,w) of arity 2 If P is predicate ">" (greater),

v=StockPrice(x) and w=StockPrice(y) are two terms then they produce an atom:

StockPrice(x) >StockPrice(y), that is, price of stock x is greater than price of stock y

Predicate P uses two terms v and w as its arguments The number two is the arity of

this predicate If a predicate or function has k arguments, the number k is called

arity of the predicate or function symbol

By convention, function and predicate symbols are denoted by Name/Arity

Functions may have variety of values, but predicates may have only Boolean values

true and false The meaning of the rule for a k-arity predicate is the set of k-tuples

that satisfy the predicate A tuple safisfies a rule if it satisfies one of the Hom clauses

that define the rule A unary (monadic) predicate is a predicate with arity 1 For

example, NASDAQ(x) is unary predicate

b Quantifiers

V means "for all" For example, Vx StockPrice(x) > 0 means that for all stocks,

stock prices are non-negative

Trang 32

3 means that "there exists" It is a way of starting the existence of some object

in the world without explicitly identifying it For example, 3 x P(SP500,y) means

that 3x StockPrice (SP500) > StockPrice(x), i.e., there is stock x less expansive than

SP500 for a given time Using the introduced notation, the following clause can be written:

3x (StockPrice (x) < $100 <- TradeVolume(x) < lOOMO)

This is a notation typical for Prolog language As already mentioned, more traditional logic notation uses the opposite sequence of expression:

3x (TradeVolume(x) < 100,000 -^StockPrice (x) <$100)

Both clauses are equivalent to the statement: "There is a stock such that if its trade volume is less than 100,000 per day than its price is less than $100 per share" Combining two quantifiers, a more complex clause can be written:

Vx By (StockPrice (y) < 100 <- TradeVolume(x) < 100,000)

Predicates defined by a collection of examples are called extensionally defined predicates, and predicates defined by a rule are called intensionally defined predicates If predicates defined by rules then inference based on these predicates can be explained in terms of these rules Similarly, the extensionally defined predicates correspond to the observable facts (or the operational predicates) A collection of intensionally defined predicates is also called domain knowledge or domain theory

11.2.2 Representative measurement theory

a Problem definition

Rapid growth of databases is accompanied by growing variety of types of numerical data RDM has unique capabilities for discovering a wide range of human-readable, understandable regularities in databases with various data types However, the use of symbolic relational data for numerical forecast and discovery regularities in numerical data requires the solution of two problems

• Problem 1 is the development of a mechanism for transformation between numerical and relational data presentations In particular, numerical stock market time series should be transformed into symbolic predicate form for discovering relational regularities

• Problem 2 is the discovery of rules computationally tractable in a relafional form

Representative measurement theory (RMT) is a powerful mechanism for approaching both problems RMT was motivated by the intention to fomialize

Trang 33

measurement in psychology to a level comparable with physics Further study has

shown that the measurement in physics itself should be formalized Finally, a fonnal

concept of measurement scale was developed This concept expresses what we call a

data type in data mining

b Some definitions

A relation structure A consists of a set A and relations S i, ,Sn defined on A

A-<A,S,, ,S„>

Each relation Si is a Boolean function (predicate) with ni arguments from A

The relational structure A = <A,Si, ,Sn> is considered along with a relational

structure of the same type

R^<A,T, T„>

Usually the set R is a subset of Re"", m>l, where Re"" is a set of m-tuples of real

numbers and each relation Ti has the same ni as the corresponding relation Sj Tj and

Si are called a k-ary relation on R Theoretically, it is not a formal requirement that

R be numerical

Next, the relational system A is interpreted as an empirical real-world system

and R is interpreted as a numerical system designed as a numerical representation of

A To formalize the idea of numeric representation, we define a homomorphism cp as

a mapping from A to R A mapping (p:

A -^ R is called a homomorphism if for all i (i=l, ,n),

(a, .,ak(i)) e S-, <^ ((p(ai) (p(ak(i))) e T-,

In other notation, Si(ai, ,ai,(i)) o Ti((p(ai), ,(p(ak(i)))

Let (D(A,R) be the set of all homomorphism for A and R It is possible that

cI)(A,R) is empty or contains a variety of representations Several theorems are

proved in RMT about the contents of 0(A,R) These theorems involve whether

<D(A,R) is empty are called representation theorems These theorems involve the size

of <D(A,R) are called uniqueness theorems

Using the set of homomorphism 0(A,R), we can define the notion of

permissible transformations and the data type (scale types) The most natural

concept of permissible transformations is a mapping of the numerical set R into

itself, which should bring a "good" representation More precisely, y is permissible

for (D(A,R) if Y maps R into itself, and for every cp in 0(A,R), 79 is also in (D(A,R)

For instance, the permissible transformations could be transformations, x -^ rx or

monotone transformations x -^ y(x)

Trang 34

c Results from measurement theory for learning algorithms

- Regularites discovered with numeric data presentation also able to be

discovered using relational data representation

RMT yields the result that: Any numerical type of data can be transformed into

a relational form with complete preservation of the relevant properties of a numeric

data type It means that all regularities that can be discovered with numeric data

presentation also can be discovered using relational data representation

Many theorems support this property mathematically These theorems are called

homomorphism theorems, because they match (set homomorphism) numerical

properties and predicates Actually, this is a theoretical basis of logical data

presentation without loosing empirical information and without generating

meaningless predicates and numeric relations Moreover, RMT changes the

paradigm of data types RMT argues that a numerical presentation is a relational

data form, not the primary numerical form It is a derivative presentation of an

empirical relational data type The theorems mentioned support this idea

- Ordering relation is the most important relation for the transformation

The next critical result from measurement theory for leaming algorithms is that

the ordering relation is the most important relation for this transformation Most

practical regularities can be written in this form and discovered using an ordering

relation This relation is a central relation in transformation between numerical and

relational data presentation

- Speed up the search for regularities with help of data type hierarchy

developed in RMT

The hierarchy of data types developed in RMT can help to speed up the search

for regularities The search begins with testing rules based on properties of weaker

scales and finishes with properties of stronger scales as defined in RMT The

number of search computations for weaker scales in smaller than for stronger scales

This idea is actually implemented in MMDR method MMDR begins by discovering

regularities with monadic (unary) predicates; e.g., x is larger then constant 5, x>5

and then discovers regularities with ordering relation x>y with two variables In

other words, MMDR discovers regularities based on ordering relations which are

more complex first order logic relations

- Speed up the search for regularities by reducing the space of hypotheses

Measurement theory prompts to speed up the search by searching in a smaller

space of hypotheses The reason is that there is no need to assume any particular

class of numerical fianctions to select a forecasting function in this class In contrast

Trang 35

numerical interpolation approaches assume a class of numerical functions RMT has

several theorems, which match classes of numeric functions with relations

- Avoidance of incorrect data preprocessing

RMT prompts to avoid incorrect data preprocessing which may violate data

type properties This is an especially sensitive issue if preprocessing involves

combinafions of several features Usually a discovered regularity is changed

significantly if an uncoordinated preprocessing of attributes is applied In this way,

pattems can be cormpted In particular, measurement theory can help to select

appropriate preprocessing mechanisms for stock market data and a better

preprocessing mechanism speeds up the search for regularifies

d Transformation of numerical data into relational form

There should be two steps in the transformation of numerical data into relational

form:

• Extracting, generating, and discovering essential predicates (relations)

• Matching these essential predicates with numerical data

Sometimes this is straightforward Sometimes this is a much more complex

task, especially taking into account that computational complexity of the problem is

growing exponentially with the number of new predicates RDM can be viewed also

as a tool for discovering predicates, which will be used for solving a target problem

by some other methods In this way, the whole data mining area can be considered

as a predicate discovery technology

The RMT theorems determine the numerical representafions of the attributes

and laws for the corresponding sets of axioms The RMT treats the numerical

representations of the values and laws as only numerical codes of the algebraic

structures representing the operafional properties of the values and laws Thus, the

algebraic stmctures are the primary representations of values and laws The main

statements and results of the Measurement Theory are the following:

• Numerical representations of quantities and laws of nature are determined by

the set of axioms for the corresponding empirical systems, and algebraic

systems with certain sets of relations and operations;

• The numerical representations are unique up to certain sets of permissible

transformations (such as change in measurement units);

• All physical attributes may be embedded into the structure of physical

quantities;

• Physical laws are simple because all attributes involved in a law are

simultaneously scaled by a special scaling process;

Trang 36

• The axiomatic approach is applicable not only to the physical attributes and

laws, but also to many other attributes and laws of other domains (such as

psychology),

e Empirical axiomatic theory

Several studies have shown the actual success of discovery of understandable

regularities in databases significantly depends on use of data type information Data

type information is the first source to get predicates for relational data presentation

RMT is intended for generating operations and predicates as a description of a

data type This RMT mechanism is called an empirical axiomatic theory A variety

of data types as matrices of comparisons of pairs, and muhiple comparisons,

attribute-based matrices, matrices of ordered data, and matrices of closeness can be represented

in this way Then the relational representation of data and data types are used for

discovering understandable regularities We argue that current leaming methods

utilize only part of data type information actually presented in data They either lose a

part of data type information or add some non-interpretable information

The language of empirical axiomatic theories is an adequate lossless language

to this goal There is an important difference between language of empirical

axiomatic theories and FOL language The FOL language does not indicate anything

about real world entities Meanwhile the language if empirical axiomatic theories

uses FOL language as a mathematical mechanism, but it also incorporates addifional

concepts to meet the strong requirement of empirical interpretation of relations and

operations

11.2.3 Breadth-first search

In graph theory, breadth-first search (BFS) is a graph search algorithm that

begins at the root node and explores all the neighboring nodes Then for each of

those nearest nodes, it explores their unexplored neighbor nodes, and so on, until it

finds the goal

- Components of Search Algorithms

Searching algorithms have a number of common components based on linear

data stmctures, additionally certain mechanisms need to be built into the graph data

structures in order to recover the routes discovered by the searching algorithm

These components are as follows

• An open list - a list of nodes for consideration Implementing an open list is a

straightforwards matter, and usually a linear data structure is used (usually a

queue, or an ordered queue)

Trang 37

• A closed list - a list of nodes already considered The closed list can be represented by a queue also, or a flag can be added to the visited nodes, which is set to zero while they are unvisited and unity when they are not

• A parentage mechanism - a mechanism for recovering the routes discovered

by the search algorithm The parentage mechanism works by recording which nodes result in others being placed on the open list

- The Breadth First Search Algorithm

The breadth first search algorithm requires the identity of the starting node of the search, and the destination node are known The algorithm will then search through the whole graph and attempt to find a route from start node to destination node Here is a broad outline of the Breadth-First Search Algorithm

Put the starting point onto the Open List

While the Open List is not empty

Node n=node removed from open list

Is n the destination?

If yes stop Otherwise, for each unvisited node s connected to n

add s to the open list mark n as parent ofs

This algorithm will search through all of the nodes added to the open list, until the destination node appears in the queue, in which case the algorithm will stop Having performed this search we recover the route by starting at the destination node, following the parentage of each node until we return to the start node

11.2.4 Occam's razor principle

MMDR method can use the Occam's razor principle, commonly attributed to William of Occam (early 14th century), that states: "Entities should not be multiplied beyond necessity" This principle is generally interpreted as: "Among the theories that are consistent with the observed phenomena, one should select the simplest theory" or "When you have two competing theories which make exactly the same predictions, the one that is simpler is the better" The Occam's razor principle suggests that among all the hypotheses that are correct for all (or for most of) the training examples, one should select the simplest hypothesis; it can be expected that this hypothesis is most likely to capture the stmcture inherent in the problem and that its high predicfion accuracy can be expected on objects outside the training set

Trang 38

This principle is frequently used by noise handling algorithms (e.g., rule

truncation and tree pruning) as noise handling aims at simplifying the generated

rules or decision trees in order to avoid overfiting a noisy training set

Despite the successful use of the Occam's razor principle as the basis for

hypothesis construction, several problems arise in practice First is the problem of

the definition of the most appropriate complexity measure that will be used to

identify the simplest hypothesis, since different measures can select different

simplest hypotheses for the same training set Second, recent experimental work has

undoubtly shown that applications of the Occam's razor may not always lead to best

prediction accuracy Further empirical evidence against the use of Occam's razor

principle is provided by boosting and bagging approaches in which an ensemble of

classiers (a more complex hypothesis) typically achieves better accuracy than any

single classier In addition to experimental evidence, much disorientation was

caused by the so-called "conservation law of generalization performance" Although

it is rather clear that real-word leaming tasks are different from the set of all

theoretically possible leaming tasks, there remains the so-called "selective

superiority problem" that each algorithm performs best in some but not all domains

11.3 Theory of RDM

In comparison with other data mining methods, the Relational Data Mining

approach is considered from the point of view of their Data Types, Representation

Languages (to manipulate and interpret data) and Class of hypothesis (to be tested

on data) RDM approach overcomes the limitations of particular data mining

methods, which are induced by their data types, languages and hypothesis classes by

unlimited extension the data type notion and hypothesis classes using the FOL

11.3.1 Data types in RDM

a Problems of data types

The design of a particular data mining system implies the selection of the set of

data types supported by that system In Object-Oriented Programming (OOP), this is

a part of the software design Data types are declared in the process of software

development If data types of a particular leaming problem are out of the range of

the data mining system, users have two options: to redesign a system or to corrupt

types of input training data for the system The first option often does not exit at all,

but the second produces an inappropriate result

There are two solutions to the problem:

• The first is to develop data type conversion mechanisms which may work

correctly within a data mining tool with a limited number of data types For

Trang 39

example, if input data are of a cyclical data type and only linear data types are supported by the data mining tool, one may develop a coding of the cyclical data such that a linear mechanism will process the data correctly

• Another solution would be to develop universal data mining tools to handle any data type In this case, member functions of a data type should be input information along with the data (MMDR implements this approach) This problem is more general than the problem of a specific data mining tool

In financial applications, usually the data are presented as numeric attributes, but often relations are not presented explicitly More precisely, these attributes are coded with numbers, but applicability of number relations and operations must be confiraied Let a relative difference for the stock price be

A(tMS(t)-S(t-l)]/S(t)

a "float" data type This is correct for a computer memory allocation, but it does not help to decide if all operations with float numbers are applicable for A(t) For instance, what does it mean to add one relative difference A(x) to another A(y)? There is no empirical procedure matching this sum operation However, the comparison operation makes sense, e.g., A(x) < A(y) means faster growth of stock price on date y than on date x This relation also helps interpret a relation "A(w) between A(x) and A(y)" as

A(x) < A(w) and A(w) < A(y) or A(y) < A(w) and A(w) < A(x)

Both of these relations are already interpreted empirically

Therefore, A values can be compared, but one probably should avoid an addition operation (+) if the goal is to produce an interpretable leamed rule If one decides to ignore these observations and applies operations formally proper for float numbers in programming languages, a leamed mle will be difficult to interpret As was already mentioned, these difficulties arose from the uncertainty of the set of interpretable operations and predicates for these data, i.e., uncertainty of empirical contents of data

b Levels of data type

- Single-level data type

A data type (type for short) in modem object-oriented programming (OOP) languages is a rich data structure, <A,P,F> It consists of elements A={ai,a2, a,J, relations between elements (predicates) P={Pi,P2,-Pm} and meaningftil operations with elements F={Fi,F2,-,Fk} Operations may include two, three or more elements, e.g., c = a # b, where # is an operation on elements a and b producing element c This definition of data type formalizes the concept of a single-level data type

Trang 40

Traditional AVLs operate with much simpler single-level data types Implicitly,

each attribute in AVLs reflects a type, which can take a number of possible values

These values are elements of A It is common in AVLs that a data type is given as

an implicit data type Usually, relations P and operations F are not expressed

explicitly However, such data types can be embedded explicitly into AVLs

- Multilevel data type

A multilevel data type can be defined by considering each element a; from A as

a composite data stmcture (data type) instead of as an atom

- Complex data types and selector functions

Each data type is associated with a relational system, which includes: (1)

cardinality, (2) permissible operadons with data type elements, and (3) permissible

relations between data type elements In tum, each data type element may consist of

its own subelements with their types Selector functions serve for extracting

subterms from terms Without selector functions, the intemal structure of the type

could not be accessed

11.3.2 Relational representation of examples

Knowledge Representation is an important and informal initial step in RDM

There are many ways to represent knowledge in the FOL language One of them can

skip important information; another one can hide it Therefore, data mining

algorithms may work too long to "dig" relevant information or even may produce

inappropriate mles Introducing data types and concepts of representative

measurement theory (RMT) into the knowledge representation process helps to

address this representation problem In fact the measurement theory developed a

wide set of data types

Relational representation of examples is the key to relational data mining If

examples are already given in relational form, relational methods can be applied

directly For attribute-based examples, this is not the case It requires to express

attribute-based examples and their data types in relational fonn There are 2 major

ways to express attribute-based examples using predicates:

- Generating predicates for each value

To express price $909.0 from Table II.4 in predicate form, we may generate

predicate P9090(x), such that P9090(x)=tme if and only if the price is equal to

$909.0 In this way, we would be forced to generate about 10000 predicates if prices

are expressed from $1 to $1000 with a $0.10 step In this case, the price data type

has not yet been presented with the P9090(x) predicate Therefore, additional

relations to express this data type should be introduced For example, it can be a

Định dạng
Số trang	92
Dung lượng	22,29 MB